All posts by Terry Choo

Get list of dates using recursive Common Table Expression (CTE)

Many times we need a list of consecutive dates or time at runtime mainly for a recursive ETL process. This code provides a list almost instantaneously with only the start and end dates being the only input. The output is in Date format but is easily extended to a DateTime format.

For best results, you can use this in a CTE as part of the solution so that it does not take up space in the database.

The MaxRecursion option ensures that the recursion does not go on forever, however the Per Minute and Per Second versions will definitely overrun the MaxRecursion option.

Note: The code for each one are similar, the only difference is the first parameter of the DATEADD function which determines the recursive detail.

Daily
Monthly
Yearly
Hourly
Per Minute
Per Second
Continue reading

Top 10 Tips when synchronising your database and Mailchimp

Okay, sooner or later someone is going to need to synchronise all those people who registered on your company website with an email service provider to do a bit of marketing. In my case, the latest email provider to come my way is MailChimp. MailChimp is a very popular email service provider, which was used primarily by small businesses who could not afford the typical email giants like Experian and CheetahMail. However, the industry has changed and now they service many major companies.

To do so, you will need to use the MailChimp API. The latest as of the time of writing is v2.0 and can be found here: http://apidocs.mailchimp.com/api/2.0/

On to the tips!

1. Try not to update incrementally using dates
2. Do a full list compare between the MailChimp lists and your own database
3. Re-subscribing requires 2 actions
4. Use 1 or 0 for boolean parameters
5. Remember localisation
6. Remove left and right space paddings and tab characters for email address before uploading or comparing email
7. Do not forget to handle stray emails
8. Use Webhooks to unsubscribe people at real-time
9. Remember to close the loop by unsubscribing people from your database too
10. Keep records of invalid emails – and do something about them

Continue reading

Top 11 Qualities of a DataWarehouse Implementer

Top 11th Quality: Curious to explore different designs

There are many ways to achieve the same thing. Before actually getting your hands dirty, think and do some high level designs, then choose the best one. They need not be perfect, and the beauty of it is the more experience you have, the faster you come up with designs. The downside of experience is to stick to the same design all the time without asking if you can do it better.

Continue reading

Building a simple forecast solution on SQL Server

Pricing Forecast is a marketing function that requires a lot of data crunching.

Normally there are some factors that the Marketing team wants to adjust to find out just the optimum pricing for their products.

Accordingly, the forecast solution needs to be fairly flexible but still be powerful enough to make the necessary and sometimes even complex calculation. Otherwise, if the design is too rigid and requirements change slightly, then major changes in design need to be made, which in turn means whole rounds of testing.

Continue reading

Client Pressures and Code Quality

In my experience, code quality always trumps urgency because when code has been released that has questionable quality, it is inevitable that bugs appear and the project takes longer than it should.

In my case, because the code belongs to a 3rd Party vendor, this frustrates the developer, business users and ultimately end customers.

This is where the negotiating skills and experience of the vendor Project Manager is so important. The client PM (in my case is one of us) will always put pressure to get the code as soon as possible mainly because of company pressure. And sometimes the vendor PM, on consulting with his peers, agrees to push through the code, even by-passing tested and true in-house processes to get things moving along. However, the result is almost always that the code disappoints when the client developers try to use it. This hurts the vendor’s reputation even more.

Sometimes, it is better to commit to follow the tried and tested process so that code quality can be guaranteed. In other words, a little more time, a lot less problems, and a lot more satisfactory results.

SQL Query Result Order

Every time an SQL query is run without the ORDER BY clause, the sort order of the results returned is not guaranteed.

SQL Server decides which is the best order to return the query. This is based on factors like whether indexes are used or data fragmentation on the physical drive because SQL Server will attempt to return the result in the most efficient way.

If you run the same query a few times in the same environment, you may find that the sort order remains the same. It does not mean the sort order is guaranteed but that SQL Server went through processing the data in the same way.

The same query may not retrieve the result in the same order if indexes have been rebuilt or the database is on another server or many other factors.

Just a point to note.

SQL Logical Query Processing Order

This probably applies more to beginners or people who have been using SQL for a while but have been wondering why sometimes the query works in a certain way and sometimes it didn’t.

Before I read the book for the exam 70-461, I found it relatively easy to just plonk in some code in SQL Server Management Studio and Voila! it works. Well, not all the time.

For example, this works:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders WHERE YEAR(orderdate) = 2013

But this doesn’t:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders WHERE OrderYear = 2013

Most of us will normally just go: “Hey, it’s no biggie, I’ll just use the former and then find out why later.”, but most often we will be too busy to do so. 🙂

Here is the “keyed-in order” of a standard SQL Query:

  1. SELECT
  2. FROM
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. ORDER BY

This is how it is supposed to be typed so that it does not throw a syntax error.

This is how SQL Server processes the query:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY

See the difference?

Let’s reference back to the first example:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders WHERE OrderYear = 2013

Since the query is processed in that order, SQL Server processes the WHERE clause before the SELECT clause and it has no clue of the alias “OrderYear”.. hence it fails.

Another common way to write queries with the ORDER BY clause is by using the alias:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders ORDER BY OrderYear

Again referring back to the Logical Query Processing Order, this query works because ORDER BY is processed after the SELECT clause, therefore SQL Server is fully aware of the alias “OrderDate”.

What are the benefits?

In my experience, I often re-format datetime field into something more readable, but this makes the query longer. Therefore to make code more elegant, I use the alias:

SELECT customer, CONVERT(VARCHAR(10), orderdate, 112) AS NewOrderDate FROM orders ORDER BY NewOrderDate

I find that knowing this makes me more efficient as I am coding with this in mind and make less mistakes.

If this helps anyone, do let me know. Thank you.

 

 

 

SQL and Relational Theory

Do you know the foundation of SQL?

According to the book Querying Microsoft SQL Server 2012 Training Kit on the Microsoft label, T-SQL is a Microsoft flavour of the more standard ANSI and ISO version of SQL which is based on a branch of mathematics called Relational Model.

Relational Model is further broken down into two theories – Set Theory and Predicate Logic.

Set Theory says that interaction with a set is not with the individual elements but with the set of elements on a whole. A set consists of a collection of distinct, definite objects.

Predicate Logic is when an expression evaluates out to true or false.

The combination of Set Theory and Predicate Logic gives you the foundation of SQL.

How is this useful?

Set theory implies that all elements are unique (distinct) and the order does not matter. Therefore, when constructing T-SQL queries, one considers the set of elements as a whole when manipulating data. This means that you are already walking in the right direction. Now you need to decide if you are trying to get results fast for a one off query, or optimising the results for repeat use (as used in business reports). Only deviate from standard relational model when it finally comes to presentation. For example, by adding the ORDER BY clause to sort the data.

The right terminology for the right audience

For those who has the privilege of being in the technology consulting industry for some time and started out in the technical side of things, remember those books or courses that teach about methodology or best practices?

For example, for project management, you need to specify a requirements gathering stage, done by the Business Analyst, the specifications stage, solutioning stage, project planning stage, development stage, unit testing stage, integration stage, iterate between development, unit testing and integration stage again and again, user acceptance testing stage, go live, application support, re-look at the lifecycle …. zzzzzzz…

Thing is most of the time I find that business users don’t really care for these terms. I recently simplified it (a lot!), and my boss still removed the words “user acceptance” in the testing stage. Eyes start to glaze over when explaining the various stages to business users thinking they would appreciate to know how structured I was. (Guess they weren’t that interested after all)

If you are a developer, and your audience are technical integrators or project managers, then by all means expand on these terms. Otherwise, users want to hear things using words from their world. (Things like YoY, MoM, Revenue, PPI, KPI for finance and operations, or production performance, keywords search results for others.)

Point is: Tailor your communication to your audience. (And this applies to developers too – especially developers who love to use your developer world terms. 🙂 )  Basically use what is useful and gets the point across clearly and concisely.

Address Clean-up for Mail Merge using SSIS

Remember the good ol’ days of going through a list of addresses and manually cleaning up hundreds if not thousands of addresses to prepare for a snail mail merge? What? You’re still doing it now in the 21st century? Just kidding.

Chances are, you will need to be able to detect duplicates. I find that perhaps the hardest task to accomplish is to manually eyeball the data for duplicates because duplicates seldom mean that the data is THE SAME. More likely, data is SIMILAR but noticeably different in subtle little ways. For example, Address has been split into Address1, Address2… and so on. Or someone decides to register again but adding a middle name this time.

Whatever the case, I find that using logic (in my head) and SSIS, I can prepare a more thorough cleaning of the data. Here’s my Data Flow Model:

And here’s the Control Flow diagram showing that I read data from a flat file (.csv) and then update the Database everytime I run this SSIS Package. I suggest not disabling the update tasks even though there may not be changes because if there are changes and you forget to re-enable the tasks, then changes will not be applied!

Proper Case transformation for text.

There isn’t a built in Proper Case Transformation but the function is built into VisualBasic so you just need to include that in your reference.

 

/* Microsoft SQL Server Integration Services Script Component
* Write scripts using Microsoft Visual C# 2008.
* ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.VisualBasic;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

public override void PreExecute()
{
base.PreExecute();
/*
Add your code here for preprocessing or remove if not needed
*/
}

public override void PostExecute()
{
base.PostExecute();
/*
Add your code here for postprocessing or remove if not needed
You can set read/write variables here, for example:
Variables.MyIntVar = 100
*/
}

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Row.DerivedFirstName = Strings.StrConv(Row.FirstName, VbStrConv.ProperCase, 0);
Row.DerivedName = Strings.StrConv(Row.Name, VbStrConv.ProperCase, 0);
Row.DerivedAddress1 = Strings.StrConv(Row.AddressLine1, VbStrConv.ProperCase, 0);
Row.DerivedAddress2 = Strings.StrConv(Row.AddressLine2, VbStrConv.ProperCase, 0);
Row.DerivedCity = Strings.StrConv(Row.City, VbStrConv.ProperCase, 0);
Row.DerivedState = Strings.StrConv(Row.State, VbStrConv.ProperCase, 0);
Row.DerivedPostCode = Strings.StrConv(Row.Postcode, VbStrConv.ProperCase, 0);
Row.DerivedCountry = Strings.StrConv(Row.Country, VbStrConv.ProperCase, 0);
}

}

Happy mail merging!