Address Clean-up for Mail Merge using SSIS

Remember the good ol’ days of going through a list of addresses and manually cleaning up hundreds if not thousands of addresses to prepare for a snail mail merge? What? You’re still doing it now in the 21st century? Just kidding.

Chances are, you will need to be able to detect duplicates. I find that perhaps the hardest task to accomplish is to manually eyeball the data for duplicates because duplicates seldom mean that the data is THE SAME. More likely, data is SIMILAR but noticeably different in subtle little ways. For example, Address has been split into Address1, Address2… and so on. Or someone decides to register again but adding a middle name this time.

Whatever the case, I find that using logic (in my head) and SSIS, I can prepare a more thorough cleaning of the data. Here’s my Data Flow Model:

And here’s the Control Flow diagram showing that I read data from a flat file (.csv) and then update the Database everytime I run this SSIS Package. I suggest not disabling the update tasks even though there may not be changes because if there are changes and you forget to re-enable the tasks, then changes will not be applied!

Proper Case transformation for text.

There isn’t a built in Proper Case Transformation but the function is built into VisualBasic so you just need to include that in your reference.


/* Microsoft SQL Server Integration Services Script Component
* Write scripts using Microsoft Visual C# 2008.
* ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.VisualBasic;

public class ScriptMain : UserComponent

public override void PreExecute()
Add your code here for preprocessing or remove if not needed

public override void PostExecute()
Add your code here for postprocessing or remove if not needed
You can set read/write variables here, for example:
Variables.MyIntVar = 100

public override void Input0_ProcessInputRow(Input0Buffer Row)
Row.DerivedFirstName = Strings.StrConv(Row.FirstName, VbStrConv.ProperCase, 0);
Row.DerivedName = Strings.StrConv(Row.Name, VbStrConv.ProperCase, 0);
Row.DerivedAddress1 = Strings.StrConv(Row.AddressLine1, VbStrConv.ProperCase, 0);
Row.DerivedAddress2 = Strings.StrConv(Row.AddressLine2, VbStrConv.ProperCase, 0);
Row.DerivedCity = Strings.StrConv(Row.City, VbStrConv.ProperCase, 0);
Row.DerivedState = Strings.StrConv(Row.State, VbStrConv.ProperCase, 0);
Row.DerivedPostCode = Strings.StrConv(Row.Postcode, VbStrConv.ProperCase, 0);
Row.DerivedCountry = Strings.StrConv(Row.Country, VbStrConv.ProperCase, 0);


Happy mail merging!



What is in your mind? Do you see my solution?

Hello, it’s been a while since I last blogged, so I’ll get right into it.

What’s in your client’s mind when you present a version of the solution? Do new things come up that was not mentioned before? Do you go “I told you so” in your mind?

Well, I recently had all three thoughts come to me. As simply technologists, which is what a BI developer is generally, there are several roads that lead to Rome, and you choose the most logical and straigtforward to begin with. In my case I knew that we needed to have two different entities for the project, a “client” and a “case”. Therefore, it makes sense to separate the two because of the 1:N relationship – a client can have more than one case attached to him.

However, I was told that the client and the case are one – because traditionally, there was only one ID assigned to represent both. Granted, I acknowledged that it should now be a 1:1 relationship between client and case and designed the database schema as such.

At the next meeting, I presented my schema and lo and behold, there could be more than one case attached to a client. …

I think often the client very much wants to keep what’s working, not to change anything too much, in this case the ID that they are working with. However, logically, it just didn’t make sense and as a BI developer, it is my duty to explain so. In the end we decided to adopt a combination of ClientID and CaseID.

The good thing is that nothing is built yet. I am a proponent of sorting these things out on paper rather than in code so alls well.

Next step is to include other users in the process so that I am not caught out with my design.


Different mindsets

In Business Intelligence, one generally goes down either the development route or the analysis route.

In development, one is usually skilled in at least one tool. Also, one is aware of development methodologies. The main mindset is to solve problems / puzzles. The aim is to fulfill the requirements. Once accomplished, the result is usually what one would expect – subject to the requirement gathering stage being done properly.

As for data analysis, the analyst’s objective is to allow the consumer of the data to make use of it to do some action, or make some improvements. The tools of data interpretation are measuring approaches that define metrics and modelling methods used to analyse the data.

Usually in a big organisation, there are specialists in these roles so normally the data management professional is different from the data analyst.

However, in a smaller organisation where one finds oneself to be one of the few people to deal directly with data, sometimes both needs to be done. The thing is that the way a developer thinks is fundamentally different from a data analyst. The actual switching from one mindset to the other does require some effort.

Let me explain.

The developer’s objective is to solve the problem at hand, so he makes sure that the data is accurate and is delivered in a timely manner.

The analyst’s objective is to create some useful sense of the data for the organisation’s benefit, hence, the starting point isn’t at the root of the problem, but rather at the heart of the business – What is the company doing? How does the company know it’s going in the right direction? What needs to be improved?

Data accuracy to the developer is paramount. Data accuracy to the analyst is a matter of degree that is required. Developer looks into the problem, analyst comes from the angle of the organisation. Hence, if you a playing a dual role, one needs to switch from the “What is the problem?” mentality to “How is the company doing?” mentality. This does require the person to be more involved with the business, which means stepping out of the “zone” and  speaking to business users, which to some might be a little out of their comfort zone.

Likewise, for an analyst to become a developer is, in my opinion, against his or her personality as a development mindset requires one to totally focus on the problem and ignore the chaff. But sometimes it’s this chaff that provides some information of importance to the analyst. I don’t normally see analysts becoming developers. I think it’s more the other way round, but if you disagree, do feel free to voice it here.

Why do users ask for more just after the BI Solution is Live?

A good question.

It didn’t come up during requirements gathering. Neither did it occur during reiterative development. And it certainly did not come up in user acceptance testing. So why, why, why do users ask for more features and changes AFTER the solution has gone live?

One of the differences between a BI project and any other technical project is that the “Oliver Twist” syndrome is inherently what providing data is all about. When data is first revealed, the human mind then processes the data in an effort to answer questions in the mind. This inadvertently creates more questions and in most cases it will require more data, or a different view of the data to answer.

This sort of requests cannot be captured during any earlier stages just as you can’t find a cure until the disease manifests itself.

What can be done is to analyse the objectives and goals and run a pre-implementation study, where data is queried on an adhoc basis to answer questions. This is best done adhoc so that it can be extremely flexible. This allows the users to plan ahead. After which some of these adhoc reports will probably end up as regular reports and the main bulk of information is used for monitoring and tracking, and fed back for strategy planning purposes.

In any case, don’t be surprised by the “I want more” mentality. Ultimately it is a good thing that cogs in the brains are moving, isn’t it?

The “Ta-Da” Syndrome

Here’s a typical IT solutions scenario. (Well, at least in a smaller company anyway)

Someone emails you a requirement. (“Hey I need sales figures from all the back in the middle ages – 1990 to now as I want to see who have purchased from us before and contact them. Hehe”)

The IT/BA guy says “Okay great, here’s work to be done.” and goes to the system, pulls out the data and then formats it in Excel (Hmm… Arial font-size 11 seems to look pretty) and then sends it off. “TA-DA”

An hour later, the user calls and says that’s not what he wants and conclude you don’t understand the business. Sounds familiar?

That’s because at some point in our lives we have been guilty of doing this. (Yes I can see it in your faces).

As I learnt from a talk in Church (strangely enough), giving someone the gift of the solution (in this case) isn’t really it. Sometimes, you need to listen to the user, understand what he wants, even gently lead him to see what’s the best way to solve his problem and then work  out the solution together. After a few iterations, and some more discussions with the user, the most appropriate solution is presented (“TA-DA”). And you’ll be appreciated so much more, even if the final solution is probably going to be the same as you first envisage .