All posts by Terry Choo

Top 11 Qualities of a DataWarehouse Implementer

Top 11th Quality: Curious to explore different designs

There are many ways to achieve the same thing. Before actually getting your hands dirty, think and do some high level designs, then choose the best one. They need not be perfect, and the beauty of it is the more experience you have, the faster you come up with designs. The downside of experience is to stick to the same design all the time without asking if you can do it better.

Continue reading

Building a simple forecast solution on SQL Server

Pricing Forecast is a marketing function that requires a lot of data crunching.

Normally there are some factors that the Marketing team wants to adjust to find out just the optimum pricing for their products.

Accordingly, the forecast solution needs to be fairly flexible but still be powerful enough to make the necessary and sometimes even complex calculation. Otherwise, if the design is too rigid and requirements change slightly, then major changes in design need to be made, which in turn means whole rounds of testing.

Continue reading

Client Pressures and Code Quality

In my experience, code quality always trumps urgency because when code has been released that has questionable quality, it is inevitable that bugs appear and the project takes longer than it should.

In my case, because the code belongs to a 3rd Party vendor, this frustrates the developer, business users and ultimately end customers.

This is where the negotiating skills and experience of the vendor Project Manager is so important. The client PM (in my case is one of us) will always put pressure to get the code as soon as possible mainly because of company pressure. And sometimes the vendor PM, on consulting with his peers, agrees to push through the code, even by-passing tested and true in-house processes to get things moving along. However, the result is almost always that the code disappoints when the client developers try to use it. This hurts the vendor’s reputation even more.

Sometimes, it is better to commit to follow the tried and tested process so that code quality can be guaranteed. In other words, a little more time, a lot less problems, and a lot more satisfactory results.

SQL Query Result Order

Every time an SQL query is run without the ORDER BY clause, the sort order of the results returned is not guaranteed.

SQL Server decides which is the best order to return the query. This is based on factors like whether indexes are used or data fragmentation on the physical drive because SQL Server will attempt to return the result in the most efficient way.

If you run the same query a few times in the same environment, you may find that the sort order remains the same. It does not mean the sort order is guaranteed but that SQL Server went through processing the data in the same way.

The same query may not retrieve the result in the same order if indexes have been rebuilt or the database is on another server or many other factors.

Just a point to note.

SQL Logical Query Processing Order

This probably applies more to beginners or people who have been using SQL for a while but have been wondering why sometimes the query works in a certain way and sometimes it didn’t.

Before I read the book for the exam 70-461, I found it relatively easy to just plonk in some code in SQL Server Management Studio and Voila! it works. Well, not all the time.

For example, this works:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders WHERE YEAR(orderdate) = 2013

But this doesn’t:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders WHERE OrderYear = 2013

Most of us will normally just go: “Hey, it’s no biggie, I’ll just use the former and then find out why later.”, but most often we will be too busy to do so. 🙂

Here is the “keyed-in order” of a standard SQL Query:

  1. SELECT
  2. FROM
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. ORDER BY

This is how it is supposed to be typed so that it does not throw a syntax error.

This is how SQL Server processes the query:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY

See the difference?

Let’s reference back to the first example:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders WHERE OrderYear = 2013

Since the query is processed in that order, SQL Server processes the WHERE clause before the SELECT clause and it has no clue of the alias “OrderYear”.. hence it fails.

Another common way to write queries with the ORDER BY clause is by using the alias:

SELECT customer, YEAR(orderdate) AS OrderYear FROM orders ORDER BY OrderYear

Again referring back to the Logical Query Processing Order, this query works because ORDER BY is processed after the SELECT clause, therefore SQL Server is fully aware of the alias “OrderDate”.

What are the benefits?

In my experience, I often re-format datetime field into something more readable, but this makes the query longer. Therefore to make code more elegant, I use the alias:

SELECT customer, CONVERT(VARCHAR(10), orderdate, 112) AS NewOrderDate FROM orders ORDER BY NewOrderDate

I find that knowing this makes me more efficient as I am coding with this in mind and make less mistakes.

If this helps anyone, do let me know. Thank you.

 

 

 

SQL and Relational Theory

Do you know the foundation of SQL?

According to the book Querying Microsoft SQL Server 2012 Training Kit on the Microsoft label, T-SQL is a Microsoft flavour of the more standard ANSI and ISO version of SQL which is based on a branch of mathematics called Relational Model.

Relational Model is further broken down into two theories – Set Theory and Predicate Logic.

Set Theory says that interaction with a set is not with the individual elements but with the set of elements on a whole. A set consists of a collection of distinct, definite objects.

Predicate Logic is when an expression evaluates out to true or false.

The combination of Set Theory and Predicate Logic gives you the foundation of SQL.

How is this useful?

Set theory implies that all elements are unique (distinct) and the order does not matter. Therefore, when constructing T-SQL queries, one considers the set of elements as a whole when manipulating data. This means that you are already walking in the right direction. Now you need to decide if you are trying to get results fast for a one off query, or optimising the results for repeat use (as used in business reports). Only deviate from standard relational model when it finally comes to presentation. For example, by adding the ORDER BY clause to sort the data.

The right terminology for the right audience

For those who has the privilege of being in the technology consulting industry for some time and started out in the technical side of things, remember those books or courses that teach about methodology or best practices?

For example, for project management, you need to specify a requirements gathering stage, done by the Business Analyst, the specifications stage, solutioning stage, project planning stage, development stage, unit testing stage, integration stage, iterate between development, unit testing and integration stage again and again, user acceptance testing stage, go live, application support, re-look at the lifecycle …. zzzzzzz…

Thing is most of the time I find that business users don’t really care for these terms. I recently simplified it (a lot!), and my boss still removed the words “user acceptance” in the testing stage. Eyes start to glaze over when explaining the various stages to business users thinking they would appreciate to know how structured I was. (Guess they weren’t that interested after all)

If you are a developer, and your audience are technical integrators or project managers, then by all means expand on these terms. Otherwise, users want to hear things using words from their world. (Things like YoY, MoM, Revenue, PPI, KPI for finance and operations, or production performance, keywords search results for others.)

Point is: Tailor your communication to your audience. (And this applies to developers too – especially developers who love to use your developer world terms. 🙂 )  Basically use what is useful and gets the point across clearly and concisely.

Address Clean-up for Mail Merge using SSIS

Remember the good ol’ days of going through a list of addresses and manually cleaning up hundreds if not thousands of addresses to prepare for a snail mail merge? What? You’re still doing it now in the 21st century? Just kidding.

Chances are, you will need to be able to detect duplicates. I find that perhaps the hardest task to accomplish is to manually eyeball the data for duplicates because duplicates seldom mean that the data is THE SAME. More likely, data is SIMILAR but noticeably different in subtle little ways. For example, Address has been split into Address1, Address2… and so on. Or someone decides to register again but adding a middle name this time.

Whatever the case, I find that using logic (in my head) and SSIS, I can prepare a more thorough cleaning of the data. Here’s my Data Flow Model:

And here’s the Control Flow diagram showing that I read data from a flat file (.csv) and then update the Database everytime I run this SSIS Package. I suggest not disabling the update tasks even though there may not be changes because if there are changes and you forget to re-enable the tasks, then changes will not be applied!

Proper Case transformation for text.

There isn’t a built in Proper Case Transformation but the function is built into VisualBasic so you just need to include that in your reference.

 

/* Microsoft SQL Server Integration Services Script Component
* Write scripts using Microsoft Visual C# 2008.
* ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.VisualBasic;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

public override void PreExecute()
{
base.PreExecute();
/*
Add your code here for preprocessing or remove if not needed
*/
}

public override void PostExecute()
{
base.PostExecute();
/*
Add your code here for postprocessing or remove if not needed
You can set read/write variables here, for example:
Variables.MyIntVar = 100
*/
}

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Row.DerivedFirstName = Strings.StrConv(Row.FirstName, VbStrConv.ProperCase, 0);
Row.DerivedName = Strings.StrConv(Row.Name, VbStrConv.ProperCase, 0);
Row.DerivedAddress1 = Strings.StrConv(Row.AddressLine1, VbStrConv.ProperCase, 0);
Row.DerivedAddress2 = Strings.StrConv(Row.AddressLine2, VbStrConv.ProperCase, 0);
Row.DerivedCity = Strings.StrConv(Row.City, VbStrConv.ProperCase, 0);
Row.DerivedState = Strings.StrConv(Row.State, VbStrConv.ProperCase, 0);
Row.DerivedPostCode = Strings.StrConv(Row.Postcode, VbStrConv.ProperCase, 0);
Row.DerivedCountry = Strings.StrConv(Row.Country, VbStrConv.ProperCase, 0);
}

}

Happy mail merging!

 

 

What is in your mind? Do you see my solution?

Hello, it’s been a while since I last blogged, so I’ll get right into it.

What’s in your client’s mind when you present a version of the solution? Do new things come up that was not mentioned before? Do you go “I told you so” in your mind?

Well, I recently had all three thoughts come to me. As simply technologists, which is what a BI developer is generally, there are several roads that lead to Rome, and you choose the most logical and straigtforward to begin with. In my case I knew that we needed to have two different entities for the project, a “client” and a “case”. Therefore, it makes sense to separate the two because of the 1:N relationship – a client can have more than one case attached to him.

However, I was told that the client and the case are one – because traditionally, there was only one ID assigned to represent both. Granted, I acknowledged that it should now be a 1:1 relationship between client and case and designed the database schema as such.

At the next meeting, I presented my schema and lo and behold, there could be more than one case attached to a client. …

I think often the client very much wants to keep what’s working, not to change anything too much, in this case the ID that they are working with. However, logically, it just didn’t make sense and as a BI developer, it is my duty to explain so. In the end we decided to adopt a combination of ClientID and CaseID.

The good thing is that nothing is built yet. I am a proponent of sorting these things out on paper rather than in code so alls well.

Next step is to include other users in the process so that I am not caught out with my design.

 

Different mindsets

In Business Intelligence, one generally goes down either the development route or the analysis route.

In development, one is usually skilled in at least one tool. Also, one is aware of development methodologies. The main mindset is to solve problems / puzzles. The aim is to fulfill the requirements. Once accomplished, the result is usually what one would expect – subject to the requirement gathering stage being done properly.

As for data analysis, the analyst’s objective is to allow the consumer of the data to make use of it to do some action, or make some improvements. The tools of data interpretation are measuring approaches that define metrics and modelling methods used to analyse the data.

Usually in a big organisation, there are specialists in these roles so normally the data management professional is different from the data analyst.

However, in a smaller organisation where one finds oneself to be one of the few people to deal directly with data, sometimes both needs to be done. The thing is that the way a developer thinks is fundamentally different from a data analyst. The actual switching from one mindset to the other does require some effort.

Let me explain.

The developer’s objective is to solve the problem at hand, so he makes sure that the data is accurate and is delivered in a timely manner.

The analyst’s objective is to create some useful sense of the data for the organisation’s benefit, hence, the starting point isn’t at the root of the problem, but rather at the heart of the business – What is the company doing? How does the company know it’s going in the right direction? What needs to be improved?

Data accuracy to the developer is paramount. Data accuracy to the analyst is a matter of degree that is required. Developer looks into the problem, analyst comes from the angle of the organisation. Hence, if you a playing a dual role, one needs to switch from the “What is the problem?” mentality to “How is the company doing?” mentality. This does require the person to be more involved with the business, which means stepping out of the “zone” and  speaking to business users, which to some might be a little out of their comfort zone.

Likewise, for an analyst to become a developer is, in my opinion, against his or her personality as a development mindset requires one to totally focus on the problem and ignore the chaff. But sometimes it’s this chaff that provides some information of importance to the analyst. I don’t normally see analysts becoming developers. I think it’s more the other way round, but if you disagree, do feel free to voice it here.

Why do users ask for more just after the BI Solution is Live?

A good question.

It didn’t come up during requirements gathering. Neither did it occur during reiterative development. And it certainly did not come up in user acceptance testing. So why, why, why do users ask for more features and changes AFTER the solution has gone live?

One of the differences between a BI project and any other technical project is that the “Oliver Twist” syndrome is inherently what providing data is all about. When data is first revealed, the human mind then processes the data in an effort to answer questions in the mind. This inadvertently creates more questions and in most cases it will require more data, or a different view of the data to answer.

This sort of requests cannot be captured during any earlier stages just as you can’t find a cure until the disease manifests itself.

What can be done is to analyse the objectives and goals and run a pre-implementation study, where data is queried on an adhoc basis to answer questions. This is best done adhoc so that it can be extremely flexible. This allows the users to plan ahead. After which some of these adhoc reports will probably end up as regular reports and the main bulk of information is used for monitoring and tracking, and fed back for strategy planning purposes.

In any case, don’t be surprised by the “I want more” mentality. Ultimately it is a good thing that cogs in the brains are moving, isn’t it?