Wednesday, 30 July 2014

CheckSum, HashBytes and Slowly Changing Dimensions

A recent requirement for a DW was to implement a Type 2 Slowly Changing Dimension across all attributes in the dimension.  Unfortunately, the dimension had many, many attributes (30+), which meant that a comparison of all attribute values would be needed for each new record that arrived.

To recap:

Type 2 Slowly Changing Dimension (SCD2)
SCDs are methodologies for tracking historical changes to a data row in a dimension table over time. While SCD1 is just an overwrite when there is a change (no history), SCD2 involves entering a new record into the table with the new changed data.  A new surrogate key is generated for the new record, while maintaining the natural key to allow the new record to be linked to the old one.  Typically, there would also be either an IsCurrent flag, to identify which is the current version of the data, and/or Start and End Date columns to indicate the order and time span of the changes:

Surrogate_KeyNatural_KeyCustNameCountryStart_DateEnd_Date
100375GuyScotland01-Jan-200231-Jul-2005
101375GuyEngland01-Aug-200531-Dec-9999

In the above example the Country has changed and so a new record with the new country and new surrogate key has been inserted.  As each new row arrives, it's country is checked against the existing records country to see if it has changed or not.

This is straightforward if there is only the one attribute being tracked.  A simple string comparison will quickly identify the change.  However, if there are many attributes being compared then the query doing this comparison can become quite long and perform poorly,  particularly if you have large numbers of records arriving.

In order to solve this problem it was decided to hashcode the relevant (tracked) columns together and save the output to a new column.  If we hashcode the incoming records on the same columns as well, then all we have to do is compare the two hash columns to see if there the record has changed. 

There are a number of ways to do this:

CHECKSUM()

From BOL:
 CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator...... If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change.
In fact a quick test using CHECKSUM() to hashcode 100,000 GUIDs resulted in up to 3 repeated hashcodes ( "collisions").  The CHECKSUM is very fast, but this rate of collision is not tolerable in this scenario.

HASHBYTES()

HASHBYTES() performs a similar function but the rate of collision is much much lower. It's not zero, but to date I have never been able to generate a collision.
To use this function, an algorithm needs to be specified. Thomas Kesjer has done a detailed comparison of the various algorithms' performance, and produced this graph:



The performance of the HASHBYTES algorithms are largely similar (poorer than CHECKSUM), with the exception of MD2, which is much slower.  Typically I use MD5.  The other point to note is that HASHBYTES requires a single string parameter to be passed, not multiple columns eg:

SELECT HASHBYTES('MD5', Country) FROM DummyTable

In order to hashcode multiple columns together we can either concatenate them together as strings, or an even neater way would be to convert it to an XML string using FOR XML:

SELECT HASHBYTES('MD5', (SELECT CustName, Country FOR XML RAW))

Using this technique we can hashcode all the existing records in the table, and as each new record arrives we can look at its hashcode and compare it (joining on the Natural Key).  If the code is different then the record has changed, and is inserted. If not, it is ignored.

This should be much simpler to maintain and perform much better than doing individual comparisons of the all the changing fields.





No comments:

Post a Comment