Wednesday 30 July 2014

CheckSum, HashBytes and Slowly Changing Dimensions

A recent requirement for a DW was to implement a Type 2 Slowly Changing Dimension across all attributes in the dimension.  Unfortunately, the dimension had many, many attributes (30+), which meant that a comparison of all attribute values would be needed for each new record that arrived.

To recap:

Type 2 Slowly Changing Dimension (SCD2)
SCDs are methodologies for tracking historical changes to a data row in a dimension table over time. While SCD1 is just an overwrite when there is a change (no history), SCD2 involves entering a new record into the table with the new changed data.  A new surrogate key is generated for the new record, while maintaining the natural key to allow the new record to be linked to the old one.  Typically, there would also be either an IsCurrent flag, to identify which is the current version of the data, and/or Start and End Date columns to indicate the order and time span of the changes:

Surrogate_KeyNatural_KeyCustNameCountryStart_DateEnd_Date
100375GuyScotland01-Jan-200231-Jul-2005
101375GuyEngland01-Aug-200531-Dec-9999

In the above example the Country has changed and so a new record with the new country and new surrogate key has been inserted.  As each new row arrives, it's country is checked against the existing records country to see if it has changed or not.

This is straightforward if there is only the one attribute being tracked.  A simple string comparison will quickly identify the change.  However, if there are many attributes being compared then the query doing this comparison can become quite long and perform poorly,  particularly if you have large numbers of records arriving.

In order to solve this problem it was decided to hashcode the relevant (tracked) columns together and save the output to a new column.  If we hashcode the incoming records on the same columns as well, then all we have to do is compare the two hash columns to see if there the record has changed. 

There are a number of ways to do this:

CHECKSUM()

From BOL:
 CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator...... If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change.
In fact a quick test using CHECKSUM() to hashcode 100,000 GUIDs resulted in up to 3 repeated hashcodes ( "collisions").  The CHECKSUM is very fast, but this rate of collision is not tolerable in this scenario.

HASHBYTES()

HASHBYTES() performs a similar function but the rate of collision is much much lower. It's not zero, but to date I have never been able to generate a collision.
To use this function, an algorithm needs to be specified. Thomas Kesjer has done a detailed comparison of the various algorithms' performance, and produced this graph:



The performance of the HASHBYTES algorithms are largely similar (poorer than CHECKSUM), with the exception of MD2, which is much slower.  Typically I use MD5.  The other point to note is that HASHBYTES requires a single string parameter to be passed, not multiple columns eg:

SELECT HASHBYTES('MD5', Country) FROM DummyTable

In order to hashcode multiple columns together we can either concatenate them together as strings, or an even neater way would be to convert it to an XML string using FOR XML:

SELECT HASHBYTES('MD5', (SELECT CustName, Country FOR XML RAW))

Using this technique we can hashcode all the existing records in the table, and as each new record arrives we can look at its hashcode and compare it (joining on the Natural Key).  If the code is different then the record has changed, and is inserted. If not, it is ignored.

This should be much simpler to maintain and perform much better than doing individual comparisons of the all the changing fields.





Tuesday 8 July 2014

SQL to Drop All Non-System Objects From Master DB

We've all been there - well I have anyway. You've just created your script to generate all the required objects for your database.  You run the script in the new environment.  But wait, you forgot to set the database!  All the objects have been created in the master db instead!

The number of times this has happened to me and to others I have worked with is infuriating.  I've sat there watching people spend ages deleting one by one all the db objects using Management Studio, stumbling when there are dependencies, and generally wasting a lot if time.

To solve this problem I have put together a short script to generate all the required drop statements. Fortunately, the sys.objects table contains a handy "is_ms_shipped" column that makes life much simpler:

SELECT
'IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N''[dbo].['+Name+']''))'+ CHAR(10)+ 'DROP '
+ CASE WHEN  type = 'U' THEN 'TABLE '
       WHEN  type = 'P' THEN 'PROCEDURE '
       WHEN  type ='FN' THEN 'FUNCTION '
       WHEN  type = 'V' THEN 'VIEW '
  END
+ Name + CHAR(10) + 'GO' + CHAR(10)
FROM  Master.sys.objects
WHERE is_ms_shipped <> 1
AND TYPE IN ('U','P','FN','V')
ORDER BY
CASE  WHEN type = 'P'  THEN 1
      WHEN type = 'V'  THEN 2
      WHEN type = 'FN' THEN 3
      WHEN type = 'U'  THEN 4

END

This script will generate all the DROP statements for the user-defined objects in the master db.  These need to be copied into a new window (easiest if you have been outputted the results to text rather than to grid in SSMS) and executed.

If there are any foreign key constraints, however, the above script will generate an error.  To handle this either execute all the drop statements over and over until the errors cease (each subsequent execution of the drop statements will delete additional tables as each foreign key table is dropped), or, prior to executing the above script, execute the following:

SELECT 'ALTER TABLE [' + SCHEMA_NAME(schema_id) + '].[' + OBJECT_NAME(fk.parent_object_ID) + '] DROP CONSTRAINT ' + fk.name
+ CHAR(10) + 'GO' +CHAR(10)
FROM sys.foreign_keys fk

WHERE is_ms_shipped = 0

This will generate the necessary statements that need to be run in order to drop all the constraints. Copy the output to a new query window and execute.