Friday, 10 June 2011

SQL 2008 Data Profiler

I am currently working on the early stages of a SQL 2008 Data Warehousing project.  I wanted to dive into the data sources for the DW I will be building, to determine if any of them could be eliminated up front.  This could be for a variety of reasons, including for not providing the required info for the DW or not containing valid data.  Its essential to conduct this test early on, and previously that would have involved using a 3rd party tool such as Trillium or a range of others to analyse the content and give some statistical feedback on the content.

Great to see that Microsoft (and its competitors) are finally catching on to the significance of this area, and in SQL Server 2008 Microsoft have provided an SSIS task and separate Date Profiler application to do just that.



The Data Profiling Task is available in the SSIS Control Flow and is very quick to set up.  It can be customised or, if you prefer, you can click the "Quick Profile" button to speedily set up and execute.  Under the Quick Profile option it takes just a few seconds to simply select the datasource (table, view etc) and then select a variety of checkboxes on what stats you'd like to see:

Column Null Ratio
Column Statistics
Column Value Distribution
Column Length Distribution
Column Pattern
Candidate Key



An additional option is available if customising, but is not available under the Quick Profile option:

Value Inclusion Profile

Then choose an XML file to output to.  That's all there is to it.  It's now ready to execute and output the results.

In order to actually make sense of the XML output file the Data Profile Viewer ( \Microsoft SQL Server 2008\Integration Services\Data Profile Viewer) is needed.  Open the xml file and lo and behold all the results become available!



Use the tree on the left pane to select the profle request (options checked earlier in the SSIS task) results desired, and on the right pane view the results - the distribution of data in each column, min/max values, null distributions, candidates for the primary key and more.

I have found this to be a very useful tool for quickly diving into the source systems to get a feel for the data and quickly highlight any major data quality issues.  Much better to find them up front that wait to see what you get after you've spent days building your ETL structures!

No comments:

Post a Comment