Friday, 25 October 2013

Truncate vs Delete - Myth vs Reality

I have been asked many times (often in interviews) the following question:
Q. What is the difference between Truncate and Delete in SQL Server?
The answer that, in my experience, is most expected and most often given is that Delete allows you roll back the data you have wiped, while Truncate does not - Truncate is irreversible.  A quick search on google gives a number of posts that say as much.

Simple.  But wrong!  Just try this simple test.

  1. Insert records into a table. 
  2. Open a transaction. 
  3. Truncate the table. 
  4. View the contents of the table.
  5. Rollback the transaction.
  6. View the contents of the table.

 The code below is a very simple example:
CREATE TABLE MyTable (col1 CHAR(1), col2 INT)
GO

INSERT INTO MyTable (col1, col2)
SELECT 'a', 1 UNION ALL SELECT 'b', 2 UNION ALL SELECT 'c', 3
GO

SELECT * FROM MyTable --(3 row(s) affected)

BEGIN TRAN

TRUNCATE TABLE MyTable
GO
SELECT * FROM MyTable --(0 row(s) affected)

ROLLBACK

SELECT * FROM MyTable --(3 row(s) affected)

Before the transaction is rolled back, there are no rows in the table, but after the rollback all three records have returned, despite having used Truncate to remove them.  Clearly Truncate is reversible!

So what actually is the difference between Truncate and Delete?  There are a number of general differences, but the relevant one here is the following:

When rows are DELETE'd the operation is logged for each row removed. TRUNCATE logs only the deallocation of the data pages rather than the rows themselves, making it a much faster operation.  
There are many other differences related to permission levels, seeding, triggers, contraints etc. That's for a future post.  For now, the point to note is that TRUNCATE can indeed be rolled back, and the myth has been busted! 




Tuesday, 8 October 2013

SSAS: MDX True Average VS AverageOfChildren

Recently I was tasked with working on some slow performing calculations in a cube.  In the Calculation script a developer had used the DECENDANTS function, with the LEAVES flag to navigate all the way down to the bottom of several dimensions' hierarchies. He had then used AVG across the returned set for a specified measure, effectively aggregating all the way back up again.

All the developer actually wanted to do was to have a measure that returned the average value for the current member selection.  By going down to the leaf level and then backup, however, the cube is having to scan through a potentially huge volume of data to produce the answer. Inevitably this will result in degraded query performance for users.

In fact this can be a lot simpler to do than people realise.  True, SSAS 2008R2 still does not supply an inbuilt function to do this, which surprises many new users. One day maybe it will,  but in the meantime here is how I do it.

First, it should be pointed out that there is a certain "type" of averaging that SSAS does support out of the box.  If you look at the options in the AggregationFunction property for a measure you will see that, although by default set to Sum, there is also one option called AverageOfChildren. This name is very misleading.  It is not a true average as I would understand.  It is actually an average only across the Time dimension.  If you have two dates selected and are looking at the Sales Value measure you will see it averaged across those two days.  However, if you are looking at a single day, but across two sales areas, you will not see it averaged across those areas, but summed.  Confusing, but that is just what it does. Note that to enable this correctly your time dimension needs to have its Type property set to Time

If, however, you want to do a true average, you will need to create a calculated member in the calculations tab of your cube.  The calculation is very simple, for what is an average? It is the sum of the values, divided by the number of values.  The sum we have as the base measure from the fact table, say Sales Value. When we create a new measure group in the cube it automatically generates a new measure suffixed "Count".  This measure is none other than the count of values (or fact table rows) in the measure group.

Therefore the average calculation is the one measure divided by the other:

CREATE MEMBER CURRENTCUBE.[Measures].AvgSales
AS Measures.Sales / [Measures].[SalesFactCount]

Since both measures are essentially straight from the fact table, their values are stored during processing time, and benefit from aggregation design too.  As a result the AvgSales measure ought to perform very fast.