Wednesday, 18 January 2017

SQL Server 2016 Analysis Servers - PowerPivot, Tabular or Multidimensional?

Having downloaded SQL Server 2016 Developer Edition (it’s free, click here to find out how), for the first time in years I went about a full install of the latest version of Analysis Services. 

There are three modes in which SSAS can be installed, so I thought I’d give a brief overview of each of them:



Multidimensional and Data Mining (often referred to just as Multidimensional), Tabular Mode and finally PowerPivot Mode are the three options given during the installation process.

Having different modes allows SSAS to offer a more customised solution to closely meet the needs of a wider range of of businesses and users. For example Multidimensional is well suited for large scale use - many users, large volumes of data, low latency etc, while PowerPivot is aimed more at the individual user and Tabular mode could be said to sit somewhere in between.

Multidimensional Mode

This is the “classic” mode for Analysis Services to be installed in, available in it's current form since SQL Server 2005 and is a mature product on an enterprise scale. If you’re already familiar with building OLAP databases then you can skip ahead to the next mode. Otherwise read on.

In Multidimensional mode cubes, measures, dimensions etc are designed in Visual Studio and deployed to the SSAS (Multidimensional Mode) server. Typically this process is a dedicated developer set of tasks - you would not expect business users to design the OLAP database.

Measure groups can be heavily partitioned to support very large volumes of data, only limited by the resources available. Parallel processing of partitions can be implemented to reduce data latency and improve performance, and security can be customised right down to the cell level. A lot of power and complexity can be built into the model to handle different kinds of relationships and aggregations, and additional enhancements can be made using MDX. Many previous posts on this blog have been written covering some of the more advanced features of Multidimensional mode. Also note that MDX is the primary language for querying databases designed in this mode.

PowerPivot Mode

To quote from MSDN:

“An Analysis Services server in Power Pivot mode provides server hosting of Power Pivot data in a SharePoint farm”

So that is basically the use-case for installing SSAS in PowerPivot mode - if you intend to use PowerPivot (an Excel-addin) to create models and then want to share those models, or schedule automated refreshes of them say to support regular reporting, then you need to have an SSAS PowerPivot instance.  In addition, this will need to be registered with an Office Online Server (a SharePoint 2016 feature, previously this was called Excel Services).

PowerPivot models are comparatively quick and easy to design, with far less complexity than the other two modes. Therefore power users on the business side (as opposed to developers) could be expected to develop and deploy these models themselves.

However, if you don't intend to distribute or perform server-style tasks (scheduling, security etc) on your models and they are really just for local use, it is unlikely you’ll need SSAS in PowerPivot mode.  You could instead just use PowerPivot as it comes - as a downloadable Add-in for Excel. 

PowerPivot is an end-user tool where users can source data from multiple sources and define their own relationships and calculations.

Important Note: While PowerPivot can support significant data compression, there is still a file size limitation of 2Gb, regardless of whether it is being used locally or uploaded and managed in SharePoint. So if you are expecting to handle large volumes of data, this might not be the best choice for you. You would do well then to consider Tabular Mode.

 Tabular Mode

Introduced in SQL 2012, Tabular Mode has similar functionality to PowerPivot, except that is more powerful and can handle significantly larger volumes of data. While Multidimensional mode is considered "matured", the list of new and improved components in SSAS consists almost entirely of updates to Tabular mode.  Expect more features to be added in future releases.

Tabular mode uses column storage (as opposed to row storage for Mutidimensional). It therefore compresses data using an entirely different engine, making possible higher compression rates. This can have significant improvement on things like distinct counts, which have often been a sore point in traditional Multidimensional models, but can be used to significant advantage in Tabular mode.

In Tabular mode, the data is held in memory so, instead of the 2Gb hard limit of PowerPivot, you are limited only by how much memory is available to you (rather than disk space in Multidimensional Mode), and compression rates are high.  While models are quick and fast to design the power of Tabular (and therefore complexity) is with the DAX language used to create formulas, relationships and quickly link data together. While Multidimensional (and therefore MDX) is very fast at aggregating data but poor when it comes to accessing leaf level, DAX can handle leaf level data with relative ease and performance, but designing aggregations and other features such as Custom Rollups are simply not possible. Other Multidimensional features such as parallel processing are also not available with Tabular, which could impact data latency times and performance as the database grows in size.

Note that as Tabular mode is still maturing, it can be expected that Microsoft will work to close the gap between Multidimensional and Tabular modes with each new release.

In making the decision of which mode is best for you, you will want to consider areas such as expected data volumes, data latency/refresh times, required features, hardware resources available etc.

For a more detailed feature comparison between Multidimensional and Tabular check out the MSDN article here:


Monday, 2 January 2017

Download SQL 2016 Developer Edition for free

With the release of SQL Server 2016, folks like me get very excited to get their hands on the new software.  However, in the past it has always been quite expensive, or very time-limited for an individual to get hold of a full Developer license.

Not so anymore - As part of the Visual Studio Dev Essentials program, a full developer license for SQL Server 2016 SP1 is now available for free. This includes all the tools such as SSMS, SSDT and a host of other products too, including Microsoft R Server and even a few quids worth of Azure is being thrown in.

In order to access the downloads you'll need to sign up for the program first (free), and then download away:

https://www.visualstudio.com/dev-essentials/

Also note that a new product such as SQL 2016 requires new sample databases - say goodbye to AdventureWorks and hello to the brand new Wide World Importers!  Download all the latest samples from github here:

https://github.com/Microsoft/sql-server-samples/tree/master/samples/databases/wide-world-importers

Interesting, however, is that there does not appear to be an SSAS solution build on the new samples.  I guess very little has changed is SSAS in this release, so MS felt they need not bother.  A bit irritating in my opinion, as it would be nice to have a consistent set of samples, and not have to switch back to old solutions to demo SSAS - gives the impression that SSAS is being less and less encouraged/supported. But maybe that is deliberate....

Happy downloading, and Happy New Year!

Monday, 14 December 2015

SSAS - How to Terminate Long Running Queries

Despite how well you may design your SSAS database or how well you educate your users, in my experience eventually someone will issue queries that take way too much time to complete.  It's usually down to a user forgetting to apply a filter somewhere.  They'll likely be using a front end tool like Excel that allows them to easily drag and drop measures and attributes without too much thought, and may "accidentally" query data for all time rather than just the desired dates. If the cube is partitioned by date (eg by day or month) and contains many years of data then this query could be scanning every single partition in the measure group, leaving the user to go get a coffee while waiting for the results.

(Incidentally, if a trace reveals that more partitions than expected are being scanned than there may be an issue with the Slice property)

Different front end applications will handle this waiting time in different ways, but many will simply appear to hang or freeze.  Cue a call from an agitated user: "Every time I run my query the application dies".  This is obviously unacceptable, not to mention the less visible drain on resources it may be causing.  But what steps can we take to resolve this?

While it's important to ensure that users are aware of certain front-end best practices of querying the cube (eg apply filters before nesting dimensions), there are a number of approaches we can take to quickly resolve the problem and stop the applications hanging.

1. Identify and Kill the Query
Just like a SQL relational database, SSAS also has DMVs that provide insights into server activity, including connection information.  Run the below MDX query:

SELECT SESSION_SPID,
       SESSION_USER_NAME,
       SESSION_CURRENT_DATABASE,
       SESSION_LAST_COMMAND,
       SESSION_CPU_TIME_MS,
       SESSION_ELAPSED_TIME_MS
FROM   $system.discover_sessions

From these results, we can identify which session_id is the troublemaker.  Once identified, we can investigate further, and, if necessary, manually kill the session.  Killing the session can be done with the following XMLA command:
<Cancel xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
       <ConnectionID>[CID]</ConnectionID>
       <SessionID>[SID]</SessionID>
       <SPID>[SPID]</SPID>
</Cancel>

Note: this command can be used to kill the connection, the session or the SPID - delete as required.

Executing the above command will likely cause an error to be thrown in the users application and in some cases may require them to restart the program.  However in most circumstances this is preferable to leaving the application hanging indefinitely.

2. Dynamically Kill Long Running Queries With Scheduled Jobs
While the above method is fine if a manual intervention is required, a much slicker approach would be to automate the process.  It would be simple enough to create a SQL Agent Job or SSIS package to scan the DMVs in order to detect any queries running for longer than a predefined time.  The SPIDs could then be passed dynamically to the XMLA command and killed.  An advantage of this approach is that it allows a great deal of flexibility - for example if certain users or certain times year (eg month-end), or certain databases are known to demand the use of unavoidably long running queries, then the cut-off time for killing these sessions could vary based on those parameters. 

3. Dynamically Kill Long Running Queries using SSAS Server Properties
By far the most common approach for handling this problem is to use the SSAS server properties.  However, I have seen many occasions in which these properties are misunderstood, which is actually what inspired this post.  These properties can be accessed either by right-clicking on the SSAS server in SSMS and selecting Properties (tick the Advance box at the bottom), or they can be accessed directly in the server file msmdsrv.ini:



ExternalCommandTimeout



Often I have seen this setting used to handle this particular problem, however that is a common mistake.  This setting relates to OLAP processing performance, not query performance.  If a processing query, ie a query sent from SSAS to a source system in order to process dimensions or partitions, has not completed its execution on the source system by the threshold set by this property, then the cube process command will be forced to fail.

ServerTimeout


This is the property, towards the end of the Advanced Properties list, that is relevant for us.  Any MDX query on any OLAP database on the server that does not complete within the time set by this property will be terminated.

There's no flexibility with ServerTimeout - one setting affects all MDX queries on the server regardless of origin, destination or anything else.  99% of the time this is a sufficient and effective solution for terminating long running queries with minimal fuss.


Wednesday, 16 September 2015

SSAS - Slow Running Query? Unnecessary Partition Scans? Solve it with a Slice!

While there are many possible reasons why an MDX query of an OLAP cube may not perform as well as expected - the cube dimension hierarchies may not be optimised for best use, the MDX query itself may require optimisation etc - an often overlooked area is the measure group partition definitions and the partition Slice property.


 In a typical SSAS project I would normally design an "initial" partition inside SSDT (formerly BIDS):


I would deploy this solution and create a scheduled job to create additional partitions when necessary, depending on the partition strategy (eg every night/week/month etc).  The Create Partition xmla script might look something like this:
  
<Create xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
<ParentObject>
<DatabaseID>MyDb</DatabaseID>
<CubeID>MyDb</CubeID>
<MeasureGroupID>MyMeasureGroup</MeasureGroupID>
</ParentObject>
<ObjectDefinition>
<Partition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ddl2="http://schemas.microsoft.com/analysisservices/2003/engine/2"
xmlns:ddl2_2="http://schemas.microsoft.com/analysisservices/2003/engine/2/2"
xmlns:ddl100_100="http://schemas.microsoft.com/analysisservices/2008/engine/100/100"
xmlns:ddl200="http://schemas.microsoft.com/analysisservices/2010/engine/200"
xmlns:ddl200_200="http://schemas.microsoft.com/analysisservices/2010/engine/200/200"
xmlns:ddl300="http://schemas.microsoft.com/analysisservices/2011/engine/300"
xmlns:ddl300_300="http://schemas.microsoft.com/analysisservices/2011/engine/300/300"
xmlns:ddl400="http://schemas.microsoft.com/analysisservices/2012/engine/400"
xmlns:ddl400_400="http://schemas.microsoft.com/analysisservices/2012/engine/400/400">
<ID>20150102</ID>
<Name>20150102</Name>
<Source xsi:type="QueryBinding">
<DataSourceID>MyDs</DataSourceID>
<QueryDefinition>SELECT * FROM MyTable WHERE MyDate = 20150102</QueryDefinition>
</Source>
<StorageMode>Molap</StorageMode>
<ProcessingMode>Regular</ProcessingMode>
<ProactiveCaching>
<SilenceInterval>-PT1S</SilenceInterval>
<Latency>-PT1S</Latency>
<SilenceOverrideInterval>-PT1S</SilenceOverrideInterval>
<ForceRebuildInterval>-PT1S</ForceRebuildInterval>
<Source xsi:type="ProactiveCachingInheritedBinding" />
</ProactiveCaching>
<EstimatedRows>5000000</EstimatedRows>
<AggregationDesignID>AggregationDesign</AggregationDesignID>
</Partition>
</ObjectDefinition>
</Create>

Once deployed and processed, this solution may work well. Users find the cube responsive and fast.  Over time, however, users start to complain that the same daily query/report that used to run almost instantaneously is now painfully slow.  No design changes have been deployed, no hardware modifications have been made - nothing has been altered.  So how is it that the same query now takes significantly longer?

An often overlooked area is the slice setting for partitions.  As an example of what can happen, let's look at the AdventureWorks cube from the SQL 2012 samples (available here http://msftdbprodsamples.codeplex.com/releases/view/55330).

A good starting point for debugging queries is often to run a SQL Profiler trace on the SSAS database and see what's happening when this query is executed.  Let's take a simple query:

SELECT [Measures].[Internet Sales Amount] ON 0,
[Geography].[City].
Members ON 1
FROM [Adventure Works]
WHERE [Date].[Calendar].[Date].&[20080607]

This will return the "Internet Sales Amount" measure by City on 7th June 2008.  If we use the Object Browser in Management Studio to examine the Internet Sales measure group we see this:


  
The measure group contains four partitions, one for each year.  Therefore we would expect that our query above should only hit the one single partition, Internet_Sales_2008.  Let's confirm by viewing the SQL Profiler trace results:


The trace shows that SSAS is actually scanning three partitions, not one - 2005, 2006 and 2008.  But we know the result set lies only in 2008, so why this extra scanning?  We may want to confirm that the results are only going to be in the 2008 partition by examining the partition query - the SQL query used as the source for the partition data.  In this case it's been checked and they are all set up with the correct queries and WHERE clauses. 

The reason for the extra partition scans is down to the Slice property of the partition not being set.  The slice property is used by SSAS to determine which partition(s) contains the desired data.  If SSAS knows up front where the data lies it can avoid wasting time and resources scanning partitions that will not return any data. Let's take a look at the properties of one of the partitions in SSMS:


We can see here that the Slice property is blank.  When left blank, SSAS uses it's own default method of determining which partitions may contain the data.  While this may sometimes be sufficient, the trace above shows it is hardly foolproof.  Far better to actually set this property to the value we know to be valid, enabling the SSAS engine to quickly be directed to partition containing the results.

Valid values for a slice property are an MDX member, set or tuple.  In this example the slices should be set to the relevant year, eg:

[Date].[Calendar Year].&[2008]

If we now set this for each of the partitions and rerun the same query we see the following in the trace file:


The engine is going directly to the correct partition and ignoring all others.  The result is a much faster running query.  I have seen examples where there were 100+ partitions, all without the slice set.  The result was that queries were taking up to 30 seconds to complete.  On setting the slices appropriately, query time reduced to around 1 second.

Note that we can set the slice either via properties in SSDT at design time, in SSMS properties post deployment, or better yet, in the original XMLA we used at the start of this post to automatically create the partitions on a schedule:

…<StorageMode>Molap</StorageMode>
<
ProcessingMode>Regular</ProcessingMode>
<
Slice>[Date].[Calendar Year].&amp;[2008]</Slice>
<ProactiveCaching>….

So the lesson is: Always set the Slice property :)


Sunday, 14 June 2015

SQL Server Triggers – Which fields have been updated? – Part 2

In the last post (SQL Server Triggers – Which fields have been updated? – Part 1) we explored various methods to determine which columns in a table were being updated.  All those methods relied on a SQL trigger being fired.  Once fired, if a specific column in the table was updated we could carry out an appropriate action, such as inserting an audit record into another table.  However, what if the UPDATE statement was executed against the table, but the actual values of the fields were not changed?  Consider the following example:

CREATE TABLE MyTable (Col1 INT IDENTITY(1,1), Col2 varchar(10), Col3 INT)
GO

CREATE TRIGGER dbo.tr_MyTable ON dbo.MyTable  AFTER UPDATE
AS
BEGIN
       PRINT 'An Update has occurred'
END
GO

INSERT INTO MyTable
SELECT 'a',7 UNION ALL
SELECT 'b',8 UNION ALL
SELECT 'c',9 UNION ALL
SELECT 'd',10
GO

UPDATE MyTable
SET Col3 = 10 WHERE Col2 = 'd'

The UPDATE statement above has overwritten Col3 with the same value as it had before.  The table data has not actually been changed.  Nonetheless, because an UPDATE statement has been executed, the AFTER UPDATE Trigger is fired and will output the print statement.  In fact even if the UPDATE statement had not affected a single row, the trigger would still have been fired, outputting the print statement.

This is often not the desired behaviour – I don’t want to be inserting audit records if nothing has actually changed.  Depending on what logic has been implemented in the trigger and how often it is fired, this could have an adverse impact on performance.  Therefore the first improvement we can make is to ensure nothing happens if no records have been updated.  To do this we use the @@ROWCOUNT global variable (for more on this variable see my post on SQL Server Global Variables):

ALTER TRIGGER dbo.tr_MyTable ON dbo.MyTable  AFTER UPDATE
AS
BEGIN
       IF @@ROWCOUNT = 0 RETURN;

       PRINT 'An Update has occurred'
       --Perform UPDATE logic here
END
GO

This trigger will now immediately return, doing nothing, if no records have been updated.

Next we need ensure we only take action if field values have actually changed.  To do this we can make use of the special inserted and deleted tables.  These are temporary tables created and managed automatically by SQL Server.  While we cannot perform any DDL operations on these tables, they contain identical columns to the original table the trigger is created on, and we can refer directly to these columns in the code.  When a DML trigger fires, these two tables are populated with the deleted and inserted records from the underlying table respectively.  If records are updated then the original pre-updated record arrives in the deleted table and the post-updated record in the inserted table.  Therefore, if we want to know if previous values have actually been updated with new, different values we can compare the contents of these two tables:

ALTER TRIGGER dbo.tr_MyTable ON dbo.MyTable  AFTER UPDATE
AS
BEGIN
       IF @@ROWCOUNT = 0 RETURN;

       IF EXISTS(    SELECT *
                           FROM          inserted i
                           INNER JOIN    deleted d
                                   ON    i.Col1 = d.Col1
                           WHERE         d.col2  <> i.col2
                                  OR     d.col3 <> i.col3           
                     )            
       BEGIN        
              PRINT 'An Update has occurred'
       END
END

In the above trigger, inside the IF EXISTS function, we are joining the inserted and deleted tables together on Col1, the identity column.  Because these tables are effectively the “before” and “after” state of the data, we are only interested where the values in the remaining fields do not match, indicating that data has changed in those fields.  This is logic is carried out by the WHERE clause.  Only if such rows exist do we then perform the desired UPDATE logic, in this case the PRINT statement.  


The above trigger will now only do anything if data has actually changed in the underlying table.  This has the potential of saving a lot of time and unnecessary processing of data on those occasions when an UPDATE has occurred but no overall change has actually taken place.

Monday, 8 June 2015

SQL Server Triggers – Which fields have been updated? – Part 1


When updating a table it can often be very useful to know not just that an update has taken place, but exactly which columns have been updated. For example, in an HR department you may have an employee table that contains customer names and bank account numbers.  If the bank account number for an employee changes, either through a front end application, a DML query or any other method, it should be audited – by which we mean an audit table should be inserted with a corresponding record.  While if an employee middle name is changed, an audit is not required.

Below we’ll look at a number of ways for testing for specific column updates, and in Part 2 we’ll look at determining if values have indeed actually changed or not – they could have been updated with the same value as before.


UPDATE()

One way of determing which column has been updated is by using an AFTER UPDATE trigger.  Within this trigger we could use the UPDATE() function to detect the column that has been updated and apply the audit logic accordingly:

CREATE TRIGGER tr_Employee ON Employee AFTER UPDATE
AS
       IF UPDATE (BankAccountNo))
       BEGIN
              INSERT INTO EmployeeAudit Column1Column2,... ModifiedByModifiedDate)
              SELECT inserted.Column1inserted.Column2,... USER_NAME(), GETDATE()
       END
  
But what happens if we have additional columns, some requiring audit (such as SortCode, NationalInsuranceNo etc), and some not (eg marital status)?  Or perhaps the non-sensitive columns don’t require audit but do require some other logic, eg an additional table to be amended in line with this one.  We could append multiple IF UPDATE() statements:

IF UPDATE(BankAccountNo)
BEGIN
       --apply audit logic
END

IF UPDATE (MaritalStatus)
BEGIN
       --apply other logic
END
--next IF…

Or we could combine the groups of logic together into fewer statements:

IF (UPDATE(BankAccountNo) OR UPDATE(SortCode))
BEGIN
       --apply audit logic
END

IF (UPDATE(MaritalStatus) OR UPDATE(MiddleName)
BEGIN
       --apply other logic
END

COLUMNS_UPDATED()

If you like working with binary and bitwise logic, COLUMNS_UPDATED() can be used to achieve similar results and much less code. COLUMNS_UPDATED() returns a varbinary bit pattern indicating which columns in a table have been updated. From BOL:

COLUMNS_UPDATED returns one or more bytes that are ordered from left to right, with the least significant bit in each byte being the rightmost. The rightmost bit of the leftmost byte represents the first column in the table; the next bit to the left represents the second column, and so on. COLUMNS_UPDATED returns multiple bytes if the table on which the trigger is created contains more than eight columns, with the least significant byte being the leftmost.

So for a simple four column table, when COLUMNS_UPDATED() is used in the trigger it may return a binary value of 1010.  This would indicate that columns two and four (as listed in syscolumns) have been updated, but one and three have not.  We can therefore use the value returned by this function in our trigger to determine if a specific combination of columns in our table have been updated.  However in order to do this we need to first convert the binary number into decimal.

As each column (right to left) represents an incremental increase in the power of 2 (ie, right most column is 2^0, next to the left is 2^1, then 2^2), a binary value of 1010 can be translated to decimal as:

Column Name
Col4
Col3
Col2
Col1
Column Position
4
3
2
1
COLUMNS_UPDATED()
1
0
1
0
Bin to dec conversion
2^3
2^2
2^1
2^0
Decimal value
8
0
2
0


Binary 1010 = ( 0 x 2^0) + (1 x 2^1) + (0 x 2^2) + (1 x 2^3) =  0 + 2 + 0 + 8 = 10

When working with binary we need to use SQL Server bitwise operators, in this case ‘&’ (Bitwise AND) to compare two values or we may get unexpected results.  The ‘&’ function looks at each bit (that is each 1/0) in the first binary number and compares it to the corresponding bit in the second number.  If the first bit in the first number is 1 AND the second bit in the second number is 1, then the first bit of the output is 1.  If either bits are 0 then the output is zero. Then it moves onto the second bit in both numbers and starts again. The output number can be converted back to decimal again.  In the first table below, the binary number 1010 is ‘AND-ed’ with itself and the result is shown. In the second table it is ‘AND-ed’ with a different number.



Col4
Col3
Col2
Col1
Decimal
Binary Value 1
1
0
1
0
10
Binary Value 2
1
0
1
0
10
Output of ‘AND’
1
0
1
0
10
    

Col4
Col3
Col2
Col1
Decimal
Binary Value 1
1
0
1
0
10
Binary Value 2
0
1
1
0
6
Output of ‘AND’
0
0
1
0
2

The above tables show that using the bitwise AND operator:  10 & 10 = 10,  10 & 6 = 2.  Therefore to we can say that in our table if COLUMNS_UPDATED() = 10 we know that columns two and four have been updated.

The resulting trigger for this example would look like this:

CREATE TRIGGER tr_Employee ON Employee AFTER UPDATE
AS
IF (COLUMNS_UPDATED() & 10 = 10)
BEGIN
       --apply audit logic
END

Note that the IF statement above will return TRUE if columns two and four or more are updated, not just two and four alone.
Advantages of this approach are that for a very wide table, the number of IF statements, or alternatively the number of conditions within the IF statements could be huge and time consuming to write, while using COLUMNS_UPDATED() reduces it to just one short line.  However, using this function will mean that the functionality of the code is heavily dependent on the order of the columns in the table.  If the table is dropped and recreated with columns reordered, the trigger may no longer function as expected.  In addition, while being concise, this code is more difficult to understand and debug.  Therefore I would recommend the earlier approaches unless absolutely necessary.

In Part 2  we’ll look at how to analyse the records that have been updated and determine if any update has actually been made at all – an Update trigger can be fired even if no values have actually changed.