SQL Banana: slow running

Showing posts with label slow running. Show all posts

Wednesday, 16 September 2015

SSAS - Slow Running Query? Unnecessary Partition Scans? Solve it with a Slice!

While there are many possible reasons why an MDX query of an OLAP cube may not perform as well as expected - the cube dimension hierarchies may not be optimised for best use, the MDX query itself may require optimisation etc - an often overlooked area is the measure group partition definitions and the partition Slice property.

In a typical SSAS project I would normally design an "initial" partition inside SSDT (formerly BIDS):

I would deploy this solution and create a scheduled job to create additional partitions when necessary, depending on the partition strategy (eg every night/week/month etc). The Create Partition xmla script might look something like this:

<MeasureGroupID>MyMeasureGroup</MeasureGroupID>

</ParentObject>

<Partition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:ddl2="http://schemas.microsoft.com/analysisservices/2003/engine/2"

xmlns:ddl2_2="http://schemas.microsoft.com/analysisservices/2003/engine/2/2"

xmlns:ddl100_100="http://schemas.microsoft.com/analysisservices/2008/engine/100/100"

xmlns:ddl200="http://schemas.microsoft.com/analysisservices/2010/engine/200"

xmlns:ddl200_200="http://schemas.microsoft.com/analysisservices/2010/engine/200/200"

xmlns:ddl300="http://schemas.microsoft.com/analysisservices/2011/engine/300"

xmlns:ddl300_300="http://schemas.microsoft.com/analysisservices/2011/engine/300/300"

xmlns:ddl400="http://schemas.microsoft.com/analysisservices/2012/engine/400"

xmlns:ddl400_400="http://schemas.microsoft.com/analysisservices/2012/engine/400/400">

<QueryDefinition>SELECT * FROM MyTable WHERE MyDate = 20150102</QueryDefinition>

</Source>

<StorageMode>Molap</StorageMode>

<ProcessingMode>Regular</ProcessingMode>

</ProactiveCaching>

<AggregationDesignID>AggregationDesign</AggregationDesignID>

</Partition>

</ObjectDefinition>

</Create>

Once deployed and processed, this solution may work well. Users find the cube responsive and fast. Over time, however, users start to complain that the same daily query/report that used to run almost instantaneously is now painfully slow. No design changes have been deployed, no hardware modifications have been made - nothing has been altered. So how is it that the same query now takes significantly longer?

An often overlooked area is the slice setting for partitions. As an example of what can happen, let's look at the AdventureWorks cube from the SQL 2012 samples (available here http://msftdbprodsamples.codeplex.com/releases/view/55330).

A good starting point for debugging queries is often to run a SQL Profiler trace on the SSAS database and see what's happening when this query is executed. Let's take a simple query:

SELECT [Measures].[Internet Sales Amount] ON 0,
[Geography].[City].Members ON 1
FROM [Adventure Works]
WHERE [Date].[Calendar].[Date].&[20080607]

This will return the "Internet Sales Amount" measure by City on 7th June 2008. If we use the Object Browser in Management Studio to examine the Internet Sales measure group we see this:

The measure group contains four partitions, one for each year. Therefore we would expect that our query above should only hit the one single partition, Internet_Sales_2008. Let's confirm by viewing the SQL Profiler trace results:

The trace shows that SSAS is actually scanning three partitions, not one - 2005, 2006 and 2008. But we know the result set lies only in 2008, so why this extra scanning? We may want to confirm that the results are only going to be in the 2008 partition by examining the partition query - the SQL query used as the source for the partition data. In this case it's been checked and they are all set up with the correct queries and WHERE clauses.

The reason for the extra partition scans is down to the Slice property of the partition not being set. The slice property is used by SSAS to determine which partition(s) contains the desired data. If SSAS knows up front where the data lies it can avoid wasting time and resources scanning partitions that will not return any data. Let's take a look at the properties of one of the partitions in SSMS:

We can see here that the Slice property is blank. When left blank, SSAS uses it's own default method of determining which partitions may contain the data. While this may sometimes be sufficient, the trace above shows it is hardly foolproof. Far better to actually set this property to the value we know to be valid, enabling the SSAS engine to quickly be directed to partition containing the results.

Valid values for a slice property are an MDX member, set or tuple. In this example the slices should be set to the relevant year, eg:

[Date].[Calendar Year].&[2008]

If we now set this for each of the partitions and rerun the same query we see the following in the trace file:

The engine is going directly to the correct partition and ignoring all others. The result is a much faster running query. I have seen examples where there were 100+ partitions, all without the slice set. The result was that queries were taking up to 30 seconds to complete. On setting the slices appropriately, query time reduced to around 1 second.

Note that we can set the slice either via properties in SSDT at design time, in SSMS properties post deployment, or better yet, in the original XMLA we used at the start of this post to automatically create the partitions on a schedule:

…<StorageMode>Molap</StorageMode>
<ProcessingMode>Regular</ProcessingMode>
<Slice>[Date].[Calendar Year].&[2008]</Slice>
<ProactiveCaching>….

So the lesson is: Always set the Slice property :)

Saturday, 25 April 2015

SSAS 2012 Remove Zeros Using NullProcessing and Improve Query Times

In Analysis Services optimal query response times is key to a successful project. The cube may contain all sorts of clever logic, processing times may be lightning fast, but if users are having to wait on long running queries each time they access the SSAS database they will soon tire of using it.

As fact and dimension tables grow over time, query performance can slow down. There are many techniques for improving query performance - good partition design for measure groups, implementing natural hierarchies for dimensions etc. However, there are also other techniques that often get overlooked. One such technique is to look at the source data and decide if it is really necessary to import it into the cube. A smaller cube will generally outperform a larger one (all other factors being equal).

I recently worked on an SSAS database with a dimension that had grown to contain over 30 million members at the leaf level. For any given day a subset of these members would be used - around fifty thousand would actually have data in the cube, and this was massively impacting query performance time for users – around 3 minutes were required to complete the query.

On examining the data returned by the SSAS queries, most of the measure values for these members was zero. Only a few hundred records per day contained non-zero values. In the source system relational tables the values were null, but in the cube they were displayed as zero. As a result, to resolve the cube queries SSAS was having to read and aggregate all these zero values when in actual fact they were of no interest to the users. If these zeros could be removed we could eliminate a huge chunk of data from the cube and reduce the overhead of processing, reading and returning all that unnecessary data.

It may seem odd that the cube would convert null values in the source tables into zero's in the cube, but the key to this behaviour lies in the NullProcessing property. As a simple example, let’s first create some dimension and fact data:

CREATE TABLE dimCustomer (custId int, CustName VARCHAR(25), CustCountry VARCHAR(25))

CREATE TABLE dimDate (DateId INT, CalendarDate DATE )

CREATE TABLE factMeasures (dimCustId INT, dimDateId INT, Sales INT, )

INSERT INTO dimCustomer

SELECT 1, 'Dave', 'England' UNION ALL

SELECT 2, 'Bob', 'Scotland' UNION ALL

SELECT 3, 'Jenny', 'Ireland' UNION ALL

SELECT 4, 'Jill', 'Scotland' UNION ALL

SELECT 5, 'Donny', 'Ireland'

DELCARE @t SMALLDATETIME= GETDATE()

INSERT INTO dimDate

SELECT CAST(CONVERT( VARCHAR(12), @t-1, 112) AS INT),CAST(@t-1 AS DATE) UNION ALL

SELECT CAST(CONVERT( VARCHAR(12), @t-2, 112) AS INT),CAST(@t-2 AS DATE) UNION ALL

SELECT CAST(CONVERT( VARCHAR(12), @t-3, 112) AS INT),CAST(@t-3 AS DATE) UNION ALL

SELECT CAST(CONVERT( VARCHAR(12), @t-4, 112) AS INT),CAST(@t-4 AS DATE) UNION ALL

SELECT CAST(CONVERT( VARCHAR(12), @t-5, 112) AS INT),CAST(@t-5 AS DATE) UNION ALL

SELECT CAST(CONVERT( VARCHAR(12), @t-6, 112) AS INT),CAST(@t-6 AS DATE) UNION ALL

SELECT CAST(CONVERT( VARCHAR(12), @t-7, 112) AS INT),CAST(@t-7 AS DATE)

INSERT INTO factMeasures

SELECT 1, CAST(CONVERT( VARCHAR(12), @t-1, 112) AS INT), 10 UNION ALL

SELECT 1, CAST(CONVERT( VARCHAR(12), @t-2, 112) AS INT), 20 UNION ALL

SELECT 1, CAST(CONVERT( VARCHAR(12), @t-3, 112) AS INT), 20 UNION ALL

SELECT 2, CAST(CONVERT( VARCHAR(12), @t-1, 112) AS INT), 30 UNION ALL

SELECT 2, CAST(CONVERT( VARCHAR(12), @t-2, 112) AS INT), 50 UNION ALL

SELECT 3, CAST(CONVERT( VARCHAR(12), @t-1, 112) AS INT), NULL UNION ALL

SELECT 3, CAST(CONVERT( VARCHAR(12), @t-2, 112) AS INT), NULL UNION ALL

SELECT 3, CAST(CONVERT( VARCHAR(12), @t-3, 112) AS INT), NULL UNION ALL

SELECT 3, CAST(CONVERT( VARCHAR(12), @t-4, 112) AS INT), NULL UNION ALL

SELECT 4, CAST(CONVERT( VARCHAR(12), @t-1, 112) AS INT), 10 UNION ALL

SELECT 4, CAST(CONVERT( VARCHAR(12), @t-2, 112) AS INT), NULL UNION ALL

SELECT 4, CAST(CONVERT( VARCHAR(12), @t-3, 112) AS INT), NULL UNION ALL

SELECT 5, CAST(CONVERT( VARCHAR(12), @t-1, 112) AS INT), NULL UNION ALL

SELECT 5, CAST(CONVERT( VARCHAR(12), @t-2, 112) AS INT), NULL UNION ALL

SELECT 5, CAST(CONVERT( VARCHAR(12), @t-3, 112) AS INT), NULL UNION ALL

SELECT 5, CAST(CONVERT( VARCHAR(12), @t-4, 112) AS INT), NULL

SELECT * FROM factMeasures

As the results show, we now have a simple schema with a fact table containing Sales values. Note that for CustID 5, Donny, all values are null.

Next let’s build a simple SQL Server Analysis Server 2012 cube on top of this schema, using SQL Server Data Tools (SSDT). The cube design looks like this:

After deploying and processing the cube, we’re now ready to browse the data by customer and date:

Note that Donny is listed with 4 days of data, with a sales value of zero. The source system showed nulls, but the cube has converted that to zeros, in effect creating values where none existed. In our example we would prefer that these cells in the cube are empty. There may be a need to have the record in the source system, e.g. while this measure column for this row in the source table is null, there may be an additional measure column for which there are valid figures for Donny. However, at present a NON EMPTY query will return these zeros, when we would expect Donny not to be listed at all.

To change this behaviour to that desired we need to examine the NullProcessing property for the measure. In SSDT open the Cube and go to the Cube Structure tab. In the measures list on the left select the relevant measure and open the properties window (F4). At the bottom of the list of properties expand the Source property:

By default the NullProcessing is set to Automatic, which rather ambiguously means the Analysis Services server decides what to do. Changing the NullProcessing setting to Preserve, will preserve the nulls from the source data. Using the above example, setting this to Preserve and reprocessing produces the following results for the same query as before:

Donny no longer appears in the list, and the results have reduced from 48 cells to 18. Although not apparent on such a small dataset, in the original example of 50k records taking 3 mins to return, applying the NullProcessing changes reduced query time to less than 5 seconds to return only a few hundred rows.

Tuesday, 30 December 2014

How to Execute SQL Script Asynchronously

SQL code generally runs sequentially. One line of code must complete before the next line is executed. We can redirect the path that is followed through the code using conditional logic or GOTO statements, but ultimately we cannot progress to the next step until the previous one has completed.

However, there are situations where we may want the code to continue to step 2 without waiting for a particularly time-consuming step 1 to complete. If step 1 and step 2 do not impact each other then it might be beneficial to have the SQL execute asynchronously or concurrently, saving overall execution time.

For an example let's create a table to insert data into for step 1, and a stored procedure that will insert data into it:

CREATE TABLE [dbo].[InsertSomething](
       id                   INT IDENTITY(1,1) NOT NULL,
       Col1                 VARCHAR(32) NULL,
       InsertedDate        DATETIME NULL
)
GO

CREATE PROC DoSomething @Data VARCHAR(32) = ''
AS
INSERT INTO InsertSomething (Col1, InsertedDate) VALUES(@Data + ' - Start', GETDATE())
WAITFOR DELAY '00:00:10:00'
INSERT INTO InsertSomething (Col1, InsertedDate) VALUES(@Data + ' - End', GETDATE())
GO

The stored proc DoSomething will insert two rows into the table InsertSomething. After the first row is inserted there will be a delay of ten seconds before the second row is inserted. Therefore the time for the procedure to complete is approximately ten seconds, which in this example represents a long running piece of code.

EXEC DoSomething 'Test 1'

INSERT INTO InsertSomething (Col1, InsertedDate) VALUES('Test 2', GETDATE())
SELECT * FROM InsertSomething

Results

id Col1 InsertedDate

----------- -------------------------------- -----------------------

1 Test 1 - Start 2014-12-30 12:38:40.023

2 Test 1 - End 2014-12-30 12:38:50.030

3 Test 2 2014-12-30 12:38:50.037

We can see from the timestamps that the "Test 2" record was inserted only after the stored procedure had completed inserting the Test 1 records. IE the sql script is executing sequentially, as expected.

In the above example we would prefer the "Test 2" insert statement not to have to wait for the stored procedure to complete. We would like the two sql commands to run concurrently. To do this we need to create a second stored procedure to call the first via a new thread:

CREATE PROC [dbo].[AsyncInsert] @Data    VARCHAR(32)
AS

DECLARE @rc          INT
DECLARE @object      INT
DECLARE @osql_cmd    VARCHAR(1000)
DECLARE @ServerName  VARCHAR(256) = (SELECT @@SERVERNAME)

DECLARE @ExecStmt    VARCHAR(MAX) = '"EXEC DoSomething @Data='''+@Data + '''"'

EXEC @rc = sp_oacreate 'wscript.shell', @object OUT

SET @osql_cmd = 'osql -E -dMyDB -S' +@ServerName+' -Q'+ @ExecStmt

EXEC @rc=sp_oamethod @object, 'run', null, @osql_cmd

EXEC sp_oadestroy @object

GO

Test the new procedure by running the following code:

EXEC [AsyncInsert] @Data = 'Test 1 Async'

WAITFOR DELAY '00:00:01:00'

INSERT INTO InsertSomething (Col1, InsertedDate) VALUES('Test 2 Async', GETDATE())

WAITFOR DELAY '00:00:10:00'

SELECT * FROM InsertSomething

Results
id          Col1                             InsertedDate
----------- -------------------------------- -----------------------
1           Test 1 - Start                   2014-12-30 12:38:40.023
2           Test 1 - End                     2014-12-30 12:38:50.030
3           Test 2                           2014-12-30 12:38:50.037
4           Test 1 Async - Start             2014-12-30 12:39:01.967
5           Test 2 Async                     2014-12-30 12:39:02.930
6           Test 1 Async - End               2014-12-30 12:39:11.987

In the results above we can see that for the Async tests the timestamps show that the Test 2 insert occurred while the stored procedure (Test 1) was still running - ie after the "Test 1 Async - Start" insert command completed but before the "Test 1 Async - End" insert was executed.

We have generated a new thread to execute the stored procedure that does not require us to wait for it to complete before continuing. Therefore the SQL script is now effectively running asynchronously.