THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.

  • PASS Board elections coming up

    It’s four years since I ran for election to the PASS Board now, for a two year term which ended two years ago.

    The two other people that got elected when I did were Denise McInerney and Adam Jorgensen. Two years ago, Denise and Adam became the two Vice Presidents, and Adam is about to being a stint as President. For me, I didn’t run a second time. Two years ago, Jen Stirrup, Amy Lewis, and Tim Ford got elected when Rushabh Meta, Douglas McDowell, and I left the board.

    And so we come around to election time again, and Jen, Amy, and Tim are all about to finish their first two years on the board. Amy is not re-running, but Jen and Tim are, along with two new candidates – Argenis Fernandez and Ryan Adams.

    I’m excited, because PASS can’t lose here. The current board members seem to be doing an excellent job (although I didn’t get to serve alongside either Jen or Tim, I know them both well and am sure they’re doing great), but I also know that Argenis and Ryan will do great.

    Argenis is a passionate guy who will stand for important things. I’ve had many conversations with him, and know him to be wise, caring, and dedicated to seeing good things happen. He raises money for charity, and is not afraid to put his reputation on the line for important things. He is one of my most excellent friends.

    Ryan is also a great guy. He works hard for PASS, and has almost won the PASSion award for his dedication to the community. I haven’t had as many conversations with him as I have the other candidates, but I know that he will work tirelessly for the PASS community – that word ‘tireless’ seems to describe Ryan more than anyone else I can think of in the community, and I am a little jealous of his energy levels.

    I can’t tell you who to vote for (except for Jen – as the only person running for the EMEA seat I expect everyone to vote for her), but do take the time to get to know the candidates and make your vote count – voting opens later this week.


  • Superhero powers for managing the enterprise

    This month’s T-SQL Tuesday is hosted by Jen McCown of @midnightdba fame. She wants us to write about strategies for managing the enterprise, and as a database consultant, I find myself giving advice on this kind of thing to customers remarkably often.

    No, I’m not going to do stories about LobsterPot customers. We don’t do that. What happens between us and the customer stays between us and the customer. However, I can talk about the kinds of things that we look for when we talk to a customer.

    The thing that I want look at in this post is about that twitch that you get when something doesn’t feel right. The ‘spider-sense’ feeling that Peter Parker gets when there’s something that’s not quite right.TSQL2sDay150x150

    Experience is what helps people know what ‘normal’ looks like. I’ve heard stories that people who are trained to spot counterfeit bank notes don’t spend time with fake notes, they spend time with the real ones (on loan, presumably). They learn what ‘normal’ looks like, and get really familiar with it. That way, when something isn’t quite right, they can spot it and raise an alarm.

    For DBAs taking care of an environment, whether small or large, they will have learned what ‘normal’ looks like. They should have benchmarks that tell them what how various metrics perform, and for those things that they don’t have formal metrics on, they should still be familiar enough to recognise when something isn’t right, even if they can’t tell exactly what. Their ‘spider-sense’ should tingle.

    If you don’t have benchmarks, then get them. Every time you find something that doesn’t seem right, you will wish you had a benchmark on that thing to be able to quantify your suspicion. Feeling like something isn’t right is great, but it won’t be long until someone asks “How do you know this thing needs attention?” and “I just know” probably won’t cut it. If you’re a consultant, you can probably get away with “In my experience...”, because that’s what they’re paying you for, but having supporting evidence – actual numbers – can help, particularly if you’re the main person responsible and are needing to persuade someone to find the money for a CapEx.

    Having the numbers handy is useful for a lot of situations, but there are a whole bunch of tools also available to look at too. A while back I wrote about how DBAs could use the same kinds of tools that other industries hire data professionals to provide, in a post called “DBAs and the Internet of Things”. If you take this kind of approach and start analysing the readings from all kinds of things that affect your database, then you can get ahead of the game. Feed this stuff into something like Azure ML for predictive analytics, and you might be able to have an even-better spider-sense, where the prediction isn’t just based on your own opinions, but on what has caused failures in the past.

    Too often, the significant thing is some small detail that most people wouldn’t notice, but before of your experience and expertise, you can spot it and work out whether it’s significant or not. Then if you don’t have that particular thing benchmarked, or analysed by other tools, you can include it to see what’s going on.

    ...and develop superhero powers for managing your enterprise. It’s something we regularly recommend to DBAs.


  • SQL 2014 queries under a strobe light

    I just wrote a post about Live Query Statistics. Let me show you how this technology in SQL Server 2014 can be used for some amazingly cool (nerdy cool, at least) stuff.

    Behind the scenes, LQS uses a DMV called sys.dm_exec_query_profiles. When you run a query with SET STATISTICS PROFILE ON, the engine puts data into this DMV for every operator in the plan that produces data (which doesn’t include the SELECT operator, or Compute Scalar, for example). What SSMS does while you’re watching a long-running query is poll this DMV over and over to get the progress data, so that you can see it in those moments between kicking off the query and its completion.

    When you use LQS on a relatively quick query, say, one that completes in a single second, you can’t exactly watch this information come through. You might be able use a debugger, and pause operation of your server for a moment while you step through it, but this is far from ideal. And yet a query that completes in a single second might actually need some work. What if this query needs to be able to run many times per second, and you’re looking for strategies to tune every last bit out of it?

    Clearly LQS is going to be no use.

    But the workings behind it... that’s another matter. Let me show you.

    When botanists are wanting to study what happens in the flight of an insect, they take lots of photographs, often using a strobe light to capture a moment with as little blur as possible. It allows for incredibly detailed images, like the ones you can see in this article from the Daily Mail in the UK.

    (Image only linked from source – please let me know if it’s not there, but also go and look at the amazing pictures that this guy takes)

    I don’t know how many times this insect flapped its wings between each image that was captured, but I think you’ll agree that with enough images, it would be possible to learn a lot about the way that the wing-flapping takes place.

    This is the same as what I’m doing with queries in SQL Server 2014.

    Suppose I run a query over and over, with statistics profile turned on. The code here runs my sub-second query over and over for six minutes.

    --set statistics profile on

    declare @t datetime = dateadd(minute,6,getdate());

    while (@t > getdate())

    select p.Name, sum(d.OrderQty) as Qty
    from Sales.SalesOrderDetail d
    join Production.Product p
    on p.ProductID = d.ProductID
    group by p.Name;


    I used an SSMS window for this, and told SSMS to discard the results. I only need to run it once to see the results – I just care about the profile stats. Incidentally, it returns 266 rows.

    Now, I know that the plan that’s being used here is:


    The Node_IDs of these operators are 1 (the Join), 2 (the Aggregate), 3 (the Scan on SalesOrderDetail), and 7 (the Scan on Product). These numbers have gaps just because of the inner workings of the Query Optimizer. And they go from left to right because that’s how a plan runs. The SELECT calls Node 1, which calls Nodes 2 and 7, and Node 2 calls Node 3.

    So during five of the six minutes that my query was running over and over and over, I went into a different window and polled the DMV every half a second.

    set nocount on

    declare @t datetime = dateadd(minute,5,getdate());

    while (@t > getdate())

    insert dbo.rf_strobe
    select getdate() as myquerytime, *, 'Hash Match'
    from sys.dm_exec_query_profiles

    waitfor delay '0:00:00.5'


    I had made a table which matched the DMV, with a couple of extra columns thrown in. One to record when I captured the moment (so that I could differentiate between moments), and one (called ‘comment’) to comment which query I was monitoring. The DMV contains the sql_handle and plan_handle, so I could’ve differentiated between them later, but I wanted to be able to differentiate between them more easily than that.

    This data gave me 597 different values for ‘myquerytime’, 597 different moments captured. I don’t know how many different times my query ran in that time – probably far more than 597, although I wouldn’t have really cared if it were less. These 597 moments each had up to 4 rows, showing how each operator was going in its cycle.

    There are columns in the DMV for all kinds of attributes, but the one that I was most interested in was the row_count, which tells me how many rows the operator has produced. There are columns about CPU, but as I’m looking at a query which runs in a small number of milliseconds, I’m more interested in the number of rows its produced.

    Here’s a some of the data that’s in my table.


    You can see that when I polled the DMV at 19:28:20.303, Node 3 (the Clustered Index Scan on SalesOrderDetail) had pushed 25241 rows, and no rows had been outputted by the other three. Half a second later, the snapshot showed 91044 rows from that node. Another half a second, and it was 54279, and in the poll at 19:28:21.807, the scan had served up all its rows, and there was output from the other three operators.

    Each of these is like a strobe image, capturing a moment in the life of the query.

    To recreate how the query runs, we need to piece them together. To do this, I’m going to assume that the outputted rows are done in the same order each time (which is a reasonable assumption when I’m running the query over and over in quick succession, with no changing parameters or conditions). By summing the row_count across all the operators in each poll, I can order the polls. A quick bit of pivotting...

    select row_number() over (order by OverallRowCount) as rownum, *
        count(*) + sum(row_count) as OverallRowCount,
        max(case when node_id = 1 then row_count end) as HashMatchJoin,
        max(case when node_id = 2 then row_count end) as HashMatchAgg,
        max(case when node_id = 3 then row_count end) as CIXScan,
        max(case when node_id = 7 then row_count end) as IXScan
    from dbo.rf_strobe
    where comment = 'Hash Match'
    group by myquerytime
    ) s
    order by rownum

    ...and I can see the query start:


    ..and the moment when the Clustered Index Scan stops outputting more rows:


    ...and the end of the query.


    Notice that almost all the time we polled the DMV to see how the query was going, data was still being pulled out of SalesOrderDetail. It was only 96.1% (574/597) of the way into the query that data started to be outputted from the blocking Hash Aggregate. And because a Hash Match Join blocks until the hash table has been completed, we have to wait even longer before we eventually start pulling data from the Product table, when we pull up to 504 rows to find the 266 rows of interest.

    I’m sure you can get a feel for how this query is running from this information.

    But let’s compare this to what happens if I force a Nested Loop join instead of the Hash Match Join. This plan:


    I repeated the collection of data, running the same query over and over but with OPTION (LOOP JOIN), and a different comment in the polled data. I also decided to use dynamic SQL to query my strobe table to save rewriting the pivot for each plan.

    declare @comment varchar(100) = 'Forced Loop Join';
    declare @qry nvarchar(max) =
    'select row_number() over (order by OverallRowCount) as rownum, *
        count(*) + sum(row_count) as OverallRowCount
    ' + (select ', max(case when node_id = ' + cast(node_id as varchar(10)) + '
                        then row_count end) as [' +
                        cast(node_id as varchar(10)) + ': ' + physical_operator_name + ']'
            from dbo.rf_strobe
            where comment = @comment
            group by node_id, physical_operator_name
            order by node_id
            for xml path(''),type).value('.','nvarchar(max)')
    + '
    from dbo.rf_strobe
    where comment = @comment
    group by myquerytime
    ) s
    exec sp_executesql @qry, N'@comment varchar(100)', @comment = @comment;

    It started very similarly, but was very different towards the end.

    The Scan seemed to go for a longer portion of the plan – 98.5% (587/596), but as the Hash Match Aggregate started producing rows, the Nested Loop was pulling the row from the Seek and returning it to the SELECT operator before pulling the next row in. You can see the row_count going up equally across the three operators, which is very different to what we saw with the Hash Match Join.


    A factor I hadn’t considered before became evident at the end. In the Hash Match Join example, we saw a bunch of moments when all the rows had produced their data, which aren’t there in the Nested Loop example. The result of the query is no different, but the tear-down time is much quicker with the Nested Loop – presumably because the Hash Table used for the join doesn’t need to be dropped. I didn’t expect this to be as significant as it seems to be, but we certainly managed to catch six images – about 1% of them – when the operators had all stopped returning data, but the query was still hanging around as far as the DMV was concerned. With everything else being identical, I can only assume it’s down to the Hash Table having more to do to shut down than the Nested Loop.

    Just for fun, I tested the pivot query itself, giving 582 strobe images.



    The first thing to notice is that the Compute Scalar operators didn’t report anything, as expected.


    The next is that the right-most Sort was blocking, and had a significant pause after the Scan finished – about 35 rows or 6% of the query time.


    The Stream Aggregate doesn’t block, but the left-most Sort, which has called the Stream Aggregate (via two Compute Scalars) does, following which there’s another pause (but smaller – fewer rows), after which the Segment and Sequence Project operators don’t block.


    At the end of the query we have about 80 rows – well over 10% of the query time – after the Sequence Project has outputted its last row.



    Now – there is more that I will learn about this still, and I have been making some assumptions about whether the DMV polling gives a sufficiently random moment. But from the things I’ve seen, there is definitely information about queries that I haven’t seen before and which require some further research.

    Finally, I spent a bit of time looking at visualisation options for this. I immediately thought of the Power View play axis that’s on scatter charts, but sadly I didn’t have enough luck coming up with an effective visualisation very quickly. I had felt like Hans Rosling with his scatter chart about birth rates, and used LAG and MAX() OVER() to come up with a speed of row production compared to the total, but I kept getting the “representative sample” message, which wasn’t conducive. I’m sure it won’t be long before this visualisation would be easy, and for the purposes of analysis, I was more interested in exploring the data rather than making it look pretty.

    Strobe photography is very impressive. I just never thought it would apply to T-SQL queries.


  • Live Query Statistics in SQL 2014, not just SQL 2016

    One of the coolest things in SQL Server 2016 is Live Query Statistics (LQS). But did you know that it’s available in SQL 2014? (Edit: Needs SP1)

    The thing is that we haven’t been able to view it effectively before now, before SQL Server Management Studio 2016 became available.

    LQS provides the ability to watch an execution plan while the query is still running.


    In this image above, you can see that I have a fairly long-running query, and I’ve taken a screen shot while the query is running. I can tell it’s still running because of the dotted lines. At this point, the Hash Match has asked the Index Scan for all the rows it’s going to, and the Index Scan has been marked as 100% done. The Hash Match is now asking the Merge Join for rows. The Merge Join is pulling data out of the Sort, which has already pulled in the rows it needed, and so on.

    It’s cool stuff.

    And significantly, this is against SQL Server 2014 (SP1). I had to use SSMS 2016, but then the button appeared at the top to include Live Query Statistics...


    ...and then I could see them. When I connected to an earlier version, such as SQL Server 2012, the box was disabled and I couldn’t see them.

    So why not install SSMS 2016 (it’s only the client tool – I’m not suggesting you put it on a server), and see how some of your queries look?

    You shouldn’t do this against a production machine, because it takes some effort for SQL Server to produce the data used for this. But see how you go. It’s definitely a useful performance-tuning feature which is available right now.


  • Can you still use DBCC PAGE to access data in dropped tables if TDE is turned on?





    Oh, ok... I’ll explain some more.

    Transparent Data Encryption encrypts data at rest. That’s the stuff that’s on disk – the encryption happens when the data is written to disk, and the decryption happens as the data is loaded into RAM from the disk. The engine handles this so that it’s invisible to the user, applications, and so on. Without it, you can open an MDF/NDF file in a hex editor and read the contents. With it, you can’t.

    Here’s an example with a database that’s not encrypted:


    And here’s an example that is:


    I searched for some of the text that I could see – successfully in the first, unsuccessfully in the second.

    I also used SQL Server to show me the contents of a page using DBCC PAGE, and could do this successfully (once I’d closed the files in the hex editor and brought the databases back online).


    ...which also worked in both databases.


    I had hoped this would work okay, because I figured that DBCC PAGE would have to pull the data into RAM again (remember this system was offline – the pages weren’t in RAM before), and that it would decrypt this as it did it. But I wondered if DBCC PAGE might be slightly lower-level, and bypass it somehow. I argued with myself that if TDE was indeed Transparent, it shouldn’t care... what if my application relied on using DBCC PAGE, it’s a known feature, even if it is officially undocumented (which is where my doubts set in).

    But as you see, it worked okay.

    But what if I dropped the table first? Would SQL then go “Hang on – this page isn’t one that I have control over any more...” and refuse to decrypt it?

    No – it works just the same.

    Even if you drop an object, you can still access the pages that it used until they get overwritten. You won’t be able to read them with a hex editor, but DBCC PAGE will still read them in decrypted form, letting you pull that data out.

    And yes, you can even use DBCC WRITEPAGE to overwrite the bytes in their unencrypted form, so that you can use (dangerous) method of fixing corruption, even in an encrypted database. I’ve just redone my fix for Steve Stedman’s 10th Corruption Challenge, and it worked just fine on an encrypted version of the database.

    It’s still T-SQL Tuesday on the topic of encryption, so I’m throwing this one into the mix for that as well.


  • Always Encrypted with SQL 2016

    Transparent Data Encryption has been around for some time now, making sure that data in SQL Server as stored on disk is encrypted. When it was announced, this was incredibly exciting.

    You see, by default, SQL Server data is not encrypted. If you open up the pages within a data file, you can read the data that’s in there. Numbers are stored as hex, varchar strings are stored as readable values – it can be quite surprising to realise this. The first time you ever salvage an MDF file from a broken server, open up that file, and just start reading the data, you realise that the data in most databases is only as secure as the physical security of the disks – you can read data from that table whether or not your Windows login has access to it.

    With Transparent Data Encryption, that MDF file is encrypted. It’s only decrypted when it’s pulled into RAM, where access to it is controlled by database permissions. If your access to a particular table has been denied, you’re not getting to read that data out of RAM. When the power goes out, the decrypted data in RAM disappears – that’s what RAM’s like. It’s a good thing. But TDE is across the whole database, whether you like it or not. And everyone who has access to the table can read it unencrypted, as if it’s never been encrypted.

    It’s Transparent. It doesn’t feel like it’s encrypted, and as far any application can see, it’s not. This only protects against the disk (or backup) being compromised.

    And then there’s the distrust of DBAs.

    As someone who deals with sensitive data regularly, I can assure you that this is a very real concern. I’ve had police checks, to help make sure that I’m trustworthy – and I make a point of never looking myself up in customer databases (I know I must exist in some of my customers’ databases – I receive bills from some of them even). I also have confidence that my employees at LobsterPot Solutions are all completely trustworthy, and that I’d throw the book at them if they ever broke that trust.

    I’ve certainly been asked “How do I stop the DBA from being able to read the data?”

    And that’s where the problem is. The DBA is the guardian of your data. They’re the person who makes sure that the database doesn’t have corruption, who can defrag indexes, tune queries, and so on. The DBA should be trustworthy with the data. But what if that data contains the lottery numbers? What if that data involves national secrets? At what point is the sensitivity of the data way too great to be able to let Archbishop Desmond Tutu look after it, let alone Barry the DBA?

    Now, I understand that a DBA might feel insulted that they’re not trusted to read the data. They need to look after it, but they can’t look at it. My take is that if I can be removed from suspicion if there’s a leak, then great. I know I’m okay. The client should know I’m okay. But would a criminal investigation be able to remove me from suspicion? Not if I’m one of the only people that could read the data and know how to cover my tracks.

    Transparent Data Encryption doesn’t stop me from being able to read the data. Not at all. If I have access to query the table, it looks unencrypted to me.

    Always Encrypted, in SQL Server 2016, does stop me though, if I don’t have the ability to access the certificate which has protects the column master key. Always Encrypted stores encrypted data in the database file. It’s encrypted in RAM, and while it’s being sent across the network, only being decrypted by the application that has access to the certificate. Now, I know that as an administrator, I might be able to jump through a few hoops to be able to get access to it, but this could be handled by other people, who could deny my access to it.

    This decryption is still transparent to the applications that use it – so the applications don’t need to change, apart from telling the connection string to look out for it. The .Net 4.6 environment will handle the decryption without developers needing to code it specifically. Access to those applications can be controlled in other ways.

    If you’re needing to search by values that are encrypted, it might not be ideal for you. But for that data that you need to make sure even you can’t read, this could be a really nice option. Roll on 2016!


    PS: Another option is to build encryption functions into client applications, so that the applications encrypt everything, and handle everything well away from the database. We’ve been able to do this since applications first began, but when there are multiple applications, the coordination of keys can become problematic. What you get with SQL 2016 is the ability to do this centralised control over the encryption and decryption.

    PPS: This post was brought to you for the 69th T-SQL Tuesday, hosted by Ken Wilson (@_KenWilson)


  • Corruption, corruption, more corruption

    No, not in my systems.

    Corruption does happen from time to time. At LobsterPot Solutions we get calls from people now and then who have corruption in their databases, that want help getting them repaired. Generally, it’s not too much of a problem, although I’ve seen some doozies over time. (And if you need help, get in touch!)

    Interestingly, I don’t think many DBAs practise solving corruption issues very often. They might test restores, and even test getting data back from a salvaged LDF file, but rarely would they test any of the stranger corruption scenarios. I’ve never put it on a high priority – I know lots of ways to get data out of tables, and know how to get data out of pages using DBCC PAGE... but I wouldn’t say that I practise solving corruption very often. There are so many different ways that a database can become corrupted, and 99% of the time, the data can be salvaged using some very standard approaches.

    So it was good to see Steve Stedman (@SQLEmt) run a series of Corruption Challenges – ten corrupt databases, all of which could be repaired without data loss, but never using the standard DBCC CHECKDB repair option. I noticed the first one part way into the allotted time, and thought it would be a useful exercise to see if I could solve them all. Let’s face it – when someone gets in touch because they have a problem, they want to be reassured that they’re dealing with someone who knows what they’re doing.

    All ten challenges have now appeared, and not only have I solved every one of them – a feat only shared by Andre Kamman of the Netherlands and Neil Abrahams of the UK – but I’ve also won three of the events, more than anyone else, allowing me to finish top of the scoreboard!

    Winning the competition was never my intention – I’m just pleased to have been able to know that there were no challenges that beat me, which hopefully means that my customers are still in capable hands.

    My challenge to you is to grab the ten databases that Steve has made available, and without looking at the solutions that are all posted, see how you go. At the very least it will be good practice for when you have to deal with the real thing.


  • Check the settings when installing SQL Server

    Next, next, next, next, next... you know the drill.

    Except that when installing SQL, it’s simply not good enough. The defaults might not work for you, and that makes this post qualify for this month’s T-SQL Tuesday, hosted by Andy Yun (@sqlbek).


    Most things are fine, but there is one page which you really shouldn’t ignore. And it’s not even obvious. But if you’re just clicking Next, next, next, then you might easily miss it.

    When you reach this page:


    please make sure you don’t just hit Next without looking at the Collation tab. It looks like this:


    Now, I don’t particularly care what you decide to put here. But you shouldn’t just set it blindly. If you are installing a new server and you need it to be consistent with what you’ve used before, go and look at the collation setting on your old instance. This is ridiculously important.

    You see, people are lazy. And when I say people, I mean developers. And when I say developers, I mean bad developers. (There was a thing recently that said that “Women’s Soccer” should now be called simply “Soccer”. I agree with this. There shouldn’t have to be a differentiation between a game played by a particular type of people, except perhaps “American Football”, which is obviously called “football” because they use their feet so much. Oh right, about developers. I hope as time passes, developers become better. But for now, when I say “Developers”, I mean bad developers.)

    So when a developer creates a temporary table, they do something like:

    CREATE TABLE #resultset (col1 varchar(20) NOT NULL, col2 int NOT NULL);

    And then they populate it with some data, and then they use it in another query. Like this:

    SELECT t.*
    FROM dbo.SomeTable t
    JOIN #resultset r ON t.code = r.col1;

    This is code that has worked for years. But if you have ignored the collation setting and the instance collation is different to the database collation, you will get an error. The system won’t know whether two strings are supposed to be the same or not. Is ‘abc’ the same as ‘àbç’, if one tells you to ignore accents and the other says not to? The system can’t decide. It’s even more complex than that, because two strings might be identical, but it won’t know how to look them up if the alphabet orders letters differently. It can’t decide and gives an error. It goes in the too-hard basket.

    Of course, a nice responsible developer will have created the temporary table like this, and then the problem never occurs:

    CREATE TABLE #resultset (col1 varchar(20) COLLATE DATABASE_DEFAULT NOT NULL, col2 int NOT NULL);

    But let’s face it – this is rare. Most people write database code without thinking about the collation settings, and that’s a problem. Most of us are bad developers.

    The error you get is the one described here: This is a Connect item which I raised in January 2008 (7.5 years ago), which has 47 upvotes, and which was closed in 2011 as “Won’t Fix”. It was looking likely for a while (at least when using tempdb), but then they must’ve realised it wasn’t particularly important. Ultimately, it could give unexpected results if you’re not confident about which language you’re using, and it’s potentially better to give an explicit error than to let your query work but give the wrong results.

    SELECT *
    FROM dbo.Sports
    WHERE Name = 'football';

    Msg 468, Level 16, State 9, Line 1
    Cannot resolve the collation conflict... ;)


  • DBAs and the Internet of Things

    Ever since data moved to the cloud, database administrators have wondered what it will mean for their jobs.

    It doesn’t matter whether you are thinking about Infrastructure as a Service, Platform as a Service, Database as a Service, there is some aspect of what a database administrator does that may not exist in the future.

    And then there’s the situation around Big Data, and the Internet of Things. We see videos of how organisations are using devices that produce data to help control things as large as the London Underground, and as personal as your fitness routine. It doesn’t matter whether the Thing is recording your heart beat or the vibrations of a tube train – the fact is that data is being produced to help our lives.

    For a database administrator, this present a new set of concerns. Does the DBA of today need to be able to design and tune a system that leverages Windows Azure SQL Database? What about Parallel Data Warehouse? Hadoop? Apache Storm? StreamInsight? There are so many things that make it seem to the DBA that their skill set must adapt for them to survive.

    But what I want to suggest is that the DBA should not be seeing these technologies as their domain to administer and tune, but their domain to consume and leverage.

    Currently, database administrators in the SQL Server world use Performance Monitor counters to keep an eye on how disk, CPU, I/O, RAM, cache, etc., etc., are going. This data is polled regularly and stored, while SQL Server Agent alerts are configured to holler if there’s something noteworthy, such as a Transaction Log filling up. Extended Events allow all kinds of events to be monitored, with data about them examined, stored if necessary, reacted to, and more.

    Once, we used SQL Server Trace, and a tool called Profiler. Profiler was frowned upon somewhat – its veracity for consuming data meant that using it to run a trace live would put an unnecessary load on the SQL instance being monitored. The better option was to configure a trace, and then use Profiler after the fact to see what was happening. Extended Events (XE) is seen in a different kind of light. An XE session runs, keeping an eye on things in a very decoupled way, and the stream of events that is produced can be hooked into using a number of different methods.

    From the perspective of the DBA, these XE sessions become an “Internet of Things”-style environment. Imagine that each configured event is a smart device, sitting somewhere, producing data in some semi-structured form. Of course, we know the structure, but each event might have a different configuration, with different kinds of properties being available. The DBA consumes them using the same mechanisms as IoT technologies, and suddenly becomes a new kind of DBA-superhero. It’s not unheard of today, to hear of people looking after large numbers of databases.

    The Internet of Things is here for database administrators. Extended Events are part of it.


    PS: This post about Extended Events is part of the latest T-SQL Tuesday event, hosted by my friend Jes Borland (@grrl_geek).


  • Monitoring skew in PDW

    When you have data stored across several servers, skew becomes very significant.

    In SQL Server Parallel Data Warehouse (PDW), part of the Analytics Platform System (APS), data is stored in one of two ways – distributed or replicated. Replicated data is copied in full across every compute node (those servers which actually store user data), while distributed data is spread as evenly as possible across all the compute nodes, with the decision about where each row of data is stored is dependent on the hash of a column’s value. This column is known as the distribution key.

    For performance reasons, though, data warehouse designers will typically choose a distribution key that avoids data movement, by choosing a column which is typically used in joins or aggregations. This performance is traded off against skew, which itself can have a performance impact. If a disproportionate amount of data is stored on a single compute node (which is really a question of ‘distributions’ which correspond to filegroups – eight per compute node), then storage is effectively used up more quickly (after all, you only need one disk to fill up to have problems), and queries run slower because one distribution has to churn through more data than the others.

    And so skew should be monitored. TSQL2sDay150x150This fits in nicely with this month’s T-SQL Tuesday, on the topic of monitoring, and hosted by Cathrine Wilhemsen (@cathrinew).

    It’s not even just monitoring it that is important, but trending it. You see, your data can change over time.

    Imagine you sell products, to customers, in stores, over time. You could distribute your data by store, product, customer, or day (or even by some surrogate key, but then you need to generate that surrogate key). And whatever you choose might start off as just fine – until your business changes a little... such as a product comes out that outsells the others by a long way, or when you start selling online, to anonymous users, and offer sales on certain days of the year. Skew might not be there at the start, but it might become a factor over time.

    So how can you measure skew? Well, using DBCC PDW_SHOWSPACEUSED.

    You might have created a table called dbo.Sales, but if you have six compute nodes, then your data is actually stored across 48 tables. DBCC PDW_SHOWSPACEUSED("dbo.Sales") will provide the number of rows, and the space used and reserved for each of these 48 tables, letting me see what the skew is like.

    If I have 480 million rows, then ideally I have 10 million rows in each distribution. But if my data shows that 4 of my distributions have 15 million rows, and the other 44 have about 9.5 million, then it might not seem like much skew – most of them are 95% of what I had hoped for. But if the disks storing those 15 million-row tables are starting to fill up, then I’m essentially missing out on the opportunity to store 5.5 million rows on each of the 44 emptier ones – 240 million rows’ worth! To put it another way – if they were all as full as my fullest distribution, I could’ve had 48 * 15 million rows, which is 720 million rows! This is not efficient use of my storage.

    Applying a bit of maths like this can help me work out the wastage – looking to see what the maximum number of rows (or storage allocated) is, and multiplying that by the number of distributions that I have, to see what I could be storing if I didn’t have skew. This wastage could be a stored as a percentage easily enough, and this can be tracked over time.

    You could very easily run this command across your distributed tables on a regular basis, storing the results in a table, and produce a report which shows both the overall skew across the appliance and individually across each table. This can then become part of your capacity planning process, as I’m sure you’re considering across all your SQL Server instances.


  • Tuning slow spatial queries in SQL Server

    Even with the right indexes in place, spatial queries in SQL Server are often too slow – but they needn’t be.

    Two of the most commonly found patterns of query in the spatial world are when you’re looking for the nearest thing to where you are (which I’ve written about before), and when you’re looking for the points that are within a particular boundary. This latter example is where I want to spend most of my time in this post.

    A spatial index splits the world (for geography, or area for geometry) into a series of grids. The index essentially stores whether each point is inside each grid section or not, or for more complex types, whether it completely covers the grid section, whether it’s not in the grid section at all, or whether it’s partially in it. If we know that a particular grid section is completely covered by a polygon / multipolygon (I’m going to refer to polygons from here on, but it all applies to multipolygons too), then every point that is within that grid section must also be in that polygon. The inverse applies to a grid section which doesn’t overlap with a polygon at all – none of the points in that grid can be in that polygon.

    More work is required if the overlap of the polygon and grid section is only partial.

    Let’s assume we’ve got to the finest level of the grid system, and the polygon that we’re considering is partially overlapping the grid section. Now, we need to do some calculations to see if each point is within the polygon (or at least, the intersection of the grid section and the polygon). These calculations involve examining all the sections of the polygon’s boundary to see if each point is inside or outside.

    Imagine trying to see if a particular point is within the coastline of Australia or not, and just rely on our eyes to be able to judge whether an answer is ‘easy’ or ‘hard’.


    If we’re zoomed out this much, we can easily tell that a point near the border of South Australia and Northern Territory (near where it says ‘Australia’ in red) is within the coastline, or that a point half way between Tasmania and Victoria is not. To know whether some arbitrary point near Adelaide is within the coastline or not – I can’t tell right now. I’m going to need to zoom in more.


    Now, if my point of interest is near Murray Bridge, or in the sea between Kangaroo Island and the mainland, then I can tell my answer. But somewhere slightly northwest of Adelaide, and again, we need to zoom. But notice here that we can see a lot more detail about the coastline than we could see before. Parts of the coastline that seemed like relatively simple now show a complexity we hadn’t started to consider.

    If we zoom again, we see that the coastline to the northwest of Adelaide is frustratingly complex.


    And this doesn’t stop as we keep zooming.





    The area around the Port River here is full of man-made structures, which can be defined relatively easily at some zoom level. A natural coastline is trickier (go back a few images to see the zoomed out version of this next one).


    Here we see a complexity that could just keep going. Each of these bays would have their own level of complexity, and as we zoom, we would see that the complexity just keeps going. The crinkliness of coastlines becomes like a fractal, that can keep zooming further and further, like in the animated GIF here (which is on a loop, but I think demonstrates the point well enough). It’s supposedly an illusion, but it’s really a basic property of fractals.

    INFINITE(Image from

    Benoit B. Mandelbrot’s work in fractals shows a pattern which is probably a more familiar ‘infinite zoom’ concept.

    Mandelbrot color zoom.gif(Image from

    Coastlines don’t actually let us zoom in infinitely, because we can’t go far part the detail in atoms, but we do need to consider how much detail we choose to define in our polygons.

    If I consider a dataset I have of South Australian spatial data, and I query the UnionAggregate of the different postcode areas, I get a geography value of the whole state.


    There are over 26,000 points defining this area (I didn’t count them, I used the STNumPoints method). If I convert it into a string (using .ToString()), I get a result which is over a million characters long. This is going to take a lot of effort to use in real queries.

    But how much detail do I actually need?

    One of the spatial methods is called Reduce(). It simplifies the shapes involved.

    If I use Reduce(50), then it reduces the number of points by about half. But at the zoom level we can see at this point, we might not care at all.


    If I Reduce(250), the number of points comes down to about 4000.


    This isn’t necessarily good though. Reducing even further can start to distort the borders so much that we end up making mistakes. A point that is within the Reduced polygon might not be in our original one, and vice versa.

    At Reduce(10000) I have under 400 points.


    At Reduce(50000) you can clearly see changes. Kangaroo Island has been simplified too much. The peninsulas have become remarkably triangular, and we would be making mistakes.


    Mistakes will come as soon as we Reduce the number of points at all. Despite the fact that at the level shown above, we could see no discernible different between the original set and Reduce(50), if we zoom in we can see that even Reduce(50) is a problem.


    You can clearly see that the crinkliness has been reduced, but that this could make the results of any query change. It would perform better, but it would be wrong.

    However – it would only be wrong near the borders. We can use this...

    Let me tell you about three other methods – STBuffer(@d), BufferWithCurves(@d), BufferWithTolerance(@d, @t, @r). These put a buffer around our polygon. If we have buffered enough, and then we Reduce, we can make sure that we get no false negatives from our check. So that anything that outside this area is definitely outside the original.

    BufferWithCurves produces a buffer shape using curves rather than points, which can offer some nice optimisations, but buffering coastlines using BufferWithCurves is very memory-intensive, and you may even find you are unable to produce this buffer.

    BufferWithTolerance is like having a Reduce built into the buffer – which sounds ideal, but I found that even with a very high tolerance I was still getting too many points for my liking. With a buffer of 1000, and tolerance of 250, I still had nearly 20,000 points. The things that look like curves in the image below are actually made up of many points.

    So I’m choosing a Reduced STBuffer instead.

    With a large enough value for STBuffer, and sufficiently small value for Reduce, I can make a simple border, and can use STDifference to confirm that it completely surrounds my original polygon. With not much effort, I can find some values that work enough. Notice that there are only seven points in the image below, whereas the BufferWithOverflow version has hundreds.

    select @SApolygon union all select @SApolygon.STBuffer(500).Reduce(500)


    The query below confirms that no part of the original land is outside my buffered and reduced area, showing zero points in the STDifference, and is what I used to find appropriate values for buffering and reducing.

    select @SApolygon.STDifference(@SApolygon.STBuffer(@reducefactor).Reduce(@reducefactor)).STNumPoints();


    Having defined an area that can have no false negatives, we can flip it around to eliminate false positives. We do this by buffering and reducing the inverse (which we find using ReorientObject()). I actually found that buffering by 1000 was needed to make sure there was no overlap (the second query below confirms it – the first query is just to overlay the two images).

    select @SApolygon union all select @SApolygon.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject();

    select @SApolygon.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject().STDifference(@SApolygon).STNumPoints(); --must be be zero


    With a zero for the second query above, we have defined an area for which there are no false positives.

    Now I know that points within the ‘no false positives’ area are definitely within my original polygon, and that points within the ‘no false negatives’ area are definitely not. The area between the two is of more interest, and needs checking against the complex coastline.


    Now let’s play with some better queries to show how to do this.

    Let’s create a table of areas. I’m going to use computed columns (which must be persisted to be useful) for the ‘no false negatives’ and ‘no false positives’ areas, although I could’ve done this manually if I wanted to experiment with buffers and tolerances on a per-area basis. Let’s also create a table of points, which we’ll index (although I’ll also let you know how long the queries took without the index).

    create table dbo.areas
    (id int not null identity(1,1) primary key,
    area geography,
    nofalseneg as area.STBuffer(500).Reduce(500) persisted,
    nofalsepos as area.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject() persisted
    create table dbo.points
    (id int not null identity(1,1) primary key,
    point geography
    create spatial index ixSpatialPoints on dbo.points(point);

    And we’ll put some data in. Our SA outline in the areas, and the points from AdventureWorks2012’s list of addresses – 19614 points.

    insert dbo.areas (area)
    select geography::UnionAggregate(geom)
    from postcodes.dbo.postcodes
    insert dbo.points (point)
    select SpatialLocation
    from AdventureWorks2012.person.Address;

    Putting the data in took some time. The 19,614 points took over about 5 seconds to get inserted, but the 1 area took 25 seconds. STBuffer() takes a while, and this is why having these computed columns as persisted can help so much.

    The query that we want to solve is:

    select p.*
    from areas a
    join points p on p.point.STIntersects(a.area) = 1;

    This returns 46 rows, and it takes 4 seconds (or 3.5 minutes without the index).

    ...whereas this query takes less than a second (or 20 seconds without the index):

    select p.*
    from areas a
    This query Intersects(a.nofalsepos) = 1;

    But it only returns 44 rows, because we’re only including the points that are in the “no false positives” area.

    This query returns 47 rows. We’re counting the points that are in the “no false negatives” area, which is a larger area than we want.

    select p.*
    from areas a
    join points p on p.point.STIntersects(a.nofalseneg) = 1;

    It also runs in under a second (or 20 seconds without the index).

    By now hopefully you see where I’m getting at...

    What I need is the list of points that are inside ‘nofalsepos’, plus the ones that inside ‘nofalseneg’, but not inside ‘nofalsepos’, that have also been checked against the proper area.

    select p.*
    from areas a
    join points p on p.point.STIntersects(a.nofalsepos) = 1
    union all
    select p.*
    from areas a
    join points p on
                    not p.point.STIntersects(a.nofalsepos) = 1
                and p.point.STIntersects(a.nofalseneg) = 1
                and p.point.STIntersects(a.area) = 1;

    This query takes approximately 1 second (or 40 without the index).

    But it’s complex – and that’s unfortunate, but we can figure this out using a view.

    create view dbo.PointAreaOverlap as
    select as pointid, p.point, as areaid, a.area
    from areas a
    join points p on p.point.STIntersects(a.nofalsepos) = 1
    union all
    select as pointid, p.point, as areaid, a.area
    from areas a
    join points p on
                    not p.point.STIntersects(a.nofalsepos) = 1
                and p.point.STIntersects(a.nofalseneg) = 1
                and p.point.STIntersects(a.area) = 1;

    , which we can query using much simpler code such as:

    select pointid, point
    from dbo.PointAreaOverlap
    where areaid = 1;

    Coastlines, or any complex spatial values, can make for poor performance, even with indexes in place. To improve performance, you could consider Reduce() to have it run quicker, although you’ll need to take some additional steps to guarantee correctness.

    And in case you were wondering what the “B.” in “Benoit B. Mandelbrot” stands for... it stands for “Benoit B. Mandelbrot”.


  • SHOWPLAN permission denied even if the database isn’t actually used

    To view a query plan, you need SHOWPLAN permission on the database level at least. You have this if you have CONTROL DATABASE, or CONTROL SERVER, or if you have ALTER TRACE at the instance level. I know this last one because it’s mentioned in Books Online on the ‘Database Permissions’ page, not because it’s particularly intuitive.

    As a consultant, I sometimes deal with customers who are reluctant to grant me the kind of access level that I would like to have to work on queries. SHOWPLAN is something that I will almost always request, though, and generally it’s considered harmless enough. I want to be able to show plans, so SHOWPLAN is part of what I like to have when writing any kind of query. Actually, I often find myself requesting ALTER TRACE, because it covers SHOWPLAN across all databases. Without it, you can find yourself in a situation where you sometimes get this error


    , because a view, function, or stored procedure accesses a database that you haven’t been granted access to. Maybe it contains sensitive information – maybe you don’t have security clearance, for example, but there is a table in that database that is referenced for part of of process you need to look into. I’m not going to get into the why, or the reasons why you could request better access, or anything like that – that’s not the point of this post. The point of this post is to talk about something which I learned about SHOWPLAN across databases that aren’t actually used in query. And it’s part of this month’s T-SQL Tuesday, hosted by Mike Donnelly (@SQLMD).TSQL2sDay150x150

    I was thinking about this situation though – having cross-database queries and not having SHOWPLAN on all of the referenced databases – and about the fact that views often contain more information than you necessarily need. This got me back to my Redundant Joins material (which I should reblog about, as I haven’t written about it properly on, and that the Query Optimizer can simplify out joins which aren’t actually used at all.

    Something occurred to me which I didn’t know the answer to, so I did a bit of research, found the answer, making it something I wanted to write about for this T-SQL Tuesday about  new things learned.

    Imagine a view, a table-valued function, a sub-query, just some table expression, which references (joins to) a lookup table but doesn’t need to. If we’re not interested in the data in the lookup table, this join is only needed if it’s matching multiple rows, or being used as a filter (which can’t happen if it’s a left outer join), or if it’s a right outer or full outer join (and therefore wanting to return all the rows in the lookup table, even those not mentioned in the left-hand set). If it’s not used, it’s redundant, and won’t be part of the query plan.

    Annoyingly, the simplification phase, when redundant joins are removed, is done AFTER permissions are checked. This is easy to demonstrate. Let’s consider a user which has VIEW DEFINITION rights on a table, but not SELECT rights. This user can run sp_help, and see all the metadata associated with the table. This user can query sys.columns and see the rows there, one for each column in the table. But to run the query SELECT TOP (0) * FROM dbo.someTable; , which is purely metadata, permission is denied.

    The reason I know it’s only metadata is because running as a more-privileged user, the query plan shows me this (as shown here, using AdventureWorks2012.Production.Production instead of dbo.soimeTable).


    This query does not select data from the table. If it did, we’d see a Seek or a Scan here. This query never needs to access the table. It is explicitly told to fetch no rows from it. The only thing we use here is metadata, and we do have permission to get that.

    And yet the less-privileged user can’t run this query. Metadata isn’t a problem, but the permissions are tested first, and the query is rejected.


    Permissions are checked once the query has been parsed. If an object is used in the query, then SELECT permission is required. If an object is updated, then UPDATE permission is needed, even if it’s logically impossible to update any actual rows (try WHERE 1=2 if you need to check).

    Now once a plan is in cache, VIEW SERVER STATE is needed to be able to view it. And if you have VIEW SERVER STATE, then you can view the plans that are in cache, even if you don’t have permissions to run the queries.

    ...which brings me to SHOWPLAN.

    SHOWPLAN is different to VIEW SERVER STATE – it doesn’t apply to the plan cache. The plan cache is an instance-level thing, and a database permission like SHOWPLAN isn’t going to cut it.

    To view the plan of a query that’s not in cache, you need SHOWPLAN permission. And you need to be able to run the query – even if the query isn’t actually going to touch the tables. I wouldn’t mind being able to look at plans to offer tuning advice without having to have permission to run the query, but this is just one of those things.

    Sadly, it extends to databases. If a database is referenced by a query, even if it’s not used, then you need to have SHOWPLAN permission on that database (or ALTER TRACE at the instance level, as I mentioned earlier).

    So if a view references a database for a lookup, and your query uses that database, you won’t be able to see the query plan of any query that uses it. You can have SHOWPLAN permission in the database where your data is, and with another user, you could verify that your plan doesn’t even touch the other database. But if it mentions it at all, you need SHOWPLAN on that database.

    The script below will let you reproduce this if you want.



    create login test with password ='test'
    create database test
    use test
    create user test for login test
    alter role db_owner add member test

    create table dbo.test (test int);

    grant showplan to test
    use AdventureWorks2012
    create user test for login test
    grant select on Production.Product to test
    deny showplan to test

    use test

    execute as login = 'test'
    select t.*
    from dbo.test t
    left join (select top (1) 1 from AdventureWorks2012.Production.Product) t2(c) on t.test = t2.c

    execute as login = 'test'
    set showplan_xml on
    select t.*
    from dbo.test t
    left join (select top (1) 1 from AdventureWorks2012.Production.Product) t2(c) on t.test = t2.c
    set showplan_xml off

    --Original user:
    set showplan_xml on
    select t.*
    from dbo.test t
    left join (select top (1) 1 from AdventureWorks2012.Production.Product) t2(c) on t.test = t2.c
    set showplan_xml off

    --use AdventureWorks2012
    --drop user test
    --drop database test
    --drop login test

  • T-SQL v MDX: Pick the hierarchy and populate the report parameter values

    When dealing with report parameters, you want the experience to be as seamless as possible for the end user. In SQL Server Reporting Services, we have the tremendously useful feature of having dynamic parameters, in which you can set a particular report parameter based on the choice of a previous one. For example, if you have parameters such as Country and City, you can set it up so that you pick the Country as Australia, and get the City choices filtered down to Adelaide, Melbourne, and Canberra, but if you choose the UK as your Country, your City choices become London, Brighton, and Dover.

    In T-SQL, we do this by making the City parameter use (for its Available Values set) a dataset which is dependent on the Country parameter. Something like “select City from Cities where Country = @country” – it’s quite straight forward.

    But when you want to be able to pick something entirely different, the story becomes more complex.

    Suppose you want to be able to filter a report either by Size or by Type. Typically you’d do this with two parameters, with an Any or All option near the top of each list. This makes a lot of sense, and even caters for the scenario of filtering the report by both Size and Type. But sometimes this doesn’t quite suit the situation.

    Imagine you’re filtering a report by date. You can either filter by the Financial Year, or by the Calendar Year. Or in fact, by the Week, Month, or Quarter. One option is to just pick the Start Date and End Date, and let the user work out when the Financial Year should be – another option is to let the user choose the type of period, and then populate the second parameter with a bunch of choices accordingly.

    In T-SQL, we could have a query which leverages a table of numbers to produce a bunch of reporting periods based on the type. Something like this:

    DECLARE @periodtype VARCHAR(50) = 'calweek';

    SELECT EnglishMonthName + ' ' + CAST(CalendarYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP by CalendarYear, MonthNumberOfYear, EnglishMonthName
    HAVING @periodtype = 'month'
    SELECT CAST(CalendarYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP BY CalendarYear
    HAVING @periodtype = 'calyear'
    SELECT CAST(FiscalYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP BY FiscalYear
    HAVING @periodtype = 'fiscalyear'
    SELECT 'Week ' + CAST(WeekNumberOfYear AS VARCHAR(2)) + ', ' + CAST(CalendarYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP BY CalendarYear, WeekNumberOfYear
    HAVING @periodtype = 'calweek'
    ORDER BY PeriodStart;

    In this scenario, we basically choose which of the queries we’re going to run using a variable, and a query parameter in SSRS. It works nicely, and thanks to Startup Expression Predicates in Filter operators, only actually runs one of the queries.


    , with the results for ‘month’ being something like:


    You may be thinking about the fact that SSRS will only let us have a single value for a parameter entry – not two. This can be overcome by concatenating the two values, and then splitting them in the query, or by having these values stored in a separate table. To concatenate them, just use:

    , CAST(MIN(DateKey) AS CHAR(8)) + CAST(MAX(DateKey) AS CHAR(8)) AS PeriodValue

    And then in the actual report query, doing something like:


    But in Analysis Services, things are done a little differently. We have hierarchies instead, and need to refer to them somehow. We need to be able to pull out items such as [Transaction Date].[Fiscal Calendar].[Fiscal Month].&[201501], or [Transaction Date].[By Week].[Week].&[20141122].

    We’re not supposed to create MDX tuples using T-SQL. You could, but it’s not the right approach. It’s almost never the right approach.

    I’m going to start by letting the user select a @ReportPeriodType, which is going to be one of the following values:

    [Transaction Date].[Fiscal Calendar].[Fiscal Month]
    [Transaction Date].[Fiscal Calendar].[Fiscal Quarter]
    [Transaction Date].[Fiscal Calendar].[Fiscal Year]
    [Transaction Date].[Calendar].[Year]
    [Transaction Date].[By Week].[Week]

    And then use a query like this:

    MEMBER Measures.ParamName AS STRTOVALUE(@ReportPeriodType + ".Hierarchy.CurrentMember.Name")
    MEMBER Measures.ParamUniq AS STRTOVALUE(@ReportPeriodType + ".Hierarchy.CurrentMember.Unique_Name")
    { Measures.ParamName, Measures.ParamUniq } ON COLUMNS,
    STRTOSET(@ReportPeriodType + ".Members") ON ROWS
    FROM [MyCube]

    And if I want to filter this to only things from the last 2 years, I could use:

    WHERE (STRTOMEMBER("[Transaction Date].[Date].&[" + @today + "]").LAG(730) : STRTOMEMBER("[Transaction Date].[Date].&[" + @today + "]")

    Let’s look at what I’m doing here. I’m passing in a parameter which is a string, which is the level of a hierachy. I’m going to use the members of this level on the Rows axis. To construct that, STRTOSET(@ReportPeriodType + ".Members") does the trick nicely. I could just leave it at that, and use the MEMBER_CAPTION and MEMBER_UNIQUE_NAME properties of each dimension member, but the field name will change according to the level, and I don’t want to have to work out the impact of this in SSRS. It’s much easier to have a couple of calculated members to grab the Level’s Hierarchy’s CurrentMember’s Name and Unique_Name, which I throw onto the Columns axis, and I’m sorted. My WHERE clause works just fine, so long as I’m not grabbing data out of the [Date] hierarchy, for filtering the list of parameters down to a manageable size. If I had a hidden hierarchy, I could easily pass that in with confidence that my report writers aren’t going to try to ask for hierarchy members from it.

    Notice that the main part of my query has no mention of the dimension name, or hierarchy name, or anything. This can be used for the members of ANY hierarchy, in ANY dimension. I simply need to pass in the Dimension.Hierarchy.Level string that I want. Here are some of the results if I pass in [Transaction Date].[Fiscal Calendar].[Fiscal Month].


    The dynamic nature of cubes is remarkable, and can be leveraged to make reusable data sets very easily.


  • Tuning Parallel Data Warehouse Queries

    Performance tuning in regular SQL Server can be addressed in a number of ways. This can involve looking at what’s going on with the disk configuration, the memory configuration, the wait stats, the parallelism settings, indexing, and so much more. But if you have a Parallel Data Warehouse (PDW) environment, then there are a lot of things that are taken out of your hands.

    When you buy the Analytics Platform System (APS) appliance, which hosts PDW, you’re buying a complete system. It’s an appliance, in the same way that you buy appliances for the kitchen. When you go camping and you want to cook some food, you have to build a fire, and find some way of suspending the food over the fire. Building a fire isn’t exactly easy in itself – you need to consider what you’re going to do with kindling, you need larger pieces of wood for building heat, you need a spark – and then working out how to suspend the food at an appropriate height is another matter again! There are plenty of opportunities to tune this environment, and I’m sure many people have this down to a fine art. For most of us, though, we would rather just have an appliance (an oven) and just put the food in and turn it on.

    The PDW environment is like the oven rather than the campfire. Disk comes preconfigured. Network comes preconfigured. Placement of files and filegroups is preconfigured, as is the configuration of tables and databases across these filegroups. High availability decisions are made for us, and we are left to design the database.

    But the design of the tables within the database can dramatically affect the performance of the system – in ways that regular SQL Server environments don’t even start to have to consider. And this is because of the concept of RAM across multiple servers.

    To be able to see a row of data, ready to be able to do something with it – whether that be to return it to the user who ran the query, to manipulate it in some way, or to join it to another row from another table – it must exist in RAM. DBAs understand this, and make sure that their SQL environments have as much RAM as possible, to avoid having to pull that row into RAM from disk mid-query. But when two rows that need to be joined exist on different servers, the memory spaces in RAM that can hold those rows exist on different servers too. In PDW, we have just that situation – lots of servers, with the data spread out across them. It looks like a single server, but is actually made up of lots of servers in a scaled-out architecture.

    Therefore, to join data in PDW, we have to be confident that any two rows that need to be joined exist on the same server. This can be done in one of three ways:

    1. At least one of the rows can be stored on every server.

    This is done by configuring one of the tables as a Replicated table rather than a Distributed table. This is very common, and is a common feature for dimension tables that don’t change particularly often. If the entire list of Stores is stored (ha!) on every server, then our Sales data can be easily joined to it without fear that the Store in question for a particular Sales row might not be on the same server as said Sales row.

    2. The two tables can be distributed using the same key, with that key being part of the join conditions.

    Another common method. Suppose we have a large number of Products, which are frequently being changed and versioned, and we have a large amount of Sales. Both are good candidates for being Distributed rather than Replicated – we probably wouldn’t want a full copy of our large and quickly-growing Products table on every server. But if both tables are distributed on ProductKey and this is part of the join condition (p.ProductKey = s.ProductKey), then the system will know that any two rows that are a successful match according to the join conditions must also be on the server.

    3. Data can be moved mid-query.

    In regular SQL, if a row isn’t in RAM already, we need to pull a copy of it into RAM off the disk. Similarly, in PDW, if the system can’t be sure that the row exists on the right server, it can pull a copy of it onto the right server. This is known as Data Movement, and just as regular DBAs try to minimise the amount of disk activity for performance tuning, PDW DBAs try to minimise Data Movement.

    Data Movement comes in a variety of flavours, but I want to mention two of them here. One is broadcasting, which is turning the smaller of the two tables (or at least, the data which is identified as being of interest once other row and column filters have been applied) into a temporary replicated table, putting a copy of that data on every server. This turns our scenario into situation 1, and the query can continue. Another is shuffling, which is distributing some of the data on a different key, which is part of the join conditions – putting us into situation 2.

    To satisfy a single query, there might be a lot of data movement required.

    If you consider a typical data warehouse query, there are a lot of joins, across a lot of tables, and the join conditions are different for every join. Unless you start to think about PDW queries in a very different way to which you think about non-PDW queries, you can very easily suffer from poor performance simply down to unnecessary data movement.

    Thinking outside the box can provide some useful mechanisms for tuning PDW, which can see the performance improve massively.

    Denormalising is a method that many data warehouse designers are very familiar with, but when you start to consider the pain of joins in a PDW environment, it starts to become a lot more significant. It’s often not appropriate to denormalise too much, but if you have been keen on normalising significantly in the past, you should potentially reconsider this useful method.

    Redundant predicates are something which many developers don’t seem to appreciate. I’ve heard developers say that any piece of redundant code should be removed, because it’s going to create extra work. But in the PDW world, these predicates could help explain to the PDW Query Optimizer that data movement isn’t actually necessary, and could save you a large amount of effort.

    Views are often maligned in the regular SQL world, as people write views that query views that query views. I’m not suggesting that you fall into that trap, but you should certainly consider views to encapsulate join logic, encouraging the use of those redundant predicates that I just mentioned.

    Table splitting seems counter-intuitive to avoiding joins, but when split horizontally (having some rows in one table and others in another), gives extra options around how to distribute the data. Some data could be distributed according to one column, and other data could be distributed on another. This has the added benefit of avoiding skew, but can also benefit from using views when accessing these tables, as PDW does not support check constraints. Splitting tables vertically is rarer, but can allow dimensions to get wider and allow for additional denormalisation before hitting the maximum row width of 32kB (in-row data is limited to 8kB, but data movement buffers are only 32kB, which makes for a much harder limit).

    These methods don’t have the same kind of impact in your ordinary SQL Server environments, but can have a massive effect in PDW. You can see examples of how these things can come into play by looking at a previous post of mine, where you should be able to spot my use of all of these methods.

    Oh, and this piece about performance tuning fits into this month’s T-SQL Tuesday topic, hosted by Russ Thomas (@sqljudo) on the topic of Performance.



  • Why isn’t my filtered index being used?

    Quite often, people have filtered indexes but find they’re not being used as often as they’d like. I was reminded of this recently when I read a good post by Kendra Little from about filtered indexes. In it, Kendra talks about how the WHERE clause of a filtered index allows an IN clause, but not an OR clause (to quote someone from Microsoft “We currently do not support the "OR" clause due to the optimizer matching difficulties for such clauses”). Going back a few years, Tim Chapman wrote a post about the pains of filtered indexes.

    Anyway, both of these posts hint that filtered indexes aren’t always used. Tim addresses it directly, and Kendra mentions needing a hint to make sure the index is used. I thought I’d explore the question a little more. I’m confident that both Tim and Kendra know this information – they are two of the top index experts in the world. This post isn’t for them, but for those people who are trying to find out why their filtered indexes aren’t being used.

    To be used, the filtered index must be able to satisfy the query.

    This should be fairly obvious, but it goes a little deeper than you might think on first glance. Let’s explore Tim’s examples to show what I mean.

    Tim creates a filtered index:

    CREATE INDEX FIDX_SalesOrderDetail_ProductID
    ON Sales.SalesOrderDetail (ProductID)
    WHERE ProductID = 870;

    ...and then shows that it’s used successfully for the query:

    SELECT ProductID
    FROM Sales.SalesOrderDetail
    WHERE ProductID = 870;

    (The image here is from Tim’s blog post, and belongs to Microsoft)

    No surprise here – if the system knows that ProductID is 870, then it can use an index which only includes rows that satisfy that.

    Tim then tries to use a variable instead of 870 – although he still passes in the value of 870.

    DECLARE @ProductID INT;
    SET @ProductID = 870;

    SELECT ProductID
    FROM Sales.SalesOrderDetail
    WHERE ProductID = @ProductID;

    (Image from Tim’s blog again)

    No luck – the system doesn’t use the filtered index. He can’t even use a hint to force it to use it – SQL replies saying that it can’t create a plan for it using that index hint.

    So what’s going on?

    The problem is not with the Query Optimizer seeing that the value is going to be 870 – the problem is with the plan cache. You see, when SQL runs a query, it figures it won’t be in isolation and it puts it into the cache. But the version that goes into the cache is a general one, that doesn’t consider the values that are passed in. Because it needs a plan that will work regardless of what the parameter is set to, using the filtered index here would be inappropriate.

    Tim shows one way around this, and ‘hints’ at another in an edit, although sadly you tend to find that in blog post edits, you can miss the key a little.

    The way that Tim gets around this is to use Dynamic SQL, but I’m not a fan.

    SET @ProductID = 870
    SET @SQL = N'SELECT ProductID
    FROM Sales.SalesOrderDetail 
    WHERE ProductID = ' + CAST(@ProductID AS VARCHAR(10))
    EXECUTE sp_executesql @SQL

    Yeah – I’m not a fan of this. Go read my post on SQL Injection for why.

    Tim does mention a better method in the ‘edit’ bit in his post, and I want to look at that. He says: “In many cases, a way to get around the local variable problem is to recompile the statement.” – but I want to make it very clear that the point is not actually to recompile the statement, but to use a statement that isn’t going to get put into the cache (which you do by using the OPTION (RECOMIPLE) hint, which makes it sound like you’re recompiling the statement).

    When you use OPTION (RECOMPILE), the main impact is not that it recompiles, but that the query doesn’t get cached. Because it doesn’t get cached, it won’t find the query in the cache beforehand either. This means that it doesn’t need to consider the generalised version – it has the confidence to know that it doesn’t need to cater for future uses, so it can use the filtered index!


    So the better option than using Dynamic SQL is to use OPTION (RECOMPILE).

    So that’s one reason why your filtered index might not be used – but there’s another too:

    Using the filtered index might be too much work.

    Let’s consider Kendra’s example. She had a query that used IN. Her example was:

    CREATE INDEX IX_Votes_filter ON dbo.Votes (PostId) WHERE (VoteTypeId IN (1,2));

    But as I don’t have a Votes table, I’m going to use an equivalent with AdventureWorks:

    CREATE INDEX FIDX_Product2Color ON Production.Product (ProductID) WHERE (Color IN ('Red', 'Black'));

    Kendra uses SELECT COUNT(*) queries to test it. Similarly to Kendra’s example, this works well when IN is used, and when OR is used, but when just one of the options is used, it needs a hint to use the index, and then it needs a lookup to satisfy it.


    What Kendra doesn’t point out is why the filtered index needed the hint to be used in the single-option examples (which is fair enough – because her post was about OR v IN, not about why the filtered index wasn’t being used).

    The reason why is because of the Lookup that’s needed. This is so expensive, it’s cheaper for the Query Optimiser to do a clustered index scan instead. It’s standard ‘tipping point’ stuff, but we don’t normally see this when we have COUNT(*), because COUNT(*) is just counting rows, not returning extra columns.

    ...except that there is a column that isn’t included in the filtered index, which our query needs – Color.

    Subtly, even though our filtered index only contains rows that have the Color either Red or Black (or in Kendra’s example, VoteTypeID either 1 or 2), it doesn’t store which rows are Red and which rows are Black. We know that every row in the index is either Red or Black, so we can use this index as a starting point, but we would need to do a Lookup to get the actual Color value.

    To fix this, we should INCLUDE the Color column in the filtered index.

    CREATE INDEX FIDX_Product2Color2 ON Production.Product (ProductID) INCLUDE (Color) WHERE (Color IN ('Red', 'Black'));

    Now we don’t need to hint at the index to use, and we see a residual predicate being used to make sure that we only pull Red rows out of the index.


    So we see two reasons here for filtered indexes not being used, and two ways to help encourage them to be used more often. The first is to consider using OPTION (RECOMPILE) to help persuade the Query Optimizer not to consider generalising the query across different parameter values, and the second is to INCLUDE the columns that are used in the filter, in case the database engine needs those values later in the query.

    Do try to use filtered indexes, and be understanding about those times when they’re not used. There’s quite probably a simple explanation.


More Posts Next page »

This Blog


Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement