THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.

  • Can you still use DBCC PAGE to access data in dropped tables if TDE is turned on?

    Yes.

    #ShortestPostEver

    @rob_farley 

     

    Oh, ok... I’ll explain some more.

    Transparent Data Encryption encrypts data at rest. That’s the stuff that’s on disk – the encryption happens when the data is written to disk, and the decryption happens as the data is loaded into RAM from the disk. The engine handles this so that it’s invisible to the user, applications, and so on. Without it, you can open an MDF/NDF file in a hex editor and read the contents. With it, you can’t.

    Here’s an example with a database that’s not encrypted:

    image

    And here’s an example that is:

    image

    I searched for some of the text that I could see – successfully in the first, unsuccessfully in the second.

    I also used SQL Server to show me the contents of a page using DBCC PAGE, and could do this successfully (once I’d closed the files in the hex editor and brought the databases back online).

    image

    ...which also worked in both databases.

    image

    I had hoped this would work okay, because I figured that DBCC PAGE would have to pull the data into RAM again (remember this system was offline – the pages weren’t in RAM before), and that it would decrypt this as it did it. But I wondered if DBCC PAGE might be slightly lower-level, and bypass it somehow. I argued with myself that if TDE was indeed Transparent, it shouldn’t care... what if my application relied on using DBCC PAGE, it’s a known feature, even if it is officially undocumented (which is where my doubts set in).

    But as you see, it worked okay.

    But what if I dropped the table first? Would SQL then go “Hang on – this page isn’t one that I have control over any more...” and refuse to decrypt it?

    No – it works just the same.

    Even if you drop an object, you can still access the pages that it used until they get overwritten. You won’t be able to read them with a hex editor, but DBCC PAGE will still read them in decrypted form, letting you pull that data out.

    And yes, you can even use DBCC WRITEPAGE to overwrite the bytes in their unencrypted form, so that you can use (dangerous) method of fixing corruption, even in an encrypted database. I’ve just redone my fix for Steve Stedman’s 10th Corruption Challenge, and it worked just fine on an encrypted version of the database.

    It’s still T-SQL Tuesday on the topic of encryption, so I’m throwing this one into the mix for that as well.

    TSQL2sDay150x150

  • Always Encrypted with SQL 2016

    Transparent Data Encryption has been around for some time now, making sure that data in SQL Server as stored on disk is encrypted. When it was announced, this was incredibly exciting.

    You see, by default, SQL Server data is not encrypted. If you open up the pages within a data file, you can read the data that’s in there. Numbers are stored as hex, varchar strings are stored as readable values – it can be quite surprising to realise this. The first time you ever salvage an MDF file from a broken server, open up that file, and just start reading the data, you realise that the data in most databases is only as secure as the physical security of the disks – you can read data from that table whether or not your Windows login has access to it.

    With Transparent Data Encryption, that MDF file is encrypted. It’s only decrypted when it’s pulled into RAM, where access to it is controlled by database permissions. If your access to a particular table has been denied, you’re not getting to read that data out of RAM. When the power goes out, the decrypted data in RAM disappears – that’s what RAM’s like. It’s a good thing. But TDE is across the whole database, whether you like it or not. And everyone who has access to the table can read it unencrypted, as if it’s never been encrypted.

    It’s Transparent. It doesn’t feel like it’s encrypted, and as far any application can see, it’s not. This only protects against the disk (or backup) being compromised.

    And then there’s the distrust of DBAs.

    As someone who deals with sensitive data regularly, I can assure you that this is a very real concern. I’ve had police checks, to help make sure that I’m trustworthy – and I make a point of never looking myself up in customer databases (I know I must exist in some of my customers’ databases – I receive bills from some of them even). I also have confidence that my employees at LobsterPot Solutions are all completely trustworthy, and that I’d throw the book at them if they ever broke that trust.

    I’ve certainly been asked “How do I stop the DBA from being able to read the data?”

    And that’s where the problem is. The DBA is the guardian of your data. They’re the person who makes sure that the database doesn’t have corruption, who can defrag indexes, tune queries, and so on. The DBA should be trustworthy with the data. But what if that data contains the lottery numbers? What if that data involves national secrets? At what point is the sensitivity of the data way too great to be able to let Archbishop Desmond Tutu look after it, let alone Barry the DBA?

    Now, I understand that a DBA might feel insulted that they’re not trusted to read the data. They need to look after it, but they can’t look at it. My take is that if I can be removed from suspicion if there’s a leak, then great. I know I’m okay. The client should know I’m okay. But would a criminal investigation be able to remove me from suspicion? Not if I’m one of the only people that could read the data and know how to cover my tracks.

    Transparent Data Encryption doesn’t stop me from being able to read the data. Not at all. If I have access to query the table, it looks unencrypted to me.

    Always Encrypted, in SQL Server 2016, does stop me though, if I don’t have the ability to access the certificate which has protects the column master key. Always Encrypted stores encrypted data in the database file. It’s encrypted in RAM, and while it’s being sent across the network, only being decrypted by the application that has access to the certificate. Now, I know that as an administrator, I might be able to jump through a few hoops to be able to get access to it, but this could be handled by other people, who could deny my access to it.

    This decryption is still transparent to the applications that use it – so the applications don’t need to change, apart from telling the connection string to look out for it. The .Net 4.6 environment will handle the decryption without developers needing to code it specifically. Access to those applications can be controlled in other ways.

    If you’re needing to search by values that are encrypted, it might not be ideal for you. But for that data that you need to make sure even you can’t read, this could be a really nice option. Roll on 2016!

    @rob_farley 

    PS: Another option is to build encryption functions into client applications, so that the applications encrypt everything, and handle everything well away from the database. We’ve been able to do this since applications first began, but when there are multiple applications, the coordination of keys can become problematic. What you get with SQL 2016 is the ability to do this centralised control over the encryption and decryption.

    PPS: This post was brought to you for the 69th T-SQL Tuesday, hosted by Ken Wilson (@_KenWilson)

    TSQL2sDay150x150

  • Corruption, corruption, more corruption

    No, not in my systems.

    Corruption does happen from time to time. At LobsterPot Solutions we get calls from people now and then who have corruption in their databases, that want help getting them repaired. Generally, it’s not too much of a problem, although I’ve seen some doozies over time. (And if you need help, get in touch!)

    Interestingly, I don’t think many DBAs practise solving corruption issues very often. They might test restores, and even test getting data back from a salvaged LDF file, but rarely would they test any of the stranger corruption scenarios. I’ve never put it on a high priority – I know lots of ways to get data out of tables, and know how to get data out of pages using DBCC PAGE... but I wouldn’t say that I practise solving corruption very often. There are so many different ways that a database can become corrupted, and 99% of the time, the data can be salvaged using some very standard approaches.

    So it was good to see Steve Stedman (@SQLEmt) run a series of Corruption Challenges – ten corrupt databases, all of which could be repaired without data loss, but never using the standard DBCC CHECKDB repair option. I noticed the first one part way into the allotted time, and thought it would be a useful exercise to see if I could solve them all. Let’s face it – when someone gets in touch because they have a problem, they want to be reassured that they’re dealing with someone who knows what they’re doing.

    All ten challenges have now appeared, and not only have I solved every one of them – a feat only shared by Andre Kamman of the Netherlands and Neil Abrahams of the UK – but I’ve also won three of the events, more than anyone else, allowing me to finish top of the scoreboard!

    Winning the competition was never my intention – I’m just pleased to have been able to know that there were no challenges that beat me, which hopefully means that my customers are still in capable hands.

    My challenge to you is to grab the ten databases that Steve has made available, and without looking at the solutions that are all posted, see how you go. At the very least it will be good practice for when you have to deal with the real thing.

    @rob_farley

  • Check the settings when installing SQL Server

    Next, next, next, next, next... you know the drill.

    Except that when installing SQL, it’s simply not good enough. The defaults might not work for you, and that makes this post qualify for this month’s T-SQL Tuesday, hosted by Andy Yun (@sqlbek).

    TSQL2sDay150x150

    Most things are fine, but there is one page which you really shouldn’t ignore. And it’s not even obvious. But if you’re just clicking Next, next, next, then you might easily miss it.

    When you reach this page:

    image

    please make sure you don’t just hit Next without looking at the Collation tab. It looks like this:

    image

    Now, I don’t particularly care what you decide to put here. But you shouldn’t just set it blindly. If you are installing a new server and you need it to be consistent with what you’ve used before, go and look at the collation setting on your old instance. This is ridiculously important.

    You see, people are lazy. And when I say people, I mean developers. And when I say developers, I mean bad developers. (There was a thing recently that said that “Women’s Soccer” should now be called simply “Soccer”. I agree with this. There shouldn’t have to be a differentiation between a game played by a particular type of people, except perhaps “American Football”, which is obviously called “football” because they use their feet so much. Oh right, about developers. I hope as time passes, developers become better. But for now, when I say “Developers”, I mean bad developers.)

    So when a developer creates a temporary table, they do something like:

    CREATE TABLE #resultset (col1 varchar(20) NOT NULL, col2 int NOT NULL);

    And then they populate it with some data, and then they use it in another query. Like this:

    SELECT t.*
    FROM dbo.SomeTable t
    JOIN #resultset r ON t.code = r.col1;

    This is code that has worked for years. But if you have ignored the collation setting and the instance collation is different to the database collation, you will get an error. The system won’t know whether two strings are supposed to be the same or not. Is ‘abc’ the same as ‘àbç’, if one tells you to ignore accents and the other says not to? The system can’t decide. It’s even more complex than that, because two strings might be identical, but it won’t know how to look them up if the alphabet orders letters differently. It can’t decide and gives an error. It goes in the too-hard basket.

    Of course, a nice responsible developer will have created the temporary table like this, and then the problem never occurs:

    CREATE TABLE #resultset (col1 varchar(20) COLLATE DATABASE_DEFAULT NOT NULL, col2 int NOT NULL);

    But let’s face it – this is rare. Most people write database code without thinking about the collation settings, and that’s a problem. Most of us are bad developers.

    The error you get is the one described here: https://connect.microsoft.com/SQLServer/feedback/details/324910/collation-error-behaviour-option. This is a Connect item which I raised in January 2008 (7.5 years ago), which has 47 upvotes, and which was closed in 2011 as “Won’t Fix”. It was looking likely for a while (at least when using tempdb), but then they must’ve realised it wasn’t particularly important. Ultimately, it could give unexpected results if you’re not confident about which language you’re using, and it’s potentially better to give an explicit error than to let your query work but give the wrong results.

    SELECT *
    FROM dbo.Sports
    WHERE Name = 'football';

    Msg 468, Level 16, State 9, Line 1
    Cannot resolve the collation conflict... ;)

    @rob_farley

  • DBAs and the Internet of Things

    Ever since data moved to the cloud, database administrators have wondered what it will mean for their jobs.

    It doesn’t matter whether you are thinking about Infrastructure as a Service, Platform as a Service, Database as a Service, there is some aspect of what a database administrator does that may not exist in the future.

    And then there’s the situation around Big Data, and the Internet of Things. We see videos of how organisations are using devices that produce data to help control things as large as the London Underground, and as personal as your fitness routine. It doesn’t matter whether the Thing is recording your heart beat or the vibrations of a tube train – the fact is that data is being produced to help our lives.

    For a database administrator, this present a new set of concerns. Does the DBA of today need to be able to design and tune a system that leverages Windows Azure SQL Database? What about Parallel Data Warehouse? Hadoop? Apache Storm? StreamInsight? There are so many things that make it seem to the DBA that their skill set must adapt for them to survive.

    But what I want to suggest is that the DBA should not be seeing these technologies as their domain to administer and tune, but their domain to consume and leverage.

    Currently, database administrators in the SQL Server world use Performance Monitor counters to keep an eye on how disk, CPU, I/O, RAM, cache, etc., etc., are going. This data is polled regularly and stored, while SQL Server Agent alerts are configured to holler if there’s something noteworthy, such as a Transaction Log filling up. Extended Events allow all kinds of events to be monitored, with data about them examined, stored if necessary, reacted to, and more.

    Once, we used SQL Server Trace, and a tool called Profiler. Profiler was frowned upon somewhat – its veracity for consuming data meant that using it to run a trace live would put an unnecessary load on the SQL instance being monitored. The better option was to configure a trace, and then use Profiler after the fact to see what was happening. Extended Events (XE) is seen in a different kind of light. An XE session runs, keeping an eye on things in a very decoupled way, and the stream of events that is produced can be hooked into using a number of different methods.

    From the perspective of the DBA, these XE sessions become an “Internet of Things”-style environment. Imagine that each configured event is a smart device, sitting somewhere, producing data in some semi-structured form. Of course, we know the structure, but each event might have a different configuration, with different kinds of properties being available. The DBA consumes them using the same mechanisms as IoT technologies, and suddenly becomes a new kind of DBA-superhero. It’s not unheard of today, to hear of people looking after large numbers of databases.

    The Internet of Things is here for database administrators. Extended Events are part of it.

    @rob_farley 

    PS: This post about Extended Events is part of the latest T-SQL Tuesday event, hosted by my friend Jes Borland (@grrl_geek).

    TSQL2sDay150x150

  • Monitoring skew in PDW

    When you have data stored across several servers, skew becomes very significant.

    In SQL Server Parallel Data Warehouse (PDW), part of the Analytics Platform System (APS), data is stored in one of two ways – distributed or replicated. Replicated data is copied in full across every compute node (those servers which actually store user data), while distributed data is spread as evenly as possible across all the compute nodes, with the decision about where each row of data is stored is dependent on the hash of a column’s value. This column is known as the distribution key.

    For performance reasons, though, data warehouse designers will typically choose a distribution key that avoids data movement, by choosing a column which is typically used in joins or aggregations. This performance is traded off against skew, which itself can have a performance impact. If a disproportionate amount of data is stored on a single compute node (which is really a question of ‘distributions’ which correspond to filegroups – eight per compute node), then storage is effectively used up more quickly (after all, you only need one disk to fill up to have problems), and queries run slower because one distribution has to churn through more data than the others.

    And so skew should be monitored. TSQL2sDay150x150This fits in nicely with this month’s T-SQL Tuesday, on the topic of monitoring, and hosted by Cathrine Wilhemsen (@cathrinew).

    It’s not even just monitoring it that is important, but trending it. You see, your data can change over time.

    Imagine you sell products, to customers, in stores, over time. You could distribute your data by store, product, customer, or day (or even by some surrogate key, but then you need to generate that surrogate key). And whatever you choose might start off as just fine – until your business changes a little... such as a product comes out that outsells the others by a long way, or when you start selling online, to anonymous users, and offer sales on certain days of the year. Skew might not be there at the start, but it might become a factor over time.

    So how can you measure skew? Well, using DBCC PDW_SHOWSPACEUSED.

    You might have created a table called dbo.Sales, but if you have six compute nodes, then your data is actually stored across 48 tables. DBCC PDW_SHOWSPACEUSED("dbo.Sales") will provide the number of rows, and the space used and reserved for each of these 48 tables, letting me see what the skew is like.

    If I have 480 million rows, then ideally I have 10 million rows in each distribution. But if my data shows that 4 of my distributions have 15 million rows, and the other 44 have about 9.5 million, then it might not seem like much skew – most of them are 95% of what I had hoped for. But if the disks storing those 15 million-row tables are starting to fill up, then I’m essentially missing out on the opportunity to store 5.5 million rows on each of the 44 emptier ones – 240 million rows’ worth! To put it another way – if they were all as full as my fullest distribution, I could’ve had 48 * 15 million rows, which is 720 million rows! This is not efficient use of my storage.

    Applying a bit of maths like this can help me work out the wastage – looking to see what the maximum number of rows (or storage allocated) is, and multiplying that by the number of distributions that I have, to see what I could be storing if I didn’t have skew. This wastage could be a stored as a percentage easily enough, and this can be tracked over time.

    You could very easily run this command across your distributed tables on a regular basis, storing the results in a table, and produce a report which shows both the overall skew across the appliance and individually across each table. This can then become part of your capacity planning process, as I’m sure you’re considering across all your SQL Server instances.

    @rob_farley

  • Tuning slow spatial queries in SQL Server

    Even with the right indexes in place, spatial queries in SQL Server are often too slow – but they needn’t be.

    Two of the most commonly found patterns of query in the spatial world are when you’re looking for the nearest thing to where you are (which I’ve written about before), and when you’re looking for the points that are within a particular boundary. This latter example is where I want to spend most of my time in this post.

    A spatial index splits the world (for geography, or area for geometry) into a series of grids. The index essentially stores whether each point is inside each grid section or not, or for more complex types, whether it completely covers the grid section, whether it’s not in the grid section at all, or whether it’s partially in it. If we know that a particular grid section is completely covered by a polygon / multipolygon (I’m going to refer to polygons from here on, but it all applies to multipolygons too), then every point that is within that grid section must also be in that polygon. The inverse applies to a grid section which doesn’t overlap with a polygon at all – none of the points in that grid can be in that polygon.

    More work is required if the overlap of the polygon and grid section is only partial.

    Let’s assume we’ve got to the finest level of the grid system, and the polygon that we’re considering is partially overlapping the grid section. Now, we need to do some calculations to see if each point is within the polygon (or at least, the intersection of the grid section and the polygon). These calculations involve examining all the sections of the polygon’s boundary to see if each point is inside or outside.

    Imagine trying to see if a particular point is within the coastline of Australia or not, and just rely on our eyes to be able to judge whether an answer is ‘easy’ or ‘hard’.

    image

    If we’re zoomed out this much, we can easily tell that a point near the border of South Australia and Northern Territory (near where it says ‘Australia’ in red) is within the coastline, or that a point half way between Tasmania and Victoria is not. To know whether some arbitrary point near Adelaide is within the coastline or not – I can’t tell right now. I’m going to need to zoom in more.

    image

    Now, if my point of interest is near Murray Bridge, or in the sea between Kangaroo Island and the mainland, then I can tell my answer. But somewhere slightly northwest of Adelaide, and again, we need to zoom. But notice here that we can see a lot more detail about the coastline than we could see before. Parts of the coastline that seemed like relatively simple now show a complexity we hadn’t started to consider.

    If we zoom again, we see that the coastline to the northwest of Adelaide is frustratingly complex.

    image

    And this doesn’t stop as we keep zooming.

    image

     

    image

    image

    The area around the Port River here is full of man-made structures, which can be defined relatively easily at some zoom level. A natural coastline is trickier (go back a few images to see the zoomed out version of this next one).

    image

    Here we see a complexity that could just keep going. Each of these bays would have their own level of complexity, and as we zoom, we would see that the complexity just keeps going. The crinkliness of coastlines becomes like a fractal, that can keep zooming further and further, like in the animated GIF here (which is on a loop, but I think demonstrates the point well enough). It’s supposedly an illusion, but it’s really a basic property of fractals.

    INFINITE(Image from http://www.moillusions.com/infinite-zoom-coast-illusion/)

    Benoit B. Mandelbrot’s work in fractals shows a pattern which is probably a more familiar ‘infinite zoom’ concept.

    Mandelbrot color zoom.gif(Image from http://commons.wikimedia.org/wiki/File:Mandelbrot_color_zoom.gif)

    Coastlines don’t actually let us zoom in infinitely, because we can’t go far part the detail in atoms, but we do need to consider how much detail we choose to define in our polygons.

    If I consider a dataset I have of South Australian spatial data, and I query the UnionAggregate of the different postcode areas, I get a geography value of the whole state.

    image

    There are over 26,000 points defining this area (I didn’t count them, I used the STNumPoints method). If I convert it into a string (using .ToString()), I get a result which is over a million characters long. This is going to take a lot of effort to use in real queries.

    But how much detail do I actually need?

    One of the spatial methods is called Reduce(). It simplifies the shapes involved.

    If I use Reduce(50), then it reduces the number of points by about half. But at the zoom level we can see at this point, we might not care at all.

    image

    If I Reduce(250), the number of points comes down to about 4000.

    image

    This isn’t necessarily good though. Reducing even further can start to distort the borders so much that we end up making mistakes. A point that is within the Reduced polygon might not be in our original one, and vice versa.

    At Reduce(10000) I have under 400 points.

    image

    At Reduce(50000) you can clearly see changes. Kangaroo Island has been simplified too much. The peninsulas have become remarkably triangular, and we would be making mistakes.

    image

    Mistakes will come as soon as we Reduce the number of points at all. Despite the fact that at the level shown above, we could see no discernible different between the original set and Reduce(50), if we zoom in we can see that even Reduce(50) is a problem.

    image

    You can clearly see that the crinkliness has been reduced, but that this could make the results of any query change. It would perform better, but it would be wrong.

    However – it would only be wrong near the borders. We can use this...

    Let me tell you about three other methods – STBuffer(@d), BufferWithCurves(@d), BufferWithTolerance(@d, @t, @r). These put a buffer around our polygon. If we have buffered enough, and then we Reduce, we can make sure that we get no false negatives from our check. So that anything that outside this area is definitely outside the original.

    BufferWithCurves produces a buffer shape using curves rather than points, which can offer some nice optimisations, but buffering coastlines using BufferWithCurves is very memory-intensive, and you may even find you are unable to produce this buffer.

    BufferWithTolerance is like having a Reduce built into the buffer – which sounds ideal, but I found that even with a very high tolerance I was still getting too many points for my liking. With a buffer of 1000, and tolerance of 250, I still had nearly 20,000 points. The things that look like curves in the image below are actually made up of many points.

    So I’m choosing a Reduced STBuffer instead.

    With a large enough value for STBuffer, and sufficiently small value for Reduce, I can make a simple border, and can use STDifference to confirm that it completely surrounds my original polygon. With not much effort, I can find some values that work enough. Notice that there are only seven points in the image below, whereas the BufferWithOverflow version has hundreds.

    select @SApolygon union all select @SApolygon.STBuffer(500).Reduce(500)

    image

    The query below confirms that no part of the original land is outside my buffered and reduced area, showing zero points in the STDifference, and is what I used to find appropriate values for buffering and reducing.

    select @SApolygon.STDifference(@SApolygon.STBuffer(@reducefactor).Reduce(@reducefactor)).STNumPoints();

    image

    Having defined an area that can have no false negatives, we can flip it around to eliminate false positives. We do this by buffering and reducing the inverse (which we find using ReorientObject()). I actually found that buffering by 1000 was needed to make sure there was no overlap (the second query below confirms it – the first query is just to overlay the two images).

    select @SApolygon union all select @SApolygon.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject();

    select @SApolygon.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject().STDifference(@SApolygon).STNumPoints(); --must be be zero

    image

    With a zero for the second query above, we have defined an area for which there are no false positives.

    Now I know that points within the ‘no false positives’ area are definitely within my original polygon, and that points within the ‘no false negatives’ area are definitely not. The area between the two is of more interest, and needs checking against the complex coastline.

    image

    Now let’s play with some better queries to show how to do this.

    Let’s create a table of areas. I’m going to use computed columns (which must be persisted to be useful) for the ‘no false negatives’ and ‘no false positives’ areas, although I could’ve done this manually if I wanted to experiment with buffers and tolerances on a per-area basis. Let’s also create a table of points, which we’ll index (although I’ll also let you know how long the queries took without the index).

    create table dbo.areas
    (id int not null identity(1,1) primary key,
    area geography,
    nofalseneg as area.STBuffer(500).Reduce(500) persisted,
    nofalsepos as area.ReorientObject().STBuffer(1000).Reduce(500).ReorientObject() persisted
    );
    go
    create table dbo.points
    (id int not null identity(1,1) primary key,
    point geography
    );
    go
    create spatial index ixSpatialPoints on dbo.points(point);

    And we’ll put some data in. Our SA outline in the areas, and the points from AdventureWorks2012’s list of addresses – 19614 points.

    insert dbo.areas (area)
    select geography::UnionAggregate(geom)
    from postcodes.dbo.postcodes
    go
    insert dbo.points (point)
    select SpatialLocation
    from AdventureWorks2012.person.Address;

    Putting the data in took some time. The 19,614 points took over about 5 seconds to get inserted, but the 1 area took 25 seconds. STBuffer() takes a while, and this is why having these computed columns as persisted can help so much.

    The query that we want to solve is:

    select p.*
    from areas a
    join points p on p.point.STIntersects(a.area) = 1;

    This returns 46 rows, and it takes 4 seconds (or 3.5 minutes without the index).

    ...whereas this query takes less than a second (or 20 seconds without the index):

    select p.*
    from areas a
    This query Intersects(a.nofalsepos) = 1;

    But it only returns 44 rows, because we’re only including the points that are in the “no false positives” area.

    This query returns 47 rows. We’re counting the points that are in the “no false negatives” area, which is a larger area than we want.

    select p.*
    from areas a
    join points p on p.point.STIntersects(a.nofalseneg) = 1;

    It also runs in under a second (or 20 seconds without the index).

    By now hopefully you see where I’m getting at...

    What I need is the list of points that are inside ‘nofalsepos’, plus the ones that inside ‘nofalseneg’, but not inside ‘nofalsepos’, that have also been checked against the proper area.

    select p.*
    from areas a
    join points p on p.point.STIntersects(a.nofalsepos) = 1
    union all
    select p.*
    from areas a
    join points p on
                    not p.point.STIntersects(a.nofalsepos) = 1
                and p.point.STIntersects(a.nofalseneg) = 1
                and p.point.STIntersects(a.area) = 1;

    This query takes approximately 1 second (or 40 without the index).

    But it’s complex – and that’s unfortunate, but we can figure this out using a view.

    create view dbo.PointAreaOverlap as
    select p.id as pointid, p.point, a.id as areaid, a.area
    from areas a
    join points p on p.point.STIntersects(a.nofalsepos) = 1
    union all
    select p.id as pointid, p.point, a.id as areaid, a.area
    from areas a
    join points p on
                    not p.point.STIntersects(a.nofalsepos) = 1
                and p.point.STIntersects(a.nofalseneg) = 1
                and p.point.STIntersects(a.area) = 1;

    , which we can query using much simpler code such as:

    select pointid, point
    from dbo.PointAreaOverlap
    where areaid = 1;

    Coastlines, or any complex spatial values, can make for poor performance, even with indexes in place. To improve performance, you could consider Reduce() to have it run quicker, although you’ll need to take some additional steps to guarantee correctness.

    And in case you were wondering what the “B.” in “Benoit B. Mandelbrot” stands for... it stands for “Benoit B. Mandelbrot”.

    @rob_farley

  • SHOWPLAN permission denied even if the database isn’t actually used

    To view a query plan, you need SHOWPLAN permission on the database level at least. You have this if you have CONTROL DATABASE, or CONTROL SERVER, or if you have ALTER TRACE at the instance level. I know this last one because it’s mentioned in Books Online on the ‘Database Permissions’ page, not because it’s particularly intuitive.

    As a consultant, I sometimes deal with customers who are reluctant to grant me the kind of access level that I would like to have to work on queries. SHOWPLAN is something that I will almost always request, though, and generally it’s considered harmless enough. I want to be able to show plans, so SHOWPLAN is part of what I like to have when writing any kind of query. Actually, I often find myself requesting ALTER TRACE, because it covers SHOWPLAN across all databases. Without it, you can find yourself in a situation where you sometimes get this error

    image

    , because a view, function, or stored procedure accesses a database that you haven’t been granted access to. Maybe it contains sensitive information – maybe you don’t have security clearance, for example, but there is a table in that database that is referenced for part of of process you need to look into. I’m not going to get into the why, or the reasons why you could request better access, or anything like that – that’s not the point of this post. The point of this post is to talk about something which I learned about SHOWPLAN across databases that aren’t actually used in query. And it’s part of this month’s T-SQL Tuesday, hosted by Mike Donnelly (@SQLMD).TSQL2sDay150x150

    I was thinking about this situation though – having cross-database queries and not having SHOWPLAN on all of the referenced databases – and about the fact that views often contain more information than you necessarily need. This got me back to my Redundant Joins material (which I should reblog about, as I haven’t written about it properly on sqlblog.com), and that the Query Optimizer can simplify out joins which aren’t actually used at all.

    Something occurred to me which I didn’t know the answer to, so I did a bit of research, found the answer, making it something I wanted to write about for this T-SQL Tuesday about  new things learned.

    Imagine a view, a table-valued function, a sub-query, just some table expression, which references (joins to) a lookup table but doesn’t need to. If we’re not interested in the data in the lookup table, this join is only needed if it’s matching multiple rows, or being used as a filter (which can’t happen if it’s a left outer join), or if it’s a right outer or full outer join (and therefore wanting to return all the rows in the lookup table, even those not mentioned in the left-hand set). If it’s not used, it’s redundant, and won’t be part of the query plan.

    Annoyingly, the simplification phase, when redundant joins are removed, is done AFTER permissions are checked. This is easy to demonstrate. Let’s consider a user which has VIEW DEFINITION rights on a table, but not SELECT rights. This user can run sp_help, and see all the metadata associated with the table. This user can query sys.columns and see the rows there, one for each column in the table. But to run the query SELECT TOP (0) * FROM dbo.someTable; , which is purely metadata, permission is denied.

    The reason I know it’s only metadata is because running as a more-privileged user, the query plan shows me this (as shown here, using AdventureWorks2012.Production.Production instead of dbo.soimeTable).

    image

    This query does not select data from the table. If it did, we’d see a Seek or a Scan here. This query never needs to access the table. It is explicitly told to fetch no rows from it. The only thing we use here is metadata, and we do have permission to get that.

    And yet the less-privileged user can’t run this query. Metadata isn’t a problem, but the permissions are tested first, and the query is rejected.

    image

    Permissions are checked once the query has been parsed. If an object is used in the query, then SELECT permission is required. If an object is updated, then UPDATE permission is needed, even if it’s logically impossible to update any actual rows (try WHERE 1=2 if you need to check).

    Now once a plan is in cache, VIEW SERVER STATE is needed to be able to view it. And if you have VIEW SERVER STATE, then you can view the plans that are in cache, even if you don’t have permissions to run the queries.

    ...which brings me to SHOWPLAN.

    SHOWPLAN is different to VIEW SERVER STATE – it doesn’t apply to the plan cache. The plan cache is an instance-level thing, and a database permission like SHOWPLAN isn’t going to cut it.

    To view the plan of a query that’s not in cache, you need SHOWPLAN permission. And you need to be able to run the query – even if the query isn’t actually going to touch the tables. I wouldn’t mind being able to look at plans to offer tuning advice without having to have permission to run the query, but this is just one of those things.

    Sadly, it extends to databases. If a database is referenced by a query, even if it’s not used, then you need to have SHOWPLAN permission on that database (or ALTER TRACE at the instance level, as I mentioned earlier).

    So if a view references a database for a lookup, and your query uses that database, you won’t be able to see the query plan of any query that uses it. You can have SHOWPLAN permission in the database where your data is, and with another user, you could verify that your plan doesn’t even touch the other database. But if it mentions it at all, you need SHOWPLAN on that database.

    The script below will let you reproduce this if you want.

    @rob_farley 

     

    create login test with password ='test'
    go
    create database test
    go
    use test
    go
    create user test for login test
    alter role db_owner add member test

    go
    create table dbo.test (test int);
    go


    grant showplan to test
    go
    use AdventureWorks2012
    go
    create user test for login test
    grant select on Production.Product to test
    deny showplan to test
    go

    use test
    go

    execute as login = 'test'
    go
    select t.*
    from dbo.test t
    left join (select top (1) 1 from AdventureWorks2012.Production.Product) t2(c) on t.test = t2.c
    go
    revert
    go

    execute as login = 'test'
    go
    set showplan_xml on
    go
    select t.*
    from dbo.test t
    left join (select top (1) 1 from AdventureWorks2012.Production.Product) t2(c) on t.test = t2.c
    go
    set showplan_xml off
    go
    revert
    go

    --Original user:
    set showplan_xml on
    go
    select t.*
    from dbo.test t
    left join (select top (1) 1 from AdventureWorks2012.Production.Product) t2(c) on t.test = t2.c
    go
    set showplan_xml off
    go

    ----Cleanup
    --use AdventureWorks2012
    --go
    --drop user test
    --drop database test
    --drop login test

  • T-SQL v MDX: Pick the hierarchy and populate the report parameter values

    When dealing with report parameters, you want the experience to be as seamless as possible for the end user. In SQL Server Reporting Services, we have the tremendously useful feature of having dynamic parameters, in which you can set a particular report parameter based on the choice of a previous one. For example, if you have parameters such as Country and City, you can set it up so that you pick the Country as Australia, and get the City choices filtered down to Adelaide, Melbourne, and Canberra, but if you choose the UK as your Country, your City choices become London, Brighton, and Dover.

    In T-SQL, we do this by making the City parameter use (for its Available Values set) a dataset which is dependent on the Country parameter. Something like “select City from Cities where Country = @country” – it’s quite straight forward.

    But when you want to be able to pick something entirely different, the story becomes more complex.

    Suppose you want to be able to filter a report either by Size or by Type. Typically you’d do this with two parameters, with an Any or All option near the top of each list. This makes a lot of sense, and even caters for the scenario of filtering the report by both Size and Type. But sometimes this doesn’t quite suit the situation.

    Imagine you’re filtering a report by date. You can either filter by the Financial Year, or by the Calendar Year. Or in fact, by the Week, Month, or Quarter. One option is to just pick the Start Date and End Date, and let the user work out when the Financial Year should be – another option is to let the user choose the type of period, and then populate the second parameter with a bunch of choices accordingly.

    In T-SQL, we could have a query which leverages a table of numbers to produce a bunch of reporting periods based on the type. Something like this:

    DECLARE @periodtype VARCHAR(50) = 'calweek';

    SELECT EnglishMonthName + ' ' + CAST(CalendarYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP by CalendarYear, MonthNumberOfYear, EnglishMonthName
    HAVING @periodtype = 'month'
    UNION ALL
    SELECT CAST(CalendarYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP BY CalendarYear
    HAVING @periodtype = 'calyear'
    UNION ALL
    SELECT CAST(FiscalYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP BY FiscalYear
    HAVING @periodtype = 'fiscalyear'
    UNION ALL
    SELECT 'Week ' + CAST(WeekNumberOfYear AS VARCHAR(2)) + ', ' + CAST(CalendarYear AS CHAR(4)) AS PeriodName
         , MIN(DateKey) AS PeriodStart, MAX(DateKey) AS PeriodEnd
    FROM dbo.DimDate
    GROUP BY CalendarYear, WeekNumberOfYear
    HAVING @periodtype = 'calweek'
    ORDER BY PeriodStart;

    In this scenario, we basically choose which of the queries we’re going to run using a variable, and a query parameter in SSRS. It works nicely, and thanks to Startup Expression Predicates in Filter operators, only actually runs one of the queries.

    image

    , with the results for ‘month’ being something like:

    image

    You may be thinking about the fact that SSRS will only let us have a single value for a parameter entry – not two. This can be overcome by concatenating the two values, and then splitting them in the query, or by having these values stored in a separate table. To concatenate them, just use:

    , CAST(MIN(DateKey) AS CHAR(8)) + CAST(MAX(DateKey) AS CHAR(8)) AS PeriodValue

    And then in the actual report query, doing something like:

    WHERE f.DateKey BETWEEN CAST(LEFT(@Period,8) AS INT) AND CAST(RIGHT(@Period,8) AS INT)

    But in Analysis Services, things are done a little differently. We have hierarchies instead, and need to refer to them somehow. We need to be able to pull out items such as [Transaction Date].[Fiscal Calendar].[Fiscal Month].&[201501], or [Transaction Date].[By Week].[Week].&[20141122].

    We’re not supposed to create MDX tuples using T-SQL. You could, but it’s not the right approach. It’s almost never the right approach.

    I’m going to start by letting the user select a @ReportPeriodType, which is going to be one of the following values:

    [Transaction Date].[Fiscal Calendar].[Fiscal Month]
    [Transaction Date].[Fiscal Calendar].[Fiscal Quarter]
    [Transaction Date].[Fiscal Calendar].[Fiscal Year]
    [Transaction Date].[Calendar].[Year]
    [Transaction Date].[By Week].[Week]

    And then use a query like this:

    WITH
    MEMBER Measures.ParamName AS STRTOVALUE(@ReportPeriodType + ".Hierarchy.CurrentMember.Name")
    MEMBER Measures.ParamUniq AS STRTOVALUE(@ReportPeriodType + ".Hierarchy.CurrentMember.Unique_Name")
    SELECT
    { Measures.ParamName, Measures.ParamUniq } ON COLUMNS,
    STRTOSET(@ReportPeriodType + ".Members") ON ROWS
    FROM [MyCube]

    And if I want to filter this to only things from the last 2 years, I could use:

    WHERE (STRTOMEMBER("[Transaction Date].[Date].&[" + @today + "]").LAG(730) : STRTOMEMBER("[Transaction Date].[Date].&[" + @today + "]")

    Let’s look at what I’m doing here. I’m passing in a parameter which is a string, which is the level of a hierachy. I’m going to use the members of this level on the Rows axis. To construct that, STRTOSET(@ReportPeriodType + ".Members") does the trick nicely. I could just leave it at that, and use the MEMBER_CAPTION and MEMBER_UNIQUE_NAME properties of each dimension member, but the field name will change according to the level, and I don’t want to have to work out the impact of this in SSRS. It’s much easier to have a couple of calculated members to grab the Level’s Hierarchy’s CurrentMember’s Name and Unique_Name, which I throw onto the Columns axis, and I’m sorted. My WHERE clause works just fine, so long as I’m not grabbing data out of the [Date] hierarchy, for filtering the list of parameters down to a manageable size. If I had a hidden hierarchy, I could easily pass that in with confidence that my report writers aren’t going to try to ask for hierarchy members from it.

    Notice that the main part of my query has no mention of the dimension name, or hierarchy name, or anything. This can be used for the members of ANY hierarchy, in ANY dimension. I simply need to pass in the Dimension.Hierarchy.Level string that I want. Here are some of the results if I pass in [Transaction Date].[Fiscal Calendar].[Fiscal Month].

    image

    The dynamic nature of cubes is remarkable, and can be leveraged to make reusable data sets very easily.

    @rob_farley

  • Tuning Parallel Data Warehouse Queries

    Performance tuning in regular SQL Server can be addressed in a number of ways. This can involve looking at what’s going on with the disk configuration, the memory configuration, the wait stats, the parallelism settings, indexing, and so much more. But if you have a Parallel Data Warehouse (PDW) environment, then there are a lot of things that are taken out of your hands.

    When you buy the Analytics Platform System (APS) appliance, which hosts PDW, you’re buying a complete system. It’s an appliance, in the same way that you buy appliances for the kitchen. When you go camping and you want to cook some food, you have to build a fire, and find some way of suspending the food over the fire. Building a fire isn’t exactly easy in itself – you need to consider what you’re going to do with kindling, you need larger pieces of wood for building heat, you need a spark – and then working out how to suspend the food at an appropriate height is another matter again! There are plenty of opportunities to tune this environment, and I’m sure many people have this down to a fine art. For most of us, though, we would rather just have an appliance (an oven) and just put the food in and turn it on.

    The PDW environment is like the oven rather than the campfire. Disk comes preconfigured. Network comes preconfigured. Placement of files and filegroups is preconfigured, as is the configuration of tables and databases across these filegroups. High availability decisions are made for us, and we are left to design the database.

    But the design of the tables within the database can dramatically affect the performance of the system – in ways that regular SQL Server environments don’t even start to have to consider. And this is because of the concept of RAM across multiple servers.

    To be able to see a row of data, ready to be able to do something with it – whether that be to return it to the user who ran the query, to manipulate it in some way, or to join it to another row from another table – it must exist in RAM. DBAs understand this, and make sure that their SQL environments have as much RAM as possible, to avoid having to pull that row into RAM from disk mid-query. But when two rows that need to be joined exist on different servers, the memory spaces in RAM that can hold those rows exist on different servers too. In PDW, we have just that situation – lots of servers, with the data spread out across them. It looks like a single server, but is actually made up of lots of servers in a scaled-out architecture.

    Therefore, to join data in PDW, we have to be confident that any two rows that need to be joined exist on the same server. This can be done in one of three ways:

    1. At least one of the rows can be stored on every server.

    This is done by configuring one of the tables as a Replicated table rather than a Distributed table. This is very common, and is a common feature for dimension tables that don’t change particularly often. If the entire list of Stores is stored (ha!) on every server, then our Sales data can be easily joined to it without fear that the Store in question for a particular Sales row might not be on the same server as said Sales row.

    2. The two tables can be distributed using the same key, with that key being part of the join conditions.

    Another common method. Suppose we have a large number of Products, which are frequently being changed and versioned, and we have a large amount of Sales. Both are good candidates for being Distributed rather than Replicated – we probably wouldn’t want a full copy of our large and quickly-growing Products table on every server. But if both tables are distributed on ProductKey and this is part of the join condition (p.ProductKey = s.ProductKey), then the system will know that any two rows that are a successful match according to the join conditions must also be on the server.

    3. Data can be moved mid-query.

    In regular SQL, if a row isn’t in RAM already, we need to pull a copy of it into RAM off the disk. Similarly, in PDW, if the system can’t be sure that the row exists on the right server, it can pull a copy of it onto the right server. This is known as Data Movement, and just as regular DBAs try to minimise the amount of disk activity for performance tuning, PDW DBAs try to minimise Data Movement.

    Data Movement comes in a variety of flavours, but I want to mention two of them here. One is broadcasting, which is turning the smaller of the two tables (or at least, the data which is identified as being of interest once other row and column filters have been applied) into a temporary replicated table, putting a copy of that data on every server. This turns our scenario into situation 1, and the query can continue. Another is shuffling, which is distributing some of the data on a different key, which is part of the join conditions – putting us into situation 2.

    To satisfy a single query, there might be a lot of data movement required.

    If you consider a typical data warehouse query, there are a lot of joins, across a lot of tables, and the join conditions are different for every join. Unless you start to think about PDW queries in a very different way to which you think about non-PDW queries, you can very easily suffer from poor performance simply down to unnecessary data movement.

    Thinking outside the box can provide some useful mechanisms for tuning PDW, which can see the performance improve massively.

    Denormalising is a method that many data warehouse designers are very familiar with, but when you start to consider the pain of joins in a PDW environment, it starts to become a lot more significant. It’s often not appropriate to denormalise too much, but if you have been keen on normalising significantly in the past, you should potentially reconsider this useful method.

    Redundant predicates are something which many developers don’t seem to appreciate. I’ve heard developers say that any piece of redundant code should be removed, because it’s going to create extra work. But in the PDW world, these predicates could help explain to the PDW Query Optimizer that data movement isn’t actually necessary, and could save you a large amount of effort.

    Views are often maligned in the regular SQL world, as people write views that query views that query views. I’m not suggesting that you fall into that trap, but you should certainly consider views to encapsulate join logic, encouraging the use of those redundant predicates that I just mentioned.

    Table splitting seems counter-intuitive to avoiding joins, but when split horizontally (having some rows in one table and others in another), gives extra options around how to distribute the data. Some data could be distributed according to one column, and other data could be distributed on another. This has the added benefit of avoiding skew, but can also benefit from using views when accessing these tables, as PDW does not support check constraints. Splitting tables vertically is rarer, but can allow dimensions to get wider and allow for additional denormalisation before hitting the maximum row width of 32kB (in-row data is limited to 8kB, but data movement buffers are only 32kB, which makes for a much harder limit).

    These methods don’t have the same kind of impact in your ordinary SQL Server environments, but can have a massive effect in PDW. You can see examples of how these things can come into play by looking at a previous post of mine http://sqlblog.com/blogs/rob_farley/archive/2014/12/05/minimising-data-movement-in-pdw-using-query-optimisation-techniques.aspx, where you should be able to spot my use of all of these methods.

    Oh, and this piece about performance tuning fits into this month’s T-SQL Tuesday topic, hosted by Russ Thomas (@sqljudo) on the topic of Performance.

    TSQL2sDay150x150

    @rob_farley

  • Why isn’t my filtered index being used?

    Quite often, people have filtered indexes but find they’re not being used as often as they’d like. I was reminded of this recently when I read a good post by Kendra Little from brentozar.com about filtered indexes. In it, Kendra talks about how the WHERE clause of a filtered index allows an IN clause, but not an OR clause (to quote someone from Microsoft “We currently do not support the "OR" clause due to the optimizer matching difficulties for such clauses”). Going back a few years, Tim Chapman wrote a post about the pains of filtered indexes.

    Anyway, both of these posts hint that filtered indexes aren’t always used. Tim addresses it directly, and Kendra mentions needing a hint to make sure the index is used. I thought I’d explore the question a little more. I’m confident that both Tim and Kendra know this information – they are two of the top index experts in the world. This post isn’t for them, but for those people who are trying to find out why their filtered indexes aren’t being used.

    To be used, the filtered index must be able to satisfy the query.

    This should be fairly obvious, but it goes a little deeper than you might think on first glance. Let’s explore Tim’s examples to show what I mean.

    Tim creates a filtered index:

    CREATE INDEX FIDX_SalesOrderDetail_ProductID
    ON Sales.SalesOrderDetail (ProductID)
    WHERE ProductID = 870;

    ...and then shows that it’s used successfully for the query:

    SELECT ProductID
    FROM Sales.SalesOrderDetail
    WHERE ProductID = 870;
     

    (The image here is from Tim’s blog post, and belongs to Microsoft)

    No surprise here – if the system knows that ProductID is 870, then it can use an index which only includes rows that satisfy that.

    Tim then tries to use a variable instead of 870 – although he still passes in the value of 870.

    DECLARE @ProductID INT;
    SET @ProductID = 870;

    SELECT ProductID
    FROM Sales.SalesOrderDetail
    WHERE ProductID = @ProductID;

    (Image from Tim’s blog again)

    No luck – the system doesn’t use the filtered index. He can’t even use a hint to force it to use it – SQL replies saying that it can’t create a plan for it using that index hint.

    So what’s going on?

    The problem is not with the Query Optimizer seeing that the value is going to be 870 – the problem is with the plan cache. You see, when SQL runs a query, it figures it won’t be in isolation and it puts it into the cache. But the version that goes into the cache is a general one, that doesn’t consider the values that are passed in. Because it needs a plan that will work regardless of what the parameter is set to, using the filtered index here would be inappropriate.

    Tim shows one way around this, and ‘hints’ at another in an edit, although sadly you tend to find that in blog post edits, you can miss the key a little.

    The way that Tim gets around this is to use Dynamic SQL, but I’m not a fan.

    DECLARE @SQL NVARCHAR(MAX), @ProductID INT
    SET @ProductID = 870
    SET @SQL = N'SELECT ProductID
    FROM Sales.SalesOrderDetail 
    WHERE ProductID = ' + CAST(@ProductID AS VARCHAR(10))
    EXECUTE sp_executesql @SQL

    Yeah – I’m not a fan of this. Go read my post on SQL Injection for why.

    Tim does mention a better method in the ‘edit’ bit in his post, and I want to look at that. He says: “In many cases, a way to get around the local variable problem is to recompile the statement.” – but I want to make it very clear that the point is not actually to recompile the statement, but to use a statement that isn’t going to get put into the cache (which you do by using the OPTION (RECOMIPLE) hint, which makes it sound like you’re recompiling the statement).

    When you use OPTION (RECOMPILE), the main impact is not that it recompiles, but that the query doesn’t get cached. Because it doesn’t get cached, it won’t find the query in the cache beforehand either. This means that it doesn’t need to consider the generalised version – it has the confidence to know that it doesn’t need to cater for future uses, so it can use the filtered index!

    image

    So the better option than using Dynamic SQL is to use OPTION (RECOMPILE).

    So that’s one reason why your filtered index might not be used – but there’s another too:

    Using the filtered index might be too much work.

    Let’s consider Kendra’s example. She had a query that used IN. Her example was:

    CREATE INDEX IX_Votes_filter ON dbo.Votes (PostId) WHERE (VoteTypeId IN (1,2));

    But as I don’t have a Votes table, I’m going to use an equivalent with AdventureWorks:

    CREATE INDEX FIDX_Product2Color ON Production.Product (ProductID) WHERE (Color IN ('Red', 'Black'));

    Kendra uses SELECT COUNT(*) queries to test it. Similarly to Kendra’s example, this works well when IN is used, and when OR is used, but when just one of the options is used, it needs a hint to use the index, and then it needs a lookup to satisfy it.

    image

    What Kendra doesn’t point out is why the filtered index needed the hint to be used in the single-option examples (which is fair enough – because her post was about OR v IN, not about why the filtered index wasn’t being used).

    The reason why is because of the Lookup that’s needed. This is so expensive, it’s cheaper for the Query Optimiser to do a clustered index scan instead. It’s standard ‘tipping point’ stuff, but we don’t normally see this when we have COUNT(*), because COUNT(*) is just counting rows, not returning extra columns.

    ...except that there is a column that isn’t included in the filtered index, which our query needs – Color.

    Subtly, even though our filtered index only contains rows that have the Color either Red or Black (or in Kendra’s example, VoteTypeID either 1 or 2), it doesn’t store which rows are Red and which rows are Black. We know that every row in the index is either Red or Black, so we can use this index as a starting point, but we would need to do a Lookup to get the actual Color value.

    To fix this, we should INCLUDE the Color column in the filtered index.

    CREATE INDEX FIDX_Product2Color2 ON Production.Product (ProductID) INCLUDE (Color) WHERE (Color IN ('Red', 'Black'));

    Now we don’t need to hint at the index to use, and we see a residual predicate being used to make sure that we only pull Red rows out of the index.

    image

    So we see two reasons here for filtered indexes not being used, and two ways to help encourage them to be used more often. The first is to consider using OPTION (RECOMPILE) to help persuade the Query Optimizer not to consider generalising the query across different parameter values, and the second is to INCLUDE the columns that are used in the filter, in case the database engine needs those values later in the query.

    Do try to use filtered indexes, and be understanding about those times when they’re not used. There’s quite probably a simple explanation.

    @rob_farley

  • SQL Injection – the golden rule

    The problem with SQL Injection is that most people don’t realise the fundamental concept which makes SQL Injection vulnerability not only easy to spot, but also easy to prevent. And it’s the thing that SQL Injection has in common with countless other hacking mechanisms that have been around since the early days of computing.

    The simple truth is this: You are vulnerable if you confuse DATA with COMMAND.

    SQL Injection occurs when a piece of user-entered data makes its way into the command that is being executed. Not simply as a piece of data, such as the value that a column is being compared to, but when it becomes something which actually gets executed. Buffer overflow attacks worked in the same way – although the method of getting the data to be executed is different.

    So you should never put yourself in a situation where something that a user has entered gets executed. It doesn’t matter what sanitisation you have done, you should never do it. Creative hackers will be able to find ways around however you do your sanitisation. Just don’t run something that has come from the user – pass it in using parameters instead.

    Code within a stored procedure is trustworthy – SQL code written in a stored procedure can be trusted. But...

    Code within a web application is trustworthy – SQL code written in a web application can be trusted. But...

    ...but all this can be undermined if your developers tell user-entered code to be executed, rather than being treated as parameters.

    (Of course, if your server is compromised and someone can tell rogue commands to run, then all bets are off – I’m just looking at SQL Injection here)

    What happens with SQL Injection is this: a COMMAND is constructed that includes user-entered DATA, and is then executed. It doesn’t matter whether this is done within .NET code, or within perl code, or within a stored procedure – you just don’t ever execute data.

    There are plenty of guides to avoiding SQL Injection. They say to avoid using dynamic SQL; to avoid ad hoc SQL; to only use stored procedures. Most of them miss the point. They’re not bad, but following these rules doesn’t make you injection-proof. You simply don’t ever execute anything that came from the user.

    Dynamic SQL is fine. You can construct a query based on user input – so long as you don’t execute something that came from the user. If the user chooses a particular option, the dynamically-generated SQL can include or exclude some predicate, or can bring a new parameter into the mix. This gives the freedom of dynamic SQL without needing to become vulnerable to SQL Injection.

    Ad hoc SQL is also fine, and you don’t need to use stored procedures. It doesn’t matter whether your SQL code is stored in a stored procedure within the database, or whether your SQL code is stored within your application.

    Ad hoc code or dynamic SQL like this is okay:

    if (UserInputtedFirstName.Length > 0) {
       cmd += " AND u.FirstName = @fn ";
       params.Add("@fn", .......
       ......
    }

    but code like this is not:

    if (UserInputtedFirstName.Length > 0) {
       cmd += " AND u.FirstName = " + Sanitise(UserInputtedFirstName);
       ......
    }

    ...no matter what the Sanitise functionality is. The only way to protect against SQL Injection is to only use controlled commands, and never anything else.

    But what about generic procedures, that developers use to be able to be able to query whichever objects they want?

    It’s easy to simply “Just say no”, to procedures like this. After all – it’s not healthy for all kinds of reasons, such as the plan cache, such as the effective use of indexes, and more. But if you have a stored procedure like:

    create proc dbo.GetAnyData(@tablename nvarchar(256), @colname nvarchar(256), @filtercol nvarchar(256), @filterval sql_variant) as
    begin
       declare @qry nvarchar(max) = 'select ' + @colname + ' from ' + @tablename + ' where ' + @filtercol + ' = @val';
       exec sp_executesql @qry, '@val sql_variant', @val = @filterval;
    end

    ...with an application that uses it, then you may be somewhat worried (and legitimately so) about SQL Injection, and be wondering how you’re going refactor all the code that uses this procedure.

    So how do you make this safe from SQL Injection, so that the calls to the procedure aren’t susceptible, while all the redevelopment is done?

    Hopefully you know the answer – you need to control the commands.

    The issue is not with the use of @filterval – that’s being passed in as data, and is being used as data. It’s never part of the command itself. The problem is that @tablename, @colname and @filtercol are all passed in from the calling application, potentially even from the user, and then used as part of the command. This is what needs to be fixed.

    Luckily, we have access to a bunch of things that we know can be used here – actual table names and actual column names in the database’s catalog views. So therefore, we can safely use those things. They are controlled.

    create proc dbo.GetAnyData(@tablename nvarchar(256), @colname nvarchar(256), @filtercol nvarchar(256), @filterval sql_variant) as
    begin
       declare @knowntable nvarchar(256), @knowncol nvarchar(256), @knownfiltercol nvarchar(256);

       select @knowntable = quotename(object_name(object_id)), @knowncol = quotename(name)
       from sys.all_columns
       where object_id = object_id(@tablename) and name = @colname;
      
       select @knownfiltercol = quotename(name)
       from sys.all_columns
       where object_id = object_id(@tablename) and name = @filtercol;

       declare @qry nvarchar(max) = 'select ' + @knowncol + ' from ' + @knowntable + ' where ' + @knownfiltercol + ' = @val';
       if @qry is not null
          exec sp_executesql @qry, N'@val sql_variant', @val = @filterval;
    end

    I’m not saying this is good. I’m saying this is SAFE. We don’t execute anything that we don’t already control. The only tables we can query are ones that we know actually exist. The only columns that we can use are ones that must exist. You can always add more logic, to make sure that people can’t query system objects, and things like that – but they definitely can’t execute anything overly malicious. They can’t make calls to sp_configure or to xp_cmdshell. They can’t drop tables or update values. If you need to restrict the columns, you can easily filter the queries on all_columns, so that the calling code can’t access so much.

    SQL Injection is serious. It’s a huge problem for people who deal with databases – and not just SQL Server, but all databases. The solution is easy, but requires some level of discipline.

    Don’t execute user input, and don’t let your developers do it either (and if that means forcing them to use stored procedures that you can audit, then fine – do whatever it takes).

    Oh, and as this month’s T-SQL Tuesday (hosted by Kenneth Fisher@sqlstudent144) is on the topic of security, that’s what this post is on.

    TSQL2sDay150x150

    @rob_farley

  • Medians pre-SQL 2012

    SQL 2012 was a big release for working out the median in SQL Server, with the advent of the function PERCENTILE_CONT(). It’s a very elegant way of working out the median (hint, that’s the 0.5 point), even though it’s not actually an aggregate function, as I’ve written before.

    Plus – it doesn’t even perform well. About a year ago, Aaron Bertrand (@aaronbertrand) wrote a fantastic post about different methods for getting medians, and showed that PERCENTILE_CONT() is actually one of the slowest methods, and that the best method is to use an idea from Peter Larsson (@SwePeso) that uses an OFFSET-FETCH clause to grab the rows of interest before doing an average of them.

    Except that the OFFSET-FETCH clause was also new in 2012. So if you’re stuck on SQL 2008 R2 and earlier, you’re a bit more stuck.

    All the pre-SQL 2012 methods that Aaron showed used ROW_NUMBER() except one – which used a combination of MIN/MAX over each half of the data. But one method that Aaron didn’t explore in his post was to simulate OFFSET-FETCH in earlier versions. Let me show you…

    Here’s the OFFSET-FETCH method. Notice that it fetches either 1 or 2 rows (depending on whether the overall count is 1 or 2), but offsets by just under half of the set.

    SELECT    d.SalesPerson, w.Median
    FROM
    (
      SELECT SalesPerson, COUNT(*) AS y
      FROM dbo.Sales
      GROUP BY SalesPerson
    ) AS d
    CROSS APPLY
    (
      SELECT AVG(0E + Amount)
      FROM
      (
        SELECT z.Amount
         FROM dbo.Sales AS z
         WHERE z.SalesPerson = d.SalesPerson
         ORDER BY z.Amount
         OFFSET (d.y - 1) / 2 ROWS
         FETCH NEXT 2 - d.y % 2 ROWS ONLY
      ) AS f
    ) AS w(Median);

    What my pre-2012-compatible version does is to fetch slightly MORE than the set first, and then get the top 1 or 2 but in DESC order.

    SELECT    d.SalesPerson, w.Median
    FROM
    (
      SELECT SalesPerson, COUNT(*) AS y
      FROM dbo.Sales
      GROUP BY SalesPerson
    ) AS d
    CROSS APPLY
    (
      SELECT AVG(0E + Amount)
      FROM
      (
        SELECT TOP (2 - d.y % 2) Amount
        FROM
        (
        SELECT TOP (d.y / 2 + 1) z.Amount
         FROM dbo.Sales AS z
         WHERE z.SalesPerson = d.SalesPerson
         ORDER BY z.Amount
         ) AS t
         ORDER BY Amount DESC
      ) AS f
    ) AS w(Median);

    With OFFSET-FETCH, we’re grabbing the rows we want by skipping over the rows we’re not interested in until we find the ones that we are interested in. In the TOP/TOPDESC, we’re identifying the rows we want by the fact that they’re the bottom of the top slightly-more-than-half set.

    Other than that, the idea is exactly the same. The results are identical, but what about the performance?

    First, let’s give you the code to set up your environment (as found in Aaron’s post) – I used a clustered index.

    CREATE TABLE dbo.Sales(SalesPerson INT, Amount INT);
    GO
     
    --CREATE CLUSTERED INDEX x ON dbo.Sales(SalesPerson, Amount);
    --CREATE NONCLUSTERED INDEX x ON dbo.Sales(SalesPerson, Amount);
    --DROP INDEX x ON dbo.sales;
     
    ;WITH x AS
    (
      SELECT TOP (100) number FROM master.dbo.spt_values GROUP BY number
    )
    INSERT dbo.Sales WITH (TABLOCKX) (SalesPerson, Amount)
      SELECT x.number, ABS(CHECKSUM(NEWID())) % 99
      FROM x CROSS JOIN x AS x2 CROSS JOIN x AS x3;

    What I want to do to evaluate this is to look at the query plans. Once I’ve done that, I’ll make a comment about the performance and where it fits into the mix from Aaron’s post.

    So those plans… OFFSET-FETCH method first, followed by the TOP/TOPDESC method. I’m using a clustered index on the data – a nonclustered index gives the same shape but with nonclustered index operations instead of clustered index operations. Heaps are a different story that I’m not exploring here.

    image

    As you’d expect, there’s a lot of similarity. Both use Nested Loops, grabbing the Counts from a Scan on the outer branch, with a Seek on the inner branch. And both inner branches have a Top Operator pulling the data out of a Seek. But the TOP/TOPDESC method has TWO Top operators, with a Sort in between. This is because of the ‘TOPDESC’ bit. If we had a ‘Bottom’ operator, then that would avoid the need for a Sort, but no such animal exists, and it does ‘Bottom’ by doing a Top of re-Sorted data. It’s very disappointing. The Top operator in the OFFSET-FETCH method has a new property called OffsetExpression, which it uses to skip over as many rows as it needs – it’s simply not supported pre-2012.

    image

    (Quick side note: the arrow between the Compute Scalar and the right-most Top operator in both plans is quite thin – much thinner that you might expect. This is only a quirk of the plan because the Actuals haven’t been reported here. MSDN (https://technet.microsoft.com/en-us/library/ms178082.aspx) says: “Compute Scalar operators that appear in Showplans generated by SET STATISTICS XML might not contain the RunTimeInformation element. In graphical Showplans, Actual Rows, Actual Rebinds, and Actual Rewinds might be absent from the Properties window when the Include Actual Execution Plan option is selected in SQL Server Management Studio. When this occurs, it means that although these operators were used in the compiled query plan, their work was performed by other operators in the run-time query plan.” Therefore, the arrow coming out of the Compute Scalar operator is the width of the estimated number of rows, because it doesn’t have the actual number of rows. But it’s a Compute Scalar – it’s not going to change the number of rows, and you should consider the width of the arrow as being the width of the arrow going into it.)

    Of course, this TOP/TOPDESC method is slower than OFFSET-FETCH. If we had a ‘Bottom’ operator, I think it wouldn’t be very much slower, but here we have a Sort operator! And those things are bad. The plans estimated that the cost of the Sort would be 27% of the total query, and that the ratio between the two queries would be 58:42, which is 1.38:1. But remember that those Cost percentages are based on estimated values, and we know those estimates are quite a long way out.

    So instead, we use a more empirical method, which is to run them against each other.

    On my machine (a Surface Pro 2), with a warm cache, the OFFSET-FETCH method took around 380ms, compared to around 570ms for the TOP/TOPDESC. It’s definitely slower – no surprises there. It’s a good 50% slower, if not more. But this still makes it faster than any of the pre-SQL 2012 versions that Aaron used.

    I’m sure you’re going to point out that I’m clearly running this on a version of SQL Server that is at least 2012… so I ran it on a SQL 2008 R2 box as well, and found that the plan was identical as shown here, and that it was about 30% faster than the “2005_3” version from Aaron’s post with an index applied.

    So if you’re using SQL 2008 R2 (or earlier) still, then don’t dismiss the best-performing median function from Aaron’s post (thanks again, Peso!), but instead, consider coming up with a 2008R2-compatible version, as I’ve done here.

    Update: Another method is to consider simply filtering on ROW_NUMBER(), which isn’t included in Aaron’s post either. It’s still good, but doesn’t quite perform as quickly as the TOP/TOPDESC method on the million-row set, because it has to figure out the ROW_NUMBER() for a lot of rows. The OffsetExpression property in the Top operator of SQL 2012+ is your friend.

    SELECT d.SalesPerson, w.Median
    FROM
    (
       SELECT SalesPerson, COUNT(*) AS y
       FROM dbo.Sales
       GROUP BY SalesPerson
    ) AS d
    CROSS APPLY
    (
    SELECT AVG(0E + Amount)
    FROM
    (
       SELECT
           z.Amount,
            rn = ROW_NUMBER() OVER (ORDER BY z.Amount)
       FROM dbo.Sales AS z
       WHERE z.SalesPerson = d.SalesPerson
    ) AS f
    WHERE
    f.rn BETWEEN (d.y + 1) / 2 AND (d.y + 2) / 2
    ) AS w(Median);

    @rob_farley

  • APS / PDW Monitoring

    When you get a Analytics Platform System (APS) – the appliance that hosts a copy SQL Server Parallel Data Warehouse Edition (EDW) and potentially a Hadoop cluster as well – one of the things that you get is an Admin Console. It’s quite handy to be able to look at it and see if your appliance is healthy or not.

    The trouble with the Admin Console, though, is what I’ve just said – you look at it to assess the health of your appliance. And this isn’t something that you really want to do. What are you going to do – look at it every morning and then get on with your day? If this is what you’re doing to monitor your environment, the chances are you won’t be watching in the moment you start to actually have a problem, and really you’ll be getting the alert when your phone rings, or at least after you’ve picked it up and got an earful of ‘angry customer’.

    You need a better solution – something which is going to actually let you know if a problem is detected, and which is going to store some historical information so that you can do some troubleshooting. You know – just like you have on your other SQL boxes. The thing is that PDW doesn’t come with something which can send emails. And it doesn’t come with a scheduler like SQL Agent which can be used to kick off jobs to collect data, or to check perfmon counters in case there’s an issue.

    So how do you monitor PDW?

    The answer is: from outside it.

    Ideally, your organisation has centralised monitoring anyway. Some server(s) that run around checking that everything is healthy. They check the health of the Windows boxes, they see how much disk space is free on all the volumes, they make sure that backups have been taken (not just SQL backups, but backups of whole servers), and run tests like DBCC, restores, and pings. If there’s a problem with the monitoring servers, this is recognised and understood.

    Now, you can roll your own one of these, and have it query DMVs like PDW’s sys.dm_pdw_component_health_alerts, sys.dm_pdw_errors, and sys.dm_pdw_os_performance_counters. Collecting this information (and the contents of many other DMVs) can help provide excellent coverage when troubleshooting, and also highlight potential issues before they arrive. Running DBCC PDW_SHOWSPACEUSED from time to time is definitely important, to be conscious of how skew is looking – the list of things to keep an eye on goes on and on.

    Something that shouldn’t be overlooked is the usefulness of System Center Operations Manager (even if I keep wanting to type ‘Centre’ instead of ‘Center’). There are SCOM Management Packs available to cater for PDW, HDInsight (another component within APS) and APS itself. If SCOM is part of your organisation, then configuring it to monitor your APS appliance is definitely worth doing. I’ve lifted the image here from the APS help file – if you’re at all familiar with SCOM, you’ll recognise it and see that you have good coverage of your APS environment with it. It should never fully replace using queries to look at the metadata within (for assessing skew, etc.), but you should definitely be using SCOM with APS if you can.

    image

    I mentioned that this image is part of the APS help file – it goes into quite some detail about setting up SCOM to work with APS, so if you’re considering APS, you should be able to reassure your IT staff that they will be able to use their existing SCOM environment to monitor the appliance still.

    Don’t neglect monitoring your APS box. When we get an appliance, it’s easy to let it just sit there and do its stuff, assuming that everything is going to be okay because it’s an appliance. We don’t monitor our kettles at home, but our businesses don’t depend on the health of the kettles (maybe the coffee machine, but that’s a different story). Monitoring doesn’t have to be hard work, but it does have to happen. Luckily, we get a bunch of tools to help us make that happen.

    TSQL2sDay150x150And this is for yet another T-SQL Tuesday. The first for 2015, and the 62nd in all – hosted this time by Robert Pearl (@pearlknows).

    @rob_farley

  • Four SQL MVPs at LobsterPot – including three in Australia

    Today LobsterPot Solutions sets a new first. We are the only company to ever employ three current Australian SQL MVPs, giving us four awardees in total. Congratulations to Martin Cairney who joins Julie Koesmarno (AUS), Ted Krueger (USA) and me (AUS) as recipients of this prestigious award. This demonstrates LobsterPot's ongoing commitment to the SQL Server community, that show that our consultants are truly influential in the SQL world.MVP_FullColor_ForScreen

    From Microsoft’s website about MVPs:
    Microsoft Most Valuable Professionals, or MVPs are exceptional community leaders who actively share their high-quality, real-world deep technical expertise with the community and with Microsoft. They are committed to helping others get the most out of their experience with Microsoft products and technologies.
    Technical communities play a vital role in the adoption and advancement of technology—and in helping our customers do great things with our products. The MVP Award provides us with an opportunity to say thank you and to bring the voice of community into our technology roadmap.

    This fits very closely with LobsterPot’s desire to help people with their data story. We help with the adoption and advancement of SQL Server, and help customers do great things with data. It’s no surprise that we see a high proportion of LobsterPot consultants are MVP awardees.

More Posts Next page »

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement