THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and runs training courses around the world in SQL Server and BI topics.

  • Why isn’t my filtered index being used?

    Quite often, people have filtered indexes but find they’re not being used as often as they’d like. I was reminded of this recently when I read a good post by Kendra Little from brentozar.com about filtered indexes. In it, Kendra talks about how the WHERE clause of a filtered index allows an IN clause, but not an OR clause (to quote someone from Microsoft “We currently do not support the "OR" clause due to the optimizer matching difficulties for such clauses”). Going back a few years, Tim Chapman wrote a post about the pains of filtered indexes.

    Anyway, both of these posts hint that filtered indexes aren’t always used. Tim addresses it directly, and Kendra mentions needing a hint to make sure the index is used. I thought I’d explore the question a little more. I’m confident that both Tim and Kendra know this information – they are two of the top index experts in the world. This post isn’t for them, but for those people who are trying to find out why their filtered indexes aren’t being used.

    To be used, the filtered index must be able to satisfy the query.

    This should be fairly obvious, but it goes a little deeper than you might think on first glance. Let’s explore Tim’s examples to show what I mean.

    Tim creates a filtered index:

    CREATE INDEX FIDX_SalesOrderDetail_ProductID
    ON Sales.SalesOrderDetail (ProductID)
    WHERE ProductID = 870;

    ...and then shows that it’s used successfully for the query:

    SELECT ProductID
    FROM Sales.SalesOrderDetail
    WHERE ProductID = 870;
     

    (The image here is from Tim’s blog post, and belongs to Microsoft)

    No surprise here – if the system knows that ProductID is 870, then it can use an index which only includes rows that satisfy that.

    Tim then tries to use a variable instead of 870 – although he still passes in the value of 870.

    DECLARE @ProductID INT;
    SET @ProductID = 870;

    SELECT ProductID
    FROM Sales.SalesOrderDetail
    WHERE ProductID = @ProductID;

    (Image from Tim’s blog again)

    No luck – the system doesn’t use the filtered index. He can’t even use a hint to force it to use it – SQL replies saying that it can’t create a plan for it using that index hint.

    So what’s going on?

    The problem is not with the Query Optimizer seeing that the value is going to be 870 – the problem is with the plan cache. You see, when SQL runs a query, it figures it won’t be in isolation and it puts it into the cache. But the version that goes into the cache is a general one, that doesn’t consider the values that are passed in. Because it needs a plan that will work regardless of what the parameter is set to, using the filtered index here would be inappropriate.

    Tim shows one way around this, and ‘hints’ at another in an edit, although sadly you tend to find that in blog post edits, you can miss the key a little.

    The way that Tim gets around this is to use Dynamic SQL, but I’m not a fan.

    DECLARE @SQL NVARCHAR(MAX), @ProductID INT
    SET @ProductID = 870
    SET @SQL = N'SELECT ProductID
    FROM Sales.SalesOrderDetail 
    WHERE ProductID = ' + CAST(@ProductID AS VARCHAR(10))
    EXECUTE sp_executesql @SQL

    Yeah – I’m not a fan of this. Go read my post on SQL Injection for why.

    Tim does mention a better method in the ‘edit’ bit in his post, and I want to look at that. He says: “In many cases, a way to get around the local variable problem is to recompile the statement.” – but I want to make it very clear that the point is not actually to recompile the statement, but to use a statement that isn’t going to get put into the cache (which you do by using the OPTION (RECOMIPLE) hint, which makes it sound like you’re recompiling the statement).

    When you use OPTION (RECOMPILE), the main impact is not that it recompiles, but that the query doesn’t get cached. Because it doesn’t get cached, it won’t find the query in the cache beforehand either. This means that it doesn’t need to consider the generalised version – it has the confidence to know that it doesn’t need to cater for future uses, so it can use the filtered index!

    image

    So the better option than using Dynamic SQL is to use OPTION (RECOMPILE).

    So that’s one reason why your filtered index might not be used – but there’s another too:

    Using the filtered index might be too much work.

    Let’s consider Kendra’s example. She had a query that used IN. Her example was:

    CREATE INDEX IX_Votes_filter ON dbo.Votes (PostId) WHERE (VoteTypeId IN (1,2));

    But as I don’t have a Votes table, I’m going to use an equivalent with AdventureWorks:

    CREATE INDEX FIDX_Product2Color ON Production.Product (ProductID) WHERE (Color IN ('Red', 'Black'));

    Kendra uses SELECT COUNT(*) queries to test it. Similarly to Kendra’s example, this works well when IN is used, and when OR is used, but when just one of the options is used, it needs a hint to use the index, and then it needs a lookup to satisfy it.

    image

    What Kendra doesn’t point out is why the filtered index needed the hint to be used in the single-option examples (which is fair enough – because her post was about OR v IN, not about why the filtered index wasn’t being used).

    The reason why is because of the Lookup that’s needed. This is so expensive, it’s cheaper for the Query Optimiser to do a clustered index scan instead. It’s standard ‘tipping point’ stuff, but we don’t normally see this when we have COUNT(*), because COUNT(*) is just counting rows, not returning extra columns.

    ...except that there is a column that isn’t included in the filtered index, which our query needs – Color.

    Subtly, even though our filtered index only contains rows that have the Color either Red or Black (or in Kendra’s example, VoteTypeID either 1 or 2), it doesn’t store which rows are Red and which rows are Black. We know that every row in the index is either Red or Black, so we can use this index as a starting point, but we would need to do a Lookup to get the actual Color value.

    To fix this, we should INCLUDE the Color column in the filtered index.

    CREATE INDEX FIDX_Product2Color2 ON Production.Product (ProductID) INCLUDE (Color) WHERE (Color IN ('Red', 'Black'));

    Now we don’t need to hint at the index to use, and we see a residual predicate being used to make sure that we only pull Red rows out of the index.

    image

    So we see two reasons here for filtered indexes not being used, and two ways to help encourage them to be used more often. The first is to consider using OPTION (RECOMPILE) to help persuade the Query Optimizer not to consider generalising the query across different parameter values, and the second is to INCLUDE the columns that are used in the filter, in case the database engine needs those values later in the query.

    Do try to use filtered indexes, and be understanding about those times when they’re not used. There’s quite probably a simple explanation.

    @rob_farley

  • SQL Injection – the golden rule

    The problem with SQL Injection is that most people don’t realise the fundamental concept which makes SQL Injection vulnerability not only easy to spot, but also easy to prevent. And it’s the thing that SQL Injection has in common with countless other hacking mechanisms that have been around since the early days of computing.

    The simple truth is this: You are vulnerable if you confuse DATA with COMMAND.

    SQL Injection occurs when a piece of user-entered data makes its way into the command that is being executed. Not simply as a piece of data, such as the value that a column is being compared to, but when it becomes something which actually gets executed. Buffer overflow attacks worked in the same way – although the method of getting the data to be executed is different.

    So you should never put yourself in a situation where something that a user has entered gets executed. It doesn’t matter what sanitisation you have done, you should never do it. Creative hackers will be able to find ways around however you do your sanitisation. Just don’t run something that has come from the user – pass it in using parameters instead.

    Code within a stored procedure is trustworthy – SQL code written in a stored procedure can be trusted. But...

    Code within a web application is trustworthy – SQL code written in a web application can be trusted. But...

    ...but all this can be undermined if your developers tell user-entered code to be executed, rather than being treated as parameters.

    (Of course, if your server is compromised and someone can tell rogue commands to run, then all bets are off – I’m just looking at SQL Injection here)

    What happens with SQL Injection is this: a COMMAND is constructed that includes user-entered DATA, and is then executed. It doesn’t matter whether this is done within .NET code, or within perl code, or within a stored procedure – you just don’t ever execute data.

    There are plenty of guides to avoiding SQL Injection. They say to avoid using dynamic SQL; to avoid ad hoc SQL; to only use stored procedures. Most of them miss the point. They’re not bad, but following these rules doesn’t make you injection-proof. You simply don’t ever execute anything that came from the user.

    Dynamic SQL is fine. You can construct a query based on user input – so long as you don’t execute something that came from the user. If the user chooses a particular option, the dynamically-generated SQL can include or exclude some predicate, or can bring a new parameter into the mix. This gives the freedom of dynamic SQL without needing to become vulnerable to SQL Injection.

    Ad hoc SQL is also fine, and you don’t need to use stored procedures. It doesn’t matter whether your SQL code is stored in a stored procedure within the database, or whether your SQL code is stored within your application.

    Ad hoc code or dynamic SQL like this is okay:

    if (UserInputtedFirstName.Length > 0) {
       cmd += " AND u.FirstName = @fn ";
       params.Add("@fn", .......
       ......
    }

    but code like this is not:

    if (UserInputtedFirstName.Length > 0) {
       cmd += " AND u.FirstName = " + Sanitise(UserInputtedFirstName);
       ......
    }

    ...no matter what the Sanitise functionality is. The only way to protect against SQL Injection is to only use controlled commands, and never anything else.

    But what about generic procedures, that developers use to be able to be able to query whichever objects they want?

    It’s easy to simply “Just say no”, to procedures like this. After all – it’s not healthy for all kinds of reasons, such as the plan cache, such as the effective use of indexes, and more. But if you have a stored procedure like:

    create proc dbo.GetAnyData(@tablename nvarchar(256), @colname nvarchar(256), @filtercol nvarchar(256), @filterval sql_variant) as
    begin
       declare @qry nvarchar(max) = 'select ' + @colname + ' from ' + @tablename + ' where ' + @filtercol + ' = @val';
       exec sp_executesql @qry, '@val sql_variant', @val = @filterval;
    end

    ...with an application that uses it, then you may be somewhat worried (and legitimately so) about SQL Injection, and be wondering how you’re going refactor all the code that uses this procedure.

    So how do you make this safe from SQL Injection, so that the calls to the procedure aren’t susceptible, while all the redevelopment is done?

    Hopefully you know the answer – you need to control the commands.

    The issue is not with the use of @filterval – that’s being passed in as data, and is being used as data. It’s never part of the command itself. The problem is that @tablename, @colname and @filtercol are all passed in from the calling application, potentially even from the user, and then used as part of the command. This is what needs to be fixed.

    Luckily, we have access to a bunch of things that we know can be used here – actual table names and actual column names in the database’s catalog views. So therefore, we can safely use those things. They are controlled.

    create proc dbo.GetAnyData(@tablename nvarchar(256), @colname nvarchar(256), @filtercol nvarchar(256), @filterval sql_variant) as
    begin
       declare @knowntable nvarchar(256), @knowncol nvarchar(256), @knownfiltercol nvarchar(256);

       select @knowntable = quotename(object_name(object_id)), @knowncol = quotename(name)
       from sys.all_columns
       where object_id = object_id(@tablename) and name = @colname;
      
       select @knownfiltercol = quotename(name)
       from sys.all_columns
       where object_id = object_id(@tablename) and name = @filtercol;

       declare @qry nvarchar(max) = 'select ' + @knowncol + ' from ' + @knowntable + ' where ' + @knownfiltercol + ' = @val';
       if @qry is not null
          exec sp_executesql @qry, N'@val sql_variant', @val = @filterval;
    end

    I’m not saying this is good. I’m saying this is SAFE. We don’t execute anything that we don’t already control. The only tables we can query are ones that we know actually exist. The only columns that we can use are ones that must exist. You can always add more logic, to make sure that people can’t query system objects, and things like that – but they definitely can’t execute anything overly malicious. They can’t make calls to sp_configure or to xp_cmdshell. They can’t drop tables or update values. If you need to restrict the columns, you can easily filter the queries on all_columns, so that the calling code can’t access so much.

    SQL Injection is serious. It’s a huge problem for people who deal with databases – and not just SQL Server, but all databases. The solution is easy, but requires some level of discipline.

    Don’t execute user input, and don’t let your developers do it either (and if that means forcing them to use stored procedures that you can audit, then fine – do whatever it takes).

    Oh, and as this month’s T-SQL Tuesday (hosted by Kenneth Fisher@sqlstudent144) is on the topic of security, that’s what this post is on.

    TSQL2sDay150x150

    @rob_farley

  • Medians pre-SQL 2012

    SQL 2012 was a big release for working out the median in SQL Server, with the advent of the function PERCENTILE_CONT(). It’s a very elegant way of working out the median (hint, that’s the 0.5 point), even though it’s not actually an aggregate function, as I’ve written before.

    Plus – it doesn’t even perform well. About a year ago, Aaron Bertrand (@aaronbertrand) wrote a fantastic post about different methods for getting medians, and showed that PERCENTILE_CONT() is actually one of the slowest methods, and that the best method is to use an idea from Peter Larsson (@SwePeso) that uses an OFFSET-FETCH clause to grab the rows of interest before doing an average of them.

    Except that the OFFSET-FETCH clause was also new in 2012. So if you’re stuck on SQL 2008 R2 and earlier, you’re a bit more stuck.

    All the pre-SQL 2012 methods that Aaron showed used ROW_NUMBER() except one – which used a combination of MIN/MAX over each half of the data. But one method that Aaron didn’t explore in his post was to simulate OFFSET-FETCH in earlier versions. Let me show you…

    Here’s the OFFSET-FETCH method. Notice that it fetches either 1 or 2 rows (depending on whether the overall count is 1 or 2), but offsets by just under half of the set.

    SELECT    d.SalesPerson, w.Median
    FROM
    (
      SELECT SalesPerson, COUNT(*) AS y
      FROM dbo.Sales
      GROUP BY SalesPerson
    ) AS d
    CROSS APPLY
    (
      SELECT AVG(0E + Amount)
      FROM
      (
        SELECT z.Amount
         FROM dbo.Sales AS z
         WHERE z.SalesPerson = d.SalesPerson
         ORDER BY z.Amount
         OFFSET (d.y - 1) / 2 ROWS
         FETCH NEXT 2 - d.y % 2 ROWS ONLY
      ) AS f
    ) AS w(Median);

    What my pre-2012-compatible version does is to fetch slightly MORE than the set first, and then get the top 1 or 2 but in DESC order.

    SELECT    d.SalesPerson, w.Median
    FROM
    (
      SELECT SalesPerson, COUNT(*) AS y
      FROM dbo.Sales
      GROUP BY SalesPerson
    ) AS d
    CROSS APPLY
    (
      SELECT AVG(0E + Amount)
      FROM
      (
        SELECT TOP (2 - d.y % 2) Amount
        FROM
        (
        SELECT TOP (d.y / 2 + 1) z.Amount
         FROM dbo.Sales AS z
         WHERE z.SalesPerson = d.SalesPerson
         ORDER BY z.Amount
         ) AS t
         ORDER BY Amount DESC
      ) AS f
    ) AS w(Median);

    With OFFSET-FETCH, we’re grabbing the rows we want by skipping over the rows we’re not interested in until we find the ones that we are interested in. In the TOP/TOPDESC, we’re identifying the rows we want by the fact that they’re the bottom of the top slightly-more-than-half set.

    Other than that, the idea is exactly the same. The results are identical, but what about the performance?

    First, let’s give you the code to set up your environment (as found in Aaron’s post) – I used a clustered index.

    CREATE TABLE dbo.Sales(SalesPerson INT, Amount INT);
    GO
     
    --CREATE CLUSTERED INDEX x ON dbo.Sales(SalesPerson, Amount);
    --CREATE NONCLUSTERED INDEX x ON dbo.Sales(SalesPerson, Amount);
    --DROP INDEX x ON dbo.sales;
     
    ;WITH x AS
    (
      SELECT TOP (100) number FROM master.dbo.spt_values GROUP BY number
    )
    INSERT dbo.Sales WITH (TABLOCKX) (SalesPerson, Amount)
      SELECT x.number, ABS(CHECKSUM(NEWID())) % 99
      FROM x CROSS JOIN x AS x2 CROSS JOIN x AS x3;

    What I want to do to evaluate this is to look at the query plans. Once I’ve done that, I’ll make a comment about the performance and where it fits into the mix from Aaron’s post.

    So those plans… OFFSET-FETCH method first, followed by the TOP/TOPDESC method. I’m using a clustered index on the data – a nonclustered index gives the same shape but with nonclustered index operations instead of clustered index operations. Heaps are a different story that I’m not exploring here.

    image

    As you’d expect, there’s a lot of similarity. Both use Nested Loops, grabbing the Counts from a Scan on the outer branch, with a Seek on the inner branch. And both inner branches have a Top Operator pulling the data out of a Seek. But the TOP/TOPDESC method has TWO Top operators, with a Sort in between. This is because of the ‘TOPDESC’ bit. If we had a ‘Bottom’ operator, then that would avoid the need for a Sort, but no such animal exists, and it does ‘Bottom’ by doing a Top of re-Sorted data. It’s very disappointing. The Top operator in the OFFSET-FETCH method has a new property called OffsetExpression, which it uses to skip over as many rows as it needs – it’s simply not supported pre-2012.

    image

    (Quick side note: the arrow between the Compute Scalar and the right-most Top operator in both plans is quite thin – much thinner that you might expect. This is only a quirk of the plan because the Actuals haven’t been reported here. MSDN (https://technet.microsoft.com/en-us/library/ms178082.aspx) says: “Compute Scalar operators that appear in Showplans generated by SET STATISTICS XML might not contain the RunTimeInformation element. In graphical Showplans, Actual Rows, Actual Rebinds, and Actual Rewinds might be absent from the Properties window when the Include Actual Execution Plan option is selected in SQL Server Management Studio. When this occurs, it means that although these operators were used in the compiled query plan, their work was performed by other operators in the run-time query plan.” Therefore, the arrow coming out of the Compute Scalar operator is the width of the estimated number of rows, because it doesn’t have the actual number of rows. But it’s a Compute Scalar – it’s not going to change the number of rows, and you should consider the width of the arrow as being the width of the arrow going into it.)

    Of course, this TOP/TOPDESC method is slower than OFFSET-FETCH. If we had a ‘Bottom’ operator, I think it wouldn’t be very much slower, but here we have a Sort operator! And those things are bad. The plans estimated that the cost of the Sort would be 27% of the total query, and that the ratio between the two queries would be 58:42, which is 1.38:1. But remember that those Cost percentages are based on estimated values, and we know those estimates are quite a long way out.

    So instead, we use a more empirical method, which is to run them against each other.

    On my machine (a Surface Pro 2), with a warm cache, the OFFSET-FETCH method took around 380ms, compared to around 570ms for the TOP/TOPDESC. It’s definitely slower – no surprises there. It’s a good 50% slower, if not more. But this still makes it faster than any of the pre-SQL 2012 versions that Aaron used.

    I’m sure you’re going to point out that I’m clearly running this on a version of SQL Server that is at least 2012… so I ran it on a SQL 2008 R2 box as well, and found that the plan was identical as shown here, and that it was about 30% faster than the “2005_3” version from Aaron’s post with an index applied.

    So if you’re using SQL 2008 R2 (or earlier) still, then don’t dismiss the best-performing median function from Aaron’s post (thanks again, Peso!), but instead, consider coming up with a 2008R2-compatible version, as I’ve done here.

    Update: Another method is to consider simply filtering on ROW_NUMBER(), which isn’t included in Aaron’s post either. It’s still good, but doesn’t quite perform as quickly as the TOP/TOPDESC method on the million-row set, because it has to figure out the ROW_NUMBER() for a lot of rows. The OffsetExpression property in the Top operator of SQL 2012+ is your friend.

    SELECT d.SalesPerson, w.Median
    FROM
    (
       SELECT SalesPerson, COUNT(*) AS y
       FROM dbo.Sales
       GROUP BY SalesPerson
    ) AS d
    CROSS APPLY
    (
    SELECT AVG(0E + Amount)
    FROM
    (
       SELECT
           z.Amount,
            rn = ROW_NUMBER() OVER (ORDER BY z.Amount)
       FROM dbo.Sales AS z
       WHERE z.SalesPerson = d.SalesPerson
    ) AS f
    WHERE
    f.rn BETWEEN (d.y + 1) / 2 AND (d.y + 2) / 2
    ) AS w(Median);

    @rob_farley

  • APS / PDW Monitoring

    When you get a Analytics Platform System (APS) – the appliance that hosts a copy SQL Server Parallel Data Warehouse Edition (EDW) and potentially a Hadoop cluster as well – one of the things that you get is a Management Console. It’s quite handy to be able to look at it and see if your appliance is healthy or not.

    The trouble with the Management Console, though, is what I’ve just said – you look at it to assess the health of your appliance. And this isn’t something that you really want to do. What are you going to do – look at it every morning and then get on with your day? If this is what you’re doing to monitor your environment, the chances are you won’t be watching in the moment you start to actually have a problem, and really you’ll be getting the alert when your phone rings, or at least after you’ve picked it up and got an earful of ‘angry customer’.

    You need a better solution – something which is going to actually let you know if a problem is detected, and which is going to store some historical information so that you can do some troubleshooting. You know – just like you have on your other SQL boxes. The thing is that PDW doesn’t come with something which can send emails. And it doesn’t come with a scheduler like SQL Agent which can be used to kick off jobs to collect data, or to check perfmon counters in case there’s an issue.

    So how do you monitor PDW?

    The answer is: from outside it.

    Ideally, your organisation has centralised monitoring anyway. Some server(s) that run around checking that everything is healthy. They check the health of the Windows boxes, they see how much disk space is free on all the volumes, they make sure that backups have been taken (not just SQL backups, but backups of whole servers), and run tests like DBCC, restores, and pings. If there’s a problem with the monitoring servers, this is recognised and understood.

    Now, you can roll your own one of these, and have it query DMVs like PDW’s sys.dm_pdw_component_health_alerts, sys.dm_pdw_errors, and sys.dm_pdw_os_performance_counters. Collecting this information (and the contents of many other DMVs) can help provide excellent coverage when troubleshooting, and also highlight potential issues before they arrive. Running DBCC PDW_SHOWSPACEUSED from time to time is definitely important, to be conscious of how skew is looking – the list of things to keep an eye on goes on and on.

    Something that shouldn’t be overlooked is the usefulness of System Center Operations Manager (even if I keep wanting to type ‘Centre’ instead of ‘Center’). There are SCOM Management Packs available to cater for PDW, HDInsight (another component within APS) and APS itself. If SCOM is part of your organisation, then configuring it to monitor your APS appliance is definitely worth doing. I’ve lifted the image here from the APS help file – if you’re at all familiar with SCOM, you’ll recognise it and see that you have good coverage of your APS environment with it. It should never fully replace using queries to look at the metadata within (for assessing skew, etc.), but you should definitely be using SCOM with APS if you can.

    image

    I mentioned that this image is part of the APS help file – it goes into quite some detail about setting up SCOM to work with APS, so if you’re considering APS, you should be able to reassure your IT staff that they will be able to use their existing SCOM environment to monitor the appliance still.

    Don’t neglect monitoring your APS box. When we get an appliance, it’s easy to let it just sit there and do its stuff, assuming that everything is going to be okay because it’s an appliance. We don’t monitor our kettles at home, but our businesses don’t depend on the health of the kettles (maybe the coffee machine, but that’s a different story). Monitoring doesn’t have to be hard work, but it does have to happen. Luckily, we get a bunch of tools to help us make that happen.

    TSQL2sDay150x150And this is for yet another T-SQL Tuesday. The first for 2015, and the 62nd in all – hosted this time by Robert Pearl (@pearlknows).

    @rob_farley

  • Four SQL MVPs at LobsterPot – including three in Australia

    Today LobsterPot Solutions sets a new first. We are the only company to ever employ three current Australian SQL MVPs, giving us four awardees in total. Congratulations to Martin Cairney who joins Julie Koesmarno (AUS), Ted Krueger (USA) and me (AUS) as recipients of this prestigious award. This demonstrates LobsterPot's ongoing commitment to the SQL Server community, that show that our consultants are truly influential in the SQL world.MVP_FullColor_ForScreen

    From Microsoft’s website about MVPs:
    Microsoft Most Valuable Professionals, or MVPs are exceptional community leaders who actively share their high-quality, real-world deep technical expertise with the community and with Microsoft. They are committed to helping others get the most out of their experience with Microsoft products and technologies.
    Technical communities play a vital role in the adoption and advancement of technology—and in helping our customers do great things with our products. The MVP Award provides us with an opportunity to say thank you and to bring the voice of community into our technology roadmap.

    This fits very closely with LobsterPot’s desire to help people with their data story. We help with the adoption and advancement of SQL Server, and help customers do great things with data. It’s no surprise that we see a high proportion of LobsterPot consultants are MVP awardees.

  • Merry Christmas

    I just wanted to take a moment to wish a Merry Christmas to you: people in the SQL Community; people I see at clients and the like; and especially those people I am incredibly privileged to have working at possibly the best SQL Consultancy in the world.

    To those who I have represent my brand: I love you guys! You’re all passionate about providing the best experience for our customers, developing the SQL community, and doing amazing things to help people improve their data story. I couldn’t be prouder of you all. Sure, there are times when I lose sleep (and hair) over stuff, but I know that we have each other’s backs, and that’s a brilliant thing. I’ve often likened us to that story about the tiger in a cage. The best way to defend such a tiger is to let it out of its cage. If I can help enable you, and remove any obstacles that come between you and your ability to be phenomenal, then that’s what I’ll try to do. We all have our different styles, but together I think we can be an incredible force. It’s been a crazy year in many ways, including starting the LobsterPot story in the US (and Ted – you’ve been incredible!), but we have even more exciting times ahead, I’m sure. The Microsoft data stack is developing quicker than ever, and people are using it in bigger and better ways all the time.

    Merry Christmas guys. Let’s continue to spread the SQL cheer… :)

    @rob_farley

  • Retrieving N rows per group

    Sometimes a forum response should just be a blog post… so here’s something I wrote over at http://dba.stackexchange.com/a/86765/4103.

    The question was somewhat staged I think, being from Paul White (@sql_kiwi), who definitely knows this stuff already.

    His question:

    I often need to select a number of rows from each group in a result set.

    For example, I might want to list the 'n' highest or lowest recent order values per customer.

    In more complex cases, the number of rows to list might vary per group (defined by an attribute of the grouping/parent record). This part is definitely optional/for extra credit and not intended to dissuade people from answering.

    What are the main options for solving these types of problems in SQL Server 2005 and later? What are the main advantages and disadvantages of each method?

    AdventureWorks examples (for clarity, optional)

    1. List the five most recent recent transaction dates and IDs from the TransactionHistory table, for each product that starts with a letter from M to R inclusive.
    2. Same again, but with n history lines per product, where n is five times the DaysToManufactureProduct attribute.
    3. Same, for the special case where exactly one history line per product is required (the single most recent entry by TransactionDate, tie-break on TransactionID.

    And my answer:

    Let's start with the basic scenario.

    If I want to get some number of rows out of a table, I have two main options: ranking functions; or TOP.

    First, let's consider the whole set from Production.TransactionHistory for a particular ProductID:

    SELECT h.TransactionID, h.ProductID, h.TransactionDate
    FROM Production.TransactionHistory h
    WHERE h.ProductID = 800;

    This returns 418 rows, and the plan shows that it checks every row in the table looking for this - an unrestricted Clustered Index Scan, with a Predicate to provide the filter. 797 reads here, which is ugly.

    Expensive Scan with 'Residual' Predicate

    So let's be fair to it, and create an index that would be more useful. Our conditions call for an equality match on ProductID, followed by a search for the most recent by TransactionDate. We need the TransactionID returned too, so let's go with: CREATE INDEX ix_FindingMostRecent ON Production.TransactionHistory (ProductID, TransactionDate) INCLUDE (TransactionID);.

    Having done this, our plan changes significantly, and drops the reads down to just 3. So we're already improving things by over 250x or so...

    Improved plan

    Now that we've levelled the playing field, let's look at the top options - ranking functions and TOP.

    WITH Numbered AS
    (
    SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (ORDER BY TransactionDate DESC) AS RowNum
    FROM Production.TransactionHistory h
    WHERE h.ProductID = 800
    )
    SELECT TransactionID, ProductID, TransactionDate
    FROM Numbered
    WHERE RowNum <= 5;
    
    SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
    FROM Production.TransactionHistory h
    WHERE h.ProductID = 800
    ORDER BY TransactionDate DESC;

    Two plans - basic TOP\RowNum

    You will notice that the second (TOP) query is much simpler than the first, both in query and in plan. But very significantly, they both use TOP to limit the number of rows actually being pulled out of the index. The costs are only estimates and worth ignoring, but you can see a lot of similarity in the two plans, with the ROW_NUMBER() version doing a tiny amount of extra work to assign numbers and filter accordingly, and both queries end up doing just 2 reads to do their work. The Query Optimizer certainly recognises the idea of filtering on a ROW_NUMBER() field, realising that it can use a Top operator to ignore rows that aren't going to be needed. Both these queries are good enough - TOP isn't so much better that it's worth changing code, but it is simpler and probably clearer for beginners.

    So this work across a single product. But we need to consider what happens if we need to do this across multiple products.

    The iterative programmer is going to consider the idea of looping through the products of interest, and calling this query multiple times, and we can actually get away with writing a query in this form - not using cursors, but using APPLY. I'm using OUTER APPLY, figuring that we might want to return the Product with NULL, if there are no Transactions for it.

    SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
    FROM 
    Production.Product p
    OUTER APPLY (
        SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
        FROM Production.TransactionHistory h
        WHERE h.ProductID = p.ProductID
        ORDER BY TransactionDate DESC
    ) t
    WHERE p.Name >= 'M' AND p.Name < 'S';

    The plan for this is the iterative programmers' method - Nested Loop, doing a Top operation and Seek (those 2 reads we had before) for each Product. This gives 4 reads against Product, and 360 against TransactionHistory.

    APPLY plan

    Using ROW_NUMBER(), the method is to use PARTITION BY in the OVER clause, so that we restart the numbering for each Product. This can then be filtered like before. The plan ends up being quite different. The logical reads are about 15% lower on TransactionHistory, with a full Index Scan going on to get the rows out.

    ROW_NUMBER plan

    Significantly, though, this plan has an expensive Sort operator. The Merge Join doesn't seem to maintain the order of rows in TransactionHistory, the data must be resorted to be able to find the rownumbers. It's fewer reads, but this blocking Sort could feel painful. Using APPLY, the Nested Loop will return the first rows very quickly, after just a few reads, but with a Sort, ROW_NUMBER() will only return rows after a most of the work has been finished.

    Interestingly, if the ROW_NUMBER() query uses INNER JOIN instead of LEFT JOIN, then a different plan comes up.

    ROW_NUMBER() with INNER JOIN

    This plan uses a Nested Loop, just like with APPLY. But there's no Top operator, so it pulls all the transactions for each product, and uses a lot more reads than before - 492 reads against TransactionHistory. There isn't a good reason for it not to choose the Merge Join option here, so I guess the plan was considered 'Good Enough'. Still - it doesn't block, which is nice - just not as nice as APPLY.

    The PARTITION BY column that I used for ROW_NUMBER() was h.ProductID in both cases, because I had wanted to give the QO the option of producing the RowNum value before joining to the Product table. If I use p.ProductID, we see the same shape plan as with the INNER JOIN variation.

    WITH Numbered AS
    (
    SELECT p.Name, p.ProductID, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY p.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
    FROM Production.Product p
    LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
    WHERE p.Name >= 'M' AND p.Name < 'S'
    )
    SELECT Name, ProductID, TransactionID, TransactionDate
    FROM Numbered n
    WHERE RowNum <= 5;

    But the Join operator says 'Left Outer Join' instead of 'Inner Join'. The number of reads is still just under 500 reads against the TransactionHistory table.

    PARTITION BY on p.ProductID instead of h.ProductID

    Anyway - back to the question at hand...

    We've answered question 1, with two options that you could pick and choose from. Personally, I like the APPLY option.

    To extend this to use a variable number (question 2), the 5 just needs to be changed accordingly. Oh, and I added another index, so that there was an index on Production.Product.Name that included the DaysToManufacture column.

    WITH Numbered AS
    (
    SELECT p.Name, p.ProductID, p.DaysToManufacture, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY h.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
    FROM Production.Product p
    LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
    WHERE p.Name >= 'M' AND p.Name < 'S'
    )
    SELECT Name, ProductID, TransactionID, TransactionDate
    FROM Numbered n
    WHERE RowNum <= 5 * DaysToManufacture;
    
    SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
    FROM 
    Production.Product p
    OUTER APPLY (
        SELECT TOP (5 * p.DaysToManufacture) h.TransactionID, h.ProductID, h.TransactionDate
        FROM Production.TransactionHistory h
        WHERE h.ProductID = p.ProductID
        ORDER BY TransactionDate DESC
    ) t
    WHERE p.Name >= 'M' AND p.Name < 'S';

    And both plans are almost identical to what they were before!

    Variable rows

    Again, ignore the estimated costs - but I still like the TOP scenario, as it is so much more simple, and the plan has no blocking operator. The reads are less on TransactionHistory because of the high number of zeroes in DaysToManufacture, but in real life, I doubt we'd be picking that column. ;)

    One way to avoid the block is to come up with a plan that handles the ROW_NUMBER() bit to the right (in the plan) of the join. We can persuade this to happen by doing the join outside the CTE. (Edited because of a silly typo that meant that I turned my Outer Join into an Inner Join.)

    WITH Numbered AS
    (
    SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY TransactionDate DESC) AS RowNum
    FROM Production.TransactionHistory h
    )
    SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
    FROM Production.Product p
    LEFT JOIN Numbered t ON t.ProductID = p.ProductID
    AND t.RowNum <= 5 * p.DaysToManufacture WHERE p.Name >= 'M' AND p.Name < 'S';

    image

    The plan here looks simpler - it's not blocking, but there's a hidden danger.

    Notice the Compute Scalar that's pulling data from the Product table. This is working out the 5 * p.DaysToManufacture value. This value isn't being passed into the branch that's pulling data from the TransactionHistory table, it's being used in the Merge Join. As a Residual.

    image

    So the Merge Join is consuming ALL the rows, not just the first however-many-are-needed, but all of them and then doing a residual check. This is dangerous as the number of transactions increases. I'm not a fan of this scenario - residual predicates in Merge Joins can quickly escalate. Another reason why I prefer the APPLY/TOP scenario.

    In the special case where it's exactly one row, for question 3, we can obviously use the same queries, but with 1 instead of 5. But then we have an extra option, which is to use regular aggregates.

    SELECT ProductID, MAX(TransactionDate)
    FROM Production.TransactionHistory
    GROUP BY ProductID;

    A query like this would be a useful start, and we could easily modify it to pull out the TransactionID as well for tie-break purposes (using a concatenation which would then be broken down), but we either look at the whole index, or we dive in product by product, and we don't really get a big improvement on what we had before in this scenario.

    But I should point out that we're looking at a particular scenario here. With real data, and with an indexing strategy that may not be ideal, mileage may vary considerably. Despite the fact that we've seen that APPLY is strong here, it can be slower in some situations. It rarely blocks though, as it has a tendency to use Nested Loops, which many people (myself included) find very appealing.

    I haven't tried to explore parallelism here, or dived very hard into question 3, which I see as a special case that people rarely want based on the complication of concatenating and splitting. The main thing to consider here is that these two options are both very strong.

    I prefer APPLY. It's clear, it uses the Top operator well, and it rarely causes blocking.

    @rob_farley

  • Will 2015 be a big year for the SQL community?

    In Australia, almost certainly yes.

    Australia recently saw two Azure data centres open, meaning that customers can now consider hosting data in Azure without worrying about it going overseas. Whether you’re considering SQL Database or having an Azure VM with SQL on it, the story has vastly improved here in Australia, and conversations will go further.

    The impact of this will definitely reach the community…

    …a community which is moving from strength to strength in itself.

    I say that because in 2014 we have seen new PASS Chapters pop up in Melbourne and Sydney (user groups that have existed for some time but have now been aligned with PASS); many of the prominent Australian partner organisations have MVPs on staff now, which was mentioned a few times at the Australian Partner Conference in September; and SQL Saturdays have come along way since the first ones were run around the country in 2012. February will see SQL Saturday 365 in Melbourne host around 30 sessions, and build on its 2013 effort of becoming one of the largest ten SQL Saturday events in the world. Microsoft Australia seems more receptive than ever to the SQL Server community, and I’m seeing individuals pushing into the community as well.

    From a personal perspective, I think 2015 will be an interesting year. As well as being a chapter leader and regional mentor, I know that I need to develop some new talks, after getting rejected to speak at the PASS Summit, but I also want to take the time to develop other speakers, as I have done in recent years.

    TSQL2sDay150x150I also want to write more – both blogs and white papers. I’ve blogged every month for at least five years, but many months that’s just the T-SQL Tuesday post. (Oh yeah – this post is for one of those two, hosted by Wayne Sheffield (@DBAWayne) on the topic of ‘Giving Back’.) So I want to be able to write a lot more than 12 posts in the year, and take the opportunity to get deeper in the content. I know I have a lot to talk about, whether it be in the BI space, or about query plans, or PDW, or security – there really are a lot of topics I could cover – I just need to reserve the time to get my content out there.

    So challenge me. If you want help with an abstract, or a talk outline (which I know is very different to an abstract), or you want me to blog on a particular topic, then let me know and I’ll see what I can do. I want to give even more to the community, and if you’re in the community, that should include you!

    @rob_farley

  • Minimising Data Movement in PDW Using Query Optimisation Techniques

    This is a white paper that I put together recently about APS / PDW Query Optimisation. You may have seen it at http://blogs.technet.com/b/dataplatforminsider/archive/2014/11/14/aps-best-practice-how-to-optimize-query-performance-by-minimizing-data-movement.aspx as well, but in case you haven’t, read on!

    I think the significance of this paper is big, because most people who deal with data warehouses (and PDW even more so) haven’t spent much time thinking about Query Optimisation techniques, and certainly not about how they can leverage features of SQL Server’s Query Optimizer to minimise data movement (which is probably the largest culprit for poor performance in a PDW environment).

    Oh, and I have another one that I’m writing too...

     


    The Analytics Platform System, with its MPP SQL Server engine (SQL Server Parallel Data Warehouse) can deliver performance and scalability for analytics workloads that you may not have expected from SQL Server. But there are key differences in working with SQL Server PDW and SQL Server Enterprise Edition that one should be aware of in order to take full advantage of the SQL Server PDW capabilities. One of the most important considerations when tuning queries in Microsoft SQL Server Parallel Data Warehouse is the minimisation of data movement. This post shows a useful technique regarding the identification of redundant joins through additional predicates that simulate check constraints.

    Microsoft’s PDW, part of the Analytics Platform System (APS), offers scale-out technology for data warehouses. This involves spreading data across a number of SQL Server nodes and distributions, such that systems can host up to many petabytes of data. To achieve this, queries which use data from multiple distributions to satisfy joins must leverage the Data Movement Service (DMS) to relocate data during the execution of the query. This data movement is both a blessing and a curse; a blessing because it is the fundamental technology which allows the scale-out features to work, and a curse because it can be one of the most expensive parts of query execution. Furthermore, tuning to avoid data movement is something which many SQL Server query tuning experts have little experience, as it is unique to the Parallel Data Warehouse edition of SQL Server.

    Regardless of whether data in PDW is stored in a column-store or row-store manner, or whether it is partitioned or not, there is a decision to be made as to whether a table is to be replicated or distributed. Replicated tables store a full copy of their data on each compute node of the system, while distributed tables distribute their data across distributions, of which there are eight on each compute node. In a system with six compute nodes, there would be forty-eight distributions, with an average of less than 2.1% (100% / 48) of the data in each distribution.

    When deciding whether to distribute or replicate data, there are a number of considerations to bear in mind. Replicated data uses more storage and also has a larger management overhead, but can be more easily joined to data, as every SQL node has local access to replicated data. By distributing larger tables according to the hash of one of the table columns (known as the distribution key), the overhead of both reading and writing data is reduced – effectively reducing the size of databases by an order of magnitude.

    Having decided to distribute data, choosing which column to use as the distribution key is driven by factors including the minimisation of data movement and the reduction of skew. Skew is important because if a distribution has much more than the average amount of data, this can affect query time. However, the minimisation of data movement is probably the most significant factor in distribution-key choice.

    Joining two tables together involves identifying whether rows from each table match to according a number of predicates, but to do this, the two rows must be available on the same compute node. If one of the tables is replicated, this requirement is already satisfied (although it might need to be ‘trimmed’ to enable a left join), but if both tables are distributed, then the data is only known to be on the same node if one of the join predicates is an equality predicate between the distribution keys of the tables, and the data types of those keys are exactly identical (including nullability and length). More can be read about this in the excellent whitepaper about Query Execution in Parallel Data Warehouse at http://gsl.azurewebsites.net/Portals/0/Users/Projects/pdwau3/sigmod2012.pdf

    To avoid data movement between commonly-performed joins, creativity is often needed by the data warehouse designers. This could involve the addition of extra columns to tables, such as adding the CustomerKey to many fact data tables (and using this as the distribution key), as joins between orders, items, payments, and other information required for a given report, as all these items are ultimately about a customer, and adding additional predicates to each join to alert the PDW Engine that only rows within the same distribution could possibly match. This is thinking that is alien for most data warehouse designers, who would typically feel that adding CustomerKey to a table not directly related to a Customer dimension is against best-practice advice.

    Another technique commonly used by PDW data warehouse designers that is rarely seen in other SQL Server data warehouses is splitting tables up into two, either vertically or horizontally, whereas both are relatively common in PDW to avoid some of the problems that can often occur.

    Splitting a table vertically is frequently done to reduce the impact of skew when the ideal distribution key for joins is not evenly distributed. Imagine the scenario of identifiable customers and unidentifiable customers, as increasingly the situation as stores have loyalty programs allowing them to identify a large portion (but not all) customers. For the analysis of shopping trends, it could be very useful to have data distributed by customer, but if half the customers are unknown, there will be a large amount of skew.

    To solve this, sales could be split into two tables, such as Sales_KnownCustomer (distributed by CustomerKey) and Sales_UnknownCustomer (distributed by some other column). When analysing by customer, the table Sales_KnownCustomer could be used, including the CustomerKey as an additional (even if redundant) join predicate. A view performing a UNION ALL over the two tables could be used to allow reports that need to consider all Sales.

    The query overhead of having the two tables is potentially high, especially if we consider tables for Sales, SaleItems, Deliveries, and more, which might all need to be split into two to avoid skew while minimising data movement, using CustomerKey as the distribution key when known to allow customer-based analysis, and SalesKey when the customer is unknown.

    By distributing on a common key the impact is to effectively create mini-databases which are split out according to groups of customers, with all of the data about a particular customer residing in a single database. This is similar to the way that people scale out when doing so manually, rather than using a system such as PDW. Of course, there is a lot of additional overhead when trying to scale out manually, such as working out how to execute queries that do involve some amount of data movement.

    By splitting up the tables into ones for known and unknown customers, queries that were looking something like the following:

    SELECT …
    FROM Sales AS s
    JOIN SaleItems AS si
    ON si.SalesKey = s.SalesKey
    JOIN Delivery_SaleItems AS dsi
    ON dsi.LineItemKey = si.LineItemKey
    JOIN Deliveries AS d
    ON d.DeliveryKey = dsi.DeliveryKey

    …would become something like:

    SELECT …
    FROM Sales_KnownCustomer AS s
    JOIN SaleItems_KnownCustomer AS si
    ON si.SalesKey = s.SalesKey
    AND si.CustomerKey = s.CustomerKey
    JOIN Delivery_SaleItems_KnownCustomer AS dsi
    ON dsi.LineItemKey = si.LineItemKey
    AND dsi.CustomerKey = s.CustomerKey
    JOIN Deliveries_KnownCustomer AS d
    ON d.DeliveryKey = dsi.DeliveryKey
    AND d.CustomerKey = s.CustomerKey
    UNION ALL
    SELECT …
    FROM Sales_UnknownCustomer AS s
    JOIN SaleItems_UnknownCustomer AS li
    ON si.SalesKey = s.SalesKey
    JOIN Delivery_SaleItems_UnknownCustomer AS dsi
    ON dsi.LineItemKey = s.LineItemKey
    AND dsi.SalesKey = s.SalesKey
    JOIN Deliveries_UnknownCustomer AS d
    ON d.DeliveryKey = s.DeliveryKey
    AND d.SalesKey = s.SalesKey

    I’m sure you can appreciate that this becomes a much larger effort for query writers, and the existence of views to simplify querying back to the earlier shape could be useful. If both CustomerKey and SalesKey were being used as distribution keys, then joins between the views would require both, but this can be incorporated into logical layers such as Data Source Views much more easily than using UNION ALL across the results of many joins. A DSV or Data Model could easily define relationships between tables using multiple columns so that self-serving reporting environments leverage the additional predicates.

    The use of views should be considered very carefully, as it is easily possible to end up with views that nest views that nest view that nest views, and an environment that is very hard to troubleshoot and performs poorly. With sufficient care and expertise, however, there are some advantages to be had.

    The resultant query would look something like:

    SELECT …
    FROM Sales AS s
    JOIN SaleItems AS li
    ON si.SalesKey = s.SalesKey
    AND si.CustomerKey = s.CustomerKey
    JOIN Delivery_SaleItems AS dsi
    ON dsi.LineItemKey = si.LineItemKey
    AND dsi.CustomerKey = s.CustomerKey
    AND dsi.SalesKey = s.SalesKey
    JOIN Deliveries AS d
    ON d.DeliveryKey = dsi.DeliveryKey
    AND d.CustomerKey = s.CustomerKey
    AND d.SalesKey = s.SalesKey

    Joining multiple sets of tables which have been combined using UNION ALL is not the same as performing a UNION ALL of sets of tables which have been joined. Much like any high school mathematics teacher will happily explain that (a*b)+(c*d) is not the same as (a+c)*(b+d), additional combinations need to be considered when the logical order of joins and UNION ALLs.

    joins

    Notice that when we have (TableA1 UNION ALL TableA2) JOIN (TableB1 UNION ALL TableB2), we must perform joins not only between TableA1 and TableB1, and TableA2 and TableB2, but also TableA1 and TableB2, and TableB1 and TableA2. These last two combinations do not involve tables with common distribution keys, and therefore we would see data movement. This is despite the fact that we know that there can be no matching rows in those combinations, because some are for KnownCustomers and the others are for UnknownCustomers. Effectively, the relationships between the tables would be more like the following diagram:

    joins2

    There is an important stage of Query Optimization which must be considered here, and which can be leveraged to remove the need for data movement when this pattern is applied – that of Contradiction.

    The contradiction algorithm is an incredibly useful but underappreciated stage of Query Optimization. Typically it is explained using an obvious contradiction such as WHERE 1=2. Notice the effect on the query plans of using this predicate.

    clip_image012Because the Query Optimizer recognises that no rows can possibly satisfy the predicate WHERE 1=2, it does not access the data structures seen in the first query plan.

    This is useful, but many readers may not consider queries that use such an obvious contradiction are going to appear in their code.

    But suppose the views that perform a UNION ALL are expressed in this form:

    CREATE VIEW dbo.Sales AS
    SELECT *
    FROM dbo.Sales_KnownCustomer
    WHERE CustomerID > 0
    UNION ALL
    SELECT *
    FROM dbo.Sales_UnknownCustomer
    WHERE CustomerID = 0;

    Now, we see a different kind of behaviour.

    Before the predicates are used, the query on the views is rewritten as follows (with SELECT clauses replaced by ellipses).

    SELECT …
    FROM (SELECT …
    FROM (SELECT ...
    FROM [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
    UNION ALL
    SELECT …
    FROM [tempdb].[dbo].[TEMP_ID_4208] AS T4_1) AS T2_1
    INNER JOIN
    (SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
    UNION ALL
    SELECT …
    FROM [tempdb].[dbo].[TEMP_ID_4209] AS T5_1) AS T3_1
    INNER JOIN
    (SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
    UNION ALL
    SELECT …
    FROM [tempdb].[dbo].[TEMP_ID_4210] AS T6_1) AS T4_1
    INNER JOIN
    (SELECT …
    FROM [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
    UNION ALL
    SELECT …
    FROM [tempdb].[dbo].[TEMP_ID_4211] AS T6_1) AS T4_2
    ON (([T4_2].[CustomerKey] = [T4_1].[CustomerKey])
    AND ([T4_2].[SalesKey] = [T4_1].[SalesKey])
    AND ([T4_2].[DeliveryKey] = [T4_1].[DeliveryKey]))) AS T3_2
    ON (([T3_1].[CustomerKey] = [T3_2].[CustomerKey])
    AND ([T3_1].[SalesKey] = [T3_2].[SalesKey])
    AND ([T3_2].[SaleItemKey] = [T3_1].[SaleItemKey]))) AS T2_2
    ON (([T2_2].[CustomerKey] = [T2_1].[CustomerKey])
    AND ([T2_2].[SalesKey] = [T2_1].[SalesKey]))) AS T1_1

    Whereas with the inclusion of the additional predicates, the query simplifies to:

    SELECT …
    FROM (SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
    WHERE ([T4_1].[CustomerKey] > 0)) AS T3_1
    INNER JOIN
    (SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
    WHERE ([T5_1].[CustomerKey] > 0)) AS T4_1
    INNER JOIN
    (SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
    WHERE ([T6_1].[CustomerKey] > 0)) AS T5_1
    INNER JOIN
    (SELECT …
    FROM [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
    WHERE ([T6_1].[CustomerKey] > 0)) AS T5_2
    ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
    AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
    AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
    ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
    AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
    AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
    ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
    AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))
    UNION ALL
    SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[Sales_UnknownCustomer] AS T4_1
    WHERE ([T4_1].[CustomerKey] = 0)) AS T3_1
    INNER JOIN
    (SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[SaleItems_UnknownCustomer] AS T5_1
    WHERE ([T5_1].[CustomerKey] = 0)) AS T4_1
    INNER JOIN
    (SELECT …
    FROM (SELECT …
    FROM [sample_vsplit].[dbo].[Delivery_SaleItems_UnknownCustomer] AS T6_1
    WHERE ([T6_1].[CustomerKey] = 0)) AS T5_1
    INNER JOIN
    (SELECT …
    FROM [sample_vsplit].[dbo].[Deliveries_UnknownCustomer] AS T6_1
    WHERE ([T6_1].[CustomerKey] = 0)) AS T5_2
    ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
    AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
    AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
    ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
    AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
    AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
    ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
    AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))) AS T1_1

    This may seem more complex – it’s certainly longer – but this is the original, preferred version of the join. This is a powerful rewrite of the query.

    joins3 

    Furthermore, the astute PDW-familiar reader will quickly realise that the UNION ALL of two local queries (queries that don’t require data movement) is also local, and that therefore, this query is completely local. The TEMP_ID_NNNNN tables in the first rewrite are more evidence that data movement has been required.

    When the two plans are shown using PDW’s EXPLAIN keyword, the significance is shown even clearer.

    The first plan appears as following, and it is obvious that there is a large amount of data movement involved.

    clip_image014

    clip_image015

    The queries passed in are identical, but the altered definitions of the views have removed the need for any data movement at all. This should allow your query to run a little faster. Ok, a lot faster.

    Summary

    When splitting distributed tables vertically to avoid skew, views over those tables should include predicates which reiterate the conditions that cause the data to be populated into each table. This provides additional information to the PDW Engine that can remove unnecessary data movement, resulting in much-improved performance, both for standard reports using designed queries, and ad hoc reports that use a data model.

     

    Check us out at www.lobsterpot.com.au or talk to me via Twitter at @rob_farley

  • Learning through others

    This PASS Summit was a different experience for me – I wasn’t speaking. I’ve presented at three of the five PASS Summits I’ve been to, where the previous one I’d not spoken at was 2012, while I was a PASS Director (and had been told I shouldn’t submit talks – advice that I’d ignored in 2013). I have to admit that I really missed presenting, both in 2012 and this year, and I will need to improve my session abstracts to make sure I get selected in future years.

    I’m not a very good ‘session attendee’ on the whole – it’s not my preferred style of learning – but I still wanted to go, because of the learning involved. Sometimes I will learn a lot from the various things that are mentioned in the few sessions I go to, but more significantly, I learn a lot from discussions with other people. I hear what they are doing with technology, and that encourages me to explore those technologies further. It’s not quite at the point of learning by osmosis simply by being in the presence of people who know stuff, but by developing relationships with people, and hearing them speak about the things they’re doing, I definitely learn a lot.

    Of course, I don’t get to know people for the sake of learning. I get to know people because I like getting to know people. But of course, one of the things I have in common with these people is SQL, and conversations often come around to that. And I know that I learn a lot from those conversations. I don’t have the luxury of living near many (any?) of my friends in the data community, and spending time with them in person definitely helps me.TSQL2sDay150x150

    And it’s not just SQL stuff that I learn. This month’s T-SQL Tuesday (for which this is a post) is hosted by Chris Yates (@YatesSQL), who I got to run alongside on one of the mornings. Even that was a learning experience for me, as we chatted about all kinds of things, and I listened to my feet hitting the ground – another technique I learned from a community – and made sure I stuck to my running form to minimise the pain I’d be feeling later in the day. Talking to Chris while I ran helped immensely, and I was far less sore than I thought I might be.

    On the SQL side, I got to learn about how excited people are about scale-out, with technologies like Stretched Tables coming very soon. As someone involved in the Parallel Data Warehouse space (and seriously – how thrilled was I to be able to chat with Dr Rimma Nehme, who was involved in the PDW Query Optimizer), scale-out is very much in my thoughts, and seeing what Microsoft is doing in this space is great – but learning what other people in the community are thinking about it is even more significant for me.

    @rob_farley 

    PS: This is the 60th T-SQL Tuesday. Huge thanks to Adam Machanic (@adammachanic) for starting this, and giving me something to write about each month these last five years.

  • PASS Summit WIT Lunch

    With the pleasant sound of cutlery on crockery, those lucky enough to secure tickets to the WIT Lunch at the PASS Summit get to listen to an interview with Kimberly Bryant, who is the founder of a non-profit organisation called Black Girls Code – helping teenaged girls from low-privilege communities to get into technology.

    She calls herself an Accidental Entrepreneur, driven by her passion to see the less-privileged have opportunities to explore an industry that was dominated by a very different part of the community. Her daughter was interested in tech, and went on a tech-focused summer camp, where she was the only non-white kid, and one of only three girls. With a crowd of about 40, that was less than ten percent of the camp.

    What Kimberly saw at the camp, and in other environments that are dominated by a particular demographic, was that the people who were providing for the group would cater for the masses, and not the minorities. From an economic perspective, I’m sure this makes sense. If you’re going to find something that caters for a particular cluster of people, a particular type of person, then targetting the larger clusters is likely to give the ‘best results’. But (my opinion) this is ignoring the fact that the larger clusters of people tend to be catered for by just about anything. In my experience, if someone is part of a larger cluster, they have a large amount of support from their peers already, and need less from the organisers. But if the organisers can ensure that the edges of the group are looked after, then the ones in the middle will still be just fine, and the whole group will be encouraged.

    Diversity is something that the IT industry suffers from, and I do mean ‘suffer’. Without good diversity, our industry is held back. Stupidly, our industry keeps shooting itself in the foot, and it’s the larger clusters of people – I guess that means people like me – who need to take a stand when we see things that would alienate minority groups.

    Kimberly Bryant points out that teams need diversity, and that hiring decisions need to ensure that they don’t turn away people because of diversity. For myself, as a business owner, I hope that I never turn someone away because of diversity, because I do agree that teams need diversity. What I love the most though, is that what Kimberly has done is to develop programs to make sure that people from a particular minority group present as stronger candidates to hiring managers.

    Let’s encourage people from minority groups to get into IT. We’ll all benefit from it.

    @rob_farley

  • Dr Rimma Nehme at the PASS Summit

    This Summit’s presentation from Microsoft Research Labs is from Dr Rimma Nehme, bucking the trend of having presentations from Dr David DeWitt. I’m really pleased to be able to hear from her, because she’s an absolute legend.

    Among her qualifications is work on the PDW Query Optimizer – a topic closer to me than probably any other area of SQL Server. I just wish I had known this a few minutes ago when I met her, but I’m sure she’ll chat more freely after her big presentation.http://www.sqlpass.org/images/speakers/RimmaNehme588.png

    Today she’s talking about Cloud Computing, which is great because the cloud space has changed significantly in recent years, and it’s good to hear from Microsoft Research Labs again. For example, analysing the power-effectiveness of a data centre by comparing the total power used by a data centre against the computing power of a data centre. This leads to exploring more effective systems, such as evaporative cooling (which is used by many Australian homes and businesses, of course), making energy-responsibility a key component of cloud computing. With such an effort being put into cloud computing, the globally-responsible option is to use the cloud.

    The five key drivers for cloud that Dr Nehme listed are:

    • Elasticity
    • No CapEx
    • Pay Per Use
    • Focus on Business
    • Fast Time To Market

    These are all huge, of course, and the business aspects are massive. It’s increasingly easy to persuade businesses to move to the cloud, but the exciting thing about the technologies that have been discussed this week is the elasticity point.

    Microsoft is doing huge amounts of work to let people scale out easily. New technologies such as Stretched Tables will allow people to have hybrid solutions between on-prem and cloud like never before. With a background in the PDW Query Optimizer, Dr Nehme is the perfect person to be exploring what’s going on with spreading the load across multiple cloud-based machines for these scale-out solutions.

    The cloud means that many database professionals worry about their jobs. I’m sure people felt the same way when the industrial revolution came through. People who work on production-lines have been replaced by robots, and database administrators who only do high availability don’t need to handle that in the cloud space. But they will not be redundant. Dr Nehme just said “Cloud was not designed to be a threat to DBAs”, and this is significant. The key here is that we have more data than ever, and we need to be able to use computing power effectively.

    We can’t keep going with the amount of data that is appearing, and we need to be more responsible than ever.

    Great keynote, Dr Nehme. I hope this is the first of many keynotes from you.

    @rob_farley

  • PASS Summit – Thursday Keynote

    It’s good to point out it’s still only Thursday, as my laptop tells me that it’s already Friday.

    Today is the second of only two keynotes this Summit, which means that it’s the opportunity to hear from Microsoft Research Labs about what’s going on with data from their perspective.

    It’s also when we get to hear from the PASS VPs – community members that I used to serve with on the PASS Board of Directors – about how PASS is doing from a Financial and Marketing perspective.

    One of the interesting things about PASS is that there are reserves of over a million dollars. I mention this because it’s an area that some people is quite “interesting” for a community and non-profit organisation, but I want to point out that these savings help let PASS be more free in what they (we?) do. Having a million dollars in the bank means that PASS can reach out and do things that will serve the community, even if it seems like it could be risky. There is a lot of risk in running the Summit every year, and this is the most obvious area that PASS could need money to cover costs that might not come back if, say, there’s another volcano eruption in Iceland. I saw first-hand the freedom that PASS had because of the reserves (although some risks were still very high and freedom does not mean irresponsibility), and I know this is a good thing.

    From the marketing perspective, the celebration of individuals who have gone beyond the norm is a great part of the Summit event, and the PASSion Award winner has been announced as Andrey Korshikov. This guy has done so much for the Russian Data Community, making him the most influential SQL person in the largest country of the world. You can’t go past that…

    @rob_farley

  • Keynote technologies – new or not?

    So I’m sitting in the PASS Summit keynote, and there are some neat things being shown.

    Something that just appeared on the screen was around having the location of shoppers being shown on a plan of a store. There were some ‘Oohs’ coming from around the room, as they mentioned that Kinect was being used to track locations. Hotspots were appearing on a time-driven picture.

    But the thing that I think is most exciting is that this is almost all achievable right now. Collecting information from Kinect is something that my friends John & Bronwen have been presenting about for years, and displaying things on custom maps in Power BI (complete with hotspots) is also very achievable. If you don’t know how to do this, get along to Hope Foley’s session this afternoon (Wed 5th), as she explores more of what’s possible with Power Map. She wrote a post recently about Custom Maps in Power Map, which is a great blog post, walking through how to show spatial data on the plan of a building, playing it against a time dimension.

    The stuff in the keynote is excellent – much of it is future, but if you’re at the PASS Summit, you can be having conversations with many of the world’s best experts about how to revolutionise your data story, not just in the future, but right now.

    @rob_farley

  • PASS Summit keynote

    The PASS Summit has kicked off again with a tremendous keynote from Ranga. He's been in the role at Microsoft for a little over a year now, and has really come into his own, as can be seen by the presentation this morning. The changes to the data picture haven't changed hugely over the past year, although the "Internet of Things" space is increasing quickly.

    image1

    With that, the speed of growth in data volume has kicked in harder than ever. Being able to collect, process, and analyse the kinds of volume that we're now facing means that scaling is major feature being discussed. In recent years, this meant looking at Big Data and the ways that this can hook into our existing solutions, and technologies like Hekaton have allowed us to scale up to handle huge numbers of transactions in a scale-up scenario.

    This year, though, we see scale-out having a refreshed focus. We're hearing talk of 'sharding' more, and the idea of being able to use multiple databases (including cloud-based ones) to achieve scale on demand – an elasticity that suits business more than ever.

    Most of our customers at LobsterPot see changes in the amount of business that’s going on across the year, with some having certain key days requiring orders of magnitude more traffic than on ‘normal’ days. They already scale out their websites, but data is another matter. Databases typically scale UP, not OUT.

    My work in the Analytics Platform System / Parallel Data Warehouse space makes me acutely aware of the challenges around scaling out data. When you need to perform joins between tables which have their data in different databases, on different servers, there are problems that need addressing. A lot of it happens behind the scenes through complex data movement techniques, so that it looks like a normal query. This is stuff that is hard to do through clever data

    What we’re seeing this morning are some of the ways that Microsoft is providing scale out technology in SQL Server and SQL Database. Considering they now have over a million SQL Database databases in Azure, thinking about how to leverage this technology to enhance on-prem SQL Server databases to provide a new level of hybrid is very interesting.

    One of these technologies is Stretched Tables, which we saw this morning. This is about being able to take a table in SQL Server and stretch it into SQL Databases in Azure. This means that the table will be sharded across on-prem and cloud – hot data being stored locally, and more-rarely used data being stored in the cloud. For queries that need to access data that’s in the cloud, data can can be extracted from the cloud tables, pushing predicates down to pull back part of the data, transparently (as far as the user is concerned).

    This is not like using linked servers and views, handling inserts with triggers. This is achieving hybrid behind the scenes, giving users a logical layer they can query to access their information whether it’s local or in the cloud.

    Until now, I’ve always felt that ‘hybrid’ has been about using some components locally and other components in the cloud. But what we’re seeing now are ways that ‘hybrid’ can mean that we have the core of our database – the tables themselves – are handled in a hybrid way.

    Exciting times ahead…

    @rob_farley

More Posts Next page »

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement