THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and runs training courses around the world in SQL Server and BI topics.

  • String length and SARGability

    CONVERT_IMPLICIT isn’t the only problem with getting data types wrong. You might have the right type, but what if the length is wrong? This post will look at both getting the type wrong and getting the length wrong too.

    Let’s do some testing. We’ll need a table with indexes. I’d normally use one of the AdventureWorks versions for this, but as they have a tendency to use user-defined types (which I’m not actually a fan of), I’m going to create my own. Also, my example needs to leverage a composite index. I’m only putting a single row in the table, because the amount of data isn’t relevant to what I’m going to show. I’m using a Windows collation, which is the default on my machine. I’ll put a note in later to mention why.

    create table dbo.StringLength
    (
        pk int primary key,
        id int,
        v50 varchar(50) collate Latin1_General_CI_AS,
        vmax varchar(max) collate Latin1_General_CI_AS,
        n50 nvarchar(50) collate Latin1_General_CI_AS,
        nmax nvarchar(max) collate Latin1_General_CI_AS
    );
    go
    create index ix_v50 on dbo.StringLength (v50, id);
    create index ix_n50 on dbo.StringLength (n50, id);
    go
    insert dbo.StringLength (pk, id, v50, vmax, n50, nmax)
    values (1, 1, 'abcdefghij', 'abcdefghij', N'abcdefghij', N'abcdefghij');
    go

    I haven’t indexed the vmax and nmax fields, because you can’t use them as index keys. Of course, there’s plenty of argument to having those fields in your actual tables, but if you’re wanting to be able to search on that data, you might want to consider a full-text index. If the searching is always on the start of the string, you could consider another option, but we’ll come to that later.

    Let’s look at what happens when we do it right, defining variables using the right type.

    declare
        @s varchar(1000) = 'abcdefghij',
        @i int = 0;
    select id
    from dbo.StringLength
    where v50 = @s
    and id > @i;

    image

    Notice that there is no Predicate property of the Index Seek, only a Seek Predicate, which has both a Prefix and Start. The range of rows returned by the Seek Predicate has a start-point based on a combination of the Prefix and Start, and an end-point which is the end of the Prefix.

    If the wrong types are used, we see that a conversion is needed.

    Let’s start by using a varchar(50) parameter, and comparing it to the nvarchar(50) column.

    image

    We still see no Predicate here, but look at the Prefix. A CONVERT_IMPLICIT is needed because the types don’t match. In case you haven’t heard, this is bad.

    But how bad is it? Actually, not very bad at all, because @s is converted into the correct type, and then used in the Seek. You should still avoid it by passing in the correct type, but the cost of converting a parameter to the correct type is not that bad, because it only needs to happen once.

    What happens if we do it the other way around, passing in an nvarchar(50) parameter and comparing it to the varchar(50) column.

    image

    Oh!

    Straight away, you’ll notice that there’s a different shape to the execution plan, we’ve lost the Prefix in the Seek Predicate, and we have a Predicate (the residual one) property as well. But we do still have an Index Seek. It hasn’t resorted to using an Index Scan as would’ve been the case if we had used a number.

    (Just for completeness, let’s truncate the table – to avoid a conversion error – and use a number for the comparison)

    image

    Here we get an Index Scan. No Seek Predicate. The index on v50 is as good as useless because we’re comparing the column to a number. Look what’s going on in the Predicate – we’re converting the v50 field into an integer, and seeing if it’s equal to @i. That’s doing it for every row in the index.

    Luckily, we now get a warning about this. See the yellow triangle with an exclamation mark in it on the SELECT operator? If I click on that, I see a warning that says:

    Type conversion in expression (CONVERT_IMPLICIT(int,[tempdb].[dbo].[StringLength].[v50],0)) may affect "CardinalityEstimate" in query plan choice, Type conversion in expression (CONVERT_IMPLICIT(int,[tempdb].[dbo].[StringLength].[v50],0)=[@i]) may affect "SeekPlan" in query plan choice

    It’s actually two warnings. One is a SeekPlan warning, and one is a CardinalityEstimate warning. It’s the SeekPlan one that has caused the Scan, while the CardinalityEstimate problem means the Query Optimizer has little idea about how many rows to expect out of the Scan operator. (And no, there’s no full stop / period at the end of those warnings. Go figure...)

    Anyway, that was just an aside, because I figure there are plenty of posts already out there about this CONVERT_IMPLICIT issue leading to a Scan instead of a Seek. Let’s go back to our situation, where we were dealing with nvarchar and varchar strings, and still had a Seek. This one:

    image

    Clearly this is a different situation to a regular CONVERT_IMPLICIT. It’s less harmful, although there is still a big impact, and it’s potentially much more commonplace, as people don’t tend to care quite as much if they see an Index Seek already in play.

    Oh, and this behaviour doesn’t happen with SQL collations. If you have a SQL collation, the sort order between varchar and nvarchar is different, and it has to do a Scan, just like when I used a number.

    The thing that’s happening here is the same as when you have a datetime column that you’re casting to a date, or when you’re using a LIKE comparison with fixed start. The Query Optimizer uses a function called GetRangeThroughConvert (in the Compute Scalar – you can see it in the XML), which is then used to create a Dynamic Seek. I’ve presented on this before, such as in the SARGability and Residualiciousness talks at SQLBits in 2010 and the PASS Summit in 2011. Paul White (@sql_kiwi) expanded on the GetRangeThroughConvert function in a post from 2012. The seek can’t guarantee to catch everything though, so it uses a Predicate (the residual one) to make sure the value matches exactly.

    So why is this so bad? In my presentations I’ve talked about the GetRangeThroughConvert behaviour as being a good thing. More on that first down in the post.

    What if we pass in the correct type, but make it too long or too short?

    image

    When it’s too long (though we haven’t considered ‘max’ yet), we get the same behaviour as if it were the right length. Interestingly, if you search the XML version of this plan for either 1000 or 50 (except for in the column name), you don’t find it anywhere. It’s as if we passed in the correct value. The same happens if you pass in a string that is too short, but here you need to consider whether you might be wrecking the parameter.

    In this situation, my query didn’t return the same results, because @s is only ‘abcde’. But it does this without any kind of warning – you can populate varchar(5) variable with a longer string and it won’t complain at all.

    image

    But max is done differently.

    Let’s see what happens when we pass in a varchar(max) or nvarchar(max) parameter, and compare it to the limited-length string.

    image

    We’re comparing the varchar column to a varchar parameter, but the parameter is defined as a max field, and we have the GetRangeThroughConvert functionality, like what happened with the nvarchar / varchar scenario. But it’s more complicated again – despite the fact that we have a Range, our Residual Predicate doesn’t include the equality check. That check has actually been pulled further left in the plan, in that new Filter operator you see there.

    You see, checking a max type is expensive, and involves memory allocation (that parameter is potentially up to 2GB in size), so the Filter is moved to the left as far as possible (SQL won’t ever do a max comparison in a Seek/Scan operator because of the memory allocation). By calling all the other filters (however the filters are done, Joins, Seeks, Residuals) before applying the max filter, the data that gets pulled into the max filter is now as few rows as possible. The Seek will be close to the correct amount, thanks to the GetRangeThroughConvert part, but that final check does still need to take place. It’s a good thing that the check is pulled left, but you should avoid passing in a max parameter so that this check can be done in the Seek Predicate.

    So what about the other way around? What if we have a max column, and a limited-length parameter?

    In some ways, this is more trivial because you can’t index a max column. This is one of the things that carried over from the days of text and ntext, although you couldn’t define a variable as text or ntext either, so you were less likely to try.

    If you can’t index the column, but still want to be able to perform searches that would be index-like, what can you do (apart from using an Full-Text Index)?

    Well, you could use a computed column that matches the length of your parameter. The non-clustered index persists the value, although it’s not persisted in the underlying heap / clustered index.

    alter table dbo.StringLength
    add vmax50 as cast(vmax as varchar(50));
    go
    create index ix_vmax50 on dbo.StringLength(vmax50, id) include (vmax);

    This is similar to what can be done to tune many queries to get around SARGability problems. But it won’t help here unless we change our query, because our query still uses the max column, and implied conversion makes the shorter one longer, not the other way around. However, if we add an extra predicate to our query, we can achieve what we want:

    image

    The Filter here is still the max check, but now we at least have a more effective seek on the rest of it, thanks to making a column which contains the first part of the potentially-long string.

    So you can see that GetRangeThroughConvert functionality is useful, and way better than the alternative of using a Scan.

    But GetRangeThroughConvert is actually bad for your query. Honestly.

    The reason why it’s bad is because of what has happened to the second predicate that we’re using, the “id > @i” bit. When the type matched properly, it was part of the Seek Predicate. When GetRangeThroughConvert is used, either through the wrong type or by passing in the right type but with the max length, this second predicate gets relegated to the Residual (see the image below). That means that every row that satisfies the string comparison much be checked against this second predicate. It’s like using the phone book and not appreciating that all the “Farley”s are sorted by first name. Even without a second predicate, there may be problems in the plan because of unnecessary sorts, or the avoidance of a Merge Join, simply because the data cannot leverage the second column in an index (or the CIX key if there are no more columns in the key). If you’re looking for Barry Smith in the phone book, but are passing in ‘Smith’ via a varchar(max) parameter, you’ll be scanning all the Smiths looking for Barry. If you’d passed it in using varchar(100), then you could’ve found all the Barry Smiths with your Seek.

    image

    It won’t give you a warning, but you’ll find your indexes aren’t being used as effectively as you might like.

    @rob_farley

  • Not-so-dirty SQL hacks

    Using a hammer to push in a screw isn’t a good idea, no matter how good the hammer is. We all know that. and yet there are times when we get frustrated at the ‘right tool’ and opt for the one that will work. Unfortunately, there are plenty of examples in the IT space – the topic of which is this month’s T-SQL Tuesday, hosted by Rick Krueger (@DataOgre).

    TSQL2sDay150x150

    There are many ways to use SQL. Having seen plenty of awful code over the years, and written plenty of it myself, I know that there are lots of ways to achieve the same thing, and there are ways to persuade the Query Optimizer to take one particular approach rather than another. These efforts might seem quite ‘hacky’ at first glance, but if there is good reason for them, then perhaps they belong.

    For example, a couple of years ago (and then again a couple of weeks ago) I wrote about using views as a way of passing sets of values to a stored procedure (where the stored procedure is masquerading as an instead-of trigger). At the time, people told me it was quite hacky, and that they’d never use it. However, I’ve heard of other people (including Greg Low, who also blogs at sqlblog.com) using this same method for getting data into Azure using SSIS, because it just works, and it has a number of advantages over other methods.

    Many look at string concatenation as another example of this, although some get attached to the wrong solution.

    When concatenating strings stored across a single row, there is no issue – simply use the + operator, or a function like CONCAT().

    image

    It’s when you need to consider data across multiple rows that you have more of a problem.

    If there are a known number of rows in play, then performing a Pivot transformation can simplify the problem down to the single-line variety. I don’t mind whether you prefer to use the PIVOT operator or not. Personally, I prefer not to. In this example, I take a simple set of unpivotted data, and do a basic pivot transform, using GROUP BY and CASE. These then form useful values for the CONCAT function.

    image

    Frustratingly, pivoting requires the values to be known, producing a known set of columns. Dynamic Pivot requires Dynamic SQL, but if this is being done, there’s no problem to provide those values to a CONCAT function – though creating the Dynamic SQL probably requires a degree of concatenation in the first place!

    There is a “Quirky Update” method for concatenation that has been touted a bit over the years, but I find myself completely unable to recommend it, based on the prerequisites listed to get reliable results. For me, I want the results of any query I write to be guaranteed, and to have a query which could end up producing the wrong results is simply too dangerous for my liking. I’d rather use a cursor than risk wrong-results, and that’s saying something!

    The method I like to use for string concatenation has its roots in a hack of sorts, and to explain why, I’ll first describe the environment from which it has sprung – FOR XML PATH.

    There are a few ways to generate XML from SQL. I find FOR XML PATH to be quite straight forward, although it can become overly complex when nesting comes into play (causing me to switch to one of the others). However, for simple examples, XML PATH is very effective.

    The results of a simple query like this:

    SELECT ProductID, Name, ListPrice
    FROM Production.Product
    ORDER BY ProductID;

    can be turned into XML by simply adding a “FOR XML PATH” option at the end. Providing some additional parameters, we get results like the following:

    image

    You’ll notice that the column names have appeared within XML tags. The interesting thing here is that we can manipulate this behaviour by changing the name of the columns, with aliases. For example, putting an @ in front of the column name causes the element to be treated as an attribute instead (I remember it as “@ for ATtribute”).

    image

    Another trick with column names is that if there is none (or it’s [*]), then the element tag disappears. Naturally this is way better than having “<>” appear in the resultant XML.

    image

    …but this behaviour of making the empty element disappear is part of the appeal.

    I’m told that it was around the time of the development of this ‘disappearing element’ bit that the penny dropped and using it for string concatenation became an option.

    Consider what happens if we only have the Name column now.

    image

    If we had no ROOT element, we wouldn’t have legal XML…

    image

    …but if we used an empty string for the PATH element as well…

    image

    …now suddenly we have string concatenation (including the support for ordering), without having to resort to something which could fall apart at a moment’s notice.

    If we want commas included, that’s easy, and if we want to convert it back into a string (instead of being XML, with XML-style quotes), then surrounding it with brackets (round ones, what Americans call parentheses) and putting .value('.','varchar(max)') at the end will do the trick.

    image

    Here, we’ve taken XML generation and turned it into a string concatenation operation in a very simple way. The SQL Product Group realised this potential quite early on, and this is widely regarded as the best way of doing concatenation from SQL Server 2005 on.

    There are lots of other things that I’ve done over the years that might seem quite hacky, but are actually quite legitimate – but string concatenation is probably my favourite.

    @rob_farley

  • TVPs in SSIS

    Almost two years ago, I wrote about a method to use Table-Valued Parameters in SQL 2005 – or basically any environment that doesn’t support them natively.

    The idea was to use a View with an ‘instead of’ trigger. Essentially, the trigger acts as a stored procedure, which is then used to be able to handle all the rows however you want. That could be distributing the rows into tables as if it’s an actual insert, but also works to just run whatever code you like, as a trigger is essentially just a stored procedure.

    So anyway – today I got challenged to make it more explicit that this also works within SQL Server Integration Services. SSIS handles batch inputs into views, but if you were hoping to push data into a stored procedure in a data flow, that’s going to be on a row-by-row basis – no TVPs. I’d described it to a class I was teaching, showed them the code from my 2011 blog post, and they pointed out that I hadn’t made it particularly obvious (in my blog post) that you could use that method in SSIS.

    The idea is very simple – you just create a view, as I did in my earlier post, and then use it as a destination within SSIS. This can be used for all kinds of methods for which you’d want to use a TVP, including pushing data up into SQL Azure databases.

    @rob_farley

  • Ultimate query tuning

    Infinitely better.

    100% of the reads removed. Roughly 4000 (okay, 3890), down to zero.

    None.

    Let me explain...

    Obviously if there’s data being returned, there are reads needed. And obviously there is some CPU needed for SQL to be given a query and do anything at all. Luckily for me, performance of a query is typically evaluated using the number of reads performed. That’s what I’m looking at.

    Recently I came across a query that typically returns no rows. It was looking for rows in a table that met a particular condition, joined those rows to a bunch of other tables, and return the result. But yeah, typically no rows returned. Estimating a few, but none actually coming out.

    Consider the following query, on AdventureWorks.

    SELECT *
    FROM Sales.SalesOrderDetail as od
    JOIN Production.Product as p on p.ProductID = od.ProductID
    JOIN Production.ProductModel as m on m.ProductModelID = p.ProductModelID
    LEFT JOIN Production.ProductSubcategory as s on s.ProductSubcategoryID = p.ProductSubcategoryID
    where od.OrderQty > 1200;

    Running this query gives me no rows back, but well over a thousand reads.

    image

    As well as this, it suggested a missing index.

    CREATE NONCLUSTERED INDEX OrderQtyIndex
    ON [Sales].[SalesOrderDetail] ([OrderQty])
    INCLUDE ([SalesOrderID], [SalesOrderDetailID], [CarrierTrackingNumber], [ProductID], [SpecialOfferID], [UnitPrice], [UnitPriceDiscount], [LineTotal], [rowguid], [ModifiedDate]);

    Creating this index does indeed reduce the reads. A lot too! Instead of scanning the Sales.SalesOrderDetail table, it can use an Index and quickly find any rows that have more than 1200 items sold.

    image

    This is good. But is it worth stopping here?

    The index that I created gives a slightly different plan. It doesn’t perform a Seek on an index on the Sales.SalesOrderDetail table, it does a Scan! But zero reads.

    image

    The difference is a filter.

    CREATE NONCLUSTERED INDEX ZeroReadsIndex
    ON [Sales].[SalesOrderDetail] ([OrderQty])
    INCLUDE ([SalesOrderID], [SalesOrderDetailID], [CarrierTrackingNumber], [ProductID], [SpecialOfferID], [UnitPrice], [UnitPriceDiscount], [LineTotal], [rowguid], [ModifiedDate])
    WHERE OrderQty > 1200;

    Now, we have a completely empty index. Scanning this index is zero reads. It’s just like having a table with no rows in it. This index gets only populated with a row whenever a qualifying row appears in the underlying table. When that happens, there’s few reads required to be able to get the necessary data out of the other tables. But for 99.9% of the time this query is run, there are now NO READS. For the real scenario, that row gets picked up and removed quite quickly, returning the query to the ‘zero reads’ scenario.

    Do you have frequently-run queries that typically return no rows, because there’s a predicate that is rarely satisfied? How about turning that predicate into an index filter, and seeing if you can reduce a big chunk of the footprint?

    @rob_farley

  • Looking back, or looking forward?

    Sometimes I only blog for T-SQL Tuesday. I don’t want to break my run of blogging for each one just yet, but I do wonder sometimes.

    The reason I’m reflecting is that what we’ve done in the past will so often affect what we do in the future. The musician Sting tells us that “History will teach nothing”, but he’s trying to address the idea of being tied down by the negative experiences of the past, rather than making a conscious effort to see improvement. We need to acknowledge what has happened in the past, but understand that we can shape our futures and see a better world ahead.TSQL2sDay150x150 History can both help keep us diligent (as in the case of my blogging), and it can help us see where change is urgently needed.

    In the SQL world, we also need to find a balance between treating the past with respect, and not letting it hold us back.

    Mickey Stuewe is hosting this month’s T-SQL Tuesday, and the topic is Auditing (as you may have guessed).

    By keeping a record of what has happened, we can compare what things look like now, with what they looked like before, and that is incredibly powerful. (Erin Stellato is the name in SQL that most people associate with baselines – go and read some of her stuff on what you should be doing, because that’s not what I’m covering here.) It’s not just about being able to troubleshoot, but it’s about being able to implement change. Every change that we ever try involves risk, and there is a healthy amount of trepidation that goes along with that. But by being able to compare the before and after of change, you can make decisions more easily.

    Let me give you a couple of examples – one in the SQL world, and one not.

    The internet is full of tips and tricks about life (amongst countless other things, such as laughing penguins), with plenty being about little ways to be more effective. A while back I needed to replace a AA battery in the kitchen clock, but could only find AAA batteries. So I used a piece of Al-foil and got the clock working again. When I did this, I was curious about the impact on the battery, and the clock, and even on the foil. I wondered whether the battery would last as long, whether the clock would be affected by having less current going through it, and even whether it was safe (did the foil become hot, for example?). The problem was, I had no metrics to base my ideas on. I honestly had no idea how long it had been since the clock last got a new battery. Nor had I been tracking the health of the clock over, um, time, to see if it was staying accurate or not. I wasn’t about to start monitoring these things either – I just had to go with my gut and figure it was probably not awful for either the battery or the clock, but should a normal AA battery in soon. The risk was small, but if I’d had data about these things, perhaps I’d be able to make a more informed decision.

    In SQL, I often tune queries. I can happily tell a client that I’ve reduced the amount of time taking for a query from 20 seconds to 0.2 seconds, or from 400,000 reads to 13 reads. I can easily take metrics on these things. But even these metrics aren’t always the right ones. What the customer is more interested in is whether or not these changes impact their business. Does the speed increase of these queries now mean that they are handle a larger customer base in busy times? Does this mean that they might be able to avoid spending the money on a new server? Does it both save them money and increase revenue?

    Business Intelligence draws information out of all kinds of business data, and hopefully provides a platform for being able to make decisions. No matter whether the data is on the performance metrics of a server or on the sales metrics for a product, there is an opportunity to be able to implement change and notice an upturn. If you don’t have that data, if you haven’t been auditing your systems, then you’re approaching change with a very different hat on, one that probably doesn’t sit quite so comfortably. Looking back at what the past was like provides a glimpse of what the future might be, and insight into how change can become rewarding. History can teach us plenty.

    But yes, although we all recognise that it’s good to have the metrics in place to measure the impact of change, we shouldn’t allow a lack of data to turn into both an excuse and a license for inactivity. Step out and see what change you can make, with both eyes open.

    @rob_farley

  • Getting better over time

    This month’s T-SQL Tuesday is hosted by Bradley Ball (@sqlballs), and he wants us to write about a proper balls-up. He wants us to think it’s an opportunity for a Second Chance, to see things as an opportunity to write about how we learned and did better when we had a second chance.TSQL2sDay150x150

    I thought hard about this. I know I could write about times when I’ve done something badly – there are projects, jobs, university assignments, that I would happily redo – but I also know that I could list almost anything I’ve done in that category. Not that everything in my past is bad, but compared to the way I do things now, perhaps ‘not as good’. Hindsight is a wonderful thing, but over time we continue to hone our skills and, I’d like to think, we make improvements in the ways we do things.

    The thing I want to consider in this post is my blog. For the past 3.5 years I’ve been blogging here at sqlblog.com. Before that, I was at msmvps.com for about three years, and had spent a year or so at blogspot, reaching back to 2005. I even did four posts over at sqlblogcasts.com. Despite the fact that there are a number of posts I still find useful and that get complimented by people as knowledgeable as Paul White, I still cringe a lot when I look at old posts (and not just because the images seem to have been lost from my blogspot posts). There are things that I did out of ignorance (such as when I used scalar functions), and things that I know I could have explained better now. If I compare the stuff I write even now compared to content by better bloggers than me, those people (like Paul) who write posts I could only dream of putting together, it makes me realise my blog is pointless. When I look at the old stuff… cringing is an understatement!

    By all rights, I should remove all that old content. I feel like it must simply be there to misinform the masses. You know the masses – that would be all those people that might want to read about a fix() function in T-SQL, or about… actually, no, I can’t see anything from my old blogspot site that’s useful at all – okay, maybe the thing about using row_number() to avoid duplicates, but I clearly didn’t get my MVP status because of blogging… Yeah, those masses of people. Ha. Does something have mass if it’s zero?

    Over time, I would like to rework some of my old posts into new ones. For example – I would like to redo my Covering Schmuvvering post using trace flag 9130 to demonstrate some more about what’s going on. But even reworking posts doesn’t mean I should necessarily remove the old ones.

    I think mainly, I leave that old content there because it shows part of my journey. I imagine it won’t be long before I’ll look back at some of my recent posts and shake my head about what’s in there. When people like Jef Verheul write tweets suggesting I can’t write rubbish, I have plenty to point them at (including this one, I think!).

    If you’re thinking about whether it’s worth writing blog posts or not – just do it. Even if you write absolute rubbish, that’ll just mean you have posts that are at a similar level to what I’ve written – and still write!

    @rob_farley

    Note: This is the same whatever you’re doing, not just blogging. I’m sure everyone in the world can write queries better than earlier versions of me, do BI projects better than earlier versions of me, present better than earlier versions of me, etc, etc, etc. (And of course, plenty of you can do these things better than current versions of me too!)

  • Plan Operator Tuesday round-up

    Eighteen posts for T-SQL Tuesday #43 this month, discussing Plan Operators.

    I put them together and made the following clickable plan. It’s 1000px wide, so I hope you have a monitor wide enough.

    Let me explain this plan for you (people’s names are the links to the articles on their blogs – the same links as in the plan above).

    It was clearly a SELECT statement. Wayne Sheffield (@dbawayne) wrote about that, so we start with a SELECT physical operator, leveraging the logical operator Wayne Sheffield.

    The SELECT operator calls the Paul White operator, discussed by Jason Brimhall (@sqlrnnr) in his post. The Paul White operator is quite remarkable, and can consume three streams of data. Let’s look at those streams.

    The first pulls data from a Table Scan – Boris Hristov (@borishristov)’s post – using parallel threads (Bradley Ball@sqlballs) that pull the data eagerly through a Table Spool (Oliver Asmus@oliverasmus). A scalar operation is also performed on it, thanks to Jeffrey Verheul (@devjef)’s Compute Scalar operator.

    The second stream of data applies Evil (I figured that must mean a procedural TVF, but could’ve been anything), courtesy of Jason Strate (@stratesql). It performs this Evil on the merging of parallel streams (Steve Jones@way0utwest), which suck data out of a Switch (Paul White@sql_kiwi). This Switch operator is consuming data from up to four lookups, thanks to Kalen Delaney (@sqlqueen), Rick Krueger (@dataogre), Mickey Stuewe (@sqlmickey) and Kathi Kellenberger (@auntkathi). Unfortunately Kathi’s name is a bit long and has been truncated, just like in real plans.

    The last stream performs a join of two others via a Nested Loop (Matan Yungman@matanyungman). One pulls data from a Spool (my post@rob_farley) populated from a Table Scan (Jon Morisi). The other applies a catchall operator (the catchall is because Tamera Clark (@tameraclark) didn’t specify any particular operator, and a catchall is what gets shown when SSMS doesn’t know what to show. Surprisingly, it’s showing the yellow one, which is about cursors. Hopefully that’s not what Tamera planned, but anyway...) to the output from an Index Seek operator (Sebastian Meine@sqlity).

    Lastly, I think everyone put in 110% effort, so that’s what all the operators cost. That didn’t leave anything for me, unfortunately, but that’s okay. Also, because he decided to use the Paul White operator, Jason Brimhall gets 0%, and his 110% was given to Paul’s Switch operator post.

    I hope you’ve enjoyed this T-SQL Tuesday, and have learned something extra about Plan Operators. Keep your eye out for next month’s one by watching the Twitter Hashtag #tsql2sday, and why not contribute a post to the party? Big thanks to Adam Machanic as usual for starting all this.

    @rob_farley

  • Spooling in SQL execution plans

    Sewing has never been my thing. I barely even know the terminology, and when discussing this with American friends, I even found out that half the words that Americans use are different to the words that English and Australian people use. TSQL Tuesday

    That said – let’s talk about spools! In particular, the Spool operators that you find in some SQL execution plans. This post is for T-SQL Tuesday, hosted this month by me! I’ve chosen to write about spools because they seem to get a bad rap (even in my song I used the line “There’s spooling from a CTE, they’ve got recursion needlessly”). I figured it was worth covering some of what spools are about, and hopefully explain why they are remarkably necessary, and generally very useful.

    If you have a look at the Books Online page about Plan Operators, at http://msdn.microsoft.com/en-us/library/ms191158.aspx, and do a search for the word ‘spool’, you’ll notice it says there are 46 matches. 46! Yeah, that’s what I thought too...

    image

    Spooling is mentioned in several operators: Eager Spool, Lazy Spool, Index Spool (sometimes called a Nonclustered Index Spool), Row Count Spool, Spool, Table Spool, and Window Spool (oh, and Cache, which is a special kind of spool for a single row, but as it isn’t used in SQL 2012, I won’t describe it any further here).

    Spool, Table Spool, Index Spool, Window Spool and Row Count Spool are all physical operators, whereas Eager Spool and Lazy Spool are logical operators, describing the way that the other spools work. For example, you might see a Table Spool which is either Eager or Lazy. A Window Spool can actually act as both, as I’ll mention in a moment.

    In sewing, cotton is put onto a spool to make it more useful. You might buy it in bulk on a cone, but if you’re going to be using a sewing machine, then you quite probably want to have it on a spool or bobbin, which allows it to be used in a more effective way. This is the picture that I want you to think about in relation to your data.

    I’m sure you use spools every time you use your sewing machine. I know I do. I can’t think of a time when I’ve got out my sewing machine to do some sewing and haven’t used a spool. However, I often run SQL queries that don’t use spools. You see, the data that is consumed by my query is typically in a useful state without a spool. It’s like I can just sew with my cotton despite it not being on a spool!

    image

    Many of my favourite features in T-SQL do like to use spools though.

    image

    This looks like a very similar query to before, but includes an OVER clause to return a column telling me the number of rows in my data set. I’ll describe what’s going on in a few paragraphs’ time.

    So what does a Spool operator actually do?

    The spool operator consumes a set of data, and stores it in a temporary structure, in the tempdb database. This structure is typically either a Table (ie, a heap), or an Index (ie, a b-tree). If no data is actually needed from it, then it could also be a Row Count spool, which only stores the number of rows that the spool operator consumes. A Window Spool is another option if the data being consumed is tightly linked to windows of data, such as when the ROWS/RANGE clause of the OVER clause is being used. You could maybe think about the type of spool being like whether the cotton is going onto a small bobbin to fit in the base of the sewing machine, or whether it’s a larger spool for the top.

    A Table or Index Spool is either Eager or Lazy in nature. Eager and Lazy are Logical operators, which talk more about the behaviour, rather than the physical operation. If I’m sewing, I can either be all enthusiastic and get all my cotton onto the spool before I start, or I can do it as I need it. “Lazy” might not the be the best word to describe a person – in the SQL world it describes the idea of either fetching all the rows to build up the whole spool when the operator is called (Eager), or populating the spool only as it’s needed (Lazy).

    Window Spools are both physical and logical. They’re eager on a per-window basis, but lazy between windows.

    And when is it needed?

    The way I see it, spools are needed for two reasons.

    1 – When data is going to be needed AGAIN.

    2 – When data needs to be kept away from the original source.

    If you’re someone that writes long stored procedures, you are probably quite aware of the second scenario. I see plenty of stored procedures being written this way – where the query writer populates a temporary table, so that they can make updates to it without risking the original table. SQL does this too. Imagine I’m updating my contact list, and some of my changes move data to later in the book. If I’m not careful, I might update the same row a second time (or even enter an infinite loop, updating it over and over). A spool can make sure that I don’t, by using a copy of the data. This problem is known as the Halloween Effect (not because it’s spooky, but because it was discovered in late October one year). As I’m sure you can imagine, the kind of spool you’d need to protect against the Halloween Effect would be eager, because if you’re only handling one row at a time, then you’re not providing the protection... An eager spool will block the flow of data, waiting until it has fetched all the data before serving it up to the operator that called it.

    In the query below I’m forcing the Query Optimizer to use an index which would be upset if the Name column values got changed, and we see that before any data is fetched, a spool is created to load the data into. This doesn’t stop the index being maintained, but it does mean that the index is protected from the changes that are being done.

    image

    There are plenty of times, though, when you need data repeatedly. Consider the query I put above. A simple join, but then counting the number of rows that came through. The way that this has executed (be it ideal or not), is to ask that a Table Spool be populated. That’s the Table Spool operator on the top row. That spool can produce the same set of rows repeatedly. This is the behaviour that we see in the bottom half of the plan. In the bottom half of the plan, we see that the a join is being done between the rows that are being sourced from the spool – one being aggregated and one not – producing the columns that we need for the query.

    Table v Index

    When considering whether to use a Table Spool or an Index Spool, the question that the Query Optimizer needs to answer is whether there is sufficient benefit to storing the data in a b-tree. The idea of having data in indexes is great, but of course there is a cost to maintaining them. Here we’re creating a temporary structure for data, and there is a cost associated with populating each row into its correct position according to a b-tree, as opposed to simply adding it to the end of the list of rows in a heap. Using a b-tree could even result in page-splits as the b-tree is populated, so there had better be a reason to use that kind of structure. That all depends on how the data is going to be used in other parts of the plan. If you’ve ever thought that you could use a temporary index for a particular query, well this is it – and the Query Optimizer can do that if it thinks it’s worthwhile.

    It’s worth noting that just because a Spool is populated using an Index Spool, it can still be fetched using a Table Spool. The details about whether or not a Spool used as a source shows as a Table Spool or an Index Spool is more about whether a Seek predicate is used, rather than on the underlying structure.

    Recursive CTE

    I’ve already shown you an example of spooling when the OVER clause is used. You might see them being used whenever you have data that is needed multiple times, and CTEs are quite common here.

    With the definition of a set of data described in a CTE, if the query writer is leveraging this by referring to the CTE multiple times, and there’s no simplification to be leveraged, a spool could theoretically be used to avoid reapplying the CTE’s logic. Annoyingly, this doesn’t happen. Consider this query, which really looks like it’s using the same data twice. I’m creating a set of data (which is completely deterministic, by the way), and then joining it back to itself. There seems to be no reason why it shouldn’t use a spool for the set described by the CTE, but it doesn’t.

    image

    On the other hand, if we don’t pull as many columns back, we might see a very different plan.

    image

    You see, CTEs, like all sub-queries, are simplified out to figure out the best way of executing the whole query. My example is somewhat contrived, and although there are plenty of cases when it’s nice to give the Query Optimizer hints about how to execute queries, it usually doesn’t do a bad job, even without spooling (and you can always use a temporary table).

    When recursion is used, though, spooling should be expected.

    Consider what we’re asking for in a recursive CTE. We’re telling the system to construct a set of data using an initial query, and then use set as a source for another query, piping this back into the same set and back around. It’s very much a spool. The analogy of cotton is long gone here, as the idea of having a continual loop of cotton feeding onto a spool and off again doesn’t quite fit, but that’s what we have here. Data is being fed onto the spool, and getting pulled out a second time when the spool is used as a source.

    image

    (This query is running on AdventureWorks, which has a ManagerID column in HumanResources.Employee, not AdventureWorks2012)

    The Index Spool operator is sucking rows into it – lazily. It has to be lazy, because at the start, there’s only one row to be had. However, as rows get populated onto the spool, the Table Spool operator on the right can return rows when asked, ending up with more rows (potentially) getting back onto the spool, ready for the next round. (The Assert operator is merely checking to see if we’ve reached the MAXRECURSION point – it vanishes if you use OPTION (MAXRECURSION 0), which you can try yourself if you like).

    Spools are useful. Don’t lose sight of that. Every time you use temporary tables or table variables in a stored procedure, you’re essentially doing the same – don’t get upset at the Query Optimizer for doing so, even if you think the spool looks like an expensive part of the query.

    I hope you’re enjoying this T-SQL Tuesday. Why not head over to my post that is hosting it this month to read about some other plan operators? At some point I’ll write a summary post – once I have you should find a comment below pointing at it.

    @rob_farley

  • Running goals - an update

    Back in January, I wrote about some of my “running goals”. It’s time to update those of you who aren’t connected to me on Twitter or Facebook (or weren’t looking on the right days).

    I mentioned that I wanted to get a better time in a half marathon.

    Yes. I did that. I ran two half marathons in Adelaide this this year, with times of 2:04:58 and 2:03:57.

    half1half2

    I mentioned that I wouldn’t mind trying a full marathon.

    Last week, I did that. It was in the Barossa Valley – the wine region just north of Adelaide. My official time was 5:18:24. I probably could’ve been faster, but I’d had injuries along the way during my training. The South Australian Road Runners’ Club had a mentor program for people interested in doing a marathon, and I got involved. I didn’t make it to many (none) of the Sunday morning training runs, but I lined up anyway, and ran a marathon.

    The date of the marathon was significant for me – it was the day before the tenth anniversary of my back injury. In hindsight, I would never recommend running a marathon the day before a significant day like that. I was pleased I’d finished the marathon (I wasn’t aiming for a particular time, and was just pleased to finish – being hit by a car around 31km in didn’t really help my time, but I managed to get around), and I hadn’t really expected the impact of the injury-anniversary to affect me so much. I got physically and emotionally sick for a few days, and it was horrible. Ten years is a long time – more than a quarter of my life – and I know that it’s been tough not just on me but on those around me. Completing a marathon hasn’t made me feel like I’ve conquered my injury, it just helps me feel like I know I can keep going despite it.

    running << I’m smiling because someone’s pointing a camera at me. And because I can see the finish line.

    I mentioned I wanted to keep doing some cardio every day, and lose more weight.

    This fell off in mid-Feb when I got the first injury of my marathon training. I picked up a thing called “hip bursitis”. That led to increased back pain, and doing something every day was just beyond me. I got below 200lb, but only just. I’m roughly that now still, and I’m okay about that. I might try an “every day” thing again soon, but I’ll see how I go.

    I wanted to run some distance during 2013. 750 miles? 900? 1000?

    Well, five months in, I’ve run 341. That tracks to a bit over 800. But also I’ve run a marathon. And right now, a week later, part of me thinks I’ve achieved enough running goals, and it would be good to NOT run for a while. So forgive me if I don’t manage to run 1000 miles during 2013. I’d have to do over 20 miles every week from now to reach 1000 – my injuries just aren’t compatible with that.

    Running is tough. It’s not completely good for my back, and I have mixed emotions about it. As people have pointed out, I’m not exactly the right build for running... but that’s not the point. The point is that I have a back injury, and I need to work with my physio to continue to see it improve. Running might be part of that, but there are lots of things that I still don’t have in my life that I would like to be able to have again. I’d like to be able to dance. I’d like to be able to play sport with my sons. I’d like to be able to continue to pick up my daughter (who’s five, and getting towards the kind of weight that I can’t actually lift). One day, I plan to carry my wife over the threshold of the house we built. Any of those things is going to take a lot of work – but a year ago, I couldn’t run either.

    What, what? A car?

    Yes – I got hit by a car during the Barossa Marathon. It wasn’t like it took my legs out and I rolled across the windscreen. It’s more that it hit my hands.

    A marshal had stopped this car at a point where we had to cross the road we were running on. It was at the 31km point – almost 20 miles into the 26 miles route. The driver had wound down her window and was talking to the marshal. That was all fine. I took it fairly wide, and crossed in front of the car. The marshal was saying to her “Don’t pull off yet”, but she did – I guess she was still looking at the marshal as she took her foot of the clutch. I was in front of her at that point, and as she started forward, I wasn’t moving fast enough to jump out of the way. I turned and put my hands (and weight) on her car, and she stopped. I almost fell over, but didn’t.

    Annoyingly, I’d stopped running then. Later in the run, Erin Stellato (@erinstellato) tweeted to me that it’s easier to keep going if you never stop. Well, I had trouble getting going again. My legs were sore, of course. My back had just got a jolt I wasn’t expecting, as had my adrenalin. I was sweating and leaning forward (so my eyes were starting to suffer). It took me an embarrassingly long time to finish crossing the road. Thankfully there was a drink station there, where I washed my eyes out, and I kept going, about two minutes per mile slower than I’d been going beforehand. I’m not saying I would’ve broken five hours, but I would recommend if time is important to you that you don’t have an altercation with a vehicle part way round.

    Massive thanks to the people that sent the 187 tweets during my run (read to me by the RunMeter app). It helped. Now go and give money to Yanni Robel’s fundraising. I run in a Team In Training shirt to honour the effort that she’s putting in, so please – give money to her fund. Also, if you’re reading this in the first 8-ish hours after I’ve published it, send Yanni a message on Twitter (@yannirobel), because she’s about to run a marathon too!

  • T-SQL Tuesday #43 – Hello, Operator?

    June 11th is next week. It’s a Tuesday, the second Tuesday of the month, making it T-SQL Tuesday! This is your opportunity to write on a specific topic, along with countless* people from around the SQL community (* at least until the end of the day, when it will be quite easy to count how many people joined in).TSQL2sDay150x150 This month marks 3.5 years since it started, with 42 events held so far.

    This month, for number 43, I’m the host. That means that I set the topic, and this blog post is the one that you need to get a comment into so that I can find your post before I collate them.

    The topic is Plan Operators. If you ever write T-SQL, you will almost certainly have looked at execution plans (if you haven’t, go look at some now. I mean really – you should be looking at this stuff). As you look at these things, you will almost certainly have had your interest piqued by some, and tried to figure out a bit more about what’s going on.

    That’s what I want you to write about! One (or more) plan operators that you looked into. It could be a particular aspect of a plan operator, or you could do a deep dive and tell us everything you know. You could relate a tuning story if you want, or it could be completely academic. Don’t just quote Books Online at me, explain what the operator means to you. You could explore the Compute Scalar operator, or the many-to-many feature of a Merge Join. The Sequence Project, or the Lazy Spool. You’re bound to have researched one of them at some point (if you never have, take the opportunity this week), and have some wisdom to impart. This is a chance to raise the collective understanding about execution plans!

    ComputeScalarMergeJoinSequenceProjectLazySpool

    So, the T-SQL Tuesday thing...

    If you haven’t heard about T-SQL Tuesday before, then do this exercise with me. Do a quick search for “T-SQL Tuesday”. If you glance down the results,  you’ll see a bunch of posts either inviting people to a T-SQL Tuesday, contributing in one (the ones that contribute link back to the host), or summarising the posts. The ‘host post’ (which this month is this one!) will have a bunch of comments and trackbacks, pointing to all the contributing posts (and hopefully to the summary too). All-in-all, it makes a terrific resource about that particular subject.

    So here’s what you do!

    1. Some time next Tuesday (GMT) – that’s June 11th 2013 – publish your blog post. If you’re in Australia like me, GMT Tuesday runs from mid-morning on the Tuesday through to mid-morning on Wednesday. If you’re in the US, then it’ll run from Monday evening through to Tuesday evening.

    2. Make sure that your post includes, somewhere near the top, the T-SQL Tuesday image (from above), with a link to this post. This helps people find the other posts on the topic, and hopefully encourages others to jump in with their own posts. If it helps, just switch to source/HTML mode, and put the following snippet in there:
    <a href="http://sqlblog.com/blogs/rob_farley/archive/2013/06/02/t-sql-tuesday-43-hello-operator.aspx" target="_blank"><img alt="TSQL Tuesday" align="right" border="0" src="http://sqlblog.com/blogs/rob_farley/TSQL2sDay150x150_1B3B2D1E.jpg"/></a>

    3. Come back to this post, and write a comment, providing a link to your post. That’s probably the only way I’ll know it’s there. You can tweet, sure (use the hashtag #tsql2sday), but I still might not find it.

    4 (optional, but maybe worthwhile). Keep your eye out for other people’s posts, and for when I publish the summary...

    That’s about it – happy writing!

    Remember: 1: Jun 11 GMT; 2: Image with a link; 3: Comment below.

    @rob_farley

  • Part of the journey: failure

    The topic for this month’s T-SQL Tuesday is about the journey. Wendy Pastrick’s choice (I’m hosting again next month!).TSQL2sDay150x150

    There are a lot of journeys. There are some that just keep going, and others that seem to finish (some in success; some in failure). Of course, many of the ones that finish end up being the start of new journeys, but sometimes they don’t need to continue – they just need closure. There are things that can be learned regardless of how things went, whether or not goals were reached, and whether or not there was failure.

    There’s been a few things recently to remind me of this...

    I visited a company recently who has put a video together promoting the idea of failure. It wasn’t asking that people fail, but said “Go ahead and fail,” because failure happens. They had been through a rough time, but were persisting and seeing things turn around.

    Just the other night, we saw the musical Chitty Chitty Bang Bang (you probably know the movie – the Bond film where Dick van Dyke stars as the guy with the gadget-car, who takes on Goldfinger and falls for the girl with the inappropriate name). Anyway, there’s a brilliant song in that called “The Roses of Success” (YouTube link there). It has the same sentiment – “…from the ashes of disaster grow the roses of success!”

    A few years ago, my kids started saying “FAIL!” when someone did something wrong. I can’t say I liked the insult. Far worse would’ve been “DIDN’T TRY!” It would be very easy to just stay in bed and ‘avoid failure’ that way, but anyone who fails has at least done something. To fail, you must at least be active.

    I talk to a lot of people about Microsoft Certification, particularly people who have failed an exam. I tell everyone (not just those who have failed before) to try the exams before they feel they’re ready for them, . What’s the worst that can happen? Worst case, they don’t pass. But how is that a bad thing? It might feel less than brilliant (I know, it’s happened to me before), but it gives an opportunity to target the weak areas before having a subsequent attempt. It doesn’t matter how many attempts it takes to get a passing score – the wrong option would be to give up. Studying can be excellent, but not to the point of causing extra stress.

    There are things in life we do easily, and there are things that we struggle with. I know there’s a bunch of stuff in my own life that falls into both categories. I don’t want this post to be a list of the things that I’m not doing well – I simply want to point out that I want to keep trying.

    With God’s help, I can improve in the areas in which I’m not excelling, and start to smell the roses of success.

  • Why I present

    Louis Davidson just asked me why I write, and now Bob Pusateri (@sqlbob) is asking me why I present, which is his question for this month’s T-SQL Tuesday.

    TSQL2sDay150x150

    If you didn’t follow the link to see his actual question, you’ll need to know that he actually posed the question “How did you come to love presenting?”

    Well, sometimes I don’t, but on the whole, I have to admit that presenting is part of who I am, and I miss it if I’m not presenting. It’s why despite being a PASS board member (that link will only seem relevant if you’re reading this while I’m still one) and having plenty of reason to NOT present at the PASS Summit in 2013, I’ve submitted the maximum number of abstracts for consideration. It’s why I want to be teaching more, both online and in the classroom, and so on.

    It’s not that I think I have anything important to say (although I do only ever teach / present on things that I think are important).

    It’s not that I think I’m good at presenting (my feedback scores beg to differ).

    It’s not that I’m comfortable presenting (I still get ridiculously nervous most of the time).

    I’m just addicted to it.

    It’s a drug – it really is.

    I spend my time walking around the room, or around the stage, explaining things to people, watching for those moments when the audience gets it, and... well, I’m addicted to it.

    If you watch http://www.sqlbits.com/Sessions/Event5/Designing_for_simplification, you’ll see a few things. I was in Wales, and had started with the few words in Welsh that I know (but that’s been edited out – hopefully when I thought I was saying ‘hello’ I wasn’t actually insulting anyone). I nearly fell off the stage. I broke the microphone. I typed some things wrong in my queries. People complained that I didn’t say anything significant…

    But around 33:10 in, you hear the audience almost start clapping. IN THE UK (where people don’t clap for presentations). It’s a moment where people see something they weren’t expecting, and (hopefully) realise the potential in what they’ve heard.

    Phil Nolan wrote nicely about me on his blog, and said “Those of you who know Rob Farley will know he’s a funny guy with an enormous armoury of shockingly bad jokes.” More importantly though, he wrote “His design tips challenged a number of our ideas and meant I took away many valuable techniques,” which helped me know why I present.

    …because it’s not about me, it’s about you. I present because at least one of the people in the audience will benefit from it. And that’s addictive.

    @rob_farley

  • Filegroups and Non-Clustered Indexes

    Let’s start with some basics and then jump in a bit deeper, for this post to go with the 40th T-SQL Tuesday, hosted this month by Jen McCown. TSQL2sDay150x150

    SQL Server holds data, and that data is stored physically in files.

    Of course, in the database world we think of the data as living in tables*, so naturally there must be some sort of mapping between the data in a table and the data in a file. This is where filegroups come in.

    When SQL objects are created within a database, there is an option to be able to tell the system where to actually store it. But this isn’t a file, it’s a filegroup. (If it were a file and the disk volume that the file was on filled up, we’d have big problems. Instead, we put it in a filegroup, and can add extra files (on different volumes if so desired) to that filegroup.) Objects are stored within filegroups. Filegroups are groups of files, although many database systems do end up with only a single file per filegroup.

    Filegroups end up providing a large amount of flexibility for the storage of data. Rarely accessed data can be put in filegroups that have files on cheaper (but probably slower) disk, while data that is heavily written can be put on disk that’s more appropriate for that, and so on. I’m sure you get the picture, and this is nothing overly thought-worthy.

    You may even have already considered the idea around partitioning data across filegroups, moving data of a certain age (but potentially even from the same table) onto a different filegroup, so that queries that use different parts of tables can benefit from having some data on faster disk. Lots of potential. :)

    Where it becomes interesting though, is when you start to consider the concept of non-clustered indexes.

    You see, while a clustered index (or heap) provides the structure to store the data across all the columns of a table (although I’m simplifying it a little in regard to LOB data), we use copies of some of that data (in non-clustered indexes) to optimise access to the data.

    So where are these copies stored? Many database people don’t even think about this kind of thing, in the same way that they often don’t think about including indexes in their database design. I’m going to figure you’re not in that boat though, because I probably know you, and you’re probably not one of those people.

    Most people like to have a default filegroup that is NOT the primary filegroup. It means that when they create new objects, those new objects get put in the default filegroup, not the primary one. But it’s not actually that simple.

    Let’s start by creating a database.

    CREATE DATABASE fg_testing;
    GO
    USE fg_testing;
    GO
    SELECT *
    FROM sys.filegroups;
    --Only one filegroup at the moment

    image

    Notice the column data_space_id. This is the column which identifies each filegroup. We’ll use it later.

    Let’s create a new filegroup and set it to be the default.

    ALTER DATABASE fg_testing ADD FILEGROUP FG2;
    GO
    SELECT *
    FROM sys.filegroups;
    GO

    image

    Cool – data_space_id 2 is created.

    ALTER DATABASE fg_testing MODIFY FILEGROUP FG2 DEFAULT;

    image

    I think this is a shame – but I guess it’s fair enough. We have to have a file in the filegroup before we can make it the default. That’s easy enough though, and probably what we’d be wanting to do before too long anyway.

    ALTER DATABASE fg_testing ADD FILE ( NAME = N'file2', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2012\MSSQL\DATA\fg_testing_file2.ndf') TO FILEGROUP FG2;
    GO
    ALTER DATABASE fg_testing MODIFY FILEGROUP FG2 DEFAULT;

    image

    (I’m doing this on my laptop, which only has one physical drive – on a server it’d go somewhere more sensible of course)

    Now let’s add another filegroup. This is going to be for some data, but I don’t want it to be my default.

    ALTER DATABASE fg_testing ADD FILEGROUP FG3;
    GO
    ALTER DATABASE fg_testing ADD FILE ( NAME = N'file3', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2012\MSSQL\DATA\fg_testing_file3.ndf') TO FILEGROUP FG3;

    image

    Now, I want to create an object on FG3, which I’m going to pretend is my slower disk.

    Once created, I’m looking in sys.objects to confirm which filegroup the table is in, but there’s no information there. Remember that a table is only metadata, and the things that matter are the indexes/heaps that are on it. Checking in sys.indexes shows me that indeed, it’s in data_space_id 3.

    CREATE TABLE dbo.OrderDates (OrderDate date PRIMARY KEY, NumOrders int NOT NULL DEFAULT 0) ON FG3;
    GO
    SELECT *
    FROM sys.objects
    WHERE object_id = OBJECT_ID('dbo.OrderDates');
    GO
    SELECT *
    FROM sys.indexes
    WHERE object_id = OBJECT_ID('dbo.OrderDates');
    GO

    image

    For completeness’ sake, I’m going to put some data in there, using a query that I blogged about yesterday.

    INSERT dbo.OrderDates (OrderDate, NumOrders)
    SELECT TOP (DATEDIFF(DAY,'20010101','20130312'))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1))
    GO

    image

    But the whole point of this was to see what happens with the non-clustered index, which I want to be on recent data only (ie, filtered), and I want it to be in the DEFAULT filegroup.

    As I want it in the default group, I won’t specify a filegroup for the index.

    CREATE INDEX ixRecentData ON dbo.OrderDates (OrderDate)
    INCLUDE (NumOrders)
    WHERE OrderDate >= '20130101';
    GO
    SELECT *
    FROM sys.indexes
    WHERE object_id = OBJECT_ID('dbo.OrderDates');

    image

    But look! It’s put the non-clustered index in the same filegroup as the clustered index. This isn’t what I wanted. In fact, it’s almost never what I’d want, because even if the disk performance is identical, it can be nice to have transactions which update both the clustered and non-clustered indexes using different disks.

    Instead, we have to specify it explicitly to tell it to use the filegroup we want.

    DROP INDEX ixRecentData ON dbo.OrderDates
    GO
    CREATE INDEX ixRecentData ON dbo.OrderDates (OrderDate)
    INCLUDE (NumOrders)
    WHERE OrderDate >= '20130101'
    ON FG2;
    GO
    SELECT *
    FROM sys.indexes
    WHERE object_id = OBJECT_ID('dbo.OrderDates');

    image

    It’s a shame to have to be so explicit with these things, but when you create non-clustered indexes, you really should think about which filegroup they’re going onto, because they may well not go onto the default one as you might expect.

    @rob_farley

  • A TOP Query

    For the blog post that I’ll publish tomorrow, I wrote a query that I thought needed a blog post all of its own. This is that post.

    The query was this one. Its results aren’t that interesting, it’s just a list of dates with a random number between 0 and 99. Just some sample data that I thought I’d use.

    SELECT TOP (DATEDIFF(DAY,'20010101','20130312'))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1));

    image

    So, let me give some background...

    When teaching about T-SQL, I often point out that a nums table is tremendously useful. One of its uses is to make a dates table, which can be really handy in a bunch of other ways, like in data warehouses. The idea is simple, assuming you have a table of numbers which starts from either 0 or 1 (I don’t really care, although I start mine from 1). Here I figure that you want to get dates from between 20010101, up to (but not including) today.

    SELECT DATEADD(day, num-1, '20010101')
    FROM dbo.nums
    WHERE num <= DATEDIFF(day, '20010101', SYSDATETIME());

    I’m not going to show you the results, I figure that it’s easy for you to picture a list of dates.

    Oh, alright then.

    image

    Anyway, moving on.

    In this case, I didn’t have a nums table handy, and for that, I tend to use ROW_NUMBER() and the table master.dbo.spt_values. This table contains a bunch of rows of handy reference data. I wasn’t interested in the contents though, I just wanted to have enough rows, and it’s quick to reference that rather than writing a bunch of self-joining CTEs. There’s over 2000 rows (2515 to be exact) in master.dbo.spt_values, and if I want to have up to 4 million, I just do a cross join to itself.

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2;

    image

    This query gives me a lot of rows, of course... but if I use it in a sub-query (or CTE) and filter it, then the simplification work that the Query Optimizer does will mean that it doesn’t try to work out all 4 million rows for me, it’ll stop when it’s seen enough.

    As an example, I can use this in conjunction with my earlier query, which had a WHERE clause.

    WITH nums AS (
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day, num-1, '20010101')
    FROM nums
    WHERE num <= DATEDIFF(day, '20010101', SYSDATETIME());

    This produces my rows in less than a second, giving the same results as when we had a physical nums table.

    To include my random values, I’m using the commonly found randomizing method of ABS(CHECKSUM(NEWID())) % 100. RAND() is no good, it just produces the same value for every row. NEWID() is much better, but it’s not a number. CHECKSUM solves that, but can be negative. ABS will wrap that up nicely and give a random number in a large range. Mod 100 solves that.

    WITH nums AS (
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day, num-1, '20010101') AS OrderDate,
           ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM nums
    WHERE num <= DATEDIFF(day, '20010101', SYSDATETIME());

    image

    This works nicely, and is a common pattern for dealing with this kind of thing.

    But it’s not the query at the top of this post. That was done without a CTE, and used TOP instead, ordering by the ROW_NUMBER().

    Let’s have a look at it.

    SELECT TOP (DATEDIFF(DAY,'20010101','20130312'))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1));

    We can see that the FROM clause is the same as in our nums CTE. And the same randomising bit for NumOrders is in there.

    But instead of using ROW_NUMBER() OVER (ORDER BY (SELECT 1)) to define a num column, we’re using it within another function? Can we do that? Yes. ROW_NUMBER() can only be used in the SELECT clause and in the ORDER BY clause, and there are restrictions about using it within aggregate functions as you might expect, but here we’re just using it as a value which changes with every row, and there’s really no problem at all.

    We don’t have a filter though. In our CTE version, we used a filter to make sure we weren’t using every row from our CROSS JOIN. Here, we’re using TOP instead. But not TOP with some number – TOP with a function in it! This has been possible for a while, and it even supports sub-queries that produce numbers, in the form (including the double-bracket): SELECT TOP ((SELECT SomeVal ... )) Col1, Col2...

    TOP appears in this query because I needed to limit the number of rows coming back. I couldn’t use the WHERE clause, because I didn’t have anything to filter on. So I used TOP, and had to use a function in there.

    So let’s compare the plans.

    To be fair, I’ll use the fixed date in both queries (for now).

    image

    Oh, how disappointing! My TOP query is 79% of the batch, and the CTE version is 21%. Clearly my new one is 4 times as bad, and I should give up on it.

    Actually, if you run these two queries on your own machine, you’ll see the first one isn’t 4 times as bad at all. It’s actually FASTER than the second. Something’s going on, and we should find out what.

    The plans look remarkably similar. In fact, the second one is identical, but has an extra Filter operator. I don’t have a Filter in the TOP one, but I didn’t really expect it to make that much difference.

    Otherwise, the plans look pretty similar. They both use a Row Count Spool, have a Sequence Project to work out the ROW_NUMBER, and they both use a Top operator – even the second one which doesn’t use TOP.

    But you see, the Query Optimizer would have seen that I was filtering on a column that mapped to ROW_NUMBER(), and that I was doing a “less than” operation there. That’s like doing a TOP, and the Query Optimizer sees benefit in this. It doesn’t explain what’s going on though with the “4 times worse” thing though.

    Let’s examine some of the numbers. Interestingly, we see that the Nested Loop operator expects to do almost no work in the second plan, and 27% in the first. There’s a warning on there – that’s just because I’m using a CROSS JOIN, and I’m okay with that.

    The word ‘expect’ was very important in that last paragraph. The percentages there are based on the work that is expected to be done. Let’s look at the properties of the two Nested Loop operators.

    image image

    Glancing down here, we see a lot of it is the same, but the Estimated Number of Rows in the first one is 4453 (which is correct), while the second one is only 100 (not quite so correct). The arrows on the upper side of the Nested Loops show the effect of this.

    image image

    The second one figures it’ll have seen all the rows it needs to see before it gets a second row from the first table, whereas the second one things it might need 1.77058 rows (being 4453/2515). No wonder the second query thinks it’s quicker.

    Let’s see where this 100 estimate comes from though. Perhaps it’s that filter?

    Based on the estimates of the rows going into the filter and coming out of it, it expects that the filter will see 100 rows and return just 30. This is bad too, but it’s not as bad as the 100 v 4453 impact.

    This thing actually comes from the way that the Query Optimizer has turned the WHERE clause into a TOP clause. It figures there’s some doubt there, and guesses that 100 rows is probably not a bad place to start. When we give an explicit value (even using the DATEDIFF function), it can figure out what’s there and use this value. The second query goes part way there and works out that the result of the DATEDIFF is 4453, but simply doesn’t apply it fully to the Top operator, leaving us with that guess.

    image image

    It’d be nice if it could tell that 4453 is never NULL, and simplify this out a bit more, but it simply doesn’t do this.

    To round the post off, let’s consider what happens if we’re using SYSDATETIME() instead of the constant.

    SELECT TOP (DATEDIFF(DAY,'20010101',SYSDATETIME()))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1));

    WITH nums AS
    (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day,num-1,'20010101') AS OrderDate, ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM nums
    WHERE num <= DATEDIFF(DAY,'20010101',SYSDATETIME());

    image

    Oh! Now the first one is simpler still, leaving out the Row Count Spool operator, and thinking it’s going to be cheaper than the second one. Having not trusted that figure before, does this mean the first one is actually worse? Well, we have an idea about where to look – the estimates on some of the arrows, particularly near the Top operator.

    image image

    Oh no! Our first query thinks there’s now only one row going to be coming through. How awful! (Oh, and the number 4452 is fine, because I’m actually running this on March 11th, not March 12th, it’s just that March 12th is T-SQL Tuesday this month, which is what I was writing the query for).

    If you run this on your machine, hopefully you saw something different. Really.

    You see, this problem has kinda been fixed, and if you enable the documented traceflag 4199, it should be better – for the first query at least.

    By turning on trace flag 4199, and telling it to ignore what’s in the cache, it will evaluate SYSDATETIME() for that particular execution, and therefore come up with the right value for the Top operator. It doesn’t fix the WHERE clause version, but it does solve the TOP clause version.

    SELECT TOP (DATEDIFF(DAY,'20010101',SYSDATETIME()))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1))
    OPTION (QUERYTRACEON 4199, RECOMPILE);

    WITH nums AS
    (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day,num-1,'20010101') AS OrderDate, ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM nums
    WHERE num <= DATEDIFF(DAY,'20010101',SYSDATETIME())
    OPTION (QUERYTRACEON 4199, RECOMPILE);

    image

    The reason why I say this shouldn’t’ve happened on your system is because you’re probably using trace flag 4199 as a start-up parameter.

    So there you have it... a query which might seem strange at first glance, but is actually a really nice alternative. Don’t be afraid to use expressions in your TOP clause – it’s a very powerful mechanism, and TOP is a great operator to see in your plans (because they will ask for fewer rows from your Scans and Seeks – remember how execution plans suck?). As is often the case, we’re tripped up by estimates being out, but if you can see what’s actually going on, you should be able to make good decisions about how to form your queries.

    @rob_farley

  • Why should you bother with the PASS BA Conference this April?

    I mean really? Why should you spend some of your training budget to go to this thing?

    Suppose you’re someone in the PASS Community who mainly looks after people’s data. That could involve database administration, performance tuning, helping developers write queries, that kind of thing. What part of “Advanced Analytics and Insights”, “Big Data Innovations and Integration”, “Data Analytics and Visualization”, “Information Delivery and Collaboration” or “Strategy and Architecture” is relevant to you? It sounds all well and good for the BI team, who’s thinking about cubes and models and report subscriptions and Power something, but that’s not you.

    The problem is that as data professionals, we’re no longer just database administrators. The world has become more demanding than that. Maybe it’s because of the financial difficulties that the western world has been facing. Maybe it’s because we’ve out-grown our jobs as database administrators. Maybe we’re simply being asked for more than we were before.

    Now more than ever before, if you’re a data professional, you need to be thinking about more than just transaction logs, corruption checking, and backups. You need to be thinking about the overall data story. You can tune your databases to cope with the large amount of data that’s pouring into them, as more and more systems produce consumable data. You can work with your developers to  help them understand the significance of indexes to be able to get the data out faster. But is this really enough?

    Today, we need to be strategic about the data. An increasing number of companies are moving their data to the cloud, where the need for database administrators is not quite the same as it has been in the past. There are a number of tools out there to allow you to manage hundreds, even thousands of database servers, putting pressure on you to be providing more from your role.

    And then you get asked into meetings! People more senior than you asking about what can be done with the data. Can you offer more than just a comment about how much they can trust you to make sure their data is available?

    This is why you need to be looking at things like the Business Analytics Conference. It’s because you need to know how to make the data that you look after more relevant to the organisation who entrusts you with it. You need to know how to get insight from that data. You need to know how to visualise it effectively. You need to know how to make it visible through portals such as SharePoint.

    And you need to know WHY these things are important.

    Either that, or you need to call in external consultants, who can provide these kind of services. You know how to get in touch. ;)

    @rob_farley

    PS: I should mention that I’m on the PASS board, so I see a lot of stuff about this conference. I’m not part of the organising committee at all though, and have been remarkably separate from the whole process. I do consider that this conference is about helping people achieve more within the data space, and that’s something I think more people should be taken advantage of.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement