- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and runs training courses around the world in SQL Server and BI topics.
This past weekend saw the first SQL Saturday in Melbourne. Numbers were good – there were about 300 people registered, and the attendance rate seemed high (though I didn’t find out the actual numbers). Looking around during the keynote, I didn’t see many empty seats in the room, and I knew there were 300 seats, plus people continued to arrive as the day went on.
My own session was fun. I’d been remarkably nervous (as I often am) beforehand, particularly as this was a talk I hadn’t given in about 3.5 years. There were elements of it that I teach often enough, but it was more about the structure of the talk, which ends up being so critical to how things end up working. I may give the impression of talking completely off-the-cuff, but I do have most of it thoroughly planned – the lack of slides and firm agenda is primarily there to allow me to have flexibility to match the audience.
Anyway, as my demos were coming together, I found myself putting ‘GO’ between various lines, so that my CREATE statements didn’t get the red squiggly underlines by the SSMS window. I find it kinda frustrating when I’m just going to be running individual statements, but nevertheless, it’s good to avoid the squiggles. But of course, GO isn’t part of T-SQL, and I thought it was worth mentioning it. I think only one person in the room (a former student) had heard me explain this before, so it worked out okay. And it fits in nicely with this month’s T-SQL Tuesday, which is on the topic of “Dirty Little Tricks”, and hosted by Matt Velic (@mvelic).
So I’m going to show you two tricks, which are essentially harmless, but also help demonstrate potentially useful features of SQL Server – one in T-SQL, and one in Analysis Services.
The T-SQL one, as I’ve already mentioned, is about the GO keyword, which isn’t actually part of T-SQL.
You see, it’s a feature of SQL Server Management Studio, and of sqlcmd, but it’s not really a database engine thing. It’s the batch separator, and defines the point at which a bunch of T-SQL commands should be separated from the bunch that follow. It’s particularly useful for those times when the command you’re issuing needs to be the first command in the batch (such as a CREATE command), or even issued completely by itself (such as SET SHOWPLAN_XML ON).
…and it’s configurable.
This is the SSMS Options dialog, and you’ll see an option where you can change GO to be something else.
I had thought at some point that you could change the Batch Separator to just about anything else, and then create a stored procedure called ‘go’, but of course, if you have more than one statement in your batch, then you must use EXEC to run a stored procedure. So hoping that ‘go’ might run a stored procedure by appearing at the end of your batch doesn’t work. Besides that would be a BadThingToDo.
Proper mischief involves changing it to a keyword, such as CREATE, DELETE or SELECT. If you make it SELECT, then all kinds of things will stop working in SSMS, and every SELECT query will come back with “A fatal scripting error occurred. Incorrect syntax was encountered while parsing SELECT.” Well, for new windows at least.
The point at which it becomes really annoying for your unsuspecting colleague is that restarting SSMS only makes it worse. The setting is stored in C:\Users\YourName\AppData\Roaming\Microsoft\SQL Server Management Studio\12.0\SqlStudio.bin (the 12.0 means 2014 – it’s 11.0 for SQL2012), so even if you think installing a new copy of SSMS will fix it, it won’t.
Sadly for you, reader, if they do an internet search, they’ll find posts like this one, and they will quickly realise who inflicted this pain on them.
The other trick that I thought I’d mention is with SSAS translations, and again, demonstrates a nice feature of SQL Server.
One rarely-used feature of Analysis Services (Multidimensional) is Translations.
I say it’s rarely used, but if you have an environment that needs to cater for multiple languages, then you could well use them. They allow someone who has different language settings on their client machine to see localised data, dimension names, and so on. It’s really quite clever, and can be very influential in gaining acceptance of a system that must be used throughout all the worldwide branches of your organisation.
But where you can have fun (so long as it doesn’t go into production) is when you have someone on your dev or test team who is originally from a different country (but with the same language), and likes to have their computer set to their home language. Like someone in Australia who likes to use English (New Zealand), or who likes to have English (Canada), despite the fact that they’ve been living in the United States for some years.
The trick here is to introduce a translation in the language that they choose to use. They’ll be the only person who will notice it, and you can go as subtle or as blatant as you like.
In the editor for the Cube file, you will see a Translations tab over on the right. It lets you enter the words in that language for the various concepts. So you could throw in the odd “eh” for Canadians, or mix up the vowels for Kiwis.
Once you get into Dimension translations, you have so many more options! You can tell the data within attributes to come from a different column, even one that you’ve only made up for the DSV. That means that the reports they see can throw in the odd reference to hockey, or hobbits, or whatever else you might decide is appropriate to mess with their heads. Of course, when they see the report having the wrong names for things, they’ll tell someone else to fix it, but there won’t be anything to fix. It’s almost the ultimate “Doesn’t work on my machine” scenario, just to mess with that one person who doesn’t have their language settings the same as everyone else.
…but please don’t let either of these go on in production. The last thing you need is to have someone think SQL is broken in production, or to have someone think you’re racist, when you’re just picking on New Zealanders.
There are quite a few reasons why an Index Scan is better than an Index Seek in the world of SQL Server. And yet we see lots of advice saying that Scans are bad and Seeks are good.
Let’s explore why.
Michael Swart (@MJSwart) is hosting T-SQL Tuesday this month, and wants people to argue against a popular opinion. Those who know me and have heard my present would realise that I often argue for things that are somewhat unconventional, and that I have good reason for doing so. (For example, in my Advanced T-SQL course, I teach people how to write GROUP BY statements. Because most people do it wrong most of the time.)
So today I’m going to look at some of what’s going on with Scans and Seeks, and will demonstrate why the Seek operator is the one that has more to do.
I’m not going to suggest that all your execution plans will be better if all the Seeks are replaced by Scans of those same indexes. That’s simply not the case. But the advice that you always hear is a generalisation. Some Seeks are better than some Scans, and some Scans are better than some Seeks. But best of all of them is a particular Scan, and hopefully this post will go some way to convincing you of that, and demonstrate ways that you can help your queries take advantage of this technique.
From the user’s perspective, the big thing with Seeks is that the database engine can go straight to the required data for a particular query, whereas Scans search through the whole table for the data that’s needed. This is fairly true, and certainly, if it were the whole story, then it would be very hard to argue against Seeks. After all – if we can go straight to the required data, then that’s perfect! Hopefully you’re already thinking that it does sound too good to be true, and yet this is what we’re taught about Seeks.
An index uses a tree-structure to store its data in a searchable format, with a root node at the ‘top’. The data itself is stored in an ordered list of pages at the ‘leaf level’ of the tree, with copies of the ‘key data’ in levels above. The ‘key data’ is anything that’s defined in the index key, plus enough extra data to make sure that each row is uniquely identifiable (if the index is a ‘unique index’, then it already has enough information, if not, then the clustered index key(s) are included – with uniquifier column if the CIX keys are not unique), and therefore searchable. This means that the data can be found quite quickly, but it still requires some searching. It’s not like we have the file, pageid and slot number ahead of time. Then we really could go straight to the data we needed, which is what happens when we do a RID Lookup against a heap. We might find that this address stores nothing more than a forwarding record to another RID, but still we’re getting to the data very quickly. With an Index Seek, or even a Key Lookup, we need to find the data by searching for it through the levels of the tree.
I’ll also point out that a Seek takes two forms: Singleton and RangeScan, depending on whether the systems knows that we’re looking for at most one record, or whether we’re looking for multiple records. The singleton form is only used when the system already has sufficient data to identify a unique record. If there is any chance that a second record could match, then a RangeScan is performed instead. For the sake of the post, let’s consider the singleton form a special case of the RangeScan form, because they both dive in to the index the same way, it’s just that the singleton only dives down, rather than looking around once there.
So the Seek operation works out that it can use the index to find some rows that satisfy a predicate – some condition in an ON, WHERE or HAVING clause. It works out a predicate that indicates the start of the range, and then looks for that row. The database engine starts at the top of the tree, at the root node, and checks the index key entries there to find out which row to go to at the next level down, where it repeats the operation, eventually reaching the leaf level, where it can find the start of the range. It then traverses the leaf pages of the index, until it reaches the end of the range – a condition which must be checked against each row it finds along the way.
A Scan simply starts at the first page of the index and starts looking. Clearly, if only some of the rows are of interest, those rows might not be all clumped together (as they would be in an index on a useful key), but if they are, then a Seek would’ve been faster for the same operation, but there’s important part here:
Seeks are only faster when the index is not ideal.
Seeks are able to locate the data of interest in a less-than-perfect index more quickly than simply starting at the first page and traversing through.
But that search takes effort, both at the start, and on each record that must be checked in the RangeScan. I’m not just talking about any residual predicates that need to be applied – it needs to check each row to see if it’s found the end of the range. Granted, these checks are probably very quick, but it’s still work.
What’s more, a Seek hides information more than a Scan.
When you’re troubleshooting, and you look at a Scan operator, you can see what’s going on. You might not be able to see how many rows have actually considered (ie, filtered using the Predicate) before returning the handful that you’ve asked for (particularly if the scan doesn’t run to completion), but other than that, it’s pretty simple. A Seek still has this (residual) Predicate property, but also has a Seek Predicate that shows finds the extents of the RangeScans – and we have no idea how big they are. At least with a Scan we can look in sys.partitions to see how many rows are in there.
Wait – RangeScans? Plural?
Yes. The execution plan does tell you that there are multiple RangeScans, if you look at the properties of the Seek operator. Obviously not in ‘Number of Executions’, or in ‘Actual’ anything. But in the Seek Predicates property. If you expand it. And count how many (at leas they’re numbered) entries there are. Each of these entries indicates another RangeScan. Each with its own cost.
And it’s not about the ‘Tipping Point’
I’m not going to talk about the fact that a Seek will turn into a Scan if the Seek is not selective enough, because that’s just not true. A Seek of a non-covering index, one that then requires lookups to get the rest of the required information will switch to using a covering index, even if that index is not ideal, if the number of lookups needed makes the ‘less ideal but covering’ index a less-costly option. This concept has nothing at all to do with Seeks and Scans. I can even make a Scan + Lookups turn into a Seek at a tipping point if you’re really keen... it’s entirely about the expense of Lookups.
So, Seeks have slightly more work to do, but this work is to make up for the indexes are typically ‘less-than-perfect’.
Whenever you need just a subset of an index, where that subset is defined by a predicate, then a Seek is going to be useful. But in a perfect world, many of our indexes can be pre-filtered to the rows of interest. That might be “active tasks” or “orders from today”, or whatever. If a query hits the database looking for this set of things, then a Scan is ideal, because we can choose to use an index which has already been filtered to the stuff we want.
So I don’t mind Scans. I don’t view them with the same level of suspicion as I do Seeks, and I often find myself looking for those common predicates that could be used in a filtered index, to potentially make indexes which are pre-filtered, and which are more likely to be scanned, because they have the 20 rows of interest (rather than seeking into a much larger index to get those 20 rows).
There’s more to this – I’ve delivered whole presentations on this topic, where I show how Scans can often make Top queries run quite nicely, and also how Seeks can tend to be called too frequently.
I don’t want you to start working to turn all your plans’ Seeks into Scans – but you should be aware that quite often, a Seek is only being done because your index strategy has space for improvement.
I feel like every database project has major decisions now, which are remarkably fundamental to the direction that’s going to be taken. And it’s almost as if new options appear with ever-increasing frequently.
Consider a typical database project, involving a transactional system to support an application, with extracts into a data warehouse environment for reporting, possibly with an analytical layer on top for aggregations.
Not so long ago, the transactional system could be one of a small number of database systems, but if you were primarily in the Microsoft space you’d be looking at SQL Server, either Standard or Enterprise (but that decision would be relatively easy, based on the balance between cost and features), with extracts into another database, using Analysis Services for aggregations and Reporting Services for reports. Yes, there were plenty of decisions to make, but the space has definitely become more complex. If you’re thinking about a BI solution, you need to work out whether you should leverage the SharePoint platform for report delivery, figure out whether you want to use the Tabular or Multidimensional models within SSAS, Project or Package within SSIS, and of course, cloud or ‘on-premise’.
This month’s T-SQL Tuesday topic, hosted by fellow MCM Jason Brimhall (@sqlrnnr) is on the times when a bet has had to be made, when you’ve had to make a decision about going one way rather than another, particularly when there’s been an element of risk about it. These decisions aren’t the kind of thing that could cause massive data loss, or cost someone their job, but nonetheless, they are significant decisions that need to be made, often before all the facts are known.
As I mentioned before, one of the biggest questions at the moment is: Cloud or “On-Premise”
I’m not going to get into the “on-premise” v “on-premises” argument. The way I look at it, “on-premise” has become an expression that simply means “not in the cloud”, and doesn’t mean it’s actually on your premises at all. The question is not about whether you have a physical server that you can walk up to without leaving your office – plenty of organisations have servers hosted with an ISP, without being ‘in the cloud’. It also doesn’t mean that you’re avoiding virtual machines completely.
So by ‘cloud’, I’m talking about a system like Windows Azure SQL Database. You’ve made the decision to adopt something like WASD, and are dealing with all the ramifications of such a system. Maintenance of it is being handled as part of your subscription. You’re not making decisions about what operating system you’re using, or what service accounts are being used. You’re spinning up a database in the cloud, because you’ve made a decision to take the project that way.
WASD has a much smaller initial outlay than purchasing licenses, and the pricing model is obviously completely different – not only using a subscription basis, but considering data transfer (for outbound data) too. If you’re comparing the cost of getting your system up and running, then the fact that you’re not having to set up servers, install an operating system, have media for backups, and so on, means that choosing the cloud can seem very attractive.
But there are often caveats (how often are ‘bets’ made more risky because of a caveat that was ignored or at least devalued?).
For example, right now, the largest WASD database is limited to 150GB. That might seem a lot for your application, but you still need to have considered what might happens if that space runs out. You can’t simply provision a new chunk of storage and tell the database to start using that as well.
You need to have considered what happens when the space runs out. Because it will.
I would like to think that every database system has asked this question, but too frequently, it doesn’t get asked, or otherwise, the answer is disregarded. Many on-premise systems find it easy enough to throw extra storage at the problem, and this is a perfectly valid contingency plan. Other systems have a strict archiving procedure in place, which can also ensure that the storage stays small. But still, there are questions to ask, and a plan to work out.
To me, it feels a lot like what happened to Twitter in its early days. The concept of Twitter is very simple – it’s like text messages sent to the world. But because the idea caught on, scaling become a bigger problem than they expected, much earlier than they expected. They were a victim of their own success. They worked things out, but there were definitely growing pains.
In the 1990s, many of us in the IT industry spent a decent amount of time fixing code that no one imagined would need to be still running in the futuristic 21st century. After the fact, many claimed that the problem had been over-exaggerated, but those of us who had worked on systems knew that a lot of things would have broken if we hadn’t invested that effort. It’s just that when a lot of software got written, no one expected it to be still be used in 2000. Those coders didn’t expect to be so successful.
It’s too easy to become a victim of success. I tell people that if they have done a good job with their database application, they will probably have underestimated its popularity, and will have also underestimated the storage requirements, and so on. I’ve seen many environments where storage volumes were undersized, and volumes which had been intended for one type of use now serve a variety (such as a drive for user database data files now containing tempdb or log files, even the occasional backup). As a consultant I never judge, because I understand that people design systems for what they know at the time, not necessarily the future. And storage is typically cheap to add.
But when it comes to Windows Azure SQL Databases, have a plan for what you do when you start to reach 150GB. Scaling out should be a question asked early, not late.
Suppose your SQL table has a bunch of spatial points (geographies if you like) with an order in which they need to appear (such as time) and you want to convert them into a LineString, or path.
One option is to convert the points into text, and do a bunch of string manipulation. I’m not so keen on that, even though it’s relatively straightforward if you use FOR XML PATH to do the heavy lifting.
The way I’m going to show you today uses three features that were all introduced in SQL Server 2012, to make life quite easy, and I think quite elegant as well.
Let’s start by getting some points. I’ve plotted some points around Adelaide. To help, I’m going to use Report Builder to show you the results of the queries – that way, I can put them on a map and you can get a feel for what’s going on, instead of just seeing a list of co-ordinates.
First let’s populate our data, creating an index that will be helpful later on:
select identity(int,1,1) as id, *
(geography::Point(-34.924269, 138.599252, 4326), 'Cnr Currie & KW Sts', cast('20140121 9:00' as datetime)),
(geography::Point(-34.924344, 138.597544, 4326), 'Cnr Currie & Leigh Sts', '20140121 9:30'),
(geography::Point(-34.923025, 138.597458, 4326), 'Cnr Leigh & Hindley Sts', '20140121 10:00'),
(geography::Point(-34.923016, 138.597608, 4326), 'Cnr Bank and Hindley Sts', '20140121 10:30'),
(geography::Point(-34.921775, 138.597533, 4326), 'Cnr Bank St and North Tce', '20140121 11:00'),
(geography::Point(-34.921520, 138.601814, 4326), 'Cnr North Tce and Gawler Pl', '20140121 11:30'),
(geography::Point(-34.924071, 138.601975, 4326), 'Cnr Gawler Pl and Grenfell St', '20140121 12:00'),
(geography::Point(-34.923966, 138.605590, 4326), 'Cnr Grenfell and Pulteney Sts', '20140121 12:30'),
(geography::Point(-34.921338, 138.605405, 4326), 'Cnr Pulteney St and North Tce', '20140121 13:00')
) p (geo, address, timeatlocation);
create index ixTime on dbo.JourneyPoints(timeatlocation) include (geo);
select * from dbo.JourneyPoints;
Great. Starting at the corner of Currie and King William Streets, we wander through the streets, including Leigh St, where the LobsterPot Solutions office is (roughly where the ‘e’ is).
I’ve labelled the points with the times, but it’s still not great viewing. Frankly, it’s a bit hard to see what route was taken.
What we really want is to draw lines between each of them. For this, I’m going to find the next point in the set, using LEAD(), and use the spatial function ShortestLineTo to get the path from our current point to the next one.
lead(geo) over (order by timeatlocation) as nextGeo,
geo.ShortestLineTo(lead(geo) over (order by timeatlocation)) as lineToNext,
I didn’t need to use pull back the fields geo and nextGeo, but I figure that the lineToNext column might be confusing at first glance, since it uses the subsequent row’s position as an argument in a function on the current row’s position. Anyway, hopefully you get the gist, here’s what it looks like.
This is way better – you can see the path that was taken, and can easily tell that the route didn’t just go straight up North Tce, it ducked down Gawler Place instead.
What’s more – with each part of the journey still being a separate row, I can colour each part differently. You know, in case I don’t like the “Tomato” colour in my last example (yes, that colour is called “Tomato”, no matter whether you say it “tomato”, or “tomato”, or even “tomato”).
To colour it differently, I’m going to throw in an extra field, which is just the number of minutes since we started. I’ll use the old fashioned OVER clause for that, to count the number of minutes since the earlier time.
select geo.ShortestLineTo(lead(geo) over (order by timeatlocation)) as lineToNext,
datediff(minute, min(timeatlocation) over (), timeatlocation) as minutesSinceStart
Cool – now I can easily tell which end it started at (the more tomatoey end), and where it ended (the paler end). Each segment is the same colour, but that’s okay.
Now, I said I’d use three SQL 2012 features, and so far the only new ones have been LEAD and ShortestLineTo. But remember I still have several rows, and each section of the route is a separate line. Well, to join them up, I’m going to use 2012’s UnionAggregate function. To use this, I need to use a sub-query (I’ll go with a CTE), because I can’t put an OVER clause inside an aggregate function.
with lines as (
select geo.ShortestLineTo(lead(geo) over (order by timeatlocation)) as LineToNext
select geography::UnionAggregate(LineToNext) as WholeRoute
Now I have my solution! I’ve converted points into lines, in the right order.
You may be wondering how this performs – what kind of execution plan is going to appear.
Well it’s this:
Look at this – there are Stream Aggregates (which just watch the data as it comes through, popping rows out when needed, but never holding onto anything except the aggregate as it grows), a Spool (which is used to do a bit of the windowing trickery, but also holding onto very little), and the Sequence Project & Segment operators which generate a row_number as a marker for the lead function. You might be interested to know that the right-most Stream Aggregate has the following “Defined Value” property:
[Expr1005] = Scalar Operator(LAST_VALUE([spatial_test].[dbo].[JourneyPoints].[geo])),
[[spatial_test].[dbo].[JourneyPoints].geo] = Scalar Operator(ANY([spatial_test].[dbo].[JourneyPoints].[geo]))
For each group (which is defined as the row), it uses the LAST_VALUE of geo, and ANY of geo. ANY is the current one, and LAST_VALUE is the row after it. It’s the last row, because the Spool gives up two rows for each ‘window’ – the current row and the lead row. In this scenario, with 9 rows of data in the index, the Spool pulls in (from the right) 9 rows, and serves up (to the left) 17. That’s two per original row, except the last which doesn’t have a lead row.
So the overhead on making this work is remarkably small. With an index in the right order, the amount of work to do is not much more than scanning over the ordered data.
Finally, if I had wanted to do this for several routes, I could have put a RouteID field in the table, used PARTITION BY RouteID in each OVER clause, and GROUP BY RouteID in the final query. If you do this, then you should put routeid as the first key column in your index. That way, the execution plan can be almost identical (just with slightly more explicit grouping, but with identical performance characteristics) to before.
with lines as (
select routeid, geo.ShortestLineTo(lead(geo) over (partition by routeid order by timeatlocation)) as LineToNext
select routeid, geography::UnionAggregate(LineToNext) as WholeRoute
group by routeid;
But I don’t have a picture of that, because that wasn’t the query I was wanting.
A couple of years I ago I was going to write a song about automation, in reggae style, which could maybe have been used by the Trinidad SQL community – particularly Nigel Sammy (@nigelsammy). The theme was going to be around the fact that you need automation because the sun shines and the beach is calling.
But of course, automation is about so much more than freeing up time for the beach (even here in Adelaide, where every weekday this week is set to be over 40C). Automation helps you be consistent in what you do by removing manual steps, and lets you focus your attention on the things that require thought, rather than being the same as always.
This month’s T-SQL Tuesday is about automation, and I thought I’d write about how a few of my favourite applications help me massively in the quest for better automation. The host-post asks about what has changed since the last time automation was a topic, but that time I mainly looked at Policy Based Management, which is great for being able to make sure that things happen. This time, I want to look particularly at the things I use to develop repeatable commands, thereby reducing how much I have to do compared to how much can be done by the machine.
SQL Server Management Studio (SSMS)
The Script button in dialogs! Oh how I love it. In fact, I wish that there were no OK button on dialog boxes in SSMS. I would be perfectly fine with a “Script and Close” button instead. I know I could have an Extended Events session or Trace running to be able to pick up what has just been run on the SQL box, but that doesn’t quite cut it. When I hit the OK button, I don’t actually know what commands are going to be run. I’ll have a good idea, of course, but if I’ve been tabbing through options and accidentally changed something, I might not have noticed (ok, I’m sure I will have, no one ever makes that mistake in real life). Even more significantly though, I might want to be able to run exactly the same command against another server. The Script button is amazingly useful and should be used by EVERYONE.
While I was at university, I used Unix a lot. My PC at home ran Linux, and I shuddered whenever I’d find out I had to use a Microsoft environment. It’s okay – I got over it – but one thing that remains is my appreciation for the text editor vi. I was pretty much forced to use it for a long while, and for a good year or more, I think I learned a new way of doing things almost every day. Just about every time you’d sit with someone else and work with them, you’d see something they’d do and go “Oh, how did you do that?” Of course, they’d reply with “Oh, that’s just pressing star”, or something like that. It was a good time, and I developed an appreciation for vi (and later, vim, and its Windows client gVim), which has stayed with me. Still I find myself opening Visual Studio and filling a row with ‘j’s as I hope to scroll down through the code.
From an automation perspective, gVim is great. The whole environment is based on keystrokes, so there’s never any reliance on putting the mouse cursor somewhere and clicking. Furthermore, I can hit ‘q’ and then record a macro, playing it back with @ (ok, it’s actually q followed by another letter, in which you store the macro, and @ followed by the letter for the macro of interest). This makes it great not just for writing code, but editing all kinds of text. I like Excel for being able to use formulas which can be repeated across each row, but I also find myself leveraging gVim’s macros for doing things even more easily – and navigating multiple lines.
I so wish that Windows had the macro-recording concept of gVim, or the Script button of SSMS. It would be really nice to be able to go to some spot in the Registry, or some Control Panel dialog, make some change, and say “And please give me a Script for what I’ve just done!” (If someone knows how to do this, PLEASE let me know)
But even so, PowerShell is tremendously useful. In my Linux days I would control everything through a shell environment (I preferred tcsh for some reason – I forget why – bash was good too, of course), and as such I could look back at what I’d just done, store scripts to repeat things another time, and so on. I don’t get that feeling with Windows, but PowerShell helps. I feel comfortable loading up a piece of XML in PowerShell (even an Execution Plan), and I love how easily I can move around XML in PowerShell.
Of course, every month I write a post for T-SQL Tuesday, and it would be quite neat to have a script that would automate that for me. But there are plenty of things that I don’t have automated (and may never do), and putting blog posts together is probably going to remain one of those. I can’t see myself creating a fully-automated Write-BlogPost cmdlet any time soon.
“It just runs slow these days”
I’m sure you’ve heard this, or even said it, about a computer that’s a few years old. We remember the days when the computer was new, and it seemed to just fly – but that was then, and this is now. Change happens, things erode, and become slower. Cars, people, computers. I can accept that cars get slower. They lose horsepower over time as the precision components wear and become less precise. I also know that my youth is a thing of the past. But electronics? What happens there?
Well, in my experience, computers don’t get slower. They just feel slower. I see two main reasons, and neither of them are because of ageing hardware.
Your computer might be slower than it was yesterday even. In the world of databases we might even be investigating why the computer is slower than it was five minutes ago. Again, it’s probably not because of ageing hardware.
One possible reason is that we’re simply asking systems to do more. If we’re comparing our laptops to when we bought them, we’re probably refreshing webpages more frequently (often in the background) and have installed too many utilities (hopefully not in the background, but you never know), and the system has more to get done in a given minutes compared to when it was new. With a database server, the amount of data has probably grown, there may be more VLFs in the log file to deal with, more users pushing more transactions. These are not things you want to uninstall like that annoying browser search bar on your aunt’s ageing computer, but they can be a very valid reason for things to be slower. Hopefully you are tuning your system to make sure that scalability is possible, and you’re very happy with the amount of extra work that’s being done, even if it does mean that some processes take a little longer than they once did.
This problem can be summarised by the fact that the system is having to wait for resources to be free so that it can get its stuff done.
Another reason for slowness is that the system is having to wait more for other reasons, things that you don’t want it having to wait for. An increase in busyness will cause slowness because of waiting, but you can easily make the argument that this is ‘acceptable’. It’s much more of a problem if the system is being slower without actually achieving any more than it was before.
Waits are the topic of this month’s T-SQL Tuesday, hosted by Robert Davis (@sqlsoldier). Go and have a look at his post to see what other people have written about on this topic.
In the SQL Server world, this kind of problem is identified by looking at wait stats. The system records what processes are waiting for, and you can see these by querying sys.dm_os_wait_stats. It’s very useful, but querying it in isolation isn’t as useful as taking snapshots of it. If you want to store copies of it over time, you may prefer to do something along the lines of:
--A schema for monitoring data can be useful
create schema monitoring;
--Create a table that has the structure of sys.dm_os_wait_stats
select top (0) *
--Add a column to know the the stats are collected
alter table monitoring.waits
add snapshot_time datetime default sysdatetime();
--Run this section regularly
insert monitoring.waits (wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
select * from sys.dm_os_wait_stats;
Regularly collecting snapshots of wait_stats like this can give you a picture of what has occurred over time. You can easily pull this data into a report, or into Excel, or even get a picture of a recent version quite easily, using a query such as:
with numbered as (
wait_time_ms - lead(wait_time_ms) over (partition by wait_type order by snapshot_time desc) as diff_wait_time,
waiting_tasks_count - lead(waiting_tasks_count) over (partition by wait_type order by snapshot_time desc) as diff_wait_count,
1000 * datediff(second,lead(snapshot_time) over (partition by wait_type order by snapshot_time desc),snapshot_time) as diff_ms,
row_number() over (partition by wait_type order by snapshot_time desc) as rownum
select wait_type, snapshot_time, diff_wait_count, diff_wait_time, diff_ms
where rownum = 1
order by diff_wait_time desc, wait_type;
This query compares the amount of wait time for each type (which is frustratingly stored as a string) since the previous one, using the LEAD function that was introduced in SQL Server 2012 (LEAD rather than LAG because we’re looking at snapshot_time desc, not ASC). Using ROW_NUMBER(), we can easily pick out the latest snapshot by filtering to rownum = 1, but if you’re just wanting to chart them, the contents of the CTE will be enough.
Make sure you keep an eye on the amount of data you’re storing, of course, and be careful of the impact of someone inadvertently clearing the stats (though as the query picks up deltas, you should be able to consider a filter that will ignore the deltas that might have spanned a period during which the stats were cleared).
This post is not going to go into all the different wait types to tell you which ones are worth worrying about and which ones are worth ignoring. But what I would suggest to you is that you track what’s going on with your environment and keep an eye out for things that seem unusual. When troubleshooting, you will find any history invaluable.
There is a lot of discussion about “the cloud”, and how that affects people’s data stories. Today the discussion enters the realm of T-SQL Tuesday, hosted this month by Jorge Segarra.
Over the years, companies have invested a lot in making sure that their data is good, and I mean every aspect of it – the quality of it, the security of it, the performance of it, and more. Experts such as those of us at LobsterPot Solutions have helped these companies with this, and continue to work with clients to make sure that data is a strong part of their business, not an oversight. Whether business intelligence systems are being utilised or not, every business needs to be able to rely on its data, and have the confidence in it. Data should be a foundation upon which a business is built.
In the past, data had been stored in paper-based systems. Filing cabinets stored vital information. Today, people have server rooms with storage of various kinds, recognising that filing cabinets don’t necessarily scale particularly well. It’s easy to ‘lose’ data in a filing cabinet, when you have people who need to make sure that the sheets of paper are in the right spot, and that you know how things are stored. Databases help solve that problem, but still the idea of a large filing cabinet continues, it just doesn’t involve paper.
If something happens to the physical ‘filing cabinet’, then the problems are larger still. Then the data itself is under threat. Many clients have generators in case the power goes out, redundant cables in case the connectivity dies, and spare servers in other buildings just in case they’re required. But still they’re maintaining filing cabinets.
You see, people like filing cabinets. There’s something to be said for having your data ‘close’. Even if the data is not in readable form, living as bits on a disk somewhere, the idea that its home is ‘in the building’ is comforting to many people. They simply don’t want to move their data anywhere else.
The cloud offers an alternative to this, and the human element is an obstacle.
By leveraging the cloud, companies can have someone else look after their filing cabinet. A lot of people really don’t like the idea of this, partly because the administrators of the data, those people who could potentially log in with escalated rights and see more than they should be allowed to, who need to be trusted to respond if there’s a problem, are now a faceless entity in the cloud.
But this doesn’t mean that the cloud is bad – this is simply a concern that some people may have.
In new functionality that’s on its way, we see other hybrid mechanisms that mean that people can leverage parts of the cloud with less fear. Companies can use cloud storage to hold their backup data, for example, backups that have been encrypted and are therefore not able to be read by anyone (including administrators) who don’t have the right password. Companies can have a database instance that runs locally, but which has its data files in the cloud, complete with Transparent Data Encryption if needed. There can be a higher level of control, making the change easier to accept.
Hybrid options allow people who have had fears (potentially very justifiable) to take a new look at the cloud, and to start embracing some of the benefits of the cloud (such as letting someone else take care of storage, high availability, and more) without losing the feeling of the data being close.
This month’s T-SQL Tuesday is hosted by Kendal van Dyke (@SQLDBA), and is on the topic of swag. He asks about the best SQL Server swag that we’ve ever received from a conference.
I can’t say I ever focus on getting the swag at conferences, as I see some people doing. I know there are plenty of people that get around all the sponsors as soon as they’ve arrived, collecting whatever goodies they can, sometimes as token gifts for those at home, sometimes as giveaways for the user groups they attend. I remember a few years ago at my first PASS Summit, the SQLCAT team gave me a large pile of leftover SQL Server swag to give away to my user group – piles of branded things to stop your phone sliding off your car dashboard, and other things. The user group members thought it was great, and over the course of a few months, happily cleared me out of it all.
I tend to consider swag to be something that you haven’t earned except by being at a conference, and there was no winning associated with it, it was simply a giveaway item at a sponsor booth. That means I don’t include the HP Mini laptop that was given away at TechEd Australia a few years ago to every attendee, or the SQL Server bag and Camelbak bottle that I was given as a thank-you for writing a guest blog post (which I use as my regular laptop bag and water bottle for work). I don’t even include the copy of Midtown Madness that I got as a door prize at my vey first TechEd event in 1999 (that was a really good game, and even meant that when I went to Chicago last year, I felt a strange familiarity about the place).
I don’t want to include shirts in the mix either. I was given a nice SQL Server shirt about five years ago TechEd Australia. It’s a business shirt (buttons, cuffs, pocket on the chest), black with the SQL Server logo on it. It was such a nice shirt that I commented about it to the Product Marketing Manager for Australia (Christine, at the time), who unexpectedly arranged for me to get another one. That was certainly an improvement on the tent I was given at one of the MVP conference I attended.
So when I consider these ‘rules’, two pieces of swag come to mind, and I think both were at PASS Summits (although I can’t be sure). One was a hand-warmer from HP, one of the “crystallisation-type” ones, which proved extremely popular when I got home, until one day when it didn’t survive being recharged – not overly SQL related, but still it was good swag.
The other was an umbrella, from expressor, which was from the PASS Summit in 2010, my first PASS Summit. I remember it well – Blythe Morrow (now Gietz) (@blythemorrow) was working the booth, having stopped working for PASS some time before, but she’d been on my list of people to meet, as I’d had plenty of contact with her while she’d worked at PASS, my being a chapter leader and general volunteer. There had been an expressor dinner on one of the first evenings, which I’d been asked to be at, which is when I’d met lots of SQL people in person for the first time, including Ted Krueger (@onpnt), Jessica Moss (@jessicamoss) and Blythe. Anyway, at some point the next day I swung by their booth to say hello and thank them for the dinner, and Blythe says “Oh, we have the best swag – here!” and handed me an umbrella.
And she was right. It’s excellent.
It’s PASS Election time, and this year, there are no serving board members on the ballot.
Two years ago, Adam Jorgensen, Denise McInerney and I were elected to the PASS board, serving two-year terms. Since then, Adam and Denise have both been elected as Vice-Presidents of PASS, and will remain on the board in that role. Of the previous VPs, Douglas McDowell is stepping off the board, while Tom LaRock has become President (the current President, Bill Graziano, becoming Immediate-Past President and the current IPP, Rushabh Mehta, stepping off the board).
So instead of the vote being to replace or re-elect Adam, Denise and me, we see Douglas and Rushabh leaving the board.
This leaves the question of me. I could run again this year, but instead have decided to take a break from the board.
Being on the PASS board is a great way to serve the SQL community, and I’m all about serving – it’s a large part of what drives me. In fact, I often think I’m better at serving than pursuing my own vision. Sometimes I think I’d make a great case study for psychologists. Other times, I think I’d they’d just roll their eyes and say “No, we see people like you all the time, you’re a classic <insert term here>.”
I was asked to be on the board just over two years ago, when PASS gathered people from the UK, Nordic and Australian regions to discuss the globalisation of PASS. Greg Low and I were there from Australia, and when JRJ, Raoul Ilyes and I were asked to be on the board, I accepted. All three of us did. Because Raoul was ineligible to become a director (he’d recently taken a job with SolidQ, and PASS has rules about having more than two board members from the same company), we were put in as board advisors. As I was already involved, I ran for election to be a director at the end of the year (two years ago). JRJ was chosen by the board to take over Andy Warren’s position when he left the board mid-term. And so the story of PASS globalisation continued with JRJ and me as directors.
Being a board member from Australia has been tough, but if it was that alone, I’d make it work. I’ve got up plenty of early mornings to be in meetings. But there’s more. I run a business (LobsterPot Solutions) out of Adelaide, which is a tough market. We also have staff in Melbourne and now Canberra (Julie’s moved there) too, and take on clients all around Australia and overseas. That doesn’t mean I travel a lot, but it does take time and emotional investment.
I should travel more than I do. In fact, that’s one of the reasons why I’m leaving the board – my old back injury. It doesn’t normally affect me much, but this year, it has done more than I would’ve liked. I trained for and completed a marathon earlier in the year. I ran because I could. I’d had nine years of not being able to run, and then after working with my physiotherapist to get it back, wanted to see how far I could push it. It was hard, but I ran a marathon. I have a medal to prove it. It was hard though, and it took a lot out of me. Every injury during training was exacerbated by my back condition, and I ended up not being able to fly to the other board meetings this year. I missed the first one because it was the first week of the school year and I chose to be here for my family, but the other two were because my back simply wasn’t up to it. I attended via Skype instead, but it wasn’t the same.
Essentially, I don’t feel that I’ve been able to be as effective this year on the board as what I’d like. I see my role on the board being to stand up for what is right for the community, and being an enabler. My portfolio this year has been SQLSaturday, and the focus I’ve taken has been in enabling Karla and Niko to do what they do more effectively, and to try to make life easier for the SQLSaturday organisers, attendees and sponsors. It’s been a year of change at PASS in many ways though, and with my energy levels being lower this year than I’d’ve liked, I don’t know how well I’ve done that. I have high standards, of course.
I would happily remain on the board as an advisor, or take on a different role within PASS (although Australia already has two Regional Mentors), but I’m also really thrilled that the people that are running for the board this year are such high quality. There are a few PASSion award winners in there, and almost everyone on the list has been involved in growing the SQL community in significant ways. I would love to see Allen get back on the board, and all the others will have a positive effect too. I could happily write recommendations for every one of them! I’m pleased I’m not running against these people, because I feel PASS deserves to have these people on the board more than it deserves to have me continue. I might run again in the future, but am also happy to serve behind the scenes too.
So instead, the PASS membership gets to vote three new people onto the board! Exciting times.
CONVERT_IMPLICIT isn’t the only problem with getting data types wrong. You might have the right type, but what if the length is wrong? This post will look at both getting the type wrong and getting the length wrong too.
Let’s do some testing. We’ll need a table with indexes. I’d normally use one of the AdventureWorks versions for this, but as they have a tendency to use user-defined types (which I’m not actually a fan of), I’m going to create my own. Also, my example needs to leverage a composite index. I’m only putting a single row in the table, because the amount of data isn’t relevant to what I’m going to show. I’m using a Windows collation, which is the default on my machine. I’ll put a note in later to mention why.
create table dbo.StringLength
pk int primary key,
v50 varchar(50) collate Latin1_General_CI_AS,
vmax varchar(max) collate Latin1_General_CI_AS,
n50 nvarchar(50) collate Latin1_General_CI_AS,
nmax nvarchar(max) collate Latin1_General_CI_AS
create index ix_v50 on dbo.StringLength (v50, id);
create index ix_n50 on dbo.StringLength (n50, id);
insert dbo.StringLength (pk, id, v50, vmax, n50, nmax)
values (1, 1, 'abcdefghij', 'abcdefghij', N'abcdefghij', N'abcdefghij');
I haven’t indexed the vmax and nmax fields, because you can’t use them as index keys. Of course, there’s plenty of argument to having those fields in your actual tables, but if you’re wanting to be able to search on that data, you might want to consider a full-text index. If the searching is always on the start of the string, you could consider another option, but we’ll come to that later.
Let’s look at what happens when we do it right, defining variables using the right type.
@s varchar(1000) = 'abcdefghij',
@i int = 0;
where v50 = @s
and id > @i;
Notice that there is no Predicate property of the Index Seek, only a Seek Predicate, which has both a Prefix and Start. The range of rows returned by the Seek Predicate has a start-point based on a combination of the Prefix and Start, and an end-point which is the end of the Prefix.
If the wrong types are used, we see that a conversion is needed.
Let’s start by using a varchar(50) parameter, and comparing it to the nvarchar(50) column.
We still see no Predicate here, but look at the Prefix. A CONVERT_IMPLICIT is needed because the types don’t match. In case you haven’t heard, this is bad.
But how bad is it? Actually, not very bad at all, because @s is converted into the correct type, and then used in the Seek. You should still avoid it by passing in the correct type, but the cost of converting a parameter to the correct type is not that bad, because it only needs to happen once.
What happens if we do it the other way around, passing in an nvarchar(50) parameter and comparing it to the varchar(50) column.
Straight away, you’ll notice that there’s a different shape to the execution plan, we’ve lost the Prefix in the Seek Predicate, and we have a Predicate (the residual one) property as well. But we do still have an Index Seek. It hasn’t resorted to using an Index Scan as would’ve been the case if we had used a number.
(Just for completeness, let’s truncate the table – to avoid a conversion error – and use a number for the comparison)
Here we get an Index Scan. No Seek Predicate. The index on v50 is as good as useless because we’re comparing the column to a number. Look what’s going on in the Predicate – we’re converting the v50 field into an integer, and seeing if it’s equal to @i. That’s doing it for every row in the index.
Luckily, we now get a warning about this. See the yellow triangle with an exclamation mark in it on the SELECT operator? If I click on that, I see a warning that says:
Type conversion in expression (CONVERT_IMPLICIT(int,[tempdb].[dbo].[StringLength].[v50],0)) may affect "CardinalityEstimate" in query plan choice, Type conversion in expression (CONVERT_IMPLICIT(int,[tempdb].[dbo].[StringLength].[v50],0)=[@i]) may affect "SeekPlan" in query plan choice
It’s actually two warnings. One is a SeekPlan warning, and one is a CardinalityEstimate warning. It’s the SeekPlan one that has caused the Scan, while the CardinalityEstimate problem means the Query Optimizer has little idea about how many rows to expect out of the Scan operator. (And no, there’s no full stop / period at the end of those warnings. Go figure...)
Anyway, that was just an aside, because I figure there are plenty of posts already out there about this CONVERT_IMPLICIT issue leading to a Scan instead of a Seek. Let’s go back to our situation, where we were dealing with nvarchar and varchar strings, and still had a Seek. This one:
Clearly this is a different situation to a regular CONVERT_IMPLICIT. It’s less harmful, although there is still a big impact, and it’s potentially much more commonplace, as people don’t tend to care quite as much if they see an Index Seek already in play.
Oh, and this behaviour doesn’t happen with SQL collations. If you have a SQL collation, the sort order between varchar and nvarchar is different, and it has to do a Scan, just like when I used a number.
The thing that’s happening here is the same as when you have a datetime column that you’re casting to a date, or when you’re using a LIKE comparison with fixed start. The Query Optimizer uses a function called GetRangeThroughConvert (in the Compute Scalar – you can see it in the XML), which is then used to create a Dynamic Seek. I’ve presented on this before, such as in the SARGability and Residualiciousness talks at SQLBits in 2010 and the PASS Summit in 2011. Paul White (@sql_kiwi) expanded on the GetRangeThroughConvert function in a post from 2012. The seek can’t guarantee to catch everything though, so it uses a Predicate (the residual one) to make sure the value matches exactly.
So why is this so bad? In my presentations I’ve talked about the GetRangeThroughConvert behaviour as being a good thing. More on that first down in the post.
What if we pass in the correct type, but make it too long or too short?
When it’s too long (though we haven’t considered ‘max’ yet), we get the same behaviour as if it were the right length. Interestingly, if you search the XML version of this plan for either 1000 or 50 (except for in the column name), you don’t find it anywhere. It’s as if we passed in the correct value. The same happens if you pass in a string that is too short, but here you need to consider whether you might be wrecking the parameter.
In this situation, my query didn’t return the same results, because @s is only ‘abcde’. But it does this without any kind of warning – you can populate varchar(5) variable with a longer string and it won’t complain at all.
But max is done differently.
Let’s see what happens when we pass in a varchar(max) or nvarchar(max) parameter, and compare it to the limited-length string.
We’re comparing the varchar column to a varchar parameter, but the parameter is defined as a max field, and we have the GetRangeThroughConvert functionality, like what happened with the nvarchar / varchar scenario. But it’s more complicated again – despite the fact that we have a Range, our Residual Predicate doesn’t include the equality check. That check has actually been pulled further left in the plan, in that new Filter operator you see there.
You see, checking a max type is expensive, and involves memory allocation (that parameter is potentially up to 2GB in size), so the Filter is moved to the left as far as possible (SQL won’t ever do a max comparison in a Seek/Scan operator because of the memory allocation). By calling all the other filters (however the filters are done, Joins, Seeks, Residuals) before applying the max filter, the data that gets pulled into the max filter is now as few rows as possible. The Seek will be close to the correct amount, thanks to the GetRangeThroughConvert part, but that final check does still need to take place. It’s a good thing that the check is pulled left, but you should avoid passing in a max parameter so that this check can be done in the Seek Predicate.
So what about the other way around? What if we have a max column, and a limited-length parameter?
In some ways, this is more trivial because you can’t index a max column. This is one of the things that carried over from the days of text and ntext, although you couldn’t define a variable as text or ntext either, so you were less likely to try.
If you can’t index the column, but still want to be able to perform searches that would be index-like, what can you do (apart from using an Full-Text Index)?
Well, you could use a computed column that matches the length of your parameter. The non-clustered index persists the value, although it’s not persisted in the underlying heap / clustered index.
alter table dbo.StringLength
add vmax50 as cast(vmax as varchar(50));
create index ix_vmax50 on dbo.StringLength(vmax50, id) include (vmax);
This is similar to what can be done to tune many queries to get around SARGability problems. But it won’t help here unless we change our query, because our query still uses the max column, and implied conversion makes the shorter one longer, not the other way around. However, if we add an extra predicate to our query, we can achieve what we want:
The Filter here is still the max check, but now we at least have a more effective seek on the rest of it, thanks to making a column which contains the first part of the potentially-long string.
So you can see that GetRangeThroughConvert functionality is useful, and way better than the alternative of using a Scan.
But GetRangeThroughConvert is actually bad for your query. Honestly.
The reason why it’s bad is because of what has happened to the second predicate that we’re using, the “id > @i” bit. When the type matched properly, it was part of the Seek Predicate. When GetRangeThroughConvert is used, either through the wrong type or by passing in the right type but with the max length, this second predicate gets relegated to the Residual (see the image below). That means that every row that satisfies the string comparison much be checked against this second predicate. It’s like using the phone book and not appreciating that all the “Farley”s are sorted by first name. Even without a second predicate, there may be problems in the plan because of unnecessary sorts, or the avoidance of a Merge Join, simply because the data cannot leverage the second column in an index (or the CIX key if there are no more columns in the key). If you’re looking for Barry Smith in the phone book, but are passing in ‘Smith’ via a varchar(max) parameter, you’ll be scanning all the Smiths looking for Barry. If you’d passed it in using varchar(100), then you could’ve found all the Barry Smiths with your Seek.
It won’t give you a warning, but you’ll find your indexes aren’t being used as effectively as you might like.
Using a hammer to push in a screw isn’t a good idea, no matter how good the hammer is. We all know that. and yet there are times when we get frustrated at the ‘right tool’ and opt for the one that will work. Unfortunately, there are plenty of examples in the IT space – the topic of which is this month’s T-SQL Tuesday, hosted by Rick Krueger (@DataOgre).
There are many ways to use SQL. Having seen plenty of awful code over the years, and written plenty of it myself, I know that there are lots of ways to achieve the same thing, and there are ways to persuade the Query Optimizer to take one particular approach rather than another. These efforts might seem quite ‘hacky’ at first glance, but if there is good reason for them, then perhaps they belong.
For example, a couple of years ago (and then again a couple of weeks ago) I wrote about using views as a way of passing sets of values to a stored procedure (where the stored procedure is masquerading as an instead-of trigger). At the time, people told me it was quite hacky, and that they’d never use it. However, I’ve heard of other people (including Greg Low, who also blogs at sqlblog.com) using this same method for getting data into Azure using SSIS, because it just works, and it has a number of advantages over other methods.
Many look at string concatenation as another example of this, although some get attached to the wrong solution.
When concatenating strings stored across a single row, there is no issue – simply use the + operator, or a function like CONCAT().
It’s when you need to consider data across multiple rows that you have more of a problem.
If there are a known number of rows in play, then performing a Pivot transformation can simplify the problem down to the single-line variety. I don’t mind whether you prefer to use the PIVOT operator or not. Personally, I prefer not to. In this example, I take a simple set of unpivotted data, and do a basic pivot transform, using GROUP BY and CASE. These then form useful values for the CONCAT function.
Frustratingly, pivoting requires the values to be known, producing a known set of columns. Dynamic Pivot requires Dynamic SQL, but if this is being done, there’s no problem to provide those values to a CONCAT function – though creating the Dynamic SQL probably requires a degree of concatenation in the first place!
There is a “Quirky Update” method for concatenation that has been touted a bit over the years, but I find myself completely unable to recommend it, based on the prerequisites listed to get reliable results. For me, I want the results of any query I write to be guaranteed, and to have a query which could end up producing the wrong results is simply too dangerous for my liking. I’d rather use a cursor than risk wrong-results, and that’s saying something!
The method I like to use for string concatenation has its roots in a hack of sorts, and to explain why, I’ll first describe the environment from which it has sprung – FOR XML PATH.
There are a few ways to generate XML from SQL. I find FOR XML PATH to be quite straight forward, although it can become overly complex when nesting comes into play (causing me to switch to one of the others). However, for simple examples, XML PATH is very effective.
The results of a simple query like this:
SELECT ProductID, Name, ListPrice
ORDER BY ProductID;
can be turned into XML by simply adding a “FOR XML PATH” option at the end. Providing some additional parameters, we get results like the following:
You’ll notice that the column names have appeared within XML tags. The interesting thing here is that we can manipulate this behaviour by changing the name of the columns, with aliases. For example, putting an @ in front of the column name causes the element to be treated as an attribute instead (I remember it as “@ for ATtribute”).
Another trick with column names is that if there is none (or it’s [*]), then the element tag disappears. Naturally this is way better than having “<>” appear in the resultant XML.
…but this behaviour of making the empty element disappear is part of the appeal.
I’m told that it was around the time of the development of this ‘disappearing element’ bit that the penny dropped and using it for string concatenation became an option.
Consider what happens if we only have the Name column now.
If we had no ROOT element, we wouldn’t have legal XML…
…but if we used an empty string for the PATH element as well…
…now suddenly we have string concatenation (including the support for ordering), without having to resort to something which could fall apart at a moment’s notice.
If we want commas included, that’s easy, and if we want to convert it back into a string (instead of being XML, with XML-style quotes), then surrounding it with brackets (round ones, what Americans call parentheses) and putting .value('.','varchar(max)') at the end will do the trick.
Here, we’ve taken XML generation and turned it into a string concatenation operation in a very simple way. The SQL Product Group realised this potential quite early on, and this is widely regarded as the best way of doing concatenation from SQL Server 2005 on.
There are lots of other things that I’ve done over the years that might seem quite hacky, but are actually quite legitimate – but string concatenation is probably my favourite.
Almost two years ago, I wrote about a method to use Table-Valued Parameters in SQL 2005 – or basically any environment that doesn’t support them natively.
The idea was to use a View with an ‘instead of’ trigger. Essentially, the trigger acts as a stored procedure, which is then used to be able to handle all the rows however you want. That could be distributing the rows into tables as if it’s an actual insert, but also works to just run whatever code you like, as a trigger is essentially just a stored procedure.
So anyway – today I got challenged to make it more explicit that this also works within SQL Server Integration Services. SSIS handles batch inputs into views, but if you were hoping to push data into a stored procedure in a data flow, that’s going to be on a row-by-row basis – no TVPs. I’d described it to a class I was teaching, showed them the code from my 2011 blog post, and they pointed out that I hadn’t made it particularly obvious (in my blog post) that you could use that method in SSIS.
The idea is very simple – you just create a view, as I did in my earlier post, and then use it as a destination within SSIS. This can be used for all kinds of methods for which you’d want to use a TVP, including pushing data up into SQL Azure databases.
100% of the reads removed. Roughly 4000 (okay, 3890), down to zero.
Let me explain...
Obviously if there’s data being returned, there are reads needed. And obviously there is some CPU needed for SQL to be given a query and do anything at all. Luckily for me, performance of a query is typically evaluated using the number of reads performed. That’s what I’m looking at.
Recently I came across a query that typically returns no rows. It was looking for rows in a table that met a particular condition, joined those rows to a bunch of other tables, and return the result. But yeah, typically no rows returned. Estimating a few, but none actually coming out.
Consider the following query, on AdventureWorks.
FROM Sales.SalesOrderDetail as od
JOIN Production.Product as p on p.ProductID = od.ProductID
JOIN Production.ProductModel as m on m.ProductModelID = p.ProductModelID
LEFT JOIN Production.ProductSubcategory as s on s.ProductSubcategoryID = p.ProductSubcategoryID
where od.OrderQty > 1200;
Running this query gives me no rows back, but well over a thousand reads.
As well as this, it suggested a missing index.
CREATE NONCLUSTERED INDEX OrderQtyIndex
ON [Sales].[SalesOrderDetail] ([OrderQty])
INCLUDE ([SalesOrderID], [SalesOrderDetailID], [CarrierTrackingNumber], [ProductID], [SpecialOfferID], [UnitPrice], [UnitPriceDiscount], [LineTotal], [rowguid], [ModifiedDate]);
Creating this index does indeed reduce the reads. A lot too! Instead of scanning the Sales.SalesOrderDetail table, it can use an Index and quickly find any rows that have more than 1200 items sold.
This is good. But is it worth stopping here?
The index that I created gives a slightly different plan. It doesn’t perform a Seek on an index on the Sales.SalesOrderDetail table, it does a Scan! But zero reads.
The difference is a filter.
CREATE NONCLUSTERED INDEX ZeroReadsIndex
ON [Sales].[SalesOrderDetail] ([OrderQty])
INCLUDE ([SalesOrderID], [SalesOrderDetailID], [CarrierTrackingNumber], [ProductID], [SpecialOfferID], [UnitPrice], [UnitPriceDiscount], [LineTotal], [rowguid], [ModifiedDate])
WHERE OrderQty > 1200;
Now, we have a completely empty index. Scanning this index is zero reads. It’s just like having a table with no rows in it. This index gets only populated with a row whenever a qualifying row appears in the underlying table. When that happens, there’s few reads required to be able to get the necessary data out of the other tables. But for 99.9% of the time this query is run, there are now NO READS. For the real scenario, that row gets picked up and removed quite quickly, returning the query to the ‘zero reads’ scenario.
Do you have frequently-run queries that typically return no rows, because there’s a predicate that is rarely satisfied? How about turning that predicate into an index filter, and seeing if you can reduce a big chunk of the footprint?
Sometimes I only blog for T-SQL Tuesday. I don’t want to break my run of blogging for each one just yet, but I do wonder sometimes.
The reason I’m reflecting is that what we’ve done in the past will so often affect what we do in the future. The musician Sting tells us that “History will teach nothing”, but he’s trying to address the idea of being tied down by the negative experiences of the past, rather than making a conscious effort to see improvement. We need to acknowledge what has happened in the past, but understand that we can shape our futures and see a better world ahead. History can both help keep us diligent (as in the case of my blogging), and it can help us see where change is urgently needed.
In the SQL world, we also need to find a balance between treating the past with respect, and not letting it hold us back.
Mickey Stuewe is hosting this month’s T-SQL Tuesday, and the topic is Auditing (as you may have guessed).
By keeping a record of what has happened, we can compare what things look like now, with what they looked like before, and that is incredibly powerful. (Erin Stellato is the name in SQL that most people associate with baselines – go and read some of her stuff on what you should be doing, because that’s not what I’m covering here.) It’s not just about being able to troubleshoot, but it’s about being able to implement change. Every change that we ever try involves risk, and there is a healthy amount of trepidation that goes along with that. But by being able to compare the before and after of change, you can make decisions more easily.
Let me give you a couple of examples – one in the SQL world, and one not.
The internet is full of tips and tricks about life (amongst countless other things, such as laughing penguins), with plenty being about little ways to be more effective. A while back I needed to replace a AA battery in the kitchen clock, but could only find AAA batteries. So I used a piece of Al-foil and got the clock working again. When I did this, I was curious about the impact on the battery, and the clock, and even on the foil. I wondered whether the battery would last as long, whether the clock would be affected by having less current going through it, and even whether it was safe (did the foil become hot, for example?). The problem was, I had no metrics to base my ideas on. I honestly had no idea how long it had been since the clock last got a new battery. Nor had I been tracking the health of the clock over, um, time, to see if it was staying accurate or not. I wasn’t about to start monitoring these things either – I just had to go with my gut and figure it was probably not awful for either the battery or the clock, but should a normal AA battery in soon. The risk was small, but if I’d had data about these things, perhaps I’d be able to make a more informed decision.
In SQL, I often tune queries. I can happily tell a client that I’ve reduced the amount of time taking for a query from 20 seconds to 0.2 seconds, or from 400,000 reads to 13 reads. I can easily take metrics on these things. But even these metrics aren’t always the right ones. What the customer is more interested in is whether or not these changes impact their business. Does the speed increase of these queries now mean that they are handle a larger customer base in busy times? Does this mean that they might be able to avoid spending the money on a new server? Does it both save them money and increase revenue?
Business Intelligence draws information out of all kinds of business data, and hopefully provides a platform for being able to make decisions. No matter whether the data is on the performance metrics of a server or on the sales metrics for a product, there is an opportunity to be able to implement change and notice an upturn. If you don’t have that data, if you haven’t been auditing your systems, then you’re approaching change with a very different hat on, one that probably doesn’t sit quite so comfortably. Looking back at what the past was like provides a glimpse of what the future might be, and insight into how change can become rewarding. History can teach us plenty.
But yes, although we all recognise that it’s good to have the metrics in place to measure the impact of change, we shouldn’t allow a lack of data to turn into both an excuse and a license for inactivity. Step out and see what change you can make, with both eyes open.
This month’s T-SQL Tuesday is hosted by Bradley Ball (@sqlballs), and he wants us to write about a proper balls-up. He wants us to think it’s an opportunity for a Second Chance, to see things as an opportunity to write about how we learned and did better when we had a second chance.
I thought hard about this. I know I could write about times when I’ve done something badly – there are projects, jobs, university assignments, that I would happily redo – but I also know that I could list almost anything I’ve done in that category. Not that everything in my past is bad, but compared to the way I do things now, perhaps ‘not as good’. Hindsight is a wonderful thing, but over time we continue to hone our skills and, I’d like to think, we make improvements in the ways we do things.
The thing I want to consider in this post is my blog. For the past 3.5 years I’ve been blogging here at sqlblog.com. Before that, I was at msmvps.com for about three years, and had spent a year or so at blogspot, reaching back to 2005. I even did four posts over at sqlblogcasts.com. Despite the fact that there are a number of posts I still find useful and that get complimented by people as knowledgeable as Paul White, I still cringe a lot when I look at old posts (and not just because the images seem to have been lost from my blogspot posts). There are things that I did out of ignorance (such as when I used scalar functions), and things that I know I could have explained better now. If I compare the stuff I write even now compared to content by better bloggers than me, those people (like Paul) who write posts I could only dream of putting together, it makes me realise my blog is pointless. When I look at the old stuff… cringing is an understatement!
By all rights, I should remove all that old content. I feel like it must simply be there to misinform the masses. You know the masses – that would be all those people that might want to read about a fix() function in T-SQL, or about… actually, no, I can’t see anything from my old blogspot site that’s useful at all – okay, maybe the thing about using row_number() to avoid duplicates, but I clearly didn’t get my MVP status because of blogging… Yeah, those masses of people. Ha. Does something have mass if it’s zero?
Over time, I would like to rework some of my old posts into new ones. For example – I would like to redo my Covering Schmuvvering post using trace flag 9130 to demonstrate some more about what’s going on. But even reworking posts doesn’t mean I should necessarily remove the old ones.
I think mainly, I leave that old content there because it shows part of my journey. I imagine it won’t be long before I’ll look back at some of my recent posts and shake my head about what’s in there. When people like Jef Verheul write tweets suggesting I can’t write rubbish, I have plenty to point them at (including this one, I think!).
If you’re thinking about whether it’s worth writing blog posts or not – just do it. Even if you write absolute rubbish, that’ll just mean you have posts that are at a similar level to what I’ve written – and still write!
Note: This is the same whatever you’re doing, not just blogging. I’m sure everyone in the world can write queries better than earlier versions of me, do BI projects better than earlier versions of me, present better than earlier versions of me, etc, etc, etc. (And of course, plenty of you can do these things better than current versions of me too!)