- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.
The topic for this month’s T-SQL Tuesday is:
“Be inspired by the IT horror stories from http://thedailywtf.com, and tell your own daily WTF story. The truly original way developers generated SQL in project X. Or what the grumpy "DBA" imposed on people in project Y. Or how the architect did truly weird "database design" on project Z”
And I’m torn.
I haven’t missed a T-SQL Tuesday yet. Some months (okay, most months) it’s the only blog post I write. I know I should write more posts, but I simply get distracted by other things. Other things like working for clients, or spending time with the family, or sometimes nothing (you know – those occasions when you find yourself doing almost nothing and time just slips away, lost to some newspaper article or mindless game that looked good in the iTunes store). So I don’t want to miss one.
But I find the topic painful to write about. Not because of the memories of some of the nasty things I’ve seen at customer sites – that’s a major part of why we get called in. But because I wouldn’t ever want to be a customer who had a bad story that got told. When I see you tweeting things like “I’m dying in scalar-function hell today”, I always wonder who knows which customer you’re visiting today, or if you’re not a consultant whether your employer knows what you’re tweeting. Is your boss/customer okay with that tweet’s announcement that their stuff is bad? What if you tweet “Wow – turns out our website is susceptible to SQL Injection attacks!”? Or what if you write “Oh geez, this customer hasn’t had a successful backup in months…”? At what point does that become a problem for them? Is it when customers leave? Is it when they get hacked? Is it when their stock price drops? (I doubt the tweet of a visiting consultant would cause a stock price to fall, but still…)
So I’m quite reluctant to write this blog post at all. I had to think for some time before I thought of a scenario that I was happy to talk about.
This place was never a customer, and this happened a long time ago. Plus, it’s not a particularly rare situation – I just hadn’t seen it become this bad. So I’m happy enough to talk about this...
There was some code that was taking a long time to execute. It was populating a table with a list of IDs of interest, along with a guid that had been generated for this particular run. The main queries ran, doing whatever transforms they needed to do, inserting and updating some other tables, and then the IDs of interest were deleted from that table that was populated in the first part. It all seems relatively innocuous.
But execution was getting worse over time. It had gone from acceptable, to less than ideal, to painful. And the guy who was asking me the question was a little stumped. He knew there was a Scan on the list of IDs – he was okay with that because it was typically only a handful of rows. Once it had been a temporary table, but someone had switched it to be a regular table – I never found out why. The plans had looked the same, he told me, from when it was a temporary table even to now. But the temporary table solution hadn’t seen this nasty degradation. He was hoping to fix it without making a change to the procedures though, because that would have meant source control changes. I’m hoping that the solution I recommended required a source control change too, but you never know.
What I found was that the list of IDs was being stored in a table without a clustered index. A heap. Now – I’m not opposed to heaps at all. Heaps are often very good, and shouldn’t be derided. But you need to understand something about heaps – which is that they’re not suited to tables that have a large amount of deletes. Every time you insert a row into a heap, it goes into the first available slot on the last page of the heap. If there aren’t any slots available, it creates a new page, and the story continues. It doesn’t keep track of what’s happened earlier. They can be excellent for getting data in – and Lookups are very quick because every row is addressed by the actual Row ID, rather than some key values which then require a Seek operation to find them (that said, it’s often cheap to avoid Lookups, by adding extra columns to the Include list of a non-clustered index). But because they don’t think about what kind of state the earlier pages might be in, you can end up with heaps that are completely empty, a bunch of pointers from page to page, with header information, but no actual rows therein. If you’re deleting rows from a heap, this is what you’ll get.
This guy’s heap had only a few rows in it. 8 in fact, when I looked – although I think a few moments later those 8 had disappeared, and were replaced by 13 others.
But the table was more than 400MB in size. For 8 small rows.
At 8kB per page, that’s over 50,000 pages. So every time the table was scanned, it was having to look through 50,000 pages.
When it had been a temporary table, a new table was created every time. The rows would typically have fitted on one or two pages, and then at the end, the temporary table would’ve disappeared. But I think multiple processes were needing to look at the list, so making sure it wasn’t bound to a single session might’ve been useful. I wasn’t going to judge, only to offer a solution. My solution was to put a clustered index in place. I could’ve suggested they rebuild the heap regularly, which would’ve been a quick process run as often as they liked – but a clustered index was going to suit them better. Compared to single-page heap, things wouldn’t’ve been any faster, but compared to a large empty heap, Selects and Deletes would’ve been much faster. Inserts are what heaps do well – but that wasn’t a large part of the process here.
You see, a clustered index maintains a b-tree of data. The very structure of an index needs to be able to know what range of rows are on each page. So if all the rows on a page are removed, this is reflected within the index, and the page can be removed. This is something that is done by the Ghost Cleanup process, which takes care of actually deleting rows within indexes to reduce the effort within the transaction itself, but it does still happen. Heaps don’t get cleaned up in the same way, and can keep growing until they get rebuilt.
Sadly, this is the kind of problem that people can face all the time – the system worked well at first, testing didn’t show any performance problems, the scale of the system hasn’t changed, but over time it just starts getting slower. Defragmenting heaps is definitely worth doing, but better is to find those heaps which fragment quickly, and turn them into clustered indexes.
…but while I hope you never come across heaps that have grown unnecessarily, my biggest hope is that you be very careful about publicly discussing situations you’ve seen at customers.
It’s a common question “Do you have a backup?” But it’s the wrong question. Very relevant for this month’s T-SQL Tuesday, hosted by Ken Fisher (@sqlstudent144), on the topic of backups.
I think the question should be “Can you recover if needed?”
We all know that a backup is only as good as your ability to restore from it – that you must test your backups to prove their worth. But there’s more to it than being able to restore a backup. Do you know what to do in case of a disaster? Can you restore what you want to restore? Does that restore get your applications back up? Does your reporting become available again? Do you have everything you need? Are there dependencies on other databases?
I often find that organisations don’t quite have the Disaster Recovery story they need, and this is mostly down to not having practised specific scenarios.
Does your disaster testing include getting applications to point at the new server? Have anything else broken while you do that?
Does your disaster testing include a scenario where a rogue process changed values, but there is newer data that you want to keep?
Does your disaster testing include losing an extract from a source system which does incremental extracts?
Does your disaster testing include a situation where a well-meaning person has taken an extra backup, potentially spoiling differential or log backups?
Does your disaster testing include random scenarios where your team needs to figure out what’s going on and what needs to happen to get everything back?
The usefulness of standard SQL backups for some of these situations isn’t even clear. Many people take regular VM backups, but is that sufficient? Can you get the tail of the log if your VM disappears? Does a replicated copy of your database provide enough of a safety net here, or in general?
The key issue is not whether you have a backup. It’s not even whether you have a restorable backup. It’s whether you have what you need to survive if things go south – whichever southbound route you’ve been taken down.
Some people say I talk a lot – but I guess it depends on the context.
Certainly, for many years, I’ve been fairly comfortable about standing up in front of people and explaining things. Whether it’s teaching a course, leading a workshop, presenting at a conference, or preaching at a church, it all has that same “I’m talking, and people are looking at me” feeling. I totally understand why people get nervous about it, and still have a certain about of terror that I suffer from before getting up to present. It doesn’t stop me doing it – I would happily present all the time, despite the fear factor.
It’s almost a cliché, but the biggest advice I have for new speakers is to realise that the people in the room do actually want to hear what you have to say. They don’t want you to fail.
…but there’s more to it than that.
I can present on just about any topic, so long as I have time to prepare. That preparation time is NOT in creating an effective talk (although that’s part of it) – it’s in getting to know the subject matter well.
Suppose I’m giving a talk about Columnstore indexes, like I just did at the PASS Summit. By all means, I want to craft a story for my presentation, and be able to work out which things I want to communicate through that story. If slides will work, then I’ll need to create them. If demos will work, then I’ll need to plan them too. But most of all, I want to get myself deep into Columnstore. I want to read everything there is on the subject. I want to create them, alter them, explore the DMVs about them, find ways to break them, and generally immerse myself in them. That way, I can speak confidently on the topic, knowing that I’m quite probably the most qualified person in the room to be up the front. I want to be explaining concepts that I know intimately.
When people ask questions, there’s no guarantee that I’ll know the answer. At the end of my talk at the PASS Summit, someone asked me if I’d tried using columnstore indexes in a particular way, and I had to say no. She went on to tell me what she’d found, and it was interesting and piqued my curiosity for an area I hadn’t explored. Would I have been thrown if she’d asked me during the session, in front of everyone else? No – not at all. Because I felt comfortable with the depth of my knowledge.
This applies just the same if I’m preaching in a church. If I’m preaching on a section of Galatians, I want to know that section backwards. I want to know the rest of the chapter, the rest of the book, what the rest of the Bible says on the matter, how it has applied in my own life, and what other people say on it too. I want to have a thorough picture of what God is saying to me, and to the rest of the church, through that passage.
When I get stuck in my words, and stumble in some way, I need to know the topic well. I will have a bunch of sound bites that I’ve rehearsed, and expect to explain things using particular phrases. But those are the things that can disappear from my head when the nerves strike. My safety net is the deep knowledge of the subject, so that I can find a different way of explaining it.
I don’t like giving word-perfect speeches. The idea of talking from a script that I need to stick to exactly doesn’t work for me – I get too nervous and wouldn’t be able to pull it off (although one day I will give stand-up comedy a try, which means having well-crafted jokes that need to be word-perfect to work). Knowing the material is way better than knowing the words, and for me is way less stressful.
My advice to anyone is to get into public speaking. It’s a great way of stretching yourself. But do get into your topic as deeply as you can. If you’ve looked at something from a variety of angles, you will be able to explain to anyone.
Big thanks to Andy Yun (@sqlbek) for hosting this month’s T-SQL Tuesday.
Thursday! Kilt day.
We start with Grant Fritchey (PASS’ VP of Finance, in a kilt), talking about the various metrics of PASS, which show that the community is growing both numerically and graphically, reaching 87% of countries now. It’s good to know that things are going well. This is all public information, and I’m not going to go into the details here.
He also announces that PASS will have a BA Day – Jan 11th in Chicago. More information on this will follow.
Grant hands over to Denise McInerney (PASS’ VP of Marketing). She announces new branding for the PASS organisation – logo and website (website launching early next year) – and the dates for next year’s summit.
David DeWitt, Adjunct Professor at MIT (previously of Microsoft Research) comes up. He’s going to talk about data warehouse technologies, including cloud and scaling. Amazon Redshift, Snowflake, and SQL DW.
A great session, which will have helped a lot of people appreciate SQL DW more than ever.
.So I’m back at the PASS Summit, and the keynote’s on! We’re all getting ready for a bunch of announcements about what’s coming in the world of the Microsoft Data Platform.
First up – Adam Jorgensen. Some useful stats about PASS, and this year’s PASSion Award winner, Mala Mahadevan (@sqlmal)
There are tweets going on using #sqlpass and #sqlsummit – you can get a lot of information from there.
Joseph Sirosh – Corporate Vice President for the Data Group, Microsoft – is on stage now. He’s talking about the 400M children in India (that’s more than all the people in the United States, Mexico, and Canada combined), and the opportunities because of student drop-out. Andhra Pradesh is predicting student drop-out using new ACID – Algorithms, Cloud, IoT, Data. I say “new” because ACID is an acronym database professionals know well.
He’s moving on to talk about three patterns: Intelligence DB, Intelligent Lake, Deep Intelligence.
Intelligence DB – taking the intelligence out of the application and moving it into the database. Instead of the application controlling the ‘smarts’, putting them into the database provides models, security, and a number of other useful benefits, letting any application on top of it. It can use SQL Server, particularly with SQL Server R Services, and support applications whether in the cloud, on-prem, or hybrid.
Rohan Kumar – General Manager of Database Scripts – is up now. Fully Managed HTAP in Azure SQL DB hits General Availability on Nov 15th. HTAP is Hybrid Transactional / Analytical Processing, which fits really nicely with my session on Friday afternoon. He’s doing a demo showing the predictions per second (using SQL Server R Services), and how it easily reaches 1,000,000 per second. You can see more of this at this post, which is really neat.
Justin Silver, a Data Scientist from PROS comes onto stage to show how a customer of theirs handles 100 million price requests every day, responding to each one in under 200 milliseconds. Again we hear about SQL Server R Services, which pushes home the impact of this feature in SQL 2016. Justin explains that using R inside SQL Server 2016, they can achieve 100x better performance. It’s very cool stuff.
Rohan’s back, showing a Polybase demo against MongoDB. I’m sitting next to Kendra Little (@kendra_little) who is pretty sure it’s the first MongoDB demo at PASS, and moving on to show SQL on Linux. He not only installed SQL on Linux, but then restored a database from a backup that was taken on a Windows box, connected to it from SSMS, and ran queries. Good stuff.
Back to Joseph, who introduces Kalle Hiitola from Next Games – a Finnish gaming company – who created a iOS game that runs on Azure Media Services and DocumentDB, using BizSpark. 15 million installs, with 120GB of new data every day. 11,500 DocumentDB requests per second, and 43 million “Walkers” (zombies in their ‘Walking Dead’ game) eliminated every day. 1.9 million matches (I don’t think it’s about zombie dating though) per day. Nice numbers.
Now onto Intelligent Lake. Larger volumes of data than every before takes a different kind of strategy.
Scott Smith – VP of Product Development from Integral Analytics – comes in to show how Azure SQL Data Warehouse has allowed them to scale like never before in the electric-energy industry. He’s got some great visuals.
Julie Koesmarno on stage now. Can’t help but love Julie – she’s come a long way in the short time since leaving LobsterPot Solutions. She’s done Sentiment Analysis on War & Peace. It’s good stuff, and Julie’s demo is very popular.
Deep Intelligence is using Neural Networks to recognise components in images. eSmart Systems have a drone-based system for looking for faults in power lines. It’s got a familiar feel to it, based on discussions we’ve been having with some customers (but not with power lines).
Using R Services with ML algorithms, there’s some great options available…
Jen Stirrup on now. She’s talking about Pokemon Go and Azure ML. I don’t understand the Pokemon stuff, but the Machine Learning stuff makes a lot of sense. Why not use ML to find out where to find Pokemon?
There’s an amazing video about using Cognitive Services to help a blind man interpret his surroundings. For me, this is the best demo of the morning, because it’s where this stuff can be really useful.
SQL is changing the world.
I feel like this topic just keeps going around and around. Every time I’m in a room where someone needs to log into a computer that’s not theirs, there seems to be a thing of “Oh, I know their password…”, which makes me cringe.
I’ve written about this before, and even for a previous T-SQL Tuesday, about two years ago, but there’s something that I want to stress, which is potentially a different slant on the problem.
A password is not just YOUR secret. It’s also a secret belonging to the bank / website / program that the password is for.
Let me transport you in your mind, back to primary school. You had a club. You had a password that meant that you knew who was in the club and who wasn’t (something I’ve seen in movies – I don’t remember actually being in one). At some point you had a single password that was used by everyone, but then you found that other people knew the password and could gain entry, because you only needed someone to be untrusted for the password to get out.
You felt upset because that password wasn’t theirs to share. It was the property of you, the club owner. Someone got access to your club when you hadn’t actually granted them access.
Now suppose I’m an online retailer (I’m not, but there are systems that I administer). You’ve got a password to use my site, and I do all the right things to protect that password – one-way hashing before it even reaches the database, never even being able to see it let alone emailing it, and a ton of different mechanisms that make sure that your stuff is safe. You’ve decided to a password which you’ve generated as a ‘strong password’, and that’s great. Maybe you can remember it, which doesn’t necessarily make it insecure. I don’t even care if you’ve written it down somewhere, so long as you’re treating it as a secret.
Because please understand, it’s MY secret too.
If the password you use gets out, because maybe someone gets into your LastPass account, or maybe someone steals the PostIt you’ve written it on, or maybe you use that same password at a different site which then gets hacked…
…then that other person has access to MY site as you.
If that other person buys stuff from me as you, I might need to refund you for the money / credit / points you didn’t mean to spend. And if I’ve already sent the goods out, then that’s going to hurt me.
If that other person does malicious things on my site because they’re accessing it as a privileged user, then that’s going to hurt me.
Someone knowing the secret that I’ve worked hard to keep secret… that’s going to hurt me.
I have no control over the password that you choose to use. But please understand that it’s not just YOUR password. Use something that is a secret between you and me. I will never know your password, but I want you to make sure that no one else ever does either. Don’t reuse passwords.
Big thanks to Andy Mallon (@amtwo) for hosting this month’s T-SQL Tuesday.
This month’s T-SQL Tuesday is hosted by Jeffrey Verheul (@devjef) and is on the topic of Cloud.
I seem to spend quite a bit of my time these days helping people realise the benefit of the Azure platform, whether it be Machine Learning for doing some predictions around various things (best course of action, or expected value, for example), or keeping a replicated copy of data somewhere outside the organisation’s network, or even a full-blown Internet of Things piece with Stream Analytics pulling messages off an Service Bus Event Hub. But primarily, the thing that I have to combat most of all is this:
Do I really want that stuff to be ‘out there’?
People are used to having their data, their company information, their processing, going on somewhere outside the building where they physically are.
Now, there are plenty of times when organisations’ server rooms aren’t actually providing as much benefit as they expect. Conversations with people quickly help point out that their web site isn’t hosted locally (I remember in the late ‘90s a company I was at making the decision to start hosting their web site at an actual hosting provider rather than having every web request come in through the same modem as all their personal web browsing). Email servers are often the next to go. But for anyone working at home, the server room may as well be ‘the cloud’ anyway, because their data is going off to some ‘unknown’ place, with a decent amount of cabling between where they are and where their data is hosted.
Everyone’s photos are stored in ‘cloud’ already, where it be in Instagram’s repository or in something which is more obviously ‘the cloud’. Messages with people no longer just live on people’s phones, but on the servers of Facebook and Twitter. Their worries and concerns are no longer just between them and their psychiatrist, but stored on Google’s search engine web logs.
The ‘cloud’ is part of today’s world. You’re further into it than you may appreciate. So don’t be afraid, but try it out. Play with Azure ML, or with other areas of Cortana Intelligence. Put some things together to help yourself in your day-to-day activity. You could be pleasantly surprised about what you can do.
I have a session coming up at both the PASS Summit in October and the 24HOP Summit Preview event in September, on Operational Analytics. Actually, my session is covering the benefits of combining both In-Memory and R into the Operational Analytics story, to be able to see even greater benefits…
…but I thought I’d do some extra reading on Real-Time Operational Analytics, which also suits this month’s T-SQL Tuesday topic, hosted by Jason Brimhall (@sqlrnnr). He’s challenged us all to sharpen our skills in some area, and write about the experience.
Now, you can’t look at Real-Time Operational Analytics without exploring Sunil Agarwal (@S_u_n_e_e_l) ’s excellent blog series. He covers a few things, but the one that I wanted to write about here is Compression Delay.
I’ve played with Compression Delay a little, but I probably haven’t appreciated the significance of it all that much. Sure, I get how it works, but I have always figured that the benefits associated with Compression Delay would be mostly realised by having Columnstore in general. So I was curious to read Sunil’s post where he looks at the performance numbers associated with Compression Delay. You can read this yourself if you like – it’s here – but I’m going to summarise it, and add some thoughts of my own.
The thing with Operational Analytics is that the analytical data, reporting data, warehouse-style data, is essentially the same data as the transactional data. Now, it doesn’t look quite the same, because it’s not been turned into a star-schema, or have slowly changing dimension considerations, but for the purposes of seeing what’s going on, it’s data that’s capable of handling aggregations over large amounts of data. It’s columnstore.
Now, columnstore data isn’t particularly suited to transactional data. Finding an individual row within columnstore data can be tricky, and it’s much more suited to rowstore. So when data is being manipulated quite a lot, it’s not necessarily that good to be using columnstore. Rowstore is simply better for this.
But with SQL 2016, we get updateable non-clustered columnstore indexes. Data which is a copy of the underlying table (non-clustered data is a copy – clustered data or heap data is the underlying table). This alone presents a useful opportunity, as we can be maintaining a columnstore copy of the data for analytics, while handling individual row updates in the rowstore.
Except that it’s a bit more complicated than that. Because every change to the underlying rowstore is going to need the same change made in columnstore. We’re not actually benefiting much.
Enter the filtered index. With a flag to indicate that frequent changes for that row have finished, we can choose to have the columnstore copy of the data only on those rows which are now relatively static. Excellent. Plus, the Query Optimizer does some clever things to help with queries in this situation.
But many systems don’t have a flag like that. What then?
Well, one nice option is to consider using Compression Delay.
Compression Delay tells our columnstore index to delay compressing the data for some period of time. That is, to not turn it into proper columnstore data for a while. Remember I said that columnstore data doesn’t enjoy being updated much – this is to prevent that pain, by leaving it as rowstore data for a while.
I haven’t really explored this much myself yet. I have a few simulations to run, to see what kind of performance gains can be had from this. But Sunil’s experiments saw a 15% improvement on the OLTP workload by choosing an appropriate Compression Delay, and that sounds pretty good to me.
I feel like there’s so much more to be explored with these new technologies. Having that flag to indicate when a row can be pulled into a filtered columnstore index seems really useful. Compression Delay seems great too, and in many ways feels like a nicer solution than ending up with a filtered index that might not catch everything. Compression Delay to me feels like having a filtered columnstore index that uses getdate() (which I think would be a lovely feature), although it’s not quite same.
So I’m going to keep playing with this, and will be teaching you plenty of information at both the upcoming events. I could present a lot of it now, but I would prefer to deepen my understanding more before I have to stand in front of you all. For me, the best preparation for presentations is to try to know every tiny detail about the technology – but that’s a path I’m still on, as I continue to sharpen.
Don’t get me started on how I keep seeing people jump into Azure SQL DW without thinking about the parallel paradigm. SQL DW is to PDW, the way that Azure SQL DB is to SQL Server. If you were happy using SQL Server for your data warehouse, then SQL DB may be just fine. Certainly you should get your head around the MPP (Massively Parallel Processing) concepts before you try implementing something in SQL DW. Otherwise you’re simply not giving it a fair chance, and may find that MPP is a hindrance rather than a help. Mind you, if you have worked out that MPP is for you, then SQL DW is definitely a brilliant piece of software.
One of the biggest frustrations that people find with SQL DW is that you need (or rather, needed) to use SSDT to connect to it. You couldn’t use SSMS. And let’s face it – while the ‘recommended’ approach may be to use SSDT for all database development, most people I come across tend to use SSMS.
But now with the July 2016 update of SSMS, you can finally connect to SQL DW using SQL Server Management Studio. Hurrah!
…except that it’s perhaps not quite that easy. There’s a few gotchas to be conscious of, plus a couple of things that caused me frustrations perhaps more than I’d’ve liked.
First I want to point out that at the time of writing, SSMS is still not a supported tool against PDW. You’ve always been able to connect to it to write queries, so long as you can ignore some errors that pop up about NoCount not being supported, but Object Explorer simply doesn’t work, and without Object Explorer, the overall experience has felt somewhat pained.
Now, when you provision SQL DW through the Azure portal, you get an interface in the portal that includes options for pausing, or changing the scale, as per this image:
And you may notice that there’s an option to “Open in Visual something” there. Following this link gives you a button that will open SSDT, and connect it to SQL DW. And this works! I certainly had a lot more luck doing this than simply opening SSDT and putting in some connection details. Let me explain…
In that image, notice the “Show database connection strings” link. That’s where you can see a variety of connection strings, and from there, you can extract the information you’ll need to make a connection in either SSDT or SSMS. You know, in case you don’t want to just hit the button to “Open in Visual something”.
When I first used these settings to connect using SSDT (rather than using the “Open in…” button), it didn’t really work for me. I found that when I used the “New Query” button, it would give me a “SQLQuery1.sql” window, rather than a “PDW SQLQuery1.dsql” window, and this wasn’t right. Furthermore, if I right-clicked a table and chose the “View Code’ option, I would get an error. I also noticed that when I connected using the “Open in…” button, it would tell me I was connected to version 10.0.8408.8, but when I tried putting the details in myself, it would say version “12.0.2000”. I’ve since found out that this was my own doing, because I hadn’t specified the database to connect to. And this information turned out to be useful for using SSMS too.
There is no “Open in SSMS” button in Azure. But you can connect using the standard Connect to Database Engine part of SSMS.
And it works! Previous versions would complain about NOCOUNT, and Object Explorer would have a bit of a fit. There’s none of that now – terrific.
And you get to see everything in the Object Explorer too, complete with an icon for the MPP database. But the version says 12.0.2000.8 if you connect like this.
To solve this, you need to use the “Options >>>” button in that Connect to Server dialog, and specify the database. Then you’ll make the right connection, but you’ll lose the “Security” folder in Object Explorer.
Now, it’s not perfect yet.
When I look at Table Properties, for example, I can see that my table is distributed on a Hash, but it doesn’t tell me which column it is. It also tells me that the server I’m connected to is my own machine, rather than the SQL Azure instance.
I can see what the distribution column is within the Object Explorer, because it’s displayed with different icon, but still, I would’ve liked to have seen it in the Properties window as well. It’s not going to get confused by having a golden or silver key there, as it might in a non-parallel environment, because those things aren’t supported. If they do become supported, I hope they manage to come up with another way of highlighting the distributed column.
One rather large frustration is the very promising link on the database to “Open in Management Portal”,
, which opens a browser within SSMS (not exactly my preferred browser, but it seems like a good use for that feature). I’m okay with this, but following the link to the Query Performance Insight page, I’m immediately disappointed:
I get that SSMS doesn’t host the most ideal browser for this kind of thing, and that I’m probably going to be running a separate browser anyway, but I’m would like this to be addressed in a future update.
Probably my biggest frustration is that when I start a new query, I get this set of warnings:
…which suggests that it doesn’t really know about SQL DW. I can tell them to be suppressed, so that the dialog doesn’t re-appear, but I don’t like the feeling that the system is attempting them at all.
It’s certainly a lot less painful than it was in the past though. I love the fact that I can use the Object Explorer window. I love that I can script objects, in a way that feels way more natural to me than in SSDT.
This is SSDT:
This is SSMS:
, although oddly the SSMS script includes the USE statement at the top, which isn’t supported in SQLDW (I’m sure this won’t be the case for much longer).
Overall, I’m really pleased that the team has put things in place to make SSMS talk to SQL DW at all. I was beginning to think that SSMS wasn’t going to come to this particular party. This release, despite having some way to go just yet, suggests that I’ll soon be using SSMS more when I’m using SQL DW.
And therefore, this topic worthy for Chris Yates’ T-SQL Tuesday blog party this month – celebrating the new things that have come along in the SQL world recently.
It’s T-SQL Tuesday, but this isn’t actually my post!
Regular readers might remember that I posted about “Number of Rows Read” a while back, and in that post, I mentioned that SQL Sentry should consider having a warning in Plan Explorer. Well, they put that feature in, and asked me if I could write about it. So I did.
But this wasn’t a standard “Can you write about our product, so we can post it on our main product page”, this was an offer to write for sqlperformance.com – alongside a select few others such as SQLskills people like Paul, Jonathan, Erin, Glenn, Tim, and (back when he was with them) Joe, SQL Sentry people like Aaron and Kevin, Jason, Rick, and the legendary Paul White.
Now, I consider writing for sqlblog.com a huge honour (and I’m still hugely grateful to Adam Machanic (@adammachanic) – for his invite to write here, and the hosting that he does), and being invited to write for sqlperformance.com is a huge honour as well (just like it’s an honour to be asked to write for books, or to work for particular clients). So I’m going to write for both. I’ve just had three posts get published on sqlperformance.com, but I plan to be keeping the numbers fairly even between the two. sqlblog.com will remain my main blogging site, and the one that I refer people to. And I still want to make sure I publish something here each month.
The content on both sites is excellent, from all the authors. Part of what attracted me to sqlblog.com was the high standard of content here, and the number of bloggers that I admire here, and the list of authors at sqlperformance.com is very strong too. Both Aaron and Paul (White) often write about T-SQL performance, which is close to my heart as well, so I think my content fits in quite nicely there.
Those three posts I’ve written are:
Number of Rows Read / Actual Rows Read warnings in Plan Explorer
What's "Actually" going on with that Seek?
SQL Server 2016 Temporal Table Query Plan Behaviour
One of my favourite things about Power BI is its extensibility. For ages I would hear about how good some other self-service reporting platforms were, and how things in the Microsoft space were lacking. Power View, in particular, was frustratingly limited in what you could do with it, and I felt somewhat disappointed. It was good as a data exploration tool, but simply wasn’t a good reporting environment if you wanted something that was customisable.
But in recent times, Power BI has really stepped up, with custom visualisations offering the ability to extend capabilities much further.
I hadn’t explored much in the way of custom visuals in Power BI until a while back, even though I was very much aware of the competition that was held in September. It had been on my list to explore some of what was possible. And this month, the T-SQL Tuesday topic (hosted by Wendy Pastrick – @wendy_dance) was to learn something new and to blog about it. So it seemed a good idea to learn how to make my own custom visualisation!
Now, creativity isn’t exactly my thing. I find it really hard to write songs, for example. I know how to do it – but I quickly become self-critical and get stuck. Writing is easier, because it feels less ‘creative’, and appeals more to the teacher / preacher in me (and I know that takes creativity, especially if you’ve ever seen me present, but it’s different). So sitting down and coming up with a new way of visualising data wasn’t something I was going to do.
But I found the Synoptic Panel visualisation, which is actually the one that won the competition, and learned some of how to use this. This one lets you make your own visualisations based on your own images. It’s really neat.
There are two aspects to it.
1. The .pbiviz file. This is what you add as your custom visualisation to the Power BI Desktop. You can get it from the Power BI Visuals Gallery, along with a bunch of other cool visuals. It’s the one that looks like a room with lots of colours.
2. Your image, with a bunch of extra mark-up. This is where the magic happens.
You head over to http://synoptic.design/ where there’s a tool for creating it all. You drop in your image (I went with the LobsterPot logo, which was my own design, in a moment of creativity), and picking the “magic wand” icon, click on the various areas of your image.
Here I’ve made just two areas, by clicking on the two parts of the logo. I can name them if I want, but already I’m pretty much done. There’s a big button on the right that says “Export to Power BI”, which lets me download an SVG file. Interestingly, I already had an SVG image of the company logo, but I needed the right markup, so I needed the exported one.
Now over in the Power BI Desktop tool, I wanted to see what this looked like. I started by importing the custom visualisation. I clicked on the ellipsis at the end of the “Visualizations” pane, and got the extra one added.
Clicking this to add it to my report, I got a window that looked like this:
And I expected to see some way of getting my LobsterPot claw to display. No such luck. I clicked all over, and saw nothing.
But it turned out I just needed to add some data to it first. Once I’d done that, I got some options:
Hitting “SELECT MAP”, I could find my SVG file and I got my claw!
The data I had was just two values. I wanted to be able to colour each section a different colour. But the areas of my image were already red, so I coloured my areas in white, setting the saturation to be less for the higher colours (less white, therefore more red), and more for the lower colours (more white, so less red). I made a negative measure to help with this.
But quite quickly, I had my claw, with a value of 90 showing quite red, and a value of 50 showing a fainter pink colour.
In hindsight, I should’ve edited my image before I started, making the two areas white, and then I could’ve easily coloured them in red. But in my exploration, I was able to learn some of the capabilities of this useful control (tip: read https://powerbi.microsoft.com/en-us/blog/visual-awesomeness-unlocked-using-the-synoptic-panel/), and I figure that the availability of this visualisation may mean that I never worry about creating my own from scratch.
I wonder whether I will one day come up with a visualisation of my own. Perhaps, although for the time being I’ll leave it up to the experts. I can use this one for all kinds of things, I suspect.
With SQL Server 2005’s extended support ending today, it seems appropriate to write a post about “My Favourite SQL Server Feature” for T-SQL Tuesday this month, hosted by Jens Vestergaard (@vestergaardj).
The thing is that when I consider reasons to upgrade from SQL Server 2005, I see a ton of features that could be leveraged in later versions. You could use availability groups, or columnstore indexes, or spatial. You could use the Tablix control in SSRS, or project-based SSIS configurations... so many reasons. But of course, if you’re feeling stuck on older versions, then it’s the backwards-compatibility of SQL Server that is the key. The new features can come later, once you’ve moved to a later version.
Now, I appreciate that if you have outer joins that use the “*=” syntax, then upgrading is going to take some effort. That does need to be fixed before you can move to later versions.
You see, what I love the most about SQL Server today, and giving reasons to upgrade, is actually SQL Azure. If you have provisioned SQL Database in Azure, then your system is continually being upgraded. You are on the newest version of the platform. There is no decision you can make to satisfy those people who say “Hang on, leave me on SQL Server 2008 R2, because my third-party product doesn’t support SQL Server 2012 yet...”
I can assure you that Microsoft does not want to break the code that you have running successfully in SQL Database. They will continue to improve the platform, and provide new features, but I’m confident that any code that you write today for SQL Database will continue to work there for a very long time.
And that gives me hope for on-prem SQL Server environments too. I feel confident that things I do today, whether I’m dealing with new SQL 2016 work, or back as far as SQL 2008 R2, will continue to work in future versions. Because Microsoft is in the business of upgrading you to the latest version.
My Windows machine pulls down updates automatically. My iPhone does the same. My SQL Database does. I would like my on-prem SQL Server to be doing the same, automatically deploying things into a SQL 2016 environment as it becomes available.
Get off SQL 2005 as soon as you can, and brace yourself for frequent upgrades. Microsoft recommends proactive upgrades from cumulative updates now, so I’m confident that it’s only a matter of time before upgrading becomes a continuous process on-prem, like it is in SQL Azure.
My favourite SQL Server feature is its backwards-compatibility.
...but first, let’s look at one oft-forgotten reason why finding a particular piece of text can be slow, if not impossible: collation. This will then provide a useful platform for making it go faster.
I say ‘impossible’, but of course it’s never impossible to find something in a database (assuming it’s there). It might take longer, but you can always scan the column for it, starting on the first page and going until you’ve found it. Various things like Full Text Search can help make things easier, but all-too-frequently we see code that searches for SomeText%, or worse: %SomeText%. This is the thing I want to look at – finding non-indexed strings patterns.
First let’s remember that if we are hoping to use an index, we need to know what language we’re in. I have spoken before about how I picked up a map in Sweden to find Västerås, but couldn’t find it listed in the map’s index. I didn’t realise that in Swedish, ‘ä’ and ‘å’ are not the same as ‘a’, and found at a different part of the alphabet. When I searched using an English alphabet, I couldn’t find the entry. I might think that ‘Västerås’ and ‘Vasteras’ are the same, but a Swedish person would tell me otherwise. It’s like if I refer to a game as ‘football’, you would need to understand my personal collation setting to know what I was talking about. When Michael Palin sung (as a lumberjack) about wearing high-heels, suspenders and a bra, he wasn’t referring to anything that held his trousers up, despite what people using an American collation setting might think.
But this is about making searches for text go faster. If we’re comparing two strings in a different collation we get an error, but let’s think about speed.
Consider that I’m looking for rows in a table WHERE CommentText LIKE '%Farl%'. Right away, I’m sure you appreciate that no amount of regular indexing on CommentText would let me perform an ordinary b-tree index search to find that row. I could improve it by using other technologies that will allow the individual words in my text to be found, but I’m just looking for a particular piece of text. It’s not even a whole word.
For my experiment, I’m using a table on SQL 2014 on my Surface Pro 2. It’s a larger version of AdventureWorks2012’s Person.Address with 19 million rows. There is a column called AddressLine1, which has collation SQL_Latin1_General_CP1_CI_AS and has type nvarchar(120). You can create it using code like this:
CREATE TABLE [Person].[Address_Big](
[BigAddressID] [int] IDENTITY(1,1) NOT NULL,
[AddressID] [int] NOT NULL,
[AddressLine1] [nvarchar](60) NOT NULL,
[AddressLine2] [nvarchar](60) NULL,
[City] [nvarchar](30) NOT NULL,
[StateProvinceID] [int] NOT NULL,
[PostalCode] [nvarchar](15) NOT NULL,
[SpatialLocation] [geography] NULL,
[rowguid] [uniqueidentifier] NOT NULL,
[ModifiedDate] [datetime] NOT NULL
select * from Person.Address;
I ran this query quite a few times, and it took about 40 seconds to tell me there were no rows returned.
where AddressLine1 like N'%Farl%'
option (maxdop 1);
Obviously no one lives on 203 Farley Avenue, or 1 Farlabulous Drive. But nor do they live at 711 Gofarles Street. You see, despite the fact that I had specified ‘Farl’ with a capital F and lower-case ‘arl’, it didn’t care about that at all. My collation setting told it explicitly to ignore case. That’s what the CI was for in SQL_Latin1_General_CP1_CI_AS. In fact, if we query select * from fn_helpcollations() where name = 'SQL_Latin1_General_CP1_CI_AS'; we see it says “Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data”. So it’s not only case-insensitive, it’s also kanatype-insensitive and width-insensitive too.
Clearly there is a lot more work for it to do, when scanning large amounts of text looking for a group of consecutive characters that could match inexactly.
Now, without changing my table at all, I changed my query like so, telling it to use a binary collation for the search. To search exactly rather than inexactly.
where AddressLine1 like N'%Farl%' collate Latin1_General_BIN
option (maxdop 1);
I could’ve used the SQL collation SQL_Latin1_General_CP437_BIN, but I find it easier to remember the Windows collation, and in Australia, the default settings for SQL are to use Windows collations. They’re a little better than the SQL collations .
But anyway – this query returned in just 7 seconds. I re-ran the original one – 40 seconds. I re-ran this one – 7 seconds. It really was significantly faster. The plan is the same. There is no sneakiness going on. The search for the binary text was simply faster.
This makes sense. If I’m looking for a particular string, it’s going to be quicker if I can just look for the exact bits, and not have to consider what the text might be in a different case, or if width needs to play a part, and so on.
Now you might think “Great – I’m going to add that to all my string searches”, but you should understand that there is potential for the results to be different. If there were someone in my table who lived in FARLEY BOULEVARD (in all caps, in the way that French people often write their surnames, for example), then that would have been found in my case-insensitive-collation search, but not in my binary-collation search for the lower-case letters ‘arl’. It’s useful if the data in your system is only stored in capitals, in which case (ha!) you could actually change the collation of your column, but it’s definitely worth considering the benefits of asking for a collation-specific search.
And what about grouping, you ask? Ok, maybe I didn’t hear you ask that, but let’s pretend you did.
If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.
select City, count(*)
group by City
option (maxdop 1);
select City collate Latin1_General_BIN, count(*)
group by City collate Latin1_General_BIN
option (maxdop 1);
Considering what’s going on with a hash function and non-exact strings is actually pretty interesting. HASH(Value) must produce the same value for any two values that are considered equal – such as ‘FARLEY’ and ‘Farley’ in my CI collation. For this to happen, it obviously can’t hash the actual values, it must have to convert the values into a common form that will hash the same way regardless of case, kana, and width. But this is work that is hidden from the query plan. We can see the impact of it through the query speed, but not anywhere in the plan. This will become yet another thing for me to investigate – but not this week before T-SQL Tuesday comes around and I need to publish this post. New father Bob Pusateri (@sqlbob) is hosting this month, about text searching, in case you hadn’t guessed.
Finding things to publish online for public consumption is something I’m often a bit reluctant to do. Most of my work is for customers, and there’s no way I’m going to share some of their data unless they’ve explicitly allowed it.
So when Jorge Segarra (@sqlchicken) posted a challenge to publish a Power BI report on the web for T-SQL Tuesday, I had give some thought about what kind of data to show. Luckily, Scott Stauffer (@sqlsocialite) has been Fitbit-challenging me over recent weeks, and in particular, in some that don’t include some of the big-steppers like Steve Stedman (@sqlemt), who should probably change his name to Stepman, considering he has a treadmill desk and does over 100k each week. Anyway – with a group of people who do the same order of magnitude of steps as me, I stretched myself to do better than I had been doing, and figured this could make useful data.
I started with an export of the last 31 days from Fitbit. That’s as much data as you can pull down from them, and although I could go to the effort of getting extra exports and combining them, I didn’t for this. After all, I’d rather be out getting more steps done than analysing them.
I had a bit of cleaning to do first, because the data had an annoying first line. In fact, I found it easier to pull the data in as text, remove the top line, then split the data up by the delimiter. I could then mark the various columns as numbers, which made life a lot easier.
After all this was done, I was ready to throw some stuff onto a report (ok, I also added some measures to show the steps as a percentage of the total steps – they have to be measures to handle the division properly). It was easy to get a chart on there, and a card to show some of the numbers. But I wanted to make it a bit more interesting... so I added an average, to show how my increased steppage made an impact on my average.
I added a column: RollinAvgSteps = AVERAGEX(FILTER(fitbit_export_20160214, EARLIER(fitbit_export_20160214[Date])>=fitbit_export_20160214[Date]),fitbit_export_20160214[Steps])
...which takes the average of my steps to date. There are a bunch of ways to achieve this, but this is the way that I chose. And you can see that the average line is (happily) improving! Oh, and because I pulled down the extract on the 14th, there’s a dip at the end. My numbers were much healthier by the end of the day, and despite spending way too long NOT walking, I did about 7244 steps that day.
You can see the result at http://bit.ly/RobFitbit, which looks like:
Oh, and having done all this, I discovered that fellow SQL Saturday Melbourne precon presenter Reza Rad has a series on doing Power BI on Fitbit data – I was pleased to see that he did a similar set of transforms to the data.