- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.
If data modelling were easier, I doubt there would be as many books on the subject, and we wouldn’t have multiple methodologies to consider.
I’m not going to explore the different methodologies here – that’s almost a religious argument these days, and I am more than happy to let you adopt whichever method you like. Instead, I want to challenge you to think about what is driving your design, and what makes you consider whether it suits your business or not.
Time and time again I see companies that use software to help them run their business. Sometimes this is an off-the-shelf system or a cloud-based solution; sometimes it’s a bespoke system built by software developers. I’m definitely in favour of using software, and wonder how people operate without it these days.
...but how much is your business driven by the software? I see a lot of businesses being led by their software, rather than having the software adapt to the business. For the most part, I’m fine with either. There is a lot to be gained by using systems developed by similar businesses, and taking advantage of lessons learned by others. Letting that software help guide internal processes can be very useful.
But I don’t think that applies to data models – you should at least consider how much it does.
I don’t like to write about specific customer situations, so I’m not going to describe a particular anecdote in great detail here. But I want to say that I frequently see environments where the models used within data warehouses don’t describe the business that’s going on – they describe the software that’s used.
Many data professionals look at a data warehouse as a platform for reporting, built according to the available data sources. I disagree with this.
The models within a data warehouse should describe the business. If it doesn’t, it’s a data model gone wrong.
What is the central thing that your business does? What is the main interest point? What do you need to look after? For me, this forms the core of the warehouse.
The core of the warehouse is not necessarily the main fact table, but could be one of the main dimensions. If you’re a store, do you care about sales, or do you care about customers? The difference is subtle, but might drive some important design elements.
Two clothing stores might use the same back-end systems for their point-of-sales systems, and might have the same loyalty system set up to persuade people to keep coming back. But one store might have a focus of getting customers back, driving brand loyalty which leads to dedicated fans and word-of-mouth sales. The other store might be more about piquing interest from people walking past the door, and trying to get them to come in and pick up a bargain. Of course, there will be an element of both in both stores, but the culture amongst the staff will be slightly different, as the first tries to identify the customer, tries to make sure that the customer feels cared for, and tries to form a relationship with the customer. It’s less important that the customer buys something, so long as they are going to return. The second sees the customer as a way to get a sale, while the first sees the sale (or even the lack of a sale!) as a way to get a customer. I’m sure you can think of stores in each category.
It would be very easy to create the same data warehouse for both stores, using a standard retail environment. But are the needs of the stores adequately met?
There is no doubt that both stores need sales to stay afloat – the retail business requires it. But if your business culture has slightly different concerns to the industry standard, then the data model should cater for that. Perhaps you need a way of scoring customer loyalty, and some path analysis to see what helps a customer reach a particular level of engagement. Perhaps you need to start collecting extra data. Maybe the stores could consider awarding points for simply visiting the store, even if no sales are actually made. Is the person who works from a cafe and buys just one cup of coffee all morning good for business, or bad for business? Can your data model help explore this, or are you designing a system which only handles the data in your transactional system?
I like to come back to the description of a data warehouse being the single source of truth for an organisation. Many people consider this an issue for data quality – that once data is in the warehouse, it’s trusted and can be used for business analytics. But it should go beyond that. The data warehouse should have transformed the data as kept by the various software packages into data which describes the business, becoming the source of truth about the business. The reports and dashboards across this data should help identify the culture of the organisation, by highlighting the its values and ideals.
The starting point for a data warehouse design should not be “What are the facts we need to measure?” but rather “What are we about as a business?” – often similar, but occasionally not. Ask what success looks like and what questions will address that.
Don’t ignore the bottom line, but also don’t ignore what's really important to the business.
This post was prompted by the seventy-second monthly T-SQL Tuesday, hosted this month by Mickey Stuewe (@sqlmickey).
A regular spot at the PASS Summit is the Women in Technology lunch. This year is no different.
A few years ago, I was on a panel for discussion at the lunch. The last couple of years though, have changed format, and have an interview focus, with a champion for WiT. This year, Angie Chang is being interviewed about an initiative called HackBright, which helps women form careers in IT. Angie has also been involved with Girl Geek Dinners.
HackBright has classes which are only women. This is terrific, and men in technology need to understand how important this is.
My daughter goes to a girls’ school. She can learn there without any prejudice about which activities are suited to boys, and which are more suited to girls. So she learns how to program a robot, she plays cricket, as well as gymnastics, dance, and choir. She sees no differentiation between these things, and will be able to discover skills that she might not have developed if she’d had to compete against boys. I don’t play cricket myself, but I know that in co-ed schools, it’s only really the boys that play cricket or program computers.
Angie is talking about scholarships that are available through HackBright, sponsored by some of the leading employers in the IT space. Clearly there are companies who have realised the value of technical women, and who want to ensure that their companies are welcoming to women. They talk about increasing the number of women who are getting into IT, but also understanding that if organisations don’t provide cultures that encourage to stay and develop their careers there, then the efforts of companies like HackBright get wasted, and the IT industry doesn’t improve.
A question has just come in from a guy who has a 16yo daughter in the San Francisco area, and he’s asked if his daughter is welcome to go to the Girl Geek Dinners.
Naturally, the answer is yes. :)
The second day keynote is always a highlight of mine. Until a few years ago there were three keynotes, with the third day including a session from Microsoft Research. Recently this has changed, and the third day keynote rolls into the second. Today I’m expecting some new announcements, some updates on how PASS is tracking, an acknowledgement of the outgoing president, and more. Plus the Microsoft Research session. And again I’m blogging as we go. Also, it’s SQLKilt Day, in support of the PASS Women in IT initiatives to encourage women to consider IT as a career option and to support women in the IT community.
The information about PASS finances, membership numbers, and reach continue to impress me. I used to be on the board, and was involved in some of the globalisation initiatives. I love that PASS is so much more than North America. While I would love it to be easier to get to the US from Australia, I have come across quite a lot of people from other countries (even quite a lot in Australia). PASS has committed to hosting the Summit in Seattle until at least 2019, and I’m very pleased with this, and 2016 will be in the week of October 24-28.
Lance Harra (@sqlfarmer) has won the PASSion award. I’m really not surprised at this – over the years I’ve been involved with PASS, I’ve seen his name as one of the regulars.
Before I know it, Dr Rimma Nehme from Microsoft Research has taken the stage, and she’s speaking about the Internet of Things. Dr David DeWitt is doing part of the talk, and these two complement each other really well. David is a regular at PASS keynotes, while Rimma did her first one last year. I’m not going to try to describe all the points that they’ve covered – but if you are interested in IoT, this is a great talk that covers IoT from basics up to some of the issues they are seeing with it. Even people experienced with IoT will get something out of this, and it’s explained in a nice simple way, as we’ve come to expect from these two.
Later today I have the Women in IT lunch to attend, another booth session and another main session to deliver, as well as catching up with a lot more people.
I’m back at the PASS Summit. Another year – my sixth now. And I’m sitting at the bloggers’ table, next to Mark Broadbent (@retracement).
The PASS Summit is by far the best SQL Server event in the world each year – even better than the Adelaide SQL Server User Group sessions, and the first keynote is always full of announcements.
It always starts with a bunch of information about how the reach of the SQL community has grown over the years, and it’s an amazing thing to see the impact that the community has these days. Tom La Rock has given the announcements about this, as the current PASS President.
Joseph Sirosh is up now. He’s the newly appointed Corporate VP for the Data Group now. He’s speaking to the changing face of data. I feel like this is a regular spot at keynotes – data has been changing so fast for years now, and the things that people are using data for becomes more and more impressive. Last year there it was about analysing shopping patterns for people who move through a department store – this year it’s looking at huge quantities of medical information to predict current and future medical conditions. This lets people save lives with data now, because early intervention becomes even more possible.
Eric Flesichman is a Chief Architect and is a VP in Platform Engineering at DocuSign. He’s talking about how SQL Server was the right fit for them, and this fits in with what I hear at customers too. Microsoft is the leader in both Ability to Execute and Completeness of Vision in the latest the Magic Quadrant by Gartner (as has just been pointed out by Joseph), and people are finding that SQL Server continues to become the sensible choice for even the largest of organisations.
Shawn Bice up now – General Manager of the Database Systems Group. He’s showing the new features of SQL Server 2016, and pointing out it’s all built-in. It’s not about Add-Ons – everything is part of the product. He talks about how it’s the leader in Mission Critical, the least vulnerable, the highest performing, cheapest cost for BI, and the Advanced Analytics that has come with R’s integration within the platform. R is the biggest language amongst data scientists, whether they are solving problems in space, or marketing, or wherever. To make SQL Server the most significant data platform for data scientists, this R integration is critical.
HA and DR has been improved with better algorithms for data transfer. I see this as incredibly important. Every improvement in compression and parallelism is an improvement in moving data around, whether between servers within an on-prem system, a hybrid system, or pure Azure.
PolyBase comes into SQL Server 2016 to let people use T-SQL over Hadoop. I’m used to having PolyBase through APS, and the idea of being able to hook into Hadoop data stores from “regular SQL” provides numerous opportunities. This is going to lower the barrier for people who want to leverage Hadoop into their current environments. This is really exciting.
The columnstore improvements in SQL 2014 meant that columnstore data could be updated, but in SQL 2016 we get updateable non-clustered columnstore indexes. For people who redesigned tables to leverage columnstore, or rather, saw changes that they would want to make to leverage columnstore and decided against columnstore – these people can now put an updateable columnstore index on a subset of the columns in a table, and leverage the technology much more easily. Again, lowering the barrier.
Rohan Kumar, a Partner Director in Engineering, is showing the impact of this, with a live dashboard, using a non-clustered columnstore index to explore data. He’s looking at how the R integration and the columnstore improvements provide a platform to discover anomalies in data in much quicker time than ever. I know that fraud analysis happens in close-to-real time within banks, but these changes make this kind of work available to many more organisations.
Sadly, Rohan then opens Profiler to show what’s happening behind the scenes with AlwaysEncrypted. But it’s Profiler, and everyone has been trying to move off Profiler for some years now. AlwaysEncrypted is impressive, but Profiler???
Stretched Databases make up the last main demo, and the keynote wraps up. The mood in the place is that these are exciting times.
With only a few days to go until people arrive in Seattle, I should probably explain what my sessions are going to be on. You know, in case you hadn’t thought to go to the PASS site and read for yourself (and for those who want to hear something that’s a little less ‘abstract’y). A few people told me last year they were disappointed I wasn’t presenting, so if you’re in that situation, maybe this post is of interest.
This year, I’m giving three different presentations – two regular Summit sessions that are on the schedule, and a 20-spot that I’m doing (twice!) at the Microsoft booth (Wednesday at 1:45pm and Thursday at 1:15pm).
The 20-minute spot is about techniques you can use to avoid data movement in MPP systems such as PDW or SQL DW. It focuses on Query Optimizer things that I do in regular SQL environments, leveraging things like join redundancy, contradiction optimisations, and tuning aggregations. It’s going to be quite fast-paced, as we have three significant things to explore, with heavy use of SQL query plans and showing how those techniques apply to MPP distributed plans. I’ve used these methods to make queries run WAY faster in both MPP and non-MPP environments, and people have said things like “I wish I’d known that before my last client” (someone at Microsoft), and “Oh, that’s cool – I’m so going to use that” (someone at a major US-based PDW partner). So yeah – come along. Both times!
The first main session I’m giving is one of my favourite sessions, and is called “A few of my favourite query plan operators”. This talk is going to focus on four Query Plan operators, but also discuss what’s going on in about six others. Or maybe seven – it depends how you count them, and how puzzled the looks on people’s faces go. We’ll explore what’s happening as your query runs, and why sometimes the counter-intuitive option could work out better. Lots of demos, as you’d expect from one of my presentations. Some live typing, and plenty of hand-waving as I describe why a particular plan shape is really what you’re after – even if the estimated cost might suggest otherwise.
The new session is one that I wasn’t expecting to give, so that’ll be fun. It’s on “The Power of Composite Indexes”, and already I’m kinda regretting it (no, not really) because Americans say that word differently to how I say it (for me the longest syllable is the “com”, while for Americans it’s the “pos”). But however you say it, the idea is to look at indexes with multiple keys, and look at how powerful that can be. And how you can kill the performance as well. We’ll be looking at issues like sargability v residuality, blocking plan operators (especially for people that came to my first session), partitions, and even helping you use T-SQL to fix one the most frustrating features about the way that queries run.
So anyway – those are the sessions that I’m giving at the PASS Summit this year. Hopefully somewhere in all that content there’s something you haven’t heard before.
Over the past decade or so, Business Intelligence has become a big deal. As a data consultant, most of my work would be categorised as being in the BI space. People want to have insight into how their business is operating, and be able to use this to do things better. Data has become one of the biggest influencers in the world today – now that data is available, intuition is generally seen as ‘not good enough’, and people want empirical evidence for making decisions ...at least in my experience.
And that’s all well and good for people who run businesses. You’re just a DBA. Business Intelligence is something that you support, something that you provide, something which you do – for other people to consume.
My challenge to you is to become someone who consumes BI as well.
In fact, just about everyone within your organisation could do better by the stuff that you provide. And that ‘everyone’ includes YOU.
As a BI developer, you create reports. Do you track how frequently those reports are run? Do you track how long the reports take to produce? How often are they redefined? What KPIs do you put around that?
As a production DBA, you ensure that backups have been taken, and run tests to make sure every backup can be restored. Are you tracking how long those backups are taking? Are things taking longer? What are the metrics that suggest things are getting worse? Are you using predictive analytics to warn you a system might go down soon?
As a helpdesk operator, how are you being measured? I’m guessing that your manager is analysing something about the satisfaction level of the people you help, or the number of tasks you get through in the week, and what kinds of tasks use your skills better... are you consuming that information too?
As a team leader... well, you get the picture.
Data is all around us. Not just in the Internet of Things, but in the metadata of the systems that we use.
If you are a data professional, you might be able to spend a bit of time exploring what’s possible using data that is important to you, like SQL Server Audit data, or Windows Event Logs, or report execution logs. If you can get something working in your development environment, get clearance to put it on an Azure instance or production SQL box – something which is looked after and which is properly licensed. But then, start having conversations about how this kind of approach could help just about everyone in the organisation. Big picture stuff is useful, but everyone has a big picture which is useful for them. Stepping back from the minutiae of the day and making intelligent decisions about tomorrow is not just for senior management, but should apply to self-management as well.
DBAs – get familiar with SQL Server’s auditing. Explore the posts that are coming out today in the T-SQL Tuesday event hosted this month by Sebastian Meine (@sqlity), and use this as a source for your own Business Intelligence system.
It’s four years since I ran for election to the PASS Board now, for a two year term which ended two years ago.
The two other people that got elected when I did were Denise McInerney and Adam Jorgensen. Two years ago, Denise and Adam became the two Vice Presidents, and Adam is about to being a stint as President. For me, I didn’t run a second time. Two years ago, Jen Stirrup, Amy Lewis, and Tim Ford got elected when Rushabh Meta, Douglas McDowell, and I left the board.
And so we come around to election time again, and Jen, Amy, and Tim are all about to finish their first two years on the board. Amy is not re-running, but Jen and Tim are, along with two new candidates – Argenis Fernandez and Ryan Adams.
I’m excited, because PASS can’t lose here. The current board members seem to be doing an excellent job (although I didn’t get to serve alongside either Jen or Tim, I know them both well and am sure they’re doing great), but I also know that Argenis and Ryan will do great.
Argenis is a passionate guy who will stand for important things. I’ve had many conversations with him, and know him to be wise, caring, and dedicated to seeing good things happen. He raises money for charity, and is not afraid to put his reputation on the line for important things. He is one of my most excellent friends.
Ryan is also a great guy. He works hard for PASS, and has almost won the PASSion award for his dedication to the community. I haven’t had as many conversations with him as I have the other candidates, but I know that he will work tirelessly for the PASS community – that word ‘tireless’ seems to describe Ryan more than anyone else I can think of in the community, and I am a little jealous of his energy levels.
I can’t tell you who to vote for (except for Jen – as the only person running for the EMEA seat I expect everyone to vote for her), but do take the time to get to know the candidates and make your vote count – voting opens later this week.
This month’s T-SQL Tuesday is hosted by Jen McCown of @midnightdba fame. She wants us to write about strategies for managing the enterprise, and as a database consultant, I find myself giving advice on this kind of thing to customers remarkably often.
No, I’m not going to do stories about LobsterPot customers. We don’t do that. What happens between us and the customer stays between us and the customer. However, I can talk about the kinds of things that we look for when we talk to a customer.
The thing that I want look at in this post is about that twitch that you get when something doesn’t feel right. The ‘spider-sense’ feeling that Peter Parker gets when there’s something that’s not quite right.
Experience is what helps people know what ‘normal’ looks like. I’ve heard stories that people who are trained to spot counterfeit bank notes don’t spend time with fake notes, they spend time with the real ones (on loan, presumably). They learn what ‘normal’ looks like, and get really familiar with it. That way, when something isn’t quite right, they can spot it and raise an alarm.
For DBAs taking care of an environment, whether small or large, they will have learned what ‘normal’ looks like. They should have benchmarks that tell them what how various metrics perform, and for those things that they don’t have formal metrics on, they should still be familiar enough to recognise when something isn’t right, even if they can’t tell exactly what. Their ‘spider-sense’ should tingle.
If you don’t have benchmarks, then get them. Every time you find something that doesn’t seem right, you will wish you had a benchmark on that thing to be able to quantify your suspicion. Feeling like something isn’t right is great, but it won’t be long until someone asks “How do you know this thing needs attention?” and “I just know” probably won’t cut it. If you’re a consultant, you can probably get away with “In my experience...”, because that’s what they’re paying you for, but having supporting evidence – actual numbers – can help, particularly if you’re the main person responsible and are needing to persuade someone to find the money for a CapEx.
Having the numbers handy is useful for a lot of situations, but there are a whole bunch of tools also available to look at too. A while back I wrote about how DBAs could use the same kinds of tools that other industries hire data professionals to provide, in a post called “DBAs and the Internet of Things”. If you take this kind of approach and start analysing the readings from all kinds of things that affect your database, then you can get ahead of the game. Feed this stuff into something like Azure ML for predictive analytics, and you might be able to have an even-better spider-sense, where the prediction isn’t just based on your own opinions, but on what has caused failures in the past.
Too often, the significant thing is some small detail that most people wouldn’t notice, but before of your experience and expertise, you can spot it and work out whether it’s significant or not. Then if you don’t have that particular thing benchmarked, or analysed by other tools, you can include it to see what’s going on.
...and develop superhero powers for managing your enterprise. It’s something we regularly recommend to DBAs.
I just wrote a post about Live Query Statistics. Let me show you how this technology in SQL Server 2014 can be used for some amazingly cool (nerdy cool, at least) stuff.
Behind the scenes, LQS uses a DMV called sys.dm_exec_query_profiles. When you run a query with SET STATISTICS PROFILE ON, the engine puts data into this DMV for every operator in the plan that produces data (which doesn’t include the SELECT operator, or Compute Scalar, for example). What SSMS does while you’re watching a long-running query is poll this DMV over and over to get the progress data, so that you can see it in those moments between kicking off the query and its completion.
When you use LQS on a relatively quick query, say, one that completes in a single second, you can’t exactly watch this information come through. You might be able use a debugger, and pause operation of your server for a moment while you step through it, but this is far from ideal. And yet a query that completes in a single second might actually need some work. What if this query needs to be able to run many times per second, and you’re looking for strategies to tune every last bit out of it?
Clearly LQS is going to be no use.
But the workings behind it... that’s another matter. Let me show you.
When botanists are wanting to study what happens in the flight of an insect, they take lots of photographs, often using a strobe light to capture a moment with as little blur as possible. It allows for incredibly detailed images, like the ones you can see in this article from the Daily Mail in the UK.
(Image only linked from source – please let me know if it’s not there, but also go and look at the amazing pictures that this guy takes)
I don’t know how many times this insect flapped its wings between each image that was captured, but I think you’ll agree that with enough images, it would be possible to learn a lot about the way that the wing-flapping takes place.
This is the same as what I’m doing with queries in SQL Server 2014.
Suppose I run a query over and over, with statistics profile turned on. The code here runs my sub-second query over and over for six minutes.
--set statistics profile on
declare @t datetime = dateadd(minute,6,getdate());
while (@t > getdate())
select p.Name, sum(d.OrderQty) as Qty
from Sales.SalesOrderDetail d
join Production.Product p
on p.ProductID = d.ProductID
group by p.Name;
I used an SSMS window for this, and told SSMS to discard the results. I only need to run it once to see the results – I just care about the profile stats. Incidentally, it returns 266 rows.
Now, I know that the plan that’s being used here is:
The Node_IDs of these operators are 1 (the Join), 2 (the Aggregate), 3 (the Scan on SalesOrderDetail), and 7 (the Scan on Product). These numbers have gaps just because of the inner workings of the Query Optimizer. And they go from left to right because that’s how a plan runs. The SELECT calls Node 1, which calls Nodes 2 and 7, and Node 2 calls Node 3.
So during five of the six minutes that my query was running over and over and over, I went into a different window and polled the DMV every half a second.
set nocount on
declare @t datetime = dateadd(minute,5,getdate());
while (@t > getdate())
select getdate() as myquerytime, *, 'Hash Match'
waitfor delay '0:00:00.5'
I had made a table which matched the DMV, with a couple of extra columns thrown in. One to record when I captured the moment (so that I could differentiate between moments), and one (called ‘comment’) to comment which query I was monitoring. The DMV contains the sql_handle and plan_handle, so I could’ve differentiated between them later, but I wanted to be able to differentiate between them more easily than that.
This data gave me 597 different values for ‘myquerytime’, 597 different moments captured. I don’t know how many different times my query ran in that time – probably far more than 597, although I wouldn’t have really cared if it were less. These 597 moments each had up to 4 rows, showing how each operator was going in its cycle.
There are columns in the DMV for all kinds of attributes, but the one that I was most interested in was the row_count, which tells me how many rows the operator has produced. There are columns about CPU, but as I’m looking at a query which runs in a small number of milliseconds, I’m more interested in the number of rows its produced.
Here’s a some of the data that’s in my table.
You can see that when I polled the DMV at 19:28:20.303, Node 3 (the Clustered Index Scan on SalesOrderDetail) had pushed 25241 rows, and no rows had been outputted by the other three. Half a second later, the snapshot showed 91044 rows from that node. Another half a second, and it was 54279, and in the poll at 19:28:21.807, the scan had served up all its rows, and there was output from the other three operators.
Each of these is like a strobe image, capturing a moment in the life of the query.
To recreate how the query runs, we need to piece them together. To do this, I’m going to assume that the outputted rows are done in the same order each time (which is a reasonable assumption when I’m running the query over and over in quick succession, with no changing parameters or conditions). By summing the row_count across all the operators in each poll, I can order the polls. A quick bit of pivotting...
select row_number() over (order by OverallRowCount) as rownum, *
count(*) + sum(row_count) as OverallRowCount,
max(case when node_id = 1 then row_count end) as HashMatchJoin,
max(case when node_id = 2 then row_count end) as HashMatchAgg,
max(case when node_id = 3 then row_count end) as CIXScan,
max(case when node_id = 7 then row_count end) as IXScan
where comment = 'Hash Match'
group by myquerytime
order by rownum
...and I can see the query start:
..and the moment when the Clustered Index Scan stops outputting more rows:
...and the end of the query.
Notice that almost all the time we polled the DMV to see how the query was going, data was still being pulled out of SalesOrderDetail. It was only 96.1% (574/597) of the way into the query that data started to be outputted from the blocking Hash Aggregate. And because a Hash Match Join blocks until the hash table has been completed, we have to wait even longer before we eventually start pulling data from the Product table, when we pull up to 504 rows to find the 266 rows of interest.
I’m sure you can get a feel for how this query is running from this information.
But let’s compare this to what happens if I force a Nested Loop join instead of the Hash Match Join. This plan:
I repeated the collection of data, running the same query over and over but with OPTION (LOOP JOIN), and a different comment in the polled data. I also decided to use dynamic SQL to query my strobe table to save rewriting the pivot for each plan.
declare @comment varchar(100) = 'Forced Loop Join';
declare @qry nvarchar(max) =
'select row_number() over (order by OverallRowCount) as rownum, *
count(*) + sum(row_count) as OverallRowCount
' + (select ', max(case when node_id = ' + cast(node_id as varchar(10)) + '
then row_count end) as [' +
cast(node_id as varchar(10)) + ': ' + physical_operator_name + ']'
where comment = @comment
group by node_id, physical_operator_name
order by node_id
for xml path(''),type).value('.','nvarchar(max)')
where comment = @comment
group by myquerytime
exec sp_executesql @qry, N'@comment varchar(100)', @comment = @comment;
It started very similarly, but was very different towards the end.
The Scan seemed to go for a longer portion of the plan – 98.5% (587/596), but as the Hash Match Aggregate started producing rows, the Nested Loop was pulling the row from the Seek and returning it to the SELECT operator before pulling the next row in. You can see the row_count going up equally across the three operators, which is very different to what we saw with the Hash Match Join.
A factor I hadn’t considered before became evident at the end. In the Hash Match Join example, we saw a bunch of moments when all the rows had produced their data, which aren’t there in the Nested Loop example. The result of the query is no different, but the tear-down time is much quicker with the Nested Loop – presumably because the Hash Table used for the join doesn’t need to be dropped. I didn’t expect this to be as significant as it seems to be, but we certainly managed to catch six images – about 1% of them – when the operators had all stopped returning data, but the query was still hanging around as far as the DMV was concerned. With everything else being identical, I can only assume it’s down to the Hash Table having more to do to shut down than the Nested Loop.
Just for fun, I tested the pivot query itself, giving 582 strobe images.
The first thing to notice is that the Compute Scalar operators didn’t report anything, as expected.
The next is that the right-most Sort was blocking, and had a significant pause after the Scan finished – about 35 rows or 6% of the query time.
The Stream Aggregate doesn’t block, but the left-most Sort, which has called the Stream Aggregate (via two Compute Scalars) does, following which there’s another pause (but smaller – fewer rows), after which the Segment and Sequence Project operators don’t block.
At the end of the query we have about 80 rows – well over 10% of the query time – after the Sequence Project has outputted its last row.
Now – there is more that I will learn about this still, and I have been making some assumptions about whether the DMV polling gives a sufficiently random moment. But from the things I’ve seen, there is definitely information about queries that I haven’t seen before and which require some further research.
Finally, I spent a bit of time looking at visualisation options for this. I immediately thought of the Power View play axis that’s on scatter charts, but sadly I didn’t have enough luck coming up with an effective visualisation very quickly. I had felt like Hans Rosling with his scatter chart about birth rates, and used LAG and MAX() OVER() to come up with a speed of row production compared to the total, but I kept getting the “representative sample” message, which wasn’t conducive. I’m sure it won’t be long before this visualisation would be easy, and for the purposes of analysis, I was more interested in exploring the data rather than making it look pretty.
Strobe photography is very impressive. I just never thought it would apply to T-SQL queries.
One of the coolest things in SQL Server 2016 is Live Query Statistics (LQS). But did you know that it’s available in SQL 2014? (Edit: Needs SP1)
The thing is that we haven’t been able to view it effectively before now, before SQL Server Management Studio 2016 became available.
LQS provides the ability to watch an execution plan while the query is still running.
In this image above, you can see that I have a fairly long-running query, and I’ve taken a screen shot while the query is running. I can tell it’s still running because of the dotted lines. At this point, the Hash Match has asked the Index Scan for all the rows it’s going to, and the Index Scan has been marked as 100% done. The Hash Match is now asking the Merge Join for rows. The Merge Join is pulling data out of the Sort, which has already pulled in the rows it needed, and so on.
It’s cool stuff.
And significantly, this is against SQL Server 2014 (SP1). I had to use SSMS 2016, but then the button appeared at the top to include Live Query Statistics...
...and then I could see them. When I connected to an earlier version, such as SQL Server 2012, the box was disabled and I couldn’t see them.
So why not install SSMS 2016 (it’s only the client tool – I’m not suggesting you put it on a server), and see how some of your queries look?
You shouldn’t do this against a production machine, because it takes some effort for SQL Server to produce the data used for this. But see how you go. It’s definitely a useful performance-tuning feature which is available right now.
Oh, ok... I’ll explain some more.
Transparent Data Encryption encrypts data at rest. That’s the stuff that’s on disk – the encryption happens when the data is written to disk, and the decryption happens as the data is loaded into RAM from the disk. The engine handles this so that it’s invisible to the user, applications, and so on. Without it, you can open an MDF/NDF file in a hex editor and read the contents. With it, you can’t.
Here’s an example with a database that’s not encrypted:
And here’s an example that is:
I searched for some of the text that I could see – successfully in the first, unsuccessfully in the second.
I also used SQL Server to show me the contents of a page using DBCC PAGE, and could do this successfully (once I’d closed the files in the hex editor and brought the databases back online).
...which also worked in both databases.
I had hoped this would work okay, because I figured that DBCC PAGE would have to pull the data into RAM again (remember this system was offline – the pages weren’t in RAM before), and that it would decrypt this as it did it. But I wondered if DBCC PAGE might be slightly lower-level, and bypass it somehow. I argued with myself that if TDE was indeed Transparent, it shouldn’t care... what if my application relied on using DBCC PAGE, it’s a known feature, even if it is officially undocumented (which is where my doubts set in).
But as you see, it worked okay.
But what if I dropped the table first? Would SQL then go “Hang on – this page isn’t one that I have control over any more...” and refuse to decrypt it?
No – it works just the same.
Even if you drop an object, you can still access the pages that it used until they get overwritten. You won’t be able to read them with a hex editor, but DBCC PAGE will still read them in decrypted form, letting you pull that data out.
And yes, you can even use DBCC WRITEPAGE to overwrite the bytes in their unencrypted form, so that you can use (dangerous) method of fixing corruption, even in an encrypted database. I’ve just redone my fix for Steve Stedman’s 10th Corruption Challenge, and it worked just fine on an encrypted version of the database.
It’s still T-SQL Tuesday on the topic of encryption, so I’m throwing this one into the mix for that as well.
Transparent Data Encryption has been around for some time now, making sure that data in SQL Server as stored on disk is encrypted. When it was announced, this was incredibly exciting.
You see, by default, SQL Server data is not encrypted. If you open up the pages within a data file, you can read the data that’s in there. Numbers are stored as hex, varchar strings are stored as readable values – it can be quite surprising to realise this. The first time you ever salvage an MDF file from a broken server, open up that file, and just start reading the data, you realise that the data in most databases is only as secure as the physical security of the disks – you can read data from that table whether or not your Windows login has access to it.
With Transparent Data Encryption, that MDF file is encrypted. It’s only decrypted when it’s pulled into RAM, where access to it is controlled by database permissions. If your access to a particular table has been denied, you’re not getting to read that data out of RAM. When the power goes out, the decrypted data in RAM disappears – that’s what RAM’s like. It’s a good thing. But TDE is across the whole database, whether you like it or not. And everyone who has access to the table can read it unencrypted, as if it’s never been encrypted.
It’s Transparent. It doesn’t feel like it’s encrypted, and as far any application can see, it’s not. This only protects against the disk (or backup) being compromised.
And then there’s the distrust of DBAs.
As someone who deals with sensitive data regularly, I can assure you that this is a very real concern. I’ve had police checks, to help make sure that I’m trustworthy – and I make a point of never looking myself up in customer databases (I know I must exist in some of my customers’ databases – I receive bills from some of them even). I also have confidence that my employees at LobsterPot Solutions are all completely trustworthy, and that I’d throw the book at them if they ever broke that trust.
I’ve certainly been asked “How do I stop the DBA from being able to read the data?”
And that’s where the problem is. The DBA is the guardian of your data. They’re the person who makes sure that the database doesn’t have corruption, who can defrag indexes, tune queries, and so on. The DBA should be trustworthy with the data. But what if that data contains the lottery numbers? What if that data involves national secrets? At what point is the sensitivity of the data way too great to be able to let Archbishop Desmond Tutu look after it, let alone Barry the DBA?
Now, I understand that a DBA might feel insulted that they’re not trusted to read the data. They need to look after it, but they can’t look at it. My take is that if I can be removed from suspicion if there’s a leak, then great. I know I’m okay. The client should know I’m okay. But would a criminal investigation be able to remove me from suspicion? Not if I’m one of the only people that could read the data and know how to cover my tracks.
Transparent Data Encryption doesn’t stop me from being able to read the data. Not at all. If I have access to query the table, it looks unencrypted to me.
Always Encrypted, in SQL Server 2016, does stop me though, if I don’t have the ability to access the certificate which has protects the column master key. Always Encrypted stores encrypted data in the database file. It’s encrypted in RAM, and while it’s being sent across the network, only being decrypted by the application that has access to the certificate. Now, I know that as an administrator, I might be able to jump through a few hoops to be able to get access to it, but this could be handled by other people, who could deny my access to it.
This decryption is still transparent to the applications that use it – so the applications don’t need to change, apart from telling the connection string to look out for it. The .Net 4.6 environment will handle the decryption without developers needing to code it specifically. Access to those applications can be controlled in other ways.
If you’re needing to search by values that are encrypted, it might not be ideal for you. But for that data that you need to make sure even you can’t read, this could be a really nice option. Roll on 2016!
PS: Another option is to build encryption functions into client applications, so that the applications encrypt everything, and handle everything well away from the database. We’ve been able to do this since applications first began, but when there are multiple applications, the coordination of keys can become problematic. What you get with SQL 2016 is the ability to do this centralised control over the encryption and decryption.
PPS: This post was brought to you for the 69th T-SQL Tuesday, hosted by Ken Wilson (@_KenWilson)
No, not in my systems.
Corruption does happen from time to time. At LobsterPot Solutions we get calls from people now and then who have corruption in their databases, that want help getting them repaired. Generally, it’s not too much of a problem, although I’ve seen some doozies over time. (And if you need help, get in touch!)
Interestingly, I don’t think many DBAs practise solving corruption issues very often. They might test restores, and even test getting data back from a salvaged LDF file, but rarely would they test any of the stranger corruption scenarios. I’ve never put it on a high priority – I know lots of ways to get data out of tables, and know how to get data out of pages using DBCC PAGE... but I wouldn’t say that I practise solving corruption very often. There are so many different ways that a database can become corrupted, and 99% of the time, the data can be salvaged using some very standard approaches.
So it was good to see Steve Stedman (@SQLEmt) run a series of Corruption Challenges – ten corrupt databases, all of which could be repaired without data loss, but never using the standard DBCC CHECKDB repair option. I noticed the first one part way into the allotted time, and thought it would be a useful exercise to see if I could solve them all. Let’s face it – when someone gets in touch because they have a problem, they want to be reassured that they’re dealing with someone who knows what they’re doing.
All ten challenges have now appeared, and not only have I solved every one of them – a feat only shared by Andre Kamman of the Netherlands and Neil Abrahams of the UK – but I’ve also won three of the events, more than anyone else, allowing me to finish top of the scoreboard!
Winning the competition was never my intention – I’m just pleased to have been able to know that there were no challenges that beat me, which hopefully means that my customers are still in capable hands.
My challenge to you is to grab the ten databases that Steve has made available, and without looking at the solutions that are all posted, see how you go. At the very least it will be good practice for when you have to deal with the real thing.
Next, next, next, next, next... you know the drill.
Except that when installing SQL, it’s simply not good enough. The defaults might not work for you, and that makes this post qualify for this month’s T-SQL Tuesday, hosted by Andy Yun (@sqlbek).
Most things are fine, but there is one page which you really shouldn’t ignore. And it’s not even obvious. But if you’re just clicking Next, next, next, then you might easily miss it.
When you reach this page:
please make sure you don’t just hit Next without looking at the Collation tab. It looks like this:
Now, I don’t particularly care what you decide to put here. But you shouldn’t just set it blindly. If you are installing a new server and you need it to be consistent with what you’ve used before, go and look at the collation setting on your old instance. This is ridiculously important.
You see, people are lazy. And when I say people, I mean developers. And when I say developers, I mean bad developers. (There was a thing recently that said that “Women’s Soccer” should now be called simply “Soccer”. I agree with this. There shouldn’t have to be a differentiation between a game played by a particular type of people, except perhaps “American Football”, which is obviously called “football” because they use their feet so much. Oh right, about developers. I hope as time passes, developers become better. But for now, when I say “Developers”, I mean bad developers.)
So when a developer creates a temporary table, they do something like:
CREATE TABLE #resultset (col1 varchar(20) NOT NULL, col2 int NOT NULL);
And then they populate it with some data, and then they use it in another query. Like this:
FROM dbo.SomeTable t
JOIN #resultset r ON t.code = r.col1;
This is code that has worked for years. But if you have ignored the collation setting and the instance collation is different to the database collation, you will get an error. The system won’t know whether two strings are supposed to be the same or not. Is ‘abc’ the same as ‘àbç’, if one tells you to ignore accents and the other says not to? The system can’t decide. It’s even more complex than that, because two strings might be identical, but it won’t know how to look them up if the alphabet orders letters differently. It can’t decide and gives an error. It goes in the too-hard basket.
Of course, a nice responsible developer will have created the temporary table like this, and then the problem never occurs:
CREATE TABLE #resultset (col1 varchar(20) COLLATE DATABASE_DEFAULT NOT NULL, col2 int NOT NULL);
But let’s face it – this is rare. Most people write database code without thinking about the collation settings, and that’s a problem. Most of us are bad developers.
The error you get is the one described here: https://connect.microsoft.com/SQLServer/feedback/details/324910/collation-error-behaviour-option. This is a Connect item which I raised in January 2008 (7.5 years ago), which has 47 upvotes, and which was closed in 2011 as “Won’t Fix”. It was looking likely for a while (at least when using tempdb), but then they must’ve realised it wasn’t particularly important. Ultimately, it could give unexpected results if you’re not confident about which language you’re using, and it’s potentially better to give an explicit error than to let your query work but give the wrong results.
WHERE Name = 'football';
Msg 468, Level 16, State 9, Line 1
Cannot resolve the collation conflict... ;)
Ever since data moved to the cloud, database administrators have wondered what it will mean for their jobs.
It doesn’t matter whether you are thinking about Infrastructure as a Service, Platform as a Service, Database as a Service, there is some aspect of what a database administrator does that may not exist in the future.
And then there’s the situation around Big Data, and the Internet of Things. We see videos of how organisations are using devices that produce data to help control things as large as the London Underground, and as personal as your fitness routine. It doesn’t matter whether the Thing is recording your heart beat or the vibrations of a tube train – the fact is that data is being produced to help our lives.
For a database administrator, this present a new set of concerns. Does the DBA of today need to be able to design and tune a system that leverages Windows Azure SQL Database? What about Parallel Data Warehouse? Hadoop? Apache Storm? StreamInsight? There are so many things that make it seem to the DBA that their skill set must adapt for them to survive.
But what I want to suggest is that the DBA should not be seeing these technologies as their domain to administer and tune, but their domain to consume and leverage.
Currently, database administrators in the SQL Server world use Performance Monitor counters to keep an eye on how disk, CPU, I/O, RAM, cache, etc., etc., are going. This data is polled regularly and stored, while SQL Server Agent alerts are configured to holler if there’s something noteworthy, such as a Transaction Log filling up. Extended Events allow all kinds of events to be monitored, with data about them examined, stored if necessary, reacted to, and more.
Once, we used SQL Server Trace, and a tool called Profiler. Profiler was frowned upon somewhat – its veracity for consuming data meant that using it to run a trace live would put an unnecessary load on the SQL instance being monitored. The better option was to configure a trace, and then use Profiler after the fact to see what was happening. Extended Events (XE) is seen in a different kind of light. An XE session runs, keeping an eye on things in a very decoupled way, and the stream of events that is produced can be hooked into using a number of different methods.
From the perspective of the DBA, these XE sessions become an “Internet of Things”-style environment. Imagine that each configured event is a smart device, sitting somewhere, producing data in some semi-structured form. Of course, we know the structure, but each event might have a different configuration, with different kinds of properties being available. The DBA consumes them using the same mechanisms as IoT technologies, and suddenly becomes a new kind of DBA-superhero. It’s not unheard of today, to hear of people looking after large numbers of databases.
The Internet of Things is here for database administrators. Extended Events are part of it.
PS: This post about Extended Events is part of the latest T-SQL Tuesday event, hosted by my friend Jes Borland (@grrl_geek).