THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.

  • Passwords – a secret you have no right to share

    I feel like this topic just keeps going around and around. Every time I’m in a room where someone needs to log into a computer that’s not theirs, there seems to be a thing of “Oh, I know their password…”, which makes me cringe.

    I’ve written about this before, and even for a previous T-SQL Tuesday, about two years ago, but there’s something that I want to stress, which is potentially a different slant on the problem.

    A password is not just YOUR secret. It’s also a secret belonging to the bank / website / program that the password is for.

    Let me transport you in your mind, back to primary school. You had a club. You had a password that meant that you knew who was in the club and who wasn’t (something I’ve seen in movies – I don’t remember actually being in one). At some point you had a single password that was used by everyone, but then you found that other people knew the password and could gain entry, because you only needed someone to be untrusted for the password to get out.

    You felt upset because that password wasn’t theirs to share. It was the property of you, the club owner. Someone got access to your club when you hadn’t actually granted them access.

    Now suppose I’m an online retailer (I’m not, but there are systems that I administer). You’ve got a password to use my site, and I do all the right things to protect that password – one-way hashing before it even reaches the database, never even being able to see it let alone emailing it, and a ton of different mechanisms that make sure that your stuff is safe. You’ve decided to a password which you’ve generated as a ‘strong password’, and that’s great. Maybe you can remember it, which doesn’t necessarily make it insecure. I don’t even care if you’ve written it down somewhere, so long as you’re treating it as a secret.

    Because please understand, it’s MY secret too.

    If the password you use gets out, because maybe someone gets into your LastPass account, or maybe someone steals the PostIt you’ve written it on, or maybe you use that same password at a different site which then gets hacked…

    …then that other person has access to MY site as you.

    If that other person buys stuff from me as you, I might need to refund you for the money / credit / points you didn’t mean to spend. And if I’ve already sent the goods out, then that’s going to hurt me.

    If that other person does malicious things on my site because they’re accessing it as a privileged user, then that’s going to hurt me.

    Someone knowing the secret that I’ve worked hard to keep secret… that’s going to hurt me.

    I have no control over the password that you choose to use. But please understand that it’s not just YOUR password. Use something that is a secret between you and me. I will never know your password, but I want you to make sure that no one else ever does either. Don’t reuse passwords.


    Big thanks to Andy Mallon (@amtwo) for hosting this month’s T-SQL Tuesday.


  • You’ve been doing cloud for years...

    This month’s T-SQL Tuesday is hosted by Jeffrey Verheul (@devjef) and is on the topic of Cloud.

    I seem to spend quite a bit of my time these days helping people realise the benefit of the Azure platform, whether it be Machine Learning for doing some predictions around various things (best course of action, or expected value, for example), or keeping a replicated copy of data somewhere outside the organisation’s network, or even a full-blown Internet of Things piece with Stream Analytics pulling messages off an Service Bus Event Hub. But primarily, the thing that I have to combat most of all is this:

    Do I really want that stuff to be ‘out there’?

    People are used to having their data, their company information, their processing, going on somewhere outside the building where they physically are.TSQL2sDay150x150

    Now, there are plenty of times when organisations’ server rooms aren’t actually providing as much benefit as they expect. Conversations with people quickly help point out that their web site isn’t hosted locally (I remember in the late ‘90s a company I was at making the decision to start hosting their web site at an actual hosting provider rather than having every web request come in through the same modem as all their personal web browsing). Email servers are often the next to go. But for anyone working at home, the server room may as well be ‘the cloud’ anyway, because their data is going off to some ‘unknown’ place, with a decent amount of cabling between where they are and where their data is hosted.

    Everyone’s photos are stored in ‘cloud’ already, where it be in Instagram’s repository or in something which is more obviously ‘the cloud’. Messages with people no longer just live on people’s phones, but on the servers of Facebook and Twitter. Their worries and concerns are no longer just between them and their psychiatrist, but stored on Google’s search engine web logs.

    The ‘cloud’ is part of today’s world. You’re further into it than you may appreciate. So don’t be afraid, but try it out. Play with Azure ML, or with other areas of Cortana Intelligence. Put some things together to help yourself in your day-to-day activity. You could be pleasantly surprised about what you can do.


  • The Impact of Compression Delay in Real-time Operational Analytics

    I have a session coming up at both the PASS Summit in October and the 24HOP Summit Preview event in September, on Operational Analytics. Actually, my session is covering the benefits of combining both In-Memory and R into the Operational Analytics story, to be able to see even greater benefits…

    …but I thought I’d do some extra reading on Real-Time Operational Analytics, which also suits this month’s T-SQL Tuesday topic, hosted by Jason Brimhall (@sqlrnnr). He’s challenged us all to sharpen our skills in some area, and write about the experience.TSQL2sDay150x150

    Now, you can’t look at Real-Time Operational Analytics without exploring Sunil Agarwal (@S_u_n_e_e_l) ’s excellent blog series. He covers a few things, but the one that I wanted to write about here is Compression Delay.

    I’ve played with Compression Delay a little, but I probably haven’t appreciated the significance of it all that much. Sure, I get how it works, but I have always figured that the benefits associated with Compression Delay would be mostly realised by having Columnstore in general. So I was curious to read Sunil’s post where he looks at the performance numbers associated with Compression Delay. You can read this yourself if you like – it’s here – but I’m going to summarise it, and add some thoughts of my own.

    The thing with Operational Analytics is that the analytical data, reporting data, warehouse-style data, is essentially the same data as the transactional data. Now, it doesn’t look quite the same, because it’s not been turned into a star-schema, or have slowly changing dimension considerations, but for the purposes of seeing what’s going on, it’s data that’s capable of handling aggregations over large amounts of data. It’s columnstore.

    Now, columnstore data isn’t particularly suited to transactional data. Finding an individual row within columnstore data can be tricky, and it’s much more suited to rowstore. So when data is being manipulated quite a lot, it’s not necessarily that good to be using columnstore. Rowstore is simply better for this.

    But with SQL 2016, we get updateable non-clustered columnstore indexes. Data which is a copy of the underlying table (non-clustered data is a copy – clustered data or heap data is the underlying table). This alone presents a useful opportunity, as we can be maintaining a columnstore copy of the data for analytics, while handling individual row updates in the rowstore.

    Except that it’s a bit more complicated than that. Because every change to the underlying rowstore is going to need the same change made in columnstore. We’re not actually benefiting much.

    Enter the filtered index. With a flag to indicate that frequent changes for that row have finished, we can choose to have the columnstore copy of the data only on those rows which are now relatively static. Excellent. Plus, the Query Optimizer does some clever things to help with queries in this situation.

    But many systems don’t have a flag like that. What then?

    Well, one nice option is to consider using Compression Delay.

    Compression Delay tells our columnstore index to delay compressing the data for some period of time. That is, to not turn it into proper columnstore data for a while. Remember I said that columnstore data doesn’t enjoy being updated much – this is to prevent that pain, by leaving it as rowstore data for a while.

    I haven’t really explored this much myself yet. I have a few simulations to run, to see what kind of performance gains can be had from this. But Sunil’s experiments saw a 15% improvement on the OLTP workload by choosing an appropriate Compression Delay, and that sounds pretty good to me.

    I feel like there’s so much more to be explored with these new technologies. Having that flag to indicate when a row can be pulled into a filtered columnstore index seems really useful. Compression Delay seems great too, and in many ways feels like a nicer solution than ending up with a filtered index that might not catch everything. Compression Delay to me feels like having a filtered columnstore index that uses getdate() (which I think would be a lovely feature), although it’s not quite same.

    So I’m going to keep playing with this, and will be teaching you plenty of information at both the upcoming events. I could present a lot of it now, but I would prefer to deepen my understanding more before I have to stand in front of you all. For me, the best preparation for presentations is to try to know every tiny detail about the technology – but that’s a path I’m still on, as I continue to sharpen.


  • Finally, SSMS will talk to Azure SQL DW

    Don’t get me started on how I keep seeing people jump into Azure SQL DW without thinking about the parallel paradigm. SQL DW is to PDW, the way that Azure SQL DB is to SQL Server. If you were happy using SQL Server for your data warehouse, then SQL DB may be just fine. Certainly you should get your head around the MPP (Massively Parallel Processing) concepts before you try implementing something in SQL DW. Otherwise you’re simply not giving it a fair chance, and may find that MPP is a hindrance rather than a help. Mind you, if you have worked out that MPP is for you, then SQL DW is definitely a brilliant piece of software.

    One of the biggest frustrations that people find with SQL DW is that you need (or rather, needed) to use SSDT to connect to it. You couldn’t use SSMS. And let’s face it – while the ‘recommended’ approach may be to use SSDT for all database development, most people I come across tend to use SSMS.

    But now with the July 2016 update of SSMS, you can finally connect to SQL DW using SQL Server Management Studio. Hurrah!

    …except that it’s perhaps not quite that easy. There’s a few gotchas to be conscious of, plus a couple of things that caused me frustrations perhaps more than I’d’ve liked.

    First I want to point out that at the time of writing, SSMS is still not a supported tool against PDW. You’ve always been able to connect to it to write queries, so long as you can ignore some errors that pop up about NoCount not being supported, but Object Explorer simply doesn’t work, and without Object Explorer, the overall experience has felt somewhat pained.

    Now, when you provision SQL DW through the Azure portal, you get an interface in the portal that includes options for pausing, or changing the scale, as per this image:


    And you may notice that there’s an option to “Open in Visual something” there. Following this link gives you a button that will open SSDT, and connect it to SQL DW. And this works! I certainly had a lot more luck doing this than simply opening SSDT and putting in some connection details. Let me explain…

    In that image, notice the “Show database connection strings” link. That’s where you can see a variety of connection strings, and from there, you can extract the information you’ll need to make a connection in either SSDT or SSMS. You know, in case you don’t want to just hit the button to “Open in Visual something”.


    When I first used these settings to connect using SSDT (rather than using the “Open in…” button), it didn’t really work for me. I found that when I used the “New Query” button, it would give me a “SQLQuery1.sql” window, rather than a “PDW SQLQuery1.dsql” window, and this wasn’t right. Furthermore, if I right-clicked a table and chose the “View Code’ option, I would get an error. I also noticed that when I connected using the “Open in…” button, it would tell me I was connected to version 10.0.8408.8, but when I tried putting the details in myself, it would say version “12.0.2000”. I’ve since found out that this was my own doing, because I hadn’t specified the database to connect to. And this information turned out to be useful for using SSMS too.

    There is no “Open in SSMS” button in Azure. But you can connect using the standard Connect to Database Engine part of SSMS.


    And it works! Previous versions would complain about NOCOUNT, and Object Explorer would have a bit of a fit. There’s none of that now – terrific.

    And you get to see everything in the Object Explorer too, complete with an icon for the MPP database. But the version says 12.0.2000.8 if you connect like this.


    To solve this, you need to use the “Options >>>” button in that Connect to Server dialog, and specify the database. Then you’ll make the right connection, but you’ll lose the “Security” folder in Object Explorer.



    Now, it’s not perfect yet.

    When I look at Table Properties, for example, I can see that my table is distributed on a Hash, but it doesn’t tell me which column it is. It also tells me that the server I’m connected to is my own machine, rather than the SQL Azure instance.


    I can see what the distribution column is within the Object Explorer, because it’s displayed with different icon, but still, I would’ve liked to have seen it in the Properties window as well. It’s not going to get confused by having a golden or silver key there, as it might in a non-parallel environment, because those things aren’t supported. If they do become supported, I hope they manage to come up with another way of highlighting the distributed column.


    One rather large frustration is the very promising link on the database to “Open in Management Portal”,


    , which opens a browser within SSMS (not exactly my preferred browser, but it seems like a good use for that feature). I’m okay with this, but following the link to the Query Performance Insight page, I’m immediately disappointed:


    I get that SSMS doesn’t host the most ideal browser for this kind of thing, and that I’m probably going to be running a separate browser anyway, but I’m would like this to be addressed in a future update.

    Probably my biggest frustration is that when I start a new query, I get this set of warnings:


    …which suggests that it doesn’t really know about SQL DW. I can tell them to be suppressed, so that the dialog doesn’t re-appear, but I don’t like the feeling that the system is attempting them at all.

    It’s certainly a lot less painful than it was in the past though. I love the fact that I can use the Object Explorer window. I love that I can script objects, in a way that feels way more natural to me than in SSDT.

    This is SSDT:


    This is SSMS:


    , although oddly the SSMS script includes the USE statement at the top, which isn’t supported in SQLDW (I’m sure this won’t be the case for much longer).


    Overall, I’m really pleased that the team has put things in place to make SSMS talk to SQL DW at all. I was beginning to think that SSMS wasn’t going to come to this particular party. This release, despite having some way to go just yet, suggests that I’ll soon be using SSMS more when I’m using SQL DW.

    And therefore, this topic worthy for Chris Yates’ T-SQL Tuesday blog party this month – celebrating the new things that have come along in the SQL world recently.



  • Some recent posts

    It’s T-SQL Tuesday, but this isn’t actually my post!

    Regular readers might remember that I posted about “Number of Rows Read” a while back, and in that post, I mentioned that SQL Sentry should consider having a warning in Plan Explorer. Well, they put that feature in, and asked me if I could write about it. So I did.

    But this wasn’t a standard “Can you write about our product, so we can post it on our main product page”, this was an offer to write for – alongside a select few others such as SQLskills people like Paul, Jonathan, Erin, Glenn, Tim, and (back when he was with them) Joe, SQL Sentry people like Aaron and Kevin, Jason, Rick, and the legendary Paul White.

    Now, I consider writing for a huge honour (and I’m still hugely grateful to Adam Machanic (@adammachanic) – for his invite to write here, and the hosting that he does), and being invited to write for is a huge honour as well (just like it’s an honour to be asked to write for books, or to work for particular clients). So I’m going to write for both. I’ve just had three posts get published on, but I plan to be keeping the numbers fairly even between the two. will remain my main blogging site, and the one that I refer people to. And I still want to make sure I publish something here each month.

    The content on both sites is excellent, from all the authors. Part of what attracted me to was the high standard of content here, and the number of bloggers that I admire here, and the list of authors at is very strong too. Both Aaron and Paul (White) often write about T-SQL performance, which is close to my heart as well, so I think my content fits in quite nicely there.

    Those three posts I’ve written are:

    Number of Rows Read / Actual Rows Read warnings in Plan Explorer

    What's "Actually" going on with that Seek?

    SQL Server 2016 Temporal Table Query Plan Behaviour

    Happy reading!


  • Exploring custom visuals in Power BI

    One of my favourite things about Power BI is its extensibility. For ages I would hear about how good some other self-service reporting platforms were, and how things in the Microsoft space were lacking. Power View, in particular, was frustratingly limited in what you could do with it, and I felt somewhat disappointed. It was good as a data exploration tool, but simply wasn’t a good reporting environment if you wanted something that was customisable.

    But in recent times, Power BI has really stepped up, with custom visualisations offering the ability to extend capabilities much further.

    I hadn’t explored much in the way of custom visuals in Power BI until a while back, even though I was very much aware of the competition that was held in September. It had been on my list to explore some of what was possible. And this month, the T-SQL Tuesday topic (hosted by Wendy Pastrick@wendy_dance) was to learn something new and to blog about it. So it seemed a good idea to learn how to make my own custom visualisation!

    Now, creativity isn’t exactly my thing. I find it really hard to write songs, for example. I know how to do it – but I quickly become self-critical and get stuck. Writing is easier, because it feels less ‘creative’, and appeals more to the teacher / preacher in me (and I know that takes creativity, especially if you’ve ever seen me present, but it’s different). So sitting down and coming up with a new way of visualising data wasn’t something I was going to do.

    But I found the Synoptic Panel visualisation, which is actually the one that won the competition, and learned some of how to use this. This one lets you make your own visualisations based on your own images. It’s really neat.

    There are two aspects to it.

    1. The .pbiviz file. This is what you add as your custom visualisation to the Power BI Desktop. You can get it from the Power BI Visuals Gallery, along with a bunch of other cool visuals. It’s the one that looks like a room with lots of colours.


    2. Your image, with a bunch of extra mark-up. This is where the magic happens.

    You head over to where there’s a tool for creating it all. You drop in your image (I went with the LobsterPot logo, which was my own design, in a moment of creativity), and picking the “magic wand” icon, click on the various areas of your image.


    Here I’ve made just two areas, by clicking on the two parts of the logo. I can name them if I want, but already I’m pretty much done. There’s a big button on the right that says “Export to Power BI”, which lets me download an SVG file. Interestingly, I already had an SVG image of the company logo, but I needed the right markup, so I needed the exported one.

    Now over in the Power BI Desktop tool, I wanted to see what this looked like. I started by importing the custom visualisation. I clicked on the ellipsis at the end of the “Visualizations” pane, and got the extra one added.


    Clicking this to add it to my report, I got a window that looked like this:


    And I expected to see some way of getting my LobsterPot claw to display. No such luck. I clicked all over, and saw nothing.

    But it turned out I just needed to add some data to it first. Once I’d done that, I got some options:


    Hitting “SELECT MAP”, I could find my SVG file and I got my claw!

    The data I had was just two values. I wanted to be able to colour each section a different colour. But the areas of my image were already red, so I coloured my areas in white, setting the saturation to be less for the higher colours (less white, therefore more red), and more for the lower colours (more white, so less red). I made a negative measure to help with this.

    But quite quickly, I had my claw, with a value of 90 showing quite red, and a value of 50 showing a fainter pink colour.


    In hindsight, I should’ve edited my image before I started, making the two areas white, and then I could’ve easily coloured them in red. But in my exploration, I was able to learn some of the capabilities of this useful control (tip: read, and I figure that the availability of this visualisation may mean that I never worry about creating my own from scratch.

    I wonder whether I will one day come up with a visualisation of my own. Perhaps, although for the time being I’ll leave it up to the experts. I can use this one for all kinds of things, I suspect.


  • Reasons to upgrade SQL Server

    With SQL Server 2005’s extended support ending today, it seems appropriate to write a post about “My Favourite SQL Server Feature” for T-SQL Tuesday this month, hosted by Jens Vestergaard (@vestergaardj). TSQL2sDay150x150

    The thing is that when I consider reasons to upgrade from SQL Server 2005, I see a ton of features that could be leveraged in later versions. You could use availability groups, or columnstore indexes, or spatial. You could use the Tablix control in SSRS, or project-based SSIS configurations... so many reasons. But of course, if you’re feeling stuck on older versions, then it’s the backwards-compatibility of SQL Server that is the key. The new features can come later, once you’ve moved to a later version.

    Now, I appreciate that if you have outer joins that use the “*=” syntax, then upgrading is going to take some effort. That does need to be fixed before you can move to later versions.

    You see, what I love the most about SQL Server today, and giving reasons to upgrade, is actually SQL Azure. If you have provisioned SQL Database in Azure, then your system is continually being upgraded. You are on the newest version of the platform. There is no decision you can make to satisfy those people who say “Hang on, leave me on SQL Server 2008 R2, because my third-party product doesn’t support SQL Server 2012 yet...”

    I can assure you that Microsoft does not want to break the code that you have running successfully in SQL Database. They will continue to improve the platform, and provide new features, but I’m confident that any code that you write today for SQL Database will continue to work there for a very long time.

    And that gives me hope for on-prem SQL Server environments too. I feel confident that things I do today, whether I’m dealing with new SQL 2016 work, or back as far as SQL 2008 R2, will continue to work in future versions. Because Microsoft is in the business of upgrading you to the latest version.

    My Windows machine pulls down updates automatically. My iPhone does the same. My SQL Database does. I would like my on-prem SQL Server to be doing the same, automatically deploying things into a SQL 2016 environment as it becomes available.

    Get off SQL 2005 as soon as you can, and brace yourself for frequent upgrades. Microsoft recommends proactive upgrades from cumulative updates now, so I’m confident that it’s only a matter of time before upgrading becomes a continuous process on-prem, like it is in SQL Azure.

    My favourite SQL Server feature is its backwards-compatibility.


  • How to make text searching go faster

    ...but first, let’s look at one oft-forgotten reason why finding a particular piece of text can be slow, if not impossible: collation. This will then provide a useful platform for making it go faster.

    I say ‘impossible’, but of course it’s never impossible to find something in a database (assuming it’s there). It might take longer, but you can always scan the column for it, starting on the first page and going until you’ve found it. Various things like Full Text Search can help make things easier, but all-too-frequently we see code that searches for SomeText%, or worse: %SomeText%. This is the thing I want to look at – finding non-indexed strings patterns.

    First let’s remember that if we are hoping to use an index, we need to know what language we’re in. I have spoken before about how I picked up a map in Sweden to find Västerås, but couldn’t find it listed in the map’s index. I didn’t realise that in Swedish, ‘ä’ and ‘å’ are not the same as ‘a’, and found at a different part of the alphabet. When I searched using an English alphabet, I couldn’t find the entry. I might think that ‘Västerås’ and ‘Vasteras’ are the same, but a Swedish person would tell me otherwise. It’s like if I refer to a game as ‘football’, you would need to understand my personal collation setting to know what I was talking about. When Michael Palin sung (as a lumberjack) about wearing high-heels, suspenders and a bra, he wasn’t referring to anything that held his trousers up, despite what people using an American collation setting might think.

    But this is about making searches for text go faster. If we’re comparing two strings in a different collation we get an error, but let’s think about speed.

    Consider that I’m looking for rows in a table WHERE CommentText LIKE '%Farl%'. Right away, I’m sure you appreciate that no amount of regular indexing on CommentText would let me perform an ordinary b-tree index search to find that row. I could improve it by using other technologies that will allow the individual words in my text to be found, but I’m just looking for a particular piece of text. It’s not even a whole word.

    For my experiment, I’m using a table on SQL 2014 on my Surface Pro 2. It’s a larger version of AdventureWorks2012’s Person.Address with 19 million rows. There is a column called AddressLine1, which has collation SQL_Latin1_General_CP1_CI_AS and has type nvarchar(120). You can create it using code like this:

    CREATE TABLE [Person].[Address_Big](
        [BigAddressID] [int] IDENTITY(1,1) NOT NULL,
        [AddressID] [int] NOT NULL,
        [AddressLine1] [nvarchar](60) NOT NULL,
        [AddressLine2] [nvarchar](60) NULL,
        [City] [nvarchar](30) NOT NULL,
        [StateProvinceID] [int] NOT NULL,
        [PostalCode] [nvarchar](15) NOT NULL,
        [SpatialLocation] [geography] NULL,
        [rowguid] [uniqueidentifier] NOT NULL,
        [ModifiedDate] [datetime] NOT NULL

    insert Person.Address_Big
    select * from Person.Address;
    go 100

    I ran this query quite a few times, and it took about 40 seconds to tell me there were no rows returned.

    select *
    from Person.Address_Big
    where AddressLine1 like N'%Farl%'
    option (maxdop 1);

    Obviously no one lives on 203 Farley Avenue, or 1 Farlabulous Drive. But nor do they live at 711 Gofarles Street. You see, despite the fact that I had specified ‘Farl’ with a capital F and lower-case ‘arl’, it didn’t care about that at all. My collation setting told it explicitly to ignore case. That’s what the CI was for in SQL_Latin1_General_CP1_CI_AS. In fact, if we query select * from fn_helpcollations() where name = 'SQL_Latin1_General_CP1_CI_AS'; we see it says “Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data”. So it’s not only case-insensitive, it’s also kanatype-insensitive and width-insensitive too.

    Clearly there is a lot more work for it to do, when scanning large amounts of text looking for a group of consecutive characters that could match inexactly.

    Now, without changing my table at all, I changed my query like so, telling it to use a binary collation for the search. To search exactly rather than inexactly.

    select *
    from Person.Address_Big
    where AddressLine1 like N'%Farl%' collate Latin1_General_BIN
    option (maxdop 1);

    I could’ve used the SQL collation SQL_Latin1_General_CP437_BIN, but I find it easier to remember the Windows collation, and in Australia, the default settings for SQL are to use Windows collations. They’re a little better than the SQL collations [citation needed].

    But anyway – this query returned in just 7 seconds. I re-ran the original one – 40 seconds. I re-ran this one – 7 seconds. It really was significantly faster. The plan is the same. There is no sneakiness going on. The search for the binary text was simply faster.

    This makes sense. If I’m looking for a particular string, it’s going to be quicker if I can just look for the exact bits, and not have to consider what the text might be in a different case, or if width needs to play a part, and so on.

    Now you might think “Great – I’m going to add that to all my string searches”, but you should understand that there is potential for the results to be different. If there were someone in my table who lived in FARLEY BOULEVARD (in all caps, in the way that French people often write their surnames, for example), then that would have been found in my case-insensitive-collation search, but not in my binary-collation search for the lower-case letters ‘arl’. It’s useful if the data in your system is only stored in capitals, in which case (ha!) you could actually change the collation of your column, but it’s definitely worth considering the benefits of asking for a collation-specific search.

    And what about grouping, you ask? Ok, maybe I didn’t hear you ask that, but let’s pretend you did.

    If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.

    select City, count(*)
    from Person.Address_Big
    group by City
    option (maxdop 1);

    select City collate Latin1_General_BIN, count(*)
    from Person.Address_Big
    group by City collate Latin1_General_BIN
    option (maxdop 1);

    Considering what’s going on with a hash function and non-exact strings is actually pretty interesting. HASH(Value) must produce the same value for any two values that are considered equal – such as ‘FARLEY’ and ‘Farley’ in my CI collation. For this to happen, it obviously can’t hash the actual values, it must have to convert the values into a common form that will hash the same way regardless of case, kana, and width. But this is work that is hidden from the query plan. We can see the impact of it through the query speed, but not anywhere in the plan. This will become yet another thing for me to investigate – but not this week before T-SQL Tuesday comes around and I need to publish this post. New father Bob Pusateri (@sqlbob) is hosting this month, about text searching, in case you hadn’t guessed.



  • A Fitbit report in Power BI for T-SQL Tuesday

    Finding things to publish online for public consumption is something I’m often a bit reluctant to do. Most of my work is for customers, and there’s no way I’m going to share some of their data unless they’ve explicitly allowed it.

    So when Jorge Segarra (@sqlchicken) posted a challenge to publish a Power BI report on the web for T-SQL Tuesday, I had give some thought about what kind of data to show. Luckily, Scott Stauffer (@sqlsocialite) has been Fitbit-challenging me over recent weeks, and in particular, in some that don’t include some of the big-steppers like Steve Stedman (@sqlemt), who should probably change his name to Stepman, considering he has a treadmill desk and does over 100k each week. Anyway – with a group of people who do the same order of magnitude of steps as me, I stretched myself to do better than I had been doing, and figured this could make useful data.


    I started with an export of the last 31 days from Fitbit. That’s as much data as you can pull down from them, and although I could go to the effort of getting extra exports and combining them, I didn’t for this. After all, I’d rather be out getting more steps done than analysing them.

    I had a bit of cleaning to do first, because the data had an annoying first line. In fact, I found it easier to pull the data in as text, remove the top line, then split the data up by the delimiter. I could then mark the various columns as numbers, which made life a lot easier.


    After all this was done, I was ready to throw some stuff onto a report (ok, I also added some measures to show the steps as a percentage of the total steps – they have to be measures to handle the division properly). It was easy to get a chart on there, and a card to show some of the numbers. But I wanted to make it a bit more interesting... so I added an average, to show how my increased steppage made an impact on my average.

    I added a column: RollinAvgSteps = AVERAGEX(FILTER(fitbit_export_20160214, EARLIER(fitbit_export_20160214[Date])>=fitbit_export_20160214[Date]),fitbit_export_20160214[Steps])

    ...which takes the average of my steps to date. There are a bunch of ways to achieve this, but this is the way that I chose. And you can see that the average line is (happily) improving! Oh, and because I pulled down the extract on the 14th, there’s a dip at the end. My numbers were much healthier by the end of the day, and despite spending way too long NOT walking, I did about 7244 steps that day.

    You can see the result at, which looks like:


    Oh, and having done all this, I discovered that fellow SQL Saturday Melbourne precon presenter Reza Rad has a series on doing Power BI on Fitbit data – I was pleased to see that he did a similar set of transforms to the data.


  • Join Effects With UPDATE

    A lot of people don’t like UPDATE with a FROM clause. I’m going to explore what’s going on, both logically and within the query plan. And I’m doing this for T-SQL Tuesday #74, hosted by Robert Davis (@sqlsoldier). TSQL2sDay150x150[3]

    I’m going to use a fresh copy of AdventureWorks2012. I’m using SQL 2012 SP3, but the things I’m looking at should apply to most versions. I’m going to use Production.Product and Production.ProductSubcategory. The Product table has a ProductSubcategoryID column with a foreign key in place, although this column allows nulls, as not every product must be within a subcategory.

    Our standard UPDATE query doesn’t have a FROM clause. It’s just “UPDATE ... SET ... WHERE ...”, and if we need to hook into other tables, we use sub-queries. Let’s look at why this is significant.

    The WHERE clause filters rows. A sub-query in the WHERE clause still only filters rows, either by being a scalar expression used in one of the predicates, or being a single-column table expression used in an IN predicate, or a table expression used in an EXISTS clause. Any other tables used in sub-queries in the WHERE clause can only be used to help filter the table being updated – they can’t affect the SET clause at all, or cause a row to be updated multiple times.

    Some examples are like this:

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    WHERE ProductSubcategoryID = (SELECT s.ProductSubcategoryID FROM Production.ProductSubcategory s WHERE s.Name = 'Tights');

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    WHERE ProductSubcategoryID IN (SELECT s.ProductSubcategoryID FROM Production.ProductSubcategory s WHERE s.ProductCategoryID = 4);

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    WHERE EXISTS (SELECT * FROM Production.ProductSubcategory s 
                  WHERE s.ProductCategoryID = 4 
                  AND s.ProductSubcategoryID = Production.Product.ProductSubcategoryID);

    Using other tables in the SET clause generally means something that returns a scalar value, although this could become more complex using CASE. Still though, the logical impact on the overall query is notable. Something in the SET clause cannot be used to filter the values being updated, or to update a value multiple times. The SET clause is a list of “column = <scalar expression>” clauses, in which each column must come from the table (or table expression) being updated (which means I don’t think it should ever have a table alias), and cannot be listed multiple times. As the expression is scalar, it can’t produce multiple rows or columns.

    UPDATE Production.Product
    SET DiscontinuedDate = (SELECT MAX(s.ModifiedDate)
                            FROM Production.ProductSubcategory s 
                            WHERE s.ProductSubcategoryID = Production.Product.ProductSubcategoryID)
    WHERE ProductSubcategoryID IN (SELECT s.ProductSubcategoryID FROM Production.ProductSubcategory s WHERE s.ProductCategoryID = 4);

    Now, all UPDATE statements could be written like this. As an update statement cannot change the number of rows in a table, the net effect on any table is a single row of change (I know triggers could be used to have a larger effect, but that’s a separate topic). I’ve met plenty of people over the years who will argue for never using a FROM clause in an UPDATE clause.

    You see, a FROM clause can have more of an effect than these sub-queries.

    Let’s think about what introducing extra tables via a FROM clause can do. For now, let’s start with what’s going on in a SELECT query, when you turn a FROM clause into a FROM ... JOIN clause.

    1. It can let you access data in the columns of those tables, to use in predicates or expressions.

    2. It can let you filter the data in the base table, by only allowing rows which match, such as when using an inner join or right outer join.

    3. It can cause rows in the base table to be returned multiple times, if multiple rows in the joined table match a single row in the base table.

    4. It can introduce NULL rows, if a full or right outer join is being done (or a left outer join with the base table second) and there are rows in the joined table that don’t match any rows in the base table.

    The particular combination of these will affect the type of join performed by your SELECT query – such as a Semi Join which does number 2, but none of the others. And if it does none, then the join is redundant and won’t appear in the query plan at all.

    So how does this work in an UPDATE statement?

    There are two possible ways that a FROM clause can work – one is to include the table being updated in the FROM clause, and the other is to have it NOT included in the FROM clause. If it doesn’t appear in the FROM clause, then predicates to define the matching criteria must be included in the WHERE clause to avoid updating every row. If it does appear in the FROM clause, then I would recommend using the table alias in the UPDATE clause rather than the name of the table itself.

    (Interestingly PDW does not support “UPDATE ... FROM ... JOIN”, although “UPDATE ... FROM ... WHERE” is fine.)

    So this is fine:

    UPDATE Production.Product
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.ProductSubcategory s
    WHERE s.ProductSubcategoryID = Production.Product.ProductSubcategoryID;

    As is this:

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    But please be careful about:

    UPDATE Production.Product
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    It works, but I don’t consider it safe. Because you have the potential to update a table which isn’t mentioned in the FROM clause, you could find yourself inadvertently updating every row in Production.Product. There are safeguards to prevent it happening – this next example gives an error:

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    FROM Production.Product p
    WHERE Production.Product.ProductSubcategoryID IS NOT NULL;

    , although this one doesn’t, and updates every row in the table – after all, we have a CROSS JOIN going on, because I’ve listed the wrong table.

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    FROM Production.ProductSubcategory p
    WHERE p.ProductSubcategoryID IS NOT NULL;

    If I’m writing queries, it’s generally fine. But if there’s a system which produces dynamic SQL, I start to worry. I’d rather update the alias, and be completely clear about what’s going on.

    So let’s go with the idea of using the table alias in the UPDATE clause when using the FROM clause, and choosing to always include the table being updated in the FROM clause. Unless we’re using PDW, of course.

    But the impact of those join effects... let’s look at them.

    Earlier, we saw this query. An inner join between Product and ProductSubcategory.

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    Now, because s.ProductSubcategoryID is known to be unique (it’s the primary key on s), there is no way that this can cause ‘multiple updates’ to Product. Things are okay here, but filtering could certainly apply. A join is done to get the values from ProdcutSubcategory, and the rows are fed into the Clustered Index Update operator.


    Filters are okay here. UPDATE is happy with filters, whether they’re implemented using the WHERE clause or via an ON clause.

    But what if the unique index weren’t there? Then we might see duplicate rows – the next join effect.

    ALTER TABLE [Production].[Product] DROP CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID];
    ALTER TABLE [Production].[ProductSubcategory] DROP CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID];

    Now what does the plan look like – we should see a Table Scan instead of a Clustered Index Scan because we just dropped the PK, but what other differences?


    It looks very similar, but now throws a Distinct Sort in there. You see, an Update isn’t going to do multiple updates. It won’t allow it. So it does a Distinct Sort on the PK of the Product table, and uses whichever value it cares to for the update.

    Another option it could’ve used would’ve been to use an Aggregate operator (because GROUP BY and DISTINCT are essentially the same thing), in which case it would’ve needed to apply an aggregate function to s.ModifiedDate while grouping by the Product PK. Which aggregate? The ANY() aggregate, of course – because it doesn’t care which value to use, it just has to be a valid one. I can get this plan by using an OPTION (FAST 1) query hint, because that will avoid doing the Sort, as a Sort is blocking. It also turns the Hash Match into a Nested Loop, because it really wants to get that first row through as quickly as possible. It’s a slower query, but lets us see the ANY() aggregate.


    So we can see that if a multiple rows are going to be returned by the FROM clause, this will get shrunk down to a single one. This is how that third ‘join effect’ is handled.

    Be really careful about this. It’s a bad thing, and the reason why purists don’t like to see a FROM clause in an UPDATE statement.

    The next (and final – yay!) join effect is to have NULLs introduced.

    Let’s start by putting those constraints back in:

    ALTER TABLE [Production].[ProductSubcategory] ADD  CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID] PRIMARY KEY CLUSTERED
    ALTER TABLE [Production].[Product]  WITH CHECK ADD  CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID] FOREIGN KEY([ProductSubcategoryID])
    REFERENCES [Production].[ProductSubcategory] ([ProductSubcategoryID]);
    ALTER TABLE [Production].[Product] CHECK CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID];

    ...and let’s put a RIGHT JOIN in there (for purely academic reasons – I know you’d never do this in real life, although you might put a LEFT JOIN in with the base table second). This means that our FROM clause will return an extra row for each ProductSubcategory that has no Products. There aren’t any of them in AdventureWorks2012, but the Query Optimizer doesn’t know that.

    Before I go any further, let’s quickly make something clear. With a right outer join, the result set of the join contains rows that don't exist in the base table. Obviously we can't update those – there's nothing in the base table for those rows. But we’re going to look at how the query plan handles this situation.

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    RIGHT JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    There are two things of note here in the query plan (apart from the fact that it has put the base table second and used a left join):


    For a start, we still see a Distinct Sort! I can assure you that the unique constraint is in there. If I remove the keyword ‘RIGHT’ I go back to my original version without any distinctifying operator. But the thing that’s new here is that Filter. Let’s look at what the Filter is doing:


    It’s filtering on “[IsBaseRow1001] IS NOT NULL”. That value is coming from the Product table, and is simply checking to see whether the row coming out of the Join operator is a match or not. It’s testing to see if we actually have a row in the base table to update. It could’ve tested ProductID for NULL for this, like we would in a query, but I guess it’s quicker to test IsBaseRow for being NULL than to test ProductID. I don’t know much about IsBaseRow1001, but I can tell that it’s not more than a single byte. The Estimated Row Size on the Scan of Product was 15B in the original query, and is 16B in this query. But I’m just guessing here. Theoretically it’s not needed at all, of course, and for testing, could have been a single bit.

    Or the Query Optimizer could have turned the join into an inner join. After all, we’re not interested in updating a non-row. As much as it’s interesting to see IsBaseRow1001 coming through, I can’t help but think that turning that join operator into an inner join would’ve done the trick. But as we don’t see LEFT JOIN + NULL being turned into an Anti Semi Join either, I’m not too surprised that this translation isn’t happening either.

    Because there could be multiple Subcategories without Products, there is a possibility of non-unique ProductIDs – the NULLs – coming out of the Join operator. But these are the only ones that could be duplicates, because each Product has at most a single matching ProductSubcategoryID in s. Therefore, once the NULLs have been removed by the Filter, the QO should be able to know that the data coming out of the Filter is unique on ProductID, but it doesn’t use this information, and needs a distinctifying operator to be sure.

    The Distinct Sort is still on ProductID, but a FAST 1 hint turns it into a Hash Aggregate this time instead of a Stream Aggregate. The reason for this is that a Nested Loop over the Product table isn’t going to find the ProductSubcategories that don’t have Products (although it could if it understood the translation to Inner Join). Therefore, it still performs a Hash Aggregate, does the filter for IsBaseRow1001, and then does a Hash Match (Flow Distinct) on ProductID.

    It’s interesting to see that we have a Build Residual here on ProductID, despite ProductID being an integer.


    You see, normally in a Hash Match on an integer we wouldn’t see a residual because the hash function produces an integer. It’s because ProductID could have been NULL. The nullability of the column coming through obviously wasn’t change by the Filter (and let’s face it – it didn’t test ProductID for NULL, it tested IsBaseRow1001).

    Quick interlude for showing that the hash function produces a 32-bit value, and doesn’t need a residual check when hashing on a non-nullable integer (while a nullable integer needs more than 32 bits):

    Compare the plans of these two queries. The Hash Match operator in the first one doesn’t have a Probe Residual, because s.ProductCategoryID doesn’t allow NULLs. The Hash Match operator in the second does have a Probe Residual, because p.ProductSubcategoryID does allow NULLs, and a nullable integer can cause clashes in the hash table.

    SELECT *
    FROM Production.ProductSubcategory s
    JOIN Production.ProductCategory c ON c.ProductCategoryID = s.ProductCategoryID

    SELECT *
    FROM Production.Product p
    JOIN Production.ProductSubcategory s ON s.ProductSubcategoryID = p.ProductSubcategoryID

    Also consider the Hash Match operator in the following query:


    SELECT *
    FROM dbo.rf_BigInts t1
    JOIN dbo.rf_BigInts t2 ON =

    DROP TABLE dbo.rf_BigInts;

    , where we see a Probe Residual on a non-nullable bigint (a lot more than 32 bits). This tells me that bigints can have clashes in the hash table, despite non-nullable integers not showing this.

    Oh yes, we were looking at the Build Residual.

    If we went back to an Inner Join with FAST 1, where we got a Stream Aggregate, and turn that into a Hash Match on the non-nullable ProductID, we can see that our Build Residual disappears.

    ALTER TABLE [Production].[Product] DROP CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID];
    ALTER TABLE [Production].[ProductSubcategory] DROP CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID];

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID


    Let’s do a bit of a summary...

    If you’re doing an UPDATE, you can only update each row in that table one time, no matter what kind of impact your FROM clause might want to have on the base table. You may feel like it's going to update some rows multiple times, but that's not how it works 

    Each of the join effects is either applied (in the case of a filter) or mitigated (in the case of duplicates or NULLs), so that you can access the data in other tables without fear of having a bad effect on your UPDATE, but don’t do it! Because you can’t tell which value gets picked up by the ANY() aggregate (or first row by a distinctifying operator), you should avoid duplicates completely, so that your UPDATE finds a single value to update each row with.

    And I would encourage you to use table aliases in your UPDATE clause if you use a FROM clause with a JOIN – but if you don’t use a JOIN, then make sure you include the match in your WHERE clause.


  • A new superpower for SQL query tuners – Number of Rows Read

    There’s a technique that I’ve been using for some time now, and been looking for going even further back, which has become immediately available to everyone who can see a query plan (version permitting).

    Some years back (ok, it was 2010), I started to present at conferences about SARGability and residual predicates. I had a session at SQLBits VII about it (one of the first times I’d presented with SQL MVPs from at least five different countries in the room), I presented at the 2011 PASS Summit on “The evils of residualiciousness”, and I even wrote a song with the line “my predicate’s residual, my seek just runs too slow”. I wrote blog posts about bad covering indexes, and TSA security probes (or something like that).

    The point of all this is that an index seek, merge join, or hash match is not necessarily effective for quickly locating the rows that you care about. It all comes down to the predicates that are involved, and whether they are SARGable for the index you’re trying to use.

    Over at my “Covering Schmuvvering” post, I describe an index and query on AdventureWorks like this:

    CREATE INDEX rf_ix_Covering ON Production.Product(DaysToManufacture)
    INCLUDE (Name, ProductNumber, Size, ReorderPoint, Color);

    SELECT Name, ProductNumber
    FROM Production.Product
    WHERE DaysToManufacture < 4
    AND ReorderPoint < 100
    AND Color = 'Red';

    The plan gives me an Index Seek that returns a single row. Wohoo!


    ...but I explain how this is actually really bad, because all the work is being done in the Predicate, not the Seek Predicate.


    You see, the “Predicate” is the Residual Predicate (my term – you’ll just see it as “Predicate” here, because it’s the same kind of thing that we see in a Scan, which doesn’t have a concept of the Seek Predicate and the Residual Predicate).

    The Residual Predicate is the leftover one, that needs to be checked after the Seek Predicate. You might not have one, if every predicate in your query is handled already by the time the Seek is done. But if you do have one, there is extra checking to do.

    So just like how a Scan will start on the first page of the index and keep running until it’s either reached the end or it doesn’t need to keep looking (because the operator on its left has stopped asking for more rows), a Seek will find the rows that satisfy the Seek Predicate, and then have to check each one to see if it satisfies any leftover predicates.

    The same applies for Merge Joins, which find things that match using the sort order of each data stream, but then need to apply any residual predicates. Hash Matches can be even worse, as the Probe Residual will include the Probe Keys if the hash function could cause two different values to be assigned to the same bucket. Read more on probe residuals here.

    Back to Index Seeks though...

    When the Query Optimizer creates a plan that has a residual predicate in a Seek or Scan, one of the earlier iterations will have seen this predicate applied in a Filter operator to the left of the Index operation. But before the plan is created, the residual predicate is pushed down into the index operation.

    If we apply trace flag 9130 (undocumented, so be wary), we can see the plan without the pushdown.


    And that thick arrow there shows us 407 rows being produced by the Seek operation, despite the single row being produced in our original plan. That original plan did not contain that information.

    But with Service Pack 3 of SQL Server 2012, all that changes.

    From now on, Index Seeks and Index Scans have an additional property called “Actual Rows Read”, which shows through in SSMS as “Number of Rows Read”.


    Please note that you need to have a recent-enough version of SSMS to see this come through. If you are running SSMS 2014, you’ll need to wait for the next service pack. Or if you’re running SSMS 2012 or SSMS 2016, make sure you’ve applied the latest service pack or CTP.

    This feature is amazing though! A significant part of the query tuning I do is to look for ineffective Seek operations. Many people look for Scans, but don’t consider that a poor Seek is just as bad. Now, we can easily see that this Seek took 407 rows to produce just 1, and a different indexing strategy could be recommended. It wouldn’t surprise me to see warnings start to come through in 3rd party products like SQL Sentry’s Plan Explorer soon as well, because I think that a Seek with 0.25% effectiveness is worth a warning. And without having to use the trace flag either!

    (Edit: They do now! Check out )

    It’s like you’ve just been given X-ray vision on the Index operators!

    Other experts in the space understand the significance of the residual predicate, such as Kendra Little (@Kendra_Little), and now you can see this too.


    There are a few more things I would like to see to round this feature off. I’d like to see Estimated Rows Read (which I can see using 9130), and I’d like to see similar information for Merge Joins and Hash Matches. If you want to see these implemented, jump onto Connect and vote for these items: Estimated Rows Read and Merge Join / Hash Match info – and one to fix up the naming convention.

    ...and in the meantime, make sure your local copy of SSMS is up-to-date, and encourage people to upgrade their SQL 2012 boxes to SP3!


  • Does the paradigm fit the environment?

    This month we see the 73rd T-SQL Tuesday come around – six years were completed last month (the first was in December 2009), and this is the start of year seven. It’s hosted this month by Bradley Ball (@sqlballs), and he asks a question about whether our environments are naughty or nice.

    Now, I’m a consultant, and deal with a variety of customers. And I don’t talk about those customers. It’s a thing. I don’t ever want to have a tweet or post where I say “Seriously? I can’t believe my customer has a setup like this!” – because my customers wouldn’t want me to say those things in public, even if I’ve kept the names private.

    Something that I see from time to time though, which really affects the ‘niceness’ of an environment is the paradigm that was used to solve the particular problem.

    20-something years ago, when I was at university, I did a lot of subjects that haven’t obviously impacted my career. Most of them were interesting and fairly useful (although I haven’t really been able to apply the particular nuances of using matrices to rotate the camera in a 3D-modelling environment), but one that really stands out for me as being particularly useful was a subject on Programming Paradigms. We explored functional programming, logic programming, and a few others. I can’t actually tell you the full list of paradigms we explored – I didn’t even attend most of the classes (I had stuff going on which meant I ended up only scraping through – the lowest scoring semester of my whole degree). But the impact that it had on me was an appreciation that the problems we face today shouldn’t necessarily be approached with the same hat that we wore yesterday.

    In the database space, we use a set-based programming paradigm. We apply relational theory to a schema design, and then write queries using set-based logic. This is a useful approach, but it can go too far. When you’re writing queries that you want to perform in particular ways, the focus could be something different. Perhaps you want to create a cursor, looping through each row of a resultset and doing some amount of processing on it. Iterative code, within a set-based environment. It’s a different paradigm, and can turn a nice system into a naughty one, or perhaps even turn a naughty system into a nice one.

    Even within the database space, we have different paradigms to apply. I see data warehouses that try to stick to a normalised design like the underlying transactional environment. I see data warehouses that demand a purely star-schema design. I see parallel systems that haven’t considered distribution theory, and parallel systems which have pushed distribution theory to the nth degree. I see indexing strategies which help, and indexing strategies which don’t.

    Usually, this comes down to the paradigm being applied. It’s generally not too hard to spot when the wrong paradigm has been used, or when a particular paradigm has been pushed too far, but it’s not always easy to quantify and measure empirically. My perspective is that the people involved should feel like things make sense. When sufficiently educated people (people who ask questions rather than blindly accept what they are told) are comfortable with the design decisions, it’s generally not a problem. When they find themselves trying to figure out what’s going on, and why a particular approach to a query has been taken, then there’s an issue. And I don’t care whether that’s a problem with a T-SQL query, or an MDX query, or a piece of ETL – I simply find that if there are experts in the place who wince a little when describing why something is the way it is, then that’s a sign that things aren’t quite right.

    Now, I’ll happily help fight battles to get these things fixed. But as a consultant, I know there are battles worth fighting, and situations worth accepting. And I know that success can be achieved despite things which are less than ideal. But when I think about whether a particular environment is worth getting a lump of coal or a nice elf-created gift, then I often look at the paradigm that was used when the system was designed. Then at least, things will make more sense.

    I hope you all have had a terrific 2015. Why not decide to write a few T-SQL Tuesday posts yourself in 2016?



  • Anti-Harassment Policies and Codes of Conduct

    I was a director on the PASS board back in 2012 when our having a Code of Conduct was first raised. A number of conferences had experienced bad situations, particularly around sexist behaviour, and it was very appropriate for PASS to take a stand and say “We don’t want this kind of thing to happen to PASS members.”

    We ALL wanted to make sure that the PASS community was a safe community – one which people could be part of without having to worry about whether there would be an “incident”. No one wanted the PASS Summit, or a SQL Saturday, or any PASS-related event, to incur an “incident”. We considered that the only acceptable number of incidents was zero.

    That said, there was a certain amount of awkwardness – particularly in the days leading up to the official discussion about the proposed Code of Conduct. There was a genuine fear about how a Code of Conduct would affect the tone of PASS events. Nobody wanted to be removed from an event because of a seemingly innocuous comment, but even more, no one wanted there to be an incident of harassment. And this fear expressed itself in awkwardness, bordering on flippancy.

    As the globalisation committee (a subsection of the board including some advisors – all of whom knew about the proposed Code) sat around to discuss globalisation, the first time there was a double-entendre, instead of raising an eyebrow or saying “Oh really?” or something else, the expression of the day was “There’s a Code of Conduct violation right there...”. It was a reflection of the nervousness that people felt around what the impact would be. People wanted to maintain the informal atmosphere of the meeting, but didn’t know how to react to a double-entendre in light of the future Code of Conduct – remembering that we ALL wanted PASS to become a safer community for our members.

    We don’t tolerate harassment at all. But at what point do things become harassment? At first it felt like we were trying to define it.

    As an Australian, I see a certain amount of banter about New Zealanders. It goes both ways, and the jokes are apparently very similar. They joke that we treat our sheep in particular ways, and we say the same about them. In the 1980s, the Kiwi Prime Minister Robert Muldoon said that New Zealanders moving to Australia raised the average IQ of both countries, which I think is a fantastic sledge! To suggest that people leaving New Zealand must be less smart than the average Kiwi, but still smarter than the average Australian, is a beautifully crafted rib. Is it racist? By definition, perhaps – but I doubt anyone felt vilified by it.

    “By definition, perhaps” was the phrase that worried me.

    I knew that if we defined the Code of Conduct wrongly, then I, and many others, could easily be in breach of it. I knew that if I reacted to a double-entendre with a raised eyebrow, that could be seen as sexualising a situation. I knew that if I joked that a Tottenham fan in the room was of lower intelligence than an Arsenal fan, then that could be seen as harassment. Maybe not by the Spurs fan, but by someone else watching, who might think that I genuinely insulted them. Even to suggest that a developer has no respect of data (as many PASS presenters might suggest in a session) could be seen as unfairly assigning undesirable attributes to people. It was a concern.

    So instead of raising an eyebrow, instead of reacting to any situation in my usual way, I reacted with “There’s a Code of Conduct violation right there...”. It still achieved the joke, but in a way that acknowledged my fears of what the Code might imply. It wasn’t appropriate, and I’m sorry. The correct thing to do would have been to have just bitten my tongue and ignore it. I also wasn’t the only one in that situation – I think just about everyone in the room did the same.

    We all wanted a policy, but we didn’t know how it was going to affect us.

    As we discussed it, we were able to work out that really what we wanted was not a Code of Conduct that defined what we allowed and what we didn’t allow, because we would never have been able to get that right. What we wanted was to adopt a stance that said “We do not tolerate harassment”, and to have a procedure for what happens if someone feels harassed. What we wanted was an Anti-Harassment Policy.

    Let me express that again:

    We do not tolerate harassment.

    And I don’t want to define what harassment means for an individual. I don’t want to define that certain types of touching are okay and others are not. I don’t want to define that particularly types of eye-contact count as harassment. I don’t want to define the words that can be used to describe body parts (like if someone falls and says they’ve hurt their backside – do they need to be careful about the word they use?), or what counts as “acceptable swearing” at a party. If we define this, then we run the risk that someone might go right up to the defined line in harassing someone, but we haven’t provided a course of action for the victim because the harasser hasn’t broken the “Code of Conduct”.

    I do want to have well-documented processes for how to react if someone feels harassed, because I want the person who feels harassed to know they have a course of action open to them.

    I think a Code of Conduct should be around expected behaviour in particular situations. A Code of Conduct says that a session presenter should wear a collared shirt not a T-shirt. A Code of Conduct says that a sponsor should respect the geographic boundaries of other vendors’ booths. A Code of Conduct shouldn’t say “You must not use someone’s nationality as the subject of a joke” – because when Australia was beaten in the final of the Rugby World Cup, that’s an opportunity to rib them about it, but the principle of standing against racism is incredibly valid. If I suggest that Americans are stupid for considering that “could care less” means the same as “could not care less” – am I crossing the line? It probably depends on a lot of other factors.

    Let me say it again:

    I do not tolerate harassment.

    I simply recognise that what some people see as harassment, others see as friendly banter. Should Bradley Ball, Joe Sack, and Buck Woody be offended about jokes regarding their names? I don’t know. That’s entirely up to them in the situation, and the context of what’s said. Sometimes they might be fine with it, other times they might not. That’s their right. No one else gets to dictate their reaction. Should Kevin Kline have been upset that I sang Happy Birthday to him loudly, in public situations, repeatedly, for a whole day? I try to monitor those situations, and back off if they seem to be getting upset. Is my detector of people’s personal lines sometimes faulty? Sadly, yes.

    I do not tolerate my own harassment of others.

    If you have ever felt harassed by me, I’m personally sorry and honestly regret it. I know I joke. I know I often joke at other people’s expense. But I never mean to harass.

    My personal Code of Conduct varies according to the company that I’m keeping – there are times that it’s okay to point out a double-entendre, but a job interview is probably not that time. My personal Anti-Harassment Policy is not variable. I don’t tolerate harassment, and if you ever feel harassed by me, tell me. If I don’t stop (though I hopefully always do stop), then tell me again, or tell a friend of mine and get them to help me stop (because I have probably misinterpreted you – if I say ‘Oi’ to someone who calls me fat, that doesn’t necessarily mean I’m feeling harassed, even though my extra kilos bothers me and I really don’t like it being pointed out).

    PASS has an Anti-Harassment Policy. As the SQL community, we don’t tolerate it, and we know what to do if someone feels harassed.

    Defining harassment is tough – it’s subjective, and individual. Making a stance to say “we don’t tolerate it” and “if you harass someone, here’s how we will respond” is a good thing.

    Let me say that again:

    The PASS community doesn’t tolerate harassment.


  • What’s driving your data model?

    If data modelling were easier, I doubt there would be as many books on the subject, and we wouldn’t have multiple methodologies to consider.

    I’m not going to explore the different methodologies here – that’s almost a religious argument these days, and I am more than happy to let you adopt whichever method you like. Instead, I want to challenge you to think about what is driving your design, and what makes you consider whether it suits your business or not.

    Time and time again I see companies that use software to help them run their business. Sometimes this is an off-the-shelf system or a cloud-based solution; sometimes it’s a bespoke system built by software developers. I’m definitely in favour of using software, and wonder how people operate without it these days.

    ...but how much is your business driven by the software? I see a lot of businesses being led by their software, rather than having the software adapt to the business. For the most part, I’m fine with either. There is a lot to be gained by using systems developed by similar businesses, and taking advantage of lessons learned by others. Letting that software help guide internal processes can be very useful.

    But I don’t think that applies to data models – you should at least consider how much it does.

    I don’t like to write about specific customer situations, so I’m not going to describe a particular anecdote in great detail here. But I want to say that I frequently see environments where the models used within data warehouses don’t describe the business that’s going on – they describe the software that’s used.

    Many data professionals look at a data warehouse as a platform for reporting, built according to the available data sources. I disagree with this.

    The models within a data warehouse should describe the business. If it doesn’t, it’s a data model gone wrong.

    What is the central thing that your business does? What is the main interest point? What do you need to look after? For me, this forms the core of the warehouse.

    The core of the warehouse is not necessarily the main fact table, but could be one of the main dimensions. If you’re a store, do you care about sales, or do you care about customers? The difference is subtle, but might drive some important design elements.

    Two clothing stores might use the same back-end systems for their point-of-sales systems, and might have the same loyalty system set up to persuade people to keep coming back. But one store might have a focus of getting customers back, driving brand loyalty which leads to dedicated fans and word-of-mouth sales. The other store might be more about piquing interest from people walking past the door, and trying to get them to come in and pick up a bargain. Of course, there will be an element of both in both stores, but the culture amongst the staff will be slightly different, as the first tries to identify the customer, tries to make sure that the customer feels cared for, and tries to form a relationship with the customer. It’s less important that the customer buys something, so long as they are going to return. The second sees the customer as a way to get a sale, while the first sees the sale (or even the lack of a sale!) as a way to get a customer. I’m sure you can think of stores in each category.

    It would be very easy to create the same data warehouse for both stores, using a standard retail environment. But are the needs of the stores adequately met?

    There is no doubt that both stores need sales to stay afloat – the retail business requires it. But if your business culture has slightly different concerns to the industry standard, then the data model should cater for that. Perhaps you need a way of scoring customer loyalty, and some path analysis to see what helps a customer reach a particular level of engagement. Perhaps you need to start collecting extra data. Maybe the stores could consider awarding points for simply visiting the store, even if no sales are actually made. Is the person who works from a cafe and buys just one cup of coffee all morning good for business, or bad for business? Can your data model help explore this, or are you designing a system which only handles the data in your transactional system?

    I like to come back to the description of a data warehouse being the single source of truth for an organisation. Many people consider this an issue for data quality – that once data is in the warehouse, it’s trusted and can be used for business analytics. But it should go beyond that. The data warehouse should have transformed the data as kept by the various software packages into data which describes the business, becoming the source of truth about the business. The reports and dashboards across this data should help identify the culture of the organisation, by highlighting the its values and ideals.

    The starting point for a data warehouse design should not be “What are the facts we need to measure?” but rather “What are we about as a business?” – often similar, but occasionally not. Ask what success looks like and what questions will address that.

    Don’t ignore the bottom line, but also don’t ignore what's really important to the business.


    This post was prompted by the seventy-second monthly T-SQL Tuesday, hosted this month by Mickey Stuewe (@sqlmickey).


  • PASS Summit 2015 WiT lunch

    A regular spot at the PASS Summit is the Women in Technology lunch. This year is no different.

    A few years ago, I was on a panel for discussion at the lunch. The last couple of years though, have changed format, and have an interview focus, with a champion for WiT. This year, Angie Chang is being interviewed about an initiative called HackBright, which helps women form careers in IT. Angie has also been involved with Girl Geek Dinners.

    HackBright has classes which are only women. This is terrific, and men in technology need to understand how important this is.

    My daughter goes to a girls’ school. She can learn there without any prejudice about which activities are suited to boys, and which are more suited to girls. So she learns how to program a robot, she plays cricket, as well as gymnastics, dance, and choir. She sees no differentiation between these things, and will be able to discover skills that she might not have developed if she’d had to compete against boys. I don’t play cricket myself, but I know that in co-ed schools, it’s only really the boys that play cricket or program computers.

    Angie is talking about scholarships that are available through HackBright, sponsored by some of the leading employers in the IT space. Clearly there are companies who have realised the value of technical women, and who want to ensure that their companies are welcoming to women. They talk about increasing the number of women who are getting into IT, but also understanding that if organisations don’t provide cultures that encourage to stay and develop their careers there, then the efforts of companies like HackBright get wasted, and the IT industry doesn’t improve.

    A question has just come in from a guy who has a 16yo daughter in the San Francisco area, and he’s asked if his daughter is welcome to go to the Girl Geek Dinners.

    Naturally, the answer is yes. :)


More Posts Next page »

This Blog



News? Haven't you read my blog?

My Company

Can't find something?

Contact Me

Twitter: @rob_farley
Skype: rob_farley

MVP (SQL Server)


Adelaide SQL UG

Privacy Statement