THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Michael Coles: Sergeant SQL

SQL Server development, news and information from the front lines

  • SQLSaturday #69 - Philly Love

    Thanks to the Philly SQL Server User Group (PSSUG) and to everyone who attended SQLSaturday #69 in the City of Brotherly Love yesterday. It was a great event with a lot of great people. My presentations are available for download at the links below: 

    http://www.sqlsaturday.com/viewsession.aspx?sat=69&sessionid=3333

    http://www.sqlsaturday.com/viewsession.aspx?sat=69&sessionid=3334

    I just went through my speaker evaluations, and I'm happy to report the response was pretty positive across the board. Having lived in Philly, I know Philadelphians aren't shy about telling how they really feel, so I really appreciate the positive feedback.

    For those of you who wrote comments with areas for improvement, rest assured I appreciate the feedback and I'll work your suggestions into future presentations!

    Next stop is SQLSaturday #71 in Boston (http://www.sqlsaturday.com/71/eventhome.aspx). Lots of top-notch speakers presenting at this one, and I'm looking forward to learning a little Power Shell from the master, and maybe learn a little bit o' that DBA stuff this time around :)

    See you in Boston!

     

  • SQLSaturday #60 - Cleveland Rocks!

    Looking forward to seeing all the DBAs, programmers and BI folks in Cleveland at SQLSaturday #60 tomorrow!  I'll be presenting on (1) Intro to Spatial Data and (2) Build Your Own Search Engine in SQL.  I've reworked the Spatial Data presentation based on feedback from previous SQLSaturday events and added more sample code.  I also expanded the Build Your Own Search Engine code samples to demonstrate additional FILESTREAM functionality.

    See you all tomorrow!

    A little road music, please!  http://www.youtube.com/watch?v=vU0JpyH1gC8


  • Why "Tailoring" Your Resume Is Bad

    I was just writing a response to a comment on my "Sell Yourself!" presentation (http://sqlblog.com/blogs/michael_coles/archive/2010/12/05/sell-yourself-presentation.aspx#comments), and it started getting a little lengthy so I decided to turn it into a blog post.  The "Sell Yourself!" post got a couple of very good comments on the blog, and quite a few more comments offline. 

    I think I'll start this one with a great exchange from the movie "The Princess Bride":

    Vizzini: HE DIDN'T FALL? INCONCEIVABLE.
    Inigo Montoya: You keep using that word. I do not think it means what you think it means.

    So before we get started on why I think "tailoring" your resume is a complete waste of time, let's make sure we're all talking about the same thing when we say "tailoring".  There are two types of "tailoring" your resume:

    1.  You create multiple versions of your resume targeting different areas of interest to you.  Shannon Lowder mentioned that she keeps a "master resume" with all his experiences listed, and pulls content out of it for each position he's applying for.  He indicates he's had good success with this method.  I'm not sure if Shannon creates a completely new version of his resume for every job posting or just general versions for different types of positions.  I would recommend the latter, especially if you are planning on sending your resume out to a *lot* of recruiters and/or companies.  It would be difficult (thought not impossible) to manage 20, 30 or more copies of your resume with only slight variations between them.

    Shannon uses a method that is actually very similar to one of my suggestions, which is to keep a Word document with descriptions of *all* your accomplishments (no matter how small).  Update the document regularly (I update mine at least once every 3 months, or more often if anything interesting happens).  When it's time to update your resume you can easily grab the relevant accomplishments from your master document and format them for your resume.  Even if you're happy at your current job, with no plans to ever leave, your master document comes in very handy at annual review time when the boss asks you "what were your accomplishments this year?".  It's a very small investment for a potentially significant return.

    I know of several people who have had success creating multiple versions of their resume, but there are some guidelines to keep in mind:

    • Make sure each version of your resume reflects a position you would *want* to be in.  If I'm interested in SQL Server development, .NET or SSIS development positions, it doesn't make much sense to create a custom resume highlighting my 3 months of Linux experience.
    • Manage your resumes well and make sure you send the right resume to the right recruiter.  Sending the wrong resume to the wrong recruiter, or bringing the wrong version of your resume to a job interview will only serve to confuse people.
    • Create your multiple versions of your resumes before you make first contact.  That is to say, put your multiple resume versions together before you approach a recruiter.  More on this later.

    So if it's all good, and people have success with it, what's the downside to "tailoring" your resume?  Well, that's where we get into #2.

    2.  The flip side of tailoring your resume is the last-minute rush job.  This happens after a recruiter has contacted you (or vice versa) and she tells you something along the lines of "Java development is required for this job, can you tailor your resume to highlight your Java experience?"

    This one calls for a little introspection.  If you were really interested in a position requiring skills that aren't highlighted on your resume, why wasn't that skill highlighted on your resume (or one of your resume variants) to begin with?  Considering you weren't interested enough to highlight that skill in the first place, do you think you're going to be happy in a job where it's a core requirement?  If you decide you'd love that job and leaving it off the resume the first time was just an oversight, then you should thoroughly integrate it into your resume and go for that job.  Here's the problem with "tailoring" your resume on-demand at a recruiter's request:

    • The recruiter probably wants it turned around quickly, usually within a day or two (at most).  You don't usually have adequate time to ensure quality.
    • Whatever you slap on your resume at the last minute will most likely have that tacked-on feel; it might have the "doesn't belong there" quality to it.  This can actually diminish the impact of your resume.
    • When you make last-minute changes to your resume you increase the odds that you'll introduce a typo or other error.  As I mention in the presentation your resume is your primary marketing tool.  You've most likely spent countless hours writing, formatting, spell-checking, reviewing, adjusting, tweaking, and perfecting your resume.  You can easily destroy all that hard work in a matter of minutes.
    • In my experience (this is just my experience, mind you), last-minute changes to your resume are almost completely ineffective.

    As Andy Warren points out "...the challenge is that if you don't tailor, the recruiter may perceive that you're not willing to help you both be successful."  I agree with Andy that if you're not willing to make last-minute changes to your resume the recruiter won't be happy; but a good recruiter should ask the same questions you asked above:  Why isn't that skill already highlighted?  Any recruiter worth her salt will wonder this.  A great recruiter will ask you that question and probe deeper to find out if this job really is a good fit for you.  To turn Andy's point around: If you change your resume to get a job that requires skills you're not really interested in, neither you, the recruiter, nor the company who hired you has achieved success.

    The important thing is to consider the concept of "tailoring".  Classic tailoring (making, mending clothes) requires thorough planning before the first cut is made in the cloth; well before the first stitch is stitched.  Randomly cutting and sewing pieces of extra cloth into a $1,000 tailored suit at the last minute is not a recipe for success.  The same thing goes for your high-quality resume.

    So, the bottom line for me is that creating multiple versions of your resume (well before you send it to recruiters) is not a bad thing.  I'm all for quality "tailoring" like this -- just make sure you manage it well.  Low-quality last-minute on-demand "tailored" changes to your resume don't seem to improve the odds of getting a quality job.

  • This Year's SQL Christmas Card

    This year's Christmas Card is similar to last year's.  I used the geometry data type again for a spatial data design.  Just download the attachment, unzip the .SQL script and run it in SSMS.  Then look at the Spatial Data preview tab for the result.

    Also don't forget to visit http://www.noradsanta.org/ if your kids want to track Santa.

    Merry Christmas, Happy Holidays and have a great new year!

  • Here Comes the FY11 Earmarks Database

    RepublicratI'm really interested in politics (don't worry, I'm not going to start bashing politicians and hammering you with political rage).  The point is when the U.S. FY11 Omnibus Spending Bill (the bill to fund the U.S. Government for another year) was announced it piqued my interest.  I'm fascinated by "earmarks" (also affectionally known as "pork").  For those who aren't familiar with U.S. politics, "earmark" is a slang term for "Congressionally Directed Spending".  It's basically the set of provisions added to bills by Senators and Representatives in the U.S. Congress, directing money to be spent on certain programs or projects without all the hassle of introducing a separate bill, sending it through umpteen committees, and getting enough supporters in both chambers to vote it up.  Most of the time Senators and Representatives introduce earmarks that send money back to their home states (see this guy for details).

    Now it looks like the Omnibus Bill is dead for now, but the record of the earmarks is still available at http://appropriations.senate.gov/news.cfm in PDF format (no, you don't have to visit Wikileaks).  Just look for the "Earmarks" links at the bottom of the page under "All News Items".  So what I have been doing the past few days, since the bill was announced, is loading the earmarks into a SQL database.  It turned out to be a little more complex than it should have been.  There was the issue of extracting the data out of the PDF files, normalizing it, ETLing it into SQL Server, cleaning it up (a *lot* of manual cleanup required), adding and cleaning up reference data, and finally linking the earmarks to the reference data.

    You can download the resulting SQL 2008 database backup file from this post and restore it to your test server to play around with.  Here are some sample queries I ran:

    SELECT SUM(e.Amount)
    FROM dbo.Earmark e;

    Result:
    8313820025

    (That's $8.3 Billion dollars in earmarks -- note that some earmarks did not specify $ amounts, so this number is low)

    Want to see total earmarks requested by party affiliation?

    WITH CTE
    AS
    (
      SELECT DISTINCT e.EarmarkID, e.Amount, c.Party
      FROM dbo.Earmark e
      INNER JOIN dbo.EarmarkCongress ec
        ON e.EarmarkID = ec.EarmarkID
      INNER JOIN dbo.Congress c
        ON ec.CongressID = c.CongressID
    )
    SELECT Party, SUM(Amount)
    FROM CTE
    GROUP BY Party;

    Results:
    D, 6779032457
    R, 2933829318
    I, 216149500

    This one actually double-counts some items since the same earmark can be sponsored by a Democrat (D), a Republican (R), and/or an Independent (I) all at the same time.

    And if you want to see how much was requested by your Senators and Representatives:

    SELECT c.Chamber,
     
    c.LastName,
      c.FirstName,
     
    c.Party,
     
    SUM(e.Amount)
    FROM dbo.Earmark e
    INNER JOIN dbo.EarmarkCongress ec
      ON e.EarmarkID = ec.EarmarkID
    INNER JOIN dbo.Congress c
      ON ec.CongressID = c.CongressID
    WHERE c.State = 'NJ'
    GROUP BY c.Chamber,
      c.LastName,
      c.FirstName,
      c.Party;

    Results:
    House, Adler, John D, 23848125
    House, Andrews, Rob D, 27710000
    House, Holt, Rush D, 103953000
    House, Pallone, Frank D, 21810000
    House, Pascrell, Bill D, 15143000
    House, Payne, Donald D, 52908000
    House, Rothman, Steven D, 107153000
    House, Sires, Albio D, 120141000
    Senate, Lautenberg,  Frank D, 239061125
    Senate, Menendez,  Bob D, 243982125

    Again there can be double-counting since an earmark can have multiple sponsors (very likely when you're talking about Senators and Representatives from the same state).

    Also note that I had to add in the non-voting Congressional delegates from U.S. territories (Puerto Rico, U.S. Virgin Islands, Guam, etc.) and Washington, D.C. to the reference data.  They don't vote, but they're allowed to attach earmarks to bills to bring some cash back home.

    I'd love to see what others are able to do with this data.  Some ideas -- use Reporting Services to map earmarks across the country; come up with clever queries to allocate earmarks equally across sponsors to avoid the double-counting issues; some Top n style queries by amounts, counts, etc.; combine it with spatial data and other reference data out there; compare things like # of sponsors for an earmark vs. the amount of the earmark, or seniority of sponsors vs. the amount of the earmark; maybe do something clever with it in PowerPivot.  If you find this type of data interesting and you download the database to play with, let us know the results and what you come up with.

    Democracy - It's what's for dinner

  • Sell Yourself! Presentation

    Thanks to everyone who attended my "Sell Yourself!" presentation at SQLSaturday #61 in Washington, D.C., and thanks to NOVA SQL for setting up the event!

    I'm uploading the presentation deck here in PDF, original length, with new materials (I had to cut some slides out due to time limits).  This deck includes a new section on recruiters and a little more information on the resume.

    BTW, if you're rewriting your resume I highly recommend the book Elements of Resume Style by S. Bennett.  I've used it as a reference when rewriting my resume and when helping others, and it's a very valuable tool.  There are one or two things I disagree with the author about (he recommends against the use of bulleted lists in the resume, I think they're great for emphasis and readability in certain areas so long as they're not overused, for instance); but overall the book has plenty of solid advice on how to get the most out of your resume.

    Also, if you haven't done so yet, check out Steve Jones' presentation "The Modern Resume: Building Your Brand" and his new blog: http://modernresume.blogspot.com/.  Steve is a SQL MVP, entrepreneur (a founder and editor-in-chief at SQL Server Central: http://www.sqlservercentral.com), SQL Server guru, and all-around great guy.  His new career-oriented professional-development blog and presentations are full of great career advice and tips for SQL Professionals.

  • Try-N-Save SSIS Packages

    The Try-N-Save SSIS Packages from my SSIS Dimensional Data Optimization presentation are available at http://cid-8f7e6c950afc6b3e.office.live.com/self.aspx/.Public/Presentation/TryNSave.zip.

    I'm still working on getting the sample database uploaded - even compressed a backup is larger than SkyDrive's upload filesize limit. I'll script it out when I have time (in addition to DDL, there are some tables that need to be prepopulated).

     

     

  • High Performance Dimensional Data Loads With SSIS Presentation

    Just finished giving the SSIS High-Performance Dimensional Data Load presentation at SQLSaturday #59 NYC.  Here are the slides in PDF format.  I'll upload the Try-N-Save code and sample data later for attendees to play with.

    Thanks to everyone who attended my session and thanks to Melissa D. and NJSQL for putting this together.  For those who are interested in Alejandro Mesa's composable DML solution to the problem of Type 2 dimension updates, here's the complete statement from the demo:

    INSERT INTO Dim.Geography_Hash

     CityName, CountyFIPS, CountyName, StateFIPS,
     StateUSPS, StateName, ZIP, TimeOffset,
     DaylightSavingTime, StartDateID, CurrentFlag, BatchID,
     LineageID, CubeInd, SortOrder, Hash
    )
    SELECT CityName, CountyFIPS, CountyName, StateFIPS,
     StateUSPS, StateName, ZIP, TimeOffset,
     DaylightSavingTime, StartDateID, CurrentFlag, BatchID,
     LineageID, CubeInd, SortOrder, Hash
    FROM
    (
     MERGE INTO Dim.Geography_Hash AS Target
     USING Staging.Geography_Hash AS Source
     ON Target.ZIP = Source.ZIP
      AND Target.CurrentFlag = Source.CurrentFlag
     WHEN MATCHED AND Target.Hash <> Source.Hash
      THEN UPDATE SET CurrentFlag = 'N'
     WHEN NOT MATCHED
      THEN INSERT
      (
       CityName, CountyFIPS, CountyName, StateFIPS,
       StateUSPS, StateName, ZIP, TimeOffset,
       DaylightSavingTime, StartDateID, CurrentFlag, BatchID,
       LineageID, CubeInd, SortOrder, Hash
      )
      VALUES
      (
       Source.CityName, Source.CountyFIPS, Source.CountyName, Source.StateFIPS,
       Source.StateUSPS, Source.StateName, Source.ZIP, Source.TimeOffset,
       Source.DaylightSavingTime, Source.StartDateID, Source.CurrentFlag, Source.BatchID,
       Source.LineageID, Source.CubeInd, Source.SortOrder, Source.Hash
      )
     OUTPUT $action, inserted.CityName, inserted.CountyFIPS, inserted.CountyName, inserted.StateFIPS,
       inserted.StateUSPS, inserted.StateName, inserted.ZIP, inserted.TimeOffset,
       inserted.DaylightSavingTime, inserted.StartDateID, inserted.CurrentFlag, inserted.BatchID,
       inserted.LineageID, inserted.CubeInd, inserted.SortOrder, inserted.Hash
    )
    AS T
    (
     action, CityName, CountyFIPS, CountyName, StateFIPS,
     StateUSPS, StateName, ZIP, TimeOffset,
     DaylightSavingTime, StartDateID, CurrentFlag, BatchID,
     LineageID, CubeInd, SortOrder, Hash
    )
    WHERE action = 'UPDATE';

    As mentioned, the OUTPUT clause on the inner MERGE statement feeds the outer INSERT clause.  Next stop is SQLSaturday #61 in DC at the beginning of December.

  • Dynamic SQL and Late Binding

    We all know that dynamic SQL runs in its own scope (see here for details), and that if used improperly it can have serious security implications.  But it's also important to realize that dynamic SQL evaluates your SQL statements more like a stored procedure than a regular script.  In stored procedures, if the tables you reference exist the stored proc must reference the table structures as they exist at stored proc creation time.  That is to say stored procs will late bind to referenced tables only if the tables do not exist at stored proc creation time; dynamic SQL does the same.  If the table already exists, SQL Server validates the structure at parse time.

    Consider the following SQL code.  In this code we create a table named xyz in one batch.  In the subsequent batch we check for the existence of table xyz, drop it if it already exists, and then recreate it.  In this scenario table xyz starts with two columns named i and c; when I rebuild the table in the second batch I will change the table structure to 3 columns--i, c and x:

    DBCC FREEPROCCACHE;
    DBCC DROPCLEANBUFFERS;
    GO

    CREATE TABLE dbo.xyz
    (
     
    i int,
     
    c varchar(100)
    );
    GO

    IF
    OBJECT_ID('dbo.xyz') IS NOT NULL
      DROP TABLE dbo.xyz;

    CREATE
    TABLE dbo.xyz
    (
      i int,
     
    c varchar(100),
     
    x varchar(100)
    );

    INSERT INTO dbo.xyz
    (
      i,
      c,
     
    x
    )
    VALUES
    (
      1,
      'hello',
      'good bye'
    );
    GO

    DROP
    TABLE dbo.xyz;
    GO

    This code runs successfully.  Now let's convert the second batch to dynamic SQL and execute the sample again:

    DBCC FREEPROCCACHE;
    DBCC DROPCLEANBUFFERS;
    GO

    CREATE
    TABLE dbo.xyz
    (
      i int,
      c varchar(100)
    );
    GO

    DECLARE
    @sql nvarchar(1000);
    SET @sql = N'
    IF OBJECT_ID(''dbo.xyz'') IS NOT NULL
      DROP TABLE dbo.xyz;

    CREATE TABLE dbo.xyz
    (
      i int,
      c varchar(100),
      x varchar(100)
    );

    INSERT INTO dbo.xyz
    (
      i,
      c,
      x
    )
    VALUES
    (
      1,
      ''hello'',
      ''good bye''
    );'
    ;

    EXEC
    (@sql);
    GO

    DROP
    TABLE dbo.xyz;
    GO

    The dynamic SQL example returns the following error message:

    Msg 207, Level 16, State 1, Line 12
    Invalid column name 'x'
    .

    Why is this?  Well the xyz table exists, but the column x referenced in the dynamic SQL INSERT does not exist at parse time.  It will exist after the CREATE TABLE statement is executed, but the INSERT statement parse errors out before the dynamic SQL is executed.  If you comment out the final DROP TABLE statement you can see that the DROP TABLE and CREATE TABLE statements in the dynamic SQL were not executed.  If you're doing something like this with dynamic SQL you can ensure that the DROP TABLE and CREATE TABLE are executed before the INSERT statement is parsed by breaking up your dynamic SQL like below:

    DECLARE @sql nvarchar(1000);
    SET @sql = N'
    IF OBJECT_ID(''dbo.xyz'') IS NOT NULL
      DROP TABLE dbo.xyz;

    CREATE TABLE dbo.xyz
    (
      i int,
      c varchar(100),
      x varchar(100)
    );'
    ;

    EXEC (@sql);

    SET
    @sql = N'
    INSERT INTO dbo.xyz
    (
      i,
      c,
      x
    )
    VALUES
    (
      1,
      ''hello'',
      ''good bye''
    );'
    ;

    EXEC
    (@sql);
    GO

    As an interesting (if not entirely useful) aside, if you create the three column table above, insert data into it, then drop it and recreate with two columns, the dynamic SQL succeeds.  It appears to remember the old table structure and runs just fine.

    DBCC DROPCLEANBUFFERS;
    DBCC FREEPROCCACHE;
    GO

    CREATE
    TABLE dbo.xyz
    (
      i int,
      c varchar(100),
      x varchar(100)
    );
    GO

    INSERT
    INTO dbo.xyz
    (
      i,
      c
    ,
      x
    )
    VALUES
    (
      1,
      'a',
      'b'
    );
    GO

    IF
    OBJECT_ID('dbo.xyz') IS NOT NULL
      DROP TABLE dbo.xyz;
    GO

    CREATE
    TABLE dbo.xyz
    (
      i int,
      c varchar(100)
    );
    GO

    DECLARE
    @sql nvarchar(1000);
    SET @sql = N'
    IF OBJECT_ID(''dbo.xyz'') IS NOT NULL
      DROP TABLE dbo.xyz;

    CREATE TABLE dbo.xyz
    (
      i int,
      c varchar(100),
      x varchar(100)
    );

    INSERT INTO dbo.xyz
    (
      i,
      c,
      x
    )
    VALUES
    (
      1,
      ''hello'',
      ''good bye''
    );'
    ;

    EXEC (@sql);
    GO

    DROP
    TABLE dbo.xyz;
    GO

    The trick is to remember that if the table already exists, dynamic SQL is looking at your table structure at parse time not at run time.  So your target table structure has to fit the dynamic SQL statements that are being parsed.

  • T-SQL Tuesday #006 Round-up!

    T-SQL Tuesday this month was all about LOB (large object) data.  Thanks to all the great bloggers out there who participated!  The participants this month posted some very impressive articles with information running the gamut from Reporting Services to SQL Server spatial data types to BLOB-handling in SSIS.

     

     

    One thing I noticed immediately was a trend toward articles about spatial data (SQL Server 2008 Geography and Geometry data types, a very fun topic to explore if you haven’t played around with it yet!)  So without further ado, and in no particular order, here’s the roundup:

     

    ·       Joseph Richberg was first across the finish line with a discussion of a project in which he loads PDF files into SQL Server.  Be sure to check out how he detects changes in large PDF files during the ETL process: http://josef-richberg.squarespace.com/journal/2010/5/10/t-sql-tuesday-6-what-about-blob.html.

     

    ·       Prolific SQL Server author and blogger Stacia Misner posted an excellent tutorial on combining spatial data with other relational data and surfacing it all with Reporting Services.  If you plan to do anything with spatial data in SQL Server 2008, BOOKMARK THIS NOW!  http://blog.datainspirations.com/2010/05/11/sql-server-2008-r2-reporting-services-the-word-is-but-a-stage-t-sql-tuesday-006/.

     

    ·       SQL Server guru Pinal Dave posted a great resource post for spatial data from around the world complete with demonstration queries at http://blog.sqlauthority.com/2010/05/11/sql-server-spatial-database-queries-what-about-blob-t-sql-tuesday-006/.  This is one to bookmark folks!

     

    ·       First of all, kudos to Jason Brimhall on that crazy picture of a blobfish!  As a bonus, Jason gave us a great article on optimizing queries when LOB data is involved -- the discussion of the sys.dm_db_index_physical_stats DMV (plus the sample code) is worth the bookmark! http://jasonbrimhall.info/2010/05/11/t-sql-tuesday-006-a-blobbing-we-will-go/.

     

    ·       Chad Miller grabbed my attention with his fascinating article on using Powershell to move BLOB data around.  I haven’t done much with Powershell to date, but I’m going to use his article as my jumping off point for getting my hands dirty with this powerful scripting language.  Check it out!  http://sev17.com/2010/05/t-sql-tuesday-006-blobs-filestream-and-powershell/.

     

    ·       Paul Randal provided the inside scoop on the negative performance effects of LOB data when you shrink your databases.  DBAs in particular need to check this one out! http://sqlskills.com/BLOGS/PAUL/post/Why-LOB-data-make-shrink-run-slooooowly-(T-SQL-Tuesday-006).aspx.

     

    ·       John “SSIS Master” Welch explored how SSIS handles LOB data. Check out his links to previous articles on how to move files to/from LOB columns using the Import Column and Export Column transformations.  http://agilebi.com/cs/blogs/jwelch/archive/2010/05/11/t-sql-tuesday-006-lobs-in-the-ssis-dataflow.aspx.

     

    ·       Stefan Bauer posted a nice article about setting up, configuring and testing Filestream data at http://stef-bauer.com/2010/05/10/t-sql-tuesday-006-configure-filestream-storage-for-blobs/.

     

    ·       Michael Zilberstein wrote a nice article about some experiments he did to answer a question he had about LOB data and locking.  http://sqlblog.com/blogs/michael_zilberstein/archive/2010/05/11/24999.aspx.

     

    ·       Rob “Thunder From Down Under” Farley followed up his great article on exploded data with the Reporting Services map control with some pointers on how to get data ESRI shapefile data loaded into SQL Server to begin with.  http://sqlblog.com/blogs/rob_farley/archive/2010/05/11/spatial-data-from-shapefiles-for-t-sql-tuesday-006.aspx.

     

    ·       Steve “SQL Ninja” Jones took us on a trip in his wayback machine with a nice article about the history of LOB data and some ideas for how you can get the best performance out of your large object data: http://www.sqlservercentral.com/blogs/steve_jones/archive/2010/05/11/put-your-left-lob-in_2C00_-put-your-left-lob-out_2620_.aspx.

     

    ·       Vamsi “the SQL Slugger” (!) posted hints for LOB data optimization at http://sqlslugger.com/archive/2010/05/t-sql-tuesday-006-what-about-blob/.

     

    ·       I rounded out this month with a sample spatial database that I use in live demos and presentations.  I created the database from the U.S. Census Bureau’s Tiger/Line database, and it’s definitely a good way to start playing with spatial data without the pain of creating your own spatial database from scratch.  http://sqlblog.com/blogs/michael_coles/archive/2010/05/11/t-sql-tuesday-006-tiger-line-spatial-data.aspx.

     

    Enjoy all the great articles and here’s looking forward to another great T-SQL Tuesday next month!

  • T-SQL Tuesday #006: Tiger/Line Spatial Data

    This month’s T-SQL Tuesday post is about LOB data http://sqlblog.com/blogs/michael_coles/archive/2010/05/03/t-sql-tuesday-006-what-about-blob.aspx.

     

    For this one I decided to post a sample Tiger/Line SQL database I use all the time in live demos. For those who aren't familiar with it, Tiger/Line data is a dataset published by the U.S. Census Bureau. Tiger/Line has a lot of nice detailed geospatial data down to a very detailed level.  It actually goes from the U.S. state level all the way down to street, feature and landmark level. Tiger/Line data is very complete and detailed--but the best part is it's FREE.  There are lots of applications for Tiger/Line, like national [U.S.] and local mapping and geocoding applications [applications that convert street addresses to (latitude, longitude) coordinates].  All this great data is distributed in the form of a ton of ESRI shapefiles.

     

    A shapefile is basically a file format that contains shape objects like points, lines and polygons.  SQL Server doesn’t natively understand ESRI shapefiles, but it also stores geospatial objects like lines, points and polygons.  There some handy utilities out there for loading these files into SQL Server.  Morten Nielsen has a great utility for loading shapefiles into Geometry and Geography data types at http://www.sharpgis.net/page/Shape2SQL.aspx.  Because of the volume of data involved in this project (I loaded hundreds of shapefiles) I decided to roll my own small set of SSIS custom components that read ESRI shapefiles and convert them to SQL Server spatial data types [keep an eye out -- these components are scheduled to be published with source code by SQL Server Standard magazine in the near future].

     

    The sample database can be downloaded from http://www.sqlkings.com/downloads/Tiger_Sample.zip

     

     

    In the future I’ll be sharing some code samples on the blog to demonstrate Tiger/Line data (as well as spatial data from other sources) based on this database.

     

    Here are a couple of quick queries you can run against this sample database to view the spatial data in SSMS 2008.

     

    SELECT *

    FROM Tiger.State;

     

    US of A

     

    SELECT c.*

    FROM Tiger.State s

    INNER JOIN Tiger.County c

    ON s.STATEFP = c.STATEFP

    WHERE s.STUSPS = 'TX';

     

    All my ex's live in Texas

     

    SELECT *

    FROM Tiger.ZCTA z

    WHERE z.ZCTA5CE00 LIKE '0[7-8]%';

     

    Jersey--Howyadoin?

     

    This last one works because all of the ZIP Code tabulation areas for the state of New Jersey start with '07' and '08'.  There are similar relationships between other ZCTA prefixes and their states.

     

    Next time we'll look at using SQL Server-based spatial data with online mapping programs like Bing maps.

     

    *For more information about Tiger/Line data visit http://www.census.gov/geo/www/tiger/tgrshp2009/tgrshp2009.html.

  • T-SQL Tuesday #006: "What About BLOB?"

    Invitation for T-SQL Tuesday #006:  "What About BLOB?"

     

    It's getting warm outside just in time for the May T-SQL Tuesday blog party.  I’ll be your host this month--and the secret word for this T-SQL Tuesday is "Large Object (LOB) Data".

     

    What About BLOB? 

     

    What’s T-SQL Tuesday?

     

    About 6 months ago Adam Machanic (Twitter: @AdamMachanic) decided to throw a worldwide blog party.  Every month Adam picks a host to post the topic and the rules.  Everyone who wants to participate publishes a blog entry on the topic of the day, all during the same 24 hour time period.

     

    Here’s the shortlist of the roundups from the first 5 T-SQL Tuesdays:

     

    • #001 Adam Machanic asked for your favorite Date/Time tricks
    • #002 Adam followed up by asking for your favorite Puzzling Situations
    • #003 Rob Farley made the commitment with his topic: Relationships
    • #004 Mike Walsh kicked asked for your input on IO
    • #005 Aaron Nelson asked us to report on "Reporting"

    And that brings us up-to-date for #006: LOB Data.  Anyone can participate - just post a blog entry related to the topic and link it back to this page.


    To participate your post must go live between 00:00:00 GMT on Tuesday the 11th of May and 00:00:00 GMT on Wednesday the 12th.  See the table below for some quick conversions to other time zones.

     

    Time Zone

    Start

    End

    GMT

    2010-05-11 00:00:00

    2010-05-12 00:00:00

    EST

    2010-05-10 20:00:00

    2010-05-11 20:00:00

    PST

    2010-05-10 17:00:00

    2010-05-11 17:00:00


    What Is LOB Data?

     

    MSDN conveniently defines Large Object ("LOB") data types for us: "LOB data types are those that exceed the maximum row size of 8 kilobytes (KB)."

     

    There have been a several improvements in LOB data functionality in SQL Server 2008 (there were even some in SQL Server 2005).  In 2008 the XML, GEOMETRY, GEOGRAPHY data types can all hold 2.1 GB of data.  CLR data types can also hold up to 2.1 GB of data.  So the question of the day is how do you use LOB data?  Here are a few possible starting points:

     

    • LOB data storage, optimization, limitations, "under-the-hood"
    • Indexing, querying, optimization, tricks, tips, performance tuning of LOB data
    • Interesting uses/projects for LOB data types:
      • The MAX data types (VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX))
      • XML
      • GEOMETRY/GEOGRAPHY (spatial)
      • CLR data types
    • FILESTREAM hints, tips, tricks, .NET SqlFileStream Class

    The only rule is that your topic has to involve SQL Server's LOB data types in some form.  If you want to demonstrate handling LOB data in .NET, for instance, go for it.  If you want to demonstrate Oracle LOB data handling, this might not be the place to do it (although a comparison of the two might be interesting...) :)

     

    Here Come the Lawyers...

     

    Once again, please note the time is in GMT (aka UTC). As other hosts have mentioned, feel free to write your post ahead of time and schedule it for publication. The rules are based on when the post goes live, not when it is written. The rules for T-SQL Tuesday haven’t changed, but I’ll cover them again here:

    1. Your post must go live between 00:00:00 GMT on Tuesday the 11th of May and 00:00:00 GMT on Wednesday the 12th (see my handy-dandy time zone conversion table above for more info). If your post doesn’t go live in the time limits, it won’t be included in the round-up post.
    2. Your post must link back to this post (by trackback or comment).
    3. "T-SQL Tuesday #006" MUST be in the title of the post.
    4. It is your responsibility to verify the trackback or comment appears here on this post.  If you don’t see your trackback add your own comment with a link back to your T-SQL Tuesday post and it will be included in the roundup.

     

    Note that some bloggers have reported an issue that their blog sites don’t properly generate a trackback when they preschedule a post for publication.  After you publish your post, be sure to check here to see if a trackback was generated.  Feel free to post a comment here after your post goes live if you just want to play it safe.

     

    Twitter

     

    A lot of the T-SQL Tuesday bloggers and bloggees are on Twitter. Follow the hashtag #TSQL2sDay and when your post goes live, tweet a link to it with that tag.

     

    Want To Host?

     

    It's easy to become a host (hey, even I did it!):

     

    1.            Participate in at least two events.

    2.            Let Adam Machanic know you want to host.

     

    You can tweet Adam or leave a comment on his blog, as described in his first T-SQL Tuesday Invitation.

     

    If you have any questions, concerns or general comments, leave me a message on this post.  I'll respond to your post and/or make updates to this post as required.  Have fun thinking of a topic and I’m looking forward to all the great posts!

     

    Sponsors

     

    This episode of T-SQL Tuesday was brought to you by the letters "M", "A", "X", and the number 2,147,483,647.

     

  • Find a Hash Collision, Win $100

    Margarity Kerns recently published a very nice article at SQL Server Central on using hash functions to detect changes in rows during the data warehouse load ETL process.  On the discussion page for the article I noticed a lot of the same old arguments against using hash functions to detect change.  After having this same discussion several times over the past several months in public and private forums, I've decided to see if we can't put this argument to rest for a while.  To that end I'm going to hold a little contest:  Generate an SHA-1 hash collision and win $100 and a book (see bottom section for details).  Before I get into the details of the contest I'm going to give a little background of how this came about.

    Background Info

    NOTE: If you aren't familiar with hash functions I highly recommend first reading the Wikipedia article at http://en.wikipedia.org/wiki/Cryptographic_hash_function.

    The idea of using a hash function for change detection is not new.  Essentially a hash function generates a "fingerprint" of your data that you can use to compare an inbound row and an existing row.

    Some people are wary of hash functions because they map a theoretically infinite number of large inputs to a much smaller finite set of hash values.  Most of the arguments people make against using hash functions for change detection boil down to variations of Murphy's Law:

    "There's a chance of a hash collision [generating the same hash value for two different inputs], so a collision will happen!"

    People have different ways of dealing with this issue, including taking one of the following positions:

    1. The chance of collision is negligible so no additional precautions are required.
    2. A collision will absolutely happen so I won't use hash functions for change detection at all!
    3. A collision may happen so I want to use hash values only to initially narrow down the number of rows I need to compare fully.

    Positions #1 and #2 above are at different ends of the spectrum.  Position #3 sits in the middle as a compromise solution.  While compromises may make for good politics, they often make for terrible technical solutions, as I'll discuss below.

    Position #1: Odds of Collision are Low Enough to be Ignored

    As far as position #1 is concerned, it depends on which hash function you're using.  You need to choose a true one-way collision-free* cryptographic hash function with a wide bit length.  I normally recommend an SHA-2 hash function (256, 384 or 512 bit hash value), or when that's not available the SHA-1 160 bit hash function.  The odds of generating a collision with a 160 bit hash function are 2^80.  That is to say you can expect a collision after you generate hashes for 1,208,925,819,614,629,174,706,176 rows of data.

    Of course if you're identifying rows by their natural or business keys this alternatively means you need to generate 1,208,925,819,614,629,174,706,176 variations of that single row before you'll hit a collision with SHA-1.

    To put that number in perspective, consider that Google processes 20,000,000,000,000,000 bytes (20 petabytes) of data per day.  If you were to store a single row in a database table for every single byte Google processes each day, it would take you 60,446,290 days (approximately 156,600 years) to store 1,208,925,819,614,629,174,706,176 rows in that table.

    I personally assume position #1 on this subject, with the assumption that you have chosen a good solid hash function for the job.  More on this later.

    *A collision-free cryptographic hash function is a one-way hash function with negligible probability of generating the same hash value for two different inputs. SHA-1 and SHA-256 are examples of collision-free cryptographic hash functions.

    Position #2: I Don't Trust Hash Functions

    This position can't really be argued with.  As shown above the odds of a collision with SHA-1 or another collision-free hash function are extremely low.  But if you don't trust it, you just don't trust it.  So the alternative is to compare every inbound column with every existing column.  It will cost you in efficiency on wide tables, but if you're not concerned about processing power, server resources and execution time this classic method of change detection is well-proven to be 100% effective.

    Position #3: The Compromise - Use Hash Values to Initially Narrow Down Results

    This position is the compromise position that combines the implementation of #1 and #2 above.  It sounds wonderful in theory - use a hash function to narrow down your results, eliminating rows that don't need to be compared column by column; then compare all of the columns in the remaining rows that haven't been eliminated.  So let's look at a scenario:

    • You are processing Row A through your ETL process into a target table.  Row B is the equivalent row in the target table (it has the same natural key/business key as Row A).  This assumes we are first locating the equivalent row in the target table by natural key/business key of the incoming row.

    There are three possible scenarios:

    • Row B exists in the target table, and is equal to Row A (no change).
    • Row B exists in the target table, but it is not equal to Row A (update).
    • Row B does not exist in the target table (insert Row A).

    Let's say you've generated two hash values, h(A) is the hash for Row A and h(B) is the hash for Row B.  Now we need to use h(A) and h(B) to eliminate rows to get rid of the extra column by column comparisons.  Here are the rules you need to implement to use h(A) and h(B) to eliminate extra comparisons in this compromise solution:

    A.  h(A) is equal to h(B): according to the compromise, if h(A) = h(B) we need to compare all columns of the inbound row against the existing row since the belief is that the hash function can/will generate collisions.  The idea is that h(A) may have generated the same value as h(B) even if A <> B.  So we need to:

    (1)  Compare all columns in A and B.  If A = B then perform no action.

    (2)  Compare all columns in A and B.  If A <> B then update.

    B.  h(A) is not equal to h(B): cryptographic hash functions guarantee that they will generate the same hash value for the exact same inputs.  So we can eliminate full row comparisons if h(A) <> h(B).  We know automatically that if h(A) <> h(B) then A <> B.  Just perform the update.

    C.  h(B) is NULL: that is, if Row B does not exist in the target table than h(B) is NULL.  This is a case where no further full-row comparisons are necessary.  Just insert the row.

    Now consider a slowly changing dimension (SCD) in a datamart application.  Many SCDs change slowly over time (hence the name slowly changing dimension).  This means that new rows (updates and inserts) are far less common than receiving duplicate rows during ETL.  So the vast majority of your inbound data will fall under rule A(1) above.  So you're still performing comparisons of all columns for the vast majority of rows in a given table just to figure out that you don't need to update them after all!

    If you eliminate even 90% of the inbound rows under rule A(1) above you haven't saved much processing (you're still comparing all columns for changes for 90% of your inbound rows).  You probably actually cost yourself a lot of time and efficiency since you haven't accounted for the overhead of generating hash values for 100% of the inbound rows.

    The only way this compromise is more efficient is if a very large percentage of your inbound rows (much greater than 50+%) are inserts under Rule C or updates under Rule B above.  If the majority of your inbound rows are duplicates of existing rows under Rule A, you gain nothing.

    The Contest

    One-way collision-free cryptographic hash functions are supposed to have negligible probability of a hash collision, or two different inputs generating the same output.  Hash collisions are what cause change detection with hashes to fail.

    For instance, consider the following example of an MD5 hash collision:

    DECLARE @A varbinary(8000),
          @B varbinary(8000),
          @hA binary(16),
          @hB binary(16);

    SELECT @A = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,
          @B = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70;

    SELECT @hA = HASHBYTES('MD5', @A),
          @hB = HASHBYTES('MD5', @B);
         
    SELECT CASE WHEN @A = @B
                      THEN '@A Equals @B'
                      ELSE '@A Is Not Equal To @B'
                      END AS AB_Equal,
                CASE WHEN @hA = @hB
                      THEN '@hA Equals @hB'
                      ELSE '@hA Is Not Equal To @hB'
                      END AS Hash_Equal;

    The results are shown below:

    When you run this you'll notice that the query reports the two source varbinary strings @A and @B are not equal, yet the two MD5 hashes they generate are equal.  This is an example of a simple hash collision with MD5.

    Now the challenge is to populate the following script with two different binary values that generate the same hash value.  The output should be the same as shown above in the MD5 example.

    --  Begin script
    DECLARE @A varbinary(8000),
          @B varbinary(8000),
          @hA binary(20),
          @hB binary(20);

    -- Replace the ? below with binary strings

    SELECT @A = ?,
          @B = ?;

    SELECT @hA = HASHBYTES('SHA1', @A),
          @hB = HASHBYTES('SHA1', @B);

    SELECT CASE WHEN @A = @B
                      THEN '@A Equals @B'
                      ELSE '@A Is Not Equal To @B'
                      END AS AB_Equal,
                CASE WHEN @hA = @hB
                      THEN '@hA Equals @hB'
                      ELSE '@hA Is Not Equal To @hB'
                      END AS Hash_Equal;
    -- End script

    The first person who sends me an example of two varbinary strings that generate the same SHA1 hash value will win $100 (US$) and a copy of my book Pro T-SQL 2008 Programmer's Guide.

    And here are the inevitable conditions:

    1. No NULLs.  @A and @B in the script above cannot be set to NULL for purposes of this contest.
    2. 8,000 bytes or less.  The T-SQL HASHBYTES function accepts varbinary(8000) values, so the values passed into it in this contest must be 8,000 bytes in length or less.  The values assigned to @A and @B above must be 8,000 bytes or less in length.
    3. No unnecessary changes to the script.  The only change allowed to the script above are the replacement of the question marks (?) with binary strings.  No other changes to the script are authorized.
    4. Only one person will win.  The first person who sends me a copy of the above script with two different binary values that generate an SHA-1 hash collision will win.
    5. Void where prohibited.  Obviously if contests like this aren't legal in your country, state, county, city, etc. then you can't take part.  Petition your government to make it legal :)
    6. Time limits.  Entries must be received prior to midnight U.S. Eastern Standard Time on October 31, 2010.
    7. Decisions of the judge are final.  For purposes of this contest that would be me.
    8. SQL Server 2005 or 2008.  Entries must be runnable on SQL Server 2005 and SQL Server 2008 Developer Edition, and the results must be reproducible.

    If a winning entry is received prior to the deadline, I'll post an update entry to the blog with the winning script and the name of the winner.

  • T-SQL Tuesday #005: Creating SSMS Custom Reports

    This is my contribution to the T-SQL Tuesday blog party, started by Adam Machanic and  hosted this month by Aaron Nelson.  Aaron announced this month's topic is "reporting" so I figured I'd throw a blog up on a reporting topic I've been interested in for a while -- namely creating custom reports in SSMS.

    Creating SSMS custom reports isn't difficult, but like most technical work it's very detailed with a lot of little steps involved.  So this post is a little longer than usual and includes a lot of screenshots.  There's also a downloadable ZIP file with the projects from this article included.

    SSMS Custom Reports 

    SQL Server 2008 and 2005 both offer custom report functionality in Management Studio.  With a little bit of work you can create custom SSMS reports just like the standard reports that Microsoft ships with Management Studio.  The Disk Usage Report shown below is just one of the many standard reports that comes with SSMS.

    SSMS Standard Report

    Creating a Report Project 

    SSMS uses the SQL Server Reporting Services (SSRS) 2005 client/viewer control to render both standard and custom reports.  This is true for both SSMS 2005 and SSMS 2008.  So the first step to building a custom SSMS report is to fire up Visual Studio 2005 and create a new Report Server Project as shown below.  For this example we'll create a custom report that lists missing indexes, so give the project the name Missing Index.

    Create New Project Dialog

    Once you create the Report Server Project right-click on Reports in the Solution Explorer and choose Add > New Item...

    Add Report to Project

    When the Add New Item box appears, choose the Report template and give the report a name.  For this example I named the report Missing Index.rdl - the .rdl extension stands for "Report Definition Language", and is the standard extension for SSRS report definition files.

    Add Report Dialog

    Defining the Report Dataset 

    Once the report is added to your project you have to add a new dataset to the report.  The dataset defines the structure and content of the source data that will populate your report.  Choose <New Dataset...> from the Dataset: dropdown.

    Add New Dataset dropdown

    Visual Studio will respond with a Data Source box.  Just make sure the Type: dropdown is set to the default Microsoft SQL Server and put Data Source=. in the Connection string: box.  This is all that's required since the SSMS custom report you're creating will use connections created and managed by SSMS's Object Explorer.

    Add Datasource Dialog

    After you define the data source, you can define the SQL query that will populate your report.  Just put the query in the dataset window as shown below.

    Adding a Dataset

    I borrowed (and slightly modified) the following query from Brent Ozar.  He originally published it at SQLServerPedia.  This particular query uses SQL Server's missing index Dynamic Management Views (DMVs) to identify missing indexes.

    -- Begin missing index query

    WITH cte
    AS
    (
        SELECT mid.object_id AS object_id,
            QUOTENAME(OBJECT_SCHEMA_NAME(mid.object_id)) AS table_schema,
            QUOTENAME(OBJECT_NAME(mid.object_id)) AS table_name,
            QUOTENAME('IX_' + OBJECT_SCHEMA_NAME(mid.object_id) + '_' +
                LEFT(CAST(NEWID() AS CHAR(36)), 8)) AS index_name,
            mid.index_handle
        FROM sys.dm_db_missing_index_details mid
    )
    SELECT
        DENSE_RANK() OVER 
        (
            ORDER BY cte.table_schema, cte.table_name
        ) AS table_color,
        ROW_NUMBER() OVER 
        (
            PARTITION BY cte.table_schema, cte.table_name 
            ORDER BY (avg_total_user_cost * avg_user_impact) * (user_seeks + user_scans) DESC
        ) AS index_color,
        cte.table_schema,
        cte.table_name,
        cte.index_name AS index_name,
        (avg_total_user_cost * avg_user_impact) * (user_seeks + user_scans) AS calc_impact,
        'CREATE NONCLUSTERED INDEX ' + cte.index_name + ' ON ' + 
        cte.table_schema + '.' + cte.table_name + 
        ' (' + COALESCE(mid.equality_columns, '') + 
            CASE WHEN mid.inequality_columns IS NULL 
                THEN ''
                ELSE CASE WHEN mid.equality_columns IS NULL 
                    THEN ''  
                    ELSE ',' 
                    END + mid.inequality_columns 
                END + 
        ') ' + CASE WHEN mid.included_columns IS NULL 
                THEN ''  
                ELSE 'INCLUDE (' + mid.included_columns + ')' 
                END + 
        ';' AS create_stmt,
        mid.equality_columns,
        mid.inequality_columns,
        mid.included_columns
    FROM sys.dm_db_missing_index_group_stats AS migs
    INNER JOIN sys.dm_db_missing_index_groups AS mig 
        ON migs.group_handle = mig.index_group_handle
    INNER JOIN sys.dm_db_missing_index_details AS mid 
        ON mig.index_handle = mid.index_handle 
            AND mid.database_id = DB_ID()
    INNER JOIN cte 
        ON cte.index_handle = mid.index_handle
    WHERE migs.group_handle IN

        SELECT group_handle 
        FROM sys.dm_db_missing_index_group_stats WITH (NOLOCK)
    )
    ORDER BY calc_impact DESC;

    -- End missing index query

    There are a couple of caveats to keep in mind about the missing index DMVs that this query relies on:

    • They're not designed to be exhaustive.  If you want to get a serious analysis of useful indexes in your databases, you definitely want to use a tool that will perform a more extensive analysis.
    • The DMVs only persist their information since the last time the SQL Server service was restarted.  If you've recently restarted the service you'll get very little information back.  These DMVs are best to use when your server has been running under normal load for a while.
    • The suggestions returned by the DMVs aren't always the best way to go.  These DMVs often recommend suggested indexes with lots of overlap (lots of columns in common, often in similar order).

    To put it another way, don't take the raw information returned by these DMVs at face value.  The information they return is definitely useful for determining where indexes might provide benefit if your system has been running under normal load for a while, and the redundant index suggestions tend to indicate which indexes might be most useful.  But these DMVs just amount to a starting point for analyzing your indexing needs.  I'd recommend against creating dozens of redundant indexes based on the raw output of these DMVs.

    Visual Studio will normally populate the dataset fields.  If for some reason it doesn't, you can manually edit the fields by hitting the Edit Selected Dataset button Edit Selected Dataset button to pull up the Dataset window.  You can then enter the field names on the Fields tab if they aren't already populated.

    Edit Dataset Fields tab

    Building the Report 

    So far most of what we've done is just setup.  With this done, it's time to design and build the actual report.  For this, click on the Visual Studio Layout tab to get to the report designer surface.  You can drag and drop text boxes, charts, tables, images and any other controls from the Toolbox onto the designer surface.  To keep it simple we'll just drag a text box and a table onto the designer.

    Reprot Designer Layout tab

    Put the title of the report in the text box (in this case "Missing Index Report") and format it to your liking.  The table we dragged onto the designer surface has three columns by default.  For this example we want six columns total.  To add more columns right-click on the top border of the table and choose Insert Column to the Left to add a new column.  Repeat two more times.

    Inserting columns in SSRS table

    In the Header row of the table we'll type in the headers for each column like the following:

    Editing Table Column Headers in a report

    In the Detail row we'll put in the formulas to populate the database fields like this:

    Editing the Report Table Details

    SSRS formulas begin with the equal sign (=).  Fields from the dataset are referenced directly using the format Fields!field_name.Value.

    Deploying and Running the Report 

    At this point use Visual Studio to build the project.  Once it builds without error, navigate to the project directory in Windows Explorer and copy the Missing Index.rdl file to the SSMS Custom Reports directory (on my computer this directory is located at C:\Users\Michael\Documents\SQL Server Management Studio\Custom Reports, it'll be different on yours). 

    You can now bring up this report in SSMS by right-clicking on a database in the Object Explorer and selecting Reports > Custom Reports... to select the custom report.

    Running a Custom Report in SSMS

    Choose your custom report from the file selection box and click Open.

    Selecting a Custom Report in SSMS

    When you run a custom report in SSMS you'll get a warning like the following:

    SSMS Custom Report Warning Box

    Just choose Run.  You might also want to check the box that says "Please don't show this warning again" to keep the box from popping up every time you run a custom report.  The simple custom report you've created looks like the one shown below.

    Simple Custom Report

    You can use images, color and other formatting techniques to make the report easier to read and use, and more flexible for your users.  Consider the image below, which is a screenshot of a reformatted version of the Missing Index report.  This one includes more information, color and collapsible sections.  Both reports are included in the attached ZIP file under the Missing Index and Missing Index Color directories.

    Reformatted SSMS Custom Report

    You can play around with the source files included in the attached ZIP file.

  • Bit-Twiddling in SQL

    Someone posted a question to the SQL Server forum the other day asking how to count runs of zero bits in an integer using SQL.  Basically the poster wanted to know how to efficiently determine the longest contiguous string of zero-bits (known as a run of bits) in any given 32-bit integer.  Here are a couple of examples to demonstrate the idea:

     

    Decimal = Binary = Zero Run
    999,999,999 decimal = 00111011 10011010 11001001 11111111 binary = 2 contiguous zero bits
    666,666,666 decimal = 00100111 10111100 10000110 10101010 binary = 4 contiguous zero bits

     

    My first reaction was that SQL is not my first choice of a go-to language for bit twiddling hacks.  These types of calculations are generally most efficient in C-style compiled procedural languages with plenty of bit manipulation instructions that map almost directly one-for-one to low-level machine language instructions.  In all fairness, SQL does have some bit-level operators (&, |, etc. operators), but the performance isn’t as optimized as a language like C#.

     

    At any rate, a few different ideas were tossed around, like the simplistic loop-and-count procedural method.  Using this method you just keep a running total of the longest zero-bit run and keep looping over the bits, adjusting your running total, comparing to your largest run of zero bits, and shifting your integer one bit right each time.  But this being SQL, I decided to attack the problem from a set-based perspective.  To start with I built a 1,000,000 row table to hold random integers:

     

    CREATE TABLE TempNum
    (
        Num BIGINT NOT NULL
    );
    GO

    WITH
    GenerateRandom
    AS
    (
        SELECT 1 AS n,
            ABS(CONVERT(BIGINT, CONVERT(BINARY, CHECKSUM(NEWID()))) % 4294967296) AS Random

        UNION
    ALL

        SELECT n + 1,
            ABS(CONVERT(BIGINT, CONVERT(BINARY, CHECKSUM(NEWID()))) % 4294967296)
        FROM GenerateRandom
        WHERE n < 1000000
    )
    INSERT INTO TempNum
    (
        Num
    )
    SELECT Random
    FROM GenerateRandom
    OPTION (MAXRECURSION 0);
    GO

     

    Note that I used SQL Server’s BIGINT data type (64-bit integer) since I wanted to deal only in unsigned 32-bit integers.  For my initial crack at a solution I split the 32-bit integer up into individual bits and treated the whole thing like a classic Gaps and Islands problem.  Essentially the 1 bits are islands, the 0's are gaps, and the length of the longest gap is the correct answer. 

     

    DBCC FREEPROCCACHE;
    DBCC DROPCLEANBUFFERS;
    GO

    WITH
    Powers
    AS
    (
        SELECT 0 AS id,
            CAST(1 AS BIGINT) AS pwr

        UNION ALL

        SELECT p.id + 1,
            POWER(CAST(2 AS BIGINT), p.id + 1)
        FROM Powers p
        WHERE p.id < 33
    ),
    Islands
    AS
    (
        SELECT ROW_NUMBER() OVER (PARTITION BY tn.Num ORDER BY id) AS RowNum,
            tn.Num,
            p.id,
            p.pwr
        FROM Powers p
        CROSS JOIN TempNum tn
        WHERE p.pwr & ((tn.Num * 2 ) | CAST(8589934593 AS BIGINT)) <> 0
    )
    SELECT c.Num AS Num,
        MAX(n.id - c.id - 1) AS ZeroBitRun
    INTO #Temp
    FROM Islands AS c
    INNER JOIN Islands AS n
        ON n.RowNum = c.RowNum + 1
            AND n.Num = c.Num
    WHERE n.id - c.id > 1
    GROUP BY c.Num,
        n.Num;
    GO

     

    One interesting aside on this solution - to ensure that leading and trailing zeroes were counted for every 32-bit number I had to shift the number left one bit (multiply by 2) and do a bitwise OR (| operator) with 8589934593, setting both the lowest bit (bit 0) and the highest bit (bit 33) to 1.  Basically this means you’re now dealing with a 34-bit integer, with the highest and lowest bits counted as islands.  This ensures that for the 32 bits in between leading and trailing gaps are correctly identified.  On my PC this query took an average of just over 360 seconds (6 minutes) to identify the largest run of zero bits in 1,000,000 random numbers.  Another solution posted to the group used logarithms combined with bit-shifting in a recursive CTE.  This one completed 1,000,000 iterations in about 17.5 minutes.

     

    I couldn't help but think that when you get down to it, a solution to this problem should play to SQL’s strengths.  I decided that what I really need for an efficient SQL solution to this problem is a lookup table.  Granted, a lookup table of 4+ billion rows (one row for each and every 32 bit number) would take a long time to build and probably wouldn't lend itself to IO efficiencies in SQL Server.  So I opted for a scaled down version and built a lookup table of 65,536 rows with one row representing every possible 16 bit number.  There are a lot of clever ways to grab bit-level pattern information (Google "de Bruijn sequence", for instance), but to be honest this lookup table is a one-time build and static population so I decided to keep it simple.

     

    CREATE TABLE #Nybbles
    (
        Num int primary key not null,
        String varchar(4) not null,
        Leading tinyint not null,
        Trailing tinyint not null
    );
    GO

    INSERT
    INTO #Nybbles
    (
        Num,
        String,
        Leading,
        Trailing
    )
    VALUES (0, '0000', 4, 4), (1, '0001', 3, 0), (2, '0010', 2, 1), (3, '0011', 2, 0),
        (
    4, '0100', 1, 2), (5, '0101', 1, 0), (6, '0110', 1, 1), (7, '0111', 1, 0),
        (
    8, '1000', 0, 3), (9, '1001', 0, 0), (10, '1010', 0, 1), (11, '1011', 0, 0),
        (
    12, '1100', 0, 2), (13, '1101', 0, 0), (14, '1110', 0, 1), (15, '1111', 0, 0);
    GO

    CREATE
    TABLE #BitMask
    (
        Length tinyint,
        Mask varchar(16)
    );
    GO

    WITH
    GetMasks
    AS
    (
        SELECT 0 AS Length,
            CAST('' AS varchar(16)) AS Mask

        UNION
    ALL

        SELECT Length + 1,
            CAST(REPLICATE('0', Length + 1) AS varchar(16))
        FROM GetMasks
        WHERE Length < 16
    )
    INSERT INTO #BitMask
    (
        Length,
        Mask
    )
    SELECT Length,
        Mask
    FROM GetMasks;
    GO

    CREATE
    TABLE BitPattern
    (
        Num int not null primary key,
        Trailing tinyint,
        Leading tinyint,
        Seq tinyint
    );
    GO

    WITH
    GetBin
    AS
    (
        SELECT n1.Num * 4096 + n2.Num * 256 + n3.Num * 16 + n4.Num AS Num,
            n1.String + n2.String + n3.String + n4.String AS String
        FROM #Nybbles n1
        CROSS JOIN #Nybbles n2
        CROSS JOIN #Nybbles n3
        CROSS JOIN #Nybbles n4
    )
    INSERT INTO BitPattern
    (
        Num,
        Trailing,
        Leading,
        Seq
    )
    SELECT g.Num,
        (
            SELECT MAX(b1.Length) 
            FROM #BitMask b1
            WHERE RIGHT(g.String, b1.Length) = b1.Mask
        ) AS Trailing,
        (
            SELECT MAX(b2.Length) 
            FROM #BitMask b2
            WHERE LEFT(g.String, b2.Length) = b2.Mask
        ) AS Leading,
        COALESCE
        (
            (
                SELECT MAX(b3.Length)
                FROM #BitMask b3
                WHERE CHARINDEX(b3.Mask, g.String) > 0
            ), 
        0) AS Seq
    FROM GetBin g
    ORDER BY Num;
    GO

    DROP
    TABLE #BitMask;
    DROP TABLE #Nybbles;
    GO 

     

    The first part of this script puts all 16 combinations of 4-bit nybbles (nybble = half a byte) and their equivalent binary formatted strings (0 = '0000', 14 = '1110') into a temp table called #Nybbles.

     

    There’s also a #BitMask temp table with bitmasks representing zero-bit runs.  The bitmasks are just strings of consecutive '0' characters of the necessary length (length 1 = '0', length 5 = '00000').

     

    The BitPatterns table is the actual 16-bit number lookup table.  This table is populated by combining every 16-bit combination of nybbles from the #Nybbles temp table.  This table has 4 columns:

     

    • Num is the 16-bit number
    • Trailing is the number of zero bits trailing (on the right-hand side) in the number
    • Leading is the number of zero bits leading (on the left-hand side) in the number
    • Seq is the longest sequence of zero bits within the number

     

    The total run time to build this lookup table was around 30 seconds on my computer.  Keep in mind that’s a one-time cost since you never have to build (or modify) the table again.

     

    With the information in this lookup table the query that locates the longest run of zero bits in any given 32-bit number is relatively simple:

     

    DBCC FREEPROCCACHE;

    DBCC DROPCLEANBUFFERS;

    GO

     

    WITH CTE

    AS

    (

        SELECT tn.Num,

            (

                SELECT

                    CASE WHEN l.Seq > h.Seq THEN

                    (

                        CASE WHEN l.Seq > h.Trailing + l.Leading THEN l.Seq

                            ELSE h.Trailing + l.Leading

                            END

                    )

                    ELSE

                    (

                        CASE WHEN h.Seq > h.Trailing + l.Leading THEN h.Seq

                            ELSE h.Trailing + l.Leading

                            END

                    ) 

                    END

            ) AS Seq,

            CASE WHEN l.Trailing = 16 THEN h.Trailing + l.Trailing

                ELSE l.Trailing

                END AS Trailing,

            CASE WHEN h.Leading = 16 THEN h.Leading + l.Leading

                ELSE h.Leading

                END AS Leading

        FROM TempNum tn WITH (NOLOCK)

        INNER JOIN BitPattern l WITH (NOLOCK)

            ON l.Num = (tn.Num & 65535)

        INNER JOIN BitPattern h WITH (NOLOCK)

            ON h.Num = (tn.Num / 65536)

    )

    SELECT Num,

        Seq,

        Leading,

        Trailing

    INTO #Temp

    FROM CTE;

    GO

     

    Basically you join the BitPattern lookup table on the high 16 bits and again on the low 16 bits.  The first CASE expression in the Seq subquery performs a 3-way maximum calculation.  It returns the largest of:

     

    • the low word zero-bit run (l.Seq),
    • the high word zero-bit run (h.Seq), or
    • the count of high word trailing zero-bits + the count of low word leading zero-bits (h.Trailing + l.Leading)

     

    The other two CASE expressions return the total number of trailing and leading zero-bits in the 32-bit number.  The CASE expression is needed to handle the case when the high or low word is all zero-bits.

     

    This particular solution took an average of about 6 seconds to calculate the longest zero-bit run for 1,000,000 numbers.

     

    To my earlier point, I created a C# solution (which itself could have been optimized) that performed the exact same calculation for 1,000,000 random 32 bit integers in 2 seconds flat.  So I guess there are two main points here: (1) Make sure you choose the right tool/language for the job, and (2) Whatever tool/language you choose try to play to its strengths.

More Posts Next page »

This Blog

Syndication

News

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement