THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Enjoy Another Sandwich -- Kent Tegels

Yummy slices of SQL Server between slices of .NET and XML

  • I’m 99% confident that where you are matters

    It really has been a long time since I posted anything of value here. Yes, a lot of that is by my own choice and some of you might be wondering if I’ve given up on SQL Server. No, haven’t, it remains a vital tool for me. But I have become more of user of the product in last couple of years rather than somebody who is “internals guru.” To be frank, going from technical trainer to University professor has had a lot to do with that. I tend to caremuch less now about squeezing cycles out of execution times than I do about being able to produce meaningful work. Well, as meaningful as Academia is anyway.

     

    Something happened yesterday that maybe you didn’t know about.You might be left saying “and I care why?” too. No, as far as I can tell, no new bits shipped. No, nobody has discovered a new way of using hash indexes to improve query performance. And no, those rumors SQL Server running on Linux didn't come true.

     

    Microsoft Research simply published a research paper.

     

    But this isn’t just a paper methinks. It is the paper myside of the world (the arcane place where statistics, geography and businessoverlap one another) needs the rest of world to really grok. I’ve been saying for years that “maps are data too” and reminding people of Tobler’s first law[1].The glassy eyed mouth breathing reception to it has been… less than inspiring.

     

    A single paper may not change that much, but even if only afew of the folks that see this blog post spend the few minutes to takes readthe paper and consider the implications of it… I believe the results could really push forward acceptance of Spatial Business Intelligence and our economyalong with it. How? BI has already an even more essential part of how America leverages technology for competitive advantage today. We make more effective and efficient use of capital because of it. Adding a Spatial dimension to the mix can help us not only what consumers to, but also what factors arising from culture and clustering drive those processes.

     

    Enough talk. Go. Read. Think. Innovate.

     

    “Learning Location Correlation from GPS trajectories”: http://research.microsoft.com/apps/pubs/default.aspx?id=121308

     

     



    [1] “Everythingis related to everything else, but nearer things are more related than moredistant things.” Sort of like your desire to drink the cold beverage-of-choice alreadyin your refrigerator compared to having to go buy said beverage, bring it home,cool it down and then drink it.

  • +1: The start of me, version 4.0

    Cross-posted from Spendid Isolation

    After a few months of silence, I am happy to say that I am returning to regular duty in the Microsoft Technical space. The last eight months have been very difficult ones between health issues and the economy. However, I feel like now is the best time to start making a personal investment in community building again. Therefore, to that end, I am pleased to make the following announcements.

    Working with the other leaders of The Sioux Falls .NET user group, I have taken on responsibility for getting The Sioux Falls chapter of the Professional Association for SQL Server up and going. As you know, SQL Server is just as much my passion as helping others learn and grow are. I am very excited about this opportunity. We will soon be announcing an organizational meeting for those of you interested in attending.

    Second, as many of know, I had a great run with DevelopMentor as one of their SQL Server instructors. So when I read Ted Neward's post about the company, I felt sad. 

    In much the same way he does, I feel warmly about people like Dr. Sumida and the Niels Berglund. Indeed, I remember my night with "Scary Ron" and agree -- it had a lot impact on me! So much of what Ted has to say resonates deeply with me -- especially the bits about what it meant to be part of that culture. Honestly, that is what I found to be so great about being there.

    Sadly, the economic reality of 2009 meant that I could not continue to be part of DM going forward. While there has been a lot of back-and-forth DM's economic future today , there is no doubt in my mind that they will continue to be successful thanks to the hard work their instructors and back office staff.

    Still, I love to learn and share what I have learned with others. That is why, today, I am pleased to announce that I have joined the Technical Staff of PluralSight. Again, I feel blessed to be working with "the best of the best" folks like Fritz Onion, Dan Sullivan and many others. This is also a great fit for me since I can work on developing on-line delivered content, cutting down on travel while I work on getting practical experience and data for my Doctorate research. Of course, getting my health back the best it has been in years benefits too.

    HDC09

    Last but far from least, I am happy to say that Joe Olson has let me know that I will be speaking at the Heartland Developer's Conference again this year. Of all of the conferences I have done, HDC is the most rewarding and important to me since it directly affects the developers in my community first. This year's talk will be very different for me: rather than doing a deep dive into some facet of the technology stack I will be talking about "how to (and not to) give a presentation."

    Maybe a little of "Scary Kent" will finally come out. I'm +1 with that.

    Cross-posted from Spendid Isolation

    After a few months of silence, I am happy to say that I am returning to regular duty in the Microsoft Technical space. The last eight months have been very difficult ones between health issues and the economy. However, I feel like now is the best time to start making a personal investment in community building again. Therefore, to that end, I am pleased to make the following announcements.

    Working with the other leaders of The Sioux Falls .NET user group, I have taken on responsibility for getting The Sioux Falls chapter of the Professional Association for SQL Server up and going. As you know, SQL Server is just as much my passion as helping others learn and grow are. I am very excited about this opportunity. We will soon be announcing an organizational meeting for those of you interested in attending.

    Second, as many of know, I had a great run with DevelopMentor as one of their SQL Server instructors. So when I read Ted Neward's post about the company, I felt sad. 

    In much the same way he does, I feel warmly about people like Dr. Sumida and the Niels Berglund. Indeed, I remember my night with "Scary Ron" and agree -- it had a lot impact on me! So much of what Ted has to say resonates deeply with me -- especially the bits about what it meant to be part of that culture. Honestly, that is what I found to be so great about being there.

    Sadly, the economic reality of 2009 meant that I could not continue to be part of DM going forward. While there has been a lot of back-and-forth DM's economic future today , there is no doubt in my mind that they will continue to be successful thanks to the hard work their instructors and back office staff.

    Still, I love to learn and share what I have learned with others. That is why, today, I am pleased to announce that I have joined the Technical Staff of PluralSight. Again, I feel blessed to be working with "the best of the best" folks like Fritz Onion, Dan Sullivan and many others. This is also a great fit for me since I can work on developing on-line delivered content, cutting down on travel while I work on getting practical experience and data for my Doctorate research. Of course, getting my health back the best it has been in years benefits too.

    HDC09

    Last but far from least, I am happy to say that Joe Olson has let me know that I will be speaking at the Heartland Developer's Conference again this year. Of all of the conferences I have done, HDC is the most rewarding and important to me since it directly affects the developers in my community first. This year's talk will be very different for me: rather than doing a deep dive into some facet of the technology stack I will be talking about "how to (and not to) give a presentation."

    Maybe a little of "Scary Kent" will finally come out. I'm +1 with that.

  • Welcome to my Splendid Isolation

    If you are wondering if I am still alive, the answer is: you betcha! Today I am happy to announce that a long-term goal of mine is starting to take shape. You can read more about that at: http://sfdnug.com/blogs/splendid-isolation/.

    Yes, I will still be blogging my SQL Server specific content here too.

     SQL Server 2008 R2 sounds very interesting from what I've heard and read about so far. Let's see if they can keep on delivering! 

  • T-SQL statements with multiple Common Table Expressions: Yes, we can.

    Yesterday I was mulling over the SQL Server tweets and read one by Michelle Ufford (aka SqlFool). She was answering another developer's questions about the use of Common Table Expression in which she asked if it was possible to have multiple CTEs in a single query.

    Yes and No.

    No in the sense that you cannot nest CTEs. That is, a CTE cannot be written such that its results into another CTE which in turn feeds the query in question.

    Yes in the sense that you can define multiple CTEs as "peers." That is, you can have as many CTEs as you like -- each separated by a comma -- which feed a query.

    This is more than just syntax and semantic sugar. Remember than one of the use cases for a CTE is to write an executable query where you can re-use a resolved result set over and over again without rewriting its query each time as sub-queries. There may be cases where you need to do that with more than one feeding query too.

    Here's an example. Sure, there's other ways to write this, but it demonstrates the use of peered CTEs. Let us say we have table that represents students scores on an exam. Each time the class runs, a new generation of the test is issued. Ideally, the student scores on the test should be statistically similar for each class assuming a normal distribution of students. You decide to see if that's really the case. You decide that you would like to know how many students scores from the third test generation were between the average score less one standard deviation from the first generation and the average score plus one standard deviation from the second .

    Let us start the query by generating some data in a table variable:

    create table #t(id int not null primary key clustered,score decimal(5,2) not null,generation tinyint not null);
    declare @score float=1.0;
    declare @group tinyint = 0;
    declare @index smallint = 1000;
    while @index > 0 begin
    set @score = 100*RAND();
    set @group = RAND()*5+1;
    insert into #t values (1000-@index,@score,@group);
    set @index -= 1;
    end;

    You've heard about CTEs and wonder "Might they might help in this case" That really is an interesting question. Here is one way to solve the problem using CTEs:

    -- find the means and S for the first two generations of data
    -- (shows using peer CTEs)
    -- on my test machine, total sub-tree cost is 0.02304924
    with g0(a,s) as (
    select AVG(score),STDEVP(score) from #t where generation=1
    ), g1(a,s) as (
    select AVG(score),STDEVP(score) from #t where generation=2
    )
    -- find the number of criteria matching records
    select COUNT(score),avg(score),stdevp(score) from #t,g0,g1
    where generation > 1
    group by g0.a,g0.s,g1.a,g1.s
    having avg(score) between g0.a-g0.s and g1.a+g1.s;

    And another using sub-queries:

    -- with subsqueries instead
    -- on my test machine, total sub-tree cost is 0.0197593
    select COUNT(score),avg(score),stdevp(score) from #t
    where generation > 1
    having AVG(score) between (
    select AVG(score)-STDEVP(score) from #t where generation=1
    ) and (
    select AVG(score)+STDEVP(score) from #t where generation=2);

    So here's where this gets interesting. Which of these two queries has a lower query cost and why? If the hairs on the back of your neck started rising with the CTE-based query after reading the where clause... GOOD, they should have. What we have done there is creating a Cartesian product. Most DBAs and Developers would cringe at that. After all, aren't you bringing as many rows into scope as the product of the sizes of the base table and the two result sets from the CTEs? Yes, you are. But even so, 1000x1x1 is still just one thousand. The problem is that the query engine does not really grok that. If you execute the two plans sequentially with "show actual plan" turned on, you'll see that the CTE-driven plan has a considerably higher cost that the sub-query plan. Why? You have probably already guessed! It is the expense of doing the Cartesian product.

    The actual plan shows us this nicely:

    Click here see picture. For some reason, inserting the IMG tag here causes IE to freak out. Yuck!

    1: Yes, yes, I know, I know. Statistically this is not really meaningful. You should be using Student's T-Test comparing each group individually. It is just an example used here to help people understand the peer CTE concept. Before sending me a whiney email, statboy, just keep that in mind.

  • A quick update

     Today is a bittersweet day. Yesterday was my last day as an employee of DevelopMentor. I truly appreciated the opportunities I had there. My future with them is still not completely written. There are number of classes coming up that I could be teaching for them ( see http://www.develop.com/schedule/microsoft-data for a list ). Moreover, there are many good folks still working hard for them. I wish all of them the best success.

    As for me, right now my top priority is getting healthy again and to keep on chipping away at my Doctorate. I have accepted an Adjunct professorship with Colorado Technical University here in Sioux Falls where I will be teaching classes in business and computer science. In July, I should be going to the Institute for Advanced Learning ( http://www.instituteforadvancedstudies.com/ ) in Colorado Springs to start the long road towards dissertation. It is odd to being that again, but I am feeling much better about getting that done.

    There are some other opportunities too. I have been approached by other companies to help with consulting and training opportunities. So, despite the bleakness of the current economy, I am feeling pretty good. Of course, it helps when you have a strong support network backing and I am certainly blessed to have that going full on for me now.

    In a few days, I am hoping to introduce a new personal web site with blog and wiki features. The focus of the site will be in the technologies and practices I am most interested in: business intelligence, data mining, F#, XML, .NET, leadership and, of course... things gourmand! I will still be posting here from time to time as well when something relevant comes up.

    Looking out my window, I see clear skies and smell a warm breeze. A purring gray cat is asleep on my lap. As Suzuki-Roshi taught, "Just to be alive is enough."

  • another 15 seconds of fame

    Thanks to Chris G.  Williams for his "9 questions" mini-interview. Read it at http://geekswithblogs.net/cwilliams/archive/2009/02/22/129607.aspx
  • Some interesting affects of Table Partitioning

    I had the pleasure of mentoring a very good class last week. One feature in SQL Server 2005 they really liked was table partitioning. In brief, this technology allows a table to be fragmented over n-many file groups that consist of one or more database files.  There a couple of very good reasons to do this: for transactional tables, it takes fewer resources to backup, restore since only the active parts of the database are written to specific file groups. Second, if the partitioning is aligned to anticipated, commonly used queries, the data access engine will have to read a minimum amount of index and tabular data to complete them. That means that I/O is reduced and queries complete faster and cheaper.

    That is, at least, the theory. However, a demonstration of table partitioning illustrated a two other interesting effects. First, a well-designed partitioning schema can produce lower cost, non-parallelized query plans that same designed tables. Second, even queries that would not benefit from the partitioning design can still show considerable improvement.

    Let us start with the design of the database. For our purposes, this database will host a single table -- one that records the line items of sales transactions. In a real database, this design may or may not be appropriate.  However, we are interested in showing how the concept works. Following is the create database statement used. Notice that we have file groups distributed over three name drives. In my case, I am using a USB hub with three USB 2.0 memory sticks plugged into it[i]. We will create a non-partitioned version of the table on drive "D." We will distribute the transaction log over three files on drives D, F and G. We will also create sixteen file groups each containing a single file. We will eventually create the partitioned table over these file groups.

    create database [xmlazon] on primary
    ( name = N'xmlazoN',filename = N'd:\xmlazon.mdf',size = 224mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg0(name = N'xmlazon_f0',filename = N'd:\xmlazon_f0.ndf',size = 64mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg1(name = N'xmlazon_f1',filename = N'd:\xmlazon_f1.ndf',size = 32mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg2(name = N'xmlazon_f2',filename = N'f:\xmlazon_f2.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg3(name = N'xmlazon_f3',filename = N'g:\xmlazon_f3.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg4(name = N'xmlazon_f4',filename = N'f:\xmlazon_f4.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg5(name = N'xmlazon_f5',filename = N'g:\xmlazon_f5.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg6(name = N'xmlazon_f6',filename = N'f:\xmlazon_f6.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg7(name = N'xmlazon_f7',filename = N'f:\xmlazon_f7.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg8(name = N'xmlazon_f8',filename = N'g:\xmlazon_f8.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fg9(name = N'xmlazon_f9',filename = N'f:\xmlazon_f9.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fgA(name = N'xmlazon_fA',filename = N'g:\xmlazon_fA.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fgB(name = N'xmlazon_fB',filename = N'f:\xmlazon_fB.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fgC(name = N'xmlazon_fC',filename = N'g:\xmlazon_fC.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fgD(name = N'xmlazon_fD',filename = N'f:\xmlazon_fD.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fgE(name = N'xmlazon_fE',filename = N'g:\xmlazon_fE.ndf',size = 8mb,maxsize = unlimited,filegrowth = 1mb )
    ,filegroup xmlazon_fgF(name = N'xmlazon_fF',filename = N'd:\xmlazon_fF.ndf',size = 32mb,maxsize = unlimited,filegrowth = 1mb )
    log on
    ( name = N'xmlazon_log0',filename = N'd:\xmlazon_log0.ldf',size = 74mb,maxsize = 2048gb,filegrowth = 10% )
     ,( name = N'xmlazon_log1',filename = N'f:\xmlazon_log1.ndf',size = 74mb,maxsize = 2048gb,filegrowth = 10% )
     ,( name = N'xmlazon_log2',filename = N'g:\xmlazon_log2.ndf',size = 74mb,maxsize = 2048gb,filegrowth = 10% );

    Now we can turn our attention to the two versions of the table in question. First, here is the non-partition version.  It is a basic table design illustrating some of the basics of good design such as using the smallest possible data type and having clustered index on the table.

    create table dbo.saleDetail ( customerID smallInt not null , orderDate smallDateTime not null check(orderDate > '2005-10-06') , orderLine tinyint not null check(orderLine > 1) , productID smallint not null , quantity smallint not null check(quantity > 0) , unitSalePrice decimal(7,2) not null check(unitSalePrice > 1.0) , lineTotal as quantity*unitSalePrice persisted constraint pkSaleDetail primary key(customerID,orderDate,orderLine));

    The partitioned version of the table depends on both a partitioning function and partitioning scheme. The partition function is simple. Based on an input data type (SmallDateTime in this case), it returns a whole number. Think of it as being like a CASE ... WHEN statement with a twist. The values listed in the body of the function act as delimiters. When a value enters the function, it is compared with the entries in the function value. In our example, the value 7 October 2005 returns one, whereas 1 January 2006 returns two and so on.

    create partition function quarteringFunction(smallDateTime) as right for values ('20060101','20070101','20070201','20070301','20070401','20070501','20070601','20070701','20070801','20070901','20071001','20071101','20071201','20080101','20090101');

    The partition scheme is also straight forward as this object actually invokes the partitioning function and uses the result to return a file group name on action statements:

    create partition scheme quartersToFileGroups as partition quarteringFunction to(xmlazon_fg0,xmlazon_fg1,xmlazon_fg2,xmlazon_fg3,xmlazon_fg4,xmlazon_fg5,xmlazon_fg6,xmlazon_fg7,xmlazon_fg8,xmlazon_fg9,xmlazon_fga,xmlazon_fgb,xmlazon_fgc,xmlazon_fgd,xmlazon_fge,xmlazon_fgf);

    For example, if a row is inserted with a date of 7 October 2005, the partition function returns one and the partition scheme cues to the data engine to write to file group xmlazon_fg0.

    The table definitions are nearly identical, however the partitioned table must be created on the partitioning schema and must have the partitioning column value passed in:

    create table dbo.saleDetail2 ( customerID smallInt not null , orderDate smallDateTime not null check(orderDate > '2005-10-06') , orderLine tinyint not null check(orderLine > 1) , productID smallint not null , quantity smallint not null check(quantity > 0) , unitSalePrice decimal(7,2) not null check(unitSalePrice > 1.0) , lineTotal as quantity*unitSalePrice persisted constraint pkSaleDetail primary key(customerID,orderDate,orderLine)) on quartersToFileGroups(orderDate);

    As for test data: I wrote a little C# application that generates an ADO.NET DataTable with about two and half million rows t in it, then wrote that data to the non-partitioned version. To make repeating the test a bit easier and more stable, I then used the BCP program to dump the data to native, binary file. I can then load the file using the BULK INSERT statement:

    bulk insert xmlazon.dbo.saleDetail from 'd:\saleDetails.raw' with(batchsize=100000,dataFileType='native',tablock);

    Loading the entire file into the non-partition table took, on average, about 74 seconds. Loading into the partitioned table took, again on average, about 113 seconds, or about 52.7% longer. This makes sense: each of the inbound rows needs to resolved to a file group using the partition scheme (and thus the partition function).

    I wanted to begin testing by using a query that should take maximum advantage of the partitioning concept. The following query against the non-partitioned version of the table selects about 2.78% (69,584) row from the database. Since I am running on a dual-core machine I used the MAXDOP option emulate a having a single CPU. Data is read from the table using a Clustered Index Scan.

    select productID,quantity,lineTotal from dbo.saleDetail where orderDate between '2007-11-23' and '2007-12-26' option(MAXDOP 1);

    We can increase MAXDOP to two easily enough:

    select productID,quantity,lineTotal from dbo.saleDetail where orderDate between '2007-11-23' and '2007-12-26' option(MAXDOP 2);

    Running these same queries against the partitioned table yields dramatic differences. The following table summarizes the results for all four of the test query versions:

    Measure

    Non-Partitioned, Non-Parallelized

    Non-Partitioned, Parallelized

    Partitioned, Non-Parallelized

    Partitioned, Parallelized

    Initial row count

    68415.62

    68415.62

    64729.71

    64729.71

    Total plan IO

    7.783866

    7.783866

    0.9781018

    0.9781018

    Total plan CPU

    2.757001562

    1.528797662

    0.350537321

    0.350537321

    Total plan Cost

    12.74087

    10.41267

    1.384889

    1.384889

    While the numbers tell a large part of the story here -- mainly that partitioning the table dramatically reduces the plan costs in all categories -- there is something else you need to see. Here are the plan tree texts for the partition table queries. The MAXDOP 1 listing is first:

    Compute Scalar(DEFINE:([xmlazon].[dbo].[saleDetail2].[lineTotal]=[xmlazon].[dbo].[saleDetail2].[lineTotal]))
        
    Clustered Index Scan(OBJECT:([xmlazon].[dbo].[saleDetail2].[pkSaleDetail2]), SEEK:([PtnId1000] >= (13) AND [PtnId1000] <= (14)),  WHERE:([xmlazon].[dbo].[saleDetail2].[orderDate]>='2007-11-23 00:00:00.000' AND [xmlazon].[dbo].[saleDetail2].[orderDate]<='2007-12-26 00:00:00.000') ORDERED FORWARD)

    Here is the MAXDOP 2 plan:

    Compute Scalar(DEFINE:([xmlazon].[dbo].[saleDetail2].[lineTotal]=[xmlazon].[dbo].[saleDetail2].[lineTotal]))
         Clustered Index Scan(OBJECT:([xmlazon].[dbo].[saleDetail2].[pkSaleDetail2]), SEEK:([PtnId1000] >= (13) AND [PtnId1000] <= (14)),  WHERE:([xmlazon].[dbo].[saleDetail2].[orderDate]>='2007-11-23 00:00:00.000' AND [xmlazon].[dbo].[saleDetail2].[orderDate]<='2007-12-26 00:00:00.000') ORDERED FORWARD)

    Yes, you are right -- they are exactly the same plan. Here we have such low-cost access to the data that parallelizing the plan does not help reduce its cost. This can be very helpful in situations where you find that you running into issues with CPU load. It also demonstrates one of the good side-effects of using partitioned tables.

    All of this is well and good you might be thinking but what queries that run "against the grain" of the table partitioning design? Is there some downside to it? Well, to test that, consider the following query. It has been designed to read all of the data in table, not just some parts of it:

    select distinct sd.customerID,sd.productID,COUNT(sd.productID),SUM(sd.lineTotal) from dbo.saleDetail sd group by sd.customerID,sd.productID with rollup order by sd.customerID,sd.productID option(maxdop 1);

    The following table summarizes the statistics for the combinations of parallelization and use of the partition table:

    Measure

    Non-Partitioned, Non-Parallelized

    Non-Partitioned, Parallelized

    Partitioned, Non-Parallelized

    Partitioned, Parallelized

    Initial Row Count

    2500003

    2500003

    2500003

    2500003

    Final Row Count

    11103.24

    11103.24

    11098.89

    11098.89

    Total IO

    7.80646652

    7.795127262

    7.84741552

    0.005630631

    Total CPU

    213.045881

    137.0510621

    132.3262086

    11.79203117

    Total Cost

    220.8524

    144.8462

    140.1736

    94.92677

    As you can see, while this query would not seem to benefit from table partitioning, it actually does. This demonstrates the second side-effect of table partitioning. Since partitioned tables can be spread over many file groups, the data engine may be able to access data resting in tables and indexes more effectively. In some cases, this affect can out-perform parallelization when compared to non-partitioned tables. However, partitioning should not be considered a "silver bullet" for solving all performance issues as there is a demonstrated impact in insert -- a thus a like similar impact on update and delete -- performance as well.

    Source code and data are available for download on request, use use the contact feature here.


    [i] Think a RARID. A redundant array of REALLY inexpensive disks.

  • HDC08 Omaha Demos Posted

    Slides and demos for my hdc08omaha talks are now available for download from http://tinyurl.com/5oykgj Thanks to all who attended!
  • Looks like its time my 15-seconds of fame...

    First up, my article Hierarchy ID has been published in MSDN Magazine. Its also available on the MSDN Web Site at http://msdn.microsoft.com/en-us/magazine/cc794278.aspx.

    Second, my recent interview with Dr. Greg Low and his world famous "SQL Downunder Podcast" is now avaialbe at http://www.sqldownunder.com/PreviousShows/tabid/98/Default.aspx. In show #39, Greg and I talk about the Spatial features support in SQL Server 2008.

    Lastly, just a reminder that the Heartland Developers' Conferences (http://www.heartlanddc.com/) are just around the corner. There's going to be a lot of great content at these shows and I'm looking forward to them. I'll be giving a talk about the Spatial features in SQL Server 2008.

  • RDL me this - how do you write reports if all you have is SQL Server Express Advanced Services Edition?

    A few days ago via Twitter (http://www.twitter.com/, follow me as ktegels), Mr. Lowe (AKA Vendoran) asked "SQL Express with Advanced services comes with SSRS, but does it come with the ability to create/edit rdl in VS?"

    The short answer to that question is "it depends." In most of the pay-for-license editions of Visual Studio you can use the Report Viewer controls to do that. For example, you can create a new C# Windows Forms project, then drag and drop a Report Viewer control on the form. After you have created the report you want, you will find an "RDLC" file in that project. RDLCs are, essentially, RDL files designed for use with the Report Viewer control. The good news is that these can be used Report Manager in most cases if you: a.) rename the file with an extension of RDL instead of RLDC before uploading and b.) reconfigure the connection string used by that report to a shared connection defined within Report Manager. However, the Expression Version of Visual Studio does not support this. True, you can look for, enable and use the Report Viewer control, but you will not be able to design a report in the Express SKU.

    So what to do, what to do? Your best bet today is download and install the Report Builder 2.0 tool. Robert Bruckner, a member of the SQL Server team specializing in Express/Advanced Services edition, has a blog post (http://blogs.msdn.com/robertbruckner/archive/2008/08/25/ReportBuilder-20-RC1-Release.aspx ) about this tool and a download link. I tried this new tool today. It feels very much like the Report Designer in SSRS2008 - a good thing indeed. When you save a designed report, you get an RDL file.  Based on what Mr. Bruckner says in blog, I believe this tool has been target to help solve the riddle at hand.

    I am currently building a VPC with SQL Server 2008 Express Edition with Advanced Services to see if some version of this tool is included with it. If so, it looks like we have a good solution for low-cost reporting in hand.

    By the way, if you are using the new AdventureWorks2008 OLTP database, you might want to check out   this "issue" (http://www.codeplex.com/MSFTDBProdSamples/WorkItem/View.aspx?WorkItemId=7427) I found with "[HumanResources].[vEmployeeDepartment] .Your votes would be appreciated.

  • The Top Five New Features in SSIS 2008 for Developers, Part 2

    Lookup. For many SQL Server Integration Services Developers, it's the transformation we most love to dislike - especially if you are using it against data living on remote server not on the same local network as the host running your package. This combination of circumstances sometimes drives us to some very inventive things. In this post I want to look at a situation I ran into a few months ago, how I addressed that situation with SSIS90 and how SSIS100 improves on that.

    For the sake of keeping things simple, let us start with a scenario. You have a list of 5,000 email addresses in an XML file. You need do a look-up of those email address and get the person's name and mailing address. You will write that to a CSV file. This is trivial task using SSIS. Where SSIS90 runs into a problem is when the server you are performing the look-up against is on the far-end of a network connection. Following the best practices of SSIS development, supposed your look-up task was based on a SQL statement that specified only the columns of interest. However, what do you do about the rows of interest? In SSIS90, there is not much you can actually do easily.

    Remember that unless you use a memory restriction, SSIS90 fully populates the look-up cache with all of the records in the look-up query before processing. That is a double-edge sword: it definitely increases the "wait time" before a data flow task starts processing records. This is especially true if you are pulling data from a remote source with a slow network connection. Yet, the once the data flow starts, the records are processed at blistering speeds.

    A straightforward solution to this situation is to use SSIS's ability to serialize buffers to a file - also known as RAW files. In this case, an initial streaming of the lookup data in made into a RAW file. That RAW file is then used as a data source in a second package. The RAW file is "join merged" with the XML source. This effectively provides the same functionality as the look-up did in the previous package. Take care not assume that the RAW file will have all of the needed information - after all, people can move and new email addresses could be added at any time. Therefore, you should when you are designing the package, you should handle unmatched records using the lookup transformation. In addition, you should update the RAW file with the updated data.

    In SSIS100, the idea is fundamentally the same, but the tasks are slightly different. The problem being addressed is that there is no obvious way to serialize a look-up cache to a file (aside from the aforementioned technique that may not be at all obvious.) Neither is using a "merge join" as a way of doing a "look-up." The SSIS team has added a new transformation - the Cache Transform -- and modified another - the look-up - to make it more obvious how to serialize the cache.

    The Cache Transform component itself is simple enough: within a given Data Flow, you connect this component to a path and it writes the buffers from that path to a new version of the raw-file format know as a Cache-Raw file (or CAW, its file extension.) Like a normal RAW file, the output contains the binary version of the data. It also contains an index covering one or more of the columns. This indexing helps the look-up component efficiently use the cached data. When using this to solve our slow data problem we would run a one-time process to initialize the cache from a data source.

    The Look-up component in SSIS100 is expanded to accommodate using the Cache-Raw file. As before, this component can be configured to use an OLE-DB data source. However, it can also be configured to use a Cache-Raw file instead. This can dramatically improve the performance of data flow since it eliminates the start-up delay in acquiring the data. However, there is still the possibility that they cached data may be stale or missing desired matches. Another new feature in the SSIS100 look-up is the ability to direct rows that were not found in the look-up to a new data path. You could "kind of" do this in SSIS90 if you assume that the only row-level error was a failure to find a match. In the new design, rows not matching from the Cache-Raw file could be redirected to another look-up configured to read recently updated data from the remote database.

    One thing to keep in mind about the new look-up transform is that it does not seem to maintain the Cache-Raw file automatically. What you should do is re-direct the no-match rows to a traditional look-up. As your complete processing in the data flow, add a second Cache Transform that writes the union of the matched rows and the not-matched but by then looked-up rows. You will need to name that CAW file something other than what you used as source, of course. The File Task can be used to delete the old version of the Cache and rename the new file to the name you used when configuring the first look-up.

    Examples of how to build the packages discussed in the post can be downloaded from http://gosqlserver.net/downloads/ttfnfssis100part2.zip .

  • The Top Five New Features in SSIS 2008 for Developers, Part 1

    You have to give credit where credit is due, and I certainly have to give credit to Kirk Haselden et al and their book Microsoft SQL Server 2005 Integration Services. That book -- and some of trial and error -- taught me a lot about how to tune data flow tasks for better performance. The folks at Simple Talk have part of the book online and specifically about tuning data flows (http://www.simple-talk.com/sql/sql-server-2005/sql-server-2005-ssis-tuning-the-dataflow-task/). If you aren't all that familiar with the tuning of SSIS 2005 data flows, it's a great piece to read before going at the rest of this article.

    I'd love to say that you don't need to worry about data flow tuning in SQL Sever 2008 Integration Services (SSIS100), but I can't. What I can say is you need to focus less on CPU allocation and utilization. As noted in the aforementioned article, in SSIS 2005 CPUs are allocated statically as the data flow task initializes. Basically, this means that if the optimizer for SSIS 2005 determines that it only needs one or two CPUs to execute the data flow, that is all it would ever use even if more CPUs were available to work the data flow. What change for SSIS100 is that CPU are now dynamic scheduled based on amount of "flow pressure" and all available CPUs are utilized. "Slower" tasks can now get more CPU time.

    Why did the SSIS team decide to use static scheduling in SSIS90 but dynamic scheduling in SSIS100? The answer is simple: back in 2002 to 2004 when SQL Server 2005 was being planned and developed, multiple CPU hosts were not as common as they are today. When you only have one or two CPUs to work with, static schedule is easier and has the least execution plan generation time and cost. However, now that most production-level hosts have four or more CPUs to work with, dynamic allocation is a more efficient solution. On multiple CPU machines, this change enhances performance considerably in many cases. On the downside, on single CPU machines, performance may actually degrade some, so it is certainly something you want to know about and consider.

  • A detour: Building a MultiPolygon with SqlGeometryBuilder

    The RTM version of SQL Server 2008 and recently release of the CLR updates with Visual Studio 2008 SP1 gave us the SqlGeometryBuilder and SqlGeographyBuilder classes to work with. These are very handy, simple APIs but, well, lets just say that the documentation on them is a bit lacking (Isaac Kunen's as some discussion of them at Our Upcoming Builder API). This morning, I wanted to write a few bits of .NET code that did the same work as this T-SQL statement:

    declare @p geometry = geometry::STGeomFromText('MULTIPOLYGON(((-77.054700 38.872957,-77.057962 38.872620,-77.058547 38.870079,-77.055592 38.868840,-77.053217 38.870656,-77.054700 38.872957),(-77.056972 38.870639,-77.055851 38.870219,-77.054875 38.870864,-77.055452 38.871804,-77.056784 38.871655,-77.056972 38.870639)),((-77.056408 38.875290,-77.056947 38.875224,-77.057466 38.873598,-77.057273 38.872737,-77.055335 38.873020,-77.055499 38.874058,-77.056408 38.875290)))',4326);

    The trick here is how do you delimit the rings of a polygon and how do you make a collection? It helps to keep four simple rules in mind:

    1. After instantiating the Builder, immediately set the Spatial Reference ID (SRID) you want to use. You must do this before you define any elements in the collection.
    2. In the case of a collection type, you need to call .BeginGeometry passing in a member of the OpenGisGeometryType enumeration for the desired collection.
    3. Each figure in the collection needs to started with a call to .BeginGeometry as well.
    4. Collection members must be well-formed. The collection must also be well-formed.

    Keeping all that in mind, here's example method for constructing the geometry shown above:

    private static SqlGeometry CreateMultipolygon() {
      // Create a new Geometry Builder to work with
      SqlGeometryBuilder gb = new SqlGeometryBuilder();
      // Set the Spatial Reference ID to 1
      gb.SetSrid(1);
      // Start the collection
      gb.BeginGeometry(OpenGisGeometryType.MultiPolygon);
      // Start the first element in this collection
      gb.BeginGeometry(OpenGisGeometryType.Polygon);
      // Define the first element (figure)
      gb.BeginFigure(-77.054700,38.872957);
      gb.AddLine(-77.057962, 38.872620);
      gb.AddLine(-77.058547, 38.870079);
      gb.AddLine(-77.055592, 38.868840);
      gb.AddLine(-77.053217, 38.870656);
      gb.AddLine(-77.054700, 38.872957);
      // End the first element (figure)
      gb.EndFigure();
      // Define the second figure
      gb.BeginFigure(-77.056972, 38.870639);
      gb.AddLine(-77.055851, 38.870219);
      gb.AddLine(-77.054875, 38.870864);
      gb.AddLine(-77.055452, 38.871804);
      gb.AddLine(-77.056784, 38.871655);
      gb.AddLine(-77.056972, 38.870639);
      gb.EndFigure();
      // End the first polygon
      gb.EndGeometry();
      // Define the second polygon
      gb.BeginGeometry(OpenGisGeometryType.Polygon);
      gb.BeginFigure(-77.056408, 38.875290);
      gb.AddLine(-77.056947, 38.875224);
      gb.AddLine(-77.057466, 38.873598);
      gb.AddLine(-77.057273, 38.872737);
      gb.AddLine(-77.055335, 38.873020);
      gb.AddLine(-77.055499, 38.874058);
      gb.AddLine(-77.056408, 38.875290);
      gb.EndFigure();
      gb.EndGeometry();
      // End (close) the collection
      gb.EndGeometry();
      // Return that as a SqlGeometry instance
      return gb.ConstructedGeometry;
    }

  • The Top Five New Features in SSIS 2008 for Developers, Part 0

    I am going to get myself in trouble (again) by saying this but SSIS isn’t a Business Intelligence tool as much as it is a developer tool. If you are like me, you have written a lot of code does, basically, the following:

    • Extracts the data from some place;
    • Transforms that data somehow;
    • Loads the transformed data into database or some other store.

    Sure, my toolset has changed over the years from COBOL to Perl, Expect and QuickBasic to VBScript to C#, but the basic tasks have not. That is probably why I never warmed up to SQL Server 2000 DTS. The idea of extract, load and transform never really worked for me. I wanted streams, not tables.

    So when SSIS debuted with SQL Server 2005, I decided to make the effort to learn it to the best of my abilities. Yes, it is a great tool but like any tool, it does have some shortcomings:

    • Optimizing data flows is somewhat of a black art;
    • Using the linear lookup with a remote server is slow... unless you "cheated";
    • Transact-SQL lacks an UPSERT command (at least prior to 2008);
    • Calling a Web Service from a script required building an external assembly and making it available to the Script runtime;
    • I am back to writing scripts in BASIC. Compiled BASIC with the full range of the CLR, sure, but it is still, well BASIC.

    Over the next few posts, I want to talk about the new features in SSIS 2008 that help address these shortcomings in SSIS 2005. However, if you saw my recent talk in Omaha, you already know the story. But as of this morning, you can get the bits too! Just browse to http://www.4shared.com/dir/7670149/a1b13c97/Omaha_SQLBI_User_Group.html and, when prompted for a password, enter "SQL4You" (sans quotes, of course). My presentation and bits are in the file labeled "ug_wnissis100.zip"

    Sudhir Gajre’s excellent performance tuning for SQL Server 2005 deck is also available from that site.

  • When is a bug not a bug?

    I guess when the developers decide it is not. Consider the following query:
    drop table dbo.shapeParts
    go
    create table dbo.shapeParts(
         shapePartID tinyint not null identity(1,1) primary key,
         shapeID tinyint not null,
         shapeQuad tinyint null,
         shape geometry not null);
    go
    insert into dbo.shapeParts values
    (1,0,geometry::STGeomFromText('POLYGON((0 1,1 1,1 2,0 2,0 1))',1));
    insert into dbo.shapeParts values
    (1,1,geometry::STGeomFromText('POLYGON((1 1,2 1,2 2,1 2,1 1))',1));
    insert into dbo.shapeParts values
    (1,2,geometry::STGeomFromText('POLYGON((0 0,1 0,1 1,0 1,0 0))',1));
    insert into dbo.shapeParts values
    (1,3,geometry::STGeomFromText('POLYGON((1 0,2 0,2 1,1 1,1 0))',1));
    go
    create spatial index shapeindex on dbo.shapeParts(shape)
    using geometry_grid with (
         bounding_box =(0, 0, 2, 2),
         grids =(level_1 = low,level_2 = low,level_3 = low,level_4 = low),
         cells_per_object = 1)
    go
    declare @g2 geometry = geometry::STGeomFromText('POINT(1 1)',1);
    select top(1) *
    from dbo.shapeParts with(index(shapeIndex))
    where (shape.STContains(@g2)=1)
    or (shape.STIntersects(@g2))=1
    go
    The Books Online topic "Geography Methods Supported by Spatial Indexes" reads:
    Under certain conditions, spatial indexes support the following set-oriented geography methods: STIntersects(), STEquals(), and STDistance(). To be supported by a spatial index, these methods must be used within the WHERE clause of a query, and they must occur within a predicate of the following general form: geography1 . method_name ( geography2 ) comparison_operator valid_number To return a non-null result, geography1 and geography2 must have the same Spatial Reference Identifier (SRID). Otherwise, the method returns NULL.
    . What's in question here is what does "under certain conditions" means. The seems simple enough and for other data types, the type of operation preformed with multiple filter predicates is not an issue, but if you actually try to execute this query in RC0, you get an execution time error:
    Msg 8635, Level 16, State 10, Line 2
    The query processor could not produce a query plan for a query with a spatial index hint. Reason: Could not find required comparison predicate. Try removing the index hints or removing SET FORCEPLAN.
    This seemed like a bug to me, so I filed it on connect (see https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=357578). The response?
    "Thank you for the bug submission. Spatial indexes match only WHERE/ON clauses made of single atomic predicates or ANDs thereof -- not ORs. We also do not infer that STIntersects here is redundant."
    The response that came back was authoritative and well-reasoned. Yet it also smacks of "its not an error because we say it isn't." Hopefully this response (and kind of response) will be reconsidered in the future.
More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement