THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Michael Zilberstein

  • T-SQL Tuesday #006: LOB, row-overflow and locking behavior

    This post is my contribution to T-SQL Tuesday #006, hosted this time by Michael Coles.

    Actually this post was born last Thursday when I attended Kalen Delaney's "Deep dive into SQL Server Internals" seminar in Tel-Aviv. I asked question, Kalen didn't have answer at hand, so during a break I created demo in order to check certain behavior. Demo goes later in this post but first small teaser. I have MyTable table with 10 rows. I take 2 rows that reside on different pages. In first session transaction is opened and first row is deleted using PAGLOCK hint (without COMMIT for a meantime). In second session I select all columns from the second row(both sessions access rows by primary key which is also clustered index). Second session got locked. How come?

    Here is the full story and demo script. First we create table with 10 rows, 2 of which are long.

    CREATE TABLE dbo.MyTable( c1 INT NOT NULL PRIMARY KEY, c2 VARCHAR(8000), c3 VARCHAR(8000), c4 VARCHAR(8000), c5 VARCHAR(MAX) )

    ;WITH Nums( Num )
    FROM sys.objects
    INSERT INTO dbo.MyTable(c1, c2, c3, c4, c5)
    Num AS c1,
    CASE WHEN Num IN (2, 8) THEN REPLICATE('a', 3000) ELSE 'a' END AS c2,
    CASE WHEN Num IN (2, 8) THEN REPLICATE('b', 3000) ELSE 'b' END AS c3,
    CASE WHEN Num IN (2, 8) THEN REPLICATE('c', 4000) ELSE 'c' END AS c4,
    CASE WHEN Num IN (2, 8) THEN REPLICATE('d', 4000) ELSE 'd' END AS c5
    FROM Nums AS n1
    WHERE Num <= 10

    Let's examine, how out table is stored (7 is ID of my database; I don't show entire output but only relevant parts):

    DBCC IND(7, MyTable, -1)



    So we have 2 leaf-level data pages, 1 page for row-overflow data and 1 page for LOB data plus 3 IAM pages - one per page type. We can also verify that row-overflow and LOB pages contain 2 rows each (using DBCC PAGE we check that m_slotCnt = 2 in page's header). Now open new window in Management Studio and execute:


    dbo.MyTable WITH(PAGLOCK) WHERE c1 = 2

    In another session execute:

    SELECT c1, c2, c3, c5 FROM dbo.MyTable WHERE c1 = 8

    We see that second session is locked, so let's check locking details with sp_lock procedure:




    Page in contention is 200585 which is LOB data page as we can see from the first table. Actually that's perfectly natural behavior - DELETE acquires exclusive lock, SELECT tries to acquire shared. But I'm sure it would have taken some time for me to solve the puzzle whether it was real case in my database. We're used to think of a row sitting at one page, so rowlock locks a row or a key. Probably entire page if we used PAGLOCK hint or Optimizer decided to start from a page-level lock. But we aren't accustomed to think of a row which spans over multiple pages.

    And what happens if we query row-overflow page? Pretty much the same (just substitute c5 by c4 in the second query):




    Another interesting issue is - what happens when we don't query LOB and row-overflow columns? We know that SQL Server doesn't know to lock particular columns but always entire row of a table or an index. So what will happen if we query only first 3 columns: c1, c2 and c3? No blocking lock this time. It means that SQL Server knows to lock particular columns and leave other columns unlocked but only if they reside on separate pages. Actually it is similar to column-oriented databases behavior.

    Finally I want to check what happens when LOB column is updated - whether in-row data is also being locked. So I update LOB column in the first session and try to query same row but without LOB column.

    First session:


    dbo.MyTable SET c5 = REPLICATE('e', 4000) WHERE c1 = 2

    Second session:

    SELECT c1, c2, c3, c4 FROM dbo.MyTable WHERE c1 = 2

    Blocking lock again. Now on a keylock:




    Here I don't see any reason for exclusive lock - shared would be enough in order to prevent DML operations on the row. No danger of dirty read here since in-row data page isn't updated at all. So first step was nice - shared lock isn't acquired when we don't read row-overflow and/or LOB page. Second step in my opinion should be - for DML operations acquire only shared lock for pages/rows/keys that aren't updated.

  • Parsing Extended Events xml_deadlock_report

    Jonathan Kehayias and Paul Randall posted more than a year ago great articles on how to monitor historical deadlocks using Extended Events system_health default trace. Both tried to fix on the fly bug in xml output that caused failures in xml validation. Today I've found out that their version isn't bulletproof either. So here is the fixed one:

    SELECT CAST(xest.target_data as XML) xml_data, *
    INTO #ring_buffer_data
    sys.dm_xe_session_targets xest
    INNER JOIN sys.dm_xe_sessions xes on xes.[address] = xest.event_session_address
    xest.target_name = 'ring_buffer' AND 
  = 'system_health'

    ;WITH CTE( event_name, event_time, deadlock_graph )
    event_xml.value('(./@name)', 'varchar(1000)') as event_name,
    event_xml.value('(./@timestamp)', 'datetime') as event_time,
    event_xml.value('(./data[@name="xml_report"]/value)[1]', 'varchar(max)') as deadlock_graph
    FROM #ring_buffer_data
    CROSS APPLY xml_data.nodes('//event[@name="xml_deadlock_report"]') n (event_xml)
    WHERE event_xml.value('@name', 'varchar(4000)') = 'xml_deadlock_report'
    SELECT event_name, event_time
    WHEN CHARINDEX('<victim-list/>', deadlock_graph) > 0 THEN
    REPLACE(deadlock_graph, '<victim-list/>', '<deadlock><victim-list>'),
    '<process-list>', '</victim-list><process-list>'
    REPLACE(deadlock_graph, '<victim-list>', '<deadlock><victim-list>'),
    '<process-list>', '</victim-list><process-list>'
    XML) AS DeadlockGraph
    ORDER BY event_time DESC

    The difference as you can see is in the check whether '<victim-list>' node is empty (CHARINDEX('<victim-list/>', deadlock_graph) > 0). My system_health session caught some weird deadlocks. According to trace they're caused by intra-query parallelism when threads of the same session deadlock one another. Until now it is rather a rare type of deadlock but nothing strange - Bart Duncan and Craig Freedman blogged about it long ago. What is really strange is that this deadlock isn't caught by 1222 trace flag - error log remained empty. And - probably connected to previous item - victim list is empty. So instead of opening node '<victim-list>' we receive an empty one: '<victim-list/>'. I'm not sure such a deadlock deserves any attention if it occurs once in a while but I sure want to be able to parse xml_deadlock_report events without failures.

  • T-SQL Tuesday - IO capacity planning

    This post is my contribution to Adam Machanic's T-SQL Tuesday #004, hosted this time by Mike Walsh.

    Being applicative DBA, I usually don't take part in discussions which storage to buy or how to configure it. My interaction with IO is usually via PerfMon. When somebody calls me asking why everything is suddenly so slow on database server, "disk queue length" or "average seconds per transfer" counters provide an overwhelming answer in 60-70% of such cases. Sometimes it can be fixed by adding indexes to heavy tables. Not always though. Many companies, especially young and fast growing ones, just can't properly measure IO requirements. They are usually aware of how much disk space they need. Question about IOpS is met with puzzled looks.

    Recently I had to build a new system capable of 40 thousand random IOpS at the peak load. In such IO is by far the most expensive part. Although space requirements were moderate, just several Terabytes, 4-5 SATA disks of 1TB won't be enough. Far from it. Modern disks with 15000 RPM rotation speed can perform ~170 random IO requests per second. So, 40K/170 = 235 disks in several RAID0 arrays. Enough?

    Not so fast. What about high availability requirements? Can we afford downtime if one of the disks is gone? No, we can't. OK, so RAID0 is out of scope. It leaves us with RAID10(striping plus mirroring) and different parity options: RAID3 through RAID6. I'll consider RAID5 as it is the most popular one. For RAID10 calculation is simple: 235 disk in RAID0 we've seen earlier x2 for mirror disks = 470 disks. That's quite a lot! So let's consider less expensive (at least sounds so) option - RAID5. Standard RAID5 array contains 5 disks. Unlike RAID3 and RAID4 it has no dedicated disk for parity (instead parity is written in a round robbin on all disks), so we can assume that nothing is wasted. We're still running with the same number - 235 disks, just a little less overall space due to parity. But space isn't our concern here. So, can we finally issue order to the storage vendor?

    Unfortunately we're still not there. Good time to ask about usage pattern - what do those 40K IOpS contain? ~40% are SELECT operations (e.g. 16K operations per second), 20% are INSERT (8K per second) and 40% are UPDATE (16K per second). We also know, that rows that undergo UPDATE are those that are SELECTed before. And we know that all operations use clustered index seek. Let's perform simple calculation. We can safely assume that upper levels of clustered index are in cache, so we have 1 physical read per SELECT. UPDATE performs read + write but we know that row has been read earlier, so we can assume that read is also from cache. So, price of UPDATE is 1 IO. Great! 1 IO per operation regardless of its type - we're still with 235 disks. At this point some of you would say: "Stop wasting our time, let's buy that damn storage!"

    OK, after week or two we install storage, configure 235 disks in arrays of 5 plus a little more for hot spare, Windows, SQL Server transaction log etc. Now we open PerfMon and start load test expecting to see 40K IOpS at maximum load. And we would be greatly disappointed. Actually, having CV updated is a good thing at this point - some organizations that I know won't employ DBA that wastes them around 100K bucks. So, where is the catch?

    The catch is an UPDATE. Contrary to the wide belief, PerfMon doesn't show physical IO. Yes, we know that drives D and E reside on the same "physical disk". But is it really physical? No, it is just another layer that can conceal beneath anything from directly attached disk to RAID50 array. PerfMon shows IO "on host". So we probably see 1 IO for UPDATE. But storage actually does something else - and it is really physical layer now. First, it reads both data and parity bytes. Data byte is in cache, so we have 1 read instead of 2; but remember, we haven't updated anything yet (you probably ask about parity byte - why it isn't in cache? because for SELECT operations only data bytes are read). Next, it writes both data and parity. So, we have 3 IO operations per single UPDATE. Going back to overall numbers, 16K updates are actually 48K IO operations when working on RAID5. This trap is known as "RAID5 write penalty". That's not the only RAID5 issue. We also have ~2/3 performance degradation on single disks failure - because in order to reconstruct corrupted bit we have to read all 4 bits on other disks including parity -> 4 times more IO. Considering all this, we should carefully evaluate whether RAID5 is really more cost effective than RAID10. The answer is positive for read-intensive environments where we do not have write penalty. For write-intensive environments TCO of RAID10 can be better, especially taking into account possible corruptions.

    Last but not least. I know that some statements here are oversimplified. I didn't take into account page splits or influence of storage system cache; I didn't consider SSD-based or mixed storage. Although I used real system I've built as example, here it is more of a mathematical model. Real world is always more complicated. But I don't want my fellow DBAs to fall asleep reading this post :-).

  • For want of a nail

    For want of a nail the shoe was lost.
    For want of a shoe the horse was lost.
    For want of a horse the rider was lost.
    For want of a rider the battle was lost.
    For want of a battle the kingdom was lost.
    And all for the want of a horseshoe nail.


    I'm now in the middle of severity A case that is best depicted by this rhyme. While Microsoft engineers look for a root cause, I'll describe here how small bug can "kill" strong server. The story begins when for some unknown reason "rows" column in sys.partitions DMV begins to show 4.6 billion rows for clustered index of one of the entities tables in my datawarehouse database. COUNT(*) on the table returns 1 million rows.

    Second step - daily job that executes sp_updatestats stored procedure. This procedure updates statistics in the database using sample data . Let's say, I randomly took 5% of table's data and calculated number of unique values for Column1 - n values. So total unique values of Column1 in my table is 20*n. But how do I know, how many rows per unique value of Column1 are there in the table? Very simple: SQL Server just takes rows column from sys.partitions and divides by our 20*n number.

    Third step - query: it joins between our Table and another one WHERE Column1 = x. Optimizer compiles the query and builds execution plan based on existing statistics (e.g. 4 billion rows in the table). Expected number of rows from our table is millions - so Optimizer chooses HASH JOIN as the best way to perform JOIN operation.

    Now that Optimizer made a decision about execution plan, it asks for a memory quota for query execution. Based on the estimation about millions of rows. Remember that memory quota for query execution isn't taken from data buffer cache, so on 32 bit systems we can't utilize too much memory. For example, I executed single query as described earlier and while it was running checked memory usage of my session.


    So we have 252Mb granted while less than 1Mb used. Total available memory is 1.6Gb, so you can calculate yourself how many concurrent queries system supports at its current state. It is exactly the situation we had: 5 sessions were running and 65 were waiting for memory grant (if you monitor wait types - there will be a lot of RESOURCE_SEMAPHORE).

    P.S. DBCC UPDATEUSAGE fixes the initial problem. So while Microsoft engineers look for the reason why sys.partitions went wild, I monitor it in order to fix the problem before it "kills" the server.

  • Partition Details Custom Report - enhanced

    Recently my fellow friend and colleague Yoni Nakache drew my attention to the nice custom report which can save valuable time to any DBA - it returns number of rows per partition along with partition boundaries and filegroup data. The rdl can be found here at Codeplex. What was definitely missing is space usage information - reserved / used space per partition. So I filled the gap - new rdl is here.

  • NHibernate wonders

    This ORM piece of tool just can't stop to surprise. A week ago I've found out that although NHibernate can execute stored procedure, it doesn't support output parameters - when procedure returns one, it causes failure. Today I've found another pearl. Imagine following scenario: 2 entities with many-to-many relationship between them. For example, projects and employees - project contains many employees, employee can participate in several projects. So we have ProjectsEmployees table that maps employees to projects. Done with a foreword, now the scenario itself. Project X contains 9 employees already mapped to it in the ProjectsEmployees table. We want to add another employee to the project. Just a simple INSERT operation, right? NHibernate doesn't look for easy ways. It performs the following operations:

    1. DELETE FROM ProjectsEmployees WHERE ProjectID = X

    2-11. INSERT INTO ProjectsEmployees VALUES()...

    Instead of single INSERT we have 11 operations here. Besides unnecessary pressure on IO system and possible locks, we have much more serious consistency issue - if HR department queries table after DELETE operation but before all the INSERTs have been completed, it would receive project status that never occurred in real life - 3 employees for example. But that's the way NHibernate works - "just in case" overwrites the entire projects object.

    Ah, and how did I find out? Noticed that after mapping new employee to project trigger on delete fires. WEB team developers (those actually programming with NHibernate) keep log of commands NHibernate executes in development environment. There I have seen it plain and simple.

  • Monitoring page splits with Extended Events

    After reading Kalen Delaney's post about single insert causing 10 page splits, I wanted to see those splits in detail - their order at first place. And in SQL Server 2008 there is a way to trace splits - using new Extended Events infrastructure. Here is simple script that creates the trace and afterwards displays results.

    First of all, create and populate table in tempdb as described in the Kalen's post.

    USE tempdb;

    CREATE TABLE split_page 
    id2 bigint DEFAULT 0,
    data1 VARCHAR(33) NULL, 
    data2 VARCHAR(8000) NULL);

    GO 385

    Now, let's create and start Extended Events session. The only event we would like to monitor is page_split. On the way we'll capture sql text in order to be sure that it is our insert that caused split.

    ADD EVENT sqlserver.page_split
    ACTION (sqlserver.database_id, sqlserver.sql_text
    WHERE sqlserver.database_id = 2
    ADD TARGET package0.asynchronous_file_target
    filename = N'c:\temp\MonitorPageSplits.etx'
    metadatafile = N'c:\temp\MonitorPageSplits.mta'

    ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = start;

    Now execute INSERT command from Kalen's script.

    SET IDENTITY_INSERT split_page  ON;
    INSERT INTO split_page (id, id2, data1, data2)
    SELECT 111, 0, REPLICATE('a', 33), REPLICATE('b', 8000);
    SET IDENTITY_INSERT split_page  OFF;

    Afterwards we'll close the session and display results.



    split.value('(/event/data[@name=''file_id'']/value)[1]','int') AS [file_id],
    split.value('(/event/data[@name=''page_id'']/value)[1]','int') AS [page_id],
    split.value('(/event[@name=''page_split'']/@timestamp)[1]','datetime') AS [event_time],
    split.value('(/event/action[@name=''sql_text'']/value)[1]','varchar(max)') AS [sql_text]

    SELECT CAST(event_data AS XML) AS split
    FROM sys.fn_xe_file_target_read_file('c:\temp\MonitorPageSplits*.etx', 'c:\temp\MonitorPageSplits*.mta', NULL, NULL)
    AS t
    ORDER BY [event_time]

    OK, the result is:


    So, we see 10 splits, among them 4 splits of page 148, another 3 of page 178 etc. It makes sense. When split occurs, ~half of the data from the old page goes to the new one. So if the new row - the one that caused split - should originally have entered first half of the page, after the split it would still try to enter the old page - not the new one. In our case originally we had 56 rows before the new one (id from 0 to 110 step 2) and 385 - 56 = 329 rows after. I would still expect 3 and not 4 splits of the initial page because (((385 / 2) / 2) / 2) = ~48 < 57 (new row's place). So I have expected that after the third split new row would at last leave the initial page. But I was wrong - don't know whether that's just not strict math or there're other factors I didn't think of.

  • Restoring master database

    Disclaimer: I'm not an infrastructure DBA.Probably everything below is just bullshit.

    Recently our talented IT guys destroyed disk on a virtual machine. They chose the disk carefully - it was the one with master database. How do you restore master db from backup? Start SQL Server service from command line in single user mode and execute RESTORE DATABASE command. OK, start in single user mode fails because it doesn't find master database files. At this point I would like to take backup file and instead of restoring database in working state, just extract master.mdf and mastlog.ldf files (not just them, see ahead). But unfortunately it is unsupported (or I failed to find the right option - see disclaimer). Next step - find SQL Server installation with the same product version on another server, stop the service, copy master.mdf and mastlog.ldf (can't be done without stopping service - files are inaccessible). Again "sqlserver.exe -m"... This time mssqlsystemresource.mdf and mssqlsystemresource.ldf are missing. Copy them too. Also model.mdf and modellog.ldf. Again "sqlserver.exe -m". This time I faced real problem - startup failed on creating tempdb because on the server I took master.mdf from, tempdb datafile sat on disk F and tempdb log on disk H. While on my server I have only C and D. As far as I understand, at this point I have 3 options: re-install SQL Server and afterwards restore all databases including master from backup, add disks F and H or find installation where tempdb sits on C or D. Luckily I could just add F and H and it solved the problem. But all the sequence of actions seems unreasonable. BTW, after starting service in single user mode, I restored master from backup and it automatically moved tempdb to its old location on D - master contains path to all system databases. Afterwards F and H have been removed. Seems to me totally unreasonable that in order to restore master database new logical disks had to be created. I see 2 possible solutions: to be able to extract mdf and ldf files from backup without actually restoring database or to be able to configure path to system databases externally in service properties like path to master and errorlog.

  • NHibernate and triggers

    Recently had an interesting experience with NHibernate (I was against working with it from the beginning for many reasons, but unfortunately I wasn't the one to decide). I added trigger to a table that performs several update / insert operation to other tables. Suddenly updating table via NHibernate returned error:

    NHibernate.AdoNet.TooManyRowsAffectedException: Unexpected row count: 6; expected: 1

    So NHibernate doesn't support triggers that perform DML operations? Not quite. The solution is: add SET NOCOUNT ON before trigger performs DML operations. Looks like NHibernate doesn't really count DML operations, but messages of the type "2 rows updated" returned by SQL Server. And NOCOUNT setting suppresses these messages.

  • MERGE statement syntax modified

    This is probably known to all but I've just found out - MERGE command syntax had been modified since CTP6. In the beginning of 2008 I had built a system on CTP6 and this week, while preparing for production deployment, found out that procedures are failing with syntax error. It appears that in CTP6 MERGE contained the following blocks:


    In RTM, if you write WHEN TARGET NOT MATCHED, it fails compilation. New syntax is:


    For non-believers, here is old blog post by Chad Boyd with examples that use old syntax.

  • Query performance troubleshooting in SQL Server 2008: query_hash and query_plan_hash

    Recently I have noticed 2 new columns added to sys.dm_exec_query_stats and sys.dm_exec_requests DMVs in SQL Server 2008: query_hash and query_plan_hash. Those columns can greatly enhance performance monitoring process. In SQL 2005 main query I'm using for query performance monitoring is:

    qs.total_physical_reads + qs.total_logical_reads + qs.total_logical_writes) AS [Total IO],
    qs.total_physical_reads + qs.total_logical_reads + qs.total_logical_writes) /qs.execution_count AS [Avg IO],
    SUBSTRING(qt.[text], qs.statement_start_offset/2, (
    WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) *
    ELSE qs.statement_end_offset 
    END - qs.statement_start_offset)/
    ) AS query_text,
    sys.dm_exec_query_stats qs
    CROSS APPLY sys.dm_exec_sql_text (qs.[sql_handle]) AS qt
    OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
    --ORDER BY [Total IO] DESC

    It returns top 10 heaviest queries by average IO that exist in cache. I won't go deeper discussing plan cache memory pressure conditions that can force query plan out of cache thus preventing its detection. Let's just say that it works right in 99% of cases which is good enough. Memory pressure can be detected by other queries.

    Which queries would be candidates for tuning? First, those with highest total IO. Second, queries with highest average IO that pass certain minimum number of executions criteria (we usually won't tune query, even the heaviest one, that runs once a month in some offline batch). As a side node - you're probably asking, why do I ignore CPU counters like total_worker_time. The reason is simple: in sys.dm_exec_query_stats this counter is unreliable. It shows incorrect numbers in case of parallel execution.

    So, Houston,  do we have a problem here? Unfortunately, we do - when application that works with the database doesn't make proper usage of parameterization (and we don't want to force parameterization via database level setting). In such a case we'll see lots of similar queries with 1 or 2 in execution_count and different values of should-be-parameters in query_text. We can miss such queries because every single one is not heavy enough to be of interest or because many of the queries aren't in cache anymore pushed out by new queries - even of the same type. It is especially realistic scenario for 32 bit systems where entire non-data cache is limited to 1GB of space.

    What are our options with poorly parameterized queries? We can use CLR user-defined function provided by Itzik Ben-Gan in his book "Inside SQL Server 2005: T-SQL Querying" that uses regular expressions functionality in order to parameterize query text (this function is widely used for Profiler trace analysis). Query text can be passed through the function and used as grouping column. But even if we don't count performance and CPU price of grouping by text column, I know several organizations that just won't let you create your objects in their database.

    Taking all the above into account, I was delighted to find out that in SQL 2008 Microsoft added query_hash and query_plan_hash columns to sys.dm_exec_query_stats DMV. query_hash would be the same for queries with similar logic, query_plan_hash would be the same for queries with similar execution plan. And what's the difference? For column with uneven data distribution, execution plan can be different depending on parameter value. If we have the same value in 90% of a table, Optimizer would sure choose scan option. For another value which is responsible for 0.1% of rows Optimizer will prefer index seek with key or RID lookup depending on table's structure. For those two queries we'll see the same query_hash but different query_plan_hash.

    So new SQL 2008 version of the query is:

    ;WITH CTE(TotalExecutions, [Total IO], [Avg IO], StatementTextForExample, plan_handle, QueyHash, QueryPlanHash)
    SELECT TOP 10  
    SUM(execution_count) AS TotalExecutions,
    SUM(total_physical_reads + total_logical_reads + total_logical_writes) AS [Total IO],
    SUM(total_physical_reads + total_logical_reads + total_logical_writes) / SUM(execution_count) AS [Avg IO],
    MIN(query_text) AS StatementTextForExample,
    MIN(plan_handle) AS plan_handle,
    query_hash AS QueryHash,
    query_plan_hash AS QueryPlanHash
    SUBSTRING(qt.[text], qs.statement_start_offset/2, (
    WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) *
    ELSE qs.statement_end_offset 
    END - qs.statement_start_offset)/
    ) AS query_text
    sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.[sql_handle]) AS qt
    WHERE qt.[text] NOT LIKE '%sys.dm_exec_query_stats%'
    ) AS query_stats
    GROUP BY query_hash, query_plan_hash
    TotalExecutions, [Total IO], [Avg IO],
    tp.query_plan AS StatementPlan,
    QueyHash, QueryPlanHash
    OUTER APPLY sys.dm_exec_query_plan(plan_handle) AS tp
    --ORDER BY [Total IO] DESC;

    Another great usage of new columns is to create repository and monitor execution plan changes over time, i.e. changes in query_plan_hash for the queries with the same query_hash value. In SQL 2005 such monitoring was pretty complicated and required on-the-fly parameterization. In SQL 2008 it looks pretty straightforward.

  • Name resolution in SQL Server

    This post is inspired by recent post by Mladen Prajdić. Mladen had found out (as many did before him - hanging around in blogs and communities, I see questions about this issue from time to time) that name resolution works for temporary tables differs from the one for permanent tables. While for permanent table more than one CREATE TABLE statement with the same table name can be issued inside the same batch or procedure, for temporary table it will fail compilation saying: "There is already an object named '#t1' in the database".

    -- Succeeds
    CREATE TABLE t1(col1 int);
    DROP TABLE t1(col1 int);
    CREATE TABLE t1(col1 int);

    -- Fails
    CREATE TABLE #t1(col1 int);
    DROP TABLE #t1(col1 int);
    CREATE TABLE #t1(col1 int);

    Reasons for this differentiation are historical. Deferred name resolution had been introduced in SQL Server 7  but for permanent tables only. It doesn't work for temporary tables and doesn't work for variables even when defined in different scope:

    -- Fails with "The variable name '@a' has already been declared" error despite the fact that @a variable is declared in 2 different scopes.
    @a INT 
    @a INT 

    Possible workaround for temporary tables (if for some reason you desperately need to create twice temporary table with the same name inside the batch) is to alter table schema instead of dropping and recreating from scratch.

    CREATE TABLE #t1(col1 int);
    ALTER TABLE #t1 ADD col2 VARCHAR(32);
  • Partitioning by GUID

    Recently I had to use GUID (Global Unique Identifier) as primary key in order to ensure cross-server uniqueness. In many cases surrogate key (some sort of concatenation between server id and intra-server identity) would give a better performance but in my case due to application requirements surrogate key wasn't an option.

    Next step is partitioning by GUID column. Say, you want 3 partitions. Ho do you build partition function? Sure, you can perform a calculation and find the exact 2 values that will divide all the GUID range to 3 equal parts. I needed at least 20 partitions, so I used an approximation. If you write down long hexadecimal number, you'll receive something like A09F76... I decided that partitioning by first 2 tabs is enough. Example for decimal: let's divide values from 1 to 10000 to 3 partitions - what would be the ranges? The exact answer is: 1 - 3333, 3334 - 6666, 6667-10000. Using approximation by 2 first digits will give the following ranges: 1 - 3300, 3301 - 6600, 6601 - 10000. 2 orders of magnitude (first 2 digits) are good enough approximation. In hexadecimal first 2 digits (00 - FF) represent 0-255 in decimal. I'm not accustomed to hexadecimal calculations, so first I divide decimal range to 3 parts: 1 - 85, 86 - 170, 171 - 255. Next step is converting it to hexadecimal: 01 - 55, 56 - AA, AB - FF.

    OK, done with values, how do I translate them into GUID? For example, 01 in hexadecimal - what is the GUID representation of it (taking into account that those should be first 2 digits - not the last ones). Took me awhile to understand. First I tried "01000000-0000-0000-0000-000000000000" - surprisingly all the rows (more then a million) came to the first partition. OK, another try: "00000001-0000-0000-0000-000000000000". Still the same. The catch is that GUID should be read in a byte groups order from right to left while inside every byte group reading order is from left to right. And every 2 tabs are 1 byte. So the right representation for 01 as first 2 digits in GUID is: "00000000-0000-0000-0000-010000000000".

    To finish with, partition function that partitions GUID to 20 range groups:

    CREATE PARTITION FUNCTION [pf_GUIDRangePartition](uniqueidentifier) AS RANGE RIGHT 
    ) ;
  • Scalability features I would like to have in SQL Server

    1. Active-Active cluster.

    Like RAC in Oracle. Maximum that commodity server can provide by now is 4 quad-core CPU machine. If application requires more, we can split the database (which is not always possible; and even when it is - such a solution often looks very artificial) or buy high-end server with more CPUs and much higher price line. The only thing that keeps SQL Server alive in the market for applications with high scalability requirements is totally unreasonable (IMHO) price that Oracle charges for RAC. It costs so much that even high-end machine is cheaper. For 4 quad-core CPU machine you'll pay 100K$ for SQL Server 2005 Enterprise Edition license (25K$ per license per CPU) and 240K$ for Oracle 11g Enterprise Edition (40K$ per license per every 2 cores). If you 2 machines - for Microsoft the price is doubled. For Oracle - if you want them in RAC - double the price for licensing plus add 20K$ addition for RAC per every pair of cores. Which gives us 200K$ vs 960K$ for two identical 4 quad-core CPU machines. Price is the only real answer Microsoft actually has to RAC - all the rest is pretty pathetic. Judge by yourself - here is the link to the whitepaper (PDF) on Microsoft site that compares SQL Server with RAC. And what will happen if tomorrow Oracle reduces price for RAC by 50%? BTW, the prices for Oracle are list prices but they're willing to negotiate. So the gap isn't that big.

    2. Indexes per partition

    SQL Server 2005 gave partial answer (or workaround - to be honest) - indexed views. But with huge limitation - since index on view couldn't be aligned with partition schema, sliding window scenario was impossible (index on view had to be dropped in order to perform partition switch or merge). In SQL Server 2008 the situation is better - now indexed view can be aligned with partition schema and support switch and merge operations on partitions. Nevertheless, indexed view is pretty costly when compared to the normal index. It also requires unique index which is not always possible.

    Another part of the answer in SQL Server 2008 are filtered indexes. They should be cheaper then indexed view and can be non-unique. But filtered indexes can't be created on computed columns (Microsoft's answer to Oracle's function-based indexes) or UDT columns. Close but still not there.

    3. Bitmap indexes and function-based indexes

    Those aren't pure scalability features but they can be very useful when ability to analyze, apply filters and run queries against large read-only partitions is required (this requirement is one of the two main reasons to keep huge amounts of historical data - second reason is legislation and accountability requirements).

    I'm sure, there're other features I forgot to mention but lack of the features in the above list causes me to curse more often then I would like to.

  • How do we know that Microsoft conducts performance tests?

    We know it from comments in system procedure's code. Here is what I've found while digging for some undocumented stuff about statistics:

    -- I don't like this copy&paste stuff, but here it boosts performance by 50%

    You can find it too - in sys.sp_table_statistics2_rowset procedure. It returns first 2 parts of DBCC SHOW_STATISTICS result (stat header and density vector) in table format. The procedure header is:

    EXEC sys.sp_table_statistics2_rowset
        @table_name = ?, --  sysname
        @table_schema = ?, --  sysname
        @table_catalog = ?, --  sysname
        @stat_name = ?, --  sysname
        @stat_schema = ?, --  sysname
        @stat_catalog = ? --  sysname

    If you supply only @table_name and @stat_name, you'll receive the same output as from undocumented (I think) syntax of DBCC SHOW_STATISTICS:


Privacy Statement