THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Michael Zilberstein

  • LAG function – practical use and comparison to old syntax

    Recently I had to analyze huge trace – 46GB of trc files. Looping over files I loaded them into trace table using fn_trace_gettable function and filters I could use in order to filter out irrelevant data. I ended up with 6.5 million rows table, total of 7.4GB in size. It contained RowNum column which was defined as identity, primary key, clustered. One of the first things I detected was that although time difference between first and last events in the trace was 10 hours, total duration of all sql events was only 4 hours. So the next assumption to check was whether application got stuck - whether we have significant gap between two subsequent events. In the following query I’m looking for events that occurred at least 30 seconds after previous event’s completion.

    SQL 2005/2008/2008R2 syntax:

    SELECT
      
    curEvent.RowNum, prevEvent.EndTime AS PrevEventEndTime, curEvent.StartTime AS CurEventStartTime
    ,
      
    DATEDIFF(SECOND, prevEvent.EndTime, curEvent.StartTime) AS
    ApplicationStuckDuration
    FROM
      
    dbo.TraceData curEvent
      
    INNER JOIN dbo.TraceData prevEvent ON curEvent.RowNum - 1 =
    prevEvent.RowNum
    WHERE DATEDIFF(SECOND, prevEvent.EndTime, curEvent.StartTime) >=
    30

    image

    SQL 2012 syntax:

    ;WITH GapsCTE AS
    (
      
    SELECT RowNum, StartTime AS CurEventStartTime, LAG(EndTime, 1, NULL) OVER(ORDER BY RowNum ASC) AS
    PrevEventEndTime
      
    FROM
    dbo.TraceData
    )  
    SELECT RowNum, PrevEventEndTime, CurEventStartTime, DATEDIFF(SECOND, PrevEventEndTime, CurEventStartTime) AS
    ApplicationStuckDuration
    FROM
    GapsCTE
    WHERE DATEDIFF(SECOND, PrevEventEndTime, CurEventStartTime) >=
    30

    image

    Here are results (pretty surprising at the first glance, I have to say):

      SQL 2012 SQL 2005/2008/2008R2
    Duration 103 seconds 77 seconds
    Reads 919,752 1,839,504
    CPU 20 seconds 13 seconds

    In the old SQL versions we see twice reads – that’s obvious from execution plan since we scan entire table twice and merge join on RowNum column. It seems that in this case Merge Join is so efficient that CPU overheard of segmentation, aggregation and spooling costs more than additional single ordered scan of the table that is already in memory anyway.

    I started to write all this in order to demonstrate practical usage of the new function and how we can benefit from it compared to previous SQL Server versions. Results became a complete surprise for me.

    Conclusion: always check and recheck yourself!

  • Reading temporary table from another session

    It happens to me at least once a week – I want to check progress of some heavy script that runs in chunks over big dataset and find out that it writes intermediate data to temporary table only. Last time it happened 3 days ago when I wanted to analyze 50GB trace table on my notebook. I wrote a script that was taking 200 thousand rows at a time, parameterizing them and aggregating by different keys – host name, application etc. Usual trace analysis stuff. After an hour I wanted to check the progress but found out that intermediate results are written to temp table and of course I forgot to add debug prints. Took me some thought and ~5 minutes of coding to find the solution. How?

    My trace analysis script ran over 200K rows chunks and among other things aggregated them grouping by parameterized query text. One of the measurement columns was “TotalQueries” – counter of rows in a group that could be later used to calculate average values for Reads, Writes, Duration and CPU. Every 200k rows chunk inserted rows into #BA temp table. Last step in a script, when all chunks have already been processed, was aggregation of #BA table’s data into permanent table. So in order to monitor the progress I could SUM all values of “TotalQueries” column in #BA table and compare it to number of rows in my 50GB trace table (which can be easily verified using sp_spaceused).

    Of course, I new the temp table name – I wrote the initial script. So first thing was to find its object_id.

    SELECT [object_id] FROM tempdb.sys.tables WHERE name LIKE '#BA%'

    It brought me negative number: –1546597904 in my current demo. I don’t have access to temp table created by another session. But as admin on my own notebook I sure have access to every page in every database. So next step is to find all pages that belong to #BA table. For this task we have undocumented but widely known (and used) DBCC IND command. Since digging in every page manually and wasting hour on it wasn’t exactly my purpose, I kept DBCC IND output in the table variable and used it later in the script.

    DECLARE @BATablePages TABLE
    (
      
    PageFID INT, PagePID INT, IAMFID INT, IAMPID INT, ObjectID BIGINT, IndexID INT, PartitionNumber INT
    ,
      
    PartitionID BIGINT, iam_chain_type VARCHAR(64), PageType TINYINT, IndexLevel TINYINT, NextPageFID INT
    ,
      
    NextPagePID INT, PrevPageFID INT, PrevPagePID
    INT
    )

    INSERT INTO
    @BATablePages
    EXEC('DBCC IND(2, -1546597904, 1)'
    )

    Actually we need only PagePID column and only for leaf level data pages e.g. PageType = 1. Next step is to loop over those pages and insert their content into another table. For viewing page’s data I used another widely known (and undocumented as well) DBCC PAGE command. It is less known that DBCC PAGE can be used along with WITH TABLERESULT suffix. Note: column definitions in the following script are accidental – I just wanted them to be wide enough so that script can complete successfully.

    DECLARE @rc INT, @v_PageID INT, @sql NVARCHAR(MAX)

    CREATE TABLE #ind( RowNum INT NOT NULL IDENTITY(1,1), PageNum INT
    NOT NULL )
    CREATE TABLE #page(ParentObject VARCHAR(128), [Object] VARCHAR(512), Field VARCHAR(128), [Value] VARCHAR(8000
    ) )

    INSERT INTO #ind( PageNum
    )
    SELECT
    PagePID
    FROM
    @BATablePages
    WHERE PageType =
    1

    SET @rc =
    @@ROWCOUNT
    WHILE @rc >
    0
    BEGIN
       SELECT
    @v_PageID = PageNum FROM #ind WHERE RowNum =
    @rc
          
      
    SET @sql = 'DBCC PAGE(2, 1, ' + CONVERT(VARCHAR(32), @v_PageID) +
    ', 3) WITH TABLERESULTS'

      
    INSERT INTO #page(ParentObject, [Object], Field, [Value]
    )
      
    EXEC(@sql
    )

      
    SET @rc = @rc -
    1
    END

    So now we have all data pages inside single table but not in a format we’re used to:

    image

    But actually, we now have everything we need. The rest is simple:

    DELETE FROM #page WHERE Field <> 'TotalQueries'
                                    
    SELECT SUM(CONVERT(INT, [Value])) AS
    TotalQueries
    FROM #page
  • Page splits, extended events, index page allocation and all the fish

    2 years ago I wrote about monitoring page splits with Extended Events. Only 2 bloggers explored Extended Events feature at that time, so my post was more of a learning than for any practical matter. Yet one question remained open: according to straightforward calculations, after 3 splits of the first page, next split should have occurred on another page - but trace with extended events clearly showed 4 splits of initial page.

    Recently Pavel Nefyodov has drawn my attention to another curious fact: initially we had 1 page; after all those splits we had 10. But trace showed 10 splits, not 9. So where did additional split come from? And finally I've realized that there is another simple way to trace those splits - via transaction log. If we manually checkpoint the system before and perform insert in transaction afer - our transaction log will containg exactly the relevant data; and it'll remain this way all the time transaction stays open. We'll query transaction log using undocumented but widely known fn_dblog function.

    OK, Let's see some code (mostly taken from my old post which in turn used page split demo from Kalen Delaney's blog). Initial configuration (including trace definition):

    USE tempdb;
    GO

    CREATE TABLE split_page  
    (id INT IDENTITY(0,2) PRIMARY KEY,
    id2 bigint DEFAULT 0,
    data1 VARCHAR(33) NULL,  
    data2 VARCHAR(8000) NULL);
    GO

    INSERT INTO split_page DEFAULT VALUES;
    GO 385

    CREATE EVENT SESSION MonitorPageSplits ON SERVER
    ADD EVENT sqlserver.page_split
    (
      
    ACTION (sqlserver.database_id, sqlserver.sql_text)  
       
    WHERE sqlserver.database_id = 2
    )
    ADD TARGET package0.asynchronous_file_target
    (
      
    SET  
          
    filename = N'c:\temp\MonitorPageSplits.etx',  
          
    metadatafile = N'c:\temp\MonitorPageSplits.mta'
    );
    GO

    CHECKPOINT
    GO

    Short glance at our table's allocation inside data file:

    DBCC IND(tempdb, split_page, -1);
    GO

    PagePID IndexID iam_chain_type PageType IndexLevel
    175 1 In-row data 10 NULL
    148 1 In-row data 1 0

    So we have 1 IAM page and 1 data page. Now the split itself and - remember, we should query transaction log before committing transaction if we don't want accidental checkpoint to ruine all our efforts.

    ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = start;
    GO

    SET IDENTITY_INSERT split_page  ON;
    GO
    BEGIN TRAN

    INSERT INTO
    split_page (id, id2, data1, data2)
         
    SELECT 111, 0, REPLICATE('a', 33), REPLICATE('b', 8000);
    GO

    SET IDENTITY_INSERT split_page  OFF;
    GO

    SELECT [Current LSN], Operation, Context, [Page ID], [Description]
    FROM fn_dblog(NULL, NULL)

    COMMIT
    GO

    ALTER
    EVENT SESSION MonitorPageSplits ON SERVER STATE = STOP;
    GO
    DROP EVENT SESSION MonitorPageSplits ON SERVER;
    GO

    SELECT  
      
    split.value('(/event/data[@name=''file_id'']/value)[1]','int') AS [file_id],
      
    split.value('(/event/data[@name=''page_id'']/value)[1]','int') AS [page_id],
      
    split.value('(/event[@name=''page_split'']/@timestamp)[1]','datetime') AS [event_time],
      
    split.value('(/event/action[@name=''sql_text'']/value)[1]','varchar(max)') AS [sql_text]
    FROM
      
    (  
          
    SELECT CAST(event_data AS XML) AS split
          
    FROM sys.fn_xe_file_target_read_file('c:\temp\MonitorPageSplits*.etx', 'c:\temp\MonitorPageSplits*.mta', NULL, NULL)
       )
    AS t
    ORDER BY [event_time]

    First let's take a look at our table now:

    PagePID IndexID iam_chain_type PageType IndexLevel
    175 1 In-row data 10 NULL
    148 1 In-row data 1 0
    184 1 In-row data 2 1
    185 1 In-row data 1 0
    187 1 In-row data 1 0
    188 1 In-row data 1 0
    189 1 In-row data 1 0
    190 1 In-row data 1 0
    191 1 In-row data 1 0
    392 1 In-row data 1 0
    393 1 In-row data 1 0
    394 1 In-row data 1 0

    Transaction log contains more rows than we are interested in, so I cleaned irrelevant ones out from the table below and correlated transaction log with extended events trace. I also transformed hexadecimal page numbers from transaction log to decimal ones - like we see in extended events trace.

    image

    So we can see that indeed there have been 3 splits of the first page. From transaction log we see that first action was actually allocating of page 184 which is - look at the allocation table above - index page. According to transaction log this page is allocated - not splitted from any other page. Still extended events trace shows as if it has been splitted from the first data page of our table. Don't know whether this behavior can be considered as bug - probably not. Anyway - both unanswered question from my old post are resolved: first page indeed has been splitted 3 times and we had only 9 data page splits while 10th split belongs to index page allocation.

    Have a nice week!

  • Read Committed isolation level, indexed views and locking behavior

    From BOL, "Key-Range Locking" article:

    Key-range locks protect a range of rows implicitly included in a record set being read by a Transact-SQL statement while using the serializable transaction isolation level. The serializable isolation level requires that any query executed during a transaction must obtain the same set of rows every time it is executed during the transaction. A key range lock protects this requirement by preventing other transactions from inserting new rows whose keys would fall in the range of keys read by the serializable transaction. (bold is mine - MZ)


    I had a blocking locks problems recently. What should have been working concurrently from multiple threads actually worked single-threaded only. Here is demo - table of employees and indexed view above it that shows number of employees and total salary per department. I populated it randomly with up to 10 deparments.


    USE [tempdb]
    GO

    SET ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, CONCAT_NULL_YIELDS_NULL, QUOTED_IDENTIFIER ON
    SET
    NUMERIC_ROUNDABORT OFF
    GO

    CREATE TABLE dbo.Employees
    (
      
    EmployeeID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
      
    DepartmentID SMALLINT NOT NULL,
      
    Salary MONEY NOT NULL
    )
    GO

    CREATE VIEW dbo.V_Departments
    WITH SCHEMABINDING
    AS
       SELECT
          
    DepartmentID,
          
    SUM(Salary) AS DepartmentSalary,
          
    COUNT_BIG(*) AS TotalEmployees
      
    FROM dbo.Employees
      
    GROUP BY DepartmentID
    GO

    CREATE UNIQUE CLUSTERED INDEX PK_V_Departments ON dbo.V_Departments( DepartmentID ASC )
    GO

    INSERT INTO dbo.Employees( DepartmentID, Salary )
    SELECT
      
    [object_id] % 10 + 1 AS DepartmentID,
      
    [object_id] % 3000 + 1 AS Salary
    FROM sys.objects
    GO

    Now I open 2 concurrent transactions and try to add new employee from both. The important thing here is that both belong to new departments: 11 and 13 (in the demo data we had departments from 1 to 10).


    Session 1 Session 2
    SET TRANSACTION ISOLATION LEVEL READ COMMITTED

    BEGIN TRAN

    INSERT INTO
    dbo.Employees( DepartmentID, Salary )
    VALUES( 11, 2500 )
    SET TRANSACTION ISOLATION LEVEL READ COMMITTED

    BEGIN TRAN

    INSERT INTO
    dbo.Employees( DepartmentID, Salary )
    VALUES( 13, 3500 )

    Second transaction is locked. Why? We don't have anything else that updates department 12, so why it is locked? Let's find out


    SELECT
      
    OBJECT_NAME(i.[object_id]) AS ObjectName,
      
    i.name AS IndexName,
      
    l.request_mode AS LockRequestMode,
      
    l.resource_type AS LockResourceType,
      
    l.resource_description AS LockResourceDescription
    FROM
      
    sys.dm_tran_locks AS l
      
    INNER JOIN sys.dm_os_waiting_tasks AS wt ON l.lock_owner_address = wt.resource_address
      
    INNER JOIN sys.partitions p ON p.hobt_id = l.resource_associated_entity_id
      
    INNER JOIN sys.indexes AS i ON i.[object_id] = p.[object_id] AND i.index_id = p.index_id
    WHERE wt.session_id = 57
    AND l.request_status = 'WAIT'


    image


    RangeS-U mode means shared lock for the range between values and update lock for values themselves. One value is obvious - department 11 that we've inserted in the first session. Second value, as we see in the query result, is (ffffffffffff) which indicates max int. Storage Engine knows to lock rows and other objects that are scanned by Query Processor. But we don't have row with department greater than 11. So right boundary of locked range is set to maximum possible value.

    So we understand what range is locked. But why? If we work in pure read committed isolation level, we hold exclusive lock on rows we modify but nobody prevents another user from inserting or deleting another row with same department value. But we expect that our operation would update corresponding department row in the view before anybody else updates same row (e.g. if we read data after our transacion is complete, it won't be affected by other transactions). Which is classic serializable behavior. So instead of forcing customer to require serializable isolation level expicitely, SQL Server implicitely converted read-committed to serializable for this particular operation. It appears that same scenario occurs with cascading updates on foreign keys.

  • Old error in active locks detection script

    Almost every applicative DBA has scripts for locking and blocking issues detection. Some, like me, wrote the script themselves in order to become familiar with underlying DMVs, others downloaded one of the huge number of versions available in the net. All of these scripts are based on sys.dm_tran_locks DMV which contains information about currently active locks and on sys.dm_os_waiting_tasks that holds infomation about waiting tasks. Some of the scripts also filter out locks acquired by system sessions by adding "WHERE request_session_id > 50". My own script was among them.

    Recently I had a blocking issue. But for some reason my script didn't show blocking locks tree. Deeper investigation showed that both blocker and blocked session belonged to Service Broker. But why script failed to indicate it? Following query provides an answer - it shows spid of Service Broker activated tasks:

    SELECT spid FROM sys.dm_broker_activated_tasks

    image

    As you can see, common belief that all spids under 50 belong to internal system sessions is just a myth. Actually, it was right at SQL 2000 days. Starting from SQL 2005 Serive Broker (and, I think, other features too) started to use first 50 spids.

    Very nice ebook and script collection "Free SQL Server DMV Starter Pack" inspired me to write this post. It is really nice and "no nonsense" - I'll recommend it as a great starter to young DBAs that don't have their own script collection. Still script for detecting active locks contains the same old familiar mistake:

    SELECT ...
    FROM sys.dm_tran_locks
    WHERE request_session_id > 50
  • T-SQL Tuesday #006: LOB, row-overflow and locking behavior

    This post is my contribution to T-SQL Tuesday #006, hosted this time by Michael Coles.

    Actually this post was born last Thursday when I attended Kalen Delaney's "Deep dive into SQL Server Internals" seminar in Tel-Aviv. I asked question, Kalen didn't have answer at hand, so during a break I created demo in order to check certain behavior. Demo goes later in this post but first small teaser. I have MyTable table with 10 rows. I take 2 rows that reside on different pages. In first session transaction is opened and first row is deleted using PAGLOCK hint (without COMMIT for a meantime). In second session I select all columns from the second row(both sessions access rows by primary key which is also clustered index). Second session got locked. How come?

    Here is the full story and demo script. First we create table with 10 rows, 2 of which are long.

    CREATE TABLE dbo.MyTable( c1 INT NOT NULL PRIMARY KEY, c2 VARCHAR(8000), c3 VARCHAR(8000), c4 VARCHAR(8000), c5 VARCHAR(MAX) )
    GO

    ;WITH Nums( Num )
    AS
    (
      
    SELECT ROW_NUMBER() OVER( ORDER BY (SELECT 1) ) AS Num
      
    FROM sys.objects
    )
    INSERT INTO dbo.MyTable(c1, c2, c3, c4, c5)
    SELECT 
      
    Num AS c1,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('a', 3000) ELSE 'a' END AS c2,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('b', 3000) ELSE 'b' END AS c3,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('c', 4000) ELSE 'c' END AS c4,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('d', 4000) ELSE 'd' END AS c5
    FROM Nums AS n1
    WHERE Num <= 10
    GO

    Let's examine, how out table is stored (7 is ID of my database; I don't show entire output but only relevant parts):

    DBCC IND(7, MyTable, -1)

    image

     

    So we have 2 leaf-level data pages, 1 page for row-overflow data and 1 page for LOB data plus 3 IAM pages - one per page type. We can also verify that row-overflow and LOB pages contain 2 rows each (using DBCC PAGE we check that m_slotCnt = 2 in page's header). Now open new window in Management Studio and execute:

    BEGIN TRAN

    DELETE FROM
    dbo.MyTable WITH(PAGLOCK) WHERE c1 = 2

    In another session execute:

    SELECT c1, c2, c3, c5 FROM dbo.MyTable WHERE c1 = 8

    We see that second session is locked, so let's check locking details with sp_lock procedure:

     

    image

     

    Page in contention is 200585 which is LOB data page as we can see from the first table. Actually that's perfectly natural behavior - DELETE acquires exclusive lock, SELECT tries to acquire shared. But I'm sure it would have taken some time for me to solve the puzzle whether it was real case in my database. We're used to think of a row sitting at one page, so rowlock locks a row or a key. Probably entire page if we used PAGLOCK hint or Optimizer decided to start from a page-level lock. But we aren't accustomed to think of a row which spans over multiple pages.

    And what happens if we query row-overflow page? Pretty much the same (just substitute c5 by c4 in the second query):

     

    image

     

    Another interesting issue is - what happens when we don't query LOB and row-overflow columns? We know that SQL Server doesn't know to lock particular columns but always entire row of a table or an index. So what will happen if we query only first 3 columns: c1, c2 and c3? No blocking lock this time. It means that SQL Server knows to lock particular columns and leave other columns unlocked but only if they reside on separate pages. Actually it is similar to column-oriented databases behavior.

    Finally I want to check what happens when LOB column is updated - whether in-row data is also being locked. So I update LOB column in the first session and try to query same row but without LOB column.

    First session:

    BEGIN TRAN

    UPDATE
    dbo.MyTable SET c5 = REPLICATE('e', 4000) WHERE c1 = 2

    Second session:

    SELECT c1, c2, c3, c4 FROM dbo.MyTable WHERE c1 = 2

    Blocking lock again. Now on a keylock:

     

    image

     

    Here I don't see any reason for exclusive lock - shared would be enough in order to prevent DML operations on the row. No danger of dirty read here since in-row data page isn't updated at all. So first step was nice - shared lock isn't acquired when we don't read row-overflow and/or LOB page. Second step in my opinion should be - for DML operations acquire only shared lock for pages/rows/keys that aren't updated.

  • Parsing Extended Events xml_deadlock_report

    Jonathan Kehayias and Paul Randall posted more than a year ago great articles on how to monitor historical deadlocks using Extended Events system_health default trace. Both tried to fix on the fly bug in xml output that caused failures in xml validation. Today I've found out that their version isn't bulletproof either. So here is the fixed one:

    SELECT CAST(xest.target_data as XML) xml_data, *
    INTO #ring_buffer_data
    FROM 
      
    sys.dm_xe_session_targets xest
      
    INNER JOIN sys.dm_xe_sessions xes on xes.[address] = xest.event_session_address
    WHERE 
      
    xest.target_name = 'ring_buffer' AND 
      
    xes.name = 'system_health'
    GO

    ;WITH CTE( event_name, event_time, deadlock_graph )
    AS
    (
      
    SELECT
          
    event_xml.value('(./@name)', 'varchar(1000)') as event_name,
          
    event_xml.value('(./@timestamp)', 'datetime') as event_time,
          
    event_xml.value('(./data[@name="xml_report"]/value)[1]', 'varchar(max)') as deadlock_graph
      
    FROM #ring_buffer_data
          
    CROSS APPLY xml_data.nodes('//event[@name="xml_deadlock_report"]') n (event_xml)
      
    WHERE event_xml.value('@name', 'varchar(4000)') = 'xml_deadlock_report'
    )
    SELECT event_name, event_time
       
    CAST(
          
    CASE 
              
    WHEN CHARINDEX('<victim-list/>', deadlock_graph) > 0 THEN
                  
    REPLACE (
                      
    REPLACE(deadlock_graph, '<victim-list/>', '<deadlock><victim-list>'),
                      
    '<process-list>', '</victim-list><process-list>'
              
    ELSE
                  
    REPLACE (
                      
    REPLACE(deadlock_graph, '<victim-list>', '<deadlock><victim-list>'),
                      
    '<process-list>', '</victim-list><process-list>'
          
    END 
       AS
    XML) AS DeadlockGraph
    FROM CTE
    ORDER BY event_time DESC
    GO

    The difference as you can see is in the check whether '<victim-list>' node is empty (CHARINDEX('<victim-list/>', deadlock_graph) > 0). My system_health session caught some weird deadlocks. According to trace they're caused by intra-query parallelism when threads of the same session deadlock one another. Until now it is rather a rare type of deadlock but nothing strange - Bart Duncan and Craig Freedman blogged about it long ago. What is really strange is that this deadlock isn't caught by 1222 trace flag - error log remained empty. And - probably connected to previous item - victim list is empty. So instead of opening node '<victim-list>' we receive an empty one: '<victim-list/>'. I'm not sure such a deadlock deserves any attention if it occurs once in a while but I sure want to be able to parse xml_deadlock_report events without failures.

  • T-SQL Tuesday - IO capacity planning

    This post is my contribution to Adam Machanic's T-SQL Tuesday #004, hosted this time by Mike Walsh.

    Being applicative DBA, I usually don't take part in discussions which storage to buy or how to configure it. My interaction with IO is usually via PerfMon. When somebody calls me asking why everything is suddenly so slow on database server, "disk queue length" or "average seconds per transfer" counters provide an overwhelming answer in 60-70% of such cases. Sometimes it can be fixed by adding indexes to heavy tables. Not always though. Many companies, especially young and fast growing ones, just can't properly measure IO requirements. They are usually aware of how much disk space they need. Question about IOpS is met with puzzled looks.

    Recently I had to build a new system capable of 40 thousand random IOpS at the peak load. In such IO is by far the most expensive part. Although space requirements were moderate, just several Terabytes, 4-5 SATA disks of 1TB won't be enough. Far from it. Modern disks with 15000 RPM rotation speed can perform ~170 random IO requests per second. So, 40K/170 = 235 disks in several RAID0 arrays. Enough?

    Not so fast. What about high availability requirements? Can we afford downtime if one of the disks is gone? No, we can't. OK, so RAID0 is out of scope. It leaves us with RAID10(striping plus mirroring) and different parity options: RAID3 through RAID6. I'll consider RAID5 as it is the most popular one. For RAID10 calculation is simple: 235 disk in RAID0 we've seen earlier x2 for mirror disks = 470 disks. That's quite a lot! So let's consider less expensive (at least sounds so) option - RAID5. Standard RAID5 array contains 5 disks. Unlike RAID3 and RAID4 it has no dedicated disk for parity (instead parity is written in a round robbin on all disks), so we can assume that nothing is wasted. We're still running with the same number - 235 disks, just a little less overall space due to parity. But space isn't our concern here. So, can we finally issue order to the storage vendor?

    Unfortunately we're still not there. Good time to ask about usage pattern - what do those 40K IOpS contain? ~40% are SELECT operations (e.g. 16K operations per second), 20% are INSERT (8K per second) and 40% are UPDATE (16K per second). We also know, that rows that undergo UPDATE are those that are SELECTed before. And we know that all operations use clustered index seek. Let's perform simple calculation. We can safely assume that upper levels of clustered index are in cache, so we have 1 physical read per SELECT. UPDATE performs read + write but we know that row has been read earlier, so we can assume that read is also from cache. So, price of UPDATE is 1 IO. Great! 1 IO per operation regardless of its type - we're still with 235 disks. At this point some of you would say: "Stop wasting our time, let's buy that damn storage!"

    OK, after week or two we install storage, configure 235 disks in arrays of 5 plus a little more for hot spare, Windows, SQL Server transaction log etc. Now we open PerfMon and start load test expecting to see 40K IOpS at maximum load. And we would be greatly disappointed. Actually, having CV updated is a good thing at this point - some organizations that I know won't employ DBA that wastes them around 100K bucks. So, where is the catch?

    The catch is an UPDATE. Contrary to the wide belief, PerfMon doesn't show physical IO. Yes, we know that drives D and E reside on the same "physical disk". But is it really physical? No, it is just another layer that can conceal beneath anything from directly attached disk to RAID50 array. PerfMon shows IO "on host". So we probably see 1 IO for UPDATE. But storage actually does something else - and it is really physical layer now. First, it reads both data and parity bytes. Data byte is in cache, so we have 1 read instead of 2; but remember, we haven't updated anything yet (you probably ask about parity byte - why it isn't in cache? because for SELECT operations only data bytes are read). Next, it writes both data and parity. So, we have 3 IO operations per single UPDATE. Going back to overall numbers, 16K updates are actually 48K IO operations when working on RAID5. This trap is known as "RAID5 write penalty". That's not the only RAID5 issue. We also have ~2/3 performance degradation on single disks failure - because in order to reconstruct corrupted bit we have to read all 4 bits on other disks including parity -> 4 times more IO. Considering all this, we should carefully evaluate whether RAID5 is really more cost effective than RAID10. The answer is positive for read-intensive environments where we do not have write penalty. For write-intensive environments TCO of RAID10 can be better, especially taking into account possible corruptions.

    Last but not least. I know that some statements here are oversimplified. I didn't take into account page splits or influence of storage system cache; I didn't consider SSD-based or mixed storage. Although I used real system I've built as example, here it is more of a mathematical model. Real world is always more complicated. But I don't want my fellow DBAs to fall asleep reading this post :-).

  • For want of a nail

    For want of a nail the shoe was lost.
    For want of a shoe the horse was lost.
    For want of a horse the rider was lost.
    For want of a rider the battle was lost.
    For want of a battle the kingdom was lost.
    And all for the want of a horseshoe nail.

     

    I'm now in the middle of severity A case that is best depicted by this rhyme. While Microsoft engineers look for a root cause, I'll describe here how small bug can "kill" strong server. The story begins when for some unknown reason "rows" column in sys.partitions DMV begins to show 4.6 billion rows for clustered index of one of the entities tables in my datawarehouse database. COUNT(*) on the table returns 1 million rows.

    Second step - daily job that executes sp_updatestats stored procedure. This procedure updates statistics in the database using sample data . Let's say, I randomly took 5% of table's data and calculated number of unique values for Column1 - n values. So total unique values of Column1 in my table is 20*n. But how do I know, how many rows per unique value of Column1 are there in the table? Very simple: SQL Server just takes rows column from sys.partitions and divides by our 20*n number.

    Third step - query: it joins between our Table and another one WHERE Column1 = x. Optimizer compiles the query and builds execution plan based on existing statistics (e.g. 4 billion rows in the table). Expected number of rows from our table is millions - so Optimizer chooses HASH JOIN as the best way to perform JOIN operation.

    Now that Optimizer made a decision about execution plan, it asks for a memory quota for query execution. Based on the estimation about millions of rows. Remember that memory quota for query execution isn't taken from data buffer cache, so on 32 bit systems we can't utilize too much memory. For example, I executed single query as described earlier and while it was running checked memory usage of my session.

    image

    So we have 252Mb granted while less than 1Mb used. Total available memory is 1.6Gb, so you can calculate yourself how many concurrent queries system supports at its current state. It is exactly the situation we had: 5 sessions were running and 65 were waiting for memory grant (if you monitor wait types - there will be a lot of RESOURCE_SEMAPHORE).

    P.S. DBCC UPDATEUSAGE fixes the initial problem. So while Microsoft engineers look for the reason why sys.partitions went wild, I monitor it in order to fix the problem before it "kills" the server.

  • Partition Details Custom Report - enhanced

    Recently my fellow friend and colleague Yoni Nakache drew my attention to the nice custom report which can save valuable time to any DBA - it returns number of rows per partition along with partition boundaries and filegroup data. The rdl can be found here at Codeplex. What was definitely missing is space usage information - reserved / used space per partition. So I filled the gap - new rdl is here.

  • NHibernate wonders

    This ORM piece of tool just can't stop to surprise. A week ago I've found out that although NHibernate can execute stored procedure, it doesn't support output parameters - when procedure returns one, it causes failure. Today I've found another pearl. Imagine following scenario: 2 entities with many-to-many relationship between them. For example, projects and employees - project contains many employees, employee can participate in several projects. So we have ProjectsEmployees table that maps employees to projects. Done with a foreword, now the scenario itself. Project X contains 9 employees already mapped to it in the ProjectsEmployees table. We want to add another employee to the project. Just a simple INSERT operation, right? NHibernate doesn't look for easy ways. It performs the following operations:

    1. DELETE FROM ProjectsEmployees WHERE ProjectID = X

    2-11. INSERT INTO ProjectsEmployees VALUES()...

    Instead of single INSERT we have 11 operations here. Besides unnecessary pressure on IO system and possible locks, we have much more serious consistency issue - if HR department queries table after DELETE operation but before all the INSERTs have been completed, it would receive project status that never occurred in real life - 3 employees for example. But that's the way NHibernate works - "just in case" overwrites the entire projects object.

    Ah, and how did I find out? Noticed that after mapping new employee to project trigger on delete fires. WEB team developers (those actually programming with NHibernate) keep log of commands NHibernate executes in development environment. There I have seen it plain and simple.

  • Monitoring page splits with Extended Events

    After reading Kalen Delaney's post about single insert causing 10 page splits, I wanted to see those splits in detail - their order at first place. And in SQL Server 2008 there is a way to trace splits - using new Extended Events infrastructure. Here is simple script that creates the trace and afterwards displays results.

    First of all, create and populate table in tempdb as described in the Kalen's post.

    USE tempdb;
    GO

    CREATE TABLE split_page 
    (id INT IDENTITY(0,2) PRIMARY KEY,
    id2 bigint DEFAULT 0,
    data1 VARCHAR(33) NULL, 
    data2 VARCHAR(8000) NULL);
    GO

    INSERT INTO split_page DEFAULT VALUES;
    GO 385

    Now, let's create and start Extended Events session. The only event we would like to monitor is page_split. On the way we'll capture sql text in order to be sure that it is our insert that caused split.

    CREATE EVENT SESSION MonitorPageSplits ON SERVER
    ADD EVENT sqlserver.page_split
    (
      
    ACTION (sqlserver.database_id, sqlserver.sql_text
       
    WHERE sqlserver.database_id = 2
    )
    ADD TARGET package0.asynchronous_file_target
    (
      
    SET 
          
    filename = N'c:\temp\MonitorPageSplits.etx'
          
    metadatafile = N'c:\temp\MonitorPageSplits.mta'
    );
    GO

    ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = start;
    GO

    Now execute INSERT command from Kalen's script.

    SET IDENTITY_INSERT split_page  ON;
    GO
    INSERT INTO split_page (id, id2, data1, data2)
         
    SELECT 111, 0, REPLICATE('a', 33), REPLICATE('b', 8000);
    GO
    SET IDENTITY_INSERT split_page  OFF;
    GO

    Afterwards we'll close the session and display results.

    ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = STOP;
    GO

    DROP EVENT SESSION MonitorPageSplits ON SERVER;
    GO

    SELECT 
      
    split.value('(/event/data[@name=''file_id'']/value)[1]','int') AS [file_id],
      
    split.value('(/event/data[@name=''page_id'']/value)[1]','int') AS [page_id],
      
    split.value('(/event[@name=''page_split'']/@timestamp)[1]','datetime') AS [event_time],
      
    split.value('(/event/action[@name=''sql_text'']/value)[1]','varchar(max)') AS [sql_text]
    FROM
      

          
    SELECT CAST(event_data AS XML) AS split
          
    FROM sys.fn_xe_file_target_read_file('c:\temp\MonitorPageSplits*.etx', 'c:\temp\MonitorPageSplits*.mta', NULL, NULL)
       )
    AS t
    ORDER BY [event_time]
    GO

    OK, the result is:

    image

    So, we see 10 splits, among them 4 splits of page 148, another 3 of page 178 etc. It makes sense. When split occurs, ~half of the data from the old page goes to the new one. So if the new row - the one that caused split - should originally have entered first half of the page, after the split it would still try to enter the old page - not the new one. In our case originally we had 56 rows before the new one (id from 0 to 110 step 2) and 385 - 56 = 329 rows after. I would still expect 3 and not 4 splits of the initial page because (((385 / 2) / 2) / 2) = ~48 < 57 (new row's place). So I have expected that after the third split new row would at last leave the initial page. But I was wrong - don't know whether that's just not strict math or there're other factors I didn't think of.

  • Restoring master database

    Disclaimer: I'm not an infrastructure DBA.Probably everything below is just bullshit.

    Recently our talented IT guys destroyed disk on a virtual machine. They chose the disk carefully - it was the one with master database. How do you restore master db from backup? Start SQL Server service from command line in single user mode and execute RESTORE DATABASE command. OK, start in single user mode fails because it doesn't find master database files. At this point I would like to take backup file and instead of restoring database in working state, just extract master.mdf and mastlog.ldf files (not just them, see ahead). But unfortunately it is unsupported (or I failed to find the right option - see disclaimer). Next step - find SQL Server installation with the same product version on another server, stop the service, copy master.mdf and mastlog.ldf (can't be done without stopping service - files are inaccessible). Again "sqlserver.exe -m"... This time mssqlsystemresource.mdf and mssqlsystemresource.ldf are missing. Copy them too. Also model.mdf and modellog.ldf. Again "sqlserver.exe -m". This time I faced real problem - startup failed on creating tempdb because on the server I took master.mdf from, tempdb datafile sat on disk F and tempdb log on disk H. While on my server I have only C and D. As far as I understand, at this point I have 3 options: re-install SQL Server and afterwards restore all databases including master from backup, add disks F and H or find installation where tempdb sits on C or D. Luckily I could just add F and H and it solved the problem. But all the sequence of actions seems unreasonable. BTW, after starting service in single user mode, I restored master from backup and it automatically moved tempdb to its old location on D - master contains path to all system databases. Afterwards F and H have been removed. Seems to me totally unreasonable that in order to restore master database new logical disks had to be created. I see 2 possible solutions: to be able to extract mdf and ldf files from backup without actually restoring database or to be able to configure path to system databases externally in service properties like path to master and errorlog.

  • NHibernate and triggers

    Recently had an interesting experience with NHibernate (I was against working with it from the beginning for many reasons, but unfortunately I wasn't the one to decide). I added trigger to a table that performs several update / insert operation to other tables. Suddenly updating table via NHibernate returned error:

    NHibernate.AdoNet.TooManyRowsAffectedException: Unexpected row count: 6; expected: 1

    So NHibernate doesn't support triggers that perform DML operations? Not quite. The solution is: add SET NOCOUNT ON before trigger performs DML operations. Looks like NHibernate doesn't really count DML operations, but messages of the type "2 rows updated" returned by SQL Server. And NOCOUNT setting suppresses these messages.

  • MERGE statement syntax modified

    This is probably known to all but I've just found out - MERGE command syntax had been modified since CTP6. In the beginning of 2008 I had built a system on CTP6 and this week, while preparing for production deployment, found out that procedures are failing with syntax error. It appears that in CTP6 MERGE contained the following blocks:

    • WHEN MATCHED
    • WHEN [TARGET] NOT MATCHED
    • WHEN SOURCE NOT MATCHED

    In RTM, if you write WHEN TARGET NOT MATCHED, it fails compilation. New syntax is:

    • WHEN MATCHED
    • WHEN NOT MATCHED [BY TARGET]
    • WHEN NOT MATCHED BY SOURCE

    For non-believers, here is old blog post by Chad Boyd with examples that use old syntax.

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement