THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Michael Zilberstein

  • Israeli SQL Server Usergroup: Locking & Blocking in active environment

    A week ago I gave my session on the subject. It is only first part of the two – second will be next time when there is an open slot. Demos from my session are available here.

  • Corruption case

    Recently I had to take care of the most interesting corruption case I’ve even seen, so decided to share this experience with you. We’re talking about small accounting program which keeps its data in SQL Server Express – in this particular case in SQL Server 2005. The customer called today and sent me following error screen (nice screenshot – taken with cellular phone camera Smile):

    image

    Upon connecting to the server I’ve immediately noticed dumps with the same error. Here is entire error message:

    A time-out occurred while waiting for buffer latch -- type 2, bp 04268450, page 1:804, stat 0xc00009, database id: 5, allocation unit Id: 72057594108248064, task 0x00A186B8 : 0, waittime 300, flags 0x1a, owning task 0x00A0A4D8. Not continuing to wait.

     This article was extremely helpful in analyzing this dump with WinDbg tool – flow and somewhat cryptic commands described there easily pinpointed the guilty thread and its call stack:

    sqlservr!LatchBase::AcquireInternal
    sqlservr!BUF::AcquireLatch
    sqlservr!BPool::Get
    sqlservr!PageRef::Fix
    sqlservr!IndexPageManager::GetPageForLinkModification
    sqlservr!RemoveBTreePageIfUnchangedInternal
    sqlservr!RemoveBTreePageIfUnchanged
    sqlservr!CleanVersionsOnBTreePage
    sqlservr!IndexDataSetSession::CleanupVersionsOnPage
    sqlservr!GhostExorciser::CleanupPage
    sqlservr!TaskGhostCleanup::ProcessTskPkt
    sqlservr!GhostRecordCleanupTask
    sqlservr!CGhostCleanupTask::ProcessTskPkt
    sqlservr!TaskReqPktTimer::ExecuteTask
    sqlservr!OnDemandTaskContext::ProcessTskPkt
    sqlservr!SystemTaskContext::ExecuteFunc
    sqlservr!SystemTaskEntryPoint
    sqlservr!OnDemandTaskContext::FuncEntryPoint
    sqlservr!SOS_Task::Param::Execute
    sqlservr!SOS_Scheduler::RunTask
    sqlservr!SOS_Scheduler::ProcessTasks
    sqlservr!SchedulerManager::WorkerEntryPoint
    sqlservr!SystemThread::RunWorker
    sqlservr!SystemThreadDispatcher::ProcessWorker
    sqlservr!SchedulerManager::ThreadEntryPoint
    msvcr80!_callthreadstartex
    msvcr80!_threadstartex

    Highlighted part of the call stack indicates that Ghost Cleanup process caused this failure. Actually, I had a similar encounter with stuck ghost cleanup several years ago. The easiest way to verify that it is still stuck is to execute DBCC CHECKDB command. Indeed CHECKDB appeared to be blocked by ghost cleanup session.

    What’s next? You can’t kill system session. Indeed you can’t by you can start SQL Server without it – using trace flag 661 as startup parameter (don’t forget to remove flag and restart service again if you use this flag). So, after restarting service, rebuilding index in question, removing trace flag and restarting service again, I’ve already thought that I’ve earned my beer database is fixed.

    Not so fast. Now DBCC CHECKDB succeeds to complete but results are very very red. Including interesting messages like:

    Msg 8992, Level 16, State 1, Line 1
    Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=1) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
    Msg 8992, Level 16, State 1, Line 1
    Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=2) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
    Msg 8992, Level 16, State 1, Line 1
    Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=3) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
    Msg 8992, Level 16, State 1, Line 1
    Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,index_id=0) in sys.indexes does not have a matching row (object_id=1575689407) in sys.objects.
    Msg 8992, Level 16, State 1, Line 1
    Check Catalog Msg 3855, State 1: Attribute (data_space_id=1) exists without a row (object_id=1575689407,index_id=0) in sys.indexes.

    Nice, uh? After first wave of shock has passed, I checked and found out that indeed object with that id doesn’t exist. So it seems that all we need to do is to delete 3 rows from sys.columns and 1 row from sys.indexes. Ah, but those sys.something objects are views, aren’t they? And what are the real objects? The way to find real – internal – tables and columns is via execution plans:

    image

    So actually we need to delete rows from sys.syscolpars and from sys.sysidxstats (notice that column names are also different). How do we do it? Let’s try DAC (Dedicated Admin Connection)? No way – Express Edition doesn’t support DAC. Unless… unless we use trace flag 7806 as startup parameter.

    Restart server again, connect using DAC, try to delete rows… Oops,

    Msg 259, Level 16, State 1, Line 1
    Ad hoc updates to system catalogs are not allowed.

    But for this we have Paul Randal’s instructions. So: “ sqlservr.exe -sInstanceName -m -T661 –T7806 “, then “ sqlcmd -S.\InstanceName /A ” and finally…

       1: DELETE FROM sys.syscolpars WHERE id = 1575689407 AND number = 0
       2: GO
       3: DELETE FROM sys.sysidxstats WHERE id = 1575689407 AND indid = 0
       4: GO

    Now stop the server, remove all trace flags, start server in a normal way and verify that DBCC CHECKDB returns nothing. Bingo! And… well deserved my own home-brewed Scottish Ale!

    image

  • Beware of SQL Server and PerfMon differences in disk latency calculation

    Recently sp_blitz procedure on one of my OLTP servers returned alarming notification about high latency on one of the disks (more than 100ms per IO). Our chief storage guy didn’t understand what I was talking about – according to his measures, average latency is only about 15ms. In order to investigate the issue, I’ve recorded 2 snapshots of sys.dm_io_virtual_file_stats and calculated latency per read and write separately. Results appeared to be even more alarming: while for read average latency was only 9ms, for write it skyrocketed to 260ms. Even more strange – PerfMon results for the same time interval showed similar read latency but only 17ms per write! We’ve almost opened support request for it – during informal phone conversations with colleagues in Microsoft nobody could explain it.

    Finally we’ve found the answer after more thorough analysis of PerfMon file. It appears that calculation formula is different. sys.dm_io_virtual_file_stats contains total number of writes and total write latency (stall), so by dividing them we receive average latency per single write operation. PerfMon on another hand measure Disk sec/write counter every sample interval (default is 15 seconds, I used 1 second interval in order to see more detailed picture). And according to PerfMon, average latency per write during certain period is average of samples measured during this period. Disk we’re talking about contains only datafiles of a single database. So we’ve noticed that while number of read operations per second is pretty equal along entire interval, writes behave totally different. Most of the time we see 0 (zero) writes – and 0 write latency accordingly. But once every minute we see massive writes – SQL tries to write almost 4000 IOs per second for several seconds in a row (several means sometimes 2, sometimes 10). I guess, it is checkpoint. And during these peaks latency jumps to huge values – up to 1.8 seconds (!) per write. The catch is that when we have 59 samples with 0 latency (we don’t have writes, remember?) and 1 second with 120ms latency, PerfMon with calculate average and receive (59*0 + 1*120)/60 = 2ms per write latency. sys.dm_io_virtual_file_stats at the same time will report 120 ms / write. And indeed, when I calculated average write latency from PerfMon but took into account only samples with non-zero number of writes, results became very close to what sys.dm_io_virtual_file_stats showed.

    So who is right in this story and should we do something with the disk? Both are right – they just calculate different numbers. sys.dm_io_virtual_file_stats calculates average latency per IO but doesn’t provide any indication whether load is linear or contains peaks and idle periods. Peaks have much more influence – because peaks in latency are usually accompanied by peaks in number of IO operations, so their relative weight is higher. Perfmon on another hand, when it comes to calculate average over period (for example, PAL tool does it automatically), smoothens peaks because for it every sample – the one with zero writes and the one with 10K writes per second – has same relative weight.

    Here is how our latency graph looks:

    image

    My first thought was: “Checkpoint is background process, shouldn’t really affect user experience – storage guy is right, leave the disk alone”. But then I decided to check read latencies – indeed, when it is so busy that write takes 1.8 seconds, there is no reason to suppose that reads fill any better. As I said in the beginning, overall average read latency was 9ms. But when separated for write-idle and checkpoint periods, results were completely different. For write-idle periods, read latency was 1.5ms. But during checkpoint it jumped to 120ms. So user experience is affected – read queries that go to disk can run very slow or even appear as stuck during checkpoints. Beware Smile.

  • LAG function – practical use and comparison to old syntax

    Recently I had to analyze huge trace – 46GB of trc files. Looping over files I loaded them into trace table using fn_trace_gettable function and filters I could use in order to filter out irrelevant data. I ended up with 6.5 million rows table, total of 7.4GB in size. It contained RowNum column which was defined as identity, primary key, clustered. One of the first things I detected was that although time difference between first and last events in the trace was 10 hours, total duration of all sql events was only 4 hours. So the next assumption to check was whether application got stuck - whether we have significant gap between two subsequent events. In the following query I’m looking for events that occurred at least 30 seconds after previous event’s completion.

    SQL 2005/2008/2008R2 syntax:

    SELECT
      
    curEvent.RowNum, prevEvent.EndTime AS PrevEventEndTime, curEvent.StartTime AS CurEventStartTime
    ,
      
    DATEDIFF(SECOND, prevEvent.EndTime, curEvent.StartTime) AS
    ApplicationStuckDuration
    FROM
      
    dbo.TraceData curEvent
      
    INNER JOIN dbo.TraceData prevEvent ON curEvent.RowNum - 1 =
    prevEvent.RowNum
    WHERE DATEDIFF(SECOND, prevEvent.EndTime, curEvent.StartTime) >=
    30

    image

    SQL 2012 syntax:

    ;WITH GapsCTE AS
    (
      
    SELECT RowNum, StartTime AS CurEventStartTime, LAG(EndTime, 1, NULL) OVER(ORDER BY RowNum ASC) AS
    PrevEventEndTime
      
    FROM
    dbo.TraceData
    )  
    SELECT RowNum, PrevEventEndTime, CurEventStartTime, DATEDIFF(SECOND, PrevEventEndTime, CurEventStartTime) AS
    ApplicationStuckDuration
    FROM
    GapsCTE
    WHERE DATEDIFF(SECOND, PrevEventEndTime, CurEventStartTime) >=
    30

    image

    Here are results (pretty surprising at the first glance, I have to say):

      SQL 2012 SQL 2005/2008/2008R2
    Duration 103 seconds 77 seconds
    Reads 919,752 1,839,504
    CPU 20 seconds 13 seconds

    In the old SQL versions we see twice reads – that’s obvious from execution plan since we scan entire table twice and merge join on RowNum column. It seems that in this case Merge Join is so efficient that CPU overheard of segmentation, aggregation and spooling costs more than additional single ordered scan of the table that is already in memory anyway.

    I started to write all this in order to demonstrate practical usage of the new function and how we can benefit from it compared to previous SQL Server versions. Results became a complete surprise for me.

    Conclusion: always check and recheck yourself!

  • Reading temporary table from another session

    It happens to me at least once a week – I want to check progress of some heavy script that runs in chunks over big dataset and find out that it writes intermediate data to temporary table only. Last time it happened 3 days ago when I wanted to analyze 50GB trace table on my notebook. I wrote a script that was taking 200 thousand rows at a time, parameterizing them and aggregating by different keys – host name, application etc. Usual trace analysis stuff. After an hour I wanted to check the progress but found out that intermediate results are written to temp table and of course I forgot to add debug prints. Took me some thought and ~5 minutes of coding to find the solution. How?

    My trace analysis script ran over 200K rows chunks and among other things aggregated them grouping by parameterized query text. One of the measurement columns was “TotalQueries” – counter of rows in a group that could be later used to calculate average values for Reads, Writes, Duration and CPU. Every 200k rows chunk inserted rows into #BA temp table. Last step in a script, when all chunks have already been processed, was aggregation of #BA table’s data into permanent table. So in order to monitor the progress I could SUM all values of “TotalQueries” column in #BA table and compare it to number of rows in my 50GB trace table (which can be easily verified using sp_spaceused).

    Of course, I new the temp table name – I wrote the initial script. So first thing was to find its object_id.

    SELECT [object_id] FROM tempdb.sys.tables WHERE name LIKE '#BA%'

    It brought me negative number: –1546597904 in my current demo. I don’t have access to temp table created by another session. But as admin on my own notebook I sure have access to every page in every database. So next step is to find all pages that belong to #BA table. For this task we have undocumented but widely known (and used) DBCC IND command. Since digging in every page manually and wasting hour on it wasn’t exactly my purpose, I kept DBCC IND output in the table variable and used it later in the script.

    DECLARE @BATablePages TABLE
    (
      
    PageFID INT, PagePID INT, IAMFID INT, IAMPID INT, ObjectID BIGINT, IndexID INT, PartitionNumber INT
    ,
      
    PartitionID BIGINT, iam_chain_type VARCHAR(64), PageType TINYINT, IndexLevel TINYINT, NextPageFID INT
    ,
      
    NextPagePID INT, PrevPageFID INT, PrevPagePID
    INT
    )

    INSERT INTO
    @BATablePages
    EXEC('DBCC IND(2, -1546597904, 1)'
    )

    Actually we need only PagePID column and only for leaf level data pages e.g. PageType = 1. Next step is to loop over those pages and insert their content into another table. For viewing page’s data I used another widely known (and undocumented as well) DBCC PAGE command. It is less known that DBCC PAGE can be used along with WITH TABLERESULT suffix. Note: column definitions in the following script are accidental – I just wanted them to be wide enough so that script can complete successfully.

    DECLARE @rc INT, @v_PageID INT, @sql NVARCHAR(MAX)

    CREATE TABLE #ind( RowNum INT NOT NULL IDENTITY(1,1), PageNum INT
    NOT NULL )
    CREATE TABLE #page(ParentObject VARCHAR(128), [Object] VARCHAR(512), Field VARCHAR(128), [Value] VARCHAR(8000
    ) )

    INSERT INTO #ind( PageNum
    )
    SELECT
    PagePID
    FROM
    @BATablePages
    WHERE PageType =
    1

    SET @rc =
    @@ROWCOUNT
    WHILE @rc >
    0
    BEGIN
       SELECT
    @v_PageID = PageNum FROM #ind WHERE RowNum =
    @rc
          
      
    SET @sql = 'DBCC PAGE(2, 1, ' + CONVERT(VARCHAR(32), @v_PageID) +
    ', 3) WITH TABLERESULTS'

      
    INSERT INTO #page(ParentObject, [Object], Field, [Value]
    )
      
    EXEC(@sql
    )

      
    SET @rc = @rc -
    1
    END

    So now we have all data pages inside single table but not in a format we’re used to:

    image

    But actually, we now have everything we need. The rest is simple:

    DELETE FROM #page WHERE Field <> 'TotalQueries'
                                    
    SELECT SUM(CONVERT(INT, [Value])) AS
    TotalQueries
    FROM #page
  • Page splits, extended events, index page allocation and all the fish

    2 years ago I wrote about monitoring page splits with Extended Events. Only 2 bloggers explored Extended Events feature at that time, so my post was more of a learning than for any practical matter. Yet one question remained open: according to straightforward calculations, after 3 splits of the first page, next split should have occurred on another page - but trace with extended events clearly showed 4 splits of initial page.

    Recently Pavel Nefyodov has drawn my attention to another curious fact: initially we had 1 page; after all those splits we had 10. But trace showed 10 splits, not 9. So where did additional split come from? And finally I've realized that there is another simple way to trace those splits - via transaction log. If we manually checkpoint the system before and perform insert in transaction afer - our transaction log will containg exactly the relevant data; and it'll remain this way all the time transaction stays open. We'll query transaction log using undocumented but widely known fn_dblog function.

    OK, Let's see some code (mostly taken from my old post which in turn used page split demo from Kalen Delaney's blog). Initial configuration (including trace definition):

    USE tempdb;
    GO

    CREATE TABLE split_page  
    (id INT IDENTITY(0,2) PRIMARY KEY,
    id2 bigint DEFAULT 0,
    data1 VARCHAR(33) NULL,  
    data2 VARCHAR(8000) NULL);
    GO

    INSERT INTO split_page DEFAULT VALUES;
    GO 385

    CREATE EVENT SESSION MonitorPageSplits ON SERVER
    ADD EVENT sqlserver.page_split
    (
      
    ACTION (sqlserver.database_id, sqlserver.sql_text)  
       
    WHERE sqlserver.database_id = 2
    )
    ADD TARGET package0.asynchronous_file_target
    (
      
    SET  
          
    filename = N'c:\temp\MonitorPageSplits.etx',  
          
    metadatafile = N'c:\temp\MonitorPageSplits.mta'
    );
    GO

    CHECKPOINT
    GO

    Short glance at our table's allocation inside data file:

    DBCC IND(tempdb, split_page, -1);
    GO

    PagePID IndexID iam_chain_type PageType IndexLevel
    175 1 In-row data 10 NULL
    148 1 In-row data 1 0

    So we have 1 IAM page and 1 data page. Now the split itself and - remember, we should query transaction log before committing transaction if we don't want accidental checkpoint to ruine all our efforts.

    ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = start;
    GO

    SET IDENTITY_INSERT split_page  ON;
    GO
    BEGIN TRAN

    INSERT INTO
    split_page (id, id2, data1, data2)
         
    SELECT 111, 0, REPLICATE('a', 33), REPLICATE('b', 8000);
    GO

    SET IDENTITY_INSERT split_page  OFF;
    GO

    SELECT [Current LSN], Operation, Context, [Page ID], [Description]
    FROM fn_dblog(NULL, NULL)

    COMMIT
    GO

    ALTER
    EVENT SESSION MonitorPageSplits ON SERVER STATE = STOP;
    GO
    DROP EVENT SESSION MonitorPageSplits ON SERVER;
    GO

    SELECT  
      
    split.value('(/event/data[@name=''file_id'']/value)[1]','int') AS [file_id],
      
    split.value('(/event/data[@name=''page_id'']/value)[1]','int') AS [page_id],
      
    split.value('(/event[@name=''page_split'']/@timestamp)[1]','datetime') AS [event_time],
      
    split.value('(/event/action[@name=''sql_text'']/value)[1]','varchar(max)') AS [sql_text]
    FROM
      
    (  
          
    SELECT CAST(event_data AS XML) AS split
          
    FROM sys.fn_xe_file_target_read_file('c:\temp\MonitorPageSplits*.etx', 'c:\temp\MonitorPageSplits*.mta', NULL, NULL)
       )
    AS t
    ORDER BY [event_time]

    First let's take a look at our table now:

    PagePID IndexID iam_chain_type PageType IndexLevel
    175 1 In-row data 10 NULL
    148 1 In-row data 1 0
    184 1 In-row data 2 1
    185 1 In-row data 1 0
    187 1 In-row data 1 0
    188 1 In-row data 1 0
    189 1 In-row data 1 0
    190 1 In-row data 1 0
    191 1 In-row data 1 0
    392 1 In-row data 1 0
    393 1 In-row data 1 0
    394 1 In-row data 1 0

    Transaction log contains more rows than we are interested in, so I cleaned irrelevant ones out from the table below and correlated transaction log with extended events trace. I also transformed hexadecimal page numbers from transaction log to decimal ones - like we see in extended events trace.

    image

    So we can see that indeed there have been 3 splits of the first page. From transaction log we see that first action was actually allocating of page 184 which is - look at the allocation table above - index page. According to transaction log this page is allocated - not splitted from any other page. Still extended events trace shows as if it has been splitted from the first data page of our table. Don't know whether this behavior can be considered as bug - probably not. Anyway - both unanswered question from my old post are resolved: first page indeed has been splitted 3 times and we had only 9 data page splits while 10th split belongs to index page allocation.

    Have a nice week!

  • Read Committed isolation level, indexed views and locking behavior

    From BOL, "Key-Range Locking" article:

    Key-range locks protect a range of rows implicitly included in a record set being read by a Transact-SQL statement while using the serializable transaction isolation level. The serializable isolation level requires that any query executed during a transaction must obtain the same set of rows every time it is executed during the transaction. A key range lock protects this requirement by preventing other transactions from inserting new rows whose keys would fall in the range of keys read by the serializable transaction. (bold is mine - MZ)


    I had a blocking locks problems recently. What should have been working concurrently from multiple threads actually worked single-threaded only. Here is demo - table of employees and indexed view above it that shows number of employees and total salary per department. I populated it randomly with up to 10 deparments.


    USE [tempdb]
    GO

    SET ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, CONCAT_NULL_YIELDS_NULL, QUOTED_IDENTIFIER ON
    SET
    NUMERIC_ROUNDABORT OFF
    GO

    CREATE TABLE dbo.Employees
    (
      
    EmployeeID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
      
    DepartmentID SMALLINT NOT NULL,
      
    Salary MONEY NOT NULL
    )
    GO

    CREATE VIEW dbo.V_Departments
    WITH SCHEMABINDING
    AS
       SELECT
          
    DepartmentID,
          
    SUM(Salary) AS DepartmentSalary,
          
    COUNT_BIG(*) AS TotalEmployees
      
    FROM dbo.Employees
      
    GROUP BY DepartmentID
    GO

    CREATE UNIQUE CLUSTERED INDEX PK_V_Departments ON dbo.V_Departments( DepartmentID ASC )
    GO

    INSERT INTO dbo.Employees( DepartmentID, Salary )
    SELECT
      
    [object_id] % 10 + 1 AS DepartmentID,
      
    [object_id] % 3000 + 1 AS Salary
    FROM sys.objects
    GO

    Now I open 2 concurrent transactions and try to add new employee from both. The important thing here is that both belong to new departments: 11 and 13 (in the demo data we had departments from 1 to 10).


    Session 1 Session 2
    SET TRANSACTION ISOLATION LEVEL READ COMMITTED

    BEGIN TRAN

    INSERT INTO
    dbo.Employees( DepartmentID, Salary )
    VALUES( 11, 2500 )
    SET TRANSACTION ISOLATION LEVEL READ COMMITTED

    BEGIN TRAN

    INSERT INTO
    dbo.Employees( DepartmentID, Salary )
    VALUES( 13, 3500 )

    Second transaction is locked. Why? We don't have anything else that updates department 12, so why it is locked? Let's find out


    SELECT
      
    OBJECT_NAME(i.[object_id]) AS ObjectName,
      
    i.name AS IndexName,
      
    l.request_mode AS LockRequestMode,
      
    l.resource_type AS LockResourceType,
      
    l.resource_description AS LockResourceDescription
    FROM
      
    sys.dm_tran_locks AS l
      
    INNER JOIN sys.dm_os_waiting_tasks AS wt ON l.lock_owner_address = wt.resource_address
      
    INNER JOIN sys.partitions p ON p.hobt_id = l.resource_associated_entity_id
      
    INNER JOIN sys.indexes AS i ON i.[object_id] = p.[object_id] AND i.index_id = p.index_id
    WHERE wt.session_id = 57
    AND l.request_status = 'WAIT'


    image


    RangeS-U mode means shared lock for the range between values and update lock for values themselves. One value is obvious - department 11 that we've inserted in the first session. Second value, as we see in the query result, is (ffffffffffff) which indicates max int. Storage Engine knows to lock rows and other objects that are scanned by Query Processor. But we don't have row with department greater than 11. So right boundary of locked range is set to maximum possible value.

    So we understand what range is locked. But why? If we work in pure read committed isolation level, we hold exclusive lock on rows we modify but nobody prevents another user from inserting or deleting another row with same department value. But we expect that our operation would update corresponding department row in the view before anybody else updates same row (e.g. if we read data after our transacion is complete, it won't be affected by other transactions). Which is classic serializable behavior. So instead of forcing customer to require serializable isolation level expicitely, SQL Server implicitely converted read-committed to serializable for this particular operation. It appears that same scenario occurs with cascading updates on foreign keys.

  • Old error in active locks detection script

    Almost every applicative DBA has scripts for locking and blocking issues detection. Some, like me, wrote the script themselves in order to become familiar with underlying DMVs, others downloaded one of the huge number of versions available in the net. All of these scripts are based on sys.dm_tran_locks DMV which contains information about currently active locks and on sys.dm_os_waiting_tasks that holds infomation about waiting tasks. Some of the scripts also filter out locks acquired by system sessions by adding "WHERE request_session_id > 50". My own script was among them.

    Recently I had a blocking issue. But for some reason my script didn't show blocking locks tree. Deeper investigation showed that both blocker and blocked session belonged to Service Broker. But why script failed to indicate it? Following query provides an answer - it shows spid of Service Broker activated tasks:

    SELECT spid FROM sys.dm_broker_activated_tasks

    image

    As you can see, common belief that all spids under 50 belong to internal system sessions is just a myth. Actually, it was right at SQL 2000 days. Starting from SQL 2005 Serive Broker (and, I think, other features too) started to use first 50 spids.

    Very nice ebook and script collection "Free SQL Server DMV Starter Pack" inspired me to write this post. It is really nice and "no nonsense" - I'll recommend it as a great starter to young DBAs that don't have their own script collection. Still script for detecting active locks contains the same old familiar mistake:

    SELECT ...
    FROM sys.dm_tran_locks
    WHERE request_session_id > 50
  • T-SQL Tuesday #006: LOB, row-overflow and locking behavior

    This post is my contribution to T-SQL Tuesday #006, hosted this time by Michael Coles.

    Actually this post was born last Thursday when I attended Kalen Delaney's "Deep dive into SQL Server Internals" seminar in Tel-Aviv. I asked question, Kalen didn't have answer at hand, so during a break I created demo in order to check certain behavior. Demo goes later in this post but first small teaser. I have MyTable table with 10 rows. I take 2 rows that reside on different pages. In first session transaction is opened and first row is deleted using PAGLOCK hint (without COMMIT for a meantime). In second session I select all columns from the second row(both sessions access rows by primary key which is also clustered index). Second session got locked. How come?

    Here is the full story and demo script. First we create table with 10 rows, 2 of which are long.

    CREATE TABLE dbo.MyTable( c1 INT NOT NULL PRIMARY KEY, c2 VARCHAR(8000), c3 VARCHAR(8000), c4 VARCHAR(8000), c5 VARCHAR(MAX) )
    GO

    ;WITH Nums( Num )
    AS
    (
      
    SELECT ROW_NUMBER() OVER( ORDER BY (SELECT 1) ) AS Num
      
    FROM sys.objects
    )
    INSERT INTO dbo.MyTable(c1, c2, c3, c4, c5)
    SELECT 
      
    Num AS c1,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('a', 3000) ELSE 'a' END AS c2,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('b', 3000) ELSE 'b' END AS c3,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('c', 4000) ELSE 'c' END AS c4,
      
    CASE WHEN Num IN (2, 8) THEN REPLICATE('d', 4000) ELSE 'd' END AS c5
    FROM Nums AS n1
    WHERE Num <= 10
    GO

    Let's examine, how out table is stored (7 is ID of my database; I don't show entire output but only relevant parts):

    DBCC IND(7, MyTable, -1)

    image

     

    So we have 2 leaf-level data pages, 1 page for row-overflow data and 1 page for LOB data plus 3 IAM pages - one per page type. We can also verify that row-overflow and LOB pages contain 2 rows each (using DBCC PAGE we check that m_slotCnt = 2 in page's header). Now open new window in Management Studio and execute:

    BEGIN TRAN

    DELETE FROM
    dbo.MyTable WITH(PAGLOCK) WHERE c1 = 2

    In another session execute:

    SELECT c1, c2, c3, c5 FROM dbo.MyTable WHERE c1 = 8

    We see that second session is locked, so let's check locking details with sp_lock procedure:

     

    image

     

    Page in contention is 200585 which is LOB data page as we can see from the first table. Actually that's perfectly natural behavior - DELETE acquires exclusive lock, SELECT tries to acquire shared. But I'm sure it would have taken some time for me to solve the puzzle whether it was real case in my database. We're used to think of a row sitting at one page, so rowlock locks a row or a key. Probably entire page if we used PAGLOCK hint or Optimizer decided to start from a page-level lock. But we aren't accustomed to think of a row which spans over multiple pages.

    And what happens if we query row-overflow page? Pretty much the same (just substitute c5 by c4 in the second query):

     

    image

     

    Another interesting issue is - what happens when we don't query LOB and row-overflow columns? We know that SQL Server doesn't know to lock particular columns but always entire row of a table or an index. So what will happen if we query only first 3 columns: c1, c2 and c3? No blocking lock this time. It means that SQL Server knows to lock particular columns and leave other columns unlocked but only if they reside on separate pages. Actually it is similar to column-oriented databases behavior.

    Finally I want to check what happens when LOB column is updated - whether in-row data is also being locked. So I update LOB column in the first session and try to query same row but without LOB column.

    First session:

    BEGIN TRAN

    UPDATE
    dbo.MyTable SET c5 = REPLICATE('e', 4000) WHERE c1 = 2

    Second session:

    SELECT c1, c2, c3, c4 FROM dbo.MyTable WHERE c1 = 2

    Blocking lock again. Now on a keylock:

     

    image

     

    Here I don't see any reason for exclusive lock - shared would be enough in order to prevent DML operations on the row. No danger of dirty read here since in-row data page isn't updated at all. So first step was nice - shared lock isn't acquired when we don't read row-overflow and/or LOB page. Second step in my opinion should be - for DML operations acquire only shared lock for pages/rows/keys that aren't updated.

  • Parsing Extended Events xml_deadlock_report

    Jonathan Kehayias and Paul Randall posted more than a year ago great articles on how to monitor historical deadlocks using Extended Events system_health default trace. Both tried to fix on the fly bug in xml output that caused failures in xml validation. Today I've found out that their version isn't bulletproof either. So here is the fixed one:

    SELECT CAST(xest.target_data as XML) xml_data, *
    INTO #ring_buffer_data
    FROM 
      
    sys.dm_xe_session_targets xest
      
    INNER JOIN sys.dm_xe_sessions xes on xes.[address] = xest.event_session_address
    WHERE 
      
    xest.target_name = 'ring_buffer' AND 
      
    xes.name = 'system_health'
    GO

    ;WITH CTE( event_name, event_time, deadlock_graph )
    AS
    (
      
    SELECT
          
    event_xml.value('(./@name)', 'varchar(1000)') as event_name,
          
    event_xml.value('(./@timestamp)', 'datetime') as event_time,
          
    event_xml.value('(./data[@name="xml_report"]/value)[1]', 'varchar(max)') as deadlock_graph
      
    FROM #ring_buffer_data
          
    CROSS APPLY xml_data.nodes('//event[@name="xml_deadlock_report"]') n (event_xml)
      
    WHERE event_xml.value('@name', 'varchar(4000)') = 'xml_deadlock_report'
    )
    SELECT event_name, event_time
       
    CAST(
          
    CASE 
              
    WHEN CHARINDEX('<victim-list/>', deadlock_graph) > 0 THEN
                  
    REPLACE (
                      
    REPLACE(deadlock_graph, '<victim-list/>', '<deadlock><victim-list>'),
                      
    '<process-list>', '</victim-list><process-list>'
              
    ELSE
                  
    REPLACE (
                      
    REPLACE(deadlock_graph, '<victim-list>', '<deadlock><victim-list>'),
                      
    '<process-list>', '</victim-list><process-list>'
          
    END 
       AS
    XML) AS DeadlockGraph
    FROM CTE
    ORDER BY event_time DESC
    GO

    The difference as you can see is in the check whether '<victim-list>' node is empty (CHARINDEX('<victim-list/>', deadlock_graph) > 0). My system_health session caught some weird deadlocks. According to trace they're caused by intra-query parallelism when threads of the same session deadlock one another. Until now it is rather a rare type of deadlock but nothing strange - Bart Duncan and Craig Freedman blogged about it long ago. What is really strange is that this deadlock isn't caught by 1222 trace flag - error log remained empty. And - probably connected to previous item - victim list is empty. So instead of opening node '<victim-list>' we receive an empty one: '<victim-list/>'. I'm not sure such a deadlock deserves any attention if it occurs once in a while but I sure want to be able to parse xml_deadlock_report events without failures.

  • T-SQL Tuesday - IO capacity planning

    This post is my contribution to Adam Machanic's T-SQL Tuesday #004, hosted this time by Mike Walsh.

    Being applicative DBA, I usually don't take part in discussions which storage to buy or how to configure it. My interaction with IO is usually via PerfMon. When somebody calls me asking why everything is suddenly so slow on database server, "disk queue length" or "average seconds per transfer" counters provide an overwhelming answer in 60-70% of such cases. Sometimes it can be fixed by adding indexes to heavy tables. Not always though. Many companies, especially young and fast growing ones, just can't properly measure IO requirements. They are usually aware of how much disk space they need. Question about IOpS is met with puzzled looks.

    Recently I had to build a new system capable of 40 thousand random IOpS at the peak load. In such IO is by far the most expensive part. Although space requirements were moderate, just several Terabytes, 4-5 SATA disks of 1TB won't be enough. Far from it. Modern disks with 15000 RPM rotation speed can perform ~170 random IO requests per second. So, 40K/170 = 235 disks in several RAID0 arrays. Enough?

    Not so fast. What about high availability requirements? Can we afford downtime if one of the disks is gone? No, we can't. OK, so RAID0 is out of scope. It leaves us with RAID10(striping plus mirroring) and different parity options: RAID3 through RAID6. I'll consider RAID5 as it is the most popular one. For RAID10 calculation is simple: 235 disk in RAID0 we've seen earlier x2 for mirror disks = 470 disks. That's quite a lot! So let's consider less expensive (at least sounds so) option - RAID5. Standard RAID5 array contains 5 disks. Unlike RAID3 and RAID4 it has no dedicated disk for parity (instead parity is written in a round robbin on all disks), so we can assume that nothing is wasted. We're still running with the same number - 235 disks, just a little less overall space due to parity. But space isn't our concern here. So, can we finally issue order to the storage vendor?

    Unfortunately we're still not there. Good time to ask about usage pattern - what do those 40K IOpS contain? ~40% are SELECT operations (e.g. 16K operations per second), 20% are INSERT (8K per second) and 40% are UPDATE (16K per second). We also know, that rows that undergo UPDATE are those that are SELECTed before. And we know that all operations use clustered index seek. Let's perform simple calculation. We can safely assume that upper levels of clustered index are in cache, so we have 1 physical read per SELECT. UPDATE performs read + write but we know that row has been read earlier, so we can assume that read is also from cache. So, price of UPDATE is 1 IO. Great! 1 IO per operation regardless of its type - we're still with 235 disks. At this point some of you would say: "Stop wasting our time, let's buy that damn storage!"

    OK, after week or two we install storage, configure 235 disks in arrays of 5 plus a little more for hot spare, Windows, SQL Server transaction log etc. Now we open PerfMon and start load test expecting to see 40K IOpS at maximum load. And we would be greatly disappointed. Actually, having CV updated is a good thing at this point - some organizations that I know won't employ DBA that wastes them around 100K bucks. So, where is the catch?

    The catch is an UPDATE. Contrary to the wide belief, PerfMon doesn't show physical IO. Yes, we know that drives D and E reside on the same "physical disk". But is it really physical? No, it is just another layer that can conceal beneath anything from directly attached disk to RAID50 array. PerfMon shows IO "on host". So we probably see 1 IO for UPDATE. But storage actually does something else - and it is really physical layer now. First, it reads both data and parity bytes. Data byte is in cache, so we have 1 read instead of 2; but remember, we haven't updated anything yet (you probably ask about parity byte - why it isn't in cache? because for SELECT operations only data bytes are read). Next, it writes both data and parity. So, we have 3 IO operations per single UPDATE. Going back to overall numbers, 16K updates are actually 48K IO operations when working on RAID5. This trap is known as "RAID5 write penalty". That's not the only RAID5 issue. We also have ~2/3 performance degradation on single disks failure - because in order to reconstruct corrupted bit we have to read all 4 bits on other disks including parity -> 4 times more IO. Considering all this, we should carefully evaluate whether RAID5 is really more cost effective than RAID10. The answer is positive for read-intensive environments where we do not have write penalty. For write-intensive environments TCO of RAID10 can be better, especially taking into account possible corruptions.

    Last but not least. I know that some statements here are oversimplified. I didn't take into account page splits or influence of storage system cache; I didn't consider SSD-based or mixed storage. Although I used real system I've built as example, here it is more of a mathematical model. Real world is always more complicated. But I don't want my fellow DBAs to fall asleep reading this post :-).

  • For want of a nail

    For want of a nail the shoe was lost.
    For want of a shoe the horse was lost.
    For want of a horse the rider was lost.
    For want of a rider the battle was lost.
    For want of a battle the kingdom was lost.
    And all for the want of a horseshoe nail.

     

    I'm now in the middle of severity A case that is best depicted by this rhyme. While Microsoft engineers look for a root cause, I'll describe here how small bug can "kill" strong server. The story begins when for some unknown reason "rows" column in sys.partitions DMV begins to show 4.6 billion rows for clustered index of one of the entities tables in my datawarehouse database. COUNT(*) on the table returns 1 million rows.

    Second step - daily job that executes sp_updatestats stored procedure. This procedure updates statistics in the database using sample data . Let's say, I randomly took 5% of table's data and calculated number of unique values for Column1 - n values. So total unique values of Column1 in my table is 20*n. But how do I know, how many rows per unique value of Column1 are there in the table? Very simple: SQL Server just takes rows column from sys.partitions and divides by our 20*n number.

    Third step - query: it joins between our Table and another one WHERE Column1 = x. Optimizer compiles the query and builds execution plan based on existing statistics (e.g. 4 billion rows in the table). Expected number of rows from our table is millions - so Optimizer chooses HASH JOIN as the best way to perform JOIN operation.

    Now that Optimizer made a decision about execution plan, it asks for a memory quota for query execution. Based on the estimation about millions of rows. Remember that memory quota for query execution isn't taken from data buffer cache, so on 32 bit systems we can't utilize too much memory. For example, I executed single query as described earlier and while it was running checked memory usage of my session.

    image

    So we have 252Mb granted while less than 1Mb used. Total available memory is 1.6Gb, so you can calculate yourself how many concurrent queries system supports at its current state. It is exactly the situation we had: 5 sessions were running and 65 were waiting for memory grant (if you monitor wait types - there will be a lot of RESOURCE_SEMAPHORE).

    P.S. DBCC UPDATEUSAGE fixes the initial problem. So while Microsoft engineers look for the reason why sys.partitions went wild, I monitor it in order to fix the problem before it "kills" the server.

  • Partition Details Custom Report - enhanced

    Recently my fellow friend and colleague Yoni Nakache drew my attention to the nice custom report which can save valuable time to any DBA - it returns number of rows per partition along with partition boundaries and filegroup data. The rdl can be found here at Codeplex. What was definitely missing is space usage information - reserved / used space per partition. So I filled the gap - new rdl is here.

  • NHibernate wonders

    This ORM piece of tool just can't stop to surprise. A week ago I've found out that although NHibernate can execute stored procedure, it doesn't support output parameters - when procedure returns one, it causes failure. Today I've found another pearl. Imagine following scenario: 2 entities with many-to-many relationship between them. For example, projects and employees - project contains many employees, employee can participate in several projects. So we have ProjectsEmployees table that maps employees to projects. Done with a foreword, now the scenario itself. Project X contains 9 employees already mapped to it in the ProjectsEmployees table. We want to add another employee to the project. Just a simple INSERT operation, right? NHibernate doesn't look for easy ways. It performs the following operations:

    1. DELETE FROM ProjectsEmployees WHERE ProjectID = X

    2-11. INSERT INTO ProjectsEmployees VALUES()...

    Instead of single INSERT we have 11 operations here. Besides unnecessary pressure on IO system and possible locks, we have much more serious consistency issue - if HR department queries table after DELETE operation but before all the INSERTs have been completed, it would receive project status that never occurred in real life - 3 employees for example. But that's the way NHibernate works - "just in case" overwrites the entire projects object.

    Ah, and how did I find out? Noticed that after mapping new employee to project trigger on delete fires. WEB team developers (those actually programming with NHibernate) keep log of commands NHibernate executes in development environment. There I have seen it plain and simple.

  • Monitoring page splits with Extended Events

    After reading Kalen Delaney's post about single insert causing 10 page splits, I wanted to see those splits in detail - their order at first place. And in SQL Server 2008 there is a way to trace splits - using new Extended Events infrastructure. Here is simple script that creates the trace and afterwards displays results.

    First of all, create and populate table in tempdb as described in the Kalen's post.

    USE tempdb;
    GO

    CREATE TABLE split_page 
    (id INT IDENTITY(0,2) PRIMARY KEY,
    id2 bigint DEFAULT 0,
    data1 VARCHAR(33) NULL, 
    data2 VARCHAR(8000) NULL);
    GO

    INSERT INTO split_page DEFAULT VALUES;
    GO 385

    Now, let's create and start Extended Events session. The only event we would like to monitor is page_split. On the way we'll capture sql text in order to be sure that it is our insert that caused split.

    CREATE EVENT SESSION MonitorPageSplits ON SERVER
    ADD EVENT sqlserver.page_split
    (
      
    ACTION (sqlserver.database_id, sqlserver.sql_text
       
    WHERE sqlserver.database_id = 2
    )
    ADD TARGET package0.asynchronous_file_target
    (
      
    SET 
          
    filename = N'c:\temp\MonitorPageSplits.etx'
          
    metadatafile = N'c:\temp\MonitorPageSplits.mta'
    );
    GO

    ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = start;
    GO

    Now execute INSERT command from Kalen's script.

    SET IDENTITY_INSERT split_page  ON;
    GO
    INSERT INTO split_page (id, id2, data1, data2)
         
    SELECT 111, 0, REPLICATE('a', 33), REPLICATE('b', 8000);
    GO
    SET IDENTITY_INSERT split_page  OFF;
    GO

    Afterwards we'll close the session and display results.

    ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = STOP;
    GO

    DROP EVENT SESSION MonitorPageSplits ON SERVER;
    GO

    SELECT 
      
    split.value('(/event/data[@name=''file_id'']/value)[1]','int') AS [file_id],
      
    split.value('(/event/data[@name=''page_id'']/value)[1]','int') AS [page_id],
      
    split.value('(/event[@name=''page_split'']/@timestamp)[1]','datetime') AS [event_time],
      
    split.value('(/event/action[@name=''sql_text'']/value)[1]','varchar(max)') AS [sql_text]
    FROM
      

          
    SELECT CAST(event_data AS XML) AS split
          
    FROM sys.fn_xe_file_target_read_file('c:\temp\MonitorPageSplits*.etx', 'c:\temp\MonitorPageSplits*.mta', NULL, NULL)
       )
    AS t
    ORDER BY [event_time]
    GO

    OK, the result is:

    image

    So, we see 10 splits, among them 4 splits of page 148, another 3 of page 178 etc. It makes sense. When split occurs, ~half of the data from the old page goes to the new one. So if the new row - the one that caused split - should originally have entered first half of the page, after the split it would still try to enter the old page - not the new one. In our case originally we had 56 rows before the new one (id from 0 to 110 step 2) and 385 - 56 = 329 rows after. I would still expect 3 and not 4 splits of the initial page because (((385 / 2) / 2) / 2) = ~48 < 57 (new row's place). So I have expected that after the third split new row would at last leave the initial page. But I was wrong - don't know whether that's just not strict math or there're other factors I didn't think of.

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement