What does it mean if you see a high percentage of signal waits? Thanks to Microsoft whitepapers, presentations, and blogs, everybody would say it implies CPU pressure.
Well, almost everyone except Mario Broodbakker, whose excellent blog “Taking the guesswork out of SQL Server performance profiling Part 2” shows us that SQL Server signal waits are in fact an indication of the SQLOS scheduler pressure, which may or may not be an indication of the CPU pressure as seen by the OS. I don’t know whether Mario Broodbakker was the first one who came up with the term ‘SQLOS scheduler pressure’, but I have not seen it used elsewhere or by anybody else.
I read this excellent blog a while ago. But I don’t think I appreciated it significance at the time, or else I would not have totally forgotten anything about it until Stuart Ozer of Microsoft brought it to my attention a few days ago. Well, there was a scenario where the OS CPU perfmon counters suggested that none of the processors was fully used, while at the same time SQLOS showed very high signal wait time with no apparent bottleneck on other resources.
The reason this appears to be contradictory is that if signal waits are truly an indicator of the CPU pressure, then SQL Server should be using more CPU time if it’s not waiting on any other resources. Clearly, something doesn’t add up.
Well, the key to this puzzle is Mario Broodbakker’s observation that signal waits are an indicator of the SQLOS scheduler pressure. Although the SQLOS scheduler pressure often translates into the CPU pressure, it’s not always the case. Mario Broodbakker showed a case where a CPU intensive query was run on two connections, they were both scheduled on the same SQLOS scheduler, but they were run on two different processor cores. Because the queries had to wait for each for their turns to run on the same scheduler, there was very high signal wait time. But at the same time, each of the two processors was only half used.
In addition to understanding the behavior conceptually, the best way to really appreciate the behavior is to see it in action. But if you are using a multi-socket multi-core machine such as an HP DL585 with four dual-core sockets, it may be hard to reproduce the behavior with Mario’s CPU intensive query (given in his blog referenced above).
To help you see the behavior, I’ve modified Mario Broodbakker’s query to the following (table t1 has a single column with 10,000 rows):
use mytest
go
exec begin_schedstats
go
exec begin_waitstats
go
select * from sys.dm_os_tasks where session_id = @@spid
go
waitfor delay '00:05:00'
SELECT COUNT(*)
FROM t1 a INNER LOOP JOIN t1 b
ON b.c1 = a.c1
option (maxdop 1)
select * from sys.dm_os_tasks where session_id = @@spid
go
exec end_schedstats
go
exec end_waitstats
go
Follow this procedure to see the behavior:
- First, fire up perfmon to display the % Processor Time counters for all the processors
- Start a Query Analyzer instance, and open several QA connection to the same SQL instance to run the above query
- For each connection, from the QA output, check to see what SQLOS scheduler the query is running. Note that sys.dm_os_tasks has a scheduler_id column.
- If you don’t see any two connections sharing the same scheduler, open more QA connections to run the query and eventually SQL Server will wrap around to use the same scheduler.
- If you see two connections running the query on the same scheduler, you are done. Abort all the other connections, and wait to see how these two connections drive the % Processor Time counter values on the perfmon display. At steady state, you should see two processors being driven to ~50% each, when this query should in fact drive each processor to 100% if different SQLOS scheduler is used.
I don’t know whether there is a guarantee that you’ll see the queries running on two different processor cores. But from my tests, I saw it in each and every of the test runs.
Mario Broodbakker’s message on the SQLOS scheduler pressure is not well known. It’s a very important message that should be promoted.