THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

The Rambling DBA: Jonathan Kehayias

The random ramblings and rantings of frazzled SQL Server DBA

Follow Up on Failed Connections and Threadpool waits

A comment was posted on my blog post The benefits of attending PASS realized! asking for a follow up as a result of investigating this problem.  In that post, I mentioned that a high number of unused connections in oversized connection pools were causing the problem.  Unfortunately that information was incorrect, and thanks to some emails with Bob Ward, I now understand the problem better and have been able to reproduce the issue on a VM on my laptop by changing the ‘max worker threads' sp_configure option to the minimum value of 128.

So what really was the root of the problem?  Along with MOSS 2007, two other major applications had databases on this SQL Server.  One of the applications has two connection pools to SQL Server, each with one hundred connections each and when it activates, the volume of concurrent requests to the SQL Server were in the range of 150 – 190 requests at the same time.  The other application has a single connection pool and generally runs in the range of 20 concurrent connections to SQL Server.  The database for this application has a 353,000,000 row logging table that is roughly 62GB in size.  This table is rarely used beyond inserting new log records, and purging records that are older than 45 days old.  It exists for troubleshooting problems associated with data transmission from our stores to our home office.  However, when it is queried, the queries execute in parallel, and the resulting query plan would use 6 to 8 workers to execute, depending on the request being made.  The root of the problem is that the volume of concurrent requests to SQL Server exceeded the number of available worker threads at the point that the problem occurred. 

To track this problem proactively, a couple of performance counters can be monitored.  Previously I posted that failed logins, coupled with high wait times for the THREADPOOL wait type can be used to troubleshoot the problem.  However, another method exists to investigate the problem.  It happens that whenever a process waits for a worker thread to execute that a series of performance counters are updated to reflect this state.  The “SQLServer:WaitStatistics\Wait for the worker” performance counters can be collected over time to see when and how often the problem occurs.

To fix the issue, we’ve moved the MOSS databases onto a dedicated SQL Server to separate the workloads which has resolved the problem.  Essentially the size of the SQL Server isn’t sufficient for the volume of concurrent work, when all three of the applications are under load together.  We can actually pinpoint the point in time when this problem would have started in our environment from our deployment tracking software.  Over the summer changes were made in the MOSS farm to add second web server behind our load balancer which increased the amount of concurrent work beyond the tipping point for the SQL Server.

Once again, I want to thank Bob Ward for his assistance with this as well as for reviewing the contents of this post for accuracy before I posted it.

Published Tuesday, December 01, 2009 5:51 PM by Jonathan Kehayias

Comments

 

ALZDBA said:

Nice feedback ! Thanks.

btw here's a query that may help collecting the data ...

/* SQLServer instance uptime */

SELECT @@servername as ServerName

, login_time

, datediff(mi, login_time, getdate()) as SQLServer_UpTime_Minutes

, dateadd(mi, datediff(mi, login_time, getdate()),0) as SQLServer_Uptime_Datetime_yyyy_mm_01

FROM sys.dm_exec_sessions

WHERE session_id = 1

/* Query That Returns Calculated SQL Server Performance Counter Values

*

* SQL2005 variant using the DMV

*/

SELECT dmv_cntr.[object_name]

, dmv_cntr.counter_name

, dmv_cntr.instance_name

, dmv_cntr.cntr_value

, dmv_cntr.cntr_type

, CASE dmv_cntr.cntr_type

WHEN 537003264 -- This counter is expressed as a ratio and requires calculation. (Sql2005)

THEN CONVERT(FLOAT, dmv_cntr.cntr_value) / coalesce(RatioBaseValue.cntr_RatioBaseValue,-1)

ELSE dmv_cntr.cntr_value -- The values of the other counter types are already calculated.

END  as Calculated_Counter_value

, CASE dmv_cntr.cntr_type

WHEN 537003264 -- This counter is expressed as a ratio and requires calculation. (Sql2005)

THEN RatioBaseValue.cntr_RatioBaseValue

ELSE 1 -- The values of the other counter types are already calculated.

END as cntr_RatioBaseValue

FROM sys.dm_os_performance_counters dmv_cntr

left join ( SELECT CASE cntr_value

WHEN 0 THEN 1

ELSE cntr_value

END as cntr_RatioBaseValue

, counter_name

, [object_name]

, instance_name

, cntr_type

FROM sys.dm_os_performance_counters

WHERE cntr_type = 1073939712

) RatioBaseValue

on dmv_cntr.counter_name + ' ' = SUBSTRING(RatioBaseValue.counter_name, 1, PATINDEX('% Base%', RatioBaseValue.counter_name))

AND dmv_cntr.[object_name] = RatioBaseValue.[object_name]

AND dmv_cntr.instance_name = RatioBaseValue.instance_name

WHERE dmv_cntr.cntr_type <> 1073939712 -- Don't display the divisors. (RatioBaseValue)

--  and dmv_cntr.counter_name like '%wait%'

order by 1,2,3

December 2, 2009 7:35 AM
 

Jonathan Kehayias said:

The only problem with using the query to get the counters is that you are using a worker in SQL to do so which adds to the problem.  I'd stick with PerfMon to do the counter collection because of this.

December 2, 2009 1:19 PM
Anonymous comments are disabled

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement