A lot of conversation this year occurred regarding return on investment (ROI) justifications tending the PASS Summit. Lets face it, the economy has been tough on a lot of businesses and the idea of spending a few thousand dollars on conference admission, flights and hotels was a harder sell for some people. I know a lot of people who attended the Summit on their own dime, myself included. However as a testament to the value of attending this conference the very first session I attended, diagnosed and provided information for a problem that has existed in one of my servers for many weeks but was impossible to diagnose unless you knew what you were looking at.
After the opening keynote, one of the first sessions at PASS 2009 was Bob Ward’s 500 level Inside SQL Server Wait Types. I met Bob Ward last year when I attended his session Debugging the Memory of SQL Server 2008. Bob has an amazing ability of taking the most complex subjects and making them digestible. He always starts out with a disclaimer that he expects that you have a basic knowledge of SQL, and then he jumps in head first. I would venture to say that it doesn’t matter what Bob is speaking about, at some point you are going to see the debugger pop up during his session.
Recently I spoke about Wait Types and Troubleshooting at SQL Saturday #21 in Orlando, FL. I also wrote the chapter on Wait Stats in Professional SQL Server 2008 Internals and Troubleshooting (which will be available early from what I hear), and along the way I traded a few emails with Bob looking for more information on the PREEMPTIVE wait types specifically. In our last exchange Bob mentioned that he would cover those specifically in his session at PASS so I was really looking forward to the session. What I didn’t expect was that I would actually miss part of his session, consequently the part where he discussed some of the PREEMPTIVE types, looking at one of my production servers remotely to determine if it had a problem that Bob only briefly glanced over during his session.
Basically for the last 2-3 weeks the web servers for our MOSS 2007 environment have been occasionally logging errors like the following:
Event Type: Error
Event Source: Windows SharePoint Services 3
Event Category: General
Event ID: 27745
Time: 11:13:51 AM
The description for Event ID ( 27745 ) in Source ( Windows SharePoint Services 3 ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: #50071: Unable to connect to the database WSS_Content_Main_Portal on PortalDb. Check the database connection information and make sure that the database server is running..
When these occur the web application returns an exception similar to the following:
Just looking at those two items it would seem obvious that the database server disappeared momentarily from the network. However, from a DBA standpoint all signals pointed in the opposite direction.
The Sharepoint databases are on a shared SQL Server that has two other line of business applications on it and a few other smaller datbases. None of the other applications were logging any errors, or experiencing performance problems. Coupled with that, there currently are four different systems that are monitoring the server in question, SQLH2, SCOM 2007 R2, SiteScope, and a custom monitor that I wrote years ago (SCOM 2007 R2 is in progress to replace everything else). None of the ones that generate alerts ever alerted to a problem on the server, and the performance counters being collected by SQLH2 were well within specifications. In addition I was generally connected to that server when the problem occurred and none of my query windows had been disconnected. Based on all of this information my response to the MOSS team has been that the SQL Server is just fine.
The above errors are from the Monday of PASS, as you might expect, from when I was in the air flying. I got the email with both included during my lay over in Minneapolis, MN. and logged it to check things out despite the fact that I had no other alerts, emails or signs that something was actually wrong with the SQL Server. Once again the response back was, all things look good, when is the MOSS RAP scheduled with Microsoft so that we can figure out what is wrong with that environment.
One thing I know about that particular SQL Server is it always runs high user connections, over 400 is not uncommon despite the fact that 300+ of those are doing nothing but sleeping. Tuesday morning in the middle of his session Bob glanced over a THREADPOOL wait type briefly and mentioned that connection failures can be caused by this wait type. Basically, at least what I got from the session, the server is out of worker threads and can’t bind the new connection request to a worker thread. When this happens the connection waits for 10 seconds before timing out and failing.
The database server happens to be a 2 vCPU 64-bit server, so that means under the default configuration it has 512 worker threads. For whatever reason, this along with the fact that the server runs high user connections clicked with me when Bob mentioned this wait type and I had to VPN into work to check if this server had THREADPOOL waits occurring. Sure enough it had THREADPOOL waits so I ran a couple of other queries to count user sessions:
WHERE wait_type = 'threadpool'
SELECT Count(* ) AS [UserSessions]
WHERE is_user_process = 1
SELECT Count(* ) AS [SessionsOver60]
WHERE is_user_process = 1
AND last_request_end_time < Dateadd(mi,-15,Getdate())
I should point out that the values above are after the server was restarted over this past weekend during a maintenance period. However, as you can see the problem still exists. Needless to say I walked out of Bob’s session and headed to the SQL Clinic to get with a member of the Dev team and review what I was seeing and get some more information. This is an added plus about attending PASS Summit, the development team was available to look at problems in real time without additional cost in the Convention Center. You can’t beat that with a stick.
You might wonder why the problem still existed at the beginning of this week. The problem isn’t really a SQL Server problem. The problem in this case is with an over sized connection pool that has no timeout set for the connections combined with what looks like possibly incorrect connection management with some custom MOSS Web Parts. The application in this case is actually a victim of how it is being used, configured, or customized. I should point out that MOSS out of the box doesn’t exist this kind of connection problem, there are three servers in our MOSS farm, and that all of the idle sessions are not specifically from the MOSS farm.
While attending PASS didn’t fix the problem here, it did bring the root cause of the problem to light. I would venture to say that even a MOSS Premier Field Engineer wouldn’t have been able to isolate this problem, though I could be wrong about this. This simply isn’t the kind of thing that you see regularly or with enough frequency for it to be something you’d think about. We are working to solve the problem given the additional information we have now thanks to Bob Ward and PASS Summit 2009.
Now the not really obligatory PASS plug, but I enjoyed the last two Summit’s so much that I’m going to do it anyway. If you missed this years Summit and you want to attend next years, there is no better time than right now to sign up for it.
Register for PASS Summit 2010, only $995 until Jan 15 www.sqlpass.org/summit/na2010
This is the lowest price you will see for PASS Summit 2010, and it is available now so that you can possibly tap into any remaining training budget for FY2009, possibly allowing your company to split the total cost of attendance, there are hotel and flight costs as well, between this years and next years budgets. If you are looking for a real life way to justify the cost to your employer, you can’t beat my own personal experiences this year.
I’d like to thank Bob Ward and the SQL Server Development team for the information that lead to identifying the root cause of this problem.