THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

SQL Server puzzle: Why are the SQL Agent jobs executing twice?

I recently ran into a rather freakish case in which many SQL Agent jobs on a SQL Server 2005 instance were reported to run twice at their scheduled times. And the 2nd run took place either at almost exactly the same time or only a second or two later.

 

Upon further examination, I could confirm the following:

 

  • The jobs did indeed run twice. No hallucination here! It’s not that the job history showed two entries for each run. But they actually ran twice as often as their schedules would normally allow.
  • A SQL trace revealed that the related queries (e.g. queries checking the job status, retrieving the job info, and updating the job history) all came from the same server locally. So it’s not the case that msdb was copied somewhere else but still pointing back to this server. Nor was it the case that the jobs on some other server had the identical schedules as the jobs on this server, and were kicking off the jobs on this server.
  • Stopping SQL Agent did not stop all the SQL Agent traffic seen in the SQL trace. Nor did it prevent the jobs from being executed, though they were no longer being executed twice.
  • For a long-running job whose output file was specified, the 2nd run would often fail because it could not get hold of the output file, a further evidence that a 2nd attempt was indeed made to run the job at the scheduled time.
  • Not all the scheduled jobs ran twice. The jobs that did not run twice appeared to be the ones whose durations were extremely short.

 

In addition, I googled for any reports of similar behaviors out there in the community, and did find a few reported cases. But none of them reported the root cause.

 

As I mentioned at the beginning, this is a rather freakish case and I don’t expect you to run into it. But it is still interesting from a troubleshooting perspective, and that’s why I think it’s worth sharing the story here.

 

Now with the information given above, can you guess what may have caused this behavior? I’ve included the root cause at the bottom of this post. Note that it’s possible that the same observed behavior has some other root causes, of which what I encountered may be just one. Take a minute to think about before you scroll to the bottom.

 

 ||

 ||

 ||

 ||

 ||

 \/

 

 

 

 

 

 

 

In this particular case, the SQL Server instance was a single instance running in a two-node cluster, and after poking around I eventually discovered that the SQL Agent service was still running on the inactive node (say node B). In other words, both nodes had the SQL Agent service running at the same time for the same SQL Server instance! Furthermore, it looked like that when the SQL instance was failed over from node B to node A, the SQL Agent service was never stopped on node B.

 

This was not supposed to happen at all, and I had never seen this happened before. When failing over from node B to node A, the cluster service will ensure that SQL Server and SQL Agent are stopped on node B, and that is how the server cluster is designed to function. Note that you can’t just start SQL Agent on node B (out of the cluster) if SQL Server has been failed over to node A because the SQL Agent service depends on the SQL Server service and SQL Server service cannot run on node B when all the system databases are on node A.

 

I don’t know the further root cause as to why SQL Agent was able to continue to run on node B. But as soon as I killed the SQL Agent process from the OS on node B, the jobs stopped being executed twice.

Published Wednesday, August 25, 2010 6:13 PM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

AaronBertrand said:

Whoa, that is weird.  I have never seen this happen either.  Was this symptom noticed shortly after applying a service pack or cumulative update?  Just wondering if the cluster got confused when one of the nodes was being updated...

August 25, 2010 9:32 PM
 

Linchi Shea said:

Hi Aaron;

No, there was no patch update on the cluster. I have a hunch about what might have caused it, but just haven't got time to test and verify. Definitely, the cluster was confused for whatever reason.

August 26, 2010 12:47 AM
 

TrackBack said:

August 28, 2010 9:22 AM
 

Texdanny said:

Yes, that happened before. We did everything - and then we had to delete the job; recreate from scratch, and now all is well.  It made us looked pretty bad, but we are still troubleshooting to prevent recurrence in the future.

August 28, 2010 9:54 PM
 

NebraskaPaul said:

We had the same issue.  I never could nail down the root cause, but did the same thing as TexDanny.  Deleted the job, recreated from scratch and all was well.

August 30, 2010 10:23 AM
 

Check said:

This may be over three and a half years old, but this just solve an ongoing mystery for me too.  Thanks a ton!

March 18, 2014 3:14 PM
 

Lucas Benevides (DBA Cabuloso) said:

I use SQL Server 2008 R2 with a failover Cluster and this happened to me. I took two whole days to find it out, thanks to this post. It is quite a huge BUG. We also work for years with Cluster and had never seen anything like this.

Weeeird.

April 15, 2014 2:10 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement