I recently ran into a rather freakish case in which many SQL Agent jobs on a SQL Server 2005 instance were reported to run twice at their scheduled times. And the 2nd run took place either at almost exactly the same time or only a second or two later.
Upon further examination, I could confirm the following:
- The jobs did indeed run twice. No hallucination here! It’s not that the job history showed two entries for each run. But they actually ran twice as often as their schedules would normally allow.
- A SQL trace revealed that the related queries (e.g. queries checking the job status, retrieving the job info, and updating the job history) all came from the same server locally. So it’s not the case that msdb was copied somewhere else but still pointing back to this server. Nor was it the case that the jobs on some other server had the identical schedules as the jobs on this server, and were kicking off the jobs on this server.
- Stopping SQL Agent did not stop all the SQL Agent traffic seen in the SQL trace. Nor did it prevent the jobs from being executed, though they were no longer being executed twice.
- For a long-running job whose output file was specified, the 2nd run would often fail because it could not get hold of the output file, a further evidence that a 2nd attempt was indeed made to run the job at the scheduled time.
- Not all the scheduled jobs ran twice. The jobs that did not run twice appeared to be the ones whose durations were extremely short.
In addition, I googled for any reports of similar behaviors out there in the community, and did find a few reported cases. But none of them reported the root cause.
As I mentioned at the beginning, this is a rather freakish case and I don’t expect you to run into it. But it is still interesting from a troubleshooting perspective, and that’s why I think it’s worth sharing the story here.
Now with the information given above, can you guess what may have caused this behavior? I’ve included the root cause at the bottom of this post. Note that it’s possible that the same observed behavior has some other root causes, of which what I encountered may be just one. Take a minute to think about before you scroll to the bottom.
In this particular case, the SQL Server instance was a single instance running in a two-node cluster, and after poking around I eventually discovered that the SQL Agent service was still running on the inactive node (say node B). In other words, both nodes had the SQL Agent service running at the same time for the same SQL Server instance! Furthermore, it looked like that when the SQL instance was failed over from node B to node A, the SQL Agent service was never stopped on node B.
This was not supposed to happen at all, and I had never seen this happened before. When failing over from node B to node A, the cluster service will ensure that SQL Server and SQL Agent are stopped on node B, and that is how the server cluster is designed to function. Note that you can’t just start SQL Agent on node B (out of the cluster) if SQL Server has been failed over to node A because the SQL Agent service depends on the SQL Server service and SQL Server service cannot run on node B when all the system databases are on node A.
I don’t know the further root cause as to why SQL Agent was able to continue to run on node B. But as soon as I killed the SQL Agent process from the OS on node B, the jobs stopped being executed twice.