THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Job History row limiter

There are situations where guaranteed accuracy correctness is essential, which is why we have transactional databases meeting the ACID properties. And then there are situations where we just need an approximate number. An example is the size of the SQL Agent job history log.

I bring up this topic because on more than one occasion, after fixing performance problems at a client site, the remaining top waits are two SQL statements belonging to stored procedures in msdb: one in sp_sqlagent_log_jobhistory,

sql agent ins

and one in sp_jobhistory_row_limiter.

jobhistory row limiter

The average wait time is on for the day of Feb 28 was 207ms and 201ms respectively. The average elapsed times over several weeks are 171ms and 170ms respectively. Neither of these show meaningful CPU consumption, at 0.716ms and 0.124ms respectively.

A third statement, in sp_jobhistory_row_limiter, actually has much higher CPU consumption. This statement contributes 33.5ms CPU and 39.4ms elapsed time per call over the long period.

jobhistory row limiter

In terms of overall load on the system, these msdb procedures are negligible. A modern 2-socket system has 12-16 physical cores, so there are 288-384 CPU-hours available per day. The CPU intensive query consumes 40 min per day (33.5ms/call x 71,509 calls per day), a very small portion of the available CPU.

The wait time of the 2 longer running queries may be significant at 4 hours per day each. The calls are from SQL Agent, not a live person waiting for the response, so the wait time is not a problem by itself. The high wait statement in sp_sqlagent_log_jobhistory is:

INSERT INTO msdb.dbo.sysjobhistory ( job_id, step_id, step_name, sql_message_id, sql_severity, message, run_status, run_date, run_time, run_duration, operator_id_emailed, operator_id_netsent, operator_id_paged, retries_attempted, server)
VALUES ( @job_id, @step_id, @step_name, @sql_message_id, @sql_severity, @message, @run_status, @run_date, @run_time, @run_duration, @operator_id_emailed, @operator_id_netsent, @operator_id_paged, @retries_attempted, @server)
GO

The high wait statement in sp_jobhistory_row_limiter is:

SELECT @current_rows_per_job = COUNT(*)
FROM msdb.dbo.sysjobhistory with (TABLOCKX)
WHERE (job_id = @job_id)

The maximum elapsed times of sp_sqlagent_log_jobhistory and sp_jobhistory_row_limiter over the longer period are 11 and 41 seconds respectively. Potentially this might cause problems in situation that require precise timing. Of course, it is probably not a good idea to rely on SQL Agent if precise timing is required on a heavily load database server processing a high volume of jobs.

One might think that at just less than 1 call per second (71,509 calls per day, 86,400 sec per day) the chance of a collision between the two statements at 0.2 sec each is small. However these are not random calls, but fire on schedule. So if 100 jobs fire off every 2 minutes on the minute, perhaps the chance of collision is much higher?

The sp_sqlagent_log_jobhistory procedure inserts into jobhistory, and then at the end calls sp_jobhistory_row_limiter. There will not be contention between the two procedures for a given job and session. So the issue is whether there will be collisions between different jobs. (I am not sure under what circumstances there would be multiple sessions of 1 job. Does this occur if the job does not complete within its periodicity or can one job be controlled by multiple schedulers?)

The typical sequence of statements in sp_jobhistory_row_limiter excluding conditionals is as follows:

EXECUTE master.dbo.xp_instance_regread -- 'JobHistoryMaxRows'

EXECUTE master.dbo.xp_instance_regread -- 'JobHistoryMaxRowsPerJob'

BEGIN TRANSACTION

/* this is the high wait statement */
SELECT @current_rows_per_job = COUNT(*)
FROM msdb.dbo.sysjobhistory with (TABLOCKX)
WHERE (job_id = @job_id)

WITH RowsToDelete AS (
  SELECT TOP (@rows_to_delete) *
  FROM msdb.dbo.sysjobhistory
  WHERE (job_id = @job_id)
  ORDER BY instance_id
)
DELETE FROM RowsToDelete;

/* this is the high CPU statement */
SELECT @current_rows = COUNT(*)
FROM msdb.dbo.sysjobhistory

/* this is rarely called? */
WITH RowsToDelete AS (
  SELECT TOP (@rows_to_delete) *

  FROM msdb.dbo.sysjobhistory
  ORDER BY instance_id
)
DELETE FROM RowsToDelete;

IF (@@trancount > 0)
COMMIT TRANSACTION

The combination of the SELECT with (TABLOCKX) and the BEGIN/COMMIT TRAN will block inserts to jobhistory for the duration of the transaction. So it is quite possible there are collisions with the sp_sqlagent_log_jobhistory insert statement.

Below is the detail for the xp_instance_regread call. For some reason the execution statistics are not available. If this is mainly called from the sp_jobhistory_row_limiter procedure, with 2 calls, then there would be 143,000 calls per day with an average wait of 12.6ms per call.

jobhistory row limiter

I am thinking that the reason for the BEGIN/COMMIT TRAN block in the sp_jobhistory_row_limiter procedure is to prevent the action of one session from deleting rows in the process of being inserted by another session?

Having limits on both the total number of rows for job history and individual jobs is a good idea. But this is an example of a situation where we are not concern about guaranteed accuracy. In this case there are 150 jobs and 260,000 rows in jobhistory, requiring 460MB of storage. If we set total jobhistory at 500,000 rows and individual job history at 1000 rows, perhaps our intention

We definitely do not need to check and enforce (both) jobhistory limits 71,500 times per day. One way to reduce the frequency of calls is to delete. I suggest deleting at minimum 10 rows per job everytime the threshold is exceeded.

I think it is silly to do a full count of the number of rows in sysjobhistory for job execution. The SQL Server SELECT COUNT(*) statement does not cheat. It reports the actual number of rows in the table by touching the row header of each and every row. So perhaps we should use an alternative good enough row count, from the sys.dm_db_partition_stats row_count column for example?

I would also like to do away with the BEGIN/COMMIT TRAN block along with the TABLOCKX. Perhaps we can only delete jobs more than 1 day old ?

/* be sure to make this index change before altering the procedure */
CREATE INDEX nc1 ON sysjobhistory (job_id) INCLUDE (run_date)
WITH (DROP_EXISTING = ON)
GO

So my proposal for the sequence of statements in jobhistory_row_limter is as follows:

SELECT @today = CONVERT( int, CONVERT(char (8), GETDATE(),112))

--BEGIN TRAN
SELECT @current_rows_per_job = COUNT(*)
FROM msdb.dbo.sysjobhistory -- with (TABLOCKX)
WHERE (job_id = @job_id) AND run_date < @today

SELECT @rows_to_delete = @current_rows_per_job - @max_rows_per_job

IF ( @rows_to_delete > 0 ) BEGIN
/* if we only delete previous days, then this will only active once per day */
WITH RowsToDelete AS (
  SELECT TOP (@rows_to_delete) *
  FROM msdb.dbo.sysjobhistory
  WHERE (job_id = @job_id) AND run_date < @today
  ORDER BY instance_id
)
DELETE FROM RowsToDelete;

END

/* do this first */
SELECT @current_rows = row_count
FROM sys.dm_db_partition_stats d WITH(NOLOCK)
WHERE object_id = OBJECT_ID('sysjobhistory') AND index_id < 2

/* only if the above is above threshold */
IF (@current_rows > @max_total_rows) BEGIN

SELECT @current_rows = COUNT(*)
FROM msdb.dbo.sysjobhistory with (NOLOCK)
WHERE run_date < @today

SELECT @rows_to_delete = @current_rows - @max_total_rows

IF ( @rows_to_delete > 0 ) BEGIN
WITH RowsToDelete AS (
  SELECT TOP (@rows_to_delete) *
  FROM msdb.dbo.sysjobhistory
  WHERE run_date < @today
  ORDER BY instance_id
)
DELETE FROM RowsToDelete;
-- ALTER INDEX ALL ON msdb.dbo.sysjobhistory REBUILD?
END

END

--COMMIT TRAN

Even though the overall impact of the jobhistory limits procedures are insignificant, it is somewhat annoying to have them show up at the top of the list of a third party performance monitoring tool. With a few simple precautions, these can be eliminated.

ps
If any one is willing to alter the jobhistory row limiter procedure,
please us know the changes made along with the before and after execution statistics.

I should also add that the sysjobhistory table will get heavily fragmented over time, especially since job_id is a uniqueidentifier, so perhaps periodic index rebuilds are in order. Perhaps our strategy should be: when the max total rows is exceeded, delete a larger percentage, say 20% for example, then rebuild the index (outside the transaction if present),

pps - I may have inadvertently deleted a comment? - oops! - sorry Publish

Published Tuesday, March 05, 2013 4:42 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Deba Nayak said:

Hi! Joe: good information. I find similar potential performance issue indicated in the Ignite tool for our production systems. Let me know how the proposed changes working for you so far. Thanks again. debanayak@yahoo.com.

June 25, 2013 1:19 PM
 

jchang said:

the client did not use my proposed changes to the msdb procedures. They decided that because no one actually looked at the job history, it was better to truncate the table periodically. I still advise rebuilding the clustered index after truncation to force release of dead space.

But please let me know what you do and the results.

June 25, 2013 10:46 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement