THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Rules-Driven Maintenance

I haven’t posted for a while, but I hope that today I’ve got something really interesting to share.

Even though I love SQL Server, there are a few areas that have always sort of disappointed me:

  1. People around the world have constantly had to implement and re-implement essentially the same patterns for database maintenance and backups.
  2. Even though those patterns are well established, there isn’t much out of the box in SQL Server -- and I include Maintenance Plans and Central Management Server/Policy-Based Management -- that works well to implement them, especially across a large number of instances. There are some very good community-supplied solutions ( http://ola.hallengren.com/, http://sqlfool.com/2011/06/index-defrag-script-v4-1/ ) that help, which is great, but the popularity of those solutions also demonstrates that the gap in the product is real.
  3. I personally found I was spending a lot of time setting up, checking on and troubleshooting basic maintenance and backup jobs, and after a while it becomes very repetitive work. We have about 2000 databases and a team that has ranged from one to four people.

Some simple examples:

  1. You want all databases on all production instances to back up nightly, in sequence, during a maintenance window. If a new database is added, it should also start being backed up. Old files should be removed in some safe, reasonable fashion.
  2. You want log backups for any databases that are in full recovery mode, on any instance, to happen at least every 15 minutes, and more often if the log files are filling up.
  3. You want any indexes that are fragmented, and only those that are fragmented, to be rebuilt or defragged nightly during a defined maintenance window.
  4. You want all databases on all instances to be verified with CheckDB nightly or weekly, in sequence, at a specified time.

Basically anybody who manages small to mid-size databases has these rules about maintenance, and they are always very similar, but we have to go around and set up many, maybe hundreds, of jobs and scripts to implement those rules. Editing the jobs when the rules, or perhaps the deployed databases, change is really tedious.

Solution: Rules-Driven Maintenance

I came up with a system to help manage this with less manual effort. The system consists of tables containing preferences that describe how maintenance and backups should operate, and some stored procedures and functions that implement those preferences in an automated, instance-wide maintenance routine. I was thinking of calling this “Policy-Based Maintenance,” but I don’t want it to be confused with the SQL Server feature Policy-Based Management (PBM) because it’s not really the same animal. (I like PBM, but don’t get me started about how the SQL Server team forgot to finish that feature before moving on to other things...)

Here is a brief synopsis of how the system is implemented, using full backups as an example:

  1. There’s a policy table called FullBackupDefaults that defines the basic preferences about when and where full backups should be taken.
  2. There’s a second table called FullBackupOverrides where it’s possible to make exceptions to the default settings for specific databases.
  3. A table-valued function compares those preferences to the state of the databases on the instance including their backup history, and generates a “work queue” – a list of databases to back up – for the current time.
  4. A SQL Agent job is spawned that processes that work queue, iterating through the listed databases and backing them up.
  5. When the spawned job completes successfully, it is deleted.

If a new database is added to the instance, it’ll automatically become part of this process, using the default settings. If you have a database that needs special treatment, let’s say weekly full backups instead of nightly full backups, then you can make an entry in the FullBackupOverrides table that describes the different policy to apply to the exception database.

Using that general idea, I created similar maintenance code for both data (full and differential) and log backups, index maintenance, statistics maintenance and integrity checks. Each type of activity has its own pair of preference tables following the same design as the example above. With this solution, I can deploy maintenance across a large number of SQL Server instances, and set preferences for how it should act, rather than making and maintaining detailed SQL Agent jobs or maintenance plans, which would be hard to administer centrally. The maintenance will handle new databases fairly seamlessly, so there is no need to go back and constantly check the servers, other than to run some PBM rules against them from the Central Management Server to be sure things are healthy.

What about the existing community solutions?

You should definitely check these out:

http://ola.hallengren.com/

http://sqlfool.com/2011/06/index-defrag-script-v4-1/

They are great, and have some better and different features than this solution. The one feature that I think distinguishes mine is just the implementation of preference tables to drive all the maintenance activity.

Disclaimers

The code for the system is posted here, and you are welcome to download it and try it out. Please note that this is not a commercial solution, and that while you may test or deploy this, you do so at your sole risk. Edits to the code almost certainly will be required for your environment. Please read, understand and test the code thoroughly before you even imagine using this in production. I do not want to cause you or your employer any pain. License for using this solution is GPL 3.0: http://opensource.org/licenses/GPL-3.0. You may use and modify this code, but not sell it.

This has been run on SQL Server versions from 2005 to 2008 R2. It might work on 2012, but I have not tested it.

I run only case-insensitive collation; if you are on a case-sensitive system, then you have my sympathy, and you might have to edit a few things to make this work.

There is coverage for Quest Litespeed backups in the solution; other third-party backup products could be incorporated easily, but have not been. It should be possible to follow the pattern used to handle Litespeed and extend it to any other third-party stored procedures.

If you haven’t done index maintenance for a while, and you flip this solution on you might blow up your transaction log. Be sure that you have plenty of log space and plenty of log backup space, and watch the process the first few runs. This is especially true if the combination of Standard Edition of SQL Server and the preference for Online index maintenance applies in your environment, because the maintenance process will perform index maintenance using reorganize (defrag), which produces a lot of log.

Lastly, this is a solution for the type of environment that has many small databases, not the type with a few big, performance intensive databases. Databases that are large and require detailed attention may not be suitable for a one-size-fits-all solution like this.

Walkthrough

OK, with all the fine print out of the way, here’s how you can try this code. You will need these prerequisites:

  1. PowerShell 2 and SSMS / SMO installed on the workstation from which you will deploy the solution. The deployment is handled by a PowerShell 2 script that uses invoke-sqlcmd. Before running it, you’ll need to have the SQL Server Snap-ins for PowerShell installed and working. The solution itself, after deployment, does not require PowerShell on the servers where it is deployed.
  2. The full backup procedure in this solution calls a file delete procedure that I have implemented as a CLR stored procedure. You’ll need either mine, or you’ll need to substitute your own, or comment out the calls to that procedure and handle deletes some other way. The one I use is available here.
  3. Windows-authenticated sysadmin access to your SQL Server environment. The deployment script uses windows auth and your own identity. It could be edited to support SQL auth if necessary, but it does not by default.

To install on your test server

  1. Download and unpack the code into a convenient folder.
  2. Read and edit Deploy.ps1, entering the name of your test server, and the location of the deployment SQL scripts, according to the instructions in the file. The deployment script can push the solution out to one or to multiple SQL Server instances.
  3. Execute your edited Deploy.ps1 to install the solution to the specified server.

The deployment process will populate an admin database on the SQL instance with the policy tables, stored procedures and functions, and will add five SQL Agent jobs to the instance. You can use your own admin database or allow the script to create one for you.

There is one SQL Agent job for each type of activity required for maintenance:

Policy-Based Data Backups

Policy-Based Log Backups

Policy-Based Integrity Checks

Policy-Based Index Maintenance

Policy-Based Stats Maintenance

Each job typically works like this:

  1. On the job schedule, each job will check to see if it’s time to do any work. (For example, is it now the start of the backup maintenance window for any database(s)?) The work, if any, is listed using a table-valued function that takes the current time as a parameter.
  2. If there is work to be done according to the maintenance preferences, then the system will spawn a new SQL Agent job, specifically for that work. Example: Run index maintenance as of Feb 5, 2012 at 21:00.
  3. The newly spawned SQL Agent job will iterate through the work queue for that time. If the job is successful, it will be automatically deleted. If there is an error, it will remain on the server for troubleshooting, or for monitoring software to alert on.

Fix This One Broken Thing

In the Full Backups stored procedure, there are calls to a proc called DeleteAgedBackupFiles. You, most likely, don’t have said procedure :-). You can either remove this call, substitute your own, or compile the CLR code and install it from Prereq #2 above. I didn’t do this to be mean – this is a cheap trick to make you, gentle reader, not install in production until you have read and understood the code.

Setting Preferences and Policies

The rules driving the maintenance jobs are stored in a few simple tables. There are views over those tables that give a summary of what preferences are in effect:

SELECT * FROM dbo.FullBackupPolicies

SELECT * FROM dbo.DiffBackupPolicies

SELECT * FROM dbo.LogBackupPolicies

SELECT * FROM dbo.IndMaintPolicies

SELECT * FROM dbo.StatsMaintPolicies

SELECT * FROM dbo.IntegCheckPolicies

Behind each view is a pair of tables, one for default settings that apply across the whole instance, and a second for database-specific overrides or exceptions to those defaults. To adjust the behavior or the timing of the different kinds of maintenance, just adjust the preference values in those underlying tables.

Some examples:

Q. How do I set the time for my integrity checks window for all databases?

A. In the table dbo.IntegCheckDefaults set the value for WindowStartTime to the desired time, as 24-hour time, format hh:mi.

Q. How would I limit integrity checks for one larger database to only Saturday nights?

A. In the table dbo.IntegCheckOverrides, make a row with the database name, and set the value for all days of the week, except Saturday, to 0. Set Saturday to 1.

Q. How do I make one larger database use differential backups nightly, with one weekly full backup, while the other databases get a nightly full?

A. In the table dbo.FullBackupOverrides, make a row for the large database. Set all the values for days of the week to 0 except the day you want the full backup. In the table dbo.DiffBackupOverrides, make a row for the database and set all the days of the week to 1 except the day you want the full backup.

Q. How can I make the system ignore backups for one large database that I explicitly don’t want to back up?

A. In the table dbo.FullBackupOverrides, add a row for the database where Exclude = 1.

This same pattern should work to set any and all preferences.

Turning On Automated Maintenance

The deployment script will install the SQL Agent jobs as “disabled,” as a safety net. When you are ready to begin using the automated maintenance, just enable the five SQL Agent jobs.

With the jobs running, watch the SQL Error Log periodically, and the file system, to be sure you are getting all the maintenance activities when you expect to, and that the agent jobs are not failing.

Feedback

I would LOVE feedback on this solution and ideas about how to make it better. If you try it, please come back here and let me know how it went.

(F.) A. Q.

Here are some details that may be helpful:

Q. What are the available preferences?

WindowStartTime All maintenance activities except log backups have a specified time; this is the start time for that window, daily.
WindowDurationMins Index maintenance can be limited to a set duration past which new rebuilds will not start. This value is the length of the window. (A rebuild in progress, however, will run to completion.)
Sunday - Saturday Days of the week on which to perform the maintenance activity (Bit columns)
MaxIntervalHours (Minutes) For backups, the maximum length of time between backups. For example, you can set a rule that says a full backup should be taken at least every 24 hours, or a log backup at least every 15 minutes.
MaxLogPercentUsed The limit of log percent full that will trigger a log backup. The frequency of log backups is governed both by MaxIntervalMinutes and by MaxLogPercentUsed.
Path The path for file output. If NULL, the default path set at the instance level will be used.
ThirdPartyBackup Optional third party backup. Only Litespeed is implemented. Can be NULL (implies Native), ‘Native’ or ‘Litespeed’
UseDifferentialBackup Bit indicating that differentials should be used.
FullBackupSequencesToRetain For the backup cleanup process, indicates how many sets of Full, Differential and Log backups to keep on disk.
DeleteOnlyArchivedFiles For the backup cleanup process indicates whether to delete files where the “ready to archive” attribute is set in the OS. (Can prevent deleting files that have not gone to tape.)
DatabaseName For override tables, the name of the database where the overrides apply.
Exclude Ignore/skip the database (Bit)
Online For index maintenance, try to keep the indexes online (1) or do not try (0). For Enterprise Edition, Online = 1 will cause online index rebuild; for Standard Edition, Online = 1 will cause reorganize instead of rebuild. Watch the log file size and backup disk capacity if using defrag!
DefragThreshold When examining indexes for fragmentation, the threshold for percent fragmentation that will cause index reorganize.
RebuildThreshold When examining indexes for fragmentation, the threshold for percent fragmentation that may cause rebuild (subject to the Online preference and SQL Server edition as described above)

Q. Why does the log backup policy job never stop?

A. The log backup policy job is implemented as a loop, with a one minute delay, so as not to flood the job history with executions on a one-minute interval. It runs every minute to keep up with log files that might be filling up any any given time. It doesn’t do work every minute; a backup is only executed for a given database when MaxIntervalMinutes or MaxLogPercentUsed is exceeded.

Q. I set my WindowStartTime to 03:15 or 11:45 and now I don’t see any maintenance/backups? Huh?

A. The Policy jobs can run on any interval, at any time, but at least one execution has to “line up,” within a few minutes, with the WindowStartTime. That is, if the policy job runs every hour, on the hour, but your window is set to start at 15 minutes after the hour, the job will never find work to do. Set the job schedules to align with the corresponding WindowStartTime.

Q. Why did the index maintenance job fill my Standard Edition log?

A. See if you have the combination Standard Edition and Online = 1 for index maintenance. That will cause indexes to be maintained using Reorganize, which is very log-intensive. You can: a. watch and wait, because the level of fragmentation, and so the amount of log generated, might be reduced with more regular maintenance depending on the database. That is, it might be a temporary issue. b. Turn the Online preference off, if your application is not 24x7 and you can have blocking during a maintenance window. c. Get the checkbook and perform an edition upgrade to Enterprise.

Q. How the heck to these crazy overrides work?

A. The design of the overrides is: for each database, for every value in the preferences, IF the override table has a non-null value, use it. If the override table has no value or the value is NULL then use the corresponding default value. This means that when you add a row to an overrides table, you only have to fill in the specific values that are different from the defaults, and leave the rest as NULL. The NULLs will “inherit” the default values from the defaults table at run time.

Q. What happens if a job fails?

A. Failure typically will result in an extra job (or many extra jobs) showing in SQL Agent Job Activity Monitor, with a name that indicates the date and time that the job was intended to run. The job history can be used to troubleshoot what happened. Successful maintenance jobs will delete at the end of their run, but failed jobs will remain for troubleshooting.

Q. How can I see what index maintenance was done?

A. Select from the TrackFragmentation table

Q. Does this work with:

Log Shipping Yes. Log backups should work by firing the log backup job in SQL Agent instead of running a conventional log backup. You may elect to disable the log shipping log backup job’s schedule (but don’t delete the job) to prevent collisions between the schedule starting the job and this solution starting the job.
Mirroring Yes.
Replication Probably. I don’t use replication much, so it has not been tested.
Clustering Yes.
Named Instances Yes. For multiple instances on the same machine, consider staggering the WindowStartTime values on each instance.
2012 I hope to find out soon. Looks promising.
2000 Um … no.
VLDB Not really. This solution targets many small, simple databases. It does operations like sp_updatestats that might not be appropriate for a big system.
Partitioning Yes, there is automatic detection and index maintenance for partitioning on Enterprise Edition

Q. I have an admin database already. Can I use it?

A. Yes. Check for name collisions for objects in the database, and then just enter your preferred admin database name in the Deploy PowerShell script.

Q. The SQL scripts don’t include “Use <database>?”

A. True. The T-SQL scripts are designed to be used with a database that has the name of your choosing. (In my shop, all the admin databases have unique names per instance, by design.) The PowerShell deployment script handles the database name. If you run the scripts from SSMS, be sure to set the database manually.

Q. The deployment script croaked in the middle. Can I rerun it?

A. Yes, the deployment scripts are all designed with “If not exists…” and can be rerun if necessary.

Q. I have tested the solution and I want to deploy on multiple instances. How?

A. The Deploy.ps1 file can use a list of SQL Server instances to deploy across multiple servers if desired. Keep in mind that the Policy agent jobs are disabled by default when deployed and will need to be switched on.

Q. Why is WindowStartTime implemented as text! You Hack! :-)

A. SQL Server 2005

Note: I’ve published some updates to this code, so please grab the latest version here.

Published Wednesday, August 01, 2012 11:23 AM by merrillaldrich

Attachment(s): RulesMaintenanceV1.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Merrill Aldrich said:

This month I have a minor update to the Rules-Driven Maintenance code I originally posted back in August

March 11, 2013 10:56 PM
 

mbourgon said:

Merrill, a question about this for you.  It looks like I could stagger backups by setting up 2 rows per server, so that M/W/F it would back up to server A, and T/Th/S it would back up to server B.  Is that correct?  

September 20, 2013 3:07 PM
 

merrillaldrich said:

When you say "back up to server x" I assume you mean the output path for backups going to a share on another server rather than to local disk?

If so, I looked at it and with a small change and some testing I think you could do that. You'd have to change the PK on the FullBackupOverrides to allow more than one row of settings per database, then make sure, or enforce with a different constraint, that the settings don't overlap - or else you might get strange behavior like two backups the same day.

September 21, 2013 3:40 PM
 

Merrill Aldrich said:

Happy Fall! It’s a beautiful October here in Minneapolis / Saint Paul. In preparation for my home town

October 9, 2013 7:19 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement