When dealing with VLDB availability, I found it useful to keep in mind a set of simple and general guidelines. Let me highlight them here.
1. Identify VLDB Early in the Application Lifecycle
The importance of this is rather self evident. But it is often ignored in practice, and it may take a high-profile outage for people to wake up and start to retrofit their existing systems with VLDB best practices.
2. Specifically Plan for Disaster Recovery Scenarios
With VLDB, you have little margin for error and no second chance for retry.
- Plan and document for each of the key disaster categories. Key disaster categories may include: site failure, single server failure, cluster failure, SQL instance failure, user/application errors, database corruption, file failure, page failure, storage outage, and network outage.
- For each disaster recovery scenario, the plan must include specific recovery strategies with detailed step by step recovery procedures. Some of the recovery procedures are not trivial, and a small error may significantly prolong the outage. So perish the thought of doing them by the seat of your pants!
- The plan must identify potential areas in which the availability requirements may not be met, and include contingencies for remedying the deficiencies. Where it is found to fail the availability requirements, alternative database/app architecture or infrastructure improvement may need to be introduced.
3. Reduce the Unit of Consistency Checks, Backup, and Recovery
- Give priority to recovery strategies at the finer granular level, and only fall back to recovery strategies at a larger unit when necessary. SQL Server 2005 provides a suite of features to enable recovering from database corruption at different levels of granularity. One has the choices of restoring pages, restoring files, restoring filegroups, and restoring the database.
- Reserve the strategy of full database restore as the last resort. This separates a VLDB from a small database. For a small database, restoring from a full database backup is almost always a necessary, and often sufficient, step in its recovery strategy. For a VLDB, restoring from a full database backup is rarely desirable.
- Avoid the one-big-file approach. Create multiple data files for a database. This may improve time to recovery because if only a single file is damaged one has the option of restoring that file and replaying the transaction log backups.
- Enable the database PAGE_VERIFY CHECKSUM option. The option helps with early and more precise detection of problems on the I/O path, and facilitates root cause analysis should a problem is encountered.
- Determine whether data in a VLDB can be partitioned into separate logical groups. This is not implementation specific. Implementation can be based on table partitioning, placing user objects in different filegroups, or placing the objects in different databases.
- Partition large tables and implement partition-based recovery strategy. Table partitioning is one of the most important VLDB features.
- Partition a very large database into multiple smaller databases. Examine the database to see whether it can be divided into multiple databases that have loose coupling between them. Consider the tradeoff between ease of maintenance operations with multiple databases and likely increase in application complexity.
4. Reduce or Eliminate Unnecessary/Repetitive Work
A smaller database may tolerate unnecessary repetition in its maintenance work without compromising availability. In a VLDB, repetition becomes expensive and often a key hindrance to successfully meeting your availability requirements.
Examine expensive tight loops in maintenance jobs, and move as much work out of the inner loop as possible. Database maintenance operations often consist of collections or hierarchies of loops of work. For instance, it is common to loop through databases, tables, and then indexes to update statistics and perform index rebuild. The assumption is that these objects can be treated uniformly—an assumption whose validity must be explicitly examined in a VLDB context.
Explore opportunities to reduce repetitions caused by job failure and restart.
It is common that when a database maintenance job fails and restarts, it may repeat bulk of the work that has already been completed prior to the failure, a waste you may not afford in a critical VLDB.
5. Separate Read-Only Data from Read-Write Data
Fortunately, a VLDB often contains a large amount of data that becomes read-only once it ages beyond a point. The maintenance requirements on the read-only data are inevitably less demanding than those on the read-write data. Effective VLDB management must exploit this characteristic.
Identify the complete lifecycle for each data entity. A data entity typically goes through several phases in its lifecycle. An example of such a lifecycle is as follows: Data creation -> Read-write phase -> Read-only phase -> Data purge.
Partition very large tables using table partitioning, identify the read-only partitions, and apply the maintenance operations on the read-only partitions only when necessary.
Purge data at the end of its lifecycle by dropping its table partition. Removing a table partition is a database meta-data operation and can be done instantaneously.
Move read-only data to a separate read-only database or multiple read-only databases. This must be weighed against likely increase in application complexity due to the need to route queries to multiple databases.
6. Pre-perform as Much Work as Possible
Examine each recovery scenario, and look for work that can be done before the recovery state. The goal is to keep to a minimum the amount of work that has to be done after an event of a disaster to recover a database.
Do not delay learning until the recovery time. That's right!
7. Prefer Resilience to Recovery
Being able to transparently recover from a failure is resilience. In other words, resilience can be viewed as automated recovery. Server clustering, automatic failover with synchronous log shipping or disk replication, and multiple load-balanced HBAs are examples of configurations that improve resilience.
8. Take Advantage of the SAN infrastructure
Consider SAN-based DBMS snapshot backups. SAN-based snapshot backups are capable of backing up very large databases in a matter of seconds to another set of disk volumes, making available the otherwise backup time for performing other maintenance operations.
Consider SAN clones, snapshots or BCVs for offloading maintenance operations. These SAN technologies require little time to create a copy of a very large database, which can then be used for offloading maintenance operations.