In my career to date I’ve worked as a DBA for mainly “buy don’t build” scenarios. One advantage - and hair-graying problem, if I am honest - has been that I think I’ve seen upward of one or two hundred different ISV-provided applications’ SQL Server databases. It’s a great learning opportunity, as I can see successes and failures in a large variety of designs, and sort of “fast-forward” to version five or eight of a given design pattern to see things like whether it scales, or has storage problems, has security flaws or DR issues. Today I want to talk about an often repeated blunder that I’ve seen the whole time: designing around the wrong number of databases in SQL Server.
Implementing an application based on the wrong number of databases, architecturally, can have major consequences. As you can imagine, that type of design issue is very, very hard to untangle in version 2 or 5, after the application is out in the wild. So to me it makes sense to consider very carefully, and get this right if you are lucky enough to be at the beginning of the process, or at a point where a redesign is possible.
The Most Important Consideration: The Log(s)
The transaction log doesn’t seem to get a lot of love in the development process – it’s just that workhorse file that becomes a nuisance when it grows to fill its disk. But here’s the thing: whenever anything goes wrong in SQL Server, from a simple failed row insert to a server crash or DR event, the log is the key thing that saves your bacon. It makes rollback possible, and recovery of the data to a consistent state possible. It makes atomic transactions possible. It makes online full backup possible, and point-in-time recovery. I’d like to repeat that: the log is a vital element in taking a full backup while the database is online. We’ll see why that is so important in a minute.
The most important idea about the log is this: the only way to get a transactionally consistent set of data over time is to have either just one stream of journal (log records) for the data, or to have multiple streams of journal that are robustly tied together, lock step. Because each database in SQL Server has its own, independent log file, the first situation is simple to implement – just use one database – while the second is very, very difficult. Splitting data into more than one database implies multiple log files, which in turn implies an independent log sequence. Having separate transaction log sequences means that, while it’s initially possible to perform a transaction across the two databases consistently and correctly, there’s very little guarantee after that transaction, in the space of backup and restore or disaster recovery, especially out in the real world.
The converse is also true – there’s obviously a penalty for co-mingling unrelated information together in one database with a single stream of log records. Centralizing all activity, unnecessarily, in a single log file carries with it overhead, reduced concurrency and other problems. So how can one make this design decision? It turns out there are some fairly simple criteria one can apply, and, when it comes right down to it, only two scenarios.
Two Possible Mistakes
Fortunately there are just two mistakes that we can make:
- Separating data that needs to be transactionally consistent into two databases
- Combining data that has no need to be transactionally consistent into one database
Here are some questions to ask about a design to see whether it is helpful or dangerous to split it into multiple databases. First, if you are considering multiple databases, are there any places in the design where the databases must be “lock-step,” time-wise. Example: are there any tables in one database that refer by joins and foreign keys to another database, or to a sequence or identity generated in another database? If so, you may be open to profound risk.
To test this, do the following experiment – even as a thought experiment. Imagine that the system is running full-tilt, the application adding data to both databases. Halt the system. Restore one database from backup to one point in time, let’s say five minutes before the time of the halt. Next, restore the other database to a point in time TEN minutes before the halt. If the design implies that that five minute gap in time will cause problems with your data or application, then you are open to severe risk by storing this data under two separate log sequences, which is implied by using two databases.
Second, if your application uses three-part-name, cross-database queries, what are they for? Do they imply that the data across databases really should be part of the same set, or are they legitimately separate? Is there some give and take, in time sequence, between the data in one database and data in the other(s)? I’ve had ISV’s say things like “failover at the instance level is a requirement.” That would be OK, if it were possible. It’s not, not really. “But we would just restore both databases at/to the same time,” is something I’ve heard as a remedy. That isn’t anywhere near good enough, and here are some technical examples showing why:
Backups are Broken
- Full backups are not and cannot be time-consistent across multiple databases. That’s because, in order for full backup to work while a database is online, it incorporates some “slice” of the log file, which is replayed when the database is restored from the backup file, in order to make the data consistent in the database relative to changes that were made during the time the full backup was actually taken. The restore will always be as of a point in time at the end of the time it took to execute the full backup.
- It’s impossible to run full backups of two databases and make them end at the same time. Imagine: one day database 1 takes 5 minutes to back up and database 2 takes 25 minutes and 5 seconds to back up. Each day that time varies. Starting those at the same time obviously makes no difference. They would have to be made to end at the same time, every time, which is impossible.
- Implication: it’s not ever possible to restore multiple databases from full backups to the same point in time.
- One might say, then, that it’s just always a requirement that log backups be used, and any restore manually rolled forward to precisely the same time point. I think that would be fine, for the five percent of organizations who could actually make that happen. Few people actually do that out in the real world. Simple restore from full backup files is basically always an expectation, and while it’s technically possible to use log backup files, there’s real risk in the fact that either people won’t expect to have to do that, or they won’t know how to do that. Why even introduce that risk?
DR is Broken
- Suppose your customer (or your organization) has implemented log shipping or off-site async database mirroring as a disaster recovery solution.
- Crunch: disaster. The app just crashed, the database server is down, and we need to fail over.
- At the DR site, we have some log backup files, or streams of database mirroring log records – different ones for each database. For one database, the log sequence goes up to 5 minutes before the crash. For another database, the last log backup was much bigger, and it didn’t copy over the WAN in time, so that one is 15 minutes behind. Or that database was behind in sending mirror log records due to some large transaction.
- The DBA, as would be reasonable to expect, applies whatever log records are available and brings the DR site copies of the data online, and, because it’s a crisis and she’s busy, she moves on to the 300 other databases under her area of responsibility. The databases are now 10 minutes apart in time. How’s the application look?
The takeaway here: in order for HA/DR, crash recovery and backup features built in to SQL Server to work simply and reliably, your database design must take into account how the log sequence preserves data integrity and consistency in the data, which is easy in the context of one database, and extremely difficult across multiple databases.
Performance and Common Sense Partitioning
So, what about the other side of this issue? There are, of course, performance and administrative advantages to partitioning data out where it’s safe to do so. One basic principle of scalability is to avoid global resources (think one huge table) in favor of partitioned resources (several smaller tables). Here are some cases where it can be safe, and advantageous, to separate sets of data into distinct databases:
- True “reference” information that is loosely coupled to the primary database. I have seen this successfully implemented for “lookup data” that may be aggregated together from other areas of an organization, such as other regions or other organizational functions, that is not coupled to an application database in real time. I have also seen it used effectively for small data marts, data warehouses or reporting databases, where reporting information is copied out of a primary database, asynchronously. If such a database is “behind” the main application database, it doesn’t really matter, and whatever interface exists between the two is elastic enough to accommodate that.
- Audit or application error log information. There are applications where audit or application log data is produced in abundance, i.e. crazy huge tables. It may be safe, if business policies permit, to store this data in its own database with a separate log, enabling different and better administrative and retention policies. Simple example: let’s re-index the transactional database, but maybe let’s not tie up the main application’s database log by re-indexing tens of gigs of log or audit data, where fragmentation of that data isn’t really a concern. Secondarily, perhaps the transaction context for real activity in the main database can be different than the transaction context of audit or error logging so, for example, logging can’t block “real” transactions.
- Multiple ‘tenants’ using the same application. I have some experience (not a huge amount) with multi-tenant scenarios, and in every case I have seen, it’s been destructive to combine multiple tenants literally into a single database, and much more successful to host the tenants side-by-side, each in their own database. There are some administrative challenges to managing a whole collection of databases, but that can be solved with semi-clever code, tools, and rigor around implementation. The scalability problem of one massive DB is much worse. There are several reasons for this:
- The natural partitioning that results from splitting the DBs out (one tenant cares only about her own transactions) seems always to provide better query performance and scalability than co-mingling data from different audiences into the same tables. There is some penalty in procedure cache, because each different database will have distinct query plans, but having the data partitioned, on balance, tends to make the queries simpler and faster. Table and index scans, when required, are automatically restricted to a small set of data that matches the audience, rather than scanning a huge structure and discarding much of the data because it’s not relevant to the client who is performing a query.
- Locking and associated concurrency issues can be much simpler, and the damage contained. One tenant locks a table? No problem outside their own world, if they have their own database, and the lock is likely to be much shorter in duration.
- Tenants frequently want to take data with them when they come and go, and there’s a major advantage in terms of portability and/or archiving if the database for that tenant is physically independent.
- Scale-out to multiple commodity servers clearly is simpler.
- Backup and database maintenance across many small databases can be more asynchronous and less impactful than the same for one massive database.
- The log sequence(s) for each database can be safely and legitimately independent. This makes them smaller, easier to manage, and helps the HA/DR situation.
Conclusion
If you have the chance to make this choice, if it’s not already too late, choose wisely. Consider that disasters do happen, and be certain to choose a single or multiple database design for the right reasons, and in a way that doesn’t introduce risk for your application.