Refactoring tables does not have to be such a big deal. We do not have to shut modifications down, migrate all data into new structure, deploy modified modules, and do it all at once, while the system is down.
Doing all migration at once may be risky, and if we make a mistake, there is no easy rollback to the old version. This is why whenever we need to refactor a large table, we are using an alternative, low-risk, no-downtime, incremental approach. The method I am going to describe has been used in practice several times, and we did not have any problems with it.
All our transitions from old table structure to the new one were smooth and incremental. More to the point, we were able to set aside the migration at any time and switch to some other more important task, or leave for the day, or enjoy the weekend, all while the database was fully functional.
Typical refactoring scenario
I am totally making this up, I have never worked on bug tracking systems.
Suppose that we are storing tickets in a table dbo.Tickets, which, among other columns, has a column named AssignedTo. As such, each ticket can be assigned to only one person at any time.
However, the next version of our system will allow to assign a ticket to more than one person. Our plan is to create another table, dbo.TicketAssignments(TicketID, AssignedTo), migrate existing data to it, and change appr. 50 stored procedures affected by this change. To minimize risks, we are going to finish all database changes at least a few days before the new version of our system is released.
This means that the current version of our system is going to run against our refactored database exactly as it did against the old one. BTW, the table dbo.Tickets is quite large, and is heavily used all the time.
Before the migration. Prerequisites.
We shall need some extra disk space, approximately as much as the old table uses up. Besides, we need a server that is not struggling with its current workload, so that it can withstand some additional work for the duration of our migration.
Also we shall need a good test coverage on all modules using the table, including unit tests, stress tests, and performance baselines. Typically in our system we already have solid test coverage.
One more thing: because table structure is going to change, inserts directly into the table are not going to work any more. As a result, all test data should be populated via stored procedures that continue to work against the new table structure.
For example, we might use a stored procedure dbo.SaveTicket that has a parameter @AssingedTo to populate test data for our unit tests.
Creating new tables, changing modifications.
As our first step, we create two new empty tables:
- dbo.TicketsV2, which has all the same columns as dbo.Tickets, except it does not have AssignedTo column;
- dbo.TicketAssignments(TicketID, AssignedTo)
We also change all the procedures which modify dbo.Tickets, so that they write to both old and new tables.
This biggest risk in this step is introducing concurrency-related problems. We need to stress test our modifications well. If we have any problems, however, we can just run a rollback script, changing all the modifications back to their original version. Such rollback takes just a split second.
In fact, we never actually had any problems at this step, because our stress testing harness is quite solid.
Of course, our modifications get slower, but we have made sure that our hardware can handle it. We had not had actual problems with slow modifications either.
Migrating existing data
At this stage all the reports are running off the old table.
As all new changes get saved into both old and new tables, we are also moving over all existing data to the new structure. We want this migration to be non-intrusive, so we typically just run one WHILE loop, moving over like 1K-10K rows at a time, so that our migration does not hinder OLTP activity and reports.
Modifying the reports.
While the data is migrating, we can take our time changing our stored procedures to read from new tables. Because we have good unit testing coverage, we can refactor procedures with confidence - if we break something, we shall know it right away.
Because not all data has been migrated yet, we do not deploy the modified procedures.
Verifying that migration completed.
To verify that all data migrated correctly, we need to write a SQL query. To my best knowledge, there is no GUI tool that can efficiently compare large tables. Writing a query, however, is not that difficult.
This query is going to use a lot of resources. We need to be careful not to bring the server to its knees while the query is running. There are several ways to accomplish that.
Deploying modified reports.
Once we have complete and correct data in the new tables, we can start deploying new procedures. We do not have to deploy them all at once - we can deploy them five or ten modules at a time, even if some other procedures still access the old table.
There is one more problem our team need to be very careful with - our test server is not exactly identical to our production one. As such, we encounter a risk that our procedure runs fast in test environment, but is slow in production.
This is why we first deploy our changed procedures into a different schema, which (schema) is not exposed to our user. For example, instead of altering procedure dbo.GetTicketsForDay, we create a new procedure Internal.GetTicketsForDay. Only developers have privileges on Internal schema, so users cannot execute it yet. Once we have executed Internal.GetTicketsForDay in production environment and are happy with performance, we can deploy it as dbo.GetTicketsForDay.
Our only risk at this stage is that we can deploy poorly performing procedures. Our rollback strategy is simple - we just roll back to the original stored procedures that read from the old table.
Finalizing the migration.
Once all the reports access the new table, we can change our modifications, so that they no longer write to the old table. The old table can be archived out and dropped, reclaiming the additional storage which we needed for the migration.
As we have seen, we can refactor an OLTP table without downtime and with low risks, even if it is big and data migration takes a lot of time.
One more thing: a common reaction to such Agile war stories is a recommendation to "do it right the first time". It is so common that I would like to address it right now.
Of course, at the time when the previous version of bug tracking system, it was quite obvious that eventually we might need to add the ability to assign a ticket to more than one person. Of course, if we had the table dbo.TicketAssignments from the very beginning, we would not have to go through this complex refactoring. In other words, we should "do it right the first time", should we not?
In general, "doing it right the first time", developing a flexible database structure that we should not have to change later, makes a lot of practical sense, but not under all circumstances.
More specifically, when the previous version of bug tracking system was being designed and developed, there were hundreds of brilliant ideas, lots of great features that might be useful later on. Implementing all these hundreds of brilliant ideas would take years, and a much bigger team. It was just not possible. More to the point, it was not what our customers wanted - they wanted us to take care of their most important problems first, and quickly. They did not want to wait until we could provide more features.
Instead of trying to add as many features as possible and as a result delivering a half baked low quality product, the team concentrated on developing only must-have features, on performance, and on very high quality. As a result, the product had a very limited set of features, but it delivered everything it promised, and it was simple, very fast, and rock solid. As such, it was a success.
It would be quite easy to provide examples of over-engineered projects that tried to "do it right the first time" and miserably failed, but this is beyond the scope of this post...