Finally, a real geeky post. I received two questions in as many weeks about what really happens to existing partitions when you alter the underlying partition function. The big question being… does altering a partition function cause any data movement? The answer, as usual, is “It depends”
Most people who work with partitioning are aware that the SWITCH operation does not cause data movement. But SWITCH is an option to ALTER TABLE, not ALTER PARTITION FUNCTION. ALTER TABLE SWITCH … allows you to ‘switch’ the contents of a populated partition with an empty one, and if all the prerequisite conditions are not met, the SWITCH cannot proceed and you’ll get an error.
However, altering a partition function is different. When you alter a partition function with the SPLIT option, you can end up with an empty partition, so many people believe that it’s just a simple operation. But it’s not always.
The ALTER PARTITION FUNCTION statement is deceptively simple:
ALTER PARTITION FUNCTION partition_function_name()
{
SPLIT RANGE ( boundary_value )
| MERGE RANGE ( boundary_value )
} [ ; ]
You can either specify a new boundary value (with SPLIT) or an existing one (with MERGE). Let’s look at SPLIT.
Suppose I have a partition function with boundary points at 1, 10, 100, 1000. Suppose I have a table created on a partition scheme that uses this function, and I have populated the table. If I alter the function with SPLIT RANGE (50), the partition between 10 and 100 will be split into 2. If I already have data between 10 and 50 and other data between 50 and 100, some of the data will have to move. But which data?
Let me create a simple example in the AdventureWorks database. The following statements will create a copy of the Production.Product table and partition it into 5 partitions, with the boundary points suggested above.
USE AdventureWorks;
GO
CREATE PARTITION FUNCTION price_parts (money)
AS RANGE LEFT FOR VALUES ( 1.00,10.00,100.00,1000.00) ;
GO
CREATE PARTITION SCHEME price_scheme
AS PARTITION price_parts
ALL to ([PRIMARY]);
GO
SELECT * INTO NewProduct
FROM Production.Product;
GO
CREATE CLUSTERED INDEX newproduct_pk
ON NewProduct(ProductID)
ON price_scheme (ListPrice);
You can use the $partition reference to see how many rows are in each partition:
SELECT $partition.price_parts(ListPrice), COUNT(*)
FROM dbo.NewProduct
GROUP BY $partition.price_parts(ListPrice);
GO
Or, you can use the sys.partitions catalog view:
SELECT * FROM sys.partitions
WHERE object_id = object_id('dbo.NewProduct');
The output shows me 5 partitions, and the number of rows in each.
Now lets see what happens when I split partition number 3, which contains the rows between 10 and 100.
ALTER PARTITION FUNCTION price_parts()
SPLIT RANGE (50);
GO
Run the sys.partitions query again, and you’ll see 6 rows:
Partitions 3 and 4 together now have 77 rows, which were in the previous partition 3. But how can we tell which rows moved? You need to look at partition_id. Do not confuse partition_number with partition_id. Partition_number values are always consecutive so where we had partition_number values 1-5 before, we now have 1-6. What was partition 3 is now split into 3 and 4, what was partition 4 is now 5 and what was 5 is now 6. Partition_id values are actually tied to the physical storage and do NOT change unless the data moves. So by comparing the two outputs above, you can see that the old partition 5, using partition_id 72057594058571776 is now partition 6, but it still uses partition_id 72057594058571776. For partitions 3 and 4, we can see that partition 4 uses the partition_id of the old partition 3, and the new partition 3 has an entirely new partition_id. It was the data between 10 and 50 that moved.
WHY?
The answer is clearly stated in BOL, in the description of ALTER PARTITION FUNCTION:
Based on boundary_value, the Database Engine splits one of the existing ranges into two. Of these two, the one where the new boundary_value resides is considered the new partition.
This says that whatever partition contains the new value (50, in our case) is the one that moves! And how do we know which partition contains the boundary point? That depends on whether your partition function is defined using RANGE LEFT or RANGE RIGHT. In my case, the function was defined using RANGE LEFT, so any rows with a ListPrice equal to the boundary value go in the partition on the LEFT, that is, with values smaller that the boundary value. So the old partition 3 contained these values:
10 < ListPrice <= 100
The new partition 3 contains these values:
10 < ListPrice <= 50
And the new partition 4 contains these values:
50 < ListPrice <= 100
So what about when we add a new boundary point at the end to create a new empty partition? The maximum ListPrice value is 3578.27, so I’ll add a new boundary point at 5000, after also altering the partition scheme to allow for another filegroup.
ALTER PARTITION SCHEME price_scheme
NEXT USED [PRIMARY];
GO
ALTER PARTITION FUNCTION price_parts()
SPLIT RANGE (5000);
GO
Here is what the new partitions look like:
Note that even though partition 7 is empty, it uses the existing partition_id 72057594058571776. The data in partition 6 was moved to a new physical location with a new partition_id. For 86 rows, it doesn’t take all that long to move the rows, but what if partition 6 had 86 BILLION rows? You may think that adding a new empty partition is a very fast operation, but if your partition function definition requires the existing 86 billion rows to move, it could be a extremely slow operation.
So what happens when you use the MERGE option to ALTER PARTITION FUNCTION? As a former university instructor, it’s really great to be able to say… that’s your homework!
So have fun!
~Kalen