THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Technical content about Microsoft data technologies. All opinions expressed are purely my own and do not reflect positions of my employer or associates.

Hexagonal Close Packing for your Fact Data

Disclaimer: It's likely you don't really want to do what I am proposing in this post. This is just an esoteric defrag process, which won't really deliver much performance benefit but carries a fair amount of risk. I present it purely as a curiosity. To paraphrase Scott Hanselman, you might just want to run away at this point, as I can only guarantee that this is an idea, not a good idea. Modest fragmentation has a minor effect on most systems, and what follows is definitely overkill. And it might not even work.

So, suppose we have a textbook fact-dimensional data warehouse, having fact tables that are partitioned into file groups, and hence files, based on chronology. A typical structure might be a Primary data file with dimension tables, and one Fact file for every month of fact data. It struck me that it would be interesting to see if I could "close pack" the data in the older fact files, so that there's as little fragmentation as possible AND as little unused space in the files as possible. This turns out to be a bit of a stunt, because, as we know:

  • Reindexing within the same files requires work space for an index to be copied; that implies that each typical fact data file probably has a quantity of empty space a little larger than the largest object stored in the file, and perhaps more. In aggregate, that could be quite a lot of space for a large warehouse. You cannot reindex into the same file and end up with a packed file, nor can you guarantee that a reindex will end with the empty space at the end of the file.
  • Shink is out of the picture, because if it's run with anything other than Truncate, Shrink causes spectacular fragmentation as it moves the data from the end of a file to the beginning.

Here's the best solution I've been able to come up with so far:

  1. Generate ALTER DATABASE statements that will add a file group and a new, empty fact file for every existing fact file.
  2. Factor in the output of DBCC SHOWFILESTATS, which contains the amount of used space in each existing file, so that the new files can be created at the right size, preventing OS-level fragmentation.
  3. Make new partition scheme(s), mapping to the new, empty files.
  4. Recreate all partitioned indexes from the "old" fact files into the new, empty files by running CREATE INDEX with DROP EXISTING using the new partition scheme(s). This also gives the opportunity to add FILLFACTOR=100, if the fillfactor on the existing indexes was set lower for some reason. (This is old, read-only data, and we want it to be as dense as possible.)
  5. Drop the old files, partition scheme(s) and filegroups.
  6. Rename the new partition scheme(s) with the name of the old one(s).

This would be very unpleasant to manually script, but I discovered it was great practice for two different scripting techniques - generating SQL using SQL, composing statements in the output of SELECT, and generating scripts with PowerShell and SMO. The first is a quick and dirty technique, if a little error prone, and the second is like magic once you get the knack of PowerShell.

First, we need to make ALTER DATABASE statements that will add file groups and then files to the database. There might be tens or hundreds of them, so a little hack-o-mation is helpful. The ALTER DATABASE statements are simple and repetitive, so this is a case where generating them with T-SQL is practical. We'll need a list of the existing files, and the amount of space used in each, which can be had with a combination of system views and DBCC SHOWFILESTATS:

CREATE TABLE #tmpspc (
    fileid int,
    filegroup int,
    totalextents int,
    usedextents int,
    name sysname,
    filename nchar(520)
   ) ;

SELECT AS filegroupname,
       UsedExtents * CONVERT(float, 64) / 1024 AS usedMB
FROM    sys.filegroups g
       INNER JOIN sys.database_files f ON f.data_space_id = g.data_space_id
       INNER JOIN #tmpspc ON f.physical_name = #tmpspc.[FileName]

In our warehouse, all the fact file groups are named as 'FACTDATA_<yyyymm>, so the above results can be filtered by file group name. Next, some string manipulation can create all the ALTER DATABASE statements needed to add a parallel set of file groups and files along side the existing ones. REPLACE() below is set up to derive new names from the existing names:

SELECT AS filegroupname,
       + QUOTENAME(REPLACE(, 'FACTDATA_', 'FACTDATA1_')) + ';' AS filegroupcreate,
       + REPLACE(f.[name], 'FACTDATA_', 'FACTDATA1_') + ''', '
       + 'FILENAME = N''' + REPLACE(f.physical_name, 'FACTDATA_',
                                    'FACTDATA1_') + ''', ' + 'SIZE = '
       + CAST(CEILING(( usage.usedextents * CONVERT(float, 64) / 1024 )) AS varchar(1000))
       + 'MB , ' + 'FILEGROWTH = 10240KB ) ' + 'TO FILEGROUP '
       + QUOTENAME(REPLACE(, 'FACTDATA_', 'FACTDATA1_')) + '; ' AS filecreate
FROM    sys.filegroups g
       INNER JOIN sys.database_files f ON f.data_space_id = g.data_space_id
       INNER JOIN #tmpspc usage ON f.physical_name = usage.[FileName]

I call this "hack-o-mation," because it's in the same family as "automation," only it's not really safe or consistent. It's like automation's unreliable cousin. If you generate SQL code like this, it's pretty easy to make a mistake, so you should review the results very carefully before executing. Anyway, we get, in my case, a little over 200 statements like:


    FILENAME = N'F:\MSSQL.1\SQLData\myDB_FACTDATA1_200101_File1.ndf', 
    SIZE = 30MB , FILEGROWTH = 10240KB ) TO FILEGROUP [FACTDATA1_200101] ;
    FILENAME = N'F:\MSSQL.1\SQLData\myDB_FACTDATA1_200102_File1.ndf', 
    SIZE = 13MB , FILEGROWTH = 10240KB ) TO FILEGROUP [FACTDATA1_200102] ;
    FILENAME = N'F:\MSSQL.1\SQLData\myDB_FACTDATA1_200103_File1.ndf', 
    SIZE = 17MB , FILEGROWTH = 10240KB ) TO FILEGROUP [FACTDATA1_200103] ;
    FILENAME = N'F:\MSSQL.1\SQLData\myDB_FACTDATA1_200104_File1.ndf', 
    SIZE = 16MB , FILEGROWTH = 10240KB ) TO FILEGROUP [FACTDATA1_200104] ; 

Once we have the logic to create the new files, we need a new partition scheme, which will take the existing partition function for the fact tables (the function contains all the boundaries where one partition ends and the next begins) and map it onto those new files. This can also be derived with T-SQL and a query against system views, to make a new partition scheme from an existing one. Again, a simple replace function substitutes the new file group names for the existing ones:

DECLARE @newPartScheme varchar(max) ;
DECLARE @fileGroupList varchar(max) ;

SET @fileGroupList = '' ;
SELECT  @fileGroupList = @fileGroupList + REPLACE(, 'FACTDATA_',
                                                 'FACTDATA1_') + ', '
FROM    sys.partition_schemes AS sps
       INNER JOIN sys.partition_functions AS pf ON sps.function_id = pf.function_id
       INNER JOIN sys.destination_data_spaces AS ds ON ds.partition_scheme_id = sps.data_space_id
                                                       AND ds.destination_id <= pf.fanout
       INNER JOIN sys.filegroups AS sf ON sf.data_space_id = ds.data_space_id
WHERE = N'FileGroupsByDateMonthly'
ORDER BY ds.destination_id ASC ;
SET @fileGroupList = LEFT(@fileGroupList, LEN(RTRIM(@fileGroupList)) - 1) ;
SET @newPartScheme = 'CREATE PARTITION SCHEME [FileGroupsByDateMonthly1] '
   + ' AS PARTITION [PartitionByDateMonthly] TO ( ' + @fileGroupList + ' );' ;
SELECT  @newPartScheme ;            

This gives a single, giant statement like:

CREATE PARTITION SCHEME [FileGroupsByDateMonthly1]  
AS PARTITION [PartitionByDateMonthly] TO (   
) ;

Now we have new FACT files, and the means to direct data into those files. The next bit is a little trickier, and we'll change scripting techniques from T-SQL "string builder" to PowerShell + SMO. This is because I need to generate CREATE INDEX statements for all the tables that the parition scheme uses, and it's just impractical to make them by building SELECTs against system views. (Technically, all of this work including the scripting above could be done in PowerShell.) One added difficulty: if we want this to be efficient, we have to drop and build the indexes in a specific order, as:

  1. Drop Non-Clustered Indexes
  2. Drop Clustered Index
  3. Create New Clustered Index
  4. Create New Non-Clustered Indexes

So, we need a PowerShell technique that will locate all the partitioned tables, then script out Drop and Create statements for those in the correct order. Buck Woody has a nice, concise sample of how to script objects with PowerShell here. I used that as a starting point, and cooked up a script that works like this:

  • Create an instance of SMO server connected to the database server
  • Get the appropriate partitioned tables for the database in a collection
  • Loop through those, and for each table
    • Script DROP statements in the right order
    • Script CREATE statements in the right order, replacing the old partition scheme with the new one
  • End Loop

Here's the script:

# Create a connection to the server and choose a database
[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SqlServer.SMO") | out-null 
$sqlserver = New-Object ('Microsoft.SqlServer.Management.Smo.Server') '<servername>'
$db = $sqlserver.Databases['<myDB>']

# Names of existing and new partition schemes. New will replace existing in script output:
$oldPartitionScheme = 'FileGroupsByDateSKMonthly'
$newPartitionScheme = 'FileGroupsByDateSKMonthly1'

# Compose a query that will list out the tables that use a specific partition scheme
$partitionedTables = $db.tables | Where-Object -filter { $_.PartitionScheme -eq $oldPartitionScheme }

# For each partitioned table, script out drop and then create statements for partitioned indexes
# in order drop non-clustered, drop clustered, create clustered, create non-clustered:
$scripter = New-Object ('Microsoft.SqlServer.Management.Smo.Scripter') ($sqlserver)

# Set scripting options here as needed
# $scripter.Options.<someoption> = <somevalue>
$partitionedTables | foreach {
    # Note current table in output:
    ''    '/***  '  + $_.Schema + '.' + $_.Name + '  ***/'    ''
    $partitionedIndexes = ( $_.indexes | Where-Object -Filter { $_.IsPartitioned -eq $True 
               -and $_.PartitionScheme -eq $oldPartitionScheme } )    

    #script drop statements    
    $scripter.Options.ScriptDrops = $True
    #script nonclustered indexes
    $partitionedIndexes | Where-Object -Filter { $_.IsClustered -eq $False } | foreach {
        $scripter.Script( $_ )
    #script clustered indexes
    $partitionedIndexes | Where-Object -Filter { $_.IsClustered -eq $True } | foreach {
        $scripter.Script( $_ )
        'GO'    }

    #script create statements
    $scripter.Options.ScriptDrops = $False
    #script clustered indexes
    $partitionedIndexes | Where-Object -Filter { $_.IsClustered -eq $True } | foreach {
        $indexCreateScr = $scripter.Script( $_ )
        $indexCreateScr = $IndexCreateScr -replace $oldPartitionScheme, $newPartitionScheme
        $indexCreateScr = $IndexCreateScr -replace 'SORT_IN_TEMPDB = OFF', 'SORT_IN_TEMPDB = ON'
    #script nonclustered indexes
    $partitionedIndexes | Where-Object -Filter { $_.IsClustered -eq $False } | foreach {
        $indexCreateScr = $scripter.Script( $_ )
         $indexCreateScr = $IndexCreateScr -replace $oldPartitionScheme, $newPartitionScheme
         $indexCreateScr = $IndexCreateScr -replace 'SORT_IN_TEMPDB = OFF', 'SORT_IN_TEMPDB = ON'

If the stars align, this PowerShell script will spew forth a long T-SQL script which you can pipe out to a text file, containing all the code to reconstruct every index on the new partition scheme. There are a few pitfalls, the main one being that the new partition scheme is inserted with a simple text replacement, so it's important that the names of the new and old partition schemes be unique enough for the replace to work cleanly. Check it over carefully and test it before you hit anything important with it.

The only additional effort to the data file compaction is creating a similar collection of file and filegroup dropping ALTER DATABASE scripts, which can be derived with one of the two techniques demonstrated above, and I'll leave that as an exercise for the reader.

In conclusion: this is a method that will compact the fact data into clean, small, defragmented files. The need for that is debateable. But, should you want to fool with it, it's a great way to hone your skills with partitioning, indexing and with techniques for script generation.

Published Sunday, August 9, 2009 9:00 PM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Merrill Aldrich said:

Some time ago, I blogged about how to really comprehensively re-index a data warehouse:

May 27, 2011 7:56 PM

Merrill Aldrich said:

As I talked about in my last post , I just went through a re-indexing project that took the partitioned

January 14, 2013 7:30 PM

Leave a Comment


This Blog


Privacy Statement