This post is a bit off topic for SQL Server proper, but I thought someone out 'oogling teh inter-tubes might find it useful. Our group recently had to perform a large SAN migration both to new equipment and to a new geographic location. We started with a replication method provided by the SAN vendor. I won't mention who, but from where I sit I can't prevent you from guessing :-). We had to move a lot of files; the great majority made the move intact but, perhaps because we are picky, we needed every single, individual file to be copied successfully. The files are just files, not SQL Server data, but they are important. It came to light that a few got ... what's the technical term ... "whacked" in transit.
File Copy Validation
After the initial copy, we needed a method to validate the files. Since the files are in two physical locations, directly comparing them is impractical, as it would take all the same bandwidth that it took to transmit the files in the first place. So, we cooked up a PowerShell script based on some examples that find duplicate files using a simple MD5 hash, and used the script to output hash values and full paths for each system into two sets of CSV files, each in one of the two locations.
(If you haven't tried PowerShell yet, I have to say I am finding that in the grand scheme of things, the order of good stuff goes "Sliced Bread," then "PowerShell," a very close second.)
From the resulting CSV files, a quick SSIS load into SQL Server tables and it was possible to have SQL do the heavy lifting, finding any files where the full path is the same but the computed hashes don't match. From there, thirty seconds of additional effort made XCOPY commands to re-write the corrupt files from the source system to the destination. We were seaching for relatively few files out of millions, so this was a huge time saver.
This is just a quick and dirty administrative script, so I'm afraid it's not polished code, lacking niceties like error handling or parameters. YMMV, &c.:
#-------------------------------------------------#
# This script will read all files in a directory and subdirectories
# and compute a hash value for the content of each file, storing the
# hash in a text file report. Use to binary-compare sets of files in
# two locations.
#
# Created 8/27/2009 by Merrill Aldrich using logic from this
# duplicate file scanning method:
# http://blogs.msdn.com/powershell/archive/2006/04/25/583225.aspx
#-------------------------------------------------#
# Set directory to scan and output file to store hash values
$searchFolder = "E:\SomeDirectory"
$outFile = "C:\File_Name_Of_your_choice.csv"
$cryptoServiceProvider = [System.Security.Cryptography.MD5CryptoServiceProvider]
$hashAlgorithm = new-object $cryptoServiceProvider
# Header for CSV format output file:
"File,CreateDate,ModDate,Hash" | Out-File $outFile
# Scan each target file, recursively walk subfolders
"Scanning"
[int]$i = 0 # counter, just for user feedback
Get-Childitem -recurse $searchFolder `
| Where-Object { $_.PSIsContainer -eq $false } `
| foreach {
# Show progress every 'x' files
$i += 1
if ( $i%100 -eq 0 ) { $i.toString() + " Files Processed" | out-host }
# compute a hash value for file content
$stream = $null
$hashByteArray = $null
$stream = $_.OpenRead();
$hashByteArray = $hashAlgorithm.ComputeHash($stream);
$stream.Close();
# close the file stream in case of error
trap {
$_.Exception.Message | out-host
if ($stream -ne $null) { $stream.Close(); }
break;
}
$currentHash = [string]$hashByteArray
# append file name, dates, hash to the output text file
"""" + $_.fullname + """," `
+ $_.creationtime.ToString( "s" ) + "," `
+ $_.lastwritetime.ToString( "s" ) + "," `
+ $currentHash `
| Out-File -append $outFile
}
$i.toString() + " Files Processed"
"Scan Complete."
This process has worked well for us. The only speed bump: the standard format for writing date stamps in sortable format, $_.creationtime.ToString( "s" ), makes strings like 2004-05-03T14:07:32, which are perfectly good ISO dates as fas as I know, but SSIS's standard import-wizard-generated packages seem to choke on the "T." To work around that I had to add a little Derived Column / Replace expression into the import packages to remove the "T" character from the date stamps. It's so like SSIS, which I generally like very much, to go 95% of the way but make you work for that last 5%...