Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example FinanceDataExport_20100528.zip
Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.
Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:
- SourceFolder (type String): The folder that contains the (zip) files
- DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used
Note: In a previous blog post I wrote about unzipping zip files within SSIS, you might also find this useful: SSIS – Unpack a ZIP file with the Script Task
Public Sub Main()
'Use this piece of code to loop through a set of files in a directory
'and delete all files except for the most recent one based on a date in the filename.
'File name example:
'DataExport_20100413.zip
Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
Dim mostRecentFile As String = ""
Dim currentFileDate As Integer
Dim mostRecentFileDate As Integer
Dim currentFileCreationDate As Date
Dim mostRecentFileCreationDate As Date
Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.
If dateInFilename Then
'Check which file is the most recent
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with: YYYYMMDD.zip)
If currentFileDate > mostRecentFileDate Then
mostRecentFileDate = currentFileDate
mostRecentFile = fi.Name
End If
Next
Else 'Date is not in filename, use creation date
'Check which file is the most recent
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
currentFileCreationDate = fi.CreationTime 'Get creation date of current file
If currentFileCreationDate > mostRecentFileCreationDate Then
mostRecentFileCreationDate = currentFileCreationDate
mostRecentFile = fi.Name
End If
Next
End If
'Delete all files except the most recent one
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
If fi.Name <> mostRecentFile Then
File.Delete(rootDirectory.ToString + "\" + fi.Name)
End If
Next
Dts.TaskResult = ScriptResults.Success
End Sub