THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

Reading zip files with SSIS

Playing with SSIS I normally have to read flat files that come in zip format. The standard procedure here is to launch winzip, decompress the file, read it with a flat file adapter and then delete the uncompressed file. This behaviour generates a lot of useless I/O on temporary disks to decompress a file that will be soon deleted.

Another way of solving the problem is to have a look at DotNetZip library. A very smart programmer wrote a library that makes opening zip files very easy using .NET. He also added in version 1.6 a great function: OpenReader. You can open a zip file, look into its entries and open a stream reader on it.

Using this tecnique you can open a stream reader directly into the zip file and the process the stream using a VB source component in your data flow package.

And... Yes, you cannot use the flat file adapter. You will need to parse the stream and read strings from it but you will end up with very few I/O when compared to the unzip / read / delete solution. As I/Os are very expensive, this tecnique may speed up your package significantly.

Published Wednesday, July 16, 2008 7:25 PM by AlbertoFerrari

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Bart Czernicki said:

Good solution for people that haven't used this.  In some of the Business Intelligence loads we prefer to extract the zip file first.  Extracting a large file (> 500mb) into memory isn't always a great idea.  However, you are right if use this you will get huge performance gains in terms of I/O and you can bypass usually the slowest part of the server...the hard disk(s)/SAN.

July 16, 2008 3:36 PM
 

AlbertoFerrari said:

Bart, the good news is that OpenReader does not decompress the file in memory. It opens a stream that decompresses chunks of the file at each read request.

I have been interested in investigating the solution for this reason: you can avoid both I/O AND memory pressure, at the cost of some coding.

We are using to decompress 800Mb files without any memory consumption.

Alberto

July 16, 2008 4:15 PM
 

Noral Kuhlmann said:

Did you mean v1.5?  I can not find a reference to v1.6.

Thank you

July 16, 2008 6:28 PM
 

AlbertoFerrari said:

Noral, I forgot to say that v 1.6 is still in beta, even if it is already available. You can download the sources and compile the library in order to make some test.

July 17, 2008 3:14 AM
 

Bart Czernicki said:

Alberto,

In your example...it will be an interesting experiment (for me), how some of the C# iterators extensions perform in memory.  I have gotten exponential performance gains in that regard (as it doesn't read the whole thing in memory).

Using LINQ with this (i.e. LINQ to CSV) would probably give usability boost (i.e. to do some data profiling that is not out till SQL 2008).

http://www.codeproject.com/KB/linq/LINQtoCSV.aspx

July 19, 2008 11:02 PM
 

Robert said:

Thanks Alberto. We deal with very large flat files ( Around 20 GB  compressed ,170 GB uncompressed)and trying to avoid the I/O and increase performance. Do you have any references  to implement this  in a data flow source component ?

August 25, 2009 8:13 PM
 

Chris said:

any chance that you have a sample of this ? ie from zipped flat file to script task ? just need to find the best way to get this done.

November 30, 2009 4:19 AM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement