THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Performance impact: how to waste time and space with SqlBulkCopy.WriteToServer()?

If you have some time to kill, you could try to load a tiny amount of data, say ~3GB or 3,000,000 rows, with SqlBulkCopy, but do remember to give the WriteToServer method a very small payload every time it is called. The test results reported in my previous post suggest that once you have started loading, you might as well go for a lunch break.

 

It turns out wasting space is just as easy. In fact, wasting a great amount of space is made easy with calling SqlBulkCopy.WriteToServer with a small payload. All you need to do is direct that payload towards a heap table.

 

How much space can you waste? Well, how about SQL Server allocating more than 90GB of the storage space in order to import about 3GB of data?

 

I ran a series of data loading tests with SqlBulkCopy against the same customer table used in my previous post. Each test loaded 3,000,000 rows into the customer table without any index. The database was pre-allocated 200GB of space, and the customer table was the only user table. The test data was generated in a C# program and added to a DataTable object, which was then passed to the WriteToServer method to be copied to the customer table on the SQL Server 2008 instance. This data loader program can be configured to run the bulk importing either single threaded or in any number of threads concurrently into multiple non-overlapping ranges.

 

The tests were first divided into two groups. The first group of tests was to load the data in a single thread, and the second group of tests in 10 threads. In each group, tests were run with different levels of payload. More specifically, I tested populating the DataTable object with the following number of rows before calling the WriteToServer method with that object:

 

2, 5, 10, 20, 50, and 100

 

Before presenting the results, let me establish the frame of reference with the results from calling WriteToServer with a payload of 2,000 rows:

 

Payload
(# of rows)

Threads

Duration (second)

Data (GB)

Reserved (GB)

Unused (GB)

2,000

1

289

2.4

3.8

1.4

2,000

10

94

2.4

3.8

1.4

 

Okay, there was a huge performance difference between loading the data in a single thread versus loading the data in 10 threads. But I’d like to turn your attention to the space usage. At the completion of loading the data, I looked at how much space was actually used by the table data, how much space was reserved by SQL Server for the table, and how much space was allocated to the table but not used.

 

The above table shows that with payload = 2,000 rows, there was 1.4GB of unused space allocated to the table in both cases.

 

Now take a look at the following table and pay attention to the highlighted cells:

 

Payload
(# of rows)

Threads

Duration (second)

Data (GB)

Reserved (GB)

Unused (GB)

2

1

10,228

12

96

84

5

1

4,074

4.8

38

33.2

10

1

2,206

2.4

19.2

16.8

20

1

1,267

2.4

9.6

7.2

50

1

574

2.4

3.8

1.4

100

1

385

2.4

3.8

1.4

2

10

1,998

10

87

77

5

10

872

4.8

38

33.2

10

10

441

2.4

19

16.8

20

10

240

2.4

9.6

7.2

50

10

135

2.4

3.8

1.4

100

10

105

2.4

3.8

1.4

 

The amount of space wasted was staggering when the payload was 2 rows or 5 rows.

 

For instance, SQL Server had to allocate a whopping 96GB in order to load the 3,000,000 million rows or about 3GB worth of data into the customer table when WriteToServer had to copy 2 rows at a time. Good news is that this appears to be an anomaly only when the payload was extremely low. As shown in the above two tables, as soon as the payload level was greater than 50, SQL Server did not need to allocate more than 3.8GB of storage space.

 

Number of threads made little difference in terms of how much space was wasted. 

 

The extra space is quite easy to reclaim by creating a clustered index. The potential problem, however, is that you could run out of disk space during the bulk load if you didn't know about this behavior. In addition, you may not have planned to pre-allocate a lot of free space and bulk loading with a small payload can cause the database file to autogrow, thus further degrading the performance.

 

So why would SQL Server waste so much space in the case of a very small payload for WriteToServer? I don’t know enough about the internals of the SQL Server space allocation algorithm and/or SqlBulkCopy to offer an answer. But it’s revealing to check how the pages were allocated and used in some of these outrageous cases. For instance, when the payload was 2 rows, each database extent only had a single used page on average, and in each used page only 17% was occupied with the customer table rows.

 

The space usage picture was slightly better when the payload was 5 rows. While it remained the same that one page was used per extent, the average density of the page was higher at about 42%.

 

When the payload was bumped up to 50, the used page count per extent increased to 5 and the page density was respectable at about 85%.

 

No matter what may be the rationale for this behavior, SQL Server does seem to not handle space allocation efficiently when the WriteToServer payload is extremely small.

 

The solution? Well, just avoid small payload when using SqlBulkCopy. In my own test environment, that means I don’t want the WriteToServer method to keep copying fewer than several hundred rows.

Published Saturday, July 02, 2011 11:00 PM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Mladen Prajdic said:

Have you tried using the IDataReader input instead of the DataTable?

Have your data generator class implement IDataReader and then yield the generated rows in the read method. Then pass your class to the WriteToServer method.

July 3, 2011 9:22 AM
 

Adam Machanic said:

Nice post. This all makes perfect sense if you consider that minimally logged bulk copy logs only extent allocations. When you send two rows at a time, that's not going to fill an entire extent (or probably even a single page in the extent) so you're going to waste a ton of space.

July 3, 2011 2:13 PM
 

Simon said:

I would agree with Adam your row size is such that you will probably get 6ish rows per page so once you go above 42 rows (7x6) you will be filling all pages in an extent. It would be interesting to try with numbers between 40 and 50 to see if you can get the amount of unused space down.

July 4, 2011 4:28 AM
 

Linchi Shea said:

Adam and Simon;

It does make a lot of sense. See my new post on this for more info.

Mladen;

Do you expect using IDataReader would make a difference? It seems to me that IDataReader is just an interface. You still have to provide specific implementation. In the implmentation and no matter how you implement it, you still have to consider how much payload the WriteToServer() method may have when it is called, whether implicitly or explicitly. Is that not so?

July 4, 2011 10:50 PM
 

Mladen Prajdic said:

I was thinking more from the perspective of speed than space allocation.

If you implement the IDataReader as a streaming class (has to yield a row in its Read() implmentation) that you input to the WriteToServer you'd be doing direct streamed inserts instead of batch by batch.

July 5, 2011 6:00 AM
 

Linchi Shea said:

Mladen;

I have not studied the behavior of WriteToServer(IDataReader) carefully yet, but it's something I'll be doing shortly. In my case, I'm generating all the data to be imported. Now, the Read() method would have to be implemented on top of some internal buffer, correct? While the buffer is being 'streamed' out by the WriteToServer() method via the IDataReader.Read() method, the buffer cannot be populated or 'streamed' in. So it is still 'batch' to some extent, although one can certianly do things to mask the nature of that 'batch' (e.g. with multiple buffers). Again, I don't have sufficient experience with it and certainly don't have intimate knowledge of all the critical timings of the behavior to judge whether this is ultimately superior.

As said, I do plan to run some tests.

July 5, 2011 11:29 AM
 

Mladen Prajdic said:

I was talking about something like this.

I wrote it off the top of my head so you'll hav eto tweak it but i think it gets the point accros.

// row that the Get* methods read from

private DataRow _currentRow;

public bool IDataReader.Read()

{

GenerateTestRow();

}

private bool GenerateTestRow()

{

while (someConditionForGeneratingTestData)

{

_currentRow = Generate test row here;

/*this returns the row to the caller immediately after creation so there's no buffer to speak of */

yield return true;

}

}

July 6, 2011 6:06 AM
 

Linchi Shea said:

Mladen;

Any idea how much data will be copied when WriteToSerer(IDataReader) is called with this implementation?

July 6, 2011 9:55 AM
 

Mladen Prajdic said:

everything until the loop exists.

so someConditionForGeneratingTestData should be for example

while(numberOfGeneratedRows < 10000)

July 6, 2011 11:17 AM
 

Linchi Shea said:

Ah, I was thinking about a different implementation. Will give this a try. Thanks Mlanden!

July 6, 2011 3:16 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement