THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Column Store, Parallelism and Decimal

SQL Server performance has the interesting nature in that no matter how sound and logical an idea is on how it might behave in one scenario compared to another, what actually happens could be very different. Presumably the reason is that what happens inside the SQL Server engine is very complex with very many elements to support even a basic query, including steps that we are not even aware of. On top of this, the modern microprocessor is also very complicated, with radically different characteristics depending on whether an address is in (the processor L2) cache, or a memory round-trip necessary, and then whether it is a local or remote node memory access, not to mention the implications of cache-coherency checks.

With this in mind, some aspects of the Decimal/Numeric data type are of interest. There have been previous discussions on the fact that the Decimal data type is more expensive than integer or float, with impact that could be on the order of 2-3X, enough to become a serious consideration in queries that aggregate very many rows and columns with the Decimal data type. It is easy enough to understand the integer and floating point data types can be executed directly by the microprocessor while decimal must be handled in software, which should mean that the overhead is much higher than 2-4X. The explanation is that the even simple matter of accessing a column in the page-row storage organization of traditional database engines involves a series of address offset calculations for which the modern microprocessor cannot pre-fetch from memory sufficiently far in advance to keep its execution pipeline filled.

If this is indeed the case, then one would expect that the difference between integer and float compared to decimal would have far larger impact in column store indexes introduced in SQL Server 2012 for nonclustered and clustered in the upcoming 2014. The whole point of column store is to access memory sequentially to fully utilize the capability of modern microprocessors emphasizing bandwidth oriented over serialized round-trip memory accesses. In any performance investigation, it is always very helpful first to build baseline with non-parallel execution plans. This is because parallel execution introduces a whole new set of variability's that can complicate the assessment procedure. Of course, with such sound and logical reasoning, the outcome is inevitably the unexpected, hence the opening paragraph.

It would seem that the SQL Server engine follows completely different code path on operations to column store indexes depending on whether the execution plan is non-parallel or parallel. But it turns out that is occurs in SQL Server 2014 CTP2, and not SQL Server 2012 SP1, so it is possible the unexpected behavior will not occur in the 2014 release version?

Test System

The test system is a Dell PowerEdge T110II with 1 Xeon E3-1240 (Sandy Bridge) 4-core, 3.3GHz nominal (3.7GHz in Turbo) processor with Hyper-Threading enabled, 32GB memory, and storage on 8 SATA SSDs in RAID 0 attached to a LSI 9260 RAID controller. The operating system is Windows Server 2012 R2, and SQL Server version 2014 CTP 2.

The database was populated using the TPC-H data generator (dbgen) to produce a SF10 data set. This puts 59.986M rows in the Lineitem table which would have been 10GB using the 8-byte datetime data type but is 8GB with the 4-byte date data type. The index keys are different from the TPC-H kit, but the test here are not represented as conforming to TPC rules for official results.

Four Lineitem tables were created, all at SF 10. Two use the conventional page/row storage, and the other two use Clustered Columnstore indexes. The conventional tables were not compressed, while Columnstore indexes are compressed without option. For each type of storage, one table has 4 columns of type float (8 byte), and the other has 4 columns declared as decimal(18,6) at 9 bytes, NOT NULL in both cases.

The conventional Lineitem table average 139 bytes per row or 59 row per page with 8 byte float and 143 bytes per row, 57.25 rows per page for the 9 byte decimal. The table with clustered column store index averaged 44.1 and 45.67 bytes per row for float and decimal respectively. The columnstore indexes where about 2.5GB versus 8GB for the conventional tables.

Previous test have shown that there is no difference between int/bigint, money and float, as all are natively supported on the processor hardware. From the table definition, the four float/decimal columns are adjacent, and should be within 64 bytes of the row header? Meaning all values are in the same cache line?

One additional note is that this report is a quick response to a question concerning decimal overhead. I did not have time to setup rigorous measurements averaged over 100 executions and follow-up with an examination of anomalies. All measurements here are based on a few runs.

Page/Row and ColumnStore Performance with Parallelism

The basic test case a simple aggregate of 1-4 of the float or numeric columns along with a count, in reference to a count only query. For the page/row table, a clustered index (table) scan is forced. There is no point to forcing a scan on columnstore index, due to the nature of column storage. Performance data is collected from sys.dm_exec_query_stats. An attempt was made to ensure data in memory prior to each measurement, but some columnstore accesses generated a small amount of disk IO.

Below is the CPU in nanoseconds per row for the four cases at DOP 1. The DMV reports worker time in micro-seconds, so that value was multiplied by 1000 to get nanoseconds.

The CPU nominal frequency is 3.3GHz but for single thread operations, could very be running at the turbo frequency of 3.7GHz, somewhat more than 3 cycles per ns. The cost of the Count only operation, forcing a scan on the entire 8GB table (but does not touch either the float or decimal columns) is about the same for both conventional tables at 58.8 and 60.15 ns per row respectively, probably reflecting the slight difference in table size (3%).

The true cost structure of a SQL Server table (clustered index) scan consists of a cost for the page access, each row within a page, and each column, typically with the first column access having high cost than the subsequent columns, and perhaps higher cost if a subsequent columns is not on a previously accessed cache line, and perhaps higher for non-fixed length columns that involve a more complicated address calculation.

In previous reports, I have cited the page access cost as in the 650-750 CPU-ns range for Sandy Bridge generation processors. So about 10ns of the average row cost cited above is amortizing the page access cost (for just under 60 rows per page).

Below are the same test data, but showing incremental cost of each additional column accessed and aggregated. The Count value is the same as above because it is the baseline operation.

Notice that the incremental cost for the first column aggregated (1SUM) is higher than the subsequent columns. It is strongly evident that decimal aggregation is much more expensive than the float type (and other tests show float to be the same as int and money).

The reason that we cannot put a specific value on the difference is because of the cost structure of complete operation has page, row and columns components of which the int/float/decimal difference only involves the last component. In addition, the number of columns of each type also impacts the differential.

Below is the cost per row in CPU-ns of the count query, with the two conventional tables on the left and the two columnstore indexes on the right at DOP from 1 to 8. The system under test has 4 physical cores with HT enabled. SQL Server correctly places threads on separate physical cores when DOP allows, but the DOP 8 test forces both logical processors on each core to be used. It is also clear in the Columnstore tests that there is something very peculiar. CPU put unit work is not supposed to decrease from DOP 1 to 2. There are certain cases when this does happen, example being a hash join in where the parallel plan has a bitmap filter, which is not employed (per rule) in non-parallel plans. This not the case here, and a test on SQL Server 2012 shows the expected performance advantage for columnstore at all DOP levels.

Below is the rows per second for the Count query. This is the inverse of elapsed time and is better for demonstrating scaling with DOP. The vertical axis is log scale in base 2 to better distinguish 2X. The scaling with DOP is not particularly good in the DOP 1-4 range, with each thread on separate physical cores. This is believed to be the case with as CPU/row only rises moderately with parallelism to DOP 4. This query does almost no work other than page access, so it is possible there is contention somewhere in the buffer pool management?

Perfect scaling would be doubling performance for each doubling of DOP (each thread on separate physical cores), an example being from 16 to 32 rows/µs on the vertical scale from DOP 1 to 2. An indicator of quality of the measurement the ratio of worker time to elapsed time. In a perfect situation, this would be equal to the DOP. At DOP 4, the ratio is unexpectedly low at 2.8. Very good scaling is normally expected when parallelism is over separate physical cores. Here the scaling in that case is poor, but appears to be great when both logical processors on each core are active at DOP 8. The sharp rise in CPU per row from DOP 4 to 8 is indicative of this aspect of HT. Had the DOP 4 measurement indicated the correct worker/elapsed ratio closer to 4, there would have been only a moderate increase in performance from DOP 4 to 8.

Below is the single column SUM query cost per row (in ns) for the two conventional tables on the left and the two columnstore tables on the right. The cost difference between float and decimal in the conventional tables is now evident though not large. It is much more significant in the columnstore tables, as expected.

Below is the rows per second for the single column SUM query.

Below is the two column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right. There is a larger difference between float and decimal in the conventional tables compared to the single column query. This is expected as there is more work is in column operations relative to the page and row access. The difference on the columnstore table is than in the single column and this was not expected.

Below is the rows per second for the two column SUM query.

Below is the 3 column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right.

Below is the rows per second for the 3 SUM test.

The graphs below show the query cost per row for Columnstore access with 1, 2 and 3 columns, the same as before, except with the DOP 1 value truncated.


Aspects of the Decimal data type cost relative to float have been examined, noting the float has elsewhere been observed to be the same as int and money. The overhead can be as large as 2-3X in conventional page/row tables, depending on the ratio of work between page and row access, relative to column aggregation. In queries that access small to even moderately large number of decimal values (rows*columns) the higher cost of decimal should not be significant. In large data warehouse queries that access ten million values, it will be noticeable and probably start to get painful in the hundred million scale.

It was expected that the Decimal data type cost would be much high in columnstore indexes, as the row access costs are greatly reduced, magnifying the column contribution. First, the problem in SQL Server 2014 CTP2 with columnstore clustered index for non-parallel execution plans was observed, and it is hoped that this problem will be resolved in RTM. The expected large impact was observed on queries aggregating a single decimal type columns, but the trend decreased unexpectedly for 2 and 3 columns aggregated. SQL Server performance characteristics are not always in-line with expectations, no matter how sound and reasonable the logic al is.

Published Saturday, March 01, 2014 9:19 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Greg Low said:

Nice summary Joe!

What does concern me whenever I see direct comparisons between using float and decimal though, is that I mostly want to use decimal as I really do want accurate data, particularly when dealing with monetary values, not approximate data as provided by float.

The two are far from interchangeable.

March 1, 2014 7:51 PM

Adam Machanic said:


I've been working on a lot of financial systems for the past 8 years or so and while I used to be dogmatic about using DECIMAL for precision, my experience working on these systems has actually lead to the opposite point of view: I now favor FLOAT in most cases, since it's both smaller and performs better. As it turns out, except in accounting systems, a bit of lost precision is usually quite alright.

Most of the systems I work with are concerned with whole dollar amounts in the millions or billions, and losing a cent here and there is absolutely no big deal. The same was even true of a financial reconciliation database I worked on a few years ago. We changed everything to FLOAT (from DECIMAL(38,12) IIRC), re-ran all of the historical reports, and discovered that in only a tiny number of cases did the final numbers change at all. And those cases were not a problem for the business.

So I do think they're relatively interchangeable. Which is not to say that it's not extremely important both to test and to ask the right questions of the right people when making data type choices! DECIMAL certainly has its place. But I think a lot of people tend to make way too big a deal about precision even in cases when it really doesn't make all that much of a difference.

March 1, 2014 11:39 PM

jchang said:

My report here compares decimal and float compute costs, without getting into the differences between the two data types. I do stress I had previously determined that float and money had identical performance characteristics.

As to the double precision 8-byte float type, the accuracy is 15.95 digits, which is nearly 1 in 10 quadrillion, so even in cents, that is a really big monetary value, nearly a full order of magnitude larger the US debt (but this is getting closer).

I am inclined to think the more likely source of discrepancy stems from the fact that 0.1 cannot be represented exactly in binary (representation of float).

When the float data type is past on to client, and it makes inconsistent use of rounding (some grids use floor, not round?). Also, it is important to look for any unexpected implicit data type conversions? Float should probably keep as float all the way through? with a final handling for fractional currency units?

It is also possible if an interest is calculated to high precision, and a payment value is applied that is different by a really microscopic amount, instead of treating this as zero, difference is treating as having a consequence?

I am certain that if the application coding was handled correctly with respect to the nature of the float type, then there would be no problems, but of course, this is not something that can be taught to sloppy developers (am I biased?).

March 2, 2014 1:50 AM

Vassilis Papadimos said:

The performance discontinuity you observed at DOP=2 can be explained by another variable, execution mode. Starting with SQL Server 2012, there is a new set of query execution algorithms that kick in for queries involving columnstore indexes. The unit of work for these new algorithms is not an individual row but a set of rows called a "batch" (not to be confused with T-SQL batches!) Operators using these new algorithms are annotated as "Estimated/Actual Execution Mode: Batch" in showplan. Batch execution mode is enabled only for parallel plans.

There's an article on TechNet with more details (note that it was written for SQL Server 2012):

P.S. I work for Microsoft.

March 2, 2014 7:13 PM

jchang said:

thanks, I had been looking for that link. But the SQL Server 2012 behavior is correct. It is the 2014 CTP2 has seems to be stupid. Could it be that Microsoft documented this effect in 2012, but did not activate it until 2014? In any case, this is similar to the Bitmap filter, which is only activated in parallel plans, even though it could help in non-parallel joins

March 3, 2014 3:51 AM

Vassilis Papadimos said:

The limitation exists in both SQL Server 2012 and 2014, serial plans force row execution mode (and you're right, a similar limitation applies to bitmap filters). I'm not sure what you mean when you say "the SQL Server 2012 behavior is correct". Are you saying that a serial plan for SUM on a nonclustered columnstore index column is significantly slower in SQL Server 2014 compared to 2012? Please file this as an issue in Connect if you find that to be the case.

March 3, 2014 1:41 PM

jchang said:

Ok, you are right, the difference is very likely to be due to Row versus Batch mode. On SQL 2014 CTP2, tested on a clean install system, the column store in Row mode has CPU very close to the normal page/row table.  I only did a quick check for SQL 2012 sp1 on my laptop, I did not have memory for the full (page/row) table scan. And the 2012 nonclustered columnstore execution stayed in row mode for both DOP 1 and 2 (unknown reason, windows 7?)

So the implication of this is row mode essentially loses almost all the performance advantages of columnstore. There legitimate reasons for this if the query has logic that cannot be supported in column/batch mode. but the decision to disable batch mode at DOP 1 could be mildly annoying. I will have my ExecStats tool check for columnstore row mode.


March 4, 2014 4:03 PM

What Counts For a DBA – Depth ??? Simple-Talk said:

May 27, 2014 5:12 AM

What Counts For a DBA – Depth ??? Simple-Talk said:

May 29, 2014 5:22 PM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement