THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Hekaton and Benchmarks?

It is rather curious that the two TPC-E benchmarks results published for SQL Server 2014 did not employ the memory optimized tables and natively compiled procedures, given that Hekaton is the hallmark feature of 2014. Potential for 30X gain in transaction processing have been cited in presentations and whitepapers. No explanation has been given so I will speculate. Why guess when I do not know? Well hopefully someone will come out with the real reason.

TPC Benchmarks

The TPC benchmarks are meant to be a common standard that allows people to compare results of different DBMS, operating systems and processor or storage hardware. Many years ago, the TPC-C benchmark was very important so that one could be sure that database engines would not have any bottlenecks in processing that workload, which only comprised a small number of SQL operations. An actual database application might use many different database features and there would be no assurance on whether one or more operations not in the benchmarks had scaling limitations.

Several years ago, all of the major database players, software and hardware, contributed to TPC-E, which was supposed to reduce the cost of the benchmark system, increase schema complexity and be more representative (higher read-to-write ratio?). But after the benchmark was approved, Oracle decided not to publish results, even though every indication is that they are perfectly capable of producing very good single system results.

Oracle RAC scales very well in the TPC-C benchmark, which has a high degree of locality by Warehouse. TPC-E does not have locality and is presumed to be more difficult to scale in an active-active cluster architecture. At the time, RAC scaling was very important to Oracle. Since then, Oracle has favored single system benchmarks, especially on their Sun platform with SPARC processors. Recent SPARC processor have 8 simultaneous multi-threading per processor core (SMT, equivalent to HT for Intel processors).

Microsoft decided to quickly shift over from TPC-C to TPC-E. The time frame of TPC-E roughly corresponded to the SQL Server versions 2005 and 2008 boundary. Microsoft did not allow TPC-C results to be publish on SQL Server 2008, and the few TPC-E results that were published after TPC-E launch employed SQL Server 2005.

One reason was that log stream compression was added into SQL Server 2008. This is to improve database mirroring functionality. Log compression consumes CPU cycles and is constraint to a single thread, but reduces network traffic especially important over a WAN. Many people use DB mirroring, and few people are up against the single core log compression throughput limit, so perhaps this was for the greater good?

See Paul Randal blog SQL Server 2008: Performance boost for Database Mirroring, 11 Oct 2007. A SQLCAT article, http://sqlcat.com/sqlcat/b/ technical notes/ archive/2007/09/17/ database-mirroring-log-compression-in-sql-server-2008-improves-throughput.aspx, on database mirroring dated 17 Sep 2007 is cited, but the link is no longer valid, Trace flag 1462 is cited as Disable Mirroring Log Compression (Paul says this flag is active).

TPC-C

The TPC-C benchmark consists of 5 stored procedures, New-Order (45%), Payment (43%) and three at 4% each: Order-Status, Delivery and Stock-Level. The New-Order is the principle call, getting the next order id, inserts one row for the order, an average of 10 rows for order line items, and 10 rows updated in the Stock table. The Stock table has 2 integer columns forming the key, 4 quantity related columns and 11 fixed length char columns totaling 290 bytes, (308 bytes per row excluding overhead for the table). Only the four quantity columns are actually updated. However it is necessary to write both the original and new (?) rows in its entirety to the log?

Paul Randall says only the bytes that changed need to be written to the log, but a key column update has the effect of a insert/delete.

It is also possible the exact nature of the Stock update statement requires the char columns to be logged?

This would imply that the raw New Order transaction log write is on the order of 6.4KB to encompass 10 original and new rows of Stock plus 10 rows of order line insert? The TPC-C reports cite 6.4KB per transaction presumably including the payment transaction?

The last TPC-C results on SQL Server (2005) were in 2011, one for a system with 4 Opteron processors and one for 2 x Xeon X5690 (Westmere) 6-core processors at 1,024,380 tpm-C, or 17,073 New Order transactions per sec. On the assumption of 6.4KB per transaction log write per New-Order, the raw (uncompressed) log IO would be 109MB/s.

More recently, Cisco posted a TPC-C with Oracle 11/Linux on a 2 x Xeon E5-2690 (Sandy Bridge) 8-core processors at 1,609,186 tpm-C, for 26,819 New Order transactions/s. There is also an Oracle/Linux TPC-C result on 8 x E7-8870 (Westmere-EX) 10-core processors at 5,055,888 tpm-C or 84,264 New Order transactions/s.

Even if the 2 x 8-core transaction log volume were within the single thread compression capability, it is likely that higher volume from systems with 4-socket 8-cores or even the newer 2-socket E5 v2 12-core processors would exceed the log stream compression capability of a single core?

TPC-E

The more recent TPC-E benchmark is comprised of 10 transactions, of which Trade-Order, representing the benchmark score, is 10% of volume. Some of the individual transactions are comprised of multiple frames, so the total number of stored procedures calls (involving a network round-trip) per scored transaction is 24. If we were to work out the RPC volume between TPC-C and TPC-E, we see that the average transaction cost is not substantially different between the two benchmarks. So one of the cited objectives for TPC-E, being more representative of current workloads, may be true, but the database engine does not really care.

Perhaps the more important matter is that the TRADE table is 139 bytes per row, excluding overhead. The Trade-Order involves 1 row, and is only 10% of the transaction volume (4% of call volume?). Several of the TPC-E transactions involves write operations, and the TPC-E reports cited a log write of 6.7KB/trade, perhaps amortizing the other transactions.

Some TPC-E stats with raw log write estimates

Bench
mark
ProcessorCores
per
ScoreInitial/Final
data size
8-hour
log space
TPC-C2 x X569061,024,380 tpm-C 7TB (83MB/WH)3000GB?
TPC-E2 x X569061284 tps-E 5.6/6.0TB98(178?)GB
TPC-E2 x E5-269081881.76 tps-E 7.7/8.4TB378GB
TPC-E2 x E5-2697122590.93 tps-E 10.7/11.3TB523GB

So perhaps the net difference between TPC-E and TPC-C is a 10X reduction in log write?

Hekaton

So what does all this have to do with Hekaton? Let suppose that Hekaton is capable of providing a 10-30X increase in transaction performance. The current 2-socket with Xeon E5 v2 processor has 12 cores per socket, 15 cores per socket in the 4-socket E7 model. This would generate 17 and 34 MB/s raw transaction log volume without Hekaton. At a 10X gain from Hekaton, we would be beyond the log compression capability of a single core in the 4-socket. At 30X gain, beyond the compression capability for the 2-socket system? All of this is just guess work. Would someone from Microsoft like to comment?

Note

See the SQLCAT slide deck “Designing High Scale OLTP Systems” by Thomas Kejser and Ewan Fairweather present at SQL Bits 2010. It cites log write throughput at 80MB/s. It is not stated whether this was a limitation or just that particular case. It does appear to be in the correct range from what one Nehalem-EX core can generate in compressed output.

Hekaton Math

The Hekaton performance gain math calculation is worth examining. An example was cited of the performance in a test scenario. In employing just the memory optimized table, the performance gain was 3X. With the natively compiled stored procedures (NCP), another 10X gain was realized for a combined gain of 30X.

Let’s assume that baseline test query requires 100 units of work. A 3X gain via memory optimized tables means that work is now 33 units, so the difference between tradition concurrency and the new system is 67 units.

The 10X gain with natively compiled procedures over the first step means that the 33 units with just memory optimized tables is further reduced to 3.3, for a net reduction of 29.7.

So if we were not careful with the underlying math, we might have thought that the 10X with natively compiled procedures was more important than the memory optimized tables. In principle, NCP could have been implemented without memory optimized tables.

This example demonstrates why it is important to first tackle the big resource consumer first, but the gains are magnified when secondary items can also be improved.

TPC-C New Order Stock update SQL Statement

-----------------------
-- update stock values
-----------------------
UPDATE stock SET s_ytd = s_ytd + @li_qty
, @s_quantity = s_quantity = s_quantity - @li_qty
+ CASE WHEN(s_quantity - @li_qty < 10) THEN 91 ELSE 0 END

, s_order_cnt = s_order_cnt + 1, s_remote_cnt = s_remote_cnt
+ CASE WHEN(@li_s_w_id = @w_id) THEN 0 ELSE 1 END
, @s_data = s_data

, @s_dist = CASE @d_id
WHEN 1 THEN s_dist_01 WHEN 2 THEN s_dist_02
WHEN 3 THEN s_dist_03 WHEN 4 THEN s_dist_04
WHEN 5 THEN s_dist_05 WHEN 6 THEN s_dist_06
WHEN 7 THEN s_dist_07 WHEN 8 THEN s_dist_08
WHEN 9 THEN s_dist_09 WHEN 10 THEN s_dist_10
END

WHERE s_i_id = @li_id AND s_w_id = @li_s_w_id

 

Addendum

Regarding the comment below on max memory opt table of 256GB, I do not see any reason that even a modern transaction processing database needs even half of this for in-memory data. But it is very reasonable to have tens of TB in the DB for generally beneficial functionality. This being the case, I think it would be nice to have a feature for rolling in-memory table data to a regular table. My thought is that at the end of each day or other period, the entire set transactions should be converted to a regular table, that will then become a partition in the archive/history table.

Published Thursday, March 13, 2014 7:05 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Lonny Niederstadt said:

I think another reason for the current lack of TCP-E results for Hekaton may be the stated 256 gb max size for memory optimized tables.  I believe that's the aggregate limit.

March 13, 2014 10:01 PM
 

jchang said:

Excellent, I had not realized there was a limit for memory optimized tables. Is the limit also dependent on system memory?

Also, I failed to consider that the 2 x 12 core 2590 tps-E result has a db size of 10.7TB, way bigger than memory. So 10X would mean a 100TB database.

There is a rigid requirement in the TPC benchmarks for database size to scale with the score, to ensure that results are comparable.

So really there needs to be a new benchmark definition for in-memory. It could simply drop the database size scaling requirement, but perhaps require the maximum memory configuration (all DIMMs populated, but perhaps not the 32GB today)

March 14, 2014 11:03 AM
 

BobBarker said:

I really have no idea what any of this means

March 18, 2014 5:33 AM
 

Joe User said:

I dont see any 2014/2012 comparisons with similar HW/DBs and queries

Dont you have to also compile the stored procs down to .NET clr etc and declare the tables as memory resident?

March 18, 2014 11:17 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement