Earlier this year, HPE and Microsoft sponsored an article,
The Importance of Benchmarking,
in SQL Server Pro.
While true, it is also apparent that there has been little interest in benchmarks
within the DBA/IT community over the last several years.
There are many reasons for this,
one of which is the emphasis on the benchmark result as the important takeaway.
Today most people expect that systems are very powerful,
and probably know that performance at the processor core level
has improved only incrementally in the recent years.
The main venue is the increasing number of cores with each generation,
currently at 22 in Xeon E5 v4 and 24 in E7 v4.
The TPC benchmarks are generally well designed,
with rules meant the prevent the publication of a distorted result.
But every application has its own characteristics,
different table organizations,
and transactions with SQL logic
bearing no resemblance to a benchmark or another application.
As such, a benchmark result does not translate directly to a different set of circumstances.
What is of great importance are the problems that were encountered in the benchmark,
and the measures that were taken to address each problem.
The topic of this article is Inserts to a table with a sequentially increasing index,
commonly in the form of an identity column,
but other implementations are possible as well.
At a New York City SQL Server Users Group meeting, Thomas Grohser,
Expert SQL Server Performance Engineering,
mentioned that multiple client (connections)
doing single row inserts to a table
with clustered indexed on an identity column
had extremely poor performance running on a 4-scoket system.
In that case, this was about 6,000 rows per second.
Inserts to a table with a clustered index on a uniqueidentifier,
or having a compound key not leading with the identity column,
performance was 95,000 (calls and) rows per second.
Memory-optimized tables could achieve 1.7 rows per second.
But it might be a good idea to consult with Thomas on this.
The assessment was that this was a contention issue due to the high latency
between processor sockets.
All modern systems with more than one processor socket have
Non-Uniform Memory Access (NUMA) architecture.
For Intel processors, this goes back to Xeon 5500 (2009) and 7500 (2010) and 2003/4 for AMD Opteron.
An Insert to a table with an index on an identity column,
regardless of whether the index is clustered or nonclustered,
involves acquiring an exclusive lock on the last row of the last page,
as well as accessing the memory for that page and row.
When there are multiple concurrent threads with affinity on cores
in different sockets of a multi-socket system,
the memory involved bounces between the L3 caches of each socket.
While it is inefficient for memory to bounce across sockets,
the magnitude of the impact on Inserts is stunning.
Testing Insert Performance
The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled.
Turbo-boost is disabled for consistency.
The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164).
Some tests were conducted on a single socket Xeon E5 v4 with 10 cores,
but most are on the E3 system.
In the past, I used to maintain two-socket systems for investigating issues,
but only up to the Core2 processor, which were not NUMA.
The test table has 8 fixed length not null columns,
4 bigint, 2 guids, 1 int, and a 3-byte date.
This adds up to 70 bytes.
With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.
Both heap and clustered index organized tables were tested.
The indexes tested were 1) single column key sequentially increasing
and 2) two column key leading with a grouping value followed by a sequentially increasing value.
The grouping value was chosen so that inserts go to many different pages.
The test was for a client to insert a single row per call.
Note that the recommended practice is to consolidate
multiple SQL statements into a single RPC, aka network roundtrip,
and if appropriate, bracket multiple Insert, Update and Delete statements
with a BEGIN and COMMIT TRAN.
This test was contrived to determine the worst case insert scenario.
On the single socket Xeon E3 quad-core system, the heap table with no indexes
and both the heap with one nonclustered index and clustered index
with the two column key
could support an insert call rate of 70-72,000 per sec,
both rows inserted and number of network roundtrips.
The heap table with one nonclustered index and table with only a clustered index,
both with the index key being a single sequentially increasing column
such that inserts from all threads/connections go to the last page
could support an insert call rate of about 40,000 rows per sec.
The issue here is multiple concurrent connections
making calls to Insert into a single table
with a sequentially increasing index, clustered or nonclustered,
resulting in contention between threads for the last page and row.
There is a substantial 42% performance degradation on a single socket system,
with memory in a common L3 cache, but the relevant memory locations
are moved between the L2 cache of different cores after exclusive access is acquired.
While the impact is large,
even this may not be a crippling effect depending on the circumstance.
In a multi-socket system, memory must now also be moved between L3 cache of
different processor sockets, which has higher latency.
The impact is expected to be more severe with as the number of sockets increases.
The presumption here is that the limitation is in the ability to acquire exclusive
locks between threads on different cores, potentially in different sockets.
If each call inserted more than one row, the call might decrease only slighty
resulting in a higher row insert rate.
Hence the explicit emphasis on the call rate as well as the row insert rate.
Note also that we should problem be able to insert in to multiple tables
each at more or less the same rate as inserts to a single table
until the limitation becomes the log write thread or overall CPU.
This test is not the same as the test by Thomas Grohser
in terms of table structure, client application, and hardware etc.
Still, the expectation is that results would not be dramatically different.
My tests on this matter is incomplete, and more work is necessary.
As soon as I can get access to two and four socket systems, I will try to run this same test
(hint to Intel or someone).
In preliminary tests on the single socket 10-core system, the Insert call rate
with rows going to different pages was over 100,000 per sec for both heap and clustered,
and the around 30,000 per sec with a sequential index, nonclustered or clustered.
We might infer that this a contention issue in which performance degrades
with increasing number of cores, for both single and multi-socket system.
In the case multiple sockets, there might be as more severe degradation?
This issue might be more complex than a simple matter of the higher latency between sockets,
but other people might have better tools to conduct such an investigation.
There are work-arounds for this issue.
One is to implement a multi-column clustered index key such that inserts are distributed over many pages.
It also necessary to not have even a nonclustered index on the identity column,
which may have implications.
Another work-around is simply to deploy on a single socket system.
This is actually good advice for perhaps a majority of situations.
The Xeon E3 with 4 cores is perhaps twenty times more powerful than
a 4-socket Pentium II Xeon from 1998.
If quad-core E3 were not enough, probably in memory or PCI-E,
then a single socket Xeon E5 with up to 22 cores should definitely considered
before reflexively defaulting to a two-socket system
without quantitative substantiation of any performance assumptions.
One the problems today is that infrastructure people have bought into
the vendor driven arguments for virtual machines and/or cloud,
but apply it pervasively even into the mission-critical systems,
while completely disregarding any special requirements.
The typical platform is a 2-socket.
There are other options too.
We could try affinitizing calls that insert into the critical table,
Map TCP IP Ports to NUMA Nodes.
Benchmarks and Inserts
Before closing this topic, given that it is so important,
we might ask whether the benchmarks shed light on this matter.
The TPC-C benchmark has the highest insert volume into a single table.
However, TPC-C does not have this problem because all tables are organized
(clustered index) by warehouse followed by district.
This reflects a brick and mortar business,
where customers place orders mostly in their own warehouse and district.
If we were curious anyhow, TPC-C benchmark results are in transactions per minute (tpm-C).
The highest SQL Server result is 1.8M tpm-C, corresponding to 30K new orders per second.
Technically, the order line table has more rows inserted.
Each transaction inserts one row to orders
and on average 10 rows to the order line table with a single call,
so the Insert statement call volume is the same for both tables.
No TPC-C benchmark results were published for SQL Server version 2008 or later,
even though a handful of results continued to be published for version 2005
after 2008 RTM.
The reason for this is that SQL Server 2008 compressed log file writes
to better support database mirroring,
a very important feature across the broad base of users.
Log writes were single-threaded until version 2016.
This meant that SQL Server write performance could be limited by the ability of a
single core running the log writer thread to compress log entries.
Presumably there was some survey to suggest that this would be acceptable
as there is no option to disable log write compression?
The TPC-C benchmark has a curious statement when updating the stock table
in the output clause that touches but does not change several fat columns.
This part serves no meaningful business purpose,
but has the effect of grossly inflating the log write size far beyond
what should be necessary.
Presumably other RDBMSs do not have mandatory log compression.
And so the highest TPC-C result on a system with Intel processors
is 5M tpm-C (8 sockets, 80 cores total, 160 threads)
corresponding to 84000 inserts to orders per second.
The TPC-E benchmark does have 3 tables that are unique on the (big)integer Trade Id column,
and two more tables that have a two column key leading with Trade Id.
Is there a magical tuning trick that can evade the insert to a table
with a sequentially increasing index problem?
TPC-E works around this by not using an identity column in the tables
where one might think the primary key column should be an identity.
In fact, it uses a scalar function and a table to assign trade Id values.
The TradeOrderFrame4 procedure first calls the function Get_Next_T_ID
with an output variable to retrieve a Trade Id.
The function uses the table TID_RANGES to store blocks of ID values.
Each block, represented by a row in TID_RANGES, is assigned
to a specific session id (SPID, accessed with @@SPID).
Every time the function is called,
a column for the current Id value for the callers spid is returned
and the row is updated to the next Id value.
This is probably an acceptable implementation in the TPC-E benchmark,
where all connections are running transactions.
In a real-world environment, there might be very many open connections,
not all of which process transactions, but in a common connection pool.
I might suggest using spid modulus the total number or logical processors.
Or perhaps spid divided by 2 first, then modulus the number of cores when HT in active.
All of this is in the details as required by TPC full disclosure rules.
But it is buried in the supplement files with thousands of extraneous files.
There is no explanation given as to why the three tables that have a primary key clustered
on the trade Id column is not an identity,
instead using a function that requires an update to maintain.
It is interesting this method was in TPC-E from the beginning.
The first TPC-E result was published in 2007
for a 16-socket
with Xeon (7140M?) dual-core processors.
In that generation, there were probably 4 processors in each NUMA node.
The performance was 660 tpsE,
meaning 660 inserts per second to each of 3 tables unique on trade Id
plus inserts to two more tables.
The current 4-socket Xeon E7 v4 with 96 cores total is
Was the insert - index on sequentially increasing column problem know back then?
If it occurred at such a low insert rate,
then how many people suffered this problem without knowing why?
The conclusion to draw is that, yes benchmarks are important.
However, what is important is not the result that vendors like to brag about,
but rather the many problems that were solved in producing a competitive result.
While all the methods are disclosed in the supporting files,
there is no accompanying explanation.
Some methods are simple to apply, like lock pages in memory.
Others must be worked into the database architecture,
which has deep ramification if done after it is already is in production,
or must be part of the client side application, such as the TCP per affinity.
Some methods might have critically significant impact.
Others might be of just minor impact that is desired for benchmarks,
but have no important consequence in a production system.
I do not suppose anyone would be willing to apply the -x startup parameter
in a production environment, but it might be good to know the impact?
Good luck in deciphering which details are critically important to a specific situation
and which might just be used to get an extra 1% edge over a competitor.
The tests systems, client-load generator and server each had Intel I210 1GbE connected to a common switch.
A single thread connection could drive about 4K single row select RPC/sec, and 2K single insert row RPC/sec. A high level of saturation was achieved at 80 concurrent connections.
About 12 years ago, I had tested Xeon (Pentium 4) systems with 1GbE connected back to back (no switch), and this could drive 5K single row select RPC/s. I did not test the current 1GbE direct connect network round-trip rate.
Both of these systems also had a Intel XL710 40GbE with direct QSFP+ connection. At first this did not seem to work in Windows Server 2016 TP5 with the Intel beta driver.
Intel support was of the opinion that this should work but did not say definitively.
Then on Tue, with the latest Windows TP5 hotfixes, all of the sudden the load generator was connecting via the 40 GbE instead of the 1GbE. There were some quirks in the behavior.
In any case, the 40GbE direct connect could drive 11K select RPC/s, and 6K insert RPC/sec at the single. Only 20 concurrent threads were necessary to achieve a high level of load, at which the 10-core Xeon E5 could support perhaps 40K single row inserts with a sequential clustered index.
I am currently in the process of changing over to Windows Server 2016 RTM + hotfixes,
which also seems to upgrade the XL710 driver from 1.3.115 to 1.5.59, so more to come.
Henrik provided the link,
SQL Server Latch Contention, which is why he mentioned that this is latch contention.
I would have thought it would be lock contention, but apparently it is latch contention according to MS?.
I only took a quick look, MS identifies the problem with the degree of concurrency, more severe at higher concurrency. Between Thomas Grosher's report, and my data, it should be emphasis that there is most probably a strong hardware NUMA or not-NUMA impact. But it would be preferable if one person ran the same test on both 1-socket, 2 and 4-socket systems.