THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

SQL Server on Xeon Phi?

Can the Intel Xeon Phi x200, aka Knights Landing, run SQL Server? It does run Windows Server 2016, so is there anything in SQL Server 2016 that would stop it from installing? Xeon Phi is designed for HPC, so it would not have been tested with SQL Server, but that does not confirm whether it will or will not work. If so, then this could be used to prove out some important theories.

The main line Intel processors used in the Core i7/5/3 and Xeon product lines, with recent generation codenames Haswell, Broadwell, Sky Lake and soon Kaby Lake, are heavily overbuilt at the core level. The Broadwell processor core is designed to operate at 4GHz if not higher.

The Haswell 22nm processor was rated for up to 3.5GHz base, 3.9GHz turbo at launch in Q2’13. In Q2’14, a new model was rated 4/4.4GHz base/turbo at 88W TDP. Both are C-0 step, so the higher frequency was achieved by maturity of the manufacturing process or cherry picking? The Broadwell 14nm processor had a top frequency of 3.3/3.8GHz base/turbo 65W for desktop, but perhaps this is because it was more focused at mobile than desktop? (Curiously there is also a Xeon E3 v4 at 3.5/3.8GHz 95W and Iris Pro graphics). The top Sky Lake 14nm processor was 4.0/4.2 GHz base/turbo at 91W.

With a single core under load, processor is probably running at the turbo boost frequency. When all four cores are under load, it should be able to maintain the rated base frequency while staying within design thermal specifications, and it might be able to run at a boosted frequency depending on which execution units are active.

The latest Intel Xeons (regular, not phi) are the E5 and E7 v4, based on the Broadwell core. There are 3 die versions, LCC, MCC, and HCC with 8 10?, 15, and 24 cores respectively. All of these should be able to operate at the same frequency as the desktop Broadwell or better, considering that the Xeon E5/7 v4 Broadwells came out one year after the desktop processors. But Xeons need to be more conservative in its ratings so a lower frequency is understandable.

The top Xeon 4-core model, E5-1630 v4, using the LCC die is 3.7/4GHz at 140W TDP. The top 8-core is 3.4/4.0GHz, E5-1680 v4, also at 140W TDP.

The top 14-core (MCC die) is 2.2/2.8GHz 105W. The top 24-core (HCC die) is 2.2/3.4GHz 140W. So the Xeon E5 and E7 v4 processors are built using cores designed to operate electrically at over 4GHz, but are constrained by heat dissipation when all cores are active to a much lower value, as low as one-half the design frequency in the high core count parts.

The transistor density half of Moore’s law is that doubling the number of transistors on the same manufacturing process should enable a 40% increase in general purpose performance. The implication here is that if a particular Intel processor (full-size) core is designed with transistor budget to operator at 4GHz, then in theory, a processor with one-quarter of that transistor budget should be comparable to the full-size core operated at one-half the design frequency, whatever the actual operating frequency of the quarter size core is. (Doubling the quarter size core to half-size yields 1.4X gain. Double again to full-size yields another 1.4X for approximately 2X performance going from quarter to full size).

So the theory is that it might be possible to have 100 cores of one-quarter the complexity of the Broadwell core on a die of comparable size as Broadwell EX (456mm2), with adjustments for L2/L3 variations, and differences in the memory and PCI elements.

This just what Xeon Phi, aka Knights Landing, appears to be. There are 72 cores in 36 tiles, operating at 1.3-1.5 GHz base, 1.7GHz turbo. The Xeon Phi x200 is based on the Silvermont Atom, but at 14nm. A tile is composed 2 Atom cores, with 4-way simultaneous multi-threading (SMT) and 1M L2 cache shared between 2 cores. (There is no shared L3? how is cache coherency handled?) The Xeon Phi has 16 MCDRAM and 6 memory channels capable of 115GB/s and 384GB max capacity (6x64GB). The MCDRAM can be used in one of three modes: Cache, Flat, or Hybrid.

There is no mention of the MCDRAM latency, only the phenomenal combined bandwidth of 400-500GB/s. My expectation is that it should be possible for the processor to off-die memory roundtrip latency to be lower when the memory is in the same package as the processor compared to the common arrangement when memory is outside the processor package. This is because it should be possible to use really narrow wires to connect the processor to memory in a common package, so there should be less buffering circuits to amplify the signal current? (Can some circuit designer speak to this please?)

This higher core count, higher threads on SMT is more or less comparable to IBM POWER, SPARC and even AMD Zen. Transactional queries are essentially pointer chasing code: fetch a memory location, use its value to determine the next location to fetch. This should run fine on a simpler core than 6/8-port superscalar Broadwell. And have many dead cycles during the round-trip memory access latency, implying SMT will work well (beyond the two threads per core in the main line Intel cores).

However, this may not be the best general purpose computing solution, in that there are important single threaded tasks, or tasks that are not massively parallelizable for which the existing powerful Intel core is the best. My thinking is that a mix of few powerful cores and many smaller cores is right solution. And that there should be a few smaller cores dedicated to special OS functions (interrupt handling and polling), in a blended asymmetric-symmetric arrangement.

Published Tuesday, August 23, 2016 7:35 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



RichB said:

Now the BI edition is dead, and the standard edition crippled, what would be the win on this (other than it being kind of cool) rather than running a similar total cpu hz on faster but fewer cores as per some of your and PaulR's recent(last few years! posts)?  72 cores and 288 threads sounds impressive...

I have a feeling you answered my question is the last 3 paragraphs, but I'm not entirely sure :)

I'm also struggling with understanding possibly something quite basic - how many of these processors can one easily present to an OS as a single 'server'?  

Always find your hardware notes fascinating, so thanks :)

August 23, 2016 9:07 PM

RichB said:

Incidentally in the Xeon Phi Product Brief it does claim to be "binary compatible with Intel Xeon processors which allows it to run any x86 workload"

August 23, 2016 9:11 PM

jchang said:

The objectives for the current Xeon Phi is proof-of-concept.

That the smaller/simpler Atom core at 1.3GHz are not far off the powerful Broadwell cores at 2-3GHz+ in transactional queries, with the expectation that Broadwell at 3GHz+ are exception for code running largely inside L2 cache.

Also, for transactions, SMT should be more than 2-threads in the Xeons since Nehalem, because the nature of the code is serialized memory accesses, for which memory latency is everything and frequency does not help much. (In my earlier post advocating single socket for most environments, dropping the Xeon E5 v2 from 2.7GHz to 135MHz resulted in only a 3X reduction in performance for CPU intensive SQL, less for IO.)

If what I say above is somewhat (in the direction of what is) true (with many caveats), then the next generation processor-system would have a mix of powerful Sky Lake/Cannon Lake cores (at 4-way SMT) and very many Phi cores. This could be a single socket system with a processor having both big and small cores, or a two socket system with one processor of big cores and a second of very many small cores, though I am favoring single socket.

It then becomes Microsoft’s obligation to price the class of cores appropriately. I believe that AMD cores are currently charged at one half that Intel core, so we should argue that the Atom core is one-half the AMD.

And oh yeah, Microsoft would also need to rewrite their kernel to understand near and far memory, using each appropriately. And also for 3D Xpoint, but they are already doing that. SQL Server would also need to know this.

August 24, 2016 8:58 AM

Craigmeister said:

Taking a step back from the focus on SQL Server for a moment, as a proof of concept, does this go down the path of a sort of processor-centric model of distributed computing? It seems instead of a distributed system of cheap computers (my crude description of Hadoop) or the proprietary distributed PLC I/O hardware of Netezza, Intel is playing with the idea of a distributed system of cheap processors. If so, what kind of processing load would that be designed for?

Supposedly, the GPU is best at FP calculations and is being tapped for ML and deep learning systems. Would a complementary system of Atom processors/cores provide similar or better capabilities?

August 30, 2016 9:48 AM

jchang said:

the Xeon Phi is not cheap. It may be cheap to build a chip with 4-8 Atom cores, building a giant die is not. Networking cheap computers only works for workloads that can tolerate high latency.

As I said above, the nature of Moore's Law is that a complex core does not have linear gain with the number of transistors. Some workloads really need a powerful core, other work ok with a simpler core, and if parallelized, it is possible to aggregate more capability with simpler cores.

It is more correct to say GPU is better suits to massive SIMD, integer or FP.

in summary, yes/no, it depends, no one solution for everything, in fact, a mixed solution might be better

August 30, 2016 11:25 AM

joycesanders said:

All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read. The 960 Pro is rated for 3,500MB/s read. This is pretty much the maximum possible bandwidth for PCI-E x4 gen3.

happy wheels 2 online

October 24, 2018 1:51 AM

Himanshu said:

I have one of the most trending game to play based on mathematics just from our website play the cool math games with all unlocked level which you will play without any pay and signup.

January 18, 2019 11:54 PM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Privacy Statement