THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Intel Processor Architecture 2020

There is an article on WCCFtech, that a new Intel processor architecture to succeed the lake processors (Sky, Cannon, Ice and Tiger) will be "faster and leaner" and more interestingly might not be entirely compatible with older software. The original source is bitsandchips.it. I suppose it is curious that the Lake processors form a double tick-tock or now process-architecture-optimization (PAO), but skipped Kaby, and Cannon. Both the bridge (Sandy and Ivy) and well processors (Has and Broad) each had only one tick-tock pair.

Naturally, I cannot resist commenting on this. About time!

For perspective, in the really old days, processor architecture and instruction set architecture (ISA) was somewhat the same thing. The processor implemented the instruction set, so that was the architecture. I am excluding the virtual-architecture concept where lower cost version would not implement the complete instruction set in hardware.

The Intel Pentium Pro was a significant step away from this, with micro-architecture and instruction set architecture now largely different topics. Pentium Pro has its own internal instructions, called micro-operations. The processor dynamically decodes X86 instructions to the "native" micro-operations. This was one of the main concepts that allow Intel to borrow many of the important technologies from RISC.

The Pentium 4 processor, codename Willamette, had a Trace cache, that was a cache for decoded instructions. This may not have been in the Core 2 architecture that followed Pentium 4.

My recollection is that Pentium Pro had 36 physical registers of which only 8 are visible to the X86 ISA. The processor would rename the ISA registers as necessary to support out-of-order execution. Pentium 4 increased this to 128 registers.

Also see MIT 6.838 and NJIT rlopes

The Nehalem micro-architecture diagrams do not mention a µop cache, (somehow the acronym is DSB) but Sandy Bridge and subsequent processors do. This is curious because both Willamette and Nehalem are Oregon designs, while Core 2 and Sandy Bridge are Haifa designs.

The other stream that comes into this topic involves the Intel Itanium adventure. The original plan for Itanium was to have a hardware (silicon) X86 unit. Naturally, this would not be comparable to the then contemporary X86 processors, which would have been Pentium III, codename Coppermine at 900MHz, for Merced. So by implication, X86 execution would probably be comparable to something several years old, a Pentium II 266MHz with luck, and Itanium was not lucky.

By the time of Itanium 2, the sophistication of software CPU emulation was sufficiently advanced that the hardware X86 unit was discarded. In its place was IA-32 Execution Layer. Also see the IEEE Micro paper on this topic. My recollection was the Execution Layer emulation was not great but not bad either.

The two relevant technologies are: one, the processor having native µops instead of the visible X86 instructions, and two, the Execution Layer for non-native code. With this, why is the compiler generating X86 (ok, Intel wants to call these IA-32 and Intel 64 instructions?) binaries.

Why not make the native processor µops visible to the compiler. When the processor detects a binary with native micro-instructions, it can bypass the decoder? Also make the full set of physical registers visible to the compiler? If Hyper-threading is enabled, then the compiler should know to only use the correct fraction of registers.

I am inclined to also say that the more the compiler knows about the underlying hardware, the better it can generate binaries to fully utilize available resources, with less reliance on the processor doing dynamic scheduling for parallelism. But of course, that was what Itanium was, and we would need to understand why Itanium did not succeed. My opinion was that EPIC was really better suited to scientific computing and not logic heavy server applications.

Have one or two generations of overlap, for Microsoft and the Linux players make a native micro-op operating system. Then ditch the hardware decoders for X86. Any old code would then run on the Execution Layer, which may not be 100% compatible. But we need a clean break from old baggage or it will sink us.

Off topic, but who thinks legacy baggage is sinking the Windows operating system?

Addendum

Of course, I still think that one major issues is that Intel is stretching their main line processor core over too broad a spectrum. The Core is used in both high-performance and high-efficiency mode. For high performance, it is capable of well over 4GHz, probably more limited by power than transistor switching speed. For power efficiency, the core is throttled to 2 or even 1 GHz.

If Intel wants to do this in a mobile processor, it is probably not that big a deal. However, in the big server chips, with 24 core in Xeon v4 and possibly 32 cores in the next generation (v5), it becomes a significant matter.

The theory is that if a given core is designed to operate at a certain level, then doubling the logic should achieve a 40% increase in performance. So if Intel is deliberately de-rating the core in the Xeon HCC die, then they could built a different core specifically to one half the original performance is perhaps one quarter the complexity.

So it should be possible to have 100 cores with half the performance of the Broadwell 4GHz capable core, i.e., equivalent to Broadwell at 2GHz? If this supposed core were very power efficient, then perhaps we could even support the thermal envelope of 100 mini-cores?

Of course, not every application is suitable for wide parallelism. I would like to see Intel do a processor with mixed cores. Perhaps 2 or 4 high performance cores and 80 or so mini-cores?

A really neat trick would be if the GPU were programmable, but graphics vendors have things along this line?

Published Tuesday, December 27, 2016 12:19 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Anon said:

>Why not make the native processor µops visible to the compiler. When the processor detects a binary with native micro-instructions, it can bypass the decoder?

Since you asked: The instructions can never "bypass the decoder". The output side of the decoder is much too wide (even the old 6502 had more than a 100 bits of control lines; just imagine what a modern x86 has) for it to be directly encoded in the program with any efficiency -- the I-cache would be swamped. Since some kind of decoder is always necessary, it makes sense anyway to decouple the external instructions from the internal operations somewhat, to allow for flexibility in future implementation changes.

Also, the internal µops on modern Intel chips are quite close to the external instructions anyway, especially in the domain of fused µops. The vast majority of x86 instructions actually decode 1-to-1 into µops -- it's just a more efficient encoding of the internal instructions.

>Also make the full set of physical registers visible to the compiler?

The whole point of having rename registers is that the compiler can assign them dynamically to eg. subsequent iterations of a loop, instead of having the compiler do unrolling and register allocation. Or over subroutine boundaries, where the compiler can't even do optimal register allocation. It leads to fewer instructions (saving I-cache), smaller instructions since fewer registers need to be encoded (again saving I-cache) and a simpler programming model, with pretty much no downsides.

>I am inclined to also say that the more the compiler knows about the underlying hardware, the better it can generate binaries to fully utilize available resources

This is true, but the compiler already knows more than enough about the hardware to do that. This is what the numerous CPU profiles in GCC are for.

>with less reliance on the processor doing dynamic scheduling for parallelism

This kind of thinking was the entire reason why Itanium failed. The processor can do more efficient and optimal allocation of resources than a compiler will ever be able to, since it can adapt to runtime circumstances such as varying load latencies, and, again, fix register allocation around subroutine boundaries.

May 4, 2017 1:50 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Privacy Statement