Intel Lunar Lake CPU Architecture Deep-Dive: Lion Cove +14% IPC, Skymont IPC More Than Raptor Cove, Next-Gen Power Managment & Scheduling

Intel Lunar Lake CPU Architecture Deep-Dive: Lion Cove +14% IPC, Skymont IPC More Than Raptor Cove, Next-Gen Power Managment & Scheduling

 0
Intel Lunar Lake CPU Architecture Deep-Dive: Lion Cove +14% IPC, Skymont IPC More Than Raptor Cove, Next-Gen Power Managment & Scheduling

Intel has introduced Lunar Lake, its most advanced, efficient & ground-breaking SOC to date and we are doing a deep dive into its P-Core & E-Core architectures and more.

Lunar Lake has been the talk of the town ever since it was first unveiled by Intel and today, the company is finally taking the wraps off the chip to help us understand what makes it tick. The design goals for the Lunar Lake CPU were simple, to make a highly-efficient SOC which is designed to cater to the next-gen AI PC platforms such as Microsoft Copilot+. Some of the achievements of Lunar Lake include:

  • Breakthrough x86 power efficiency
  • Exceptional Core Performance
  • Massive Leap in Graphics
  • Unmatched AI Compute
  • So before we get into the in-depth details, let's take a quick overview of Intel's Lunar Lake. It all begins at the construction which includes several packaging technologies housing several tiles.

    The Lunar Lake SOC has 7 main components which start at the interposer package. This package hosts the memory, stiffener and the Base Tile which uses Foveros interconnect to combine the compute tile and Platform Controller Tile together. Also, you might notice that Intel went with way fewer Tiles on Lunar Lake versus Meteor Lake. That's done to achieve maximum efficiency and low latency overhead. As for process nodes, the Lunar Lake Compute Tile is made on TSMC's N3B and the Platform Controller die uses the TSMC N6 process node.

    Lunar Lake is also Intel's first chip to feature on-package memory which comes in 16 GB & 32 GB (dual-rank) LPDDR5X configurations, running at up to 8533 MT/s speeds per chip. The memory supports a 16b x4 channel and achieves 40% lower PHY power along with a 250 mm2 area savings versus a traditional PCB embedded design.

    With all the outsides of the chip done, it's now time to briefly glance at the chip's 8-core hybrid design which still employs a P-Core and E-Core configuration with 4 cores each. These cores are backed by a brand new Thread Director. Talking about P-Cores, you are getting a PPA core design for improved single-threaded performance, 2.5 MB of L2 cache per core, and up to 12 MB of shared L3 cache. The E-Cores feature 4 MB of shared L2 cache in the same cluster & offer twice the Vector & AI throughput.

    Then we have the new Xe2 GPU which offers 8 Xe cores, 8 new Ray Tracing Units, XMX support, & 8 MB of dedicated cache along with brand new media & display capabilities. The Lunar Lake SOC adds a whopping 120 Platform TOPS to the mix with 48 TOPS coming from the NPU, 67 TOPS coming from the GPU, and around 5 TOPS from the CPU.

    With that said, Intel announced 80+ designs across 20+ partners for its Lunar Lake SOCs. The company anticipates a Q3 launch with wider availability starting in Q4 2024. Intel also announced a brand new AI PC developer kit based on the Lunar Lake SOC which will be available around the same timeframe and allow developers to develop new AI PC experiences and fine-tune them for the Lunar Lake chips. This Dev Kit will also be compatible with future CPUs such as Panther Lake.

    Now we switch gears to the architectural deep dives for Lunar Lake:

    One of the two new core architectures that Intel has incorporated within Lunar Lake CPUs is called Lion Cove and it's a "P" core that is tuned for higher performance.

    It's the direct successor to the Redwood P-Core featured on the Meteor Lake CPUs and is designed to deliver:

  • Performance & Area Efficiency - Optimize ST perf/watt and perf/area for client SOCs
  • Overall Microarchitecture - Generation IPC improvement and future scalability
  • Modernize design database - Accelerate innovation going forward
  • With Lion Cove, Intel is shifting its entire Hyper-Threading strategy. Typically, you'll see SMT support on modern-day chips which add two threads per core. On existing chips, such as Meteor Lake or Raptor Lake, Hyper-Threading provides up to +30% throughput and +20% Cdyn (power at the same V/F curve). There are three types of scheduling handled on a hybrid client device. These include handling OS scheduling through P-Cores (no-HT), E-Core clusters, or P-Cores (with HT).

    When the Lion Cove core architecture was in development, it was going to target the traditional CPU market which meant that we would've seen a variant of Lion Cove which retained hyper-threading or SMT support. However, when Lion Cove P-Cores were chosen for Lunar Lake, the development team had to check every available transistor and see if it made sense to incorporate it into the product.

    Looking at the really low power characteristics of Lunar Lake, it was ultimately decided to drop SMT support which resulted in better performance throughput and efficiency hence the final design of Lion Cove ultimately eliminated the requirement of having HT/SMT onboard. Transaction synchronization and Advanced Matrix Extensions have been removed too. Intel more or less removed any transistor that didn't add to the processors' productivity.

    Removing hyper-thread logic and optimizing the core leads to a +15% performance per power increase, +10% performance/area, and +30% perf/power/area versus a non-hyper-threaded CPU in single-threaded workloads.

    But this new approach still yields +5% better performance/power and +15% perf/power and area improvements over a hyper-threaded implementation. The only downside is the performance/area which falls by -15%.

    Lion Cove comes with a brand new AI self-turning controller which acts as a thermal management controller that adapts to real-time operating conditions. This allows the core to run at higher frequencies and achieves higher sustained performance versus previous static techniques.

    The core also features a finer clock granularity that can extract more performance for a given core power budget. Lion Cove gets a +2% benefit from the 16.67 MHz interval versus the 100 MHz interval on last-generation chips.

    Diving into the Lion Cove P-Core architecture, the Front-end on the new core features an 8 times larger prediction block than the previous generation. There's also a wider fetch and increased decode bandwidth. Both Uip cache and queue (192 entries) capacities have also been increased along with the read bandwidth.

    The Out of Order Engine sees a spit of the INT & VEC domains with their independent renaming and schedules. The engine comes with 8 wide allocation/rename units, Following are the improvements versus Redwood Cove:

  • 8 Wide alloc/rename (versus 6)
  • 12 wide retirement (versus 8)
  • 576 deep instruction window (versus 512)
  • 18 execution ports (vs 12)
  •  

    On the integer side, you are getting:

  • 6 Integer ALU (versus 5)
  • 3 Jump units (versus 2)
  • 3 shift units (versus 2)
  • 3 mul 64x64->64 (versus 1)
  • On the Vector side, you are getting:

  • 4 SIMD ALUs "256b" (versus 3)
  • 2 FMA @ 4 cycle latency "256b"
  • 2 FP dividers "256b" (versus 1)
  • Intel's Lion Cove significantly changes the memory subsystem with a new 3-level cache hierarchy which includes L0, L1, and L2. The L0 cache has a capacity of 48 KB and features 4 Load-To-Use units (3x256 for Lunar Lake and 2x512b for Arrow Lake).

    The L1 cache has 9 Load-To-Use units (2x64B) with 192 KB for L1d and 64 KB for L1i. The L2 cache has 16 Load-To-Use units (2x64B) with 2.5 MB L3 cache per core on Lunar Lake and 3 MB L3 cache on Arrow Lake CPUs. The memory subsystem also gets 128 pages DTLB (versus 96) and 3 STA AGUs (versus 2).

    Lion Cove P-Core IPC, Performance & Efficiency For Lunar Lake Mobile

    Now the most important aspect of the Intel Lion Cove core architecture, it's IPC. Intel states a +14% IPC improvement for LNC cores versus the RWC (Redwood Cove) cores and the performance is scalable at different power levels with the biggest benefits seen at the lowest power figures, yielding an increase of over 18%. This is a double-digit IPC increase and a big update versus the previous generations.

    One big advantage associated with Lion Cove cores is that it's 99% process agnostic which means that it is compatible with pretty much any node which wasn't possible with previous designs which were made for a specific process node.

    The second core incorporated within Lunar Lake is called Skymont and it is an E-core that is optimized for efficiency.

    Skymont is the direct successor to the Crestmont core which we saw on Meteor Lake CPUs and comes with some huge updates in terms of performance and efficiency. Some of the highlights include:

  • Increase workload coverage - Increase range of Low Power island & MT Performance
  • Double vector & AI Throughput - For Increased VNNI capability support
  • Increase Scalability - For overall performance uplift
  • Starting with the details, Skymont comes with an updated prediction block with 128 bytes, faster "Find the next" instructions, and 96 instruction bytes for Parallel Fetch. Skymont also features a wider Decode which includes 9-wide (3x3) or 50% more decode clusters than Crestmont E-Cores, a Nanocode that unlocks microcode parallelism per cluster, and Uop queue capacity that's increased from 64 entries to 96 entries.

     

    On the front-end side (OOE or Out of Order Engine), we are looking at an 8-wide allocation and 16-wide retire which means that resources can be added and cleared faster. These are 2 additional allocations and 8 additional retire connections versus the last-gen Gracemont cores.

    Queuing also gets more resources with the out-of-order window now growing to 416 entries. Dispatch ports have been increased to 26 which include 8 Integer ALUs, 3 Jump Ports, and 3 Loads/Cycles.

    Vector performance is now upgraded to a 4x 128-bit floating point pipeline and SIMD vector for 2x gigaflops/TOPs. The addition of FMUL, FADD, FMA also leads to shorter latencies and there's also native hardware support for floating point rounding. Intel has also added additional execution units within Lunar Lake CPUs that lead to improved AI performance. Load has been increased to 3 from 2 (128-bit), Store has been increased from 2 to 4 while the shared L2 TLB is increased from 3096 to 4192 entries for both code and data.

    The memory subsystem enhancements see an increase across the board with the L2 cache being 4 MB L2 per four core clusters, double the bandwidth from 64B to 128B/ cycle and L1 to L1 transfers are now faster and offer a more predictable communication. This is achieved by eliminating the need to transfer data from the fabric and instead just going to the L2 cache via the L1 cache. The conviction clock has been upgraded from 16 bytes to 32 bytes per clock.

    Skymont E-Core IPC, Performance & Efficiency For Lunar Lake Mobile

    The Skymont E-core architecture is also scalable across multiple platforms such as the Low Power Island being used on Lunar Lake for a performance efficiency scenario that leverages low-power fabric and system cache, delivering increased workload coverage.

    For Lunar Lake,  IPC has been estimated to be +38% in Integer (SPECrate2017_int_base est / GCC) and +68% in Floating Point (Specrate2017_fp_base) versus the Crestmont E-Cores on the Meteor Lake CPUs.

    You are getting the following improvements in single-threaded workloads:

  • Same Performance at 1/3 Power (Versus Crestmont LP-E)
  • 70% Higher Performance at ISO (Versus Crestmont LP-E)
  • 2x Higher Performance at Peak Power (Versus Crestmont LP-E)
  • And the following in multi-threaded workloads:

  • Same performance at 1/3 Power (Versus Crestmont LP-E)
  • 2.9x Higher Performance at ISO (Versus Crestmont LP-E)
  • 4x Higher Performance at Peak Power (Versus Crestmont LP-E)
  • As mentioned above, Intel's Skymont E-core is quite scalable and in Arrow Lake, we can see the full capabilities of this architecture with a 2% IPC improvement over the Raptor Cove P-Core in both Integer and Floating Point. Intel showcases how Skymont ends up being as fast as Raptor Cove at 0.6x the power while offering 1.2x higher performance at the same power. That's quite a leap over Crestmont and Gracemont E-Cores that came before it.

    Also, the Lunar Lake Low-Power Island has its Dedicated Voltage Rail and sits closer to the DRAM. This means that the decoupled E-Core can now scale well for low-power and high-performance products. Intel also revealed that there's a 5% delta between Skymont on a low-power island and Skymont on a dedicated engine.

    Starting with Lunar Lake CPUs, Intel is adding a brand new Thread Director upgrade that evolves the scheduler to better utilize the P & E-Cores on the chip. First introduced on Alder Lake CPUs & extended to Raptor Lake, Thread Director first scheduled the high-demand workloads to the P-Cores and low-demand workloads to the E-Cores. Over time, the work that was provided to the E-Cores moved to the P-Cores where it could be done faster.

    However, in its current state, we have seen that Thread Director has some drawbacks, especially in gaming scenarios where work being moved to E-Cores is not only slower but also introduces latency bottlenecks which introduce unwanted stutter during the switch. To overcome this, certain game engines prioritize P-Cores and users have also resorted to disabling the E-Core entirely to get the best performance.

    With Meteor Lake, Intel introduced its LP E-Cores where the work was first scheduled and if the Thread Director saw that the work was exceeding capacity, it was moved over to the standard P and E-Cores on the compute tile.

    For Lunar Lake, the Thread Director starts at E-Cores but for high-performance options, we will start at the P-Cores and give OEMs the flexibility to adjust the scheduling as per their need. Thread Direction provides the hint on which core to put the workload at, the ultimate decision rests at OS.

    Intel also brings new foundations to Thread Director which include:

  • Enhanced Algorithms for workloads classification
  • Finer Granularity in workload handling
  • Very low power/thermal hint to OS for experience continuity
  • Intel is also introducing OS Containment Zones within the Windows OS which reads the initialization table and sets zone with PPM parameters.

    These include an Efficiency Zone which schedules the work to the E-Cores, the Hybrid/Compute Zone which schedules the work the the P-Cores, and a "Zoneless" mode which schedules the work across both P-Cores and E-Cores. These Zones constrain the workloads to only those cores and keep the rest of the compute tile cores either parked or idle.

    Power management also sees an upgrade as an internal block within the SOC. This block has three SOC power management profiles which include Best Efficiency Mode, Balanced Mode & Performant Mode. The SOC frequency and scheduling are based on the respective mode chosen by the OS scheduler and the power management block directly communicates with the Intel Thread Director engine.

    With these two engines, Intel manages up to -35% power savings with containment and power management optimization enabled in applications such as Microsoft Teams.

    OEMs will have the freedom to select the optimization gears using the Intel Dynamic Tuning Technology for their respective products. Intel also provides a teaser at the future of Thread Director which is expected to leverage increased scenario granularity, AI-based scheduling hints, and Cross IP Scheduling. These innovations might come to Panther Lake next year or the chips after that.

    One of the biggest selling points of the upcoming AI PC platform will be the TOPS each respective platform has to offer. Generally, GPUs are seen as the main component for handling AI compute but more recently, we have seen NPUs start to take up the market share as they offer a low overhead and are designed specifically for AI processing which means they will only run when needed, leading to lower power and an efficient way to do AI.

    Intel shows that while CPU and GPU make up the majority of the AI market, NPUs are expected to see further adoption as we move forward. As such, the company is offering 120 peak TOPS with its Lunar Lake SOCs with the NPU alone amounting to 40% of the total processing power.

    Within Lunar Lake SOCs, Intel has integrated its 4th Generation NPU architecture called NPU 4 and it offers twice the power efficiency and 48 Peak TOPS, which is a 4.36x increase over the NPU in Meteor Lake SOCs which offered just 11 TOPS. So how does Intel get from NPU 3 to NPU 4, well the answer is scalability.

    NPU 4 is by all means a bigger and enhanced version of NPU 3 across the board. It improves the architecture, it increases the number of engines and it pushes the frequency further.

    These upgrades are necessary since AI relies mostly on Vector and Matrix operations which are quite complex. The NPU 4 gets 12K MACs versus 4K MACs on the last gen and the NCE's or Neural Compute Engines are increased from 2 to 6. Each MAC array is still 2048-wide but it's a 16x16x8 array for INT8 and a 16x16x4 array for FP16 data types.

    The NPU also offers a higher clock rate, dialing it up to 1.95 GHz from the previous 1.4 GHz on Meteor Lake SOCs. NPU 4 is said to offer twice the performance at ISO and 4x the peak performance versus NPU 3. NPU 4 also comes with an upgraded Shave DSP with 4x the vector compute and 12x overall vector performance that improves transformer ALM performance. It's a 512-bit vector register file size and offers 4x bandwidth to and from SHAVE DSP. The DMA engine also gets 2x higher bandwidth and new functions for embedding tokenization.

    So overall improvements for NPU 4 vs NPU 3 include:

  • 12x Higher Vector Performance
  • 4x Higher AI TOPS
  • 2x Higher IP Bandwidth
  • So what do all of these numbers mean for power efficiency, well In stable Diffusion, Lunar Lake SOCs provide significant power savings versus Meteor Lake SOCs while providing over 4.5x performance jump.

    Intel Lunar Lake also features updated connectivity in the form of WIFI 7 and Thunderbolt 4 support.

    The Lunar Lake laptops will feature up to 3 Thunderbolt 4 ports and offer up to 25% better read and write speeds with Thunderbolt 5 SSDs. Furthermore, these experiences will be accelerated using Thunderbolt Share which adds a new level of productivity allowing multi-PC connectivity.

    For WIFI, Intel is integrating the latest WIFI 7 technologies on the SOC itself rather than making use of a discrete module as was the case with the prior Meteor Lake generation. New features include Wi-Fi proximity sensing, Bluetooth over PCIe vs USB, up to 55% reduction time to boot & from sleep for Bluetooth, low-power gaming, and productivity, all the while lowering costs and reducing the footprint.

    The integrated Wi-Fi 7 solution on Lunar Lake has a 28% smaller silicon size versus the BE 200 networking interface and features CNVio 3 interface at 11 Gbps (vs 5 Gbps for CNVio 2). There's also RF Interference Mitigation Technology which dynamically adjusts the DDR clock frequency, which has a major impact on Wi-Fi performance. The most major aspect of the Wi-Fi 7 integration on Lunar Lake SOCs is the Multi-Link Operation or MLO which adds increased reliability, a boost in throughput, latency improvements, and traffic separation/differentiation.

    Security is also an important aspect of the Lunar Lake SOCs, especially the EVO platforms. Lunar Lake has several built-in security engines providing hardware security such as the Intel SSE (Silicon Security Engine), Intel GSC (Graphics Security Controller), CSME (Converged Security & Manageability Engine), and Intel PSE (Partner Security Engine).

    That covers mostly everything about Intel's Lunar Lake SOCs. The chips are expected to ship next quarter. Intel isn't necessarily sharing any information on the SKUs, their performance & pricing but we can expect those once we get closer to launch.

    What's Your Reaction?

    like

    dislike

    love

    funny

    angry

    sad

    wow