AMD Zen 5 Core Architecture Breakdown At Hot Chips: Zen For A New Chapter In High-Performance Computing

AMD Zen 5 Core Architecture Breakdown At Hot Chips: Zen For A New Chapter In High-Performance Computing

 0
AMD Zen 5 Core Architecture Breakdown At Hot Chips: Zen For A New Chapter In High-Performance Computing
AMD Zen 5 Core Architecture Breakdown At Hot Chips: Zen For A New Chapter In High-Performance Computing 4

At Hot Chips, AMD is offering an in-depth look at its brand-new Zen 5 core architecture which will be powering its next high-performance PC journey.

AMD's Zen 1 core architecture first launched back in 2017 and since then, the company has introduced five new architectures (Zen+, Zen 2, Zen 3, Zen 4, Zen 5). AMD started the decade, by launching the Zen 3 architecture which brought a 19% IPC improvement to the table, an 8-core complex, and increased L3 caches per CCX while utilizing the 7nm/6nm process technologies.

The company followed up with the Zen 4 release, bringing another 14% IPC improvement, AVX-512 (FP-256) instructions, doubling the L2 cache to 1 MB, support for VNNI/BFLOAT16 and rocking the 5nm and 4nm process technology.

This year, AMD introduced Zen 5, its latest high-performance core architecture which brings a 16% IPC uplift with AVX-512 and FP-512 variants, 8-wide dispatch, 6 ALUs, Dual pipe fetch/decode, and a 4nm/3nm technology utilization. Today, AMD is deep-diving into the full architecture for its Zen 5 at Hot Chips.

AMD starts by stating the design objectives for Zen 5. In terms of performance, Zen 5 aims to deliver another major 1T and NT performance increase, balanced cross-core 1T/NT instruction and data throughput, create front-end parallelism, increase execution parallelism, high throughput, efficient data movement and prefetching, and support AVX512/FP512 data paths for throughput and AI uplifts. Simultaneously, AMD wants to add new capabilities such as additional ISA extensions and new security features along with expanded platform support with its Zen 5 and Zen 5C core variants.

Following is an overview of AMD's Zen 5 core architecture:

2 Threads/Core

NextGen Branch Predictor

Caches:

  • I-Cache: 32KB, 8-way; 2x32B fetch/cycle
  • Op-cache: 6K inst; 2x 6-wide fetch/cycle
  • D-Cache: 48KB, 12-way; 4mem ops/cycle
  • L2-Cache: 1MB, 16-way
  • Dual I-Fetch/decode pipes, 4 inst/pipe

    8 ops/cycle dispatched to integer or FP

    Execution capabilities:

  • 6 Integer ALU
  • 4 AGU, 4 addresses to LS per cycle
  • 4 FP ops/cycle; 2cycle FADD
  • TLBs:

  • L1: 64 entry ITLB, 96entry DTLB
  • L2: 2K ITLBl 4K DTLB everything but 1G
  • As for what the Zen 5 offers to provide a balanced throughput, you are looking at:

    Front End parallelism:

  • 2 predicted token branches per cycle
  • 2x Op cache pipes
  • 2x instruction fetch/decode pipes
  • 8 wide dispatch
  • Execution:

  • Integer: 6ALU, 4AGU addresses->LS
  • FPU: full 512b AVX512 datapaths
  • FPU: 4 execution pipes
  • Dataflow:

  • 4 load pipes support 2, 512b AVX512 pipes
  • 48K, 12-way L1D cache delivers 4 memops/cycle
  • 2x width L2 cache <-> L1I and L1D caches
  • In terms of Fetch Advances, AMD's Zen 5 core architecture offers:

    Branch Prediction: fewer bubbles, more accuracy, and throughput

  • Zero-bubble conditional branches
  • L2-sized (16K) L1 BTB and larger TAGE
  • Larger return addresses stack (52 entry)
  • 2 taken predictions/cycle
  • Memory management:

  • Aggressive fetch hides L2 & table walk latencies
  • 4x the L2 ITLB (2048 entry)
  • Icache latency and bandwidth

  • 64B/cycle fetch
  • 2 instruction fetch streams
  • In terms of Decode Advances, AMD's Zen 5 core architecture offers:

    Opcache: higher density with greater coverage and throughput

  • 33% more entry associativity (16-way)
  • Dense entries store 6 instructions or fused instructions, not ops
  • 2 OC pipes x 6 inst/pipe -> 12 inst/cycle
  • Dual Decode Pipes

  • 2 pipes support parallel independent instruction streams/basic blocks
  • 4 inst/cycle throughput per pipe
  • SMT mode gives each thread a pipe
  • 8-wide dispatch to Int and FP execution

    In terms of Execution Advances, AMD's Zen 5 core architecture offers:

    8-wide dispatch, rename, retire

    Integer scheduler advances

  • Unified with age matrix
  • More symmetry, simplifying pick
  • 6 ALU with 3 multipliers, 3 branch units

    4 AGU feed a wider LS with 4 memory addresses per cycle

    Execution window growth

  • Scheduler growth
  • 240-entry physical register file
  • ROB/retire queue 448/224 1T/2T entries
  • AMD has also made major FP changes and added new features such as the aforementioned AVX 512 with full 512b datapath. Zen 5 offers more bandwidth and less latency with 4 1op/cycle execution pipelines, 2 LS/integer register pipelines, 2 512b loads/cycle, 1 512b store/cycle, and 2 cycle FADDs. The execution window has also been widened with 8-wide dispatch in 3 larger schedulers (1/pipe pair) and the physical register file has doubled.

    Lastly, we have the Load and Store advances which include:

    48KB 12-way L1D keeping 4-cycle load-to-use

    More Bandwidth

  • 4 LS pipes for a mix of 4 loads/2 stores per cycle
  • 4 Integer load pipes support 2, 512b AVX512 pipes
  • 2 store commit per cycle
  • 64B fill/victim from/to L2 DCache
  • String Store optimizations - eliminates dest mem read, frees bandwidth if the Store will overwrite the block
  • Larger In-Flight Window

  • Load and Store queue growth
  • Store coalescing buffer growth
  • More in-flight misses (scalable tracking)
  • Scalable load ordering queue
  • Data prefetching

  • New 2D stride prefetcher also improves stream and region prefetchers
  • Extends workload pattern recognition
  • Moving over to the cache, Zen 5 has seen certain upgrades with 2x the L2/core interface bandwidth with 64B/clk to the L1 and L1D and from the L1D, 2x L2 associativity which is now 16-way and does 3.5 fewer cycles, supports more L3 in-flight misses and configurations include 32/16 MB L3 (Zen 5 / Zen 5C), 4 MB per core (Zen 5) and 2 MB per core (Zen 5C).

    Talking about the two configurations, the Zen 5 core is optimized for peak 1T performance while the Zen 5C core is aimed at perf/w and perf/area optimized platforms. Both Zen 5 and Zen 5C use the same ISA which includes the following:

    For power efficiency improvements, AMD has built Zen 5 from the ground up and continues to build upon the power gating improvements and 2T support (major pref/watt benefits). The Zen 5 architecture also features reduced power state entry/exit times, better branch prediction to eliminate waster work, and also optimizes operations by eliminating bus, cache, and inter-core traffic through string operations optimizations and prefetcher effectiveness and efficiencies.

    Following are the key advances made within Zen 5 versus Zen 4:

    AMD is also sharing the Zen 5 core complex's speeds and feeds which offers double the L2 associativity, double the L2 bandwidth, low-latency L3 with 320 L3 in-flight misses, a fast and private L2 cache (1 MB), L3 shared across all cores in the complex, L3 filled from L2 victims & L2 tags duplicated in L3 for probe filtering and fast cache transfer.

    Talking about products, AMD's Zen 5 core complexes or CCX's will be featured first across three rounds of products. These include Ryzen 9000 "Granite Ridge" Desktop CPUs, Ryzen AI 300 "Strix" Laptop CPUs, and 5th Gen EPYC "Turin" Data Center CPUs.

    AMD just got started with Zen 5 so we can expect even more products in the future as the company fine-tunes the architecture for PCs & servers.

    What's Your Reaction?

    like

    dislike

    love

    funny

    angry

    sad

    wow