AMD Zen 5 Core Architecture Breakdown At Hot Chips: Zen For A New Chapter In High-Performance Computing

AMD Zen 5 Core Architecture Breakdown At Hot Chips: Zen For A New Chapter In High-Performance Computing 4

At Hot Chips, AMD is offering an in-depth look at its brand-new Zen 5 core architecture which will be powering its next high-performance PC journey.

AMD's Zen 1 core architecture first launched back in 2017 and since then, the company has introduced five new architectures (Zen+, Zen 2, Zen 3, Zen 4, Zen 5). AMD started the decade, by launching the Zen 3 architecture which brought a 19% IPC improvement to the table, an 8-core complex, and increased L3 caches per CCX while utilizing the 7nm/6nm process technologies.

The company followed up with the Zen 4 release, bringing another 14% IPC improvement, AVX-512 (FP-256) instructions, doubling the L2 cache to 1 MB, support for VNNI/BFLOAT16 and rocking the 5nm and 4nm process technology.

This year, AMD introduced Zen 5, its latest high-performance core architecture which brings a 16% IPC uplift with AVX-512 and FP-512 variants, 8-wide dispatch, 6 ALUs, Dual pipe fetch/decode, and a 4nm/3nm technology utilization. Today, AMD is deep-diving into the full architecture for its Zen 5 at Hot Chips.

AMD starts by stating the design objectives for Zen 5. In terms of performance, Zen 5 aims to deliver another major 1T and NT performance increase, balanced cross-core 1T/NT instruction and data throughput, create front-end parallelism, increase execution parallelism, high throughput, efficient data movement and prefetching, and support AVX512/FP512 data paths for throughput and AI uplifts. Simultaneously, AMD wants to add new capabilities such as additional ISA extensions and new security features along with expanded platform support with its Zen 5 and Zen 5C core variants.

Following is an overview of AMD's Zen 5 core architecture:

2 Threads/Core

NextGen Branch Predictor

Caches:

I-Cache: 32KB, 8-way; 2x32B fetch/cycle

Op-cache: 6K inst; 2x 6-wide fetch/cycle

D-Cache: 48KB, 12-way; 4mem ops/cycle

L2-Cache: 1MB, 16-way

Dual I-Fetch/decode pipes, 4 inst/pipe

8 ops/cycle dispatched to integer or FP

Execution capabilities:

6 Integer ALU

4 AGU, 4 addresses to LS per cycle

4 FP ops/cycle; 2cycle FADD

TLBs:

L1: 64 entry ITLB, 96entry DTLB

L2: 2K ITLBl 4K DTLB everything but 1G

As for what the Zen 5 offers to provide a balanced throughput, you are looking at:

Front End parallelism:

2 predicted token branches per cycle

2x Op cache pipes

2x instruction fetch/decode pipes

8 wide dispatch

Execution:

Integer: 6ALU, 4AGU addresses->LS

FPU: full 512b AVX512 datapaths

FPU: 4 execution pipes

Dataflow:

4 load pipes support 2, 512b AVX512 pipes

48K, 12-way L1D cache delivers 4 memops/cycle

2x width L2 cache <-> L1I and L1D caches

In terms of Fetch Advances, AMD's Zen 5 core architecture offers:

Branch Prediction: fewer bubbles, more accuracy, and throughput

Zero-bubble conditional branches

L2-sized (16K) L1 BTB and larger TAGE

Larger return addresses stack (52 entry)

2 taken predictions/cycle

Memory management:

Aggressive fetch hides L2 & table walk latencies

4x the L2 ITLB (2048 entry)

Icache latency and bandwidth

64B/cycle fetch

2 instruction fetch streams

In terms of Decode Advances, AMD's Zen 5 core architecture offers:

Opcache: higher density with greater coverage and throughput

33% more entry associativity (16-way)

Dense entries store 6 instructions or fused instructions, not ops

2 OC pipes x 6 inst/pipe -> 12 inst/cycle

Dual Decode Pipes

2 pipes support parallel independent instruction streams/basic blocks

4 inst/cycle throughput per pipe

SMT mode gives each thread a pipe

8-wide dispatch to Int and FP execution

In terms of Execution Advances, AMD's Zen 5 core architecture offers:

8-wide dispatch, rename, retire

Integer scheduler advances

Unified with age matrix

More symmetry, simplifying pick

6 ALU with 3 multipliers, 3 branch units

4 AGU feed a wider LS with 4 memory addresses per cycle

Execution window growth

Scheduler growth

240-entry physical register file

ROB/retire queue 448/224 1T/2T entries

AMD has also made major FP changes and added new features such as the aforementioned AVX 512 with full 512b datapath. Zen 5 offers more bandwidth and less latency with 4 1op/cycle execution pipelines, 2 LS/integer register pipelines, 2 512b loads/cycle, 1 512b store/cycle, and 2 cycle FADDs. The execution window has also been widened with 8-wide dispatch in 3 larger schedulers (1/pipe pair) and the physical register file has doubled.

Lastly, we have the Load and Store advances which include:

48KB 12-way L1D keeping 4-cycle load-to-use

More Bandwidth

4 LS pipes for a mix of 4 loads/2 stores per cycle

4 Integer load pipes support 2, 512b AVX512 pipes

2 store commit per cycle

64B fill/victim from/to L2 DCache

String Store optimizations - eliminates dest mem read, frees bandwidth if the Store will overwrite the block

Larger In-Flight Window

Load and Store queue growth

Store coalescing buffer growth

More in-flight misses (scalable tracking)

Scalable load ordering queue

Data prefetching

New 2D stride prefetcher also improves stream and region prefetchers

Extends workload pattern recognition

Moving over to the cache, Zen 5 has seen certain upgrades with 2x the L2/core interface bandwidth with 64B/clk to the L1 and L1D and from the L1D, 2x L2 associativity which is now 16-way and does 3.5 fewer cycles, supports more L3 in-flight misses and configurations include 32/16 MB L3 (Zen 5 / Zen 5C), 4 MB per core (Zen 5) and 2 MB per core (Zen 5C).

Talking about the two configurations, the Zen 5 core is optimized for peak 1T performance while the Zen 5C core is aimed at perf/w and perf/area optimized platforms. Both Zen 5 and Zen 5C use the same ISA which includes the following:

For power efficiency improvements, AMD has built Zen 5 from the ground up and continues to build upon the power gating improvements and 2T support (major pref/watt benefits). The Zen 5 architecture also features reduced power state entry/exit times, better branch prediction to eliminate waster work, and also optimizes operations by eliminating bus, cache, and inter-core traffic through string operations optimizations and prefetcher effectiveness and efficiencies.

Following are the key advances made within Zen 5 versus Zen 4:

AMD is also sharing the Zen 5 core complex's speeds and feeds which offers double the L2 associativity, double the L2 bandwidth, low-latency L3 with 320 L3 in-flight misses, a fast and private L2 cache (1 MB), L3 shared across all cores in the complex, L3 filled from L2 victims & L2 tags duplicated in L3 for probe filtering and fast cache transfer.

Talking about products, AMD's Zen 5 core complexes or CCX's will be featured first across three rounds of products. These include Ryzen 9000 "Granite Ridge" Desktop CPUs, Ryzen AI 300 "Strix" Laptop CPUs, and 5th Gen EPYC "Turin" Data Center CPUs.

AMD just got started with Zen 5 so we can expect even more products in the future as the company fine-tunes the architecture for PCs & servers.