DRAM Cache For GPUs Improves Performance By Up To 12.5x While Significantly Reducing Power Versus HBM

DRAM Cache For GPUs Improves Performance By Up To 12.5x While Significantly Reducing Power Versus HBM

 0
DRAM Cache For GPUs Improves Performance By Up To 12.5x While Significantly Reducing Power Versus HBM
DRAM Cache For GPUs Improves Performance By Up To 12.5x While Significantly Reducing Power Versus HBM 1

A new research paper has discovered the usefulness of DRAM cache for GPUs which can help enable higher performance at low power.

The GPU industry, which involves consumer, workstation, and AI GPUs, is proceeding in a way that we are seeing advancements in memory capacities and bandwidth, but it isn't sustainable, and ultimately, we could hit the limits if an innovative approach isn't taken.

We have seen GPU makers advance this segment by either incorporating large sums of secondary LLCs (Last Level Caches) or increasing the size of L2 caches. Keeping this in mind, researchers have devised a new way of developing GPU memories, particularly HBM, to break the modern-day capacities and bandwidth limits, along with making data transfer and management much more efficient.

Based on a research paper published at ArVix, researchers have proposed using a dedicated DRAM cache on GPU memory, similar to what we see in modern-day SSDs. DRAM cache is a high-speed storage place for memory, allowing an effective "fetch-and-execute" process. However, this cache isn't what we see in SSDs, somewhat a bit different since it involves the use of SCM (Storage-Class Memory), which is a much more viable alternative to the modern-day HBM and has a lower per-bit dollar cost than DRAM as well.

Researchers have proposed a hybrid approach, utilizing both SCM and DRAM together, to reduce and avoid memory oversubscription & ensure a higher performance per capacity.

The research is pretty in-depth, as expected, and it involves multiple data-flow models as well to aid the SCM data fetching process, and one of those is Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization, which is an attempt to hasten up the process of fetching "data tags," which tells where the data resides in each cache line. The AMIL method suggests keeping all the tags together in the last column of a single row within the DRAM cache for faster access, reducing tag probe overhead, and maintaining Error-Correcting Code (ECC) protection.

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that mandate memory oversubscription, resulting in substantial speedups. However, the DRAM cache needs to be carefully designed to address the latency and bandwidth limitations of the SCM while minimizing cost overhead and considering GPU’s characteristics. Because the massive number of GPU threads can easily thrash the DRAM cache and degrade performance, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multidimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probe traffic and increase effective DRAM BW with minimal cost overhead,we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cache line tags.The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cacheprobe traffic from CTC misses, our Aggregated Meta data-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache implementation withTag-And-Data (TAD) organization.

Compared to HBM, the HMS improves performance by upto 12.5× (2.9× overall) and reduces energy by upto 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by91-93% and 57-75%, respectively.

Paper - Bandwidth-Effective DRAM Cache for GPus wth Storage-Class Memory

Now moving to the more exciting part, the proposed solution guarantees a significant performance improvement, increasing by 12.5x compared to HBM and being 89.3% more energy efficient. These optimistic figures could potentially mark the next-gen transition of the industry into more "innovative" solutions for GPUs, given that SCM combined with DRAM becomes an actual reality once it passes specific qualification tests.

News Source: @Underfox3

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow