Bruce Jacob

University of Maryland

SLIDE

# All Tomorrow's Memories Bruce Jacob **Keystone Professor** University of Maryland







All Tomorrow's Memories

Bruce Jacob

University of Maryland



Bruce Jacob

University of Maryland

SLIDE 2

## Stealth Revolution

Page Mapping Garbage Collection Wear Leveling

> Other FTL Functions

#### Application

#### **Operating System** VM + FS





Bruce Jacob

University of Maryland

SLIDE 2

# Stealth Revolution

Page Mapping Garbage Collection Wear Leveling

> Other FTL Functions

#### **Fusion-ish SSD**

#### Application

#### **Operating System** VM + FS





All Tomorrow's Memories

Bruce Jacob

University of Maryland

SLIDE 2

#### Other FTL Functions

#### **Fusion-ish SSD**

#### Application

#### **Operating System** VM + FS

Page Mapping Garbage Collection Wear Leveling







All Tomorrow's Memories

Bruce Jacob

University of Maryland





All Tomorrow's Memories

Bruce Jacob

University of Maryland





Bruce Jacob

University of Maryland





#### Bruce Jacob

## Stealth Revolution



Question: where is data? Answer: <data owner> To data owner: read X To requester: mem[X]





#### Bruce Jacob

# **Stealth Revolution**

#### [DATA]



**Question: where is data?** Answer: <data owner> To data owner: read X

To requester: mem[X]

# Major implications for OS and applications





#### Bruce Jacob

# **Stealth Revolution**

#### [DATA]



**Question: where is data?** 

Answer: <data owner>

To data owner: read X

Major implications (esp. considering (esp. considering) (esp. sion-io like) for OS and applications



CPU

**FPGA** 

**GPU** 

Bruce Jacob

University of Maryland

SLIDE 5

**Fine-Grained Access** Bandwidth Capacity Low Power Nonvolatility\_

\*Things we did and/or are doing now (I'll cover in talk)

# **Background: Wish List** DRAM -HBM/HMC\*

#### Flash, 3DXP, RRAM, PCM, etc - NVMM\*



HB



Bruce Jacob

University of Maryland

SLIDE 5

# Background: Wish List DRAM -HBM/HMC\*

#### Flash, 3DXP, RRAM, PCM, etc - NVMM\*

# Major implications for OS\* and applications

\*Things we did and/or are doing now (I'll cover in talk)



Bruce Jacob

University of Maryland

SLIDE 6

# **Background: Memory Latency**

tRP = 15ns

Bank Precharge

#### TIME

#### tRCD = 15ns, tRAS = 37.5ns

#### **Row Activate (15ns)** and Data Restore (another 22ns)



Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.



# **Background: Memory Latency**



All Tomorrow's Memories

Bruce Jacob

University of Maryland





### Hybrid Memory Cube

#### VAULT (channel)

Partitions (ranks)

#### Logic Base (I/O & CTL)

Off-chip: high speed SerDes and generic protocol

4 I/O Ports, up to 80 GB/s each

Next gen is 160 GB/s per (640 total)

Total conc'y = **16 x 8 x 2..8** (256–1024)





#### **High Bandwidth Memory** Uses a simple '2.5D' instead of full 3D stacking **TSV Stack** Up to 4 or 8 DRAM dies HBM DRAMs HBM 1024-bit x 2Gtps Interface *= 256 GB/sec* **GPU/CPU TSV Interposer**





#### Each Link is 128 Bits Wide: 1024 Total

All Tomorrow's Memories

Bruce Jacob

University of Maryland

SLIDE 9

# **High Bandwidth Memory** DRAM DRAM DRAM DRAM **CPU/ASIC** DRAM





Bruce Jacob

University of Maryland

SLIDE 10

# Performance Comparison







**Note:** wear-out mitigated by using MANY de (thousands). A single device would wear out in unde days; therefore, 1000 devices should last for at least a Next, you can trade off longevity for access time and wea if the data need only last hours or minutes, wearout is rec

#### Non-Volatile Main Memory

| Cost fo                                          | or Size of IO GB                                             | F Power for<br>I0 GB                                       | Power<br>per GB/s |
|--------------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------|-------------------|
| \$1,000                                          | 0 I bucke                                                    | et 0.I-IW                                                  | 0.I W             |
| \$100                                            |                                                              | 1 I W                                                      | 0.I W             |
| \$10                                             | <l chip<="" th=""><th><b>)</b> ()</th><th>0.I W (?)</th></l> | <b>)</b> ()                                                | 0.I W (?)         |
| <b>\$40</b>                                      | <l chip<="" th=""><th><b>)</b> ()</th><th>0.I W (?)</th></l> | <b>)</b> ()                                                | 0.I W (?)         |
|                                                  |                                                              | CPU<br>DDRx SDRAN<br>Last-Level Cas                        | 1<br>che          |
| evices<br>er two<br>a year.<br>earout:<br>duced. | N<br>( 0                                                     | NAND Flash Main Memory<br>( or *any* source of cheap bits) |                   |





Bruce Jacob

University of Maryland

SLIDE 12

# A Tale of 3 Memory Systems







Bruce Jacob

University of Maryland

SLIDE 12

# A Tale of 3 Memory Systems



# FTL – Flash

### **32G DRAM**







Bruce Jacob

University of Maryland



Bruce Jacob

University of Maryland







### **High Bandwidth Non Volatiles** Borrow a page from the HMC playbook **Network Fabric** MC MC MC MC MC **NV ReRAM:** up to 1000ns expected\* \*trade-offs?

All Tomorrow's Memory Systems

Bruce Jacob

University of Maryland

![](_page_27_Picture_6.jpeg)

All Tomorrow's Memory Systems

Bruce Jacob

University of Maryland

SLIDE 6

#### Crossbar 3D ReRAM

![](_page_28_Picture_5.jpeg)

![](_page_28_Picture_6.jpeg)

Cells minimum area (<u>no access transistor</u>)
2-stack arrays @ 16nm, 20 x 20 mm die: 64GB of ReRAM
8-stack arrays => 256 GB of ReRAM
Stacks arbitrarily high
No. Access. Transistor.

![](_page_28_Picture_8.jpeg)

# No Access Transistor1T1R Memory ArrayLow Latency, Low DensityHigh Performance, High Density

1T1R

![](_page_29_Picture_1.jpeg)

#### **Crossbar RRAM Technology**

All Tomorrow's Memories

Bruce Jacob

University of Maryland

SLIDE 17

![](_page_29_Picture_7.jpeg)

1TnR

(n=8)

#### No Access Transistor

![](_page_30_Picture_1.jpeg)

All Tomorrow's Memories

Bruce Jacob

University of Maryland

![](_page_30_Picture_6.jpeg)

![](_page_31_Picture_0.jpeg)

Bruce Jacob

University of Maryland

SLIDE 18

![](_page_31_Picture_5.jpeg)

# (n = 1 .. 2048) (n=8)

![](_page_31_Picture_7.jpeg)

![](_page_32_Picture_0.jpeg)

For n = 2048 area is ~75% white space

> Use for processor (cores, controllers, routers, NoC, etc.)

All Tomorrow's Memories

Bruce Jacob

University of Maryland

SLIDE 18

# 1TnR (n = 1 .. 2048) (n=8)

![](_page_32_Picture_8.jpeg)

![](_page_33_Figure_0.jpeg)

# 1TnR (n = 1 .. 2048) (n=8)

![](_page_33_Picture_2.jpeg)

![](_page_34_Figure_0.jpeg)

Bruce Jacob

University of Maryland

Bruce Jacob

University of Maryland

SLIDE 20

# **Example Monolithic Numbers** ~64 cores, ~256GB ReRAM, ~4k banks Assume <u>200ns</u> latency for <u>8-byte</u> payload: Bandwidth = 4k \* 8 bytes / 200ns

# 4-deep non-blocking => 8k

- = 4k \* 40 MB/s = <u>160 GB/s</u>
- e.g., 64 cores, each 4-way multithreaded, each with 512-bit (8-longword) SIMD, vectored & scatter-gather loads,

![](_page_35_Picture_9.jpeg)

journaled main memory

enormous sparse data sets

All Tomorrow's Memories

Bruce Jacob

University of Maryland

SLIDE 21

### So what all does this enable?

- **<u>HBM/HMC</u>**: hugely parallel systems (the duality of bandwidth and parallelism), streaming applications, 2x performance
- <u>NVMM: massive data sets, new OS</u> paradigms such as merged VM+FS and (built-in checkpoint/restart)
- **HBNV: fine-grained operations on**

![](_page_36_Picture_12.jpeg)

Expect a shake-up soon.

All Tomorrow's Memories

Bruce Jacob

University of Maryland

SLIDE 22

# **Datacenter & Cloud Issues**

- Distribution at storage-level interface simplifies application development
- Potential for significant performance
- RoCE appropriate for supercomputing? How about RoXX?
- At what round-trip latency does this rival MPI as programming model?

![](_page_37_Picture_16.jpeg)

Bruce Jacob

University of Maryland

SLIDE 23

- Nonvolatility Issues Unified VM+FS Subsystems (OS redesign) By default, data in process address space temporary, garbage-collected at exit(); permanentify function to keep around
- ➡ Possible directions:

  - Persistent objects (e.g. Mneme, POMS) [failed only due to reliance on disk] Named regions

Journaled main memory w/ checkpointing

![](_page_38_Picture_11.jpeg)

**Capacity** Issues  $\rightarrow$  TLB overhead is ~20%

All Tomorrow's Memories

Bruce Jacob

University of Maryland

SLIDE 24

# **Rethink Protection & Translation**

- So get rid of it already!
- BUT: need protection, authentication
- Why not waste bits? Simplify both sharing and translation by eliminating much of VM
- → OS/HW co-design needed: e.g., sharing via vaddr instead of paddr, language support?
- Recall: Nonvolatile main memories  $\sim$ TB per node

![](_page_39_Picture_11.jpeg)

**Bottom Line** It's going to happen. :)

- Combined VM+FS subsystems
- Journaled main memory
- Persistent Object Store work from 80s
- OS: Simpler design, fewer potential bugs
- VM arguably a <u>way</u> better abstraction to distribute than the FS
  - Monolithic = good for many applications

All Tomorrow's Memories

Bruce Jacob

University of Maryland

![](_page_40_Picture_13.jpeg)

All Tomorrow's Memories **Bottom Line** Bruce Jacob

University of Maryland

2

SLIDE 25

Jourp

![](_page_41_Picture_4.jpeg)

#### Shameless Plug

![](_page_42_Picture_1.jpeg)

# Washington DC Oct 2019

All Tomorrow's Memory Systems

Bruce Jacob

University of Maryland

SLIDE 26

![](_page_42_Picture_7.jpeg)

# **MEMSYS 2018**

The International Symposium on Memory Systems \* October 1–4, Washington DC

#### **Important Dates**

Memory-device manufacturing, memory-architecture design, and the use of memory technologies by application software all profoundly impact today's

www.memsys.io

(sigconf), blind submission (no authors listed), up to 16 pages long

#### Organizers

Bruce Jacob, U. Maryland Kathy Smiley, Memory Systems

Rajat Agarwal, Intel Abdel-Hameed Badawy, NMSU omputing systems, in terms of their performance, ty, predictability, power dissipation, and cost. Existing )gies are seen as limiting in terms of power, capacity, and ging memory technologies offer the potential to overcome and design-related limitations to answer the requirements applications. Our goal is to bring together researchers, l others interested in this exciting and rapidly evolving ich other on the latest state of the art, to exchange ideas, ure challenges. Visit memsys.io for more information.

#### rest

lished papers containing significant novel ideas and are solicited. Papers focusing on system, software, and concepts, outside of traditional conference scopes, will be

preferred over others (e.g., the desired focus is away from pipeline design, processor cache design, prefetching, data prediction, etc.). Symposium topics include, but are not limited to, the following:

- Memory-system design from both hardware and software perspectives
- Memory failure modes and mitigation strategies
- Memory-system resilience, especially at large scale
- Memory and system security issues
- Operating system design for hybrid/nonvolatile memories

RAM, 3DXP, memristors, etc. nguages, optimization memory technologies rdware, software, mitigation rage/memory/accelerators ment techniques rdware and software, olications latacenter applications e-memory machines

echnologies to support them, d heterogeneous memories side of traditional

ideas that oups—to eople, ople and pt extended 1 papers, and each

acm

Jishen Zhao, UC San Diego

ven a 20 minute presentation time slot. All accepted papers will be published in the ACM Digital Library.

![](_page_42_Figure_37.jpeg)

Bruce Jacob

University of Maryland

SLIDE 27

# Thank You! Bruce Jacob blj@umd.edu www.ece.umd.edu/~blj

![](_page_43_Picture_5.jpeg)

![](_page_43_Picture_6.jpeg)

![](_page_43_Picture_7.jpeg)