Bruce Jacob

University of Maryland

#### **Contemporary DRAM Architectures and Beyond**

**Bruce Jacob** 

Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/~blj/

#### OUTLINE:

- Motivation & Background
- Experiments
- Results
- More Recent Results



Bruce Jacob

University of Maryland

#### Sources

"A Performance Study of Contemporary DRAM Architectures," *Proc. ISCA '99.* V. Cuppu, B. Jacob, B. Davis, and T. Mudge

Recent experiments by Vinodh Cuppu, Ph.D. student at University of Maryland

Recent experiments by Brian Davis, Ph.D. student at University of Michigan

| CONTEMPORARY  |
|---------------|
| DRAM          |
| ARCHITECTURES |
| AND BEYOND    |

Bruce Jacob

University of Maryland

### Dilemma: THIS ...

STATUS QUO in MEMORY-SYSTEM RESEARCH:

if (memory\_instruction(INSTR)) {
 if (L1\_cache\_miss( data\_addr(INSTR) ){
 if (L2\_cache\_miss( data\_addr(INSTR) ){
 cycles += DRAM\_LATENCY;
 }
 }
 }
....

Bruce Jacob

University of Maryland

#### ... or THIS

Fast Page Mode Read Cycle



Bruce Jacob

University of Maryland

# Motivation

#### HERE'S WHAT YOU MISS:



Bruce Jacob

University of Maryland

### Goal

PRELIMINARY DRAM STUDY:

- Bus Transmission
- Row Access
- Column Access
- Data Transfer
- Bus Wait/Synch Time
- Stalls Due to Refresh
- The OVERLAP of These Components (with each other) (with CPU execution)

**MODEL EXISTING TECHNOLOGY** 

#### **BUS TRANSMISSION**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

#### CONTEMPORARY DRAM **DRAM** Primer **ARCHITECTURES** AND BEYOND **Bruce Jacob ROW ACCESS** DRAM Data In/Out **Buffers Column Decoder** Sense Amps MEMORY CPU BUS CONTROLLER **Row Decoder** Memory Array



AND BEYOND

**Bruce Jacob** 



#### **BUS TRANSMISSION**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

Bruce Jacob

DRAM

CONTEMPORARY

ARCHITECTURES AND BEYOND

> University of Maryland

#### **Read Timing for Conventional DRAM**



Bruce Jacob

University of Maryland

# **DRAM Primer**

#### Read Timing for Fast Page Mode DRAM



#### **Read Timing for Extended Data Out DRAM**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

Bruce Jacob

DRAM

CONTEMPORARY

ARCHITECTURES AND BEYOND

> University of Maryland

#### **Read Timing for Synchronous DRAM**





CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

Bruce Jacob

University of Maryland

# **Simulator Overview**

CPU: SimpleScalar v3.0a

- 8-way out-of-order
- L1 cache: split 64K/64K, lockup free x32
- L2 cache: unified 1MB, lockup free x1
- L2 blocksize: 128 bytes

Main Memory: 8 64Mb DRAMs

- 100MHz/128-bit memory bus
- Optimistic open-page policy (close-immediately can be calculated)

**Represents a "typical" workstation** 

#### CONTEMPORARY **DRAM Configurations**

FPM, EDO, SDRAM, ESDRAM: x16 DRAM x16 DRAM x16 DRAM x16 DRAM CPU Memory 128-bit 100MHz bus Controller and caches x16 DRAM x16 DRAM x16 DRAM x16 DRAM DIMM **Rambus, Direct Rambus, SLDRAM:** 



Note: TRANSFER WIDTH of Direct Rambus Channel

- equals that of ganged FPM, EDO, etc. ۲
- is 2x that of Rambus & SLDRAM •

DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

| CONTEMPORARY  |
|---------------|
| DRAM          |
| ARCHITECTURES |
| AND BEYOND    |

Bruce Jacob

University of Maryland

# **DRAM Configurations**

Strawman: Rambus, etc.



### **Overhead: Memory vs. CPU**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

Bruce Jacob

University of Maryland Definitions (var. on Burger, et al)

- t<sub>PROC</sub> processor with perfect memory
- t<sub>REAL</sub> realistic configuration
- t<sub>BW</sub> CPU with wide memory paths
- t<sub>DRAM</sub> time seen by DRAM system



# Memory & CPU — PERL



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

### **Average Latency of DRAMs**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

### **Average Latency of DRAMs**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

Bruce Jacob

University of Maryland

### **Cost-Performance**

FPM, EDO, SDRAM, ESDRAM:

- Lower Latency => Wide/Fast Bus
- Increase Capacity => Decrease Latency
- Low System Cost

Rambus, Direct Rambus, SLDRAM:

- Lower Latency => Multiple Channels
- Increase Capacity => Increase Capacity
- High System Cost

#### 1 DRDRAM = Multiple SDRAM

Bruce Jacob

University of Maryland

# Conclusions

**100MHz/128-bit Bus is Current Bottleneck** 

 Solution: Fast Bus/es & MC on CPU (*e.g.* Compaq Alpha, Sony Emotion, ...)

Current DRAMs Solving Bandwidth Problem (but not Latency Problem)

There is Locality in DRAM Accesses (but how important is this?)

SPECint '95 Fits in 1MB Cache

Bruce Jacob

University of Maryland

# **Recent (Unfinished) Work**

Investigation of Organization-Level Parameters:

- Channel widths & speeds, turnaround
- Independent vs. ganged channels
- Banks per channel, burst widths

Detailed Study of DRDRAM vs. SDRAM in Highly Concurrent Environment

**Embedded DRAM+DSP Architectures** 

**Detailed Study of Multiprocessor Buses** 

### **Channel/Bank Model**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

### **Read/Write Request Model**



CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bruce Jacob

Bruce Jacob

University of Maryland

# **Concurrency Model**





#### Legal if no turnaround and R/W to different banks:



Legal if turnaround  $\leq$  10ns and R/W to different banks:



#### CONTEMPORARY DRAM **Bandwidth vs. Burst Width Bruce Jacob** 8-Byte Burst Width **16-Byte Burst Width** University of **32-Byte Burst Width** Maryland **64-Byte Burst Width 128-Byte Burst Width** 1.25 1 **Cycles per Instruction** 0.75 0.5 0.25 0 0.4 0.8 1.6 3.2 6.4 System Bandwidth (GB/s = Channels \* Width \* Speed)

PERL: 1 channel, 4 banks, 2GHz CPU

**ARCHITECTURES** AND BEYOND



Bruce Jacob

Bruce Jacob

University of Maryland

# Conclusions

None yet ... preliminary data

#### **CONTACT INFO:**

**Prof. Bruce Jacob** 

Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/~blj/ blj@eng.umd.edu

