

Bruce Jacob

University of Maryland

### **Salient Points**

128 Registers (1KB cache)

- Eliminates renaming
- Reduces cache accesses

**Speculative Loads** 

Moves loads past control barriers

### **Predicated Execution**

Eliminates 40% of conditional branches

First time we've seen all these in one place

Bruce Jacob

University of Maryland

## **Goals of Architecture**

**Overcome performance limiters:** 

- Branches
- Memory latency
- Sequential program model

Long Architecture Lifetime

- Large register file
- Fully interlocked architecture
- No fixed issue-width

**Retain Backward Compatibility with x86** 

THE IA-64 ARCHITECTURE

Bruce Jacob

University of Maryland

# Background

In the Beginning ... CISC

**RISC -> larger register files, simpler design** 

• (better IC technologies)

VLIW -> possible to do lots in parallel

 Influenced superscalar design, dynamic scheduling

Superscalar Design:

- Dependency-checking, I-dispatch, register renaming, out-of-order
- A lot of hardware required

\_ . .

Bruce Jacob

University of Maryland

### Performance Limiters: Branches

20-30% Execution Time on Today's Processors: MISPREDICTS

4-issue, 8-stage pipe, BR resolved stage 7:24 issue slots squashed per mispredict

**Mispredicts:** 

- 95% prediction accuracy
- Branches = 1/6 instrs

24 stalls 120 instr

ALSO: break program into "basic blocks" Severely limits opportunities for parallelism







Bruce Jacob

University of Maryland

# **Obvious Answer: VLIW**

(Very Long Instruction Word)

Where did VLIW fail that Intel/HP can win?

### Code expansion

A B NOP NOP

Binary compatibility



Cannot exploit full parallelism:
 Control instructions serialize execution
 Cannot move loads above branches



Bruce Jacob

University of Maryland

| Variable Instruction: Bundle                       |                                                                                                 |        |        |          |  |
|----------------------------------------------------|-------------------------------------------------------------------------------------------------|--------|--------|----------|--|
|                                                    | 128                                                                                             |        |        | 0        |  |
|                                                    | Instr2                                                                                          | Instr1 | Instr0 | Template |  |
| Instructions:                                      |                                                                                                 |        |        |          |  |
| Opcode                                             |                                                                                                 |        |        |          |  |
| <ul> <li>Predicate register (6)</li> </ul>         |                                                                                                 |        |        |          |  |
| <ul> <li>Source1 &amp; Source2 (7 each)</li> </ul> |                                                                                                 |        |        |          |  |
| • Dest (7)                                         |                                                                                                 |        |        |          |  |
| Template (8?):                                     |                                                                                                 |        |        |          |  |
|                                                    | <ul> <li>Instruction grouping</li> <li>(can do in 3 bits: I0, I1, I2 paired w/ prev)</li> </ul> |        |        |          |  |
| • F                                                | Prefetch hints?                                                                                 |        |        |          |  |

THE IA-64 ARCHITECTURE

Bruce Jacob

University of Maryland

## **Predicated Execution**

Jerry Huck on instruction-level parallelism:

"Just because you allow it doesn't mean you're going to get any of it."

**BRANCHES LIMIT ILP:** 

Sequential, no-predict: normal bank teller

Sequential, predict: fill out slip in advance (predict whether deposit or withdrawal)

Predicated Execution: fill out both slips, throw away whichever is wrong





Bruce Jacob

University of Maryland

# Long Architecture Life

Large Register File

• Like jump from 8 -> 32 in 70's -> 80's

**Fully Interlocked** 

• Not tied to a particular implementation

VLIW w/ Variable Instruction Width

 Not tied to particular implementation, Works well for "shovels" of different widths