# **Computer Architecture** 2007-2008

# Organization (www.liacs.nl/ca)

#### People

- Lecturer: Lex Wolters
- Assignment leader: Harmen van der Spek Assistant: Van Thieu Vu
- Student assistants: Eyal Halm & Joris Huizer

#### Lectures (3 EC)

- Wednesday 11.15-13.00h till Dec 5th (except Oct 3rd)
- Book: Hennessy & Patterson, fourth edition! Exam: date unknown yet
- Assignment (4 EC)
  - Parts 1 (10%), 2a (30%), 2b (30%), 3 (30%): strict deadlines

  - Assistance (room 306):
     Wed 13.45-15.30h (scheduled): this afternoon Intro part 1
    - » Mon, Tue, Thu 15.30-16.30h

# Lecture 1 - Introduction

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### Outline

- · Computer Science at a Crossroads
- Computer Architecture v. Instruction Set Arch.
- · What Computer Architecture brings to table

Break

# Old Conventional Wisdom: Power is free, Transistors expensive New Conventional Wisdom: "Power wall" Power expensive, Xtors free (can put more on chip than can afford to turn on) Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, ...) New CW: "ILP wall" law of diminishing returns on more HW for ILP Old CW: Multiplies are slow, Memory access is fast

Crossroads: Conventional Wisdom in Comp. Arch

- New CW: "Memory wall" Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply)
- Old CW: Uniprocessor performance 2X / 1.5 yrs
- New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
- Uniprocessor performance now 2X / 5(?) yrs
- ⇒ Sea change in chip design: multiple "cores" (2X processors per chip / ~ 2 years)
  - » More simpler processors are more power efficient



#### Déjà vu all over again? Sea Change in Chip Design Intel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm<sup>2</sup> chip Multiprocessors imminent in 1970s, '80s, '90s, ... "... today's processors ... are nearing an impasse as technologies approach the speed of light.." RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm<sup>2</sup> chip David Mitchell, The Transputer: The Time Is Now (1989) Transputer was premature ⇒ Custom multiprocessors strove to lead uniprocessors ⇒ Procrastination rewarded: 2X seq. perf. / 1.5 years 125 mm<sup>2</sup> chip, 0.065 micron CMOS = 2312 RISC II+FPU+lcache+Dcache "We are dedicating all of our future product development to multicore designs. ... This is a sea change in computing" RISC II shrinks to ~ 0.02 mm<sup>2</sup> at 65 nm Paul Otellini, President, Intel (2004) - Caches via DRAM or 1 transistor SRAM? Difference is all microprocessor companies switch to multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs $\Rightarrow$ Biggest programming challenge: 1 to 2 CPUs • Processor is the new transistor?

## **Problems with Sea Change**

- Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, ... not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip
- Architectures not ready for 1000 CPUs / chip
   Unlike Instruction Level Parallelism, cannot be solved by just by
   computer architects and compiler writers alone, but also cannot
   be solved *without* participation of computer architects
- The 4<sup>th</sup> edition of the textbook 'Computer Architecture: A Quantitative Approach' explores shift from Instruction Level Parallelism to Thread Level Parallelism / Data Level Parallelism

### Outline

- Computer Science at a Crossroads
- Computer Architecture v. Instruction Set Arch.
- What Computer Architecture brings to table





## Instruction Set Architecture

"... the attributes of a [computing] system as seen by the programmer, *i.e.* the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation."

- Amdahl, Blaauw, and Brooks, 1964

SOFTWARE

S

- -- Organization of Programmable Storage
- -- Data Types & Data Structures: Encodings & Representations
- -- Instruction Formats
- -- Instruction (or Operation Code) Set
- -- Modes of Addressing and Accessing Data Items and Instructions
- -- Exceptional Conditions

## **ISA vs. Computer Architecture**

- · Old definition of computer architecture = instruction set design Other aspects of computer design called implementation
- Insinuates implementation is uninteresting or less challenging Our view is computer architecture >> ISA
- Architect's job much more than instruction set design; technical hurdles today more challenging than those in instruction set design
- Since instruction set design not where action is, some conclude computer architecture (using old definition) is not where action is - We disagree on conclusion

  - Agree that ISA not where action is (ISA in appendix B)

#### Comp. Arch. is an Integrated Approach

- · What really matters is the functioning of the complete system
  - hardware, runtime system, compiler, operating system, and application – In networking, this is called the "End to End argument

  - Computer architecture is not just about transistors,
  - individual instructions, or particular implementations E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions





#### Outline

- · Computer Science at a Crossroads
- Computer Architecture v. Instruction Set Arch.
- · What Computer Architecture brings to table

#### What Computer Architecture brings to Table

- Other fields often borrow ideas from architecture
  - **Quantitative Principles of Design**
  - 1. Take Advantage of Parallelism
  - Principle of Locality
  - 3. Focus on the Common Case
  - 4. Amdahl's Law
- 5. The Processor Performance Equation Careful, quantitative comparisons
  - Define, quantity, and summarize relative performance
  - Define and quantity relative cost
  - Define and quantity dependability
  - Define and quantity power
- Culture of anticipating and exploiting advances in technology
- Culture of well-defined interfaces that are carefully implemented and thoroughly checked

## 1) Take Advantage of Parallelism

- · Increasing throughput of server computer via multiple processors or multiple disks
- Detailed HW design
  - Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand Multiple memory banks searched in parallel in set-associative caches
- Pipelining: overlap instruction execution to reduce the total time to complete an instruction sequence.
  - Not every instruction depends on immediate predecessor  $\Rightarrow$  executing instructions completely/partially in parallel possible

  - Classic 5-stage pipeline: 1) Instruction Fetch (Ifetch),

  - 2) Register Read (Reg), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) Register Write (Reg)





# 2) The Principle of Locality

- · The Principle of Locality: Program access a relatively small portion of the address space at any instant of time.
- Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)
- · Last 30 years, HW relied on locality for memory perf.











|                               |            |     | inst count C                         |
|-------------------------------|------------|-----|--------------------------------------|
| PU time = <u>Seco</u><br>Prog |            |     | Cycles x Second<br>Instruction Cycle |
|                               | Inst Count | CPI | Clock Rate                           |
| Program                       | X          |     |                                      |
| Compiler                      | x          | (X) |                                      |
| Inst. Set.                    | х          | x   |                                      |
| Organization                  |            | х   | x                                    |
| Technology                    |            |     | x                                    |





# Outline

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
   1. Define, quantity, and summarize relative performance
- Define, quantity, and summarize relative performance
   Define and quantity relative cost
- 3. Define and quantity dependability
- 4. Define and quantity power



## Tracking Technology Performance Trends

- Drill down into 4 technologies:
  - Disks
  - Memory
  - Network
  - Processors
- Compare ~1980 Archaic vs. ~2000 Modern
   Performance Milestones in each technology
- Compare for Bandwidth vs. Latency improvements in performance over time
- Bandwidth: number of events per unit time
   E.g., Mbits / second over network, Mbytes / second from disk
- Latency: elapsed time for a single event
- E.g., one-way network delay in microseconds, average disk access time in milliseconds

| Archaic                                                                                                                                                               | Modern                                                                                                                                                                                                              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>CDC Wren I, 1983</li> <li>3600 RPM</li> <li>0.03 GBytes capacity</li> <li>Tracks/Inch: 800</li> <li>Bits/Inch: 9550</li> <li>Three 5.25" platters</li> </ul> | <ul> <li>Seagate 373453, 2003</li> <li>15000 RPM (4)</li> <li>73.4 GBytes (2500)</li> <li>Tracks/Inch: 64000 (80)</li> <li>Bits/Inch: 533,000 (60)</li> <li>Four 2.5" platters<br/>(in 3.5" form factor)</li> </ul> |
| <ul> <li>Bandwidth:<br/>0.6 MBytes/sec</li> <li>Latency: 48.3 ms</li> <li>Cache: none</li> </ul>                                                                      | Bandwidth:<br>86 MBytes/sec (140)     Latency: 5.7 ms (8)     Cache: 8 MBytes                                                                                                                                       |



| Memory<br>Archaic                                                | Modern                                                             |     |
|------------------------------------------------------------------|--------------------------------------------------------------------|-----|
| • 1980 DRAM<br>(asynchronous)                                    | <ul> <li>2000 Double Data Rate Syncl<br/>(clocked) DRAM</li> </ul> | ٦r. |
| 0.06 Mbits/chip                                                  | • 256.00 Mbits/chip (4000                                          | X)  |
| • 64,000 xtors, 35 mm <sup>2</sup>                               | <ul> <li>256,000,000 xtors, 204 mm<sup>2</sup></li> </ul>          |     |
| <ul> <li>16-bit data bus per<br/>module, 16 pins/chip</li> </ul> | <ul> <li>64-bit data bus per<br/>DIMM, 66 pins/chip (4</li> </ul>  | X)  |
| <ul> <li>13 Mbytes/sec</li> </ul>                                | • 1600 Mbytes/sec (120                                             | X)  |
| <ul> <li>Latency: 225 ns</li> </ul>                              | Latency: 52 ns     (4                                              | X)  |
| <ul> <li>(no block transfer)</li> </ul>                          | <ul> <li>Block transfers (page mode)</li> </ul>                    |     |











#### Rule of Thumb for Latency Lagging BW

- In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4 (and capacity improves faster than bandwidth)
- Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency



#### 6 Reasons Latency Lags Bandwidth (cont'd)

- 2. Distance limits latency
  - Size of DRAM block  $\Rightarrow$  long bit and word lines ⇒ most of DRAM access time
  - Speed of light and computers on network
  - 1. & 2. explains linear latency vs. square BW?
- 3. Bandwidth easier to sell ("bigger=better")
  - E.g., 10 Gbits/s Ethernet ("10 Gig") vs. 10 µsec latency Ethernet
  - 4400 MB/s DIMM ("PC4400") vs. 50 ns latency
  - Even if just marketing, customers now trained
  - Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance

#### 6 Reasons Latency Lags Bandwidth (cont'd)

#### 4. Latency helps BW, but not vice versa

- Spinning disk faster improves both bandwidth and rotational latency
  - 3600 RPM ⇒ 15000 RPM = 4.2X
  - » Average rotational latency: 8.3 ms  $\Rightarrow$  2.0 ms
  - » Things being equal, also helps BW by 4.2X
  - Lower DRAM latency  $\Rightarrow$  More access/second (higher bandwidth)
- - Higher linear density helps disk BW (and capacity), but not disk Latency
  - » 9,550 BPI  $\Rightarrow$  533,000 BPI  $\Rightarrow$  60X in BW

#### 6 Reasons Latency Lags Bandwidth (cont'd)

#### 5. Bandwidth hurts latency

- Queues help Bandwidth, hurt Latency (Queuing Theory) Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency
- 6. Operating System overhead hurts
  - Latency more than Bandwidth
    - Long messages amortize overhead; overhead bigger part of short messages

#### Summary of Technology Trends

- For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement
  - In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X
- Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components
- Multiple processors in a cluster or even in a chip
- Multiple disks in a disk array
- Multiple memory modules in a large memory
- Simultaneous communication in switched LAN
- · HW and SW developers should innovate assuming Latency Lags Bandwidth
  - If everything improves at the same rate, then nothing really changes - When rates vary, require real innovation

#### Outline

٠

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
- 1. Define and quantity cost
- 2. Define and quantity power
- 3. Define and quantity dependability
- 4. Define, quantity, and summarize relative performance

#### Define and quantify cost (1/3)

#### Three factors lower cost:

- Learning curve manufacturing costs decrease 1. over time, measured by change in yield % manufactured devices that survives the testing procedure
- 2. Volume doubling volume cuts cost 10%
  - Decrease time to get down the learning curve
  - Increases purchasing and manufacturing efficiency Amortizes development costs over more devices
- 3. Commodities reduce costs by reducing margins Products sold by multiple vendors in large volumes that essentially identical
  - E.g. keyboards, monitors, DRAMs, disks, PCs

Most of computer cost in Integrated Circuits (ICs)





#### Define and quantify cost: cost vs. price (3/3)

- Margin = price product sells cost to manufacture
- Margins pay for a research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, pretax profits, and taxes.
- Most companies spend 4% (commodity PC business) to 12% (high-end server business) of income on R&D, which includes all engineering.

#### Outline

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
  - 1. Define and quantity cost
  - 2. Define and quantity power
  - Define and quantity dependability
     Define, quantity, and summarize relative performance

# Define and quantity power (1/2)

- For CMOS chips, traditional dominant energy consumption has been in switching transistors, called *dynamic power*
- Power<sub>dynamic</sub> =  $\frac{1}{2}$  × CapacitiveLoad× Voltage<sup>2</sup> × FrequencySwitched • For mobile devices, energy better metric
- Energy<sub>dynamic</sub> = CapacitiveLoad× Voltage<sup>2</sup>
- For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy
- Capacitive load a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors
- Dropping voltage helps both, so went from 5V to 1V
- To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit)

#### Example of quantifying power

 Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power?

 $Power_{dynamic} = 1/2 \times CapacitiveLoad \times Voltage^{2} \times FrequencySwitched$ 

- = $1/2 \times .85 \times \text{CapacitiveLoad} \times (.85 \times \text{Voltage})^2 \times \text{FrequencySwitched}$
- $= (.85)^3 \times \text{OldPower}_{dynamic}$
- $\approx 0.6 \times \text{OldPower}_{dynamic}$

#### Define and quantity power (2/2)

· Because leakage current flows even when a transistor is off, now static power important too

Power<sub>static</sub> = Current<sub>static</sub> × Voltage

- · Leakage current increases in processors with smaller transistor sizes
- · Increasing the number of transistors increases power even if they are turned off
- In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40%
- Very low power systems even gate voltage to inactive modules to control loss due to leakage

# Outline

#### Review

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
- 1. Define and quantity relative cost 2. Define and quantity power
- 3. Define and quantity dependability
- 4. Define, quantity, and summarize relative performance

#### Define and quantity dependability (1/3)

- ٠ How decide when a system is operating properly?
- Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable
- Systems alternate between 2 states of service with respect to an SLA:
  - 1. Service accomplishment, where the service is delivered as specified in SLA
  - Service interruption, where the delivered service is different from the SLA 2.
- Failure = transition from state 1 to state 2
- Restoration = transition from state 2 to state 1

#### Define and quantity dependability (2/3)

- Module reliability = measure of continuous service accomplishment (or time to failure). Two metrics:
  - 1. Mean Time To Failure (MTTF) measures Reliability 2. Failures In Time (FIT) = 1/MTTF, the rate of failures
- Mean Time To Repair (MTTR) measures Service Interruption
- Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate
- between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / (MTTF + MTTR)

#### Example calculating reliability

- If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules
- Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):

FailureRat = 10×(1/1,000,000) +1/500,000+1/200,000

=(10+2+5)/1000000

=17/1,000,000

=17.000FIT

MTTF=1,000,000,000/17,000

# ≈ 59,000hours

- And in conclusion ...
- Computer Architecture >> instruction sets
- · Computer Architecture skill sets are different

  - 5 Quantitative principles of design
     Quantitative approach to design
     Solid interfaces that really work - Technology tracking and anticipati
- Computer Science at the crossroads from sequential to parallel computing – Salvation ree
- ires innovation in many fields, including computer architecture
- Tracking and extrapolating technology part of architect's responsibility
- Expect Bandwidth in disks, DRAM, network, and processors to improve by at least as much as the square of the improvement in Latency Quantify dynamic and static power
- Capacitance x Voltage<sup>2</sup> x frequency, Energy vs. power
- Quantify dependability Reliability (MTTF, FIT), Availability (99.9...)

# Reading

- This lecture: chapter 1
- Next lecture: appendix A
- Assignment 1: appendix B