# **Computer Architecture** 2007-2008

#### Organization (www.liacs.nl/ca)

#### People

- Lecturer: Lex Wolters
- Assignment leader: Harmen van der Spek Assistant: Van Thieu Vu
- Student assistants: Eyal Halm & Joris Huizer

#### Lectures (3 EC)

- Wednesday 11.15-13.00h till Dec 5th (except Oct 3rd)
- Book: Hennessy & Patterson, fourth edition! Exam: date unknown yet
- Assignment (4 EC)
  - Parts 1 (10%), 2a (30%), 2b (30%), 3 (30%): strict deadlines

  - Assistance (room 306):
     Wed 13.45-15.30h (scheduled): this afternoon Intro part 1
    - » Mon, Tue, Thu 15.30-16.30h

# Lecture 1 - Introduction

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### Outline

- · Computer Science at a Crossroads
- Computer Architecture v. Instruction Set Arch.
- · What Computer Architecture brings to table

Break

# Old Conventional Wisdom: Power is free. Transistors expensive New Conventional Wisdom: "Power wall" Power expensive, Xtors free (can put more on chip than can afford to turn on) Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, ...) New CW: "ILP wall" law of diminishing returns on more HW for ILP Old CW: Multiplies are slow, Memory access is fast New CW: "Memory wall" Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply)

**Crossroads: Conventional Wisdom in Comp. Arch** 

- Old CW: Uniprocessor performance 2X / 1.5 yrs
- New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall Uniprocessor performance now 2X / 5(?) yrs
- ⇒ Sea change in chip design: multiple "cores" (2X processors per chip / ~ 2 years)
  - » More simpler processors are more power efficient



#### Déjà vu all over again? Sea Change in Chip Design Intel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm<sup>2</sup> chip Multiprocessors imminent in 1970s, '80s, '90s, ... "... today's processors ... are nearing an impasse as technologies approach the speed of light.." RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm<sup>2</sup> chip David Mitchell, The Transputer: The Time Is Now (1989) Transputer was premature ⇒ Custom multiprocessors strove to lead uniprocessors ⇒ Procrastination rewarded: 2X seq. perf. / 1.5 years 125 mm<sup>2</sup> chip, 0.065 micron CMOS = 2312 RISC II+FPU+lcache+Dcache "We are dedicating all of our future product development to multicore designs. ... This is a sea change in computing" RISC II shrinks to ~ 0.02 mm<sup>2</sup> at 65 nm Paul Otellini, President, Intel (2004) - Caches via DRAM or 1 transistor SRAM? Difference is all microprocessor companies switch to multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs $\Rightarrow$ Biggest programming challenge: 1 to 2 CPUs • Processor is the new transistor?

#### **Problems with Sea Change**

- Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, ... not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip
- Architectures not ready for 1000 CPUs / chip
   Unlike Instruction Level Parallelism, cannot be solved by just by
   computer architects and compiler writers alone, but also cannot
   be solved *without* participation of computer architects
- The 4<sup>th</sup> edition of the textbook 'Computer Architecture: A Quantitative Approach' explores shift from Instruction Level Parallelism to Thread Level Parallelism / Data Level Parallelism

#### Outline

- Computer Science at a Crossroads
- Computer Architecture v. Instruction Set Arch.
- What Computer Architecture brings to table





#### Instruction Set Architecture

"... the attributes of a [computing] system as seen by the programmer, *i.e.* the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation."

- Amdahl, Blaauw, and Brooks, 1964 SOFTWARE

S

- -- Organization of Programmable Storage
- -- Data Types & Data Structures: Encodings & Representations
- -- Instruction Formats
- -- Instruction (or Operation Code) Set
- -- Modes of Addressing and Accessing Data Items and Instructions
- -- Exceptional Conditions

#### **ISA vs. Computer Architecture**

- · Old definition of computer architecture = instruction set design Other aspects of computer design called implementation
- Insinuates implementation is uninteresting or less challenging Our view is computer architecture >> ISA
- Architect's job much more than instruction set design; technical hurdles today more challenging than those in instruction set design
- Since instruction set design not where action is, some conclude computer architecture (using old definition) is not where action is - We disagree on conclusion

  - Agree that ISA not where action is (ISA in appendix B)

#### Comp. Arch. is an Integrated Approach

- · What really matters is the functioning of the complete system
  - hardware, runtime system, compiler, operating system, and application - In networking, this is called the "End to End argument

  - Computer architecture is not just about transistors, individual instructions, or particular implementations
  - E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions





#### Outline

- · Computer Science at a Crossroads
- Computer Architecture v. Instruction Set Arch.
- What Computer Architecture brings to table

#### What Computer Architecture brings to Table

- Other fields often borrow ideas from architecture
  - **Quantitative Principles of Design**
  - 1. Take Advantage of Parallelism
  - Principle of Locality
  - 3. Focus on the Common Case 4. Amdahl's Law
  - 5. The Processor Performance Equation
- Careful, quantitative comparisons
  - Define, quantity, and summarize relative performance
  - Define and quantity relative cost
  - Define and quantity dependability Define and quantity power
- Culture of anticipating and exploiting advances in technology
- Culture of well-defined interfaces that are carefully implemented and thoroughly checked

#### 1) Take Advantage of Parallelism

- · Increasing throughput of server computer via multiple processors or multiple disks
- Detailed HW design
  - Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand Multiple memory banks searched in parallel in set-associative caches
- Pipelining: overlap instruction execution to reduce the total time to complete an instruction sequence.
  - Not every instruction depends on immediate predecessor  $\Rightarrow$  executing instructions completely/partially in parallel possible
  - Classic 5-stage pipeline: 1) Instruction Fetch (Ifetch),

  - 2) Register Read (Reg), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) Register Write (Reg)





# 2) The Principle of Locality

- · The Principle of Locality: Program access a relatively small portion of the address space at any instant of time.
- Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)
- · Last 30 years, HW relied on locality for memory perf.











| Processor                       | performa                   | ince          | equation                              |
|---------------------------------|----------------------------|---------------|---------------------------------------|
|                                 |                            |               | inst count Cycle                      |
| CPU time = <u>Seco</u><br>Progr | nds = Instruc<br>ram Progr | tions x<br>am | Cycles x Seconds<br>Instruction Cycle |
|                                 | Inst Count                 | CPI           | Clock Rate                            |
| Program                         | X                          |               |                                       |
| Compiler                        | x                          | (X)           |                                       |
| Inst. Set.                      | х                          | Х             |                                       |
| Organization                    |                            | Х             | X                                     |
| Technology                      |                            |               | X                                     |





# Outline

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
   1. Define, quantity, and summarize relative performance
- Define, quantity, and summarize relative performance
   Define and quantity relative cost
- 3. Define and quantity dependability
- 4. Define and quantity power



#### Tracking Technology Performance Trends

- Drill down into 4 technologies:
  - Disks
  - Memory
    Network
  - Network
     Processors
- Compare ~1980 Archaic vs. ~2000 Modern
- Performance Milestones in each technology
- Compare for Bandwidth vs. Latency improvements in performance over time
- Bandwidth: number of events per unit time
   E.g., Mbits / second over network, Mbytes / second from disk
- Latency: elapsed time for a single event
- E.g., one-way network delay in microseconds, average disk access time in milliseconds

| Archaic                                           | Modern                                                           |        |
|---------------------------------------------------|------------------------------------------------------------------|--------|
| <ul> <li>CDC Wren I, 1983</li> </ul>              | <ul> <li>Seagate 373453, 200</li> </ul>                          | 03     |
| • 3600 RPM                                        | <ul> <li>15000 RPM</li> </ul>                                    | (4X)   |
| 0.03 GBytes capacity                              | <ul> <li>73.4 GBytes</li> </ul>                                  | (2500X |
| <ul> <li>Tracks/Inch: 800</li> </ul>              | Tracks/Inch: 64000                                               | (80X   |
| <ul> <li>Bits/Inch: 9550</li> </ul>               | <ul> <li>Bits/Inch: 533,000</li> </ul>                           | (60X   |
| Three 5.25" platters                              | <ul> <li>Four 2.5" platters<br/>(in 3.5" form factor)</li> </ul> |        |
| <ul> <li>Bandwidth:<br/>0.6 MBytes/sec</li> </ul> | <ul> <li>Bandwidth:<br/>86 MBytes/sec</li> </ul>                 | (140X  |
| <ul> <li>Latency: 48.3 ms</li> </ul>              | Latency: 5.7 ms                                                  | (8X    |
| Cache: none                                       | Cache: 8 MBytes                                                  |        |



| Memory                                                           |                                                                |           |
|------------------------------------------------------------------|----------------------------------------------------------------|-----------|
| Archaic                                                          | Modern                                                         |           |
| <ul> <li>1980 DRAM<br/>(asynchronous)</li> </ul>                 | <ul> <li>2000 Double Data Rat<br/>(clocked) DRAM</li> </ul>    | e Synchr. |
| 0.06 Mbits/chip                                                  | <ul> <li>256.00 Mbits/chip</li> </ul>                          | (4000X)   |
| • 64,000 xtors, 35 mm <sup>2</sup>                               | • 256,000,000 xtors, 204                                       | 4 mm²     |
| <ul> <li>16-bit data bus per<br/>module, 16 pins/chip</li> </ul> | <ul> <li>64-bit data bus per<br/>DIMM, 66 pins/chip</li> </ul> | (4X)      |
| <ul> <li>13 Mbytes/sec</li> </ul>                                | <ul> <li>1600 Mbytes/sec</li> </ul>                            | (120X)    |
| <ul> <li>Latency: 225 ns</li> </ul>                              | <ul> <li>Latency: 52 ns</li> </ul>                             | (4X)      |
| (no block transfer)                                              | <ul> <li>Block transfers (page</li> </ul>                      | mode)     |











#### Rule of Thumb for Latency Lagging BW

- In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4 (and capacity improves faster than bandwidth)
- Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency



#### 6 Reasons Latency Lags Bandwidth (cont'd)

- 2. Distance limits latency
  - Size of DRAM block  $\Rightarrow$  long bit and word lines ⇒ most of DRAM access time
  - Speed of light and computers on network
  - 1. & 2. explains linear latency vs. square BW?
- 3. Bandwidth easier to sell ("bigger=better")
  - E.g., 10 Gbits/s Ethernet ("10 Gig") vs. 10 µsec latency Ethernet
  - 4400 MB/s DIMM ("PC4400") vs. 50 ns latency
  - Even if just marketing, customers now trained
  - Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance

#### 6 Reasons Latency Lags Bandwidth (cont'd)

- 4. Latency helps BW, but not vice versa
  - Spinning disk faster improves both bandwidth and rotational latency
    - 3600 RPM ⇒ 15000 RPM = 4.2X
    - » Average rotational latency: 8.3 ms  $\Rightarrow$  2.0 ms
    - » Things being equal, also helps BW by 4.2X
    - Lower DRAM latency  $\Rightarrow$  More access/second (higher bandwidth)
  - - Higher linear density helps disk BW (and capacity), but not disk Latency
    - » 9,550 BPI  $\Rightarrow$  533,000 BPI  $\Rightarrow$  60X in BW

#### 6 Reasons Latency Lags Bandwidth (cont'd)

#### 5. Bandwidth hurts latency

- Queues help Bandwidth, hurt Latency (Queuing Theory) Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency
- 6. Operating System overhead hurts
  - Latency more than Bandwidth
    - Long messages amortize overhead; overhead bigger part of short messages

#### Summary of Technology Trends

- For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement
  - . In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X
- Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components
- Multiple processors in a cluster or even in a chip
- Multiple disks in a disk array
- Multiple memory modules in a large memory
- Simultaneous communication in switched LAN
- · HW and SW developers should innovate assuming Latency Lags Bandwidth
  - If everything improves at the same rate, then nothing really changes - When rates vary, require real innovation

#### Outline

٠

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
- 1. Define and quantity cost
- 2. Define and quantity power
- 3. Define and quantity dependability
- 4. Define, quantity, and summarize relative performance

#### Define and quantify cost (1/3)

#### Three factors lower cost:

- Learning curve manufacturing costs decrease 1. over time, measured by change in yield % manufactured devices that survives the testing procedure
- 2. Volume doubling volume cuts cost 10%
  - Decrease time to get down the learning curve
  - Increases purchasing and manufacturing efficiency Amortizes development costs over more devices
- 3. Commodities reduce costs by reducing margins Products sold by multiple vendors in large volumes that essentially identical
  - E.g. keyboards, monitors, DRAMs, disks, PCs

Most of computer cost in Integrated Circuits (ICs)





#### Define and quantify cost: cost vs. price (3/3)

- Margin = price product sells cost to manufacture
- Margins pay for a research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, pretax profits, and taxes.
- Most companies spend 4% (commodity PC business) to 12% (high-end server business) of income on R&D, which includes all engineering.

#### Outline

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
  - 1. Define and quantity cost
  - 2. Define and quantity power
  - Define and quantity dependability
     Define, quantity, and summarize relative performance

# Define and quantity power (1/2)

- For CMOS chips, traditional dominant energy consumption has been in switching transistors, called *dynamic power*
- Power<sub>dynamic</sub> =  $\frac{1}{2}$  × CapacitiveLoad× Voltage<sup>2</sup> × FrequencySwitched • For mobile devices, energy better metric
- Energy<sub>dynamic</sub> = CapacitiveLoad× Voltage<sup>2</sup>
- For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy
- Capacitive load a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors
- Dropping voltage helps both, so went from 5V to 1V
- To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit)

#### Example of quantifying power

 Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power?

 $Power_{dynamic} = 1/2 \times CapacitiveLoad \times Voltage^{2} \times FrequencySwitched$ 

- = $1/2 \times .85 \times \text{CapacitiveLoad} \times (.85 \times \text{Voltage})^2 \times \text{FrequencySwitched}$
- $= (.85)^3 \times \text{OldPower}_{dynamic}$
- $\approx 0.6 \times \text{OldPower}_{dynamic}$

#### Define and quantity power (2/2)

· Because leakage current flows even when a transistor is off, now static power important too

Power<sub>static</sub> = Current<sub>static</sub> × Voltage

- · Leakage current increases in processors with smaller transistor sizes
- · Increasing the number of transistors increases power even if they are turned off
- In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40%
- Very low power systems even gate voltage to inactive modules to control loss due to leakage

#### Outline

#### Review

- Technology Trends: Culture of tracking, anticipating and exploiting advances in technology
- Careful, quantitative comparisons:
- 1. Define and quantity relative cost 2. Define and quantity power
- 3. Define and quantity dependability
- 4. Define, quantity, and summarize relative performance

#### Define and quantity dependability (1/3)

- ٠ How decide when a system is operating properly?
- Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable
- Systems alternate between 2 states of service with respect to an SLA:
  - 1. Service accomplishment, where the service is delivered as specified in SLA
  - Service interruption, where the delivered service is different from the SLA 2.
- Failure = transition from state 1 to state 2
- Restoration = transition from state 2 to state 1

#### Define and quantity dependability (2/3)

- Module reliability = measure of continuous service accomplishment (or time to failure). Two metrics:
  - 1. Mean Time To Failure (MTTF) measures Reliability 2. Failures In Time (FIT) = 1/MTTF, the rate of failures
- Mean Time To Repair (MTTR) measures Service Interruption
- Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate
- between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / (MTTF + MTTR)

#### Example calculating reliability

- If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules
- Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):

FailureRat = 10×(1/1,000,000) +1/500,000+1/200,000

=(10+2+5)/1000000

=17/1,000,000

=17.000FIT

MTTF=1,000,000,000/17,000

#### ≈ 59,000hours

- And in conclusion ...
- Computer Architecture >> instruction sets
- · Computer Architecture skill sets are different
  - 5 Quantitative principles of design
     Quantitative approach to design
     Solid interfaces that really work
  - Technology tracking and anticipation
- Computer Science at the crossroads from sequential to parallel computing – Salvation ree
- uires innovation in many fields, including computer architecture
- Tracking and extrapolating technology part of architect's responsibility
- Expect Bandwidth in disks, DRAM, network, and processors to improve by at least as much as the square of the improvement in Latency Quantify dynamic and static power
- Capacitance x Voltage<sup>2</sup> x frequency, Energy vs. power
- Quantify dependability Reliability (MTTF, FIT), Availability (99.9...)

# Reading

- This lecture: chapter 1
- Next lecture: appendix A
- Assignment 1: appendix B

# Lecture 2 – Performance & **Pipelining**

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### **Review from last lecture**

- · Tracking and extrapolating technology part of architect's responsibility
- Expect Bandwidth in disks, DRAM, network, and processors to improve by at least as much as the square of the improvement in Latency
- Quantify Cost (vs. Price)
- IC ≈ f(Area<sup>2</sup>) + Learning curve, volume, commodity, margins · Quantify dynamic and static power
- Capacitance x Voltage<sup>2</sup> x frequency, Energy vs. power · Quantify dependability
- Reliability (MTTF vs. FIT), Availability (MTTF/(MTTF+MTTR)

#### **Outline**

- . Quantify and summarize performance
  - Ratios, Geometric Mean, Multiplicative Standard Deviation - Fallacies & Pitfalls: Benchmarks age, disks fail, 1 point fail danger
- Pipelining
  - MIPS: an ISA for Pipelining
  - 5 stage pipelining
  - Structural and Data Hazards
  - Forwarding
  - Branch Schemes
  - Exceptions and Interrupts
- Conclusion

#### **Definition: Performance**

- · Performance is in units of things per sec - bigger is better
- · If we are primarily concerned with response time

 $Performance(X) = \frac{1}{ExecutionTime(X)}$ 

" X is n times faster than Y" means

$$n = \frac{Performance(X)}{Performance(Y)} = \frac{ExecutionTime(Y)}{ExecutionTime(X)}$$

#### Performance: What to measure?

- · Usually rely on benchmarks vs. real workloads
- · To increase predictability, collections of benchmark applications, called benchmark suites, are popular
- SPECCPU: popular desktop benchmark suite
   CPU only, split between integer and floating point programs
   SPECCPU2006:
  - - Motio: "An ounce of honest data is worth a pound of marketing hype"
       12 integer and 17 floating point programs
       SPECSFS (NFS file server) and SPECWeb (WebServer) added as server benchmarks
- Transaction Processing Council measures server performance and cost-performance for databases

   - TPC-C Complex query for Online Transaction Processing

   - TPC-H models ad hoc decision support

  - TPC-W a transactional web benchmark
     TPC-App application server and web services benchmark

#### How Summarize Suite Performance (1/5)

- · Arithmetic average of execution time of all programs? But they vary by 4X in speed, so some would be more important than others in arithmetic average
- · Could add a weight per program, but how pick a weight? Different companies want different weights for their products
- SPECRatio: Normalize execution times to reference computer, yielding a ratio proportional to

Performance =  $\frac{\text{time on reference computer}}{1}$ time on computer rated





#### How Summarize Suite Performance (4/5)

- Does a single mean well summarize performance of programs in benchmark suite?
- Can decide if mean a good predictor by characterizing variability of distribution using standard deviation
- Like geometric mean, geometric standard deviation is multiplicative rather than arithmetic
- Can simply take the logarithm of SPECRatios, compute the standard mean and standard deviation, and then take the exponent to convert back:

 $GeometricMean = \exp\left(\frac{1}{n} \times \sum_{i=1}^{n} \ln(SPECRatio_i)\right)$  $GeometricStDev = \exp(StDev(\ln(SPECRatio_i)))$ 











Range is [0.75,2.27] with 11/14 inside 1 StDev (78%)

#### Fallacies and Pitfalls (1/2)

- · Fallacies commonly held misconceptions
- When discussing a fallacy, we try to give
   Pitfalls easily made mistakes.
   Often generalization of the second second
- Often generalizations of principles true in limited context
   Show Fallacies and Pitfalls to help you avoid these errors
- · Fallacy: Benchmarks remain valid indefinitely
  - Once a benchmark becomes popular, tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark "benchmarksmanship."
  - 70 benchmarks from the 5 SPEC releases. 70% were dropped from the next release since no longer useful

#### · Pitfall: A single point of failure

Rule of thumb for fault tolerant systems: make sure that every component was redundant so that no single component failure could bring down the whole system (e.g, power supply)

#### Fallacies and Pitfalls (2/2)

- Fallacy Rated MTTF of disks is 1,200,000 hours or ≈ 140 years, so disks practically never fail •
- But disk lifetime is 5 years  $\Rightarrow$  replace a disk every 5 years; on average, 28 replacements wouldn't fail
- A better unit: % that fail (1.2M MTTF = 833 FIT)
- Fail over lifetime: if had 1000 disks for 5 years = 1000\*(5\*365\*24)\*833 /10<sup>9</sup> = 36,485,000 / 10<sup>6</sup> = 37 = 3.7% (37/1000) fail over 5 yr lifetime (1.2M hr MTTF)
- But this is under pristine conditions
- little vibration, narrow temperature range ⇒ no power failures Real world:
- 3% to 6% of SCSI drives fail per year
- » 3400 6800 FIT or 150,000 300,000 hour MTTF [Gray & van Ingen 05] - 3% to 7% of ATA drives fail per year
- » 3400 8000 FIT or 125,000 300,000 hour MTTF [Gray & van Ingen 05]

#### Outline

- Quantify and summarize performance
- Ratios, Geometric Mean, Multiplicative Standard Deviation
- Fallacies & Pitfalls: Benchmarks age, disks fail, 1 point fail danger
- Pipelining
  - MIPS: an ISA for Pipelining
  - 5 stage pipelining
  - Structural and Data Hazards \_
  - Forwarding – Branch Schemes
  - Exceptions and Interrupts \_
- Conclusion

# A "Typical" RISC ISA

- 32-bit fixed format instruction (3 formats)
- 32 32-bit GPR (R0 contains zero, DP take pair)
- · 3-address, reg-reg arithmetic instruction Single address mode for load/store:
- base + displacement - no indirection
- Simple branch conditions
- · Delayed branch

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3





# Approaching an ISA Instruction Set Architecture Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing Meaning of each instruction is described by the register transfer language (RTL) on architected registers and memory

- Given technology constraints assemble adequate datapath
  - Architected storage mapped to actual storage
     Function units to do all the required operations
  - Possible additional storage (eg. MAR, MBR, ...)
  - Interconnect to move information among regs and FUs
- Map each instruction to sequence of RTLs
- Collate sequences into symbolic controller state transition diagram (STD)
- Lower symbolic STD to control points
- Implement controller











# Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)

- Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
- Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches
- and jumps).











· Machine A is 1.33 times faster



#### **Three Generic Data Hazards**

**Read After Write (RAW)** Instr, tries to read operand before Instr, writes it

> I: add r1,r2,r3 → J: sub r4,<mark>r1</mark>,r3

 Caused by a "Dependence" (in compiler nomenclature). This hazard results from an actual need for communication.

#### **Three Generic Data Hazards**

- Write After Read (WAR) Instr, writes operand before Instr, reads it I: sub r4,<mark>r1</mark>,r3
  - -J: add <mark>r1</mark>,r2,r3 K: mul r6,r1,r7
- Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1".
- Can't happen in MIPS 5 stage pipeline because: - All instructions take 5 stages, and
  - Reads are always in stage 2, and
  - Writes are always in stage 5

















#### **Branch Stall Impact**

- If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
- Two part solution:
  - Determine branch taken or not sooner, AND
  - Compute taken branch address earlier
- MIPS branch tests if register = 0 or  $\neq$  0

#### • MIPS Solution:

- Move Zero test to ID/RF stage
- Adder to calculate new PC in ID/RF stage
   1 clock cycle penalty for branch versus 3



#### Four Branch Hazard Alternatives

#### #1: Stall until branch direction is clear

#### #2: Predict Branch Not Taken

- Execute successor instructions in sequence
- "Squash" instructions in pipeline if branch actually taken
- Advantage of late pipeline state update
   47% MIPS branches not taken on average
- PC+4 already calculated, so use it to get next instruction

#### #3: Predict Branch Taken

- 53% MIPS branches taken on average
- But haven't calculated branch target address in MIPS
- » MIPS still incurs 1 cycle branch penalty
  - » Other machines: branch target known before outcome

#### Four Branch Hazard Alternatives

#### #4: Delayed Branch

- Define branch to take place AFTER a following instruction

Branch delay of length n

branch instruction sequential successor<sub>1</sub> sequential successor<sub>2</sub>

# sequential successor<sub>n</sub>

- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- MIPS uses this







| Pinalina speadun –                                                                   | -                       |                                    | Pipeline dep                                   | th                                         |
|--------------------------------------------------------------------------------------|-------------------------|------------------------------------|------------------------------------------------|--------------------------------------------|
| r ipenne speedup =                                                                   | 1 + E                   | Branch                             | frequency × E                                  | Franch penalty                             |
|                                                                                      |                         |                                    |                                                |                                            |
| Assume 4% uncor                                                                      | nditio                  | nal bra<br>onal b                  | inch, 6% cond                                  | itional branch                             |
| untaken, 1070 co                                                                     | onun                    | onai b                             | anch-taken                                     |                                            |
| Scheduling Bra<br>scheme per                                                         | anch<br>nalty           | СРІ                                | speedup v.<br>unpipelined                      | speedup v.<br>stall                        |
| Scheduling Bra<br>scheme per<br>Stall pipeline                                       | anch<br>nalty<br>3      | CPI<br>1.60                        | speedup v.<br>unpipelined<br>3.1               | speedup v.<br>stall<br>1.0                 |
| Scheduling Bra<br>scheme per<br>Stall pipeline<br>Predict taken                      | anch<br>nalty<br>3      | <i>CPI</i><br>1.60<br>1.20         | speedup v.<br>unpipelined<br>3.1<br>4.2        | speedup v.<br>stall<br>1.0<br>1.33         |
| Scheduling Bra<br>scheme per<br>Stall pipeline<br>Predict taken<br>Predict not taken | anch<br>nalty<br>3<br>1 | <i>CPI</i><br>1.60<br>1.20<br>1.14 | speedup v.<br>unpipelined<br>3.1<br>4.2<br>4.4 | speedup v.<br>stall<br>1.0<br>1.33<br>1.40 |

#### **Problems with Pipelining**

- Exception: An unusual event happens to an instruction during its execution
   Examples: divide by zero, undefined opcode
- Interrupt: Hardware signal to switch the processor to a new instruction stream

   Example: a sound card interrupts when it needs more audio output samples (an audio "click" happens if it is left waiting)
- Problem: It must appear that the exception or interrupt must appear between 2 instructions (I<sub>1</sub> and I<sub>i+1</sub>)

   The effect of all instructions up to and including I<sub>i</sub> is totalling complete
  - No effect of any instruction after I, can take place
- The interrupt (exception) handler either aborts program or restarts at instruction  $\mathbf{I}_{i+1}$



#### And In Conclusion:

- Quantify and summarize performance
   A Ratios, Geometric Mean, Multiplicative Standard Deviatio
- F&P: Benchmarks age, disks fail,1 point fail danger
- Control via State Machines and Microprogramming
- Just overlap tasks; easy if tasks are independent
- Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
   Speedup = Pipeline depth 1 + Pipeline stall CPI × Cycle Time<sub>oppelined</sub> Cycle Time<sub>pipelined</sub>
- Hazards limit performance on computers:
   Structural: need more HW resources
   Data (RAW,WAR,WAW): need forwarding, compiler scheduling
   Control: delayed branch, prediction
- Exceptions, Interrupts add complexity

#### Reading

- This lecture: appendix A Pipelining
- Next lecture: appendix C Memory Hierarchy



#### **Review from last lecture**

- Quantify and summarize performance
   A Ratios, Geometric Mean, Multiplicative Standard Deviation
- F&P: Benchmarks age, disks fail,1 point fail danger
- Control VIA State Machines and Microprogramming
- Just overlap tasks; easy if tasks are independent
- Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
  - $Speedup = \frac{Pipeline \ depth}{1 + Pipeline \ stall \ CPI} \times \frac{Cycle \ Time_{unpipelined}}{Cycle \ Time_{pipelined}}$
- Hazards limit performance on computers:
   Structural: need more HW resources
   Data (RAW,WAR,WAW): need forwarding, compiler scheduling
- Control: delayed branch, prediction
- Exceptions, Interrupts add complexity

#### Outline

- Review
- Memory hierarchy
- Locality
- Cache design
- Virtual address spaces
- Page table layout
- TLB design options
- Conclusion











#### The Principle of Locality

#### The Principle of Locality:

Program access a relatively small portion of the address space at any instant of time.

- Two Different Types of Locality:
  - Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
  - Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
- · Last 15 years, HW relied on locality for speed

It is a property of programs which is exploited in machine design.







- So high that usually talk about Miss rate
- Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory
- Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)
- Miss penalty: time to replace a block from lower level, including time to replace in CPU
  - access time time to lower level = f(latency to lower level)
    - transfer time
  - time to transfer block = f(bandwidth between upper & lower levels)







| Q3: V<br>on a I                     | Vhich<br>miss?                                     | block                                  | should                                    | d be re                                          | eplace           | d           |
|-------------------------------------|----------------------------------------------------|----------------------------------------|-------------------------------------------|--------------------------------------------------|------------------|-------------|
| • Easy f                            | or Direc                                           | t Mappe                                | d                                         |                                                  |                  |             |
| • Set As<br>- LRU<br>asso<br>- Ranc | SOCIATIV<br>(Least Rec<br>ciativity<br>Iom; easy t | re or Ful<br>cently Used<br>o implemen | ly Assoc<br>d); appealing<br>ht, how well | c <b>iative:</b><br>g, but hard f<br>does it wor | to impleme<br>k? | nt for high |
| Assoc:                              | 2-v                                                | vay                                    | 4-w                                       | ay                                               | 8-w              | ay          |
| Size                                | LRU                                                | Ran                                    | LRU                                       | Ran                                              | LRU              | Ran         |
| 16 KB                               | 5.2%                                               | 5.7%                                   | 4.7%                                      | 5.3%                                             | 4.4%             | 5.0%        |
| 64 KB                               | 1.9%                                               | 2.0%                                   | 1.5%                                      | 1.7%                                             | 1.4%             | 1.5%        |

1.13% 1.13%

1.12% 1.12%

256 KB

1.15% 1.17%





# Cache missesCompulsoryFirst access miss, cold start miss.CapacityCache is full.ConflictTwo blocks are mapped to the same location.

#### **6 Basic Cache Optimizations**

#### **Reducing Miss Rate**

- 1. Larger Block size (compulsory misses)
- 2. Larger Cache size (capacity misses)
- 3. Higher Associativity (conflict misses)

#### **Reducing Miss Penalty**

4. Multilevel Caches

#### Reducing hit time

- 5. Giving Reads Priority over Writes
- E.g., Read complete before earlier writes in write buffer 6. Avoiding Address Translation during Indexing of the Cache

#### Outline

- Review
- Memory hierarchy
- Locality
- Cache design
- Virtual address spaces
- Page table layout
- · TLB design options
- Conclusion



No way to prevent a program from accessing any machine resource





























## Summary #3/3: TLB, Virtual Memory

- Page tables map virtual address to physical address
- TLBs are important for fast translation
- TLB misses are significant in processor performance funny times, as most systems can't access all of 2nd level cache without TLB misses!
- Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? •
- Today VM allows many processes to share single memory without having to swap all processes to disk; <u>today VM protection is more</u> <u>important than memory hierarchy benefits</u>, <u>but computers insecure</u>

#### Reading

- This lecture: appendix C Memory Hierarchy
- Next lecture: chapter 2 Instruction-Level Parallelism

# Lecture 4 –

#### Instruction Level Parallelism

Slides were used during lectures by David Patterson, Berkeley, spring 2006

## Outline

#### • ILP

- · Compiler techniques to increase ILP
- Loop Unrolling
- Static Branch Prediction
- Dynamic Branch Prediction
- Overcoming Data Hazards with Dynamic Scheduling
- Tomasulo Algorithm
- Conclusion

# **Recall from Pipelining**

Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls

- Ideal pipeline CPI: measure of the maximum performance attainable by the implementation
- Structural hazards: HW cannot support this
- combination of instructions
- <u>Data hazards</u>: instruction depends on result of prior instruction still in the pipeline
- <u>Control hazards</u>: caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

# Instruction Level Parallelism

Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance

Two approaches to exploit ILP:

- 1) Rely on hardware to help discover and exploit the parallelism dynamically (e.g., Pentium 4, AMD Opteron, IBM Power)
- 2) Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2)

# Instruction-Level Parallelism

#### (ILP)

- Basic Block (BB) ILP is quite small
  - BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit
  - average dynamic branch frequency 15% to 25%
     ⇒ 4 to 7 instructions execute between a pair of branches
  - plus instructions in BB likely to depend on each other
- To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks
- Simplest: <u>loop-level parallelism</u> to exploit parallelism among iterations of a loop. E.g.,

for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i];

# Loop-Level Parallelism

- Exploit loop-level parallelism to parallelism by "unrolling loop" either by 1. dynamic via branch prediction or 2. static via loop unrolling by compiler (Another way is vectors, to be covered later)
- Determining instruction dependence is critical to Loop Level Parallelism
- If 2 instructions are
  - <u>parallel</u>, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards)
  - <u>dependent</u>, they are not parallel and must be executed in order, although they may often be partially overlapped



Instr, is data dependent (aka true dependence) on Instr 1. Instr., tries to read operand before Instr, writes it

> I: add r1,r2,r3 → J: sub r4,<mark>r1</mark>,r3

- 2. or  $Instr_J$  is data dependent on  $Instr_K$  which is dependent on  $Instr_J$
- If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped
- Data dependence in instruction sequence  $\Rightarrow$  data dependence in source code  $\Rightarrow$  effect of original data dependence must be preserved
- If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard

# ILP and Data Dependencies,

#### Hazards

- HW/SW must preserve program order: order instructions would execute in if executed sequentially as determined by original source program - Dependences are a property of programs
- Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline
- Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited
- HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program

# Name Dependence #1: Anti-

- -dependence Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; two versions of name dependence
- · Instr, writes operand before Instr, reads it

I: sub r4, r1, r3 \_ J: add <mark>r1</mark>,r2,r3 K: mul r6,r1,r7

Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1"

- · If anti-dependence caused a hazard in the pipeline,
  - called a Write After Read (WAR) hazard



- If output-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
- Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for regs
  - Either by compiler or by HW







Preserving exception behavior
 ⇒ any changes in instruction execution order
 must not change how exceptions are raised in
 program
 (⇒ no new exceptions)

| • | Example:                            |                   |  |
|---|-------------------------------------|-------------------|--|
|   | DADDU                               | R2,R3,R4          |  |
|   | BEQZ                                | R2,L1             |  |
|   | LW                                  | R1,0(R2)          |  |
|   | L1:                                 |                   |  |
|   | <ul> <li>– (Assume brand</li> </ul> | ches not delayed) |  |

• Problem with moving LW before BEQZ?



| -Exampl                                                                                                  | e                                                                                                                                      |                                                     |                                                 |
|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------|
| This code                                                                                                | add a scalar to a                                                                                                                      | vector:                                             |                                                 |
| for $(i=1)$                                                                                              | 100: i>0: i=i=                                                                                                                         | 1)                                                  |                                                 |
| v[i]                                                                                                     | - v[i] + e.                                                                                                                            | -,                                                  |                                                 |
| Assume to                                                                                                | noming interiores                                                                                                                      | ioi uli exai                                        | inpico                                          |
| - Ignore dela                                                                                            | ayed branch in these ex                                                                                                                | xamples                                             | stalls between                                  |
| - Ignore dela<br>Instruction<br>producing result                                                         | ayed branch in these ex<br>Instruction<br>using result<br>Another FP Al U on                                                           | xamples<br>Latency<br>in cycles<br>4                | stalls betweer<br>in cycles<br>3                |
| <ul> <li>Ignore dela</li> </ul> Instruction<br>producing result<br>FP ALU op<br>FP ALU op                | ayed branch in these e<br>Instruction<br>using result<br>Another FP ALU op<br>Store double                                             | xamples<br>Latency<br>in cycles<br>4<br>3           | stalls between<br>in cycles<br>3<br>2           |
| - Ignore dela<br>Instruction<br>producing result<br>FP ALU op<br>FP ALU op<br>Load double                | ayed branch in these ex<br>Instruction<br>using result<br>Another FP ALU op<br>Store double<br>FP ALU op                               | xamples<br>Latency<br>in cycles<br>4<br>3<br>1      | stalls between<br>in cycles<br>3<br>2<br>1      |
| - Ignore dela<br>Instruction<br>producing result<br>FP ALU op<br>FP ALU op<br>Load double<br>Load double | ayed branch in these er<br><i>Instruction</i><br><i>using result</i><br>Another FP ALU op<br>Store double<br>FP ALU op<br>Store double | xamples<br>Latency<br>in cycles<br>4<br>3<br>1<br>1 | stalls between<br>in cycles<br>3<br>2<br>1<br>0 |



ADD.D F4,F0,F2;add scalar from F2 S.D 0(R1),F4;store result DADDUI R1,R1,-8;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero







| Star   | 15     |                        |
|--------|--------|------------------------|
| 1 Loop | :L.D   | F0,0(R1)               |
| 2      | L.D    | F6,-8(R1)              |
| 3      | L.D    | F10,-16(R1)            |
| 4      | L.D    | F14,-24(R1)            |
| 5      | ADD.D  | F4, F0, F2             |
| 6      | ADD.D  | F8, F6, F2             |
| 7      | ADD.D  | F12,F10,F2             |
| 8      | ADD.D  | F16,F14,F2             |
| 9      | S.D    | 0(R1),F4               |
| 10     | S.D    | -8(R1),F8              |
| 11     | S.D    | -16(R1),F12            |
| 12     | DSUBUI | R1,R1,#32              |
| 13     | S.D    | 8(R1),F16 ; 8-32 = -24 |
| 14     | BNEZ   | R1,LOOP                |

#### Unrolled Loop Detail

- · Do not usually know upper bound of loop
- · Suppose it is n, and we would like to unroll the loop to make k copies of the body
- · Instead of a single unrolled loop, we generate a pair of consecutive loops:
  - 1st executes (n mod k) times and has a body that is the original loop
  - 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times
- For large values of n, most of the execution time will be spent in the unrolled loop

# **5** Loop Unrolling Decisions

Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences:

- Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) 1.
- Use different registers to avoid unnecessary constraints forced by using same registers for different computations
- 3. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code
- Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent Transformation requires analyzing memory addresses and finding that they do not refer to the same address
- 5. Schedule the code, preserving any dependences needed to yield the same result as the original code

# 3 Limits to Loop Unrolling

- 1) Decrease in amount of overhead amortized with each extra unrolling
  - Amdahl's Law
- 2) Growth in code size
  - For larger loops, concern it increases the instruction cache miss rate
- 3) <u>Register pressure</u>: potential shortfall in registers created by aggressive unrolling and scheduling
  - If not be possible to allocate all live values to registers, may lose some or all of its advantage

Loop unrolling reduces impact of branches on pipeline; another way is branch prediction



# Dynamic Branch Prediction

- Why does prediction work?
  - Underlying algorithm has regularities
  - Data that is being operated on has regularities
  - Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems
- · Is dynamic branch prediction better than static branch prediction?
  - Seems to be
  - There are a small number of important branches in programs which have dynamic behavior

# **Dynamic Branch Prediction**

- Performance = f(accuracy, cost of misprediction)
- Branch History Table: Lower bits of PC address index table of 1-bit values
  - Says whether or not branch taken last time - No address check
- Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iteratios before exit):
  - End of loop case, when it exits instead of looping as before First time through loop on *next* time through code, when it predicts exit instead of looping





# **Correlated Branch Prediction**

- · Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper *n*-bit branch history table
- In general, (m,n) predictor means record last m branches to select between 2<sup>m</sup> history tables, each with n-bit counters Thus, old 2-bit BHT is a (0,2) predictor
- Global Branch History: *m*-bit shift register keeping T/NT status of last *m* branches.





# **Tournament Predictors**

- · Multilevel branch predictor
- Use *n*-bit saturating counter to choose between predictors
- Usual choice between global and local predictors



# **Tournament Predictors**

Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between:

- · Global predictor
  - 4K entries index by history of last 12 branches (212 = 4K)
  - Each entry is a standard 2-bit predictor
- · Local predictor
  - Local history table: 1024 10-bit entries recording last 10 branches, index by branch address
  - The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters









# **Dynamic Branch Prediction**

#### Summary

- Prediction becoming important part of execution
- · Branch History Table: 2 bits for loop accuracy
- Correlation: Recently executed branches correlated with next branch
  - Either different branches (GA)
     Or different executions of same branches (PA)
- Tournament predictors take insight to next level, by using multiple predictors
  - usually one based on global information and one based on local information, and combining them with a selector
  - In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5 and Pentium 4
- Branch Target Buffer: include branch address & prediction



# Advantages of Dynamic

# Scheduling

- Dynamic scheduling hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior
- It handles cases when dependences unknown at compile time
  - it allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve
- It allows code that compiled for one pipeline to run efficiently on a different pipeline
- · It simplifies the compiler
- Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling (next lecture)

# HW Schemes: Instruction

#### **Parallelism**

- Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADD F10,F0,F8 SUBD F12,F8,F14
- Enables out-of-order execution and allows out-oforder completion (e.g., SUBD)

 In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

- Will distinguish when an instruction *begins execution* and when it *completes execution*; between 2 times, the instruction is *in execution*
- Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder

# Dynamic Scheduling Step 1

- Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue
- Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

1) Issue—Decode instructions, check for structural hazards

Read operands—Wait until no data hazards, then read operands

# A Dynamic Algorithm:

#### Tomasulo's

- For IBM 360/91 (before caches!)
  □ ⇒ Long memory latency
- · Goal: High Performance without special compilers
- Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations
   This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!
- Why Study 1966 Computer?
- The descendants of this have flourished! – Alpha 21264, Pentium 4, AMD Opteron, Power 5, ...

# Tomasulo Algorithm

- Control & buffers <u>distributed</u> with Function Units (FU)
   FU buffers called "<u>reservation stations</u>"; have pending operands
- Registers in instructions replaced by values or pointers to reservation stations(RS); called <u>register renaming</u>;
  - Renaming avoids WAR, WAW hazards
  - More reservation stations than registers, so can do optimizations compilers cannot
- Results to FU from RS, <u>not through registers</u>, over <u>Common Data</u> <u>Bus</u> that broadcasts results to all FUs
- Avoids RAW hazards by executing an instruction only when its operands are available
- Load and Stores treated as FUs with RSs as well
- Integer instructions can go past branches (predict taken), allowing FP ops beyond basic block in FP queue



# **Reservation Station Components**

Op: Operation to perform in the unit (e.g., + or –)

#### Vj, Vk: Value of Source operands

- Store buffers has V field, result to be stored
- Qj, Qk: Reservation stations producing source
- registers (value to be written)
- Note: Qj,Qk=0 => ready
- Store buffers only have Qi for RS producing result
- Busy: Indicates reservation station or FU is busy

**Register result status**—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

# Three Stages of Tomasulo

# Algorithm

- 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).
- 2. Execute—operate on operands (EX)
- When both operands ready then execute; if not ready, watch Common Data Bus for result
- 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available
- Normal data bus: data + destination ("go to" bus)
- <u>Common data bus</u>: data + <u>source</u> ("<u>come from</u>" bus) - 64 bits of data + 4 bits of Functional Unit <u>source</u> address - Write if matches expected Functional Unit (produces result) - Does the broadcast
- Example speed: 3 clocks for FI .pt. +,-; 10 for \*; 40 clks for /










































# Why can Tomasulo overlap loop iterations?

- Register renaming
  - Multiple iterations use different physical destinations for registers (dynamic loop unrolling).
- Reservation stations
  - Permit instruction issue to advance past integer control flow operations
  - Also buffer old values of registers totally avoiding the WAR stall
- Other perspective: Tomasulo building data flow dependency graph on the fly

#### offers

#### two major advantages

- 1. Distribution of the hazard detection logic – distributed reservation stations and the CDB
  - If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
  - If a centralized register file were used, the units would have to read their results from the registers when register buses are available
- 2. Elimination of stalls for WAW and WAR hazards



- Number of functional units that can complete per cycle
   limited to one!
   Multiple CDPs = more El Legis for anyticle constant
- » Multiple CDBs  $\Rightarrow$  more FU logic for parallel assoc stores
- Non-precise interrupts!
  - We will address this later

# And In Conclusion ... (1)

- Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
- · Loop unrolling by compiler to increase ILP
- Branch prediction to increase ILP

#### • Dynamic HW exploiting ILP

- Works when can't know dependence at compile time
- Can hide L1 cache misses
- Code for one machine runs well on another

# And In Conclusion ... (2)

- Reservations stations: renaming to larger set of registers + buffering source operands
  - Prevents registers as bottleneck
    Avoids WAR, WAW hazards
  - Allows loop unrolling in HW
- Not limited to basic blocks (integer units gets ahead, beyond branches)
- · Helps cache misses as well
- Lasting Contributions
  - Dynamic scheduling
  - Register renaming
    Load/store disambiguation
- 360/91 descendants are Intel Pentium 4, IBM Power 5, AMD Athlon/Opteron, ...

#### Reading

- This lecture: chapter 2 Instruction-Level Parallelism
- Next week: no class, Oct 3rd
- Next class, Oct 10th: ILP (cont'd)
- This afternoon: *introduction on assignment 2;* highly recommended!

# Lecture 5 – Instruction Level Parallelism (cont'd)

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### **Review from Last Time (1)**

- Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
- · Loop unrolling by compiler to increase ILP
- · Branch prediction to increase ILP
- Dynamic HW exploiting ILP
  - Works when can't know dependence at compile time
  - Can hide L1 cache misses
  - Code for one machine runs well on another

#### **Review from Last Time (2)**

- Reservations stations: renaming to larger set of registers + buffering source operands
  - Prevents registers as bottleneck
  - Avoids WAR, WAW hazards
- Allows loop unrolling in HW
  Not limited to basic blocks
- (integer units gets ahead, beyond branches)
- Helps cache misses as well
- Lasting Contributions
  - Dynamic scheduling
    - Register renaming
  - Load/store disambiguation
- 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron, ...

#### Outline

- ILP
- Speculation
- Speculative Tomasulo Example
- Memory Aliases
- Exceptions
- VLIW
- Increasing instruction bandwidth
- Register Renaming vs. Reorder Buffer
- Value Prediction
- Limits to ILP

#### Speculation to greater ILP

- Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct
- Speculation ⇒ fetch, issue, and execute instructions as if branch predictions were always correct
- Dynamic scheduling ⇒ only fetches and issues instructions
- Essentially a data flow execution model: Operations execute as soon as their operands are available

#### Speculation to greater ILP

3 components of HW-based speculation:

- 1. Dynamic branch prediction to choose which instructions to execute
- 2. Speculation to allow execution of instructions before control dependences are resolved + ability to undo effects of incorrectly speculated sequence
- 3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks

#### **Adding Speculation to Tomasulo**

- Must separate execution from allowing instruction to finish or "commit"
- · This additional step called instruction commit
- When an instruction is no longer speculative, allow it to update the register file or memory
- Requires additional set of buffers to hold results of instructions that have finished execution but have not committed
- This reorder buffer (ROB) is also used to pass results among instructions that may be speculated

#### **Reorder Buffer (ROB)**

- In Tomasulo's algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file
- With speculation, the register file is not updated until the instruction commits
   – (we know definitively that the instruction should execute)
- Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit
  - ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo's
  - algorithm - ROB extends architectured registers like RS

#### **Reorder Buffer Entry**

#### Each entry in the ROB contains four fields:

- 1. Instruction type
  - A branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations)
- 2. Destination
  - Register number (for loads and ALU operations) or memory address (for stores) where the instruction result should be written
- 3. Value
  - · Value of instruction result until the instruction commits
- 4. Ready
  - Indicates that instruction has completed execution, and the value is ready

#### **Reorder Buffer operation**

- Holds instructions in FIFO order, exactly as issued
- When instructions complete, results placed into ROB
   Supplies operands to other instruction between execution
   complete & commit ⇒ more registers like RS
  - − Tag results with ROB buffer number instead of reservation station
- Instructions commit ⇒ values at head of ROB placed in registers

FP

Ор

)ue

Res Stations

FP Adder

Reorder Buffer

FP Regs

Res Stations

FP Adder

 As a result, easy to undo speculated instructions on mispredicted branches or on exceptions
 Commit path

## Recall: 4 Steps of Speculative Tomasulo Algorithm

- 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called "dispatch")
- 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called "issue")
- 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.
- 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called "graduation")

















#### **Avoiding Memory Hazards**

- WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending
- RAW hazards through memory are maintained by two restrictions:
  - 1. not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and
  - 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier store these restrictions ensure that any load that
- accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

#### **Exceptions and Interrupts**

- IBM 360/91 invented "imprecise interrupts" Computer stopped at this PC; its likely close to this address Not so popular with programmers
  - Also, what about Virtual Memory? (Not in IBM 360)
- Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit
  - If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly This is exactly same as need to do with precise exceptions
- Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB
  - If a speculated instruction raises an exception, the exception is recorded in the ROB

  - This is why reorder buffers in all new processors

#### **Getting CPI below 1**

- CPI ≥ 1 if issue only 1 instruction every clock cycle
  - Multiple-issue processors come in 3 flavors: 1. statically-scheduled superscalar processors, 2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors
- 2 types of superscalar processors issue varying numbers of instructions per clock use in-order execution if they are statically scheduled, or
- out-of-order execution if they are dynamically scheduled VLIW processors, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)

#### VLIW: Very Large Instruction Word

- · Each "instruction" has explicit coding for multiple operations
  - In IA-64, grouping called a "packet"
  - In Transmeta, grouping called a "molecule" (with "atoms" as ops)
- · Tradeoff instruction space for simple decoding
  - The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel

  - E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
  - » 16 to 24 bits per field => 7\*16 or 112 bits to 7\*24 or 168 bits wide - Need compiling technique that schedules across several branches

#### **Recall: Unrolled Loop that Minimizes Stalls for Scalar**

| 1 Loop: | L.D    | F0,0(R1)    |   |      | L D to ADD D: 1 Cycle  |
|---------|--------|-------------|---|------|------------------------|
| 2 -     | L.D    | F6,-8(R1)   |   |      | ADD D to S D: 2 Cycles |
| 3       | L.D    | F10,-16(R1) |   |      | 7122.2 10 0.2. 2 Oyolo |
| 4       | L.D    | F14,-24(R1) |   |      |                        |
| 5       | ADD.D  | F4,F0,F2    |   |      |                        |
| 6       | ADD.D  | F8,F6,F2    |   |      |                        |
| 7       | ADD.D  | F12,F10,F2  |   |      |                        |
| 8       | ADD.D  | F16,F14,F2  |   |      |                        |
| 9       | S.D    | 0(R1),F4    |   |      |                        |
| 10      | S.D    | -8(R1),F8   |   |      |                        |
| 11      | S.D    | -16(R1),F12 |   |      |                        |
| 12      | DSUBUI | R1,R1,#32   |   |      |                        |
| 13      | BNEZ   | R1,LOOP     |   |      |                        |
| 14      | S.D    | 8(R1),F16   | ; | 8-32 | = -24                  |

|                       | , nonig               |                   |             |                    |            |
|-----------------------|-----------------------|-------------------|-------------|--------------------|------------|
| Memory<br>reference 1 | Memory<br>reference 2 | FP<br>operation 1 | FP<br>op. 2 | Int. op/<br>branch | Cloc       |
| L.D F0.0(R1)          | L.D F6,-8(R1)         |                   |             |                    |            |
| L.D F10,-16(R1)       | L.D F14,-24(R1)       |                   |             |                    | :          |
| L.D F18,-32(R1)       | L.D F22,-40(R1)       | ADD.D F4,F0,F2    | ADD.D F     | 8,F6,F2            |            |
| L.D F26,-48(R1)       |                       | ADD.D F12,F10,F2  | ADD.D F     | 16,F14,F2          |            |
|                       |                       | ADD.D F20,F18,F2  | ADD.D F     | 24,F22,F2          |            |
| S.D 0(R1),F4          | S.D -8(R1),F8         | ADD.D F28,F26,F2  |             |                    |            |
| S.D -16(R1),F12       | S.D -24(R1),F16       |                   |             |                    |            |
| S.D -32(R1),F20       | S.D -40(R1),F24       |                   |             | DSUBUI R1,R1       | ,#48       |
| S.D -0(R1),F28        |                       |                   |             | BNEZ R1,LOOP       | <u>،</u> د |
| Unrolled 7 t          | imes to avoid         | delays            |             |                    |            |
| 7 results in          | 9 clocks, or 1.       | 3 clocks per iter | ration (1   | .8X)               |            |
| Average: 2.           | 5 ops per cloc        | k. 50% efficienc  | v           | · ·                |            |
| hin a start           |                       |                   | ,<br>       |                    |            |

#### **Problems with 1st Generation VLIW**

#### · Increase in code size

- generating enough operations in a straight-line code fragment requires ambitiously unrolling loops
- whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding
- Operated in lock-step; no hazard detection HW a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized - Compiler might prediction function units, but caches hard to
  - predict

#### Binary code compatibility

- Pure VLIW => different numbers of functional units and unit latencies require different versions of the code

#### Intel/HP IA-64 "Explicitly Parallel Instruction Computer (EPIC)"

- IA-64: instruction set architecture
- 128 64-bit integer regs + 128 82-bit floating point regs – Not separate register files per functional unit as in old VLIW
- Hardware checks dependencies (interlocks => binary compatibility over time)
- Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?
- Itanium<sup>™</sup> was first implementation (2001) - Highly parallel and deeply pipelined hardware at 800Mhz – 6-wide, 10-stage pipeline at 800Mhz on 0.18 μ process
- Itanium 2<sup>™</sup> is name of 2nd implementation (2005) - 6-wide, 8-stage pipeline at 1666Mhz on 0.13 μ process
- Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3



PC of in

Branch Target Buffer (BTB)

on to fetch



PC of branch

When match is

found, Predicted PC is returned If branch predicted taken, instruction fetch continues at







#### Speculation: Register Renaming vs. ROB

- Alternative to ROB is a larger physical set of registers combined with register renaming

   Extended registers replace function of both ROB and reservation stations
- Instruction issue maps names of architectural registers to physical register numbers in extended register set
  - On issue, allocates a new unused register for the destination (which avoids WAW and WAR hazards)
  - Speculation recovery easy because a physical register holding an instruction destination does not become the architectural register until the instruction commits
- Most Out-of-Order processors today use extended registers with renaming

#### Value Prediction

- Attempts to predict value produced by instruction – E.g., Loads a value that changes infrequently
- Value prediction is useful only if it significantly increases ILP
  - Focus of research has been on loads; so-so results, no processor uses value prediction
- Related topic is address aliasing prediction
   RAW for load and store or WAW for 2 stores
- Address alias prediction is both more stable and simpler since need not actually predict the address values, only whether such values conflict
  - Has been used by a few processors



#### **Perspective**

- Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model
- Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice
- Conservative in ideas, just faster clock and bigger
- Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multipleissue processors announced in 1995
  - Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units ⇒ performance 8 to 16X
- · Peak v. delivered performance gap increasing

#### In Conclusion ...

- Interrupts and Exceptions either interrupt the current instruction or happen between instructions
- Possibly large quantities of state must be saved before interrupting
   Machines with precise exceptions provide one single point in the program to restart execution
  - All instructions before that point have completed
  - No instructions after or including that point have completed
- Hardware techniques exist for precise exceptions even in the face of out-of-order execution!
  - Important enabling factor for out-of-order execution



#### Limits to ILP

- · Conflicting studies of amount
  - Benchmarks (vectorized Fortran FP vs. integer C programs)
     Hardware sophistication
     Compiler sophistication
    - Compiler sophistication
- How much ILP is available using existing mechanisms with increasing HW budgets?
- Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
  - Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints
  - Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock
  - Motorola AltaVec: 128 bit ints and FPs
  - Supersparc Multimedia ops, etc.

#### **Overcoming Limits**

- Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies
- However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future

#### Limits to ILP

Initial HW Model here; MIPS compilers.

- Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers
  - => all register WAW & WAR hazards are avoided
  - 2. Branch prediction perfect; no mispredictions
  - 3. Jump prediction all jumps perfectly predicted

(returns, case statements) 2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available

4. *Memory-address alias analysis* – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW

Also: perfect caches; 1 cycle latency for all instructions (FP \*,/); unlimited instructions issued/clock cycle;

#### Limits to ILP HW Model comparison

|                                  | Model    | Power 5                            |
|----------------------------------|----------|------------------------------------|
| Instructions Issued<br>per clock | Infinite | 4                                  |
| Instruction Window<br>Size       | Infinite | 200                                |
| Renaming<br>Registers            | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch Prediction                | Perfect  | 2% to 6% misprediction             |
|                                  |          | (Tournament<br>Branch Predictor)   |
| Cache                            | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory Alias<br>Analysis         | Perfect  | ??                                 |







|                                     | New Model                                                             | Model    | Power 5                                                       |
|-------------------------------------|-----------------------------------------------------------------------|----------|---------------------------------------------------------------|
| Instructions<br>Issued per<br>clock | 64                                                                    | Infinite | 4                                                             |
| Instruction<br>Window Size          | 2048                                                                  | Infinite | 200                                                           |
| Renaming<br>Registers               | Infinite                                                              | Infinite | 48 integer +<br>40 Fl. Pt.                                    |
| Branch<br>Prediction                | Perfect vs. 8K<br>Tournament vs.<br>512 2-bit vs.<br>profile vs. none | Perfect  | 2% to 6%<br>misprediction<br>(Tournament Branch<br>Predictor) |
| Cache                               | Perfect                                                               | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3                            |
| Memory<br>Alias                     | Perfect                                                               | Perfect  | ??                                                            |





|                                     | New Model                             | Model    | Power 5                            |
|-------------------------------------|---------------------------------------|----------|------------------------------------|
| Instructions<br>Issued per<br>clock | 64                                    | Infinite | 4                                  |
| Instruction<br>Window Size          | 2048                                  | Infinite | 200                                |
| Renaming<br>Registers               | Infinite v. 256,<br>128, 64, 32, none | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch<br>Prediction                | 8K 2-bit                              | Perfect  | Tournament Branch<br>Predictor     |
| Cache                               | Perfect                               | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory<br>Alias                     | Perfect                               | Perfect  | Perfect                            |



|                                     | New Model                                 | Model    | Power 5                            |
|-------------------------------------|-------------------------------------------|----------|------------------------------------|
| Instructions<br>Issued per<br>clock | 64                                        | Infinite | 4                                  |
| Instruction<br>Window Size          | 2048                                      | Infinite | 200                                |
| Renaming<br>Registers               | 256 Int + 256 FP                          | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch<br>Prediction                | 8K 2-bit                                  | Perfect  | Tournament                         |
| Cache                               | Perfect                                   | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory<br>Alias                     | Perfect v. Stack<br>v. Inspect v.<br>none | Perfect  | Perfect                            |



|                                     | Now Model                        | Model    | Power 5                            |  |
|-------------------------------------|----------------------------------|----------|------------------------------------|--|
|                                     | New Model                        | Model    | I Ower 5                           |  |
| Instructions<br>Issued per<br>clock | 64 (no<br>restrictions)          | Infinite | 4                                  |  |
| Instruction<br>Window Size          | Infinite vs. 256,<br>128, 64, 32 | Infinite | 200                                |  |
| Renaming<br>Registers               | 64 Int + 64 FP                   | Infinite | 48 integer +<br>40 Fl. Pt.         |  |
| Branch<br>Prediction                | 1K 2-bit                         | Perfect  | Tournament                         |  |
| Cache                               | Perfect                          | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |  |
| Memory<br>Alias                     | HW<br>disambiguation             | Perfect  | Perfect                            |  |



#### Limits to ILP (1)

- Doubling issue rates above today's 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to
  - issue 3 or 4 data memory accesses per cycle,
  - resolve 2 or 3 branches per cycle,
  - rename and access more than 20 registers per cycle, and
  - fetch 12 to 24 instructions per cycle.
- The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate
  - E.g., widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power!

#### Limits to ILP (2)

- Most techniques for increasing performance increase power consumption ٠
- The key question is whether a technique is energy *efficient*: does it increase power consumption faster than it increases performance?
- Multiple issue processors techniques all are energy inefficient:
  - Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows
  - 2. Growing gap between peak issue rates and sustained performance
- Number of transistors switching = f(peak issue rate), and performance = f(sustained rate), growing gap between peak and sustained performance ⇒ increasing energy per unit of performance

## Reading

- This lecture:
  - chapter 2 ILP
    chapter 3: 3.1-3.4 Limits to ILP
- Next lecture:
  - chapter 3: 3.5-3.8 Simultaneous Multithreading (SMT)
- No class on Wed Oct 31st
- Wed Nov 14th 11.15-13.00h & 13.45-15.30h, room 402

# Lecture 6 Simultaneous Multithreading

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### Outline

- Thread Level Parallelism (TLP)
- Multithreading
- Simultaneous Multithreading (SMT)
- Power 4 vs. Power 5
- Head to Head: VLIW vs. Superscalar vs. SMT
- Commentary
- Conclusion

#### How to Exceed ILP Limits?

- These are not laws of physics; just practical limits for today, and perhaps overcome via research
- Compiler and ISA advances could change results
- WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage
  - Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stack

#### HW v. SW to increase ILP

- Memory disambiguation: HW best
- Speculation:
  - HW best when dynamic branch prediction better than compile time prediction
  - Exceptions easier for HW
  - HW doesn't need bookkeeping code or compensation code
  - Very complicated to get right
- Scheduling: SW can look ahead to schedule better
- Compiler independence: does not require new compiler, recompilation to run well

#### Performance beyond single thread ILP

- There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes)
- Explicit Thread Level Parallelism or Data Level Parallelism
- Thread: process with own instructions and data
  - thread may be a process part of a parallel program of multiple processes, or it may be an independent program
  - Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute
- Data Level Parallelism: Perform identical operations on data, and lots of data

#### **Thread Level Parallelism (TLP)**

- ILP exploits implicit parallel operations within a loop or straight-line code segment
- TLP explicitly represented by the use of multiple threads of execution that are inherently parallel
- Goal: Use multiple instruction streams to improve
  - 1. Throughput of computers that run many programs
  - 2. Execution time of multi-threaded programs
- TLP could be more cost-effective to exploit than ILP

#### New Approach: Multithreaded Execution

 Multithreading: multiple threads to share the functional units of 1 processor via

#### overlapping

- processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table
- memory shared through the virtual memory mechanisms, which already support multiple processes
- HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks

#### • When switch?

- Alternate instruction per thread (fine grain)
- When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

#### **Fine-Grained Multithreading**

- Switches between threads on each instruction, causing the execution of multiples threads to be interleaved
- Usually done in a round-robin fashion, skipping any stalled threads
- CPU must be able to switch threads every clock
- Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls
- Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads
- Used on Sun's Niagara (will see later)

#### **Course-Grained Multithreading**

- Switches threads only on costly stalls, such as L2 cache misses
- Advantages
  - Relieves need to have very fast thread-switching
  - Doesn't slow down thread, since instructions from other threads issued only when the thread encounters a costly stall
- Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs
   Since CPU issues instructions from 1 thread, when a stall
  - occurs, the pipeline must be emptied or frozen
     New thread must fill pipeline before instructions can complete
- Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time</li>
- Used in IBM AS/400

#### For most apps, most execution units lie idle



#### Do both ILP and TLP?

- TLP and ILP exploit two different kinds of parallel structure in a program
- Could a processor oriented at ILP to exploit TLP?
  - functional units are often idle in data path designed for ILP because of either stalls or dependences in the code
- Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls?
- Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists?



#### Simultaneous Multithreading (SMT)

- Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading
  - Large set of virtual registers that can be used to hold the register sets of independent threads
  - Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads
  - Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW
- Just adding a per thread renaming table and keeping separate PCs

 Independent commitment can be supported by logically keeping a separate reorder buffer for each thread

> Source: Micrprocessor Report, December 6, 1995 "Compaq Chooses SMT for Alpha"



#### **Design Challenges in SMT**

- Since SMT makes sense only with fine-grained implementation, impact of fine-grained scheduling on single thread performance?
  - A preferred thread approach sacrifices neither throughput nor single-thread performance?
  - Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls
- Larger register file needed to hold multiple contexts
- Not affecting clock cycle time, especially in
   Instruction issue more candidate instructions need to be
   considered
  - Instruction completion choosing which instructions to commit may be challenging
- Ensuring that cache and TLB conflicts generated by SMT do not degrade performance









#### Changes in Power 5 to support SMT

- Increased associativity of L1 instruction cache and the instruction address translation buffers
- · Added per thread load and store queues
- Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches
- Added separate instruction prefetch and buffering per thread
- Increased the number of virtual registers from 152 to 240
- Increased the size of several issue queues
- The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

#### **Initial Performance of SMT**

- Pentium 4 Extreme SMT yields 1.01 speedup for SPECint\_rate benchmark and 1.07 for SPECfp\_rate - Pentium 4 is dual threaded SMT
- SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark
- Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (26<sup>2</sup> runs) speed-ups from 0.90 to 1.58; average was 1.20
- Power 5, 8 processor server 1.23 faster for SPECint\_rate with SMT, 1.16 faster for SPECfp\_rate
- Power 5 running 2 copies of each app speedup between 0.89 and 1.41
  - Most gained some
  - FI.Pt. apps had most cache conflicts and least gains

## Head to Head ILP competition

| Processor                        | Micro architecture                                                | Fetch /<br>Issue /<br>Execute | Func-<br>tional<br>Units | Clock<br>Rate<br>(GHz) | Transis-<br>tors,<br>Die size              | Power         |
|----------------------------------|-------------------------------------------------------------------|-------------------------------|--------------------------|------------------------|--------------------------------------------|---------------|
| Intel<br>Pentium<br>4<br>Extreme | Speculative<br>dynamically<br>scheduled; deeply<br>pipelined; SMT | 3/3/4                         | 7 int.<br>1 FP           | 3.8                    | 125 M,<br>122<br>mm <sup>2</sup>           | 115<br>W      |
| AMD<br>Athlon 64<br>FX-57        | Speculative<br>dynamically<br>scheduled                           | 3/3/4                         | 6 int.<br>3 FP           | 2.8                    | 114 M,<br>115<br>mm <sup>2</sup>           | 104<br>W      |
| IBM<br>Power5<br>(1 CPU<br>only) | Speculative<br>dynamically<br>scheduled; SMT;<br>2 CPU cores/chip | 8/4/8                         | 6 int.<br>2 FP           | 1.9                    | 200 M,<br>300<br>mm <sup>2</sup><br>(est.) | 80W<br>(est.) |
| Intel<br>Itanium 2               | Statically<br>scheduled<br>VLIW-style                             | 6/5/11                        | 9 int.<br>2 FP           | 1.6                    | 592 M,<br>423<br>mm <sup>2</sup>           | 130<br>W      |







#### **No Silver Bullet for ILP**

- No obvious over all leader in performance
  The AMD Athlon leads on SPECInt performance
- followed by the Pentium 4, Itanium 2, and Power5 Itanium 2 and Power5, which perform similarly on
- SPECFP, clearly dominate the Athlon and Pentium 4 on SPECFP
- Itanium 2 is the most inefficient processor both for FI. Pt. and integer code for all but one efficiency measure (SPECFP/Watt)
- Athlon and Pentium 4 both make good use of transistors and area in terms of efficiency,
- IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT

#### Limits to ILP

- Doubling issue rates above today's 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to
  - Issue 3 or 4 data memory accesses per cycle,
  - Resolve 2 or 3 branches per cycle,
  - Rename and access more than 20 registers per cycle, and

- Fetch 12 to 24 instructions per cycle.

 Complexities of implementing these capabilities likely means sacrifices in maximum clock rate
 E.g., widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power!

#### Limits to ILP

- Most techniques for increasing performance increase
   power consumption
- The key question is whether a technique is <u>energy</u> <u>efficient</u>: does it increase power consumption faster than it increases performance?
- Multiple issue processors techniques all are energy inefficient:
  - 1. Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows
  - 2. Growing gap between peak issue rates and sustained performance
- Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance ⇒ increasing energy per unit of performance

#### Commentary

- Itanium architecture does not represent a significant breakthrough in scaling ILP or in avoiding the problems of complexity and power consumption
- Instead of pursuing more ILP, architects are increasingly focusing on TLP implemented with single-chip multiprocessors
- In 2000, IBM announced the 1st commercial single-chip, general-purpose multiprocessor, the Power4, which contains 2 Power3 processors and an integrated L2 cache
  - Since then, Sun Microsystems, AMD, and Intel have switch to a focus on single-chip multiprocessors rather than more aggressive uniprocessors.
- Right balance of ILP and TLP is unclear today

   Perhaps right choice for server market, which can exploit more TLP, may differ from desktop, where single-thread performance may continue to be a primary requirement

#### And in conclusion ...

- Limits to ILP (power efficiency, compilers, dependencies ...) seem to limit to 3 to 6 issue for practical options
- Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance
- Coarse grain vs. Fine grained multithreading
   Only on big stall vs. every clock cycle
- Simultaneous Multithreading if fine grained multithreading based on OOO superscalar microarchitecture
- Instead of replicating registers, reuse rename registers
  Itanium/EPIC/VLIW is not a breakthrough in ILP
- Balance of ILP and TLP unclear in marketplace

## Reading

- This lecture: - chapter 3: Limits on ILP; Multithreading
- Next lecture:
  - appendix F (on CD): Vector processors
     start with chapter 4: Multiprocessors

# Lecture 7 Vector Processors &

# **Multiprocessor Introduction**

Slides were used during lectures by Krste Asanovic & David Patterson, Berkeley, spring 2006

#### Outline

- Vector Processors
- Vector Metrics, Terms
- Multiprocessing Motivation
- SISD v. SIMD v. MIMD
- · Centralized vs. Distributed Memory
- Challenges to Parallel Programming
- Conclusion

#### **Supercomputers**

Definition of a supercomputer:

- Fastest machine in world at given task
- A device to turn a compute-bound problem into an I/O bound problem
- Any machine costing \$30M+
- Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

#### **Supercomputer Applications**

#### Typical application areas

- Military research (nuclear weapons, cryptography)
- Scientific research
- Weather forecasting
- Oil exploration
- Industrial design (car crash simulation)

All involve huge computations on large data sets

In 70s-80s, Supercomputer = Vector Machine

#### **Vector Supercomputers**

#### Epitomized by Cray-1, 1976:

- Scalar Unit + Vector Extensions
- Load/Store Architecture
- Vector Registers
- Vector Instructions
- Hardwired Control
- Highly Pipelined Functional Units
- Interleaved Memory System
- No Data Caches
- No Virtual Memory







|                     | # Scalar Code    | # Vector Code     |
|---------------------|------------------|-------------------|
| # C CODE            | LT R4. 64        | IT VLR. 64        |
| C[i] = A[i] + B[i]; | loop:            | LV V1, R1         |
|                     | L.D F0, 0(R1)    | LV V2, R2         |
|                     | L.D F2, 0(R2)    | ADDV.D V3, V1, V2 |
|                     | ADD.D F4, F2, F0 | SV V3, R3         |
|                     | S.D F4, 0(R3)    |                   |
|                     | DADDIU R1, 8     |                   |
|                     | DADDIU R2, 8     |                   |
|                     | DADDIU R3, 8     |                   |
|                     | DSUBIU R4, 1     |                   |
|                     | BNEZ R4, loop    |                   |

#### **Vector Instruction Set Advantages**

#### Compact

- one short instruction encodes N operations
- Expressive, tells hardware that these N operations:
  - are independent
  - use the same functional unit
  - access disjoint registers
  - access registers in the same pattern as previous instructions
  - access a contiguous block of memory (unit-stride load/store)
  - access memory in a known pattern (strided load/store)
- Scalable
  - can run same object code on more parallel pipelines or lanes













#### Vector Memory-Memory vs. Vector Register Machines

- Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
   All operands must be read in and out of memory
- VMMAs make if difficult to overlap execution of multiple vector operations, why?
   Must check dependencies on memory addresses
- VMMAs incur greater startup latency
   Scalar code was faster on CDC Star-100 for vectors < 100 elements
- For Cray-1, vector/scalar breakeven point was around 2 elements
- ⇒Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures





| Benchmark<br>name | Operations executed<br>in vector mode,<br>compiler-optimized | Operations executed<br>in vector mode,<br>hand-optimized | Speedup from<br>hand optimization |
|-------------------|--------------------------------------------------------------|----------------------------------------------------------|-----------------------------------|
| BDNA              | 96.1%                                                        | 97.2%                                                    | 1.52                              |
| MG3D              | 95.1%                                                        | 94.5%                                                    | 1.00                              |
| FLO52             | 91.5%                                                        | 88.7%                                                    | N/A                               |
| ARC3D             | 91.1%                                                        | 92.0%                                                    | 1.01                              |
| SPEC77            | .90.3%                                                       | 90.4%                                                    | 1.07                              |
| MDG               | 87.7%                                                        | 94.2%                                                    | 1.49                              |
| TRFD              | 69.8%                                                        | 73.7%                                                    | 1.67                              |
| DYFESM            | 68.8%                                                        | 65.6%                                                    | N/A                               |
| ADM               | 42.9%                                                        | 59.6%                                                    | 3.60                              |
| OCEAN             | 42.8%                                                        | 91.2%                                                    | 3.92                              |
| TRACK             | 14.4%                                                        | 54.6%                                                    | 2.52                              |
| SPICE             | 11.5%                                                        | 79.9%                                                    | 4.06                              |
| OCD               | 4.2%                                                         | 75.1%                                                    | 2.15                              |

| Processor        | Compiler             | Completely<br>vectorized | Partially<br>vectorized | Not<br>vectorized |
|------------------|----------------------|--------------------------|-------------------------|-------------------|
| CDC CYBER 205    | VAST-2 V2.21         | 62                       | 5                       | 33                |
| Convex C-series  | FC5.0                | 69                       | 5                       | 26                |
| Cray X-MP        | CFT77 V3.0           | 69                       | 3                       | 28                |
| Cray X-MP        | CFT V1.15            | 50                       | 1                       | 49                |
| Cray-2           | CFT2 V3.1a           | 27                       | 1                       | 72                |
| ETA-10           | FTN 77 V1.0          | 62                       | 7                       | 31                |
| Hitachi S810/820 | FORT77/HAP V20-2B    | 67                       | 4                       | 29                |
| IBM 3090/VF      | VS FORTRAN V2.4      | 52                       | 4                       | 44                |
| NEC SX/2         | FORTRAN77 / SX V.040 | 66                       | 5                       | 29                |

nels. For each processor we indicate how many loops were completely vectorized, partially vectorized, and unvectorized. These loops were collected by Callahan, Dongarra, and Levine [1988]. Two different compilers for the Cray X-MP show the large dependence on compiler technology.













#### **Vector Scatter/Gather**

Want to vectorize loops with indirect accesses:

for (i=0; i<N; i++)
 A[i] = B[i] + C[D[i]]</pre>

#### Indexed load instruction (Gather)

LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA, vB, vC # Do add SV vA, rA # Store result

#### **Vector Scatter/Gather**

#### Scatter example:

for (i=0; i<N; i++)
 A[B[i]]++;</pre>

#### Is following a correct translation?

```
LV vB, rB # Load indices in B vector
LVI vA, rA, vB # Gather initial A values
ADDV vA, vA, 1 # Increment
SVI vA, rA, vB # Scatter incremented values
```



# Store A back to memory under mask

SV vA, rA





#### Problem: Loop-carried dependence on reduction variables for (i=0; i<N; i++)</pre> sum += A[i]; # Loop-carried dependence on sum Solution: Re-associate operations if possible, use binary tree to perform reduction # Rearrange as: sum[0:VL-1] = 0 # Vector of VL partial sums for(i=0; i<N; i+=VL)</pre> # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2;# Halve vector length sum[0:VL-1] += sum[VL:2\*VL-1] # Halve no. of partials } while (VL>1)

#### A Modern Vector Super: NEC SX-6 (2003)

M[0]=0

B[0]

CMOS Technology

A[0]

A[1]

Compress Expand

M[0]=0

- 500 MHz CPU, fits on single chip
- SDRAM main memory (up to 64GB) Scalar unit
- 4-way superscalar with out-of-order and speculative execution
- 64KB I-cache and 64KB data cache Vector unit
- - 8 foreground VRegs + 64 background VRegs (256x64-bit elements/VReg)
  - 1 multiply unit. 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit - 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
  - 1 load & store unit (32x8 byte accesses/cycle)
  - 32 GB/s memory bandwidth per processor
- SMP structure
  - 8 CPUs connected to memory through crossbar

**Properties of Vector Processors** 

256 GB/s shared memory bandwidth (4096 interleaved banks)

#### **Multimedia Extensions**

- · Very short vectors added to existing ISAs for micros
- Usually 64-bit registers split into 2x32b or 4x16b or 8x8b
- Newer designs have 128-bit registers (Altivec, SSE2)
- · Limited instruction set:
  - no vector length control
  - no strided load/store or scatter/gather
  - unit-stride loads must be aligned to 64/128-bit boundary
- Limited vector register length:
  - requires superscalar dispatch to keep multiply/add/load units busy
  - loop unrolling to hide latencies increases register pressure
- Trend towards fuller vector support in microprocessors

# · Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate · Vector instructions access memory with known pattern

- => highly interleaved memory => amortize memory latency of over - 64 elements
- => no (data) caches required! (Do use instruction cache)
- · Reduces branches and branch problems in pipelines
- · Single vector instruction implies lots of work (- loop) => fewer instruction fetches

### **Operation & Instruction Count: RISC v. Vector Processor**

| Program | RISC | Vector | $\mathbf{R}/\mathbf{V}$ | RISC | Vector | $\mathbf{R}/\mathbf{V}$ |
|---------|------|--------|-------------------------|------|--------|-------------------------|
|         | 1100 | VECIOI | 1X / V                  | Ribe | Vector | <u> </u>                |
| swim256 | 115  | 95     | 1.1x                    | 115  | 0.8    | 142x                    |
| hydro2d | 58   | 40     | 1.4x                    | 58   | 0.8    | 71x                     |
| nasa7   | 69   | 41     | 1.7x                    | 69   | 2.2    | 31x                     |
| su2cor  | 51   | 35     | 1.4x                    | 51   | 1.8    | 29x                     |
| tomcatv | 15   | 10     | 1.4x                    | 15   | 1.3    | 11x                     |
| wave5   | 27   | 25     | 1.1x                    | 27   | 7.2    | 4x                      |
| mdljdp2 | 32   | 52     | 0.6x                    | 32   | 15.8   | 2x                      |
|         |      |        |                         |      |        |                         |

#### **Common Vector Metrics**

- R<sub>oo</sub>: MFLOPS rate on an infinite-length vector

   vector "speed of light"
   Pool problems do not have unlimited vector lengths, and the state
  - Real problems do not have unlimited vector lengths, and the start-up penalties encountered in real problems will be larger
     (R<sub>n</sub> is the MFLOPS rate for a vector of length n)
- N<sub>1/2</sub>: The vector length needed to reach one-half of R<sub>∞</sub> – a good measure of the impact of start-up
- $\bullet~\mathbf{N_V}$  : The vector length needed to make vector mode faster than scalar mode

 measures both start-up and speed of scalars relative to vectors, quality of connection of scalar unit to vector unit

#### **Vector Execution Time**

- Time = f(vector length, data dependicies, struct. hazards)
   Initiation rate: rate that FU consumes vector elements
- (= number of lanes; usually 1 or 2 on Cray T-90)
   Convoy: set of vector instructions that can begin execution in same clock (no struct. or data hazards)
- Chime: approx. time for a vector operation
- <u>m convoys take m chimes</u>; if each vector length is n, then they take approx. m x n clock cycles (ignores overhead; good approximization for long vectors)

# 1: LV <u>V1</u>,Rx ;load vector X 2: MULV <u>V2</u>,F0,<u>V1</u> ;vector-scalar mult. LV V3,Ry ;load vector Y

4 convoys, 1 lane, VL=64 ⇒ 4 x 64 = 256 clocks (or 4 clocks per result)

3: ADDV <u>V4, V2</u>, V3 ;add 4: SV Ry, <u>V4</u> ;store the result

(or 4 clocks per

- Load/store operations move groups of data between registers and memory
- Three types of addressing

**Memory operations** 

- Unit stride
  - » Contiguous block of information in memory
  - » Fastest: always possible to optimize this
- Non-unit (constant) stride
  - » Harder to optimize memory system for all possible strides
  - » Prime number of data banks makes it easier to support different strides at full bandwidth
- <u>Indexed</u> (gather-scatter)
  - » Vector equivalent of register indirect
  - » Good for sparse arrays of data
  - » Increases number of programs that vectorize



#### 

- How to do prime number of banks efficiently?

#### **Vectors Are Inexpensive**

#### Scalar

- N ops per cycle  $\Rightarrow O(N^2)$  circuitry
- HP PA-8000
  - 4-way issue
  - reorder buffer: 850K transistors
     incl. 6,720 5-bit
    - register number comparators

#### Vector

- N ops per cycle  $\Rightarrow O(N + \epsilon N^2)$  circuitry • T0 vector micro
  - 24 ops per cycle
  - 730K transistors total

     only 23 5-bit register number comparators

     No floating point

#### **Vectors Lower Power**

#### Single-issue Scalar

- One instruction fetch, decode, dispatch per operation
- Arbitrary register accesses, adds area and power
- Loop unrolling and software pipelining for high performance increases instruction cache footprint
- All data passes through cache; waste power if no temporal locality
- One TLB lookup per load or store
- Off-chip access in whole cache lines

- Vector
- One inst fetch, decode, dispatch per vector
- Structured register accesses
- Smaller code for high performance, less power in instruction cache misses
- Bypass cache
- One TLB lookup per group of loads or stores
  Move only necessary data across chip boundary

#### Superscalar Energy Efficiency Even Worse

•

#### **Superscalar**

- Control logic grows quadratically with issue width
- Control logic consumes energy regardless of available parallelism
- Speculation to increase visible parallelism wastes energy

#### Vector

- Control logic grows linearly with issue width
- Vector unit switches off when not in use
- Vector instructions expose parallelism without speculation
- Software control of speculation when desired: - Whether to use vector mask or compress/expand for conditionals

#### **Vector Applications**

- Limited to scientific computing?
- Multimedia Processing (compress., graphics, audio synth, image proc.)
- Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)
- · Lossy Compression (JPEG, MPEG video and audio)
- Lossless Compression (Zero removal, RLE, Differencing, LZW)
- Cryptography (RSA, DES/IDEA, SHA/MD5)
- Speech and handwriting recognition
- Operating systems/Networking (memcpy, memset, parity, checksum)
- Databases (hash/join, data mining, image/video serving)
- Language run-time support (stdlib, garbage collection)
- even SPECint95

#### **Older Vector Machines**

| Machine    | Year   | Clock   | Regs  | Elements | FUs | <u>LSUs</u> |
|------------|--------|---------|-------|----------|-----|-------------|
| Cray 1     | 1976   | 80 MHz  | 8     | 64       | 6   | 1           |
| Cray XMP   | 1983 1 | 20 MHz  | 8     | 64       | 8   | 2 L, 1 S    |
| Cray YMP   | 1988 1 | 66 MHz  | 8     | 64       | 8   | 2 L, 1 S    |
| Cray C-90  | 1991 2 | 240 MHz | 8     | 128      | 8   | 4           |
| Cray T-90  | 1996 4 | 455 MHz | 8     | 128      | 8   | 4           |
| Convex C-1 | 1984   | 10 MHz  | 8     | 128      | 4   | 1           |
| Convex C-4 | 1994 1 | 33 MHz  | 16    | 128      | 3   | 1           |
| Fuj. VP200 | 1982 1 | 33 MHz  | 8-256 | 32-1024  | 3   | 2           |
| Fuj. VP300 | 1996 1 | 00 MHz  | 8-256 | 32-1024  | 3   | 2           |
| NEC SX/2   | 1984 1 | 60 MHz  | 8+8K  | 256+var  | 16  | 8           |
| NEC SX/3   | 1995 4 | 100 MHz | 8+8K  | 256+var  | 16  | 8           |

#### **Newer Vector Computers**

- Cray X1
- MIPS like ISA + Vector in CMOS
- NEC Earth Simulator
  - Fastest computer in world for 3 years; 40 TFLOPS
     640 CMOS vector nodes

#### **Key Architectural Features of X1**

New vector instruction set architecture (ISA)

- Much larger register set (32x64 vector, 64+64 scalar)
- 64- and 32-bit memory and IEEE arithmetic
- Based on 25 years of experience compiling with Cray1 ISA

#### **Decoupled Execution**

- Scalar unit runs ahead of vector unit, doing addressing and control
- Hardware dynamically unrolls loops, and issues multiple loops concurrently
- Special sync operations keep pipeline full, even across barriers
   ⇒ Allows the processor to perform well on short nested loops

Scalable, distributed shared memory (DSM) architecture

- Memory hierarchy: caches, local memory, remote memory
- Low latency, load/store access to entire machine (tens of TBs)
- Processors support 1000's of outstanding refs with flexible addressing
- Very high bandwidth network Coherence protocol, addressing and synchronization optimized for DM

#### Cray X1E Mid-life Enhancement

- Technology refresh of the X1 (0.13μm)
  - -~50% faster processors
  - Scalar performance enhancements
  - Doubling processor density
  - Modest increase in memory system bandwidth
  - Same interconnect and I/O

#### Machine upgradeable

- Can replace Cray X1 nodes with X1E nodes

# ESS – configuration of a general purpose supercomputer

- Processor Nodes (PN) Total number of processor nodes is 640. Each processor node consists of eight vector processors of 8 GFLOPS and 16GB shared memories. Therefore, total numbers of processors is 5,120 and total peak performance and main memory of the system are 40 TFLOPS and 10 TB, respectively. Two nodes are installed into one cabinet, which size is 40°x56°x80°. 16 nodes are in a cluster. Power consumption per cabinet is approximately 20 KVA.
- Interconnection Network (IN): Each node is coupled together with more than 83,000 copper cables via single-stage crossbar switches of 16GB/s x2 (Load + Store). The total length of the cables is approximately 1,800 miles.
- 3. Hard Disk. Raid disks are used for the system. The capacities are 450 TB for the systems operations and 250 TB for users.
- 4. Mass Storage system: 12 Automatic Cartridge Systems (STK PowderHorn9310); total storage capacity is approximately 1.6 PB.

From Horst D. Simon, NERSC/LBNL, May 15, 2002, "ESS Rapid Response Meeting"







#### **Vector Summary**

- Vector is alternative model for exploiting ILP
- If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines
- Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations
- Fundamental design issue is memory bandwidth – With virtual address translation and caching
- Will multimedia popularity revive vector architectures?

#### **Outline**

- Vector Processors
- · Vector Metrics, Terms
- Multiprocessing Motivation
- · SISD v. SIMD v. MIMD
- Centralized vs. Distributed Memory
- Challenges to Parallel Programming
- Conclusion



#### Déjà vu all over again?

... today's processors ... are nearing an impasse as technologies approach the speed of light.."

David Mitchell, *The Transputer: The Time Is Now* (1989) Transputer had bad timing (Uniprocessor performance↑)

 $\Rightarrow$  Procrastination rewarded: 2X seq. perf. / 1.5 years

"We are dedicating all of our future product development to multicore designs. ... This is a sea change in computing"

Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs

| Manufacturer/Year | AMD/'05 | Intel/'06 | IBM/'04 | Sun/'05 |
|-------------------|---------|-----------|---------|---------|
| Processors/chip   | 2       | 2         | 2       | 8       |
| Threads/Processor | 1       | 2         | 2       | 4       |
| Threads/chip      | 2       | 4         | 4       | 32      |

#### Other Factors ⇒ Multiprocessors

- Growth in data-intensive applications – Data bases, file servers, ...
- Growing interest in servers, server perf.
- Increasing desktop perf. less important – Outside of graphics
- Improved understanding in how to use multiprocessors effectively
   Especially server where significant natural TLP
- Advantage of leveraging design investment by replication
  - Rather than unique design

# Flynn's Taxonomy

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.



| Single Instruction Single   | Single Instruction Multiple   |  |  |
|-----------------------------|-------------------------------|--|--|
| Data (SISD)                 | Data <u>SIMD</u>              |  |  |
| (Uniprocessor)              | (single PC: Vector, CM-2)     |  |  |
| Multiple Instruction Single | Multiple Instruction Multiple |  |  |
| Data (MISD)                 | Data <u>MIMD</u>              |  |  |
| (????)                      | (Clusters, SMP servers)       |  |  |

- SIMD ⇒ Data Level Parallelism
- MIMD ⇒ Thread Level Parallelism
- MIMD popular because
  - Flexible: N pgms and 1 multithreaded pgm
  - Cost-effective: same MPU in desktop & MIMD

# Back to Basics "A parallel computer is a collection of processing elements that <u>cooperate</u> and communicate to solve large problems fast." Parallel Architecture = Computer Architecture + Communication Architecture Two classes of multiprocessors WRT memory: Centralized Memory Multiprocessor < few dozen processor chips (and < 100 cores) in 2006</li> Small enough to share single, centralized memory Physically Distributed-Memory multiprocessor Larger number chips and cores than 1 BW demands ⇒ Memory distributed among processors





#### **Distributed Memory Multiprocessor**

- Pro: Cost-effective way to scale memory bandwidth
- If most accesses are to local memory
- Pro: Reduces latency of local memory accesses
- Con: Communicating data between processors more complex
- Con: Must change software to take advantage of increased memory BW

# Two Models for Communication and Memory Architecture

- 1. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors
- 2. Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either
  - UMA (Uniform Memory Access time) for shared address, centralized memory MP
  - NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP
- In past, confusion whether "sharing" means sharing physical memory (Symmetric MP) or sharing address space





#### **Challenges of Parallel Processing**

- Second challenge is long latency to remote memory
- Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.)
- What is performance impact if 0.2% instructions involve remote access?
  - a. 1.5X
  - b. 2.0X
  - c. 2.5X

#### **CPI Equation**

CPI = Base CPI + Remote request rate x Remote request cost

= 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3

No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve remote access

#### And in Conclusion [1/2] ...

- · One instruction operates on vectors of data
- Vector loads get data from memory into big register files, operate, and then vector store
- E.g., Indexed load, store for sparse matrix
- Easy to add vector to commodity instruction set
   E.g., Morph SIMD into vector
- Vector is very efficient architecture for vectorizable codes, including multimedia and many scientific codes

#### And in Conclusion [2/2] ...

- "End" of uniprocessors speedup => Multiprocessors
- Parallelism challenges: % parallalizable, long latency to remote memory
- Centralized vs. distributed memory
   Small MP vs. lower latency, larger BW for Larger MP
- Message Passing vs. Shared Address
   Uniform access time vs. Non-uniform access time

#### **Reading and Schedule**

This lecture:
 Appendix E: V

- Appendix F: Vector Processors
  Chapter 4: 4.1 Introduction Multiprocessors
- Next week, Oct 31st: No class
- Next lecture, Nov 7<sup>th</sup>: remainder of chapter 4 (in the afternoon feedback on assignment 2a)
- On Wed Nov 14<sup>th</sup> both at 11.15-13.00h and at 13.45-15.30h lectures in room 402

# Lecture 8 Snooping Cache Based Multiprocessors

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### **Review**

- · 1 instruction operates on vectors of data
- Vector loads get data from memory into big register files, operate, and then vector store
- E.g., Indexed load, store for sparse matrix
- Easy to add vector to commodity instruction set
   E.g., Morph SIMD into vector
- Vector is very efficient architecture for vectorizable codes, including multimedia and many scientific codes
- "End" of uniprocessors speedup ⇒ Multiprocessors
- Parallelism challenges: % parallalizable, long latency to remote memory
- Centralized vs. distributed memory
   Small MP vs. lower latency, larger BW for larger MP
- Message Passing vs. Shared Address
   Uniform access time vs. Non-uniform access time

#### Outline

- Review
- Coherence
- Write Consistency
- Snooping
- Building Blocks
- Snooping protocols and examples
- Coherence traffic and Performance on MP
- Conclusion

#### **Challenges of Parallel Processing**

- 1. Application parallelism ⇒ primarily via new algorithms that have better parallel performance
- 2. Long remote latency impact ⇒ both by architect and by the programmer
- For example, reduce frequency of remote accesses either by
  - Caching shared data (HW)
     Restructuring the data layout to make more accesses local (SW)
- Today's lecture on HW to help latency via caches

#### 







#### **Defining Coherent Memory System**

- Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P
- <u>Coherent view of memory</u>: Read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses
- 3. Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors - If not, a processor could keep value 1 since saw as last write
  - For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1

#### Write Consistency

#### For now assume

- 1. A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write
- 2. The processor does not change the order of any write with respect to any other memory access
- ⇒ if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order

#### **Basic Schemes for Enforcing Coherence**

- Program on multiple processors will normally have copies of the same data in several caches
   Unlike I/O, where its rare
- Rather than trying to avoid sharing in SW, SMPs use a HW protocol to maintain coherent caches
   – Migration and Replication key to performance of shared data
- Migration data can be moved to a local cache and used there in a transparent fashion
   Reduces both latency to access shared data that is allocated remotely and bandwidth demand on the shared memory
- Replication for reading shared data simultaneously, since caches make a copy of data in local cache – Reduces both latency of access and contention for read shared data

#### Outline

- Review
- Coherence
- Write Consistency
- Snooping
- Building Blocks
- Snooping protocols and examples
- Coherence traffic and Performance on MP
- Conclusion

#### **Two Classes of Cache Coherence Protocols**

- 1. <u>Directory based</u> Sharing status of a block of physical memory is kept in just one location, the <u>directory</u>
- 2. <u>Snooping</u> Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept
  - All caches are accessible via some broadcast medium (a bus or switch)
  - All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access





#### **Architectural Building Blocks**

- Cache block state transition diagram
   FSM specifying how disposition of block changes
   » invalid, valid, exclusive
- Broadcast Medium Transactions (e.g., bus)
   Fundamental system design abstraction
  - Logically single set of wires connect several devices
     Protocol: arbitration, command/addr, data
     ⇒ Every device observes every transaction
- Broadcast medium enforces serialization of read or write accesses ⇒ Write serialization
  - 1<sup>st</sup> processor to get medium invalidates others copies
  - Implies cannot complete write until it obtains bus
  - All coherence schemes require serializing accesses to same cache block
- Also need to find up-to-date copy of cache block

#### Locate up-to-date copy of data

- Write-through: get up-to-date copy from memory
   Write through simpler if enough memory BW
- Write-back harder
  - Most recent copy can be in a cache
- Can use same snooping mechanism
- 1. Snoop every address placed on the bus
- 2. If a processor has dirty copy of requested cache block, it provides it in response to a read request and aborts the memory access
- Complexity from retrieving cache block from cache, which can take longer than retrieving it from memory
- Write-back needs lower memory bandwidth
   ⇒ Support larger numbers of faster processors
   ⇒ Most multiprocessors use write-back

#### **Cache Resources for WB Snooping**

- Normal cache tags can be used for snooping
- · Valid bit per block makes invalidation easy
- · Read misses easy since rely on snooping
- Writes ⇒ Need to know whether any other copies of the block are cached
  - No other copies ⇒ No need to place write on bus for WB
     Other copies ⇒ Need to place invalidate on bus

#### **Cache Resources for WB Snooping**

- To track whether a cache block is shared, add extra state bit associated with each cache block, like valid bit and dirty bit
  - Write to Shared block  $\Rightarrow$  Need to place invalidate on bus and mark cache block as private (if an option)
  - No further invalidations will be sent for that block
  - This processor called owner of cache block
- Owner then changes state from shared to unshared (or exclusive)

#### Cache behavior in response to bus

- Every bus transaction must check the cacheaddress tags
- could potentially interfere with processor cache accesses
   A way to reduce interference is to duplicate tags
   One set for caches access, one set for bus accesses
- Another way to reduce interference is to use L2 tags
   Since L2 less heavily used than L1
  - ⇒ Every entry in L1 cache must be present in the L2 cache, called the inclusion property
  - If Snoop gets a hit in L2 cache, then it must arbitrate for the L1 cache to update the state and possibly retrieve the data, which usually requires a stall of the processor

#### **Example Protocol**

- Snooping coherence protocol is usually implemented by incorporating a finite-state controller in each node
- Logically, think of a separate controller associated with each cache block

   That is, snooping operations or cache requests for different blocks can proceed independently
- In implementations, a single controller allows multiple operations to distinct blocks to proceed in interleaved fashion
  - That is, one operation may be initiated before another is completed, even through only one cache access or one bus access is allowed at time



#### Is 2-state Protocol Coherent?

- Processor only observes state of memory system by issuing memory operations
- Assume bus transactions and memory operations are atomic and a one-level cache – all phases of one bus transaction complete before next one starts
  - processor waits for memory operation to complete before issuing next
  - with one-level cache, assume invalidations applied during bus transaction
- All writes go to bus + atomicity
  - Writes serialized by order in which they appear on bus (bus order)
     => invalidations applied to caches in bus order
- · How to insert reads in this order?
  - Important since processors see writes through reads, so determines whether write serialization is satisfied
  - But read hits may happen independently and do not appear on bus or enter directly in bus order
- Let's understand other ordering issues



 Doesn't constrain ordering of reads, though shared-medium (bus) will order read misses too

 any order among reads between writes is fine, as long as in program order














| Exam               | ple   |       |         |       |       |        |        |        |      |       |          |       |
|--------------------|-------|-------|---------|-------|-------|--------|--------|--------|------|-------|----------|-------|
|                    | P1    |       |         | P2    |       |        | Rus    | 1      |      |       | Mem      | nn/   |
| step               | State | Addr  | Value   | State | Addr  | Value  | Action | Proc.  | Addr | Value | Addr     | Value |
| P1 Write 10 to A1  | Excl. | A1    | 10      |       |       |        | WrMs   | P1     | A1   |       |          |       |
| P1: Read A1        |       |       |         |       |       |        |        |        |      |       |          |       |
| P2: Read A1        |       |       |         |       |       |        |        |        |      |       |          |       |
|                    |       |       |         |       |       |        |        |        |      |       |          |       |
| P2: Write 20 to A1 | -     | -     | -       |       | _     |        |        |        |      |       | <u> </u> |       |
| P2: Write 40 to A2 |       | -     |         | -     | -     |        |        | -      | -    | -     | -        |       |
| 12. WHILE 40 10 AZ | 1     |       |         |       |       |        |        |        |      |       | -        |       |
|                    | A     | ssume | es A1 a | and A | 2 map | to sar | me cao | che bl | ock  | 2     |          |       |



| Examp              | ole      |           |           |        |           |        |             |       |      |       |      |           |
|--------------------|----------|-----------|-----------|--------|-----------|--------|-------------|-------|------|-------|------|-----------|
|                    | P1       |           |           | P2     |           |        | Bus         |       |      |       | Mem  | ory       |
| step               | State    | Addr      | Value     | State  | Addr      | Value  | Action      | Proc. | Addr | Value | Addr | Value     |
| P1 Write 10 to A1  | Excl.    | <u>A1</u> | <u>10</u> |        |           |        | <u>WrMs</u> | P1    | A1   |       |      |           |
| P1: Read A1        | Excl.    | A1        | 10        |        |           |        |             |       |      |       |      |           |
| P2: Read A1        | [        |           |           | Shar.  | <u>A1</u> |        | <u>RdMs</u> | P2    | A1   |       |      |           |
|                    | Shar.    | A1        | 10        |        |           |        | <u>WrBk</u> | P1    | A1   | 10    | A1   | <u>10</u> |
|                    |          |           |           | Shar.  | A1        | 10     | RdDa        | P2    | A1   | 10    | A1   | 10        |
| P2: Write 20 to A1 | <u> </u> |           |           |        |           |        |             |       |      |       |      |           |
| P2: Write 40 to A2 | 1        |           |           |        |           |        |             |       |      |       |      |           |
|                    |          | Į         |           |        |           |        |             |       |      |       |      |           |
|                    | A        | ssume     | es A1 :   | and A2 | ? map     | to sar | me cao      | he bl | ock  |       |      |           |





#### **Implementation Complications**

- Write Races:
  - Cannot update cache until bus is obtained » Otherwise, another processor may get bus first, and then write the same cache block!
  - Two step process:
    - » Arbitrate for bus
    - » Place miss on bus and complete operation
  - If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart.
  - Split transaction bus:

  - » Bus transaction is not atomic: can have multiple outstanding transactions for a block Multiple misses can interleave, allowing two caches to grab block in the Exclusive state
  - » Must track and prevent multiple misses for one block
- · Must support interventions and invalidations

#### **Implementing Snooping Caches**

- Multiple processors must be on bus, access to both addresses and data ٠
- Add a few new commands to perform coherency, in addition to read and write
- · Processors continuously snoop on address bus If address matches tag, either invalidate or upda
- Since every bus transaction checks cache tags, could interfere with CPU just to check:
  - solution 1: duplicate set of tags for L1 caches just to allow checks in parallel with CPU
  - solution 2: L2 cache already duplicate, provided L2 obeys inclusion with L1 cache
    - » block size, associativity of L2 affects L1

#### Limitations in Symmetric Shared-Memory **Multiprocessors and Snooping Protocols**

- Single memory accommodate all CPUs  $\Rightarrow$  Multiple memory banks
- Bus-based multiprocessor, bus must support both coherence traffic & normal memory traffic
- ⇒ Multiple buses or interconnection networks (cross bar or small point-to-point)
- Opteron
  - Memory connected directly to each dual-core chip
  - Point-to-point connections for up to 4 chips
  - Remote memory and local memory latency are similar,

#### allowing OS Opteron as UMA computer

#### Outline

- Review
- Coherence
- · Write Consistency
- Snooping
- Building Blocks
- · Snooping protocols and examples
- Coherence traffic and Performance on MP
- Conclusion

#### **Performance of Symmetric Shared-Memory Multiprocessors**

#### Cache performance is combination of

- 1. Uniprocessor cache miss traffic
- 2. Traffic caused by communication
  - Results in invalidations and subsequent cache misses

#### 4th C: coherence miss

Joins Compulsory, Capacity, Conflict

#### **Coherency Misses**

- 1. True sharing misses arise from the communication of data through the cache coherence mechanism
  - Invalidates due to 1<sup>st</sup> write to shared block
  - Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word

#### 2. False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into · Invalidation does not cause a new value to be

- communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared
- ⇒ miss would not occur if block size were 1 word

| Exam  | ple:         | True      | v. | False  | Sharing | v. | Hit? |
|-------|--------------|-----------|----|--------|---------|----|------|
| LAUIN | <b>PIC</b> . | I L L L L |    | 1 4150 | onuning |    |      |

Assume x1 and x2 in same cache block. P1 and P2 both read x1 and x2 before.

| Time | P1       | P2       | True, False, Hit? Why?          |
|------|----------|----------|---------------------------------|
| 1    | Write x1 |          | True miss; invalidate x1 in P2  |
| 2    |          | Read x2  | False miss; x1 irrelevant to P2 |
| 3    | Write x1 |          | False miss; x1 irrelevant to P2 |
| 4    |          | Write x2 | False miss; x1 irrelevant to P2 |
| 5    | Read x2  |          | True miss; invalidate x2 in P1  |





#### A Cache Coherent System Must:

- · Provide set of states, state transition diagram, and actions
- Manage coherence protocol
  - (0) Determine when to invoke coherence protocol - (a) Find info about state of block in other caches to determine action
  - » whether need to communicate with other cached copies - (b) Locate the other copies
- (c) Communicate with those copies (invalidate/update) • (0) is done the same way on all systems
  - state of the line is maintained in the cache – protocol is invoked if an "access fault" occurs on the line
- · Different approaches distinguished by (a) to (c)

#### **Bus-based Coherence**

- All of (a), (b), (c) done through broadcast on bus - faulting processor sends out a "search"
  - others respond to the search probe and take necessary action
- · Could do it in scalable network too broadcast to all processors, and let them respond
- · Conceptually simple, but broadcast doesn't scale with p
  - on bus, bus bandwidth doesn't scale
  - on scalable network, every fault leads to at least p network transactions
- Scalable coherence:
  - can have same cache states and state transition diagram
  - different mechanisms to manage protocol

#### And in Conclusion ...

- · Caches contain all information on state of cached memory blocks
- Snooping cache over shared medium for smaller MP by invalidating other cached copies on write
- Sharing cached data ⇒ Coherence (values returned by a read), Consistency (when a written value will be returned by a read)
- MPs are highly effective for multiprogrammed workloads
- MPs proved effective for intensive commercial workloads, such as OLTP (assuming enough I/O to be CPU-limited), DSS applications (where query optimization is critical), and large-scale, web searching applications

#### **Reading and Schedule**

- This lecture:
  - 4.2 Symmetric Shared-Memory Architectures
  - 4.3 Performance of Symmetric Shared-Memory Multiprocessors
- This afternoon: feedback on assignment 2a
- Next week, Nov 14th:
  - 11.15-13.00h: directory-based MP & rest of chapter 4
  - 13.45-15.30h: chapter 5 memory hierarchy design

### Lecture 9 Directory Based Multiprocessors

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### **Review**

- Caches contain all information on state of cached memory blocks
- Snooping cache over shared medium for smaller MP by invalidating other cached copies on write
- Sharing cached data ⇒ Coherence (values returned by a read), Consistency (when a written value will be returned by a read)

#### Outline

- Review
- · Directory-based protocols and examples
- Synchronization
- Consistency
- Cross Cutting Issues
- · Fallacies and Pitfalls
- Cautionary Tale
- Sun T1 ("Niagara") Multiprocessor
- Microprocessor Comparison
- Conclusion

#### **Bus-based Coherence**

- All of (a), (b), (c) done through broadcast on bus - faulting processor sends out a "search"
  - others respond to the search probe and take necessary action
- Could do it in scalable network too
   broadcast to all processors, and let them respond
- Conceptually simple, but broadcast doesn't scale with p
- on bus, bus bandwidth doesn't scale
- on scalable network, every fault leads to at least p network transactions
- Scalable coherence:
  - can have same cache states and state transition diagram
     different mechanisms to manage protocol

#### **Scalable Approach: Directories**

- Every memory block has associated directory information
  - keeps track of copies of cached blocks and their states
  - on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary
  - in scalable networks, communication with directory and copies is through network transactions
- Many alternatives for organizing directory
- Many alternatives for organizing direct information



#### **Directory Protocol**

- Similar to Snoopy Protocol: Three states
  - Shared: ≥ 1 processors have data, memory up-to-date
  - Uncached (no processor has it; not valid in any cache)
  - Exclusive: 1 processor (owner) has data; memory out-of-date
- · In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)
- · Keep it simple(r):
  - Writes to non-exclusive data ⇒ write miss
  - Processor blocks until access completes
  - Assume messages received and acted upon in order sent

#### **Directory Protocol**

- No bus and don't want to broadcast: - interconnect no longer single arbitration point - all messages have explicit responses
- · Terms: typically 3 processors involved
  - Local node where a request originates Home node where the memory location of an address resides

  - Remote node has a copy of a cache block, whether exclusive or shared
- · Example messages on next slide: P = processor number, A = address

| Direct                                            | ory Protoco                                   | ol Messages                                    | (Fig 4.22)   |
|---------------------------------------------------|-----------------------------------------------|------------------------------------------------|--------------|
| Message type                                      | Source                                        | Destination                                    | Msg Content  |
| Read miss                                         | Local cache                                   | Home directory                                 | P, A         |
| <ul> <li>Processor</li> <li>make P a r</li> </ul> | P reads data at addı<br>ead sharer and requ   | ress A;<br>est data                            |              |
| Write miss                                        | Local cache                                   | Home directory                                 | P, A         |
| <ul> <li>Processor<br/>make P the</li> </ul>      | P has a write miss a<br>e exclusive owner an  | t address A;<br>d request data                 |              |
| Invalidate                                        | Home directory                                | Remote caches                                  | Α            |
| <ul> <li>Invalidate</li> </ul>                    | a shared copy at add                          | iress A                                        |              |
| Fetch                                             | Home directory                                | Remote cache                                   | Α            |
| <ul> <li>Fetch the l<br/>change the</li> </ul>    | block at address A a<br>state of A in the rei | nd send it to its home<br>note cache to shared | directory;   |
| Fetch/Invalidate                                  | Home directory                                | Remote cache                                   | Α            |
| <ul> <li>Fetch the l<br/>invalidate</li> </ul>    | block at address A a<br>the block in the cach | nd send it to its home<br>ne                   | directory;   |
| Data value reply                                  | Home directory                                | Local cache                                    | Data         |
| – Return a d                                      | ata value from the h                          | ome memory (read mis                           | ss response) |
| Data write back                                   | Remote cache                                  | Home directory                                 | A, Data      |
| <ul> <li>Write back</li> </ul>                    | a data value for add                          | lress A (invalidate res                        | oonse)       |
|                                                   |                                               |                                                |              |

#### **State Transition Diagram for One Cache Block in Directory Based System**

- States identical to snoopy case; transactions very similar
- Transitions caused by read misses, write misses, invalidates, data fetch requests
- Generates read miss & write miss message to home directory
- Write misses that were broadcast on the bus for snooping  $\Rightarrow$  explicit invalidate & data fetch requests
- Note: on a write, a cache block is bigger, so need to read the full cache block









Exclusive.

Example

#### **Example Directory Protocol**

 Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) ⇒ three possible directory requests:

 Read miss: owner processor sent data fetch message, causing state of block in owner's cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor.

Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared.

- Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty.
- Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

|                    | PI    |      |       | P2    |      |       | BUS    | -    |      |       | Direc | JOIY  |        | Memo  |
|--------------------|-------|------|-------|-------|------|-------|--------|------|------|-------|-------|-------|--------|-------|
| step               | State | Addr | Value | State | Addi | Value | Actior | Proc | Addr | Value | Addr  | State | {Procs | Value |
| P1: Write 10 to A1 | _     |      |       |       |      |       |        |      |      |       |       |       |        |       |
| Di Di Ili          |       |      |       |       |      |       |        |      |      |       |       |       |        |       |
| P1: Read A1        |       |      |       |       |      |       |        |      |      |       |       |       |        |       |
| P2: Read A1        | _     |      |       |       |      |       |        |      |      |       |       |       |        |       |
|                    |       |      |       |       |      |       |        |      |      |       |       |       |        |       |
|                    |       |      |       |       |      |       |        |      |      |       |       |       |        |       |
| P2: Write 20 to A1 | -     |      |       |       |      |       |        |      |      |       |       |       |        |       |
|                    | _     |      |       |       |      |       |        |      |      |       |       |       |        |       |
| P2: Write 40 to A2 | _     |      |       |       |      |       |        |      |      |       |       |       |        |       |
|                    | -     |      |       |       |      |       |        |      |      |       |       |       |        |       |
|                    |       |      |       |       |      |       |        |      |      |       |       |       |        |       |

|                    | Proc  | cess      | or 1     | Pro      | oces | sor   | 2 li     | nter  | coni     | nect     | Di        | irect | ory         | Mem      |
|--------------------|-------|-----------|----------|----------|------|-------|----------|-------|----------|----------|-----------|-------|-------------|----------|
|                    | P1    |           |          | P2       |      |       | Bus      |       |          |          | Direc     | tory  |             | Memo     |
| step               | State | Addr      | Value    | State    | Addr | Value | Action   | Proc. | Addr     | Value    | Addr      | State | {Proc       | s) Value |
| P1: Write 10 to A1 |       |           |          | <u> </u> |      |       | WrMs     | P1    | A1       | <u> </u> | <u>A1</u> | Ex    | <u>{P1}</u> | +        |
|                    | Excl. | <u>A1</u> | 10       | <u> </u> |      |       | DaRp     | P1    | A1       | 0        |           |       |             |          |
| P1: Read A1        | +     | -         | <u> </u> | —        | -    |       | <u> </u> | -     | -        | <u> </u> |           |       | -           | +        |
| P2: Read A1        | +     | -         | -        |          |      |       | -        |       |          | <u> </u> |           |       | -           | +        |
|                    | +     | -         | -        |          | -    |       | <u> </u> |       | -        | -        |           |       | <u> </u>    | +        |
| P2: Write 20 to A1 | +     | -         | -        | <u> </u> | -    |       | <u> </u> | -     | -        | <u> </u> |           | -     |             | -        |
| 12.11110.2010.001  | +     | <u> </u>  |          |          | -    |       |          |       | <u> </u> | -        |           |       |             | +        |
| P2: Write 40 to A2 | +     | <u> </u>  |          | <u> </u> |      |       | <u> </u> |       |          |          |           |       | <u> </u>    | +        |
|                    |       |           |          |          |      |       |          |       |          |          |           |       |             | -        |
|                    |       |           |          |          |      |       |          |       |          |          |           |       |             |          |
|                    | T     | Г         |          | Г        | Г    |       |          |       |          |          | F         |       | Г           | T        |

|                    | Proc  | cess | or 1  | Pro   | oces | ssor  | 2 li   | nter  | coni | nect  | Di        | rect  | ory         | Mem      |
|--------------------|-------|------|-------|-------|------|-------|--------|-------|------|-------|-----------|-------|-------------|----------|
|                    | P1    |      |       | P2    |      |       | Bus    |       |      |       | Direc     | tory  | -           | Memo     |
| step               | State | Addr | Value | State | Addr | Value | Action | Proc. | Addr | Value | Addr      | State | (Procs)     | Value    |
| P1: Write 10 to A1 |       |      |       |       |      |       | WrMs   | P1    | A1   |       | <u>A1</u> | Ex    | <u>{P1}</u> |          |
|                    | Excl. | A1   | 10    |       |      |       | DaRp   | P1    | A1   | 0     |           |       |             |          |
| P1: Read A1        | Excl. | A1   | 10    |       |      |       |        |       |      |       |           |       |             |          |
| P2: Read A1        |       |      |       |       |      |       |        |       |      |       |           |       |             |          |
|                    |       |      |       |       |      |       |        |       |      |       |           |       |             |          |
|                    |       |      |       |       |      |       |        |       |      |       |           |       |             |          |
| P2: Write 20 to A1 |       |      |       |       |      |       |        |       |      |       |           |       |             |          |
|                    |       |      |       |       |      |       |        |       |      |       |           |       |             |          |
| P2: Write 40 to A2 | -     |      |       |       |      |       |        |       | -    |       | -         |       | <u> </u>    | <u> </u> |
|                    | -     |      |       |       |      |       |        |       | -    |       |           |       | <u> </u>    | <u> </u> |
|                    |       |      |       |       |      |       |        |       |      |       |           |       |             |          |





|                    | Pro   | cess | sor 1 | Pro      | oces | sor   | 2 1    | nter  | coni | nect  | D     | irect | ory     | Mem   |
|--------------------|-------|------|-------|----------|------|-------|--------|-------|------|-------|-------|-------|---------|-------|
|                    | P1    |      |       | P2       |      |       | Bus    |       |      | -     | Direc | tory  |         | Mem   |
| step               | State | Addr | Value | State    | Addr | Value | Action | Proc. | Addr | Value | Addr  | State | (Procs) | Value |
| P1: Write 10 to A1 | -     |      |       |          |      |       | WrMs   | P1    | A1   |       | A1    | Ex    | (P1)    |       |
|                    | Excl. | A1   | 10    |          |      |       | DaRp   | P1    | A1   | 0     |       |       |         |       |
| P1: Read A1        | Excl. | A1   | 10    | <u> </u> |      |       |        |       |      |       |       |       |         |       |
| P2: Read A1        |       |      |       | Shar.    | A1   |       | RdMs   | P2    | A1   |       |       |       |         |       |
|                    | Shar. | A1   | 10    |          | _    |       | Ftch   | P1    | A1   | 10    | A1    |       |         | 10    |
|                    |       |      |       | Shar.    | A1   | 10    | DaRo   | P2    | A1   | 10    | A1    | Shar. | P1.P2}  | 10    |
| P2: Write 20 to A1 | -     |      |       | Excl.    | A1   | 20    | WrMs   | P2    | A1   |       | · ·   |       |         | 10    |
|                    | Inv.  |      |       |          |      | _     | Inval. | P1    | A1   |       | A1    | Excl. | (P2)    | 10    |
| P2: Write 40 to A2 | -     |      |       | <u> </u> |      |       | WrMs   | P2    | A2   |       | A2    | Excl. | {P2}    | 0     |
|                    | -     |      |       | <u> </u> |      |       | WrBk   | P2    | A1   | 20    | A1    | Unca. | 0       | 20    |
|                    | _     | _    |       | -        | _    |       |        |       |      | _     |       |       |         |       |













#### **A Popular Middle Ground**

- · Two-level "hierarchy"
- Individual nodes are multiprocessors, connected non-hierarchically

   e.g. mesh of SMPs
- Coherence across nodes is directory-based
   directory keeps track of nodes, not individual processors
- Coherence within nodes is snooping or directory
   orthogonal, but needs a good interface of functionality
- SMP on a chip directory + snoop?

#### Synchronization

- Why Synchronize? Need to know when it is safe for different processes to use shared data
- · Issues for Synchronization:
  - Uninterruptable instruction to fetch and update memory (atomic operation);
  - User level synchronization operation using this primitive;
  - For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

# Uninterruptable Instruction to Fetch and Update Memory

- Atomic exchange: interchange a value in a register for a value in memory
  - $0 \Rightarrow$  synchronization variable is free
  - $1 \Rightarrow$  synchronization variable is locked and unavailable
  - Set register to 1 & swap
  - New value in register determines success in getting lock
     0 if you succeeded in setting the lock (you were first)
     1 if other processor had already claimed access
     Key is that exchange operation is indivisible
- Test-and-set: tests a value and sets it if the value passes the test
- Fetch-and-increment: it returns the value of a memory location and atomically increments it
  - 0 ⇒ synchronization variable is free





#### **Another MP Issue: Memory Consistency Models**

- What is consistency? When must a processor see the new value? e.g., seems that P1: A = 0; B = 0P2:
  - A = 1; L1: if (B == 0) ... B = 1; if (A == 0) ... 12.
- Impossible for both if statements L1 & L2 to be true? - What if write invalidate is delayed & processor continues?
- · Memory consistency models: what are the rules for such cases?
- Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved ⇒ assignments before ifs above

#### SC: delay all memory accesses until all invalidates done

#### Memory Consistency Model

- · Schemes faster execution to sequential consistency
- Not an issue for most programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x)
  - release (s) {unlock}

acquire (s) {lock}

- read(x)
- · Only those programs willing to be nondeterministic are not synchronized: "data race": outcome f(proc. speed)
- Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

# **Relaxed Consistency Models: The Basics**

- Key idea: allow reads and writes to complete out of order, but to use synchronization operations to enforce ordering, so that a synchronized program behaves as if the processor were sequentially consistent
- Quentitally Consistent By relaxing orderings, may obtain performance advantages Also specifies range of legal compiler optimizations on shared data Unless synchronization points are clearly defined and programs are synchronized, compiler could not interchange read and write of 2 shared data items because might affect the semantics of the program 3 major sets of relaxed orderings:
- - Because retains ordering among writes, many programs that operate under sequential consistency operate under this model, without additional synchronization. Called processor consistency.
  - W ordering (all writes completed before next write)
- 3. R→W ordering, an wrise compression before next write) start and R→R orderings, a variety of models depending on ordering restrictions and how synchronization operations enforce ordering Many complexities in relaxed consistency models; defining precisely what it means for a write to complete; deciding when processors can see values that it has written

#### Mark Hill observation

Instead, use speculation to hide latency from strict consistency model

- If processor receives invalidation for memory reference before it is committed, processor uses speculation recovery to back out computation and restart with invalidated memory reference
- 1. Aggressive implementation of sequential consistency or processor consistency gains most of advantage of more relaxed models
- 2. Implementation adds little to implementation cost of speculative processor
- 3. Allows the programmer to reason using the simpler programming models

#### Cross Cutting Issues: Performance Measurement of Parallel Processors

- · Performance: how well scale as increase Proc
- Speedup fixed as well as scaleup of problem
  - Assume benchmark of size n on p processors makes sense: how scale benchmark to run on m \* p processors?
     Memory constrained scaling: keeping the amount of memory
  - Memory-constrained scaling: keeping the amount of memory used per processor constant
     <u>Time-constrained scaling</u>: keeping total execution time, assuming perfect speedup, constant
- Example: 1 hour on 10 P, time ~ O(n<sup>3</sup>), 100 P?
- <u>Time-constrained scaling</u>: 1 hour  $\Rightarrow$  10<sup>1/3</sup>n  $\Rightarrow$  2.15n scale up
  - Memory-constrained scaling: 10n size ⇒ 10<sup>3</sup>/10 ⇒ 100X or 100 hours! 10X processors for 100X longer???
  - Need to know application well to scale: # iterations, error
  - tolerance

# Fallacy: Amdahl's Law doesn't apply to parallel computers

- Since some part linear, can't go 100X?
- 1987 claim to break it, since 1000X speedup

   researchers scaled the benchmark to have a data set size
   that is 1000 times larger and compared the uniprocessor
   and parallel execution times of the scaled benchmark. For
   this particular algorithm the sequential portion of the
   program was constant independent of the size of the input,
   and the rest was fully parallel—hence, linear speedup with
   1000 processors
- · Usually sequential scale with data too

# Fallacy: Linear speedups are needed to make multiprocessors cost-effective

- Mark Hill & David Wood 1995 study
- · Compare costs SGI uniprocessor and MP
- Uniprocessor = \$38,400 + \$100 \* MB
- MP = \$81,600 + \$20,000 \* P + \$100 \* MB
- 1 GB, uni = \$138k v. mp = \$181k + \$20k \* P
- What speedup for better MP cost performance?
- 8 proc = \$341k; \$341k/138k ⇒ 2.5X
- 16 proc ⇒ need only 3.6X, or 25% linear speedup
- · Even if need some more memory for MP, not linear

#### Fallacy: Scalability is almost free

- "build scalability into a multiprocessor and then simply offer the multiprocessor at any point on the scale from a small number of processors to a large number"
- Cray T3E scales to 2048 CPUs vs. 4 CPU Alpha

   At 128 CPUs, it delivers a peak bisection BW of 38.4 GB/s, or 300 MB/s per CPU (uses Alpha microprocessor)
  - Compaq Alphaserver ES40 up to 4 CPUs and has 5.6 GB/s of interconnect BW, or 1400 MB/s per CPU
- Build apps that scale requires significantly more attention to load balance, locality, potential contention, and serial (or partly parallel) portions of program. 10X is very hard

# Pitfall: Not developing SW to take advantage (or optimize for) multiprocessor architecture

- SGI OS protects the page table data structure with a single lock, assuming that page allocation is infrequent
- Suppose a program uses a large number of pages that are initialized at start-up
- Program parallelized so that multiple processes allocate the pages
- But page allocation requires lock of page table data structure, so even an OS kernel that allows multiple threads will be serialized at initialization (even if separate processes)

#### Answers to 1995 Questions about Parallelism

- In the 1995 edition of this text, we concluded the chapter with a discussion of two then current controversial issues.
- 1. What architecture would very large scale, microprocessor-based multiprocessors use?
- 2. What was the role for multiprocessing in the future of microprocessor architecture?
- Answer 1. Large scale multiprocessors did not become a major and growing market ⇒ clusters of single microprocessors or moderate SMPs
- Answer 2. Astonishingly clear. For at least for the next 5 years, future MPU performance comes from the exploitation of TLP through multicore processors vs. exploiting more ILP

#### **Cautionary Tale**

- Key to success of birth and development of ILP in 1980s and 1990s was software in the form of optimizing compilers that could exploit ILP
- Similarly, successful exploitation of TLP will depend as much on the development of suitable software systems as it will on the contributions of computer architects
- Given the slow progress on parallel software in the past 30+ years, it is likely that exploiting TLP broadly will remain challenging for years to come









1.2 GHz at ≈72W typical, 79W peak power consumption

Write through

allocate LD

no-allocate ST





| CPI Breakd | own of               | Perfo              | rmance                          |                                 |
|------------|----------------------|--------------------|---------------------------------|---------------------------------|
| Benchmark  | Per<br>Thread<br>CPI | Per<br>core<br>CPI | Effective<br>CPI for<br>8 cores | Effective<br>IPC for<br>8 cores |
| TPC-C      | 7.20                 | 1.80               | 0.23                            | 4.4                             |
| SPECJBB    | 5.60                 | 1.40               | 0.18                            | 5.7                             |
| SPECWeb99  | 6.60                 | 1.65               | 0.21                            | 4.8                             |
|            |                      |                    |                                 |                                 |



| Performance: E                                                 | Benchr            | narks + Sun                                 | Marketing                                         |
|----------------------------------------------------------------|-------------------|---------------------------------------------|---------------------------------------------------|
| Benchmark\Architecture                                         | Sun Fire<br>T2000 | IBM p5-550 with 2<br>dual-core Power5 chips | Dell PowerEdge                                    |
| SPECjbb2005 (Java server software)<br>business operations/ sec | 63,378            | 61,789                                      | 24,208 (SC1425 with dual singl<br>core Xeon)      |
| SPECweb2005 (Web server performance)                           | 14,001            | 7,881                                       | 4,850 (2850 with two dual-cor<br>Xeon processors) |
| NotesBench (Lotus Notes performance)                           | 16,061            | 14,740                                      |                                                   |
| SPECjappServer 2004 Dual Node                                  |                   |                                             |                                                   |
|                                                                | Sun Fire<br>T2000 | HP 1×4540                                   |                                                   |
| Space (RU)                                                     | 2                 | 4                                           |                                                   |

|                             | Sun Fire<br>T2000 | HP 1x4640 |                              |
|-----------------------------|-------------------|-----------|------------------------------|
| Space (RU)                  | 2                 | 4         |                              |
| Watte                       | 320               | 1,303     |                              |
| Performance (SPECjapp JOPs) | 615               | 471       |                              |
| Performance / Watt          | 1.922             | 0.391     | Space Watte and Performance  |
| SWaP                        | 0.96              | 0.09      | opace, watts, and renormance |



| Processor                            | SUN T1           | Opteron       | Pentium D      | IBM Power 5      |
|--------------------------------------|------------------|---------------|----------------|------------------|
| Cores                                | 8                | 2             | 2              | 2                |
| Instruction issues<br>/ clock / core | 1                | 3             | 3              | 4                |
| Peak instr. issues                   | 8                | 6             | 6              | 8                |
| Multithreading                       | Fine-<br>grained | No            | SMT            | SMT              |
| L1 I/D in KB per core                | 16/8             | 64/64         | 12K<br>uops/16 | 64/32            |
| L2 per core/shared                   | 3 MB<br>shared   | 1MB /<br>core | 1MB/<br>core   | 1.9 MB<br>shared |
| Clock rate (GHz)                     | 1.2              | 2.4           | 3.2            | 1.9              |
| Transistor count (M)                 | 300              | 233           | 230            | 276              |
| Die size (mm <sup>2</sup> )          | 379              | 199           | 206            | 389              |
| Power (W)                            | 79               | 110           | 130            | 125              |





#### Niagara 2

- Improve performance by increasing threads supported per chip from 32 to 64
   - 8 cores \* 8 threads per core
- Floating-point unit for each core, not for each chip
- Hardware support for encryption standards EAS, 3DES, and elliptical-curve cryptography
- Niagara 2 will add a number of 8x PCI Express interfaces directly into the chip in addition to integrated 10Gigabit Ethernet XAU interfaces and Gigabit Ethernet ports.
- Integrated memory controllers will shift support from DDR2 to FB-DIMMs and double the maximum amount of system memory.
   Kevin

Kevin Krewell "Sun's Niagara Begins CMT Flood -The Sun UltraSPARC T1 Processor Released" *Microprocessor Report*, January 3, 2006

#### And in Conclusion ...

- Caches contain all information on state of cached memory blocks
- Snooping cache over shared medium for smaller MP by invalidating other cached copies on write
- Sharing cached data ⇒ Coherence (values returned by a read), Consistency (when a written value will be returned by a read)
- Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping ⇒ uniform memory access)
- Directory has extra data structure to keep track of state of all cache blocks
- Distributing directory ⇒ scalable shared address multiprocessor ⇒ Cache coherent, Non uniform memory access

#### Reading

- This lecture:
  - chapter 4: 4.4-4.10 rest of Multiprocessors and TLP
- Next lecture: – chapter 5: Memory Hierarchy Design

# Lecture 10 Advanced Memory Hierarchy

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### Outline

- 11 Advanced Cache Optimizations
- · Memory Technology and DRAM optimizations
- Virtual Machines
- Xen VM: Design and Performance
- AMD Opteron Memory Hierarchy
- Opteron Memory Performance vs. Pentium 4
- Conclusion



#### **Review: 6 Basic Cache Optimizations**

#### **Reducing hit time**

- 1. Giving Reads Priority over Writes
- E.g., Read complete before earlier writes in write buffer
   Avoiding Address Translation during Cache
   Indexing

#### Reducing Miss Penalty

#### 3. Multilevel Caches

#### **Reducing Miss Rate**

- 4. Larger Block size (Compulsory misses)
- 5. Larger Cache size (Capacity misses)
- 6. Higher Associativity (Conflict misses)









#### 4. Increasing Cache Bandwidth by Pipelining

- Pipeline cache access to maintain bandwidth, but higher latency
- Instruction cache access pipeline stages:
  - 1: Pentium
- 2: Pentium Pro through Pentium III
- 4: Pentium 4
- $\Rightarrow$  greater penalty on mispredicted branches
- → more clock cycles between the issue of the load and the use of the data

#### 5. Increasing Cache Bandwidth: Non-Blocking Caches

- <u>Non-blocking cache</u> or <u>lockup-free cache</u> allow data cache to continue to supply cache hits during a miss – requires F/E bits on registers or out-of-order execution – requires multi-bank memories
- "<u>hit under miss</u>" reduces the effective miss penalty by working during miss vs. ignoring CPU requests
- "<u>hit under multiple miss</u>" or "<u>miss under miss</u>" may further lower the effective miss penalty by overlapping multiple misses
  - Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
  - Requires multiple memory banks (otherwise cannot support)
     Pentium Pro allows 4 outstanding memory misses



#### 6. Increasing Cache Bandwidth via Multiple Banks

- Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses
  - E.g.,T1 ("Niagara") L2 has 4 banks
- Banking works best when accesses naturally spread themselves across banks ⇒ mapping of addresses to banks affects behavior of memory system
- Simple mapping that works well is "sequential interleaving"
  - Spread block addresses sequentially across banks
     E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; ...



#### 8. Merging Write Buffer to Reduce Miss Penalty

- Write buffer to allow processor to continue while waiting to write to memory
- If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry
- · If so, new data are combined with that entry
- Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory
- The Sun T1 (Niagara) processor, among many others, uses write merging

#### 9. Reducing Misses by Compiler Optimizations

- McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks <u>in software</u>
- Instructions

Reorder procedures in memory so as to reduce conflict misses
 Profiling to look at conflicts (using tools they developed)

- Data
  - Merging Arrays: Improve spatial locality by single array of compound elements vs. 2 arrays
  - Loop Interchange: Change nesting of loops to access data in order stored in memory
  - Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
  - Blocking: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows

#### **Merging Arrays Example**

/\* Before: 2 sequential arrays \*/
int val[SIZE];
int key[SIZE];

/\* After: 1 array of stuctures \*/
struct merge {
 int val;
 int key;

};

struct merge merged\_array[SIZE];

Reducing conflicts between val & key; improve spatial locality

#### Loop Interchange Example

/\* Before \*/
for (k = 0; k < 100; k = k+1)
 for (j = 0; j < 100; j = j+1)
 for (i = 0; i < 5000; i = i+1)
 x[i][j] = 2 \* x[i][j];
/\* After \*/
for (k = 0; k < 100; k = k+1)
 for (j = 0; j < 100; j = i+1)
 for (j = 0; j < 100; j = j+1)
 x[i][j] = 2 \* x[i][j];</pre>

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

#### Loop Fusion Example

2 misses per access to a &  ${\rm c}$  vs. one miss per access; improve spatial locality













#### Compiler Optimization vs. **Memory Hierarchy Search**

- Compiler tries to figure out memory hierarchy optimizations
- New approach: "Auto-tuners" 1st run variations of program on computer to find best combinations of optimizations (blocking, padding, ...) and algorithms, then produce C code to be compiled for that computer
- "Auto-tuner" targeted to numerical method E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W





| Technique                                        | Hit<br>Time | Band-<br>width | Miss<br>penalty | Miss<br>rate | HW cost/<br>complexity | Comment                                                            |
|--------------------------------------------------|-------------|----------------|-----------------|--------------|------------------------|--------------------------------------------------------------------|
| Small and simple caches                          | +           |                |                 | -            | 0                      | Trivial; widely used                                               |
| Way-predicting caches                            | +           |                |                 |              | 1                      | Used in Pentium 4                                                  |
| Trace caches                                     | +           |                |                 |              | 3                      | Used in Pentium 4                                                  |
| Pipelined cache access                           | -           | +              |                 |              | 1                      | Widely used                                                        |
| Nonblocking caches                               |             | +              | +               |              | 3                      | Widely used                                                        |
| Banked caches                                    |             | +              |                 |              | 1                      | Used in L2 of Opteron and<br>Niagara                               |
| Critical word first and early<br>restart         |             |                | +               |              | 2                      | Widely used                                                        |
| Merging write buffer                             |             |                | +               |              | 1                      | Widely used with write<br>through                                  |
| Compiler techniques to reduce<br>cache misses    |             |                |                 | +            | 0                      | Software is a challenge;<br>some computers have<br>compiler option |
| Hardware prefetching of<br>instructions and data |             |                | +               | +            | 2 instr.,<br>3 data    | Many prefetch instructions;<br>AMD Opteron prefetches<br>data      |
| Compiler-controlled<br>prefetching               |             |                | +               | +            | 3                      | Needs nonblocking cache; in<br>many CPUs                           |

#### Main Memory Background · Performance of Main Memory: Latency: Cache Miss Penalty » Access Time: time between request and word arrives » Cycle Time: time between requests - Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM: Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms. 1% time) - Addresses divided into 2 halves (Memory as a 2D matrix): » RAS or Row Access Strobe

- » CAS or Column Access Strobe
- Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor <u>Size:</u> DRAM/SRAM - <u>4-8</u>, <u>Cost/Cycle time</u>: SRAM/DRAM - <u>8-16</u>

#### **Main Memory Deep Background**

- "Out-of-Core", "In-Core," "Core Dump"?
- "Core memory"?
- Non-volatile, magnetic
- Lost to 4 Kbit DRAM (today using 512Mbit DRAM)
- Access time 750 ns, cycle time 1500-3000 ns







|        | DRAM<br>DIMM  | l n<br>n | ame<br>ame     | base<br>base | ed or<br>ed or     | י P<br>P P | eak Chip<br>eak DIMN | Tra<br>I M | insfe<br>Byte  | ers /<br>s / : | / Sec<br>Sec |          |
|--------|---------------|----------|----------------|--------------|--------------------|------------|----------------------|------------|----------------|----------------|--------------|----------|
|        | Stan-<br>dard | CI       | ock Ra<br>(MHz | te M         | transfe<br>/ secor | ers<br>nd  | DRAM<br>Name         | Mb         | ytes/s<br>DIMM | /              | DIMM<br>Name | ]        |
| â      | DDR           |          | 133            |              | 266                |            | DDR266               | :          | 2128           |                | PC2100       |          |
| 25/G   | DDR           |          | 150            |              | 300                |            | DDR300               | :          | 2400           |                | PC2400       |          |
| s (\$1 | DDR           |          | 200            |              | 400                |            | DDR400               | :          | 3200           |                | PC3200       | Fas      |
| 4/0    | DDR2          |          | 266            |              | 533                |            | DDR2-533             |            | 1264           |                | PC4300       | est      |
| sale   | DDR2          |          | 333            |              | 667                |            | DDR2-667             | Į          | 5336           |                | PC5300       | or s     |
| t for  | DDR2          |          | 400            |              | 800                |            | DDR2-800             | (          | 6400           |                | PC6400       | ale      |
| stes   | DDR3          |          | 533            |              | 1066               |            | DDR3-1066            | ł          | 3528           |                | PC8500       | 1/07     |
| Fa     | DDR3          |          | 666            |              | 1333               |            | DDR3-1333            | 1          | 0664           | P              | C10700       | \$4      |
|        | DDR3          |          | 800            |              | 1600               |            | DDR3-1600            | 1          | 2800           | P              | C12800       | 00/G     |
|        |               |          |                | x 2 -        |                    |            | 🔶 x 8 💳              |            | -              |                |              | <u>.</u> |

#### **Need for Error Correction!**

#### · Motivation:

- Failures/time proportional to number of bits!
   As DRAM cells shrink, more vulnerable
- Went through period in which failure rate was low enough without error correction that people didn't do correction
  - DRAM banks too large now
- Servers always corrected memory systems
  Basic idea: add redundancy through parity bits
  - Common configuration: Random error correction
    - » SEC-DED (single error correct, double error detect)
  - » One example: 64 data bits + 8 parity bits (11% overhead) Really want to handle failures of physical components as well
  - » Organization is multiple DRAMs/DIMM, multiple DIMMs
  - » Want to recover from failed DRAM and failed DIMM!
  - » "Chip kill" handle failures width of single DRAM chip

#### Outline

- 11 Advanced Cache Optimizations
- · Memory Technology and DRAM optimizations
- Virtual Machines
- Xen VM: Design and Performance
- AMD Opteron Memory Hierarchy
- Opteron Memory Performance vs. Pentium 4
- Conclusion

#### **Introduction to Virtual Machines**

#### VMs developed in late 1960s

- Remained important in mainframe computing over the years
   Largely ignored in single user computers of 1980s and 1990s
- · Recently regained popularity due to
  - increasing importance of isolation and security in modern systems,
  - failures in security and reliability of standard operating systems,
  - sharing of a single computer among many unrelated users,
  - and the dramatic increases in raw speed of processors, which makes the overhead of VMs more acceptable

#### What is a Virtual Machine (VM)?

- Broadest definition includes all emulation methods that provide a standard software interface, such as the Java VM
- "(Operating) System Virtual Machines" provide a complete system level environment at binary ISA Here assume ISAs always match the native hardware ISA
   E.g., IBM VM/370, VMware ESX Server, and Xen
- Present illusion that VM users have entire computer to themselves, including a copy of OS
- Single computer runs multiple VMs, and can support a multiple, different OSes On conventional platform, single OS "owns" all HW resources
  - With a VM, multiple OSes all share HW resources
- Underlying HW platform is called the host, and its resources are shared among the guest VMs

#### Virtual Machine Monitors (VMMs)

- Virtual machine monitor (VMM) or hypervisor is software that supports VMs
- VMM determines how to map virtual resources to physical resources
- Physical resource may be time-shared, partitioned, or emulated in software
- VMM is much smaller than a traditional OS; isolation portion of a VMM is ≈ 10,000 lines of code

#### VMM Overhead?

- · Depends on the workload
- User-level processor-bound programs (e.g., SPEC) have zero-virtualization overhead Runs at native speeds since OS rarely invoked
- I/O-intensive workloads ⇒ OS-intensive execute many system calls and privileged instructions ⇒ can result in high virtualization overhead
  - For System VMs, goal of architecture and VMM is to run almost all instructions directly on native hardware
- · If I/O-intensive workload is also I/O-bound ⇒ low processor utilization since waiting for I/O ⇒ processor virtualization can be hidden ⇒ low virtualization overhead

#### **Requirements of a Virtual Machine Monitor**

#### A VM Monitor

- Presents a SW interface to guest software,
- Isolates state of guests from each other, and Protects itself from guest software (including guest OSes)
- Guest software should behave on a VM exactly as if running on the native HW
- Except for performance-related behavior or limitations of fixed resources shared by multiple VMs
- Guest software should not be able to change allocation of real system resources directly
- Hence, VMM must control ≈ everything even though guest VM and OS currently running is temporarily using them
- Access to privileged state, Address translation, I/O, Exceptions and Interrupts, ...

#### **Requirements of a Virtual Machine Monitor**

- VMM must be at higher privilege level than guest VM, which generally run in user mode > Execution of privileged instructions handled by VMM
- E.g., Timer interrupt: VMM suspends currently running guest VM, saves its state, handles interrupt, determine which guest VM to run next, and then load its state
  - Guest VMs that rely on timer interrupt provided with virtual timer and an emulated timer interrupt by VMM
  - Requirements of system virtual machines are
  - ≈ same as paged-virtual memory:
  - 1. At least 2 processor modes, system and user
  - 2. Privileged subset of instructions available only in system mode, trap if executed in user mode
  - All system resources controllable only via these instructions

#### **ISA Support for Virtual Machines**

- If plan for VM during design of ISA, easy to reduce instructions executed by VMM, speed to emulate ISA is <u>virtualizable</u> if can execute VM directly on real machine letting VMM retain ultimate control of CPU: "<u>direct execution</u>
  - Since VMs have been considered for desktop/PC server apps only recently, most ISAs were created ignoring virtualization, including 80x86 and most RISC architectures
- VMM must ensure that guest system only interacts with virtual resources  $\Rightarrow$  conventional guest OS runs as user mode program on top of VMM
  - If guest OS accesses or modifies information related to HW resources via a privileged instruction—e.g., reading or writing the page table pointer—it will trap to VMM
- If not, VMM must intercept instruction and support a virtual version of sensitive information as guest **OS** expects

#### Impact of VMs on Virtual Memory

- Virtualization of virtual memory if each guest OS in every VM manages its own set of page tables?
- VMM separates real and physical memory
  - Makes real memory a separate, intermediate level between virtual memory and physical memory
  - Some use the terms virtual memory, physical memory, and machine memory to name the 3 levels
  - Guest OS maps virtual memory to real memory via its page tables, and VMM page tables map real memory to physical memory
- VMM maintains a shadow page table that maps directly from the guest virtual address space to the physical address space of HW
  - Rather than pay extra level of indirection on every memory access VMM must trap any attempt by guest OS to change its page table or to access the page table pointer

#### ISA Support for VMs & Virtual Memory

- IBM 370 architecture added additional level of indirection that is managed by the VMM
  - Guest OS keeps its page tables as before, so the shadow pages are unnecessary - (AMD Pacifica proposes same improvement for 80x86)
- To virtualize software TLB, VMM manages the real TLB and has a copy of the contents of the TLB of each guest VM
  - Any instruction that accesses the TLB must trap
  - TLBs with Process ID tags support a mix of entries from different VMs and the VMM, thereby avoiding flushing of the TLB on a VM switch

#### Impact of I/O on Virtual Memory

- I/O most difficult part of virtualization
  - Increasing number of I/O devices attached to the computer - Increasing diversity of I/O device types
  - Sharing of a real device among multiple VMs
  - Supporting many device drivers that are required, especially if different guest OSes are supported on same VM system
- Give each VM generic versions of each type of I/O device driver, and let VMM to handle real I/O
- Method for mapping virtual to physical I/O device depends on the type of device:
- Disks partitioned by VMM to create virtual disks for guest VMs Network interfaces shared between VMs in short time slices, and VMM tracks messages for virtual network addresses to ensure that guest VMs only receive their messages

#### Example: Xen VM

- Xen: Open-source System VMM for 80x86 ISA Project started at University of Cambridge, GNU license model
- Original vision of VM is running unmodified OS Significant wasted effort just to keep guest OS happy
- "paravirtualization" small modifications to guest OS to simplify virtualization

Three examples of paravirtualization in Xen:

- 1. To avoid flushing TLB when invoke VMM, Xen mapped into upper 64 MB of address space of each VM
- Guest OS allowed to allocate pages, just check that didn't 2. violate protection restrictions To protect the guest OS from user programs in VM, Xen takes advantage of 4 protection levels available in 80x86 – Most OSes for 80x86 keep everything at privilege levels 0 or at 3. – Xen VMM runs at the highest privilege level (0) – Guest OS runs at the next level (1) – Applications run at the lowest children level (2) 3.
- - Applications run at the lowest privilege level (3)



#### Xen and I/O

- To simplify I/O, privileged VMs assigned to each hardware I/O device: "driver domains" Xen Jargon: "domains" = Virtual Machines
- Driver domains run physical device drivers, although interrupts still handled by VMM before being sent to appropriate driver domain
- Regular VMs ("guest domains") run simple virtual device drivers that communicate with physical devices drivers in driver domains over a channel to access physical I/O hardware
- Data sent between guest and driver domains by page remapping









# Protection and Instruction Set Architecture Example Problem: 80x86 POPF instruction loads flag registers from top of stack in memory. One such flag is Interrupt Enable (IE) In system mode, POPF changes IE In user mode, POPF simply changes all flags <u>except</u> IE Problem: guest OS runs in user mode inside a VM, so it expects to see changed a IE, but it won't Historically, IBM mainframe HW and VMM took 3 steps: Reduce cost of processor virtualization Intel/AMD proposed ISA changes to reduce this cost Reduce interrupt cost by steering interrupts to proper VM directly without invoking VMM and 3. not yet addressed by Intel/AMD; in the future?



18 instructions cause problems for virtualization:

- 1. Read control registers in user model that reveal that the guest operating system in running in a virtual machine (such as POPF), and
- 2. Check protection as required by the segmented architecture but assume that the operating system is running at the highest privilege level

Virtual memory: 80x86 TLBs do not support process ID tags  $\Rightarrow$  more expensive for VMM and guest OSes to share the TLB

each address space change typically requires a TLB flush

#### Intel/AMD address 80x86 VM Challenges

- Goal is direct execution of VMs on 80x86
- Intel's VT-x

.

- A new execution mode for running VMs
- An architected definition of the VM state
- Instructions to swap VMs rapidly
- Large set of parameters to select the circumstances where a VMM must be invoked
   VT-x adds 11 new instructions to 80x86
- Xen 3.0 plan proposes to use VT-x to run Windows on Xen
- AMD's Pacifica makes similar proposals
- Plus indirection level in page table like IBM VM 370
- Ironic adding a new mode − If OS start using mode in kernel, new mode would cause performance problems for VMM since + 100 times too slow

#### Outline

- 11 Advanced Cache Optimizations
- Memory Technology and DRAM optimizations
- Virtual Machines
- Xen VM: Design and Performance
- AMD Opteron Memory Hierarchy
- Opteron Memory Performance vs. Pentium 4
- Conclusion

#### **AMD Opteron Memory Hierarchy**

- 12-stage integer pipeline yields a maximum clock rate of 2.8 GHz and fastest memory PC3200 DDR SDRAM
- 48-bit virtual and 40-bit physical addresses
- I and D cache: 64 KB, 2-way set associative, 64-B block, LRU
- L2 cache: 1 MB, 16-way, 64-B block, pseudo LRU
- Data and L2 caches use write back, write allocate
- L1 caches are virtually indexed and physically tagged
   L1 I TLB and L1 D TLB: fully associative, 40 entries

   32 entries for 4 KB pages and 8 for 2 MB or 4 MB pages
- L2 I TLB and L1 D TLB: 4-way, 512 entities of 4 KB pages
- Memory controller allows up to 10 cache misses
- 8 from D cache and 2 from I cache

#### **Opteron Memory Hierarchy Performance**

#### • For SPEC2000

- I cache misses per instruction is 0.01% to 0.09%
- D cache misses per instruction are 1.34% to 1.43%
- L2 cache misses per instruction are 0.23% to 0.36%
- Commercial benchmark ("TPC-C-like")

   I cache misses per instruction is 1.83% (100X!)
  - D cache misses per instruction are 1.39% (≈ same)
  - L2 cache misses per instruction are 0.62% (2X to 3X)
- How compare to ideal CPI of 0.33?





| entium               | 4 vs. Opteron Mo                                           | emory Hierarch                                             |
|----------------------|------------------------------------------------------------|------------------------------------------------------------|
| CPU                  | Pentium 4 (3.2 GHz*)                                       | Opteron (2.8 GHz*)                                         |
| Instruction<br>Cache | Trace Cache<br>(8K micro-ops)                              | 2-way associative,<br>64 KB, 64B block                     |
| Data<br>Cache        | 8-way associative, 16<br>KB, 64B block,<br>inclusive in L2 | 2-way associative,<br>64 KB, 64B block,<br>exclusive to L2 |
| L2 cache             | 8-way associative,<br>2 MB, 128B block                     | 16-way associative,<br>1 MB, 64B block                     |
| Prefetch             | 8 streams to L2                                            | 1 stream to L2                                             |
| Memory               | 200 MHz x 64 bits                                          | 200 MHz x 128 bits                                         |
|                      | *Clock rate for this compariso                             | n in 2005; faster versions exi                             |





#### And in Conclusion [1/2] ...

- Memory wall inspires optimizations since so much performance lost there
  - Reducing hit time: Small and simple caches, Way prediction, Trace caches
  - Increasing cache bandwidth: Pipelined caches, Multibanked caches, Nonblocking caches
  - Reducing Miss Penalty: Critical word first, Merging write buffers
     Reducing Miss Rate: Compiler optimizations
  - Reducing miss penalty or miss rate via parallelism: Hardware prefetching, Compiler prefetching
- "Auto-tuners" search replacing static compilation to explore optimization space?
- DRAM Continuing Bandwidth innovations: Fast page mode, Synchronous, Double Data Rate

#### And in Conclusion [2/2] ...

- VM Monitor presents a SW interface to guest software, isolates state of guests, and protects itself from guest software (including guest OSes)
- Virtual Machine Revival
  - Overcome security flaws of large OSes
  - Manage Software, Manage Hardware
  - Processor performance no longer highest priority
- Virtualization challenges for processor, virtual memory, and I/O
- Paravirtualization to cope with those difficulties
- Xen as example VMM using paravirtualization

   2005 performance on non-I/O bound, I/O intensive apps: 80% of native Linux without driver VM, 34% with driver VM
- Opteron memory hierarchy still critical to

# performance

#### Reading

- This lecture: - chapter 5: Memory Hierarchy Design
- Next lecture: - chapter 6: Storage Systems

#### Lecture 11 – Storage

Slides were used during lectures by David Patterson, Berkeley, spring 2006

#### **Case for Storage**

- Shift in focus from computation to communication and storage of information - E.g., Cray Research/Thinking Machines vs. Google/Yahoo
  - "The Computing Revolution" (1960s to 1980s) ⇒ "The Information Age" (1990 to today) Storage emphasizes reliability and scalability as
- well as cost-performance
- What is "Software king" that determines which HW actually features used? - Operating System for storage
  - Compiler for processor
- Also has own performance theory—queuing theory—balances throughput vs. response time

#### Outline

- Magnetic Disks
- RAID
- Advanced Dependability/Reliability/Availability
- I/O Benchmarks, Performance and Dependability
- Intro to Queuing Theory
- The End

#### **Disk Organization**







| Continue                                   | d advance in cana        | city (60%/yr) and           |  |  |  |  |
|--------------------------------------------|--------------------------|-----------------------------|--|--|--|--|
| bandwid                                    | th (40%/yr)              | icity (60 % yr) and         |  |  |  |  |
| Slow improvement in seek, rotation (8%/yr) |                          |                             |  |  |  |  |
| Year                                       | Sequentially             | Randomly                    |  |  |  |  |
|                                            |                          | (1 sector/seek)             |  |  |  |  |
| 1990                                       | 4 minutes                | 6 hours                     |  |  |  |  |
| 1000                                       |                          |                             |  |  |  |  |
| 2000                                       | 12 minutes               | 1 week(!)                   |  |  |  |  |
| 2000<br>2006                               | 12 minutes<br>56 minutes | 1 week(!)<br>3 weeks (SCSI) |  |  |  |  |



| Replace Small Number of Large Disks with<br>Large Number of Small Disks! (1988 Disks) |            |               |            |    |  |  |  |  |
|---------------------------------------------------------------------------------------|------------|---------------|------------|----|--|--|--|--|
|                                                                                       | IBM 3390K  | IBM 3.5" 0061 | x70        | _  |  |  |  |  |
| Capacity                                                                              | 20 GBytes  | 320 MBytes    | 23 GBytes  |    |  |  |  |  |
| Volume                                                                                | 97 cu. ft. | 0.1 cu. ft.   | 11 cu. ft. | 9X |  |  |  |  |
| Power                                                                                 | 3 KW       | 11 W          | 1 KW       | 3X |  |  |  |  |
| Data Rate                                                                             | 15 MB/s    | 1.5 MB/s      | 120 MB/s   | 8X |  |  |  |  |
| I/O Rate                                                                              | 600 I/Os/s | 55 I/Os/s     | 3900 IOs/s | 6X |  |  |  |  |
| MTTF                                                                                  | 250 KHrs   | 50 KHrs       | ??? Hrs    |    |  |  |  |  |
| Cost                                                                                  | \$250K     | \$2K          | \$150K     |    |  |  |  |  |
| Disk Arrays have potential for large data and                                         |            |               |            |    |  |  |  |  |

I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?

#### **Array Reliability**

Reliability of N disks = Reliability of 1 Disk ÷ N

50,000 Hours ÷ 70 disks = 700 hours

Disk system MTTF: Drops from 6 years to 1 month!

Arrays (without redundancy) too unreliable to be useful!

Hot spares support reconstruction in parallel with access: very high media availability can be achieved

#### **Redundant Arrays of (Inexpensive) Disks**

- Files are "striped" across multiple disks
- Redundancy yields high data availability
  - <u>Availability</u>: service still provided to user, even if some components failed
- Disks will still fail
- Contents reconstructed from data redundantly stored in the array
  - $\Rightarrow$  Capacity penalty to store redundant info
  - $\Rightarrow$  Bandwidth penalty to update redundant info



• (RAID 2 not interesting, so skip)



#### RAID 3

- Sum computed across recovery group to protect against hard disk failures, stored in P disk
- Logically, a single high capacity, high transfer rate disk: good for large transfers
- Wider arrays reduce capacity costs, but decreases availability
- 33% capacity cost for parity if 3 data disks and 1 parity disk

#### **Inspiration for RAID 4**

- RAID 3 relies on parity disk to discover errors on Read
- But every sector has an error detection field
- To catch errors on read, rely on error detection field vs. the parity disk
- Allows independent reads to different disks simultaneously









#### **RAID 6: Recovering from 2 failures**

#### Why > 1 failure recovery?

- operator accidentally replaces the wrong disk during a failure
- since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing
- $\Rightarrow$  increases the chances of a 2nd failure during repair since takes longer
- reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss

#### **RAID 6: Recovering from 2 failures**

- Network Appliance's row-diagonal parity or RAID-DP
- Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
- Since it is protecting against a double failure, it adds two check blocks per stripe of data.
- If p+1 disks total, p-1 disks have data; assume p=5
   Row parity disk is just like in RAID 4
- Even parity across the other 4 data blocks in its stripe
- Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal

#### Example p = 5

- Row diagonal parity starts byrecovering one of the 4 blocks on the failed disk using diagonal parity
  - Since each diagonal misses one disk, and all diagonals miss a different disk, 2 diagonals are only missing 1 block
     Once the data for those blocks is recovered, then the
- standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
- · Process continues until two failed disks are restored

| Data   | Data   | Data   | Data   | Row          | Diagona  |
|--------|--------|--------|--------|--------------|----------|
| Disk 0 | Disk 1 | Disk 2 | Disk 3 | Parity       | I Parity |
| V      | V      | 2      | 3      | $\checkmark$ | V        |
| 1_     | 2      | 3      | 4      | 0            | 1        |
| 2      | 3      | 4      | 0      | 1            | 2        |
| 3      | 4      | 0      | 1      | 2            | 3        |
| 4      | 0      | 1      | 2      | 3            | 4        |
| 0      |        | 2      | 3      | 4            |          |

# <section-header><section-header><text><text><text><text>



#### **Definitions**

- Examples on why precise definitions so important for reliability
  - Is a programming mistake a fault, error, or failure?
  - Are we talking about the time it was designed or the time the program is run?
  - If the running program doesn't exercise the mistake, is it still a fault/error/failure?
  - If an alpha particle hits a DRAM memory cell, is it a
  - fault/error/failure if it doesn't change the value? - Is it a fault/error/failure if the memory doesn't access the changed bit? Did a fault/error/failure atill accur if the memory had arrest
  - Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU?

#### International Federation for Information Processing (IFIP) Standard terminology

- Computer system<u>dependability</u>: quality of delivered service such that reliance can be placed on service
- <u>Service</u> is observed <u>actual behavior</u> as perceived by other system(s) interacting with this system's users
- Each module has idealspecified behavior, where service specification is agreed description of expected behavior
- A system <u>failure</u> occurs when the actual behavior deviates from the specified behavior
- Failure occurred because anerror, a defect in module
- The cause of an error is afault
- When a fault occurs it creates a<u>latent error</u>, which becomes <u>effective</u> when it is activated
- When error actually affects the delivered service, a failure occurs (time from error to failure is <u>error latency</u>)

#### Fault v. (Latent) Error v. Failure

- Anerror is manifestation *in the system* of a fault, a failure is manifestation *on the service* of an error
- If an alpha particle hits a DRAM memory cell, is it a fault/error/failure if it doesn't change the value?
   Is it a fault/error/failure if the memory doesn't access the changed bit?
- Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU?
  An alpha particle bitting a DRAM can be afault
- An alpha particle hitting a DRAM can be afault
- If it changes the memory, it creates anerror
- Error remainslatent until effected memory word is read
- If the effected word error affects the delivered service, a failure occurs

#### **Fault Categories**

- 1. Hardware faults: Devices that fail, such alpha particle hitting a memory cell
- 2. Design faults: Faults in software (usually) and hardware design (occasionally)
- 3. Operation faults: Mistakes by operations and maintenance personnel
- 4. Environmental faults: Fire, flood, earthquake, power failure, and sabotage

#### Also by duration:

- 1. <u>Transient faults</u> exist for limited time and not recurring
- 2. Intermittent faults cause a system to oscillate between faulty and fault-free operation
- 3. <u>Permanent faults</u> do not correct themselves over time

# Fault Tolerance vs Disaster Tolerance Fault-Tolerance (or more properly, Error-Tolerance): mask local faults (prevent errors from becoming failures) RAID disks Uninterruptible Power Supplies Cluster Failover Disaster Tolerance:masks site errors (prevent site errors from causing service failures) Protects against fire, flood, sabotage,... Redundant system and service at remote site. Use design diversity

From Jim Gray's "Talk at UC Berkeley on Fault Tolerance " 11/9/00





#### HW Failures in Real Systems: Tertiary Disks

A cluster of 20 PCs in seven 7-foot high, 19-inch wide racks with 368 8.4 GB, 7200 RPM, 3.5-inch IBM disks. The PCs are P6-200MHz with 96 MB of DRAM each. They run FreeBSD 3.0 and the hosts are connected via switched 100 Mbit/second Ethernet

| Component                     | Total in System | Total Failed | % Failed |
|-------------------------------|-----------------|--------------|----------|
| SCSI Controller               | 44              | 1            | 2.3%     |
| SCSI Cable                    | 39              | 1            | 2.6%     |
| SCSI Disk                     | 368             | 7            | 1.9%     |
| IDE Disk                      | 24              | 6            | 25.0%    |
| Disk Enclosure -Backplane     | 46              | 13           | 28.3%    |
| Disk Enclosure - Power Supply | 92              | 3            | 3.3%     |
| Ethernet Controller           | 20              | 1            | 5.0%     |
| Ethernet Switch               | 2               | 1            | 50.0%    |
| Ethernet Cable                | 42              | 1            | 2.3%     |
| CPU/Motherboard               | 20              | 0            | 0%       |

#### Does Hardware Fail Fast? 4 of 384 Disks that failed in Tertiary Disk

| Messages in system log for failed disk                                                  | No. log<br>msgs | Duration<br>(hours) |
|-----------------------------------------------------------------------------------------|-----------------|---------------------|
| Hardware Failure (Peripheral device write fault [for] Field Replaceable Unit)           | 1763            | 186                 |
| <b>Not Ready</b> (Diagnostic failure: ASCQ = Component ID [of] Field Replaceable Unit)  | 1460            | 90                  |
| Recovered Error (Failure Prediction Threshold<br>Exceeded [for] Field Replaceable Unit) | 1313            | 5                   |
| Recovered Error (Failure Prediction Threshold<br>Exceeded [for] Field Replaceable Unit) | 431             | 17                  |

#### High Availability System Classes Goal: Build Class 6 Systems

| System Type            | Una<br>(n | available<br>nin/year) | Availability | Availability<br>Class |
|------------------------|-----------|------------------------|--------------|-----------------------|
| Unmanaged              |           | 50,000                 | 90.%         | 1                     |
| Managed                |           | 5,000                  | 99.%         | 2                     |
| Well Managed           | 500       | 99.9%                  | 3            |                       |
| Fault Tolerant         |           | 50                     | 99.99%       | 4                     |
| High-Availability      |           | 5                      | 99.999%      | 5                     |
| Very-High-Availability |           | .5                     | 99.9999%     | 6                     |
| Ultra-Availability     |           | .05                    | 99.99999%    | 7                     |

#### **UnAvailability = MTTR/MTBF**

can cut it in 1/2 by cutting MTTR or MTBF

From Jim Gray's "Talk at UC Berkeley on Fault Tolerance " 11/9/00

#### How Realistic is "5 Nines"?

- HP claims HP-9000 server HW and HP-UX OS can deliver 99.999% availability guarantee "in certain pre-defined, pre-tested customer environments"
  - Application faults?
  - Operator faults?
  - Environmental faults?
- Collocation sites (lots of computers in 1 building on Internet) have
  - 1 network outage per year (~1 day)
  - 1 power failure per year (~1 day)
- Microsoft Network unavailable recently for a day due to problem in Domain Name Server: if only outage per year, 99.7% or 2 Nines

#### Outline

- Magnetic Disks
- RAID
- Advanced Dependability/Reliability/Availability
- I/O Benchmarks, Performance and Dependability
- Intro to Queuing Theory
- The End





#### I/O Benchmarks: Transaction Processing

#### · Early 1980s great interest in OLTP

- Expecting demand for high TPS (e.g., ATM machines, credit cards)
- Tandem's success implied medium range OLTP expands Each vendor picked own conditions for TPS claims, report only CPU times with widely different I/O
   Conflicting claims led to disbelief of all benchmarks ⇒ chaos
- 1984 Jim Gray (Tandem) distributed paper to Tandem
   + 19 in other companies propose standard benchmark
- Published "A measure of transaction processing power," Datamation, 1985 by Anonymous et. al - To indicate that this was effort of large group
  - To avoid delays of legal department of each author's firm
  - Still get mail at Tandem to author "Anonymous
- Led to Transaction Processing Council in 1988 www.tpc.org

#### I/O Benchmarks: TP1 by Anon et. al

• DebitCredit Scalability: size of account, branch, teller, history function of throughput

|   | •                                                                                                                                                    |                | • •               |  |  |  |  |  |
|---|------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|-------------------|--|--|--|--|--|
|   | TPS                                                                                                                                                  | Number of ATMs | Account-file size |  |  |  |  |  |
|   | 10                                                                                                                                                   | 1,000          | 0.1 GB            |  |  |  |  |  |
|   | 100                                                                                                                                                  | 10,000         | 1.0 GB            |  |  |  |  |  |
|   | 1,000                                                                                                                                                | 100,000        | 10.0 GB           |  |  |  |  |  |
|   | 10,000                                                                                                                                               | 1,000,000      | 100.0 GB          |  |  |  |  |  |
|   | – Each input TPS =>100,000 account records, 10 branches, 100 ATMs                                                                                    |                |                   |  |  |  |  |  |
|   | <ul> <li>Accounts must grow since a person is not likely to use the bank more<br/>frequently just because the bank has a faster computer!</li> </ul> |                |                   |  |  |  |  |  |
| • | <ul> <li>Response time: 95% transactions take ≤ 1 second</li> </ul>                                                                                  |                |                   |  |  |  |  |  |
| • | Report price (initial purchase price + 5 year                                                                                                        |                |                   |  |  |  |  |  |

- maintenance = cost of ownership)
- · Hire auditor to certify results

#### **Unusual Characteristics of TPC**

- · Price is included in the benchmarks cost of HW, SW, and 5-year maintenance agreements included ⇒ price-performance as well as performance
- · The data set generally must scale in size as the throughput increases
- trying to model real systems, demand on system and size of the data stored in it increase together
- · The benchmark results are audited Must be approved by certified TPC auditor, who enforces TPC rules ⇒ only fair results are submitted
- · Throughput is the performance metric but response times are limited
- eg, TPC-C: 90% transaction response times < 5 seconds An independent organization maintains the benchmarks
  - COO ballots on changes, meetings, to settle disputes

#### **TPC Benchmark History/Status**

| Benchmark                                        | Data Size (GB)  | Performance<br>Metric                     | 1st Results |
|--------------------------------------------------|-----------------|-------------------------------------------|-------------|
| A: Debit Credit (retired)                        | 0.1 to 10       | transactions/s                            | .lul-90     |
| B: Batch Debit Credit<br>(retired)               | 0.1 to 10       | transactions/s                            | Jul-91      |
| C: Complex Query                                 | 100 to 3000     | new order                                 | Sep-92      |
| OLTP                                             | (min. 07 * tpm) | trans/min (tpm)                           |             |
| D: Decision Support<br>(retired)                 | 100, 300, 1000  | queries/hour                              | Dec-95      |
| H: Ad hoc decision<br>support                    | 100, 300, 1000  | queries/hour                              | Oct-99      |
| R: Business reporting decision support (retired) | 1000            | queries/hour                              | Aug-99      |
| W: Transactional web                             | ~ 50, 500       | web inter-<br>actions/sec.                | Jul-00      |
| App: app. server & web<br>services               |                 | Web Service<br>Interactions/sec<br>(SIPS) | Jun-05      |

#### I/O Benchmarks via SPEC

- SFS 3.0 Attempt by NFS companies to agree on standard benchmark
  - Run on multiple clients & networks (to prevent bottlenecks)
  - Same caching policy in all clients Reads: 85% full block & 15% partial blocks
  - Writes: 50% full block & 50% partial blocks
  - Average response time: 40 ms
  - Scaling: for every 100 NFS ops/sec. increase capacity 1GB
- Results: plot of server load (throughput) vs. response time & number of users

  - Assumes: 1 user => 10 NFS ops/sec
  - 3.0 for NFS 3.0
  - Added SPECMail (mailserver), SPECWeb (webserver) benchmarks



#### Availability benchmark methodology

- Goal: quantify variation in QoS metrics as events • occur that affect system availability
- · Leverage existing performance benchmarks - to generate fair workloads
  - to measure & trace quality of service metrics
- · Use fault injection to compromise system hardware faults (disk, memory, network, power)
  - software faults (corrupt input, driver error returns)
  - maintenance events (repairs, SW/HW upgrades)
- · Examine single-fault and multi-fault workloads the availability analogues of performance micro- and macro-benchmarks



#### **Reconstruction policy (2)**

- · Linux: favors performance over data availability
  - automatically-initiated reconstruction, idle bandwidth - virtually no performance impact on application
  - very long window of vulnerability (>1hr for 3GB RAID)
- · Solaris: favors data availability over app. perf.
  - automatically-initiated reconstruction at high BW
  - as much as 34% drop in application performance - short window of vulnerability (10 minutes for 3GB)
- Windows: favors neither!
  - manually-initiated reconstruction at moderate BW
  - as much as 18% app. performance drop
  - somewhat short window of vulnerability (23 min/3GB)



#### **Deriving Little's Law**

- Timeberrye = elapsed time that observe a system
- Number = number of (overlapping) tasks during Timeobserve
- Time<sub>accumulated</sub> = sum of elapsed times for each task

#### Then

- Mean number tasks in system= Time<sub>accumulated</sub> / Time<sub>observe</sub>
- Mean response time=  $\text{Time}_{\text{accumulated}} / \text{Number}_{\text{task}}$
- Arrival Rate= Number<sub>task</sub> / Time<sub>observe</sub>

#### Factoring RHS of 1st equation

- Time\_{accumulated} / Time\_{observe} = Time\_{accumulated} / Number\_{task} x Number\_{task} / Time\_{observe}
- Then get Little's Law:
- Mean number tasks in system= Arrival Rate x Mean response time



#### **Server Utilization**

- For a single server, service rate = 1 / Timeserve
- Server utilization must be between 0 and 1, since system is in equilibrium (arrivals = departures); often called traffic intensity, traditionally ρ)
- Server utilization

   mean number tasks in service
   Arrival rate x Time<sub>server</sub>
- What is disk utilization if get 50 I/O requests per second for disk and average disk service time is 10 ms (0.01 sec)?
- Server utilization = 50/sec x 0.01 sec = 0.5
- · Or server is busy on average 50% of time

#### Time in Queue vs. Length of Queue

- · We assume First In First Out (FIFO) queue
- Relationship of time in queue (*Time<sub>queue</sub>*) to mean number of tasks in queue (*Length<sub>queue</sub>*) ?
- Time<sub>queue</sub> = Length<sub>queue</sub> x Time<sub>server</sub> + "Mean time to complete service of task when new task arrives if server is busy"
- New task can arrive at any instant; how predict last part?
- To predict performance, need to know sometime about distribution of events

#### **Distribution of Random Variables**

- A variable is random if it takes one of a specified set of values with a specified probability
- Cannot know exactly next value, but may know probability of all possible values
  I/O Requests can be modeled by a random variable because OS normally switching between several
- processes generating independent I/O requests – Also given probabilistic nature of disks in seek and rotational delays
- Can characterize distribution of values of a random variable with discrete values using a *histogram*
  - Divides range between the min & max values into *buckets* Histograms then plot the number in each bucket as columns
  - Works for discrete values e.g., number of I/O requests?
- · What about if not discrete? Very fine buckets

# Characterizing distribution of a random variable

Need mean time and a measure of variance For mean, use weighted arithmetic mean (WAM):

- fi = frequency of task i
- Ti = time for tasks I

Weighted arithmetic mean =  $f1 \times T1 + f2 \times T2 + ... + fn \times Tn$ 

For variance, instead of standard deviation, use Variance (square of standard deviation) for WAM:

- Variance =  $(f1 \times T1^2 + f2 \times T2^2 + \ldots + fn \times Tn^2) WAM^2$ 
  - If time is miliseconds, Variance units are square milliseconds!

Got a unitless measure of variance?

#### Squared Coefficient of Variance (C<sup>2</sup>)

- C<sup>2</sup> = Variance / WAM<sup>2</sup> ⇒ C = sqrt(Variance)/WAM = StDev/WAM Unitless measure
- Trying to characterize random events, but need distribution of random events with tractable math
- Most popular such distribution is exponential distribution, where C = 1
- Note using constant to characterize variability about the mean
  - Invariance of C over time ⇒ history of events has no impact on probability of an event occurring now
  - Called memoryless, an important assumption to predict behavior - (Suppose not; then have to worry about the exact arrival times of requests relative to each other ⇒ make math not tractable!)

#### **Poisson Distribution**

- Most widely used exponential distribution is Poisson
- Described by probability mass function: Probability (k) =  $e^{-a} x a^k / k!$

- where a = Rate of events x Elapsed time

• If interarrival times exponentially distributed & use arrival rate from above for rate of events, number of arrivals in time interval t is a Poisson process

#### **Time in Queue**

• Time new task must wait for server to complete a task assuming server busy - Assuming it's a Poisson process

- Average residual service time =  $\frac{1}{2}$  x Arithmetic mean x (1 + C<sup>2</sup>)
  - When distribution is not random & all values = average  $\Rightarrow$  standard deviation is 0  $\Rightarrow$  C is 0 ⇒ average residual service time = half average service time
  - When distribution is random & Poisson  $\Rightarrow$  C is 1  $\Rightarrow$  average residual service time = weighted arithmetic mean

#### **Time in Queue**

- All tasks in queue (Length<sub>queue</sub>) ahead of new task must be completed before task can be serviced Each task takes on average Time<sub>server</sub>
  - Task at server takes average residual service time to complete
- Chance server is busy is server utilization  $\Rightarrow$  expected time for service is Server utilization  $\times$ Average residual service time
- Time<sub>queue</sub> = Length<sub>queue</sub> x Time<sub>server</sub> + Server utilization x Average residual service time
- Substituting definitions for  $\text{Length}_{\text{queue}}$  Average residual service time, & rearranging:

Time<sub>queue</sub> = Time<sub>server</sub> x Server utilization/(1-Server utilization)

#### M/M/1 Queuing Model

- System is in equilibrium
- Times between 2 successive requests arriving, "interarrival times", are exponentially distributed
- Number of sources of requests is unlimited
- Server can start next job immediately
- Single queue, no limit to length of queue, and FIFO discipline, so all tasks in line must be completed
- There is one server
- Called M/M/1 (book also derives M/M/m)
- 1. Exponentially random request arrival ( $\dot{C^2} = 1$ )
- 2. Exponentially random service time (C<sup>2</sup> = 1)
- 3. 1 server
- ${\it M}$  standing for Markov, mathematician who defined and analyzed the memoryless processes

#### Example

40 disk I/Os / sec, requests are exponentially distributed, and average service time is 20 ms

- $\Rightarrow$  Arrival rate/sec = 40, Time<sub>server</sub> = 0.02 sec 1. On average, how utilized is the disk? Server utilization = Arrival rate × Time<sub>server</sub> = 40 x 0.02 = 0.8 = 80%
- 2. What is the average time spent in the queue?
  - Time<sub>queue</sub> = Time<sub>server</sub> x Server utilization/(1-Server utilization) = 20 ms x 0.8/(1-0.8) = 20 x 4 = 80 ms
- 3. What is the average response time for a disk request, including the queuing time and disk service time Time<sub>system</sub>=Time<sub>queue</sub> + Time<sub>server</sub> = 80+20 ms = 100 ms
### How much better with 2X faster disk?

Average service time is 10 ms

- ⇒ Arrival rate/sec = 40, Time<sub>server</sub> = 0.01 sec
- 1. On average, how utilized is the disk? Server utilization = Arrival rate  $\times$  Time<sub>server</sub> = 40 x 0.01 = 0.4 = 40%
- 2. What is the average time spent in the queue? Time<sub>queue</sub> = Time<sub>server</sub> x Server utilization/(1-Server utilization)
  - = 10 ms x 0.4/(1-0.4) = 10 x 2/3 = 6.7 ms
- 3. What is the average response time for a disk request, including the queuing time and disk service time? Time<sub>system</sub>=Time<sub>queue</sub> + Time<sub>server</sub>=6.7+10 ms = 16.7 ms 6X faster response time with 2X faster disk!

#### Value of Queuing Theory in practice

- Learn quickly do not try to utilize resource 100% but how far should back off?
- Allows designers to decide impact of faster hardware on utilization and hence on response time
- Works surprisingly well

## Cross cutting Issues: Buses $\Rightarrow$ point-to-point links and switches

| Standard           | width | length | Clock rate   | MB/s | Max    |
|--------------------|-------|--------|--------------|------|--------|
| (Parallel) ATA     | 8b    | 0.5 m  | 133 MHz      | 133  | 2      |
| Serial ATA         | 2b    | 2 m    | 3 GHz        | 300  | ?      |
| (Parallel) SCSI    | 16b   | 12 m   | 80 MHz (DDR) | 320  | 15     |
| Serial Attach SCSI | 1b    | 10 m   |              | 375  | 16,256 |
| PCI                | 32/64 | 0.5 m  | 33 / 66 MHz  | 533  | ?      |
| PCI Express        | 2b    | 0.5 m  | 3 GHz        | 250  | ?      |

- No. bits and BW is per direction ⇒ 2X for both directions (not shown).
- Since use fewer wires, commonly increase BW via versions with 2X-12X the number of wires and BW

## **Storage Example: Internet Archive**

- Goal of making a historical record of the Internet
   Internet Archive began in 1996
  - Wayback Machine interface perform time travel to see what the website at a URL looked like in the past
- It contains over a petabyte (10<sup>15</sup> bytes), and is growing by 20 terabytes (10<sup>12</sup> bytes) of new data per month
- . In addition to storing the historical record, the same hardware is used to crawl the Web every few months to get snapshots of the Internet.



# **Estimated Cost**

- Via processor, 512 MB of DDR266 DRAM, ATA disk controller, power supply, fans, and enclosure = \$500
- 7200 RPM Parallel ATA drives holds 500 GB = \$375.
- 48-port 10/100/1000 Ethernet switch and all cables for a rack = \$3000.
- Cost \$84,500 for a 80-TB rack.
- 160 Disks are ≈ 60% of the cost
- Other costs: power, space, .....

#### **Estimated Performance**

- 7200 RPM Parallel ATA drives holds 500 GB, has an average time seek of 8.5 ms, transfers at 50 MB/second from the disk. The PATA link speed is 133 MB/second.
  performance of the VIA processor is 1000 MIPS.

  - operating system uses 50,000 CPU instructions for a disk I/O. network protocol stacks uses 100,000 CPU instructions to transmit a data block between the cluster and the external world
- ATA controller overhead is 0.1 ms to perform a disk I/O.
- Average I/O size is 16 KB for accesses to the historical record via the Wayback interface, and 50 KB when collecting a new snapshot
- Disks are limit:  $\approx 75$  I/Os/s per disk, 300/s per node, 12000/s per rack, or about 200 to 600 Mbytes/sec Bandwidth per rack
- Switch needs to support 1.6 to 3.8 Gbits/second over 40 Gbit/sec links

#### **Estimated Reliability**

- CPU/memory/enclosure MTTF is 1,000,000 hours (x 40)
- PATA Disk MTTF is 125,000 hours (x 160)
- PATA controller MTTF is 500,000 hours (x 40)
- Ethernet Switch MTTF is 500,000 hours (x 1)
- Power supply MTTF is 200,000 hours (x 40)
- Fan MTTF is 200,000 hours (x 40)
- PATA cable MTTF is 1,000,000 hours (x 40)
- MTTF for the system is 531 hours (≈ 3 weeks)
- 70% of time failures are disks
- · 20% of time failures are fans or power supplies

# Summary (1/2)

- Disks: Arial Density now 30%/yr vs. 100%/yr in 2000s
- RAID Techniques: Goal was performance, popularity due to reliability of storage
- TPC: price performance as normalizing configuration feature - Auditing to ensure no foul play
- Throughput with restricted response time is normal measure
- Fault⇒ Latent errors in system ⇒ Failure in service
- · Components often fail slowly
- Real systems: problems in maintenance, operation as well as hardware, software



## The End

#### The last lecture

- chapter 6: Storage Systems

#### • Exam

- Mon Jan 14th 2008, 14-17h
- chap 1-6, app A, C & F
- remark: sample exams on website based on previous edition of book
- Assignment
  - deadline 2b: Dec 3<sup>rd</sup>
  - deadline 3: Dec 24th (intro by Eyal on Wed Dec 5th, 13.45h)