# Lecture 7 Vector Processors &

# **Multiprocessor Introduction**

Slides were used during lectures by Krste Asanovic & David Patterson, Berkeley, spring 2006

#### Outline

- Vector Processors
- Vector Metrics, Terms
- Multiprocessing Motivation
- SISD v. SIMD v. MIMD
- · Centralized vs. Distributed Memory
- Challenges to Parallel Programming
- Conclusion

#### **Supercomputers**

Definition of a supercomputer:

- Fastest machine in world at given task
- A device to turn a compute-bound problem into an I/O bound problem
- Any machine costing \$30M+
- Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

#### **Supercomputer Applications**

#### Typical application areas

- Military research (nuclear weapons, cryptography)
- Scientific research
- Weather forecasting
- Oil exploration
- Industrial design (car crash simulation)

All involve huge computations on large data sets

In 70s-80s, Supercomputer = Vector Machine

#### **Vector Supercomputers**

#### Epitomized by Cray-1, 1976:

- Scalar Unit + Vector Extensions
- Load/Store Architecture
- Vector Registers
- Vector Instructions
- Hardwired Control
- Highly Pipelined Functional Units
- Interleaved Memory System
- No Data Caches
- No Virtual Memory







|                                             | # Scalar Code    | # Vector Code     |
|---------------------------------------------|------------------|-------------------|
| # C code                                    | LI R4, 64        | LI VLR, 64        |
| for (i=0; i<64; i++)<br>C[i] = A[i] + B[i]; |                  | LV V1, R1         |
|                                             | L.D F0, 0(R1)    | LV V2, R2         |
|                                             | L.D F2, 0(R2)    | ADDV.D V3, V1, V2 |
|                                             | ADD.D F4, F2, F0 | SV V3, R3         |
|                                             | S.D F4, 0(R3)    |                   |
|                                             | DADDIU R1, 8     |                   |
|                                             | DADDIU R2, 8     |                   |
|                                             | DADDIU R3, 8     |                   |
|                                             | DSUBIU R4, 1     |                   |
|                                             | BNEZ R4, loop    |                   |

#### **Vector Instruction Set Advantages**

#### Compact

- one short instruction encodes N operations
- Expressive, tells hardware that these N operations:
  - are independent
  - use the same functional unit
  - access disjoint registers
  - access registers in the same pattern as previous instructions
  - access a contiguous block of memory (unit-stride load/store)
  - access memory in a known pattern (strided load/store)
- Scalable
  - can run same object code on more parallel pipelines or lanes













#### Vector Memory-Memory vs. Vector Register Machines

- Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
   All operands must be read in and out of memory
- VMMAs make if difficult to overlap execution of multiple vector operations, why?
   Must check dependencies on memory addresses
- VMMAs incur greater startup latency
- Scalar code was faster on CDC Star-100 for vectors < 100 elements</li>
   For Cray-1, vector/scalar breakeven point was around 2 elements
- ⇒Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures

(we ignore vector memory-memory from now on)



| Benchmark<br>name | Operations executed<br>in vector mode,<br>compiler-optimized | Operations executed<br>in vector mode,<br>hand-optimized | Speedup from<br>hand optimization |
|-------------------|--------------------------------------------------------------|----------------------------------------------------------|-----------------------------------|
| BDNA              | 96.1%                                                        | 97.2%                                                    | 1.52                              |
| MG3D              | 95.1%                                                        | 94.5%                                                    | 1.00                              |
| FLO52             | 91.5%                                                        | 88.7%                                                    | N/A                               |
| ARC3D             | 91.1%                                                        | 92.0%                                                    | 1.01                              |
| SPEC77            | 90.3%                                                        | 90.4%                                                    | 1.07                              |
| MDG               | 87.7%                                                        | 94.2%                                                    | 1.49                              |
| TRFD              | 69.8%                                                        | 73.7%                                                    | 1.67                              |
| DYFESM            | 68.8%                                                        | 65.6%                                                    | N/A                               |
| ADM               | 42.9%                                                        | 59.6%                                                    | 3.60                              |
| OCEAN             | 42.8%                                                        | 91.2%                                                    | 3.92                              |
| TRACK             | 14.4%                                                        | 54.6%                                                    | 2.52                              |
| SPICE             | 11.5%                                                        | 79.9%                                                    | 4.06                              |
| OCD               | 4.2%                                                         | 75.1%                                                    | 2.15                              |

| Processor        | Compiler             | Completely<br>vectorized | Partially<br>vectorized | Not<br>vectorized |
|------------------|----------------------|--------------------------|-------------------------|-------------------|
| CDC CYBER 205    | VAST-2 V2.21         | 62                       | 5                       | 33                |
| Convex C-series  | FC5.0                | 69                       | 5                       | 26                |
| Cray X-MP        | CFT77 V3.0           | 69                       | 3                       | 28                |
| Cray X-MP        | CFT V1.15            | 50                       | 1                       | 49                |
| Cray-2           | CFT2 V3.1a           | 27                       | 1                       | 72                |
| ETA-10           | FTN 77 V1.0          | 62                       | 7                       | 31                |
| Hitachi S810/820 | FORT77/HAP V20-2B    | 67                       | 4                       | 29                |
| IBM 3090/VF      | VS FORTRAN V2.4      | 52                       | 4                       | 44                |
| NEC SX/2         | FORTRAN77 / SX V.040 | 66                       | 5                       | 29                |

Inglife FL3 results of appying scenarios wany loops were completely vectorized, partially vectorized, and unvectorized. These loops were collected by Callahan, Dongara, and Levine [1988]. Two different compilers for the Cray X-MP show the large dependence on compiler technology.













#### **Vector Scatter/Gather**

Want to vectorize loops with indirect accesses:

for (i=0; i<N; i++)
 A[i] = B[i] + C[D[i]]</pre>

#### Indexed load instruction (Gather)

LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA, vB, vC # Do add SV vA, rA # Store result

#### **Vector Scatter/Gather**

#### Scatter example:

for (i=0; i<N; i++)
 A[B[i]]++;</pre>

#### Is following a correct translation?

```
LV vB, rB # Load indices in B vector
LVI vA, rA, vB # Gather initial A values
ADDV vA, vA, 1 # Increment
SVI vA, rA, vB # Scatter incremented values
```



# Store A back to memory under mask

SV vA, rA





#### Vector Reductions Problem: Loop-carried dependence on reduction variables for (i=0; i<N; i++)</pre> sum += A[i]; # Loop-carried dependence on sum Solution: Re-associate operations if possible, use binary tree to perform reduction # Rearrange as: sum[0:VL-1] = 0 # Vector of VL partial sums for(i=0; i<N; i+=VL)</pre> # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2;# Halve vector length sum[0:VL-1] += sum[VL:2\*VL-1] # Halve no. of partials } while (VL>1)

#### A Modern Vector Super: NEC SX-6 (2003)

M[0]=0

B[0]

CMOS Technology

A[0]

A[1]

Compress Expand

M[0]=0

- 500 MHz CPU, fits on single chip
- SDRAM main memory (up to 64GB)
   Scalar unit
- A CONTRACT
- 64KB I-cache and 64KB data cache
   Vector unit
- vector unit
  - 8 foreground VRegs + 64 background VRegs (256x64-bit elements/VReg)

4-way superscalar with out-of-order and speculative execution

- 1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
   8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
- 1 load & store unit (32x8 byte accesses/cycle)
- 32 GB/s memory bandwidth per processor
- SMP structure
  - 8 CPUs connected to memory through crossbar
  - 256 GB/s shared memory bandwidth (4096 interleaved banks)

#### **Multimedia Extensions**

- · Very short vectors added to existing ISAs for micros
- Usually 64-bit registers split into 2x32b or 4x16b or 8x8b
- Newer designs have 128-bit registers (Altivec, SSE2)
- · Limited instruction set:
  - no vector length control
  - no strided load/store or scatter/gather
  - unit-stride loads must be aligned to 64/128-bit boundary
- Limited vector register length:
  - requires superscalar dispatch to keep multiply/add/load units busy
  - loop unrolling to hide latencies increases register pressure
- Trend towards fuller vector support in microprocessors

# Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate

- Vector instructions access memory with known pattern
   => highly interleaved memory
- => amortize memory latency of over 64 elements
- => no (data) caches required! (Do use instruction cache)
- Reduces branches and branch problems in pipelines
- Single vector instruction implies lots of work (- loop) => fewer instruction fetches

#### Operation & Instruction Count: RISC v. Vector Processor

| Program | RISC | Vector | R/V  | RISC | Vector | R/V  |
|---------|------|--------|------|------|--------|------|
| -       |      | Vector |      |      | VECIUI |      |
| swim256 | 115  | 95     | 1.1x | 115  | 0.8    | 142x |
| hydro2d | 58   | 40     | 1.4x | 58   | 0.8    | 71x  |
| nasa7   | 69   | 41     | 1.7x | 69   | 2.2    | 31x  |
| su2cor  | 51   | 35     | 1.4x | 51   | 1.8    | 29x  |
| tomcatv | 15   | 10     | 1.4x | 15   | 1.3    | 11x  |
| wave5   | 27   | 25     | 1.1x | 27   | 7.2    | 4x   |
| mdljdp2 | 32   | 52     | 0.6x | 32   | 15.8   | 2x   |
|         |      |        |      |      |        |      |

## **Properties of Vector Processors**

#### **Common Vector Metrics**

- R<sub>∞</sub>: MFLOPS rate on an infinite-length vector vector "speed of light"
  - Real problems do not have unlimited vector lengths, and the start-up penalties encountered in real problems will be larger (R<sub>n</sub> is the MFLOPS rate for a vector of length n)
- N<sub>1/2</sub>: The vector length needed to reach one-half of R<sub>2</sub> a good measure of the impact of start-up
- mode

measures both start-up and speed of scalars relative to vectors, quality of connection of scalar unit to vector unit

#### **Vector Execution Time**

- Time = f(vector length, data dependicies, struct. hazards) Initiation rate: rate that FU consumes vector elements
- (= number of lanes; usually 1 or 2 on Cray T-90) Convov: set of vector instructions that can begin
- execution in same clock (no struct. or data hazards) Chime: approx. time for a vector operation
- m convoys take m chimes; if each vector length is n, then they take approx. m x n clock cycles (ignores overhead; good approximization for long vectors)

:load vector Y

#### 1: LV <u>V1.</u>Rx ;load vector X 2: MULV <u>V2</u>,F0,<u>V1</u> ;vector-scalar mult.

4 convoys, 1 lane, VL=64  $\Rightarrow$  4 x 64 = 256 clocks (or 4 clocks per result)

V3,Ry 3: ADDV <u>V4, V2</u>,V3 ;add 4: SV Ry,<u>V4</u> ;store the result

LV

#### **Interleaved Memory Layout Memory operations** Vector Process · Load/store operations move groups of data between registers and memory Three types of addressing - Unit stride » Contiguous block of information in memory » Fastest: always possible to optimize this Addr Addr Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 - Non-unit (constant) stride - 0 » Harder to optimize memory system for all possible strides Great for unit stride: » Prime number of data banks makes it easier to support different - Contiguous elements in different DRAMs strides at full bandwidth - Indexed (gather-scatter) Startup time for vector operation is latency of single read » Vector equivalent of register indirect What about non-unit stride? » Good for sparse arrays of data - Above good for strides that are relatively prime to 8 » Increases number of programs that vectorize Bad for: 2.4 Better: prime number of banks ...!

### How to get full bandwidth for Unit Stride?

- · Memory system must sustain (# lanes x word) /clock
- · No. memory banks > memory latency to avoid stalls - *m* banks  $\Rightarrow$  *m* words per memory latency *l* clocks
  - if m < l, then gap in memory pipeline:

– may have 1024 banks in SRAM

· If desired throughput greater than one word per cycle - Either more banks (start multiple requests simultaneously)

21

m

- Or wider DRAMS. Only good for unit stride or large data types
- More banks/weird numbers of banks good to support more strides at full bandwidth
  - How to do prime number of banks efficiently?

#### **Vectors Are Inexpensive**

#### Scalar

 N ops per cycle  $\Rightarrow$  O(N<sup>2</sup>) circuitry

#### • HP PA-8000

- 4-way issue
- reorder buffer: 850K transistors
- incl. 6,720 5-bit register number comparators

#### Vector

- N ops per cycle  $\Rightarrow O(N + \epsilon N^2)$  circuitry T0 vector micro
  - · 24 ops per cycle
  - 730K transistors total only 23 5-bit register number comparators · No floating point

#### **Vectors Lower Power**

#### Single-issue Scalar

- One instruction fetch, decode, dispatch per operation
- Arbitrary register accesses, adds area and power
- Loop unrolling and software pipelining for high performance increases instruction cache footprint
- All data passes through cache; waste power if no temporal locality
- One TLB lookup per load or store
- Off-chip access in whole cache lines .

- Vector
- · One inst fetch, decode, dispatch per vector
- Structured register accesses
- Smaller code for high . performance, less power in instruction cache misses
- · Bypass cache
- One TLB lookup per group of loads or stores Move only necessary data across chip boundary

#### Superscalar Energy Efficiency Even Worse

•

#### **Superscalar**

- Control logic grows quadratically with issue width
- **Control logic consumes** energy regardless of available parallelism
- Speculation to increase visible parallelism wastes energy

#### Vector

- Control logic grows linearly with issue width
- Vector unit switches off when not in use
- Vector instructions expose parallelism without speculation
- Software control of speculation when desired: Whether to use vector mask or compress/expand for conditionals

#### **Vector Applications**

- Limited to scientific computing?
- Multimedia Processing (compress., graphics, audio synth, image proc.)
- · Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)
- · Lossy Compression (JPEG, MPEG video and audio)
- Lossless Compression (Zero removal, RLE, Differencing, LZW)
- Cryptography (RSA, DES/IDEA, SHA/MD5)
- Speech and handwriting recognition
- Operating systems/Networking (memcpy, memset, parity, checksum)
- · Databases (hash/join, data mining, image/video serving)
- Language run-time support (stdlib, garbage collection)
- even SPECint95

#### **Older Vector Machines**

| Machine    | Year              | Clock   | Regs  | Elements | FUs | LSUs     |
|------------|-------------------|---------|-------|----------|-----|----------|
| Cray 1     | 1976              | 80 MHz  | 8     | 64       | 6   | 1        |
| Cray XMP   | 1983 <sup>-</sup> | 120 MHz | 8     | 64       | 8   | 2 L, 1 S |
| Cray YMP   | 1988 <sup>-</sup> | 166 MHz | 8     | 64       | 8   | 2 L, 1 S |
| Cray C-90  | 1991 2            | 240 MHz | 8     | 128      | 8   | 4        |
| Cray T-90  | 1996 4            | 455 MHz | 8     | 128      | 8   | 4        |
| Convex C-1 | 1984              | 10 MHz  | 8     | 128      | 4   | 1        |
| Convex C-4 | 1994 <sup>-</sup> | 133 MHz | 16    | 128      | 3   | 1        |
| Fuj. VP200 | 1982 <sup>-</sup> | 133 MHz | 8-256 | 32-1024  | 3   | 2        |
| Fuj. VP300 | 1996 <sup>-</sup> | 100 MHz | 8-256 | 32-1024  | 3   | 2        |
| NEC SX/2   | 1984 <sup>-</sup> | 160 MHz | 8+8K  | 256+var  | 16  | 8        |
| NEC SX/3   | 1995 4            | 400 MHz | 8+8K  | 256+var  | 16  | 8        |

#### **Newer Vector Computers**

- Cray X1
- MIPS like ISA + Vector in CMOS
- NEC Earth Simulator
  - Fastest computer in world for 3 years; 40 TFLOPS - 640 CMOS vector nodes

#### **Key Architectural Features of X1**

New vector instruction set architecture (ISA)

- Much larger register set (32x64 vector, 64+64 scalar)
- 64- and 32-bit memory and IEEE arithmetic
- Based on 25 years of experience compiling with Cray1 ISA

#### Decoupled Execution

- Scalar unit runs ahead of vector unit, doing addressing and control
- Hardware dynamically unrolls loops, and issues multiple loops concurrently
- Special sync operations keep pipeline full, even across barriers ⇒ Allows the processor to perform well on short nested loops

#### Scalable, distributed shared memory (DSM) architecture

- Memory hierarchy: caches, local memory, remote memory
- Low latency, load/store access to entire machine (tens of TBs)
- Processors support 1000's of outstanding refs with flexible addressing
- Very high bandwidth network Coherence protocol, addressing and synchronization optimized for DM

#### Cray X1E Mid-life Enhancement

- Technology refresh of the X1 (0.13μm)
  - -~50% faster processors
  - Scalar performance enhancements
  - Doubling processor density
  - Modest increase in memory system bandwidth
  - Same interconnect and I/O

#### Machine upgradeable

- Can replace Cray X1 nodes with X1E nodes

# ESS – configuration of a general purpose supercomputer

- Processor Nodes (PN) Total number of processor nodes is 640. Each processor node consists of eight vector processors of 8 GFLOPS and 16GB shared memories. Therefore, total numbers of processors is 5,120 and total peak performance and main memory of the system are 40 TFLOPS and 10 TB, respectively. Two nodes are installed into one cabinet, which size is 40°x56°x80°. 16 nodes are in a cluster. Power consumption per cabinet is approximately 20 KVA.
- Interconnection Network (IN): Each node is coupled together with more than 83,000 copper cables via single-stage crossbar switches of 16GB/s x2 (Load + Store). The total length of the cables is approximately 1,800 miles.
- 3. Hard Disk. Raid disks are used for the system. The capacities are 450 TB for the systems operations and 250 TB for users.
- 4. Mass Storage system: 12 Automatic Cartridge Systems (STK PowderHorn9310); total storage capacity is approximately 1.6 PB.

From Horst D. Simon, NERSC/LBNL, May 15, 2002, "ESS Rapid Response Meeting"







#### **Vector Summary**

- Vector is alternative model for exploiting ILP
- If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines
- Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations
- Fundamental design issue is memory bandwidth – With virtual address translation and caching
- Will multimedia popularity revive vector architectures?

#### **Outline**

- Vector Processors
- · Vector Metrics, Terms
- Multiprocessing Motivation
- · SISD v. SIMD v. MIMD
- Centralized vs. Distributed Memory
- Challenges to Parallel Programming
- Conclusion



#### Déjà vu all over again?

... today's processors ... are nearing an impasse as technologies approach the speed of light.."

David Mitchell, *The Transputer: The Time Is Now* (1989) Transputer had bad timing (Uniprocessor performance<sup>1</sup>)

 $\Rightarrow$  Procrastination rewarded: 2X seq. perf. / 1.5 years

"We are dedicating all of our future product development to multicore designs. ... This is a sea change in computing"

Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs

| Manufacturer/Year | AMD/'05 | Intel/'06 | IBM/'04 | Sun/'05 |
|-------------------|---------|-----------|---------|---------|
| Processors/chip   | 2       | 2         | 2       | 8       |
| Threads/Processor | 1       | 2         | 2       | 4       |
| Threads/chip      | 2       | 4         | 4       | 32      |

#### Other Factors ⇒ Multiprocessors

- Growth in data-intensive applications – Data bases, file servers, ...
- · Growing interest in servers, server perf.
- Increasing desktop perf. less important – Outside of graphics
- Improved understanding in how to use multiprocessors effectively
   Especially server where significant natural TLP
- Advantage of leveraging design investment by replication
  - Rather than unique design

# Flynn's Taxonomy

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

#### · Flynn classified by data and control streams in 1966

| Single Instruction Single   | Single Instruction Multiple   |
|-----------------------------|-------------------------------|
| Data (SISD)                 | Data <u>SIMD</u>              |
| (Uniprocessor)              | (single PC: Vector, CM-2)     |
| Multiple Instruction Single | Multiple Instruction Multiple |
| Data (MISD)                 | Data <u>MIMD</u>              |
| (????)                      | (Clusters, SMP servers)       |

- SIMD ⇒ Data Level Parallelism
- MIMD ⇒ Thread Level Parallelism
- MIMD popular because
  - Flexible: N pgms and 1 multithreaded pgm
  - Cost-effective: same MPU in desktop & MIMD

# Back to Basics "A parallel computer is a collection of processing elements that <u>cooperate</u> and communicate to solve large problems fast." Parallel Architecture = Computer Architecture + Communication Architecture Two classes of multiprocessors WRT memory: Centralized Memory Multiprocessor < few dozen processor chips (and < 100 cores) in 2006</li> Small enough to share single, centralized memory Physically Distributed-Memory multiprocessor Larger number chips and cores than 1 BW demands ⇒ Memory distributed among processors





#### **Distributed Memory Multiprocessor**

- Pro: Cost-effective way to scale memory bandwidth
- If most accesses are to local memory
- Pro: Reduces latency of local memory accesses
- Con: Communicating data between processors more complex
- Con: Must change software to take advantage of increased memory BW

# Two Models for Communication and Memory Architecture

- 1. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors
- 2. Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either
  - UMA (Uniform Memory Access time) for shared address, centralized memory MP
  - NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP
- In past, confusion whether "sharing" means sharing physical memory (Symmetric MP) or sharing address space





#### **Challenges of Parallel Processing**

- Second challenge is long latency to remote memory
- Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.)
- What is performance impact if 0.2% instructions involve remote access?
  - a. 1.5X
  - b. 2.0X
  - c. 2.5X

#### **CPI Equation**

CPI = Base CPI + Remote request rate x Remote request cost

= 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3

No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve remote access

#### And in Conclusion [1/2] ...

- · One instruction operates on vectors of data
- Vector loads get data from memory into big register files, operate, and then vector store
- E.g., Indexed load, store for sparse matrix
- Easy to add vector to commodity instruction set
   E.g., Morph SIMD into vector
- Vector is very efficient architecture for vectorizable codes, including multimedia and many scientific codes

#### And in Conclusion [2/2] ...

- "End" of uniprocessors speedup => Multiprocessors
- Parallelism challenges: % parallalizable, long latency to remote memory
- Centralized vs. distributed memory
   Small MP vs. lower latency, larger BW for Larger MP
- Message Passing vs. Shared Address
   Uniform access time vs. Non-uniform access time

#### **Reading and Schedule**

This lecture:
 Appendix E: V

- Appendix F: Vector Processors
  Chapter 4: 4.1 Introduction Multiprocessors
- Next week, Oct 31st: No class
- Next lecture, Nov 7<sup>th</sup>: remainder of chapter 4 (in the afternoon feedback on assignment 2a)
- On Wed Nov 14<sup>th</sup> both at 11.15-13.00h and at 13.45-15.30h lectures in room 402