当前位置:首页 >> 电脑基础知识 >>

Chip Multiprocessing and the cell broadband engine


Cell Broadband Engine

Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind

ACM Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Computer Architecture at the turn of the millenium
“The end of architecture” was being proclaimed – Frequency scaling as performance driver – State of the art microprocessors
?multiple instruction issue ?out of order architecture ?register renaming ?deep pipelines

Little or no focus on compilers – Questions about the need for compiler research Academic papers focused on microarchitecture tweaks

2

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

The age of frequency scaling
Scaled Device
Voltage, V / α tox/α
GATE
n+ source

WIRING
W/α
Performance
n+ drain

1

bips
0.9

0.8

L/α
p substrate, doping α*NA

xd/α

0.7

SCALING: Voltage: Oxide: Wire width: Gate width: Diffusion: Substrate:
3

V/α tox /α W/α L/α xd /α α * NA

RESULTS: Higher Density: ~α2 Higher Speed: ~α Power/ckt: ~1/α2 Power Density: ~Constant
Source: Dennard et al., JSSC 1974.

0.6

0.5 37 34 31 28 25 22 19 16 13 10 7 Total FO4 Per Stage

deeper pipeline
? 2006 IBM Corporation

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

Cell Broadband Engine

Frequency scaling
A trusted standby – Frequency as the tide that raises all boats – Increased performance across all applications – Kept the industry going for a decade

Massive investment to keep going
– Equipment cost – Power increase – Manufacturing variability – New materials

4

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

… but reality was closing in!
Relative to Optimal FO4 1 0.8 0.6 0.4 0.2 0
37 34 31 28 25 22 19 16 13 10 7
1 0.1

1000

100

10

1

0.1

bips bips^3/W
Active Power Leakage Power

0.01

Total FO4 Per Stage

0.001 0.01

5

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Power crisis
1000

The power crisis is not “natural” – Created by deviating from ideal scaling theory – Vdd and Vt not scaled by α
? additional performance with increased voltage

100

Tox(?)

10

Laws of physics – Pchip = ntransistors * Ptransistor Marginal performance gain per transistor low – significant power increase – decreased power/performance efficiency
6 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

Vdd
1

classic scaling

Vt

0.1 1 0.1 gate length Lgate (?m)
? 2006 IBM Corporation

0.01

Cell Broadband Engine

The power inefficiency of deep pipelining
Power-performance optimal
Relative to Optimal FO4 1 0.8 0.6 0.4 0.2 0
37 34 31 28 25 22 19 16 13 10 7

Performance optimal

bips bips^3/W

deeper pipeline

Total FO4 Per Stage

Source: Srinivasan et al., MICRO 2002

Deep pipelining increases number of latches and switching rate ? power increases with at least f2 Latch insertion delay limits gains of pipelining
7 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Hitting the memory wall
normalized

100 80 60

MFLOPS

Memory latency

Latency gap – Memory speedup lags behind processor speedup – limits ILP Chip I/O bandwidth gap – Less bandwidth per MIPS Latency gap as application bandwidth gap usable bandwidth avg. request size

40 20 0 -20 -40 -60

1990

2003

Source: McKee, Computing Frontiers 2004

no. of inflight roundtrip latency requests – Typically (much) less than chip I/O bandwidth
Source: Burger et al., ISCA 1996 8 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Our Y2K challenge
10× performance of desktop systems 1 TeraFlop / second with a four-node configuration 1 Byte bandwidth per 1 Flop – “golden rule for balanced supercomputer design” scalable design across a range of design points mass-produced and low cost
9 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Cell Design Goals
Provide the platform for the future of computing
– 10× performance of desktop systems shipping in 2005

Computing density as main challenge

– Dramatically increase performance per X
? X = Area, Power, Volume, Cost,…

Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost

10

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Necessity as the mother of invention
Increase power-performance efficiency –Simple designs are more efficient in terms of power and area Increase memory subsystem efficiency –Increasing data transaction size –Increase number of concurrently outstanding transactions ? Exploit larger fraction of chip I/O bandwidth Use CMOS density scaling –Exploit density instead of frequency scaling to deliver increased aggregate performance Use compilers to extract parallelism from application –Exploit application parallelism to translate aggregate performance to application performance
11 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Exploit application-level parallelism
Data-level parallelism – SIMD parallelism improves performance with little overhead Thread-level parallelism – Exploit application threads with multi-core design approach Memory-level parallelism – Improve memory access efficiency by increasing number of parallel memory transactions

Compute-transfer parallelism – Transfer data in parallel to execution by exploiting application knowledge

12

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Concept phase ideas
large architected register set modular design design reuse

high frequency and low power!

Chip Multiprocessor

control cost of coherency

reverse Vdd scaling for low power
13

efficient use of memory interface

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Cell Architecture
Heterogeneous multicore system architecture
– Power Processor Element for control tasks – Synergistic Processor Elements for dataintensive processing
SPE
SPU SPU SPU SPU SPU SPU SPU SPU

SXU LS SMF
16B/cycle

SXU LS SMF

SXU LS SMF

SXU LS SMF

SXU LS SMF

SXU LS SMF

SXU LS SMF

SXU LS SMF

EIB (up to 96B/cycle)

PPE

16B/cycle

16B/cycle

16B/cycle (2x)

Synergistic Processor Element (SPE) consists of
– Synergistic Processor Unit (SPU) – Synergistic Memory Flow Control (SMF)
14

PPU L2 L1

MIC

BIC

32B/cycle 16B/cycle

PXU
Dual XDRTM FlexIOTM

64-bit Power Architecture with VMX
? 2006 IBM Corporation

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

Cell Broadband Engine

Shifting the Balance of Power with Cell Broadband Engine
Data processor instead of control system – Control-centric code stable over time – Big growth in data processing needs
? ? ? ? Modeling Games Digital media Scientific applications

Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Parallelism added as an after-thought
15 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Powering Cell – the Synergistic Processor Unit
64B

Fetch

ILB Local Store
Issue / Branch
Single Port SRAM

128B

SMF
16B

2 instructions

VRF
16B x 2 16B x 3 x 2

V F P U

V F X U

PERM

LSU
Source: Gschwind et al., Hot Chips 17, 2005

16

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Density Computing in SPEs
Today, execution units only fraction of core area and power

– Bigger fraction goes to other functions
? ? ? ? Address translation and privilege levels Instruction reordering Register renaming Cache hierarchy

Cell changes this ratio to increase performance per area and power

– Architectural focus on data processing
? ? ? ? ?
17

Wide datapaths More and wide architectural registers Data privatization and single level processor-local store All code executes in a single (user) privilege level Static scheduling
Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Streamlined Architecture
Architectural focus on simplicity
– – – – Aids achievable operating frequency Optimize circuits for common performance case Compiler aids in layering traditional hardware functions Leverage 20 years of architecture research

Focus on statically scheduled data parallelism
– Focus on data parallel instructions
? No separate scalar execution units ? Scalar operations mapped onto data parallel dataflow

– Exploit wide data paths
? data processing ? instruction fetch

– Address impediments to static scheduling
? Large register set ? Reduce latencies by eliminating non-essential functionality

18

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

SPE Highlights
RISC organization
– 32 bit fixed instructions – Load/store architecture – Unified register file
SFP FXU EVN FWD

LS DP

LS

User-mode architecture
– No page translation within SPU

SIMD dataflow
– Broad set of operations (8, 16, 32, 64 Bit) – Graphics SP-Float – IEEE DP-Float

GPR

CONTROL

FXU ODD

LS

LS

256KB Local Store
– Combined I & D

CHANNEL DMA SMM ATO RTB

DMA block transfer
– using Power Architecture memory translation
SBI

14.5mm2 (90nm SOI)
Source: Kahle, Spring Processor Forum 2005 19 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

BEB

Cell Broadband Engine

Synergistic Processing
shorter pipeline
instruction bundling

scalar layering

DLP

large register file

static scheduling

static prediction

data parallel select

high frequency

simpler μarch

wide data paths

determ. latency

ILP opt.

large basic blocks

compute density

local store

single port

sequential fetch

20

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Efficient data-sharing between scalar and SIMD processing
Legacy architectures separate scalar and SIMD processing
– Data sharing between SIMD and scalar processing units expensive
? Transfer penalty between register files

– Defeats data-parallel performance improvement in many scenarios

Unified register file facilitates data sharing for efficient exploitation of data parallelism
– Allow exploitation of data parallelism without data transfer penalty
? Data-parallelism always an improvement
21 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

SPU Pipeline
SPU PIPELINE FRONT END
IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2

SPU PIPELINE BACK END
Branch Instruction RF1 RF2 Permute Instruction EX1 EX2 EX3 EX4 Load/Store Instruction EX1 EX2 EX3 EX4 EX5 EX6 Fixed Point Instruction EX1 EX2 Floating Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB WB WB WB IF IB ID IS RF EX WB Instruction Fetch Instruction Buffer Instruction Decode Instruction Issue Register File Access Execution Write Back

22

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

SPE Block Diagram
Floating-Point Unit Fixed-Point Unit Permute Unit Load-Store Unit Branch Unit Channel Unit Result Forwarding and Staging Register File
Instruction Issue Unit / Instruction Line Buffer

Local Store (256kB) Single Port SRAM

128B Read 128B Write DMA Unit

On-Chip Coherent Bus 8 Byte/Cycle
23

16 Byte/Cycle

64 Byte/Cycle

128 Byte/Cycle
? 2006 IBM Corporation

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

Cell Broadband Engine

Synergistic Memory Flow Control
SMF implements memory management and mapping SMF operates in parallel to SPU
SPE
SPU

SXU LS SMF DMA Engine DMA Queue

– Independent compute and transfer – Command interface from SPU
? DMA queue decouples SMF & SPU ? MMIO-interface for remote nodes

Block transfer between system memory and local store

SPE programs reference system memory using user-level effective address space
– Ease of data sharing – Local store to local store transfers – Protection
24 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

Atomic Facility

MMU

RMT

Bus I/F Control

MMIO

Data Bus Snoop Bus Control Bus Translate Ld/St MMIO

? 2006 IBM Corporation

Cell Broadband Engine

Synergistic Memory Flow Control
Bus Interface
– Up to 16 outstanding DMA requests – Requests up to 16KByte – Token-based Bus Access Management

Element Interconnect Bus
– – – – –
SPE3
16B 16B

Four 16 byte data rings Multiple concurrent transfers 96B/cycle peak bandwidth Over 100 outstanding requests 200+ GByte/s @ 3.2+ GHz
SPE7
16B 16B

PPE

SPE1
16B 16B

SPE5
16B 16B

IOIF1

16B 16B 16B 16B

16B

Data Arb

16B 16B 16B

16B 16B

16B 16B

16B 16B

16B 16B

MIC

SPE0

SPE2

SPE4

SPE6

BIF/IOIF0 Source: Clark et al., Hot Chips 17, 2005

25

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Memory efficiency as key to application performance
Greatest challenge in translating peak into app performance – Peak Flops useless without way to feed data Cache miss provides too little data too late – Inefficient for streaming / bulk data processing – Initiates transfer when transfer results are already needed – Application-controlled data fetch avoids not-on-time data delivery SMF is a better way to look at an old problem – Fetch data blocks based on algorithm – Blocks reflect application data structures

26

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Traditional memory usage
Long latency memory access operations exposed Cannot overlap 500+ cycles with computation Memory access latency severely impacts application performance

computation
27

mem protocol

mem idle

mem contention
? 2006 IBM Corporation

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

Cell Broadband Engine

Exploiting memory-level parallelism
Reduce performance impact of memory accesses with concurrent access Carefully scheduled memory accesses in numeric code Out-of-order execution increases chance to discover more concurrent accesses – Overlapping 500 cycle latency with computation using OoO illusory Bandwidth limited by queue size and roundtrip latency

28

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Exploiting MLP with Chip Multiprocessing
More threads more memory level parallelism – overlap accesses from multiple cores

29

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

A new form of parallelism: CTP
Compute-transfer parallelism – Concurrent execution of compute and transfer increases efficiency
? Avoid costly and unnecessary serialization

– Application thread has two threads of control
? SPU ? SMF computation thread transfer thread

Optimize memory access and data transfer at the application level – Exploit programmer and application knowledge about data access patterns
30 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Memory access in Cell with MLP and CTP
Super-linear performance gains observed – decouple data fetch and use

31

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Synergistic Memory Management
timely data delivery decouple fetch and access
high throughput

reduced coherence cost

explicit data transfer

parallel compute & transfer

deeper fetch queue multi core

application control

many outstanding transfers

block fetch

local store

reduce miss complexity

improved code gen

shared I&D

storage density

low latency access

single port

sequential fetch

32

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Cell BE Implementation Characteristics
Frequency Increase vs. Power Consumption

241M transistors 235mm2 Design operates across wide frequency range

– Optimize for power & yield
> 20 GFlops (DP) @3.2GHz Up to 75 GB/s I/O bandwidth 100+ simultaneous bus transactions
Relative

> 200 GFlops (SP) @3.2GHz

Up to 25.6 GB/s memory bandwidth

– 16+8 entry DMA queue per SPE
Voltage

Source: Kahle, Spring Processor Forum 2005 33 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Cell Broadband Engine

34

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

System configurations
Game console systems Blades HDTV Home media servers Supercomputers
Cell Design

XDR?

XDR?

XDR?

XDR?

Cell BE Processor
IOIF BIF

Cell BE Processor
IOIF

XDR?

XDR?

XDR?

XDR?

Cell BE Processor
XDR? XDR?

Cell BE Processor
IOIF switch BIF
? 2006 IBM Corporation

IOIF BIF

Cell BE Processor
IOIF0 IOIF1

35

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

IOIF XDR?

Cell BE Processor
XDR?

XDR?

Cell BE Processor
XDR?

IOIF

Cell Broadband Engine

Compiling for the Cell Broadband Engine
The lesson of “RISC computing” – Architecture provides fast, streamlined primitives to compiler – Compiler uses primitives to implement higher-level idioms – If the compiler can’t target it do not include in architecture Compiler focus throughout project – Prototype compiler soon after first proposal – Cell compiler team has made significant advances in
? Automatic SIMD code generation ? Automatic parallelization ? Data privatization
Raw Hardware Performance

Cell Design

Programmability

Programmer Productivity

36

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Single source automatic parallelization
Single source compiler – Address legacy code and programmer productivity – Partition a single source program into SPE and PPE functions Automatic parallelization – Across multiple threads – Vectorization to exploit SIMD units Compiler managed local store – Data and code blocking and tiling – “Software cache” managed by compiler Range of programmer input – Fully automatic – Programmer-guided
? pragmas, intrinsics, OpenMP directives

Prototype developed in IBM Research
37 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Compiler-driven performance
Code partitioning – Thread-level parallelism – Handle local store size – “Outlining” into SPE overlays Data partitioning – Data access sequencing – Double buffering – “Software cache” Data parallelization – SIMD vectorization – Handling of data with .unknown alignment
Source: Eichenberger et al., PACT 2005 38 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

Raising the bar with parallelism…
Data-parallelism and static ILP in both core types – Results in low overhead per operation Multithreaded programming is key to great Cell BE performance – Exploit application parallelism with 9 cores – Regardless of whether code exploits DLP & ILP – Challenge regardless of homogeneity/heterogeneity Leverage parallelism between data processing and data transfer – A new level of parallelism exploiting bulk data transfer – Simultaneous processing on SPUs and data transfer on SMFs – Offers superlinear gains beyond MIPS-scaling

39

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

… while understanding the trade-offs
Uniprocessor efficiency is actually low – Gelsinger’s Law captures historic (performance) efficiency
? 1.4x performance for 2x transistors

– Marginal uni-processor (performance) efficiency is 40% (or lower!)
? And power efficiency is even worse

The “true” bar is marginal uniprocessor efficiency – A multiprocessor “only” has to beat a uniprocessor to be the better solution – Many low-hanging fruit to be picked in multithreading applications
? Embarrassing application parallelism which has not been exploited

40

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Cell: a Synergistic System Architecture
Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent – Share address translation and memory protection model SPE optimized for efficient data processing – SPEs share Cell system functions provided by Power Architecture – SMF implements interface to memory
? Copy in/copy out to local storage

Power Architecture provides system functions – Virtualization – Address translation and protection – External exception handling Element Interconnect Bus integrates system as data transport hub
41 Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006 ? 2006 IBM Corporation

Cell Broadband Engine

A new wave of innovation
CMPs everywhere – same ISA CMPs – different ISA CMPs – homogeneous CMP – heterogeneous CMP – thread-rich CMPs – GPUs as compute engine Compilers and software must and will follow… Novel systems – Power4 – BOA – Piranha – Cyclops – Niagara – BlueGene – Power5 – Xbox360 – Cell BE

42

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

Conclusion
Established performance drivers are running out of steam Need new approach to build high performance system – Application and system-centric optimization – Chip multiprocessing – Understand and manage memory access – Need compilers to orchestrate data and instruction flow Need new push on understanding workload behavior Need renewed focus on compilation technology An exciting journey is ahead of us!

43

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation

Cell Broadband Engine

? Copyright International Business Machines Corporation 2005,6. All Rights Reserved. Printed in the United States May 2006. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

44

Chip Multiprocessing and the Cell Broadband Engine M. Gschwind, Computing Frontiers 2006

? 2006 IBM Corporation


赞助商链接
相关文章:
更多相关标签: