Fiber to the processor Page 1
Fiber-to-the-processor and other challenges for photonics in
future systems
A.F.J. Levi
http://www.usc.edu/alevi
with contributions from
Bindu Madhavan – USC
and
Agilent Technologies
Stanford, April 21, 2005
Fiber to the processor Page 2
What is a system ?
VSR interconnect
¾ Understand electronics in systems
– Definition of system
• Complex enough to require system area
network
– Multi-processor rack-based system,
router, data center, telephone switch,
automobile etc., are systems
– Cell-phone, telephone handset,
camera, pocket calculator, etc., are
not complex enough to be systems
– Chip IO performance
– Backplane performance
¾ Chassis systems composed of passive backplane
with connectors for linecards
– Backplane supplies power to linecards
– Connectors are interconnected by traces in
backplane
¾ Chassis systems have slots for linecards that plug
into backplane at connectors
¾ Total chip-to-chip interconnect length up to 1meter.
¾ Interconnect loss is a tradeoff between
– Cost – improved line-characteristic using costlier
dielectric materials, blind-via techniques,counterboring
of backplane press-fit connector vias.
– Density – reduced signal density at linecard-backplane
interface allows for cheaper PCB manufacturing
options
Backplane via
Backplane
connectorLine card
trace
IC
Line card via
Backplane trace
Package to
PCB
transition
Backplane
128 port × 40 × 2 Gb/s = 10.24 Tb/s
5 RU = 8.75”
Line cards
8 × 8 × 40 × 2
Gb/s = 5.12 Tb/s
Fiber to the processor Page 3
System interconnect hierarchy and advanced optical
solutions
FTTP
Length at which electrical transmission lines are required
Transfer bit rate
1 m 10 m 100 m 1 km100 µm 1 mm 1 cm 10 cm
10 M 1 M 100 k100 G 10 G 1 G 100 M
Gate-to-Gate
Chip-to-Chip
Substrate-to-Substrate
Board-to-Board
Shelf-to-Shelf
Frame-to-Frame
Electronics
Parallel Optical Data Link
POLO
PONI
Parallel Optical
Interconnect
“LAN”
Increasing system functionality
Fiber to the processor
applications
10 µm1 µm100 nm10 nm1 nm0.1 nm
1 T10 T
Conventional
Optical
Data Link
S
i
n
g
l
e
a
t
o
m
E
l
e
c
t
r
o
n
B
o
h
r
r
a
d
i
u
s
i
n
G
a
A
s
Q
u
a
n
t
u
m
e
f
f
e
c
t
s
a
c
c
e
s
s
e
d
b
y
p
h
o
t
o
n
i
c
s
A. F. J. Levi, Optical Interconnects in Systems,
Proc. IEEE 88, 1264-1270 (2000)
10 k
Fiber to the processor Page 4
Parallel optical interconnect products emerge from DARPA
funded POLO – PONI – MAUI programs
POLO-PONI-MAUI
VCSELs / PINs
Optics
Guide pin Passives
2000
PONI (1997 – 2000) - inspired products for 10 m – 600 m
interconnect lengths: Agilent, Zarlink, Picolight, Gore, Emcore,
Paracer, E20, Silicon Light Machines, Cielo
Agilent announced 12 x 3.3 Gb/s = 40 Gb/s November 2000
Full production November 2001, customers: Nortel, Cisco, IBM
12 x 10 Gb/s = 120 Gb/s demonstrated 2003
POLO (1994 – 1997)
20041995
time
MAUI (2002 – present)
Combination of VCSEL
WDM and parallel fiber
optic technology for FTTP
1 m – 100 m interconnect
length applications
240 Gb/s < 1 W
demonstrated 2004
Silicon IC Flex circuit Metal base
8 mm x 6 mm PMOSA
240 – 1000 Gb/s, < 1W
Fiber to the processor Page 5
Parallel optics and CMOS integration
POLO
Ring network for parallel optics
integrated in single CMOS IC
20 Gb/s Tx
20 Gb/s Rx
20× JetStream on a chip
Point-to-point host
interface for parallel optics
16 Gb/s Tx
16 Gb/s Rx
HP experimental
JetStream ring network
1 Gb/s Tx
1 Gb/s Rx
Afterburner JetStream
2
1
0
m
m
Link Adapter Chip for parallel fiber-optic ring
network
– 400,000 transistors includes ring MAC
– 10.2 mm x 7.3 mm in 0.5 µm CMOS
– tape-out 8.17.00, received 11.10.00
H
i
g
h
-
s
p
e
e
d
p
a
r
a
l
l
e
l
f
i
b
e
r
-
o
p
t
i
c
i
n
t
e
r
f
a
c
e
Host
144 mm
July 1995 October 1997 December 2000
Fiber to the processor Page 6
New markets for optical interconnects: Solving the
electronics interconnect and packaging mess!
FTTP
CPU
Memory
Cont.
IO
Cont.
PCI Cards
Main
Memory
M
a
i
n
M
e
m
o
r
y
The memory
access bottleneck
The SAN
¾ Integration trend places multi-processors on single chip
– Chip multi-processor (CMP) from Broadcom (SiByte BCM1250)
¾ Main memory likely to remain separate in most systems
– 10nm CMOS circuits have 100M transistors/mm2
• 6 transistors per bit in SRAM → 16 Mb = 2MB/mm2 or 200MB/cm2
• 1 transistor per bit in DRAM → 100 Mb = 12MB/mm2 or 1.2GB/cm2
– Might be useful for single-chip notebook computer or make an interesting L2 cache for a CMP
¾ Multiple processor boards in chassis systems are connected by switches
Fiber to the processor Page 7
1U (1.75”) thick 20-port GbE switch/router for chassis servers
(2001)
SERDES + dual quad-channel MMF
optical modules
Quad 8-port, mesh-connected GbE
Switch ICs with 20 external ports
Clock
generation
Quad serial link IC for GbE
backplane interconnect
¾ 96W, hot-swappable 20-
port GbE router
¾ 15.5” x 5.35”
¾ ~2300 components
¾ ~7000 nets, ~11000 pins
¾ Electrical and optical
GbE IO
¾ 8 GbE optical links
¾ 8 GbE backplane links
¾ 4 GbE Cat-5 links
GbE
PHY IC
Eight GbE serial backplane interconnect over low-cost CPCI connectors
100W, 48V,
20A brick
100W, 48V,
20A brick
System example
Management Microprocessor and
support circuitry
Fiber to the processor Page 8
Integration and packing driven processor crisis: The case for
fiber-to-the-processor (FTTP)
System level issues
¾ Electronics fails to deliver
¾ Power crisis - projected kW CPU not viable
¾ Processor crisis driving multi-core processor design
with increased IO demand and only a fraction of
transistors being active at any one time
¾ Intel moves to CMP and Pentium IV uni-processor
development terminated - 2005
¾ Bandwidth density and latency crisis
¾ increasing mismatch between memory bus bandwidth
and CPU
¾ many CPU cycles wasted after cache miss
¾ Signal integrity crisis
¾ EMI, reflections, crosstalk, device noise may lead the way
to optical interconnects
¾ high-speed electrical signaling not reliable
¾ $400M i820 memory translator hub recall because of
electrical noise - 5.10.00
¾ 1.13 GHz PIII recall because of electrical noise in circuit
element - 8.28.00
¾ Fiber-to-the-processor is a new design point
¾ Less power, less power density in distributed system
using WDM SAN
¾ Better signal integrity, optical isolation
¾ More bandwidth density gives reduced latency in
node and SAN
¾ Removes electrical backplane bottleneck for future
multi-processor systems
1
10
100
1000
1980 1985 1990 1995 2000 2005 2010
Year
L
o
g
1
0
p
o
w
e
r
(
W
i386SX
Pentium 4
Itanium
Moore’s Law: On-chip high-performance local clock (SIA 97)
Ethernet switch-port deployment
0.01
0.1
1
10
1994 1996 1998 2000 2002 2004
Year
D
a
t
a
r
a
t
e
(
G
b
/
s
)
Moore’s Law
2×
every 2 years
Ethernet data-
rate
deployment
0.1
1
10
100
1000
i38
6D
x-1
6
i48
6D
x-2
5
i48
6D
x-3
3
P1
-66
P1
-10
0
P1
-13
3
P1
-20
0
P1
-23
3
P2
-45
0
P3
-73
3
P4
-15
00
P4
-20
00
P4
-30
00
P4
-32
00
Ita
niu
m-
2
B
u
s
b
a
n
d
w
i
d
t
h
(
G
b
/
s
)
External Memory Bandwidth
Internal CPU Bandwidth
accounts for
superscalar
microprocessor
architecture by
multiplying
internal datapath
width by the
number of
instructions that
can be issued
simultaneously.
Fiber to the processor Page 9
Optical interconnects and the memory access bottleneck
FTTP
0.1
1
10
100
1000
i38
6D
x-1
6
i48
6D
x-2
5
i48
6D
x-3
3
P1
-66
P1
-10
0
P1
-13
3
P1
-20
0
P1
-23
3
P2
-45
0
P3
-73
3
P4
-15
00
P4
-20
00
P4
-30
00
P4
-32
00
Ita
niu
m-
2
B
u
s
B
a
n
d
w
i
d
t
h
(
G
b
/
s
)
External Memory Bandwidth
Internal CPU Bandwidth
Optical
interconnect can
fill the
memory-access
performance gap
with bandwidth
edge density of
60 – 600 Gb/s/mm
Fiber to the processor Page 10
FTTP: A new architecture enabled by optical interconnects
and high-performance CMOS integration
¾ New technology
– Optical interconnect
• Ultra-high bandwidth
• Low power
• Low latency
FTTP
Driving to a
“technology
convergence point”
CMOS
optical
interface
Optical
interconnect
Switch-based
architecture
¾ Integration
– CMOS interface to optics
• High-performance crossbar switch
System level issues
¾ New switch-based architecture
– Next generation scalable NUMA
• Switch integrated in processor and memory
High-performance CMOS interfaceMulti-processor switched-based network
P1
P2
L3
5 Tb/s P1
P2
L3
5 Tb/s
SAN
SAN
Parallel optics and WDM
VCSEL
Fiber to the processor Page 11
Example latency estimate
P
Ctl Memory
Cross Bar
P P
Ctl Memory
Cross Bar
P
P
Ctl Memory
Cross Bar
P P
Ctl Memory
Cross Bar
P
16 ns
16 ns
30 ns
50ns
20 ns
10ns
Round-trip
time
per segment
Round-trip
time
80 ns
+ 10 ns
+ 16 ns
+ 30 ns
+ 16 ns
+ 30 ns
+ 16 ns
+ 30 ns
+ 16 ns
+ 10 ns
+ 20 ns
+ 50 ns
= 324 ns
10 Cy at 125 MHz
(80 ns)
5 Cy at 500 MHz
(10 ns)
4+4 Cy at 500 MHz
(16 ns)
15 Cy at 500 MHz
(30 ns)
4+4 Cy at 500 MHz
(16 ns)
15 Cy at 500 MHz
(30 ns)
¾10× increase in clock rate reduces round-trip time ~10×
¾Assume time-of-flight ~ 0 ns
Fiber to the processor Page 12
System impact of increased available bandwidth: Reduced
message latency and improved scaling
( )
2
Portsn
4
knD
Nk
BW
LtttDt
kBW2BW
n
wsrlatency_message
1n
porttionsecbi
=
⋅=
=
+++⋅=
⋅⋅= −
( )
BW
Lttt
2
kt wsrlatency_message +++=
Where N Æ Total number of nodes
k Æ Number of nodes in each dimension
n Æ number of dimensions
D Æ Average distance between any pair of nodes
tr Æ Time to make routing decision (10 cycles, < 20 ns)
ts Æ The delay through switch (6 cycles, < 20 ns)
tw Æ The interconnection delay (1.0 m hop length)
BW Æ Bandwidth of each port = B × W, Where B is the
bandwidth of each line, and W is port width
L Æ Packet length (1 kB)
¾ The 4-SAN ports can be used to design a 2-D torus with N = k2
processors (n = 2, N = [16, 64, 256, 1024])
¾ Message latency is
¾ For 32 processor network
– 32 GB/s, 4-port switch achieve × 1.5 better no-load average
message latency compared with to a 20 GB/s, 6-port switch
• (× 1.36 better no-load average message latency for 2048
processors)
32 GB/s = 256 Gb/s3.2 GB/s = 25.6 Gb/s
3-array, 2-cube (2-D torus)
Processor node
¾Bisection-bandwidth and message
latency for a k-array n-cube network
– A network with n-dimensions and k-nodes
per dimension
3-array, 3-cube (3-D torus)
wrap-around not shown
Fiber to the processor Page 13
System impact of reduced cache miss
¾ Simulation assumptions
– L1 hit rate - 90% (based on third party test results)
– http://www.aceshardware.com/Spades/read.php?article_id=20000190
– L2 access latency - 9 cycles (based on P4)
– http://www.aceshardware.com/Spades/read.php?article_id=20000190
– L3 access latency - 20 cycles (based on Merced)
– http://www.geek.com/procspec/features/itanium/index.htm
• Assume 96% of the memory access is satisfied by L1
and L2.
– 5.0 GHz processor speed
– 1.3 cycles per instruction
• Using Intel assumptions
– http://developer.intel.com/design/pentium4/manuals/248966.htm
– Each instruction is sub-divided into micro-ops
during execution
¾ Impact of memory access bandwidth on cache hit rate not
taken into account
– Improved BW improves hit-rate because of reduced pre-
fetch distance
¾ Performance of FTTP with only L2 cache and 96% cache hit
rate is equal to RAMBUS with L2 and L3 with 99.3% cache hit
rate
– Adding a L3 cache to hide memory access latency does
not out perform FTTP
99.3% hit
600 MIPS
96.0% hit
600 MIPS
Improving
performance
Fiber to the processor Page 14
Fiber-to-the processor: Exposing raw CPU performance
System level issues
¾ Single-chip multi-CPU module
with integrated switch and
optical system area network
(SAN)
– SoC internal bandwidth
10GHz×128×2×2=5.12Tb/s
¾ Main memory module with high-
performance optical IO port
¾ All off-chip high-speed signals
are optical
– 1.28 Tb/s×5 ports = 6.4 Tb/s SoC
IO bisection bandwidth
• RDMA ready
• 1RU electrical backplane supports
only two (2) SoC processors
• Number of SoC processors using
FTTP backplane determined by
power dissipation
¾ All off-chip slow-speed signals
are electrical (including electrical
power)
4 × 32 b-
wide
4 Gb/s
point-to-
point
half-duplex
electrical
data link
Optical port
2×80 GB/s
WDM
2×64×10 Gb/s
1.28 Tb/s
WDM processor
SAN
North
South East
West
C
P
U
L
1
L
2
L
3
C
P
U
L
1
L
2
RDMA
Main memory
Memory
controller
with crossbar
switch
WDM processor
SAN
fiber-optic
interconnect plane
Optical port
2 × 80 GB/s
Single-chip
processor
Main
memory
PMOSA
module
PIM and TLB
FTTP Socket
Main memory
Fiber to the processor Page 15
FTTP exposes raw CPU performance with multiple serial
optical chip-to-chip interconnects
¾ Single-chip CPU module (SoC)
with FTTP optical interface
¾ Main memory module with high-
performance optical port
– Serial main memory fed by
optical/CMOS interface
¾ All off-chip high-speed signals are
optical
¾ All off-chip slow-speed signals are
electrical (including electrical
power)
¾ Key FTTP enablers:
– Agilent MAUI optical sub-
assembly
– USC multi-rate multi-lane
serial CMOS interfaceCPUL1
L2 L3
CPU
L1
L2
CPU
L1
L2 L3
CPU
L1
L2
CPU
L1
L2L3
CPU
L1
L2
CPU
L1
L2L3
CPU
L1
L2
Single-chip CPU module
with integrated multiple
optical serial links
Optical signaling boundary
of multi-processor SoC
MAUI interconnect fabric
MAUI system-wide
interconnect
fiber-optic
interconnect plane
Optical port
2 × 32 GB/s
Single-chip
processor
Main
memory
PIM and TLB
Socket
Socket
FTTP
MAUI optical port
2 × 32 GB/s = 512 Gb/s
USC multi-rate multi-lane
serial CMOS interface
Serial feed to
main memory
Fiber to the processor Page 16
Flip-chip optical socket LGA concept
¾ Today at USC: 1.27mm pitch FC-LGA, 40 x 40
mm2, 960-pin, Rogers 2800 dielectric,
estimated price $30 in 10k volume
¾ 212.5 mm center-to-center IC pad-pitch
¾ Option 1: 6.5 x 6.5 mm2 IC = 216 diff IO
¾ Option 2: 5.0 x 5.0 mm2 IC = 108 diff IO
¾ Package performance
¾ -3dB > 20 GHz, NEXT < -30 dB
¾ Can be improved to -3dB ~40 GHz, NEXT
< -30 dB
¾ Easily modified to implement “optical socket”
for fiber to the processor
¾ Package level optical interconnect for
inter-chip optical buses
¾ 8mm x 5mm chip scale optical port is
a prototype today
¾ Today: 0.48 Tb/s, <2W unidirectional
fiber-optic port
¾ Future: >1 Tb/s, <1W unidirectional
fiber-optic port
¾ Includes alignment pins for MT-
ferrule with 12-fiber ribbon
Agilent / MAUI – DARPA program
Fiber to the processor Page 17
A system architecture roadmap:
The FTTP opportunity
FTTP
2
0
0
0
2
0
1
0
Processor
Bus
Local
I/O Bus
Backplane
System
Area
Network
Local
Area
Network
Proprietary
Bus PCI
Compact
PCI
VME
Proprietary
Interconnect
Gbit
Ethernet
10/100
Ethernet
Rapid I/O Infiniband 10 GbitEthernet
100 Gbit
EthernetFTTP
I
n
c
r
e
a
s
i
n
g
s
y
s
t
e
m
i
n
t
e
g
r
a
t
i
o
n
Traditional system partitioning
and increasing interconnect length scale
T
i
m
e
Technology
insertion
Minimum 1 Tb/s/port × 5 ports/chip
Fiber to the processor Page 18
The cost of myths
¾ ‘Optics will not speed up memory access’
– said Howard Davidson, OIDA, October 21, 2004, Burlingame, CA.
– Actually only true for for SMP and its current programming model in
which latency is dominated by global directory coherency
• NUMA, which has local coherency, does not suffer from this problem –
but you have to change your software
¾ Embracing myths as truths avoids the need to innovate
Fiber to the processor Page 19
Impact of decreasing CMOS device feature size on
interconnect: 80 Gb/s serial IO
Pad Characteristics
0
50
100
150
200
250
43 44 45 46 47 48 49 52 55 58
Year since 1958
F
C
d
P
i
t
c
h
(
m
m
)
1000
1500
2000
2500
3000
3500
H
i
g
h
P
e
r
f
o
r
m
a
n
c
e
A
S
I
C
I
O
c
o
u
n
t
FC pad pitch (um)
High Peformance IO pad count
Scaling trends
fT versus CMOS Technology
y = -91845x3 + 39908x2 - 6368.4x + 459.84
R2 = 0.9903
50
100
150
200
250
300
350
400
0.01 0.06 0.11 0.16
Feature size (um)
f
T
(
G
H
z
)
Transistor density versus minimum CMOS
feature size
y = 11429x-2
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
0.00 0.01 0.10 1.00
Feature Size (um)
T
r
a
n
s
i
s
t
o
r
s
/
m
m
2
150 µm
1
5
0
µ
m
75 µm diaIC IO density
¾ Transistor scaling to 10 nm CMOS by 2016
– 100 M transistors/mm2 (2 Intel Pentium-IV processors)
• Scaling fails due to IO, on-chip wiring, and Vdd ~ 0.8 V to
give 10-60 W power dissipation
– 80 Gb/s IO based on PAM-4, fT > 400 GHz and 400 mW
– High-speed IO pad-pitch improvement limited by crosstalk
and package material properties
– 75 µm pad diameter and 150 µm pitch
– 36 bond-pads/mm2
– 9 differential pair IO/mm2
– 18 power and ground pads/mm2
2016
Intel 11/2001
NRZ PAM-4
P
a
Fiber to the processor Page 20
Challenges for electronics and photonics driven by CMOS
scaling
Electronics
Electronics
Computation Communication
trace ConnectorProc. Mem Comm
10 nm CMOS, fT > 400 GHz, < 10-18 J switching energy
10 – 12 metal layers
100 transistors/µm2 for random logic
500 transistors/µm2 for SRAM cells
0.0122 µm2 /SRAM single-port cell
100M transistors/mm2
2 Pentium-IV/mm2
80 Gb/s IO (PAM-4 and fT > 400 GHz)
Integration implies high power density ~ 10-60 W/mm2
Assumes 110 oC junction temperature
Si thermal conductivity κ = 1.5 W/cm oC
Forces 10 mm2 area (~ 1-6 W/mm2) for 100M
transistor circuit in 10 nm CMOS (or liquid
cooling …)
Distributed architecture on chip
Benefit from large fT to reduce power and use
high-speed serial IO to reduce packaging cost
Remaining area for power regulation, RF-
style and analog elements, self-test, calibration
Controlled-impedance launch to package trace with
S11 < -10 dB restricts flip-chip IO pitch on IC/Pkg
to 150 µm pitch
9 Differential IO/mm2, suggests high-speed
serial that also reduces backplane design effort
Low-loss (< -3 dB), low-crosstalk (< -30 dB), dense
IO electrical packages requires
tan δ < 0.002
εr < 2.5
Via technology
High-aspect ratio, blind-via, tight pad
overlap of via, relatively tight registration
Low-loss tangent PCB dielectric (tan δ < 0.002)
High density, perfect electrical backplane connector
is required that is mechanically reliable,
manufacturable, low-cost, low-NEXT, and
impedance-matched at data rate
Pkg
Fiber to the processor Page 21
Photonics
Challenges for electronics and photonics driven by Moore’s
Law CMOS scaling
Photonics
Computation Communication
Optical logic and memory not practical at
present time
Optical devices cannot match electronic feature
size (100 transistors/µm2 in 10 nm CMOS) and
efficiency or approach computational
equivalence for digital processing
Electronic interface to optical devices potentially
limited by:
Bias voltage and current
Drive voltage and current
Intimacy of integration requiring fan-
in/fan-out of controlled impedance lines
Harsh thermal, mechanical, electromagnetic
environment
Slow speed photonic devices!
≤ 20 Gb/s digital modulation of laser
diodes
Fiber optics superior to electrical interconnect on length
scales ≥ 1 m, using metrics of signal loss, power
dissipation and bandwidth
Lower-power, higher-impedance lines can be used
to interface electronics to optical devices.
“Optical PCB-trace” required for intra-chassis
interconnect
Optical connector has superior form-factor (3× –