Stanford2005光纤信号处理下载_在线阅读_35

is_101398

暂无简介

Stanford2005光纤信号处理 Fiber to the processor Page 1 Fiber-to-the-processor and other challenges for photonics in future systems A.F.J. Levi http://www.usc.edu/alevi with contributions from Bindu Madhavan – USC and Agilent Technologies Stanford, April 21, 2005 Fiber to the pro...

Fiber to the processor Page 1 Fiber-to-the-processor and other challenges for photonics in future systems A.F.J. Levi http://www.usc.edu/alevi with contributions from Bindu Madhavan – USC and Agilent Technologies Stanford, April 21, 2005 Fiber to the processor Page 2 What is a system ? VSR interconnect ¾ Understand electronics in systems – Definition of system • Complex enough to require system area network – Multi-processor rack-based system, router, data center, telephone switch, automobile etc., are systems – Cell-phone, telephone handset, camera, pocket calculator, etc., are not complex enough to be systems – Chip IO performance – Backplane performance ¾ Chassis systems composed of passive backplane with connectors for linecards – Backplane supplies power to linecards – Connectors are interconnected by traces in backplane ¾ Chassis systems have slots for linecards that plug into backplane at connectors ¾ Total chip-to-chip interconnect length up to 1meter. ¾ Interconnect loss is a tradeoff between – Cost – improved line-characteristic using costlier dielectric materials, blind-via techniques,counterboring of backplane press-fit connector vias. – Density – reduced signal density at linecard-backplane interface allows for cheaper PCB manufacturing options Backplane via Backplane connectorLine card trace IC Line card via Backplane trace Package to PCB transition Backplane 128 port × 40 × 2 Gb/s = 10.24 Tb/s 5 RU = 8.75” Line cards 8 × 8 × 40 × 2 Gb/s = 5.12 Tb/s Fiber to the processor Page 3 System interconnect hierarchy and advanced optical solutions FTTP Length at which electrical transmission lines are required Transfer bit rate 1 m 10 m 100 m 1 km100 µm 1 mm 1 cm 10 cm 10 M 1 M 100 k100 G 10 G 1 G 100 M Gate-to-Gate Chip-to-Chip Substrate-to-Substrate Board-to-Board Shelf-to-Shelf Frame-to-Frame Electronics Parallel Optical Data Link POLO PONI Parallel Optical Interconnect “LAN” Increasing system functionality Fiber to the processor applications 10 µm1 µm100 nm10 nm1 nm0.1 nm 1 T10 T Conventional Optical Data Link S i n g l e a t o m E l e c t r o n B o h r r a d i u s i n G a A s Q u a n t u m e f f e c t s a c c e s s e d b y p h o t o n i c s A. F. J. Levi, Optical Interconnects in Systems, Proc. IEEE 88, 1264-1270 (2000) 10 k Fiber to the processor Page 4 Parallel optical interconnect products emerge from DARPA funded POLO – PONI – MAUI programs POLO-PONI-MAUI VCSELs / PINs Optics Guide pin Passives 2000 PONI (1997 – 2000) - inspired products for 10 m – 600 m interconnect lengths: Agilent, Zarlink, Picolight, Gore, Emcore, Paracer, E20, Silicon Light Machines, Cielo Agilent announced 12 x 3.3 Gb/s = 40 Gb/s November 2000 Full production November 2001, customers: Nortel, Cisco, IBM 12 x 10 Gb/s = 120 Gb/s demonstrated 2003 POLO (1994 – 1997) 20041995 time MAUI (2002 – present) Combination of VCSEL WDM and parallel fiber optic technology for FTTP 1 m – 100 m interconnect length applications 240 Gb/s < 1 W demonstrated 2004 Silicon IC Flex circuit Metal base 8 mm x 6 mm PMOSA 240 – 1000 Gb/s, < 1W Fiber to the processor Page 5 Parallel optics and CMOS integration POLO Ring network for parallel optics integrated in single CMOS IC 20 Gb/s Tx 20 Gb/s Rx 20× JetStream on a chip Point-to-point host interface for parallel optics 16 Gb/s Tx 16 Gb/s Rx HP experimental JetStream ring network 1 Gb/s Tx 1 Gb/s Rx Afterburner JetStream 2 1 0 m m Link Adapter Chip for parallel fiber-optic ring network – 400,000 transistors includes ring MAC – 10.2 mm x 7.3 mm in 0.5 µm CMOS – tape-out 8.17.00, received 11.10.00 H i g h - s p e e d p a r a l l e l f i b e r - o p t i c i n t e r f a c e Host 144 mm July 1995 October 1997 December 2000 Fiber to the processor Page 6 New markets for optical interconnects: Solving the electronics interconnect and packaging mess! FTTP CPU Memory Cont. IO Cont. PCI Cards Main Memory M a i n M e m o r y The memory access bottleneck The SAN ¾ Integration trend places multi-processors on single chip – Chip multi-processor (CMP) from Broadcom (SiByte BCM1250) ¾ Main memory likely to remain separate in most systems – 10nm CMOS circuits have 100M transistors/mm2 • 6 transistors per bit in SRAM → 16 Mb = 2MB/mm2 or 200MB/cm2 • 1 transistor per bit in DRAM → 100 Mb = 12MB/mm2 or 1.2GB/cm2 – Might be useful for single-chip notebook computer or make an interesting L2 cache for a CMP ¾ Multiple processor boards in chassis systems are connected by switches Fiber to the processor Page 7 1U (1.75”) thick 20-port GbE switch/router for chassis servers (2001) SERDES + dual quad-channel MMF optical modules Quad 8-port, mesh-connected GbE Switch ICs with 20 external ports Clock generation Quad serial link IC for GbE backplane interconnect ¾ 96W, hot-swappable 20- port GbE router ¾ 15.5” x 5.35” ¾ ~2300 components ¾ ~7000 nets, ~11000 pins ¾ Electrical and optical GbE IO ¾ 8 GbE optical links ¾ 8 GbE backplane links ¾ 4 GbE Cat-5 links GbE PHY IC Eight GbE serial backplane interconnect over low-cost CPCI connectors 100W, 48V, 20A brick 100W, 48V, 20A brick System example Management Microprocessor and support circuitry Fiber to the processor Page 8 Integration and packing driven processor crisis: The case for fiber-to-the-processor (FTTP) System level issues ¾ Electronics fails to deliver ¾ Power crisis - projected kW CPU not viable ¾ Processor crisis driving multi-core processor design with increased IO demand and only a fraction of transistors being active at any one time ¾ Intel moves to CMP and Pentium IV uni-processor development terminated - 2005 ¾ Bandwidth density and latency crisis ¾ increasing mismatch between memory bus bandwidth and CPU ¾ many CPU cycles wasted after cache miss ¾ Signal integrity crisis ¾ EMI, reflections, crosstalk, device noise may lead the way to optical interconnects ¾ high-speed electrical signaling not reliable ¾ $400M i820 memory translator hub recall because of electrical noise - 5.10.00 ¾ 1.13 GHz PIII recall because of electrical noise in circuit element - 8.28.00 ¾ Fiber-to-the-processor is a new design point ¾ Less power, less power density in distributed system using WDM SAN ¾ Better signal integrity, optical isolation ¾ More bandwidth density gives reduced latency in node and SAN ¾ Removes electrical backplane bottleneck for future multi-processor systems 1 10 100 1000 1980 1985 1990 1995 2000 2005 2010 Year L o g 1 0 p o w e r ( W i386SX Pentium 4 Itanium Moore’s Law: On-chip high-performance local clock (SIA 97) Ethernet switch-port deployment 0.01 0.1 1 10 1994 1996 1998 2000 2002 2004 Year D a t a r a t e ( G b / s ) Moore’s Law 2× every 2 years Ethernet data- rate deployment 0.1 1 10 100 1000 i38 6D x-1 6 i48 6D x-2 5 i48 6D x-3 3 P1 -66 P1 -10 0 P1 -13 3 P1 -20 0 P1 -23 3 P2 -45 0 P3 -73 3 P4 -15 00 P4 -20 00 P4 -30 00 P4 -32 00 Ita niu m- 2 B u s b a n d w i d t h ( G b / s ) External Memory Bandwidth Internal CPU Bandwidth accounts for superscalar microprocessor architecture by multiplying internal datapath width by the number of instructions that can be issued simultaneously. Fiber to the processor Page 9 Optical interconnects and the memory access bottleneck FTTP 0.1 1 10 100 1000 i38 6D x-1 6 i48 6D x-2 5 i48 6D x-3 3 P1 -66 P1 -10 0 P1 -13 3 P1 -20 0 P1 -23 3 P2 -45 0 P3 -73 3 P4 -15 00 P4 -20 00 P4 -30 00 P4 -32 00 Ita niu m- 2 B u s B a n d w i d t h ( G b / s ) External Memory Bandwidth Internal CPU Bandwidth Optical interconnect can fill the memory-access performance gap with bandwidth edge density of 60 – 600 Gb/s/mm Fiber to the processor Page 10 FTTP: A new architecture enabled by optical interconnects and high-performance CMOS integration ¾ New technology – Optical interconnect • Ultra-high bandwidth • Low power • Low latency FTTP Driving to a “technology convergence point” CMOS optical interface Optical interconnect Switch-based architecture ¾ Integration – CMOS interface to optics • High-performance crossbar switch System level issues ¾ New switch-based architecture – Next generation scalable NUMA • Switch integrated in processor and memory High-performance CMOS interfaceMulti-processor switched-based network P1 P2 L3 5 Tb/s P1 P2 L3 5 Tb/s SAN SAN Parallel optics and WDM VCSEL Fiber to the processor Page 11 Example latency estimate P Ctl Memory Cross Bar P P Ctl Memory Cross Bar P P Ctl Memory Cross Bar P P Ctl Memory Cross Bar P 16 ns 16 ns 30 ns 50ns 20 ns 10ns Round-trip time per segment Round-trip time 80 ns + 10 ns + 16 ns + 30 ns + 16 ns + 30 ns + 16 ns + 30 ns + 16 ns + 10 ns + 20 ns + 50 ns = 324 ns 10 Cy at 125 MHz (80 ns) 5 Cy at 500 MHz (10 ns) 4+4 Cy at 500 MHz (16 ns) 15 Cy at 500 MHz (30 ns) 4+4 Cy at 500 MHz (16 ns) 15 Cy at 500 MHz (30 ns) ¾10× increase in clock rate reduces round-trip time ~10× ¾Assume time-of-flight ~ 0 ns Fiber to the processor Page 12 System impact of increased available bandwidth: Reduced message latency and improved scaling ( ) 2 Portsn 4 knD Nk BW LtttDt kBW2BW n wsrlatency_message 1n porttionsecbi = ⋅= = +++⋅= ⋅⋅= − ( ) BW Lttt 2 kt wsrlatency_message +++= Where N Æ Total number of nodes k Æ Number of nodes in each dimension n Æ number of dimensions D Æ Average distance between any pair of nodes tr Æ Time to make routing decision (10 cycles, < 20 ns) ts Æ The delay through switch (6 cycles, < 20 ns) tw Æ The interconnection delay (1.0 m hop length) BW Æ Bandwidth of each port = B × W, Where B is the bandwidth of each line, and W is port width L Æ Packet length (1 kB) ¾ The 4-SAN ports can be used to design a 2-D torus with N = k2 processors (n = 2, N = [16, 64, 256, 1024]) ¾ Message latency is ¾ For 32 processor network – 32 GB/s, 4-port switch achieve × 1.5 better no-load average message latency compared with to a 20 GB/s, 6-port switch • (× 1.36 better no-load average message latency for 2048 processors) 32 GB/s = 256 Gb/s3.2 GB/s = 25.6 Gb/s 3-array, 2-cube (2-D torus) Processor node ¾Bisection-bandwidth and message latency for a k-array n-cube network – A network with n-dimensions and k-nodes per dimension 3-array, 3-cube (3-D torus) wrap-around not shown Fiber to the processor Page 13 System impact of reduced cache miss ¾ Simulation assumptions – L1 hit rate - 90% (based on third party test results) – http://www.aceshardware.com/Spades/read.php?article_id=20000190 – L2 access latency - 9 cycles (based on P4) – http://www.aceshardware.com/Spades/read.php?article_id=20000190 – L3 access latency - 20 cycles (based on Merced) – http://www.geek.com/procspec/features/itanium/index.htm • Assume 96% of the memory access is satisfied by L1 and L2. – 5.0 GHz processor speed – 1.3 cycles per instruction • Using Intel assumptions – http://developer.intel.com/design/pentium4/manuals/248966.htm – Each instruction is sub-divided into micro-ops during execution ¾ Impact of memory access bandwidth on cache hit rate not taken into account – Improved BW improves hit-rate because of reduced pre- fetch distance ¾ Performance of FTTP with only L2 cache and 96% cache hit rate is equal to RAMBUS with L2 and L3 with 99.3% cache hit rate – Adding a L3 cache to hide memory access latency does not out perform FTTP 99.3% hit 600 MIPS 96.0% hit 600 MIPS Improving performance Fiber to the processor Page 14 Fiber-to-the processor: Exposing raw CPU performance System level issues ¾ Single-chip multi-CPU module with integrated switch and optical system area network (SAN) – SoC internal bandwidth 10GHz×128×2×2=5.12Tb/s ¾ Main memory module with high- performance optical IO port ¾ All off-chip high-speed signals are optical – 1.28 Tb/s×5 ports = 6.4 Tb/s SoC IO bisection bandwidth • RDMA ready • 1RU electrical backplane supports only two (2) SoC processors • Number of SoC processors using FTTP backplane determined by power dissipation ¾ All off-chip slow-speed signals are electrical (including electrical power) 4 × 32 b- wide 4 Gb/s point-to- point half-duplex electrical data link Optical port 2×80 GB/s WDM 2×64×10 Gb/s 1.28 Tb/s WDM processor SAN North South East West C P U L 1 L 2 L 3 C P U L 1 L 2 RDMA Main memory Memory controller with crossbar switch WDM processor SAN fiber-optic interconnect plane Optical port 2 × 80 GB/s Single-chip processor Main memory PMOSA module PIM and TLB FTTP Socket Main memory Fiber to the processor Page 15 FTTP exposes raw CPU performance with multiple serial optical chip-to-chip interconnects ¾ Single-chip CPU module (SoC) with FTTP optical interface ¾ Main memory module with high- performance optical port – Serial main memory fed by optical/CMOS interface ¾ All off-chip high-speed signals are optical ¾ All off-chip slow-speed signals are electrical (including electrical power) ¾ Key FTTP enablers: – Agilent MAUI optical sub- assembly – USC multi-rate multi-lane serial CMOS interfaceCPUL1 L2 L3 CPU L1 L2 CPU L1 L2 L3 CPU L1 L2 CPU L1 L2L3 CPU L1 L2 CPU L1 L2L3 CPU L1 L2 Single-chip CPU module with integrated multiple optical serial links Optical signaling boundary of multi-processor SoC MAUI interconnect fabric MAUI system-wide interconnect fiber-optic interconnect plane Optical port 2 × 32 GB/s Single-chip processor Main memory PIM and TLB Socket Socket FTTP MAUI optical port 2 × 32 GB/s = 512 Gb/s USC multi-rate multi-lane serial CMOS interface Serial feed to main memory Fiber to the processor Page 16 Flip-chip optical socket LGA concept ¾ Today at USC: 1.27mm pitch FC-LGA, 40 x 40 mm2, 960-pin, Rogers 2800 dielectric, estimated price $30 in 10k volume ¾ 212.5 mm center-to-center IC pad-pitch ¾ Option 1: 6.5 x 6.5 mm2 IC = 216 diff IO ¾ Option 2: 5.0 x 5.0 mm2 IC = 108 diff IO ¾ Package performance ¾ -3dB > 20 GHz, NEXT < -30 dB ¾ Can be improved to -3dB ~40 GHz, NEXT < -30 dB ¾ Easily modified to implement “optical socket” for fiber to the processor ¾ Package level optical interconnect for inter-chip optical buses ¾ 8mm x 5mm chip scale optical port is a prototype today ¾ Today: 0.48 Tb/s, <2W unidirectional fiber-optic port ¾ Future: >1 Tb/s, <1W unidirectional fiber-optic port ¾ Includes alignment pins for MT- ferrule with 12-fiber ribbon Agilent / MAUI – DARPA program Fiber to the processor Page 17 A system architecture roadmap: The FTTP opportunity FTTP 2 0 0 0 2 0 1 0 Processor Bus Local I/O Bus Backplane System Area Network Local Area Network Proprietary Bus PCI Compact PCI VME Proprietary Interconnect Gbit Ethernet 10/100 Ethernet Rapid I/O Infiniband 10 GbitEthernet 100 Gbit EthernetFTTP I n c r e a s i n g s y s t e m i n t e g r a t i o n Traditional system partitioning and increasing interconnect length scale T i m e Technology insertion Minimum 1 Tb/s/port × 5 ports/chip Fiber to the processor Page 18 The cost of myths ¾ ‘Optics will not speed up memory access’ – said Howard Davidson, OIDA, October 21, 2004, Burlingame, CA. – Actually only true for for SMP and its current programming model in which latency is dominated by global directory coherency • NUMA, which has local coherency, does not suffer from this problem – but you have to change your software ¾ Embracing myths as truths avoids the need to innovate Fiber to the processor Page 19 Impact of decreasing CMOS device feature size on interconnect: 80 Gb/s serial IO Pad Characteristics 0 50 100 150 200 250 43 44 45 46 47 48 49 52 55 58 Year since 1958 F C d P i t c h ( m m ) 1000 1500 2000 2500 3000 3500 H i g h P e r f o r m a n c e A S I C I O c o u n t FC pad pitch (um) High Peformance IO pad count Scaling trends fT versus CMOS Technology y = -91845x3 + 39908x2 - 6368.4x + 459.84 R2 = 0.9903 50 100 150 200 250 300 350 400 0.01 0.06 0.11 0.16 Feature size (um) f T ( G H z ) Transistor density versus minimum CMOS feature size y = 11429x-2 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09 1.0E+10 0.00 0.01 0.10 1.00 Feature Size (um) T r a n s i s t o r s / m m 2 150 µm 1 5 0 µ m 75 µm diaIC IO density ¾ Transistor scaling to 10 nm CMOS by 2016 – 100 M transistors/mm2 (2 Intel Pentium-IV processors) • Scaling fails due to IO, on-chip wiring, and Vdd ~ 0.8 V to give 10-60 W power dissipation – 80 Gb/s IO based on PAM-4, fT > 400 GHz and 400 mW – High-speed IO pad-pitch improvement limited by crosstalk and package material properties – 75 µm pad diameter and 150 µm pitch – 36 bond-pads/mm2 – 9 differential pair IO/mm2 – 18 power and ground pads/mm2 2016 Intel 11/2001 NRZ PAM-4 P a Fiber to the processor Page 20 Challenges for electronics and photonics driven by CMOS scaling Electronics Electronics Computation Communication trace ConnectorProc. Mem Comm 10 nm CMOS, fT > 400 GHz, < 10-18 J switching energy 10 – 12 metal layers 100 transistors/µm2 for random logic 500 transistors/µm2 for SRAM cells 0.0122 µm2 /SRAM single-port cell 100M transistors/mm2 2 Pentium-IV/mm2 80 Gb/s IO (PAM-4 and fT > 400 GHz) Integration implies high power density ~ 10-60 W/mm2 Assumes 110 oC junction temperature Si thermal conductivity κ = 1.5 W/cm oC Forces 10 mm2 area (~ 1-6 W/mm2) for 100M transistor circuit in 10 nm CMOS (or liquid cooling …) Distributed architecture on chip Benefit from large fT to reduce power and use high-speed serial IO to reduce packaging cost Remaining area for power regulation, RF- style and analog elements, self-test, calibration Controlled-impedance launch to package trace with S11 < -10 dB restricts flip-chip IO pitch on IC/Pkg to 150 µm pitch 9 Differential IO/mm2, suggests high-speed serial that also reduces backplane design effort Low-loss (< -3 dB), low-crosstalk (< -30 dB), dense IO electrical packages requires tan δ < 0.002 εr < 2.5 Via technology High-aspect ratio, blind-via, tight pad overlap of via, relatively tight registration Low-loss tangent PCB dielectric (tan δ < 0.002) High density, perfect electrical backplane connector is required that is mechanically reliable, manufacturable, low-cost, low-NEXT, and impedance-matched at data rate Pkg Fiber to the processor Page 21 Photonics Challenges for electronics and photonics driven by Moore’s Law CMOS scaling Photonics Computation Communication Optical logic and memory not practical at present time Optical devices cannot match electronic feature size (100 transistors/µm2 in 10 nm CMOS) and efficiency or approach computational equivalence for digital processing Electronic interface to optical devices potentially limited by: Bias voltage and current Drive voltage and current Intimacy of integration requiring fan- in/fan-out of controlled impedance lines Harsh thermal, mechanical, electromagnetic environment Slow speed photonic devices! ≤ 20 Gb/s digital modulation of laser diodes Fiber optics superior to electrical interconnect on length scales ≥ 1 m, using metrics of signal loss, power dissipation and bandwidth Lower-power, higher-impedance lines can be used to interface electronics to optical devices. “Optical PCB-trace” required for intra-chassis interconnect Optical connector has superior form-factor (3× –

本文档为【Stanford2005光纤信号处理】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

Stanford2005光纤信号处理

热门搜索

历史搜索