## High-performance Scalable Photonics On-chip Network for Many-core Systems-on-Chip

## Achraf Ben Ahmed

A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE AND ENGINEERING



Adaptive Systems Laboratory Graduate Department of Computer and Information Systems University of Aizu, Japan

March 2016

The thesis titled

### High-performance Scalable Nanophotonics On-chip Network for Many-core Systems

by

### Achraf Ben Ahmed

is reviewed and approved by:

| Chief referee              |          |         |       |
|----------------------------|----------|---------|-------|
| Professor                  | $\sum$   | 110     |       |
| Abderazek Ben Abdallah     | 15en     | Alder   | borth |
| Professor                  | 1        | 0 \     | AS    |
| Toshiyaki Miyazaki         | 7.my     | arake   |       |
| Professor                  | 1        | 0 1     | (重)   |
| Tsuneo Tsukahara           | U. J.S.  | Kichera | (策)   |
| Professor                  |          |         | B     |
| Anh T. Pham                | Anh T.   | PHAM    | 3     |
| Senior Associate Professor | 1        | , /     |       |
| Yukihide Kohira            | Mukihide | Kohina  | (二)   |
|                            | 1        | 100     |       |

University of Aizu

March 2016

Dedicated to

my lovely Mother,

my Father, and to the rest of my Family

### High-performance Scalable Nanophotonics On-chip Network for Many-core Systems

Achraf Ben Ahmed

Submitted for the Degree of Doctor of Philosophy March 2016

#### Abstract

The continuous increasing demand for higher performance computing systems and aggressive technology scaling has driven the trend of integrating large number of cores in a single chip. In future generations of high-performance many-core systems, the efficiency of the communication infrastructure is as important as the computation efficiency of individual cores. Conventional electrical Networks-on-Chip (NoCs) are expected to reach their limits with increasing core counts because of high power dissipation and reduced performance.

As indicated in the latest version of ITRS roadmap, photonic wiring is a promising interconnect paradigm for future system-on-chip (SoC) designs that can provide broadband data transfer rates unmatchable by the existing metal interconnects. When combined with Wavelength Division Multiplexing (WDM), multiple parallel optical streams of data are concurrently transferred through a single waveguide. This contrasts with the Electronic Networks-on-Chip (ENoCs) that require a unique metal wire per bit stream. The key to saving power in on-chip photonic communication comes from the fact that once a photonic path is established, the optical data is transmitted in an end-to-end fashion without the need for buffering, repeating, or regenerating.

The photonic switching/routing techniques, configuration and routing algorithm directly affect the performance and power characteristics of future many-core on-chip Photonic communication. In particular, the control module and the path configuration algorithm, which orchestrate the different electrical control function play a In this dissertation, a set of novel photonic routing algorithms and architectures are proposed for future on-chip optical networks.

First, a new low-latency, non-blocking photonic switch/router (NBPS) and its control module capable of handling all photonic communication configuration tasks is proposed. The proposed approach is based on a new hybrid spatial switching mechanism for the photonic data stream transfer and is done by manipulating the state of the broadband switching elements. In addition, the NBPS is based on a Wavelength-Selective-Switching (WSS) for handling all communication configuration tasks.

Second, a new contention-aware path configuration algorithm and architecture for Electro-Assisted Photonic Network-on-Chip (EA-PNoC) is proposed. In addition to the main configuration tasks, the algorithm also decouples the Electronic Control Network (ECN) from the Photonic Communication Network (PCN) in a manner that both photonic and electric domains work independently from each other. The proposed algorithm orchestrates the different path configuration packets processes and significantly alleviates the contention in the ECN.

Third, a low-complexity routing and configuration algorithm for EA-PNoC is proposed. The approach is mainly based on photonic components augmented with a simple electronic control module and a so-called wavelength-shifting mechanism. The main merit of this new approach is to configure the path using photonic devices instead of the typical power-hungry electronic router.

The proposed architectures and algorithms were evaluated with a discrete-event simulator, which incorporates detailed physical models of the photonic components. Results show that we could achieve better energy efficiency, as well as a considerable reduction in the blocking occurrence, which is the main source of latency and bandwidth degradation in conventional EA-PNoCs.

### オンチップメニーコアシステムのための高性能、高拡 張性フォトニクスオンチップネットワーク

#### ベン アメド アシュラフ

### 博士号学位のために 2016 年 3 月に提出

### 概要

高性能計算システムやより高度な微細化技術への需要の高まりは、単一チップ上に より多くのコアを統合するという手法へと向かわせた。次世代の高性能計算用メニ ーコアシステムにおいて、効率的な通信基盤は、単独のコアの計算効率と同様に重 要である。現在 electrical Networks-on-chips (NoCS) は、コア数の増加にとも なう電源不足や性能低下の問題へアプローチする手法として期待されている。

ITRS ロードマップの最新版が示す通り、フォトニックワイアリングは、既存のメ タルによる相互配線と適合しない広帯域データ転送レートを提供する将来的な system-on-chip (SoC) 設計への有望なパラダイムである。Wavelength Division Multiplexing (WDM)と組み合わせたとき、複数の並列した光学データストリームは 単一の導波管 (waveguide) に同時に転送される。 これは Electronic Networkson-Chip (ENoCs) がビットストリームごとに別々のメタル配線を要することと、対 称的である。

光学スイッチング・ルーティング技術、配置手法、ルーティングアルゴリズムは、 将来的なメニーコアのオンチップ上光学通信の性能と消費電力の特性に直接影響す る。とりわけ、異なる電子制御機能を編成する、制御モジュールと経路設定アルゴ リズムは、電気的なリソースと光学的なリソース両方をどうやって利用するかとい う点で、重要な役割を担っている。本論文では、将来的なオンチップの光学ネット ワークのための、革新的な光学ルーティングアルゴリズムとアーキテクチャのセッ トを提案している。

第一に、新しい低レイテンシ non-blocking photonic switch/router (NBPS) と、 全光学通信の設定タスクを扱える制御モジュールを提案する。この提案手法は、光 学データストリーム転送のための新しいハイブリッド空間のスイッチング機構をベ ースとし、広帯域スイッチングエレメントの状態を操作することで行われる。くわ えて、この NBPS は、全ての通信設定タスクを扱う Wavelength-Selective-Switching (WSS)の上に成り立っている

第二に、新しい contention-aware 経路設定アルゴリズムとアーキテクチャを提案 する。メインの設定タスクに加えて、このアルゴリズムは、光学的領域と電気的領 域でそれぞれ個別に動作するシステムの、the Electronic Control Network (ECN) を Photonic Communication

Network (PCN)から分離させている。本提案手法は、ECN中の競合の処理をし著し く緩和する、異なるパスセットアップパケットを編成する。

第三に、ハイブリッド光学 Network-on-Chip のための低複雑性ルーティングおよび設定アルゴリズムを提案する。この手法は主に、シンプルな電気的制御モジュールといわゆる wavelength shifting 機構によって強化された光学デバイスをベースとしている。

## Declaration

The work in this thesis is based on research carried out at the Adaptive Systems Laboratory at the University of Aizu, Japan. No part of this thesis has been submitted elsewhere for any other degree or qualification and it is all my own work unless referenced to the contrary in the text.

#### Copyright © 2016 by Achraf Ben Ahmed.

"The copyright of this thesis rests with the author. No quotations from it should be published without the author's prior written consent and information derived from it should be acknowledged".

## Acknowledgements

I would like to express my sincere gratitude to my advisor Prof. Prof. Abderazek Ben Abdallah for the continuous support of my Ph.D study and research, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis.

I would like to thank also Prof. Toshiyaki Miyazai, Prof. Tsuneo Tsukahara, Prof. Anh Pham, and Prof. Yukihide Kohira of the University of Aizu for taking the time to revise my thesis. Moreover, my sincere gratitude to Prof. Yuichi Okuyama for his help and support during the past three years.

I would also like to thank my family for the support they provided me through my entire life and in particular, I must acknowledge my parents, without whose love, encouragement and editing assistance, I would not have finished this thesis.

Last but not least, I would like to thank all my friends back home and in Japan. Especially, the members of the Adaptive Systems Laboratory at the University of Aizu. They facilitated my integration in the Japanese society with their valuable advice, making my stay in Japan much easier and more comfortable.

## Contents

|          | Ab   | stract  |            |                                       | iv   |
|----------|------|---------|------------|---------------------------------------|------|
|          | Dec  | laratio | on         |                                       | viii |
|          | Ack  | nowle   | dgments    |                                       | ix   |
| 1        | Intr | oducti  | ion        |                                       | 1    |
|          | 1.1  | Curren  | nt System  | Design                                | . 1  |
|          | 1.2  | Electr  | onic Netw  | vork-on-Chip                          | . 4  |
|          | 1.3  | Photo   | nic Interc | onnect                                | . 6  |
|          | 1.4  | Photo   | nic Netwo  | orks-on-Chip: Problems and Motivation | . 8  |
|          | 1.5  | Thesis  | o Objectiv | es and Contributions                  | . 9  |
|          | 1.6  | Thesis  | outline    |                                       | . 10 |
| <b>2</b> | Bac  | kgrou   | nd: Phot   | onic Networks-on-Chip                 | 12   |
|          | 2.1  | Photo   | nic Comn   | nunication                            | . 12 |
|          |      | 2.1.1   | Photonie   | c NoC Building Blocks                 | . 13 |
|          |      |         | 2.1.1.1    | Laser                                 | . 14 |
|          |      |         | 2.1.1.2    | Coupler                               | . 14 |
|          |      |         | 2.1.1.3    | Waveguide                             | . 14 |
|          |      |         | 2.1.1.4    | Micro-Ring Resonator                  | . 15 |
|          |      |         | 2.1.1.5    | Modulator                             | . 16 |
|          |      |         | 2.1.1.6    | Photodetector                         | . 17 |
|          | 2.2  | Routin  | ng Scheme  | es in PNoCs                           | . 17 |
|          |      | 2.2.1   | Circuit S  | Switching                             | . 19 |

|   |                                  | 2.2.2                                                                                     | Wavelen                                                                                                                       | gth-Routed                                   | . 19                                                                                                |
|---|----------------------------------|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------|
|   |                                  |                                                                                           | 2.2.2.1                                                                                                                       | Source-based Routing                         | . 21                                                                                                |
|   |                                  |                                                                                           | 2.2.2.2                                                                                                                       | Destination-based Routing                    | . 21                                                                                                |
|   |                                  |                                                                                           | 2.2.2.3                                                                                                                       | Multiple Write Single Read                   | . 21                                                                                                |
|   |                                  |                                                                                           | 2.2.2.4                                                                                                                       | Single Write Multiple Read                   | . 22                                                                                                |
|   |                                  |                                                                                           | 2.2.2.5                                                                                                                       | Fully Connected Crossbar                     | . 23                                                                                                |
|   | 2.3                              | Photo                                                                                     | nic NoC I                                                                                                                     | Metrics                                      | . 24                                                                                                |
|   |                                  | 2.3.1                                                                                     | Power B                                                                                                                       | Budget                                       | . 24                                                                                                |
|   |                                  | 2.3.2                                                                                     | Data Int                                                                                                                      | tegrity                                      | . 24                                                                                                |
|   | 2.4                              | Chapt                                                                                     | er Summ                                                                                                                       | ary                                          | . 26                                                                                                |
| 3 | Dol                              | ated V                                                                                    | Vonka                                                                                                                         |                                              | 27                                                                                                  |
| J | 3.1                              |                                                                                           |                                                                                                                               | Photonic NoCs                                |                                                                                                     |
|   | 3.1                              |                                                                                           |                                                                                                                               | ed Photonic NoCs                             |                                                                                                     |
|   | 3.2<br>3.3                       |                                                                                           | -                                                                                                                             | NoCs Architectures                           |                                                                                                     |
|   | 3.3                              |                                                                                           |                                                                                                                               |                                              |                                                                                                     |
|   | <b>0.4</b>                       | Unapt                                                                                     | er summa                                                                                                                      | ary                                          | . 52                                                                                                |
|   |                                  | _                                                                                         |                                                                                                                               |                                              |                                                                                                     |
| 4 | Nor                              |                                                                                           | ing, Lov                                                                                                                      | v Latency Electro-Assisted Photonic NoC Arch |                                                                                                     |
| 4 | Nor<br>tect                      | n-block                                                                                   | ing, Lov                                                                                                                      | v Latency Electro-Assisted Photonic NoC Arch |                                                                                                     |
| 4 |                                  | n-block<br>Jure                                                                           |                                                                                                                               | v Latency Electro-Assisted Photonic NoC Arch | i-<br>33                                                                                            |
| 4 | tect                             | n-block<br>cure<br>Introd                                                                 | uction .                                                                                                                      |                                              | i-<br><b>33</b><br>. 33                                                                             |
| 4 | <b>tect</b><br>4.1               | n-block<br>sure<br>Introd<br>Syster                                                       | uction .<br>n Archite                                                                                                         | · · · · · · · · · · · · · · · · · · ·        | <b>- 33</b><br>. 33<br>. 33                                                                         |
| 4 | <b>tect</b><br>4.1<br>4.2        | n-block<br>cure<br>Introd<br>Systen<br>Electr                                             | uction .<br>n Archite<br>o-optic R                                                                                            |                                              | <b>33</b><br>. 33<br>. 33<br>. 34                                                                   |
| 4 | tect<br>4.1<br>4.2<br>4.3        | n-block<br>cure<br>Introd<br>Systen<br>Electr                                             | uction .<br>n Archite<br>o-optic R<br>IIC Non-H                                                                               | cture                                        | <b>33</b><br>. 33<br>. 33<br>. 34<br>. 41                                                           |
| 4 | tect<br>4.1<br>4.2<br>4.3        | n-block<br>Sure<br>Introd<br>Syster<br>Electr<br>PHEN                                     | uction .<br>n Archite<br>o-optic R<br>IIC Non-H                                                                               | cture                                        | <b>33</b><br>. 33<br>. 33<br>. 34<br>. 41<br>. 41                                                   |
| 4 | tect<br>4.1<br>4.2<br>4.3        | n-block<br>Sure<br>Introd<br>Syster<br>Electr<br>PHEN                                     | uction .<br>n Archite<br>o-optic R<br>IIC Non-H<br>Building                                                                   | cture                                        | <b>33</b><br>33<br>33<br>33<br>34<br>41<br>41<br>41<br>41                                           |
| 4 | tect<br>4.1<br>4.2<br>4.3        | n-block<br>Sure<br>Introd<br>Syster<br>Electr<br>PHEN                                     | uction .<br>n Archite<br>o-optic R<br>IIC Non-H<br>Building<br>4.4.1.1<br>4.4.1.2                                             | cture                                        | <b>33</b><br>. 33<br>. 33<br>. 34<br>. 41<br>. 41<br>. 41<br>. 43                                   |
| 4 | tect<br>4.1<br>4.2<br>4.3        | h-block<br>Jure<br>Introd<br>Syster<br>Electr<br>PHEN<br>4.4.1                            | uction .<br>n Archite<br>o-optic R<br>IIC Non-H<br>Building<br>4.4.1.1<br>4.4.1.2<br>Micro-R                                  | cture                                        | <b>33</b><br>. 33<br>. 33<br>. 34<br>. 41<br>. 41<br>. 41<br>. 43<br>. 43                           |
| 4 | tect<br>4.1<br>4.2<br>4.3        | h-block<br>Jure<br>Introd<br>System<br>Electr<br>PHEN<br>4.4.1                            | uction .<br>n Archite<br>o-optic R<br>IC Non-F<br>Building<br>4.4.1.1<br>4.4.1.2<br>Micro-R<br>Teardow                        | cture                                        | <b>33</b><br>33<br>33<br>34<br>41<br>41<br>41<br>41<br>43<br>43<br>43<br>43                         |
| 4 | tect<br>4.1<br>4.2<br>4.3        | h-block<br>fure<br>Introd<br>System<br>Electr<br>PHEN<br>4.4.1<br>4.4.2<br>4.4.3<br>4.4.4 | uction .<br>n Archite<br>o-optic R<br>IIC Non-H<br>Building<br>4.4.1.1<br>4.4.1.2<br>Micro-R<br>Teardow<br>Optical            | cture                                        | <b>33</b><br>33<br>33<br>34<br>41<br>41<br>41<br>41<br>43<br>43<br>43<br>43<br>45<br>46             |
| 4 | tect<br>4.1<br>4.2<br>4.3<br>4.4 | h-block<br>fure<br>Introd<br>System<br>Electr<br>PHEN<br>4.4.1<br>4.4.2<br>4.4.3<br>4.4.4 | uction .<br>n Archite<br>o-optic R<br>IC Non-H<br>Building<br>4.4.1.1<br>4.4.1.2<br>Micro-R<br>Teardow<br>Optical<br>Weight E | cture                                        | <b>33</b><br>33<br>33<br>34<br>41<br>41<br>41<br>41<br>43<br>43<br>43<br>43<br>43<br>45<br>46<br>49 |

|          |     | 4.5.3   | Network   | Interface and Gateway Architecture                  | 53 |
|----------|-----|---------|-----------|-----------------------------------------------------|----|
|          |     | 4.5.4   | Dimensi   | on-Order-Routing (DOR-XY)                           | 54 |
|          |     | 4.5.5   | Arbiter   | Architecture                                        | 54 |
|          | 4.6 | Chapt   | er Summ   | ary                                                 | 56 |
| <b>5</b> | Cor | ntentio | n-Aware   | e Path Configuration Algorithm                      | 57 |
|          | 5.1 | Introd  | uction .  |                                                     | 57 |
|          | 5.2 | Conte   | ntion-awa | re Path Configuration Algorithm                     | 57 |
|          |     | 5.2.1   | Path Co   | onfiguration Phases                                 | 59 |
|          |     |         | 5.2.1.1   | Path Setup                                          | 59 |
|          |     |         | 5.2.1.2   | ACK                                                 | 61 |
|          |     |         | 5.2.1.3   | Payload Transmission                                | 61 |
|          |     |         | 5.2.1.4   | Teardown                                            | 62 |
|          |     | 5.2.2   | Advanta   | ages of the Proposed Path Configuration Algorithm . | 63 |
|          | 5.3 | Evalua  | ation     |                                                     | 64 |
|          |     | 5.3.1   | Methode   | ology and Assumptions                               | 64 |
|          |     |         | 5.3.1.1   | Benchmarks                                          | 65 |
|          |     | 5.3.2   | Complex   | xity                                                | 66 |
|          |     | 5.3.3   | Latency   | Evaluation Under Synthetic Workloads                | 68 |
|          |     |         | 5.3.3.1   | Blocking Latency                                    | 68 |
|          |     |         | 5.3.3.2   | Blocked Requests                                    | 71 |
|          |     | 5.3.4   | Latency   | Evaluation Under Realistic Workloads                | 72 |
|          |     |         | 5.3.4.1   | Path Setup Latency Ratio                            | 72 |
|          |     |         | 5.3.4.2   | Speedup                                             | 73 |
|          |     | 5.3.5   | Bandwid   | dth Evaluation Under Synthetic Workloads            | 74 |
|          |     | 5.3.6   | Bandwie   | th Evaluation Under Realistic Workloads             | 75 |
|          |     | 5.3.7   | Energy    | Evaluation Under Synthetic Workloads                | 75 |
|          |     |         | 5.3.7.1   | Path Configuration Energy Overhead                  | 75 |
|          |     |         | 5.3.7.2   | Total Power                                         | 78 |
|          |     | 5.3.8   | Energy    | Evaluation Under Realistic Workloads                | 80 |
|          |     |         | 5.3.8.1   | Path Configuration Energy Overhead                  | 80 |
|          |     |         | 5.3.8.2   | Total Power                                         | 82 |

|   | 5.4             | Resul   | lts Summary                                           | 83  |
|---|-----------------|---------|-------------------------------------------------------|-----|
|   | 5.5             | Chapt   | er Summary                                            | 83  |
| 6 | Ene             | ergy-ef | ficient Wavelength-Shifting Routing Algorithm and Ar- |     |
|   | $\mathbf{chit}$ | ecture  |                                                       | 87  |
|   | 6.1             | Introd  | luction                                               | 87  |
|   | 6.2             | Wavel   | ength-Routed Control Network Architecture             | 88  |
|   | 6.3             | Wavel   | ength-Shifting Routing Algorithm                      | 90  |
|   |                 | 6.3.1   | Routing Phases                                        | 90  |
|   |                 | 6.3.2   | Blocking Management                                   | 92  |
|   |                 | 6.3.3   | Case Study                                            | 93  |
|   | 6.4             | Evalua  | ation                                                 | 98  |
|   |                 | 6.4.1   | Methodology and Assumptions                           | 98  |
|   |                 | 6.4.2   | Complexity                                            | 98  |
|   |                 | 6.4.3   | Path Configuration Delay                              | 100 |
|   |                 |         | 6.4.3.1 Delay Under Light Traffic                     | 101 |
|   |                 |         | 6.4.3.2 Delay Under Heavy Traffic                     | 102 |
|   |                 | 6.4.4   | Mirco-Ring Release Time                               | 103 |
|   |                 | 6.4.5   | Bandwidth                                             | 104 |
|   |                 | 6.4.6   | Path Configuration Energy                             | 104 |
|   |                 |         | 6.4.6.1 Energy Under Light Traffic                    | 105 |
|   |                 |         | 6.4.6.2 Energy Under Heavy Traffic                    | 106 |
|   |                 | 6.4.7   | Energy Efficiency                                     | 107 |
|   |                 | 6.4.8   | Chapter Summary                                       | 108 |
| 7 | The             | esis Su | mmary and Discussion                                  | 109 |
|   | 7.1             | Contri  | ibutions Summary                                      | 109 |
|   | 7.2             | Resul   | lts Summary                                           | 110 |
|   | 7.3             | Discus  | ssion                                                 | 111 |

## List of Figures

| 1.1  | SOC design complexity trends                                             | 2  |
|------|--------------------------------------------------------------------------|----|
| 1.2  | Power consumption trends for communication-centric SoC design. $\ . \ .$ | 3  |
| 1.3  | Power consumption trends for computation-centric SoC design              | 4  |
| 1.4  | Network-on-Chip architecture                                             | 5  |
| 2.1  | Functional diagram of an optical communication.                          | 13 |
| 2.2  | Cross-section of a waveguide.                                            | 15 |
| 2.3  | Micrographs of a fabricated microring resonator                          | 16 |
| 2.4  | Micro-ring modulator                                                     | 16 |
| 2.5  | Circuit model of germanium detector with inductive gain peak             | 17 |
| 2.6  | Anatomy of EA-PNoC architecture.                                         | 18 |
| 2.7  | Anatomy of WR-PNoC architecture.                                         | 20 |
| 2.8  | Example of a source-based routing with four nodes                        | 21 |
| 2.9  | Example of a destination-based routing with four nodes                   | 22 |
| 2.10 | A Multi-Write Single-Read connection between four nodes                  | 22 |
| 2.11 | A Single-Read Multi-Write connection between four nodes                  | 23 |
| 2.12 | A fully connected crossbar connecting to four nodes                      | 23 |
| 2.13 | Different source of noise in a photonic link                             | 25 |
| 4.1  | PHENIC system architecture.                                              | 35 |
| 4.2  | Power consumption comparison results                                     | 36 |
| 4.3  | PNoC's power and latency distributions.                                  | 37 |
| 4.4  | Input-buffer dynamic energy break-down before saturation                 | 38 |
| 4.5  | Example of 5x5 blocking switch                                           | 39 |
| 4.6  | Examples of dependency between the PSCP and Teardown packets             | 40 |

| 4.7  | PHENIC's non-blocking photonic switch                                   | 42 |
|------|-------------------------------------------------------------------------|----|
| 4.8  | Photonic siwtch building blocks                                         | 43 |
| 4.9  | Photonic switch building blocks instantiation.                          | 44 |
| 4.10 | Worst case optical power loss                                           | 48 |
| 4.11 | PHENIC's light-weight electronic router.                                | 50 |
| 4.12 | Example of $4 \times 4$ interconnection network                         | 51 |
| 4.13 | PHENIC's electronic controller configuration packet size and format.    | 52 |
| 4.14 | PHENIC's gateway.                                                       | 53 |
| 4.15 | PHENIC's electronic and photonic arbiters.                              | 55 |
| 5.1  | Successful path-setup.                                                  | 59 |
| 5.2  | Failed path-setup                                                       | 60 |
| 5.3  | ACK phase                                                               | 61 |
| 5.4  | Payload transmission                                                    | 61 |
| 5.5  | Tear-down phase                                                         | 62 |
| 5.6  | Latency comparison results under random uniform traffic $\ . \ . \ .$ . | 70 |
| 5.7  | Average blocking latency comparison under random uniform traffic        | 71 |
| 5.8  | Number of blocked request comparison result                             | 72 |
| 5.9  | Average path setup ratio                                                | 73 |
| 5.10 | Normalized speedup comparison results                                   | 74 |
| 5.11 | Bandwidth comparison results under random uniform traffic               | 75 |
| 5.12 | Bandwidth comparison results under FFT and Dataflow workloads           | 76 |
| 5.13 | Path setup and acknowledgments energy for half-loaded network           | 77 |
| 5.14 | Path setup and acknowledgments energy near-saturation                   | 78 |
| 5.15 | Total energy and energy efficiency comparison results                   | 79 |
| 5.16 | Total Energy breakdown comparison                                       | 80 |
| 5.17 | Input-buffer dynamic energy breakdown near-saturation                   | 81 |
| 5.18 | Normalized path setup dynamic power per achieved bandwidth              | 81 |
| 5.19 | Power efficiency comparison results                                     | 82 |
| 6.1  | Photonic switch controller                                              | 89 |
| 6.2  | Wavelength-shifting routing algorithm flowchart                         | 91 |

| 6.3  | Packet format generated at the sender network interface              | 91  |
|------|----------------------------------------------------------------------|-----|
| 6.4  | Micro architecture of the path request detector and modulator banks. | 92  |
| 6.5  | Micro architecture of the Path Blocked Manager.                      | 93  |
| 6.6  | Communication example in the wavelength-routed control network       | 94  |
| 6.7  | Wavelength-shifting algorithm example $(1/3)$                        | 95  |
| 6.8  | Wavelength-shifting algorithm example $(2/3)$                        | 96  |
| 6.9  | Wavelength-shifting algorithm example $(3/3)$                        | 97  |
| 6.10 | Complexity and power comparison results                              | .00 |
| 6.11 | Path configuration delay under contention-less                       | .02 |
| 6.12 | Path configuration delay comparison results under contention 1       | .02 |
| 6.13 | Mirco-ring release time under contention-less                        | .04 |
| 6.14 | Offered bandwidth comparison results under contention-less 1         | .05 |
| 6.15 | Path configuration energy comparison results under contention-less 1 | .06 |
| 6.16 | Path configuration energy comparison results under contention 1      | .06 |
| 6.17 | Energy efficiency comparison results under contention-less 1         | .07 |

## List of Tables

| 4.1 | Micro-rings configuration for data transmission              |
|-----|--------------------------------------------------------------|
| 4.2 | Wavelength assignment for acknowledgment signals             |
| 4.3 | Insertion loss parameters                                    |
| 4.4 | Comparison between $5 \times 5$ optical routers              |
| 4.5 | Power loss comparison                                        |
| 5.1 | Configuration parameters                                     |
| 5.2 | Photonic communication network energy parameters             |
| 5.3 | Ring requirement comparison results for 64 cores systems 67  |
| 5.4 | Ring requirement comparison results for 256 cores systems 69 |
| 5.5 | Evaluation results summary under uniform random traffic 84   |
| 5.6 | Evaluation results summary under FFT workload                |
| 5.7 | Evaluation results summary under Dataflow workload           |
| 6.1 | Chip configuration                                           |
| 6.2 | Delay contribution for 32 nm technology nodes                |
| 6.3 | Energy contribution for 32nm technology nodes                |

## List of Abbreviation

| 3D-IC:   | Three dimensional Integrated Circuit                 |
|----------|------------------------------------------------------|
| 3D-NoC:  | Three dimensional Network-on-Chip                    |
| ACK:     | Acknowledgment                                       |
| BER:     | Bit Error Rate                                       |
| DB:      | Detector Bank                                        |
| DPE:     | Data Processing Element                              |
| DOR:     | Dimension Order Routing                              |
| DWDM:    | Dense Wavelength Division Multiplexing               |
| EA-PNoC: | Electro Assisted PNoC                                |
| E-NoC:   | Electronic Network-on-Chip                           |
| ECN:     | Electronic Control Network                           |
| EOR:     | Electro-Optic Router                                 |
| FCA:     | Free Carrier Absorption                              |
| ITRS:    | International Technology Road-map for Semiconductors |
| MB:      | Modulator Bank                                       |
| MRR:     | Micro-Ring Resonator                                 |
| MRCT:    | Micro-Ring Configuration Table                       |
| MRST:    | Micro-Ring State Table                               |
| MRRT:    | Micro-Ring Release Time                              |
| MPSoC:   | Multiprocessor Systems-on-Chip                       |
| MWSR:    | Multiple Write Single Read                           |
| NBPS:    | Non-Blocking Photonic Switch                         |
| NI:      | Network Interface                                    |
| P2P:     | Point-to-Point                                       |

| PBM:     | Path Blocked Module                        |
|----------|--------------------------------------------|
| PBP:     | Path Blocked Packet                        |
| PCD      | Path Configuration Delay                   |
| PCE      | Path Configuration Energy                  |
| PCN:     | Photonic Communication Network             |
| PE:      | Processing Element                         |
| PIC:     | Photonic Integrated Circuit                |
| P-NoC:   | Photonic Network-on-Chip                   |
| PR:      | Path Request                               |
| PSCP:    | Path Setup Control Packet                  |
| PSC:     | Photonic Switch Controller                 |
| SNR:     | Signal to Noise Ratio                      |
| SoC:     | System-on-Chip                             |
| SRMW:    | Single Write Multiple Read                 |
| TPA:     | Two Photon Absorption                      |
| TIA      | Trans-Impedance Amplifier                  |
| WDM:     | Wavelength Division Multiplexing           |
| WRCN:    | Wavelength-Routed Control Network          |
| WR-PNoC: | Wavelength-Routed Photonic Network-on-Chip |
| WSRA:    | Wavelength-Shifted Routing Algorithm       |
| WSS:     | Wavelenth Selective Switching              |

## Chapter 1

## Introduction

In this chapter, we give a background about the conducted research including the interconnect scalability issues in current System-on-Chips (SoCs) and how the Electronic Network-on-Chip (ENoC) is no longer scalable with the increase of the number of cores inside the chip. The Photonic Networks-on-Chip (PNoCs) are introduced as potential solutions to mitigate the E-NoCs problems. The routing algorithm is the main challenge when designing a PNoCs. Thus, the two approaches for routing optical data are introduced and discussed. We end this chapter with the main research objectives and contributions, as well as the outline of the thesis.

### 1.1 Current System Design

Over the past few decades, the advancement of computing systems is dominated by the rapid advances in semiconductor technology. This made possible the shrinking the chip size and the integration of hundred or even thousand of cores. Systems-on-Chip (SoCs) [1] are embedded systems composed of several modules on a single chip (processors, memories, input/output peripherals). With SoCs, it is now possible to process information and execute critical tasks at higher speed and lower power on a tiny chip. This is due to the increasing number of transistors that can be embedded on a single chip, which keeps doubling every 18 months as Gordon Moore predicted [2]. This made shrinking the chip size while maintaining high performance possible. This technology scaling has allowed SoCs to grow continuously in component count and complexity and evolve to systems with many processors embedded on a single die. As an example, the *Intel Xeon* processor [3] includes 2.3 billion transistors. With such high integration level, the development of many cores on a single die has become possible. These systems are called Multiprocessor Systems-on-Chip (MPSoC). For instance, the *Tilera Tile64* [4] and *Intel Polaris* [5] contain 64 and 80 cores, respectively. Figure 1.1 illustrates the SoC design com-



Figure 1.1: SOC design complexity trends [6].

plexity trends made by International Technology Roadmap for Semiconductors 2011 (ITRS) [6]. ITRS predicts that the number of Processing Elements (PEs) will grow rapidly in subsequent years to reach the 6000 PEs by 2026. Also, the amount of main memory is assumed to increase proportionally with the number of PEs. In the same way, the number of Data Processing Engines (DPEs) will increase significantly, leading to more than 70 TFlops processing performance [6].

As the number of cores keeps increasing, and to efficiently take advantage of this large number, specific constraints must be taken into consideration. For example, design complexity, low energy dissipation, small silicon area, manufacturer and yield, resource management, etc.. In particular, the interconnection network starts to play a more and more important role in determining the performance and also the power consumption of the entire chip [7]. Interconnects consume more than 50% of dynamic power, and this percentage is expected to increase [8]. As a result of this significant increase in the number of PEs and DPEs, the power consumption will increase proportionally making the power consumption a key factor for the design of communication-concentric SoC. Even for performance-centric design, the power consumption remains a problem. Although the power consumption per DPE will decrease, this will be outweighed by the increase in the number of DPEs per chip, which results in critical chip packaging and cooling issues. This growth power demand is illustrated in Figs. 1.2 and 1.3. In addition to the growing of the



Figure 1.2: Power consumption trends for communication-centric SoC design [6].

number of PEs and DPEs inside the chip, this power consumption increase can be explained by the communication schemes used in such systems. The communication bottleneck can be driven by the fact that exploiting parallelism becomes a necessity and a must in the new architecture design to meet the performance requirement and the power constraints. However, parallelism implies an additional communication overhead due to the sharing and synchronization issues between threads and cores. Another aspect of the communication bottleneck is the I/O scaling frequency. While dramatic increases in on-chip frequencies have yielded huge performance gains over the last decades, this has not been the case for DRAM and other I/O what is usually referred to the memory wall [9].



Figure 1.3: Power consumption trends for computation -centric SoC design [6].

### 1.2 Electronic Network-on-Chip

ENoCs (Electronic Network-on-Chip) [10-17] were introduced as a promising solution which can respond to the issues mentioned above. Based on a simple and scalable architecture platform, NoC connects processors, memory, and other custom designs together using switching packets on a hop-by-hop basis to provide a higher bandwidth and more enhanced performance. As shown in Fig.1.4, NoC architectures are based upon connecting segment (or wires) and switching blocks to combine the benefits of the previously proposed Point-to-Point (P2P) and sharedbus architectures, while solving their disadvantages, such as the large numbers of long wires in P2P and the lack of scalability in shared bus systems.

However, E-NoC represents serious scalability problems as the number of cores increases into hundred or thousand. This scalability problem concerns the three main metrics that affect the performance of an interconnection network which are throughput, latency, and power. As an example, in 2D mesh-based architecture, the channels, buffers, and crossbar are responsible for 15-30% of the total network power consumption [5].

This limitation comes basically from the high diameter that NoC suffers from.



Figure 1.4: Network-on-Chip architecture [10].

The network's diameter is the number of hops that a flit traverses in the longest possible minimal path between a pair of source and destination. The diameter is important for the NoC design since a large network diameter has an impact on the worst case routing latency in the network. For all these facts, the seek for optimizing NoC-based architecture becomes more and more necessary, and There has been much conducted research to achieve this goal through various approaches, such as developing fast routers [18–21] or designing new network topologies [22–24]. One of these proposed solutions was merging the Network-on-Chip to the third dimension (3D-NoC).

In fact, three-dimensional integrated circuits (3D-ICs) [25] have attracted much attention as a potential solution to resolve the interconnect bottleneck. A threedimensional chip is a stack of multiple device layers with direct vertical interconnects tunneling through them [26–29]. Research made so far have shown that 3D-ICs can achieve higher packing density due to the addition of a third dimension to the conventional two-dimensional layout. The average interconnects length can also be reduced as demonstrated in [30, 31], where a stochastic model for the global netlength distribution of a 3D-SoC is derived. In comparison to a 2D-SoC, the results prove that the use of three-dimensional architectures potentially reduces net length as the square root of the number of strata. This reduction in net length could lead to significant reductions in chip footprint area, power dissipation, and cycle time [32, 33], not forget to mention that circuitry is more immune to noise with 3D-ICs [25].

Feero et al. [15] showed that 3D-NoC has the ability to reduce latency and the energy per packet by decreasing the number of hopes by 40% which is a fundamental and important factor to evaluate the system performance [15]. Pavlidis et al [34] analyzed the zero-load latency and power consumption, and demonstrated that a decrease of 62% and 58% in power consumption can be achieved with 3D-NoC when compared to a traditional 2D-NoC topology for a network size of N= 128 and N= 256 nodes, respectively, where N is the number of cores connected in the network.

However, when we talk about large-scale systems, where the number of cores can reach few thousands, the performance of such systems is still limited by the available interconnect bandwidth, and associated power budget. Infact, the power consumption is proportional to the number of cores [34], in particular, data transfer between processors and memory, between process units or between memory storage units. Thus, to achieve significant and scalable solution to the interconnect problems, real fundamental changes in system interconnect, and fabrication technologies are needed. Especially in the architecture-level, where the power consumption should be independent of the future growth of number of cores and the bandwidth requirement.

A promising approach to the above problems is the use of the integrated optics technology, which could empower an increase in the ratio between data rate and power dissipation. Photonic Integrated Circuits (PIC) use light (i.e., photons) rather than electrons to send and receive data across the chip. Recent developments in nanostructures, metamaterials, and silicon technologies have expanded the range of possible functionality of light to be used as a mean of data transfer inside the chip

### **1.3** Photonic Interconnect

Integrated photonic technology is an attractive interconnect solution that can be used to mitigate the energy and bandwidth bottlenecks that are arising in SoCs systems. Photonic interconnects can enable improved bandwidth density by leveraging wavelength-division multiplexing (WDM) to transmit concurrently multiple parallel streams of data via a single waveguide [35] with very high data rates reaching 25Gbps and 40Gbps per single channel (i.e., wavelength) [36,37]. As a result, photonics can alleviate the problems facing interconnect subsystems that are reaching limits in wire and input/output (I/O) pin density.

When using photonics in NoC architectures, the key to saving power comes from the fact that once a photonic path is established, the optical data is transmitted in an end-to-end fashion without the need for buffering, repeating, or regenerating. This is different from electronic NoCs, where messages are buffered, regenerated and then transmitted on the inter-router links several times en route to their destination. In addition, photonic routers do not need to switch to every bit of the transmitted data like in electronic routers; optical routers switch on and off once per message, and their energy dissipation does not depend on the bit rate. This feature allows for the transmission of ultra-high bandwidth messages while avoiding the power cost which found in traditional electronic networks. Higher bandwidth density is attractive from deployment and integration standpoint since it enables enormous throughput within the same or smaller physical dimensions, as in the electronic domain. The combined advantages of better bandwidth density and power efficiency make photonic interconnects a serious contender as a technological replacement for electronic interconnects.

It should be mentioned here that the difficulties of electrical interconnections are not simply ones of scaling of power and bandwidth of the interconnection system. There are a plenty of other difficulties that could be dominant reasons for changing to a radical solution like photonic interconnection. This includes issues such as voltage isolation, timing accuracy, and overall ease of design. For example, an optical link may allow larger synchronous zones in systems, not only on one chip but also possibly extending to multiple chips. The limitations in the performance of electrical interconnect at high speeds are making it difficult maintain large synchronous zones in systems. The use of photonic links helps because effective signalvelocities may be higher, compared to the electronic lines (repeated and unrepeated ) where the effective signal propagation velocity is limited to a relatively small fraction of the velocity of light (e.g., 10-20%) [38].

Another example of the benefit of using photonic interconnect is that such interconnection intrinsically provides voltage isolation between different parts of thecomponents inside the chip. In fact, photo-detectorsessentially count photons, not measure classical voltage, and provide perfect voltage isolation as a result [38].

## 1.4 Photonic Networks-on-Chip: Problems and Motivation

When using photonic links, the first challenge that faced on-chip network designers is the lack of data buffering and in-flight processing for optical beams. Two main approaches were proposed to deal with this issue. The first approach is the use of Electro-Assisted PNoC (EA-PNoC) architecture. In EA-PNoC, an Electronic Control Network (ECN) is dedicated to handling the arbitration and the routing process [39] to set the switches before the data take place in End-to-End fashion. The data is transmitted in a Photonic Communication Network (PCN) made by photonic switches [39] or crossbar [40]. This approach is characterized to be power hungry since it uses an entire network for routing and arbitration purpose. In addition, the latency overhead due to the path configuration process to enable the End-To-End transmission [41–43]. On the other hand, the use of such approach provides a huge bandwidth since the whole laser power budget will be dedicated only to the data transfer.

The second approach is the use of Wavelength-Routed PNoC (WR-PNoC) architectures [44–48]. WR-PNoCs use individual wavelengths which can be statically or dynamically allocated to source-destination pairs using combinations of modulators, filters, and waveguides. They also use wavelength selectivity to route data through the network, in contrast to circuit-switched networks (i.e., EA-PNoC) which utilize wavelength selectivity for bandwidth aggregation. Such networks are characterized by a low latency since there is no path configuration before the transmission. Furthermore, this kind of architectures has high photonic-layer complexity (e.g., more than a million ring resonators required for the implementation in [44]). Besides the complexity, the basic building block for these architectures is a ring resonator. Because of the effect of temperature variations on refractive index [49,50], the resonance frequency of the ring is shifted and leading to a miss-routing.

Both approaches have their pros and cons, while for the EA-PNoC scheme, which is used for bandwidth-sensitive workload, more attention should be given to the ECN by minimizing the path configuration overhead. In contrast with the WR-PNoC, where the design is more concentrated on minimizing the number of photonic devices (i.e., modulators and detectors), as well as the number of used wavelengths.

### **1.5** Thesis Objectives and Contributions

Starting from all the facts mentioned above, in this thesis, we propose a highperformance, scalable nanophotonics on-chip network for many-core systems. First, a novel energy-efficient and high-throughput many-core hybrid Silicon-Photonic Network-on-Chip architecture (PHENIC), is proposed. The proposed PHENIC system has a Non-Blocking Photonic Switch (NBPS) and equipped with contentionaware path configuration algorithm. The proposed system efficiently reduces the blocking occurrence resulting in reducing the total energy and increasing the system's bandwidth. We demonstrate that the proposed system has a better performance and low energy dissipation compared to conventional EA-PNoCs.

Second, to further optimize PHENIC system, a Wavelength-Routed Control Network (WRCN) is proposed, where the ECN is substituted by another plane made by photonic devices rather than electronic ones (i.e., electronic router). Moreover, a Wavelength-Shifting Routing Algorithm (WSRA) is proposed to handle the different routing and arbitration processes. The new proposed WRCP come as a compromise between the EA-PNoC and WR-PNoC architectures, where we can aggregate a considerable bandwidth (i.e., EA-PNoC) with low power and latency overhead (i.e., WR-PNoC).

The main contributions of this research are as follows:

• A new Non-Blocking Photonic Switch (NBPS) capable of handling all acknowledgment signals required for the path setup process (i.e., ACK and Tear-down). Thus, we adopt a new hybrid switching policy in the PCN: Spatial switching for the data stream transfer which is mostly used in conventional hybrid-PNoC designs. This is done by manipulating the state of the broadband switching elements. The second switching used is a Wavelength Selective Switching for the acknowledgment and Tear-down signals by using passive filters placed at the input and output of each port.

- A contention-aware path configuration algorithm [42,43] that aims to decouple the ECN from the PCN in a manner that they work independently of each other. The proposed algorithm orchestrates the different path setup packets processes; thus, significantly alleviating the contention in the ECN and its consequent energy overhead and further enhancing the bandwidth, contrary to the already proposed hybrid-PNoC architectures.
- A Wavelength-Routed Control Network (WRCN), which aim to reduce the path configuration delay, as well as the energy. The proposed architecture is augmented with Wavelength-Shifting Routing Algorithm (WSRA), which manages the different routing and arbitration processes.
- A detailed performance evaluation where we highlight the efficiency of the proposed system and the performance gain when compared to well known previously proposed EA-PNoC systems [51]. Moreover, an analytically model is proposed, where we highlight the merits of the proposed WSRA, in terms of path configuration delay and energy efficiency.

### **1.6** Thesis Outline

The rest of the thesis is organized as follows:

- In Chapter 2, we first overview the photonic on-chip interconnect main components, and we highlight the different routing schemes.
- Chapter 3 presents some of the important related works that dealt with EA-PNoC, WR-PNoC, and other architectures.

- Chapter 4 introduces the proposed non-blocking electro-optic router with its two components. The non-blocking photonic switch and the electronic controller.
- In Chapter 5, the proposed photonic architecture and the contention aware routing algorithm are discussed. A detailed performance evaluation is given at the end of this chapter.
- We dedicate Chapter 6 to introduce the proposed WRCN and the corresponding routing algorithm (i.e., WSRA). A design space exploration is performed where we compare the proposed scheme with conventional ones.
- Finally in Chapter 7, we end this thesis with the conclusion. We also discuss how this work can be optimized further.

## Chapter 2

# Background: Photonic Networks-on-Chip

In this chapter, we introduce the photonic on-chip interconnect paradigm, and we explain its principal components and the different steps that the data need to go through. We also highlight the two main approaches of routing photonic data, which are the EA-PNoC and WR-PNoC approaches. Examples from the literature are shown and we discuss the pros and cons of each approach.

### 2.1 Photonic Communication

Figure 2.1 shows a functional diagram of an optical communication. The original electronic data stream is first encoded for signal conditioning. The second step is to increase the data rate (if needed) through a serialization step. The serialization step aims to reduce the number of wires required at the output by combining multiple incoming data streams (i.e., wires). The aggregate data rate remains constant before and after the serializer. Third, a driver circuit is required for each transmission wire to condition further each signal with the appropriate peak-to-peak voltage levels and to supply an adequate amount of current to drive the optical modulator. Finally, the modulator translates the electrical signal into an optical signal so that it can propagate on the photonic interconnection network. Once the original data stream is converted to the optical domain, and according to the routing process, the data



travels along the allocated waveguides until it reaches the receiver side.

Figure 2.1: Functional diagram of an optical communication.

In the receiver side, a dedicated photodetector intercepts the incoming light beam. The photodetector converts a photon stream back into an electrical current. After the detection, the resulting current needs to go through an amplification step using a Trans-Impedance Amplifier (TIA), which aims to convert the output of the photodetectors from current-based to a voltage-based signal. The next step is the deserialization step, which converts back the data rate into the original data rate (i.e., before serialization). Lastly, the decoding step is performed where the original data signal is recovered.

#### 2.1.1 Photonic NoC Building Blocks

In this section, the main photonic building blocks needed for the generation and the detection of an optical data inside a PNoC (Photonic Network-on-Chip), are introduced.

#### 2.1.1.1 Laser

Lasers emit light through a process of optical amplification based on the stimulated emission of photons. Most lasers consist of a gain medium, a pump, and a mechanism for optical feedback [52]. For photonic interconnection networks, the key laser parameters are wavelength of operation, maximum output power, power efficiency, stability, footprint, CMOS compatibility, and cost. Using off-chip laser and coupled with broadband quantum dot semiconductor optical amplifiers, many wavelength channels can be produced, with low relative intensity noise, that may be modulated, transmitted, and received. This approach alleviates packaging complexity of the silicon chip while simultaneously minimizing the overall cost and power consumption. One of the promising solutions for the off-chip laser is to use Vertical-Cavity Surface Emitting Laser (VCSEL) [53]. Such devices can be flip-chipped directly on top of the grating coupler with a strict optical alignment needed for integration.

#### 2.1.1.2 Coupler

Couplers allow on-chip components to interface physically with off-chip ones. The challenge in couplers is that the cross-section of the waveguide (on chip) is x1000 smaller than the one of the fiber (off-chip) [54]. This mismatch in the cross-section between off-chip and on-chip domains arises many problems including insertion loss, integration density, bandwidth density, crosstalk, reflectivity, and scalability. Another major challenge for each coupling technology, critical for its commercial viability, is its packaging [54].

### 2.1.1.3 Waveguide

The waveguide is the basic building block in the photonic link. It is used to carry high-speed optical data stream from one point to another. In most silicon photonic applications, the high index contrast between the silicon waveguides and the silicon dioxide cladding surrounding them results in an extremely small optical mode size for single-mode operation. This condition enables very dense integration of silicon photonic devices. For example, Crystalline silicon photonic waveguides are capable of transporting wavelength-parallel optical data with tera bits-per-second data rates across the entire chip [54, 55]. In addition, as the photonic link might have irregular trajectory, waveguides can be bended [54, 56] or used to perform a crossing if needed [54, 57]. Figure 2.2 shows a cross section of a silicon waveguide.



Figure 2.2: Cross-section of a waveguide [58].

#### 2.1.1.4 Micro-Ring Resonator

Microring Resonators (MRs) are widely used to achieve high-performance operations, switching, detection and modulation. MRRs having a high refractive index contrast are used in dense integration due to their small size. When used as dynamic components, MRs might be used as microring modulators to modulate a specific wavelength according to its Free Spectral Range (FSR). An extremely small ring will have a large FSR and allow the filtering of a single wavelength channel [54]. In addition to the wavelength selectivity, MRRs can act as broadband switches used to switch many wavelengths at the time. This technique is used for the bandwidth aggregation when multiple wavelengths are needed to deliver the required bandwidth. Such technique is used mainly in EA-PNoC, where the MRRs are tuned electronically through an integrated heater. Figure 2.3 shows scanning electron micrographs of one fabricated microring resonator with waveguide width equal to 500 nm and waveguide cross-sections at two cleaved facets.



Figure 2.3: Scanning electron micrographs of a fabricated microring resonator and waveguide cross-section at two cleaved facets [59].

#### 2.1.1.5 Modulator

The electro-optic modulator [54, 60] is a critical device that enables high-speed conversion from an electrical signal to an optical signal. By using an array of modulators, multiple wavelengths can be obtained (i.e., multiplexing). The resonant wavelength (the wavelength on which each modulator is tuned) is determined by the round-trip phase of the used micro-ring resonator. Integrated silicon modulator can operate at very high frequency reaching 25GHz [36] or even 40 GHz [37].

Figure.2.4 [61] shows a modulator with its embedded heater needed to tune to a specific wavelength.



Figure 2.4: Micro-ring modulator. (a) Cross-section of the designed modulator with a local heater and (b) Top-view SEM image of the fully fabricated microring modulator. The ring radius is 5 m [61].

#### 2.1.1.6 Photodetector

Photodetectors are the end point of a photonic link. By absorbing the incoming photons, it translates the optical data stream into an electronic one. In case where the multiplexing is used at the start point of the photonic link (i.e., array of modulators producing multiple wavelengths), an array of photodetector is needed to isolate each wavelength. In order to achieve this, a passive (selective) MR is located in front of each photodetector to intercept the required wavelength. Germanium is used as the absorbing material for photodetection, with a resulting bandwidth reaching 40GHZ [62] and 60 GHZ [63]. Figure 2.5 shows a circuit model of a germanium photodetector.



Figure 2.5: Circuit model of germanium detector with inductive gain peak.(a) Optical micrograph of the gain peaked photodetector using the 360 pH inductor. The inductor is approximately 100  $\mu$ m x 100  $\mu$ m in size [63] and (b) Inductor used to peak the frequency response of the photodetector [63].

## 2.2 Routing Schemes in PNoCs

In this section, the two main routing schemes in PNoCs are described. Circuit switching used in EA-PNoCs and wavelength-selective used in WR-PNoC are the two main routing and arbitration schemes used in most PNoC architectures. Some other schemes are also used, such as Time- Division-Multiplexing [64, 65].



Figure 2.6: Anatomy of EA-PNoC architecture. The two communications are using the same set of wavelengths in a circuit switching scheme.

## 2.2.1 Circuit Switching

An important aspect of EA-PNoC architecture is that once a path is set, no additional processing is required to ensure a packet reaches its intended destination. This propriety comes with the cost of additional power and latency overheads. In EA-PNoC (Electro-assisted PNoC) the source node first issues a configuration packet via a copper-based electrical link to the destination node. The configuration packet is routed via the ECN (Electric Control Network) reserving the photonic switches along the path for the photonic message which will follow it in the PCN (Photonic Communication Network). It includes a source and destination addresses information and other additional control information if needed. When the destination node receives the configuration packet, it will acknowledge via an ACK packet that the path is ready for the transmission. When the ACK packet is received and processed, the source node starts the transmission of the optical data stream via the previously reserved photonic switch. When the transmission is done, the reserved path would be released by a release packet. The circuit-switched nature of these EA-PNoCs directly affects the performance and power characteristics of on-chip communication.

Figure 2.6 shows two communications between two different pairs of source and destination. We can see that both communications are using the same set of wavelengths, which provides an enormous bandwidth through WDM. In contrast with the wavelength-routed architectures where WDM is used for routing purpose, in EA-PNoC the available wavelengths (i.e., laser power budget) are completely dedicated to the data transmission, which provides a huge bandwidth needed for bandwidthintensive applications.

## 2.2.2 Wavelength-Routed

Wavelength-routed architectures use individual wavelengths allocated to a sourcedestination pairs using combinations of modulators, filters, waveguides and photodetectors. Wavelength-routed networks use wavelength-selective routing to route data between source and destination and select the appropriate wavelength through filters to be delivered to the required source, in contrast to circuit-switched where all available wavelengths can be dedicated to one communication. These architectures exhibit lower latency than electro-assisted architectures since they do not require a path configuration. On the other hand, such scheme is characterized by an excessive use of photonic devices, which increases proportionally with the number of nodes.

Figure 2.7 shows the anatomy of a wavelength routed link. Any pair of source/destination uses a unique wavelength<sup>1</sup>. The main challenge in this kind of architecture is how to use the minimum set of photonic devices/wavelengths to ensure a contention free network. In the following subsections, we show how a WR-PNoC can be built using photonic components.



Figure 2.7: Anatomy of WR-PNoC architecture. The four communications are using a different set of wavelengths.

 $<sup>^{1}</sup>$ Most of the proposed architectures in the literature follow this approach to minimize the number or required photonic devices.

#### 2.2.2.1 Source-based Routing

The source-based routing is a configuration where each node reads from a single wavelength channel that has been assigned to it. Any other node can write to this channel. Figure 2.8 shows an implementation example with microring modulators, ring filters, and detectors. The layout uses  $N \times N$  rings and N wavelength channels, where N is the number of nodes. To prevent network collision, multiple access points must be disallowed from concurrently writing to a common destination. This could be accomplished with a separate arbitration network, such as a token arbitration ring [44].



Figure 2.8: Example of a source-based routing with four nodes. Each node receives data on a dedicated wavelength channel and transmits on all other wavelength channels.

#### 2.2.2.2 Destination-based Routing

Figure 2.9 shows the implementation of source-based routing for four nodes. In this configuration, a node modulates a single wavelength channel, and its intended destination selectively reads it. Like the source-based routing implementation, this method uses  $N \times N$  rings and N wavelength channels.

#### 2.2.2.3 Multiple Write Single Read

Another way of connecting nodes together is using a multi-write single-read (MWSR) interconnection. This kind of connection between nodes, generally involves a serpentine-like waveguide that passes through all nodes and covers the entire chip. MWSR connection can be seen in Fig. 2.10, where for nodes (each node is represented with its transmitter part (Tx) and receiver part (Rx)). Each receiver node has



Figure 2.9: Example of a destination-based routing with four nodes. Each node receives data on a dedicated wavelength channel and transmits on all other wavelength channels.



Figure 2.10: A Multi-Write Single-Read connection between four nodes.

its dedicated waveguide, where the receiver part is tuned to the (N-1) wavelength modulated by the other nodes, Where N is the number of nodes in the network. This configuration, like the source-based and destination-based schemes, requires an arbitration network to tune required modulators in each transmitter part of the node. Assuming only one wavelength for the transmission, as shown in Fig. 2.10, this configuration requires N waveguides in addition to (N-1) modulators and one photodetector at each node.

#### 2.2.2.4 Single Write Multiple Read

Figure 2.11 shows an alternative to the MWSR, which is the Single Write Multiple Read (SWMR)connection. This configuration also requires one waveguide for each node. Each node has one modulator and (N-1) photodetectors. Rings placed in front of each photo-detectors are turned ON or OFF by a separate control network.



Figure 2.11: A Single-Read Multi-Write connection between four nodes.

#### 2.2.2.5 Fully Connected Crossbar

A disadvantage of the previous wavelength-routed schemes is that they require an additional arbitration mechanism, either local or global. To avoid having to arbitrate between the nodes, a fully connected wavelength crossbar can ensure a contention free communication. As shown in Fig. 2.12, each source modulates data on a different wavelength channel depending on the destination, and each destination receives on a different wavelength depending on the source. This implementation requires  $N \times (N-1)$  modulators,  $N \times (N-1)$  photodetector (with its corresponding filter ring) and  $N \times (N-1)$  wavelength channels. As can be seen, such configuration might be not feasible for large networks, or even small ones if we consider the consequences of putting many wavelengths into a single waveguide regarding crosstalk and data correctness.



Figure 2.12: A fully connected crossbar connecting to four nodes. Each sourcedestination combination uses a dedicated wavelength.

## 2.3 Photonic NoC Metrics

In addition to the conventional metrics, such as power consumption, delay and throughput, PNoCs have other specific metrics that need to be taken into consideration when designing an efficient and reliable system. The two most important design considerations are the power budget and the data integrity.

## 2.3.1 Power Budget

The power budget here does not refer to the power consumed by the network, but it is defined as the maximum laser power that could be injected into the network without altering the photonic devices (waveguides, Modulators, etc.). In fact, the injected laser power should not exceed the nonlinear threshold of the photonic components. The nonlinear threshold is defined as the amount of power above it the photonic components start showing a nonlinear behavior. The two main nonlinear effects are the Two-Photon Absorption (TPA) [66] and the Free Carrier Absorption (CFA). These two phenomena when existing they induce a high loss [66], resonance mismatch [54] and components' alteration. The power penalty or the power loss, which is the amount of power lost between the source and the destination, will be discussed in the next chapter when evaluating the proposed switch. This loss could be caused by a crossing or by passing through or by an MRR. Consequently, the lost power will propagate inside the network causing a noise in the receiver side.

## 2.3.2 Data Integrity

The data integrity concerns the reliability of the network in terms of data correctness and how the network should be built in such a way to minimize the losses and the resulting noise. Figure. 2.13 shows an example of a resulting noise in a photonic link with two wavelengths. In addition to the noise induced by the losses, the noise can also be caused by the laser and the modulation as shown in (a) and (b), respectively. In (c), the crossing element will cause a power penalty to the signal, which lets a portion of the original signal to be spread to the other outputs of the crossing element. These small portions of power will travel across the network



Figure 2.13: Different source of noise in a photonic link. (a) Laser noise, (b) modulation noise, (c) noise due to the loss induced by a crossing element, (d) noise due to the imperfect coupling of filters, and (e) noise due to power losses traveling inside the network and coupling with other signals.

and coupled with other signals leading to additional noise, as shown in (e). Other sources of noise, is the imperfect coupling of the ring, as shown in (d). In fact, before being detected, the signal needs to be filtered. In this filtration step, some of the power will leak and will be detected with the original signal. Moreover, some other causes of noise, can be related to the thermal causes, such as Johnson [67] and Shot [68] noises.

The amount of noise in a detected signal is given by the *Signal to Noise Ratio* (SNR). From the SNR, the Bit Error Rate (BER) is calculated. In fact, due to the presence of noise in a given channel, a receiver may detect a 1 bit when the originating transmitter intended to send a 0 bit, and the BER is the rate of the faulty detected bit. Theoretically, for free error transmission the BER should be equal to zero, but a value of BER between  $10^{-10}$  and  $10^{-12}$  is also acceptable.

## 2.4 Chapter Summary

In this chapter the different steps involved in optical communication were explained. Also, the different photonic components used to build a PNoC are reviewed. We also reviewed the two main approaches for routing optical data, the electroassisted, and the wavelength routed schemes. Some photonic PNoC specific metrics such as the power budget and the data integrity were discussed. In the next chapter, some of the important works dealing with PNoCs architectures are discussed.

# Chapter 3

# **Related Works**

In this chapter, we discuss some of the important related works that dealt with Photonic Network-on-Chip. We divide this chapter into three sections. The first one deals with EA-PNoCs, where we present the works are done in the path configuration algorithm, as well as the photonic switches. In the second section, some WR-PNoCs architectures are discussed with the problem of the required photonic resources in such architectures. Finally, we dedicate a section for other architectures where there is a mixture of the two previous approaches.

## 3.1 Electro-assisted Photonic NoCs

Many research groups proposed EA-PNoC architectures, these works proposed either an optimization for the path configuration algorithm or a photonic switch with less MRRs (Micro-ring Resonator) and crossing.

Many works have been conducted so far to tackle the problems of the path configuration algorithm, such as the blocking occurrence and the resulting delay and power overheads. *Hendry et al.* [69] proposed a circuit-switched memory access in photonic interconnection networks. This work represents a typical EA-PNoC, where all path setup steps are generated and executed in the ECN. *Chan et al.* [70] proposed a circuit switched EA-PNoC for cores-to-memories connections with the addition of a wavelength-selective spatial routing to increase the path diversity and the bandwidth. *Chan et al.* [40] also proposed a circuit switched mesh using a 4x4 non-blocking switch augmented with two gateways for ejection/injection from/to the network. Sacham et al. [39] proposed a torus hybrid-PNoC based on a blocking 4x4 optical switch with an additional network for the ejection/injection from/to the torus. *Petracca et al.* [71] proposed a non-blocking torus EA-PNoC where the conventional path setup scheme is used. Cisse et al. [72] proposed an EA-PNoC torus named HPNoC which uses predictive switching [73] in the ECN to reduce the setup latency by reducing the pipeline stages of the electrical router. Although the latency is reduced by using such predictive switching, the path setup steps are all generated and transmitted in the ECN. Ye et al. [74] proposed a new protocol, called Quickly Acknowledge and Simultaneously Tear-down (QAST), to reduce the control delays during the path setup and tear-down processes. QAST uses an optical ACKsignal and sends a *Tear-down* packet at the beginning of a transmission instead of sending it at the end of the transmission, as in conventional EA-PNoCs. Optimizing the *Tear-down* to be sent in parallel with the transmission does not solve the problem of the path setup procedure. Because, the optical transmission of the data is very short and sending the *Tear-down* after, or at the same time, does not reduce the latency overhead.

The previously cited works [39, 40, 69, 71, 72, 74] use conventional path configuration process where all configuration packet are generated and transmitted in the electronic network. The difference between them is just the topology (mesh or torus), the adopted switch (blocking or non-blocking) in addition to some optimization like the predictive switching in [72] or the Tear-down optimization in [72].

From the photonic switch point of view, many works also have been proposed. The main concern was to minimize the number of used MRRs in addition to having a better layout (less crossing, bending, etc.). Ye et al. [74] propose a non-blocking switch. The proposed switch comes with  $5 \times 5$  and  $4 \times 4$  configurations. While the  $5 \times 5$  configuration is used for a typical connection between the four neighboring and PE, the  $4 \times 4$  is used for the edges to save the area and the number of used MRRs. Similar, *Ji et al.* [75] propose a  $5 \times 5$  non-blocking photonic switch where the routing functionality was proved through a 12.5 Gbps high-speed signal transmission experiments. Many blocking-switch were also proposed, the aim of this kind

of switch is to reduce the used photonic component devices (e.g., waveguides and MRRs), which implies better efficiency in terms of crosstalk and laser power. From these works we can cite, the work of *Chan et al.*, *Sacham et al.* [39] and *Wang et al.* [76].

## 3.2 Waveleght-routed Photonic NoCs

Wavelength-routed architectures come as an alternative to the EA-PNoCs architectures. The main claim of proposing this kind of architectures is that when using a fully optical architecture, the overheads associated with the path configuration (i.e., delay and power) is eliminated. This claim is logic and has sense, but such architectures come at the cost of higher photonic devices usage, which can reach several thousands of MRRs for a 64 cores system, in addition to the required waveguides, modulators, and detectors.

Vantrease et al. proposed a fully photonic architecture named Corona to remove completely all electrical interconnect replacing them by an optical crossbar and token [44], the complete architecture with a broadcasting mechanism requires a million MRRs. In a later work [77], they presented channel-based and slot-based protocols for their arbitration mechanism in addition to a flow-control for fully optical interconnects. Gu et al. proposed FNOC [78], a fat-tree based fully optical network. They omit the electronic control layer by using an optical turn-around router (OTAR) which carries both payload data and network control data on the same optical network. Pasricha et al. [79] proposed using an optical ring waveguide with bus protocol standards to replace global pipelined electrical interconnects. Beausoleil et al. [80] proposed a crossbar-based architecture, where 64 wavelengths are multiplexed over 270 waveguides; 256 waveguides are allocated for control and data. and 14 waveguides are for broadcast and arbitration. Zhang et al. [81] introduced a multilayer Nanophotonic interconnection named MPNOC, which uses multiple layers to create a crossbar with no optical waveguide crossover. Chan et al. [40] propose an optical crossbar using 56 waveguides to perform the routing process. A recent work proposed by Randy et al. [45] also uses a multi-layer photonic interconnect with a Micro-ring Resonator for the intralayer communication rather than TSVs, as used in [81]. *Kirman et al.* [82] proposed a fully optical ONoC using a wavelength-based oblivious routing, where each node has physical connectivity to all other nodes via static paths. For the wavelength allocation between the nodes, they use a wavelength-reuse algorithm proposed by *Aggarwal et al.* [83]. *Chen et al.* [48] proposed a fully optical NoC which uses a two wavelength assignment methods, called source-based wavelength assignment (SW) and destination-based wavelength assignment (DW). Some other works focused in how to reduce the crossbar complexity in fully optical architectures. *Pan et al.* proposed *Flexishare* [84] which is a flexible crossbar topology that allows channel provisioning according to the average traffic load and a distributed token stream arbitration which provides multiple tokens for a given channel.

Although many works tried to reduce the required photonic devices to implement a fully optical WR-PNoCs, the feasibility of such architectures is not straightforward. Besides the number of required MRRs, such systems are more vulnerable to the misrouting and the fault occurrence than EA-PNoCs because the arbitration and the routing are tightly dependent on the used wavelengths frequency, which may deviate due to thermal issues inside the chip [49, 50].

## **3.3** Other Photonic NoCs Architectures

In this section, we also introduce some other works, which are a mix of the two previous approaches. In these works, authors tried to mitigate the problem of the EA-PNoCs (i.e., path configuration overheads) and WR-PNoCs (i.e., excessive use of photonic components).

For example, *Pan et al.* proposed *Firefly* [85] which reduces the crossbar complexity by designing smaller optical crossbars connecting selected clusters and implementing electrical interconnect within the cluster. Another recent work was proposed by *Tan et al.* [86] where a butterfly fat-tree based hybrid optoelectronic NoC architecture is introduced using the generic wavelength-routed optical router. However, the wavelength assignment used in this approach for routing purposes leads to an inefficient use of the optical spectrum.

Some other works opted to deal with the ECN (Electronic Control Network) in EA-PNoCs, by simply eliminating it. For instance, in a work proposed by *Wang et al.* [87], the typical ECN is reduced to one central controller to process all path setup requests packets and set the corresponding optical switch according to an MRRs state table. Although this solution reduces the hop count in the ECN, it suffers from a complex centralized router and the electronic layer cannot be used like a conventional one if we want to use it for small packets (e.g., cache block broadcasting).

Another interesting work to solve the path configuration problem was proposed by *Hendry et al.* [64,65], where they completely remove the ECN and they substitute it by a *Time Division Multiplexing* arbitration scheme which provides round-robin fairness to set up photonic circuit paths. In this work, instead of setting the path in the ECN, each communication between any pair of nodes is only allowed to be active during a specific time slot. While this solution eliminates the blocking occurrence since the time is divided between all nodes, but according to the obtained results, the electronic energy did not decrease. This is because of the buffering required when there is a switching between the X and Y directions. Moreover, the path is fixed at the design level. *Cianchetti et al.* [46] have presented *Phastlane*, which is built upon a low latency optical crossbar for data transmission under contention-less conditions. However, when contention exists, the router makes use of electrical buffers and, if necessary, a high-speed drop signaling network. A source-base routing is used with a wavelength shifting at each hop to decode the destination address. Bahirat et al. proposed METEOR a hybrid photonic NoC based on concentric rings. The path setup in the proposed architecture is performed using separate waveguides in addition to separate waveguides for data transmission. The concentric rings interfere with the cores through a gateway interfaces. To keep implementation costs low, the author restricts the number of gateway interfaces. The region of influence is defined as the region, which has access to the concentric rings. Cores outside those regions will communicate through the electronic network.

## 3.4 Chapter Summary

In this chapter, we discussed some of the important related works that deal with wavelength-routed architectures, as well as the electro-assisted ones. We can say that for the EA-PNoCs architectures no prior work tried actually to change or to introduce a new path configuration scheme while keeping the benefit of the electronic layer. In previous architectures, the problem of the electronic layer was only solved by eliminating it. We believe that we can keep the electronic layer to be used for small messages (such as, cache coherence protocol) since the electronic router is optimized and well suited for message broadcasting. In the coming chapters, the new proposed electro-optic router and the contention aware routing algorithm will be introduced as a solution to the EA-PNoC architectures, as well as a new control plane made by photonic devices to further reduce the power and delay overheads. With the proposed system, we reduced the overhead of the path configuration to make EA-PNoCs more convenient interconnection infrastructure for high-performance manycore systems.

## Chapter 4

# Non-blocking, Low Latency Electro-Assisted Photonic NoC Architecture

## 4.1 Introduction

In this chapter we explain the proposed EA-PNoC (Electro-Assisted Photonic Network-On-Chip architecture), named PHENIC. The backbone of this architecture, the electro-optic router (EOR), is introduced with its two main parts: the NBPS (Non-Blocking Photonic Switch) and its corresponding electronic controller. In this chapter, we also discuss the challenges of the EA-PNoC from the photonic point of view where the trade-offs between blocking and non-blocking switches are introduced. In addition, the challenges of designing the electronic controller module are also highlighted.

## 4.2 System Architecture

The simplified block diagram of the PHENIC system is shown in Fig.4.1. The system consists of two networks: the first one is the PCN, and is based on silicon broadband photonic switches interconnected by waveguides; the second one is the ECN and is used for path reservation and configuration of the optical switches at the PCN by mainly powering ON/OFF the MRRs. Each Processing Element(PE) is connected to a local electronic router and also connected to the corresponding gateway (modulator/detector) in the PCN. Messages generated by the PEs are separated into control signals and payload signals. Control signals are routed in the ECN and used for path setting (i.e., routing). The payload signals are converted to optical data and transmitted on the PCN. Many EOR were proposed in the literature to mitigate the different challenges and limitations of both the photonic switch and the electronic controller. Before we introduce the proposed EOR, let us first review the most challenging issues in designing a photonic switch and its corresponding electronic controller.

## 4.3 Electro-optic Router Design Challenges

We previously performed a preliminary evaluation where we analyzed the performance of our earlier proposed PHENIC system<sup>1</sup> [88–90].

The evaluation is performed with different message sizes, network sizes, and with various synthetic traffic patterns. We also compared the results obtained with a conventional ENoC system [10, 11, 16, 91–96]. Figure 4.2 shows the power consumption comparison results near-saturation for different network sizes and different benchmarks. The first thing to notice from is the efficiency of PHENIC system in terms of power consumption when compared to conventional ENoC. As we explained earlier, this efficiency is inherited from the low-power properties of the photonic link. The second observation is the large gap which manifests when we increase the network size from 64 to 256 cores. Despite the fact that this power overhead is still less than the one of ENoC, it puts under question the scalability of EA-PNoC systems as we increase the network size. This is because these systems are targeted for Many-core systems that can reach hundreds and thousands of cores [10]. To understand the reasons for this increase in power, and which of the ECN and PCN is mainly responsible, we analyzed the power and latency overheads of 256 cores PHENIC system

<sup>&</sup>lt;sup>1</sup>This evaluation was made with a blocking photonic switch to see the impact of the blocking occurrence on the system performance. All evaluation results can be found in [41]



Figure 4.1: PHENIC system architecture.(a) Electro-optical router interconnected for a 3x3 mesh-based, (b) 5x5 non-blocking photonic switch, and (c) Unified tile including PE, NI and control modules.



Figure 4.2: Power consumption comparison results.

under different message sizes, as illustrated in Fig.4.3. As we can see in Fig.4.3 (a), the ECN consumes the largest portion of the system power budget. This portion varies between 78% and 93% of the total system power, depending on the message size. At the same time, the setup latency is three times greater than the transmission latency, as illustrated in Fig.4.3 (b). From these results, it becomes obvious that the ECN should be given more attention and further optimized to reduce the power overhead, and also to increase the throughput in a given EA-PNoC system. For this purpose, we evaluated the average dynamic energy in the input buffer which is considered to be the most congested and power-hungry component in the ECN. With this analysis, we can have a clear understanding of the effects caused by the different packets traveling the ECN since the input buffer activity reflects the other components behavior, such as the routing computation module, arbiter, crossbar, and inter-router links.

Figure 4.4 depicts the input buffer dynamic energy breakdown near-saturation when evaluated on 64 and 256 cores systems. From this figure, we can see that two portions dominate the total dynamic energy in the input buffer which are the *Path-setup-Control-Packet* (PSCP) and *Path\_blocked* packets (85%), while the *ACK* 



Figure 4.3: PNoC's power and latency distributions. (a) power and (b) latency.

and *Tear-down* packets consume much smaller ration (15%). As we previously mentioned, for every communication between a source and a destination pair, a *PSCP* packet is injected and travels the ECN to request the necessary resources in the PCN.

When another communication utilizes these resources, a  $Path\_blocked$  packet is generated and travels back to the source node while releasing the already reserved resources by the *PSCP* packet. The alternation between the PSCP and *Path\_blocked* packets continues several times until the requested resources are released and become available. In this fashion, a significant amount of energy is wasted on generating, processing, and storing ineffective packets that do not reach their destinations after all. The energy burden of these two packets can be seen in Fig. 4.3, where we can also observe that they are quite equivalent. This is logic since, as the number of *Path\_blocked* packets increases, the number of *PSCP* packets necessary to establish the path again increases as well.

In an ideal situation, the *Path\_blocked* energy overhead should be removed, and the PSCP one should be a little bit higher than those of the ACK and *Tear-down* overheads. In practice, it is very difficult; thus, the most effective approach is to reduce the blocking occurrence as much as possible to reduce the *PSCP* packets generation and alleviate the consequent congestion. In conventional EA-PNoC systems, blocking is mainly caused by two major factors: first, the use of a blocking opti-



Figure 4.4: Input-buffer dynamic energy break-down before saturation.

cal switch where some input or output ports share the same resources (MRRs and waveguides). This kind of switches have been used in many prior works [40,76] to decrease the energy (both static and dynamic) in the PCN by reducing the number of MRRs and waveguides.

Figure 4.8 shows a blocking switch layout [41,76]. As a matter of fact, blocking switches are simple with a limited number of MRRs and waveguides while nonblocking switches are much more complex. Nevertheless, the ECN energy and latency significantly increase when using a blocking type. This is due to the big difference between the energy properties of each of the photonic and electrical paradigms. As a consequence, EA-PNoCs should be equipped with non-blocking optical switches that allow the elimination of the dependency caused by the resources' sharing between communications. The second factor for blocking is the high congestion frequently found in the ECN. As we previously mentioned, the ECN in most EA-PNoCs host different kinds of packets (e.g., PSCP, Path\_blocked, ACK, and Tear-down). These packets share all the resources of the ECN, creating congestion that has a huge impact on the energy, as well as on the system throughput. To relieve this congestion, the naive approach is to increase the buffer size; however, this solution increases the static power consumption and also the ECN area. Allowing part of these packets to be transferred in the PCN, provides a better traffic balance and fair resource utilization. In particular, by sending the ACK and Tear-down packets



Figure 4.5: Example of 5x5 blocking switch [41,76].

in the PCN as optical signals, three main advantages can be achieved. First, the congestion in the ECN is significantly relieved and, as a result, the blocking probability decreases as well. Second, when transferring these two types of packets in the PCN, we can exploit the benefits of the latter's low-energy properties. Moreover, the ECN buffer size can be reduced without affecting the performance. Third, break the dependency between the different path configuration steps that can increase the blocking probability.

To understand this latter point, Fig. 4.6 illustrates a simplified example of three cases of dependency frequently observed during our evaluation study between the PSCP and Tear-down packets. In the first case, a PSCP of a given communication C3 is stored in the West input port and requesting the east output port; however, the requested resources for both West input port and east output port are utilized by another communication (C1). Despite the fact that the Tear-down packet will release these resources in the next cycle, the PSCP is dropped and a  $Path_blocked$  packet is generated to travel back to the source node (represented by a green dashed line in Fig. 4.6) where a new PSCP is generated. Similarly in the second case, a PSCP in the south input-port (C6) is requesting the north output-port. In this case, the PSCP does not share the same output port with the previous communication



Figure 4.6: Examples of dependency between the PSCP and Teardown packets.

(C4). Nevertheless, it is blocked since the input-port resources are already reserved and will be released in the next cycle by the *Tear-down* packet located in the same input port. In the third case, the *PSCP* (C5) and the *Tear-down* (C2) packets are located in different input ports and requesting the same local output-port. We assume that for arbitration reasons, the *PSCP* is served first; therefore, it is blocked despite the fact that the local output port resources will be released in the next cycle. This case is considered to be the worst. This is because the *PSCP* is already in the destination node. Nevertheless, it is blocked and has to travel all the way back to the source node due to its dependency with the *Tear-down* packet.

As a conclusion for this study, blocking constitutes the primary source of energy and latency overhead in conventional EA-PNoC systems. It is mainly caused by the congestion, which mostly occurs due to the different types of packets sharing the ECN resources. To solve the problems elaborated in this study, we explain the details of the complete architecture of the proposed PHENIC system in the next section. We first highlight the key functions and components of the energy-efficient non-blocking switch, and then the adopted path setup algorithm targeted to alleviate the contention commonly found in conventional EA-PNoC systems.

## 4.4 PHENIC Non-Blocking Photonic Switch

The proposed switch should be able to handle the data stream like any other conventional photonic switch, as well as the acknowledgment signals and the resulting regeneration process of the *Tear-down* signal at each hop. Thus, we adopt a hybrid switching policy: *Spacial-switching* for the data signals by manipulating the state of the broadband switching elements (green MRs in Fig. 4.8) and a *Wavelengthselective switching* for the *Tear-down* signals by using detectors and modulators. Moreover, since the *Tear-down* signals should be checked and regenerated at each hop, it is crucial that their manipulation should be done automatically and without interfering with data signals nor causing a blockage inside the switch.

It is important to mention that there is not a dedicated gateway including detector and modulator banks for the *Tear-down* signal at the local port. Instead, when the *Tear-down* is generated at the sources network interface, it is first sent to the electronic router. There, the *Photonic Switch Controller*, explained later in Fig.4.11, will release the corresponding MRRs and generate another *Tear-down* which is sent to the output port modulator in the PCN where it continues its path on a hop-by-hop basis until it reaches its destination. At the destination node, the *Tear-down* is detected in the input port and sent to the *Photonic Switch Controller* in the corresponding electronic router. In this fashion, we can omit the overhead of an additional gateway which becomes significant when we increase the number of cores.

## 4.4.1 Building Blocks

#### 4.4.1.1 Waveguides

The core of the proposed switch is a 4x4 non-blocking switch. As shown in Fig. 4.8, two waveguides (top right side of the figure) are used: one for the ejection and one for the injection from the east and the north ports to the gateway. Similarly, two other waveguides are used for the south and west ports. In addition to the



Figure 4.7: PHENIC's non-blocking photonic switch.

straight waveguide, crossing waveguide and bended waveguide are used as shown in Fig. 4.8 (d) and (e), respectively. Figure 4.9 shows an example of photonic component instantiation using OMNET++ framework.



Figure 4.8: Optical router building blocks. (a) crossing element with two opposite MRs, (b) crossing element with single MR, (c) parallel switching element, (d) crossing element without MR, and (e) bending element. Numbers indicate the corresponding dimension in  $\mu$ m [40,97].

## 4.4.1.2 Micro Rings Resonators

PHENIC's photonic switch uses Two types of MRR are used. First, active broadband microrings controlled by the electronic controller to allow the data to be switched from one direction to another one. Second, passive MRRs are used to filter specific wavelength. While the broadband switching elements are used for the payload data, the passive ones are used for the acknowledgment signals (i.e., Ack and Teardown). Figures 4.8 (a) and (b) show two types of active microring used in the photonic switch, a crossing waveguide with a double MRR and with single MRR, respectively.

## 4.4.2 Micro-Ring Configuration

Table 4.1 shows the MRRs configuration for data transmission, where 18 MRRs are used in a non-blocking fashion. We use the first six wavelengths in the optical spectrum starting from 1550 nm, with a wavelength spacing equal to 0.8 nm to maintain a low cross-talk as reported in [98]. For the acknowledgment signals, we use the first five wavelengths in the optical spectrum starting from 1550 nm: four wavelengths for the *Tear-down* signal where each one is dedicated to each port



Figure 4.9: Photonic switch building blocks instantiation.

except the local one. In addition, a single wavelength for the *ACK*. The remaining available wavelengths are used for data transmission. Moreover, the five wavelengths used to control the *ACK* and *Tear-down* signals are constant regardless of the network size, in contrast with the WR-PNoCs architectures where the number of the wavelength used for control and arbitration grows with the network size. Thus, cutting these wavelengths from the available spectrum to be used for control, would not degrade the system bandwidth. These five wavelengths will be negligible especially when Dense Wavelength Division Multiplexing (DWDM) is used providing up to 128 wavelengths per waveguide [99].

| output/Input | Local | North | East  | South | West |
|--------------|-------|-------|-------|-------|------|
| Local        | -     | 9,18  | 11,18 | 14    | 16   |
| North        | 17,10 | -     | 1     | 3     | None |
| East         | 17,12 | 2     | -     | None  | 4    |
| South        | 13    | 6     | None  | -     | 8    |
| West         | 15    | None  | 5     | 7     | -    |

Table 4.1: Micro-rings configuration for data transmission.

## 4.4.3 Teardown and Acknowledgment Handling

In case where the *Tear-down* signals enter the switch, they need to be redirected to the corresponding electronic controller. Since these signals are coming from different ports, and they are modulated with different wavelengths, detectors capable of switching all the four wavelengths are placed in front of the input ports to intercept them. The converted optical signal will be redirected to the electronic router to be processed. According to the information included, the corresponding MRRs will be released. For the ACK, when the PSCP reaches the destination, 1-bit optical signal is modulated starting from the output port (i.e., opposite direction) and travels back to the source. The wavelength assignment for each port is shown in Table 4.2.

With this effective hybrid switching mechanism, we take advantage of the lowpower consumption of the photonic link by using optical pulses modulated with the suitable wavelength instead of propagating the acknowledgment signals in the ECN. Second, we take advantage of the WDM proprieties by separating the acknowledgment packets and the data signals and let them coexist in the same medium without interfering with each other. This is in contrast to the electronic domain where these acknowledgment packets travel for several hops consequently blocking the waiting cores from sending their PSCP packets.

| ble 1.2. Wavelength absignment for acknowledgment sign |                   |                   |                   |                   |                   |
|--------------------------------------------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
|                                                        | Local             | North             | East              | South             | West              |
| Input                                                  | $Mod_{\lambda_0}$ | $Det_{\lambda_3}$ | $Det_{\lambda_2}$ | $Det_{\lambda_1}$ | $Det_{\lambda_4}$ |
| Output                                                 | $Det_{\lambda_0}$ | $Mod_{\lambda_1}$ | $Mod_{\lambda_4}$ | $Mod_{\lambda_3}$ | $Mod_{\lambda_2}$ |

Table 4.2: Wavelength assignment for acknowledgment signals.

## 4.4.4 Optical Power Loss Evaluation

The trade-off between the blocking and the non blocking switches is essentially about the laser power that needs to be injected into the chip. The total optical laser power delivered to the chip is expressed in equation 4.4.1, where P, S,  $IL_{max}$ , and n are the power threshold, the detector sensitivity, the worst case insertion loss and the number of wavelength, respectively.

$$P_{threshold} - D_{sensitivity} \ge IL_{max} + 10\log_{10}n \tag{4.4.1}$$

The power threshold, is the injected power above it the photonic component starts to have non-linear behavior (e.g., high insertion loss). As example, waveguides and modulator have a power threshold equal to 15 dBm [55] and -2 dBm [54], respectively. The detector sensitivity, is the amount of power required to excite the photodiode. A sensitivity of 7.3 dBm is demonstrated in [100] with a bit-error-rate of  $10^{-12}.IL_{max}$  is the worst case insertion loss, which is the maximum losses sustained when the signal travels between the source and destination, by adding all losses, as described in 4.4.2.

$$Total_{(loss)} = PassBy_{(loss)} + PassThr_{(loss)} + Cross_{(loss)} + Bend_{(loss)} + Prop_{(loss)}$$

$$(4.4.2)$$

 $PassBy_{(loss)}$  is when the signal pass by a MRR but without coupling with it. The  $PassThr_{(loss)}$  indicates when the signal couple with a MRR. The  $Cross_{(loss)}$ ,  $Bend_{(loss)}$  and  $Prop_{(loss)}$  are the losses caused by the crossing element, bending element and the total propagation loss between the source and the destination, respectively.

It is important to mention that handling ACK and Tear-down signals optically, no additional insertion loss will be added. In fact, since the *Tear*-down signal will be generated at each hop, the incurred insertion loss will be much lower than the worst case insertion loss. For ACK signal, the insertion loss will be the same as the corresponding data signal loss.

The energy power overhead is caused by the modulators and detectors placed in the front of each port, which as we show later in the next chapter, is much lower than the one caused if acknowledgment signals were transferred in conventional electrical links. In addition, since modulators and detectors are energy bit-dependent, we only use 1-bit to modulate ACK signal and 8-bits for the *Tear-down* to modulate any destination address in a 256 cores system. To understand the pros and cons of each

| able 4.9. Insertion loss parameters [101 109 |                               |  |  |  |
|----------------------------------------------|-------------------------------|--|--|--|
| Parameter                                    | Value                         |  |  |  |
| Propagation loss (silicon)                   | 1.2  dB/cm                    |  |  |  |
| Waveguide crossing                           | 0.12 dB                       |  |  |  |
| Waveguide bending                            | $0.005 \text{ dB}/90^{\circ}$ |  |  |  |
| Drop into a ring                             | 0.5 dB                        |  |  |  |
| Passing by a ring                            | 0.005  dB                     |  |  |  |

Table 4.3: Insertion loss parameters [101–103].

approach (using blocking or no blocking switch), the insertion loss of PHENICs photonic switch is evaluated against a blocking one [40]. Figure 4.8 shows the different building blocks used to model the two photonic switches. The numbers in the figures represent the dimension of the component [97]. In addition to the crossing element with MRR, the blocking switch has a parallel switching element. These dimensions are used to calculate the resulting propagation loss when the light travels through different components and different switches. Table 4.3 shows the value of the various losses that a signal can go through, during its travel between



the modulator and the detector as reported in [101–103].

Figure 4.10: Worst case optical power loss.

|                    | PHENIC     | Bl-Switch [40] |
|--------------------|------------|----------------|
| Non-blocking       | Yes        | No             |
| Number of Ring     | 18         | 12             |
| Number of Crossing | 27         | 10             |
| Passive routing    | Yes $(x4)$ | Yes $(x4)$     |

Table 4.4: Comparison between  $5 \times 5$  optical routers.

Table 4.4 shows a comparison between the two evaluated switches regarding their optical resources cost. It is clear that the blocking switch has better physical characteristics (i.e., less waveguide crossing and fewer MRRs) then PHENIC switch. In fact, this kind of switch is used for light traffic load, where the injection rate is low, and the use of blocking switch does not degrade the performance. In addition, with a minimal number of rings, the resulting insertion loss is lower than the non-blocking one. However, when it comes to the system performance, this kind of network shows higher energy and the number of blocked requests increases considerably, as shown in the next chapter.

Table 4.5 shows the optical power loss of the twenty possible communication pairs inside the switch, (e.g.  $E \mapsto L$  is the optical power loss from the East port to

| Table 4.5: Power loss comparison. |        |                  |           |        |                  |  |  |
|-----------------------------------|--------|------------------|-----------|--------|------------------|--|--|
| Direction                         | PHENIC | Blocking<br>[40] | Direction | PHENIC | Blocking<br>[40] |  |  |
| $E \longmapsto L$                 | 1.36   | 0.5              | N⊢→S      | 1.36   | 1                |  |  |
| E⊢→N                              | 1.11   | 0.62             | N⊢→W      | 1.11   | 0.62             |  |  |
| $E \mapsto S$                     | 0.99   | 0.63             | S⊷→E      | 0.99   | 0.63             |  |  |
| $E \mapsto W$                     | 1.48   | 1                | S⊢→L      | 0.87   | 1.13             |  |  |
| $L \mapsto E$                     | 1.48   | 1.13             | S⊢→N      | 1.48   | 1                |  |  |
| $L \mapsto N$                     | 1.24   | 1.5              | S⊷→W      | 1.11   | 0.62             |  |  |
| $L \mapsto S$                     | 1.11   | 0.5              | W⊢→E      | 1.36   | 1                |  |  |
| $L \mapsto W$                     | 0.86   | 1.12             | W⊢→L      | 0.74   | 1.5              |  |  |
| $N \mapsto E$                     | 0.99   | 0.62             | W⊢→N      | 1.11   | 0.62             |  |  |
| $N \mapsto L$                     | 1.24   | 1.12             | W-⊢→S     | 0.99   | 0.62             |  |  |
| Average Loss                      |        |                  |           |        |                  |  |  |
| PHENIC                            | 1.15   |                  |           |        |                  |  |  |
| Blocking<br>PNoC                  | 0.87   |                  |           |        |                  |  |  |

Table 4.5: Power loss comparison.

the Ejection port). We take into consideration only the crossing loss and the drop into ring loss since they have the biggest loss value compared to the pass by ring loss or the bending loss. We can see that the blocking switch has the best average loss. This is due to the reduced number of crossing and MRRs inside this router. This results will lead to a lower worst case insertion loss as shown in Fig. 4.10. For the two types of switches, the waveguides crossing loss contributes almost to 50% of the total loss. The second contributor is the  $PassThr_{(loss)}$ , when the signal needs to couple with the ring element to change the direction.

# 4.5 Light-Weight Electronic Controller Architecture

Despite the huge bandwidth that we can get from the EA-PNoC architectures, the ECN (Electronic Control Network) is considered as the main source of latency and power consumption, as discussed earlier. This overhead might be caused by the use of an inappropriate message size, a non-optimized physical channel width, or especially the used a non-optimized path setup protocol, which is a source of both power and latency overhead. An inappropriate message size, includes the electronic configuration packet and also the optical payload data. Defining an appropriate channel width will affect the whole performance of the ECN where the area cost and power consumption of the buffer increase linearly with the physical channel width and quadratically for the crossbar as reported in [104]. The modulation rate also may affect the electronic controller by adding a serialization and deserialization power and latency.



Figure 4.11: PHENIC's light-weight electronic router.

Figure 4.11 shows the proposed light-Weight Electronic Controller. First, we can see the connection between the network interface(NI) and the local port, where a configuration packet (CP) is sent from the NI to the local port. The CP could be a setup packet or a path blocked packet. The NI is also connected to the data switch (i.e., PCN). When the source node receives the ACK, the payload is processed by a serializer bank (if needed), a high-speed driver and a modulator to convert the electrical signal to an optical one. At the source node, the optical data leaves the data switch and go through a detection step, a high-speed amplification step, and a deserialization step. Finally, the NI's receiver receives the payload data with its original clock speed. Many parameters and design consideration need to be taken into consideration to design a light-weight router suitable for EA-PNoCs.

#### 4.5.1 **PHENIC** Topology Consideration

In PHENIC system, the EORs (Electro-Optic Routers) are connected with each other in mesh-like arrangement. We opted for the mesh topology because it is simple and regular topology and making a photonic switch mesh-based is easier than other irregular topology. Figure 4.12 shows an example of interconnection network in mesh-like arrangement and



Figure 4.12: Example of  $4 \times 4$  interconnection network. (a) Mesh and (b) Torus

Some other works opted for the torus-based arrangement. Also, the Torus-based arrangement is very attractive architecture in terms of exploiting the edges. Exploiting the edges in the photonic domain would not degrade the performance because the distance does not matter in photonic transmission. However, as explained in the next chapter, Torus-based architecture is highly penalized regarding the electronic power due to its complex electronic controller.

#### 4.5.2 Packet Format and Buffer Size

One of the most critical design consideration in EA-PNoC is the choice of the packet format/size for of the configuration packet. In fact, the packet format and size need to be optimized for the latency rather than the bandwidth. Thus, PHENIC

system has a highly optimized packet format, which consists only of a source and destination addresses and a packet identifier.

Compared to pure ENoC where the packet format/size and the physical channel width should be optimized for the different data set and to carry control and data packets for different sizes, the packet format/size for the ECN should be optimized to carry only small packets. For example, in pure ENoC, the physical channel width is varying between 128 bits and 512 bits while for EA-PNoC there is no need to use wide width physical channel since all packets are relatively small, and we can define a small channel width to reduce the power consumption in the ECN. In EA-PNoC since all packets are relatively small, the choice should be opted for the largest one to avoid packet's fragmentation. In fact, all packets carried in the ECN are control packets, and the latency resulting from the fragmentation of these packets will affect the overall performance as we discussed previously.

Figure 6.3 shows the packet format needed for the path configuration. This format is used when issuing the PSCP (Path Setup Control Packet) and when this latter is being blocked.



Figure 4.13: PHENIC's electronic controller configuration packet size and format.

To perform all path configuration steps only the packet Id, the source and the destinations addresses are needed. The Id field has 1-bit width (0 for path setup packet and 1 for path blocked packet), the source and destination addresses are encoded with 8 bits each, for a network up to 256 cores. By moving half of the network's traffic to the photonic layer, the buffer size is reduced to the half leading to less energy overhead in the electronic layer. Nevertheless, with half of the buffer size we can obtain better bandwidth and less latency.

#### 4.5.3 Network Interface and Gateway Architecture

The task of the NI is first to issue the packet configuration (i.e., PSCP, ACK, Teardown) and second is to convert the payload data from electronic to optical and vice versa.



Figure 4.14: PHENIC's gateway used for Electro-To-Optical and Optical-To-Electro conversion. The gateway is the medium between the processing plane and the data plane. The example shows how eight stream data are converted to only 4 with higher data rate.

The module responsible for the Electro-To-Optical and Optical-To-Electro conversion is the gateway. The gateway is a module of the network interface, which converts the data between the two domains. One important consideration when designing a gateway is to consider the data modulation rate and the number of used wavelength (i.e., bandwidth link). In fact, they mainly depend on the application workload characteristics, which make this factor changing from application to another. In PHENIC system, the data rate is equal to the processor frequency, and we use only one wavelength for the payload data (in addition to the five wavelengths used for the ACK and the Teardown handling). Thus, the serialization/deserialization power and latency overhead are eliminated. Figure 4.14 shows an example of the typical connection between a gateway and the processing element. The figure shows how eight data stream can be serialized to only four data stream. The resulting data has higher frequency suitable for the modulator drivers bank. In the receiver side, the original clock is recovered with a deserializer bank after passing by the TIA bank. The serialization and deserialization steps do not affect the aggregate bandwidth, just the data rate changes. It should be mentioned that the serialization degree strongly depends on the workload characterization and the available laser power budget (i.e., number of available wavelengths.)

#### 4.5.4 Dimension-Order-Routing (DOR-XY)

PHENIC system uses the *Dimension-Ordered-Routing* (DOR-XY). Three main pipeline stages can define the routing process at each router: *Buffer writing* (BW), *Routing Calculation and Switch Allocation* (RC/SA), and finally the *Crossbar Traversal* (CT). We opted for the deterministic XY routing for two main reasons. First, its simplicity and its deadlock free proprieties. Second, the minimal distance between the source and the destination. While in E-Noc the number of hops in one communication does not affect the whole system performance, in EA-PNoCs it would affect the laser power budget because of the additional losses (e.g., waveguide crossing, propagation, etc.) that a signal will encounter if additional hops are added. Algorithm 1 shows the different steps of computing the output port in DOR-XY.

#### 4.5.5 Arbiter Architecture

The arbiter is the heart of the proposed electro-optic router, it consist of two parts: the electronic arbiter responsible for the arbitration of the electronic packets and the photonic switch controller (PSC) responsible for handling all request from/to the electronic arbiter and the photonic switch.

Figure 4.15 shows the main building blocks of PHENIC's arbiter. The main arbiter and the included PSC receives the detected *Tear-down* from the above switch (colored arrows). According to the information encoded in this signal, the corre-



Figure 4.15: PHENIC's electronic and photonic arbiters. (a) electronic arbitration module with the Stall-Go flow control and Round-Robin scheduling, (b) Photonic Switch Controller (PSC), (C) Micro-ring Configuration Table (MRCT), and (d) Micro-ring State Table (MRST).

```
// Destination address
     Input: X_{dest}, Y_{dest}
     // Next node address
     Input: X<sub>current</sub>, Y<sub>current</sub>
     // The new outport
     Output: Ouport
    if (X_{current} \text{ is equal to } X_{dest}) then
 1
           if (Y_{current} \text{ is equal to } Y_{dest}) then
 \mathbf{2}
           \mathbf{end}
 3
           Ouport \leftarrow LOCAL;
 4
 \mathbf{5}
           else
                  if (Y_{next} \text{ is smaller than } Y_{dest}) then
 6
                        Ouport \leftarrow NORTH;
 7
 8
                  else Ouport \leftarrow SOUTH;;
 9
           \mathbf{end}
10 end
11 else
           if (X_{next} \text{ is smaller than } X_{dest}) then
12
\mathbf{13}
                  Ouport \leftarrow \mathbf{EAST};
           else Ouport \leftarrow WEST;;
\mathbf{14}
15 end
```

Algorithm 1: Dimension-Ordered-Routing (DOR-XY)

sponding MRRs are released and a new *Tear-down* is generated for the next hop until it reaches its final destination and all MRs involved in this communication will be released. More details about the different routing/arbitration process will be given in Chapter 5.

## 4.6 Chapter Summary

In this chapter, we introduced the proposed PHENIC system. The proposed system is equipped with a non-blocking photonic switch and a light-weight electronic controller. We also discussed the different challenges when designing the electronic controller and how the blocking occurrence degrade considerably the system performance, especially if a blocking photonic switch is used. In the next chapter, the routing algorithm is introduced. A contention-aware path configuration routing algorithm is discussed.

## Chapter 5

# Contention-Aware Path Configuration Algorithm

## 5.1 Introduction

After introducing the EOR (Electro-Optical Router), in this chapter the proposed contention-aware path configuration algorithm is introduced. First, the different steps involved in this algorithm are described, where the main merits of moving part of the configuration packets to the photonic layer are highlighted. Second, we evaluate the typical system performance parameters, such as latency, bandwidth and power. In addition, a special care is given to the path configuration process, where the blocking latency, the number of blocked requests and the dynamic energy for the path configuration are evaluated.

## 5.2 Contention-aware Path Configuration Algorithm

As explained previously, the proposed architecture has the ability to remove the dependency between the ECN and PCN, which causing a significant latency overhead in conventional hybrid-PNoC systems. All steps involved in the proposed algorithm are described in Algorithm 2.

#### Algorithm 2: Path-configuration Algorithm.

// Path Setup Control Packet for communication i, PSCPi // Path Blocked Packet for communication  $i,\ {\rm PB}i$ Input:  $S_i$ ,  $D_i$ // From ACK detector Input: Detc<sub>ACKs</sub> // To ACK modulator **Output**: Mod<sub>ACKs</sub> // From Teardown detector **Input**:  $TeardMod_i$ // To Teardown modulator **Output**:  $TeardMod_i$ // To Microring resonator **Output**:  $MRRs_{j=0...n}$ // Buffer writing and routing computation stages 1 initialization; 2 while (Path-Setup-Control-Packet (PSCP) !=0) do DestAdd  $\leftarrow$  PSCP*i*; 3 PortIn  $\leftarrow$  PSCP*i*; 4  ${f if}$  (resource are available ) then /\* check MRRs state \*/ 5  $Grant_i \leftarrow \text{Arbiter};$ 6 7 else /\* generate path blocked \*/  $Blocked_i \leftarrow Arbiter;$ 8 end 9 10 end // Path blocked 11 initialization; **12 while** (*PB* !=0) **do** /\* Path blocked arrives \*/ if (MRRsi state is reserved) then 13 /\* release reserved MRRs \*/ release  $\leftarrow MRRsi;$ 14 15 end // Generate ACK **16** initialization; 17 while (NI receiver  $\leftarrow PSCPi$ ) do /\* PSCP arrives to Dest \*/ if (PSCP arrives to NI) then /\* generate ACK to Src \*/ 18  $ACK_i \leftarrow \text{To modulator ACK } (\lambda 0);$ 19 20 end // Receives ACK **21** initialization; **22 while** (*NI receiver*  $\leftarrow ACK_i(\lambda 0)$ ) **do** /\* ACK arrives to Src  $\lambda 0$  \*/  $\mathbf{23}$ if (ACK arrives to the NIsender ) then /\* modulate the data \*/  $Data_i \leftarrow$  To Data's Modulator;  $\mathbf{24}$ 25 end // Identify and Generate  $Teardown_i$ 26 initialization; **27** while (From detector signal = Teardown<sub>i</sub> with  $\lambda i$ ) do /\* find In-port according to the wavelength  $\ */$ 28 findInport  $\leftarrow \lambda i$ ; 29 free  $\leftarrow$  MRRs*i*; /\* Free involved MRRs \*/  $Teardwon_i \leftarrow To modulator \lambda_i;$ 30 /\* generate new Tear-down according to  $\lambda i$  \*/ 31 end

### 5.2.1 Path Configuration Phases

#### 5.2.1.1 Path Setup



Figure 5.1: Successful path-setup.

Figure 5.1 shows an example of a successful path-setup process where all the necessary resources between a given source-destination pair are reserved. Before optical data transmission, the source node issues a Path-Setup-Control-Packet (PSCP) which is routed in the ECN and includes information about the destination and source addresses. In addition to the source and destination addresses, other information are included. For example, 1-bit is used for the Packet-type field. This field can be "0" for a PSCP (Path-Setup-Control-Packet) and "1" when this configuration packet is a Path-blocked. Other information to ensuring Quality-of-Service and fault-tolerance, such as Message-ID, Fault-status, Error-Detection-Code, can be also included. For each electrical router, the output-port is calculated according to Dimension-Order routing [16]. Every time the PSCP progresses to the next router, the optical waveguides between the previous and current routers are reserved. Depending on the output port of the electrical router, the corresponding photonic router is configured by switching ON/OFF one or more MRRs using the micro-ring configuration table shown in Table 4.1.

In the example shown in Fig. 5.1, the packet is entering the local input-port at-

tached to the Network Interface (NI) and requesting the east output-port. According to Table 4.1, MRRs 12 and 17 are required and their availability is checked in the (Micro Ring State Table) MRST. In this table, both MRRs' states are "0" (free). Therefore, the switch controller reserves these two MRRs and changes their states from "0" (free) to "1" (not free). After this successful reservation (hop based), the PSCP continues its path to the next hop and the same procedure is repeated until all necessary MRRs are reserved for the complete path. This process is illustrated in *lines* 1 - 10 of Algorithm 2.



Figure 5.2: Failed path-setup.

In case where the requested MRRs at a given optical switch along the path are not available, blocking occurs. This can be seen in Fig. 5.2 where MRR 16, which is necessary for the ejection to the local output-port from the west input-port, is used by another communication. In this case, the *PSCP* is converted into a *Path\_blocked* packet (PB). The PB, then, travels back to the source node and releases the already reserved resources. The release is done by re-updating the corresponding entries in the MRST to "0" and by sending an electrical "OFF" signal to the corresponding MRRs in the PCN. This process is illustrated in *lines* 11 – 15 of Algorithm 2.



Figure 5.3: ACK phase.

#### 5.2.1.2 ACK

When the *PSCP* arrives successfully at the destination node, the NI modulates one-bit acknowledgment (ACK) signal to travel back to the source via the PCN. This can be seen in Fig. 5.3 and in lines 16 - 20 of Algorithm 2.

#### 5.2.1.3 Payload Transmission



Figure 5.4: Payload transmission.

Upon the arrival of this ACK signal, the source node modulates the payload

through the data modulators and sends it to the destination node via the PCN. Lines 21 - 25 of Algorithm 2 depicts this data/payload transfer phase and Fig. 5.4

#### PS<sub>1</sub> PS<sub>n</sub> PS<sub>2</sub> OFF Gw<sub>0</sub> OFF Gw<sub>o</sub> Gw<sub>1</sub> Gw<sub>1</sub> MRCT MRST MRCT MRST MRCT MRST NI NI TD TD EC EC EC ER ER. ↓1 **OPS**, MRST **OPS<sub>2</sub> MRST** OPS\_ MRST Dst Src MRs State MRs State MRs State 4 0 1->0 4 4 0 12 1->0 12 0 12 0 16 0 16 0 16 1->0 1->0 17 0 17 0 17

#### 5.2.1.4 Teardown

Figure 5.5: Tear-down phase.

The last process of the proposed path-configuration algorithm is the Tear-downstep as shown in lines 26 - 31 of Algorithm 2. When the entire payload is transmitted, it is necessary to release the reserved optical resources. This is handled by the source node which sends a Tear - down packet to the destination after predetermined number of cycles depending on the source-destination addresses, transmission bandwidth and message size. As shown in Fig. 5.5, the source's NI sends the electronic Tear - down packet (TD) to the first electronic router  $ER_1$ . The Electronic Controller (EC) in this router indexes the MRCT with input-output ports information and determines the MRRs that need to be released. As we can see in this figure, the states of MRRs 12 and 17, previously reserved in the path-setup process, are reset to *Free* (state="0") and electrical "OFF" signals are sent to these two MRRs. After the MRRs are deactivated, a new optical Tear-down signal is generated according to the used wavelength. It is sent through the PCN to the next hop where it is converted back to electrical and redirected to the EC in the corresponding electronic router to be processed. After this process, the MRRs are released and a new optical Tear-down signal is generated. This process is repeated until the *Tear-down* reaches the destination and all optical resources are released.

## 5.2.2 Advantages of the Proposed Path Configuration Algorithm

It is important to mention that the path-setup and path-blocked processes of the proposed algorithm are very similar to the conventional ones [39–41, 69, 72, 88–90]. The main difference is that the MRST in our proposal contains only two states: *Free* and *Active*. The MRRs are set "ON" as soon as the PSCP succeeds to reserve them. In the conventional mechanisms, three states are necessary: *Free*, *Reserved*, and *Active*. When the PSCP finds the requested MRRs *Free*, it updates their states in the MSCT to *Reserved* without turning them "ON". When the complete path-setup process is completed, the ACK signal travels back to the source node and sets the corresponding MRRs "ON" by updating their states in the MSCT to *Active*. With the proposed algorithm, some portions of the reserved path might be set "ON" and then "OFF" due to the unavailability of the resources. However, it enables the fast ACK transmission in the PCN.

In conventional path-setup algorithms, the ACK and Tear-down packets are transmitted in the ECN and have to go through all the buffering, routing computation, and arbitration stages. With the proposed algorithm, they are carried via the PCN. As a consequence, the ETE latency can be significantly reduced in addition to the dynamic energy saving that can be achieved. In addition, we considerably decrease the latency caused by the path blocking that requires several cycles for the path dropping and the new PSCP generation. Another key feature of the proposed path setup algorithm is the efficiency of the ECN resources' utilization. By moving the acknowledgment signals to the upper layer, we can reduce the buffer depth to only 2 slots, since half of the network traffic is eliminated. This reduction is a key factor to design a light-weight router, highly optimized for latency and energy.

## 5.3 Evaluation

#### 5.3.1 Methodology and Assumptions

We evaluate our proposed system using a modified version of PhoenixSim which is a physical-layer simulator developed in the OMNeT++ simulation environment [97]. The used simulator incorporates detailed physical models of basic photonic building blocks such as waveguides, modulators, photodetectors, and switches. Electronic energy performance is based on the ORION simulator [104]. PHENIC system is evaluated for 64 and 256 cores systems. We compare the obtained results with the previous blocking mesh-based PHENIC system (PHENIC\_BL) [41] and three conventional hybrid-PNoC architectures [40,97,105]. We chose these three networks for their different behaviors. In fact, the first one has a blocking switch (Chan\_Mesh) and was proposed by Chan et al. [40]. The second one is considered as non-blocking (Chan\_Xb), since it uses a crossbar [97]. The third system is a torus-based system (Shacham) [105] having the capability of setting the path with less hop count by taking advantage of the connections between the edges.

Tables 5.1 and 5.2 show the system and energy configuration parameters, respectively.

| Network Configuration                     | Value                                      |
|-------------------------------------------|--------------------------------------------|
| Process technology                        | 32 nm                                      |
| Number of tiles                           | 256,64                                     |
| Chip area (equally divided amongst tiles) | $400 \ mm^2$                               |
| Core frequency                            | 2.5 GHz                                    |
| Electronic Control frequency              | 1 GHz                                      |
| Power Model                               | Orion 2.0                                  |
| Buffer Depth                              | 2                                          |
| Physical Channel Width                    | 32                                         |
| Forwarding                                | Wormhole switching like                    |
| Scheduling                                | Round-robin                                |
| Control Flow                              | Stall-go                                   |
| Routing                                   | Static XY                                  |
| Message size                              | 2 Kbytes                                   |
| Simulation time                           | $10 \text{ ms} (25 \ 10^8 \text{ cycles})$ |

Table 5.1: Configuration parameters.

| Network Configuration      | Value                 |
|----------------------------|-----------------------|
| Datarate (per wavelength)  | $2.5 \mathrm{GB/s}$   |
| MRRs dynamic energy        | $375 \mathrm{fJ/bit}$ |
| MRRs static energy         | $400~\mu~{\rm W}$     |
| Modulators dynamic energy  | $25 \mathrm{fJ/bit}$  |
| Modulators static energy   | $30 \ \mu \ W$        |
| Photodetector energy       | 50fJ/bit              |
| MRRs static thermal tuning | $1\mu W/ring$         |

Table 5.2: Photonic communication network energy parameters.

#### 5.3.1.1 Benchmarks

For benchmarks, we used *Random Uniform* and *Bitreverse* traffic patterns. *Random Uniform* traffic is a communication pattern where the destinations are randomly and uniformly selected each time a new communication occurs. In *Bitreverse*, each node sends messages to the complement node of its ID; thus, resulting in very long communications to observe the scalability of the proposed system, we limited the use of bitreverse to study the path configuration energy overhead. In addition, we evaluate the performance of the proposed system using two realistic workloads: Cooley-Tukey FFT algorithm [106] and Data flow/streaming execution model [97].

Cooley-Tukey FFT algorithm [106] and Data flow/streaming execution model [97]. The traffic pattern generated by the FFT algorithm is modeled according to [107]. In this traffic, each core starts reading from the memory. Then, it processes k=m/M sample elements, where m is the size of the array of input samples and M is the number of cores. After this phase, the algorithm proceeds with a sequence of log M iterations. At each iteration, the processors exchange data according to a butterfly scheme, resulting in long distance communications. Finally, a write to the memory step is executed where the cores store the exchanged data. To get the characteristics of the computation and communication steps, the Pentium-M core was used as a reference according to the work in [107, 108]. The Pentium-M takes 39.32 ms to compute the FFT on 256 K samples and 2.18 ms for the communication stage.

For the Data flow execution model [97], each core computes some part of the total computation (i.e., piece of data), passes it on to other core, and repeats the same computation for the next piece of data that arrives. In this fashion, a large

data set is broken into small chunks and processed in parallel through all the cores in the network. In this application, only cores from the edges read and write to the memory and the data is exchanged in a hop-by-hop basis resulting in short distance communications. With these two applications, we can observe the effects of short and long distance communications on the proposed system performance.

#### 5.3.2 Complexity

In this section we evaluate the complexity of the proposed system against the four other architectures. The evaluation considers the number of used rings and the resulting static thermal tuning. The number of used MRR is given by equation 5.3.1, where  $Mod/Detc_{(ring)}$  is the number of rings required to modulate/detect the payload signal.  $Switch_{(ring)}$  is the number of ring required for the photonic switch to route the optical data. Finally, the ACKs(ring) is the number required to handle the acknowledgment signal.

$$Total_{(ring)} = Mod/Detc_{(ring)} + Switch_{(ring)} + ACKs(ring)$$
(5.3.1)

Tables 5.3 and 5.4 show the comparison results for 6- and 256-cores system, respectively. We can see that the two blocking networks (PHENIC\_BL and Chan\_Mesh) have the lowest number of rings. In fact, this kind of network is used for light-traffic load, where the injection rate is low and the use of blocking switch does not degrade the performance. In addition, with minimal number of rings, the resulting insertion loss is lower than the non-blocking one. But, when it comes to the system performance, this kind of network shows higher energy and the number of blocked requests increases considerably, as shown in the next section. For the proposed PHENIC system, it has an additional rings used for acknowledgment signal, compared to the other networks. This increase can reach 100%, 50% and 12% when compared to the blocking networks, crossbar and torus systems, receptively. We also observe the same behavior when evaluating the required static thermal tuning, which is required to maintain the functionality of the ring, under 20K temperature

|                                                   |        |           |           |         | ,       |
|---------------------------------------------------|--------|-----------|-----------|---------|---------|
| PHENIC PHENIC BHENIC-BL Chan-Mesh Chan-Xb Shacham | PHENIC | PHENIC_BL | Chan_Mesh | Chan_Xb | Shacham |
| Mod/Detc                                          | 64     | 64        | 64        | 64      | 64      |
| Switch                                            | 1152   | 852       | 768       | 1152    | 1620    |
| ACKs                                              | 640    | I         | I         | I       | I       |
| Total                                             | 1856   | 916       | 832       | 1216    | 1684    |
| Static Thermal Tuning (mW)                        | ) 37   | 18        | 16        | 24      | 33      |

with  $1\mu W$  for each ring.

#### 5.3.3 Latency Evaluation Under Synthetic Workloads

Figures 5.6 (a) and (b) show the overall average latency and the average latency near the saturation region, respectively. We can see that for zero-load latency, all networks behave in the same way. Near saturation, PHENIC shows more flexibility and scalability in 256 cores when compared to the other networks. For the 64 cores configuration, the crossbar-based system slightly outperforms PHENIC system in terms of latency. This can be explained by the use of Optical-to-Electronic conversion of the *Teardown* which affects the overall latency for small networks.

#### 5.3.3.1 Blocking Latency

From Fig. 5.7 we can see the resulting blocking latency comparison results for all studied networks in 64 and 256 cores systems under random uniform traffic. The blocking latency can be defined as the average time added to the overall latency when a path setup packet is being blocked and needs to go back to the source node. The first thing to notice from Fig. 5.7 is the resulting overhead of using a blocking switch to save on the number of rings. We can see how the previous blocking PHENIC\_BL and the Chan\_Mesh networks have a considerable blocking latency, reaching the 200% when compared to the proposed PHENIC, crossbar, and the torus systems, in both 64 and 256 cores systems. When comparing the proposed PHENIC to the crossbar-based and the torus-based systems, we can see that the proposed PHENIC slightly outperforms the two networks in 64 cores system. When it comes to larger networks, we can clearly see the benefits of the proposed PHENIC system. For example, when compared with the crossbar-based system, which is considered as non blocking and also having the same number of rings (except for those used for the acknowledgment signals), we can see an improvement of 60%just before the saturation. We can conclude that by breaking the dependency of the different configuration packets, many requests can be saved from being blocked. This improvement is less when compared to the torus-based system with just 37%. This is because it has the capability of using the edges, so a path blocked packet

| Table 5.4: King requirement comparison results for 250 cores systems.Mod/DetcPHENICPHENIC_BLChan_MeshChan_XbSwitch $256$ $256$ $256$ $256$ $256$ Switch $4608$ $3252$ $3072$ $4608$ ACKs $2560$ $  -$ Total $7424$ $3508$ $3328$ $4864$ Static Thermal Thuing (mW) $140$ $71$ $67$ $08$ | PHENIC CO<br>PHENIC 256<br>4608<br>2560<br>7424<br>149 | PHENIC_BL<br>256<br>3252<br>-<br>3508<br>71 | Chan_Mesh<br>Chan_Mesh<br>3072<br>-<br>3328<br>67 | Chan Xb<br>Chan Xb<br>256<br>4608<br>-<br>4864<br>98 | Shacham<br>256<br>6324<br>-<br>6580<br>131 |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------|---------------------------------------------------|------------------------------------------------------|--------------------------------------------|
|                                                                                                                                                                                                                                                                                         | 211                                                    |                                             | 5                                                 | 00                                                   | TOT                                        |

520 J 2 . Ë L



Figure 5.6: Latency comparison results under random uniform traffic. (a) Overall Latency and (b) Latency near-saturation.

spends less time to reach the source node. Another interesting behavior is the one of the curve in the proposed system is less aggressive then the other networks. We can see, for instance, that between 0.06 ms and 0.04 ms injection rates (near-saturation region), the blocking latency for the crossbar-based system increased by 300%, while it is just 63% for the PHENIC system. We can say that the proposed system is less sensitive to the blocking when compared to a blocking (PHENIC\_BL, Chan\_Mesh), and non-blocking networks (Chan\_Xb, Shacham).



Figure 5.7: Average blocking latency comparison under random uniform traffic. The left Y-axis shows the blocking latency for PHENIC\_BL and Chan\_Mesh networks and the right Y-axis for PHENIC, Chan\_Xb and Shacham networks.

#### 5.3.3.2 Blocked Requests

Our final evaluation in this subsection is shown in Fig. 5.8, which shows the number of blocked requests that reached more than half of the network diameter. In other words, the number of PSCPs that failed to reach their destinations after traveling more than half of their path. We can see that for low injection rates, all networks behave similarly. When the injection rate increases and the system reaches the near-saturation region (between the two vertical dashed lines) we can see that in the proposed PHENIC system, the number of blocked requests decreases by 31% and 36% when compared to the crossbar and the torus based systems for 256 cores,

respectively. Compared to the blocking networks, the number of blocked requests for PHENIC 256 cores decreases by 42%, and by 35% for 64 cores system. Moreover, the curves have the same behavior as the blocking latency in Fig. 5.7. We can notice that the curve for the other networks is more aggressive, in contrast with the proposed system. In this figure, we are only showing the most energy-costly portion (i.e., packet blocked). Since a PSCP traveling more than half the network and after that it is canceled, this incurs high wasted energy dissipation (i.e., buffering, switching, crossbar traversal).



Figure 5.8: Number of blocked request comparison result under uniform traffic. We limit this evaluation to the requests having reaching more than the half network diameter. The vertical dashed line represents the near-saturation point.

#### 5.3.4 Latency Evaluation Under Realistic Workloads

#### 5.3.4.1 Path Setup Latency Ratio

Figures. 5.9 (a) and (b) show the average path setup latency ratio comparison results for 64 cores and 256 cores systems, respectively. As shown in equation 5.3.2, the path setup latency ratio is defined as the ratio between the average time needed to set the path and the average time needed for the transmission. The time needed for setting the path includes the time to send the PSCP in addition to the time resulting from blocked packets and the ACK. In other words, the network efficiency is higher when the average path setup overhead is low.

$$PSL_{ratio} = \frac{T_{PathRequest} + T_{PathBlocked} + T_{ACK}}{T_{transmission}}$$
(5.3.2)

Since the exchange of data is done in a hop-by-hop basis in the Data flow application,



Figure 5.9: Average path setup latency ratio.(a) 64 cores and (b) 256 cores.

we can see a low path setup overhead for all networks. However, PHENIC-II achieves the lowest overhead when compared to other systems, which can reach the 50% when compared to the Torus architecture. When it comes to the FFT application, we can see the impact of long communications. In fact, the path setup overhead increases considerably for all networks and for both sizes. Nevertheless, PHENIC-II system achieves up to 30% and 50% improvement when compared to the Torus architecture for 64 and 256 cores, respectively. When compared to the blocking and the crossbar architectures, an improvement of 10% and 5% can be observed, respectively.

#### 5.3.4.2 Speedup

When evaluating the speedup, the Data flow application with its short communication affects the performance of the studied systems. In fact, the Torus architecture can no longer take advantage of its capability of using the edges. Therefore, it has the lowest speedup among all networks, as shown in Fig. 5.10 (a) and (b). For



Figure 5.10: Normalized speedup comparison results.(a) 64 cores and (b) 256 cores.

PHENIC-II system, it outperforms the Torus and the blocking networks when running the Data flow application for both network sizes. However, it is outperformed by the crossbar system by 7% in 64 cores system. In 256 cores system, PHENIC outperforms the crossbar-based system by 9%, which confirms that PHENIC system shows better performance for large network. When it comes to the FFT, the Torus architecture slightly outperforms all other networks, taking advantage of the communication between the edges.

#### 5.3.5 Bandwidth Evaluation Under Synthetic Workloads

For the achieved bandwidth, Fig. 5.11 shows that the bandwidth is increased by 24% and 51% when compared to PHENIC\_BL and Chan\_Mesh, respectively, for both 64 and 256 cores configurations. When compared to the crossbar and the torus systems, we can see that the three systems behave in the same way. While the torus system has the capability of setting the path with less hop count, we can see that PHENIC system can achieve the same performance without the need for an extra accessing network which is required for the torus. This behavior is observed for both 64 and 256 core systems.



Figure 5.11: Bandwidth comparison results under random uniform traffic.

#### 5.3.6 Bandwidth Evaluation Under Realistic Workloads

Fig. 5.11 shows the bandwidth comparison results under FFT and Dataflow workloads. We can see that for FFT the achieved bandwidth is similar for all networks and for small and large network size with a small advantage of torus-based system. For Dataflow workload, PHENIC system is outperformed by the crossbarbased system by 5% in 64 cores system. In 256 cores system, PHENIC outperform the crossbarbased system by 9%, which confirm that PHENIC system takes benefit of the proposed path configuration scheme in large network rather than small one.

#### 5.3.7 Energy Evaluation Under Synthetic Workloads

#### 5.3.7.1 Path Configuration Energy Overhead

We evaluate the energy overhead for the PSCP which is given by Equation 5.3.3, where  $PS_{Succ}$  is the dynamic energy in the ECN dissipated by the successful *PSCPs* reaching their destinations, and  $PS_{Failed}$  is the dynamic energy consumed by the *PSCPs* which resulted in *Path\_blocked* packets. We also evaluate the *ACK* energy overhead which is defined as: (1) the energy dissipated by the *ACK* and *Tear-down* packets for the PHENIC\_BL system, and (2) the sum of the dynamic energy of



Figure 5.12: Bandwidth comparison results under FFT and Dataflow workloads.

the modulators and detectors used for the optical ACK and Tear-down signals in PHENIC system. These two definitions are represented by equations 5.3.4 and 5.3.5, respectively.

$$PSCP_{Energy} = PSCP_{Succ} + PSCP_{Failed}$$
(5.3.3)

$$E - ACKs_{Energy} = Ack_{Packet} + Teardown_{Packet}$$
(5.3.4)

$$O - ACKs_{Energy} = ACKs_{Modulators} + ACKs_{Detectors}$$
(5.3.5)

Figures 5.13 (a) and (b) show the PSCP and ACKs dynamic energy overhead for half-load traffic under random uniform and bitreverse benchmarks, respectively. As can be seen in these two figures, the energy overhead of the PSCP considerably decreases by almost 66% for both 256 and 64 cores systems, when compared to the blocking networks. The same enhancement can also be seen for the ACKs energy which is also considerably reduced by 36% in 256 cores and 64% in 64 cores systems.



Figure 5.13: Path setup and acknowledgments energy for half-loaded network. (a) and (b) random traffic, (c) and (d) bitreverse.

When compared to the crossbar-based system, this latter outperforms the proposed PHENIC in both 64 and 256 cores systems. Nevertheless, PHENIC system is still showing better performance when compared to the torus-based system. Figures 5.14 (a) and (b) represent the energy overhead when the system is fully loaded (i.e., near the saturation region) for random uniform and bitreverse traffic, respectively. We can notice that the decrease in the PSCP and ACK energy is considerable when compared to the other architectures for both small and large networks, especially when compared to the blocking ones. Moreover, the torus based system is largely penalized due to the additional ports for the connection between the edges. We can also see, that for all network the PSCP energy dominates the overall energy, this is because the blocking can be avoided to a certain limit; but, due to the photonic



Figure 5.14: Path setup and acknowledgments energy near-saturation. (a) and (b) random, (c) and (b) bitreverse.

resources limitation, some of the requests become blocked. This problem is mostly related to the structure of the switch and can be avoided by using high-radix switches in addition to be related to the used routing algorithm. For the acknowledgment's energy, it is clear that the optical handling of the *Tear-down* and *ACK* adopted in PHENIC is more energy efficient for the two benchmarks and for the two network sizes.

#### 5.3.7.2 Total Power

This can be clearly observed when we compare the total energy and the energy efficiency. While the proposed PHENIC, crossbar-based, torus-based systems behave in the same way in terms of bandwidth, they have different energy profiles.



Figure 5.15: Total energy and energy efficiency comparison results under random uniform traffic near-saturation.

Figure 5.15 shows the total energy and the energy efficiency comparison results for 64 and 256 cores systems. For the 256 cores configuration, the proposed system outperforms all other networks. This is illustrated by an improvement in terms of energy efficiency reaching 26% and 48% when compared the crossbar-based (non blocking) and the mesh-based (blocking), respectively. When compared to the torusbased architecture, PHENIC improves the energy efficiency by up 70%. The torusbased architecture offers high bandwidth thanks to the connection between edges leading to short communications. On the other hand, it comes at high energy cost. This can be explained by the fact that the additional input-ports, required for the edge connections established in the torus-based system, incur increased area and consequently an energy overhead. In Fig. 5.16 (a) and (b), the energy breakdown is shown for 64 and 256 cores systems, respectively. Compared to other networks where the electronic energy is reaching 90% of the total energy, PHENIC shows more balanced energy distribution between the photonic and electronic networks. This is despite the fact that the electronic power is still high with 70% of the total system energy. When we dig more in the energy evaluation, we find the explanation of this energy efficient scheme. Figures 5.17 (a) and (b) show the buffering dynamic energy comparison results. From this figure, we can see first how the dynamic energy of



Figure 5.16: Total Energy breakdown comparison under random uniform traffic near-saturation. (a) 64 cores systems, (b) 256 cores systems.

the *PSCP* is decreased considerably when compared to all other networks. We can also observe the significant decrease in the *Path\_blocked* dynamic energy which is a direct consequence of the considerable decrease of the *PSCP* dynamic energy. it can be seen that the torus-based system can achieve the same performance of the proposed PHENIC system in terms of bandwidth, at the cost of higher electronic energy.

#### 5.3.8 Energy Evaluation Under Realistic Workloads

#### 5.3.8.1 Path Configuration Energy Overhead

We also compared our proposed system to other networks in terms of power efficiency. First, we evaluate the dynamic power required to set the path. Figures. 5.18 (a) and (b) show the normalized dynamic power consumption per achieved bandwidth. We could see that using the optical signaling for path configuration in PHENIC system reduces the dynamic power consumption by up to 20% and 60% when compared to the blocking and the crossbar architectures, respectively. We also noticed that the Torus system is largely penalized in terms of power because of its additional ports to connect the edges. Moreover, the benefits of using optical



Figure 5.17: Input-buffer dynamic energy breakdown near-saturation. (a) 64 cores, (b) 256 cores.



Figure 5.18: Normalized path setup dynamic power per achieved bandwidth.

signaling in the path setup is more noticeable in terms of energy for long distance communications (i.e., FFT traffic pattern) rather than shorter ones (i.e., data flow traffic pattern).



#### 5.3.8.2 Total Power

Figure 5.19: Power efficiency comparison results.

Finally, we present the results of the power efficiency. It is calculated as the ratio of the achieved bandwidth in Gbps to the total power consumption in watts (static and dynamic). The results are shown in figures 5.19 (a) and (b). We can see that the Torus system is always penalized in terms of power efficiency. In addition, in the FTT benchmark for 256 cores, PHENIC-II system outperforms the blocking and the crossbar architectures by 5% and 14%, respectively. For the Data flow, PHENIC-II outperforms the blocking architecture by 51% while observing the same behavior as the crossbar architecture. It is important to mention that since the two applications have an intensive access to the memory, in PHENIC-II system, we also use the optical signaling for any acknowledgment between the memories banks and the cores (e.g., request for read/write). Thus, achieving more network efficiency and contributing to lower power consumption and lower path setup overhead.

### 5.4 Results Summary

PHENIC system was compared to conventional EA-PNoC (Electro-assisted PNoC) architecture having different proprieties. PHENIC system with the proposed nonblocking electro-optic router and its path configuration algorithm can reduce the blocking occurrence by up to 42% and 35% for 256 cores and 64 cores systems with uniform random traffic, respectively. As a consequence, the path configuration energy overhead was considerable reduced in synthetic workloads, as well as realistic ones. The speedup was also increase when evaluating with FFT and Dataflow workloads. We can see an increase of 9% when compared to a crossbar-based architecture for 256 cores systems in contrast with 64 cores systems where PHENIC was outperformed by 7%. This deficiency of PHENIC system in small network is explained by the fact that in any optical communication whether off-chip or on-chip can show its benefit only for long distance. Nevertheless, PHENIC system still outperform all networks under all workloads in terms of path configuration energy overhead as well as the energy efficiency. Tables 5.5, 5.6, 5.7 summarize the obtained results with uniform random, FFT and DataFow workloads, respectively.

## 5.5 Chapter Summary

In this chapter, the contention aware path configuration algorithm was introduced. We performed a detailed evaluation results under synthetic and realistic workloads. From the obtained results, we can see that the proposed PHENIC outperforms other systems whether having non-blocking or blocking switches. In addition, it provides much better energy efficiency than the torus-based which can offer the same bandwidth as the proposed system. We can conclude that the obtained improvement by PHENIC is the result of the association of three main factors together: (1) the non blocking switch supporting optical acknowledgment signals, (2) the light-weight router with reduced buffer size, (3) and the path setup algorithm to adopt hybrid switching inside the photonic switch. In the next chapter, we introduce an alternative to the typical ECN in order to further reduce the overheads associated with the path configuration.

|                                                                     |                       |                  |     |        |           | 1         | 1       |               |
|---------------------------------------------------------------------|-----------------------|------------------|-----|--------|-----------|-----------|---------|---------------|
|                                                                     | Ener.                 | Eff. $(J/b)$     | 256 | 0.38   | 0.74      | 0.74      | 0.51    | 1.31          |
|                                                                     | Ē                     | Eff.             | 64  | 0.34   | 0.42      | 0.42      | 0.24    | 0.70          |
| ic.                                                                 | Buff.                 | Ener. $(\mu J)$  | 256 | 2.01   | 7.21      | 7.21      | 1.71    | 3.29          |
| om traff                                                            | Bı                    | Ener.            | 64  | 0.13   | 0.45      | 0.45      | 1.11    |               |
| rm rand                                                             | Path Conf.            | Energy $(\mu J)$ | 256 | 14.56  | 53.74     | 53.74     | 31.59   | 50.76 2.28    |
| er unifoi                                                           | $\operatorname{Path}$ | Energ            | 64  | 1.84   | 5.84      | 5.80      | 2.13    | 3.56          |
| ary undo                                                            | Band.                 | (Gbps)           | 256 | 98.33  | 67.09     | 67.09     | 95.563  | 100.48 $3.56$ |
| s summ                                                              | Baı                   | (Gb              | 64  | 46.7   | 29.1      | 29.1      | 47.12   | 46.53         |
| on result                                                           | #Blocked              | Request          | 256 | 85     | 152       | 152       | 120     | 130           |
| Evaluatio                                                           | #Blc                  | Req              | 64  | 42     | 65        | 65        | 50      | 55            |
| Table 5.5: Evaluation results summary under uniform random traffic. | Blocking              | y $(10^{-3}s)$   | 256 | 300    | 802       | 802       | 691     | 421           |
| Ē                                                                   | Blo                   | Latency (10      | 64  | 4.99   | IJ        | 3.99      | ы       | 333           |
|                                                                     |                       |                  |     | PHENIC | PHENIC_BL | Chan_Mesh | Chan_Xb | Shasham       |

|           | $PS L_{\hat{b}}$ | PS Latency | Normalized       | alized      | Path Conf./ | Conf./                             | $\operatorname{Bal}$ | Band.                 | ${\rm Power}_{\prime}$ | /er/  |
|-----------|------------------|------------|------------------|-------------|-------------|------------------------------------|----------------------|-----------------------|------------------------|-------|
|           | Ra               | Ratio      | Speed            | Speed Up    | Band.       | Band.(pJ/b)                        | (Gt                  | (Gbps)                | Baı                    | Band. |
|           | 64               | 256        | 64               | 256         | 64 256      | 256                                | 64                   | 25                    | 64                     | 25    |
| PHENIC    | 0.114            | 0.041      | 1.006            | 1.034       | 2.564       | 10.340 C                           | 0.420                | 0.420 $0.302$         | 2.629                  | 2.688 |
| PHENIC_BL | 0.132            | 0.046      | 0.990            | 0.990 1.063 | 2.554       | 2.554  10.718  1                   | <del>, _ 1</del>     | 0.802                 | 2.471                  | 2.598 |
| Chan_Mesh | 0.132            | 0.046      | <del>, - 1</del> | 1.063       | 2.554       | 2.554  10.718  1.000  0.802        | 1.000                | 0.802                 | 2.471                  | 2.598 |
| Chan_Xb   | 0.136            | 0.040      | 1.002            |             | 2.559       | 9.887                              | 0.581                | 9.887 $0.581$ $0.409$ | 2.458                  | 2.378 |
| Shasham   | 0.157            | 0.072      |                  | 1.106       | 2.564       | 1.004  1.106  2.564  11.290  0.937 | 0.937                |                       | 0.794                  | 0.874 |

|           | Lable 5.      | 7: Evalu                   | ation re-             | sults su | l'able 5.7: Evaluation results summary under Dataflow workload. | nder Da | ataflow v                | vorkloac |                           |                 |
|-----------|---------------|----------------------------|-----------------------|----------|-----------------------------------------------------------------|---------|--------------------------|----------|---------------------------|-----------------|
|           | Path          | Path Conf.                 | Normalized            | alized   | Path Conf.                                                      | Conf.   | Band.                    | nd.      | $\operatorname{Power}_{}$ | er/             |
|           | Over          | Overhead                   | Spee                  | Speed Up | Ener/band.                                                      | band.   | (Gb                      | (Gbps)   | Band.                     | .pu             |
|           | 64            | 256                        | 64                    | 256      | 64                                                              | 256     | 64                       | 25       | 64                        | 25              |
| PHENIC    | 0.007         | 0.007                      | 0.007 $1.387$ $2.509$ | 2.509    | 62                                                              | 347     | 0.145                    | 0.024    | 69.159 $65.432$           | 65.432          |
| PHENIC_BL | $\overline{}$ | 0.020  0.018  0.990  1.464 | 0.990                 | 1.464    | 56.474 202.2 0.288 0.056                                        | 202.2   | 0.288                    | 0.056    | $68.560 \ 45.390$         | 45.390          |
| Chan_Mesh | 0.020         | 0.018                      | 1                     | 1.464    |                                                                 | 202.2   | 56.474 202.2 0.288 0.056 | 0.056    | $68.560 \ 45.390$         | 45.390          |
| Chan_Xb   | 0.022         | 0.021                      | 0.021 $1.482$ $2.301$ | 2.301    | 83.674                                                          | 317.7   | 83.674 317.7 0.167 (     | 0.028    | 72.355 69.294             | 69.294          |
| Shasham   | 0.022         | 0.021                      | 1.020                 |          | 57.618 138                                                      | 138     | <del>,</del> 1           |          | 17.103                    | 17.103 $10.355$ |
|           |               |                            |                       |          |                                                                 |         |                          |          |                           |                 |

# Chapter 6

# Energy-efficient Wavelength-Shifting Routing Algorithm and Architecture

# 6.1 Introduction

In this chapter, we present a new control network architecture and a path configuration algorithm named Wavelength-Routed Control Network (WRCN) and Wavelength-Shifting Routing Algorithm (WSRA), respectively. They come as an alternative to the conventional ones and the previously discussed path configuration algorithm. The key idea of this new path configuration scheme is to minimize further the path configuration overheads by using a control plane made mainly by photonic devices instead of the typical electronic routers. We evaluate this new mechanism through an analytical model, and we show the benefits of this new way of configuring the path.

# 6.2 Wavelength-Routed Control Network Architecture

The trade-offs between configuring the path electronically or optically are essentially the number of used photonic components and the path configuration delay and energy. The proposed control network is based on a photonic switch controller (PSC) interconnected with each other in mesh-like arrangement. However, with waveguides instead of the electronic links. The key idea of the WRCN is to perform all required path configuration steps in optical manner rather than electronic one. The electronic components in the proposed architecture (i.e., buffers) come as a support, especially in contention condition.

Figure 6.1 shows a simplified block diagram of the PSC. The key idea is to use a wavelength shifting mechanism between any pair of input/output along the path between a source and a destination. To enable this wavelength shifting mechanism, detector-banks (DB) are placed in front of each input port to receive the incoming signal while modulator-banks (MB) are placed in front of each output port to modulate the new optical signal with the new wavelength. The two banks are connected via electronic wires. In Figure 6.1, the DB are illustrated with hexagons while the DB with circles.

To handle these different configuration packets, one pair of DB/MB for the Path Request (PR), Path Blocked and Path Teardown (BT) on each pair of Input/Output ports are used. These DBs/MBs can be seen in Figure 6.1 with the red and blue colors, respectively. The purple circle in front of each PR entry stands for a tuned miro-ring to enable the redirection of the incoming request to the Path Blocked Module (PBM), in case where the requested resources are not available. One of the main claims of using conventional EA-PNoCs approach is that the electronic controller can be used for small messages and latency-sensitive ones when the path configuration overheads can not be amortized. Similarly, we use the proposed controller to route directly small and sensitive latency messages but using optical signal instead of electronic packets. Moreover, as we explain later, the routing of the optical data is done in an obvious and passive manner. Thus, eliminating the latency



Figure 6.1: Photonic switch controller (PSC). PR: Path Request, BT: Path Blocked and Path Teardown, D: Data, PBM: Path Blocked Module

and power overhead caused by the buffering, routing computation, and switching steps in the conventional electronic router, which take at least three clock cycles.

### 6.3 Wavelength-Shifting Routing Algorithm

The novelty in the proposed WSRA is that the configuration packet/data signal is generated at the source node with a wavelength frequency  $\lambda_M$ , where M is the number of hops needed to arrive at the destination. Along the path, and using a wavelength shifting mechanism implemented in the proposed controller, the configuration/packet signal would arrive at the destination node with a wavelength frequency of  $\lambda_0$ . At the destination node, the DB responsible for detecting signal with  $\lambda_0$  is connected to the main controller. The former will forward the received signal to its corresponding network interface. To enable the routing of the configuration packets and short messages in the control layer, in an obvious manner without the need of buffering or computing the output port, we used a static source routing based on *Dimension-Ordered-Routing*(DOR-XY). Figure 6.2 shows a flowchart of this algorithm. We can see that any signal entering the PSC is handled according to from which DB it comes (i.e., signal's type) and its wavelength. In addition to turning ON and OFF the required MRRs, two main decisions could be taken. The first one is sending the signal to the next hop by automatically shifting its wavelength if  $\lambda_n < \lambda_0$ , and the second one is to redirect the signal to the corresponding network interface if  $\lambda_n = \lambda_0$ .

#### 6.3.1 Routing Phases

The packet is initially generated at the source's network interface, where the routing information are embedded in it. As shown in Figure 6.3, the packet would contain the addresses of the source and destination nodes in addition to how many hops needed before the turn (#HopsBT) to the Y direction and the number of hops needed to arrive to the destination after the turn (#HopsAT). The Id's field stands for the packet identifier (00: Path request, 01: Path blocked, 10: Path teardown, 11: Data). The data's field is used only for the short and latency-sensitive mes-



Figure 6.2: Wavelength-shifting routing algorithm flowchart.

| Id Source. Ad Dest.Ad | #HopsBT | #HopsAT | Data |
|-----------------------|---------|---------|------|
|-----------------------|---------|---------|------|

Figure 6.3: Packet format generated at the sender network interface. Source.Ad: Source address, Dest.Ad: Destination address, #HopsBT: Number of hops before the turn, #HopsAT: Number of hops after the turn, Data: Payload data used only for short/ latency-sensitive messages

sages. Figure 6.4 shows the micro-architecture of the proposed PSC. For simplicity, the figure only shows the structure of the Path Request's (PR) detector-bank and modulator-bank needed to route the path request between the (WestIn, EastOut) and (SouthIn, NorthOut) directions. The other path configuration packet and the short message data would be handled in the same way. In Figure 6.4, we assume



Figure 6.4: Micro architecture of the path request detector and modulator banks for (WestIn, EastOut) and (SouthIn,NorthOut) directions.

a network size of  $M \times M$ , where (M is the number of nodes in one direction. The number of ring (i.e., Detectors/Modulators) inside each bank depends on the maximum number of hops that a message can travel. As we show later, using this approach with fully connection between any Inport/Outport pair leads to a number of ring N proportional to  $(M - 1)^2$ .

#### 6.3.2 Blocking Management

Figure 6.5 shows the micro-architecture of the Path Blocked Module (PBM). This module plays a significant role in our proposed architecture, and it is arranged in such a way that it intercepts all redirected signals from the blocked inputs. These redirected signals, will be decoded and a path blocked signal is generated to be sent back to the source node informing it about the non-availability of the required resources. The newly generated path blocked signal, will follow the same routing process explaining previously for the path request. The Path Blocked Module (PBM) in Fig. 6.5, embed the routing information into an electronic packet and send it to the main controller to modulates the required path blocked signal with the required wavelength. The routing information can be easily obtained by finding out from which detector banks (i.e., NorthIn, EastIn, SouthIn, WestIn) and with which wavelength the blocked request entered the PBM.



Figure 6.5: Micro architecture of the Path Blocked Manager.

#### 6.3.3 Case Study

Figure 6.6 shows an example of communication in  $4 \times 4$  network size of. This example illustrates the worst case, where the packet configuration travels the maximum number of hops in addition to having to switch the direction. This communication involves seven steps which are summarized in Figs. 6.7, 6.8 and 6.9. This example takes into consideration that the converted signal needs to be buffered when the signal needs to switch the direction from X/Y to Y/X. We prove this design choice in the next section where we show that a fully connected scheme leads to a considerable number of MRRs. Figure. 6.7 shows how the main controller of the



Figure 6.6: Communication example in the wavelenght-routed control network. The configuration packet needs to travel two hops before the turn and two hops after the turn.

source node receives the configuration packet from the network interface. In this example, two hops are needed before the turn (switching between X and Y directions). Thus, the main controller encodes the new signal with  $\lambda_2$ . Figure. 6.8 shows



Figure 6.7: Wavelength-shifting algorithm example (1/3). The network interface sends a configuration packet that the main controller decode how many hops are remaining before the turn. The new signal is decoded with  $\lambda_2$ .

the configuration packet traveling the two hops before the turn. At each hop, the wavelength is shifted until the signal arrives to the switching node with  $\lambda_0$  (Fig. 6.8 (b)). The main controller decodes the embedded routing information again and modulates the new signal with  $\lambda_0$  since two hops remain to reach the destination. Figure. 6.9 shows the configuration packet traveling the two hops after the turn. At each hop, the wavelength is shifted until the signal arrives to the switching node with  $\lambda_0$  (Fig. 6.9 (b)). The same as for the X direction, the main controller decodes again the embedded formation and sends the converted packet to its corresponding network interface. This wavelength-shifting routing process is the same for all the configuration packets, as well as the small messages. As can be seen from this example, the number of used wavelength depends only on the number of hops in one direction. Moreover, the same set of wavelengths can be used for both directions, since the X and Y directions do not interfere with each other. When the detector bank receives the incoming signal, in addition to transferring it to the linked modu-



Figure 6.8: Wavelength-shifting algorithm example (2/3). (a) The signal is detected with  $\lambda_2$  and modulated again with  $\lambda_1$  and (b) The signal enters the switch with  $\lambda_0$ , the converted signal is redirected to the main controller, which decides to modulate a new signal with  $\lambda_2$  to travel in the Y direction.



Figure 6.9: Wavelength-shifting algorithm example (3/3).(a) The signal is detected with  $\lambda_2$  and modulated again with  $\lambda_1$  and (b) The signal enters the switch with  $\lambda_0$ , the converted signal is redirected to the main controller, which redirect it to its network interface.

lator, it sends a signal to a dedicated port on the main controller, which switch ON or OFF the required MRR(s). This operation concerns only the path configuration messages. For the short messages, the main controller interferes only to solve the congestion if the required resources are not available.

### 6.4 Evaluation

#### 6.4.1 Methodology and Assumptions

In this section, we investigate the number of required ring, delay and power trade-offs in designing the proposed architecture discussed in the previous section while varying micro-architectural parameters under different scenarios assumptions. Tables 6.1, 6.2 and 6.3 show the chip configuration for this study, the delay contribution of the involved components and the energy contribution, respectively. Our study is based on an analytical model, where the obtained results depend on the number of hops for a given communication. We chose this approach because the overheads associated with the path configuration mainly depend on the communication distance, and our goal is to prove that in the proposed scheme the delay and the energy are slightly affected by the distance, in contrast to the conventional process of configuring the path.

| Process technology                        | 32nm         |
|-------------------------------------------|--------------|
| Number of tiles                           | 64/256       |
| Chip area (equally divided amongst tiles) | $400 \ mm^2$ |
| Core frequency                            | 2.5GHz       |
| Electronic Control frequency              | 5 GHz        |

Table 6.1: Chip configuration.

#### 6.4.2 Complexity

Figures 6.16 (a) and (b) show the required number of rings and the resulting static power, respectively. In this calculation, we consider only the ring for the control network. Rings in the data network (rings for switch and rings for data

| Modulator Driver            | 16.3 ps                   |
|-----------------------------|---------------------------|
| Modulator                   | 20 ps                     |
| Detector TIA                | 6.9 ps                    |
| Detector                    | 0.3 ps                    |
| Waveguide propagation       | 46.7 ps/cm                |
| Electronic Wire propagation | 200 ps/cm                 |
| Router delay                | 600  ps (3  clock cycles) |

Table 6.2: Delay contribution for 32 nm technology nodes [109].

modulation and detection ) are not included. The total number of rings inside the chip is obtained from the sum of all rings in the data and control networks, as shown in equation 6.4.1. We calculate the required number of rings for fully connected detector and modulator banks (i.e., any signal with contention-less condition can travel between the X/Y and Y/X directions without being stopped). In addition, we calculate the required rings when a signal needs to be buffered if there is a switch between X/Y and Y/X directions. The number of rings N for a network size of  $M \times M$ , where M is the number of nodes in one direction, is obtained according to the equations 6.4.2 and 6.4.3 for fully connected (2D) and half connected (1D), respectively.

$$Total_{(ring)} = Mod/Detc_{(ring)} + Switch_{(ring)} + Routing(ring)$$
(6.4.1)

$$N_{(Ring2D)} = 24(M-1)^2 + 4M^2 + 4M^2(M-1)^2$$
(6.4.2)

$$N_{(Ring1D)} = 24(M-1)M^2 + 4M^3 \tag{6.4.3}$$

The results show that by using a fully connected scheme, the number of required rings will increase dramatically, reaching 16k for a network size of 16  $\times$  16. This number is too high, and technically a chip cannot have such amount of photonic devices for thermal and area issues. Furthermore, the results of the static power required to thermally tuning these rings is important in the 2D scheme, reaching 33Watts for 16  $\times$  16 network size. Thus, we opted for partially connected

scheme, at the cost of additional clock cycles for buffering and modulating again the signal when it needs to be switched from the X/Y to the Y/X direction. Compared to a conventional EA-PNoC, the proposed architecture will have additionl 12k and 100k in 64 and 256 nodes, respectively. Nevertheless, the required rings in the proposed architecture is much lower compared to other fully optical NoCs. For example, in 64 nodes system the work in [44], [110], [111] requires 256k, 14k, 39k, respectively. While the proposed architecture needs only 12k for a similar 64 nodes system.



Figure 6.10: Complexity and power comparison results. (a) Total number of ring and (b) Ring's static power.

#### 6.4.3 Path Configuration Delay

In this section we compare the Path Configuration Delay (PCD)  $(PC_{(delay)})$ under contention-less condition (i.e, the path request signal arrives to the source node without being blocked) and with a path request being blocked. We focus on the communication hops between source and destination, regardless of the traffic pattern. The  $PC_{(delay)}$  is the needed time to receive the Path Request signal  $T_{(PR)}$ and the time needed to send back the ACK  $(T_{(Ack)})$ . An additional bolcking delay is added when a PR is being blocked before reaching the destination  $T_{(BL)}$ . The total delay is given by equation 6.4.4. Table 6.2 shows the delay contribution of the involved components.

$$PC_{(delay)} = T_{(PR)} + T_{(Ack)} + T_{(BL)}$$
 (6.4.4)

In the previous proposed PHENIC [42], the path request as well as the path blocked are generated in the electronic layer, while the ACK is transmitted in optical manner in the data layer following the previously set path by the path request. The same for the teadrown signal, it takes place in the optical layer, but in hop by hop manner to release previous set MRRs. For the conventional EA-PNoCs, all configuration packets are generated and transmitted in the electronic layer. Equations 6.4.5 and 6.4.6 illustrate the delay of the  $T_{PR(elec)}$  and  $T_{BL(elec)}$  when they are transmitted electronically in terms of hops count. The  $T_{(PR)}$  and  $T_{(Ack)}$  are equal since the ACK goes back to the source following the previously set path. For the blocking delay, it depends on the hops count for the communication (n) and also on the node where the path request is being blocked (j).  $Link_{(delay)}$ ,  $Router_{(delay)}$ ,  $NIF_{clk}$ , stand for the delay of the electronic wire between routers, the delay of the router (pipeline stages) and a clock cycle needed for the network interface to send again the path request, respectively.

$$T_{PR(elec)}(n) = T_{ACK(elec))} = (n-1) * Link_{(delay)} +n * Router(delay)$$

$$(6.4.5)$$

$$T_{BL(elec)}(n, 0 < j \le n-1) = (n-1-j) * Link_{(delay)} + (n-j) * Router_{(delay)} + NIF_{clk}$$

$$(6.4.6)$$

#### 6.4.3.1 Delay Under Light Traffic

Figure 6.15 shows the path configuration delay in function of the number of hops. The results show a comparison between the conventional EA-PNoC, PHENIC system, and the wavelength-shifting scheme. The first thing to notice is the difference between the three networks where PHENIC outperforms the conventional one, and the wavelength-shifting scheme outperforms the two other architectures.



Figure 6.11: Path configuration delay under contention-less.

#### 6.4.3.2 Delay Under Heavy Traffic



Figure 6.12: Path configuration delay comparison results under contention. (a) conventional EA-PNoC, (b) PHENIC and (c) Wavelength-shifting

Figures 6.12 (a), (b) and (c) show the path configuration delay for conventional EA-PNoC, PHENIC and WS systems, respectively. We can see that PHENIC system outperforms the conventional one by up to 50%, with a worst case delay between 150 and 200 cycles when the PSCP (Path Setup Control Packet) is being blocked

few hops before reaching the destination. For the conventional one, the worst case is between 200-300 cycles.

For the wavelength-shifting scheme,  $T_{PR(ws)}$  and  $T_{BL(ws)}$  are obtained with equations 6.4.7 and 6.4.8, respectively. Where  $Wg_{(delay)}$  is the waveguide delay between two consecutive PSC.  $TRX_{(delay)}$  is the time needed to detect and modulate the signal.  $SN_{clk}$  is a clock cycle if there is a need to switch between the X/Y and Y/X direction, respectively.

$$T_{PR(ws)}(n) = (n-1) * Wg_{(delay)} + n * TRX_{(delay)} + SN_{clk} (\text{if } X_{src} \neq X_{dst} \oplus Y_{src} \neq Y_{dst})$$

$$(6.4.7)$$

$$T_{BL(ws)}(n, 0 < j \le n - 1) = (n - 1 - j) * Wg_{(delay)} + (n - j)TRX_{(delay)}$$

$$+SN_{clk}(\text{if } X_{src} \ne X_{dst} \oplus Y_{src} \ne Y_{dst})$$

$$(6.4.8)$$

To evaluate the proposed architecture in terms of delay and to have a fair comparison with the other architectures, any communication with more than three hops has to switch from X/Y to Y/X. In fact, this approach is not realistic but we want to see the impact of the direction switch with the blocking occurrence. Despite that, the results in Fig. 6.12 (c) shows an important decrease in the path configuration delay with a worst case between 20 and 30 cycles.

#### 6.4.4 Mirco-Ring Release Time

In this evaluation, we study the impact of the new routing on the Micro-Ring Release Time (MRRT). Figures 6.13 (a), (b) and (c) show the path configuration delay for conventional EA-PNoC, PHENIC and WS systems, respectively. These results are obvious and the consequence of a fast and efficient path configuration scheme. We can see that the conventional EA-PNoC is highly penalized because all path configuration steps needed to be transmitted electronically. With the new routing method we can see a considerable decrease in the time needed to release the



Figure 6.13: Mirco-ring release time under contention-less. (a) conventional EA-PNoC, (b) PHENIC, and (c) Wavelength-shifting

reserved MRRs. In addition to the impact on the bandwidth (more communications can occur if the reservation time of the MRRs is decreased), decreasing the MRRT will help to decrease the needed thermal power for tuning these MRRs.

#### 6.4.5 Bandwidth

The achieved bandwidth under contention-less condition is shown in Fig. 6.14. The figure shows that for small distances, the bandwidth is considerably increased in the wavelength-shifting scheme. For example, with a communication of four hops WS mechanism can achieve up to 3500 Kb/cycle, while for the conventional and PHENIC architectures, the maximum is about 100 Kb/cycle and 250 Kb/cycle, respectively. For long distance, (more than 16 hops), WS largely outperform the other architectures, with 250 Kb/cycle against few Kb/cycle for the two other architectures. In a typical computer system, communications' distance are limited to few hops, and the long distance is generally for global synchronization messages, such as the ones used in cache coherence protocols.

#### 6.4.6 Path Configuration Energy

Similar to the path configuration delay, in this section, we evaluate the energy associated with the path configuration. This evaluation takes into consideration only the dynamic energy, as we want to see the effect of using photonic components on the energy. Table 6.3 shows the energy contribution of the involved components.



Figure 6.14: Offered bandwidth comparison results under contention-less.

| Energy contribution for o2min    | teennoiog, not     |
|----------------------------------|--------------------|
| Buffering                        | 0.12  pJ/bit       |
| Electronic Crossbar              | 0.36  pJ/bit       |
| Electronic Wire                  | o.34 pJ/mm/bit     |
| Electronic Static                | 0.35  pJ/bit       |
| Modulator Driver                 | 0.32  pJ/bit       |
| Modulator                        | 0.025  pJ/bit      |
| Detector TIA                     | 0.69  pJ/bit       |
| Detector                         | 0.05  pJ/bit       |
| Micro-Ring ON/OFF                | 0.375              |
| Micro-Ring Static Thermal Tuning | $1 \ \mu W/Ring/K$ |

Table 6.3: Energy contribution for 32nm technology nodes [97, 105].

#### 6.4.6.1 Energy Under Light Traffic

Figure 6.14 shows the path configuration dynamic energy under contention-less condition. In contrast with the delay, WS scheme is outperformed by our previous PHENIC system. This is because there are a lot of photonic components involved in the path configuration process in the WS scheme. This outperformance of PHENIC system can be seen clearly for a large number of hops. For example, in a communication of 20 hops, PHENIC achieves a path configuration energy of 150 while in WS it is about 300. Nevertheless, WS still outperforms the conventional architecture largely.



Figure 6.15: Path configuration energy comparison results under contention-less.

#### 6.4.6.2 Energy Under Heavy Traffic

Under blocking occurrence, we can see the same energy profile, with PHENIC system outperforming the two other architectures. However, the gap between PHENIC and WS is decreased, compared to the contention-less condition.



Figure 6.16: Path configuration energy comparison results under contention. (a) conventional EA-PNoC, (b) PHENIC and (c) Wavelength-shifting.

This decrease in the gap between the two architectures shows the benefits of WS when the network is congested. In fact, in contention-less condition PHENIC shows a maximum of energy reaching 280, while the maximum for WS is about 450. In the presence of blocking, PHENIC's path configuration energy can reach 600 with an increase of 114% while WS can reach a maximum of 700 Joule with a rise of only 50%. Moreover, in this evaluation, we assumed that the path request has been blocked only one time, which is not realistic since it can be blocked several times until it succeeds to reach the destination.

#### 6.4.7 Energy Efficiency

Figure 6.17 depicts the energy efficiency comparison results under contention-less condition. The results show that the conventional architecture is always penalized, and we can see an exponential increase of the energy efficiency with the number of hops. This behavior is not adequate for future generations of many cores systems since the energy efficiency should not be affected by the communication distance. For PHENIC system, it has a less aggressive curve, where the energy efficiency starts to be noticeable when the communication's distance reaches the 20th hops. For the



Figure 6.17: Energy efficiency comparison results under contention-less.

proposed wavelength-shifting mechanism, a flat curve can be seen, and the energy efficiency is independent of the communication's distance. This result comes from the fact that the overhead associated with the additional components that have been added to perform the path configuration steps were amortized with the huge offered bandwidth obtained with small path configuration delay.

#### 6.4.8 Chapter Summary

In this chapter, we presented a new control network architecture and its corresponding wavelength-shifting routing algorithm. We showed the outperformance of this new scheme compared to the previously discussed PHENIC system, and especially compared to the conventional EA-PNoC architectures. The results show a significant reduction in the path configuration delay and the needed time to release the micro-rings. While the dynamic energy with the wavelength-shifting schemes increases, the energy efficiency considerably decreases and becomes independent of the communication's distance.

# Chapter 7

# **Thesis Summary and Discussion**

We conclude this dissertation with a summary chapter where we summarize the main contributions of this research. We discuss the obtained results from the conducted simulation and the design space exploration. Finally, we conclude this thesis by a discussion where we highlight how this work can be improved, as well as the other design considerations that were not covered in this dissertation.

### 7.1 Contributions Summary

In this thesis, we presented a high-performance nanophotonics on-chip network architecture, named PHENIC, endorsed with two contention-aware path configuration algorithms. The proposed architectures manage to reduce the blocking occurrence in EA-PNoC (Electro-Assisted PNoC), as well as reducing the path configuration delay and energy overheads.

We tackled the blocking occurrence by proposing a non-blocking photonic switch. The proposed switch, in addition to routing the data stream, it handles the ACK and the Teardown signals, which are moved to the photonic layer. Two switching policies were adopted in the proposed photonic switch: a spatial switching used for the data stream (i.e., payload data) as any previously proposed photonic switch, and a wavelength-selective switching used to handle the ACK and the Tear-down signals.

To control the photonic switch, a light-weight electronic controller was also

proposed. The proposed controller, in addition to configuring the switch's MRRs like conventional controller, it manages the incoming/outcoming Tear-down signals. Thus, a contention aware-path configuration algorithm was proposed. The proposed algorithm manages the different path configuration steps in both the ECN (Electronic Control Network) and PCN (Photonic Communication Network).

The photonic switch and the electronic controller works together and enables the dissociation of the ECN and PCN, which leads to reducing the blocking occurrence during the path configuration process.

Moreover, a new alternative to the conventional ECN is proposed, which consists of a control network made mainly by photonic components instead of electronic ones like in typical EA-PNoC. To orchestrate the different path configuration steps, the proposed control network is augmented with a new wavelength-shifting routing algorithm. The key idea of the new control network is to configure the path using optical signals instead of electronic packets. Thus, eliminating electronic components, which are considered as power hungry such as the crossbar and the buffer would obviously reduce the path configuration overhead in terms of power and delay.

## 7.2 Results Summary

In this research we focused mainly on the issues related to the path configuration from the architecture and algorithm point of view. In addition to the conventional metric of any on-chip network (latency, bandwidth and power), the blocking latency, the number of blocked requests and the path configuration energy were also evaluated.

PHENIC system was compared to conventional EA-PNoCs (Electro-assisted PNoC) architectures having different proprieties. PHENIC system with the proposed non-blocking electro-optic router and its path configuration algorithm can reduce the blocking occurrence by up to 42% and 35% for 256 cores and 64 cores systems with uniform random traffic, respectively. As a consequence, the path configuration energy overhead was considerably reduced in synthetic workloads, as well as realistic ones. The speedup was also increased when evaluated with FFT and

DataFlow workloads. We can see an increase of 9% when compared to a crossbarbased architecture for 256 cores systems in contrast to 64 cores systems where PHENIC was outperformed by 7%. Nevertheless, PHENIC system still outperforms all networks under all workloads regarding the path configuration energy, as well as the energy efficiency.

For the wavelength-routed control network, a design space exploration shows that the path configuration energy and delay can be further reduced when we use a wavelength-selective routing. We showed that the path configuration delay was considerably decreased when compared with the conventional control networks, where all configuration packets are generated and transmitted electronically. The wavelength-shifting scheme also outperforms PHENIC system, where half of the configuration packets are moved to the photonic layer. This comes at the cost of additional photonic components (i.e., rings, detectors, modulators) and static power needed to maintain the proper functionality of these components.

## 7.3 Discussion

Despite the good results obtained with the proposed PHENIC architecture, some points should be fixed to enhance its performance and reliability.

The main challenge that faces PNoCs systems is their thermal sensitivity. PNoCs are sensitive to ambient temperature variations because their basic building blocks, ring resonators, are sensitive to those variations. For these reasons, any proposed PNoC should be based on a thermally resilient architecture, which tolerates the presence of small or even large temperature variations. For the time being, this temperature fluctuation is solved by trimming methods, where the resonance wavelength is corrected when there is a drift caused by a temperature variation. The ideal solution is to tackle this problem from the preventive point of view. For example, a deep system-level modeling and analysis of thermal effect should be carried to extract the thermal behavior of the photonic devices and the consequence of the interaction between light and the silicon. Upon the obtained results, a thermally resilient architecture and routing algorithms should be proposed. Besides the thermally resilient propriety, any proposed architecture must also take into consideration the noise and the Bit Error Rate (BER) at the receiver side. A multilayered photonic switch could be an option and can reduce or totally eliminate the waveguide crossing inside the switch. By removing the crosses and the resulting power loss, less laser power would be injected into the chip. Moreover, the reliability issue could be tackled from the routing algorithm point of view. In fact, because of aging factors, some MRRs could be faulty, causing the degradation of the link capability or even its permanent failure. The obvious solution is the reroute of the message, but photonic interconnects have additional constraints. For example, an extra hop in photonic interconnect could cause extra losses and noises, which could affect the injected laser power and the correctness of the data at the receiver side.

Some other issues, which are not covered in this thesis should also be mentioned. Regarding the memory wall for example, some works have been conducted to design memory-processor interconnection. The goal was to reduce the memory-wall by providing more bandwidth to the memory using a photonic link. The gateway should also be carefully designed, which depends mainly on the workload characteristics and the available laser power budget (i.e., number of available wavelengths). Thus, deep study of the application characteristic needs to take place before the design of the gateway.

The integration of photonic devices and electronic ones on the same CMOS design flow is also an important research direction. The electronic industry spends billions of dollars to develop tools, processes and facilities. The most urgent question is how to reuse these capabilities for photonics. In reality, it is very difficult to reuse electronic facilities for photonics without making any process change. For example, Photonics require relatively primitive processing with 90 nm, which contrasts microelectronic chips that require 16 nm processing.

# Bibliography

- G. Leary and K. Chatha, "Design of noc for soc with multiple use cases requiring guaranteed performance," in 23rd International Conference on VLSI Design, Jan 2010, pp. 200–205.
- [2] "The law that's not a law," Spectrum, IEEE, vol. 52, no. 4, pp. 38–57, April 2015.
- [3] S. Rusu, S. Tam, H. Muljono, D. Ayers et al., "A 45nm 8 core enterprise xeon processor," in . IEEE Asian Solid State Circuits Conference, Nov 2009, pp. 9–12.
- [4] S. Bell, B. Edwards, J. Amann, R. Conlin *et al.*, "TILE64 processor: A 64core soc with mesh interconnect," in *IEEE International Solid-State Circuits Conference*, Feb 2008, pp. 88–598.
- S. Vangal, J. Howard, G. Ruhl, S. Dighe *et al.*, "An 80-tile 1.28tflops networkon-chip in 65nm cmos," in *IEEE International Solid-State Circuits Conference*, Feb 2007, pp. 98–589.
- [6] International Technology Roadmap For Semiconductors: Chapter: Interconnect, 2011. [Online]. Available: www.itrs.net
- [7] R. Kumar, V. Zyuban, and D. M. Tullsen, "Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling," in *Proceed*ings of the 32Nd Annual International Symposium on Computer Architecture, ser. ISCA '05, 2005, pp. 408–419.

- [8] N. Magen, A. Kolodny, U. Weiser, and N. Shamir, "Interconnect-power dissipation in a microprocessor," in *Proceedings of the 2004 International Work*shop on System Level Interconnect Prediction, ser. SLIP '04, 2004, pp. 7–13.
- [9] P. Jacob, A. Zia, O. Erdogan, P. Belemjian *et al.*, "Mitigating memory wall effects in high-clock-rate and multicore cmos 3-d processor memory stacks," *Proceedings of the IEEE*, vol. 97, no. 1, pp. 108–122, Jan 2009.
- [10] A. Ben Abdallah, Multicore Systems-on-Chip: Practical Hardware/Software Design, 2nd Edition. Atlantis Press, Paris, 2013, iSBN: 13: 978-9491216916.
- [11] A. Ben Ahmed and A. Ben Abdallah, "Graceful deadlock-free fault-tolerant routing algorithm for 3d network-on-chip architectures," J. Parallel Distrib. Comput., vol. 74, no. 4, pp. 2229–2240, April 2014.
- [12] K. Mori, A. Esch, A. Ben Abdallah, and K. Kuroda, "Advanced design issues for oasis network-on-chip architecture," in 2010 International Conference on Broadband, Wireless Computing, Communication and Applications (BWCCA), Nov 2010, pp. 74–79.
- [13] A. Ben Ahmed and A. Ben Abdallah, "Architecture and design of highthroughput, low-latency, and fault-tolerant routing algorithm for 3d-networkon-chip (3D-NoC)," J. Supercomput., vol. 66, no. 3, pp. 1507–1532, Dec. 2013.
- [14] L. Benini and G. D. Micheli, Networks on chips: technology and tools. Morgan Kauffmann, 2006.
- [15] B. Feero and P. Pande, "Performance evaluation for three-dimensional networks-on-chip," in *IEEE Computer Society Annual Symposium on VLSI*, March 2007, pp. 305–310.
- [16] A. Ben Abdallah and M. Sowa, "Basic network-on-chip interconnection for future gigascale mcsocs applications," in *Proc. of the Symposium on Science, Society, Technology, Communication and Computation Orthogonalization*, 2006, pp. 4–6.

- [17] K. N. Dang, M. Meyer, Y. Okuyama, X.-T. Tran, and A. Ben Abdallah, "A soft-error resilient 3d network-on-chip router," in *IEEE 7th International Conference on Awareness Science and Technology*, 2015.
- [18] J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, and C. R. Das, "A low latency router supporting adaptivity for on-chip interconnects," in *Proceedings* of the 42Nd Annual Design Automation Conference, ser. DAC '05, 2005, pp. 559–564.
- [19] J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. Yousif, and C. Das, "A gracefully degrading and energy-efficient modular router architecture for onchip networks," in 33rd International Symposium on Computer Architecture, 2006, pp. 4–15.
- [20] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, "Express virtual channels: Towards the ideal interconnection fabric," in *Proceedings of the 34th Annual International Symposium on Computer Architecture*, ser. ISCA '07, 2007, pp. 150–161.
- [21] R. Mullins, A. West, and S. Moore, "Low-latency virtual-channel routers for on-chip networks," in *Proceedings of the 31st Annual International Symposium* on Computer Architecture, ser. ISCA '04, 2004, p. 188.
- [22] W. J. Dally, "Performance analysis of k-ary n-cube interconnection networks," *IEEE Trans. Comput.*, vol. 39, no. 6, pp. 775–785, Jun. 1990.
- [23] J. Kim, J. Balfour, and W. Dally, "Flattened butterfly topology for on-chip networks," *Computer Architecture Letters*, vol. 6, no. 2, pp. 37–40, Feb 2007.
- [24] U. Ogras and R. Marculescu, ""it's a small world after all": NoC performance optimization via long-range link insertion,", *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 14, no. 7, pp. 693–706, July 2006.

- [25] C.-W. Chou, J.-F. Li, Y.-C. Yu, C.-Y. Lo *et al.*, "Hierarchical test integration methodology for 3-D ICs," *IEEE Design Test*, vol. 32, no. 4, pp. 59–70, Aug 2015.
- [26] S. Das, A. Fan, K.-N. Chen, C. S. Tan et al., "Technology, performance, and computer-aided design of three-dimensional integrated circuits," in *Proceed*ings of the 2004 International Symposium on Physical Design, ser. ISPD '04, 2004, pp. 108–115.
- [27] C.-T. Ko, Z.-C. Hsiao, Y.-J. Chang, P.-S. Chen, Y.-J. Hwang et al., "A wafer-level three-dimensional integration scheme with cu tsvs based on microbump/adhesive hybrid bonding for three-dimensional memory application," *IEEE Transactions on Device and Materials Reliability*, vol. 12, no. 2, pp. 209– 216, June 2012.
- [28] A. Ben Ahmed, A. Ben Ahmed, and A. B. Abdallah, "Deadlock-Recovery Support for Fault-tolerant Routing Algorithms in 3D-NoC Architectures," in *Embedded Multicore Socs (MCSoC), 2013 IEEE 7th International Symposium* on. IEEE, 2013, pp. 67–72.
- [29] A. Ben Ahmed and A. Ben Abdallah, "LA-XYZ: low latency, high throughput look-ahead routing algorithm for 3D network-on-chip (3D-NoC) architecture," in *Embedded Multicore Socs (MCSoC)*, 2012 IEEE 6th International Symposium on. IEEE, 2012, pp. 167–174.
- [30] J. Joyner, P. Zarkesh-Ha, and J. Meindl, "A stochastic global net-length distribution for a three-dimensional system-on-a-chip (3d-soc)," in 14th Annual IEEE International ASIC/SOC Conference, 2001, pp. 147–151.
- [31] P. Zarkesh-Ha, J. Davis, and J. Meindl, "Prediction of net-length distribution for global interconnects in a heterogeneous system-on-a-chip," *IEEE Transactions onVery Large Scale Integration (VLSI) Systems*, vol. 8, no. 6, pp. 649–659, Dec 2000.

- [32] A. W. Topol, D. C. La Tulipe, Jr., L. Shi, D. J. Frank et al., "Threedimensional integrated circuits," *IBM J. Res. Dev.*, vol. 50, no. 4/5, pp. 491– 506, Jul. 2006.
- [33] A. Rahman, A. Fan, and R. Reif, "Comparison of key performance metrics in two- and three-dimensional integrated circuits," in *Proceedings of the IEEE* 2000 International Interconnect Technology Conference, 2000, pp. 18–20.
- [34] V. Pavlidis and E. Friedman, "3-d topologies for networks-on-chip," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 15, no. 10, pp. 1081–1090, Oct 2007.
- [35] X. Zheng, S. Lin, Y. Luo, J. Yao, G. Li *et al.*, "Efficient wdm laser sources towards terabyte/s silicon photonic interconnects," *Journal of Lightwave Technology*, vol. 31, no. 24, pp. 4142–4154, Dec 2013.
- [36] G. Li, X. Zheng, J. Yao, H. Thacker *et al.*, "25gb/s 1v-driving cmos ring modulator with integrated thermal tuning," *Opt. Express*, vol. 19, no. 21, pp. 20435–20443, Oct 2011.
- [37] G. Li, X. Zheng, H. Thacker, J. Yao et al., "40 gb/s thermally tunable cmos ring modulator," in 2012 IEEE 9th International Conference on Group IV Photonics (GFP), Aug 2012, pp. 1–3.
- [38] D. Miller, "Rationale and challenges for optical interconnects to electronic chips," *Proceedings of the IEEE*, vol. 88, no. 6, pp. 728–749, June 2000.
- [39] A. Shacham, K. Bergman, and L. Carloni, "On the design of a photonic network-on-chip," in *First International Symposium on Networks-on-Chip*, *NOCS 2007*, May 2007, pp. 53–64.
- [40] J. Chan, G. Hendry, K. Bergman, and L. Carloni, "Physical-layer modeling and system-level design of chip-scale photonic interconnection networks," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 30, no. 10, pp. 1507–1520, Oct 2011.

- [41] A. Ben Ahmed, M. Meyer, Y. Okuyama, and A. Ben Abdallah, "Efficient router architecture, design and performance exploration for many-core hybrid photonic network-on-chip (2D-PHENIC)," in 2nd International Conference on Information Science and Control Engineering (ICISCE), April 2015, pp. 202–206.
- [42] A. Ben Ahmed, Y. Okuyama, and A. Ben Abdallah, "Contention-free routing for hybrid photonic mesh-based network-on-chip systems," in *The 9th IEEE International Symposium on Embedded Multicore/Manycore SoCs (MCSoc)*, September 2015, pp. 235–242.
- [43] A. Ben Ahmed, M. Meyer, Y. Okuyama, and A. Ben Abdallah, "Hybrid photonic noc based on non-blocking photonic switch and light-weight electronic router," in *The 2015 IEEE International Conference on Systems, Man and Cybernetics (SMC)*, October 2015.
- [44] D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. Jouppi et al., "Corona: System implications of emerging nanophotonic technology," in Computer Architecture, 2008. ISCA '08. 35th International Symposium on, June 2008, pp. 153–164.
- [45] R. Morris, A. Kodi, A. Louri, and R. Whaley, "Three-dimensional stacked nanophotonic network-on-chip architecture with minimal reconfiguration," *Computers, IEEE Transactions on*, vol. 63, no. 1, pp. 243–255, Jan 2014.
- [46] M. J. Cianchetti, J. C. Kerekes, and D. H. Albonesi, "Phastlane: A rapid transit optical routing network," *SIGARCH Comput. Archit. News*, vol. 37, no. 3, pp. 441–450, Jun. 2009.
- [47] M. Briere, B. Girodias, Y. Bouchebaba, G. Nicolescu, F. Mieyeville et al., "System level assessment of an optical noc in an mpsoc platform," in *Design*, *Automation Test in Europe Conference Exhibition*, 2007. DATE '07, April 2007, pp. 1–6.

- [48] Z. Chen, H. Gu, Y. Chen, and H. Zhang, "Wavelength assignment in optical network-on-chip: Design and performance," in *TENCON 2013 - 2013 IEEE Region 10 Conference (31194)*, Oct 2013, pp. 1–4.
- [49] M. S. Nawrocka, T. Liu, X. Wang, and R. R. Panepucci, "Tunable silicon microring resonator with wide free spectral range," *Applied Physics Letters*, vol. 89, no. 7, 2006.
- [50] M. Lipson, "Guiding, modulating, and emitting light on silicon-challenges and opportunities," *Journal of Lightwave Technology*, vol. 23, no. 12, pp. 4222– 4238, Dec. 2005.
- [51] A. Ben Ahmed and A. Ben Abdallah, "Hybrid silicon-photonic network-onchip for future generations of high-performance many-core systems," *The Journal of Supercomputing*, vol. 71, no. 12, pp. 4446–4475, 2015.
- [52] G. Soni and V. Banga, "Performance analysis of free space optical link at 650 nm wavelength," in *Third International Conference on Computational Intelligence and Information Technology*, Oct 2013, pp. 474–481.
- [53] T. Fukushima, Y. Ito, M. Murugesan, J. Bea, K. Lee *et al.*, "Tiny vcsel chip self-assembly for advanced chip-to-wafer 3d and hetero integration," in *3D* Systems Integration Conference (3DIC), 2014 International, Dec 2014, pp. 1–4.
- [54] K. Bergman, L. P. Carloni, A. Biberman, J. Chan, and G. Hendry, *Photonic Network-on-Chip Design*. Springer-Verlag New York, 2014, iSBN: 978-1-4419-9334-2.
- [55] B. Lee, C. Xiaogang, A. Biberman, L. Xiaoping *et al.*, "Ultrahigh-bandwidth silicon photonic nanowire waveguides for on-chip networks," *IEEE Photonics Technology Letters*, vol. 20, no. 6, pp. 398–400, March 2008.
- [56] H. Philipp., K. Andersen, W. Svendsen, and H. Ou, "Amorphous silicon rich silicon nitride optical waveguides for high density integrated optics," *Electronics Letters*, vol. 40, no. 7, pp. 419–421, April 2004.

- [57] M. Popovic, E. Ippen, and F. Kartner, "Low-loss bloch waves in open structures and highly compact, efficient si waveguide-crossing arrays," pp. 56–57, Oct 2007.
- [58] T. Barwicz, M. A. Popovic, P. T. Rakich, M. R. Watts *et al.*, "Microringresonator-based add-drop filters in sin: fabrication and analysis," *Opt. Express*, vol. 12, no. 7, pp. 1437–1442, Apr 2004.
- [59] S. Xiao, M. H. Khan, H. Shen, and M. Qi, "Compact silicon microring resonators with ultra-low propagation loss in the C band," *Opt. Express*, vol. 15, no. 22, pp. 14467–14475, Oct 2007.
- [60] T. Baba, S. Akiyama, M. Imai, N. Hirayama *et al.*, "50-gb/s ring-resonatorbased silicon modulator," *Opt. Express*, vol. 21, no. 10, pp. 11869–11876, May 2013.
- [61] P. Dong, R. Shafiiha, S. Liao, H. Liang *et al.*, "Wavelength-tunable silicon microring modulator," *Opt. Express*, vol. 18, no. 11, pp. 10941–10946, May 2010.
- [62] L. Vivien, J. Osmond, J.-M. Fédéli, D. Marris-Morini *et al.*, "42 ghz p.i.n germanium photodetector integrated in a silicon-on-insulator waveguide," *Opt. Express*, vol. 17, no. 8, pp. 6252–6257, Apr 2009.
- [63] A. Novack, M. Gould, Y. Yang, Z. Xuan *et al.*, "Germanium photodetector with 60 ghz bandwidth using inductive gain peaking," *Opt. Express*, vol. 21, no. 23, pp. 28387–28393, Nov 2013.
- [64] G. Hendry, J. Chan, S. Kamil, L. Oliker et al., "Silicon nanophotonic networkon-chip using tdm arbitration," in *IEEE 18th Annual Symposium on High Performance Interconnects (HOTI)*, Aug 2010, pp. 88–95.
- [65] G. Hendry, E. Robinson, V. Gleyzer, J. Chan et al., "Time-divisionmultiplexed arbitration in silicon nanophotonic networks-on-chip for highperformance chip multiprocessors," J. Parallel Distrib. Comput., vol. 71, no. 5, pp. 641–650, May 2011.

- [66] K. Xu, H. K. Tsang, G. Lei, Y. M. Chen *et al.*, "Osnr monitoring for nrzpsk signals using silicon waveguide two-photon absorption," *IEEE Photonics Journal*, vol. 3, no. 5, pp. 968–974, Oct 2011.
- [67] A. Pollarolo, T. Jeong, S. Benz, and H. Rogalla, "Johnson noise thermometry measurement of the boltzmann constant with a 200 omega sense resistor," *IEEE Transactions on Instrumentation and Measurement*, vol. 62, no. 6, pp. 1512–1517, June 2013.
- [68] S. Khunkhao, C. Umjaruan, A. Thaworn, S. Nuanloy et al., "Shot noise behavior of planar mo/n-si/mo photodetector structure in avalanche mode," in 2010 International Conference on Electrical Engineering/Electronics Computer Telecommunications and Information Technology (ECTI-CON), May 2010, pp. 1137–1141.
- [69] G. Hendry, E. Robinson, V. Gleyzer, J. Chan, L. Carloni et al., "Circuitswitched memory access in photonic interconnection networks for highperformance embedded computing," in High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, Nov 2010, pp. 1–12.
- [70] J. Chan and K. Bergman, "Photonic interconnection network architectures using wavelength-selective spatial routing for chip-scale communications," *IEEE/OSA Journal of Optical Communications and Networking*, vol. 4, no. 3, March 2012.
- [71] M. Petracca, B. Lee, K. Bergman, and L. Carloni, "Design exploration of optical interconnection networks for chip multiprocessors," in *High Perfor*mance Interconnects, 2008. HOTI '08. 16th IEEE Symposium on, Aug 2008, pp. 31–40.
- [72] C. Adi, H. Matsutani, M. Koibuchi, H. Irie, T. Miyoshi, and T. Yoshinaga, "An efficient path setup for a photonic network-on-chip," in 2010 First International Conference on Networking and Computing (ICNC), Nov 2010, pp. 156–161.

- [73] H. Matsutani, M. Koibuchi, H. Amano, and T. Yoshinaga, "Prediction router: Yet another low latency on-chip router architecture," in *High Performance Computer Architecture*, 2009. HPCA 2009. IEEE 15th International Symposium on, Feb 2009, pp. 367–378.
- [74] Y. Ye, J. Xu, B. Huang, X. Wu et al., "3-D mesh-based optical network-on-chip for multiprocessor system-on-chip," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 32, no. 4, pp. 584–596, April 2013.
- [75] R. Ji, L. Yang, L. Zhang, Y. Tian *et al.*, "Five-port optical router for photonic networks-on-chip," *Opt. Express*, vol. 19, no. 21, pp. 20258–20268, Oct 2011.
- [76] H. Wang, L. Benjamin, A. Shacham, and K. Bergman, "Silicon nanophotonic network-on-chip using tdm arbitration," in *High Performance Interconnects* (HOTI), 2010 IEEE 18th Annual Symposium on, Aug 2010, pp. 88–95.
- [77] D. Vantrease, N. Binkert, R. Schreiber, and M. Lipasti, "Light speed arbitration and flow control for nanophotonic interconnects," in *Microarchitecture*, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, Dec 2009, pp. 304–315.
- [78] H. Gu, J. Xu, and W. Zhang, "A low-power fat tree-based optical networkon-chip for multiprocessor system-on-chip," in *Design*, Automation Test in Europe Conference Exhibition, DATE '09., April 2009, pp. 3–8.
- [79] S. Pasricha and N. Dutt, "Orb: An on-chip optical ring bus communication architecture for multi-processor systems-on-chip," in *Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific*, March 2008, pp. 789–794.
- [80] R. Beausoleil, J. Ahn, N. Binkert, A. Davis, D. Fattal et al., "A nanophotonic interconnect for high-performance many-core computation," in 16th IEEE Symposium on High Performance Interconnects,, Aug 2008, pp. 182–189.

- [81] X. Zhang and A. Louri, "A multilayer nanophotonic interconnection network for on-chip many-core communications," in 47th ACM/IEEE Design Automation Conference (DAC), June 2010, pp. 156–161.
- [82] N. Kirman and J. F. Martínez, "A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing," SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 15–28, Mar. 2010.
- [83] A. Aggarwal, A. Bar-Noy, D. Coppersmith, R. Ramaswami, B. Schieber, and M. Sudan, "Efficient routing and scheduling algorithms for optical networks," in *Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms*, ser. SODA '94, 1994, pp. 412–423.
- [84] Y. Pan, J. Kim, and G. Memik, "Flexishare: Channel sharing for an energyefficient nanophotonic crossbar," in *High Performance Computer Architecture* (HPCA), 2010 IEEE 16th International Symposium on, Jan 2010, pp. 1–12.
- [85] Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choudhary, "Firefly: Illuminating future network-on-chip with nanophotonics," in *Proceedings* of the 36th Annual International Symposium on Computer Architecture, ser. ISCA '09. ACM, 2009, pp. 429–440.
- [86] X. Tan, M. Yang, L. Zhang, X. Wang, and Y. Jiang, "A hybrid optoelectronic networks-on-chip architecture," *Lightwave Technology, Journal of*, vol. 32, no. 5, pp. 991–998, March 2014.
- [87] J. Wang, G. Han, B. Li, J. Lu, and W. Dou, "Cpnoc: An energy-efficient photonic network-on-chip," in Advanced Information Networking and Applications Workshops (WAINA), 2013 27th International Conference on, March 2013, pp. 1571–1576.
- [88] A. Ben Ahmed and A. Ben Abdallah, "PHENIC: Silicon photonic 3d-networkon-chip architecture for high-performance heterogeneous many-core systemon-chip," in Sciences and Techniques of Automatic Control and Computer Engineering (STA), 2013 14th International Conference on, Dec 2013, pp. 1–9.

- [89] M. Meyer, Y. Okuyama, and A. Ben Abdallah, "On the design of a faulttolerant photonic network," in *The 2015 IEEE International Conference on Systems, Man and Cybernetics (SMC)*, October 2015, pp. 821–826.
- [90] M. Meyer, A. Ben Ahmed, Y. Tanaka, and A. Ben Abdallah, "FTTDOR: Microring fault-resilient optical router for reliable network-on-chip systems," in *The 9th IEEE International Symposium on Embedded Multicore/Manycore* SoCs (MCSoc), September 2015, pp. 227–234.
- [91] A. Ben Ahmed, T. Ochi, S. Miura, and A. Ben Abdallah, "Run-Time Monitoring Mechanism for Efficient Design of Application-Specific NoC Architectures in Multi/Manycore Era," in Complex, Intelligent, and Software Intensive Systems (CISIS), 2013 Seventh International Conference on, July 2013, pp. 440–445.
- [92] S. Miura, A. Ben Abdallah, and K. Kuroda, "PNoC: Design and Preliminary Evaluation of a Parameterizable NoC for MCSoC Generation and Design Space Exploration," in *The 19th Intelligent System Symposium (FAN 2009)*, 2009, pp. 314–317.
- [93] A. Ben Ahmed, A. Ben Abdallah, and K. Kuroda, "Architecture and design of efficient 3D network-on-chip (3D NoC) for custom multicore SoC," in Broadband, Wireless Computing, Communication and Applications (BWCCA), 2010 International Conference on. IEEE, 2010, pp. 67–73.
- [94] K. Mori, A. Ben Abdallah, and K. Kuroda, "Design and evaluation of a complexity effective network-on-chip architecture on FPGA," in *Proc. of The 19th Intelligent System Symposium (FAN 2009)*, 2009, pp. 318–321.
- [95] A. Ben Abdallah, T. Yoshinaga, and M. Sowa, "Mathematical Model for Multiobjective Synthesis of NoC Architectures," in *IEEE Proc. of the 36th International Conference on Parallel Processing*, 2007.
- [96] A. Ben Ahmed and A. Ben Abdallah, "Low-overhead Routing Algorithm for 3D Network-on-Chip," in Networking and Computing (ICNC), 2012 Third International Conference on, Dec 2012, pp. 23–32.

- [97] J. Chan, G. Hendry, A. Biberman, K. Bergman, and L. Carloni, "Phoenixsim: A simulator for physical-layer analysis of chip-scale photonic interconnection networks," in *Design, Automation Test in Europe Conference Exhibition* (DATE), 2010, March 2010, pp. 691–696.
- [98] K. Preston, N. Sherwood-Droz, J. Levy, and M. Lipson, "Performance guidelines for wdm interconnects based on silicon microring resonators," in 2011 Conference on Lasers and Electro-Optics (CLEO), May 2011, pp. 1–2.
- [99] L. Brusberg, H. Schroder, M. Queisser, and K. Lang, "Single-mode glass waveguide platform for dwdm chip-to-chip interconnects," in 2012 IEEE 62nd Electronic Components and Technology Conference (ECTC), May 2012, pp. 1532–1539.
- [100] H. Pan, S. Assefa, W. M. J. Green, D. M. Kuchta *et al.*, "High-speed receiver based on waveguide germanium photodetector wire-bonded to 90nm soi cmos amplifier," *Opt. Express*, vol. 20, no. 16, pp. 18145–18155, Jul 2012.
- [101] F. Xia, L. Sekaric, and Y. Vlasov, "Ultracompact optical buffers on a silicon chip," Nat Photon., vol. 1:6571,2006, pp. 2801–2803, 2007.
- [102] W. Bogaerts, P. Dumon, D. V. Thourhout, and R. Baets, "Low-loss, lowcross-talk crossings for silicon-on-insulator nanophotonic waveguides," *Opt. Lett.*, vol. 32, pp. 2801–2803, Oct 2007.
- [103] B. Lee, A. Biberman, P. Dong, M. Lipson, and K. Bergman, "All-optical comb switch for multiwavelength message routing in silicon photonic networks," *IEEE Photonics Technology Letters*, vol. 20, no. 10, pp. 767–769, May 2008.
- [104] A. Kahng, B. Li, L.-S. Peh, and K. Samadi, "Orion 2.0: A power-area simulator for interconnection networks," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 20, no. 1, pp. 191–196, Jan 2012.
- [105] A. Shacham, K. Bergman, and L. Carloni, "Photonic networks-on-chip for future generations of chip multiprocessors," *IEEE Transactions on Computers*, vol. 57, no. 9, pp. 1246–1260, Sept 2008.

- [106] H. Andrews, "A high-speed algorithm for the computer generation of fourier transforms," *IEEE Transactions on Computers*, vol. C-17, no. 4, pp. 373–375, April 1968.
- [107] G. Hendry, E. Robinson, V. Gleyzer, J. Chan et al., "Circuit-switched memory access in photonic interconnection networks for high-performance embedded computing," in , 2010 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2010, pp. 1–12.
- [108] M. Frigo and S. G. Johnson, "Benchmarked fft implementations." [Online]. Available: http://www.fftw.org/benchfft/ffts.html
- [109] G. Chen, H. Chen, M. Haurylau, N. Nelson et al., "Predictions of cmos compatible on-chip optical interconnect," in Proceedings of the 2005 International Workshop on System Level Interconnect Prediction, ser. SLIP '05, 2005, pp. 13–20.
- [110] A. Joshi, C. Batten, Y.-J. Kwon, S. Beamer et al., "Silicon-photonic clos networks for global on-chip communication," in 3rd ACM/IEEE International Symposium on Networks-on-Chip, May 2009, pp. 124–133.
- [111] Y.-H. Kao and H. Chao, "Blocon: A bufferless photonic clos network-on-chip architecture," in *Fifth IEEE/ACM International Symposium on Networks on Chip (NoCS)*, May 2011, pp. 81–88.

# List of Publications

### **Refereed Journals**

- Achraf Ben Ahmed, Abderazek Ben Abdallah, "Silicon-Photonic Network-On- Chip for Future Generations of High-performance Many-core Systems", The Journal of Supercomputing, vol. 71, no. 12, pp. 4446-4475, October 2015.
- Achraf Ben Ahmed, Abderazek Ben Abdallah, "Architecture and Design of Real-time System for Elderly Health Monitoring", International Journal of Embedded Systems, Inderscience Publishers. (In press)

### **Refereed Refereed International conferences**

- Achraf Ben Ahmed, Michael Meyer, Yuichi Okuyama and Abderazek Ben Abdallah, "Hybrid Photonic NoC Based On Non-blocking Photonic Switch and Light-weight Electronic Router", in the IEEE Proceeding of the 2015 International Conference on Systems, Man and Cybernetics (SMC), pp. 56-61, October 2015.
- Achraf Ben Ahmed, Yuichi Okuyama and Abderazek Ben Abdallah, "Contentionfree Routing For Hybrid Photonic mesh-based Network-On-Chip Systems", in the IEEE proceeding of the 9th International Symposium on Embedded Multicore/Manycore SoCs (MCSoC), pp.235-242, September 2015.
- 3. Achraf Ben Ahmed, Michael Meyer, Yuichi Okuyama and Abderazek Ben Abdallah, "Efficient Router Architecture, Design and Performance Explo-

ration for Many-core Hybrid Photonic Network-on-Chip (2D-PHENIC)", in the IEEE Proceedings of the International Conference on Information Science and Control Engineering (ICISC), pp. 202-206, April 2015.

- 4. Achraf Ben Ahmed and Abderazek Ben Abdallah, "PHENIC: Towards Photonic 3D-Network-on-Chip Architecture for High-throughput Many-core Systems-on-Chip", in the IEEE Proceeding of the 14th International Conference on Sciences and Techniques of Automatic control and computer engineering, pp.1-9, December 2013.
- 5. Achraf Ben Ahmed and Abderazek Ben Abdallah, "Hardware/Software Prototyping of Dependable Real-Time System for Elderly Health Monitoring", in the IEEE Proceeding of the 1. 14th International conference on Sciences and Techniques of Automatic control and computer engineering, pp.1-9, December 2013.
- Achraf Ben Ahmed, Yumiko Kimezawa and Abderazek Ben Abdallah, "Towards Smart Health Monitoring System for Elderly People", in the IEEE Proceeding of The 4th International Conference on Awareness Science and Technology, pp. 248-253, August 2012.