

# Power- and Performance-aware On-Chip Interconnection Architectures for Many-core Systems

by

### Hemanta Kumar Mondal

Under the Supervision of Dr. Sujay Deb

Indraprastha Institute of Information Technology Delhi

April, 2017

© Indraprastha Institute of Information Technology Delhi (IIITD),

New Delhi 110020



# Power- and Performance-aware On-Chip Interconnection Architectures for Many-core Systems

by

### Hemanta Kumar Mondal

#### Submitted

in partial fulfillment of the requirements for the degree of Doctor of Philosophy

to the

Indraprastha Institute of Information Technology Delhi

April, 2017

Certificate

This is to certify that the thesis titled "Power- and Performance-aware On-Chip

Interconnection Architectures for Many-core Systems" being submitted by Mr.

Hemanta Kumar Mondal to the Indraprastha Institute of Information Technology Delhi,

for the award of the degree of Doctor of Philosophy, is an original research work carried

out by him under my supervision. In my opinion, the thesis has reached the standards

fulfilling the requirements of the regulations relating to the degree.

The results contained in this thesis have not been submitted in part or full to any other

university or institute for the award of any degree/diploma.

April, 2017

Dr. Sujay Deb

Indraprastha Institute of Information Technology Delhi

New Delhi 110 020

i

Dedicated to the loving memories of my Father and my Uncle.....

### Acknowledgements

I have been indeed lucky to have the opportunity to work with Dr. Sujay Deb, my PhD advisor. It would not be possible to complete this journey without his immense supports. He always guided me not only as a mentor, But also as a great philosopher. I learned a lot under his supervision. Dr. Deb is one of the most fascinating people I have ever met in my life. His enthusiasm, excellent insights, strong technical expertise, and systematic problem solving approach helped me a lot to excel my research career. His immense energy, optimism and positive attitude have kept on motivating me to face the challenges both at professional and personal level. Thank you, Dr. Sujay Deb. I am looking forward to continue research with you in future.

A special mention for my doctoral committee members Dr. M. S. Hashmi and Dr. Alexander Fell and my collaborator Dr. Amlan Ganguly, who have played a major role in shaping my PhD journey. I thank them for their insightful feedback and comments. I would take the privilege to acknowledge Prof. Pankaj Jalote for his thought and research vision which I really admire. A Special Thanks to all Professors and all Administrative Staffs for all kinds of financial and infrastructure support. Specially, I would like to thank Mr. Ashutosh Brahma for his ample support throughout my PhD journey. In every difficult time, he was always with me.

I would like to thank all my (present and past) colleagues Naushad Ansari, Shiju S, Dr. Hemant Aggarwal, Wazir Singh, Vijay Sharma, Naveen Gupta, Salin Verma. Rakhi Hemani, Sonal Aggarwal, Ayushi Rastogi, Aijaz Zaidi, Amit Chauhan, Rahul Gangopadhyay, Dinesh Rano, Deepayan Banerjee, Nalla Anandakumar, Mitali Sinha, Arpan Jati, Dipto Sarkar, Nilabha Bhattacharjee and Aritra Hazra for their presence during my tough time. I would like to show my sincere thanks to all lab mates for their support, valuable technical discussions and specially, 'Friday party' which I would miss it a lot. It would be my pleasure to extend my gratitude to Sri Harsha Gade, Raghav Kishore, Wazir Singh, Sidhartha Sankar Rout and Shashwat Kaushik for their massive supports during my research.

My family's support has been enormous throughout my journey. Biggest thanks to my father, mother, uncle, aunty, sisters, and brothers, and all my relatives for their remarkable supports. Thank you all once again for constant support and blessings throughout my journey. Thanks to my wife- Mampi. She never forced me to bring her to University. Thanks god that I got such a wonderful wife. Now, it's time to start second phase of my life. Thank you all!!!

### **Abstract**

Networks-on-Chip (NoCs) are fast becoming the de-facto communication infrastructures in Chip Multi-Processors (CMPs) for large-scale applications. The traditional approaches of implementing a NoC with planar metal interconnects have high latency and significant power consumption overhead. This is mainly due to the multi-hop data exchange using wired links, specifically when the number of cores is significantly high. To address these problems, Wireless NoCs (WNoCs) that augment multi-hop wired interconnects in a NoC with high-bandwidth, single-hop, long-range wireless links are being explored. Although multi-hop communication is replaced by WNoC, still, NoC components including wireless transceivers consume a significant portion of chip power, which is one of the major bottlenecks in NoC architectures for CMPs. With progressing generations and system sizes, this proportion increases exponentially. Another important concern with the existing WNoCs is the performance limitations due to single frequency channel communication with omnidirectional antenna setups. These bottlenecks open up new opportunities for detailed investigations into the power and performance efficiency of WNoCs and design low-energy, high-performance communication infrastructures for CMPs.

Analysis of network resources for several benchmarks shows that, utilization and hence energy consumption is application dependent and the desired performance can be achieved even without operating all resources at maximum specifications. To reduce the power consumption, we propose a leakage power-aware NoC architecture using power gated router based on the router utilization. To compute the utilization of routers, we propose an adaptive two-step estimation method that computes utilization at both global and local router levels. This hybrid estimation method provides an accurate prediction of router utilization with low run-time overheads. Using the utilization estimates, we reduce the switching and idle-state power consumption of WNoC architecture. To eliminate power-gating impacts and maintain the performance, we implement a deadlock-free Seamless Bypass Routing (SBR) strategy that bypasses a power-gated router.

Based on the utilization of routers, we propose a switching (dynamic) power-aware NoC architecture using Adaptive Multi-Voltage Scaling (AMS) mechanism to achieve significant energy saving. To implement the AMS based WNoC architectures, we also propose a multi-level voltage shifter along with efficient control mechanism that allows switching between two voltage levels from a given fixed set of voltage levels. But most wireless interconnects are implemented using a token passing protocol in which only a single paired is actively involved in data transmission at any given time. Hence, the

wireless transceivers can be selectively switched on and off depending on the workload. This will improve the power efficiency of the network without affecting the overall network performance, especially when all the wireless transceivers are designed to operate at the same frequency and only one pair can use the channel at a time.

Since all these wireless links are not required all time, power-gated wireless transceivers can provide an effective solution for power efficient WNoC design. In this dissertation, to increase the power efficiency, we also propose the partially power gated transceiver for wireless interfaces (WIs) using AMS to reduce the idle-state power consumption based on the utilization of WIs. For packets transmitted over wireless links, receiver-end control strategy is proposed with WNoC. This enables effective power gating strategy for WIs as it eliminates periodic waking up of complete receiver chain. The proposed technique also reduces routing overhead and need of control signals significantly.

However, most existing WNoC architectures generally use omnidirectional antenna along with token passing protocol to access wireless medium. That limits the achievable performance benefits since only one wireless pair can communicate at a time. It is also not practical in the immediate future to arbitrarily scale up the number of non-overlapping channels by designing mm-wave transceivers operating in disjoint frequency bands. Consequently, we explore the use of directional antennas where multiple simultaneous wireless interconnect pairs can communicate. Concurrent wireless communications can result in interference. This can be minimized by optimal placement of wireless nodes. To address this, we propose an interference-aware Directional Wireless NoC (DWNoC) topology with optimal placement of WIs by incorporating planner log-periodic antennas (PLPAs). This DWNoC architecture enables the directional point-to-point links between transceivers and hence multiple wireless links can operate at the same time without interference. It also increases the energy efficiency of DWNoC as well as utilization of WIs significantly as compared to existing NoC architectures.

In addition, we also address the on-chip communication bottlenecks between Last Level Caches (LLCs) and Memory Controllers (MCs) to access off-chip memory. Communication between LLCs and memory controllers faces significant challenge due to the placement of memory controllers, high network latency, and switching strategy. Especially, as system size increases, the latency between caches and limited number of memory controllers increases, thereby degrading the memory performance. To overcome this, we propose an adaptive hybrid switching strategy with dual crossbar router to provide low latency paths between caches and memory controllers. The performance is further improved by finding the optimal number and placement of memory controllers with low overheads. To reduce the energy overhead of dual crossbar routers, we introduce partially drowsy and power gated techniques with routers in the proposed architecture.

### **Publications**

#### **Journal Articles**

 Hemanta Kumar Mondal, Sri Harsha Gade, M S Shamim, Sujay Deb, and Amlan Ganguly, "Interference-Aware Wireless Network-on-Chip Architecture using Directional Antennas," IEEE Transactions of Multi-Scale Computing Systems (TMSCS), July 2016.

(DOI: 10.1109/TMSCS.2016.2595527)

- 2. **Hemanta Kumar Mondal**, Sri Harsha Gade, Raghav Kishore, and Sujay Deb, "SNoC: Utilization-based Adaptive Strategy for Power-efficient NoC Architecture,". Sustainable Computing, Informatics and Systems Journal, Elsevier. (Minor revision)
- 3. **Hemanta Kumar Mondal**, Sri Harsha Gade, Raghav Kishore, and Sujay Deb, "Adaptive Multi-Voltage Scaling with Utilization Prediction for Energy-efficient Wireless NoC,". IEEE Transactions on Sustainable Computing (**T-SUSC**), August 2017. (Accepted)
- 4. **Hemanta Kumar Mondal**, Sri Harsha Gade, Raghav Kishore, and Sujay Deb, "An Efficient and Reconfigurable Evaluation Framework for Irregular NoC Architecture," (Under preparation)
- Hemanta Kumar Mondal, Sri Harsha Gade, and Sujay Deb,
   "Interconnection Supports for Energy-efficient High Bandwidth Memory Access in CMPs," (Under review)

#### **Conference Articles**

- Shrestha Bansal, Hemanta Kumar Mondal, Sri Harsha Gade and Sujay Deb, "Energy-efficient NoC Router for High Throughput Applications in Many-core GPUs." iNIS 2017. (Under review)
- Shashwat Kaushik, Muni Aggarwal, Hemanta Kumar Mondal, Sri Harsha Gade and Sujay Deb, "Path Loss-aware Adaptive Transmission Power Control Scheme for Energy-efficient Wireless NoC", MWSCAS. (Accepted)

- 3. **Hemanta Kumar Mondal**, Shashwat Kaushik, Sri Harsha Gade, and Sujay Deb, "Energy-efficient Transceiver for Wireless NoC," 30th International Conference on VLSI Design (**VLSID**), Hyderabad, January 2017.
- Raghav Kishore, Hemanta Kumar Mondal and Sujay Deb, "Energyefficient Reconfigurable Framework for Evaluating Hybrid NoCs," 20<sup>th</sup>
  international symposium on VLSI Design and Test (VDAT), IIT Guwahati,
  India, May 2016.
- 5. **Hemanta Kumar Mondal**, Sri Harsha Gade, Raghav Kishore, and Sujay Deb, "Adaptive multi-voltage scaling in wireless NoC for high performance low power applications," Design, Automation & Test in Europe Conference & Exhibition (**DATE**), Dresden, Germany, pp. 1315-1320, 2016.
- Hemanta Kumar Mondal, Sri Harsha Gade, Raghav Kishore, Shashwat Kaushik and Sujay Deb, "Power Efficient Router Architecture for Wireless Network-on-Chip," 17th International Symposium on Quality Electronic Design (ISQED), Santa Clara, March 2016.
- 7. **Hemanta Kumar Mondal**, Sri Harsha Gade, Raghav Kishore, and Sujay Deb, "Power- and performance-aware fine-grained reconfigurable router architecture for NoC," Green Computing Conference and Sustainable Computing Conference (**IGSC**), 2015 Sixth International, Las Vegas, NV, pp. 1-6, 2015.
- 8. Sri Harsha Gade, **Hemanta Kumar Mondal**, and Sujay Deb, "A Hardware and Thermal Analysis of DVFS in a Multi-core System with Hybrid WNoC Architecture," 2015 28th International Conference on VLSI Design (**VLSID**), Bangalore, pp. 117-122, 2015.
- 9. **Hemanta Kumar Mondal**, and Sujay Deb, "Wireless network-on-chip: a new era in multi-core chip design," 2014 25nd IEEE International Symposium on Rapid System Prototyping (**RSP**) in conjunction with ESWeek'14, New Delhi, pp. 59-64, 2014. (Invited paper)
- 10. Hemanta Kumar Mondal, and Sujay Deb, "An energy efficient wireless Network-on-Chip using power-gated transceivers," 2014 27th IEEE International System-on-Chip Conference (SOCC), Las Vegas, pp. 243-

248, 2014.

- 11. **Hemanta Kumar Mondal**, Sri Harsha Gade and Sujay Deb, "An Efficient Hardware Implementation of DVFS in Multi-core System with Wireless Network-on-Chip," 2014 IEEE Computer Society Annual Symposium on VLSI (**ISVLSI**), Tampa, FL, pp. 184-189, 2014.
- 12. **Hemanta Kumar Mondal**, and Sujay Deb, "Energy efficient on-chip wireless interconnects with sleepy transceivers," 2013 8th IEEE Design and Test Symposium (**IDT**), Marrakesh, pp. 1-6, 2013.

# Contents

| CERTIFICATE                             | I    |
|-----------------------------------------|------|
| ACKNOWLEDGEMENTS                        | III  |
| ABSTRACT                                | IV   |
| PUBLICATIONS                            | VI   |
| CONTENTS                                | IX   |
| LIST OF FIGURES                         | XIII |
| LIST OF TABLES                          | XVI  |
| CHAPTER 1 INTRODUCTION                  | 1    |
| 1.1 Motivation and Introduction         | 1    |
| 1.2 Contributions                       | 8    |
| 1.2 Dissertation outline                | 11   |
| CHAPTER 2 LEAKAGE POWER-AWARE NOC       | 13   |
| 2.1 Related work                        | 14   |
| 2.2 Leakage Power-Aware NoC             | 16   |
| 2.2.1 NoC Router Design                 | 16   |
| 2.2.2 Global Utilization Computation    | 18   |
| 2.2.3 Runtime Utilization Estimation    | 19   |
| 2.2.4 Power Management Controller (PMC) | 20   |
| 2.2.5 Routing Strategy                  | 22   |
| 2.3 Walkthrough Example                 | 26   |
| 2.4 Performance Evaluation              | 29   |

| 2.4.1 Simulation Setup                          | 29 |
|-------------------------------------------------|----|
| 2.4.2 Performance Metrics Review                | 30 |
| 2.4.3 Router-level Statistics                   | 32 |
| 2.4.4 Bandwidth and Latency Analysis            | 33 |
| 2.4.5 Energy saving                             | 34 |
| 2.4.6 Comparison with Other Low Power NoCs      | 35 |
| 2.4.7 Router Power and Area Overhead            | 36 |
| 2.4.8 Scalability                               | 37 |
| 2.5 Conclusions                                 | 38 |
| CHAPTER 3 SWITCHING POWER-AWARE NOC             | 39 |
| 3.1 Preliminaries and Related Works             | 40 |
| 3.2 Switching and Idle-state Power-aware WNoC   | 42 |
| 3.2.1 System Architecture                       | 42 |
| 3.2.2 Dynamic Runtime Utilization Computation   | 43 |
| 3.2.3 AMS: Base Router Control                  | 46 |
| 3.2.4 AMS: WI Control                           | 48 |
| 3.2.5 Routing Strategy                          | 52 |
| 3.3 Hardware Implementation                     | 52 |
| 3.3.1 Adaptive Multi-Voltage Scaling Controller | 52 |
| 3.3.2 On-Chip Voltage Regulator                 | 53 |
| 3.3.3 Voltage Level Shifter                     | 54 |
| 3.4 Experiments                                 | 56 |
| 3.4.1 Experimental Setup                        | 57 |
| 3.4.2 AMSC-based Router Implementation          | 57 |
| 3.4.3 Power Gated LNA and PA Implementation     | 58 |

| 3.4.4 Comparison of Energy and Energy-delay Product     | 61    |
|---------------------------------------------------------|-------|
| 3.4.5 Energy Saving                                     | 63    |
| 3.4.6 Performance Evaluation                            | 64    |
| 3.4.7 Scalability and Impact of Power Gating            | 66    |
| 3.5 Conclusion                                          | 68    |
| CHAPTER 4 DIRECTIONAL WIRELESS NOC ARCHITECTURE         | 69    |
| 4.1 Related work                                        | 70    |
| 4.2 Communication Infrastructure                        | 72    |
| 4.2.1 Physical Layer Design                             | 72    |
| 4.2.2 Topology of the DWNoC                             | 75    |
| 4.2.3 Interference Aware WI Placement Problem           | 76    |
| 4.2.4 Interference Avoidance Constraints                | 79    |
| 4.2.5 Proposed Optimization Algorithm                   | 80    |
| 4.2.6 Communication Protocol                            | 83    |
| 4.3 Experimental Results                                | 84    |
| 4.3.1 Simulation Setup                                  | 84    |
| 4.3.2 Interference-Aware Constraints for WI placement   | 86    |
| 4.3.3 Antenna Characteristics and Link Budget Analysis  | 86    |
| 4.3.4 Interference-Aware WI Placement                   | 92    |
| 4.3.5 Performance Evaluation of DWNoC                   | 92    |
| 4.3.6 Application Specific Traffic                      | 95    |
| 4.3.7 Area Overheads and Scalability                    | 96    |
| 4.4 Conclusion                                          | 97    |
| CHAPTER 5 LOW LATENCY NETWORK FOR OFF-CHIP MEMORY ACCES | S 99  |
| 5.1 Related work                                        | . 101 |

| 5.2 Efficient Low Latency NoC                                       |
|---------------------------------------------------------------------|
| 5.2.1 Hybrid Switching Strategy                                     |
| 5.2.2 Energy-efficiency                                             |
| 5.3 Experimental Results                                            |
| 5.3.1 Simulation Setup                                              |
| 5.3.2 Router Implementation with Hybrid Switching and Overheads 110 |
| 5.3.3 Memory Controller Placement                                   |
| 5.3.4 Performance Evaluation                                        |
| 5.3.5 Energy saving                                                 |
| 5.3.6. Summary of Existing and Proposed Architectures               |
| 5.4 Conclusions and Future Works                                    |
| 5.4.1 Conclusions                                                   |
| 5.4.2 Future Works                                                  |
| CHAPTER 6 CONCLUSIONS AND FUTURE WORKS                              |
| 6.1 Conclusions 117                                                 |
| 6.2 Future works                                                    |
| 6.2.1 Energy-efficient High Bandwidth Memory Access                 |
| 6.2.2 Hybrid electrical-optical interconnect                        |
| 6.2.3 Fault-tolerant and reliable emerging interconnect             |
| 6.2.4 Network-on-chip architecture for artificial neural networks   |
| BIBLIOGRAPHY122                                                     |

# List of Figures

| Figure 1.1 Normalized NoC power consumption with processing cores at different             |
|--------------------------------------------------------------------------------------------|
| technology nodes                                                                           |
| Figure 1.2 Component wise power consumption of transceiver at 65nm 5                       |
| Figure 1.3 Dynamic and leakage power of router at Vdd=1V                                   |
| Figure 1.4 Overview of energy-efficient WNoC highlighting all the Contributions 9          |
| Figure 2.1 Implementation of power gated hybrid router design for proposed NoC 17          |
| Figure 2.2 Categorization of routers based on pre-computed global utilization under real   |
| applications for a 64-core system                                                          |
| Figure 2.3 Block level diagram shows power management controller with control signals      |
| 21                                                                                         |
| Figure 2.4 Router bypasses links and control circuitry                                     |
| Figure 2.5 Pipeline stages of (a) regular XY routing and (b) seamless bypass routing 24    |
| Figure 2.6 Example scenario illustrating our proposed architecture                         |
| Figure 2.7 Histogram for flit arrival at router that represents (a) HU, (b) MU and (c) RU  |
| 31                                                                                         |
| Figure 2.8 Normalized bandwidth of NoC architectures with different traffic situations for |
| a 64 core system wrt mesh                                                                  |
| Figure 2.9 Normalized saturation latency of Proposed NoC based architecture over regular   |
| architectures wrt mesh                                                                     |
| Figure 2.10 Percentage of packet energy saving using proposed NoC over regular mesh        |
| and WNoC                                                                                   |
| Figure 3.1 Cluster-level AMS base wireless NoC architecture                                |
| Figure 3.2 State machine diagram of AMS controller                                         |
| Figure 3.3 Schematic of power-gated LNA and PA with AMSC                                   |

| Figure 3.4 Format of data transmission over wireless link                        | 50        |
|----------------------------------------------------------------------------------|-----------|
| Figure 3.5 Hybrid switched inductor and switched capacitor voltage regulator for | or multi- |
| voltage scaling                                                                  | 53        |
| Figure 3.6 Level shifter with AMSC                                               | 55        |
| Figure 3.7 Level shifter controlling mechanism                                   | 56        |
| Figure 3.8 Gain and noise figure analysis for pgLNA and npgLNA                   | 59        |
| Figure 3.9 PMOS switches with block level LNA/PA                                 | 59        |
| Figure 3.10 S-parameter analysis for pgPA and npgPA                              | 60        |
| Figure 3.11 S-parameter analysis across the process and temperature variations   | 61        |
| Figure 3.12 Shows normalized energy of 64 core system under uniform random       | n traffic |
| pattern with respect to baseline Mesh topology                                   | 61        |
| Figure 3.13 Shows (a) normalized delay and (b) normalized energy-delay produ     | ict of 64 |
| core system under uniform random traffic pattern with respect to baseline Mesh t | opology.  |
|                                                                                  | 62        |
| Figure 3.14 Packet energy saving in percentage with AMS over non-AMS arc         | hitecture |
| under application-specific and synthetic traffic                                 | 63        |
| Figure 3.15 Peak throughput for proposed and baseline architectures              | 65        |
| Figure 3.16 Peak latency for proposed and baseline architectures                 | 66        |
| Figure 3.17 Power consumption of PA/LNA in different operating modes             | 67        |
| Figure 4.1 On-chip planar log-periodic antenna with dimensions in mm             | 73        |
| Figure 4.2 A hybrid hierarchical NoC architecture with subnets structure         | 74        |
| Figure 4.3 4×4 small world network topology with directional antennas            | 75        |
| Figure 4.4 Interference-aware wireless interfaces placement algorithm            | 82        |
| Figure 4.5 Data routing strategy                                                 | 83        |
| Figure 4.6 Performance evaluation setup for DWNoC                                | 85        |
| Figure 4.7 Return loss of the on-chip planar log-periodic antenna                | 87        |
| Figure 4.8 Radiation pattern along the azimuthal and elevation plane             | 87        |
| Figure 4.9 S(2,1) variation with distance                                        | 88        |
| Figure 4.10 S(2,1) variation with different antenna orientation                  | 90        |
| Figure 4.11 Antennas setup with a different orientation                          | 90        |
| Figure 4.12 S(2,1) plot for SIR calculation.                                     | 91        |

| Figure 4.13 Wireless links placement in 64 core system                              | 92           |
|-------------------------------------------------------------------------------------|--------------|
| Figure 4.14 Latency for various NoC architectures for 64 core system                | 93           |
| Figure 4.16 Peak bandwidth and packet energy dissipation of various NoC ar          |              |
| for 256-core system                                                                 | 94           |
| Figure 4.15 Peak bandwidth and packet energy dissipation of various NoC ar          | chitectures  |
| for 64-core system.                                                                 | 94           |
| Figure 4.17 Bandwidth and packet energy of WNoCs with application-specific          | traffic for  |
| 64 core system                                                                      | 96           |
| Figure 4.18 Percentage of area overhead over total silicon area                     | 97           |
| Figure 5.1 An example of 6×6 Mesh NoC topology with memory controller,              | processing   |
| elements and Off-chip memory                                                        | 100          |
| Figure 5.2 Proposed router architecture                                             | 103          |
| Figure 5.3 The pipeline stages :(a) 5-stage baseline router for packet (b) 2-stages | s for circui |
| switching                                                                           | 105          |
| Figure 5.4 Intermediate steps represent for packet transfer                         | 106          |
| Figure 5.5 PMC controls the inactive/active components                              | 108          |
| Figure 5.6 Optimal memory controller placement using mean-shift approach            | 111          |
| Figure 5.7 Reduction in memory latency over baseline                                | 112          |
| Figure 5.8 Improvement in peak throughput over baseline                             | 112          |
| Figure 5.9 Improvement in application runtime over baseline                         | 113          |
| Figure 5.10 Reduction in network energy over baseline                               | 114          |

# List of Tables

| Table 2.1 Shows the seamless bypass routing strategy                         | 25          |
|------------------------------------------------------------------------------|-------------|
| Table 2.2 Simulation Setup for Proposed NoC                                  | 29          |
| Table 2.3 SPLASH-2 and PARSEC Benchmark Characteristics for Expected T       | rends 35    |
| Table 2.4 Summary of proposed and energy-efficient NoC Architectures         | 36          |
| Table 2.5 Characteristics Table of Proposed NoC                              | 37          |
| Table 3.1 Probabilistic Notations                                            | 44          |
| Table 3.2 Shows AMS Controller mechanism on routers                          | 47          |
| Table 3.3 Shows AMS Controller mechanism on routers                          | 49          |
| Table 3.4 Characteristics of Voltage Regulator                               | 54          |
| Table 3.6 Simulation Setup                                                   | 56          |
| Table 3.5 Representation of Topologies/Components                            | 57          |
| Table 3.7 Router Implementation                                              | 58          |
| Table 3.8 Summary of Existing and Proposed Power/Energy-efficient NoC Ar     | chitectures |
|                                                                              | 64          |
| Table 4.1 Dimensions of the antenna                                          | 74          |
| Table 4.2 Percentage of busy and idle cycles in a 64-core system given defau | ılt problem |
| sizes                                                                        | 95          |
| Table 5.1 Shows algorithm for circuit switching setup                        | 104         |
| Table 5.2 Intensity Classification of PARSEC Benchmark                       | 104         |
| Table 5.3 Shows selection logic for hybrid switching                         | 105         |
| Table 5.4 Simulation setup                                                   | 110         |
| Table 5.5 Summary of Existing and Proposed Works                             | 115         |

### Chapter 1

### Introduction

### 1.1 Motivation and Introduction

The advancements in chip design indicate that there will be a manifold increase in the number of cores on a single die over the next few years. Chip Multi-Processors (CMPs) are gaining significant interest for a wide range of applications; consumer electronics, single-chip cloud computers, supercomputers, defense applications, etc. Intel has recently developed the Knights Landing processor with 36 tiles for high performance computing [1]. A processor array containing 1000 independent processor is fabricated in 32-nm technology node [2]. Other companies like Tilera, Nvidia, and Samsung are also developing the multi-core systems. However, existing methods of integrating and designing multi-core chips do not scale to very large core counts due to several challenges. On-chip communication and power consumption are two such major challenges.

For more than four decades, Moore's Law and Dennard scaling have resulted in increasing transistor integration capacity with a constant power density [3]. With the feature size below 65nm, this trend can no longer be continued from generation to generation, because of exponential growth of leakage current, and limitations of threshold and operating voltage scaling. Recent studies have highlighted the bottlenecks of interconnect power in the total chip power consumption [4]. With CMPs becoming more communication centric as opposed to computation centric, interconnect power plays a key role in accommodating the expected growth in number of cores. These facts emphasize the need to introduce aggressive power saving methods for interconnection networks to achieve an energy efficient CMPs without significant performance degradation.

These can be addressed by introducing a power-aware Network-on-Chip (NoC) that has emerged as the communication platform that enables partitioning of the design effort [5]. Advances in NoC have made it the preferred choice for the communication backbone

of Chip Multi-Processors. However, traditional NoCs suffer from the limitations arising out of a multi-hop communication of conventional planar metal interconnects, where data transfer across the NoC fabric causes high latency and power consumption. With a further increase in the number of cores on a chip, this problem will be significantly aggravated. Several attempts have been made to alleviate the limitations of regular NoC architectures by innovations in routing and interconnect deployments [6]. However, due to the basic interconnect technology being the metal/dielectric combination the performance improvements were only incremental.

Different approaches have been explored to address the limitations of conventional NoCs; such as 3D and Photonic NoCs and NoC architectures with multi-band RF interconnects [7] [8] [9]. All these new technologies are capable of improving the speed and power dissipation in data transfer in a many-core chip. However, in 3D NoCs, manufacturing issues and temperature due to increased power density put a ceiling on the overall performance advantages. Photonic NoCs need an underlying electrical network to set up path and routing information adding to the overhead. RF interconnects based NoCs need laying long on-chip transmission lines across the chip. In comparison, wireless transmission for on-chip data transfer over relatively long distances stands out as a revolutionary alternative which besides improving the performance profiles of modern NoCs, also eliminates the need to lay out waveguides on the chip [10] [11] [12] [13]. Establishing long-range, single hop communication links with wireless transceivers reduces the number of hops involved in data transmission across distant nodes on the chip thus improving performance. Depending upon the operating frequency range the data bandwidth of such links can be very high. Such high bandwidth, low latency long distance links enable the design of novel NoC architectures that were so far impossible to conceive due to age-old limitations imposed by delays along long wires. Of these emerging interconnects, wireless links with mm-wave antennas offer the most feasible solution because of their compatibility with CMOS manufacturing process. Coupled with advances in mm-wave transceiver design [14], on-chip wireless links offer the most promising solution to improve the performance of NoCs. Many Wireless NoC (WNoC) architectures [11] [12] [13] have been proposed with varying performance improvements and overheads. All these different architectures augment conventional wired topology with Wireless

Interfaces (WIs); WIs primarily used for long range communication to reduce multi-hop communications. The advancements made in NoCs and on-chip WIs provide high-performance communication infrastructures to match the computation capabilities of processing elements in many core systems. But, wireless transceivers have their own power overheads, which is another major concern in WNoC architectures.

Most of the wireless NoC architectures are based on omnidirectional antennas. In most cases, these WNoC architectures adopt a token passing based medium access mechanism to transmit data over the shared wireless channel. Since all antennas share a common channel, only a single communication is possible at any instant of time. This limits the performance benefits of using wireless links. One potential solution to achieve simultaneous communications is to have multiple non-overlapping wireless channels. But creating multiple transceivers tuned to non-overlapping channels is an extremely challenging job due to effects of interference. Interference reduces available bandwidth and degrades the bit error rate (BER). The cumulative impact of these leads to poor Quality-of-Service (QoS).

To address these aforementioned issues, we discuss multiple effective solutions to make an efficient communication infrastructure for CMPs. However, as technology scales, interconnect power and latency increases. This results in high power consumption in NoC/WNoC topologies at deep submicron technologies and large system sizes. To realize energy-efficient high-performance computing, it is necessary to introduce aggressive power saving methods for NoCs. The work proposed in this dissertation tackles power consumption in NoC elements to achieve energy-efficient on-chip interconnection in CMPs. The total power consumption in NoC at different technology nodes with different system sizes is shown in Figure 1.1. The network and core power values are obtained using DSENT [15] and McPAT [16] tools. At all technologies, the total die area is kept constant, and power consumption in NoC for given system size is computed. As can be observed, NoC power increases exponentially with technology nodes and increasing system size, though the supply voltage reduces.

Power consumption in network components remains a major issue in deep submicron



Figure 1.1 Normalized NoC power consumption with processing cores at different technology nodes

technology. As CMOS scaling progresses into smaller technology nodes, power has become the primary design constraint. NoC, with many routers transferring data simultaneously, consumes high power and a significant portion of total chip's power. For example, NoC in Intel Teraflop processor consumes 28% of the total tile power [17]. As system size increases, this is further multiplied by several folds with a significantly high number of active routers. In addition to this, it has been shown that it is not possible to operate all system resources (cores, NoC, memory, etc.) simultaneously at full power all the time while meeting Thermal Design Power (TDP) constraints [18]. High power consumption also leads to higher system temperatures affecting the reliability of the system. This makes it necessary to reduce the power consumption in NoC routers for



Figure 1.2 Dynamic and leakage power of router at Vdd=1V

energy efficient and reliable system while achieving the required network performance. Figure 1.2 shows the trend of router dynamic and leakage power across different technology nodes from 45nm to 22nm. The results are obtained by simulating a regular router using DSENT [15] and breaking down the power consumption into dynamic and leakage components. Leakage power, unlike dynamic power, increases as technology scales and accounts for 59% of total power at 22nm. This trend shows basically the both leakage and dynamic power consumption is a major concern in NoC architecture. These facts emphasize the need to introduce aggressive power saving methods for NoC router components and aim for energy efficient and reliable interconnection in sustainable computing platforms.

Similarly, power consumption in Wireless Interfaces (WIs), consumed during idle phases of wireless communication, acts as a key energy waste in WNoCs. The component-wise breakdown of WI power, in Figure 1.3 [19], shows that WIs dissipate high power; and Low-Noise Amplifier (LNA) and Power Amplifier (PA) consume most of it. The energy efficiency at routers as well as WIs has become a major issue for NoC architecture. The power consumption in baseline router and idle-state power dissipation in WIs has to be reduced to achieve low power on-chip communication while providing desired network performance.

Based on this utilization of routers, we propose a leakage power-aware NoC architecture to reduce the power consumption using power gating strategy. The methods of



Figure 1.3 Component wise power consumption of transceiver at 65nm

router utilization computation are discussed in chapter 2. Due to presence of power gated routers, packets may drop or packet latency may increase by storing the data into virtual channels. Therefore, it is also essential to reduce the packet drops and latency to avoid the performance degradation. To maintain the performance, we propose a seamless bypass routing strategy which helps to route the packet by bypassing the power gated router(s) as well as reduce the packet drops.

In order to achieve energy-efficient communication, we propose switching power-aware WNoC using Adaptive Multi-Voltage Scaling (AMS) control technique for wired and wireless NoC architectures. The AMS mechanism follows a hybrid approach to implement voltage scaling in router components and power gating in WIs to save switching and idle-state power respectively. It operates based on utilization of components during different application activity phases to adaptively vary supply voltage and reduce energy consumption in NoCs. The utilization of router components is computed using precomputed global utilization and stochastic router utilization models that are discussed in section 2.2 (Chapter 2) and 3.2 (Chapter 3) respectively. The stochastic utilization model is derived dynamically by observing routing characteristics of the application at each router. The stochastic modeling of router utilization allows AMS to adapt to different applications according to their performance requirements while maintaining optimal network power consumption.

At the same time, we also evaluate the utilization of WIs to increase the energy-efficiency of WNoC. WI utilization is tracked by verifying availability of token and data to be transmitted at each WI and power gating them when they are not actively engaged in communication. As token is transmitted through wirelessly, in this case, token can be missed due to inactive receiver-end. To receive the token by waking up periodically is not a feasible solution due to excessive transient energy overhead. This is one of the major challenge in power gated WIs. To avoid this, we propose a sophisticated receiver-end wake-up control strategy using address signature along with data packets. This strategy processes the signature without interrupting the inactive components. For this, the unique signature of desired receiver WI is appended between dummy bits and actual packet data to be transmitted. We add a certain number of dummy bits with the original packet to avoid data loss during signature matching and wake up latency. We ensure that the extra dummy

bits added with the original data introduces minimal overhead. At the receiving WI, the received bits are decoded for this WI address using a pattern matching decoder. If the address matches, remaining actual data is received. If there is a mismatch, packet reception is ceased and LNA is put back to sleep state. The pattern matching decoder is operated at serial data stream frequency and starts decoding WI address after LNA is active.

In order to improve performance and avoid limitations of existing single channel WNoCs, we also propose a Directional Wireless NoC (DWNoC) architecture using Planar Log-Periodic Antennas (PLPAs) in this dissertation. In this topology, we propose an interference-aware wireless nodes placement algorithm to improve the performance. To achieve that we define constraints that can result in wireless interference and find placement that avoids all such constraints. This will enable us to create multiple concurrent wireless links eventually improving performance and energy-efficiency. We explore the PLPAs for the on-chip wireless links and demonstrate the use of their directivity to enable pair-wise communications. We present Directional Wireless NoC (DWNoC) architecture following three requirements: interference-avoidance, low power dissipation and performance enhancement through concurrent links.

In addition, communication between LLC and memory controllers faces significant challenge due to the placement of memory controllers, high network latency, and switching strategy. Though advancements like multichannel memory controllers, High Bandwidth Memory (HBM) etc. improve memory bandwidth; limited number of pins restricts the total number of memory controllers. With few memory controllers servicing requests from large set of cores, placement and interconnections of these memory controllers within NoC play a significant role in providing efficient off-chip memory access. Several works in the past have demonstrated the capabilities of NoC to provide high-performance communication in CMPs. Intel's 80-core [17] design arranges cores in 10x8 2D mesh network with 5-port routers and provides a bandwidth of 80Gbps per tile. Tile64 from Tilera [20] uses 8x8 2D mesh and provides a bisection bandwidth of 2.56Tbps. Both these architectures use memory controllers placed at top and bottom of the network. With fewer number of memory controllers, better NoC design and placement can have a huge impact on system performance. Arbitration policies between memory requests arriving at each memory controllers have been proposed in [21] [22] to provide better interconnection support for

improving memory performance. A memory controller placement and routing between cores and memory in [23] have shown that diamond placement with 16 memory controllers provides optimal performance in 8x8 mesh based CMP. A generic placement method with divide and conquer approach [24] analyzes different memory controller placements based on hop count metric. NoC hardware supports have also been explored to provide high bandwidth off-chip memory access. Another possible solution to provide high bandwidth memory access is to establish dedicated paths between processing elements and memory controllers. But, this is not scalable with increasing system sizes.

In order to improve the performance of memory accesses in CMPs, we introduce an adaptive hybrid switching strategy, which is a combination of circuit switching and packet switching strategy, with dual crossbar router. As studies show that optimal placement of memory controllers can optimize the communication mismatch significantly, so, we place the memory controllers to enhance the memory Bandwidth. To reduce the energy overhead of dual crossbar routers, we present partially drowsy and power gating techniques in the proposed architecture.

#### 1.2 Contributions

The overall contributions of this thesis are illustrated in Figure 1.4. This figure highlights the target applications, WNoC based CMP and associated challenges. Hybrid NoC architecture consists of a large number of tiles. The contents of this tile are a router, core, caches (L1, L2, and L3), network interface controller and links (hop). There are two types of routers such as base router (BR) and hybrid router (HR). Each tile is connected to neighbor routers through links. Packets are transmitted from one core to another through links and routers. If there are multi-hop communications, packet adopts the wireless links depending upon the availability of wireless channel and token passing Medium Access Control (MAC) protocol. The main challenges associated with this interconnect technology are performance dependency, power consumption due to NoC components and evaluation framework for WNoC. Our main contributions in this thesis are presented by shaded boxes, those are discussed one by one in the following sections.



Figure 1.4 Overview of energy-efficient WNoC highlighting all the Contributions

The major contributions of this work are as follows:

- I. A low power router architecture using partially power gated router coupled with drowsy VCs is proposed to reduce leakage power based on the router utilization as discussed in Chapter 2. To reduce the impact on performance of the proposed power saving methods, we introduce a Low overhead, deadlock-free seamless bypass routing strategy. We evaluated the power and area overheads along with impacts of power gating. We also did a detailed comparative analysis of the proposed techniques with the state-of-art.
- II. We propose a low overhead adaptive two-step hybrid estimation method using stochastic process for router utilization from its past usage pattern. Based on the utilization of routers, we propose an energy-efficient NoC architecture using Adaptive Multi-Voltage Scaling (AMS) mechanism to achieve significant energy saving by scaling the voltage of router components as discussed in Chapter 3. To implement the AMS scheme with NoC, a multi-level voltage shifter that allows switching between two voltage levels from a given fixed set of voltage levels. We also integrate on-chip

voltage regulator and level shifter with routers. We also use power gating enabled WIs using AMS to reduce the idle-state power dissipation and in that process significantly improve power efficiency of the NoC infrastructure.

- III. For packets transmitted over wireless links, a receiver-end control strategy is proposed to wake up the desired receiver based on token as discussed in Chapter 3. This scheme proposes to use unique address signature along with data packets. The inactive components will remain so until signature matches by pattern matching decoder at the desired WI. This enables effective power gating strategy for WIs as it eliminates periodic waking up of complete receiver chain. This signature is appended between dummy bits and data packet. These dummy bits are appended with packet to avoid the data lost due to components in inactive mode. The proposed technique also reduces routing overhead and need of control signals significantly.
- IV. A DWNoC architecture using PLPA antenna is explored for multichannel simultaneous communications without interference as discussed in Chapter 4. Simultaneous communication can enhance the overall system performance significantly. We propose the interference-aware WIs placement algorithm and routing strategy for minimizing interference effects. This algorithm also helps in the optimal utilization of the WIs and increases the energy-efficiency of the network compared to the WNoC setup with omnidirectional antennas.
- V. To enhance the performance of the system in terms of memory access latency, we propose an energy-efficient adaptive hybrid switching strategy with dual crossbar routers that allow simultaneous use of both circuit and packet switched paths as discussed in Chapter 5. An optimal number and placement of memory controllers in the network using machine learning approach are also introduced to increase the memory access bandwidth. A routing protocol for seamless communication between last level caches and memory controllers using adaptive hybrid switching strategy is also proposed.

VI. We developed a performance evaluation framework using system and network simulators for WNoC. Router utilization based metric is introduced and is applied to enable design and evaluation of energy-efficient architecture by implementing AMS with power gating. The proposed framework is implemented on top of an existing cycle-accurate open-source network simulator and system simulator. The framework is used for experimental evaluations under various application-specific and synthetic traffic scenarios to validate our proposed works.

### 1.3 Dissertation outline

The remaining of this dissertation is divided into 5 chapters. In this section, we discuss the contents each chapter briefly.

Chapter 2: We implement the leakage power aware NoC architecture with power management controller to apply fine-grained power gating in NoC routers based on the utilization of the routers. The utilization computation method for routers is discussed in this chapter. The design of router bypass links and control circuitry is presented to transfer data through power gated routers. A deadlock-free seamless bypass routing is presented that makes use of bypass links to minimize the adverse effects of power gating and maintain performance. A walk-through example describes the various steps involved in power gating and routing data in NoC under various possible scenarios. We also present a detailed analysis of power and area overheads of proposed architecture, impacts of power gating and comparison of proposed design with existing low power techniques.

Chapter 3: First, a detailed description of dynamic runtime utilization computation for estimating router utilization with low runtime overhead is presented. A stochastic model is developed for accurate prediction of router utilization from its past utilization characteristics during dynamic runtime utilization computation phase. The supply voltage of router elements is varied dynamically based on the predicted utilization. To reduce the switching power, AMS control mechanism, implemented to reduce switching power in router elements and idle-state power in WI components is described in this chapter. Similarly, we power gate the WI components, when they are not actively involved in data

transmission to reduce idle-state power consumption. We present the details of AMS controller, along with multi-level voltage shifter and voltage regulator to dynamically change supply voltage of router components and WIs based on their utilization and save network energy. A control mechanism for packet transferred over wireless link is presented to avoid the data loss and maintain the signal fidelity in the presence of power gated WIs. A new approach to transmitting medium access token through wireless link is proposed to minimize control signal latency. A detailed discussion and analysis of the impact of PVT variations of PA and LNA are discussed in results section.

Chapter 4: We demonstrate directional antenna based WIs for interference-aware DWNoC architecture for many-core systems to overcome the bottleneck of existing WNoC setup with omnidirectional antennas. It is also shown that DWNoC can establish concurrent links which do not interfere with each other. Simultaneous communications can enhance the overall system performance significantly. We also propose the interference-aware WIs placement algorithm and routing strategy for minimizing interference effects. This algorithm also helps in the optimal utilization of the WIs in the network with minimum overheads.

Chapter 5: In this chapter, we highlight the on-chip communication bottlenecks between Last Level Caches (LLC) and Memory Controllers (MCs) to access off-chip memory. To overcome this, we discuss hybrid switching strategy with dual crossbar routers that allow simultaneous use of both circuit and packet switch paths. We also try to find the optimal number and placement of memory controllers in the network using machine learning approach. We further improve upon this by using power-efficient drowsy virtual channels and power gating techniques to achieve energy-efficient off-chip memory access. A routing protocol is introduced for seamless communication between LLCs and MCs using adaptive hybrid switching strategy.

**Chapter 6:** In the last chapter of this dissertation, we conclude our present research work with detailed outcome and future research directions in the domain of efficient communication infrastructure.

### Chapter 2

## Leakage Power-aware NoC

In this chapter, we implement the leakage power aware NoC architecture with power management controller to apply fine-grained power gating in NoC routers based on the utilization of the routers. The utilization computation method for routers is discussed in this chapter. The design of router bypass links and control circuitry is presented to transfer data through power gated routers. A deadlock-free seamless bypass routing is presented that makes use of bypass links to minimize the adverse effects of power gating and maintain performance. A walk-through example describes the various steps involved in power gating and routing data in NoC under various possible scenarios. We also present a detailed analysis of power and area overheads of proposed architecture, impacts of power gating and comparison of proposed design with existing low power techniques.

We tackle the leakage power consumption in NoCs and propose ways to reduce the leakage power without impacting the performance. Analysis of router power consumption shows that leakage power accounts for a significant portion of total NoC power. We also notice that the contribution of leakage power increases with technology node, specifically in deep-submicron nodes. We reduce the leakage power consumption using power gating in base router. Power gating operation in NoC architecture is implemented using router utilization data. Router utilization is estimated in two steps: i) Pre-computed global utilization and ii) Run-time utilization. In this chapter, we implement a cost-effective runtime utilization estimation method to save the leakage power. In the first step, utilization is pre-computed from the knowledge of hardware and the application characteristics. This pre-computation provides a coarse utilization estimate and helps us to identify routers that are at the extreme ends of utilization i.e., very highly used or rarely used. The runtime utilization estimation is performed on remaining routers and provides a fine estimate to capture router idle phases during application execution. The two-step hybrid approach helps us in achieving the most accurate profiling of router utilization characteristics and

maximizing energy savings while reducing the estimation overheads.

In this work, we propose leakage power and performance-aware NoC architecture using power gating technique to achieve energy-efficient communication in many core systems. Of all router components, leakage power consumption in crossbar fabric and virtual channels is comparatively high [18]. Hence, the proposed method partially power gates the router to turn off these components that account for most of the leakage power in NoC. Power gating techniques, in general, have performance penalty due to associated wake-up latency in bringing a circuit to the active state. Similarly, it can have performance or energy penalties due to 1) short-term sleep periods within high cumulative utilization phase; 2) wake-up due to burst traffic during light utilization; 3) blockage of paths and 4) isolation of routers. To overcome these effects, we propose a Seamless Bypass Routing (SBR) strategy. SBR uses bypass links to route data through power gated routers effectively. It reduces the impacts of wake-up latency, congestion, blocking and isolation due to power gated routers and minimizes the impacts of power gating on communication performance. Using power gating, NoC allows for static reconfiguration of NoC according to the application requirements and reduces power consumption. By considering runtime utilization locally at each router, further reduction in leakage power consumption in router is achieved.

### 2.1 Related work

Communication performance over long distances and power consumption have been major challenges for many-core NoC architectures. WNoC architectures like the ones proposed in [11] [12] [13] [14] [25] [26] by inserting low latency and low energy wireless links, improve performance over traditional wired only NoCs. With power becoming major design constraint in ultra DSM technologies, several low power design techniques like dynamic scaling, multi  $V_t$  cells, clock/power gating, etc. [27] have been proposed to reduce power consumption in CMOS ICs. Recently, several works have explored implementing such methods to reduce power consumption in links/routers and achieve energy efficient network infrastructures.

Power gating schemes, that cut-off power supply to a circuit when it is not active,

have been widely used for leakage power reduction and also been employed for NoCs [28] [29] [30]. Panthre [31], a power-aware routing and topology reconfiguration method is proposed to provide long uninterrupted intervals of sleep to selected units using power gating. Panthre considers link utilization to implement power gating and if utilization of a datapath is below predefined threshold, it is put in sleep state. Ultra fine-grained run-time power gating method for on-chip routers is proposed in [32] that reduces leakage power using look-ahead technique but with performance penalty. Fine-grained power-gated FlexiBuffer that reduces buffer leakage power is proposed in [33], and can operate with minimal changes to flow control. NoRD (Node-Router Decoupling) [34] is a novel technique for power gating bypass, that decouples the Node's ability of transferring packet by monitoring the status of associated router. Traffic based virtual channel activation is explored in [35] for low power applications. Power Punch, a novel performance aware technique has been proposed to achieve non-blocking power gating of on-chip routers [36]. In this paper power control signal is sent ahead of the packet to punch-through any blocked routers along the imminent path of packets. An energy efficient virtual channel power gating mechanism is discussed in [37] for on-chip networks to save the leakage power. All these schemes offer leakage power savings using efficient algorithms based on link/virtual channel/application utilization. Many of these algorithms either depend on predetermined router/link utilization estimation or perform completely runtime utilization estimate. Predetermined utilization estimates can be obtained with reduced overheads, but offers minimal flexibility that can adversely affect performance. On the other hand, fully runtime estimation methods offer dynamic control options but require significant overheads.

To maximize leakage power saving in NoCs with minimal overheads, we propose a hybrid utilization estimation approach, NoC in this work. The two-step utilization computation finds a good balance between rigidity of predetermined methods and runtime overheads. The pre-computed estimate reduces the number of routers controlled dynamically at runtime, limiting the estimation overheads. On the other hand, runtime utilization estimate captures the detailed activity characteristics during different phases of application execution to maximize energy savings. In addition to this, NoC promises to increase power efficiency of on-chip wireless communication significantly along with regular router components. A flow control technique that allows packets to bypass router

pipeline stages [38]. ShortPath is augmented with a fine-grained pipeline bypassing mechanism to skip all the stages without contention for high speed applications [39]. Authors in [40] combine two techniques of adaptive channel buffer and router pipeline bypassing to simultaneously reduce power consumption and improve performance. An evaluation of bypassing technique is discussed based on traffic by avoiding the full routing functionality of selected nodes [41]. Single-cycle Multi-Hop Asynchronous Repeated Traversal (SMART) is proposed to dynamically set up single-cycle paths with turns from source to destination [42]. A Seamless bypass routing strategy proposed along with power gating control minimizes the performance impacts and adverse effects of power gated routers in the network. We present a detailed discussion of NoC architecture and performance evaluation of NoC architecture with WNoC. We explore performance overhead and associated trade-offs for realizing the proposed framework.

### 2.2 Leakage Power-Aware NoC

A generic NoC architecture consists of routers attached to message source/sink components; with all routers interconnected by wires in a specific topology. WNoC architectures, using strategically and optimally placed WIs at some routers, provide long distance wireless communication to improve NoC performance. A WNoC topology, with Base Routers (BRs) and Hybrid Routers (HRs) (BR + WI) is explored in [43] [44]. In this section, we discuss the hybrid two-level utilization estimation based power gating strategy and router design implemented in proposed architecture. We describe the power management controller that decides router power gating operation and router modifications necessary to reduce NoC leakage power consumption without effecting performance significantly.

#### 2.2.1 NoC Router Design

The architecture of power gated router design implemented in NoC is shown in Figure 2.1. The BR components of NoC router are typical of any NoC router design and provide similar functionality. The router design consists of six major components: i) BR components that include wired I/O ports, buffers & virtual channels, crossbar fabric, route computation unit, switch allocation and arbiter, ii) WI components, iii) Utilization



Figure 2.1 Implementation of power gated hybrid router design for proposed NoC

Computation Unit (UCU), iv) Power Management Controller (PMC), and vi) Bypass links and multiplexing circuit. HR, shown in the figure, contains WI components and control signals along with all BR components. WI components include serializer, modulator, Power Amplifier (PA) at the transmitter and Low Noise Amplifier (LNA), demodulator, deserializer at the receiver. A Voltage Controlled Oscillator (VCO) is used to generate the required carrier signal for both transmitter and receiver. These components provide the necessary hardware to read packet data from BR and convert it to analog information that can be transmitted over wireless channel. UCU observes activity of the router and computes runtime router utilization. PMC reads necessary information about the status of router and makes decision about power gating the router. The bypass links with multiplexing circuitry implement SBR strategy to reduce the adverse impacts of power gating and avoid performance penalties in NoC.

#### 2.2.2 Global Utilization Computation

The hybrid approach operates using two levels of router utilization. The coarse grain pre-computed utilization is observed over entire application period and helps us to categorize routers as Highly-Utilized (HU), Moderately-Utilized (MU) and Rarely-Utilized (RU). Each router is placed into one of these three categories based on their precomputed global utilization. The threshold that indicates which category the router belongs to, depends on the application being run on the many-core system. The threshold value for each category is set as less than 10% for RU, utilization between 10% and 75% for MU and HU routers have utilization more than 75%. As the name suggests, routers classified as RU are rarely used over entire application time, whereas HU routers are active for most of the application time. MU routers fall in between these two categories. Figure 2.2 shows the number of routers falling into each category for different PARSEC [45] and SPLASH-2 [46] benchmarks. It can be seen that less than 30% of routers are highly active across all benchmarks simulated and on average 48% of routers are moderately utilized. This allows us to save power consumption in routers with low utilization by reducing their voltage without impacting performance significantly. The utilization of the routers may vary with application to application. We define thresholds for router categorization based on precomputed router utilization as shown in Figure 2.2 to reduce the hardware overheads. Thresholds are estimated by running the *PARSEC* and *SPLASH-2* benchmarks suite. In our



Figure 2.2 Categorization of routers based on pre-computed global utilization under real applications for a 64-core system

simulation setup, we run the applications individually to estimate the thresholds. These utilization thresholds may change with other benchmarks or multiple simultaneous applications. For such cases, the thresholds and number of routers falling into each category vary according to those application/s characteristics. But, NoC architecture can be reconfigured by setting new threshold values without any changes.

For estimating the router utilization at global level in our experiments, a counter is associated with each router and is incremented whenever routing decision for that router is made. All counters are reset at the start of simulation. A total count is computed from the sum of individual counts of all routers. The utilization of any router is given by the ratio of a router's count and the total count. Therefore, at any cycle, utilization of the desired router can be observed. Utilization of router,  $U_r$  can be determined as,

$$\operatorname{Ur}_{i} \Big|_{t} = \left(\frac{Rc_{i}}{\sum_{i=1}^{n-1} Rc_{i}} \times 100\%\right)\Big|_{t},$$

Where  $Ur_i$  is the utilization of  $i^{th}$  router,  $Rc_i$  is the register count of the  $i^{th}$  router, n is the total number of routers and t is the given simulation cycle time at which function call is being made. Given the dynamic nature of the metric, time (t), is only used for reference purposes. In the following subsection, we discuss the runtime utilization estimation method of MU routers.

#### 2.2.3 Runtime Utilization Estimation

To achieve maximum leakage energy saving with minimal runtime overheads, proposed NoC architecture estimates router utilization in two steps. During the precomputation step, UCU estimates the utilization of a router for the entire application period. We implement this using proposed NoC, where each router can be power gated individually using a power management controller. In the proposed method, routers under RU are power gated throughout the application. Routers under HU are never power gated. Finally, MU routers are power gated dynamically based on run-time utilization estimate. This reduces the runtime computation overheads of power gating operation in proposed NoC.

To achieve better power efficiency in MU routers, proposed NoC exploits runtime utilization of a router during different phases of the application execution. The runtime utilization estimate of MU routers is computed on an epoch-to-epoch basis at the start of each epoch. An epoch signifies a fixed/variable duration phase in total simulation period. An epoch's duration can remain fixed throughout the application or varied for each epoch and duration is determined to suit application needs. For our simulations, we have used fixed duration of 1K cycles for each epoch. The value is chosen empirically for the set of benchmarks used in our simulations. The longer epoch durations do not provide the necessary granularity to capture different application phases, while smaller durations increase runtime overheads of control operations. The utilization estimate is computed as the number of cycles for which the router is active within an epoch period. UCU in each router computes runtime utilization for all neighboring routers that are at one hop distance from it. At the start of an epoch, it processes the packet headers that are waiting in its input VCs through Header Decoder (HD) to determine the number of packets that will be routed to each output port. If the numbers of packets that are to be transferred to an output port are less than a predetermined global threshold, UCU sets the downstream router connected to that port to be underutilized for the next epoch. Once UCU completes its operation, the utilization estimate information of neighboring routers is transferred to each corresponding router.

## 2.2.4 Power Management Controller (PMC)

Power Management Controller (PMC) of NoC controls the power gating operation of router components and generates the necessary power gating control signal such as  $PGS\_BR$  for base router components as presented in Figure 2.3. An active high PGS signal from PMC indicates that corresponding components need to be power gated and vice-versa. At the start of each epoch, PMC receives its runtime utilization estimate from all neighboring routers as 1-bit  $Pilot\_in$  signals. It also transmits the results of UCU as  $Pilot\_out$  signals; one 1-bit  $Pilot\_out$  for each neighboring router. An active high pilot signal indicates the utilization is low. In a simple mesh, PMC sends four  $Pilot\_out$  signals and receives four  $Pilot\_in$  signals from neighboring routers.

In BR, VCs and crossbar consume significant portion of leakage power and hence



Figure 2.3 Block level diagram shows power management controller with control signals

only these components are power gated i.e., routers in NoC are partially power gated. PMC utilizes pre-computed and runtime router utilization estimates to generate PGS\_BR signal and control power gating operation. Routers falling under HU/RU categories are never/always power gated respectively throughout the application period. PMC in these routers sets PGS BR to be active low/high for HU/RU respectively for entire application period and disables all runtime control operations by UCU. Power gating operation in MU routers is enforced at the start of epoch based on runtime utilization estimates. At the start of each epoch, PMC receives utilization estimates from all neighboring (one-hop distance) routers as *Pilot\_in* signals and makes a decision regarding the enforcement of power gating. If more than half *Pilot\_in* signals indicate that the router is going to be underutilized, PMC power gates the BR components. The VCs and CF of router remain in sleep mode for that entire epoch duration until a new decision is made at the start of next epoch. To communicate the power gating status of the router, a one-bit Router Status Signal (RSS) is transmitted by PMC to all neighboring routers. Active high RSS represents that particular router is power-gated. This is utilized in seamless Bypass Routing Strategy (SBR) strategy by neighboring routers to effectively route the packets through power gated routers. Similar to the *Pilot\_in* signals received from neighboring routers that indicate router's utilization estimate, PMC transmits Pilot\_out signals to all neighboring routers indicating their

runtime utilization estimate as computed by UCU.

While UCU is computing utilization of neighboring routers, the router continues its regular operation. UCU and its operation lie outside the input to output critical path of the router and do not add additional delay to packet transfer. When PMC makes the decision to power gate the router, an active high *PGS\_BR* signal is sent to router components, and an active high *RSS* signal is sent to neighboring routers. Once the router is power gated, bypass links are enabled to transfer packets through the power gated router. When a power gated router is being brought to the active state, PMC wakes up the router components and sends active low RSS to neighboring routers, and the neighboring routers can start routing packets normally.

### 2.2.5 Routing Strategy

The hybrid two-level utilization based approach of NoC reduces leakage power consumption in NoC significantly by selectively power gating the routers. But, turning off some routers in the network can lead to some adverse effects like performance degradation due to wake up latency, congestion, router blocking, deadlock and isolation. To overcome these challenges and reduce the impact of power gating on performance, we propose Seamless Bypass Routing (SBR) strategy that uses bypass links and additional router modifications to effectively route data through partially power gated routers, where only VCs and crossbar fabric are turned off.

### **2.2.5.1 Bypass Links**

To implement SBR strategy, we add two changes to NoC router architecture compared to regular router design; bypass links that allow data to be transferred through a power gated router and a power gating table with Router Computation (RC) unit, that stores information about power gating status of four neighboring routers as shown in Figure 2.4. The two addition additional bypass links in NoC router connect North & South ports and East & West ports bypassing the crossbar fabric as shown in the figure. These links provide means to transfer data through a power gated router and reduce stalling and re-routing of packets at the neighboring routers. At each input and output port, demultiplexer and multiplexer are added respectively that control the flow of data through either crossbar



Figure 2.4 Router bypasses links and control circuitry

fabric or bypass links. The demultiplexer at the input port is controlled by two signals, power gating signal from PMC and Local Data (*LD*) signal from RC unit. *LD* indicates that destination of the data at the input is local node. When the router is not power gated, data from the port is written to the buffers and transferred through crossbar as in a regular router. When, it is power gated, data is routed either to local port or bypass link according to *LD* signal. At each output port, a multiplexer is added which is again controlled by power gating and *LD* signals. Under regular router operation, data is taken from crossbar fabric. When the router is power gated, data is either taken from bypass link or local port as per the *LD*. To avoid the deadlock, one buffer slot is reserved for RC computation of each VC.

A power gating status table is maintained with RC, which stores information about the status of neighboring routers as shown in Figure 2.4. The entries in the table are updated based on RSS received at the start of epoch. An entry '1' against any port indicates that the router connected to that port is power gated and is used while determining the output port for a packet. The use of bypass links when router is power gated, and SBR strategy to minimize the adverse effects of power gating are explained in subsequent section.

## 2.2.5.2 Seamless Bypass Routing

Performance degradation in NoCs with power gated routers arises from re-routing



Figure 2.5 Pipeline stages of (a) regular XY routing and (b) seamless bypass routing

or stalling of the packets due to the presence of one or more power gated routers along the downstream path of a packet. The degradation can be in the form of increased packet latency or congestion and contention of non-power gated paths. Another performance impact is the latency while bringing components back to wake-up mode. To reduce the impact of wake-up latency, arbiter, switch allocator, RC, and virtual channel allocator are always kept ON as these components of router consume very little leakage power. When a router comes into active state from sleep state, packets reaching at its input ports can be processed through these stages, while VCs and Crossbar Fabric (CF) become active. To address other effects, we propose SBR, which is an extended version of XY routing algorithm [47]. In regular XY routing, the packet is routed first in the X-direction and then along the Y direction from source to destination. A typical router pipeline requires five stages; Buffer Write (BW), Routing Computation (RC), Virtual channel Allocator and Switch Allocator (VA and SA), Switch Traversal (ST) and Link Traversal (LT) as shown in Figure 2.5(a). Data and tail flits go through ST and LT stages. In case, the next router in

Table 2.1 Shows seamless bypass routing strategy

```
Algorithm: Seamless Bypass Routing
At each epoch start:
      — For ports N, E, W, S:
                   • if RSS == 1
                               Set Hop Limit to 2
                   else
                               Set Hop_Limit to 1
X_{Curr}, Y_{Curr}: X-, Y- positions of router
X<sub>Dest</sub>, Y<sub>Dest</sub>: X-, Y- positions of destination
if |X_{Dest} - X_{Curr}| \ge Hop Limit:
      — Set Output to Port E if X_{Dest} > X_{Curr}
      — Set Output to Port W if X_{Dest} < X_{Curr}
else
       - if |Y_{Dest} - Y_{Curr}| \ge Hop Limit:
                   • Set Output to Port N if Y_{Dest} > Y_{Curr}
                   • Set Output to Port S if Y_{Dest} < Y_{Curr}
       else
                   • if |Y_{Dest} - Y_{Curr}| = 0:
                          Set Output to Port E if X_{Dest} > X_{Curr}
                          Set Output to Port W if X_{Dest} \le X_{Curr}
                          Set Output to Local Port if X_{Dest} = X_{Curr}
                   else
                                             Wait
```

path is power gated, the packet is stalled at the current router, which increases packet latency.

SBR strategy implemented in NoC reduces packet stalling by effectively transferring them through bypass links as described in Table 2.1. RC in each router of NoC has a power gating status table with RSS entries for all of its neighboring routers as shown in Figure 2.4. When RSS entry against the port is '0', RC follows regular deadlock free XY routing strategy. When RSS entry for any port is '1', it indicates that router connected to the port is power gated. RC makes a decision to route the packet through another port connected to non-power gated router or use the bypass links of the power gated router. The decision is made depending on the distance of the destination from current router. If packet destination is not same as the power gated router and is more than one hop away in the direction same as that of power gated router, RC makes the decision to use bypass links of the partially power gated router. The packet is transferred to the power gated router, and then directly transferred to second router in path using bypass links along that direction. If the destination is one hop away in the direction same as that of the power gated router, RC reroutes the packet through one of the ports connected to non-power gated routers. If the

packet destination is the power gated router, the packet is routed to the power gated router. At the power gated router, when a packet reaches at its input ports, RC resolves packet destination. If the destination is the local node, data is routed to a local port. Otherwise, data is routed to an output port using the bypass link. Since RC takes one cycle to complete its operation, one flit sized input buffer always remains ON at each input port of the partially power gated router to store the flit under processing. In SBR strategy, a packet passes through only two stages, RC and Bypass Link Traversal (BLT) to reach its output port as shown in Figure 2.5(b). Hence, it takes only three cycles for routing the packet, as opposed to five cycles in a regular router. The additional hardware added at each port and RC adds little overhead to the router. By using the bypass links and little modifications to router architecture, SBR avoids major performance impacts of power gating while achieving leakage power savings.

SBR is deadlock free as it follows X direction first strategy similar to that of traditional XY routing. The packet transfer using SBR follows along X-direction first and then along Y-direction and hence does not result in any cyclic dependencies. Even while using bypass links, packet transfer still makes only one turn from X to Y-direction. The only violation to this is; if Y coordinate of power gated router along path (if present) is same as that of packet source and its X coordinate is same as the X coordinate of packet destination (packet source and destination are different power gated router). In such scenario, an additional turn from X to Y is made at the neighboring router of power gated router. At the ensuing router, packet is routed using simple XY or SBR based on its RSS entries. The routing is deterministic, and bypass links do not allow change of direction between X and Y. Hence, none of the possible output ports result in cyclic turn or dependency on the upstream router resources. This prevents any cyclic dependencies and deadlocks in SBR strategy.

# 2.3 Walkthrough Example

In this section, the proposed method is illustrated through a walkthrough example. The current state of the 16-core system illustrated is as shown in Figure 2.6. Routers 1, 7 and 8 are hybrid routers with WIs. Based on pre-computed utilization, routers 4 and 5 fall into RU category and are power gated for entire application period. As per the output of



Figure 2.6 Example scenario illustrating our proposed architecture

UCU computations at the start of current epoch, routers 10 and 11 are partially power gated for the current epoch. As per the information about all power gated routers, the ports with RSS='1' in different routers are shown in table in Figure 2.4. All other ports in every router have RSS value set to '0'. To showcase different capabilities of proposed method, we consider traversal of three packets (named  $P_1$ ,  $P_2$  and  $P_3$ ) originating from the source router 0. The destinations for the three packets are routers 9, 4 and 15 respectively. The three packets reach their respective destinations following the steps as described below:

- I. At time t<sub>1</sub>, packet P<sub>1</sub> is injected from local node to router 0. RC unit at router 0, process the packet header and decides the output port to be East port (following XY routing), since RSS of East port is '0' (Router 1 is in wake up state). Arbiter grants access of East port to P<sub>1</sub>.
- II. At t<sub>2</sub>, packet P<sub>1</sub> traverses the crossbar switch at router 0 to reach East port of router 0. At the same time, packet P<sub>2</sub> is injected from local node into

- router 0. Destination for P<sub>2</sub> is IP attached to router 4. *RSS* of North port at router 0 is '1' as router 4 is power gated. Since, the destination is local port of router 4, RC makes decision of output as North port for P<sub>2</sub>. Since the port is free, arbiter grants access of this port to P<sub>2</sub>.
- III. At t<sub>3</sub>, P<sub>1</sub> traverses the link between router 0 & 1 and reaches input buffer of router 1. Packet P<sub>2</sub> traverse crossbar switch at router 0 to reach North port. At the same time, packet P<sub>3</sub> is injected into router 0 from local node. Since destination is router 15, following XY routing, RC assigns East port to P<sub>3</sub> and arbiter grants access as port is available.
- IV. At t<sub>4</sub>, header of P<sub>1</sub> is processed by RC at router 1. Since the destination is router 9, which is two hops away along Y-dimension, RC assigns North port to P<sub>1</sub> and arbiter grants access to the port. P<sub>2</sub> traverses the link between routers 0 & 4 and reaches the input of router 4. P<sub>3</sub> traverses the crossbar switch at router 0 to reach the East port of the router.
- V. At t<sub>5</sub>, P<sub>1</sub> traverses the crossbar switch to reach the North port of router 1. Header of P<sub>2</sub> is processed by RC at router 4. Since the destination is node attached to router 4, RC assigns the local port to P<sub>2</sub>. P<sub>3</sub> traverses the link between router 0 & 1 to reach the input of router 1.
- VI. At t<sub>6</sub>, P<sub>1</sub> is processed by RC (RC is not power gated) and since destination is not local node, it is assigned to the bypass link connecting the North and South ports of router 5. P<sub>2</sub> reaches its destination. P<sub>3</sub> is processed by RC at router 1. Since destination is router 15 and there is a wireless link available along the path, RC assigns wireless link at router 1 to P<sub>3</sub>.
- VII. At t<sub>7</sub>, P<sub>1</sub> traverses the bypass link to reach the input of router 9. P<sub>3</sub> traverses the crossbar switch at router 1 to reach the WI.
- VIII. At t<sub>8</sub>, P<sub>1</sub> is processed by RC at router 9 and since the destination is local node, it is assign to the local port of the router. P<sub>3</sub> is processed through WI at router 1 (WI address bits are added at the starting of the packet) and is transmitted using the antenna.
  - IX. At t<sub>9</sub>, P<sub>1</sub> reaches its destination. P<sub>3</sub> is received by the antenna of WI at router
    7. The packet P<sub>3</sub> is also received by antenna of WI at router 8.

- X. At t<sub>10</sub>, WI address added to P<sub>3</sub> is processed at both the WIs of routers 7 and
   8. At WI of router 7, since the WI address matches it WI, packet P<sub>3</sub> is received and processed at the WI.
- XI. At  $t_{11}$  to  $t_{17}$ , RC process  $P_3$  at router 7 and following XY routing,  $P_3$  reaches its destination at router 15 through router 11.

### 2.4 Performance Evaluation

In this section, we discuss the performance benefits, energy efficiency and associated overhead of NoC and compare it with other existing power-gated NoC architectures in detail. We present peak throughput and average latency with both application specific and synthetic traffic for performance evaluation of NoC router based NoC architecture.

## 2.4.1 Simulation Setup

The proposed NoC architecture is characterized using modified cycle accurate Noxim simulator [48] that models the progress of data flits accurately per clock cycle, accounting for all flits that reach the destination or dropped [49]. Network trace based on application traffic is collected from Graphite [50] full system simulator using *PARSEC* [45] and *SPLASH-2* [46] benchmarks. The traces are executed on Noxim simulator. We consider system size of 64 cores for our experiments, which is representative of current multicore technology trends. The width of all wired links is considered to be same as the flit size, which is considered to be 32 bits. We consider a moderate packet size of 64 flits

Table 2.2 Simulation Setup for Proposed NoC

| Topology        | 8×8 Mesh, 8×8 Mesh based WNoC                            |  |  |
|-----------------|----------------------------------------------------------|--|--|
| Routing         | XY for baseline, SBR, NorthLast for hybrid router node   |  |  |
| Pipeline        | 4 stages for regular and 3 stages for power-gated router |  |  |
| Flit size       | 32 bits                                                  |  |  |
| Packet size     | 64 flits                                                 |  |  |
| Clock Frequency | 2.5GHz                                                   |  |  |
| Workload        | Synthetic and Real                                       |  |  |

for all our experiments. Similar to the wired links, we have adopted wormhole routing in the wireless links too. The routers are input buffered with each port comprised of 4 VCs. Each VC has a depth of 4 flits. The ports associated with the WIs have an increased buffer depth to avoid excess packet drop. Through simulations, we find the ideal buffer depth at WIs to be 8 flits and increasing it further provides diminishing returns. The NoC switches are driven with a clock of frequency 2.5GHz. The configured system setup for all simulations is presented in Table 2.2. Threshold values for RU and HU routers are application dependent. To estimate the power break up in terms of dynamic and leakage power consumption, the RTL level design of router is synthesized with Synopsys Design Compiler using 32nm CMOS technology. The delays and energy dissipation of the wired links are obtained through Cadence simulations taking into account the specific length of each link based on the established connections on a 20mmx20mm die. The on/off keying (OOK) modulation based wireless transceiver adopted from [19] is designed and simulated using the TSMC 65-nm CMOS process and is shown to dissipate 32mW sustaining a data rate of 16Gbps with a bit-error rate (BER) of less than 10<sup>-15</sup> while occupying an area of 0.3mm<sup>2</sup>. We adopt SA-based optimization technique for placement of WIs to get maximum benefits [14]. Token-based medium access mechanism is used in WNoC architecture.

#### 2.4.2 Performance Metrics Review

In order to evaluate the proposed NoC in terms of packet energy dissipation and peak network bandwidth, we review some basic metrics. Packet energy,  $E_{pkt}$  is the energy dissipated in transferring one packet completely from source to destination at network saturation. It can be measured as

$$E_{pkt} = \frac{(\sum_{i=1}^{N_{pkt}} (L_i - h_i \lambda) E_{buf} + h_i E_{wire} \lambda) + N_{sim} E_{wireless}}{N_{pkt}},$$
2.2

Where,  $N_{pkt}$  is the number of packets routed in the NoC,  $L_i$  is the latency of  $i^{th}$  packet,  $h_i$  is the number of hops along the path for  $i^{th}$  packet and  $E_{buf}$  is the energy dissipated for a flit in router buffers. The energy dissipation of wireline hop is  $E_{wire}$  and  $\lambda$  is the packet length in the number of flits.  $N_{sim}$  is the duration of the simulation and  $E_{wireless}$  is the energy dissipated by two wireless transceivers in the WNoC in one cycle.

Peak bandwidth is the maximum achievable data rate for the NoC. The bandwidth is measured as the average number of bits successfully arriving per core per second. Bandwidth, *B* can be determined as,

$$B = t\beta Nf, 2.3$$

Where, t is the maximum throughput for number of flits received per clock at network saturation,  $\beta$  is the number of bits in a flit, N is the number of cores in the NoC and f is the



Figure 2.7 Histogram for flit arrival at router that represents (a) HU, (b) MU and (c) RU

clock frequency. Throughput is directly obtained from system level simulations performed by NoC simulator.

#### 2.4.3 Router-level Statistics

Application of power gating in NoC is largely dependent on workload activity of the network. We run the workloads to collect statistics of flit arrivals at the router level. For an example, Figure 2.7 shows the histogram of flit arrival under Uniform traffic at various time stamps. Each plot represents flit arrival for a fixed total simulation period at three different routers falling under HU, MU and RU categories. Intervals between each timestamp contain mixture of long and short time durations. Application of power gating during long intervals provides the best savings with reduced transient energy. Figure 2.7(a) represents HU routers, where power gating approach is not affordable. RU routers in Figure 2.7(c) have very sparse utilization spread across time and can be power gated for entire application period to save maximum static energy with little impact on performance. Figure 2.7(b) represents MU routers with some dense utilization periods along with very low utilization periods. For these routers, power gating can be applied dynamically on an epoch-to-epoch basis. Application of power gating in short intervals results in increase in transient energy due to frequent state changes.



Figure 2.8 Normalized bandwidth of NoC architectures with different traffic situations for a 64 core system w.r.t. mesh

# 2.4.4 Bandwidth and Latency Analysis

We evaluate the performance of NoC under different traffic situations with both mesh & WNoC architectures in terms of bandwidth and latency.

**Bandwidth:** We evaluate peak bandwidth with proposed NoC for 64-core system under both synthetic and real traffic. To evaluate performance under both computation and communication intensive workloads, both *SPLASH-2* and *PARSEC* benchmarks are considered. Figure 2.8 shows peak bandwidth (Peak bandwidth is the maximum achievable data rate for the WNoC) at network saturation for mesh and WNoC architectures with NoC and regular routers. NoC improves or performs equally well in more than half the tested benchmarks and traffic patterns. Peak bandwidth improves with Bitreversal traffic whereas it remains same with LU\_C, Transpose1, Transpose2 and Bufferfly traffic patterns. The performance degradation observed with traffic patterns is very little. As SBR reduces the contention delay due to power gated routers, this reduction enhances the overall bandwidth of the network. From results, it is evident that the proposed architecture offers little to no performance degradation while achieving considerable power/energy savings at router and network level.

**Latency:** Figure 2.9 shows the network performance in terms of latency under various traffic patterns. The normalized packet latency for mesh and WNoC with NoC router is 0.62 and 0.39 respectively with respect to baseline mesh. It can be seen that latency



Figure 2.9 Normalized saturation latency of Proposed NoC based architecture over regular architectures w.r.t. mesh



Figure 2.10 Percentage of packet energy saving using proposed NoC over regular mesh and WNoC

for Blackscholes, fmm, and synthetic applications is low due to the low communication intensity. It can also be seen that latency for Fluidanimate, FFT, Raytrace, Radix and Radiosity\_C is high due to the high communication requirements. However, the latency increases in both mesh and WNoC architectures using NoC is insignificant as compared to using regular routers except FFT and synthetic traffic patterns. From bandwidth and latency results, it is clear that NoC architecture achieves significant static energy saving with little to no impact on the performance of the network.

### 2.4.5 Energy saving

The average packet energy saving obtained by NoC is evaluated by running *PARSEC* [45] and *SPLASH-2* [46] benchmarks on 64-core system. These benchmarks consist of a combination of complete executables and computational kernels, which represent a variety of compute intensive scenarios across scientific, engineering and financial applications. We also run synthetic traffic patterns viz., Transpose1, Transpose2, Bitreversal, Shuffle and Butterfly. The synthetic traffic patterns represent regularly occur occurring traffic scenarios and can have messages generated by different sources. The energy comparisons are made with mesh and WNoC architectures using regular routers and proposed NoC routers. The energy saved with NoC over regular routers in case of both architectures for different traffics is illustrated in Figure 2.10. On average, NoC achieves

Table 2.3 SPLASH-2 and PARSEC Benchmark Characteristics for Expected Trends

| Benchmarks                               | Spatial locality | Power saving | Remarks                                             |
|------------------------------------------|------------------|--------------|-----------------------------------------------------|
| LU, FFT,<br>Blackcholes,<br>Fluidanimate | High             | High         | High power saving due to many underutilized routers |
| FMM, Radix                               | Medium           | Medium       | Medium                                              |
| Radiosity, Raytrace                      | Low              | High         | High power saving due to many underutilized routers |

Note: Power saving ranges: High  $\geq$ =40%, Medium  $\geq$ = 30% & <40, and Low <30%.

49% saving in energy across all benchmarks as compared with regular WNoC and mesh respectively.

The energy savings achieved by NoC are dependent on spatial locality and traffic pattern characteristics of benchmarks as shown in Table 2.3. If the spatial locality of a program is high, data is accessed contiguously among few routers or from small address space and vice-versa. Benchmarks with high spatial locality have many RU routers outside the heavily used cluster of routers. This increases the chances to save power in those routers by power gating them through pre-computed utilization estimates. On the other hand, if spatial locality is low, data is accessed non-contiguously over large chunks of addresses. These results in many routers falling under MU category of pre-computed estimate and static energy savings are primarily achieved through runtime optimizations. While energy saving with NoC is significant for high/low spatial locality, runtime and bypass control overheads are higher for latter with routers going in and out of sleep state, whereas routers remain mostly power gated in former case. When spatial locality of benchmarks is moderate, the energy saving obtained by NoC depends on distribution of application traffic and can vary from moderate to high. Different benchmarks with respective spatial locality characteristics are presented in Table 2.3 and their energy savings in mesh architecture can be observed from Figure 2.10.

#### 2.4.6 Summary of Proposed and Existing NoC Architectures

In this section, we compare NoC with other recently proposed power-gated NoC architectures. The power saving obtained and impacts incurred by each method along with NoC are shown in Table 2.4. The impacts incurred can be either degradation in performance or area overheads. As can be seen, NoC provides significantly high leakage

Table 2.4 Summary of proposed and energy-efficient NoC Architectures

| Ref. | Approaches  | Configurations                                                                                                                                                                                                                                                 | Power gated<br>Components        | Advantages                                                                               | Power/<br>Energy<br>saving (%)                             | Penalty                                  |
|------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------------|
| [30] | Panthre     | 8×8 MeshVCs: 4VCs/ port, 8<br>flits/VC; XY for baseline and<br>up/down for Panthre; Clock<br>Frequency: 2GHz, Pipeline: 2<br>stage VC flow control,<br>Simulator: Trace-driven cycle-<br>level simulator, Benchmark:<br>Synthetic- Uniform and SPEC<br>CPU2006 | Data path<br>and whole<br>router | Leakage<br>power saving<br>with low<br>latency<br>impact                                 | Overall<br>network<br>power:<br>14.5%                      | 1.8%<br>degradation<br>in<br>performance |
| [27] | FlexiBuffer | 8×8 Mesh, XY routing, 1.5Ghz<br>clock frequency, Technology:<br>32nm CMOS process, Simulator:<br>Cycle accurate simulator, Orion<br>2.0                                                                                                                        | Buffers                          | Leakage<br>power saving<br>with low<br>latency<br>impact                                 | Overall<br>router<br>power:<br>39%                         | 3%<br>degradation<br>in throughput       |
| [28] | NoRD        | 4×4 Mesh and 8×8 mesh, input<br>buffer: 5 flits depth, 3GHz clock<br>frequency, pipeline: 4 stages<br>Simulator: Simics with GEMS<br>and Garnet, and Orion 2.0<br>Technology: 45nm CMOS<br>Process, Benchmark: PARSEC                                          | Router                           | Reduce the<br>router static<br>energy and<br>improve the<br>average<br>packet<br>latency | Router<br>static<br>energy<br>saving:29.9<br>%             | 3% area<br>overhead                      |
| This | Proposed    | 8×8 Mesh and 8×8 Mesh and WNoC, XY, SBR and North-Last for WNoCs, 32 flit size, packet size:64 flits, 2.5 GHz clock frequency Simulator: Graphite and Noxim, DSENT, Technology: 32nm CMOS process, Benchmarks: PARSEC and SPLASH-2                             | Buffers and<br>Crossbar          | saves leakage power consumption in base router                                           | Router<br>leakage<br>power<br>saving:<br>92.20% for<br>BR; | 7% area<br>overhead for<br>BR            |

energy savings as compared to other techniques, while adding 7% area overheads and no performance degradation. In addition, NoC addresses short-term sleep affects, path blocking and isolation of routers by use of SBR strategy and bypass links.

#### 2.4.7 Router Power and Area Overhead

In this section, we discuss the overheads due to power gating implementation in NoC architecture. NoC uses a power gating switch to keep leakage power to a minimum. The power gating switch consumes 3.16µW of power. The power consumption of additional components (inclusive of UCU, PMC, PG, and modified RC) is 88.83µW. The leakage power consumption of wormhole-based router at an average injection rate of 0.3flits/cycle is 7.54mW. The component wise leakage power consumption in different components for regular router and NoC is shown in Table 2.5. Leakage power consumption for NoC during active and sleep modes, including all components, is 7.65mW and 0.60mW respectively. The proposed technique can save up to 92.20% of base router leakage power during sleep mode.

Table 2.5 Characteristics Table of Proposed NoC

| **                   | Leakage power    |          | Area                            |                             |
|----------------------|------------------|----------|---------------------------------|-----------------------------|
| Components           | Regular wormhole | Proposed | Regular wormhole                | Proposed (mm <sup>2</sup> ) |
|                      | router (mW)      | (mW)     | based router (mm <sup>2</sup> ) |                             |
| Buffer               | 6.14             | 0.192    | 13.3×10 <sup>-3</sup>           | 13.3×10 <sup>-3</sup>       |
| Crossbar             | 1.15             | 0.05     | 5.88×10 <sup>-3</sup>           | 5.88×10 <sup>-3</sup>       |
| Power gating control |                  | 0.089    |                                 | 1.822×10 <sup>-3</sup>      |
| Others               | 0.247            | 0.267    | 0.518×10 <sup>-3</sup>          | 0.518×10 <sup>-3</sup>      |
| Total                | 7.537            | 0.596    | 19.698×10 <sup>-3</sup>         | 21.2175×10 <sup>-3</sup>    |

The total area requirement of a regular wormhole based router is  $19.70\times10^{-3}\text{mm}^2$ . Power gating switch consumes  $0.10\times10^{-3}\text{mm}^2$  of the area. PMC, and modified RC and UCU units together occupy  $1.72\times10^{-3}\text{mm}^2$ . Total area overhead due to additional components (including PG, UCU and PMC) in NoC is  $1.82\times10^{-3}\text{mm}^2$ . The total area requirement for base routers in NoC (including router and power gating components) is  $21.22\times10^{-3}\text{mm}^2$ . NoC BR requires additional area overhead of 7% over regular BR. The component wise area and power requirements for NoC and regular wormhole routers are presented in Table 2.5. The cross-section area overhead for a single wire is  $5.7\times10^{-3}\mu\text{m}^2$ . Total wire overhead with neighboring routers is  $0.046\mu\text{m}^2$ ; which is very small compared to the baseline wires.

#### 2.4.8 Scalability

NoC integrates two components PMC and UCU to control power gating operation and uses PMOS as a header to reduce leakage power consumption in NoC. The number of header sleep transistors increases with number of routers. The modifications to RC unit do not require any information from other routers and so they do not change with the number of routers. PMC and UCU units operate using signals only from the four neighboring routers and hence their implementation remains unchanged with system size. Hence proposed approach can be extended to any number of cores. The most important aspect of the proposed NoC is that it can be used for both wireline and wireless network topologies without any significant performance impacts. The proposed method achieves a significant saving in energy with little area overhead and is scalable to any system size in both wireline

and wireless NoCs.

### 2.5 Conclusions

In this chapter, we propose and evaluate the performance of NoC architecture for achieving sustainable computing platforms [51]. It employs fine-grained power gating technique in router to reduce leakge power consumption. Using a hybrid two-level utilization estimation approach and seamless bypass routing strategy, NoC maximizes power saving while minimizing the impact on performance. Simulation results show that the proposed approach of NoC saves leakage power consumption up to 92.20% in the base routers compared with baseline router architecture. Seamless bypass routing strategy and its capability to avoid performance degradation and many of the adverse effects of power gating are discussed in detail. On average, NoC reduces the total packet energy consumption by 49% with negligible reduction in network bandwidth and delay performance. Furthermore, proposed NoC architecture can be effectively implemented for both wired and wireless NoCs with only 7% area overhead as compared to baseline router architecture.

# Chapter 3

# Switching Power-aware NoC

In this chapter, first, a detailed description of dynamic runtime utilization computation for estimating router utilization with low runtime overhead is presented. A stochastic model is developed for accurate prediction of router utilization from its past utilization characteristics during dynamic runtime utilization computation phase. The supply voltage of router elements is varied dynamically based on the predicted utilization. To reduce the switching power, Adaptive Multi-Voltage Scaling (AMS) control mechanism, implemented to reduce switching power in router elements and idle-state power in WI components is described in this chapter. Adaptive multi-voltage scaling is a dynamic power minimization technique that reduces the operating voltage based on the utilization of the components. Similarly, we power gate the WI components, when they are not actively involved in data transmission to reduce idle-state power consumption. We present the details of AMS controller, along with multi-level voltage shifter and voltage regulator to dynamically change supply voltage of router components and WIs based on their utilization and save network energy. A control mechanism for packet transferred over wireless link is presented to avoid the data loss and maintain the signal fidelity in the presence of power gated WIs. A new approach to transmitting medium access token through wireless link is proposed to minimize control signal latency. A detailed discussion and analysis of the impact of PVT variations of PA and LNA are discussed in results section.

We describe the strategy to estimate the utilization of router using stochastic process model in application level. A stochastic process is nothing but a collection of random variables to represent the evaluation of system over time. Based on the utilization, we compute the switching power consumption of the routers using AMS. As power/energy is the main concern for on-chip communications for many-core system, we also attempt to

reduce the power consumption of WIs based on utilization of WIs using AMS control mechanism. The utilization of WIs is dependent on token passing protocol. We notice that all the WIs are not utilized all the time as WNoC supports single frequency channel communication to avoid the interference. Therefore, we reduce the idle-state power consumption at WIs using AMS scheme, which improves the power efficiency at WIs significantly.

The utilization of the routers is divided into two levels: Pre-computed global utilization and dynamic runtime utilization, which is also called hybrid 2-steps approach, which is discussed in Chapter 2 in detailed. For switching power saving, pre-computed global utilization method is similar to Chapter 2. In this case, off-line supply voltage estimation process provides a great advantage over the existing methods. Not only this reduces the significant power consumption, but also it diminishes the massive hardware overhead. But, as leakage power saving is based on binary decision of turning on /off a router, such a detailed estimate computation is not needed. Therefore, a dynamic runtime utilization is estimated using stochastic process model for detailed computation. These approaches provide an incredible power/energy saving without any significant performance degradation.

### 3.1 Preliminaries and Related Works

With power becoming the biggest challenge for chip design in latest technology generations, several low power techniques like dynamic scaling and power gating have been proposed in literature. These techniques, in the recent past, have been extended to NoCs in many core architectures to make them energy-efficient. Dynamic scaling schemes control the operating voltage or frequency or both of a circuit by exploiting idle phases of execution. They reduce dynamic energy consumption of CMOS circuits without adversely impacting their performance. Different algorithms have been employed to identify low active phases of different NoC elements and save switching energy [52] [53]. Dynamic Voltage/Frequency Scaling (DVFS) for WNoCs has been explored in [54] and improves energy-delay product with various real traffic patterns. DVFS-enabled sustainable WNoC architecture has been proposed to facilitate design of power and thermal efficient communication in multicore chips [55]. A history-based Dynamic Voltage Scaling (DVS)

policy based on link utilization is discussed in [56] to reduce the network energy consumption. All these methods save dynamic power consumption in NoCs by varying voltage and/or frequency of routers appropriately. On the other hand, power gating schemes cut off power supply to a circuit when it is not active and reduces leakage power. In this chapter, we propose an energy-efficient WNoC architecture using a novel AMS mechanism to achieve significant network energy saving by scaling the voltage of router components. Our main aim in this work is to optimize supply voltage to minimize unnecessary dynamic power consumption without significant performance degradation. Similarly, we perform power gating operation with WIs to save the idle-state power consumption. The combination of both dynamic voltage scaling and power gating can reduce the energy/power consumption significantly in WNoC architecture.

AMS uses a hybrid power management approach to reduce switching power based on utilization of routers [57]. Routers at extreme ends (HU and RU) remain at high or low utilization for most of the application duration. They do not exhibit significant variations in runtime idle and active phases like MU routers. Exploiting this, AMS operates HU and RU routers at high and low voltages respectively to reduce runtime dynamic control overheads. Routers falling in MU are operated dynamically using runtime utilization to scale voltage during low utilization phases and save energy. Using this hybrid two level router utilization, AMS achieves considerable energy savings in NoC without impacting its performance or incurring significant runtime control overheads.

The proposed AMS also controls supply to WIs in WNoC topology to save idlestate power consumption. Most existing WNoCs [12] [13] [14] employ single channel wireless communication because of the ease of design and low overheads. As a result, only a pair of WIs communicate at any time. Exploiting this, idle-state power associated with all the other idle WIs is eliminated by power gating them. The power gating control for WIs in AMS is independent of base router operation and hence, AMS is easily extended to wired NoC topologies without any modifications. In this chapter, we also present detailed discussion of WI control using AMS scheme. We discuss the hardware implementation of AMS controller and associated voltage regulator and level shifter circuits to efficiently choose and switch between different voltage levels. The proposed architecture is evaluated

using *PARSEC* [45] and *SPLASH-2* [46] benchmarks for achievable energy savings, network performance, power gating impacts, and overheads.

# 3.2 Switching and Idle-state Power-aware WNoC

In this section, we discuss the architecture of AMS based WNoC, stochastic model for estimation of router utilization and AMS control mechanism for router voltage scaling and WI.

### 3.2.1 System Architecture

The WNoC topology, considered in this work, augments 2D wired mesh topology with single hop, long-range wireless links, where WIs used for long distance communication across the chip as presented in Figure 3.1 (representative of our proposed architecture). We adopt simulated annealing-based optimization technique for placement of WIs from [1] to get maximum performance benefits. WNoC architecture consists of Base Routers (BRs) and few Hybrid Routers (HRs) that integrate WIs with BR components. Each BR is a five port, fully connected wired router.

A cluster in the WNoC topology with HRs, BRs, AMS controller and Voltage Regulator Unit (VRU) is shown in Figure 3.1. BR includes wired I/O ports, buffers, crossbar, route computation unit, and switch allocator. The BR communicates with the neighboring routers by use of wired links. The WI in HR converts packet data between digital and RF domains through serializer/ deserializer, modulator/ demodulator, power



Figure 3.1 Cluster-level AMS base wireless NoC architecture

amplifier and LNA components. A single antenna with each WI is used to transmit and receive the data. The wireless channel is shared between all WIs and token passing mechanism with round robin arbitration is adopted to provide access of the shared wireless medium. AMS controller is integrated with each router to control the operating voltage of router components and WIs. The router voltage is controlled through router utilization estimates, and WI power gating is operated based on availability of token and data to be transmitted.

#### 3.2.2 Dynamic Runtime Utilization Computation

The operating voltage of MU routers in AMS based NoC architecture is varied according to its utilization and is operated on an epoch-to-epoch basis. At the start of each epoch, router utilization estimate for next epoch is computed, and operating voltage is set accordingly. An epoch signifies duration in total simulation period. An epoch's duration can remain fixed throughout the application or varied for each epoch and duration is determined to suit application needs. The efficiency of AMS control operation and consequently the energy saving achieved is dependent on accurate capture of utilization for each router. We propose a probabilistic utilization model based on stochastic processes to provide an accurate and fine-grained prediction for router utilization. The proposed model uses a linear estimator for router utilization and stochastic updates to prediction model to capture past utilization characteristics. The linear estimator is chosen for its ease of determining router utilization based on observed parameters. The probabilistic parameters used for estimation of router utilization are shown in Table 3.1. The utilization of router in the proposed model is measured in terms of number of flits, the router is supposed to process in an epoch. The number of flits processed by any router in an epoch depends on its current input port occupancy and flits forwarded from its neighboring routers (routers that are single hop away). The number of flits occupying all input buffers is directly observable at the start of epoch and is represented by random variable T. The values taken by T vary between 0 and  $\eta * \beta$ , where  $\eta$  is router degree and  $\beta$  is buffer size in flits at each port. The number of flits received from neighboring routers is obtained using the utilization estimation model. The neighboring routers, at the start of epoch, estimate the number of flits to be transmitted to the router under consideration and communicate this information.

**Table 3.1 Probabilistic Notations** 

| Notations                   | Classifications                                                              |  |  |  |
|-----------------------------|------------------------------------------------------------------------------|--|--|--|
| F                           | Flit count random vector of length $\eta$ $[F_1 \ F_2 \ \ F_{\eta}]$         |  |  |  |
| $F_i$                       | Random variable denoting number of flits routed to output port i in an epoch |  |  |  |
| T                           | Buffer occupancy random variable (Number of flits occupying all inpurports)  |  |  |  |
| E[F], Var[F]                | Mean and variance of random vector F                                         |  |  |  |
| $E_T$ , $\sigma_T^2$        | Mean and variance of random variable T                                       |  |  |  |
| $E_i, \sigma_i^2$           | Mean and variance of random variable, F <sub>i</sub> in random vector F      |  |  |  |
| Cov[X,Y]                    | Covariance of random variables X and Y                                       |  |  |  |
| CF                          | Covariance matrix of random vector F                                         |  |  |  |
| $\mathbf{C}_{\mathbf{F}Fi}$ | Cross-Covariance of F <sub>i</sub> and F                                     |  |  |  |
| η                           | Degree of the router (including local port)                                  |  |  |  |
| β                           | Buffer size at each input port in flits                                      |  |  |  |

The total utilization for router under consideration is then computed from the combination of input buffer occupancy and utilization estimated from neighboring routers.

At any router, we estimate the number of flits routed to each port by observing its current input buffer occupancy, T and past traffic characteristics. The random vector,  $\mathbf{F}$  indicates the number of flits routed to each output port in an epoch. For the topology evaluated in our work, router degree is five, and  $\mathbf{F}$  is represented as  $[F_NF_E F_WF_SF_L]$ . It is assumed that number of flits routed to one output port is not affected by the number of flits routed to another port. Hence, random variables,  $F_i$  are independent to each other and covariance of any two  $F_i$ 's is zero. T and  $\mathbf{F}$  take different values (in number of flits) and also vary in time from epoch to epoch and so form stochastic process. We use linear estimator to find number of flits routed to each port, given buffer occupancy, T at the start of n+1 epoch and traffic characteristics for previous n epochs as shown in Equation 3.1

$$F_i = \mathbf{a_F} \mathbf{F} + a_T T + b_F + b_T$$
 3.1

Where,

$$\mathbf{a}_{\mathbf{F}} = \mathbf{C}_{\mathbf{F}}^{-1} \mathbf{C}_{\mathbf{F}F_{i}}, a_{T} = \frac{Cov[F_{i}, T]}{Var[T]}$$

$$b_{F} = E_{i} - \mathbf{a}_{\mathbf{F}}' E[\mathbf{F}], b_{T} = E_{i} - a_{T} E_{T}$$
3.2

The random vector, **F** on the right-hand side of Equation 3.1 is the number of flits routed to each port in  $n^{th}$  epoch. The covariance matrix and cross-covariance vector capture previous traffic characteristics. The accuracy of utilization estimate depends on how well these parameters capture the traffic properties. For this purpose, the mean, variance and covariance parameters are updated in each epoch based on the observed values of flits routed in previous epoch. The update operation of the estimation parameters is not on the data path of router operation and hence does incur additional delay overheads. Since cross-covariance and covariance values are updated at each epoch, they capture traffic characteristics of all previous epochs. In n+1 epoch, the parameters are updated as follows:

$$E_{i_{\langle n+1\rangle}} = \frac{n}{n+1} E_{i_{\langle n\rangle}} + \frac{F_{i_{\langle n\rangle}}}{n+1}, E_{T_{\langle n+1\rangle}} = \frac{n}{n+1} E_{T_{\langle n\rangle}} + \frac{T_{\langle n\rangle}}{n+1}$$

$$Var[F_{i}]_{\langle n+1\rangle} = \frac{n}{n+1} [Var[F_{i}]_{\langle n\rangle} + E_{i_{\langle n\rangle}}^{2}] + \frac{F_{i_{\langle n\rangle}}^{2}}{n+1} - E_{i_{\langle n+1\rangle}}^{2}$$

$$Var[T_{i}]_{\langle n+1\rangle} = \frac{n}{n+1} [Var[T]_{\langle n\rangle} + E_{T_{\langle n\rangle}}^{2}] + \frac{T_{\langle n\rangle}}{n+1} - E_{T_{\langle n+1\rangle}}^{2}$$

The covariance matrix, cross-covariance vector, and coefficients in Equation 3.1 are updated from new mean and variance parameters.

At the system start, there is no prior knowledge or past observed values of flits routed and hence must be initialized with blind estimates. Without loss of generality, we assume  $\mathbf{F}$  and T to have Gaussian distribution with respective mean and variance parameters. We set the initial values of mean and variance parameters as

$$E_T = \frac{\eta * \beta}{2}, Var[T] = \frac{\eta * \beta}{10}$$
3.4

$$E_i = \frac{\beta}{2}, Var[F_i] = \frac{\beta}{5}$$

In each epoch, the utilization estimates for number of flits routed to each port are computed and are sent to routers connected to corresponding ports. To minimize the wiring overhead, we divide router utilization into multiple levels, and the utilization estimates are quantized into one of these levels. This quantized utilization estimate of each port is transferred to the neighboring routers.

At any router, the utilization estimates (UE) are received from all its neighboring routers (computed using Equation 3.1 at those routers). The total utilization of the router for the next epoch is computed as sum of its input buffer occupancy and all upstream loads as shown in Equation 3.6

$$TUE = \sum_{i=1}^{\eta} UE_i + T_Q$$
 3.6

Where TUE is the total utilization estimate for the next epoch,  $UE_i$  is upstream load and  $T_Q$  is the quantized buffer occupancy. This total utilization estimate is used by AMS controller to determine router supply voltage for next epoch. In next subsections, we discuss a switching power-aware NoC architecture using this stochastic process model and present a thorough performance evaluation of the NoC including energy efficiency and overheads.

#### 3.2.3 AMS: Base Router Control

The AMS control mechanism is distinctively separated into two controls viz. for BR components and WI components in HR. The BR control is operated based on router utilization estimates described in section 3.2.2 and WI control is operated as per availability of data to transmit/receive.

The BR control operation using utilization estimates is shown in Table 3.2. At the start of each epoch, quantized utilization estimates from neighboring routers are received and total utilization is computed as shown in Table 3.2. The total router utilization for an epoch is also quantized and the level into which the current router utilization falls is used to determine the supply voltage. The thresholds for each level are determined by user based on applications. For our simulations, we have considered four utilization levels under MU, 0-5 % (L<sub>1</sub>), 5-25% (L<sub>2</sub>), 25-75% (L<sub>3</sub>) and above 75% (L<sub>4</sub>). Hence, the quantized utilization is

**Table 3.2 Shows AMS Controller mechanism on routers** 

```
Algorithm: Pseudo code of AMS Controller mechanism on routers
Initial: Pre-computed global utilization: HU, MU, RU
Supply_RoutersRU ← 0V
Supply_RoutersHU ← 1V
Set router states and utilization levels
AMS Control: for each epoch
         Utilization Estimate (UE) ← Stochastic_Process_Model
         Upstream Load ← Each_ port (UE<sub>i</sub> in Equation 3.6)
         Input Load \leftarrow Input buffers occupied (T<sub>Q</sub> in Equation 3.6)
         Total Load (TUE) ← Upstream Load + Input Load
if (TUE > Current Utilization):
          Voltage ↑;
elseif
         (TUE < Current Utilization):
          Voltage \mathbf{\Psi};
else
         No Change
```

represented as two-bit utilization level.

Once the router utilization is resolved, the supply voltage is determined as shown in Table 3.2. If the utilization for next epoch is higher than current utilization, the supply voltage is increased and vice-versa. If utilization estimate is same as current utilization, the supply voltage remains the same. Based on AMS control operation of BR components, the state transitions between different volt voltages depending on utilization estimate are shown in Figure 3.2. For current implementation, we considered four supply voltage states, low power (S<sub>LP</sub>), normal state (S<sub>N</sub>) and high performance (S<sub>HP</sub>) states along with power gated state (S<sub>PG</sub>). Once the system is powered, all the routers start in normal state except RU and HU routers. They are operated using pre-computed global utilization and are not controlled dynamically. HU and RU routers are provided with high and low supply voltage



Figure 3.2 State machine diagram of AMS controller

respectively and remain under those voltages for entire application duration. As system continues to run, utilization estimates (for MU routers) for each epoch are made and controller scales up or down the voltage respectively according to high or low utilization estimate. The voltage state transitions based on utilization estimate and current utilization can be determined from state machine in Figure 3.2.

#### 3.2.4 AMS: WI Control

The idle-state power consumption of WIs in WNoCs is reduced by power gating WI components in AMS scheme. A typical WI consists of (i) serializer/deserializer buffers data interfaces between WI and remaining components, as router (ii) modulator/demodulator to modulate packet data stream to desired transmission frequency, (iii) PA and LNA to amplify transmitted and received signal respectively and (iv) antenna to transmit or receive data. The antennas at all WIs are assumed to be broadcast capable. Furthermore, the wireless medium is shared among all WIs as they all operate on same frequency. A token passing mechanism with round robin arbitration is adopted from [14] to control access of medium to any WI. Of all the components in WI, PA and LNA are the most power hungry and consume more than 60% of WI power, and hence we power gate only these components. Figure 3.3 demonstrates the block level representation of power gated LNA and PA. The design of LNA and PA is adopted from [58]. Two Power Gating Switches (PGSs), one each for LNA and PA are used to power gate them individually. Two



Figure 3.3 Schematic of power-gated LNA and PA with AMSC

switches are operated based on the availability of token and data to be transmitted at WI. Since a single antenna is used for transmission and reception, either LNA or PA is active anytime.

Initially, PA and LNA in all WIs are kept in sleep mode. At any HR, when routing strategy decides to use WI and associated WI has token to transmit data, its PA is bought to active state as shown in Table 3.3. The wakeup operation of power amplifier overlaps with switch traversal operation. But, wake up latency of LNA and additional WI address bits add overheads to the overall operation and are shown at end of the section. When RC at HR makes decision to use wireless link for packet transfer, arbiter grants access of WI according to availability of token. AMS operates on this grant signal for waking up PA. While PA is being brought to active state, packet traverses the crossbar to reach WI buffers (switch traversal stage of router pipeline). Since wake up latency is smaller than clock period (values discussed in result section), PA is active before packet reaches WI and packet is transmitted without additional delay. The transmitted signal is received at destination WI through shared wireless medium. At receiving WI, LNA is brought to active state as soon

Table 3.3 Shows AMS Controller mechanism on routers

```
Algorithm: AMS Controller mechanism on routers
LNA and PA control:
Initial assumption: All PA and LNA are power gated.
    Assign token to the WI<sub>i</sub>
 TX: while Request for output
          if(token available && Arbiter_Grant_WI = = 1)
            SupplyPA ←Vdd;
         then Transmit data using shared wireless medium
         Continue until whole message transferred
          if (packet transfer complete)
 Pass token after tail flit is transmitted
           SupplyPA \leftarrow 0V;
           Arbiter_Grant_WI = 0;
else
  Wait for token
end if
    end while
 RX:while (Rx power > noise threshold)
           SupplyLNA ← Vdd;
           Decode WI Address
         if (Received Address != WI Address):
           SupplyLNA \leftarrow 0V:
         if (packet transfer complete):
  SupplyLNA \leftarrow0V;
```



Figure 3.4 Format of data transmission over wireless link

as the signal is detected at the antenna. To differentiate valid received signal from channel noise, a comparator [59] is used, whose threshold is set to the average value of signal strength of the most far apart nodes and worst case noise floor value of the wireless channel. The received packet is then demodulated and deserialized through WI components at the receiving HR. All the components in both transmitting and receiving WIs are kept active during data transmission. The receiving WI is put in sleep mode once the tail flit transmission is complete. The antennas in WNoC topology are broadcast capable and since all WIs operate at same frequency; LNAs at all WIs are brought to active state on detecting valid transmission. This results in unnecessary energy consumption in LNA as it is kept awake till header is resolved by RC unit. Furthermore, the packet transmission through wireless links is serial and all bits received at the antenna during wake up period of LNA are lost. To avoid data loss, we add redundant dummy bits at the start of the packet as shown in Figure 3.4. The number of redundant bits added is equal to product of wake up latency and bit rate of transmitter. Since wake-up latency is very small, the additional bit overhead is very low at mm-wave frequencies.

The redundant energy consumption in LNAs is reduced by minimizing the packet processing time to resolve desired receiver before LNA can be turned off. For this purpose, a unique address is assigned to each WI in AMS and is used for resolving appropriate destination WI. The unique address of desired receiver WI is appended between dummy bits and actual packet data to be transmitted as illustrated in Figure 3.4. At the receiving WI, the received bits are decoded for this WI address using a pattern matching decoder. If the address matches, remaining actual data is received. If there is a mismatch, packet reception is ceased and LNA is put back to sleep state. The pattern matching decoder is operated at serial data stream frequency and starts decoding WI address after LNA is active.

The number of bits required for WI address is  $r = \log_2 N_{WI} + I$  and is small as  $N_{WI}$ , number of WIs is significantly less than total number of routers. To indicate start of WI address bits and differentiate them from dummy bits, we consider MSB of all WI addresses to be '0' and pattern of dummy bits is all '1'. Though dummy bits and WI address add bit overhead to packet transmission, they ultimately avoid data loss and excessive energy consumption with a little control overhead.

The total time required to complete packet transmission, considering all control bits, using wireless links is

$$T_{\text{max}} = (\mathbf{r} + m + N_{flits} * S_{flit}) t_{bit} + t_{propagation}$$
 3.7

Where,  $N_{flits}$  is number of flits in a packet,  $S_{flit}$  is size of flit in bits, m is number of dummy bits, r is number of WI address bits,  $t_{bit}$  is bit duration and  $t_{propagation}$  is propagation delay of wireless channel. In case of WI address mismatch, packet receiving stops before processing the r+m bits. The time taken to resolve WI address with pattern matching is at most  $r*t_{bit}$ . In comparison without WI address, the entire header flit has to be received ( $S_{flit}*t_{bit}$ ) and resolved by RC (takes one clock cycle). For example, the worst case duration before which LNA is turned off with WI address is 0.35ns, while the best case without WI address bits is 2ns for the experimental setup and specifications mentioned in section 3.4. Hence, the active period of LNA and corresponding energy consumption is reduced considerably by adding WI address bits.

Once the packet transmission is complete, the comparator output at all WIs goes low as no valid signal is present in the wireless medium. The transmitting WI, then intimates the next WI in round robin arbitration by transmitting the token as shown in Figure 3.4. The token data is address of next WI. At WI, the output of pattern matching circuit remains high while comparator output goes low (after data is transmitted), indicating that received data is token. Upon receiving the token, next WI turns on its PA and starts transmitting. The main advantage of transmitting token wirelessly is reduced control wiring overhead and signaling time. AMS mechanism, with proposed power gating control, reduces idle-state power consumption in WI components without adding significant control overhead.

# 3.2.5 Routing Strategy

To avoid deadlocks, livelocks and minimize multi-hop communications in WNoC architecture, we adopt routing strategy from [6]. It also avoids cyclic dependencies for wireless links. With all routers in the network arranged in  $M \times N$  matrix, the position of source i and destination j are  $(x_1, y_1)$  and  $(x_2, y_2)$  respectively. The WI positions nearest to source and destination routers are denoted as  $WI_S=(p,q)$  and  $WI_D=(r,s)$  respectively. Data transfer between routers i and j is followed by two conditions. The hop count between source and destination is computed using Manhattan distance  $(D_M)$  and using wireless links  $(D_{WI})$  (flit transmission over wireless link is considered one hop). Firstly, if  $D_M$  is less than  $D_{WI}$ , packet is routed over wired links using XY routing. Otherwise, we verify if WIs are less than or equal to two hops away from source router. If so, hybrid path (combination of wired and wireless links) is used to route the packet using South-Last routing, else packet is transferred through wired links using default XY routing. The second constraint is imposed to reduce the traffic at WIs and avoid congestion. South-Last routing strategy is shown to be deadlock-free in presence of wireless links [6] and hence adopted in this work.

# 3.3 Hardware Implementation

In this section, we discuss hardware level implementation of AMS controller, onchip voltage regulator and voltage level shifter required for proposed mechanism.

# 3.3.1 Adaptive Multi-Voltage Scaling Controller

The proposed AMS Controller (AMSC) consists of Utilization Computation Unit (UCU), state estimation unit, comparator and Power Gating (PG) control unit as shown in Figure 3.1. UCU is comprised of utilization prediction model along with predictor and update units. The prediction model holds the flit and buffer occupancy model parameters used to predict router utilization and update these parameters in each epoch. The predictor unit, using these model parameters computes utilization estimates for all neighboring routers at the start of epoch. The update unit updates the model by using observed values of past utilization characteristics. The utilization estimates are communicated as quantized 2-bit *Pilot\_out* signal to corresponding neighboring routers. Similarly, AMSC at each router receives its upstream load estimates as 2-bit *Pilot in* signals from neighboring

routers. Hence, AMS mechanism requires a total of 16-bit wiring overhead for communicating utilization estimates with neighboring routers. Using utilization estimates, state estimation unit determines the operating voltage for next epoch as described in section 3.2.3. Based on the output of state estimation unit, AMSC sends *CNTRL\_MV* signal to voltage regulator circuit to set appropriate voltage. *CNTRL\_MV* is also used to control level shifter circuit at the input ports of the router as illustrated in Figure 3.1.

The PG control unit and comparator control the operation of power gating of PA and LNA components. The PG control for PA receives input from grant signal of arbiter. Based on RC decision and grant signal for WI from arbiter, AMSC generates appropriate  $PG_PA$  signal to control PA. The PG control for LNA receives input from comparator, which is associated with receiver front-end, and WI address decoding circuit. At low to high transition of comparator output, AMSC generates  $PG_LNA$  to wake up the LNA. As soon as WI address decoder completes its operation, AMSC puts LNA to sleep state if decoder output is low. For both  $PG_PA$  and  $PG_LNA$ , an active high value indicates that respective component is power gated.

# 3.3.2 On-Chip Voltage Regulator

The voltage scaling mechanism of proposed AMS requires generation and



Figure~3.5~Hybrid~switched~inductor~and~switched~capacitor~voltage~regulator~for~multi-voltage~scaling

**Table 3.4 Characteristics of Voltage Regulator** 

| Characteristics     | Voltage levels |      |       |
|---------------------|----------------|------|-------|
|                     | V1             | V2   | V3    |
| Transient time (ns) | 38             | 43   | 39    |
| Efficiency (%)      | 68.9           | 66.0 | 67.4  |
| Energy(nJ)          | 0.398          | 0.30 | 0.183 |

assignment of multiple voltages by on-chip voltage regulator. Besides generating the necessary voltages, it is challenging to supply voltages independently to each router and quickly transition between different voltage levels. A single chip level regulator requires complex power distribution network for all voltage levels, whereas implementing a voltage regulator with each router incurs significant area overheads. To balance area and routing overheads, we implement the Voltage Regulator Unit (VRU) for each cluster (group of neighboring routers) as shown in Figure 3.1. The VRU circuit is adopted from [54] and provides high power conversion efficiency. We use Switched Inductor and Switched Capacitor (SISC) voltage regulator as shown in Figure 3.5 to generate the voltage levels. The output voltage of the circuit is regulated by controlling the duty cycle of the clock. It requires smaller sized inductor as compared to traditional SI buck regulator, thereby reducing area overhead, while being capable of supporting dynamic voltage scaling operations. Switches (P1/N1, P2/N2, P3/N3, and P4/N4) are clocked at 200MHz and 400MHz to divide the supply voltage VDD. Switch NP/NN is mainly used to control the upper and lower level voltages depending on the required output voltage level. The VRU primarily generates four voltages in our implementation. The initial fine-grained levels are V1 (1V) for S<sub>HP</sub> state, V2 (0.9V) for S<sub>N</sub> state, V3 (0.8V) for S<sub>LP</sub> state and V4 (0V) for cutting off power supply. The transient time, efficiency, and energy overhead of voltage regulator are illustrated in Table 3.4. The efficiency is higher for larger output voltage, and switch timing is mainly dependent on the initial and final voltage levels. The four primary voltage levels from VRU are routed to all routers in the cluster. At the router, AMSC sets and switches to the appropriate voltage level for the next epoch.

#### 3.3.3 Voltage Level Shifter

AMS mechanism allows dynamic scaling of voltage at router level and so requires control to handle data from one voltage level to another for data transfer between



Figure 3.6 Level shifter with AMSC

neighboring routers. For this purpose, we integrate voltage level shifter at the input of all ports in the router. Once the supply voltage for router is set at the start of epoch, AMSC configures the level shifter to precisely control states between incoming voltage and router voltage. The level shifter implemented is a differential cascade voltage switch circuit as represented in Figure 3.6. It forms half-latch by two upper PMOS (MP1 and MP2) and a pair of NMOS, which are controlled by input signals A and  $\bar{A}$ . When the input signal, A goes from low to high, MNI is turned on and when the input signal  $\bar{A}$  goes from high to low, MN2 is turned off. As a consequence, the voltage at node P1 is pulled down, leading MP2 to be on. Once MP2 is turned on, the node P2 is charged. Thus, voltage level is passed at P2 which is output voltage of level shifter. This output voltage is the input of transmission gate (TG1). Then, transmission gate passes the voltage level based on control signal from AMSC. When AMSC send the  $00 (B_1B_2)$  combination, voltage V1 is passed to the next router and so on, which is presented in Figure 3.7. Four transmission gates are required for four voltage levels: V1, V2, V3 and V4. Depending upon the present status of routers, AMSC takes action to decide appropriate voltage level. This level shifter with TG is placed with each I/O port in every router. Total four level shifters are needed to appropriately convert voltage levels from all neighboring routers to the router's voltage



Figure 3.7 Level shifter controlling mechanism

level.

# 3.4 Experiments

In this section, we discuss detail implementation, overhead, performance benefits of our proposed architectures as compared to a traditional mesh NoC and wireless NoC. We also compare our proposed architectures with other recently proposed energy-efficient NoC/WNoC architectures.

**Table 3.5 Simulation Setup** 

| Topology        | 8×8 and 16×16 Mesh NoC, 8x8 and 16×16 WNoC architectures |
|-----------------|----------------------------------------------------------|
| Routing         | XY for baseline, XY and South-Last for WNoCs             |
| Pipeline        | 4 stages                                                 |
| Buffer depth    | 4 for base routers and 8 for hybrid routers              |
| Flit size       | 32 bits                                                  |
| Packet size     | 64 flits                                                 |
| Clock Frequency | 2.5GHz                                                   |
| Workload        | Synthetic and Real                                       |

**Table 3.6 Representation of Topologies/Components** 

| Topologies/components        | Representation |
|------------------------------|----------------|
| Electrical Mesh NoC          | Mesh           |
| Electrical Mesh NoC with AMS | amsMesh        |
| Regular Wireless NoC         | WNoC           |
| Wireless NoC with AMS        | amsWNoC        |
| Power gated PA               | pgPA           |
| Power gated LNA              | pgLNA          |
| Non-power gated PA           | npgPA          |
| Non-power gated LNA          | npgLNA         |

# 3.4.1 Experimental Setup

We modeled and simulated the NoC architectures using the cycle-accurate Noxim simulator [48]. Full system simulation of *PARSEC* [45] and *SPLASH-2* [46] benchmarks using Graphite [50] is used to generate the traces that are fed into the NoC simulator. We run the application specific real traffic for preliminary analysis of routers. Systems with 64 and 256 cores are used for all evaluations and system specifications are presented in Table 3.5. The die area is kept fixed at 20mm×20mm for all system sizes. The width of all wired links is same as flit size. We have adopted wormhole routing. The routers are input buffered with each port comprised of 4 VCs. Each VC has a depth of 4 flits. The ports associated with the WIs have an increased buffer depth of 8 flits [58]. The NoC switches are driven with a 2.5GHz clock. The network switches and adaptive multi-voltage scaling controller are synthesized with Synopsys Design Compiler using 28nm process technology. In this work, we also implement the voltage regulator and level-shifter circuits, power gated PA and LNA using Cadence tools at 65nm process technology. Throughout our evaluation, we refer to multiple topologies and network elements as shown in Table 3.6.

#### 3.4.2 AMSC-based Router Implementation

To implement AMS scheme, we integrate each router with additional components, viz., AMSC and level shifter per port. AMSC and four level shifters together occupy 1006.83μm² per router. The total area requirement for modified base router (including control units, buffers, crossbar, arbiter, RC, and VCs) is 9969.59μm² and the area overhead associated with AMS is 8.63% of the base router. The component-wise area and power

**Table 3.7 Router Implementation** 

| Components     | Area (μm²) | Power(µW)             |
|----------------|------------|-----------------------|
| AMS controller | 860.83     | 42.84                 |
| Router         | 8962.76    | 409                   |
| Comparator     | 191.47     | 146                   |
| Level shifter  | 36.5       | 6.3 ×10 <sup>-3</sup> |
| PG control     | 100.18     | 3.16                  |

numbers of router with AMS control are shown in Table 3.7. The total power consumption for all BR components combined is  $409\mu W$ . The AMS controller with UCU and state estimate units consume a total power of  $42.84\mu W$  for all its operations. The level shifter at each port consumes 6.3nW while converting voltage from one level to another. Hence, AMS control mechanism incurs power overheads of 10% over a regular BR. The areas of transceiver circuit for transmitter  $(T_x)$  and receiver  $(R_x)$  are  $90\times10^3\mu m^2$  and  $70\times10^3\mu m^2$  respectively [19]. Therefore, total area of a hybrid router with control units is  $160.29\times10^3\mu m^2$ . The AMSC, level shifter and regulator circuits add less than 1% silicon area overhead for hybrid router.

We have implemented the stochastic model parameters. All parameters are updated from previous sets of corresponding values. Hence, all update and estimate operations are simple addition and multiplication calculations. And at each router, we are calculating only five estimates. So, overall area, power and latency overheads are considerably small. For each router, from equation 3.1, total number of operations are six multiplications and six additions. The update operation of the estimation parameters is not on the data path of router operation and hence does incur additional delay overheads.

#### 3.4.3 Power Gated LNA and PA Implementation

In this section, we discuss the implementation of power gated LNA and PA in detail. Both PA and LNA are designed to operate at frequency of 58GHz. The gain and noise figure performance of power gated LNA (pgLNA) and non-power gated LNA (npgLNA) are shown in Figure 3.8. At the design frequency, peak gain and noise figure of npgLNA are 10dB and 2.8dB respectively. The implementation of power gating with LNA shows a small deviation in performance as observed from Figure 3.8. The peak gain of



Figure 3.8 Gain and noise figure analysis for pgLNA and npgLNA

pgLNA remains same as that of npgLNA at and below the design frequency, however it starts falling at higher frequencies. This result in small decrease in bandwidth and the 3dB bandwidth of pgLNA is 18GHz at 58GHz operating frequency. The bandwidth deviations occur due to increase in parallel capacitance offered by PMOS switch as shown in Figure 3.9, which results in increased quality factor. Similarly, noise figure of pgLNA increases to 3dB due to PMOS device adding noise.

The PA is designed to operate at 58GHz with 1V supply voltage. The S-parameter analysis of npgPA and pgPA is illustrated in Figure 3.10. The peak gain of non-power gated PA is 12dB with 3dB bandwidth of 16GHz. The gain reduces to 11dB after applying power gating scheme. *IR* drop across the sleep transistor reduces the voltage swing, which impacts the linearity of the amplifier. The total power consumed by LNA/PA is 10mW.



Figure 3.9 PMOS switches with block level LNA/PA



Figure 3.10 S-parameter analysis for pgPA and npgPA

Figure 3.11 shows the  $S_{21}$  variation of PA across the Temperature (Temp) and Process (P) variations. At temperature 125°C and SS process corner, the gain of PA is degraded from 10dB (25°C) to 8dB (125°C) in worst case conditions. From this result, it is clear that there are no such significant changes in gain under worst case scenario.

Performance deviations of both LNA/PA due to power gating are dependent on sizing of PMOS switch. PMOS switch operates in linear region during active state of components and adds linear resistance and capacitance to original design. However, with proper sizing, its effects are minimized with the cost of area penalty. For our LNA/PA design, the sizing of sleep transistor is kept at ten times the size of NMOS transistors to achieve this.

In biasing circuit, capacitor is inserted with virtual supply node to avoid the impact of RF signals on supply voltages. To reduce  $R_{on}$ , a large sized PMOS switch is used. However, in this scenario, high peak current (inrush current) flows for high  $V_{ds}$ . This current can put devices out of SOA (safe operating area). This current is reduced by using two different sizes PMOS switches in parallel as shown in Figure 3.9. These switches are controlled by two input control signals ( $V_{AL}$  and  $V_{BS}$ ) with delta delay difference to avoid the high inrush current. This will add some delay during wake up mode, but this approach reduces significant amount of inrush current. LNA and PA component of WI consume 20mW power of 32mW total transceiver power. pgPA and pgLNA reduce the power



Figure 3.11 S-parameter analysis across the process and temperature variations consumption significantly in WNoC architecture. Hence, total power consumption for PA and LNA component reduces up to 62.50% of a single hybrid router.

### 3.4.4 Comparison of Energy and Energy-delay Product

Figure 3.12 shows the total packet energy with uniform random traffic for a 64-core system. The energy values for different architectures are normalized with that of baseline Mesh. The results shown in the figure are for packet injection rate of 0.3flits/cycle. At this packet injection rate, throughput for all architectures is saturated. From figure, it can be seen that amsMesh reduces energy consumption by 22% as compared to Mesh. The energy savings are achieved through efficient voltage scaling of AMS mechanism. WNoC and amsWNoC topologies consume significantly less energy than mesh based topologies due to improvements provided by WIs. Total packet energy for amsWNoC is 32% less as



Figure 3.12 Shows normalized energy of 64 core system under uniform random traffic pattern with respect to baseline Mesh topology.



Figure 3.13 Shows (a) normalized delay and (b) normalized energy-delay product of 64 core system under uniform random traffic pattern with respect to baseline Mesh topology.

compared to WNoC architecture. The improvement in energy is achieved by power gating of WIs along with voltage scaling of base routers. The normalized packet delay for different topologies with uniform random traffic is shown in Figure 3.13(a). Scaling of voltage for based routers in amsMesh does not result in performance degradation as shown in the figure. With amsWNoC, though the performance degrades as compared to WNoC topology, it is better as compared to Mesh topology. Figure 3.13(b) shows Energy-Delay Product (EDP) of different NoC architectures for uniform random traffic pattern. It can be observed that EDP of amsWNoC is still better as compared to WNoC topology, even though amsWNoC has lower performance. amsWNoC has 89.91%, 82.96% and 33.05% lower EDP compared to Mesh, amsMesh and WNoC topologies respectively. As frequency remains constant and dynamic energy has a quadratic relationship with supply voltage, amsWNoC provides lower EDP as compared to regular Mesh and WNoC architectures.



Figure 3.14 Packet energy saving in percentage with AMS over non-AMS architecture under application-specific and synthetic traffic

#### 3.4.5 Energy Saving

To evaluate the energy savings provided by AMS scheme, we evaluate the proposed scheme for 64 and 256 core systems. To obtain characteristics for both computation and communication intensive traffic patterns, we have considered benchmarks from *PARSEC* [45] and *SPLASH-2* [46] along with synthetic traffic patterns. The overall packet energy savings achieved by proposed AMS based architectures for different benchmarks are presented in Figure 3.14. The results presented are energy saved with amsMesh over Mesh and amsWNoC over WNoC at both system sizes. On average, proposed technique achieves savings of 56% for 256 and 32% for 64 core system across all benchmark in WNoC topologies. In Mesh, this achieves energy saving of 26% for 64 and 51% for 256 core system on average as compared with baseline architectures. From figure, it can be observed that the energy savings attained for 256 core system is high as compared with that for 64 core system for both Mesh and WNoC.

We present a summary of our proposed technique along with recently proposed power/energy-efficient NoC architectures as shown in Table 3.8 (all metrics are compared with those of baseline architectures). As can be seen from the table, the proposed technique performs equally in terms of overall network energy savings and energy-delay product as compared to dynamic scaling techniques of [55] and [56]. Our proposed technique, with only voltage scaling, achieves comparable results and also further reduces idle-state power consumption in WIs. The power gating control of AMS scheme for WIs achieves high

Table 3.8 Summary of Existing and Proposed Power/Energy-efficient NoC Architectures

| Ref.     | Approaches                                 | Configurations                                                                                                                            | Power gated<br>Components                          | Advantages                                                                                         | Power/<br>Energy<br>saving (%)                                                    | Penalty                                           |
|----------|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|---------------------------------------------------|
| [25]     | Panthre                                    | 8×8 MeshVCs: 4VCs/<br>port, 8 flits/VC; XY for<br>baseline and up/down<br>for Panthre; Clock<br>Frequency: 2GHz                           | Data path<br>and whole<br>router                   | Leakage power<br>saving with low<br>latency impact                                                 | Overall<br>network<br>power:<br>14.5%                                             | 1.8%<br>degradation<br>in<br>performance          |
| [27]     | FlexiBuffer                                | 8×8 Mesh, XY routing,<br>1.5Ghz clock frequency,                                                                                          | Buffers                                            | Buffer<br>management<br>and leakage<br>power                                                       | Overall<br>router power:<br>39%                                                   | 3%<br>degradation<br>in throughput                |
| [40]     | DVFS-<br>enabled<br>sustainable<br>WNoC    | 8×8 Mesh, flit size 32,<br>2.5GHz clock frequency,                                                                                        | Link                                               | Reduce<br>temperature<br>hotspot and link<br>energy                                                | Overall<br>network<br>energy: 60%                                                 |                                                   |
| [41]     | DVS policy<br>based on link<br>utilization | 8×8 Mesh, GAL NoC,<br>32 bit router, 8 VCs/port,<br>32 bits flits, Routing<br>XY,                                                         | Link and<br>FIFO                                   | DVS for Link<br>and FIFO for<br>low energy                                                         | Energy-<br>delay: 36%                                                             | Negligible<br>performance<br>degradation          |
| Proposed | amsWNoC                                    | 8×8 Mesh and 16×16<br>Mesh and WNoC, XY<br>and South-Last for<br>WNoCs, 32 flit size,<br>packet size:64 flits, 2.5<br>GHz clock frequency | Buffers and<br>Crossbar,<br>wireless<br>Interfaces | saves both<br>switching<br>power<br>consumption in<br>base router and<br>idle-state power<br>at WI | Overall packet energy: 20% - 69%; Idlestate power of WI: up to 62.50% EDP: 33.05% | Less than 1%<br>silicon area<br>overhead in<br>HR |

power savings as compared to other schemes as shown in the table. AMS scheme achieves significant energy savings as compared to baseline topologies without adding significant overhead. The savings achieved improve with increasing system sizes.

#### 3.4.6 Performance Evaluation

We evaluate the performance with AMS scheme under different traffic situations for both mesh and WNoC topologies in terms of throughput and latency. In order to bring out characteristics of the proposed architecture in the presence of both computation intensive and communication intensive workloads, we have considered five synthetic, ten *SPLASH-2* [46] benchmarks, and two *PARSEC* [45] benchmarks, *Blackscholes* and *Fluidanimate*. *SPLASH-2* benchmeks include *Cholesky*, *FFT*, *fmm*, *LU\_C*, *LU\_NC*, *Radiosity\_C*, *Radix*, *Raytrace*, *Volrend* and *Water-nsquared*.

#### 3.4.6.1 Throughput Analysis

To evaluate the performance of the proposed AMS based architectures, we considered both synthetic and application-based traffic distributions. In the following analysis, the system sizes considered are 64-core and 256-core for the purpose of scalability



Figure 3.15 Peak throughput for proposed and baseline architectures

analysis. We also compare it with a conventional wireline mesh and regular WNoC architectures. Figure 3.15 shows peak throughput at network saturation for Mesh and WNoC architectures. In this result, the throughput values for different architectures are normalized with respect to maximum throughput value of baseline Mesh of 64 core system. WNoC, as expected, improves performance over mesh topology due to presence of single hop, long-range wireless links. From the throughput comparison in Figure 3.15, it can be observed that all WNoCs provide better throughput than Mesh due to the presence of single hop long-range wireless links. From figure it can be seen that WNoCs for 256 core system has greater throughput compared to Mesh with 256 cores in most of the applications. This provides the highest throughput due to the hybrid communication. The packets in WNoCs transmit through wireless links as well as follow alternative wireline route at the same time. The performance degradation observed with traffic patterns is very little. The performance of Shuffle, Bitreversal, Blackscholes, Radiosity\_C, Raytrace and fmm is marginally degraded. From results, it is evident that the proposed architecture offers little to no performance degradation while achieving considerable energy savings at router and network level.

### 3.4.6.2 Latency Analysis

Figure 3.16 summarizes performance in terms of latency under various traffic patterns. All packet latency values are normalized with respect to maximum value of



Figure 3.16 Peak latency for proposed and baseline architectures

baseline Mesh of 64 core system. From figure it can be observed that WNoCs have lower latency as compared to Mesh for 256 core system. It can be seen that latency for *Blackscholes*, *Fluidanimate*, *FFT*, *LU\_NC*, *Radix*, *Volrend* and *Water-nsquared* is high for Mesh architectures for 64-core system due to the high communication intensity. It can also be seen that latency for *Butterfly*, *Blackscholes*, *LU\_C*, and *Raytrace* of WNoCs is comparatively low due to the communication requirements. However, the latency overhead in both Mesh and WNoC architectures using AMSC is very little as compared to using regular architectures. From throughput and latency results, it is clear that WNoC architecture achieves significant energy saving with little to no impact on the performance of the network.

#### 3.4.7 Scalability and Impact of Power Gating

Proposed router design integrates two components, viz., on-chip level shifter and AMSC with every router. AMSC unit in a router requires data from the same router and downstream routers. As additional components in AMSC are associated with only neighboring routers (single hop), implementation remains unchanged with system size and AMS scheme provides a scalable design approach. The on-chip voltage router is placed centrally in every cluster. The total number of voltage regulator remains same for even larger systems. But, complexity will increase with increase in system size. The proposed method achieves significant savings in energy with little area overhead and is scalable to



Figure 3.17 Power consumption of PA/LNA in different operating modes

any system size. The time taken for utilization estimation step at the start of each epoch is considerably small and adds little overheads to the router operations. Pre-computation step for categorizing the routers is a one-time process and requires same time as that of one epoch calculations. But, this removes up to 60% of routers from dynamic computation, thereby reducing the overheads of proposed method significantly.

Mispredictions during the estimation of runtime utilization at any router may result in the voltage for next epoch being set higher/lower than required. In case of the router voltage being set to a lower value, the power saving is more but may result in performance impact (increased charging time of load capacitances). But, since our implementation keeps the frequency constant at all voltages, the performance impact due to lower voltage is insignificant. Whereas in the event of router utilization being mispredicted as higher, the router components are operated at higher voltages, thereby increasing the power consumption and reducing the energy savings. As the router utilization is predicted for each epoch, this high-power consumption phase remains only for a short period before the estimation is computed again for next epoch.

The major impacts of power gating at WI include wakeup latency and transient energy consumption. When a circuit changes states between sleep and wake up modes,

114.59pJ transient energy is spent with the implemented power gating switch. The wakeup latency of WI due to power gating is 0.14ns which is comparable with Panthre [31]. The normalized power consumption of PA/LNA components of WI is presented in Figure 3.17. From this figure it can be seen that normalized leakage power consumption is 17uW at sleep mode. Time needed to change the state is in order of nanoseconds and so does not add significant overhead to the performance. In our implementation, average transient time of on-chip voltage regulator and voltage level shifter are 40ns and 12.56ns, respectively. These time periods are enough to avoid the mismatches of voltage and frequency between routers.

#### 3.5 Conclusion

In this chapter, we propose an energy-efficient WNoC architecture using novel adaptive multi-voltage scaling (AMS) scheme based on router utilization [60]. To provide an accurate and fine-grained prediction for router utilization, we propose a probabilistic utilization model based on stochastic processes. We also present a multi-level voltage shifter to switch between any two voltage levels from a fix set of supply voltages and efficient control mechanism for packet transmission over wireless links. We also discuss the power gated LNA and PA and impacts of power gating in detail. Proposed architecture saves both switching power consumption in base router and idle-state power consumption in WIs for WNoC architecture. AMS scheme saves up to 56% in network packet energy consumption as compared to baseline architectures without incurring significant performance penalty and area overheads. The proposed architecture saves up to 62.50% idle-state power consumption in WIs for 256 core system using power gating approach.

# Chapter 4

# Directional Wireless NoC Architecture

In this chapter, we demonstrate directional antenna based WIs for interference-aware DWNoC architecture for many-core systems to overcome the bottleneck of existing WNoC setup with omnidirectional antennas. It is also shown that DWNoC can establish concurrent links which do not interfere with each other. Simultaneous communications can enhance the overall system performance significantly. We also propose the interference-aware WIs placement algorithm and routing strategy for minimizing interference effects. This algorithm also helps in the optimal utilization of the WIs in the network with minimum overheads.

In this dissertation, so far, we have discussed the wireless NoC setup with omnidirectional antennas, where single frequency channel communication is employed. Most of the wireless NoC architectures are based on omnidirectional antennas. In most cases, these WNoC architectures adopt a token passing based medium access mechanism to transmit data over the shared wireless channel. Since all antennas share a common channel, only a single communication is possible at any instant of time. This limits the performance benefits of using wireless links. One potential solution to achieve simultaneous communications is to have multiple non-overlapping wireless channels. But creating multiple transceivers tuned to non-overlapping channels is an extremely challenging job due to effects of interference and high overheads. Interference reduces available bandwidth and degrades the bit error rate (BER). The cumulative impact of these leads to poor Quality-of-Service (QoS).

In order to improve performance over WNoCs with omnidirectional setups, we explore using directional antennas in this chapter. This will enable us to create multiple concurrent wireless links, eventually improving performance and energy-efficiency. We propose to use Planar Log-Periodic Antennas (PLPAs) for the on-chip wireless links and demonstrate the use of their directivity to pair WIs and have concurrent communications without interference. We present Directional Wireless NoC (DWNoC) architecture

following three requirements: interference-avoidance, low power dissipation and performance enhancement through concurrent links.

#### 4.1 Related work

Recently different research groups have explored multiple wireless NoC architectures for efficient on-chip communication infrastructure. A comprehensive survey regarding various WNoC architectures and their design principles is presented in [10]. Notable examples include an inter-router wireless scalable express channel for NoC (iWISE) architecture that reduces power consumption, area overhead and improves performance proposed in [11]. In this architecture, on-chip communication is achieved by using an omnidirectional antenna. A two-tier hybrid wireless/wired architecture to interconnect hundreds to thousands of cores in CMPs using sub-THz wireless links is discussed in [12]. Each wireless hub consists of a single transmitter and multiple receivers with terahertz on-chip antenna, but it consumes 4.5pJ/bit. It improves latency while power consumption is comparable with a mesh network. Most recently, energy efficient hierarchical NoC architecture with Zigzag antenna and millimeter wave transceivers is proposed to design a mm-wave wireless NoC in [14]. A network-based processor array architecture with 2D mesh is proposed in [13] for performance enhancement of NoC. The topology proposed here is a 2D wired mesh architecture augmented with few wireless interfaces. It saves power and area overhead but latency does not improve significantly. To improve the latency, energy efficient, low-cost domain specific irregular mesh WNoC is discussed in [61]. Network performance of this WNoC has been evaluated by implementing the RF microarchitecture. A hybrid WNoC designed with carbon nanotube (CNT) antennas [62] is capable of achieving several orders of magnitude lower energy consumption. In [63], the authors proposed a nature-inspired WNoC with CNT antennas, which is resilient to failures of the antenna elements. However, integration of CNT antennas with standard CMOS processes needs to overcome significant challenges. Millimeter-wave CMOS transceivers operating in the sub-THz frequency ranges is a more near-term solution.

Concurrently progress is being made in the design of on-chip antenna for intra-chip wireless communications. Various types of on-chip antenna have been proposed and

analyzed for energy efficient on-chip wireless communication. A wireless transmitter and receiver have been integrated with the on-chip antenna to operate at 15GHz and used to transmit global clock signal in [64]. A push-push oscillator with on-chip patch antenna is designed to operate at a high frequency in [65]. The feasibility of integrated antenna has been discussed in [66], and three different types of antenna, zigzag, meander, and loop have been simulated. For communication within ICs and over free space, 2mm zigzag dipole antenna is discussed in [67]. Authors in [68] have analyzed the propagation characteristics of integrated antennas for intra-chip communications. They have concluded that on-chip propagation is dominantly through surface wave instead of space wave. Recently, authors in [69] propose an algorithm to optimize antenna orientation at each WI and minimize communication energy by exploiting antenna directivity. They use simulated annealing algorithm to achieve the optimal orientation of the antennas. Authors in [70] have discussed the metal zigzag antennas to operate on different non-overlapping frequency channels for simultaneous communications. A wireless NoC architecture augmented with directional on-chip planner log periodic antennas is explored in [71] for simultaneous multi-channel communications.

Most of the existing WNoC architectures we have explored use antennas with broadcast capability. Establishing concurrent links using these antennas is not possible because of interference issues. WNoCs using sub-THz transceivers use a token passing based data-link layer to enable multiple transceivers access the shared wireless channel [58] [11]. The token-based approach limits the number of simultaneous transmissions to one over the shared wireless channel. This is because it is non-trivial to design transceivers in non-overlapping frequency bands to form scalable designs with multiple transceivers in different frequencies to achieve Frequency Division Multiple Access (FDMA). In [72], the authors proposed a Code Division Multiple Access (CDMA) based WNoC allowing multiple transmitters to send data over the wireless medium simultaneously through orthogonal code channels. However, in this CDMA based WNoC, lack of synchronization between multiple transmitters may result in low reliability in data transfer over the wireless medium due to loss in orthogonality of the codes. In this work, we design a WNoC based on directional antennas to enable multiple concurrent wireless channels over the NoC fabric. Consequently, the performance and energy-efficiency of the WNoC will be

improved. We propose wireless interfaces placement algorithm with deadlock free routing for interference-aware DWNoC architecture.

#### 4.2 Communication Infrastructure

Insertion of bypass paths or long-range shortcuts realized with metal interconnects is shown to improve the performance in a conventional mesh-based NoC [6]. Small-world networks are a type of complex networks often found in nature that are characterized by both short-distance and long-range links [73]. Long-range shortcuts realized with wireless interconnects can improve the efficiency of the network as such networks have very low average number of hops between nodes even for very large network sizes. In this section, we discuss the physical layer, topology and communication mechanisms of the proposed DWNoC architecture.

#### 4.2.1 Physical Layer Design

The architecture and consequently the performance of DWNoC strongly depend on the physical layer design strategy. In this section, we first discuss directional log-periodic antenna and we propose to use that in the DWNoC followed by details of the transceiver design.

#### 4.2.1.1 Log-periodic Antenna

PLPAs are very popular and widely studied for their ease of manufacturing and their wide-band properties [74]. However, to the best of our knowledge, these antennas have not been implemented on a silicon substrate for on-chip wireless interconnections. In this work, we propose to use the design for an on-chip planar log-periodic antenna with wide bandwidth and high end-fire directivity in the millimeter wave frequency range. This antenna can also be designed such that it can operate at multiple frequencies.

The design parameters of the log-periodic antenna adopted from [75], with eight teeth are shown in Figure 4.1. Here the sizes of the teeth increase in a logarithmic manner. The dimensions mentioned in Table 4.1 are specifically for 60 GHz carrier frequency and can be scaled appropriately for any desired frequency of operation within micro and millimeter wave frequency range. The longest dimension of the antenna is 1.1825 mm that



Figure 4.1 On-chip planar log-periodic antenna with dimensions in mm

is comparable to the wavelength of the signal in the dielectric medium given by equation (4.1).

$$\lambda_g = \lambda / \sqrt{\varepsilon_r}$$

Where,  $\lambda_g$  is the wavelength in a dielectric medium,  $\lambda$  is the wavelength in free space, and  $\varepsilon_r$  is the dielectric constant of a 10-20  $\Omega$ -cm silicon substrate, which is commonly used in CMOS technology. The feasibility of this substrate has already been studied in [76]. A wireless interconnect can be implemented by establishing a communication link between antennas on the same substrate with one antenna in the end-fire region of another. Using these antennas, we establish wireless links in proposed DWNoC architecture.

# 4.2.1.2 Transceiver

To ensure the high throughput and energy efficiency of the proposed DWNoC, the transceiver circuitry has to provide wide bandwidth as well as low power consumption. The mm-wave frequency band can provide abundant bandwidth without suffering from severe signal degradation, because of the short communication distances in the DWNoC. In designing the on-chip mm-wave wireless transceiver, low power design considerations



Figure 4.2 A hybrid hierarchical NoC architecture with subnets structure

need to be taken into account both at the architecture and circuit levels. In this work, we have adopted a non-coherent on-off keying (OOK) based transceiver proposed in [19] for the DWNoC. OOK is selected as it allows relatively simple and low-power circuit implementation. As illustrated in Figure 4.2, the transmitter (TX) circuitry consists of an up-conversion mixer and a power amplifier (PA). On the receiver (RX) side, direct-conversion topology is adopted, consisting of a low noise amplifier (LNA), a down-conversion mixer and a baseband amplifier. An injection-lock voltage-controlled oscillator (VCO) is reused for TX and RX. With both direct-conversion and injection-lock technology, a power-hungry phase-lock loop (PLL) is eliminated. The OOK transceiver design keeps the additional overhead added by WIs in DWNoC architecture to a minimum.

Table 4.1 Dimensions of the antenna

| Dimension | Length (mm) |
|-----------|-------------|
| L1        | 0.11000     |
| L2        | 0.14019     |
| L3        | 0.17867     |
| L4        | 0.22772     |
| L5        | 0.29023     |
| L6        | 0.36990     |
| L7        | 0.47144     |
| L8        | 0.60085     |
| L9        | 0.05500     |
| W1        | 0.11000     |
| W2        | 0.16500     |

## 4.2.2 Topology of the DWNoC

In the DWNoC topology, each core is connected to a NoC switch and the switches are interconnected using either only wireline or both wireline and wireless links. The topology of DWNoC is a small-world network where the wireline links between switches are established following an inverse power-law distribution as shown in (4.2).

$$p(i,j) = \frac{d_{ij}^{-\alpha} f_{ij}}{\sum_{\forall i} \sum_{\forall j} d_{ij}^{-\alpha} f_{ij}}$$

Where, P(i,j) is the probability of establishing a link, between two switches i and j,  $d_{ij}$  is the Manhattan distance,  $f_{ij}$  is the frequency of communication between switch i and j. As can be seen from the equation, the probability of a link insertion between two switches i and j separated by a Manhattan distance of  $d_{ij}$  is proportional to the distance raised to a finite negative power. The value of  $\alpha$  is optimized to obtain topologies with higher performance gains and optimal wiring costs [77]. The distance is obtained by considering a tile-based floorplan of the cores on the die. This power-law based link distribution results in both short distance connections and long-range links due to the non-zero probability for



Figure 4.3 4×4 small world network topology with directional antennas

links between far-away nodes according to (4.2). The frequency of traffic interaction between the cores,  $f_{ij}$ , is also factored into (4.2) so that more frequently communicating cores have a higher probability of having a direct link optimizing the topology for application-specific traffic. For a general-purpose many-core NoC, (4.2) may be modified for uniform traffic distribution, which captures both local and long distance traffic interactions. To establish the network connectivity each pair of switch in the NoC is selected and a link is established between them with the probability given in (4.2) until all the links are deployed. The total number of links is same as that of conventional mesh. The network setup is repeated until there are no isolated clusters. An upper bound is imposed on the number of wireline links attached to a particular switch so that no switch becomes unrealistically large [63].

Once the wireline network is established as outlined above, we proceed for WI placement. For this purpose, a certain number of the NoC switches will be equipped with the transceivers to form the WIs as shown in Figure 4.3. It represents our proposed DWNoC architecture. This architecture consists of directional antennas, hybrid switches (combination of wired and wireless switches), and regular network switches. Each WI is paired only to one other WI, and the directional antennas of paired WIs are in the end fire region of each other. This allows for multiple antenna pairs to communicate with each other at the same time using the same frequency. The placement of the WIs is optimized to minimize the average distance or hop-count between source-destination pairs while ensuring that no WI or wireless link is in the path of communication between other pairs of WIs. Optimization of WI placement is essential for efficient performance and interference free communication during multiple simultaneous wireless data transmissions.

#### 4.2.3 Interference Aware WI Placement Problem

DWNoC topology augments a regular wired network with directional wireless links to improve communication latency and energy. The wired line network is laid out using (4.2) as described in the previous section. To obtain optimal placement of WIs for multiple interference-free concurrent wireless transmissions for a given topology and traffic pattern, we define a constrained optimization problem. The WI placement problem minimizes the average latency of the network while avoiding interference and meeting WI overhead

constraints.

The objective function of the WI placement problem is defined as (4.3),

$$\min_{W} \frac{1}{N(N-1)} \sum_{i=1}^{N} \sum_{j=1, j \neq i}^{N} h_{ij} f_{ij}$$
where,  $h_{ij} = K \times d_{ij-WI} + (1-K)d_{ij}$ 

$$K = 1 \text{ or } 0$$
4.3

The optimization variable, W is 2-column matrix denoting the WI pairs; each pair connected by a wireless link. N is the total number of nodes arranged in a RxC mesh topology.  $f_{ij}$  is the frequency of communication between nodes i and j.  $h_{ij}$  is the number of hops between nodes i and j. The value of  $h_{ij}$  is depended on whether a wireless link exists in the vicinity (one wired hop) of the path between nodes i and j, which is denoted by K. K is a logical variable with '1' indicating the presence of wireless link along the path and '0' indicating no wireless link. Starting from the source node, the algorithm searches for WI within one wired hop at every node along the path towards the destination. If a WI is found and the wireless link is such that it reduces total hop count to the destination, K is set to '1'.  $d_{ij}$  is the Manhattan distance between the nodes i and j, and is the hop count in the absence of a wireless link.  $d_{ij-WI}$  represents the hop count between nodes i and j when a wireless link exists along the path. A hop, in case of wired link, refers to transferring a flit between two adjacent routers, generally accomplished in one system clock cycle. For wireless links, a hop refers to transmitting data between the paired WIs, which takes multiple system clock cycles due to bit serialization of flit. For our setup, this is three clock cycles. The hop count, with or without wireless link is total sum of cycles for all hops between source and destination. The objective of the placement problem is to minimize the average hop count between any two nodes over all the nodes in the network.

The WI placement in DWNoC topology is a constrained optimization problem. The first constraint arises from overheads due to WI components. Each wireless interface adds area and power overheads,  $A_{WI}$  and  $P_{WI}$  to the system due to transceiver and antenna. The number of WI pairs placed in DWNoC is limited by the tolerable area and power overheads. This constraint is represented by (4.4), where  $N_{WI}$  is maximum allowed number of WIs that

can keep overhead to tolerable limit.

$$2*Rows(W) \le N_{WI}$$

DWNoC topology makes multiple concurrent wireless communications at same frequency possible by allowing each WI to be able to communicate with only one another WI. This is enforced by constraining each WI to only one pair while optimizing the WI placement. This constraint is represented by (4.5). Since the antennas used in DWNoC are directional in nature, antennas in a WI pair are oriented to be in the end-fire radiation regions of one another.

$$unique(W) = W$$
 4.5

WIs are generally more efficient for long-range intra-chip communication and hence it is necessary that paired WIs are far apart from each other. For this purpose, we enforce a constraint, as represented by (4.6), that paired WIs are separated by a minimum threshold distance.

$$\left| W_{X} * \begin{pmatrix} 1 \\ -1 \end{pmatrix} \right| + \left| W_{Y} * \begin{pmatrix} 1 \\ -1 \end{pmatrix} \right| \ge D_{th} * \mathbf{1}$$

$$4.6$$

Where,  $W_X$  and  $W_Y$  are matrices representing X- and Y- coordinates of nodes in a mesh network. **1** is a vector of all ones with length same as the number of rows in W.  $D_{th}$  is the Manhattan distance between paired nodes corresponding to the physical minimum threshold distance. The minimum threshold distance is determined by the distance at which communication through wireless link achieves energy savings over wired counterpart at the same distance. This distance is generally in the range of 7mm to 10mm [10] [78].

To ensure reliable concurrent wireless communication, any pair of directional wireless links must be non-interfering. This is achieved by placing interference avoidance constraints on the placement of WIs and wireless links as represented by (4.7).

$$\mathbf{A}(W^{v}) = \mathbf{0}$$

$$W^{v} = W * \begin{bmatrix} N & 1 & N+1 & 0 \\ 1 & N & 0 & N+1 \end{bmatrix} - N * \mathbf{1}_{rows(W) \times 4}$$

$$4.7$$

Where, A is the NxN matrix representing the interference avoidance constraints between any two wireless links. W' converts values in W matrix to appropriate position in A. The interference constraints are described in detail in the following section. All these constraints combined form the constrained optimization problem for placing WIs in DWNoC topology.

#### **4.2.4 Interference Avoidance Constraints**

The interference avoidance constraints form the crucial part of WI placement in DWNoC topology. They ensure that placement of WIs is such that concurrent wireless transmissions are interference free. We use coordinate geometry and the directional properties and transmission characteristics of the PLPA antenna like transmission gain, beam width, bit-error rate (BER), path loss variation and signal-to-interference ratio (SIR) to develop interference constraints. The properties of the PLPA antenna that are required to obtain interference avoidance regions are

- a) The radiation pattern of the PLPA antenna, which gives the main lobe and side lobe end fire regions.
- b) The path loss variation of the antenna with respect to distance, which defines the distance, DEFR over which a receiving antenna can reliably receive the transmitted signal with significant strength.
- c) The received power and interference power over different radial directions surrounding a transmitter antenna. This gives the regions near an antenna where unpaired WIs can be placed to avoid interference.

We obtain these characteristics by implementing and simulating the PLPA in HFSS tool. All these antenna characteristics are not independent and are correlated with each other. Valid data pertaining to these parameters and some of their inter-dependencies relevant to this work have been provided in the experimental results section.

Since all wireless communications use same frequency channel, for concurrent wireless transmissions to happen, data from one communicating pair should not interfere with another pair's data or reach unpaired WIs. We use properties from coordinate geometry to develop constraints to ensure there is no interference. Each node in the

DWNoC is considered as a point in XY plane and any two nodes are represented as  $N_I(x_I, y_I)$  and  $N_2(x_2, y_2)$ . The wireless link between two paired WI nodes can be represented as the line segment joining both the node coordinates. Using this, interference in wireless transmission can be avoided if:

- a) No two wireless link segments can intersect with each other, i.e., there is no intersecting point between them. This avoids any interference when data is being transmitted simultaneously using two links at same frequency.
- b) Any node coordinate present in the end-fire radiation regions of antennas from a wireless link is ineligible to act as WI node for another wireless link. If  $\theta_{ML}$  and  $\theta_{SL}$  are main lobe and side lobe widths of PLPA in radian, four triangular regions can be formed for two antennas from a wireless link. All WI nodes from other wireless links should be outside the bounds of these regions. This makes sure that data from one wireless link does not reach WIs from other links, which may lead to data loss at that WI.

If a WI node pair and the wireless link is placed in DWNoC topology, the first condition gives all the ineligible wireless links or WI node pairs. The second condition gives the nodes unavailable to be WIs as part of other wireless links. Both the conditions must be satisfied for all the wireless link and corresponding node pairs that exist in the DWNoC topology.

As previously mentioned, antennas from paired WIs are in the main lobe end fire regions of each other. Here it is assumed that the transmission power of an antenna at a WI is tuned so as to reach only the paired antenna and beyond that transmitted signal power falls significantly. The bound of the region in the side lobe is determined by distance after which transmitted power is insignificant. Satisfying the two conditions for all WIs ensures that there is no interference while transmitting data simultaneously using multiple wireless links.

#### 4.2.5 Proposed Optimization Algorithm

To solve the constrained WI placement problem formulated, we propose an optimization algorithm based on Breadth First Search (BFS). The proposed optimization

algorithm, shown in Figure 4.4 is a greedy algorithm that places wireless links recursively one by one while satisfying all the constraints at each step. At each step, we set the parameter  $H_{Temp}$  to the average hop count of the existing DWNoC setup,  $H_{Sol}$ . A set of valid WI node candidates and ineligible WI node pairs is given as input. Each valid WI node pair for each node in the WI node candidate set is added to the existing topology and corresponding average hop count,  $H_{current}$ , as described in (4.3), is computed. If the current hop count,  $H_{current}$  is less than the hop count,  $H_{Temp}$ , the intermediate solution is updated with this node pair, and  $H_{Temp}$  is updated with  $H_{current}$ . Else, the algorithm moves to the next valid node pair. To calculate the hop count, we use BFS method which traverses a tree structure from the root node and explores neighbor nodes first and so on. At the end of each step, once all valid node pairs are explored, the algorithm identifies a new WI node pair that reduces the average hop count the most. The WI node pair matrix,  $W_{Sol}$  is updated with this new pair and is added to the DWNoC topology. The average hop count of DWNoC topology,  $H_{Sol}$  is updated with  $H_{Temp}$  from this step.

Using the interference constraints, all remaining nodes and node pairs that violate the interference constraints for existing DWNoC topology are computed. The valid WI node candidate set is updated by removing the nodes violating the constraints. Similarly, ineligible WI node pair set is updated by adding the new invalid node pairs. These two sets along with the updated hop count are provided as input to the next recursive step in the algorithm. The RxC mesh topology and its average hop count are provided as initial solutions for first step. In this way, the proposed algorithm keeps adding new WI node pairs/wireless links till one of the stopping criteria is met. There are three stopping criteria for the algorithm:

- The number of WIs added to the topology exceeds the maximum allowed WIs,  $N_{WI}$ .
- Any recursive step in the algorithm returns an empty solution.
- The valid WI candidate node set at the end of any recursive step is empty.

The proposed algorithm provides interference free WI placement setup for DWNoC topology and at the same time also provides the free DWNoC topology that allows concurrent wireless optimal number of wireless links required for a given system size and



Figure 4.4 Interference-aware wireless interfaces placement algorithm

traffic pattern. The algorithm is greedy in nature as it adds only one wireless link at each



Figure 4.5 Data routing strategy

step that provides the best. Using the constrained WI placement problem and proposed BFS based greedy algorithm, an interference communications can be designed for any system size. We implemented the proposed formulation and algorithm in MATLAB for different system sizes and traffic patterns. The WI placement and number of optimal links for different system sizes are discussed in the results section.

#### **4.2.6 Communication Protocol**

To avoid deadlock situations and minimize the multi-hop communications in the

proposed DWNoC architecture, a routing strategy is illustrated in Figure 4.5. In  $N \times N$  matrix, the position of the source (S) i and destination (D) j are  $(x_I, y_I)$  and  $(x_2, y_2)$  respectively. The nearest source and destination WI positions can be denoted as  $WI_S=(p,q)$  and  $WI_D=(r,s)$ . Since the small-world topology based DWNoC is essentially an irregular architecture we adopted a South-Last routing as proposed in [6] to achieve an efficient deadlock-free routing policy for wireless links. It also avoids the cyclic dependencies for wireless links. Data transfer between i and j is followed by two conditions. First, if the nearest WI from the source is more than two hops away, then the data routing takes place via deadlock free XY routing [79]. Otherwise, nearest WIs from source will be adopted for long distance communications. Second, hop counts are estimated by both Manhattan distance  $(D_M)$  between i and j, and minimum hop counts  $(D_{WI})$  from i and j using wireless links. If  $D_{WI}$  is less than  $D_M$ , hybrid path (combination of wireless and wireline) will take responsibility for data transfer. If not, XY routing will take place through wireline. This process will continue for every packet transfer.

# **4.3 Experimental Results**

In this section, we present the simulation setup shown in Figure 4.6 and performance evaluation results of the DWNoC architecture. This setup represents the tool chain to obtain interference-aware architecture. Rest of the work describes the performance analysis of DWNoC with application specific traffic and existing architecture in subsections 4.3.5. and 4.3.6.

#### 4.3.1 Simulation Setup

We discuss the characteristics of the on-chip log-periodic antenna that enables the directional wireless links. Simulation results for PLPA antenna are obtained using Ansoft's HFSS tool [80]. Based on these characteristics we model the energy consumption, bandwidth, and reliability of the wireless interconnect. In this work, the NoC architecture is characterized using a cycle accurate simulator that models the progress of the data flits accurately per clock cycle accounting for those flits that reach the destination as well as those that are dropped. The width of all wired links is considered to be same as the flit size, which is considered to be 32 bits in this work. We consider a moderate packet size of 64



Figure 4.6 Performance evaluation setup for DWNoC

flits for all our experiments. Similar to the wired links, we have adopted wormhole routing in the wireless links too. We consider 2 system sizes of 64 and 256 cores for our experiments, which are representative of current multicore technology trends [81]. The number of VCs in the DWNoC switches depends on the system size and the number of interconnects. The network switches are synthesized from a RTL level design using 65nm standard cell libraries from CMP [82], using Synopsys. The NoC switches are driven with a clock frequency of 2.5 GHz. The delays and energy dissipation on the wired links were obtained through Cadence simulations taking into account the specific lengths of each link based on the established connections in the 20mm×20mm die following the topology of the NoCs. The wireless transceiver adopted from [19] is designed and simulated using the TSMC 65-nm CMOS process and is shown to dissipate 32mW sustaining a data rate of 16Gbps with BER of less than  $10^{-15}$  whereas occupying an area of 0.3mm<sup>2</sup>.

We evaluate the proposed DWNoC in terms of packet energy dissipation and peak network bandwidth. Packet energy,  $E_{pkt}$  is the energy dissipated in transferring one packet completely from source to destination at network saturation. It can be measured as

$$E_{pkt} = \frac{(\sum_{i=1}^{N} (L_i - h_i \lambda) E_{buf} + h_i E_{wire} \lambda) + N_{sim} E_{wireless}}{N_{pkt}}$$
4.8

Where,  $N_{pkt}$  is the number of packets routed in the NoC,  $L_i$  is the latency of  $i^{th}$  packet,  $h_i$  is the number of hops in the path of the packet and  $E_{buf}$  is the energy dissipation of a flit in the NoC switch buffers. The energy dissipation of wireline hop is  $E_{wire}$  and  $\lambda$  is the packet length in the number of flits.  $N_{sim}$  is the duration of the simulation and  $E_{wireless}$  is the energy dissipated by all the wireless transceivers in the WNoC in one cycle.

Peak bandwidth is the maximum achievable data rate for the NoC. The bandwidth is measured as the average number of bits successfully arriving per core per second. Bandwidth, *B* can be determined as,

$$B = t\beta Nf$$
 4.9

Where, t is the maximum throughput in a number of flits received per clock at network saturation,  $\beta$  is the number of bits in a flit, N is the number of cores in the NoC and f is the clock frequency. The throughput is directly obtained from system level simulations performed by NoC simulator. In the following subsections, first we discuss the characteristics of the antennas and associated constraints for interference avoidance, and then present a thorough performance evaluation of the proposed DWNoC architecture.

#### 4.3.2 Interference-Aware Constraints for WI placement

To obtain interference-aware constraints, we first analyzed the following characteristics: i) antenna characteristics for transmission gain, beam coverage and link budget for QoS, ii) path loss variation with distance for constraining maximum separation between paired antennas, iii) antenna orientation analysis and iv) Signal-to-Interference ratio (SIR) for minimal impact of interference in performance. These parameters and some of their inter-dependencies relevant to this work discuss here.

#### 4.3.3 Antenna Characteristics and Link Budget Analysis

The antenna characteristics like radiation pattern, beam width and path loss determine and affect the interference avoidance constraints discussed in the WIs placement



Figure 4.7 Return loss of the on-chip planar log-periodic antenna

algorithm. The 60 GHz on-chip PLPA described is simulated, and the results are presented in this section. The return loss of the antenna i.e.  $S_{II}$  parameter is shown in Figure 4.7. The  $S_{II}$  parameter indicates that this antenna can be operated as a dual band antenna at 60 GHz and 44 GHz. In this work, we have considered 60 GHz band for DWNoC.

The radiation patterns of the antenna in the azimuth and elevation plane are shown in Figure 4.8. It demonstrates the end-fire beam width of the on-chip PLPA is 33°. Similarly, the half-power beamwidth along the elevation is 30°. The radiation pattern determines the region in which a paired antenna should lie for a given antenna.

Next, we simulated the transmission characteristics of two PLPAs integrated on the same substrate separated by a distance of 20mm, which is the typical on-chip interconnect length in the DWNoC architecture. The antennas are aligned such that maximum signal



Figure 4.8 Radiation pattern along the azimuthal and elevation plane

can be coupled between the antennas, i.e. one antenna is oriented along the end-fire region of the second antenna. The transmission gain,  $G_a$ , between the antennas is computed according to (4.10)

$$G_a = \frac{\left|S_{21}\right|^2}{(1 - \left|S_{11}\right|^2)(1 - \left|S_{22}\right|^2)}$$

$$4.10$$

Where,  $S_{11}=S_{22}$  due to reciprocity. At 60 GHz, the gain is -38.65 dB, which translates to transmission power requirement in the range of 0dBm (with OOK modulation scheme and BER requirement of  $10^{-15}$ ) [11]. This kind of transmitter power can be easily generated in the on-chip scenario.

#### 4.3.3.1 Path Loss Estimation with Distance

We aligned the two PLPAs integrated on the same substrate separated by a distance of 28mm. These antennas are employed as a transmitter and receiver respectively. Power attenuation is calculated with the various distances between transmitter and receiver for path loss estimation. The distance between transmitter and receiver is varied from 2mm to 28mm and the corresponding variation of S(2,1) parameter is shown in Figure 4.9. Path loss increases exponentially as per variation of S(2,1).

We have found the effective slope from our simulations to calculate the value of Path Loss (PL) factor Y. The value of Y is 1.42, which is slightly smaller than free-space PL factor of 2. PL is compared with results in [67] where PLs are 1.454 for Zigzag, 1.342



Figure 4.9 S(2,1) variation with distance

for Meander and 1.411 for Linear antenna. Path loss in dB can be calculated using:

$$PL = \Upsilon 10log_{10}(d) + NF_{TOTAL}$$

$$4.11$$

Where, NF<sub>TOTAL</sub> is the total noise floor. PL is equal to -64.95 dB over 28mm T-R separation. The path loss variation of the antenna with distance sets a constraint on the maximum separation that two paired antennas can have between them.

Based on the path loss characteristics, we also estimate the energy consumed in transmitting data using wireless links and compare it with that of wired links. The energy consumed in a links is depended on the distance between source and destination nodes. At a distance of 10mm, data transmission over wireless links consumes 1.8320pJ/bit including the power consumed in transceiver circuits at transmitter and receiver and transmission loss of the channel. Over the same distance, buffered and unbuffered wires consume 4.6074pJ/bit and 10.6625pJ/bit respectively. As can be observed, wireless links consume significantly less energy as compared to wired links. Furthermore, energy consumption of wireless links varies linearly with distance whereas unbuffered wires exhibit exponentially variation. Though, buffered wires show linear variation, the rate of change is higher as compared to wireless links and energy saving with wireless links becomes apparent at large distances. This, along with low latency of wireless links, makes them very efficient for communicating data over long distances across the chip.

#### 4.3.3.2 Receiver Power with Antenna Orientation

To understand the relationship between received power and antenna orientation for WI placement, we simulated the PLPAs antenna with a transmitter-receiver pair and one antenna is rotated about the second (receiver) antenna. The antenna is rotated across the axis normal to its plane, and the variation of S(2,1) is plotted. At each rotation, the power received from the transmitter due to the lobes is illustrated in Figure 4.10. We compare this figure with azimuthal radiation pattern for validation. From the figure, power ratio at 0 and 360° is -19.6 dB due to the main lobe. Power ratio at 180° is -19 dB due to the back lobe. Power ratio at 225° is around -28.81 dB due to small side lobes. The variation of received power with antenna orientation along with radiation pattern determines the available nodes given a wireless link as described in the interference avoidance conditions of the WI



Figure 4.10 S(2,1) variation with different antenna orientation placement algorithm.

# 4.3.3.3 Signal-to-Interference Analysis

Signal-to-Interference Ratio (SIR) analysis is very essential for accurate system model design. SIR changes with antenna setup and can be improved by adjusting the antenna orientation. As a case study, we have taken a valid antenna setup after considering all constraints as shown in Figure 4.11 for interference analysis. We obtained the optimal



Figure 4.11 Antennas setup with a different orientation

number of WIs for 64-core system size as 6 from SA. There are total six antennas, named as  $A_1$ - $A_6$ , required to establish pair-wise communication for a given network.  $A_1$ ,  $A_5$  and  $A_6$  are communicating with  $A_2$ ,  $A_4$  and  $A_3$ , respectively. When  $A_5$  and  $A_4$  communicate,  $A_2$  will not be out of interference range due to the formation of side lobes. We have measured the interference received by  $A_2$  which is described below. In this setup,  $A_5$  and  $A_4$  communicate, and  $A_2$  listens. The parameters S(4,5) and S(2,5) are plotted in Figure 4.12. From the plot, the values of S(4,5) and S(2,5) are -18.741 dB and -78.824 dB at 60 GHz. Using these values, we have determined SIR as follows:

$$S(4,5) = 10\log\frac{PRX}{PTX} & \frac{PRX}{PTX} = 10^{S(4,5)/10}$$
4.12

$$SIR = 10\log \frac{P_{RX}}{P_{INT}} dB = 60.083 dB$$
4.13

Now from equations (4.12 and 4.13), we can calculate SIR as follows:

$$S(2,5) = 10\log \frac{PINT}{PTX} & \frac{PINT}{PTX} = 10^{S(2,5)/10}$$
 4.14

Interference signal received by  $A_2$  (when  $A_4$  and  $A_5$  communicate in the above setup) will have the S(2,5) of -78 dB which is very low. Estimated worst-case SIR is 60.083 dB. Therefore, optimal placement of wireless interfaces can reduce the interference effects.



Figure 4.12 S(2,1) plot for SIR calculation



Figure 4.13 Wireless links placement in 64 core system

#### 4.3.4 Interference-Aware WI Placement

Using the properties of PLPA antenna and the constrained placement optimization problem from sections 4.2.3 to 4.2.5, we obtain the WI placement avoiding interference and optimized over average hop count. We implemented the placement algorithm for system sizes from 64 nodes to 1024 nodes. The WI placement and DWNoC topology for a 64-core system are shown in Figure 4.13. As seen, no two-wireless links cross their paths, and no unpaired WI is present in the end-fire region of another link. Hence, allowing three simultaneous communications using same frequency without any interference. The number of WI pairs that can be placed using this algorithm for system sizes of 64, 128, 256 and 512 nodes are 3, 3, 7 and 10 respectively. These values are when overhead constraints are disabled from the algorithm.

# 4.3.5 Performance Evaluation of DWNoC

In this section, we evaluate the peak bandwidth, latency, and energy efficiency of the proposed directional antenna based DWNoC architecture developed in the aforementioned sub-sections. The network architectures, discussed in the earlier section, are simulated using a cycle accurate simulator which models the progress of data flits



Figure 4.14 Latency for various NoC architectures for 64 core system

accurately per clock cycle accounting for flits that reach the destination as well as those that are dropped. One hundred thousand iterations were performed to reach stable results in each experiment, eliminating the effect of transients in the first few thousand cycles. Wireless links enables a lower saturation latency compared with mesh, torus and clos. We evaluate the latency of the proposed DWNoC with standard mesh, torus and clos through uniform random traffic pattern. The variation of packet latency for all architectures is plotted in Figure 4.14. This gives huge improvement in network capabilities as compared to standard mesh, torus and clos. A 64-core system is also considered with uniform random traffic distribution, and the results are shown in Figure 4.15 for bandwidth evaluation. We also compare it with a conventional wireline mesh, torus, clos and the token-based WNoC. The token-based WNoC is formed by overlaying the same small-world based wireline NoC with optimized location and number of WIs as in [83]. It was found that 13 WIs optimized the achievable bandwidth in case of the token-based WNoC. The zig-zag antennas used in this architecture are not directional and hence the shared wireless medium requires a token passing protocol for granting access to a single transmitter at any given time. The OOK based wireless transceivers used in the token-based WNoC are the same as in the proposed DWNoC operating in the same mm-wave frequency range. Through cycle accurate system level simulations, we found that our proposed architecture achieves higher bandwidth and lower packet energy compared to existing architectures, mesh, torus, clos and token based WNoC. The token based WNoC has higher bandwidth and consumed less packet energy



Figure 4.15 Peak bandwidth and packet energy dissipation of various NoC architectures for 64-core system.

compared to other architectures, the mesh, torus and clos of the same system size due to the efficient small-world network topology and energy-efficient wireless links. However, in the token-based WNoC, only a single wireless link is active at any given point of time. This limits the potential gains in performance in the token-based system. In contrast, long-range point-to-point wireless links established with the directional antennas enable multiple concurrent wireless links. Consequently, the performance and packet energy



Figure 4.16 Peak bandwidth and packet energy dissipation of various NoC architectures for 256-core system

improve in the proposed DWNoC architecture with directional antennas. We have evaluated the directional antenna based DWNoC with a similar number of WIs (3 links or 6 WIs for the 64-core system) as the token based WNoC optimized for best performance. Even for the same overheads in wireless transceivers we see that the proposed architecture achieves higher bandwidth and lower packet energy compared to the token-based system. The DWNoC with the best trade-off in packet energy and area overhead with 6 wireless links (12 WIs) achieves significantly higher bandwidth. For the 256-core system, the same trend is observed in Figure 4.16. The 256-core DWNoC further improves the performance as it has a higher number of concurrent wireless links. The proposed architecture improves peak bandwidth up to 84.13% over mesh, 67.39% over torus, 53.44% over clos, and 9% over regular WNoC under uniform random traffic pattern for 64 core system. Similarly, it increases peak bandwidth up to 213.89% over mesh, 159.41% over torus, 101.21% over clos, and 10.13% over regular WNoC for 256 core system. This architecture also improves the energy-efficiency up to 22.20% over Mess and 7% over regular WNoC for 64 core system.

# 4.3.6 Application Specific Traffic

In this section, we evaluate the performance of proposed DWNoC in the presence of real application based traffic for a 64 core system. In order to bring out characteristics of the DWNoC architecture in the presence of both computation intensive and communication intensive workloads, we have considered six *SPLASH-2* [46] benchmarks, *FFT, FMM, Radix, LU, Radiosity, Raytrace*, and three *PARSEC* [45] benchmarks Blackcholes, Canneal and Fluidanimate, which vary in their traffic distributions. The characteristics of the applications are summarized in Table 4.2. The workloads are

Table 4.2 Percentage of busy and idle cycles in a 64-core system given default problem sizes

| Benchmark    | Busy % | Idle % | Default Problem Size                |
|--------------|--------|--------|-------------------------------------|
| FFT          | 81.99  | 18.01  | 65,536 Data Points                  |
| Radix        | 84.98  | 15.02  | 262,144 Integers, 1024 RADIX        |
| LU           | 87.62  | 12.38  | 512x512 Matrix, 16x16 Blocks        |
| Canneal      | 56.74  | 43.26  | 200,000 Elements                    |
| FMM          | 67.64  | 32.36  | 16K particles                       |
| Radiosity    | 81.46  | 18.54  | room, -ae 5000.0 -en 0.050 -bf 0.10 |
| Raytrace     | 80.25  | 19.75  | car                                 |
| Blackcholes  | 79.15  | 20.85  | 65,536 options                      |
| Fluidanimate | 72.78  | 27.21  | 5 frames, 300,000 particles         |



Figure 4.17 Bandwidth and packet energy of WNoCs with application-specific traffic for 64 core system

simulated using GEM5 [84], a full system emulator and the traffic traces are then simulated using the NoC simulator to determine network level characteristics. Figure 4.17 shows peak bandwidth and packet energy at network saturation for the proposed DWNoC and token based WNoC architecture in non-uniform traffic scenario. From the results, it is evident that for non-uniform traffic patterns, the same trend is maintained as the uniform traffic. For all traffic patterns, DWNoC architecture with 6 WIs performs best due to the presence of higher number of simultaneous links. Canneal has the highest offered load and hence display the highest gains in bandwidth for the DWNoC architectures compared to the other benchmarks. This implies that communication intensive workloads will benefit more in terms of performance with the directional antenna based DWNoC. However, there is significantly high gain in packet energy for less communication intensive benchmarks like *FFT* for the DWNoC with 6 WIs. In this case, increasing the number of wireless links results in more packets using wireless links resulting in higher energy efficiency.

## 4.3.7 Area Overheads and Scalability

In this section, we evaluate the area overheads required to enable the DWNoC architectures. The DWNoC with 12 WIs is designed for 256-core system. The token-based WNoC has 13 WIs as proposed in [83]. However, the antennas used in the 2 architectures



Figure 4.18 Percentage of area overhead over total silicon area

are different and hence will have slightly different area requirements. Hence, the total number of electronic ports in all the switches is the same for the WNoCs and the conventional electronic mesh. As a result, the area for the wired switches is equal for all three architectures. The area overheads for all the WNoCs and the mesh are shown in Figure 4.18. The wireless transceiver circuits require an active area of 0.16mm<sup>2</sup> each [19] and the area of the PLPA antennas is 1.33mm<sup>2</sup> each. The overall area for the DWNoC with 6 wireless links is the maximum for 256-core system and is approximately 11.57% of the total die area assumed to be 400mm<sup>2</sup>.

Optimal number of WIs can be decided as per system size. Since number of WIs in a wireless NoC will scale at a much slower rate (because of the associated power and area overhead) than the number of cores. Depending upon the system size, additional links can be placed by considering above constraints within the WIs placement algorithm. Hence, our proposed approach provides a scalable modular DWNoC architecture.

#### 4.4 Conclusion

In this work, we explore the use of directional antennas for on-chip wireless interconnects where multiple simultaneous wireless pairs can communicate [85]. We propose an interference-aware WIs placement algorithm for WNoC architecture by incorporating directional planer log-periodic antennas (PLPAs). This DWNoC architecture enables directional point-to-point links between transceivers and hence multiple wireless links can operate at the same time without interference. The proposed architecture

improves peak bandwidth up to 84.13% over mesh, 67.39% over torus, 53.44% over clos, and 9% over regular WNoC under uniform random traffic pattern for 64 core system. Similarly, it increases peak bandwidth up to 213.89% over mesh, 159.41% over torus, 101.21% over clos, and 10.13% over regular WNoC for 256 core system. This architecture also improves the energy-efficiency up to 22.20% over Mess and 7% over regular WNoC for 64 core system.

# Chapter 5 Low Latency Network for Off-Chip Memory Access

In this chapter, we highlight the on-chip communication bottlenecks between Last Level Caches (LLCs) and Memory Controllers (MCs) to access off-chip memory. To overcome this, we discuss hybrid switching strategy with dual crossbar routers that allow simultaneous use of both circuit and packet switch paths. We also find the optimal number and placement of memory controllers in the network using machine learning approach. We further improve upon this by using power-efficient drowsy virtual channels and power gating techniques to achieve energy-efficient off-chip memory access. A routing protocol is introduced for seamless communication between LLCs and MCs using adaptive hybrid switching strategy.

Communication between LLCs and memory controllers faces significant challenge due to the placement of memory controllers, high network latency, and switching strategy. An important issue associated with off-chip memory access is feasible number of memory controllers, limited by available pin bandwidth and their placement in the on-chip network. Especially, as system size increases, the latency between caches and limited number of memory controllers increases, thereby degrading the memory performance. That is why there is a great need for radical alternative approaches to achieve improvement in memory controller access on future CMPs.

Existing networks based on packet switching (PS) have their own bottlenecks in terms of latency and throughput due to the use of store and forward method and requires more power for complex switching protocol. Similarly, circuit switching (CS) provides guaranteed high bandwidth once path is established, but increases the network delay due to circuit/path setup and data gets lost if a router in the circuit is down. A feasible approach to achieve improvement in memory controller access is establishing dedicated paths/ links with combination of both CS and PS methodologies. To implement this hybrid switching,



Figure 5.1 An example of  $6\times6$  Mesh NoC topology with memory controller, processing elements and Off-chip memory

the major challenges are circuit switch setup, true selection logic for hybrid switching, routing protocol and workloads.

To increase the off-chip memory access bandwidth, we propose an adaptive hybrid switching strategy with dual crossbar router, that combines packet switch and circuit switched networks. We also propose an optimal placement strategy for the memory controllers in this work. An example of proposed architecture is shown in Figure 5.1, which consists of on-chip components like memory controllers, caches, processing elements, and off-chip memory components. The proposed architecture forms circuit switched paths between last level caches and memory controllers to provide low-latency access to main memory. The regular data communication between all other nodes in the network uses packet switched network, thereby improving memory access performance without impacting regular on-chip network communication. Based on the request from a core, a reservation table is used to form circuit switch path with nearest memory controller. The dual crossbar router contains circuit for both packet and circuit switch networks. To provide guaranteed paths with low latency, we find the optimal number and placement of memory controllers in the network using clustering algorithm. To reduce the energy

overhead of dual crossbar routers, we introduce partially drowsy and power gated techniques in the proposed architecture.

#### 5.1 Related work

Communication infrastructure between LLCs and MCs is the biggest performance bottleneck in the latest technology generations. Hybrid switching strategies and placement of memory controllers have been explored to overcome this challenge. The global management of NoCs in accelerator-rich architectures and hybrid network is explored in [86] to provide a predictable performance and energy-efficiency. A global accelerator manager is used to manage location and timing of accelerator accesses and reserve paths with fewer conflicts. The major issues with single global manager are the latency and routing congestion in communicating with all cores. To minimize the LLC access delays, a hybrid virtual cut-through switching for short request packets whereas circuit switching for longer and delay sensitive response packets are introduced in [87]. A hybrid router combining circuit and packet switching with bus architecture for on-chip network is discussed in [88] to handle streaming and best efforts traffic efficiently. Similarly, possible placements of memory controllers have also been discussed to reduce the communication bottleneck. In this context, authors in [24] have explored the optimal placement of memory controller using divide and conquer method. In this method, number of memory controller increases exponentially with system sizes. The placement of memory controller is explored in [89] using simulated annealing approach for finding out the best memory controller configuration. Authors of [90] focus on optimal placement of cores, distributed shared LLCs and memory controllers to minimize the latency. A genetic algorithm based memory controller placement is presented in [23] to reduce contention and latency in the on-chip fabric. Authors of [91] propose two networks prioritizing memory response message latency that can cooperatively improve performance by reducing end-to-end memory access latencies. A zero-latency circuit setup scheme is explored to transfer the individual data packets [92]. A hybrid circuit/packet-switched NoC that exploits communication locality through periodic configuration of the most beneficial circuits has been explored in [93]. In addition, the unique private cache scheme targeting the class of interconnects which exploits communication locality to improve the communication latency. Most of the

state-of-the-art architectures use hybrid switching, simulated annealing, genetic algorithm and analytical approaches. Hence, suitable and feasible solutions are required to identify switching strategy, and optimal number and placement of memory controllers for providing low latency path between LLCs and MCs.

There has been significant research targeting energy-efficiency of on-chip communication infrastructures. Online reinforcement learning based machine learning is introduced to perform dynamic partitioning of LLC along with DVFS in core and network components in [94]. Fine-grained power gated FlexiBuffer is explored in [33] to reduce leakage power with minimal changes to flow control. NoRD (Node-Router Decoupling) [34], a novel technique for bypassing power gated routers, decouples the node's ability to transfer a packet by monitoring the status of associated router. Power-aware routing and topology reconfiguration named Panthre [31] is proposed to provide long intervals of uninterrupted sleep to selected units using power gating. The drowsy buffer is explored in [18] to reduce the power consumption of the network. In this work, we design a router that supports adaptive hybrid strategy. The performance is further optimized by finding the optimal number and placement of memory controllers using machine learning approach. The proposed method promises to increase power efficiency of on-chip communication infrastructure significantly. We present a detailed performance evaluation of proposed NoC architecture and explore the performance overheads and associated trade-offs for realizing the proposed NoC architecture.

# **5.2 Efficient Low Latency NoC**

The proposed router with dual crossbar and controller is shown in Figure 5.2. It consists of five I/O ports, dual crossbar, routing unit, allocator, reservation table, mux/demux, controller, latch and virtual channels (VCs). The controller consists of analyzer, selection unit, and power management unit. Four virtual channels are associated with each port. The crossbars are implemented using folded technique to support dual switching at a given cycle to reduce the energy overhead. Two 2-folded crossbars consist of four switching elements (two for each crossbar), whereas a single conventional crossbar uses five switching elements. Folded bufferless crossbar (upper crossbar) saves energy by avoiding read/write operations in buffers, while regular switch stores data into VCs.



Figure 5.2 Proposed router architecture

Analyzer observes the address of the task injected by the processing element. Latches are used to store the data till analyzer completes its operation. Selection logic alters the switching mode based on response from analyzer. Power management controller takes the decision of power saving based on selection logic response and traffic pattern. When packet is injected into a router, depending on the task, controller set the '0' flag for circuit switching and '1' for packet switching based on selection logic response. If it is '0', data traverse through upper crossbar, otherwise lower crossbar. Adaptive hybrid switching strategy, energy-efficiency, and placement of memory controllers are discussed in next subsequent sections.

# **5.2.1 Hybrid Switching Strategy**

The key design goal of our proposed hybrid switching network is to provide a high bandwidth memory access for CMPs. The dual crossbar based router architecture enables high-performance communication infrastructure by allowing simultaneous use of both circuit and packet switched paths.

Table 5.2 Shows algorithm for circuit switching setup

|    | Algorithm I: Pseudo Code for Circuit Switching Setup                                     |  |  |
|----|------------------------------------------------------------------------------------------|--|--|
| 1. | Controller collect addresses from core during injection                                  |  |  |
| 2. | Analyzer analyzes the received addresses                                                 |  |  |
| 3. | for all addresses obtained from core do                                                  |  |  |
| 4. | Reserve the path from Source to Destination between buffer, caches and memory controller |  |  |
| 5. | if conflict found                                                                        |  |  |
| 6. | Wait into VC                                                                             |  |  |
| 7. | else update the reservation table                                                        |  |  |
| 8. | end for                                                                                  |  |  |
| 9. | Reserve the path for circuit switching between core and memory controller.               |  |  |

# **5.2.1.1** Circuit-Switching Setup

In this section, we discuss the circuit switching setup for the hybrid router. A reservation table is used to configure the circuit switching path as described in Table 5.1. The controller receives information regarding read/write sets from processing core and performs the buffer allocation operation. Based on this data, controller finds the nearest shared last level caches and memory controller. It reserves the circuit switching path between buffer, caches and memory controller and stores *StartTime*, *EndTime*, *InPort* and *OutPort* information from data set into reservation table locally. At the beginning of the processing task, controller sends the request to memory controller according to the read set to fetch data from buffer. Memory controller sends the responses once all the requests are satisfied with uncertain access latency depending on the position of the LLCs and memory controllers. To reduce the access latency, optimal placement of memory controllers is required, which is discussed in later section. If there is any conflict found during memory controller access, controller stops the current packet transfer or diverts it to another path and grants signal to establish a circuit switch path. If there is no conflict, router updates the reservation table at each cycle. Once the path is established, data stream follows the same

**Table 5.1 Intensity Classification of PARSEC Benchmark** 

| Applications                                     | Workload<br>type            | Switching<br>Techniques |  |
|--------------------------------------------------|-----------------------------|-------------------------|--|
| Blackscholes,<br>Bodytrack, Vips,<br>Dedup       | Compute intensive (CI)      | Packet<br>Switching     |  |
| Freqmine, Ferret,<br>X264 Fluidanimate           | Hybrid                      | Hybrid<br>Switching     |  |
| Streamclusters,<br>Canneal<br>Facesim, Swaptions | Memory<br>intensive<br>(MI) | Circuit<br>Switching    |  |

Table 5.3 Shows selection logic for hybrid switching

|     | Algorithm II: Selection Logic for Hybrid Switching |  |  |
|-----|----------------------------------------------------|--|--|
| 1.  | Controller receives a task description from a core |  |  |
| 2.  | if MI && ~CI                                       |  |  |
| 3.  | Setup circuit switch (algorithm 1)                 |  |  |
| 4.  | Transfer packet                                    |  |  |
| 5.  | Destroy circuit switch path                        |  |  |
| 6.  | Enable packet switching                            |  |  |
| 7.  | else if CI && ~MI                                  |  |  |
| 8.  | Use buffered crossbar                              |  |  |
| 9.  | Transfer packet                                    |  |  |
| 10. | else                                               |  |  |
| 11. | Hybrid switching (packet + circuit)                |  |  |
| 12. | if (conflict found)                                |  |  |
| 13. | Set Wait for packet switch                         |  |  |
| 14. | Transfer circuit switch packet                     |  |  |
| 15. | Transfer packet switch data                        |  |  |

path. Once the circuit switch session is complete, controller resets the path to be used for packet switching.

# 5.2.1.2 Selection Logic for Hybrid-Switching

A selection logic unit is associated with router to implement the hybrid switching strategy. The selection logic unit is operated based on workloads. The workloads can be classified into three categories: i) compute intensive, ii) hybrid and iii) Memory intensive. For example, intensity classification of *PARSEC* benchmark is illustrated in Table 5.2. The selection logic works according to type of benchmarks as discussed in Table 5.3. Both switchings can work simultaneously as dual crossbar is incorporated in the router architecture.

# **5.2.1.3 Routing Protocol**

The routing protocol for seamless on-chip communication using hybrid switching is discussed here. A typical router pipeline requires five stages; Buffer Write (BW),



Figure 5.3 The pipeline stages :(a) 5-stage baseline router for packet (b) 2-stages for circuit switching

Routing Computation (RC), Virtual channel Allocator and Switch Allocator (VA and SA), Switch Traversal (ST) and Link Traversal (LT) as shown in Figure 5.3 to transfer a packet. Packet switching strategy follows these stages to transfer a packet. As circuit switching established a dedicated path between two nodes, it bypasses some intermediate pipeline stages and packet traverses through ST and LT pipeline stages. XY-routing is adopted for both CS and PS. Once path is established for CS, controller switches the routing strategy to bufferless crossbar. If conflicts found during circuit switching path setup, flit(s) for PS is stored into either vacant VCs or neighbor routers, enabling both PS and CS to work simultaneously. To store the flits into a neighbor router, look-ahead routing strategy is used.

For both switchings, router receives packets and forwards it to the output port by following the intermediate steps as shown in Figure 5.4. Analyzer decides the classification of packets in terms of memory or compute or hybrid intensive. For CS, packet traverses either through the existing path or setup a new path. To establish a new CS path, arbitrary policy [95] is used. The arbitrary policy is managed by flit-ranking component and port-



Figure 5.4 Intermediate steps represent for packet transfer

selection component from knowledge of reservation table. Flit ranking component ranks all the incoming flits in every cycle and flit with highest rank gets access to establish CS path. Port-selection component ranks the output port based on availability of port and desirability of that flit. Flit-rank component provides the highest priority to a flit for CS and then PS.

Flit level routing is adopted from [95] to avoid the deadlocks and livelocks for hybrid switching strategy. A packet is injected into any port of the router, only if at least one channel is free. Otherwise, flit must be stored into VC of upstream router till the availability of VC. Every router decides the status of channels locally. For example, two flits C and P are injected into a router. Flit C is for CS and P for PS. Now, flit-ranking and port-selection components give priority to flit, C in every cycle. The flit, P is stored in VCs till the CS session path is active or takes an alternative path to route the packet. Flit-ranking and port-selection components together ensure that no deadlocks and livelocks occur [95].

# **5.2.2** Energy-efficiency

In this section, we present a low power router architecture using drowsy and power gating approaches to achieve energy-efficient communication infrastructure. Of all the components of a router, power consumption in virtual channels and crossbar fabrics is comparatively high. To reduce the overhead, we applied these techniques with NoC architecture as shown in Figure 5.5.

## **5.2.2.1 Power Management Controller**

Power Management Controller (PMC) controls the drowsy and power gating operations of router components. It generates the pointer signal to manage the active/inactive components. The internal components of the router are classified into three states; A 2-bit pointer signal is associated with PMC to change the state of the NoC components. Active (00), Drowsy (01) and Power gated (10) states. PMC is introduced to operate the router under these states. Initially, pointer component of controller is assigned as '00'  $\rightarrow$  active state. One-bit pilot signal ( $Pilot_In/Pilot_Out$ ) is associated with every neighbor router to spread the information one-hop ahead as shown in Figure 5.2. This saves the wake-up latency penalty of slept or drowsy components.



Figure 5.5 PMC controls the inactive/active components

#### 5.2.2.2 Drowsy buffer

We introduce fine-grained power saving techniques using dynamic management. A drowsy technique is mainly applied here to save power consumption without loss of any stored data in VC. D-flip-flop (DFF) based buffer is used to design the VCs. Before applying the drowsy scheme with DFF, we estimate the data retention voltage to retain the stored data in VCs. The data retention voltage for DFF is 0.63V at 32nm technology node. Consequently, the drowsy circuit is operated under two level of voltages; 0.63 (drowsy state) and 1V (active state). The drowsy signal will get activated depending on the workload. To reduce the wake-up penalty and avoid the data loss, we use the drowsy circuit partially with first virtual channel of each port.

# 5.2.2.3 Power gated VCs/ Crossbar

We explore power gating with rest of the virtual channels and unused crossbar fabrics to completely shut down the idle components. A pointer is used to indicate the status of virtual channels for every port as shown in Figure 5.2. When a pointer changes status to power gating state, controller sends the sleep signal to all VCs from the second VC. The status of pointer is changed depending on the workload. If the first virtual channel consists of data, the pointer status for the second VC is changed to '00' to store the next incoming

flit. The process is repeated for all other VCs, where  $n+1^{th}$  pointer status is changed to '00' whenever  $n^{th}$  VC contains data. For this operation, we insert the header sleep transistor between VCs/Crossbar and power supply to control the on-off states. This technique provides more benefits in terms of leakage/ idle state power when virtual channels and crossbars will not be used for long time.

# **5.2.2.4 Optimal Memory Controller Placement**

To enable high-performance memory access, along with hybrid switching technique, we find optimal number and placement of memory controllers in the on-chip network. We adopt mean shift clustering algorithm [96] to find the placement and optimum number of memory controllers for given size. Mean shift algorithm is a density estimation based non-parametric approach that is useful in analyzing clusters with arbitrary shapes. It regards the features of nodes in the network as probability density function and finds the modes of this density. The modes represent the local maxima, corresponding to the dense feature regions around which the clusters are developed. The ability of mean shift algorithm to work with arbitrary clusters and no information about number of clusters makes it ideal for placement in NoC.

We use the communication activity between different cores and memory as feature set to cluster different cores and find location of memory controllers in NoC. The input feature of any node is its normalized communication activity with every other node (L1 cache, L2 cache, and memory) in the network. Hence, for a network with *N* nodes, the input is *N* feature vectors in an *N*-dimensional space. The mean shift algorithm finds regions of high communication activity and forms clusters of corresponding nodes within these dense regions. We, then place a memory controller at the center of each such cluster, thereby finding the number and placement of controllers based on application characteristics.

# **5.3 Experimental Results**

In this section, we discuss implementation, performance benefits, overheads and scalability of our proposed architecture. We compare our design with prior works that employ placement of memory controllers to address the memory bandwidth bottleneck.

**Table 5.4 Simulation setup** 

| Setup                   | Components         | Configuration                                        |  |
|-------------------------|--------------------|------------------------------------------------------|--|
|                         | System size        | 48 cores, 12 L2 caches                               |  |
| Details of              | L1 cache (private) | 8-way, 32KB, LRU policy                              |  |
| system                  | L2 cache (shared)  | 16-way, 128KB, LRU policy, 1 L2 shared by four core  |  |
| architecture            | Memory             | 2048MB/channel, 4 channels/MC                        |  |
|                         | Topology           | Mesh topology                                        |  |
|                         | Routing            | XY routing, wormhole switching                       |  |
|                         |                    | 5-port (including local port), 8-flit depth per port |  |
| Details of              | Router             | 5-stage (BW, RC, SA, ST, LT) for packet switching    |  |
| network<br>architecture |                    | (both baseline and proposed)                         |  |
|                         |                    | 2-stage (ST, LT) for circuit switch                  |  |
|                         | On-Chip wire link  | 64-bit width, 1 cycle latency                        |  |
|                         | Off-Chip wire link | 256-bit width, 2 cycle latency                       |  |

# **5.3.1 Simulation Setup**

We have modeled and simulated the proposed NoC architecture using cycle-accurate heterogeneous system simulator, Multi2Sim [97]. A system with 48 cores is used for all evaluations and system specifications are presented in Table 5.4. The baseline architecture uses two memory controllers [98]. The cores and memory are arranged in mesh NoC topology. The routers are input buffered with each port comprised of 4 VCs. Each VC has a depth of 8 flits. The NoC switches are driven with a 2.5GHz clock. The width of all wired links is same as flit size, 32 bits. The network switches are synthesized with Synopsys Design Compiler using 32nm process technology. In this work, we also implement the drowsy circuit and sleep transistors for power gating using Cadence tool at 65nm technology. The mean shift algorithm to find optimal placement of memory controllers in proposed architecture is implemented using MATLAB. The performance is evaluated by simulating benchmarks from *AMD SDK* [99] with baseline and proposed architectures using Multi2Sim.

# **5.3.2** Router Implementation with Hybrid Switching and Overheads

To implement hybrid switching scheme, we integrate dual crossbar, switching controller and power management controller with each router. Controller with Analyzer and selection unit together occupies 1006.83µm². The area requirement for modified base router (including control units, buffers, crossbar, modified arbiter, RC, and VCs) is 9969.59µm². The total area overhead associated with proposed hybrid approach is 0.16%



Figure 5.6 Optimal memory controller placement using mean-shift approach

of total silicon area for 48 cores. The total power consumption of all base router components combined is  $409\mu W$ . The controller with analyzer and selection units consumes a total power of  $42.84\mu W$  for all its operations. The sleep transistor consumes leakage power of 8.89nW and  $1.95\mu m^2$  of area.

# **5.3.3 Memory Controller Placement**

We have collected the communication traces from full system simulator of core-to-memory and core-to-core as feature set to cluster different cores and find location of memory controllers in NoC. After that we run the mean-shift clustering algorithm for clustering. Before clustering, we estimate the mean of core-to-memory as well as core-to-core communications for various traffic characteristics. The mean shift algorithm finds regions of high communication activity and forms clusters of corresponding nodes within these dense regions. We have placed the memory controller at the center of each cluster. Using the mean-shift clustering algorithm, we obtain the memory controller placement optimized overall inter-core communication and memory accesses. For 48 cores system with applications considered, the optimal number of memory controllers is 4 and their placement is as shown in Figure 5.6. In comparison, prior works in [23] and [24] use 16 memory controllers in a 64-node system as shown in Table 5.5.



Figure 5.7 Reduction in memory latency over baseline

#### 5.3.4 Performance Evaluation

We evaluate the performance of proposed architecture under different traffic situations in terms of execution time, throughput and latency.

# **5.3.4.1** Latency Improvements

Figure 5.7 shows the network performance in terms of average memory access latency under various applications. It can be seen that the proposed architecture reduces access latency by 27.15% on average (geometric mean) over baseline architecture. The improvements are achieved by forming dedicated circuit switch paths between last-level caches and memory controllers.



Figure 5.8 Improvement in peak throughput over baseline



Figure 5.9 Improvement in application runtime over baseline

#### **5.3.4.2** Throughput Improvements

Figure 5.8 shows the peak throughput improvement provided by hybrid switching with optimal memory controller placement. The throughput is increased by 61.97%, on average over baseline architecture with 2 memory controllers. By finding the optimal number and placement of memory controllers, uniform access is provided to all nodes to improve the overall throughput of memory accesses.

# **6.3.4.3 Execution Time**

Figure 5.9 shows the reduction in application runtime with proposed architecture as compared to baseline architecture. We consider one and two application running simultaneously for all our evaluations. On average, the proposed architecture speeds-up application execution by 12.53% over baseline architecture.

#### **5.3.5** Energy saving

The average packet energy saving obtained by proposed architecture over baseline system is shown in Figure 5.10. On average, proposed architecture achieves 21.13% savings in energy across all benchmarks as compared with regular NoC. The energy improvements by proposed architecture are obtained by drowsy and power gated network components (in an idle state) and providing low energy paths for memory controller accesses.



Figure 5.10 Reduction in network energy over baseline

#### 5.3.6. Summary of Existing and Proposed Architectures

We present a summary of prior and proposed works that tackle the issue of memory performance in CMPs in Table 5.5. As can be seen, the proposed architecture provides significant benefits in terms of latency, runtime, and energy metrics. Furthermore, the proposed modifications rely only on local information and hence it is scalable to any system size without incurring significant overheads.

# **5.4 Conclusions and Future Works**

#### 5.4.1 Conclusions

In this chapter, we propose an energy-efficient and adaptive hybrid switching based NoC to address the on-chip communication bottlenecks between LLCs and MCs. Dual crossbar based router is introduced for both circuit and packet switching simultaneously. The performance is further optimized by finding the optimal number and placement of memory controllers using machine learning approach. Energy-efficiency is enhanced by introducing a drowsy and power gated router. We also present the detail implementation, overhead and performance benefits of our proposed architecture. Performance evaluations show that proposed design achieves improvements of 12.53%, 61.97% and 21.13% in

**Table 5.5 Summary of Existing and Proposed Works** 

| Ref.               | Configurations and workloads                                                                                                                                                                                                                                                               | Network parameters and Technology                                                                                                                                    | MCs | Improvements                                                                                 |
|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|----------------------------------------------------------------------------------------------|
| [22]               | 64 in-order cores; L1 I&D caches:<br>16KB (2 way set associated), 1 cycle<br>latency; L2(Private): 64KB (4 way set<br>associative) 6 cycle latency; L3<br>(Shared); 8 MB (16 way Associative);<br>Memory Latency: 150 cycles<br>Workload: TPC-H, SPECjbb, TPC-<br>W, SPECweb and Synthetic | Processors: 64; router<br>latency: 1 cycle; Inter<br>router wire latency: 1<br>cycle; Buffer: 32 flits;<br>virtual channels: 2 for<br>XY, YX, CDR and 4<br>for XY-YX | 16  | 10-15% latency reduction                                                                     |
| [23]               | Processors: 64; Processor frequency:<br>2GHz, L1 cache (Private): 16KB, 4<br>way, 64 bit line, 3 cycle access latency;<br>L2 cache (Shared): 64MB, 64 bit line,<br>6 cycle latency; Memory Latency: 260<br>Cycles<br>Workload: SLPASH-2, PARSEC and<br>TPC-H                               | 8×8 Mesh, %- port<br>router, Router<br>scheme: Wormhole,<br>Flit size: 128 bits,<br>Technology: 32nm<br>CMOS process                                                 | 16  | 7.63% latency<br>reduction, 13.94%<br>PDP reduction                                          |
| [68]               | L1 I&D (private): 32KB, 4 way, 3 cycle latency, L2 (shared): 2MB, 8-way, 6 cycle latency, Memory latency: 280-cycle access latency, Processor frequency: 2GHz, Workload: PARSEC                                                                                                            | 4×8 Mesh topology,<br>XY routing,<br>wormhole switching,<br>4-stage router<br>pipeline,                                                                              | 4   | 11.3% runtime<br>reduction, 11%<br>energy reduction                                          |
| Proposed<br>Method | 48 cores, 12 L2 caches, L1 cache<br>(private): 8-way, 32KB, LRU policy;<br>L2 cache (shared): 16-way, 128KB,<br>LRU policy, 1 L2 shared by four<br>cores; Worklod: AMD SDK;                                                                                                                | Mesh topology,<br>wormhole routing,<br>XY-routing,<br>Technology: 32nm<br>CMOS Process<br>Technology                                                                 | 4   | 27.86% runtime<br>reduction, 30.28%<br>throughput<br>improvement, 31.21%<br>energy reduction |

runtime, throughput, and network energy respectively.

# **5.4.2 Future Works**

Though the proposed work provides significant benefits for accessing memory controllers and improve memory access performance, there is still scope for further enchantment in performance. As a part of our ongoing investigations, we would like to explore further enhancements for optimal placement of memory controller and hardware level hybrid switching strategy. In case of hybrid switching, the main challenge is to build a packet-switched network for communication-centric workloads and circuit-switched network for memory-intensive workloads. This challenge is primarily associated with both hardware as well as software components. Furthermore, long circuit switching delay results in significantly poor performance. There are also issues with stability due to rapid changing traffic patterns. The transition between these two switching patterns in every few milliseconds is a significant challenge for hardware design. The efficient scheduling is required to address the issues on reconfigurable hardware system. Among the challenges, it is difficult to predict how different application with disparate requirements would

perform and impact the scheduling of the system. I am planning to explore all these challenges in near future to reduce the on-chip communication bottlenecks between last level caches and memory controllers.

# Chapter 6

# Conclusions and Future Works

# **6.1 Conclusions**

In this dissertation, we discussed on-chip communication bottlenecks for manycore systems in terms of power and performance. The applications of the many-core system range from embedded systems to data centers to supercomputers. The efficient on-chip communication infrastructure is a major concern for many-core systems. To avoid the excessive power consumption due to interconnect technology, we explored various novel power saving schemes for NoC architectures. At the same time, we introduced the seamless bypass routing strategy to maintain the system performance. Wireless NoCs, by augmenting wired topologies with low latency wireless links, overcome performance limitations of conventional NoCs. However, Wireless Interfaces (WIs) consume significant amount of power. We resolved this limitation of WIs by introducing power gating along with AMS controller. This is realized with a receiver-end control strategy that wakes up the desired WI having token for wireless communication. Existing WNoCs suffers in terms of scalability because of the fact that only one pair can communicate at a time as the WIs are designed to operate at same frequency. To overcome this, we propose the directional wireless NoC topology using PLPAs with optimal placement of WIs. To evaluate this wireless interconnect based systems, we developed an efficient evaluation framework to conduct extensive performance evaluation. Additionally, to enhance the system performance in terms of memory access latency, we propose an energy-efficient adaptive hybrid switching strategy with dual crossbar routers that allow simultaneous use of both circuit and packet switched paths. All proposed architectures are evaluated using popular PARSEC and SPLASH-2 benchmarks suite.

In Chapter 2, the utilization estimation method for routers is discussed. The utilization of the router is computed using two levels of operation: pre-computed global

and dynamic runtime utilization methods. This proposed architecture saves up to 92.20% leakage power in base routers. The detail implementation of power management controller and impacts of power gating with base router is described. To maintain the performance, seamless bypass routing strategy is explored. This routing strategy alleviates the performance of some real and synthetic benchmarks. This saves the overall packet energy on average by 49% as compared to regular WNoC.

In chapter 3, to reduce the switching power, the dynamic runtime utilization is modeled using stochastic process for better accuracy. AMS scheme saves up to 56% in network packet energy consumption as compared to baseline architectures without incurring significant performance penalty and area overheads. An energy-efficient transceiver for WNoC using a novel power gating approach is presented with detailed impacts. We propose a sophisticated receiver-end wake-up control strategy using address signature along with data packets. This strategy processes the signature without interrupting the inactive components. A new approach to transmitting medium access token through wireless link is proposed to minimize control signal latency. To the best of our knowledge, this is the first work that shows how to minimize the impacts of power gating on performance of WIs and discusses the power efficiency of WIs. The proposed architecture saves up to 62.50% idle-state power consumption in WIs for 256 core system using power gating approach. We present the details of AMS controller, along with multilevel voltage shifter and voltage regulator to dynamically change supply voltage of router components and WIs based on their utilization. Furthermore, the proposed method can be effectively implemented for both wired and wireless NoC topologies with only 7% area overhead as compared to baseline router architecture.

In chapter 4, we explore the directional wireless NoC to support the simultaneous multi-channel communications using PLPA antennas. The optimal placement of the WIs is described to avoid the interference effects by satisfying all the interference-aware constraints. The proposed architecture improves peak bandwidth up to 84.13% over mesh, 67.39% over torus, 53.44% over clos, and 9% over regular WNoC under uniform random traffic pattern for 64 core system. Similarly, it increases peak bandwidth up to 213.89% over mesh, 159.41% over torus, 101.21% over clos, and 10.13% over regular WNoC for

256 core system. This architecture also improves the energy-efficiency up to 22.20% over Mess and 7% over regular WNoC for 64 core system. It will also improve the peak bandwidth as well as energy-efficiency for concurrent multiple applications for large-scale applications. Proposed architecture increases the utilization of WIs significantly by concurrent communications.

We address the on-chip communication bottlenecks between Last Level Caches and Memory Controllers to access the off-chip memory in Chapter 5. We explored the energy-efficient hybrid switching mechanism and optimal placement of memory controllers. We have applied machine learning approach for optimal placement of memory controllers. As a part of our ongoing investigations, we would like to explore further enhancements for optimal placement of memory controller and hardware level hybrid switching strategy. Energy-efficiency is enhanced by introducing a drowsy and power gated router. The adaptive hybrid switching strategy with dual crossbar scheme improves the peak throughput of the network by 61.97% and reduces the network energy by 21.13% as compared to traditional NoC architectures.

# **6.2 Future works**

I want to pursue multidisciplinary investigation to carry forward my work in the domain of efficient communication infrastructure for many-core architectures. Specifically, I would also like to explore the high bandwidth memory access, hybrid electrical-optical interconnect, fault tolerant and reliable emerging interconnect, and NoC for artificial neural networks in the domain of many-core architectures. The glimpse of future research directions is discussed in the following sections:

# **6.2.1 Energy-efficient High Bandwidth Memory Access**

High-performance computing of petaflop and exaflop orders is being explored for several applications in biomedical, multimedia, security and many other fields. At the same time, the gap between the bandwidth that processors demand and memory bandwidth is widening. Existing architectures are reaching their limits. An important issue associated with off-chip memory access is feasible number of memory controllers, limited by

available pin bandwidth and their placement in the on-chip network [24] [90] [89]. To provide high bandwidth memory access for exascale applications, the critical challenges are an efficient interconnect design to provide low latency paths between cores and memory controllers, an optimal placement and number of memory controllers for a given system size and routing protocol to evenly service memory requests from all cores. The optimal placement of memory controllers and communication bottlenecks between on-chip and off-chip components are two such major challenges. To exploit this, I would like to explore the placement of memory controller, efficient switching strategy and routing protocol to enhance the memory access speed. I want to bridge the gap that exists between high bandwidth off-chip memory and efficient on-chip networks to reap full benefits from recent advancements in technology. Hardware accelerators can also improve communication latency between on-chip and off-chip components. I would also like to explore the accelerator-rich architecture to address these bottlenecks.

# **6.2.2** Hybrid electrical-optical interconnect

Existing data center networks based on electrical packet switching have their own bottlenecks. Electrical based networks consume much power and pose limitations in terms of latency and throughput. Hybrid electrical-optical networks are a promising solution for high bandwidth and low power consumption for very large data center networks [100] [101] [102]. The main challenge is to build a packet-switched for electrical and circuitswitched for optical networks with these two technologies. This challenge is primarily associated with both hardware as well as software components. Furthermore, long circuit switching delay results in significantly poor performance due to rapid changing traffic patterns. The transition between these two networks every few milliseconds is a significant challenge for hardware design. To determine the circuit switching topology, design of an efficient scheduling algorithm is one of the major challenges associated with these networks. The efficient scheduling is required to address the issues on reconfigurable hardware system. Among the challenges, it is difficult to predict how different application with disparate requirements would perform and impact the scheduling of the system. The hybrid interconnect-architecture co-design along with hardware support for efficient communication infrastructure for data center application promises to be the key towards

achieving efficient many-core systems. I would like to continue my research in the pursuit of efficient many-core designs with emerging hybrid interconnect networks.

# 6.2.3 Fault-tolerant and reliable emerging interconnect

I plan to further explore emerging interconnect based many-core architecture as the number of cores and sub-systems on chip increases. The probability of failure in the emerging interconnects rises due to the process variations, aging effects and soft errors in current and expected process variation in future generations [103] [104]. I would like to address these issues by investigating all levels; circuit, architecture, and communication to maintain the reliability of the system. For that, we need an efficient fault model for emerging interconnects and avoid their impacts on the architecture and network levels. Efficient fault detection, isolation logic and correction methods with low overhead are essential for many-core chip design. I plan to pursue these challenging tasks of designing fault-free and reliable many-core chips.

# 6.2.4 Network-on-chip architecture for artificial neural networks

Conventional Neural Networks (CNN) and Deep Learning currently offer the best solutions to many problems in aerospace, automotive, Military, electronics, financial, industrial, medical, telecommunications, speech recognition etc. The combination of both neural networks and deep learning are achieving outstanding performance in these applications. At the same time, conventional neural networks are both computationally intensive and memory intensive. So, it is difficult to be deployed on low power lightweight embedded applications. For these type of applications, providing highly flexible, scalable and energy efficient communication infrastructure is a major architectural challenge for hardware implementation of reconfigurable neural networks [105] [106] [107]. I would like to investigate energy-efficient emerging interconnect based networks for these networks and compare with existing setup.

# Bibliography

- [1] Sodani, A., Gramunt, R., Corbal, J., Kim, H.S., Vinod, K., Chinthamani, S., Hutsell, S., Agarwal, R. and Liu, Y.C., "Knights landing: Second-generation intel xeon phi product," *IEEE Micro*, vol. 36, no. 2, pp. 34-36, 2016.
- [2] Bohnenstiehl, B., Stillmaker, A., Pimentel, J.J., Andreas, T., Liu, B., Tran, A.T., Adeagbo, E. and Baas, B.M., "KiloCore: A 32-nm 1000-Processor Computational Array," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 4, pp. 891-902, 2017.
- [3] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous and A. R. LeBlanc, "Design of ion-implanted MOSFET's with very small physical dimensions," *IEEE Journal of Solid-State Circuits*, vol. 9, no. 5, pp. 256-268, 1974.
- [4] Y. Hoskote, S. Vangal, A. Singh, N. Borkar and S. Borkar, "A 5-GHz Mesh Interconnect for a Teraflops Processor," *IEEE Micro*, vol. 27, no. 5, pp. 51-61, 2007.
- [5] T. Bjerregaard and M. Shankar, "A survey of research and practices of network-on-chip," *ACM Computing Surveys (CSUR)*, vol. 38, no. 1, 2006.
- [6] U. Y. Ogras and R. Marculescu, "It's a small world after all": NoC performance optimization via long-range link insertion," *IEEE Transactions on very large scale integration (VLSI) systems*, vol. 14, no. 7, pp. 693-706, 2006.
- [7] V. F. Pavlidis, E. G. Friedman, "3-D Topologies for Networks-on-Chip,"

- *IEEE Transactions on Very Large Scale Integration (VLSI)*, vol. 15, no. 10, pp. 1081-1090, 2007.
- [8] Shacham, A., Bergman, K. and Carloni, L.P., "Photonic Network-on-Chip for Future Generations of Chip Multi-Processors," *IEEE Transactions on Computers*, vol. 57, no. 9, pp. 1246-1260, 2008.
- [9] Chang, M.F., Cong, J., Kaplan, A., Naik, M., Reinman, G., Socher, E. and Tam, S.W.,, "CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect," in *IEEE International Symposium on High-Performance Computer Architecture (HPCA)*, 2008.
- [10] Deb, S., Ganguly, A., Pande, P.P., Belzer, B. and Heo, D., "Wireless NoC as Interconnection Backbone for Multicore Chips: Promises and Challenges," *IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS)*, vol. 2, no. 2, pp. 228-239, 2012.
- [11] D. DiTomaso, A. Kodi, S. Kaya, D. Matolak, "iWISE: Inter-router Wireless Scalable Express Channels for Network-on-Chip (NoCs) Architectures," in 19th Annual IEEE Symposium on High Performance Interconnects, 2011.
- [12] Lee, S.B., Tam, S.W., Pefkianakis, I., Lu, S., Chang, M.F., Guo, C., Reinman, G., Peng, C., Naik, M., Zhang, L. and Cong, J., "A Scalable Micro Wireless Interconnects Structure for Chips," in *Mobicon '09*, 2009.
- [13] C. Wang, W. Hu, N. Bagherzadeh, "A Wireless Network-on-Chip Design for Multicore Platforms," in 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2011.
- [14] Deb, S., Chang, K., Yu, X., Sah, S.P., Cosic, M., Ganguly, A., Pande, P.P., Belzer, B. and Heo, D., "Design of an Energy Efficient CMOS Compatible NoC architecture with Millimeter-Wave Wireless Interconnects," *IEEE Transactions on Computers*, vol. 62, no. 99, pp. 2382-2396, 2012.

- [15] Sun, C., Chen, C.H.O., Kurian, G., Wei, L., Miller, J., Agarwal, A., Peh, L.S. and Stojanovic, V., "DSENT-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip model-ing.," in *IEEE/ACM Sixth International Symposium on Net-works-on-Chip*, 2012.
- [16] S. Li, J.-H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi,, "Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures, in Microarchitecture," in 42nd Annual IEEE/ACM International Symposium on MI-CRO-42, 2009.
- [17] Vangal, S.R., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Singh, A., Jacob, T., Jain, S. and Erraguntla, V., "An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 1, pp. 29-41, 2008.
- [18] Zhan, J., Ouyang, J., Ge, F., Zhao, J. and Xie, Y., "DimNoC: A dim silicon approach towards power-efficient on-chip network," in *52nd ACM/EDAC/IEEE Design Automation Conference (DAC)*, 2015.
- [19] Yu, X., Baylon, J., Wettin, P., Heo, D., Pande, P.P. and Mirabbasi, S., "Architecture and Design of Multichannel Millimeter-Wave Wireless NoC," *IEEE Design & Test*, vol. 31, no. 6, pp. 19-28, 2014.
- [20] T. Corporation, August 2010. [Online]. Available: http://www.tilera.com.
- [21] Mutlu, O. and Moscibroda, T., "Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems," in *SIGARCH Comput. Archit. News*, 2008.
- [22] Nesbit, K.J., Aggarwal, N., Laudon, J. and Smith, J.E., "Fair queuing memory systems," in *MICRO*, 2006.
- [23] Abts, D., Enright Jerger, N.D., Kim, J., Gibson, D. and Lipasti, M.H., "Achieving predictable performance through better memory controller

- placement in many-core cmps," in ISCA, 2009.
- [24] Xu, T.C., Liljeberg, P. and Tenhunen, H., "Optimal memory controller placement for chip multiprocessor," in *CODES and ISSS*, 2011.
- [25] Deb, S. and Mondal, H.K., "Wireless network-on-chip: a new era in multi-core chip design," in *IEEE International Symposium on Rapid System Prototyping (RSP)*, 2014.
- [26] Mondal, H.K., Kaushik, S., Gade, S.H. and Deb, S., "Energy-Efficient Transceiver for Wireless NoC," in 30th International Conference on VLSI Design, 2017.
- [27] K. Roy, S. Mukhopadhyay and H. Mahmoodi-Meimand, "Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits," *Proceedings of the IEEE*, vol. 91, no. 2, pp. 305-327, 2003.
- [28] Mondal, H.K. and Deb, S., "Energy efficient on-chip wireless interconnects with sleepy transceivers.," in *IEEE 8th International Design and Test Symposium (IDT)*, 2013.
- [29] Mondal, H.K. and Deb, S., "An energy efficient wireless Network-on-Chip using power-gated transceivers.," in 27th IEEE International System-on-Chip Conference (SOCC), 2014.
- [30] Mondal, H.K., Harsha, G.N.S. and Deb, S., "An efficient hardware implementation of dvfs in multi-core system with wireless network-on-chip.," in *IEEE Computer Society Annual Symposium on VLSI (ISVLSI)*, 2014.
- [31] R. Parikh, R. Das and V. Bertacco, "Power-aware NoCs through routing and topology reconfiguration," in *51st ACM/EDAC/IEEE Design Automation Conference (DAC)*, 2014.

- [32] Matsutani, H., Koibuchi, M., Ikebuchi, D., Usami, K., Nakamura, H. and Amano, H., "Ultra Fine-Grained Run-Time Power Gating of On-chip Routers for CMPs," in *Fourth ACM/IEEE International Symposium on Networks-on-Chip (NOCS)*, 2010.
- [33] G. Kim, J. Kim and S. Yoo, "FlexiBuffer: Reducing leakage power in onchip network routers," in 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC), 2011.
- [34] L. Chen and T. M. Pinkston, "NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers," in 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
- [35] Muhammad, S.T., Ezz-Eldin, R., El-Moursy, M.A., El-Moursy, A.A. and Refaat, A.M., "Traffic-Based Virtual Channel Activation for Low-Power NoC," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 23, no. 12, pp. 3029-3042, 2015.
- [36] Chen, L., Zhu, D., Pedram, M. and Pinkston, T.M., "Power punch: Towards non-blocking power-gating of NoC routers," in *IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)*, 2015.
- [37] Mirhosseini, A., Sadrosadati, M., Fakhrzadehgan, A., Modarressi, M. and Sarbazi-Azad, H., "An energy-efficient virtual channel power-gating mechanism for on-chip networks," in *Design, Auto-mation & Test in Europe Conference & Exhibition (DATE)*, 2015.
- [38] T. Krishna, A. Kumar, P. Chiang, M. Erez, L.S. Peh, "NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication," in *16th IEEE Symposium on High Performance Interconnects (HOTI)*, 2008.
- [39] Psarras, A., Seitanidis, I., Nicopoulos, C. and Dimitrakopoulos, G., "ShortPath: A Network-on-Chip Router with Fine-Grained Pipeline Bypassing," *IEEE Transactions on Computers*, vol. 65, no. 10, pp. 3136-

- 3147, 2016.
- [40] Kodi, A., Louri, A. and Wang, J., "Design of energy-efficient channel buffers with router bypassing for network-on-chips (NoCs)," in *IEEE Quality of Electronic Design*, 2009.
- [41] Hollis, S.J. and Jackson, C., "When does network-on-chip bypassing make sense?.," in *IEEE International SOC Conference*, 2009.
- [42] Krishna, T., Chen, C.H.O., Kwon, W.C. and Peh, L.S., "SMART: single-cycle multihop traversals over a shared network on chip," *IEEE MICRO*, vol. 34, no. 3, pp. 43-56, 2014.
- [43] Mondal, H.K., Gade, S.H., Kishore, R., Kaushik, S. and Deb, S., "Power efficient router architecture for wireless Network-on-Chip," in *17th International Symposium on Quality Electronic Design (ISQED)*, 2016.
- [44] Mondal, H.K., Gade, S.H., Kishore, R. and Deb, S., "Power-and performance-aware fine-grained reconfigurable router architecture for NoC," in *In IEEE Green Computing Conference and Sustainable Computing Conference (IGSC)*, 2015.
- [45] C. Bienia, "Benchmarking modern multiprocessors," Ph.D. Dis-sertation, Princeton Univ., http://parsec.cs.princeton.edu/publications.htm, 2011.
- [46] Woo, S.C., Ohara, M., Torrie, E., Singh, J.P. and Gupta, A., "The SPLASH-2 programs: characterization and methodological considerations," Proceedings of the 22nd an-nual international symposium on Computer architecture (ISCA '95). ACM, 1995.
- [47] R. Wu, Y. Wang and D. Zhao, "Low-Cost Deadlock-Free Design of Minimal-Table Rerouted XY-Routing for Irregular Wireless NoCs," Networks-on-Chip (NOCS), 2010.

- [48] F. Fazzino, M. Palesi and D. Patti, "Noxim: Network-on-chip simulator.," 2008. [Online]. Available: URL: http://sourceforge.net/projects/noxim.
- [49] R. Kishore, H.K. Mondal, S. Deb, "Energy-efficient Reconfigurable Framework for Evaluating Hybrid NoCs," in *International Symposium on VLSI Design and Test (VDAT)*, 2016.
- [50] Miller, J.E., Kasture, H., Kurian, G., Gruenwald, C., Beckmann, N., Celio, C., Eastep, J. and Agarwal, A., "Graphite: A distributed parallel simulator for multicores," in *The Sixteenth International Sympo-sium on High-Performance Computer Architecture*, 2010.
- [51] Mondal, H.K., Harsha, G.N.S. Kishore, R. and Deb, S., "P2NoC: Power-and Performance-aware NoC Architectures for Sustainable Computing," *Sustainable Computing, Informatics and Systems,* (Under minor revision).
- [52] Gade, S.H., Mondal, H.K. and Deb, S., "A hardware and thermal analysis of dvfs in a multi-core system with hybrid wnoc architecture.," in 28th International Conference on VLSI Design (VLSID), 2015.
- [53] Kaushik, S., Agrawal, M., Mondal, H.K., Harsha, G.N.S. and Deb, S., "Path Loss-aware Adaptive Transmission Power Control Scheme for Energy-efficient Wireless NoC," in *MWSCAS*, 2017..
- [54] Murray, J., Tang, N., Pande, P.P., Heo, D. and Shirazi, B.A., "DVFS Pruning for Wireless NoC Architectures," *IEEE Design & Test*, vol. 32, no. 2, pp. 29-38, 2015.
- [55] Murray, J., Pande, P.P. and Shirazi, B., "DVFS-enabled sustainable wireless NoC architecture," in 2012 IEEE International SOC Conference (SOCC),, 2012.
- [56] Rahimi, A., Salehi, M.E., Mohammadi, S. and Fakhraie, S.M., "Low-energy GALS NoC with FIFO—Monitoring dy-namic voltage scaling,"

- *Microelectronics J.*, vol. 42, no. 6, pp. 889-896, 2008.
- [57] Mondal, H.K., Gade, S.H., Kishore, R. and Deb, S., "Adaptive multi-voltage scaling in wireless NoC for high performance low power applications.," in *n Proceedings of the Conference on Design, Automation & Test in Europe*, 2016.
- [58] Chang, K., Deb, S., Ganguly, A., Yu, X., Sah, S.P., Pande, P.P., Belzer, B. and Heo, D., "Performance evaluation and design trade-offs for wire-less network-on-chip architectures," *ACM Journal on Emerging Tech-nologies in Computing Systems (JETC)*, vol. 8, no. 3, p. 23, 2012.
- [59] A. Saurabh, P. Srivastava, "New improved high speed low power double tail comparator design for 2.5 GHz input signal," in *IEEE Technology Symposium (TechSym)*, 2014.
- [60] Mondal, H.K., Harsha, G.N.S. Kaushik, S. and Deb, S., "Adaptive Multi-Voltage Scaling with Utilization Prediction for Energy-efficient Wireless NoC," *IEEE Transactions on Sustainable Computing*, (Under major revision).
- [61] R. Wu, Y. Wang, and D. Zhao, "Low-cost deadlock-free design of minimal-table rerouted xy-routing for irregular wireless nocs," in *ACM/IEEE International Symposium on Networks-on-Chip (NOCS)*, 2010.
- [62] A. Ganguly, K. Chang, S. Deb, P. P. Pande, B. Belzer, and C. Teuscher, "Scalable Hybrid Wireless Network-on-Chip Archi-tectures for Multicore Systems," *IEEE Transactions on Computers*, vol. 60, no. 10, p. 1485–1502, 2011.
- [63] Ganguly, A., Wettin, P., Chang, K. and Pande, P., "Complex network inspired fault-tolerant NoC architectures with wireless links," in *IEEE/ACM International Symposium on Networks on Chip (NoCS)*, 2011.

- [64] B. A. Floyd, C. M. Hung, and K. K. O., "Intra-chip wireless in-terconnect for clock distribution implemented with integrated antennas, receivers and transmitters," *IEEE Journal of Solid-State Circuits*, pp. 543-552, 2002.
- [65] Seok, E., Cao, C., Shim, D., Arenas, D.J., Tanner, D.B., Hung, C.M. and Kenneth, K.O., "A 410GHz CMOS push-push oscillator with an on-chip patch antenna," in *ISSCC*, 2008.
- [66] Kim, K. and Yoon, H., "On-chip wireless interconnection with integrated antennas," International Electron Devices Meeting, 2000.
- [67] Kim, K., Floyd, B.A., Mehta, J.L., Yoon, H., Hung, C.M., Bravo, D., Dickson, T.O., Guo, X., Li, R., Trichy, N. and Caserta, J., "On-chip antennas in silicon ICs and their application," *IEEE Trans. Electron Devices*, vol. 52, pp. 1312-1323, 2005.
- [68] Y. P. Zhang, Z. M. Chen and M. Sun, "Propagation mechanisms of radio waves over intra-chip channels with integrated anten-nas: Frequency-domain measurements and time-domain analysis," *IEEE Transactions on Antennas and Propagation*, pp. 2900-2906, 2007.
- [69] A. Mineo, M. Palesi, G. Ascia, V. Catania, "Exploiting antenna di-rectivity in wireless NoC architectures," *Microprocessors and Mi-crosystems*, vol. 43, pp. 59-66, 2016.
- [70] S. Deb, K. Chang, M. Cosic, A. Ganguly, P. P. Pande, D. Heo, & B. Belzer,, "CMOS compatible many-core noc architectures with multi-channel millimeter-wave wireless links," in *Proceedings of the great lakes symposium on VLSI. ACM*, 2012.
- [71] Shamim, M.S., Mansoor, N., Samaiyar, A., Ganguly, A., Deb, S. and Sunndar Ram, S., "Energy-efficient wireless network-on-chip architecture with log-periodic on-chip antennas," in *Proceedings of the 24th edition of the great lakes symposium on VLSI. ACM*, 2014.

- [72] A. Vidapalapati, V. Vijayakumaran, A. Ganguly and A. Kwasinski, "NoC Architectures with Adaptive Code Division Multiple Access based Wireless Links," in *Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS)*, 2012.
- [73] D. J. Watts and S. H. Strogatz, "Collective dynamics of 'small-world' networks," *Nature*, vol. 393, pp. 440-442, 1998.
- [74] Chen, Z.N., Ammann, M.J., Qing, X., Wu, X.H., See, T.S. and Cai, A., "Planar antennas," *IEEE Microw. Mag*, vol. 7, no. 6, pp. 63-73, 2006.
- [75] A. Samaiyar, S. S. Ram, S. Deb, "Millimeter-wave planar log periodic antenna for on-chip wireless interconnects," in *Antennas and Propagation* (EuCAP), 2014 8th European Conference on, 2014.
- [76] K. Kim and K. K. O, "Characteristics of integrated dipole an-tennas on bulk, SOI, and SOS substrates for wireless communi-cation," in *IEEE International Interconnect Technology Conference*, 1998.
- [77] T. Petermann and P. De Los Rios, "Spatial small-world net-works: a wiring cost perspective," 2005. [Online]. Available: arXiv:cond-mat/0501420v2.
- [78] S. H. Gade, S. Deb, "Achievable Performance Enhancements with mm-Wave Wireless Interconnects in NoC," in *Proceedings of the 9th International Symposium on Networks-on-Chip (NOCS)*, 2016.
- [79] Dehyadgari, M., Nickray, M., Afzali-Kusha, A. and Navabi, Z., "Evaluation of pseudo adaptive XY routing using an object oriented model for NOC," in 17th IEEE International Conference on Microelectronics (ICM), 2005.
- [80] "ANSYS HFSS: High Frequency Electromagnetic Field Simulation," [Online]. Available: http://www.ansys.com/products/electronics/ansys-hfss.
- [81] Jotwani, R., Sundaram, S., Kosonocky, S., Schaefer, A., Andrade, V., Constant, G., Novak, A. and Naffziger, S., "An x86-64 Core Implemented

- in 32nm SOI CMOS," in *IEEE International Solid-State Circuits Conference*, 2010.
- [82] "Chip MultiProjects," [Online]. Available: http://cmp.imag.fr/.
- [83] N. Mansoor, A. Ganguly, M. Yuvaraj, "An Energy-efficient and robust millimeter-wave wireless Network-on-Chip architec-ture," in *Proceedings of IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)*, 2013.
- [84] Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T., Sardashti, S. and Sen, R., "The GEM5 Simulator," in *ACM SIGARCH Comp. Arch. News*, 2011.
- [85] Mondal, H., Gade, S., Shamim, M., Deb, S. and Ganguly, A., "Interference-aware wireless network-on-chip architecture using directional antennas," *IEEE Transactions on Multi-Scale Computing Systems*, 2016.
- [86] Cong, J., Gill, M., Hao, Y., Reinman, G. and Yuan, B., "On-chip interconnection network for accelerator-rich architectures," in *DAC*, 2015.
- [87] Lotfi-Kamran, P., Modarressi, M. and Sarbazi-Azad, H., "An Efficient Hybrid-Switched Network-on-Chip for Chip Multiprocessors," *IEEE Transactions on Computers*, vol. 65, no. 5, pp. 1656-1662, 2016.
- [88] Lusala, A.K. and Legat, J.D., "A hybrid router combining circuit switching and packet switching with bus architecture for on-chip networks," in *IEEE International NEWCAS Conference (NEWCAS)*, 2010.
- [89] Eitschberger, P., Keller, J., Thiele, F. and Kessler, C., "Exploring the Placement of Memory Controllers in Manycore Processors.," in *A Case Study for Intel SCC. MCC'13*, 2013.
- [90] Tootaghaj, D.Z. and Farhat, F., "Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip," 2016. [Online]. Available:

- arXiv:1607.04298v4.
- [91] Sharifi, A., Kultursay, E., Kandemir, M. and Das, C.R., "Addressing end-to-end memory access latency in noc-based multicores," in *In 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)*, 2012.
- [92] Mazloumi, A. and Modarressi, M., "A hybrid packet/circuit-switched router to accelerate memory access in NoC-based chip multiprocessors.," in *Design, Automation & Test in Europe Conference & Exhibition (DATE)*, 2015.
- [93] Abousamra, A., Jones, A.K. and Melhem, R., "Codesign of NoC and cache organization for reducing access latency in chip multiprocessors," *IEEE Transactions on Parallel and Distributed Systems*, vol. 23, no. 6, pp. 1038-1046, 2012.
- [94] Jain, R., Panda, P.R. and Subramoney, S., "Machine Learned Machines: Adaptive co-optimization of caches, cores, and On-chip Network.," in *Design, Automation & Test in Europe Conference & Exhibition (DATE)*, 2016.
- [95] Moscibroda, T. and Mutlu, O., "A case for bufferless routing in on-chip networks," in *ACM SIGARCH Computer Architecture News*, 2009.
- [96] Comaniciu, D. and Meer, P., "Mean shift: a robust approach toward feature space analysis.," *IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 24, no. 5, pp. 603-619, 2002.
- [97] Ubal, R., Jang, B., Mistry, P., Schaa, D. and Kaeli, D., "Multi2sim: A simulation framework for cpu-gpu computing.," in *International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '12. ACM*, 2012.
- [98] Sodani, A., Gramunt, R., Corbal, J., Kim, H.S., Vinod, K., Chinthamani, S.,

- Hutsell, S., Agarwal, R. and Liu, Y.C., "Knights Landing: Second-Generation Intel Xeon Phi Product," *IEEE Micro*, vol. 36, no. 2, pp. 34-46, 2016.
- [99] "AMD," [Online]. Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk..
- [100] Farrington, N., Porter, G., Radhakrishnan, S., Bazzaz, H.H., Subramanya, V., Fainman, Y., Papen, G. and Vahdat, A., "Helios: A Hybrid Electrical/Optical Switch," in ACM SIGCOMM Computer Communication Review, 2010.
- [101] Griese, E., "A high-performance hybrid electrical-optical interconnection technology for high-speed electronic systems," *IEEE Transactions on Advanced Packaging*, vol. 24, no. 3, pp. 375-383, 2001.
- [102] Rahman, M.N. and Esmailpour, A., "A Hybrid Electrical and Optical Networking Topology of Data Center for Big Data Network," in *ASEE*, 2014.
- [103] Park, D., Nicopoulos, C., Kim, J., Vijaykrishnan, N. and Das, C.R., "Exploring fault-tolerant network-on-chip architectures.," in *IEEE International Conference on Dependable Systems and Networks*, 2006.
- [104] Fiorin, L. and Sami, M., "Fault-tolerant network interfaces for networks-on-Chip.," *IEEE Transactions on Dependable and Secure Computing*, vol. 11, no. 1, pp. 16-29, 2014.
- [105] Carrillo, S., Harkin, J., McDaid, L.J., Morgan, F., Pande, S., Cawley, S. and McGinley, B., "Scalable hierarchical network-on-chip architecture for spiking neural network hardware implementations.," *IEEE Transactions on Parallel and Distributed Systems*, vol. 24, no. 12, pp. 2451-2461, 2013.
- [106] Vainbrand, D. and Ginosar, R., "Scalable network-on-chip architecture for

- configurable neural networks," *Microprocessors and Microsystems*, vol. 35, no. 2, pp. 152-166, 2011.
- [107] Vainbrand, D. and Ginosar, R., "Network-on-chip architectures for neural networks," in *Fourth ACM/IEEE International Symposium on Networks-on-Chip (NOCS)*, 2010.