# Communication Scheduling and Buslet-based Design: New Paradigms for High Level Synthesis 

by<br>ENZO TARTAGLIONE<br>Laurea, Politecnico di Torino, Turin, Italy, 2013

THESIS

Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Chicago, 2015<br>Chicago, Illinois

Defense Committee:

Shantanu Dutt, Chair and Advisor
John Lillis
Mariagrazia Graziano, Politecnico di Torino

# To my parents 

always near
despite the distance.

## ACKNOWLEDGMENTS

I would like to thank my advisor at the University of Illinois at Chicago, Prof. Shantanu Dutt, with whom I collaborated to perform the research described in this thesis. The work in this thesis is a result of this collaboration when I was a student at UIC from August 2014 to July 2015, as well as a funded research assistant from January 2015 to July 2015. This collaboration also allowed me to intellectually grow so much.

I would like to thank all my friends from Turin belonging to the group "The paladins of Medusa", old and new members, for letting me feel as I were in Italy despite the huge distance between us. A special thank goes to Massimo S., thanks to whom I was able to carry out part of the paperwork necessary to graduate.

Lastly, I deeply thank my parents for the vigorous and continuous support they have always given to me. Since my first year at the university they helped me to overcome difficult moments and certainly I was able to do whatever I did thanks to their continuous support.

## PREFACE

This project, as the final stage of my master's studies, gave me both the experience and abilities that I was willing to acquire in the field of VLSI CAD algorithms and tools development.

The chance of working with Prof. Dutt, in particular, provided me first with knowledge on state-of-the-art High Level Synthesis algorithms and then gave me the possibility of developing skills in problem solving and algorithm design. In the last three months the work was mainly focused on the study and design of interconnect models for dynamic power minimization. Furthermore, such an opportunity helped me to develop skills in object oriented coding.

The complexity of the analyzed problems, finally, gave me a brand new viewpoint on problem solving methods and approaches and enlarged my vision beyond a purely engineering perspective.

## TABLE OF CONTENTS

## CHAPTER

## PAGE

1 INTRODUCTION ..... 1
1.1 Flexible buslets: a new interconnection paradigm ..... 7
1.1.1 Previous work using bus-like structures ..... 9
1.2 Thesis outline ..... 10
2 MAIN ISSUES IN BUSLET WIRED DESIGNS ..... 11
2.1 Maximization of wire efficiency ..... 11
2.2 Dynamic power consumption and signal delay: constraint on maximum cardinality of a buslet ..... 19
2.2.1 Maximum fanin and fanout per FU ..... 21
3 SCHEDULING ALGORITHMS ..... 24
3.1 Introduction to the VLSI CAD flow ..... 24
$3.2 \quad$ High level synthesis ..... 26
3.2.1 As-soon-as-possible scheduling algorithm ..... 27
3.2.2 As-late-as-possible scheduling algorithm ..... 29
3.2.3 Mobility range for an operation node ..... 32
$3.3 \quad$ Force directed scheduling algorithm ..... 33
3.3.1 Probability estimation for an operation node ..... 34
3.3.2 Distribution graph ..... 35
3.3.3 Force computation ..... 37
3.3.4 Predecessor and successor forces ..... 39
3.3.5 Complexity ..... 45
$3.4 \quad$ Scheduling algorithm for buslet-based designs ..... 45
3.4.1 Communication with dedicated interconnections ..... 45
3.4.2 Force directed communication scheduling ..... 47
3.4.3 Complexity ..... 51
3.5 Summary of scheduling algorithms ..... 51
4 BINDING ALGORITHMS ..... 53
4.1 General concepts to be taken into account ..... 53
4.1.1 Wirelength model ..... 56
4.1.2 Effect of fanin/fanout constraints on binding ..... 57
4.2 Chronological binding ..... 60
4.2.1 Reuse of already available hardware ..... 61
4.2.2 Reduction of solution space ..... 62
4.2.3 Complexity ..... 64

## TABLE OF CONTENTS (Continued)

4.3 Chronological binding with similarity reduction ..... 64
4.3.1 Justification of similarity-based pruning ..... 70
4.3.2 Complexity ..... 71
4.4 Chronological lookahead binding ..... 72
4.4.1 Complexity ..... 74
4.5 Simultaneous binding of iso-scheduled COMs ..... 75
4.5.1 Improving the average computation time ..... 79
4.5.2 Complexity ..... 82
4.6 Force directed binding ..... 83
4.6.1 Formulation of probability for a given solution ..... 84
4.6.2 Formulation of weight function for a given solution ..... 85
4.6.3 Predecessor/successor forces ..... 86
4.6.4 Efficacy of this algorithm ..... 87
4.6.5 Complexity ..... 89
5 BUSLET POWER MODELING ..... 90
5.1 Metrics involved in buslet power consumption ..... 90
5.2 Minimum spanning tree ..... 91
5.2.1 Prim's algorithm ..... 92
5.2.2 Kruskal's algorithm ..... 92
5.3 Standard buffer placement ..... 93
5.3.1 Complexity ..... 96
5.4 MST-2B ..... 96
5.4.1 Shortest path problem: Dijkstra's algorithm ..... 99
5.4.2 Complexity ..... 100
5.5 MST-1B ..... 102
5.5.1 Placing tristate buffers ..... 104
5.5.2 Complexity ..... 105
5.6 MST-LOGD ..... 106
5.6.1 Complexity ..... 108
5.7 MST-BF ..... 108
5.7.1 Complexity ..... 109
5.8 MAX-D ..... 109
5.8.1 Complexity ..... 112
5.9 MAX-D-MAXHOP ..... 115
5.9.1 Complexity ..... 115
5.10 Hierarchical partitioning ..... 115
5.10.1 Complexity ..... 119
6 EXPERIMENTAL RESULTS ..... 120
6.1 Performance for scheduling and binding algorithms ..... 121
6.1.1 Comparison with another HLS algorithm ..... 121

## TABLE OF CONTENTS (Continued)

## CHAPTER

PAGE
6.1.2 Discussion of results ..... 123
$6.2 \quad$ Power estimation ..... 127
7 CONCLUSIONS ..... 141
APPENDIX ..... 143
CITED LITERATURE ..... 148
VITA ..... 153

## LIST OF TABLES

## TABLE

PAGE
I AVERAGE RUNTIMES FOR DIFFERENT DFG SIZES . . . . . . 143
II AVERAGE AREA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
III AVERAGE WIRELENGTH . . . . . . . . . . . . . . . . . . . . . . . . 144
IV AVERAGE WIRE EFFICIENCY . . . . . . . . . . . . . . . . . . . . . 144
V AVERAGE WIRELENGTH FOR DIFFERENT BUSLET STRUC-
TURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
$\begin{array}{ll}\text { VI TOTAL DYNAMIC POWER CONSUMED BY INTERCONNECTS } \\ & \text { - WIRE CAPACITANCE CONTRIBUTION . . . . . . . . . . . . } 145\end{array}$
$\begin{array}{ll}\text { VII TOTAL DYNAMIC POWER CONSUMED BY INTERCONNECTS } \\ & \text { - TRISTATE BUFFER CONTRIBUTION . . . . . . . . . . . . . }\end{array}$
$\begin{array}{ll}\text { VIII TOTAL STATIC POWER CONSUMED BY INTERCONNECTS - } \\ & \text { TRISTATE BUFFER LEAKAGE . . . . . . . . . . . . . . . . . }\end{array}$
IX DYNAMIC POWER CONSUMED BY FUS . . . . . . . . . . . . . 146
X STATIC POWER CONSUMED BY FUS . . . . . . . . . . . . . . . . 146
XI TOTAL POWER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
XII PERCENTAGE CHANGE IN TOTAL POWER . . . . . . . . . . . . 147
XIII MAXIMUM DISTANCE BETWEEN FUS . . . . . . . . . . . . . . . 147

## LIST OF FIGURES

## FIGURE

## PAGE

1 Example of dedicated interconnections design ..... 12
2 Example of buslet design ..... 13
3 Example of overlapping communications. ..... 15
4 Example of non overlapping communications. ..... 16
5 Example of dedicated interconnects implementation. ..... 17
6 Example of buslet implementation ..... 17
$7 \quad$ Second example of buslet implementation. ..... 18
8 Functional unit interface with the buslet ..... 19
9 Example of as-soon-as-possible schedule ..... 28
10 Example of as-late-as-possible schedule ..... 31
11 Example of mobility range for a node in a dataflow graph ..... 33
12 Scheduling probability example ..... 35
13 Example of distribution graph ..... 36
14 Example of self force computation ..... 38
15 Example of self force computation ..... 39
16 Reduction of mobility range for predecessors ..... 40
17 Effect of scheduling node $N_{j}$ on predecessor $N_{i}$ and successor $N_{k}$ ..... 41
18 Reduction of mobility range for predecessors ..... 43

## LIST OF FIGURES (Continued)

## FIGURE

PAGE
19 Example of time frame in which the communication between $u_{i}$ and $u_{j}$ can happen ..... 46
20
Example of communications possible concurrency ..... 47
21 Mobility range for communication between $u_{i}$ and $u_{j}$ ..... 49
22 Example of communication scheduling and its effect on mobility ranges for functional units ..... 50
23 General dataflow for the scheduling technique used ..... 52
24
Examples of computation of wirelength for buslet cardinalities from 2 to 5 ..... 57
25 Example of fanin/fanout constraint ..... 59
26 Example of reuse of previously bound buslets ..... 61
27 Example of binding solution space reduction ..... 63
28 Example two solutions similar differing for buslets labels ..... 65
29 Example two solutions similar with different binding for both buslets and functional units ..... 66
30 Why simultaneous binding matters ..... 76
31 Chronological binding vs. simultaneous binding of iso-scheduled coms output ..... 76
32 Binding overlapping communications ..... 78
33
Analysis of possibilities for 3 overlapping communications ..... 80
34
Example of SBP ..... 94
35 Communication in the SBP model ..... 97
36
Communication in the MST-2B model ..... 98
37 Example of MST-2B ..... 100

## LIST OF FIGURES (Continued)

## FIGURE

PAGE
38 Example of MST-1B ..... 102
39 Communication in the MST-1B model ..... 103
40 Example of MST-LOGD ..... 107
41
Example of MST-BF ..... 111
42 MAX-D algorithm at work ..... 113
43 Example of MAX-D ..... 113
44 HP at work ..... 117
45 Example of hierarchical partitioning ..... 118
46 Comparison between greedy-binding method and ours for buslet cardi- nality 2 . ..... 122
47 Comparison between two different output chip layouts ..... 123
48 Average, maximum and peak wire efficiencies ..... 124
49 Runtime data and curve fitting for the designed HLS algorithms ..... 124
50 Total wirelength averages ..... 127
51 Total area averages ..... 128
52 Logic gates composing a FA ..... 130
53 SRAM cell ..... 131
54 Total power consumption ..... 133
55 Total wirelengths ..... 133
56 Dynamic power contribution of FUs ..... 134
57 Leakage power contribution of FUs ..... 135
58 Dynamic power contribution of wire ..... 136

## LIST OF FIGURES (Continued)

59 Maximum distance between any two connected FUs ..... 136
60 Dynamic power contribution of tristate buffers input load ..... 137
61 Leakage power contribution of tristate buffers ..... 138
62 Percentage change in total power ..... 138
63 Total interconnect power ..... 139

## LIST OF ABBREVIATIONS

| ALAP | As Late As Possible |
| :--- | :--- |
| ALU | Arithmetic Logic Unit |
| ASAP | As Soon As Possible |
| ASIC | Application-Specific Integrated Circuit |
| BP | Binding Possibility |
| CB | Chronological Binding |
| CBSR | Data Flow Graph |
| DFG | Force Directed Binding |
| FDB | Force Directed Scheduling |
| FDCS | Finite State Machine |
| FDS | Functional Unit |
| FSM | High Level Synthesis |
| FU | Lookahead |
| HLS | HA |

# LIST OF ABBREVIATIONS (Continued) 

| ML-RCS | Minimum Latency-Resource Constrained Sched- |
| :---: | :---: |
|  | ule |
| MR-LCS | Minimum Resource-Latency Constrained Sched- |
|  | ule |
| MST | Minimum Spanning Tree |
| PF | Predecessor Force |
| RCA | Ripple Carry Adder |
| RTL | Register Transfer Level |
| SF | Self-Force |
| SSI | Small Scale Integration |
| UIC | University of Illinois at Chicago |
| VLSI | Very Large Scale Integrated |
| WL | Wirelength |

## SUMMARY

Current nanoscale designs are highly interconnect dominated, taking about $70 \%$ of the chip area. Interconnects also consume a significant part of the dynamic power and are responsible of about the $60 \%$ of signal delays. It is, thus, important to be able to synthesize much lower interconnect-complexity designs than are possible with current high-level synthesis (HLS) tools and algorithms. Towards that end, we have developed the following new paradigms in the scheduling, binding and general architecture synthesis problems of HLS:

- Flexibly-structured that connect a few neighborhood functional units (FUs) instead of dedicated interconnects between pairs of FUs, thereby sharing interconnects among a number of FU pairs that need to communicate.
- Communication scheduling (followed by standard operation scheduling that respects the communication schedules) in which communications between FUs are scheduled at appropriate times to minimize the number of buslets needed, subject to buslet cardinality constraints (for the purpose of upper bounding signal delay).
- Buslet binding techniques, aiming to respect both buslet cardinality constraint and a constraint on maximum fanin and fanout for the functional units. These techniques will range from simple but effective approaches like chronological binding (CB) to more sophisticated ones, like the use of lookahead approaches and simultaneous binding of iso-scheduled communications (communications scheduled in the same clock cycle). Furthermore, in this direction, similar solutions detection mechanism was developed, in order to improve the


## SUMMARY (Continued)

final quality of the result. Finally, also a force directed approach was used to solve the binding problem (FDB). All these techniques were implemented and compared in terms of both performance and complexity.

- Buslet power modeling. A number of configurations with multiple tri-state buffers for interconnecting FUs through a buslet were implemented, aiming to minimize the total power consumed using buslets. These range from techniques using minimum spanning trees to more sophisticated structures with constraints on maximum graph distance between connected FUs to hierarchical partitioning.

Using the aforementioned techniques, we obtain significant wirelength (WL) reduction, ranging between $35 \%$ and $71 \%$, compared to conventional designs with dedicated interconnects between communicating FU-pairs. The total chip area, including total FU area, also reduces in our designs compared to conventional designs. The power, on the other side of the coin, will increase with buslet size, but sublinearly. Empirical results show that we are able to limit the increment of power consumed by buslets compared to dedicated-interconnect designs, to a logarithmic function of the maximum buslet cardinality.

## CHAPTER 1

## INTRODUCTION

Our world is full of integrated circuits. It is possible to find them practically everywhere, like in mobiles, cars, televisions etc.; hence, they are now the heart of the modern technology. The integrated circuit (IC) is a cluster embedding a number of very advanced electric circuits. In particular, it will contain a very large number of components like transistors, resistors and capacitors, all interconnected. Among these, the transistor is the fundamental electrical component, at the basement of modern technology. It was invented in 1947 by John Bardeen, Walter Brattain and William Shockley[1], it was revolutionary: it embedded together tiny dimensions, high speed, high reliability and efficiency. Decades after decades, larger and more complex circuits, but at the same time compact and efficient, were this way realized.From the first ICs commercialized in the early 60s, containing just a few of components on the same chip (for example, TAA320 by Philips had two transistors only) and, for this, called Small-Scale Integration devices (SSI), transistors scaling increased thanks to more accurate fabrication technologies (Medium-Scale Integration contained hundreds of transistors on the same chip, Large-Scale Integration hundreds of thousands). Nowadays, the most advanced ICs embed millions of components on an area no larger than a fingernail. The transistors on these chips are realized using up 14 nm technology (like for Intel Core i7-5550U). In order to understand these
sizes, we can say that we could fit thousands of transistors inside a red blood cell ${ }^{1}$. This is the era of Very Large-Scale Integrated Circuits (VLSI).

A very famous prediction in this field was done in 1965 by Gordon Moore, head of the research and development office at Fairchild Semiconductor. He speculated [3] that:
"The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years."
G. Moore, 1965

Today we have reached physical limits of transistor scaling: we need to handle problems like leakage power consumption, limited range of gate metals, limited available materials for modeling the channel due to scaling degree and, most important, atomic dimension about to be reached ${ }^{2}$. Keep on scaling transistors (trend known as More Moore) is unsustainable. However, nowadays a huge work of research for allowing transistor scaling is done. One of the

[^0]most promising among these relies on spin state electron usage, called spintronics[4], but there are also other possibilities like tunnel junctions[5][6][7] and new studies on nano-wire geometries[8][9]. The use of these new technologies in order to produce new ICs and, for this, making the transistor scaling trend keep on increasing is known as More than Moore.

Hence, in rapidly shrinking technologies, interconnect design has become a significant issue to be tackled. Indeed, as we are scaling more and more transistors, the density of interconnections is hugely increasing. With this, interconnections consume the most of the dynamic power in the $\mathrm{IC}[10]$. Further, performance in terms of maximum clock frequency is bottlenecked by interconnections because of self and inter-wire parasitic capacitance effects, some of them due to tight narrowness of wires like crosstalk [11], wire performance fall for effects like IR-Drop [12] and its reliability for the electromigration effect [12]. These, mixed with the increase of resistance due to both long wires and use of VIAs, will increase the RC time constant, reducing the cutoff frequency. A number of approaches like buffer insertion[13], wire shaping[14] and communication encoding[15] were proposed in order to minimize VLSI interconnects. However, all of these techniques focus their attention on the minimization of the delay on designs with a big number of interconnections. What if we try to directly minimize the number of the required interconnections or their length? In that case, as effect, we will obtain the effect of minimizing parasitic parameters making also designs less complex, making the routers algorithms running more easily and obtaining better results (in terms of average wirelength for a given design). The problem, here, is how to address such an issue. In order to minimize the "wire cost", we need to work on such a minimization already at HLS phase. During this phase, indeed, FUs are
instantiated as well as their mutual interconnections: the true minimization of interconnections starts here.

Traditionally, in HLS one of the following two problems need to be solved:

- Minimum latency-resource constrained scheduling (ML-RCS): here we have a fixed number of hardware units and we need schedule operations on them to minimize the total latency.
- Minimum resource-latency constrained scheduling (MR-LCS): in this case we have a fixed latency and we need to schedule operations to satisfy the latency constraint and minimize the total number of resources.

As we aim to minimize the interconnect complexity, which is a problem tightly linked to resources minimization, we will solve the MR-LCS problem. Furthermore, this correlates to current problems of interest in both ASICs and embedded designs, where the paradigm of resource/power optimization for a target performance is nowadays dominant. Some works already tried to address such a relevant issue. The approaches used were mainly two:

- Wirelength minimization.
- Minimization of the number of interconnects.

All the papers aiming to minimize the wirelength are interconnection-aware high level synthesis algorithms which will have a floorplanning ${ }^{1}$ step during the HLS. In particular, during this, both scheduling ${ }^{2}$ and binding ${ }^{3}$ steps are performed together and the whole design is driven by the wirelength estimated by the floorplanner. As we can imagine, using such an approach, we will effectively be able to minimize the wirelength. However, as we are introducing a floorplan step, we are also introducing a non negligible computation effort: for each possible decision at the scheduling/binding step, in order to evaluate it, we need to run the floorplanner. Hence, such an approach is feasible for small designs only.

One of the approaches used was integrating interconnect power optimization into high-level synthesis[16]. The main strength of such a paper consisted not only in the reduction datapath unit power but also in the reduction of the power consumed by interconnects. Practically, such a work reduces unnecessary switching activity in the interconnects. This is accomplished by an interconnect-aware binding, which tries to allocate as close as possible hardware resources with a high communication rate. Furthermore, a gating technique is proposed to reduce spurious switching activity. Such a work, however, has the big drawback of running, together with the

[^1]binding algorithm, also the floorplanning stage. This will lead to a big overhead in the computation, also using an approximate power model.

Another proposed work uses physical information to drive the solution at the HLS stage of the VLSI CAD flow[17]. In particular, here it was explicitly designed an incremental floorplanning HLS algorithm, which improves at the same time designs schedule, binding and floorplan. Thanks to the incremental approach, the overhead introduced by the floorplan stage is reduced; however, performing both scheduling, binding and floorplan stages at the same time will make runtime hugely increasing.

Other works reducing interconnections in VLSI have, on the contrary, their focus on trying to minimize the amount of wires itself instead of their own length. They are able to do it detecting "common patterns" across the input[18]: if a sequence of same-type operations is observed in more than one point of the input, then they can be mapped to the same physical units, using exactly the same interconnections, which will result in the reduction of interconnect complexity. This approach is completely unaware of the wirelength because it is not the parameter to be optimized. For this, its execution is relatively fast and it obtains improvements in wiring complexity respect to a wire minimization unaware algorithm. However, such a model is not very accurate for two reasons:

- It is not estimating the wirelength at all: for this reason, there may be cases in which we prefer to have more short connections than few very long ones ${ }^{1}$.
- It is not trivial to dynamically decide the granularity of the pattern detection: if this is fixed, then it may lead to non negligible suboptimal results; if it is dynamically determined, instead, it will result in a high computation effort.

Another approach used in such a direction was trying to optimize communication sharing in pipelined architectures[19]: in this way, the total number of interconnections reduces and their use increases. However, such a technique has some significant weaknesses:

- It works after HLS and, for this, it is limited to an already defined hardware architecture.
- It relies on pipeline architectures: this means that such an optimization is limited to the number of pipeline stages through a line: the more they are, the more the number of registers will be (higher power consumption) and the least they are, the least such an optimization is effective.


### 1.1 Flexible buslets: a new interconnection paradigm

Minimizing the number of interconnects is an interesting because it aims to maximize the use of the same interconnections. Let us define for a given wire $w_{i}$ a parameter, called wire efficiency, as

$$
\begin{equation*}
\varepsilon\left(w_{i}\right)=\frac{T\left(w_{i}\right)}{\bar{\lambda}} \tag{1.1}
\end{equation*}
$$

[^2]where $T\left(w_{i}\right)$ is the number of clock cycles $w_{i}$ is busy and $\bar{\lambda}$ is the total latency of the output design ${ }^{1}$. A good approach in order to minimize interconnections is the maximization of such a parameter. In this way, we are trying to reuse, where possible, the already existing interconnections.

However, in a typical design, such a parameter is usually very low ${ }^{2}$ making a given wire idle for most of the time.

In a typical design, wire efficiency is usually very low (according to experimental data we present in in Table IV and Fig. 48, maximum wire efficiency for dedicated-interconnection based designs is below 0.1 or $10 \%$ ), making a given wire idle most of the time. A natural solution to alleviate these issues is merging close-by wires which are busy in different clock cycles in a single mini-bus like structure which we call a buslet, and whose wirelength (WL) is smaller than the sum of the WL's of the merged wires. Such a buslet is a flexibly-structured mini-bus in that it connects a small subset of FUs placed potentially anywhere on the chip. However, we need to take into account the no-conflict constraint on a given buslet: we need to ensure that at most one FU per clock cycle will drive the buslet (which it will do via tri-state buffers that are enabled at the right clock cycle(s) by the datapath FSM controller, which all HLS design have). In order to ensure this, a proper scheduling algorithm (minimizing the number

[^3]of buslets requested for such a constraint) as well as a binding one, need to be designed, as well as a study of the necessary constraints in order to ensure a good solution quality.

### 1.1.1 Previous work using bus-like structures

Finally, we note that the problem of minimizing design wirelength using "bus-like" structures was earlier proposed in [20]. This problem was solved using partitioned buses, each of which:

1. Spans across all the FUs.
2. Connects to the FUs in a linear order (determined based on the DFG in order to maximize the number of adjacent FU communication).

The number of such buses depends on the maximum number of simultaneous "overlapping" communication needed across the aforementioned linear order, and this depends on the operation scheduling performed to solve the ML-RCS. Further, it is also not clear that they can solve the more standard MR-LCS problem that we are solving, due to their technique not having any control on worst-case communication delay because of their long-spanning buses. While this was an innovative technique, one of the drawbacks is that the number of buses is not minimized in initial scheduling, and this number can be prohibitive for large designs - they only show that their technique works well in reducing total wirelength and area for two small DFGs compared to previous dedicated-interconnects based designs. Further, the worst-case length across which a communication can take place is linear in the number of FUs, leading to a slow design for larger DFGs. Also, since there is no constraint on the number of buses, the input and output capacitive loading of FUs can be large, leading to a further slow-down of their designs, as well
as to high dynamic power. Finally, apparently in order to reduce the interconnect complexity imposed by their linear order (each bus spans across all FUs), they use ALUs instead of FUs of different functionality, thus reducing the number of total resources. However, this can lead to higher leakage power (only one FU within an ALU can be active at any time, meaning that FU usage will be low, and thus they would generally need more total FUs - in spite of the constraint on number of ALUs - than a design of comparable speed that uses individual FUs). In contrast, we have flexibly structured buslets that each connect only a few FUs (thus not having a slow down or increased dynamic power due to unnecessary FU loading from buses). Further, we perform communication scheduling first in order to minimize the number of buslets needed, and thereby significantly reduce interconnect complexity for DFGs of any size.

### 1.2 Thesis outline

The rest of the thesis is organized as follows. In the next chapter an analysis of the constraints needed for designs without high interconnect delay and power consumption will be presented. In Chapter 3, I present an overview of the entire VLSI flow with focus on the general area of this thesis. Furthermore, some basic issues in scheduling are presented, including the well-known force-directed scheduling algorithm, which is the starting point for our scheduling algorithms. Chapter 3 discusses also our scheduling algorithms that aim to minimize, first, the number of buslets, and then the number of FUs. Chapter 4 discusses several new binding algorithms and Chapter 5 is entirely devoted to the design of different buslet interconnection structures in order to minimize dynamic power. The thesis ends with presentation and discussion of results in Chapter 6, and conclusions in Chapter 7.

## CHAPTER 2

## MAIN ISSUES IN BUSLET WIRED DESIGNS

The use of buslet in place of standard dedicated interconnections will bring evident benefits to the whole design in terms of overall wire structure to be routed as well as maximization of the already existing resources itself. However, a number of factors are involved in such a process: from the increment in dynamic power consumed during the communication for fanins to the increment of area due to multiplexers or demultiplexers needed. In this chapter all these aspects will be analyzed, defining the main driving factors of the HLS algorithm.

### 2.1 Maximization of wire efficiency

A design involving buslets will bring evident benefits on the efficiency of the design itself: as previously stated, it is literally possible to merge separate wires which are active in different clock cycles, obtaining benefits in the overall wirelength. Let us take, as example, Fig. 1. From this, we can see that we have to map 4 different communications between 4 different Functional Units (FUs). Unfortunately, in order to map all of these communications, using dedicated interconnections we will need to use four different interconnections, even if all of these will happen in different clock cycles. Now, let us use a buslet instead: as all of these communications will happen in different clock cycles, we will be able to map all of these in a wire structure simultaneously linking all of these FUs. This is possible because, when a


Figure 1: Example of dedicated interconnections design.
Dashed arrows indicate communications while continuous lines are implemented interconnection. Each color indicates a different clock cycle in which the communication happens.
communication between source functional unit $F U_{s}$ and destination functional unit $F U_{d}$ takes place, then all the others wired to the same buslet will ignore the signal, exactly how it does work in a regular bus, thanks to the use of tristate buffers. The same implementation of the problem presented in Fig. 1 with the use of buslets is sketched in Fig. 2.

Let us, now, analyze the difference in wirelength between the two designs. In the dedicated interconnections we will have the following wirelengths:

- blue wire: $d\left(F U_{1}, F U_{3}\right)+0.25 d\left(F U_{2}, F U_{3}\right)$
- red wire: $d\left(F U_{1}, F U_{3}\right)+0.25 d\left(F U_{2}, F U_{3}\right)$
- green wire: $d\left(F U_{2}, F U_{3}\right)$


Figure 2: Example of buslet design. Dashed arrows indicate communications while the continuous line is the buslet. Each color indicates a different clock cycle in which the communication happens.

- black wire: $d\left(F U_{2}, F U_{3}\right)$
where $d(x, y)$ indicates the distance between $x$ and $y$. So, we will have that the total wirelength for dedicated interconnections design in Fig. 1 is obtaining a relevant reduction in wirelength.

$$
d_{\text {ded interc }}=2 d\left(F U_{1}, F U_{3}\right)+2.5 d\left(F U_{2}, F U_{3}\right)
$$

For the buslet wiring, instead, it is evident it is

$$
d_{\text {buslet }}=d\left(F U_{1}, F U_{3}\right)+2 d\left(F U_{2}, F U_{3}\right)
$$

Let us, now, analyze the difference in wire efficiency between the two implementations. According to the definition of wire efficiency in (1.1), in the dedicated interconnections design we have $\frac{1}{4}$ the efficiency we will have in the buslet design. This evidences that having designs with buslets will have also a significant improvement in the overall wire efficiency. We can generalize this telling that the wire efficiency of a buslet is the sum of the dedicated interconnections merged in order to obtain it.

Let us say that $C\left(w_{m}\right)$ is the set of communications mapped in the dedicated interconnection $w_{m}$. We say that $w_{m}$ respects the condition of non conflicting communications if, $\forall c o m_{i} \in$ $C\left(w_{m}\right), t\left(\operatorname{com}_{i}\right) \neq t\left(\operatorname{com}_{j}\right)$, with $\operatorname{com}_{i} \neq \operatorname{com}_{j}$, where $t\left(\operatorname{com}_{i}\right)$ is the scheduling time for $\operatorname{com}_{i}$. However, such a condition works assuming communication latency ${ }^{1} \lambda_{\text {com }}$ unitary. If we have a parametric value of $\lambda_{\text {com }}$, we need to ensure the following conditions:

$$
\begin{cases}t\left(\operatorname{com}_{i}\right)+\lambda_{\mathrm{com}}-1<t\left(\operatorname{com}_{j}\right) & \text { if } t\left(\operatorname{com}_{i}\right) \geq t\left(\operatorname{com}_{j}\right)  \tag{2.1}\\ t\left(\operatorname{com}_{j}\right)+\lambda_{\mathrm{com}}-1<t\left(\operatorname{com}_{i}\right) & \text { if } t\left(\operatorname{com}_{i}\right) \geq t\left(\operatorname{com}_{j}\right)\end{cases}
$$

which can be also written as

$$
\begin{equation*}
\left|t\left(\operatorname{com}_{i}\right)-t\left(\operatorname{com}_{j}\right)\right| \geq \lambda_{c o m} \tag{2.2}
\end{equation*}
$$

which means that the scheduling times for the two communications must differ at least by the latency of the communication itself.

[^4]

Figure 3: Example of overlapping communications.

Let us take, as example, Fig. 3. In this case, at a certain point, $t\left(\operatorname{com}_{j}\right)$ has been scheduled while $t\left(\right.$ com $\left._{i}\right)$ has not finished yet. This obviously means that these communications will bever been scheduled to the same buslet. Now, let us analyze the case represented in Fig. 4. In this case, which is a limit case, the two scheduling times will exactly differ of $\lambda_{\text {com }}$. However, there will not be any conflict between them. In this way, they may be "mapped" to the same buslet. Now, if we decide to merge a set $M$ of wires, we need to ensure the non conflicting communications condition: $\forall c o m_{i} \in \forall w_{n} \in M$ (2.2) must be true.

Let us say that $M$ is a set of dedicated interconnections respecting the condition of non overlapping communications. We can here see that, merging buslets, we will always increase the wire efficiency (1.1)

$$
\begin{equation*}
\varepsilon_{\text {buslet }}=\sum_{\forall w_{n} \in M} \varepsilon_{i} \tag{2.3}
\end{equation*}
$$



Figure 4: Example of non overlapping communications.

However, now comes the problem of how to maximize buslet efficiency. Let us assume we need to merge, in a certain way, some interconnections. Here we are using, for sake of simplicity, $\lambda_{\text {com }}=1$. We have a design with 6 functional units (respectively A, B, C, D, E and F) and with the following communication scheduling times:

$$
\left\{\begin{array}{l}
t[\operatorname{com}(A, B)]=t[\operatorname{com}(A, C)]=t[\operatorname{com}(E, F)] \\
t[\operatorname{com}(C, D)]=t[\operatorname{com}(C, B)]=t[\operatorname{com}(D, E)]
\end{array}\right.
$$

Dedicated interconnections design is pictured in Fig. 5. In order to build buslets merging these wires, we have 6 possibilities.

For example, we can have an example of merging in Fig. 6. Here, in particular, wires $w_{1}$ and $w_{5}$ were merged creating buslet $b_{1}, w_{2}$ and $w_{4}$ generated $b_{2}$ and $w_{3}$ and $w_{6}$ were merged into


Figure 5: Example of dedicated interconnects implementation.


Figure 6: Example of buslet implementation.


Figure 7: Second example of buslet implementation.
$b_{3}$. This may appear to be an optimal solution for the input design. However, in order to get such a solution, all the possible solutions were inspected, and this will take exponential time; so, for big designs, this is not a very good solution. For this, an heuristic to solve such a problem appears to be necessary. Furthermore, we can see that we are constrained by communications scheduling times: in fact, it is evident that the lowest possible number of buslets is the maximum number of overlapping communications scheduling times. Hence, if we could change the previous constraints on this like

$$
\left\{\begin{aligned}
t[\operatorname{com}(A, B)] & =t[\operatorname{com}(A, C)] \\
t[\operatorname{com}(C, D)] & =t[\operatorname{com}(C, B)] \\
t[\operatorname{com}(E, F)] & =t[\operatorname{com}(D, E)]
\end{aligned}\right.
$$

we would be able to reduce the total number of needed buslets. An example is provided in


Figure 8: Functional unit interface with the buslet. It will have an output and an input tristate buffer, which will read/write from a proper buffer.

Fig. 7 in which $w_{1}, w_{5}$ and $w_{6}$ were merged in $b_{1}$ and all the remaining created $b_{2}$. Hence, in order to have good quality solutions, we do not need to work on binding only, but scheduling times for communication play a fundamental role in the final solution quality.

### 2.2 Dynamic power consumption and signal delay: constraint on maximum cardinality

## of a buslet

Reduction of wirelength and increase of wire efficiency are two wonderful effects brought by buslets. However, we have also to take into account two relevant side effects of simultaneously connecting several functional units: each of these will have an input capacitance which will increase the overall dynamic power necessary to perform a single communication. Furthermore, the wirelength of the single interconnection (not of the overall design) will increase, which will also bring to an increment of the signal delay. This implies that we have to use a higher dynamic power compared to the dedicated interconnections case having, at the same time, a lower maximum working frequency. We need to carefully handle such a relevant issues. Let
us suppose that buslets are bidirectional communication structures (so, data can travel from any connected functional unit $F U_{a}$ to any $F U_{b}$ ). In such a structure, any connected resource/FU needs two different "accesses" to the buslet ${ }^{1}$ :

- one as its output, which needs to be a tristate buffer.
- one as its input.

Let us ignore the capacitive parasitics at the tristate buffer outputs (as they generally are much smaller than input capacitances) as well as capacitance for the interconnection itself (for the moment). We can see that the main capacitive load is given by the input capacitances. Let us define buslet cardinality $\left|b_{j}\right|$ of buslet $b_{j}$ as the number of FUs connected to it. We know that, in the general case, some FUs can be connected to a buslet $b_{j}$ as a fanout, without having it as a fanin. For this reason, defining $\left|f i n\left(b_{j}\right)\right|$ as the number of FUs having $b_{j}$ as a fanin, we know that $\left|f i n\left(b_{j}\right)\right| \leq\left|b_{j}\right|$. Defining $C_{i}$ as the input capacitance of $F U_{i}$ (or that if a fanin multiplexer of $b_{j}$ ), the total buslet capacitance will be

$$
\begin{equation*}
C_{b_{j}}=\sum_{i=1}^{\left|f i n\left(b_{j}\right)\right|} C_{i} \tag{2.4}
\end{equation*}
$$

[^5]If we say, for sake of simplicity, that every FU will have the same input capacitance $C_{i n}$, we can approximate (2.4) to

$$
\begin{equation*}
C_{b_{j}} \approx \sum_{i=1}^{\left|f i n\left(b_{j}\right)\right|} C_{i n}=\left|f i n\left(b_{j}\right)\right| \cdot C_{i n} \tag{2.5}
\end{equation*}
$$

According to (2.5), if we have a large number of FUs having a buslet as fanin and, according to this, allowing high buslet cardinality, we may have a high input load capacitance to a driving FU. Hence, one important parameter to be taken into account is an upper bound on buslet cardinality.

Buslet cardinality and length also affect signal delay, and thus limiting maximum buslet cardinality reduces the RC delay by both reducing wire resistance and capacitance, as well as capacitive load of FU inputs. However, signal delay is generally tackled as on upper bound constraint (e.g., signal delay between any two FUs should be at most 1 clock period) and not a minimization metric, and thus the ramification of buslets on signal delay is a little lighter than on power.

### 2.2.1 Maximum fanin and fanout per FU

All the functional units, in order to be connected to a buslet, need an input and an output access to it. However, in the general case, they will be connected to a number of buslets greater than one: we need extra structures in order to connect the input or the output access of the FU to the buslets. These structures are multiplexers for inputs and demultiplexers for the outputs. According to this, from the point of view of a single FU, it becomes important to limit the
number of buslets it is connected to, in order to minimize the signal delay in demultiplexers or multiplexers. Since a FU may have a buslet as only fanin or fanout, these two paramers will typically not be the same, and thus we need an upper-bound constraint on each. In the following, we motivate the need for such constraints:

- Signal delay: as already stated, if a FU has a high fanout, the amount of delay in the data transmission to the buslet is proportional to the number of 1:2 demultiplexers that a larger 1:m demultiplexer is constructed from the signal needs to cross. Assuming a tree structure, if the amount of time to cross a single $1: 2$ demultiplexer is $t_{\text {demux }}$ and the fanout is m , the estimated delay is

$$
\begin{equation*}
\Delta t_{\text {fanout }}=\left\lceil\log _{2}(m)\right\rceil \cdot t_{\text {demux }} \tag{2.6}
\end{equation*}
$$

The same issue applies to fanin using multiplexers. Thus, fixing an upper bound on the fanin and on the fanout, we will ensure data transmission from FU to target buslet (and vice versa) will not be too slow.

- Area increment: multiplexers, as well as demultiplexers, have a cost, here expressed in terms of increment in area size. Assuming fanout m, tree structure for the $1: \mathrm{m}$ demultiplexer and area for a single 1:2 demultiplexer $A_{\text {demux }}$, the expected increment of area is

$$
\begin{equation*}
\Delta A_{\text {fanout }}=(m-1) \cdot A_{\text {demux }} \tag{2.7}
\end{equation*}
$$

which, unlike for the time delay increment in (2.6), is a linear increment and, for this, for high m , this has a non negligible effect on the whole design.

We know, indeed, that a $2: 1$ multiplexer needs 4 transistors to be implemented, as it uses transmission gates, while a full-adder requires 14 transistors[21]. For this reason, as increasing the input data size of $h$ bits (we have 2 inputs of size $h$ ) we will increase hardware area cost of a factor h itself for both an h bit size 2:1 multiplexer and for a Ripple Carry Adder (RCA) (as for two h-bits inputs we need h 2:1 multiplexers and for adding two $h$-bits addenda we need $h$ Full Adders (FAs)), we can approximately say that for a given $h$ data size of any input, we have that the area of a multiplexer is about $\frac{1}{3}$ RCA's. The same issue applies for fanin and multiplexers.

## CHAPTER 3

## SCHEDULING ALGORITHMS

### 3.1 Introduction to the VLSI CAD flow

The electronics industry was able to achieve an outstanding growth over the last two decades, thanks to the huge improvements obtained in the field of large-scale systems design. Essentially, such a market took place thanks to the the advent of VLSI. For this, some well organized structure for designing VLSI circuits needed to be developed.

The design process, at each one of its levels, meets the same intermediate steps. Indeed, it starts from a given set of requirements; then, an initial design structure is developed according to some criteria and tested against the requirements. Whenever the requirements are not met, the design needs to be somehow improved. We can divide the nowadays VLSI CAD flow in the following steps:

1. System specification: In this stage global goals and constraints of the system are defined. In particular, it will be evidenced which one will be the functionality of the system (like input and output signals description) as well as boundaries on minimum speed or maximum power consumption, constraints on size and space occupied by the circuit or even more complex constraints like maximum temperature.
2. High level synthesis: Here we take as input a sequence of operations which will provide the desired output and it will give, as output, a general FU structure with all needed data
transfer between hardware blocks and time in which these FU are allowed to start their computation. Functional Units are here seen as black boxes which will perform a given computation.
3. Logic design: Choice of the proper logic gates structure for each functional unit needed in the high level synthesis stage. This phase can be seen also as RTL description of the FUs.
4. Ciruit design: All logic gates are here converted in transistors, according to the used technology. In particular, all the transistors are here sized in order to meet power and delay requirements (the transistor model has plenty of capacitances which are heavily taken into account in this synthesis stage).
5. Physical design: Here the circuit is converted into a layout, which is the physical implementation of the circuit. This will represent the actual fabrication result of the circuit. It is divided in a number of intermediate stages; however the two most significant are floorplanning (which is the placement on the board of the FUs) and routing (which is the creation of the physical wires interconnecting, where needed, the functional units). Both these synthesis stages can be iterated until physical constraints like delay and power consumption are not met.
6. Physical verification and signoff: In this stage all the constraints are checked. If some of these are not met, then synthesis stage(s) are iterated, depending on how the synthesis algorithms are designed.
7. Fabrication: Here layout is converted into masks for lithography, which will be used for the fabrication in the tapeout process.
8. Packaging and testing: Now, the circuit is physically tested and a verification of the constraints is performed.

We will mainly focus on the high level synthesis stage as it is our work domain.

### 3.2 High level synthesis

High-level synthesis (HLS) became a relevant issue in the VLSI design since the late 1970s. Because of the huge scaling trend happened in the last decades, a lot of the research was performed in order to bring some improvement at the design level. More, soon researchers realized that some decisions taken at in the earliest stages of the CAD flow will determine the quality of the final output. For this reason, a lot of research was done on the so called High Level Synthesis (HLS).

We can divide the high level synthesis in some steps, which can be also each other merged:

1. Scheduling: operation nodes, of the input DFG, will receive a scheduling time, according to some implemented heuristics, which usually aims to minimize total quantity of needed functional units for the given design or the total request time to accomplish the task the whole DFG aims to reach
2. Binding: here, operation nodes will be "mapped" to functional units (FUs) according to some heuristics. This means that we will decide which physical piece of hardware will execute the operation described by each operation node.

As these are the earliest steps in the whole physical design process, any decision taken here may affect quality of gained design in successive steps: for example, if we do not take into account chip area minimization problem, we may have huge designs with very distant FUs, indirectly affecting maximum distance between communicating functional units and, for this, making working frequency drop. For this reason, it is very important to properly take into account a number of metrics also in the early stages of physical design.

Through the history of high level synthesis, a number of algorithms were developed. Here follows the most significant, fundamental ones, in order to wholly understand the thesiswork.

### 3.2.1 As-soon-as-possible scheduling algorithm

This is the simplest scheduling algorithm. Let us suppose we wish to minimize the final design total latency regardless the number of hardware units we will need to use. In order to do this, what we can do is simply schedule any operation node in the first available clock cycle when all its predecessors in the dataflow graph ended their own computation. This gives this algorithm the name as-soon-as-possible (ASAP).

Let us define $\lambda\left(u_{i}\right)$ the latency of the operation node $u_{i}$. If $P\left(u_{i}\right)$ is the set of all predecessors of operation node $u_{i}$, the asap schedule of $u_{i}$ is defined as

$$
\begin{equation*}
\operatorname{asap}(u)=\max _{\forall v \in P(u)}\{\operatorname{asap}(v)+\lambda(v)\} \tag{3.1}
\end{equation*}
$$

Of course, the first nodes in the DFG are immediately scheduled as they do not have dependency with any other node.


Figure 9: This is an example of as-soon-as-possible schedule.

An interesting property of this algorithm is that it provides the lowest possible execution time for the provided input. This is called minimum latency for the input.

Let us take, as example of as-soon-as-possible schedule, Fig. 9. Here we are assuming execution time for each operation node to be 1 clock cycle and communication latency to be negligible. Here we can see that, regardless the number of nodes working at the same clock cycles, we are scheduling all of them as soon as input data is ready. For example, we can schedule node F just when data will arrive from its predecessor nodes A and B. At the beginning, there are some nodes which have no predecessors (in this case, A, B, C, D and E) and are immediately
executed. Following this approach, we see that the clock cycle when the last operation nodes ends its execution (in this case K at clock cycle 5) corresponds also to the minimum possible latency for correctly executing the whole dataflow graph. We see that such an algorithm has a pull-up effect on the scheduling times (all of these are placed at the earliest possible clock cycle). This algorithm will analyze each node $(O(n))$ inspecting all its predecessors $(O(n))$. In the worst case, this runs in $O\left(n^{2}\right)$.

```
Algorithm 1 As-soon-as-possible scheduling
    procedure ASAP(DFG)
        Topological sort \((\mathrm{DFG}) \triangleright\) Here, for every directed edge \(e\left(u_{i}, u_{j}\right)\) from \(u_{i}\) to \(u_{j}, u_{i}\) comes
    before \(u_{j}\) in the ordering
        for all \(u_{i} \in D F G\) do
            first_sched \(=0 \quad \triangleright\) This is asap time, initialized
            for all \(u_{j} \in P\left(u_{i}\right)\) do
            if \(t\left(u_{j}\right)+\lambda\left(u_{j}\right)>\) first_sched then \(\triangleright\) We need to respect data dependency, as
    stated in (3.1)
                    first_sched \(=t\left(u_{j}\right)+\lambda\left(u_{j}\right)\)
            \(t\left(u_{i}\right)=\) first_sched
```


### 3.2.2 As-late-as-possible scheduling algorithm

As-late-as-possible (ALAP) scheduling algorithm can be seen as the complementary than ASAP. Indeed, it will schedule the operation nodes in the latest available opportunity. However, such an algorithm will not provide the input latency as output, but it will ask for it as input.

Hence, the provided latency $\bar{\lambda}$ needs to be greater than the minimum latency for the input, provided by ASAP scheduling:

$$
\begin{equation*}
\bar{\lambda} \geq \lambda_{A S A P} \tag{3.2}
\end{equation*}
$$

Contrarily than the ASAP schedule, the ALAP schedule for a given node $u$ can happen when all its successors are scheduled: if we define $V\left(u_{i}\right)$ the set of all the successor nodes of $u_{i}$, the alap schedule for $u_{i}$ is

$$
\begin{equation*}
\operatorname{alap}(u)=\min _{\forall u_{j} \in V\left(u_{i}\right)}\left\{\operatorname{alap}\left(u_{j}\right)-\lambda\left(u_{i}\right)\right\} \tag{3.3}
\end{equation*}
$$

Here the latest nodes (those having no successors) will be scheduled at time $\bar{\lambda}$ minus their own latency. Let us also here analyze an example of as-late-as-possible schedule (proposed in Fig. 10). We underline that it is the same DFG as in Fig. 9. Here, unlike for asap, we start binding all the nodes which have no successors or, in a different way, all nodes not needing to send computed data (in this case, K, H and I) at the least possible clock cycle (which has to be provided as input) minus their own latency (in this case we are assuming all the operation nodes have the same unitary latency; so, we are scheduling these at $\bar{\lambda}-1$ ). Then, we will schedule all the nodes which have to transmit data to these ( K needs to receive data from both J and G, H from D and E from I) and so on. We see that such an algorithm has a pull-down effect on the scheduling times (all of these are placed at the least possible clock cycle).

The complexity follows the same analysis as done for asap schedule: it is $O\left(n^{2}\right)$. Such an algorithm is not that much useful by itself; however, it contributes in the definition of a very important parameter in our algorithms.


Figure 10: This is an example of as-late-as-possible schedule. It is the same DFG as in 9

```
Algorithm 2 As-late-as-possible scheduling
    procedure \(\operatorname{ALAP}(\mathrm{DFG}, \bar{\lambda})\)
        Reverse topological sort (DFG) \(\triangleright\) Here, for every directed edge \(e\left(u_{i}, u_{j}\right)\) from \(u_{i}\) to \(u_{j}, u_{j}\)
    comes before \(u_{i}\) in the ordering
        for all \(u_{i} \in D F G\) do
            last_sched \(=\infty\)
            for all \(u_{j} \in V\left(u_{i}\right)\) do
            if \(t\left(u_{j}\right)-\lambda\left(u_{i}\right)<\) first_sched then \(\quad \triangleright\) Here we need to respect reverse data
    dependency (3.3)
                last_sched \(=t\left(u_{j}\right)-\lambda\left(u_{i}\right)\)
            \(t\left(u_{i}\right)=\) last_sched
```


### 3.2.3 Mobility range for an operation node

According to the definition of as-soon-as-possible (3.1) and as-late-as-possible (3.3) schedule, given a maximum execution time for out input $\bar{\lambda}$, we can determine a set of acceptable clock cycles in which a given node $u$ can be scheduled. In particular, we can say that, if node $u$ will be scheduled in any clock cycle between $\operatorname{asap}(u)$ and $\operatorname{alap}(u)$, it will certainly exist a solution such that the total latency will be lower or equal than $\bar{\lambda}$. The clock cycles between $\operatorname{asap}(u)$ and $\operatorname{alap}(u)$ are called mobility range of operation node $u$.

Hence, in order to respect latency constraints, we say that scheduling time $t(u)$ must be in its mobility range

$$
\begin{equation*}
t(u) \in[\operatorname{asap}(u) ; \operatorname{alap}(u)] \tag{3.4}
\end{equation*}
$$

Further, from the mobility range of a node $u$ we can know how much decisional power on its scheduling time we have. Such an estimation is obtained just counting all the valid scheduling times for node $u$. Now, as we know by (3.2) that $\operatorname{asap}(u) \leq \operatorname{alap}(u)$, we can determine a parameter, called mobility of node $u$, defined as

$$
\begin{equation*}
\mu(u)=\operatorname{alap}(u)-\operatorname{asap}(u)+1 \tag{3.5}
\end{equation*}
$$

which will always be a positive number. Let us take, as example, Fig. 11. This, in particular, encloses both the asap schedule in Fig. 9 and the alap in Fig. 9. Let us focus on the node G. Here we see that its asap schedule is at clock cycle 2 while its alap is at clock cycle 3. According to what stated for asap and alap schedules and to the maximum latency constraint here fixed


Figure 11: Example of mobility range for a node in a dataflow graph
to $\bar{\lambda}=5$, we will never beein able to schedule G earlier than clock cycle 2 or later than clock cycle 3 . Hence, we have just two possible schedulings for this and, for this, a mobility of 2 .

### 3.3 Force directed scheduling algorithm

One well-known scheduling algorithm, known for its efficacy and efficiency, is forced directed scheduling, by Pierre G. Paulin and John P. Knight [22]. A wide number of MR-LCS algorithms used in High Level Synthesis comes from force directed scheduling (FDS). The main objective of FDS is minimizing the least number of FUs needed for the binding step. Such an objective is achieved uniformly distributing the operation nodes in all the available functional units. This will guarantee that functional units necessary in order to perform some operation in a given time range are efficiently used in any other time moment of the overall design latency, leading to a very high use of the FU and, according to the efficiency definition we provided, to a high
efficiency. Fundamental for FDS algorithm are asap and alap scheduling algorithms, because the mobility range for each operation node is an essential parameter.

We can divide this algorithm essentially in four different steps:

1. Computation of probabilities for operation nodes.
2. Creation of "distribution graphs".
3. Computation of forces and self forces
4. Inclusion of the effect on predecessors and successors

### 3.3.1 Probability estimation for an operation node

FDS algorithm assumes each operation node will have a uniform scheduling probability for any of the clock cycles in the mobility range and null in any other clock cycle. For this, we can say that the probability $p(u, k)$ of scheduling the operation node $u$ in clock cycle $k$ is

$$
p(u, k)= \begin{cases}\frac{1}{\mu(u)} & \text { if } \operatorname{asap}(u) \leq k \leq \operatorname{alap}(u)  \tag{3.6}\\ 0 & \text { otherwise }\end{cases}
$$

Let us take, as example, the probability of scheduling node G in the DFG in Fig. 11. As shown in Fig. 12 and as already stated, the only two available clock cycles G can be scheduled are 2 and 3 . According to the mobility definition, its $\mu$ is 2 . Hence, as stated in (3.6), we have a uniform probability distribution for all clock cycles in range $[2 ; 3]$ which is $\frac{1}{2}$ for each of these clock cycles, while it is zero for all the others clock cycles (in this case, 1 and 4).


Figure 12: Probability for node G to be scheduled in any clock cycle of the DFG in Fig. 11.

### 3.3.2 Distribution graph

After computing all the probabilities for any clock cycle $j$ in range $[0 ; \bar{\lambda}]$, we can see each clock cycles how many functional units of a certain type (like adders, multipliers...) are needed. If we say that $J$ is the set of all the operations nodes $u_{i}$ requiring the same type $g\left(u_{i}\right)$, we can say that, at the generic clock cycle $j$, the distribution graph for functional units of type $g\left(u_{i}\right)$ is

$$
\begin{equation*}
W\left(g\left(u_{i}\right), k\right)=\sum_{u_{i} \in J} p\left(u_{i}, k\right) \tag{3.7}
\end{equation*}
$$

From this, we have an estimation of the peak of needed functional units: indeed, the minimum number of FUs for a design is the maximum number of functional units active in the same clock cycle. $W\left(k\left(u_{i}\right), k\right)$ indicates the distribution of the sum of scheduling a node of a certain type $g\left(u_{i}\right)$ in a given clock cycle $k$. Hence, if we minimize such a quantity, then we will try to schedule operation nodes in points in which the distribution graph is the lowest.

Let us try, for example, to build the distribution graph for multipliers in the DFG in Fig. 11. We have to take all the probabilities for all the latency of our DFG for any node performing multiplication and adding them through every single clock cycle, according to the formulation


Figure 13: Example of distribution graph for multipliers in DFG in Fig. 11
given in (3.7). This is shown in Fig. 13. In particular, in this case we will have the highest value for our distribution graph in clock cycle 1 (it is $\frac{17}{6}$ ). Hence, we will have a lot of operation nodes that can be scheduled in clock cycle 1. As we are solving an MR-LCS problem, we want to minimize such a maximum value. Furthermore, in this way, we are implicitly maximizing the efficiency of the needed functional units and at the same time minimizing them. However, a formulation for this is necessary.

### 3.3.3 Force computation

Now we can compute the so called force for a given schedule ${ }^{1}$. This is the new parameter introduced by such a technique and it can be called the heart of this algorithm.

We define force of a given node $u_{i}$ in clock cycle $k$ as

$$
\begin{equation*}
F\left(u_{i}, k\right)=p\left(u_{i}, k\right) \cdot W\left(g\left(u_{i}\right), k\right) \tag{3.8}
\end{equation*}
$$

This expresses a sort of "repulsive" force, like that of a spring. For this, what we aim to have is the scheduling of the generic operation node $u_{i}$ in the clock cycle having the lowest force possible or, better, where the other forces are pushing to. For this, it is introduced the notion of self force, expressing how much a given force is weak against the others for scheduling the same operation node $u_{i}$ :

$$
\begin{equation*}
S F\left(u_{i}, k\right)=\sum_{j=\operatorname{asap}\left(u_{i}\right)}^{\operatorname{alap}\left(u_{i}\right)}\left(\delta_{j, k}-p\left(u_{i}, k\right)\right) \cdot W\left(g\left(u_{i}\right), k\right) \tag{3.9}
\end{equation*}
$$

where $\delta_{j, k}$ is a Kronecker delta.
In order to evaluate which is the best clock cycle to schedule a particular node $u$, we look at the lowest self-force value of it per every available clock cycle, because essentially it is the clock cycle in which we will make the total force being the lowest.

Let us analyze, for example, the self force for node G. According to the distribution graph,

[^6]

Figure 14: Example of self force computation. Here we want to focus on node G. The DFG is in Fig. 11
we have just two possible schedules (as we can see in Fig. 14). Let us suppose we want to schedule it in clock cycle 2. If we take such a decision, then we will have, as effect, that the distribution graph for cc 2 will increase of a value $1-\frac{1}{2}$ while it will decrease for clock cycle 2 of $\frac{1}{2}$. So, we will have distribution graph value in 2 being $\frac{8}{6}+\left(1-\frac{1}{2}\right)=\frac{11}{2}$ while in 3 being $\frac{5}{6}-\frac{1}{2}=\frac{2}{6}$. We can see a representation of this in Fig. 15a

Let us suppose, now, that we wish to schedule it in clock cycle 3. The distribution graph will decrease its value in 2 to $\frac{8}{6}-\frac{1}{2}=\frac{5}{6}$ while it will increase in 3 to $\frac{5}{6}+\left(1-\frac{1}{2}\right)=\frac{8}{6}$. We can see a representation of this in Fig. 15b. So, at the end, we will have a lower maximum in the distribution graph if we schedule G in 3.


Figure 15: Example of self force computation. Here we are scheduling node G in clock cycle 2 (a) or 3 (b). The DFG is in Fig. 11

The self force formulation in (3.9) performs exactly such an analysis: it will evaluate at the same time the increment of the distribution graph at a certain point and the decrease in all the other possible schedules. For this, choosing the lowest value of self force for performing our scheduling, we will at the same time minimize both the value of the distribution graph in the scheduling point and in all the other points in the mobility range.

### 3.3.4 Predecessor and successor forces

Scheduling the operation node $u_{i}$ in a given clock cycle $k$ will affect mobility range of adjacent nodes in the dataflow graph. In particular, we will have the effect of modifying the


Figure 16: Reduction of mobility range for predecessor of $G$ if scheduled at clock cycle 2 . The DFG is in Fig. 11
mobility range for the predecessors of $u_{i}$ modifying $\operatorname{alap}\left(P\left(u_{i}\right)\right)$ and the mobility range of its successors modifying $\operatorname{asap}\left(V\left(u_{i}\right)\right)$.

Let us take the example of scheduling node $G$ in clock cycle 2. According to the input DFG (in Fig. 11), it has as predecessor node C and as successor K . If we perform such a scheduling, it means that it needs to receive input data at most by clock cycle 2. For this, C needs to produce its output data at most by cc 2 and, for this, it has to be scheduled earlier. This results in the reduction of its mobility range and, for instance, in the reduction of its mobility, which will modify the distribution graph itself. This is visible in Fig. 16 This may be a non-negligible effect for the quality of our solution because it may lead to suboptimalities. For this, we have to take in consideration the effect of the scheduling on predecessors and successors for the scheduled node, together with self force in the choice of scheduling to be performed.


Figure 17: Effect of scheduling node $N_{j}$ on predecessor $N_{i}$ and successor $N_{k}$

The predecessor force $\operatorname{PF}\left(u_{i}, k\right)$, corresponding to scheduling $u_{i}$ at clock cycle $k$, is

$$
\begin{array}{r}
P F\left(u_{i}, k\right)=\sum_{u_{h} \in P\left(u_{i}\right)} \sum_{j \in M R\left(u_{h}\right)} \\
{\left[\Delta_{\text {pred }}\left(j, k, u_{h}\right)\right] W(j)} \tag{3.10}
\end{array}
$$

where

$$
\Delta_{\text {pred }}\left(j, k, u_{h}\right)=\left\{\begin{array}{l}
p_{\text {new }}\left(u_{h}, j\right)-p\left(u_{h}, j\right)  \tag{3.11}\\
\text { if } j+\lambda\left(u_{h}\right) \\
+\lambda(\text { com }) \leq k \\
-p\left(u_{h}, j\right) \\
\text { otherwise }
\end{array}\right.
$$

$$
\begin{equation*}
p_{\text {new }}\left(u_{h}, j\right)=\frac{1}{\mu_{\text {new }}\left(u_{h}\right)} \tag{3.12}
\end{equation*}
$$

$\lambda\left(u_{h}\right)$ is the latency $u_{h}$-type node requires and $\mu_{\text {new }}\left(u_{h}\right)$ is the new mobility of $u_{h}$ due to $\operatorname{alap}\left(u_{h}\right)$ reduction to $k-\lambda(c o m)-\lambda\left(u_{h}\right)$ in case it originally was $>=k-\lambda(c o m)-\lambda\left(u_{h}\right)(k$ is the clock cycle for which the force effect of the current node of interest $u_{i}$ is being determined).

Instead, the successors force $\operatorname{VF}\left(u_{i}, k\right)$ is

$$
\begin{array}{r}
V F\left(u_{i}, k\right)=\sum_{u_{h} \in V\left(u_{i}\right)} \sum_{j \in M R\left(u_{h}\right)} \\
{\left[\Delta_{\text {succ }}\left(j, k, u_{h}\right)\right] W(j)} \tag{3.13}
\end{array}
$$

where

$$
\Delta_{\text {succ }}\left(j, k, u_{h}\right)=\left\{\begin{array}{l}
p_{\text {new }}\left(u_{h}, j\right)-p\left(u_{h}, j\right)  \tag{3.14}\\
\text { if } j-\lambda_{\text {com }} \leq k \\
-p\left(u_{h}, j\right) \\
\text { otherwise }
\end{array}\right.
$$

Let us analyze more in depth the example previously discussed and pictured in Fig. 16. In particular, let us analyze the predecessor force contribution to the scheduling of G to cc 2. In particular, it can no longer be scheduled in clock cycle 2. For this, we will apply a similar formulation already described for self force. For this, we will have positive contribution


Figure 18: Reduction of mobility range for predecessor of $G$ if scheduled at clock cycle 2 and its effect on the distribution graph. The DFG is in Fig. 11.
(making the evaluation of the current scheduling worse) from all the schedules which will not be available while a negative one from those still available. In this case, in particular, we will have for clock cycle 2 a contribution $1-\frac{1}{2} \cdot \frac{8}{6}$ which indicates that the current solution will bring the distribution graph having a higher value in 2 . As we can see, this is conform to the formulation provided in (3.10).

Finally, the schedule being chosen $\left(u_{i}, k\right)$ is the one having the lowest value, combining self force and predecessors/successors contributions.

$$
\begin{equation*}
\left(u_{i}, k\right)=\min _{\forall k, \forall u_{i}}\left\{S F\left(u_{i}, k\right)+P F\left(u_{i}, k\right)+V F\left(u_{i}, k\right)\right\} \tag{3.15}
\end{equation*}
$$

```
Algorithm 3 Force directed scheduling
    procedure \(\operatorname{FDS}(\mathrm{DFG}, \bar{\lambda})\)
        REM_NODES \(\leftarrow\) NODES(DFG) \(\triangleright\) Load all the nodes in remaining nodes
        while REM_NODES \(\neq \emptyset\) do \(\quad \triangleright\) While there still is one node to be scheduled
            Update asap and alap times for all REM_NODES
            Compute distribution graph for all clock cycles
                \(\triangleright\) According to (3.7)
            for all \(u_{i} \in R E M_{-} N O D E S\) do \(\quad \triangleright\) For all unscheduled nodes compute forces
                for all \(c k \in\left[\operatorname{asap}\left(u_{i}\right) ; \operatorname{alap}\left(u_{i}\right)\right]\) do
                    Compute \(S F\left(u_{i}, c k\right)\)
                                \(\triangleright\) The computation follows (3.9)
                    \(\operatorname{PSF}\left(u_{i}, c k\right)=0\)
                    for all \(u_{j} \in P\left(u_{i}\right), V\left(u_{i}\right)\) do \(\quad \triangleright\) For all the predecessors and successors of \(u_{i}\)
                        for all \(c c \in\left[\operatorname{asap}\left(u_{j}\right) ; \operatorname{alap}\left(u_{j}\right)\right]\) do
                        if \(c c\) valid schedule then \(\quad \triangleright\) This follows the formulation in (3.10)
                                    \(\operatorname{PSF}\left(u_{i}, c k\right)=S F\left(u_{j}, c k\right)\)
                                    else
                                    \(\operatorname{PSF}\left(u_{i}, c k\right)+=S F\left(u_{j}, c k\right)\)
            Schedule \(u_{\text {sch }}\), having \(\min \left\{S F\left(u_{i}, c k\right)+P S F\left(u_{i}, c k\right)\right\}\)
            Remove \(u_{\text {sch }}\) from REM_NODES \(\quad \triangleright\) You remove it as it is now scheduled
        return (sched(NODES)) \(\quad \triangleright\) Return all nodes scheduling times
```


### 3.3.5 Complexity

Let us first analyze this algorithm without predecessors and successors. In such a condition, we will analyze every node every scheduling step (as we need to compute self forces), this will totally take $O\left(n^{2}\right)$ time. Including predecessors and successors, as their maximum number can be $n-1$, we have an overhead of $O(n)$, making the total complexity being $O\left(n^{3}\right)$.

### 3.4 Scheduling algorithm for buslet-based designs

The topic here discussed involves the scheduling of operation nodes. As previously seen, a good scheduling algorithm used for minimizing the minimum amount of FUs for a given design is FDS. However, such an approach assumes that an arbitrary number of interconnections is available for the final design and their contribution to the design is negligible. In general, this is not true, as a big number of interconnections will make the design at the routing phase being a very complex problem to be solved.

### 3.4.1 Communication with dedicated interconnections

For any design using dedicated interconnections, if we schedule node $u_{i}$ at time $t\left(u_{i}\right)$ and one of its successors $v_{i}$ at $t\left(v_{i}\right)$ (in this way, if $P\left(v_{i}\right)$ is the set of $v_{i}$ predecessors and $V\left(u_{i}\right)$ is the set of successors of $u_{i}$, we can say that $u_{i} \in P\left(v_{i}\right)$ and $\left.v_{i} \in V\left(u_{i}\right)\right)$, the communication $\operatorname{com}\left(u_{i}, v_{i}\right)$ between these can happen in the time frame

$$
\begin{equation*}
t\left[\operatorname{com}\left(u_{i}, v_{i}\right)\right] \in\left[t\left(u_{i}\right)+\lambda\left(u_{i}\right) ; t\left(v_{i}\right)-\lambda_{\text {com }}\right] \tag{3.16}
\end{equation*}
$$



Figure 19: Example of time frame in which the communication between $u_{i}$ and $u_{j}$ can happen
where $\lambda_{\text {com }}$ is the latency of the communication, usually

$$
\begin{equation*}
\lambda_{\text {com }}=1 \tag{3.17}
\end{equation*}
$$

and deciding when a given communication will take place will not appreciably change the hardware structure as well as the total power consumed ${ }^{1}$. For this reason, such a problem is not taken into account by scheduling, as it involves the management of registers and their allocation. Timing for communications, in this way, will happen after the binding phase.

Let us suppose $u_{j}$ is a successor of $u_{i}$. Further, $u_{i}$ will have data ready from clock cycle 3 and $u_{j}$ is scheduled for clock cycle 5 (as pictured in Fig. 19). If $\lambda_{c o m}=1$ we can schedule

[^7]

Figure 20: Example of communications possible concurrency.
Such an issue is relevant for our buslet-based design.
the communication to both clock cycle 3 or 4 and, in case it is $\lambda_{\text {com }}=2$, just in clock cycle
3. However, from this we can see that, when we will perform the binding, we really need an interconnection between the functional units $u_{i}$ and $u_{j}$ will be bound to and, finally, such a communication needs be performed in any case ${ }^{1}$.

### 3.4.2 Force directed communication scheduling

Here we wish to use wiring structures which will connect at the same time a number of functional units. For this, in order to minimize their minimum needed number, we have to reduce as much as possible their concurrency.

Let us suppose to have three communications having some timeframes (computed as previ-

[^8]ously stated previously and stated in Fig. 19) as pictured in Fig. 22. As we are going to use buslets, we need to ensure that, in a given buslet, conflicts do not have to happen. For this, one communication at a time can happen on each buslet. In case we decide to schedule all the communications in clock cycle 3, then we need 3 buslets. However, as it is possible to see, assuming $\lambda_{\text {com }}=1$, it is easy to minimize such a number scheduling communications in different clock cycles. So, if our aim is here to minimize the minimum number of needed buslets, the scheduling must happen not on the nodes in the DFG (operation nodes) but on the arcs (communications between operation nodes).

For this, let us assume that operation nodes are not scheduled yet. we can say that the mobility range for communication $\operatorname{com}\left(u_{i}, v_{i}\right)$ between operation node $u_{i}$ and $v_{i}$ is

$$
\begin{equation*}
\mu\left[\operatorname{com}\left(u_{i}, v_{i}\right)\right] \in\left[\operatorname{asap}\left(u_{i}\right)+\lambda\left(u_{i}\right) ; \operatorname{alap}\left(v_{i}\right)-\lambda_{\text {com }}\right] \tag{3.18}
\end{equation*}
$$

An example of communication mobility range can be found in Fig. 21. However, once we will schedule a communication, we will have an effect on the mobility range of the operation nodes. Let us decide to schedule communication $\operatorname{com}\left(u_{i}, u_{j}\right)$ from $u_{i}$ to $u_{j}$ in $t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]$. The new alap time for $u_{i}$ will be

$$
\begin{equation*}
\operatorname{alap}\left(u_{i}\right)=t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]-\lambda\left(u_{i}\right) \tag{3.19}
\end{equation*}
$$

while the new asap time for $v_{i}$ will be

$$
\begin{equation*}
\operatorname{asap}\left(v_{i}\right)=t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]+\lambda_{c o m} \tag{3.20}
\end{equation*}
$$



Figure 21: Mobility range for communication between $u_{i}$ and $u_{j}$.
We are assuming here $\lambda\left(u_{i}\right)=2$ and $\lambda_{\text {com }}=1$.

In general, the situation presented in (3.19) and (3.20) is pictured in Fig. 22. This happens because we need to ensure data from the source being ready by the beginning of the communication from the source $u_{i}$ and data to be computed by $u_{j}$ has already been transmitted when $u_{j}$ is scheduled. For this reason, we need to take into account a reduction of the mobility range for operation nodes. However, as our main objective here is minimizing concurrency between communications, we will perform force directed scheduling for communications (FDCS) independently from the regular FDS. However, we will not have the warranty that mobility for every operation node will drop to 1 .

For this, regular FDS has to be performed on the modified mobility ranges after FDCS, in order to get a complete schedule for both communications and operation nodes.


Figure 22: Example of communication scheduling and its effect on mobility ranges for functional units

```
Algorithm 4 Force directed scheduling
    procedure \(\operatorname{FDCS}(\mathrm{DFG}, \bar{\lambda})\)
        REM_COMS \(\leftarrow\) EDGES \((\) DFG \() \quad \triangleright\) Load all the edges of the input DFG
        while REM_COMS \(\neq \emptyset\) do \(\quad \triangleright\) While there still is one com to be scheduled
            Update asap and alap times for all REM_COMS
            Compute distribution graph for all clock cycles
                                \(\triangleright\) According to (3.7)
            for all \(e_{i} \in\) REM_COMS do \(\quad \triangleright\) For all unscheduled coms compute forces
                for all \(c k \in\left[\operatorname{asap}\left(e_{i}\right) ; \operatorname{alap}\left(e_{i}\right)\right]\) do
                Compute \(S F\left(e_{i}, c k\right) \quad \triangleright\) The computation follows (3.9)
                \(\operatorname{PSF}\left(e_{i}, c k\right)=0\)
                    for all \(e_{j} \in P\left(e_{i}\right), V\left(e_{i}\right)\) do \(\quad \triangleright\) For all the predecessors and successors of \(e_{i}\)
                for all \(c c \in\left[\operatorname{asap}\left(e_{j}\right) ; \operatorname{alap}\left(e_{j}\right)\right]\) do
                        if \(c c\) valid schedule then \(\quad \triangleright\) This follows the formulation in (3.10)
                \(\operatorname{PSF}\left(e_{i}, c k\right)-=S F\left(e_{j}, c k\right)\)
                    else
                    \(\operatorname{PSF}\left(e_{i}, c k\right)+=S F\left(e_{j}, c k\right)\)
            Schedule \(e_{s c h}\), having \(\min \left\{S F\left(e_{i}, c k\right)+P S F\left(e_{i}, c k\right)\right\}\)
            Remove \(e_{s c h}\) from REM_COMS \(\triangleright\) You remove it as it is now scheduled
        return (sched(COMS))
                                Return all communications scheduling times
```


### 3.4.3 Complexity

Let us first analyze this algorithm without predecessors and successors. In such a condition, we will analyze every edge (as we are talking about communications) every scheduling step (as we need to compute self forces), this will totally take $O\left(e^{2}\right)$ time. Including predecessors and successors, as their maximum number can be $e-1$, we have an overhead of $O(e)$, making the total complexity being $O\left(e^{3}\right)$.

### 3.5 Summary of scheduling algorithms

Our scheduling is articulated in two different parts, sequentially executed:

1. Scheduling for communications: applies exactly the same model than FDS (3.9) to communications, including communication predecessors/successors analysis (3.10)
2. Scheduling for operation nodes, according to remaining mobility ranges

The complexity for FDCS is $O\left(e^{2}\right)$ while the complexity for FDS is $O\left(n^{2}\right)$. As these algorithms are applied sequentially and we can not say anything on the input DFG connectivity, the total complexity for the scheduling is $O\left(n^{2}+e^{2}\right)$.


Figure 23: General dataflow for the scheduling technique used

## CHAPTER 4

## BINDING ALGORITHMS

### 4.1 General concepts to be taken into account

After communication and node scheduling, we need to bind each operation node to a FU and each communication between operation nodes to a target buslet (these bindings also lead to binding of FU connections to buslets). From FDCS and FDS we obtained:

- Scheduling times for operation nodes $t\left(u_{i}\right)$
- Scheduling times for communications $t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]$
- Lower bound on the final number of functional units. We may finally need more FUs than those determined during scheduling because of the fanin/fanout constraints we impose on binding that were not imposed during scheduling (we cannot determine fanin/fannout during scheduling).

Let us assume we need to bind operation node $u_{i}$ to a target functional unit. This operation node requires a FU able to perform the operation needed by $u_{i}$. Let us identify the type of functional units able to perform the operation described by operation node $u_{i}$ as $k\left(u_{i}\right)$ and let us say the set of FU of type $k\left(u_{i}\right)$ is $F U\left[k\left(u_{i}\right)\right]$. We identify the subset $F U_{\text {avl }}\left(u_{i}\right)$ of available functional units as the set of functional units $r_{l} \in F U\left[k\left(u_{i}\right)\right]$ which
are idling in the clock cycles range $\left[t\left(u_{i}\right) ; t\left(u_{i}\right)+\lambda\left(u_{i}\right)-1\right]$. More formally we can say that

$$
\begin{equation*}
r_{l} \in F U_{\text {avl }}\left(u_{i}\right) \mid \prod_{k=t\left(u_{i}\right)}^{t\left(u_{i}\right)+\lambda\left(u_{i}\right)-1} \delta_{c k}\left(r_{l}, k\right)=1 \tag{4.1}
\end{equation*}
$$

where

$$
\delta_{c k}\left(r_{l}, k\right)= \begin{cases}1 & \text { if } r_{l} \text { idling in } k  \tag{4.2}\\ 0 & \text { otherwise }\end{cases}
$$

Further, we also observe that we have

$$
\begin{equation*}
F U_{\text {avl }}\left(u_{i}\right) \subseteq F U\left(k\left(u_{i}\right)\right) \tag{4.3}
\end{equation*}
$$

Let us here define, for a given functional unit $r_{l}$, the set of all the buslet in fanin as fin $\left(r_{l}\right)$ and the set in fanout as $\operatorname{fout}\left(r_{l}\right)$. From this, we know that the fanin of $r_{l}$ is the cardinality of $\operatorname{fin}\left(r_{l}\right)$ and that the fanout is the cardinality of fout $\left(r_{l}\right)$. We define fanin for the functional unit $u_{i}$ as $\left|\operatorname{fin}\left(r_{l}\right)\right|$ and fanout as $\mid$ fout $\left(r_{l}\right) \mid$. Because of the constraint on maximum allowable fanin and fanout on any functional unit, we need also to ensure that, after binding, for any $r_{l}$ in our design, we will have

$$
\begin{equation*}
\left|f i n\left(r_{l}\right)\right| \leq \text { max_fanin } \tag{4.4}
\end{equation*}
$$

and

$$
\begin{equation*}
\mid \text { fout }\left(r_{l}\right) \mid \leq \text { max_fanout } \tag{4.5}
\end{equation*}
$$

It may happen that, if we will not have any $r_{l} \in F U_{\text {avl }}\left(u_{i}\right)$ also respecting (4.4) and (4.5), we may need to add some functional unit of type $k\left(u_{i}\right)$ to $F U\left[k\left(u_{i}\right)\right]$ and, consequently, to $F U_{\text {avl }}\left(u_{i}\right)$.

- Lower bound on the final number of buslets. The number of buslets can also increase beyond this value during binding due to the max buslet cardinality constraint on it that we cannot determine during scheduling.

Let us assume that the set of buslets in our design is $B$. In this way, any buslet in the design $b_{k} \in B$. Furthermore, the number of total buslets in our design is given by the cardinality of $B$. For this, we say that our design has $|B|$ buslets. Let us assume we need to bind $\operatorname{com}\left(u_{i}, u_{j}\right)$. Let us define set of available buslets $B_{\text {avl }}\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]$ as set of buslets which are idling in clock cycles window $\left[t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right] ; t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]+\lambda_{\text {com }}-1\right]$. More formally,

$$
\begin{equation*}
b_{i} \in B_{\text {avl }}\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right] \mid \prod_{k=t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]}^{t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]+\lambda_{c o m}-1} \delta_{c k}\left(b_{i}, k\right)=1 \tag{4.6}
\end{equation*}
$$

where $\delta_{c k}\left(b_{i}, k\right)$ is defined in (4.2).By consequence, we also have

$$
\begin{equation*}
B_{\text {avl }}\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right] \subseteq B \tag{4.7}
\end{equation*}
$$

Let us assume that

$$
\begin{equation*}
F U_{a v l}\left(u_{i}\right) \neq \emptyset \tag{4.8}
\end{equation*}
$$

and that

$$
\begin{equation*}
F U_{a v l}\left(u_{j}\right) \neq \emptyset \tag{4.9}
\end{equation*}
$$

If $\forall r_{l} \in F U_{\text {avl }}\left(u_{i}\right), \forall r_{m} \in F U_{\text {avl }}\left(u_{j}\right), \forall b_{k} \in B_{\text {avl }}\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]$, after binding, we will have

$$
\begin{equation*}
\left|b_{k}\right|>\text { max_buslet_cardinality } \tag{4.10}
\end{equation*}
$$

then we need to add another buslet to $B$ and, consequently, to $B_{\text {avl }}\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]$, in order to use it to bind $\operatorname{com}\left(u_{i}, u_{j}\right)$.

At the end of binding process, what we aim to reach is to satisfy buslet cardinality constraint as well as maximum fanin and fanout. Moreover, just using these conditions is not sufficient in order to have a design in which we have low wirelength: this has to be a driving factor in our whole binding process. For this reason, what we want to use is a wirelength estimation model giving us an idea on which solution, among all those available, may be better to choose.

### 4.1.1 Wirelength model

As we need a very quick-to-be computed wirelength model, without performing any kind of floorplanning, we estimate wirelength for a given buslet $b_{k}$ based on only two parameters:

1. current buslet cardinality $\left|b_{k}\right|$
2. maximum allowed buslet cardinality $\max \{|b|\}$

We assume for simplicity that all FUs have the same square shape and same area. We then estimate the buslet WL as the sum of Manhattan distances of all FUs in it from the center of


Figure 24: Examples of computation of wirelength for buslet cardinalities from 2 to 5
gravity of the most "compact" area that the buslet and its FUs lie in. This is a obviously lower bound on the WL of the buslet, since due to FUs being connected to multiple buslets, not all buslets can be located in their most compact areas determined independent of other buslets. However, as empirical evidence shows (Fig. 50 and Table III), this simple WL model works well in yielding a final design with low floorplanning determined WL.

### 4.1.2 Effect of fanin/fanout constraints on binding

As previously stated, during our binding, we will have to deal with fanin and fanout constraints for ach functional unit. Let us say that $F U_{l}$ is a generic functional unit required from
our binding. We need to ensure that the number of buslets it is connected as fanin will not exceed a given threshold level max_fanin

$$
\begin{equation*}
\text { fanin }\left(F U_{l}\right) \leq \text { max_fanin } \tag{4.11}
\end{equation*}
$$

Same issue will apply to fanout:

$$
\begin{equation*}
\text { fanout }\left(F U_{l}\right) \leq \text { max_fanout } \tag{4.12}
\end{equation*}
$$

We need to ensure this during binding. However, such a problem is not straightforward. Let us say that we want to bind operation node $u_{i}$ to functional unit $F U_{l}$, which, as it is connected to the operation node $u_{j}$, bound to $F U_{m}$, will require a link with it through a buslet $b_{k}$. When we perform such an operation, we have to ensure that, at least,

$$
\begin{equation*}
\operatorname{fanin}\left(F U_{l}\right)+\sum_{\forall k\left(u_{n}\right) \in P_{\text {unb }}\left(u_{i}\right)} \delta_{\text {comn }}\left(F U_{l}, k\left(u_{n}\right)\right) \leq \text { max_fanin } \tag{4.13}
\end{equation*}
$$

where

$$
\delta_{\text {conn }}\left(F U_{l}, k\left(u_{n}\right)\right)= \begin{cases}1 & \text { if } F U_{l} \text { not connected to any } F U \text { of type } k\left(u_{n}\right)  \tag{4.14}\\ 0 & \text { otherwise }\end{cases}
$$



Figure 25: Example of fanin/fanout constraint to be taken into account if binding G (in red). FU which binds G needs to be able to be connected to the connected to the ones the green op nodes will be mapped to.
and $P_{u n b}\left(u_{i}\right)$ is the set of unbound predecessor nodes of $u_{i}$. This is because we need to ensure that, after binding $u_{i}$ to $F U_{l}$, we will be able to respect maximum fanin constraint. Such a constraint will apply to the fanout constraint:

$$
\text { fanout }\left(F U_{l}\right)+\sum_{\forall k\left(u_{n}\right) \in V_{u n b}\left(u_{i}\right)} \delta_{\text {conn }}\left(F U_{l}, k\left(u_{n}\right)\right) \leq \text { max_fanout }
$$

where $V_{u n b}\left(u_{i}\right)$ is the set of unbound successor nodes of $u_{i}$.
A practical example of this is presented by Fig. 25. Here, when we decide to bind the operational node G to a given functional unit $F U_{i}$, we also need to ensure it will be able to be connected to at least one more functional unit (that K will be mapped to): in fact, C can potentially be
mapped to the same FU as it is of the same type. However, such a constraint may be too tight. Let us assume we have $u_{n} \in V_{u n b}\left(u_{i}\right)$. When binding $u_{i}$ to $F U_{l}$, we have $F U_{l}$ already connected to the functional unit $F U_{m}$, with $k\left(u_{n}\right)=k\left(F U_{m}\right)$. However, when binding $u_{n}$, we may have the case that $F U_{m}$ has already been bound from another operation node. This means that we may not been able to satisfy maximum fanout constraint. Same issue applies to predecessors and fanin constraint. In order to be sure to satisfy fanin/fanout constraints, we need to use the loose constraints

$$
\begin{gather*}
\text { fanin }\left(F U_{l}\right)+\left|P_{\text {unb }}\left(u_{i}\right)\right| \leq \text { max_fanin }  \tag{4.15}\\
\text { fanout }\left(F U_{l}\right)+\left|V_{\text {unb }}\left(u_{i}\right)\right| \leq \text { max_fanout } \tag{4.16}
\end{gather*}
$$

This may oversize the constraint but this will certainly allow us to satisfy maximum fanin and fanout constraints.

### 4.2 Chronological binding

The first finding algorithm we are going to introduce is called chronological binding (CB). This is because it will perform binding moving chronologically: if we are binding $\operatorname{com}\left(u_{i}, u_{j}\right)$, then it means that we will have already bound all the communications $\operatorname{com}\left(u_{k}, u_{l}\right)$ having $t\left[\operatorname{com}\left(u_{k}, u_{l}\right)\right]<t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]$. This is a greedy strategy that will ensure us to bind the absolute locally best solution for the current communication. This technique, in its semplicity, will bring with itself two huge benefits:

1. Makes reuse of hardware available more natural
2. Reduces the solution space to be explored


Figure 26: Example of reuse of previously bound buslets.
In blue communications binding, while in green operation nodes binding.

### 4.2.1 Reuse of already available hardware

Binding communications in chronological way, we will be able to more easily reuse already existing hardware. Let us suppose we have bound $\operatorname{com}\left(u_{i}, u_{j}\right)$ to $b_{k}$, binding also $u_{i}$ to the functional unit $r_{0}$ and $u_{j}$ to $r_{1}$. When its communication successor $\operatorname{com}\left(u_{j}, u_{l}\right)$ needs to be bound, considering that $u_{j}$ has already been bound to $r_{1}$, if resource type $k\left(u_{l}\right)=k\left(u_{i}\right)$ and $r_{0}$ is an available FU , then it is possible to bind such a communication without any resource adding. We have also to add that it is very likely for both $r_{0}$ and $b_{k}$ to be available as we are moving chronologically.

Let us analyze an example to better understand this. Looking at Fig. 26, we see that, moving chronologically, we will first bind $\operatorname{com}(A, B)$. Let us assume A is bound to the FU A1 while B to B1. In order to perform the communication between these functional units, a certain interconnection is created (as we will use buslets, it will be through one of these). When
we will need to bind successor communication, if it is with an "A-type" operation node, we can reuse such a communication. Furthermore, as we are moving chronologically and if we realistacally assume $\lambda_{\text {com }}=1$, if such a connection exists, it can certainly be reused. This naturally contributes int the wire efficiency increment.

### 4.2.2 Reduction of solution space

In the general case (except at the beginning of our binding process), when we need to bind a communication $\operatorname{com}\left(u_{i}, u_{j}\right)$, we have already bound $u_{i}$ to a target functional unit, as we have also bound the predecessor communication $\operatorname{com}\left(u_{h}, u_{i}\right) \in P\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]$. If we consider the space of the solutions as combination of communications, buslets, source and destination functional units, each binding step not only we are analyzing one communication at a time but we already know which is the functional unit the source is bound to. For this, as the total number of buslets will be at most the number of edges $e$ of the DFG and the quantity of functional units is the amount of nodes $n$ of the DFG, then we say that CB will scale the size of solutions to be analyzed from $n^{2} e^{2}$ to $n e$.

An example of such an issue can be observed in Fig. 27. Let us suppose we already bound $\operatorname{com}(\mathrm{A}, \mathrm{B})$ with, of course, operation nodes to functional units. When we are going to bing the successor communication $\operatorname{com}(\mathrm{B}, \mathrm{A})$, we already find B being bound to B 1 . This, as it is here evidenced, reduces the solution space to be explored. This is something constant in all the binding execution, significantly reducing the amount of possibilities to be computed each step.


Figure 27: Example of binding solution space reduction.
In blue communications binding, while in green operation nodes binding.

```
Algorithm 5 General chronological binding algorithm
    procedure Chronological_Binding(sched(NODES, COMS))
        Sort_growing_sched_time(COMS)
        partial_soln_set \(=\emptyset\)
        initial_empty_binding \(=\) initialize(buslets, FUs) \(\quad \triangleright\) Initialized empty design
        push(partial_soln_set, initial_empty_binding) \(\triangleright\) Initialized list of partial solutions
        for all \(\operatorname{com}_{i} \in C O M S\) do
            com_bound \(=\) false
            while com_bound \(=\) false do
            Compute all possibilities and put them in Local_solns
            if Local_solns \(=\emptyset\) then \(\triangleright\) If no possibility available, then we lack of FU or buslet
                    Add needed resource (FU or buslet)
            else
                        com_bound \(=\) true
                        if similar_soln_pruning \(=\) true then
                            Local_solns \(=\) Similar_solns_cutter(Local_solns) \(\quad \triangleright\) This function will be
    explained in Sec. 4.3
                Sort_in_decreasing_WL_order(Local_solns)
                    partial_soln_set \(=\) first_n_solns(Local_solns, soln_held_per_step \() ~ \triangleright\) We take the
    best n solutions
        return (first(partial_soln_set)) \(\triangleright\) We return at the end the design having the best
    wirelength
```


### 4.2.3 Complexity

About the binding, as we are binding e different communications to $b$ buslets and to $r$ functional units, a first estimation of the complexity could be $O\left(e b r^{2}\right)$. However, according to our former analysis, source nodes are most of the times already bound. For this, the complexity will reduce to reduces to $O(e b r)$. We can also state that certainly $b \leq e$ (as at most we will have one buslet per communication) and $r \leq n$ (as at most we will have one functional unit per node and in the worst case we will have only one resource type). According to this, the complexity for CB will be $O\left(n e^{2}\right)$. Finally, let us say that we want to propagate each binding step at most $S$ best solutions: the complexity becomes $O\left(S n e^{2}\right)$.

### 4.3 Chronological binding with similarity reduction

Chronological binding with similarity reduction (CBSR) is a natural extension of simple CB. When we decide to hold a number of locally "optimal" solution, in fact, there may be the case that we are holding some similar solution. We can say that two solutions are similar if they will bring, at the end, to the same buslet structures. This means that, at the end of our binding process, if we say that solution $S_{1}$ is similar to solution $S_{2}$, then the set of buslets $B_{1} \in S_{1}$ will be exactly the same as the set of buslets $B_{2} \in S_{2}$. More formally, we can say that

$$
\begin{equation*}
\forall b_{i} \in B_{1} \exists b_{j} \in B_{2} \mid \forall r_{k} \in b_{i} \Rightarrow r_{k} \in b_{j} \tag{4.17}
\end{equation*}
$$

We can see an example in Fig. 28. Here we see that Solution A is similar to Solution B as buslet b1 and b2 are inverted and functional units bind exactly the same operation nodes in the


Figure 28: Example two solutions similar differing for buslets labels. In red operation nodes bindings, in black buslets and in yellow functional units.

DFG. However, such a definition seems to be weak for $r_{k}$ : in fact, potentially any $r_{i} \in F U\left[k\left(r_{k}\right)\right]$ can have the same schedule of $r_{k}$ in a different design: it is like swapping labels.

In the example presented by Fig. 29, we not only have buslet labels swapped, but also those for functional units. We are here assuming $F U_{i}$ and $F U_{j}$ are of the same type. We see that the schedules for both are overlapping, as well as the structure of the buslets they are connected to. Here it is not necessary that they are bound exactly to the same operation nodes, provided


Figure 29: Example two solutions similar with different binding for both buslets and functional units. In red operation nodes bindings, in black buslets and in yellow functional units.
that the buslets they are connected to are the same and their busy clock cycles are exactly identical. For this, we should modify statement (4.17) allowing any resource in $b_{j}$ to match $r_{k}$ :

$$
\begin{equation*}
\forall b_{i} \in B_{1} \exists b_{j} \in B_{2}\left|\forall r_{k} \in b_{i} \exists r_{l} \in b_{j}\right| \forall c k \in[0 ; \bar{\lambda}-1] \delta_{c k}\left(r_{k}, c k\right)=\delta_{c k}\left(r_{l}, c k\right) \tag{4.18}
\end{equation*}
$$

In general we can say that, if (4.18) is true, then $S_{1}$ and $S_{2}$ are similar. Let us give, now, an extensive formulation of this.

Let us suppose we need to bind $\operatorname{com}\left(u_{i}, u_{j}\right)$. We know that the set of available FUs to be
mapped to $u_{i}$ is $F U_{\text {avl }}\left(u_{i}\right)$ and for $u_{j}$ is $F U_{\text {avl }}\left(u_{j}\right)$. Moreover, the set of available buslets is $B_{\text {avl }}\left(\operatorname{com}\left(u_{i}, u_{j}\right)\right)$. If we say that

$$
\begin{gather*}
b_{k} \in B_{\text {avl }}\left(\operatorname{com}\left(u_{i}, u_{j}\right)\right)  \tag{4.19}\\
r_{l} \in F U_{\text {avl }}\left(u_{i}\right)  \tag{4.20}\\
r_{m} \in F U_{\text {avl }}\left(u_{j}\right) \tag{4.21}
\end{gather*}
$$

we can say that the quantity of binding possibilities $\left|S\left(\operatorname{com}\left(u_{i}, u_{j}\right)\right)\right|$ is

$$
\begin{equation*}
\left|S\left(\operatorname{com}\left(u_{i}, u_{j}\right)\right)\right|=\sum_{\forall b_{k}} \sum_{\forall r_{l}} \sum_{\forall r_{m}} \delta_{R}\left(r_{l}, b_{k}\right) \cdot \delta_{R}\left(r_{m}, b_{k}\right) \cdot \delta_{B}\left(r_{l}, r_{m}, b_{k}\right) \tag{4.22}
\end{equation*}
$$

where, having

$$
\Delta_{R(\text { fanout })}\left(r_{l}, b_{k}\right)= \begin{cases}1 & \text { if } b_{k} \notin \text { fanout }\left(r_{l}\right)  \tag{4.23}\\ 0 & \text { otherwise }\end{cases}
$$

it is

$$
\delta_{R}\left(r_{l}, b_{k}\right)= \begin{cases}1 & \text { iffanout }\left(r_{l}\right)+\Delta_{R(\text { fanout })}\left(r_{l}, b_{k}\right) \leq \text { max_fanout }  \tag{4.24}\\ 0 & \text { otherwise }\end{cases}
$$

and in the same way we have

$$
\Delta_{R(\text { fanin })}\left(r_{m}, b_{k}\right)= \begin{cases}1 & \text { if } b_{k} \notin \operatorname{fanin}\left(r_{m}\right)  \tag{4.25}\\ 0 & \text { otherwise }\end{cases}
$$

$$
\delta_{R}\left(r_{m}, b_{k}\right)=\left\{\begin{array}{l}
1 \quad \text { iffanin }\left(r_{m}\right)+\Delta_{R(\text { fanin })}\left(r_{m}, b_{k}\right) \leq \text { max_fanin }  \tag{4.26}\\
0 \text { otherwise }
\end{array}\right.
$$

and, saying that

$$
\Delta_{B}\left(r_{l}, r_{m}, b_{k}\right)= \begin{cases}2 & \text { if }\left(r_{l} \wedge r_{m}\right) \notin \operatorname{busletset}\left(b_{k}\right)  \tag{4.27}\\ 1 & \text { if } \exists!\left(r_{l}, r_{m}\right) \notin \operatorname{busletset}\left(b_{k}\right) \\ 0 & \text { otherwise }\end{cases}
$$

we have

$$
\delta_{B}\left(r_{l}, r_{m}, b_{k}\right)= \begin{cases}1 & \text { if }\left|b_{k}\right|+\Delta_{B}\left(r_{l}, r_{m}, b_{k}\right) \leq \text { max_bus_cardinality }  \tag{4.28}\\ 0 & \text { otherwise }\end{cases}
$$

From this, we can see that we may potentially have several binding possibilities, resulting with the same estimated wirelength, which are "similar": for example, assuming that we are just beginning out binding (so, all buslets are empty and all FUs available), if we need to bind $\operatorname{com}\left(u_{i}, u_{j}\right)$, which one may be the difference between binding it to buslet $b_{k}$ or $b_{k+1}$, as both of them are empty?

If we desire to save more than one binding possibility for every step, as those "similar" solutions will have exactly the same wirelength, it may be necessary to implement a similar solutions detector, in order to prune all solutions leading to same final solutions.

In order to find solution $S^{*}$ is similar to $S^{r e f}$, we need to look at the following issues:

1. Each buslet $b^{*} \in S^{*}$ needs to have a corresponding buslet $b^{r e f} \in S^{r e f}$ such that

$$
\begin{equation*}
\left|b^{*}\right|=\left|b^{r e f}\right| \tag{4.29}
\end{equation*}
$$

2. Each buslet $b^{*} \in S^{*}$ needs to have a corresponding buslet $b^{r e f} \in S^{r e f}$ so that, for each clock cycle $m$, defining, for the generic resource ${ }^{1} \mathrm{r}$

$$
\delta_{c k}(r, m)= \begin{cases}0 & \text { if ridling in } m  \tag{4.30}\\ 1 & \text { otherwise }\end{cases}
$$

we need to have

$$
\begin{equation*}
\delta_{c k}\left(b^{*}, m\right)=\delta_{c k}\left(b^{r e f}, m\right) \tag{4.31}
\end{equation*}
$$

$\forall m$ value.
3. Then, assuming $b_{k}^{r e f}$ has been found similar to $b_{k}^{*}$, $\forall$ functional unit $F U_{i}^{r e f} \in b_{k}^{r e f}$, we need to find a functional unit $F U_{i}^{*} \in b_{k}^{*}$ such that
(a) Has same busy clock cycles and, in particular, $\forall m$

$$
\begin{equation*}
\delta_{c k}\left(F U_{i}^{r e f}, m\right)=\delta_{c k}\left(F U_{i}^{*}, m\right) \tag{4.32}
\end{equation*}
$$

[^9](b) Has same fanin and fanout:
\[

\left\{$$
\begin{array}{l}
\operatorname{fanin}\left(F U_{i}^{r e f}\right)=\operatorname{fanin}\left(F U_{i}^{*}\right)  \tag{4.33}\\
\operatorname{fanout}\left(F U_{i}^{r e f}\right)=\operatorname{fanout}\left(F U_{i}^{*}\right)
\end{array}
$$\right.
\]

### 4.3.1 Justification of similarity-based pruning

Let us imagine two solutions $S_{1}$ and $S_{2}$ are similar at step $l-1$ but, at step $l$, while binding $\operatorname{com}\left(u_{i}, u_{j}\right), S_{1}$ will have, among all its binding possibilities, a solution with a better best wirelength than $S_{2}$. We can have this when we need to add $F U_{i}$ to a buslet $b_{k}$ in $S_{2}$ while we do not do the same in $S_{1}$. This happens if:

1. $\exists b_{1} \in S_{1}, b_{2} \in S_{2}| | F U_{\text {avl }}\left(u_{i}\right) \in b_{1}\left|=\left|F U_{\text {avl }}\left(u_{i}\right) \in b_{2}\right|\right.$

If we have not the same number of same type FUs available connected to the same buslet, then we may need to add one more of such a FU to the buslet, resulting in a cardinality increment and, consequently, in a wirelength increase. However, this goes against condition (4.32).
2. $\exists F U_{1} \in S_{1}, F U_{2} \in S_{2} \mid k\left(F U_{1}\right)=k\left(F U_{2}\right)$,
$\left|b_{\text {avl }}\left[f i n\left(F U_{1}\right)\right]\right|=\left|b_{\text {avl }}\left[\operatorname{fin}\left(F U_{2}\right)\right]\right|$
If we need to use a FU as source and the available different fanin buslets are different, then we need to add one more buslet to $\operatorname{fanin}\left(F U_{2}\right)$. However, as we have ensured (4.33) for all FUs linked to buslets having (4.29), this can not happen.
3. $\nexists F U_{1} \in S_{1}, F U_{2} \in S_{2} \mid k\left(F U_{1}\right)=k\left(F U_{2}\right)$,
$\mid b_{\text {avl }}\left[\right.$ fout $\left.\left(F U_{1}\right)\right]|=| b_{\text {avl }}\left[\right.$ fout $\left.\left(F U_{2}\right)\right] \mid$
Same as previous point.

```
Algorithm 6 Similarity detector algorithm
    procedure Similar_Solns_Cutter(Input_solns) \(\triangleright\) Here I take as input the set of solutions
    to verify
        Output_solns = \(\emptyset\)
        for all \(S_{i} \in\) Input_solns do
            similar \(=\) false
            for all \(S_{j} \in\) Output_solns do
                if all buslets in \(S_{i}\) have a similar in \(S_{j}\) then
                    if all FUs in \(S_{i}\) have a similar in \(S_{j}\) then
                    similar \(=\) true
            if similar \(=\) false then push \(\left(\right.\) Output_solns, \(\left.S_{i}\right)\)
        return (Output_solns)
```


### 4.3.2 Complexity

Similarity reduction will certainly improve the quality of solutions. However, it may introduce some overhead in the computation of the final solution. It will concur, at the end of each binding step, in the formation of the set of the Y partial solutions to be propagated in the next binding steps. For this, for each binding step $(O(e))$ we will analyze the Y top ranked (according to our driving metrics which, in this case, is wirelength) solutions $(O(Y))$. For each of these, we will try to match buslets $(O(e))$ and functional units $(O(n))$. Let us say $s_{n} \in Y$.
$y_{0}$ will be matched with $Y-1$ other solutions, $y_{1}$ with $Y-2$ and so on. With this, we have a number of matchings to be tested which is $O\left(Y^{2}\right)^{1}$. So, the total complexity added from simularity reduction will be $O\left(Y^{2} n e\right)$. With this, we will have that total complexity for SRCB will be $O(Y n e(e+Y))$.

### 4.4 Chronological lookahead binding

A possible development of chronological binding algorithm is to evaluate the fouture implications for a given binding solution chosen. In fact, if we perform choices in function of the current design structure without taking care of future effects of it, then we may easily fall in suboptimal solutions.

Let us assume we are at binding step $i-1$ and our current solution is $S_{i-1}$ (with this we mean the set of both functional units and buslets already bound). We know that each communication
${ }^{1}$ We know that

$$
\sum_{i=0}^{N-1} i=\frac{N(N-1)}{2}
$$

can be bound to a buslet only, as well as to a source FU and a destination one. Let us say that such a quatern of values is identified by $q_{k} \in Q_{i}{ }^{12}$. We see that

$$
\begin{equation*}
S_{i}=S_{i-1}+q_{b e s t} \tag{4.35}
\end{equation*}
$$

where $q_{b e s t} \in Q_{i}$ is the quatern of values which will make the increase of total wirelength the lowest. Each $q_{k} \in Q_{i}$ may significantly modify the sets $Q_{i+f}$ with $f$ positive integer. Let us assume our $q_{b e s t}$ involved the binding of $\operatorname{com}\left(u_{h}, u_{j}\right)$ to buslet $b_{k}$ and $u_{j}$ to $r_{m}$. When we will bind its successor communication $\operatorname{com}\left(u_{j}, u_{l}\right)$, we will already have $u_{j}$ bound to $r_{m}$ and, if no a connection is available between $r_{m}$ and any functional unit in $F U_{a v l}\left(u_{l}\right)$, then extra wiring cost will be needed. However, if we will be able to detect this in advance, then we may be able to save extra wirelength.

So, let us say we need to compute the partial solution at the i-th step allowing the inspection of the next $k$ levels. Our partial solution will be

$$
\begin{equation*}
S_{i}=S_{i-1}+q_{\text {best }}(i+k-1) \tag{4.36}
\end{equation*}
$$

${ }^{1}$ In particular, we can extensively write

$$
\begin{equation*}
q_{k}=\left(\operatorname{com}_{i}, b_{j}, r_{l}, r_{m}\right) \tag{4.34}
\end{equation*}
$$

where $\operatorname{com}_{i}$ is the bound communication, $b_{j}$ is the buslet $c o m_{i}$ has been bound to, $r_{l}$ is the source FU while $r_{m}$ is the destination one.

[^10]where $q_{b e s t}(i+k-1)$ is the quatern producing, to the $i+k-1$ binding step, the best wirelength.

```
Algorithm 7 Chronological binding with lookahead
    procedure Chronological_BInding_with_Lookahead(sched(NODES, COMS))
        sort_coms_per_ascending_sched_time() \(\triangleright\) From lowest to highest scheduling time
        root_soln \(=\) initialize_binding_soln()
        for all com \(_{i} \in\) coms_to_bind do
            best_LA_wl \(=\infty\)
            for all available buslets \(b_{k}\), source \(F U_{m}\), destination \(F U_{n}\) do \(\triangleright\) Any combination of
    these
            local \(=\operatorname{bind}\left(S_{j}, \operatorname{com}_{i}, b_{k}, F U_{m}, F U_{n}\right) \triangleright\) Here we will bind the quatern \(\left(c o m_{i}, b_{k}\right.\),
    \(\left.F U_{m}, F U_{n}\right)\) to \(S_{j}\)
            LA_wirelen \(=\) wl_lookahead(local, com \(_{i}\), coms_to_bind, LA_level) \(\quad \triangleright\) We call
    lookahead function
            if LA_wirelen < best_LA_wl then \(\triangleright\) This was implemented holding 1 solution only
                    best_local_soln = local
                best_LA_wl = LA_wirelen
        root_soln \(=\) best_local_soln
        return (root_soln)
```


### 4.4.1 Complexity

We already know that usual chronological binding will take $O\left(n e^{2}\right)$. However, here we need to take into account that, when we are performing the lookahead search, we will try any possible combination between any buslet $(O(e)$ ), any source $(O(n))$ and any destination $(O(n))$. All of this will be performed for the next, let us say, $k$ steps, each binding step.

The overhead introduced by lookahead technique is, this way, $O\left(\mathrm{ken}^{2}\right)$, bringing the total algorithm complexity to $O\left(k n^{3} e^{3}\right)$.

```
Algorithm 8 Wirelength lookahead
    procedure WL_LOOKAHEAD(binding_soln, last_bound_com, coms_to_bind, LA_level) \(\triangleright\) This
    performs the lookahead and returns the best wirelength after k binding steps.
        if LA_level \(=0\) then
            return wirelength(binding_soln) \(\triangleright\) This function will compute wirelength for a given
    solution
        current_com \(=\) next_com(last_bound_com, coms_to_bind) \(\triangleright\) We compute the
    chronologically next com of last_bound_com
        best_LA_wl \(=\infty\)
        for all available buslets \(b_{k}\), source \(F U_{m}\), destination \(F U_{n}\) do \(\triangleright\) any combination of these
            local \(=\operatorname{bind}\left(S_{j}\right.\), com_to_bind, \(\left.b_{k}, F U_{m}, F U_{n}\right)\)
            LA_wirelen \(=\) wl_lookahead(local, current_com, coms_to_bind, LA_level-1) \(\triangleright\)
    Recursive call: we decrease LA level and we pass the new bound com and the new partial
    solution (local)
            if LA_wirelen < best_LA_wl then
                best_LA_wl = LA_wirelen
        return (best_LA_wl) \(\triangleright\) LA function will return just the best wirelength value found
```


### 4.5 Simultaneous binding of iso-scheduled COMs

According to the algorithm developed so far, what we are going to do in our design is binding the best solution(s) moving from one communication to another. Communications are sorted in ascending scheduling time order; so, we are binding communications according to this. However, what shall we expect to do if more than one communication has the same scheduling time? How to sort them? Can this lead us to suboptimal solutions?

If we assume a generic communication latency $\lambda_{c o m}>0$, for any nodes in the DFG $u_{i}, u_{j}, u_{k}$, $u_{l}$, if we have $\left|t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]-t\left[\operatorname{com}\left(u_{k}, u_{l}\right)\right]\right|<\lambda_{\text {com }}$, then the two communications can not be scheduled to the same buslet. However, depending on the order they are bound, we may have different effects on the resulting output.


Figure 30: Scheduling example in order to figure out why simultaneous binding matters. Here we are assuming latency for both FUs and communications 1.


Figure 31: Binding obtained by using CB (on the left, (a)) or simultaneous binding of iso-scheduled communications (on the right, (b)).

Input scheduling is taken by Fig. 30.

Let us assume we have, as input, the scheduling in Fig. 30. Further, let us also assume we have a maximum cardinality constraint of 3 . If we use Chronological binding technique, when we will bind $\operatorname{com}(\mathrm{B}, \mathrm{A})$, we will reuse the already existing interconnection between these FUs (in Fig. 31a, it is the black wire). However, in order to bind com(D, A), as we can not use the black wire because already busy in clock cycle 4 and we can not use the blue wire because we would violate the buslet maximum cardinality constraint, we need to add another buslet (the red one). However, if we analyze simultaneously scheduled communications, we can be able to find better configurations, satisfying the constraints and saving on the number of buslets (and, in this way, increasing the wire efficiency). This is pictured in Fig. 31b. Here we have bound $\operatorname{com}(\mathrm{B}, \mathrm{A})$ to the blue buslet, connecting A to it ans $\operatorname{com}(\mathrm{D}, \mathrm{A})$ to the black one, adding D to it.

Hence, for this simple example we can formalize as follows: if we have that

$$
\begin{equation*}
\forall \operatorname{com}\left(u_{i}, u_{j}\right) \in D F G \exists \operatorname{com}\left(u_{k}, u_{l}\right) \in D F G \mid t\left[\operatorname{com}\left(u_{k}, u_{l}\right)\right]<t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]+\lambda_{\operatorname{com}} \tag{4.37}
\end{equation*}
$$

then we have to take into account all the edges being bound in any possible order (this means that the complexity will be $e!$ ). However, in order this to happen, some conditions need to be satisfied. We can primarily say that, if we have

$$
\begin{equation*}
\forall u_{i} \in D F G \exists k\left(u_{i}\right) \mid \lambda\left[k\left(u_{i}\right)\right] \geq \lambda_{\text {com }} \tag{4.38}
\end{equation*}
$$



Figure 32: Example of domino effects when considering overlapping communications. Here we are assuming $\lambda_{\text {com }}=2$.

Furthermore, we usually have

$$
\begin{equation*}
\lambda_{\text {com }} \leq \lambda\left[t\left(u_{i}\right)\right] \forall u_{i} \in D F G \tag{4.39}
\end{equation*}
$$

Let us assume (4.39) is verified. At such a point, the maximum number of communications to be taken into account will be the maximum number of overlapping communications, which is an output of FDCS. However, for this, we have a domino effect which may let the number of communications to be bound huge.

Let us analyze the case proposed in Fig. 32. Here we want to bind com1. In order to do it, assuming that we have $\lambda_{\text {com }}=2$, we want to respect (4.37). For this, we need to bind together all the communications within the red circle. However, we see that the same condition applies to all the other communications, generating the overmentioned domino effect. In our
computation case, we will assume $\lambda_{\text {com }}=1$, which is a realistic parameter which, more, will satisfy (4.39). In this case, thanks to FDCS, we will have that the maximum number of overlapping communications $\phi$ will be

$$
\begin{equation*}
\phi=\left\lceil\frac{\max \left\{t\left[\operatorname{com}\left(u_{i}, u_{j}\right)\right]\right\}-\min \left\{t\left[\operatorname{com}\left(u_{k}, u_{l}\right)\right]\right\}}{e}\right\rceil \tag{4.40}
\end{equation*}
$$

$\forall u_{i}, u_{j}, u_{k}, u_{l} \in D F G$. Hence, as this is a special case, the overmentioned domino effect does not take place. Usually such a number is very low and, thanks to this, computing all the dispositions between these will not bring to any significant overhead. An example of graph with 3 overlapping communications can be found in 33 .

### 4.5.1 Improving the average computation time

We define binding level the group of binding solutions in which the same number of communication has already been bound.

Let us suppose we have $\phi$ overlapping communications. For the first communication to be bound, we have $\phi$ possibilities. In each of them there will still be $\phi-1$ communications to be bound; so, at the second binding level, we will have $\phi \cdot(\phi-1)$ possibilities. If we step to the last level which has to be the $\phi^{\text {th }}$ as we need to bind all $\phi$ communications, we will have $\phi \cdot(\phi-1) \cdot \ldots \cdot 1$ possibilities, which is $\phi$ !.

What we want to do now is trying to reduce this number of possibilities to be computed as much as possible. In order to do this, we can use the detection of similar patterns heuristics (Sec. 4.3), applied at each binding level: in this way, we are able to prune all similar solutions.


Figure 33: Analysis of possibilities for 3 overlapping communications

At this point, however, if we find solutions moving per binding levels (e.g. we first find solutions for first binding level, then for second...) we will not be able to erase any similar pattern, as we need to arrive at the last level in order to correctly explore solution space ${ }^{1}$. For this, in order to make us able to prune some solutions, we may apply here Depth-First Search algorithm: as we have a final result, we may be able to prune branches in the "possibilities tree" we have. Here a question comes naturally out: is this really necessary? Can we go in depth just for one

[^11]solution and then, applying other algorithms, like Breadth-first search, prune all suboptimal solutions (referring to the first initially found)?

If we act this way, we have the disadvantage of having just one reference solution (until we do not reach leaves in our solutions graph): if we use DFS algorithm, we are constantly trying to reach leaves in our solution graph, having best solution been updated quicker than in BFS. In particular, here we are using a "left hand" algorithm.

The pseudocode is divided into three different procedures.

Simultaneous_COM_binding_caller will just find out which will be the next communications to be bound and it will pass both the current solution and the communications to be bound to Simultaneous_COM_binder, which will produce the best solution (in terms of wirelength).

```
Algorithm 9 Simultaneous com binding caller
    procedure Simultaneous_COM_BINDING_CALLER
        root_soln \(=\) initialize_binding_soln()
        while not all coms bound do \(\quad \triangleright\) While there still is a com not bound
            coms_to_bind \(=\) lowest_sched_time_unbound_coms (coms) \(\quad \triangleright\) Look for the set of
    unbound coms with the lowest scheduling time
        root_soln \(=\) Simultaneous_COM_binder(coms_to_bind, root_soln) \(\triangleright\) Call
    Simultaneous_COM_binder: it will return the best solution binding all the coms_to_bind
        return (root_soln)
```

Simultaneous_COM_binder is really the heart of the algorithm: it will analyze all the possible solutions binding communications in any possible order and will choose the best. It is a recursive
function. This also calls solution_binder which, taking as input a set of binding solutions and the com to be bound, it will produce the set of all possible binding solutions for the given inputs.

```
Algorithm 10 Simultaneous COM binder
    procedure Simultaneous_COM_BINDER(coms_to_bind, binding_solns)
        if coms_to_bind \(=\emptyset\) then \(\quad \triangleright\) Recursive function: if no more coms to be bound, then
    return just the best among the passed binding solutions
            return (best(binding_solns)) \(\triangleright\) best=lowest wirelength
        for all com \(_{i} \in\) coms_to_bind do
            intermediate_bindings \(=\) solution_binder \(\left(\operatorname{com}_{i}\right.\), binding_solns \()\)
            coms_list_next_level \(=\) coms_to_bind \(-\operatorname{com}_{i} \quad \triangleright\) coms_list_next_level contains
    coms_to_bind excluding com \(_{i}\)
            binding_solns_harvest \(=\) Simultaneous_COM_binder(coms_list_next_level, intermedi-
    ate_bindings) \(\triangleright\) Recursive call: it will be passed all the computed solutions to this moment
    and the remaining coms to be bound
            best_binding_soln \(=\) best(binding_solns_harvest, best_binding_soln) \(\quad \triangleright\) best=lowest
    wirelength
        return (best_binding_soln)
```


### 4.5.2 Complexity

This algorithm adds an overhead to standard chronological binding $\left(O\left(n e^{2}\right)\right)$. Such an overhead is given by the computation of all possible solutions for overlapping communications. If we say that the maximum number of overlapping communications is $\phi$, usually much lower than $e$, and that $\gamma=2^{\phi}(\gamma$, in this case, represents the average computations for all the

```
Algorithm 11 Solution binder
    procedure Solution_binder(com_to_bind, binding_solns) \(\triangleright\) Inputs
    are scheduling times (both coms and nodes), max buslet cardinality, max fanin and fanout
    for FUs and how many solutions to propagate each binding step. Finally, we can choose to
    use similar solution pruning enabling the proper flag.
        for all \(S_{j} \in\) binding_solns do
            for all available buslets \(b_{k}\), source \(F U_{m}\), destination \(F U_{n}\) do \(\triangleright\) any combination of
    these
            local \(=\operatorname{bind}\left(S_{j}\right.\), com_to_bind, \(\left.b_{k}, F U_{m}, F U_{n}\right) \quad \triangleright\) We perform binding, starting
    from solution \(S_{j}\), for the quatern \(\left(C O M_{i}, B_{k}, F U_{m}, F U_{n}\right)\)
            push(Local_solns, local) \(\triangleright\) Add local to Local_solns
        return (Local_solns)
```

combinations, having first a reference solution to be compared with intermediates). The final complexity will be, this way, $O\left(\gamma n e^{2}\right)$.

### 4.6 Force directed binding

Here what we aim to present is an alternative to chronological binding. In fact, as chronological binding has evident benefits, unfortunately, shows also the non negligible drawback of having a too local view of the whole dataflow graph. If we focus just on binding a single communication per time (or, by extension, to a restricted set of them, considering overlapping communications or also taking into account lookahead mechanisms), we will not be able to take into account the whole DFG structure, seeing which will be the effect of our binding on the whole final solution. For this, it may be reasonable to use a force directed-like approach also for binding, with the aim of minimizing final wirelength.

Let us define a solution possibility $q$ as a combination of four different elements:

- Communication $\operatorname{com}\left(u_{i}, u_{j}\right)$ to be bound
- Buslet $b_{k}$ to which $\operatorname{com}\left(u_{i}, u_{j}\right)$ will be bound
- Functional unit $r_{l}$ to which $u_{i}$ will be bound
- Functional unit $r_{m}$ to which $u_{j}$ will be bound

If, at step $n$ during execution of our algorithm, we have obtained a binding solution $B S_{i-1}$, the new binding solution $B S_{i}$ will be obtained applying the chosen solution $q_{i}$ to $B S_{i-1}$ :

$$
\begin{equation*}
B S_{i}=B S_{i-1}+q_{i} \tag{4.41}
\end{equation*}
$$

Now comes naturally the question: how to choose $q_{i}$, if we are not moving chronologically anymore?

### 4.6.1 Formulation of probability for a given solution

As we aim to minimize the overall wirelength of our design and our wirelength model deals with buslet cardinality (in particular, wirelength is directly proportional to current buslet cardinality), it is very likely for us to try to minimize it. It may sounds natural to elaborate a force using the probability of binding a given solution $q_{k}$ is inversely proportional to the cardinality $b_{k}$ will have after binding $F U_{i}$ and $F U_{j}$ to it.

According to the definition of probability, in order to compute the probability for a particular $q_{k}$, we need to have the future buslet cardinalities for any solution binding the communication
$\operatorname{com}\left(u_{i}, u_{j}\right)$ taken into exam.
We can model, this way, the probability of binding the possibility $q\left(c o m_{i}, b_{k}, r_{l}, r_{m}\right)$ as

$$
\begin{equation*}
p\left[q\left(\operatorname{com}_{i}, b_{k}, r_{l}, r_{m}\right)\right]=\frac{\frac{1}{1+\Delta W L\left(S_{n-1}, q\left(\operatorname{com}_{i}, b_{k}, r_{l}, r_{m}\right)\right)}}{\sum_{\text {all } q_{s} \text { for } \operatorname{com}_{i}} p\left(q_{s}\right)} \tag{4.42}
\end{equation*}
$$

where $\Delta W L\left(S_{n-1}, q\left(c o m_{i}, b_{k}, F U_{m}, F U_{n}\right)\right)$ is the difference between the wirelength for $S_{n}=$ $S_{n-1}+P_{r}$ and wirelength for $S_{n-1}$.

### 4.6.2 Formulation of weight function for a given solution

What we aim here to formulate is a factor determining a sort of distribution graph in which we may aim to minimize peaks. However, it is not a straightforward issue to determine which one could be the parameter we may apply such an issue.

The first idea we may have is to directly plug whole solution's wirelength into it. In fact, the highest such a parameter is for a given solution, the least we are likely to choose it. So, as first instance, we may decide to define weight function in our FDB algorithm as

$$
w\left(q_{k}\right)=\Delta W L\left(S_{n-1}, q_{k}\right)
$$

With these, we will compute the self force as

$$
\begin{equation*}
S F\left(q_{k}\right)=\left(1-p\left(q_{k}\right)\right) * w\left(q_{k}\right)-\sum_{\forall q_{s} \neq q_{k} \text { with comi }} p\left(q_{s}\right) * w\left(q_{s}\right) \tag{4.43}
\end{equation*}
$$

After computing all the self forces, it will be chosen the binding solution with the lowest one.

### 4.6.3 Predecessor/successor forces

According to what said previously for the force directed-like approaches (FDS, our FDCS), after we take a decision we will have an effect on predecessors as well as on successors. Does such an issue apply also on force directed binding?

Let us assume we want to bind the quatern described in $q_{k}$ which was defined in (4.34). Then, we will have, on all the other binding possibilities, the following effects:

1. $b_{j} \in q_{k}$ will not be anymore available $\forall c o m_{h}| | t\left(c o m_{h}\right)-t\left(c o m_{i}\right) \mid<\lambda_{c o m}$
2. $r_{l}$ will not be available $\forall u_{h}\left|k\left(u_{k}\right)=k\left(u_{i}\right),\left|t\left(u_{h}\right)-t\left(u_{i}\right)\right|<\lambda\left(k\left(u_{i}\right)\right)\right.$
3. $r_{m}$ will not be available $\forall u_{h}\left|k\left(u_{k}\right)=k\left(u_{j}\right),\left|t\left(u_{h}\right)-t\left(u_{j}\right)\right|<\lambda\left(k\left(u_{j}\right)\right)\right.$
4. $b_{j}$ may increase its cardinality
5. $r_{l}$ and $r_{m}$ may increase their own fanin/fanout

Let us analyze all these effects. For the first effect, as we can not have any conflict on the buslet, we have a direct effect on the forces for all the communications scheduled at the same time the currently bound communication is. About the availability of the the functional units, we will certainly have an effect on all the nodes in the DFG having a scheduling time which overlaps the currently marked as busy clock cycles for the current functional units. This will certainly affect the force for all the communications between nodes having such a type. These are the effects not taking into account the constrains on the buslet cardinality and on functional unit fanin/fanout.

Analyzing the buslet variation of cardinality, we will have an effect on the force for all the
communications which can be bound to buslet $b_{j}$. Unfortunately, this means that we will have an effect on all the forces for our DFG and, for this, we need to analyze all of them. Further, the effect on fanin/fanout will affect all the operation nodes of the same type as the currently bound ones through the whole design. Synthesizing, we need to analyze the variation for all the forces in the design.

### 4.6.4 Efficacy of this algorithm

Such an algorithm will show a very peculiar behavior. For its formulation, it will take a communication between two operation nodes, bind it and, then, keep on binding other communications between the same kind of functional units as much as possible. So we can say that, after binding the quatern $q_{k}$, it will bind all the other quaterns $q_{h}$ such that

- It will not create conflicts on the buslet $b_{j} \in q_{k}$.
- Functional units $r_{l}$ and $r_{m}$ will not be busy.

This happens because the distribution graph will make the force dropping for these (as it will use an already available structure, the value for it is zero). According to this, any effect on predecessors/successors appears being unuseful (as all this contribution is ignored from the distribution graph structure). However, this may lead to suboptimalities. A similar example was already discussed in Sec. 4.5. In that case, the self force to bind $\operatorname{com}(\mathrm{B}, \mathrm{A})$ will be the lowest as an interconnection is already present. This, as seen, will drive the solution to be suboptimal.

For this, we expect such a technique not to produce the best absolute result but, as it globally
looks at all the communications, not producing the worst results either.
Down here it is presented a very synthetic version of FDB which, however, illustrates all main issues to be tackled.

```
Algorithm 12 FDB algorithm
    procedure Force_Directed_Binding(sched(COM, NODES))
        root_soln \(=\) initialize_binding_soln()
        while not all coms bound do
            prev_wl \(=\) wirelength (root_soln) \(\quad \triangleright\) Wirelength before next binding step
            for com \(_{i} \in\) coms_to_bind do
                for all available buslets \(b_{k}\), source \(F U_{m}\), destination \(F U_{n}\) do \(\triangleright\) Any combination
    of these
                wl_after \(\left[\right.\) com \(\left._{i}\right]\left[b_{k}\right]\left[F U_{m}\right]\left[F U_{n}\right]=\) wirelength(bind(root_soln, \(\operatorname{com}_{i}, b_{k}, \quad F U_{m}\),
    \(\left.\left.F U_{n}\right)\right) \triangleright\) With this we can identify a 4D variable. The same will apply to weight, probability
    and SF
                weight \(\left[\mathrm{com}_{i}\right]\left[b_{k}\right]\left[F U_{m}\right]\left[F U_{n}\right]=\) wl_after \(\left[\mathrm{com}_{i}\right]\left[b_{k}\right]\left[F U_{m}\right]\left[F U_{n}\right]\) - prev_wl
                for all available buslets \(b_{k}\), source \(F U_{m}\), destination \(F U_{n}\) do
                    compute probability \(\left[\mathrm{com}_{i}\right]\left[b_{k}\right]\left[F U_{m}\right]\left[F U_{n}\right]\)
                            \(\triangleright\) See 4.42
                for all available buslets \(b_{k}\), source \(F U_{m}\), destination \(F U_{n}\) do
                    compute \(\mathrm{SF}\left[\mathrm{com}_{i}\right]\left[b_{k}\right]\left[F U_{m}\right]\left[F U_{n}\right]\)
                            \(\triangleright\) See 4.43
            lowest_SF \(=\infty\)
            for com \(_{i} \in\) coms_to_bind do
                for all available buslets \(b_{k}\), source \(F U_{m}\), destination \(F U_{n}\) do
                    if SF \(\left[c o m_{i}\right]\left[b_{k}\right]\left[F U_{m}\right]\left[F U_{n}\right]<\) lowest_SF then \(\quad \triangleright\) Look for the lowest SF
                    lowest_SF \(=\mathrm{SF}\left[\mathrm{com}_{i}\right]\left[b_{k}\right]\left[F U_{m}\right]\left[F U_{n}\right]\)
                    best_bind \(=\left(c o m_{i}, b_{k}, F U_{m}, F U_{n}\right)\)
            root_soln \(=\) bind(root_soln, best_bind)
        return (root_soln)
```


### 4.6.5 Complexity

Here the complexity is comparable to that already computed for FDS algorithm, as the basic technique is the same. However, in FDS the solution space to be explored was $n$ only. Here it will be $O\left(n^{2} e^{2}\right)$ (complexity coming from the quatern of values composing each binding possibility). For this, the total complexity of such a technique without predecessors and successors will be $O\left(n^{4} e^{4}\right)$. However, if we want to also add predecessors and successors contribution, the total expected complexity will be $O\left(n^{6} e^{6}\right)$.

## CHAPTER 5

## BUSLET POWER MODELING

Once we perform scheduling and binding, we will have, as output, a bench of functional units and interconnections between them. Furthermore, we also have the correct sequence for both data transmissions (performed using buslets) and data computation (performed in the functional units). Having these data, commonly outputted after the high level synthesis step in VLSI CAD (Sec. 3.1), we need to go through the rest of the CAD flow. Let us assume we will jump straight to physical design (floorplanning). In order to estimate the power consumed using buslets and to compare it with dedicated interconnections, some modeling is necessary. Indeed, we can use some technique to minimize all the metrics involved in wiring power consumption. All the models here presented assume floorplanning stage has already been performed: for this, all the FUs are already placed on chip.

### 5.1 Metrics involved in buslet power consumption

Let us assume a generic buslet $b_{i}$ connects $\left|b_{i}\right|$ functional units (this is its cardinality). Further, let us assume that the two functional units $F U_{j}, F U_{k} \in b_{i}$. Whenever a communication between these two functional units has to be performed, two main factors contribute to the power consumption (in this case, this is dynamic power):

1. Wire length to be driven. The wire has its own capacitance and it has to be driven. From this, we can say that, the lowest this is, the lowest this power contribution is. An idea could be "slicing" the circuit such that just the minimum wirelength will be driven.
2. Input buffer capacitance. Each FU will have, as input, a buffer. This has a non negligible input capacitance to be driven. Furthermore, if we wish to insert some buffers in the middle of the circuit, this contribution depends on the number of buffers met. This goes against the statement evidenced in the previous point and, for this, some trade-off is necessary.

Furthermore, we have also a static power contribution, coming from the buffers in the design, due to leakage current. All of these factors need to be taken into account while designing the buslet structure itself.

### 5.2 Minimum spanning tree

The first approach we can try to use is creating a minimum spanning tree (MST), connecting all the functional units. Let us build a completely connected ${ }^{1}$ undirected $^{2}$ graph G, having as nodes all the functional units connected to the same buslet $b_{j}$. Let us associate to each edge $e\left(F U_{i}, F U_{j}\right)$ connecting $F U_{i}, F U_{j} \in G$ the weight $w\left[e\left(F U_{i}, F U_{j}\right)\right]$, which will be the Manhattan distance between the two functional units. Now, we will use an algorithm allowing us to have

[^12]a connected graph ${ }^{1}$ with the least total minimum wirelength. We define it being a minimum spanning tree. Here follows a quick description of two standard algorithms used to solve this problem.

### 5.2.1 Prim's algorithm

This algorithm was designed by Robert Prim in 1957[23]. Prim's algorithm uses a greedy approach that computes an MST in a weighted undirected graph. So, it will look for all the edges which will create a tree including each and every node, in which the total of the edges' weights is the minimum. The algorithm builds such a tree one node at a time, from an arbitrary chosen one, and each step it will add the cheapest possible connection from such a built the tree to any other non connected node.

The algorithm, however, may be modified to start with any particular vertex. The total complexity of this algorithm is $O\left(n^{2}\right)$.

### 5.2.2 Kruskal's algorithm

This algorithm was designed by Joseph Kruskal 1956[24]. Kruskal's algorithm is another minimum-spanning-tree algorithm.

It will work finding, each step, the cheapest edge which will connect any two trees in the forest. Also this uses a greedy approach and it will compute the MST starting from a connected graph. Hence, it will find the edges, which will form a tree, which will include all the nodes, minimizing the total of the edges' weights.

[^13]```
Algorithm 13 Prim algorithm
    procedure \(\operatorname{Prim}(\mathrm{G}) \quad \triangleright \mathrm{G}\) is the input graph, containing both nodes and edges
        \(Q=\emptyset\)
        \(F=\emptyset \quad \triangleright \mathrm{F}\) will be our minimum spanning tree
        for each node \(u_{i} \in G\) do
            \(C\left[u_{i}\right]=\min _{\forall u_{j} \in P\left(u_{i}\right)}\left\{w\left[e\left(u_{i}, u_{j}\right)\right]\right\}\)
            \(Q=Q+u_{i} \quad \triangleright\) When initialized \(Q\) contains all the nodes in \(G\)
        while \(Q\) not empty do
            \(u_{\text {best }}=\operatorname{first}(Q)\)
            for all \(u_{i} \in Q\) do
                if \(C\left[u_{i}\right]<u_{\text {best }}\) then
                    \(u_{\text {best }}=u_{i} \quad \triangleright\) Here we will select the edge with the lowest cost in Q
            \(F=F+u_{\text {best }}\)
            \(Q=Q-u_{\text {best }} \quad \triangleright\) Here we will exclude \(u_{\text {best }}\) from Q
        return \((\mathrm{F}) \quad \triangleright\) best \(\quad \triangleright\) We return the MST
```

The complexity of this algorithm is $O(e \log (e))$ : this means that this algorithm is better than Prim's if the number of arcs from the starting graph is not high. Hence, if the graph is fully connected, Prim runs in lower time.

### 5.3 Standard buffer placement

According to the common structure for any bus, through the line we will not have any buffer. The signal, in this way, will be propagated through the whole wirelength of the buslet, reaching all the functional units connected to it. Of course, we will minimize their number according to the fact that a given FU will need to read or to write only data from/to the buslet. This is completely deterministic as it is a known data from any of our HLS algorithms. For example, if a given functional unit $F U_{j} \in b_{j}$ will drive the buslet never reading it, we will insert the output buffer only. This will improve both static and dynamic power, reducing also signal delay. We

```
Algorithm 14 Kruskal algorithm
    procedure \(\operatorname{KruskaL}(\mathrm{G}) \quad \triangleright \mathrm{G}\) is the input graph, containing both nodes and edges
        \(Q=\emptyset\)
        \(F=\emptyset\)
        for each edge \(e_{i} \in G\) do
            \(Q=Q+e_{i}\)
        Sort_ascending \((\mathrm{Q})\)
                            \(\triangleright\) When initialized \(Q\) contains all the edges in \(G\)
                                \(\triangleright\) Sort all the edges in weight ascending order
        while \(Q\) not empty do
            \(e_{\text {best }}=\operatorname{first}(Q)\)
            \(F=F+u_{i}+u_{j}\)
            for all \(e_{i} \in Q\) do
                    if \(u_{i} \in F\) and \(u_{j} \in F\) then
                        \(Q=Q-e_{i}\)
        return \((\mathrm{F})\)


Figure 34: Example of standard buffer placement model for buslet of maximum cardinality 8 . FUs are the squares, diamonds are bidirectional buffers (connected as shown in the legend) and blue nodes are steiner points.
have benefits on the dynamic power because we will decrease the total buslet capacitance to be driven. We know that, for estimating dynamic power, we have two contributions. In this case, the contribution from the wirelength \(C_{w}\) is fixed for any communication happening and it is
\[
\begin{equation*}
C_{w}\left(b_{j}\right)=W L\left(b_{j}\right) \cdot C_{F / m} \tag{5.1}
\end{equation*}
\]
where \(W L\left(b_{j}\right)\) is the total wirelength for \(b_{j}\) and \(C_{F / m}\) is characteristic line capacitance per length unit. The contribution from the buffers, instead, is
\[
\begin{equation*}
C_{t r i}\left(b_{j}\right)=C_{t r i-i n} \sum_{\forall F U_{i} \in b_{j}} \delta_{\text {fanin }}\left(F U_{i}, b_{j}\right) \tag{5.2}
\end{equation*}
\]
where
\[
\delta_{\text {fanin }}\left(F U_{i}, b_{j}\right)=\left\{\begin{array}{l}
1 \text { if } F U_{i} \text { reads data from } b_{j}  \tag{5.3}\\
0 \text { otherwise }
\end{array}\right.
\]
and \(C_{t r i-i n}\) is typical input capacitance for tristate buffers. For this reason, minimizing fanout buffers will have here a contribution on saving static power only, but fanin buffers will also save static power. An example of such an interconnection type is shown in Fig. 34 To wrap-up, the final dynamic power for a single buslet in the design \(b_{j}\) will be
\[
\begin{equation*}
P_{d y n}\left(b_{j}\right)=f_{w} \cdot V_{d d}^{2} \cdot \frac{\operatorname{COMS}\left(b_{j}\right)}{\bar{\lambda}}\left(C_{w}\left(b_{j}\right)+C_{t r i}\left(b_{j}\right)+1\right) \tag{5.4}
\end{equation*}
\]
```

Algorithm 15 Standard MST power model
procedure MST-POW (G, D, $\bar{\lambda}$ )
tot_pow $=0$
for all buslet $b_{j} \in D$ do
$F=\operatorname{Prim}(\mathrm{G})$
$\mathrm{WL}=$ total_length $(\mathrm{G})$
CFIN $=$ total_fanin_buffers $\left(b_{j}\right)$
this_buf_pow $=P_{d y n}\left(b_{j}\right) \quad \triangleright$ It is (5.4)
this_buf_pow $=$ this_buf_pow + static_power
tot_pow $=$ tot_pow + this_buf_pow

```

\subsection*{5.3.1 Complexity}

In order to build the MST we use here Prim's algorithm \(\left(O\left(n^{2}\right)\right)\). Then, in order to decide the buffer placement, we will take another \(O(n)\) time and, finally, to compute the total static and dynamic power we just need to know their quantity \((O(1))\). The most dominant part is here the creation of the MST and, for this, the total final complexity is \(O\left(n^{2}\right)\).

\subsection*{5.4 MST-2B}

The main problem with standard buffer placement is that we have to drive all the buslet for any communication, even if two functional units are each other close.

We can see an example of such a behavior in Fig. 35. In order to avoid this, what we can do is literally "slicing" the buslet, using tristate buffers which may enable/disable the propagation of the signal according to the target functional unit position in the MST.

Let us call routing nodes all the nodes in the MST which are not FUs. Each wire \(w\left(u_{i}, u_{j}\right)\) in the MST connecting two nodes which are both routing nodes will have, at both its ends, a


Figure 35: Example of communication between FUs in the case of standard interconnections. Here, even if the two FUs needing to communicate are close (in this case, \(F U_{1}\) is sending data to \(F U_{2}\), the signal will propagate through all the buslet (it is evidenced in red).)
bidirectional tristate buffer \({ }^{1}\). Using these, it is possible to minimize the total wirelength to be driven. We can see this implementation in Fig. 37. In that case, just the minimum wirelength will be driven.

However, such an approach has some drawbacks:
- When the signal reaches routing node \(R_{k}\), the fanin input capacitance at that particular node will be not only that of the next wire it will cross, but also the sum of the input capacitances of all the other wires connected to \(R_{k}\). An example of this is in Fig. 36 .

\footnotetext{
\({ }^{1}\) We define bidirectional tristate buffer the pairing of two tristate buffers in parallel, oriented in the two opposite directions.
}


Figure 36: Example of communication between \(F U_{1}\) and \(F U_{4}\) in the MST-2B model. In this case, even if the wirelength is the lowest possible (in red), the number of input capacitances to be loaded (evidenced in green) is very high, much greater than in the standard implementation.

Let us call the extra capacitance to be driven (that not belonging to the path the signal will follow) \(C_{\text {fin-extra }}\left(R_{k}\right)\). It will be
\[
\begin{equation*}
C_{\text {fin-extra }}\left(R_{k}\right)=C_{\text {tri-in }} \cdot\left(\sum_{\forall w_{j} \text { connected to } R_{k}} \delta_{\text {wirefanin }}\left(w_{j}, R_{k}\right)-1\right) \tag{5.5}
\end{equation*}
\]
where
\[
\delta_{\text {wirefanin }}\left(w_{j}, R_{k}\right)= \begin{cases}1 & \text { if } w_{j} \text { reads data from } R_{k}  \tag{5.6}\\ 0 & \text { otherwise }\end{cases}
\]
- For all the wire slices crossed by the signal, we will have a contribution from the fanin tristate buffers. Let us call such a contribution \(C_{\text {fin-route }-2 B}\). Let us also assume we have the route \(Y\) which is a bunch of wires \(w_{j}\) connecting routing nodes only. We can say that wire \(w\left(R_{k}, R_{l}\right)\) will connect routing nodes \(R_{k}\) and \(R_{l}\). From this, the contribution of \(C_{\text {fin-route }-2 B}(Y)\) can be expressed as
\[
\begin{equation*}
C_{\text {fin-route- } 2 B}(Y)=\sum_{\forall w_{j}\left(R_{k}, R_{l}\right) \in Y} 2+\delta_{\text {wirefanout }}\left(w_{j}, R_{k}\right)+\delta_{\text {wirefanin }}\left(w_{j}, R_{l}\right) \tag{5.7}
\end{equation*}
\]
where \(\delta_{\text {wirefanin }}\left(w_{j}, R_{l}\right)\) was already defined in (5.6) while
\[
\delta_{\text {wirefanout }}\left(w_{j}, R_{k}\right)= \begin{cases}1 & \text { if } w_{j} \text { writes data to } R_{k}  \tag{5.8}\\ 0 & \text { otherwise }\end{cases}
\]

Always referring to Fig. 36, when performing \(\operatorname{com}\left(F U_{1}, F U_{4}\right)\), input capacitances for \(F U_{2}, F U_{3}\) and another wire connecting steiner points are loaded.

So, such an approach will minimize capacitance from wire with a higher cost coming from tristate buffers.

\subsection*{5.4.1 Shortest path problem: Dijkstra's algorithm}

Dijkstra's algorithm is used in a graph, directed or undirected, to find the shortest path between two nodes. It was designed by Edsger W. Dijkstra in 1959[25].

From the source node, it will iteratively look for the lowest cost edge, and it will bring at the destination node the sum of the crossed edges. It will end when it will reach the destination


Figure 37: Example of implementation of MST-2B.
node.
The total complexity of this algorithm is \(O\left(n^{2}\right)\), but there are some other implementations making this \(O(e+n \log (n))\).

\subsection*{5.4.2 Complexity}

The total complexity of such an algorithm is tightly linked to the initial computation of the MST (Prim's algorithm, \(O\left(n^{2}\right)\) ), computation of the dynamic power (number of communications, \(e\), times shortest path computation, \(O\left((n+r)^{2}\right)\), where \(r\) is the number of routing nodes) and buffer placement ( \(O(w)\), with \(w\) number of wires, because each wire will need to handle its own buffers, depending on read/write operations). In total, our complexity will be \(O\left(n^{2}\right)+O\left((n+r)^{2}\right)+O(w)\). As \(O\left(n^{2}\right) \leq O\left((n+r)^{2}\right)\), we can say that the complexity is
```

Algorithm 16 Dijkstra algorithm
procedure $\operatorname{DiJkstra}(\mathrm{G}, \mathrm{s}, \mathrm{d}) \triangleright \mathrm{G}$ is the graph, s is the source node and d is the destination
$S=\emptyset$
for all $u_{i} \in G$ do
if $u_{i} \neq s$ then
$d\left(u_{i}\right)=\infty \quad \triangleright$ The distance between s and $u_{i}$ is unknown
$p\left(u_{i}\right)=\infty \quad \triangleright$ The minimum distance node between $u_{i}$ predecessors and s is
unknown
$S=S+u_{i}$
while $S \neq \emptyset$ do
$\mathrm{n}=$ node with $\min \mathrm{d}$
$S=S-n$
for all $u_{j}$ predecessors and successors of n do
temp $=d(n)+e\left(u_{j}, n\right)$
if temp<d( $\left.u_{j}\right)$ then
$d\left(u_{j}\right)=t e m p$
$p\left(u_{j}\right)=n$

```
```

Algorithm 17 MST with two buffers per line slice power model
procedure MST-2B(G, D, $\bar{\lambda})$
tot_pow $=0$
for all buslet $b_{j} \in D$ do
$F=\operatorname{Prim}(\mathrm{G})$
this_buf_pow $=0$
for all $c_{i} \in C S\left(b_{j}\right)$ do $\triangleright$ For all communications scheduled (CS) in buslet $b_{j}$
$\mathrm{R}=\operatorname{Dijkstra}\left(\mathrm{G}, u_{i}, u_{j}\right)$
CFIN $=C_{\text {fin-extra }}(R, G)+C_{\text {fin-route-2B }}(R) \triangleright C_{\text {fin-extra }}(R, G)$ defined in (5.5)
while $C_{\text {fin-route-2B }}(R)$ defined in (5.7)
this_buf_pow $=P_{d y n}\left(b_{j}\right) \triangleright$ It is (5.4), using as capacitance CFIN and as distance
$d(R)$
this_buf_pow $=$ this_buf_pow + static_power
tot_pow $=$ tot_pow + this_buf_pow

```


Figure 38: Example of implementation of MST-1B.
\(O\left((n+r)^{2}+w\right)\). However, as we are in a minimum spanning tree, we also know that \(w=n+r-1\). For this, we rewrite the complexity as \(O\left((n+r)^{2}+n+r-1\right)\) and, finally, \(O\left((n+r)^{2}\right)\).

\subsection*{5.5 MST-1B}

In MST-2B, we are routing signals through the shortest path of our MST. This is performed completely isolating this path from the rest of the buslet, thanks to the use of tristate buffers at both the ends of each wire. However, we can reach such a result using a couple of tristate buffers only.

A structure example of this can be found in Fig. 38. MST-1B differs from MST-2B exactly from this. In particular, it has the big advantage of halving \(C_{\text {fin-route }}\) contribution, as it will be met one couple of tristate buffers per wire only.

Let us try to perform the same communication as in Fig. 36 (MST-2B). With this implemen-


Figure 39: Example of communication between \(F U_{1}\) and \(F U_{4}\) in the MST-1B model. In this case, even if the wirelength is the lowest possible (in red), the number of input capacitances to be loaded (evidenced in green) is very high, much greater than in the standard implementation.
tation, the number of the buffers crossed almost halves. We find this in Fig. 39.
Assuming route \(Y\), the contribution of \(C_{\text {fin-route-1B }}(Y)\) can be expressed as
\[
\begin{equation*}
C_{\text {fin-route- } 1 B}(Y)=\sum_{\forall w_{j}\left(R_{k}, R_{l}\right) \in Y} 1+\delta_{\text {wirefanin }}\left(w_{j}, R_{l}\right) \tag{5.9}
\end{equation*}
\]
where \(\delta_{\text {wirefanin }}\left(w_{j}, R_{l}\right)\) was already defined in (5.6).
The other side of the coin of using this technique, however, is that the signal will not propagate through the shortest path only. If we look at Fig. 39, indeed, we see that the signal, for a small piece of line, will overpropagate.

\subsection*{5.5.1 Placing tristate buffers}

Here comes naturally a question: where it is more convenient to place the pair of buffers? It depends on the communications flow itself. Let us assume we are taking this decision on wire \(w\left(R_{k}, R_{l}\right)\). Here, we can determine the communication rate, defined as the ratio between the number of communications between two nodes over the total latency
\[
\begin{equation*}
r\left(R_{k}, R_{l}\right)=\frac{\# \text { coms from } R_{k} \text { to } R_{l}}{\bar{\lambda}} \tag{5.10}
\end{equation*}
\]

Note: \(r\left(R_{k}, R_{l}\right) \neq r\left(R_{l}, R_{k}\right)\). Let us assume we place the buffer at distance \(x\) from \(R_{k}\). In this case, the total dynamic power consumed by \(w\left(R_{k}, R_{l}\right)\) will be
\[
\begin{array}{r}
P_{\text {wiredyn }}\left[w\left(R_{k}, R_{l}\right)\right]=\left[r\left(R_{k}, R_{l}\right)+r\left(R_{l}, R_{k}\right)\right] \cdot C_{F / m} \cdot d\left[w\left(R_{k}, R_{l}\right)\right]+ \\
{\left[\eta_{1}-r\left(R_{k}, R_{l}\right)\right] \cdot C_{F / m} \cdot x+}  \tag{5.11}\\
{\left[\eta_{2}-r\left(R_{l}, R_{k}\right)\right] \cdot C_{F / m} \cdot\left\{d\left[w\left(R_{k}, R_{l}\right)\right]-x\right\}}
\end{array}
\]
where \(\eta_{1}\) is defined as
\[
\begin{equation*}
\eta_{1}=\sum_{\forall n_{i} \in U\left(R_{k}, R_{l}\right)} r\left(n_{i}, R_{k}\right) \tag{5.12}
\end{equation*}
\]
where \(U\left(R_{k}, R_{l}\right)\) is the set of all neighbors nodes of \(R_{k}\) excluding \(R_{l}\), and \(\eta_{2}\) is similary defined. The contribution depending on \(x\) is \(\left[\eta_{1}-r\left(R_{k}, R_{l}\right)\right] \cdot C_{F / m} \cdot x+\left[\eta_{2}-r\left(R_{l}, R_{k}\right)\right] \cdot C_{F / m}\). \(\left\{d\left[w\left(R_{k}, R_{l}\right)\right]-x\right\}\). In order to minimize the power, we have two possibilities:
\[
x= \begin{cases}0 & \text { ifr }\left(R_{k}, R_{l}\right)-r\left(R_{l}, R_{k}\right)+\eta_{1}-\eta_{2} \geq 0  \tag{5.13}\\ d\left[w\left(R_{k}, R_{l}\right)\right] & \text { otherwise }\end{cases}
\]

Furthermore, thanks to this analysis, we can accurately evaluate the total power consumed by each slice of the buslet. This makes the total dynamic power estimation easier.

When we have to compute the contribution of the buffer input capacitances to the dynamic power, we have to distinguish the two cases. Let us say that \(C_{1}\) is the buffer oriented from \(R_{k}\) to \(R_{l}\) and \(C_{2}\) is oriented in the opposite direction. Both will be loaded when signal will entirely cross the wire (in both directions), while just one will be loaded when neighbor nodes will communicate. According to this, we can state that
\[
\left\{\begin{array}{l}
P_{C_{1} d y n}=\left[r\left(R_{k}, R_{l}\right)+r\left(R_{l}, R_{k}\right)+\eta_{1}\right] \cdot C_{t r i-i n}  \tag{5.14}\\
P_{C_{2} d y n}=\left[r\left(R_{k}, R_{l}\right)+r\left(R_{l}, R_{k}\right)+\eta_{2}\right] \cdot C_{t r i-i n}
\end{array}\right.
\]

\subsection*{5.5.2 Complexity}

The complexity of this algorithm is the same as MST-2B adding the overhead of taking the decision of where to place the buffer. This, in particular, analyzes the neighbor nodes \((O(n+r))\)
```

Algorithm 18 MST with one buffer per line slice power model
procedure $\operatorname{MST}-1 \mathrm{~B}(\mathrm{G}, \mathrm{D}, \bar{\lambda})$
tot_pow $=0$
for all buslet $b_{j} \in D$ do
$F=\operatorname{Prim}(\mathrm{G})$
this_buf_pow $=0$
for all $c_{i} \in C S\left(b_{j}\right)$ do $\quad \triangleright$ For all communications scheduled (CS) in buslet $b_{j}$
$\mathrm{R}=\operatorname{Dijkstra}\left(\mathrm{G}, u_{i}, u_{j}\right)$
$\operatorname{track}(\mathrm{R}, \mathrm{G}) \quad \triangleright$ Mark each wire composing R as crossed in G
for all edges $e_{i} \in G$ do $\quad \triangleright \mathrm{G}$ is composed by edges, already owning the number of
times being crossed
Place_buffers $\left(e_{i}\right) \quad \triangleright$ According to (5.13)
Compute wire_dpow $\left(e_{i}\right)$
Compute buf_dpow $\left(e_{i}\right)$
$t h i s_{-} b u f_{-} p o w=$ wire_dpow $\left(e_{i}\right)+b u f_{\_} d p o w\left(e_{i}\right)+\operatorname{static\_ buf}\left(e_{i}\right)$
this_buf_pow $=$ this_buf_pow + static_power
tot_pow $=$ tot_pow + this_buf_pow

```
per each edge \((O(n+r-1))\), and adding an overhead \(O\left((n+r)^{2}\right)\). However, complexity for MST-2B was already \(O\left((n+r)^{2}\right)\); so, MST-1B complexity is \(O\left((n+r)^{2}\right)\).

\subsection*{5.6 MST-LOGD}

All the previous approaches (Standard, MST-2B and MST-1B) are approaches based on minimum spanning tree. This means that, as they are built, they ensure the lowest possible wirelength for the buslet. However, this may lead to situations in which, for a given communication, the total number of nodes to be crossed may be very high. In the worst case, we may need to cross the whole buslet, making power consumption (for MST-2B and MST-1B) higher than in the standard approach (this is due to the extra buffers inserted in the line).

In order to ensure this, we may fix a maximum graph distance between, let us say, a root node,


Figure 40: Example of implementation of MST-LOGD.
and any other node in the buslet. Let us define hop distance \(h_{d}\) as the distance, expressed as number of nodes, between two nodes in a graph.

If we impose this to be at \(\operatorname{most} \log _{2}(n)\) from a root node, which can be randomly chosen, then we are, at the same time, ensuring that the maximum distance between any two nodes in such a graph will be, at most, \(2 \log _{2}(n)+1\). As we will build our MST starting from a fully connected graph, such a solution will always exist and will have, as result, a more compact buslet structure. What we need to do is just modifying Prim algorithm adding as constraint the maximum distance from a root node. For any other characteristic, this will be just an evolution of MST-1D.
```

Algorithm 19 Prim algorithm with max hop distance constraint
procedure Prim_logmod(G, D)
$Q=\emptyset$
$F=\emptyset$
for each node $u_{i} \in G$ do
$C\left[u_{i}\right]=\min _{\forall u_{j} \in P\left(u_{i}\right)}\left\{w\left[e\left(u_{i}, u_{j}\right)\right]\right\}$
$Q=Q+u_{i} \quad \triangleright$ When initialized $Q$ contains all the nodes in $G$
root $=\emptyset$
while $Q$ not empty do
$u_{\text {best }}=\operatorname{first}(Q)$
for all $u_{i} \in Q$ do
if $C\left[u_{i}\right]<u_{\text {best }}$ then
if $\operatorname{hops}\left(u_{i}\right.$, root $) \leq D$ then $\quad \triangleright$ If the number of hops separating $u_{i}$ and root
respects the constraint
$u_{\text {best }}=u_{i}$
if root $=\emptyset$ then
root $=u_{\text {best }}$
$F=F+u_{\text {best }}$
$Q=Q-u_{\text {best }}$

```

\subsection*{5.6.1 Complexity}

We are just adding an overhead to the computation of the MST obtained satisfying logarithmic distance from root node. For this, each time we have to test the distance between the potentially new node and the root. However, as all the nodes, once added, will store the information of their own distance from the root, such a test will be \(O(1)\), non affecting the overall complexity, remaining, as in the case of MST-1D, \(O\left((n+r)^{2}\right)\).

\subsection*{5.7 MST-BF}

The problem with MST-LOGD is that we do not have any constraint on the maximum number of wires connected to a single node. In this way, we may potentially have high undesired
```

Algorithm 20 MST with one buffer per line slice power model and max log distance
procedure MST-LOGD (G, D, $\bar{\lambda}$ )
tot_pow $=0$
for all buslet $b_{j} \in D$ do
$F=$ Prim_logmod (G, D)
this_buf_pow $=0$
for all $c_{i} \in C S\left(b_{j}\right)$ do $\quad \triangleright$ For all communications scheduled (CS) in buslet $b_{j}$
$\mathrm{R}=\operatorname{Dijkstra}\left(\mathrm{G}, u_{i}, u_{j}\right)$
$\operatorname{track}(\mathrm{R}, \mathrm{G}) \quad \triangleright$ Mark each wire composing R as crossed in G
for all edges $e_{i} \in G$ do $\quad \triangleright \mathrm{G}$ is composed by edges, already owning the number of
times being crossed
Place_buffers $\left(e_{i}\right) \quad \triangleright$ According to (5.13)
Compute wire_dpow $\left(e_{i}\right)$
Compute buf_dpow $\left(e_{i}\right)$
$t h i s_{-} b u f_{-} p o w=$ wire_dpow $\left(e_{i}\right)+b u f_{\_} d p o w\left(e_{i}\right)+\operatorname{static\_ buf}\left(e_{i}\right)$
this_buf_pow $=$ this_buf_pow + static_power
tot_pow $=$ tot_pow + this_buf_pow

```
capacitances to be loaded. We want to also impose a constraint on this. Let us call branching factor the maximum number of wires connected to a single node and let us say it to be \(f\). Each node will have at most \(f\) wires connected to it, limiting the number of extra undesired capacitances to be loaded.

\subsection*{5.7.1 Complexity}

We are just adding an overhead for the introduction of the branching factor. However, such a test will be \(O(1)\) only. The overall complexity, in this way, will keep on being \(O\left((n+r)^{2}\right)\).

\subsection*{5.8 MAX-D}

Let us, now, focus on the minimization of the longest path between FUs. Such a kind of optimization makes sense because we may want to minimize as much as possible wire dynamic
```

Algorithm 21 Prim algorithm with max branch factor constraint
procedure Prim_maxbranchfactor(G, B)
$Q=\emptyset$
$F=\emptyset$
$B=\emptyset$
for each node $u_{i} \in G$ do
$C\left[u_{i}\right]=\min _{\forall u_{j} \in P\left(u_{i}\right)}\left\{w\left[e\left(u_{i}, u_{j}\right)\right]\right\}$
$B\left[u_{i}\right]=0$
$Q=Q+u_{i} \quad \triangleright$ When initialized $Q$ contains all the nodes in $G$
while $Q$ not empty do
$u_{\text {best }}=\operatorname{first}(Q)$
for all $u_{i} \in Q$ do
if $C\left[u_{i}\right]<u_{\text {best }}$ then
acceptable $=$ true
for all neighbors nodes $u_{j}$ of $u_{i}$ do
if $B\left[u_{j}\right] \geq B$ then $\quad \triangleright$ If the number of branches of $u_{i}$ does not meet the
constraint

$$
\text { acceptable }=\text { false }
$$

$$
\text { if acceptable }=\text { true then }
$$

$$
u_{\text {best }}=u_{i}
$$

for all neighbors nodes $u_{j}$ of $u_{\text {best }}$ do

$$
B\left[u_{j}\right]=B\left[u_{j}\right]+1
$$

$$
B\left[u_{\text {best }}\right]=B\left[u_{\text {best }}\right]+1
$$

$$
F=F+u_{\text {best }}
$$

$$
Q=Q-u_{\text {best }}
$$

```


Figure 41: Example of implementation of MST-BF with maximum branching factor allowed 3.
```

Algorithm 22 MST with one buffer per line slice power model and max branching factor
procedure $\operatorname{MST}-\operatorname{LOGD}(\mathrm{G}, \mathrm{B}, \bar{\lambda})$
tot_pow $=0$
for all buslet $b_{j} \in D$ do
$F=$ Prim_maxbranchfactor (G, B)
this_buf_pow $=0$
for all $c_{i} \in C S\left(b_{j}\right)$ do $\quad \triangleright$ For all communications scheduled (CS) in buslet $b_{j}$
$\mathrm{R}=\operatorname{Dijkstra}\left(\mathrm{G}, u_{i}, u_{j}\right)$
$\operatorname{track}(\mathrm{R}, \mathrm{G})$
$\triangleright$ Mark each wire composing R as crossed in G
for all edges $e_{i} \in G$ do $\quad \triangleright \mathrm{G}$ is composed by edges, already owning the number of
times being crossed
Place_buffers $\left(e_{i}\right) \quad \triangleright$ According to (5.13)
Compute wire_dpow $\left(e_{i}\right)$
$\triangleright(5.11)$
Compute buf_dpow $\left(e_{i}\right)$
$t h i s \_b u f$ _pow $=w i r e \_d p o w\left(e_{i}\right)+b u f \_d p o w\left(e_{i}\right)+s t a t i c \_b u f\left(e_{i}\right)$
this_buf_pow $=$ this_buf_pow + static_power
tot_pow $=$ tot_pow + this_buf_pow

```
power consumption. In order to do this, what we have to do is to fix an upper bound for such a distance and each other connect the other Functional units accordingly. In order to do this, we can first connect the two most distant FUs and use their distance as upper bound. Then, we can connect the other FUs to the so obtained wire through the closest Hannan point \({ }^{1}\). However, it may happen that some FUs can not be directly connected to it. At this point, they are connected to the closest Hannan point (distance computed as Manhattan distance). Then, there may be FUs that, even if connected in such a way, can not respect the maximum distance imposed. If this happens, then they are connected at the end, such that the maximum distance will be minimized.

Finally, as the first two FUs can be connected in a number of ways, as shown by the dashed lines in Fig. 42, a number of interconnections are attempted and we will hold just the best (i.e. that having the least maximum distance).

\subsection*{5.8.1 Complexity}

This technique first looks for the two most distant FUs \(\left(O\left(\frac{n^{2}}{2}\right)\right)\). Then, for each FU, it will compute Hannan points \((O(n))\) and check if max distance constraint is respected \((O(n)\), because just one path exists between any connected FU). Finally, All the FUs not respecting such a constraint will be connected to the closest Hannan point \(\left(O\left(n^{2}\right)\right)\). Hence, the total complexity here is \(O\left(\frac{n^{2}}{2}\right)+O\left(n^{2}\right)+O\left(n^{2}\right)=O\left(n^{2}\right)\).

\footnotetext{
\({ }^{1}\) In this case, we define Hannan point the intersection between the already placed wire and the wire representing the shortest distance between such a wire and the target FU.
}


Figure 42: MAX-D algorithm at work. First it defines that max distance is between \(F U_{1}\) and \(F U_{2}\), then it tracks one among the possible min Manhattan distance paths between them (the other paths are dashed), then Hanna points are computed (green "X") and finally, where distance constraints respected, FUs are connected (the example of final design is in Fig. 43).


Figure 43: Example of final interconnections obtained with MAX-D.
```

Algorithm 23 Maximum distance constrained buslet generation
procedure MAX-D(G)
tot_pow $=0$
for all buslet $b_{j} \in G$ do
$\mathrm{B}=\emptyset \quad \triangleright \mathrm{B}$ is the final structure of the buslet
$\mathrm{R}=\mathrm{FUs}\left(b_{j}\right) \quad \triangleright \mathrm{R}$ contains the remaining FUs to be connected to the buslet
$\mathrm{Q}=$ compute_clique $\left(\mathrm{G}, b_{j}\right) \triangleright$ Here we create a clique from G having as nodes all the
FUs in $b_{j}$ having as weight of the arcs the Manhattan distance between the nodes
first_edge $=\emptyset$
for all edges $e_{i} \in Q$ do
if $w\left[e_{i}\right] \geq w[$ first_edge $]$ then
first_edge $=e_{i}$
$B=$ first_edge
$\mathrm{R}=\mathrm{R}$ - FUs(first_edge) $\quad \triangleright$ Remove from the set of FUs to be connected those
connected through first_edge
for all $r_{i} \in R$ do
next_wire $=\operatorname{Hannan}\left(r_{i}\right.$, first_edge $) \triangleright$ Here we compute the closest wire connecting
$r_{i}$ to first_edge
if $\mathrm{R}+$ next_wire respects max distance w[first_edge] then
$\mathrm{B}=\mathrm{B}+$ next_wire $\quad \triangleright$ We add wire connecting $r_{i}$ to first_edge
$\mathrm{R}=\mathrm{R}-r_{i} \quad \triangleright$ We remove $r_{i}$ from the list of the to be connected FUs
for all $r_{i} \in R$ do $\quad \triangleright$ For all the FUs not connected because not respecting max
distance constraint
next_wire $=$ shortest_edge $\left(r_{i}, \mathrm{~B}\right) \triangleright$ We compute the shortest edge connecting $r_{i}$ to
B
$\mathrm{B}=\mathrm{B}+$ next_wire $\quad \triangleright$ We add wire connecting $r_{i}$ to first_edge
$\mathrm{R}=\mathrm{R}-r_{i} \quad \triangleright$ We remove $r_{i}$ from the list of the to be connected FUs
this_buf_pow $=0$
for all $c_{i} \in C S\left(b_{j}\right)$ do $\quad \triangleright$ For all communications scheduled (CS) in buslet $b_{j}$
$\mathrm{R}=\operatorname{Dijkstra}\left(\mathrm{G}, u_{i}, u_{j}\right)$
$\operatorname{track}(\mathrm{R}, \mathrm{G}) \quad \triangleright$ Mark each wire composing R as crossed in G
for all edges $e_{i} \in G$ do $\quad \triangleright \mathrm{G}$ is composed by edges, already owning the number of
times being crossed
Place_buffers $\left(e_{i}\right) \quad \triangleright$ According to (5.13)
Compute wire_dpow $\left(e_{i}\right) \quad \triangleright(5.11)$
Compute buf_dpow $\left(e_{i}\right)$
this_buf_pow $=$ wire_dpow $\left(e_{i}\right)+\operatorname{buf}$ _dpow $\left(e_{i}\right)+\operatorname{static\_ buf}\left(e_{i}\right)$
this_buf_pow $=$ this_buf_pow + static_power
tot_pow $=$ tot_pow + this_buf_pow

```

\subsection*{5.9 MAX-D-MAXHOP}

MAX-D technique intrinsically has a big drawback: in the first tracked line, there will be a number of steiner points, which will make huge the power consumption due to buffer load. A solution to this problem is to include also a constraint to the maximum number of steiner points a given communication needs to cross: Thanks to this, we my more difficultly meet the max distance constraint; however, we can fix the maximum number of buffers to be loaded for a single communication.

\subsection*{5.9.1 Complexity}

Here we are adding the overhead of checking hop distance to MAX-D algorithm. However, when we are computing max distance, this is an \(O(1)\) operation. Fot this, the total complexity remains \(O\left(n^{2}\right)\).

\subsection*{5.10 Hierarchical partitioning}

Let us use, now, a completely different approach. Now, we aim to build a binary tree-like structure. In order to do this, we hierarchically partition the floorplan computing centers of gravity. Then, when we will have partitions of one functional unit each, we will build our binary tree, connecting the previously computed centers of gravity (CG).

We can find an example of hierarchical partitioning in Fig. 44. More formally, let us say we have a set of functional units \(S\) to be connected through a buslet. We compute the center of gravity of it
\[
\begin{equation*}
C G(S)=\left[\frac{\sum_{\forall F U_{i} \in S} x\left(F U_{i}\right)}{|S|} ; \frac{\sum_{\forall F U_{i} \in S} y\left(F U_{i}\right)}{|S|}\right] \tag{5.15}
\end{equation*}
\]
```

Algorithm 24 Maximum distance constrained buslet generation with maximum hop distance
constraint
procedure MAX-D(G, max_hops)
tot_pow $=0$
for all buslet $b_{j} \in G$ do
$\mathrm{B}=\emptyset \quad \triangleright \mathrm{B}$ is the final structure of the buslet
$\mathrm{R}=\mathrm{FUs}\left(b_{j}\right) \quad \triangleright \mathrm{R}$ contains the remaining FUs to be connected to the buslet
$\mathrm{Q}=$ compute_clique $\left(\mathrm{G}, b_{j}\right) \triangleright$ Here we create a clique from G having as nodes all the
FUs in $b_{j}$ having as weight of the arcs the Manhattan distance between the nodes
first_edge $=\emptyset$
for all edges $e_{i} \in Q$ do
if $w\left[e_{i}\right] \geq w[$ first_edge $]$ then
first_edge $=e_{i}$
B = first_edge
$\mathrm{R}=\mathrm{R}$ - FUs(first_edge) $\quad \triangleright$ Remove from the set of FUs to be connected those
connected through first_edge
for all $r_{i} \in R$ do
next_wire $=\operatorname{Hannan}\left(r_{i}\right.$, first_edge $) \triangleright$ Here we compute the closest wire connecting
$r_{i}$ to first_edge
if $\mathrm{R}+$ next_wire respects max distance $\mathrm{w}[$ first_edge] then
if $\mathrm{R}+$ next_wire respects max hop distance max_hops then
$\mathrm{B}=\mathrm{B}+$ next_wire $\quad \triangleright$ We add wire connecting $r_{i}$ to first_edge
$\mathrm{R}=\mathrm{R}-r_{i} \quad \triangleright$ We remove $r_{i}$ from the list of the to be connected FUs
for all $r_{i} \in R$ do $\quad \triangleright$ For all the FUs not connected because not respecting max
distance constraint
next_wire $=$ shortest_edge $\left(r_{i}, \mathrm{~B}\right) \triangleright$ We compute the shortest edge connecting $r_{i}$ to
B
$\mathrm{B}=\mathrm{B}+$ next_wire $\quad \triangleright$ We add wire connecting $r_{i}$ to first_edge
$\mathrm{R}=\mathrm{R}-r_{i} \quad \triangleright$ We remove $r_{i}$ from the list of the to be connected FUs
this_buf_pow $=0$
for all $c_{i} \in C S\left(b_{j}\right)$ do $\quad \triangleright$ For all communications scheduled (CS) in buslet $b_{j}$
$\mathrm{R}=\operatorname{Dijkstra}\left(\mathrm{G}, u_{i}, u_{j}\right)$
$\operatorname{track}(\mathrm{R}, \mathrm{G}) \quad \triangleright$ Mark each wire composing R as crossed in G
for all edges $e_{i} \in G$ do $\quad \triangleright \mathrm{G}$ is composed by edges, already owning the number of
times being crossed
Place_buffers $\left(e_{i}\right) \quad \triangleright$ According to (5.13)
Compute wire_dpow $\left(e_{i}\right) \quad \triangleright(5.11)$
Compute buf_dpow $\left(e_{i}\right)$
this_buf_pow $=$ wire_dpow $\left(e_{i}\right)+\operatorname{buf}$ _dpow $\left(e_{i}\right)+\operatorname{static\_ buf}\left(e_{i}\right)$
this_buf_pow $=$ this_buf_pow + static_power
tot_pow $=$ tot_pow + this_buf_pow

```


Figure 44: Example of hierarchical partitioning. Here we are recursively partitioning the floorplan (first partitioning in red, second in blue and, finally, connect all the centers of gravity in reverse order to that they were created.).
where \(x\left(F U_{i}\right)\) is x value of the position of \(F U_{i}\) while \(y\left(F U_{i}\right)\) is the y position. Now, we decide if we want to divide \(S\) partitioning vertically or horizontally. If we have
\[
\begin{equation*}
\forall F U_{i}, F U_{j} \in S \max \left\{d\left[x\left(F U_{i}\right), x\left(F U_{j}\right)\right]\right\}>\max \left\{d\left[y\left(F U_{i}\right), y\left(F U_{j}\right)\right]\right\} \tag{5.16}
\end{equation*}
\]
then we will partition \(S\) vertically, otherwise horizontally. We will use, this way, a divide and conquer-like approach. Furthermore, after tracked the buslet, we can apply the buffer placements in MST-2B and MST-1B.


Figure 45: Example of implementation of hierarchical partitioning. In blue we have the centers of gravity. We see that the last one is a dummy nose as it connects two wires only and, for this, it can not be considered a steiner point.
```

Algorithm 25 Hierarchical partitioning wire generation
procedure $\mathrm{HP}(\mathrm{G})$
if $\operatorname{size}(\mathrm{G})=1$ then return G
$F=\emptyset$
Compute $C G(G)$
$\triangleright$ According to formulation in (5.15)
Compute $\operatorname{maxx}(\mathrm{G})$
Compute $\operatorname{maxy}(\mathrm{G})$
if $\max x>\max y$ then
$\mathrm{L}=$ left_partition(G, CG)
$\mathrm{R}=$ right_partition $(\mathrm{G}, \mathrm{CG})$
$\mathrm{F}=\mathrm{F}+\mathrm{HP}(\mathrm{L})+\mathrm{HP}(\mathrm{R})+\mathrm{CG}$
else
$\mathrm{U}=$ up_partition(G, CG)
$\mathrm{D}=$ down_partition(G, CG)
$\mathrm{F}=\mathrm{F}+\mathrm{HP}(\mathrm{U})+\mathrm{HP}(\mathrm{D})+\mathrm{CG}$
return F

```

\subsection*{5.10.1 Complexity}

Computing the first center of gravity will take \(O(n)\). After partitioning it, we will have two sets of approx \(\frac{n}{2}\) size each, which will need to be partitioned and so on. We will perform such a division \(\log _{2}(n)\) times; so, the time for dividing will be \(O(n \log (n))\). The number of generated CG will be \(n-1\); so, conquering will take \(O(n)\) time. For this, we can say that the generation of such a buslet will take \(O(n \log (n))\) time. However, we have to perform also buffer placement, which will be dominant as it is \(O\left((n+r)^{2}\right)\). Here we know \(r\) as it is \(n-1\). For this, we can easily say that the complexity will be \(O\left(4 n^{2}\right)\).

\section*{CHAPTER 6}

\section*{EXPERIMENTAL RESULTS}

All the algorithms presented in the previous sections were implemented in C++. In order to evaluate both final results and runtimes, several runs were performed, from the mediabench benchmark suite (with number of nodes and edges ranging from (11, 7) for DFG hal to (197, 196) for DFG smooth_color_z_triangle) [26]. In order to evaluate both final results and runtimes, several runs were performed, with 14 different DFGs, For each FU type, aspect ratios 1:1, \(3: 4\) and 1:2 were used in the floorplanning step, in which we used the well-known floorplanner Parquet[27] to obtain the wirelengths and areas of the final designs. We assume the following areas of resources: add/sub \(8 \mathrm{\mu m}^{2}\), multiplier/divider \(128 \mathrm{\mu m}^{2}\), memory \(512 \mathrm{\mu m}^{2}\), comparator \(8 \mu^{2}\). For any run of our algorithm, where we define run of our algorithm the execution of our algorithm giving as input a particular combination of the following parameters: input DFG, maximum buslet cardinality, maximum FUs' fanin, maximum FUs' fanout, maximum DFG's latency, a set of 20 solutions were produced, in which the aspect ratio of any FU was randomly chosen. Then, for all of these solutions power estimation was performed for all the patterns we have presented and then these values were averaged. Communication delay (FU-to-FU signal delay) is assumed to take 1 clock cycle, and the communication as well as operations scheduling done assume this delay for all communication. All the runs were performed on a 4th Generation Intel Core i7-4720HQ Processor ( 2.60 GHz with 1600 MHz 6 MB cache) processor machine with 16 GB RAM.

\subsection*{6.1 Performance for scheduling and binding algorithms}

The most significant metrics we determined were area size (Fig. 51) and wirelength (Fig. 50) shows the wirelength interchange average obtained from 14 different DFGs for different buslet cardinality constraint, comparing four different algorithm settings:
- No similarity solution pruning, propagating from each binding step 1 best partial solution only (NSP1).
- No similarity solution pruning, propagating from each binding step 20 best partial solutions (NSP20).
- Use of similarity solution pruning, propagating from each binding step 1 best partial solution only (YSP1).
- Use of similarity solution pruning, propagating from each binding step 20 best partial solutions (YSP20).
- Force Directed Binding (FDB).
- No similarity solution pruning, propagating from each binding step 1 best partial solution only with lookahead of next com to be bound (LA1).
- Simultaneous iso-scheduled communications binding (SIMCOM).

\subsection*{6.1.1 Comparison with another HLS algorithm}

In order to evaluate how fair is the solution we obtain for dedicated interconnections (cardinality \(>2\) ), which will be used to evaluate the effect on wirelength of buslets, we have compared


Figure 46: Comparison between greedy-binding method and ours for buslet cardinality 2.
these results with those obtained from another common sense algorithm. This is structured as follows: as scheduling algorithm it will use FDS only while, as binding, a greedy approach which tries to maximize as much as possible the reuse of the same already created interconnections. However, the special case of our algorithm for buslet cardinality of 2 achieves lower wirelenghts than the latter FDS and greedy-binding method by about \(55 \%\) on an average, and thus we compare the buslet results (cardinality \(>2\) ) to our version of the dedicated interconnect results (Fig. 46). This is possible thanks to the efficacy of our FDCS algorithm, which allows us to minimize the number of interconnections needed in the binding step. So, all the comparisons will be performed here on will be fair.


Figure 47: Comparison between two different output chip layouts for smooth_color_z_ triangle (197 nodes, 196 edges) with max FU fanin and fanout constraints of 8. (a) Dedicated-interconnections based design (area: \(4788.06 \mathrm{~mm}^{2}\), FUs: 34, WL: \(2847.29 \mu \mathrm{~m}\) ). (b) Buslet-based design with max cardinality 5 (area: \(3147.09 \mathrm{\mu m}^{2}\), FUs: 27, WL: \(1169.09 \mu \mathrm{~m}\) ).

\subsection*{6.1.2 Discussion of results}

For all the computed solutions a latency constraint of 1.6 x DFG's asap time was imposed. Maximum fanin and maximum fanout constraints were set to 8 . \(\lambda_{\text {com }}\) was set to 1 clock cycle. With these constraints, using buslets instead of dedicated interconnections, we obtained a wirelength reduction ranging between \(35 \%\) and \(71 \%\) as shown in Table III. According to these data, the wiring complexity was significantly reduced. We can visually notice (Fig. 47) this comparing the output of the floorplanner for max buslet cardinality of 5 (Fig. 47b) with the dedicated interconnections solution (Fig. 47a). Let us focus, for example, on the FU "H1":


Figure 48: Average, maximum and peak wire efficiencies gained using NSP1, \(\lambda_{t o t}=1.6 \cdot \operatorname{asap}(D F G), \operatorname{maxFU}\) fanin \(/\) fanout \(=8\).


Figure 49: Runtime data and curve fitting for the designed HLS algorithms
in Fig. 47a it is placed on the top right of the floorplan and we can count 7 wires it is connected to, while in Fig. 47b, where it is placed in the low left, it is connected through 3 buslets only. Such a reduction happens for all the FUs. Same thing happens for "F0", which is the most connected FU in Fig. 47a: its number of interconnections scales of a factor greater than 2 in Fig. 47b. We can also see a wirelength reduction trend with the increase of maximum buslet cardinality in Fig. 50 and Table III. According to this, if we look at how average wire efficiency (Table IV and Fig. 48) changes as a function of buslet cardinality, we see that it has a growing trend as buslet cardinality increases. In particular, we can see that best results are obtained propagating more than one solution: comparing NSP20 to NSP1 (Table III), we have a reduction in wirelength up to \(8 \%\). Empirically we can clearly see that we obtained the best results using SIMCOM binding, which is followed by LA1. We can explain this thinking that LA1 will look at the next communication only, while, in a design, there may be more than two overlapping communications. This empirically states that simultaneously binding iso-scheduled communications is the best strategy, leading to the best results.

If we look at runtimes (Table I), we notice that runtimes are mainly affected by the number of edges in the graph as opposed to the number of nodes:
- For NSP1 and YSP1 runtime \(t=0.06 e^{2}\), where \(e\) is the number of edges of the input DFG
- For NSP20 and YSP20 runtime \(t=0.98 e^{2}\)
- For LA runtime \(t=0.74 e^{2}\)
- For SIMCOM runtime \(t=0.44 e^{2}\)
- For FDB runtime \(t=0.02 e^{3}-2.85 e^{2}+91 e\)

This was also stated during the theoretical analysis of the designed algorithms in Ch. 3 and Ch. 4: in particular, as our scheduling and binding algorithms are buslet and, for instance, communication-centric, the number of edges in the DFG, which represents also the number of communications to be both scheduled and bound, is the main parameter which affects the runtime. If we look at the runtime in Fig. 49, we see that all the designed techniques, apart from FDB, empirically have a quadratic best fit. This is consistent with the theoretical analysis: we expected all of these being \(O\left(k n e^{2}\right)\) where \(k\) is a factor differing for all the designed algorithms. Also FDB behavior is consistent with the theoretical analysis as it was predicted being \(O\left(n^{4} e^{4}\right)\). Among these, we can say that the algorithm providing the best wirelength (SIMCOM) is also the second in speed (the fastest is simple CB). FDB has a cubic best fit and, for this, it is fast for low number of edges but, beyond 120 edges, increases rapidly, already having, for \(e \approx 200\), the worst runtime.

Looking at total chip area (Table II, Fig. 51), we can observe that, for buslet cardinality lower than 6, increasing buslet cardinality, a decreasing trend occurs. This can be explained by observing that the number of FUs used is higher for lower buslet cardinalities. This is an effect of the maximum fanin/fanout constraint. If we are using dedicated interconnections, we can connect a given \(F U_{i}\) to at most max_fanin + max_fanout other FUs. However, if we use buslets of maximum cardinality \(\max \{|b|\}\), as the fanin/fanout constraint applies to buslets, we will be able to connect at most (max_fanin + max_fanout) \(\cdot(\max \{|b|\}-1)\) FUs to \(F U_{i}\) while respecting the actual fanin/fanout constraint. For this, a reduction of the number of FUs and


Figure 50: Total wirelength averages
thus layout area is expected.

\subsection*{6.2 Power estimation}

All the following power estimations are performed on the outputs of NSP1. When computing the power, we are assuming supply voltage 1.2 V and working frequency 1 GHz . We have analyzed the following approaches:
- \(S B P\) : standard buffer placement
- MST-2B: MST with2 buffer placed per wire
- MST-1B: MST with 1 buffer placed per wire


Figure 51: Total area averages
- MST-LOGD: MST with constraint on the maximum hop distance between FUs, which has to be logarithmic
- MST-BF3: MST with branching factor 3 (branching factor set to 3 )
- MAX-D: buslet with constraint on maximum physical distance between each FU
- MAX-D-MAXHOP3: buslet with constraint on maximum physical distance between each FU and maximum hop distance 3
- HP: hierarchical partitioning

The technology node assumed is 22 nm with unit length wire capacitance \(0.11 \frac{\mathrm{fF}}{\mu \mathrm{m}}[28]\), input capacitance for an inverter \(\left(C_{L}^{\text {inv }}\right) 0.08 \mathrm{fF}[29]\) and leakage current \(\left(I_{\text {leak }}^{\text {inv }}\right) 3.8 \mathrm{pA}[29]\). In order
to compute the dynamic power to drive an FU we first define its dynamic energy per operation as
\[
\begin{equation*}
D E_{o p}^{F U}=\frac{1}{2} \cdot V_{d d}^{2} \cdot C_{L}^{F U} \cdot p(0 \rightarrow 1) \tag{6.1}
\end{equation*}
\]
where
- \(V_{d d}\) is the supply voltage.
- \(C_{L}^{F U}\) is the sum of the input loads of all the logic gates of the FU.
- \(p(0 \rightarrow 1)\) is the probability that the driving signal will indeed be a 0 to 1 transiting signal. Probabilities of 0 and 1 of a signal being uniform, \(p(0 \rightarrow 1)=\frac{1}{4}\); however, since this assumption is generally incorrect, and depends on the logic of the path generating the signal, in our general formulation, we set \(p(0 \rightarrow 1)\) to be a little higher at \(\frac{1}{3}\).

From (6.1) we can define the dynamic power for a given FU as
\[
\begin{equation*}
D P^{F U}=D E_{o p}^{F U} \cdot f \cdot \varepsilon_{F U} \tag{6.2}
\end{equation*}
\]
where
- \(f\) is the working frequency.
- \(\varepsilon_{F U}\) is the fraction of cc's over the latency period in which it will be "busy" (potentially driven by a \(0 \rightarrow 1\) transiting signal).

Leakage power is computed as
\[
\begin{equation*}
P_{\text {leak }}^{F U}=V_{d d} \cdot I_{\text {leak }}^{F U} \tag{6.3}
\end{equation*}
\]


Figure 52: Logic gates composing a FA
where \(I_{\text {leak }}^{F U}\) is the leakage current.
Tristate buffers are realized with two inverters and a transmission gate (TG) cascaded. For this, we can say that the total input load of tristate buffers is \(3 \cdot C_{L}^{i n v}\) of an inverter (input for the two inverters and for the transmission gate) while the leakage power consumed is \(2 \cdot P_{\text {leak }}^{\text {inv }}\) (we ignore the leakage current of a T-gate since we estimate that it has much smaller leakage than an inverter because its path to ground goes through the gate of at least one transistor). Let us derive the leakage and dynamic power also for a full adder (FA). We know that a FA is made of two 2 -inputs XOR gates, three 2 -inputs AND gates and one 3-inputs OR gate. For each of these, we know that the dynamic power depends on the number of fanins of each gate. For this, we can say that, according to the FA design we take into account (Fig. 52), the input load for a FA is \(15 \cdot C_{L}^{i n v}\). From this, dynamic power is obtained using (6.1) and (6.2). Talking about leakage power, we can see that the equivalent resistance will always be comparable to that of


Figure 53: SRAM cell
the inverter, as \(R_{o f f} \gg R_{\text {on }}\), where \(R_{\text {off }}\) is the drain-source resistance when the transistor is off and \(R_{o n}\) is the drain-source resistance when transistor is on. For this, for each gate, we have 1x the leakage power consumed by an inverter. As we have 6 different gates, at the end we will have a leakage power consumption of \(6 \cdot P_{l e a k}^{i n v}\).

Let us assume we are working with 16 -bit data. Assuming that for arithmetic operations we will use a ripple carry adder (RCA), its total input load will be \(240 \cdot C_{L}^{i n v}\) and its leakage power \(96 \cdot P_{\text {leak }}^{i n v}\). If we want to determine the leakage and dynamic power for an array multiplier, we know that it requires, in our case, \(16^{2}\) FAs. Hence, its input load will be \(3840 \cdot C_{L}^{i n v}\) and its leakage power \(1536 \cdot P_{\text {leak }}^{i n v}\).

Finally, let us analyze an SRAM cell power consumption: as it is made of two inverters and two pass transistors, its power consumption is the same as the tristate buffer. Now, let us assume we will use \(16 \times 256\) memories. Its input load in write will be \(48 \cdot C_{L}^{i n v}\) (as we will write 16 bits
only), its input load in read \(16 \cdot C_{L}^{\text {inv }}\) (pass transistors on the address line activated only) while the leakage power will be \(8192 \cdot P_{\text {leak }}^{i n v}\).

Let us compute, now, the lowest allowable switching time \(t_{s w}^{\min }\). From our former analysis, we can see that the most critical FU we will use is the multiplier, as it will have a critical path crossing 32 FAs. Let us assume here that the total wirelength of the interconnections in a FA to be \(\left(\frac{1}{10}\right)^{\text {th }}\) the average wirelength of a buslet. From experimental data, we know that it is 80 \(\mu \mathrm{m}\). Furthermore, as we know that \(R_{w i r e}\) is \(17 \frac{\Omega}{\mu \mathrm{~m}}[30]\), the total delay in a single FU due to wiring is \(t_{\text {wire }}^{F U}=11.9 \mathrm{ps}\). Now, in the entire multiplier, this will be \(t_{\text {wire }}^{m u l}=380.8 \mathrm{ps}\). Let us analyze, now, transistors switching time. We know that, in a FA, the critical path crosses 2 logic gates. Furthermore, the longest switching time is \(t_{r d}=0.48 \mathrm{ps}[29]\). Hence, the total transistors switching time in a FA is \(t_{\text {tran }}^{F A}=1.44 \mathrm{ps}\) and, for the multiplier, it will be \(t_{t r a n}^{m u l}=46.08 \mathrm{ps}\). Hence, the lowest switching time we may theoretically have is \(t_{s w}^{m i n}=426.88 \mathrm{ps}\). From this, we can deduct that the theoretical highest frequency is \(f_{\max }=2.34 \mathrm{GHz}\). We want to use a frequency which is below the theoretical maximum one: this is why we want to choose \(f=1\) GHz.

Let us analyze the total power consumption (Table XI, Fig. 54) for different buslet designs. From this, we can see that SBP, for any buslet cardinality, is the worst approach for the power metric. We could expect this because, with such a structure, the signal propagates through the whole buslet wirelength. We can see that, in general, the best approaches are here MST-2B, MST-1B and MST-LOGD. Hence, here looks like that the total wirelength plays a fundamental role for the final power consumption (Table XIII, Fig. 55).


Figure 54: Total power consumption


Figure 55: Total wirelengths


Figure 56: Dynamic power contribution of FUs

Let us analyze the contribution of FUs power. Dynamic power (Table IX, Fig. 56) is constant through all the buslet cardinalities. We have, as input for all our algorithms, a DFG, whose essential components are operation nodes. Each operation node represents an operation to be executed, regardless the number of bound FUs, within the total design latency \(\bar{\lambda}\). Even if we minimize the number of FUs, all the operation nodes will be bound to a FU; so, all of them will always be executed. Hence, dynamic power consumed by FUs does not change for any buslet cardinality or FU fanin/fanout constraint. However, leakage power (Table X, Fig. 57) depends on the number of the bound FUs. As stated in Par. 6.1, the number of FUs decreases when the buslet cardinality increases. Hence, the leakage power decreases and, as dynamic power is constant, also total power consumed by FUs decreases.


Figure 57: Leakage power contribution of FUs

Let us analyze now, in detail, all the power contributions of the buslet. Dynamic power, in particular, receives contributions from both wire capacitance and buffer input capacitance. If we analyze the contribution of wire capacitance (Table VI, Fig. 58), we see, as we could expect, that we have the highest contribution from SBP (as previously stated, the signal will be propagated through all the buslet). This is the upper bound for MST-based interconnect structures. Then, we see that we have the lowest contribution for MST-2B. As it will isolate just the shortest path the signal needs to propagate through, this is the lower bound for MST-based approaches.

Let us analyze, now, the contribution to dynamic power given by tristate buffers (Table VII, Fig. 60). Here we see that SBP is the lowest (we expect this as their number is in the worst

Total dyna̧̧aic power consumed by interconnects (wire capacitance contribution)


Figure 58: Dynamic power contribution of wire


Figure 59: Maximum distance between any two connected FUs


Figure 60: Dynamic power contribution of tristate buffers input load
case the number of FUs connected through a buslet) while the worst is MST-2B.
Analyzing leakage power (Table VIII, Fig. 61), which is here the leakage power consumed by tristate buffers, we see that we have the lowest contribution in SBP while the highest in both MST-2B and HP. leakage power is here directly proportional to the number of buffers. We can see that HP, which had a lower dynamic power consumption than MST-2B, here shows similar power values. This is due to the branching factor each steiner node has. In HP it will always be 3 and, for this, every time a signal crossing a steiner node will load 3 buffers while, in MST-2B, such a parameter is not precised; hence, dynamic power may be higher. Furthermore, with HP we will meet, through the signal shortest path, at most \(2 \log _{2}(n)\) steiner points while, for MST-2B, it will be, in the worst case, \(2(n-1)\).

The total power consumed by the buslet is the sum of all of these three contributions (we


Figure 61: Leakage power contribution of tristate buffers


Figure 62: Percentage change in total power


Figure 63: Total interconnect power
can see this in Table XI and Fig. 63). Hence, we see that the worst values are held by SBP while the best are for MST-LOGD and MST-2B, as the wire power consumption contribution is here the most relevant.

If we want to analyze the percentage change of power from the dedicated interconnects case (Table XII, Fig. 62), we see that, for cardinality 3, we have a reduction of \(0.19 \%\) on the total power, for MST-2B and MST-1B. This is due to the fact that buslet power consumption does not increase as much as FUs leakage power decreases. Apart from this, however, we have a sublinear increment in the total power, due to the fact that FUs leakage power does not decrease significantly. We have, for cardinality 8 , an increase in total power consumption of
about \(0.8 \%\) for all the techniques apart from SBP, whose power increment is \(1.43 \%\). Regarding the wirelength (Fig. 55), we see that as expected SBP has the best values while MST-BF is second best ranked.

\section*{CHAPTER 7}

\section*{CONCLUSIONS}

In this thesis, we proposed a novel and innovative interconnection paradigm for HLS designs, based on the use of flexibly-structured buslets in place of dedicated interconnections. Moreover, a new scheduling algorithm, Force Directed Communication Scheduling, was developed to perform communication scheduling (to the best of our knowledge, the first time such an operation has been performed in HLS) with the goal of minimizing the number of buslets. A number of buslet-centric binding algorithms were also proposed, with the aim of minimizing wirelength under maximum buslet cardinality and maximum FU fanin/fanout constraints. A "similar solutions pruning" heuristic was also developed, whose task is to avoid the propagation of similar solutions during binding. Finally, a number of buslet designs with multi-tri-stated points and different routing techniques were developed and implemented with the aim of minimizing either the total power consumed by buslets or the maximum communication distance (in order to meet latency constraints without increasing the number of FUs or buslets). Experimental results show that the use of buslets results in an average wirelength reduction in range of \(35 \%\) to \(71 \%\) across a set of 14 tested DFGs. This clearly establishes the efficacy of buslet-based interconnections in reducing interconnect complexity. Finally, power dissipation grows sublinearly with maximum buslet cardinality. In particular, for high cardinalities, we have found a buslet design technique that results in logarithmic growth in power with buslet cardinality. Further challenges in this direction involve the optimization of dynamic power
consumption also for buslets with low cardinalities (as opposed to achieving logarithmic power growth asymptotically, i.e., for buslets with high cardinalities).

\section*{APPENDIX}

\section*{EXPERIMENTAL DATA COLLECTED}

Here we provide all the tables with various empirical data that have been referenced in Chapter 6.

TABLE I: AVERAGE RUNTIMES FOR DIFFERENT DFG SIZES
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & & \multicolumn{7}{|c|}{Average runtimes [s]} \\
\hline DFG name & nodes/ edges & NSP1 & NSP20 & YSP1 & YSP20 & LA1 & SIMCOM & FDB \\
\hline hal & 11/8 & 0.06 & 0.169 & 0.015 & 0.015 & 0.122 & 0.068 & 0.004 \\
\hline horner & 18/16 & 0.171 & 3.42 & 0.249 & 0.608 & 0.98 & 0.552 & 0.105 \\
\hline arf & 28/30 & 0.452 & 11.159 & 0.734 & 9.063 & 3.009 & 3.213 & 0.162 \\
\hline ewf & 34/47 & 1.921 & 32.609 & 3.156 & 46.163 & 29.579 & 20.509 & 1.004 \\
\hline smooth & 51/52 & 1.78 & 36.926 & 2.626 & 46.095 & 17.491 & 11.948 & 0.414 \\
\hline motion & 53/50 & 1.982 & 29.277 & 3.015 & 23.55 & 30.429 & 13.753 & 1.449 \\
\hline feedback & 53/50 & 12.834 & 246.226 & 20.348 & 296.99 & 187.059 & 88.734 & 20.066 \\
\hline collapse & 56/73 & 15.361 & 186.836 & 24.611 & 228.616 & 66.494 & 79.581 & 8.409 \\
\hline write & 106/88 & 94.048 & 1438.797 & 91.209 & 1194.191 & 379.791 & 237.063 & 41.464 \\
\hline interpolate & 108/104 & 565.312 & 15717.174 & 826.798 & 14555.136 & 11780.99 & 5773.47 & 6027.027 \\
\hline matmul & 109/116 & 336.527 & 8193.599 & 542.706 & 7172.703 & 7361.628 & 2294.239 & 4492.863 \\
\hline idctcol & 114/164 & 1164.966 & 18102.956 & 1222.529 & 16008.73 & 11513.62 & 3501.898 & 6324.749 \\
\hline jpeg & 134/169 & 1728.064 & 29170.12 & 1848.222 & 27840.8 & 23334.3 & 24555.86 & 33360.879 \\
\hline smooth & 197/196 & 2829.357 & 49515.27 & 2941.344 & 45300.13 & 34534.22 & 16134.999 & 65808.893 \\
\hline
\end{tabular}

\section*{APPENDIX (Continued)}

TABLE II: AVERAGE AREA
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{7}{|c|}{ Average area \(\left[\mathrm{\mu m}^{2}\right]\)} \\
\hline buslet card & NSP1 & NSP20 & YSP1 & YSP20 & LA1 & SIMCOM & FDB \\
\hline 2 & 34140 & 33780 & 34138 & 34393 & 31527 & 31976 & 28520 \\
3 & 32682 & 31427 & 32710 & 31698 & 28330 & 28850 & 27240 \\
4 & 31154 & 29776 & 31110 & 30103 & 28168 & 28412 & 26931 \\
5 & 30183 & 29595 & 30281 & 29822 & 27306 & 27273 & 26329 \\
6 & 29423 & 28769 & 30044 & 29413 & 26832 & 26500 & 26044 \\
7 & 28871 & 29504 & 28818 & 28679 & 26754 & 26378 & 25906 \\
8 & 28749 & 28908 & 29340 & 28490 & 27077 & 26877 & 25849 \\
\hline
\end{tabular}

TABLE III: AVERAGE WIRELENGTH
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{7}{|c|}{ Average wirelength \([\mu \mathrm{m}]\)} \\
\hline buslet card & NSP1 & NSP20 & YSP1 & YSP20 & LA1 & SIMCOM & FDB \\
\hline 2 & 12890.23 & 11829.41 & 12903.97 & 12056.56 & 11951.75 & 11891.36 & 11822.00 \\
3 & 8210.82 & 7721.33 & 8355.09 & 7728.88 & 7724.66 & 7588.27 & 7834.83 \\
4 & 7419.55 & 7069.31 & 7203.90 & 6889.79 & 6872.57 & 6801.19 & 7212.15 \\
5 & 5911.16 & 5684.90 & 5795.64 & 5384.46 & 5513.70 & 5305.71 & 5718.99 \\
6 & 4877.09 & 4506.85 & 4989.47 & 4564.82 & 4534.18 & 4348.19 & 4789.79 \\
7 & 4301.18 & 4021.56 & 4240.95 & 3930.40 & 3991.34 & 3815.23 & 3981.63 \\
8 & 3632.94 & 3467.75 & 3639.42 & 3569.15 & 3595.23 & 3422.67 & 3480.06 \\
\hline
\end{tabular}

TABLE IV: AVERAGE WIRE EFFICIENCY
\begin{tabular}{|c|c|c|c|}
\hline buslet card & \begin{tabular}{c} 
Average wire efficiency [\%] \\
\(\operatorname{avg}\left(d f g_{j}\left(\operatorname{avg}\left(\varepsilon\left(b_{i}\right)\right)\right)\right.\)
\end{tabular} & \begin{tabular}{c}
\(\operatorname{Max}\) wire efficiency [\%] \\
\(\operatorname{avg}\left(d f g_{j}\left(\max \left(\varepsilon\left(b_{i}\right)\right)\right)\right.\)
\end{tabular} & \begin{tabular}{c} 
Peak wire efficiency[\%] \\
\(\max \left(d f g_{j}\left(\max \left(\varepsilon\left(b_{i}\right)\right)\right)\right.\)
\end{tabular} \\
\hline 2 & 0.6 & 4.3 & 9.8 \\
3 & 0.7 & 6.8 & 10.6 \\
4 & 0.9 & 8.5 & 14.9 \\
5 & 1.8 & 11.1 & 20.7 \\
6 & 3.5 & 12.4 & 23.3 \\
7 & 4.1 & 14.7 & 25.5 \\
8 & 3.6 & 17.2 & 27.7 \\
\hline
\end{tabular}

\section*{APPENDIX (Continued)}

TABLE V: AVERAGE WIRELENGTH FOR DIFFERENT BUSLET STRUCTURES
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{8}{|c|}{ Average wirelength \([\mu \mathrm{m}]\)} \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 12043.56 & 12043.56 & 12043.56 & 12043.56 & 12043.56 & 12043.56 & 12043.56 & 12043.56 \\
3 & 8290.99 & 8290.99 & 8290.99 & 10135.06 & 10135.06 & 8988.84 & 9143.08 & 9514.66 \\
4 & 7877.12 & 7877.12 & 7877.12 & 8695.93 & 8695.93 & 9101.65 & 9609.30 & 10314.31 \\
5 & 6830.58 & 6830.58 & 6830.58 & 8055.14 & 8419.98 & 8205.02 & 9111.87 & 9910.66 \\
6 & 5756.40 & 5756.40 & 5756.40 & 7229.11 & 7305.61 & 7125.79 & 8252.74 & 9063.80 \\
7 & 5338.84 & 5338.84 & 5338.84 & 7164.80 & 7157.40 & 6787.59 & 8051.16 & 9004.20 \\
8 & 5094.44 & 5094.44 & 5094.44 & 6103.21 & 6109.72 & 6577.24 & 8066.88 & 9062.46 \\
\hline
\end{tabular}

TABLE VI: TOTAL DYNAMIC POWER CONSUMED BY INTERCONNECTS - WIRE CAPACITANCE CONTRIBUTION
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{8}{|c|}{ Power [nW] } \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 325614.48 & 302265.00 & 302265.00 & 302265.00 & 302265.00 & 302265.00 & 302265.00 & 302265.00 \\
3 & 499326.47 & 333705.21 & 333705.21 & 393170.09 & 393170.09 & 368010.09 & 361843.91 & 379925.31 \\
4 & 627904.34 & 359291.88 & 378154.14 & 381928.38 & 381928.38 & 427339.17 & 418111.59 & 449245.76 \\
5 & 757494.36 & 377306.02 & 411995.67 & 422277.06 & 458542.19 & 471866.75 & 464719.65 & 501538.98 \\
6 & 852962.00 & 390877.94 & 429326.25 & 440535.92 & 486823.15 & 503781.09 & 487184.72 & 524452.74 \\
7 & 965694.90 & 410624.50 & 455353.79 & 461668.42 & 519260.68 & 542027.72 & 513271.87 & 549435.31 \\
8 & 1051381.50 & 419680.27 & 467384.43 & 465556.54 & 496738.44 & 561435.89 & 524144.84 & 559028.99 \\
\hline
\end{tabular}

TABLE VII: TOTAL DYNAMIC POWER CONSUMED BY INTERCONNECTS TRISTATE BUFFER CONTRIBUTION
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{9}{|c|}{ Power [nW] } \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 31166.38 & 24774.01 & 24774.01 & 24774.01 & 24774.01 & 36912.77 & 24774.01 & 24774.01 \\
3 & 42498.09 & 59350.18 & 59350.18 & 59350.18 & 59350.18 & 57788.22 & 59350.18 & 59350.18 \\
4 & 50518.95 & 84559.50 & 74394.82 & 71130.52 & 71130.52 & 75012.09 & 73117.42 & 71737.93 \\
5 & 60317.44 & 105972.78 & 87644.58 & 82131.07 & 84163.36 & 89995.06 & 84064.25 & 81427.78 \\
6 & 69564.78 & 125518.43 & 99759.03 & 91352.18 & 93900.22 & 102141.99 & 93949.53 & 90398.45 \\
7 & 77806.02 & 144498.12 & 111680.10 & 100120.85 & 102304.29 & 115057.36 & 104669.25 & 100012.91 \\
8 & 87322.46 & 158367.21 & 120174.26 & 112690.69 & 112834.14 & 120794.21 & 110792.24 & 106931.97 \\
\hline
\end{tabular}

\section*{APPENDIX (Continued)}

TABLE VIII: TOTAL STATIC POWER CONSUMED BY INTERCONNECTS - TRISTATE BUFFER LEAKAGE
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{9}{|c|}{ Power [nW] } \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 69.97 & 83.42 & 83.42 & 83.42 & 83.42 & 96.45 & 83.42 & 83.42 \\
3 & 52.98 & 78.72 & 78.72 & 70.39 & 70.39 & 101.05 & 77.28 & 76.52 \\
4 & 51.19 & 108.88 & 91.00 & 81.27 & 81.27 & 111.92 & 87.89 & 83.90 \\
5 & 46.87 & 114.27 & 89.08 & 76.39 & 78.86 & 113.81 & 83.60 & 78.63 \\
6 & 43.66 & 115.26 & 87.29 & 71.36 & 76.60 & 113.06 & 81.36 & 75.79 \\
7 & 42.03 & 119.14 & 87.89 & 69.32 & 74.39 & 114.63 & 81.39 & 74.25 \\
8 & 42.09 & 122.41 & 89.93 & 77.83 & 81.32 & 115.69 & 82.56 & 73.73 \\
\hline
\end{tabular}

TABLE IX: DYNAMIC POWER CONSUMED BY FUS
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{8}{|c|}{ Power [nW] } \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 \\
3 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 \\
4 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 \\
5 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 \\
6 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 4948688.04 & 494868.04 \\
7 & 494868.04 & 494868.04 & 4948688.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 \\
8 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 & 494868.04 \\
\hline
\end{tabular}

TABLE X: STATIC POWER CONSUMED BY FUS
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{8}{|c|}{ Power [nW] } \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 359267.20 & 359267.20 & 359267.20 & 359267.20 & 359267.20 & 359267.20 & 359267.20 & 359267.20 \\
3 & 356903.60 & 356903.60 & 356903.60 & 356903.60 & 356903.60 & 356903.60 & 356903.60 & 356903.60 \\
4 & 357093.60 & 357093.60 & 357093.60 & 357093.60 & 357093.60 & 357093.60 & 357093.60 & 357093.60 \\
5 & 356736.40 & 356736.40 & 356736.40 & 356736.40 & 356736.40 & 356736.40 & 356736.40 & 356736.40 \\
6 & 356196.80 & 356196.80 & 356196.80 & 356196.80 & 356196.80 & 356196.80 & 356196.80 & 356196.80 \\
7 & 356022.00 & 356022.00 & 356022.00 & 356022.00 & 356022.00 & 356022.00 & 356022.00 & 356022.00 \\
8 & 356029.60 & 356029.60 & 356029.60 & 356029.60 & 356029.60 & 356029.60 & 356029.60 & 356029.60 \\
\hline
\end{tabular}

\section*{APPENDIX (Continued)}

TABLE XI: TOTAL POWER
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{8}{|c|}{ Power [nW] } \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 861340.83 & 860759.44 & 860759.44 & 860759.44 & 860759.44 & 861015.24 & 860759.44 & 860759.44 \\
3 & 862661.11 & 859711.47 & 859711.47 & 860892.43 & 860892.43 & 860388.66 & 860272.80 & 860633.67 \\
4 & 865581.29 & 860947.54 & 861103.62 & 861104.09 & 861104.09 & 862120.58 & 861874.11 & 862465.22 \\
5 & 868007.55 & 861384.28 & 861686.32 & 861768.99 & 862537.41 & 862955.49 & 862663.72 & 863342.41 \\
6 & 869559.04 & 861508.03 & 861733.83 & 861773.96 & 862755.91 & 863296.36 & 862768.89 & 863437.65 \\
7 & 871802.09 & 862111.64 & 862318.61 & 862195.15 & 863395.73 & 864146.38 & 863330.26 & 863953.26 \\
8 & 873713.81 & 862581.00 & 862738.75 & 862540.42 & 863170.41 & 864657.93 & 863678.95 & 864290.59 \\
\hline
\end{tabular}

TABLE XII: PERCENTAGE CHANGE IN TOTAL POWER
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{9}{|c|}{ Power [\%] } \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 0.00 & -0.07 & -0.07 & -0.07 & -0.07 & -0.04 & -0.07 & -0.07 \\
3 & 0.15 & -0.19 & -0.19 & -0.05 & -0.05 & -0.11 & -0.12 & -0.08 \\
4 & 0.49 & -0.05 & -0.03 & -0.03 & -0.03 & 0.09 & 0.06 & 0.13 \\
5 & 0.77 & 0.01 & 0.04 & 0.05 & 0.14 & 0.19 & 0.15 & 0.23 \\
6 & 0.95 & 0.02 & 0.05 & 0.05 & 0.16 & 0.23 & 0.17 & 0.24 \\
7 & 1.21 & 0.09 & 0.11 & 0.10 & 0.24 & 0.33 & 0.23 & 0.30 \\
8 & 1.44 & 0.14 & 0.16 & 0.14 & 0.21 & 0.39 & 0.27 & 0.34 \\
\hline
\end{tabular}

TABLE XIII: MAXIMUM DISTANCE BETWEEN FUS
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline & \multicolumn{8}{|c|}{ distance \(\mu \mathrm{m}]\)} \\
\hline \begin{tabular}{c} 
buslet \\
card
\end{tabular} & SBP & MST-2B & MST-1B & MST-LOGD & MST-BF3 & HP & MAX-D & \begin{tabular}{c} 
MAX-D \\
MAXHOP3
\end{tabular} \\
\hline 2 & 252.4321 & 252.4321 & 252.4321 & 252.4321 & 252.4321 & 252.4321 & 252.4321 & 252.4321 \\
3 & 391.4565 & 391.4565 & 391.4565 & 472.7913 & 472.7913 & 354.1469 & 387.966 & 431.638 \\
4 & 476.5574 & 476.5574 & 476.5574 & 485.4096 & 485.4096 & 418.2369 & 472.6802 & 537.204 \\
5 & 568.4289 & 568.4289 & 568.4289 & 582.4692 & 647.0399 & 488.6677 & 558.275 & 642.5311 \\
6 & 635.792 & 635.792 & 635.792 & 643.9562 & 724.1355 & 535.0732 & 618.4402 & 707.064 \\
7 & 697.9911 & 697.9911 & 697.9911 & 702.9376 & 786.1211 & 576.6508 & 667.4153 & 761.1182 \\
8 & 715.8234 & 715.8234 & 715.8234 & 702.7463 & 754.152 & 589.9182 & 680.1763 & 766.8425 \\
\hline
\end{tabular}

\section*{CITED LITERATURE}
1. Shockley, W.: The path to the conception of the junction transistor. Electron Devices, IEEE Transactions on, 23(7):597-620, 1976.
2. Turgeon, M. L.: Clinical hematology: theory and procedures, volume 936. Lippincott Williams \& Wilkins, 2005.
3. Moore, G. E.: Cramming more components onto integrated circuits, reprinted from electronics, volume 38 , number 8, april 19, 1965, pp. 114 ff. Solid-State Circuits Society Newsletter, IEEE, 11(5):33-35, 2006.
4. Wolf, S., Awschalom, D., Buhrman, R., Daughton, J., Von Molnar, S., Roukes, M., Chtchelkanova, A. Y., and Treger, D.: Spintronics: a spin-based electronics vision for the future. Science, 294(5546):1488-1495, 2001.
5. Moodera, J. S., Kinder, L. R., Wong, T. M., and Meservey, R.: Large magnetoresistance at room temperature in ferromagnetic thin film tunnel junctions. Physical Review Letters, 74(16):3273, 1995.
6. Yuasa, S., Nagahama, T., Fukushima, A., Suzuki, Y., and Ando, K.: Giant roomtemperature magnetoresistance in single-crystal fe/mgo/fe magnetic tunnel junctions. Nature materials, 3(12):868-871, 2004.
7. Ikeda, S., Miura, K., Yamamoto, H., Mizunuma, K., Gan, H., Endo, M., Kanai, S., Hayakawa, J., Matsukura, F., and Ohno, H.: A perpendicular-anisotropy cofeb-mgo magnetic tunnel junction. Nature materials, 9(9):721-724, 2010.
8. Stockman, M. I.: Nanofocusing of optical energy in tapered plasmonic waveguides. Physical review letters, 93(13):137404, 2004.
9. Wang, Z. L., Gao, R. P., Pan, Z. W., and Dai, Z. R.: Nano-scale mechanics of nanotubes, nanowires, and nanobelts. Advanced Engineering Materials, 3(9):657, 2001.
10. Liu, D. and Svensson, C.: Power consumption estimation in cmos vlsi chips. Solid-State Circuits, IEEE Journal of, 29(6):663-670, 1994.

\section*{CITED LITERATURE (Continued)}
11. Altunyurt, N., Aygun, K., Doran, K., and Mekonnen, Y.: Mitigation of far-end crosstalk induced by routing and out-of-plane interconnects, July 4 2013. WO Patent App. PCT/US2011/067,976.
12. Bhooshan, R., Kuve, S., and Puvvada, V.: Electro-migration (em) and voltage (ir) drop analysis of integrated circuit (ic) designs, January 1 2008. US Patent 7,315,992.
13. Wu, B. and Sherwani, N. A.: Effective buffer insertion of clock tree for high-speed vlsi circuits. Microelectronics journal, 23(4):291-300, 1992.
14. Fishburn, J. P.: Shaping a vlsi wire to minimize elmore delay. In Proceedings of the 1997 European conference on Design and Test, page 244. IEEE Computer Society, 1997.
15. Benini, L., De Micheli, G., Macii, E., Sciuto, D., and Silvano, C.: Address bus encoding techniques for system-level power optimization. In Design, Automation and Test in Europe, 1998., Proceedings, pages 861-866. IEEE, 1998.
16. Zhong, L. and Jha, N. K.: Interconnect-aware low-power high-level synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24(3):336-351, 2005.
17. Gu, Z., Wang, J., Dick, R. P., and Zhou, H.: Unified incremental physical-level and highlevel synthesis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26(9):1576-1588, 2007.
18. Herrmann, D. and Ernst, R.: Improved interconnect sharing by identity operation insertion. In Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design, pages 489-493. IEEE Press, 1999.
19. Park, N. and Kurdahi, F. J.: Module assignment and interconnect sharing in registertransfer synthesis of pipelined data paths. In Computer-Aided Design, 1989. ICCAD-89. Digest of Technical Papers., 1989 IEEE International Conference on, pages 16-19. IEEE, 1989.
20. Ewering, C.: Automatic high level synthesis of partitioned busses. In Computer-Aided Design, 1990. ICCAD-90. Digest of Technical Papers., 1990 IEEE International Conference on, pages 304-307. IEEE, 1990.

\section*{CITED LITERATURE (Continued)}
21. Vigneswaran, T., Mukundhan, B., and Reddy, P. S.: A novel low power, high speed 14 transistor cmos full adder cell with \(50 \%\) improvement in threshold loss problem. Enformatika Trans. Eng. Comp. Tech, 13:82-85, 2006.
22. Paulin, P. G. and Knight, J. P.: Force-directed scheduling for the behavioral synthesis of asics. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 8(6):661-679, 1989.
23. Prim, R. C.: Shortest connection networks and some generalizations. Bell system technical journal, 36(6):1389-1401, 1957.
24. Kruskal, J. B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48-50, 1956.
25. Dijkstra, E. W.: A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269-271, 1959.
26. Lee, C., Potkonjak, M., and Mangione-Smith, W. H.: Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330-335. IEEE Computer Society, 1997.
27. Ng, A. N., Markov, I. L., Aggarwal, R., and Ramachandran, V.: Solving hard instances of floorplacement. In Proceedings of the 2006 international symposium on Physical design, pages 170-177. ACM, 2006.
28. Nakamura, N., Oda, N., Soda, E., Hosoi, N., Gawase, A., Aoyama, H., Tanaka, Y., Kawamura, D., Chikaki, S., Shiohara, M., et al.: Feasibility study of 70 nm pitch cu/porous low-k d/d integration featuring euv lithography toward 22 nm generation. In Electron Devices Meeting (IEDM), 2009 IEEE International, pages 1-4. IEEE, 2009.
29. Prathima, A., Bailey, K., and Gurumurthy, K.: Impact of device parameters of triple gate soi-finfet on the performance of cmos inverter at 22 nm . International Journal of VLSI Design \& Communication Systems, 3(5):79, 2012.
30. Nakamura, N., Takigawa, Y., Soda, E., Hosoi, N., Tarumi, Y., Aoyama, H., Tanaka, Y., Kawamura, D., Ogawa, S., Oda, N., et al.: Design impact study of wiring size and barrier metal on device performance toward 22 nm -node featuring euv lithogra-

\section*{CITED LITERATURE (Continued)}
phy. In Interconnect Technology Conference, 2009. IITC 2009. IEEE International, pages \(1 \overline{4-16 \text {. IEEE, } 2009 .}\)
31. Goldstine, H. H. and Goldstine, A.: The electronic numerical integrator and computer (eniac). Mathematical Tables and Other Aids to Computation, pages 97-110, 1946.
32. Kilby, J. S.: Invention of the integrated circuit. Electron Devices, IEEE Transactions on, 23(7):648-654, 1976.
33. Wolfe, T.: the tinkerings of robert noyce. Esquire (December 1983), pages 346-374, 1983.
34. Moore, G. E. et al.: Progress in digital integrated electronics. IEDM Tech. Digest, 11, 1975.
35. Donze, R., Erickson, K., Hovis, W., Sheets, J., Tetzloff, J., and Zumbrunnen, L.: Fin fet diode structures and methods for building, September 17 2004. US Patent App. 10/944,624.
36. Fried, D. M., Nowak, E. J., Rainey, B. A., and Sadana, D. K.: Fin fet devices from bulk semiconductor and method for forming, November 4 2003. US Patent 6,642,090.
37. Chang, J., Guillorn, M., Haensch, W., and Saenger, K.: Fin field effect transistor devices with self-aligned source and drain regions, November 26 2013. US Patent 8,592,280.
38. Cea, S., Mehandru, R., Shifren, L., and Kuhn, K.: Wrap-around contacts for finfet and tri-gate devices, June 23 2011. US Patent App. 12/646,651.
39. Bernstein, K., Sleight, J., and Yang, M.: Hybrid crystal orientation cmos structure for adaptive well biasing and for power and performance enhancement, December 8 2009. US Patent 7,629,233.
40. Adkisson, J., Dunbar, T., Gambino, J., and Leitch, M.: Graphene field effect transistor, June 20 2013. WO Patent App. PCT/US2012/063,508.
41. Jego, C., Casseau, E., and Martin, E.: Interconnect cost control during high-level synthesis. In Proceedings of Design Circuits \& Integrated Systems Conference, pages 507-512, 2000.

\section*{CITED LITERATURE (Continued)}
42. Krishnan, V. and Katkoori, S.: Clock period minimization with iterative binding based on stochastic wirelength estimation during high-level synthesis. In VLSI Design, 2008. VLSID 2008. 21st International Conference on, pages 641-646. IEEE, 2008.

\section*{VITA}

Name: Enzo Tartaglione

Education: Master of Science in Electrical and Computer Engineering University of Illinois at Chicago, USA, July 2015

Master of Science in Electronic Engineering Politecnico di Torino, Italy, July 2015

Alta Scuola Politecnica student
Italy

Bachelor of Science in Electronic Engineering
Politecnico di Torino, Italy, October 2013```


[^0]:    ${ }^{1}$ A human blood cell (erythrocyte) has approximately a $50 \mu \mathrm{~m}^{2}$ area[2]. From this, we can see that we have three magnitude orders of difference between the size of an erythrocyte and a modern transistor.
    ${ }^{2}$ Nowadays the used technology is 14 nm (as stated before) while we know that a silicon atom has a Van Der Waals radius of 220 pm . This means that a transistor is realized with less than a hundred of silicon atoms. Furthermore, we can not have transistors too much close each other in order to avoid mutual electric field influence.

[^1]:    ${ }^{1}$ A floorplan of an IC is a schematic representation of the placement of its FUs. Floorplans are created during the so called floorplanning design stage, which usually is the first step in the physical design phase.
    ${ }^{2}$ We define scheduling as the step in the HLS in which we are deciding when a given operation should happen. It usually is the first step of HLS.
    ${ }^{3}$ We define binding as the step in HLS in which we map each operation to a given FU. It usually is performed after scheduling step.

[^2]:    ${ }^{1}$ Connecting distant FUs is more difficult than connecting close ones; for this, designs with few very distant connections may result in more complex interconnections than those with more but shorter.

[^3]:    ${ }^{1}$ We define latency of a given design as the number of clock cycles needed for a complete execution.
    ${ }^{2}$ From our collected empiric data, even trying to maximize such a parameter, we obtained peak values below 0.1 (Table IV)

[^4]:    ${ }^{1}$ We define communication latency the time interval necessary to complete the communication from its begin.

[^5]:    ${ }^{1}$ Actually, this is not always true. In particular, let us assume $F U_{i}, F U_{j} \in b_{n}$ where $b_{n}$ is a buslet. If $b_{n}$ performs communication from $F U_{i}$ to $F U_{j}$ only, then $F U_{i}$ will not need input tristate buffer and $F U_{j}$ will not need output tristate buffer. Such a consideration will be taken into account in the estimation of the power consumed for communications.

[^6]:    ${ }^{1}$ With the term schedule we identify a pair operation node - clock cycle in which it is scheduled.

[^7]:    ${ }^{1}$ However, we have to state that, changing communication schedule, we may have effect on peak temperature on the realized device. Anyway, such a problem can not be taken directly into account at high level synthesis stage, as routing still does not exist.

[^8]:    ${ }^{1}$ This statement is always true if $u_{i}$ and $u_{j}$ need to be mapped to different types of FUs. If they can be mapped to the same FU, such a communication will never take place as data is already in internal registers of such an FU. We are assuming such an optimization will automatically be performed.

[^9]:    ${ }^{1}$ Here with resource we mean both FU or buslet.

[^10]:    ${ }^{2}$ Note here that, as we are moving in chronological order, we will analyze each time all the possible binding solutions for the same communication $\operatorname{com}_{i}$. For this reason, we will have as Q sets as it is the total latency $\bar{\lambda}$

[^11]:    ${ }^{1}$ at a certain level $k$ we may prune a solution because worse than others for $\Delta_{1}$ WL but, in the next $\phi-k$ levels, while this pruned solution worsened of a factor $\Delta_{2}$, all the solutions held at level $k$ worsened of factor $\Delta_{3}>\Delta_{1}+\Delta_{2}$

[^12]:    ${ }^{1}$ We define completely connected a graph $G$ having $n$ nodes, each of them directly connected to all the other $n-1$ nodes. Such a graph will have $\frac{n^{2}-n}{2}$ edges in total.
    ${ }^{2}$ We define a graph G being undirected when, for each couple of nodes $u_{i}, u_{j} \in G$, if exists an edge connecting $u_{i}$ to $u_{j}$, then it will also exist an edge connecting $u_{j}$ to $u_{i}$.

[^13]:    ${ }^{1}$ We define a graph G being connected if, $\forall u_{i}, u_{j} \in G$, exists a path connecting them.

