Timing and Area Optimization for VLSI Circuit and Layout

Wei-Tong Chuang
**Title:** Timing and Area Optimization for VLSI Circuit and Layout

**Personal Author(s):** Chuang, Wei-Tong

**Type of Report:** Technical

**Time Covered:** FROM 8/91 TO 4/94

**Date of Report:** May 10, 1994

**Page Count:** 149

**Subject Terms:**
- Time-area optimization
- VLSI layout
- Linear programming
- Delay estimation
- Sequential circuits
- Physical design
- Placement
- Gate sizing

**Abstract:**
(Attached)

**Distribution/Availability of Abstract:**
- Unclassified/Unlimited
- Same as report
- DTIC Users

**Abstract Security Classification:** Unclassified

**Name of Responsible Individual:**

**Telephone:**

**Office Symbol:**

---

**Form:** DD Form 1473, JUN 86

**Security Classification of This Page:** Unclassified
This thesis considers two problems in computer-aided design of VLSI circuits: (1) discrete gate sizing and (2) timing-driven placement improvement.

The discrete gate-sizing problem is described as follows. A standard cell library typically contains several versions of any given gate type, each of which has a different gate size. We consider the problem of choosing optimal gate sizes from the library to minimize a cost function (such as total circuit area) while meeting the timing constraints imposed on the circuit. After presenting an efficient solution algorithm for combinational circuits, we examine the problem of minimizing the area of a synchronous sequential circuit for a given clock period specification. This is done by appropriately selecting a size for each gate in the circuit and by adjusting the delays between the central clock distribution node and individual flip-flops. Existing methods treat these two problems separately, which may lead to very suboptimal solutions in some cases. We develop a novel unified approach to tackle them simultaneously. We also address the problem of making this work applicable to very large synchronous sequential circuits by partitioning these circuits to reduce the computational complexity.

Traditionally, gate sizing is performed before the actual physical design steps are performed. A drawback of such an approach is that the interconnect wire lengths are not available at the gate-sizing stage. The gate sizes selected to be optimal at that stage may no longer be optimal later in the physical design process in which large interconnect capacitances are introduced at the output of each gate. To remedy this problem, we propose a novel algorithm which performs delay and area optimization for a given compact placement by resizing and relocating cells in the circuit layout. The algorithm combines gate sizing with the placement adjustment procedure into one formulation. Since the gate-sizing procedure is embedded within the placement adjustment process, interconnect capacitance information is included in the gate-size selection process.
TIMING AND AREA OPTIMIZATION FOR VLSI CIRCUIT AND LAYOUT

BY

WEI-TONG CHUANG

B.S., National Taiwan University, 1986
M.S., University of Maryland at College Park, 1989

THESIS

Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 1994

Urbana, Illinois
This thesis considers two problems in computer-aided design of VLSI circuits:
(1) discrete gate sizing and (2) timing-driven placement improvement.

The discrete gate-sizing problem is described as follows. A standard cell library typically contains several versions of any given gate type, each of which has a different gate size. We consider the problem of choosing optimal gate sizes from the library to minimize a cost function (such as total circuit area) while meeting the timing constraints imposed on the circuit. After presenting an efficient solution algorithm for combinational circuits, we examine the problem of minimizing the area of a synchronous sequential circuit for a given clock period specification. This is done by appropriately selecting a size for each gate in the circuit and by adjusting the delays between the central clock distribution node and individual flip-flops. Existing methods treat these two problems separately, which may lead to very suboptimal solutions in some cases. We develop a novel unified approach to tackle them simultaneously. We also address the problem of making this work applicable to very large synchronous sequential circuits by partitioning these circuits to reduce the computational complexity.

Traditionally, gate sizing is performed before the actual physical design steps are performed. A drawback of such an approach is that the interconnect wire lengths are not available at the gate-sizing stage. The gate sizes selected to be optimal at that stage
may no longer be optimal later in the physical design process in which large interconnect capacitances are introduced at the output of each gate. To remedy this problem, we propose a novel algorithm which performs delay and area optimization for a given compact placement by resizing and relocating cells in the circuit layout. The algorithm combines gate sizing with the placement adjustment procedure into one formulation. Since the gate-sizing procedure is embedded within the placement adjustment process, interconnect capacitance information is included in the gate-size selection process.
ACKNOWLEDGMENTS

I would like to express my sincere appreciation and thanks to my advisor, Professor Ibrahim Hajj, for all of his assistance, technical or otherwise, patience and support that he has provided during my studies. I would like to acknowledge him for sharing his knowledge, insight and experience with me. I would also like to express my thanks to Professor Sachin Sapatnekar of the Iowa State University for initially suggesting the gate-sizing problem and for many fruitful discussions on various matters relating to the contents of this thesis. Thanks are also due to Professors Chung Laung Liu, Farid Najm, and Resve Saleh for serving on my committee.

I am grateful to my family for all of their help, support and tutelage. I would also like to thank all of my friends (who are too numerous to enumerate here), particularly Yao-Jen Chang, Shu-Ling Cheng, Terry Lee, Ping-Chung Li, Jaidip Singh, and Chin-Chi Teng for making my stay in Champaign-Urbana a pleasant one.

Finally, I thank all of the members of the Digital and Analog Circuits Group at the Coordinated Science Laboratory of the University of Illinois.
# TABLE OF CONTENTS

<table>
<thead>
<tr>
<th>CHAPTER</th>
<th>PAGE</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>1 INTRODUCTION</strong></td>
<td>1</td>
</tr>
<tr>
<td>1.1 Introduction</td>
<td>1</td>
</tr>
<tr>
<td>1.2 The Process of Electronic System Design</td>
<td>2</td>
</tr>
<tr>
<td>1.2.1 System design</td>
<td>2</td>
</tr>
<tr>
<td>1.2.2 Logic design</td>
<td>3</td>
</tr>
<tr>
<td>1.2.3 Circuit design</td>
<td>4</td>
</tr>
<tr>
<td>1.2.4 Physical design</td>
<td>5</td>
</tr>
<tr>
<td>1.3 Design Styles</td>
<td>5</td>
</tr>
<tr>
<td>1.4 Standard-cell Design</td>
<td>8</td>
</tr>
<tr>
<td>1.5 Discrete Gate-Sizing Problem</td>
<td>9</td>
</tr>
<tr>
<td>1.5.1 Optimization for combinational circuits</td>
<td>9</td>
</tr>
<tr>
<td>1.6 Optimization for Sequential Circuits</td>
<td>11</td>
</tr>
<tr>
<td>1.7 Performance-driven Placement</td>
<td>12</td>
</tr>
<tr>
<td>1.8 Organization of the Thesis</td>
<td>15</td>
</tr>
<tr>
<td><strong>2 DISCRETE GATE-SIZING PROBLEM</strong></td>
<td>16</td>
</tr>
<tr>
<td>2.1 Introduction</td>
<td>16</td>
</tr>
<tr>
<td>2.2 Previous Work</td>
<td>20</td>
</tr>
<tr>
<td>2.3 Problem Formulation</td>
<td>23</td>
</tr>
<tr>
<td>2.3.1 Formulation of delay constraints</td>
<td>23</td>
</tr>
<tr>
<td>2.3.2 Formulation of the linear program</td>
<td>29</td>
</tr>
<tr>
<td>2.4 Phase II: The Mapping Algorithm</td>
<td>32</td>
</tr>
<tr>
<td>2.4.1 Implicit enumeration approach</td>
<td>33</td>
</tr>
<tr>
<td>2.4.2 Global implicit enumeration approach</td>
<td>38</td>
</tr>
<tr>
<td>2.5 Phase III: The Adjusting Algorithm</td>
<td>42</td>
</tr>
<tr>
<td>2.6 Experimental Results</td>
<td>43</td>
</tr>
<tr>
<td>2.7 Conclusions</td>
<td>47</td>
</tr>
<tr>
<td><strong>3 OPTIMIZATION FOR SYNCHRONOUS SEQUENTIAL CIRCUITS</strong></td>
<td>50</td>
</tr>
<tr>
<td>3.1 Introduction</td>
<td>50</td>
</tr>
<tr>
<td>3.2 Previous Work on Clock Skew Optimization</td>
<td>54</td>
</tr>
<tr>
<td>3.2.1 Minimize $P$ subject to clocking constraints</td>
<td>57</td>
</tr>
<tr>
<td>3.2.2 Maximize minimum margin for error</td>
<td>58</td>
</tr>
<tr>
<td>3.3 Formulation of Constraints</td>
<td>59</td>
</tr>
<tr>
<td>3.4 Symbolic Propagation of Constraints</td>
<td>61</td>
</tr>
</tbody>
</table>
### 3.5 Satisfying Short-Path Delay Constraints

Page 64

### 3.6 Experimental Results

Page 69

### 3.7 Comment and Conclusions

- **3.7.1 Clock tree routing**
  - Page 70
- **3.7.2 Conclusions**
  - Page 74

### 4 PARTITIONING FOR OPTIMIZATION

- **4.1 Introduction**
  - Page 77
- **4.2 Previous Work on Partitioning**
  - **4.2.1 Iterative partitioning**
    - Page 80
  - **4.2.2 Spectral partitioning**
    - Page 84
- **4.3 Sanchis' Multiple-way Partitioning Algorithm**
  - Page 87
- **4.4 Synchronous Sequential Circuit Partitioning**
  - Page 89
- **4.5 Experimental Results and Conclusion**
  - Page 96

### 5 DELAY AND AREA OPTIMIZATION FOR PLACEMENT

- **5.1 Introduction**
  - Page 99
- **5.2 Previous Work**
  - Page 103
- **5.3 Timing-Driven Placement with Gate Sizing**
  - **5.3.1 Physical constraints**
    - Page 106
  - **5.3.2 Timing and sizing constraints**
    - Page 108
  - **5.3.3 Objective function**
    - Page 109
  - **5.3.4 Slot constraints**
    - Page 109
  - **5.3.5 Final LP**
    - Page 110
- **5.4 A Unified Algorithm for Adjusting Placement and Gate Sizing**
  - Page 111
- **5.5 Experimental Results**
  - Page 116
- **5.6 Conclusion**
  - Page 118

### 6 CONCLUSIONS

- **6.1 Future Work**
  - Page 120

### APPENDIX A EXTRACTING PARAMETERS FROM A LIBRARY

Page 125

### REFERENCES

Page 129

### VITA

Page 136
# LIST OF TABLES

<table>
<thead>
<tr>
<th>Table</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.1 Number of $(s, p)$-gates.</td>
<td>9</td>
</tr>
<tr>
<td>2.1 Performance comparison of GALANT with Lin's algorithm and simulated annealing.</td>
<td>45</td>
</tr>
<tr>
<td>2.2 Execution times for the Linear Program and the Mapping and Adjusting Algorithms.</td>
<td>46</td>
</tr>
<tr>
<td>2.3 Performance comparison of GALANT with Lin's algorithm and simulated annealing for c432.</td>
<td>47</td>
</tr>
<tr>
<td>3.1 Experimental results of the symbolic constraints propagation algorithm for ISCAS89 benchmark circuits.</td>
<td>71</td>
</tr>
<tr>
<td>3.2 Performance comparison with and without clock skew optimization for ISCAS 89 benchmark circuits.</td>
<td>72</td>
</tr>
<tr>
<td>3.3 Improving possible clocking speeds using clock skew optimization.</td>
<td>73</td>
</tr>
<tr>
<td>4.1 Performance comparison of the partitioning procedure.</td>
<td>97</td>
</tr>
<tr>
<td>5.1 Experimental results of PRECISE.</td>
<td>116</td>
</tr>
</tbody>
</table>
LIST OF FIGURES

Figure | Page
-------|-----
1.1 A typical IC design process. | 3
1.2 Engineering trade-offs among different design styles. | 6
1.3 Advantage of gate sizing together with placement. | 14
2.1 (a) A chain of three inverters. (b) Effect of transistor sizes on delay for the three-inverter chain. | 17
2.2 An example illustrating calculation of the output load capacitance of a gate. | 24
2.3 Top view of the geometry of a typical transistor. | 25
2.4 Approximating gate terminal capacitance by an affine function. | 26
2.5 Surface plots of the function $g(w,z) = z/w$ from two different viewpoints. | 27
2.6 Approximation of the function $1/w$ by a piecewise linear function. | 28
2.7 Approximating gate area by an affine function. | 30
2.8 An example illustrating the definition of $m_t$. | 31
2.9 An example illustrating the construction of a state-space tree in the mapping algorithm. | 36
2.10 Comparison of Galant and Lin's algorithm against simulated annealing for c432. | 48
3.1 The advantages of nonzero clock skew. | 51
3.2 An example illustrating the definition of a synchronous block. | 52
3.3 A synchronous sequential system. | 55
3.4 Double-clocking and zero-clocking. | 56
3.5 The symbolic constraints propagation algorithm. | 62
3.6 An example illustrating symbolic delay propagation algorithm. | 64
3.7 An example illustrating the definition of $G_{slack}$. | 67
3.8 The buffer insertion algorithm. | 68
3.9 An example illustrating buffer insertion algorithm. | 69
3.10 (a) A clock pin on a latch. (b) The modified model of a clock pin according to the optimal skew obtained from our algorithm. | 75
4.1 An example illustrating the concept of level gain. | 82
4.2 An example for multipin net model. | 84
4.3 An example illustrating the definition of an internal latch, a sequential block, and a boundary latch. | 90
4.4 Tightness factor. | 91
4.5 Example showing the definition of merit. | 92
4.6 Example showing calculation of connection numbers. | 94
4.7 Moving gain

5.1 Advantage of gate sizing together with placement
5.2 Approximating wire length using bounding box method for 2- and 3-pin nets
5.3 Approximating wire length using bounding box method for 4- and 5-pin nets
5.4 Approximating wire length using bounding box method
5.5 An outline of the Resizing and Relocation algorithm

6.1 Retiming and clock delay transformation

A.1 Calculating $\gamma$ and $\varepsilon$
A.2 Calculating $\alpha$ and $\beta$
A.3 Calculating $\tau_1$ and $\tau_2$
CHAPTER 1
INTRODUCTION

1.1 Introduction

The influence of very large scale integrated (VLSI) circuit technology on our society during the past decade has been overwhelming, in application areas ranging from consumer products to personal computers, to business management, to defense electronics. The functional capability of the modern integrated circuit (IC) has increased in scope and complexity exponentially with time over the past two decades. The exponential growth pattern in IC functions over time was first described by Gordon Moore [1], and the projection he made based on this pattern is known as Moore’s law.

The creation of large, complex electronic systems has grown beyond the capabilities of many engineers without the aid of computers. Successful completion of large design projects requires that computers be used in virtually all aspects of the design process. This trend toward automation will accelerate as improved circuit fabrication technologies permit higher levels of integration and as more powerful computers allow more sophisticated tools. These tools must span the spectrum of the design process, including partitioned design entry, logic synthesis, circuit design, circuit simulation and verification,
physical design, process simulation, and the design for testability and manufacturability. These tools are commonly implemented and termed as computer-aided design (CAD) or electronic design automation (EDA) programs. The evolution of integrated circuit development has become heavily dependent on the development of CAD and EDA resources for design support.

In this thesis, we examine two problems in the field of electronic design automation: (1) gate sizing for combinational circuits and sequential circuits, and (2) timing and area optimization for a compact placement. Before we go into details of our work, we give a brief description of the electronic systems design process.

### 1.2 The Process of Electronic System Design

A typical IC design process, shown in Figure 1.1, is composed of the four following phases [2, 3]: system design, logic design, circuit design, and physical design. They are briefly described in the following.

#### 1.2.1 System design

System design is the process of defining the circuit functionality and the input-output behavior. A behavior representation describes how a particular design should respond to a given set of inputs. Behavior may be specified by Boolean equations, tables of input and output values, or algorithms written in high-level computer languages, or hardware description languages (HDL).
As far as the physical aspect of the design is concerned, at this level one is concerned with connecting the major subsystems and communication interfaces with the external world; global wiring strategies; selecting layers for carrying global control, data, and power; placement of major subsystem; and routing strategies.

1.2.2 Logic design

Logic design is the process of transforming the register transfer level (RTL) specification of a design into a netlist of logic gates such as NAND gates, NOR gates, inverters, AOI gates and latches. This process begins with logic descriptions given by the RTL specification or generated by designers directly at the logic level, and optimizes the network of gates that are required to implement the function specified by the logic descriptions.
The design of random logic has objectives such as:

- minimize overall layout area of the fabricated chip;
- minimize critical path delay time;
- maximize testability of the synthesized logic.

Generally, a logic design system divides the design problem into two steps [4]:

- A technology-independent step, which manipulates general Boolean functions to optimize the logic, using algebraic and/or Boolean techniques.
- A technology-mapping step, which translates the technology-independent description derived in the first step to a set of logic gates that can be implemented in the design method of choice (e.g., standard-cells, gate-arrays, field-programmable gate arrays).

1.2.3 Circuit design

The circuit design phase concerns the electrical laws that govern the detailed behavior of the basic circuit elements such as transistors, resistors, capacitors, and inductors. It transforms the basic logic components into networks of transistors and interconnects.

Delay, power consumption, charge sharing problem, and reliability are among the major concerns in this phase.
1.2.4 Physical design

Physical design consists of transforming a circuit design description into a physical representation that can be used to manufacture the specified electronic circuit. Once the circuit description of a network is available, it can be converted into a layout. Behavioral or structural representations from the previous phases are transformed into geometric shapes that are used in the fabrication of the system. Placement and routing are the two major tasks in this phase.

Placement is the task of placing modules adjacent to each other on a chip to minimize area or delay. The placement procedure determines the locations of components within the circuit being designed, subject to the constraints imposed by the designers and the design rules imposed by the fabrication process and by physical principles.

Following placement, components are arranged on the chip, and the task remains to insert the electrical connections among the components to make them function correctly. A router takes a module placement and a list of connections and connects the components with wires.

Often, an iterative process of placement and routing is used to optimize certain objectives, such as performance or layout area, of the design.

1.3 Design Styles

There are various chip design options that may be used to implement a system design, such as sea-of-gate, gate array, standard-cell design, and full-custom design. These VLSI
Each design style requires different trade-offs and imposes different constraints on the chip physical design in an attempt to make the design more manageable, while maintaining sufficient design flexibility.

The trade-offs among these different approaches are illustrated in Figure 1.2 [5]. In the following, we briefly discuss the advantages and disadvantages of different design styles, namely, gate-array, standard-cell, and full-custom designs.

To develop a full-custom design, engineering groups are assembled to cover the wide range of skills required to design the part virtually from scratch. These groups may include experts in process engineering, device modeling, circuit design, physical design layout, logic design, and system architecture design. The final design is optimized for the best density and performance. However, the design turnaround time is usually large. In addition, due to the large design effort required, a full-custom design is desired only for
high volume, for which the initial engineering expense can be compensated over a long, active product life.

In the gate-array design, several of the lithographic patterning levels are standardized, except for the interconnections and via geometries. The devices used to implement circuit designs are prefabricated on a chip, but are left unconnected after the initial processing step. A circuit design/logic block is placed at a specified cell location by assigning the appropriate pattern of wires to coordinate inside that cell area; these wires connect the devices to form the selected logic gate. Different logic blocks are then connected to implement the desired logic function. The disadvantage of gate arrays is that they are not optimal for any task. There are usually blocks that are not used. Since block placement is done in advance, interconnect routing can become complex and the resulting long wires can slow down the circuit. Also, the design will not be compact since interblock spacing is fixed to allow worst-case routing needs. Another problem with the gate-array approach is that the transistor patterns are predefined. Therefore, the transistors cannot be tuned to the specified application. This leads to inferior performance compared to full-custom design.

Between these two extremes lies the standard-cell approach which strives for high design system support for chip physical design and the capability to locally optimize circuit designs using hand-crafted cells and layouts. This approach involves the use of a library of basic functional elements, each of which has been fully characterized. In the standard-cell design approach, a division is made between the tasks of handcrafting circuit designs and placing and wiring those circuit blocks together. This separation is
based on the assumption that the time-consuming task of handcrafting a custom layout is best restricted to small circuit designs only. The initial circuit design and layout are done once, and the resulting shapes stored in a technology library for repeated use across many designs.

1.4 Standard-cell Design

The standard-cell approach has the advantage of greatly simplifying the automated synthesis process because it separates the synthesis system from the details of cell layout issues. The cell library presents models for individual cells, which are useful for performing circuit and timing analyses. With the aid of many advanced CAD tools, the performance of a standard-cell designed circuit is fairly high, while the design turnaround time is fast. Therefore, this approach has become the mainstay in Application-Specific Integrated Circuits (ASICs).

A crucial issue in the cell library approach is the size of the library. If the library is too small, much time is spent in converting the logic into a format that can be supported by the small library. On the other hand, if the library size is too large, the issues of database maintenance, pattern matching and searching become significant [6]. Moreover, the useful life of a library is relatively short as dictated by the lifetime of the technology in use [7]. For these reasons, the cell libraries tend to remain relatively small in size.

The prevalent use of complex gates such as AOI or OAI further complicates the library issue. As shown in Table 1.1 [8], as many as 3,503 different complex gates can be
Table 1.1 Number of \((s, p)\)-gates.

<table>
<thead>
<tr>
<th>((s, p))</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<td>2</td>
<td>2</td>
<td>7</td>
<td>18</td>
<td>42</td>
<td>90</td>
<td>186</td>
</tr>
<tr>
<td>3</td>
<td>3</td>
<td>18</td>
<td>396</td>
<td>1677</td>
<td>6877</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>4</td>
<td>42</td>
<td>396</td>
<td>3503</td>
<td>28435</td>
<td>222943</td>
</tr>
<tr>
<td>5</td>
<td>5</td>
<td>90</td>
<td>1677</td>
<td>28435</td>
<td>425803</td>
<td>6084393</td>
</tr>
<tr>
<td>6</td>
<td>6</td>
<td>186</td>
<td>6877</td>
<td>222943</td>
<td>6084393</td>
<td>154793519</td>
</tr>
</tbody>
</table>

configured for \((s, p) = (4, 4)\), where the gates are constrained to have at most \(s\) transistors from output to ground and \(p\) transistors from output to power supply. This number dramatically increases to 425,803 for \((s, p) = (5, 5)\) and 154,793,519 for \((s, p) = (6, 6)\). It is apparent that a moderately sized library cannot support all of the possible circuit configurations for complex gates.

The other problem with the standard-cell approach is that even if the individual cells in the library are nearly optimal in performance and in terms of compactness of the layout, the whole circuit is often suboptimal after all cells are put together. For flexibility, many standard-cell libraries contain multiple versions of some cells with different driving powers.

1.5 Discrete Gate-Sizing Problem

1.5.1 Optimization for combinational circuits

In general, circuit delay and circuit area are the primary concerns of any logic design optimization. In many cases, a reduction in the number of stages (gates) between an input
and an output node can reduce the circuit area and delay. Such an optimization is usually made during the logic synthesis stage. This reduction is not, however, guaranteed to reduce the circuit delay. A standard-cell library typically contains several versions of any given gate type. Cells of identical gate type differ from each other in attribution, such as driving-capability, gate area, and input capacitive load. Because of these differences, the selection of cell versions for each individual gate in the circuit has a profound impact on the characteristics (i.e., delay, circuit area, and power consumption) of the whole circuit. By means of gate sizing, a fixed-topology logic circuit can be significantly optimized. In this thesis, we assume that a logic-level circuit description is provided, and the objective is to perform gate-size selection in an optimal way. The logic synthesis stage is usually performed before the technology mapping stage. Hence, we do not address this issue in this thesis.

Given a netlist of a logic circuit and a cell library, an automatic gate-size optimization algorithm chooses, from the library, one version of a logic gate for each cell such that

1. the total circuit delay is under a constraint and an objective function (such as circuit area or power consumption) is minimized;

2. the total circuit area (or power consumption) is under a constraint and the circuit delay is minimized.

The former is called the area (power) optimization problem while the latter is called the timing optimization problem. In this thesis, we concentrate on the first problem; specifically, we minimize total circuit area. It should be mentioned that the algorithm
presented in this thesis can be extend to the power optimization problem and timing optimization under area or power constraint as well.

1.6 Optimization for Sequential Circuits

Optimization for synchronous sequential circuits, on the other hand, is different from combinational circuit optimization. An additional degree of freedom is available to the designer in that one can set the time at which clock signals arrive at various flip-flops (FFs) in the circuit by controlling interconnect delays in the clock signal distribution network. With such adjustments, it is possible to change the delay specifications for the combinational stages of a synchronous sequential circuit to allow for better sizing.

After developing an optimization algorithm for combinational circuits in Chapter 2, we present an optimization technique for synchronous sequential circuits in Chapter 3. We examine the following problem: Given a clock period specification, how can the area of a synchronous sequential circuit be minimized by appropriately selecting a size for each gate in the circuit from a standard-cell library, and by adjusting the delays between the central clock and individual flip-flops?

In general, given a combinational subcircuit that lies between two FFs \( i \) and \( j \), with clock arrival times \( s_i \) and \( s_j \), respectively, we have the following relations:

\[
\begin{align*}
  s_i + \text{Maxdelay}(i,j) + T_{\text{setup}} &\leq s_j + P \\
  s_i + \text{Min delay}(i,j) &\geq s_j + T_{\text{hold}}
\end{align*}
\]
where \( \text{Mazdelay}(i,j) \) and \( \text{Mindelay}(i,j) \) are, respectively, the maximum and the minimum combinational delays between the two FFs, and \( P \) is the clock period. Fishburn [9] studied the clock skew problem under the assumption of constant combinational gate delays, and formulated the problem of finding the optimal clock period and the optimal skews as a linear program (LP). The objective was to minimize \( P \), with the constraints given by the inequalities in (1.1) and (1.2) above. In real design situations, however, \( P \) is dictated by system requirements, and the real problem is to reduce the circuit area.

We first consider optimizing circuits of moderate size. Then, in Chapter 4, we consider arbitrarily large synchronous sequential circuits for which the size of the formulated optimization problems becomes prohibitively large, and present a partitioning algorithm to handle such circuits. The partitioning algorithm is used to control the computational cost of the optimization problems. After the partitioning procedure, we can apply the optimization algorithm to each partitioned subcircuit individually.

### 1.7 Performance-driven Placement

To ensure that high-quality designs are produced, a CAD system must take two important issues into consideration while designing a circuit:

- Layout efficiency: producing a compact circuit layout.
- Performance: satisfying the timing specifications dictated by the clocking scheme.

With the increasing drive for high-performance chips, the timing-driven layout has become more and more important.
Conventional (area-driven) placement tools try to place modules in a chip to minimize the total wire length. However, as device geometries continue to shrink, interconnect delays become increasingly significant. As a result, the reduction of interconnect wire length, which heavily influences the interconnect delay, has become increasingly important.

Recently, there has been extensive research on performance-driven placement [10–13]. Performance-driven placement techniques can be broadly divided into two categories: net-oriented and path-oriented. In the net-oriented approach, the acceptable delay of each gate (cell) is calculated and translated into bounds on the delay associated with each net. These bounds then serve as constraints during the subsequent placement step. In the path-oriented approach, timing analyses of critical paths are performed dynamically during the placement step. All paths, or a subset of them, are taken into account implicitly in the formulation.

Conventionally, gate sizing is performed after technology mapping, and before the physical placement step. A drawback of such an approach is that accurate interconnect wire lengths are not available during the gate-sizing procedure. The gate size selected optimally at that stage may no longer be optimal after the physical design stage in which large interconnect capacitances are introduced at the output of each gate. To deal with this problem, an iteration procedure is usually followed. After global placement, the capacitance associated with each net is extracted, and the gate-sizing procedure is repeated. However, in such an iterative approach, the variation of net capacitance between iterations may be large and cause large perturbation in the solutions. Thus, a
number of iterations may be required, making this approach quite expensive. To deal with this problem, it is desirable that gate sizing and placement be incorporated into a single procedure.

As an illustration, consider a layout placement shown in Figure 1.3(a). Gate D fans out to gates $L_1$, $L_2$, and $L_3$. Assume that the delay of this circuit under such layout violates timing constraints imposed on it. Moreover, $D$ and $L_2$ lie on a long path whose delay exceeds the timing constraint. Conventional performance-driven placement would move $D$, $L_1$, $L_2$, and $L_3$ closer to each other to decrease the delay of gate $D$, as shown in Figure 1.3(b). This may increase the wire lengths of other nets attached to cells $D$, $L_1$, $L_2$, and $L_3$. But if automatic gate sizing is incorporated with performance-driven placement, a possible solution would be to replace $D$ with a template with a higher driving capacity, and $L_1$ with one with a smaller loading capacitance with respect to $D$. As a result,
some of the cells could be moved to better locations, as shown in Figure 1.3(c). The overall effect is a reduction of the long path delay, while the increase in area is kept to a minimum.

In Chapter 5, we propose an algorithm which combines the gate-sizing problem and performance-driven placement into one procedure. By considering these two problems together, the value of interconnect capacitance is known during the selection stage of the automatic sizing procedure. Therefore, optimal gate sizes can be chosen for each gate based on layout information, thus reducing the number of iterations required in the conventional approach.

1.8 Organization of the Thesis

Chapter 2 of the thesis deals with discrete gate sizing for combinational circuits. In Chapter 3, we formulate the synchronous sequential circuit area optimization problem and present the algorithms to tackle the problem. The partitioning algorithm presented in Chapter 4 allows us to handle large circuits. Chapter 5 of the thesis discusses a novel approach to timing-driven placement. Finally, concluding remarks are made in Chapter 6.
CHAPTER 2
DISCRETE GATE-SIZING PROBLEM

2.1 Introduction

The delay of a MOS integrated circuit can be tuned by appropriately choosing the sizes of transistors in the circuit. While a combinational MOS circuit in which all transistors have the minimum size has the smallest possible area, its circuit delay may not be acceptable. It is often possible to reduce the delay of such a circuit, at the expense of increased area, by increasing the sizes of certain transistors in the circuit. The optimization problem that deals with this area-delay trade-off is known as the sizing problem. In general, the interaction between the size of a certain gate and the delay of the whole circuit is very complicated. A larger cell usually has a larger driving capability and a larger input capacitive load. Therefore, using a large template tends to speed up the gate itself while slowing down the predecessor gates that drive it.

Example 2.1 Consider the chain of three CMOS inverters shown in Figure 2.1(a) [14]. Let the width of both the n-type and p-type transistors in gate 2 be \( w_2 \), and let \( D \) be the total delay through the three gates. Consider the effect of increasing \( w_2 \), while keeping
Figure 2.1 (a) A chain of three inverters. (b) Effect of transistor sizes on delay for the three-inverter chain.

the size of the transistors in gates 1 and 3 fixed. This causes the magnitude of the output current of gate 2 to increase, thus the time required, $d_2$, for gate 2 to drive its output signal will decrease monotonically (Figure 2.1(b)). However, increasing $w_2$ also increases the capacitive load on the output of gate 1, thus slowing down the output transition of the first gate. Beyond a certain point, $w_2 = A$, the total delay, $D$, starts to increase with respect to $w_2$, which shows the nonmonotonicity of the delay-area relationship.

The rationale for dealing with only combinational circuits in a world rampant with sequential circuits is as follows. A typical MOS digital integrated circuit consists of multiple stages of combinational logic blocks that lie between latches, clocked by system clock signals. Delay reduction must ensure that the worst-case delays of the combinational blocks are such that valid signals reach a latch in time for a transition in the signal clocking the latch. In other words, the worst-case delay of each combinational stage must be restricted to be below a certain specification.
The problem of continuous sizing, in which transistor sizes are allowed to vary continuously between a minimum size and a maximum size, has been tackled by several researchers [14–21]. The problem is most often posed as a nonlinear optimization problem, with nonlinear programming techniques used to arrive at the solution. The solutions found by these techniques are then rounded to the nearest integer. The continuous model works well for sizing transistors in a full-custom layout, but does not work well for designing with macrocells and standard cells, for which only a small number of choices are available. Transistor sizing on a gate array, where transistor sizes have to be multiples of the standard transistor, is also poorly realized by the continuous model.

A related problem that has received less attention is that of discrete or library-specific sizing. In this problem, only a limited number of size choices are available for each gate. This corresponds to the scenario in which a circuit designer is permitted to choose gate configurations for each gate type from within a standard-cell library. This problem is essentially a combinatorial optimization problem and has been shown to be NP-complete [22].

In this chapter, we present a new algorithm for solving the gate-sizing problem for combinational circuits that takes into consideration the variations of gate output capacitance with gate resizing. There are three phases in our algorithm. In the first stage, the gate-sizing problem is formulated as a linear program. The solution of this linear program provides us with a set of gate sizes that does not necessarily belong to the set of allowable sizes. Therefore, in the second phase, we move from the linear program solution to a set of allowable gate sizes, using heuristic techniques. In the third phase,
we further fine-tune the solution to guarantee that the delay constraints are satisfied.

Finally, to illustrate the efficacy of our algorithm, we present a comparison of the results of this technique with the solutions obtained by simulated annealing as well as by our implementation of the algorithm in [23].

It is worth mentioning that rounding solutions of the linear program to the nearest available sizes may not produce good solutions. In a tightly constrained problem, rounding continuous sizes to the nearest discrete size may not even give a feasible solution. The only reason that the continuous model works so well for transistor sizing is that the performance measures are rather insensitive to small changes in transistor sizes, and the steps between possible sizes are small compared to the sizes themselves. In our problem, however, the change between each step is large. Consequently, the solution obtained by rounding a linear program solution may violate timing constraints, or the objective value may be much larger than the optimal solution. Therefore, a more sophisticated algorithm is needed to handle the problem.

This chapter is organized as follows. We briefly describe previous approaches to the discrete gate-sizing problem in Section 2.2. Then we describe the linear programming approach that we propose in Section 2.3, followed by two postprocessing phases described in Sections 2.4 and 2.5. Experimental results are given in Section 2.6. Finally, we conclude this chapter in Section 2.7.
2.2 Previous Work

Chan [22] proposed a solution to the problem that was based on a branch-and-bound strategy. The algorithm is exact for Boolean tree networks. For general networks that are not tree-structured, a backtracking-based algorithm is proposed for finding a feasible solution. The algorithm for solving the optimal discrete sizing problem on a Boolean tree network consists of two phases. In the first phase, timing requirements for each vertex in the network are generated and propagated through the network. All of the timing requirements at the fan-in of each vertex are intersected to prune infeasible timing requirements of the vertex's predecessors. In the second phase, backward substitution is used to assign optimal sizes to each vertex to minimize the total cost. For general DAGs (directed acyclic graph), a cloning procedure is used to convert the DAG into an equivalent tree, whereby a vertex of fan-out $m$ is implicitly duplicated $m$ times, followed by a reconciliation step in which a single size that satisfies the requirements on all of the cloned vertices is selected. As pointed out in [24], this procedure does not necessarily provide the optimal solution for a general DAG; moreover, this algorithm is of exponential complexity in the worst case.

The approach of Lin et al. [23] uses a heuristic algorithm that is an adaptation of the TILOS algorithm [15] for continuous transistor sizing, with further refinements. The approach is based on a greedy algorithm that uses two measures known as sensitivity and criticality to determine which cell sizes are to be changed. The sensitivity of a cell indicates how much local delay per unit area can be decreased if we pick another
template for this specific cell, while criticality tells us whether a cell has to be replaced by a larger template to fulfill the delay constraints of the circuit. A weighted sum of a cell’s sensitivity and criticality is used to guide the algorithm to select a certain number of gates to be replaced with a different template. At the beginning of the algorithm, all cells in the circuit are set to their minimum sizes. The algorithm consists of a series of iterations, each iteration in turn having two phases. In the first phase (increasing phase), a quantum number of cells are replaced with larger templates, such that the delay constraints can be satisfied. After the timing constraints are satisfied, in the second phase (decreasing phase), a quantum number of cells are replaced with templates with smaller cell areas to reduce total circuit area. The value of quantum is determined emperically and is reduced by one-half over each iteration. The iteration continues until quantum becomes 1 or no improvement is possible. However, while the TILOS algorithm is known to work reasonably well for the continuous sizing case, the primary reason for its success is that the change in the circuit in each iteration is very small. On the other hand, in the discrete sizing case, any change must necessarily be a large jump, and a TILOS-like algorithm is likely to give very suboptimal results.

Another algorithm proposed by Li et al. [24] is exact for series-parallel circuits. A simple parallel circuit is a basic circuit that is comprised of several chains that have the same first and last module. A series-parallel circuit is a basic circuit recursively defined as [24]:

- A chain of basic modules is a series-parallel circuit.
• A simple parallel circuit is a series-parallel circuit.

• A circuit obtained from a series-parallel circuit $C$ by replacing any interconnect of $C$ by another series-parallel circuit is also a series-parallel circuit.

The algorithm uses a dynamic programming technique to find solutions for a chain of modules. For a simple parallel circuit, a number of transformations are repeated to obtain the optimal implementation. Finally, the optimal implementation of any series-parallel circuit is obtained by repeatedly using the chain and simple parallel circuit transformation on subcircuits of the given series-parallel circuit. This work is extended to nonseries-parallel circuits, whose structures are represented by general DAGs, and several heuristic techniques are used in conjunction with the algorithm, but no guarantees on optimality are made for such circuits. Moreover, their algorithm does not consider the capacitances of fan-out modules (gates). Therefore the results may not be accurate since, in reality, the gate delay is a function of the fan-out gate sizes as well.

Both of the above two approaches [23, 24] are heuristics, and hence no concrete statements can be made on how close their solutions are to the optimal solution. Moreover, neither work shows comparisons with a technique such as simulated annealing [25] that is known to give optimal or near-optimal solutions.

The algorithm proposed in [26] does use simulated annealing; however, since simulated annealing is computationally expensive, a technique for variable pruning is used by this algorithm to reduce the computational complexity. An initial configuration is obtained using an algorithm similar to TILOS [15]. The set of gates that are left at minimum size at
the end of this algorithm are eliminated from the parameter space, under the assumption that these cells would not be sized in the final configuration. The sizes of the remaining cells are determined using a simulated annealing algorithm. One argument against such an algorithm is that it would have very large runtimes for tight timing specifications, in which a large number of cells would be sized by the TILOS-like heuristic.

2.3 Problem Formulation

For a combinational circuit, the discrete gate-sizing problem is formulated as

\[
\text{minimize } \text{Area} \\
\text{subject to } \text{Delay} \leq T_{\text{spec}} \tag{2.1}
\]

Alternatively, we can formulate the following problem:

\[
\text{minimize } \text{Delay} \\
\text{subject to } \text{Area} \leq A_{\text{spec}} \tag{2.2}
\]

In this chapter, we concentrate on the first problem, although the same algorithm can be applied to the second problem with minor changes.

2.3.1 Formulation of delay constraints

The delay of a gate in a standard-cell library can be characterized by

\[
delay = R_{\text{out}} \times C_{\text{out}} + \tau = \frac{R_e}{w} \times C_{\text{out}} + \tau_1 \cdot w + \tau_2 \tag{2.3}
\]
Figure 2.2 An example illustrating calculation of the output load capacitance of a gate.

where $R_{\text{out}}$ is the equivalent resistance of the gate, $C_{\text{out}}$ is the load capacitance of the gate, $r$ is the intrinsic delay of the gate, $R_u$ represents the on-resistance of a unit transistor, and $w_i$ is called the nominal gate size of $g_i$. Therefore, the size of each gate can be parameterized by a number, $w$, referred to as the (nominal) gate size.

The output load capacitance of a gate can be calculated by summing the gate terminal capacitances of its fan-out gates and interconnect wiring capacitance, assuming that layout information is given. For the time being, we ignore the interconnect capacitance. In Chapter 5, we will discuss how to combine layout information with our formulation to obtain more accurate results.

Consider a gate $G_i$ which fans out to several gates including gate $G_j$, as shown in Figure 2.2. The output node of logic gate $i$ is connected to the n-type transistor $n_{ji}$ and p-type transistor $p_{ji}$ of logic gate $G_j$. Let the transistor $n_{ji}$ have the geometry as shown in Figure 2.3.

As illustrated in Figure 2.3, the parameter $L$ stands for the length of the channel, $x_{n_{ji}}$ is the channel width of transistor $n_{ji}$, and $d_d$ and $d_s$ refer to the lengths of the drain
Figure 2.3 Top view of the geometry of a typical transistor.

and source terminals. The gate terminal capacitance, $C_g$, of the transistor $n_{ji}$ can be expressed as

$$C_g = C_{GTA} \cdot L \cdot x_{n_{ji}} + 2 \cdot C_{GTP} \cdot (L + x_{n_{ji}})$$  \hspace{1cm} (2.4)$$

where

$C_{GTA}$ : Gate terminal area capacitance ($\text{pF}/\mu\text{m}^2$)

$C_{GTP}$ : Gate terminal perimeter capacitance ($\text{pF}/\mu\text{m}$)

Since the channel length, $L$, of transistors in a typical standard-cell library is fixed, the output load capacitance of logic gate $i$ with respect to logic gate $j$ can be expressed as

$$\text{cap}(i,j) = K_1 \cdot x_{n_{ji}} + K_2$$ \hspace{1cm} (2.5)$$

where

$$K_1 = C_{GTA} \cdot L + 2 \cdot C_{GTP}$$

$$K_2 = 2 \cdot C_{GTP} \cdot L$$ \hspace{1cm} (2.6)$$

In general, the gate terminal capacitances of a certain transistor in different versions of a logic gate may not be linearly proportional to the nominal size of that logic gate.
For example, Figure 2.4 shows a typical plot of the gate terminal capacitance of a certain transistor with respect to different sizes of a logic gate. Inspite of this, however, we can approximate the data points by an affine function using linear least-squares approximation, as shown in the figure. In other words, the output load capacitance of logic gate $i$ as seen by logic gate $j$ is

$$cap(i,j) = a_{ij} \cdot z_i + \beta_{ij}$$  \hspace{1cm} (2.7)

where $z_j$ is the size of logic gate $j$.

Therefore, the output load capacitance of gate $i$ can be found to be

$$C_{out} = cap(i,1) + cap(i,2) + \cdots + cap(i,f)$$

$$= a_{i1} \cdot z_1 + \beta_{i1} + a_{i2} \cdot z_2 + \beta_{i2} + \cdots + a_{if} \cdot z_f + \beta_{if}$$ \hspace{1cm} (2.8)

where $z_1, z_2, \ldots, z_f$ are the sizes of the cells which logic gate $i$ fans out to.

Thus, the delay function $D(w)$ of gate $i$ with nominal size $w$ can be represented as

$$D(w) = \frac{R_c}{w} \cdot C_{out} + \tau_1 \cdot w + \tau_2$$
Figure 2.5 Surface plots of the function $g(w, z) = z/w$ from two different viewpoints.

\[ = R_w \cdot \frac{a_{i1} \cdot z_1 + \beta_{i1} \cdots + a_{i.f} \cdot z_f + \beta_{i.f}}{w} + \tau_1 \cdot w + \tau_2 \quad (2.9) \]

Therefore the delay of a cell is a sum of functions of $g(w, z) = z/w$ and $h(w) = 1/w$. Figure 2.5 shows surface plots of the function $z/w$. Since the function $g(w, z) = z/w$ is relatively smooth, it can be approximated by a convex piecewise linear function with $q$ regions of the form

\[
PWL(w, z) = \begin{cases} 
  a_1 \cdot w + b_1 \cdot z + c_1 & (w, z) \in \text{Region } R_1 \\
  a_2 \cdot w + b_2 \cdot z + c_2 & (w, z) \in \text{Region } R_2 \\
  \vdots \\
  a_q \cdot w + b_q \cdot z + c_q & (w, z) \in \text{Region } R_q
\end{cases} \quad (2.10)
\]

\[= \max_{1 \leq i \leq q} (a_i \cdot w + b_i \cdot z + c_i) \quad \forall (x, y) \in \bigcup_{1 \leq i \leq q} R_i \quad (2.11)\]

The second equality follows from the first since $PWL(w, z)$ is convex.
Figure 2.6 Approximation of the function $1/w$ by a piecewise linear function.

The function $1/w$ is shown in Figure 2.6. Similarly, we can approximate the function $h(w) = 1/w$ with a convex piecewise linear function of the form

$$pwl(x) = \begin{cases} 
  d_1 \cdot w + e_1 & w \in \text{Region } r_1 \\
  d_2 \cdot w + e_2 & w \in \text{Region } r_2 \\
  \vdots \\
  d_q \cdot w + e_q & w \in \text{Region } r_q 
\end{cases}$$

(2.12)

Therefore, the gate delay $D(w, z_1, \ldots, z_f)$ of a gate with size $w$, and fan-out gate sizes $z_1 \cdots z_f$ can be represented using a convex piecewise linear function with $q$ regions, as follows:

\begin{align*}
D(w, z_1, \ldots, z_f) &= \max_{1 \leq j \leq q} (d_j \cdot w + e_j) \\
&\quad \forall w \in \bigcup_i r_i
\end{align*}

(2.13)
2.3.2 Formulation of the linear program

The formal definition of the gate-sizing problem for a combinational circuit is as given in (2.1). Since the objective function, namely, the area of the circuit, is difficult to estimate, we approximate it as the sum of the gate sizes, as has been done in almost all work on sizing [14-21].

Similarly, in general, the cell area of a logic gate may not be linearly proportional to the cell size, as shown in Figure 2.7. Nevertheless, we can approximate those data points by an affine function using linear least-squares approximation as shown in the figure.

Therefore, the cell area of a gate \( i \) can be expressed as

\[
\text{area}(i) = \gamma_i \cdot w_i + \varepsilon_i
\]  

(2.15)

where \( w_i \) is the nominal size of gate \( i \).

The delay specification states that all path delays must be bounded by \( T_{\text{spec}} \). Since the number of PI-PO paths could be exponential, the set of constraining delay equations could
potentially be exponential in the number of gates; unless certain additional variables, $m_i$, $i = 1 \ldots M$ (where $M$ is the number of gates), are introduced to reduce the number of constraints. The worst-case signal arrival time $m_i$ corresponds to the worst-case delay from the primary inputs to gate $i$. Using these variables, for each gate $i$ with delay $d_i$, we have

$$m_i = \max\{m_j + d_j \mid \forall j \in \text{Fanin}(i)\}$$

(2.16)

where $\text{Fanin}(i)$ is the set of fan-in gates of gate $i$. Equivalently, we have

$$m_j + d_i \leq m_i, \quad \forall j \in \text{Fanin}(i).$$

(2.17)

This reduces the number of constraining equations to $\sum_{i=1}^{M} \text{Fanin}(i)$, which, for most practical circuits, is of the order $O(M)$.

For example, consider a part of a circuit as shown in Figure 2.8. Gates 1, 2, and 3 fan out to gate 4. The worst-case signal arrival times of gate 1, 2, and 3 are 4.5, 3.0, and
Figure 2.8 An example illustrating the definition of $m_i$.

3.5, respectively. The gate delay of logic gate 4 is 0.3. Then the worst-case signal arrival time of gate 4 is $m_4 = \max(4.5, 3.0, 3.5) + 0.3 = 4.8$.

We now formulate the linear program as

$$\text{minimize} \quad \sum_{i=1}^{M} \gamma_i \cdot w_i$$

subject to  For all gates $i = 1 \cdots M$

$$m_j + d_i \leq m_i \quad \forall j \in \text{Fanin}(i)$$

$$m_i \leq T_{\text{spec}} \quad \forall \text{gate } i \text{ at PO's}$$

$$d_i \geq \hat{D}(w_i, w_{i,1}, \ldots w_{i,\text{fanout}(i)})$$

$$w_i \geq \text{Minsize}(i)$$

$$w_i \leq \text{Maxsize}(i)$$

where $w_{i,1}, \ldots w_{i,\text{fanout}(i)}$ are the sizes of the gates to which gate $i$ fans out, and $\text{Minsize}(i)$ and $\text{Maxsize}(i)$ are the minimum and maximum sizes of gate $i$ in the library, respectively. Notice that in the objective function, the constant term in (2.15) is omitted since it does not affect the result.
The preceding is a linear program in the variables \( w_i, d_i, m_i \). It is worth noting that the entries in the constraint matrix are very sparse, which makes the problem amenable to fast solution by sparse linear program approaches. Notice that the equalities of (2.14) are replaced here by inequalities so as to satisfy (2.15).

It should be emphasized that our approach is able to handle different timing specifications at different primary outputs. However, for the sake of simplicity, we use the same timing specification for all of the primary outputs in the circuit.

2.4 Phase II: The Mapping Algorithm

The set of permissible sizes for gate \( i \) is \( S_i = \{ w_{i,1}, \ldots, w_{i,p_i} \} \), where \( p_i \) is the cardinality of \( S_i \). The solution of the linear program would, in general, provide a gate size, \( w_i \), that does not belong to \( S_i \). If so, we consider the two permissible gate sizes that are closest to \( w_i \); we denote the nearest larger (smaller) size by \( w_{i+} \) (\( w_{i-} \)). Note that in any standard cell library, \( w_{i+} \) has a smaller delay than \( w_{i-} \). Since it is reasonable to assume that the LP solution is close to the solution of the combinatorial problem, we formulate the following smaller problem:

For all \( i = 1 \cdots M \): Select \( w_i = w_{i+} \) or \( w_{i-} \), such that \( Delay \leq T_{spec} \)

Although the complexity has been reduced from \( O(\prod_{i=1}^{M} p_i) \) for the original problem to \( O(2^M) \), this is still an NP-complete problem. In this section we present an implicit enumeration algorithm for mapping the gate sizes obtained using linear programming...
onto permissible gate sizes. The algorithm is based on a breadth-first branch-and-bound approach.

It is worth pointing out that the solution to this problem is not necessarily the optimal solution; however, it is very likely that the final objective function value for a solution arrived at using good heuristics will be close to the linear program solution, and hence close to the optimal solution. This supposition is borne out by the results presented in Section 2.6.

In Section 2.4.1, we present an implicit enumeration mapping algorithm which is single-path oriented. Although the execution time is fast, the result may not be satisfying. Therefore, in Section 2.4.2, we propose an improved implicit enumeration mapping algorithm using a global approach.

### 2.4.1 Implicit enumeration approach

The algorithm first places all $M$ gates in a queue, $Q$, in decreasing order of their worst-case signal arrival time, $m_i$. The longest path, $P$, from any PI to the gate at the head of $Q$ is found. The unmapped gates along $P$ are mapped to permissible gate sizes using an implicit enumeration approach [27]. Once a gate size has been mapped onto a permissible size, it is said to be processed, and remains unchanged during the remainder of the enumeration process. A processed gate is removed from the queue $Q$.

After $P$ has been processed, the process is repeated for the longest path to the gate that is now at the head of $Q$, until $Q$ is empty. Thus, although the circuit could have
an exponentially large number of paths, our algorithm has to handle at most $N$ of those paths.

Let $G_1$ be the gate that is currently at the head of the queue. Let $P = G_1, G_2, \ldots, G_{|P|}$ be the longest path from any $P_i$ to gate $G_1$, where $|P|$ is the number of gates on the path. The order of gates on the path is such that $G_i$ fans out to $G_{i-1}$, $2 \leq i \leq |P|$. The predecessor (successor) of gate $G_i$ on the path $P$ is the gate $G_{i+1}$ ($G_{i-1}$). Note that $G_{|P|}$ has no predecessor and $G_1$ has no successor.

Starting from $G_1$, we form a state-space tree. Each node at level $i$ in the state-space tree is a cell configuration, which represents a possible realization of gate $G_i$. To help define a cell configuration, we introduce the following notation. Let

\begin{align*}
C(i,j) & : \text{the } j\text{th node at level } i, \\
\text{anc}(i,j) & : \text{the ancestor node of } C(i,j), \\
FO(i) & : \text{the set of gates that gate } i \text{ fans out to,} \\
\text{area}(i, w_i) & : \text{the cell area of gate } i \text{ when its size is } w_i, \text{area}(i, w_i) = \gamma_i \cdot w_i \text{ (see (2.18)),} \\
R_{\text{out}}^i(w_i) & : \text{the equivalent resistance of gate } i, \text{ corresponding to size } w_i, \text{ that drives its load capacitances, } R_{\text{out}}^i(w_i) = R_w/w_i \text{ (see (2.2)),} \\
\Gamma^i(w_j) & : \text{cap}(i,j), \text{ given that gate } j \text{ is the predecessor of gate } i \text{ on path } P, \text{ and the size of gate } j \text{ is } w_j.
\end{align*}

**Definition 2.1** A cell configuration, $C(i,j)$, is a triple $(W_{ij}, A_{ij}, D_{ij})$, \[ W_{ij} = W_{C(i,j)} \in \{w_{i+}, w_{i-}\}, \]
\[ A_{ij} = A_{C(i,j)} = \text{area}(i, W_{ij}) + \text{area}(\text{anc}(i,j)), \]
\[ D_{ij} = D_{C(i,j)} = d_{ij} + \text{cap}(i,j), \]
where $d_{ij} = R^*_{out}(W_{ij}) \cdot \left[ \sum_{k \in F(i), k \neq i-1} \text{cap}(i, k) + \Gamma^i(W_{\text{end}(i,j)}) \right]$

where $A_{ij}$ is the accumulated area from the root to $C(i, j)$, $D_{ij}$ is the accumulated delay from the root to $C(i, j)$, and $d_{ij}$ is the configuration delay associated with $C(i, j)$. Physically, $d_{ij}$ corresponds to the delay of gate $i$, given that gate $i$ has size $W_{ij}$, and gate $(i - 1)$ has size $W_{\text{end}(i,j)}$.

In the state-space tree, each node has no more than two successors since there are at most two choices for the gate size. Every node in the tree corresponds to an assignment of sizes to those gates which lie on the path from the tree root to that node.

The root of the tree is, by definition, assigned a null cell configuration $(0,0,0)$. We begin with the unprocessed gate on the current path, $P$, that is closest to the POs, and implicitly enumerate the two possible realizations of each gate $i$, $w_{i+}$ and $w_{i-}$. The delay of each gate is dependent on its own size and on the size of the gates that it fans out to. Therefore, once $G_i$ has been enumerated, the delay associated with the predecessor of $G_i$ on path $P$ can be calculated, and it can be enumerated. The process continues until all gates along $P$ have been processed.

During the enumeration process, it is possible to eliminate several of the possibilities to prune the search space. A node $C(i, j)$ with a cell configuration $(W_{ij}, A_{ij}, D_{ij})$ is bounded if there exists a cell configuration $(W_{ik}, A_{ik}, D_{ik})$ at the same level of the tree such that

1. $\text{area}(i, W_{ik}) \leq \text{area}(i, W_{ij})$, $A_{ik} \leq A_{ij}$ and $D_{ik} < D_{ij}$, or
Figure 2.9  An example illustrating the construction of a state-space tree in the mapping algorithm.

(2) $\text{area}(i, W_{ih}) \leq \text{area}(i, W_{ij}), A_{ih} < A_{ij}$ and $D_{ih} \leq D_{ij}$.

Example 2.2 In Figure 2.9, let $G_1$ be the current head of the queue, $Q$. Let $G_2$ be the predecessor of $G_1$, and $G_3$ that of $G_2$ on the longest path from a PI to $G_1$. There are two possible realizations for $G_1$, namely,

(1) one with $\text{area}(1, W_{1,1}) = 1.2$ and delay $d_{1,1} = 0.9$, and

(2) one with $\text{area}(1, W_{1,2}) = 0.8$ and delay $d_{1,2} = 1.1$.

If neither node $C(1, 1)$ nor $C(1, 2)$ is bounded, we proceed to construct the second level for both cell configurations. The two successors of node $C(1, 1)$ in the tree represent two possible configurations of $G_2$ if $G_1$ is chosen to be of the size with $\text{area}(1, W_{1,1}) = 1.2$. 

36
Further, node $C(2, 1)$ represents the configuration if $G_2$ is chosen to have a template with $\text{area}(2, W_{2,1}) = 1.5$. Here, if the corresponding configuration delay of $G_2$, $d_{2,1} = 0.8$, then

- accumulated delay of $G_1$ and $G_2$, $D_{2,1} = 1.7$

- accumulated area of $G_1$ and $G_2$, $A_{2,1} = 2.7$

Similarly, node $C(2, 2)$ represents the situation if $G_1$ is chosen to be of the size with cell area 1.2 and $G_2$ with cell area 1.0. If the configuration delay of $G_2$, $d_{2,2} = 1.2$, then

- accumulated delay $D_{2,2} = 2.1$

- accumulated area $A_{2,2} = 2.2$

The entries of node $C(2, 3)$ and $C(2, 4)$ can be calculated similarly.

Now, notice that nodes $C(2, 1)$ and $C(2, 3)$ have the same gate area for $G_2$, while node $C(2, 3)$ has less accumulated area and accumulated delay than node $C(2, 1)$. Therefore, node $C(2, 1)$ is bounded, and it is not necessary to enumerate the descendants of $C(2, 1)$. Similarly, $C(2, 2)$ is bounded since $C(2, 4)$ has superior configuration to $C(2, 2)$.

For every path $P$ in the circuit, we define a quantity known as the maximum path delay, $(MPD)$, as follows:

$$MPD(P) = \begin{cases} 
\min_{j \in FO(i)} (m_j - d_j), & \text{if gate } i \text{ is not at a PO} \\
\min \left[ \min_{j \in FO(i)} (m_j - d_j), T_{spec} \right], & \text{if gate } i \text{ is at a PO}
\end{cases}$$

(2.19)
where gate \( i \) is the gate that lies at the end of path \( P \). Note that even if gate \( i \) is at a PO, it could still fan out to other gates in the circuit; this is reflected in the definition of the \( MPD \). Maximum path delay physically corresponds to the maximal delay that can be assigned to path \( P \) before its effect is propagated beyond gate \( G_{\text{end}} \) at the end of the path.

After the state-space tree for the longest path \( P \) has been constructed, the algorithm examines the cell configurations at the leaf nodes of the tree. The cell configuration, \( C(|P|, n) \), which satisfies the following requirements, is selected.

1. \( D_{|P|, n} \leq MPD(P) \),

2. \( D_{|P|, n} \geq D_{|P|, i} \) \( \forall C(|P|, i) \) such that \( D_{|P|, i} \leq MPD(P) \).

In requirement (2), instead of using \( A_{|P|, n} \leq A_{|P|, i} \) as the criterion, we use \( D_{|P|, n} \geq D_{|P|, i} \). This is because we do not want to perturb the solution obtained from the linear programming too much. This way, it is expected that no change in gate size takes the circuit delay radically away from \( T_{\text{spec}} \).

By performing a trace-back from \( C(|P|, n) \) to the root of the tree, the size of each gate along \( P \) is determined from the cell configurations at each traversed node of the tree.

### 2.4.2 Global implicit enumeration approach

The rationale behind our global enumeration algorithm is based on the following observation. Given the solution of the linear programming, the majority of the gates
remain at their smallest sizes. Only a small portion of the gates in the circuit are moved to a larger size because, for a typical circuit, although there may be a huge number of long paths, the number of gates on these long paths is, in general, relatively small.

Based on this observation, during the implicit enumeration procedure we may ignore those gates which are assigned to have their smallest size by the solution of the linear programming, and concentrate on those gates that have been assigned larger sizes and are probably on long paths.

Definition 2.2 A \textit{critical gate} is a gate whose size is larger than its smallest possible size.

Notice that the determination of critical gates, in general, can be very difficult to obtain, since whether a gate is critical or not heavily depends on the circuit structure as well as the tightness of the delay bounds. However, using an analytical approach such as linear programming, whether a gate is critical or not can be determined easily.

We modify the circuit topology by adding a source node \textit{s}\text{\textsubscript{o}} and a sink node \textit{s}\text{\textsubscript{i}}. A dummy edge is added from node \textit{s}\text{\textsubscript{o}} to each of the input nodes and from each of the output nodes to the node \textit{s}\text{\textsubscript{i}}. Next, for each gate \textit{i} we define \textit{max-delay-to-sink}, denoted by \textit{mds}(i), to be the maximum of the delays of all possible paths starting from gate \textit{i} to the sink node \textit{s}\text{\textsubscript{i}} [28]. That is,

\begin{equation}
mds(i) = \max_{j \in \text{FO}(i)} \{mds(j) + d_i\}
\end{equation}

The method for finding \textit{max-delay-to-sink} is a topological sort. That is, \textit{mds}(i) of a gate \textit{i} can be calculated only after all of the \textit{mds}'s of its fan-out gates have been
computed. Therefore, the computation of mds's starts from sink node si and proceeds backwards until we reach the source node so.

A breadth-first search is applied to levelize the circuit from the sink node backwards. The level of a gate i in this levelization is called its backward circuit level, c-level(i). By definition, the backward circuit level of the sink node si is 0, while the source node so has the largest backward circuit level. Starting from si, we form a state-space tree by implicitly enumerating critical gates. During the enumeration, noncritical gates remain at their minimum size and need not be enumerated. Each level in the state-space tree corresponds to a critical gate. The corresponding critical gate of level i is gate k, where k = F(i). Similarly, the corresponding level of a critical gate k in the state-space tree is called the gate's tree level, t.level(k). Therefore t.level(F(i)) = i. Each node at level i in the state-space tree is a cell configuration, which represents a possible realization of its corresponding gate. Let C(i, j) denote the jth node at level i, and anc(i, j) be its ancestor node.

Definition 2.3 A cell configuration, C(i, j) is a triple \( (W_{ij}, A_{ij}, D_{ij}) \),

\[
W_{ij} = W_{C(i,j)} \in \{w_{F(i)+}, w_{F(i)-}\},
\]

\[
A_{ij} = A_{C(i,j)} = \gamma_{F(i)} \cdot W_{ij} + A_{anc(i,j)},
\]

\[
D_{ij} = D_{C(i,j)} = \max \{\text{mds}(k)\}, \text{ where } k \text{ is a gate in the circuit (not necessarily a critical gate), which satisfies } c\text{-level}(k) = c\text{-level}(F(i)) + 2.
\]

\(^1\)This is different from a traditional levelizing scheme which is done starting from the source node and proceeds forwards.
where $A_{ij}$ is the accumulated area from the root to $C(i,j)$. (Notice that $\gamma_{F(i)} \cdot W_{ij}$ is the cell area of gate $F(i)$, given that its size is $W_{ij}$.)

In the state-space tree, each node has no more than two successors since there are at most two choices for the gate size. The root of the tree is, by definition, assigned a null cell configuration $(0, 0, 0)$. We begin with the critical gate that has the smallest backward circuit level and implicitly enumerate the two possible realizations of each gate $F(i)$, $w_{F(i)+}$ and $w_{F(i)-}$. The delay of each gate is dependent on its own size and on the size of the gates that it fans out to. Therefore, once $g_{F(i)}$ has been enumerated, the delay associated with the predecessor of $g_{F(i)}$ can be calculated, and the remaining critical gates can be enumerated. During the enumeration process, it is possible to eliminate several of the possibilities to prune the search space. A node $C(i,j)$ with a cell configuration $(W_{ij}, A_{ij}, D_{ij})$ is bounded if there exists a cell configuration $(W_{ik}, A_{ik}, D_{ik})$, at the same level of the tree such that

1. $A_{ik} \leq A_{ij}$ and $D_{ik} < D_{ij}$, or
2. $A_{ik} < A_{ij}$ and $D_{ik} \leq D_{ij}$.

After all of the critical gates have been implicitly enumerated, we keep calculating max-delay-to-sink for each remaining gate. However, since noncritical gates have fixed sizes, no enumeration is necessary. Rather, we simply propagate the values toward the source node. For each leaf node of the state-space tree, the max-delay-to-sink of the source node corresponding to that node is calculated and denoted by $D'_{ij}$. The cell configuration

---

\[^{2}\text{If there is more than one critical gate which has the same backward circuit level, one of them is randomly chosen.}\]
which has the largest $D'_{ij}$ and satisfies $D'_{ij} \leq T_{\text{spec}}$ is selected. By performing a trace-back from the selected leaf node to the root of the tree, the size of each critical gate is determined from the cell configurations at each traversed node.

2.5 Phase III: The Adjusting Algorithm

After the mapping phase, if the delay constraints cannot be satisfied, some of the gates in the circuit must be fine-tuned. For each PO which violates the timing constraints, we identify the longest path to that PO. For example, if gate $p$ at the PO has a worst case signal arrival time $m_p > T_{\text{spec}}$, we first find the longest path, $P$, to $G_p$. The path slack of $P$ is defined as

$$P_{\text{slack}}(P) = T_{\text{spec}} - m_p$$

(2.21)

For each gate along that longest path, we calculate the local delay difference for each of the gates along path $P$. Assume that $G_{i-1}, G_i, G_{i+1}$ are consecutive gates, in order of precedence, on path $P$. The local delay and local delay difference associated with $G_i$ are defined as

$$\text{delay}(G_i) = R_{\text{out}}^{i+1} \cdot C_{\text{out}}^{i+1} + R_{\text{out}}^{i} \cdot C_i$$

(2.22)

$$\Delta \text{delay}(G_i) = R_{\text{out}}^{i+1} \cdot \Delta C_{\text{out}}^{i+1} + \Delta R_{\text{out}}^{i} \cdot C_i$$

(2.23)

where $R_{\text{out}}^i$ and $C_{\text{out}}^i$ are, respectively, the equivalent driving resistance of gate $i$, and the capacitive load driven by gate $i$. Therefore, $\Delta \text{delay}(G_i)$ is the difference between the original local delay of $G_i$ and the new local delay of $G_i$ after we replace it with a different gate size that has a different value of $R_{\text{out}}^i$ and $C_{\text{out}}^{i+1}$. 

42
After calculating the local delay difference associated with each of the gates along path $P$, we select the largest one, $\Delta \text{delay}(G_n)$, which satisfies

$$\Delta \text{delay}(G_n) < P\text{slack}(P)$$  \hspace{1cm} (2.24)

and change the size of $G_n$ accordingly. If none of the local delay differences satisfy (2.24), we select the most negative one and replace the gate with a new realization. This process continues until the delay constraints are all satisfied. Also, notice that unlike in the mapping algorithm, we do not restrict our choices to $w_i+$ and $w_i-$ at this phase.

### 2.6 Experimental Results

The preceding algorithms were implemented in a program GALANT (GAte sizing using Linear programming ANd heuricTics) on a Sun Sparc10 station. The test circuits include several ISCAS85 combinational benchmark circuits [29]. Each cell in the standard-cell library has four different sizes of realization with different driving capabilities.

To prove the efficacy of the approach, a simulated annealing algorithm and Lin's algorithm [23] were implemented for comparison. The parameters used in Lin's algorithm have been tuned to give the best overall results. The simulated annealing algorithm that we have implemented is similar to that described in [26]. However, unlike in [26], all gate sizes were allowed to change during the simulated annealing procedure; while the run-times for this procedure were extremely high, the solution obtained can safely be said to be close to optimal. Although simulated annealing does not guarantee the global optimal
solution, a well-designed algorithm and a very slow annealing procedure can provide a solution that is very close to the global optimum.

The results of our approach, in comparison with Lin's algorithm and simulated annealing, are shown in Table 2.1. The test circuits include five ISCAS85 benchmarks, and vary in size from 160 gates (824 transistors) to 3512 gates (15,396 transistors). It can be seen that the accuracy of the results of our approach ranges from being as good as simulated annealing for c432 to a discrepancy of less than 2.0% in comparison with simulated annealing. The average discrepancy is less than 1.0%, and the run times are considerably smaller than those for simulated annealing.

Although Lin's algorithm runs much faster than GALANT, it does not always provide good results. For loose timing constraints, its solution is comparable to the result obtained using GALANT. For somewhat tight specifications, however, its solution becomes excessively pessimistic. For even tighter delay constraints, it cannot obtain a solution at all. As mentioned previously, Lin's algorithm essentially is an adaptation of the TILOS algorithm [15] for continuous transistor sizing, with a few enhancements. While the TILOS algorithm is known to work reasonably well for the continuous sizing case, the primary reason for its success is that the change in the circuit in each iteration is very small. However, in the discrete sizing case, any change must necessarily be a large jump, and a TILOS-like algorithm is likely to give very suboptimal results.

Table 2.2 shows the amount of time taken by the mapping and adjusting algorithm in comparison with the time required to solve the linear program, for some of the results in Table 2.1. It is clear that for all circuits, the chief component (over 95%) of the runtime
Table 2.1 Performance comparison of GALANT with Lin’s algorithm and simulated annealing.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>$T_{spe}$</th>
<th>Simulated Annealing</th>
<th>GALANT</th>
<th>Lin’s Algorithm</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Area ($A_{SA}$)</td>
<td>Run time</td>
<td>Area ($A_G$)</td>
</tr>
<tr>
<td>c432</td>
<td>16.0</td>
<td>2372</td>
<td>19min 53s</td>
<td>2376</td>
</tr>
<tr>
<td></td>
<td>14.0</td>
<td>2515</td>
<td>21min 17s</td>
<td>2515</td>
</tr>
<tr>
<td></td>
<td>12.0</td>
<td>2960</td>
<td>24min 27s</td>
<td>2983</td>
</tr>
<tr>
<td>c1355</td>
<td>14.0</td>
<td>8276</td>
<td>3h 32min</td>
<td>8276</td>
</tr>
<tr>
<td></td>
<td>13.0</td>
<td>9258</td>
<td>3h 45min</td>
<td>9412</td>
</tr>
<tr>
<td></td>
<td>12.5</td>
<td>10224</td>
<td>4h 12min</td>
<td>10417</td>
</tr>
<tr>
<td>c2670</td>
<td>17.0</td>
<td>17623</td>
<td>5h 22min</td>
<td>17623</td>
</tr>
<tr>
<td></td>
<td>16.0</td>
<td>17777</td>
<td>5h 42min</td>
<td>17790</td>
</tr>
<tr>
<td></td>
<td>14.0</td>
<td>18929</td>
<td>8h 12min</td>
<td>19079</td>
</tr>
<tr>
<td>c5315</td>
<td>20.0</td>
<td>36906</td>
<td>13h 46min</td>
<td>36954</td>
</tr>
<tr>
<td></td>
<td>18.5</td>
<td>37438</td>
<td>14h 2min</td>
<td>37457</td>
</tr>
<tr>
<td></td>
<td>17.0</td>
<td>38618</td>
<td>14h 43min</td>
<td>38863</td>
</tr>
<tr>
<td>c7552</td>
<td>18.0</td>
<td>50557</td>
<td>22h 5min</td>
<td>50604</td>
</tr>
<tr>
<td></td>
<td>17.0</td>
<td>50740</td>
<td>23h 20min</td>
<td>51254</td>
</tr>
<tr>
<td></td>
<td>16.0</td>
<td>52069</td>
<td>24h 5min</td>
<td>52563</td>
</tr>
</tbody>
</table>

Average Area Ratio: 1.0057

was the linear programming algorithm; the heuristic was extremely fast in comparison.

The discrepancy between the sum of LP solution time and the time required for mapping and adjusting in Table 2.2, and the total runtime in Table 2.1 is attributable to the preprocessing step which performs miscellaneous administrative steps such as reading in the circuit description and leveling the circuit.
Table 2.2 Execution times for the Linear Program and the Mapping and Adjusting Algorithms.

<table>
<thead>
<tr>
<th>Circuit</th>
<th># of gates</th>
<th>$T_{spec}$</th>
<th>LP solution</th>
<th>Mapping and Adjusting</th>
</tr>
</thead>
<tbody>
<tr>
<td>c432</td>
<td>160</td>
<td>12.0</td>
<td>6.98s</td>
<td>0.75s</td>
</tr>
<tr>
<td>c1355</td>
<td>546</td>
<td>14.0</td>
<td>1min 4s</td>
<td>7.33s</td>
</tr>
<tr>
<td>c2670</td>
<td>1193</td>
<td>14.0</td>
<td>6min 50s</td>
<td>13.68s</td>
</tr>
<tr>
<td>c5315</td>
<td>2307</td>
<td>17.0</td>
<td>18min 29s</td>
<td>32.51s</td>
</tr>
<tr>
<td>c7552</td>
<td>3512</td>
<td>16.0</td>
<td>1h 10min</td>
<td>1min 21s</td>
</tr>
</tbody>
</table>

A comparison of the run-times for GALANT, Lin's algorithm, and simulated annealing on the circuit c432, for various timing specifications, is shown in Table 2.3 and is plotted in Figure 2.10. It is clear that GALANT is orders of magnitude faster than simulated annealing, with results of comparable quality. It can be seen that as the timing specification becomes tighter, the area increases; the increase in area is very rapid for tighter timing specifications. In all cases, the solution obtained by GALANT is very close to the solution obtained by simulated annealing. In comparison with the results of Lin's algorithm, we find that GALANT provides results of substantially better quality, with reasonable run-times.

The runtime of GALANT is seen to go up as the timing specifications become tighter. This can be ascribed to the fact that there are many more solutions of the linear program that are close to the optimal solution, and hence the simplex procedure takes a longer time. This is in contrast with the case for a loose timing specification, in which most
Table 2.3 Performance comparison of GALANT with Lin’s algorithm and simulated annealing for c432.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>$T_{spec}$</th>
<th>Simulated Annealing</th>
<th>GALANT</th>
<th>Lin’s Algorithm</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Area $A_{SA}$</td>
<td>Run time</td>
<td>Area $A_G$</td>
</tr>
<tr>
<td>c432</td>
<td>17.5</td>
<td>2331</td>
<td>24min 29s</td>
<td>2331</td>
</tr>
<tr>
<td></td>
<td>17.0</td>
<td>2337</td>
<td>24min 28s</td>
<td>2337</td>
</tr>
<tr>
<td></td>
<td>16.5</td>
<td>2350</td>
<td>25min 11s</td>
<td>2350</td>
</tr>
<tr>
<td></td>
<td>16.0</td>
<td>2372</td>
<td>25min 40s</td>
<td>2376</td>
</tr>
<tr>
<td></td>
<td>15.5</td>
<td>2394</td>
<td>25min 45s</td>
<td>2402</td>
</tr>
<tr>
<td></td>
<td>15.0</td>
<td>2420</td>
<td>26min 28s</td>
<td>2424</td>
</tr>
<tr>
<td></td>
<td>14.5</td>
<td>2467</td>
<td>26min 17s</td>
<td>2467</td>
</tr>
<tr>
<td></td>
<td>14.0</td>
<td>2515</td>
<td>26min 32s</td>
<td>2515</td>
</tr>
<tr>
<td></td>
<td>13.5</td>
<td>2563</td>
<td>27min 47s</td>
<td>2563</td>
</tr>
<tr>
<td></td>
<td>13.0</td>
<td>2645</td>
<td>27min 57s</td>
<td>2658</td>
</tr>
<tr>
<td></td>
<td>12.5</td>
<td>2801</td>
<td>28min 33s</td>
<td>2801</td>
</tr>
<tr>
<td></td>
<td>12.0</td>
<td>2950</td>
<td>24min 27s</td>
<td>2983</td>
</tr>
<tr>
<td></td>
<td>11.5</td>
<td>3096</td>
<td>36min 4s</td>
<td>3139</td>
</tr>
<tr>
<td></td>
<td>11.0</td>
<td>3300</td>
<td>38min 28s</td>
<td>3315</td>
</tr>
<tr>
<td></td>
<td>10.5</td>
<td>3546</td>
<td>43min 5s</td>
<td>3583</td>
</tr>
</tbody>
</table>

Average Area Ratio: 1.009 for GALANT and 1.206 for Lin’s Algorithm.

Gates are at minimum size at the solution, and the vertices of the feasible region where these gates are at nonminimum sizes are clearly suboptimal.

2.7 Conclusions

In this chapter, an efficient algorithm is presented to minimize the area taken by cells in standard-cell designed combinational circuits under timing constraints. We present...
a comparison of the results of our algorithm with the solutions obtained by our implementation of Lin's algorithm [23] and by simulated annealing. In [23], it was shown that Lin's algorithm is able to obtain better results than the technology mapping of MIS2 [8]. Although Lin's algorithm is fast, its solution becomes excessively pessimistic for tight delay constraints. For very tight timing constraints, it fails to obtain a solution at all. Experimental results show that our approach can obtain a near-optimal solution (compared to simulated annealing) in a reasonable amount of time, even for very tight delay constraints. By adding additional linear programming constraints to account for short path delay [30], and slightly modifying the mapping and adjusting algorithm, the same approach can be used to tackle the double-sided delay constraints problem.

The major bottleneck of our approach is the time required to solve the linear program. Our approach uses a linear program which is solved using a package available in the
public domain [31], whose base is a sparse matrix dual simplex linear program solver. It is possible to reduce the CPU usage using vector processors; as pointed out in [31], the CPU usage can be reduced by about 40% on an Alliant FX/8 machine. Although the computational complexity of the simplex method can be exponential in the worst case, it has been observed that for most practical problems, the complexity ranges from $O((1/n+1/(m-n))^{-1})$ to $O((1/n+1/(m-n+1)-1/m)^{-1})$ for $m$ inequality constraints and $n$ variables [32]. Other polynomial-time linear programming algorithms such as Karmarkar's algorithm [33] may also be employed; however, in practice, its average run-time has been found to be similar to that of the simplex algorithm.

Finally, to increase the accuracy of the results, instead of using the $RC$ delay model, one can use fast timing simulation to evaluate delay of the circuit during implicit enumeration. The run time will be greater. However, as we have mentioned, the number of critical gates is likely to be relatively small. Therefore, the size of state-space tree is usually small. This means that the number of times that we have to perform fast timing simulation would also be small.
CHAPTER 3

OPTIMIZATION FOR SYNCHRONOUS SEQUENTIAL CIRCUITS

3.1 Introduction

The delay-area optimization problem for a combinational circuit is examined in Chapter 2. Optimization for synchronous sequential circuits, on the other hand, is different. An additional degree of freedom is available to the designer in that one can set the time at which clock signals arrive at various flip-flops (FFs) in the circuit by controlling interconnect delays in the clock signal distribution network. With such adjustments, it is possible to change the delay specifications for the combinational stages of a synchronous sequential circuit to allow for better sizing. This effect is even more important in the standard-cell environment, where the granularity of available choices for gate sizes is coarse, and the delay of an optimally sized combinational subcircuit may differ significantly from its delay specification. However, consideration of clock skew in conjunction with sizing increases the complexity of the problem tremendously, since it is no longer possible to decouple the problem and solve it on one subcircuit at a time.
Example 3.1 Consider the circuit shown in Figure 3.1. If the gates in Block 1 are sized substantially, while those in Block 2 are close to their minimum sizes, then by allowing a clock skew at FF B, it is possible to increase the delay specification for Block 1 and decrease that for Block 2. This could reduce the area of Block 1 greatly, at the expense of a small increase in the area of Block 2.

Example 3.2 Consider the synchronous sequential circuit shown in Figure 3.2. In addition to the possibility of adjusting clock skew at boundary latches as in Example 3.1, we can adjust clock skew at internal latches as well. By doing so, it is also possible to reduce the circuit area of the combinational block.

In general, given a combinational circuit segment that lies between two flip-flops \( i \) and \( j \), if \( s_i \) and \( s_j \) are the clock arrival times at the two flip-flops, we have the following relations:

\[
s_i + \text{Maxdelay}(i,j) + T_{\text{setup}} \leq s_j + P \tag{3.1}
\]

\[
s_i + \text{MinDelay}(i,j) \geq s_j + T_{\text{hold}} \tag{3.2}
\]
where $\text{Maxdelay}(i,j)$ and $\text{MinDelay}(i,j)$ are, respectively, the maximum and the minimum combinational delays between the two flip-flops, and $P$ is the clock period. Fishburn [9] studied the clock skew problem, under the assumption that the delays of the combinational segments are constant, and formulated the problem of finding the optimal clock period and the optimal skews as a linear program. The objective was to minimize $P$, with the constraints given by the inequalities in (3.1) and (3.2) above. In real design situations, however, $P$ is dictated by system requirements, and the real problem is to reduce the circuit area.

In this chapter, we examine the following problem: Given a clock period specification, how can the area of a synchronous sequential circuit be minimized by appropriately selecting gate size for each gate in the circuit from a standard-cell library, and by adjusting the delays between the central clock and individual flip-flops? For simplicity, the analysis will use positive-edge-triggered D-flip-flops. In the following, the terminologies flip-flop
(FF) and latch will be used interchangeably. We assume that all primary inputs (PIs) and primary outputs (POs) are connected to FFs outside the system, and are clocked with zero (or constant) skew.

We first present an algorithm for small synchronous sequential circuits and then show how it can be extended to arbitrarily large circuits. The algorithm works in three phases to solve the problem. In the first phase, the combined gate sizing and clock skew optimization problem is formulated as an LP. The solution of this LP provides us with a set of gate sizes that does not necessarily belong to the set of allowable sizes. Hence, in the second phase, we move from the LP solution to a set of allowable gate sizes, using heuristic techniques. At the end of the second phase, the set of allowable sizes obtained may not satisfy (3.1) and (3.2) simultaneously. Hence in the third stage, we fine-tune the longest path to satisfy (3.1) and satisfy the short path constraints in (3.2) by appropriately inserting delay buffers in the short path.

In Chapter 4, we consider arbitrarily large synchronous sequential circuits for which the sizes of the formulated LPs are prohibitively large, and present a partitioning algorithm to handle such circuits. The partitioning algorithm is used to control the computational cost of the linear programs. After the partitioning procedure, we can apply the optimization algorithm to each partitioned subcircuit.

This chapter is organized as follows. We briefly discuss previous work on clock skew optimization in Section 3.2. In Section 3.3, we formulate the synchronous sequential circuit area optimization problem. To reduce the number of constraints in our formulation, we propose a pruning algorithm in Section 3.4. A buffer insertion algorithm is presented...
in Section 3.5, which is used to satisfy short-path constraints without violating long-path constraints. Experimental results are given in Section 3.6. Finally, Section 3.7 concludes this chapter.

3.2 Previous Work on Clock Skew Optimization

Synchronous circuit designers usually try to eliminate clock skew. Clock skew is referred to as the variations in the delays from the central clock source to individual flip-flops of the system. This effort can involve equalization of wire length [34] or wire width [35], symmetric design of the distribution network, and design guidelines to eliminate skew due to process variations [36]. Clock skew can limit the clock speed of a synchronous system or cause clocking hazards leading to malfunction at any clock rate.

In a synchronous sequential circuit, a data race due to clock skew can cause the system to fail [37]. Consider a synchronous sequential digital system with flip-flops (FFs) as shown in Figure 3.3. Let $s_i$ denote the individual delay between the central clock source and flip-flop $FF_i$, and let $P$ be the clock period. Assume that there is a data path, with delay $d_{ij}$, from the output of $FF_i$ to the input of $FF_j$ for a certain input combination to the system. As illustrated in Figure 3.4, there are two constraints on $s_i, s_j$ and $d_{ij}$ that must be satisfied:

Double Clocking: If $s_j > s_i + d_{ij}$, then when the positive clock edge arrives at $FF_i$, the data race ahead through the path and destroy the data at the input to $FF_j$ before the clock arrives there. When the clock edge finally arrives at $FF_j$, the
wrong data are clocked through. Since the data are through two FF's with one
clock edge, this has been called double-clocking.

Zero Clocking: This occurs when \( s_i + d_{ij} > s_j + P \), i.e., the data reach \( FF_j \) too late.

When the clock edge arrives at \( FF_j \), the correct data are not ready yet. Since no
correct data are clocked in by a FF, this is called zero-clocking.

It is, therefore, desirable to keep the maximum (longest-path) delay small to maximize
the clock speed, while keeping the minimum (shortest-path) delay large enough to avoid
clock hazards.

In [9], Fishburn developed a set of inequalities which indicates whether either of the
above hazards is present. In his model, each \( FF_i \) receives the central clock signal delayed
by \( s_i \) by the delay element imposed between it and the central clock. Further, in order
for a FF to operate correctly when the clock edge arrives at time \( t \), it is assumed that the
correct input data must be present and stable during the time interval \( (t - T_{\text{setup}}, t + T_{\text{hold}}) \).
where $T_{\text{setup}}$ and $T_{\text{hold}}$ are the setup time and hold time of the FF, respectively. For all of the FFs, the lower and upper bounds $MIN(i, j)$ and $MAX(i, j)$ (where $1 \leq i, j \leq L$, $L$ is the total number of FFs in the circuit) are computed, which are the times required for a signal edge to propagate from $FF_i$ to $FF_j$. Since it is possible that multiple paths exist from $FF_i$ to $FF_j$, $MIN(i, j)$ and $MAX(i, j)$ must be computed as the minimum and maximum of these path delays; if no such path exists, define $MIN(i, j) = \infty$ and $MAX(i, j) = -\infty$.

To avoid double-clocking between $FF_i$ and $FF_j$, the data edge generated at $FF_i$ by a clock edge may not arrive at $FF_j$ earlier than $T_{\text{hold}}$ after the latest arrival of the same clock edge arrives at $FF_j$. The clock edge arrives at $FF_i$ at $s_i$, the fastest propagation from $FF_i$ to $FF_j$ is $MIN(i, j)$. The arrival time of the clock edge at $FF_j$ is $s_j$. Thus, we have
\[ s_i + \text{MIN}(i,j) \geq s_j + T_{\text{hold}}. \quad (3.3) \]

Similarly, to avoid zero-clocking, the data generated at FF\textsubscript{i} by the clock edge must arrive at FF\textsubscript{j} no later than \( T_{\text{setup}} \) amount of time before the next clock edge arrives. The slowest propagation time from FF\textsubscript{i} to FF\textsubscript{j} is \( \text{MAX}(i,j) \). The clock period is \( P \), thus the next clock edge arrives at FF\textsubscript{j} at \( s_j + P \). Therefore,

\[ s_i + T_{\text{setup}} + \text{MAX}(i,j) \leq s_j + P. \quad (3.4) \]

Inequalities (3.3) and (3.4) dictate the correct operation of a synchronous sequential system.

Two different optimization problems are then formulated [9] with regard to clock skew optimization. They are discussed briefly in the following.

### 3.2.1 Minimize \( P \) subject to clocking constraints

Assume that the value of \( T_{\text{setup}}, T_{\text{hold}}, \) and the maximum and minimum delays between each pair of flip-flops (\( \text{MAX}(i,j), \text{MIN}(i,j) \)) are constant, while the clock period \( P \) and clock skews to individual flop-flops, \( s_i \), are variable. To make the period \( P \) as short as possible while satisfying the system of inequalities Eq. (3.3) and (3.4), a linear program can be formulated as follows:
minimize \( P \)

subject to \( s_i - s_j \geq T_{\text{hold}} - \text{MIN}(i,j), \quad \forall i, j = 1, \ldots, L \) \hfill (3.5)

\[ s_j - s_i + P \geq T_{\text{setup}} + \text{MAX}(i,j), \quad \forall i, j = 1, \ldots, L \]

### 3.2.2 Maximize minimum margin for error

While manufacturing a circuit, it is inevitable that process variations will cause design parameters, such as component values, to waver from their nominal values. As a result, the manufactured circuit may no longer meet some design specifications, such as the requirements on the delay. On the other hand, a system on the verge of clock hazards might pass system diagnosis but malfunction at unpredictable times due to fluctuations in ambient temperature or power supply voltage. One way to increase reliability of the system and prevent these problems from happening, is to provide a safety margin over all the constraints of the slack, i.e., the amount by which the inequality is satisfied. This converts the problem into a maximin problem. This is modeled by introducing a new variable \( M \), which is added to each of the main constraint inequalities so that when \( M \) is maximized by the program, it will be the minimum slack over all the inequalities. In this problem formulation, \( M \) and \( s_i \) are variables to be determined, while \( P \) is specified as a constant. The problem can be formulated as:
maximize $M$

subject to $s_i - s_j - M \geq T_{\text{hold}} - \text{MIN}(i,j)$, \quad $\forall i, j = 1, \ldots, L$ \hfill (3.6)

$s_j - s_i - M \geq T_{\text{setup}} + \text{MAX}(i,j) - P$, \quad $\forall i, j = 1, \ldots, L$

### 3.3 Formulation of Constraints

In Fishburn's approach [9], it is assumed that circuit delays are fixed. In our problem, since gate sizes are to be determined, individual gate delays, and therefore the total circuit delay, are variables, while the clock period is a user-specified constant. Therefore, the problem becomes much more complicated since the delays $\text{MIN}(i,j)$ and $\text{MAX}(i,j)$ between each pair of latches are now also variables.

Our problem requires us to represent path delay constraints between every pair of FFs. This may be achieved by performing PERT [38] on the circuit and setting all FFs except the FF of interest (e.g., $FF_i$) to $-\infty$ ($\infty$) for the longest (shortest) delay path from $FF_i$ to all FFs, and the arrival time at the FF of interest is set to 0 [9]. Therefore in addition to the longest-path delay variable $m_k$, for the shortest-path delay, we introduce new variables, $p_k$, $k = 1 \ldots N$, which correspond to the shortest delay from the PIs (the outputs of FFs are considered as pseudo-PIs) up to the output of $G_k$.

$$p_j + d_k \geq p_k, \quad \forall j \in \text{Fanin}(k).$$ \hfill (3.7)
To represent path delays between every pair of FFs, we need intermediate variables \( m'_k \) (\( p'_k \)) to represent the longest (shortest) delay from FF\(_i\) to the \( k^{th} \) gate. The number of constraints so introduced may be prohibitively large. An efficient procedure for intelligent selection of intermediate \( m'_k \) and \( p'_k \) variables to reduce the number of additional variables and constraints without making approximations has been developed. Deferring a discussion on these procedures to Section 3.4, we now formulate the linear program for a general synchronous sequential circuit as

\[
\begin{align*}
\text{minimize} \quad & \sum_{k=1}^{N} \gamma_k \cdot w_k \\
\text{subject to} \quad & d_k \geq D(w_k, w_{k,1}, \ldots w_{k,\text{fo}(k)}), \quad 1 \leq k \leq N \\
& w_k \geq \text{Minsize}(k), \quad 1 \leq k \leq N \\
& w_k \leq \text{Maxsize}(k), \quad 1 \leq k \leq N \\
& \text{For all FF } i, \quad 1 \leq i \leq \mathcal{L} \\
& \quad s_i + p'_k \geq s_j + T_{\text{hold}} \quad 1 \leq j \leq \mathcal{L}, \quad k = \text{Fanin}(FF_j) \\
& \quad s_i + T_{\text{setup}} + m'_k \leq s_j + P_{\text{spec}} \quad 1 \leq j \leq \mathcal{L}, \quad k = \text{Fanin}(FF_j) \\
& \text{For all gates } k = 1, \ldots, N \\
& \quad m'_i + d_k \leq m'_k, \quad \forall l \in \text{Fanin}(k) \\
& \quad p'_i + d_k \geq p'_k, \quad \forall l \in \text{Fanin}(k)
\end{align*}
\]

(3.8)

The above is a linear program in the variables \( w_i, d_i, m_i, p_i \) and \( s_i \). Again, the entries in the constraint matrix are very sparse, which makes the problem amenable to fast solution by sparse linear program approaches.
3.4 Symbolic Propagation of Constraints

We begin by counting the number of LP constraints in (3.8). We ignore the constraints on the maximum and minimum sizes of each gate since these are handled separately by the simplex method. The \(d_i\) inequalities impose \(q\) constraints for each of the gates in the circuit to the LP formulation (see Eq. (2.15)). Let \(F = \sum_{i=1}^{N} \text{Fanin}(i)\), where \(N\) is total number of gates in the circuit. Then for each FF, there are \(O(F + L)\) constraints, where \(L\) is the total number of FFs in the circuit. Therefore the total number of constraints could be as large as \(O(N \cdot q + L \cdot (F + L))\). Assume that the average number of fan-ins to a gate is 2.5 and \(q = 5\). Then \(F = 2.5N\), and \(L \cdot F\) is the dominant term in the expression above. For real circuits, \(L\) is large, and hence the number of constraints could be tremendous. In this section, we propose a symbolic propagation method to prune the number of constraints by a judicious choice of the intermediate variables \(m\) and \(p\), without sacrificing accuracy. Basically, for any PI, we introduce \(m\) and \(p\) variables for those gates that are in that PI's fan-out cone. Also, we collapse constraints on chains of gates wherever possible (line 6 in Figure 3.5).

The synchronous sequential circuit is first levelized. For this purpose, the inputs of FFs are considered as pseudo-POs the outputs of FFs are considered as pseudo-PIs. Two string variables, \(m\text{string}(i)\) and \(p\text{string}(i)\), are used to store the long-path delay and short-path delay constraints associated with gate \(i\), respectively. For each gate and each FF, an integer variable \(w_i \in \{0, 1\}\) is introduced to indicate its status; that is, the variable \(w_i\) has the value 1 whenever \(m\text{ring}(i)\) and \(p\text{ring}(i)\) are nonempty, i.e.,
ALGORITHM Symbolic_propagation()

1. for i = 1 to L {
   2.   w_i ← 0, mstring(j) ← "", pstring(j) ← "" for all gates and PI’s;
   3.   for j = 1 to max_level {
   4.     for each gate k at level j {
   5.        if ( w_i = 0 for all l ∈ fanin(k) ); /* do nothing */
   6.        if ( among all l ∈ fanin(k), exactly one w_l = 1, others=0 ) {
   7.           mstring(k) ← mstring(l') + "d_k",
           pstring(k) ← pstring(l') + "d_k", w_k ← 1;
           /* w_l = 1, l' ∈ fanin(k) */
   8.        } else {
   9.           w_k ← 1, mstring(k) ← "m_k", pstring(k) ← "p_k";
 10.          for all w_l = 1, l ∈ fanin(k) {
 11.             write down the two constraints,
 12.             mstring(l) + d_k ≤ m_k, pstring(l) + d_k ≥ p_k,
 13.          }
 14.        }
   15.     }
 16. }

Figure 3.5 The symbolic constraints propagation algorithm.

when the constraints stored in mstring(i) and pstring(i) must be propagated; otherwise, w_i = 0.

The algorithm for propagating delay constraints symbolically is given in Figure 3.5. In the following discussion of the algorithm, we elaborate on the formation of mstring; the formation of pstring proceeds analogously. At line 2, for each gate j, w_j and mstring(j) are initialized by setting w_j = 0, and mstring(j) to the null string. At line 5, we check if w_l = 0 for all l ∈ fanin(k), i.e., if all of gate k’s input gates have a null mstring. If so,
no constraints have to be propagated, and no operations are needed. Next, at line 6, we check whether exactly one of gate k's input gates, e.g., gate l', has a nonempty mstring; others have null mstring's. If so, we may continue to propagate the constraint. This is implemented by concatenating mstring(l') and "d", and storing the resulting string in mstring(k). Also w_k is set to 1 to indicate that further propagation is required at this gate. Finally, if more than one of gate k's input gates have nonempty mstring, we add a new intermediate variable, m^k, and the string "m^k" is stored at mstring(k) (line 9).

For each input gate whose mstring is nonempty (w_i = 1), we need a delay constraint (line 12).

Example 3.3 Figure 3.6 gives an example that illustrates the symbolic delay constraints propagation algorithm. Assume that mstring(11) = "m_{11}^1", mstring(12) = mstring(13) = "" (null string). Therefore, from lines 6 and 7 of the pseudo-code, mstring(14) = "m_{11}^1 + d_4" and w_4 = 1. Propagating this further, we find that similarly, mstring(15) = "m_{11}^1 + d_4 + d_5" and w_5 = 1. Finally, for gate 16, we apply lines 9 through 12 and find that we must introduce a variable m_{16}, and set w_{16} = 1. We also write down the two constraints shown in the figure and add these to the set of LP constraints.

By using the symbolic constraints propagation algorithm, although the actual reduction is dependent on the structure of the circuit, experimental results show that this algorithm can reduce the number of constraints to less than 7% of the original number on the average for the tested circuits.

63
3.5 Satisfying Short-Path Delay Constraints

The solution of the LP would, in general, provide a gate size, \( x_k \), that does not belong to the permissible set, \( S_k = \{w_{k,1}, \ldots, w_{k,n}\} \). If so, we consider the two permissible gate sizes that are closest to \( w_k \); we denote the nearest larger (smaller) size by \( w_{k,+} \) (\( w_{k,-} \)). As in Section 2.4, we formulate the following smaller problem:

For all \( k = 1, \ldots, N \): Select \( w_k = w_{k,+} \) or \( w_{k,-} \), such that

\[
\text{for all FFs } 1 \leq i, j \leq L
\]

\[
s_i + \text{Maxdelay}(i, j) + T_{\text{setup}} \leq s_j + P_{\text{spec}}
\]

\[
s_i + \text{Mindelay}(i, j) \geq s_j + T_{\text{hold}}
\]

The mapping algorithm described in Section 2.4 can be used to obtain a solution for this problem.
After the mapping phase, if some of the delay constraints cannot be satisfied, we have to fine-tune some gate sizes in the circuit. In Section 2.5, we have discussed the approach to resolving the violation of long-path delay constraints. The same strategy can be applied to synchronous sequential circuit optimization, except that the definition of path slack must be modified.

For each PO $j$ (including pseudo POs at the inputs of FFs), the required maximum (minimum) signal arrival times, $req_l(j)$ ($req_s(j)$), can be expressed as

$$req_l(j) = s_j + P_{spec} - T_{setup}$$

$$req_s(j) = s_j + T_{hold}$$

The path slack then can be defined as

$$P_{slack}(P_l(n)) = req_l(n) - m_n$$

Violations of short-path delay constraints, on the other hand, can be resolved by inserting delay buffers. However, buffer insertion cannot be carried out arbitrarily, since one must simultaneously ensure that the changes in the circuit do not violate any long path constraints.

For every gate $i$ in the circuit, we define the gate slack, $G_{slack}(i)$, as

$$G_{slack}(i) = \begin{cases} 
\min \{m_j + G_{slack}(j) - (d_j + m_i)\}, & \text{if gate } i \text{ is not at a PO} \\
\min \{ \min_{j \in PO(i)} [m_j + G_{slack}(j) - (d_j + m_i)], (req_l(i) - m_i) \}, & \text{otherwise}
\end{cases}$$

$$G_{slack}(i) = \begin{cases} 
\min \{m_j + G_{slack}(j) - (d_j + m_i)\}, & \text{if gate } i \text{ is not at a PO} \\
\min \{ \min_{j \in PO(i)} [m_j + G_{slack}(j) - (d_j + m_i)], (req_l(i) - m_i) \}, & \text{otherwise}
\end{cases}$$
Note that if gate \( i \) is at a PO, it could still fan out to other gates in the circuit; this is reflected in the definition of the gate slack. Physically, gate slack corresponds to the amount by which the delay of gate \( i \) can be increased before its effect will be propagated to any POs or FFs, in terms of long-path delay. Therefore, it tells us the maximum delay that a delay buffer can have if we are to insert a delay buffer at the output of gate \( i \).

For example, consider a part of a circuit as shown in Figure 3.7. In the circuit, gates 4 and 5 are connected to flip-flops. Gate 4 has gates 1 and 2 as its fan-in gates, while gate 5 has gate 2 and 3 as its fan-in gates. The required signal arrival times at the inputs of both flip-flops are indicated in the figure. The long-path and short-path signal arrival times of gates 1, 2, and 3 are shown in the figure. The delays of gates 4 and 5 are 0.5 and 0.4, respectively. With this information, the long-path and short-path signal arrival times of gates 4 and 5 can be calculated, and are given in the figure. As we can see, the short-path signal arrival time of gate 4 is 2.1, which is less than the required minimum signal arrival time, 2.3. Therefore, it is necessary to process the short paths to gate 4.

Gate slacks are calculated using Eq. (3.11). For the time being, we assume that inserting a delay buffer at the output of a gate will not affect the delay of that gate. Since the gate slack of gate 2 is 0.2, we can insert a delay buffer with 0.2 delay immediately at the output of gate 2.\(^1\) After the buffer insertion, it can be seen that \( p_s = 2.3, m_s = 6.0 \), \( p_s = 2.2 \), and \( m_s = 5.5 \). This way, the required minimum signal arrival time at the input of flip-flop A is satisfied, while none of the required maximum signal arrival times are violated. On the other hand, if we insert a delay buffer with delay time 0.3 at the same

\(^1\)Alternatively, we can insert a delay buffer immediately at the input of gate 4.
Figure 3.7 An example illustrating the definition of $G_{slack}$.

In a real situation, however, inserting a buffer at the output of a gate will affect the delay of that gate. Therefore, care must be taken when performing buffer insertion to increase short-path delay.

The algorithm for inserting buffers is shown in Figure 3.8. At the beginning of this phase, we first back-propagate gate slacks from POs and all FFs. The gate slack of each gate is determined recursively using (3.11). In line (4) of the algorithm, beginning
ALGORITHM Insert_buffer(nl)

1. Let \( P_s(nl) \) be the shortest path to gate \( nl \), and \( G_{n1}, \ldots, G_{nk} \) be on path \( P_s(nl) \) \((G_{ni} \text{ fans out to } G_{n(i-1)}, 2 \leq i \leq k, k = \# \text{ of gates along } P_s(nl))\);
2. \( i \leftarrow 1 \);
3. while \( (p_{nl} < req(nl)) \) {
   4. if \( \exists \) a (smallest) buffer, \( bf \), in the library such that:
      \[
      \text{delay}(G_{ni}) < \text{delay}'(G_{ni}) + \text{delay}(bf) \leq \text{delay}(G_{ni}) + \text{slack}(G_{ni})
      \]
      \[
      \text{delay}(G_{ni}) < \text{delay}'(G_{ni}) + \text{delay}(bf) \leq \text{delay}(G_{ni}) + \text{slack}(G_{ni})
      \]
   5. insert \( bf \) at the output of \( G_{ni} \);
   6. incrementally update \( \text{slack}(j), m_j, p_j \) for each gate \( j \) in the circuit;
   7. if \( (p_{nl} \geq req(nl)) \) stop;
   8. else goto 1.
}
9. \( i \leftarrow i + 1 \);
}

Figure 3.8 The buffer insertion algorithm.

from the smallest buffer in the library, we try to insert a buffer at the output of gate \( G_{ni} \). The delay of the buffer is denoted by \( \text{delay}(bf) \). Since the output capacitance of \( G_{ni} \) is changed during this process, we have to recalculate its delay, which is denoted by \( \text{delay}'(G_{ni}) \).

Example 3.4 In Figure 3.9, let gate 4 be connected to some FF. The required maximum arrival time \( (req) \) is 4.8, and the required minimum arrival time \( (req) \) is 1.3. The actual long-path delays \( (m_i) \) and short-path delays \( (p_i) \) for all gates are as indicated. The gate slack of each gate is calculated and shown in the figure. Since gate 4 violates the shortest-path delay requirement, the shortest-path to it, \( P_s(4) \), is found; this can be seen to include gate 3. Since the gate slack of gate 3 is 1.0, we can insert a delay buffer between
gates 3 and 4. If \( \text{delay}(3) = 0.5 \), the delay after introducing the buffer, \( \text{delay}'(3) = 0.4 \), and \( \text{delay}(bf) = 0.3 \), then the new value of \( p_4 \) is 1.4, which satisfies \( \text{req}_4(4) \).

### 3.6 Experimental Results

The algorithms above were implemented in program GALANT-S on a Sun Sparc10 station. The test circuits include many of the ISCAS85 combinational benchmark circuits [29] and ISCAS89 synchronous sequential circuits [39]. Each cell in the standard-cell library has five different sizes of realization with different driving capabilities.

First, in Table 3.1, the experimental results using the symbolic constraints propagation algorithm are listed. For each circuit, the numbers of primary inputs, primary outputs, flip-flops, and gates are also shown. Both the number of longest-path delay constraints without using symbolic constraint propagation algorithm and the number of constraints pruned by the algorithm are given. It is clear that our pruning algorithm
is very efficient. The number of delay constraints is reduced by more than 93\% on the average.

For a given desired clock period ($P_{\text{spec}}$), the optimized results both with and without clock skew optimization are shown in Table 3.2. Depending on the structure of the circuits, the improvement over total area of the circuit ranges from 1.2\% to almost 20\%. As for the execution time, the run time ranges from about the same for some circuits, to less than double or triple for most circuits.

One may raise the question of whether it is worthwhile to minimize circuit area through clock skew optimization, since the reduction of area is not very significant for some circuits. However, Table 3.3 provides some more in-depth experiments of two circuits, s838 and s1423. In this experiment, we try to minimize the area using different specified clock periods. As one can see, for s1423, the minimum clock period without clock skew optimization is about 32.5. On the other hand, using clock skew optimization, the minimum period can be as small as 22, which gives an almost 33\% improvement in terms of clock speed. For s838, using clock skew optimization also gives a 30\% improvement. Hence, using clock skew optimization can not only reduce the circuit area, but also allows a faster clock speed.

3.7 Comment and Conclusions

3.7.1 Clock tree routing

In [34], Tsay proposed a zero-skew clock tree routing algorithm. In his approach, a clock tree is modeled as an $RC$ tree for delay analysis. Based on a lumped delay model
Table 3.1 Experimental results of the symbolic constraints propagation algorithm for ISCAS89 benchmark circuits.

<table>
<thead>
<tr>
<th>Circuit</th>
<th># of PIs</th>
<th># of POs</th>
<th># of FFs</th>
<th># of gates</th>
<th>longest-path constraints</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>original</td>
<td>pruned</td>
</tr>
<tr>
<td>s27</td>
<td>4</td>
<td>1</td>
<td>3</td>
<td>10</td>
<td>133</td>
</tr>
<tr>
<td>s208</td>
<td>11</td>
<td>2</td>
<td>8</td>
<td>104</td>
<td>3276</td>
</tr>
<tr>
<td>s298</td>
<td>3</td>
<td>6</td>
<td>14</td>
<td>137</td>
<td>4556</td>
</tr>
<tr>
<td>s344</td>
<td>9</td>
<td>11</td>
<td>15</td>
<td>160</td>
<td>6720</td>
</tr>
<tr>
<td>s349</td>
<td>9</td>
<td>11</td>
<td>15</td>
<td>160</td>
<td>6816</td>
</tr>
<tr>
<td>s382</td>
<td>3</td>
<td>6</td>
<td>21</td>
<td>158</td>
<td>7488</td>
</tr>
<tr>
<td>s386</td>
<td>7</td>
<td>7</td>
<td>6</td>
<td>171</td>
<td>4758</td>
</tr>
<tr>
<td>s400</td>
<td>3</td>
<td>6</td>
<td>21</td>
<td>162</td>
<td>7824</td>
</tr>
<tr>
<td>s420</td>
<td>19</td>
<td>2</td>
<td>16</td>
<td>196</td>
<td>11830</td>
</tr>
<tr>
<td>s444</td>
<td>3</td>
<td>6</td>
<td>21</td>
<td>181</td>
<td>8592</td>
</tr>
<tr>
<td>s510</td>
<td>19</td>
<td>7</td>
<td>6</td>
<td>211</td>
<td>10775</td>
</tr>
<tr>
<td>s526</td>
<td>3</td>
<td>6</td>
<td>21</td>
<td>229</td>
<td>11688</td>
</tr>
<tr>
<td>s641</td>
<td>35</td>
<td>24</td>
<td>19</td>
<td>379</td>
<td>30402</td>
</tr>
<tr>
<td>s838</td>
<td>35</td>
<td>2</td>
<td>32</td>
<td>446</td>
<td>55948</td>
</tr>
<tr>
<td>s953</td>
<td>16</td>
<td>23</td>
<td>29</td>
<td>395</td>
<td>34470</td>
</tr>
<tr>
<td>s1196</td>
<td>14</td>
<td>14</td>
<td>18</td>
<td>529</td>
<td>32736</td>
</tr>
<tr>
<td>s1423</td>
<td>17</td>
<td>5</td>
<td>74</td>
<td>657</td>
<td>106379</td>
</tr>
<tr>
<td>s1488</td>
<td>8</td>
<td>19</td>
<td>6</td>
<td>748</td>
<td>21014</td>
</tr>
<tr>
<td>s1494</td>
<td>8</td>
<td>19</td>
<td>6</td>
<td>725</td>
<td>20860</td>
</tr>
<tr>
<td>s5378</td>
<td>35</td>
<td>49</td>
<td>179</td>
<td>2779</td>
<td>911854</td>
</tr>
</tbody>
</table>
Table 3.2 Performance comparison with and without clock skew optimization for IS-CAS 89 benchmark circuits.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>$P_{\text{spec}}$</th>
<th>with clock skew opt.</th>
<th>w/o clock skew opt.</th>
<th>$\frac{A_1}{A_2}$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Area ($A_1$)</td>
<td>Run time</td>
<td>Area ($A_2$)</td>
</tr>
<tr>
<td>s27</td>
<td>3.75</td>
<td>151.12</td>
<td>0.32s</td>
<td>179.29</td>
</tr>
<tr>
<td>s208</td>
<td>6.8</td>
<td>1404.00</td>
<td>3.32s</td>
<td>1745.25</td>
</tr>
<tr>
<td>s298</td>
<td>6.5</td>
<td>2125.50</td>
<td>4.20s</td>
<td>2295.58</td>
</tr>
<tr>
<td>s344</td>
<td>8.0</td>
<td>2093.00</td>
<td>7.10s</td>
<td>2400.67</td>
</tr>
<tr>
<td>s349</td>
<td>8.0</td>
<td>2128.75</td>
<td>6.18s</td>
<td>2498.17</td>
</tr>
<tr>
<td>s382</td>
<td>8.5</td>
<td>2216.50</td>
<td>7.68s</td>
<td>2334.04</td>
</tr>
<tr>
<td>s386</td>
<td>6.5</td>
<td>3521.37</td>
<td>7.55s</td>
<td>3577.17</td>
</tr>
<tr>
<td>s400</td>
<td>8.4</td>
<td>2314.00</td>
<td>8.19s</td>
<td>2515.50</td>
</tr>
<tr>
<td>s420</td>
<td>12.0</td>
<td>2522.00</td>
<td>9.06s</td>
<td>2952.63</td>
</tr>
<tr>
<td>s444</td>
<td>8.5</td>
<td>2463.50</td>
<td>11.55s</td>
<td>2724.04</td>
</tr>
<tr>
<td>s510</td>
<td>11.0</td>
<td>3219.67</td>
<td>16.13s</td>
<td>3261.37</td>
</tr>
<tr>
<td>s526</td>
<td>6.5</td>
<td>3914.08</td>
<td>10.21s</td>
<td>4311.67</td>
</tr>
<tr>
<td>s641</td>
<td>22.0</td>
<td>4598.75</td>
<td>51.59s</td>
<td>4747.17</td>
</tr>
<tr>
<td>s838</td>
<td>10.5</td>
<td>6162.00</td>
<td>100.67s</td>
<td>7324.42</td>
</tr>
<tr>
<td>s953</td>
<td>10.5</td>
<td>5516.87</td>
<td>243.93s</td>
<td>5898.75</td>
</tr>
<tr>
<td>s1196</td>
<td>12.0</td>
<td>8550.21</td>
<td>288.15s</td>
<td>8752.42</td>
</tr>
<tr>
<td>s1423</td>
<td>35.0</td>
<td>9871.87</td>
<td>1069.75s</td>
<td>10151.38</td>
</tr>
<tr>
<td>s1488</td>
<td>10.0</td>
<td>15025.29</td>
<td>148.27s</td>
<td>15322.12</td>
</tr>
<tr>
<td>s1494</td>
<td>10.0</td>
<td>14773.96</td>
<td>158.14s</td>
<td>14962.46</td>
</tr>
<tr>
<td>s5378</td>
<td>10.0</td>
<td>29219.12</td>
<td>2633.78s</td>
<td>29717.53</td>
</tr>
</tbody>
</table>
Table 3.3 Improving possible clocking speeds using clock skew optimization.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>$P_{spec}$</th>
<th>with clock skew opt.</th>
<th>w/o clock skew opt.</th>
<th>$\frac{A_1}{A_2}$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Area ($A_1$)</td>
<td>Run time</td>
<td>Area ($A_2$)</td>
</tr>
<tr>
<td>s838</td>
<td>10.5</td>
<td>6162.00</td>
<td>100.67s</td>
<td>7324.42</td>
</tr>
<tr>
<td></td>
<td>10.25</td>
<td>6165.25</td>
<td>102.18s</td>
<td>7365.58</td>
</tr>
<tr>
<td></td>
<td>10.0</td>
<td>6182.04</td>
<td>103.25s</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>7.5</td>
<td>6637.58</td>
<td>130.20s</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>6.75</td>
<td>7417.58</td>
<td>172.31s</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>6.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>s1423</td>
<td>35.0</td>
<td>9871.87</td>
<td>1069.75s</td>
<td>10151.38</td>
</tr>
<tr>
<td></td>
<td>32.5</td>
<td>9998.63</td>
<td>1130.89s</td>
<td>10545.71</td>
</tr>
<tr>
<td></td>
<td>30.0</td>
<td>10154.08</td>
<td>1450.03s</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>22.0</td>
<td>12178.83</td>
<td>1605.43s</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>20.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>
and the delay computation method, he found that any two zero-skewed subtrees can be merged into a tree with zero skew by tapping the connection to a specific location of each subtree. The approach is a recursive bottom-up algorithm. To realize the clock routing of a nonzero-skew system as in our approach, Tsay's zero-skew routing algorithm can be modified to handle the problem [34]. This can be done by adding a fictitious delay element on each clock pin.

Let us assume that the optimal clock delay to latch $i$ is $D_0 + D_i$, where $D_0$ is a common offset value which is unknown until the clock routing is determined. Thus, the skew between latch $i$ and latch $j$ is $D_i - D_j$. Let $D_{\text{mas}}$ be the maximum clock delay, i.e.,

$$D_{\text{mas}} = \max(D_0 + D_k) = D_0 + \max_k D_k$$

(3.12)

Define the fictitious delay of latch $i$ as

$$d_i = D_{\text{mas}} - (D_0 + D_i) = \max_k D_k - D_i$$

(3.13)

In other words, each clock pin attached to a latch is modeled as a lumped delay model with an input loading capacitance and a branch delay, as shown in Figure 3.10. Then the zero-skew routing algorithm is performed on this modified clock tree with the fictitious delay on each clock pin.

### 3.7.2 Conclusions

In this chapter, a unified approach to minimizing synchronous sequential circuit area and optimizing clock skews has also been presented. Traditionally, the circuit area of a synchronous sequential circuit is minimized one combinational subcircuit at a time. Our
Figure 3.10 (a) A clock pin on a latch. (b) The modified model of a clock pin according to the optimal skew obtained from our algorithm.

Experiments have shown that this may lead to very suboptimal solutions in some cases. Experimental results show that using clock skew optimization can not only reduce the circuit area, but also allows a faster clock speed.

In our formulation, for each gate in the circuit, we use the same delay variable \(d_k\) when calculating longest-path delay and shortest-path delay. In practice, however, the worst-case maximum delay and worst-case minimum delay are different for a specific gate. Nonetheless, our formulation and algorithm described in this chapter can be modified to consider this effect.

In the experimental results, only active circuit area (cell area) is considered. The data do not include the clock tree routing area. It is possible that due to the introduction of clock skew at each latch, the clock tree routing area may be increased. On the other hand, since both positive and negative clock skews are allowed at each latch, it is possible that the net increase in the clock tree routing area may be insignificant. Nevertheless, more thorough study should be conducted before the clock skew optimization technique can be applied to real circuit designs.
Finally, the clock skew scheme may appear similar to the *maximum-rate pipelining* technique used in pipelined computer systems [40]. However, the clock in a maximum-rate pipeline cannot be single-stepped or even slowed down significantly. This makes maximum-rate designs extremely hard to debug. In the clock skew scheme, by contrast, single-stepping is always possible [9]. Therefore, circuits implemented using clock skew techniques can be debugged without difficulties.
CHAPTER 4
PARTITIONING FOR OPTIMIZATION

4.1 Introduction

As indicated in Section 3.4, the number of constraints in our formulation of the LP is, in the worst case, proportional to the product of the number of gates and the number of FFs in the circuit. Ideally, for a given synchronous sequential circuit, all variables and constraints should be considered together to obtain an optimal solution. However, for large synchronous sequential circuits, the size of the LP could be prohibitively large even with our symbolic constraint propagation algorithm. Therefore, it is desirable to partition large synchronous sequential circuits into smaller, more tractable subcircuits, so that we can apply the algorithm described in Chapter 3 to each subcircuit. While this would entail some loss of optimality, an efficient partitioning scheme would minimize that loss; moreover, the reduction in execution time would be very rewarding.

It is well-known that multiple-way network partitioning problems are NP-hard [41]. Therefore, typical approaches to solving such problems find heuristics that will yield approximate solutions in polynomial time [42, 43]. Traditional partitioning problems
usually have explicit objective functions; for example, in physical layout it is desirable to have minimal interface signals resulting from partitioning the circuit, and hence the objective function to be minimized there is the number of nets connecting more than two blocks. Our synchronous sequential circuit partitioning problem, however, is made harder by the absence of a well-defined objective function; since our ultimate goal is to minimize the total area of the circuit, there is no direct physical measure that could serve as an objective function for partitioning. In this chapter, we develop a heuristic measure that will be shown to be an effective objective function for our partitioning problem.

In this chapter, we first briefly discuss previous work on network partitioning in Section 4.2. We develop our partitioning algorithm based on Sanchis' multiple-way partitioning algorithm [42]; details of Sanchis' algorithm are provided in Section 4.3. We present our synchronous sequential circuit partitioning algorithm in Section 4.4. Finally, experimental results are given in Section 4.5, and we conclude this chapter in the same section.

4.2 Previous Work on Partitioning

As VLSI system complexity increases, a divide-and-conquer approach is used to keep the circuit design process tractable. Using this strategy, a complex problem is divided into small subproblems, thus reducing the complexity of the original problem dramatically.
Given a circuit (network) consisting of a set of modules (nodes) connected by a set of signals (nets), the objective of a K-way partitioning is to divide the whole circuit into K subsets such that the number of signals crossing these subsets is minimized.

A network as described above can be modeled as a graph where each edge (net) connects exactly two vertices (node). The graph partitioning problem can be formally stated as follows. We are given an undirected graph, $G = (V, E)$ where vertices $V = \{v_1, v_2, \ldots, v_n\}$ and weighted edges $e = (v_i, v_j)$ represent the cost of putting $v_i$ and $v_j$ in separate partitions. The problem is to divide the vertices into $k$ disjoint sets $\{P_1, P_2, \ldots, P_k\}$ for a given $k$, such that some cost function is optimized. The cost function can be based on the weights of the edges cut and/or the sizes of the partitions.

Ford and Fulkerson [44] proposed the max-flow-min-cut algorithm, which finds the optimum solution between subsets of unconstrained sizes in polynomial time. Using their algorithm, a minimum cut separating designated nodes $s$ and $t$ can be found by flow techniques in $O(n^3)$ time, where $n = |V|$ = number of vertices in the graph. Cut-tree techniques [45] will yield the global minimum cut using $n - 1$ minimum cut computation in $O(n^4)$ time. However, these algorithms tend to generate very unbalanced partitions. Unfortunately, when size balancing constraints are imposed, the problem becomes NP-complete. Because of its importance, many heuristics have been proposed to solve the partitioning problem [42, 43, 46–51]. These heuristics can be classified into the following two categories:
(1) **Iterative method** - Iterative heuristics explore the solution space by making a large number of moves (small changes to the solution) either randomly or greedily in an attempt to discover a global minimum.

(2) **Spectral method** - In spectral partitioning techniques, the eigenvectors and eigenvalues (spectrum) of a graph are computed, and a cost function is shown to be minimized by a function of the spectrum. Some heuristic is used for mapping the information provided by the eigenvectors into an actual partition.

The two approaches are discussed in more detail in the following two subsections.

### 4.2.1 Iterative partitioning

In [46], Kernighan and Lin described a heuristic procedure for graph partitioning which became the basis for most of the iterative improvement partitioning algorithms generally used. Their algorithm deals with the problem of partitioning a network with \( n \) cells (vertices) (\( n \) even) into two partitions of \( n/2 \) cells each. The basic approach is to start with a given partition and to improve it by iteratively choosing one node from each of the blocks (partitions) and exchanging them. The nodes are selected to be switched so that a maximum decrease in cut-set size is obtained (or minimum increase if no decrease is possible). The algorithm consists of a series of passes. In each pass, two nodes are interchanged in turn until all \( n \) nodes have been moved. In each iteration, the two nodes to be moved are chosen from among the ones which have not been moved during the pass. At the end of each pass, the \( n/2 \) partitions produced during the pass are examined.
and the one with the minimum cut-set size is chosen as the starting partition for the next pass. Passes are performed until no improvement in cut-set size can be obtained.

Fiduccia and Mattheyses [47] modified the Kernighan-Lin algorithm. Their algorithm has a linear worst-case complexity in each pass. In their algorithm, only one cell is moved between two partitions at a time instead of switching pairs. This allows for more flexibility in block sizes. In addition, a method is introduced for keeping the candidates in each partition sorted at all times. Elegant data structures were developed through which they could maintain the sorted candidates, and thus avoiding searching for a candidate to be moved. Hence a linear-time complexity is achieved. Fiduccia and Mattheyses also introduced the idea of preserving balance in the sizes of the blocks. Since only one cell is moved at a time, block sizes cannot be constrained to be constant during the pass. Instead, each block’s size is constrained to be within a given range. When choosing the next cell to be moved, the cell with the highest gain (reducing maximum number of cuts across the partition) in each block is examined. It will always be possible to move at least one of these cells while preserving balance. If both may be moved, the one with the highest gain is selected.

Krishnamurthy [48] further improved the Fiduccia-Mattheyses algorithm by refining the method for choosing the best cell to be moved. He introduced the concept of level gain. Consider the example shown in Figure 4.1. Moving A would eliminate one net; moving B, however, would not eliminate any net. However, if we move B and C together, two nets can be eliminated. Therefore, the first level gain ($\gamma_1$) of A is 1, and that of B is 0; while the second level gain ($\gamma_2$) of A is 0, and that of B is 2. The gain vector of a
cell $E$ is then defined as

$$\Gamma_l(E) = \langle \gamma_1(E), \cdots, \gamma_l(E) \rangle \quad (4.1)$$

where $l$ is the number of levels used. These vectors are ordered lexicographically. At each iteration, the free cell with the largest gain vector is moved. Computing higher-level gains enables the algorithm to better distinguish between cells whose first-level gains are the same.

In [42], Sanchis further generalized Krishnamurthy’s algorithm to deal with the multiple-way partitioning problem. There are several ways in which a two-way partitioning algorithm can be adapted to multiple-way partitioning. For example, one can successively choose pairs of blocks and apply the two-way algorithm to these pairs. However, since eliminating a net from the cut-set formed across a given pair of blocks does not necessarily remove it from the cut-set of the multiple block partition, this method may not be able to obtain good results. The second method consists of a hierarchical use of the two-way algorithm. For example, for a four-way partition we could use the two-way al-
algorithm to partition the cells into two blocks, and then partition each of these two blocks into two blocks each. However, the first partitioning will try to minimize the number of connections between the first two blocks, thus tending to maximize the connections inside these two blocks and making it harder to obtain good partitions thereafter. An alternative for obtaining better solutions is to attempt to improve the partition uniformly at each step. Under such a scheme, we should consider at each iteration during a pass all possible moves of each free cell from its home block to any of the other blocks, and the best of such moves should be chosen. This is the basic approach taken by Sanchis. Since we develop our partitioning scheme based on Sanchis' algorithm, details of the algorithm will be discussed in Section 4.3.

More recently, Yeh et al. [43] proposed a general-purpose multiple-way partitioning algorithm. In their approach, a top-down clustering is carried out first to group highly connected subsircuits into clusters and then condense these clusters into single nodes prior to the execution of iterative procedure. They also proposed a uniform multipin net model to capture the contributory moves. Consider the example shown in Figure 4.2. Suppose during a pass, nodes A, B and C have not been locked. Moving A would eliminate nets $e_i$, $e_k$, and $e_l$ and introduce net $e_n$ to the cut-set. Thus the gain would be 2. Moving B would not eliminate any net, but would allow for nets $e_i$, $e_j$, and $e_k$ to be eliminated, provided that node C is moved along with B. If we use the level gain model, the moving of A would be favored, which would depress the movement of B and C. Based upon the above observation, a different approach is introduced. Let us now concentrate on the perspective of nets. If we want to eliminate net $e_j$, we would have to move B and C
together. This would also introduce the elimination of nets $e_i$ and $e_k$ at the same time. Thus the gain would be 3. On the other hand, if we decide to remove net $e_i$, we would move $A$ only, and the gain is 2. Therefore, if we view a move as initiated by a net instead of a node, the ambiguity associated with selecting moves would be reduced. Based on this model, a primal-dual iteration is used to enhance the iteration improvement. The primal process is based on the Fiduccia-Mattheyses algorithm. The dual process is similar to the primal process except that it concentrates on net perspective during the pass.

4.2.2 Spectral partitioning

Spectral-based partitioning extracts information about the structure of the graph from the eigenvalues and eigenvectors of the matrices derived from the graph. A graph can be represented by the adjacency matrix $A(G)$.

$$ A_{ij} = \begin{cases} a_{ij}, & \text{if } (v_i, v_j) \in E \\ 0, & \text{otherwise} \end{cases} \quad (4.2) $$
where \( a_{ij} \) is the weight of the edge between \( v_i \) and \( v_j \). By convention, \( A_{ii} = 0 \) for all \( i = 1, \ldots, n \). If we let \( d(v_i) \) denote the degree of node \( v_i \) (i.e., the sum of weights of all edges incident on \( v_i \)), we obtain the \( n \times n \) diagonal degree matrix \( D(G) \) defined by

\[
G_{ij} = \begin{cases} 
\sum_{k=1}^{n} a_{ik}, & \text{if } i = j \\
0, & \text{if } i \neq j
\end{cases} \tag{4.3}
\]

The Laplacian of \( G \) is the \( n \times n \) symmetric matrix \( Q(G) = D(G) - A(G) \). Since the rows (and columns) sum to 0, the Laplacian is singular; it has rank of at most \( n - 1 \) and 0 as an eigenvalue. In fact, the multiplicity of the 0 eigenvalue is the number of connected components of \( G \).

Donath and Hoffman [52] derived a lower bound on the weight of the edges cut (\( E_c \)) by a partition satisfying predetermined partition sizes. If \( m_1 \geq m_2 \geq \cdots \geq m_k \) are the given partition sizes and \( \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_k \) are the smallest \( k \) eigenvalues of the Laplacian, then \( E_c \geq \frac{1}{2} \sum_{i=1}^{k} \lambda_i m_i \).

In [53], Hall showed that the eigenvalues/eigenvectors of the Laplacian solve the one-dimensional quadratic placement problem of finding the vector \( \mathbf{x} = (x_1, x_2, \ldots, x_n) \) which minimizes the total weighted squared distance between \( n \) points that can be expressed as

\[
x = \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} (x_i - x_j)^2 A_{ij} \tag{4.4}
\]

subject to the constraints \( |\mathbf{x}| = (\mathbf{x}^T \mathbf{x})^{1/2} = 1 \).

Here \( x_i \) is the coordinate assigned to vertex \( v_i \) in a one-dimensional space. The constraint is imposed to avoid the trivial solution in which all \( x_i \)s are equal. Equation
(4.4) can be rewritten in matrix notation in quadratic form as

\[
\text{minimize } \quad z = x^TQx
\]

subject to \( (x^T x)^{1/2} = 1 \) \hspace{1cm} (4.5)

To solve this constrained minimization problem, we form the Lagrangian

\[
L = x^TQx - \lambda (x^T x - 1)
\] \hspace{1cm} (4.6)

Taking the partial derivative of \( L \) with respect to \( x \) and setting it equal to 0 yields

\[
2Qx - 2\lambda x = 0
\] \hspace{1cm} (4.7)

which can be rewritten as

\[
(Q - \lambda I)x = 0
\] \hspace{1cm} (4.8)

where \( I \) is the identity matrix.

This is an eigenvalue formulation for \( \lambda \). For a system of \( n \) linear equations, there are \( n \) possible eigenvalues \( \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n \). For a connected graph, the Laplacian has rank of \( n - 1 \). The minimum eigenvalue 0 gives the trivial solution \( x = (1/\sqrt{n}), \cdots, 1/\sqrt{n}) \). Hence the eigenvector corresponding to the second smallest eigenvalue, \( \lambda_2 \) is used. The second smallest eigenvalue is a lower bound on a nontrivial solution to (4.5). In his paper, Hall heuristically derived a \( k \)-dimensional generalization in which the eigenvectors are used as the basis for clustering placement.

Recently, Hagen and Kahng [49] established a connection between Hall’s formulation and 2-way ratio-cut [54] partitioning. They construct a 2-way partition from \( v_2 \) (the
corresponding eigenvector of $\lambda_2$) by sorting $v_2$ and identifying a cut in the sorted $v_2$
which yields the best ratio-cut value.

In [51], Chan et al. developed a spectral approach to multiple-way ratio-cut partitioning which provides a generalization of the ratio-cut cost metric to $k$-way partitioning and a lower bound on this cost metric. Their approach involves finding the $k$ smallest eigenvalues/eigenvector pairs to the Laplacian. The eigenvectors provide an embedding of the graph's $n$ vertices into a $k$-dimensional subspace. A heuristic is then used to enforce the points in the embedding into $k$ partitions.

4.3 Sanchis' Multiple-way Partitioning Algorithm

A cell is labeled free if it has not been moved during the pass; otherwise, it is labeled locked. Define

$$\phi_{A_i}(N) = |\{C|C \in A_i \text{ and } C \in C_N \text{ and } C \text{ is free}\}|$$

$$\lambda_{A_i}(N) = |\{C|C \in A_i \text{ and } C \in C_N \text{ and } C \text{ is locked}\}|$$

(4.9)

Thus $\phi_{A_i}(N)$ is the number of free cells on the net $N$ which are in the block $A_i$, while $\lambda_{A_i}(N)$ is the number of locked cells on net $N$ which are in the block $A_i$. For each block $A_i$ and each net $N$, define the binding numbers $\beta$:

$$\beta_{A_i}(N) = \begin{cases} 
\phi_{A_i}(N) & \text{if } \lambda_{A_i}(N) = 0 \\
\infty & \text{if } \lambda_{A_i}(N) > 0
\end{cases}$$

(4.10)
The binding number of a net with respect to a block of a partition indicates how tightly the net is bound to the block. We also define the function \( \beta' \) as follows:

\[
\beta'_{A_i}(N) = \sum_{j \neq i} \beta_{A_j}(N)
\]  

That is, \( \beta'_{A_i}(N) \) is the sum of all of the binding numbers of net \( N \) with respect to all of the blocks of the partition except block \( A_i \); it gives a measure of how tightly \( N \) is bound to the partitions other than \( A_i \).

We now define the \( i \)th level gain associated with moving cell \( C \) from block \( A_j \) to block \( A_k \).

\[
\gamma_i(C) = \begin{cases} 
\frac{1}{|N \in N_C| \beta'_{A_k}(N) = i \text{ and } \beta_{A_k}(N) > 0|} & \\
-\frac{1}{|N \in N_C| \beta'_{A_j}(N) = i - 1 \text{ and } \beta_{A_j}(N) > 0|}
\end{cases}
\]

The first term in (4.12) measures the \( i \)th level benefit of moving cell \( C \) from the side of the partition consisting of all blocks except \( A_k \), to \( A_k \). The second term measures the \( i \)th level penalty of moving \( C \) from \( A_j \) to the side of the partition consisting of all blocks except \( A_j \).

The balance requirement for the block sizes can be satisfied as follows. Let \( r_1, \ldots, r_b \) be such that \( 0 \leq r_i \leq 1 \) for each \( i \) and

\[
\sum_{i=1}^{b} r_i = 1
\]

We want the size of \( A_i \) to be close to \( r_i c \), where \( c \) is the total number of cells in the network. A parameter \( w \) is chosen such that \( 0 < w \leq \min_{1 \leq i \leq b}(r_i c) \), and we allow the following range for the size of \( A_i \):

\[
r_i c - w \leq |A_i| \leq r_i c + w
\]
That is, a cell move from $A_i$ to $A_j$ is allowed if it preserves the above relationship for $A_i$ and $A_j$.

Given the initial partition, the algorithm improves the partition by iteratively moving one cell from one block to another in a series of passes. A cell is labeled free if it has not been moved during that pass. Each pass in turn consists of a series of iterations during each of which the free block with the largest gain is moved. During each move, we ensure that the number of constraints in a block does not violate the constraints given by (4.14). The gain vector, $\Gamma^k(C)$, as defined in (4.12), is updated constantly as cells are moved from one block to another. At the end of each pass, the partitions generated during that pass are examined and the one with the minimum cut-set size is chosen as the starting partition for the next pass. Passes are performed until no improvement of the objective value can be obtained.

### 4.4 Synchronous Sequential Circuit Partitioning

To help us describe our partitioning algorithm, we introduce the following terminology. For a synchronous sequential circuit, such as one shown in Figure 4.3, we define the following:

- **An internal latch** is a latch whose fan-in and fan-out gates belong to the same combinational block.

- **A sequential block** consists of a combinational subcircuit and its associated internal latches.
Boundary latches are latches that act as either a pseudo-PI or a pseudo-PO (but not both) to a combinational block, i.e., latches whose fan-in and fan-out gates belong to different combinational blocks.

A partition of a synchronous sequential circuit $N$ is a partition of the sequential blocks of $N$ into disjoint groups. A $b$-way partitioning of the network is described by the $b$-tuple $(G_1, G_2, \ldots, G_b)$ where the $G_i$s are disjoint sets of sequential blocks whose union is the entire set of blocks in the network. Each $G_i$ is said to be a group of the partition.

After partitioning, boundary latches that lie between groups (that do not belong to any groups) will be set to have constant skews. In other words, we do not have any control on the skews of those latches during the optimization process.

For a given sequential block $B$, let $L_B$ denote the set of boundary latches incident on $B$, and for a given boundary latch $L$, $B_L$ denotes the set of sequential blocks that $L$ is
connected to. For each boundary latch $L$, we define input tightness $\tau_{in}$, output tightness $\tau_{out}$, and the tightness ratio $\tau$ as

\[
\tau_{in}(L) = \text{maximum combinational delay from any boundary latch to } L \text{ in the unsized circuit},
\]

\[
\tau_{out}(L) = \text{maximum combinational delay from } L \text{ to any boundary latch in the unsized circuit},
\]

\[
\tau(L) = \begin{cases} 
\tau_{in}/\tau_{out} & \text{if } \tau_{in} \geq \tau_{out} \\
\tau_{out}/\tau_{in} & \text{if } \tau_{in} < \tau_{out}
\end{cases}
\]

(4.15)

where the adjective "unsized" implies that all gates in the subcircuit are at the minimum size. The tightness ratio $\tau(L)$ provides a measure of how advantageous it would be to provide a skew at $L$. For example, in Figure 4.4, if the input tightness (of path $P_{in}$) is 3.0, and output tightness (of path $P_{out}$) is 1.1, retiming the path $P_{in} \cup P_{out}$ would be of great benefit. Therefore, if the circuit contains an FF whose input and output are connected to paths with vastly different tightness factors, the two paths should remain in the same partition.
For each pair of blocks \((B_i, B_j)\), define merit \(\mu_{ij}\) as

\[
\mu_{ij} = \sum_{B_i \leftrightarrow B_j} \tau(L_k)
\]

where \(B_i \leftrightarrow B_j\) means latch \(L_k\) lies between \(B_i\) and \(B_j\). The value of \(\mu_{ij}\) is defined to be 0 if \(B_i\) and \(B_j\) are disjoint. For example, in Figure 4.5, sequential blocks \((B_i\) and \(B_j\)) are connected through three latches \(L_1, L_2,\) and \(L_3\) with tightness ratios \(\tau(L_1),\tau(L_2)\) and \(\tau(L_3)\), respectively. Then the merit between the two blocks is \(\mu_{ij} = \tau(L_1) + \tau(L_2) + \tau(L_3)\).

Physically, \(\mu_{ij}\) is used to measure the figure of merit if \(B_i\) and \(B_j\) are in the same group. A high \(\mu_{ij}\) means that the tightness ratio is high, and hence \(B_i\) and \(B_j\) should be in the same group.

The cost associated with each block, \(B_i\), is \(c_i\), the number of linear programming constraints required for solving \(B_i\). This number can be calculated very efficiently. Assume that group \(G_k\) consists of blocks \(B_{ki}, i = 1, \ldots, |G_k|\). Then we define the cost of \(G_k\), \(C(G_k) = \sum_{i=1}^{|G_k|} c_{ki}\), and the merit of \(G_k\), \(M(G_k) = \sum_{i=1}^{|G_k|} \sum_{j=i+1}^{|G_k|} \mu_{ij}\). We now formulate the following optimization problem:
\[
\begin{aligned}
& \max \quad \sum_{k=1}^{N} M(G_k) \\
& \text{subject to} \quad C(G_k) < \alpha \cdot \text{MaxConstraints}
\end{aligned}
\]  

where \( N \) is the number of groups, \( \text{MaxConstraints} \) is the maximum number of constraints that one wishes to feed to the LP, and \( \alpha \geq 1 \) is introduced so that the partitioning procedure becomes more flexible since the cost of a group is allowed to exceed \( \text{MaxConstraints} \) temporarily. Now that the partitioning problem has been explicitly defined, we develop a multiple-way synchronous sequential circuit partitioning algorithm based on the algorithm proposed by Sanchis [42].

For each group \( G_k \) and each boundary latch \( L \), define the connection number, \( \Phi \), as

\[
\Phi_{G_k}(L) = |\{B | B \in G_k \text{ and } B \in B_L\}|
\]

Since each boundary latch connects exactly two blocks, \( \Phi_{G_k}(L) \in \{0, 1, 2\} \). In other words, if \( B_i \not= B_j \), then (a) if \( B_i \not\in G_k \) and \( B_j \not\in G_k \), \( \Phi_{G_k}(L) = 0 \) (Figure 4.6(a)), (b) if \( B_i \not\in G_k \) and \( B_j \in G_k \), or vice versa (Figure 4.6(b)), \( \Phi_{G_k}(L) = 1 \), and (c) if \( B_i \in G_k \) and \( B_j \in G_k \), \( \Phi_{G_k}(L) = 2 \) (Figure 4.6(c)).

The gain associated with moving \( B \) from \( G_i \) to \( G_j \) is defined as

\[
\begin{aligned}
\Gamma_{ij}(B) &= \sum_l (\tau(L_l) | L_l \in L_B \text{ and } \Phi_{G_j}(L_l) = 1) \\
&\quad - \sum_n (\tau(L_n) | L_n \in L_B \text{ and } \Phi_{G_i}(L_n) = 2)
\end{aligned}
\]  

The first term of (4.19) measures the benefit of moving \( B \) to \( G_j \), while the second measures the penalty of moving \( B \) out of \( G_i \).
Figure 4.6 Example showing calculation of connection numbers.

As an example, consider a scenario illustrated in Figure 4.7. According to the figure, latch $L_m$ belongs to group $G_i$, while $L_t$ does not belong to any group. Therefore, we are able to change the skew of latch $L_m$, but not that of latch $L_t$. If we move sequential block $B$ from $G_i$ to $G_j$, latch $L_t$ would be included in group $G_j$, which means that we are able to adjust the skew of latch $L_t$ when we apply our optimization procedure on group $G_j$.

On the other hand, now $L_m$ does not belong to any group. Therefore, by moving $B$ from $G_i$ to $G_j$, we obtain control over latch $L_m$, and the benefit is $\tau(L_t)$. Also we lose control
on latch $L_m$, which gives a penalty of $\tau(L_m)$. Finally, latch $L_n$ does not play a role here, since it does not belong to groups $G_i$ or $G_j$, before or after the moving.

Before beginning the partitioning procedure, the number of linear programming constraints, $c_i$, required for each block $i$ is calculated using the modified symbolic constraints propagation algorithm. If $c_i \geq MaxConstraints$ for some block $B_i$, then it is placed in a group alone and will not be processed later. Let

$$TotalConstraints = \sum_j (c_j | c_j < MaxConstraints)$$  \hspace{1cm} (4.20)

Each remaining block is put into one of the $N'$ groups,

$$N' = \left\lceil \frac{TotalConstraints}{MaxConstraints} \right\rceil$$  \hspace{1cm} (4.21)

such that for each group $k$, $C(G_k) < MaxConstraints$. This is an integer knapsack problem, and many heuristic algorithms can be used to obtain an initial partition (see, for example, [55], Chapter 2). In some cases, it may be impossible to put all blocks into $N$ groups without violating the restriction on $C(G_k)$ above; if so, the number of groups may be larger than that given in (4.21).
After the partitioning, we apply the optimization algorithm described in Chapter 3 to each group.

4.5 Experimental Results and Conclusion

Table 4.1 gives the experimental results for the partitioning procedure. Since most of the ISCAS89 circuits consist of only one combinational block, we generated some synchronous sequential random logic circuits. The number of gates and FFs in those circuits are shown in Table 4.1. For each circuit, we conduct three experiments.

(1) First, we minimize the area using clock skew optimization, but without partitioning.

(2) Secondly, we minimize the circuit area using both clock skew optimization and partitioning.

(3) For comparison, we minimize the circuit with neither clock skew optimization nor partitioning.

From the table, it can be seen that the first approach is able to obtain the best result as expected. Since it considers all variables at the same time, it provides the best solution. However, the run time is large. Compared to the first approach, the second approach runs much faster, at a very slight area penalty. Not surprisingly, the third approach gives the worst solution. We also note that the introduction of clock skew provides a significantly faster clock speed for circuit m1337. Although it has not been shown here, the same result also holds for m1783. For m1783, we also specify several different MaxConstraints. The result shows that as the specified MaxConstraints increases, the number of groups after partitioning decreases. As the number of groups decreases, the optimized solution...
Table 4.1 Performance comparison of the partitioning procedure.

<table>
<thead>
<tr>
<th>Circuit</th>
<th># of PIs</th>
<th># of POs</th>
<th># of FFs</th>
<th># of gates</th>
<th># of blocks</th>
</tr>
</thead>
<tbody>
<tr>
<td>m51</td>
<td>8</td>
<td>8</td>
<td>12</td>
<td>51</td>
<td>5</td>
</tr>
<tr>
<td>m144</td>
<td>16</td>
<td>2</td>
<td>18</td>
<td>144</td>
<td>9</td>
</tr>
<tr>
<td>m1337</td>
<td>51</td>
<td>53</td>
<td>97</td>
<td>1337</td>
<td>42</td>
</tr>
<tr>
<td>m1783</td>
<td>90</td>
<td>54</td>
<td>124</td>
<td>1783</td>
<td>43</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Circuit</th>
<th>$P_{spec}$</th>
<th>with clock skew opt.</th>
<th></th>
<th>without</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>w/o partitioning</td>
<td>with partitioning</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Area</td>
<td>Run time</td>
<td>MxCnst</td>
<td>$N$</td>
<td>Area</td>
</tr>
<tr>
<td>m51</td>
<td>5.0</td>
<td>731</td>
<td>1.74s</td>
<td>300</td>
<td>2</td>
</tr>
<tr>
<td>m144</td>
<td>6.2</td>
<td>1872</td>
<td>6.11s</td>
<td>300</td>
<td>5</td>
</tr>
<tr>
<td>m1337</td>
<td>9.5</td>
<td>12364</td>
<td>135.35s</td>
<td>1500</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td>9.25</td>
<td>12353</td>
<td>151.34s</td>
<td>1500</td>
<td>6</td>
</tr>
<tr>
<td>m1337</td>
<td>7.5</td>
<td>12685</td>
<td>171.92s</td>
<td>1500</td>
<td>6</td>
</tr>
<tr>
<td>m1337</td>
<td>6.75</td>
<td>13049</td>
<td>186.61s</td>
<td>1500</td>
<td>6</td>
</tr>
<tr>
<td>m1337</td>
<td>6.5</td>
<td>-</td>
<td>-</td>
<td>1500</td>
<td>G</td>
</tr>
<tr>
<td>m1783</td>
<td>9.5</td>
<td>18564</td>
<td>427.14s</td>
<td>300</td>
<td>16</td>
</tr>
<tr>
<td></td>
<td>1000</td>
<td>8</td>
<td>18708</td>
<td>156.55s</td>
<td>2000</td>
</tr>
</tbody>
</table>

$^{\dagger}$ MxCnst = MaxConstraints, the maximum number of constraints.

$^{\dagger}$ $N$, number of groups after partitioning.
using the partitioning procedure improves, while the run time increases only slightly. When \( N = 6 \), the solution is comparable to that without using partitioning, and the run time is still far less than that without using partitioning.

In summary, in this chapter we develop a synchronous sequential circuit partitioning algorithm. We propose a heuristic measure which is shown to be effective as the objective function of the partitioning problem. Experimental results show that our partitioning procedure is very effective in making our optimization algorithm run at a much faster speed, with no significant degradation in the quality of the solution.
CHAPTER 5

DELAY AND AREA OPTIMIZATION FOR PLACEMENT

5.1 Introduction

For standard-cell based VLSI circuits, optimization for improving timing performance can be carried out at three levels in the design process: logic synthesis, gate size selection, and layout. In previous chapters, we have concentrated on optimizing the timing performance of a VLSI circuit by gate sizing. Thus far, we have not considered interconnect delay due to wiring capacitances. As the size of today’s VLSI circuits becomes increasingly larger and the device size becomes smaller, the delay of a circuit becomes dominated by interconnect delays [56]. For example, in an SSI or MSI chip designed in 5 μm nMOS technology, the gate input capacitance (15 fF per minimum size transistor) dominates the wiring capacitance (200 fF/mm). A typical transistor with W/L = 10 has a capacitance of 150 fF, which is equivalent to 0.75 mm of wire. With a typical MSI die size of a few millimeters on a side, most of the nets will be well below 0.75 mm. According to the scaling theory, when the transistors are scaled down, the gate input capacitance is reduced, while the wiring capacitance per unit length is unchanged [56].
Therefore in a 0.5 μm CMOS technology, a minimum-sized transistor has 1.5 fF input capacitance, which yields a typical transistor (W/L = 10) input capacitance of 15 fF. This is equivalent to 0.075 mm of wire, which represents a large number of the nets in a typical 12 × 12 mm VLSI chip. Consequently, as the devices are scaled down further in submicron technology, the node capacitances will not go down as much as they did in a 5 μm nMOS MSI chip because of the increased role of the wiring capacitance, and the delay of a circuit is dominated by interconnect delay.

In this chapter, we extend our work to consider interconnect delay. We coalesce our work on the gate size selection and placement steps. Layout optimization, also referred to as timing-driven layout is concerned with placement and routing. From the overall chip timing viewpoint, the placement steps are more critical than the routing which can affect mostly the local issues such as noise coupling. For this reason, placement has received more attention in timing-driven layout.

Recently, there has been extensive research on timing-driven placement [10–13]. Timing-driven placement techniques can be broadly divided into two categories: net-oriented and path-oriented. In the net-oriented approach, the acceptable delay of each gate (cell) is calculated and translated into bounds on the delay associated with each net. These bounds then serve as constraints during the subsequent placement step. In the path-oriented approach, timing analyses of critical paths are performed dynamically during the placement step. All paths, or a subset of them, are taken into account implicitly in the formulation. Since the delay of a circuit is inherently path-oriented, it is expected that path-based approaches can obtain better solutions [13,57].

A standard-cell library typically contains several versions of any given gate type, each of which has a different gate size. The gate-sizing problem is that of choosing
optimal gate sizes from the library to minimize a cost function (such as total circuit area), while meeting the timing constraints imposed on the circuit. This is usually done after technology mapping, where the logic function of each gate is determined, and before the physical placement step. A drawback of such an approach is that accurate interconnect wire lengths are not available during the gate-sizing procedure. The gate size selected optimally at that stage may no longer be optimal after the physical design stage where large interconnect capacitances are introduced at the output of each gate. To deal with this problem, an iteration procedure is usually followed. After global placement, the capacitance associated with each net is extracted, and the gate-sizing procedure is repeated. However, in such an iterative approach, the variation of net capacitance between iterations may be large and cause large perturbation in the solutions. Thus, a number of iterations may be required, making this approach quite expensive. To deal with this problem, it is desirable that gate sizing and placement be incorporated into a single procedure.

As an illustration, consider a layout placement shown in Figure 5.1(a). Gate $D$ fans out to gates $L_1$, $L_2$ and $L_3$. Assume that the delay of this circuit under such layout conditions violates the timing constraints imposed on it. Moreover, $D$ and $L_2$ lie on a long path whose delay exceeds the timing constraint. Conventional timing-driven placement would move $D$, $L_1$, $L_2$ and $L_3$ closer to one another to decrease the delay of gate $D$, as shown in Figure 5.1(b). This may increase the wire lengths of other nets attached to cells $D$, $L_1$, $L_2$ and $L_3$. But if automatic gate sizing is incorporated with timing-driven placement, a possible solution would be to replace $D$ with a template with a higher driving capacity, and $L_1$ with one with a smaller loading capacitance with respect to $D$. As a result, some of the cells could be moved to better locations, as shown in Figure...
Figure 5.1 Advantage of gate sizing together with placement.

5.1(c). The overall effect is a reduction of the long-path delay, while the increase in area is kept to a minimum.

In [58], Kim et al. propose an area-timing-testability driven placement algorithm. Their algorithm consists of a series of iterations. At the beginning of each iteration, a placement using Timberwolf [59] is done to minimize the total wire length. After placement, a set of partial scan flip-flops is selected, followed by a gate-sizing step [23]. After gate sizing is done, timing bounds are calculated for each net; then Timberwolf is called again to obtain an improved layout. In each iteration of the annealing step inside Timberwolf, cells switch their positions in an attempt to reduce the total wire length and also to meet the timing bound assigned to each net. Therefore, the algorithm is net-based, and the gate sizing and placement steps are treated separately.

In this chapter, we propose an algorithm which combines the gate-sizing problem and timing-driven placement into one procedure. By considering these two problems
together, the value of the interconnect capacitance is known during the selection stage of the automatic sizing procedure. Therefore, optimal gate sizes can be chosen for each gate based on layout information, thus reducing the number of iterations required in the conventional approach. For simplicity, we use a row-based layout style. However, a more general arrangement can be used. In the following, the terminologies “gate” and “cell” are used interchangeably. Both refer to a module in the circuit. Besides, in the following, we consider combinatorial circuits only. For a sequential circuit, we can apply our algorithm to a combinatorial block in the sequential circuit one at a time.

This chapter is organized as follows. Section 5.2 briefly discusses previous work on timing-driven placement. In Section 5.3 we formulate the task of timing-driven placement with automatic gate sizing in a single optimization problem. In Section 5.4 we describe a novel algorithm which performs delay and area optimization for a given compact placement by means of gate resizing and relocation. Experimental results are provided in Section 5.5. Finally, we conclude the chapter in Section 5.6.

5.2 Previous Work

For many years, timing-driven layout techniques were net-oriented. The timing constraints derived from higher levels were translated into bounds on delay associated with each net, and timing-driven placement and routing were used to synthesize a layout satisfying those constraints. However, timing is not associated with the nets but with the signal flows along paths which are combinations of nets in the circuit. Therefore, instead of satisfying individual net delays, the constraints on the sum of delays of all of the nets constituting a path must be satisfied.
To take into consideration more accurate timing behavior and achieve globally better solutions, timing analysis of critical paths must be performed dynamically during the placement procedure. Such a technique is proposed by Jackson and Kuh in [10] where a sequence of linear programming steps is used to determine the cell placement in a hierarchical approach. The delay behavior is modeled in a path-oriented manner and considers intercell delays as well as interconnect and pin capacitances.

Sutanthavibul and Shragowitz [60] proposed a hierarchical constructive placement algorithm with look-ahead and adaptive placement capabilities. The delay functions are computed based upon the net geometry, capacitance per unit wire length, and the net loading, to arrive at the path delay values.

Donath et al. [61] introduced an approach in which the timing is evaluated together with routability in the global placement step. The parameterized delay equations are used in the path analysis. During the placement of cells on the critical paths, fast incremental timing analysis is performed to evaluate the feasibility of each move. A complete timing analysis is done after each major step.

Srinivasan et al. [11] proposed an approach based on Lagrangian Relaxation. They observed that only a small subset of timing requirements is active as constraints at one time, thus the problem of a large number of paths can be effectively avoided. They represented timing requirements by a set of linear inequalities. When the corresponding constrained optimization problem is turned into a Lagrangian, these linear inequalities make the Lagrangian nondifferentiable. The subgradient method was used to update Lagrange multipliers on the nondifferentiable Lagrangian.

Most recently, Hamada et al. [13] proposed an algorithm which also transforms the placement with timing constraints into a Lagrangian problem. A primal-dual approach
is then used to find the optimal relative module locations. In each primal dual iteration, the primal problem is solved by a piecewise linear resistive network method, while the dual process is used to update the Lagrange multipliers by using the Newton method.

5.3 Timing-Driven Placement with Gate Sizing

Typically, a path-based timing-driven placement algorithm formulates the placement problem as an optimization problem, with both timing requirement and physical placement requirement as constraints. The constraints are usually linear ones. The objective function can be either a linear function or a quadratic function of the cell coordinates. A quadratic objective function allows efficient quadratic programming techniques to be used, thus the problem can be solved relatively fast. However, in [62], it was observed that a linear objective function tends to reflect the actual wiring demands more accurately than the quadratic objective function. Therefore, we choose to use a linear objective function in our approach.

A circuit can be modeled as a set of $M$ gates (cells), $G = \{g_1, \ldots, g_M\}$, interconnected by a set of $N$ nets, $N = \{n_1, \ldots, n_N\}$, that attach to the cells at pins. For the sake of simplicity, we assume that all gates in the circuit are of single output. Therefore, net $n_i$ is associated with gate $g_i$. Hence, the same index $i$ can be referred to as both a gate and a net. We also assume that all pins are located in the centers of cells. Therefore, the physical location of a cell $i$ on the chip is represented by $(x_i, y_i)$, where $x_i$ ($y_i$) is the $x$ ($y$) coordinate of cell $i$. The positions of the I/O pads are fixed and located on the perimeter of the chip. These constraints act as the boundary conditions.
There are three categories of constraints in our LP formulation, namely, physical, timing, and sizing constraints.

5.3.1 Physical constraints

We approximate the wire length of an individual net by the half-perimeter of the smallest rectangle enclosing the pins of the net [10]. This approximation is the same as the rectilinear, minimal Steiner tree length for two- or three-pin nets (Figure 5.2). The approximation error for four- and five-pin nets is within the width of the bounding box of the Steiner tree length [63] (Figure 5.3). The bounding box for net $i$ is denoted by four
parameters, the northernmost ($\eta_i$), southernmost ($\sigma_i$), easternmost ($\varepsilon_i$), and westernmost ($\omega_i$) extents of the pins of the net (Figure 5.4). Mathematically, the bounding box constraints can be expressed as follows:

$$
\begin{align*}
\varepsilon_i &\geq x_{ij}, \\
\omega_i &\leq x_{ij}, \\
\eta_i &\geq y_{ij}, \\
\sigma_i &\leq y_{ij}, \\
\forall 1 \leq j \leq p_i
\end{align*}
$$

(5.1)

where $p_i$ is the number of pins associated with net $i$ and $j$ is a pin of net $i$.

Let $C_h$ and $C_v$ denote the unit length wire capacitance in horizontal and vertical layers, respectively. Then the interconnect capacitance, $C_i$, of net $i$ can be estimated as

$$
C_i = C_h(\varepsilon_i - \omega_i) + C_v(\eta_i - \sigma_i)
$$

(5.2)

Similarly, the length of net $i$, $l_i$, is

$$
l_i = (\varepsilon_i - \omega_i) + (\eta_i - \sigma_i)
$$

(5.3)
Therefore, the total wire length of the layout is

\[ \sum_{i=1}^{N} l_i \]  

(5.4)

5.3.2 Timing and sizing constraints

Consider a single-output gate \( i \) with \( f_i(i) \) inputs, and gate \( i \) fans out to \( f_o(i) \) gates. The worst-case signal arrival time at the output of gate \( i \), \( m_i \), can be expressed as

\[ m_i \geq m_{ij} + d_i, \quad 1 \leq j \leq f_i(i) \]  

(5.5)

Now the delay of gate \( g_i \) can be contributed to the loading capacitance of its fan-out gates, plus the wire capacitance of its fan-out net \( n_i \). Let \( CL_{ij} \) represent the loading capacitance of gate \( g_{ij} \) with respect to gate \( g_i \). Then the delay of gate \( g_i \) is

\[ d_i = \frac{R_e}{w_i} \times C_{out} + \tau \]  

(5.6)

\[ = \frac{R_e}{w_i} \times (C_i + \sum_{j=1}^{f_o(i)} C_{L_{ij}}) + \tau_1 \cdot w_i + \tau_2 \]  

(5.7)

\[ = \frac{R_e}{w_i} \times \{C_a(e_i - \omega_i) + C_v(\eta_i - \sigma_i) + \sum_{j=1}^{f_o(i)} (\alpha_{ij} \cdot w_{ij} + \beta_{ij})\} + \tau_1 \cdot w_i + \tau_2 \]  

(5.8)

where \( \alpha_{ij} \) and \( \beta_{ij} \) are related to the (transistor) gate terminal area capacitance and (transistor) gate terminal perimeter capacitance of the transistor of cell \( j \) to which cell \( i \) fans out [17].

As in Section 2.3.1, this is a sum of functions of the form \( y/w \). Therefore, it can be approximated by a piecewise linear function.
5.3.3 Objective function

The objective function of our optimization problem can be formulated as

\[
\min \left( \sum_{i=1}^{M} \gamma_i \cdot w_i + T \cdot \sum_{i=1}^{N} l_i \right)
\]  \hspace{1cm} (5.9)

where \( l_i \) is the length of net \( i \), \( T \) is a constant. In general, we may want to set \( T \) equal to the sum of the width of interconnect wire and the minimum distance between two adjacent wires. That way, \( T \) is the minimum width a wire occupies on the chip.

The objective function in this formulation represents two important quantities to be minimized in physical design. The first term is the total area of the cells. The second term represents the total area taken by the interconnect wires.

5.3.4 Slot constraints

For most placement algorithms using mathematical programming techniques, the solutions in general would yield a placement which could have many cell overlaps. Therefore, placement is usually alternated with partitioning steps that generate constraints for the next step. During each step, the following constraint is introduced for each region:

\[
r_i^x = \frac{1}{M_i} \cdot \sum_{j \in M_i} x_j
\]

\[
r_i^y = \frac{1}{M_i} \cdot \sum_{j \in M_i} y_j
\]  \hspace{1cm} (5.10)

where \( r_i^x \) (\( r_i^y \)) is the \( x \) (\( y \)) coordinate of the center of the \( i \)th region, \( M_i \). \( M_i \) is the number of cells in that region.
Equation (5.10) forces the center of gravity of all cells in the region to be equal to the center of the region. Therefore, the cells are distributed better over the whole placement region.

5.3.5 Final LP

After introducing the constraints and objective function, we are in a position to formulate the following linear programming:

\[
\begin{align*}
\text{minimize} & \quad \left( \sum_{i=1}^{M} \gamma_i \cdot w_i + \gamma \cdot \sum_{i=1}^{N} l_i \right) \\
\text{subject to} & \quad \text{For all gates } i = 1 \cdots M \\
& \quad m_j + d_i \leq m_i \quad \forall j \in \text{Fanin}(i) \\
& \quad m_i \leq T_{\text{spec}} \quad \forall \text{ gates } i \text{ at } PO's \\
& \quad d_i \geq \delta(w_i, w_{i,1}, \ldots, w_{i,fo(i)}, \varepsilon_i, \omega_i, \eta_i, \sigma_i) \\
& \quad w_i \geq \text{Minsize}(i) \\
& \quad w_i \leq \text{Maxsize}(i) \\
& \quad \varepsilon_i \geq x_{ij} \quad \forall 1 \leq j \leq p_i \\
& \quad \omega_i \leq x_{ij} \quad \forall 1 \leq j \leq p_i \\
& \quad \eta_i \geq y_{ij} \quad \forall 1 \leq j \leq p_i \\
& \quad \sigma_i \leq y_{ij} \quad \forall 1 \leq j \leq p_i
\end{align*}
\]

The above is a linear program in the variables \(w_i, m_i, d_i, x_i, y_i, \varepsilon_i, \omega_i, \eta_i, \) and \(\sigma_i\).
5.4 A Unified Algorithm for Adjusting Placement and Gate Sizing

Although it is possible to solve (5.11) directly, the execution time may be excessively large due to the large number of variables and constraints. In this section, we present an algorithm which tackles this problem indirectly, and thus reduces execution time.

Notice that timing-driven placement is needed because, in general, gate sizes are selected before the placement procedure, and gate sizes are fixed during placement. This imposes a restriction on the placement tool in the search for a good placement with minimum wire length. On the other hand, although placement tools such as those in [59, 64] can obtain a placement with minimal wire length, the delay of the circuit based on that placement may exceed timing constraints. Recently, it has been suggested that a compact placement which violates the timing constraint could be made to satisfy the delay bound by adjusting the sizes of some gates, without altering the placement topology (Chapter 16, [65]). In the following, we propose an algorithm which combines gate resizing and relocation to satisfy timing constraints, and at the same time the total circuit area (including cell area and wire length) is kept to a minimum, for a given compact placement.

First, all of the gates in the circuit are set to their minimum size. A compact placement is obtained with the objective of minimizing the total wire length. This can be done by using existing placement packages (e.g., Timberwolf [59] or Gordian [64]). After that, the wiring capacitance associated with the output of each gate is calculated. Based on this information, together with the circuit structure, optimal gate sizes are selected using the gate size optimization algorithm described in Sections 2.4 and 2.5. In general, some gates will be selected to have a larger size. This may cause overlap among cells. This problem
ALGORITHM Resizing and Relocation()

1. do initial placement;
2. do initial gate sizing for all cells in the circuit;
3. while ( timing constraints are not satisfied ) {
   4. select gates belonging to type 1, 2, and 3;
   5. formulate LP (Eq. (5.11)) for these gates;
      (remaining cells serve as boundary conditions)
   6. solve the LP;
   7. use mapping algorithm (Sec. 2.4) to obtain permissible size
      for each gate;
   8. adjust cell locations to avoid overlap;
   9. }
10. report final placement;

Figure 5.5 An outline of the Resizing and Relocation algorithm.

can be solved by shifting cells to avoid overlap. In general, however, the perturbation on
the delays of individual gates may cause the circuit delay to exceed the delay constraints.
If that does happen, a conventional approach would repeat the gate-sizing procedure
to guarantee that the circuit delay of that specific layout is below the delay constraint.
Usually, the gate sizing and placement procedures have to be repeated a few times before
a final solution is reached.

Our algorithm, in contrast, does not repeat the gate-sizing procedure all over again.
Rather, once the algorithm detects that the delay of the circuit is violated, a number
of gates, as described below, are selected. These gates will be resized and/or moved
to different locations to satisfy time constrains as well as to minimize total circuit area
(including cell area and wire length). The outline of our algorithm is shown in Figure 5.5.
In the following, we described how we select only a small portion of gates in the circuit for resizing and relocation, and how cell resizing and relocation can be combined into one formulation.

In addition to the worst-case signal arrival time, \( m_i \), for each gate \( i \) in the circuit we introduce the required signal arrival time, \( r_i \). The required signal arrival time is the latest time by which a signal has to arrive at the output of gate \( i \) to make the delay at the POs less than the specified delay. The required signal arrival time is defined to be

\[
T_{\text{spec}}, \quad \text{if gate } i \text{ at PO}
\]

\[
\max\{r_j - d_j \mid \forall j \in \text{Fanout}(i)\}, \quad \text{otherwise}
\]

For each gate \( i \), we also define a slack \( s_i \), where.

\[
s_i = r_i - m_i
\]

Definition 5.1 An active gate \( i \) is a gate with \( s_i < 0 \). The set of all active gates is denoted by \( C \).

Definition 5.2 The timing of a circuit layout is said to be satisfied if and only if \( C \) is empty, i.e., \( s_i \geq 0 \), for every gate \( i \) in the circuit.

Definition 5.3 A critical path is a path in which all of the gates along the path have slack values less than or equal to zero.

Our objective is to satisfy specified delay bounds and to keep the total circuit area to a minimum. This can be achieved in two ways. The first one is to resize gates. For example, for those gates lying on critical paths, we may replace a gate with a template with a higher driving capacity to reduce the delay of that gate. Alternatively, we may
replace a gate with one with smaller input capacitance, thus reducing the delay of its driving gates. The second one is to move some cells to new locations, so as to reduce the interconnect wiring capacitance attached to those gates lying on critical paths. This will also reduce the delays of these gates.

The unified optimization algorithm begins by calculating the slack of each gate. Then three types of gates are selected for improvement.

(1) The first type is active gates, which are gates with negative slack. These gates will be allowed to change their sizes as well as be free to move to new locations. Since active gates are those with worst-case signal arrival times later than the required signal arrival time, it is most important to adjust the delays of active gates such that the circuit delay is less than the specified timing constraints. However, in addition to adjusting the size and location of an active gate, the following two types of cells should also be included in the linear program.

(2) The second type involves those gates with nonnegative slacks less than a small specified value, \( \delta \). During this phase some gates will change their size, and some others will be moved to new locations; as a result, output load capacitances of certain gates will increase (while those of others will decrease). For those gates with large slacks, it is likely that such changes, although they will increase their delay, will not make their slack negative. Therefore, those gates with larger slacks are likely to remain nonactive. However, for those gates with small nonnegative slacks, it is possible that such a delay increase will make their slacks become negative. Therefore, it is more advantageous to include those gates with small nonnegative slacks in the formulation to avoid additional iterations.
The third type of gate includes those that are directly connected to the outputs of active gates. Remember that active gates are those which violate timing constraints. Therefore, reducing the delays of these gates, besides changing their sizes, can also be accomplished by reducing their output load capacitances. This can be done by either moving or reducing the sizes of the active gates' fan-out cells.

Gates belonging to types 1, 2, and 3 are put into the linear program, (5.11), and a new solution is obtained by solving it. In principle, to obtain a better solution, it is necessary to include all three types of gates in the linear program. In practice, however, to maintain the efficiency of the program, it is necessary to limit the number of gates to be included. In addition, since many gates' locations are fixed, they can serve as boundary conditions for physical constraints. Therefore, the gravity centering constraints, (5.10), are not needed. Furthermore, to avoid drastically changing the solution, each selected gate is allowed to change to its nearest larger or smaller size only.

The solution of such a formulated linear program gives a new size and a new position for each selected cell. The mapping algorithm described in Section 2.4 is used to obtain the permissible size for each gate. Since many cells are moved to new locations, and some of them are replaced with templates of different sizes, there may be overlap among some cells. Therefore, it is necessary to move cells into (slightly) different locations to avoid overlap.

If necessary, the above procedure is repeated until the delay constraints are all satisfied. However, according to our experience, only one iteration is needed in most cases. Also, since only a relatively small number of gates are selected to be resized and relocated, the execution time for each iteration is relatively small (compared to the time needed to resize all gates).
Table 5.1 Experimental results of PRECISE.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>$T_{spec}$</th>
<th>Conventional approach</th>
<th>PRECISE</th>
<th>$A_P$</th>
<th>$L_P$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>cell area ($A_M$)</td>
<td>cell area ($A_P$)</td>
<td>wire length ($L_M$)</td>
<td>wire length ($L_P$)</td>
</tr>
<tr>
<td>c432</td>
<td>14.0</td>
<td>3111</td>
<td>3061</td>
<td>785</td>
<td>732</td>
</tr>
<tr>
<td></td>
<td>13.0</td>
<td>3726</td>
<td>3484</td>
<td>912</td>
<td>796</td>
</tr>
<tr>
<td></td>
<td>12.0</td>
<td>-</td>
<td>4344</td>
<td>-</td>
<td>913</td>
</tr>
<tr>
<td>c1355</td>
<td>20.0</td>
<td>8096</td>
<td>7997</td>
<td>2612</td>
<td>2356</td>
</tr>
<tr>
<td></td>
<td>19.0</td>
<td>8998</td>
<td>8911</td>
<td>2794</td>
<td>2619</td>
</tr>
<tr>
<td></td>
<td>18.0</td>
<td>-</td>
<td>9912</td>
<td>-</td>
<td>2811</td>
</tr>
<tr>
<td>c2670</td>
<td>25.0</td>
<td>18015</td>
<td>17680</td>
<td>11243</td>
<td>10710</td>
</tr>
<tr>
<td></td>
<td>23.0</td>
<td>18648</td>
<td>18408</td>
<td>11462</td>
<td>10840</td>
</tr>
<tr>
<td></td>
<td>21.0</td>
<td>-</td>
<td>19692</td>
<td>-</td>
<td>11788</td>
</tr>
<tr>
<td>c5315</td>
<td>26.0</td>
<td>37650</td>
<td>37310</td>
<td>23810</td>
<td>23173</td>
</tr>
<tr>
<td></td>
<td>24.0</td>
<td>38432</td>
<td>38055</td>
<td>24231</td>
<td>23534</td>
</tr>
<tr>
<td></td>
<td>22.5</td>
<td>-</td>
<td>39351</td>
<td>-</td>
<td>24344</td>
</tr>
<tr>
<td>c7552</td>
<td>27.0</td>
<td>50968</td>
<td>51046</td>
<td>43625</td>
<td>42524</td>
</tr>
<tr>
<td></td>
<td>24.0</td>
<td>-</td>
<td>52699</td>
<td>43613</td>
<td>43613</td>
</tr>
<tr>
<td></td>
<td>23.0</td>
<td>-</td>
<td>54088</td>
<td>43673</td>
<td>43673</td>
</tr>
</tbody>
</table>

Average Ratio

|          | 0.983 | 0.939 |

5.5 Experimental Results

The above algorithms have been implemented in C in the program PRECISE (PeRformance-driven plaCement with automatic gate SizE optimization) on a Sun Sparc 10 Station.

The experimental results of the program PRECISE, which implements the unified placement improvement and gate resizing algorithms, are summarized in Table 5.1.
To show the effectiveness of our algorithm, we intentionally adjust the value of the interconnect wiring capacitance per unit length, such that interconnect delay accounts for about 30% of the total delay of each circuit. At present, we use Fiduccia's min-cut partitioning algorithm [47] to obtain a compact placement. The partitioning algorithm recursively divides cells into two partitions so that the number of nets that cross the partition boundaries is minimized, until a small number of cells are left in each partition and then cells are placed to their final location. It has been observed that partitioning-based placement tends to spread the wiring across the layout surface and thus produces very routable placement (Chapter 4, [3]). More compact placement can be obtained by using other algorithms (e.g., [59, 64]). For comparison, we also perform placement and gate sizing based on the purely iterative approach. That is, the two procedures of placement adjustment and gate resizing are executed separately and are included in an iteration loop. The experimental results show that PRECISE is able to obtain better solutions than the conventional iterative approach. Moreover, for very tight timing bounds, the conventional approach fails to obtain solutions at all. This is because cell locations are fixed in the conventional approach, and excessively large capacitances may have been introduced at the output of some gates on critical paths. On the other hand, in addition to resizing cells, PRECISE also moves cells to different locations to reduce large wiring capacitance. Therefore, it is able to obtain solutions even for tight delay bounds. Furthermore, since instead of trying to resize all cells, PRECISE resizes only a small portion of cells when timing bounds are violated; as a result, its execution time is faster in general.
5.6 Conclusion

To date, gate sizing and placement are treated separately in different steps during the circuit design process. Such an approach has not caused much trouble because interconnect delay takes up only a small amount of the circuit delay in a chip fabricated using today's VLSI technology. However, as the devices are scaled down in deep submicron technology, the delay of a circuit becomes dominated by interconnect delay. Therefore, it becomes more and more important to combine gate sizing and placement into one procedure.

In this chapter, for the first time, the gate-sizing problem is combined with placement in one formulation. Although the execution time for the combined problem may be excessively large, we propose an indirect approach to fully utilize some special properties of the formulation to develop a novel algorithm which performs delay and area optimization for a given compact placement, by resizing and relocating cells in the circuit layout. The experimental results are very encouraging.
CHAPTER 6
CONCLUSIONS

In this thesis, an efficient algorithm is presented to minimize the area taken by cells in standard-cell designs under timing constraints. Experimental results show that our approach can obtain a near-optimal solution (compared to simulated annealing) in a reasonable amount of time, even for very tight delay constraints.

For synchronous sequential circuits, a unified approach to minimizing circuit area and optimizing clock skews is presented. Traditionally, the circuit area of a sequential circuit is minimized one combinational subcircuit at a time. Our experiments have shown that this may lead to very suboptimal solutions in some cases. We formulate the discrete gate-sizing optimization as a linear program, which enables us to integrate the equations with clock skew optimization constraints, taking a more global view of the problem. Experimental results show that this approach not only reduces total circuit area, but also gives much faster operational clock speed. For large sequential circuits, we also present a partitioning procedure. Experiments shows that our partitioning procedure is very effective. Using our partitioning procedure, our optimization algorithm is able to run at a much faster speed, with no significant degradation in the quality of the solution.

To date, most research on performance-driven placement assumes that gate sizes are selected before the placement stage. This imposes a restriction on the placement tool
in searching for a good placement with minimum wire length. Recently, it has been suggested that a compact placement which violates the timing constraint could be made to satisfy the delay bound by adjusting the sizes of some gates, without altering the placement topology [65]. In this thesis, we have shown that such an approach may lead to solutions of inferior quality. Instead, by considering resizing and moving the locations of some gates in a unified optimization procedure, we are able to obtain better solutions, with smaller execution times than the conventional iterative method. For the first time, the gate-sizing problem is combined with placement in one formulation. Although the execution time for the combined problem may be excessively large, we propose an indirect approach to fully utilize some special properties of the formulation to develop a novel algorithm which performs delay and area optimization for a given compact placement, by resizing and relocating cells in the circuit layout.

6.1 Future Work

In a combinational circuit, there are some paths that can never be excited by any combination at the primary inputs. Hence, these paths can never be critical [66–68]. The presence of false paths in a circuit causes some gates to be sized unnecessarily, since the optimizer tries to reduce the delay along a path that can never be critical. This may lead to a suboptimal solution to the gate-sizing problem. In other words, although a physical level performance optimizer must certify that the delay of the longest sensitizable paths after optimization is not longer than the specified delay, long paths are allowed to exist in the optimized circuit if they are not sensitizable. As demonstrated in [69], most long paths in a complex circuits are actually false. As a result, to optimize the performance
of a large circuit at the physical level, it is important to consider the false path problem in conjunction with gate size optimization.

Retiming [70] has been shown to be an effective technique to optimize the performance of synchronous sequential circuits. Retiming is an operation on a network whereby registers move across logic blocks in order to minimize the clock cycle or the number of registers while maintaining the behavior of the circuits.

The retiming technique considers only the sequential elements of the circuit; it assumes that the combinational logic structure is fixed. In [71], a set of logic synthesis operations has been combined with retiming to optimize sequential circuits for the area and clock period. Peripheral retiming has been used to optimize the performance of pipelined circuits using combinational delay optimization techniques [72]. While these formulations do exist, they are not directly relevant to our work since we assume that we begin our optimization at the end of the logic synthesis stage. It has been shown that a retiming algorithm can be formulated as a mixed integer linear program [70]. This approach, however, may not be applied directly to our problem, since the computational complexity involved would be prohibitive. We propose to seek methods of carrying out retiming through a series of inexpensive local optimization. For example, as shown in Figure 6.1 [9], it can be seen that changing the clock arrival time at a flip-flop is equivalent to changing the delay specifications on the combinational subcircuits to which that flip-flop is connected. The net effect of this is similar to the moving of the flip-flop across combinational logic module boundaries. Therefore the solution to the clock skew optimization problem could also be interpreted to be a new set of timing specifications for each combinational subcircuit, which may be enforced either by permitting a clock skew, or through retiming.
Methods for using retiming in combination with clock skew for achieving this change in the timing specification should be explored, thus obtaining better solutions for gate-size optimization.

In a real chip, the delay between two successive logic gates is composed of three elements: (1) intrinsic delay due to switching a gate on/off, (2) delays due to charging fanout and load capacitance, and (3) delay due to distributed RC interconnection. The scaling rule suggests that the interconnect delay will be dominant for the circuits with larger chip size and smaller geometry. The effect can be quite significant for submicron circuits since the interconnect delay grows superlinearly with the scaling factor and the chip dimension. As the VLSI fabrication technology reaches submicron device dimensions and gigahertz frequencies, it is necessary to consider such interconnect delays. The timing-driven placement improvement algorithm we propose in Chapter 5 uses a lumped RC model. However, wires on scaled-down ICs have significant resistance and should be analyzed as distributed RC lines. In summary, timing-driven placement and routing remain the challenges of today's submicron device technology.

Figure 6.1 Retiming and clock delay transformation.
Currently, there are growing demands for low power circuits for two main reasons. First, as the device size and chip density continue to increase rapidly, with a down scale to 0.6 μm at present (and 0.2 μm expected) and over 100 MHz clock cycles, it becomes too expensive to provide adequate cooling systems for powerful microprocessor chips. For example, using 0.75 μm technology and 3.3 V power supply, DEC's Alpha chip consumes 30 W at 200 MHz [73]. Second, with the increasing popularity of portable consumer products (e.g., laptop/notebook computers and cellular phones), low-power designs become a must, because conventional nickel-cadmium battery technology provides only 20 W·h of energy for each pound of weight [74]. For these reasons, designers now are willing to trade off area for low power consumption.

There has been active research related to low power designs [75–82]. At the architecture level, a parallel implementation can be used to maintain throughput while reducing the supply voltage, thus reducing the power consumption [75]. At the circuit level, a popular technique is to turn off the system clock for those parts of the circuit that are not active. In a CMOS design, the average power consumed by a gate is given by

$$P_{avg} = \frac{1}{2} \times C_{out} \times V_{dd}^2 \times D$$

(6.1)

where $C_{out}$ is the output load capacitance, $V_{dd}$ is the power supply voltage, and $D$ is the transition density of the gate [79]. Hence, power consumption of a gate is determined by three factors, namely, $C_{out}$, $V_{dd}$ and $D$. At the device level, work is being done to reduce the peak voltage needed for switching (reducing $V_{dd}$). Other than $V_{dd}$, we can reduce the power consumed by a single gate by reducing $C_{out}$ and $D$. Some work has been done in the area of low power logic synthesis. However, as in the area and delay optimization in logic synthesis, the work is applied to technology-independent synthesis, where gate
models for power, delay, and area are not very accurate [76]. Therefore, it is important to perform low-power design at the gate-sizing and physical layout stages.

For the gate-sizing and physical layout problems, it is important to reduce the capacitance load of those gates with large switching activity. For example, if gate i has large switching activity, it is then desirable to reduce its capacitance load due to (1) its fan-out gates, and (2) interconnect wires. For case (1), we should choose a template with smaller input capacitance for its fan-out gates. Hence, the objective function of the optimization should be weighted by the transition density of each gate. To deal with case (2), during placement procedure, it is advantageous to put the fan-out gates of i closer. Similarly, a weight based on each gate's transition density can be included when calculating the total wire length. Therefore, a low-power driven placement algorithm can also be developed.
APPENDIX A

EXTRACTING PARAMETERS FROM A LIBRARY

In this appendix, we will show how various parameters needed in our formulation can be calculated from a given standard-cell library.

As an example, consider a three-input AOI (and-or-inverter) gate whose output logic value is $a_1 \cdot a_2 + \overline{b}$. This gate has three inputs, namely, $a_1$, $a_2$, and $b$. In the given library, this gate has three templates with different cell areas and driving capabilities. Usually, the library lists pin-to-pin delay information, as well as worst-case delay. In our application, we use the worst-case delay. Suppose the characteristics of the three templates are specified as follows.

- **Template 1**
  - cell area = 1856
  - worst-case delay = $R_{out} \times C_{out} + r = 3.64 \times C_{out} + 0.75$
  - pin capacitance of $a_1 = 0.123$
  - pin capacitance of $a_2 = 0.091$
  - pin capacitance of $b = 0.111$

- **Template 2**
  - cell area = 3401
- worst-case delay = 2.05 \times C_{\text{out}} + 1.40
- pin capacitance of a_1 = 0.185
- pin capacitance of a_2 = 0.175
- pin capacitance of b = 0.201

- Template 3
  - Cell area = 5797
  - worst-case delay = 1.00 \times C_{\text{out}} + 2.30
  - pin capacitance of a_1 = 0.405
  - pin capacitance of a_2 = 0.285
  - pin capacitance of b = 0.345

Let \( R_a = 5.0 \). From the above information, we have

- \( w_1 = R_a/R_1^{\text{g}} = 5.0/3.64 = 1.374 \)
- \( w_2 = R_a/R_2^{\text{out}} = 5.0/2.05 = 2.439 \)
- \( w_3 = R_a/R_3^{\text{out}} = 5.0/1.00 = 5.000 \)

where \( w_1, w_2, \) and \( w_3 \) are permissible gate sizes.

To obtain a linear relationship between the cell area and the gate size, we linearly approximate the set of data points of area vs. \( w \), \{(1.374, 1856), (2.439, 3401), (5.000, 5797)\}. Then we have the following linear expression of the gate area in terms of the gate size, \( w \).

\[
\text{area} = \gamma \cdot w + \varepsilon = 1060.02 \cdot w + 571.346 \quad (A.1)
\]

Therefore, \( \gamma = 1060.02 \) and \( \varepsilon = 571.346 \). The data points and the affine function are plotted in Figure A.1
The capacitance at input pin $a_1$ to be used to calculate the loading capacitance, $C_{out}$, of this gate’s fan-in gates can be obtained by linearly approximating the set of data points of $a_1$ pin capacitance vs. $w$, $\{(1.374,0.1229), (2.439,0.185), (5.000,0.405)\}$. This gives

$$cap(a_1) = \alpha_{a_1} \cdot w + \beta_{a_1} = 0.05885 \cdot w + 0.03007 \quad (A.2)$$

Therefore, $\alpha_{a_1} = 0.05885$ and $\beta_{a_1} = 0.03007$. The data points and the affine function are plotted in Figure A.2. The capacitances at pins $a_1$, $a_2$, and $b$ contribute to the output capacitance of any fan-in gates.

Following the same procedure, we can find that $\alpha_{a_2} = 0.06942$, $\beta_{a_2} = -0.00033$, and $\alpha_b = 0.06371$, $\beta_b = 0.03336$.

To obtain the values of $\tau_1$ and $\tau_2$, we have to linearly approximate the following data set of $\tau$ vs. $w$, $\{(1.374,0.75), (2.439,1.40), (5.00,2.30)\}$. This gives

$$\tau = \tau_1 \cdot w + \tau_2 = 0.41349 \cdot w + 0.268641 \quad (A.3)$$

Therefore, $\tau_1 = 0.41349$ and $\tau_2 = 0.268641$. The data points and the affine function are plotted in Figure A.3.
Finally, the gate delay can be approximated by the following equation:

\[
\text{delay} = R_{\text{out}} \times C_{\text{out}} + \tau_1 \cdot w + \tau_2
\]

\[
= \frac{5.0}{w} \times C_{\text{out}} + 0.41349 \cdot w + 0.268641
\]

(A.4)
REFERENCES


VITA

Wei-Tong Chuang was born in Taichung, Taiwan, in 1964. He received the Bachelor of Science degree from the National Taiwan University in 1986, and the Master of Science degree from the University of Maryland at College Park in December 1989, both in Electrical Engineering. From 1988 to 1989, he was a recipient of the Graduate School Fellowship of the University of Maryland. From January 1990 to January 1994, he was a research assistant at the Coordinated Science Laboratory at the University of Illinois. He is currently a member of technique staff of the AT&T Bell Laboratories at Murray Hill, New Jersey. His research interests include optimization algorithms for CAD, timing analysis, and physical design of VLSI circuits.

PUBLICATIONS


