Systolic Design with Asynchronous Controls for Digital-Signal Processing

Final Report

Author:
Dr. Kamran Reihani

July 15, 1993

U.S. Army Research Office
Grant: DAAL03-90-G-0211

Physical Science Laboratory
P.O. Box 30002
Las Cruces, NM 88003-0002
(505) 522-9100 FAX 505-522-9389/9434

93-25093
Systolic Design with Asynchronous Controls for Digital-Signal Processing

Final Report

Author:

Dr. Kamran Reihani

July 15, 1993

U. S. Army Research Office
Grant: DAAL03-90-G-0211

Physical Science Laboratory
New Mexico State University
Box 30002
Las Cruces, New Mexico 88003-0002

The views, opinions, and/or findings contained in this report are those of the author and should not be construed as an official Department of the Army position, policy, or decision, unless so designated by other documentation.
# Systolic Design with Asynchronous Controls for Digital-Signal Processing

## Title and Subtitle

**Systolic Design with Asynchronous Controls for Digital-Signal Processing**

## Authors

Dr. Kamran Reihani

## Performing Organization Name(s) and Address(es)

**Physical Science Laboratory**
New Mexico State University
Box 30002
Las Cruces, NM 88003-0002

## Sponsor(s) and Monitor(s)

**U.S. Army Research Office**
P. O. Box 12211
Research Triangle Park, NC 27709-2211

## Abstract

The research sponsored by this grant is focused on the development of a theoretical and technological basis for designing efficient systolic arrays for digital-signal processing algorithms. The major contributions of the research are the following:

- Data reduction techniques via the utilization of algorithm properties
- Conversion of sequential input signal into input blocks by means of a spiral systolic mesh that is suitable for parallel processing and that is flexible for enabling various array dimensions
- Devising a hybrid of SA and data-flow approach, making data streams independent of computations executed in each processor, thus reducing waiting time. The communication (PE) protocols resolve the data-flow conflicts created by the merging of the spiral and asynchronous systolic array architecture

## Subject Terms

Asynchronous controls, Block kernel cycle, Communication protocol, Computational graph, Connection matrices, Data propagation, Digital-Signal processing algorithms, Processing element, Spiral mesh, Systolic array

## Distribution/Availability Statement

Approved for public release; distribution unlimited.

## Limitation of Abstract

UNCLASSIFIED
The Report Documentation Page (RDP) is used in announcing and cataloging reports. It is important that this information be consistent with the rest of the report, particularly the cover and title page. Instructions for filling in each block of the form follow. It is important to **stay within the lines** to meet optical scanning requirements.

<table>
<thead>
<tr>
<th><strong>Block 1</strong></th>
<th><strong>Agency Use Only (Leave blank)</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Block 2</strong></td>
<td><strong>Report Date</strong> Full publication date including day, month, and year, if available (e.g. 1 Jan 88) Must cite at least the year.</td>
</tr>
<tr>
<td><strong>Block 3</strong></td>
<td><strong>Type of Report and Dates Covered</strong> State whether report is interim, final, etc. If applicable, enter inclusive report dates (e.g. 10 Jun 87 - 30 Jun 88).</td>
</tr>
<tr>
<td><strong>Block 4</strong></td>
<td><strong>Title and Subtitle</strong> A title is taken from the part of the report that provides the most meaningful and complete information. When a report is prepared in more than one volume, repeat the primary title, add volume number, and include subtitle for the specific volume. On classified documents enter the title classification in parentheses.</td>
</tr>
<tr>
<td><strong>Block 5</strong></td>
<td><strong>Funding Numbers</strong> To include contract and grant numbers; may include program element number(s), project number(s), task number(s), and work unit number(s). Use the following labels: C - Contract PR - Project G - Grant TA - Task PE - Program WU - Work Unit Element Accession No.</td>
</tr>
<tr>
<td><strong>Block 6</strong></td>
<td><strong>Author(s)</strong> Name(s) of person(s) responsible for writing the report, performing the research, or credited with the content of the report. If editor or compiler, this should follow the name(s).</td>
</tr>
<tr>
<td><strong>Block 7</strong></td>
<td><strong>Performing Organization Name(s) and Address(es)</strong> Self-explanatory.</td>
</tr>
<tr>
<td><strong>Block 8</strong></td>
<td><strong>Performing Organization Report Number</strong> Enter the unique alphanumeric report number(s) assigned by the organization performing the report.</td>
</tr>
<tr>
<td><strong>Block 9</strong></td>
<td><strong>Sponsoring/Monitoring Agency Name(s) and Address(es)</strong> Self-explanatory.</td>
</tr>
<tr>
<td><strong>Block 10</strong></td>
<td><strong>Sponsoring/Monitoring Agency Report Number (If known)</strong></td>
</tr>
<tr>
<td><strong>Block 11</strong></td>
<td><strong>Supplementary Notes</strong> Enter information not included elsewhere such as: Prepared in cooperation with...; Trans. of...; To be published in... When a report is revised, include a statement whether the new report supersedes or supplements the older report.</td>
</tr>
<tr>
<td><strong>Block 12a</strong></td>
<td><strong>Distribution/Availability Statement</strong> Denotes public availability or limitations. Cite any availability to the public. Enter additional limitations or special markings in all capitals (e.g. NOFORN, REL, ITAR).</td>
</tr>
<tr>
<td><strong>Block 12b</strong></td>
<td><strong>Distribution Code</strong></td>
</tr>
<tr>
<td><strong>Block 13</strong></td>
<td><strong>Abstract</strong> Include a brief (Maximum 200 words) factual summary of the most significant information contained in the report.</td>
</tr>
<tr>
<td><strong>Block 14</strong></td>
<td><strong>Subject Terms</strong> Keywords or phrases identifying major subjects in the report.</td>
</tr>
<tr>
<td><strong>Block 15</strong></td>
<td><strong>Number of Pages</strong> Enter the total number of pages.</td>
</tr>
<tr>
<td><strong>Block 16</strong></td>
<td><strong>Price Code</strong> Enter appropriate price code (NTIS only).</td>
</tr>
<tr>
<td><strong>Blocks 17 - 19</strong></td>
<td><strong>Security Classifications</strong> Self-explanatory. Enter U.S. Security Classification in accordance with U.S. Security Regulations (i.e., UNCLASSIFIED). If form contains classified information, stamp classification on the top and bottom of the page.</td>
</tr>
<tr>
<td><strong>Block 20</strong></td>
<td><strong>Limitation of Abstract</strong> This block must be completed to assign a limitation to the abstract. Enter either UL (unlimited) or SAR (same as report). An entry in this block is necessary if the abstract is to be limited. If blank, the abstract is assumed to be unlimited.</td>
</tr>
</tbody>
</table>
# TABLE OF CONTENTS

| LIST OF FIGURES                                                                 | .......................................................... | 2 |
| 1. STATEMENT OF RESEARCH PROBLEM                                              | ....................................................................... | 3 |
| 2. SUMMARY OF RESULTS                                                          | ....................................................................... | 4 |
| 2.1 Theoretical Aspect                                                         | ....................................................................... | 4 |
| 2.2 System Aspect                                                             | ....................................................................... | 6 |
| 2.2.1 Spiral Architecture                                                     | ....................................................................... | 8 |
| 2.2.2 Asynchronous Architecture                                               | ....................................................................... | 13 |
| 2.3 Performance Aspect                                                        | ....................................................................... | 18 |
| 2.4 References                                                                | ....................................................................... | 19 |
| 3. PUBLICATIONS AND REPORTS                                                   | ....................................................................... | 21 |
| 4. PERSONNEL SUPPORTED AND DEGREES AWARDED                                     | ....................................................................... | 22 |
| APPENDIX A                                                                    | ....................................................................... | 23 |
| APPENDIX B                                                                    | ....................................................................... | 25 |
| APPENDIX C                                                                    | ....................................................................... | 27 |
| APPENDIX D                                                                    | ....................................................................... | 29 |
## LIST OF FIGURES

<table>
<thead>
<tr>
<th>Fig.</th>
<th>Description</th>
<th>Page No.</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fig. 1</td>
<td>One-step adaptive lattice predictor, $L = 3$</td>
<td>5</td>
</tr>
<tr>
<td>Fig. 2</td>
<td>Computational graph for $v_{i,k}, a_{b,k}$</td>
<td>7</td>
</tr>
<tr>
<td>Fig. 3</td>
<td>Computational scheme for $v_{i,k}, a_{b,k}$</td>
<td>7</td>
</tr>
<tr>
<td>Fig. 4</td>
<td>Interconnection scheme for $v_{i,k}, a_{b,k}$</td>
<td>7</td>
</tr>
<tr>
<td>Fig. 5</td>
<td>Proposed PE for $v_{i,k}, a_{b,k}$</td>
<td>7</td>
</tr>
<tr>
<td>Fig. 6</td>
<td>Propagation of data between array PEs</td>
<td>8</td>
</tr>
<tr>
<td>Fig. 7</td>
<td>Spiral systolic array architecture for LPF filter of order 12</td>
<td>9</td>
</tr>
<tr>
<td>Fig. 8</td>
<td>Inputs and outputs of a PE</td>
<td>10</td>
</tr>
<tr>
<td>Fig. 9</td>
<td>PE input/output data distribution for kernel cycle $t = 9$</td>
<td>12</td>
</tr>
<tr>
<td>Fig. 10</td>
<td>Proposed protocol in (type I)</td>
<td>17</td>
</tr>
</tbody>
</table>
1. STATEMENT OF RESEARCH PROBLEM

The systolic algorithm approach is currently the most effective design procedure for minimizing computing time and the number of processors. Systolic arrays (SAs) are examples of VLSI special purpose processor networks that realize computationally expensive algorithms in hardware, and can achieve massive concurrency. Such arrays are numerously used for the real-time implementation of digital-signal processing algorithms, because of the properties that these algorithms possess for their effective mapping onto high-speed SA architectures. The main problem in a systolic approach is that extra delays are needed to assure proper timing and synchronization, which will collectively slow down computation and, thereby, reduce the throughput rate. For a large-scale array, this synchronization can become particularly tedious. Moreover, algorithms for asynchronous SAs (i) have not been fully utilized for data reduction, and (ii) the conversion of sequential input signals into input blocks that will alter the organization of the systolic mesh has not been performed.

The research sponsored by this grant focused on the following important technical issues that result in the development of efficient SAs.

- Data reduction techniques via the utilization of algorithm properties.
- Conversion of sequential input signals into input blocks by means of a spiral systolic mesh that is suitable for parallel processing and that is flexible for enabling various array dimensions.
- Making data streams independent of computations executed in each processor, thus reducing waiting time.

The main objective of this research grant is to develop a theoretical and technological basis for designing an asynchronous SA — that is a hybrid of SA and data-flow approach — one capable of converting input signals into input blocks for producing a faster and more flexible system. The key to this design is a spiral SA structure, self-timed processors, and communication protocols to get control of data streams so that each computation can start if all its data are available. Moreover, the communication protocols resolve the data-flow conflicts created by the merging of the spiral and asynchronous SA architectures. The SA processes the input signal efficiently and eliminates the complex shift register organization of traditional filter realizations. Incorporated in this design are maximum parallelism and pipelinability, trade-off among computations, communications, and memory. Furthermore, the systolic array will use simple local interconnections without undesirable properties, such as preloading input data or global broadcasting.

---

1 For most digital signal processing algorithms.
2. SUMMARY OF RESULTS

The research performed during this grant focused on the improvement of present methods of designing systolic arrays [1-7] for the following digital signal processing algorithms [8-15]:

- discrete Fourier transform (DFT)
- inverse discrete Fourier transform (IDFT)
- convolution
- correlation
- linear phase filter (LPF)
- discrete Hadamard transform (DHT)
- one-step adaptive lattice predictors
- arbitrarily large LMS adaptive filters

This required a comprehensive research effort that addresses theoretical, system, and performance aspects.

Theoretical Aspect: Describe useful mathematical representations for digital-signal processing (DSP) algorithms.

System Aspect: Develop a technique for tailoring the algorithms into forms suitable for mapping onto SA architectures and to develop array systems for implementing the architectures efficiently.

Performance Aspect: Evaluate the performance of the new algorithms developed especially for the asynchronous array.

2.1 Theoretical Aspect

The mathematical representations for DSP algorithms include important properties such as local and regular data broadcasting. On the basis of these representations, we have arrived at a design procedure for creating a systolic array and a systolic algorithm for the realization of DSP algorithms efficiently. For example, lattice filters offer the following features:

1. They are very efficient structures, because of their simultaneous production of the forward and backward prediction errors.
2. The various stages of the lattice predictors are decoupled from each other. Moreover, the backward prediction errors generated by the stages of a lattice predictor are orthogonal to each other for wide-sense stationary input data.
3. They are modular in structure; an order increase is induced by adding one or more stages without affecting previous computations.
4. The structure of each of the stages is identical, making lattice filters viable candidates for VLSI implementation.
Figure 1 illustrates the diagram of a one-step predictor of order 3.

\begin{figure}
\centering
\includegraphics[width=\textwidth]{figure1.png}
\caption{One-step adaptive lattice predictor, L = 3.}
\end{figure}

The forward and backward signals are labeled \( s_{i,k} \) and \( s'_{i,k} \), respectively. The relations describing a one-step adaptive lattice predictor are

\begin{align}
\dot{s}_{0,k} &= x_k \\
\dot{s}_{1,k} &= s_{1,k} + K_1 s_{1,k-1} \\
\dot{s'}_{1,k} &= K_1 s'_{1,k-1} \\
\dot{s}_{2,k} &= s_{2,k} + K_2 s_{2,k-1} \\
\dot{s'}_{2,k} &= K_2 s'_{2,k-1} \\
\dot{s}_{3,k} &= s_{3,k} + R \\
\dot{s'}_{3,k} &= R \\
\dot{\epsilon}_f,k &= s_{L,k} \\
\dot{\epsilon}_b,k &= s'_{L,k}
\end{align}

where, \( \epsilon_{f,k} \) and \( \epsilon_{b,k} \) are the respective forward and backward predicted errors. The LMS algorithm for this lattice is

\begin{align}
K_{i,k+1} &= K_{i,k} - 2\mu_i s_{i+1,k} s'_{i,k} \\
0 &\leq i \leq L - 1
\end{align}

Here, \( \mu_i \) is the so-called convergence parameter, which is time invariant. This parameter is set to different values from stage to stage [13]. The data reduction technique can be demonstrated via the theoretical aspect of the LPF digital filters. The one-dimensional FIR filter is mathematically described as [11]

\begin{align}
y(k) &= \sum_{n=1}^{N} \omega(n) x(k - n + 1) \\
&= \sum_{n=1}^{N} \omega(n) \left[ x(k - n + 1) \pm x(k - N + n) \right] \\
\omega(n) &= \pm \omega(N + 1 - n)
\end{align}

For LPF digital filters, the following four cases are considered.

Case 1, 2: Even filter order (+ for even and - for odd symmetry):

\begin{align}
y(k) &= \sum_{n=1}^{N/2} \omega(n) \left[ x(k - n + 1) \pm x(k - N + n) \right]
\end{align}
Case 3.4: Odd filter order (+ for even and - for odd symmetry):

\[
y(k) = \sum_{n=1}^{N-1} \omega(n) \left[ x(k-n+1) \pm x(k-N+n) \right] \pm u(k)
\]

\[
\omega(n) = \pm \omega(N+1-n)
\]

\[
u(k) = \left\{ \begin{array}{ll}
\omega \left( \frac{N+1}{2} \right) x \left( k \frac{N-1}{2} \right) & \text{for even symmetry} \\
0 & \text{for odd symmetry}
\end{array} \right.
\]

Because of the similarities between the first two and last two cases, case 1 and case 2 are combined into one class, and case 3 and case 4 are combined into another class, with the respective reduced filter orders of \( N/2 \), \( N/2 \), \( (N+1)/2 \), and \( (N-1)/2 \) for SA implementation [16].

A summary of the theoretical aspect for the remaining DSP algorithms is listed in Appendix A.

2.2 System Aspect

A fundamental problem in the design of SA structures is to obtain localized algorithms, namely, localization of data dependencies. To achieve maximum parallelism in an algorithm, data dependencies during the computations must be determined. The following notations can be useful for the system aspect of SA design:

**Definition 1:** A dependency graph shows the dependency of the computations that occur in an algorithm. An algorithm is computable if, and only if, its dependency graph contains no loops or cycles.

**Definition 2:** A localized dependency graph has only local dependencies, i.e., the length of each dependency arc is independent of the problem size.

**Definition 3:** A computational graph is a localized dependency graph with each node in the graph labeled by the indices of the terms it computes.

For SA architecture, determining the computational graph for a given algorithm is a suitable starting point for non-structurally altered arrays. For example, spiral SAs have been structurally altered to conform to an architecture capable of processing blocks of input signals, rather than sequential input signals. The computational graph for the one-step adaptive lattice predictor is shown in Fig. 2. The large circles represent processing elements (PEs), and the first and second digits of the small circles represent one of nine operations, shown in Fig. 3, and the PE number, respectively.

2 This list is presented to maintain continuity with the system and performance design aspects of the DSP algorithms.
The computational scheme illustrates identical sets of operations in various stages. This enables us to determine the interconnection scheme which reveals the PE organization and PEs' input and output. The computational and interconnection scheme for the one-step adaptive lattice predictor are illustrated in Figs. 3 and 4, and the internal operations of a PE are shown in Fig. 5 [17]. The system aspect of the SA design for the DSP algorithms is listed in Appendix B [18-19].
2.2.1 Spiral Architecture

To illustrate this method, we describe herein the process involved in designing a spiral SA for an LPF digital filter: An input signal \( x(m) \) is passed from the input \( x_n \) of a PE at \((i, j)\) to its output \( x_{n+1} \) (see equations 4 and 5). Using zero or one delay, this input is transferred to the input \( x_n \) of the neighboring cell \((i + 1, j^*)\). The index \( j^* \) can be identified as \( j + 1 \) or \( j - 1 \) using an analogy from optics. Input data \( x(k) \) moves between the PEs in the direction \( d_1 = 5 \) or \( d_2 = 3 \) until it reaches the left or right boundary of the array, respectively (see Fig. 6(a) and (b)). Then the input is "reflected" back in the respective direction \( d_3 = 3 \) or \( d_4 = 5 \), with the intermediate direction \( d_5 = 4 \). The intercell propagation of intermediate values \( y(k) \), of input data \( y(k) \) can be described by

\[
y(k) = y(k-1) + \omega(i) \cdot x(k - i + 1)
\]

where \( y(k)_b = 0 \) and \( y(k)_b = y(k) \). A delay element is positioned immediately following a PE at \((i, j)\), provided that

\[
j = \begin{cases} 
(i + 1) \mod M & \text{for } (r - 1) M \leq i < r M \quad r \text{ odd} \\
(i) \mod M & \text{for } i = r M \\
(M - i) \mod M & \text{for } (r - 1) M \leq i < r M \quad r \text{ even} \\
(M - i + 1) \mod M & \text{if } i = r M
\end{cases}
\]

where \( M \) is the number of input data blocks.

![Diagram of directional encoded values](a) Directional encoded values

![Diagram of data propagation between array PEs](b) Data propagation between array PEs

Fig. 6. Propagation of data between array PEs.

The two systolic array architectures with spiral structure of intercell connections, proposed to compute equation (4) for class 1 and equation (5) for class 2 (for \( N = 12 \)) are depicted in Fig. 7(a) and (b), respectively [16]. The secondary input
signal $x(k - N + n)$ is shown integrated into the structure. In this structure, the PE performs the following operation, as illustrated in Fig. 8:

$$x_{out} = x_{in}$$
$$y_{out} = y_{in} + \omega(i) [x_{in} \pm x'_{in}]$$  \hfill (8)

Here $x(k - N + n)$ is the past input data, symmetric to $x(k - n + 1)$, and $N - 2i + 1$ sampling periods apart.

Fig. 7. Spiral systolic array architecture for LPF filter of order 12.
(Notations: + for even symmetry and - for odd symmetry.)
Let $U$, the reduced filter order, and $M$, the number of input data blocks, designate the spiral SA's number of rows and columns, respectively. Denote $\text{indx}(i, j, t)$ and $\text{indxo}(i, j, t)$ to be the respective indices of input and output signals of a PE where $i, j,$ and $t$ are the row, column and block kernel cycle. Here $i = 1, \ldots, U$, $j = 1, \ldots, M$, and $t = 1, \ldots, T_{\text{max}}$ ($T_{\text{max}} = (K/M + .5)_{\text{integer}}$), where $K$ is the number of input signals. Note that the indices of input and output signals are related by the following:

$$\text{indxo}(i, j, t) = \text{indx}(i, j, t) - M$$  \hspace{1cm} (9)
The following algorithm demonstrates the positioning of the input signals in the PEs of the spiral SA, at kernel cycle \( t \).

---

**Algorithm 1.** { "Input": spiral SA's row and column size, and input data size  
Output: input signals' indices, \( \text{indx}(i, j, t) \). }" 

Begin

\[
T_{\text{max}} = (K/M + 0.5), \text{integer}
\]

for \( t = 1 \) to \( T_{\text{max}} \) do

begin

\[
R = \begin{cases} 
    M/2, & \text{if } M \text{ even} \\
    (M - 1)/2, & \text{if } M \text{ odd}
\end{cases}
\]

\( i = 1; \)

for \( j = 1 \) to \( M \) do  

\{ 1st row computations \}

begin

\[
\text{indx}_{i, j, t} = \begin{cases} 
    (t - 1)M + (j + 1)/2, & \text{if } j \text{ odd} \\
    tM - (j - 2)/2, & \text{if } j \text{ even}
\end{cases}
\]

end:

for \( i = 2 \) to \( N \) do  

\{ rows 2 to \( N \) computations \}

begin

\( p = i \text{ mod } 2; \)

for \( j = 1 \) to \( M \) do

\[
\text{indx}_{i, j, t} = \text{indx}_{i - 1, j, t} - M - 1;
\]

for \( q = 1 \) to \( R - 1, \text{ if } i \text{ even} \) do

begin

\( j1 = 2q + p - 2; \)

\( j2 = j1 + 1; \)

if \( j2 < = M \) then

begin

\[
\text{mid} = \text{indx}_{i, j1, t}; \quad \{ \text{swap of indices} \}
\]

\[
\text{indx}_{i, j1, t} = \text{indx}_{i, j2, t};
\]

\[
\text{indx}_{i, j2, t} = \text{mid};
\]

end: \( \{ j2 < = M \} \)

end:

end: \( \{ i = 2, N \} \)

end: \( \{ t = 1, T_{\text{max}} \} \)

End.
The index of some feedback input, \( \ell' \), is related to the index of its destination PE, \( k \) by (see eq. 4 and Fig. 7).

\[
\ell' = k + \Delta k' \\
\Delta k' = (k - N + n) - (k - n + 1) = -N + 2n - 1
\]

(10)

An example for the PE input/output data distribution in the spiral SA, with \( N/2 = \ell' \) and \( M = 4 \) dimensions (at kernel cycle \( t = 9 \)) is shown in Fig. 9:

<table>
<thead>
<tr>
<th>PE output data distr.</th>
<th>( j = 1 )</th>
<th>( j = 2 )</th>
<th>( j = 3 )</th>
<th>( j = 4 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( i = 1 )</td>
<td>( (33) )</td>
<td>( (36) )</td>
<td>( (34) )</td>
<td>( (35) )</td>
</tr>
<tr>
<td>( (29) )</td>
<td>( (32) )</td>
<td>( (30) )</td>
<td>( (31) )</td>
<td></td>
</tr>
<tr>
<td>( i = 2 )</td>
<td>( (28) )</td>
<td>( (29) )</td>
<td>( (31) )</td>
<td>( (30) )</td>
</tr>
<tr>
<td>( (24) )</td>
<td>( (25) )</td>
<td>( (27) )</td>
<td>( (26) )</td>
<td></td>
</tr>
<tr>
<td>( i = 3 )</td>
<td>( (24) )</td>
<td>( (23) )</td>
<td>( (25) )</td>
<td>( (26) )</td>
</tr>
<tr>
<td>( (20) )</td>
<td>( (19) )</td>
<td>( (21) )</td>
<td>( (22) )</td>
<td></td>
</tr>
<tr>
<td>( i = 4 )</td>
<td>( (19) )</td>
<td>( (20) )</td>
<td>( (18) )</td>
<td>( (21) )</td>
</tr>
<tr>
<td>( (15) )</td>
<td>( (16) )</td>
<td>( (14) )</td>
<td>( (17) )</td>
<td></td>
</tr>
<tr>
<td>( i = 5 )</td>
<td>( (15) )</td>
<td>( (14) )</td>
<td>( (16) )</td>
<td>( (13) )</td>
</tr>
<tr>
<td>( (11) )</td>
<td>( (10) )</td>
<td>( (12) )</td>
<td>( (9) )</td>
<td></td>
</tr>
<tr>
<td>( i = 6 )</td>
<td>( (10) )</td>
<td>( (11) )</td>
<td>( (9) )</td>
<td>( (8) )</td>
</tr>
<tr>
<td>( (6) )</td>
<td>( (7) )</td>
<td>( (5) )</td>
<td>( (4) )</td>
<td></td>
</tr>
</tbody>
</table>

Fig. 9. PE input/output data distribution for kernel cycle \( t = 9 \).

The connection matrices for the feedback input data of the class 1 and class 2 spiral SA (LPF of order 12) are shown here. (Note that \( N/2 = 6, \ M = 4 \) for class 1 and \( (N - 1)/2 = 5, \ M = 4 \) for class 2.)

\[
\begin{bmatrix}
\pm 0 (3, 4) & \pm 0 (2, 2) & \pm i (3, 2) & \pm 0 (2, 1) \\
\pm 0 (3, 2) & \pm 0 (3, 1) & \pm 0 (3, 4) & \pm 0 (3, 3) \\
\pm 0 (4, 4) & \pm 0 (4, 2) & \pm i (4, 3) & \pm 0 (3, 2) \\
\pm 0 (4, 3) & \pm 0 (4, 1) & \pm i (5, 4) & \pm 0 (4, 2) \\
\pm 0 (5, 3) & \pm 0 (5, 1) & \pm i (5, 4) & \pm 0 (5, 2) \\
\pm 0 (5, 4) & \pm 0 (5, 2) & \pm i (6, 4) & \pm 0 (6, 2)
\end{bmatrix}
\]

(11)
\[
\begin{pmatrix}
\pm i (3, 2) & \pm 0 (2, 4) & \pm 0 (2, 1) & \pm 0 (2, 2) \\
\pm 0 (3, 1) & \pm 0 (3, 3) & \pm i (3, 2) & \pm 0 (3, 4) \\
\pm i (4, 3) & \pm 0 (4, 4) & \pm 0 (4, 3) & \pm 0 (4, 4) \\
\pm 0 (4, 1) & \pm 0 (4, 2) & \pm 0 (4, 3) & \pm 0 (4, 4) \\
\pm i (5, 4) & \pm 0 (5, 3) & \pm i (5, 2) & \pm 0 (5, 1)
\end{pmatrix}
\]  

(12)

where \(0(i, j)\) and \(i(i, j)\) at the \((m, n)\) location of the connection matrix refer to the connection made from the respective output and input of the PE at location \((i, j)\) to the input of the PE that makes up the processor's feedback input data at the location \((m, n)\) of the spiral SA.

The spiral SA structure for computing eqs. A4 and A5 for the DFT and IDFT algorithms and for computing eqs. A2 and A3 for the convolution and correlation algorithms are illustrated in Appendix C [20]. Note that contrary to the LPF spiral SA structure neither structure requires a feedback input, and in addition, the DFT and IDFT spiral SA structures consist of no delay elements.

2.2.2 Asynchronous Architecture

The performance of a systolic array can be improved further if an asynchronous approach is adopted. The idea is to design self-timed processors and communication protocols to gain control of data streams, so that each computation can start if all its data are available. The proposed array is a hybrid of a systolic array and a data-flow machine.

In the asynchronous design, instead of using a global clock, self-timing PEs and a communication protocol are employed. The advantage is the following: The whole period of a clock unit for addition/subtraction, multiplication, addition, and data routing can be separated into several small steps, and some of these steps can be executed simultaneously. The concept of asynchronous computations can be specified as shown below, with steps 2 and 3 executed simultaneously.

Step 1
- send/receive the respective acknowledge signal and data to the previous processor
- send a request signal that simultaneously forwards data to the next processor

Step 2
- transfer data to the next processor

Step 3
- add/subtract, multiply, and accumulate the results
The basic features of the proposed array remain the same as in the systolic array, with the exception that the data routing and computing in each PE can be operated simultaneously. Moreover, a PE does not wait for data until the previous PE completes its computations. The following algorithm reflects this new feature for an LPF digital filter [16].

Algorithm 2: Linear Phase FIR Digital Filter (Asynchronous)

Begin

While there are data entering PE, do

Begin

Receive input data \( x_{in} \), \( x'_{in} \), \( y_{in} \);
Send \( x_{i} \) to the next processing element:

& \( v_{i,j} \leftarrow x_{in} \pm x'_{in} \);
& \( u_{i,j} \leftarrow \omega_{i} v_{i,j} \);
& \( y_{out} \leftarrow u_{i,j} + y_{in} \);
End While

End

Protocol realization for Algorithm 2

A protocol for realization of Algorithm 2 must control the flow of data and make the data flow independent of the internal operations in each PE, such that the values of input variables are not overwritten during their computing time intervals. Because in spiral SAs timing is crucial for the synchronization of the PEs’ input data, and in asynchronous machines operations are triggered by the availability of the data, the system protocol for Algorithm 2 (1) triggers PEs’ operations by the fullness of their \( x_{in} \), \( x'_{in} \), and \( y_{in} \) input ports, and (2) allows data routing and computing to be performed simultaneously. In the proposed protocol for an LPF digital filter, five types of signals — REQ, ACK, FLG, EMP, FUL — are introduced, two of which are external signals and three of which are internal signals. The REQ signal reports to the next PE that the data in its output port are ready for transmission. The ACK signal reports to the previous PE that its input port is ready to receive new data. The FLG, EMP, and FUL signals indicate the completeness of the add/subtract-multiply-and-add computations, and the emptiness and fullness of the input port(s), respectively. The protocol can be described formally for data \( x_{in} \) as shown below, while noting that the specifications of the protocols for \( x'_{in} \) and \( y_{in} \) are similar.

1. PE, \( i \), receives a request, REQ\( ^{i} \), from a source PE when the data in its output port are ready to be transmitted.
2. PE\textsubscript{i} receives an acknowledge ACK\textsubscript{i} from a destined PE when its input port is ready to accept new data.

3. PE\textsubscript{i} has three internal signals, FLG\textsubscript{i}, EMP\textsubscript{i}, and FUL\textsubscript{i}, which indicate the completeness of the add/subtract-multiply-and-add operations, and the emptiness and fullness of the input port(s).

4. PE\textsubscript{i} contains the following transition latches:

   Latch 1 activates the signal ACK\textsubscript{i}, which is fired toward a source PE only after both signals REQ\textsubscript{i} and EMP\textsubscript{i} are received.

   Latch 2 controls the data flow from the input port to the buffer-for-add/subtract and to the output port, and it also activates the signals EMP\textsubscript{i} and FUL\textsubscript{i}. It is fired only after the signal FLG\textsubscript{i} is received.

   Latch 3 controls the data flow from the output port of the PE to the input port of a destined PE. It is fired only after the signal ACK\textsubscript{i} is received from the destined PE.

A detailed configuration of this protocol for type I cell (case 1 or 3 FIR linear phase digital filters) is depicted in Figure 10. In this configuration, there are two lines that intersect each PE. One is a bidirectional control line for the transmission of the ACK and REQ bit-signals, and the other is a unidirectional data line. The linear phase FIR digital filter operations in PE\textsubscript{i}, are shown in Algorithm 3 [16].

The PE protocols for some of the most important DSP algorithms are shown in Appendix D [17-19].
Algorithm 3: Linear Phase FIR Digital Filter Operation  
(A Completed Asynchronous Version)

Begin

While there are data entering PE, do
 Begin
 For all PE,s asynchronously do
  Begin
   Wait for REQinput from a source PE:
   Receive REQinput;
   While input port is empty do
    Begin
     Activate REQinput & EMP:
     Send ACKinput:
     Receive data $x_{in}$, $x'_{in}$, $y_{in}$:
     Store data into the input port:
     Activate FUL signal:
     & Send REQoutput to a destined PE:
     Wait for ACKoutput from the destined PE:
     Send $x_{in}$ to output port & buffer-for-add:
     & Add ($v_{i,j} ← x_{in} + x'_{in}$):
     Store $v_{i,j}$ into buffer-for-mult:
     Read $\omega_i$ from buffer-for-mult:
     & $v_{i,j}$ from buffer-for-mult:
     Mult ($u_{i,j} ← \omega_i v_{i,j}$):
     Store $u_{i,j}$ into buffer-for-add:
     Read $u_{i,j}$ from buffer-for-add:
     & $y_{in}$ from buffer-for-add:
     Add ($y_{out} ← u_{i,j} + y_{in}$):
     Store $y_{out}$ into buffer-for-W:
     & Activate FLG signal:
    End While
   End For
  End While
 End
Fig. 10. Proposed protocol in (type I) PE. 

17
2.3 Performance Aspect

The execution time modeling of an asynchronous system can represent quantitative evaluation of the developed algorithms. We developed three performance measures, throughput, speedup, and efficiency, for evaluating the asynchronous spiral SA realizations. The performance analysis technique is next demonstrated for the LPF digital filter asynchronous SA realization [16]. The following notations will be used to describe the time models of the arrays:

\[ T_{in} = \text{time for reading data from input port} \]
\[ T_{D_i} = \text{time for internal data transfer} \]
\[ T_{D_o} = \text{time for external data transfer} \]
\[ T_R = \text{signal REQ propagation time} \]
\[ T_A = \text{signal ACK propagation time} \]
\[ T_{Fire} = \text{time for firing data via latch} \]
\[ T_{C} = \text{clock time for synchronous systolic array realization} \]
\[ T_E = \text{signal EMP propagation time} \]
\[ T_{Ad} = \text{computing time for addition} \]
\[ T_{M} = \text{computing time for multiplication} \]
\[ T_{w} = \text{time for fetching data from buffer-for-add/multiply/etc.} \]
\[ T_{F} = \text{signal FLG/FUL propagation time} \]
\[ T_{Out} = \text{time for reading data from output port} \]

The time delays, starting with the reading of data from input port, for the respective asynchronous spiral SA and the corresponding synchronous systolic array can be described by the following relationships (see Figure 10):

\[ T_{PE_{(asyn)}} = \max\{T_{in} + T_{D_i} + (T_R + T_{Fire}) + (T_A + T_{Fire}) + T_{Out} + T_{D_o} \} \]
\[ = (T_{in} + T_{D_i} + T_{w} + T_{A} + T_{M} + T_{w} + T_{A} + T_{w}) + (T_R + T_{Fire}) + (T_{A} + T_{Fire}) + T_{Out} + T_{D_o} \] (13)

\[ T_{total_{(asyn)}} = (T_{in} + T_{D_i} + T_{Out} + T_{D_o} + 4T_{w} + 2T_{A} + T_{M} + T_{Com}) \] (14)

where,

\[ T_{Com} = T_{A} + 2T_{Fire} + T_{R} \]

and,

\[ T_{total_{(syn)}} = M'(T_{in} + T_{D_i} + T_{D_o} + 3 \max \{T_{A}, T_{M} \} + T_{w}) \]
\[ = M'(T_{in} + T_{D_i} + T_{D_o} + 3(T_{M} + T_{w}) = M'T_{C} \] (15)

where, \( M' \) refers to the number of PEs in the synchronous systolic array, and

\[ M' = \begin{cases} M + 1 & \text{if } N \text{ odd, even symmetry} \\ M & \text{otherwise} \end{cases} \]
The throughput can be obtained by dividing the number of processors by the execution time per sample. Thus, the throughput using $M \times U$ processors, $R(M_n)$ is defined by the following equation:

$$R(M_n) = \frac{M}{T_{in} + T_{D_i} + T_{out} + T_{D_o} + 4T_w + 2T_A + T_M + T_{Com}}$$ (16)

The speedup is the ratio of the throughput in the asynchronous spiral SA system to that in a reference synchronous system. From above, the speedup via an asynchronous spiral SA realization is represented as

$$S(M_n) = \frac{M (T_{in} + T_{D_i} + T_{D_o} + 3 (T_M + T_W))}{M' (T_{in} + T_{D_i} + T_{out} + T_{D_o} + 4T_W + 2T_A + T_M + T_{Com})}$$ (17)

The system efficiency is defined as the ratio of the speedup and the number of processors

$$E(M_n) = \frac{T_{in} + T_{D_i} + T_{D_o} + 3 (T_M + T_W)}{U(M') (T_{in} + T_{D_i} + T_{out} + T_{D_o} + 4T_W + 2T_A + T_M + T_{Com})}$$ (18)

The above equation indicates that the reduction of interprocessor communication time, $T_{Com}$, greatly improves the system's efficiency [16].

2.4 References


3. PUBLICATIONS AND REPORTS


4. PERSONNEL SUPPORTED AND DEGREES AWARDED

<table>
<thead>
<tr>
<th>Major Investigator</th>
<th>Research Area</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reihani, K.</td>
<td>Signal Processing and Image Understanding</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Doctoral Students</th>
<th>Dissertation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fan, Yiping</td>
<td>Current Ph.D. Student: Signal Processing</td>
</tr>
<tr>
<td>Zhang, Zhi</td>
<td>Current Ph.D. Student: Signal Processing</td>
</tr>
</tbody>
</table>
APPENDIX A

THEORETICAL ASPECT FOR DSP ALGORITHMS

— Arbitrarily Large LMS Adaptive Filters

A standard form for the LMS algorithm when real discrete time signals, \( x_i \), are used is described by

\[
W_{j+1} = W_j + 2 \mu \xi_j X_j
\]  
(A1)

where \( W_j \) and \( X_j \) are vectors representing, respectively, the filter weights and the outputs of the transversal filter line at time \( j \)

\[
W_j^T = [w_{1,j}, w_{2,j}, ..., w_{n,j}]
\]

\[
X_j^T = [x_j, x_{j-1}, ..., x_{j-n+1}]
\]

and \( \mu \) is the gain constant that regulates the speed and stability of adaptation. The error, \( \xi_j \), is the difference between the desired response \( d_j \), an externally supplied training signal, and the filter output \( y_j \)

\[
\xi_j = d_j - y_j
\]

where the filter output is given by

\[
y_j = X_j^T W_j
\]

— Convolution and Correlation

Convolution and correlation are closely related operations that are basic to many areas of digital signal processing. The convolution of two time series is equivalent to the product of the Fourier transforms of the two time series. They can be expressed for a finite length \( N \) as

**Convolution:**

\[
c_{xy}(n) = \text{conv}[[x(k), y(k)] = \sum_{k=0}^{N-1} x(k)y(n-k) = \sum_{k=0}^{N-1} x(n-k)y(k)
\]  
(A2)

**Correlation:**

\[
r_{xy}(n) = \text{corr}[[x(k), y(k)] = \sum_{k=0}^{N-1} x(k)y(n+k) = \sum_{k=0}^{N-1} x(n+k)y(k)
\]  
(A3)

— DFT and IDFT

The discrete Fourier transform (DFT) is used to transform an ordered sequence of data samples, usually from the time domain into the frequency domain, so that spectral information about the sequence can become known explicitly. The inverse discrete Fourier Transform (IDFT) is used to obtain a data sequence in time domain

23
from its complex spectrum. If we denote \( x(k) \) \((k = 0, 1, \ldots, N - 1)\) as a time series and \( X(k) \) as the output of an \( N \) point Fourier transform, then

\[
X(k) = \sum_{n=0}^{N-1} x(n) e^{-j \frac{2\pi n}{N} k} = \sum_{n=0}^{N-1} x(n) w(nk) \quad k = 0, 1, \ldots, N - 1
\]  

(A4)

\[
x(n) = \frac{1}{N} \sum_{k=0}^{N-1} X(k) e^{j \frac{2\pi n}{N} k} = \frac{1}{N} \sum_{k=0}^{N-1} X(k) w(-nk) \quad n = 0, 1, \ldots, N - 1
\]  

(A5)

where

\[
w(nk) = e^{-j \frac{2\pi n}{N} k}
\]

\[
w(-nk) = e^{j \frac{2\pi n}{N} k}
\]

--- DHT

A Hadamard matrix is a matrix whose columns are orthogonal and comprised of elements whose value is either -1 or +1. Given a Hadamard matrix of order \( N \). Hadamard matrices of higher order may be generated by the simple recursive formulation.

\[
H_{2N} = \begin{bmatrix}
H_N & H_N \\
-H_N & -H_N
\end{bmatrix}
\]  

(A6)

The Hadamard matrix of the lowest order becomes

\[
H_2 = \begin{bmatrix}
1 & 1 \\
1 & -1
\end{bmatrix}
\]  

(A7)

and the Hadamard matrix of order \( N = 4 \) is given as

\[
H_4 = \begin{bmatrix}
1 & 1 & 1 & 1 \\
1 & -1 & 1 & -1 \\
1 & 1 & -1 & 1 \\
1 & -1 & -1 & 1
\end{bmatrix} = \begin{bmatrix}
H_2 & H_2 \\
H_2 & -H_2
\end{bmatrix}
\]  

(A8)

24
APPENDIX B

SYSTEM ASPECT I(A)

— Arbitrarily Large LMS Adaptive Filters

— Computational graph for $y_j$

$$
\begin{align*}
y_j &\quad (11) (21) (31) \\
x_j &\quad (12) (22) (32) \\
\alpha_0 &= 0 \\
\end{align*}
$$

— Computational scheme for $y_j$

— Interconnection scheme for $y_j$

Interconnection scheme for $y_j$
— Proposed PE, for $y_i$

— Systolic array scheme for a $4 \times 4$ DHT
APPENDIX C

SYSTEM ASPECT I(B)

— Spiral SA Structure to Compute Equations A4 and A5 (N=6)
— DFT and IDFT
— Spiral SA Structure to Compute Equations A2 and A3 \((N = 6)\)

— Convolution and correlation

Convolution

Correlation
APPENDIX D

SYSTEM ASPECT II

— One-Step Adaptive Lattice Predictors’ PEₙ Protocol
Arbitrarily Large LMS Adaptive Filters' PE Protocol

- Internal Flag
- Acknowledge Flag
- Empty-Input Buffer Flag
- Addition Transition
- Multiplication Transition
- Transition (I/O) Latch
- Data Line
- Signal (ACK/REQ) Line
- Data Buffer
- Request Flag

Diagram:

\[
\begin{align*}
\text{(REQ/ACK)} & \quad x_{j+1} & \quad \alpha_{j+1} & \quad \text{(REQ/ACK)} & \quad y_{j+1} & \quad \text{(REQ/ACK)} \\
\end{align*}
\]
DHT's PE, Protocols

---

(a) Type I PE

(b) Type II PE