This project on subpixel target detection relates to research in the optimization of three-dimensional computing structures for use in target detection and to research in the reduction of an optimum computing to an efficiently-designed silicon chip. During the work period reported here research continued on the mathematical optimization of optimizing the device design assuming a planar computing structure for executing cellular logic transforms. A final analysis of this planar geometry has been completed. There are well-defined optima which had not heretofore been recognized. The results of this research will have a major impact on devices which we intend to prototype during Phase II. An Appendix is included, submitted by our subcontractor Visual Information Technologies (Plano, Texas) illustrating a VLSI implementation of a three-dimensional morphology processor based upon the design diagramed in Technical Progress Report Number 2.
Topic Number: SDIO 88-10

Title: Three Dimensional Cellular Automata for Subpixel Target Detection

Contract Number: N00014-88-C-0717

From: Kensal Consulting, Tucson, Arizona (Code: 0D9C9)

To: Dr. Keith Bromley, NOSC, San Diego (Code: N00014)

Project Description:

This project on subpixel target detection relates to research in the optimization of three-dimensional computing structures for use in target detection and to research in the reduction of an optimum computing structure to an efficiently-designed silicon chip.

Technical Progress:

During February research continued furthering the analysis first reported in Technical Progress Report Number 1 and Technical Progress Report Number 4, namely, the mathematical optimization of device design assuming a planar structure for executing cellular logic transforms. The optimization criterion is the maximization of pixops (picture point operations) per device. As reported in Technical Progress Report Number 4, it has been discovered that using a variable on-chip data window size improves performance. However, in our previous work, equations were not developed which showed clearly the basis of the optimum conditions discovered, although the optima were clearly defined. This has now been rectified and a final analysis of two-dimensional (planar) structures has been completed. The results are dramatic. There is a well-defined optimum design
which had not heretofore been recognized. The results of this research will have a major impact assuming that prototyping of devices for cellular logic transforms is funded in response to our Phase II Proposal (to be prepared in April). Results are given below.

Two-Dimensional Case

Assume that one-bit pixels in a square data field of span S must be processed using a neighborhood processing device with a small on-chip data memory. This device is assumed to receive 3 pixels (bit:) each clock cycle t from the data field. It is also assumed to contain B LUTs (LookUp Tables) capable of flash processing all B pixels (bits) simultaneously and returning them to the data field in another, non-overlapping clock cycle t. Finally, it is assumed that the on-chip data window memory is capable of storing M pixels (bits). Initially multiple redundancy in M is not assumed. This will be treated later. It is assumed, however, that M can be configured as a matrix with an arbitrary number of rows R and columns $C = M/R$.

The purpose of the following analysis is to determine the minimum processing time for a specific value of M. Then to apply this determination to an analysis which finds the value of M which yields the maximum number of pixels processed per unit time per on-chip transistor. From this the best processing device design can be found.

The time $T_1$ to fill $M$ is given by

$$T_1 = \frac{M t}{B} \quad (1)$$

In this time every B-bit entity must have available the eight other B-Bit entities that form the neighborhoods of all B bits. For example, if B=1, then at least eight other one-bit entities must be stored in M to form a single 3x3 neighborhood. In this case $R=3$, $C=3$, and $M=9$. Conversely for B=16 (halfword transfer) at least eight other 16-bit entities must be stored in M so that B bits may be processed simultaneously. This yields a minimum value of $M=144$, $R=3$, and $C=48$. Thus, in general the number of pixels processed per round-trip cycle through M are given by

$$N=(R-2)(C-2B) = (R-2)(M/R-2B) \quad (2)$$

The processed values of these pixels are loaded in a time $T_2$ given by
\[ T_2 = (R-2)(M/R-2B)t/B \]  
(3)

Adding equations (1) and (3) provides the round trip processing time as

\[ T_0 = (M+(R-2)(M/R-2B))t/B \]  
(4)

The number of round-trip cycles required is simply \( S^2/N \) so that the total processing time \( T \) is given by

\[
T = T_0(S^2/N) \\
= S^2(M+(R-2)(M/R-2B))t/B(R-2)(M/R-2B) \\
= S^2Mt/B(R-2)(M/R-2B) + S^2t/B
\]  
(5)

For a given design, \( M, B, \) and \( t \) are fixed and, for a given processing task, \( S \) is fixed. Under these assumptions the second term in equation (5) is a constant as is the numerator in the first term. Thus, to minimize \( T \), for a given task and design, the denominator of the first term in equation (5) must be maximized. This denominator may be expressed in terms of a constant \( A \) given by

\[ A = BM + 4B^2 \]  
(6)

and a function of \( R \) given by

\[ f(R) = -2BM/R - 2B^2R \]  
(7)

differentiating \( f(R) \), setting the result equal to zero, and solving for \( R \) yields

\[ R = (M/B)^{\frac{1}{2}} \]  
(8)

Substituting this result in (5) yields the minimum processing time as follows

\[
T_{\text{min}}(M,B,S,t) = S^2Mt/B((M/B)^{\frac{1}{2}}-2)((MB)^{\frac{1}{2}}-2B)+S^2t/B
\]  
(9)

Letting \( S=512 \) and \( t = 100\text{ns} \), equation (9) is plotted in Figures 1-4 as a function of \( M \) for \( B=1, 4, 8, 16 \), respectively, corresponding to bit, nibble, byte, and halfword
entities. Examination of these figures shows that the processing time for the optimum configuration of \( M \) (equation (8)) falls rapidly only over a relatively small range of \( M \) and reaches asymptotic value given by \( 2S^2 t/B \). This indicates that halfword \( (B = 16) \) flash processors using large values of \( M \), such as the PHP of Carnegie-Mellon \( (M = 65536) \) are wasteful of transistors. As shown in Figure 4, little is gained in such designs by increasing \( M \) much beyond 1024. Also note that, for \( M = 1024 \), the optimum configuration of \( M \) is highly rectangular, namely, 8 rows by 128 columns.

Now consider optimization in terms of the number of pixops (pixel operations) per unit time per transistor. Using triple redundancy in \( M \) for maximum speed (U.S. Patent 4,641,351, K. Preston Jr.), the total number of transistors is given by

\[
D = 12M + BL \tag{10}
\]

where \( L \) is the number of transistors per LUT and it is assumed that four transistors are required per memory cell. For the 3x3 neighborhood flash processor considered here \( L = 4 \times 512 = 2048 \). Pixops per unit time per transistor are given by

\[
P = S^2/T \tag{11}
\]

Combining equations (9), (10), and (11) yields

\[
P = B((M/B)^2 - 2)((MB)^{3/2} - 2B)/(12M + 2048B)M^{1/2} + B/(12M + 2048B)t \tag{12}
\]

Equation (12) is plotted in Figures 1-4 for \( B = 1, 4, 8, 16 \), respectively, using \( t = 100\text{ns} \). These plots correspond to Figures 1-4 and furnish a graphical solution to the maximization of equation (12) as a function of \( M \). The optimum values of \( M \) found are given in the below table

<table>
<thead>
<tr>
<th>( B )</th>
<th>Optimum ( M )</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>~ 49</td>
</tr>
<tr>
<td>4</td>
<td>~ 225</td>
</tr>
<tr>
<td>8</td>
<td>~ 441</td>
</tr>
<tr>
<td>16</td>
<td>~ 900</td>
</tr>
</tbody>
</table>
It is worth noting that these design optima are independent of the span of the data field $S$. This again points up the fallacy that large values of $M$ are required when processing large data fields. In fact, as $M$ increases beyond the optima given above, efficiency falls, i.e., transistors are being wasted.

**Preliminary Design Analysis**

The appendix to this technical progress report has been submitted by Visual Information Technologies (Texas) as part of their subcontracted research on chip design. This analysis has benefitted from the research conducted by Kensal Consulting in that the research at Visual Information Technologies has been conducted in close coordination with our own.

**Plans for Contract Completion:**

Plans for contract completion include (1) the extension of our two-dimensional analysis to the final three-dimensional structure for use in target track detection and (2) completion of the design analysis being conducted at Visual Information Technologies. Also it is likely that the software emulator being built on the MacIntosh II will be finished and some preliminary results furnished (see Technical Progress Report Number 3). At the same time the Phase II proposal will be outlined in preparation for its submission in April 1989.
Figure 1A - Bit Data Transfer

Figure 1B - Bit Data Transfer
Data Window Memory

Figure 2A - Nibble Data Transfer

Data Window Memory

Figure 2B - Nibble Data Transfer
Figure 3A - Byte Data Transfer

Figure 3B - Byte Data Transfer
Figure 4A - Word Data Transfer

Figure 4B - Word Data Transfer
APPENDIX

VLSI Implementation

Of A

3D Morphology Processor

By

Dr. John Norsworthy

Visual Information Technologies

Plano, Texas
Introduction

The purpose of this report is to outline progress that has been made in the understanding of a chip implementation of a 3-dimensional image morphology processor. It is the intent of the research to outline the trade-offs associated with this chip development. This research is conducted in association with Kensal Consulting concerning a sub-pixel target detection system, and is funded by an SBIR grant from the SDIO. VITec (Visual Information Technologies, Inc.) is uniquely qualified to participate in this research because of its background in VLSI based image processing [1,2].

Background

A thorough understanding of the 3D morphology processing has been attained. Gaining fluency with the underlying concepts is essential in understanding the appropriate tradeoffs to be made in the VLSI implementation of a system component.

Current image processing techniques typically involve processing intensity values represented by integers with 8 to 16 bits of precision. These intensity values are samples taken at points in a rectangularly tesselated space, which is a generalization of traditional 1D signal processing concepts to two or more dimensions. Examples of machines performing image processing in this manner are the VITec Image Computer, the Pixar Image Computer, as well as systems from Vicom and Gould. A 2D neighborhood, or kernal, would consist of 9 pixels, while a 3D kernal would consist of 27 voxels.

In this study, pixels are binary variables, having been processed by a thresholding circuit. Furthermore, these voxels are samples from a hexahedrally tesselated space, such that each sample is equidistant from all of its neighbors. This tessellation has a physical analogy to the cubic close-packed crystal lattice [3]. In three dimensions, the hexahedral tessellation leads to a neighborhood of 13 voxels, the outer twelve of
which form the vertices of a tetradecahedron, a polyhedra having 14 faces.

Since each pixel is a single-bit value, it can be regarded as a binary variable, which leads to the concept of a "logical transform" of a 3D neighborhood. In the case of the proposed chip, a structure would be built whose inputs are the 13 neighborhood voxels of the hexahedrally tessellated space, and whose single output is an arbitrary logical function of those 13 inputs. This structure directly implements a truth table with $2^{13}$, or 8192, entries, and corresponds to a 8Kx1 RAM (Random Access Memory) which serves as a LUT (LookUp Table). It is clear that the hexahedrally tessellated space is particularly suitable for this form of processing, since a 3D neighborhood of a rectangularly tessellated space would have 27 voxels, requiring a LUT having $2^{27}$ entries. Once this basis is established, the following architectural issues are apparent:

- Should there be parallel LUTs on a single chip
- Is it necessary to implement an arbitrary logical function of 13 variables. Research indicates that most logical transforms already in practice may be described as a composite function of the an arbitrary function of the 12 outer voxels and the center voxel (described further below) [4].
- How is image data efficiently transferred as 13 neighborhood voxels to the LUT.
- How is result data returned once it is processed by the LUT.
- Are there any constraints, such as cost, space, rad-hard, power, maximum transistor count, etc., or should all of these be "sensibly minimized".

There are several implementation issues to be studied as well. These include:

- What is the cheapest, most reliable and most well understood semiconductor process that the performance and cost requirements will allow.
- What package should be used.
- How should the RAM be designed (for example, should a 3-transistor dynamic cell or a 6-transistor static cell be used).
- What integrated circuit design methodology should be used.

**Architectural Studies**

A proposed chip architecture is shown in figures 1a and 1b, which fits well into a system proposal by Kensal Consulting. In the processor, data is loaded into shift registers, once per cycle. Although figure one shows data being loaded 8b at a time, the exact number of bits loaded at a time
is a parameter which needs further study. There are seven shift registers, corresponding to the 7 voxels of a hexagonal neighborhood in a particular frame. Three frames are needed to construct a 13 voxel 3D neighborhood, however the top and bottom 3 voxels in the 3D neighborhood occupy 3 of the 7 bit positions of the hexagon (for example, see [3], page 54). Once these 7 shift registers are loaded, processing may begin. In the 8b example of figures 1a and 1b, 8 neighborhoods are passed to the single LUT, one at a time for 8 cycles, where the LUT’s output fills an output shift register. The shift register’s contents are then outputted in parallel to an image memory. There is a 3 cycle latency in getting the pipeline started.

The difference between figures 1a and 1b is that the LUT in figure 1b has been halved in size, and its output is processed by the center voxel logic to form a composite logical function. This center voxel logic is shown in figure 2. As can be seen in figure 2, a 4b control field, which defines any one of the possible 16 logical functions of two variables, defines the logical function of the LUT output and the center voxel. It should be stressed that this composite processing does not replace the general function of an 8K LUT, but from known morphological logical transforms, complete generality is not required, and this assumption halves the size of the largest chip component: the LUT RAM.

Further study will be performed on the other architectural issues, and summarized in a final report at the conclusion of this study.

**Chip Implementation Studies**

Although the processing bandwidth of the chip is presently undefined, it is likely that a standard ASIC CMOS process will be adequate to implement this chip. Presently, most CMOS integrated circuit design is being conducted using 1μ or less minimum photolithographic design rules, which allows for extremely high packing densities. For example, if one were to examine the 1989 International Solid State Circuit Conference Digest of Technical Papers, he would find that most papers involving digital circuitry use 1μ processes, while simultaneously, chip sizes continue to increase dramatically. For example, a 1 million transistor microprocessor was presented (by Intel) which had a die area of 150 square millimeters. By the time that an implementation of this chip is started, 1μ processing will be commonplace.

CMOS processing is preferred because of its low power dissipation, high speed, proliferation of use, and insensitivity to noise. CMOS performance is typically adequate for all but the most demanding performance requirements.
The chip requirements here are much less demanding. Consider the area of a standard 6-transistor fully static RAM cell (the reasons for selecting this cell design will be presented in the final report), which we have designed and is 13.2µ x 21.6µ, using 1µ design rules as published by VLSI Technology, Inc. Assuming 30% overhead for RAM decoding, sense and write circuitry, the RAM area is only 3.0 square millimeters. This area is extremely small, and should result in very low costs. Using this process, it is certainly possible to design a LUT which could be cycled at more than 50 MHz.

Thus, with the semiconductor processing currently available today, the research should not be too constrained by chip area. Process issues will be further refined in the final report.

Conclusion

Further research should be conducted to improve on the availability of data to the LUT, so that it is always kept busy, perhaps adding complexity to the way data is brought into the proposed chip. Further study should be conducted along the lines of parallel LUTs within the chip. With today’s CMOS processing capability, schemes to reduce an 8K LUT to a 4K LUT are not worthwhile, in light of the generality which is sacrificed.

References

1. Norsworthy et al, 1988 IEEE ISSCC Digest of Technical Papers, pg 158


4. Private Communication with Kendall Preston 3/13/89
Figure 1a Proposed Chip Architecture with full LUT

Routing Matrix

Input Data Demultiplexer

Input Data

Shift Register

Shift Register

Shift Register

Shift Register

Shift Register

Shift Register

Shift Register

Input Shift Register

8Kx1 Static RAM (LUT)

Output Shift Register

Output Data

AddressIn

R/W -
Figure 2
Center Voxel Logic