# VLSI Implementation of Real-Time Parallel DCT/DST Lattice Structures for Video Communications\*

C.T. Chiu<sup>1</sup>, R. K. Kolagotla<sup>2</sup>, K.J.R. Liu, and J. F. JáJá.

**Electrical Engineering Department** Institute of Systems Research University of Maryland College Park, MD 20742

Abstract - The alternate use [1] of the discrete cosine transform (DCT) and the discrete sine transform (DST) can achieve higher data compression rates and less blocking effects in image processing. A parallel lattice structure that can dually generate the 1-D DCT and DST is proposed. This architecture is ideally suited for VLSI implementation because of its modularity, regularity, and local interconnections. The VLSI implementation of the lattice module using the distributed arithmetic approach is described. This realization of the lattice module using  $2\mu m$  double metal CMOS technology is capable of processing at a data rate of 116Mb/s in real-time.

### INTRODUCTION

Due to advances in ISDN network and high definition television (HDTV) technology, high speed transmission of digital video signals becomes very desirable. Transform coding plays a significant role in reducing the bit rate for video and image processing applications. The DCT and DST are two of the most efficient transforms for video and image coding applications. The DCT is widely used for data compression. This is due to its better energy compaction property and to the fact that, for highly correlated signals, its performance is closest to that of the optimal Karhunen-Loeve Transform (KLT) compared to other discrete transforms[2][3]. It was shown by Jain that the performance of the DST approaches that of the KLT for a first-order Markov sequence with given boundary conditions, especially for signals with low correlation coefficients [4]. Rose *et.al* showed in [1] that alternate use of the DCT and DST can achieve the removal of redundancies in the correlation between



<sup>\*</sup>This work is partially supported by the NSF grant ECD-8803012.

<sup>&</sup>lt;sup>1</sup>C.T. Chiu is with Electrical Engineering Dept., National Taiwan Institute of Technology, Taipei, Taiwan. <sup>2</sup>Now with IBM Corp. Burlington, VT05452.

neighboring blocks, as well as the preservation of continuity across the block boundaries.

We propose a time-recursive lattice structure that can generate the DCT and DST simultaneously. The time-recursive approach is adopted because data arrive serially in digital signal transmission. The resulting architectures are regular, modular, and have only local communication. These structures can compute the transformed data for sequential input time-recursively with a throughput rate of one sample per clock cycle. The total number of multipliers required is a linear function of the transform size N. Furthermore, there is no constraint on N. Therefore, our architectures are ideally suited for VLSI implementation.

The most important block in the 1-D and 2-D DCT lattice structures is the lattice module. The VLSI implementation of the lattice DCT/DST module is described in this paper. In order to achieve an efficient VLSI realization, we use the distributed-arithmetic method to implement the multipliers in the structure. The advantages of using this memory-oriented realization are the savings in chip area, better accuracy and higher speed [5]. This chip has been fabricated using  $2\mu m$  CMOS technology, and is capable of processing at a data rate of 116 Mb/s in real-time.

## Parallel 1-D DCT/DST Lattice Structure

A new scheme employing the time-recursive approach is presented. Since data arrives serially in digital signal transmission, we consider the orthogonal transforms from a time-recursive point of view instead of the entire block of data at a time. Denote  $X_c(k,t)$  and  $X_s(k,t)$  as the DCT and DST of a data sequence [x(t), x(t+1), ..., x(t+N-1)]. The time-recursive relations for the new transforms  $X_c(k, t+1)$  and  $X_s(k, t+1)$  in terms of the previous transforms  $X_c(k,t)$  are given by

$$\begin{aligned} & X_c(k,t+1) = \left\{ X_c(k,t) + \left[ -x(t) + (-1)^k x(t+N) \right] \left( \frac{2}{N} \cos\left(\frac{\pi k}{2N}\right) \right\} \cos(\frac{\pi k}{N}) \\ & + \left\{ X_s(k,t) + \left[ -x(t) + (-1)^k x(t+N) \right] \frac{2}{N} \sin\left(\frac{\pi k}{2N}\right) \right\} \sin(\frac{\pi k}{N}), \ (1) \end{aligned}$$

and

$$X_{s}(k,t+1) = \left\{ X_{s}(k,t) + \left[ -x(t) + (-1)^{k} x(t+N) \right] \frac{2}{N} \sin\left(\frac{\pi k}{2N}\right) \right\} \cos(\frac{\pi k}{N}) \\ - \left\{ X_{c}(k,t) + \left[ -x(t) + (-1)^{k} x(t+N) \right] \frac{2}{N} \cos\left(\frac{\pi k}{2N}\right) \right\} \sin(\frac{\pi k}{N}).$$
(2)

The lattice module manifesting this approach is shown in Fig. 1 for k = 0, 1, 2, ..., N-1. Note that the transform domain data X(k, t) has been decomposed into N disjoint components that use similar lattice modules with



Figure 1: The lattice structure for the DCT and DST with coefficients C(k)'s and D(k)'s, k = 1, 2, ..., N - 1.

different multipliers coefficients in them. This structure requires 6N - 8 multipliers and 5N - 1 adders; the total computational time is N clock cycles. It is important to note that the transformed data of subsequent input vectors can be generated every clock cycle.

# Distributed-Arithmetic based Implementation

Since the lattice module is the most important component in the DCT/DST lattice structures, we focus on the VLSI implementation of the lattice module. Every lattice module in the DCT/DST structure is a modified normal form digital filter [7], which has the least roundoff noise and is less sensitive to coefficient inaccuracy. In the following we will focus our discussion on the 8 point DCT with 8-bit input signals and 12-bit output signals using two's complement binary number system.

Sun et. al. [5, 7] proposed the first working  $16 \times 16$  DCT chip which incorporates the distributed-arithmetic method. Using this memory-oriented structure, a high speed, high accuracy efficient hardware implementation of the 2-D DCT can be achieved. We adopt the distributed-arithmetic scheme in our VLSI implementations. The multiplication results stored in the ROM are computed using double precision numbers on a SUN workstation. Using this realization, the roundoff errors due to multiplication are minimized since distributed-arithmetic transforms explicit multiplication into implicit multiplication. Therefore, the errors of the system are all due to the quantization



Figure 2: The building blocks of the lattice module with clocks.

errors resulting from finite precision implementation and addition operations. Under the 12-bit two's complement realization, the RMS error values are approximately 40dB [5], which is satisfactory for most applications.

One way to implement the lattice module is to realize those two multipliers with same input by one ROM. Using this method, the ROM size of each lattice array is 13824 bits and the number of adders needed is 10. When the number of bits of the input signal is reduced, the ROM size is reduced but the number of adders is increased. The adder is a 12-bit carry lookahead full adder/subtracter which is constructed using three 4-bit carry-look-ahead adders. Since, ROMs need less area than general purpose multipliers and can achieve a higher speed, circuit implementations using this approach can be used for very high speed video signal processing.

## Design of the Building Blocks

The main building blocks of the lattice structure are ROMs, adders, shift registers, and latches. These components will be described individually in this section. The critical path in our lattice modules is the feedback loop which includes one ROM and three adders. A traditional two-phase clocking scheme would use one phase to perform these computations and a second phase to latch the results. In order to make the two phases of the clock more symmetric, we perform computations on both phases of the clock. As shown in Fig. 2, we perform one addition and the ROM lookup during the first phase and two additions during the second phase. Intermediate results are stored in half-latches as described below.



## **ROM Implementation**

As described in Section 4, the main component of the distributed-arithmetic based lattice structure is the ROM. Most existing ROMs are implemented based on the precharge concept, that is, the bit lines are precharged high during the precharge phase, and then the selected word lines discharge some of the bit lines according the coefficients stored during the evaluate phase. In order to reduce the ROM access time, we use a novel ROM design [9]. Fig. 3 shows the detail of each cell in the ROM. A simple inverter with a feedback transistor and a transmission gate controlled by phase  $\phi_1$  is used as a sense amplifier at the output of the bit-lines. We precharge the bit lines to an intermediate voltage between GND and Vdd, and use n-channel transistors to either charge the bit line from this intermediate voltage to Vdd-Vt or discharge it to GND, during the evaluate phase. In this scheme, the array is fully populated; . *i.e.* the number of n-channel transistors in the array is MN, where M is the number of word lines and N is the number of bit lines. A 'zero' is stored at a particular location, by connecting the n-channel transistor at that location to Vdd; a 'one' is stored by connecting the transistor to GND. The cell size is only  $13\lambda \times 16\lambda$ .

In our distributed-arithmetic scheme, the multiplication of the 12-bit input number with a 12-bit sine or cosine coefficient is performed by two ROMs each with 6-bit inputs and two adders. This reduces the chip area needed to implement multiplication with fixed coefficients. The ROM includes two 6-bit decoders and two small ROMs. The 12-bit input is divided into two parts; the most significant 6-bits of the input are used to generate the coefficients for small ROM1 and the least significant 6-bits are used for small ROM2. The final result of the multiplication is obtained by adding the outputs of small ROM1 with a shifted version of the outputs of small ROM2. We only store



Figure 4: The logical diagram of the 6-bit decoder.

the most significant 7-bit result of the multiplication at ROM2. The sizes of small ROM1 and ROM2 are 64 words by 24 bits and 64 words by 14 bits respectively.

In order to improve the ROM access time, each 6-bit decoder is implemented as a tree consisting of two 3-bit decoders and a linear array of 64 AND gates. The delay time for this 6-bit decoder is 8.55ns, while a straightforward implementation would have a delay of 20.40ns. The outputs of the 64 AND gates form the word lines of the ROM array. The logical layout of the 6-bit decoder is shown in Fig. 4. The physical size of the final ROM is  $1292\lambda \times 1646\lambda$  which is much smaller than the area needed by a general purpose multiplier. The total ROM access time is 20ns.

### Adders

Since the lattice modules are implemented based on a word-serial bit-parallel approach, high-speed bit-parallel adders are necessary. We build a 12-bit carry lookahead adder using three 4-bit carry lookahead adders [10]. Two large inverters are placed at the outputs of the adders to supply sufficient drive capability. The physical size of the 12-bit adder is  $1022\lambda \times 256\lambda$  and the propagation delay through the adder is 18ns.

### Shift Registers and Latches

There are two kinds of latches and one shift registers in the circuit. One of the latches is the half-latch which is controlled by phase  $\phi_2$  and is used to store the intermediate results obtained from the adders. The logical representation of the 12-bit half-latch is depicted in Fig. 5. The other latch is the reset controlled half-latch located in the feedback loop. Its logical circuit is shown in Fig. 5. When the reset signal goes low, the outputs from ROM2 and ROM3 are set to zero. The shift register located at the input stage of the system is



Half latch

#### Reset control half latch

Figure 5: The logical diagram of half-latch and reset controlled half-latch.



Figure 6: The signal diagram of the clock signals.

a regular two phase shift register which delays the input sequence by eight clock cycles.

## **Control** Unit

Only one control signal (reset) and two clock phases  $(\phi_1 \text{ and } \phi_2)$  are required in this system. Phase  $\phi_1$  is used to latch the computational results from one adder and the ROM, while phase  $\phi_2$  is used to latch the results from the remaining adders. Since the propagation delay time for the ROM and the adder is approximately the same, we can make both clock phases symmetric to each other. The signal diagram of these two phases is depicted in Fig. 6. The reset signal is active low. One of the attractive features of this chip is the very simple control signals used. No additional logical control circuitry is needed in the design.

# **Chip Realization and Simulations**

Having realized the symbolic layout of the individual blocks, the next issue is to integrate all these components efficiently. This includes three ROMs,



Figure 7: The physical layout diagram of the lattice module.

eleven adders, four half-latches, two reset controlled half-latches and one shift register. ROM2 and ROM3 are rotated by 90 and 270 degree respectively to simplify inter-component routing. This chip accepts 8-bit input signals and produces 12-bit DCT and DST coefficients every 100ns. The physical layout of the lattice module chip is depicted in Fig. 7. There are 18000 transistors in the chip, most of which are used in the three fully-populated ROMs. The total size of the active circuitry is  $5400\lambda \times 3400\lambda$ . This is fabricated in a die of size  $6800\lambda \times 4600\lambda$  and packaged in a 40-pin chip.

Both logical and timing simulations were performed on the this chip. From the simulation results due to Sun *et. al* [5], the word length of the ROM must be at least 9 bits to ensure that the SNR is greater than 40dB. To reduce the error due to recursive computations, we increase the word length of the ROM to 12 bits. We used IRSIM to perform logic simulations on the layout of the chip. The results from IRSIM were compared with the DCT and DST of the same input sequence obtained from a C program written on a SUN workstation. The results are accurate up to the least significant bit of the 12-bit representation. The SNR of this computation from simulations is about 41dB, which is satisfactory for image and video processing applications. SPICE simulations indicate that the worst case rise and fall time for the ROM bit-lines are 8ns and 9ns respectively. Fig. 8 shows the timing simulation for the entire ROM which includes the decoder, cell array, and sense



Figure 8: The SPICE simulation of the ROM.

amplifier. The sense amplifier, together with a feedback transistor is used to precharge the bit lines to an intermediate voltage between GND and Vdd. This decreases the voltage swing on the bit lines during the evaluate phase, and hence reduces the ROM access time. It should be noted that although the bit line does not charge up to Vdd, the sense amplifier can still discharge the output to GND in a relatively short time. The delay time in this case is 15ns. The delay time for the three stage 12-bit carry lookahead adder is 20ns. The critical path in the structure is the feedback loop, which contains one ROM and three adders. Timing simulations indicate that the chip works at a frequency of 14.5MHz and hence input data at a rate of 116Mb/s can be processed in real-time.

# Conclusion

The algorithm and architecture of the first chip that can generate the 1-D DCT and DST simultaneously were described. We implemented the lattice module using the distributed arithmetic method with a data rate of 116 Mb/s under  $2\mu m$  CMOS technology. Fabricating using  $0.8\mu m$  technology will result in a data rate of 320 Mb/s. Testing results clearly show that our VLSI implementation is a good candidate for real-time image processing. We are currently exploring further refinements and a full-implementation of a

| Technology              | 2-um double metal CMOS |
|-------------------------|------------------------|
| Core size               | $5.6 \times 3.4 mm^2$  |
| Die size                | $6.8 \times 4.6 mm^2$  |
| Total no of transistors | 18,000                 |
| Active signal pads      | 39                     |
| Speed                   | 14.5 <i>M</i> Hz       |

Table 1: Summary of the DCT lattice module chip.

system that can handle  $16 \times 16$  block sizes.

## References

- K. Rose, A. Heiman, and I. Dinstein, "DCT/DST Alternate-Transform Image Coding," *IEEE Trans. on Communication*, vol. 38, No. 1, pp. 94-101, Jan. 1990.
- [2] A. Rosenfeld, and A. C. Kak, Digital Picture Processing, 2nd edition, Academic Press, 1982.
- [3] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete cosine transform," *IEEE Trans. Comput.*, vol. C-23, pp. 90-93, Jan. 1974.
- [4] A. K. Jain, "A fast Karhunen-Loeve transform for a class of stochastic process," *IEEE Trans., Commun.*, vol. COM-24, pp.1023-1029, 1976.
- [5] M. T. Sun, T. C. Chen, and A. M. Gottlieb, "VLSI implementation of a 16 × 16 Discrete Cosine Transform," *IEEE Trans. on Circuits and Systems*, vol. 36, No. 4, pp. 610-617, Apr. 1989.
- [6] R. Gluth, "Regular FFT-related transform kernels for DCT/DST based polyphase filter banks," *IEEE ICASSP*, 1991, pp. 2205-2208.
- [7] S. A. White, "Applications of Distributed-Arithmetic to Digital Signal Processing: A Tutorial Review," *IEEE ASSP Magazine*, pp. 4-19, July. 1989.
- [8] K. J. R. Liu, and C. T. Chiu, "Unified Parallel Lattice Structures for Time-Recursive Discrete Cosine/Sine/Hartley Transforms," to appear IEEE Trans. on Signal processing, May, 1993.
- [9] L. A. Glasser, and D. W. Dobberpuhl, The Design and Analysis of VLSI Circuits, Addison Wesley, 1985.
- [10] Neil H. E. Weste, and Kamran Eshraghian, Principles of CMOS VLSI Design, A Systems Perspective, Addison Wesley, 1985.