# Hardware Architecture of Motion Estimation (ME) #### Shao-Yi Chien Most slides are prepared by Y.W.Huang, DSP/IC Design Lab. - Track block regions from frame to frame - Compromise between efficiently removing temporal redundancy and computation cost - Usually, motion estimation is only performed at the encoder side to avoid the huge computation at the decoder side, so the motion data have to be transmitted from the encoder to the decoder as side information. - Motion estimation takes more than 90% of the total computation in modern video encoders - Translational model is often adopted. # Block Matching Algorithm (BMA) (2/3) Motion Vector V<sub>t</sub>(p.q) =(Veci,Vecj) the location in the search range $\Omega$ that has the maximum correlation value between blocks in temporally adjacent frames **Current Frame** # Block Matching Algorithm (BMA) (3/3) Video Sequence **Previous Reconstructed** Frame t-1 (reference frame) Search Range Best Matching Block **Current Encoding** Frame t (current frame) †Motion Vector **Current Block** # Factors of Affecting BMA - Search algorithm - Matching criterion - SSD (sum of squared pixel difference, mostly used in software) - SAD (sum of absolute pixel difference, mostly used in hardware) - Search range [-p,+p] Search Range in Reference Frame # Full-Search Block Matching Algorithm **Current Block Search Range** Reference Block (Candidate Block) Candidate Search Position (Search Location) $$SAD(i,j) = \sum_{k=1}^{N} \sum_{l=1}^{N} \left| x_{t}(k,l) - x_{t-1}(k+i,l+j) \right|$$ ### **Computation Complexity** ``` Loop 1: For m= 0 to (width/blocksize)-1 For each macroblock For n= 0 to (height/blocksize) -1 Loop 2: For each candidate Loop 3: For i = -d to d-1 search position For j = -d to d-1 Loop 4: For k = 0 to N-1 Calculate the distortion, and Loop 5: chose the smallest one For 1 = 0 to N-1 Loop 6: MAD(i,j) = MAD(i,j) + |X(k,l)-Y(k+i,l+j)| End (Loop 6) End (Loop 5) End (Loop 4) End (Loop 3) End (Loop 2) End (Loop 1) ``` # Inter-Level Parallelism (1/2) ``` Loop 1: For m = 0 to N-1 Loop 2: For n = 0 to N-1 Loop 3: For k = -p to p-1 Loop 4: For I = -p to p-1 SAD(k,I) = SAD(k,I) + |X(m,n)-Y(m+k,n+I)| End (Loop 4) End (Loop 3) End (Loop 1) ``` Each PE is responsible for the SAD of all pixels in a candidate. For 2-D arrays (2px2p PE's), at least (NxN) cycles are required. For 1-D arrays (2px1 PE's), at least (2pxNxN) cycles are required. DSP in VLSI Design ### Inter-Level Parallelism (2/2) current block candidate blocks DSP in VLSI Design Shao-Yi Chien # Intra-Level Parallelism (1/2) ``` Loop 1: For k = -p to p-1 Loop 2: For I = -p to p-1 Loop 3: For m = 0 to N-1 Loop 4: For n = 0 to N-1 SAD(k,I) = SAD(k,I) + |X(m,n)-Y(m+k,n+I)| End (Loop 4) End (Loop 3) End (Loop 2) End (Loop 1) ``` Each PE is responsible for the SAD of one pixel in all candidates. For 2-D arrays (NxN PE's), at least (2px2p) cycles are required. For 1-D arrays (Nx1 PE's), at least (Nx2px2p) cycles are required. DSP in VLSI Design # Intra-Level Parallelism (2/2) current block candidate blocks DSP in VLSI Design Shao-Yi Chien # 1-D Systolic Array K.-M. Yang, M.-T. Sun, L. Wu, "A family of VLSI design for the motion compensation block matching algorithm," *IEEE Transactions on Circuits and Systems*, vol. 36, no. 10, October 1989 # Systolic Array (1/3) - Systolic architecture (systolic array) - □ A network of processing elements (PEs) that rhythmically compute and pass data through the system - Modularity and regularity - All the PEs in the systolic array are uniform and fully pipelined - Contains only local interconnection # Systolic Array (2/3) Typical systolic array DSP in VLSI Design # Systolic Array (3/3) - Some relaxations - Not only local but also neighbor interconnections - □ Use of data broadcast operations - Use of different PEs in the system, especially at the boundaries - □ Also called as "semi-systolic array" ### 1-D Linear PE Array ME - The first chip design for block matching motion estimation in the world - Two kinds of dataflow - □ Broadcasting reference frame, move current - ☐ Broadcasting current frame, move reference - PE number = 1-D search range - Each PE computes the SAD of a candidate macroblock - Flexible block size (simply change data flow) - Cascaded chip to enlarge the search range DSP in VLSI Design # Block Matching Algorithm $$S(m_j) = \sum \sum |a(I_a + i, J_a + j) - b(I_b + k, J_b + l + m_j)|,$$ for $m_j = 0, 1, \dots, 15$ . - □ a: current frame - □ b: reference frame # Broadcasting Reference Frame search range [-8, +7] block size 16x16 **Processing Element** Accumulator Accumulate the SAD of a candidate $\sum |\mathbf{a}(i,j) \cdot \mathbf{b}(k,l)|$ #### **Basic Data Flow** ### Broadcasting Current Frame #### **Basic Data Flow** b(0,0) b(0,1) | Cycle time | Data | seque | b(0,15) | PE <sub>o</sub> | PE, | PE <sub>14</sub> | PE <sub>15</sub> | |--------------------|--------------------|--------------------|--------------------|-----------------------------------------------------------------------|------------------------------------|----------------------------------------|------------------------------------| | 1 | С | <b>p'</b> | Р | $\sum \sum \mathbf{a}(i,j) \cdot \mathbf{b}(\mathbf{k},\mathbf{l}) $ | $\Sigma\Sigma[a(i,j)-b(k,l+1)]$ | <br>$\Sigma \Sigma [a(i,j)-b(k,l+14)]$ | $\Sigma\Sigma$ [a(i,j)-b(k,l+15) | | 0x16+0<br>0x16+1 | a(0,0)<br>a(0,1) | b(0,16)<br>b(0,17) | b(1,0)<br>b(1,1) | a(0,0)-b(0,0)<br>a(0,1)-b(0,1) | a(0,0)-b(0,1)<br>a(0,1)-b(0,2) | <br>a(0,0)-b(0,14)<br>a(0,1)-t(0,15) | a(0,0)-b(0,15)<br>a(0,1)-b(0,16) | | 0x16+14<br>0x16+15 | a(0,14)<br>a(0,15) | | b(1,14)<br>b(1,15) | a(0,14)-b(0,14)<br>a(0,15) <mark>-b(0,15)</mark> | a(0,14)-b(0,15)<br>a(0,15)-b(0,16) | a(0,14)-b(0,28)<br>a(0,15)-b(0,29) | a(0,14)-b(0,29)<br>a(0,15)-b(0,30) | | 1x16+0<br>1x16+1 | a(1,0)<br>a(1,1) | b(1,16)<br>b(1,17) | b(2,0)<br>b(2,1) | a(1,0) b(1,0)<br>a(1,1)-b(1,1) | a(1,0) b(1,1)<br>a(1,1)-b(1,2) | <br>a(1,0)-b(1,14)<br>a(1,1)-b(1,15) | a(1,0) b(1,15)<br>a(1,1)-b(1,16) | | 1x16+14<br>1x16+15 | a(1,14)<br>a(1,15) | b(1,30)<br>b(1,31) | | a(1,14)-b(1,14)<br>a(1,15)-b(1,15) | a(1,14)-b(1,15)<br>a(1,15)-b(1,16) | <br>a(1,14)-b(1,28)<br>a(1,15)-b(1,29) | a(1,14)-b(1,29)<br>a(1,15)-b(1,30) | #### Load reference frame data from $Q_0$ - $Q_{15}$ to $R_0$ - $R_{15}$ in parallel for every 16 cycles | 14x16+0 | a(14,0) b(14,16) | b(15,0) | a(14,0)-b(14,0) | a(14,0)-b(14,1) | a(14,0)-b(14,14) | a(14,0)-b(14,15) | |----------------------|----------------------------------------|----------------------|----------------------------------------|----------------------------------------|------------------|----------------------------------------| | 14x16+1 | a(14,1) b(14,17) | b(15,1) | a(14,1)-b(14,1) | a(14,1)-b(14,2) | a(14,1)-b(14,15) | a(14,1)-b(14,16) | | 14x16+14<br>14x16+15 | a(14,14) b(14,30)<br>a(14,15) b(14,31) | b(15,14)<br>b(15,15) | | a(14,14)-b(14,15)<br>a(14,15)-b(14,16) | 1 1 1 1 1 1 | a(14,14)-b(14,29)<br>a(14,15)-b(14,30) | | 15x16+0 | a(15,0) b(15,16) | b(0,0) | a(15,0)-b(15,0) | a(15,0)-b(15,1) | a(15,0)-b(15,14) | a(15,0)-b(15,15) | | 15x16+1 | a(15,1) b(15,17) | b(0,1) | a(15,1)-b(15,1) | a(15,1)-b(15,2) | a(15,1)-b(15,15) | a(15,1)-b(15,16) | | 15x16+14<br>15x16+15 | a(15,14) b(15,30)<br>a(15,15) b(15,31) | b(0,14)<br>b(0,15) | a(15,14)-b(15,14)<br>a(15,15)-b(15,15) | | .,,, | a(15,14)-b(15,29)<br>a(15,15)-b(15,30) | # Cascaded Chip Design For example, search range is extended from [-8, +7] to [-16, +15], and motion estimation of 2 macroblocks are processed simultaneously DSP in VLSI Design ### Overlapped Search Area Overlapped search area can be broadcasted to each chip to save the bandwidth Overlapped tracking area of two adjacent blocks Overlapped sub-tracking area DSP in VLSI Design # Motion Estimation with Fractional Precision (1/2) Quarter-pel precision : candidates of integer pixel : candidates of fractional pixel # Motion Estimation with Fractional Precision (2/2) DSP in VLSI Design # Fractional Motion Compensation Chip-Pair Design DSP in VLSI Design Shao-Yi Chien 26 # Block Diagram of a Fractional Motion Estimation Chip DSP in VLSI Design Shao-Yi Chien 27 # 2-D Systolic Array H. Yo and Y. H. Hu, "A novel modular systolic array architecture for full-search block matching motion estimation," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 5, no. 5, October 1995 # 6-D Array to 3-D Array $b = vN_h + h \qquad 0 \le b < N_{hv} = N_h N_v$ ``` do v=0 to N_v-1 do h=0 to N_h-1 l = 2p(m+p) + n + p 0 \le l < l_p = (2p)^2 do b = 0 to N_{hv} - 1 MV(h, v) = (0, 0) MV(b) = 0 D_{min}(h,v)=\infty do m = -p to p - 1 D_{min}(b) = \infty 0 \le k < N^2 don = -p \text{ to } p-1 do l = 0 to l_p - 1 MAD(m,n) = 0 MAD(l) = 0 = k \mod N. do i = 0 to N - 1 do k=0 to N^2-1 do j = 0 to N - 1 MAD(l) = MAD(l) + |x_s(k) - y_s(k+l)| MAD(m,n) = MAD(m,n) + |x(i,j) - y(i+m,j+n)| enddo k enddo i if D_{min}(b) > MAD(l) enddo i D_{min}(b) = MAD(l) if D_{min}(h, v) > MAD(m, n) MV(b) = l D_{min}(h,v) = MAD(m,n) endif MV(h,v)=(m,n) enddo l endif enddo n b: block index enddo b enddo m I: search candidate index enddo h k: pixel index enddo v ``` #### 3-D DG of Motion Estimation - - $\Box$ b - $\square$ k - Data reuse of reference frame is also considered DSP in VLSI Design #### 3-D DG to 2-D SFG DSP in VLSI Design Shao-Yi Chien 31 #### 2-D SFG to 1-D SFG $D^*=N^2D$ DSP in VLSI Design #### 1-D SFG to 2-D Mesh # Scheduling | | | | | | | | L | | | | |-------------------------------------|-----|-----|-----|-----|-----|-----|-----|--------------|-------|-------| | y11 | y21 | y31 | y41 | y51 | y61 | y71 | y81 | y91 | y10,1 | y11,1 | | y12 | y22 | y32 | y42 | y52 | y62 | y72 | y82 | y92 | y10,2 | y11,2 | | y13 | y23 | у33 | y43 | y53 | у63 | y73 | y83 | y93 | y10,3 | y11,3 | | y14 | y24 | y34 | y44 | y54 | y64 | y74 | y84 | y94 | y10,4 | y11,4 | | y15 | y25 | y35 | y45 | y55 | y65 | y75 | y85 | y95 | y10,5 | y11,5 | | y16 | y26 | у36 | y46 | y56 | y66 | y76 | y86 | y96 | y10,6 | yl1,6 | | y17 | y27 | y37 | y47 | y57 | y67 | y77 | y87 | y <b>9</b> 7 | y10,7 | y11,7 | | reference block 1 reference block 2 | | | | | | | | | | | - Search area of reference block 2 Overlapped search area between two adjacent blocks | | $\triangle$ | $\cap$ | $\cap$ | $\cap$ | $\cap$ | 0 | $\cap$ | $\cap$ | $\cap$ | $\cap$ | $\cap$ | $\cap$ | $\wedge$ | | $\cap$ | $\cap$ | | |---------------|-------------|--------|--------|-------------|--------|-----|--------|-------------|------------|------------|-------------|-------------|-----------------|-------------|-------------|------------------|-----| | x13 x12 x11 | 8 | -₩ | ₽. | 8 | -8 | ٠8 | 8 | -8. | -8∙ | 8 | 8 | -⊹ | -8 | 8 | -⊹ | .გ. | -MV | | | • | * | • | ٠ | ŧ | Ť | • | Ť | • | Ť | * | * | * | Ť | Ť | • | | | Clock cycle | | | | | | | | | | | | | | | | | | | 1 | yii | ٧. | * | * | * | ٠ | * | * | * | • | * | * | * | • | ٠ | ٠ | | | 2 | | y12 | | • | * | * | • | * | * | • | * | • | • | * | * | * | | | 3 | | y13 | | | | • | • | | * | • | • | • | • | * | • | | | | 4 | | y14 | | | | | • | • | • | • | • | • | • | : | • | 7 | | | 5 | | y15 | | | | | Ţ | | • | | : | : | | | : | | | | 6 | | y22 | | | | | | .: | : | : | | | | : | : | | | | 7 | | y23 | | | | | | | | : | I | : | : | : | - | | | | <b>8</b><br>9 | | | | | | | | y24 | | | Ĭ | - | | | | | | | | | | | | | | | y25 | | | | | | | | | | | 10 | | | | | | | | y26<br>y27 | | | | • | • | • | * | | | | 11<br>12 | | | | | | | | y34. | | | | | 7 | | | | | | 13 | | | | | | | | y35 | | | | | | • | ٠ | • | | | 14 | | | | | | | | y36 | | | | | | v42 | * | * | | | 15 | v43 | v43 | y43 | v37 | v43 | v43 | v43 | y37 | v43 | v43 | v43 | w37 | v43 | v43 | v43 | • | | | 16 | | | | | | | | y44 | | | | | | | | | ١ | | 17 | | | | | | | | y45 | | | | | | | | | Į | | 18 | y52 | y52 | y46 | y46 | y52 | y52 | y46 | ¥46 | y52 | y52 | y46 | y46 | y52 | y52 | y46 | y46 | | | 19 | y53 | y53 | y33 | ¥47 | y53 | y53 | y53 | y47 | y53 | y53 | y53 | y47 | y53 | y53 | y53 | y47 | 1 | | 20 | y54 | | 21 | y61 | y55 | y55 | y55 | 761 | ¥55 | y55 | y55 | y61 | y55 | y55 | y55 | y61 | y55 | y55 | y55 | 1 | | 22 | y62 | y62 | y56 | y56 | y62 | y62 | ¥56 | y56 | y62 | y62 | y56 | y56 | y62 | y62 | y56 | y56 | ł | | 23 | y63 | y63 | y63 | y57 | у63 | y63 | y63 | <b>y5</b> 7 | y63 | y63 | y63 | y57 | у63 | y63 | y63 | y57 | 1 | | 24 | y64 <b>X64</b> | y64 | | 25 | y71 | y65 | y65 | y65 | y71 | y65 | y65 | y65 | yλ | <b>465</b> | y65 | y65 | y71 | y65 | y65 | y65 | | | 26 | y72 | y72 | y66 | y66 | y72 | y72 | y66 | y66 | y72 | y72 | <b>x</b> 66 | y66 | y72 | y72 | <b>y6</b> 6 | y66 | | | 27 | y73 | y73 | y73 | y67 | y73 | y73 | y73 | y67 | y73 | y73 | y73 | <b>y</b> 67 | y73 | y73 | y73 | y <del>6</del> 7 | | | 28 | y74 y74° | <del>7</del> 74 | y74 | y74 | y74 | | | 29 | y81 | y75 | y75 | y75 | y81 | y75 | y75 | y75 | y81 | y75 | y75 | y75 | y81 | <b>Y</b> Z5 | y75 | y75 | | | 30 | y82 | y82 | y76 | y76 | y82 | y82 | y76 | y76 | y82 | y82 | y76 | y76 | y82 | y82 | <b>Y</b> Z6 | y76 | | | 31 | y83 | y83 | y83 | <b>y</b> 77 | y83 | y83 | y83 | y77 | y83 | y83 | y83 | <i>y77</i> | y83 | y83 | y83 | <b>Y</b> Z7 | | | | | | | | | | D-1 | (Т | | | | | | | | ` | ı | \* Delay (D) # Tree-Based Architecture Y.-S. Jehng, L.-G. Chen, and T.-D. Chiueh, "An efficient and simple VLSI tree architecture for motion estimation algorithms," *IEEE Transactions on Signal Processing*, vol. 41, no. 2, February 1993 #### **Features** - High throughput - Parallel computing - Short data path length, low-latency delay - Independent data flow computation that benefits the irregular block matching especially for the realization of three-step hierarchy search algorithm #### Tree-Based Architecture #### Memory Interleaving #### Whole Architecture DSP in VLSI Design Shao-Yi Chien #### Pipeline Interleaving Remove pipeline bubbles (step hazards) Interleave with adjacent block | pipeli<br>stage | | | C | urren | it tasi | k step | 2 | | | | | r | ext t | ask s | tep 2 | | <br> | |-----------------|-----|-----|-----|-------|---------|--------|-----|-----|-----|-----|-----|-----|-------|-------|-------|---|------| | 1 | î — | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | | | | | | | | | | 2 | | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | | | | | | | | | 3 | | | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | | | | | | | | 4 | | | | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | | | | | | | 5 | | | | | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | | | | | | 6 | | | | | | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | | | | | 7 | | | | | | | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | | | | 8 | | | | | | | | | | | | | | | | 2 | | #### Tree-Cut Technique - Applying folding to - Processing elements - Memory #### Comparison TABLE I TREE-CUT TECHNIQUE FOR HARDWARE REDUCTION | | #1 | ADDER | | | N 0 | N 16 | |----------------|----------------|-------|--------|-----------------------------------|---------------|----------------| | Configuration | | N = 8 | N = 16 | Time Instances<br>Required (FBMA) | N = 8 $d = 3$ | N = 16 $d = 7$ | | Systolic mesh | $2N^2 + N + 1$ | 137 | 529 | (2d + 1)(N + 2d) | 98 | 450 | | Full tree | $2N^2$ | 128 | 512 | $(2d + 1)^2$ | 49 | 225 | | ½ cut | $2N^2/2 + 1$ | 65 | 257 | $2(2d + 1)^2$ | 98 | 450 | | ½ cut<br>¼ cut | $2N^{2}/4 + 1$ | 33 | 129 | $4(2d + 1)^2$ | 196 | 900 | # On-Chip RAM Issues Ref: Jen-Chieh Tuan, Tian-Sheuan Chang, and Chein-Wei Jen, "On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 12, no. 1, pp. 61-72, Jan. 2002. #### On-Chip SRAM The off-chip memory bandwidth can be dramatically reduced with on-chip memory #### On-Chip SRAM - If we can buffer current block pixels and search area pixels on the on-chip SRAM, we can significantly decrease the required bandwidth on system bus (external RAM) - □ Data reuse of search area pixels can further reduce the bandwidth of system bus - Act like cache memory in CPU - This is a trade-off between area and bandwidth - In the following discussions, we assume block size is N x N, and search range is [-P, +P-1] ### Different Schemes of Data Reuse for Search Area Pixels - Data reuse between different rows of candidates in one column of a block (scheme A) - Data reuse between adjacent columns of candidates in a block (scheme B) - Data reuse between adjacent blocks in one row of block (scheme C) - Data reuse between different rows of block (scheme D) - In today's technology, scheme C is mostly used. #### Illustration of Scheme A Data reuse between different rows of candidates in one column of a block candidate of row 0 candidate of row 1 #### Illustration of Scheme B Data reuse between adjacent columns of candidates in a block DSP in VLSI Design 48 #### Illustration of Scheme C Data reuse between adjacent blocks in one row of block #### Illustration of Scheme D Data reuse between different rows of block ## Comparison of Different Schemes of Search Area Data Reuse | | Scheme A | Scheme B | Scheme C | Scheme D | |-----------------------------------|-----------------------------------|-----------------------------|-------------------------------|------------------------| | On-chip<br>buffer size<br>(bytes) | (2N-1) x (N-1) | N x (2P+N-1)<br>+ N x (N-1) | Max{2N, 2P}<br>x (2P+N-1) | W x (2P-1) +<br>2P x N | | Off-chip to on-chip (times/pixel) | (2P/N+1) <sup>2</sup> x<br>(2P/N) | (2P/N+1) x<br>(2P/N) | 2P/N+1 | 1 | | On-chip to core (times/pixel) | 2NP / (2P+N-<br>1) | 2NP / (2P+N-<br>1) x 2 | 2NP / (2P+N-<br>1) x (2P/N+1) | 2P x (2P/N+1) | #### Level C+ Data Reuse - Conventional data reuse schemes are based on raster scan - By use of stripe scan - Stitch n successive vertical MBs (n-stitched) - Load their searching ranges - □ Partially reuse vertical data Ref: Ching-Yeh Chen, Chao-Tsung Huang, Yi-Hau Chen, and Liang-Gee Chen, "Level C+ data reuse scheme for motion estimation with corresponding coding orders," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 16, no. 4, pp. 553--558, April 2006. #### Comparison | Data Reuse Scheme | Bandwidth (Ea) | SRB | |-------------------|-----------------------|---------------------------------| | Level C scheme | $1 + \frac{SR_V}{N}$ | $(SR_H + N - 1)(SR_V + N - 1)$ | | Level C+ scheme | $1 + \frac{SR_V}{nN}$ | $(SR_H + N - 1)(SR_V + nN - 1)$ | | Level D scheme | 1 | $(SR_H + W - 1)(SR_V - 1)$ | □ System memory bandwidth (equivalent access factor) $$Ea_{ME} = \frac{Total\ memory\ bandwidth\ for\ reference\ frame}{processed\ current\ pixels}$$ □ On-chip memory size (SRB) #### Fast-Algorithm-Based Architecture Y.-W. Huang, S.-Y. Chien, B.-Y. Hsieh, and L.-G. Chen, "An efficient and low power architecture design for motion estimation using global elimination algorithm," *IEEE International Conference on Acoustics, Speech, and Signal Processing*, 2002 #### Multi-Level SEA (MSEA) - Convert the sea value to msea value. - Split the 16x16 macro-block into sub-blocks. - SAD ≥ msea ≥ sea, which means MSEA can skip more unnecessary SAD calculations than SEA under the same scan order. - However, the computation of msea value is heavier than that of sea value. #### Example of MSEA at Level = 3 - MSEA is the sub-sampled version of SAD !!! - Define MSEA = SSAD | | <b>4</b> → | <b>4</b> → | <b>4</b> → | 4-4- | |---|--------------------|--------------------|--------------------|--------------------| | 4 | csum <sub>00</sub> | csum <sub>01</sub> | csum <sub>02</sub> | csum <sub>03</sub> | | 4 | csum <sub>10</sub> | csum <sub>11</sub> | csum <sub>12</sub> | csum <sub>13</sub> | | 4 | csum <sub>20</sub> | csum <sub>21</sub> | csum <sub>22</sub> | csum <sub>23</sub> | | 4 | csum <sub>30</sub> | csum <sub>31</sub> | csum <sub>32</sub> | csum <sub>33</sub> | 16 x 16 current block 16 x 16 candidate block $csum_{ij}$ : sum of sub-block<sub>ij</sub> in current block $rsum_{ii}$ : sum of sub-block i in candidate block of search position (m,n) $$msea(m,n) = |csum_{00} - rsum_{00}| + |csum_{01} - rsum_{01}| + \dots + |csum_{33} - rsum_{33}|$$ #### Global Elimination Algorithm - 1. Compute SSAD by raster scan - 2. Keep the M search positions with the smallest SSAD - 3. Compute SAD of the M positions and skip the rest DSP in VLSI Design Shao-Yi Chien 57 #### Flowchart of GEA DSP in VLSI Design Shao-Yi Chien 58 #### Reduced Operations - Full-search block matching algorithm - Compute SAD at each search position - About 256 subtractions, 256 absolute value calculations, and 256 additions per search position - Global elimination algorithm - □ Compute SSAD at each search position - SSAD is much easier to compute than SAD - About 16 subtractions, 16 absolute value calculations, and 16 additions per search position ### Motion Compensated Subjective 5 **Previous Frame** **Current Frame** **Compensated Frame** **GEA** **FSBMA** **MV Plot** DSP in VLSI Design Shao-Yi Chien 60 ### Motion Compensated PSNR (dB) | Level=3, M=7 | QCIF [ | -16,+15] | CIF [-32,+31] | | | |-----------------|--------|----------|---------------|-------|--| | Video Sequence | GEA | FSBMA | GEA | FSBMA | | | Coastguard | 32.93 | 32.93 | 31.55 | 31.59 | | | Container | 43.11 | 43.11 | 38.53 | 38.53 | | | Foreman | 32.22 | 32.21 | 32.82 | 32.85 | | | Hall Monitor | 32.97 | 32.98 | 34.82 | 34.90 | | | Mobile Calendar | 26.15 | 26.15 | 25.16 | 25.20 | | | Silent | 35.16 | 35.14 | 36.11 | 36.12 | | | Stefan | 24.67 | 24.71 | 25.71 | 25.73 | | | Table Tennis | 32.11 | 32.10 | 32.96 | 33.03 | | | Weather | 38.42 | 38.42 | 37.45 | 37.45 | | # VLSI Architecture Design for SEA #### Systolic Part Current block data and search range data are inputted column by column from left to right, top to down. Each PE computes the sum of a 4x4 sub-block, and the sixteen sums are processed in parallel. #### Parallel Adder Tree - 1. "AD" unit stands for "Absolute Difference." - 2. The sixteen sub-block sums of current block (stored in register right after they were previously calculated by the systolic part) and the sixteen sub-block sums of a candidate block (directly inputted from the systolic part) are inputted in parallel. - 3. The msea value (at level 3) of a candidate block can be computed in one cycle. - 4. This adder tree can be reused to compute SAD. #### Parallel Comparator Tree (1/3) *msea* from the parallel adder tree - 1. The goal of the parallel comparator tree is to find the smallest 7 msea values among all search positions in order to calculate the SADs of these 7 candidates in the later stage. - 2. The "mseax\_reg" units (x=1~7) contain the up-to-date smallest 7 msea values and their corresponding search positions. - 3. The new msea value of the next search position is inputted from the parallel adder tree. - 4. The first part of the parallel comparator tree is to find the maximum among the stored values and the new coming value. #### Parallel Comparator Tree (2/3) - 1. The second part of the parallel comparator tree is to find whether a stored msea value is equal to the maximum value. - 2. If more than one msea registers are equal to the msea\_max, only one of them will be selected. This is done by the "CHECK" unit. #### Parallel Comparator Tree (3/3) - 1. The third part of the parallel comparator tree is to replace the maximum msea value in the msea register by the new coming value if necessary. - The msea registers always contain the smallest 7 msea values and their corresponding search positions. #### Comparison with FSBMA [-16,+15] | Architecture | Description | Required Freq. | Gate Count | |--------------|-------------------|----------------|------------| | Yang [1] | 1-D semi-systolic | 97.32 MHz | 44.7 K | | AB1 [3] | 1-D systolic | 285.88 MHz | 14.4 K | | AB2 [5] | 2-D systolic | 17.87 MHz | 98.2 K | | Hsieh [9] | 2-D systolic | 26.24 MHz | 100.1 K | | Tree [10] | Tree structure | 12.17 MHz | 58.7 K | | Yeo [12] | 2-D semi-systolic | 3.04 MHz | 436.6 K | | Lai [14] | 1-D semi-systolic | 3.04 MHz | 384.8 K | | SA [15] | 2-D systolic | 12.17 MHz | 127.0 K | | SSA [16] | 2-D semi-systolic | 12.17 MHz | 110.6 K | | Ours [18] | Based on GEA | 19.42 MHz | 23.1 K | DSP in VLSI Design Shao-Yi Chien #### Comparison with FSBMA [-32,+31] | Architecture | Description | Required Freq. | Gate Count | |--------------|-------------------|----------------|------------| | Yang [1] | 1-D semi-systolic | 194.64 MHz | 107.6 K | | AB1 [3] | 1-D systolic | 961.04 MHz | 16.5 K | | AB2 [5] | 2-D systolic | 60.07 MHz | 107.9 K | | Hsieh [9] | 2-D systolic | 74.14 MHz | 100.0 K | | Tree [10] | Tree structure | 48.66 MHz | 58.4 K | | Yeo [12] | 2-D semi-systolic | 3.04 MHz | 1746.4 K | | Lai [14] | 1-D semi-systolic | 3.04 MHz | 1539.3 K | | SA [15] | 2-D systolic | 48.66 MHz | 127.0 K | | SSA [16] | 2-D semi-systolic | 48.66 MHz | 110.6 K | | Ours [18] | Based on GEA | 61.62 MHz | 33.2 K | DSP in VLSI Design Shao-Yi Chien #### Chip Photo & Spec. | Process | TSMC 1P4M 0.35 um | |-------------------|-------------------------------| | Package | 128 CQFP | | Die Size | 3.679 x 4.001 mm <sup>2</sup> | | Core Size | 2.591 x 2.879 mm <sup>2</sup> | | Max. Frequency | 27.8 MHz | | Logic Gate Count | 25,997 | | On-Chip SRAM | 20,480 bits | | Transistor Count | 357,551 | | Search Range | [-16,+15] | | Processing Speed | 152 QCIF (176 x 144) fps | | | 38 CIF (352 x 288) fps | | Power Consumption | 272 mW @ 25 MHz | # Motion Estimation in H.264 #### **Motion Compensation** - Seven kinds of block sizes (16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4) - 1/4 sample accuracy - □ 6-tap filtering for 1/2-pixel - □ Simplified filtering for 1/4-pixel - Multiple reference pictures #### Variable Block Sizes MB-Modes 0 16x8 0 1 8x16 8x8 8x8-Modes 0 8x4 0 1 4x8 0 1 4x4 #### Example of Variable Block Sizes #### Multiple Reference Frames MPEG-1, MPEG-2, MPEG-4 #### H.264/JVT/AVC Note: the whole 8x8 sub-partition must be predicted by the same reference frame. # Rate-Distortion Optimized Mode Decision - Lagrangian method to minimize $J = D + \lambda R$ - J is the cost function - D means distortion (SAD, SATD, SSD) - R stands for a function of bit-rate. - λ is called Lagrangian multiplier. - Theoretically, assume D is a function of R, denoted as D(R), then λ should be obtained by differentiating J with respect to R. - $\square$ Setting the first derivative to zero and solve $\lambda$ . - $\square$ In the reference software, $\lambda$ is a function of QP. #### Macroblock Mode Decision #### MinCost0 #### MinCost1 | 1. Inter4x4 2. Inter4x8 3. Inter8x4 4. Inter8x8 (5 ref, 1/4-pel) | 1. Inter4x4 2. Inter4x8 3. Inter8x4 4. Inter8x8 (5 ref, 1/4-pel) | |------------------------------------------------------------------|------------------------------------------------------------------| | 1. Inter4x4 2. Inter4x8 3. Inter8x4 4. Inter8x8 (5 ref, 1/4-pel) | 1. Inter4x4 2. Inter4x8 3. Inter8x4 4. Inter8x8 (5 ref, 1/4-pel) | Inter16x16 (5 ref, ¼-pel) MinCost Cost16x16 = MinCost MinCost2 MinCost3 Cost8x8 = MinCost0+MinCost1+MinCost2+MinCost3 #### **Encountered Problem** The exact estimation of Rate in the Lagrangian cost function makes parallel processing for different partitions impossible. DSP in VLSI Design # Modified Macroblock Mode Decision DSP in VLSI Design #### Basic Architecture - 256 PE's - Current block stays - Broadcast search area pixels - Accumulate and propagate partial SAD values - Do not require 256 8-bit registers to buffer the pixels for a candidate block ## Basic Data Flow -16-16 Search Area 81 SAD #### Variable Block Size Architecture DSP in VLSI Design #### Implementation | Process | TSMC 1P4M 0.35um | |--------------|-------------------------------------| | Chip area | 5.056 x 5.056 mm <sup>2</sup> | | Package | 128 CQFP | | On-chip SRAM | 24,576 bits | | Gate count | 105,575 | | Max. freq. | 66.67 MHz | | Search range | H [-24, +23], V [-16, +15] | | Capability | 50 GOPS<br>D1 (720x480) 30fps 1Ref. | | | SIF (352x240) 30fps 4Ref. | # Variable-Block-Size Motion Estimation - Ching-Yeh Chen, Shao-Yi Chien, Yu-Wen Huang, Tung-Chien Chen, Tu-Chih Wang, and Liang-Gee Chen, "Analysis and Architecture Design of Variable Block-Size Motion Estimation for H.264/AVC," *IEEE Transactions on Circuits and Systems I*, vol 53. no. 2, pp. 578--593, Feb. 2006. - Tung-Chien Chen, Shao-Yi Chien, Yu-Wen Huang, Chen-Han Tsai, Ching-Yeh Chen, To-Wei Chen, and Liang-Gee Chen, "Analysis and Architecture Design of an HDTV720p 30 Frames/s H.264/AVC Encoder," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 16, no. 6, pp. 673--688, June 2006. ### Six Major Reference Architecture - **1DInterYSW**: K. M. Yang, M. T. Sun, and L. Wu, "A family of VLSI designs for the motion compensation block-matching algorithm," IEEE Trans. Circuits Syst., vol. 36, no. 10, pp. 1317–1325, Oct. 1989. - **2DInterYH**: H. Yeo and Y. H. Hu, "A novel modular systolic array architecture for full-search block matching motion estimation," IEEE Trans. Circuits Syst. Video Technol., vol. 5, no. 5, pp. 407–416, Oct. 1995. - **2DInterLC**: Y. K. Lai and L. G. Chen, "A data-interlacing architecture with twodimensional data-reuse for full-search block-matching algorithm," IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 2, pp. 124–127, Apr. 1998. - **2DIntraVS**: T. Komarek and P. Pirsch, "Array architectures for block matching algorithms," IEEE Trans. Circuits Syst., vol. 36, no. 10, pp. 1301–1308, Oct. 1989. - **2DIntraKP**: L. De Vos and M. Stegherr, "Parameterizable VLSI architectures for the full-search block-matching algorithm," IEEE Trans. Circuits Syst., vol. 36, no. 10, pp. 1309–1316, Oct. 1989. - **2DIntraHL**: C. H. Hsieh and T. P. Lin, "VLSI architecture for block-matching motion estimation algorithm," IEEE Trans. Circuits Syst. Video Technol., vol. 2, no. 2, pp. 169–175, Jun. 1992. #### Inter-Parallel Architecture DSP in VLSI Design #### Intra-Parallel Architecture #### **AB2 Architecture** Dependence Graph projection on i,k plane Systolic Architecture $$(N=3, p=2)$$ Number of cycles for a macroblock = $(2p+1) \times (2p+N)$ DSP in VLSI Design ## Propagation Partial SAD Architecture DSP in VLSI Design Shao-Yi Chien 90 ### SAD Tree Architecture (a) Propagation Reg 4 x 5 Reconfigurable path: shift downward, leftward and upward A: Shift downward and fetch 4 pixels in each cycle B: Shift downward and fetch 5 pixels in each cycle C: Shift leftward and do not fetch pixels D: Shift upward and fetch 4 pixels in each cycle E: Shift upward and fetch 5 pixels in each cycle (c) #### **VBSME** Version? DSP in VLSI Design Shao-Yi Chien 92 ## Comparison | Name | No. of PEs | Operating Cycles | Latency | Data Flow | |-----------------------|--------------------|----------------------------------------|------------------------------------|---------------| | | | (Cycles/Macroblock) | (Cycles) | | | 1DInterYSW [5] | $2P_h$ | $N^2 \times 2P_v + 2P_h$ | $N^2$ | Data Flow I | | 2DInterYH [7] | $2P_h \times 2P_v$ | $2N^2$ | $N^2$ | Data Flow I | | 2DInterLC [8] | $2P_h \times 2P_v$ | $2N^2$ | $2N^2$ | Data Flow I | | 2DIntraVS [9] | $N^2$ | $2P_h \times 2P_h + N \times 2P_v$ | $N \times 2P_v$ | Data Flow III | | 2DIntraKP [6] | $N^2$ | $2P_{\nu}\times(N+2P_h)+N$ | 3N | Data Flow II | | 2DIntraHL [10] | $N^2$ | $(2P_v + N - 1) \times (2P_h + N - 1)$ | $2N + (N-1) \times (2P_h + N - 2)$ | Data Flow II | | Propagate Partial SAD | $N^2$ | $2P_h \times 2P_v + N - 1$ | N | Data Flow II | | SAD Tree | $N^2$ | $2P_h \times 2P_v + N - 1$ | N | Data Flow III | #### Comparison: Hexagonal Plot DSP in VLSI Design Shao-Yi Chien Shao-Yi Chien DSP in VLSI Design