

### HLS 101

Lecturer: Hua-Yang Weng

Date: 2022/08/03



Signed in as: 翁華揚

#### **HLS Textbook**

| Why Learn HLS                                | ~   | HLS Textbook                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |  |  |  |  |  |  |  |
|----------------------------------------------|-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|--|
| Introduction to FPGA<br>Architecture         | ~   | As society embraces digital transformation with intelligent service and automation, the sheer volume of data and computing continues to skyrocket. Moore's law may be soon out of gas; even not, the power will limit its continued growth. So a new approach needs to pick up the gap. Heterogeneous computing is a likely candidate, especially FPGA.                                                                                                                                                                           |  |  |  |  |  |  |  |  |  |
| From Gate to HLS                             | ~   | Many infrastructure providers, such as Amazon, Microsoft, Alibaba, Baidu, are embracing FPGA as a Service (FaaS) to scale their computing                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |  |  |  |
| HLS Introduction                             | ~   | environment, e.g., Amazon F1 instance, Alibaba F3. FPGA design is traditionally performed by hardware designer The conventional way of job                                                                                                                                                                                                                                                                                                                                                                                        |  |  |  |  |  |  |  |  |  |
| Application Acceleration<br>Development Flow | *   | partitioned among software and hardware designer no longer meet the development cycle. It needs a paradigm shift. That is to have a software designer do end-to-end design from application to a hardware accelerator. From my experience of leading product developm the software engineer using C++ to design accelerator can design as good quality as an experienced hardware engineer in terms of                                                                                                                            |  |  |  |  |  |  |  |  |  |
| IO Interface                                 | ~   | performance and resource used. However, it does take a learning curve. The objective of the course is to empower the software designer to<br>develop an efficient hardware accelerator and develop a system that efficiently integrates application and hardware accelerator.                                                                                                                                                                                                                                                     |  |  |  |  |  |  |  |  |  |
| PIPELINE                                     | ~   | The HLS textbook is to supplement the in-class lecture. Therefore, it contains extensive material that is not possible to cover in class. HLS is                                                                                                                                                                                                                                                                                                                                                                                  |  |  |  |  |  |  |  |  |  |
| Data Flow                                    | ~   | an area that covers an extensive background, from the programming language, compiler, logic design, compiler techniques, computer<br>architecture, system design, and application-domain knowledge. In addition, it is the first time to put together comprehensive material from                                                                                                                                                                                                                                                 |  |  |  |  |  |  |  |  |  |
| Data Type                                    | ~   | industry documents, mainly from FPGA vendor Xilinx published papes. Laboratory and code examples are based on the Xilinx Vitis tool.                                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |  |  |  |  |  |
| Memory Architecture                          | ~   | The textbook starts with                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |  |  |  |  |  |  |  |  |  |
|                                              |     | <ul> <li>Chapter 1: An Introduction. It gives a brief overview of the contemporary art of computation—the need for HLS and industry status in adopting HLS.</li> <li>Chapter 2: FPGA architecture. Designers need to know the architecture components (CLB, DSP, BRAM, and Interconnect) in the FPGA to use its resource effectively.</li> <li>Chapter 3: From Gate to HLS. It introduces background on logic design, Verilog language. A last it takes a gcd design to illustrate the abstraction that HLS can offer.</li> </ul> |  |  |  |  |  |  |  |  |  |
| Structure and Hierarchica<br>Design          | · ~ |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |
| Best Practice                                | ~   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |

- See the below textbook for details
  - https://boledu-next-chakra.vercel.app/textbooks/hls-textbook
- Course:
  - NTUEE EEE5060 Application Acceleration with High-Level-Synthesis
  - NTUEE EEE5029 Multimedia System-on-chip Design



## Outline

- Why HLS?
- HLS IP Flow
- Pragma Introduction
- Design Flow
- Labs



## Outline

- Why HLS?
- HLS IP Flow
- Pragma Introduction
- Design Flow
- Labs







- Software Defined Hardware (SDH)
- Computational Efficiency
- Design and Verification Productivity
- Improve Quality of Results (QoR)
- Fast system prototyping
- Fast architecture exploration
  - C++  $\longleftrightarrow$  python v.s. verilog  $\longleftrightarrow$  HLS



- Software Defined Hardware (SDH)
  - Defense Advanced Research Projects Agency
- Computational Efficiency
- Design and Verification Productivity
- Improve Quality of Results (QoR)
- Fast system prototyping
- Fast architecture exploration
  - C++  $\longleftrightarrow$  python v.s. verilog  $\longleftrightarrow$  HLS



### DARPA – Software Defined Hardware (SDH) Program

"In modern warfare, decisions are driven by information. ... The ability to exploit this data to understand and predict the world around us is an asymmetric advantage for the Department of Defense (DoD)."

Goal of SDH:

- Compute efficiency (GOPs/Watt) in SDH system to be at efficiencies within 5X of ASICs and 500-1000X better than CPU implementation.
- The same programmability as current NumPy/Python implementation



Provide application and dataflow reconfigurable software & hardware co-design for optimized performance

SDH Program Concept and Structure



- Software Defined Hardware (SDH)
- Computational Efficiency
- Design and Verification Productivity
- Improve Quality of Results (QoR)
- Fast system prototyping
- Fast architecture exploration
  - C++  $\longleftrightarrow$  python v.s. verilog  $\longleftrightarrow$  HLS



## **Computational Power**



- Highly flexibility
- Low computing efficiency
- General software processing





- Better flexibility
- Average computing efficiency
- Suited to simple logic and SIMD computationally intensive task
- Good flexibiliy
- Better computing efficiency
- Parallel computing, real-time processing, low power consumption
  - Suited to hardware acceleration of specific algorithm



### What area that FPGA is more compute efficient than GPU

- Irregular parallelism
- Customized data types
- Customized datapath, e.g. dataflow
- Efficient memory access semantics (random access, FIFO, stack etc.)

### Why is FPGA not as popular as GPU?



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

### CD/W: Computational Density (GOPs/s) per Watt



http://cas.ee.ic.ac.uk/people/gac1/DATE2011/Stitt.pdf



BOLEDU



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

- Software Defined Hardware (SDH)
- Computational Efficiency
- Design and Verification Productivity
- Improve Quality of Results (QoR)
- Fast system prototyping
- Fast architecture exploration
  - C++  $\longleftrightarrow$  python v.s. verilog  $\longleftrightarrow$  HLS



### **Design and Verification Productivity**



http://www.ecs.umass.edu/ece/labs/vlsicad/ece667/reading/hls-survey.pdf





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

- Software Defined Hardware (SDH)
- Computational Efficiency
- Design and Verification Productivity
- Improve Quality of Results (QoR)
- Fast system prototyping
- Fast architecture exploration
  - C++  $\longleftrightarrow$  python v.s. verilog  $\longleftrightarrow$  HLS



### Improve Quality of Results (QoR)

- Advanced optimization algorithm leverage on continued research for better and intelligent synthesis algorithms
- Allow more design space exploration quickly create many different implementation from one high-level description of the design. e.g. Explore SIMD parallelism with a single parameters.

| Algorithm       | Impl. | FF  | LUT | BRAM | DSP48 | Throughput |
|-----------------|-------|-----|-----|------|-------|------------|
| Sobel Filter    | RTL   | 153 | 202 | 1    | 0     | 2.350 kHz  |
|                 | HLS   | 172 | 252 | 1    | 0     | 2.213 kHz  |
| Gaussian Filter | RTL   | 128 | 174 | 1    | 0     | 2.118 kHz  |
|                 | HLS   | 86  | 152 | 1    | 0     | 2.890 kHz  |
| Morphologic     | RTL   | 80  | 77  | 0    | 0     | 5.571 kHz  |
|                 | HLS   | 119 | 123 | 0    | 0     | 5.261 kHz  |
| Histogram       | RTL   | 176 | 201 | 1    | 0     | 1.758 kHz  |
|                 | HLS   | 141 | 214 | 1    | 0     | 1.819 kHz  |

"A Comparative Study between RTL and HLS for Image Processing Application with FPGAs" https://escholarship.org/uc/item/9vx1s37b



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

## **Example from Qualcomm**



- The HLS design code space is much smaller at the C-level than at the RTL, making it easier to verify and correct; the 100x faster simulation speeds enable us to detect problems and close coverage magnitudes faster than in RTL
- With the HLS methodology, what is verified in C stays verified in the RTL domain. As a result, most of the bugs are found and corrected in C.
- When HLS/HLV is done, the remaining work in the RTL environment is mostly at the interface level.

Slides from NTUEE EEE5029 Multimedia System-on-chip Design



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

# Example from NVIDIA (Image Decoder)





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng



# Example from NVIDIA (Image Decoder)



### QoR - Area & Timing

| Design  | Display m           | odule #1 | Display module #2   |       | Camera module #1 |      | Camera module #2 |       |
|---------|---------------------|----------|---------------------|-------|------------------|------|------------------|-------|
|         | RTL                 | HLS      | RTL                 | HLS   | RTL              | HLS  | RTL              | HLS   |
| Area    | 3434                | 2876     | 8796                | 10960 | 2762             | 2838 | 49390            | 50247 |
| Timing  | 0                   | 0        | -0.36               | -0.33 | 0                | 0    | 0                | 0     |
| Perf    | 3 pixels / 3 cycles |          | 3 pixels / 3 cycles |       | 2 pixels / cycle |      | 2 pixels /cycle  |       |
| Latency | 3 cycles            |          | 3 cycles            |       | unconstrained    |      | unconstrained    |       |
|         |                     |          |                     |       |                  |      |                  |       |



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

### Industry case - Nvidia

Nvidia Research - Machine Learning Accelerator

*"10X Improvement in RTL design and verification effort compared to manual RTL"* 

- Enable full SoC level performance < 2.6% from RTL in cycle count
- Low Design Effort Spec-to-Tapeout in 6 months with < 10 researchers

Nvidia Xavier 12nFF SoC

- C++ functional verification runtime ~500x less resource than RTL
- Fast verification makes rapid product changes possible
  - VP9/HEVC code from 8 to 10 bit color depth in 2 weeks
  - Change from 20nm/500Mhz to 28/nm/800Mhz in 3 days with HLS



NVResearch Prototype: 36 Chips on Package in TSMC 16nm Technology



https://www.mentor.com/hls-lp/multimedia/player/nvidia-design-and-verification-of-a-machine-learning-accelerator-socusing-an-object-oriented-hls-based-design-flow-2cea13e3-93cf-4539-bac6-01f75c263fc1



Hua-Yang Weng

### Industry Case – Google Designs VP9 CODEC in Half the Time

- Time to Verified RTL: 2x faster
  - Built in under 6 monts v.s. 1 year for RTL
  - 69k lines of C++ v.s. 1.2 millon lines of Verilog
- Simulation Speed: 500x faster
  - RTL simulation: 70 servers and 2 days
  - C simulation: 3 servers in 2 hours
- > 99% bugs caught in C simulation
- Benefits from the view of Google
  - 90% less code, less bug
  - Flexibility SW-like process, late-stage algorithm changes
  - Rapid HW prototyping rapidly evaluate new idea, algorithms



#### EDU

#### https://go.mentor.com/4uNV1





Hua-Yang Weng

## **Application Specific**

### Example of Oil, Gas workload

### Productivity

### Not the traditional programming model for FPGAs:

- One Software Engineer, no previous O&G experience, one month to describe & implement entire RTM Algo in C++
- No optimized library calls, completely described in C++
- < 500 lines of code, < 50 Pragmas</p>
- Standard language, open source tools and libraries



#### Hua-Yang Weng

#### Seismic Method for Oil and Gas industry

- Seismic Imaging Technology
- Seismic Survey: Acoustic wave sampling
- Seismic Imaging: Mathematically process the wave traces to create an image

#### RTM (Reverse Time Migration)

- High-fidelity algorithm for imaging complex sub-surface structures
- Cross-correlation between source wavefield and receiver wavefield
- Wavefield reconstruction by saved boundaries





- Software Defined Hardware (SDH)
- Computational Efficiency
- Design and Verification Productivity
- Improve Quality of Results (QoR)
- Fast system prototyping: ESL
- Fast architecture exploration
  - C++  $\longleftrightarrow$  python v.s. verilog  $\longleftrightarrow$  HLS



## Typical SOC design flow

 Overlap in specification/architecture phase and RTL-design phase; multiple design changes
 Architecture design done informally

SW development starting late in the project



Multimedia SoC Design

Graduate Institute of Electronics Engineering

Media IC and System Lab

National Taiwan University

Shao-Yi Chien

Hua-Yang Weng

Slides from NTUEE EEE5029 Multimedia System-on-chip Design

27

24





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

## Emerging SoC Design Flow (2/2)

#### ESL : Electronic System Level Design



Source: Synopsys, Inc. Multimedia SoC Design

radonal raman onitolog

Shao-Yi Chien

## **ESL: New SOC Design Flow**

### Architecture closure

- Achieve a reduction # of RTL iterations
- Can perform concurrent HW and SW design
- Shorten the time it takes to get to golden RTL



#### Multimedia SoC Design

#### Shao-Yi Chien

46



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

## **SOC** Design Flows

mories

DSP

Sub-System

Typical Flow: Step 1 and 2 performed on RTL model

idges

Architecture Closure

– RTL Closure

Micro

Sub-System

cess

Analog



### New Flow: Step 1 on transaction level, step 2 on RTL model

Application

Logic

Peripherals



#### Multimedia SoC Design



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

#### Shao-Yi Chien

Hua-Yang Weng

Slides from NTUEE EEE5029 Multimedia System-on-chip Design

49





Multimedia SoC Design

Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University SystemC System Verilog Shao-Yi Chien Ptolemy Matlab

75

Hua-Yang Weng

## SystemC

Not a new language

- A special class library
- Based on C++
  - Includes all the advantages/disadvantages of C++
- Good reference implementation
- C++ compatibility supports SW compatibility
- Only limited path to implementation
- TLM methodology and experience exists
- Oriented towards HDS verification, architecture exploration, and fast higher level simulation



## **High Level Synthesis Tools**

- Mentor Graphics→Calypt: Catapult C (Acquiqred by Calypto)→ Mentor Graphics Catapult C
  - Forte Design System: Cynthesizer (Acquired by Cadence)
  - Synopsys: Synphony C compiler
  - Cadence: C2Silicon Startus HLS
  - ChipVision: PowerOpt?
  - Xilinx: Vivado Vitis-HLS, Vitis
  - NEC CyberWorkBench

Multimedia SoC Design

Media IC and System Lab Graduate Institute of Electronics Engineering Shao-Yi Chien

106

- Software Defined Hardware (SDH)
- Computational Efficiency
- Design and Verification Productivity
- Improve Quality of Results (QoR)
- Fast system prototyping: ESL
- Fast architecture exploration
  - C++  $\longleftrightarrow$  python v.s. verilog  $\longleftrightarrow$  HLS







## Outline

### • Why HLS?

### HLS IP Flow

- Pragma Introduction
- Design Flow
- Labs



## **HLS IP Flow**

0. Coding in C++

### 1. C-Simulation (SW-Emulation)

• Check the C source code evaluation with the golden (Similar to SystemC)

### 2. C-Synthesis

Perform C -> RTL synthesis

### 3. Co-Simulation (Hardware-Emulation)

- Using standard RTL verification tools
- 4. Generate bitstream (FPGA)
  - RTL to Gate-level synthesis + P&R for IC flow



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

https://docs.xilinx.com/v/u/en-US/ug871-vivado-high-level-synthesis-tutorial

See UG871 for details:

## 0. Coding in C++

Coding in C++ rather than tedious RTL level

### Benefits:

- No sequential logic bugs
- Unified coding language
  - Design: C++
  - Verification: C++
  - Application: C++

### Disadvantages:

- Stiff learning curve
- RTL is still the mainstream in Digital IC Design Flow
  - FAE, customer, ....



### Illustration – GCD

Euclidean Algorithm

Simplified Euclidean GCD Algorithm

gcd(a,b) = gcd(b,(a-b))

 $= \gcd(a, (b-a))$ 

gcd(a,b) = gcd(b,r)where, a = qb + r





end Y = B; endmodule

- RTL synthesis tool only copies the circuit for the while/for loop
- But the # of loop could not be determined at compiling time
- The circuit could not be synthesized
- It needs a structure implementation



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### Illustration – GCD (RTL)





module gcd\_fsm( input clock, reset, go, input AGB, ALB, output A en, B en, A mux sel, B mux sel, out mux sel, ouput done ); reg running = 0; always @( posedge clock) begin if(go) running <= 1; else if (done) running <= 0; end reg [5:0] ctrl sig; assign { A\_en, B\_en, A\_mux\_sel, B\_mux\_sel, done } always @(\*) begin if(!running) ctrl\_sig = 5'b11\_00\_0; else if( AGB ) ctrl\_sig = 5'b10\_1x\_0; else if(ALB) ctrl sig = 5'b11 11 0; ctrl sig = 5'b00 xx 1; else end endmodule

```
// Datapath Logic
wire [width-1:0] out = (out_mux_sel) ? B: A-B;;
wire [width-1:0] A_next = ( A_mux_sel ) ? out : A_in;
wire [width-1:0] B_next = ( B_mux_sel ) ? A : B_in;
```

// Generate output control signals
wire AGB = ( A > B);
wire ALB = (A < B);</pre>

```
// edge-triggered flip-flop
always @( posedge clock) begin
if( A_en ) A <= A_next;
if (B_en) B <= B_next;
end
endmodule
```



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### A Glimpse of High-Level-Synthesis

- HLS build synchronous design
  - No timing -> no clock, reset
  - No port width imply by data type
  - Port direction lhs, rhs
    - Input: only read, "pass by value"
    - Ouptut: function return, a reference, or a pointer
    - Inout: a reference or a pointer
- Loop:
  - Automatic control/datapath synthesis







#### Mapping of Key Attributes of C Code





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### **Function Hierarchy**

- Top-level function becomes the top level of the RTL
- Sub-functions are synthesized into blocks in the RTL design
- Inlined to dissolve the hierarchy
  - Provide greater optimization opportunity





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

### **Function Arguments**

- Function arguments mapped to ports on the RTL blocks
- Additional control ports are added to the design for control/synchronization among blocks
- Input/output (I/O) protocols
  - Allow automatically synchronize data exchange among blocks







Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### How top module connects to system





### Arrays

- Typically implemented by a memory block
  - Read & write array mapped to RAM
  - Constant array mapped to ROM
- By using directives
  - An array can be partitioned and map to multiple RAMs (ARRAY\_PARTITION)
  - Multiple arrays can be merged and mapped to one RAM (ARRAY\_RESHAPE)
  - A array can be partitioned into individual elements and map to registers





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### Expressions – Data Flow Graph

- Expression is translated to datapath and its control path (FSM)
- Start by analyzing the data dependencies between the various steps in the expression shown above. This analysis leads to a Data Flow Graph (DFG)







Control Flow: Loop

- Loops are the main area of parallelism in an algorithm
- Loops can be
  - pipelined,
  - Unrolled, Partially unrolled,
  - Merged
  - Flattened
- HLS generates the datapath and control logic



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

#### Control Flow – Rolled

- By default, loops are rolled
  - Each loop iteration corresponds to a "sequence" of states (DAG)
  - The state sequence will be repeated multiple times based on the loop trip count.
  - The resource (adder) is repeatedly used in the loop iteration.
  - Efficient use the resource, but longer latency







Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### Loop - Unroll

- Rolled loops can be made unrolled or partially unrolled by #pragma UNROLL [factor = n]
- Pros
  - Decrease loop overhead
  - Increase parallelism for scheduling
- Cons
  - Increase operator count, negatively impact area, power and timing



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University Loop - Pipeline

- One of the most important optimization
- Allow a new iteration to begin before the previous iteration is complete
- Key matric: Initiation Interval (II)





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

# 1. C-Simulation (SW-Emulation)

Verify your C/C++ code with the golden

- Since the hardware is designed in C++, the testbench is also C++.
- It is just like Freshman C/C++ course.



### Non-Synthesizable Code for Testbench

- The great power of HLS is the C-simulation and RTL co-simulation testbench are the same.
  - Similar to systemC ESL validation
- Use <u>SYNTHESIS</u> for testbench code



# 2. C-Synthesis

- Analysis the C/C++ code and transform to RTL code
- Tools:
  - FPGA: Vivado-HLS (deprecated)  $\rightarrow$  Vitis-HLS
  - IC: Stratus-HLS, ... etc.
- Tools guaranteed the logic.
- Pragmas are needed to control its behavior
  - UNROLL factor=2
  - PIPELINE II=1



• . . .

Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

#### Resource Allocation, Scheduling, Binding

 Resource allocation: Each operation is mapped to a hardware resource, annotated with both timing and area information

#pragma HLS allocation operation instance = add limit = 1

- Scheduling: decide which clock cycle to perform what operations
- **Binding**: mapped to the hardware resource.

#pragma HLS bind\_op variable=<variable> op=<type>
impl=<value> latency=<int>





Example - Expression Datapath Resource allocation & Scheduling







delay=6 using 1 MAC



delay=4 using 1 adder and 1 multiplier



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### Example – Expression Datapath Binding & Resource Sharing





### Example – Control Flow





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis HLS extra the control logic in the form of a finite-state-machine, in which each of state C0, C1, C2, C3 perform the following tasks

- C0: perform (b+c) and loop initiation. The result is latched at the end of the C0 state
- C1: Generate in memory access control signal, including in\_addr, in\_ce
- C2: wait for the RAM return in[i] data
- C3: Perform the multiplication of x\*a and addition. Generate the out RAM control signals, out\_addr, out\_ce, out\_we

The full sequence of states are: C0, {C1, C2, C3}, {C1, C2, C3}, {C1, C2, C3}, and return to C0

If directive "#pragma PIPELINE" is specified, HLS generates a pipelined datapath for the operations in the loop and its corresponding loop controller logic.





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

### Example – Control Flow (Vivado HLS)

#### int controlflow(int a[N]) { int i, acc; acc = 0;for(i = 0; i < N; i++) {</pre> acc += a[i]; } return acc; }

#### Scheduling

Resource

- Expression Variable Name

acc\_fu\_75\_p2

icmp\_In7\_fu\_58\_p2

LUT

9

21

9

39

i\_fu\_64\_p2

Multiplexer

Name

acc\_0\_reg\_46

ap\_NS\_fsm

i\_0\_reg\_35

Total

Total

| Operation\Control Step |
|------------------------|
| × Loop 1               |
| i_0(phi_mux)           |
| acc_0(phi_mux)         |
| icmp_In7(icmp)         |
| i(+)                   |
| a_load(read)           |
| acc(+)                 |
|                        |



Bitwidth P0

32

4

4

40

Bitwidth P1

32

1

4

37



- Memory address is the ٠ same as "variable i"
- But, address of array "a" • is different?



DSP48E

Bits

1

4

37

0 0

0

0 0

0 0

Operation

+

+

3

2 32

4

2

8

icmp

Input Size

FF LUT

0

Total Bits

64

4

8

76

39

13

61

9

Slides from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis

#### Interface

| RTL Ports  | Dir | Bits | Protocol   | Source Object | C Type       |
|------------|-----|------|------------|---------------|--------------|
| ap_clk     | in  | 1    | ap_ctrl_hs | controlflow   | return value |
| ap_rst     | in  | 1    | ap_ctrl_hs | controlflow   | return value |
| ap_start   | in  | 1    | ap_ctrl_hs | controlflow   | return value |
| ap_done    | out | 1    | ap_ctrl_hs | controlflow   | return value |
| ap_idle    | out | 1    | ap_ctrl_hs | controlflow   | return value |
| ap_ready   | out | 1    | ap_ctrl_hs | controlflow   | return value |
| ap_return  | out | 32   | ap_ctrl_hs | controlflow   | return value |
| a_address0 | out | 4    | ap_memory  | a             | array        |
| a_ce0      | out | 1    | ap_memory  | a             | array        |
| a d0       | in  | 32   | ap memory  | а             | arrav        |

### 3. Co-Simulation (Hardware-Emulation)

- Using standard RTL verification tools
- Waveform viewers
- System-level considerations
  - E.g. FIFOs, deadlocks, ... etc.





Hua-Yang Weng

# 4.Generate bitstream (FPGA)

Tools: Vivado

- Automation in FPGA tools without clicks
- Configure the synthesis and placement via FPGA .tcl
- This step takes around 1~2 hr

#### Post layout verification → Run on FPGA



# **Application Timeline (1/2)**









Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

# Outline

- Why HLS?
- HLS IP Flow
- Pragma Introduction
- Design Flow
- Labs



# **#Pragma Introduction**

 Special purpose directive for turning on or off some compiler-specific features.

}

#pragma omp parallel for

- Example: OpenMP
  - Multi-thread programming
- for (k=0; k<pixel3DTiles[j].size() ; k++) {
   (\*pointDataVec)[pointCnt].x = point3DTiles[j][k](0);
   (\*pointDataVec)[pointCnt].y = point3DTiles[j][k](1);
   (\*pointDataVec)[pointCnt].z = point3DTiles[j][k](2);
   (\*pointDataVec)[pointCnt].pixel = pixel3DTiles[j][k];
   ++pointCnt;
   ++tilePointCnt;</pre>

- Example: HLS
  - Unroll, pipeline, .....

```
for (int i = 0; i < 9; i++) {
    #pragma HLS UNROLL
    R[i] = 0;</pre>
```

```
for (int i=0; i<frameDataNum; i++) {
    #pragma HLS LOOP_TRIPCOUNT min=1 max=TRIPCOUNT
    #pragma HLS PIPELINE
    frameStreamOut.write(frameIn[i]);
}</pre>
```



}

# **#Pragma Introduction**

- Interface Synthesis pragma HLS interface
- Task-level Pipeline pragma HLS dataflow, pragma HLS stream
- Pipeline pragma HLS pipeline
- Loop Unrolling pragma HLS unroll, pragma HLS dependence
- Array Optimization pragma HLS array\_partition, pragma HLS array\_reshape
- Resource Optimization pragma HLS allocation, pragma HLS function\_instantiate
- Others
  - https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas



# Outline

- Why HLS?
- HLS IP Flow
- Pragma Introduction
- Design Flow
- Labs



# **Design Flow**

#### 1. Platform select

- Data center flow
- Embedded system flow
- 2. Develop software algorithm
- 3. Software profile
- 4. Set Acceleration Goal
- 5. Applicability of the Hardware
- 6. Hardware Architecture Plan
- 7. HLS coding



### **1. Platform select**



# 2. Develop Software Algorithm

- C++ is a better choice for HLS development flow
- Python or other language is okay, but need to translate to C++ for HLS hardware synthesis
  - Rewrite the code in C/C++
  - Cython or other transforms may/may-not help
- Pure C++ code is the simplest case
  - If function calls deeper API, then need to ensure the code in API is synthesizable

#### • Other examples: FINN (Python)



Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

# **3. Software profile: Identify the function to be accelerated**

| Number of subsystems | Background<br>Rendering | GUI  | Pose<br>Estimation | Pose<br>Refinement | Model<br>Rendering | Swap<br>Window | Total<br>Time | Avg GN<br>Iter # | Avg LS<br>Iter # |
|----------------------|-------------------------|------|--------------------|--------------------|--------------------|----------------|---------------|------------------|------------------|
| 4                    | 2.75                    | 0.53 | 6.51               | 10.43              | 0.31               | 7.60           | 28.51         | 3.16             | 0.17             |

- You can use timers such as std::chrono
  - https://en.cppreference.com/w/cpp/chrono
- The platform and the underlying computational cores matters a lot
  - High-end CPU, low-end CPU, MCU, GPU, .....



# 4. Set Acceleration Goal

### Set your goal

- What is the assumption of this goal, under what scenario?
- Example:

#### Acceleration Goals: Frame latency < 2.5 ms

- Assuming surgeon head motion  $\rightarrow$  20 deg/sec
- Assuming 4 subsystems  $\rightarrow$  1 surgical target + 3 surgical instruments
- Assuming pipelined sense-compute-render-display system
- Assuming bottleneck of the pipeline bounded by compute core
- Current software application latency  $\rightarrow$  10 ms



# 4. Set Acceleration Goal

| Number of subsystems | Background<br>Rendering | GUI  | Pose<br>Estimation | Pose<br>Refinement | Model<br>Rendering | Swap<br>Window | Total<br>Time | Avg GN<br>Iter # | Avg LS<br>Iter # |
|----------------------|-------------------------|------|--------------------|--------------------|--------------------|----------------|---------------|------------------|------------------|
| 4                    | 2.75                    | 0.53 | 6.51               | 10.43              | 0.31               | 7.60           | 28.51         | 3.16             | 0.17             |

- Is this goal competitive?
  - x4 times faster, not too impressive (Impressive -> 2 ~ 3 orders)
- Achievable ?

#### Roughly estimate the cycles needed

• What's the time complexity of the function? How much degree of parallelism can be achieved to reach this goal? (Resource enough?)

#### Determine if it is PCIe-bound (Data center flow)

- 800x800x3 + 10000 x (1 + 3 x 4) byte @ 33us = 57.84 GB/s
- U50 PCIe bandwidth (Host -> PCIe -> FPGA) Read(Write): 11GB/s -> PCIe-bounded



# 5. Applicability of the Hardware

- How general is your hardware?
  - ASIC-like?
  - DSP-like?
- Example (ASIC-like)
- Applicable to most **D**irect **D**ense **P**hotometric **R**efinement problems (**DDPR**)
  - Not restricted to planar or marker objects  $\rightarrow$  General 3D rigid objects
  - Suits the front-end refinement of Visual Odometry (VO) if depth provided
- Example (DSP-like)
  - Custom ISA + common processing units



# 6. Hardware Architecture Plan

- What is the possible compute architecture? dataflow?
- E.g. Systolic array, dedicated dataflow?





Media IC and System Lab Graduate Institute of Electronics Engineering National Taiwan University

Hua-Yang Weng

# **Design Flow**

#### 1. Platform select

- Data center flow
- Embedded system flow
- 2. Develop software algorithm
- 3. Software profile
- 4. Set Acceleration Goal
- 5. Applicability of the Hardware
- 6. Hardware Architecture Plan
- 7. HLS coding



# **HLS Code Prerequisites**

- No unsupported data type (Due to dynamic allocating)
  - Std::vector, new (malloc), \*pointers,
- No unsupported relative high-level-functions
  - E.g. Eigen, \*OpenCV, Open3D, ..... Most of the libraries.

# (※ Prevent RTL-like or hardware unfriendly coding style during design)



# **Original Host Code (1/3)**

#### 1. Not supported data type (Due to dynamic allocating)

(Eigen -> C++ template library for linear algebra, equivalent to numpy in python )

```
void DodecaSystemTracker::denseAlignment(const cv::Mat& normalizedImage,
    const Region& validRegion,
    const dst::Vec& pixel3D,
    const dst::MatX3& point3D,
    int methodSelection.
    double epsilonRot,
                                                                       {
    double epsilonTra,
    int maxIter,
    dst::Mat3* R,
    dst::Vec3* t)
    dst::Vec6 p;
    p << Transformation::FromRotationMatirxToAxisAngle(*R), * t;</pre>
    dst::Vec6 deltaP = dst::Vec6::Constant(DBL MAX);
    int iter = 0;
    int num = pixel3D.size();
    dst::Mat34 Rt;
    Rt << *R, * t;
    dst::Vec warpedI(num);
    dst::Vec warpedIu(num);
    dst::Vec warpedIv(num);
```

Graduate Institute of Electronics Engineering National Taiwan University

### Need to rewrite in array or pointer!

```
namespace dst
{
   typedef unsigned char byte;
   typedef Eigen::VectorXd Vec;
   typedef Eigen::MatrixXd Mat;
   typedef Eigen::Vector2d Vec2;
   typedef Eigen::Vector3d Vec3;
   typedef Eigen::Vector4d Vec4;
   typedef Eigen::Matrix<double, 6, 1> Vec6;
   typedef Eigen::RowVector2d RVec2;
   typedef Eigen::RowVector3d RVec3;
   typedef Eigen::RowVector4d RVec4;
   typedef Eigen::Natrix<double, 1, 6> RVec6;
```

typedef Eigen::Matrix<double, 1, 7> RVec7;

# Original Host Code (2/3)

2. Not supported relative high-level-functions

```
(Eigen class array-wise operation)
Need to rewrite the detail
implementations!
// --- Step 1: Compute Jfa ---
dst::MatX3 point2D = (point3D * R->transpose()).rowwise() + t->transpose()) * _inMat.transpose();
point2D.leftCols(2).array().colwise() *= 1. / point2D.col(2).array();
```

// --- Step 2: Compute H --dst::Mat6 H = J.transpose() \* J; // [6, 6] = [6, N] @ [N, 6]

// --- Step 3: Compute delta p --dst::Vec E = pixel3D - warpedI; // [N, ]
dst::Vec6 JtE = J.transpose() \* E; // [6, 1]
deltaP = H.inverse() \* JtE; // [6, 1]



# **Original Host Code (3/3)**

# (X Prevent RTL-like or hardware unfriendly coding style during design)

```
double rxcl1 14 = rx * cl1 14;
      double rycl1 l4 = ry * cl1 l4;
      double rzcl1_l4 = rz * cl1_l4;
      (*JRr)(0, 0) = -(sl * ry2rz2 * rx_l3) - (2 * ry2rz2 * rxcl1_l4);
      (*JRr)(0, 1) = (2 * cl1 * ry l2) - (sl * ry2rz2 * ry l3)
          - (2 * ry2rz2 * rycl1 l4);
      (*JRr)(0, 2) = (2 * cl1 * rz l2) - (sl * ry2rz2 * rz l3)
          - (2 * ry2rz2 * rzcl1 l4);
      (*JRr)(1, 0) = (rzsl * rx 13) - (rzcl * rx 12) - (cl1 * ry 12)
          + (rx * rysl * rx 13) + (2 * rx * ry * rxcl1 14);
      (*JRr)(1, 1) = (rzsl * ry_13) - (rzcl * ry_12) - (cl1 * rx_12)
          + (rx * rysl * ry 13) + (2 * rx * ry * rycl1 14);
      (*JRr)(1, 2) = (rzsl * rz 13) - (rzcl * rz 12) - sl 1
          + (rx * rysl * rz_l3) + (2 * rx * ry * rzcl1_l4);
      (*JRr)(2, 0) = (rycl * rx_12) - (cl1 * rz_12) - (rysl * rx_13)
          + (rx * rzsl * rx 13) + (2 * rx * rz * rxcl1 14);
      (*JRr)(2, 1) = sl l + (rycl * ry l2) - (rysl * ry l3)
          + (rx * rzsl * ry l3) + (2 * rx * rz * rycl1 l4);
      (*JRr)(2, 2) = (rycl * rz 12) - (cl1 * rx 12) - (rysl * rz 13)
          + (rx * rzsl * rz l3) + (2 * rx * rz * rzcl1 l4);
      (*JRr)(3, 0) = (rzcl * rx_l2) - (cl1 * ry_l2) - (rzsl * rx_l3)
          + (rx * rysl * rx 13) + (2 * rx * ry * rxcl1 14);
      (*JRr)(3, 1) = (rzcl * ry_{12}) - (cl1 * rx_{12}) - (rzsl * ry_{13})
          + (rx * rysl * ry_l3) + (2 * rx * ry * rycl1_l4);
Media IC and System Lab
                                                    Hua-Yang Weng
Graduate Institute of Electronics Engineering
National Taiwan University
```

### Use for loop instead to gain the benefit of HLS



# Then the 4 steps...

- C-Simulation
- C-Synthesis
- Co-Simulation
- Generate Bitstream



# Outline

- Why HLS?
- HLS IP Flow
- Pragma Introduction
- Design Flow
- Labs



# Lab Introduction

- Design files from NTUEE EEE5060 Application Acceleration with High-Level-Synthesis
- Lab1, Lab2: Embedded system flow
  - Vitis-hls, Vivado
  - MPSoC FPGAs: Pynq, Ultrascale+
- Lab3: Data center flow
  - Vitis
  - Data Center FPGAs: Alveo

