



Shao-Yi Chien



# Outline

- Processor technology
- Basic architecture
- Operation
- Programmer's view
- Developed environment
- CPU power consumption
- Application-specific instruction-set processors (ASIP)
- Co-processor
- Selecting a microprocessor
- Other trends of processor design



### **Processor Technology**

- The architecture of the computation engine used to implement a system's desired functionality
- Processor does not have to be programmable
  - "Processor" not equal to general-purpose processor



Multimedia SoC Design

Shao-Yi Chien



### **Processor Technology**

Processors vary in their customization for the problem at hand



Desired functionality

total = 0for i = 1 to N loop total += M[i] end loop







Single-purpose processor

General-purpose processor

Multimedia SoC Design

Shao-Yi Chien

Application-specific

processor



### **General-Purpose Processors**

- Programmable device used in a variety of applications
  - Also known as "microprocessor"
- Features
  - Program memory
  - General datapath with large register file and general ALU
- User benefits
  - Low time-to-market and NRE costs
  - □ High flexibility
- Intel CPU is the most well-known, but there are hundreds of others





### **Single-Purpose Processors**

#### Digital circuit designed to execute exactly one program

 a.k.a. coprocessor, accelerator or peripheral

#### Features

- Contains only the components needed to execute a single program
- No program memory

#### Benefits

#### Fast

Low power

□ Small size Multimedia SoC Design





### **Application-Specific Processors**

- Programmable processor optimized for a particular class of applications having common characteristics
  - Compromise between general-purpose and single-purpose processors

#### Features

- Program memory
- Optimized datapath
- Special functional units

#### Benefits

 Some flexibility, good performance, size and power





# Why General-Purpose Processors in SoCs?

- Using microprocessors is a very efficient way to implement digital systems
- Microprocessors make it easier to design families of products that can be built to provide various feature sets at different price points and can be extended to provide new features to keep up with rapidly changing markets



### **Basic Architecture**

- Control unit and datapath
- Key differences to single-purpose processors
  - Datapath is general
  - Control unit doesn't store the algorithm – the algorithm is "programmed" into the memory





# **Datapath Operations**

- Load
  - Read memory location into register
- ALU operation
  - Input certain registers through ALU, store back in register
- Store
  - Write register to memory location





# **Control** Unit

- Control unit: configures the datapath operations
  - Sequence of desired operations ("instructions") stored in memory – "program"
- Instruction cycle broken into several sub-operations, each one clock cycle, e.g.:
  - Fetch
  - Decode
  - Fetch operands
  - Execute
  - Store results





- Fetch
  - Get next instruction into IR
  - PC: program counter, always points to next instruction
  - IR: holds the fetched instruction





 Decode
 Determine what the instruction means





 Fetch operands
 Move data from memory to datapath register





### Execute

- Move data through the ALU
- (This example instruction does nothing during this sub-operation)





- Store results
  - Write data from register to memory
    - (This example instruction does nothing during this sub-operation)





# Pipelining: Increasing Instruction Throughput



Multimedia SoC Design

Shao-Yi Chien



# Superscalar and VLIW Architectures

Performance can be improved by:

- □ Faster clock (but there's a limit)
- Pipelining: slice up instruction into many stages
- Multiple ALUs to support more than one instruction stream – superscalar



### Superscalar and VLIW Architectures

### Superscalar

- Scalar: non-vector operations
- Fetches instructions in batches, executes as many as possible
  - May require extensive hardware to detect independent instructions (dynamic)
  - VLIW (Very-Long Instruction Word): each word in memory has multiple independent instructions (static)
    - Relies on the compiler to detect and schedule instructions
    - □ Currently growing in popularity



### **Two Memory Architectures**



### Holds copy of part of memory

processor

be slow

Hits and misses

Shao-Yi Chien

memory close to

**Cache Memory** 

Memory access may

Cache is small but fast







### Programmer's View

Programmer doesn't need understanding of architecture in detail

□ Instead, needs to know what instructions can be executed

- Two levels of instructions:
  - □ Assembly level

8**....** 

□ Structured languages (C, C++, Java, etc.)

 Most developments today are done using structured languages

- □ But, some assembly level programming may still be necessary
- Drivers: portion of program that communicates with and/or controls (drives) another device

Multimedia SoC Design

Shao-Yi Chien



### **Assembly-Level Instructions**

| Instruction 1 | opcode | operand1 | operand2 |
|---------------|--------|----------|----------|
| Instruction 2 | opcode | operand1 | operand2 |
| Instruction 3 | opcode | operand1 | operand2 |
| Instruction 4 | opcode | operand1 | operand2 |
|               | ····   |          |          |
|               |        |          |          |

### Instruction Set

Defines the legal set of instructions for that processor

- Data transfer: memory/register, register/register, I/O, etc.
- Arithmetic/logical: move register through ALU and back
- Branches: determine next PC value when not just PC+1

Shao-Yi Chien



## A Simple (Trivial) Instruction Set



Multimedia SoC Design



### Addressing Modes

| Addressing<br>mode   | Operand field    | Register-file contents | Memory<br>contents |
|----------------------|------------------|------------------------|--------------------|
| Immediate            | Data             |                        |                    |
| Register-direct      | Register address | Data                   |                    |
| Register<br>indirect | Register address | Memory address         | Data               |
| Direct               | Memory address   |                        | Data               |
| Indirect             | Memory address   |                        | Memory address     |
|                      |                  |                        | Data               |

Multimedia SoC Design



### **Programmer Considerations**

### Program and data memory space

- Embedded processors have often very limited memory
  - e.g., 64 Kbytes program, 256 bytes of RAM (expandable)
- Registers: How many are there?
  - Only a direct concern for assembly-level programmers
  - □ Be aware of special-function registers



### **Programmer Considerations**



- 🗆 Two ways
- □ I/O instructions for parallel I/O of CPU
  - Such as Intel x86
- Memory-mapped I/O (through system bus)
  - The most common way



### **Programmer Considerations**

### Interrupts

- Cause the processor to suspend execution of the main program and jump to an interrupt service routine (ISR) or interrupt handler
- After the ISR completes, the processor resumes execution of the main program (foreground program) by restoring the PC
- The ISR should be located at a specific address in program memory



# **Operating System**

- Optional software layer providing low-level services to a program (application).
  - Resource (CPU, memory, ...) management
  - □ File management, disk access
  - Keyboard/display interfacing
  - Scheduling multiple programs for execution
    - Or even just multiple threads from one program
  - Program makes system calls to the OS

Multimedia SoC Design

Shao-Yi Chien

| DB file_name "out.txt"                                   | store file name                                                                          |
|----------------------------------------------------------|------------------------------------------------------------------------------------------|
| MOV R0, 1324<br>MOV R1, file_name<br>INT 34<br>JZ R0, L1 | system call "open" id<br>address of file-name<br>cause a system call<br>if zero -> error |
| read the file<br>JMP L2<br>L1:<br>handle the er:         | bypass error cond.<br>ror                                                                |
| L2:                                                      |                                                                                          |



### **Development Environment**

### Development processor (host)

- The processor on which we write and debug our programs
  - Usually a PC
- Target processor
  - The processor that the program will run on in our embedded system
    - Often different from the development processor



Shao-Yi Chien

Multimedia SoC Design





- Typically, these tools are combined into a single integrated development environment (IDE)
- Compilers
  - □ Cross compiler
    - Runs on one processor, but generates code for another
- Assemblers
  - Linkers
- Debuggers
- Profilers



### Debugger

- □ Run on the development processor
- Support stepwise program execution
- Support breakpoints
- When the program stops, the user can examine values of various memory and register location
- Source-level debuggers: step-by-step in source program
- Use instruction-set simulators (ISS) or virtual machines (VM)



### Emulator

- Support debugging of the program while it executes on the target processor
- □ Microprocessor in-circuit emulator (ICE)
  - A special hardware tool to emulate the behavior of a processor



# Instruction Set Simulator For a Simple Processor

```
#include <stdio.h>
typedef struct {
                                                              return 0;
   unsigned char first byte, second byte;
} instruction;
                                                          int main(int argc, char *argv[]) {
instruction program[1024]; //instruction memory
unsigned char memory[256]; //data memory
                                                             FILE* ifs:
void run program(int num bytes) {
                                                              If ( argc != 2 ||
                                                                  (ifs = fopen(argv[1], "rb") == NULL ) {
   int pc = -1;
                                                                       return -1;
   unsigned char reg[16], fb, sb;
                                                             if (run program(fread(program,
   while (+pc < (num bytes / 2)) {
                                                                  sizeof(program) == 0) {
     fb = program[pc].first byte;
                                                                       print memory contents();
     sb = program[pc].second byte;
                                                                       return(0);
     switch(fb >> 4) {
         case 0: reg[fb & 0x0f] = memory[sb]; break;
                                                              else return(-1);
         case 1: memory[sb] = reg[fb & 0x0f]; break;
         case 2: memory[reg[fb & 0x0f]] =
                 reg[sb >> 4]; break;
         case 3: reg[fb & 0x0f] = sb; break;
         case 4: reg[fb & 0x0f] += reg[sb >> 4]; break;
         case 5: reg[fb & 0x0f] -= reg[sb >> 4]; break;
         case 6: pc += sb; break;
         default: return -1;
```



### **Testing and Debugging**



ISS

- Gives us control over time – set breakpoints, look at register values, set values, step-by-step execution, ...
- But, doesn't interact with real environment
- Download to board
  - □ Use device programmer
  - Runs in real environment, but not controllable
- Compromise: emulator
  - Runs in real environment, at real-time speed or near
  - Supports some controllability from the PC

Multimedia SoC Design

Shao-Yi Chien



### Conventional stages of development

- Debugging using an ISS
- Emulation using an emulator
- Field testing by downloading the program directly into the target processor
- Different levels of simulation
  - Instruction-level simulator
  - Cycle-level simulator
  - Hardware/software co-simulator



## **CPU** Power Consumption

Most modern CPUs are designed with power consumption in mind to some degree

- Voltage drops: power consumption proportional to V<sup>2</sup>
- Toggling: more activity means more power
- Leakage: basic circuit characteristics; can be eliminated by disconnecting power



## **CPU** Power-Saving Strategies

Reduce power supply voltage

- Run at lower clock frequency
- Disable function units with control signals when not in use
- Disconnect parts from power supply when not in use



## Power management styles

- Static power management: does not depend on CPU activity.
  - □ Example: user-activated power-down mode.
- Dynamic power management: based on CPU activity.
  - Example: disabling off function units.
  - Dynamic Voltage Frequency Scaling (DVFS)



## Application: PowerPC 603 Energy Features

- Provides doze, nap, sleep modes.
- Dynamic power management features:
  - □ Uses static logic.
  - □ Can shut down unused execution units.
  - Cache organized into subarrays to minimize amount of active circuitry.



## PowerPC 603 Activity

#### Percentage of time when units are idle for SPEC integer/floating-point:

| unit            | Specint92 | Specfp92 |
|-----------------|-----------|----------|
| D cache         | 29%       | 28%      |
| I cache         | 29%       | 17%      |
| load/store      | 35%       | 17%      |
| fixed-point     | 38%       | 76%      |
| floating-point  | 99%       | 30%      |
| system register | 89%       | 97%      |

Shao-Yi Chien



## Power-Down Costs

Going into a power-down mode costs:
 time;

energy.

- Must determine if going into mode is worthwhile.
- Can model CPU power states with power state machine.



## Application: StrongARM SA-1100 Power Saving

- Processor takes two supplies:
   VDD is main 3.3V supply.
   VDDX is 1.5V.
- Three power modes:
  - $\Box$  Run: normal operation.
  - □ Idle: stops CPU clock, with logic still powered.
  - □ Sleep: shuts off most of chip activity; 3 steps, each about 30  $\mu$ s; wakeup takes > 10 ms.

Multimedia SoC Design

Shao-Yi Chien



## SA-1100 Power State Machine

 $P_{run} = 400 \text{ mW}$ 



Shao-Yi Chien



## Selecting a Microprocessor

#### Issues

- □ Technical: speed, power, size, cost
- □ Other: development environment, prior expertise, licensing, etc.
- Speed: how to evaluate a processor's speed?
  - Clock speed but instructions per cycle may differ
  - □ Instructions per second but work per instr. may differ
  - Dhrystone: Synthetic benchmark, developed in 1984. Dhrystones/sec.
    - MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital's VAX 11/780). A.k.a. Dhrystone MIPS. Commonly used today.
      - □ So, 750 MIPS = 750\*1757 = 1,317,750 Dhrystones per second
  - □ SPEC: set of more realistic benchmarks, but oriented to desktops
  - EEMBC EDN Embedded Benchmark Consortium, <u>www.eembc.org</u>
    - Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications



## **General Purpose Processors**

| Processor                  | Clock speed | Periph.                                          | Bus Width        | MIPS      | Power | Trans. | Price |
|----------------------------|-------------|--------------------------------------------------|------------------|-----------|-------|--------|-------|
| General Purpose Processors |             |                                                  |                  |           |       |        |       |
| Intel PIII                 | 1GHz        | 2x16 K<br>L1, 256K<br>L2, MMX                    | 32               | ~900      | 97W   | ~7M    | \$900 |
| IBM<br>PowerPC<br>750X     | 550 MHz     | 2x32 K<br>L1, 256K<br>L2                         | 32/64            | ~1300     | 5W    | ~7M    | \$900 |
| MIPS<br>R5000              | 250 MHz     | 2x32 K<br>2 way set assoc.                       | 32/64            | NA        | NA    | 3.6M   | NA    |
| StrongARM<br>SA-110        | 233 MHz     | None                                             | 32               | 268       | 1W    | 2.1M   | NA    |
|                            |             |                                                  | Microcontr       | oller     |       |        |       |
| Intel<br>8051              | 12 MHz      | 4K ROM, 128 RAM,<br>32 I/O, Timer, UART          | 8                | ~1        | ~0.2W | ~10K   | \$7   |
| Motorola<br>68HC811        | 3 MHz       | 4K ROM, 192 RAM,<br>32 I/O, Timer, WDT,<br>SPI   | 8                | ~.5       | ~0.1W | ~10K   | \$5   |
|                            |             |                                                  | Digital Signal P | rocessors |       |        |       |
| TI C5416                   | 160 MHz     | 128K, SRAM, 3 T1<br>Ports, DMA, 13<br>ADC, 9 DAC | 16/32            | ~600      | NA    | NA     | \$34  |
| Lucent<br>DSP32C           | 80 MHz      | 16K Inst., 2K Data,<br>Serial Ports, DMA         | 32               | 40        | NA    | NA     | \$75  |

Sources: Intel, Motorola, MIPS, ARM, TI, and IBM Website/Datasheet; Embedded Systems Programming, Nov. 1998



## Application-Specific Instruction-Set Processors (ASIPs)

#### General-purpose processors

- Sometimes too general to be effective in demanding applications
  - e.g., video processing requires huge video buffers and operations on large arrays of data, inefficient on a GPP
- But single-purpose processor has high NRE, not programmable
- ASIPs targeted to a particular domain
  - □ Contain architectural features specific to that domain
    - e.g., embedded control, digital signal processing, video processing, network processing, telecommunications, etc.
  - □ Still programmable

Multimedia SoC Design

Shao-Yi Chien



## A Common ASIP: Microcontroller

- For embedded control applications
  - □ Reading sensors, setting actuators
  - Mostly dealing with events (bits): data is present, but not in huge amounts
  - e.g., VCR, disk drive, digital camera, washing machine, microwave oven

#### Microcontroller features

- On-chip peripherals
  - Timers, analog-digital converters, serial communication, etc.
  - Tightly integrated for programmer, typically part of register space
- □ On-chip program and data memory
- Direct programmer access to many of the chip's pins
- Specialized instructions for bit-manipulation and other low-level operations

Multimedia SoC Design

Shao-Yi Chien



## **Co-Processor**

- Co-processor: added function unit that is called by instruction
  - □ Floating-point units are often structured as co-processors
- Tightly-coupled to the CPU
- When receiving a co-processor instruction, the CPU must activate the co-processor and pass it the relevant instructions
- Co-processor: can load/store co-processor registers and CPU registers
- To provide compatibility, the function of co-processor can be emulated with software interrupt handler
- ARM allows up to 16 designer-selected co-processors.
   Multimedia SoC Design Shao-Yi Chien

## Another Common ASIP: Digital Signal Processors (DSP)

#### For signal processing applications

- □ Large amounts of digitized data, often streaming
- Data transformations must be applied fast
- □ e.g., cell-phone voice filter, digital TV, music synthesizer

#### DSP features

- Several instruction execution units
- Multiply-accumulate single-cycle instruction, other instrs.
- Efficient vector operations
  - e.g., add two arrays vector ALUs, loop buffers, etc.



## Trend: Even More Customized ASIPs

- In the past, microprocessors were acquired as chips
- Today, we increasingly acquire a processor as Intellectual Property (IP)

□ e.g., synthesizable VHDL model

- Opportunity to add a custom datapath hardware and a few custom instructions, or delete a few instructions
  - Can have significant performance, power and size impacts



# Trend: Even More Customized ASIPs

Problem: need compiler/debugger for customized ASIP

- Remember, most development uses structured languages
- □ One solution: automatic compiler/debugger generation
  - e.g., <u>www.tensillica.com</u> (acquired by Cadence)
- □ Another solution: retargettable compilers
- Modern solution: automatic hardware/compiler/debugger generation with a processor architecture design language
  - CoWare LISATek → Synopsys Processor Designer
  - ARM MaxCore



#### **DVB-T and Application Specific Multirate DSP**

#### Infineon Low-Power, DVB-T Single-Chip Receiver:

- ASIP for DVB-T acquisition and tracking algorithms
- Harvard Architecture
- 60 mostly RISC-like Instructions
- 8x32-Bit General Purpose Registers, 4x9-Bit Address Registers
- 2048x20-Bit Instruction ROM, 512x32-Bit Data Memory

#### Infineon Application Specific Multirate DSP

- ASIP for interpolation and decimation filters or CORDIC
- Highly optimized data kernel
- Complex FSM to control
- Success story at <u>http://www.coware.com</u>







### Comparisons with reference models

|               | Speed   | Area         |  |
|---------------|---------|--------------|--|
| Infineon ASMD | 5.09 ns | 11 678 Gates |  |

|  |                                | 0.00 110 | in one cateo |  |
|--|--------------------------------|----------|--------------|--|
|  | handwritten and optimized VHDL | 4.22 ns  | 9 549 Gates  |  |

| Infineon ICORE                 | 8 ns | 59 000 Gates |
|--------------------------------|------|--------------|
| handwritten and optimized VHDL | 8 ns | 58 473 Gates |

#### ➔ Design Time reduction: up to 60 %





## Signal Processing is Migrating to the Host





## CoreExtend

User Defined Instructions (UDI) for even Higher
 Performance

- Performance tuning, design reuse and production differentiation
- Add instruction (UDI) without architecture license
- As close to the microprocessor core as you can get

□ Tightly integrated to pipeline and the GPR

- Accelerates from 2x to hardwired-like speed
- Supported by newer MIPS cores: 4KE family, M4K, 4KSd, 24K



## **Users Execute Block Diagram**





## CorExtend<sup>™</sup> Instruction Development Flow





## CORXpert – Automating CorExtend<sup>™</sup> UDIs



**Profile Application** 

Multimedia SoC Design



## CorExtend/UDI Application Examples

- VoIP simple example
  - □ 2X total speed up over optimized code with 7K gates
- 802.11a/b/g/l/e lower MAC high throughput wirespeed examples
  - 30X AES (128-bit key) speed up with 10.5K gates + 20x64bit round key RAM
- ADSL2 + SIMD reuse example
  - >=40X RS decode speed up over optimized code with 8K gates
- JPEG decode acceleration



## Trend: Multi-core Comes from Low Power Demands

Low power : key technology in any application domain



Multimedia SoC Design

Satoshi Matsushitam, "Low Power Multi-Core Chips for Mobile Embedded Applications," Mlti-core Processor Forum Notes, ISSCC2006.

#### Solution Space in each Category

| Category             | Apps                   | Characteristics                                | Throughput<br>technique                         | Low power<br>technique                | Products                                                  |
|----------------------|------------------------|------------------------------------------------|-------------------------------------------------|---------------------------------------|-----------------------------------------------------------|
| Always On            | High-end<br>server     | Independent<br>transactions                    | Clock, ILP, Parallel<br>processing              | -                                     | ıвм Power4                                                |
|                      | Server or<br>Numerical | Independent<br>transactions or<br>calculations | Resource sharing by multiple threads            | Low Vdd                               | <u>Sun Niagara,</u><br>Tera MTA                           |
|                      | Network                | Independent<br>packets                         | Parallel processing                             | -                                     | Broadcom BCM1480,<br>Cavium Octeon,<br>PMC-Sierra RM11200 |
| Demanded<br>On       |                        |                                                | Clock, SIMD, Parallel processing                | -                                     | <u>IBM/Sony/Toshiba</u><br><u>Cell</u>                    |
|                      | Desktop                | General-purpose,<br>Media                      | Clock, ILP, SIMD,<br>Parallel processing        | Low Vdd                               | Intel Pentium D,<br>AMD Athlon 64 x2                      |
| Periodic-<br>ally On | Mobile                 | General-purpose,<br>Media                      | Parallel processing,<br>SIMD, DSP, HW<br>engine | Low Vdd,<br>DVFS,<br>Power-off        | NEC MP211, NEC<br>MPCore, TI OMAP,<br>Hitachi SH-Mobile   |
| Mostly<br>Waiting    | Security               | Encrypt, Decrypt                               | HW engine                                       | Device,<br>Small logic,<br>Slow clock | Infineon 66Plus, Sony<br>Felica                           |

#### Mobile Application SoC: MP211

- Asymmetric Multi-processor (AMP) which integrates three ARM926 (200MHz) and a DSP
- System level power control for intermittent load



Processor Chip," ISSCC Dig. Tech. Papers, pp.136-137, 2005. (NEC)



#### Scalability Enhancement with Multi-core



#### Media Parallel Multi-cores

- Various approaches for scalability

#### UniPhier (Matsushita) Scalable Architecture Basic Extended ))Extended) Architecture New Media Processor (Basic Architecture) Execution Uni Arithmetic Array Hardware Instruction Engine Data Parallel Parallel MPEC Execution Uni Arithmetic Array Inchang Instruction Engine Parallel Data Parallel MPE NPEG2 / Processo Processor NGA Hardwa ution Uni t Engine struction Low Po Parallel MPEG4 / H.264 Consur (QVGA) (QVGA) (VLIŴ + SIMD + HŴ) + CPU

#### MeP (Toshiba)



#### SH-Mobile (Renesas)



#### DSP+ HW Engine + CPU

# COMACP (TI)

(DSP or HW) + CPU

#### Cell (IBM/Sony/Toshiba)



8 x SPE(DSP) + CPU Higher clock

#### EMMA (NECEL)



HW Engine + DSP + CPU

Technology Trends for CPU Parallel Multi-core

- Multiprocessing for performance and power efficiency
- Recovery from degradation of software productivity



#### AMP/SMP: Asymmetric/Symmetric Multiprocessor



## Trend: Heterogeneous System Architecture (HSA)

#### Target: power, performance, programmability and portability.

HSA Accelerated Processing Unit





#### A NEW ERA OF PROCESSOR PERFORMANCE





#### **EVOLUTION OF HETEROGENEOUS COMPUTING**



6 | 2012 Financial Analyst Day | February 2, 2012 | Consumerization, Cloud, Convergence



#### **HETEROGENEOUS SYSTEM ARCHITECTURE – AN OPEN PLATFORM**

- Open Architecture, published specifications
  - HSAIL virtual ISA
  - HSA memory model
  - HSA system specification
- ISA agnostic for both CPU and GPU
- Inviting partners to join us, in all areas
  - Hardware companies
  - Operating Systems
  - Tools and Middleware
  - Applications

HSA Foundation to guide the architecture

9 | 2012 Financial Analyst Day | February 2, 2012 | Consumerization, Cloud, Convergence



AMD



#### AMD VISION FOR HSA



Nearest Available Screen: Push of the button enables wireless display extension



Biometric Recognition: Secure, fast, accurate: face, voice, fingerprints



**3D Content:** New artistic uses of mainstream 3D content for everyday consumption

**User Generated** 



Augmented Reality: Superimpose graphics, audio and other digital information as a virtual overlay





Multi-point HD Video Conferencing: Flawless HD video anywhere, multi-stream encode.



Beyond HD Video Experiences: Streaming media, new codecs, 3D, transcode, Audio



All Day Computing: All day active use of your PC with no need of a power source.



Natural UI & Gestures: Touch, no touch & voice

17 | 2012 Financial Analyst Day | February 2, 2012 | Consumerization, Cloud, Convergence



#### Key Founders of HSA Foundation





## **HSA Solution Stack**



#### **HSA Solution Stack**



## Example: New ARM Architecture



Multimedia SoC Design