# Sundance SMT8096 PCI based DSP/FPGA development board

2009.12.18.

Kang Young Yun CISL, POSTECH

### Outline

- 1. Introduction
  - 1.1 SMT8096
  - 1.2 DSP/FPGA
- 2. Hardware
  - 2.1 DSP module
  - 2.2 FPGA module
  - 2.3 ADC/DAC module
- 3. Software
  - 3.1 Basic setup
  - 3.2 3L Diamond
  - 3.3 Xilinx ISE
  - 3.4 Modelsim
  - **3.5 PARS**
- 4. Examples
  - 4.1 DSP example
  - 4.2 FPGA example
  - 4.3 IDE example
  - 4.4 Xilinx ISE example
  - 4.5 Modelsim example
  - 4.6 PARS example

### Introduction - SMT8096

- The SMT8096 is a PCI system based on 3 main modules.
  - TI C6416T DSP module (SMT395)
  - Xilinx Virtex-4 FPGA module (SMT368)
  - Dual ADC/DAC module (SMT350)
- All plugged on a PCI carrier board (SMT310Q).



## Introduction - SMT8096 (cont.)



- SHB (Sundance High-speed Bus)
  - 32bit, 100MB/s
- Comport (Communication port)
  - 8bit, 20MB/s

## Introduction - SMT8096 (cont.)

- ▶ 보유 장비
  - 1) SMT8096 board (2X2 시스템)
  - 2) SMT8096 baord + FPGA mod. + ADC/DAC mod.
     (4X4 시스템으로 업그레이드)



## Introduction - DSP/FPGA

### Digital Signal Processor

- 프로그램에 의해 명령이 하나씩 patch되어 해석되어 동 작함
- 고정된 하드웨어 아키텍처에 의한 제약
- 제조사: Texas Instrument, Analog Device

### Field Programmable Gate Array

- 명령의 해석없이 하드웨어가 직접 동작함
- 맞춤화된 아키텍처, 버스 구조, 메모리 및 가속기 블록
- 제조사: Xilinx, Altera, Atmel

- Reasons to select an FPGA over a DSP microprocessor
  - Performance target not achievable with one-two microproce ssors
    - Properly executed FPGA designs typically outperform a DSP micr oprocessor by a factor of 100:1, and by more than 1000:1 in sp ecial circumstances
  - Power dissipation
    - Power dissipation of an FPGA-DSP design is typically about 20% of a microprocessor based design working at the same sample r ate
  - Programmatic issues can tip the balance
    - Software validation costs are avoided by using hardware
    - Availability of talent/tools
    - Reliability Issues



#### ▶ TI DSPs

#### **Applications Matrix Guideline**

|                | Digital Media<br>Processors      | OMAP<br>Applications<br>Processors | C6000<br>Digital Signal<br>Processors | C5000<br>Digital Signal<br>Processors | C2000<br>Digital Signal<br>Controllers                                | MSP430<br>Microcontrollers |
|----------------|----------------------------------|------------------------------------|---------------------------------------|---------------------------------------|-----------------------------------------------------------------------|----------------------------|
| Audio          |                                  |                                    |                                       |                                       |                                                                       |                            |
| Automotive     |                                  |                                    |                                       |                                       |                                                                       |                            |
| Communications |                                  |                                    |                                       |                                       |                                                                       |                            |
| Industrial     |                                  |                                    |                                       |                                       |                                                                       |                            |
| Medical        |                                  |                                    |                                       |                                       |                                                                       |                            |
| Security       |                                  |                                    |                                       |                                       |                                                                       |                            |
| Video          |                                  |                                    |                                       |                                       |                                                                       |                            |
| Wireless       |                                  |                                    |                                       |                                       |                                                                       |                            |
| Key Feature    | Complete tailored video solution | Low power and high performance     | High<br>performance                   | Power-efficient<br>performance        | Performance,<br>integration for<br>greener industrial<br>applications | Ultra-low power            |



#### C6000™ DSP Platform Roadmap

The C6000 DSP platform includes a wide range of devices that raise the bar in performance, set new levels of cost efficiency and offer on-chip peripheral integration to enable developers of high-performance systems to choose the device that best suits their specific application.

#### TMS320C64x<sup>™</sup> DSP Generation – Highest-Performance Fixed-Point DSPs

|                     | Internal RAM (Bytes)<br>L1 Program Cache/<br>L1 Data Cache/ |                       | Enhanced<br>DMA |                  |        |      |                   | Pow<br>CPU | er (W) <sup>2</sup> | Volta | ige (V) |                | 100-U              |
|---------------------|-------------------------------------------------------------|-----------------------|-----------------|------------------|--------|------|-------------------|------------|---------------------|-------|---------|----------------|--------------------|
| Part Number         | L2 Unified RAM/Cache                                        | McBSP                 | (Channels)      | COM <sup>3</sup> | Timers | MHz  | MIPS              | and L1     | Total               | Core  | 1/0     | Packaging      | Price <sup>1</sup> |
| Highest Performance | Highest Performance                                         |                       |                 |                  |        |      |                   |            |                     |       |         |                |                    |
| TMS320C6416TBGLZ1   | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 1000 | 8000 <sup>5</sup> | 0.44       | 1.65                | 1.2   | 3.3     | 532 BGA. 23 mm | 246.55             |
| TMS320C6416TGLZ8    | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 850  | $6800^{5}$        | TBD        | TBD                 | 1.2   | 3.3     | 532 BGA, 23 mm | 191.90             |
| TMS320C6416TBGLZ7   | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 720  | $5760^{5}$        | 0.44       | 1.36                | 1.2   | 3.3     | 532 BGA, 23 mm | 123.80             |
| TMS320C6416TGLZ6    | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 600  | $4800^{5}$        | 0.39       | 1.1                 | 1.1   | 3.3     | 532 BGA, 23 mm | 104.30             |
| TMS320C6415TBGLZ1   | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 1000 | 8000              | 0.44       | 1.65                | 1.2   | 3.3     | 532 BGA, 23 mm | 219.60             |
| TMS320C6415TBGLZ8   | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 850  | 6800              | TBD        | TBD                 | 1.2   | 3.3     | 532 BGA, 23 mm | 165.60             |
| TMS320C6415TBGLZ7   | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 720  | 5760              | 0.44       | 1.36                | 1.2   | 3.3     | 532 BGA, 23 mm | 112.50             |
| TMS320C6415TBGLZ6   | 16K/16K/1M                                                  | 2+Utopia <sup>4</sup> | 64              | PCI/HPI 32/16    | 3      | 600  | 4800              | 0.39       | 1.1                 | 1.1   | 3.3     | 532 BGA, 23 mm | 90.00              |
| TMS320C6414TBGLZ1   | 16K/16K/1M                                                  | 3                     | 64              | HPI 32/16        | 3      | 1000 | 8000              | 0.44       | 1.65                | 1.2   | 3.3     | 532 BGA, 23 mm | 207.85             |
| TMS320C6414TBGLZ8   | 16K/16K/1M                                                  | 3                     | 64              | HPI 32/16        | 3      | 850  | 6800              | TBD        | TBD                 | 1.2   | 3.3     | 532 BGA, 23 mm | 157.40             |
| TMS320C6414TBGLZ7   | 16K/16K/1M                                                  | 3                     | 64              | HPI 32/16        | 3      | 720  | 5760              | 0.44       | 1.36                | 1.2   | 3.3     | 532 BGA, 23 mm | 106.95             |
| TMS320C6414TBGLZ6   | 16K/16K/1M                                                  | 3                     | 64              | HPI 32/16        | 3      | 600  | 4800              | 0.39       | 1.1                 | 1.1   | 3.3     | 532 BGA, 23 mm | 85.55              |

- Xilinx FPGAs
  - Virtex series high performance, high cost
    - Virtex-II pro
    - Virtex-4
    - Virtex-5
    - Virtex-6
  - Spartan series mid performance, low cost
    - Spartan-II
    - Spartan-3
    - Spartan-6

#### **FEATURE** Comparison Table

| Features                        | Virtex-5                    | Virtex-4                    | Extended Spartan-3A |
|---------------------------------|-----------------------------|-----------------------------|---------------------|
| Logic Cells                     | Up to 330,000               | Up to 200,000               | Up to 53,000        |
| User I/Os                       | Up to 1200                  | Up to 960                   | Up to 519 I/O       |
| I/O Standards Supported         | Over 40                     | Over 20                     | Over 20             |
| Clock Management - DCM          | Yes                         | Yes                         | Yes                 |
| Clock Management - PLL          | Yes                         | No                          | No                  |
| Embedded Block RAM              | Up to 18 Mbits              | Up to 11Mbits               | Up to 1.8 Mbits     |
| Embedded Multipliers for DSP    | Yes (25 x 18 MAC)           | Yes (18 x 18 MAC)           | Yes (18 x 18 MAC)   |
| Multi-Gigabit High Speed Serial | Yes                         | Yes                         | No                  |
| Soft Processor Support          | Yes                         | Yes                         | Yes                 |
| Embedded PowerPC® Processors    | Yes (PowerPC 440 Processor) | Yes (PowerPC 405 Processor) | No                  |

### Hardware - DSP module

#### SMT395-VP30-6 characteristics:

- ⇒ 1GHz TMS320C6416T fixed-point DSP,
- ⇒ 8000 MIPS peak performance,
- ⇒ Xilinx Virtex-II Pro VP30-6 FF896 package,
- ⇒ Six 20MB/s communication ports (Comports),
- $\Rightarrow$  256MBytes of SDRAM (133MHz),
- ⇒ 8MByte Flash ROM for boot code and FPGA programming,
- ⇒ Global expansion connector,
- ⇒ High bandwidth data I/O via 2 Sundance High-speed Buses (SHB),
- ⇒ JTAG Diagnostics port.





> RSL (Rocket Serial link): 2.5Gbit/s



### Hardware - FPGA module

#### SMT368 characteristics:

- ⇒ Virtex 4-SX (XCV4SX35 FPGA) in an FF668 package,
- ⇒ Two banks of 8Mbytes of ZBTRAM,
- ⇒ Four Sundance High-speed Bus (SHB 160 I/Os),
- ⇒ Two Comport,
- ⇒ One Sundance LVDS Bus (SLB 60-way Samtec SQH).
- > ZBTRAM (Zero Bus Turnaround Random Access Memory): designed to sustain 100% bus bandwidth by eliminating turnaround cycle when there is transition form Read to Write, or vice-versa.
- > LVDS (Low Voltage Differential Signaling): an electrical signaling system that can run at very high speeds over inexpensive twisted-pair copper cables











## Hardware - ADC/DAC module

#### SMT350 characteristics:

- ⇒ Two 14-bit ADCs (TI ADS5500) sampling at up to 105MHz,
- ⇒ Dual 16-bit DAC (DAC5686) sampling at up to 500MHz (interpolation),
- ⇒ One Sundance LVDS Bus (SLB 60-way Samtec SQH),
- ⇒ Low-jitter on-board system clock based around the combination of a VCXO and the TI - CDCM7005,
  - ⇒ 50-Ohm terminated analogue inputs and outputs, external triggers and clocks via MMCX (Huber and Suhner) connectors.

Daughter module component side



mmcx type connector

Daughter module solder side





PCB for probing purpose – SMT598







| Connector name (silkscreen and schematics) | Description               | Location on the board |
|--------------------------------------------|---------------------------|-----------------------|
| J13                                        | ADCA Analog Input         | Middle / Left         |
| J11                                        | ADCB Analog Input         | Middle / Left         |
| J32                                        | DACA Analog Output        | Middle / Right        |
| J31                                        | DACB Analog Output        | Middle / Right        |
| J30                                        | External Reference Input  | Top / Left            |
| J29                                        | External Clock Input      | Top / Left            |
| J34                                        | External Reference Output | Top / Right           |
| J4                                         | External Clock Output     | Top / Right           |
| J24                                        | External Trigger ADCs     | Bottom / Left         |
| J25                                        | External Trigger DAC      | Bottom / Left         |

## Software - Basic setup

- ▶ TI CCS (Code Composer Studio) 3.3 or higher
  - TI development tool suite (compiler, linker, etc)
- Xilinx ISE 9.2i (ISE 10.1 is not compatible)
  - Xilinx development tool suite (synthesis, simulation)
- Sundance driver
- 3L Diamond
  - A set of tools to create efficient applications using multiprocessor hardware made from DSPs and FPGAs
  - Provides an integrated development environment
  - Diamond server/ Diamond IDE

## Software - Basic setup (cont.)

#### Why do I need Diamond?

- This is rather like asking, why do I need a high-level language? The strict answer is that you can make do without tools to help you build applications, but you will be making your task an order of magnitude harder.
- If you decide not to use Diamond to build a multiprocessor application you will have to do all the work yourself, including things like:
  - loading all the processors, including ones remote from the host PC
  - keeping control of all the separate modules needed to load your application
  - starting your application in a synchronised way
  - managing communications, possibly including deadlock-free message routing
  - writing your own device drivers
  - explicitly managing all memory allocation
  - writing your own multithreading support or using something that is likely to be less efficient than Diamond's
  - inventing a host communication mechanism
  - being prepared to rewrite your source if the configuration changes
  - being prepared to make major changes if the underlying hardware changes
  - $\circ$  supporting multiple source versions for all hardware and configuration variations
  - handling all of the underlying hardware peculiarities (unexpected cache behaviour, for example)
  - ... and many more.
- Diamond can do all of this for you. 3L works closely with hardware vendors and puts a great deal of time, effort, and experience into optimising all aspects of the system to give you the best results.

## Software - 3L Diamond



## Software - 3L Diamond (cont.)

- The Diamond Model
  - In this model, a computing system is a collection of concurrently active sequential processes that can only communicate with each other over channels.
  - DSP tasks are implemented in C while FPGA tasks may be implemented directly in VHDL.



## Software - Xilinx ISE

- Xilinx ISE (Integrated Software Environment)
  - Synthesis: HDL code로부터 netlist 파일(\*.ngc) 생성
  - Implement
    - Translate: 여러 디자인을 하나의 netlist로 구성
    - Map: netlist로부터 physical component 구성 (slices and IOBs)
    - Place and route: 칩위에 구성물을 구성하고, 그 구성물을 연결하며 타이밍 데이터를 리포트함
  - Configuration: generate PROM files and download to devices using iMPACT
  - ISIM (ISE Simulator): PC 환경에서 로직 검증

## Software - Modelsim

ModelSim provides a comprehensive simulati on and debug environment for complex ASIC and FPGA designs.



## Software - PARS

Parallel Application from Rapid Simulation

#### **Design Flow with PARS**



### DSP example 1 - Hello.c

Application file 생성 과정



## DSP example 1 - Hello.c (cont.)

Application file 생성 과정

```
C:\WINDOWS\system32\cmd.exe

C:\W3L\Diamond\target\c6000\Sundance\yykang\example-1>31 c hello.c

C:\W3L\Diamond\target\c6000\Sundance\yykang\example-1>31 t hello.obj

C:\W3L\Diamond\target\c6000\Sundance\yykang\example-1>31 a hello.cfg hello.app

C:\W3L\Diamond\target\c6000\Sundance\yykang\example-1>hello.app
```

▶ Application 실행 화면



#### FPGA example 1 - addone

#### Overview

 In this example the DSP generates data that are sent to a task in the FPGA that increments them by one. The result is sent to the DSP which displays it on the host PC.



```
🖺 driver.c - 워드패드
파일(F) 편집(E) 보기(\underline{V}) 삽입(I) 서식(\underline{O}) 도움말(\underline{H})
 #include <stdio.h>
 #include <chan.h>
 // Input ports
 INPUT_PORT (0.DIN)
 // Output ports
 OUTPUT_PORT (0,DOUT)
 main(int argc, char *argv[], char *envp[],
    CHAN *in_ports[], int ins,
    CHAN *out_ports[], int outs)
        int i, result
        printf("3L Diamond FPGA example-1₩n" );
        for (i=0; i<100; i++) {
               chan_out_word(i, &DOUT);
               chan_in_word(&result, &DIN);
               printf("%d + 1 = %d₩n", i, result);
도움말을 보려면 <F1> 키를 누르십시오.
```

```
🧓 addone.vhd - 메모장
파일(\underline{F}) 편집(\underline{E}) 서식(\underline{O}) 보기(\underline{V}) 도움말(\underline{H})
library ieee;
use ieee.std logic 1164.all;
use ieee.numeric std.all;
library Smt;
use Smt.Smt pkq.all;
entity addone is
    port (
     -- @Diamond3L Begin@
     -- Do not alter the contents of the block between the begin and end tags.
     x_chan_in_0 : IN X_chan_t;
                                       --DIN
     y_chan_in_0 : OUT Y_chan t;
                                       --DIN
     x chan out 0 : OUT X chan t; --DOUT
     y chan out 0 : IN Y chan t;
                                       --DOUT
     -- @Diamond3L End@
     -- The following signals are always required for a task
                  : IN std logic;
     clk
     rst
                   : IN std logic;
     ce
                 : IN std loqic;
            : IN std_logic
     -- Do not add any ports after this point
   );
end addone;
architecture arch of addone is
begin
  x chan out 0.data
                          <= std logic vector(unsigned(x chan in 0.data) + to unsigned(1,</pre>
x chan in O.data'length));
 x chan out 0.write
                        <= x chan in 0.write;
 x chan out 0.validwords <= x chan in 0.validwords;
 y_chan_in_0
                  <= y chan out 0;
end arch;
```





## IDE example – addone



## IDE example - smt350



### IDE example - smt350 (cont.)



# Xilinx ISE example (cont.)



## Xilinx ISE example



### Modelsim example



#### Modelsim example (cont.)



## PARS example - add one



#### **Future** works

- ▶ ITRC Forum 전시 예정 (2010.05.25)
  - ∘ Full-Duplex AF Relay 구현
  - Relay: EM-level self-interference cancellation
  - Destination: real-time or image processing
- ▶ IDE 예제 코드 완벽 분석
- PARS programming
  - FPGA module
  - ADC/DAC module

#### References

- http://www.sundance.com/
- http://support.sundance.com