Kameleon, a RISC-V Based 2-core Multi-accelerator Academic SoC

27/07/2023

On 8 June, Francesc Moll professor at the Polytechnic University of Catalonia, presented the poster Kameleon, a RISC-V based 2-core multi-accelerator academic SoC, at the RISC-V Summit Europe 2023. The poster presents a quick overview of the Kameleon SoC and its components developed in the framework of the DRAC project involving the Barcelona Supercomputing Center, the Universitat de Barcelona, the Universitat Politècnica de Catalunya, the Universitat Autònoma de Barcelona and the Insitut de Microelectrònica de Barcelona.

Kameleon is a SoC based on the RISC-V ISA. It integrates two cores, Sargantana and Lagarto KA. And four accelerators: SAURIA, an accelerator for Deep Neural Networks; PQC, an encryption accelerator for secure communication; Picos, a hardware accelerated task scheduler and WFA, a genomics analysis accelerator.

RISC-V opens the possibility to design application-specific computing systems based on accelerators that offload specific, complex tasks from the CPU. This results in improved performance and efficiency of computing systems. The Kameleon SoC is the result of a research DRAC project involving six universities and research centres to implement a RISC-V-based multi-accelerator ASIC. Two different cores were developed in the project: one out-of-order and one inorder. In addition, three domain-specific accelerators were the result of the project: cryptography, genomics and autonomous navigation, while a fourth accelerator was added for task scheduling. The project culminated with the physical design of the SoC which, in addition to the digital IP, included a PLL up to 2 GHz and a 4-lane 8 Gbps Serdes interface to an FPGA.

The Kameleon SoC architecture has two cores and four accelerators, even though it does not implement dual-core functionality: only one core can be active at a time (to be chosen by the user), and the active core cannot be changed while the SoC is on.

A. In-order core

The in-order core is an in-order 64-bit RISC-V [1] processor that integrates the RV64G ISA. It features a highly optimized seven-stage pipeline which can achieve working frequencies higher than 1 GHz. It also features a double-precision Floating Point Unit (FPU) and a 128-bit Single Instruction Multiple Data (SIMD) unit that accelerates domain-specific applications.

This core is capable of reaching 2.44 CoreMark/MHz [2]. It also delivers comparable or even higher performance than other state-of-the-art academic cores under Autobench EEMBC benchmark suite [3]. Sargantana thus leads in the academically designed RISC-V core space.

B. Out-of-order core

The OoO core is a two-way 64-bit out-of-order processor that supports the I, M, and A extensions of the RISC-V ISA composed of a ten-cycle pipeline that embraces the different stages of the implementation. The micro-architecture is shaped by two main blocks: a sequential front-end, and an out-of-order back-end.

On the front-end, the core fetches and issues two instructions each clock cycle and is able to execute speculative datapaths. On the back-end, the instructions dispatched are stored into different instruction queues, which are able to host up to 32 instructions. The design of the queues is focused on reducing energy consumption.

The instruction execution is performed by different functional units compounded with a bypassing logic technique to effectively broadcast the source operands for dependent instructions.

The core is capable of reaching performance values of 0.897 IPC for the RISC-V benchmark [4] and 0.988 IPC for the EEMBC benchmark [5].

C. Cache memory

Both cores, Sargantana and Lagarto Ka, have access to L1 and L2 cache memories. Each core has an individual cache of 48 kB (16 kB for instructions and 32 kB for data). And a shared L2 cache of 512 kB.

As for the four accelerators, they can be active at the same time. The active core will communicate with the accelerators through both AXI Lite and AXI Full interfaces.

A. Post-Quantum Cryptography accelerator (PQC)

The PQC accelerator is a module that accelerates the CME [6] (Classic McEliece) KEM (Key-Encapsulation Mechanism), specifically its encryption and decryption functions.

The CME KEM’s functionality works in the following manner: The server generates a public-secret key pair (PK, SK) using the KEM. Then the client, which holds the server’s PK, feeds it to the Encapsulation algorithm to produce a session key in plain text and encrypted forms. Finally, the server receives the encrypted session key, and decrypts it using its own SK and the De-capsulation algorithm. Upon successful completion of this protocol, both the client and the server have established a common session key in a secure way and they can go on communicating via symmetric cryptography algorithms.

The PQC’s function is to accelerate the encryption of a session key into a cipher-text by one part of the communication pair that is later decrypted by the other part into the session-key using the decryption algorithm.

B. Systolic Array tensor Unit for Artificial Intelligence Acceleration (SAURIA)

Autonomous driving is one of engineering’s current key challenges spanning many fields of research. At the algorithmic level, the state-of-the-art models for autonomous driving are based on Deep Neural Networks (DNN) which achieve high inference accuracy at the cost of computational complexity. Due to this high computational complexity, accelerators are required to obtain acceptable response times.

SAURIA is the low-power, high-energy-efficiency solution for accelerating DNN workloads in the Kameleon SoC. Its design revolves around a systolic array architecture, which exploits parallelism, data reuse and pipelining in order to efficiently compute general matrix-matrix multiplications.

In order to maximize energy efficiency, approximate arithmetic circuits have been employed in the systolic array processing elements, reducing power consumption by 30% without any significant degradation in the performance of the YOLOv3 [7] object detection DNN. The SAURIA accelerator, working at 500 MHz, has a peak throughput of 128 GFLOP/s and a peak energy efficiency of 1.41 TFLOP/sW.

C. WaveFront Alignment algorithm (WFA) accelerator

The WFA accelerator is the first accelerator for exact pairwise alignment of long DNA sequences based on the Wavefront Alignment Algorithm (WFA) [8]. [blinded]. It supports sequence lengths up to 10K bases and error rates up to 10%.

WFA accelerates DNA alignment in the Kameleon SoC. This accelerator, based on the evaluations on an FPGA prototype, provides performance improvements up to 1076× compared to the WFA implementation on the chip’s CPU.

Having a genomics accelerator inside the Kameleon SoC, leverages the chip’s functionality for analyzing genomics data. This makes the chip a perfect platform suitable for genomics applications by eliminating the need of costly external accelerators and their communication complexities.

D. A hardware accelerated task scheduler and WFA (Picos)

Picos is a hardware implementation of a task-based programming model run-time developed at the in-house programming models group. The main objective is to reduce the run-time overhead by accelerating task scheduling (including dependence resolution) and task synchronization (taskwait).

Tasks can be executed on any processing core or accelerator. CPU tasks can be any piece of code, while accelerator tasks have to be an execution of the implemented application, with parameters given through a custom API.

Picos is able to manage and control the execution of all accelerators at the same time, including CPU tasks. Moreover, dependencies can be declared independently of the target (accelerator or CPU).

Distribución de los planos de Kameleon — Fig. 1. Kameleon floorplan distribution.

The Physical Design for the Kameleon SoC has been made using Globalfoundries 22 nm FDSOI technology. The design has a size of three by three mm and 203 IO pins (120 pins of which are for Power Supply). The size of the design was dictated by the area requirements of the IPs it had to contain. The way in which the IP’s have been distributed inside of Kameleon’s square floorplan is present in ”Fig. 1”.

Some things to note from this floorplan are that both cores have been integrated into the same IP, there also appear two IPs which aren’t cores or accelerators: A PLL for clock generation and Four Serializer/Deserializers for communication.

The way the IPs have been placed in the floorplan is the one that left the largest rectangular area to the Cores’ IP, which was the last one to be designed.

In creating the design of this SoC, we have proven the viability of designing an open-source/open-hardware SoC designed entirely by academic and research institutions, which was the goal of the project. Currently, the design is yet to be manufactured. However its functionality has been tested by simulation, and it is able to implement the functionality that all of its constituting IPs provide, while reaching work frequencies as high as 800 MHz.

REFERENCES

[1] The RISC-V foundation, “About RISC-V,” 2023.

[2] S. Gal-On and M. Levy, “Exploring coremark a benchmark maximizing simplicity and efficacy,” The Embedded Microprocessor Benchmark Consortium, 2012.

[3] J. Poovey, T. Conte, M. Levy, and S. Gal-On, “A Benchmark Characterization of the EEMBC Benchmark Suite,” IEEE Micro, vol. 29, pp. 18–29, 2009.

[4] R.-V. International, “RISC-V Benchmarks.” https://github.com/riscv-software-src/riscv-tests, 2023. [accessed aug-2019].

[5] E. M. B. Consortium, “EEMBC AutoBench Performance Benchmark Suite.” https://www.eembc.org/autobench/, 2023. [accessed march-2023].

[6] M. R. Albrecht, D. J. Bernstein, T. Chou, and C. C. et al., “Classic McEliece: conservative code-based cryptography,” 2020.

[7] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” CoRR, vol. abs/1804.02767, 2018.