

ALMA MATER STUDIORUM Università di Bologna

# New Computing Architecture for AI – The European Ecosystem

#### **Prof. Andrea Bartolini**

The Department of Electrical, Electronic and Information Engineering (**DEI**) – <a.bartolini@unibo.it>

#### Al Architectures some trends...



 Table 4 Scaling configurations and MFU for each stage of Llama 3 405B pre-training. See text and Figure 5 for descriptions of each type of parallelism.

 https://arxiv.org/abs/2407.21783



Bommasani, Rishi, et al. "On the Opportunities and Risks of Foundation Models." Center for Research on Foundation Models (CRFM), Stanford Institute for Human-Centered Artificial Intelligence (HAI).







## **European Ecosystem of AI Platforms**

## **European Ecosystem of AI Platforms**





## **THE EU HPC & EPI TIMELINE**



## **EPI PROJECT FACTSHEET**

- Currently in Phase 2 (2022-2025)
- Consortium of 27 strategically chosen key European academic and industrial partners
- Total budget: 70 M€
- Funded by EuroHPC JU (50%)
  - and co-funded by Croatia, France, Germany, Greece, Italy, the Netherlands, Portugal, Spain, Sweden and Switzerland







## **EPAC VISION AND CONTRIBUTIONS**

EPAC



- VEC Self-hosted RISC-V CPU + wide VPU (256 double elements) supporting RVV 0.7.1 / 1.0
- STX RISC-V CPU + specific cores for stencil and neural network computation
- VRP RISC-V CPU with support for variable precision arithmetic (data size up to 512 bit)
- **eFPGA -** On-chip reconfigurable logic
- Ziptillion IP compressing/decompressing data to/from the main memory
- KVX FPGA demonstrator of the Kalray RISC-V CPU targeting HPC and ML



European

edi

Processor Initiative

#### **European Ecosystem of AI platforms**



# Towards A *Flexible* Manycore RV Accelerator

- **Gen-AI workloads increasingly** mix *dense* and *sparse* computations
  - Dense: stencils, encoding...
  - Sparse: weight and activation sparsification in CNNs, DNNs, LLMs, as well as graph NNs
- > Next-generation systems must handle *both* compute types efficiently
  - Need flexibility in quantization and sparsification
  - Accelerators focus on only one or are too specialized / inflexible to be future-proof
- **Occamy, a** *flexible* **RV chiplet scalable** manycore for efficient sparse and dense **Gen-Al**

ALMA MATER STUDIORUM

# dense spars





Paulin et al., «Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-Vbased Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET» https://arxiv.org/pdf/2406.15068 30.04.2024 11



# **Occamy: Inception and key Figures**

#### From concept to tapeout in 15 months

- Manticore concept at Hot Chips 2020 → GF challenge: take concept to prototype (multi-chiplet in GF12)
- Kickoff in April 2021, ≈25 people, up to 10 full-time
- Tapeout July 2022 (GF12) and September 2022 (GF65)
- GlobalFoundries, Synopsys, Rambus, Micron, Avery

## • 2.5D assembly by IZM (Fraunhofer)

- February to September 2023
- Received two *early* samples in August 2023
- High-complexity multi-chiplet prototype
  - 73mm<sup>2</sup>, 600 MGE each chiplet

**ETH** zürich

• 606mm<sup>2</sup> 65nm passive interposer

ALMA MATER STUDIORUM Università di Bologna

• RO4350B, low-CTE, high stability 12-layer PCB



Paulin et al., «Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET» <u>https://arxiv.org/pdf/2406.15068</u>

# Achieving Scale through Hierarchical Design



Occamy System





**Occamy Chiplet** 



Occamy Group



Paulin et al., «Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-Vbased Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET» <u>https://arxiv.org/pdf/2406.15068</u>

#### Key challenge: tolerate memory latency at local and global level $\rightarrow$ never wait for memory!



# **Occamy Performance and FPU Utilization**

- Mixed workloads @ 1GHz, 0.8 V, 25°C
  - RV32G baseline vs. code using ISA extensions
  - All workloads use FP64 data, int16 indices
  - Sparse LHS real-world matrices, RHS 1% density
- > Near-ideal dense, leading sparse perf.
  - **GEMM**: **686 GFLOP/s 40 GFLOP/s/W**, **89%** FPU util. competitive with GPUs
  - *Stencils:* Up to **571 GFLOP/s 28 GFLOP/s/W**, **83%** FPU util. (≥**15%** more than GPU code gens)
  - *SpIMM:* Up to **307 GFLOP/s 16 GFLOP/s/W**, **42%** FPU util. (≥**1.6**× more than sd. LA on GPUs)
  - *SpMSpM:* Up to 187 GCOMP/s 17 GCOMP/s/W, 49% index comparator utilization



Paulin et al., «Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET» <u>https://arxiv.org/pdf/2406.15068</u> 30.04.2024 14



# Large Language Inference and Training

GPT-J Utilization by Sequence Length and Precision



16

#### **European Ecosystem of AI platforms**



# Metis - AI Platform



□ AI Edge inference accelerator

- M.2 module or PCIe card
- Metis AIPU executes all tasks of an AI workload
  - Offload complete network(s)
  - Not just individual layers
- Easy-to-use software stack
  - Voyager SDK combining compilation and quantization flow





[ESSERC24] Metis AI Processing Unit – a 210 TOPS SoC Powered by Digital in-Memory Computing





# Metis AI Processing Unit (AIPU)



### AIPU



# **Quad-core System-on-Chip**

- PCIe 3.0 4x link to host
- LPDDR4x
- RISC-V controlled
- 48 MiByte on-chip SRAM
  - □ 4 MiByte L1 per AI core
  - □ 32 MiByte L2 shared
- 4 MiByte D-IMC (Digital in-Memory Computing)
  - □ 1M 8-bit D-IMC weights per AI core
  - □ 52.4 TOPS per AI core

[ESSERC24] Metis AI Processing Unit – a 210 TOPS SoC Powered by Digital in-Memory Computing









Credits:

Andrea Bartolini

www.unibo.it