

## CSE 234: Data Systems for Machine Learning Winter 2025

1

https://hao-ai-lab.github.io/cse234-w25/

MLSys Basics

Optimizations and Parallelization

### LLMSys

### Enrollment Updates

- and Thursday)
	- To ~100 waitlisted students in total
	- With a 24h expiration
- If there are still rooms (which is highly likely)
	- We will send the next batch Friday morning

### • We have sent out two batches of enrollment invitations (on Wed

## Today's Learning Goals

- How to make operators fast in general?
	- *Vectorize*
	- **?** Data layout
	- ? Parallelization (at the operator level)
- Matmul-specific optimization
- GPUs and accelerators
	- High-level Idea
	- The accelerator market

Runtime: schedule memory

**Operator** 



Dataflow Graph

Autodiff

**Graph Optimization** 

**Parallelization** 

### Recap: Strides

### $\vee$  Python











### Why we bother saving "strides" when saving tensors

tensor torch.Tensor.view

Tensor.view(\*shape)  $\rightarrow$  Tensor

### • Strides can separate the underlying storage and the view of the

```
>>> x = torch.randn(4, 4)\gg x.size()
torch.Size([4, 4])\Rightarrow y = x.view(16)
\gg y.size()
torch.Size([16])>>> z = x.view(-1, 8) #
\gg z.size()
torch.Size([2, 8])
```
• Enable zero-copy of some very frequently used operators

### Operator: Slice

- Change the offset by +1
- Reduce the shape to [3, 2]

• Note: zero copy



### Operator: Transpose

### • How to do Transpose?

```
\vee Python
```

```
1 print(t.start(de())2 # (24, 12, 4, 1)3
 4 print(t.permute((1, 2, 3, 0)).is_{contiguous())5 # True
 6
 7 print(t.permute((1, 2, 3, 0)).stride())8 # (12, 4, 1, 24)9
10 print_internal(t.permute((1, 2, 3, 0)))11 # tensor([0, 1, 2, 3,4, 5, 6, 7,12 #
      8, 9, 10, 11,
13#14 #
       12, 13, 14, 15,
       16, 17, 18, 19,
15 #
16 #
       20, 21, 22, 23])
```
• Note 1: zero copy

• Note 2: underlying storage is unchanged



### Transposed Matrix  $A<sup>T</sup>$



### **Broadcast**

• Question: how to do broadcast?

- b.strides=[1], b.shape=[1,3], b.data=[1, 2, 3]
- 
- A.strides[1]]



• After broadcast: b.shape=[4,3], b.data=[1, 2, 3], b.strides=?

• Recall the def of strides: A[i, j] = A.data[offset + i\*A.strides[0] + j \*

### Home Exercise: Swapping tiles









### Problems of Strides

- Memory Access may become not continuous
	- Many vectorized ops requires continuous storage
	- What's the underlying storage after slice?



### TORCH.TENSOR.CONTIGUOUS

Tensor.contiguous(memory\_format=torch.contiguous\_format) → Tensor

Returns a contiguous in memory tensor containing the same data as self tensor. If self tensor is already in the specified memory format, this function returns the self tensor.

Parameters

memory\_format (torch.memory\_format, optional) - the desired memory format of returned Tensor. Default: torch.contiguous\_format.

### Parallelization (Elementise)

• How to parallelize the loop?

for (int i = 0; i < 64; ++i) { float4  $a =$  load float4(A + i\*4); float4  $b = load_{1}$ cloat4(B + i\*4); float4  $c = add_f $1$ oat4 $(a, b)$ ;$ store float4( $C + i* 4$ , c);

vectorized Vectorized &

parallelized

### We'll com back to this later

```
#pragma omp parallel for
for (int i = 0; i < 64; ++i) {
     float4 a = load_float4(A + i * 4);
     float4 b = load_float4(B + i*4);
     float4 c = add_float4(a, b);
     store float4(C * 4, C);
```
### Summary

### • Vectorization

- Leverage platform-specific vectorized functions
- reduce seek time
- Data layout
	- Stride format
	- Zero copy
	-
- Parallelization on CPUs

### • Enable fast array-manipulation: slice, transpose, broadcast, etc.

### Next

- How to make operators fast in general?
	- *Vectorize*
	- *V* Data layout
	- **V** Parallelization (at the operator level)
- **?** Matmul-specific optimization
- GPUs and accelerators
	- High-level Idea
	- The accelerator market

Runtime: schedule memory

**Operator** 



Dataflow Graph

Autodiff

**Graph Optimization** 

**Parallelization** 

### What is Matmul in Code?

Compute  $C = dot(A, B.T)$ 

float  $A[n][n], B[n][n], C[n][n];$ 

for (int  $i = 0; i < n; ++i$ ) for (int  $j = 0; j < n; ++j$ ) {  $C[i][j] = 0;$ for (int  $k = 0$ ;  $k < n$ ;  $++k$ ) {  $C[i][j] += A[i][k] * B[j][k];$ 

- What is the time complexity of 2D matmul?
- $\bullet$   $O(n \wedge 3)$

- 
- 
- What is the best complexity we can achieve?
- $O(n \land 2.371552)$

### Matmul Complexity



### a good area to do research  $\odot$

### How to Make Matmul Fast

# $max A = \#ops / \#bytes$

### Recall: Memory Hierarchy

- 
- 



## • Ideally: we want everything to be local to processors (In registers) • But registers are expensive and small, hence memory hierarhcy

Source: Latency numbers every programmer should know

7ns 14x L1 cache

200ns 20x L2 cache, 200x L1 cache

### Simplify It a bit





### Recall How to Estimate AI: count loads

float\* A,  $*B$ ,  $*C$ ,  $*D$ ,  $.E$ ,  $*mp1$ ,  $*tmp2;$ assume arrays are allocated here compute  $E = D + ((A + B) * C)$ add(n, A, B, tmp1); mul(n,  $tmp1, C, tmp2);$ add(n, tmp2, D, E);

void fused(int n, float\* A, float\* B, float\* C, float\* D, float $*$  E) { for (int  $i=0$ ;  $i\leq n$ ;  $i++$ )  $E[i] = D[i] + (A[i] + B[i]) * C[i];$ } compute  $E = D + (A + B)^* C$ fused(n, A, B,C, D, E);

### Overall arithmetic intensity = 1/3

Four loads, one store per 3 math ops arithmetic intensity = 3/5



### Review Matmul loop

dram float  $A[n][n]$ ,  $B[n][n]$ ,  $C[n][n]$ ; for (int i = 0; i < n; ++i) { for (int  $j = 0$ ;  $j < n$ ; ++ $j$ ) { register float  $c = 0$ ; for (int  $k = 0$ ;  $k < n$ ;  $++k$ ) { register float  $a = A[i][k];$ register float  $b = B[j][k];$  $c == a * b;$  $|C[i][j] = c;$ 

### #reigsters needed:  $1 + 1 + 1 = 3$

Read a Read b Write c n^3 n^3 n^2

Read cost:  $2 * n^2$  \* speed(dram -> register)

## Register Tiled Matrix Multiplication

```
dram float A[n/v1][n/v3][v1][v3];dram float|B[n/v2][n/v3][v2][v3];dram float C[n/v1][n/v2][v1][v2];
```

```
for (int i = \theta; i < n/v1; ++i) {
   for (int j = 0; j < n/v2; ++j) {
      register float c[v1][v2] = 0;for (int k = 0; k < n/v3; ++k) {
        register float a[v1][v3] = A[i][k];register float b[v2][v3] = B[j][k];c \leftarrow dot(a, b.T);
      C[i][j] = c;
```


## Register Tiled Matrix Multiplication

Read a Read b Write c N^3 / v\_2 N^3 / v\_1 **N^2** 

dram float  $A[n/v1][n/v3][v1][v3];$ dram float  $B[n/v2][n/v3][v2][v3];$ dram float  $C[n/v1][n/v2][v1][v2];$ 



#reigsters needed:  $v1*v3 + v2*v3 + v1*v2$ 

Read cost:  $(n^2\sqrt{2} + n^23 / v1)^*$  speed(dram -> register)



## **Register Tiled** Matrix Multiplication

• Q: is the load cost related to v3?

- Q: How to set v1 / v2?
	- What are the constraints?

## #reigsters needed:  $v1*v3 + v2*v3 + v1*v2$

• Q: Why essentially can tiling reduce read cost?

Read a Read b N^3 / v\_2 N^3 / v\_1

### Make it more complicated: Consider L1 cache







using at the L1-> register level What's the required condition?

Sub-procedure, can apply register tiling here



### Cache-aware tiling

```
dram float A[n/b1][b1][n];dram float B[n/b2][b2][n];dram float C[n/b1][n/b2][b1][b2];for (int i = 0; i < n/b1; ++i) {
  11cache float a[b1][n] = A[i];for (int j = 0; j < n/b2; ++j) {
     11cache b[b2][n] = B[j];C[i][j] = dot(a, b.T);Later we apply
             register tiling
             hara
```
- Data movement path:
- 1.Dram
- 2.Dram -> l1cache (cache tiling) 3.L1cache -> register (reg tiling)

### Cache-aware tiling

A's dram -> l1 cost:  $n / b1 * n * b1 = n^2$ B's dram -> l1 time cost:  $n / b1 * n / b2 * b2 * n = n \land 3 / b1$ 

dram float  $A[n/b1][b1][n];$ dram float  $B[n/b2][b2][n];$ dram float  $C[n/b1][n/b2][b1][b2];$ for (int i = 0; i <  $n/b1$ ; ++i) { **11cache** float  $a[b1][n] = A[i];$ for (int  $j = 0$ ;  $j < n/b2$ ; ++j) { **11cache**  $b[b2][n] = B[j];$  $C[i][j] = dot(a, b.T);$ 

### • b1  $*$  n + b2  $*$  n < L1 cache size

Vs. previous untiled version?

s.t.

### Putting Things Together

dram float  $A[n/b1][b1/v1][n][v1];$ dram float  $B[n/b2][b2/v2][n][v2];$ 

### We set  $v3 = 1$  (we know it does not matter)

### Outside: cache tiling Inside: register tiling



 $DRAM \rightarrow L1$  cache reads here





### Putting Things Together

dram float  $A[n/b1][b1/v1][n][v1];$ dram float  $B[n/b2][b2/v2][n][v2];$ 

### Outside: cache tiling Inside: register tiling

## **Cost: dram -> l1**

- $n/b1*b1/v1*n*v1 = n^2$
- $n/b1*n/b2*b2/v2*n*v2=n^2/$  $\overline{D}$
- **Cost: l1 -> register:**
- n / b1 \* n / b2 \* b1 / v1 \* b2  $/ v2 * n * v1 = n<sup>3</sup> / v2$
- $\cdot$  n^3 / v1





### In practice

- On CPUs: We have disk -> dram -> L2 -> L1 -> Register
- How to choose v1, v2, b1, b2, c1, c2?
- - $\bullet$  L2 -> L1
	- L1 -> register
- S.t. sizes of L2, L1, registers

• While we are reading from dram -> L2, can we concurrently read:

### Why tiling works: **reuse** loading

 $float A[n][n];$  $float B[n][n];$  $float C[n][n];$ 

 $C[i][j] = sum(A[i][k] * B[j][k], axis=k)$ 

Access of A is independent of the dimension of j

Tile the j dimension by v1 enables resue of A for v1 times





### Homework Candidate?

• Q: How to tile?

for n in range(0, N): for co in range(0, CO): for h in range(0, H): for w in range(0, W): for ci in range(0, CI): for kh in range(0, KH): for kw in range(0, KW):  $C[n,co,h,w]$  +=  $A[n,co,h+kh,w+kw] \times B[kh,kw,co,ci]$ Reduction loop. Stencil computation loops. 5) for parallelization.

Simple spatial loops.

Reduction loops. But usually too small (<=



### Where we are

- How to make operators fast in general?
	- *Vectorize*
	- *V* Data layout
	- **V** Parallelization (at the operator level)
- **Matmul-specific optimization**
- ? GPUs and accelerators
	- High-level Idea
	- The accelerator market

Runtime: schedule memory

**Operator** 



Dataflow Graph

Autodiff

**Graph Optimization** 

**Parallelization** 

### Recap CPU parallelization

- We can parallelize this loop using CPU threads
- $\bullet$   $==$  using many concurrent cores

```
#pragma omp parallel for
```
### for (int i = 0; i < 64; ++i) { float4  $a = load_f$ loat4 $(A + i * 4)$ ; float4  $b = load_f$ loat4(B + i\*4); float4  $c = add_fload4(a, b);$ store float4( $C * 4$ ,  $C$ );

## Vectorized & parallelized





Intel® Architecture currently has SIMD operations of vector length 4, 8, 16

## Single-Instruction Multiple-Data



## Chip Design Trajectory: SIMD







If we're able to reduce size of ALU (transistors) while keeping its power

### Chip Industry:  $70$ nm  $-$ > 60nm  $-$ > 50nm  $-$ > ...  $-$ >?



• Problem: this is not substantiable; there are also power/heat issues when you put more ALUs in a limited area (s.t. physics limitations)

### Chip Industry: Moore's Law Comes to an End





## Idea: How about we use a lot of weak/specialized cores







## Hardware Accelerators: GPUs

- **Graphics Processing Unit (GPU)**: Tailored for matrix/tensor ops
- Basic idea: Use tons of ALUs (but weak and more specialized); massive data parallelism (SIMD on steroids);
- Popularized by NVIDIA in early 2000s for video games, graphics, and multimedia; now ubiquitous in DL
- CUDA released in 2007; later wrapper APIs on top: CuDNN, CuSparse, CuDF (RapidsAI), NCCL, etc.

### Other Hardware Accelerators Other Hardware Accelerators 99999 66666 • E.g. 60606 66666

- - Tensor Processing Unit (TPU)
	- An "application-specific integrated circuit" (ASIC) created by Google in mid 2010s; used for AlphaGo
- E.g.
	- B200 (projected release 2025): fp4 / fp8 Tensorcore
- E.g.
	- M3 max: mixing tensorcore and normal core

## What Does It Mean by "Specialized" In accelerator world

### In General:

• Can only compute certain computations: matmul, w/ sparsity

- Functionality-specialized:
	-
	- Mixing specialized cores with versatile cores
- Reduce precision
	- Floating point operations: fp32, fp16, fp8, int8, int4, ...
- Tune the distribution of different components for specific workloads
	- SRAM, cache, registers, etc.

## Case Study 1: Nvidia GPU Specification



**NVIDIA AI Enterprise** 

\* With sparsity



NVIDIA HGX H100 Partner and NVIDIA-Certified Systems™ with 4 or 8 GPUs NVIDIA DGX H100 with 8 GPUs

Add-on

## Case Study 1: Nvidia GPU Specification



## Case Study 1: Nvidia GPU Specification





## Question: why this could work in ML programs?

### Case Study 2: Apple Silicon



### Case Study 2: Apple Silicon Revealed

### Up to 128GB of unified memory 92 billion transistors

### 16-core CPU

12 performance cores 4 efficiency cores Up to 80% faster than M1 Max Up to 50% faster than M2 Max



### 40-core GPU

Next-generation architecture **Dynamic Caching** Mesh shading Ray tracing Up to 50% faster than M1 Max Up to 20% faster than M2 Max

## Case Study 3: Leading Chip Startups







### Case Study 3: Groq

More specialized hardware

### Nvidia GPUs

## Question: How did Groq achieve that?

### Llama 3.3 70B Output Speed (multiple 1k input prompts)

Output Tokens per Second; Higher is better



### Case Study 3: Groq

### GroqCard™



### **Card Specifications Form Factor**

Dual width, full height, 3/4 length PCI Ex adapter

Performance Up to 750 TOPs, 188 TFLOPs (INT8, FP1

**Memory** 230 MB SRAM per chip Up to 80 IB/s on-die memory bandwid

**Chip Scaling** Up to 9 RealScale™ chip-to-chip conne

**Numerics** INT8, INT16, INT32 & TruePoint<sup>™</sup> techno MXM: FP32 **VXM: FP16, FP32** 

**Power** Max: 375W; TDP: 275 ; Typical: 240W



### Case Study 3: Groq

• Recall

```
dram float A[n][n], B[n][n], C[n][n];
for (int i = 0; i < n; ++i) {
   for (int j = 0; j < n; ++j) {
     register float c = 0;
      for (int k = 0; k < n; ++k) {
        register float a = A[i][k];
        register float b = B[j][k];
        c + = a * b;\mathcal{F}C[i][j] = c;}
```
### Take-home Exercise

- Study B100 specification and compare it to H100
	- How nvidia claims another 2x from H100 -> B100?
- How about B200?

### Economic Question

## Question: What is Nvidia's Moat?

Market Summary > NVIDIA Corp

### 136.20 USD +136.16 (340,400.00%) <sup>+</sup> all time

Closed: Jan 15, 7:59 PM EST · Disclaimer After hours 136.22 +0.020 (0.015%)





## Next: GPU and CUDA

- Basic concepts and Architecture
- Programming abstraction
- Case study: Matmul

Runtime: schedule memory

**Operator** 



### Dataflow Graph

Autodiff

Graph Optimization

**Parallelization** 

### GPU Overview



### Threads, Blocks, Grids

- Threads: smallest units to process a chunk of data
- Blocks: A group of threads that share memory
- Grid: A collection of blocks that execute the same kernel



### How many SMs/Threads we have?

- V100 (2018 Now): 80 SMs, 2048 threads/SM
- A100 (2020 Now): 108 SMs, 2048 threads/SM
- H100 (2022 Now): 144 SMs, 2048 threads/SM
- B100 (2025 ): go surveying the numbers

### CUDA

• Introduced in 2007 with NVIDIA Tesla architecture

• C-like languages for programming GPUs

• CUDA's design matches the grid/block/thread concepts in GPUs

### CUDA Programs contain A Hierarchy of Threads



const int  $Nx = 12$ ; const int  $Ny = 6$ ;

dim3 threadsPerBlock(4, 3, 1); dim3 numBlocks(Nx/threadsPerBlock.x, Ny/threadsPerBlock.y, 1);

// assume A, B, C are allocated Nx x Ny float arrays

// this call will trigger execution of 72 CUDA threads: // 6 thread blocks of 12 threads each matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);



Run on **CPU** 

### How Many threads/Blocks it runs on?

const int  $Nx = 12$ ; const int  $Ny = 6$ ;

dim3 threadsPerBlock(4, 3, 1); dim3 numBlocks(Nx/threadsPerBlock.x, Ny/threadsPerBlock.y, 1);

// assume A, B, C are allocated Nx x Ny float arrays

// this call will trigger execution of 72 CUDA threads: // 6 thread blocks of 12 threads each matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);





### Grid, Block, and Thread

- GridDim: The dimensions of the grid
- blockIdx: The block index within the grid
- blockDim: The dimensions of a block
- threadIdx: The thread index within a block
- What About GridId?
- What about threadDim?







### An Example CUDA Program

```
const int Nx = 12;
const int Ny = 6;
dim3 threadsPerBlock(4, 3, 1);
dim3 numBlocks(Nx/threadsPerBlock.x,
               Ny/threadsPerBlock.y, 1);
// assume A, B, C are allocated Nx x Ny float arrays
// this call will cause execution of 72 threads
// 6 blocks of 12 threads each
matrixAddDoubleB<<<numBlocks, threadsPerBlock>>>(A, B, C);
```

```
__device__ float doubleValue(float x)
   return 2 * x;
// kernel definition
__global__ void matrixAddDoubleB(float A[Ny][Nx],
                                 float B[Ny][Nx],float C[Ny][Nx])int i = blockIdx.x * blockDim.x + threadIdx.x;int j = blockIdx.y * blockDim.y + threadIdx.y;C[j][i] = A[j][i] + doubleValue(B[j][i]);}
```




### Separation CPU and GPU Execution

```
const int Nx = 12;
const int Ny = 6;
dim3 threadsPerBlock(4, 3, 1);
dim3 numBlocks(Nx/threadsPerBlock.x,
               Ny/threadsPerBlock.y, 1);
// assume A, B, C are allocated Nx x Ny float arrays
// this call will cause execution of 72 threads
// 6 blocks of 12 threads each
matrixAddDoubleB<<<numBlocks, threadsPerBlock>>>(A, B, C);
```

```
__device__ float doubleValue(float x)
   return 2 * x;
// kernel definition
__global__ void matrixAddDoubleB(float A[Ny][Nx],
                                 float B[Ny][Nx],
                                 float C[Ny][Nx])
\{int i = blockIdx.x * blockDim.x + threadIdx.x;int j = blockIdx.y * blockDim.y + threadIdx.y;C[j][i] = A[j][i] + doubleValue(B[j][i]);}
```


### #Threads is Explicit and Static in Programs

const int  $Nx = 11$ ; // not a multiple of threadsPerBlk.x const int  $Ny = 5$ ; // not a multiple of threadsPerBlk.y dim3 threadsPerBlk $(4, 3, 1)$ ; dim3  $numBlocks(3, 2, 1);$ // assume A, B, C are allocated Nx x Ny float arrays // this call will trigger execution of 72 CUDA threads: // 6 thread blocks of 12 threads each matrixAdd<<<numBlocks, threadsPerBlk>>>(A, B, C);

```
// kernel definition
__global__ void matrixAdd(float A[Ny][Nx],
                         float B[Ny][Nx],float C[Ny][Nx])
  int i = blockIdx.x * blockDim.x + threadIdx.x;int j = blockIdx.y * blockDim.y + threadIdx.y;// auard against out of bounds array access
  if (i < Nx 88 j < Ny)C[j][i] = A[j][i] + B[j][i];
```
Developers to:

- To provide CPU/GPU code separation
- Statically declare blockDim, shapes.
- Map data to blocks/threads
- Check boundary conditions

### SIMD Constraints: how to handle control flow?

### SIMD requires all ALUs/Core Must proceed in the same pace

oat A[N])

 $x * blockDim.x + threadIdx.x;$ 

f);