2021-06-08发表2023-02-16更新 Erial 15 分钟读完 (大约2311个字)

cuda3

内存和数据局部性

示例 – 矩阵乘法

#### 一个基本的矩阵乘法

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
  // Calculate the row index of the P element and M
  int Row = blockIdx.y*blockDim.y+threadIdx.y;
  // Calculate the column index of P and N
  int Col = blockIdx.x*blockDim.x+threadIdx.x;

  if ((Row < Width) && (Col < Width)) {
    float Pvalue = 0;
    // each thread computes one element of the block sub-matrix
    for (int k = 0; k < Width; ++k) {
      Pvalue += M[Row*Width+k]*N[k*Width+Col];
    }
    P[Row*Width+Col] = Pvalue;
  }
}

GPU 上的性能如何

CGMA ratio（Compute to Global Memory Access ratio）：每次访问 CUDA 内核区域内全局内存执行的浮点计算次数。
越大越好

基本矩阵乘法的 CGMA

CGMA 就是看你取一次数，多少次运算需要用到这个数的值

所有线程为其输入矩阵元素访问全局内存
- 每次浮点加法一次内存访问（4 字节）
- CGMA ratio =1
- 4B/s 的内存带宽/FLOPS
假设一个 GPU
- 峰值浮点速率 1,500 GFLOPS，200 GB/s DRAM 带宽
- 4*1,500 = 6,000 GB/s 需要达到峰值 FLOPS
- 200 GB/s 的内存带宽将执行速度限制在 50 GFLOPS
这将执行率限制为设备峰值浮点执行率的 3.3% (50/1500)！
需要大幅减少内存访问以接近 1,500 GFLOPS
要达到 1,500 GFLOPS 的峰值，我们需要 CGMA=30

如何提高内存访问效率？

增加计算量
提高利用率
利用内存层次结构
- 全局内存
- 共享内存
- 寄存器文件

声明 CUDA 变量

Variable declaration	Memory	Scope	Lifetime
int LocalVar;	register	thread	thread
device shared int SharedVar;	shared	block	block
device int GlobalVar;	global	grid	application
device constant int ConstantVar;	constant	grid	application

** device ** 在与 ** shared ** 或 ** constant ** 一起使用时是可选的
自动变量驻留在寄存器中
- 除了驻留在全局内存中的每线程数组

示例：共享内存变量声明

void blurKernel(unsigned char * in, unsigned char * out, int w, int h) {
  __shared__ float ds_in[TILE_WIDTH][TILE_WIDTH];
 …
}

CUDA 中的共享内存

一种特殊类型的内存，其内容在内核源代码中明确定义和使用
- 每个 SM 一个
- 以比全局内存高得多的速度（在延迟和吞吐量方面）访问
- 访问和共享范围——线程块
- Lifetime——线程阻塞，对应线程结束执行后内容会消失
- 通过内存加载/存储指令访问
- 计算机体系结构中的一种暂存存储器形式

基本矩阵乘法内核的全局内存访问模式

分块/阻塞 - 基本理念

将全局内存内容划分为 tiles
将线程的计算集中在每个时间点的一个或少量分块上

分块的基本概念

在拥堵的交通系统中，显着减少车辆可以大大改善所有车辆看到的延迟
- 为通勤者拼车
- 全局内存访问的平铺
  - 司机 = 访问其内存数据操作数的线程
  - 汽车 = 内存访问请求

一些计算对分块更具挑战性

有些拼车可能比其他拼车更容易
- 拼车参与者需要有类似的工作时间表
- 有些车辆可能更适合拼车
  
  分块也存在类似的挑战

需要同步：

大纲

识别由多个线程访问的全局内存内容块
将 tile 从全局内存加载到片上内存中
使用屏障同步来确保所有线程都准备好开始阶段
让多个线程从片上存储器访问它们的数据
使用屏障同步来确保所有线程都完成了当前阶段
移动到下一个 tile

•Tiled Matrix Multiplication Kernel

矩阵乘法

数据访问模式

每个线程 - 一行 M 和一列 N
每个线程块 – 一条 M 条和一条 N 条

分块矩阵乘法

将每个线程的执行分解成阶段
这样线程块在每个阶段的数据访问都集中在 M 的一个 tile 和 N 的一个 tile 上
tile 在每个维度中包含 BLOCK_SIZE 个元素

加载 tile

一个块中的所有线程都参与
每个线程在平铺代码中加载一个 M 元素和一个 N 元素

屏障同步

同步块中的所有线程
- __syncthreads()
同一个块中的所有线程必须到达 __syncthreads() 才能继续前进
最适合用于协调分阶段执行平铺算法
- 确保在阶段开始时加载 tile 的所有元素
- 确保在阶段结束时消耗 tile 的所有元素

处理分块的边界条件

平铺矩阵乘法内核

__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
  __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
  __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

  int bx = blockIdx.x;  int by = blockIdx.y;
  int tx = threadIdx.x; int ty = threadIdx.y;

  int Row = by * blockDim.y + ty;
  int Col = bx * blockDim.x + tx;
  float Pvalue = 0;

 // Loop over the M and N tiles required to compute the P element
 for (int p = 0; p < WIDHT/TILE_WIDTH; ++p) {
    // Collaborative loading of M and N tiles into shared memory
    ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
    ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
    __syncthreads();

    for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];
    __synchthreads();
  }
  P[Row*Width+Col] = Pvalue;
}

分块（线程块）大小注意事项

每个线程块应该有多个线程
16 的 TILE_WIDTH 给出 1616 = 256 个线程
32 的 TILE_WIDTH 给出 3232 = 1024 个线程

For 16, in each phase, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. (16 floating-point operations for each memory load)
CGMA=?

For 32, in each phase, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations. (32 floating-point operation for each memory load)
CGMA=?

共享内存和线程

For an SM with 16KB shared memory
- Shared memory size is implementation dependent!
- For TILE_WIDTH = 16, each thread block uses 22564B = 2KB of shared memory.
- For 16KB shared memory, one can potentially have up to 8 thread blocks executing
  - This allows up to 8512 = 4,096 pending loads. (2 per thread, 256 threads per block)
- The next TILE_WIDTH 32 would lead to 232324 Byte= 8K Byte shared memory usage per thread block,
- allowing 2 thread blocks active at the same time
  - However, the thread count limitation of 1536 threads per SM in current generation GPUs will reduce the number of blocks per SM to one!
- Each __syncthread() can reduce the number of active threads for a block
  - More thread blocks can be advantageous
对于具有 16KB 共享内存的 SM
- 共享内存大小取决于实现！
- 对于 TILE_WIDTH = 16，每个线程块使用 22564B = 2KB 的共享内存。
- 对于 16KB 共享内存，最多可能有 8 个线程块在执行
  - 这允许多达 8*512 = 4,096 个待处理负载。（每个线程 2 个，每个块 256 个线程）
- 下一个 TILE_WIDTH 32 将导致每个线程块使用 232324 Byte= 8K Byte 共享内存，从而允许 2 个线程块同时处于活动状态
  - 但是，当前一代 GPU 中每个 SM 1536 个线程的线程数限制将使每个 SM 的块数减少到一个！
- 每个 __syncthread() 可以减少一个块的活动线程数
  - 更多的线程块可能是有利的

处理任意大小的矩阵

–Threads that do not calculate valid P elements but still need to participate in loading the input tiles

–Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-existent P[3,2] but need to participate in loading tile element N[1,2]

–

–Threads that calculate valid P elements may attempt to load non-existing input elements when loading input tiles

–Phase 0 of Block(0,0), Thread(1,0), assigned to calculate valid P[1,0] but attempts to load non-existing N[3,0]

不计算有效 P 元素但仍需要参与加载输入瓦片的线程
Block(1,1)的 Phase 0, Thread(1,0)，分配计算不存在的 P[3,2]但需要参与加载 tile 元素 N[1,2]

计算有效 P 元素的线程可能会在加载输入图块时尝试加载不存在的输入元素
Block(0,0)、Thread(1,0) 的第 0 阶段，分配用于计算有效 P[1,0] 但尝试加载不存在的 N[3,0]

“简单”的解决方案

–When a thread is to load any input element, test if it is in the valid index range

–If valid, proceed to load

–Else, do not load, just write a 0

–Rationale: a 0 value will ensure that that the multiply-add step does not affect the final value of the output element

–The condition tested for loading input elements is different from the test for calculating output P element

–A thread that does not calculate valid P element can still participate in loading input tile elements

Loading Elements – with boundary check

8    for (int p = 0; p < (Width-1) / TILE_WIDTH + 1; ++p) {

++       if(Row < Width && t * TILE_WIDTH+tx < Width) {
9               ds_M[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
++       } else {
++             ds_M[ty][tx] = 0.0;
++       }
++       if (p*TILE_WIDTH+ty < Width && Col < Width) {
10             ds_N[ty][tx] = N[(p*TILE_WIDTH + ty) * Width + Col];
++       } else {
++             ds_N[ty][tx] = 0.0;
++       }
11      __syncthreads();

Inner Product – Before and After

++    if(Row < Width && Col < Width) {
12     	for (int i = 0; i < TILE_WIDTH; ++i) {
13            	    Pvalue += ds_M[ty][i] * ds_N[i][tx];
       	}
14     __syncthreads();
15   } /* end of outer for loop */
++   if (Row < Width && Col < Width)
16         P[Row*Width + Col] = Pvalue;
  } /* end of kernel */

一些要点

对于每个线程，条件是不同的
- 加载 M 元素
- 加载 N 元素
- 计算和存储输出元素
对于大矩阵，控制发散的影响应该很小

处理一般矩形矩阵

一般来说，矩阵乘法是根据矩形矩阵定义的
j x k M 矩阵乘以 k x l N 矩阵产生 j x l P 矩阵

我们已经介绍了方阵乘法，一种特殊情况

核函数需要泛化处理一般矩形矩阵
Width 参数被三个参数替换：j、k、l
当 Width 用于指代 M 的高度或 P 的高度时，用 j 代替
当 Width 用于指代 M 的宽度或 N 的高度时，替换为 k
当 Width 用于指代 N 的宽度或 P 的宽度时，替换为 l

cuda3

https://erial21.github.io/2021/06/08/CUDA/cuda3/

作者

Erial

发布于

2021-06-08

更新于

2023-02-16

许可协议

#CUDA

cuda3

内存和数据局部性

示例 – 矩阵乘法

GPU 上的性能如何

基本矩阵乘法的 CGMA

如何提高内存访问效率？

声明 CUDA 变量

示例：共享内存变量声明

CUDA 中的共享内存

基本矩阵乘法内核的全局内存访问模式

分块/阻塞 - 基本理念

分块的基本概念

一些计算对分块更具挑战性

大纲

•Tiled Matrix Multiplication Kernel

矩阵乘法

数据访问模式

分块矩阵乘法

加载 tile

屏障同步

处理分块的边界条件

平铺矩阵乘法内核

分块（线程块）大小注意事项

共享内存和线程

处理任意大小的矩阵

“简单”的解决方案

Loading Elements – with boundary check

Inner Product – Before and After

一些要点

处理一般矩形矩阵

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

目录

最新文章