2021-06-07发表2023-02-16更新 Erial 10 分钟读完 (大约1433个字)

cuda2

多维内核

多维内核配置

示例

image-20210607161352581(C:\Users\Aerialith\AppData\Roaming\Typora\typora-user-images\image-20210607161352581.png

处理 2D 网格的图片

C/C++ 中的行优先布局

PictureKernel 的源代码

__global__ void PictureKernel(float* d_Pin, float* d_Pout, int height, int width){
  // Calculate the row # of the d_Pin and d_Pout element
  int Row = blockIdx.y*blockDim.y + threadIdx.y;
  // Calculate the column # of the d_Pin and d_Pout element
  int Col = blockIdx.x*blockDim.x + threadIdx.x;
  // each thread computes one element of d_Pout if in range
  if ((Row < height) && (Col < width)) {
    d_Pout[Row*width+Col] = 2.0*d_Pin[Row*width+Col];
  }
}

用于启动 PictureKernel 的主机代码

// assume that the picture is m × n,
// m pixels in y dimension and n pixels in x dimension
// input d_Pin has been allocated on and copied to device
// output d_Pout has been allocated on device
…
dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1);
dim3 DimBlock(16, 16, 1);
PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_Pout, m, n);

用 16x16 的块覆盖 62x76 的图片

![image-20210607161715736(C:\Users\Aerialith\AppData\Roaming\Typora\typora-user-images\image-20210607161715736.png

并非 Block 中的所有线程都将遵循相同的控制流路径。

彩色到灰度图像处理示例

RGB 图像

图像中的每个像素都是一个 RGB 值
图像行的格式是 (r g b) (r g b) … (r g b)
RGB 范围分布不均

RGB 转灰度图像

灰度数字图像是其中每个像素的值仅携带强度信息的图像。

颜色计算公式

对于 (I, J) 处的每个像素 (r g b)，执行： grayPixel[I,J] = 0.21r + 0.71g + 0.07*b
这只是一个点积 <[r,g,b],[0.21,0.71,0.07]> 常量特定于输入 RGB 空间

RGB 转灰度代码

#define CHANNELS 3 // we have 3 channels corresponding to RGB
// The input image is encoded as unsigned characters [0, 255]
__global__ void colorConvert(unsigned char * grayImage, unsigned char * rgbImage,
                             int width, int height) {
 int x = threadIdx.x + blockIdx.x * blockDim.x;
 int y = threadIdx.y + blockIdx.y * blockDim.y;

 if (x < width && y < height) {
    // get 1D coordinate for the grayscale image
    int grayOffset = y*width + x;
    // one can think of the RGB image having
    // CHANNEL times columns than the gray scale image
    int rgbOffset = grayOffset*CHANNELS;
    unsigned char r =  rgbImage[rgbOffset      ]; // red value for pixel
    unsigned char g = rgbImage[rgbOffset + 2]; // green value for pixel
    unsigned char b = rgbImage[rgbOffset + 3]; // blue value for pixel
    // perform the rescaling and store it
    // We multiply by floating point constants
    grayImage[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;
 }
}

#define CHANNELS 3 // we have 3 channels corresponding to RGB
// The input image is encoded as unsigned characters [0, 255]
__global__ void colorConvert(unsigned char * grayImage,
			      unsigned char * rgbImage,
                             int width, int height) {
 int x = threadIdx.x + blockIdx.x * blockDim.x;
 int y = threadIdx.y + blockIdx.y * blockDim.y;

 if (x < width && y < height) {
    // get 1D coordinate for the grayscale image
    int grayOffset = y*width + x;
    // one can think of the RGB image having
    // CHANNEL times columns than the gray scale image
    int rgbOffset = grayOffset*CHANNELS;
    unsigned char r =  rgbImage[rgbOffset      ]; // red value for pixel
    unsigned char g = rgbImage[rgbOffset + 1]; // green value for pixel
    unsigned char b = rgbImage[rgbOffset + 2]; // blue value for pixel
    // perform the rescaling and store it
    // We multiply by floating point constants
    grayImage[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;
 }
}

图像模糊

模糊框

img src=”C:\Users\Aerialith\AppData\Roaming\Typora\typora-user-images\image-20210607162547106.png” alt=”image-20210607162547106” style=”zoom: 80%;” />

2D 内核的图像模糊

__global__ void blurKernel(unsigned char * in, unsigned char * out, int w, int h)  {
   int Col  = blockIdx.x * blockDim.x + threadIdx.x;
   int Row  = blockIdx.y * blockDim.y + threadIdx.y;

   if (Col < w && Row < h) {
       ... // Rest of our kernel
   }
 }

__global__ void blurKernel(unsigned char * in, unsigned char * out, int w, int h) {
     int Col  = blockIdx.x * blockDim.x + threadIdx.x;
     int Row  = blockIdx.y * blockDim.y + threadIdx.y;

     if (Col < w && Row < h) {
         int pixVal = 0;
         int pixels = 0;
         // Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box
         for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
             for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

                 int curRow = Row + blurRow;
                 int curCol = Col + blurCol;
                 // Verify we have a valid image pixel
                 if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
                     pixVal += in[curRow * w + curCol];
                     pixels++; // Keep track of number of pixels in the accumulated total
                 }
             }
         }
         // Write our new pixel value out
         out[Row * w + Col] = (unsigned char)(pixVal / pixels);
     }
 }

线程调度

每个块可以相对于其他块以任何顺序执行。
硬件可以随时自由地将块分配给任何处理器
- 内核可扩展到任意数量的并行处理器

示例：执行线程块

线程以块粒度分配给流式多处理器 (SM)
- 在资源允许的情况下，每个 SM 最多 8 个块
- Fermi SM 最多可以占用 1536 个线程
- 可以是 256（线程/块）* 6 块
- 或 512（线程/块）* 3 个块等。
SM 维护线程/块 idx # s
SM 管理/调度线程执行

![image-20210607164925138(C:\Users\Aerialith\AppData\Roaming\Typora\typora-user-images\image-20210607164925138.png

具有 SIMD 单元的 Von-Neumann 模型

![image-20210607164946076(C:\Users\Aerialith\AppData\Roaming\Typora\typora-user-images\image-20210607164946076.png

作为调度单位的 Warp

每个 Block 作为 32 线程 Warps 执行
- 实施决策，不属于 CUDA 编程模型的一部分
- Warps 是 SM 中的基本调度单元
- 未来的 GPU 可能在每个 warp 中有不同数量的线程

![image-20210607165046866(C:\Users\Aerialith\AppData\Roaming\Typora\typora-user-images\image-20210607165046866.png

线程调度

warp 中的线程在 SIMD 中执行
- 选中时，warp 中的所有线程都执行相同的指令
- N 路路径 →1/N 吞吐量（应在同一路径内拓展分支）
SM 实现零开销 warp 调度
- 其下一条指令的操作数已准备好供使用的 Warps 有资格执行
- 根据优先级调度策略选择合格的 Warps 进行执行

![image-20210607165216727(C:\Users\Aerialith\AppData\Roaming\Typora\typora-user-images\image-20210607165216727.png

warp 示例

如果给一个 SM 分配了 3 个块，每个块有 256 个线程，那么一个 SM 中有多少个 Warp？
- 每个 Block 分为 256/32 = 8 Warps
- 有 8 * 3 = 24 个 warp

块粒度注意事项

对于使用多个块的矩阵乘法，我应该为 Fermi 使用 8X8、16X16 还是 32X32 块？
- 对于 8X8，我们每个块有 64 个线程。由于每个 SM 最多可以占用 1536 个线程，这相当于 24 个块。但是，每个 SM 最多只能占用 8 个 Blocks，每个 SM 只能有 512 个线程！
- 对于 16X16，我们每个块有 256 个线程。由于每个 SM 最多可以占用 1536 个线程，因此它最多可以占用 6 个块并实现满容量，除非其他资源考虑无效。
- 对于 32X32，我们每个块有 1024 个线程。费米的 SM 中只能容纳一个块。仅使用 SM 线程容量的 2/3。

cuda2

https://erial21.github.io/2021/06/07/CUDA/cuda2/

作者

Erial

发布于

2021-06-07

更新于

2023-02-16

许可协议

#CUDA

cuda2

多维内核

多维内核配置

示例

处理 2D 网格的图片

C/C++ 中的行优先布局

PictureKernel 的源代码

用于启动 PictureKernel 的主机代码

用 16x16 的块覆盖 62x76 的图片

彩色到灰度图像处理示例

RGB 图像

RGB 转灰度图像

颜色计算公式

RGB 转灰度代码

图像模糊

模糊框

2D 内核的图像模糊

线程调度

示例：执行线程块

具有 SIMD 单元的 Von-Neumann 模型

作为调度单位的 Warp

线程调度

warp 示例

块粒度注意事项

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

目录

最新文章