2.2. CUDA Programming Model¶

There’s a lot of useful discussion of optimizing code for HPC at The CUDA Programming Guide.

2.2.1. Warp Speed Programming¶

A majority of NVIDIA devices have 32 threads per warp. This processor layout is important, since it essentially means you can execute 32 floating point operations simultaneously in SIMD mode.

2.2.2. Device Intrinsics¶

The CUDA programming guide provides several important math intrinsics that can speed up your CUDA codes. One of the most important is fma(x,y,z), which executes x*y+z with the same speed as a single multiplication.

Other intrinsics are listed here.

2.2.3. Inter-Warp Communication¶

Threads within a warp can operate in a tightly coupled way by exchanging data using warp-level primitives.

This is not just useful for loops, but can be used to manually implement reduction or even to coordinate the warp to act as a group.

2.2.4. For Loop Example¶

Here’s a minimal example to run Y += A*X using a CUDA kernel. It shows the grid-stride-for-loop pattern:

__global__ void saxpy(size_t n, float a, float *x, float *y) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x) {
          y[i] = a * x[i] + y[i];
      }
}

int main() {
    int numSMs; // the V100 has 160 SMs, each with 32 "CUDA cores"
    cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, 0);
    const size_t N = (1 << 20) + 3; // 1M elements plus 3
    float *x, *y;
    cudaMalloc(&x, 2*N*sizeof(float));
    y = x + N; // Note: this is inefficient, since y is 4,
               // but not 8-byte aligned (y & 0x0F == 3*4);
    saxpy<<<32*numSMs, 32>>>(N, 2.0, x, y);
    return 0;
}

2.2.5. TODO¶

CMakeLists.txt discussion
CUDA with Fortran