4.3 Mapping Threads to Data Elements¶

In the last section we mentioned that the key new idea in CUDA programming is that the programmer is responsible for:

setting up the grid of blocks of threads and
determining a mapping of those threads to elements in 1D, 2D, or 3D arrays.

We briefly saw task 1 (setting up grids with blocks) in the previous section, through the use of the dim3 data structure. Now we will examine more examples using dim3, then combine that with task 2, which is to map the threads within the blocks within the grid to data elements in arrays.

1D grid of 1D blocks of threads¶

Filename: 1-basics/1.2-dim3/dim3Demo1D1D.cu

The following example creates a 1 dimensional grid of 2 blocks that are also 1 dimensional, containing 8 threads:

#include <stdio.h>

// CUDA runtime
 #include <cuda_runtime.h>

__global__ void hello() {
    // special dim3 variables available to each thread in a kernel
    // or device function:
    // blockIdx    the x, y, z coordinate of the block in the grid
    // threadIdX   the x, y, z coordinate of the thread in the block
    printf("I am thread (%d, %d, %d) of block (%d, %d, %d) in the grid\n",
          threadIdx.x, threadIdx.y, threadIdx.z, blockIdx.x, blockIdx.y, blockIdx.z );

}

// Note that this is called from the host, not the GPU device.
// We create dim3 structs there and can print their components
// with this function.
void printDims(dim3 gridDim, dim3 blockDim) {
    printf("Grid Dimensions : [%d, %d, %d] blocks. \n",
    gridDim.x, gridDim.y, gridDim.z);

printf("Block Dimensions : [%d, %d, %d] threads.\n",
    blockDim.x, blockDim.y, blockDim.z);
}
 int main(int argc, char **argv) {

// dim3 is a special data type: a vector of 3 integers.
    // each integer is accessed using .x, .y and .z

// 1 dimensionsional case is the following:
    // 1D grid of 2 1D blocks
    dim3 gridDim(2);      // 2 blocks in x direction, y, z default to 1
    dim3 blockDim(8);     // 8 threads per block in x direction

printDims(gridDim, blockDim);

printf("From each thread:\n");
    hello<<<gridDim, blockDim>>>();
    cudaDeviceSynchronize();     // need for printfs in kernel to flush

return 0;
 }

If we simply change main to create a 1D grid with 2 blocks of 8 threads, we still maintain the same thread number values that can be used as indexes into an array of 16 data values. Here is the code- look for the change in main:

Filename: 1-basics/1.3-1DBlockPrint/print2Blocks.cu

// System includes
#include <stdio.h>
#include <assert.h>

#include <cuda_runtime.h>

// Given a 1 dimensional grid of blocks of threads,
// determine my thread number.
// This is run on the GPU on each thread.
__device__ int find1DThreadNumber() {
   // illustrative variable names
   int threadsPerBlock_horizontal = blockDim.x;
   int gridBlockNumber = blockIdx.x;

int threadNumber = (gridBlockNumber * threadsPerBlock_horizontal) + threadIdx.x;
   return threadNumber;
}

// Print information about a thread running this function.
// This is run on the GPU on each thread.
__global__ void hellofromDevice1D(int val) {

int threadNumber = find1DThreadNumber();   // device function call
   printf("[b%d of %d, t%d]:\tValue sent to kernel function is:%d\n",
               blockIdx.x, gridDim.x,
               threadNumber, val);
}

int main(int argc, char **argv) {

//////////////////////////////////////////////////////////////
   //    Each block that you specify maps to an SM.
   //////////////////////////////////////////////////////////////

printf("1D grid of blocks\n");
   // 2 blocks of 8 threads each goes to 2 SMs
   dim3 gridDim1(2), blockDim1(8);
   // TODO: try 8 blocks of 8 threads each. What do you observe?

hellofromDevice1D<<<gridDim1, blockDim1>>>(1);

cudaDeviceSynchronize();         // comment out and re-make and run
}

Note

Some other new ideas from this code are the following:

CUDA kernel functions that run on the device can have parameters that get passed from the host code calling it.
A kernel function called from host code, which we learned was designated by the keyword __global__, can call other functions that will immediately run on the device. These functions are designated with the keyword __device__, such as the function find1DThreadNumber() given above.

The situation from the above code is depicted in Figure 4-6, where the thread numbers computed by the function find1DThreadNumber and printed in the output above as t0, t1, t2, etc. are mapped to indices of an array containing 16 elements. Compare the function, repeated here, to Figure 4-6.

Function to obtain array index using information about 1D grid of 1D blocks¶

// Given a 1 dimensional grid of blocks of threads, 
// determine my thread number.
// This is run on the GPU on each thread.
__device__ int find1DThreadNumber() {
  // illustrative variable names
  int threadsPerBlock_horizontal = blockDim.x;
  int gridBlockNumber = blockIdx.x;

  int threadNumber;
  threadnumber = (gridBlockNumber * threadsPerBlock_horizontal) + threadIdx.x;
  return threadNumber;
}

../_images/1DArrayMapping.png — Figure 4-6: 1D grid of 1D blocks of threads mapped to array indexes¶

Warning

The function called find1DThreadNumber is sufficient to calculate an index into any length 1-dimensional array when using a 1D grid of 1D blocks. As a programmer, you must determine the grid and block sizes from the length of the array and ensure that you don’t go out of the bounds of the array. You will see how this is done next when we look at an example of vector addition from linear algebra.

Exercises¶

4.3-1: Try a few more blocks: Try changing the code for print2Blocks.cu to use more than 2 blocks, such as 4 (don’t try too large because of all the printing that will happen, some of which may not get returned). What do you observe about the numbering for each thread?

4.3-2: Functions annotated with the __device__ qualifier must be called from a kernel function or another device function.

True.
Yes! The qualifier signifies code that gets called on a running section of device code, which can be a starting kernel function or another device function.
False.
The qualifier signifies code that gets called from a running section of device code.

2D grid of 2D blocks of threads¶

2D grids of 2D blocks of threads are useful for applications that use 2-dimensional arrays, or matrices. We will look at that in the next chapter containing applications.

You have attempted of activities on this page