Show Menu

CUDA Programming Cheat Sheet (DRAFT) by

NVIDIA CUDA C Programming

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Cuda Kernels

A CUDA Kernel function is defined using the __global__ keyword.
A Kernel is executed N times in parallel by N different threads on the device
Each thread has a unique ID stored in the built-in threadIdx variable, a struct with components x,y,z.
Each thread block has a unique ID stored in the built-in blockIdx variable, a struct with components x,y,z.

Kernel Config­uration

Kernel Execution Config­uration
kernel­Fun­cti­on<­<<n­um_­blocks, num_th­rea­ds>­>>(­params)
The number of thread blocks along each dimension of the grid.
The number of threads along each dimension of the thread block

CUDA Thread Organi­zation

Thread are grouped in blocks and can be organized in 1 to 3 dimens­ions.
Blocks are grouped into grids which can be organized in 1 to 3 dimens­ions.
Blocks are executed indepe­nde­ntly.

1D Grid of 1D Blocks

int index = blockIdx.x * blockDim.x + thread­Idx.x;

1D Grid of 3D Blocks

int index = blockIdx.x blockDim.x blockDim.y blockDim.z + thread­Idx.z blockDim.y blockDim.x + thread­Idx.y blockDim.x + thread­Idx.x;

2D Grid of 2D Blocks applied on a Matrix

The index of each thread is identified by two coordi­nates i and j.
We can find i applying the rule of 1D Grid of 1D Blocks over the x axis:
int i = blockIdx.x * blockDim.x + thread­Idx.x;
And we can find j applying the rule of 1D Grid of 1D Blocks over the y axis:
int j = blockIdx.y * blockDim.y + thread­Idx.y;
Thus, knowing that a row in the grid is large GridDim.x times BlockDim.x, we can calculate the index:
int index = j gridDim.x blockDim.x +i;

CUDA Events

Declaring a Cuda Event
cudaEv­ent_t event;
Allocating the event
Recording the Event.
Synchr­onizing the event
Find elapsed time between two events
cudaEv­ent­Ela­pse­dTi­me(­&e­lapsed, a, b);
Free event variables

CUDA Streams

GPU operations on CUDA use execution queues called streams.
Operations pushed in a stream are executed according to a FIFO policy.
There is a default Stream, called stream 0.
Operations pushed in a non-de­fault stream will be executed after all operations on default stream are emptied.
Operations assigned to default stream introduce implicit synchr­oni­zation barriers among other streams.

CUDA Streams API

Create a stream
Deallocate a stream
Block host until all operations on a stream are completed.
We can use stream to obtain the concurrent execution of the same kernel or different kernels.

Synchr­oni­zation operations

Explicit Synchr­oni­zation
Implicit Synchr­oni­zation
cudaDe­vic­eSy­nch­ron­ize() blocks host code until all operations on device are completed
Operations assigned to default stream
cudaSt­rea­mWa­itE­ven­t(s­tream, event) blocks all operations assigned to a stream until event is reached.
Memory Alloca­tions on device
Settings operations on device
Page-l­ocked memory alloca­tions



Memory Workflow

First we allocate and "­bui­ld" the input on the host.
Then we allocate dynamic memory on the device, obtaining pointers to the allocated memory areas.
Finally, we initialize the memory on the device and we copy the memory from the host to the device.

At the end of the comput­ation, we may want to copy the memory from the device to the host.
Copy operation is blocking.

Memory Allocation API Functions

Dynamic memory allocation
cudaMalloc ((void **) &udev, N*size­of(­dou­ble));
u_dev is the pointer to the allocated variable
Memory Initia­liz­ation on device
cudaMe­mse­t(void *devPtr, int val, size_t count;
devPtr is a pointer to the device address space. The function fills the first count bytes of the memory area with the constant byte value val.
Copying data from host to device
cudaMe­mCp­y(void dst, void src, size_t size, cudaMe­mcp­yHo­stT­oDe­vice);
dst is the destin­ation address, src is the source address, size is the size in bytes of data to copy and the last parameter is the direction of the copy.
Copying data from device to host
cudaMe­mCp­y(void dst, void src, size_t size, cudaMe­mcp­yDe­vic­eTo­Host);
After 4.0, CUDA supports Unified Virtual Addressing meaning that the systems itself knows where the buffer is allocated. The direction parameter must be set to cudaMe­mcp­yDe­fault.

Global Memory

Declaring a static variable
__device__ type variab­le_­name;
Declaring a dynamic variable
cudaMa­llo­c((void **) &ptr, size);
Deallo­cating a dynamic variable
Allocating an aligned 2D buffer where elements are padded so that each row is aligned
cudaMa­llo­cPi­tch­(&ptr, &p­itch, width*­siz­eof­(fl­oat), height)
cudaMa­llo­cPitch returns an integer pitch that can be used to access row element with stride access. For example:
float ∗row = devPtr + r ∗ pitch;

Shared Memory

Static variable declar­ation inside the kernel.
__shared__ type shmem[­SIZE];
Dynamic variable allocation outside the kernel
extern __shared__ type *shmem;

Constant memory

Declaring a static variable
__cons­tant__ type variab­le_­name;
Copy memory from host to device.
cudaMe­mcp­yTo­Sym­bol­(va­ria­ble­_name, &h­ost­_src, sizeof­(type), cudaMe­mcp­yHo­stT­oDe­vice);
We cannot declare a dynamic variable on the costant memory

Texture Memory

Managing texture memory
Allocate global memory on device
cudaMa­llo­c(&M, memsize)
Create a texture reference.
textur­e<d­ata­type, dim> Mtextu­reRef;
Create a channel descriptor
cudaCh­ann­elF­orm­atDesc Mdesc = cudaCr­eat­eCh­ann­elD­esc­<da­tat­ype­>();
Bind the texture reference to memory.
cudaBi­ndT­ext­ure(0, Mtextu­reRef, M, Mdesc)
Unbind at the end.
In order to access the texture memory, we can use the texture reference Mtextu­reRef.*
text1D­fet­ch(­Mte­xtu­reRef, address);
Accessing 2D cuda array.
text2D­fet­ch(­Mte­xtu­reRef, address);
Accessing 3D cuda array.
text3D­fet­ch(­Mte­xtu­reRef, address);

Asynch­ronous Data Transfers

Allocates page-l­ocked memory on the host.
cudaMa­llo­cHo­st(­buffer, size)
Frees page-l­ocked memory.
Registers an existing host memory range for use by CUDA.
Unregi­sters a memory range that was registered with cudaHo­stR­egi­ster.
Copies data between host and device.
cudaMe­mcp­yAs­ync­(de­st_­buffer, src_bu­ffer, dest_size, src_size, direct­ion­,st­ream)
These operations must be queued into a non-de­fault stream.

Page-l­ocked Memory

Pageable memory is memory which is allowed to be paged in or paged out whereas page-l­ocked memory is memory not allowed to be paged in or paged out.

Page out is moving data from RAM to HDD, while page in means moving data from HDD to RAM. These operations occurs when the main memory does not have enough free space.

Error Handling

All CUDA API functions returns an error code of type cudaError.
The constant cudaSu­ccess means no error.
cudaGe­tLa­stError return the status of the internal error variable. Calling this function resets the internal error to cudaSu­ccess.

Macro for Error Handling

#define CUDA_CHECK(X) {\
cudaError_t _m_cudaStat = X;\
if(cudaSuccess != _m_cudaStat) {\
fprintf(stderr,"\nCUDA_ERROR: %s in file %s line %d\n",\
cudaGetErrorString(_m_cudaStat), __FILE__, __LINE__);\
} }
CUDA_CHECK( cudaMemcpy(d_buf, h_buf, buffSize,
cudaMemcpyHostToDevice) );