slama.dev

GPU Computing

notes notes Icon, released 23. 2. 2023, updated 22. 1. 2024[edit]

Preface

This website contains my lecture notes from a lecture by Kazem Shekofteh from the academic year 2022/2023 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).

Lecture overview

  1. Introduction [slides]
  2. CUDA programming [slides]
  3. Basic architecture [slides]
  4. Matrix multiplication optimizations [slides]
  5. Parallel computing [slides]
  6. Profiling [slides]
  7. Scheduling optimizations [slides]
  8. nn-body optimization [slides]
  9. Host-device optimizations [slides]
  10. OpenACC [slides]
  11. Stencil computations [slides]
  12. OpenCL [slides]
  13. Consistency & Coherence [slides]

Bulk-Synchronous Parallel (BSP) model

Scaling rules

Moore’s law

A law about the exponential increase of processing power, has many variants:

Amdahl’s law

We want to find the maximum possible improvement, when given

CUDA programming

A program always consists of two parts:

__global__ void matAdd (float A[N][N], float B[N][N], float C[N][N]) {
	// one thread computes one element
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	
	// compute only if we're within bounds!
	if (i < N && j < N)
		C[i][j] = A[i][j] + B[i][j];
}

int main() {
	// the thread grid and block structure is 2D, 2D
	// adding more/less changes the structure
	dim3 dimBlock(16, 16);
	dim3 dimGrid((N + dimBlock.x  1) / dimBlock.x, (N + dimBlock.y  1) / dimBlock.y);
	matAdd <<<dimGrid, dimBlock>>> (A, B, C);
}

Overview

Thread and memory diagram for CUDA.

Memory

Global/device memory
Shared memory
// static
__shared__ int s[64];

// dynamic, size is third parameter of kernel call
// extern refers to the fact that it's declared elsewhere (kernel)
extern __shared__ int s[];
Registers
Host memory
Code examples

Allocating memory:

float *hMem;
float *dMem;

if (USE_PINNED_MEMORY) {
	// casting to (void**) shouldn't be necessary in newer CUDAs
	// CUDA's malloc doesn't return the pointer, but a status
	// (which in this case is ignored but can be handled)
	cudaMallocHost((void**) &hMem, N * sizeof(float));
} else {
	hMem = (float*) malloc(N * sizeof(float));
}

cudaMalloc((void**)&dMem, N * sizeof(float));

Copying memory:

// host -> device
cudaMemcpy(dMem, hMem, N * sizeof(float), cudaMemcpyHostToDevice);

// device -> host
cudaMemcpy(hMem, dMem, N * sizeof(float), cudaMemcpyDeviceToHost);

Variable declaration

Prefix Location Access from Lifetime
__device__ global memory (device memory) device/host program
__constant__ constant memory (device memory) device/host program
__shared__ shared memory device thread block

Function declaration

Prefix Executed on Callable from
__device__ device device
__global__ device host
__host__ host host

Coalescing

Combining fine-grain access by multiple threads into a single operation.

Offset and stride access graph.

Stride access for Shared memory graph.

Thread scheduling

__global__ badKernel (...) {
	id = threadIdx.x;

	// not a great idea, use < instead!
	if ( id % 32 == 0 )
		out = complex_function_call();
	else
		out = 0;
}

During execution, the hardware schedules blocks to SMs

A SM has multiple warp schedulers that can execute multiple warps concurrently:

Optimizing Matrix Multiplication

Naive (CPU)

__host__ __device__ void GetMatrixValue(int row, int col, float* M, int Width) {
	return M[row * Width + col];
}

__host__ __device__ void SetMatrixValue(int row, int col, float* M, int Width, float val) {
	M[row * Width + col] = val;
}

void MM_CPU (float* M, float* N, float* P, int Width) {
	for (int col = 0; col < Width; ++col) {
		for (int row = 0; row < Width; ++row) {
			float Pvalue = 0;
			for (int k = 0; k < Width; ++k) {
				float Melement = GetMatrixValue(row, k, M, Width);
				float Nelement = GetMatrixValue(k, col, N, Width);
				Pvalue += Melement * Nelement;
			}
			SetMatrixValue(row, col, P, Width, Pvalue);
		}
	}
}

Naive (GPU)

__global__ void MM_NAIVE (float* Md, float* Nd, float* Pd, int Width) {
	float Pvalue = 0;
	float Melement, Nelement;
	int row = threadIdx.y;
	int col = threadIdx.x;

	for (int k = 0; k < Width; ++k) {
		Melement = GetMatrixValue(row, k, Md, Width);
		Nelement = GetMatrixValue(k, col, Nd, Width);
		Pvalue += Melement * Nelement;
	}

	SetMatrixValue(row, col, Pd, Width, Pvalue);
}

Multiple thread blocks (GPU)

__global__ void MM_MTB (float* Md, float* Nd, float* Pd, int Width) {
	float Pvalue = 0;
	float Melement, Nelement;
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;

	for (int k = 0; k < Width; ++k) {
		Melement = GetMatrixValue(row, k, Md, Width);
		Nelement = GetMatrixValue(k, col, Nd, Width);
		Pvalue += Melement * Nelement;
	}
	
	SetMatrixValue(row, col, Pd, Width, Pvalue);
}

Using shared memory (GPU)

Note: I think that the code in the presentation is wrong – the tiling doesn’t make sense for sizes other than the size of the block (otherwise threads are writing out of bounds). I’ve modified it, hopefully it’s somewhat correct.

// here we assume that
// - blockDim.x == blockDim.y and they divide Width,
// - nThreads * nBlocks = width ** 2

TILEWIDTH = 32  // same as blockDim.x and blockDim.y!


__global__ void MM_SM (float* Md, float* Nd, float* Pd, int Width) {
	// is allocated statically for simpler code
	__shared__ float Mds[TILEWIDTH][TILEWIDTH];
	__shared__ float Nds[TILEWIDTH][TILEWIDTH];
	
	float Pvalue = 0
	int tx = threadIdx.x;
	int ty = threadIdx.y;
	int row = blockIdx.y * TILEWIDTH + ty;
	int col = blockIdx.x * TILEWIDTH + tx;
	
	if (row > Width || col > Width)
		return;

	// loop over tiles
	for (int m = 0; m < Width / TILEWIDTH; ++m) {
		// load the tile of both of the arrays
		Mds[ty][tx] = GetMatrixValue(row, m * TILEWIDTH + tx, Md, Width);
		Nds[ty][tx] = GetMatrixValue(m * TILEWIDTH + ty, col, Nd, Width);

		// RAW dependency:
		// we must wait for all of them to finish!
		__syncthreads();

		// do the actual computation
		for (int k = 0; k < TILEWIDTH; ++k)
			Pvalue += Mds[ty][k] * Nds[k][tx];

		// WAR dependency:
		// again wait or some threads will change Mds/Nds
		__syncthreads ();
	}

	SetMatrixValue(row, col, Pd, Width, Pvalue);
}

Parallelism

Various levels of parallelism:

The lecture goes into theoretical parallel algorithm design.

Synchronization

Definition (synchronization): enforcement of a defined logical order between events. This establishes a defined time-relation between distinct places, thus defining their behavior in time.

Profiling

Definition (arithmetic density) , sometimes called computational intensity rr is the ratio between floating point operations and data movements, i.e. r=FLOPsByter = \frac{\mathrm{FLOPs}}{\mathrm{Byte}}

To evaluate a performance, we use the roofline model:

Roofline model illustration.

The lecture goes into CUDA profiling. Here are some important concepts:

Scheduling optimizations

Naive implementation

__global__ void Reduction(int *out, int *in, size_t N) {
	extern __shared__ int sPartials[];
	const int tid = threadIdx.x;

	// each thread loads one element from global to shared mem
	sPartials[tid] = in[blockIdx.x * blockDim.x + threadIdx.x];
	__syncthreads();

	// do reduction in shared mem
	for (unsigned int s = 1; s < blockDim.x; s *= 2) {
		if (tid % ( 2 * s ) == 0)
			sPartials[tid] += sPartials[tid + s];

		__syncthreads();
	}

	if (tid == 0)
		out[blockIdx.x] = sPartials[0];
 }

Reduction: naive implementation diagram.

Interleaved address divergent

for (unsigned int s = 1; s < blockDim.x; s *= 2) {
	int index = 2 * s * tid;

	if (index < blockDim.x)
		sPartials[index] += sPartials[index + s];

	__syncthreads();
}

Reduction: interleaved address divergent diagram.

Resolving bank conflicts

for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
	if (tid < s)
		sPartials[tid] += sPartials[tid + s];

	__syncthreads();
}

Reduction: resolving bank conflicts diagram.

Making use of idle threads

// we have HALF of the threads but each loads DOUBLE
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;

// each thread loads TWO elements from global to shared mem
sPartials[tid] = in[i] + in[i + blockDim.x];
__syncthreads();

Manual unrolling

for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
	if (tid < 0)
		sPartials[tid] += sPartials[tid + s];

	__syncthreads();
}

if (tid < 32 && blockDim.x >= 64) sPartials[tid] = sPartials[tid + 32];
if (tid < 16 && blockDim.x >= 32) sPartials[tid] = sPartials[tid + 16];
if (tid <  8 && blockDim.x >= 16) sPartials[tid] = sPartials[tid +  8];
if (tid <  4 && blockDim.x >=  8) sPartials[tid] = sPartials[tid +  4];
if (tid <  2 && blockDim.x >=  4) sPartials[tid] = sPartials[tid +  2];
if (tid <  1 && blockDim.x >=  2) sPartials[tid] = sPartials[tid +  1];

Overview

Here is an overview of the versions we have implemented so far (the values in the table are GB/s for the given version and TPB):

Version \ TPB 3232 6464 128128 256256 512512 10241024
naive 7.397.39 12.5712.57 16.77\mathbf{16.77} 14.6714.67 12.3312.33 9.059.05
addresses 10.4610.46 18.3318.33 23.88\mathbf{23.88} 18.9618.96 14.514.5 10.0210.02
banks 11.0511.05 19.5419.54 30.83\mathbf{30.83} 27.5127.51 23.6723.67 17.9917.99
first add 21.6821.68 37.1537.15 58.03\mathbf{58.03} 51.3151.31 43.7543.75 33.6633.66
unrolling 22.5922.59 36.9136.91 68.38\mathbf{68.38} 62.3562.35 53.0653.06 43.7843.78

nn-body optimization

I’m not writing the formulas from the slides, this isn’t a physics course.

AOS vs SOA

// AOS
struct {
	float x, y, z;
	float vx, vy, vz;
	float mass;
} p_t;

p_t particles [MAX_SIZE];
struct {
	float x    [MAX_SIZE],
	      y    [MAX_SIZE],
	      z    [MAX_SIZE];
	float vx   [MAX_SIZE],
	      vy   [MAX_SIZE],
	      vz   [MAX_SIZE];
	float mass [MAX_SIZE];
} p_t;

p_t particles;

Naive implementation

__host__ __device__ void bodyBodyInteraction(...) {
	float dx = x1 - x0;
	float dy = y1 - y0;
	float dz = z1 - z0;

	float distSqr = dx*dx + dy*dy + dz*dz;
	distSqr += softeningSquared;

	float invDist = rsqrtf(distSqr);

	float invDistCube = invDist * invDist * invDist;
	float s = mass1 * invDistCube;

	*fx = dx * s;
	*fy = dy * s;
	*fz = dz * s;
}
__global__ void ComputeNBodyGravitation_Naive(...) {
	// outer loop, in case blocks don't fully cover the bodies
	for (int i = blockIdx.x * blockDim.x + threadIdx.x;
	     i < N;
	     i += blockDim.x * gridDim.x)
	{
		float acc[3] = {0};
		float4 self = ((float4 *) posMass)[i];

		// care: also includes interaction between itself!
		// is essentially zero and prevents needless branching
#pragma unroll 16
		for (int j = 0; j < N; j++) {
			float4 other = ((float4 *) posMass)[j];
			float fx, fy, fz;

			bodyBodyInteraction(
				&fx, &fy, &fz,
				self.x, self.y, self.z,
				other.x, other.y, other.z, other.w,
				softeningSquared
			);

			acc[0] += fx;
			acc[1] += fy;
			acc[2] += fz;
		}

		force[3*i+0] = acc[0];
		force[3*i+1] = acc[1];
		force[3*i+2] = acc[2];
	}
}

Shared memory

__global__ void ComputeNBodyGravitation_Shared (...) {
	extern __shared__ float4 sharedPM[];
	for (int i = blockIdx.x * blockDim.x + threadIdx.x;
	     i < N;
	     i += blockDim.x * gridDim.x)
	{
		float acc[3] = {0};
		float4 self = ((float4 *) posMass)[i];
#pragma unroll 32
		for (int j = 0; j < N; j += blockDim.x) {
			// each thread loads its part to the shared memory
			sharedPM[threadIdx.x] = ((float4 *) posMass)[j+threadIdx.x];
			__syncthreads();

			for (size_t k = 0; k < blockDim.x; k++) {
				float fx, fy, fz;
				float4 other = sharedPM[k];

				bodyBodyInteraction(
					&fx, &fy, &fz,
					self.x, self.y, self.z,
					other.x, other.y, other.z, other.w,
					softeningSquared
				);

				acc[0] += fx;
				acc[1] += fy;
				acc[2] += fz;
			}
			__syncthreads();
		}

		force[3*i+0] = acc[0];
		force[3*i+1] = acc[1];
		force[3*i+2] = acc[2];
	}
}

Host-device optimizations

Host-device synchronization

SAXPY without streaming.

SAXPY with streaming.

cudaStream_t stream0, stream1;
cudaStreamCreate ( &stream0 );
cudaStreamCreate ( &stream1 );
float *d_A0, *d_B0, *d_C0;
float *d_A1, *d_B1, *d_C1;

// cudaMallocs go here

for (int i = 0; i < n; i += segSize * 2) {
	// stream 0
	cudaMemCpyAsync ( d_A0, h_A + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_B0, h_B + i, segSize * sizeof(float), ... , stream0 );
	saxpy <<< segSize/256, 256, 0, stream0 >>> ( ... );
	cudaMemCpyAsync ( d_C0, h_C + i, segSize * sizeof(float), ... , stream0 );

	// stream 1
	cudaMemCpyAsync ( d_A1, h_A + i + segSize, segSize * sizeof(float), ..., stream1 );
	cudaMemCpyAsync ( d_B1, h_B + i + segSize, segSize * sizeof(float), ..., stream1 );
	saxpy <<< segSize/256, 256, 0, stream1 >>> ( ... );
	cudaMemCpyAsync ( d_C1, h_C + i + segSize, segSize * sizeof(float), ..., stream1 );
}

Issues

Improved SAXPY streams diagram.

cudaStream_t stream0, stream1;
cudaStreamCreate ( &stream0 );
cudaStreamCreate ( &stream1 );
float *d_A0, *d_B0, *d_C0;
float *d_A1, *d_B1, *d_C1;

// cudaMallocs go here

for (int i = 0; i < n; i += segSize * 2) {
	cudaMemCpyAsync ( d_A0, h_A + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_B0, h_B + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_A1, h_A + i + segSize, segSize * sizeof(float), ..., stream1 );
	cudaMemCpyAsync ( d_B1, h_B + i + segSize, segSize * sizeof(float), ..., stream1 );

	saxpy <<< segSize/256, 256, 0, stream0 >>> ( ... );
	saxpy <<< segSize/256, 256, 0, stream1 >>> ( ... );

	cudaMemCpyAsync ( d_C0, h_C + i, segSize * sizeof(float), ... , stream0 );
	cudaMemCpyAsync ( d_C1, h_C + i + segSize, segSize * sizeof(float), ..., stream1 );
}

Virtual Shared Memory

Productivity

OpenACC

Naive version
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	float* x;
	float* y;

	// allocate & initialize x, y

	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];

	//free and cleanup
}
kernels pragma
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	// restrict states that the memory of the pointers doesn't overlap
	// is useful for compiler optimizations
	float* restrict x;
	float* restrict y;

	// allocate & initialize x, y

#pragma acc kernels
{
	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];
}

	//free and cleanup
}
parallel loop pragma
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	float* restrict x;
	float* restrict y;

	// allocate & initialize x, y

#pragma acc parallel loop
	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

#pragma acc parallel loop
	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];

	//free and cleanup
}
data pragma
int main() {
	int n = 256*1024*1024; float a = 2.0f; float b = 3.0f;

	float* restrict x;
	float* restrict y;

	// allocate & initialize x, y

#pragma acc data copyin(x[0:n]) copy(y[0:n])
{
#pragma acc parallel loop present(x,y)
	for (int i = 0; i < n; ++i)
		y[i] = a*x[i] + y[i];

#pragma acc parallel loop present(x,y)
	for (int i = 0; i < n; ++i)
		y[i] = b*x[i] + y[i];
}

	//free and cleanup
}
Nested loops
// distributes outer loop to n threads
// each thread executes the inner loop sequentially
// block size 256, block count n/256 (rounded up)
#pragma acc parallel vector_length(256)
#pragma acc loop gang vector
for ( int i = 0; i < n; ++i ) {
	for ( int j = 0; j < m; ++j ) {
		// do stuff
	}
}

// same as above but only 16 blocks
// some threads might have to loop multiple times
#pragma acc parallel vector_length(256) num_gangs(16)
#pragma acc loop gang vector
for ( int i = 0; i < n; ++i ) {
	for ( int j = 0; j < m; ++j ) {
		// do stuff
	}
}

// parallelizes both loops
// distributes n outer loops to n blocks
// distributes inner loop to their threads (256/block)
#pragma acc parallel vector_length(256)
#pragma acc loop gang
for ( int i = 0; i < n; ++i ) {
#pragma acc loop vector
	for ( int j = 0; j < m; ++j ) {
		// do stuff
	}
}

Stencil computations

Image processing

Note: I am beyond confused to what the algorithm actually is, this is my best guess:

Partial differential equations

Read the slides, I’m fairly certain this isn’t too important.

Performance optimizations

GPU programming models

OpenCL

Platform model
Execution model
CUDA OpenCL
Grid NDRange
Thread Block Work group
Thread Work item
Thread ID Global ID
Block index Block ID
Thread index Local ID

Different types of kernels exist:

Memory model
CUDA OpenCL
Host memory Host memory
Global/device memory Global memory
Shared memory Local memory
Registers/local memory Private memory

Consistency & Coherence

Ordering problem: threads operate independently: which order to apply?

Cache coherence: two threads write to same variable, which gets written to cache:

Memory consistency: the order in which memory operations appear to be performed