Architecture of GPU and CUDA

In this blog, we will introduce the architecture of GPU from the programmers' perspective, and give some examples of CUDA programming. From more details, you should read CUDA Guide - Nvidia.

 

Intro

In this section, the architecture of GPU will be introduced from two perspective, hardware and software.

Hardware

Compared with CPU, GPU is specialized for highly parallel computations and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control.

  • GPU devotes more transistors to data processing, e.g., floating-point computations, is beneficial for highly parallel computations;
  • GPU can hide memory access latencies with computation, instead of relying on large data caches and complex flow control to avoid long memory access latencies, both of which are expensive in terms of transistors.

From the perspective of hardware, there are some key words we need to know.

 

Software

From the perspective of software, there are 4 key concepts:

We can have the conclusion that "gird > block > warp > thread".

Grid, block and threadScheduler on SMs

 

Example: Nvidia V100

Programming Model

At its core are three key abstractions:

 

Memory Hierarchy

We will introduce this in details by some examples in the latter blogs.

 

Thread Hierarchy

The index of a thread and its thread ID relate to each other in a straightforward way:

As an example, the following code adds two matrices A and B of size N x N and stores the result into matrix C:

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.

 

Host and Device

There are two roles in CUDA program, host and device.

These two types of threads are parallelized, the host will NOT wait for device to finish its job. If we want to let host wait for device to finish the kernel functions, cudaDeviceSynchronize should be called in host code.

 

Example: Vector Addition

Here is some naive examples of CUDA. Generally speaking, there are usually 3 steps to write a CUDA program.

Here is an example of vector addition.

 

 

References