课程信息
- 分数占比: 25+25两次作业,50考试
- 注意实现:提早做practical
- 考试重点:平时的practical
GPU Architecture
factors influencing performance
CPUs: clock speed
processing rate per compute element
- power consumed proportional to f * V<sup>2</sup>
- increase frequency f and decrease voltage V can keep power managable
- but V cannot be decreased easily, so no longer available to increase f
CPUs: memory latency
- cache hierarchies
- large on-chip caches to stage data
- L1, L2, L3, controllers...
- other latency hiding measures
- hardware multithreading
- out of order execution
CPUs: memory bandwidth
CPUs gengerally use commodity Double Data Rate (DDR) RAM
CPUs: parallelism
- use more cores
- limited clock speed keeps power consumption per core managable
GPUs
- require graphical processing units
- designed to rendering
- balance between floating point/memory bandwith
GPUs: clock speed
- underlying clock frequency relatively modest
- performance based on parallelism
GPUs: memory latency
- schedule many independent tasks
- if one task encounters a long latency tasks
- swapped out and schedule another task
scientific applications
- GPUs good for
- many indepented tasks
- significatn data parallelism
- structured code with identifable kernels
- not good for
- highly coupled problems with little parallelism
- IO-dominated problems
- Poorly structured core with diffuse computational intensity
核心结构较差,计算强度分散
CUDA Programming
streaming multiprocessors
- a two level hierarchy
- many streaming multiprocessors each with many cores
- exact numbers depend on particular hardware
dim3 structure
struct {
unsigned int x;
unsigned int y;
unsigned int z;
};
synchronisation between host and device
- kernel launches are asynchronous
- return immediately on the host
- synchronisation required to ensure completion
- errors can appear asynchronously
synchronisation on the device
- synchronisation between threads in the same block is possible
- allows coordination of action in shared memory
- allow reductions
- not possible to synchronise between blocks
memory management
- host and device have separate address spaces
- data accessed by kernel must be in the device memory
Performance Optimisation
copying between host and device
- separate memory spaces mean some copying inevitable
- simply must avoid unnecessary copies
- keep data resident on device
- involve moving all relevant code to device
- recalculation / extra computation instead of data communication
occupancy and latency hiding
- work decomposed and distributed between threads
- actually want N<sub>threads</sub> >> N<sub>cores</sub>
- latency for access to main memory
memory coalescing (合并)
- GPUs have a high peak memory bandwidth
- only achieved when accesses are coalesced
consecutive threads access consecutive memory locations
- if not, access may be serialised
grid stride (跨度) loops
- common advice is to parallelise a loop by mapping one iteration to one thread
- GPUs limit on the number of threads (1024) and blocks that make up a grid
- small cost to start/end a block
- kernels need to compute common values - redundant computation
- kernels need to initialise shared memory
- solution: each CUDA thread perform several iterations
code branching
- threads are scheduled in groups of 32
- each group is a 'wrap'
- share same instruction scheduling hardware units
- executes instructions in lock-step (SIMT)
- every thread execute same instruction, but with different data
- branches in the code can cause serialisation
- threads aren't in the executing branch sit idle
CUDA Memory
constant memory
- read only in kernel
- no cache coherency mechanism required to support writes
- fast, effective, high bandwidth
shared memory
- shared between threads in a block
- useful for temporary values
- may use for reductions
- may require care in synchronisation with a block
- lifetime of the kernel's blocks
- only addressable when a block starts executing
- released when a block finishes
unified memory
- GPU has a separate memory space from the host CPU
CUDA Libraries
only for use, see slides
OpenMP Target Offloading
OpenMp device model
- host-centric model with one host device and multiple target devices of the same type
- device
- device data env
target region
- basic offloading construct
- defines a section of a program
host and device data
- have separate memory spaces
- data inside target region must be mapped to the device
- mapped data must not be accessed by the host until the target region has completed
map clause
#pragma omp target map(map-type:list)
where list:
- to: copy data to device on entry
- from: copy data to host on exit
- tofrom: copy data to device on entry and back on exit
- alloc: allocate an uninitialized copy on the device (no values)
dynamically allocated data
need to specify the number of elements to be copied
int* B = (int*)malloc(sizeof(int) *N);
#pragma omp target map(to:B[0:N]) // can do B[10:3]
keeping data on the device
- moving data is expensive
- keep the data on the device between target regions
target data
constructs just map data and do not offload
target update
constructs copies values between host and device between target constructs
#pragma omp target enter data map(to: A[0:N],B[0:N])
for (r=0; r<reps; r++){
#pragma omp target
{
// do stuff with A and B
}
// do something on the host
}
#pragma omp target exit data map(from: B[0:N])
parallelism on the device
- GPUs are not able to support a full threading model outside of a single stream multiprocessor (SM)
- no sync or memory fences between SMs
- no coherency between L1 caches
- aa
teams construct
- create multiple master threads inside a target region
distribute construct
calling functions inside target regions
declare target
target directive clauses
device
performance issues
aa
memory layout