看心情写の博文 | LEON の熬夜空间

课程信息

分数占比： 25+25两次作业，50考试
注意实现：提早做practical
考试重点：平时的practical

GPU Architecture

factors influencing performance

CPUs: clock speed

processing rate per compute element

power consumed proportional to f * V2
increase frequency f and decrease voltage V can keep power managable
but V cannot be decreased easily, so no longer available to increase f

CPUs: memory latency

cache hierarchies
- large on-chip caches to stage data
- L1, L2, L3, controllers...
other latency hiding measures
- hardware multithreading
- out of order execution

CPUs: memory bandwidth

CPUs gengerally use commodity Double Data Rate (DDR) RAM

CPUs: parallelism

use more cores
limited clock speed keeps power consumption per core managable

GPUs

require graphical processing units
designed to rendering
- balance between floating point/memory bandwith

GPUs: clock speed

underlying clock frequency relatively modest
performance based on parallelism

GPUs: memory latency

schedule many independent tasks
if one task encounters a long latency tasks
- swapped out and schedule another task

scientific applications

GPUs good for
- many indepented tasks
- significatn data parallelism
- structured code with identifable kernels
not good for
- highly coupled problems with little parallelism
- IO-dominated problems
- Poorly structured core with diffuse computational intensity 核心结构较差，计算强度分散

CUDA Programming

streaming multiprocessors

a two level hierarchy
- many streaming multiprocessors each with many cores
- exact numbers depend on particular hardware

dim3 structure

struct { 
	unsigned int x;
	unsigned int y;
	unsigned int z;
};

synchronisation between host and device

kernel launches are asynchronous
- return immediately on the host
- synchronisation required to ensure completion
- errors can appear asynchronously

synchronisation on the device

synchronisation between threads in the same block is possible
- allows coordination of action in shared memory
- allow reductions
not possible to synchronise between blocks

memory management

host and device have separate address spaces
- data accessed by kernel must be in the device memory

Performance Optimisation

copying between host and device

separate memory spaces mean some copying inevitable
simply must avoid unnecessary copies
- keep data resident on device
- involve moving all relevant code to device
- recalculation / extra computation instead of data communication

occupancy and latency hiding

work decomposed and distributed between threads
actually want Nthreads >> Ncores
latency for access to main memory

memory coalescing (合并)

GPUs have a high peak memory bandwidth
- only achieved when accesses are coalesced consecutive threads access consecutive memory locations
if not, access may be serialised

grid stride (跨度) loops

common advice is to parallelise a loop by mapping one iteration to one thread
- GPUs limit on the number of threads (1024) and blocks that make up a grid
- small cost to start/end a block
- kernels need to compute common values - redundant computation
- kernels need to initialise shared memory
solution: each CUDA thread perform several iterations

code branching

threads are scheduled in groups of 32
- each group is a 'wrap'
- share same instruction scheduling hardware units
- executes instructions in lock-step (SIMT)
  - every thread execute same instruction, but with different data
branches in the code can cause serialisation
- threads aren't in the executing branch sit idle

CUDA Memory

constant memory

read only in kernel
no cache coherency mechanism required to support writes
fast, effective, high bandwidth

shared memory

shared between threads in a block
- useful for temporary values
- may use for reductions
may require care in synchronisation with a block
lifetime of the kernel's blocks
- only addressable when a block starts executing
- released when a block finishes

unified memory

GPU has a separate memory space from the host CPU

CUDA Libraries

only for use, see slides

OpenMP Target Offloading

OpenMp device model

host-centric model with one host device and multiple target devices of the same type
device
- a
device data env
- a

target region

basic offloading construct
defines a section of a program

host and device data

have separate memory spaces
data inside target region must be mapped to the device
mapped data must not be accessed by the host until the target region has completed

map clause

#pragma omp target map(map-type:list) where list:

to: copy data to device on entry
from: copy data to host on exit
tofrom: copy data to device on entry and back on exit
alloc: allocate an uninitialized copy on the device (no values)

dynamically allocated data

need to specify the number of elements to be copied

int* B = (int*)malloc(sizeof(int) *N);
#pragma omp target map(to:B[0:N]) // can do B[10:3]

keeping data on the device

moving data is expensive
keep the data on the device between target regions
target data constructs just map data and do not offload
target update constructs copies values between host and device between target constructs

#pragma omp target enter data map(to: A[0:N],B[0:N])
for (r=0; r<reps; r++){ 
#pragma omp target
	{
		// do stuff with A and B 
	}
	// do something on the host
}
#pragma omp target exit data map(from: B[0:N])

parallelism on the device

GPUs are not able to support a full threading model outside of a single stream multiprocessor (SM)
- no sync or memory fences between SMs
- no coherency between L1 caches
- aa

teams construct

create multiple master threads inside a target region

distribute construct

calling functions inside target regions

declare target

target directive clauses

device

Advance Parallel Techniques复习笔记