Advance Parallel Techniques复习笔记

发布于2024年02月01日浏览量262

# UoE笔记
# APT

课程信息

  • 分数占比: 25+25两次作业,50考试
  • 注意实现:提早做practical
  • 考试重点:平时的practical

GPU Architecture

factors influencing performance

CPUs: clock speed

processing rate per compute element

  • power consumed proportional to f * V<sup>2</sup>
  • increase frequency f and decrease voltage V can keep power managable
  • but V cannot be decreased easily, so no longer available to increase f

CPUs: memory latency

  • cache hierarchies
    • large on-chip caches to stage data
    • L1, L2, L3, controllers...
  • other latency hiding measures
    • hardware multithreading
    • out of order execution

CPUs: memory bandwidth

CPUs gengerally use commodity Double Data Rate (DDR) RAM

CPUs: parallelism

  • use more cores
  • limited clock speed keeps power consumption per core managable

GPUs

  • require graphical processing units
  • designed to rendering
    • balance between floating point/memory bandwith

GPUs: clock speed

  • underlying clock frequency relatively modest
  • performance based on parallelism

GPUs: memory latency

  • schedule many independent tasks
  • if one task encounters a long latency tasks
    • swapped out and schedule another task

scientific applications

  • GPUs good for
    • many indepented tasks
    • significatn data parallelism
    • structured code with identifable kernels
  • not good for
    • highly coupled problems with little parallelism
    • IO-dominated problems
    • Poorly structured core with diffuse computational intensity 核心结构较差,计算强度分散

CUDA Programming

streaming multiprocessors

  • a two level hierarchy
    • many streaming multiprocessors each with many cores
    • exact numbers depend on particular hardware

dim3 structure

struct { unsigned int x; unsigned int y; unsigned int z; };

synchronisation between host and device

  • kernel launches are asynchronous
    • return immediately on the host
    • synchronisation required to ensure completion
    • errors can appear asynchronously

synchronisation on the device

  • synchronisation between threads in the same block is possible
    • allows coordination of action in shared memory
    • allow reductions
  • not possible to synchronise between blocks

memory management

  • host and device have separate address spaces
    • data accessed by kernel must be in the device memory

Performance Optimisation

copying between host and device

  • separate memory spaces mean some copying inevitable
  • simply must avoid unnecessary copies
    • keep data resident on device
    • involve moving all relevant code to device
    • recalculation / extra computation instead of data communication

occupancy and latency hiding

  • work decomposed and distributed between threads
  • actually want N<sub>threads</sub> >> N<sub>cores</sub>
  • latency for access to main memory

memory coalescing (合并)

  • GPUs have a high peak memory bandwidth
    • only achieved when accesses are coalesced consecutive threads access consecutive memory locations
  • if not, access may be serialised

grid stride (跨度) loops

  • common advice is to parallelise a loop by mapping one iteration to one thread
    • GPUs limit on the number of threads (1024) and blocks that make up a grid
    • small cost to start/end a block
    • kernels need to compute common values - redundant computation
    • kernels need to initialise shared memory
  • solution: each CUDA thread perform several iterations

code branching

  • threads are scheduled in groups of 32
    • each group is a 'wrap'
    • share same instruction scheduling hardware units
    • executes instructions in lock-step (SIMT)
      • every thread execute same instruction, but with different data
  • branches in the code can cause serialisation
    • threads aren't in the executing branch sit idle

CUDA Memory

constant memory

  • read only in kernel
  • no cache coherency mechanism required to support writes
  • fast, effective, high bandwidth

shared memory

  • shared between threads in a block
    • useful for temporary values
    • may use for reductions
  • may require care in synchronisation with a block
  • lifetime of the kernel's blocks
    • only addressable when a block starts executing
    • released when a block finishes

unified memory

  • GPU has a separate memory space from the host CPU

CUDA Libraries

only for use, see slides

OpenMP Target Offloading

OpenMp device model

  • host-centric model with one host device and multiple target devices of the same type
  • device
    • a
  • device data env
    • a

target region

  • basic offloading construct
  • defines a section of a program

host and device data

  • have separate memory spaces
  • data inside target region must be mapped to the device
  • mapped data must not be accessed by the host until the target region has completed

map clause

#pragma omp target map(map-type:list) where list:

  1. to: copy data to device on entry
  2. from: copy data to host on exit
  3. tofrom: copy data to device on entry and back on exit
  4. alloc: allocate an uninitialized copy on the device (no values)

dynamically allocated data

need to specify the number of elements to be copied

int* B = (int*)malloc(sizeof(int) *N); #pragma omp target map(to:B[0:N]) // can do B[10:3]

keeping data on the device

  • moving data is expensive
  • keep the data on the device between target regions
  • target data constructs just map data and do not offload
  • target update constructs copies values between host and device between target constructs
#pragma omp target enter data map(to: A[0:N],B[0:N]) for (r=0; r<reps; r++){ #pragma omp target { // do stuff with A and B } // do something on the host } #pragma omp target exit data map(from: B[0:N])

parallelism on the device

  • GPUs are not able to support a full threading model outside of a single stream multiprocessor (SM)
    • no sync or memory fences between SMs
    • no coherency between L1 caches
    • aa

teams construct

  • create multiple master threads inside a target region

distribute construct

calling functions inside target regions

declare target

target directive clauses

device

performance issues

aa

memory layout

My Nocturzone

LEON の 熬夜空间