Execution Building Blocks

Thread

  • Has its own registers and private variables.

Warp / Wave

  • Fixed-size group of threads executed together in lockstep.

  • These are hardware  scheduling units — the smallest batch of threads that can be executed together.

  • The SM/CU scheduler runs one warp/wave at a time on its execution units.

    • The SM does not  literally execute a whole warp in a single cycle always; the SM issues instructions for a warp and can interleave instructions from multiple warps to hide latency.

    • The warp is the smallest scheduling/issue granularity, but instruction dispatch and active lanes depend on pipeline and issue width.

    • Reconvergence is implemented by hardware (and compiler) mechanisms; divergent threads are masked and the SM executes each path serially until reconvergence.

  • Wavefront  (AMD) or Warp  (NVIDIA).

  • Common sizes:

    • NVIDIA warp: 32 threads.

    • AMD Wavefront: 64 threads.

  • These threads share a program counter  and execute the same instruction at the same time (SIMT model).

  • If threads diverge in control flow, the hardware masks off threads not taking the current branch until they reconverge.

Workgroup (API abstraction)
  • Defined by you  in Vulkan’s compute shader or GLSL/HLSL.

  • Group of threads that can share shared memory  within a single SM/CU.

  • A workgroup is scheduled to a single SM/CU for its lifetime while active, but a workgroup may be split across multiple SMs over time if the runtime re-schedules (e.g., after preemption or context switch).

    • Practically, code semantics assume the workgroup runs on a single SM until completion.

  • Size: arbitrary (within hardware limits), e.g., local_size_x = 256 .

  • Purpose: group of threads that:

    • Can share shared memory / LDS .

    • Can synchronize using barrier()  calls.

  • Hardware: The entire workgroup runs within one SM/CU  (so they can share its on-chip memory).

  • Example:

    • If you set local_size_x = 256 , that’s 256 threads in the workgroup.

  • Subgroup (API Abstraction)

    • Subset of threads in a workgroup that maps to a warp/wave (e.g., 32 threads).

    • Enables warp-level operations (shuffles, reductions) without shared memory .

    • Exposed in Vulkan/OpenCL; size is queried at runtime.

SM (Streaming Multiprocessor) / CU (Compute Unit)

  • SM  = Streaming Multiprocessor (NVIDIA terminology).

  • CU  = Compute Unit (AMD terminology).

  • Hardware block that runs multiple warps/waves.

  • This is the fundamental hardware block that executes shader threads.

  • Has per-SM caches and shared memory.

    • Registers  (not shared between SMs).

    • Shared memory  (LDS).

    • L1 cache  (per-SM).

    • Implication: If one SM’s L1 cache is filled with certain data, another SM won’t see it — coherence happens at L2.

  • Resource Partitioning :

    • Fixed registers/thread (e.g., 255 regs/thread on NVIDIA Ampere).

    • Shared memory configurable (e.g., 64–164 KB on NVIDIA).

  • Concurrent Execution :

    • Runs multiple warps/waves simultaneously (e.g., 64 warps/SM on NVIDIA).

    • Hides latency via zero-cost warp switching.

  • Each SM/CU has :

    • Its own set of registers (private to threads assigned to it).

    • Its own shared memory / LDS (Local Data Store), accessible to all threads in a workgroup.

    • Access to L1 cache and special-function units (SFUs).

  • Vulkan equivalent: A workgroup  in a compute shader runs entirely within one SM/CU.

TPC/GPC (Texture Processing Cluster / Graphics Processing Cluster) / SA (Shader Array)

  • A group of SMs/CUs that may share intermediate caches or specialized hardware.

  • NVIDIA :

    • "Texture Processing Cluster" (TPC) or "Graphics Processing Cluster" (GPC)

    • Groups 2–8 SMs sharing raster/tessellation units.

  • AMD :

    • "Shader Array" (SA) in RDNA

    • Uses “Shader Array” or “Workgroup Processor” as a cluster-like grouping.

    • Groups 2 CUs sharing instruction cache/ray accelerators.

  • Shared at cluster level: Sometimes texture units, geometry units, or a shared instruction cache.

GPU Die

  • Graphics Engine (e.g., AMD's Shader Engine, NVIDIA's GPC).

  • Contains multiple clusters + fixed-function units (geometry, raster).

GPU

  • All clusters together, sharing the L2 cache and global memory.