Specialized units & instructions

SFUs (special function units)

  • Note that transcendental operations (sin/cos, rsqrt) may be executed on SFUs with different latencies/throughput.

Tensor/Matrix cores, Ray-tracing cores

  • Mention specialized units for matrix multiply/accumulate or ray traversal that change performance characteristics for algorithms that use them.

Asynchronous copies / DMA engines

  • Add async copy mechanisms (device to shared, or staging) that allow overlap of memory transfer with compute, when supported.