SIMD

CPU ARM

NEON
  • NEON

  • Vector width:  128-bit registers

  • Typical lane count:  4 lanes Γ— 32-bit (e.g., 4 Γ— float32 )

SVE / SVE2 (Scalable Vector Extension)
  • Vector width:  Variable.

    • Register width is not fixed (128–2048 bits in 128-bit steps).

    • Code is vector-length agnostic (designed to scale across cores).

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

CPU x86/x64 (Intel / AMD)

FMA3 / FMA4 (Fused multiply-add)
  • Is often used in combination with AVX/AVX2/AVX-512.

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  256-bit registers

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

AVX-512
  • The adoption is limited (mainly HPC, data centers, or select Intel chips).

  • Includes masking, scatter/gather, and more advanced operations.

  • Platform:  x86/x64 (Intel only in select CPUs, not widely available)

  • Vector width:  512-bit registers

  • Typical lane count:  16 lanes Γ— 32-bit (16 Γ— float32 )

AVX / AVX2 (Advanced Vector Extensions)
  • AVX2 Added full integer support

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  256-bit registers

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

SSE (Streaming SIMD Extensions)
  • Superseded by AVX.

  • SSE1–SSE4 progressively added instructions but retained 128-bit width.

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  128-bit registers

  • Typical lane count:  4 lanes Γ— 32-bit (4 Γ— float32 )

MMX
  • Legacy, obsolete.

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  64-bit registers

  • Typical lane count:  2 lanes Γ— 32-bit (8 Γ— float32 )

RISC-V (Reduced Instruction Set Computer) (Risk-Five)

  • Is an open, modular instruction set architecture (ISA) based on the RISC (Reduced Instruction Set Computer) design principles.

  • Unlike proprietary ISAs (e.g., x86 by Intel/AMD, ARM by Arm Ltd.), RISC-V is:

    • Open source  β€” Anyone can use or implement it without licensing fees.

    • Modular  β€” It has a minimal base instruction set, with optional extensions (e.g., floating-point, SIMD, vector).

RVV (RISC-V Vector Extension)
  • Similar to ARM SVE, RVV allows hardware to define vector width.

  • Not fixed to 128, 256, or 512 bitsβ€”code adapts dynamically.

  • Scalable width:  Vector registers can be from 128 to 2048 bits, depending on hardware.

  • Vector-Length Agnostic (VLA):

    • Programs don’t assume a fixed vector width.

    • Code adapts to hardware at runtime β€” same binary works on 128-bit or 512-bit hardware.

GPU

  • GPUs use SIMT  (Single Instruction, Multiple Thread), not SIMD per se, but functionally similar at scale.

CUDA
  • NVIDIA GPUs

  • Vector width:  Scalar SIMT

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

OpenCL
  • Cross-vendor GPU compute

  • Vector width:  Variable

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

Wavefronts / Warps
  • Used in GPU shaders

  • Vector width:  32/64 threads

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )