Caio Raphael

Shader Alignment

Minimum Dynamic-Offset / CBV Allocation Granularity

GPUs and drivers require that when you bind or use a portion of a large buffer as a uniform/constant buffer the start address and/or size line up to an alignment.
That alignment is the “minimum dynamic-offset” (Vulkan) or the CBV/constant buffer granularity (D3D12).
It lets the driver map many small logical buffers into a single big GPU buffer efficiently.
If you bind at an unaligned offset the API/driver will reject it or you will get wrong data or degraded performance.
Drivers can report 64, 128, 256, or other powers of two.
UBO alignment is usually larger than SSBO alignment because UBO usage and caches are handled differently by the hardware.
Value :
- Many APIs and drivers use 256 bytes as the Minimum Dynamic-Offset on common desktop GPUs.
  - VkGuide:
```
struct MaterialConstants {  // written into uniform buffers later
    glm::vec4 colorFactors; // multiply the color texture
    glm::vec4 metal_rough_factors;
    glm::vec4 extra[14];
        /*
        padding, we need it anyway for uniform buffers
        it needs to meet a minimum requirement for its alignment. 
        256 bytes is a good default alignment for this which all the gpus we target meet, so we are adding those vec4s to pad the structure to 256 bytes.
        */
};
```
- But not every platform or GPU guarantees 256. Mobile or integrated GPUs may have different values.
- VkPhysicalDeviceLimits .
  - minUniformBufferOffsetAlignment
    - Is the minimum required alignment, in bytes, for the offset member of the VkDescriptorBufferInfo structure for uniform buffers.
    - When a descriptor of type DESCRIPTOR_TYPE_UNIFORM_BUFFER or DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC is updated, the offset must be an integer multiple of this limit.
    - Similarly, dynamic offsets for uniform buffers must be multiples of this limit.
    - The value must be a power of two.
  - minStorageBufferOffsetAlignment
    - Is the minimum required alignment, in bytes, for the offset member of the VkDescriptorBufferInfo structure for storage buffers.
    - When a descriptor of type DESCRIPTOR_TYPE_STORAGE_BUFFER or DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC is updated, the offset must be an integer multiple of this limit.
    - Similarly, dynamic offsets for storage buffers must be multiples of this limit.
    - The value must be a power of two.
  - minTexelBufferOffsetAlignment
    - Is the minimum required alignment, in bytes, for the offset member of the VkBufferViewCreateInfo structure for texel buffers.
    - If the texelBufferAlignment feature is enabled, this limit is equivalent to the maximum of the uniformTexelBufferOffsetAlignmentBytes and storageTexelBufferOffsetAlignmentBytes members of VkPhysicalDeviceTexelBufferAlignmentProperties , but smaller alignment is optionally allowed by storageTexelBufferOffsetSingleTexelAlignment and uniformTexelBufferOffsetSingleTexelAlignment .
    - If the texelBufferAlignment feature is not enabled, VkBufferViewCreateInfo :: offset must be a multiple of this value.
    - The value must be a power of two.
Best practice :
- Query the GPU at runtime and align your buffer ranges to the reported value.
- Assert size at compile time:
```
static_assert(sizeof(MaterialConstants) == 256, "MaterialConstants must be 256 bytes");
```

Default Layouts

UBOs :
- std140.
SSBOs :
- std430.
Push Constants :
- std430 (Vulkan).
- Source: GLSL Spec 4.60.8 , page 90.
  - OpenGL Spec 4.6 , page 146 (7.6.2.2).

Alignment Options

Offset and Stride Assignment .
There are different alignment requirements depending on the specific resources and on the features enabled.
Platform dependency :
- 32-bit IEEE-754
  - The scalar value is 4 bytes.
  - The standard for desktop, mobile, OpenGL ES and Vulkan.
- 16-bit half precision :
  - The scalar value is 2 bytes.
  - In rare cases, like embedded or custom OpenGL drivers.
- 64-bit IEEE-754 double :
  - The scalar value is 8 bytes.
  - Non-standard case.
  - Would require headers redefining GLfloat as double , not compliant with spec.
C layout ≈ std430 only if you manually match packing and alignment. Otherwise, it’s platform-dependent.

| GLSL type                        | C equivalent                                        | Typical C (x86_64) - Alignment |            Typical C (x86_64) - Size | Typical C (x86_64) - Stride |                                                                     std140 - Base Alignment |                std140 - Occupied Size |                          std140 - Stride | std430 - Base Alignment |                                std430 - Occupied Size |                             std430 - Stride |
| -------------------------------- | --------------------------------------------------- | -----------------------------: | -----------------------------------: | --------------------------: | -----------------------------------------------------------------------------------------: | ------------------------------------: | ---------------------------------------: | ----------------------: | ----------------------------------------------------: | ------------------------------------------: |
| bool                            | C _Bool (native) — or use int32_t to match GLSL |       _Bool : 1; int32_t : 4 |             _Bool : 1; int32_t : 4 |     _Bool : 1; int32_t : 4 |                                                                                          4 |                                     4 | 16 (std140 rounds scalar arrays to vec4) |                       4 |                                                     4 |                                           4 |
| int / uint                    | int32_t / uint32_t                               |                              4 |                                    4 |                           4 |                                                                                          4 |                                     4 |                                       16 |                       4 |                                                     4 |                                           4 |
| float                           | float                                              |                              4 |                                    4 |                           4 |                                                                                          4 |                                     4 |                                       16 |                       4 |                                                     4 |                                           4 |
| double                          | double                                             |                              8 |                                    8 |                           8 |                                                                                          8 |                                     8 |          32 (rounded to dvec4 alignment) |                       8 |                                                     8 |                                           8 |
| vec2 / ivec2                  | float[2] / int32_t[2]                            |                              4 |                                    8 |                           8 |                                                                                          8 |                                     8 |                                       16 |                       8 |                                                     8 |                                           8 |
| vec3 / ivec3                  | float[3] / int32_t[3]                            |                              4 |                                   12 |                          12 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| vec4 / ivec4                  | float[4] / int32_t[4]                            |                              4 |                                   16 |                          16 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| dvec2                           | double[2]                                          |                              8 |                                   16 |                          16 |                                                                                         16 |                                    16 |                                       32 |                      16 |                                                    16 |                                          16 |
| dvec3                           | double[3]                                          |                              8 |                                   24 |                          24 |                                                                                         32 |                                    32 |                                       32 |                      32 |                                                    32 |                                          32 |
| dvec4                           | double[4]                                          |                              8 |                                   32 |                          32 |                                                                                         32 |                                    32 |                                       32 |                      32 |                                                    32 |                                          32 |
| mat2 (2×2 float, column-major) | float[2][2] (2 columns of vec2 )                 |                              4 |                                   16 |             8 (column size) |                                                                                         16 |                           16 × 2 = 32 |      each column has vec4 as stride (16) |                       8 |                                            8 × 2 = 16 |          each column has vec2 as stride (8) |
| mat3 (3×3 float, column-major) | float[3][3] (3 columns of vec3 )                 |                              4 |                                   36 |            12 (column size) |                                                                                         16 |                           16 × 3 = 48 |      each column has vec4 as stride (16) |                      16 |                                           16 × 3 = 48 |         each column has vec3 as stride (16) |
| mat4 (4×4 float)               | float[4][4]                                        |                              4 |                                   64 |            16 (column size) |                                                                                         16 |                           16 x 4 = 64 |      each column has vec4 as stride (16) |                      16 |                                           16 × 4 = 64 |         each column has vec4 as stride (16) |
| T[] (Array of T)               | T[]                                                |                     alignof(T) |                            sizeof(T) |                   sizeof(T) | base_align(T), rounded up to vec4 base align (16 for 32-bit scalars; 32 for 64-bit/double) | occupied per element = rounded stride |          base_align(T), rounded up to 16 |           base_align(T) | occupied per element = sizeof(T) rounded to alignment |                               base_align(T) |
| vec3[] (Array of vec3)         | float[3][]                                         |                              4 |                                   12 |                          12 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| struct                          | struct { ... }                                     |          max(member alignment) | struct size padded to that alignment |     sizeof(struct) (padded) |                                                  max(member align) rounded up to vec4 (16) |  struct size padded to multiple of 16 |          sizeof(struct) rounded up to 16 |       max(member align) |                  struct size padded to that alignment | sizeof(struct) (padded to member alignment) |

Scalar Alignment

Looks like std430 , but its vectors are even more compact?
Also known as (?) The spec doesn't say.
EXT_scalar_block_layout .
- Core in Vulkan 1.2.
- This extension allows most storage types to be aligned in scalar alignment.
- Make sure to set --scalar-block-layout when running the SPIR-V Validator.
- A big difference is being able to straddle the 16-byte boundary.
- In GLSL this can be used with scalar keyword and extension

Extended Alignment (std140)

Source .
Conservative, padded layout used for uniform blocks.
Widely supported.
Caveats :
- "Avoiding usage of vec3"
  - Usually applies to std140, because some hardware vendors seem to not follow the spec strictly. Although, everything should work when using std430.
  - Array of vec3 (ARRAY) :
    - Alignment will be 4x of a float .
    - Size will be alignment * amount of elements .

// Scalars
    float ->  4 bytes // for 32-bit IEEE-754
    int   ->  4 bytes // for 32-bit IEEE-754
    uint  ->  4 bytes // for 32-bit IEEE-754
    bool  ->  4 bytes // for 32-bit IEEE-754
    
// Vectors
    // Base alignments
    vec2  ->  8 bytes  // 2 times the underlying scalar type.
    vec3  -> 16 bytes  // 4 times the underlying scalar type.
    vec4  -> 16 bytes  // 4 times the underlying scalar type.
    
// Arrays
    // Size of the element type, rounded up to a multiple of the size of `vec4` (behave like `vec4` slots).
    // Arrays of types are not necessarily tightly packed.
    // An array of floats in such a block will not be the equivalent to an array of floats in C/C++. Arrays will only match their C/C++ definitions if the type is a multiple of 16 bytes.
    // Ex: `float arr[N]` uses 16 bytes per element.

// Matrices
    // Treated as arrays of vectors. 
    // They are column-major by default; you can change it with `layout(row_major)` or `layout(column_major)`.

// Struct
    // The biggest struct member, rounded up to multiples of the size of `vec4` (behave like `vec4` slots).
    // Struct members are effectively padded so that each member starts on a 16-byte boundary when necessary.
    // The struct size will be the space needed by its members.

Examples :

layout(std140) uniform U { float a[3]; }; // size = 3 * 16 = 48 bytes

Base Alignment (std430)

Allowed usage :
- SSBOs, Push Constants.
- KHR_uniform_buffer_standard_layout .
  - Core in Vulkan 1.2.
  - Allows the use of std430 memory layout in UBOs.
  - These memory layout changes are only applied to Uniforms .
- KHR_relaxed_block_layout .
  - Core in Vulkan 1.1; all Vulkan 1.1+ devices support relaxed block layout.
  - This extension allows implementations to indicate they can support more variation in block Offset decorations.
  - This comes up when using std430 memory layout where a vec3 (which is 12 bytes) is still defined as a 16 byte alignment.
  - With relaxed block layout an application can fit a float on either side of the vec3 and maintain the 16 byte alignment between them.
  - Currently there is no way in GLSL to legally express relaxed block layout, but a developer can use the --hlsl-offsets with glslang to produce the desired offsets.
Relaxed layout used for shader-storage blocks and allows much tighter packing.
Requires newer GLSL 4.3+ or equivalent support.

// Scalars
    float ->  4 bytes // for 32-bit IEEE-754
    int   ->  4 bytes // for 32-bit IEEE-754
    uint  ->  4 bytes // for 32-bit IEEE-754
    bool  ->  4 bytes // for 32-bit IEEE-754
    
// Vectors
    // Base alignments
    vec2  ->  8 bytes  // 2 times the underlying scalar type.
    vec3  -> 16 bytes  // 4 times the underlying scalar type.
    vec4  -> 16 bytes  // 4 times the underlying scalar type.
    
// Arrays
    // array stride is the natural alignment of the element. 
    // Ex: `float arr[N]` uses 4 bytes per element.

// Matrices
    // Treated as arrays of vectors. 
    // They are column-major by default; you can change it with `layout(row_major)` or `layout(column_major)`.
    
// Struct
    // Struct packing is tighter and follows natural alignments (less padding).

Examples :

layout(std430, binding=0) buffer B { float a[3]; }; // size = 3 * 4 = 12 bytes

Comparisons

Input Attributes vs Uniform Buffers :
- "If you want a shader to access a resource that is not on vertex buffers, they must go through a descriptor set".
- With Vertex Buffers we can pass arbitrary attributes to the vertex shader for each vertex.
- Including global variables as vertex data would be a waste of memory, and it would require us to update the vertex buffer whenever the transformation changes.
- The right way to tackle this in Vulkan is to use Resource Descriptors
Uniform Buffers vs Push Constants :
- ChatGPT...
- "Lets say I want to send the time, would it be optimal to send it as a ubo or a push constant? The time should not change between draw calls (it's irrelevant) in the same frame. Seems like a UBO I only need to set it once and every object would benefit from it, but a push constant needs to be send for every object?"
  - Exactly. For your “time” value, a Uniform Buffer is generally the more optimal choice in this scenario.
  - Push Constants
    - You must call vkCmdPushConstants for each command buffer section where shaders need it.
    - Since push constants are set per draw/dispatch scope, if you have many objects, you’d be redundantly re-sending the same value (time) multiple times in the same frame.
    - There’s no automatic “shared” state — every pipeline that uses it must get the value pushed explicitly.
  - Uniform Buffers
    - You can store the time in a uniform buffer once per frame, bind it once in a descriptor set, and then every draw call will see the same value without re-uploading.
    - Works well for “global” frame data (view/proj matrices, time, frame index, etc.).
    - Binding a pre-allocated UBO in a descriptor set has low overhead and avoids per-draw constant pushing.
  - Performance implication:
    - If the data is the same for all draws in a frame, a UBO avoids redundant driver calls and state changes, and makes it easier to keep the command buffer lean. Push constants are better suited for per-object or per-draw small data.
Storage Image vs. Storage Buffer :
- While both storage images and storage buffers allow for read-write access in shaders, they have different use cases:
- Storage Images :
  - Ideal for 2D or 3D data that benefits from texture operations like filtering or addressing modes.
- Storage Buffers :
  - Better for arbitrary structured data or when you need to access data in a non-uniform pattern.
Texel Buffer vs. Storage Buffer :
- Texel buffers and storage buffers also have different strengths:
- Texel Buffers :
  - Provide texture-like access to buffer data, allowing for operations like filtering.
- Storage Buffers :
  - More flexible for general-purpose data storage and manipulation.
Do
- Do keep constant data small, where 128 bytes is a good rule of thumb.
- Do use push constants if you do not want to set up a descriptor set/UBO system.
- Do make constant data directly available in the shader if it is pre-determinable, such as with the use of specialization constants.
Avoid
- Avoid indexing in the shader if possible, such as dynamically indexing into buffer or uniform arrays, as this can disable shader optimisations in some platforms.
Impact
- Failing to use the correct method of constant data will negatively impact performance, causing either reduced FPS and/or increased BW and load/store activity.
- On Mali, register mapped uniforms are effectively free. Any spilling to buffers in memory will increase load/store cache accesses to the per thread uniform fetches.

Input Attributes

About

The only shader stage in core Vulkan that has an input attribute controlled by Vulkan is the vertex shader stage ( SHADER_STAGE_VERTEX ).
```
#version 450
layout(location = 0) in vec3 inPosition;

void main() {
    gl_Position = vec4(inPosition, 1.0);
}
```
Other shader stages, such as a fragment shader stage, have input attributes, but the values are determined from the output of the previous stages run before it.
This involves declaring the interface slots when creating the VkPipeline and then binding the VkBuffer before draw time with the data to map.
Before calling vkCreateGraphicsPipelines a VkPipelineVertexInputStateCreateInfo struct will need to be filled out with a list of VkVertexInputAttributeDescription mappings to the shader.
```
VkVertexInputAttributeDescription input = {};
input.location = 0;
input.binding  = 0;
input.format   = FORMAT_R32G32B32_SFLOAT; // maps to vec3
input.offset   = 0;
```

The only thing left to do is bind the vertex buffer and optional index buffer prior to the draw call.

vkBeginCommandBuffer();
// ...
vkCmdBindVertexBuffer();
vkCmdDraw();
// ...
vkCmdBindVertexBuffer();
vkCmdBindIndexBuffer();
vkCmdDrawIndexed();
// ...
vkEndCommandBuffer();

Limits :
- maxVertexInputAttributes
- maxVertexInputAttributeOffset

Memory Layout

.
.
.
- Single binding.
.
- One binding per attribute.
One binding or many bindings? It doesn't matter that much. In some cases one is better, etc, don't worry too much about it.

Vertex Input Binding / Vertex Buffer

Tell Vulkan how to pass this data format to the vertex shader once it's been uploaded into GPU memory
A vertex binding describes at which rate to load data from memory throughout the vertices.
It specifies the number of bytes between data entries and whether to move to the next data entry after each vertex or after each instance.
VkVertexInputBindingDescription .
- binding
  - Specifies the index of the binding in the array of bindings.
- stride
  - Specifies the number of bytes from one entry to the next.
- inputRate
  - VERTEX_INPUT_RATE_VERTEX
    - Move to the next data entry after each vertex.
  - VERTEX_INPUT_RATE_INSTANCE
    - Move to the next data entry after each instance.
  - We're not going to use instanced rendering, so we'll stick to per-vertex data.
VkVertexInputAttributeDescription
- Describes how to handle vertex input.
- An attribute description struct describes how to extract a vertex attribute from a chunk of vertex data originating from a binding description.
- We have two attributes, position and color, so we need two attribute description structs.
- binding
  - Tells Vulkan from which binding the per-vertex data comes.
- location
  - References the location directive of the input in the vertex shader.
    - The input in the vertex shader with location 0 is the position, which has two 32-bit float components.
- format
  - Describes the type of data for the attribute.
  - Implicitly defines the byte size of attribute data.
  - A bit confusingly, the formats are specified using the same enumeration as color formats.
  - The following shader types and formats are commonly used together:
    - float : FORMAT_R32_SFLOAT
    - vec2 : FORMAT_R32G32_SFLOAT
    - vec3 : FORMAT_R32G32B32_SFLOAT
    - vec4 : FORMAT_R32G32B32A32_SFLOAT
  - As you can see, you should use the format where the amount of color channels matches the number of components in the shader data type.
  - It is allowed to use more channels than the number of components in the shader, but they will be silently discarded.
    - If the number of channels is lower than the number of components, then the BGA components will use default values of (0, 0, 1) .
  - The color type ( SFLOAT , UINT , SINT ) and bit width should also match the type of the shader input. See the following examples:
    - ivec2 : FORMAT_R32G32_SINT , a 2-component vector of 32-bit signed integers
    - uvec4 : FORMAT_R32G32B32A32_UINT , a 4-component vector of 32-bit unsigned integers
    - double : FORMAT_R64_SFLOAT , a double-precision (64-bit) float
- offset
  - Specifies the number of bytes since the start of the per-vertex data to read from.

Graphics Pipeline Vertex Input Binding :

For the following vertices:

Vertex :: struct {
    pos:   eng.Vec2,
    color: eng.Vec3,
}

vertices := [?]Vertex{
    { {  0.0, -0.5 }, { 1.0, 0.0, 0.0 } },
    { {  0.5,  0.5 }, { 0.0, 1.0, 0.0 } },
    { { -0.5,  0.5 }, { 0.0, 0.0, 1.0 } },
}

We setup this in the Graphics Pipeline creation:

vertex_binding_descriptor := vk.VertexInputBindingDescription{
    binding   = 0,
    stride    = size_of(Vertex),
    inputRate = .VERTEX,
}
vertex_attribute_descriptor := [?]vk.VertexInputAttributeDescription{
    {
        binding  = 0,
        location = 0,
        format   = .R32G32_SFLOAT,
        offset   = cast(u32)offset_of(Vertex, pos),
    },
    {
        binding  = 0,
        location = 1,
        format   = .R32G32B32_SFLOAT,
        offset   = cast(u32)offset_of(Vertex, color),
    },
}
vertex_input_create_info := vk.PipelineVertexInputStateCreateInfo {
    sType                           = .PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO,
    vertexBindingDescriptionCount   = 1,
    pVertexBindingDescriptions      = &vertex_binding_descriptor,
    vertexAttributeDescriptionCount = len(vertex_attribute_descriptor),
    pVertexAttributeDescriptions    = &vertex_attribute_descriptor[0],
}

The pipeline is now ready to accept vertex data in the format of the vertices container and pass it on to our vertex shader.

Vertex Buffer :
- If you run the program now with validation layers enabled, you'll see that it complains that there is no vertex buffer bound to the binding.
- The next step is to create a vertex buffer and move the vertex data to it so the GPU is able to access it.
- Creating :
  - Follow the tutorial for creating a buffer, specifying BUFFER_USAGE_VERTEX_BUFFER as the BufferCreateInfo usage .

Index Buffer

Motivation :
- Drawing a rectangle takes two triangles, which means that we need a vertex buffer with six vertices. The problem is that the data of two vertices needs to be duplicated, resulting in redundancies.
- The solution to this problem is to use an index buffer.
- An index buffer is essentially an array of pointers into the vertex buffer.
- It allows you to reorder the vertex data, and reuse existing data for multiple vertices.
- .
  - The first three indices define the upper-right triangle, and the last three indices define the vertices for the bottom-left triangle.
- It is possible to use either uint16_t or uint32_t for your index buffer depending on the number of entries in vertices . We can stick to uint16_t for now because we’re using less than 65535 unique vertices.
- Just like the vertex data, the indices need to be uploaded into a VkBuffer for the GPU to be able to access them.
Creating :
- Follow the tutorial for creating a buffer, specifying BUFFER_USAGE_INDEX_BUFFER as the BufferCreateInfo usage .
Using :
- We first need to bind the index buffer, just like we did for the vertex buffer.
- The difference is that you can only have a single index buffer. It’s unfortunately not possible to use different indices for each vertex attribute, so we do still have to completely duplicate vertex data even if just one attribute varies.
- An index buffer is bound with vkCmdBindIndexBuffer which has the index buffer, a byte offset into it, and the type of index data as parameters.
  - As mentioned before, the possible types are INDEX_TYPE_UINT16 and INDEX_TYPE_UINT32 .
- Just binding an index buffer doesn’t change anything yet, we also need to change the drawing command to tell Vulkan to use the index buffer.
- Remove the vkCmdDraw line and replace it with vkCmdDrawIndexed .

Push Constants

A Push Constant is a small bank of values accessible in shaders.
These are designed for small amount (a few dwords) of high frequency data to be updated per-recording of the command buffer.
So that the shader can understand where this data will be sent, we specify a special push constants <layout> in our shader code.

layout(push_constant) uniform MeshData {
    mat4 model;
} mesh_data;

Choosing to use Push Constants :
- In early implementations of Vulkan on Arm Mali, this was usually the fastest way of pushing data to your shaders. In more recent times, we have observed on Mali devices that overall they can be slower. If performance is something you are trying to maximise on Mali devices, descriptor sets may be the way to go. However, other devices may still favour push constants.
- Having said this, descriptor sets are one of the more complex features of Vulkan, making the convenience of push constants still worth considering as a go-to method, especially if working with trivial data.
Limits :
- maxPushConstantsSize
  - guaranteed at least 128 bytes on all devices.
  - If you're using Vulkan 1.4 the minimum was increased to 256.
Push Constants .

Offsets

Ex1 :

layout(push_constant, std430) uniform pc {
    layout(offset = 32) vec4 data;
};

layout(location = 0) out vec4 outColor;

void main() {
   outColor = data;
}

VkPushConstantRange range = {};
range.stageFlags = SHADER_STAGE_FRAGMENT;
range.offset = 32;
range.size = 16;

Updating

Ex1 :

Push constants can be incrementally updated over the course of a command buffer.

// vkBeginCommandBuffer()
vkCmdBindPipeline();
vkCmdPushConstants(offset: 0, size: 16, value = [0, 0, 0, 0]);
vkCmdDraw(); // values = [0, 0, 0, 0]

vkCmdPushConstants(offset: 4, size: 8, value = [1 ,1]);
vkCmdDraw(); // values = [0, 1, 1, 0]

vkCmdPushConstants(offset: 8, size: 8, value = [2, 2]);
vkCmdDraw(); // values = [0, 1, 2, 2]
// vkEndCommandBuffer()

Interesting how old values are kept. Values that were not changed are preserved.

Lifetime

vkCmdPushConstants is tied to the VkPipelineLayout usage and therefore why they must match before a call to a command such as vkCmdDraw() .
Because push constants are not tied to descriptors, the use of vkCmdBindDescriptorSets has no effect on the lifetime or pipeline layout compatibility of push constants.
The same way it is possible to bind descriptor sets that are never used by the shader, the same is true for push constants.

CPU Performance

Push one struct once per draw instead of many separate vkCmdPushConstants calls (one call writing a small struct is far cheaper).
Many small state changes cause the driver to update internal tables, validate, or patch commands — that’s CPU work and cannot be avoided without batching.
Observations :
- 5 push calls were taking 7.65us. I groupped all them in 1 single push call, now taking 3.08us.
- This was substancial, as at the time I was issuing this push calls hundreds of time per frame; I later reduced this number, but anyway, could be significant.

Descriptors Sets

About

VkDescriptorSet
One Descriptor -> One Resource.
They are always organized in Descriptor Sets.
- One or more descriptors contained.
- Combine descriptors which are used in conjunction.
A handle or pointer into a resource.
- Note that is not just a pointer, but a pointer + metadata.
A core mechanism used to bind resources to shaders.
Holds the binding information that connects shader inputs to data such as VkBuffer resources and VkImage textures.
Think of it as a set of GPU-side pointers that you bind once.
The internal representation of a descriptor set is whatever the driver wants it to be.
Article by Arseny Kapoulkine .
Sample talking about best practices .
Content :
- Where to find a Resource.
- Usage type of a Resource.
- Offsets, sometimes.
- Some metadata, sometimes.

Example :

// Note - only set 0 and 2 are used in this shader
layout(set = 0, binding = 0) uniform sampler2D myTextureSampler;

layout(set = 0, binding = 2) uniform uniformBuffer0 {
    float someData;
} ubo_0;

layout(set = 0, binding = 3) uniform uniformBuffer1 {
    float moreData;
} ubo_1;

layout(set = 2, binding = 0) buffer storageBuffer {
    float myResults;
} ssbo;

API :
- .
- .
Limits :
- maxBoundDescriptorSets
- Per stage limit
- maxPerStageDescriptorSamplers
- maxPerStageDescriptorUniformBuffers
- maxPerStageDescriptorStorageBuffers
- maxPerStageDescriptorSampledImages
- maxPerStageDescriptorStorageImages
- maxPerStageDescriptorInputAttachments
- Per type limit
- maxPerStageResources
- maxDescriptorSetSamplers
- maxDescriptorSetUniformBuffers
- maxDescriptorSetUniformBuffersDynamic
- maxDescriptorSetStorageBuffers
- maxDescriptorSetStorageBuffersDynamic
- maxDescriptorSetSampledImages
- maxDescriptorSetStorageImages
- maxDescriptorSetInputAttachments
- VkPhysicalDeviceDescriptorIndexingProperties if using Descriptor Indexing
- VkPhysicalDeviceInlineUniformBlockPropertiesEXT if using Inline Uniform Block
Visual explanation {0:00 -> 5:35} .
- Nice.
- The rest of the video is meh.

Difficulties

Problems :
- "They are not bad but they very much force a specific rendering style: you have triple / quadrupled nested for loops, binding your things based on usage and then rebind descriptor sets as needed."
- "Many of us are moving towards bindless rendering, where you just bind everything once in one big descriptor set, and then index into it at will; tho, Vulkan 1.0 does not greatly support, and also the descriptor count for it was quite low".
- Cannot update descriptors after binding in a command buffer.
- All descriptors must be valid, even if not used.
- Descriptor arrays must be sampled uniformly.
  - Different invocations can’t use different indices.
  - Can sample “dynamically uniform”, e.g. runtime-based index.
- Upper limit on descriptor counts.
- Discourages GPU-driven rendering architectures.
  - Due to the need to set up descriptor sets per draw call it’s hard to adapt any of the aforementioned schemes to GPU-based culling or command submission.
Solutions :
- Descriptor Indexing :
  - Available in 1.3, optional in 1.2, or EXT_descriptor_indexing .
  - Update descriptors after binding.
  - Update unused descriptors.
  - Relax requirement that all descriptors must be valid, even if unused.
  - Non-uniform array indexing.
- Buffer Device Address :
  - Available in 1.3, optional in 1.2, or KHR_buffer_device_address .
  - Directly access buffers through addresses without a descriptor.
  - See [[#Physical Storage Buffer]] below.
- Descriptor Buffers – EXT_descriptor_buffer :
  - Manage descriptors directly.
  - ~~Similar to D3D12’s descriptor model~~ .

Allocation

A scheme that works well is to use free lists of descriptor set pools; whenever you need a descriptor set pool, you allocate one from the free list and use it for subsequent descriptor set allocations in the current frame on the current thread. Once you run out of descriptor sets in the current pool, you allocate a new pool. Any pools that were used in a given frame need to be kept around; once the frame has finished rendering, as determined by the associated fence objects, the descriptor set pools can reset via vkResetDescriptorPool and returned to free lists. While it’s possible to free individual descriptors from a pool via DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET , this complicates the memory management on the driver side and is not recommended.
When a descriptor set pool is created, application specifies the maximum number of descriptor sets allocated from it, as well as the maximum number of descriptors of each type that can be allocated from it. In Vulkan 1.1, the application doesn’t have to handle accounting for these limits – it can just call vkAllocateDescriptorSets and handle the error from that call by switching to a new descriptor set pool. Unfortunately, in Vulkan 1.0 without any extensions, it’s an error to call vkAllocateDescriptorSets if the pool does not have available space, so application must track the number of sets and descriptors of each type to know beforehand when to switch to a different pool.
Different pipeline objects may use different numbers of descriptors, which raises the question of pool configuration. A straightforward approach is to create all pools with the same configuration that uses the worst-case number of descriptors for each type – for example, if each set can use at most 16 texture and 8 buffer descriptors, one can allocate all pools with maxSets=1024, and pool sizes 16 1024 for texture descriptors and 8 1024 for buffer descriptors. This approach can work but in practice it can result in very significant memory waste for shaders with different descriptor count – you can’t allocate more than 1024 descriptor sets out of a pool with the aforementioned configuration, so if most of your pipeline objects use 4 textures, you’ll be wasting 75% of texture descriptor memory.
Strategies :
- Two alternatives that provide a better balance memory use:
1. Measure an average number of descriptors used in a shader pipeline per type for a characteristic scene and allocate pool sizes accordingly. For example, if in a given scene we need 3000 descriptor sets, 13400 texture descriptors, and 1700 buffer descriptors, then the average number of descriptors per set is 4.47 textures (rounded up to 5) and 0.57 buffers (rounded up to 1), so a reasonable configuration of a pool is maxSets=1024, 5*1024 texture descriptors, 1024 buffer descriptors. When a pool is out of descriptors of a given type, we allocate a new one – so this scheme is guaranteed to work and should be reasonably efficient on average.
2. Group shader pipeline objects into size classes, approximating common patterns of descriptor use, and pick descriptor set pools using the appropriate size class. This is an extension of the scheme described above to more than one size class. For example, it’s typical to have large numbers of shadow/depth prepass draw calls, and large numbers of regular draw calls in a scene – but these two groups have different numbers of required descriptors, with shadow draw calls typically requiring 0 to 1 textures per set and 0 to 1 buffers when dynamic buffer offsets are used. To optimize memory use, it’s more appropriate to allocate descriptor set pools separately for shadow/depth and other draw calls. Similarly to general-purpose allocators that can have size classes that are optimal for a given application, this can still be managed in a lower-level descriptor set management layer as long as it’s configured with application specific descriptor set usages beforehand.

Implementation

Descriptors are like pointers, so as any pointer they need to allocate space to live ahead of time.
How many :
- Its possible to have 1 very big descriptor pool that handles the entire engine, but that means we need to know what descriptors we will be using for everything ahead of time.
- That can be very tricky to do at scale. Instead, we will keep it simpler, and we will have multiple descriptor pools for different parts of the project , and try to be more accurate with them.
  - I don't know what that actually means in practice.
VkDescriptorPool .
- Maintains a pool of descriptors, from which descriptor sets are allocated.
- Descriptor pools are externally synchronized, meaning that the application must not allocate and/or free descriptor sets from the same pool in multiple threads simultaneously.
- They are very opaque.
- VkDescriptorPoolCreateInfo .
  - Contains a type of descriptor (same VkDescriptorType as on the bindings above ), alongside a ratio to multiply the maxSets parameter is.
  - This lets us directly control how big the pool is going to be. maxSets controls how many VkDescriptorSets we can create from the pool in total, and the pool sizes give how many individual bindings of a given type are owned.
  - flags .
    - Is a bitmask of VkDescriptorPoolCreateFlagBits specifying certain supported operations on the pool.
    - DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET
      - Determines if individual descriptor sets can be freed or not:
      - We're not going to touch the descriptor set after creating it, so we don't need this flag. You can leave flags to its default value of 0 .
    - DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND
      - Descriptor pool creation may fail with the error ERROR_FRAGMENTATION if the total number of descriptors across all pools (including this one) created with this bit set exceeds maxUpdateAfterBindDescriptorsInAllPools , or if fragmentation of the underlying hardware resources occurs.
  - maxSets
    - Is the maximum number of descriptor sets that can be allocated from the pool.
  - poolSizeCount
    - Is the number of elements in pPoolSizes .
  - pPoolSizes
    - Is a pointer to an array of VkDescriptorPoolSize structures, each containing a descriptor type and number of descriptors of that type to be allocated in the pool.
    - If multiple VkDescriptorPoolSize structures containing the same descriptor type appear in the pPoolSizes array then the pool will be created with enough storage for the total number of descriptors of each type.
    - VkDescriptorPoolSize .
      - type
        
        Is the type of descriptor.
      - descriptorCount
        
        Is the number of descriptors of that type to allocate. If type is DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK then descriptorCount is the number of bytes to allocate for descriptors of this type.
VkDescriptorSetAllocateInfo
- descriptorPool
  - Is the pool which the sets will be allocated from.
- descriptorSetCount
  - Determines the number of descriptor sets to be allocated from the pool.
- pSetLayouts
  - Is a pointer to an array of descriptor set layouts, with each member specifying how the corresponding descriptor set is allocated.
vkAllocateDescriptorSets() .
- The allocated descriptor sets are returned in pDescriptorSets .
- When a descriptor set is allocated, the initial state is largely uninitialized and all descriptors are undefined, with the exception that samplers with a non-null pImmutableSamplers are initialized on allocation.
- Descriptors also become undefined if the underlying resource or view object is destroyed.
- Descriptor sets containing undefined descriptors can still be bound and used, subject to the following conditions:
  - For descriptor set bindings created with the PARTIALLY_BOUND bit set:
    - All descriptors in that binding that are dynamically used must have been populated before the descriptor set is consumed .
  - For descriptor set bindings created without the PARTIALLY_BOUND bit set:
    - All descriptors in that binding that are statically used must have been populated before the descriptor set is consumed .
  - Descriptor bindings with descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK can be undefined when the descriptor set is consumed ; though values in that block will be undefined.
  - Entries that are not used by a pipeline can have undefined descriptors.
- pAllocateInfo
  - Is a pointer to a VkDescriptorSetAllocateInfo structure describing parameters of the allocation.
- pDescriptorSets
  - Is a pointer to an array of VkDescriptorSet handles in which the resulting descriptor set objects are returned.
Multithreading :
- Descriptor pools are externally synchronized, meaning that the application must not allocate and/or free descriptor sets from the same pool in multiple threads simultaneously.
- Command Pools are used to allocate, free, reset, and update descriptor sets. By creating multiple descriptor pools, each application host thread is able to manage a descriptor set in each descriptor pool at the same time.

Best Practices

Don’t allocate descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to allocate the descriptor set with texture descriptors.
Don't allocate descriptor sets from descriptor pools on performance critical code paths.
Don't allocate, free or update descriptor sets every frame, unless it is necessary.
Don't set DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET if you do not need to free individual descriptor sets.
- Setting DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET may prevent the implementation from using a simpler (and faster) allocator.

Descriptor Types

Descriptor Types .

Overview

For buffers, application must choose between uniform and storage buffers, and whether to use dynamic offsets or not. Uniform buffers have a limit on the maximum addressable size – on desktop hardware, you get up to 64 KB of data, however on mobile hardware some GPUs only provide 16 KB of data (which is also the guaranteed minimum by the specification). The buffer resource can be larger than that, but shader can only access this much data through one descriptor.
On some hardware, there is no difference in access speed between uniform and storage buffers, however for other hardware depending on the access pattern uniform buffers can be significantly faster. Prefer uniform buffers for small to medium sized data especially if the access pattern is fixed (e.g. for a buffer with material or scene constants). Storage buffers are more appropriate when you need large arrays of data that need to be larger than the uniform buffer limit and are indexed dynamically in the shader.
For textures, if filtering is required, there is a choice of combined image/sampler descriptor (where, like in OpenGL, descriptor specifies both the source of the texture data, and the filtering/addressing properties), separate image and sampler descriptors (which maps better to Direct3D 11 model), and image descriptor with an immutable sampler descriptor, where the sampler properties must be specified when pipeline object is created.
The relative performance of these methods is highly dependent on the usage pattern; however, in general immutable descriptors map better to the recommended usage model in other newer APIs like Direct3D 12, and give driver more freedom to optimize the shader. This does alter renderer design to a certain extent, making it necessary to implement certain dynamic portions of the sampler state, like per-texture LOD bias for texture fade-in during streaming, using shader ALU instructions.

Storage Images

DESCRIPTOR_TYPE_STORAGE_IMAGE
Is a descriptor type that allows shaders to read from and write to an image without using a fixed-function graphics pipeline.
This is particularly useful for compute shaders and advanced rendering techniques.
Storage Images and Implementation .

// FORMAT_R32_UINT
layout(set = 0, binding = 0, r32ui) uniform uimage2D storageImage;

// example usage for reading and writing in GLSL
const uvec4 texel = imageLoad(storageImage, ivec2(0, 0));
imageStore(storageImage, ivec2(1, 1), texel);

Use cases :
- Image Processing :
  - Storage images are ideal for image processing tasks like filters, blurs, and other post-processing effects.

Sampler

DESCRIPTOR_TYPE_SAMPLER and DESCRIPTOR_TYPE_SAMPLED_IMAGE .

layout(set = 0, binding = 0) uniform sampler samplerDescriptor;
layout(set = 0, binding = 1) uniform texture2D sampledImage;

// example usage of using texture() in GLSL
vec4 data = texture(sampler2D(sampledImage,  samplerDescriptor), vec2(0.0, 0.0));

Combined Image Sampler

DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER
On some implementations, it may be more efficient to sample from an image using a combination of sampler and sampled image that are stored together in the descriptor set in a combined descriptor.

layout(set = 0, binding = 0) uniform sampler2D combinedImageSampler;

// example usage of using texture() in GLSL
vec4 data = texture(combinedImageSampler, vec2(0.0, 0.0));

Uniform Buffer / UBO (Uniform Buffer Object)

DESCRIPTOR_TYPE_UNIFORM_BUFFER
Uniform buffers can also have dynamic offsets at bind time ( DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC ).

layout(set = 0, binding = 0) uniform uniformBuffer {
    float a;
    int b;
} ubo;

// example of reading from UBO in GLSL
int x = ubo.b + 1;
vec3 y = vec3(ubo.a);

Uniform Buffers commonly use std140 layout (strict alignment rules, predictable padding).
- Source: ChatGPT. I want to confirm.

/* UBO: small read-only data (std140) */
layout(set = 0, binding = 0, std140) uniform SceneParams {
    mat4 viewProj;
    vec4 lightPos;
    float time;
} scene;

UBO (Uniform Buffer Object) :
- “Uniform buffer object” is more of an OpenGL-era name, but some Vulkan tutorials and developers still use it informally to mean the same thing — the buffer that holds uniform data.

Storage Buffer / SSBO (Shader Storage Buffer Object)

DESCRIPTOR_TYPE_STORAGE_BUFFER
GLSL uses distinct address spaces: uniform → UBO, buffer → SSBO.
Use std430 layout by default (tighter packing, fewer padding requirements).
SSBO (Shader Storage Buffer Object) is a OpenGL term.

// Implicit std430 (default)
layout(set = 0, binding = 0) buffer storageBuffer {
    float a;
    int b;
} ssbo;

// Explicit std430
layout(set = 0, binding = 1, std430) buffer ParticleData {
    vec4 pos[];
} particles;

// Reading and writing to a SSBO in GLSL
ssbo.a = ssbo.a + 1.0;
ssbo.b = ssbo.b + 1;

BufferBlock and Uniform would have been seen prior to KHR_storage_buffer_storage_class .
Storage buffers can also have dynamic offsets at bind time DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC .
Why SSBO for dynamic arrays :
- std430 allows tight packing and runtime-sized arrays (T data[]) , which is ideal for dynamic-length storage.
- SSBOs allow arbitrary indexing, read/write, and atomics.
- maxStorageBufferRange is usually much larger than maxUniformBufferRange .
- You can use *_DYNAMIC descriptors to bind multiple subranges of one large backing buffer cheaply.

Many arrays :

A buffer block may contain multiple arrays, but only the last member of the block may be a runtime-sized (unsized) array T x[] . All other arrays must be fixed-size (compile-time constant) or you must implement sizing/offsets yourself.
- This is invalid , even with descriptor indexing:
```
layout(std430, set = 0, binding = 0) buffer FixedArrays { 
    vec4 A[]; 
    vec2 B[]; 
    mat4 C[]; 
    some_struct D[];
} fixedArrays;
```

Use a uint x[] :

32-bit words; simplest and portable.
This is effectively an untyped byte/word blob stored in the SSBO and you manually reinterpret (cast) it in the shader

layout(std430, set = 0, binding = 0) buffer PackedBytes {
    uint countA;   // number of A elements
    uint offsetA;  // offset into data[] in uint words
    uint countB;
    uint offsetB;  // offset into data[] in uint words
    uint countC;
    uint offsetC;

    uint data[];   // payload in 32-bit words
} pb;

// helpers
float readFloat(uint baseWordIndex) {
    return uintBitsToFloat(pb.data[baseWordIndex]);
}

vec2 readVec2(uint baseWordIndex) {
    return vec2(
        uintBitsToFloat(pb.data[baseWordIndex + 0]),
        uintBitsToFloat(pb.data[baseWordIndex + 1])
    );
}

vec3 readVec3(uint baseWordIndex) {
    return vec3(
        uintBitsToFloat(pb.data[baseWordIndex + 0]),
        uintBitsToFloat(pb.data[baseWordIndex + 1]),
        uintBitsToFloat(pb.data[baseWordIndex + 2])
    );
}

vec4 readVec4(uint baseWordIndex) {
    return vec4(
        uintBitsToFloat(pb.data[baseWordIndex + 0]),
        uintBitsToFloat(pb.data[baseWordIndex + 1]),
        uintBitsToFloat(pb.data[baseWordIndex + 2]),
        uintBitsToFloat(pb.data[baseWordIndex + 3])
    );
}

mat4 readMat4(uint baseWordIndex) {
    // mat4 stored column-major as 16 floats (4 columns of vec4)
    return mat4(
        readVec4(baseWordIndex + 0),
        readVec4(baseWordIndex + 4),
        readVec4(baseWordIndex + 8),
        readVec4(baseWordIndex + 12)
    );
}

Use a vec4 x[] :

128-bit blocks; simpler alignment for vec4/mat4 data.

// Pack everything into vec4 blocks for simple alignment
layout(std430, set = 0, binding = 0) buffer Packed {
    uint countA;
    uint offsetA; // in vec4-blocks
    uint countB;
    uint offsetB; // in vec4-blocks
    uint countC;
    uint offsetC; // in vec4-blocks
    uint countD;
    uint offsetD; // in vec4-blocks

    vec4 blocks[]; // single runtime-sized array (last member)
} packed;

// helpers
vec4 getA(uint i) {
    return packed.blocks[packed.offsetA + i];
}

vec2 getB(uint i) {
    return packed.blocks[packed.offsetB + i].xy; // we store each B in one vec4 block
}

mat4 getC(uint i) {
    uint base = packed.offsetC + i * 4; // mat4 occupies 4 vec4 blocks
    return mat4(packed.blocks[base + 0],
                packed.blocks[base + 1],
                packed.blocks[base + 2],
                packed.blocks[base + 3]);
}

// for some_struct D that we store as 1 vec4 per element:
some_struct getD(uint i) {
    vec4 v = packed.blocks[packed.offsetD + i];
    // decode v -> some_struct fields
}

Use many SSBOs:

layout(std430, set=0, binding=0) buffer BufA { vec4 A[]; } bufA;
layout(std430, set=0, binding=1) buffer BufB { vec2 B[]; } bufB;
layout(std430, set=0, binding=2) buffer BufC { mat4 C[]; } bufC;
layout(std430, set=0, binding=3) buffer BufD { some_struct D[]; } bufD;

Texel Buffer

Texel buffers are a way to access buffer data with texture-like operations in shaders.
Texel Buffers and Implementation .
Compatibility Requirements .
- The format specified in the shader (SPIR-V Image Format) must exactly match the format used when creating the VkImageView (Vulkan Format).
- Require exact format matching between the shader and the view. The views must always match the shader exactly.
Best Practices .
Uniform Texel Buffer :
- DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER
- Read-only access.
```
layout(set = 0, binding = 0) uniform textureBuffer uniformTexelBuffer;

// example of reading texel buffer in GLSL
vec4 data = texelFetch(uniformTexelBuffer, 0);
```
- Use cases :
  - Lookup Tables :
    - Uniform texel buffers are useful for implementing lookup tables that need to be accessed with texture-like operations.

Storage Texel Buffer :

DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER
Read-write access.

// FORMAT_R8G8B8A8_UINT
layout(set = 0, binding = 0, rgba8ui) uniform uimageBuffer storageTexelBuffer;

// example of reading and writing texel buffer in GLSL
int offset = int(gl_GlobalInvocationID.x);
vec4 data = imageLoad(storageTexelBuffer, offset);
imageStore(storageTexelBuffer, offset, uvec4(0));

Use cases :
- Particle Systems :
  - Storage texel buffers can be used to store and update particle data in a compute shader, which can then be read by a vertex shader for rendering.

Input Attachment

DESCRIPTOR_TYPE_INPUT_ATTACHMENT

layout (input_attachment_index = 0, set = 0, binding = 0) uniform subpassInput inputAttachment;

// example loading the attachment data in GLSL
vec4 data = subpassLoad(inputAttachment);

Updates

Implementation

A Descriptor Set, even though created and allocated, is still empty. We need to fill it up with data.
Updates must happen outside of a command record and execution.
- No update after vkCmdBindDescriptorSets() .
- Usually you update before vkBeginCommandBuffer() or after the vkQueueSubmit() (if we know the sync is done for cmd).
If using Descriptor Indexing :
- Descriptors can be updated after binding in command buffers.
  - Command buffer execution will use most recent updates.
- .
VkWriteDescriptorSet .
- dstSet
  - Is the destination descriptor set to update.
- dstBinding
  - Is the descriptor binding within that set.
- dstArrayElement
  - Remember that descriptors can be arrays, so we also need to specify the first index in the array that we want to update.
  - If not using an array, the index is simply 0 .
  - Is the starting element in that array.
  - If the descriptor binding identified by dstSet and dstBinding has a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK then dstArrayElement specifies the starting byte offset within the binding.
- descriptorCount
  - It's a descriptor count, not a descriptor SET count!!
  - Is the number of descriptors to update.
  - If the descriptor binding identified by dstSet and dstBinding has a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK , then descriptorCount specifies the number of bytes to update.
  - Otherwise, descriptorCount is one of
    - the number of elements in pImageInfo
    - the number of elements in pBufferInfo
    - the number of elements in pTexelBufferView
    - a value matching the dataSize member of a VkWriteDescriptorSetInlineUniformBlock structure in the pNext chain
    - a value matching the accelerationStructureCount of a VkWriteDescriptorSetAccelerationStructureKHR or VkWriteDescriptorSetAccelerationStructureNV structure in the pNext chain
    - a value matching the descriptorCount of a VkWriteDescriptorSetTensorARM structure in the pNext chain
- descriptorType
  - We need to specify the type of descriptor again
  - Is a VkDescriptorType specifying the type of each descriptor in pImageInfo , pBufferInfo , or pTexelBufferView .
  - It must be the same type as the descriptorType specified in VkDescriptorSetLayoutBinding for dstSet at dstBinding , except if VkDescriptorSetLayoutBinding for dstSet at dstBinding is equal to DESCRIPTOR_TYPE_MUTABLE_EXT .
  - The type of the descriptor also controls which array the descriptors are taken from.
- pBufferInfo
  - Is a pointer to an array of VkDescriptorBufferInfo structures or is ignored, as described below.
  - VkDescriptorBufferInfo .
    - Structure specifying descriptor buffer information
    - Specifies the buffer and the region within it that contains the data for the descriptor.
    - buffer
      - Is the buffer resource or NULL_HANDLE .
    - offset
      - Is the offset in bytes from the start of buffer .
      - Access to buffer memory via this descriptor uses addressing that is relative to this starting offset.
      - For DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC and DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC descriptor types:
        
        offset is the base offset from which the dynamic offset is applied.
    - range
      - Is the size in bytes that is used for this descriptor update, or WHOLE_SIZE to use the range from offset to the end of the buffer.
        
        When range is WHOLE_SIZE the effective range is calculated at vkUpdateDescriptorSets by taking the size of buffer minus the offset .
      - For DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC and DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC descriptor types:
        
        range is the static size used for all dynamic offsets.
- pImageInfo
  - Is a pointer to an array of VkDescriptorImageInfo structures or is ignored, as described below.
  - VkDescriptorImageInfo .
    - imageLayout
      - Is the layout that the image subresources accessible from imageView will be in at the time this descriptor is accessed.
      - Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLED_IMAGE , DESCRIPTOR_TYPE_STORAGE_IMAGE , DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER , and DESCRIPTOR_TYPE_INPUT_ATTACHMENT .
    - imageView
      - Is an image view handle or NULL_HANDLE .
      - Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLED_IMAGE , DESCRIPTOR_TYPE_STORAGE_IMAGE , DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER , and DESCRIPTOR_TYPE_INPUT_ATTACHMENT .
    - sampler
      - Is a sampler handle.
      - Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLER and DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER if the binding being updated does not use immutable samplers.
- pTexelBufferView
  - Is a pointer to an array of VkBufferView handles as described in the Buffer Views section or is ignored, as described below.
vkUpdateDescriptorSets() .
- descriptorWriteCount
  - Is the number of elements in the pDescriptorWrites array.
- pDescriptorWrites
  - Is a pointer to an array of VkWriteDescriptorSet structures describing the descriptor sets to write to.
- descriptorCopyCount
  - Is the number of elements in the pDescriptorCopies array.
- pDescriptorCopies
  - Is a pointer to an array of VkCopyDescriptorSet structures describing the descriptor sets to copy between.

Best Practices

Don’t update descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to update the descriptor set with texture descriptors.
When rendering dynamic objects the application will need to push some amount of per-object data to the GPU, such as the MVP matrix. This data may not fit into the push constant limit for the device, so it becomes necessary to send it to the GPU by putting it into a VkBuffer and binding a descriptor set that points to it.
Materials also need their own descriptor sets, which point to the textures they use. We can either bind per-material and per-object descriptor sets separately or collate them into a single set. Either way, complex applications will have a large amount of descriptor sets that may need to change on the fly, for example due to textures being streamed in or out.
Not-good Solution: One or more pools per-frame, resetting the pool :
- The simplest approach to circumvent the issue is to have one or more VkDescriptorPool s per frame, reset them at the beginning of the frame and allocate the required descriptor sets from it. This approach will consist of a vkResetDescriptorPool() call at the beginning, followed by a series of vkAllocateDescriptorSets() and vkUpdateDescriptorSets() to fill them with data.
- This is very useful for things like per-frame descriptors. That way we can have descriptors that are used just for one frame, allocated dynamically, and then before we start the frame we completely delete all of them in one go.
- This is confirmed to be a fast path by GPU vendors, and recommended to use when you need to handle per-frame descriptor sets.
- The issue is that these calls can add a significant overhead to the CPU frame time, especially on mobile. In the worst cases, for example calling vkUpdateDescriptorSets() for each draw call, the time it takes to update descriptors can be longer than the time of the draws themselves.
Solution: Caching descriptor sets :
- A major way to reduce descriptor set updates is to re-use them as much as possible. Instead of calling vkResetDescriptorPool() every frame, the app will keep the VkDescriptorSet handles stored with some caching mechanism to access them.
- The cache could be a hashmap with the contents of the descriptor set (images, buffers) as key. This approach is used in our framework by default. It is possible to remove another level of indirection by storing descriptor set handles directly in the materials and/or meshes.
- Caching descriptor sets has a dramatic effect on frame time for our CPU-heavy scene.
- In this game on a 2019 mobile phone it went from 44ms (23fps) to 27ms (37fps). This is a 38% decrease in frame time.
- This system is reasonably easy to implement for a static scene, but it becomes harder when you need to delete descriptor sets. Complex engines may implement techniques to figure out which descriptor sets have not been accessed for a certain number of frames, so they can be removed from the map.
- This may correspond to calling vkFreeDescriptorSets() , but this solution poses another issue: in order to free individual descriptor sets the pool has to be created with the DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET flag. Mobile implementations may use a simpler allocator if that flag is not set, relying on the fact that pool memory will only be recycled in block.
- It is possible to avoid using that flag by updating descriptor sets instead of deleting them. The application can keep track of recycled descriptor sets and re-use one of them when a new one is requested.
Solution: One buffer per-frame :
- We will now explore an alternative approach, that is complementary to descriptor caching in some way. Especially for applications in which descriptor caching is not quite feasible, buffer management is another lever for optimizing performance.
- As discussed at the beginning, each rendered object will typically need some uniform data along with it, that needs to be pushed to the GPU somehow. A straightforward approach is to store a VkBuffer per object and update that data for each frame.
- This already poses an interesting question: is one buffer enough? The problem is that this data will change dynamically and will be in use by the GPU while the frame is in flight.
- Since we do not want to flush the GPU pipeline between each frame, we will need to keep several copies of each buffer, one for each frame in flight.
- Another similar option is to use just one buffer per object, but with a size equal to num_frames * buffer_size , then offset it dynamically based on the frame index.
  - For each frame, one buffer per object is created and filled with data. This means that we will have many descriptor sets to create, since every object will need one that points to its VkBuffer . Furthermore, we will have to update many buffers separately, meaning we cannot control their memory layout and we might lose some optimization opportunities with caching.
- We can address both problems by reverting the approach: instead of having a VkBuffer per object containing per-frame data, we will have a VkBuffer per frame containing per-object data. The buffer will be cleared at the beginning of the frame, then each object will record its data and will receive a dynamic offset to be used at vkCmdBindDescriptorSets() time.
- With this approach we will need fewer descriptor sets, as more objects can share the same one: they will all reference the same VkBuffer , but at different dynamic offsets. Furthermore, we can control the memory layout within the buffer.
- Using a single large VkBuffer in this case shows a performance improvement similar to descriptor set caching.
- For this relatively simple scene stacking the two approaches does not provide a further performance boost, but for a more complex case they do stack nicely:
  - Descriptor caching is necessary when the number of descriptor sets is not just due to VkBuffer s with uniform data, for example if the scene uses a large amount of materials/textures.
  - Buffer management will help reduce the overall number of descriptor sets, thus cache pressure will be reduced and the cache itself will be smaller.
- (2025-09-08)
  - I personally liked this technique much more than descriptor caching.
  - It sounds more concrete than fiddling with descriptor sets.
  - Reminds me of Buffer Device Address.
Do
- Update already allocated but no longer referenced descriptor sets, instead of resetting descriptor pools and reallocating new descriptor sets.
- Prefer reusing already allocated descriptor sets, and not updating them with the same information every time.
- Consider caching your descriptor sets when feasible.
- Consider using a single (or few) VkBuffer per frame with dynamic offsets.
- Batch calls to vkAllocateDescriptorSets if possible – on some drivers, each call has measurable overhead, so if you need to update multiple sets, allocating both in one call can be faster;
- To update descriptor sets, either use vkUpdateDescriptorSets with descriptor write array, or use vkUpdateDescriptorSetWithTemplate from Vulkan 1.1. Using the descriptor copy functionality of vkUpdateDescriptorSets is tempting with dynamic descriptor management for copying most descriptors out of a previously allocated array, but this can be slow on drivers that allocate descriptors out of write-combined memory. Descriptor templates can reduce the amount of work application needs to do to perform updates – since in this scheme you need to read descriptor information out of shadow state maintained by application, descriptor templates allow you to tell the driver the layout of your shadow state, making updates substantially faster on some drivers.
- Prefer dynamic uniform buffers to updating uniform buffer descriptors. Dynamic uniform buffers allow to specify offsets into buffer objects using pDynamicOffsets argument of vkCmdBindDescriptorSets without allocating and updating new descriptors. This works well with dynamic constant management where constants for draw calls are allocated out of large uniform buffers, substantially reduce CPU overhead, and can be more efficient on GPU. While on some GPUs the number of dynamic buffers must be kept small to avoid extra overhead in the driver, one or two dynamic uniform buffers should work well in this scheme on all architectures.
- On some drivers, unfortunately the allocate & update path is not very optimal – on some mobile hardware, it may make sense to cache descriptor sets based on the descriptors they contain if they can be reused later in the frame.

Descriptor Set Layout

Contains the information about what that descriptor set holds.
Specifies the types of resources that are going to be accessed by the pipeline, just like a render pass specifies the types of attachments that will be accessed.
How many :
- You need to specify a descriptor set layout for each descriptor set when creating the pipeline layout.
  - You can use this feature to put descriptors that vary per-object and descriptors that are shared into separate descriptor sets.
  - In that case, you avoid rebinding most of the descriptors across draw calls which are potentially more efficient.
- Since the buffer structure is identical across frames, one layout suffices.
  - Create only 1 descriptor set layout, regardless of frames in-flight.
  - This layout defines the type of resource (e.g., VKDESCRIPTORTYPEUNIFORMBUFFER ) and its binding point.
VkDescriptorSetLayout .
- Opaque handle to a descriptor set layout object.
- Is defined by an array of zero or more descriptor bindings.
- Where it's used :
  - VkPipelineLayoutCreateInfo .
  - vkDescriptorSetAllocateInfo .
- VkDescriptorSetLayoutBinding .
  - Structure specifying a descriptor set layout binding.
  - Each individual descriptor binding is specified by a descriptor type, a count (array size) of the number of descriptors in the binding, a set of shader stages that can access the binding, and (if using immutable samplers) an array of sampler descriptors.
  - Bindings that are not specified have a descriptorCount and stageFlags of zero, and the value of descriptorType is undefined.
  - binding
    - Is the binding number of this entry and corresponds to a resource of the same binding number in the shader stages.
    - Used in the shader and the type of descriptor, which is a uniform buffer object.
  - descriptorType
    - Is a VkDescriptorType specifying which type of resource descriptors are used for this binding.
  - descriptorCount
    - Insight :
      - It's a descriptor count, not a descriptor SET count !! It's just to specify how many resources is expected to be in that binding.
      - It makes complete sense to be used for arrays.
      - Caio:
        
        What happens if the values don't match? For example, trying to get the index 5 of the array, when the binding was described having descriptorCount = 1 ?
      - Oni:
        
        I don't know if this is specified. I guess it's only going to update the first element. So you're going to read bogus data. Maybe it changes between different drivers, no idea.
    - What value to use :
      - A MVP transformation is in a single uniform buffer, so we using a descriptorCount of 1 .
      - In other words, a whole struct counts as 1 .
    - Is the number of descriptors contained in the binding, accessed in a shader as an array.
      - Except if descriptorType is DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK in which case descriptorCount is the size in bytes of the inline uniform block.
    - If descriptorCount is zero this binding entry is reserved and the resource must not be accessed from any stage via this binding within any pipeline using the set layout.
    - It is possible for the shader variable to represent an array of uniform buffer objects, and this property specifies the number of values in the array.
    - Examples :
      - This could be used to specify a transformation for each of the bones in a skeleton for skeletal animation.
  - stageFlags
    - Is a bitmask of VkShaderStageFlagBits specifying which pipeline shader stages can access a resource for this binding.
      - SHADER_STAGE_ALL is a shorthand specifying all defined shader stages, including any additional stages defined by extensions.
    - If a shader stage is not included in stageFlags , then a resource must not be accessed from that stage via this binding within any pipeline using the set layout.
    - Other than input attachments which are limited to the fragment shader, there are no limitations on what combinations of stages can use a descriptor binding, and in particular a binding can be used by both graphics stages and the compute stage.
  - pImmutableSamplers
    - Affects initialization of samplers.
    - If descriptorType specifies a DESCRIPTOR_TYPE_SAMPLER or DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER type descriptor, then pImmutableSamplers can be used to initialize a set of immutable samplers .
    - If descriptorType is not one of these descriptor types, then pImmutableSamplers is ignored .
    - Immutable samplers are permanently bound into the set layout and must not be changed; updating a DESCRIPTOR_TYPE_SAMPLER descriptor with immutable samplers is not allowed and updates to a DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER descriptor with immutable samplers does not modify the samplers (the image views are updated, but the sampler updates are ignored).
    - If pImmutableSamplers is not NULL , then it is a pointer to an array of sampler handles that will be copied into the set layout and used for the corresponding binding. Only the sampler handles are copied; the sampler objects must not be destroyed before the final use of the set layout and any descriptor pools and sets created using it.
    - If pImmutableSamplers is NULL , then the sampler slots are dynamic and sampler handles must be bound into descriptor sets using this layout. ]
- VkDescriptorSetLayoutCreateInfo .
  - pBindings
    - A pointer to an array of VkDescriptorSetLayoutBinding structures.
  - bindingCount
    - Is the number of elements in pBindings .
  - flags
    - Is a bitmask of VkDescriptorSetLayoutCreateFlagBits specifying options for descriptor set layout creation.
- vkCreateDescriptorSetLayout() .
  - Create a new descriptor set layout.
  - pCreateInfo
    - Is a pointer to a VkDescriptorSetLayoutCreateInfo structure specifying the state of the descriptor set layout object.
  - pAllocator
    - Controls host memory allocation as described in the Memory Allocation chapter.
  - pSetLayout
    - Is a pointer to a VkDescriptorSetLayout handle in which the resulting descriptor set layout object is returned.
VkPipelineLayoutCreateInfo .
- Structure specifying the parameters of a newly created pipeline layout object
- setLayoutCount
  - Is the number of descriptor sets included in the pipeline layout.
  - How it works :
    - It's possible to have multiple descriptor sets ( set = 0 , set = 1 , etc).
    - "You can have set = 0 being a set that is always bound and never changes, set = 1 is something specific to the current object being rendered, etc."
- pSetLayouts
  - Is a pointer to an array of VkDescriptorSetLayout objects.
  - The implementation must not access these objects outside of the duration of the command this structure is passed to.

Binding

A Descriptor state is tracked only inside a command buffer; they are always bound at command buffer level; their state is local to command buffers.
- They are not bound at queue level or global level, only to command buffers.
.
Which set index to choose :
- According to GPU vendors, each descriptor set slot has a cost, so the fewer we have, the better.
- "Organize shader inputs into "sets" by update frequency."
- Rarely changes -> low index.
- Changes frequently -> high index.
- Usually Descriptor Set 0 is used to always bind some global scene data, which will contain some uniform buffers and some special textures, and Descriptor Set 1 will be used for per-object data.
vkCmdBindDescriptorSets .
- It needs to be done before the vkCmdDrawIndexed() calls, for example.
- commandBuffer
  - Is the command buffer that the descriptor sets will be bound to.
- pipelineBindPoint
  - Is a VkPipelineBindPoint indicating the type of the pipeline that will use the descriptors. There is a separate set of bind points for each pipeline type, so binding one does not disturb the others.
  - Unlike vertex and index buffers, descriptor sets are not unique to graphics pipelines, therefore, we need to specify if we want to bind descriptor sets to the graphics or compute pipeline.
  - Indicates the type of the pipeline that will use the descriptor.
  - There is a separate set of bind points for each pipeline type, so binding one does not disturb the others.
  - .
    - A raytracing command takes the currently bound descriptors from the raytracing bind point.
    - A draw command takes the currently bound descriptors from the graphics bind point.
    - The two don't interfere with each other.
- layout
  - Is a VkPipelineLayout object used to program the bindings.
- firstSet
  - Is the set number of the first descriptor set to be bound.
- descriptorSetCount
  - Is the number of elements in the pDescriptorSets array.
- pDescriptorSets
  - Is a pointer to an array of handles to VkDescriptorSet objects describing the descriptor sets to bind to.
- dynamicOffsetCount
  - Is the number of dynamic offsets in the pDynamicOffsets array.
- pDynamicOffsets
  - Is a pointer to an array of uint32_t values specifying dynamic offsets.

Strategy: Descriptor Indexing ( `EXT_descriptor_indexing` )

Plan

SSBOs and UBOs.
- Can I just put different data without restriction?
  - Yes. See the SSBO section for that.
- SSBOs or UBOs?
  - Using storage buffers exclusively instead of uniform buffers can increase GPU time on some architectures.
  - I'll use SSBO, as that was the general recommendation.
  - Maybe I'll mix both.
Globals:
- Camera view/proj, lights, ambient, etc.
- I could just bind this once as well.
Material Data:
- The Material index is used to look up material data from material storage buffer. The textures can then be accessed using the indices from the material data and the descriptor array.
- I'd use the instance index (or similar) to index into a []Material_Data .
Model Matrix / Transforms:
- Same as material data. I can send via push constants if direct drawing, or via []model_matrix if indirect drawing.

Draw Data:

Indices to index into the other arrays.

struct DrawData
{
    uint materialIndex;
    uint transformOffset;
    uint vertexOffset;
    uint unused0; // vec4 padding

    // ... extra gameplay data
};

Vertex Shader:

DrawData dd = drawData[gl_DrawIDARB];
TransformData td = transformData[dd.transformOffset];
vec4 positionLocal = vec4(positionData[gl_VertexIndex + dd.vertexOffset], 1.0);
vec3 positionWorld = mat4x3(td.transform[0], td.transform[1], td.transform[2]) * positionLocal;

Frag Shader:

DrawData dd = drawData[drawId];
MaterialData md = materialData[dd.materialIndex];
vec4 albedo = texture(sampler2D(materialTextures[md.albedoTexture], albedoSampler), uv * vec2(md.tilingX, md.tilingY));

Overall:
- []textures
- []material_data
  - uv, flip, modulate, etc.
- []model_matrices
  - transforms.
- []draw_data
  - Indices to index into the other arrays.
- vertex/indices
  - As input attributes, to then use Indirect Drawing.
Slots:
- tex buffer and material data buffer will be in the same set 0, or should they be 0/1?
- Probably every bind is on desc set 0
- The slots are based on frequency, but every single binding I'm talking about might just be bound once globally without problems
Vertex:
- Indirect vs Full bindless:
  - I'll use Indirect Drawing for now. ChatGPU deep search didn't give me much.
- Go for bindless first with drawing direct. Instead of using the instanceID or similar, I just send the draw_data index via push constants. this way, the shader will be completely finalized, but then I batch the draws via draw indirect and use the instanceID instead of the push constants ID
  - What not invert and do indirect first? I cannot do that, as the instanceID is useless without a bindless design! I NEED to have use for the ID, as I cannot bind desc sets or push constants for each individual draw! bindless first is a MUST.
- Having to bind vertex buffers per-draw would not work for a fully bindless design.
1. Indirect Drawing:
2. Full bindless:
  - Using a large index buffer: We need to bind index data. If just like the vertex data, index data is allocated in one large index buffer, we only need to bind it once using vkCmdBindIndexBuffer .
  - Some hardware doesn’t support vertex buffers as a first-class entity, and the driver has to emulate vertex buffer binding, which causes some CPU-side slowdowns when using vkCmdBindVertexBuffers .
  - In a fully bindless design, we need to assume that all vertex buffers are suballocated in one large buffer and either use per-draw vertex offsets ( vertexOffset argument to vkCmdDrawIndexed ) to have hardware fetch data from it, or pass an offset in this buffer to the shader with each draw call and fetch data from the buffer in the shader. Both approaches can work well, and might be more or less efficient depending on the GPU.
3. Mesh Shaders.
  - Mesh Shaders is probably what is most true to the bindless strategy, but I won't go that way yet (too soon, too new).
4. ~~Compute~~
  - Maybe I could use a compute to do this for me, but then I'd lose the rasterizer.

About

Descriptor indexing is also known by the term "bindless", which refers to the fact that binding individual descriptor sets and descriptors is no longer the primary way we keep shader pipelines fed. Instead, we can bind a huge descriptor set once and just index into a large number of descriptors.
Adds a lot of flexibility to how resources are accessed.
"Bindless algorithms" are generally built around this flexibility where we either index freely into a lot of descriptors at once, or update descriptors where we please. In this model, "binding" descriptors is not a concern anymore.
The core functionality of this extension is that we can treat descriptor memory as one massive array, and we can freely access any resource we want at any time, by indexing.
If an array is large enough, an index into that array is indistinguishable from a pointer.
At most, we need to write/copy descriptors to where we need them and we can now consider descriptors more like memory blobs rather than highly structured API objects.
The introduction of descriptor indexing revealed that the descriptor model is all just smoke and mirrors. A descriptor is just a blob of binary data that the GPU can interpret in some meaningful way. The API calls to manage descriptors really just boils down to “copy magic bits here.”
Support :
- Descriptor Indexing was created in 2018, so all hardware 2018+ should support it.
- Core in Vulkan 1.2+
- Limits queried using VkPhysicalDeviceDescriptorIndexingPropertiesEXT .
- Features queried using VkPhysicalDeviceDescriptorIndexingFeaturesEXT .
- Features toggled using VkPhysicalDeviceDescriptorIndexingFeaturesEXT .
Required for :
- Raytracing.
- Many GPU Driven Rendering approaches.
Advantages :
- No costly transfer of descriptor to GPU every frame. Shows up as spending a lot of time in vkUpdateDescriptorSets (Vulkan)
- More flexible / dynamic rendering architecture
- No manual tracking of per-object resource groups
- Updating matrices and material data can be done in bulk before command recording
- CPU and GPU refer to resources the same way, by index
- GPU can store Texture IDs in a buffer for reference later in the frame – many uses
- Easy Vertex Pulling – gets rid of binding vertex buffers
- Write resource indexes from one shader into a buffer that another shader reads & uses
- G-Buffer can use material ID instead of values
- Terrain Splatmap contains material IDs allowing many materials to be used, instead of 4
- And more…
Disadvantages :
- Requires hardware support
  - May be too new for widespread use
  - Different “feature levels” can help ease transition
- Different Performance Penalties
  - Arrays indexing can cause memory indirections
    - Fetching texture descriptors from an array indexed by material data indexed by material index can add an extra indirection on GPU compared to some alternative designs
- “With great power comes great responsibility”
  - GPU can't verify that valid descriptors are bound
  - Validation is costlier: happens inside shaders
  - Can be difficult to debug
  - Descriptor management is up to the Application
- On some hardware, various descriptor set limits may make this technique impractical to implement; to be able to index an arbitrary texture dynamically from the shader, maxPerStageDescriptorSampledImages should be large enough to accomodate all material textures - while many desktop drivers expose a large limit here, the specification only guarantees a limit of 16, so bindless remains out of reach on some hardware that otherwise supports Vulkan.
Comparison: Indexing resources without the extension :
- .
- Descriptor Indexing, explanation of "dynamic non-uniform" .
  - Good read.
- Constant Indexing :
```
layout(set = 0, binding = 0) uniform sampler2D Tex[4];

texture(Tex[0], ...);
texture(Tex[2], ...);

// We can trivially flatten a constant-indexed array into individual resources,
// so, constant indexing requires no fancy hardware indexing support.
layout(set = 0, binding = 0) uniform sampler2D Tex0;
layout(set = 0, binding = 1) uniform sampler2D Tex1;
layout(set = 0, binding = 2) uniform sampler2D Tex2;
layout(set = 0, binding = 3) uniform sampler2D Tex3;
```
- Image Array Dynamic Indexing :
  - The dynamic indexing features allow us to use a non-constant expression to index an array.
    - This has been supported since Vulkan 1.0.
  - The restriction is that the index must be dynamically uniform .
```
layout(set = 0, binding = 0) uniform sampler2D Tex[4];

texture(Tex[dynamically_uniform_expression], ...);
```
- Non-uniform vs Texture Atlas vs Texture Array :
  - Accessing arbitrary textures in a draw call is not a new problem, and graphics programmers have found ways over the years to workaround restrictions in older APIs. Rather than having multiple textures, it is technically possible to pack multiple textures into one texture resource, and sample from the correct part of the texture. This kind of technique is typically referred to as "texture atlas". Texture arrays (e.g. sampler2DArray) is another feature which can be used for similar purposes.
  - Problems with atlas:
    - Mip-mapping is hard to implement, and must likely be done manually with derivatives and math.
    - Anisotropic filtering is basically impossible.
    - Any other sampler addressing than CLAMP_TO_EDGE is very awkward to implement.
    - Cannot use different texture formats.
  - Problems with texture array:
    - All resolutions must match.
    - Number of array layers is limited (just 256 in min-spec).
    - Cannot use different texture formats.
  - Non-uniform indexing solves these issues since we can freely use multiple sampled image descriptors instead. Atlases and texture arrays still have their place. There are many use cases where these restrictions do not cause problems.
  - Non-uniform indexing is not just limited to textures (although that is the most relevant use case). Any descriptor type can be used as long as the device supports it.

Features

Update-after-bind :
- In Vulkan, you generally have to create a VkDescriptorSet and update it with all descriptors before you call vkCmdBindDescriptorSets . After a set is bound, the descriptor set cannot be updated again until the GPU is done using it. This gives drivers a lot of flexibility in how they access the descriptors. They are free to copy the descriptors and pack them somewhere else, promote them to hardware registers, the list goes on.
- Update-After-Bind gives flexibility to applications instead. Descriptors can be updated at any time as long as they are not actually accessed by the GPU. Descriptors can also be updated while the descriptor set is bound to a command buffer, which enables a "streaming" use case.
  - This means the application doesn’t have to unbind or re-record command buffers just to change descriptors—reducing CPU overhead in some streaming-resource scenarios.
- Concurrent Updates :
  - Another "hidden" feature of update-after-bind is that it is possible to update the descriptor set from multiple threads. This is very useful for true "bindless" since unrelated tasks might want to update descriptors in different parts of the streamed/bindless descriptor set.
- After and after :
  - .
Non-uniform indexing :
- While update-after-bind adds flexibility to descriptor management, non-uniform indexing adds great flexibility for shaders.
- It completely removes all restrictions on how we index into arrays, but we must notify our intent to the compiler.
- Normally, drivers and hardware can assume that the dynamically uniform guarantee holds, and optimize for that case.
- If we use the nonuniformEXT decoration in GL_EXT_nonuniform_qualifier we can let the compiler know that the guarantee does not necessarily hold, and the compiler will deal with it in the most efficient way possible for the target hardware. The rationale for having to annotate like this is that driver compiler backends would be forced to be more conservative than necessary if applications were not required to use nonuniformEXT .
- When to use it :
  - The invocation group :
    - The invocation group is a set of threads (invocations) which work together to perform a task.
    - In graphics pipelines, the invocation group is all threads which are spawned as part of a single draw command. This includes multiple instances, and for multi-draw-indirect it is limited to a single gl_DrawID .
    - In compute pipelines, the invocation group is a single workgroup, so it’s very easy to know when it is safe to avoid nonuniformEXT.
    - An expression is considered dynamically uniform if all invocations in an invocation group have the same value.
      - In other words, dynamically uniform means that the index is the same across all threads spawned by a draw command.
  - Interaction with Subgroups :
    - It is very easy to think that dynamically uniform just means "as long as the index is uniform in the subgroup, it’s fine!". This is certainly true for most (desktop) architectures, but not all.
    - It is technically possible that a value can be subgroup uniform, but still not dynamically uniform. Consider a case where we have a workgroup size of 128 threads, with a subgroup size of 32. Even if each subgroup does subgroupBroadcastFirst() on the index, each subgroup might have different values, and thus, we still technically need nonuniformEXT here. If you know that you have only one subgroup per workgroup however, subgroupBroadcastFirst() is good enough.
    - The safe thing to do is to just add nonuniformEXT if you cannot prove the dynamically uniform property. If the compiler knows that it only really cares about subgroup uniformity, it could trivially optimize away nonuniformEXT(subgroupBroadcastFirst()) anyways.
    - The common reason to use subgroups in the first place, is that it was an old workaround for lack of true non-uniform indexing, especially for desktop GPUs. A common pattern would be something like:

Implementation

Exemples :
- odin_cool_engine:
  - odin_cool_engine/src/rp_ui.odin
    - It just sends an index to the compute pipeline via push constants.
  - odin_cool_engine/src/renderer.odin:725
    - It just sends an index to the compute pipeline via push constants.
- Descriptor Indexing Sample .
Setup :
1. Check availability of the extension through vk.EXT_DESCRIPTOR_INDEXING_EXTENSION_NAME + vk.EnumerateDeviceExtensionProperties .
2. Check supported features of the extension through vk.GetPhysicalDeviceFeatures2 + vk.PhysicalDeviceDescriptorIndexingFeatures as the pNext term.
VkDescriptorSetLayoutCreateInfo .
- flags
  - UPDATE_AFTER_BIND_POOL
    - Specifies that descriptor sets using this layout must be allocated from a descriptor pool created with the UPDATE_AFTER_BIND bit set.
    - Descriptor set layouts created with this bit set have alternate limits for the maximum number of descriptors per-stage and per-pipeline layout.
    - The non-UpdateAfterBind limits only count descriptors in sets created without this flag. The UpdateAfterBind limits count all descriptors, but the limits may be higher than the non-UpdateAfterBind limits.
VkDescriptorBindingFlagBits :
- PARTIALLY_BOUND
  - Specifies that descriptors in this binding that are not dynamically used, don't need to contain valid descriptors at the time the descriptors are consumed.
    - A descriptor is 'dynamically used' if any shader invocation executes an instruction that performs any memory access using the descriptor.
    - If a descriptor is not dynamically used, any resource referenced by the descriptor is not considered to be referenced during command execution.
  - This provides so it's not necessary to bind every descriptor. Allows a descriptor array binding to function even when not all array elements are written or valid.
  - This is critical if we want to make use of descriptor "streaming". A descriptor only has to be bound if it is actually used by a shader.
  - Without this feature, if you have an array of N descriptors and your shader indexes [0..N-1], all descriptors must be valid; otherwise behavior is undefined even if the shader never touches the uninitialized ones.
  - When enabled, you only need to write descriptors that the shader will index. “Holes” in the array are allowed, provided shader indices never touch them.
  - Use this when you want to leave “holes” in a large descriptor array (i.e. not update every element) without pre-filling unused slots with a fallback texture. When this flag is set, descriptors that are not dynamically used by the shader need not contain valid descriptors — but if the shader actually accesses an unwritten descriptor you still get undefined/invalid results. This is a convenience to avoid writing N fallback descriptors each time.
- VARIABLE_DESCRIPTOR_COUNT
  - Allows a descriptor binding to have a variable number of descriptors.
  - Use a variable amount of descriptors in an array.
  - Specifies that this is a variable-sized descriptor binding, whose size will be specified when a descriptor set is allocated using this layout.
  - This must only be used for the last binding in the descriptor set layout (i.e. the binding with the largest value of binding).
  - vk.DescriptorSetLayoutBinding.descriptorCount
    - The value is treated as an upper bound on the size of the binding.
    - The actual count is supplied at allocation time via VkDescriptorSetVariableDescriptorCountAllocateInfo .
    - For the purposes of counting against limits such as maxDescriptorSet and maxPerStageDescriptor , the full value of descriptorCount is counted, except for descriptor bindings with a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK , when VkDescriptorSetLayoutCreateInfo.flags does not contain DESCRIPTOR_SET_LAYOUT_CREATE_DESCRIPTOR_BUFFER . In this case, descriptorCount specifies the upper bound on the byte size of the binding; thus it counts against the maxInlineUniformBlockSize and maxInlineUniformTotalSize limits instead.
  - When we later allocate the descriptor set, we can declare how large we want the array to be.
  - Be aware that there is a global limit to the number of descriptors can be allocated at any one time.
  - This is extremely useful when using EXT_descriptor_indexing , since we do not have to allocate a fixed amount of descriptors for each descriptor set.
  - In many cases, it is far more flexible to use runtime sized descriptor arrays.
  - Use this when you want the shader-visible length of a descriptor-array binding to be allocatable per-descriptor-set (i.e. different sets expose different array lengths) instead of using a single compile-time/ layout upper bound. At allocation you pass the actual count with VkDescriptorSetVariableDescriptorCountAllocateInfo. This reduces bookkeeping/pool usage and lets you avoid allocating the full upper bound for every set. Requires the descriptor-indexing feature be enabled and the variable-size binding must be the last binding in the set
- UPDATE_AFTER_BIND
  - Specifies that if descriptors in this binding are updated between when the descriptor set is bound in a command buffer and when that command buffer is submitted to a queue, then the submission will use the most recently set descriptors for this binding and the updates do not invalidate the command buffer. Descriptor bindings created with this flag are also partially exempt from the external synchronization requirement in vkUpdateDescriptorSetWithTemplateKHR and vkUpdateDescriptorSets . Multiple descriptors with this flag set can be updated concurrently in different threads, though the same descriptor must not be updated concurrently by two threads. Descriptors with this flag set can be updated concurrently with the set being bound to a command buffer in another thread, but not concurrently with the set being reset or freed.
  - Update-after-bind is another critical component of descriptor indexing, which allows us to update descriptors after a descriptor set has been bound to a command buffer.
  - This is critical for streaming descriptors, but it also relaxed threading requirements. Multiple threads can update descriptors concurrently on the same descriptor set.
  - UPDATE_AFTER_BIND descriptors is somewhat of a precious resource, but min-spec in Vulkan is at least 500k descriptors, which should be more than enough.
- UPDATE_UNUSED_WHILE_PENDING
  - Specifies that descriptors in this binding can be updated after a command buffer has bound this descriptor set, or while a command buffer that uses this descriptor set is pending execution, as long as the descriptors that are updated are not used by those command buffers. Descriptor bindings created with this flag are also partially exempt from the external synchronization requirement in vkUpdateDescriptorSetWithTemplateKHR and vkUpdateDescriptorSets in the same way as for UPDATE_AFTER_BIND . If PARTIALLY_BOUND is also set, then descriptors can be updated as long as they are not dynamically used by any shader invocations. If PARTIALLY_BOUND is not set, then descriptors can be updated as long as they are not statically used by any shader invocations.
  - Update-Unused-While-Pending is somewhat subtle, and allows you to update a descriptor while a command buffer is executing.
  - The only restriction is that the descriptor cannot actually be accessed by the GPU.
- UPDATE_AFTER_BIND vs UPDATE_UNUSED_WHILE_PENDING
  - Both involve updates to descriptor sets after they are bound, UPDATE_UNUSED_WHILE_PENDING is a weaker requirement since it is only about descriptors that are not used, whereas UPDATE_AFTER_BIND requires the implementation to observe updates to descriptors that are used.

Enabling Non-Uniform Indexing :

Enable runtimeDescriptorArray and shaderSampledImageArrayNonUniformIndexing (required for indexing an array of COMBINED_IMAGE_SAMPLER ), descriptorBindingPartiallyBound (optional, to avoid undefined behavior on not fully populated arrays).

If in Vulkan <1.2, then the features must be enabled in the vk.PhysicalDeviceDescriptorIndexingFeatures .

If in Vulkan >=1.2, then the features must be enabled in the vk.PhysicalDeviceVulkan12Features .

If this is not followed, you'll get:

[ERROR] --- vkCreateDevice(): pCreateInfo->pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDeviceDescriptorIndexingFeatures structure. The features in VkPhysicalDeviceDescriptorIndexingFeatures were promoted in Vulkan 1.2 and is also found in VkPhysicalDeviceVulkan12Features. To prevent one feature setting something to TRUE and the other to FALSE, only one struct containing the feature is allowed.
pNext chain: VkDeviceCreateInfo::pNext -> [STRUCTURE_TYPE_LOADER_DEVICE_CREATE_INFO] -> [STRUCTURE_TYPE_LOADER_DEVICE_CREATE_INFO] -> [VkPhysicalDeviceVulkan13Features] -> [VkPhysicalDeviceVulkan12Features] -> [VkPhysicalDeviceDynamicRenderingUnusedAttachmentsFeaturesEXT] -> [VkPhysicalDeviceDescriptorIndexingFeatures].
The Vulkan spec states: If the pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDevice8BitStorageFeatures, VkPhysicalDeviceShaderAtomicInt64Features, VkPhysicalDeviceShaderFloat16Int8Features, VkPhysicalDeviceDescriptorIndexingFeatures, VkPhysicalDeviceScalarBlockLayoutFeatures, VkPhysicalDeviceImagelessFramebufferFeatures, VkPhysicalDeviceUniformBufferStandardLayoutFeatures, VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures, VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures, VkPhysicalDeviceHostQueryResetFeatures, VkPhysicalDeviceTimelineSemaphoreFeatures, VkPhysicalDeviceBufferDeviceAddressFeatures, or VkPhysicalDeviceVulkanMemoryModelFeatures structure (https://vulkan.lunarg.com/doc/view/1.4.328.0/windows/antora/spec/latest/chapters/devsandqueues.html#VUID-VkDeviceCreateInfo-pNext-02830)

vulkan12_features := vk.PhysicalDeviceVulkan12Features{
    // etc

    descriptorIndexing                        = true,
        // Descriptor Indexing:
        // Todo: Is this only for VK 1.2?

    runtimeDescriptorArray                    = true,
        // Descriptor Indexing:

    shaderSampledImageArrayNonUniformIndexing = true,
        // Descriptor Indexing: required for indexing an array of `COMBINED_IMAGE_SAMPLER`.

    descriptorBindingPartiallyBound           = true,
        // Descriptor Indexing: optional, to avoid undefined behavior on not fully populated arrays.

    descriptorBindingVariableDescriptorCount  = true,
        // Descriptor Indexing: Allows a descriptor binding to have a variable number of descriptors.

    // etc
}

In GLSL use the GL_EXT_nonuniform_qualifier extension and wrap the index with nonuniformEXT(...) (or apply nonuniformEXT to the loaded value) so the compiler emits the SPIR-V NonUniformEXT decoration.

In the shader :
- Constructors and builtin functions, which all have return types that are not qualified by nonuniformEXT , will not generate nonuniform results.
  - Shaders need to use the constructor syntax (or assignment to a nonuniformEXT -qualified variable) to re-add the nonuniformEXT qualifier to the result of builtin functions.
  - Correct:
    - It is important to note that to be 100% correct, we must use:
    - nonuniformEXT(sampler2D()) .
    - It is the final argument to a call like texture() which determines if the access is to be considered non-uniform.
  - Wrong:
    - It is very common in the wild to see code like:
    - sampler2D(Textures[nonuniformEXT(in_texture_index)], ...)
    - This looks very similar to HLSL, but it is somewhat wrong.
    - Generally, it will work on drivers, but it is not technically correct.
  - Examples:
    - sampler2D() is such a constructor, so we must add nonuniformEXT afterwards.
      - out_frag_color = texture(nonuniformEXT(sampler2D(Textures[in_texture_index], ImmutableSampler)), in_uv);
- Other use cases:
  - The nonuniform qualifier will propagate up to the final argument which is used in the load/store or atomic operation.
  - Examples:
```
// At the top
#extension GL_EXT_nonuniform_qualifier : require

uniform UBO { vec4 data; } UBOs[];   
vec4 foo = UBOs[nonuniformEXT(index)].data;

buffer  SSBO { vec4 data; } SSBOs[]; 
vec4 foo = SSBOs[nonuniformEXT(index)].data;

uniform sampler2D Tex[];
vec4 foo = texture(Tex[nonuniformEXT(index)], uv);

uniform uimage2D Img[];              
uint count = imageAtomicAdd(Img[nonuniformEXT(index)], uv, val);
```
```
#version 450
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64) in;

layout(set = 0, binding = 0) uniform sampler2D Combined[];
layout(set = 1, binding = 0) uniform texture2D Tex[];
layout(set = 2, binding = 0) uniform sampler Samp[];
layout(set = 3, binding = 0) uniform U { vec4 v; } UBO[];
layout(set = 4, binding = 0) buffer S { vec4 v; } SSBO[];
layout(set = 5, binding = 0, r32ui) uniform uimage2D Img[];

void main()
{
    uint index = gl_GlobalInvocationID.x;
    vec2 uv = vec2(gl_GlobalInvocationID.yz) / 1024.0;

    vec4 a = textureLod(Combined[nonuniformEXT(index)], uv, 0.0);
    vec4 b = textureLod(nonuniformEXT(sampler2D(Tex[index], Samp[index])), uv, 0.0);
    vec4 c = UBO[nonuniformEXT(index)].v;
    vec4 d = SSBO[nonuniformEXT(index)].v;

    imageAtomicAdd(Img[nonuniformEXT(index)], ivec2(0), floatBitsToUint(a.x + b.y + c.z + d.w));
}
```
- Caviats:
  - LOD:
    - Using implicit LOD with nonuniformEXT can be spicy! If the threads in a quad do not have the same index, LOD might not be computed correctly.
    - The quadDivergentImplicitLOD property lets you know if it will work.
    - In this case however, it is completely fine, since the helper lanes in a quad must come from the same primitive, which all have the same flat fragment input.
- Avoinding nonuniformEXT :
  - You might consider using subgroup operations to implement nonuniformEXT on your own.
  - This is technically out of spec, since the SPIR-V specification states that to avoid nonuniformEXT ,
  - the shader must guarantee that the index is "dynamically uniform".
  - "Dynamically uniform" means the value is the same across all invocations in an "invocation group".
  - The invocation group is defined to be all invocations (threads) for:
    - An entire draw command (for graphics)
    - A single workgroup (for compute).
  - Avoiding nonuniformEXT with clever programming is far more likely to succeed when writing compute shaders,
  - since the workgroup boundary serves as a much easier boundary to control than entire draw commands.
  - It is often possible to match workgroup to subgroup 1:1, unlike graphics where you cannot control how
  - quads are packed into subgroups at all.
  - The recommended approach here is to just let the compiler do its thing to avoid horrible bugs in the future.

Enabling Update-After-Bind :

In VkDescriptorSetLayoutCreateInfo we must pass down binding flags in a separate struct with pNext .

bindings_count := len(stage_set_layout.bindings)
descriptor_bindings_flags := make([]vk.DescriptorBindingFlagsEXT, bindings_count, context.temp_allocator)
for i in 0..<len(descriptor_bindings_flags) {
    descriptor_bindings_flags[i] = { .PARTIALLY_BOUND }
}
descriptor_bindings_flags[bindings_count - 1] += { .VARIABLE_DESCRIPTOR_COUNT }
    // Only the last binding supports VARIABLE_DESCRIPTOR_COUNT.

descriptor_binding_flags_create_info := vk.DescriptorSetLayoutBindingFlagsCreateInfoEXT{
    sType         = .DESCRIPTOR_SET_LAYOUT_BINDING_FLAGS_CREATE_INFO_EXT,
    bindingCount  = u32(bindings_count),
    pBindingFlags = raw_data(descriptor_bindings_flags),
    pNext         = nil,
}
descriptor_set_layout_create_info := vk.DescriptorSetLayoutCreateInfo{
    sType        = .DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
    flags        = {  },

    bindingCount = u32(bindings_count),
    pBindings    = raw_data(stage_set_layout.bindings),

    pNext        = &descriptor_binding_flags_create_info,
}

// Num Descriptors
static constexpr uint32_t NumDescriptorsStreaming  = 2048;
static constexpr uint32_t NumDescriptorsNonUniform = 64;

// Pool
uint32_t poolCount = NumDescriptorsStreaming + NumDescriptorsNonUniform;
VkDescriptorPoolSize       pool_size = vkb::initializers::descriptor_pool_size(DESCRIPTOR_TYPE_SAMPLED_IMAGE, poolCount);
VkDescriptorPoolCreateInfo pool      = vkb::initializers::descriptor_pool_create_info(1, &pool_size, 2);

// Allocate
VkDescriptorSetVariableDescriptorCountAllocateInfoEXT variable_info{};
allocate_info.pNext              = &variable_info;

variable_info.sType              = STRUCTURE_TYPE_DESCRIPTOR_SET_VARIABLE_DESCRIPTOR_COUNT_ALLOCATE_INFO_EXT;
variable_info.descriptorSetCount = 1;
variable_info.pDescriptorCounts = &NumDescriptorsStreaming;
CHECK(vkAllocateDescriptorSets(get_device().get_handle(), &allocate_info, &descriptors.descriptor_set_update_after_bind));
variable_info.pDescriptorCounts = &NumDescriptorsNonUniform;
CHECK(vkAllocateDescriptorSets(get_device().get_handle(), &allocate_info, &descriptors.descriptor_set_nonuniform));

The VkDescriptorPool must also be created with UPDATE_AFTER_BIND . Note that there is global limit to how many UPDATE_AFTER_BIND descriptors can be allocated at any point. The min-spec here is 500k, which should be good enough.

Strategy: Descriptor Buffers ( `EXT_descriptor_buffer` )

Article .
Sample .
Released on (2022-11-21).
TLDR :
- Descriptor sets are now backed by VkBuffer objects where you memcpy in descriptors. Delete VkDescriptorPool and VkDescriptorSet from the API, and have fun!
- Performance is either equal or better.
Coming from Descriptor Indexing, we use plain uints instead of actual descriptor sets, there are some design questions that come up.
Do we assign one uint per descriptor, or do we try to group them together such that we only need to push one base offset?
If we go with the latter, we might end up having to copy descriptors around. If we go with one uint per descriptor, we just added extra indirection on the GPU. GPU throughput might suffer with the added latency.
On the other hand, having to group descriptors linearly one after the other can easily lead to copy hell. Copying descriptors is still an abstracted operation that requires API calls to perform, and we cannot perform it on the GPU. The overhead of all these calls in the driver can be quite significant, especially in API layering. I’ve seen up to 10 million calls to “copy descriptor” per second which adds up.
Managing descriptors really starts looking more and more like just any other memory management problem. Let’s try translating existing API concepts into what they really are under the hood.
vkCreateDescriptorPool
- vkAllocateMemory . Memory type unknown, but likely HOST_VISIBLE and DEVICE_LOCAL . Size of pool computed from pool entries.
vkAllocateDescriptorSets
- Linear or arena allocation from pool. Size and alignment computed from VkDescriptorSetLayout .
vkUpdateDescriptorSets
- Writes raw descriptor data by copying payload from VkImageView / VkSampler / VkBufferView . Write offset is deduced from VkDescriptorSetLayout and binding. The VkDescriptorSet contains a pointer to HOST_VISIBLE mapped CPU memory. Copies are similar.
vkCmdBindDescriptorSets
- Binds the GPU VA of the VkDescriptorSet somehow.
The descriptor buffer API effectively removes VkDescriptorPool and VkDescriptorSet . The APIs now expose lower level detail.

For example, there’s now a bunch of properties to query:

typedef struct VkPhysicalDeviceDescriptorBufferPropertiesEXT {
    …
    size_t             samplerDescriptorSize;
    size_t             combinedImageSamplerDescriptorSize;
    size_t             sampledImageDescriptorSize;
    size_t             storageImageDescriptorSize;
    size_t             uniformTexelBufferDescriptorSize;
    size_t             robustUniformTexelBufferDescriptorSize;
    size_t             storageTexelBufferDescriptorSize;
    size_t             robustStorageTexelBufferDescriptorSize;
    size_t             uniformBufferDescriptorSize;
    size_t             robustUniformBufferDescriptorSize;
    size_t             storageBufferDescriptorSize;
    size_t             robustStorageBufferDescriptorSize;
    size_t             inputAttachmentDescriptorSize;
    size_t             accelerationStructureDescriptorSize;
    …
} VkPhysicalDeviceDescriptorBufferPropertiesEXT;

Strategy: Push Descriptor ( `VK_KHR_push_descriptor` )

Promoted to core in Vulkan 1.4.
Last modified date: (2017-09-12).
This extension allows descriptors to be written into the command buffer, while the implementation is responsible for managing their memory. Push descriptors may enable easier porting from older APIs and in some cases can be more efficient than writing descriptors into descriptor sets.
Sample .
New Commands
- vkCmdPushDescriptorSetKHR
If Vulkan Version 1.1 or VK_KHR_descriptor_update_template is supported:
- vkCmdPushDescriptorSetWithTemplateKHR
New Structures
- Extending VkPhysicalDeviceProperties2 :
  - VkPhysicalDevicePushDescriptorPropertiesKHR
New Enum Constants
- VK_KHR_PUSH_DESCRIPTOR_EXTENSION_NAME
- VK_KHR_PUSH_DESCRIPTOR_SPEC_VERSION
- Extending VkDescriptorSetLayoutCreateFlagBits :
  - VK_DESCRIPTOR_SET_LAYOUT_CREATE_PUSH_DESCRIPTOR_BIT_KHR
- Extending VkStructureType:
  - VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PUSH_DESCRIPTOR_PROPERTIES_KHR
If Vulkan Version 1.1 or VK_KHR_descriptor_update_template is supported:
- Extending VkDescriptorUpdateTemplateType :
  - VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_PUSH_DESCRIPTORS_KHR

Strategy: Bindful / Classic strategy (Slot-based / Frequency-based)

mna (midmidmid):
- The reason you split up resources into multiple sets is actually to reduce the cost of vkCmdBindDescriptorSets . The idea being that if you've got one set that holds scene-wide data and a different set that holds object-specific data, you only bind the scene stuff once and then just leave it bound. Then the per-object updates go faster because you're pushing much smaller descriptor sets into whatever special silicon descriptor sets map to on your particular GPU. Note: there are rules about how you have to arrange your sets (so like the scene-wide one has to be at a lower index than the per-object one), and all of the pipelines you use must have compatible layouts for the sets you aren't rebinding every time you switch to a different pipeline. Someone can correct me if I'm wrong, but if you switch to a pipeline that's got an incompatible layout for some descriptor set at index n then all descriptor sets at indices >= n need to be rebound.
- I think the only reason I'd change any of my stuff to bindless is if I hit however many hundreds of thousands of calls to vkCmdBindDescriptorSets it takes for descriptors to be a per-frame bottleneck.
- But I find descriptors pretty intuitive and easy to work with.
- I didn't find them easy to work with when I first came to VK (from GL/D3D11-world), but now that I've got some scaffolding set up to manage them, they're easy sauce.
- (They actually map pretty well to having worked with old console GPUs where you manage the command queue directly and have to think about resource bindings in terms of physical registers on the GPU. It was helpful to have that background.)
- If you're working with descriptor sets, then you have lots of little objects whose lifetimes you need to track and manage. Getting them grouped into the appropriate set of pools cuts that number down to something that's not hard to manage. So, for me, I've got a dynamically allocated and recycled set of descriptor pools for stuff that changes every frame, and then I've got my materials grouped into pack files (for fast content loading) and each of those has one descriptor pool for all the sets for all of its materials. Easy peasy. For bindless, you need to figure out how you're going to divide up the big array of descriptors in your one mega set. There's different strategies for doing that. But you'll get a better description of them out of the bindless fans on the server.
- Implementation-wise, I don't think there's a huge complexity difference between the two approaches. Bindless might be conceptually simpler since "it's just a big array" doesn't require as big of a mental shift as dividing resources up by usage and update frequency and thinking in those terms.
In the “classic” model, before you draw or dispatch, you must bind each resource to a specific descriptor binding or slot.
Example:
- vkCmdBindDescriptorSets(...)
- Binding texture #0 for this draw, texture #1 for that draw, etc.
The shader uses a fixed binding index:
- layout(set = 0, binding = 3) uniform sampler2D tex;
If you want to change which texture is used, you re-bind that descriptor.
.

Specialization Constants

Allows a constant value in SPIR-V to be specified at VkPipeline creation time.
This is powerful as it replaces the idea of doing preprocessor macros in the high level shading language (GLSL, HLSL, etc).
A way to provide constant values to a SPIR-V shader at pipeline creation time so the compiler can constant-fold, inline, and eliminate branches.
- This yields code equivalent to having compiled separate shader variants with those constant values baked in.
This is not Vulkan exclusive, but an optimization from SPIR-V. OpenGL 4.6 can also use this feature.
Sample .
UBOs and Push Constants suffer from limited optimizations during shader compilation. Specialization Constants can provide those optimizations:
- Uniform buffer objects (UBOs) are one of the most common approaches when it is necessary to set values within a shader at run-time and are used in many tutorials. UBOs are pushed to the shader just prior to its execution, this is after shader compilation which occurs during vkCreateGraphicsPipelines . As these values are set after the shader has been compiled, the driver’s shader compiler has limited scope to perform optimizations to the shader during its compilation. This is because optimizations such as loop unrolling or unused code removal require the compiler to have knowledge of the values controlling them which is not possible with UBOs. Push constants also suffer from the same problems as UBOs, as they are also provided after the shader has been compiled.
- Specialization Constants are set before pipeline creation meaning these values are known during shader compilation, this allows the driver’s shader compiler to perform optimizations. In this optimisation process the compiler has the ability to remove unused code blocks and statically unroll which reduces the fragment cycles required by the shader which results in increased performance.
- While specialization constants rely on knowing the required values before pipeline creation occurs, by trading off this flexibility and allowing the compiler to perform these optimizations you can increase the performance of your application easily and reduce shader code size.
Do :
- Use compile-time specialization constants for all control flow. This allows compilation to completely remove unused code blocks and statically unroll loops.
Don’t :
- Use control-flow which is parameterized by uniform values; specialize shaders for each control path needed instead.
Impact :
- Reduced performance due to less efficient shader programs.

Example :

#version 450
layout (constant_id = 0) const float myColor = 1.0;
layout(location = 0) out vec4 outColor;

void main() {
    outColor = vec4(myColor);
}

struct myData {
    float myColor = 1.0f;
} myData;

VkSpecializationMapEntry mapEntry = {};
mapEntry.constantID = 0; // matches constant_id in GLSL and SpecId in SPIR-V
mapEntry.offset     = 0;
mapEntry.size       = sizeof(float);

VkSpecializationInfo specializationInfo = {};
specializationInfo.mapEntryCount = 1;
specializationInfo.pMapEntries   = &mapEntry;
specializationInfo.dataSize      = sizeof(myData);
specializationInfo.pData         = &myData;

VkGraphicsPipelineCreateInfo pipelineInfo = {};
pipelineInfo.pStages[fragIndex].pSpecializationInfo = &specializationInfo;

// Create first pipeline with myColor as 1.0
vkCreateGraphicsPipelines(&pipelineInfo);

// Create second pipeline with same shader, but sets different value
myData.myColor = 0.5f;
vkCreateGraphicsPipelines(&pipelineInfo);

Use cases :
- Toggling features:
  - Support for a feature in Vulkan isn’t known until runtime. This usage of specialization constants is to prevent writing two separate shaders, but instead embedding a constant runtime decision.
- Improving backend optimizations:
  - Optimizing shader compilation from SPIR-V to GPU.
  - The “backend” here refers to the implementation’s compiler that takes the resulting SPIR-V and lowers it down to some ISA to run on the device.
  - Constant values allow a set of optimizations such as constant folding , dead code elimination , etc. to occur.
- Affecting types and memory sizes:
  - It is possible to set the length of an array or a variable type used through a specialization constant.
  - It is important to notice that a compiler will need to allocate registers depending on these types and sizes. This means it is likely that a pipeline cache will fail if the difference is significant in registers allocated.
How they work :
- The values are supplied using VkSpecializationInfo attached to the VkPipelineShaderStageCreateInfo .
- In GLSL (or HLSL → SPIR-V) mark a constant with a constant id, e.g. layout(constant_id = 0) const int MATERIAL_MODE = 0;
- Create VkSpecializationMapEntry entries mapping constantID → offset/size in your data block.
- Fill a contiguous data buffer with the specialization values and set up VkSpecializationInfo .
- Put the VkSpecializationInfo* into the shader stage VkPipelineShaderStageCreateInfo before calling vkCreateGraphicsPipelines . The backend finalizes (specializes/compiles) the shader at pipeline creation time.
How it affects the pipeline workflow :
- TLDR :
  - It does not solve the pipeline workflow problem. It provides a system for shader optimization at SPIR-V→GPU compile time.
  - Specialization lets you get near-compile-time optimizations while still selecting variants at runtime, but it does not avoid having multiple created pipelines if you need multiple different specialized behaviors.
- They do not, by themselves, precompile every possible branch permutation and keep them all resident for you. Each distinct set of specialization values that you want available at runtime normally corresponds to a separately created pipeline (the specialization values are applied during pipeline creation).
- If you need multiple variants you must create (or reuse) the pipelines for those values.
- If you have N independent boolean specialization choices, the number of possible specialized pipelines is 2^N (exponential growth). Creating many pipelines increases driver/state memory and creation time; use caching/derivatives/libraries if creation cost or count is a concern.
- You cannot change a specialization constant per draw without binding a different pipeline: the specialization is fixed for the pipeline object, so per-draw changes require binding another pipeline or using a different strategy (uniforms, push constants, dynamic branching).
- Different values mean different pipeline creation (driver work / memory).
- "Is this a way to precompile every branching of a shader?"
  - Yes, but only if you actually create a pipeline for each variant.
  - Specialization constants let the driver compile-away branches at pipeline-creation time, but they do not magically produce all variants for you at draw time.
Recommendations :
- Improving shader performance with vulkan's specialization constants .
  - When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo field of VkPipelineShaderStageCreateInfo . At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.
  - It is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the rendering loop and just swap pipelines as needed while rendering.
  - "promote the UBO array to a push constant".
  - Applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.
  - In other words:
    - The article shows how it's possible to pass a value to the shader during graphics pipeline creation so the shader is compiled from SPIR-V to GPU with that constant altered.
    - This helps by allowing the SPIR-V→GPU compiler to make optimization choices such as unrolling loops and removing branches; it can also enable UBO promotion.
    - The article does not suggest specialization constants solve the pipeline workflow problem. It focuses on compile-time shader optimizations.

Physical Storage Buffer ( `KHR_buffer_device_address` )

Impressions :
- (2025-09-08)
- No descriptor sets.
  - Cool.
- Very easy to set up.
- Shader usage is a bit tricky; push constants are required to access buffers in many patterns.
- More prone to programmer errors because there is no automatic bounds checking.
- Hmm, idk, for now not sure.
Adds the ability to have “pointers in the shader”.
Buffer device address is a powerful and unique feature of Vulkan. It exposes GPU virtual addresses directly to the application, and the application can then use those addresses to access buffer data freely through pointers rather than descriptors.
This feature lets you place addresses in buffers and load and store to them inside shaders, with full capability to perform pointer arithmetic and other tricks.
Support :
- Core in Vulkan 1.3.
- Submitted at (2019-01-06), core at (2019-11-25).
- Coverage :
  - (2025-09-08) 71.6%
  - 79.8% Windows
  - 70.9% Linux
  - 68.7% Android
Lack of safety :
- A critical thing to note is that a raw pointer has no idea of how much memory is safe to access. Unlike SSBOs when bounds-checking features are enabled, you must either do range checks yourself or avoid relying on out-of-bounds behavior.
Creating a buffer :
- To be able to grab a device address from a VkBuffer , you must create the buffer with SHADER_DEVICE_ADDRESS usage.
- The memory you bind that buffer to must be allocated with the corresponding flag via pNext .
```
VkMemoryAllocateFlagsInfoKHR flags_info{STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR};
flags_info.flags             = MEMORY_ALLOCATE_DEVICE_ADDRESS_KHR;
memory_allocation_info.pNext = &flags_info;
```
- After allocating and binding the buffer, query the address:
```
VkBufferDeviceAddressInfoKHR address_info{STRUCTURE_TYPE_BUFFER_DEVICE_ADDRESS_INFO_KHR};
address_info.buffer = buffer.buffer;
buffer.gpu_address  = vkGetBufferDeviceAddressKHR(device, &address_info);
```
- This address behaves like a normal address; you can offset the VkDeviceAddress value as you see fit since it is a uint64_t .
- There is no host-side alignment requirement enforced by the API for this value.
- When using this pointer in shaders, you must provide and respect alignment semantics yourself, because the shader compiler cannot infer anything about a raw pointer loaded from memory.
- You can place this pointer inside another buffer and use it as an indirection.
GL_EXT_buffer_reference :
- In Vulkan GLSL, the GL_EXT_buffer_reference extension allows declaring buffer blocks as pointer-like types rather than SSBOs. GLSL lacks true pointer types, so this extension exposes pointer-like behavior.
```
#extension GL_EXT_buffer_reference : require
```
- You can forward-declare types. Useful for linked lists and similar structures.
```
layout(buffer_reference) buffer Position;
```
- You can declare a buffer reference type. This is not an SSBO declaration, but effectively a pointer-to-struct.
```
layout(std430, buffer_reference, buffer_reference_align = 8) writeonly buffer Position {
    vec2 positions[];
};
```
- buffer_reference tags the type accordingly. buffer_reference_align marks the minimum alignment for pointers of this type.
- You can place the Position type inside another buffer or another buffer reference type:
```
layout(std430, buffer_reference, buffer_reference_align = 8) readonly buffer PositionReferences {
    Position buffers[];
};
```
- Now you have an array of pointers.
- You can also place a buffer reference inside push constants, an SSBO, or a UBO.
```
layout(std430, set = 0, binding = 0) readonly buffer Pointers {
    Positions positions[];
};

layout(std430, push_constant) uniform Registers {
    PositionReferences references;
} registers;
```

Casting pointers :

A key aspect of buffer device address is that we gain the capability to cast pointers freely.
While it is technically possible (and useful in some cases!) to "cast pointers" with SSBOs with clever use of aliased declarations like so:

layout(set = 0, binding = 0) buffer SSBO { float v1[]; };
layout(set = 0, binding = 0) buffer SSBO2 { vec4 v4[]; };

It gets kind of hairy quickly, and not as flexible when dealing with composite types.
When we have casts between integers and pointers, we get the full madness that is pointer arithmetic. Nothing stops us from doing:

#extension GL_EXT_buffer_reference : require
layout(buffer_reference) buffer PointerToFloat { float v; };

PointerToFloat pointer = load_pointer();
uint64_t int_pointer = uint64_t(pointer);
int_pointer += offset;
pointer = PointerToFloat(int_pointer);
pointer.v = 42.0;

Not all GPUs support 64-bit integers, so it is also possible to use uvec2 to represent pointers. This way, we can do raw pointer arithmetic in 32-bit, which might be more optimal anyways.

#extension GL_EXT_buffer_reference_uvec2 : require
layout(buffer_reference) buffer PointerToFloat { float v; };
PointerToFloat pointer = load_pointer();
uvec2 int_pointer = uvec2(pointer);
uint carry;
uint lo = uaddCarry(int_pointer.x, offset, carry);
uint hi = int_pointer.y + carry;
pointer = PointerToFloat(uvec2(lo, hi));
pointer.v = 42.0;

Debugging :
- When debugging or capturing an application that uses buffer device addresses, there are some special driver requirements that are not universally supported. Essentially, to be able to capture application buffers which contain raw pointers, we must ensure that the device address for a given buffer remains stable when the capture is replayed in a new process. Applications do not have to do anything here, since tools like RenderDoc will enable the bufferDeviceAddressCaptureReplay feature for you, and deal with all the magic associated with address capture behind the scenes. If the bufferDeviceAddressCaptureReplay is not present however, tools like RenderDoc will mask out the bufferDeviceAddress feature, so beware.
Sample .
.

Shader Alignment

Minimum Dynamic-Offset / CBV Allocation Granularity

Default Layouts

Alignment Options

Scalar Alignment

Extended Alignment (std140)

Base Alignment (std430)

Comparisons

Input Attributes

About

Memory Layout

Vertex Input Binding / Vertex Buffer

Index Buffer

Push Constants

Offsets

Updating

Lifetime

CPU Performance

Descriptors Sets

About

Difficulties

Allocation

Implementation

Best Practices

Descriptor Types

Overview

Storage Images

Sampler

Combined Image Sampler

Uniform Buffer / UBO (Uniform Buffer Object)

Storage Buffer / SSBO (Shader Storage Buffer Object)

Texel Buffer

Input Attachment

Updates

Implementation

Best Practices

Descriptor Set Layout

Binding

Strategy: Descriptor Indexing ( EXT_descriptor_indexing )

Plan

About

Features

Implementation

Strategy: Descriptor Buffers ( EXT_descriptor_buffer )

Strategy: Push Descriptor ( VK_KHR_push_descriptor )

Strategy: Bindful / Classic strategy (Slot-based / Frequency-based)

Specialization Constants

Physical Storage Buffer ( KHR_buffer_device_address )

Strategy: Descriptor Indexing ( `EXT_descriptor_indexing` )

Strategy: Descriptor Buffers ( `EXT_descriptor_buffer` )

Strategy: Push Descriptor ( `VK_KHR_push_descriptor` )

Physical Storage Buffer ( `KHR_buffer_device_address` )