Vulkan

🕒 Created: 2025-08-01 | Updated: 2025-12-13

Starting

Versions

Versions and Features Breakdown .
- Patch notes.
Version Release Summary .
Why not use Vulkan 1.0? {12:57 -> end} .
- 1.0 is harder, with missing features and clunky interfaces.
- The video is pretty nice. I listed the problems it explained about 1.0 and placed them in the documentation below.
- It was well explained and I came to appreciate using Vulkan 1.3+.

Is OOP?

Version 1.3, (2024-02-22).
.

API Structs

Many structures in Vulkan require you to explicitly specify the type of structure in the sType member.
Functions that create or destroy an object will have a VkAllocationCallbacks parameter that allows you to use a custom allocator for driver memory, which will also be left nullptr in this tutorial.
Almost all functions return a VkResult that is either SUCCESS or an error code. The specification describes which error codes each function can return and what they mean.
The KHR postfix, which means that these objects are part of a Vulkan extension.
The pNext member can point to an extension structure.

Compatibility

Support

Platform Support .
Checking for Vulkan support .
Windows (7 and later)
- Yes, via the official SDK and drivers.
Linux
- Yes. Native support via Mesa and vendor drivers.
Android (5.0+)
- Yes, most devices from Android 7.0+ support Vulkan.
macOS
- No native support — requires MoltenVK (Vulkan-to-Metal wrapper).
iOS
- No native support — requires MoltenVK.
Web
- No native support — experimental via WebGPU or Emscripten with translation layers.
Consoles.
- Partially supported; depends on platform SDKs and NDAs (e.g., Nintendo Switch uses a Vulkan-like API).

Driver support

Vulkan requires updated GPU drivers.
Older or integrated GPUs (especially pre-2013) may lack Vulkan support.
Vendor support varies: NVIDIA, AMD, and Intel generally support Vulkan on most modern hardware.

Compatibility Layers

To increase compatibility.
MoltenVK :
- Runs Vulkan on Metal (required for macOS/iOS).
gfx-rs / wgpu / bgfx :
- Abstraction layers to use Vulkan when available, fallback to other APIs.
ANGLE / Zink :
- Can translate other APIs (e.g., OpenGL) to Vulkan and vice-versa.

Tutorials

Tutorials in Docs

Docs Vulkan Guide .
- I already read everything before the memory allocation section.
Docs Vulkan Tutorial .
- Based on the vulkan-tutorial, with differences:
  - Vulkan 1.4 as a baseline
  - Dynamic rendering instead of render passes
  - Timeline semaphores
  - Slang as the primary shading language
  - Modern C++ (20) with modules
  - Vulkan-Hpp with RAII
  - It also contains Vulkan usage clarifications, improved synchronization and new content.
  - "This tutorial will use RAII with smart pointers and it will endeavor to demonstrate the latest methods and extensions which should hopefully make Vulkan a joy to use."
- Does not require knowledge of previous APIs, but you need to know C++ and graphics math.
- Impressions :
  - Holy moly the new C++ API is a pain.
  - I preferred to go back to the vulkan-tutorial several times and check how it's used in the C API.
  - I used this tutorial only as a base to consider the new features.
  - I didn't use Slang, I didn't like it; I stayed with GLSL.
vulkan-tutorial .
- Does not require knowledge of previous APIs, but you need to know C++ and graphics math.
- You can use C, but the tutorial is in C++.
- Vulkan 1.0; shown here .
- Uses GLSL for shaders.
~ Vulkan Guide .
- For people with previous experience with Graphics APIs.
- I'm not a big fan of this guide.
- Uses :
  - Vulkan 1.3.
  - C++, Visual Studio, CMake.
  - SDL to create a window.
  - Vk Bootstrap .
    - Abstracts a big amount of boilerplate that Vulkan has when setting up. Most of that code is written once and never touched again, so we will skip most of it using this library. This library simplifies instance creation, swapchain creation, and extension loading. It will be removed from the project eventually in an optional chapter that explains how to initialize that Vulkan boilerplate the “manual” way.
  - VMA (Vulkan Memory Allocator)
    - Implements memory allocators for Vulkan, header only. In Vulkan, the user has to deal with the memory allocation of buffers, images, and other resources on their own. This can be very difficult to get right in a performant and safe way. Vulkan Memory Allocator does it for us and allows us to simplify the creation of images and other resources. Widely used in personal Vulkan engines or smaller scale projects like emulators. Very high end projects like Unreal Engine or AAA engines write their own memory allocators.
- Impressions :
  - The tutorial gives you a project with many things already done, and holds your hand for every syntax, file, folder, methodology, etc.
    - It simply throws a lot of stuff at you.
    - It's a pretty bloated experience, for sure.
    - I consider that a pain.
Samples Collections in C++ .
Vulkan Barriers Explained .
Vulkan AMD Blog Posts .
Writing an Efficient Vulkan Renderer .

Playlists

Playlist Vulkan with Odin - Nadako .
- Vulkan 1.3, with Dynamic Rendering.
- I watched videos 1 through 11.
- They are good videos.
- I do not recommend them to someone who has never seen anything before, because they are not exactly for beginners and their explanations lack some foundation.
- I recommend them as a reference for how to set up in Odin.
~~Playlist Vulkan - OGLDEV~~ .
- C++, with Visual Studio.
- Assumes you have seen another GPU API before.
- Video 1:
  - Window with GLFW, not explained.
- Video 8:
  - Theory explanation ok; code explanation meh.
- Video 12:
  - Synchronization with 1 frame in-flight.
  - Good video.
- ~~Video 16~~:
  - Descriptor Sets.
  - Nope. See the spec, guides, or other videos on the subject, I think it's better.
- Video 21:
  - Dynamic Rendering.
  - {0:00 -> 12:14}
    - Explanation of the code to obtain the EXT for Vulkan 1.2, and ignore it for Vulkan 1.3
  - The rest of the video is irrelevant, it does not explain anything beyond what to change if someone is following his code line by line.
~~Playlist Vulkan 2024 - GetIntoGameDev~~ .
- Overall :
  - The person seems nice and I like when he draws things.
  - Unfortunately 95% of the series videos are code in C++ and he does not do a good job explaining the code.
  - I listed some videos below that I considered interesting.
- Vulkan 1.3.
- Video 12:
  - Synchronization, with 1 frame in-flight.
  - The drawings are nice.
- ~Video 13:
  - Multithreaded rendering.
  - Nope. See the Multithreading Rendering section to understand why "nope".
- Video 26:
  - Barycentric coordinates.
- Only code, so nope :
  - Videos: 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29.
- ~~Playlist Vulkan - GetIntoGameDev~~ .
  - Vulkan 1.2, (2022-01-22).
  - Watch the new 2024 version of the tutorials.
  - The person sometimes explains on a sheet of paper, which is nice.
~~Playlist Vulkan - Computer Graphics at TU Wien~~ .
- Vulkan 1.2.
- Video 1:
  - SDK, Instances, extensions, physical devices, logical devices.
  - Ok.
- Video 2:
  - Presentation Modes, Swapchain.
  - {10:20 -> 21:45}
    - Explanation of all Presentation Modes.
- Video 3:
  - Explanation of Buffers and Images.
  - The explanation seemed s a bit rushed and the definition is poorly established.
  - I can return and rewatch the video after reading the documentation.
- Video 4:
  - Commands, Command Pools, Command Buffers.
  - Ok, sure.
  - I skipped the descriptor sets part.
- Video 5:
  - Pipelines.
  - I skipped it.
- Video 6:
  - Synchronization.
  - Skipped.
- Impressions :
  - I don't like the illustrations, nor the tone of the explanation.
  - I simply feel I learn more and feel more confident reading the documentation or the spec.
  - The videos are "more technical", but when that is the case documentation is better.
  - I prefer a simpler playlist to learn some basic concepts, and to read the documentation for advanced topics.
~~Playlist Vulkan - Brendan Galea~~ .
- Vulkan 1.0.
- C++, with Visual Studio.
- It's a pain to see C++ code.
- The sketch explanations in the middle of the videos are ok, but the rest is very bad; all code-related parts are unpleasant and with a LOT of mess in C++.
- Video 1:
  - Window with GLFW.
- Video 2:
  - Light explanation of the graphics pipeline.
  - {9:54}
    - Shader compilation, to SPIR-V.
- Video 20:
  - Descriptor Sets
  - {0:00 -> 5:35} Nice explanation.
  - The rest of the video is nah.
~~Vulkan playlist - Cakez~~ .
- C++
- Starts by teaching how to install Visual Studio and Git...
- Does not use GLFW, instead creates its own platform layer on Windows to create a window.
~~Vulkan playlist - Francesco Piscani~~ .
- He uses the vulkan-tutorial.
- Spends the first 4 episodes doing basically nothing, just setting up CMake and Linux.
- Nope, it sounds bad as tutorials.

Talks

Vulkan in Doom 3 .
- Use RenderDoc extensively.
- 1 Render Pass, 1 subpass, 3 attachments.
  - .
- Buffers and Images
  - .
- Allocations:
  - VMA for allocators.
  - .
  - .
- 28 shaders + changes => 100 pipelines total at runtime.
  - .
- Synchronization:
  - Not much of it. Doom 3 was single-threaded, it didn't require multithreading.

Samples

To run :
- Git clone recursively the repo.
- Build the entire solution.
- Vulkan-Samples\build\windows\app\bin\debug\AMD64 .
- Copy the shaders and assets folders from Vulkan-Samples to the folder above.
- Type .\vulkan_samples sample sample_name .
Note :
- Normal and hpp have the same performance; or whatever, it does not matter.
Impressions :
- The extension samples were more visually "uninteresting".
- I saw all API samples, but I didn't see all Extensions.
- There were still other folders besides these two, but I was lazy to check.

API

instancing
- .
- Wow, awesome.
- The fps is very high.
oit_linked_lists (Order Independent Transparency)
- .
oit_depth_peeling (Order Independent Transparency)
- The object in the center rotates with the mouse.
- .
compute_nbody
- .
dynamic_uniform_buffers.
- .
hdr
- .
- Allows changing the object, toggling the skybox, changing the exposure, toggling bloom.
terrain_tessellation
- .
- Increasing the tessellation factor made it look like the terrain polycount increased.
timestamp_queries
- .
- Allows changing the object, toggling the skybox, changing the exposure, toggling bloom.
separate_image_sampler
- .
- Allows selecting linear or nearest filtering.
texture_loading
- .
- Allows increasing the LOD bias, reducing image quality.
texture_mipmap_generation
- .
- Allows calibrating the LOD bias, and choosing between mipmap off, bilinear and anisotropic.
hello_triangle_1_3 / hello_triangle
- .
- Nothing special
- No dynamic resize.

Extensions

dynamic_line_rasterization
- .
- This sample demonstrates functions from various extensions related to dynamic line rasterization.
- These functions can be useful for developing CAD applications.
- From the EXT_line_rasterization extension.
  - vkCmdSetLineStippleEXT - sets the stipple pattern.
- From the EXT_extended_dynamic_state3 extension:
  - vkCmdSetPolygonModeEXT - sets how defined primitives should be rasterized.
  - vkCmdSetLineRasterizationModeEXT - sets the algorithm for line rasterization.
  - vkCmdSetLineStippleEnableEXT - toggles stippling for lines.
- And also from core Vulkan:
  - vkCmdSetLineWidth - sets the line width.
  - vkCmdSetPrimitiveTopologyEXT - defines which type of primitives is being drawn.
debug utils
- .
- Toggle bloom, toggle skybox.
- The EXT_debug_utils extension to setup a validation layer messenger callback and pass additional debugging information to debuggers like RenderDoc.
- EXT_debug_utils has been introduced based on feedback for the initial Vulkan debugging extensions EXT_debug_report and EXT_debug_marker , combining these into a single instance extension with some added functionality.
- Procedure examples :
  - vkCmdBeginDebugUtilsLabelEXT
  - vkCmdInsertDebugUtilsLabelEXT
  - vkCmdEndDebugUtilsLabelEXT
  - vkQueueBeginDebugUtilsLabelEXT
  - vkQueueInsertDebugUtilsLabelEXT
  - vkQueueEndDebugUtilsLabelEXT
  - vkSetDebugUtilsObjectNameEXT
  - vkSetDebugUtilsObjectTagEXT
conditional_rendering
- .
- A list of 235 parts of the car, which can be disabled to not render.
- The EXT_conditional_rendering extension allows the execution of rendering commands to be conditional based on a value taken from a dedicated conditional buffer.
- This may help an application reduce latency by conditionally discarding rendering commands without application intervention.
- This sample demonstrates usage of this extension for conditionally toggling the visibility of sub-meshes of a complex glTF model.
- Instead of having to update command buffers, this is done by updating the aforementioned buffer.
conservative_rasterization
- .
- Enabling the conservative rasterization option causes this blending effect.
- EXT_conservative_rasterization changes the way fragments are generated.
- Enables overestimation to generate fragments for every pixel touched instead of only pixels that are fully covered.
color_write_enable
- .
- Color picker to change the background color.
- Some options for "bit", changing the triangle color.
- The EXT_color_write_enable extension allows toggling the output color attachments using a pipeline dynamic state.
- It allows the program to prepare an additional framebuffer populated with the data from a defined color blend attachment which can be blended dynamically to the final scene.
- The final results are comparable to those obtained with vkCmdSetColorWriteMaskEXT , but it does not require the GPU driver to support EXT_extended_dynamic_state3 .
dynamic_blending
- This sample demonstrates the functionality of EXT_extended_dynamic_state3 related to blending.
- It includes the following features:
  - vkCmdSetColorBlendEnableEXT : toggles blending on and off.
  - vkCmdSetColorBlendEquationEXT : modifies blending operators and factors.
  - vkCmdSetColorBlendAdvancedEXT : utilizes more complex blending operators.
  - vkCmdSetColorWriteMaskEXT : toggles individual channels on and off.
descriptor_indexing
- .
~descriptor_buffer_basic
- .
- Just boxes rotating, I didn't understand.
- Just textures rotating, I didn't understand.
dynamic_multisample_rasterization
- This sample demonstrates one of the functionalities of EXT_extended_dynamic_state3 related to rasterization samples.
- The extension can be used to dynamically change sampling without the need to swap pipelines.
- .
- This thing took quite a while to open, generating binary files, etc.
dynamic_primitive_clipping
- .
- This sample demonstrates how to apply depth clipping using the vkCmdSetDepthClipEnableEXT() command which is a part of the EXT_extended_dynamic_state3 extension.
- Additionally it also shows how to apply primitive clipping using the gl_ClipDistance[] builtin shader variable.
- It is worth noting that primitive clipping and depth clipping are two separate features of the fixed-function vertex post-processing stage.
- They're both described in the same chapter of the Vulkan specification (chapter 27.4, "Primitive clipping").
- What is primitive clipping
  - Primitives produced by vertex/geometry/tessellation shaders are sent to fixed-function vertex post-processing.
  - Primitive clipping is a part of post-processing pipeline in which primitives such as points/lines/triangles are culled against the cull volume and then clipped to the clip volume.
  - And then they might be further clipped by results stored in the gl_ClipDistance[] array - values in this array must be calculated in a vertex/geometry/tessellation shader.
  - In the past, the fixed-function version of the OpenGL API provided a method to specify parameters for up to 6 clipping planes (half-spaces) that could perform additional primitive clipping. Fixed-function hardware calculated proper distances to these planes and made a decision - should the primitive be clipped against these planes or not (for historical study - search for the glClipPlane() description).
  - Vulkan inherited the idea of primitive clipping, but with one important difference: the user has to calculate the distance to the clip planes on their own in the vertex shader.
  - And - because the user does it in a shader - they do not have to use clip planes at all. It can be any kind of calculation, as long as the results are put in the gl_ClipDistance[] array.
  - Values that are less than 0.0 cause the vertex to be clipped. In the case of a triangle primitive the whole triangle is clipped if all of its vertices have values stored in gl_ClipDistance[] below 0.0. When some of these values are above 0.0 - the triangle is split into new triangles as described in the Vulkan specification.
- What is depth clipping
  - When depth clipping is disabled then effectively there is no near or far plane clipping.
  - Depth values of primitives that are behind the far plane are clamped to the far plane depth value (usually 1.0).
  - Depth values of primitives that are in front of the near plane are clamped to the near plane depth value (by default it's 0.0, but may be set to -1.0 if we use settings defined in VkPipelineViewportDepthClipControlCreateInfoEXT structure. This requires the presence of the EXT_depth_clip_control extension which is not part of this tutorial).
  - In this sample the result of depth clipping (or lack of it) is not clearly visible at first. Try to move the viewer position closer to the object and see how the "use depth clipping" checkbox changes object appearance.
~buffer_device_address.
- .
- I didn't understand. It's just things moving.
~calibrated_timestamps
- timestamp_queries, but with other timings.

Core

Instance / Extensions

Instance

VkInstance
- The Vulkan context, used to access drivers.
The instance is the connection between your application and the Vulkan library.
VkApplicationInfo .
- Optional, but it may provide some useful information to the driver to optimize our specific application.
vkInstanceCreateInfo .
- Tells the Vulkan driver which global extensions and validation layers we want to use.

Instance Level Extensions

vkEnumerateInstanceExtensionProperties()
- Retrieve a list of supported extensions before creating an instance.
- Each VkExtensionProperties struct contains the name and version of an extension.

Debugging

Validation Layers

Layers .
Vulkan is designed for high performance and low driver overhead, therefore, it will include very limited error checking and debugging capabilities by default.
The driver will often crash instead of returning an error code if you do something wrong, or worse, it will appear to work on your graphics card and completely fail on others.
Vulkan allows you to enable extensive checks through a feature known as validation layers .
Validation layers are pieces of code that can be inserted between the API and the graphics driver to do things like running extra checks on function parameters and tracking memory management problems.
The nice thing is that you can enable them during development and then completely disable them when releasing your application for zero overhead. Anyone can write their own validation layers, but the Vulkan SDK by LunarG provides a standard set of validation layers. You also need to register a callback function to receive debug messages from the layers.
Because Vulkan is so explicit about every operation and the validation layers are so extensive, it can actually be a lot easier to find out why your screen is black compared to OpenGL and Direct3D!
Common operations in validation layers are:
- Checking the values of parameters against the specification to detect misuse
- Tracking the creation and destruction of objects to find resource leaks
- Checking thread safety by tracking the threads that calls originate from
- Logging every call and its parameters to the standard output
- Tracing Vulkan calls for profiling and replaying
There were formerly two different types of validation layers in Vulkan: instance and device specific.
The idea was that instance layers would only check calls related to global Vulkan objects like instances, and device-specific layers would only check calls related to a specific GPU.
Device-specific layers have now been deprecated , which means that instance validation layers apply to all Vulkan calls.
We don’t really need to check for the existence of this extension because it should be implied by the availability of the validation layers.
vkEnumerateInstanceLayerProperties
RenderDoc :
- Do not run validation at the same time as RenderDoc, otherwise you'll also be validating RenderDoc.
Vulkan Configurator :
- Overwrites the normal Layer setup.
- Implicitly loads layers.
- How to use :
  - RIGHT-CLICK.
Performance :
- Ensure validation layers and debug callbacks are off for performance runs. Use pipeline cache objects to avoid repeated pipeline creation cost.
- I notice how each 'push', 'descriptor set bind', 'vertex bind', 'indices bind' and 'draw' were a lot slower with validations on.

Message Callback

The validation layers will print debug messages to the standard output by default, but we can also handle them ourselves by providing an explicit callback in our program.
This will also allow you to decide which kind of messages you would like to see.
messageSeverity
messageType
pfnUserCallback
- messageSeverity
  - DEBUG_UTILS_MESSAGE_SEVERITY_VERBOSE_EXT
    - Diagnostic message
  - DEBUG_UTILS_MESSAGE_SEVERITY_INFO_EXT
    - Informational message like the creation of a resource
  - DEBUG_UTILS_MESSAGE_SEVERITY_WARNING_EXT
    - Message about behavior that is not necessarily an error, but very likely a bug in your application
  - DEBUG_UTILS_MESSAGE_SEVERITY_ERROR_EXT
    - Message about behavior that is invalid and may cause crashes.
- messageType
  - DEBUG_UTILS_MESSAGE_TYPE_GENERAL_EXT
    - Some event has happened that is unrelated to the specification or performance
  - DEBUG_UTILS_MESSAGE_TYPE_VALIDATION_EXT
    - Something has happened that violates the specification or indicates a possible mistake
  - DEBUG_UTILS_MESSAGE_TYPE_PERFORMANCE_EXT
    - Potential non-optimal use of Vulkan
- pCallbackData
  - Refers to a VkDebugUtilsMessengerCallbackDataEXT struct containing the details of the message itself, with the most important members being:
  - pMessage
    - The debug message as a null-terminated string
  - pObjects
    - Array of Vulkan object handles related to the message
  - objectCount
    - Number of objects in the array
- pUserData
  - Contains a pointer specified during the setup of the callback and allows you to pass your own data to it.

Debug Utils ( `VK_EXT_debug_utils` )

Sample

must(
    vk.SetDebugUtilsObjectNameEXT(
        dev,
        &vk.DebugUtilsObjectNameInfoEXT {
            sType = .DEBUG_UTILS_OBJECT_NAME_INFO_EXT,
            objectType = obj,
            objectHandle = handle,
            pObjectName = strings.clone_to_cstring(name, context.temp_allocator),
        },
    ),
)

Window / Surface / GLFW

Window

The Vulkan API itself is completely platform-agnostic, which is why we need to use the standardized WSI (Window System Interface) extension to interact with the window manager.
Windows can be created with the native platform APIs or libraries like GLFW and SDL .
Some platforms allow you to render directly to a display without interacting with any window manager through the KHR_display and KHR_display_swapchain extensions.
These allow you to create a surface that represents the entire screen and could be used to implement your own window manager, for example.

GLFW

GLFW Reference .
The very first call in initWindow should be glfwInit() , which initializes the GLFW library. Because GLFW was originally designed to create an OpenGL context, we need to tell it to not create an OpenGL context with a later call:
Because handling resized windows takes special care that we’ll look into later, disable it for now with another window hint call:

glfwWindowHint(GLFW_CLIENT_API, GLFW_NO_API);
glfwWindowHint(GLFW_RESIZABLE, GLFW_FALSE);

All that’s left now is creating the actual window. Add a GLFWwindow* window; private class member to store a reference to it and initialize the window with:

window = glfwCreateWindow(WIDTH, HEIGHT, "Vulkan", nullptr, nullptr);

The first three parameters specify the width, height and title of the window. The fourth parameter allows you to optionally specify a monitor to open the window on, and the last parameter is only relevant to OpenGL.
Init:

void initWindow() {
    glfwInit();

    glfwWindowHint(GLFW_CLIENT_API, GLFW_NO_API);
    glfwWindowHint(GLFW_RESIZABLE, GLFW_FALSE);

    window = glfwCreateWindow(WIDTH, HEIGHT, "Vulkan", nullptr, nullptr);
}

Main loop:

void mainLoop() {
    while (!glfwWindowShouldClose(window)) {
        glfwPollEvents();
    }
}

Destroy:

void cleanup() {
    glfwDestroyWindow(window);

    glfwTerminate();
}

Blocking the Thread :

Surface

A VkSurfaceKHR is an opaque handle representing a platform-specific presentation target (for example, a window on Windows, an X11 window on Linux, or a UIView on iOS). It is created directly from the Vulkan instance together with a native window handle. Conceptually, a surface is:
- Instance-level: it lives above any physical or logical device.
- Window abstraction: it wraps the OS window or drawable so that Vulkan knows where to submit images for display.
- Device-agnostic: you can create a surface before choosing which GPU you will use.
Once created, the surface is used by a chosen physical device to query presentation support, formats and capabilities, and then by the logical device to build a Swapchain.
A surface itself is not intrinsically tied to any particular physical or logical device, because:
- Creation: you call vkCreateSurfaceKHR(instance, …) without involving a VkPhysicalDevice or VkDevice handle.
- Lifetime: it exists even before you pick or create a device, and you destroy it with vkDestroySurfaceKHR(instance, surface, …) .
Lifetime :
- The surface is tied to the GLFW window's lifecycle.
- It does not change when the window is resized, minimized, or restored.
- The same surface handle remains valid until you destroy it (e.g., when closing the window).
"Window surfaces are part of the larger topic of render targets and presentation".
Surface Formats .

Extensions

To establish the connection between Vulkan and the window system to present results to the screen, we need to use the WSI (Window System Integration) extensions.
The KHR_surface exposes a VkSurfaceKHR object that represents an abstract type of surface to present rendered images to.
The surface in our program will be backed by the window that we’ve already opened with GLFW.
The KHR_surface extension is an instance level extension, and we’ve actually already enabled it, because it’s included in the list returned by glfwGetRequiredInstanceExtensions . The list also includes some other WSI extensions that we’ll use in the next couple of chapters.
The window surface needs to be created right after the instance creation, because it can actually influence the physical device selection.
It should also be noted that window surfaces are an entirely optional component in Vulkan if you just need off-screen rendering.
- Vulkan allows you to do that without hacks like creating an invisible window (necessary for OpenGL).
Vulkan also allows you to remotely render from a non-presenting GPU or remotely over the internet, or run compute acceleration for AI without a render or presentation target.
Although the VkSurfaceKHR object and its usage is platform-agnostic, its creation isn’t because it depends on window system details. For example, it needs the HWND and HMODULE handles on Windows. Therefore, there is a platform-specific addition to the extension, which on Windows is called KHR_win32_surface and is also automatically included in the list from glfwGetRequiredInstanceExtensions .
GLFW actually has glfwCreateWindowSurface that handles the platform differences for us.

Blocking the thread

Difficulties due to GLFW .
A callback glfw.SetWindowRefreshCallback allows the swapchain to be recreated while resizing.
- See [[#Swapchain Recreation]].

Physical Device / Logical Device

Physical Device

VkPhysicalDevice
A GPU. Used to query physical GPU details, like features, capabilities, memory size, etc.

Device Level Extensions

Vulkan Hardware Database .

Queue Families

Most operations performed with Vulkan, like draw commands and memory operations, are asynchronously executed by submitting them to a VkQueue .
Queues are allocated from queue families, where each queue family supports a specific set of operations in its queues.
- For example, there could be separate queue families for graphics, compute and memory transfer operations.
The availability of queue families could also be used as a distinguishing factor in physical device selection.
- It is possible for a device with Vulkan support to not offer any graphics functionality; however, all graphics cards with Vulkan support today will generally support all queue operations that we’re interested in.
We need to check which queue families are supported by the device and which one of these supports the commands that we want to use.

Presentation support

Although the Vulkan implementation may support window system integration, that does not mean that every device in the system supports it. Therefore, we need to extend createLogicalDevice to ensure that a device can present images to the surface we created.
Since the presentation is a queue-specific feature, the problem is actually about finding a queue family that supports presenting to the surface we created.
It’s actually possible that the queue families supporting drawing commands and the queue families supporting presentation do not overlap.
- It’s very likely that these end up being the same queue family after all, but throughout the program we will treat them as if they were separate queues for a uniform approach.
- Nevertheless, you could add logic to explicitly prefer a physical device that supports drawing and presentation in the same queue for improved performance.
Therefore, we have to take into account that there could be a distinct presentation queue.
We’ll look for a queue family that has the capability of presenting to our window surface. The function to check for that is vkGetPhysicalDeviceSurfaceSupportKHR , which takes the physical device, queue family index and surface as parameters.
It should be noted that the availability of a presentation queue, as we checked in the previous chapter, implies that the Swapchain extension must be supported. However, the extension does have to be explicitly enabled.
Not all graphics cards are capable of presenting images directly to a screen for various reasons, for example, because they are designed for servers and don’t have any display outputs. Secondly, since image presentation is heavily tied into the window system and the surfaces associated with windows, it is not part of the Vulkan core. You have to enable the KHR_swapchain device extension after querying for its support.

Surface Capabilities

The extents can change when resizing and you should requery the surface properties. Note that if it says the current extent is {UINT32_MAX, UINT32_MAX} (happens on some platforms) then you'll need to ask the windowing system for an appropriate new size (but I don't know GLFW well enough to know if GetFramebufferSize is the right function for that purpose)

Logical Device

VkDevice
The “logical” GPU context that you actually execute things on.
Where you describe more specifically which VkPhysicalDeviceFeatures you will be using, like multi viewport rendering and 64-bit floats.
You also need to specify which queue families you would like to use.

Queues

Queues .
VkQueue
- Execution “port” for commands.
- GPUs will have a set of queues with different properties.
  - Some allow only graphics commands, others only allow memory commands, etc.
- Command buffers are executed by submitting them into a queue, which will copy the rendering commands onto the GPU for execution.
The queues are automatically created along with the logical device, but we don’t have a handle to interface with them yet.
Device queues are implicitly cleaned up when the device is destroyed.
We can use the vkGetDeviceQueue function to retrieve queue handles for each queue family. The parameters are the logical device, queue family, queue index and a pointer to the variable to store the queue handle in. Because we’re only creating a single queue from this family, we’ll simply use index 0 .
Vulkan Guide:
- It is common to see engines using 3 queue families:
  - One for drawing the frame, other for async compute, and other for data transfer.
- In this tutorial, we use a single queue that will run all our commands for simplicity.

Multi-queue

.
Some hardware only has one queue.

Render Loop

Now that everything is ready for rendering, you first ask the VkSwapchainKHR for an image to render to. Then you allocate a VkCommandBuffer from a VkCommandBufferPool or reuse an already allocated command buffer that has finished execution, and “start” the command buffer, which allows you to write commands into it.
Next, you begin rendering by using Dynamic Rendering.
Then create a loop where you bind a VkPipeline , bind some VkDescriptorSet resources (for the shader parameters), bind the vertex buffers, and then execute a draw call.
If there is nothing more to render, you end the VkCommandBuffer . Finally, you submit the command buffer into the queue for rendering. This will begin execution of the commands in the command buffer on the gpu. If you want to display the result of the rendering, you “present” the image you have rendered to to the screen. Because the execution may not have finished yet, you use a semaphore to make the presentation of the image to the screen wait until rendering is finished.
At a high level, rendering a frame in Vulkan consists of a common set of steps:
- Wait for the previous frame to finish
- Acquire an image from the Swapchain
- Record a command buffer which draws the scene onto that image
  - Re-recording every frame doesn't really take up performance.
- Submit the recorded command buffer
  - Takes performance.
- Present the Swapchain image
  - Puts it up on the screen.

Swapchain

Vulkan does not have the concept of a "default framebuffer," hence it requires an infrastructure that will own the buffers we will render to before we visualize them on the screen.
This infrastructure is known as the swapchain and must be created explicitly in Vulkan.
The Swapchain is essentially a queue of images that are waiting to be presented to the screen.
Our application will acquire such an image to draw to it, and then return it to the queue.
The conditions for presenting an image from the queue depend on how the Swapchain is set up.
The general purpose of the Swapchain is to synchronize the presentation of images with the refresh rate of the screen.
- This is important to make sure that only complete images are shown.
Every time we want to draw a frame, we have to ask the Swapchain to provide us with an image to render to. When we’ve finished drawing a frame, the image is returned to the Swapchain for it to be presented to the screen at some point.
"Is a collection of render targets".
- Render Targets is not a well-defined term.
The number of render targets and conditions for presenting finished images to the screen depends on the present mode.
VkSwapchainKHR
- Holds the images for the screen.
- It allows you to render things into a visible window.
- The KHR suffix shows that it comes from an extension, which in this case is KHR_swapchain .
Swapchains .
- Good video.
- Pre-rotate on mobile.
- When to recreate, recreation problems, recreation strategies, maintenance.
- Present modes.
Support :
- There are basically three kinds of properties we need to check:
  - Basic surface capabilities (min/max number of images in Swapchain, min/max width and height of images)
  - Surface formats (pixel format, color space)
  - Available presentation modes
- It is important that we only try to query for Swapchain support after verifying that the extension is available.

Swapchain Creation

VkSwapchainCreateInfoKHR .
- surface
  - Is the surface onto which the swapchain will present images. If the creation succeeds, the swapchain becomes associated with surface .
- minImageCount
  - we also have to decide how many images we would like to have in the Swapchain. However, simply sticking to the minimum means that we may sometimes have to wait on the driver to complete internal operations before we can acquire another image to render to. Therefore, it is recommended to request at least one more image than the minimum:
```
uint32_t imageCount = surfaceCapabilities.minImageCount + 1;
```
  - We should also make sure to not exceed the maximum number of images while doing this, where 0 is a special value that means that there is no maximum
```
if (surfaceCapabilities.maxImageCount > 0 && imageCount > surfaceCapabilities.maxImageCount) {
    imageCount = surfaceCapabilities.maxImageCount;
}
```
- imageFormat
  - For the color space we’ll use SRGB if it is available, because it results in more accurate perceived colors . It is also pretty much the standard color space for images, like the textures we’ll use later on.
  - Because of that we should also use an SRGB color format, of which one of the most common ones is FORMAT_B8G8R8A8_SRGB .
- imageColorSpace
  - Is a VkColorSpaceKHR value specifying the way the swapchain interprets image data.
- imageExtent
  - Is the size (in pixels) of the swapchain image(s).
  - The swap extent is the resolution of the Swapchain images. It’s almost always exactly equal to the resolution of the window that we’re drawing to in pixels .
  - The range of the possible resolutions is defined in the VkSurfaceCapabilitiesKHR structure.
  - On some platforms, it is normal that maxImageExtent may become (0, 0) , for example when the window is minimized. In such a case, it is not possible to create a swapchain due to the Valid Usage requirements , unless scaling is selected through VkSwapchainPresentScalingCreateInfoKHR , if supported .
  - We’ll pick the resolution that best matches the window within the minImageExtent and maxImageExtent bounds. But we must specify the resolution in the correct unit.
  - GLFW uses two units when measuring sizes: pixels and screen coordinates . For example, the resolution {WIDTH, HEIGHT} that we specified earlier when creating the window is measured in screen coordinates. But Vulkan works with pixels, so the Swapchain extent must be specified in pixels as well.
  - Unfortunately, if you are using a high DPI display (like Apple’s Retina display), screen coordinates don’t correspond to pixels. Instead, due to the higher pixel density, the resolution of the window in pixel will be larger than the resolution in screen coordinates. So if Vulkan doesn’t fix the swap extent for us, we can’t just use the original {WIDTH, HEIGHT} . Instead, we must use glfwGetFramebufferSize to query the resolution of the window in pixel before matching it against the minimum and maximum image extent.
  - The surface capabilities changes every time the window resizes, and it's only used for creating the Swapchain, so it doesn't make sense to cache.
- imageUsage
- imageSharingMode (Handling multiple queues):
  - We need to specify how to handle Swapchain images that will be used across multiple queue families. That will be the case in our application if the graphics queue family is different from the presentation queue. We’ll be drawing on the images in the Swapchain from the graphics queue and then submitting them on the presentation queue. There are two ways to handle images that are accessed from multiple queues:
    - SHARING_MODE_EXCLUSIVE :
      - An image is owned by one queue family at a time, and ownership must be explicitly transferred before using it in another queue family.
      - This option offers the best performance.
    - SHARING_MODE_CONCURRENT :
      - Images can be used across multiple queue families without explicit ownership transfers.
      - Concurrent mode requires you to specify in advance between which queue families ownership will be shared using the queueFamilyIndexCount and pQueueFamilyIndices parameters.
  - If the queue families differ, then we’ll be using the concurrent mode in this tutorial to avoid having to do the ownership chapters, because these involve some concepts that are better explained at a later time.
  - If the graphics queue family and presentation queue family are the same, which will be the case on most hardware, then we should stick to exclusive mode. Concurrent mode requires you to specify at least two distinct queue families.
- queueFamilyIndexCount
  - Is the number of queue families having access to the image(s) of the swapchain when imageSharingMode is SHARING_MODE_CONCURRENT .
- pQueueFamilyIndices
  - Is a pointer to an array of queue family indices having access to the images(s) of the swapchain when imageSharingMode is SHARING_MODE_CONCURRENT .
- imageArrayLayers
  - Is the number of views in a multiview/stereo surface. For non-stereoscopic-3D applications, this value is 1.
- presentMode
- preTransform
  - We can specify that a certain transform should be applied to images in the Swapchain if it is supported ( supportedTransforms in capabilities ), like a 90-degree clockwise rotation or horizontal flip. To specify that you do not want any transformation, simply specify the current transformation.
  - IDENTITY
    - This would not be optimal on devices that support rotation and will lead to measurable performance loss.
    - It is strongly recommended that surface_properties.currentTransform be used instead. However, the application is required to handle preTransform elsewhere accordingly.
- compositeAlpha
  - Specifies if the alpha channel should be used for blending with other windows in the window system.
  - You’ll almost always want to simply ignore the alpha channel, hence OPAQUE .
- clipped
  - If set to TRUE , then that means that we don’t care about the color of pixels that are obscured, for example, because another window is in front of them.
  - Unless you really need to be able to read these pixels back and get predictable results, you’ll get the best performance by enabling clipping.
- oldSwapChain
  - Can be an existing non-retired swapchain currently associated with surface , or NULL_HANDLE .
  - If the oldSwapchain is NULL_HANDLE :
    1. And if the native window referred to by pCreateInfo->surface is already associated with a Vulkan swapchain, ERROR_NATIVE_WINDOW_IN_USE must be returned.
  - If the oldSwapchain is valid:
    1. This may aid in the resource reuse, and also allows the application to still present any images that are already acquired from it.
    2. And the oldSwapchain has exclusive full-screen access, that access is released from pCreateInfo->oldSwapchain . If the command succeeds in this case, the newly created swapchain will automatically acquire exclusive full-screen access from pCreateInfo->oldSwapchain .
    3. And there are outstanding calls to vkWaitForPresent2KHR , then vkCreateSwapchainKHR may block until those calls complete.
    4. Any images from oldSwapchain that are not acquired by the application may be freed by the implementation, upon calling vkCreateSwapchainKHR , which may occur even if creation of the new swapchain fails.
    5. The oldSwapchain will be retired upon calling vkCreateSwapchainKHR , even if creation of the new swapchain fails.
      - After oldSwapchain is retired, the application can pass to vkQueuePresentKHR any images it had already acquired from oldSwapchain .
        
        An application may present an image from the old swapchain before an image from the new swapchain is ready to be presented.
        
        As usual, vkQueuePresentKHR may fail if oldSwapchain has entered a state that causes ERROR_OUT_OF_DATE to be returned.
    6. The application can continue to use a shared presentable image obtained from oldSwapchain until a presentable image is acquired from the new swapchain, as long as it has not entered a state that causes it to return ERROR_OUT_OF_DATE .
    7. The application can destroy oldSwapchain to free all memory associated with oldSwapchain .
  - Regardless if the oldSwapchain is valid or not:
    1. The new swapchain is created in the non-retired state.
- flags
  - Is a bitmask of VkSwapchainCreateFlagBitsKHR indicating parameters of the swapchain creation.
  - SWAPCHAIN_CREATE_DEFERRED_MEMORY_ALLOCATION_EXT
    - When EXT_swapchain_maintenance1 is available, you can optionally amortize the cost of swapchain image allocations over multiple frames.
    - When this is used, image views cannot be created until the first time the image is acquired.
      - The idea is that normally the images and image views are acquired when a Swapchain recreation happens, but if this flag is enabled it is necessary to acquire them after result == SUCCESS || result == SUBOPTIMAL_KHR as the result of vkAcquireNextImageKHR .

Present Modes

Common present modes are double buffering (vsync) and triple buffering.
The presentation mode is arguably the most important setting for the Swapchain, because it represents the actual conditions for showing images to the screen. There are four possible modes available in Vulkan:
- PRESENT_MODE_IMMEDIATE_KHR
  - Images submitted by your application are transferred to the screen right away, which may result in tearing.
- PRESENT_MODE_FIFO_KHR
  - The Swapchain is a queue where the display takes an image from the front of the queue when the display is refreshed, and the program inserts rendered images at the back of the queue. If the queue is full, then the program has to wait. This is most similar to vertical sync as found in modern games. The moment that the display is refreshed is known as "vertical blank".
- PRESENT_MODE_FIFO_RELAXED_KHR
  - This mode only differs from the previous one if the application is late and the queue was empty at the last vertical blank. Instead of waiting for the next vertical blank, the image is transferred right away when it finally arrives. This may result in visible tearing.
- PRESENT_MODE_MAILBOX_KHR
  - This is another variation of the second mode. Instead of blocking the application when the queue is full, the images that are already queued are simply replaced with the newer ones. This mode can be used to render frames as fast as possible while still avoiding tearing, resulting in fewer latency issues than standard vertical sync. This is commonly known as "triple buffering," although the existence of three buffers alone does not necessarily mean that the framerate is unlocked.
Only the PRESENT_MODE_FIFO_KHR mode is guaranteed to be available, so we’ll again have to write a function that looks for the best mode that is available:
.
Options :
- I think that PRESENT_MODE_MAILBOX_KHR is a very nice trade-off if energy usage is not a concern. It allows us to avoid tearing while still maintaining fairly low latency by rendering new images that are as up to date as possible right until the vertical blank.
- On mobile devices, where energy usage is more important, you will probably want to use PRESENT_MODE_FIFO_KHR instead.
- .
- .
  - Slide from the Samsung talk on (2025-02-25).
  - It recommends FIFO and says that mailbox is not as good as it seems because it induces a lot of stutter.

Drawing directly to the Swapchain vs Blitting to the Swapchain

Source .
Drawing directly into the swapchain :
- Is fine for many projects, and it can even be optimal in some cases such as phones.
- Restrictions :
  - Their resolution is fixed to whatever your window size is.
    - If you want to have higher or lower resolution, and then do some scaling logic, you need to draw into a different image.
    - Swapchain image size (imageExtent / surface extent) is part of swapchain creation and is tied to the surface. If you want an internal render at a different resolution (supersampling, dynamic resolution, lower-res upscaling), you create an offscreen image/render-target at the desired size and then copy/blit/resolve/tone-map into the swapchain image for presentation. The spec and WSI notes treat imageExtent as the surface-presentable size.
  - The formats of the image used in the swapchain are not guaranteed.
    - Different OS, drivers, and windowing modes can have different optimal swapchain formats.
    - The WSI model exposes the surface’s supported formats to the application via vkGetPhysicalDeviceSurfaceFormatsKHR (or equivalent WSI queries); the returned list is implementation- and surface-dependent, so you must choose from what the platform/driver exposes. That means formats available for swapchains vary by OS, driver, and surface.
    - Vulkan explicitly states this via VkSurfaceFormatKHR and vkGetPhysicalDeviceSurfaceFormatsKHR . The specification (Section 30.5 "WSI Swapchain", Vulkan 1.3.275) and tutorials emphasize that the application must query and choose from available formats supported by the surface/device combination. Android documentation (Vulkan on Android) and Windows (DXGI_FORMAT) similarly highlight platform-specific format requirements and HDR needs (e.g., FORMAT_A2B10G10R10_UNORM_PACK32 or DXGI_FORMAT_R10G10B10A2_UNORM for HDR10). This variability makes direct rendering inflexible.
  - HDR support needs its own very specific formats.
    - HDR output requires specific color formats and color-space metadata (examples: 10-bit packed UNORM formats or explicit HDR color-space support such as ST2084/Perceptual Quantizer). WSI and sample repos treat HDR as a distinct case (e.g. A2B10G10 formats and HDR color spaces). Support is platform- and driver-dependent.
    - HDR Sample discussion .
  - Swapchain formats are, for the most part, low precision.
    - Some platforms with High Dynamic Range rendering have higher precision formats, but you will often default to 8 bits per color.
    - So if you want high precision light calculations, systems that would prevent banding, or to be able to go past 1.0 on the normalized color range, you will need a separate image for drawing.
      - HDR/high-dynamic-range lighting typically uses floating-point or extended-range render targets (e.g. R16G16B16A16_SFLOAT or higher) for intermediate lighting accumulation; final tonemapping reduces values into the presentable format. Because presentable swapchain images are often limited (8-bit), the offscreen high-precision image plus a conversion/tonemap pass is the usual pattern.
    - Many surfaces expose 8-bit UNORM or sRGB formats (e.g. B8G8R8A8_UNORM / SRGB ) as commonly returned swapchain formats. Higher-precision formats (16-bit float per channel or 10-bit packed) exist and are used for HDR/high-precision pipelines, but they are not guaranteed by every surface/driver. Therefore applications that need high-precision lighting/accumulation commonly render into a 16-bit-float render target and tonemap/convert for presentation.
    - Banding artifacts in gradients or low-light scenes are a well-known consequence of limited precision. High-precision rendering (HDR, complex lighting, deferred shading G-Buffers) requires formats like FORMAT_R16G16B16A16_SFLOAT (RGBA16F) to store values outside the [0.0, 1.0] range and prevent banding. While some swapchains can support HDR formats (e.g., 10:10:10:2), they are less universally available and not the default. Using RGBA16F directly in a swapchain is often unsupported or inefficient for presentation.
Drawing to a different image and copying/blitting to the swapchain image :
- Advantages :
  - Decouples tonemapping from presentation timing
    - Tonemap into an intermediate LDR image that you control. You can finish the tonemap pass earlier and defer the actual transfer/present of the swapchain image to a later point, reducing risk of stalling the present path or blocking on swapchain ownership.
  - Avoids writing directly to the swapchain
    - Writing directly into the swapchain can introduce stalls (wait-for-acquire or present-time synchronization). Using an intermediate LDR image lets you do the heavy work off-swapchain and only do a cheap transfer/present step when convenient.
  - Enables batching / chaining of postprocesses without touching the swapchain
    - If you need further LDR processing (dithering, temporal AA, UI composite, overlays, readback for screenshots, or additional filters), do those against the intermediate image. This allows composing multiple passes without repeatedly transitioning the swapchain.
  - Easier support for multiple outputs or different sizes/formats
    - You can tonemap once to an LDR image and then blit/copy to different-size or different-format targets (screenshots, streaming encoder, secondary displays) without re-running tonemap.
  - Allows use of transient/optimized memory for the intermediate
    - The intermediate image can be created as transient (e.g., MEMORY_PROPERTY_LAZILY_ALLOCATED or tiled transient attachment) to reduce memory pressure and bandwidth compared with always keeping a full persistent LDR buffer.
  - Better control over final conversion semantics
    - In shader you control quantization, gamma conversion, ordered/temporal dithering, and color-space tagging. After producing the controlled LDR image you can choose the transfer method (exact copy vs scaled blit) that matches target capabilities, improving visual consistency across vendors.
  - Improved cross-queue / async workflows
    - You can produce the LDR image on a graphics/compute queue and then perform a transfer on a transfer-only queue (or use a dedicated present queue) with explicit ownership transfers, possibly improving throughput if hardware supports it.
  - Facilitates deterministic screenshots / capture
    - Saving an intermediate LDR image for file export is safer (format/bit-depth known) than capturing the swapchain which may have platform-specific transforms applied.
- Trade-offs :
  - Extra GPU memory usage
    - You need memory for the intermediate LDR image (unless you use transient attachments), which increases resident memory footprint.
  - Extra GPU bandwidth and a copy step
    - Creating an LDR image then copying/blitting to the swapchain costs memory bandwidth and GPU cycles. This can increase frame time if the transfer is on the critical path.
  - More layout transitions and synchronization complexity
    - You must manage transitions and possibly ownership transfers (if different queues are used). Incorrect synchronization can cause stalls or correctness bugs.
  - Potential increased latency if done poorly
    - If the copy/blit is done synchronously right before present, it can add latency compared with rendering directly to the swapchain; the intended decoupling only helps if scheduling is arranged to avoid the critical path.
  - Implementation complexity
    - Managing an extra render target, transient allocation, and copy logic is more code than rendering directly to the swapchain.

Swapchain Recreation

When to recreate

If the window surface changed such that the Swapchain is no longer compatible with it.
If the window resizes.
If the window minimizes.
- This case is special because it will result in a framebuffer size of 0 .
- We can handle by waiting for the framebuffer size to be back to something greater than 0 , indicating that the window is no longer minimized.
If the swapchain image format changed during an application's lifetime, for example, when moving a window from a standard range to a high dynamic range monitor.

Finding out that a recreation is needed

The vkAcquireNextImageKHR and vkQueuePresentKHR functions can return the following special values to indicate this.
- ERROR_OUT_OF_DATE_KHR
  - The Swapchain has become incompatible with the surface and can no longer be used for rendering. Usually happens after a window resize.
- SUBOPTIMAL_KHR
  - The Swapchain can still be used to successfully present to the surface, but the surface properties are no longer matched exactly.
  - You should ALWAYS recreate the swapchain if the result is suboptimal.
  - This result means that it's a "success" but there will be performance penalties.
  - Both SUCCESS and SUBOPTIMAL_KHR are considered "success" return codes.
If the Swapchain turns out to be out of date when attempting to acquire an image, then it is no longer possible to present to it. Therefore, we should immediately recreate the Swapchain and try again in the next drawFrame call.
You could also decide to do that if the Swapchain is suboptimal, but I’ve chosen to proceed anyway in that case because we’ve already acquired an image.

result = presentQueue.presentKHR( presentInfoKHR );
if (result == vk::Result::eErrorOutOfDateKHR || result == vk::Result::eSuboptimalKHR) {
    framebufferResized = false;
    recreateSwapChain();
} else if (result != vk::Result::eSuccess) {
    throw std::runtime_error("failed to present Swapchain image!");
}

currentFrame = (currentFrame + 1) % MAX_FRAMES_IN_FLIGHT;

The vkQueuePresentKHR function returns the same values with the same meaning. In this case, we will also recreate the Swapchain if it is suboptimal, because we want the best possible result.
Finding out explicitly :
- Although many drivers and platforms trigger ERROR_OUT_OF_DATE_KHR automatically after a window resize, it is not guaranteed to happen.
- That’s why we’ll add some extra code to also handle resizes explicitly:
```
glfw.SetWindowUserPointer(vulkan_context.glfw_window, vulkan_context)
glfw.SetFramebufferSizeCallback(vulkan_context.glfw_window, proc "c" (window: glfw.WindowHandle, _, _: i32) {s
    vulkan_context := cast(^Vulkan_Context)glfw.GetWindowUserPointer(window)
    vulkan_context.glfw_framebuffer_resized = true
})
```
- "Usually it's not the best idea to depend on this".
  - Problems with multithreading.
  - You depend on the windowing system to notify changes correctly; this can be really tricky on mobile.

Recreating

void recreateSwapChain() {
    device.waitIdle();

    cleanupSwapChain();

    createSwapChain();
    createImageViews();
}

Synchronization :
1. ~Flush and Recreate:
  - "We first call vkDeviceWaitIdle , because just like in the last chapter, we shouldn’t touch resources that may still be in use."
    - This is not enough.
    - .
  - The whole app has to stop and wait for synchronization.
  - .
  - .
2. Recreate and check:
  - .
  - You do not need to stop your rendering at any given point.
  - The reason why you are allowed to pass the old swapchain when recreating the new swapchain, is due to this strategy.
  - This is the recommendation.
  - Strategy .
    - This issue is resolved by deferring the destruction of the old swapchain and its remaining present semaphores to the time when the semaphore corresponding to the first present of the new swapchain can be destroyed. Because once the first present semaphore of the new swapchain can be destroyed, the first present operation of the new swapchain is done, which means the old swapchain is no longer being presented.
    - The destruction of both old swapchains must now be deferred to when the first QP of the new swapchain has been processed. If an application resizes the window constantly and at a high rate, we would keep accumulating old swapchains and not free them until it stops.
      - This potentially accumulates a lot of memory, I think.
    - So what's the correct moment then? Only after the new swapchain has completed one full cycle of presentations, that is, when I acquire image index 0 for the second time.
  - Analysis :
    - (2025-08-19)
    - Holy, now I understand the problem.
    - I cannot delete anything from the old swapchain until I am sure that everything from the previous one has been presented. I thought that by acquiring the first image of the new swapchain, that would already indicate that it was safe to delete the old swapchain, but that's not true; by doing that, I only guarantee that 1 (ONE) image from the old swapchain has been presented, but the old swapchain may have several images in the queue.
    - However, as made clear, that is not the case.
    - Dealing with this can be a nightmare. Potentially having to handle multiple old swapchains at the same time in case of very frequent resizes (smooth swapchain).
3. EXT_swapchain_maintenance1 .
  - "You should always use this extension if available".
  - Support :
    - Introduced in 2023.
    - (2025-02-25)
      - Only 25% of Android devices and 20% of desktop GPUs use it.
      - It was added on Android 14.
  - Adds a collection of window system integration features that were intentionally left out or overlooked in the original KHR_swapchain extension.
  - Features :
    - Allow applications to release previously acquired images without presenting them.
    - Allow applications to defer swapchain memory allocation for improved startup time and memory footprint.
    - Specify a fence that will be signaled when the resources associated with a present operation can be safely destroyed.
    - Allow changing the present mode a swapchain is using at per-present granularity.
    - Allow applications to define the behavior when presenting a swapchain image to a surface with different dimensions than the image.
      - Using this feature may allow implementations to avoid returning ERROR_OUT_OF_DATE_KHR in this situation.
    - This extension makes vkQueuePresentKHR more similar to vkQueueSubmit , allowing it to specify a fence that the application can wait on.
  - The problem with vkDeviceWaitIdle or vkQueueWaitIdle :
    - Typically, applications call these functions and assume it’s safe to delete swapchain semaphores and the swapchain itself.
    - The problem is that WaitIdle functions are defined in terms of fences - they only wait for workloads submitted through functions that accept a fence.
    - Unextended vkQueuePresent does not provide a fence parameter.
    - The vkDeviceWaitIdle can’t guarantee that it’s safe to delete swapchain resources.
      - The validation layers don't trigger errors in this case, but it's just because so many people use it and there's no good alternative.
      - When EXT_swapchain_maintenance1 is enabled the validation layer will report an error if the application shutdown sequence relies on vkDeviceWaitIdle or vkQueueWaitIdle to release swapchain resources instead of using a presentation fence.
    - The extension fixes this problem.
    - By waiting on the presentation fence, the application can safely release swapchain resources.
- To avoid a deadlock, only reset the fence if we are submitting work:
  - If reset is made right after wait for the fence, but the window was resized, then it will happen a deadlock.
  - The fence is opened by the signaling of QueueSubmit , and closed by the ResetFences .
```
vkWaitForFences(device, 1, &inFlightFences[currentFrame], TRUE, UINT64_MAX);

uint32_t imageIndex;
VkResult result = vkAcquireNextImageKHR(device, swapChain, UINT64_MAX, imageAvailableSemaphores[currentFrame], NULL_HANDLE, &imageIndex);

if (result == ERROR_OUT_OF_DATE_KHR) {
    recreateSwapChain();
    return;
} else if (result != SUCCESS && result != SUBOPTIMAL_KHR) {
    throw std::runtime_error("failed to acquire Swapchain image!");
}

// Only reset the fence if we are submitting work
vkResetFences(device, 1, &inFlightFences[currentFrame]);
```
What to recreate :
- The image views need to be recreated because they are based directly on the Swapchain images.
Smooth Swapchain Resizing :
- "Don't bother with smooth swapchain resizing, it's not worth it".
- My experience :
  - (2025-08-04)
  - A callback glfw.SetWindowRefreshCallback allows the swapchain to be recreated while resizing.
  - Synchronization :
    - Since the swapchain is recreated all the time, it becomes difficult to manage when the old swapchain should be destroyed along with its resources.
    - ~~At the moment I'm handling the old_swapchain in a "bad" way, and I feel that recreating it every resize frame only worsens synchronization~~.
      - It is not necessary to deal with the old_swapchain when using vkDeviceWaitIdle() .
  - My current implementation:
```
eng.window_init(1280, 720, "Expedicao Hover", proc "c" (window: glfw.WindowHandle) {
    context = eng.global_context
    // fmt.printfln("REFRESHED")
    eng.swapchain_resize()
    game_draw(&game, game.cycle_draw.dt_cycles_s)
})
```

Updating resources after recreating

Destroy every image and view created from the old swapchain (the swapchain destroys its own images).
Update everything that holds a reference to either of those.
- If anything was created using the swapchain's size you also have to destroy and recreate those and update anything that references them.
- There's no getting around it.

Frames In-Flight

Motivation

The render loop has one glaring flaw: unnecessary idling of the host. We are required to wait on the previous frame to finish before we can start rendering the next.
To fix this we allow multiple frames to be in-flight at once, allowing the rendering of one frame to not interfere with the recording of the next.
This control over the number of frames in flight is another example of Vulkan being explicit.

Frame

There is no concept of a frame in Vulkan. This means that the way you render is entirely up to you. The only thing that matters is when you have to display the frame to the screen, which is done through a swapchain. But there is no fundamental difference between rendering and then sending the images over the network, or saving the images into a file, or displaying it on the screen through the swapchain.
This means it is possible to use Vulkan in an entirely headless mode, where nothing is displayed to the screen. You can render the images and then store them on disk (very useful for testing) or use Vulkan as a way to perform GPU calculations such as a raytracer or other compute tasks.

How many Frames In-Flight

We choose the number 2 because we don’t want the CPU to get too far ahead of the GPU.
- With two frames in flight, the CPU and the GPU can be working on their own tasks at the same time. If the CPU finishes early, it will wait till the GPU finishes rendering before submitting more work.
- With three or more frames in flight, the CPU could get ahead of the GPU, adding frames of latency. Generally, extra latency isn’t desired.

One Per Frame In-Flight

Duplicate :
- Resources :
  - Uniform Buffers.
    - If modified while a previous frame uses it, corruption occurs.
  - Dynamic Storage Buffers.
    - GPU-computed results (e.g., particle positions). Writing to a buffer while an older frame reads it causes hazards.
  - Color/Depth Attachments.
  - Staging Buffers
    - If updated per frame (e.g., vkMapMemory ), duplication avoids overwriting mid-transfer.
  - Compute Shader Output Buffers:
    - If frame N writes, and frame N+1 reads, duplicate to prevent read-before-write.
    - Use ping-pong buffers (count = frames in-flight).
- Command pool.
  - I have doubts about this; some people do it differently.
- Command buffer.
- 'present_finished_semaphore'.
- 'render_finished_fence'.
Don't duplicate :
- Resources :
  - Static Vertex/Index Buffers:
    - Initialized once, read-only. No per-frame updates.
  - Immutable Textures
    - Loaded once (e.g., via VkDeviceMemory ).
    - Not mapped for change.
    - It's device local.
- Static BRDF LUTs.
  - Initialized once, read by all frames.

Advancing a frame

void drawFrame() {
    ...

    currentFrame = (currentFrame + 1) % MAX_FRAMES_IN_FLIGHT;
}

By using the modulo ( % ) operator, we ensure that the frame index loops around after every MAX_FRAMES_IN_FLIGHT enqueued frames.

Acquire Next Image

vkWaitForFences()
- Waits on the previous frame.
- Takes an array of fences and waits on the host for either any or all of the fences to be signaled before returning.
- The TRUE we pass here indicates that we want to wait for all fences, but in the case of a single one it doesn’t matter.
- This function also has a timeout parameter that we set to the maximum value of a 64 bit unsigned integer, UINT64_MAX , which effectively disables the timeout.
vkAcquireNextImageKHR()
- Acquire the index of an available image from the swapchain for rendering .
- If an image was acquired, then it means that this image is idle (i.e., not currently being displayed or written to).
- If no image is ready, the call blocks (or returns an error if non-blocking).
- The returned image index is now " owned " by your app for rendering.
- We only get a swapchain image index from the windowing present system.
- A semaphore/fence is signaled when the image is safe to use.
- timeout
  - If the swapchain doesn’t have any image we can use, it will block the thread with a maximum for the timeout set.
  - The measurement unit is nanoseconds.
  - 1 second is fine: 1_000_000_000 .
- semaphore
  - Semaphore to signal.
- fence
  - Fence to signal.
  - It is possible to specify a semaphore, fence or both.
- pImageIndex
  - Specifies a variable to output the index of the Swapchain image that has become available to use.
  - The index refers to the VkImage in the swapChainImages array.

Image Layout Transitions

See Vulkan#Images .
Before we can start rendering to an image, we need to transition its layout to one that is suitable for rendering.
Before rendering, we transition the image layout to IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL .

// Before starting rendering, transition the swapchain image to COLOR_ATTACHMENT_OPTIMAL
transition_image_layout(
    imageIndex,
    vk::ImageLayout::eUndefined,
    vk::ImageLayout::eColorAttachmentOptimal,
    {},                                                     // srcAccessMask (no need to wait for previous operations)
    vk::AccessFlagBits2::eColorAttachmentWrite,                // dstAccessMask
    vk::PipelineStageFlagBits2::eTopOfPipe,                   // srcStage
    vk::PipelineStageFlagBits2::eColorAttachmentOutput        // dstStage
);

After rendering, we need to transition the image layout back to IMAGE_LAYOUT_PRESENT_SRC_KHR so it can be presented to the screen:

// After rendering, transition the swapchain image to PRESENT_SRC
transition_image_layout(
    imageIndex,
    vk::ImageLayout::eColorAttachmentOptimal,
    vk::ImageLayout::ePresentSrcKHR,
    vk::AccessFlagBits2::eColorAttachmentWrite,                 // srcAccessMask
    {},                                                      // dstAccessMask
    vk::PipelineStageFlagBits2::eColorAttachmentOutput,        // srcStage
    vk::PipelineStageFlagBits2::eBottomOfPipe                  // dstStage
);

Render Targets

Attachments

Nvidia: Use storeOp = DONT_CARE rather than UNDEFINED layouts to skip unneeded render target writes.
Nvidia: Don't transition color attachments from "safe" to "unsafe" unless required by the algorithm.

Transient Resources

Transient attachments (or Transient Resources) are render targets (like color/depth buffers) designed to exist only temporarily during a render pass, with their contents discarded afterward. They're optimized for fast on-chip memory access and avoid unnecessary memory operations.

Render Target

A Render Target is not a term in Vulkan but it's a term in graphics programming.
It's a term for an image you render into. In Vulkan this is an VkImage + VkImageView used as a color/depth attachment in a render pass or as a color attachment in dynamic rendering.
Examples :
- Vulkan#Drawing to a High Precision Image ( R16G16B16A16_SFLOAT ) .
  - It's a Render Target technique to draw into a high-precision image and then copy the result to an SDR image for the swapchain.
Drawing a UI :
- The UI texture must preserve alpha in the areas you want to be transparent, for later compositing.
1. Draw UI directly to the final render target (swapchain image, or image to blit to the swapchain image) :
  - After tonemap, enable blending and draw UI.
  - Oni:
    - For the scene, I render into an RGBA16 image, then I draw on the swapchain with a tonemapper, then I draw the UI on the swapchain with blending enabled.
2. Composite in a shader :
  - Sample scene image and UI image, compute out = scene * (1 - alpha_ui) + ui * alpha_ui (or use premultiplied alpha: out = scene + ui ).
    - Both ways work; premultiplied alpha avoids some edge artifacts if UI already uses premultiplied data.

Compositing :

Used to combine render targets, or any other images.

Fragment shader :
- Render to an image and draw a full-screen triangle/quad that samples the HDR image and outputs LDR color.
  - Could be the swapchain image if supported, or an intermediate image then blit/copy to swapchain.
- Pros :
  - Simple and guaranteed compatible with swapchain color attachment usage.
  - Useful if you want to draw the UI while making this final composition.
    - Seems like I'm mixing responsibilities, even though I'm reducing one render pass.
- Cons :
  - Less flexible for arbitrary per-pixel work that requires many conditionals or random write patterns.
  - Need to issue a draw call and set up graphics pipeline.

Compute shader :

Sample HDR image(s), write the LDR pixels to an output image.
- Could be the swapchain image if supported, or an intermediate image then blit/copy to swapchain.
Pros :
- Flexible: can read multiple inputs and write arbitrary outputs (random writes, multiple passes) without needing geometry.
- Easy to implement multi-image compositing in one dispatch (read N sampled images + write to storage image).
Cons :
- On some GPUs a simple full-screen fragment pass can be faster due to fixed-function hardware for rasterization and blending.

#version 450

layout(local_size_x = 16, local_size_y = 16) in;
layout(set=0, binding=0) uniform sampler2D gameTex;
layout(set=0, binding=1) uniform sampler2D uiTex;
layout(set=0, binding=2, rgba8) uniform writeonly image2D swapchainImg;

void main() {
    ivec2 coord = ivec2(gl_GlobalInvocationID.xy);
    vec2 uv = vec2(coord) / textureSize(gameTex, 0);
    
    // Sample inputs
    vec3 game = texture(gameTex, uv).rgb;
    vec4 ui = texture(uiTex, uv);
    
    // Tonemap game (example: Reinhard)
    game = game / (game + vec3(1.0));
    
    // Composite: UI over game
    vec3 final = mix(game, ui.rgb, ui.a);
    
    // Write to swapchain
    imageStore(swapchainImg, coord, vec4(final, 1.0));
}

#version 450

layout(local_size_x = 16, local_size_y = 16) in;

layout(binding = 0) uniform sampler2D uSceneHDR;
layout(binding = 1) uniform sampler2D uUI; // optional
layout(binding = 2, rgba8) writeonly uniform image2D outImage; // target LDR image (could be swapchain-compatible image)

vec3 reinhardTonemap(vec3 c) {
    return c / (1.0 + c);
}

vec3 toSRGB(vec3 linear) {
    return pow(linear, vec3(1.0/2.2));
}

void main() {
    ivec2 pix = ivec2(gl_GlobalInvocationID.xy);
    ivec2 size = imageSize(outImage);
    if (pix.x >= size.x || pix.y >= size.y) return;

    vec2 uv = (vec2(pix) + 0.5) / vec2(size);
    vec3 hdr = texture(uSceneHDR, uv).rgb;
    float exposure = 1.0;
    vec3 mapped = reinhardTonemap(hdr * exposure);
    mapped = toSRGB(mapped);

    // Optionally composite UI
    // vec4 ui = texture(uUI, uv);
    // vec3 outc = mix(mapped, ui.rgb, ui.a);

    imageStore(outImage, pix, vec4(mapped, 1.0));
}

// Dispatch
vkCmdBindPipeline(cmd, PIPELINE_BIND_POINT_COMPUTE, computePipe);
vkCmdBindDescriptorSets(cmd, PIPELINE_BIND_POINT_COMPUTE, ...);
vkCmdDispatch(cmd, swapchain_width/16, swapchain_height/16, 1);

Dynamic Rendering

Support :
- Dynamic Rendering .
- Dynamic Rendering Local Read .
  - Used for tiling GPUs.
- Dynamic Rendering Unused Attachments .
  - EXT_dynamic_rendering_unused_attachments .
  - Requires Vulkan 1.3+.
  - Proposal .
  - VkPhysicalDeviceDynamicRenderingUnusedAttachmentsFeaturesEXT .
  - It relaxes the strict matching rules so a rendering instance and the bound pipelines may disagree about an attachment being “unused” in one but not the other (and relaxes some format/NULL mixing rules described in the extension).
  - Support :
    - Pass VkPhysicalDeviceDynamicRenderingUnusedAttachmentsFeaturesEXT in the pNext chain of the VkPhysicalDeviceFeatures2 structure passed to vkGetPhysicalDeviceFeatures2 .
    - The struct will be filled in to indicate whether each corresponding feature is supported.
  - Enabling :
    - Enable the corresponding feature in VkDeviceCreateInfo (via VkPhysicalDeviceDynamicRenderingUnusedAttachmentsFeaturesEXT )
  - This extension lifts some restrictions in the KHR_dynamic_rendering extension to allow render pass instances and bound pipelines within those render pass instances to have an unused attachment specified in one but not the other. It also allows pipelines to use different formats in a render pass as long as the attachment is NULL.
VkRenderingAttachmentInfo
- Structure specifying attachment information
- imageView
  - Is the image view that will be used for rendering.
- imageLayout
  - Is the layout that imageView will be in during rendering.
- resolveMode
  - Is a VkResolveModeFlagBits value defining how data written to imageView will be resolved into resolveImageView .
- resolveImageView
  - Is an image view used to write resolved data at the end of rendering.
- resolveImageLayout
  - Is the layout that resolveImageView will be in during rendering.
- loadOp
  - Specifies what to do with the image before rendering.
  - Is a VkAttachmentLoadOp value defining the load operation for the attachment.
  - We’re using ATTACHMENT_LOAD_OP_CLEAR to clear the image to black before rendering.
- storeOp
  - Specifies what to do with the image after rendering.
  - Is a VkAttachmentStoreOp value defining the store operation for the attachment.
  - We're using ATTACHMENT_STORE_OP_STORE to store the rendered image for later use.
- clearValue
  - Is a VkClearValue structure defining values used to clear imageView when loadOp is ATTACHMENT_LOAD_OP_CLEAR .
VkRenderingInfo
- Structure specifying render pass instance begin info.
- Specifies the attachments to render to and the render area.
- Combines the RenderingAttachmentInfo with other rendering parameters.
- flags
  - Is a bitmask of VkRenderingFlagBits .
- renderArea
  - Is the render area that is affected by the render pass instance.
  - Extent Requirements :
    - The rendering_info.renderArea.extent has to fit inside the rendering_attachment.imageView and hence the image.
  - If there is an instance of VkDeviceGroupRenderPassBeginInfo included in the pNext chain and its deviceRenderAreaCount member is not 0 , then renderArea is ignored, and the render area is defined per-device by that structure.
  - CharlesG - LunarG:
    - Viewports & scissors let you specify a size smaller than the full image, as well as redefining the origin & scale to use. Whereas the renderArea is specifying the actual image dimensions to use. This allows flexibility in how the backing VkImage is used in contrast to the viewport/scissor needs of the rendering itself. In most cases they are going to be “full” so its not like it comes into play always
    - More clarity: viewport & scissor are inputs to the rasterization stage, while the render area is an input for the attachment read/write.
  - Caio:
    - So, when comparing these two cases:
      - 1- I use a 1080p image for the renderArea and a 640p viewport and center the offset
      - 2- I use a 640p image for the renderArea and a 640p viewport and center the offset
    - Is there a difference between the quality and performance of these two? Or even, is there a visual difference?
  - CharlesG - LunarG:
    - I don't know tbh.
- colorAttachmentCount
  - Is the number of elements in pColorAttachments .
- pColorAttachments
  - Is a pointer to an array of colorAttachmentCount VkRenderingAttachmentInfo structures describing any color attachments used.
  - Each element of the pColorAttachments array corresponds to an output location in the shader, i.e. if the shader declares an output variable decorated with a Location value of X , then it uses the attachment provided in pColorAttachments[X] .
  - If the imageView member of any element of pColorAttachments is NULL_HANDLE , and resolveMode is not RESOLVE_MODE_EXTERNAL_FORMAT_DOWNSAMPLE_ANDROID , writes to the corresponding location by a fragment are discarded.
- pDepthAttachment
  - Is a pointer to a VkRenderingAttachmentInfo structure describing a depth attachment.
- pStencilAttachment
  - Is a pointer to a VkRenderingAttachmentInfo structure describing a stencil attachment.
- viewMask
  - Is a bitfield of view indices describing which views are active during rendering, when it is not 0 .
- layerCount
  - Is the number of layers rendered to in each attachment when viewMask is 0 .
  - Specifies the number of layers to render to, which is 1 for a non-layered image.

Multi-view

If VkRenderingInfo.viewMask is not 0 , multiview is enabled.
If multiview is enabled, and the multiviewPerViewRenderAreas feature is enabled, and there is an instance of VkMultiviewPerViewRenderAreasRenderPassBeginInfoQCOM included in the pNext chain with perViewRenderAreaCount not equal to 0 , then the elements of VkMultiviewPerViewRenderAreasRenderPassBeginInfoQCOM :: pPerViewRenderAreas override renderArea and define a render area for each view. In this case, renderArea must be an area at least as large as the union of all the per-view render areas.

Render Cmds

Drawing Commands

Draw Direct

Specify the Viewport and Scissor.
Bind the pipeline.
Bind the descriptor sets.
vkCmdDraw()
- vertexCount
  - Even though we don’t have a vertex buffer, we technically still have 3 vertices to draw.
- instanceCount
  - Used for instanced rendering, use 1 if you’re not doing that.
- firstVertex
  - Used as an offset into the vertex buffer, defines the lowest value of SV_VertexId .
- firstInstance
  - Used as an offset for instanced rendering, defines the lowest value of SV_InstanceID .
vkCmdDrawIndexed .
- indexCount
  - The number of vertices to draw.
- instanceCount
  - The number of instances to draw.
  - We’re not using instancing, so just specify 1 instance.
- firstIndex
  - The base index within the index buffer.
  - Specifies an offset into the index buffer, using a value of 1 would cause the graphics card to start reading at the second index.
- vertexOffset
  - The value added to the vertex index before indexing into the vertex buffer.
- firstInstance
  - The instance ID of the first instance to draw.

Draw Indirect

"In some ways, Indirect Rendering is a more advanced form of instancing".
buffer + offset + (stride * index)

Executing a draw-indirect call will be equivalent to doing this.

void FakeDrawIndirect(VkCommandBuffer commandBuffer,void* buffer,VkDeviceSize offset, uint32_t drawCount,uint32_t stride);

    char* memory = (char*)buffer + offset;

    for(int i = 0; i < drawCount; i++)
    {
        VkDrawIndexedIndirectCommand* command = VkDrawIndexedIndirectCommand*(memory + (i * stride));

        VkCmdDrawIndexed(commandBuffer, 
        command->indexCount, 
        command->instanceCount, 
        command->firstIndex, 
        command->vertexOffset,
        command->firstInstance);
    }
}

It does not carry vertex data itself — it only supplies counts and base indices/instances. The actual vertex data and indices come from the buffers you previously bound with vkCmdBindVertexBuffers and vkCmdBindIndexBuffer .
Vertex :
- To move vertex and index buffers to bindless, generally you do it by merging the meshes into really big buffers. Instead of having 1 buffer per vertex buffer and index buffer pair, you have 1 buffer for all vertex buffers in a scene. When rendering, then you use BaseVertex offsets in the drawcalls. In some engines, they remove vertex attributes from the pipelines entirely, and instead grab the vertex data from buffers in the vertex shader. Doing that makes it much easier to keep 1 big vertex buffer for all drawcalls in the engine even if they use different vertex attribute formats. It also allows some advanced unpacking/compression techniques, and it’s the main use case for Mesh Shaders.
- We also change the way the meshes work. After loading a scene, we create a BIG vertex buffer, and stuff all of the meshes of the entire map into it. This way we will avoid having to rebind vertex buffers.
Implementation :
- If the device supports multi-draw indirect ( VkPhysicalDeviceFeatures2::multiDrawIndirect ), then the entire array of draw commands can be executed through a single call to VkDrawIndexedIndirectCommand . Otherwise, each draw call must be executed through a separate call to VkDrawIndexIndirectCommand :
```
// m_enable_mci: supports multiDrawIndirect
if (m_enable_mci && m_supports_mci)
{
    vkCmdDrawIndexedIndirect(draw_cmd_buffers[i], indirect_call_buffer->get_handle(), 0, cpu_commands.size(), sizeof(cpu_commands[0]));
}
else
{
    for (size_t j = 0; j < cpu_commands.size(); ++j)
    {
        vkCmdDrawIndexedIndirect(draw_cmd_buffers[i], indirect_call_buffer->get_handle(), j * sizeof(cpu_commands[0]), 1, sizeof(cpu_commands[0]));
    }
}
```
- vkCmdDrawIndexedIndirectCount .
  - Behaves similarly to vkCmdDrawIndexedIndirect except that the draw count is read by the device from a buffer during execution. The command will read an unsigned 32-bit integer from countBuffer located at countBufferOffset and use this as the draw count.
Textures :
- Due to the fact that you want to have as much things on the GPU as possible, this pipeline maps very well if you combine it with “Bindless” techniques, where you stop needing to bind descriptor sets per material or changing vertex buffers. Having a bindless renderer also makes Raytracing much more performant and effective.
- On this guide we will not use bindless textures as their support is limited, so we will do 1 draw-indirect call per material used.
- To move textures into bindless, you use texture arrays.
- With the correct extension, the size of the texture array can be unbounded in the shader, like when you use SSBOs.
- Then, when accessing the textures in the shader, you access them by index which you grab from another buffer. If you don’t use the Descriptor Indexing extensions, you can still use texture arrays, but they will need a bounded size. Check your device limits to see how big can that be.
- To make materials bindless, you need to stop having 1 pipeline per material. Instead, you want to move the material parameters into SSBOs, and go with an ubershader approach.
- In the Doom engines, they have a very low amount of pipelines for the entire game. Doom eternal has less than 500 pipelines, while Unreal Engine games often have 100.000+ pipelines. If you use ubershaders to massively lower the amount of unique pipelines, you will be able to increase efficiency in a huge way, as VkCmdBindPipeline is one of the most expensive calls when drawing objects in vulkan.
Push Constants :
- Push Constants and Dynamic Descriptors can be used, but they have to be “global”. Using push constants for things like camera location is perfectly fine, but you cant use them for object ID as that’s a per-object call and you specifically want to draw as many objects as possible in 1 draw.

Multithreading Rendering

I'm not sure, I don't think it's necessary.
From what I understand, it's about using multiple CPU threads to handle submissions and presentations, etc.
It has nothing to do with frames in flight, btw.
~~Explanation~~ .
- The video explains okay, but nah.
- -> In the next video he says it wasn't exactly a good idea and reverted what he did in that video.
  - "It was technically slower and more confusing to do synchronizations".

Render Passes and Framebuffers

Dynamic Rendering: Features and differences from Render Passes

Replaces VkRenderPass and Framebuffers.
- Instead, we can specify the color, depth, and stencil attachments directly when we begin rendering.
Describe renderpasses inline with command buffer recording.
Provides more flexibility by allowing us to change the attachments we’re rendering to without creating new render pass objects.
Greatly simplifies application architecture.
Synchronization still needs to be done, but now it's even more explicit, truer to its stated nature.
- We had to do that with Render Passes, but that was bound up in the Render Pass creation.
- Now, the synchronization is more explicit.
Tiling GPUs aren't left behind.
- The v1.4 dynamicRenderingLocalRead , KHR_dynamic_rendering_local_read brings tiling GPUs to the same capabilities, and they don't need to state the Render Passes.
I wouldn't say that "You should use Render Passes if your hardware isn't new enough", because it isn't fun.
Better compatibility with modern rendering techniques.
.

Subpasses

.
~~External subpass dependencies~~ :
- Explained by TheMaister 2019; he is part of the Khronos Group.
- The main purpose of external subpass dependencies is to deal with initialLayout and finalLayout of an attachment reference. If initialLayout != layout used in the first subpass, the render pass is forced to perform a layout transition.
- If you don’t specify anything else, that layout transition will wait for nothing before it performs the transition. Or rather, the driver will inject a dummy subpass dependency for you with srcStageMask = TOP_OF_PIPE. This is not what you want since it’s almost certainly going to be a race condition. You can set up a subpass dependency with the appropriate srcStageMask and srcAccessMask.
- The external subpass dependency is basically just a vkCmdPipelineBarrier injected for you by the driver.
- The whole premise here is that it’s theoretically better to do it this way because the driver has more information, but this is questionable, at least on current hardware and drivers.
- There is a very similar external subpass dependency setup for finalLayout. If finalLayout differs from the last use in a subpass, driver will transition into the final layout automatically. Here you get to change dstStageMask / dstAccessMask . If you do nothing here, you get BOTTOM_OF_PIPE , which can actually be just fine. A prime use case here is swapchain images which have finalLayout = PRESENT_SRC_KHR .
- Essentially, you can ignore external subpass dependencies .
- Their added complexity gives very little gain. Render pass compatibility rules also imply that if you change even minor things like which stages to wait for, you need to create new pipelines!
- This is dumb, and will hopefully be fixed at some point in the spec.
- However, while the usefulness of external subpass dependencies is questionable, they have some convenient use cases I’d like to go over:
  - Automatically transitioning TRANSIENT_ATTACHMENT images :
    - If you’re on mobile, you should be using transient images where possible. When using these attachments in a render pass, it makes sense to always have them as initialLayout = UNDEFINED. Since we know that these images can only ever be used in COLOR_ATTACHMENT_OUTPUT or EARLY / LATE_FRAGMENT_TEST stages depending on their image format, the external subpass dependency writes itself, and we can just use transient attachments without having to think too hard about how to synchronize them. This is what I do in my Granite engine, and it’s quite useful. Of course, we could just inject a pipeline barrier for this exact same purpose, but that’s more boilerplate.
  - Automatically transitioning swapchain images :
    - Typically, swapchain images are always just used once per frame, and we can deal with all synchronization using external subpass dependencies. We want initialLayout = UNDEFINED , and finalLayout = PRESENT_SRC_KHR .
    - srcStageMask is COLOR_ATTACHMENT_OUTPUT which lets us link up with the swapchain acquire semaphore. For this case, we will need an external subpass dependency. For the finalLayout transition after the render pass, we are fine with BOTTOM_OF_PIPE being used. We’re going to use semaphores here anyways.
    - I also do this in Granite.

Framebuffers

VkFrameBuffer
- Holds the target images for a renderpass.
- Only used in legacy tutorials.
Just wrappers to image views.
The attachments of a Framebuffer are the Image Views.
The Framebuffers are used within a Render Pass.
LunarG / Vulkan: "Kinda of a bad name, it's just a couple of image views".
Only exists to combine images and renderpasses.

Render Passes

VkRenderPass
- Holds information about the images you are rendering into. All drawing commands have to be done inside a renderpass.
- Only used in legacy tutorials.
Render passes in Vulkan describe the type of images that are used during rendering operations, how they will be used, and how their contents should be treated.
All drawing commands happen inside a "render pass".
Acts as pseudo render graph.
Allows tiling GPUs to use memory efficiently.
- Efficient scheduling.
Describe images attachments.
Defines the subpasses.
Declare dependencies between subpasses.
Require VkFrameBuffers .
- Whereas a render pass only describes the type of images, a VkFramebuffer actually binds specific images to these slots.
.
Problem :
- Great in theory, not so great to use in practice.
- Single object with many responsibilities.
  - Made the API harder to reason about when looking at the code.
- Hard to architect into a renderer.
  - Yet another input for pipelines.
- The main benefit is for tiling based GPUs.
  - Commonly found in mobile.
- "Use Dynamic Rendering, it's much better".

Submit

Submits the Command Buffers recorded.
vkSubmitInfo
- The first three parameters specify which semaphores to wait on before execution begins and in which stage(s) of the pipeline to wait.
- We want to wait for writing colors to the image until it’s available, so we’re specifying the stage of the graphics pipeline that writes to the color attachment.
- That means that theoretically, the implementation can already start executing our vertex shader and such while the image is not yet available.
- Each entry in the waitStages array corresponds to the semaphore with the same index in pWaitSemaphores .
- pCommandBuffers
  - Specifies which command buffers to actually submit for execution. We simply submit the single command buffer we have.
- pSignalSemaphores
  - Specifies which semaphores to signal once the command buffer(s) have finished execution.
  - In our case we’re using the renderFinishedSemaphore for that purpose.
vkQueueSubmit()
- fence
  - Is an optional handle to a fence to be signaled once all submitted command buffers have completed execution.
- The function takes an array of VkSubmitInfo structures as argument for efficiency when the workload is much larger.
- The last parameter references an optional fence that will be signaled when the command buffers finish execution.
- This allows us to know when it is safe for the command buffer to be reused, thus we want to give it drawFence . Now we want the CPU to wait while the GPU finishes rendering that frame we just submitted:

Presentation

The last step of drawing a frame is submitting the result back to the Swapchain to have it eventually show up on the screen.
Presentation Engine :
- .
VkPresentInfoKHR
- pWaitSemaphores
  - Which semaphores to wait on before presentation can happen, just like VkSubmitInfo .
  - Since we want to wait on the command buffer to finish execution, thus our triangle being drawn, we take the semaphores which will be signaled and wait on them, thus we use signalSemaphores .
- The next two parameters specify the Swapchains to present images to and the index of the image for each Swapchain.
- This will almost always be single.
- pResults
  - It allows you to specify an array of VkResult values to check for every Swapchain if presentation was successful.
  - It’s not necessary if you’re only using a single Swapchain, because you can use the return value of the present function.
QueuePresentKHR()
- Submits a rendered image to the presentation queue.
- Used after queueing all rendering commands and transitioning the image to the correct layout.
- Vulkan transfers ownership of the image to the 'presentation engine'.
How a presentation happens :
- Who :
  - The GPU (via the display controller/hardware), orchestrated by the OS/window system .
- When :
  - At the next vertical blanking interval ( Vblank ).
    - Vblank is the moment between screen refreshes (e.g., at 60 Hz, every 16.67 ms).
  - In a Vulkan workflow, we can be sure that the presentation happened between the QueuePresentKHR() and the vkAcquireNextImageKHR() .
    - The job of the present_complete_semaphore is to hold this information.
- How :
  - The GPU's display controller reads the image from GPU memory.
  - The OS/window system (e.g., X11/Wayland on Linux, Win32 on Windows) composites the image into the application window.
  - The final output is scanned out to the display.
Image recycling :
- After presentation, the image is released back to the swapchain.
- It becomes available for re-acquisition via vkAcquireNextImageKHR (after the next vblank).

Synchronization and Cache Control

KHR_synchronization2

KHR_synchronization2 .
Nvidia: Use KHR_synchronization2 , the new functions allow the application to describe barriers more accurately.
Highlights :
- One main change with the extension is to have pipeline stages and access flags now specified together in memory barrier structures.
  - This makes the connection between the two more obvious.
- Due to running out of the 32 bits for VkAccessFlag the VkAccessFlags2KHR type was created with a 64-bit range. To prevent the same issue for VkPipelineStageFlags , the VkPipelineStageFlags2KHR type was also created with a 64-bit range.
- Adds 2 new image layouts IMAGE_LAYOUT_ATTACHMENT_OPTIMAL_KHR and IMAGE_LAYOUT_READ_ONLY_OPTIMAL_KHR to help with making layout transition easier.
- etc.

Queues

Any synchronization applies globally to a VkQueue , there is no concept of a only-inside-this-command-buffer synchronization.
Graphics pipelines are executable on queues supporting QUEUE_GRAPHICS . Stages executed by graphics pipelines can only be specified in commands recorded for queues supporting QUEUE_GRAPHICS .

QueueIdle and DeviceIdle

These functions can be used as a very rudimentary way to perform synchronization.
Closing the program :
- We should wait for the logical device to finish operations before exiting mainLoop and destroying the window.
- You can also wait for operations in a specific command queue to be finished with vkQueueWaitIdle .
- You’ll see that the program now exits without problems when closing the window.
Problem :
- The problem of vkDeviceWaitIdle or vkQueueWaitIdle , due to the lack of fences for vkQueuePresent .
  - See Vulkan#Recreating , about EXT_swapchain_maintenance1 .
Solution :
- Use EXT_swapchain_maintenance1 .
- See Vulkan#Recreating , for usage with swapchain.
.

Queue Family Ownership Transfer

Resources created with a VkSharingMode of SHARING_MODE_EXCLUSIVE must have their ownership explicitly transferred from one queue family to another in order to access their content in a well-defined manner on a queue in a different queue family.
Resources shared with external APIs or instances using external memory must also explicitly manage ownership transfers between local and external queues (or equivalent constructs in external APIs) regardless of the VkSharingMode specified when creating them.
If you need to transfer ownership to a different queue family, you need memory barriers, one in each queue to release/acquire ownership.
If memory dependencies are correctly expressed between uses of such a resource between two queues in different families, but no ownership transfer is defined, the contents of that resource are undefined for any read accesses performed by the second queue family.
A queue family ownership transfer consists of two distinct parts:
1. Release exclusive ownership from the source queue family
  - queue family release operation
  - Is defined when dstQueueFamilyIndex is one of those values.
2. Acquire exclusive ownership for the destination queue family
  - queue family acquire operation
  - Is defined when srcQueueFamilyIndex is one of those values.
- Is defined if the values are not equal, and either is one of the special queue family values reserved for external memory ownership transfers
- An application must ensure that these operations occur in the correct order by defining an execution dependency between them, e.g. using a semaphore.
- A release operation is used to release exclusive ownership of a range of a buffer or image subresource range. A release operation is defined by executing a buffer memory barrier (for a buffer range) or an image memory barrier (for an image subresource range) using a pipeline barrier command, on a queue from the source queue family.
- Etc, I haven't read much about it.

Command Buffers

The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write.
Unless you add synchronization yourself, all commands in a queue execute out of order. Reordering may happen across command buffers and even vkQueueSubmits .
This makes sense, considering that Vulkan only sees a linear stream of commands once you submit, it is a pitfall to assume that splitting command buffers or submits adds some magic synchronization for you.
Frame buffer operations inside a render pass happen in API-order, of course. This is a special exception which the spec calls out.

Queue Submissions (vkQueueSubmit)

Queue submission commands
It automatically performs a domain operation from host to device for all writes performed before the command executes, so in most cases an explicit memory barrier is not needed for this case.
In the few circumstances where a submit does not occur between the host write and the device read access, writes can be made available by using an explicit memory barrier.

Example

vkCmdDispatch (PIPELINE_STAGE_COMPUTE_SHADER)
vkCmdCopyBuffer (PIPELINE_STAGE_TRANSFER)
vkCmdDispatch (PIPELINE_STAGE_COMPUTE_SHADER)
vkCmdPipelineBarrier (srcStageMask = PIPELINE_STAGE_COMPUTE_SHADER)
We would be referring to the two vkCmdDispatch commands, as they perform their work in the COMPUTE stage. Even if we split these 4 commands into 4 different vkQueueSubmits , we would still consider the same commands for synchronization.
Essentially, the work we are waiting for is all commands which have ever been submitted to the queue including any previous commands in the command buffer we’re recording.

Blocking Operations

.
- By Samsung 2019.
- I don't know if this information is still valid.
- See the Mobile section for optimizations of vkQueuePresent .

Examples

Synchronization examples .
Example 1 :
- vkCmdDispatch – writes to an SSBO, ACCESS_SHADER_WRITE
- vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER, srcAccessMask = SHADER_WRITE, dstAccessMask = 0)
- vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE, srcAccessMask = 0, dstAccessMask = SHADER_READ)
- vkCmdDispatch – read from the same SSBO, ACCESS_SHADER_READ
- While StageMask cannot be 0, AccessMask can be 0.
Recently allocated image, to use in a compute shader as a storage image :
- The pipeline barrier looks like:
  - oldLayout = UNDEFINED
    - Input is garbage
  - newLayout = GENERAL
    - Storage image compatible layout
  - srcStageMask = TOP_OF_PIPE
    - Wait for nothing
  - srcAccessMask = 0
    - This is key, there are no pending writes to flush out.
    - This is the only way to use TOP_OF_PIPE in a memory barrier.
  - dstStageMask = COMPUTE
    - Unblock compute after the layout transition is done
  - dstAccessMask = SHADER_READ | SHADER_WRITE
Swapchain Image Transition to PRESENT_SRC :
- We have to transition them into IMAGE_LAYOUT_PRESENT_SRC before passing the image over to the presentation engine.
- Having dstStageMask = BOTTOM_OF_PIPE and dstAccessMask = 0 is perfectly fine. We don’t care about making this memory visible to any stage beyond this point. We will use semaphores to synchronize with the presentation engine anyways.
- The pipeline barrier looks like:
  - srcStageMask = COLOR_ATTACHMENT_OUTPUT
    - Assuming we rendered to swapchain in a render pass.
  - srcAccessMask = COLOR_ATTACHMENT_WRITE
  - dstStageMask = BOTTOM_OF_PIPE
    - After transitioning into this PRESENT layout, we’re not going to touch the image again until we reacquire the image, so dstStageMask = BOTTOM_OF_PIPE is appropriate.
  - dstAccessMask = 0
  - oldLayout = COLOR_ATTACHMENT_OPTIMAL
  - newLayout = PRESENT_SRC_KHR
- Setting dstAccessMask = 0 on the final TRANSFER_DST → PRESENT_SRC_KHR barrier means “there is no GPU access after this barrier that we are ordering/expressing.” For swapchain-present that is intentional and common: presentation is outside the GPU pipeline, so the barrier only needs to make the producer writes (e.g. your blit TRANSFER_WRITE ) available/visible; the presentation engine performs its own, external visibility semantics.
Example 1 :
- vkCmdPipelineBarrier(srcStageMask = FRAGMENT_SHADER, dstStageMask = ?)
- Vertex shading for future commands can begin executing early, we only need to wait once FRAGMENT_SHADER is reached.
Example 2 :
1. vkCmdDispatch
2. vkCmdDispatch
3. vkCmdDispatch
4. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = COMPUTE)
5. vkCmdDispatch
6. vkCmdDispatch
7. vkCmdDispatch
- {5, 6, 7} must wait for {1, 2, 3}.
- A possible execution order here could be:
  - #3
  - #2
  - #1
  - #7
  - #6
  - #5
- {1, 2, 3} can execute out-of-order, and so can {5, 6, 7}, but these two sets of commands can not interleave execution.
- In spec lingo {1, 2, 3} happens-before {5, 6, 7}.
Chain of Dependencies (1) :
1. vkCmdDispatch
2. vkCmdDispatch
3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
4. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
5. vkCmdDispatch
6. vkCmdDispatch
- {5, 6} must wait for {1, 2}.
- We created a chain of dependencies between COMPUTE -> TRANSFER -> COMPUTE.
- When we wait for TRANSFER in 4, we must also wait for anything which is currently blocking TRANSFER.
Chain of dependencies (2) :
1. vkCmdDispatch
2. vkCmdDispatch
3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
4. vkCmdMagicDummyTransferOperation
5. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
6. vkCmdDispatch
7. vkCmdDispatch
- {4} must wait for {1, 2}.
- {6, 7} must wait for {4}.
- The chain is {1, 2} -> {4} -> {6, 7}, and if {4} is noop (no operation), {1, 2} -> {6, 7} is achieved.

Execution Dependencies, Memory Dependencies, Memory Model

Memory Model .
- Availability and Visibility .

Data hazards

Execution dependencies and memory dependencies are used to solve data hazards, i.e. to ensure that read and write operations occur in a well-defined order.
- An operation is an arbitrary amount of work to be executed on the host, a device, or an external entity such as a presentation engine.
Write-after-read hazards :
- Can be solved with just an execution dependency
Read-after-write hazards :
- Need appropriate memory dependencies to be included between them.
Write-after-write hazards :
- Need appropriate memory dependencies to be included between them.
If an application does not include dependencies to solve these hazards, the results and execution orders of memory accesses are undefined .

Execution Dependencies

An execution dependency is a guarantee that for two sets of operations, the first set must happen-before the second set. If an operation happens-before another operation, then the first operation must complete before the second operation is initiated.
Execution dependencies alone are not sufficient to guarantee that values resulting from writes in one set of operations can be read from another set of operations.

Memory Available

Availability operations cause the values generated by specified memory write accesses to become available for future access.
Any available value remains available until a subsequent write to the same memory location occurs (whether it is made available or not) or the memory is freed.
Availability operations :
- Cause the values generated by specified memory write accesses to become available to a memory domain for future access. Any available value remains available until a subsequent write to the same memory location occurs (whether it is made available or not) or the memory is freed.
- Even with coherent mapping, you still need to have a dependency between the host writing that memory and the GPU operation reading it.
We can say “making memory available” is all about flushing caches.
vkFlushMappedMemoryRanges()
- Guarantees that host writes to the memory ranges described by pMemoryRanges can be made available to device access, via availability operations from the ACCESS_HOST_WRITE access type.
- This is required for CPU writes, which HOST_COHERENT effectively provides.
Cache example :
- When our L2 cache contains the most up-to-date data there is, we can say that memory is available , as L1 caches connected to L2 can pull in the most up-to-date data there is.
- Once a shader stage writes to memory, the L2 cache no longer has the most up-to-date data there is, so that memory is no longer considered available .
  - If other caches try to read from L2, it will see undefined data.
  - Whatever wrote that data must make those writes available before the data can be made visible again.

Memory Domain

Memory domain operations :
- Cause writes that are available to a source memory domain to become available to a destination memory domain (an example of this is making writes available to the host domain available to the device domain).

Memory Visible

Visibility operations :
- Cause values available to a memory domain to become visible to specified memory accesses.
- Memory barriers are visibility operations. Without them, you wouldn’t have visibility of the memory.
  - The execution barrier ensures the completion of a command, but the srcStageMask , dstStageMask , srcAccessMask and dstAccessMask are what handles availability.
Once written values are made visible to a particular type of memory access, they can be read or written by that type of memory access.
We can say “making memory visible” is all about invalidating caches.
Availability is a necessary part of visibility, but availability alone is not sufficient.
- You can do things that might have caused visibility, but because the write was not available, they don’t actually make the write visible.
Under the hood, visibility is implementation-specific. The pure-visibility parts typically involve forcing lines out of caches and/or invalidating them. But some kinds of visibility may not require even that.
vkInvalidateMappedMemoryRanges() .
- Guarantees that device writes to the memory ranges described by pMemoryRanges , which have been made available to the host memory domain using the ACCESS_HOST_WRITE and ACCESS_HOST_READ access types, are made visible to the host.
- If a range of non-coherent memory is written by the host and then invalidated without first being flushed, its contents are undefined.

Host Coherent

MEMORY_PROPERTY_HOST_COHERENT
- If a memory object does have this property:
  - Writes to the memory object from the host are automatically made available to the host domain.
  - It says that you don't need vkFlushMappedMemoryRanges() or vkInvalidateMappedMemoryRanges() .
  - This property alone is insufficient for availability. You still need to use synchronization to make sure that reads and writes from CPU and GPU happen in the right order, and you need memory barriers on the GPU side to manage GPU caches (make CPU writes visible to GPU reads, and make GPU writes available to CPU reads).
  - Coherency is about "visibility", but you still need availability.
- If a memory object does not have this property:
  - vkFlushMappedMemoryRanges() must be called in order to guarantee that writes to the memory object from the host are made available to the host domain, where they can be further made available to the device domain via a domain operation.
  - vkInvalidateMappedMemoryRanges() must be called to guarantee that writes which are available to the host domain are made visible to host operations.

Memory Dependency

Memory Dependency is an execution dependency which includes availability and visibility operations such that:
- The first set of operations happens-before the availability operation.
- The availability operation happens-before the visibility operation.
- The visibility operation happens-before the second set of operations.
It enforces availability and visibility of memory accesses and execution order between two sets of operations.
Most synchronization commands in Vulkan define a memory dependency.
The specific memory accesses that are made available and visible are defined by the access scopes of a memory dependency.
Any type of access that is in a memory dependency’s first access scope is made available .
Any type of access that is in a memory dependency’s second access scope has any available writes made visible to it.
Any type of operation that is not in a synchronization command’s access scopes will not be included in the resulting dependency.

Execution Stages

The Stage Masks are a bit-mask, so it’s perfectly fine to wait for both X and Y work.
By specifying the source and target stages, you tell the driver what operations need to finish before the transition can execute, and what must not have started yet.
Nvidia: Use optimal srcStageMask and dstStageMask . Most important cases: If the specified resources are accessed only in compute or fragment shaders, use the compute or the fragment stage bits for both masks, to make the barrier fragment-only or compute-only.
Caio: "Wait for srcStageMask to finish, before dstStageMask can start".
.
.
.
.

First synchronization scope

srcStageMask
This represents what we are waiting for.
"What operations need to finish before the transition can execute".

Second synchronization scope

dstStageMask
"What operations must not have started yet".
Any work submitted after this barrier will need to wait for the work represented by srcStageMask before it can execute.

Stages

VkPipelineStageFlagBits2 .
TOP_OF_PIPE and BOTTOM_OF_PIPE :
- These stages are essentially “helper” stages, which do no actual work, but serve some important purposes. Every command will first execute the TOP_OF_PIPE stage. This is basically the command processor on the GPU parsing the command. BOTTOM_OF_PIPE is where commands retire after all work has been done.
- Both these pipeline stages are deprecated, and applications should prefer ALL_COMMANDS and NONE .
- Memory Access :
  - Never use AccessMask != 0 with these stages. These stages do not perform memory accesses . Any srcAccessMask and dstAccessMask combination with either stage will be meaningless, and spec disallows this.
  - TOP_OF_PIPE and BOTTOM_OF_PIPE are purely there for the sake of execution barriers, not memory barriers.
TOP_OF_PIPE
- In the first scope:
  - Equivalent to NONE
  - Is basically saying “wait for nothing”, or to be more precise, we’re waiting for the GPU to parse all commands.
    - We had to parse all commands before getting to the pipeline barrier command to begin with.
- In the second scope:
  - Equivalent to ALL_COMMANDS with VkAccessFlags2 set to 0 .
BOTTOM_OF_PIPE
- In the first scope:
  - Equivalent to ALL_COMMANDS , with VkAccessFlags2 set to 0 .
- In the second scope:
  - Equivalent to NONE .
  - Basically translates to “block the last stage of execution in the pipeline”.
  - “No work after this barrier is going to wait for us”.
NONE
- Specifies no stages of execution.
ALL_COMMANDS
- Specifies all operations performed by all commands supported on the queue it is used with.
- Basically drains the entire queue for work.
ALL_GRAPHICS
- Specifies the execution of all graphics pipeline stages.
- It's the same as ALL_COMMANDS , but only for render passes.
- Is equivalent to the logical OR of:
  - DRAW_INDIRECT
  - COPY_INDIRECT
  - TASK_SHADER
  - MESH_SHADER
  - VERTEX_INPUT
  - VERTEX_SHADER
  - TESSELLATION_CONTROL_SHADER
  - TESSELLATION_EVALUATION_SHADER
  - GEOMETRY_SHADER
  - FRAGMENT_SHADER
  - EARLY_FRAGMENT_TESTS
  - LATE_FRAGMENT_TESTS
  - COLOR_ATTACHMENT_OUTPUT
  - CONDITIONAL_RENDERING
  - TRANSFORM_FEEDBACK
  - FRAGMENT_SHADING_RATE_ATTACHMENT
  - FRAGMENT_DENSITY_PROCESS
  - SUBPASS_SHADER
  - INVOCATION_MASK
  - CLUSTER_CULLING_SHADER

Order of execution stages

Ignoring TOP_OF_PIPE and BOTTOM_OF_PIPE .
Graphics primitive pipeline :
- DRAW_INDIRECT
  - Parses indirect buffers.
- COPY_INDIRECT
- INDEX_INPUT
- VERTEX_ATTRIBUTE_INPUT
  - Consumes fixed function VBOs and IBOs
- VERTEX_SHADER
- TESSELLATION_CONTROL_SHADER
- TESSELLATION_EVALUATION_SHADER
- GEOMETRY_SHADER
- TRANSFORM_FEEDBACK
- FRAGMENT_SHADING_RATE_ATTACHMENT
- EARLY_FRAGMENT_TESTS
  - Early depth/stencil tests.
  - Render pass performs its loadOp of a depth/stencil attachment.
  - This stage isn’t all that useful or meaningful except in some very obscure scenarios with frame buffer self-dependencies (aka, GL_ARB_texture_barrier ).
  - When blocking a render pass with dstStageMask , just use a mask of EARLY_FRAGMENT_TESTS | LATE_FRAGMENT_TESTS .
  - dstStageMask = EARLY_FRAGMENT_TESTS alone might work since that will block loadOp , but there might be shenanigans with memory barriers if you are 100% pedantic about any memory access happening in LATE_FRAGMENT_TESTS . If you’re blocking an early stage, it never hurts to block a later stage as well.
- FRAGMENT_SHADER
- LATE_FRAGMENT_TESTS
  - Late depth-stencil tests.
  - Render pass performs its storeOp of a depth/stencil attachment when a render pass is done.
  - When you’re waiting for a depth map to have been rendered in an earlier render pass, you should use srcStageMask = LATE_FRAGMENT_TESTS , as that will wait for the storeOp to finish its work.
- COLOR_ATTACHMENT_OUTPUT
  - This one is where loadOp , storeOp , MSAA resolves and frame buffer blend stage takes place.
  - Basically anything that touches a color attachment in a render pass in some way.
  - If you’re waiting for a render pass which uses color to be complete, use srcStageMask = COLOR_ATTACHMENT_OUTPUT , and similar for dstStageMask when blocking render passes from execution.
  - Usage as dstStageMask :
    - COLOR_ATTACHMENT_OUTPUT is the appropriate dstStageMask when you are transitioning an image so it can be written as a color attachment.
Graphics mesh pipeline :
- DRAW_INDIRECT
- TASK_SHADER
- MESH_SHADER
- FRAGMENT_SHADING_RATE_ATTACHMENT
- EARLY_FRAGMENT_TESTS
- FRAGMENT_SHADER
- LATE_FRAGMENT_TESTS
- COLOR_ATTACHMENT_OUTPUT
Compute pipeline :
- DRAW_INDIRECT
- COPY_INDIRECT
- COMPUTE_SHADER
Transfer pipeline :
- COPY_INDIRECT
- TRANSFER
Subpass shading pipeline :
- SUBPASS_SHADER
Graphics pipeline commands executing in a render pass with a fragment density map attachment : (almost unordered)
- The following pipeline stage where the fragment density map read happens has no particular order relative to the other stages.
- It is logically earlier than EARLY_FRAGMENT_TESTS , so:
  - FRAGMENT_DENSITY_PROCESS
  - EARLY_FRAGMENT_TESTS
Conditional rendering stage : (unordered)
- Is formally part of both the graphics, and the compute pipeline.
- The predicate read has unspecified order relative to other stages of these pipelines:
- CONDITIONAL_RENDERING
Host operations :
- Only one pipeline stage occurs.
- HOST
Command preprocessing pipeline :
- COMMAND_PREPROCESS
Acceleration structure build operations :
- Only one pipeline stage occurs.
- ACCELERATION_STRUCTURE_BUILD
Acceleration structure copy operations :
- Only one pipeline stage occurs.
- ACCELERATION_STRUCTURE_COPY
Opacity micromap build operations :
- Only one pipeline stage occurs.
- MICROMAP_BUILD
Ray tracing pipeline :
- DRAW_INDIRECT
- RAY_TRACING_SHADER
Video decode pipeline :
- VIDEO_DECODE
Video encode pipeline :
- VIDEO_ENCODE
Data graph pipeline :
- DATA_GRAPH

Memory Access

Access scopes do not interact with the logically earlier or later stages for either scope - only the stages the application specifies are considered part of each access scope.
These flags represent memory access that can be performed.
Each pipeline stage can perform certain memory accesses, and thus we take the combination of pipeline stage + access mask and we get potentially a very large number of incoherent caches on the system.
Each GPU core has its own set of L1 caches as well.
Real GPUs will only have a fraction of the possible caches here, but as long as we are explicit about this in the API, any GPU driver can simplify this as needed.
Access masks either read from a cache, or write to an L1 cache in our mental model.
Certain access types are only performed by a subset of pipeline stages.
"Had this access ( srcAccessMask ) and it's going to have this access ( dstAccessMask )".
srcAccessMask
- Lists the access types that happened before the barrier (the producer accesses) and that must be made available/visible by the barrier.
- Must describe the kinds of accesses that actually happened before the barrier (the producer accesses you need to make available/visible) .
- It does not describe what you want the resource to become after the barrier — that is expressed by dstAccessMask (what will happen after).
- The stage masks (src/dst stage) specify the pipeline stages that contain those accesses.
- srcAccessMask = 0 means “there are no prior GPU memory accesses that this barrier needs to make available” (i.e. nothing to claim as the producer side).
dstAccessMask
- Lists the access types that will happen after the barrier (the consumer accesses) and that must see the producer’s writes.
- dstAccessMask = 0 means “there are no subsequent GPU memory accesses that this barrier needs to order/make visible to” (i.e. no GPU consumer to describe with access bits).

Access Flags

VkAccessFlagBits2 .
MEMORY_READ
- Specifies all read accesses.
- It is always valid in any access mask, and is treated as equivalent to setting all READ access flags that are valid where it is used.
MEMORY_WRITE
- Specifies all write accesses.
- It is always valid in any access mask, and is treated as equivalent to setting all WRITE access flags that are valid where it is used.
SHADER_READ
- Same as SAMPLED_READ + STORAGE_READ + TILE_ATTACHMENT_READ .
SHADER_SAMPLED_READ
- Specifies read access to a uniform texel buffer or sampled image in any shader pipeline stage.
HOST_READ
- Specifies read access by a host operation. Accesses of this type are not performed through a resource, but directly on memory.
- Such access occurs in the PIPELINE_STAGE_2_HOST pipeline stage.
HOST_WRITE
- Specifies write access by a host operation. Accesses of this type are not performed through a resource, but directly on memory.
- Such access occurs in the PIPELINE_STAGE_2_HOST pipeline stage.

Access Flag -> Pipeline Stages

| Access flag                             | Pipeline stages                                                                                                                                                                                                                                                                                         |
|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| NONE                                   | Any                                                                                                                                                                                                                                                                                                     |
| INDIRECT_COMMAND_READ                  | DRAW_INDIRECT , ACCELERATION_STRUCTURE_BUILD , COPY_INDIRECT                                                                                                                                                                                                                                         |
| INDEX_READ                             | VERTEX_INPUT , INDEX_INPUT                                                                                                                                                                                                                                                                            |
| VERTEX_ATTRIBUTE_READ                  | VERTEX_INPUT , VERTEX_ATTRIBUTE_INPUT                                                                                                                                                                                                                                                                 |
| UNIFORM_READ                           | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| INPUT_ATTACHMENT_READ                  | FRAGMENT_SHADER , SUBPASS_SHADER                                                                                                                                                                                                                                                                      |
| SHADER_READ                            | ACCELERATION_STRUCTURE_BUILD , MICROMAP_BUILD , VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER               |
| SHADER_WRITE                           | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| COLOR_ATTACHMENT_READ                  | FRAGMENT_SHADER , COLOR_ATTACHMENT_OUTPUT                                                                                                                                                                                                                                                             |
| COLOR_ATTACHMENT_WRITE                 | COLOR_ATTACHMENT_OUTPUT                                                                                                                                                                                                                                                                                |
| DEPTH_STENCIL_ATTACHMENT_READ          | FRAGMENT_SHADER , EARLY_FRAGMENT_TESTS , LATE_FRAGMENT_TESTS                                                                                                                                                                                                                                         |
| DEPTH_STENCIL_ATTACHMENT_WRITE         | EARLY_FRAGMENT_TESTS , LATE_FRAGMENT_TESTS                                                                                                                                                                                                                                                            |
| TRANSFER_READ                          | ALL_TRANSFER , COPY , RESOLVE , BLIT , ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY , MICROMAP_BUILD , CONVERT_COOPERATIVE_VECTOR_MATRIX                                                                                                                                          |
| TRANSFER_WRITE                         | ALL_TRANSFER , COPY , RESOLVE , BLIT , CLEAR , ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY , MICROMAP_BUILD , CONVERT_COOPERATIVE_VECTOR_MATRIX                                                                                                                                 |
| HOST_READ                              | HOST                                                                                                                                                                                                                                                                                                   |
| HOST_WRITE                             | HOST                                                                                                                                                                                                                                                                                                   |
| MEMORY_READ                            | Any                                                                                                                                                                                                                                                                                                     |
| MEMORY_WRITE                           | Any                                                                                                                                                                                                                                                                                                     |
| SHADER_SAMPLED_READ                    | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| SHADER_STORAGE_READ                    | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| SHADER_STORAGE_WRITE                   | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| VIDEO_DECODE_READ                      | VIDEO_DECODE                                                                                                                                                                                                                                                                                           |
| VIDEO_DECODE_WRITE                     | VIDEO_DECODE                                                                                                                                                                                                                                                                                           |
| VIDEO_ENCODE_READ                      | VIDEO_ENCODE                                                                                                                                                                                                                                                                                           |
| VIDEO_ENCODE_WRITE                     | VIDEO_ENCODE                                                                                                                                                                                                                                                                                           |
| TRANSFORM_FEEDBACK_WRITE               | TRANSFORM_FEEDBACK                                                                                                                                                                                                                                                                                     |
| TRANSFORM_FEEDBACK_COUNTER_READ        | DRAW_INDIRECT , TRANSFORM_FEEDBACK                                                                                                                                                                                                                                                                    |
| TRANSFORM_FEEDBACK_COUNTER_WRITE       | TRANSFORM_FEEDBACK                                                                                                                                                                                                                                                                                     |
| CONDITIONAL_RENDERING_READ             | CONDITIONAL_RENDERING                                                                                                                                                                                                                                                                                  |
| COMMAND_PREPROCESS_READ                | COMMAND_PREPROCESS                                                                                                                                                                                                                                                                                     |
| COMMAND_PREPROCESS_WRITE               | COMMAND_PREPROCESS                                                                                                                                                                                                                                                                                     |
| FRAGMENT_SHADING_RATE_ATTACHMENT_READ | FRAGMENT_SHADING_RATE_ATTACHMENT                                                                                                                                                                                                                                                                       |
| ACCELERATION_STRUCTURE_READ            | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , CLUSTER_CULLING_SHADER , ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY , SUBPASS_SHADER |
| ACCELERATION_STRUCTURE_WRITE           | ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY                                                                                                                                                                                                                                            |
| FRAGMENT_DENSITY_MAP_READ              | FRAGMENT_DENSITY_PROCESS                                                                                                                                                                                                                                                                               |
| COLOR_ATTACHMENT_READ_NONCOHERENT      | COLOR_ATTACHMENT_OUTPUT                                                                                                                                                                                                                                                                                |
| DESCRIPTOR_BUFFER_READ                 | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| INVOCATION_MASK_READ                   | INVOCATION_MASK                                                                                                                                                                                                                                                                                        |
| MICROMAP_READ                          | MICROMAP_BUILD , ACCELERATION_STRUCTURE_BUILD                                                                                                                                                                                                                                                         |
| MICROMAP_WRITE                         | MICROMAP_BUILD                                                                                                                                                                                                                                                                                         |
| OPTICAL_FLOW_READ                      | OPTICAL_FLOW                                                                                                                                                                                                                                                                                           |
| OPTICAL_FLOW_WRITE                     | OPTICAL_FLOW                                                                                                                                                                                                                                                                                           |
| SHADER_TILE_ATTACHMENT_READ            | FRAGMENT_SHADER , COMPUTE_SHADER                                                                                                                                                                                                                                                                      |
| SHADER_TILE_ATTACHMENT_WRITE           | FRAGMENT_SHADER , COMPUTE_SHADER                                                                                                                                                                                                                                                                      |
| DATA_GRAPH_READ                        | DATA_GRAPH                                                                                                                                                                                                                                                                                             |
| DATA_GRAPH_WRITE                       | DATA_GRAPH                                                                                                                                                                                                                                                                                             |

Pipeline Stage -> Access Flags

| Pipeline stage                      | Access flags                                                                                                                                                                                                                                                                                                                   |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| ACCELERATION_STRUCTURE_BUILD       | ACCELERATION_STRUCTURE_READ , ACCELERATION_STRUCTURE_WRITE , INDIRECT_COMMAND_READ , MICROMAP_READ , SHADER_READ , TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                       |
| ACCELERATION_STRUCTURE_COPY        | ACCELERATION_STRUCTURE_READ , ACCELERATION_STRUCTURE_WRITE , TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                |
| ALL_TRANSFER                       | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| ANY                                | MEMORY_READ , MEMORY_WRITE , NONE                                                                                                                                                                                                                                                                                           |
| BLIT                               | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| CLEAR                              | TRANSFER_WRITE                                                                                                                                                                                                                                                                                                                |
| CLUSTER_CULLING_SHADER             | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| COLOR_ATTACHMENT_OUTPUT            | COLOR_ATTACHMENT_READ , COLOR_ATTACHMENT_READ_NONCOHERENT , COLOR_ATTACHMENT_WRITE                                                                                                                                                                                                                                          |
| COMMAND_PREPROCESS                 | COMMAND_PREPROCESS_READ , COMMAND_PREPROCESS_WRITE                                                                                                                                                                                                                                                                           |
| COMPUTE_SHADER                     | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_TILE_ATTACHMENT_READ , SHADER_TILE_ATTACHMENT_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                     |
| CONDITIONAL_RENDERING              | CONDITIONAL_RENDERING_READ                                                                                                                                                                                                                                                                                                    |
| CONVERT_COOPERATIVE_VECTOR_MATRIX | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| COPY                               | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| COPY_INDIRECT                      | INDIRECT_COMMAND_READ                                                                                                                                                                                                                                                                                                         |
| DATA_GRAPH                         | DATA_GRAPH_READ , DATA_GRAPH_WRITE                                                                                                                                                                                                                                                                                           |
| DRAW_INDIRECT                      | INDIRECT_COMMAND_READ , TRANSFORM_FEEDBACK_COUNTER_READ                                                                                                                                                                                                                                                                      |
| EARLY_FRAGMENT_TESTS               | DEPTH_STENCIL_ATTACHMENT_READ , DEPTH_STENCIL_ATTACHMENT_WRITE                                                                                                                                                                                                                                                               |
| FRAGMENT_DENSITY_PROCESS           | FRAGMENT_DENSITY_MAP_READ                                                                                                                                                                                                                                                                                                     |
| FRAGMENT_SHADER                    | ACCELERATION_STRUCTURE_READ , COLOR_ATTACHMENT_READ , DEPTH_STENCIL_ATTACHMENT_READ , DESCRIPTOR_BUFFER_READ , INPUT_ATTACHMENT_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_TILE_ATTACHMENT_READ , SHADER_TILE_ATTACHMENT_WRITE , SHADER_WRITE , UNIFORM_READ |
| FRAGMENT_SHADING_RATE_ATTACHMENT | FRAGMENT_SHADING_RATE_ATTACHMENT_READ                                                                                                                                                                                                                                                                                         |
| GEOMETRY_SHADER                    | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| HOST                               | HOST_READ , HOST_WRITE                                                                                                                                                                                                                                                                                                       |
| INDEX_INPUT                        | INDEX_READ                                                                                                                                                                                                                                                                                                                    |
| INVOCATION_MASK                    | INVOCATION_MASK_READ                                                                                                                                                                                                                                                                                                          |
| LATE_FRAGMENT_TESTS                | DEPTH_STENCIL_ATTACHMENT_READ , DEPTH_STENCIL_ATTACHMENT_WRITE                                                                                                                                                                                                                                                               |
| MESH_SHADER                        | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| MICROMAP_BUILD                     | MICROMAP_READ , MICROMAP_WRITE , SHADER_READ , TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                             |
| OPTICAL_FLOW                       | OPTICAL_FLOW_READ , OPTICAL_FLOW_WRITE                                                                                                                                                                                                                                                                                       |
| RAY_TRACING_SHADER                 | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| RESOLVE                            | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| SUBPASS_SHADER                     | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , INPUT_ATTACHMENT_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                           |
| TASK_SHADER                        | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| TESSELLATION_CONTROL_SHADER        | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| TESSELLATION_EVALUATION_SHADER     | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| TRANSFORM_FEEDBACK                 | TRANSFORM_FEEDBACK_COUNTER_READ , TRANSFORM_FEEDBACK_COUNTER_WRITE , TRANSFORM_FEEDBACK_WRITE                                                                                                                                                                                                                               |
| VERTEX_ATTRIBUTE_INPUT             | VERTEX_ATTRIBUTE_READ                                                                                                                                                                                                                                                                                                         |
| VERTEX_INPUT                       | INDEX_READ , VERTEX_ATTRIBUTE_READ                                                                                                                                                                                                                                                                                           |
| VERTEX_SHADER                      | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| VIDEO_DECODE                       | VIDEO_DECODE_READ , VIDEO_DECODE_WRITE                                                                                                                                                                                                                                                                                       |
| VIDEO_ENCODE                       | VIDEO_ENCODE_READ , VIDEO_ENCODE_WRITE                                                                                                                                                                                                                                                                                       |

Pipeline Barriers

Pipeline barriers also provide synchronization control within a command buffer, but at a single point, rather than with separate signal and wait operations. Pipeline barriers can be used to control resource access within a single queue.
Gives control over which pipeline stages need to wait on previous pipeline stages when a command buffer is executed.
Nvidia: Minimize the use of barriers. A barrier may cause a GPU pipeline flush. We have seen redundant barriers and associated wait for idle operations as a major performance problem for ports to modern APIs.
Nvidia: Prefer a buffer/image barrier rather than a memory barrier to allow the driver to better optimize and schedule the barrier, unless the memory barrier allows to merge many buffer/image barriers together.
Nvidia: Group barriers in one call to vkCmdPipelineBarrier2() . This way, the worst case can be picked instead of sequentially going through all barriers.
Nvidia: Don’t insert redundant barriers; this limits parallelism; avoid read-to-read barriers.
vkCmdPipelineBarrier2() .
- When submitted to a queue, it defines memory dependencies between commands that were submitted to the same queue before it, and those submitted to the same queue after it.
- commandBuffer
  - Is the command buffer into which the command is recorded.
- pDependencyInfo
  - VkDependencyInfo .
  - Specifies the dependency information for a synchronization command.
  - This structure defines a set of memory dependencies , as well as queue family ownership transfer operations and image layout transitions .
  - Each member of pMemoryBarriers , pBufferMemoryBarriers , and pImageMemoryBarriers defines a separate memory dependency .
  - dependencyFlags
    - VkDependencyFlagBits
    - Specifies how execution and memory dependencies are formed.
    - DEPENDENCY_BY_REGION
      - Specifies that dependencies will be framebuffer-local .
    - DEPENDENCY_VIEW_LOCAL
      - Specifies that dependencies will be view-local .
    - DEPENDENCY_DEVICE_GROUP
      - Specifies that dependencies are non-device-local .
    - DEPENDENCY_FEEDBACK_LOOP_EXT
      - Specifies that the render pass will write to and read from the same image with feedback loop enabled .
    - DEPENDENCY_QUEUE_FAMILY_OWNERSHIP_TRANSFER_USE_ALL_STAGES_KHR
      - Specifies that source and destination stages are not ignored when performing a queue family ownership transfer .
    - DEPENDENCY_ASYMMETRIC_EVENT_KHR
      - Specifies that vkCmdSetEvent2 must only include the source stage mask of the first synchronization scope, and that vkCmdWaitEvents2 must specify the complete barrier.
  - memoryBarrierCount
    - Is the length of the pMemoryBarriers array.
  - pMemoryBarriers
    - VkMemoryBarrier2 .
    - Specifies a global memory barrier.
    - srcStageMask
    - srcAccessMask
    - dstStageMask
    - dstAccessMask
  - bufferMemoryBarrierCount
    - Is the length of the pBufferMemoryBarriers array.
  - pBufferMemoryBarriers
    - VkBufferMemoryBarrier2 .
    - Specifies a buffer memory barrier.
    - Defines a memory dependency limited to a range of a buffer, and can define a queue family ownership transfer operation for that range.
    - Both access scopes are limited to only memory accesses to buffer in the range defined by offset and size .
    - srcStageMask
    - srcAccessMask
    - dstStageMask
    - dstAccessMask
    - srcQueueFamilyIndex
    - dstQueueFamilyIndex
    - buffer
      - Is a handle to the buffer whose backing memory is affected by the barrier.
    - offset
      - Is an offset in bytes into the backing memory for buffer ; this is relative to the base offset as bound to the buffer (see vkBindBufferMemory ).
    - size
      - Is a size in bytes of the affected area of backing memory for buffer , or WHOLE_SIZE to use the range from offset to the end of the buffer.
  - imageMemoryBarrierCount
    - Is the length of the pImageMemoryBarriers array.
  - pImageMemoryBarriers
    - VkImageMemoryBarrier2 .
    - Specifies an image memory barrier.
    - Defines a memory dependency limited to an image subresource range, and can define a queue family ownership transfer operation and image layout transition for that subresource range.
    - Image Transition :
      - If oldLayout is not equal to newLayout , then the memory barrier defines an image layout transition for the specified image subresource range.
      - If this memory barrier defines a queue family ownership transfer operation , the layout transition is only executed once between the queues.
      - When the old and new layout are equal, the layout values are ignored - data is preserved no matter what values are specified, or what layout the image is currently in.
    - srcStageMask
    - srcAccessMask
    - dstStageMask
    - dstAccessMask
    - srcQueueFamilyIndex
    - dstQueueFamilyIndex
    - oldLayout
    - newLayout
    - image
      - Is a handle to the image affected by this barrier.
    - subresourceRange
      - Describes the image subresource range within image that is affected by this barrier.

Execution Barrier

Every command you submit to Vulkan goes through a set of stages. Draw calls, copy commands and compute dispatches all go through pipeline stages one by one. This represents the heart of the Vulkan synchronization model.
Operations performed by synchronization commands (e.g. availability operations and visibility operations ) are not executed by a defined pipeline stage. However other commands can still synchronize with them by using the synchronization scopes to create a dependency chain.
When we synchronize work in Vulkan, we synchronize work happening in these pipeline stages as a whole, and not individual commands of work.
Vulkan does not let you add fine-grained dependencies between individual commands. Instead you get to look at all work which happens in certain pipeline stages.

Memory Barriers

Execution order and memory order are two different things.
Memory barriers are the tools we can use to ensure that caches are flushed and our memory writes from commands executed before the barrier are available to the pending after-barrier commands. They are also the tool we can use to invalidate caches so that the latest data is visible to the cores that will execute after-barrier commands.
In contrast to execution barriers, these access masks only apply to the precise stages set in the stage masks, and are not extended to logically earlier and later stages.
GPUs are notorious for having multiple, incoherent caches which all need to be carefully managed to avoid glitched out rendering.
This means that just synchronizing execution alone is not enough to ensure that different units on the GPU can transfer data between themselves.
Memory being available and memory being visible are an abstraction over the fact that GPUs have incoherent caches.
For GPU reading operations from CPU-written data, a call to vkQueueSubmit acts as a host memory dependency on any CPU writes to GPU-accessible memory, so long as those writes were made prior to the function call.
If you need more fine-grained write dependency (you want the GPU to be able to execute some stuff in a batch while you're writing data, for example), or if you need to read data written by the GPU, you need an explicit dependency.
For in-batch GPU reading, this could be handled by an event; the host sets the event after writing the memory, and the command buffer operation that reads the memory first issues vkCmdWaitEvents for that event. And you'll need to set the appropriate memory barriers and source/destination stages.
For CPU reading of GPU-written data, this could be an event, a timeline semaphore, or a fence.
But overall, CPU writes to GPU-accessible memory still need some form of synchronization.

Global Memory Barriers

VkMemoryBarrier2 .
A global memory barrier deals with access to any resource, and it’s the simplest form of a memory barrier.
In vkCmdPipelineBarrier2 , we are specifying 4 things to happen in order:
- Wait for srcStageMask to complete
- Make all writes performed in possible combinations of srcStageMask + srcAccessMask available
- Make available memory visible to possible combinations of dstStageMask + dstAccessMask .
- Unblock work in dstStageMask.
A common misconception I see is that _READ flags are passed into srcAccessMask , but this is redundant .
- It does not make sense to make reads available.
- Ex : you don’t flush caches when you’re done reading data.

Buffer Memory Barrier

We’re just restricting memory availability and visibility to a specific buffer.
TheMaister: No GPU I know of actually cares, I think it makes more sense to just use VkMemoryBarrier rather than bothering with buffer barriers.

Image Memory Barrier / Image Layout Transition

VkImageLayout .
Image subresources can be transitioned from one layout to another as part of a memory dependency (e.g. by using an image memory barrier ).
Image layouts transitions are done as part of an image memory barrier.
The layout transition happens in-between the make available and make visible stages of a memory barrier.
The layout transition itself is considered a read/write operation, and the rules are basically that memory for the image must be available before the layout transition takes place.
After a layout transition, that memory is automatically made available (but not visible !).
Basically, think of the layout transition as some kind of in-place data munging which happens in L2 cache somehow.
How :
- If a layout transition is specified in a memory dependency.
When :
- It happens-after the availability operations in the memory dependency, and happens-before the visibility operations.
- Layout transitions that are performed via image memory barriers execute in their entirety in submission order , relative to other image layout transitions submitted to the same queue, including those performed by render passes.
- This ordering of image layout transitions only applies if the implementation performs actual read/write operations during the transition.
- An application must not rely on ordering of image layout transitions to influence ordering of other commands.
Ensure :
- Image layout transitions may perform read and write accesses on all memory bound to the image subresource range, so applications must ensure that all memory writes have been made available before a layout transition is executed.
Available memory is automatically made visible to a layout transition, and writes performed by a layout transition are automatically made available .

Old Layout

The old layout must either be UNDEFINED , or match the current layout of the image subresource range.
- If the old layout matches the current layout of the image subresource range, the transition preserves the contents of that range.
- If the old layout is UNDEFINED , the contents of that range may be discarded. This can provide performance or power benefits.
  - Nvidia: Use UNDEFINED when the previous content of the image is not needed.
Tile-based architectures may be able to avoid flushing tile data to memory, and immediate style renderers may be able to achieve fast metadata clears to reinitialize frame buffer compression state, or similar.
If the contents of an attachment are not needed after a render pass completes, then applications should use DONT_CARE .
Why Need the Old Layout in Vulkan Image Transitions .
- Cool.

Recently allocated image

If we just allocated an image and want to start using it, what we want to do is to just perform a layout transition, but we don’t need to wait for anything in order to do this transition.
It’s important to note that freshly allocated memory in Vulkan is always considered available and visible to all stages and access types. You cannot have stale caches when the memory was never accessed.

Events / "Split Barriers"

A way to get overlapping work in-between barriers.
The idea of VkEvent is to get some unrelated commands in-between the “before” and “after” set of commands
For advanced compute, this is a very important thing to know about, but not all GPUs and drivers can take advantage of this feature.
Nvidia: Use vkCmdSetEvent2 and vkCmdWaitEvents2 to issue an asynchronous barrier to avoid blocking execution.

Example

Example 1 :
1. vkCmdDispatch
2. vkCmdDispatch
3. vkCmdSetEvent(event, srcStageMask = COMPUTE)
4. vkCmdDispatch
5. vkCmdWaitEvent(event, dstStageMask = COMPUTE)
6. vkCmdDispatch
7. vkCmdDispatch
- The " before " set is now { 1 , 2 }, and the " after " set is { 6 , 7 }.
- 4 here is not affected by any synchronization and it can fill in the parallelism “bubble” we get when draining the GPU of work from 1 , 2 , 3 .
.

Semaphores and Fences

These objects are signaled as part of a vkQueueSubmit .
To signal a semaphore or fence, all previously submitted commands to the queue must complete.
If this were a regular pipeline barrier, we would have srcStageMask = ALL_COMMANDS . However, we also get a full memory barrier, in the sense that all pending writes are made available. Essentially, srcAccessMask = MEMORY_WRITE .
Signaling a fence or semaphore works like a full cache flush. Submitting commands to the Vulkan queue makes all memory access performed by host visible to all stages and access masks. Basically, submitting a batch issues a cache invalidation on host visible memory.
A common mistake is to think that you need to do this invalidation manually when the CPU is writing into staging buffers or similar:
- srcStageMask = HOST
- dstStageMask = TRANSFER
- srcAccessMask = HOST_WRITE
- dstAccessMask = TRANSFER_READ
- If the write happened before vkQueueSubmit , this is automatically done for you.
- This kind of barrier is necessary if you are using vkCmdWaitEvents where you wait for host to signal the event with vkSetEvent . In that case, you might be writing the necessary host data after vkQueueSubmit was called, which means you need a pipeline barrier like this. This is not exactly a common use case, but it’s important to understand when these API constructs are useful.

Semaphore

VkSemaphore
Semaphores facilitate GPU <-> GPU synchronization across Vulkan queues.
- Used for syncing multiple command buffer submissions one after other.
- The CPU continues running without blocking.
Implicit memory guarantees when waiting for a Semaphore :
- While signalling a semaphore makes all memory available , waiting for a semaphore makes memory visible .
- This basically means you do not need a memory barrier if you use synchronization with semaphores since signal/wait pairs of semaphores works like a full memory barrier.
- Example :
  - Queue 1 writes to an SSBO in compute, and consumes that buffer as a UBO in a fragment shader in queue 2.
  - We’re going to assume the buffer was created with QUEUE_FAMILY_CONCURRENT .
  - Queue 1
    - vkCmdDispatch
    - vkQueueSubmit(signal = my_semaphore)
    - There is no pipeline barrier needed here.
    - Signalling the semaphore waits for all commands, and all writes in the dispatch are made available to the device before the semaphore is actually signaled.
  - Queue 2
    - vkCmdBeginRenderPass
    - vkCmdDraw
    - vkCmdEndRenderPass
    - vkQueueSubmit(wait = my_semaphore, pDstWaitStageMask = FRAGMENT_SHADER)
    - When we wait for the semaphore, we specify which stages should wait for this semaphore, in this case the FRAGMENT_SHADER stage.
    - All relevant memory access is automatically made visible , so we can safely access UNIFORM_READ in FRAGMENT_SHADER stage, without having extra barriers.
    - The semaphores take care of this automatically, nice!
Examples :
- Basic signaling / waiting :
  - Let’s say we have semaphore S and queue operations A and B that we want to execute in order.
  - What we tell Vulkan is that operation A will 'signal' semaphore S when it finishes executing, and operation B will 'wait' on semaphore S before it begins executing.
  - When operation A finishes, semaphore S will be signaled, while operation B wont start until S is signaled.
  - After operation B begins executing, semaphore S is automatically reset back to being unsignaled, allowing it to be used again.
- Image Transition on Swapchain Images :
  - We need to wait for the image to be acquired, and only then can we perform a layout transition.
  - The best way to do this is to use pDstWaitStageMask = COLOR_ATTACHMENT_OUTPUT , and then use srcStageMask = COLOR_ATTACHMENT_OUTPUT in a pipeline barrier which transitions the swapchain image after semaphore is signaled.
Types of Semaphores :
- Binary Semaphores :
  - A binary semaphore is either unsignaled or signaled.
  - It begins life as unsignaled.
  - The way we use a binary semaphore to order queue operations is by providing the same semaphore as a 'signal' semaphore in one queue operation and as a 'wait' semaphore in another queue operation.
  - Only binary semaphores will be used in this tutorial, further mention of the term semaphore exclusively refers to binary semaphores.
- Timeline Semaphores :
  - .
Correctly using the Semaphore for vkQueuePresent :
- Swapchain Semaphore Reuse .
- Since Vulkan SDK 1.4.313 , the validation layer reports cases where the present wait semaphore is not used safely:
  - This is currently reported as VUID-vkQueueSubmit-pSignalSemaphores-00067 or you may see "your VkSemaphore is being signaled by VkQueue, but it may still be in use by VkSwapchainKHR"
- In this context, safely means that the Vulkan specification guarantees the semaphore is no longer in use and can be reused.
- The problem :
  - vkQueuePresentKHR is different from the vkQueueSubmit family of functions in that it does not provide a way to signal a semaphore or a fence (without additional extensions).
  - This means there is no way to wait for the presentation signal directly. It also means we don’t know whether VkPresentInfoKHR::pWaitSemaphores are still in use by the presentation operation.
  - If vkQueuePresentKHR could signal, then waiting on that signal would confirm that the present queue operation has finished — including the wait on VkPresentInfoKHR::pWaitSemaphores .
  - In summary, it’s not obvious when it’s safe to reuse present wait semaphores.
  - The Vulkan specification does not guarantee that waiting on a vkQueueSubmit fence also synchronizes presentation operations.
- The reuse of presentation resources should rely on vkAcquireNextImageKHR or additional extensions, rather than on vkQueueSubmit fences.
- Solution options :
  1. Allocate one "submit finished" semaphore per swapchain image instead of per in-flight frame.
    - Allocate the submit_semaphores array based on the number of swapchain images (instead of the number of in-flight frames)
    - Index this array using the acquired swapchain image index (instead of the current in-flight frame index)
  2. Using EXT_swapchain_maintenance1 .
    - See Vulkan#Recreating , for use with the swapchain.

Fences

VkFence
Fences facilitate GPU -> CPU synchronization.
- Used to know if a command buffer has finished being executed on the GPU.
While signalling a fence makes all memory available, it does not make them available to the CPU, just within the device. This is where dstStageMask = PIPELINE_STAGE_HOST and dstAccessMask = ACCESS_HOST_READ flags come in. If you intend to read back data to the CPU, you must issue a pipeline barrier which makes memory available to the HOST as well.
In our mental model, we can think of this as flushing the GPU L2 cache out to GPU main memory, so that CPU can access it over some bus interface.
In order to signal that fence, any pending writes to that memory must have been made available, so even recycled memory can be safely reused without a memory barrier. This point is kind of subtle, but it really helps your sanity not having to inject memory barriers everywhere.
Usage :
- Similar to semaphores, fences are either in a signaled or unsignaled state.
- Whenever we submit work to execute, we can attach a fence to that work. When the work is finished, the fence will be signaled.
- Then we can make the CPU wait for the fence to be signaled, guaranteeing that the work has finished before the CPU continues.
- Fences must be reset manually to put them back into the unsignaled state.
  - This is because fences are used to control the execution of the CPU, and so the CPU gets to decide when to reset the fence.
  - Contrast this to semaphores which are used to order work on the GPU without the CPU being involved.
- Unlike the semaphore, the fence does block CPU execution.
  - In general, it is preferable to not block the host unless necessary.
  - We want to feed the GPU and the host with useful work to do. Waiting on fences to signal is not useful work.
  - Thus, we prefer semaphores, or other synchronization primitives not yet covered, to synchronize our work.
Example :
- Taking a screenshot :
  - Once we have already done the necessary work on the GPU, we now need to transfer the image from the GPU over to the host and then save the memory to a file.
  - We have command buffer A which executes the transfer and fence F. We submit command buffer A with fence F, then immediately tell the host to wait for F to signal. This causes the host to block until command buffer A finishes execution.
  - Thus, we are safe to let the host save the file to disk, as the memory transfer has completed.
  - Unlike the semaphore example, this example does block host execution. This means the host won’t do anything except wait until the execution has finished. For this case, we had to make sure the transfer was complete before we could save the screenshot to disk.

Main Loop Synchronization

.
- Explanation .
  - The entire video is just drawings.
.
- Explanation {0:00 -> 6:38} .
  - Good illustration.
  - The rest of the video is just code.
  - Does not comment on Multiple Frames In Flight.

Command Buffers

Commands in Vulkan, like drawing operations and memory transfers, are not executed directly using function calls. You have to record all the operations you want to perform in command buffer objects.
The advantage of this is that when we are ready to tell Vulkan what we want to do, all the commands are submitted together. Vulkan can more efficiently process the commands since all of them are available together.
In addition, this allows command recording to happen in multiple threads if so desired.

Command Pools

Create and allocate Command Buffers.
Command pools are opaque objects that command buffer memory is allocated from, and which allow the implementation to amortize the cost of resource creation across multiple command buffers.

Creation

vkCreateCommandPool() .
- device
  - Is the logical device that creates the command pool.
- pAllocator
  - Controls host memory allocation as described in the Memory Allocation chapter.
- pCommandPool
  - Is a pointer to a VkCommandPool handle in which the created pool is returned.
- pCreateInfo
  - VkCommandPoolCreateInfo .
  - queueFamilyIndex
    - Designates a queue family as described in section Queue Family Properties . All command buffers allocated from this command pool must be submitted on queues from the same queue family.
    - Command buffers are executed by submitting them on one of the device queues (graphics and presentation queues, for example).
    - Each command pool can only allocate command buffers that are submitted on a single type of queue.
  - flags
    - Is a bitmask indicating usage behavior for the pool and command buffers allocated from it.
    - COMMAND_POOL_CREATE_TRANSIENT
      - Hint that command buffers are rerecorded with new commands very often (may change memory allocation behavior)
    - COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER
      - Allow command buffers to be rerecorded individually, without this flag they all have to be reset together
      - If we record a command buffer every frame, we want to be able to reset and rerecord over it, thus, this flag should be enabled so a command buffer can be reset individually.
    - COMMAND_POOL_CREATE_PROTECTED
      - Specifies that command buffers allocated from the pool are protected command buffers.

Management

Manages the memory that is used to store the buffers and command buffers are allocated from them.
Destroying a Command Pool, destroys the Command Buffers associated.
Reset the whole Command Pool :
- vkResetCommandPool .
  - Resetting a command pool recycles all of the resources from all of the command buffers allocated from the command pool back to the command pool. All command buffers that have been allocated from the command pool are put in the initial state .
  - Any primary command buffer allocated from another VkCommandPool that is in the recording or executable state and has a secondary command buffer allocated from commandPool recorded into it, becomes invalid .
Free individual Command Buffers :
- vkFreeCommandBuffers() .
  - device
    - Is the logical device that owns the command pool.
  - commandPool
    - Is the command pool from which the command buffers were allocated.
  - commandBufferCount
    - Is the length of the pCommandBuffers array.
  - pCommandBuffers
    - Is a pointer to an array of handles of command buffers to free.
  - Any primary command buffer that is in the recording or executable state and has any element of pCommandBuffers recorded into it, becomes invalid .

Command Buffer

Creation / Allocation

VkCommandBuffer .
- Encodes GPU commands.
- All execution that is performed on the GPU itself (not in the driver) has to be encoded in a command buffer.
vkAllocateCommandBuffers() .
- pAllocateInfo
  - VkCommandBufferAllocateInfo .
  - commandPool
    - Is the command pool from which the command buffers are allocated.
  - level
    - VkCommandBufferLevel .
    - Specifies if the allocated command buffers are primary or secondary command buffers.
    - `COMMAND_BUFFER_LEVEL_PRIMARY
      - Command Buffer Primary.
    - `COMMAND_BUFFER_LEVEL_SECONDARY
      - Command Buffer Secondary.
  - commandBufferCount
    - Is the number of command buffers to allocate from the pool.
- pCommandBuffers
  - Is a pointer to an array of Command Buffer handles in which the resulting command buffer objects are returned. The array must be at least the length specified by the commandBufferCount member of pAllocateInfo . Each allocated command buffer begins in the initial state.

Lifecycle

Lifecycle .
.
Reset an single Command Buffer :
- Once a command buffer has been submitted, it’s still “alive”, and being consumed by the GPU, at this point it is NOT safe to reset the command buffer yet. You need to make sure that the GPU has finished executing all of the commands from that command buffer until you can reset and reuse it.
- vkResetCommandBuffer() .
  - commandBuffer
    - Is the command buffer to reset. The command buffer can be in any state other than pending , and is moved into the initial state .
  - flags
    - Is a bitmask of VkCommandBufferResetFlagBits controlling the reset operation.
  - Any primary command buffer that is in the recording or executable state and has commandBuffer recorded into it, becomes invalid .
  - After a command buffer is reset, any objects or memory specified by commands recorded into the command buffer must no longer be accessed when the command buffer is accessed by the implementation.
- If the command buffer was already recorded once, then a call to it will implicitly reset it.

Levels

Primary :
- Only these can be submitted to queues for execution.
- Cannot be called from other command buffers.
Secondary :
- Cannot be submitted directly, but can be called from primary command buffers.
- "We won’t make use of the secondary command buffer functionality here, but you can imagine that it’s helpful to reuse common operations from primary command buffers."
- vkCmdExecuteCommands() .
  - A primary command buffer would use this to execute a secondary command buffer.
- Re-recording :
  - If a secondary moves to the invalid state or the initial state, then all primary buffers it is recorded in move to the invalid state. A primary moving to any other state does not affect the state of a secondary recorded in it.
  - So, when a secondary command is re-recorded, the primary becomes invalid.
  - Eve: "It is not capturing a reference to a command buffer, it is going through and copying all the commands in the command buffer into itself."

Command Types

Action-Type, State-Type, Sync-Type.
.

Command Buffer Recording

Writes the commands we want to execute into a command buffer.
It’s not possible to append commands to a buffer at a later time.
vkBeginCommandBuffer() .
- commandBuffer
  - Is the handle of the command buffer which is to be put in the recording state.
- pBeginInfo
  - VkCommandBufferBeginInfo .
  - Specifies some details about the usage of this specific command buffer.
  - flags
    - VkCommandBufferUsageFlagBits .
    - Specifies how we’re going to use the command buffer.
    - COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT
      - The command buffer will be rerecorded right after executing it once.
    - COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE
      - This is a secondary command buffer that will be entirely within a single render pass.
    - COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE
      - The command buffer can be resubmitted while it is also already pending execution.
    - None of these flags are applicable for us right now.
  - pInheritanceInfo
    - VkCommandBufferInheritanceInfo .
      - If the command buffer is a secondary command buffer, then the VkCommandBufferInheritanceInfo structure defines any state that will be inherited from the primary command buffer:
    - Used if commandBuffer is a secondary command buffer. If this is a primary command buffer, then this value is ignored.
    - It specifies which state to inherit from the calling primary command buffers.
vkEndCommandBuffer() .
- The command buffer must have been in the recording state , and, if successful, is moved to the executable state .
- If there was an error during recording, the application will be notified by an unsuccessful return code returned by vkEndCommandBuffer , and the command buffer will be moved to the invalid state .

Pre-recording

"Many early Vulkan tutorials and documents recommended writing a command buffer once and re-using it wherever possible. In practice however re-use rarely has the advertized performance benefit while incurring a non-trivial development burden due to the complexity of implementation. While it may appear counterintuitive, as re-using computed data is a common optimization, managing a scene with objects being added and removed as well as techniques such as frustum culling which vary the draw calls issued on a per frame basis make reusing command buffers a serious design challenge. It requires a caching scheme to manage command buffers and maintaining state for determining if and when re-recording becomes necessary. Instead, prefer to re-record fresh command buffers every frame. If performance is a problem, recording can be multithreaded as well as using secondary command buffers for non-variable draw calls, like post processing."
- Source .

Multi-threading Recording

Usage of secondary command buffers for Vulkan Multithreaded Recording .
Usage of secondary command buffers for Vulkan Multithreaded Recording .
- There's a example code section.
External synchronization
- A type of synchronization required of the application, where parameters defined to be externally synchronized must not be used simultaneously in multiple threads.
Internal Synchronization
- A type of synchronization required of the implementation, where parameters not defined to be externally synchronized may require internal mutexing to avoid multithreaded race conditions.
Any object parameters that are not labeled as externally synchronized are either not mutated by the command or are internally synchronized.
Additionally, certain objects related to a command’s parameters (e.g. command pools and descriptor pools) may be affected by a command, and must also be externally synchronized.

Queues

Only a single thread can be submitting to a given queue at any time. If you want multiple threads doing VkQueueSubmit , then you need to create multiple queues.
As the number of queues can be as low as 1 in some devices, what engines tend to do for this is to do something similar to the pipeline compile thread or the OpenGL api call thread, and have a thread dedicated to just doing VkQueueSubmit .
As VkQueueSubmit is a very expensive operation, this can bring a very nice speedup as the time spent executing that call is done in a second thread and the main logic of the engine doesn’t have to stop.
Data upload is another section that is very often multithreaded. In here, you have a dedicated IO thread that will load assets to disk, and said IO thread will have its own queue and command allocators, hopefully a transfer queue. This way it is possible to upload assets at a speed completely separated from the main frame loop, so if it takes half a second to upload a set of big textures, you don’t have a hitch. To do that, you need to create a transfer or async-compute queue (if available), and dedicate that one to the loader thread. Once you have that, it’s similar to what was commented on the pipeline compiler thread, and you have an IO thread that communicates through a parallel queue with the main simulation loop to upload data in an asynchronous way. Once a transfer has been uploaded, and checked that it has finished with a Fence, then the IO thread can send the info to the main loop, and then the engine can connect the new textures or models into the renderer.

Command Pools

When you record command buffers, their command pools can only be used from one thread at a time. While you can create multiple command buffers from a command pool, you cant fill those commands from multiple threads. If you want to record command buffers from multiple threads, then you will need more command pools, one per thread.
Secondary Command Buffers :
- Vulkan command buffers have a system for primary and secondary command buffers. The primary buffers are the ones that open and close RenderPasses, and can get directly submitted to a queue. Secondary command buffers are used as “child” command buffers that execute as part of a primary one.
- Their main purpose is multithreading.
- Secondary command buffers cant be submitted into a queue on their own.
Command Pools are a system to allow recording command buffers across multiple threads.
- They enable different threads to use different allocators, without internal synchronization on each use.
A single command pool must be externally synchronized ; it must not be accessed simultaneously from multiple threads.
- That includes use via recording commands on any command buffers allocated from the pool, as well as operations that allocate, free, and reset command buffers or the pool itself.
If you want multithreaded command recording, you need more VkCommandPool objects. By using a separate command pool in each host-thread the application can create multiple command buffers in parallel without any costly locks.
- For that reason, we will pair a command buffer with its command allocator.
You can allocate as many VkCommandBuffer as you want from a given pool, but you can only record commands from one thread at a time.
Command buffers can be recorded on multiple threads while having a relatively light thread handle the submissions.
If two commands access the same object or memory and at least one of the commands declares the object to be externally synchronized, then the caller must guarantee not only that the commands do not execute simultaneously, but also that the two commands are separated by an appropriate memory barrier (if needed).
Similarly, if a Vulkan command accesses a non-const memory parameter and the application also accesses that memory, or if the application writes to that memory and the command accesses it as a const memory parameter, the application must ensure the accesses are properly synchronized with a memory barrier if needed.
Memory barriers are particularly relevant for hosts based on the ARM CPU architecture, which is more weakly ordered than many developers are accustomed to from x86/x64 programming. Fortunately, most higher-level synchronization primitives (like the pthread library) perform memory barriers as a part of mutual exclusion, so mutexing Vulkan objects via these primitives will have the desired effect.

Pipelines

In Vulkan, to execute code on the GPU, we need to set up a pipeline.
There are two types of pipelines, Graphics and Compute:
- Compute pipelines :
  - Are much simpler, because they only require the data for the shader code, and the layout for the descriptors used for data bindings.
- Graphics pipelines :
  - Have to configure a considerable amount of state for all of the fixed-function hardware in the GPU such as color blending, depth testing, or geometry formats.
Both types of pipelines share the shader modules and the layouts, which are built in the same way.
VkPipeline

Pipeline Layout

A collection of DescriptorSetLayouts and PushConstantRange defining its push constant usage.
PipelineLayouts for a graphics and compute pipeline are made in the same way, and they must be created before the pipeline itself.
VkPipelineLayout .
- VkPipelineLayoutCreateInfo .
  - Structure specifying the parameters of a newly created pipeline layout object
  - setLayoutCount
    - Is the number of descriptor sets included in the pipeline layout.
  - pSetLayouts
    - Is a pointer to an array of VkDescriptorSetLayout objects.
    - The implementation must not access these objects outside of the duration of the command this structure is passed to.
- vkCreatePipelineLayout() .
  - pCreateInfo
    - Is a pointer to a VkPipelineLayoutCreateInfo structure specifying the state of the pipeline layout object.
  - flags
    - Is a bitmask of VkPipelineLayoutCreateFlagBits specifying options for pipeline layout creation.
  - setLayoutCount
    - See Vulkan#Descriptor Set Layout for more information.
    - Is the number of descriptor sets included in the pipeline layout.
  - pSetLayouts
    - Is a pointer to an array of VkDescriptorSetLayout objects. The implementation must not access these objects outside of the duration of the command this structure is passed to.
  - pushConstantRangeCount
    - Is the number of push constant ranges included in the pipeline layout.
  - pPushConstantRanges
    - Is a pointer to an array of VkPushConstantRange structures defining a set of push constant ranges for use in a single pipeline layout. In addition to descriptor set layouts, a pipeline layout also describes how many push constants can be accessed by each stage of the pipeline.

Mesh Shaders

Support

(2025-09-12)
.
It is important to note that while portability between APIs can be achieved, portability in performance among vendors is much harder. This is one of the reasons why this extension has not been released as a ratified KHR extension and Khronos continues to investigate improvements to geometry rasterization.
There are further aspects that can influence the performance of mesh shaders in a vendor dependent way:
- The number of maximum output vertices and primitives that a mesh shader is compiled with.
- The number of per-vertex and per-primitive output attributes that are passed to fragment shaders. For example, it may be beneficial to fetch additional attributes in the fragment shader and interpolate them via hardware barycentrics to reduce the output space of the mesh shader.
- The complexity of the culling performed in the mesh shader. For example details regarding the per-vertex and/or per-primitive culling with compact outputs compared to letting the hardware perform culling.
- The usage of additional shared memory. If possible developers should use subgroup operations (such as shuffle) instead.
- The task payload size.
- Task shaders may add overhead, use them only when they can cull a meaningful number of primitives or when actual geometry amplification is desired.
- Do not try to reimplement the fixed-function pipeline, strive for simpler algorithms instead.
.

Motivation

.
.
The current state of the Graphics Pipeline is not a direct mapping of how a GPU operates.
There's a lot of Per Vertex -> Per Primitive -> Per Vertex -> Per Primitive happening inside a Graphics Pipeline.
The idea is to use the flexibility of Compute Shaders and use the GPU more closely as it operates.
Mesh and Task shaders follow the compute programming model and use threads cooperatively to generate meshes within a workgroup. The vertex and index data for these meshes are written similarly to shared memory in compute shaders.
Mesh shader output is directly consumed by the rasterizer, as opposed to the previous approach of using a compute dispatch followed by an indirect draw.
Mesh Shading applications can avoid preallocation of output buffers.
Before deciding to use mesh shaders, developers should ensure they are a good fit for their application. The traditional pipeline may still be best suited to many use cases, and it may not be trivial to improve performance using the mesh shading pipeline given the long evolution and optimization efforts applied to the traditional pipeline stages.
Applications and games dealing with high geometric complexity can, however, benefit from the flexibility of the two-stage approach, which allows efficient culling , level-of-detail techniques as well as procedural generation .
Compared to the traditional pipeline, the mesh shaders allow easy access to the topology of the generated primitives and developers are free to repurpose the threads to do both vertex shading and primitive shading work. This is in contrast to tessellation shaders, which, while fast, provide very limited control over the triangles created, and geometry shaders, which use a single-thread programming model that is inefficient for modern streaming processors.

Task Shader

Is optional and provides a way to implement geometry amplification by creating variable mesh shader workgroups directly in the pipeline. Task shader workgroups can output an optional payload, which is visible as read-only input to all its child mesh shader workgroups.
A Task Shader decides how many Mesh Shaders you would like to run.

Meshlets / Triangle Clusters

.
When rasterizing geometry, mesh shaders typically make use of pre-computed triangle clusters of an upper bound in the number of vertices and triangles, also sometimes referred to as meshlets. Because task and mesh shaders, like compute, have only workgroup and invocation indices as input, all data fetching is handled by the application directly, which entirely removes fixed-function vertex processing and input assembly. This allows developers to be flexible in the storage of mesh data in both vertex and primitive topology representations. Another very common technique is to leverage the task shader and let one local invocation test one cluster for visibility. Through the use of subgroup operations developers can compute and write out information about the visible clusters into the task shader payload.
The meshlet / primitive cluster dimensions can have an especially big impact for the developer, as when streaming it is ideal to store assets with a fixed clustering in advance. Vendors may have different performance recommendations and so we suggest the use of smaller cluster sizes that work equally well across multiple vendors and process multiple small clusters at once on implementations that perform better with larger clusters. In this area we advise developers to experiment and consult with their hardware vendors for recommendations.

Using it

Mesh Shading .
- The URL comes from NV_mesh_shader ; maybe it's relevant?
gl_meshlet_cadscene .
- This OpenGL/Vulkan sample illustrates the use of "mesh shaders" for rendering CAD models.
Use of Mesh Shader to improve performance - 2024 .
.
The recommended idea is a Mesh Shader to operate on a Meshlet.

What a Mesh Shader enables

You can do very early culling.
It can be faster than the classical Graphics Pipeline, if correctly optimized.

Mesh Shader output Execution Mode :

The mesh stage will set either OutputPoints , OutputLinesEXT , or OutputTrianglesEXT

#extension GL_EXT_mesh_shader : require

// Only 1 of the 3 is allowed
layout(points) out;
layout(lines) out;
layout(triangles) out;

Cluster Culling Shader

Cluster Culling Shader .
- HUAWEI_cluster_culling_shader .

Graphics Pipeline

The graphics pipeline is required for all common drawing operations.
Holds the state of the GPU needed to draw. For example: shaders, rasterization options, depth settings.
It describes the configurable state of the graphics card, like the viewport size and depth buffer operation and the programmable state using VkShaderModule objects.

Stages

.
.
Disabling stages :
- The tessellation and geometry stages can be disabled if you are just drawing simple geometry.
- If you are only interested in depth values, then you can disable the fragment shader stage, which is useful for shadow map generation.
Fixed-function stages :
- Allow you to tweak their operations using parameters, but the way they work is predefined.
- Dynamic State :
  - While most of the pipeline state needs to be baked into the pipeline state, a limited amount of the state can actually be changed without recreating the pipeline at draw time.
  - Examples are the size of the viewport, line width and blend constants.
  - If you want to use dynamic state and keep these properties out, then you’ll have to fill in a VkPipelineDynamicStateCreateInfo struct.
  - This will cause the configuration of these values to be ignored , and you will be able (and required) to specify the data at drawing time.
  - This results in a more flexible setup and is widespread for things like viewport and scissor state, which would result in a more complex setup when being baked into the pipeline state.
Programmable stages :
- Means that you can upload your own code to the graphics card to apply exactly the operations you want.
- This allows you to use fragment shaders, for example, to implement anything from texturing and lighting to ray tracers. These programs run on many GPU cores simultaneously to process many objects, like vertices and fragments in parallel.
Immutability :
- Is almost completely immutable, so you must recreate the pipeline from scratch if you want to change shaders, bind different framebuffers or change the blend function.
- The disadvantage is that you’ll have to create a number of pipelines (many VkPipeline objects) that represent all the different combinations of states you want to use in your rendering operations. However, because all the operations you’ll be doing in the pipeline are known in advance, the driver can optimize for it much better.
  - Runtime performance is more predictable because large state changes like switching to a different graphics pipeline are made very explicit.
- Only some basic configuration, like viewport size and clear color, can be changed dynamically.

Shader Compilation

Shader Module

A VkShaderModule is a processed shader file.
We create it from a pre-compiled SPIR-V file.
We can call vkDestroyShaderModule after they are used for the graphics pipeline creation.

Input Assembly

Fixed-function stage.
Collects the raw vertex data from the buffers you specify and may also use an index buffer to repeat certain elements without having to duplicate the vertex data itself.
VkPipelineVertexInputStateCreateInfo
- Describes the format of the vertex data that will be passed to the vertex shader.
- pVertexBindingDescriptions
  - Spacing between data and whether the data is per-vertex or per-instance (see instancing ).
- pVertexAttributeDescriptions
  - Type of the attributes passed to the vertex shader, which binding to load them from and at which offset.
VkPipelineInputAssemblyStateCreateInfo .
- Describes two things: what kind of geometry will be drawn from the vertices and if primitive restart should be enabled.
- topology
  - PRIMITIVE_TOPOLOGY_POINT_LIST
    - points from vertices
  - PRIMITIVE_TOPOLOGY_LINE_LIST
    - line from every two vertices without reuse
  - PRIMITIVE_TOPOLOGY_LINE_STRIP
    - the end vertex of every line is used as start vertex for the next line
  - PRIMITIVE_TOPOLOGY_TRIANGLE_LIST
    - triangle from every three vertices without reuse
  - PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP
    - the second and third vertex of every triangle is used as first two vertices of the next triangle
- primitiveRestartEnable
  - Normally, the vertices are loaded from the vertex buffer by index in sequential order, but with an element buffer you can specify the indices to use yourself.
    - This allows you to perform optimizations like reusing vertices.
  - If you set this to TRUE , then it’s possible to break up lines and triangles in the _STRIP topology modes by using a special index of 0xFFFF or 0xFFFFFFFF .

Primitive Topology

Vertex Shader

Programmable stage.
Is run for every vertex and generally applies transformations to turn vertex positions from model space to screen space. It also passes per-vertex data down the pipeline.
The VkShaderModule objects are created from shader byte code.
Accesses and computes one vertex at a time.

Tessellation Shader

Is run for every vertex and generally applies transformations to turn vertex positions from model space to screen space. It also passes per-vertex data down the pipeline.
You can do tessellation in the Geometry Shader, but the Tessellation Shader is more appropriate and efficient.
.
- Sending this amount of vertices to the Vertex Shader would be quite more expensive than generating them in the Tessellation Shader.
.
Tessellation Evaluation Shader.
- Kinda like a Vertex Shader, after the Tessellation.
Tessellation Shader .
- I was too lazy to watch it all.
- The inputs are complicated, etc.
Tessellation output Execution Mode :
- The tessellation evaluation stage will set either Triangles , Quads , or Isolines
```
// Only 1 of the 3 is allowed
layout(quads) in;
layout(isolines) in;
layout(triangles) in;
```

Geometry Shader

Programmable stage.
It operates on primitives .
Is run on every primitive (triangle, line, point) and can discard it or output more primitives than came in. This is similar to the tessellation shader but much more flexible.
However, it is used little in today’s applications because the performance is not that good on most graphics cards except for Intel’s integrated GPUs.
- Also, almost all geometry shader use cases can be replaced with a more modern Mesh shader pipeline, which like ray tracing is a wholly new pipeline solution, so it exists outside the standard graphics pipeline setup.
.
A Vertex Shader is more parallelized than a Geometry Shader.
A Vertex Shader computes one vertex at a time, while a geometry shader gets all the vertices that compose a primitive .
- It does not have access to the whole mesh, just the vertices that compose the current primitive.
OpenGL Primitives :
- May be useful.
- .
- .
Think of the Primitive Inputs as just the amount of vertices you are sending at a time.
.
The reason for this is that you can get any primitive input and have any primitive output.
.
- Use EndPrimitive() so the line strips are separated.
.
- The Vertex Shader can output data to the Geometry Shader, in the form of an array.
- The Geometry Shader can output data to the Fragment Shader, in a form of an interpolated value, using barycentric coordinates.
Instancing :
- .
- You can have many instances of a Geometry Shader, where the input is the same but the output changes.
- .
- .
.
- The smoke is a quad facing the camera (billboard).
- The points are converted to quads.
.
.
Geometry output Execution Mode :
- A geometry stage will set either OutputPoints , OutputLineStrip , or OutputTriangleStrip
```
// Only 1 of the 3 is allowed
layout(points) out;
layout(line_strip) out;
layout(triangle_strip) out;
```

Rasterization

Fixed-function stage.
Breaks the primitives into fragments .
These are the pixel elements that they fill on the framebuffer.
Any fragments that fall outside the screen are discarded, and the attributes outputted by the vertex shader are interpolated across the fragments.
Fragments that are behind other primitive fragments can also be discarded here because of depth testing.
VkPipelineRasterizationStateCreateInfo .
- polygonMode
- lineWidth
  - Is the width of rasterized line segments.
  - The maximum line width that is supported depends on the hardware.
  - Any line thicker than 1.0f requires you to enable the wideLines GPU feature.
  - If set to 0.0f , you get: lineWidth is 0.0, but the line width state is static ( pCreateInfos[0].pDynamicState->pDynamicStates does not contain DYNAMIC_STATE_LINE_WIDTH ) and wideLines feature was not enabled. The Vulkan spec states: If the pipeline requires pre-rasterization shader state, and the wideLines feature is not enabled, and no element of the pDynamicStates member of pDynamicState is DYNAMIC_STATE_LINE_WIDTH , the lineWidth member of pRasterizationState must be 1.0.
  - So, set it to 1.0f by default.
- cullMode
  - NONE
    - Specifies that no triangles are discarded
  - FRONT
    - Specifies that front-facing triangles are discarded
  - BACK
    - Specifies that back-facing triangles are discarded
  - FRONT_AND_BACK
    - Specifies that all triangles are discarded.
  - Following culling, fragments are produced for any triangles which have not been discarded.
- frontFace
  - Specifies the vertex order for the faces to be considered front-facing.
  - COUNTER_CLOCKWISE
    - Specifies that a triangle with positive area is considered front-facing.
  - CLOCKWISE
    - Specifies that a triangle with negative area is considered front-facing.
  - Any triangle which is not front-facing is back-facing, including zero-area triangles.
- rasterizerDiscardEnable .
  - When enabled, primitives are discarded after they are processed by the last active shader stage in the pipeline before rasterization.
  - Controls whether primitives are discarded immediately before the rasterization stage. This is important because when this is set to TRUE the rasterization hardware is not executed.
  - There are many Validation Usage errors that will not occur if this is set to TRUE because some topology hardware is unused and can be ignored.
  - Enabling this state is meant for very specific use cases. Prior to compute shaders, this was a common technique for writting geometry shader output to a buffer.
  - It can be used to debug/profile non-rasterization bottlenecks.
- flags
  - Reserved for future use.
- depthClampEnable
  - See the Depth section for details.
- depthBiasEnable
  - See the Depth section for details.
- depthBiasConstantFactor
  - See the Depth section for details.
- depthBiasSlopeFactor
  - See the Depth section for details.
- depthBiasClamp
  - See the Depth section for details.

Polygon Mode

.
Determines how fragments are generated for geometry.
These modes affect only the final rasterization of polygons. The polygon’s vertices are shaded and the polygon is clipped and possibly culled before these modes are applied.
FILL
- Fill the area of the polygon with fragments.
LINE
- Polygon edges are drawn as lines
POINT
- Polygon vertices are drawn as points
- If VkPhysicalDeviceMaintenance5Properties :: polygonModePointSize is TRUE , the point size of the final rasterization of polygons is taken from PointSize .
- Otherwise, the point size of the final rasterization of polygons is 1.0.
FILL_RECTANGLE_NV
- Specifies that polygons are rendered using polygon rasterization rules, modified to consider a sample within the primitive if the sample location is inside the axis-aligned bounding box of the triangle after projection.
- Note that the barycentric weights used in attribute interpolation can extend outside the range [0,1] when these primitives are shaded.
- Special treatment is given to a sample position on the boundary edge of the bounding box. In such a case, if two rectangles lie on either side of a common edge (with identical endpoints) on which a sample position lies, then exactly one of the triangles must produce a fragment that covers that sample during rasterization.
- Polygons rendered in FILL_RECTANGLE_NV mode may be clipped by the frustum or by user clip planes. If clipping is applied, the triangle is culled rather than clipped.
- Area calculation and facingness are determined for FILL_RECTANGLE_NV mode using the triangle’s vertices.
If you have a vertex shader that has PRIMITIVE_TOPOLOGY_TRIANGLE_LIST input and then during rasterization uses POLYGON_MODE_LINE , the effective topology is the Line Topology Class at that time. This means something like lineWidth would be applied when filling in the polygon with POLYGON_MODE_LINE .

Fragment Operations

Order

Discard rectangles test
Scissor test
Exclusive scissor test
Sample mask test
Certain Fragment shading operations:
- Sample Mask Accesses
- Tile Image Reads
- Depth Replacement
- Stencil Reference Replacement
- Interlocked Operations
Multisample coverage
Depth bounds test
Stencil test
Depth test
Representative fragment test
Sample counting
Coverage to color
Coverage reduction
Coverage modulation

Early Per-Fragment Tests

OpenGL 4.6:
- Once fragments are produced by rasterization, a number of per-fragment operations are performed prior to fragment shader execution. If a fragment is discarded during any of these operations, it will not be processed by any subsequent Stage, including fragment shader execution.
- Three fragment operations are performed, and a further three are optionally performed on each fragment, in the following order:
  - the pixel ownership test (see section 14.9.1);
  - the scissor test (see section 14.9.2);
  - multisample fragment operations (see section 14.9.3);
- If early per-fragment operations are enabled, these tests are also performed:
  - the stencil test (see section 17.3.3);
  - the depth buffer test (see section 17.3.4);
    - The depth buffer test discards the incoming fragment if a depth comparison fails. The comparison is enabled or disabled with the generic Enable and Disable commands using target DEPTH_TEST. When disabled, the depth comparison and subsequent possible updates to the depth buffer value are bypassed and the fragment is passed to the next operation. The stencil value, however, is modified as indicated below as if the depth buffer test passed. If enabled, the comparison takes place and the depth buffer and stencil value may subsequently be modified.
  - occlusion query sample counting (see section 17.3.5)
- Early fragment tests, as an optimization, exist to prevent unnecessary executions of the Fragment Shader. If a fragment will be discarded based on the Depth Test (due perhaps to being behind other geometry), it saves performance to avoid executing the fragment shader. There is specialized hardware that makes this particularly efficient in many GPUs.
- The most effective way to use early depth test hardware is to run a depth-only pre-processing pass. This means to render all available geometry, using minimal shaders and a rendering pipeline that only writes to the depth buffer. The Vertex Shader should do nothing more than transform positions, and the Fragment Shader does not even need to exist.
- This provides the best performance gain if the fragment shader is expensive, or if you intend to use multiple passes across the geometry.
- Limitations :
  - The Spec states that these operations happen after fragment processing. However, a specification only defines apparent behavior, so the implementation is only required to behave "as if" it happened afterwards.
  - Therefore, an implementation is free to apply early fragment tests if the Fragment Shader being used does not do anything that would impact the results of those tests. So if a fragment shader writes to glFragDepth, thus changing the fragment's depth value, then early testing cannot take place, since the test must use the new computed value.
  - Do recall that if a fragment shader writes to gl_FragDepth, even conditionally, it must write to it at least once on all codepaths.
  - There can be other hardware-based limitations as well. For example, some hardware will not execute an early depth test if the (deprecated) alpha test is active, as these use the same hardware on that platform. Because this is a hardware-based optimization, OpenGL has no direct controls that will tell you if early depth testing will happen.
  - Similarly, if the fragment shader discards the fragment with the discard keyword, this will almost always turn off early depth tests on some hardware. Note that even conditional use of discard will mean that the FS will turn off early depth tests.
  - All of the above limitations apply only to early testing as an optimization. They do not apply to anything below.
- More recent hardware can force early depth tests, using a special fragment shader layout qualifier:
  - layout(early_fragment_tests) .
    - Vulkan:
      - Specifying is a way of the application programmer providing a promise to the implementation that it is algorithmically safe to kill the fragments, so you explicitly allow the change in application-visible behavior.
      - Specifying this will make per-fragment tests be performed before fragment shader execution. If this is not declared, per-fragment tests will be performed after fragment shader execution. Only one fragment shader (compilation unit) need declare this, though more than one can. If at least one declares this, then it is enabled.
    - OpenGL 4.6:
      - An explicit control is provided to allow fragment shaders to enable early fragment tests. If the fragment shader specifies the early_fragment_tests layout qualifier, the per-fragment tests will be performed prior to fragment shader execution. Otherwise, they will be performed after fragment shader execution.
      - This will also perform early stencil tests.
      - There is a caveat with this. This feature cannot be used to violate the sanctity of the depth test. When this is activated, any writes to gl_FragDepth will be ignored . The value written to the depth buffer will be exactly what was tested against the depth buffer: the fragment's depth computed through rasterization.
      - This feature exists to ensure proper behavior when using Image Load Store or other incoherent memory writing . Without turning this on, fragments that fail the depth test would still perform their Image Load/Store operations, since the fragment shader that performed those operations successfully executed. However, with early fragment tests, those tests were run before the fragment shader. So this ensures that image load/store operations will only happen on fragments that pass the depth test.
      - Enabling this feature has consequences for the results of a discarded fragment.

Viewport and Scissors

A viewport basically describes the region of the framebuffer that the output will be rendered to.
Viewports define the transformation from the image to the framebuffer, scissor rectangles define in which region pixels will actually be stored. The rasterizer will discard any pixels outside the scissored rectangles. They function like a filter rather than a transformation.
- The difference is illustrated below.
- .
- Note that the left scissored rectangle is just one of the many possibilities that would result in that image, as long as it’s larger than the viewport.
- So if we wanted to draw to the entire framebuffer, we would specify a scissor rectangle that covers it entirely:
```
vk::Rect2D{ vk::Offset2D{ 0, 0 }, swapChainExtent }
```
Parameters :
- This will almost always be the rectangle (0, 0) , (width, height) and in this tutorial that will also be the case.
  - Remember that the size of the Swapchain and its images may differ from the WIDTH and HEIGHT of the window.
  - The Swapchain images will be used as framebuffers later on, so we should stick to their size.
- The minDepth and maxDepth values specify the range of depth values to use for the framebuffer. These values must be within the [0.0f, 1.0f] range, but minDepth may be higher than maxDepth .
  - If you aren’t doing anything special, then you should stick to the standard values of 0.0f and 1.0f .
As a Dynamic State or Static State :
- Viewport(s) and scissor rectangle(s) can either be specified as a static part of the pipeline or as a dynamic state set in the command buffer.
- Independent of how you set them, it’s possible to use multiple viewports and scissor rectangles on some graphics cards, so the structure members reference an array of them. Using multiple requires enabling a GPU feature (see logical device creation).
- It’s often convenient to make viewport and scissor state dynamic as it gives you a lot more flexibility.
- With dynamic state :
  - It’s even possible to specify different viewports and or scissor rectangles within a single command buffer.
  - This is widespread and all implementations can handle this dynamic state without a performance penalty.
  - When opting for dynamic viewport(s) and scissor rectangle(s), you need to enable the respective dynamic states for the pipeline:
```
std::vector dynamicStates = {
    vk::DynamicState::eViewport,
    vk::DynamicState::eScissor
};
vk::PipelineDynamicStateCreateInfo dynamicState({}, dynamicStates.size(), dynamicStates.data());
```
  - And then you only need to specify their count at pipeline creation time:
```
vk::PipelineViewportStateCreateInfo viewportState({}, 1, {}, 1);
```
  - The actual viewport(s) and scissor rectangle(s) will then later be set up at drawing time.
- ~~Without dynamic state~~ :
  - The viewport and scissor rectangle need to be set in the pipeline using the VkPipelineViewportStateCreateInfo struct. This makes the viewport and scissor rectangle for this pipeline immutable. Any changes required to these values would require a new pipeline to be created with the new values.
- What should you use?
  - USE DYNAMIC. There's no performance penalty.
  - Supported since launch.
  - LunarG:
    - .

Multi-Sampling

Setup

VkPipelineMultisampleStateCreateInfo .
- rasterizationSamples
  - If the bound pipeline was created without a VkAttachmentSampleCountInfoAMD or VkAttachmentSampleCountInfoNV structure, and the multisampledRenderToSingleSampled feature is not enabled, and the current render pass instance was begun with vkCmdBeginRendering with a VkRenderingInfo:colorAttachmentCount parameter greater than 0, then each element of the VkRenderingInfo:pColorAttachments array with a imageView not equal to NULL_HANDLE must have been created with a sample count equal to the value of rasterizationSamples for the bound graphics pipeline.
  - Is a VkSampleCountFlagBits value specifying the number of samples used in rasterization. This value is ignored for the purposes of setting the number of samples used in rasterization if the pipeline is created with the DYNAMIC_STATE_RASTERIZATION_SAMPLES_EXT dynamic state set, but if DYNAMIC_STATE_SAMPLE_MASK_EXT dynamic state is not set, it is still used to define the size of the pSampleMask array as described below.
- sampleShadingEnable
  - It can be used to enable Sample Shading .
- minSampleShading
  - Specifies a minimum fraction of sample shading if sampleShadingEnable is TRUE .
- pSampleMask
  - Is a pointer to an array of VkSampleMask values used in the sample mask test .
- alphaToCoverageEnable
  - Controls whether a temporary coverage value is generated based on the alpha component of the fragment’s first color output as specified in the Multisample Coverage section.
- alphaToOneEnable
  - Controls whether the alpha component of the fragment’s first color output is replaced with one as described in Multisample Coverage .
- flags
  - Reserved for future use.

Resolving

VkRenderingAttachmentInfo .
- resolveMode
  - Is a VkResolveModeFlagBits value defining how data written to imageView will be resolved into resolveImageView .
  - If resolveMode is not RESOLVE_MODE_NONE , and resolveImageView is not NULL_HANDLE , a render pass multisample resolve operation is defined for the attachment subresource.
  - RESOLVE_MODE_NONE
    - Specifies that no resolve operation is done.
  - RESOLVE_MODE_SAMPLE_ZERO
    - Specifies that result of the resolve operation is equal to the value of sample 0.
  - RESOLVE_MODE_AVERAGE
    - Specifies that result of the resolve operation is the average of the sample values.
  - RESOLVE_MODE_MIN
    - Specifies that result of the resolve operation is the minimum of the sample values.
  - RESOLVE_MODE_MAX
    - Specifies that result of the resolve operation is the maximum of the sample values.
  - RESOLVE_MODE_EXTERNAL_FORMAT_DOWNSAMPLE_ANDROID
    - Specifies that rather than a multisample resolve, a single sampled color attachment will be downsampled into a Y′CBCR format image specified by an external Android format. Unlike other resolve modes, implementations can resolve multiple times during rendering, or even bypass writing to the color attachment altogether, as long as the final value is resolved to the resolve attachment. Values in the G, B, and R channels of the color attachment will be written to the Y, CB, and CR channels of the external format image, respectively. Chroma values are calculated as if sampling with a linear filter from the color attachment at full rate, at the location the chroma values sit according to VkPhysicalDeviceExternalFormatResolvePropertiesANDROID :: externalFormatResolveChromaOffsetX , VkPhysicalDeviceExternalFormatResolvePropertiesANDROID :: externalFormatResolveChromaOffsetY , and the chroma sample rate of the resolved image.
    - No range compression or Y′CBCR model conversion is performed by RESOLVE_MODE_EXTERNAL_FORMAT_DOWNSAMPLE_ANDROID ; applications have to do these conversions themselves. Value outputs are expected to match those that would be read through a Y′CBCR sampler using SAMPLER_YCBCR_MODEL_CONVERSION_RGB_IDENTITY . The color space that the values should be in is defined by the platform and is not exposed via Vulkan.
- resolveImageView
  - Is an image view used to write resolved data at the end of rendering.
- resolveImageLayout
  - Is the layout that resolveImageView will be in during rendering.
  - If imageView is not NULL_HANDLE and resolveMode is not RESOLVE_MODE_NONE , resolveImageLayout must not be IMAGE_LAYOUT_UNDEFINED , IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL , IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL , IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL , IMAGE_LAYOUT_ZERO_INITIALIZED_EXT , IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL , or IMAGE_LAYOUT_PREINITIALIZED
From Multisample, to Singlesample.
Combine sample values from a single pixel in a multisample attachment and store the result to the corresponding pixel in a single sample attachment.
Multisample resolve operations for attachments execute in the PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT pipeline stage. A final resolve operation for all pixels in the render area happens-after any recorded command which writes a pixel via the multisample attachment to be resolved or an explicit alias of it in the subpass that it is specified.
Any single sample attachment specified for use in a multisample resolve operation may have its contents modified at any point once rendering begins for the render pass instance.
Reads from the multisample attachment can be synchronized with ACCESS_COLOR_ATTACHMENT_READ . Access to the single sample attachment can be synchronized with ACCESS_COLOR_ATTACHMENT_READ and COLOR_ATTACHMENT_WRITE . These pipeline stage and access types are used whether the attachments are color or depth/stencil attachments.
When using render pass objects, a subpass dependency specified with the above pipeline stages and access flags will ensure synchronization with multisample resolve operations for any attachments that were last accessed by that subpass. This allows later subpasses to read resolved values as input attachments.
Resolve operations only update values within the defined render area for the render pass instance. However, any writes performed by a resolve operation (as defined by its access masks) to a given attachment may read and write back any memory locations within the image subresource bound for that attachment. For depth/stencil images, if separateDepthStencilAttachmentAccess is FALSE , writes to one aspect may also result in read-modify-write operations for the other aspect. If the subresource is bound to an attachment with feedback loop enabled , implementations must not access pixels outside of the render area.
As entire subresources could be accessed by multisample resolve operations, applications cannot safely access values outside of the render area via aliased resources during a render pass instance when a multisample resolve operation is performed.
If RESOLVE_MODE_AVERAGE is used, and the source format is a floating-point or normalized type, the sample values for each pixel are resolved with implementation-defined numerical precision.
If the numeric format of the resolve attachment uses sRGB encoding, the implementation should convert samples from nonlinear to linear before averaging samples as described in the “sRGB EOTF” section of the Khronos Data Format Specification . In this case, the implementation must convert the linear averaged value to nonlinear before writing the resolved result to resolve attachment.
The resolve mode and store operation are independent; it is valid to write both resolved and unresolved values, and equally valid to discard the unresolved values while writing the resolved ones.

Multisampling Anti-Aliasing (MSAA)

Using only one sample per pixel which is equivalent to no multisampling.
Maximum supported :
- Can be extracted from VkPhysicalDeviceProperties associated with our selected physical device.
- The highest sample count that Color Image and Depth Image (Buffer) will be the maximum we can support.
What to Multisample :
- The render target.
- If using a depth image, it should also be multisampled.
Limitations :
- The multisampled image should only have one mip level.
  - This is enforced by the Vulkan specification in case of images with more than one sample per pixel.
- Multi-sampled images cannot be presented directly.
  - This requirement does not apply to the depth buffer, since it won’t be presented at any point.
DOs :
- Use 4x MSAA if possible; it’s not expensive and provides good image quality improvements.
- Use loadOp = LOAD_OP_CLEAR or loadOp = LOAD_OP_DONT_CARE for multisampled images.
- Use storeOp = STORE_OP_DONT_CARE for multisampled images.
- Use LAZILY_ALLOCATED memory to back the allocated multisampled images; they do not need to be persisted into main memory and therefore do not need physical backing storage.
- Use pResolveAttachments in a subpass to automatically resolve a multisampled color buffer into a single-sampled color buffer.
- Use KHR_depth_stencil_resolve in a subpass to automatically resolve a multisampled depth buffer into a single-sampled depth buffer. Typically this is only useful if the depth buffer is going to be used further, in most cases it is transient and does not need to be resolved.
Avoid :
- Avoid using vkCmdResolveImage() ; this has a significant negative impact on bandwidth and performance.
- Avoid using loadOp = LOAD_OP_LOAD for multisampled image attachments.
- Avoid using storeOp = STORE_OP_STORE for multisampled image attachments.
- Avoid using more than 4x MSAA without checking performance.
Impact :
- Failing to get an inline resolve can result in substantially higher memory bandwidth and reduced performance.
  - Manually writing and resolving a 4x MSAA 1080p surface at 60 FPS requires 3.9GB/s of memory bandwidth compared to just 500MB/s when using an inline resolve.
Sample Shading :
- There are certain limitations of our current MSAA implementation which may impact the quality of the output image in more detailed scenes. For example, we're currently not solving potential problems caused by shader aliasing, i.e. MSAA only smoothens out the edges of geometry but not the interior filling. This may lead to a situation when you get a smooth polygon rendered on screen but the applied texture will still look aliased if it contains high contrasting colors. One way to approach this problem is to enable Sample Shading which will improve the image quality even further, though at an additional performance cost:
```
void createLogicalDevice() {
    ...
    deviceFeatures.sampleRateShading = TRUE; // enable sample shading feature for the device
    ...
}

void createGraphicsPipeline() {
    ...
    multisampling.sampleShadingEnable = TRUE; // enable sample shading in the pipeline
    multisampling.minSampleShading = .2f; // min fraction for sample shading; closer to one is smoother
    ...
}
```
.
Performance Tests :
- (2025-09-07)
  - Done anyway, very approximate.
- MSAAx8 = 900 fps
- MSAAx4 = 1250fps
- MSAAx2 = 1550fps
- MSAA off = 2100fps
- As samples increase, frame time increases approximately by factors 1.35 (x2), 1.68 (x4) and 2.33 (x8) compared to the case without MSAA — this is consistent with substantial per-sample cost increase, but is not strictly linear with the number of samples (e.g.: x4 is not exactly 4× nor x8 exactly 8×).

Fragment Shader

Programmable stage.
Is invoked for every fragment that survives and determines which framebuffer(s) the fragments are written to and with which color and depth values. It can do this using the interpolated data from the vertex shader, which can include things like texture coordinates and normals for lighting.
The VkShaderModule objects are created from shader byte code.

Color Blending

Fixed-function stage.
Controls how the GPU combines the fragment shader’s output with what is already in the framebuffer.
Applies operations to mix different fragments that map to the same pixel in the framebuffer. Fragments can simply overwrite each other, add up or be mixed based upon transparency.
After a fragment shader has returned a color, it needs to be combined with the color that is already in the framebuffer.
This transformation is known as color blending, and there are two ways to do it:
- Mix the old and new value to produce a final color
- Combine the old and new value using a bitwise operation
Example :
- If enabled blending in the pipeline, it will blend the frag shader result with the render_target previous visual.
- So if the frag result has alpha < 1.0, it will blend the clear color with the frag shader result, giving it a "transparent visual" against the clear color.

vkPipelineColorBlendAttachmentState .

Contains the configuration per attached framebuffer.

This per-framebuffer struct allows you to configure the first way of color blending:

// Pseudo-code
if (blendEnable) {
    finalColor.rgb = (srcColorBlendFactor * newColor.rgb) <colorBlendOp> (dstColorBlendFactor * oldColor.rgb);
    finalColor.a = (srcAlphaBlendFactor * newColor.a) <alphaBlendOp> (dstAlphaBlendFactor * oldColor.a);
} else {
    finalColor = newColor;
}

finalColor = finalColor & colorWriteMask;

The most common way to use color blending is to implement alpha blending, where we want the new color to be blended with the old color based on its opacity.

The finalColor should then be computed as follows:

finalColor.rgb = newAlpha * newColor + (1 - newAlpha) * oldColor;
finalColor.a = newAlpha.a;

This can be achieved with the following parameters:

colorBlendAttachment.blendEnable = vk::True;
colorBlendAttachment.srcColorBlendFactor = vk::BlendFactor::eSrcAlpha;
colorBlendAttachment.dstColorBlendFactor = vk::BlendFactor::eOneMinusSrcAlpha;
colorBlendAttachment.colorBlendOp = vk::BlendOp::eAdd;
colorBlendAttachment.srcAlphaBlendFactor = vk::BlendFactor::eOne;
colorBlendAttachment.dstAlphaBlendFactor = vk::BlendFactor::eZero;
colorBlendAttachment.alphaBlendOp = vk::BlendOp::eAdd;

blendEnable
- If set to FALSE , then the new color from the fragment shader is passed through unmodified. Otherwise, the two mixing operations are performed to compute a new color.
- The resulting color is AND’d with the colorWriteMask to determine which channels are actually passed through.

VkPipelineColorBlendStateCreateInfo .
- Contains the global color blending settings.
- References the array of structures for all the framebuffers and allows you to set blend constants that you can use as blend factors in the aforementioned calculations.
- attachmentCount
  - Is the number of VkPipelineColorBlendAttachmentState elements in pAttachments .
  - It is ignored if the pipeline is created with DYNAMIC_STATE_COLOR_BLEND_ENABLET , DYNAMIC_STATE_COLOR_BLEND_EQUATION_EXT , and DYNAMIC_STATE_COLOR_WRITE_MASK_EXT dynamic states set, and either DYNAMIC_STATE_COLOR_BLEND_ADVANCED_EXT set or the advancedBlendCoherentOperations feature is not enabled.
- pAttachments
  - Is a pointer to an array of VkPipelineColorBlendAttachmentState structures defining blend state for each color attachment.
  - It is ignored if the pipeline is created with DYNAMIC_STATE_COLOR_BLEND_ENABLET , DYNAMIC_STATE_COLOR_BLEND_EQUATION_EXT , and DYNAMIC_STATE_COLOR_WRITE_MASK_EXT dynamic states set, and either DYNAMIC_STATE_COLOR_BLEND_ADVANCED_EXT set or the advancedBlendCoherentOperations feature is not enabled.
- logicOpEnable
  - Controls whether to apply Logical Operations .
- logicOp
  - Selects which logical operation to apply.
  - If you want to use the second method of blending (a bitwise combination), then you should set logicOpEnable to TRUE .
    - Note that this will automatically disable the first method, as if you had set blendEnable to FALSE for every attached framebuffer.
  - colorWriteMask will also be used in this mode to determine which channels in the framebuffer will actually be affected.
  - If disabled both modes, the fragment colors will be written to the framebuffer unmodified.
- blendConstants
  - Is a pointer to an array of four values used as the R, G, B, and A components of the blend constant that are used in blending, depending on the blend factor .
- flags

Creation

Setup

vkGraphicsPipelineCreateInfo .
- flags
  - DISABLE_OPTIMIZATION
    - Specifies that the created pipeline will not be optimized.
    - Using this flag may reduce the time taken to create the pipeline.
- renderPass
  - Is set to nullptr because we’re using dynamic rendering instead of a traditional render pass.
- basePipelineHandle
- basePipelineIndex
- Graphics Pipelines Inheritance :
  - Vulkan allows you to create a new graphics pipeline by deriving from an existing pipeline.
  - The idea of pipeline derivatives is that it is less expensive to set up pipelines when they have much functionality in common with an existing pipeline and switching between pipelines from the same parent can also be done quicker.
  - You can either specify the handle of an existing pipeline with basePipelineHandle or reference another pipeline that is about to be created by index with basePipelineIndex .
  - These values are only used if the VPIPELINE_CREATE_DERIVATIVE flag is also specified in the flags field of VkGraphicsPipelineCreateInfo .
vkCreateGraphicsPipelines() .
- device
  - Is the logical device that creates the graphics pipelines.
- pipelineCache
  - Is either NULL_HANDLE , indicating that pipeline caching is disabled, or to enable caching, the handle of a valid VkPipelineCache object. The implementation must not access this object outside of the duration of this command.
  - A pipeline cache can be used to store and reuse data relevant to pipeline creation across multiple calls to vkCreateGraphicsPipelines and even across program executions if the cache is stored to a file. This makes it possible to significantly speed up pipeline creation at a later time.
- createInfoCount
  - Is the length of the pCreateInfos and pPipelines arrays.
- pCreateInfos
  - Is a pointer to an array of VkGraphicsPipelineCreateInfo structures.
- pAllocator
  - Controls host memory allocation as described in the Memory Allocation chapter.
- pPipelines
  - Is a pointer to an array of VkPipeline handles in which the resulting graphics pipeline objects are returned.

Dynamic Rendering Extra Steps

Changes to the vkGraphicsPipelineCreateInfo :
- The vkGraphicsPipelineCreateInfo must be created without a VkRenderPass .
- The VkPipelineRenderingCreateInfo must be included in the pNext .
  - If a graphics pipeline is created with a valid VkRenderPass , the parameters of the VkPipelineRenderingCreateInfo are ignored.
VkPipelineRenderingCreateInfo .
- colorAttachmentCount
  - Is the number of entries in pColorAttachmentFormats
- pColorAttachmentFormats
  - Is a pointer to an array of VkFormat values defining the format of color attachments used in this pipeline.
- depthAttachmentFormat
  - Is a VkFormat value defining the format of the depth attachment used in this pipeline.
- stencilAttachmentFormat
  - Is a VkFormat value defining the format of the stencil attachment used in this pipeline.
- viewMask
  - Is a bitfield of view indices describing which views are active during rendering.
  - It must match VkRenderingInfo.viewMask when rendering.
    - As defined in VkRenderingInfo :
      - Is a bitfield of view indices describing which views are active during rendering, when it is not 0 .
      - If viewMask is not 0 , multiview is enabled.
- Formats :
  - If depthAttachmentFormat , stencilAttachmentFormat , or any element of pColorAttachmentFormats is UNDEFINED , it indicates that the corresponding attachment is unused within the render pass.
  - Valid formats indicate that an attachment can be used - but it is still valid to set the attachment to NULL when beginning rendering.

Managing Pipelines and Reducing overhead

Tips and Tricks: Vulkan Dos and Don’ts .
- Use pipeline cache.
- Use specialization constants.
  - This may cause a possible decrease in the number of instructions and registers used by the shader.
  - Specialization constants can also be used instead of offline shader permutations to minimize the amount of bytecode that needs to be shipped with an application.
- Switching pipelines:
  - Avoid frequently switching between pipelines that use different sets of pipeline stages.
  - Minimize the number of vkCmdBindPipeline calls, each call has significant CPU cost and GPU cost.
    - Consider sorting of drawcalls and/or using a low number of dynamic states.
  - Switching on/off the tessellation, geometry, task and mesh shaders is an expensive operation.
- Draw calls:
  - Group draw calls, taking into account what kinds of shaders they use.

The Problem

Immutable Pipelines.
Each combination of inputs require a dedicated pipeline.
- Shader, topology, blend mode, vertex layout, cull mode, etc.
- So if we want to do things like toggle depth-testing on and off, we will need 2 pipelines.
Causes a combinatorial explosion of variants.
- 10.000's of pipelines for shipping titles.
Building pipelines is a very expensive operation, and we want to minimize the number of pipelines used as its critical for performance.

My decisions

(2025-08-10)
Dynamic State is a must.
The use of Shader Object still seems new and may introduce some extra complexity in certain cases.
- I don't know about mobile support.
The use of Graphics Pipeline Libraries sounds interesting, but at the same time it seems limiting in some moments, for Geometry and Tessellation Shaders.
- I don't know about mobile support.
Overall, I believe that refactoring a game object to use Shader Object or Graphics Pipeline Libraries sounds "simple", since it's more about how the pipeline is constructed than how one interacts with shaders or descriptor sets. In other words, it seems like an okay decision to make in the future.
Considering the low support, and the fact that I don't have so many pipelines in mind that actually make these solutions necessary, I prefer to use graphics pipelines manually, in the "default" way.
Regardless, I believe that using Shader Object or Graphics Pipeline does not remove the need to worry about pipeline caching or precautions to avoid switching the pipeline binding all the time.
- Correct. Extensions change how pipelines are created/linked but do not remove the performance considerations around pipeline creation, pipeline cache usage, or minimizing pipeline re-binding at draw time. Vendors and platform docs recommend pipeline caches, pre-creation, and minimizing pipeline binds.
What I will do, therefore: caching and sorting of pipelines based on similarity. I will worry more about binding the pipeline in command buffers and their descriptor sets, than the process of facilitating the creation of new pipelines.
- This plan aligns with widely recommended practical strategies: use pipeline caches (persist to disk where possible), sort and batch by pipeline/descriptor similarity, and create pipelines asynchronously (background threads) to avoid stutter. These practices address the main runtime pain points regardless of whether you later adopt shader-object or pipeline-library extensions.
- Your current decisions are internally consistent and align with common, pragmatic industry practice: prefer stable/default graphics pipelines with pipeline caching, sorting, and background creation as the primary strategy, while keeping code organized so you can adopt EXT_shader_object or EXT_graphics_pipeline_library later if/when device support and measured benefits justify the switch.

Mutability with `VkDynamicState`

Implemented.
It's a must .
Not everything has to be immutable.
Set desired state while recording command buffers.
Over 70 states can be dynamic.
If we don't use this, we would need to create new pipelines if we wanted to change the resolution of our rendering.

No pipelines, with `EXT_shader_object`

EXT_shader_object .
Sample .
Article .
Support :
- Coverage .
- (2025-09-08) 11.29%.
  - 33.8% Windows.
  - 26.3% Linux.
  - 0% Android.

Shader Object and implementation in Odin {7:30 -> 11:56} .

Questions :

I don't know where pColorAttachmentFormats and depthAttachmentFormat are specified.
- I don't know if it's even necessary to specify them anywhere.
- The words attachment or format do not appear anywhere in the sample or in the spec of the extension.

    pipeline_rendering_create_info := vk.PipelineRenderingCreateInfo{
        sType                   = .PIPELINE_RENDERING_CREATE_INFO,
        colorAttachmentCount    = 1,
        pColorAttachmentFormats = format,
        depthAttachmentFormat   = .D24_UNORM_S8_UINT,
        stencilAttachmentFormat = {},
        viewMask                = 0,
    }

Code .

create_shaders :: proc() {
    push_constant_ranges := []vk.PushConstantRange {    // Pipeline
        {
            stageFlags = {.VERTEX, .FRAGMENT},
            size = 128,
        }
    }
    
    /*
    This is not used in the Shader Object.
    The only place that needs this in its code, is when making the call `vk.CmdPushConstants(cmd, g.pipeline_layout, {.VERTEX, .FRAGMENT}, 0, size_of(push), &push)`.
    */
    pipeline_layout_ci := vk.PipelineLayoutCreateInfo {
        sType = .PIPELINE_LAYOUT_CREATE_INFO,
        // flags                  = {},
        // setLayoutCount         = 1,
        // pSetLayouts            = {},
        pushConstantRangeCount = u32(len(push_constant_ranges)),
        pPushConstantRanges = raw_data(push_constant_ranges),
    }
    check(vk.CreatePipelineLayout(g.device, &pipeline_layout_ci, nil, &g.pipeline_layout))  // Pipeline


    vert_code := load_file("shaders/shader.vert.spv", context.temp_allocator)  // Shader_Info
    frag_code := load_file("shaders/shader.frag.spv", context.temp_allocator)  // Shader_Info
    shader_cis := [2]vk.ShaderCreateInfoEXT {
        {
            sType = .SHADER_CREATE_INFO_EXT,
            codeType = .SPIRV,
            codeSize = len(vert_code),
            pCode = raw_data(vert_code),
            pName = "main",
            stage = {.VERTEX},
            nextStage = {.FRAGMENT},
            flags = {.LINK_STAGE},
            // setLayoutCount:         u32,
            // pSetLayouts:            [^]DescriptorSetLayout,
            pushConstantRangeCount = u32(len(push_constant_ranges)),
            pPushConstantRanges = raw_data(push_constant_ranges),
            // pSpecializationInfo:    ^SpecializationInfo,
        },
        {
            sType = .SHADER_CREATE_INFO_EXT,
            codeType = .SPIRV,
            codeSize = len(frag_code),
            pCode = raw_data(frag_code),
            pName = "main",
            stage = {.FRAGMENT},
            // nextStage:              ShaderStageFlags,
            flags = {.LINK_STAGE},
            // setLayoutCount:         u32,
            // pSetLayouts:            [^]DescriptorSetLayout,
            pushConstantRangeCount = u32(len(push_constant_ranges)),
            pPushConstantRanges = raw_data(push_constant_ranges),
            // pSpecializationInfo:    ^SpecializationInfo,
        },
    }
    check(vk.CreateShadersEXT(g.device, 2, raw_data(&shader_cis), nil, raw_data(&g.shaders)))
}

destroy_shaders :: proc() {
    vk.DestroyPipelineLayout(g.device, g.pipeline_layout, nil)
    for shader in g.shaders do vk.DestroyShaderEXT(g.device, shader, nil)
}

render :: proc(cmd: vk.CommandBuffer) {
    shader_stages := [2]vk.ShaderStageFlags { {.VERTEX}, {.FRAGMENT} }
    vk.CmdBindShadersEXT(cmd, 2, raw_data(&shader_stages), raw_data(&g.shaders))
    
    vk.CmdSetVertexInputEXT(cmd, 0, nil, 0, nil) // Shader_Info: vk.VertexInputBindingDescription, vk.VertexInputAttributeDescription.

    vk.CmdSetViewportWithCount(cmd, 1, &vk.Viewport {  // Dynamic
        width = f32(g.swapchain.width),
        height = f32(g.swapchain.height),
        minDepth = 0,
        maxDepth = 1,
    })
    vk.CmdSetScissorWithCount(cmd, 1, &vk.Rect2D {
        extent = {width = g.swapchain.width, height = g.swapchain.height}  // Dynamic
    })
    vk.CmdSetRasterizerDiscardEnable(cmd, false) // Pipeline

    vk.CmdSetPrimitiveTopology(cmd, .TRIANGLE_LIST)  // Pipeline
    vk.CmdSetPrimitiveRestartEnable(cmd, false)      // Pipeline

    vk.CmdSetRasterizationSamplesEXT(cmd, {._1})     // Pipeline
    sample_mask := vk.SampleMask(1)
    vk.CmdSetSampleMaskEXT(cmd, {._1}, &sample_mask) // Pipeline
    vk.CmdSetAlphaToCoverageEnableEXT(cmd, false)    // Pipeline

    vk.CmdSetPolygonModeEXT(cmd, .FILL)              // Pipeline
    vk.CmdSetCullMode(cmd, {})                       // Pipeline
    vk.CmdSetFrontFace(cmd, .COUNTER_CLOCKWISE)      // Pipeline

    vk.CmdSetDepthTestEnable(cmd, false)             // Pipeline
    vk.CmdSetDepthWriteEnable(cmd, false)            // Pipeline
    vk.CmdSetDepthBiasEnable(cmd, false)             // Pipeline
    vk.CmdSetStencilTestEnable(cmd, false)           // Pipeline

    b32_false := b32(false)
    vk.CmdSetColorBlendEnableEXT(cmd, 0, 1, &b32_false) // Pipeline

    color_mask := vk.ColorComponentFlags { .R, .G, .B, .A }
    vk.CmdSetColorWriteMaskEXT(cmd, 0, 1, &color_mask)  // Pipeline

    Push :: struct {
        color: [3]f32,
    }
    push := Push { color = { 0, 0.5, 0 } }
    vk.CmdPushConstants(cmd, g.pipeline_layout, {.VERTEX, .FRAGMENT}, 0, size_of(push), &push)
    
    // vk.CmdBindDescriptorSets                         // Dynamic

    vk.CmdDraw(cmd, 3, 1, 0, 0)
}

Ditch pipelines entirely.
Bind compiled shader stages.
It was created primarily for the Nintendo Switch, to reduce the performance gap between Vulkan and NVN (the Switch's native API), which doesn't even have the concept of pipeline state objects and map almost 1:1 to how Nvidia hardware works.
If you want to use Shader Objects, the reason should be "I find it much easier to use/maintain". Because once you grow you'll encounter friction as the extension is meant for porting old engines, and goes against new features.
Support :
- Hard to recommend, as for limited support.
- Currently only available on AMD & Nvidia.
- It provides an emulation layer, which make them usable on any device not natively supporting them. but you need to provide the dll file for the layer along with the application.
Shaders :
- This extension introduces a new object type VkShaderEXT which represents a single compiled shader stage. VkShaderEXT objects may be created either independently or linked with other VkShaderEXT objects created at the same time. To create VkShaderEXT objects, applications call vkCreateShadersEXT() .
- This function compiles the source code for one or more shader stages into VkShaderEXT objects.
- Optional Linking :
  - Whenever createInfoCount is greater than one, the shaders being created may optionally be linked together. Linking allows the implementation to perform cross-stage optimizations based on a promise by the application that the linked shaders will always be used together.
  - Though a set of linked shaders may perform anywhere between the same to substantially better than equivalent unlinked shaders, this tradeoff is left to the application and linking is never mandatory.
  - To specify that shaders should be linked, include the SHADER_CREATE_LINK_STAGE_EXT flag in each of the VkShaderCreateInfoEXT structures passed to vkCreateShadersEXT() . The presence or absence of SHADER_CREATE_LINK_STAGE_EXT must match across all VkShaderCreateInfoEXT structures passed to a single vkCreateShadersEXT() call: i.e., if any member of pCreateInfos includes SHADER_CREATE_LINK_STAGE_EXT then all other members must include it too. SHADER_CREATE_LINK_STAGE_EXT is ignored if createInfoCount is one, and a shader created this way is considered unlinked.
- The stage of the shader being compiled is specified by stage . Applications must also specify which stage types will be allowed to immediately follow the shader being created. For example, a vertex shader might specify a nextStage value of SHADER_STAGE_FRAGMENT to indicate that the vertex shader being created will always be followed by a fragment shader (and never a geometry or tessellation shader). Applications that do not know this information at shader creation time or need the same shader to be compatible with multiple subsequent stages can specify a mask that includes as many valid next stages as they wish. For example, a vertex shader can specify a nextStage mask of SHADER_STAGE_GEOMETRY | SHADER_STAGE_FRAGMENT to indicate that the next stage could be either a geometry shader or fragment shader (but not a tessellation shader).
- etc, see the spec .

Reducing compilation overhead, with `EXT_graphics_pipeline_libraries`

EXT_graphics_pipeline_library .
Sample .
Support :
- Release: (2022-06-03).
- Coverage .
- (2025-09-08) 18.7% coverage.
  - 40.7% Windows.
  - 40.6% Linux.
  - 4.88% Android.
Extra info .
- I've read until the Dynamic State header.
Allows separate compilation of different parts of the graphics pipeline. With this it’s now possible to split up the monolithic pipeline creation into different steps and re-use common parts shared across different pipelines.
Compared to monolithic pipeline state, this results in faster pipeline creation times, making this extension a good fit for applications and games that do a lot of pipeline creation at runtime.
Libraries are partial pipeline objects which cannot be bound directly; they are linked together to form a final executable pipeline.
Encourages reuse of compilation work and reduces startup/runtime stutter for games with many similar pipelines.
Because libraries are precompiled partial pipelines, linking is generally cheaper than compiling whole pipelines from scratch.
Individual pipelines stages :
- The monolithic pipeline state has been split into distinct parts that can be compiled independently.
- Vertex Input Interface :
  - Contains the information that would normally be provided to the full pipeline state object by VkPipelineVertexInputStateCreateInfo and VkPipelineInputAssemblyStateCreateInfo.
  - "For our engine, this information is not known until draw time, so a pipeline for this stage is still hashed and created at draw time."
  - This stage has no shader code and thus the driver can create it quickly and there are also a fairly small number of these objects.
- Pre-Rasterization Shaders :
  - Contains vertex, tessellation, and geometry shader stages along with the state associated with VkPipelineViewportStateCreateInfo , VkPipelineRasterizationStateCreateInfo , VkPipelineTessellationStateCreateInfo , and VkRenderPass (or dynamic rendering).
  - The only information you actually need to create the pre-rasterization shader is the SPIR-V code and pipeline layout.
- Fragment Shader :
  - Contains the fragment shader along with the state in VkPipelineDepthStencilStateCreateInfo and VkRenderPass (or dynamic rendering - although in that case only the viewMask is required).
  - If combined with dynamic rendering you can create the fragment shader pipeline with only the SPIR-V and the pipeline layout.
    This allows the driver to do the heavy lifting of lowering to hardware instructions for the pre-rasterization and fragment shaders with very little information.
- Fragment Output Interface :
  - Contains the VkPipelineColorBlendStateCreateInfo, VkPipelineMultisampleStateCreateInfo, and VkRenderPass (or dynamic rendering)
  - Like with the Vertex Input Interface, this stage requires information that we don’t know until draw time, so this state is also hashed and the Fragment Output Interface pipeline is created at draw time.
  - It is expected to be very quick to create and also relatively small in number.
Final link :
- With all four individual pipeline library stages created, an application can perform a final link to a full pipeline. This final link is expected to be extremely fast - the driver will have done the shader compilation for the individual stages and thus the link can be performed at draw time at a reasonable cost.
- This is where the big benefit of the extension comes in: we’ve pre-created all of our pre-rasterization and fragment shaders, hashed the small number of vertex input/fragment output interfaces, and can on-demand create a fast linked pipeline library at draw time, thus avoiding a dreaded hitch.
If shader compilation stutter is your concern, this extension is the way to go. This extension lets you create partially-constructed PSOs (Pipeline State Objects) (e.g. one for Vertex another for Pixel Shader), and then combine them to generate the final PSO. This allows splitting the huge monolithic block into smaller monolithic blocks that are easier to handle and design around, making the API more D3D11-like (D3D11 has monolithic Rasterizer State blocks and Blend State blocks).

Creating pipeline libraries :

Creating a pipeline library (part) is similar to creating a pipeline, with the difference that you only need to specify the properties required for that specific pipeline state.
- E.g. for the vertex input interface you only specify input assembly and vertex input state, which is all required to define the interfaces to a vertex shader.

VkGraphicsPipelineLibraryCreateInfoEXT library_info{};
library_info.sType = STRUCTURE_TYPE_GRAPHICS_PIPELINE_LIBRARY_CREATE_INFO_EXT;
library_info.flags = GRAPHICS_PIPELINE_LIBRARY_VERTEX_INPUT_INTERFACE_EXT;

VkPipelineInputAssemblyStateCreateInfo       input_assembly_state  = vkb::initializers::pipeline_input_assembly_state_create_info(PRIMITIVE_TOPOLOGY_TRIANGLE_LIST, 0, FALSE);
VkPipelineVertexInputStateCreateInfo         vertex_input_state    = vkb::initializers::pipeline_vertex_input_state_create_info();
std::vector<VkVertexInputBindingDescription> vertex_input_bindings = {
    vkb::initializers::vertex_input_binding_description(0, sizeof(Vertex), VERTEX_INPUT_RATE_VERTEX),
};
std::vector<VkVertexInputAttributeDescription> vertex_input_attributes = {
    vkb::initializers::vertex_input_attribute_description(0, 0, FORMAT_R32G32B32_SFLOAT, 0),
    vkb::initializers::vertex_input_attribute_description(0, 1, FORMAT_R32G32B32_SFLOAT, sizeof(float) * 3),
    vkb::initializers::vertex_input_attribute_description(0, 2, FORMAT_R32G32_SFLOAT, sizeof(float) * 6),
};
vertex_input_state.vertexBindingDescriptionCount   = static_cast<uint32_t>(vertex_input_bindings.size());
vertex_input_state.pVertexBindingDescriptions      = vertex_input_bindings.data();
vertex_input_state.vertexAttributeDescriptionCount = static_cast<uint32_t>(vertex_input_attributes.size());
vertex_input_state.pVertexAttributeDescriptions    = vertex_input_attributes.data();

VkGraphicsPipelineCreateInfo pipeline_library_create_info{};
pipeline_library_create_info.sType               = STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
pipeline_library_create_info.flags               = PIPELINE_CREATE_LIBRARY_KHR | PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_EXT;
pipeline_library_create_info.sType               = STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
pipeline_library_create_info.pNext               = &library_info;
pipeline_library_create_info.pInputAssemblyState = &input_assembly_state;
pipeline_library_create_info.pVertexInputState   = &vertex_input_state;

vkCreateGraphicsPipelines(get_device().get_handle(), pipeline_cache, 1, &pipeline_library_create_info, nullptr, &pipeline_library.vertex_input_interface);

Deprecating shader modules :

With this extension, creating shader modules with vkCreateShaderModule has been deprecated and you can instead just pass the shader module create info via pNext into your pipeline shader stage create info. This change bypasses a useless copy and is recommended.
You can see this in the pre-rasterization and fragment shader library setup parts of the sample below.

VkShaderModuleCreateInfo shader_module_create_info{};
shader_module_create_info.sType    = STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
shader_module_create_info.codeSize = static_cast<uint32_t>(spirv.size()) * sizeof(uint32_t);
shader_module_create_info.pCode    = spirv.data();

VkPipelineShaderStageCreateInfo shader_Stage_create_info{};
shader_Stage_create_info.sType = STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
// Chain the shader module create info
shader_Stage_create_info.pNext = &shader_module_create_info;
shader_Stage_create_info.stage = SHADER_STAGE_VERTEX;
shader_Stage_create_info.pName = "main";

VkGraphicsPipelineCreateInfo pipeline_library_create_info{};
pipeline_library_create_info.stageCount = 1;
pipeline_library_create_info.pStages    = &shader_Stage_create_info;

Linking executables :

Once all pipeline (library) parts have been created, the pipeline executable can be linked together from them:

std::vector<VkPipeline> libraries = {
    pipeline_library.vertex_input_interface,
    pipeline_library.pre_rasterization_shaders,
    fragment_shader,
    pipeline_library.fragment_output_interface
};

// Link the library parts into a graphics pipeline
VkPipelineLibraryCreateInfoKHR linking_info{};
linking_info.sType        = STRUCTURE_TYPE_PIPELINE_LIBRARY_CREATE_INFO_KHR;
linking_info.libraryCount = static_cast<uint32_t>(libraries.size());
linking_info.pLibraries   = libraries.data();

VkGraphicsPipelineCreateInfo executable_pipeline_create_info{};
executable_pipeline_create_info.sType = STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
executable_pipeline_create_info.pNext = &linking_info;
executable_pipeline_create_info.flags = PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_EXT;

VkPipeline executable = NULL_HANDLE;
vkCreateGraphicsPipelines(get_device().get_handle(), thread_pipeline_cache, 1, &executable_pipeline_create_info, nullptr, &executable);

This will result in the pipeline state object to be used at draw time.
A note on PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_EXT : This is an optimization flag. If specified, implementations are allowed to do additional optimization passes. This may increase build times but can in turn result in lower runtime costs.

Independent Descriptor Sets :
- Imagine a situation where the vertex and fragment stage accesses two different descriptor sets.
```
// Vertex Shader
layout(set = 0) UBO_X;

// Fragment Shader
layout(set = 1) UBO_Y;
```
- Normally when compiling a pipeline, both stages are together and internally a driver will reserve 2 separate descriptor slots for UBO_X and UBO_Y . When using graphics pipeline libraries, the driver will see the fragment shader only uses a single descriptor set. It might internally map it to set 0 , but when linking the two libraries, there will be a collision. The PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_EXT flag ensures the driver will be able to handle this case and not have any collisions. There are some extra constraints when using this flag, but the Validation Layers will detect them for you.
Explanation .
- .
- .
  - Same number of pipelines, but acquired through reuse, instead of recompilation.
  - Think of the link step as additive, instead of multiplicative.
- .
- .
- Considerations :
  - At the time it was said there would be an impact on CPU.
  - It was unknown whether it was compatible with mobile or not.
  - No libraries were made for Geometry and Tessellation Shaders, as they are difficult.

~One pipeline per shader variant

It is the cause of the problem listed above.
Causes a combinatorial explosion of variants.

Single pipeline, branch inside shader (material ID / push constant)

No way, seems horrible.

Optimizations

Pipeline Cache, with `VkPipelineCache`

Pipeline cache sample .
Pipeline Cache .
Pipeline Cache .
It allows the driver to reuse previously computed pipeline artifacts across pipeline creations (and you can persist cache data between runs).
Avoids repeating expensive driver work; shortens startup time by reusing previously compiled artifacts.
Creating a Vulkan pipeline requires compiling VkShaderModule internally. This will have a significant increase in frame time if performed at runtime. To reduce this time, you can provide a previously initialised VkPipelineCache object when calling the vkCreateGraphicsPipelines or vkCreateComputePipelines functions. This object behaves like a cache container which stores the pipeline internal representation for reuse. In order to benefit from using a VkPipelineCache object, the data recorded during pipeline creation needs to be saved to disk and reused between application runs.
Vulkan allows an application to obtain the binary data of a VkPipelineCache object and save it to a file on disk before terminating the application. This operation can be achieved using two calls to the vkGetPipelineCacheData function to obtain the size and VkPipelineCache object’s binary data. In the next application run, the VkPipelineCache can be initialised with the previous run’s data. This will allow the vkCreateGraphicsPipelines or vkCreateComputePipelines functions to reuse the baked state and avoid repeating costly operations such as shader compilation.
How to use it :
- Create one VkPipelineCache for related pipeline creation operations (often one per device).
- Pass it into vkCreateGraphicsPipelines for every create call.
- On exit (or periodically) call vkGetPipelineCacheData and write to disk; on startup feed that blob into vkCreatePipelineCache to prepopulate the cache.
KHR_pipeline_binary
- VkPipelineCache objects were designed to enable a Vulkan driver to reuse blobs of state or shader code between different pipelines. Originally, the idea was that the driver would know best which parts of state could be reused, and applications only needed to manage storage and threading, simplifying developer code.
- Over time however, VkPipelineCache objects proved to be too opaque, prompting the Vulkan Working Group to release a number of extensions to provide more application control over them. The current capabilities of VkPipelineCache objects satisfies many applications, but has shortcomings in more advanced use cases.
- Previous difficulties :
  - The VkPipelineCache API provides no control over the lifetime of the binary objects that it contains. An application wanting to implement an LRU cache, for example, has a hard time using VkPipelineCache objects.
  - Some applications maintain a cache of VkPipeline objects. The VkPipelineCache API makes it impossible to efficiently associate the cached binary objects within a VkPipelineCache object with the application’s own cache entries.
- What’s more, most drivers maintain an internal cache of pipeline-derived binary objects. In some cases, it would be beneficial for the application to directly interact with that internal cache, especially on some specialized platforms.
- The new KHR_pipeline_binary extension introduces a clean new approach that provides applications with access to binary blobs and the information necessary for optimal caching, while smoothly integrating with the application’s own caching mechanisms.
- It’s worth noting that the EXT_shader_object extension already includes analogous functionality to KHR_pipeline_binary . The two extensions were worked on concurrently to provide a universally available solution, including devices where the EXT_shader_object extension cannot yet be supported.
- Applications that do not need the advanced functionality of the new KHR_pipeline_binary extension can continue to use VkPipelineCache objects for their simplicity and optimized implementation. But developers that are not satisfied with the VkPipelineCache API should read on to learn more about this powerful new approach.
- Article .
  - Read up to 'Caching With KHR_pipeline_binary'.

Optimizing the Shader with `KHR_buffer_device_address`

See Vulkan#Physical Storage Buffer (KHR_buffer_device_address) .
Support :

Pipeline derivatives

A creation mechanism to tell the driver that one pipeline is a parent and others are children (derivatives).
The driver may avoid redoing expensive compile/link steps and reuse intermediate data from the parent, reducing creation time.
The intent is faster creation of children by reusing work/data from the parent.
The pipeline creation API provides no way to tell it what state will change. The idea being that, since the implementation can see the parent's state, and it can see what you ask of the child's state, it can tell what's different.
Is it worth it? NO.
- TLDR :
  - No vendor is actually recommending the use of pipeline derivatives, except maybe to speed up pipeline creation.
- Tips and Tricks: Vulkan Dos and Don’ts .
  - Don’t expect speedup from Pipeline Derivatives.
- Vulkan Usage Recommendations , Samsung
  - Pipeline derivatives let applications express "child" pipelines as incremental state changes from a similar "parent"; on some architectures, this can reduce the cost of switching between similar states.
  - Many mobile GPUs gain performance primarily through pipeline caches, so pipeline derivatives often provide no benefit to portable mobile applications.
  - Recommendations:
    - Create pipelines early in application execution. Avoid pipeline creation at draw time.
    - Use a single pipeline cache for all pipeline creation.
    - Write the pipeline cache to a file between application runs.
    - Avoid pipeline derivatives.
- Vulkan Best Practice for Mobile Developers - Pipeline Management , Arm Software, Jul 11, 2019
  - Don't create pipelines at draw time without a pipeline cache (introduces performance stutters).
  - Don't use pipeline derivatives as they are not supported.
- Vulkan Samples, LunarG - API-Samples/pipeline_derivative/pipeline_derivative.cpp
  - This sample creates pipeline derivative and draws with it. Pipeline derivatives should allow for faster creation of pipelines.
  - In this sample, we'll create the default pipeline, but then modify it slightly and create a derivative.
  - The derivative will be used to render a simple cube. We may later find that the pipeline is too simple to show any speedup, or that replacing the fragment shader is too expensive, so this sample can be updated then.
Typical use case :
- Many pipelines that differ only by a few fields (e.g., different specializations or small state changes).
How to use :
- Create a base pipeline with PIPELINE_CREATE_ALLOW_DERIVATIVES .
- For similar pipelines (small shader or state differences), create child pipelines with PIPELINE_CREATE_DERIVATIVE and set basePipelineHandle or basePipelineIndex pointing to the base.
How it affects the pipeline workflow :
- Can materially reduce pipeline creation cost when many similar pipelines are needed.
- Useful at runtime if you must create many variants quickly.
- Still creates separate pipeline objects (state memory + driver bookkeeping).
Not guaranteed to be implemented with identical performance gains on all drivers; behavior is driver-dependent.

Compute Pipeline

Compute Pipeline Vulkan .
Compute Shader in OpenGL .
- Cool.
- A compute shader is used to determine an array of positions, then render each point in a graphics pipeline using POINTS as the primitive.
Poor explanation, with possibly useful code, in Vulkan .
- The video's code may be useful based on what I saw.
- Though, the video itself is meh.
A compute shader maps pretty well of how a GPU operates; which is not really the case of a Graphics Pipeline.

Use cases

Calculate images from complex postprocessing chains.
Raytracing or other non-geometry drawing.

Creation

We need to create first the pipeline layout for it, and then hook a single shader module for its code.
Once its built, we can execute the compute shader by first calling VkCmdBindPipeline and then calling VkCmdDispatch .

Using

You generally want to use a memory barrier after the dispatch of the compute shader, so you wait for the compute shader to finish to finally access its data; if that's what you want to do.
- In OpenGL the GL_SHADER_STORAGE_BARRIER is used.

Workgroup

vkCmdDispatch .
For an image, I had the decision to only use 2 of those dimensions, that way we can execute one workgroup per group of pixels in the image.
When executing compute shaders, they will get executed in groups of N lanes/threads.
The most difficult part is the decision of partitioning the compute shader between Workgroups and Local Size.
Local Size is also called Workgroup Size, representing the number of threads inside each Workgroup.
.
- The code is in OpenGL, but the concept is the same.
The size of the local_size should be ideally related to the size of a warp/wavefront from the GPU, so you don't waste processing power.
For layout(local_size_x = 3, local_size_y = 4, local_size_z = 2) , you'll use 3 * 4 * 2 = 24 threads, which is not ideal for a NVIDIA warp size.
.

GLSL Built-in Variables

Examples

The shader code is a very simple shader that will create a gradient from the coordinates of the global invocation ID.

//GLSL version to use
#version 460

//size of a workgroup for compute
layout (local_size_x = 16, local_size_y = 16) in;

//descriptor bindings for the pipeline
layout(rgba16f,set = 0, binding = 0) uniform image2D image;


void main() 
{
    ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
    ivec2 size = imageSize(image);

    if(texelCoord.x < size.x && texelCoord.y < size.y)
    {
        vec4 color = vec4(0.0, 0.0, 0.0, 1.0);

        if(gl_LocalInvocationID.x != 0 && gl_LocalInvocationID.y != 0)
        {
            color.x = float(texelCoord.x)/(size.x);
            color.y = float(texelCoord.y)/(size.y); 
        }
    
        imageStore(image, texelCoord, color);
    }
}

Inside the shader itself, we can see layout (local_size_x = 16, local_size_y = 16) in; (z=1 by default).
- By doing that, we are setting the size of a single workgroup.
- This means that for every work unit from the vkCmdDispatch , we will have 16x16 lanes of execution, which works well to write into a 16x16 pixel square.
The next layout statement is for the shader input through descriptor sets. We are setting a single image2D as set 0 and binding 0 within that set.
If local invocation ID is 0 on either X or Y, we will just default to black. This is going to create a grid that will directly display our shader workgroup invocations.
On the shader code, we can access what the lane index is through gl_LocalInvocationID variable.
There is also gl_GlobalInvocationID and gl_WorkGroupID . By using those variables we can find out what pixel exactly do we write from each lane.

Compute Shader Raytracing

Playlist Vulkan Compute Shader Raytracing .

Resources

Resources are views of memory with associated formatting and dimensionality.
Nvidia: Make sure to always use the minimum set of resource usage flags. Redundant flags may trigger redundant flushes and stalls in barriers and slow down your app unnecessarily.
Resource Creation .

Primary resources

Buffers.
- Provide access to raw arrays of bytes
Images.
- Can be multidimensional and may have associated metadata.
Tensors.
- Can be multidimensional, contain format information like images and may have associated metadata.
Samplers.
- Used to sample from images at certain coordinates, producing interpolated color values.
Micromaps .
- Uses buffers as the backing store for opaque data structures.
Acceleration Structures .
- Uses buffers as the backing store for opaque data structures.
- Used for realtime raytracing.

Buffers

Buffers in Vulkan are regions of memory used for storing arbitrary data that can be read by the graphics card.
They are essentially unformatted arrays of bytes.
Types of Buffers :
- Unformatted array .
- Uniform Buffer :
  - It remains uniform during the execution of a command (like a draw call).
  - Only load operations (read only).
    - "Read" == "Load".
    - This allows the GPU to cache them efficiently.
  - Loaded into L2, and further, into a L1 cache.
- Storage Buffers :
  - Allow Load and Store operations.
  - Supports atomic operations.
  - Data can be loaded from GPU memory into L2->L1 caches, but can also store data from shaders into memory.
- Texel Buffers :
  - Uniform Texel Buffer.
  - Storage Texel Buffer.
  - Formatted view.
- Dynamic Buffers :
  - Dynamic Uniform Buffer.
  - Dynamic Texel Buffer.
- etc.
Queues :
- Just like the images in the Swapchain, buffers can also be owned by a specific queue family or be shared between multiple at the same time.
  - The buffer will only be used from the graphics queue, so we can stick to exclusive access.

Create

vkCreateBuffer()
- VkBuffer
  - A chunk of GPU visible memory
- VkBufferCreateInfo
  - size
    - Specifies the size of the buffer in bytes. Calculating the byte size of the vertex data is straightforward with sizeof .
  - usage
    - Indicates for which purposes the data in the buffer is going to be used.
    - It is possible to specify multiple purposes using a bitwise or.
  - flags
    - Is used to configure sparse buffer memory, which is not relevant right now. We'll leave it at the default value of 0 .
  - sharingMode
    - Specifying the sharing mode of the buffer when it will be accessed by multiple queue families.
    - The buffer will only be used from the graphics queue, so we can stick to exclusive access.
    - NVIDIA:
      - VkSharingMode is ignored by the driver, so SHARING_MODE_CONCURRENT incurs no overhead relative to SHARING_MODE_EXCLUSIVE .
    - SHARING_MODE_EXCLUSIVE
      - Specifies that access to any range or image subresource of the object will be exclusive to a single queue family at a time.
    - SHARING_MODE_CONCURRENT
      - Specifies that concurrent access to any range or image subresource of the object from multiple queue families is supported.

Copy

Minimum Alignment :
- VkPhysicalDeviceLimits .
  - optimalBufferCopyOffsetAlignment
    - Is the optimal buffer offset alignment in bytes for vkCmdCopyBufferToImage2 , vkCmdCopyBufferToImage , vkCmdCopyImageToBuffer2 , and vkCmdCopyImageToBuffer .
    - This value is also the optimal host memory offset alignment in bytes for vkCopyMemoryToImage and vkCopyImageToMemory .
    - The per texel alignment requirements are enforced, but applications should use the optimal alignment for optimal performance and power use.
    - The value must be a power of two.
  - optimalBufferCopyRowPitchAlignment
    - Is the optimal buffer row pitch alignment in bytes for vkCmdCopyBufferToImage2 , vkCmdCopyBufferToImage , vkCmdCopyImageToBuffer2 , and vkCmdCopyImageToBuffer .
    - This value is also the optimal host memory row pitch alignment in bytes for vkCopyMemoryToImage and vkCopyImageToMemory .
    - Row pitch is the number of bytes between texels with the same X coordinate in adjacent rows (Y coordinates differ by one). The per texel alignment requirements are enforced, but applications should use the optimal alignment for optimal performance and power use.
    - The value must be a power of two.

Images

Images contain format information. Can be multidimensional and may have associated metadata.
An Image, unlike a Buffer, is almost always used within a View.
A texture you can write to and read from.
VkImage .
Stored as :
- .

Create

VkImageCreateInfo .
- ImageType
- extent
  - Specifies the dimensions of the image, basically how many texels there are on each axis.
  - That’s why extent.depth must be 1 instead of 0 .
- format
- tiling
- initialLayout
  - Can only be one of these 3:
    - UNDEFINED
      - Not usable by the GPU and the very first transition will discard the texels.
    - PREINITIALIZED
      - Not usable by the GPU, but the first transition will preserve the texels.
    - ZERO_INITIALIZED_EXT
      - Only if zeroInitializeDeviceMemory feature is enabled.
  - There are a few situations where it is necessary for the texels to be preserved during the first transition.
    - One example would be if you wanted to use an image as a staging image in combination with the TILING_LINEAR layout. In that case, you’d want to upload the texel data to it and then transition the image to be a transfer source without losing the data.
  - However, we usually don't need this property and can use UNDEFINED , as we can transition the image to be a transfer destination and then copy texel data to it from a buffer object.
- usage
- samples
  - For multisampling.
  - Only relevant for images that will be used as attachments.
  - The default for non-multisampled images is one sample.
- mipLevels
  - For mipmapping.
- flags
  - Related to sparse images.
  - Sparse images are images where only certain regions are actually backed by memory.
  - If you were using a 3D texture for a voxel terrain, for example, then you could use this to avoid allocating memory to store large volumes of "air" values.
- sharingMode
  - Specifies the sharing mode of the image when it will be accessed by multiple queue families.
- queueFamilyIndexCount
  - Is the number of entries in the pQueueFamilyIndices array.
- pQueueFamilyIndices
  - Is a pointer to an array of queue families that will access this image. It is ignored if sharingMode is not SHARING_MODE_CONCURRENT .

Types

Tells Vulkan with what kind of coordinate system the texels in the image are going to be addressed.
1D images
- Can be used to store an array of data or a gradient.
2D images
- Are mainly used for textures.
3D images
- Can be used to store voxel volumes, for example.

Usages

Storage Image :
- Load and Store.
- Similar to a Storage Buffer.
Sampled Image :
- Only load operations (read only).
- Similar to Uniform Buffers.
- The coordinates are between 0.0 and 1.0.
- If a coordinate doesn't match exactly a pixel, then the result is an interpolation between the neighbouring pixels.
Input Attachment :
- Only load operations (read only).
- Within a renderpass.
- Framebuffer-local.
  - Access to single coordinate only.
  - No access to other coordinates in that image.

Formats

Formats .
Compatible Formats .
Numeric Format .
R8G8B8_SRGB
- Channels stored as 0–255.
- After conversion, the values are in the 0-1 floating-point range.
- Interpreted using the sRGB nonlinear transfer function (gamma correction).
- When sampled, values are converted to linear color space in the shader automatically.
R8G8B8_UNORM
- Each 8-bit channel is an unsigned normalized integer.
- Storage range: 0–255.
- Interpreted as floating-point in the shader:
  - 0 → 0.0
  - 255 → 1.0
  - Linear mapping between.
R8G8B8_SNORM
- Each 8-bit channel is a signed normalized integer.
- Storage range: –128 to +127.
- Interpreted as floating-point in the shader:
  - –128 → –1.0
  - +127 → +1.0
  - Linear mapping between.

Tiling

Nvidia: Always use TILING_OPTIMAL .
- TILING_LINEAR is not optimal. Use a staging buffer and vkCmdCopyBufferToImage() to update images on the device.
Unlike the layout of an image, the tiling mode cannot be changed at a later time.
TILING_OPTIMAL
- The layout is opaque/driver-chosen.
- Is described as an implementation-dependent (opaque) arrangement that the driver/GPU may reorder/tile texels for efficient access; it is the intended layout for GPU use.
- When to use :
  - Image is used as a framebuffer attachment, sampled texture, or otherwise heavily used by the GPU (most rendering targets).
  - You want the GPU/driver to choose a layout that maximizes memory locality and bandwidth for rendering.
  - You will perform GPU-side post-processing / tonemapping / sampling / blits before presentation.
TILING_LINEAR
- The layout is row-major/predictable.
- Lays out texels in row-major order (with row padding possible) and is the layout for which vkGetImageSubresourceLayout returns meaningful offsets for host access; that is the mechanism used when an application needs direct CPU mapping/reading of image memory.
  - However, in practice applications usually do GPU render → copy to a host-visible staging buffer/image rather than render directly into a linear-host-visible image.
- LINEAR tiling does have functional and performance limitations (fewer supported formats/usages and worse GPU access patterns), which is why it’s rarely used for main rendering; typical use cases are CPU upload/download, debugging, or very small offscreen images. It is not only theoretically usable for CPU readback, but that is the primary practical use. You must query format/usage support for linear tiling because many formats or usages are unsupported in LINEAR.
- When to use :
  - You explicitly need to map the image memory from the CPU (direct host read/write) and the driver reports support for the requested format/usage in linear tiling.
  - Use cases: readback for screenshots/debugging, direct CPU uploads for small resources, or special interop scenarios where a row-major layout is required.
GPU OPTIMAL to Host-Visible :
- Strategy applied for 'creating a texture from file' .
  - If you want to be able to directly access texels in the memory of the image, then you must use TILING_LINEAR . We will be using a staging buffer instead of a staging image, so this won't be necessary. We will be using TILING_OPTIMAL for efficient access from the shader.
- TLDR : OPTIMAL + explicit transfer to a host-visible staging resource when needed.
- Create your render target as OPTIMAL and allocate DEVICE_LOCAL memory (fast GPU local). After rendering, copy or blit the image to a host-visible staging resource (either a buffer via vkCmdCopyImageToBuffer or a LINEAR image) and map that staging resource for CPU access. This avoids depending on limited linear support and keeps the GPU path fast.

Layouts

GENERAL
- Supports all types of device access, unless specified otherwise.
- If the unifiedImageLayouts feature is enabled, the GENERAL image layout may be used in place of the other layouts where allowed with no loss of performance.
  - VkPhysicalDeviceUnifiedImageLayoutsFeaturesKHR .
    - Can be included in the pNext chain of the VkPhysicalDeviceFeatures2 structure passed to vkGetPhysicalDeviceFeatures2 .
    - KHR_unified_image_layouts .
      - This extension significantly simplifies synchronization in Vulkan by removing the need for image layout transitions in most cases. In particular, it guarantees that using the GENERAL layout everywhere possible is just as efficient as using the other layouts.
      - In the interest of simplifying synchronization in Vulkan, this extension removes image layouts altogether as much as possible. As such, this extension is fairly simple.
      - Proposal .
      - Article .
      - Interacts with :
        
        VERSION_1_3
        
        EXT_attachment_feedback_loop_layout
        
        KHR_dynamic_rendering
      - Support :
        
        KHR_unified_image_layouts
    - unifiedImageLayouts (boolean)
      - Specifies whether usage of GENERAL , where valid, incurs no loss in efficiency.
      - Additionally, it indicates whether it can be used in place of ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT .
    - unifiedImageLayoutsVideo (boolean)
      - Specifies whether GENERAL can be used in place of any of the following image layouts with no loss in efficiency.
      - VIDEO_DECODE_DST
      - VIDEO_DECODE_SRC
      - VIDEO_DECODE_DPB
      - VIDEO_ENCODE_DST
      - VIDEO_ENCODE_SRC
      - VIDEO_ENCODE_DPB
      - VIDEO_ENCODE_QUANTIZATION_MAP
- It can be a useful catch-all image layout, but there are situations where a dedicated image layout must be used instead. For example:
  - PRESENT_SRC .
  - SHARED_PRESENT .
  - VIDEO_DECODE_SRC , VIDEO_DECODE_DST , and VIDEO_DECODE_DPB without the unifiedImageLayoutsVideo feature.
  - VIDEO_ENCODE_SRC , VIDEO_ENCODE_DST , and VIDEO_ENCODE_DPB without the unifiedImageLayoutsVideo feature.
  - VIDEO_ENCODE_QUANTIZATION_MAP without the unifiedImageLayoutsVideo feature.
- While GENERAL suggests that all types of device access are possible, it does not mean that all patterns of memory accesses are safe in all situations.
  - Common Render Pass Data Races outlines some situations where data races are unavoidable. For example, when a subresource is used as both an attachment and a sampled image (i.e., not an input attachment), enabling feedback loop adds extra guarantees which GENERAL alone does not.
Only in initialLayout :
- UNDEFINED
  - Specifies that the layout is unknown.
  - This layout can be used as the initialLayout member of VkImageCreateInfo . Image memory cannot be transitioned into this layout.
  - This layout can be used in place of the current image layout in a layout transition, but doing so will cause the contents of the image’s memory to be undefined.
- ~~PREINITIALIZED~~
  - Specifies that an image’s memory is in a defined layout and can be populated by data, but that it has not yet been initialized by the driver.
  - This layout can be used as the initialLayout member of VkImageCreateInfo . Image memory cannot be transitioned into this layout.
  - This layout is intended to be used as the initial layout for an image whose contents are written by the host, and hence the data can be written to memory immediately, without first executing a layout transition.
  - Currently, PREINITIALIZED is only useful with linear images because there is not a standard layout defined for TILING_OPTIMAL images.
- ~~ZERO_INITIALIZED_EXT~~
  - Specifies that an image’s memory is in a defined layout and is zeroed, but that it has not yet been initialized by the driver.
  - This layout can be used as the initialLayout member of VkImageCreateInfo . Image memory cannot be transitioned into this layout.
  - This layout is intended to be used as the initial layout for an image whose contents are already zeroed, either from being explicitly set to zero by an application or from being allocated with MEMORY_ALLOCATE_ZERO_INITIALIZE_EXT .
  - Only if zeroInitializeDeviceMemory feature is enabled.
Transfer :
- TRANSFER_SRC_OPTIMAL
  - It must only be used as a source image of a transfer command (see the definition of PIPELINE_STAGE_TRANSFER ).
  - This layout is valid only for image subresources of images created with the USAGE_TRANSFER_SRC usage bit enabled.
- TRANSFER_DST_OPTIMAL
  - It must only be used as a destination image of a transfer command.
  - This layout is valid only for image subresources of images created with the USAGE_TRANSFER_DST usage bit enabled.
Present :
- PRESENT_SRC
  - It must only be used for presenting a presentable image for display.
- SHARED_PRESENT
  - Is valid only for shared presentable images, and must be used for any usage the image supports.
Read :
- READ_ONLY_OPTIMAL
  - Specifies a layout allowing read only access as an attachment, or in shaders as a sampled image, combined image/sampler, or input attachment.
- DEPTH_READ_ONLY_OPTIMAL
  - Specifies a layout for the depth aspect of a depth/stencil format image allowing read-only access as a depth attachment or in shaders as a sampled image, combined image/sampler, or input attachment.
- STENCIL_READ_ONLY_OPTIMAL
  - Specifies a layout for the stencil aspect of a depth/stencil format image allowing read-only access as a stencil attachment or in shaders as a sampled image, combined image/sampler, or input attachment.
- DEPTH_STENCIL_READ_ONLY_OPTIMAL
  - Specifies a layout for both the depth and stencil aspects of a depth/stencil format image allowing read only access as a depth/stencil attachment or in shaders as a sampled image, combined image/sampler, or input attachment.
  - It is equivalent to DEPTH_READ_ONLY_OPTIMAL and STENCIL_READ_ONLY_OPTIMAL .
- SHADER_READ_ONLY_OPTIMAL
  - Specifies a layout allowing read-only access in a shader as a sampled image, combined image/sampler, or input attachment.
  - This layout is valid only for image subresources of images created with the USAGE_SAMPLED or USAGE_INPUT_ATTACHMENT usage bits enabled.
Attachments :
- ATTACHMENT_OPTIMAL
  - Specifies a layout that must only be used with attachment accesses in the graphics pipeline.
- COLOR_ATTACHMENT_OPTIMAL
  - It must only be used as a color or resolve attachment in a VkFramebuffer .
  - This layout is valid only for image subresources of images created with the COLOR_ATTACHMENT usage bit enabled.
  - Nvidia: Use COLOR_ATTACHMENT_OPTIMAL image layout for color attachments.
- DEPTH_ATTACHMENT_OPTIMAL
  - Specifies a layout for the depth aspect of a depth/stencil format image allowing read and write access as a depth attachment.
- STENCIL_ATTACHMENT_OPTIMAL
  - Specifies a layout for the stencil aspect of a depth/stencil format image allowing read and write access as a stencil attachment.
- DEPTH_STENCIL_ATTACHMENT_OPTIMAL
  - Specifies a layout for both the depth and stencil aspects of a depth/stencil format image allowing read and write access as a depth/stencil attachment.
  - Equivalent to DEPTH_ATTACHMENT_OPTIMAL and STENCIL_ATTACHMENT_OPTIMAL .
- ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT
  - It must only be used as either a color attachment or depth/stencil attachment and/or read-only access in a shader as a sampled image, combined image/sampler, or input attachment.
  - This layout is valid only for image subresources of images created with the USAGE_ATTACHMENT_FEEDBACK_LOOP usage bit enabled and either the USAGE_COLOR_ATTACHMENT or USAGE_DEPTH_STENCIL_ATTACHMENT and either the USAGE_INPUT_ATTACHMENT or USAGE_SAMPLED usage bits enabled.
- LAYOUT_RENDERING_LOCAL_READ
  - It must only be used as either a storage image, or a color or depth/stencil attachment and an input attachment.
  - This layout is valid only for image subresources of images created with either USAGE_STORAGE , or both USAGE_INPUT_ATTACHMENT and either of USAGE_COLOR_ATTACHMENT or USAGE_DEPTH_STENCIL_ATTACHMENT .
- Attachment Fragment Shading Rate
  - FRAGMENT_SHADING_RATE_ATTACHMENT_OPTIMAL
    - It must only be used as a fragment shading rate attachment or shading rate image .
    - This layout is valid only for image subresources of images created with the USAGE_FRAGMENT_SHADING_RATE_ATTACHMENT usage bit enabled.
- Fragment Density Map :
  - FRAGMENT_DENSITY_MAP_OPTIMAL_EXT
    - It must only be used as a fragment density map attachment in a VkRenderPass .
    - This layout is valid only for image subresources of images created with the USAGE_FRAGMENT_DENSITY_MAP usage bit enabled.
Read / Attachment :
- DEPTH_READ_ONLY_STENCIL_ATTACHMENT_OPTIMAL
  - Specifies a layout for depth/stencil format images allowing read and write access to the stencil aspect as a stencil attachment, and read only access to the depth aspect as a depth attachment or in shaders as a sampled image, combined image/sampler, or input attachment.
  - Equivalent to DEPTH_READ_ONLY_OPTIMAL and STENCIL_ATTACHMENT_OPTIMAL .
- DEPTH_ATTACHMENT_STENCIL_READ_ONLY_OPTIMAL
  - Specifies a layout for depth/stencil format images allowing read and write access to the depth aspect as a depth attachment, and read only access to the stencil aspect as a stencil attachment or in shaders as a sampled image, combined image/sampler, or input attachment.
  - Equivalent to DEPTH_ATTACHMENT_OPTIMAL and STENCIL_READ_ONLY_OPTIMAL .
Video :
- VIDEO_DECODE_DST
  - It must only be used as a decode output picture in a video decode operation .
  - This layout is valid only for image subresources of images created with the VIDEO_DECODE_DST usage bit enabled.
- VIDEO_DECODE_SRC
  - Reserved for future use.
- VIDEO_DECODE_DPB
  - It must only be used as an output reconstructed picture or an input reference picture in a video decode operation .
  - This layout is valid only for image subresources of images created with the USAGE_VIDEO_DECODE_DPB usage bit enabled.
- VIDEO_ENCODE_DST
  - Reserved for future use.
- VIDEO_ENCODE_SRC
  - It must only be used as an encode input picture in a video encode operation .
  - This layout is valid only for image subresources of images created with the USAGE_VIDEO_ENCODE_SRC usage bit enabled.
- VIDEO_ENCODE_DPB
  - It must only be used as an output reconstructed picture or an input reference picture in a video encode operation .
  - This layout is valid only for image subresources of images created with the USAGE_VIDEO_ENCODE_DPB usage bit enabled.
- VIDEO_ENCODE_QUANTIZATION_MAP
  - It must only be used as a quantization map in a video encode operation .
  - This layout is valid only for image subresources of images created with the VIDEO_ENCODE_QUANTIZATION_DELTA_MAP or VIDEO_ENCODE_EMPHASIS_MAP usage bit enabled.
TENSOR_ALIASING_ARM
- Specifies the layout that an image created with TILING_OPTIMAL must be in for it and a tensor bound to the same aliased range of memory to consistently interpret the data in memory.
- See https://registry.khronos.org/vulkan/specs/latest/html/vkspec.html#resources-memory-aliasing for a complete set of rules for tensor/image aliasing.
- This layout is valid only for image subresources of images created with USAGE_TENSOR_ALIASING .

Image Views

Image Views .
An image view references a specific part of an image to be used.
VkImageViewCreateInfo
- viewType
  - Allows you to treat images as 1D textures, 2D textures, 3D textures and cube maps.
- format
- components
  - Allows you to swizzle the color channels around. For example, you can map all of the channels to the red channel for a monochrome texture. You can also map constant values of 0 and 1 to a channel. In our case we'll stick to the default mapping.
- subresourceRange
  - Describes what the image's purpose is and which part of the image should be accessed. Our images will be used as color targets without any mipmapping levels or multiple layers.
  - If you were working on a stereographic 3D application, then you would create a Swapchain with multiple layers. You could then create multiple image views for each image representing the views for the left and right eyes by accessing different layers.

Copy: Blit (Copy image to image)

Transfer a rectangular region of pixel data from one image to another.
Unlike a raw copy ( vkCmdCopyImage ), a blit can perform scaling and apply filtering ( FILTER_LINEAR or FILTER_NEAREST ), which is consistent with the historical meaning of bit block transfer with optional transformations.
Name :
- Comes from bit block transfer (sometimes shortened to blt_).
- It was introduced in the 1970s in the context of 2D graphics systems, particularly at Xerox PARC.
- The idea was to copy rectangular blocks of bits (pixels) from one place in memory to another, often with operations like scaling, masking, or raster operations.
vkCmdBlitImage2 .
- commandBuffer
- pBlitImageInfo
  - VkBlitImageInfo2 .
  - srcImage
    - Is the source image.
  - srcImageLayout
    - Is the layout of the source image subresources for the blit.
  - dstImage
    - Is the destination image.
  - dstImageLayout
    - Is the layout of the destination image subresources for the blit.
  - regionCount
    - Is the number of regions to blit.
  - pRegions
    - VkImageBlit2 .
    - Defines source and destination subresources, offsets, and extents.
    - Can define multiple regions in a single blit call.
    - For each element of the pRegions array, a blit operation is performed for the specified source and destination regions.
    - Offset :
      - The offset entries specify two corners of the rectangular/box region to blit (one corner and the opposite corner).
      - You normally set offsets[0] to the region origin (frequently {0,0,0} ) and offsets[1] to the region end ( {width, height, depth} ), i.e. the bounds.
      - If left unspecified, that produces the common {0,0,0} -> {w,h,1} box.
      - The Vulkan spec requires both offsets be provided and documents constraints on them (e.g. for 2D images z must be 0/1).
    - srcSubresource
      - Is the subresource to blit from.
    - srcOffsets
      - Is a pointer to an array of two VkOffset3D structures specifying the bounds of the source region within srcSubresource .
    - dstSubresource
      - Is the subresource to blit into.
    - dstOffsets
      - Is a pointer to an array of two VkOffset3D structures specifying the bounds of the destination region within dstSubresource .
  - filter
    - Is a VkFilter specifying the filter to apply if the blits require scaling.
    - Determines how pixels are sampled if scaling occurs.
    - FILTER_NEAREST for nearest-neighbor scaling.
    - FILTER_LINEAR for linear interpolation.
  - Their layouts must be valid for transfer operations ( TRANSFER_SRC_OPTIMAL and TRANSFER_DST_OPTIMAL ).
Restrictions
- Blit operations are supported only if the format and the physical device support FORMAT_FEATURE_BLIT_SRC and FORMAT_FEATURE_BLIT_DST .
- Some formats (like depth/stencil) do not support blitting.
- Multisampled images cannot be used directly as source or destination.

Compression

Depth

Depth Tests

Shader

gl_FragDepth
- Available only in the fragment shader.
- Is an output variable that is used to establish the depth value for the current fragment.
- It is a float .
- If depth buffering is enabled and no shader writes to gl_FragDepth , then the fixed function value for depth will be used (this value is contained in the z component of gl_FragCoord ) otherwise, the value written to gl_FragDepth is used.
- If a shader statically assigns to gl_FragDepth , then the value of the fragment's depth may be undefined for executions of the shader that don't take that path. That is, if the set of linked fragment shaders statically contain a write to gl_FragDepth , then it is responsible for always writing it.
- Available in all versions of glsl.
gl_FragCoord
- Available only in the fragment shader.
- Is an input variable that contains the window relative coordinate (x, y, z, 1/w) values for the fragment.
- This value is the result of fixed functionality that interpolates primitives after vertex processing to generate fragments.
- Multi-sampling :
  - If multi-sampling, this value can be for any location within the pixel, or one of the fragment samples.
- Depth :
  - The z component is the depth value that would be used for the fragment's depth if no shader contained any writes to gl_FragDepth .
  - gl_FragCoord.z is the depth value of the fragment that your shader is operating on, not the current value of the depth buffer at the fragment position.
- Changing the origin, by redeclaring it :
  - gl_FragCoord may be redeclared with the additional layout qualifier identifiers origin_upper_left or pixel_center_integer . By default, gl_FragCoord assumes a lower-left origin for window coordinates and assumes pixel centers are located at half-pixel centers.
  - Example :
    - The (x, y) location (0.5, 0.5) is returned for the lower-left-most pixel in a window. The origin of gl_FragCoord may be changed by redeclaring gl_FragCoord with the origin_upper_left identifier. The values returned can also be shifted by half a pixel in both x and y by pixel_center_integer so it appears the pixels are centered at whole number pixel offsets. This moves the (x, y) value returned by gl_FragCoord of (0.5, 0.5) by default to (0.0, 0.0) with pixel_center_integer .
  - If gl_FragCoord is redeclared in any fragment shader in a program, it must be redeclared in all fragment shaders in that program that have static use of gl_FragCoord .
  - Redeclaring gl_FragCoord with any accepted qualifier affects only gl_FragCoord.x and gl_FragCoord.y .
  - It has no effect on rasterization, transformation or any other part of the OpenGL pipeline or language features.
- Available in all versions of glsl.
Depth Execution Modes :
- (2025-10-07) Vulkan supports this.
  - Conservative depth can be enabled in Vulkan the same way as in OpenGL (i.e. with layout(depth_<condition>) out float gl_FragDepth ).
  - You can test it and look at the SPIR-V output.
- Allows for a possible optimization for implementations that relies on an early depth test to be run before the fragment.
```
// assume it may be modified in any way
layout(depth_any) out float gl_FragDepth;

// assume it may be modified such that its value will only increase
layout(depth_greater) out float gl_FragDepth;

// assume it may be modified such that its value will only decrease
layout(depth_less) out float gl_FragDepth;

// assume it will not be modified
layout(depth_unchanged) out float gl_FragDepth;
```
- GL_ARB_conservative_depth .
- Violating the condition yields undefined behavior.
- The layout qualifier for gl_FragDepth specifies constraints on the final value of gl_FragDepth written by any shader invocation. GL implementations may perform optimizations assuming that the depth test fails (or passes) for a given fragment if all values of gl_FragDepth consistent with the layout qualifier would fail (or pass). If the final value of gl_FragDepth is inconsistent with its layout qualifier, the result of the depth test for the corresponding fragment is undefined. However, no error will be generated in this case. When the depth test passes and depth writes are enabled, the value written to the depth buffer is always the value of gl_FragDepth , whether or not it is consistent with the layout qualifier.
- <depth_any>
  - The shader compiler will note any assignment to gl_FragDepth modifying it in an unknown way, and depth testing will always be performed after the shader has executed.
  - By default, gl_FragDepth assumes the <depth_any> layout qualifier.
- <depth_greater>
  - The GL will assume that the final value of gl_FragDepth is greater than or equal to the fragment's interpolated depth value, as given by the <z> component of gl_FragCoord .
- <depth_less>
  - The GL will assume that any modification of gl_FragDepth will only decrease its value.
- <depth_unchanged>
  - The shader compiler will honor any modification to gl_FragDepth , but the rest of the GL assume that gl_FragDepth is not assigned a new value.
- If gl_FragDepth is redeclared in any fragment shader in a program, it must be redeclared in all fragment shaders in that program that have static assignments to gl_FragDepth . All redeclarations of gl_FragDepth in all fragment shaders in a single program must have the same set of qualifiers. Within any shader, the first redeclarations of gl_FragDepth must appear before any use of gl_FragDepth . The built-in gl_FragDepth is only predeclared in fragment shaders, so redeclaring it in any other shader stage will be illegal.

Depth Test

If the test fails, the fragment is discarded.
If the test passes, the depth attachment will be updated with the fragment’s output depth.

Depth Bias

Requires the VkPhysicalDeviceFeatures::depthBiasClamp feature to be supported otherwise VkPipelineRasterizationStateCreateInfo::depthBiasClamp must be 0.0f .
The depth bias values can be set dynamically using DYNAMIC_STATE_DEPTH_BIAS or the DYNAMIC_STATE_DEPTH_BIAS_ENABLE_EXT from EXT_extended_dynamic_state2 .
The rasterizer can alter the depth values by adding a constant value or biasing them based on a fragment’s slope.
Controls whether to bias fragment depth values.
This is sometimes used for shadow mapping.
Bias Constant Factor :
- Is a scalar factor controlling the constant depth value added to each fragment.
- Scales the parameter r of the depth attachment
- " depthBiasConstantFactor is a scalar factor controlling the constant depth value added to each fragment. The value is in floating point and a typical value seems to be around 2.0-3.0."
Bias Slope Factor :
- Is a scalar factor applied to a fragment’s slope in depth bias calculations.
- Scales the maximum depth slope m of the polygon.
- "I stumbled upon some Vulkan samples that used a much smaller constant bias, but the slope bias was quite high. However, because the slope bias has a much larger weight than the constant one it pretty much worked the same."
Bias Clamp :
- Is the maximum (or minimum) depth bias of a fragment.
- The scaled terms depthBiasConstantFactor and depthBiasSlopeFactor are summed to produce a value which is then clamped to a minimum or maximum value specified.

Depth Bounds

If the value is not within the depth bounds, the coverage mask is set to zero.
Requires the VkPhysicalDeviceFeatures::depthBounds feature to be supported.
The depth bound values can be set dynamically using DYNAMIC_STATE_DEPTH_BOUNDS or the DYNAMIC_STATE_DEPTH_BOUNDS_TEST_ENABLE_EXT from EXT_extended_dynamic_state .

Depth Clamp

Controls whether to clamp the fragment’s depth values as described in Depth Test.
Before the sample’s Zf is compared to Za , Zf is clamped to [min(n,f), max(n,f)] , where n and f are the minDepth and maxDepth depth range values of the viewport used by this fragment, respectively.
If set to TRUE , then fragments that are beyond the near and far planes are clamped to them as opposed to discarding them.
This is useful in some special cases like shadow maps .
Requires the VkPhysicalDeviceFeatures::depthClamp feature to be supported.

Depth Attachment

Clearing

It is always better to clear a depth buffer at the start of the pass with loadOp set to ATTACHMENT_LOAD_OP_CLEAR .
Depth images can also be cleared outside a render pass using vkCmdClearDepthStencilImage .
When clearing, notice that VkClearValue is a union and VkClearDepthStencilValue depthStencil should be set instead of the color clear value.

Multi-sampling

The following post-rasterization occurs as a "per-sample" operation. This means when doing multisampling with a color attachment, any "depth buffer" VkImage used as well must also have been created with the same VkSampleCountFlagBits value.
A coverage mask is generated for each fragment, based on which samples within that fragment are determined to be within the area of the primitive that generated the fragment.
If a fragment operation results in all bits of the coverage mask being 0 , the fragment is discarded.
Resolving :
- It is possible in Vulkan using the KHR_depth_stencil_resolve extension (promoted to Vulkan core in 1.2) to resolve multisampled depth/stencil attachments in a subpass in a similar manner as for color attachments.

Depth Image

Formats

Nvidia: Prefer using D24_UNORM_S8_UINT or D32_SFLOAT depth formats, D32_SFLOAT_S8_UINT is not optimal.
There are a few different depth formats and an implementation may expose support for in Vulkan.
For reading from a depth image only D16_UNORM and D32_SFLOAT are required to support being read via sampling or blit operations.
For writing to a depth image FORMAT_D16_UNORM is required to be supported. From here at least one of ( FORMAT_X8_D24_UNORM_PACK32 or FORMAT_D32_SFLOAT ) and ( FORMAT_D24_UNORM_S8_UINT or FORMAT_D32_SFLOAT_S8_UINT ) must also be supported. This will involve some extra logic when trying to find which format to use if both the depth and stencil are needed in the same format.

Aspect Masks

Required when performing operations such as image barriers or clearing.
DEPTH

Nvidia: VkSharingMode is ignored by the driver, so SHARING_MODE_CONCURRENT incurs no overhead relative to SHARING_MODE_EXCLUSIVE .

Layout Transition

// Example of going from undefined layout to a depth attachment to be read and written to

// Core Vulkan example
srcAccessMask = 0;
dstAccessMask = ACCESS_DEPTH_STENCIL_ATTACHMENT_READ | ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE;
sourceStage = PIPELINE_STAGE_TOP_OF_PIPE;
destinationStage = PIPELINE_STAGE_EARLY_FRAGMENT_TESTS | PIPELINE_STAGE_LATE_FRAGMENT_TESTS;

// KHR_synchronization2
srcAccessMask = ACCESS_2_NONE_KHR;
dstAccessMask = ACCESS_2_DEPTH_STENCIL_ATTACHMENT_READ_KHR | ACCESS_2_DEPTH_STENCIL_ATTACHMENT_WRITE_KHR;
sourceStage = PIPELINE_STAGE_2_NONE_KHR;
destinationStage = PIPELINE_STAGE_2_EARLY_FRAGMENT_TESTS_KHR | PIPELINE_STAGE_2_LATE_FRAGMENT_TESTS_KHR;

If unsure to use only early or late fragment tests for your application, use both.

Copying

Nvidia: Copy both depth and stencil to avoid a slow path for copying.

Reverse Depth Buffer

See Graphics Programming, Shaders#Projection Matrix .

Normal Reconstruction from Depth

You can infer the normals by calculating the derivatives on x and y between pixels of the depth buffer.
Discussion .
Implementation - Wicked Engine (János Turánszki (turanszkij)) .
Implementation - Yuwen Wu (atyuwen) .
Need :
- "In screen-space decals rendering, normal buffer is required to reject pixels projected onto near-perpendicular surfaces. But back then I was working on a forward pipeline, so no normal buffer was outputted. It seemed the best choice was to reconstruct it directly from depth buffer, as long as we could avoid introducing errors, which was not easy though."
- So, for a forward shading, this could be necessary.
- It could be avoided if saving the normals in a texture to be sent to a post-processing pass; aka, if introduced a bit of deferred in the forward renderer.
Performance :
- There's a lot of discussion if this is worthwhile. On a deferred renderer, this could be good, but the gain in performance is not obvious. It really depends on how it was implemented.

Stencil

.
1 or 0, if have a fragment from our object.

Used in

Portals.
Mirrors.
Outlining
Non-Euclidean - demo .
See through - demo .

Stencil Attachment

The PipelineRenderingCreateInfo asks for a stencilAttachmentFormat , and RenderingInfo asks for pStencilAttachment .
This is for cases where you want separate depth and stencil images, instead of merged together, like when having a depth image with D24_UNORM_S8_UINT , where the S8_UINT is for the stencil.
KHR_separate_depth_stencil_layouts .
- Core in Vulkan 1.2.
- This extension allows image memory barriers for 'depth+stencil' images to have just one of the IMAGE_ASPECT_DEPTH or IMAGE_ASPECT_STENCIL aspect bits set, rather than require both. This allows their layouts to be set independently. Image Layouts IMAGE_LAYOUT_DEPTH_ATTACHMENT_OPTIMAL , IMAGE_LAYOUT_DEPTH_READ_ONLY_OPTIMAL , IMAGE_LAYOUT_STENCIL_ATTACHMENT_OPTIMAL , or IMAGE_LAYOUT_STENCIL_READ_ONLY_OPTIMAL can be used.
- To support depth+stencil images with different layouts for the depth and stencil aspects, the depth+stencil attachment interface has been updated to support a separate layout for stencil.
- VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures .
  - Structure describing whether the implementation can do depth and stencil image barriers separately.
  - It's just a struct with a bool telling if the feature is supported.
- ~~For render passes / subpasses~~:
  - VkAttachmentDescriptionStencilLayout .
    - Deprecated in Vulkan 1.4.
    - Extends VkAttachmentDescription2 .
      - Deprecated in Vulkan 1.4.
  - VkAttachmentReferenceStencilLayout .
    - Not deprecated.
    - Extends VkAttachmentReference2 .
      - Deprecated in Vulkan 1.4.

Formats

S8_UINT
- It makes sense, as it's the same format used for stencil in the depth format D24_UNORM_S8_UINT .

Mapping Data to Shaders

Shader Alignment

Minimum Dynamic-Offset / CBV Allocation Granularity

GPUs and drivers require that when you bind or use a portion of a large buffer as a uniform/constant buffer the start address and/or size line up to an alignment.
That alignment is the “minimum dynamic-offset” (Vulkan) or the CBV/constant buffer granularity (D3D12).
It lets the driver map many small logical buffers into a single big GPU buffer efficiently.
If you bind at an unaligned offset the API/driver will reject it or you will get wrong data or degraded performance.
Drivers can report 64, 128, 256, or other powers of two.
UBO alignment is usually larger than SSBO alignment because UBO usage and caches are handled differently by the hardware.
Value :
- Many APIs and drivers use 256 bytes as the Minimum Dynamic-Offset on common desktop GPUs.
  - VkGuide:
```
struct MaterialConstants {  // written into uniform buffers later
    glm::vec4 colorFactors; // multiply the color texture
    glm::vec4 metal_rough_factors;
    glm::vec4 extra[14];
        /*
        padding, we need it anyway for uniform buffers
        it needs to meet a minimum requirement for its alignment. 
        256 bytes is a good default alignment for this which all the gpus we target meet, so we are adding those vec4s to pad the structure to 256 bytes.
        */
};
```
- But not every platform or GPU guarantees 256. Mobile or integrated GPUs may have different values.
- VkPhysicalDeviceLimits .
  - minUniformBufferOffsetAlignment
    - Is the minimum required alignment, in bytes, for the offset member of the VkDescriptorBufferInfo structure for uniform buffers.
    - When a descriptor of type DESCRIPTOR_TYPE_UNIFORM_BUFFER or DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC is updated, the offset must be an integer multiple of this limit.
    - Similarly, dynamic offsets for uniform buffers must be multiples of this limit.
    - The value must be a power of two.
  - minStorageBufferOffsetAlignment
    - Is the minimum required alignment, in bytes, for the offset member of the VkDescriptorBufferInfo structure for storage buffers.
    - When a descriptor of type DESCRIPTOR_TYPE_STORAGE_BUFFER or DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC is updated, the offset must be an integer multiple of this limit.
    - Similarly, dynamic offsets for storage buffers must be multiples of this limit.
    - The value must be a power of two.
  - minTexelBufferOffsetAlignment
    - Is the minimum required alignment, in bytes, for the offset member of the VkBufferViewCreateInfo structure for texel buffers.
    - If the texelBufferAlignment feature is enabled, this limit is equivalent to the maximum of the uniformTexelBufferOffsetAlignmentBytes and storageTexelBufferOffsetAlignmentBytes members of VkPhysicalDeviceTexelBufferAlignmentProperties , but smaller alignment is optionally allowed by storageTexelBufferOffsetSingleTexelAlignment and uniformTexelBufferOffsetSingleTexelAlignment .
    - If the texelBufferAlignment feature is not enabled, VkBufferViewCreateInfo :: offset must be a multiple of this value.
    - The value must be a power of two.
Best practice :
- Query the GPU at runtime and align your buffer ranges to the reported value.
- Assert size at compile time:
```
static_assert(sizeof(MaterialConstants) == 256, "MaterialConstants must be 256 bytes");
```

Default Layouts

UBOs :
- std140.
SSBOs :
- std430.
Push Constants :
- std430 (Vulkan).
- Source: GLSL Spec 4.60.8 , page 90.
  - OpenGL Spec 4.6 , page 146 (7.6.2.2).

Alignment Options

Offset and Stride Assignment .
There are different alignment requirements depending on the specific resources and on the features enabled.
Platform dependency :
- 32-bit IEEE-754
  - The scalar value is 4 bytes.
  - The standard for desktop, mobile, OpenGL ES and Vulkan.
- 16-bit half precision :
  - The scalar value is 2 bytes.
  - In rare cases, like embedded or custom OpenGL drivers.
- 64-bit IEEE-754 double :
  - The scalar value is 8 bytes.
  - Non-standard case.
  - Would require headers redefining GLfloat as double , not compliant with spec.
C layout ≈ std430 only if you manually match packing and alignment. Otherwise, it’s platform-dependent.

| GLSL type                        | C equivalent                                        | Typical C (x86_64) - Alignment |            Typical C (x86_64) - Size | Typical C (x86_64) - Stride |                                                                     std140 - Base Alignment |                std140 - Occupied Size |                          std140 - Stride | std430 - Base Alignment |                                std430 - Occupied Size |                             std430 - Stride |
| -------------------------------- | --------------------------------------------------- | -----------------------------: | -----------------------------------: | --------------------------: | -----------------------------------------------------------------------------------------: | ------------------------------------: | ---------------------------------------: | ----------------------: | ----------------------------------------------------: | ------------------------------------------: |
| bool                            | C _Bool (native) — or use int32_t to match GLSL |       _Bool : 1; int32_t : 4 |             _Bool : 1; int32_t : 4 |     _Bool : 1; int32_t : 4 |                                                                                          4 |                                     4 | 16 (std140 rounds scalar arrays to vec4) |                       4 |                                                     4 |                                           4 |
| int / uint                    | int32_t / uint32_t                               |                              4 |                                    4 |                           4 |                                                                                          4 |                                     4 |                                       16 |                       4 |                                                     4 |                                           4 |
| float                           | float                                              |                              4 |                                    4 |                           4 |                                                                                          4 |                                     4 |                                       16 |                       4 |                                                     4 |                                           4 |
| double                          | double                                             |                              8 |                                    8 |                           8 |                                                                                          8 |                                     8 |          32 (rounded to dvec4 alignment) |                       8 |                                                     8 |                                           8 |
| vec2 / ivec2                  | float[2] / int32_t[2]                            |                              4 |                                    8 |                           8 |                                                                                          8 |                                     8 |                                       16 |                       8 |                                                     8 |                                           8 |
| vec3 / ivec3                  | float[3] / int32_t[3]                            |                              4 |                                   12 |                          12 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| vec4 / ivec4                  | float[4] / int32_t[4]                            |                              4 |                                   16 |                          16 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| dvec2                           | double[2]                                          |                              8 |                                   16 |                          16 |                                                                                         16 |                                    16 |                                       32 |                      16 |                                                    16 |                                          16 |
| dvec3                           | double[3]                                          |                              8 |                                   24 |                          24 |                                                                                         32 |                                    32 |                                       32 |                      32 |                                                    32 |                                          32 |
| dvec4                           | double[4]                                          |                              8 |                                   32 |                          32 |                                                                                         32 |                                    32 |                                       32 |                      32 |                                                    32 |                                          32 |
| mat2 (2×2 float, column-major) | float[2][2] (2 columns of vec2 )                 |                              4 |                                   16 |             8 (column size) |                                                                                         16 |                           16 × 2 = 32 |      each column has vec4 as stride (16) |                       8 |                                            8 × 2 = 16 |          each column has vec2 as stride (8) |
| mat3 (3×3 float, column-major) | float[3][3] (3 columns of vec3 )                 |                              4 |                                   36 |            12 (column size) |                                                                                         16 |                           16 × 3 = 48 |      each column has vec4 as stride (16) |                      16 |                                           16 × 3 = 48 |         each column has vec3 as stride (16) |
| mat4 (4×4 float)               | float[4][4]                                        |                              4 |                                   64 |            16 (column size) |                                                                                         16 |                           16 x 4 = 64 |      each column has vec4 as stride (16) |                      16 |                                           16 × 4 = 64 |         each column has vec4 as stride (16) |
| T[] (Array of T)               | T[]                                                |                     alignof(T) |                            sizeof(T) |                   sizeof(T) | base_align(T), rounded up to vec4 base align (16 for 32-bit scalars; 32 for 64-bit/double) | occupied per element = rounded stride |          base_align(T), rounded up to 16 |           base_align(T) | occupied per element = sizeof(T) rounded to alignment |                               base_align(T) |
| vec3[] (Array of vec3)         | float[3][]                                         |                              4 |                                   12 |                          12 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| struct                          | struct { ... }                                     |          max(member alignment) | struct size padded to that alignment |     sizeof(struct) (padded) |                                                  max(member align) rounded up to vec4 (16) |  struct size padded to multiple of 16 |          sizeof(struct) rounded up to 16 |       max(member align) |                  struct size padded to that alignment | sizeof(struct) (padded to member alignment) |

Scalar Alignment

Looks like std430 , but its vectors are even more compact?
Also known as (?) The spec doesn't say.
EXT_scalar_block_layout .
- Core in Vulkan 1.2.
- This extension allows most storage types to be aligned in scalar alignment.
- Make sure to set --scalar-block-layout when running the SPIR-V Validator.
- A big difference is being able to straddle the 16-byte boundary.
- In GLSL this can be used with scalar keyword and extension

Extended Alignment (std140)

Source .
Conservative, padded layout used for uniform blocks.
Widely supported.
Caveats :
- "Avoiding usage of vec3"
  - Usually applies to std140, because some hardware vendors seem to not follow the spec strictly. Although, everything should work when using std430.
  - Array of vec3 (ARRAY) :
    - Alignment will be 4x of a float .
    - Size will be alignment * amount of elements .

// Scalars
    float ->  4 bytes // for 32-bit IEEE-754
    int   ->  4 bytes // for 32-bit IEEE-754
    uint  ->  4 bytes // for 32-bit IEEE-754
    bool  ->  4 bytes // for 32-bit IEEE-754
    
// Vectors
    // Base alignments
    vec2  ->  8 bytes  // 2 times the underlying scalar type.
    vec3  -> 16 bytes  // 4 times the underlying scalar type.
    vec4  -> 16 bytes  // 4 times the underlying scalar type.
    
// Arrays
    // Size of the element type, rounded up to a multiple of the size of `vec4` (behave like `vec4` slots).
    // Arrays of types are not necessarily tightly packed.
    // An array of floats in such a block will not be the equivalent to an array of floats in C/C++. Arrays will only match their C/C++ definitions if the type is a multiple of 16 bytes.
    // Ex: `float arr[N]` uses 16 bytes per element.

// Matrices
    // Treated as arrays of vectors. 
    // They are column-major by default; you can change it with `layout(row_major)` or `layout(column_major)`.

// Struct
    // The biggest struct member, rounded up to multiples of the size of `vec4` (behave like `vec4` slots).
    // Struct members are effectively padded so that each member starts on a 16-byte boundary when necessary.
    // The struct size will be the space needed by its members.

Examples :

layout(std140) uniform U { float a[3]; }; // size = 3 * 16 = 48 bytes

Base Alignment (std430)

Allowed usage :
- SSBOs, Push Constants.
- KHR_uniform_buffer_standard_layout .
  - Core in Vulkan 1.2.
  - Allows the use of std430 memory layout in UBOs.
  - These memory layout changes are only applied to Uniforms .
- KHR_relaxed_block_layout .
  - Core in Vulkan 1.1; all Vulkan 1.1+ devices support relaxed block layout.
  - This extension allows implementations to indicate they can support more variation in block Offset decorations.
  - This comes up when using std430 memory layout where a vec3 (which is 12 bytes) is still defined as a 16 byte alignment.
  - With relaxed block layout an application can fit a float on either side of the vec3 and maintain the 16 byte alignment between them.
  - Currently there is no way in GLSL to legally express relaxed block layout, but a developer can use the --hlsl-offsets with glslang to produce the desired offsets.
Relaxed layout used for shader-storage blocks and allows much tighter packing.
Requires newer GLSL 4.3+ or equivalent support.

// Scalars
    float ->  4 bytes // for 32-bit IEEE-754
    int   ->  4 bytes // for 32-bit IEEE-754
    uint  ->  4 bytes // for 32-bit IEEE-754
    bool  ->  4 bytes // for 32-bit IEEE-754
    
// Vectors
    // Base alignments
    vec2  ->  8 bytes  // 2 times the underlying scalar type.
    vec3  -> 16 bytes  // 4 times the underlying scalar type.
    vec4  -> 16 bytes  // 4 times the underlying scalar type.
    
// Arrays
    // array stride is the natural alignment of the element. 
    // Ex: `float arr[N]` uses 4 bytes per element.

// Matrices
    // Treated as arrays of vectors. 
    // They are column-major by default; you can change it with `layout(row_major)` or `layout(column_major)`.
    
// Struct
    // Struct packing is tighter and follows natural alignments (less padding).

Examples :

layout(std430, binding=0) buffer B { float a[3]; }; // size = 3 * 4 = 12 bytes

Comparisons

Input Attributes vs Uniform Buffers :
- "If you want a shader to access a resource that is not on vertex buffers, they must go through a descriptor set".
- With Vertex Buffers we can pass arbitrary attributes to the vertex shader for each vertex.
- Including global variables as vertex data would be a waste of memory, and it would require us to update the vertex buffer whenever the transformation changes.
- The right way to tackle this in Vulkan is to use Resource Descriptors
Uniform Buffers vs Push Constants :
- ChatGPT...
- "Lets say I want to send the time, would it be optimal to send it as a ubo or a push constant? The time should not change between draw calls (it's irrelevant) in the same frame. Seems like a UBO I only need to set it once and every object would benefit from it, but a push constant needs to be send for every object?"
  - Exactly. For your “time” value, a Uniform Buffer is generally the more optimal choice in this scenario.
  - Push Constants
    - You must call vkCmdPushConstants for each command buffer section where shaders need it.
    - Since push constants are set per draw/dispatch scope, if you have many objects, you’d be redundantly re-sending the same value (time) multiple times in the same frame.
    - There’s no automatic “shared” state — every pipeline that uses it must get the value pushed explicitly.
  - Uniform Buffers
    - You can store the time in a uniform buffer once per frame, bind it once in a descriptor set, and then every draw call will see the same value without re-uploading.
    - Works well for “global” frame data (view/proj matrices, time, frame index, etc.).
    - Binding a pre-allocated UBO in a descriptor set has low overhead and avoids per-draw constant pushing.
  - Performance implication:
    - If the data is the same for all draws in a frame, a UBO avoids redundant driver calls and state changes, and makes it easier to keep the command buffer lean. Push constants are better suited for per-object or per-draw small data.
Storage Image vs. Storage Buffer :
- While both storage images and storage buffers allow for read-write access in shaders, they have different use cases:
- Storage Images :
  - Ideal for 2D or 3D data that benefits from texture operations like filtering or addressing modes.
- Storage Buffers :
  - Better for arbitrary structured data or when you need to access data in a non-uniform pattern.
Texel Buffer vs. Storage Buffer :
- Texel buffers and storage buffers also have different strengths:
- Texel Buffers :
  - Provide texture-like access to buffer data, allowing for operations like filtering.
- Storage Buffers :
  - More flexible for general-purpose data storage and manipulation.
Do
- Do keep constant data small, where 128 bytes is a good rule of thumb.
- Do use push constants if you do not want to set up a descriptor set/UBO system.
- Do make constant data directly available in the shader if it is pre-determinable, such as with the use of specialization constants.
Avoid
- Avoid indexing in the shader if possible, such as dynamically indexing into buffer or uniform arrays, as this can disable shader optimisations in some platforms.
Impact
- Failing to use the correct method of constant data will negatively impact performance, causing either reduced FPS and/or increased BW and load/store activity.
- On Mali, register mapped uniforms are effectively free. Any spilling to buffers in memory will increase load/store cache accesses to the per thread uniform fetches.

Input Attributes

About

The only shader stage in core Vulkan that has an input attribute controlled by Vulkan is the vertex shader stage ( SHADER_STAGE_VERTEX ).
```
#version 450
layout(location = 0) in vec3 inPosition;

void main() {
    gl_Position = vec4(inPosition, 1.0);
}
```
Other shader stages, such as a fragment shader stage, have input attributes, but the values are determined from the output of the previous stages run before it.
This involves declaring the interface slots when creating the VkPipeline and then binding the VkBuffer before draw time with the data to map.
Before calling vkCreateGraphicsPipelines a VkPipelineVertexInputStateCreateInfo struct will need to be filled out with a list of VkVertexInputAttributeDescription mappings to the shader.
```
VkVertexInputAttributeDescription input = {};
input.location = 0;
input.binding  = 0;
input.format   = FORMAT_R32G32B32_SFLOAT; // maps to vec3
input.offset   = 0;
```

The only thing left to do is bind the vertex buffer and optional index buffer prior to the draw call.

vkBeginCommandBuffer();
// ...
vkCmdBindVertexBuffer();
vkCmdDraw();
// ...
vkCmdBindVertexBuffer();
vkCmdBindIndexBuffer();
vkCmdDrawIndexed();
// ...
vkEndCommandBuffer();

Limits :
- maxVertexInputAttributes
- maxVertexInputAttributeOffset

Memory Layout

.
.
.
- Single binding.
.
- One binding per attribute.
One binding or many bindings? It doesn't matter that much. In some cases one is better, etc, don't worry too much about it.

Vertex Input Binding / Vertex Buffer

Tell Vulkan how to pass this data format to the vertex shader once it's been uploaded into GPU memory
A vertex binding describes at which rate to load data from memory throughout the vertices.
It specifies the number of bytes between data entries and whether to move to the next data entry after each vertex or after each instance.
VkVertexInputBindingDescription .
- binding
  - Specifies the index of the binding in the array of bindings.
- stride
  - Specifies the number of bytes from one entry to the next.
- inputRate
  - VERTEX_INPUT_RATE_VERTEX
    - Move to the next data entry after each vertex.
  - VERTEX_INPUT_RATE_INSTANCE
    - Move to the next data entry after each instance.
  - We're not going to use instanced rendering, so we'll stick to per-vertex data.
VkVertexInputAttributeDescription
- Describes how to handle vertex input.
- An attribute description struct describes how to extract a vertex attribute from a chunk of vertex data originating from a binding description.
- We have two attributes, position and color, so we need two attribute description structs.
- binding
  - Tells Vulkan from which binding the per-vertex data comes.
- location
  - References the location directive of the input in the vertex shader.
    - The input in the vertex shader with location 0 is the position, which has two 32-bit float components.
- format
  - Describes the type of data for the attribute.
  - Implicitly defines the byte size of attribute data.
  - A bit confusingly, the formats are specified using the same enumeration as color formats.
  - The following shader types and formats are commonly used together:
    - float : FORMAT_R32_SFLOAT
    - vec2 : FORMAT_R32G32_SFLOAT
    - vec3 : FORMAT_R32G32B32_SFLOAT
    - vec4 : FORMAT_R32G32B32A32_SFLOAT
  - As you can see, you should use the format where the amount of color channels matches the number of components in the shader data type.
  - It is allowed to use more channels than the number of components in the shader, but they will be silently discarded.
    - If the number of channels is lower than the number of components, then the BGA components will use default values of (0, 0, 1) .
  - The color type ( SFLOAT , UINT , SINT ) and bit width should also match the type of the shader input. See the following examples:
    - ivec2 : FORMAT_R32G32_SINT , a 2-component vector of 32-bit signed integers
    - uvec4 : FORMAT_R32G32B32A32_UINT , a 4-component vector of 32-bit unsigned integers
    - double : FORMAT_R64_SFLOAT , a double-precision (64-bit) float
- offset
  - Specifies the number of bytes since the start of the per-vertex data to read from.

Graphics Pipeline Vertex Input Binding :

For the following vertices:

Vertex :: struct {
    pos:   eng.Vec2,
    color: eng.Vec3,
}

vertices := [?]Vertex{
    { {  0.0, -0.5 }, { 1.0, 0.0, 0.0 } },
    { {  0.5,  0.5 }, { 0.0, 1.0, 0.0 } },
    { { -0.5,  0.5 }, { 0.0, 0.0, 1.0 } },
}

We setup this in the Graphics Pipeline creation:

vertex_binding_descriptor := vk.VertexInputBindingDescription{
    binding   = 0,
    stride    = size_of(Vertex),
    inputRate = .VERTEX,
}
vertex_attribute_descriptor := [?]vk.VertexInputAttributeDescription{
    {
        binding  = 0,
        location = 0,
        format   = .R32G32_SFLOAT,
        offset   = cast(u32)offset_of(Vertex, pos),
    },
    {
        binding  = 0,
        location = 1,
        format   = .R32G32B32_SFLOAT,
        offset   = cast(u32)offset_of(Vertex, color),
    },
}
vertex_input_create_info := vk.PipelineVertexInputStateCreateInfo {
    sType                           = .PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO,
    vertexBindingDescriptionCount   = 1,
    pVertexBindingDescriptions      = &vertex_binding_descriptor,
    vertexAttributeDescriptionCount = len(vertex_attribute_descriptor),
    pVertexAttributeDescriptions    = &vertex_attribute_descriptor[0],
}

The pipeline is now ready to accept vertex data in the format of the vertices container and pass it on to our vertex shader.

Vertex Buffer :
- If you run the program now with validation layers enabled, you'll see that it complains that there is no vertex buffer bound to the binding.
- The next step is to create a vertex buffer and move the vertex data to it so the GPU is able to access it.
- Creating :
  - Follow the tutorial for creating a buffer, specifying BUFFER_USAGE_VERTEX_BUFFER as the BufferCreateInfo usage .

Index Buffer

Motivation :
- Drawing a rectangle takes two triangles, which means that we need a vertex buffer with six vertices. The problem is that the data of two vertices needs to be duplicated, resulting in redundancies.
- The solution to this problem is to use an index buffer.
- An index buffer is essentially an array of pointers into the vertex buffer.
- It allows you to reorder the vertex data, and reuse existing data for multiple vertices.
- .
  - The first three indices define the upper-right triangle, and the last three indices define the vertices for the bottom-left triangle.
- It is possible to use either uint16_t or uint32_t for your index buffer depending on the number of entries in vertices . We can stick to uint16_t for now because we’re using less than 65535 unique vertices.
- Just like the vertex data, the indices need to be uploaded into a VkBuffer for the GPU to be able to access them.
Creating :
- Follow the tutorial for creating a buffer, specifying BUFFER_USAGE_INDEX_BUFFER as the BufferCreateInfo usage .
Using :
- We first need to bind the index buffer, just like we did for the vertex buffer.
- The difference is that you can only have a single index buffer. It’s unfortunately not possible to use different indices for each vertex attribute, so we do still have to completely duplicate vertex data even if just one attribute varies.
- An index buffer is bound with vkCmdBindIndexBuffer which has the index buffer, a byte offset into it, and the type of index data as parameters.
  - As mentioned before, the possible types are INDEX_TYPE_UINT16 and INDEX_TYPE_UINT32 .
- Just binding an index buffer doesn’t change anything yet, we also need to change the drawing command to tell Vulkan to use the index buffer.
- Remove the vkCmdDraw line and replace it with vkCmdDrawIndexed .

Push Constants

A Push Constant is a small bank of values accessible in shaders.
These are designed for small amount (a few dwords) of high frequency data to be updated per-recording of the command buffer.
So that the shader can understand where this data will be sent, we specify a special push constants <layout> in our shader code.

layout(push_constant) uniform MeshData {
    mat4 model;
} mesh_data;

Choosing to use Push Constants :
- In early implementations of Vulkan on Arm Mali, this was usually the fastest way of pushing data to your shaders. In more recent times, we have observed on Mali devices that overall they can be slower. If performance is something you are trying to maximise on Mali devices, descriptor sets may be the way to go. However, other devices may still favour push constants.
- Having said this, descriptor sets are one of the more complex features of Vulkan, making the convenience of push constants still worth considering as a go-to method, especially if working with trivial data.
Limits :
- maxPushConstantsSize
  - guaranteed at least 128 bytes on all devices.
  - If you're using Vulkan 1.4 the minimum was increased to 256.
Push Constants .

Offsets

Ex1 :

layout(push_constant, std430) uniform pc {
    layout(offset = 32) vec4 data;
};

layout(location = 0) out vec4 outColor;

void main() {
   outColor = data;
}

VkPushConstantRange range = {};
range.stageFlags = SHADER_STAGE_FRAGMENT;
range.offset = 32;
range.size = 16;

Updating

Ex1 :

Push constants can be incrementally updated over the course of a command buffer.

// vkBeginCommandBuffer()
vkCmdBindPipeline();
vkCmdPushConstants(offset: 0, size: 16, value = [0, 0, 0, 0]);
vkCmdDraw(); // values = [0, 0, 0, 0]

vkCmdPushConstants(offset: 4, size: 8, value = [1 ,1]);
vkCmdDraw(); // values = [0, 1, 1, 0]

vkCmdPushConstants(offset: 8, size: 8, value = [2, 2]);
vkCmdDraw(); // values = [0, 1, 2, 2]
// vkEndCommandBuffer()

Interesting how old values are kept. Values that were not changed are preserved.

Lifetime

vkCmdPushConstants is tied to the VkPipelineLayout usage and therefore why they must match before a call to a command such as vkCmdDraw() .
Because push constants are not tied to descriptors, the use of vkCmdBindDescriptorSets has no effect on the lifetime or pipeline layout compatibility of push constants.
The same way it is possible to bind descriptor sets that are never used by the shader, the same is true for push constants.

CPU Performance

Push one struct once per draw instead of many separate vkCmdPushConstants calls (one call writing a small struct is far cheaper).
Many small state changes cause the driver to update internal tables, validate, or patch commands — that’s CPU work and cannot be avoided without batching.
Observations :
- 5 push calls were taking 7.65us. I groupped all them in 1 single push call, now taking 3.08us.
- This was substancial, as at the time I was issuing this push calls hundreds of time per frame; I later reduced this number, but anyway, could be significant.

Descriptors Sets

About

VkDescriptorSet
One Descriptor -> One Resource.
They are always organized in Descriptor Sets.
- One or more descriptors contained.
- Combine descriptors which are used in conjunction.
A handle or pointer into a resource.
- Note that is not just a pointer, but a pointer + metadata.
A core mechanism used to bind resources to shaders.
Holds the binding information that connects shader inputs to data such as VkBuffer resources and VkImage textures.
Think of it as a set of GPU-side pointers that you bind once.
The internal representation of a descriptor set is whatever the driver wants it to be.
Article by Arseny Kapoulkine .
Sample talking about best practices .
Content :
- Where to find a Resource.
- Usage type of a Resource.
- Offsets, sometimes.
- Some metadata, sometimes.

Example :

// Note - only set 0 and 2 are used in this shader
layout(set = 0, binding = 0) uniform sampler2D myTextureSampler;

layout(set = 0, binding = 2) uniform uniformBuffer0 {
    float someData;
} ubo_0;

layout(set = 0, binding = 3) uniform uniformBuffer1 {
    float moreData;
} ubo_1;

layout(set = 2, binding = 0) buffer storageBuffer {
    float myResults;
} ssbo;

API :
- .
- .
Limits :
- maxBoundDescriptorSets
- Per stage limit
- maxPerStageDescriptorSamplers
- maxPerStageDescriptorUniformBuffers
- maxPerStageDescriptorStorageBuffers
- maxPerStageDescriptorSampledImages
- maxPerStageDescriptorStorageImages
- maxPerStageDescriptorInputAttachments
- Per type limit
- maxPerStageResources
- maxDescriptorSetSamplers
- maxDescriptorSetUniformBuffers
- maxDescriptorSetUniformBuffersDynamic
- maxDescriptorSetStorageBuffers
- maxDescriptorSetStorageBuffersDynamic
- maxDescriptorSetSampledImages
- maxDescriptorSetStorageImages
- maxDescriptorSetInputAttachments
- VkPhysicalDeviceDescriptorIndexingProperties if using Descriptor Indexing
- VkPhysicalDeviceInlineUniformBlockPropertiesEXT if using Inline Uniform Block
Visual explanation {0:00 -> 5:35} .
- Nice.
- The rest of the video is meh.

Difficulties

Problems :
- "They are not bad but they very much force a specific rendering style: you have triple / quadrupled nested for loops, binding your things based on usage and then rebind descriptor sets as needed."
- "Many of us are moving towards bindless rendering, where you just bind everything once in one big descriptor set, and then index into it at will; tho, Vulkan 1.0 does not greatly support, and also the descriptor count for it was quite low".
- Cannot update descriptors after binding in a command buffer.
- All descriptors must be valid, even if not used.
- Descriptor arrays must be sampled uniformly.
  - Different invocations can’t use different indices.
  - Can sample “dynamically uniform”, e.g. runtime-based index.
- Upper limit on descriptor counts.
- Discourages GPU-driven rendering architectures.
  - Due to the need to set up descriptor sets per draw call it’s hard to adapt any of the aforementioned schemes to GPU-based culling or command submission.
Solutions :
- Descriptor Indexing :
  - Available in 1.3, optional in 1.2, or EXT_descriptor_indexing .
  - Update descriptors after binding.
  - Update unused descriptors.
  - Relax requirement that all descriptors must be valid, even if unused.
  - Non-uniform array indexing.
- Buffer Device Address :
  - Available in 1.3, optional in 1.2, or KHR_buffer_device_address .
  - Directly access buffers through addresses without a descriptor.
  - See [[#Physical Storage Buffer]] below.
- Descriptor Buffers – EXT_descriptor_buffer :
  - Manage descriptors directly.
  - ~~Similar to D3D12’s descriptor model~~ .

Allocation

A scheme that works well is to use free lists of descriptor set pools; whenever you need a descriptor set pool, you allocate one from the free list and use it for subsequent descriptor set allocations in the current frame on the current thread. Once you run out of descriptor sets in the current pool, you allocate a new pool. Any pools that were used in a given frame need to be kept around; once the frame has finished rendering, as determined by the associated fence objects, the descriptor set pools can reset via vkResetDescriptorPool and returned to free lists. While it’s possible to free individual descriptors from a pool via DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET , this complicates the memory management on the driver side and is not recommended.
When a descriptor set pool is created, application specifies the maximum number of descriptor sets allocated from it, as well as the maximum number of descriptors of each type that can be allocated from it. In Vulkan 1.1, the application doesn’t have to handle accounting for these limits – it can just call vkAllocateDescriptorSets and handle the error from that call by switching to a new descriptor set pool. Unfortunately, in Vulkan 1.0 without any extensions, it’s an error to call vkAllocateDescriptorSets if the pool does not have available space, so application must track the number of sets and descriptors of each type to know beforehand when to switch to a different pool.
Different pipeline objects may use different numbers of descriptors, which raises the question of pool configuration. A straightforward approach is to create all pools with the same configuration that uses the worst-case number of descriptors for each type – for example, if each set can use at most 16 texture and 8 buffer descriptors, one can allocate all pools with maxSets=1024, and pool sizes 16 1024 for texture descriptors and 8 1024 for buffer descriptors. This approach can work but in practice it can result in very significant memory waste for shaders with different descriptor count – you can’t allocate more than 1024 descriptor sets out of a pool with the aforementioned configuration, so if most of your pipeline objects use 4 textures, you’ll be wasting 75% of texture descriptor memory.
Strategies :
- Two alternatives that provide a better balance memory use:
1. Measure an average number of descriptors used in a shader pipeline per type for a characteristic scene and allocate pool sizes accordingly. For example, if in a given scene we need 3000 descriptor sets, 13400 texture descriptors, and 1700 buffer descriptors, then the average number of descriptors per set is 4.47 textures (rounded up to 5) and 0.57 buffers (rounded up to 1), so a reasonable configuration of a pool is maxSets=1024, 5*1024 texture descriptors, 1024 buffer descriptors. When a pool is out of descriptors of a given type, we allocate a new one – so this scheme is guaranteed to work and should be reasonably efficient on average.
2. Group shader pipeline objects into size classes, approximating common patterns of descriptor use, and pick descriptor set pools using the appropriate size class. This is an extension of the scheme described above to more than one size class. For example, it’s typical to have large numbers of shadow/depth prepass draw calls, and large numbers of regular draw calls in a scene – but these two groups have different numbers of required descriptors, with shadow draw calls typically requiring 0 to 1 textures per set and 0 to 1 buffers when dynamic buffer offsets are used. To optimize memory use, it’s more appropriate to allocate descriptor set pools separately for shadow/depth and other draw calls. Similarly to general-purpose allocators that can have size classes that are optimal for a given application, this can still be managed in a lower-level descriptor set management layer as long as it’s configured with application specific descriptor set usages beforehand.

Implementation

Descriptors are like pointers, so as any pointer they need to allocate space to live ahead of time.
How many :
- Its possible to have 1 very big descriptor pool that handles the entire engine, but that means we need to know what descriptors we will be using for everything ahead of time.
- That can be very tricky to do at scale. Instead, we will keep it simpler, and we will have multiple descriptor pools for different parts of the project , and try to be more accurate with them.
  - I don't know what that actually means in practice.
VkDescriptorPool .
- Maintains a pool of descriptors, from which descriptor sets are allocated.
- Descriptor pools are externally synchronized, meaning that the application must not allocate and/or free descriptor sets from the same pool in multiple threads simultaneously.
- They are very opaque.
- VkDescriptorPoolCreateInfo .
  - Contains a type of descriptor (same VkDescriptorType as on the bindings above ), alongside a ratio to multiply the maxSets parameter is.
  - This lets us directly control how big the pool is going to be. maxSets controls how many VkDescriptorSets we can create from the pool in total, and the pool sizes give how many individual bindings of a given type are owned.
  - flags .
    - Is a bitmask of VkDescriptorPoolCreateFlagBits specifying certain supported operations on the pool.
    - DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET
      - Determines if individual descriptor sets can be freed or not:
      - We're not going to touch the descriptor set after creating it, so we don't need this flag. You can leave flags to its default value of 0 .
    - DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND
      - Descriptor pool creation may fail with the error ERROR_FRAGMENTATION if the total number of descriptors across all pools (including this one) created with this bit set exceeds maxUpdateAfterBindDescriptorsInAllPools , or if fragmentation of the underlying hardware resources occurs.
  - maxSets
    - Is the maximum number of descriptor sets that can be allocated from the pool.
  - poolSizeCount
    - Is the number of elements in pPoolSizes .
  - pPoolSizes
    - Is a pointer to an array of VkDescriptorPoolSize structures, each containing a descriptor type and number of descriptors of that type to be allocated in the pool.
    - If multiple VkDescriptorPoolSize structures containing the same descriptor type appear in the pPoolSizes array then the pool will be created with enough storage for the total number of descriptors of each type.
    - VkDescriptorPoolSize .
      - type
        
        Is the type of descriptor.
      - descriptorCount
        
        Is the number of descriptors of that type to allocate. If type is DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK then descriptorCount is the number of bytes to allocate for descriptors of this type.
VkDescriptorSetAllocateInfo
- descriptorPool
  - Is the pool which the sets will be allocated from.
- descriptorSetCount
  - Determines the number of descriptor sets to be allocated from the pool.
- pSetLayouts
  - Is a pointer to an array of descriptor set layouts, with each member specifying how the corresponding descriptor set is allocated.
vkAllocateDescriptorSets() .
- The allocated descriptor sets are returned in pDescriptorSets .
- When a descriptor set is allocated, the initial state is largely uninitialized and all descriptors are undefined, with the exception that samplers with a non-null pImmutableSamplers are initialized on allocation.
- Descriptors also become undefined if the underlying resource or view object is destroyed.
- Descriptor sets containing undefined descriptors can still be bound and used, subject to the following conditions:
  - For descriptor set bindings created with the PARTIALLY_BOUND bit set:
    - All descriptors in that binding that are dynamically used must have been populated before the descriptor set is consumed .
  - For descriptor set bindings created without the PARTIALLY_BOUND bit set:
    - All descriptors in that binding that are statically used must have been populated before the descriptor set is consumed .
  - Descriptor bindings with descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK can be undefined when the descriptor set is consumed ; though values in that block will be undefined.
  - Entries that are not used by a pipeline can have undefined descriptors.
- pAllocateInfo
  - Is a pointer to a VkDescriptorSetAllocateInfo structure describing parameters of the allocation.
- pDescriptorSets
  - Is a pointer to an array of VkDescriptorSet handles in which the resulting descriptor set objects are returned.
Multithreading :
- Descriptor pools are externally synchronized, meaning that the application must not allocate and/or free descriptor sets from the same pool in multiple threads simultaneously.
- Command Pools are used to allocate, free, reset, and update descriptor sets. By creating multiple descriptor pools, each application host thread is able to manage a descriptor set in each descriptor pool at the same time.

Best Practices

Don’t allocate descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to allocate the descriptor set with texture descriptors.
Don't allocate descriptor sets from descriptor pools on performance critical code paths.
Don't allocate, free or update descriptor sets every frame, unless it is necessary.
Don't set DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET if you do not need to free individual descriptor sets.
- Setting DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET may prevent the implementation from using a simpler (and faster) allocator.

Descriptor Types

Descriptor Types .

Overview

For buffers, application must choose between uniform and storage buffers, and whether to use dynamic offsets or not. Uniform buffers have a limit on the maximum addressable size – on desktop hardware, you get up to 64 KB of data, however on mobile hardware some GPUs only provide 16 KB of data (which is also the guaranteed minimum by the specification). The buffer resource can be larger than that, but shader can only access this much data through one descriptor.
On some hardware, there is no difference in access speed between uniform and storage buffers, however for other hardware depending on the access pattern uniform buffers can be significantly faster. Prefer uniform buffers for small to medium sized data especially if the access pattern is fixed (e.g. for a buffer with material or scene constants). Storage buffers are more appropriate when you need large arrays of data that need to be larger than the uniform buffer limit and are indexed dynamically in the shader.
For textures, if filtering is required, there is a choice of combined image/sampler descriptor (where, like in OpenGL, descriptor specifies both the source of the texture data, and the filtering/addressing properties), separate image and sampler descriptors (which maps better to Direct3D 11 model), and image descriptor with an immutable sampler descriptor, where the sampler properties must be specified when pipeline object is created.
The relative performance of these methods is highly dependent on the usage pattern; however, in general immutable descriptors map better to the recommended usage model in other newer APIs like Direct3D 12, and give driver more freedom to optimize the shader. This does alter renderer design to a certain extent, making it necessary to implement certain dynamic portions of the sampler state, like per-texture LOD bias for texture fade-in during streaming, using shader ALU instructions.

Storage Images

DESCRIPTOR_TYPE_STORAGE_IMAGE
Is a descriptor type that allows shaders to read from and write to an image without using a fixed-function graphics pipeline.
This is particularly useful for compute shaders and advanced rendering techniques.
Storage Images and Implementation .

// FORMAT_R32_UINT
layout(set = 0, binding = 0, r32ui) uniform uimage2D storageImage;

// example usage for reading and writing in GLSL
const uvec4 texel = imageLoad(storageImage, ivec2(0, 0));
imageStore(storageImage, ivec2(1, 1), texel);

Use cases :
- Image Processing :
  - Storage images are ideal for image processing tasks like filters, blurs, and other post-processing effects.

Sampler

DESCRIPTOR_TYPE_SAMPLER and DESCRIPTOR_TYPE_SAMPLED_IMAGE .

layout(set = 0, binding = 0) uniform sampler samplerDescriptor;
layout(set = 0, binding = 1) uniform texture2D sampledImage;

// example usage of using texture() in GLSL
vec4 data = texture(sampler2D(sampledImage,  samplerDescriptor), vec2(0.0, 0.0));

Combined Image Sampler

DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER
On some implementations, it may be more efficient to sample from an image using a combination of sampler and sampled image that are stored together in the descriptor set in a combined descriptor.

layout(set = 0, binding = 0) uniform sampler2D combinedImageSampler;

// example usage of using texture() in GLSL
vec4 data = texture(combinedImageSampler, vec2(0.0, 0.0));

Uniform Buffer / UBO (Uniform Buffer Object)

DESCRIPTOR_TYPE_UNIFORM_BUFFER
Uniform buffers can also have dynamic offsets at bind time ( DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC ).

layout(set = 0, binding = 0) uniform uniformBuffer {
    float a;
    int b;
} ubo;

// example of reading from UBO in GLSL
int x = ubo.b + 1;
vec3 y = vec3(ubo.a);

Uniform Buffers commonly use std140 layout (strict alignment rules, predictable padding).
- Source: ChatGPT. I want to confirm.

/* UBO: small read-only data (std140) */
layout(set = 0, binding = 0, std140) uniform SceneParams {
    mat4 viewProj;
    vec4 lightPos;
    float time;
} scene;

UBO (Uniform Buffer Object) :
- “Uniform buffer object” is more of an OpenGL-era name, but some Vulkan tutorials and developers still use it informally to mean the same thing — the buffer that holds uniform data.

Storage Buffer / SSBO (Shader Storage Buffer Object)

DESCRIPTOR_TYPE_STORAGE_BUFFER
GLSL uses distinct address spaces: uniform → UBO, buffer → SSBO.
Use std430 layout by default (tighter packing, fewer padding requirements).
SSBO (Shader Storage Buffer Object) is a OpenGL term.

// Implicit std430 (default)
layout(set = 0, binding = 0) buffer storageBuffer {
    float a;
    int b;
} ssbo;

// Explicit std430
layout(set = 0, binding = 1, std430) buffer ParticleData {
    vec4 pos[];
} particles;

// Reading and writing to a SSBO in GLSL
ssbo.a = ssbo.a + 1.0;
ssbo.b = ssbo.b + 1;

BufferBlock and Uniform would have been seen prior to KHR_storage_buffer_storage_class .
Storage buffers can also have dynamic offsets at bind time DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC .
Why SSBO for dynamic arrays :
- std430 allows tight packing and runtime-sized arrays (T data[]) , which is ideal for dynamic-length storage.
- SSBOs allow arbitrary indexing, read/write, and atomics.
- maxStorageBufferRange is usually much larger than maxUniformBufferRange .
- You can use *_DYNAMIC descriptors to bind multiple subranges of one large backing buffer cheaply.

Many arrays :

A buffer block may contain multiple arrays, but only the last member of the block may be a runtime-sized (unsized) array T x[] . All other arrays must be fixed-size (compile-time constant) or you must implement sizing/offsets yourself.
- This is invalid , even with descriptor indexing:
```
layout(std430, set = 0, binding = 0) buffer FixedArrays { 
    vec4 A[]; 
    vec2 B[]; 
    mat4 C[]; 
    some_struct D[];
} fixedArrays;
```

Use a uint x[] :

32-bit words; simplest and portable.
This is effectively an untyped byte/word blob stored in the SSBO and you manually reinterpret (cast) it in the shader

layout(std430, set = 0, binding = 0) buffer PackedBytes {
    uint countA;   // number of A elements
    uint offsetA;  // offset into data[] in uint words
    uint countB;
    uint offsetB;  // offset into data[] in uint words
    uint countC;
    uint offsetC;

    uint data[];   // payload in 32-bit words
} pb;

// helpers
float readFloat(uint baseWordIndex) {
    return uintBitsToFloat(pb.data[baseWordIndex]);
}

vec2 readVec2(uint baseWordIndex) {
    return vec2(
        uintBitsToFloat(pb.data[baseWordIndex + 0]),
        uintBitsToFloat(pb.data[baseWordIndex + 1])
    );
}

vec3 readVec3(uint baseWordIndex) {
    return vec3(
        uintBitsToFloat(pb.data[baseWordIndex + 0]),
        uintBitsToFloat(pb.data[baseWordIndex + 1]),
        uintBitsToFloat(pb.data[baseWordIndex + 2])
    );
}

vec4 readVec4(uint baseWordIndex) {
    return vec4(
        uintBitsToFloat(pb.data[baseWordIndex + 0]),
        uintBitsToFloat(pb.data[baseWordIndex + 1]),
        uintBitsToFloat(pb.data[baseWordIndex + 2]),
        uintBitsToFloat(pb.data[baseWordIndex + 3])
    );
}

mat4 readMat4(uint baseWordIndex) {
    // mat4 stored column-major as 16 floats (4 columns of vec4)
    return mat4(
        readVec4(baseWordIndex + 0),
        readVec4(baseWordIndex + 4),
        readVec4(baseWordIndex + 8),
        readVec4(baseWordIndex + 12)
    );
}

Use a vec4 x[] :

128-bit blocks; simpler alignment for vec4/mat4 data.

// Pack everything into vec4 blocks for simple alignment
layout(std430, set = 0, binding = 0) buffer Packed {
    uint countA;
    uint offsetA; // in vec4-blocks
    uint countB;
    uint offsetB; // in vec4-blocks
    uint countC;
    uint offsetC; // in vec4-blocks
    uint countD;
    uint offsetD; // in vec4-blocks

    vec4 blocks[]; // single runtime-sized array (last member)
} packed;

// helpers
vec4 getA(uint i) {
    return packed.blocks[packed.offsetA + i];
}

vec2 getB(uint i) {
    return packed.blocks[packed.offsetB + i].xy; // we store each B in one vec4 block
}

mat4 getC(uint i) {
    uint base = packed.offsetC + i * 4; // mat4 occupies 4 vec4 blocks
    return mat4(packed.blocks[base + 0],
                packed.blocks[base + 1],
                packed.blocks[base + 2],
                packed.blocks[base + 3]);
}

// for some_struct D that we store as 1 vec4 per element:
some_struct getD(uint i) {
    vec4 v = packed.blocks[packed.offsetD + i];
    // decode v -> some_struct fields
}

Use many SSBOs:

layout(std430, set=0, binding=0) buffer BufA { vec4 A[]; } bufA;
layout(std430, set=0, binding=1) buffer BufB { vec2 B[]; } bufB;
layout(std430, set=0, binding=2) buffer BufC { mat4 C[]; } bufC;
layout(std430, set=0, binding=3) buffer BufD { some_struct D[]; } bufD;

Texel Buffer

Texel buffers are a way to access buffer data with texture-like operations in shaders.
Texel Buffers and Implementation .
Compatibility Requirements .
- The format specified in the shader (SPIR-V Image Format) must exactly match the format used when creating the VkImageView (Vulkan Format).
- Require exact format matching between the shader and the view. The views must always match the shader exactly.
Best Practices .
Uniform Texel Buffer :
- DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER
- Read-only access.
```
layout(set = 0, binding = 0) uniform textureBuffer uniformTexelBuffer;

// example of reading texel buffer in GLSL
vec4 data = texelFetch(uniformTexelBuffer, 0);
```
- Use cases :
  - Lookup Tables :
    - Uniform texel buffers are useful for implementing lookup tables that need to be accessed with texture-like operations.

Storage Texel Buffer :

DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER
Read-write access.

// FORMAT_R8G8B8A8_UINT
layout(set = 0, binding = 0, rgba8ui) uniform uimageBuffer storageTexelBuffer;

// example of reading and writing texel buffer in GLSL
int offset = int(gl_GlobalInvocationID.x);
vec4 data = imageLoad(storageTexelBuffer, offset);
imageStore(storageTexelBuffer, offset, uvec4(0));

Use cases :
- Particle Systems :
  - Storage texel buffers can be used to store and update particle data in a compute shader, which can then be read by a vertex shader for rendering.

Input Attachment

DESCRIPTOR_TYPE_INPUT_ATTACHMENT

layout (input_attachment_index = 0, set = 0, binding = 0) uniform subpassInput inputAttachment;

// example loading the attachment data in GLSL
vec4 data = subpassLoad(inputAttachment);

Updates

Implementation

A Descriptor Set, even though created and allocated, is still empty. We need to fill it up with data.
Updates must happen outside of a command record and execution.
- No update after vkCmdBindDescriptorSets() .
- Usually you update before vkBeginCommandBuffer() or after the vkQueueSubmit() (if we know the sync is done for cmd).
If using Descriptor Indexing :
- Descriptors can be updated after binding in command buffers.
  - Command buffer execution will use most recent updates.
- .
VkWriteDescriptorSet .
- dstSet
  - Is the destination descriptor set to update.
- dstBinding
  - Is the descriptor binding within that set.
- dstArrayElement
  - Remember that descriptors can be arrays, so we also need to specify the first index in the array that we want to update.
  - If not using an array, the index is simply 0 .
  - Is the starting element in that array.
  - If the descriptor binding identified by dstSet and dstBinding has a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK then dstArrayElement specifies the starting byte offset within the binding.
- descriptorCount
  - It's a descriptor count, not a descriptor SET count!!
  - Is the number of descriptors to update.
  - If the descriptor binding identified by dstSet and dstBinding has a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK , then descriptorCount specifies the number of bytes to update.
  - Otherwise, descriptorCount is one of
    - the number of elements in pImageInfo
    - the number of elements in pBufferInfo
    - the number of elements in pTexelBufferView
    - a value matching the dataSize member of a VkWriteDescriptorSetInlineUniformBlock structure in the pNext chain
    - a value matching the accelerationStructureCount of a VkWriteDescriptorSetAccelerationStructureKHR or VkWriteDescriptorSetAccelerationStructureNV structure in the pNext chain
    - a value matching the descriptorCount of a VkWriteDescriptorSetTensorARM structure in the pNext chain
- descriptorType
  - We need to specify the type of descriptor again
  - Is a VkDescriptorType specifying the type of each descriptor in pImageInfo , pBufferInfo , or pTexelBufferView .
  - It must be the same type as the descriptorType specified in VkDescriptorSetLayoutBinding for dstSet at dstBinding , except if VkDescriptorSetLayoutBinding for dstSet at dstBinding is equal to DESCRIPTOR_TYPE_MUTABLE_EXT .
  - The type of the descriptor also controls which array the descriptors are taken from.
- pBufferInfo
  - Is a pointer to an array of VkDescriptorBufferInfo structures or is ignored, as described below.
  - VkDescriptorBufferInfo .
    - Structure specifying descriptor buffer information
    - Specifies the buffer and the region within it that contains the data for the descriptor.
    - buffer
      - Is the buffer resource or NULL_HANDLE .
    - offset
      - Is the offset in bytes from the start of buffer .
      - Access to buffer memory via this descriptor uses addressing that is relative to this starting offset.
      - For DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC and DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC descriptor types:
        
        offset is the base offset from which the dynamic offset is applied.
    - range
      - Is the size in bytes that is used for this descriptor update, or WHOLE_SIZE to use the range from offset to the end of the buffer.
        
        When range is WHOLE_SIZE the effective range is calculated at vkUpdateDescriptorSets by taking the size of buffer minus the offset .
      - For DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC and DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC descriptor types:
        
        range is the static size used for all dynamic offsets.
- pImageInfo
  - Is a pointer to an array of VkDescriptorImageInfo structures or is ignored, as described below.
  - VkDescriptorImageInfo .
    - imageLayout
      - Is the layout that the image subresources accessible from imageView will be in at the time this descriptor is accessed.
      - Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLED_IMAGE , DESCRIPTOR_TYPE_STORAGE_IMAGE , DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER , and DESCRIPTOR_TYPE_INPUT_ATTACHMENT .
    - imageView
      - Is an image view handle or NULL_HANDLE .
      - Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLED_IMAGE , DESCRIPTOR_TYPE_STORAGE_IMAGE , DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER , and DESCRIPTOR_TYPE_INPUT_ATTACHMENT .
    - sampler
      - Is a sampler handle.
      - Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLER and DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER if the binding being updated does not use immutable samplers.
- pTexelBufferView
  - Is a pointer to an array of VkBufferView handles as described in the Buffer Views section or is ignored, as described below.
vkUpdateDescriptorSets() .
- descriptorWriteCount
  - Is the number of elements in the pDescriptorWrites array.
- pDescriptorWrites
  - Is a pointer to an array of VkWriteDescriptorSet structures describing the descriptor sets to write to.
- descriptorCopyCount
  - Is the number of elements in the pDescriptorCopies array.
- pDescriptorCopies
  - Is a pointer to an array of VkCopyDescriptorSet structures describing the descriptor sets to copy between.

Best Practices

Don’t update descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to update the descriptor set with texture descriptors.
When rendering dynamic objects the application will need to push some amount of per-object data to the GPU, such as the MVP matrix. This data may not fit into the push constant limit for the device, so it becomes necessary to send it to the GPU by putting it into a VkBuffer and binding a descriptor set that points to it.
Materials also need their own descriptor sets, which point to the textures they use. We can either bind per-material and per-object descriptor sets separately or collate them into a single set. Either way, complex applications will have a large amount of descriptor sets that may need to change on the fly, for example due to textures being streamed in or out.
Not-good Solution: One or more pools per-frame, resetting the pool :
- The simplest approach to circumvent the issue is to have one or more VkDescriptorPool s per frame, reset them at the beginning of the frame and allocate the required descriptor sets from it. This approach will consist of a vkResetDescriptorPool() call at the beginning, followed by a series of vkAllocateDescriptorSets() and vkUpdateDescriptorSets() to fill them with data.
- This is very useful for things like per-frame descriptors. That way we can have descriptors that are used just for one frame, allocated dynamically, and then before we start the frame we completely delete all of them in one go.
- This is confirmed to be a fast path by GPU vendors, and recommended to use when you need to handle per-frame descriptor sets.
- The issue is that these calls can add a significant overhead to the CPU frame time, especially on mobile. In the worst cases, for example calling vkUpdateDescriptorSets() for each draw call, the time it takes to update descriptors can be longer than the time of the draws themselves.
Solution: Caching descriptor sets :
- A major way to reduce descriptor set updates is to re-use them as much as possible. Instead of calling vkResetDescriptorPool() every frame, the app will keep the VkDescriptorSet handles stored with some caching mechanism to access them.
- The cache could be a hashmap with the contents of the descriptor set (images, buffers) as key. This approach is used in our framework by default. It is possible to remove another level of indirection by storing descriptor set handles directly in the materials and/or meshes.
- Caching descriptor sets has a dramatic effect on frame time for our CPU-heavy scene.
- In this game on a 2019 mobile phone it went from 44ms (23fps) to 27ms (37fps). This is a 38% decrease in frame time.
- This system is reasonably easy to implement for a static scene, but it becomes harder when you need to delete descriptor sets. Complex engines may implement techniques to figure out which descriptor sets have not been accessed for a certain number of frames, so they can be removed from the map.
- This may correspond to calling vkFreeDescriptorSets() , but this solution poses another issue: in order to free individual descriptor sets the pool has to be created with the DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET flag. Mobile implementations may use a simpler allocator if that flag is not set, relying on the fact that pool memory will only be recycled in block.
- It is possible to avoid using that flag by updating descriptor sets instead of deleting them. The application can keep track of recycled descriptor sets and re-use one of them when a new one is requested.
Solution: One buffer per-frame :
- We will now explore an alternative approach, that is complementary to descriptor caching in some way. Especially for applications in which descriptor caching is not quite feasible, buffer management is another lever for optimizing performance.
- As discussed at the beginning, each rendered object will typically need some uniform data along with it, that needs to be pushed to the GPU somehow. A straightforward approach is to store a VkBuffer per object and update that data for each frame.
- This already poses an interesting question: is one buffer enough? The problem is that this data will change dynamically and will be in use by the GPU while the frame is in flight.
- Since we do not want to flush the GPU pipeline between each frame, we will need to keep several copies of each buffer, one for each frame in flight.
- Another similar option is to use just one buffer per object, but with a size equal to num_frames * buffer_size , then offset it dynamically based on the frame index.
  - For each frame, one buffer per object is created and filled with data. This means that we will have many descriptor sets to create, since every object will need one that points to its VkBuffer . Furthermore, we will have to update many buffers separately, meaning we cannot control their memory layout and we might lose some optimization opportunities with caching.
- We can address both problems by reverting the approach: instead of having a VkBuffer per object containing per-frame data, we will have a VkBuffer per frame containing per-object data. The buffer will be cleared at the beginning of the frame, then each object will record its data and will receive a dynamic offset to be used at vkCmdBindDescriptorSets() time.
- With this approach we will need fewer descriptor sets, as more objects can share the same one: they will all reference the same VkBuffer , but at different dynamic offsets. Furthermore, we can control the memory layout within the buffer.
- Using a single large VkBuffer in this case shows a performance improvement similar to descriptor set caching.
- For this relatively simple scene stacking the two approaches does not provide a further performance boost, but for a more complex case they do stack nicely:
  - Descriptor caching is necessary when the number of descriptor sets is not just due to VkBuffer s with uniform data, for example if the scene uses a large amount of materials/textures.
  - Buffer management will help reduce the overall number of descriptor sets, thus cache pressure will be reduced and the cache itself will be smaller.
- (2025-09-08)
  - I personally liked this technique much more than descriptor caching.
  - It sounds more concrete than fiddling with descriptor sets.
  - Reminds me of Buffer Device Address.
Do
- Update already allocated but no longer referenced descriptor sets, instead of resetting descriptor pools and reallocating new descriptor sets.
- Prefer reusing already allocated descriptor sets, and not updating them with the same information every time.
- Consider caching your descriptor sets when feasible.
- Consider using a single (or few) VkBuffer per frame with dynamic offsets.
- Batch calls to vkAllocateDescriptorSets if possible – on some drivers, each call has measurable overhead, so if you need to update multiple sets, allocating both in one call can be faster;
- To update descriptor sets, either use vkUpdateDescriptorSets with descriptor write array, or use vkUpdateDescriptorSetWithTemplate from Vulkan 1.1. Using the descriptor copy functionality of vkUpdateDescriptorSets is tempting with dynamic descriptor management for copying most descriptors out of a previously allocated array, but this can be slow on drivers that allocate descriptors out of write-combined memory. Descriptor templates can reduce the amount of work application needs to do to perform updates – since in this scheme you need to read descriptor information out of shadow state maintained by application, descriptor templates allow you to tell the driver the layout of your shadow state, making updates substantially faster on some drivers.
- Prefer dynamic uniform buffers to updating uniform buffer descriptors. Dynamic uniform buffers allow to specify offsets into buffer objects using pDynamicOffsets argument of vkCmdBindDescriptorSets without allocating and updating new descriptors. This works well with dynamic constant management where constants for draw calls are allocated out of large uniform buffers, substantially reduce CPU overhead, and can be more efficient on GPU. While on some GPUs the number of dynamic buffers must be kept small to avoid extra overhead in the driver, one or two dynamic uniform buffers should work well in this scheme on all architectures.
- On some drivers, unfortunately the allocate & update path is not very optimal – on some mobile hardware, it may make sense to cache descriptor sets based on the descriptors they contain if they can be reused later in the frame.

Descriptor Set Layout

Contains the information about what that descriptor set holds.
Specifies the types of resources that are going to be accessed by the pipeline, just like a render pass specifies the types of attachments that will be accessed.
How many :
- You need to specify a descriptor set layout for each descriptor set when creating the pipeline layout.
  - You can use this feature to put descriptors that vary per-object and descriptors that are shared into separate descriptor sets.
  - In that case, you avoid rebinding most of the descriptors across draw calls which are potentially more efficient.
- Since the buffer structure is identical across frames, one layout suffices.
  - Create only 1 descriptor set layout, regardless of frames in-flight.
  - This layout defines the type of resource (e.g., VKDESCRIPTORTYPEUNIFORMBUFFER ) and its binding point.
VkDescriptorSetLayout .
- Opaque handle to a descriptor set layout object.
- Is defined by an array of zero or more descriptor bindings.
- Where it's used :
  - VkPipelineLayoutCreateInfo .
  - vkDescriptorSetAllocateInfo .
- VkDescriptorSetLayoutBinding .
  - Structure specifying a descriptor set layout binding.
  - Each individual descriptor binding is specified by a descriptor type, a count (array size) of the number of descriptors in the binding, a set of shader stages that can access the binding, and (if using immutable samplers) an array of sampler descriptors.
  - Bindings that are not specified have a descriptorCount and stageFlags of zero, and the value of descriptorType is undefined.
  - binding
    - Is the binding number of this entry and corresponds to a resource of the same binding number in the shader stages.
    - Used in the shader and the type of descriptor, which is a uniform buffer object.
  - descriptorType
    - Is a VkDescriptorType specifying which type of resource descriptors are used for this binding.
  - descriptorCount
    - Insight :
      - It's a descriptor count, not a descriptor SET count !! It's just to specify how many resources is expected to be in that binding.
      - It makes complete sense to be used for arrays.
      - Caio:
        
        What happens if the values don't match? For example, trying to get the index 5 of the array, when the binding was described having descriptorCount = 1 ?
      - Oni:
        
        I don't know if this is specified. I guess it's only going to update the first element. So you're going to read bogus data. Maybe it changes between different drivers, no idea.
    - What value to use :
      - A MVP transformation is in a single uniform buffer, so we using a descriptorCount of 1 .
      - In other words, a whole struct counts as 1 .
    - Is the number of descriptors contained in the binding, accessed in a shader as an array.
      - Except if descriptorType is DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK in which case descriptorCount is the size in bytes of the inline uniform block.
    - If descriptorCount is zero this binding entry is reserved and the resource must not be accessed from any stage via this binding within any pipeline using the set layout.
    - It is possible for the shader variable to represent an array of uniform buffer objects, and this property specifies the number of values in the array.
    - Examples :
      - This could be used to specify a transformation for each of the bones in a skeleton for skeletal animation.
  - stageFlags
    - Is a bitmask of VkShaderStageFlagBits specifying which pipeline shader stages can access a resource for this binding.
      - SHADER_STAGE_ALL is a shorthand specifying all defined shader stages, including any additional stages defined by extensions.
    - If a shader stage is not included in stageFlags , then a resource must not be accessed from that stage via this binding within any pipeline using the set layout.
    - Other than input attachments which are limited to the fragment shader, there are no limitations on what combinations of stages can use a descriptor binding, and in particular a binding can be used by both graphics stages and the compute stage.
  - pImmutableSamplers
    - Affects initialization of samplers.
    - If descriptorType specifies a DESCRIPTOR_TYPE_SAMPLER or DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER type descriptor, then pImmutableSamplers can be used to initialize a set of immutable samplers .
    - If descriptorType is not one of these descriptor types, then pImmutableSamplers is ignored .
    - Immutable samplers are permanently bound into the set layout and must not be changed; updating a DESCRIPTOR_TYPE_SAMPLER descriptor with immutable samplers is not allowed and updates to a DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER descriptor with immutable samplers does not modify the samplers (the image views are updated, but the sampler updates are ignored).
    - If pImmutableSamplers is not NULL , then it is a pointer to an array of sampler handles that will be copied into the set layout and used for the corresponding binding. Only the sampler handles are copied; the sampler objects must not be destroyed before the final use of the set layout and any descriptor pools and sets created using it.
    - If pImmutableSamplers is NULL , then the sampler slots are dynamic and sampler handles must be bound into descriptor sets using this layout. ]
- VkDescriptorSetLayoutCreateInfo .
  - pBindings
    - A pointer to an array of VkDescriptorSetLayoutBinding structures.
  - bindingCount
    - Is the number of elements in pBindings .
  - flags
    - Is a bitmask of VkDescriptorSetLayoutCreateFlagBits specifying options for descriptor set layout creation.
- vkCreateDescriptorSetLayout() .
  - Create a new descriptor set layout.
  - pCreateInfo
    - Is a pointer to a VkDescriptorSetLayoutCreateInfo structure specifying the state of the descriptor set layout object.
  - pAllocator
    - Controls host memory allocation as described in the Memory Allocation chapter.
  - pSetLayout
    - Is a pointer to a VkDescriptorSetLayout handle in which the resulting descriptor set layout object is returned.
VkPipelineLayoutCreateInfo .
- Structure specifying the parameters of a newly created pipeline layout object
- setLayoutCount
  - Is the number of descriptor sets included in the pipeline layout.
  - How it works :
    - It's possible to have multiple descriptor sets ( set = 0 , set = 1 , etc).
    - "You can have set = 0 being a set that is always bound and never changes, set = 1 is something specific to the current object being rendered, etc."
- pSetLayouts
  - Is a pointer to an array of VkDescriptorSetLayout objects.
  - The implementation must not access these objects outside of the duration of the command this structure is passed to.

Binding

A Descriptor state is tracked only inside a command buffer; they are always bound at command buffer level; their state is local to command buffers.
- They are not bound at queue level or global level, only to command buffers.
.
Which set index to choose :
- According to GPU vendors, each descriptor set slot has a cost, so the fewer we have, the better.
- "Organize shader inputs into "sets" by update frequency."
- Rarely changes -> low index.
- Changes frequently -> high index.
- Usually Descriptor Set 0 is used to always bind some global scene data, which will contain some uniform buffers and some special textures, and Descriptor Set 1 will be used for per-object data.
vkCmdBindDescriptorSets .
- It needs to be done before the vkCmdDrawIndexed() calls, for example.
- commandBuffer
  - Is the command buffer that the descriptor sets will be bound to.
- pipelineBindPoint
  - Is a VkPipelineBindPoint indicating the type of the pipeline that will use the descriptors. There is a separate set of bind points for each pipeline type, so binding one does not disturb the others.
  - Unlike vertex and index buffers, descriptor sets are not unique to graphics pipelines, therefore, we need to specify if we want to bind descriptor sets to the graphics or compute pipeline.
  - Indicates the type of the pipeline that will use the descriptor.
  - There is a separate set of bind points for each pipeline type, so binding one does not disturb the others.
  - .
    - A raytracing command takes the currently bound descriptors from the raytracing bind point.
    - A draw command takes the currently bound descriptors from the graphics bind point.
    - The two don't interfere with each other.
- layout
  - Is a VkPipelineLayout object used to program the bindings.
- firstSet
  - Is the set number of the first descriptor set to be bound.
- descriptorSetCount
  - Is the number of elements in the pDescriptorSets array.
- pDescriptorSets
  - Is a pointer to an array of handles to VkDescriptorSet objects describing the descriptor sets to bind to.
- dynamicOffsetCount
  - Is the number of dynamic offsets in the pDynamicOffsets array.
- pDynamicOffsets
  - Is a pointer to an array of uint32_t values specifying dynamic offsets.

Strategy: Descriptor Indexing ( `EXT_descriptor_indexing` )

Plan

SSBOs and UBOs.
- Can I just put different data without restriction?
  - Yes. See the SSBO section for that.
- SSBOs or UBOs?
  - Using storage buffers exclusively instead of uniform buffers can increase GPU time on some architectures.
  - I'll use SSBO, as that was the general recommendation.
  - Maybe I'll mix both.
Material Data:
- The Material index is used to look up material data from material storage buffer. The textures can then be accessed using the indices from the material data and the descriptor array.
- Could be sent via push constants, but if I choose to go for indirect rendering (I should), then I cannot use push constants. I'd use the instance index (or similar) to index into a []Material_Data .
Model Matrix / Transforms:
- Same as material data. I can send via push constants if direct drawing, or via []model_matrix if indirect drawing.
Globals:
- Camera view/proj, lights, ambient, etc.
- I could just bind this once as well.
Vertex:
- Indirect vs Full bindless:
  - I'm not sure. I'll use Indirect Drawing for now. ChatGPU deep search didn't give me much.
1. Indirect Drawing:
  - For indirect drawing, it makes sense to just vkCmdBindIndexBuffer , as I NEED the vertex shader to be called by the number of times I specified
  - Plan: go for bindless first, drawing direct. instead of using the instanceID or similar, I just send the draw_data index via push constants. this way, the shader will be completely finalized, but then I batch the draws via draw indirect and use the instanceID instead of the push constants ID
  - Indirect Drawing will be the last thing
  - What not invert and do indirect first? I cannot do that, as the instanceID is useless without a bindless design! I NEED to have use for the ID, as I cannot bind desc sets or push constants for each individual draw! bindless first is a MUST.
2. Full bindless:
  - Using a large index buffer: We need to bind index data. If just like the vertex data, index data is allocated in one large index buffer, we only need to bind it once using vkCmdBindIndexBuffer .
  - While Vulkan provides a first-class way to specify vertex data by calling vkCmdBindVertexBuffers , having to bind vertex buffers per-draw would not work for a fully bindless design.
    - Additionally, some hardware doesn’t support vertex buffers as a first-class entity, and the driver has to emulate vertex buffer binding, which causes some CPU-side slowdowns when using vkCmdBindVertexBuffers .
  - In a fully bindless design, we need to assume that all vertex buffers are suballocated in one large buffer and either use per-draw vertex offsets ( vertexOffset argument to vkCmdDrawIndexed ) to have hardware fetch data from it, or pass an offset in this buffer to the shader with each draw call and fetch data from the buffer in the shader. Both approaches can work well, and might be more or less efficient depending on the GPU.
3. Mesh Shaders.
  - Mesh Shaders is probably what is most true to the bindless strategy, but I won't go that way yet (too soon, too new).
4. ~~Compute~~
  - Maybe I could use a compute to do this for me, but then I'd lose the rasterizer.

Draw Data:

Indices to index into the other arrays.

struct DrawData
{
    uint materialIndex;
    uint transformOffset;
    uint vertexOffset;
    uint unused0; // vec4 padding

    // ... extra gameplay data goes here
};

Vertex Shader:

DrawData dd = drawData[gl_DrawIDARB];
TransformData td = transformData[dd.transformOffset];
vec4 positionLocal = vec4(positionData[gl_VertexIndex + dd.vertexOffset], 1.0);
vec3 positionWorld = mat4x3(td.transform[0], td.transform[1], td.transform[2]) * positionLocal;

Frag Shader:

DrawData dd = drawData[drawId];
MaterialData md = materialData[dd.materialIndex];
vec4 albedo = texture(sampler2D(materialTextures[md.albedoTexture], albedoSampler), uv * vec2(md.tilingX, md.tilingY));

Slots:
- tex buffer and material data buffer will be in the same set 0, or should they be 0/1?
- Probably every bind is on desc set 0
- The slots are based on frequency, but every single binding I'm talking about might just be bound once globally without problems
Overall:
- []textures
- []material_data
  - uv, flip, modulate, etc.
- []model_matrices
  - transforms.
- []draw_data
  - Indices to index into the other arrays.
- vertex/indices
  - As input attributes, to then use Indirect Drawing.

About

Descriptor indexing is also known by the term "bindless", which refers to the fact that binding individual descriptor sets and descriptors is no longer the primary way we keep shader pipelines fed. Instead, we can bind a huge descriptor set once and just index into a large number of descriptors.
Adds a lot of flexibility to how resources are accessed.
"Bindless algorithms" are generally built around this flexibility where we either index freely into a lot of descriptors at once, or update descriptors where we please. In this model, "binding" descriptors is not a concern anymore.
The core functionality of this extension is that we can treat descriptor memory as one massive array, and we can freely access any resource we want at any time, by indexing.
If an array is large enough, an index into that array is indistinguishable from a pointer.
At most, we need to write/copy descriptors to where we need them and we can now consider descriptors more like memory blobs rather than highly structured API objects.
The introduction of descriptor indexing revealed that the descriptor model is all just smoke and mirrors. A descriptor is just a blob of binary data that the GPU can interpret in some meaningful way. The API calls to manage descriptors really just boils down to “copy magic bits here.”
Support :
- Descriptor Indexing was created in 2018, so all hardware 2018+ should support it.
- Core in Vulkan 1.2+
- Limits queried using VkPhysicalDeviceDescriptorIndexingPropertiesEXT .
- Features queried using VkPhysicalDeviceDescriptorIndexingFeaturesEXT .
- Features toggled using VkPhysicalDeviceDescriptorIndexingFeaturesEXT .
Required for :
- Raytracing.
- Many GPU Driven Rendering approaches.
Advantages :
- No costly transfer of descriptor to GPU every frame. Shows up as spending a lot of time in vkUpdateDescriptorSets (Vulkan)
- More flexible / dynamic rendering architecture
- No manual tracking of per-object resource groups
- Updating matrices and material data can be done in bulk before command recording
- CPU and GPU refer to resources the same way, by index
- GPU can store Texture IDs in a buffer for reference later in the frame – many uses
- Easy Vertex Pulling – gets rid of binding vertex buffers
- Write resource indexes from one shader into a buffer that another shader reads & uses
- G-Buffer can use material ID instead of values
- Terrain Splatmap contains material IDs allowing many materials to be used, instead of 4
- And more…
Disadvantages :
- Requires hardware support
  - May be too new for widespread use
  - Different “feature levels” can help ease transition
- Different Performance Penalties
  - Arrays indexing can cause memory indirections
    - Fetching texture descriptors from an array indexed by material data indexed by material index can add an extra indirection on GPU compared to some alternative designs
- “With great power comes great responsibility”
  - GPU can't verify that valid descriptors are bound
  - Validation is costlier: happens inside shaders
  - Can be difficult to debug
  - Descriptor management is up to the Application
- On some hardware, various descriptor set limits may make this technique impractical to implement; to be able to index an arbitrary texture dynamically from the shader, maxPerStageDescriptorSampledImages should be large enough to accomodate all material textures - while many desktop drivers expose a large limit here, the specification only guarantees a limit of 16, so bindless remains out of reach on some hardware that otherwise supports Vulkan.
Comparison: Indexing resources without the extension :
- .
- Descriptor Indexing, explanation of "dynamic non-uniform" .
  - Good read.
- Constant Indexing :
```
layout(set = 0, binding = 0) uniform sampler2D Tex[4];

texture(Tex[0], ...);
texture(Tex[2], ...);

// We can trivially flatten a constant-indexed array into individual resources,
// so, constant indexing requires no fancy hardware indexing support.
layout(set = 0, binding = 0) uniform sampler2D Tex0;
layout(set = 0, binding = 1) uniform sampler2D Tex1;
layout(set = 0, binding = 2) uniform sampler2D Tex2;
layout(set = 0, binding = 3) uniform sampler2D Tex3;
```
- Image Array Dynamic Indexing :
  - The dynamic indexing features allow us to use a non-constant expression to index an array.
    - This has been supported since Vulkan 1.0.
  - The restriction is that the index must be dynamically uniform .
```
layout(set = 0, binding = 0) uniform sampler2D Tex[4];

texture(Tex[dynamically_uniform_expression], ...);
```
- Non-uniform vs Texture Atlas vs Texture Array :
  - Accessing arbitrary textures in a draw call is not a new problem, and graphics programmers have found ways over the years to workaround restrictions in older APIs. Rather than having multiple textures, it is technically possible to pack multiple textures into one texture resource, and sample from the correct part of the texture. This kind of technique is typically referred to as "texture atlas". Texture arrays (e.g. sampler2DArray) is another feature which can be used for similar purposes.
  - Problems with atlas:
    - Mip-mapping is hard to implement, and must likely be done manually with derivatives and math.
    - Anisotropic filtering is basically impossible.
    - Any other sampler addressing than CLAMP_TO_EDGE is very awkward to implement.
    - Cannot use different texture formats.
  - Problems with texture array:
    - All resolutions must match.
    - Number of array layers is limited (just 256 in min-spec).
    - Cannot use different texture formats.
  - Non-uniform indexing solves these issues since we can freely use multiple sampled image descriptors instead. Atlases and texture arrays still have their place. There are many use cases where these restrictions do not cause problems.
  - Non-uniform indexing is not just limited to textures (although that is the most relevant use case). Any descriptor type can be used as long as the device supports it.

Features

Update-after-bind :
- In Vulkan, you generally have to create a VkDescriptorSet and update it with all descriptors before you call vkCmdBindDescriptorSets . After a set is bound, the descriptor set cannot be updated again until the GPU is done using it. This gives drivers a lot of flexibility in how they access the descriptors. They are free to copy the descriptors and pack them somewhere else, promote them to hardware registers, the list goes on.
- Update-After-Bind gives flexibility to applications instead. Descriptors can be updated at any time as long as they are not actually accessed by the GPU. Descriptors can also be updated while the descriptor set is bound to a command buffer, which enables a "streaming" use case.
  - This means the application doesn’t have to unbind or re-record command buffers just to change descriptors—reducing CPU overhead in some streaming-resource scenarios.
- Concurrent Updates :
  - Another "hidden" feature of update-after-bind is that it is possible to update the descriptor set from multiple threads. This is very useful for true "bindless" since unrelated tasks might want to update descriptors in different parts of the streamed/bindless descriptor set.
- After and after :
  - .
Non-uniform indexing :
- While update-after-bind adds flexibility to descriptor management, non-uniform indexing adds great flexibility for shaders.
- It completely removes all restrictions on how we index into arrays, but we must notify our intent to the compiler.
- Normally, drivers and hardware can assume that the dynamically uniform guarantee holds, and optimize for that case.
- If we use the nonuniformEXT decoration in GL_EXT_nonuniform_qualifier we can let the compiler know that the guarantee does not necessarily hold, and the compiler will deal with it in the most efficient way possible for the target hardware. The rationale for having to annotate like this is that driver compiler backends would be forced to be more conservative than necessary if applications were not required to use nonuniformEXT .
- When to use it :
  - The invocation group :
    - The invocation group is a set of threads (invocations) which work together to perform a task.
    - In graphics pipelines, the invocation group is all threads which are spawned as part of a single draw command. This includes multiple instances, and for multi-draw-indirect it is limited to a single gl_DrawID .
    - In compute pipelines, the invocation group is a single workgroup, so it’s very easy to know when it is safe to avoid nonuniformEXT.
    - An expression is considered dynamically uniform if all invocations in an invocation group have the same value.
      - In other words, dynamically uniform means that the index is the same across all threads spawned by a draw command.
  - Interaction with Subgroups :
    - It is very easy to think that dynamically uniform just means "as long as the index is uniform in the subgroup, it’s fine!". This is certainly true for most (desktop) architectures, but not all.
    - It is technically possible that a value can be subgroup uniform, but still not dynamically uniform. Consider a case where we have a workgroup size of 128 threads, with a subgroup size of 32. Even if each subgroup does subgroupBroadcastFirst() on the index, each subgroup might have different values, and thus, we still technically need nonuniformEXT here. If you know that you have only one subgroup per workgroup however, subgroupBroadcastFirst() is good enough.
    - The safe thing to do is to just add nonuniformEXT if you cannot prove the dynamically uniform property. If the compiler knows that it only really cares about subgroup uniformity, it could trivially optimize away nonuniformEXT(subgroupBroadcastFirst()) anyways.
    - The common reason to use subgroups in the first place, is that it was an old workaround for lack of true non-uniform indexing, especially for desktop GPUs. A common pattern would be something like:

Implementation

Exemples :
- odin_cool_engine:
  - odin_cool_engine/src/rp_ui.odin
    - It just sends an index to the compute pipeline via push constants.
  - odin_cool_engine/src/renderer.odin:725
    - It just sends an index to the compute pipeline via push constants.
- Descriptor Indexing Sample .
Setup :
1. Check availability of the extension through vk.EXT_DESCRIPTOR_INDEXING_EXTENSION_NAME + vk.EnumerateDeviceExtensionProperties .
2. Check supported features of the extension through vk.GetPhysicalDeviceFeatures2 + vk.PhysicalDeviceDescriptorIndexingFeatures as the pNext term.
VkDescriptorSetLayoutCreateInfo .
- flags
  - UPDATE_AFTER_BIND_POOL
    - Specifies that descriptor sets using this layout must be allocated from a descriptor pool created with the UPDATE_AFTER_BIND bit set.
    - Descriptor set layouts created with this bit set have alternate limits for the maximum number of descriptors per-stage and per-pipeline layout.
    - The non-UpdateAfterBind limits only count descriptors in sets created without this flag. The UpdateAfterBind limits count all descriptors, but the limits may be higher than the non-UpdateAfterBind limits.
VkDescriptorBindingFlagBits :
- PARTIALLY_BOUND
  - Specifies that descriptors in this binding that are not dynamically used, don't need to contain valid descriptors at the time the descriptors are consumed.
    - A descriptor is 'dynamically used' if any shader invocation executes an instruction that performs any memory access using the descriptor.
    - If a descriptor is not dynamically used, any resource referenced by the descriptor is not considered to be referenced during command execution.
  - This provides so it's not necessary to bind every descriptor. Allows a descriptor array binding to function even when not all array elements are written or valid.
  - This is critical if we want to make use of descriptor "streaming". A descriptor only has to be bound if it is actually used by a shader.
  - Without this feature, if you have an array of N descriptors and your shader indexes [0..N-1], all descriptors must be valid; otherwise behavior is undefined even if the shader never touches the uninitialized ones.
  - When enabled, you only need to write descriptors that the shader will index. “Holes” in the array are allowed, provided shader indices never touch them.
  - Use this when you want to leave “holes” in a large descriptor array (i.e. not update every element) without pre-filling unused slots with a fallback texture. When this flag is set, descriptors that are not dynamically used by the shader need not contain valid descriptors — but if the shader actually accesses an unwritten descriptor you still get undefined/invalid results. This is a convenience to avoid writing N fallback descriptors each time.
- VARIABLE_DESCRIPTOR_COUNT
  - Allows a descriptor binding to have a variable number of descriptors.
  - Use a variable amount of descriptors in an array.
  - Specifies that this is a variable-sized descriptor binding, whose size will be specified when a descriptor set is allocated using this layout.
  - This must only be used for the last binding in the descriptor set layout (i.e. the binding with the largest value of binding).
  - vk.DescriptorSetLayoutBinding.descriptorCount
    - The value is treated as an upper bound on the size of the binding.
    - The actual count is supplied at allocation time via VkDescriptorSetVariableDescriptorCountAllocateInfo .
    - For the purposes of counting against limits such as maxDescriptorSet and maxPerStageDescriptor , the full value of descriptorCount is counted, except for descriptor bindings with a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK , when VkDescriptorSetLayoutCreateInfo.flags does not contain DESCRIPTOR_SET_LAYOUT_CREATE_DESCRIPTOR_BUFFER . In this case, descriptorCount specifies the upper bound on the byte size of the binding; thus it counts against the maxInlineUniformBlockSize and maxInlineUniformTotalSize limits instead.
  - When we later allocate the descriptor set, we can declare how large we want the array to be.
  - Be aware that there is a global limit to the number of descriptors can be allocated at any one time.
  - This is extremely useful when using EXT_descriptor_indexing , since we do not have to allocate a fixed amount of descriptors for each descriptor set.
  - In many cases, it is far more flexible to use runtime sized descriptor arrays.
  - Use this when you want the shader-visible length of a descriptor-array binding to be allocatable per-descriptor-set (i.e. different sets expose different array lengths) instead of using a single compile-time/ layout upper bound. At allocation you pass the actual count with VkDescriptorSetVariableDescriptorCountAllocateInfo. This reduces bookkeeping/pool usage and lets you avoid allocating the full upper bound for every set. Requires the descriptor-indexing feature be enabled and the variable-size binding must be the last binding in the set
- UPDATE_AFTER_BIND
  - Specifies that if descriptors in this binding are updated between when the descriptor set is bound in a command buffer and when that command buffer is submitted to a queue, then the submission will use the most recently set descriptors for this binding and the updates do not invalidate the command buffer. Descriptor bindings created with this flag are also partially exempt from the external synchronization requirement in vkUpdateDescriptorSetWithTemplateKHR and vkUpdateDescriptorSets . Multiple descriptors with this flag set can be updated concurrently in different threads, though the same descriptor must not be updated concurrently by two threads. Descriptors with this flag set can be updated concurrently with the set being bound to a command buffer in another thread, but not concurrently with the set being reset or freed.
  - Update-after-bind is another critical component of descriptor indexing, which allows us to update descriptors after a descriptor set has been bound to a command buffer.
  - This is critical for streaming descriptors, but it also relaxed threading requirements. Multiple threads can update descriptors concurrently on the same descriptor set.
  - UPDATE_AFTER_BIND descriptors is somewhat of a precious resource, but min-spec in Vulkan is at least 500k descriptors, which should be more than enough.
- UPDATE_UNUSED_WHILE_PENDING
  - Specifies that descriptors in this binding can be updated after a command buffer has bound this descriptor set, or while a command buffer that uses this descriptor set is pending execution, as long as the descriptors that are updated are not used by those command buffers. Descriptor bindings created with this flag are also partially exempt from the external synchronization requirement in vkUpdateDescriptorSetWithTemplateKHR and vkUpdateDescriptorSets in the same way as for UPDATE_AFTER_BIND . If PARTIALLY_BOUND is also set, then descriptors can be updated as long as they are not dynamically used by any shader invocations. If PARTIALLY_BOUND is not set, then descriptors can be updated as long as they are not statically used by any shader invocations.
  - Update-Unused-While-Pending is somewhat subtle, and allows you to update a descriptor while a command buffer is executing.
  - The only restriction is that the descriptor cannot actually be accessed by the GPU.
- UPDATE_AFTER_BIND vs UPDATE_UNUSED_WHILE_PENDING
  - Both involve updates to descriptor sets after they are bound, UPDATE_UNUSED_WHILE_PENDING is a weaker requirement since it is only about descriptors that are not used, whereas UPDATE_AFTER_BIND requires the implementation to observe updates to descriptors that are used.

Enabling Non-Uniform Indexing :

Enable runtimeDescriptorArray and shaderSampledImageArrayNonUniformIndexing (required for indexing an array of COMBINED_IMAGE_SAMPLER ), descriptorBindingPartiallyBound (optional, to avoid undefined behavior on not fully populated arrays).

If in Vulkan <1.2, then the features must be enabled in the vk.PhysicalDeviceDescriptorIndexingFeatures .

If in Vulkan >=1.2, then the features must be enabled in the vk.PhysicalDeviceVulkan12Features .

If this is not followed, you'll get:

[ERROR] --- vkCreateDevice(): pCreateInfo->pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDeviceDescriptorIndexingFeatures structure. The features in VkPhysicalDeviceDescriptorIndexingFeatures were promoted in Vulkan 1.2 and is also found in VkPhysicalDeviceVulkan12Features. To prevent one feature setting something to TRUE and the other to FALSE, only one struct containing the feature is allowed.
pNext chain: VkDeviceCreateInfo::pNext -> [STRUCTURE_TYPE_LOADER_DEVICE_CREATE_INFO] -> [STRUCTURE_TYPE_LOADER_DEVICE_CREATE_INFO] -> [VkPhysicalDeviceVulkan13Features] -> [VkPhysicalDeviceVulkan12Features] -> [VkPhysicalDeviceDynamicRenderingUnusedAttachmentsFeaturesEXT] -> [VkPhysicalDeviceDescriptorIndexingFeatures].
The Vulkan spec states: If the pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDevice8BitStorageFeatures, VkPhysicalDeviceShaderAtomicInt64Features, VkPhysicalDeviceShaderFloat16Int8Features, VkPhysicalDeviceDescriptorIndexingFeatures, VkPhysicalDeviceScalarBlockLayoutFeatures, VkPhysicalDeviceImagelessFramebufferFeatures, VkPhysicalDeviceUniformBufferStandardLayoutFeatures, VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures, VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures, VkPhysicalDeviceHostQueryResetFeatures, VkPhysicalDeviceTimelineSemaphoreFeatures, VkPhysicalDeviceBufferDeviceAddressFeatures, or VkPhysicalDeviceVulkanMemoryModelFeatures structure (https://vulkan.lunarg.com/doc/view/1.4.328.0/windows/antora/spec/latest/chapters/devsandqueues.html#VUID-VkDeviceCreateInfo-pNext-02830)

vulkan12_features := vk.PhysicalDeviceVulkan12Features{
    // etc

    descriptorIndexing                        = true,
        // Descriptor Indexing:
        // Todo: Is this only for VK 1.2?

    runtimeDescriptorArray                    = true,
        // Descriptor Indexing:

    shaderSampledImageArrayNonUniformIndexing = true,
        // Descriptor Indexing: required for indexing an array of `COMBINED_IMAGE_SAMPLER`.

    descriptorBindingPartiallyBound           = true,
        // Descriptor Indexing: optional, to avoid undefined behavior on not fully populated arrays.

    descriptorBindingVariableDescriptorCount  = true,
        // Descriptor Indexing: Allows a descriptor binding to have a variable number of descriptors.

    // etc
}

In GLSL use the GL_EXT_nonuniform_qualifier extension and wrap the index with nonuniformEXT(...) (or apply nonuniformEXT to the loaded value) so the compiler emits the SPIR-V NonUniformEXT decoration.

In the shader :
- Constructors and builtin functions, which all have return types that are not qualified by nonuniformEXT , will not generate nonuniform results.
  - Shaders need to use the constructor syntax (or assignment to a nonuniformEXT -qualified variable) to re-add the nonuniformEXT qualifier to the result of builtin functions.
  - Correct:
    - It is important to note that to be 100% correct, we must use:
    - nonuniformEXT(sampler2D()) .
    - It is the final argument to a call like texture() which determines if the access is to be considered non-uniform.
  - Wrong:
    - It is very common in the wild to see code like:
    - sampler2D(Textures[nonuniformEXT(in_texture_index)], ...)
    - This looks very similar to HLSL, but it is somewhat wrong.
    - Generally, it will work on drivers, but it is not technically correct.
  - Examples:
    - sampler2D() is such a constructor, so we must add nonuniformEXT afterwards.
      - out_frag_color = texture(nonuniformEXT(sampler2D(Textures[in_texture_index], ImmutableSampler)), in_uv);
- Other use cases:
  - The nonuniform qualifier will propagate up to the final argument which is used in the load/store or atomic operation.
  - Examples:
```
// At the top
#extension GL_EXT_nonuniform_qualifier : require

uniform UBO { vec4 data; } UBOs[];   
vec4 foo = UBOs[nonuniformEXT(index)].data;

buffer  SSBO { vec4 data; } SSBOs[]; 
vec4 foo = SSBOs[nonuniformEXT(index)].data;

uniform sampler2D Tex[];
vec4 foo = texture(Tex[nonuniformEXT(index)], uv);

uniform uimage2D Img[];              
uint count = imageAtomicAdd(Img[nonuniformEXT(index)], uv, val);
```
```
#version 450
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64) in;

layout(set = 0, binding = 0) uniform sampler2D Combined[];
layout(set = 1, binding = 0) uniform texture2D Tex[];
layout(set = 2, binding = 0) uniform sampler Samp[];
layout(set = 3, binding = 0) uniform U { vec4 v; } UBO[];
layout(set = 4, binding = 0) buffer S { vec4 v; } SSBO[];
layout(set = 5, binding = 0, r32ui) uniform uimage2D Img[];

void main()
{
    uint index = gl_GlobalInvocationID.x;
    vec2 uv = vec2(gl_GlobalInvocationID.yz) / 1024.0;

    vec4 a = textureLod(Combined[nonuniformEXT(index)], uv, 0.0);
    vec4 b = textureLod(nonuniformEXT(sampler2D(Tex[index], Samp[index])), uv, 0.0);
    vec4 c = UBO[nonuniformEXT(index)].v;
    vec4 d = SSBO[nonuniformEXT(index)].v;

    imageAtomicAdd(Img[nonuniformEXT(index)], ivec2(0), floatBitsToUint(a.x + b.y + c.z + d.w));
}
```
- Caviats:
  - LOD:
    - Using implicit LOD with nonuniformEXT can be spicy! If the threads in a quad do not have the same index, LOD might not be computed correctly.
    - The quadDivergentImplicitLOD property lets you know if it will work.
    - In this case however, it is completely fine, since the helper lanes in a quad must come from the same primitive, which all have the same flat fragment input.
- Avoinding nonuniformEXT :
  - You might consider using subgroup operations to implement nonuniformEXT on your own.
  - This is technically out of spec, since the SPIR-V specification states that to avoid nonuniformEXT ,
  - the shader must guarantee that the index is "dynamically uniform".
  - "Dynamically uniform" means the value is the same across all invocations in an "invocation group".
  - The invocation group is defined to be all invocations (threads) for:
    - An entire draw command (for graphics)
    - A single workgroup (for compute).
  - Avoiding nonuniformEXT with clever programming is far more likely to succeed when writing compute shaders,
  - since the workgroup boundary serves as a much easier boundary to control than entire draw commands.
  - It is often possible to match workgroup to subgroup 1:1, unlike graphics where you cannot control how
  - quads are packed into subgroups at all.
  - The recommended approach here is to just let the compiler do its thing to avoid horrible bugs in the future.

Enabling Update-After-Bind :

In VkDescriptorSetLayoutCreateInfo we must pass down binding flags in a separate struct with pNext .

bindings_count := len(stage_set_layout.bindings)
descriptor_bindings_flags := make([]vk.DescriptorBindingFlagsEXT, bindings_count, context.temp_allocator)
for i in 0..<len(descriptor_bindings_flags) {
    descriptor_bindings_flags[i] = { .PARTIALLY_BOUND }
}
descriptor_bindings_flags[bindings_count - 1] += { .VARIABLE_DESCRIPTOR_COUNT }
    // Only the last binding supports VARIABLE_DESCRIPTOR_COUNT.

descriptor_binding_flags_create_info := vk.DescriptorSetLayoutBindingFlagsCreateInfoEXT{
    sType         = .DESCRIPTOR_SET_LAYOUT_BINDING_FLAGS_CREATE_INFO_EXT,
    bindingCount  = u32(bindings_count),
    pBindingFlags = raw_data(descriptor_bindings_flags),
    pNext         = nil,
}
descriptor_set_layout_create_info := vk.DescriptorSetLayoutCreateInfo{
    sType        = .DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
    flags        = {  },

    bindingCount = u32(bindings_count),
    pBindings    = raw_data(stage_set_layout.bindings),

    pNext        = &descriptor_binding_flags_create_info,
}

// Num Descriptors
static constexpr uint32_t NumDescriptorsStreaming  = 2048;
static constexpr uint32_t NumDescriptorsNonUniform = 64;

// Pool
uint32_t poolCount = NumDescriptorsStreaming + NumDescriptorsNonUniform;
VkDescriptorPoolSize       pool_size = vkb::initializers::descriptor_pool_size(DESCRIPTOR_TYPE_SAMPLED_IMAGE, poolCount);
VkDescriptorPoolCreateInfo pool      = vkb::initializers::descriptor_pool_create_info(1, &pool_size, 2);

// Allocate
VkDescriptorSetVariableDescriptorCountAllocateInfoEXT variable_info{};
allocate_info.pNext              = &variable_info;

variable_info.sType              = STRUCTURE_TYPE_DESCRIPTOR_SET_VARIABLE_DESCRIPTOR_COUNT_ALLOCATE_INFO_EXT;
variable_info.descriptorSetCount = 1;
variable_info.pDescriptorCounts = &NumDescriptorsStreaming;
CHECK(vkAllocateDescriptorSets(get_device().get_handle(), &allocate_info, &descriptors.descriptor_set_update_after_bind));
variable_info.pDescriptorCounts = &NumDescriptorsNonUniform;
CHECK(vkAllocateDescriptorSets(get_device().get_handle(), &allocate_info, &descriptors.descriptor_set_nonuniform));

The VkDescriptorPool must also be created with UPDATE_AFTER_BIND . Note that there is global limit to how many UPDATE_AFTER_BIND descriptors can be allocated at any point. The min-spec here is 500k, which should be good enough.

Strategy: Descriptor Buffers ( `EXT_descriptor_buffer` )

Article .
Sample .
Released on (2022-11-21).
TLDR :
- Descriptor sets are now backed by VkBuffer objects where you memcpy in descriptors. Delete VkDescriptorPool and VkDescriptorSet from the API, and have fun!
- Performance is either equal or better.
Coming from Descriptor Indexing, we use plain uints instead of actual descriptor sets, there are some design questions that come up.
Do we assign one uint per descriptor, or do we try to group them together such that we only need to push one base offset?
If we go with the latter, we might end up having to copy descriptors around. If we go with one uint per descriptor, we just added extra indirection on the GPU. GPU throughput might suffer with the added latency.
On the other hand, having to group descriptors linearly one after the other can easily lead to copy hell. Copying descriptors is still an abstracted operation that requires API calls to perform, and we cannot perform it on the GPU. The overhead of all these calls in the driver can be quite significant, especially in API layering. I’ve seen up to 10 million calls to “copy descriptor” per second which adds up.
Managing descriptors really starts looking more and more like just any other memory management problem. Let’s try translating existing API concepts into what they really are under the hood.
vkCreateDescriptorPool
- vkAllocateMemory . Memory type unknown, but likely HOST_VISIBLE and DEVICE_LOCAL . Size of pool computed from pool entries.
vkAllocateDescriptorSets
- Linear or arena allocation from pool. Size and alignment computed from VkDescriptorSetLayout .
vkUpdateDescriptorSets
- Writes raw descriptor data by copying payload from VkImageView / VkSampler / VkBufferView . Write offset is deduced from VkDescriptorSetLayout and binding. The VkDescriptorSet contains a pointer to HOST_VISIBLE mapped CPU memory. Copies are similar.
vkCmdBindDescriptorSets
- Binds the GPU VA of the VkDescriptorSet somehow.
The descriptor buffer API effectively removes VkDescriptorPool and VkDescriptorSet . The APIs now expose lower level detail.

For example, there’s now a bunch of properties to query:

typedef struct VkPhysicalDeviceDescriptorBufferPropertiesEXT {
    …
    size_t             samplerDescriptorSize;
    size_t             combinedImageSamplerDescriptorSize;
    size_t             sampledImageDescriptorSize;
    size_t             storageImageDescriptorSize;
    size_t             uniformTexelBufferDescriptorSize;
    size_t             robustUniformTexelBufferDescriptorSize;
    size_t             storageTexelBufferDescriptorSize;
    size_t             robustStorageTexelBufferDescriptorSize;
    size_t             uniformBufferDescriptorSize;
    size_t             robustUniformBufferDescriptorSize;
    size_t             storageBufferDescriptorSize;
    size_t             robustStorageBufferDescriptorSize;
    size_t             inputAttachmentDescriptorSize;
    size_t             accelerationStructureDescriptorSize;
    …
} VkPhysicalDeviceDescriptorBufferPropertiesEXT;

Strategy: Push Descriptor ( `VK_KHR_push_descriptor` )

Promoted to core in Vulkan 1.4.
Last modified date: (2017-09-12).
This extension allows descriptors to be written into the command buffer, while the implementation is responsible for managing their memory. Push descriptors may enable easier porting from older APIs and in some cases can be more efficient than writing descriptors into descriptor sets.
Sample .
New Commands
- vkCmdPushDescriptorSetKHR
If Vulkan Version 1.1 or VK_KHR_descriptor_update_template is supported:
- vkCmdPushDescriptorSetWithTemplateKHR
New Structures
- Extending VkPhysicalDeviceProperties2 :
  - VkPhysicalDevicePushDescriptorPropertiesKHR
New Enum Constants
- VK_KHR_PUSH_DESCRIPTOR_EXTENSION_NAME
- VK_KHR_PUSH_DESCRIPTOR_SPEC_VERSION
- Extending VkDescriptorSetLayoutCreateFlagBits :
  - VK_DESCRIPTOR_SET_LAYOUT_CREATE_PUSH_DESCRIPTOR_BIT_KHR
- Extending VkStructureType:
  - VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PUSH_DESCRIPTOR_PROPERTIES_KHR
If Vulkan Version 1.1 or VK_KHR_descriptor_update_template is supported:
- Extending VkDescriptorUpdateTemplateType :
  - VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_PUSH_DESCRIPTORS_KHR

Strategy: Bindful / Classic strategy (Slot-based / Frequency-based)

mna (midmidmid):
- The reason you split up resources into multiple sets is actually to reduce the cost of vkCmdBindDescriptorSets . The idea being that if you've got one set that holds scene-wide data and a different set that holds object-specific data, you only bind the scene stuff once and then just leave it bound. Then the per-object updates go faster because you're pushing much smaller descriptor sets into whatever special silicon descriptor sets map to on your particular GPU. Note: there are rules about how you have to arrange your sets (so like the scene-wide one has to be at a lower index than the per-object one), and all of the pipelines you use must have compatible layouts for the sets you aren't rebinding every time you switch to a different pipeline. Someone can correct me if I'm wrong, but if you switch to a pipeline that's got an incompatible layout for some descriptor set at index n then all descriptor sets at indices >= n need to be rebound.
- I think the only reason I'd change any of my stuff to bindless is if I hit however many hundreds of thousands of calls to vkCmdBindDescriptorSets it takes for descriptors to be a per-frame bottleneck.
- But I find descriptors pretty intuitive and easy to work with.
- I didn't find them easy to work with when I first came to VK (from GL/D3D11-world), but now that I've got some scaffolding set up to manage them, they're easy sauce.
- (They actually map pretty well to having worked with old console GPUs where you manage the command queue directly and have to think about resource bindings in terms of physical registers on the GPU. It was helpful to have that background.)
- If you're working with descriptor sets, then you have lots of little objects whose lifetimes you need to track and manage. Getting them grouped into the appropriate set of pools cuts that number down to something that's not hard to manage. So, for me, I've got a dynamically allocated and recycled set of descriptor pools for stuff that changes every frame, and then I've got my materials grouped into pack files (for fast content loading) and each of those has one descriptor pool for all the sets for all of its materials. Easy peasy. For bindless, you need to figure out how you're going to divide up the big array of descriptors in your one mega set. There's different strategies for doing that. But you'll get a better description of them out of the bindless fans on the server.
- Implementation-wise, I don't think there's a huge complexity difference between the two approaches. Bindless might be conceptually simpler since "it's just a big array" doesn't require as big of a mental shift as dividing resources up by usage and update frequency and thinking in those terms.
In the “classic” model, before you draw or dispatch, you must bind each resource to a specific descriptor binding or slot.
Example:
- vkCmdBindDescriptorSets(...)
- Binding texture #0 for this draw, texture #1 for that draw, etc.
The shader uses a fixed binding index:
- layout(set = 0, binding = 3) uniform sampler2D tex;
If you want to change which texture is used, you re-bind that descriptor.
.

Specialization Constants

Allows a constant value in SPIR-V to be specified at VkPipeline creation time.
This is powerful as it replaces the idea of doing preprocessor macros in the high level shading language (GLSL, HLSL, etc).
A way to provide constant values to a SPIR-V shader at pipeline creation time so the compiler can constant-fold, inline, and eliminate branches.
- This yields code equivalent to having compiled separate shader variants with those constant values baked in.
This is not Vulkan exclusive, but an optimization from SPIR-V. OpenGL 4.6 can also use this feature.
Sample .
UBOs and Push Constants suffer from limited optimizations during shader compilation. Specialization Constants can provide those optimizations:
- Uniform buffer objects (UBOs) are one of the most common approaches when it is necessary to set values within a shader at run-time and are used in many tutorials. UBOs are pushed to the shader just prior to its execution, this is after shader compilation which occurs during vkCreateGraphicsPipelines . As these values are set after the shader has been compiled, the driver’s shader compiler has limited scope to perform optimizations to the shader during its compilation. This is because optimizations such as loop unrolling or unused code removal require the compiler to have knowledge of the values controlling them which is not possible with UBOs. Push constants also suffer from the same problems as UBOs, as they are also provided after the shader has been compiled.
- Specialization Constants are set before pipeline creation meaning these values are known during shader compilation, this allows the driver’s shader compiler to perform optimizations. In this optimisation process the compiler has the ability to remove unused code blocks and statically unroll which reduces the fragment cycles required by the shader which results in increased performance.
- While specialization constants rely on knowing the required values before pipeline creation occurs, by trading off this flexibility and allowing the compiler to perform these optimizations you can increase the performance of your application easily and reduce shader code size.
Do :
- Use compile-time specialization constants for all control flow. This allows compilation to completely remove unused code blocks and statically unroll loops.
Don’t :
- Use control-flow which is parameterized by uniform values; specialize shaders for each control path needed instead.
Impact :
- Reduced performance due to less efficient shader programs.

Example :

#version 450
layout (constant_id = 0) const float myColor = 1.0;
layout(location = 0) out vec4 outColor;

void main() {
    outColor = vec4(myColor);
}

struct myData {
    float myColor = 1.0f;
} myData;

VkSpecializationMapEntry mapEntry = {};
mapEntry.constantID = 0; // matches constant_id in GLSL and SpecId in SPIR-V
mapEntry.offset     = 0;
mapEntry.size       = sizeof(float);

VkSpecializationInfo specializationInfo = {};
specializationInfo.mapEntryCount = 1;
specializationInfo.pMapEntries   = &mapEntry;
specializationInfo.dataSize      = sizeof(myData);
specializationInfo.pData         = &myData;

VkGraphicsPipelineCreateInfo pipelineInfo = {};
pipelineInfo.pStages[fragIndex].pSpecializationInfo = &specializationInfo;

// Create first pipeline with myColor as 1.0
vkCreateGraphicsPipelines(&pipelineInfo);

// Create second pipeline with same shader, but sets different value
myData.myColor = 0.5f;
vkCreateGraphicsPipelines(&pipelineInfo);

Use cases :
- Toggling features:
  - Support for a feature in Vulkan isn’t known until runtime. This usage of specialization constants is to prevent writing two separate shaders, but instead embedding a constant runtime decision.
- Improving backend optimizations:
  - Optimizing shader compilation from SPIR-V to GPU.
  - The “backend” here refers to the implementation’s compiler that takes the resulting SPIR-V and lowers it down to some ISA to run on the device.
  - Constant values allow a set of optimizations such as constant folding , dead code elimination , etc. to occur.
- Affecting types and memory sizes:
  - It is possible to set the length of an array or a variable type used through a specialization constant.
  - It is important to notice that a compiler will need to allocate registers depending on these types and sizes. This means it is likely that a pipeline cache will fail if the difference is significant in registers allocated.
How they work :
- The values are supplied using VkSpecializationInfo attached to the VkPipelineShaderStageCreateInfo .
- In GLSL (or HLSL → SPIR-V) mark a constant with a constant id, e.g. layout(constant_id = 0) const int MATERIAL_MODE = 0;
- Create VkSpecializationMapEntry entries mapping constantID → offset/size in your data block.
- Fill a contiguous data buffer with the specialization values and set up VkSpecializationInfo .
- Put the VkSpecializationInfo* into the shader stage VkPipelineShaderStageCreateInfo before calling vkCreateGraphicsPipelines . The backend finalizes (specializes/compiles) the shader at pipeline creation time.
How it affects the pipeline workflow :
- TLDR :
  - It does not solve the pipeline workflow problem. It provides a system for shader optimization at SPIR-V→GPU compile time.
  - Specialization lets you get near-compile-time optimizations while still selecting variants at runtime, but it does not avoid having multiple created pipelines if you need multiple different specialized behaviors.
- They do not, by themselves, precompile every possible branch permutation and keep them all resident for you. Each distinct set of specialization values that you want available at runtime normally corresponds to a separately created pipeline (the specialization values are applied during pipeline creation).
- If you need multiple variants you must create (or reuse) the pipelines for those values.
- If you have N independent boolean specialization choices, the number of possible specialized pipelines is 2^N (exponential growth). Creating many pipelines increases driver/state memory and creation time; use caching/derivatives/libraries if creation cost or count is a concern.
- You cannot change a specialization constant per draw without binding a different pipeline: the specialization is fixed for the pipeline object, so per-draw changes require binding another pipeline or using a different strategy (uniforms, push constants, dynamic branching).
- Different values mean different pipeline creation (driver work / memory).
- "Is this a way to precompile every branching of a shader?"
  - Yes, but only if you actually create a pipeline for each variant.
  - Specialization constants let the driver compile-away branches at pipeline-creation time, but they do not magically produce all variants for you at draw time.
Recommendations :
- Improving shader performance with vulkan's specialization constants .
  - When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo field of VkPipelineShaderStageCreateInfo . At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.
  - It is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the rendering loop and just swap pipelines as needed while rendering.
  - "promote the UBO array to a push constant".
  - Applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.
  - In other words:
    - The article shows how it's possible to pass a value to the shader during graphics pipeline creation so the shader is compiled from SPIR-V to GPU with that constant altered.
    - This helps by allowing the SPIR-V→GPU compiler to make optimization choices such as unrolling loops and removing branches; it can also enable UBO promotion.
    - The article does not suggest specialization constants solve the pipeline workflow problem. It focuses on compile-time shader optimizations.

Physical Storage Buffer ( `KHR_buffer_device_address` )

Impressions :
- (2025-09-08)
- No descriptor sets.
  - Cool.
- Very easy to set up.
- Shader usage is a bit tricky; push constants are required to access buffers in many patterns.
- More prone to programmer errors because there is no automatic bounds checking.
- Hmm, idk, for now not sure.
Adds the ability to have “pointers in the shader”.
Buffer device address is a powerful and unique feature of Vulkan. It exposes GPU virtual addresses directly to the application, and the application can then use those addresses to access buffer data freely through pointers rather than descriptors.
This feature lets you place addresses in buffers and load and store to them inside shaders, with full capability to perform pointer arithmetic and other tricks.
Support :
- Core in Vulkan 1.3.
- Submitted at (2019-01-06), core at (2019-11-25).
- Coverage :
  - (2025-09-08) 71.6%
  - 79.8% Windows
  - 70.9% Linux
  - 68.7% Android
Lack of safety :
- A critical thing to note is that a raw pointer has no idea of how much memory is safe to access. Unlike SSBOs when bounds-checking features are enabled, you must either do range checks yourself or avoid relying on out-of-bounds behavior.
Creating a buffer :
- To be able to grab a device address from a VkBuffer , you must create the buffer with SHADER_DEVICE_ADDRESS usage.
- The memory you bind that buffer to must be allocated with the corresponding flag via pNext .
```
VkMemoryAllocateFlagsInfoKHR flags_info{STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR};
flags_info.flags             = MEMORY_ALLOCATE_DEVICE_ADDRESS_KHR;
memory_allocation_info.pNext = &flags_info;
```
- After allocating and binding the buffer, query the address:
```
VkBufferDeviceAddressInfoKHR address_info{STRUCTURE_TYPE_BUFFER_DEVICE_ADDRESS_INFO_KHR};
address_info.buffer = buffer.buffer;
buffer.gpu_address  = vkGetBufferDeviceAddressKHR(device, &address_info);
```
- This address behaves like a normal address; you can offset the VkDeviceAddress value as you see fit since it is a uint64_t .
- There is no host-side alignment requirement enforced by the API for this value.
- When using this pointer in shaders, you must provide and respect alignment semantics yourself, because the shader compiler cannot infer anything about a raw pointer loaded from memory.
- You can place this pointer inside another buffer and use it as an indirection.
GL_EXT_buffer_reference :
- In Vulkan GLSL, the GL_EXT_buffer_reference extension allows declaring buffer blocks as pointer-like types rather than SSBOs. GLSL lacks true pointer types, so this extension exposes pointer-like behavior.
```
#extension GL_EXT_buffer_reference : require
```
- You can forward-declare types. Useful for linked lists and similar structures.
```
layout(buffer_reference) buffer Position;
```
- You can declare a buffer reference type. This is not an SSBO declaration, but effectively a pointer-to-struct.
```
layout(std430, buffer_reference, buffer_reference_align = 8) writeonly buffer Position {
    vec2 positions[];
};
```
- buffer_reference tags the type accordingly. buffer_reference_align marks the minimum alignment for pointers of this type.
- You can place the Position type inside another buffer or another buffer reference type:
```
layout(std430, buffer_reference, buffer_reference_align = 8) readonly buffer PositionReferences {
    Position buffers[];
};
```
- Now you have an array of pointers.
- You can also place a buffer reference inside push constants, an SSBO, or a UBO.
```
layout(std430, set = 0, binding = 0) readonly buffer Pointers {
    Positions positions[];
};

layout(std430, push_constant) uniform Registers {
    PositionReferences references;
} registers;
```

Casting pointers :

A key aspect of buffer device address is that we gain the capability to cast pointers freely.
While it is technically possible (and useful in some cases!) to "cast pointers" with SSBOs with clever use of aliased declarations like so:

layout(set = 0, binding = 0) buffer SSBO { float v1[]; };
layout(set = 0, binding = 0) buffer SSBO2 { vec4 v4[]; };

It gets kind of hairy quickly, and not as flexible when dealing with composite types.
When we have casts between integers and pointers, we get the full madness that is pointer arithmetic. Nothing stops us from doing:

#extension GL_EXT_buffer_reference : require
layout(buffer_reference) buffer PointerToFloat { float v; };

PointerToFloat pointer = load_pointer();
uint64_t int_pointer = uint64_t(pointer);
int_pointer += offset;
pointer = PointerToFloat(int_pointer);
pointer.v = 42.0;

Not all GPUs support 64-bit integers, so it is also possible to use uvec2 to represent pointers. This way, we can do raw pointer arithmetic in 32-bit, which might be more optimal anyways.

#extension GL_EXT_buffer_reference_uvec2 : require
layout(buffer_reference) buffer PointerToFloat { float v; };
PointerToFloat pointer = load_pointer();
uvec2 int_pointer = uvec2(pointer);
uint carry;
uint lo = uaddCarry(int_pointer.x, offset, carry);
uint hi = int_pointer.y + carry;
pointer = PointerToFloat(uvec2(lo, hi));
pointer.v = 42.0;

Debugging :
- When debugging or capturing an application that uses buffer device addresses, there are some special driver requirements that are not universally supported. Essentially, to be able to capture application buffers which contain raw pointers, we must ensure that the device address for a given buffer remains stable when the capture is replayed in a new process. Applications do not have to do anything here, since tools like RenderDoc will enable the bufferDeviceAddressCaptureReplay feature for you, and deal with all the magic associated with address capture behind the scenes. If the bufferDeviceAddressCaptureReplay is not present however, tools like RenderDoc will mask out the bufferDeviceAddress feature, so beware.
Sample .
.

Memory Allocation

Memory Allocation .

Info

Memory Management .
- Talk by AMD.
- Shows no code.
- The video is useful.
- Memory Heaps, Memory Types.
- Memory Blocks.
- Suballocations.
- Dos and Don'ts.
- VMA.
- VmaDumpVis.py to visualize the json file dumped by VMA.
Memory Management .
- Sounds more technical; I only saw parts of the talk.
- Talk by AMD.
- Shows code.
- Memory Heaps, Memory Types.
- Dos and Don'ts.
- VMA.
There is additional level of indirection: VkDeviceMemory is allocated separately from creating VkBuffer / VkImage and they must be bound together.
Driver must be queried for supported memory heaps and memory types. Different GPU vendors provide different types of it.
It is recommended to allocate bigger chunks of memory and assign parts of them to particular resources, as there is a limit on maximum number of memory blocks that can be allocated.
When memory is over-committed on Windows, the OS memory manager may move allocations from video memory to system memory, the OS also may temporarily suspend a process from the GPU runlist in order to page out its allocations to make room for a different process’ allocations. There is no OS memory manager on Linux that mitigates over-commitment by automatically performing paging operations on memory objects.
Use EXT_pageable_device_local_memory to avoid demotion of critical resources by assigning memory priority. It’s also a good idea to set low priority to non-critical resources such as vertex and index buffers; the app can verify the performance impact by placing the resources in system memory.
Use EXT_pageable_device_local_memory to also disable automatic promotion of allocations from system memory to video memory.
Use dedicated memory allocations ( KHR_dedicated_allocation , core in VK 1.1) when appropriate.
Using dedicated memory may improve performance for color and depth attachments, especially on pre-Turing GPUs.
Use KHR_get_memory_requirements2 (core in VK 1.1) to check whether an image/buffer requires dedicated allocation.
Use host visible video memory to write data directly to video memory from the CPU. Such heap can be detected using DEVICE_LOCAL | HOST_VISIBLE . Take into account that CPU writes to such memory may be slower compared to normal memory. CPU reads are significantly slower. Check BAR1 traffic using Nsight Systems for possible issues.
Explicitly look for the MEMORY_PROPERTY_DEVICE_LOCAL when picking a memory type for resources, which should be stored in video memory.
Don’t assume fixed heap configuration, always query and use the memory properties using vkGetPhysicalDeviceMemoryProperties() .
Don’t assume memory requirements of an image/buffer, use vkGet*MemoryRequirements() .
Don’t put every resource into a Dedicated Allocation.
For memory objects that are intended to be in device-local, do not just pick the first memory type. Pick one that is actually device-local.
The benefit is that we avoid CPU memory costs for lots of tiny buffers, as well as cache misses by using just the same buffer object and varying the offset.
This optimization applies to all buffers, but in the previous blog post on shader resource binding it was mentioned that the offsets are particularly good for uniform buffers.
Software developers use custom memory management for various reasons:
- Making allocations often involves the operating system which is rather costly.
- It is usually faster to re-use existing allocations rather than to free and reallocate new ones.
- Objects that live in a continuous chunk of memory can enjoy better cache utilization.
- Data that is aligned well for the hardware can be processed faster.
Memory is a precious resource, and it can involve several indirect costs by the operating systems. For example some operating systems have a linear cost over the number of allocations for each submission to a Vulkan Queue. Another scenario is that the operating system also handles the paging state of allocations depending on other proceses, we therefore encourage not using too many allocations and organizing them “wisely”.
Device Memory: This memory is used for buffers and images and the developer is responsible for their content.
Resource Pools: Objects such as CommandBuffers and DescriptorSets are allocated from pools, the actual content is indirectly written by the driver.
Custom Host Allocators: Depending on your control-freak level you may also want to provide your own host allocator that the driver can use for the api objects.
Heap: Depending on the hardware and platform, the device will expose a fixed number of heaps, from which you can allocate certain amount of memory in total. Discrete GPUs with dedicated memory will be different to mobile or integrated solutions that share memory with the CPU. Heaps support different memory types which must be queried from the device.
Memory type: When creating a resource such as a buffer, Vulkan will provide information about which memory types are compatible with the resource. Depending on additional usage flags, the developer must pick the right type, and based on the type, the appropriate heap.
Memory property flags: These flags encode caching behavior and whether we can map the memory to the host (CPU), or if the GPU has fast access to the memory.
Memory: This object represents an allocation from a certain heap with a user-defined size.
Resource (Buffer/Image): After querying for the memory requirements and picking a compatible allocation, the memory is associated with the resource at a certain offset. This offset must fulfill the provided alignment requirements. After this we can start using our resource for actual work.
Sub-Resource (Offsets/View): It is not required to use a resource only in its full extent, just like in OpenGL we can bind ranges (e.g. varying the starting offset of a vertex-buffer) or make use of views (e.g. individual slice and mipmap of a texture array).
The fact that we can manually bind resources to actual memory addresses, gives rise to the following points:
- Resources may alias (share) the same region of memory.
- Alignment requirements for offsets into an allocation must be manually managed.
Store multiple buffers, like the vertex and index buffer, into a single VkBuffer and use offsets in commands like vkCmdBindVertexBuffers .
The advantage is that your data is more cache friendly in that case, because it’s closer together. It is even possible to reuse the same chunk of memory for multiple resources if they are not used during the same render operations, provided that their data is refreshed, of course.
This is known as aliasing and some Vulkan functions have explicit flags to specify that you want to do this.
Uniform Buffer Binding: As part of a DescriptorSet this would be the equivalent of an arbitrary glBindBufferRange(GL_UNIFORM_BUFFER, dset.binding, dset.bufferOffset, dset.bufferSize) in OpenGL. All information for the actual binding by the CommandBuffer is stored within the DescriptorSet itself.
Uniform Buffer Dynamic Binding: Similar as above, but with the ability to provide the bufferOffset later when recording the CommandBuffer, a bit like this pseudo code: CommandBuffer->BindDescriptorSet(setNumber, descriptorSet, &offset). It is very practical to use when sub-allocating uniform buffers from a larger buffer allocation.
Push Constants: PushConstants are uniform values that are stored within the CommandBuffer and can be accessed from the shaders similar to a single global uniform buffer. They provide enough bytes to hold some matrices or index values and the interpretation of the raw data is up the shader. You may recall glProgramEnvParameter from OpenGL providing something similar. The values are recorded with the CommandBuffer and cannot be altered afterwards: CommandBuffer->PushConstant(offset, size, &data)
Dynamic offsets are very fast for NVIDIA hardware. Re-using the same DescriptorSet with just different offsets is rather CPU-cache friendly as well compared to using and managing many DescriptorSets. NVIDIA’s OpenGL driver actually also optimizes uniform buffer binds where just the range changes for a binding unit.

Sub-allocation

In a real world application, you’re not supposed to actually call vkAllocateMemory for every individual buffer.
The maximum number of simultaneous memory allocations is limited by the maxMemoryAllocationCount physical device limit, which may be as low as 4096 even on high end hardware like an NVIDIA GTX 1080.
The right way to allocate memory for a large number of objects at the same time is to create a custom allocator that splits up a single allocation among many different objects by using the offset parameters that we’ve seen in many functions.
You can either implement such an allocator yourself, or use the VMA library provided by the GPUOpen initiative.
Sub-allocation is a first-class approach when working in Vulkan.
Memory is allocated in pages with a fixed size; sub-allocation reduces the number of OS-level allocations.
You should use memory sub-allocation.
Memory allocation and deallocation at OS/driver level is expensive.
vkAllocateMemory() is costly on the CPU.
Cost can be reduced by suballocating from a large memory object.
Also note the maxMemoryAllocationCount limit which constrains the number of simultaneous allocations an application can have.
A Vulkan app should aim to create large allocations and then manage them itself.

Arenas

Discussion around the availability of arenas in Vulkan

(2025-12-07)
Caio:
- hello, is it possible to create a memory arena, placing all new objects in this region, and then freeing all this region without having to call the vkDestroyX functions? I'm having the impression that Vulkan memory management is rooted in RAII, which I don't like. All my games are managed through arenas, which I think is perfect, but for Vulkan I'm having to track each individual allocation and free each one at a time. I'm already treating memory as a big arena, but I'm having the overhead of calling the destruction of each resource separately.
CharlesG:
- You don’t own the memory that backs vulkan objects. For command buffers and descriptors there are pools so the driver can do a good job with the backing memory scheme.
- For VkDeviceMemory, you decide how to sub allocate them
Caio:
- do I need to call destroy for objects like vkPipeline, VkPipelineLayout, VkDescriptorSetLayout, VkShaderModule, VkRenderPass, etc? I have lots of objects that should die exacly at the same time, but I'm having to free them one by one. I heard about suballocations for buffers and images, but what about these types of objects I mentioned?
VkIpotrick:
- they require actualy cleanup, they are not just some memory
- they might be referenced within other internal structures of the driver and have to be removed from those for example
CharlesG:
- anything that you vkCreate must be vkDestroyed; Except command buffers and descriptors where it is sufficient to just destroy the pools.
- Using Vulkan is a lot like networking with a remote server, lots of driver internals have implementation requirements that make arenas not the “obvious choice” (otherwise we’d see more of them)
Caio:
- Is there a future in Vulkan where the decision of how to free the memory is not bound to the driver, but for the programmer? You mentioned how this is limited by what the driver allows, but could this change in the future and move towards being more low-level?
VkIpotrick:
- no. i dont think that is feasable.
- that would handcuff drivers so bad that you would be too low level. At that point a propper spec could be impossibly hard to create and maintain between vendors
- vulkan drivers still have to do a loooot of things internally. its still highish level api
CharlesG:
- I concur.
- I want to reiterate that drivers deal with much more than host memory allocations, but device memory, external memory (to the process), OS api’s, display hardware, shader compilers. Some objects don’t actually DO anything on deletion (sampler come to mind because the handle stores the entire state for some implementations, when the private data ext isnt active)
- Drivers get to ask the os on your behalf to map device memory into the host address space. And deal with you forgetting to unmap it during shutdown (though the OS is more likely to also clean up after user lode drivers…)
- I mention that some objects are “free” to leak cause they didn’t allocate anything internally because that is an implementation detail that isnt possible on all hardware, so the API cant guarantee “free” sampler cleanup without screwing over some hardware. And it just ties their hands when it is no longer possible to put all the state into the handle any more in the future with extensions to the API
Caio:
- well, I imagine this was the case, but still, I was hopeful there was some alternative for bulk deletion. Currently I just wrapped around the concept of shared lifetimes and created a pseudo-arena, which internally frees all the memory for me by calling each respective destructor. Still, it annoys me a bit knowing the design could be faster if I could bulk delete the content instead of being bound by what the driver exposes
- I understand why it's not possible due to the current design by drivers, but I wish it were
- my concern now is not the performance per se, but more about the freedom of having the option of managing memory in a way that could logically be faster (logically, as freeing a memory region is quite obviously faster than having to manage the state of different objects before deleting each of them individually). I'm not currently bound by the deletion times of those calls. I'm speaking more from a philosophical standpoint.
CharlesG:
- Inb4 going all in on bindless and gpu driven where there just arent as many vulkan objects to manage
- Fences and semaphores come to mind as prime examples of not just memory
Caio:
- I'm trying to move it that way after trying bindful for a while, it's being much nicer and aligns with the vision I have of how memory is better managed;
CharlesG:
- Suggestions for the API can be made in the vulkan-pain-points channel (although itd be good to link to this convo) and an issue can be made in the Vulkan-Docs github repo as thats the home of the specification. That said, this ask is not easily actionable so hard to quantify what “success” means.
- All good, and going towards bindless is definitely going to suite your tastes better!
VkIpotrick:
- bindless is simply better at this point
- descriptor sets, layouts, pools etc made sense for old hardware, but now they are just very clunky oddly behaving abstractions
- also with bindless you can have one static allocation for all descriptors
- the ultimate memory management is static lifetime after all.

Alternatives and half-solutions

You cannot safely get the behavior you want — i.e. allocate many Vulkan resources and then legally free one big memory region while leaving the Vulkan object handles alive and never calling their destruction; on a conformant Vulkan implementation. Freeing VkDeviceMemory that backs resources while those resources are still live or still in use is undefined behavior / validation errors unless you guarantee the resources are never used again and the driver allows that. The Vulkan spec requires you to manage object lifetimes; drivers may have internal bookkeeping tied to those object handles that won’t be cleaned just by freeing the raw memory.
That said, you can achieve the practical “free everything by freeing a small number of objects/regions” without peppering vkDestroy* calls everywhere by changing how you structure resources. options that actually give you region-like semantics:
Mega-backings (buffers)
- Never creating one Vulkan resource handle per logical allocation. In practice that means: create a small number of real Vulkan resources (big backing buffers / big images or sparse resources), suballocate from them, and operate using offsets/array-layer indices. When the region should die you destroy the backing objects (a few destroys) and free their VkDeviceMemory. No per-suballocation vkDestroy* calls are necessary because there are no per-suballocation Vulkan handles to destroy.
- Create a small set of large backing VkDeviceMemory + VkBuffer objects (one per memory type/usage class you need).
- Suballocate ranges from those big buffers and use offsets everywhere:
- For vertex/index bindings: vkCmdBindVertexBuffers(..., firstBinding, 1, &bigBuffer, &offset).
- For descriptors: VkDescriptorBufferInfo{ bigBuffer, offset, range } — descriptors can point at a buffer + offset without creating new VkBuffer handles.
- When you’re done, you only need to vkDestroyBuffer / vkFreeMemory for a few big buffers, not for every tiny allocation.
- Constraints: alignment, memoryRequirements and usage flags must be compatible for all suballocations placed in a given big buffer. If two allocations need different usage flags or memory types, they must go into different backing buffers.
Texture atlases / arrays (images)
- Replace many small VkImage objects with a single large image (or texture array/array layers / atlas) and pack multiple textures into it. Use UV/array-layer indices in shader, or use VkImageView / descriptor indexing accordingly.
- You then destroy and free one big image rather than many small ones. Tradeoffs: packing, mipmapping, filtering artifacts, and sampler/view creation.

Host Memory

Allocator ( `VkAllocationCallbacks` )

VkAllocationCallbacks only control host (CPU) allocations the loader/driver makes for Vulkan bookkeeping and temporary object.
They do not give you a direct view or control of device (GPU) memory payloads.
Passing a non-NULL pAllocator to a vkCreateX function causes the driver to call your callbacks for those host allocations. They do not switch the driver from using device heaps to host malloc; they only replace the host allocator functions used by the implementation. The allocation scope rules determine whether the allocation is command-scoped or object-scoped.
Passing a custom VkAllocationCallbacks to vkCreateBuffer lets you intercept and control the host memory the driver uses to represent the buffer object — but it does not tell you how many bytes of GPU heap were (or will be) consumed by the buffer’s storage. For the latter you must intercept device allocations (see below).
To track real GPU memory you must track vkAllocateMemory / vkFreeMemory (and any driver-internal device allocations) and/or use VK_EXT_memory_report / VK_EXT_memory_budget to observe what the driver actually commits.
Examples :
- vkCreateBuffer(...) :
  - This call creates a buffer object handle and the driver's host-side bookkeeping for that object (descriptor, small metadata).
  - Those host allocations are the things pAllocator on vkCreateBuffer controls.
  - The call does not allocate GPU payload memory for the buffer contents.
  - The buffer becomes usable on the device only after you allocate VkDeviceMemory and bind it (or the driver performs some implicit allocation in non-standard implementations).
  - The implementation goes as:
    - vk.CreateBuffer
      - Create buffer. Host Visible handle. CPU Memory.
```
vk_check(vk.CreateBuffer(_device.handle, &buffer_create_info, &arena.gpu_alloc, &buffer_handle))
```
    - vk.GetBufferMemoryRequirements
      - Prepare allocation_info for VkDeviceMemory. Choose a memoryTypeIndex with the desired properties
      - allocationSize and memoryTypeIndex determine whether the allocation will be device-local, host-visible, coherent, etc.
      - This properties decide if the memory is mappable from the CPU.
      - This call doesn't allocate anything.
```
mem_requirements: vk.MemoryRequirements
vk.GetBufferMemoryRequirements(_device.handle, buffer_handle, &mem_requirements)
mem_allocation_info := vk.MemoryAllocateInfo{
    sType           = .MEMORY_ALLOCATE_INFO,
    allocationSize  = mem_requirements.size,
    memoryTypeIndex = device_find_memory_type(mem_requirements.memoryTypeBits, properties),
}
```
    - vk.AllocateMemory
      - This is the call that requests a VkDeviceMemory allocation from a particular memory type/heap.
      - Memory type is HOST_VISIBLE :
        
        The driver will allocate from the heap that provides host mappings (which is typically system RAM or a host-visible region).
        
        Effect: device payload is created — the VkDeviceMemory object represents committed device memory (counts against the heap’s budget).
        
        On discrete GPUs this is often a segment of system memory that is mapped by the driver, or on integrated GPUs it may be the same physical RAM but treated as both host- and device-accessible.
        
        The pAllocator you pass to vkAllocateMemory only affects host-side allocations the driver does while processing the call; it does not change whether the allocation consumes device heap bytes.
      - Memory type is DEVICE_LOCAL :
        
        Driver allocates a VkDeviceMemory from the device-local heap (on discrete GPUs this is the GPU VRAM heap). That is the device payload and consumes heap budget. The allocation is not host-visible, so you cannot vkMapMemory this memory.
        
        Note: on integrated GPUs device-local may still be mappable because physical memory is shared — but that depends entirely on memory type flags exposed by the driver.
      - Memory type is HOST VISIBLE + DEVICE_LOCAL :
        
        The allocation is created in a heap that the driver marks both device-local and host-visible. Physically this can mean: shared system RAM (integrated GPU) or a special heap the driver exposes that is accessible by both CPU and GPU. The VkDeviceMemory is committed and counts against that heap’s budget.
        
        You may be able to vkMapMemory this memory because it is host-visible. Performance characteristics vary: host-visible+device-local memory can be slower to CPU-access than pure host memory or slower to GPU-access than pure device-local VRAM.
        
        On PC discrete GPUs this commonly corresponds to the GPU memory that is accessible through the PCIe BAR (Resizible-BAR / ReBAR) or a special small window the driver exposes. Allocation behavior: vkAllocateMemory allocates from that BAR-exposed heap (it consumes VRAM or a BAR-mapped window of VRAM).
```
vk_check(vk.AllocateMemory(_device.handle, &mem_allocation_info, nil, &buffer_memory))
```
    - vk.BindBufferMemory
      - Binds one with the other (memory aliasing). Doesn't allocate anything
      - Binds the previously allocated device memory to the buffer object. Binding itself normally does not allocate additional device heap bytes; it just associates that payload region with the buffer handle.
      - After bind the buffer is usable for CPU mapping (if host-visible) and/or device operations.
```
vk_check(vk.BindBufferMemory(_device.handle, buffer_handle, buffer_memory, 0))
```
- vkCreateGraphicsPipelines(...)
  - Pipeline creation can be expensive and opaque.
  - During pipeline creation the driver may:
    - allocate host-side structures for the pipeline object (controlled by pAllocator passed to vkCreateGraphicsPipelines ),
    - compile/optimize shaders, build internal representations,
    - and may allocate internal device resources (driver-controlled device memory, shader/kernel upload, caches) that are not the same as application VkDeviceMemory allocations. The spec explicitly allows drivers to perform internal device allocations for things like pipelines; those allocations are not controlled by VkAllocationCallbacks . If you need to see them, use VK_EXT_device_memory_report .

Allocation, Reallocation, Free, Internal Alloc, Internal Free

pfnAllocation or pfnReallocation may be called in the following situations:
- Allocations scoped to a VkDevice or VkInstance may be allocated from any API command.
- Allocations scoped to a command may be allocated from any API command.
- Allocations scoped to a VkPipelineCache may only be allocated from:
  - vkCreatePipelineCache
  - vkMergePipelineCaches for dstCache
  - vkCreateGraphicsPipelines for pipelineCache
  - vkCreateComputePipelines for pipelineCache
- Allocations scoped to a VkValidationCacheEXT may only be allocated from:
  - vkCreateValidationCacheEXT
  - vkMergeValidationCachesEXT for dstCache
  - vkCreateShaderModule for validationCache in VkShaderModuleValidationCacheCreateInfoEXT
- Allocations scoped to a VkDescriptorPool may only be allocated from:
  - any command that takes the pool as a direct argument
  - vkAllocateDescriptorSets for the descriptorPool member of its pAllocateInfo parameter
  - vkCreateDescriptorPool
- Allocations scoped to a VkCommandPool may only be allocated from:
  - any command that takes the pool as a direct argument
  - vkCreateCommandPool
  - vkAllocateCommandBuffers for the commandPool member of its pAllocateInfo parameter
  - any vkCmd* command whose commandBuffer was allocated from that VkCommandPool
- Allocations scoped to any other object may only be allocated in that object’s vkCreate* command.
pfnFree , or pfnReallocation with zero size, may be called in the following situations:
- Allocations scoped to a VkDevice or VkInstance may be freed from any API command.
- Allocations scoped to a command must be freed by any API command which allocates such memory.
- Allocations scoped to a VkPipelineCache may be freed from vkDestroyPipelineCache .
- Allocations scoped to a VkValidationCacheEXT may be freed from vkDestroyValidationCacheEXT .
- Allocations scoped to a VkDescriptorPool may be freed from
  - any command that takes the pool as a direct argument
- Allocations scoped to a VkCommandPool may be freed from:
  - any command that takes the pool as a direct argument
  - vkResetCommandBuffer whose commandBuffer was allocated from that VkCommandPool
- Allocations scoped to any other object may be freed in that object’s vkDestroy* command.
- Any command that allocates host memory may also free host memory of the same scope.
pfnAllocation
- If pfnAllocation is unable to allocate the requested memory, it must return NULL.
- If the allocation was successful, it must return a valid pointer to memory allocation containing at least size bytes, and with the pointer value being a multiple of alignment .
`pfnReallocation``
- If the reallocation was successful, pfnReallocation must return an allocation with enough space for size bytes, and the contents of the original allocation from bytes zero to min(original size, new size) - 1 must be preserved in the returned allocation.
- If size is larger than the old size, the contents of the additional space are undefined .
- If satisfying these requirements involves creating a new allocation, then the old allocation should be freed.
- If pOriginal is NULL, then pfnReallocation must behave equivalently to a call to PFN_vkAllocationFunction with the same parameter values (without pOriginal ).
- If size is zero, then pfnReallocation must behave equivalently to a call to PFN_vkFreeFunction with the same pUserData parameter value, and pMemory equal to pOriginal .
- If pOriginal is non-NULL, the implementation must ensure that alignment is equal to the alignment used to originally allocate pOriginal .
- If this function fails and pOriginal is non-NULL the application must not free the old allocation.
pfnFree
- May be NULL , which the callback must handle safely.
- If pMemory is non-NULL, it must be a pointer previously allocated by pfnAllocation or pfnReallocation .
- The application should free this memory.
pfnInternalAllocation
- Upon allocation of executable memory, pfnInternalAllocation will be called.
- This is a purely informational callback.
pfnInternalFree
- Upon freeing executable memory, pfnInternalFree will be called.
- This is a purely informational callback.
If either of pfnInternalAllocation or pfnInternalFree is not NULL, both must be valid callbacks

Creating the allocator

VkAllocationCallbacks are for host-side allocations the Vulkan loader/driver makes (CPU memory for driver bookkeeping, staging buffers, etc.).
Using malloc / free :
- Is common and acceptable for many apps — but you must meet Vulkan’s callback semantics (alignment, reallocation behavior, thread-safety) and consider performance.
- This is a normal, valid approach. It satisfies most apps and is what many people do in practice.
- Discussion .
- Caviats :
  - Alignment:
    - Vulkan allocators must return memory suitably aligned for any type the driver might need. Use posix_memalign/aligned_alloc on POSIX, _aligned_malloc on Windows, or otherwise ensure alignment. The Vulkan spec expects allocation functions to behave like platform allocators.
  - Reallocation semantics:
    - pfnReallocation must implement C-like realloc semantics (grow/shrink, preserve contents if requested). If your platform realloc does not support required alignment, implement reallocation by allocating new aligned memory, copying the old contents, freeing the old pointer.
  - Thread-safety & performance:
    - Drivers can call the callbacks from multiple threads. The system malloc is usually thread-safe but can have global locks and contention. For high-frequency allocation patterns, a custom pool or thread-local allocator can reduce contention and improve predictable performance.
  - Internal allocation tracking:
    - VkAllocationCallbacks provide pUserData so you can route allocations to a custom pool/context for tracking or to implement more efficient pooling per object type.
The GPU VkDeviceMemory allocations (the ones created with vkAllocateMemory) are a separate resource and must be managed with Vulkan APIs and counted against the appropriate memory heap
If you use malloc for VkAllocationCallbacks , you are only providing host-allocator behavior for driver/loader-side allocations.

Scope

Each allocation has an allocation scope defining its lifetime and which object it is associated with. Possible values passed to the allocationScope parameter of the callback functions specified by VkAllocationCallbacks , indicating the allocation scope, are:
COMMAND
- Specifies that the allocation is scoped to the duration of the Vulkan command.
- The most specific allocator available is used ( DEVICE , else INSTANCE ).
OBJECT
- Specifies that the allocation is scoped to the lifetime of the Vulkan object that is being created or used.
- The most specific allocator available is used ( OBJECT , else DEVICE , else INSTANCE ).
CACHE
- Specifies that the allocation is scoped to the lifetime of a VkPipelineCache or VkValidationCacheEXT object.
- If an allocation is associated with a VkValidationCacheEXT or VkPipelineCache object, the allocator will use the CACHE allocation scope.
- The most specific allocator available is used ( CACHE , else DEVICE , else INSTANCE ).
DEVICE
- Specifies that the allocation is scoped to the lifetime of the Vulkan device.
- If an allocation is scoped to the lifetime of a device, the allocator will use an allocation scope of DEVICE .
- The most specific allocator available is used ( DEVICE , else INSTANCE ).
INSTANCE
- Specifies that the allocation is scoped to the lifetime of the Vulkan instance.
- If the allocation is scoped to the lifetime of an instance and the instance has an allocator, its allocator will be used with an allocation scope of INSTANCE .
- Otherwise an implementation will allocate memory through an alternative mechanism that is unspecified.
Most Vulkan commands operate on a single object, or there is a sole object that is being created or manipulated. When an allocation uses an allocation scope of OBJECT or CACHE , the allocation is scoped to the object being created or manipulated.
When an implementation requires host memory, it will make callbacks to the application using the most specific allocator and allocation scope available:
Pools :
- Objects that are allocated from pools do not specify their own allocator. When an implementation requires host memory for such an object, that memory is sourced from the object’s parent pool’s allocator.

Device Memory

Device memory is memory that is visible to the device — for example the contents of the image or buffer objects, which can be natively used by the device.
A Vulkan device operates on data in device memory via memory objects that are represented in the API by a VkDeviceMemory handle.
VkDeviceMemory .
- Opaque handle to a device memory object.

Properties

Memory properties of a physical device describe the memory heaps and memory types available.
To query memory properties, call vkGetPhysicalDeviceMemoryProperties .
VkPhysicalDeviceMemoryProperties
- Describes a number of memory heaps as well as a number of memory types that can be used to access memory allocated in those heaps.
- Each heap describes a memory resource of a particular size, and each memory type describes a set of memory properties (e.g. host cached vs. uncached) that can be used with a given memory heap. Allocations using a particular memory type will consume resources from the heap indicated by that memory type’s heap index. More than one memory type may share each heap, and the heaps and memory types provide a mechanism to advertise an accurate size of the physical memory resources while allowing the memory to be used with a variety of different properties.
- At least one heap must include MEMORY_HEAP_DEVICE_LOCAL in VkMemoryHeap.flags
- memoryTypeCount is the number of valid elements in the memoryTypes array.
- memoryTypes is an array of MAX_MEMORY_TYPES VkMemoryType structures describing the memory types that can be used to access memory allocated from the heaps specified by memoryHeaps.
- memoryHeapCount is the number of valid elements in the memoryHeaps array.
- memoryHeaps is an array of MAX_MEMORY_HEAPS VkMemoryHeap structures describing the memory heaps from which memory can be allocated.

Device Memory Allocation

Memory requirements :
- vkGetBufferMemoryRequirements
  - Returns the memory requirements for specified Vulkan object
  - device
    - Is the logical device that owns the buffer.
  - buffer
    - Is the buffer to query.
  - pMemoryRequirements
    - Is a pointer to a VkMemoryRequirements structure in which the memory requirements of the buffer object are returned.
- VkMemoryRequirements
  - size
    - Is the size, in bytes, of the memory allocation required for the resource.
    - The size of the required memory in bytes may differ from bufferInfo.size .
  - alignment
    - The offset in bytes where the buffer begins in the allocated region of memory, depends on bufferInfo.usage and bufferInfo.flags .
  - memoryTypeBits
    - Bit field of the memory types that are suitable for the buffer.
    - Bit i is set if and only if the memory type i in the VkPhysicalDeviceMemoryProperties structure for the physical device is supported for the resource.
- vkGetPhysicalDeviceMemoryProperties
  - Reports memory information for the specified physical device
  - We'll use it to find a memory type that is suitable for the buffer itself.
  - vkGetPhysicalDeviceMemoryProperties2 behaves similarly to vkGetPhysicalDeviceMemoryProperties , with the ability to return extended information in a pNext chain of output structures.
  - memoryHeaps
    - Are distinct memory resources like dedicated VRAM and swap space in RAM for when VRAM runs out.
    - The different types of memory exist within these heaps.
    - Right now we’ll only concern ourselves with the type of memory and not the heap it comes from, but you can imagine that this can affect performance.
  - memoryTypes
    - Consists of VkMemoryType structs that specify the heap and properties of each memory type.
    - The properties define special features of the memory, like being able to map it so we can write to it from the CPU.
    - VkMemoryType
      - Structure specifying memory type
      - heapIndex
        
        Describes which memory heap this memory type corresponds to, and must be less than memoryHeapCount from the VkPhysicalDeviceMemoryProperties structure.
      - propertyFlags
        
        Is a bitmask of VkMemoryPropertyFlagBits of properties for this memory type.
        
        VkMemoryPropertyFlagBits .
        
        The most optimal memory has the MEMORY_PROPERTY_DEVICE_LOCAL flag and is usually not accessible by the CPU on dedicated graphics cards.
  - typeFilter
    - Specify the bit field of memory types that are suitable.
    - That means that we can find the index of a suitable memory type by simply iterating over them and checking if the corresponding bit is set to 1 .
    - However, we’re not just interested in a memory type that is suitable for the vertex buffer.
    - We also need to be able to write our vertex data to that memory.
  - We may have more than one desirable property, so we should check if the result of the bitwise AND is not just non-zero, but equal to the desired properties bit field. If there is a memory type suitable for the buffer that also has all the properties we need, then we return its index, otherwise we throw an exception.
Allocation :
- VkMemoryAllocateInfo
  - allocationSize
    - Is the size of the allocation in bytes.
  - memoryTypeIndex
    - Is an index identifying a memory type from the memoryTypes array of the vkGetPhysicalDeviceMemoryProperties struct, as defined in the 'memory requirements'.
- vkAllocateMemory .
  - To allocate memory objects.
  - device
    - Is the logical device that owns the memory.
  - pAllocateInfo
    - Is a pointer to a VkMemoryAllocateInfo structure describing parameters of the allocation. A successfully returned allocation must use the requested parameters — no substitution is permitted by the implementation.
  - pAllocator
    - Controls host memory allocation.
  - pMemory
    - Is a pointer to a VkDeviceMemory handle in which information about the allocated memory is returned.
- Allocations returned by vkAllocateMemory are guaranteed to meet any alignment requirement of the implementation. For example, if an implementation requires 128 byte alignment for images and 64 byte alignment for buffers, the device memory returned through this mechanism would be 128-byte aligned. This ensures that applications can correctly suballocate objects of different types (with potentially different alignment requirements) in the same memory object.
- When memory is allocated, its contents are undefined with the following constraint:
  - The contents of unprotected memory must not be a function of the contents of data protected memory objects, even if those memory objects were previously freed.
  - The contents of memory allocated by one application should not be a function of data from protected memory objects of another application, even if those memory objects were previously freed.
- The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation- or platform-dependent limits. The maxMemoryAllocationCount feature describes the number of allocations that can exist simultaneously before encountering these internal limits.
Freeing :
- To free a memory object, call vkFreeMemory .
- Before freeing a memory object, an application must ensure the memory object is no longer in use by the device — for example by command buffers in the pending state. Memory can be freed whilst still bound to resources, but those resources must not be used afterwards. Freeing a memory object releases the reference it held, if any, to its payload. If there are still any bound images or buffers, the memory object’s payload may not be immediately released by the implementation, but must be released by the time all bound images and buffers have been destroyed. Once all references to a payload are released, it is returned to the heap from which it was allocated.
- How memory objects are bound to Images and Buffers is described in detail in the [Resource Memory Association] section.
- If a memory object is mapped at the time it is freed, it is implicitly unmapped.
- Host writes are not implicitly flushed when the memory object is unmapped, but the implementation must guarantee that writes that have not been flushed do not affect any other memory.

Resource Memory Association

Resources are initially created as virtual allocations with no backing memory. Device memory is allocated separately and then associated with the resource. This association is done differently for sparse and non-sparse resources.
Resources created with any of the sparse creation flags are considered sparse resources. Resources created without these flags are non-sparse. The details on resource memory association for sparse resources is described in Sparse Resources.
Non-sparse resources must be bound completely and contiguously to a single VkDeviceMemory object before the resource is passed as a parameter to any of the following operations:
- creating buffer, image, or tensor views
- updating descriptor sets
- recording commands in a command buffer
Once bound, the memory binding is immutable for the lifetime of the resource.
In a logical device representing more than one physical device, buffer and image resources exist on all physical devices but can be bound to memory differently on each. Each such replicated resource is an instance of the resource. For sparse resources, each instance can be bound to memory arbitrarily differently. For non-sparse resources, each instance can either be bound to the local or a peer instance of the memory, or for images can be bound to rectangular regions from the local and/or peer instances. When a resource is used in a descriptor set, each physical device interprets the descriptor according to its own instance’s binding to memory.
Sparse Resources .
Sparse Resources .
Sparse resources let you create VkBuffer and VkImage objects which are bound non-contiguously to one or more VkDeviceMemory allocations.

Host Access

Also check GPU .
Memory objects created with vkAllocateMemory are not directly host accessible.
Memory objects created with the memory property MEMORY_PROPERTY_HOST_VISIBLE are considered mappable. Memory objects must be mappable in order to be successfully mapped on the host.
vkMapMemory
- This function allows us to access a region of the specified memory resource defined by an offset and size.
- Used to retrieve a host virtual address pointer to a region of a mappable memory object.
- It is also possible to specify the special value WHOLE_SIZE to map all of the memory.
- device
  - Is the logical device that owns the memory.
- memory
  - Is the VkDeviceMemory object to be mapped.
- offset
  - Is a zero-based byte offset from the beginning of the memory object.
- size
  - Is the size of the memory range to map, or WHOLE_SIZE to map from offset to the end of the allocation.
- flags
  - Is a bitmask of VkMemoryMapFlagBits specifying additional parameters of the memory map operation.
- ppData
  - Is a pointer to a void* variable in which a host-accessible pointer to the beginning of the mapped range is returned. The value of the returned pointer minus offset must be aligned to VkPhysicalDeviceLimits.minMemoryMapAlignment .
  - Acts like regular RAM, but physically points to GPU memory.
After a successful call to vkMapMemory the memory object memory is considered to be currently host mapped.
It is an application error to call vkMapMemory on a memory object that is already host mapped.
vkMapMemory does not check whether the device memory is currently in use before returning the host-accessible pointer.
If the device memory was allocated without the MEMORY_PROPERTY_HOST_COHERENT set, these guarantees must be made for an extended range: the application must round down the start of the range to the nearest multiple of VkPhysicalDeviceLimits.nonCoherentAtomSize , and round the end of the range up to the nearest multiple of VkPhysicalDeviceLimits.nonCoherentAtomSize .
Problem :
- The driver may not immediately copy the data into the buffer memory, for example, because of caching.
- It is also possible that writes to the buffer are not visible in the mapped memory yet.
- There are two ways to deal with that problem:
  - Use a memory heap that is host coherent, indicated with MEMORY_PROPERTY_HOST_COHERENT
  - Call vkFlushMappedMemoryRanges after writing to the mapped memory, and call vkInvalidateMappedMemoryRanges before reading from the mapped memory.
- Flushing memory ranges or using a coherent memory heap means that the driver will be aware of our writings to the buffer, but it doesn’t mean that they are actually visible on the GPU yet. The transfer of data to the GPU is an operation that happens in the background, and the specification simply tells us that it is guaranteed to be complete as of the next call to vkQueueSubmit .
Minimum Alignment :
- VkPhysicalDeviceLimits .
  - minMemoryMapAlignment
    - Is the minimum required alignment, in bytes, of host visible memory allocations within the host address space.
    - When mapping a memory allocation with vkMapMemory , subtracting offset bytes from the returned pointer will always produce an integer multiple of this limit.
    - See https://registry.khronos.org/vulkan/specs/latest/html/vkspec.html#memory-device-hostaccess .
    - The value must be a power of two.
  - nonCoherentAtomSize
    - Is the size and alignment in bytes that bounds concurrent access to host-mapped device memory .
    - The value must be a power of two.
- ChatGPT:
  - Dynamic offsets:
    - If you used DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC or DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC in your VkDescriptorSetLayoutBinding .
      - That is the definition of a dynamic descriptor.
    - If you call vkCmdBindDescriptorSets(..., dynamicOffsetCount, pDynamicOffsets) . If dynamicOffsetCount > 0 and pDynamicOffsets is non-null you are supplying dynamic offsets at bind time.
  - How offsets are applied:
    - Non-dynamic descriptor:
      - The VkDescriptorBufferInfo.offset you gave to vkUpdateDescriptorSets is baked into the descriptor.
      - That offset must be a multiple of minUniformBufferOffsetAlignment .
    - Dynamic descriptor:
      - The descriptor stores a base offset / range , and the runtime adds the dynamic offset(s) you pass to vkCmdBindDescriptorSets .
      - Each dynamic offset must be a multiple of minUniformBufferOffsetAlignment .
  - If you are not using Dynamic Offsets in the vkCmdBindDescriptorSets , nor using offsets in the VkDescriptorBufferInfo , then you don't need to worry about this limit.

Staging buffer

Use a host visible buffer as temporary buffer and use a device local buffer as actual buffer.
The host visible buffer should have use BUFFER_USAGE_TRANSFER_SRC , and the device local buffer should have use BUFFER_USAGE_TRANSFER_DST .
The contents of the host visible buffer is copied to the device local buffer using vkCmdCopyBuffer .
.
Data Transfer .
Buffer copy requirements :
- Requires a queue family that supports transfer operations, which is indicated using QUEUE_TRANSFER .
  - Any queue family with QUEUE_GRAPHICS or QUEUE_COMPUTE capabilities already implicitly support QUEUE_TRANSFER operations.
  - A different queue family specifically for transfer operations could be used.
    - It will require you to make the following modifications to your program:
      - Modify QueueFamilyIndices and findQueueFamilies to explicitly look for a queue family with the QUEUE_TRANSFER bit, but not the QUEUE_GRAPHICS .
      - Modify createLogicalDevice to request a handle to the transfer queue
      - Create a second command pool for command buffers that are submitted on the transfer queue family
      - Change the sharingMode of resources to be SHARING_MODE_CONCURRENT and specify both the graphics and transfer queue families
      - Submit any transfer commands like vkCmdCopyBuffer (which we’ll be using in this chapter) to the transfer queue instead of the graphics queue
  - This will teach you a lot about how resources are shared between queue families.
  - Caio: Ok, but what's the benefits of using different queues? I don't know.

BAR (Base Address Register)

See GPU .

Memory Aliasing

A range of a VkDeviceMemory allocation is aliased if it is bound to multiple resources simultaneously, as described below, via vkBindImageMemory , vkBindBufferMemory , vkBindAccelerationStructureMemoryNV , vkBindTensorMemoryARM , via sparse memory bindings, or by binding the memory to resources in multiple Vulkan instances or external APIs using external memory handle export and import mechanisms.
Consider two resources, resourceA and resourceB, bound respectively to memory rangeA and rangeB. Let paddedRangeA and paddedRangeB be, respectively, rangeA and rangeB aligned to bufferImageGranularity. If the resources are both linear or both non-linear (as defined in the Glossary), then the resources alias the memory in the intersection of rangeA and rangeB. If one resource is linear and the other is non-linear, then the resources alias the memory in the intersection of paddedRangeA and paddedRangeB.
The implementation-dependent limit bufferImageGranularity also applies to tensor resources.
Memory aliasing can be useful to reduce the total device memory footprint of an application, if some large resources are used for disjoint periods of time.
vkBindBufferMemory() .
- If memory allocation was successful, then we can now associate this memory with the buffer using this function.
- offset
  - Offset within the region of memory.
  - Since this memory is allocated specifically for this the vertex buffer, the offset is simply 0 .
  - If the offset is non-zero, then it is required to be divisible by memRequirements.alignment .
Memory Aliasing .

Lazily Allocated Memory

If the memory object is allocated from a heap with the MEMORY_PROPERTY_LAZILY_ALLOCATED bit set, that object’s backing memory may be provided by the implementation lazily. The actual committed size of the memory may initially be as small as zero (or as large as the requested size), and monotonically increases as additional memory is needed.
A memory type with this flag set is only allowed to be bound to a VkImage whose usage flags include IMAGE_USAGE_TRANSIENT_ATTACHMENT .

Protected Memory

Protected memory divides device memory into protected device memory and unprotected device memory.
Unprotected Device Memory :
- Unprotected device memory, which can be visible to the device and can be visible to the host
- Unprotected images, unprotected tensors, and unprotected buffers, to which unprotected memory can be bound
- Unprotected command buffers, which can be submitted to a device queue to execute unprotected queue operations
- Unprotected device queues, to which unprotected command buffers can be submitted
- Unprotected queue submissions, through which unprotected command buffers can be submitted
- Unprotected queue operations
Protected Device Memory :
- Protected device memory, which can be visible to the device but must not be visible to the host
- Protected images, protected tensors, and protected buffers, to which protected memory can be bound
- Protected command buffers, which can be submitted to a protected-capable device queue to execute protected queue operations
- Protected-capable device queues, to which unprotected command buffers or protected command buffers can be submitted
- Protected queue submissions, through which protected command buffers can be submitted
- Protected queue operations
Protected Memory .

Tracking GPU Memory

Vulkan does not expose fixed per-object byte counts for most objects — exact memory use is implementation and driver-dependent. Some objects ( VkImage , VkBuffer ) must be bound to VkDeviceMemory you allocate (so you can know their size). Many other objects (pipelines, command buffers, descriptor sets, semaphores, imageviews, pipeline layouts, etc.) often cause hidden driver allocations that may live in host memory, device memory, or both — and those allocations’ size and placement vary by driver and GPU.

By object

VkInstance / VkPhysicalDevice / VkDevice (handles):
- Small host-side allocations (process RAM). Measure via your VkAllocationCallbacks or by tracking driver host allocations. These are host-visible (they are just process memory)
VkImageView / VkBufferView / VkSampler :
- Lightweight, usually host memory (small driver structures). They rarely allocate large device memory; they may cause small host allocations. Implementation dependent but small (tens to a few hundred bytes each in many drivers).
VkDescriptorSetLayout / VkPipelineLayout / VkDescriptorSet (layout vs sets):
- Layout and pipeline layout are small host structures (host memory). Descriptor sets and descriptor pools may be implemented in host memory or device memory; larger descriptor usage (large arrays, inline uniform blocks, inline immutable samplers, or driver internal structures) can cause real device allocations. Behavior is driver dependent.
VkPipeline (graphics/compute):
- Creation can cause hidden device and/or host allocations (compiled device binaries, GPU resident state). The spec explicitly allows implementations to allocate device memory during pipeline creation; the pipeline cache and pipeline executable properties APIs can help quantify some of this. Pipeline objects range from a few KB to multiple MB depending on driver, the number/complexity of shaders, and whether the driver stores compiled GPU blobs. Use VK_KHR_pipeline_executable_properties and pipeline cache queries to inspect pipeline internals.
VkPipelineCache :
- Contains data you can query with vkGetPipelineCacheData — that returns host-visible data you can size and persist.
VkCommandPool / VkCommandBuffer :
- Command buffers are allocated from a pool; actual memory holding recorded commands is driver-managed and may be placed in device local memory (GPU command stream) or host memory, depending on driver and OS. Sizes vary widely and are not exposed directly; instrument via driver callbacks or VK_EXT_device_memory_report .
VkSemaphore / VkFence :
- Binary semaphores and fences may use kernel/OS constructs or small host/device allocations; timeline semaphores hold a 64-bit value and may be backed by device memory on some implementations. Typically small (a few bytes to some KB) but driver dependent.
VkSwapchainKHR and presentable images:
- Swapchain images are VkImage objects with memory managed by the WSI/driver; they are typically DEVICE_LOCAL and can live in special presentable heaps. Their size equals image size × format bits × layers/levels plus padding (obtainable from vkGetImageMemoryRequirements for images you allocate yourself; for WSI images use provided queries and VK_EXT_memory_budget to monitor heap consumption).
Typical magnitude examples (illustrative only)
- Instance / layouts / view objects: tens to hundreds of bytes each (host).
- Small buffers (uniform buffers) / small images: KBs to MBs, depending on dimensions and format — these are the allocations you make explicitly.
- Pipelines: KBs → multiple MBs (depends on shader complexity and driver caching). Use pipeline executable queries to get an estimate.
- Command buffer pools / driver command memory: KBs → MBs per many command buffers; driver dependent.
- These numbers must be measured on your target hardware — they are not constant across drivers.

Tracking

Centralize and wrap all vkAllocateMemory / vkFreeMemory calls.
- Record: VkDeviceMemory handle, VkMemoryAllocateInfo size/flags, chosen memory type index, and optionally the VkDeviceSize and offset for any suballocator logic. Suballocation (one VkDeviceMemory used for many buffers/images) means you must additionally record your suballocations. Use this table as the authoritative committed GPU bytes. (Spec: vkAllocateMemory produces the device memory payload.)
Track suballocation bookkeeping in your allocator.
- If you allocate large VkDeviceMemory blocks and suballocate slices for many buffers/images, account the slices into your counters (otherwise counting only VkDeviceMemory handles will under- or over-count usage).
Hook creation / bind points to attribute usage.
- When you vkBindBufferMemory / vkBindImageMemory , attach which application object is consuming which suballocation — this lets you produce per-buffer/per-image committed usage.
Use VK_EXT_memory_budget for driver-reported heap usage/budgets.
- Query VkPhysicalDeviceMemoryBudgetPropertiesEXT via vkGetPhysicalDeviceMemoryProperties2 to get heapBudget and heapUsage values per heap.
- These are implementation-provided and reflect other processes and driver internal usage; use them as cross-checks and to warn when you approach limits.
- Use it to see heap usage and budget per heap (useful to spot overall device local vs host mapped heap pressure). This is not per-object, but shows total heap usage and remaining budget. Combine with device_memory_report events to attribute heap changes to objects.
Enable VK_EXT_device_memory_report for visibility into driver-internal allocations.
- This extension gives callbacks for driver-side device memory events (allocate/free/import) including allocations not exposed as VkDeviceMemory (for example, allocations made internally during pipeline creation). Use it for debugging and to catch allocations that your vkAllocateMemory wrapper would miss.
Account for dedicated allocations and imports.
- You can use VK_KHR_dedicated_allocation to force one allocation per resource. If you allocate one VkDeviceMemory per resource you know exactly how many bytes each resource consumes.
- If an allocation is made with VkMemoryDedicatedAllocateInfo or via external memory import, count that device memory appropriately — it typically represents a whole allocation tied to a single image/buffer.
Use VK_KHR_pipeline_executable_properties for pipeline internals.
- Create the pipeline with the capture flag ( VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR ) and call vkGetPipelineExecutablePropertiesKHR / vkGetPipelineExecutableStatisticsKHR to obtain compile-time statistics and sizes for pipeline executables that the driver produced. This helps measure how much space pipeline compilation produced (but it may not show every byte the driver reserved at runtime).
Vendor tools + RenderDoc / NSight / Radeon GPU Profiler.
- These tools often show GPU memory usage, allocations, and sometimes attribute memory to API objects. Use them to validate your in-process accounting.

Device Memory Report ( `VK_EXT_device_memory_report` )

Last updated (2021-01-06).
Info .
Allows registration of device memory event callbacks upon device creation, so that applications or middleware can obtain detailed information about memory usage and how memory is associated with Vulkan objects. This extension exposes the actual underlying device memory usage, including allocations that are not normally visible to the application, such as memory consumed by vkCreateGraphicsPipelines . It is intended primarily for use by debug tooling rather than for production applications.

Memory Budget ( `EXT_memory_budget` )

Last updated (2018-10-08).
Coverage .
- Not good on android, but the rest is 80%+.
Sample
Query video memory budget for the process from the OS memory manager.
It’s important to keep usage below the budget to avoid stutters caused by demotion of video memory allocations.
While running a Vulkan application, other processes on the machine might also be attempting to use the same device memory, which can pose problems.
This extension adds support for querying the amount of memory used and the total memory budget for a memory heap. The values returned by this query are implementation-dependent and can depend on a variety of factors including operating system and system load.
The VkPhysicalDeviceMemoryBudgetPropertiesEXT.heapBudget values can be used as a guideline for how much total memory from each heap the current process can use at any given time, before allocations may start failing or causing performance degradation. The values may change based on other activity in the system that is outside the scope and control of the Vulkan implementation.
The VkPhysicalDeviceMemoryBudgetPropertiesEXT.heapUsage will display the current process estimated heap usage.
With this information, the idea is for an application at some interval (once per frame, per few seconds, etc) to query heapBudget and heapUsage. From here the application can notice if it is over budget and decide how it wants to handle the memory situation (free it, move to host memory, changing mipmap levels, etc).
This extension is designed to be used in concert with VK_EXT_memory_priority to help with this part of memory management.

Vulkan Memory Allocator (VMA)

VMA in Odin .
VMA (vulkan memory allocator) .
Implements memory allocators for Vulkan, header only. In Vulkan, the user has to deal with the memory allocation of buffers, images, and other resources on their own. This can be very difficult to get right in a performant and safe way. Vulkan Memory Allocator does it for us and allows us to simplify the creation of images and other resources. Widely used in personal Vulkan engines or smaller scale projects like emulators. Very high end projects like Unreal Engine or AAA engines write their own memory allocators.
There are cases like the PCSX3 emulator project, where they replaced their attempt at allocation to VMA, and won 20% extra framerate.
Critiques :
- .

HDR Support

Shader code converts high-dynamic-range (HDR) linear color values (often stored in floating formats like R16G16B16A16_SFLOAT ) into display-referred low-dynamic-range (LDR) values (sRGB or the swapchain format).
Operations include exposure, clamping, tone curve (Reinhard, ACES, filmic), and gamma or sRGB conversion.
Each monitor manufacturer does this differently; it's not standardized .
Inputs:
- HDR color (linear), optionally exposure/exposure texture, bloom, eye adaptation.
Steps (example minimal):
1. Multiply by exposure.
2. Apply curve (e.g. Reinhard: c/(1+c) , or ACES approximation).
3. Convert to sRGB/gamma ( pow(color, 1.0/2.2) ) or use proper sRGB conversion.
4. Output vec4 clamped to [0,1] into swapchain format (e.g. FORMAT_B8G8R8A8_UNORM ).

Drawing to a High Precision Image ( `R16G16B16A16_SFLOAT` )

Rendering into an R16G16B16A16_SFLOAT (FP16) image provides:
- Higher dynamic range and precision (light accumulation > 1.0, less banding, better tone mapping).
- Freedom to tone-map and convert later.
This is the engine-side HDR pipeline .
Described technique .
- From "New draw loop" until the end.
Rendering into a separate high-precision offscreen target and then copying/blitting/tonemapping into the swapchain is the standard approach when you need arbitrary internal resolution, higher precision, HDR processing, or when the swapchain does not expose desired formats/usages. The trade-off is the extra memory and an explicit copy/blit or import step; the benefit is control over precision and size. The Vulkan command vkCmdBlitImage / transfer usage or a shader-based blit/resolve are the usual mechanisms to move from the internal target to the presentable image.
The image we will be using is going to be in the RGBA 16-bit float format.
- R16G16B16A16_SFLOAT is a common intermediate HDR format (16-bit float per channel). It increases memory and bandwidth (roughly 2× vs 8-bit RGBA) and may affect GPU/VRAM usage and upload/download costs; it also reduces quantization/banding and supports HDR/light-accumulation workflows without clamping at 1.0. The choice is an explicit trade-off: more precision (and headroom for lighting) vs more memory/bandwidth. The format is widely supported for offscreen images but may not be available as a swapchain format on all platforms, which reinforces the decision to render offscreen then convert/tonemap for presentation.
- This is slightly overkill, but will provide us with a lot of extra pixel precision that will come in handy when doing lighting calculations and better rendering.
It's possible to apply low-latency techniques where we could be rendering into a different image from the swapchain image, and then directly push that image to the swapchain with very low latency.
- Techniques like NVIDIA's "Latency Markers" / Reflex or AMD's Anti-Lag rely on starting rendering work as early as possible, often before the presentation engine signals readiness for the next frame via vkAcquireNextImageKHR (Vulkan) or AcquireNextFrame (DXGI). This necessitates rendering into a separate, persistently available image. The swapchain image index is only provided at acquisition time, making pre-rendering impossible with direct swapchain targets. Documentation for these low-latency SDKs implicitly requires separate render targets.
Choosing the image tiling:
- We can then copy that image into the swapchain image and present it to the screen.
- VkCmdCopyImage
  - Is faster, but its much more restricted, for example the resolution on both images must match.
- VkCmdBlitImage
  - Lets you copy images of different formats and different sizes into one another.
  - You have a source rectangle and a target rectangle, and the system copies it into its position.

New code for transitioning :

_drawExtent.width = _drawImage.imageExtent.width;
_drawExtent.height = _drawImage.imageExtent.height;

CHECK(vkBeginCommandBuffer(cmd, &cmdBeginInfo)); 

// transition our main draw image into general layout so we can write into it
// we will overwrite it all so we dont care about what was the older layout
vkutil::transition_image(cmd, _drawImage.image, IMAGE_LAYOUT_UNDEFINED, IMAGE_LAYOUT_GENERAL);

draw_background(cmd);

//transition the draw image and the swapchain image into their correct transfer layouts
vkutil::transition_image(cmd, _drawImage.image, IMAGE_LAYOUT_GENERAL, IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL);
vkutil::transition_image(cmd, _swapchainImages[swapchainImageIndex], IMAGE_LAYOUT_UNDEFINED, IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL);

// execute a copy from the draw image into the swapchain
vkutil::copy_image_to_image(cmd, _drawImage.image, _swapchainImages[swapchainImageIndex], _drawExtent, _swapchainExtent);

// set swapchain image layout to Present so we can show it on the screen
vkutil::transition_image(cmd, _swapchainImages[swapchainImageIndex], IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, IMAGE_LAYOUT_PRESENT_SRC_KHR);

//finalize the command buffer (we can no longer add commands, but it can now be executed)
CHECK(vkEndCommandBuffer(cmd));

The main difference we have in the render loop is that we no longer do the clear on the swapchain image. Instead, we do it on the _drawImage.image .
Once we have cleared the image, we transition both the swapchain and the draw image into their layouts for transfer, and we execute the copy command. Once we are done with the copy command, we transition the swapchain image into present layout for display. As we are always drawing on the same image, our draw_image does not need to access swapchain index, it just clears the draw image. We are also writing the _drawExtent that we will use for our draw region.

Etc

But this image still has to be copied/tonemapped into the swapchain format , which is typically limited to 8-bit UNORM unless the OS/driver supports HDR swapchain formats.
To actually output HDR to the screen, all of the following conditions must be met:

Swapchain format must support HDR bit depth .
- Example formats: FORMAT_A2B10G10R10_UNORM_PACK32 , FORMAT_R16G16B16A16_SFLOAT , or platform-specific HDR surface formats.
- You query available swapchain formats via vkGetPhysicalDeviceSurfaceFormatsKHR .
- If only 8-bit formats are exposed, you cannot present HDR directly.
Swapchain color space must be HDR-capable
- Vulkan allows specifying a VkColorSpaceKHR (e.g., COLOR_SPACE_HDR10_ST2084_EXT , COLOR_SPACE_HDR10_HLG_EXT ).
- These correspond to HDR transfer functions (PQ/HLG).
- If the driver/surface does not expose them, the system compositor won’t accept HDR content.
OS and display pipeline must be HDR-enabled
- Windows: HDR toggle must be enabled in system settings, compositor configured for HDR10.
- Linux/Wayland: requires HDR support in compositor + driver (still emerging).
- Android: requires AHardwareBuffer / SurfaceView with HDR formats.
- macOS: Metal swapchains expose extended sRGB/PQ output modes.
- (Platform docs confirm HDR availability is compositor-driven).
Application side tone mapping & gamut mapping
- Even if swapchain supports HDR, you generally still render into FP16, then apply:
  - Tone mapping (map wide dynamic range → HDR10/HLG range).
  - Color gamut conversion (usually Rec.709 → Rec.2020 for HDR10).
- Only then write into the HDR swapchain image.

Profiling

Provides your application with a mechanism to time the execution of commands on the GPU.
You can specify any pipeline stage at which the timestamp should be written, a lot of stage combinations and orderings won’t give meaningful result.
- So while it may may sound reasonable to write timestamps for the vertex and fragment shader stage directly one after another, that will usually not return meaningful results due to how the GPU works.
You can’t compare timestamps taken on different queues.
Sample .
- We’ll be using 6 time points, one for the start and one for the end of three render passes.
- The code samples/api/timestamp_queries :
  - Uses QUERY_RESULT_64 | QUERY_RESULT_WAIT , so it's not optimal.
  - The query is made after vkQueueSubmit() .
GPU Timing Basics .
- Vulkan and DX12.
- Uses QUERY_RESULT_64 and enables the hostQueryReset for vk.PhysicalDeviceVulkan12Features , using vk.ResetQueryPool() right after creating the QueryPool .
Queries .
vkCmdWriteTimestamp2 .
- This is pretty much the same as the vkCmdWriteTimestamp function used in this sample, but adds support for some additional pipeline stages using VkPipelineStageFlags2 .

Support

Device limits:
- timestampPeriod
  - If the limit of the physical device is greater than zero, timestamp queries are supported.
  - If your device has a timestampPeriod of 1, so that one increment in the result maps to exactly one nanosecond.
  - It contains the number of nanoseconds it takes for a timestamp query value to be increased by 1 ("tick").
- timestampComputeAndGraphics
  - If is TRUE , timestamps are supported by every queue family that supports either graphics or compute operations
  - If not, we need to check if the queue we want to use supports timestamps.

Query Pool

A query pool is then used to either directly fetch or copy over the results to the host.
Used to store and read back the results.
queryType
- We set to QUERY_TYPE_TIMESTAMP for using timestamp queries
queryCount
- The maximum number of the the timestamp query result this pool can store.

Reset

Before we can start writing data to the query pool, we need to reset it.
vkCmdResetQueryPool
- At the start of the command buffer.
- Sets the status of query indices [ firstQuery , firstQuery + queryCount - 1] to unavailable.
- Defines an execution dependency between other query commands that reference the same query.
vkResetQueryPool() .
QUERY_POOL_CREATE_RESET_KHR
- During Query Pool creation.

Writing

vkCmdWriteTimestamp
- Will request a timestamp to be written from the GPU for a certain pipeline stage and write that value to memory.

Reading

Reading back the results can be done in two ways:
- Copy the results into a VkBuffer inside the command buffer using vkCmdCopyQueryPoolResults
- Get the results after the command buffer has finished executing using vkGetQueryPoolResults
vkGetQueryPoolResults()
- QUERY_RESULT_64
  - Will tell the api that we want to get the results as 64 bit values. Without this flag, we would only get 32 bit values. And since timestamp queries can operate in nanoseconds, only using 32 bits could result into an overflow.
  - if your device has a timestampPeriod of 1, so that one increment in the result maps to exactly one nanosecond, with 32 bit precision you’d run into such an overflow after only about 0.43 seconds.
- QUERY_RESULT_WAIT
  - Tells the api to wait for all results to be available. So when using this flag the values written to our time_stamps vector is guaranteed to be available after calling vkGetQueryPoolResults .
  - This is fine for our use-case where we want to immediately access the results, but may introduce unnecessary stalls in other scenarios.
- QUERY_RESULT_WITH_AVAILABILITY
  - Will let you poll the availability of the results and defer writing new timestamps until the results are available.
  - This should be the preferred way of fetching the results in a real-world application. Using this flag an additional availability value is inserted after each query value. If that value becomes non-zero, the result is available. You then check availability before writing the timestamp again.

Occlusion Queries

Occlusion queries track the number of samples that pass the per-fragment tests for a set of drawing commands. As such, occlusion queries are only available on queue families supporting graphics operations. The application can then use these results to inform future rendering decisions.
An occlusion query is begun and ended by calling vkCmdBeginQuery and vkCmdEndQuery , respectively.
When an occlusion query begins, the count of passing samples always starts at zero.
For each drawing command, the count is incremented as described in Sample Counting . If flags does not contain QUERY_CONTROL_PRECISE an implementation may generate any non-zero result value for the query if the count of passing samples is non-zero.

Pipeline Statistics Queries

Pipeline statistics queries allow the application to sample a specified set of VkPipeline counters. These counters are accumulated by Vulkan for a set of either drawing or dispatching commands while a pipeline statistics query is active. As such, pipeline statistics queries are available on queue families supporting compute operations.
The availability of pipeline statistics queries is indicated by the pipelineStatisticsQuery member of the VkPhysicalDeviceFeatures object (see vkGetPhysicalDeviceFeatures and vkCreateDevice for detecting and requesting this query type on a VkDevice ).
A pipeline statistics query is begun and ended by calling vkCmdBeginQuery and vkCmdEndQuery , respectively.
When a pipeline statistics query begins, all statistics counters are set to zero. While the query is active, the pipeline type determines which set of statistics are available, but these must be configured on the query pool when it is created. If a statistic counter is issued on a command buffer that does not support the corresponding operation, or the counter corresponds to a shading stage which is missing from any of the pipelines used while the query is active, the value of that counter is undefined after the query has been made available. At least one statistic counter relevant to the operations supported on the recording command buffer must be enabled.

Performance Queries

Provide applications with a mechanism for getting performance counter information about the execution of command buffers, render passes, and commands. [asdasd]
Each queue family advertises the performance counters that can be queried on a queue of that family via a call to vkEnumeratePhysicalDeviceQueueFamilyPerformanceQueryCountersKHR . Implementations may limit access to performance counters based on platform requirements or only to specialized drivers for development purposes.
Performance queries use the existing vkCmdBeginQuery and vkCmdEndQuery to control what command buffers, render passes, or commands to get performance information for.

Mesh Shaders Queries

When a generated mesh primitives query is active, the mesh-primitives-generated count is incremented every time a primitive emitted from the mesh shader stage reaches the fragment shader stage. When a generated mesh primitives query begins, the mesh-primitives-generated count starts from zero.
Mesh and task shader pipeline statistics queries function the same way that invocation queries work for other shader stages, counting the number of times the respective shader stage has been run. When the statistics query begins, the invocation counters start from zero.

Result Status Queries

Result status queries serve a single purpose: allowing the application to determine whether a set of operations have completed successfully or not, as indicated by the VkQueryResultStatusKHR value written when retrieving the result of a query using the QUERY_RESULT_WITH_STATUS_KHR flag.
Unlike other query types, result status queries do not track or maintain any other data beyond the completion status, thus no other data is written when retrieving their results.
Support for result status queries is indicated by VkQueueFamilyQueryResultStatusPropertiesKHR :: queryResultStatusSupport , as returned by vkGetPhysicalDeviceQueueFamilyProperties2 for the queue family in question.

Other Queries

Transform Feedback Queries.
Primitives Generated Queries.
Intel Performance Queries.
Video Encode Feedback Queries.

Mobile

Samsung - Mobile best practices .
- TLDR :
  - presentMode :
    - FIFO > MAILBOX.
  - minImageCount :
    - Triple-buffer > double-buffer.
  - preTransform :
    - Not covered.
    - "Covered in a future post", but the link is broken.
Arm - Mobile Best Practices .
- It's a more technical video.
- Tiled-based GPUs, etc.
- I haven't watched it yet.
See pages 244 to 311 of Efficient Real-Time Shading with Many Lights - Ola Olsson, Emil Persson (Avalanche), Markus Billeter - 2014 for more details.
- "Many Light Rendering on Mobile Hardware".
Live Long and Optimise - Samsung 2019 .
- .
- .
- Android ideas for fixing Present blocking :
  - .
  - "This is not going to change when the image is presented, we are just delaying the calling of the function that would display the image, to a point where the image is more likely to be available by the GPU".
  - .
- (2025-09-29) I watched it to study Pipeline Barriers, but the talk covers many mobile-specific topics.

GLFW

An unfortunate disadvantage is GLFW doesn’t work in Android or iOS; it is a desktop-only solution.
SDL does offer mobile support; however, mobile windowing support is best done by interfacing with the Operating system such as using the JNI in Android.
While mobile is beyond the scope of this initial tutorial, plans exist to eventually cover it in detail, and Google has excellent documentation .

Pre-Rotation

.
.
- You can only query surfaceCapabilities.currentTransform , you cannot set it.
- If they don't match, the presentation engine will have to do the pre-rotation for you, which has a performance cost.
Implementing a full pre-rotate system is reportedly difficult, so many engines avoid it.
.
.
- This is a simpler option to implement.
- "Many engines already do a blit to the final image to the swapchain image, so this is the perfect place to do the pre-rotation".
  - "Basically free and you get performance benefits".

VR

Render Engineering#Variable Rate Shading (VRS) .

Video Decoding

Video Decoding - Wicked Engine .

SPIR-V

Standard Portable Intermediate Representation V .
SPIR-V .
Vulkan’s official shader format (portable, efficient).
SPIR-V is a binary format.
Works with Metal via MoltenVK.

Compiling

You can write GLSL or HLSL and compile to SPIR-V.
- GLSL to SPIR-V:
  - glslangValidator (from Khronos)
```
# Compile GLSL → SPIR-V (Vulkan)
glslangValidator -V vertex_shader.vert -o vert.spv
glslangValidator -V fragment_shader.frag -o frag.spv
```
- HLSL to SPIR-V:
  - DXC (DirectX Shader Compiler)
```
dxc -T vs_6_0 -E VSMain -spirv shader.hlsl -Fo vert.spv
```
  - Requires HLSL shaders with Vulkan-compatible semantics.
- Convert SPIR-V to other formats:
  - SPIRV-Cross (converts HLSL to GLSL/SPIR-V)
Compiling shaders on the commandline is one of the most straightforward options and it's the one that we'll use in this tutorial, but it's also possible to compile shaders directly from your own code.
- The Vulkan SDK includes libshaderc , which is a library to compile GLSL code to SPIR-V from within your program.

Advantages

The advantage of using a bytecode format is that the compilers written by GPU vendors to turn shader code into native code are significantly less complex. The past has shown that with human-readable syntax like GLSL, some GPU vendors were rather flexible with their interpretation of the standard. If you happen to write non-trivial shaders with a GPU from one of these vendors, then you’d risk another vendor’s drivers rejecting your code due to syntax errors, or worse, your shader running differently because of compiler bugs. With a straightforward bytecode format like SPIR-V that will hopefully be avoided.

Tooling

spirv-cross

Cross-compilation
- Converts SPIR-V shader binaries into high-level shading languages:
  - GLSL (various versions)
  - HLSL
  - MSL (Metal Shading Language for Apple platforms)
  - WGSL (WebGPU shading language)
- This lets you write shaders once (e.g. in GLSL or HLSL), compile to SPIR-V, then regenerate source for other backends.
Reflection
- Inspects SPIR-V binaries and reports metadata about:
  - Descriptor sets and bindings
  - Push constants
  - Vertex input/output attributes
  - Specialization constants
- With the --reflect flag, it outputs this data as JSON , making it easy to drive engine code-generation or runtime Vulkan setup.
Ex :
- spirv-cross scene_vert.spv --reflect > scene_vert.json .

Web

No Vulkan support in browsers; you must port to WebGPU or use translation layers.
WebAssembly - WASM .

WebGPU (wgpu)

WebGPU is a cross-platform graphics API, aiming to unify GPU access across:
- Browsers (via native support)
- Native apps (via libraries like wgpu, Dawn, etc.)

Starting

Versions

Is OOP?

API Structs

Compatibility

Support

Driver support

Compatibility Layers

Tutorials

Tutorials in Docs

Playlists

Talks

Samples

API

Extensions

Core

Instance / Extensions

Instance

Instance Level Extensions

Debugging

Validation Layers

Message Callback

Debug Utils ( VK_EXT_debug_utils )

Window / Surface / GLFW

Window

GLFW

Surface

Extensions

Blocking the thread

Physical Device / Logical Device

Physical Device

Device Level Extensions

Queue Families

Presentation support

Surface Capabilities

Logical Device

Queues

Multi-queue

Render Loop

Swapchain

Swapchain Creation

Present Modes

Drawing directly to the Swapchain vs Blitting to the Swapchain

Swapchain Recreation

When to recreate

Finding out that a recreation is needed

Recreating

Updating resources after recreating

Frames In-Flight

Motivation

Frame

How many Frames In-Flight

One Per Frame In-Flight

Advancing a frame

Acquire Next Image

Image Layout Transitions

Render Targets

Attachments

Transient Resources

Render Target

Dynamic Rendering

Multi-view

Render Cmds

Drawing Commands

Draw Direct

Draw Indirect

Multithreading Rendering

Render Passes and Framebuffers

Dynamic Rendering: Features and differences from Render Passes

Subpasses

Framebuffers

Render Passes

Submit

Presentation

Synchronization and Cache Control

KHR_synchronization2

Queues

QueueIdle and DeviceIdle

Queue Family Ownership Transfer

Command Buffers

Debug Utils ( `VK_EXT_debug_utils` )