Vulkan

Starting

Versions
Is OOP?
  • Version 1.3, (2024-02-22).

  • .

API Structs
  • Many structures in Vulkan require you to explicitly specify the type of structure in the sType  member.

  • Functions that create or destroy an object will have a VkAllocationCallbacks  parameter that allows you to use a custom allocator for driver memory, which will also be left nullptr  in this tutorial.

  • Almost all functions return a VkResult  that is either SUCCESS  or an error code. The specification describes which error codes each function can return and what they mean.

  • The KHR  postfix, which means that these objects are part of a Vulkan extension.

  • The pNext  member can point to an extension structure.

Compatibility

Support
  • Platform Support .

  • Checking for Vulkan support .

  • Windows (7 and later)

    • Yes, via the official SDK and drivers.

  • Linux

    • Yes. Native support via Mesa and vendor drivers.

  • Android (5.0+)

    • Yes, most devices from Android 7.0+ support Vulkan.

  • macOS

    • No native support — requires MoltenVK (Vulkan-to-Metal wrapper).

  • iOS

    • No native support — requires MoltenVK.

  • Web

    • No native support — experimental via WebGPU or Emscripten with translation layers.

  • Consoles.

    • Partially supported; depends on platform SDKs and NDAs (e.g., Nintendo Switch uses a Vulkan-like API).

Driver support
  • Vulkan requires updated GPU drivers.

  • Older or integrated GPUs (especially pre-2013) may lack Vulkan support.

  • Vendor support varies: NVIDIA, AMD, and Intel generally support Vulkan on most modern hardware.

Compatibility Layers
  • To increase compatibility.

  • MoltenVK :

    • Runs Vulkan on Metal (required for macOS/iOS).

  • gfx-rs / wgpu / bgfx :

    • Abstraction layers to use Vulkan when available, fallback to other APIs.

  • ANGLE / Zink :

    • Can translate other APIs (e.g., OpenGL) to Vulkan and vice-versa.

Tutorials

Tutorials in Docs
  • Docs Vulkan Guide .

    • I already read everything before the memory allocation section.

  • Docs Vulkan Tutorial .

    • Based on the vulkan-tutorial, with differences:

      • Vulkan 1.4 as a baseline

      • Dynamic rendering instead of render passes

      • Timeline semaphores

      • Slang  as the primary shading language

      • Modern C++ (20) with modules

      • Vulkan-Hpp  with RAII

      • It also contains Vulkan usage clarifications, improved synchronization and new content.

      • "This tutorial will use RAII with smart pointers and it will endeavor to demonstrate the latest methods and extensions which should hopefully make Vulkan a joy to use."

    • Does not require knowledge of previous APIs, but you need to know C++ and graphics math.

    • Impressions :

      • Holy moly the new C++ API is a pain.

      • I preferred to go back to the vulkan-tutorial  several times and check how it's used in the C API.

      • I used this tutorial only as a base to consider the new features.

      • I didn't use Slang, I didn't like it; I stayed with GLSL.

  • vulkan-tutorial .

    • Does not require knowledge of previous APIs, but you need to know C++ and graphics math.

    • You can use C, but the tutorial is in C++.

    • Vulkan 1.0; shown here .

    • Uses GLSL for shaders.

  • ~ Vulkan Guide .

    • For people with previous experience with Graphics APIs.

    • I'm not a big fan of this guide.

    • Uses :

      • Vulkan 1.3.

      • C++, Visual Studio, CMake.

      • SDL to create a window.

      • Vk Bootstrap .

        • Abstracts a big amount of boilerplate that Vulkan has when setting up. Most of that code is written once and never touched again, so we will skip most of it using this library. This library simplifies instance creation, swapchain creation, and extension loading. It will be removed from the project eventually in an optional chapter that explains how to initialize that Vulkan boilerplate the “manual” way.

      • VMA (Vulkan Memory Allocator)

        • Implements memory allocators for Vulkan, header only. In Vulkan, the user has to deal with the memory allocation of buffers, images, and other resources on their own. This can be very difficult to get right in a performant and safe way. Vulkan Memory Allocator does it for us and allows us to simplify the creation of images and other resources. Widely used in personal Vulkan engines or smaller scale projects like emulators. Very high end projects like Unreal Engine or AAA engines write their own memory allocators.

    • Impressions :

      • The tutorial gives you a project with many things already done, and holds your hand for every syntax, file, folder, methodology, etc.

        • It simply throws a lot of stuff at you.

        • It's a pretty bloated  experience, for sure.

        • I consider that a pain.

  • Samples Collections in C++ .

  • Vulkan Barriers Explained .

  • Vulkan AMD Blog Posts .

  • Writing an Efficient Vulkan Renderer .

Playlists
  • Playlist Vulkan with Odin - Nadako .

    • Vulkan 1.3, with Dynamic Rendering.

    • I watched videos 1 through 11.

    • They are good videos.

    • I do not recommend them to someone who has never seen anything before, because they are not exactly for beginners and their explanations lack some foundation.

    • I recommend them as a reference for how to set up in Odin.

  • Playlist Vulkan - OGLDEV .

    • C++, with Visual Studio.

    • Assumes you have seen another GPU API before.

    • Video 1:

      • Window with GLFW, not explained.

    • Video 8:

      • Theory explanation ok; code explanation meh.

    • Video 12:

      • Synchronization with 1 frame in-flight.

      • Good video.

    • Video 16:

      • Descriptor Sets.

      • Nope. See the spec, guides, or other videos on the subject, I think it's better.

    • Video 21:

      • Dynamic Rendering.

      • {0:00 -> 12:14}

        • Explanation of the code to obtain the EXT for Vulkan 1.2, and ignore it for Vulkan 1.3

      • The rest of the video is irrelevant, it does not explain anything beyond what to change if someone is following his code line by line.

  • Playlist Vulkan 2024 - GetIntoGameDev .

    • Overall :

      • The person seems nice and I like when he draws things.

      • Unfortunately 95% of the series videos are code in C++ and he does not do a good job explaining the code.

      • I listed some videos below that I considered interesting.

    • Vulkan 1.3.

    • Video 12:

      • Synchronization, with 1 frame in-flight.

      • The drawings are nice.

    • ~Video 13:

      • Multithreaded rendering.

      • Nope. See the Multithreading Rendering section to understand why "nope".

    • Video 26:

      • Barycentric coordinates.

    • Only code, so nope :

      • Videos: 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29.

    • Playlist Vulkan - GetIntoGameDev .

      • Vulkan 1.2, (2022-01-22).

      • Watch the new 2024 version of the tutorials.

      • The person sometimes explains on a sheet of paper, which is nice.

  • Playlist Vulkan - Computer Graphics at TU Wien .

    • Vulkan 1.2.

    • Video 1:

      • SDK, Instances, extensions, physical devices, logical devices.

      • Ok.

    • Video 2:

      • Presentation Modes, Swapchain.

      • {10:20 -> 21:45}

        • Explanation of all Presentation Modes.

    • Video 3:

      • Explanation of Buffers and Images.

      • The explanation seemed s a bit rushed and the definition is poorly established.

      • I can return and rewatch the video after reading the documentation.

    • Video 4:

      • Commands, Command Pools, Command Buffers.

      • Ok, sure.

      • I skipped the descriptor sets part.

    • Video 5:

      • Pipelines.

      • I skipped it.

    • Video 6:

      • Synchronization.

      • Skipped.

    • Impressions :

      • I don't like the illustrations, nor the tone of the explanation.

      • I simply feel I learn more and feel more confident reading the documentation or the spec.

      • The videos are "more technical", but when that is the case documentation is better.

      • I prefer a simpler playlist to learn some basic concepts, and to read the documentation for advanced topics.

  • Playlist Vulkan - Brendan Galea .

    • Vulkan 1.0.

    • C++, with Visual Studio.

    • It's a pain to see C++ code.

    • The sketch explanations in the middle of the videos are ok, but the rest is very bad; all code-related parts are unpleasant and with a LOT of mess in C++.

    • Video 1:

      • Window with GLFW.

    • Video 2:

      • Light explanation of the graphics pipeline.

      • {9:54}

        • Shader compilation, to SPIR-V.

    • Video 20:

      • Descriptor Sets

      • {0:00 -> 5:35} Nice explanation.

      • The rest of the video is nah.

  • Vulkan playlist - Cakez .

    • C++

    • Starts by teaching how to install Visual Studio and Git...

    • Does not use GLFW, instead creates its own platform layer on Windows to create a window.

  • Vulkan playlist - Francesco Piscani .

    • He uses the vulkan-tutorial.

    • Spends the first 4 episodes doing basically nothing, just setting up CMake and Linux.

    • Nope, it sounds bad as tutorials.

Talks
  • Vulkan in Doom 3 .

    • Use RenderDoc extensively.

    • 1 Render Pass, 1 subpass, 3 attachments.

      • .

    • Buffers and Images

      • .

    • Allocations:

      • VMA for allocators.

      • .

      • .

    • 28 shaders + changes => 100 pipelines total at runtime.

      • .

    • Synchronization:

      • Not much of it. Doom 3 was single-threaded, it didn't require multithreading.

Samples

  • To run :

    • Git clone recursively the repo.

    • Build the entire solution.

    • Vulkan-Samples\build\windows\app\bin\debug\AMD64 .

    • Copy the shaders  and assets  folders from Vulkan-Samples  to the folder above.

    • Type .\vulkan_samples sample sample_name .

  • Note :

    • Normal and hpp have the same performance; or whatever, it does not matter.

  • Impressions :

    • The extension samples were more visually "uninteresting".

    • I saw all API samples, but I didn't see all Extensions.

    • There were still other folders besides these two, but I was lazy to check.

API

  • instancing

    • .

    • Wow, awesome.

    • The fps is very high.

  • oit_linked_lists (Order Independent Transparency)

    • .

  • oit_depth_peeling (Order Independent Transparency)

    • The object in the center rotates with the mouse.

    • .

  • compute_nbody

    • .

  • dynamic_uniform_buffers.

    • .

  • hdr

    • .

    • Allows changing the object, toggling the skybox, changing the exposure, toggling bloom.

  • terrain_tessellation

    • .

    • Increasing the tessellation factor made it look like the terrain polycount increased.

  • timestamp_queries

    • .

    • Allows changing the object, toggling the skybox, changing the exposure, toggling bloom.

  • separate_image_sampler

    • .

    • Allows selecting linear or nearest filtering.

  • texture_loading

    • .

    • Allows increasing the LOD bias, reducing image quality.

  • texture_mipmap_generation

    • .

    • Allows calibrating the LOD bias, and choosing between mipmap off, bilinear and anisotropic.

  • hello_triangle_1_3 / hello_triangle

    • .

    • Nothing special

    • No dynamic resize.

Extensions

  • dynamic_line_rasterization

    • .

    • This sample demonstrates functions from various extensions related to dynamic line rasterization.

    • These functions can be useful for developing CAD applications.

    • From the EXT_line_rasterization  extension.

      • vkCmdSetLineStippleEXT  - sets the stipple pattern.

    • From the EXT_extended_dynamic_state3  extension:

      • vkCmdSetPolygonModeEXT  - sets how defined primitives should be rasterized.

      • vkCmdSetLineRasterizationModeEXT  - sets the algorithm for line rasterization.

      • vkCmdSetLineStippleEnableEXT  - toggles stippling for lines.

    • And also from core Vulkan:

      • vkCmdSetLineWidth  - sets the line width.

      • vkCmdSetPrimitiveTopologyEXT  - defines which type of primitives is being drawn.

  • debug utils

    • .

    • Toggle bloom, toggle skybox.

    • The EXT_debug_utils  extension to setup a validation layer messenger callback and pass additional debugging information to debuggers like RenderDoc.

    • EXT_debug_utils  has been introduced based on feedback for the initial Vulkan debugging extensions EXT_debug_report  and EXT_debug_marker , combining these into a single instance extension with some added functionality.

    • Procedure examples :

      • vkCmdBeginDebugUtilsLabelEXT

      • vkCmdInsertDebugUtilsLabelEXT

      • vkCmdEndDebugUtilsLabelEXT

      • vkQueueBeginDebugUtilsLabelEXT

      • vkQueueInsertDebugUtilsLabelEXT

      • vkQueueEndDebugUtilsLabelEXT

      • vkSetDebugUtilsObjectNameEXT

      • vkSetDebugUtilsObjectTagEXT

  • conditional_rendering

    • .

    • A list of 235 parts of the car, which can be disabled to not render.

    • The EXT_conditional_rendering  extension allows the execution of rendering commands to be conditional based on a value taken from a dedicated conditional buffer.

    • This may help an application reduce latency by conditionally discarding rendering commands without application intervention.

    • This sample demonstrates usage of this extension for conditionally toggling the visibility of sub-meshes of a complex glTF model.

    • Instead of having to update command buffers, this is done by updating the aforementioned buffer.

  • conservative_rasterization

    • .

    • Enabling the conservative rasterization option causes this blending effect.

    • EXT_conservative_rasterization  changes the way fragments are generated.

    • Enables overestimation to generate fragments for every pixel touched  instead of only pixels that are fully covered.

  • color_write_enable

    • .

    • Color picker to change the background color.

    • Some options for "bit", changing the triangle color.

    • The EXT_color_write_enable  extension allows toggling the output color attachments using a pipeline dynamic state.

    • It allows the program to prepare an additional framebuffer populated with the data from a defined color blend attachment which can be blended dynamically to the final scene.

    • The final results are comparable to those obtained with vkCmdSetColorWriteMaskEXT , but it does not require the GPU driver to support EXT_extended_dynamic_state3 .

  • dynamic_blending

    • This sample demonstrates the functionality of EXT_extended_dynamic_state3  related to blending.

    • It includes the following features:

      • vkCmdSetColorBlendEnableEXT : toggles blending on and off.

      • vkCmdSetColorBlendEquationEXT : modifies blending operators and factors.

      • vkCmdSetColorBlendAdvancedEXT : utilizes more complex blending operators.

      • vkCmdSetColorWriteMaskEXT : toggles individual channels on and off.

  • descriptor_indexing

    • .

  • ~descriptor_buffer_basic

    • .

    • Just boxes rotating, I didn't understand.

    • Just textures rotating, I didn't understand.

  • dynamic_multisample_rasterization

    • This sample demonstrates one of the functionalities of EXT_extended_dynamic_state3  related to rasterization samples.

    • The extension can be used to dynamically change sampling without the need to swap pipelines.

    • .

    • This thing took quite a while to open, generating binary files, etc.

  • dynamic_primitive_clipping

    • .

    • This sample demonstrates how to apply depth clipping  using the vkCmdSetDepthClipEnableEXT()  command which is a part of the EXT_extended_dynamic_state3  extension.

    • Additionally it also shows how to apply primitive clipping  using the gl_ClipDistance[]  builtin shader variable.

    • It is worth noting that primitive clipping  and depth clipping  are two separate features of the fixed-function vertex post-processing stage.

    • They're both described in the same chapter of the Vulkan specification (chapter 27.4, "Primitive clipping").

    • What is primitive clipping

      • Primitives produced by vertex/geometry/tessellation shaders are sent to fixed-function vertex post-processing.

      • Primitive clipping is a part of post-processing pipeline in which primitives such as points/lines/triangles are culled against the cull volume and then clipped to the clip volume.

      • And then they might be further clipped by results stored in the gl_ClipDistance[]  array - values in this array must be calculated in a vertex/geometry/tessellation shader.

      • In the past, the fixed-function version of the OpenGL API provided a method to specify parameters for up to 6 clipping planes (half-spaces) that could perform additional primitive clipping. Fixed-function hardware calculated proper distances to these planes and made a decision - should the primitive be clipped against these planes or not (for historical study - search for the glClipPlane()  description).

      • Vulkan inherited the idea of primitive clipping, but with one important difference: the user has to calculate the distance to the clip planes on their own in the vertex shader.

      • And - because the user does it in a shader - they do not have to use clip planes at all. It can be any kind of calculation, as long as the results are put in the gl_ClipDistance[]  array.

      • Values that are less than 0.0 cause the vertex to be clipped. In the case of a triangle primitive the whole triangle is clipped if all of its vertices have values stored in gl_ClipDistance[]  below 0.0. When some of these values are above 0.0 - the triangle is split into new triangles as described in the Vulkan specification.

    • What is depth clipping

      • When depth clipping is disabled then effectively there is no near or far plane clipping.

      • Depth values of primitives that are behind the far plane are clamped to the far plane depth value (usually 1.0).

      • Depth values of primitives that are in front of the near plane are clamped to the near plane depth value (by default it's 0.0, but may be set to -1.0 if we use settings defined in VkPipelineViewportDepthClipControlCreateInfoEXT  structure. This requires the presence of the EXT_depth_clip_control  extension which is not part of this tutorial).

      • In this sample the result of depth clipping (or lack of it) is not clearly visible at first. Try to move the viewer position closer to the object and see how the "use depth clipping" checkbox changes object appearance.

  • ~buffer_device_address.

    • .

    • I didn't understand. It's just things moving.

  • ~calibrated_timestamps

    • timestamp_queries, but with other timings.

Core

Instance / Extensions

Instance
  • VkInstance

    • The Vulkan context, used to access drivers.

  • The instance is the connection between your application and the Vulkan library.

  • VkApplicationInfo .

    • Optional, but it may provide some useful information to the driver to optimize our specific application.

  • vkInstanceCreateInfo .

    • Tells the Vulkan driver which global extensions and validation layers we want to use.

Instance Level Extensions
  • vkEnumerateInstanceExtensionProperties()

    • Retrieve a list of supported extensions before creating an instance.

    • Each VkExtensionProperties  struct contains the name and version of an extension.

Debugging

Validation Layers
  • Layers .

  • Vulkan is designed for high performance and low driver overhead, therefore, it will include very limited error checking and debugging capabilities by default.

  • The driver will often crash  instead of returning an error code if you do something wrong, or worse, it will appear to work on your graphics card and completely fail  on others.

  • Vulkan allows you to enable extensive checks through a feature known as validation layers .

  • Validation layers are pieces of code that can be inserted between the API and the graphics driver to do things like running extra checks on function parameters and tracking memory management problems.

  • The nice thing is that you can enable them during development and then completely disable them when releasing your application for zero overhead. Anyone can write their own validation layers, but the Vulkan SDK by LunarG provides a standard set of validation layers. You also need to register a callback function to receive debug messages from the layers.

  • Because Vulkan is so explicit about every operation and the validation layers are so extensive, it can actually be a lot easier  to find out why your screen is black compared to OpenGL and Direct3D!

  • Common operations in validation layers are:

    • Checking the values of parameters against the specification to detect misuse

    • Tracking the creation and destruction of objects to find resource leaks

    • Checking thread safety by tracking the threads that calls originate from

    • Logging every call and its parameters to the standard output

    • Tracing Vulkan calls for profiling and replaying

  • There were formerly two different types of validation layers in Vulkan: instance  and device  specific.

  • The idea was that instance layers would only check calls related to global Vulkan objects like instances, and device-specific layers would only check calls related to a specific GPU.

  • Device-specific layers have now been deprecated , which means that instance validation layers apply to all Vulkan calls.

  • We don’t really need to check for the existence of this extension because it should be implied by the availability of the validation layers.

  • vkEnumerateInstanceLayerProperties

  • RenderDoc :

    • Do not run validation at the same time as RenderDoc, otherwise you'll also be validating RenderDoc.

  • Vulkan Configurator :

    • Overwrites the normal Layer setup.

    • Implicitly loads layers.

    • How to use :

      • RIGHT-CLICK.

  • Performance :

    • Ensure validation layers and debug callbacks are off for performance runs. Use pipeline cache objects to avoid repeated pipeline creation cost.

    • I notice how each 'push', 'descriptor set bind', 'vertex bind', 'indices bind' and 'draw' were a lot slower with validations on.

Message Callback
  • The validation layers will print debug messages to the standard output by default, but we can also handle them ourselves by providing an explicit callback in our program.

  • This will also allow you to decide which kind of messages you would like to see.

  • messageSeverity

  • messageType

  • pfnUserCallback

    • messageSeverity

      • DEBUG_UTILS_MESSAGE_SEVERITY_VERBOSE_EXT

        • Diagnostic message

      • DEBUG_UTILS_MESSAGE_SEVERITY_INFO_EXT

        • Informational message like the creation of a resource

      • DEBUG_UTILS_MESSAGE_SEVERITY_WARNING_EXT

        • Message about behavior that is not necessarily an error, but very likely a bug in your application

      • DEBUG_UTILS_MESSAGE_SEVERITY_ERROR_EXT

        • Message about behavior that is invalid and may cause crashes.

    • messageType

      • DEBUG_UTILS_MESSAGE_TYPE_GENERAL_EXT

        • Some event has happened that is unrelated to the specification or performance

      • DEBUG_UTILS_MESSAGE_TYPE_VALIDATION_EXT

        • Something has happened that violates the specification or indicates a possible mistake

      • DEBUG_UTILS_MESSAGE_TYPE_PERFORMANCE_EXT

        • Potential non-optimal use of Vulkan

    • pCallbackData

      • Refers to a VkDebugUtilsMessengerCallbackDataEXT  struct containing the details of the message itself, with the most important members being:

      • pMessage

        • The debug message as a null-terminated string

      • pObjects

        • Array of Vulkan object handles related to the message

      • objectCount

        • Number of objects in the array

    • pUserData

      • Contains a pointer specified during the setup of the callback and allows you to pass your own data to it.

Debug Utils ( VK_EXT_debug_utils )
must(
    vk.SetDebugUtilsObjectNameEXT(
        dev,
        &vk.DebugUtilsObjectNameInfoEXT {
            sType = .DEBUG_UTILS_OBJECT_NAME_INFO_EXT,
            objectType = obj,
            objectHandle = handle,
            pObjectName = strings.clone_to_cstring(name, context.temp_allocator),
        },
    ),
)

Window / Surface / GLFW

Window
  • The Vulkan API itself is completely platform-agnostic, which is why we need to use the standardized WSI (Window System Interface) extension to interact with the window manager.

  • Windows can be created with the native platform APIs or libraries like GLFW  and SDL .

  • Some platforms allow you to render directly to a display without interacting with any window manager through the KHR_display  and KHR_display_swapchain  extensions.

  • These allow you to create a surface that represents the entire screen and could be used to implement your own window manager, for example.

GLFW
  • GLFW Reference .

  • The very first call in initWindow  should be glfwInit() , which initializes the GLFW library. Because GLFW was originally designed to create an OpenGL context, we need to tell it to not create an OpenGL context with a later call:

  • Because handling resized windows takes special care that we’ll look into later, disable it for now with another window hint call:

glfwWindowHint(GLFW_CLIENT_API, GLFW_NO_API);
glfwWindowHint(GLFW_RESIZABLE, GLFW_FALSE);
  • All that’s left now is creating the actual window. Add a GLFWwindow* window;  private class member to store a reference to it and initialize the window with:

window = glfwCreateWindow(WIDTH, HEIGHT, "Vulkan", nullptr, nullptr);
  • The first three parameters specify the width, height and title of the window. The fourth parameter allows you to optionally specify a monitor to open the window on, and the last parameter is only relevant to OpenGL.

  • Init:

void initWindow() {
    glfwInit();

    glfwWindowHint(GLFW_CLIENT_API, GLFW_NO_API);
    glfwWindowHint(GLFW_RESIZABLE, GLFW_FALSE);

    window = glfwCreateWindow(WIDTH, HEIGHT, "Vulkan", nullptr, nullptr);
}
  • Main loop:

void mainLoop() {
    while (!glfwWindowShouldClose(window)) {
        glfwPollEvents();
    }
}
  • Destroy:

void cleanup() {
    glfwDestroyWindow(window);

    glfwTerminate();
}
Surface
  • A VkSurfaceKHR  is an opaque handle representing a platform-specific presentation target (for example, a window on Windows, an X11 window on Linux, or a UIView on iOS). It is created directly from the Vulkan instance together with a native window handle. Conceptually, a surface is:

    • Instance-level: it lives above any physical or logical device.

    • Window abstraction: it wraps the OS window or drawable so that Vulkan knows where to submit images for display.

    • Device-agnostic: you can create a surface before choosing which GPU you will use.

  • Once created, the surface is used by a chosen physical device to query presentation support, formats and capabilities, and then by the logical device to build a Swapchain.

  • A surface itself is not intrinsically tied to any particular physical or logical device, because:

    • Creation: you call vkCreateSurfaceKHR(instance, …)  without involving a VkPhysicalDevice  or VkDevice  handle.

    • Lifetime: it exists even before you pick or create a device, and you destroy it with vkDestroySurfaceKHR(instance, surface, …) .

  • Lifetime :

    • The surface is tied to the GLFW window's lifecycle.

    • It does not change  when the window is resized, minimized, or restored.

    • The same surface handle remains valid until you destroy it (e.g., when closing the window).

  • "Window surfaces are part of the larger topic of render targets and presentation".

  • Surface Formats .

Extensions
  • To establish the connection between Vulkan and the window system to present results to the screen, we need to use the WSI (Window System Integration) extensions.

  • The KHR_surface  exposes a VkSurfaceKHR  object that represents an abstract type of surface to present rendered images to.

  • The surface in our program will be backed by the window that we’ve already opened with GLFW.

  • The KHR_surface  extension is an instance level extension, and we’ve actually already enabled it, because it’s included in the list returned by glfwGetRequiredInstanceExtensions . The list also includes some other WSI extensions that we’ll use in the next couple of chapters.

  • The window surface needs to be created right after  the instance creation, because it can actually influence the physical device selection.

  • It should also be noted that window surfaces are an entirely optional component in Vulkan if you just need off-screen rendering.

    • Vulkan allows you to do that without hacks like creating an invisible window (necessary for OpenGL).

  • Vulkan also allows you to remotely render from a non-presenting GPU or remotely over the internet, or run compute acceleration for AI without a render or presentation target.

  • Although the VkSurfaceKHR  object and its usage is platform-agnostic, its creation isn’t because it depends on window system details. For example, it needs the HWND  and HMODULE  handles on Windows. Therefore, there is a platform-specific addition to the extension, which on Windows is called KHR_win32_surface  and is also automatically included in the list from glfwGetRequiredInstanceExtensions .

  • GLFW actually has glfwCreateWindowSurface  that handles the platform differences for us.

Blocking the thread
  • Difficulties due to GLFW .

  • A callback glfw.SetWindowRefreshCallback  allows the swapchain to be recreated while resizing.

    • See [[#Swapchain Recreation]].

Physical Device / Logical Device

Physical Device
  • VkPhysicalDevice

  • A GPU. Used to query physical GPU details, like features, capabilities, memory size, etc.

Device Level Extensions
Queue Families
  • Most operations performed with Vulkan, like draw commands and memory operations, are asynchronously executed by submitting them to a VkQueue .

  • Queues are allocated from queue families, where each queue family supports a specific set of operations in its queues.

    • For example, there could be separate queue families for graphics, compute and memory transfer operations.

  • The availability of queue families could also be used as a distinguishing factor in physical device selection.

    • It is possible for a device with Vulkan support to not offer any graphics functionality; however, all graphics cards with Vulkan support today will generally support all queue operations that we’re interested in.

  • We need to check which queue families are supported by the device and which one of these supports the commands that we want to use.

Presentation support
  • Although the Vulkan implementation may support window system integration, that does not mean that every device in the system supports it. Therefore, we need to extend createLogicalDevice  to ensure that a device can present images to the surface we created.

  • Since the presentation is a queue-specific feature, the problem is actually about finding a queue family that supports presenting to the surface we created.

  • It’s actually possible that the queue families supporting drawing  commands and the queue families supporting presentation  do not  overlap.

    • It’s very likely that these end up being the same queue family after all, but throughout the program we will treat them as if they were separate queues for a uniform approach.

    • Nevertheless, you could add logic to explicitly prefer a physical device that supports drawing and presentation in the same queue for improved  performance.

  • Therefore, we have to take into account that there could be a distinct presentation queue.

  • We’ll look for a queue family that has the capability of presenting to our window surface. The function to check for that is vkGetPhysicalDeviceSurfaceSupportKHR , which takes the physical device, queue family index and surface as parameters.

  • It should be noted that the availability of a presentation queue, as we checked in the previous chapter, implies that the Swapchain extension must be supported. However, the extension does have to be explicitly  enabled.

  • Not all graphics cards are capable of presenting images directly to a screen for various reasons, for example, because they are designed for servers and don’t have any display outputs. Secondly, since image presentation is heavily tied into the window system and the surfaces associated with windows, it is not part of the Vulkan core. You have to enable the KHR_swapchain  device extension after querying for its support.

Surface Capabilities
  • The extents can change when resizing and you should requery the surface properties. Note that if it says the current extent is {UINT32_MAX, UINT32_MAX}  (happens on some platforms) then you'll need to ask the windowing system for an appropriate new size (but I don't know GLFW well enough to know if GetFramebufferSize  is the right function for that purpose)

Logical Device
  • VkDevice

  • The “logical” GPU context that you actually execute things on.

  • Where you describe more specifically which VkPhysicalDeviceFeatures you will be using, like multi viewport rendering and 64-bit floats.

  • You also need to specify which queue families you would like to use.

Queues
  • Queues .

  • VkQueue

    • Execution “port” for commands.

    • GPUs will have a set of queues with different properties.

      • Some allow only graphics commands, others only allow memory commands, etc.

    • Command buffers are executed by submitting them into a queue, which will copy the rendering commands onto the GPU for execution.

  • The queues are automatically created along with the logical device, but we don’t have a handle to interface with them yet.

  • Device queues are implicitly cleaned up when the device is destroyed.

  • We can use the vkGetDeviceQueue  function to retrieve queue handles for each queue family. The parameters are the logical device, queue family, queue index and a pointer to the variable to store the queue handle in. Because we’re only creating a single queue from this family, we’ll simply use index 0 .

  • Vulkan Guide:

    • It is common to see engines using 3 queue families:

      • One for drawing the frame, other for async compute, and other for data transfer.

    • In this tutorial, we use a single queue that will run all our commands for simplicity.

Multi-queue
  • .

  • Some hardware only has one queue.

Render Loop

  • Now that everything is ready for rendering, you first ask the VkSwapchainKHR  for an image to render to. Then you allocate a VkCommandBuffer  from a VkCommandBufferPool  or reuse an already allocated command buffer that has finished execution, and “start” the command buffer, which allows you to write commands into it.

  • Next, you begin rendering by using Dynamic Rendering.

  • Then create a loop where you bind a VkPipeline , bind some VkDescriptorSet  resources (for the shader parameters), bind the vertex buffers, and then execute a draw call.

  • If there is nothing more to render, you end the VkCommandBuffer . Finally, you submit the command buffer into the queue for rendering. This will begin execution of the commands in the command buffer on the gpu. If you want to display the result of the rendering, you “present” the image you have rendered to to the screen. Because the execution may not have finished yet, you use a semaphore to make the presentation of the image to the screen wait until rendering is finished.

  • At a high level, rendering a frame in Vulkan consists of a common set of steps:

    • Wait for the previous frame to finish

    • Acquire an image from the Swapchain

    • Record a command buffer which draws the scene onto that image

      • Re-recording every frame doesn't really take up performance.

    • Submit the recorded command buffer

      • Takes performance.

    • Present the Swapchain image

      • Puts it up on the screen.

Swapchain

  • Vulkan does not have the concept of a "default framebuffer," hence it requires an infrastructure that will own the buffers we will render to before we visualize them on the screen.

  • This infrastructure is known as the swapchain  and must be created explicitly in Vulkan.

  • The Swapchain is essentially a queue of images that are waiting to be presented to the screen.

  • Our application will acquire such an image to draw to it, and then return it to the queue.

  • The conditions for presenting an image from the queue depend on how the Swapchain is set up.

  • The general purpose of the Swapchain is to synchronize the presentation of images with the refresh rate of the screen.

    • This is important to make sure that only complete images are shown.

  • Every time we want to draw a frame, we have to ask the Swapchain to provide us with an image to render to. When we’ve finished drawing a frame, the image is returned to the Swapchain for it to be presented to the screen at some point.

  • "Is a collection of render targets".

    • Render Targets is not a well-defined term.

  • The number of render targets and conditions for presenting finished images to the screen depends on the present mode.

  • VkSwapchainKHR

    • Holds the images for the screen.

    • It allows you to render things into a visible window.

    • The KHR  suffix shows that it comes from an extension, which in this case is KHR_swapchain .

  • Swapchains .

    • Good video.

    • Pre-rotate on mobile.

    • When to recreate, recreation problems, recreation strategies, maintenance.

    • Present modes.

  • Support :

    • There are basically three kinds of properties we need to check:

      • Basic surface capabilities (min/max number of images in Swapchain, min/max width and height of images)

      • Surface formats (pixel format, color space)

      • Available presentation modes

    • It is important that we only try to query for Swapchain support after verifying that the extension is available.

Swapchain Creation
  • VkSwapchainCreateInfoKHR .

    • surface

      • Is the surface onto which the swapchain will present images. If the creation succeeds, the swapchain becomes associated with surface .

    • minImageCount

      • we also have to decide how many images we would like to have in the Swapchain. However, simply sticking to the minimum means that we may sometimes have to wait on the driver to complete internal operations before we can acquire another image to render to. Therefore, it is recommended to request at least one more image than the minimum:

      uint32_t imageCount = surfaceCapabilities.minImageCount + 1;
      
      • We should also make sure to not exceed the maximum number of images while doing this, where 0  is a special value that means that there is no  maximum

      if (surfaceCapabilities.maxImageCount > 0 && imageCount > surfaceCapabilities.maxImageCount) {
          imageCount = surfaceCapabilities.maxImageCount;
      }
      
    • imageFormat

      • For the color space we’ll use SRGB if it is available, because it results in more accurate perceived colors . It is also pretty much the standard color space for images, like the textures we’ll use later on.

      • Because of that we should also use an SRGB color format, of which one of the most common ones is FORMAT_B8G8R8A8_SRGB .

    • imageColorSpace

      • Is a VkColorSpaceKHR  value specifying the way the swapchain interprets image data.

    • imageExtent

      • Is the size (in pixels) of the swapchain image(s).

      • The swap extent is the resolution  of the Swapchain images. It’s almost always exactly equal to the resolution of the window that we’re drawing to in pixels .

      • The range of the possible resolutions is defined in the VkSurfaceCapabilitiesKHR  structure.

      • On some platforms, it is normal that maxImageExtent   may  become (0, 0) , for example when the window is minimized. In such a case, it is not possible to create a swapchain due to the Valid Usage requirements , unless scaling is selected through VkSwapchainPresentScalingCreateInfoKHR , if supported .

      • We’ll pick the resolution that best matches the window within the minImageExtent  and maxImageExtent  bounds. But we must specify the resolution in the correct unit.

      • GLFW uses two units when measuring sizes: pixels and screen coordinates . For example, the resolution {WIDTH, HEIGHT}  that we specified earlier when creating the window is measured in screen coordinates. But Vulkan works with pixels, so the Swapchain extent must be specified in pixels as well.

      • Unfortunately, if you are using a high DPI display (like Apple’s Retina display), screen coordinates don’t correspond to pixels. Instead, due to the higher pixel density, the resolution of the window in pixel will be larger than the resolution in screen coordinates. So if Vulkan doesn’t fix the swap extent for us, we can’t just use the original {WIDTH, HEIGHT} . Instead, we must use glfwGetFramebufferSize  to query the resolution of the window in pixel before matching it against the minimum and maximum image extent.

      • The surface capabilities changes every time the window resizes, and it's only used for creating the Swapchain, so it doesn't make sense to cache.

    • imageUsage

    • imageSharingMode  (Handling multiple queues):

      • We need to specify how to handle Swapchain images that will be used across multiple queue families. That will be the case in our application if the graphics queue family is different from the presentation queue. We’ll be drawing on the images in the Swapchain from the graphics queue and then submitting them on the presentation queue. There are two ways to handle images that are accessed from multiple queues:

        • SHARING_MODE_EXCLUSIVE :

          • An image is owned by one queue family at a time, and ownership must be explicitly transferred before using it in another queue family.

          • This option offers the best  performance.

        • SHARING_MODE_CONCURRENT :

          • Images can be used across multiple queue families without explicit ownership transfers.

          • Concurrent mode requires you to specify in advance between which queue families ownership will be shared using the queueFamilyIndexCount  and pQueueFamilyIndices  parameters.

      • If the queue families differ, then we’ll be using the concurrent mode in this tutorial to avoid having to do the ownership chapters, because these involve some concepts that are better explained at a later time.

      • If the graphics queue family and presentation queue family are the same, which will be the case on most hardware, then we should stick to exclusive mode. Concurrent mode requires you to specify at least two distinct queue families.

    • queueFamilyIndexCount

      • Is the number of queue families having access to the image(s) of the swapchain when imageSharingMode  is SHARING_MODE_CONCURRENT .

    • pQueueFamilyIndices

      • Is a pointer to an array of queue family indices having access to the images(s) of the swapchain when imageSharingMode  is SHARING_MODE_CONCURRENT .

    • imageArrayLayers

      • Is the number of views in a multiview/stereo surface. For non-stereoscopic-3D applications, this value is 1.

    • presentMode

    • preTransform

      • We can specify that a certain transform should be applied to images in the Swapchain if it is supported ( supportedTransforms  in capabilities ), like a 90-degree clockwise rotation or horizontal flip. To specify that you do not want any transformation, simply specify the current transformation.

      • IDENTITY

        • This would not  be optimal on devices that support rotation and will lead to measurable performance loss.

        • It is strongly recommended that surface_properties.currentTransform  be used instead. However, the application is required to handle preTransform  elsewhere accordingly.

    • compositeAlpha

      • Specifies if the alpha channel should be used for blending with other windows in the window system.

      • You’ll almost always want to simply ignore the alpha channel, hence OPAQUE .

    • clipped

      • If set to TRUE , then that means that we don’t care about the color of pixels that are obscured, for example, because another window is in front of them.

      • Unless you really need to be able to read these pixels back and get predictable results, you’ll get the best performance by enabling clipping.

    • oldSwapChain

      • Can be an existing non-retired  swapchain currently associated with surface , or NULL_HANDLE .

      • If the oldSwapchain  is NULL_HANDLE :

        1. And if the native window referred to by pCreateInfo->surface  is already associated with a Vulkan swapchain, ERROR_NATIVE_WINDOW_IN_USE   must  be returned.

      • If the oldSwapchain  is valid:

        1. This may  aid in the resource reuse, and also allows the application to still present any images that are already acquired from it.

        2. And the oldSwapchain  has exclusive full-screen access, that access is released from pCreateInfo->oldSwapchain . If the command succeeds in this case, the newly created swapchain will automatically acquire exclusive full-screen access from pCreateInfo->oldSwapchain .

        3. And there are outstanding calls to vkWaitForPresent2KHR , then vkCreateSwapchainKHR   may  block until those calls complete.

        4. Any images from oldSwapchain  that are not acquired by the application may  be freed by the implementation, upon calling vkCreateSwapchainKHR , which may  occur even if creation of the new swapchain fails.

        5. The oldSwapchain  will be retired upon calling vkCreateSwapchainKHR , even if creation of the new swapchain fails.

          • After oldSwapchain  is retired, the application can  pass to vkQueuePresentKHR  any images it had already acquired from oldSwapchain .

            • An application may present an image from the old swapchain before an image from the new swapchain is ready to be presented.

            • As usual, vkQueuePresentKHR   may  fail if oldSwapchain  has entered a state that causes ERROR_OUT_OF_DATE  to be returned.

        6. The application can  continue to use a shared presentable image obtained from oldSwapchain  until a presentable image is acquired from the new swapchain, as long as it has not entered a state that causes it to return ERROR_OUT_OF_DATE .

        7. The application can  destroy oldSwapchain  to free all memory associated with oldSwapchain .

      • Regardless if the oldSwapchain  is valid or not:

        1. The new swapchain is created in the non-retired  state.

    • flags

      • Is a bitmask of VkSwapchainCreateFlagBitsKHR  indicating parameters of the swapchain creation.

      • SWAPCHAIN_CREATE_DEFERRED_MEMORY_ALLOCATION_EXT

        • When EXT_swapchain_maintenance1  is available, you can optionally amortize the cost of swapchain image allocations over multiple frames.

        • When this is used, image views cannot be created until the first time the image is acquired.

          • The idea is that normally the images and image views are acquired when a Swapchain recreation happens, but if this flag is enabled it is necessary to acquire them after result == SUCCESS || result == SUBOPTIMAL_KHR  as the result of vkAcquireNextImageKHR .

Present Modes
  • Common present modes are double buffering (vsync) and triple buffering.

  • The presentation mode is arguably the most important setting for the Swapchain, because it represents the actual conditions for showing images to the screen. There are four possible modes available in Vulkan:

    • PRESENT_MODE_IMMEDIATE_KHR

      • Images submitted by your application are transferred to the screen right away, which may result in tearing.

    • PRESENT_MODE_FIFO_KHR

      • The Swapchain is a queue where the display takes an image from the front of the queue when the display is refreshed, and the program inserts rendered images at the back of the queue. If the queue is full, then the program has to wait. This is most similar to vertical sync as found in modern games. The moment that the display is refreshed is known as "vertical blank".

    • PRESENT_MODE_FIFO_RELAXED_KHR

      • This mode only differs from the previous one if the application is late and the queue was empty at the last vertical blank. Instead of waiting for the next vertical blank, the image is transferred right away when it finally arrives. This may result in visible tearing.

    • PRESENT_MODE_MAILBOX_KHR

      • This is another variation of the second mode. Instead of blocking the application when the queue is full, the images that are already queued are simply replaced with the newer ones. This mode can be used to render frames as fast as possible while still avoiding tearing, resulting in fewer latency issues than standard vertical sync. This is commonly known as "triple buffering," although the existence of three buffers alone does not necessarily mean that the framerate is unlocked.

  • Only the PRESENT_MODE_FIFO_KHR  mode is guaranteed to be available, so we’ll again have to write a function that looks for the best mode that is available:

  • .

  • Options :

    • I think that PRESENT_MODE_MAILBOX_KHR  is a very nice trade-off if energy usage is not a concern. It allows us to avoid tearing while still maintaining fairly low latency by rendering new images that are as up to date as possible right until the vertical blank.

    • On mobile devices, where energy usage is more important, you will probably want to use PRESENT_MODE_FIFO_KHR  instead.

    • .

    • .

      • Slide from the Samsung talk on (2025-02-25).

      • It recommends FIFO and says that mailbox is not as good as it seems because it induces a lot of stutter.

Drawing directly to the Swapchain vs Blitting to the Swapchain
  • Source .

  • Drawing directly into the swapchain :

    • Is fine for many projects, and it can even be optimal in some cases such as phones.

    • Restrictions :

      • Their resolution is fixed to whatever your window size is.

        • If you want to have higher or lower resolution, and then do some scaling logic, you need to draw into a different image.

        • Swapchain image size (imageExtent / surface extent) is part of swapchain creation and is tied to the surface. If you want an internal render at a different resolution (supersampling, dynamic resolution, lower-res upscaling), you create an offscreen image/render-target at the desired size and then copy/blit/resolve/tone-map into the swapchain image for presentation. The spec and WSI notes treat imageExtent as the surface-presentable size.

      • The formats of the image used in the swapchain are not guaranteed.

        • Different OS, drivers, and windowing modes can have different optimal swapchain formats.

        • The WSI model exposes the surface’s supported formats to the application via vkGetPhysicalDeviceSurfaceFormatsKHR  (or equivalent WSI queries); the returned list is implementation- and surface-dependent, so you must choose from what the platform/driver exposes. That means formats available for swapchains vary by OS, driver, and surface.

        • Vulkan explicitly states this via VkSurfaceFormatKHR  and vkGetPhysicalDeviceSurfaceFormatsKHR . The specification (Section 30.5 "WSI Swapchain", Vulkan 1.3.275) and tutorials emphasize that the application must query and choose from available formats supported by the surface/device combination. Android documentation (Vulkan on Android) and Windows (DXGI_FORMAT) similarly highlight platform-specific format requirements and HDR needs (e.g., FORMAT_A2B10G10R10_UNORM_PACK32  or DXGI_FORMAT_R10G10B10A2_UNORM  for HDR10). This variability makes direct rendering inflexible.

      • HDR support needs its own very specific formats.

        • HDR output requires specific color formats and color-space metadata (examples: 10-bit packed UNORM formats or explicit HDR color-space support such as ST2084/Perceptual Quantizer). WSI and sample repos treat HDR as a distinct case (e.g. A2B10G10 formats and HDR color spaces). Support is platform- and driver-dependent.

        • HDR Sample discussion .

      • Swapchain formats are, for the most part, low precision.

        • Some platforms with High Dynamic Range rendering have higher precision formats, but you will often default to 8 bits per color.

        • So if you want high precision light calculations, systems that would prevent banding, or to be able to go past 1.0 on the normalized color range, you will need a separate image for drawing.

          • HDR/high-dynamic-range lighting typically uses floating-point or extended-range render targets (e.g. R16G16B16A16_SFLOAT  or higher) for intermediate lighting accumulation; final tonemapping reduces values into the presentable format. Because presentable swapchain images are often limited (8-bit), the offscreen high-precision image plus a conversion/tonemap pass is the usual pattern.

        • Many surfaces expose 8-bit UNORM or sRGB formats (e.g. B8G8R8A8_UNORM / SRGB ) as commonly returned swapchain formats. Higher-precision formats (16-bit float per channel or 10-bit packed) exist and are used for HDR/high-precision pipelines, but they are not guaranteed by every surface/driver. Therefore applications that need high-precision lighting/accumulation commonly render into a 16-bit-float render target and tonemap/convert for presentation.

        • Banding artifacts in gradients or low-light scenes are a well-known consequence of limited precision. High-precision rendering (HDR, complex lighting, deferred shading G-Buffers) requires formats like FORMAT_R16G16B16A16_SFLOAT  (RGBA16F) to store values outside the [0.0, 1.0] range and prevent banding. While some  swapchains can  support HDR formats (e.g., 10:10:10:2), they are less universally available and not the default. Using RGBA16F directly in a swapchain is often unsupported or inefficient for presentation.

  • Drawing to a different image and copying/blitting to the swapchain image :

    • Advantages :

      • Decouples tonemapping from presentation timing

        • Tonemap into an intermediate LDR image that you control. You can finish the tonemap pass earlier and defer the actual transfer/present of the swapchain image to a later point, reducing risk of stalling the present path or blocking on swapchain ownership.

      • Avoids writing directly to the swapchain

        • Writing directly into the swapchain can introduce stalls (wait-for-acquire or present-time synchronization). Using an intermediate LDR image lets you do the heavy work off-swapchain and only do a cheap transfer/present step when convenient.

      • Enables batching / chaining of postprocesses without touching the swapchain

        • If you need further LDR processing (dithering, temporal AA, UI composite, overlays, readback for screenshots, or additional filters), do those against the intermediate image. This allows composing multiple passes without repeatedly transitioning the swapchain.

      • Easier support for multiple outputs or different sizes/formats

        • You can tonemap once to an LDR image and then blit/copy to different-size or different-format targets (screenshots, streaming encoder, secondary displays) without re-running tonemap.

      • Allows use of transient/optimized memory for the intermediate

        • The intermediate image can be created as transient (e.g., MEMORY_PROPERTY_LAZILY_ALLOCATED  or tiled transient attachment) to reduce memory pressure and bandwidth compared with always keeping a full persistent LDR buffer.

      • Better control over final conversion semantics

        • In shader you control quantization, gamma conversion, ordered/temporal dithering, and color-space tagging. After producing the controlled LDR image you can choose the transfer method (exact copy vs scaled blit) that matches target capabilities, improving visual consistency across vendors.

      • Improved cross-queue / async workflows

        • You can produce the LDR image on a graphics/compute queue and then perform a transfer on a transfer-only queue (or use a dedicated present queue) with explicit ownership transfers, possibly improving throughput if hardware supports it.

      • Facilitates deterministic screenshots / capture

        • Saving an intermediate LDR image for file export is safer (format/bit-depth known) than capturing the swapchain which may have platform-specific transforms applied.

    • Trade-offs :

      • Extra GPU memory usage

        • You need memory for the intermediate LDR image (unless you use transient attachments), which increases resident memory footprint.

      • Extra GPU bandwidth and a copy step

        • Creating an LDR image then copying/blitting to the swapchain costs memory bandwidth and GPU cycles. This can increase frame time if the transfer is on the critical path.

      • More layout transitions and synchronization complexity

        • You must manage transitions and possibly ownership transfers (if different queues are used). Incorrect synchronization can cause stalls or correctness bugs.

      • Potential increased latency if done poorly

        • If the copy/blit is done synchronously right before present, it can add latency compared with rendering directly to the swapchain; the intended decoupling only helps if scheduling is arranged to avoid the critical path.

      • Implementation complexity

        • Managing an extra render target, transient allocation, and copy logic is more code than rendering directly to the swapchain.

Swapchain Recreation

When to recreate
  • If the window surface changed such that the Swapchain is no longer compatible with it.

  • If the window resizes.

  • If the window minimizes.

    • This case is special because it will result in a framebuffer size of 0 .

    • We can handle by waiting for the framebuffer size to be back to something greater than 0 , indicating that the window is no longer minimized.

  • If the swapchain image format changed during an application's lifetime, for example, when moving a window from a standard range to a high dynamic range monitor.

Finding out that a recreation is needed
  • The vkAcquireNextImageKHR  and vkQueuePresentKHR  functions can return the following special values to indicate this.

    • ERROR_OUT_OF_DATE_KHR

      • The Swapchain has become incompatible with the surface and can no longer be used for rendering. Usually happens after a window resize.

    • SUBOPTIMAL_KHR

      • The Swapchain can still be used to successfully present to the surface, but the surface properties are no longer matched exactly.

      • You should ALWAYS  recreate the swapchain if the result is suboptimal.

      • This result means that it's a "success" but there will be performance penalties.

      • Both SUCCESS  and SUBOPTIMAL_KHR  are considered "success" return codes.

  • If the Swapchain turns out to be out of date when attempting to acquire an image, then it is no longer possible to present to it. Therefore, we should immediately recreate the Swapchain and try again in the next drawFrame  call.

  • You could also decide to do that if the Swapchain is suboptimal, but I’ve chosen to proceed anyway in that case because we’ve already acquired an image.

result = presentQueue.presentKHR( presentInfoKHR );
if (result == vk::Result::eErrorOutOfDateKHR || result == vk::Result::eSuboptimalKHR) {
    framebufferResized = false;
    recreateSwapChain();
} else if (result != vk::Result::eSuccess) {
    throw std::runtime_error("failed to present Swapchain image!");
}

currentFrame = (currentFrame + 1) % MAX_FRAMES_IN_FLIGHT;
  • The vkQueuePresentKHR  function returns the same values with the same meaning. In this case, we will also recreate the Swapchain if it is suboptimal, because we want the best possible result.

  • Finding out explicitly :

    • Although many drivers and platforms trigger ERROR_OUT_OF_DATE_KHR  automatically after a window resize, it is not guaranteed to happen.

    • That’s why we’ll add some extra  code to also handle resizes explicitly:

      glfw.SetWindowUserPointer(vulkan_context.glfw_window, vulkan_context)
      glfw.SetFramebufferSizeCallback(vulkan_context.glfw_window, proc "c" (window: glfw.WindowHandle, _, _: i32) {s
          vulkan_context := cast(^Vulkan_Context)glfw.GetWindowUserPointer(window)
          vulkan_context.glfw_framebuffer_resized = true
      })
      
    • "Usually it's not the best idea to depend on this".

      • Problems with multithreading.

      • You depend on the windowing system to notify changes correctly; this can be really tricky on mobile.

Recreating
void recreateSwapChain() {
    device.waitIdle();

    cleanupSwapChain();

    createSwapChain();
    createImageViews();
}
  • Synchronization :

    1. ~Flush and Recreate:

      • "We first call vkDeviceWaitIdle , because just like in the last chapter, we shouldn’t touch resources that may still be in use."

        • This is not enough.

        • .

      • The whole app has to stop and wait for synchronization.

      • .

      • .

    2. Recreate and check:

      • .

      • You do not  need to stop your rendering at any given point.

      • The reason why you are allowed to pass the old swapchain when recreating the new swapchain, is due to this strategy.

      • This is the recommendation.

      • Strategy .

        • This issue is resolved by deferring the destruction of the old swapchain and its remaining present semaphores to the time when the semaphore corresponding to the first present of the new swapchain can be destroyed. Because once the first present semaphore of the new swapchain can be destroyed, the first present operation of the new swapchain is done, which means the old swapchain is no longer being presented.

        • The destruction of both old swapchains must now be deferred to when the first QP of the new swapchain has been processed. If an application resizes the window constantly and at a high rate, we would keep accumulating old swapchains and not free them until it stops.

          • This potentially accumulates a lot of memory, I think.

        • So what's the correct moment then? Only after the new swapchain has completed one full cycle of presentations, that is, when I acquire image index 0  for the second  time.

      • Analysis :

        • (2025-08-19)

        • Holy, now I understand the problem.

        • I cannot delete anything from the old swapchain until I am sure that everything from the previous one has been presented. I thought that by acquiring the first image of the new swapchain, that would already indicate that it was safe to delete the old swapchain, but that's not true; by doing that, I only guarantee that 1 (ONE) image from the old swapchain has been presented, but the old swapchain may have several images in the queue.

        • However, as made clear, that is not the case.

        • Dealing with this can be a nightmare. Potentially having to handle multiple old swapchains at the same time in case of very frequent resizes (smooth swapchain).

    3. EXT_swapchain_maintenance1 .

      • "You should always use this extension if available".

      • Support :

        • Introduced in 2023.

        • (2025-02-25)

          • Only 25% of Android devices and 20% of desktop GPUs use it.

          • It was added on Android 14.

      • Adds a collection of window system integration features that were intentionally left out or overlooked in the original KHR_swapchain  extension.

      • Features :

        • Allow applications to release previously acquired images without presenting them.

        • Allow applications to defer swapchain memory allocation for improved startup time and memory footprint.

        • Specify a fence that will be signaled when the resources associated with a present operation can  be safely destroyed.

        • Allow changing the present mode a swapchain is using at per-present granularity.

        • Allow applications to define the behavior when presenting a swapchain image to a surface with different dimensions than the image.

          • Using this feature may  allow implementations to avoid returning ERROR_OUT_OF_DATE_KHR  in this situation.

        • This extension makes vkQueuePresentKHR  more similar to vkQueueSubmit , allowing it to specify a fence that the application can wait on.

      • The problem with vkDeviceWaitIdle  or vkQueueWaitIdle :

        • Typically, applications call these functions and assume it’s safe to delete swapchain semaphores and the swapchain itself.

        • The problem is that WaitIdle  functions are defined in terms of fences - they only wait for workloads submitted through functions that accept a fence.

        • Unextended vkQueuePresent  does not provide a fence parameter.

        • The vkDeviceWaitIdle  can’t guarantee that it’s safe to delete swapchain resources.

          • The validation layers don't trigger errors in this case, but it's just because so many people use it and there's no good alternative.

          • When EXT_swapchain_maintenance1  is enabled the validation layer will report an error if the application shutdown sequence relies on vkDeviceWaitIdle  or vkQueueWaitIdle  to release swapchain resources instead of using a presentation fence.

        • The extension fixes this problem.

        • By waiting on the presentation fence, the application can safely release swapchain resources.

    • To avoid a deadlock, only reset the fence if we are submitting work:

      • If reset is made right after wait for the fence, but the window was resized, then it will happen a deadlock.

      • The fence is opened by the signaling of QueueSubmit , and closed by the ResetFences .

      vkWaitForFences(device, 1, &inFlightFences[currentFrame], TRUE, UINT64_MAX);
      
      uint32_t imageIndex;
      VkResult result = vkAcquireNextImageKHR(device, swapChain, UINT64_MAX, imageAvailableSemaphores[currentFrame], NULL_HANDLE, &imageIndex);
      
      if (result == ERROR_OUT_OF_DATE_KHR) {
          recreateSwapChain();
          return;
      } else if (result != SUCCESS && result != SUBOPTIMAL_KHR) {
          throw std::runtime_error("failed to acquire Swapchain image!");
      }
      
      // Only reset the fence if we are submitting work
      vkResetFences(device, 1, &inFlightFences[currentFrame]);
      
  • What to recreate :

    • The image views need to be recreated because they are based directly on the Swapchain images.

  • Smooth Swapchain Resizing :

    • "Don't bother with smooth swapchain resizing, it's not worth it".

    • My experience :

      • (2025-08-04)

      • A callback glfw.SetWindowRefreshCallback  allows the swapchain to be recreated while resizing.

      • Synchronization :

        • Since the swapchain is recreated all the time, it becomes difficult to manage when the old swapchain should be destroyed along with its resources.

        • At the moment I'm handling the old_swapchain in a "bad" way, and I feel that recreating it every resize frame only worsens synchronization.

          • It is not necessary to deal with the old_swapchain when using vkDeviceWaitIdle() .

      • My current implementation:

        eng.window_init(1280, 720, "Expedicao Hover", proc "c" (window: glfw.WindowHandle) {
            context = eng.global_context
            // fmt.printfln("REFRESHED")
            eng.swapchain_resize()
            game_draw(&game, game.cycle_draw.dt_cycles_s)
        })
        
Updating resources after recreating
  • Destroy every image and view created from the old swapchain (the swapchain destroys its own images).

  • Update everything that holds a reference to either of those.

    • If anything was created using the swapchain's size you also have to destroy and recreate those and update anything that references them.

    • There's no getting around it.

Frames In-Flight

Motivation
  • The render loop has one glaring flaw: unnecessary idling  of the host. We are required to wait on the previous frame to finish before we can start rendering the next.

  • To fix this we allow multiple frames to be in-flight at once, allowing the rendering of one frame to not interfere with the recording of the next.

  • This control over the number of frames in flight is another example of Vulkan being explicit.

Frame
  • There is no concept of a frame in Vulkan. This means that the way you render is entirely up to you. The only thing that matters is when you have to display the frame to the screen, which is done through a swapchain. But there is no fundamental difference between rendering and then sending the images over the network, or saving the images into a file, or displaying it on the screen through the swapchain.

  • This means it is possible to use Vulkan in an entirely headless mode, where nothing is displayed to the screen. You can render the images and then store them on disk (very useful for testing) or use Vulkan as a way to perform GPU calculations such as a raytracer or other compute tasks.

How many Frames In-Flight
  • We choose the number 2 because we don’t want the CPU to get too  far ahead of the GPU.

    • With two frames in flight, the CPU and the GPU can be working on their own tasks at the same time. If the CPU finishes early, it will wait till the GPU finishes rendering before submitting more work.

    • With three or more frames in flight, the CPU could get ahead of the GPU, adding frames of latency. Generally, extra latency isn’t desired.

One Per Frame In-Flight
  • Duplicate :

    • Resources :

      • Uniform Buffers.

        • If modified while a previous frame uses it, corruption occurs.

      • Dynamic Storage Buffers.

        • GPU-computed results (e.g., particle positions). Writing to a buffer while an older frame reads it causes hazards.

      • Color/Depth Attachments.

      • Staging Buffers

        • If updated per frame (e.g., vkMapMemory ), duplication avoids overwriting mid-transfer.

      • Compute Shader Output Buffers:

        • If frame N  writes, and frame N+1  reads, duplicate to prevent read-before-write.

        • Use ping-pong buffers (count = frames in-flight).

    • Command pool.

      • I have doubts about this; some people do it differently.

    • Command buffer.

    • 'present_finished_semaphore'.

    • 'render_finished_fence'.

  • Don't duplicate :

    • Resources :

      • Static Vertex/Index Buffers:

        • Initialized once, read-only. No per-frame updates.

      • Immutable Textures

        • Loaded once (e.g., via VkDeviceMemory ).

        • Not mapped for change.

        • It's device local.

    • Static BRDF LUTs.

      • Initialized once, read by all frames.

Advancing a frame
void drawFrame() {
    ...

    currentFrame = (currentFrame + 1) % MAX_FRAMES_IN_FLIGHT;
}
  • By using the modulo ( % ) operator, we ensure that the frame index loops around after every MAX_FRAMES_IN_FLIGHT  enqueued frames.

Acquire Next Image

  • vkWaitForFences()

    • Waits on the previous frame.

    • Takes an array of fences and waits on the host for either any or all of the fences to be signaled before returning.

    • The TRUE  we pass here indicates that we want to wait for all fences, but in the case of a single one it doesn’t matter.

    • This function also has a timeout parameter that we set to the maximum value of a 64 bit unsigned integer, UINT64_MAX , which effectively disables the timeout.

  • vkAcquireNextImageKHR()

    • Acquire the index of an available image from the swapchain for rendering .

    • If an image was acquired, then it means that this image is idle  (i.e., not  currently being displayed or written to).

    • If no image is ready, the call blocks (or returns an error if non-blocking).

    • The returned image index is now " owned " by your app for rendering.

    • We only get a swapchain image index from the windowing present system.

    • A semaphore/fence is signaled when the image is safe to use.

    • timeout

      • If the swapchain doesn’t have any image we can use, it will block the thread with a maximum for the timeout set.

      • The measurement unit is nanoseconds.

      • 1 second is fine: 1_000_000_000 .

    • semaphore

      • Semaphore to signal.

    • fence

      • Fence to signal.

      • It is possible to specify a semaphore, fence or both.

    • pImageIndex

      • Specifies a variable to output the index of the Swapchain image that has become available  to use.

      • The index refers to the VkImage  in the swapChainImages  array.

Image Layout Transitions
  • See Vulkan#Images .

  • Before we can start rendering to an image, we need to transition its layout to one that is suitable for rendering.

  • Before rendering, we transition the image layout to IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL .

// Before starting rendering, transition the swapchain image to COLOR_ATTACHMENT_OPTIMAL
transition_image_layout(
    imageIndex,
    vk::ImageLayout::eUndefined,
    vk::ImageLayout::eColorAttachmentOptimal,
    {},                                                     // srcAccessMask (no need to wait for previous operations)
    vk::AccessFlagBits2::eColorAttachmentWrite,                // dstAccessMask
    vk::PipelineStageFlagBits2::eTopOfPipe,                   // srcStage
    vk::PipelineStageFlagBits2::eColorAttachmentOutput        // dstStage
);
  • After rendering, we need to transition the image layout back to IMAGE_LAYOUT_PRESENT_SRC_KHR  so it can be presented to the screen:

// After rendering, transition the swapchain image to PRESENT_SRC
transition_image_layout(
    imageIndex,
    vk::ImageLayout::eColorAttachmentOptimal,
    vk::ImageLayout::ePresentSrcKHR,
    vk::AccessFlagBits2::eColorAttachmentWrite,                 // srcAccessMask
    {},                                                      // dstAccessMask
    vk::PipelineStageFlagBits2::eColorAttachmentOutput,        // srcStage
    vk::PipelineStageFlagBits2::eBottomOfPipe                  // dstStage
);

Render Targets

Attachments
  • Nvidia: Use storeOp = DONT_CARE  rather than UNDEFINED  layouts to skip unneeded render target writes.

  • Nvidia: Don't transition color attachments from "safe" to "unsafe" unless required by the algorithm.

Transient Resources
  • Transient attachments (or Transient Resources) are render targets (like color/depth buffers) designed to exist only temporarily during a render pass, with their contents discarded afterward. They're optimized for fast on-chip memory access and avoid unnecessary memory operations.

Render Target
  • A Render Target is not a term in Vulkan but it's a term in graphics programming.

  • It's a term for an image you render into. In Vulkan this is an VkImage  + VkImageView  used as a color/depth attachment in a render pass or as a color attachment in dynamic rendering.

  • Examples :

  • Drawing a UI :

    • The UI texture must preserve alpha in the areas you want to be transparent, for later compositing.

    1. Draw UI directly to the final render target (swapchain image, or image to blit to the swapchain image) :

      • After tonemap, enable blending and draw UI.

      • Oni:

        • For the scene, I render into an RGBA16 image, then I draw on the swapchain  with a tonemapper, then I draw the UI on the swapchain  with blending enabled.

    2. Composite in a shader :

      • Sample scene image and UI image, compute out = scene * (1 - alpha_ui) + ui * alpha_ui  (or use premultiplied alpha: out = scene + ui ).

        • Both ways work; premultiplied alpha avoids some edge artifacts if UI already uses premultiplied data.

  • Compositing :

    • Used to combine render targets, or any other images.

    1. Fragment shader :

      • Render to an image and draw a full-screen triangle/quad that samples the HDR image and outputs LDR color.

        • Could be the swapchain image if supported, or an intermediate image then blit/copy to swapchain.

      • Pros :

        • Simple and guaranteed compatible with swapchain color attachment usage.

        • Useful if you want to draw the UI while making this final composition.

          • Seems like I'm mixing responsibilities, even though I'm reducing one render pass.

      • Cons :

        • Less flexible for arbitrary per-pixel work that requires many conditionals or random write patterns.

        • Need to issue a draw call and set up graphics pipeline.

    2. Compute shader :

      • Sample HDR image(s), write the LDR pixels to an output image.

        • Could be the swapchain image if supported, or an intermediate image then blit/copy to swapchain.

      • Pros :

        • Flexible: can read multiple inputs and write arbitrary outputs (random writes, multiple passes) without needing geometry.

        • Easy to implement multi-image compositing in one dispatch (read N sampled images + write to storage image).

      • Cons :

        • On some GPUs a simple full-screen fragment pass can be faster due to fixed-function hardware for rasterization and blending.

      #version 450
      
      layout(local_size_x = 16, local_size_y = 16) in;
      layout(set=0, binding=0) uniform sampler2D gameTex;
      layout(set=0, binding=1) uniform sampler2D uiTex;
      layout(set=0, binding=2, rgba8) uniform writeonly image2D swapchainImg;
      
      void main() {
          ivec2 coord = ivec2(gl_GlobalInvocationID.xy);
          vec2 uv = vec2(coord) / textureSize(gameTex, 0);
          
          // Sample inputs
          vec3 game = texture(gameTex, uv).rgb;
          vec4 ui = texture(uiTex, uv);
          
          // Tonemap game (example: Reinhard)
          game = game / (game + vec3(1.0));
          
          // Composite: UI over game
          vec3 final = mix(game, ui.rgb, ui.a);
          
          // Write to swapchain
          imageStore(swapchainImg, coord, vec4(final, 1.0));
      }
      
      #version 450
      
      layout(local_size_x = 16, local_size_y = 16) in;
      
      layout(binding = 0) uniform sampler2D uSceneHDR;
      layout(binding = 1) uniform sampler2D uUI; // optional
      layout(binding = 2, rgba8) writeonly uniform image2D outImage; // target LDR image (could be swapchain-compatible image)
      
      vec3 reinhardTonemap(vec3 c) {
          return c / (1.0 + c);
      }
      
      vec3 toSRGB(vec3 linear) {
          return pow(linear, vec3(1.0/2.2));
      }
      
      void main() {
          ivec2 pix = ivec2(gl_GlobalInvocationID.xy);
          ivec2 size = imageSize(outImage);
          if (pix.x >= size.x || pix.y >= size.y) return;
      
          vec2 uv = (vec2(pix) + 0.5) / vec2(size);
          vec3 hdr = texture(uSceneHDR, uv).rgb;
          float exposure = 1.0;
          vec3 mapped = reinhardTonemap(hdr * exposure);
          mapped = toSRGB(mapped);
      
          // Optionally composite UI
          // vec4 ui = texture(uUI, uv);
          // vec3 outc = mix(mapped, ui.rgb, ui.a);
      
          imageStore(outImage, pix, vec4(mapped, 1.0));
      }
      
      // Dispatch
      vkCmdBindPipeline(cmd, PIPELINE_BIND_POINT_COMPUTE, computePipe);
      vkCmdBindDescriptorSets(cmd, PIPELINE_BIND_POINT_COMPUTE, ...);
      vkCmdDispatch(cmd, swapchain_width/16, swapchain_height/16, 1);
      

Dynamic Rendering

  • Support :

  • VkRenderingAttachmentInfo

    • Structure specifying attachment information

    • imageView

      • Is the image view that will be used for rendering.

    • imageLayout

      • Is the layout that imageView  will be in during rendering.

    • resolveMode

      • Is a VkResolveModeFlagBits  value defining how data written to imageView  will be resolved into resolveImageView .

    • resolveImageView

      • Is an image view used to write resolved data at the end of rendering.

    • resolveImageLayout

      • Is the layout that resolveImageView  will be in during rendering.

    • loadOp

      • Specifies what to do with the image before rendering.

      • Is a VkAttachmentLoadOp  value defining the load operation  for the attachment.

      • We’re using ATTACHMENT_LOAD_OP_CLEAR  to clear the image to black before rendering.

    • storeOp

      • Specifies what to do with the image after rendering.

      • Is a VkAttachmentStoreOp  value defining the store operation  for the attachment.

      • We're using ATTACHMENT_STORE_OP_STORE  to store the rendered image for later use.

    • clearValue

      • Is a VkClearValue  structure defining values used to clear imageView  when loadOp  is ATTACHMENT_LOAD_OP_CLEAR .

  • VkRenderingInfo

    • Structure specifying render pass instance begin info.

    • Specifies the attachments to render to and the render area.

    • Combines the RenderingAttachmentInfo  with other rendering parameters.

    • flags

    • renderArea

      • Is the render area that is affected by the render pass instance.

      • Extent Requirements :

        • The rendering_info.renderArea.extent  has to fit inside the rendering_attachment.imageView  and hence the image.

      • If there is an instance of VkDeviceGroupRenderPassBeginInfo  included in the pNext  chain and its deviceRenderAreaCount  member is not 0 , then renderArea  is ignored, and the render area is defined per-device by that structure.

      • CharlesG - LunarG:

        • Viewports & scissors let you specify a size smaller than the full image, as well as redefining the origin & scale to use. Whereas the renderArea is specifying the actual image dimensions to use. This allows flexibility in how the backing VkImage is used in contrast to the viewport/scissor needs of the rendering itself. In most cases they are going to be “full” so its not like it comes into play always

        • More clarity: viewport & scissor are inputs to the rasterization stage, while the render area is an input for the attachment read/write.

      • Caio:

        • So, when comparing these two cases:

          • 1- I use a 1080p image for the renderArea  and a 640p  viewport and center the offset

          • 2- I use a 640p image for the renderArea  and a 640p  viewport and center the offset

        • Is there a difference between the quality and performance of these two? Or even, is there a visual difference?

      • CharlesG - LunarG:

        • I don't know tbh.

    • colorAttachmentCount

      • Is the number of elements in pColorAttachments .

    • pColorAttachments

      • Is a pointer to an array of colorAttachmentCount   VkRenderingAttachmentInfo  structures describing any color attachments used.

      • Each element of the pColorAttachments  array corresponds to an output location in the shader, i.e. if the shader declares an output variable decorated with a Location  value of X , then it uses the attachment provided in pColorAttachments[X] .

      • If the imageView  member of any element of pColorAttachments  is NULL_HANDLE , and resolveMode  is not RESOLVE_MODE_EXTERNAL_FORMAT_DOWNSAMPLE_ANDROID , writes to the corresponding location by a fragment are discarded.

    • pDepthAttachment

    • pStencilAttachment

    • viewMask

      • Is a bitfield of view indices describing which views are active during rendering, when it is not 0 .

    • layerCount

      • Is the number of layers rendered to in each attachment when viewMask  is 0 .

      • Specifies the number of layers to render to, which is 1 for a non-layered image.

Multi-view
Render Cmds

Drawing Commands

Draw Direct
  • Specify the Viewport and Scissor.

  • Bind the pipeline.

  • Bind the descriptor sets.

  • vkCmdDraw()

    • vertexCount

      • Even though we don’t have a vertex buffer, we technically still have 3 vertices to draw.

    • instanceCount

      • Used for instanced rendering, use 1  if you’re not doing that.

    • firstVertex

      • Used as an offset into the vertex buffer, defines the lowest value of SV_VertexId .

    • firstInstance

      • Used as an offset for instanced rendering, defines the lowest value of SV_InstanceID .

  • vkCmdDrawIndexed .

    • indexCount

      • The number of vertices to draw.

    • instanceCount

      • The number of instances to draw.

      • We’re not using instancing, so just specify 1  instance.

    • firstIndex

      • The base index within the index buffer.

      • Specifies an offset into the index buffer, using a value of 1  would cause the graphics card to start reading at the second index.

    • vertexOffset

      • The value added to the vertex index before indexing into the vertex buffer.

    • firstInstance

      • The instance ID of the first instance to draw.

Draw Indirect
  • "In some ways, Indirect Rendering is a more advanced form of instancing".

  • buffer + offset + (stride * index)

  • Executing a draw-indirect call will be equivalent to doing this.

    void FakeDrawIndirect(VkCommandBuffer commandBuffer,void* buffer,VkDeviceSize offset, uint32_t drawCount,uint32_t stride);
    
        char* memory = (char*)buffer + offset;
    
        for(int i = 0; i < drawCount; i++)
        {
            VkDrawIndexedIndirectCommand* command = VkDrawIndexedIndirectCommand*(memory + (i * stride));
    
            VkCmdDrawIndexed(commandBuffer, 
            command->indexCount, 
            command->instanceCount, 
            command->firstIndex, 
            command->vertexOffset,
            command->firstInstance);
        }
    } 
    
  • It does not carry vertex data itself — it only supplies counts and base indices/instances. The actual vertex data and indices come from the buffers you previously bound with vkCmdBindVertexBuffers  and vkCmdBindIndexBuffer .

  • Vertex :

    • To move vertex and index buffers to bindless, generally you do it by merging the meshes into really big buffers. Instead of having 1 buffer per vertex buffer and index buffer pair, you have 1 buffer for all vertex buffers in a scene. When rendering, then you use BaseVertex offsets in the drawcalls. In some engines, they remove vertex attributes from the pipelines entirely, and instead grab the vertex data from buffers in the vertex shader. Doing that makes it much easier to keep 1 big vertex buffer for all drawcalls in the engine even if they use different vertex attribute formats. It also allows some advanced unpacking/compression techniques, and it’s the main use case for Mesh Shaders.

    • We also change the way the meshes work. After loading a scene, we create a BIG vertex buffer, and stuff all of the meshes of the entire map into it. This way we will avoid having to rebind vertex buffers.

  • Implementation :

    • If the device supports multi-draw indirect ( VkPhysicalDeviceFeatures2::multiDrawIndirect ), then the entire array of draw commands can be executed through a single call to VkDrawIndexedIndirectCommand . Otherwise, each draw call must be executed through a separate call to VkDrawIndexIndirectCommand :

      // m_enable_mci: supports multiDrawIndirect
      if (m_enable_mci && m_supports_mci)
      {
          vkCmdDrawIndexedIndirect(draw_cmd_buffers[i], indirect_call_buffer->get_handle(), 0, cpu_commands.size(), sizeof(cpu_commands[0]));
      }
      else
      {
          for (size_t j = 0; j < cpu_commands.size(); ++j)
          {
              vkCmdDrawIndexedIndirect(draw_cmd_buffers[i], indirect_call_buffer->get_handle(), j * sizeof(cpu_commands[0]), 1, sizeof(cpu_commands[0]));
          }
      }
      
    • vkCmdDrawIndexedIndirectCount .

      • Behaves similarly to vkCmdDrawIndexedIndirect except that the draw count is read by the device from a buffer during execution. The command will read an unsigned 32-bit integer from countBuffer located at countBufferOffset and use this as the draw count.

  • Textures :

    • Due to the fact that you want to have as much things on the GPU as possible, this pipeline maps very well if you combine it with “Bindless” techniques, where you stop needing to bind descriptor sets per material or changing vertex buffers. Having a bindless renderer also makes Raytracing much more performant and effective.

    • On this guide we will not use bindless textures as their support is limited, so we will do 1 draw-indirect call per material used.

    • To move textures into bindless, you use texture arrays.

    • With the correct extension, the size of the texture array can be unbounded in the shader, like when you use SSBOs.

    • Then, when accessing the textures in the shader, you access them by index which you grab from another buffer. If you don’t use the Descriptor Indexing extensions, you can still use texture arrays, but they will need a bounded size. Check your device limits to see how big can that be.

    • To make materials bindless, you need to stop having 1 pipeline per material. Instead, you want to move the material parameters into SSBOs, and go with an ubershader  approach.

    • In the Doom engines, they have a very low amount of pipelines for the entire game. Doom eternal has less than 500 pipelines, while Unreal Engine games often have 100.000+ pipelines. If you use ubershaders to massively lower the amount of unique pipelines, you will be able to increase efficiency in a huge way, as VkCmdBindPipeline is one of the most expensive calls when drawing objects in vulkan.

  • Push Constants :

    • Push Constants and Dynamic Descriptors can be used, but they have to be “global”. Using push constants for things like camera location is perfectly fine, but you cant use them for object ID as that’s a per-object call and you specifically want to draw as many objects as possible in 1 draw.

Multithreading Rendering

  • I'm not sure, I don't think it's necessary.

  • From what I understand, it's about using multiple CPU threads to handle submissions and presentations, etc.

  • It has nothing to do with frames in flight, btw.

  • Explanation .

    • The video explains okay, but nah.

    • ->  In the next video he says it wasn't exactly a good idea and reverted  what he did in that video.

      • "It was technically slower and more confusing to do synchronizations".

Render Passes and Framebuffers

Dynamic Rendering: Features and differences from Render Passes
  • Replaces VkRenderPass  and Framebuffers.

    • Instead, we can specify the color, depth, and stencil attachments directly when we begin rendering.

  • Describe renderpasses inline with command buffer recording.

  • Provides more flexibility by allowing us to change the attachments we’re rendering to without creating new render pass objects.

  • Greatly simplifies application architecture.

  • Synchronization still needs to be done, but now it's even more explicit, truer to its stated nature.

    • We had to do that with Render Passes, but that was bound up in the Render Pass creation.

    • Now, the synchronization is more explicit.

  • Tiling GPUs aren't left behind.

    • The v1.4 dynamicRenderingLocalRead , KHR_dynamic_rendering_local_read  brings tiling GPUs to the same capabilities, and they don't need to state the Render Passes.

  • I wouldn't say that "You should use Render Passes if your hardware isn't new enough", because it isn't fun.

  • Better compatibility with modern rendering techniques.

  • .

Subpasses
  • .

  • External subpass dependencies :

    • Explained by TheMaister 2019; he is part of the Khronos Group.

    • The main purpose of external subpass dependencies is to deal with initialLayout and finalLayout of an attachment reference. If initialLayout != layout used in the first subpass, the render pass is forced to perform a layout transition.

    • If you don’t specify anything else, that layout transition will wait for nothing before it performs the transition. Or rather, the driver will inject a dummy subpass dependency for you with srcStageMask = TOP_OF_PIPE. This is not what you want since it’s almost certainly going to be a race condition. You can set up a subpass dependency with the appropriate srcStageMask and srcAccessMask.

    • The external subpass dependency is basically just a vkCmdPipelineBarrier injected for you by the driver.

    • The whole premise here is that it’s theoretically better to do it this way because the driver has more information, but this is questionable, at least on current hardware and drivers.

    • There is a very similar external subpass dependency setup for finalLayout. If finalLayout differs from the last use in a subpass, driver will transition into the final layout automatically. Here you get to change dstStageMask / dstAccessMask . If you do nothing here, you get BOTTOM_OF_PIPE , which can actually be just fine. A prime use case here is swapchain images which have finalLayout = PRESENT_SRC_KHR .

    • Essentially, you can ignore external subpass dependencies .

    • Their added complexity gives very little gain. Render pass compatibility rules also imply that if you change even minor things like which stages to wait for, you need to create new pipelines!

    • This is dumb, and will hopefully be fixed at some point in the spec.

    • However, while the usefulness of external subpass dependencies is questionable, they have some convenient use cases I’d like to go over:

      • Automatically transitioning TRANSIENT_ATTACHMENT  images :

        • If you’re on mobile, you should be using transient images where possible. When using these attachments in a render pass, it makes sense to always have them as initialLayout = UNDEFINED. Since we know that these images can only ever be used in COLOR_ATTACHMENT_OUTPUT  or EARLY / LATE_FRAGMENT_TEST  stages depending on their image format, the external subpass dependency writes itself, and we can just use transient attachments without having to think too hard about how to synchronize them. This is what I do in my Granite engine, and it’s quite useful. Of course, we could just inject a pipeline barrier for this exact same purpose, but that’s more boilerplate.

      • Automatically transitioning swapchain images :

        • Typically, swapchain images are always just used once per frame, and we can deal with all synchronization using external subpass dependencies. We want initialLayout = UNDEFINED , and finalLayout = PRESENT_SRC_KHR .

        • srcStageMask  is COLOR_ATTACHMENT_OUTPUT  which lets us link up with the swapchain acquire semaphore. For this case, we will need an external subpass dependency. For the finalLayout  transition after the render pass, we are fine with BOTTOM_OF_PIPE  being used. We’re going to use semaphores here anyways.

        • I also do this in Granite.

Framebuffers
  • VkFrameBuffer

    • Holds the target images for a renderpass.

    • Only used in legacy tutorials.

  • Just wrappers to image views.

  • The attachments of a Framebuffer are the Image Views.

  • The Framebuffers are used within a Render Pass.

  • LunarG / Vulkan: "Kinda of a bad name, it's just a couple of image views".

  • Only exists to combine images and renderpasses.

Render Passes
  • VkRenderPass

    • Holds information about the images you are rendering into. All drawing commands have to be done inside a renderpass.

    • Only used in legacy tutorials.

  • Render passes in Vulkan describe the type  of images that are used during rendering operations, how  they will be used, and how  their contents should be treated.

  • All drawing commands happen inside a "render pass".

  • Acts as pseudo render graph.

  • Allows tiling GPUs to use memory efficiently.

    • Efficient scheduling.

  • Describe images attachments.

  • Defines the subpasses.

  • Declare dependencies between subpasses.

  • Require VkFrameBuffers .

    • Whereas a render pass only describes the type of images, a VkFramebuffer  actually binds specific images to these slots.

  • .

  • Problem :

    • Great in theory, not so great to use in practice.

    • Single object with many responsibilities.

      • Made the API harder to reason about when looking at the code.

    • Hard to architect into a renderer.

      • Yet another input for pipelines.

    • The main benefit is for tiling based GPUs.

      • Commonly found in mobile.

    • "Use Dynamic Rendering, it's much better".

Submit

  • Submits the Command Buffers recorded.

  • vkSubmitInfo

    • The first three parameters specify which semaphores to wait on before execution begins and in which stage(s) of the pipeline to wait.

    • We want to wait for writing colors to the image until it’s available, so we’re specifying the stage of the graphics pipeline that writes to the color attachment.

    • That means that theoretically, the implementation can already start executing our vertex shader and such while the image is not yet available.

    • Each entry in the waitStages  array corresponds to the semaphore with the same index in pWaitSemaphores .

    • pCommandBuffers

      • Specifies which command buffers to actually submit for execution. We simply submit the single command buffer we have.

    • pSignalSemaphores

      • Specifies which semaphores to signal once the command buffer(s) have finished execution.

      • In our case we’re using the renderFinishedSemaphore  for that purpose.

  • vkQueueSubmit()

    • fence

      • Is an optional handle to a fence to be signaled  once all submitted command buffers have completed execution.

    • The function takes an array of VkSubmitInfo  structures as argument for efficiency when the workload is much larger.

    • The last parameter references an optional fence that will be signaled when the command buffers finish execution.

    • This allows us to know when it is safe for the command buffer to be reused, thus we want to give it drawFence . Now we want the CPU to wait while the GPU finishes rendering that frame we just submitted:

Presentation

  • The last step of drawing a frame is submitting the result back to the Swapchain to have it eventually show up  on the screen.

  • Presentation Engine :

    • .

  • VkPresentInfoKHR

    • pWaitSemaphores

      • Which semaphores to wait on before presentation can happen, just like VkSubmitInfo .

      • Since we want to wait on the command buffer to finish execution, thus our triangle being drawn, we take the semaphores which will be signaled and wait on them, thus we use signalSemaphores .

    • The next two parameters specify the Swapchains to present images to and the index of the image for each Swapchain.

    • This will almost always be single.

    • pResults

      • It allows you to specify an array of VkResult  values to check for every Swapchain if presentation was successful.

      • It’s not necessary if you’re only using a single Swapchain, because you can use the return value of the present function.

  • QueuePresentKHR()

    • Submits a rendered image to the presentation queue.

    • Used after queueing all rendering commands and transitioning the image to the correct layout.

    • Vulkan transfers ownership of the image to the 'presentation engine'.

  • How a presentation happens :

    • Who :

      • The GPU  (via the display controller/hardware), orchestrated by the OS/window system .

    • When :

      • At the next vertical blanking interval ( Vblank ).

        • Vblank  is the moment between screen refreshes (e.g., at 60 Hz, every 16.67 ms).

      • In a Vulkan workflow, we can be sure that the presentation happened between the QueuePresentKHR()  and the vkAcquireNextImageKHR() .

        • The job of the present_complete_semaphore  is to hold this information.

    • How :

      • The GPU's display controller  reads the image from GPU memory.

      • The OS/window system (e.g., X11/Wayland on Linux, Win32 on Windows) composites the image into the application window.

      • The final output is scanned out to the display.

  • Image recycling :

    • After presentation, the image is released back to the swapchain.

    • It becomes available for re-acquisition via vkAcquireNextImageKHR  (after the next vblank).

Synchronization and Cache Control

  • .

KHR_synchronization2
  • KHR_synchronization2 .

  • Nvidia: Use KHR_synchronization2 , the new functions allow the application to describe barriers more accurately.

  • Highlights :

    • One main change with the extension is to have pipeline stages and access flags now specified together in memory barrier structures.

      • This makes the connection between the two more obvious.

    • Due to running out of the 32 bits for VkAccessFlag  the VkAccessFlags2KHR  type was created with a 64-bit range. To prevent the same issue for VkPipelineStageFlags , the VkPipelineStageFlags2KHR  type was also created with a 64-bit range.

    • Adds 2 new image layouts IMAGE_LAYOUT_ATTACHMENT_OPTIMAL_KHR  and IMAGE_LAYOUT_READ_ONLY_OPTIMAL_KHR  to help with making layout transition easier.

    • etc.

Queues

  • Any synchronization applies globally to a VkQueue , there is no concept of a only-inside-this-command-buffer synchronization.

  • Graphics pipelines are executable on queues supporting QUEUE_GRAPHICS . Stages executed by graphics pipelines can  only be specified in commands recorded for queues supporting QUEUE_GRAPHICS .

QueueIdle and DeviceIdle
  • These functions can be used as a very rudimentary way to perform synchronization.

  • Closing the program :

    • We should wait for the logical device to finish operations before exiting mainLoop  and destroying the window.

    • You can also wait for operations in a specific command queue to be finished with vkQueueWaitIdle .

    • You’ll see that the program now exits without problems when closing the window.

  • Problem :

    • The problem of vkDeviceWaitIdle  or vkQueueWaitIdle , due to the lack of fences for vkQueuePresent .

  • Solution :

  • .

Queue Family Ownership Transfer
  • Resources created with a VkSharingMode  of SHARING_MODE_EXCLUSIVE   must  have their ownership explicitly transferred from one queue family to another in order to access their content in a well-defined manner on a queue in a different queue family.

  • Resources shared with external APIs or instances using external memory must  also explicitly manage ownership transfers between local and external queues (or equivalent constructs in external APIs) regardless of the VkSharingMode  specified when creating them.

  • If you need to transfer ownership to a different queue family, you need memory barriers, one in each queue to release/acquire ownership.

  • If memory dependencies are correctly expressed between uses of such a resource between two queues in different families, but no ownership transfer is defined, the contents of that resource are undefined for any read accesses performed by the second queue family.

  • A queue family ownership transfer consists of two distinct parts:

    1. Release exclusive ownership from the source queue family

    2. Acquire exclusive ownership for the destination queue family

    • Is defined if the values are not equal,  and either is one of the special queue family values reserved for external memory ownership transfers

    • An application must  ensure that these operations occur in the correct order by defining an execution dependency between them, e.g. using a semaphore.

    • A release operation  is used to release exclusive ownership of a range of a buffer or image subresource range. A release operation is defined by executing a buffer memory barrier  (for a buffer range) or an image memory barrier  (for an image subresource range) using a pipeline barrier command, on a queue from the source queue family.

    • Etc, I haven't read much about it.

Command Buffers
  • The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write.

  • Unless you add synchronization yourself, all commands in a queue execute out of order. Reordering may happen across command buffers and even vkQueueSubmits .

  • This makes sense, considering that Vulkan only sees a linear stream of commands once you submit, it is a pitfall to assume that splitting command buffers or submits adds some magic synchronization for you.

  • Frame buffer operations inside a render pass happen in API-order, of course. This is a special exception which the spec calls out.

Queue Submissions (vkQueueSubmit)
  • Queue submission commands

  • It automatically performs a domain operation from host to device  for all writes performed before the command executes, so in most cases an explicit memory barrier is not needed for this case.

  • In the few circumstances where a submit does not occur between the host write and the device read access, writes can  be made available by using an explicit memory barrier.

Example
  • vkCmdDispatch (PIPELINE_STAGE_COMPUTE_SHADER)

  • vkCmdCopyBuffer (PIPELINE_STAGE_TRANSFER)

  • vkCmdDispatch (PIPELINE_STAGE_COMPUTE_SHADER)

  • vkCmdPipelineBarrier (srcStageMask = PIPELINE_STAGE_COMPUTE_SHADER)

  • We would be referring to the two vkCmdDispatch  commands, as they perform their work in the COMPUTE stage. Even if we split these 4 commands into 4 different vkQueueSubmits , we would still consider the same commands for synchronization.

  • Essentially, the work we are waiting for is all commands which have ever been submitted to the queue including any previous commands in the command buffer we’re recording.

Blocking Operations

  • .

    • By Samsung 2019.

    • I don't know if this information is still valid.

    • See the Mobile section for optimizations of vkQueuePresent .

Examples

  • Synchronization examples .

  • Example 1 :

    • vkCmdDispatch  – writes to an SSBO, ACCESS_SHADER_WRITE

    • vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER, srcAccessMask = SHADER_WRITE, dstAccessMask = 0)

    • vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE, srcAccessMask = 0, dstAccessMask = SHADER_READ)

    • vkCmdDispatch  – read from the same SSBO, ACCESS_SHADER_READ

    • While StageMask  cannot be 0, AccessMask  can be 0.

  • Recently allocated image, to use in a compute shader as a storage image :

    • The pipeline barrier looks like:

      • oldLayout = UNDEFINED

        • Input is garbage

      • newLayout = GENERAL

        • Storage image compatible layout

      • srcStageMask = TOP_OF_PIPE

        • Wait for nothing

      • srcAccessMask = 0

        • This is key, there are no pending writes to flush out.

        • This is the only way to use TOP_OF_PIPE  in a memory barrier.

      • dstStageMask = COMPUTE

        • Unblock compute after the layout transition is done

      • dstAccessMask = SHADER_READ | SHADER_WRITE

  • Swapchain Image Transition to PRESENT_SRC :

    • We have to transition them into IMAGE_LAYOUT_PRESENT_SRC  before passing the image over to the presentation engine.

    • Having dstStageMask = BOTTOM_OF_PIPE  and dstAccessMask = 0  is perfectly fine. We don’t care about making this memory visible  to any stage beyond this point. We will use semaphores to synchronize with the presentation engine anyways.

    • The pipeline barrier looks like:

      • srcStageMask = COLOR_ATTACHMENT_OUTPUT

        • Assuming we rendered to swapchain in a render pass.

      • srcAccessMask = COLOR_ATTACHMENT_WRITE

      • dstStageMask = BOTTOM_OF_PIPE

        • After transitioning into this PRESENT  layout, we’re not going to touch the image again until we reacquire the image, so dstStageMask = BOTTOM_OF_PIPE  is appropriate.

      • dstAccessMask = 0

      • oldLayout = COLOR_ATTACHMENT_OPTIMAL

      • newLayout = PRESENT_SRC_KHR

    • Setting dstAccessMask = 0  on the final TRANSFER_DST → PRESENT_SRC_KHR  barrier means “there is no GPU  access after this barrier that we are ordering/expressing.” For swapchain-present that is intentional and common: presentation is outside the GPU pipeline, so the barrier only needs to make the producer  writes (e.g. your blit TRANSFER_WRITE ) available/visible; the presentation engine performs its own, external visibility semantics.

  • Example 1 :

    • vkCmdPipelineBarrier(srcStageMask = FRAGMENT_SHADER, dstStageMask = ?)

    • Vertex shading for future commands can begin executing early, we only need to wait once FRAGMENT_SHADER  is reached.

  • Example 2 :

    1. vkCmdDispatch

    2. vkCmdDispatch

    3. vkCmdDispatch

    4. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = COMPUTE)

    5. vkCmdDispatch

    6. vkCmdDispatch

    7. vkCmdDispatch

    • {5, 6, 7} must wait for {1, 2, 3}.

    • A possible execution order here could be:

      • #3

      • #2

      • #1

      • #7

      • #6

      • #5

    • {1, 2, 3} can execute out-of-order, and so can {5, 6, 7}, but these two sets of commands can not interleave execution.

    • In spec lingo {1, 2, 3} happens-before  {5, 6, 7}.

  • Chain of Dependencies (1) :

    1. vkCmdDispatch

    2. vkCmdDispatch

    3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)

    4. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)

    5. vkCmdDispatch

    6. vkCmdDispatch

    • {5, 6} must wait for {1, 2}.

    • We created a chain of dependencies between COMPUTE -> TRANSFER -> COMPUTE.

    • When we wait for TRANSFER in 4, we must also wait for anything which is currently blocking TRANSFER.

  • Chain of dependencies (2) :

    1. vkCmdDispatch

    2. vkCmdDispatch

    3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)

    4. vkCmdMagicDummyTransferOperation

    5. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)

    6. vkCmdDispatch

    7. vkCmdDispatch

    • {4} must wait for {1, 2}.

    • {6, 7} must wait for {4}.

    • The chain is {1, 2} -> {4} -> {6, 7}, and if {4} is noop (no operation), {1, 2} -> {6, 7} is achieved.

Execution Dependencies, Memory Dependencies, Memory Model

Data hazards
  • Execution dependencies  and memory dependencies  are used to solve data hazards, i.e. to ensure that read and write operations occur in a well-defined order.

    • An operation  is an arbitrary amount of work to be executed on the host, a device, or an external entity such as a presentation engine.

  • Write-after-read hazards :

    • Can be solved with just an execution dependency

  • Read-after-write hazards :

    • Need appropriate memory dependencies to be included between them.

  • Write-after-write hazards :

    • Need appropriate memory dependencies to be included between them.

  • If an application does not include dependencies to solve these hazards, the results and execution orders of memory accesses are undefined .

Execution Dependencies
  • An execution dependency  is a guarantee that for two sets of operations, the first set must  happen-before the second set. If an operation happens-before another operation, then the first operation must  complete before the second operation is initiated.

  • Execution dependencies  alone are not sufficient to guarantee that values resulting from writes in one set of operations can  be read from another set of operations.

Memory Available
  • Availability operations cause the values generated by specified memory write accesses to become available  for future access.

  • Any available  value remains available until a subsequent write to the same memory location occurs (whether it is made available or not) or the memory is freed.

  • Availability operations :

    • Cause the values generated by specified memory write accesses to become available  to a memory domain for future access. Any available value remains available until a subsequent write to the same memory location occurs (whether it is made available or not) or the memory is freed.

    • Even with coherent mapping, you still need to have a dependency between the host writing that memory and the GPU operation reading it.

  • We can say “making memory available” is all about flushing caches.

  • vkFlushMappedMemoryRanges()

    • Guarantees that host writes to the memory ranges described by pMemoryRanges   can  be made available  to device access, via availability operations  from the ACCESS_HOST_WRITE  access type.

    • This is required for CPU writes, which HOST_COHERENT  effectively provides.

  • Cache example :

    • When our L2 cache contains the most up-to-date data there is, we can say that memory is available , as L1 caches connected to L2 can pull in the most up-to-date data there is.

    • Once a shader stage writes to memory, the L2 cache no longer has the most up-to-date data there is, so that memory is no longer considered available .

      • If other caches try to read from L2, it will see undefined data.

      • Whatever wrote that data must make those writes available  before the data can be made visible  again.

Memory Domain
  • Memory domain operations :

    • Cause writes that are available to a source memory domain to become available to a destination memory domain (an example of this is making writes available to the host domain available to the device domain).

Memory Visible
  • Visibility operations :

    • Cause values available  to a memory domain to become visible  to specified memory accesses.

    • Memory barriers are visibility operations. Without them, you wouldn’t have visibility of the memory.

      • The execution barrier ensures the completion of a command, but the srcStageMask , dstStageMask , srcAccessMask  and dstAccessMask  are what handles availability.

  • Once written values are made visible to a particular type of memory access, they can  be read or written by that type of memory access.

  • We can say “making memory visible” is all about invalidating caches.

  • Availability is a necessary part of visibility, but availability alone is not sufficient.

    • You can do things that might have caused visibility, but because the write was not available, they don’t actually make the write visible.

  • Under the hood, visibility is implementation-specific. The pure-visibility parts typically involve forcing lines out of caches and/or invalidating them. But some kinds of visibility may not require even that.

  • vkInvalidateMappedMemoryRanges() .

    • Guarantees that device writes to the memory ranges described by pMemoryRanges , which have been made available  to the host memory domain using the ACCESS_HOST_WRITE  and ACCESS_HOST_READ  access types, are made visible  to the host.

    • If a range of non-coherent memory is written by the host and then invalidated without first being flushed, its contents are undefined.

Host Coherent
  • MEMORY_PROPERTY_HOST_COHERENT

    • If a memory object does  have this property:

      • Writes  to the memory object from the host are automatically made available  to the host domain.

      • It says that you don't need vkFlushMappedMemoryRanges()  or vkInvalidateMappedMemoryRanges() .

      • This property alone is insufficient for availability. You still need to use synchronization to make sure that reads and writes from CPU and GPU happen in the right order, and you need memory barriers on the GPU side to manage GPU caches (make CPU writes visible to GPU reads, and make GPU writes available to CPU reads).

      • Coherency is about "visibility", but you still need availability.

    • If a memory object does not  have this property:

      • vkFlushMappedMemoryRanges() must  be called in order to guarantee that writes to the memory object from the host are made available  to the host domain, where they can  be further made available to the device domain via a domain operation.

      • vkInvalidateMappedMemoryRanges()   must  be called to guarantee that writes which are available to the host domain are made visible  to host operations.

Memory Dependency
  • Memory Dependency  is an execution dependency  which includes availability  and visibility  operations such that:

    • The first set of operations happens-before the availability  operation.

    • The availability operation happens-before the visibility  operation.

    • The visibility operation happens-before the second set of operations.

  • It enforces availability  and visibility  of memory accesses and execution order  between two sets of operations.

  • Most synchronization commands in Vulkan define a memory dependency.

  • The specific memory accesses that are made available  and visible  are defined by the access scopes  of a memory dependency.

  • Any type of access that is in a memory dependency’s first access scope  is made available .

  • Any type of access that is in a memory dependency’s second access scope  has any available writes made visible  to it.

  • Any type of operation that is not in a synchronization command’s access scopes will not be included in the resulting dependency.

Execution Stages

  • The Stage Masks are a bit-mask, so it’s perfectly fine to wait for both X and Y work.

  • By specifying the source and target stages, you tell the driver what operations need to finish before the transition can execute, and what must not have started yet.

  • Nvidia: Use optimal srcStageMask  and dstStageMask . Most important cases: If the specified resources are accessed only in compute or fragment shaders, use the compute or the fragment stage bits for both masks, to make the barrier fragment-only or compute-only.

  • Caio: "Wait for srcStageMask  to finish, before dstStageMask  can start".

  • .

  • .

  • .

  • .

First synchronization scope
  • srcStageMask

  • This represents what we are waiting for.

  • "What operations need to finish before the transition can execute".

Second synchronization scope
  • dstStageMask

  • "What operations must not have started yet".

  • Any work submitted after this barrier will need to wait for the work represented by srcStageMask  before it can execute.

Stages
  • VkPipelineStageFlagBits2 .

  • TOP_OF_PIPE  and BOTTOM_OF_PIPE :

    • These stages are essentially “helper” stages, which do no actual work, but serve some important purposes. Every command will first execute the TOP_OF_PIPE  stage. This is basically the command processor on the GPU parsing the command. BOTTOM_OF_PIPE  is where commands retire after all work has been done.

    • Both these pipeline stages are deprecated, and applications should prefer ALL_COMMANDS  and NONE .

    • Memory Access :

      • Never use AccessMask != 0  with these stages. These stages do not perform memory accesses . Any srcAccessMask  and dstAccessMask  combination with either stage will be meaningless, and spec disallows this.

      • TOP_OF_PIPE  and BOTTOM_OF_PIPE  are purely there for the sake of execution barriers, not memory barriers.

  • TOP_OF_PIPE

    • In the first scope:

      • Equivalent to NONE

      • Is basically saying “wait for nothing”, or to be more precise, we’re waiting for the GPU to parse all commands.

        • We had to parse all commands before getting to the pipeline barrier command to begin with.

    • In the second scope:

      • Equivalent to ALL_COMMANDS  with VkAccessFlags2  set to 0 .

  • BOTTOM_OF_PIPE

    • In the first scope:

      • Equivalent to ALL_COMMANDS , with VkAccessFlags2  set to 0 .

    • In the second scope:

      • Equivalent to NONE .

      • Basically translates to “block the last stage of execution in the pipeline”.

      • “No work after this barrier is going to wait for us”.

  • NONE

    • Specifies no stages of execution.

  • ALL_COMMANDS

    • Specifies all operations performed by all commands supported on the queue it is used with.

    • Basically drains the entire queue for work.

  • ALL_GRAPHICS

    • Specifies the execution of all graphics pipeline stages.

    • It's the same as ALL_COMMANDS , but only for render passes.

    • Is equivalent to the logical OR of:

      • DRAW_INDIRECT

      • COPY_INDIRECT

      • TASK_SHADER

      • MESH_SHADER

      • VERTEX_INPUT

      • VERTEX_SHADER

      • TESSELLATION_CONTROL_SHADER

      • TESSELLATION_EVALUATION_SHADER

      • GEOMETRY_SHADER

      • FRAGMENT_SHADER

      • EARLY_FRAGMENT_TESTS

      • LATE_FRAGMENT_TESTS

      • COLOR_ATTACHMENT_OUTPUT

      • CONDITIONAL_RENDERING

      • TRANSFORM_FEEDBACK

      • FRAGMENT_SHADING_RATE_ATTACHMENT

      • FRAGMENT_DENSITY_PROCESS

      • SUBPASS_SHADER

      • INVOCATION_MASK

      • CLUSTER_CULLING_SHADER

Order of execution stages
  • Ignoring TOP_OF_PIPE  and BOTTOM_OF_PIPE .

  • Graphics primitive pipeline :

    • DRAW_INDIRECT

      • Parses indirect buffers.

    • COPY_INDIRECT

    • INDEX_INPUT

    • VERTEX_ATTRIBUTE_INPUT

      • Consumes fixed function VBOs and IBOs

    • VERTEX_SHADER

    • TESSELLATION_CONTROL_SHADER

    • TESSELLATION_EVALUATION_SHADER

    • GEOMETRY_SHADER

    • TRANSFORM_FEEDBACK

    • FRAGMENT_SHADING_RATE_ATTACHMENT

    • EARLY_FRAGMENT_TESTS

      • Early  depth/stencil tests.

      • Render pass performs its loadOp  of a depth/stencil attachment.

      • This stage isn’t all that useful or meaningful except in some very obscure scenarios with frame buffer self-dependencies (aka, GL_ARB_texture_barrier ).

      • When blocking a render pass with dstStageMask , just use a mask of EARLY_FRAGMENT_TESTS | LATE_FRAGMENT_TESTS .

      • dstStageMask = EARLY_FRAGMENT_TESTS  alone might work since that will block loadOp , but there might be shenanigans with memory barriers if you are 100% pedantic about any memory access happening in LATE_FRAGMENT_TESTS . If you’re blocking an early stage, it never hurts to block a later stage as well.

    • FRAGMENT_SHADER

    • LATE_FRAGMENT_TESTS

      • Late  depth-stencil tests.

      • Render pass performs its storeOp  of a depth/stencil attachment when a render pass is done.

      • When you’re waiting for a depth map to have been rendered in an earlier render pass, you should use srcStageMask = LATE_FRAGMENT_TESTS , as that will wait for the storeOp  to finish its work.

    • COLOR_ATTACHMENT_OUTPUT

      • This one is where loadOp , storeOp , MSAA resolves and frame buffer blend stage takes place.

      • Basically anything that touches a color attachment in a render pass in some way.

      • If you’re waiting for a render pass which uses color to be complete, use srcStageMask = COLOR_ATTACHMENT_OUTPUT , and similar for dstStageMask  when blocking render passes from execution.

      • Usage as dstStageMask :

        • COLOR_ATTACHMENT_OUTPUT  is the appropriate dstStageMask  when you are transitioning an image so it can be written as a color attachment.

  • Graphics mesh pipeline :

    • DRAW_INDIRECT

    • TASK_SHADER

    • MESH_SHADER

    • FRAGMENT_SHADING_RATE_ATTACHMENT

    • EARLY_FRAGMENT_TESTS

    • FRAGMENT_SHADER

    • LATE_FRAGMENT_TESTS

    • COLOR_ATTACHMENT_OUTPUT

  • Compute pipeline :

    • DRAW_INDIRECT

    • COPY_INDIRECT

    • COMPUTE_SHADER

  • Transfer pipeline :

    • COPY_INDIRECT

    • TRANSFER

  • Subpass shading pipeline :

    • SUBPASS_SHADER

  • Graphics pipeline commands executing in a render pass with a fragment density map attachment : (almost unordered)

    • The following pipeline stage where the fragment density map read happens has no particular order  relative to the other stages.

    • It is logically earlier than EARLY_FRAGMENT_TESTS , so:

      • FRAGMENT_DENSITY_PROCESS

      • EARLY_FRAGMENT_TESTS

  • Conditional rendering stage : (unordered)

    • Is formally part of both the graphics, and the compute pipeline.

    • The predicate read has unspecified order relative to other stages of these pipelines:

    • CONDITIONAL_RENDERING

  • Host operations :

    • Only one pipeline stage occurs.

    • HOST

  • Command preprocessing pipeline :

    • COMMAND_PREPROCESS

  • Acceleration structure build operations :

    • Only one pipeline stage occurs.

    • ACCELERATION_STRUCTURE_BUILD

  • Acceleration structure copy operations :

    • Only one pipeline stage occurs.

    • ACCELERATION_STRUCTURE_COPY

  • Opacity micromap build operations :

    • Only one pipeline stage occurs.

    • MICROMAP_BUILD

  • Ray tracing pipeline :

    • DRAW_INDIRECT

    • RAY_TRACING_SHADER

  • Video decode pipeline :

    • VIDEO_DECODE

  • Video encode pipeline :

    • VIDEO_ENCODE

  • Data graph pipeline :

    • DATA_GRAPH

Memory Access

  • Access scopes  do not interact with the logically earlier or later stages for either scope - only the stages the application specifies are considered part of each access scope.

  • These flags represent memory access that can be performed.

  • Each pipeline stage can perform certain memory accesses, and thus we take the combination of pipeline stage + access mask and we get potentially a very large number of incoherent caches on the system.

  • Each GPU core has its own set of L1 caches as well.

  • Real GPUs will only have a fraction of the possible caches here, but as long as we are explicit about this in the API, any GPU driver can simplify this as needed.

  • Access masks either read from a cache, or write to an L1 cache in our mental model.

  • Certain access types are only performed by a subset of pipeline stages.

  • "Had this access ( srcAccessMask ) and it's going to have this access ( dstAccessMask )".

  • srcAccessMask

    • Lists the access types that happened before  the barrier (the producer accesses) and that must be made available/visible by the barrier.

    • Must describe the kinds of accesses that actually happened before the barrier (the producer accesses you need to make available/visible) .

    • It does not  describe what you want the resource to become after the barrier — that is expressed by dstAccessMask  (what will happen after).

    • The stage masks (src/dst stage) specify the pipeline stages that contain those accesses.

    • srcAccessMask = 0  means “there are no prior GPU memory accesses that this barrier needs to make available” (i.e. nothing to claim as the producer side).

  • dstAccessMask

    • Lists the access types that will happen after  the barrier (the consumer accesses) and that must see the producer’s writes.

    • dstAccessMask = 0  means “there are no subsequent GPU memory accesses that this barrier needs to order/make visible to” (i.e. no GPU consumer to describe with access bits).

Access Flags
  • VkAccessFlagBits2 .

  • MEMORY_READ

    • Specifies all read accesses.

    • It is always valid in any access mask, and is treated as equivalent to setting all READ  access flags that are valid where it is used.

  • MEMORY_WRITE

    • Specifies all write accesses.

    • It is always valid in any access mask, and is treated as equivalent to setting all WRITE  access flags that are valid where it is used.

  • SHADER_READ

    • Same as SAMPLED_READ  + STORAGE_READ  + TILE_ATTACHMENT_READ .

  • SHADER_SAMPLED_READ

  • HOST_READ

    • Specifies read access by a host operation. Accesses of this type are not performed through a resource, but directly on memory.

    • Such access occurs in the PIPELINE_STAGE_2_HOST  pipeline stage.

  • HOST_WRITE

    • Specifies write access by a host operation. Accesses of this type are not performed through a resource, but directly on memory.

    • Such access occurs in the PIPELINE_STAGE_2_HOST  pipeline stage.

Access Flag -> Pipeline Stages

| Access flag                             | Pipeline stages                                                                                                                                                                                                                                                                                         |
|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| NONE                                   | Any                                                                                                                                                                                                                                                                                                     |
| INDIRECT_COMMAND_READ                  | DRAW_INDIRECT , ACCELERATION_STRUCTURE_BUILD , COPY_INDIRECT                                                                                                                                                                                                                                         |
| INDEX_READ                             | VERTEX_INPUT , INDEX_INPUT                                                                                                                                                                                                                                                                            |
| VERTEX_ATTRIBUTE_READ                  | VERTEX_INPUT , VERTEX_ATTRIBUTE_INPUT                                                                                                                                                                                                                                                                 |
| UNIFORM_READ                           | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| INPUT_ATTACHMENT_READ                  | FRAGMENT_SHADER , SUBPASS_SHADER                                                                                                                                                                                                                                                                      |
| SHADER_READ                            | ACCELERATION_STRUCTURE_BUILD , MICROMAP_BUILD , VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER               |
| SHADER_WRITE                           | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| COLOR_ATTACHMENT_READ                  | FRAGMENT_SHADER , COLOR_ATTACHMENT_OUTPUT                                                                                                                                                                                                                                                             |
| COLOR_ATTACHMENT_WRITE                 | COLOR_ATTACHMENT_OUTPUT                                                                                                                                                                                                                                                                                |
| DEPTH_STENCIL_ATTACHMENT_READ          | FRAGMENT_SHADER , EARLY_FRAGMENT_TESTS , LATE_FRAGMENT_TESTS                                                                                                                                                                                                                                         |
| DEPTH_STENCIL_ATTACHMENT_WRITE         | EARLY_FRAGMENT_TESTS , LATE_FRAGMENT_TESTS                                                                                                                                                                                                                                                            |
| TRANSFER_READ                          | ALL_TRANSFER , COPY , RESOLVE , BLIT , ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY , MICROMAP_BUILD , CONVERT_COOPERATIVE_VECTOR_MATRIX                                                                                                                                          |
| TRANSFER_WRITE                         | ALL_TRANSFER , COPY , RESOLVE , BLIT , CLEAR , ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY , MICROMAP_BUILD , CONVERT_COOPERATIVE_VECTOR_MATRIX                                                                                                                                 |
| HOST_READ                              | HOST                                                                                                                                                                                                                                                                                                   |
| HOST_WRITE                             | HOST                                                                                                                                                                                                                                                                                                   |
| MEMORY_READ                            | Any                                                                                                                                                                                                                                                                                                     |
| MEMORY_WRITE                           | Any                                                                                                                                                                                                                                                                                                     |
| SHADER_SAMPLED_READ                    | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| SHADER_STORAGE_READ                    | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| SHADER_STORAGE_WRITE                   | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| VIDEO_DECODE_READ                      | VIDEO_DECODE                                                                                                                                                                                                                                                                                           |
| VIDEO_DECODE_WRITE                     | VIDEO_DECODE                                                                                                                                                                                                                                                                                           |
| VIDEO_ENCODE_READ                      | VIDEO_ENCODE                                                                                                                                                                                                                                                                                           |
| VIDEO_ENCODE_WRITE                     | VIDEO_ENCODE                                                                                                                                                                                                                                                                                           |
| TRANSFORM_FEEDBACK_WRITE               | TRANSFORM_FEEDBACK                                                                                                                                                                                                                                                                                     |
| TRANSFORM_FEEDBACK_COUNTER_READ        | DRAW_INDIRECT , TRANSFORM_FEEDBACK                                                                                                                                                                                                                                                                    |
| TRANSFORM_FEEDBACK_COUNTER_WRITE       | TRANSFORM_FEEDBACK                                                                                                                                                                                                                                                                                     |
| CONDITIONAL_RENDERING_READ             | CONDITIONAL_RENDERING                                                                                                                                                                                                                                                                                  |
| COMMAND_PREPROCESS_READ                | COMMAND_PREPROCESS                                                                                                                                                                                                                                                                                     |
| COMMAND_PREPROCESS_WRITE               | COMMAND_PREPROCESS                                                                                                                                                                                                                                                                                     |
| FRAGMENT_SHADING_RATE_ATTACHMENT_READ  | FRAGMENT_SHADING_RATE_ATTACHMENT                                                                                                                                                                                                                                                                       |
| ACCELERATION_STRUCTURE_READ            | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , CLUSTER_CULLING_SHADER , ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY , SUBPASS_SHADER  |
| ACCELERATION_STRUCTURE_WRITE           | ACCELERATION_STRUCTURE_BUILD , ACCELERATION_STRUCTURE_COPY                                                                                                                                                                                                                                            |
| FRAGMENT_DENSITY_MAP_READ              | FRAGMENT_DENSITY_PROCESS                                                                                                                                                                                                                                                                               |
| COLOR_ATTACHMENT_READ_NONCOHERENT      | COLOR_ATTACHMENT_OUTPUT                                                                                                                                                                                                                                                                                |
| DESCRIPTOR_BUFFER_READ                 | VERTEX_SHADER , TESSELLATION_CONTROL_SHADER , TESSELLATION_EVALUATION_SHADER , GEOMETRY_SHADER , FRAGMENT_SHADER , COMPUTE_SHADER , RAY_TRACING_SHADER , TASK_SHADER , MESH_SHADER , SUBPASS_SHADER , CLUSTER_CULLING_SHADER                                                                 |
| INVOCATION_MASK_READ                   | INVOCATION_MASK                                                                                                                                                                                                                                                                                        |
| MICROMAP_READ                          | MICROMAP_BUILD , ACCELERATION_STRUCTURE_BUILD                                                                                                                                                                                                                                                         |
| MICROMAP_WRITE                         | MICROMAP_BUILD                                                                                                                                                                                                                                                                                         |
| OPTICAL_FLOW_READ                      | OPTICAL_FLOW                                                                                                                                                                                                                                                                                           |
| OPTICAL_FLOW_WRITE                     | OPTICAL_FLOW                                                                                                                                                                                                                                                                                           |
| SHADER_TILE_ATTACHMENT_READ            | FRAGMENT_SHADER , COMPUTE_SHADER                                                                                                                                                                                                                                                                      |
| SHADER_TILE_ATTACHMENT_WRITE           | FRAGMENT_SHADER , COMPUTE_SHADER                                                                                                                                                                                                                                                                      |
| DATA_GRAPH_READ                        | DATA_GRAPH                                                                                                                                                                                                                                                                                             |
| DATA_GRAPH_WRITE                       | DATA_GRAPH                                                                                                                                                                                                                                                                                             |

Pipeline Stage -> Access Flags

| Pipeline stage                      | Access flags                                                                                                                                                                                                                                                                                                                   |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| ACCELERATION_STRUCTURE_BUILD       | ACCELERATION_STRUCTURE_READ , ACCELERATION_STRUCTURE_WRITE , INDIRECT_COMMAND_READ , MICROMAP_READ , SHADER_READ , TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                       |
| ACCELERATION_STRUCTURE_COPY        | ACCELERATION_STRUCTURE_READ , ACCELERATION_STRUCTURE_WRITE , TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                |
| ALL_TRANSFER                       | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| ANY                                | MEMORY_READ , MEMORY_WRITE , NONE                                                                                                                                                                                                                                                                                           |
| BLIT                               | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| CLEAR                              | TRANSFER_WRITE                                                                                                                                                                                                                                                                                                                |
| CLUSTER_CULLING_SHADER             | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| COLOR_ATTACHMENT_OUTPUT            | COLOR_ATTACHMENT_READ , COLOR_ATTACHMENT_READ_NONCOHERENT , COLOR_ATTACHMENT_WRITE                                                                                                                                                                                                                                          |
| COMMAND_PREPROCESS                 | COMMAND_PREPROCESS_READ , COMMAND_PREPROCESS_WRITE                                                                                                                                                                                                                                                                           |
| COMPUTE_SHADER                     | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_TILE_ATTACHMENT_READ , SHADER_TILE_ATTACHMENT_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                     |
| CONDITIONAL_RENDERING              | CONDITIONAL_RENDERING_READ                                                                                                                                                                                                                                                                                                    |
| CONVERT_COOPERATIVE_VECTOR_MATRIX  | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| COPY                               | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| COPY_INDIRECT                      | INDIRECT_COMMAND_READ                                                                                                                                                                                                                                                                                                         |
| DATA_GRAPH                         | DATA_GRAPH_READ , DATA_GRAPH_WRITE                                                                                                                                                                                                                                                                                           |
| DRAW_INDIRECT                      | INDIRECT_COMMAND_READ , TRANSFORM_FEEDBACK_COUNTER_READ                                                                                                                                                                                                                                                                      |
| EARLY_FRAGMENT_TESTS               | DEPTH_STENCIL_ATTACHMENT_READ , DEPTH_STENCIL_ATTACHMENT_WRITE                                                                                                                                                                                                                                                               |
| FRAGMENT_DENSITY_PROCESS           | FRAGMENT_DENSITY_MAP_READ                                                                                                                                                                                                                                                                                                     |
| FRAGMENT_SHADER                    | ACCELERATION_STRUCTURE_READ , COLOR_ATTACHMENT_READ , DEPTH_STENCIL_ATTACHMENT_READ , DESCRIPTOR_BUFFER_READ , INPUT_ATTACHMENT_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_TILE_ATTACHMENT_READ , SHADER_TILE_ATTACHMENT_WRITE , SHADER_WRITE , UNIFORM_READ  |
| FRAGMENT_SHADING_RATE_ATTACHMENT   | FRAGMENT_SHADING_RATE_ATTACHMENT_READ                                                                                                                                                                                                                                                                                         |
| GEOMETRY_SHADER                    | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| HOST                               | HOST_READ , HOST_WRITE                                                                                                                                                                                                                                                                                                       |
| INDEX_INPUT                        | INDEX_READ                                                                                                                                                                                                                                                                                                                    |
| INVOCATION_MASK                    | INVOCATION_MASK_READ                                                                                                                                                                                                                                                                                                          |
| LATE_FRAGMENT_TESTS                | DEPTH_STENCIL_ATTACHMENT_READ , DEPTH_STENCIL_ATTACHMENT_WRITE                                                                                                                                                                                                                                                               |
| MESH_SHADER                        | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| MICROMAP_BUILD                     | MICROMAP_READ , MICROMAP_WRITE , SHADER_READ , TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                             |
| OPTICAL_FLOW                       | OPTICAL_FLOW_READ , OPTICAL_FLOW_WRITE                                                                                                                                                                                                                                                                                       |
| RAY_TRACING_SHADER                 | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| RESOLVE                            | TRANSFER_READ , TRANSFER_WRITE                                                                                                                                                                                                                                                                                               |
| SUBPASS_SHADER                     | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , INPUT_ATTACHMENT_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                           |
| TASK_SHADER                        | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| TESSELLATION_CONTROL_SHADER        | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| TESSELLATION_EVALUATION_SHADER     | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| TRANSFORM_FEEDBACK                 | TRANSFORM_FEEDBACK_COUNTER_READ , TRANSFORM_FEEDBACK_COUNTER_WRITE , TRANSFORM_FEEDBACK_WRITE                                                                                                                                                                                                                               |
| VERTEX_ATTRIBUTE_INPUT             | VERTEX_ATTRIBUTE_READ                                                                                                                                                                                                                                                                                                         |
| VERTEX_INPUT                       | INDEX_READ , VERTEX_ATTRIBUTE_READ                                                                                                                                                                                                                                                                                           |
| VERTEX_SHADER                      | ACCELERATION_STRUCTURE_READ , DESCRIPTOR_BUFFER_READ , SHADER_READ , SHADER_SAMPLED_READ , SHADER_STORAGE_READ , SHADER_STORAGE_WRITE , SHADER_WRITE , UNIFORM_READ                                                                                                                                                    |
| VIDEO_DECODE                       | VIDEO_DECODE_READ , VIDEO_DECODE_WRITE                                                                                                                                                                                                                                                                                       |
| VIDEO_ENCODE                       | VIDEO_ENCODE_READ , VIDEO_ENCODE_WRITE                                                                                                                                                                                                                                                                                       |

Pipeline Barriers

  • Pipeline barriers also provide synchronization control within a command buffer, but at a single point, rather than with separate signal and wait operations. Pipeline barriers can  be used to control resource access within a single queue.

  • Gives control over which pipeline stages need to wait on previous pipeline stages when a command buffer is executed.

  • Nvidia: Minimize the use of barriers. A barrier may cause a GPU pipeline flush. We have seen redundant barriers and associated wait for idle operations as a major performance problem for ports to modern APIs.

  • Nvidia: Prefer a buffer/image barrier rather than a memory barrier to allow the driver to better optimize and schedule the barrier, unless the memory barrier allows to merge many buffer/image barriers together.

  • Nvidia: Group barriers in one call to vkCmdPipelineBarrier2() . This way, the worst case can be picked instead of sequentially going through all barriers.

  • Nvidia: Don’t insert redundant barriers; this limits parallelism; avoid read-to-read barriers.

  • vkCmdPipelineBarrier2() .

    • When submitted to a queue, it defines memory dependencies between commands that were submitted to the same queue before  it, and those submitted to the same queue after  it.

    • commandBuffer

      • Is the command buffer into which the command is recorded.

    • pDependencyInfo

      • VkDependencyInfo .

      • Specifies the dependency information for a synchronization command.

      • This structure defines a set of memory dependencies , as well as queue family ownership transfer operations  and image layout transitions .

      • Each member of pMemoryBarriers , pBufferMemoryBarriers , and pImageMemoryBarriers  defines a separate memory dependency .

      • dependencyFlags

      • memoryBarrierCount

        • Is the length of the pMemoryBarriers  array.

      • pMemoryBarriers

        • VkMemoryBarrier2 .

        • Specifies a global memory barrier.

        • srcStageMask

        • srcAccessMask

        • dstStageMask

        • dstAccessMask

      • bufferMemoryBarrierCount

        • Is the length of the pBufferMemoryBarriers  array.

      • pBufferMemoryBarriers

        • VkBufferMemoryBarrier2 .

        • Specifies a buffer memory barrier.

        • Defines a memory dependency  limited to a range of a buffer, and can  define a queue family ownership transfer operation  for that range.

        • Both access scopes  are limited to only memory accesses to buffer  in the range defined by offset  and size .

        • srcStageMask

        • srcAccessMask

        • dstStageMask

        • dstAccessMask

        • srcQueueFamilyIndex

        • dstQueueFamilyIndex

        • buffer

          • Is a handle to the buffer whose backing memory is affected by the barrier.

        • offset

          • Is an offset in bytes into the backing memory for buffer ; this is relative to the base offset as bound to the buffer (see vkBindBufferMemory ).

        • size

          • Is a size in bytes of the affected area of backing memory for buffer , or WHOLE_SIZE  to use the range from offset  to the end of the buffer.

      • imageMemoryBarrierCount

        • Is the length of the pImageMemoryBarriers  array.

      • pImageMemoryBarriers

        • VkImageMemoryBarrier2 .

        • Specifies an image memory barrier.

        • Defines a memory dependency  limited to an image subresource range, and can  define a queue family ownership transfer operation  and image layout transition  for that subresource range.

        • Image Transition :

          • If oldLayout  is not equal to newLayout , then the memory barrier defines an image layout transition for the specified image subresource range.

          • If this memory barrier defines a queue family ownership transfer operation , the layout transition is only executed once between the queues.

          • When the old and new layout are equal, the layout values are ignored - data is preserved no matter what values are specified, or what layout the image is currently in.

        • srcStageMask

        • srcAccessMask

        • dstStageMask

        • dstAccessMask

        • srcQueueFamilyIndex

        • dstQueueFamilyIndex

        • oldLayout

        • newLayout

        • image

          • Is a handle to the image affected by this barrier.

        • subresourceRange

Execution Barrier

  • Every command you submit to Vulkan goes through a set of stages. Draw calls, copy commands and compute dispatches all go through pipeline stages one by one. This represents the heart of the Vulkan synchronization model.

  • Operations performed by synchronization commands (e.g. availability operations  and visibility operations ) are not executed by a defined pipeline stage. However other commands can still synchronize with them by using the synchronization scopes to create a dependency chain.

  • When we synchronize work in Vulkan, we synchronize work happening in these pipeline stages as a whole, and not individual commands of work.

  • Vulkan does not let you add fine-grained dependencies between individual commands. Instead you get to look at all work which happens in certain pipeline stages.

Memory Barriers

  • Execution order and memory order are two different things.

  • Memory barriers are the tools we can use to ensure that caches are flushed and our memory writes from commands executed before the barrier are available to the pending after-barrier commands. They are also the tool we can use to invalidate caches so that the latest data is visible to the cores that will execute after-barrier commands.

  • In contrast to execution barriers, these access masks only apply to the precise stages set in the stage masks, and are not extended to logically earlier and later stages.

  • GPUs are notorious for having multiple, incoherent caches which all need to be carefully managed to avoid glitched out rendering.

  • This means that just synchronizing execution alone is not enough to ensure that different units on the GPU can transfer data between themselves.

  • Memory being available  and memory being visible  are an abstraction over the fact that GPUs have incoherent caches.

  • For GPU reading operations from CPU-written data, a call to vkQueueSubmit  acts as a host memory dependency on any CPU writes to GPU-accessible memory, so long as those writes were made prior  to the function call.

  • If you need more fine-grained write dependency (you want the GPU to be able to execute some stuff in a batch while you're writing data, for example), or if you need to read data written by the GPU, you need an explicit dependency.

  • For in-batch GPU reading, this could be handled by an event; the host sets the event after writing the memory, and the command buffer operation that reads the memory first issues vkCmdWaitEvents  for that event. And you'll need to set the appropriate memory barriers and source/destination stages.

  • For CPU reading of GPU-written data, this could be an event, a timeline semaphore, or a fence.

  • But overall, CPU writes to GPU-accessible memory still need some form of synchronization.

Global Memory Barriers

  • VkMemoryBarrier2 .

  • A global memory barrier deals with access to any resource, and it’s the simplest form of a memory barrier.

  • In vkCmdPipelineBarrier2 , we are specifying 4 things to happen in order:

    • Wait for srcStageMask  to complete

    • Make all writes performed in possible combinations of srcStageMask  + srcAccessMask   available

    • Make available  memory visible  to possible combinations of dstStageMask  + dstAccessMask .

    • Unblock work in dstStageMask.

  • A common misconception I see is that _READ  flags are passed into srcAccessMask , but this is redundant .

    • It does not make sense to make reads available.

    • Ex : you don’t flush caches when you’re done reading data.

Buffer Memory Barrier

  • We’re just restricting memory availability and visibility to a specific buffer.

  • TheMaister: No GPU I know of actually cares, I think it makes more sense to just use VkMemoryBarrier rather than bothering with buffer barriers.

Image Memory Barrier / Image Layout Transition

  • VkImageLayout .

  • Image subresources can be transitioned from one layout to another as part of a memory dependency  (e.g. by using an image memory barrier ).

  • Image layouts transitions are done as part of an image memory barrier.

  • The layout transition happens in-between the make available  and make visible  stages of a memory barrier.

  • The layout transition itself is considered a read/write operation, and the rules are basically that memory for the image must be available  before the layout transition takes place.

  • After a layout transition, that memory is automatically made available  (but not visible !).

  • Basically, think of the layout transition as some kind of in-place data munging which happens in L2 cache somehow.

  • How :

    • If a layout transition is specified in a memory dependency.

  • When :

    • It happens-after the availability  operations in the memory dependency, and happens-before the visibility  operations.

    • Layout transitions that are performed via image memory barriers execute in their entirety in submission order , relative to other image layout transitions submitted to the same queue, including those performed by render passes.

    • This ordering of image layout transitions only applies if the implementation performs actual read/write operations during the transition.

    • An application must  not rely on ordering of image layout transitions to influence ordering of other commands.

  • Ensure :

    • Image layout transitions may perform read and write accesses on all memory bound to the image subresource range, so applications must  ensure that all memory writes have been made available  before a layout transition is executed.

  • Available  memory is automatically made visible  to a layout transition, and writes performed by a layout transition are automatically made available .

Old Layout
  • The old layout must  either be UNDEFINED , or match the current layout of the image subresource range.

    • If the old layout matches the current layout of the image subresource range, the transition preserves the contents of that range.

    • If the old layout is UNDEFINED , the contents of that range may  be discarded. This can provide performance or power benefits.

      • Nvidia: Use UNDEFINED  when the previous content of the image is not needed.

  • Tile-based architectures may be able to avoid flushing tile data to memory, and immediate style renderers may be able to achieve fast metadata clears to reinitialize frame buffer compression state, or similar.

  • If the contents of an attachment are not needed after a render pass completes, then applications should  use DONT_CARE .

  • Why Need the Old Layout in Vulkan Image Transitions .

    • Cool.

Recently allocated image
  • If we just allocated an image and want to start using it, what we want to do is to just perform a layout transition, but we don’t need to wait for anything in order to do this transition.

  • It’s important to note that freshly allocated memory in Vulkan is always considered available  and visible  to all stages and access types. You cannot have stale caches when the memory was never accessed.

Events / "Split Barriers"

  • A way to get overlapping work in-between barriers.

  • The idea of VkEvent  is to get some unrelated commands in-between the “before” and “after” set of commands

  • For advanced compute, this is a very important thing to know about, but not all GPUs and drivers can take advantage of this feature.

  • Nvidia: Use vkCmdSetEvent2  and vkCmdWaitEvents2  to issue an asynchronous barrier to avoid blocking execution.

Example
  • Example 1 :

    1. vkCmdDispatch

    2. vkCmdDispatch

    3. vkCmdSetEvent(event, srcStageMask = COMPUTE)

    4. vkCmdDispatch

    5. vkCmdWaitEvent(event, dstStageMask = COMPUTE)

    6. vkCmdDispatch

    7. vkCmdDispatch

    • The " before " set is now { 1 , 2 }, and the " after " set is { 6 , 7 }.

    • 4  here is not affected by any synchronization and it can fill in the parallelism “bubble” we get when draining the GPU of work from 1 , 2 , 3 .

  • .

Semaphores and Fences

  • These objects are signaled as part of a vkQueueSubmit .

  • To signal a semaphore or fence, all previously submitted commands to the queue must complete.

  • If this were a regular pipeline barrier, we would have srcStageMask = ALL_COMMANDS . However, we also get a full memory barrier, in the sense that all pending writes are made available.  Essentially, srcAccessMask = MEMORY_WRITE .

  • Signaling a fence or semaphore works like a full cache flush. Submitting commands to the Vulkan queue makes all memory access performed by host visible to all stages and access masks. Basically, submitting a batch issues a cache invalidation on host visible memory.

  • A common mistake is to think that you need to do this invalidation manually when the CPU is writing into staging buffers or similar:

    • srcStageMask = HOST

    • dstStageMask = TRANSFER

    • srcAccessMask = HOST_WRITE

    • dstAccessMask = TRANSFER_READ

    • If the write happened before vkQueueSubmit , this is automatically done for you.

    • This kind of barrier is necessary if you are using vkCmdWaitEvents  where you wait for host to signal the event with vkSetEvent . In that case, you might be writing the necessary host data after   vkQueueSubmit  was called, which means you need a pipeline barrier like this. This is not exactly a common use case, but it’s important to understand when these API constructs are useful.

Semaphore
  • VkSemaphore

  • Semaphores facilitate GPU <-> GPU synchronization across Vulkan queues.

    • Used for syncing multiple command buffer submissions one after other.

    • The CPU continues running without blocking.

  • Implicit memory guarantees when waiting for a Semaphore :

    • While signalling a semaphore makes all memory available , waiting for a semaphore makes memory visible .

    • This basically means you do not need a memory barrier if you use synchronization with semaphores since signal/wait pairs of semaphores works like a full memory barrier.

    • Example :

      • Queue 1 writes to an SSBO in compute, and consumes that buffer as a UBO in a fragment shader in queue 2.

      • We’re going to assume the buffer was created with QUEUE_FAMILY_CONCURRENT .

      • Queue 1

        • vkCmdDispatch

        • vkQueueSubmit(signal = my_semaphore)

        • There is no pipeline barrier needed here.

        • Signalling the semaphore waits for all commands, and all writes in the dispatch are made available  to the device before the semaphore is actually signaled.

      • Queue 2

        • vkCmdBeginRenderPass

        • vkCmdDraw

        • vkCmdEndRenderPass

        • vkQueueSubmit(wait = my_semaphore, pDstWaitStageMask = FRAGMENT_SHADER)

        • When we wait for the semaphore, we specify which stages should wait for this semaphore, in this case the FRAGMENT_SHADER  stage.

        • All relevant memory access is automatically made visible , so we can safely access UNIFORM_READ  in FRAGMENT_SHADER  stage, without having extra barriers.

        • The semaphores take care of this automatically, nice!

  • Examples :

    • Basic signaling / waiting :

      • Let’s say we have semaphore S and queue operations A and B that we want to execute in order.

      • What we tell Vulkan is that operation A will 'signal' semaphore S when it finishes executing, and operation B will 'wait' on semaphore S before it begins executing.

      • When operation A finishes, semaphore S will be signaled, while operation B wont start until S is signaled.

      • After operation B begins executing, semaphore S is automatically reset back to being unsignaled, allowing it to be used again.

    • Image Transition on Swapchain Images :

      • We need to wait for the image to be acquired, and only then can we perform a layout transition.

      • The best way to do this is to use pDstWaitStageMask = COLOR_ATTACHMENT_OUTPUT , and then use srcStageMask = COLOR_ATTACHMENT_OUTPUT  in a pipeline barrier which transitions the swapchain image after semaphore is signaled.

  • Types of Semaphores :

    • Binary Semaphores :

      • A binary semaphore is either unsignaled or signaled.

      • It begins life as unsignaled.

      • The way we use a binary semaphore to order queue operations is by providing the same semaphore as a 'signal' semaphore in one queue operation and as a 'wait' semaphore in another queue operation.

      • Only binary semaphores will be used in this tutorial, further mention of the term semaphore exclusively refers to binary semaphores.

    • Timeline Semaphores :

      • .

  • Correctly using the Semaphore for vkQueuePresent :

    • Swapchain Semaphore Reuse .

    • Since Vulkan SDK 1.4.313 , the validation layer reports cases where the present wait semaphore is not used safely:

      • This is currently reported as VUID-vkQueueSubmit-pSignalSemaphores-00067  or you may see "your VkSemaphore is being signaled by VkQueue, but it may still be in use by VkSwapchainKHR"

    • In this context, safely  means that the Vulkan specification guarantees the semaphore is no longer in use and can be reused.

    • The problem :

      • vkQueuePresentKHR  is different from the vkQueueSubmit  family of functions in that it does not provide a way to signal a semaphore or a fence (without additional extensions).

      • This means there is no way to wait for the presentation signal directly. It also means we don’t know whether VkPresentInfoKHR::pWaitSemaphores  are still in use by the presentation operation.

      • If vkQueuePresentKHR  could signal, then waiting on that signal would confirm that the present queue operation has finished — including the wait on VkPresentInfoKHR::pWaitSemaphores .

      • In summary, it’s not obvious when it’s safe to reuse present wait semaphores.

      • The Vulkan specification does not guarantee that waiting on a vkQueueSubmit  fence also synchronizes presentation operations.

    • The reuse of presentation resources should rely on vkAcquireNextImageKHR  or additional extensions, rather than on vkQueueSubmit  fences.

    • Solution options :

      1. Allocate one "submit finished" semaphore per swapchain image instead of per in-flight frame.

        • Allocate the submit_semaphores  array based on the number of swapchain images (instead of the number of in-flight frames)

        • Index this array using the acquired swapchain image index (instead of the current in-flight frame index)

      2. Using EXT_swapchain_maintenance1 .

Fences
  • VkFence

  • Fences facilitate GPU -> CPU synchronization.

    • Used to know if a command buffer has finished being executed on the GPU.

  • While signalling a fence makes all memory available, it does not make them available to the CPU, just within the device. This is where dstStageMask = PIPELINE_STAGE_HOST  and dstAccessMask = ACCESS_HOST_READ  flags come in. If you intend to read back data to the CPU, you must issue a pipeline barrier which makes memory available to the HOST as well.

  • In our mental model, we can think of this as flushing the GPU L2 cache out to GPU main memory, so that CPU can access it over some bus interface.

  • In order to signal that fence, any pending writes to that memory must have been made available, so even recycled memory can be safely reused without a memory barrier. This point is kind of subtle, but it really helps your sanity not having to inject memory barriers everywhere.

  • Usage :

    • Similar to semaphores, fences are either in a signaled or unsignaled state.

    • Whenever we submit work to execute, we can attach a fence to that work. When the work is finished, the fence will be signaled.

    • Then we can make the CPU wait for the fence to be signaled, guaranteeing that the work has finished before the CPU continues.

    • Fences must be reset manually to put them back into the unsignaled state.

      • This is because fences are used to control the execution of the CPU, and so the CPU gets to decide when to reset the fence.

      • Contrast this to semaphores which are used to order work on the GPU without the CPU being involved.

    • Unlike the semaphore, the fence does  block CPU execution.

      • In general, it is preferable to not block the host unless necessary.

      • We want to feed the GPU and the host with useful work to do. Waiting on fences to signal is not useful work.

      • Thus, we prefer semaphores, or other synchronization primitives not yet covered, to synchronize our work.

  • Example :

    • Taking a screenshot :

      • Once we have already done the necessary work on the GPU, we now need to transfer the image from the GPU over to the host and then save the memory to a file.

      • We have command buffer A which executes the transfer and fence F. We submit command buffer A with fence F, then immediately tell the host to wait for F to signal. This causes the host to block until command buffer A finishes execution.

      • Thus, we are safe to let the host save the file to disk, as the memory transfer has completed.

      • Unlike the semaphore example, this example does  block host execution. This means the host won’t do anything except wait until the execution has finished. For this case, we had to make sure the transfer was complete before we could save the screenshot to disk.

Main Loop Synchronization

Command Buffers

  • Commands in Vulkan, like drawing operations and memory transfers, are not executed directly using function calls. You have to record all the operations you want to perform in command buffer objects.

  • The advantage of this is that when we are ready to tell Vulkan what we want to do, all the commands are submitted together. Vulkan can more efficiently process the commands since all of them are available together.

  • In addition, this allows command recording to happen in multiple threads  if so desired.

Command Pools

  • Create and allocate Command Buffers.

  • Command pools are opaque objects that command buffer memory is allocated from, and which allow the implementation to amortize the cost of resource creation across multiple command buffers.

Creation
  • vkCreateCommandPool() .

    • device

      • Is the logical device that creates the command pool.

    • pAllocator

    • pCommandPool

      • Is a pointer to a VkCommandPool  handle in which the created pool is returned.

    • pCreateInfo

      • VkCommandPoolCreateInfo .

      • queueFamilyIndex

        • Designates a queue family as described in section Queue Family Properties . All command buffers allocated from this command pool must  be submitted on queues from the same queue family.

        • Command buffers are executed by submitting them on one of the device queues (graphics and presentation queues, for example).

        • Each command pool can only allocate command buffers that are submitted on a single type of queue.

      • flags

        • Is a bitmask indicating usage behavior for the pool and command buffers allocated from it.

        • COMMAND_POOL_CREATE_TRANSIENT

          • Hint that command buffers are rerecorded with new commands very often (may change memory allocation behavior)

        • COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER

          • Allow command buffers to be rerecorded individually, without this flag they all have to be reset together

          • If we record a command buffer every frame, we want to be able to reset and rerecord over it, thus, this flag should be enabled so a command buffer can be reset individually.

        • COMMAND_POOL_CREATE_PROTECTED

          • Specifies that command buffers allocated from the pool are protected command buffers.

Management
  • Manages the memory that is used to store the buffers and command buffers are allocated from them.

  • Destroying a Command Pool, destroys the Command Buffers associated.

  • Reset the whole Command Pool :

    • vkResetCommandPool .

      • Resetting a command pool recycles all of the resources from all of the command buffers allocated from the command pool back to the command pool. All command buffers that have been allocated from the command pool are put in the initial state .

      • Any primary command buffer allocated from another VkCommandPool  that is in the recording or executable state  and has a secondary command buffer allocated from commandPool  recorded into it, becomes invalid .

  • Free individual Command Buffers :

    • vkFreeCommandBuffers() .

      • device

        • Is the logical device that owns the command pool.

      • commandPool

        • Is the command pool from which the command buffers were allocated.

      • commandBufferCount

        • Is the length of the pCommandBuffers  array.

      • pCommandBuffers

        • Is a pointer to an array of handles of command buffers to free.

      • Any primary command buffer that is in the recording or executable state  and has any element of pCommandBuffers  recorded into it, becomes invalid .

Command Buffer

Creation / Allocation
  • VkCommandBuffer .

    • Encodes GPU commands.

    • All execution that is performed on the GPU itself (not in the driver) has to be encoded in a command buffer.

  • vkAllocateCommandBuffers() .

    • pAllocateInfo

      • VkCommandBufferAllocateInfo .

      • commandPool

        • Is the command pool from which the command buffers are allocated.

      • level

        • VkCommandBufferLevel .

        • Specifies if the allocated command buffers are primary or secondary command buffers.

        • `COMMAND_BUFFER_LEVEL_PRIMARY

          • Command Buffer Primary.

        • `COMMAND_BUFFER_LEVEL_SECONDARY

          • Command Buffer Secondary.

      • commandBufferCount

        • Is the number of command buffers to allocate from the pool.

    • pCommandBuffers

      • Is a pointer to an array of Command Buffer handles in which the resulting command buffer objects are returned. The array must  be at least the length specified by the commandBufferCount  member of pAllocateInfo . Each allocated command buffer begins in the initial state.

Lifecycle
  • Lifecycle .

  • .

  • Reset an single Command Buffer :

    • Once a command buffer has been submitted, it’s still “alive”, and being consumed by the GPU, at this point it is NOT safe to reset the command buffer yet. You need to make sure that the GPU has finished executing all of the commands from that command buffer until you can reset and reuse it.

    • vkResetCommandBuffer() .

      • commandBuffer

        • Is the command buffer to reset. The command buffer can  be in any state other than pending , and is moved into the initial state .

      • flags

      • Any primary command buffer that is in the recording or executable state  and has commandBuffer  recorded into it, becomes invalid .

      • After a command buffer is reset, any objects or memory specified by commands recorded into the command buffer must  no longer be accessed when the command buffer is accessed by the implementation.

    • If the command buffer was already recorded once, then a call to it will implicitly reset it.

Levels
  • Primary :

    • Only these can be submitted to queues for execution.

    • Cannot be called from other command buffers.

  • Secondary :

    • Cannot be submitted directly, but can be called from primary command buffers.

    • "We won’t make use of the secondary command buffer functionality here, but you can imagine that it’s helpful to reuse common operations from primary command buffers."

    • vkCmdExecuteCommands() .

      • A primary command buffer would use this to execute a secondary command buffer.

    • Re-recording :

      • If a secondary moves to the invalid state or the initial state, then all primary buffers it is recorded in move to the invalid state. A primary moving to any other state does not affect the state of a secondary recorded in it.

      • So, when a secondary command is re-recorded, the primary becomes invalid.

      • Eve: "It is not capturing a reference to a command buffer, it is going through and copying all the commands in the command buffer into itself."

Command Types
  • Action-Type, State-Type, Sync-Type.

  • .

Command Buffer Recording

  • Writes the commands we want to execute into a command buffer.

  • It’s not possible to append commands to a buffer at a later time.

  • vkBeginCommandBuffer() .

    • commandBuffer

      • Is the handle of the command buffer which is to be put in the recording state.

    • pBeginInfo

      • VkCommandBufferBeginInfo .

      • Specifies some details about the usage of this specific command buffer.

      • flags

        • VkCommandBufferUsageFlagBits .

        • Specifies how we’re going to use the command buffer.

        • COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT

          • The command buffer will be rerecorded right after executing it once.

        • COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE

          • This is a secondary command buffer that will be entirely within a single render pass.

        • COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE

          • The command buffer can be resubmitted while it is also already pending execution.

        • None of these flags are applicable for us right now.

      • pInheritanceInfo

        • VkCommandBufferInheritanceInfo .

          • If the command buffer is a secondary command buffer, then the VkCommandBufferInheritanceInfo  structure defines any state that will be inherited from the primary command buffer:

        • Used if commandBuffer  is a secondary command buffer. If this is a primary command buffer, then this value is ignored.

        • It specifies which state to inherit from the calling primary command buffers.

  • vkEndCommandBuffer() .

    • The command buffer must  have been in the recording state , and, if successful, is moved to the executable state .

    • If there was an error during recording, the application will be notified by an unsuccessful return code returned by vkEndCommandBuffer , and the command buffer will be moved to the invalid state .

Pre-recording

  • "Many early Vulkan tutorials and documents recommended writing a command buffer once and re-using it wherever possible. In practice however re-use rarely has the advertized performance benefit while incurring a non-trivial development burden due to the complexity of implementation. While it may appear counterintuitive, as re-using computed data is a common optimization, managing a scene with objects being added and removed as well as techniques such as frustum culling which vary the draw calls issued on a per frame basis make reusing command buffers a serious design challenge. It requires a caching scheme to manage command buffers and maintaining state for determining if and when re-recording becomes necessary. Instead, prefer to re-record fresh command buffers every frame. If performance is a problem, recording can be multithreaded as well as using secondary command buffers for non-variable draw calls, like post processing."

Multi-threading Recording

  • Usage of secondary command buffers for Vulkan Multithreaded Recording .

  • Usage of secondary command buffers for Vulkan Multithreaded Recording .

    • There's a example code section.

  • External synchronization

    • A type of synchronization required  of the application, where parameters defined to be externally synchronized must  not be used simultaneously in multiple threads.

  • Internal Synchronization

    • A type of synchronization required  of the implementation, where parameters not defined to be externally synchronized may  require internal mutexing to avoid multithreaded race conditions.

  • Any object parameters that are not labeled as externally synchronized are either not mutated by the command or are internally synchronized.

  • Additionally, certain objects related to a command’s parameters (e.g. command pools and descriptor pools) may  be affected by a command, and must  also be externally synchronized.

Queues
  • Only a single thread can be submitting to a given queue at any time. If you want multiple threads doing VkQueueSubmit , then you need to create multiple queues.

  • As the number of queues can be as low as 1 in some devices, what engines tend to do for this is to do something similar to the pipeline compile thread or the OpenGL api call thread, and have a thread dedicated to just doing VkQueueSubmit .

  • As VkQueueSubmit  is a very expensive operation, this can bring a very nice speedup as the time spent executing that call is done in a second thread and the main logic of the engine doesn’t have to stop.

  • Data upload is another section that is very often multithreaded. In here, you have a dedicated IO thread that will load assets to disk, and said IO thread will have its own queue and command allocators, hopefully a transfer queue. This way it is possible to upload assets at a speed completely separated from the main frame loop, so if it takes half a second to upload a set of big textures, you don’t have a hitch. To do that, you need to create a transfer or async-compute queue (if available), and dedicate that one to the loader thread. Once you have that, it’s similar to what was commented on the pipeline compiler thread, and you have an IO thread that communicates through a parallel queue with the main simulation loop to upload data in an asynchronous way. Once a transfer has been uploaded, and checked that it has finished with a Fence, then the IO thread can send the info to the main loop, and then the engine can connect the new textures or models into the renderer.

Command Pools
  • When you record command buffers, their command pools can only be used from one thread at a time. While you can create multiple command buffers from a command pool, you cant fill those commands from multiple threads. If you want to record command buffers from multiple threads, then you will need more command pools, one per thread.

  • Secondary Command Buffers :

    • Vulkan command buffers have a system for primary and secondary command buffers. The primary buffers are the ones that open and close RenderPasses, and can get directly submitted to a queue. Secondary command buffers are used as “child” command buffers that execute as part of a primary one.

    • Their main purpose is multithreading.

    • Secondary command buffers cant be submitted into a queue on their own.

  • Command Pools are a system to allow recording command buffers across multiple threads.

    • They enable different threads to use different allocators, without internal synchronization on each use.

  • A single command pool must be externally synchronized ; it must not be accessed simultaneously from multiple threads.

    • That includes use via recording commands on any command buffers allocated from the pool, as well as operations that allocate, free, and reset command buffers or the pool itself.

  • If you want multithreaded command recording, you need more VkCommandPool  objects. By using a separate command pool in each host-thread the application can create multiple command buffers in parallel without any costly locks.

    • For that reason, we will pair a command buffer with its command allocator.

  • You can allocate as many VkCommandBuffer  as you want from a given pool, but you can only record commands from one thread at a time.

  • Command buffers can be recorded on multiple threads while having a relatively light thread handle the submissions.

  • If two commands access the same object or memory and at least one of the commands declares the object to be externally synchronized, then the caller must  guarantee not only that the commands do not execute simultaneously, but also that the two commands are separated by an appropriate memory barrier (if needed).

  • Similarly, if a Vulkan command accesses a non-const memory parameter and the application also accesses that memory, or if the application writes to that memory and the command accesses it as a const memory parameter, the application must  ensure the accesses are properly synchronized with a memory barrier if needed.

  • Memory barriers are particularly relevant for hosts based on the ARM CPU architecture, which is more weakly ordered than many developers are accustomed to from x86/x64 programming. Fortunately, most higher-level synchronization primitives (like the pthread library) perform memory barriers as a part of mutual exclusion, so mutexing Vulkan objects via these primitives will have the desired effect.

Pipelines

  • In Vulkan, to execute code on the GPU, we need to set up a pipeline.

  • There are two types of pipelines, Graphics and Compute:

    • Compute pipelines :

      • Are much simpler, because they only require the data for the shader code, and the layout for the descriptors used for data bindings.

    • Graphics pipelines :

      • Have to configure a considerable amount of state for all of the fixed-function hardware in the GPU such as color blending, depth testing, or geometry formats.

  • Both types of pipelines share the shader modules and the layouts, which are built in the same way.

  • VkPipeline

Pipeline Layout
  • A collection of DescriptorSetLayouts  and PushConstantRange  defining its push constant usage.

  • PipelineLayouts for a graphics and compute pipeline are made in the same way, and they must be created before the pipeline itself.

  • VkPipelineLayout .

    • VkPipelineLayoutCreateInfo .

      • Structure specifying the parameters of a newly created pipeline layout object

      • setLayoutCount

        • Is the number of descriptor sets included in the pipeline layout.

      • pSetLayouts

        • Is a pointer to an array of VkDescriptorSetLayout  objects.

        • The implementation must  not access these objects outside of the duration of the command this structure is passed to.

    • vkCreatePipelineLayout() .

      • pCreateInfo

      • flags

      • setLayoutCount

      • pSetLayouts

        • Is a pointer to an array of VkDescriptorSetLayout  objects. The implementation must  not access these objects outside of the duration of the command this structure is passed to.

      • pushConstantRangeCount

        • Is the number of push constant ranges included in the pipeline layout.

      • pPushConstantRanges

        • Is a pointer to an array of VkPushConstantRange  structures defining a set of push constant ranges for use in a single pipeline layout. In addition to descriptor set layouts, a pipeline layout also describes how many push constants can  be accessed by each stage of the pipeline.

Mesh Shaders

Support
  • (2025-09-12)

  • .

  • It is important to note that while portability between APIs can be achieved, portability in performance among vendors is much harder. This is one of the reasons why this extension has not been released as a ratified KHR extension and Khronos continues to investigate improvements to geometry rasterization.

  • There are further aspects that can influence the performance of mesh shaders in a vendor dependent way:

    • The number of maximum output vertices and primitives that a mesh shader is compiled with.

    • The number of per-vertex and per-primitive output attributes that are passed to fragment shaders. For example, it may be beneficial to fetch additional attributes in the fragment shader and interpolate them via hardware barycentrics to reduce the output space of the mesh shader.

    • The complexity of the culling performed in the mesh shader. For example details regarding the per-vertex and/or per-primitive culling with compact outputs compared to letting the hardware perform culling.

    • The usage of additional shared memory. If possible developers should use subgroup operations (such as shuffle) instead.

    • The task payload size.

    • Task shaders may add overhead, use them only when they can cull a meaningful number of primitives or when actual geometry amplification is desired.

    • Do not try to reimplement the fixed-function pipeline, strive for simpler algorithms instead.

  • .

Motivation
  • .

  • .

  • The current state of the Graphics Pipeline is not a direct mapping of how a GPU operates.

  • There's a lot of Per Vertex -> Per Primitive -> Per Vertex -> Per Primitive happening inside a Graphics Pipeline.

  • The idea is to use the flexibility of Compute Shaders and use the GPU more closely as it operates.

  • Mesh and Task shaders follow the compute programming model and use threads cooperatively to generate meshes within a workgroup. The vertex and index data for these meshes are written similarly to shared memory in compute shaders.

  • Mesh shader output is directly consumed by the rasterizer, as opposed to the previous approach of using a compute dispatch followed by an indirect draw.

  • Mesh Shading applications can avoid preallocation of output buffers.

  • Before deciding to use mesh shaders, developers should ensure they are a good fit for their application. The traditional pipeline may still be best suited to many use cases, and it may not be trivial to improve performance using the mesh shading pipeline given the long evolution and optimization efforts applied to the traditional pipeline stages.

  • Applications and games dealing with high geometric complexity  can, however, benefit from the flexibility of the two-stage approach, which allows efficient culling , level-of-detail  techniques as well as procedural generation .

  • Compared to the traditional pipeline, the mesh shaders allow easy access to the topology of the generated primitives and developers are free to repurpose the threads to do both vertex shading and primitive shading work. This is in contrast to tessellation shaders, which, while fast, provide very limited control over the triangles created, and geometry shaders, which use a single-thread programming model that is inefficient for modern streaming processors.

Task Shader
  • Is optional and provides a way to implement geometry amplification by creating variable mesh shader workgroups directly in the pipeline. Task shader workgroups can output an optional payload, which is visible as read-only input to all its child mesh shader workgroups.

  • A Task Shader decides how many Mesh Shaders you would like to run.

Meshlets / Triangle Clusters
  • .

  • When rasterizing geometry, mesh shaders typically make use of pre-computed triangle clusters of an upper bound in the number of vertices and triangles, also sometimes referred to as meshlets. Because task and mesh shaders, like compute, have only workgroup and invocation indices as input, all data fetching is handled by the application directly, which entirely removes fixed-function vertex processing and input assembly. This allows developers to be flexible in the storage of mesh data in both vertex and primitive topology representations. Another very common technique is to leverage the task shader and let one local invocation test one cluster for visibility. Through the use of subgroup operations developers can compute and write out information about the visible clusters into the task shader payload.

  • The meshlet / primitive cluster dimensions can have an especially big impact for the developer, as when streaming it is ideal to store assets with a fixed clustering in advance. Vendors may have different performance recommendations and so we suggest the use of smaller cluster sizes that work equally well across multiple vendors and process multiple small clusters at once on implementations that perform better with larger clusters. In this area we advise developers to experiment and consult with their hardware vendors for recommendations.

Using it
What a Mesh Shader enables
  • You can do very early culling.

  • It can be faster than the classical Graphics Pipeline, if correctly optimized.

  • Mesh Shader output Execution Mode :

    • The mesh stage will set either OutputPoints , OutputLinesEXT , or OutputTrianglesEXT

    #extension GL_EXT_mesh_shader : require
    
    // Only 1 of the 3 is allowed
    layout(points) out;
    layout(lines) out;
    layout(triangles) out;
    

Cluster Culling Shader

Graphics Pipeline

  • The graphics pipeline is required for all common drawing operations.

  • Holds the state of the GPU needed to draw. For example: shaders, rasterization options, depth settings.

  • It describes the configurable state of the graphics card, like the viewport size and depth buffer operation and the programmable state using VkShaderModule objects.

Stages
  • .

  • .

  • Disabling stages :

    • The tessellation and geometry stages can be disabled if you are just drawing simple geometry.

    • If you are only interested in depth values, then you can disable the fragment shader stage, which is useful for shadow map  generation.

  • Fixed-function stages :

    • Allow you to tweak their operations using parameters, but the way they work is predefined.

    • Dynamic State :

      • While most  of the pipeline state needs to be baked into the pipeline state, a limited amount of the state can actually be changed without recreating the pipeline at draw time.

      • Examples are the size of the viewport, line width and blend constants.

      • If you want to use dynamic state and keep these properties out, then you’ll have to fill in a VkPipelineDynamicStateCreateInfo  struct.

      • This will cause the configuration of these values to be ignored , and you will be able (and required) to specify the data at drawing time.

      • This results in a more flexible setup and is widespread for things like viewport and scissor state, which would result in a more complex setup when being baked into the pipeline state.

  • Programmable stages :

    • Means that you can upload your own code to the graphics card to apply exactly the operations you want.

    • This allows you to use fragment shaders, for example, to implement anything from texturing and lighting to ray tracers. These programs run on many GPU cores simultaneously to process many objects, like vertices and fragments in parallel.

  • Immutability :

    • Is almost completely immutable, so you must recreate the pipeline from scratch if you want to change shaders, bind different framebuffers or change the blend function.

    • The disadvantage is that you’ll have to create a number of pipelines (many VkPipeline objects) that represent all the different combinations of states you want to use in your rendering operations. However, because all the operations you’ll be doing in the pipeline are known in advance, the driver can optimize for it much better.

      • Runtime performance is more predictable because large state changes like switching to a different graphics pipeline are made very explicit.

    • Only some basic configuration, like viewport size and clear color, can be changed dynamically.

Shader Compilation

Shader Module
  • A VkShaderModule  is a processed shader file.

  • We create it from a pre-compiled SPIR-V file.

  • We can call vkDestroyShaderModule  after they are used for the graphics pipeline creation.

Input Assembly

  • Fixed-function stage.

  • Collects the raw vertex data from the buffers you specify and may also use an index buffer to repeat certain elements without having to duplicate the vertex data itself.

  • VkPipelineVertexInputStateCreateInfo

    • Describes the format of the vertex data that will be passed to the vertex shader.

    • pVertexBindingDescriptions

      • Spacing between data and whether the data is per-vertex or per-instance (see instancing ).

    • pVertexAttributeDescriptions

      • Type of the attributes passed to the vertex shader, which binding to load them from and at which offset.

  • VkPipelineInputAssemblyStateCreateInfo .

    • Describes two things: what kind of geometry will be drawn from the vertices and if primitive restart should be enabled.

    • topology

      • PRIMITIVE_TOPOLOGY_POINT_LIST

        • points from vertices

      • PRIMITIVE_TOPOLOGY_LINE_LIST

        • line from every two vertices without reuse

      • PRIMITIVE_TOPOLOGY_LINE_STRIP

        • the end vertex of every line is used as start vertex for the next line

      • PRIMITIVE_TOPOLOGY_TRIANGLE_LIST

        • triangle from every three vertices without reuse

      • PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP

        • the second and third vertex of every triangle is used as first two vertices of the next triangle

    • primitiveRestartEnable

      • Normally, the vertices are loaded from the vertex buffer by index in sequential order, but with an element buffer  you can specify the indices to use yourself.

        • This allows you to perform optimizations like reusing vertices.

      • If you set this to TRUE , then it’s possible to break up lines and triangles in the _STRIP  topology modes by using a special index of 0xFFFF  or 0xFFFFFFFF .

Primitive Topology
  • .

Vertex Shader

  • Programmable stage.

  • Is run for every vertex and generally applies transformations to turn vertex positions from model space to screen space. It also passes per-vertex data down the pipeline.

  • The VkShaderModule  objects are created from shader byte code.

  • Accesses and computes one vertex at a time.

Tessellation Shader

  • Is run for every vertex and generally applies transformations to turn vertex positions from model space to screen space. It also passes per-vertex data down the pipeline.

  • You can do tessellation in the Geometry Shader, but the Tessellation Shader is more appropriate and efficient.

  • .

    • Sending this amount of vertices to the Vertex Shader would be quite more expensive than generating them in the Tessellation Shader.

  • .

  • Tessellation Evaluation Shader.

    • Kinda like a Vertex Shader, after the Tessellation.

  • Tessellation Shader .

    • I was too lazy to watch it all.

    • The inputs are complicated, etc.

  • Tessellation output Execution Mode :

    • The tessellation evaluation stage will set either Triangles , Quads , or Isolines

    // Only 1 of the 3 is allowed
    layout(quads) in;
    layout(isolines) in;
    layout(triangles) in;
    

Geometry Shader

  • Programmable stage.

  • It operates on primitives .

  • Is run on every primitive (triangle, line, point) and can discard it or output more primitives than came in. This is similar to the tessellation shader but much more flexible.

  • However, it is used little in today’s applications because the performance is not that good  on most graphics cards except for Intel’s integrated GPUs.

    • Also, almost all geometry shader use cases can be replaced with a more modern Mesh shader pipeline, which like ray tracing is a wholly new pipeline solution, so it exists outside the standard graphics pipeline setup.

  • .

  • A Vertex Shader is more parallelized than a Geometry Shader.

  • A Vertex Shader computes one vertex at a time, while a geometry shader gets all the vertices that compose a primitive .

    • It does not  have access to the whole mesh, just the vertices that compose the current primitive.

  • OpenGL Primitives :

    • May be useful.

    • .

    • .

  • Think of the Primitive Inputs as just the amount of vertices you are sending at a time.

  • .

  • The reason for this is that you can get any primitive input and have any primitive output.

  • .

    • Use EndPrimitive()  so the line strips are separated.

  • .

    • The Vertex Shader can output data to the Geometry Shader, in the form of an array.

    • The Geometry Shader can output data to the Fragment Shader, in a form of an interpolated value, using barycentric coordinates.

  • Instancing :

    • .

    • You can have many instances of a Geometry Shader, where the input is the same but the output changes.

    • .

    • .

  • .

    • The smoke is a quad facing the camera (billboard).

    • The points are converted to quads.

  • .

  • .

  • Geometry output Execution Mode :

    • A geometry stage will set either OutputPoints , OutputLineStrip , or OutputTriangleStrip

    // Only 1 of the 3 is allowed
    layout(points) out;
    layout(line_strip) out;
    layout(triangle_strip) out;
    

Rasterization

  • Fixed-function stage.

  • Breaks the primitives into fragments .

  • These are the pixel elements that they fill on the framebuffer.

  • Any fragments that fall outside the screen are discarded, and the attributes outputted by the vertex shader are interpolated across the fragments.

  • Fragments that are behind other primitive fragments can also be discarded here because of depth testing.

  • VkPipelineRasterizationStateCreateInfo .

    • polygonMode

    • lineWidth

      • Is the width of rasterized line segments.

      • The maximum line width that is supported depends on the hardware.

      • Any line thicker than 1.0f  requires you to enable the wideLines  GPU feature.

      • If set to 0.0f , you get: lineWidth  is 0.0, but the line width state is static ( pCreateInfos[0].pDynamicState->pDynamicStates  does not contain DYNAMIC_STATE_LINE_WIDTH ) and wideLines  feature was not enabled. The Vulkan spec states: If the pipeline requires pre-rasterization shader state, and the wideLines  feature is not enabled, and no element of the pDynamicStates  member of pDynamicState  is DYNAMIC_STATE_LINE_WIDTH , the lineWidth member of pRasterizationState must  be 1.0.

      • So, set it to 1.0f  by default.

    • cullMode

      • NONE

        • Specifies that no triangles are discarded

      • FRONT

        • Specifies that front-facing triangles are discarded

      • BACK

        • Specifies that back-facing triangles are discarded

      • FRONT_AND_BACK

        • Specifies that all triangles are discarded.

      • Following culling, fragments are produced for any triangles which have not been discarded.

    • frontFace

      • Specifies the vertex order for the faces to be considered front-facing.

      • COUNTER_CLOCKWISE

        • Specifies that a triangle with positive area is considered front-facing.

      • CLOCKWISE

        • Specifies that a triangle with negative area is considered front-facing.

      • Any triangle which is not front-facing is back-facing, including zero-area triangles.

    • rasterizerDiscardEnable .

      • When enabled, primitives are discarded after they are processed by the last active shader stage in the pipeline before rasterization.

      • Controls whether primitives are discarded immediately before the rasterization stage. This is important because when this is set to TRUE  the rasterization hardware is not executed.

      • There are many Validation Usage errors that will not occur if this is set to TRUE  because some topology hardware is unused and can be ignored.

      • Enabling this state is meant for very specific use cases. Prior to compute shaders, this was a common technique for writting geometry shader output to a buffer.

      • It can be used to debug/profile non-rasterization bottlenecks.

    • flags

      • Reserved for future use.

    • depthClampEnable

      • See the Depth section for details.

    • depthBiasEnable

      • See the Depth section for details.

    • depthBiasConstantFactor

      • See the Depth section for details.

    • depthBiasSlopeFactor

      • See the Depth section for details.

    • depthBiasClamp

      • See the Depth section for details.

Polygon Mode
  • .

  • Determines how fragments are generated for geometry.

  • These modes affect only the final  rasterization of polygons. The polygon’s vertices are shaded and the polygon is clipped and possibly culled before these modes are applied.

  • FILL

    • Fill the area of the polygon with fragments.

  • LINE

    • Polygon edges are drawn as lines

  • POINT

    • Polygon vertices are drawn as points

    • If VkPhysicalDeviceMaintenance5Properties :: polygonModePointSize  is TRUE , the point size of the final rasterization of polygons is taken from PointSize .

    • Otherwise, the point size of the final rasterization of polygons is 1.0.

  • FILL_RECTANGLE_NV

    • Specifies that polygons are rendered using polygon rasterization rules, modified to consider a sample within the primitive if the sample location is inside the axis-aligned bounding box of the triangle after projection.

    • Note that the barycentric weights used in attribute interpolation can  extend outside the range [0,1]  when these primitives are shaded.

    • Special treatment is given to a sample position on the boundary edge of the bounding box. In such a case, if two rectangles lie on either side of a common edge (with identical endpoints) on which a sample position lies, then exactly one of the triangles must  produce a fragment that covers that sample during rasterization.

    • Polygons rendered in FILL_RECTANGLE_NV  mode may  be clipped by the frustum or by user clip planes. If clipping is applied, the triangle is culled rather than clipped.

    • Area calculation and facingness are determined for FILL_RECTANGLE_NV  mode using the triangle’s vertices.

  • If you have a vertex shader that has PRIMITIVE_TOPOLOGY_TRIANGLE_LIST  input and then during rasterization uses POLYGON_MODE_LINE , the effective topology is the Line Topology Class  at that time. This means something like lineWidth  would be applied when filling in the polygon with POLYGON_MODE_LINE .

Fragment Operations

Order
  1. Discard rectangles test

  2. Scissor test

  3. Exclusive scissor test

  4. Sample mask test

  5. Certain Fragment shading operations:

    • Sample Mask Accesses

    • Tile Image Reads

    • Depth Replacement

    • Stencil Reference Replacement

    • Interlocked Operations

  6. Multisample coverage

  7. Depth bounds test

  8. Stencil test

  9. Depth test

  10. Representative fragment test

  11. Sample counting

  12. Coverage to color

  13. Coverage reduction

  14. Coverage modulation

Early Per-Fragment Tests
  • OpenGL 4.6:

    • Once fragments are produced by rasterization, a number of per-fragment operations are performed prior to fragment shader execution. If a fragment is discarded during any of these operations, it will not be processed by any subsequent Stage, including fragment shader execution.

    • Three fragment operations are performed, and a further three are optionally performed on each fragment, in the following order:

      • the pixel ownership test (see section 14.9.1);

      • the scissor test (see section 14.9.2);

      • multisample fragment operations (see section 14.9.3);

    • If early per-fragment operations are enabled, these tests are also performed:

      • the stencil test (see section 17.3.3);

      • the depth buffer test (see section 17.3.4);

        • The depth buffer test discards the incoming fragment if a depth comparison fails. The comparison is enabled or disabled with the generic Enable and Disable commands using target DEPTH_TEST. When disabled, the depth comparison and subsequent possible updates to the depth buffer value are bypassed and the fragment is passed to the next operation. The stencil value, however, is modified as indicated below as if the depth buffer test passed. If enabled, the comparison takes place and the depth buffer and stencil value may subsequently be modified.

      • occlusion query sample counting (see section 17.3.5)

    • Early fragment tests, as an optimization, exist to prevent unnecessary executions of the Fragment Shader. If a fragment will be discarded based on the Depth Test (due perhaps to being behind other geometry), it saves performance to avoid executing the fragment shader. There is specialized hardware that makes this particularly efficient in many GPUs.

    • The most effective way to use early depth test hardware is to run a depth-only pre-processing pass. This means to render all available geometry, using minimal shaders and a rendering pipeline that only writes to the depth buffer. The Vertex Shader should do nothing more than transform positions, and the Fragment Shader does not even need to exist.

    • This provides the best performance gain if the fragment shader is expensive, or if you intend to use multiple passes across the geometry.

    • Limitations :

      • The Spec states that these operations happen after fragment processing. However, a specification only defines apparent behavior, so the implementation is only required to behave "as if" it happened afterwards.

      • Therefore, an implementation is free to apply early fragment tests if the Fragment Shader being used does not do anything that would impact the results of those tests. So if a fragment shader writes to glFragDepth, thus changing the fragment's depth value, then early testing cannot take place, since the test must use the new computed value.

      • Do recall that if a fragment shader writes to gl_FragDepth, even conditionally, it must write to it at least once on all codepaths.

      • There can be other hardware-based limitations as well. For example, some hardware will not execute an early depth test if the (deprecated) alpha test is active, as these use the same hardware on that platform. Because this is a hardware-based optimization, OpenGL has no direct controls that will tell you if early depth testing will happen.

      • Similarly, if the fragment shader discards the fragment with the discard keyword, this will almost always turn off early depth tests on some hardware. Note that even conditional  use of discard will mean that the FS will turn off early depth tests.

      • All of the above limitations apply only to early testing as an optimization. They do not apply to anything below.

    • More recent hardware can force early depth tests, using a special fragment shader layout qualifier:

      • layout(early_fragment_tests) .

        • Vulkan:

          • Specifying is a way of the application programmer providing a promise to the implementation that it is algorithmically safe to kill the fragments, so you explicitly allow the change in application-visible behavior.

          • Specifying this will make per-fragment tests be performed before fragment shader execution. If this is not declared, per-fragment tests will be performed after fragment shader execution. Only one fragment shader (compilation unit) need declare this, though more than one can. If at least one declares this, then it is enabled.

        • OpenGL 4.6:

          • An explicit control is provided to allow fragment shaders to enable early fragment tests. If the fragment shader specifies the early_fragment_tests  layout qualifier, the per-fragment tests will be performed prior to fragment shader execution. Otherwise, they will be performed after fragment shader execution.

          • This will also perform early stencil tests.

          • There is a caveat with this. This feature cannot  be used to violate the sanctity of the depth test. When this is activated, any writes to gl_FragDepth  will be ignored . The value written to the depth buffer will be exactly what was tested against  the depth buffer: the fragment's depth computed through rasterization.

          • This feature exists to ensure proper behavior when using Image Load Store  or other incoherent memory writing . Without turning this on, fragments that fail the depth test would still perform their Image Load/Store operations, since the fragment shader that performed those operations successfully executed. However, with early fragment tests, those tests were run before the fragment shader. So this ensures that image load/store operations will only happen on fragments that pass the depth test.

          • Enabling this feature has consequences for the results of a discarded fragment.

Viewport and Scissors

  • A viewport basically describes the region of the framebuffer that the output will be rendered to.

  • Viewports define the transformation from the image to the framebuffer, scissor rectangles define in which region pixels will actually be stored. The rasterizer will discard any pixels outside the scissored rectangles. They function like a filter rather than a transformation.

    • The difference is illustrated below.

    • .

    • Note that the left scissored rectangle is just one of the many possibilities that would result in that image, as long as it’s larger than the viewport.

    • So if we wanted to draw to the entire framebuffer, we would specify a scissor rectangle that covers it entirely:

      vk::Rect2D{ vk::Offset2D{ 0, 0 }, swapChainExtent }
      
  • Parameters :

    • This will almost always be the rectangle (0, 0) , (width, height)  and in this tutorial that will also be the case.

      • Remember that the size of the Swapchain and its images may differ from the WIDTH  and HEIGHT  of the window.

      • The Swapchain images will be used as framebuffers later on, so we should stick to their size.

    • The minDepth  and maxDepth  values specify the range of depth values to use for the framebuffer. These values must be within the [0.0f, 1.0f]  range, but minDepth  may be higher than maxDepth .

      • If you aren’t doing anything special, then you should stick to the standard values of 0.0f  and 1.0f .

  • As a Dynamic State or Static State :

    • Viewport(s) and scissor rectangle(s) can either be specified as a static part of the pipeline or as a dynamic state set in the command buffer.

    • Independent of how you set them, it’s possible to use multiple viewports and scissor rectangles on some graphics cards, so the structure members reference an array of them. Using multiple requires enabling a GPU feature (see logical device creation).

    • It’s often convenient to make viewport and scissor state dynamic as it gives you a lot more flexibility.

    • With dynamic state :

      • It’s even possible to specify different viewports and or scissor rectangles within a single command buffer.

      • This is widespread and all implementations can handle this dynamic state without  a performance penalty.

      • When opting for dynamic viewport(s) and scissor rectangle(s), you need to enable the respective dynamic states for the pipeline:

        std::vector dynamicStates = {
            vk::DynamicState::eViewport,
            vk::DynamicState::eScissor
        };
        vk::PipelineDynamicStateCreateInfo dynamicState({}, dynamicStates.size(), dynamicStates.data());
        
      • And then you only need to specify their count at pipeline creation time:

        vk::PipelineViewportStateCreateInfo viewportState({}, 1, {}, 1);
        
      • The actual viewport(s) and scissor rectangle(s) will then later be set up at drawing time.

    • Without dynamic state :

      • The viewport and scissor rectangle need to be set in the pipeline using the VkPipelineViewportStateCreateInfo  struct. This makes the viewport and scissor rectangle for this pipeline immutable. Any changes required to these values would require a new pipeline to be created with the new values.

    • What should you use?

      • USE DYNAMIC. There's no  performance penalty.

      • Supported since launch.

      • LunarG:

        • .

Multi-Sampling

Setup
  • VkPipelineMultisampleStateCreateInfo .

    • rasterizationSamples

      • If the bound pipeline was created without a VkAttachmentSampleCountInfoAMD  or VkAttachmentSampleCountInfoNV  structure, and the multisampledRenderToSingleSampled  feature is not enabled, and the current render pass instance was begun with vkCmdBeginRendering  with a VkRenderingInfo:colorAttachmentCount  parameter greater than 0, then each element of the VkRenderingInfo:pColorAttachments  array with a imageView  not equal to NULL_HANDLE  must have been created with a sample count equal to the value of rasterizationSamples  for the bound graphics pipeline.

      • Is a VkSampleCountFlagBits  value specifying the number of samples used in rasterization. This value is ignored for the purposes of setting the number of samples used in rasterization if the pipeline is created with the DYNAMIC_STATE_RASTERIZATION_SAMPLES_EXT  dynamic state set, but if DYNAMIC_STATE_SAMPLE_MASK_EXT  dynamic state is not set, it is still used to define the size of the pSampleMask  array as described below.

    • sampleShadingEnable

    • minSampleShading

      • Specifies a minimum fraction of sample shading if sampleShadingEnable  is TRUE .

    • pSampleMask

    • alphaToCoverageEnable

      • Controls whether a temporary coverage value is generated based on the alpha component of the fragment’s first color output as specified in the Multisample Coverage  section.

    • alphaToOneEnable

      • Controls whether the alpha component of the fragment’s first color output is replaced with one as described in Multisample Coverage .

    • flags

      • Reserved for future use.

Resolving
  • VkRenderingAttachmentInfo .

    • resolveMode

      • Is a VkResolveModeFlagBits  value defining how data written to imageView  will be resolved into resolveImageView .

      • If resolveMode  is not RESOLVE_MODE_NONE , and resolveImageView  is not NULL_HANDLE , a render pass multisample resolve operation is defined for the attachment subresource.

      • RESOLVE_MODE_NONE

        • Specifies that no resolve operation is done.

      • RESOLVE_MODE_SAMPLE_ZERO

        • Specifies that result of the resolve operation is equal to the value of sample 0.

      • RESOLVE_MODE_AVERAGE

        • Specifies that result of the resolve operation is the average of the sample values.

      • RESOLVE_MODE_MIN

        • Specifies that result of the resolve operation is the minimum of the sample values.

      • RESOLVE_MODE_MAX

        • Specifies that result of the resolve operation is the maximum of the sample values.

      • RESOLVE_MODE_EXTERNAL_FORMAT_DOWNSAMPLE_ANDROID

        • Specifies that rather than a multisample resolve, a single sampled color attachment will be downsampled into a Y′CBCR format image specified by an external Android format. Unlike other resolve modes, implementations can resolve multiple times during rendering, or even bypass writing to the color attachment altogether, as long as the final value is resolved to the resolve attachment. Values in the G, B, and R channels of the color attachment will be written to the Y, CB, and CR channels of the external format image, respectively. Chroma values are calculated as if sampling with a linear filter from the color attachment at full rate, at the location the chroma values sit according to VkPhysicalDeviceExternalFormatResolvePropertiesANDROID :: externalFormatResolveChromaOffsetX , VkPhysicalDeviceExternalFormatResolvePropertiesANDROID :: externalFormatResolveChromaOffsetY , and the chroma sample rate of the resolved image.

        • No range compression or Y′CBCR model conversion is performed by RESOLVE_MODE_EXTERNAL_FORMAT_DOWNSAMPLE_ANDROID ; applications have to do these conversions themselves. Value outputs are expected to match those that would be read through a Y′CBCR sampler using SAMPLER_YCBCR_MODEL_CONVERSION_RGB_IDENTITY . The color space that the values should be in is defined by the platform and is not exposed via Vulkan.

    • resolveImageView

      • Is an image view used to write resolved data at the end of rendering.

    • resolveImageLayout

      • Is the layout that resolveImageView  will be in during rendering.

      • If imageView  is not NULL_HANDLE  and resolveMode  is not RESOLVE_MODE_NONE , resolveImageLayout  must not be IMAGE_LAYOUT_UNDEFINED , IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL , IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL , IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL , IMAGE_LAYOUT_ZERO_INITIALIZED_EXT , IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL , or IMAGE_LAYOUT_PREINITIALIZED

  • From Multisample, to Singlesample.

  • Combine sample values from a single pixel in a multisample attachment and store the result to the corresponding pixel in a single sample attachment.

  • Multisample resolve operations for attachments execute in the PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT  pipeline stage. A final resolve operation for all pixels in the render area happens-after any recorded command which writes a pixel via the multisample attachment to be resolved or an explicit alias of it in the subpass that it is specified.

  • Any single sample attachment specified for use in a multisample resolve operation may  have its contents modified at any point once rendering begins for the render pass instance.

  • Reads from the multisample attachment can be synchronized with ACCESS_COLOR_ATTACHMENT_READ . Access to the single sample attachment can be synchronized with ACCESS_COLOR_ATTACHMENT_READ  and COLOR_ATTACHMENT_WRITE . These pipeline stage and access types are used whether the attachments are color or depth/stencil attachments.

  • When using render pass objects, a subpass dependency specified with the above pipeline stages and access flags will ensure synchronization with multisample resolve operations for any attachments that were last accessed by that subpass. This allows later subpasses to read resolved values as input attachments.

  • Resolve operations only update values within the defined render area for the render pass instance. However, any writes performed by a resolve operation (as defined by its access masks) to a given attachment may  read and write back any memory locations within the image subresource bound for that attachment. For depth/stencil images, if separateDepthStencilAttachmentAccess  is FALSE , writes to one aspect may  also result in read-modify-write operations for the other aspect. If the subresource is bound to an attachment with feedback loop enabled , implementations must  not access pixels outside of the render area.

  • As entire subresources could be accessed by multisample resolve operations, applications cannot safely access values outside of the render area via aliased resources during a render pass instance when a multisample resolve operation is performed.

  • If RESOLVE_MODE_AVERAGE  is used, and the source format is a floating-point or normalized type, the sample values for each pixel are resolved with implementation-defined numerical precision.

  • If the numeric format  of the resolve attachment uses sRGB encoding, the implementation should  convert samples from nonlinear to linear before averaging samples as described in the “sRGB EOTF” section of the Khronos Data Format Specification . In this case, the implementation must  convert the linear averaged value to nonlinear before writing the resolved result to resolve attachment.

  • The resolve mode and store operation are independent; it is valid to write both resolved and unresolved values, and equally valid to discard the unresolved values while writing the resolved ones.

Multisampling Anti-Aliasing (MSAA)
  • Using only one sample per pixel which is equivalent to no multisampling.

  • Maximum supported :

    • Can be extracted from VkPhysicalDeviceProperties  associated with our selected physical device.

    • The highest sample count that Color Image and Depth Image (Buffer) will be the maximum we can support.

  • What to Multisample :

    • The render target.

    • If using a depth image, it should also be multisampled.

  • Limitations :

    • The multisampled image should only have one mip level.

      • This is enforced by the Vulkan specification in case of images with more than one sample per pixel.

    • Multi-sampled images cannot be presented directly.

      • This requirement does not apply to the depth buffer, since it won’t be presented at any point.

  • DOs :

    • Use 4x MSAA if possible; it’s not expensive and provides good image quality improvements.

    • Use loadOp = LOAD_OP_CLEAR  or loadOp = LOAD_OP_DONT_CARE  for multisampled images.

    • Use storeOp = STORE_OP_DONT_CARE  for multisampled images.

    • Use LAZILY_ALLOCATED  memory to back the allocated multisampled images; they do not need to be persisted into main memory and therefore do not need physical backing storage.

    • Use pResolveAttachments  in a subpass to automatically resolve a multisampled color buffer into a single-sampled color buffer.

    • Use KHR_depth_stencil_resolve  in a subpass to automatically resolve a multisampled depth buffer into a single-sampled depth buffer. Typically this is only useful if the depth buffer is going to be used further, in most cases it is transient and does not need to be resolved.

  • Avoid :

    • Avoid using vkCmdResolveImage() ; this has a significant negative impact on bandwidth and performance.

    • Avoid using loadOp = LOAD_OP_LOAD  for multisampled image attachments.

    • Avoid using storeOp = STORE_OP_STORE  for multisampled image attachments.

    • Avoid using more than 4x MSAA without checking performance.

  • Impact :

    • Failing to get an inline resolve can result in substantially higher memory bandwidth and reduced performance.

      • Manually writing and resolving a 4x MSAA 1080p surface at 60 FPS requires 3.9GB/s of memory bandwidth compared to just 500MB/s when using an inline resolve.

  • Sample Shading :

    • There are certain limitations of our current MSAA implementation which may impact the quality of the output image in more detailed scenes. For example, we're currently not solving potential problems caused by shader aliasing, i.e. MSAA only smoothens out the edges of geometry but not the interior filling. This may lead to a situation when you get a smooth polygon rendered on screen but the applied texture will still look aliased if it contains high contrasting colors. One way to approach this problem is to enable Sample Shading  which will improve the image quality even further, though at an additional performance cost:

    void createLogicalDevice() {
        ...
        deviceFeatures.sampleRateShading = TRUE; // enable sample shading feature for the device
        ...
    }
    
    void createGraphicsPipeline() {
        ...
        multisampling.sampleShadingEnable = TRUE; // enable sample shading in the pipeline
        multisampling.minSampleShading = .2f; // min fraction for sample shading; closer to one is smoother
        ...
    }
    

    .

  • Performance Tests :

    • (2025-09-07)

      • Done anyway, very approximate.

    • MSAAx8 = 900 fps

    • MSAAx4 = 1250fps

    • MSAAx2 = 1550fps

    • MSAA off = 2100fps

    • As samples increase, frame time increases approximately by factors 1.35 (x2), 1.68 (x4) and 2.33 (x8) compared to the case without MSAA — this is consistent with substantial per-sample cost increase, but is not  strictly linear with the number of samples (e.g.: x4 is not exactly 4× nor x8 exactly 8×).

Fragment Shader

  • Programmable stage.

  • Is invoked for every fragment that survives and determines which framebuffer(s) the fragments are written to and with which color and depth values. It can do this using the interpolated data from the vertex shader, which can include things like texture coordinates and normals for lighting.

  • The VkShaderModule  objects are created from shader byte code.

Color Blending

  • Fixed-function stage.

  • Controls how the GPU combines the fragment shader’s output with what is already in the framebuffer.

  • Applies operations to mix different fragments that map to the same pixel in the framebuffer. Fragments can simply overwrite each other, add up or be mixed based upon transparency.

  • After a fragment shader has returned a color, it needs to be combined with the color that is already in the framebuffer.

  • This transformation is known as color blending, and there are two ways to do it:

    • Mix the old and new value to produce a final color

    • Combine the old and new value using a bitwise operation

  • Example :

    • If enabled blending in the pipeline, it will blend the frag shader result with the render_target previous visual.

    • So if the frag result has alpha < 1.0, it will blend the clear color with the frag shader result, giving it a "transparent visual" against the clear color.

  • vkPipelineColorBlendAttachmentState .

    • Contains the configuration per attached framebuffer.

    • This per-framebuffer struct allows you to configure the first way of color blending:

      // Pseudo-code
      if (blendEnable) {
          finalColor.rgb = (srcColorBlendFactor * newColor.rgb) <colorBlendOp> (dstColorBlendFactor * oldColor.rgb);
          finalColor.a = (srcAlphaBlendFactor * newColor.a) <alphaBlendOp> (dstAlphaBlendFactor * oldColor.a);
      } else {
          finalColor = newColor;
      }
      
      finalColor = finalColor & colorWriteMask;
      
    • The most common way to use color blending is to implement alpha blending, where we want the new color to be blended with the old color based on its opacity.

      • The finalColor  should then be computed as follows:

        finalColor.rgb = newAlpha * newColor + (1 - newAlpha) * oldColor;
        finalColor.a = newAlpha.a;
        
      • This can be achieved with the following parameters:

        colorBlendAttachment.blendEnable = vk::True;
        colorBlendAttachment.srcColorBlendFactor = vk::BlendFactor::eSrcAlpha;
        colorBlendAttachment.dstColorBlendFactor = vk::BlendFactor::eOneMinusSrcAlpha;
        colorBlendAttachment.colorBlendOp = vk::BlendOp::eAdd;
        colorBlendAttachment.srcAlphaBlendFactor = vk::BlendFactor::eOne;
        colorBlendAttachment.dstAlphaBlendFactor = vk::BlendFactor::eZero;
        colorBlendAttachment.alphaBlendOp = vk::BlendOp::eAdd;
        
    • blendEnable

      • If set to FALSE , then the new color from the fragment shader is passed through unmodified. Otherwise, the two mixing operations are performed to compute a new color.

      • The resulting color is AND’d with the colorWriteMask  to determine which channels are actually passed through.

  • VkPipelineColorBlendStateCreateInfo .

    • Contains the global  color blending settings.

    • References the array of structures for all the framebuffers and allows you to set blend constants that you can use as blend factors in the aforementioned calculations.

    • attachmentCount

      • Is the number of VkPipelineColorBlendAttachmentState  elements in pAttachments .

      • It is ignored if the pipeline is created with DYNAMIC_STATE_COLOR_BLEND_ENABLET , DYNAMIC_STATE_COLOR_BLEND_EQUATION_EXT , and DYNAMIC_STATE_COLOR_WRITE_MASK_EXT  dynamic states set, and either DYNAMIC_STATE_COLOR_BLEND_ADVANCED_EXT  set or the advancedBlendCoherentOperations  feature is not enabled.

    • pAttachments

      • Is a pointer to an array of VkPipelineColorBlendAttachmentState  structures defining blend state for each color attachment.

      • It is ignored if the pipeline is created with DYNAMIC_STATE_COLOR_BLEND_ENABLET , DYNAMIC_STATE_COLOR_BLEND_EQUATION_EXT , and DYNAMIC_STATE_COLOR_WRITE_MASK_EXT  dynamic states set, and either DYNAMIC_STATE_COLOR_BLEND_ADVANCED_EXT  set or the advancedBlendCoherentOperations  feature is not enabled.

    • logicOpEnable

    • logicOp

      • Selects which logical operation to apply.

      • If you want to use the second method of blending (a bitwise combination), then you should set logicOpEnable  to TRUE .

        • Note that this will automatically disable the first method, as if you had set blendEnable  to FALSE  for every attached framebuffer.

      • colorWriteMask  will also be used in this mode to determine which channels in the framebuffer will actually be affected.

      • If disabled both modes, the fragment colors will be written to the framebuffer unmodified.

    • blendConstants

      • Is a pointer to an array of four values used as the R, G, B, and A components of the blend constant that are used in blending, depending on the blend factor .

    • flags

Creation

Setup
  • vkGraphicsPipelineCreateInfo .

    • flags

      • DISABLE_OPTIMIZATION

        • Specifies that the created pipeline will not be optimized.

        • Using this flag may  reduce the time taken to create the pipeline.

    • renderPass

      • Is set to nullptr  because we’re using dynamic rendering instead of a traditional render pass.

    • basePipelineHandle

    • basePipelineIndex

    • Graphics Pipelines Inheritance :

      • Vulkan allows you to create a new graphics pipeline by deriving from an existing pipeline.

      • The idea of pipeline derivatives is that it is less expensive to set up pipelines when they have much functionality in common with an existing pipeline and switching between pipelines from the same parent can also be done quicker.

      • You can either specify the handle of an existing pipeline with basePipelineHandle  or reference another pipeline that is about to be created by index with basePipelineIndex .

      • These values are only used if the VPIPELINE_CREATE_DERIVATIVE  flag is also specified in the flags  field of VkGraphicsPipelineCreateInfo .

  • vkCreateGraphicsPipelines() .

    • device

      • Is the logical device that creates the graphics pipelines.

    • pipelineCache

      • Is either NULL_HANDLE , indicating that pipeline caching is disabled, or to enable caching, the handle of a valid VkPipelineCache  object. The implementation must  not access this object outside of the duration of this command.

      • A pipeline cache can be used to store and reuse data relevant to pipeline creation across multiple calls to vkCreateGraphicsPipelines  and even across program executions if the cache is stored to a file. This makes it possible to significantly speed up pipeline creation at a later time.

    • createInfoCount

      • Is the length of the pCreateInfos  and pPipelines  arrays.

    • pCreateInfos

    • pAllocator

    • pPipelines

      • Is a pointer to an array of VkPipeline  handles in which the resulting graphics pipeline objects are returned.

Dynamic Rendering Extra Steps
  • Changes to the vkGraphicsPipelineCreateInfo :

    • The vkGraphicsPipelineCreateInfo  must be created without a VkRenderPass .

    • The VkPipelineRenderingCreateInfo  must be included in the pNext .

      • If a graphics pipeline is created with a valid VkRenderPass , the parameters of the VkPipelineRenderingCreateInfo  are ignored.

  • VkPipelineRenderingCreateInfo .

    • colorAttachmentCount

      • Is the number of entries in pColorAttachmentFormats

    • pColorAttachmentFormats

      • Is a pointer to an array of VkFormat  values defining the format of color attachments used in this pipeline.

    • depthAttachmentFormat

      • Is a VkFormat  value defining the format of the depth attachment used in this pipeline.

    • stencilAttachmentFormat

      • Is a VkFormat  value defining the format of the stencil attachment used in this pipeline.

    • viewMask

      • Is a bitfield of view indices describing which views are active during rendering.

      • It must  match VkRenderingInfo.viewMask  when rendering.

        • As defined in VkRenderingInfo :

          • Is a bitfield of view indices describing which views are active during rendering, when it is not 0 .

          • If viewMask  is not 0 , multiview is enabled.

    • Formats :

      • If depthAttachmentFormat , stencilAttachmentFormat , or any element of pColorAttachmentFormats  is UNDEFINED , it indicates that the corresponding attachment is unused within the render pass.

      • Valid formats indicate that an attachment can  be used - but it is still valid to set the attachment to NULL  when beginning rendering.

Managing Pipelines and Reducing overhead

  • Tips and Tricks: Vulkan Dos and Don’ts .

    • Use pipeline cache.

    • Use specialization constants.

      • This may cause a possible decrease in the number of instructions and registers used by the shader.

      • Specialization constants can also be used instead of offline shader permutations to minimize the amount of bytecode that needs to be shipped with an application.

    • Switching pipelines:

      • Avoid frequently switching between pipelines that use different sets of pipeline stages.

      • Minimize the number of vkCmdBindPipeline  calls, each call has significant CPU cost and GPU cost.

        • Consider sorting  of drawcalls and/or using a low number of dynamic states.

      • Switching on/off the tessellation, geometry, task and mesh shaders is an expensive operation.

    • Draw calls:

      • Group draw calls, taking into account what kinds of shaders they use.

The Problem
  • Immutable Pipelines.

  • Each combination of inputs require a dedicated pipeline.

    • Shader, topology, blend mode, vertex layout, cull mode, etc.

    • So if we want to do things like toggle depth-testing on and off, we will need 2 pipelines.

  • Causes a combinatorial explosion of variants.

    • 10.000's of pipelines for shipping titles.

  • Building pipelines is a very expensive operation, and we want to minimize the number of pipelines used as its critical for performance.

My decisions
  • (2025-08-10)

  • Dynamic State is a must.

  • The use of Shader Object still seems new and may introduce some extra complexity in certain cases.

    • I don't know about mobile support.

  • The use of Graphics Pipeline Libraries sounds interesting, but at the same time it seems limiting in some moments, for Geometry and Tessellation Shaders.

    • I don't know about mobile support.

  • Overall, I believe that refactoring a game object to use Shader Object or Graphics Pipeline Libraries sounds "simple", since it's more about how the pipeline is constructed than how one interacts with shaders or descriptor sets. In other words, it seems like an okay decision to make in the future.

  • Considering the low support, and the fact that I don't have so many pipelines in mind that actually make these solutions necessary, I prefer to use graphics pipelines manually, in the "default" way.

  • Regardless, I believe that using Shader Object or Graphics Pipeline does not  remove the need to worry about pipeline caching or precautions to avoid switching the pipeline binding all the time.

    • Correct. Extensions change how pipelines are created/linked but do not remove the performance considerations around pipeline creation, pipeline cache usage, or minimizing pipeline re-binding at draw time. Vendors and platform docs recommend pipeline caches, pre-creation, and minimizing pipeline binds.

  • What I will do, therefore: caching and sorting of pipelines based on similarity. I will worry more about binding the pipeline in command buffers and their descriptor sets, than the process of facilitating the creation of new pipelines.

    • This plan aligns with widely recommended practical strategies: use pipeline caches (persist to disk where possible), sort and batch by pipeline/descriptor similarity, and create pipelines asynchronously (background threads) to avoid stutter. These practices address the main runtime pain points regardless of whether you later adopt shader-object or pipeline-library extensions.

    • Your current decisions are internally consistent and align with common, pragmatic industry practice: prefer stable/default graphics pipelines with pipeline caching, sorting, and background creation as the primary strategy, while keeping code organized so you can adopt EXT_shader_object  or EXT_graphics_pipeline_library  later if/when device support and measured benefits justify the switch.

Mutability with VkDynamicState
  • Implemented.

  • It's a must .

  • Not everything has to be immutable.

  • Set desired state while recording command buffers.

  • Over 70 states can be dynamic.

  • If we don't use this, we would need to create new pipelines if we wanted to change the resolution of our rendering.

No pipelines, with EXT_shader_object
  • EXT_shader_object .

  • Sample .

  • Article .

  • Support :

    • Coverage .

    • (2025-09-08) 11.29%.

      • 33.8% Windows.

      • 26.3% Linux.

      • 0% Android.

  • Shader Object and implementation in Odin {7:30 -> 11:56} .

    • Questions :

      • I don't know where pColorAttachmentFormats  and depthAttachmentFormat  are specified.

        • I don't know if it's even necessary to specify them anywhere.

        • The words attachment  or format  do not appear anywhere in the sample or in the spec of the extension.

          pipeline_rendering_create_info := vk.PipelineRenderingCreateInfo{
              sType                   = .PIPELINE_RENDERING_CREATE_INFO,
              colorAttachmentCount    = 1,
              pColorAttachmentFormats = format,
              depthAttachmentFormat   = .D24_UNORM_S8_UINT,
              stencilAttachmentFormat = {},
              viewMask                = 0,
          }
      
    • Code .

    create_shaders :: proc() {
        push_constant_ranges := []vk.PushConstantRange {    // Pipeline
            {
                stageFlags = {.VERTEX, .FRAGMENT},
                size = 128,
            }
        }
        
        /*
        This is not used in the Shader Object.
        The only place that needs this in its code, is when making the call `vk.CmdPushConstants(cmd, g.pipeline_layout, {.VERTEX, .FRAGMENT}, 0, size_of(push), &push)`.
        */
        pipeline_layout_ci := vk.PipelineLayoutCreateInfo {
            sType = .PIPELINE_LAYOUT_CREATE_INFO,
            // flags                  = {},
            // setLayoutCount         = 1,
            // pSetLayouts            = {},
            pushConstantRangeCount = u32(len(push_constant_ranges)),
            pPushConstantRanges = raw_data(push_constant_ranges),
        }
        check(vk.CreatePipelineLayout(g.device, &pipeline_layout_ci, nil, &g.pipeline_layout))  // Pipeline
    
    
        vert_code := load_file("shaders/shader.vert.spv", context.temp_allocator)  // Shader_Info
        frag_code := load_file("shaders/shader.frag.spv", context.temp_allocator)  // Shader_Info
        shader_cis := [2]vk.ShaderCreateInfoEXT {
            {
                sType = .SHADER_CREATE_INFO_EXT,
                codeType = .SPIRV,
                codeSize = len(vert_code),
                pCode = raw_data(vert_code),
                pName = "main",
                stage = {.VERTEX},
                nextStage = {.FRAGMENT},
                flags = {.LINK_STAGE},
                // setLayoutCount:         u32,
                // pSetLayouts:            [^]DescriptorSetLayout,
                pushConstantRangeCount = u32(len(push_constant_ranges)),
                pPushConstantRanges = raw_data(push_constant_ranges),
                // pSpecializationInfo:    ^SpecializationInfo,
            },
            {
                sType = .SHADER_CREATE_INFO_EXT,
                codeType = .SPIRV,
                codeSize = len(frag_code),
                pCode = raw_data(frag_code),
                pName = "main",
                stage = {.FRAGMENT},
                // nextStage:              ShaderStageFlags,
                flags = {.LINK_STAGE},
                // setLayoutCount:         u32,
                // pSetLayouts:            [^]DescriptorSetLayout,
                pushConstantRangeCount = u32(len(push_constant_ranges)),
                pPushConstantRanges = raw_data(push_constant_ranges),
                // pSpecializationInfo:    ^SpecializationInfo,
            },
        }
        check(vk.CreateShadersEXT(g.device, 2, raw_data(&shader_cis), nil, raw_data(&g.shaders)))
    }
    
    destroy_shaders :: proc() {
        vk.DestroyPipelineLayout(g.device, g.pipeline_layout, nil)
        for shader in g.shaders do vk.DestroyShaderEXT(g.device, shader, nil)
    }
    
    render :: proc(cmd: vk.CommandBuffer) {
        shader_stages := [2]vk.ShaderStageFlags { {.VERTEX}, {.FRAGMENT} }
        vk.CmdBindShadersEXT(cmd, 2, raw_data(&shader_stages), raw_data(&g.shaders))
        
        vk.CmdSetVertexInputEXT(cmd, 0, nil, 0, nil) // Shader_Info: vk.VertexInputBindingDescription, vk.VertexInputAttributeDescription.
    
        vk.CmdSetViewportWithCount(cmd, 1, &vk.Viewport {  // Dynamic
            width = f32(g.swapchain.width),
            height = f32(g.swapchain.height),
            minDepth = 0,
            maxDepth = 1,
        })
        vk.CmdSetScissorWithCount(cmd, 1, &vk.Rect2D {
            extent = {width = g.swapchain.width, height = g.swapchain.height}  // Dynamic
        })
        vk.CmdSetRasterizerDiscardEnable(cmd, false) // Pipeline
    
        vk.CmdSetPrimitiveTopology(cmd, .TRIANGLE_LIST)  // Pipeline
        vk.CmdSetPrimitiveRestartEnable(cmd, false)      // Pipeline
    
        vk.CmdSetRasterizationSamplesEXT(cmd, {._1})     // Pipeline
        sample_mask := vk.SampleMask(1)
        vk.CmdSetSampleMaskEXT(cmd, {._1}, &sample_mask) // Pipeline
        vk.CmdSetAlphaToCoverageEnableEXT(cmd, false)    // Pipeline
    
        vk.CmdSetPolygonModeEXT(cmd, .FILL)              // Pipeline
        vk.CmdSetCullMode(cmd, {})                       // Pipeline
        vk.CmdSetFrontFace(cmd, .COUNTER_CLOCKWISE)      // Pipeline
    
        vk.CmdSetDepthTestEnable(cmd, false)             // Pipeline
        vk.CmdSetDepthWriteEnable(cmd, false)            // Pipeline
        vk.CmdSetDepthBiasEnable(cmd, false)             // Pipeline
        vk.CmdSetStencilTestEnable(cmd, false)           // Pipeline
    
        b32_false := b32(false)
        vk.CmdSetColorBlendEnableEXT(cmd, 0, 1, &b32_false) // Pipeline
    
        color_mask := vk.ColorComponentFlags { .R, .G, .B, .A }
        vk.CmdSetColorWriteMaskEXT(cmd, 0, 1, &color_mask)  // Pipeline
    
        Push :: struct {
            color: [3]f32,
        }
        push := Push { color = { 0, 0.5, 0 } }
        vk.CmdPushConstants(cmd, g.pipeline_layout, {.VERTEX, .FRAGMENT}, 0, size_of(push), &push)
        
        // vk.CmdBindDescriptorSets                         // Dynamic
    
        vk.CmdDraw(cmd, 3, 1, 0, 0)
    }
    
  • Ditch pipelines entirely.

  • Bind compiled shader stages.

  • It was created primarily for the Nintendo Switch, to reduce the performance gap between Vulkan and NVN (the Switch's native API), which doesn't even have the concept of pipeline state objects and map almost 1:1 to how Nvidia hardware works.

  • If you want to use Shader Objects, the reason should be "I find it much easier to use/maintain". Because once you grow you'll encounter friction as the extension is meant for porting old engines, and goes against new features.

  • Support :

    • Hard to recommend, as for limited support.

    • Currently only available on AMD & Nvidia.

    • It provides an emulation layer, which make them usable on any device not natively supporting them. but you need to provide the dll file for the layer along with the application.

  • Shaders :

    • This extension introduces a new object type VkShaderEXT  which represents a single compiled shader stage. VkShaderEXT  objects may be created either independently or linked with other VkShaderEXT  objects created at the same time. To create VkShaderEXT  objects, applications call vkCreateShadersEXT() .

    • This function compiles the source code for one or more shader stages into VkShaderEXT  objects.

    • Optional Linking :

      • Whenever createInfoCount  is greater than one, the shaders being created may optionally be linked together. Linking allows the implementation to perform cross-stage optimizations based on a promise by the application that the linked shaders will always be used together.

      • Though a set of linked shaders may perform anywhere between the same to substantially better than equivalent unlinked shaders, this tradeoff is left to the application and linking is never mandatory.

      • To specify that shaders should be linked, include the SHADER_CREATE_LINK_STAGE_EXT  flag in each of the VkShaderCreateInfoEXT  structures passed to vkCreateShadersEXT() . The presence or absence of SHADER_CREATE_LINK_STAGE_EXT  must match across all VkShaderCreateInfoEXT  structures passed to a single vkCreateShadersEXT()  call: i.e., if any member of pCreateInfos  includes SHADER_CREATE_LINK_STAGE_EXT  then all other members must include it too. SHADER_CREATE_LINK_STAGE_EXT  is ignored if createInfoCount  is one, and a shader created this way is considered unlinked.

    • The stage of the shader being compiled is specified by stage . Applications must also specify which stage types will be allowed to immediately follow the shader being created. For example, a vertex shader might specify a nextStage  value of SHADER_STAGE_FRAGMENT  to indicate that the vertex shader being created will always be followed by a fragment shader (and never a geometry or tessellation shader). Applications that do not know this information at shader creation time or need the same shader to be compatible with multiple subsequent stages can specify a mask that includes as many valid next stages as they wish. For example, a vertex shader can specify a nextStage  mask of SHADER_STAGE_GEOMETRY | SHADER_STAGE_FRAGMENT  to indicate that the next stage could be either a geometry shader or fragment shader (but not a tessellation shader).

    • etc, see the spec .

Reducing compilation overhead, with EXT_graphics_pipeline_libraries
  • EXT_graphics_pipeline_library .

  • Sample .

  • Support :

    • Release: (2022-06-03).

    • Coverage .

    • (2025-09-08) 18.7% coverage.

      • 40.7% Windows.

      • 40.6% Linux.

      • 4.88% Android.

  • Extra info .

    • I've read until the Dynamic State header.

  • Allows separate compilation of different parts of the graphics pipeline. With this it’s now possible to split up the monolithic pipeline creation into different steps and re-use common parts shared across different pipelines.

  • Compared to monolithic pipeline state, this results in faster pipeline creation times, making this extension a good fit for applications and games that do a lot of pipeline creation at runtime.

  • Libraries are partial pipeline objects which cannot be bound directly; they are linked together to form a final executable pipeline.

  • Encourages reuse of compilation work and reduces startup/runtime stutter for games with many similar pipelines.

  • Because libraries are precompiled partial pipelines, linking is generally cheaper than compiling whole pipelines from scratch.

  • Individual pipelines stages :

    • The monolithic pipeline state has been split into distinct parts that can be compiled independently.

    • Vertex Input Interface :

      • Contains the information that would normally be provided to the full pipeline state object by VkPipelineVertexInputStateCreateInfo and VkPipelineInputAssemblyStateCreateInfo.

      • "For our engine, this information is not known until draw time, so a pipeline for this stage is still hashed and created at draw time."

      • This stage has no shader code and thus the driver can create it quickly and there are also a fairly small number of these objects.

    • Pre-Rasterization Shaders :

    • Fragment Shader :

      • Contains the fragment shader along with the state in VkPipelineDepthStencilStateCreateInfo and VkRenderPass (or dynamic rendering - although in that case only the viewMask is required).

      • If combined with dynamic rendering you can create the fragment shader pipeline with only the SPIR-V and the pipeline layout.
        This allows the driver to do the heavy lifting of lowering to hardware instructions for the pre-rasterization and fragment shaders with very little information.

    • Fragment Output Interface :

      • Contains the VkPipelineColorBlendStateCreateInfo, VkPipelineMultisampleStateCreateInfo, and VkRenderPass (or dynamic rendering)

      • Like with the Vertex Input Interface, this stage requires information that we don’t know until draw time, so this state is also hashed and the Fragment Output Interface pipeline is created at draw time.

      • It is expected to be very quick to create and also relatively small in number.

  • Final link :

    • With all four individual pipeline library stages created, an application can perform a final link to a full pipeline. This final link is expected to be extremely fast - the driver will have done the shader compilation for the individual stages and thus the link can be performed at draw time at a reasonable cost.

    • This is where the big benefit of the extension comes in: we’ve pre-created all of our pre-rasterization and fragment shaders, hashed the small number of vertex input/fragment output interfaces, and can on-demand create a fast linked pipeline library at draw time, thus avoiding a dreaded hitch.

  • If shader compilation stutter is your concern, this extension is the way to go. This extension lets you create partially-constructed PSOs (Pipeline State Objects) (e.g. one for Vertex another for Pixel Shader), and then combine them to generate the final PSO. This allows splitting the huge monolithic block into smaller monolithic blocks that are easier to handle and design around, making the API more D3D11-like (D3D11 has monolithic Rasterizer State blocks and Blend State blocks).

  • Creating pipeline libraries :

    • Creating a pipeline library (part) is similar to creating a pipeline, with the difference that you only need to specify the properties required for that specific pipeline state.

      • E.g. for the vertex input interface you only specify input assembly and vertex input state, which is all required to define the interfaces to a vertex shader.

    VkGraphicsPipelineLibraryCreateInfoEXT library_info{};
    library_info.sType = STRUCTURE_TYPE_GRAPHICS_PIPELINE_LIBRARY_CREATE_INFO_EXT;
    library_info.flags = GRAPHICS_PIPELINE_LIBRARY_VERTEX_INPUT_INTERFACE_EXT;
    
    VkPipelineInputAssemblyStateCreateInfo       input_assembly_state  = vkb::initializers::pipeline_input_assembly_state_create_info(PRIMITIVE_TOPOLOGY_TRIANGLE_LIST, 0, FALSE);
    VkPipelineVertexInputStateCreateInfo         vertex_input_state    = vkb::initializers::pipeline_vertex_input_state_create_info();
    std::vector<VkVertexInputBindingDescription> vertex_input_bindings = {
        vkb::initializers::vertex_input_binding_description(0, sizeof(Vertex), VERTEX_INPUT_RATE_VERTEX),
    };
    std::vector<VkVertexInputAttributeDescription> vertex_input_attributes = {
        vkb::initializers::vertex_input_attribute_description(0, 0, FORMAT_R32G32B32_SFLOAT, 0),
        vkb::initializers::vertex_input_attribute_description(0, 1, FORMAT_R32G32B32_SFLOAT, sizeof(float) * 3),
        vkb::initializers::vertex_input_attribute_description(0, 2, FORMAT_R32G32_SFLOAT, sizeof(float) * 6),
    };
    vertex_input_state.vertexBindingDescriptionCount   = static_cast<uint32_t>(vertex_input_bindings.size());
    vertex_input_state.pVertexBindingDescriptions      = vertex_input_bindings.data();
    vertex_input_state.vertexAttributeDescriptionCount = static_cast<uint32_t>(vertex_input_attributes.size());
    vertex_input_state.pVertexAttributeDescriptions    = vertex_input_attributes.data();
    
    VkGraphicsPipelineCreateInfo pipeline_library_create_info{};
    pipeline_library_create_info.sType               = STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
    pipeline_library_create_info.flags               = PIPELINE_CREATE_LIBRARY_KHR | PIPELINE_CREATE_RETAIN_LINK_TIME_OPTIMIZATION_INFO_EXT;
    pipeline_library_create_info.sType               = STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
    pipeline_library_create_info.pNext               = &library_info;
    pipeline_library_create_info.pInputAssemblyState = &input_assembly_state;
    pipeline_library_create_info.pVertexInputState   = &vertex_input_state;
    
    vkCreateGraphicsPipelines(get_device().get_handle(), pipeline_cache, 1, &pipeline_library_create_info, nullptr, &pipeline_library.vertex_input_interface);
    
  • Deprecating shader modules :

    • With this extension, creating shader modules with vkCreateShaderModule  has been deprecated and you can instead just pass the shader module create info via pNext  into your pipeline shader stage create info. This change bypasses a useless copy and is recommended.

    • You can see this in the pre-rasterization and fragment shader library setup parts of the sample below.

    VkShaderModuleCreateInfo shader_module_create_info{};
    shader_module_create_info.sType    = STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
    shader_module_create_info.codeSize = static_cast<uint32_t>(spirv.size()) * sizeof(uint32_t);
    shader_module_create_info.pCode    = spirv.data();
    
    VkPipelineShaderStageCreateInfo shader_Stage_create_info{};
    shader_Stage_create_info.sType = STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
    // Chain the shader module create info
    shader_Stage_create_info.pNext = &shader_module_create_info;
    shader_Stage_create_info.stage = SHADER_STAGE_VERTEX;
    shader_Stage_create_info.pName = "main";
    
    VkGraphicsPipelineCreateInfo pipeline_library_create_info{};
    pipeline_library_create_info.stageCount = 1;
    pipeline_library_create_info.pStages    = &shader_Stage_create_info;
    
  • Linking executables :

    • Once all pipeline (library) parts have been created, the pipeline executable can be linked together from them:

    std::vector<VkPipeline> libraries = {
        pipeline_library.vertex_input_interface,
        pipeline_library.pre_rasterization_shaders,
        fragment_shader,
        pipeline_library.fragment_output_interface
    };
    
    // Link the library parts into a graphics pipeline
    VkPipelineLibraryCreateInfoKHR linking_info{};
    linking_info.sType        = STRUCTURE_TYPE_PIPELINE_LIBRARY_CREATE_INFO_KHR;
    linking_info.libraryCount = static_cast<uint32_t>(libraries.size());
    linking_info.pLibraries   = libraries.data();
    
    VkGraphicsPipelineCreateInfo executable_pipeline_create_info{};
    executable_pipeline_create_info.sType = STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
    executable_pipeline_create_info.pNext = &linking_info;
    executable_pipeline_create_info.flags = PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_EXT;
    
    VkPipeline executable = NULL_HANDLE;
    vkCreateGraphicsPipelines(get_device().get_handle(), thread_pipeline_cache, 1, &executable_pipeline_create_info, nullptr, &executable);
    
    • This will result in the pipeline state object to be used at draw time.

    • A note on PIPELINE_CREATE_LINK_TIME_OPTIMIZATION_EXT : This is an optimization flag. If specified, implementations are allowed to do additional optimization passes. This may increase build times but can in turn result in lower runtime costs.

  • Independent Descriptor Sets :

    • Imagine a situation where the vertex and fragment stage accesses two different descriptor sets.

    // Vertex Shader
    layout(set = 0) UBO_X;
    
    // Fragment Shader
    layout(set = 1) UBO_Y;
    
    • Normally when compiling a pipeline, both stages are together and internally a driver will reserve 2 separate descriptor slots for UBO_X  and UBO_Y . When using graphics pipeline libraries, the driver will see the fragment shader only uses a single descriptor set. It might internally map it to set 0 , but when linking the two libraries, there will be a collision. The PIPELINE_LAYOUT_CREATE_INDEPENDENT_SETS_EXT  flag ensures the driver will be able to handle this case and not have any collisions. There are some extra constraints when using this flag, but the Validation Layers will detect them for you.

  • Explanation .

    • .

    • .

      • Same number of pipelines, but acquired through reuse, instead of recompilation.

      • Think of the link step as additive, instead of multiplicative.

    • .

    • .

    • Considerations :

      • At the time it was said there would be an impact on CPU.

      • It was unknown whether it was compatible with mobile or not.

      • No libraries were made for Geometry and Tessellation Shaders, as they are difficult.

~One pipeline per shader variant
  • It is the cause of the problem listed above.

  • Causes a combinatorial explosion of variants.

Single pipeline, branch inside shader (material ID / push constant)
  • No way, seems horrible.

Optimizations

Pipeline Cache, with VkPipelineCache
  • Pipeline cache sample .

  • Pipeline Cache .

  • Pipeline Cache .

  • It allows the driver to reuse previously computed pipeline artifacts across pipeline creations (and you can persist cache data between runs).

  • Avoids repeating expensive driver  work; shortens startup time by reusing previously compiled artifacts.

  • Creating a Vulkan pipeline requires compiling VkShaderModule  internally. This will have a significant increase in frame time if performed at runtime. To reduce this time, you can provide a previously initialised VkPipelineCache  object when calling the vkCreateGraphicsPipelines  or vkCreateComputePipelines  functions. This object behaves like a cache container which stores the pipeline internal representation for reuse. In order to benefit from using a VkPipelineCache  object, the data recorded during pipeline creation needs to be saved to disk and reused between application runs.

  • Vulkan allows an application to obtain the binary data of a VkPipelineCache  object and save it to a file on disk before terminating the application. This operation can be achieved using two calls to the vkGetPipelineCacheData  function to obtain the size and VkPipelineCache  object’s binary data. In the next application run, the VkPipelineCache  can be initialised with the previous run’s data. This will allow the vkCreateGraphicsPipelines  or vkCreateComputePipelines  functions to reuse the baked state and avoid repeating costly operations such as shader compilation.

  • How to use it :

    • Create one VkPipelineCache  for related pipeline creation operations (often one per device).

    • Pass it into vkCreateGraphicsPipelines  for every create call.

    • On exit (or periodically) call vkGetPipelineCacheData  and write to disk; on startup feed that blob into vkCreatePipelineCache  to prepopulate the cache.

  • KHR_pipeline_binary

    • VkPipelineCache  objects were designed to enable a Vulkan driver to reuse blobs of state or shader code between different pipelines. Originally, the idea was that the driver would know best which parts of state could be reused, and applications only needed to manage storage and threading, simplifying developer code.

    • Over time however, VkPipelineCache  objects proved to be too opaque, prompting the Vulkan Working Group to release a number of extensions to provide more application control over them. The current capabilities of VkPipelineCache  objects satisfies many applications, but has shortcomings in more advanced use cases.

    • Previous difficulties :

      • The VkPipelineCache  API provides no control over the lifetime of the binary objects that it contains. An application wanting to implement an LRU cache, for example, has a hard time using VkPipelineCache  objects.

      • Some applications maintain a cache of VkPipeline objects. The VkPipelineCache API makes it impossible to efficiently associate the cached binary objects within a VkPipelineCache object with the application’s own cache entries.

    • What’s more, most drivers maintain an internal cache of pipeline-derived binary objects. In some cases, it would be beneficial for the application to directly interact with that internal cache, especially on some specialized platforms.

    • The new KHR_pipeline_binary  extension introduces a clean new approach that provides applications with access to binary blobs and the information necessary for optimal caching, while smoothly integrating with the application’s own caching mechanisms.

    • It’s worth noting that the EXT_shader_object  extension already includes analogous functionality to KHR_pipeline_binary . The two extensions were worked on concurrently to provide a universally available solution, including devices where the EXT_shader_object  extension cannot yet be supported.

    • Applications that do not need the advanced functionality of the new KHR_pipeline_binary extension can continue to use VkPipelineCache objects for their simplicity and optimized implementation. But developers that are not satisfied with the VkPipelineCache API should read on to learn more about this powerful new approach.

    • Article .

      • Read up to 'Caching With KHR_pipeline_binary'.

Optimizing the Shader with KHR_buffer_device_address
Pipeline derivatives
  • A creation mechanism to tell the driver that one pipeline is a parent and others are children (derivatives).

  • The driver may avoid redoing expensive compile/link steps and reuse intermediate data from the parent, reducing creation time.

  • The intent is faster creation of children by reusing work/data from the parent.

  • The pipeline creation API provides no way to tell it what state will change. The idea being that, since the implementation can see the parent's state, and it can see what you ask of the child's state, it can tell what's different.

  • Is it worth it?  NO.

    • TLDR :

      • No vendor is actually recommending the use of pipeline derivatives, except maybe to speed up pipeline creation.

    • Tips and Tricks: Vulkan Dos and Don’ts .

      • Don’t expect speedup from Pipeline Derivatives.

    • Vulkan Usage Recommendations , Samsung

      • Pipeline derivatives let applications express "child" pipelines as incremental state changes from a similar "parent"; on some architectures, this can reduce the cost of switching between similar states.

      • Many mobile GPUs gain performance primarily through pipeline caches, so pipeline derivatives often provide no  benefit to portable mobile applications.

      • Recommendations:

        • Create pipelines early in application execution. Avoid pipeline creation at draw time.

        • Use a single pipeline cache  for all pipeline creation.

        • Write the pipeline cache to a file between application runs.

        • Avoid pipeline derivatives.

    • Vulkan Best Practice for Mobile Developers - Pipeline Management , Arm Software, Jul 11, 2019

      • Don't create pipelines at draw time without a pipeline cache (introduces performance stutters).

      • Don't use pipeline derivatives as they are not supported.

    • Vulkan Samples, LunarG - API-Samples/pipeline_derivative/pipeline_derivative.cpp

      • This sample creates pipeline derivative and draws with it. Pipeline derivatives should allow for faster creation of pipelines.

      • In this sample, we'll create the default pipeline, but then modify it slightly and create a derivative.

      • The derivative will be used to render a simple cube. We may later find that the pipeline is too simple to show any speedup, or that replacing the fragment shader is too expensive, so this sample can be updated then.

  • Typical use case :

    • Many pipelines that differ only by a few fields (e.g., different specializations or small state changes).

  • How to use :

    • Create a base pipeline with PIPELINE_CREATE_ALLOW_DERIVATIVES .

    • For similar pipelines (small shader or state differences), create child pipelines with PIPELINE_CREATE_DERIVATIVE  and set basePipelineHandle  or basePipelineIndex  pointing to the base.

  • How it affects the pipeline workflow :

    • Can materially reduce pipeline creation cost when many similar pipelines are needed.

    • Useful at runtime if you must create many variants quickly.

    • Still creates separate pipeline objects (state memory + driver bookkeeping).

  • Not guaranteed to be implemented with identical performance gains on all drivers; behavior is driver-dependent.

Compute Pipeline

Use cases
  • Calculate images from complex postprocessing chains.

  • Raytracing or other non-geometry drawing.

Creation
  • We need to create first the pipeline layout for it, and then hook a single shader module for its code.

  • Once its built, we can execute the compute shader by first calling VkCmdBindPipeline  and then calling VkCmdDispatch .

Using
  • You generally want to use a memory barrier after the dispatch of the compute shader, so you wait for the compute shader to finish to finally access its data; if that's what you want to do.

    • In OpenGL the GL_SHADER_STORAGE_BARRIER  is used.

Workgroup
  • vkCmdDispatch .

  • For an image, I had the decision to only use 2 of those dimensions, that way we can execute one workgroup per group of pixels in the image.

  • When executing compute shaders, they will get executed in groups of N lanes/threads.

  • The most difficult part is the decision of partitioning the compute shader between Workgroups and Local Size.

  • Local Size is also called Workgroup Size, representing the number of threads inside each Workgroup.

  • .

    • The code is in OpenGL, but the concept is the same.

  • The size of the local_size should be ideally related to the size of a warp/wavefront from the GPU, so you don't waste processing power.

  • For layout(local_size_x = 3, local_size_y = 4, local_size_z = 2) , you'll use 3 * 4 * 2 = 24  threads, which is not ideal for a NVIDIA warp size.

  • .

GLSL Built-in Variables
Examples
  • The shader code is a very simple shader that will create a gradient from the coordinates of the global invocation ID.

//GLSL version to use
#version 460

//size of a workgroup for compute
layout (local_size_x = 16, local_size_y = 16) in;

//descriptor bindings for the pipeline
layout(rgba16f,set = 0, binding = 0) uniform image2D image;


void main() 
{
    ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
    ivec2 size = imageSize(image);

    if(texelCoord.x < size.x && texelCoord.y < size.y)
    {
        vec4 color = vec4(0.0, 0.0, 0.0, 1.0);

        if(gl_LocalInvocationID.x != 0 && gl_LocalInvocationID.y != 0)
        {
            color.x = float(texelCoord.x)/(size.x);
            color.y = float(texelCoord.y)/(size.y); 
        }
    
        imageStore(image, texelCoord, color);
    }
}
  • Inside the shader itself, we can see layout (local_size_x = 16, local_size_y = 16) in;  (z=1 by default).

    • By doing that, we are setting the size of a single workgroup.

    • This means that for every work unit from the vkCmdDispatch , we will have 16x16 lanes of execution, which works well to write into a 16x16 pixel square.

  • The next layout statement is for the shader input through descriptor sets. We are setting a single image2D as set 0 and binding 0 within that set.

  • If local invocation ID is 0 on either X or Y, we will just default to black. This is going to create a grid that will directly display our shader workgroup invocations.

  • On the shader code, we can access what the lane index is through gl_LocalInvocationID  variable.

  • There is also gl_GlobalInvocationID  and gl_WorkGroupID . By using those variables we can find out what pixel exactly do we write from each lane.

Compute Shader Raytracing

Resources

  • Resources are views of memory with associated formatting and dimensionality.

  • Nvidia: Make sure to always use the minimum set of resource usage flags. Redundant flags may trigger redundant flushes and stalls in barriers and slow down your app unnecessarily.

  • Resource Creation .

Primary resources
  • Buffers.

    • Provide access to raw arrays of bytes

  • Images.

    • Can  be multidimensional and may  have associated metadata.

  • Tensors.

    • Can  be multidimensional, contain format information like images and may  have associated metadata.

  • Samplers.

    • Used to sample from images at certain coordinates, producing interpolated color values.

  • Micromaps .

    • Uses buffers as the backing store for opaque data structures.

  • Acceleration Structures .

    • Uses buffers as the backing store for opaque data structures.

    • Used for realtime raytracing.

Buffers

  • Buffers in Vulkan are regions of memory used for storing arbitrary data that can be read by the graphics card.

  • They are essentially unformatted arrays of bytes.

  • Types of Buffers :

    • Unformatted array .

    • Uniform Buffer :

      • It remains uniform during the execution of a command (like a draw call).

      • Only load operations (read only).

        • "Read" == "Load".

        • This allows the GPU to cache them efficiently.

      • Loaded into L2, and further, into a L1 cache.

    • Storage Buffers :

      • Allow Load and Store operations.

      • Supports atomic operations.

      • Data can be loaded from GPU memory into L2->L1 caches, but can also store data from shaders into memory.

    • Texel Buffers :

      • Uniform Texel Buffer.

      • Storage Texel Buffer.

      • Formatted view.

    • Dynamic Buffers :

      • Dynamic Uniform Buffer.

      • Dynamic Texel Buffer.

    • etc.

  • Queues :

    • Just like the images in the Swapchain, buffers can also be owned by a specific queue family or be shared between multiple at the same time.

      • The buffer will only be used from the graphics queue, so we can stick to exclusive access.

Create
  • vkCreateBuffer()

    • VkBuffer

      • A chunk of GPU visible memory

    • VkBufferCreateInfo

      • size

        • Specifies the size of the buffer in bytes. Calculating the byte size of the vertex data is straightforward with sizeof .

      • usage

        • Indicates for which purposes the data in the buffer is going to be used.

        • It is possible to specify multiple purposes using a bitwise or.

      • flags

        • Is used to configure sparse buffer memory, which is not relevant right now. We'll leave it at the default value of 0 .

      • sharingMode

        • Specifying the sharing mode of the buffer when it will be accessed by multiple queue families.

        • The buffer will only be used from the graphics queue, so we can stick to exclusive access.

        • NVIDIA:

          • VkSharingMode  is ignored by the driver, so SHARING_MODE_CONCURRENT  incurs no overhead relative to SHARING_MODE_EXCLUSIVE .

        • SHARING_MODE_EXCLUSIVE

          • Specifies that access to any range or image subresource of the object will be exclusive to a single queue family at a time.

        • SHARING_MODE_CONCURRENT

          • Specifies that concurrent access to any range or image subresource of the object from multiple queue families is supported.

Copy

Images

  • Images contain format information. Can be multidimensional and may have associated metadata.

  • An Image, unlike a Buffer, is almost always used within a View.

  • A texture you can write to and read from.

  • VkImage .

  • Stored as :

    • .

Create
  • VkImageCreateInfo .

    • ImageType

    • extent

      • Specifies the dimensions of the image, basically how many texels there are on each axis.

      • That’s why extent.depth  must be 1  instead of 0 .

    • format

    • tiling

    • initialLayout

      • Can only  be one of these 3:

        • UNDEFINED

          • Not usable by the GPU and the very first transition will discard the texels.

        • PREINITIALIZED

          • Not usable by the GPU, but the first transition will preserve the texels.

        • ZERO_INITIALIZED_EXT

      • There are a few situations where it is necessary for the texels to be preserved during the first transition.

        • One example would be if you wanted to use an image as a staging image in combination with the TILING_LINEAR  layout. In that case, you’d want to upload the texel data to it and then transition the image to be a transfer source without losing the data.

      • However, we usually don't need this property and can use UNDEFINED , as we can transition the image to be a transfer destination and then copy texel data to it from a buffer object.

    • usage

    • samples

      • For multisampling.

      • Only relevant for images that will be used as attachments.

      • The default for non-multisampled images is one sample.

    • mipLevels

      • For mipmapping.

    • flags

      • Related to sparse images.

      • Sparse images are images where only certain regions are actually backed by memory.

      • If you were using a 3D texture for a voxel terrain, for example, then you could use this to avoid allocating memory to store large volumes of "air" values.

    • sharingMode

      • Specifies the sharing mode of the image when it will be accessed by multiple queue families.

    • queueFamilyIndexCount

      • Is the number of entries in the pQueueFamilyIndices  array.

    • pQueueFamilyIndices

      • Is a pointer to an array of queue families that will access this image. It is ignored if sharingMode  is not SHARING_MODE_CONCURRENT .

Types
  • Tells Vulkan with what kind of coordinate system the texels in the image are going to be addressed.

  • 1D images

    • Can be used to store an array of data or a gradient.

  • 2D images

    • Are mainly used for textures.

  • 3D images

    • Can be used to store voxel volumes, for example.

Usages
  • Storage Image :

    • Load and Store.

    • Similar to a Storage Buffer.

  • Sampled Image :

    • Only load operations (read only).

    • Similar to Uniform Buffers.

    • The coordinates are between 0.0 and 1.0.

    • If a coordinate doesn't match exactly a pixel, then the result is an interpolation between the neighbouring pixels.

  • Input Attachment :

    • Only load operations (read only).

    • Within a renderpass.

    • Framebuffer-local.

      • Access to single coordinate only.

      • No access to other coordinates in that image.

Formats
  • Formats .

  • Compatible Formats .

  • Numeric Format .

  • R8G8B8_SRGB

    • Channels stored as 0–255.

    • After conversion, the values are in the 0-1 floating-point range.

    • Interpreted using the sRGB nonlinear transfer function (gamma correction).

    • When sampled, values are converted to linear color space in the shader automatically.

  • R8G8B8_UNORM

    • Each 8-bit channel is an unsigned  normalized integer.

    • Storage range: 0–255.

    • Interpreted as floating-point in the shader:

      • 0 → 0.0

      • 255 → 1.0

      • Linear mapping between.

  • R8G8B8_SNORM

    • Each 8-bit channel is a signed  normalized integer.

    • Storage range: –128 to +127.

    • Interpreted as floating-point in the shader:

      • –128 → –1.0

      • +127 → +1.0

      • Linear mapping between.

Tiling
  • Nvidia: Always use TILING_OPTIMAL .

    • TILING_LINEAR  is not optimal. Use a staging buffer and vkCmdCopyBufferToImage()  to update images on the device.

  • Unlike the layout of an image, the tiling mode cannot  be changed at a later time.

  • TILING_OPTIMAL

    • The layout is opaque/driver-chosen.

    • Is described as an implementation-dependent (opaque) arrangement that the driver/GPU may reorder/tile texels for efficient access; it is the intended layout for GPU use.

    • When to use :

      • Image is used as a framebuffer attachment, sampled texture, or otherwise heavily used by the GPU (most rendering targets).

      • You want the GPU/driver to choose a layout that maximizes memory locality and bandwidth for rendering.

      • You will perform GPU-side post-processing / tonemapping / sampling / blits before presentation.

  • TILING_LINEAR

    • The layout is row-major/predictable.

    • Lays out texels in row-major order (with row padding possible) and is the layout for which vkGetImageSubresourceLayout  returns meaningful offsets for host access; that is the mechanism used when an application needs direct CPU mapping/reading of image memory.

      • However, in practice applications usually do GPU render → copy to a host-visible staging buffer/image rather than render directly into a linear-host-visible image.

    • LINEAR tiling does have functional and performance limitations (fewer supported formats/usages and worse GPU access patterns), which is why it’s rarely used for main rendering; typical use cases are CPU upload/download, debugging, or very small offscreen images. It is not only theoretically usable for CPU readback, but that is the primary practical use. You must query format/usage support for linear tiling because many formats or usages are unsupported in LINEAR.

    • When to use :

      • You explicitly need to map the image memory from the CPU (direct host read/write) and the driver reports support for the requested format/usage in linear tiling.

      • Use cases: readback for screenshots/debugging, direct CPU uploads for small resources, or special interop scenarios where a row-major layout is required.

  • GPU OPTIMAL to Host-Visible :

    • Strategy applied for 'creating a texture from file' .

      • If you want to be able to directly access texels in the memory of the image, then you must use TILING_LINEAR . We will be using a staging buffer instead of a staging image, so this won't be necessary. We will be using TILING_OPTIMAL  for efficient access from the shader.

    • TLDR : OPTIMAL  + explicit transfer to a host-visible staging resource when needed.

    • Create your render target as OPTIMAL  and allocate DEVICE_LOCAL  memory (fast GPU local). After rendering, copy  or blit  the image to a host-visible staging resource (either a buffer via vkCmdCopyImageToBuffer  or a LINEAR image) and map that staging resource for CPU access. This avoids depending on limited linear support and keeps the GPU path fast.

Layouts
  • GENERAL

    • Supports all types of device access, unless specified otherwise.

    • If the unifiedImageLayouts  feature is enabled, the GENERAL  image layout may  be used in place of the other layouts where allowed with no loss of performance.

      • VkPhysicalDeviceUnifiedImageLayoutsFeaturesKHR .

        • Can be included in the pNext  chain of the VkPhysicalDeviceFeatures2  structure passed to vkGetPhysicalDeviceFeatures2 .

        • KHR_unified_image_layouts .

          • This extension significantly simplifies synchronization in Vulkan by removing the need for image layout transitions in most cases. In particular, it guarantees that using the GENERAL  layout everywhere possible is just as efficient as using the other layouts.

          • In the interest of simplifying synchronization in Vulkan, this extension removes image layouts altogether as much as possible. As such, this extension is fairly simple.

          • Proposal .

          • Article .

          • Interacts with :

            • VERSION_1_3

            • EXT_attachment_feedback_loop_layout

            • KHR_dynamic_rendering

          • Support :

        • unifiedImageLayouts  (boolean)

          • Specifies whether usage of GENERAL , where valid, incurs no loss in efficiency.

          • Additionally, it indicates whether it can  be used in place of ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT .

        • unifiedImageLayoutsVideo  (boolean)

          • Specifies whether GENERAL  can be used in place of any of the following image layouts with no loss in efficiency.

          • VIDEO_DECODE_DST

          • VIDEO_DECODE_SRC

          • VIDEO_DECODE_DPB

          • VIDEO_ENCODE_DST

          • VIDEO_ENCODE_SRC

          • VIDEO_ENCODE_DPB

          • VIDEO_ENCODE_QUANTIZATION_MAP

    • It can be a useful catch-all image layout, but there are situations where a dedicated image layout must be used instead. For example:

      • PRESENT_SRC .

      • SHARED_PRESENT .

      • VIDEO_DECODE_SRC , VIDEO_DECODE_DST , and VIDEO_DECODE_DPB  without the unifiedImageLayoutsVideo  feature.

      • VIDEO_ENCODE_SRC , VIDEO_ENCODE_DST , and VIDEO_ENCODE_DPB  without the unifiedImageLayoutsVideo  feature.

      • VIDEO_ENCODE_QUANTIZATION_MAP  without the unifiedImageLayoutsVideo  feature.

    • While GENERAL  suggests that all types of device access are possible, it does not mean that all patterns of memory accesses are safe in all situations.

      • Common Render Pass Data Races  outlines some situations where data races are unavoidable. For example, when a subresource is used as both an attachment and a sampled image (i.e., not an input attachment), enabling feedback loop  adds extra guarantees which GENERAL  alone does not.

  • Only in initialLayout :

    • UNDEFINED

      • Specifies that the layout is unknown.

      • This layout can  be used as the initialLayout  member of VkImageCreateInfo .  Image memory cannot  be transitioned into this layout.

      • This layout can  be used in place of the current image layout in a layout transition, but doing so will cause the contents of the image’s memory to be undefined.

    • PREINITIALIZED

      • Specifies that an image’s memory is in a defined layout and can  be populated by data, but that it has not yet been initialized by the driver.

      • This layout can  be used as the initialLayout  member of VkImageCreateInfo .  Image memory cannot  be transitioned into this layout.

      • This layout is intended to be used as the initial layout for an image whose contents are written by the host, and hence the data can  be written to memory immediately, without first executing a layout transition.

      • Currently, PREINITIALIZED  is only useful with linear  images because there is not a standard layout defined for TILING_OPTIMAL  images.

    • ZERO_INITIALIZED_EXT

      • Specifies that an image’s memory is in a defined layout and is zeroed, but that it has not yet been initialized by the driver.

      • This layout can  be used as the initialLayout  member of VkImageCreateInfo . Image memory cannot  be transitioned into this layout.

      • This layout is intended to be used as the initial layout for an image whose contents are already zeroed, either from being explicitly set to zero by an application or from being allocated with MEMORY_ALLOCATE_ZERO_INITIALIZE_EXT .

      • Only if zeroInitializeDeviceMemory  feature is enabled.

  • Transfer :

    • TRANSFER_SRC_OPTIMAL

      • It must  only be used as a source image of a transfer command (see the definition of PIPELINE_STAGE_TRANSFER ).

      • This layout is valid only  for image subresources of images created with the USAGE_TRANSFER_SRC  usage bit enabled.

    • TRANSFER_DST_OPTIMAL

      • It must  only be used as a destination image of a transfer command.

      • This layout is valid only for image subresources of images created with the USAGE_TRANSFER_DST  usage bit enabled.

  • Present :

    • PRESENT_SRC

      • It must  only be used for presenting a presentable image for display.

    • SHARED_PRESENT

      • Is valid only for shared presentable images, and must  be used for any usage the image supports.

  • Read :

    • READ_ONLY_OPTIMAL

      • Specifies a layout allowing read only access as an attachment, or in shaders as a sampled image, combined image/sampler, or input attachment.

    • DEPTH_READ_ONLY_OPTIMAL

      • Specifies a layout for the depth aspect of a depth/stencil format image allowing read-only access as a depth attachment or in shaders as a sampled image, combined image/sampler, or input attachment.

    • STENCIL_READ_ONLY_OPTIMAL

      • Specifies a layout for the stencil aspect of a depth/stencil format image allowing read-only access as a stencil attachment or in shaders as a sampled image, combined image/sampler, or input attachment.

    • DEPTH_STENCIL_READ_ONLY_OPTIMAL

      • Specifies a layout for both  the depth and stencil aspects of a depth/stencil format image allowing read only access as a depth/stencil attachment or in shaders as a sampled image, combined image/sampler, or input attachment.

      • It is equivalent to DEPTH_READ_ONLY_OPTIMAL  and STENCIL_READ_ONLY_OPTIMAL .

    • SHADER_READ_ONLY_OPTIMAL

      • Specifies a layout allowing read-only access in a shader as a sampled image, combined image/sampler, or input attachment.

      • This layout is valid only  for image subresources of images created with the USAGE_SAMPLED  or USAGE_INPUT_ATTACHMENT  usage bits enabled.

  • Attachments :

    • ATTACHMENT_OPTIMAL

      • Specifies a layout that must  only be used with attachment accesses in the graphics pipeline.

    • COLOR_ATTACHMENT_OPTIMAL

      • It must  only be used as a color or resolve attachment in a VkFramebuffer .

      • This layout is valid only for image subresources of images created with the COLOR_ATTACHMENT  usage bit enabled.

      • Nvidia: Use COLOR_ATTACHMENT_OPTIMAL  image layout for color attachments.

    • DEPTH_ATTACHMENT_OPTIMAL

      • Specifies a layout for the depth aspect of a depth/stencil format image allowing read and write access as a depth attachment.

    • STENCIL_ATTACHMENT_OPTIMAL

      • Specifies a layout for the stencil aspect of a depth/stencil format image allowing read and write access as a stencil attachment.

    • DEPTH_STENCIL_ATTACHMENT_OPTIMAL

      • Specifies a layout for both  the depth and stencil aspects of a depth/stencil format image allowing read and write access as a depth/stencil attachment.

      • Equivalent to DEPTH_ATTACHMENT_OPTIMAL  and STENCIL_ATTACHMENT_OPTIMAL .

    • ATTACHMENT_FEEDBACK_LOOP_OPTIMAL_EXT

      • It must  only be used as either a color attachment or depth/stencil attachment and/or read-only access in a shader as a sampled image, combined image/sampler, or input attachment.

      • This layout is valid only  for image subresources of images created with the USAGE_ATTACHMENT_FEEDBACK_LOOP  usage bit enabled and either the USAGE_COLOR_ATTACHMENT  or USAGE_DEPTH_STENCIL_ATTACHMENT  and either the USAGE_INPUT_ATTACHMENT  or USAGE_SAMPLED  usage bits enabled.

    • LAYOUT_RENDERING_LOCAL_READ

      • It must  only be used as either a storage image, or a color or depth/stencil attachment and an input attachment.

      • This layout is valid only  for image subresources of images created with either USAGE_STORAGE , or both USAGE_INPUT_ATTACHMENT  and either of USAGE_COLOR_ATTACHMENT  or USAGE_DEPTH_STENCIL_ATTACHMENT .

    • Attachment Fragment Shading Rate

    • Fragment Density Map :

      • FRAGMENT_DENSITY_MAP_OPTIMAL_EXT

        • It must  only be used as a fragment density map attachment in a VkRenderPass .

        • This layout is valid only  for image subresources of images created with the USAGE_FRAGMENT_DENSITY_MAP  usage bit enabled.

  • Read / Attachment :

    • DEPTH_READ_ONLY_STENCIL_ATTACHMENT_OPTIMAL

      • Specifies a layout for depth/stencil format images allowing read and write access to the stencil aspect as a stencil attachment, and read only access to the depth aspect as a depth attachment or in shaders as a sampled image, combined image/sampler, or input attachment.

      • Equivalent to DEPTH_READ_ONLY_OPTIMAL  and STENCIL_ATTACHMENT_OPTIMAL .

    • DEPTH_ATTACHMENT_STENCIL_READ_ONLY_OPTIMAL

      • Specifies a layout for depth/stencil format images allowing read and write access to the depth aspect as a depth attachment, and read only access to the stencil aspect as a stencil attachment or in shaders as a sampled image, combined image/sampler, or input attachment.

      • Equivalent to DEPTH_ATTACHMENT_OPTIMAL  and STENCIL_READ_ONLY_OPTIMAL .

  • Video :

  • TENSOR_ALIASING_ARM

Image Views
  • Image Views .

  • An image view references a specific part of an image to be used.

  • VkImageViewCreateInfo

    • viewType

      • Allows you to treat images as 1D textures, 2D textures, 3D textures and cube maps.

    • format

    • components

      • Allows you to swizzle the color channels around. For example, you can map all of the channels to the red channel for a monochrome texture. You can also map constant values of 0  and 1  to a channel. In our case we'll stick to the default mapping.

    • subresourceRange

      • Describes what the image's purpose is and which part of the image should be accessed. Our images will be used as color targets without any mipmapping levels or multiple layers.

      • If you were working on a stereographic 3D application, then you would create a Swapchain with multiple layers. You could then create multiple image views for each image representing the views for the left and right eyes by accessing different layers.

Copy: Blit (Copy image to image)
  • Transfer a rectangular region of pixel data from one image to another.

  • Unlike a raw copy ( vkCmdCopyImage ), a blit can perform scaling and apply filtering ( FILTER_LINEAR  or FILTER_NEAREST ), which is consistent with the historical meaning of bit block transfer with optional transformations.

  • Name :

    • Comes from bit block transfer  (sometimes shortened to blt_).

    • It was introduced in the 1970s in the context of 2D graphics systems, particularly at Xerox PARC.

    • The idea was to copy rectangular blocks of bits (pixels)  from one place in memory to another, often with operations like scaling, masking, or raster operations.

  • vkCmdBlitImage2 .

    • commandBuffer

    • pBlitImageInfo

      • VkBlitImageInfo2 .

      • srcImage

        • Is the source image.

      • srcImageLayout

        • Is the layout of the source image subresources for the blit.

      • dstImage

        • Is the destination image.

      • dstImageLayout

        • Is the layout of the destination image subresources for the blit.

      • regionCount

        • Is the number of regions to blit.

      • pRegions

        • VkImageBlit2 .

        • Defines source and destination subresources, offsets, and extents.

        • Can define multiple regions in a single blit call.

        • For each element of the pRegions  array, a blit operation is performed for the specified source and destination regions.

        • Offset :

          • The offset entries specify two corners of the rectangular/box region to blit (one corner and the opposite corner).

          • You normally set offsets[0]  to the region origin (frequently {0,0,0} ) and offsets[1]  to the region end ( {width, height, depth} ), i.e. the bounds.

          • If left unspecified, that produces the common {0,0,0} -> {w,h,1}  box.

          • The Vulkan spec requires both offsets be provided and documents constraints on them (e.g. for 2D images z  must be 0/1).

        • srcSubresource

          • Is the subresource to blit from.

        • srcOffsets

          • Is a pointer to an array of two VkOffset3D  structures specifying the bounds of the source region within srcSubresource .

        • dstSubresource

          • Is the subresource to blit into.

        • dstOffsets

          • Is a pointer to an array of two VkOffset3D  structures specifying the bounds of the destination region within dstSubresource .

      • filter

        • Is a VkFilter  specifying the filter to apply if the blits require scaling.

        • Determines how pixels are sampled if scaling occurs.

        • FILTER_NEAREST  for nearest-neighbor scaling.

        • FILTER_LINEAR  for linear interpolation.

      • Their layouts must be valid for transfer operations ( TRANSFER_SRC_OPTIMAL  and TRANSFER_DST_OPTIMAL ).

  • Restrictions

    • Blit operations are supported only if the format and the physical device support FORMAT_FEATURE_BLIT_SRC  and FORMAT_FEATURE_BLIT_DST .

    • Some formats (like depth/stencil) do not support blitting.

    • Multisampled images cannot be used directly as source or destination.

Compression

Depth

Depth Tests

Shader
  • gl_FragDepth

    • Available only in the fragment shader.

    • Is an output  variable that is used to establish the depth value for the current fragment.

    • It is a float .

    • If depth buffering is enabled and no shader writes to gl_FragDepth , then the fixed function value for depth will be used (this value is contained in the z component of gl_FragCoord ) otherwise, the value written to gl_FragDepth  is used.

    • If a shader statically assigns to gl_FragDepth , then the value of the fragment's depth may be undefined for executions of the shader that don't take that path. That is, if the set of linked fragment shaders statically contain a write to gl_FragDepth , then it is responsible for always writing it.

    • Available in all versions of glsl.

  • gl_FragCoord

    • Available only in the fragment shader.

    • Is an input  variable that contains the window relative coordinate (x, y, z, 1/w) values for the fragment.

    • This value is the result of fixed functionality that interpolates primitives after vertex processing to generate fragments.

    • Multi-sampling :

      • If multi-sampling, this value can be for any location within the pixel, or one of the fragment samples.

    • Depth :

      • The z  component is the depth value that would be used for the fragment's depth if no shader contained any writes to gl_FragDepth .

      • gl_FragCoord.z  is the depth value of the fragment that your shader is operating on, not  the current value of the depth buffer at the fragment position.

    • Changing the origin, by redeclaring it :

      • gl_FragCoord  may be redeclared with the additional layout qualifier identifiers origin_upper_left  or pixel_center_integer . By default, gl_FragCoord  assumes a lower-left origin for window coordinates and assumes pixel centers are located at half-pixel centers.

      • Example :

        • The (x, y)  location (0.5, 0.5)  is returned for the lower-left-most pixel in a window. The origin of gl_FragCoord  may be changed by redeclaring gl_FragCoord  with the origin_upper_left  identifier. The values returned can also be shifted by half a pixel in both x and y by pixel_center_integer  so it appears the pixels are centered at whole number pixel offsets. This moves the (x, y) value returned by gl_FragCoord  of (0.5, 0.5)  by default to (0.0, 0.0)  with pixel_center_integer .

      • If gl_FragCoord  is redeclared in any fragment shader in a program, it must be redeclared in all fragment shaders in that program that have static use of gl_FragCoord .

      • Redeclaring gl_FragCoord  with any accepted qualifier affects only gl_FragCoord.x  and gl_FragCoord.y .

      • It has no effect on rasterization, transformation or any other part of the OpenGL pipeline or language features.

    • Available in all versions of glsl.

  • Depth Execution Modes :

    • (2025-10-07) Vulkan supports this.

      • Conservative depth can be enabled in Vulkan the same way as in OpenGL (i.e. with layout(depth_<condition>) out float gl_FragDepth ).

      • You can test it and look at the SPIR-V output.

    • Allows for a possible optimization for implementations that relies on an early depth test to be run before the fragment.

    // assume it may be modified in any way
    layout(depth_any) out float gl_FragDepth;
    
    // assume it may be modified such that its value will only increase
    layout(depth_greater) out float gl_FragDepth;
    
    // assume it may be modified such that its value will only decrease
    layout(depth_less) out float gl_FragDepth;
    
    // assume it will not be modified
    layout(depth_unchanged) out float gl_FragDepth;
    
    • GL_ARB_conservative_depth .

    • Violating the condition​ yields undefined behavior.

    • The layout qualifier for gl_FragDepth  specifies constraints on the final value of gl_FragDepth  written by any shader invocation.  GL implementations may perform optimizations assuming that the depth test fails (or passes)  for a given fragment if all values of gl_FragDepth  consistent with the layout qualifier would fail (or pass).  If the final value of gl_FragDepth  is inconsistent with its layout qualifier, the result of the depth test for the corresponding fragment is undefined.  However, no error will be generated in this case.  When the depth test passes and depth writes are enabled, the value written to the depth buffer is always the value of gl_FragDepth , whether or not it is consistent with the layout qualifier.

    • <depth_any>

      • The shader compiler will note any assignment to gl_FragDepth  modifying it in an unknown way, and depth testing will always be performed after the shader has executed.

      • By default, gl_FragDepth  assumes the <depth_any>  layout qualifier.

    • <depth_greater>

      • The GL will assume that the final value of gl_FragDepth  is greater than or equal to the fragment's interpolated depth value, as given by the <z>  component of gl_FragCoord .

    • <depth_less>

      • The GL will assume that any modification of gl_FragDepth  will only decrease its value.

    • <depth_unchanged>

      • The shader compiler will honor any modification to gl_FragDepth , but the rest of the GL assume that gl_FragDepth  is not assigned a new value.

    • If gl_FragDepth  is redeclared in any fragment shader in a program, it must be redeclared in all fragment shaders in that program that have static assignments to gl_FragDepth . All redeclarations of gl_FragDepth  in all fragment shaders in a single program must have the same set of qualifiers. Within any shader, the first redeclarations of gl_FragDepth  must appear before any use of gl_FragDepth . The built-in gl_FragDepth  is only predeclared in fragment shaders, so redeclaring it in any other shader stage will be illegal.

Depth Test
  • If the test fails, the fragment is discarded.

  • If the test passes, the depth attachment will be updated with the fragment’s output depth.

Depth Bias
  • Requires the VkPhysicalDeviceFeatures::depthBiasClamp  feature to be supported otherwise VkPipelineRasterizationStateCreateInfo::depthBiasClamp  must be 0.0f .

  • The depth bias values can be set dynamically  using DYNAMIC_STATE_DEPTH_BIAS  or the DYNAMIC_STATE_DEPTH_BIAS_ENABLE_EXT  from EXT_extended_dynamic_state2 .

  • The rasterizer can alter the depth values by adding a constant value or biasing them based on a fragment’s slope.

  • Controls whether to bias fragment depth values.

  • This is sometimes used for shadow mapping.

  • Bias Constant Factor :

    • Is a scalar factor controlling the constant depth value added to each fragment.

    • Scales the parameter r  of the depth attachment

    • " depthBiasConstantFactor  is a scalar factor controlling the constant depth value added to each fragment. The value is in floating point and a typical value seems to be around 2.0-3.0."

  • Bias Slope Factor :

    • Is a scalar factor applied to a fragment’s slope in depth bias calculations.

    • Scales the maximum depth slope m  of the polygon.

    • "I stumbled upon some Vulkan samples that used a much smaller constant bias, but the slope bias was quite high. However, because the slope bias has a much larger weight than the constant one it pretty much worked the same."

  • Bias Clamp :

    • Is the maximum (or minimum) depth bias of a fragment.

    • The scaled terms depthBiasConstantFactor  and depthBiasSlopeFactor  are summed to produce a value which is then clamped to a minimum or maximum value specified.

Depth Bounds
  • If the value is not within the depth bounds, the coverage mask is set to zero.

  • Requires the VkPhysicalDeviceFeatures::depthBounds  feature to be supported.

  • The depth bound values can be set dynamically  using DYNAMIC_STATE_DEPTH_BOUNDS  or the DYNAMIC_STATE_DEPTH_BOUNDS_TEST_ENABLE_EXT  from EXT_extended_dynamic_state .

Depth Clamp
  • Controls whether to clamp the fragment’s depth values as described in Depth Test.

  • Before the sample’s Zf  is compared to Za , Zf  is clamped to [min(n,f), max(n,f)] , where n  and f  are the minDepth  and maxDepth  depth range values of the viewport used by this fragment, respectively.

  • If set to TRUE , then fragments that are beyond the near and far planes are clamped to them as opposed to discarding them.

  • This is useful in some special cases like shadow maps .

  • Requires the VkPhysicalDeviceFeatures::depthClamp  feature to be supported.

Depth Attachment

Clearing
  • It is always better to clear a depth buffer at the start of the pass with loadOp  set to ATTACHMENT_LOAD_OP_CLEAR .

  • Depth images can also be cleared outside a render pass using vkCmdClearDepthStencilImage .

  • When clearing, notice that VkClearValue  is a union and VkClearDepthStencilValue depthStencil  should be set instead of the color clear value.

Multi-sampling
  • The following post-rasterization occurs as a "per-sample" operation. This means when doing multisampling with a color attachment, any "depth buffer" VkImage  used as well must also have been created with the same VkSampleCountFlagBits  value.

  • A coverage mask  is generated for each fragment, based on which samples within that fragment are determined to be within the area of the primitive that generated the fragment.

  • If a fragment operation results in all bits of the coverage mask being 0 , the fragment is discarded.

  • Resolving :

    • It is possible in Vulkan using the KHR_depth_stencil_resolve  extension (promoted to Vulkan core in 1.2) to resolve multisampled depth/stencil attachments in a subpass in a similar manner as for color attachments.

Depth Image

Formats
  • Nvidia: Prefer using D24_UNORM_S8_UINT  or D32_SFLOAT  depth formats, D32_SFLOAT_S8_UINT  is not optimal.

  • There are a few different depth formats and an implementation may expose support for in Vulkan.

  • For reading  from a depth image only D16_UNORM  and D32_SFLOAT  are required to support being read via sampling or blit operations.

  • For writing  to a depth image FORMAT_D16_UNORM  is required to be supported. From here at least one of ( FORMAT_X8_D24_UNORM_PACK32   or   FORMAT_D32_SFLOAT ) and  ( FORMAT_D24_UNORM_S8_UINT   or   FORMAT_D32_SFLOAT_S8_UINT ) must also be supported. This will involve some extra logic when trying to find which format to use if both  the depth and stencil are needed in the same format.

Aspect Masks
  • Required when performing operations such as image barriers or clearing.

  • DEPTH

Sharing Mode
  • Nvidia: VkSharingMode  is ignored by the driver, so SHARING_MODE_CONCURRENT  incurs no overhead relative to SHARING_MODE_EXCLUSIVE .

Layout Transition
// Example of going from undefined layout to a depth attachment to be read and written to

// Core Vulkan example
srcAccessMask = 0;
dstAccessMask = ACCESS_DEPTH_STENCIL_ATTACHMENT_READ | ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE;
sourceStage = PIPELINE_STAGE_TOP_OF_PIPE;
destinationStage = PIPELINE_STAGE_EARLY_FRAGMENT_TESTS | PIPELINE_STAGE_LATE_FRAGMENT_TESTS;

// KHR_synchronization2
srcAccessMask = ACCESS_2_NONE_KHR;
dstAccessMask = ACCESS_2_DEPTH_STENCIL_ATTACHMENT_READ_KHR | ACCESS_2_DEPTH_STENCIL_ATTACHMENT_WRITE_KHR;
sourceStage = PIPELINE_STAGE_2_NONE_KHR;
destinationStage = PIPELINE_STAGE_2_EARLY_FRAGMENT_TESTS_KHR | PIPELINE_STAGE_2_LATE_FRAGMENT_TESTS_KHR;
  • If unsure to use only early or late fragment tests for your application, use both.

Copying
  • Nvidia: Copy both depth and stencil to avoid a slow path for copying.

Reverse Depth Buffer

Normal Reconstruction from Depth

  • You can infer the normals by calculating the derivatives on x and y between pixels of the depth buffer.

  • Discussion .

  • Implementation - Wicked Engine (János Turánszki (turanszkij)) .

  • Implementation - Yuwen Wu (atyuwen) .

  • Need :

    • "In screen-space decals rendering, normal buffer is required to reject pixels projected onto near-perpendicular surfaces. But back then I was working on a forward pipeline, so no normal buffer was outputted. It seemed the best choice was to reconstruct it directly from depth buffer, as long as we could avoid introducing errors, which was not easy though."

    • So, for a forward shading, this could  be necessary.

    • It could be avoided if saving the normals in a texture to be sent to a post-processing pass; aka, if introduced a bit of deferred in the forward renderer.

  • Performance :

    • There's a lot of discussion if this is worthwhile. On a deferred renderer, this could be good, but the gain in performance is not obvious. It really depends on how it was implemented.

Stencil

  • .

  • 1 or 0, if have a fragment from our object.

Used in

Stencil Attachment

  • The PipelineRenderingCreateInfo  asks for a stencilAttachmentFormat , and RenderingInfo  asks for pStencilAttachment .

  • This is for cases where you want separate depth and stencil images, instead of merged together, like when having a depth image with D24_UNORM_S8_UINT , where the S8_UINT  is for the stencil.

  • KHR_separate_depth_stencil_layouts .

    • Core in Vulkan 1.2.

    • This extension allows image memory barriers for 'depth+stencil' images to have just one of the IMAGE_ASPECT_DEPTH  or IMAGE_ASPECT_STENCIL  aspect bits set, rather than require both. This allows their layouts to be set independently. Image Layouts IMAGE_LAYOUT_DEPTH_ATTACHMENT_OPTIMAL , IMAGE_LAYOUT_DEPTH_READ_ONLY_OPTIMAL , IMAGE_LAYOUT_STENCIL_ATTACHMENT_OPTIMAL , or IMAGE_LAYOUT_STENCIL_READ_ONLY_OPTIMAL  can be used.

    • To support depth+stencil images with different layouts for the depth and stencil aspects, the depth+stencil attachment interface has been updated to support a separate layout for stencil.

    • VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures .

      • Structure describing whether the implementation can do depth and stencil image barriers separately.

      • It's just a struct with a bool telling if the feature is supported.

    • For render passes / subpasses:

Formats
  • S8_UINT

    • It makes sense, as it's the same format used for stencil in the depth format D24_UNORM_S8_UINT .

Mapping Data to Shaders

Shader Alignment

Minimum Dynamic-Offset / CBV Allocation Granularity
  • GPUs and drivers require that when you bind or use a portion of a large buffer as a uniform/constant buffer the start address and/or size line up to an alignment.

  • That alignment is the “minimum dynamic-offset” (Vulkan) or the CBV/constant buffer granularity (D3D12).

  • It lets the driver map many small logical buffers into a single big GPU buffer efficiently.

  • If you bind at an unaligned offset the API/driver will reject it or you will get wrong data or degraded performance.

  • Drivers can report 64, 128, 256, or other powers of two.

  • UBO alignment is usually larger than SSBO alignment because UBO usage and caches are handled differently by the hardware.

  • Value :

    • Many APIs and drivers use 256 bytes as the Minimum Dynamic-Offset on common desktop GPUs.

      • VkGuide:

      struct MaterialConstants {  // written into uniform buffers later
          glm::vec4 colorFactors; // multiply the color texture
          glm::vec4 metal_rough_factors;
          glm::vec4 extra[14];
              /*
              padding, we need it anyway for uniform buffers
              it needs to meet a minimum requirement for its alignment. 
              256 bytes is a good default alignment for this which all the gpus we target meet, so we are adding those vec4s to pad the structure to 256 bytes.
              */
      };
      
    • But not every platform or GPU guarantees 256. Mobile or integrated GPUs may have different values.

    • VkPhysicalDeviceLimits .

      • minUniformBufferOffsetAlignment

        • Is the minimum required  alignment, in bytes, for the offset  member of the VkDescriptorBufferInfo  structure for uniform buffers.

        • When a descriptor of type DESCRIPTOR_TYPE_UNIFORM_BUFFER  or DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC  is updated, the offset   must  be an integer multiple of this limit.

        • Similarly, dynamic offsets for uniform buffers must  be multiples of this limit.

        • The value must  be a power of two.

      • minStorageBufferOffsetAlignment

        • Is the minimum required  alignment, in bytes, for the offset  member of the VkDescriptorBufferInfo  structure for storage buffers.

        • When a descriptor of type DESCRIPTOR_TYPE_STORAGE_BUFFER  or DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC  is updated, the offset   must  be an integer multiple of this limit.

        • Similarly, dynamic offsets for storage buffers must  be multiples of this limit.

        • The value must  be a power of two.

      • minTexelBufferOffsetAlignment

  • Best practice :

    • Query the GPU at runtime and align your buffer ranges to the reported value.

    • Assert size at compile time:

    static_assert(sizeof(MaterialConstants) == 256, "MaterialConstants must be 256 bytes");
    
Default Layouts
Alignment Options
  • Offset and Stride Assignment .

  • There are different alignment requirements depending on the specific resources and on the features enabled.

  • Platform dependency :

    • 32-bit IEEE-754

      • The scalar value is 4 bytes.

      • The standard for desktop, mobile, OpenGL ES and Vulkan.

    • 16-bit half precision :

      • The scalar value is 2 bytes.

      • In rare cases, like embedded or custom OpenGL drivers.

    • 64-bit IEEE-754 double :

      • The scalar value is 8 bytes.

      • Non-standard case.

      • Would require headers redefining GLfloat  as double , not compliant with spec.

  • C layout ≈ std430  only if you manually match packing and alignment. Otherwise, it’s platform-dependent.

| GLSL type                        | C equivalent                                        | Typical C (x86_64) - Alignment |            Typical C (x86_64) - Size | Typical C (x86_64) - Stride |                                                                     std140 - Base Alignment |                std140 - Occupied Size |                          std140 - Stride | std430 - Base Alignment |                                std430 - Occupied Size |                             std430 - Stride |
| -------------------------------- | --------------------------------------------------- | -----------------------------: | -----------------------------------: | --------------------------: | -----------------------------------------------------------------------------------------: | ------------------------------------: | ---------------------------------------: | ----------------------: | ----------------------------------------------------: | ------------------------------------------: |
| bool                            | C _Bool  (native) — or use int32_t  to match GLSL |       _Bool : 1; int32_t : 4 |             _Bool : 1; int32_t : 4 |     _Bool : 1; int32_t : 4 |                                                                                          4 |                                     4 | 16 (std140 rounds scalar arrays to vec4) |                       4 |                                                     4 |                                           4 |
| int  / uint                    | int32_t  / uint32_t                               |                              4 |                                    4 |                           4 |                                                                                          4 |                                     4 |                                       16 |                       4 |                                                     4 |                                           4 |
| float                           | float                                              |                              4 |                                    4 |                           4 |                                                                                          4 |                                     4 |                                       16 |                       4 |                                                     4 |                                           4 |
| double                          | double                                             |                              8 |                                    8 |                           8 |                                                                                          8 |                                     8 |          32 (rounded to dvec4 alignment) |                       8 |                                                     8 |                                           8 |
| vec2  / ivec2                  | float[2]  / int32_t[2]                            |                              4 |                                    8 |                           8 |                                                                                          8 |                                     8 |                                       16 |                       8 |                                                     8 |                                           8 |
| vec3  / ivec3                  | float[3]  / int32_t[3]                            |                              4 |                                   12 |                          12 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| vec4  / ivec4                  | float[4]  / int32_t[4]                            |                              4 |                                   16 |                          16 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| dvec2                           | double[2]                                          |                              8 |                                   16 |                          16 |                                                                                         16 |                                    16 |                                       32 |                      16 |                                                    16 |                                          16 |
| dvec3                           | double[3]                                          |                              8 |                                   24 |                          24 |                                                                                         32 |                                    32 |                                       32 |                      32 |                                                    32 |                                          32 |
| dvec4                           | double[4]                                          |                              8 |                                   32 |                          32 |                                                                                         32 |                                    32 |                                       32 |                      32 |                                                    32 |                                          32 |
| mat2  (2×2 float, column-major) | float[2][2]  (2 columns of vec2 )                 |                              4 |                                   16 |             8 (column size) |                                                                                         16 |                           16 × 2 = 32 |      each column has vec4 as stride (16) |                       8 |                                            8 × 2 = 16 |          each column has vec2 as stride (8) |
| mat3  (3×3 float, column-major) | float[3][3]  (3 columns of vec3 )                 |                              4 |                                   36 |            12 (column size) |                                                                                         16 |                           16 × 3 = 48 |      each column has vec4 as stride (16) |                      16 |                                           16 × 3 = 48 |         each column has vec3 as stride (16) |
| mat4  (4×4 float)               | float[4][4]                                        |                              4 |                                   64 |            16 (column size) |                                                                                         16 |                           16 x 4 = 64 |      each column has vec4 as stride (16) |                      16 |                                           16 × 4 = 64 |         each column has vec4 as stride (16) |
| T[]  (Array of T)               | T[]                                                |                     alignof(T) |                            sizeof(T) |                   sizeof(T) | base_align(T), rounded up to vec4 base align (16 for 32-bit scalars; 32 for 64-bit/double) | occupied per element = rounded stride |          base_align(T), rounded up to 16 |           base_align(T) | occupied per element = sizeof(T) rounded to alignment |                               base_align(T) |
| vec3[]  (Array of vec3)         | float[3][]                                         |                              4 |                                   12 |                          12 |                                                                                         16 |                                    16 |                                       16 |                      16 |                                                    16 |                                          16 |
| struct                          | struct { ... }                                     |          max(member alignment) | struct size padded to that alignment |     sizeof(struct) (padded) |                                                  max(member align) rounded up to vec4 (16) |  struct size padded to multiple of 16 |          sizeof(struct) rounded up to 16 |       max(member align) |                  struct size padded to that alignment | sizeof(struct) (padded to member alignment) |

Scalar Alignment
  • Looks like std430 , but its vectors are even more compact?

  • Also known as (?) The spec doesn't say.

  • EXT_scalar_block_layout .

    • Core in Vulkan 1.2.

    • This extension allows most storage types to be aligned in scalar  alignment.

    • Make sure to set --scalar-block-layout  when running the SPIR-V Validator.

    • A big difference is being able to straddle the 16-byte boundary.

    • In GLSL this can be used with scalar  keyword and extension

Extended Alignment (std140)
  • Source .

  • Conservative, padded layout used for uniform blocks.

  • Widely supported.

  • Caveats :

    • "Avoiding usage of vec3"

      • Usually applies to std140, because some hardware vendors seem to not follow the spec strictly. Although, everything should work when using std430.

      • Array of vec3  (ARRAY) :

        • Alignment will be 4x of a float .

        • Size will be alignment * amount of elements .

// Scalars
    float ->  4 bytes // for 32-bit IEEE-754
    int   ->  4 bytes // for 32-bit IEEE-754
    uint  ->  4 bytes // for 32-bit IEEE-754
    bool  ->  4 bytes // for 32-bit IEEE-754
    
// Vectors
    // Base alignments
    vec2  ->  8 bytes  // 2 times the underlying scalar type.
    vec3  -> 16 bytes  // 4 times the underlying scalar type.
    vec4  -> 16 bytes  // 4 times the underlying scalar type.
    
// Arrays
    // Size of the element type, rounded up to a multiple of the size of `vec4` (behave like `vec4` slots).
    // Arrays of types are not necessarily tightly packed.
    // An array of floats in such a block will not be the equivalent to an array of floats in C/C++. Arrays will only match their C/C++ definitions if the type is a multiple of 16 bytes.
    // Ex: `float arr[N]` uses 16 bytes per element.

// Matrices
    // Treated as arrays of vectors. 
    // They are column-major by default; you can change it with `layout(row_major)` or `layout(column_major)`.

// Struct
    // The biggest struct member, rounded up to multiples of the size of `vec4` (behave like `vec4` slots).
    // Struct members are effectively padded so that each member starts on a 16-byte boundary when necessary.
    // The struct size will be the space needed by its members.
  • Examples :

    layout(std140) uniform U { float a[3]; }; // size = 3 * 16 = 48 bytes
    
Base Alignment (std430)
  • Allowed usage :

    • SSBOs, Push Constants.

    • KHR_uniform_buffer_standard_layout .

      • Core in Vulkan 1.2.

      • Allows the use of std430  memory layout in UBOs.

      • These memory layout changes are only applied to Uniforms .

    • KHR_relaxed_block_layout .

      • Core in Vulkan 1.1; all Vulkan 1.1+ devices support relaxed block layout.

      • This extension allows implementations to indicate they can support more variation in block Offset  decorations.

      • This comes up when using std430  memory layout where a vec3  (which is 12 bytes) is still defined as a 16 byte alignment.

      • With relaxed block layout an application can fit a float  on either side of the vec3  and maintain the 16 byte alignment between them.

      • Currently there is no way in GLSL to legally express relaxed block layout, but a developer can use the --hlsl-offsets  with glslang  to produce the desired offsets.

  • Relaxed layout used for shader-storage blocks and allows much tighter packing.

  • Requires newer GLSL 4.3+ or equivalent support.

// Scalars
    float ->  4 bytes // for 32-bit IEEE-754
    int   ->  4 bytes // for 32-bit IEEE-754
    uint  ->  4 bytes // for 32-bit IEEE-754
    bool  ->  4 bytes // for 32-bit IEEE-754
    
// Vectors
    // Base alignments
    vec2  ->  8 bytes  // 2 times the underlying scalar type.
    vec3  -> 16 bytes  // 4 times the underlying scalar type.
    vec4  -> 16 bytes  // 4 times the underlying scalar type.
    
// Arrays
    // array stride is the natural alignment of the element. 
    // Ex: `float arr[N]` uses 4 bytes per element.

// Matrices
    // Treated as arrays of vectors. 
    // They are column-major by default; you can change it with `layout(row_major)` or `layout(column_major)`.
    
// Struct
    // Struct packing is tighter and follows natural alignments (less padding).
  • Examples :

    layout(std430, binding=0) buffer B { float a[3]; }; // size = 3 * 4 = 12 bytes
    

Comparisons

  • Input Attributes vs Uniform Buffers :

    • "If you want a shader to access a resource that is not  on vertex buffers, they must go through a descriptor set".

    • With Vertex Buffers we can pass arbitrary attributes to the vertex shader for each vertex.

    • Including global variables as vertex data would be a waste of memory, and it would require us to update the vertex buffer whenever the transformation changes.

    • The right way to tackle this in Vulkan is to use Resource Descriptors

  • Uniform Buffers vs Push Constants :

    • ChatGPT...

    • "Lets say I want to send the time, would it be optimal to send it as a ubo or a push constant? The time should not change between draw calls (it's irrelevant) in the same frame. Seems like a UBO I only need to set it once and every object would benefit from it, but a push constant needs to be send for every object?"

      • Exactly. For your “time” value, a Uniform Buffer is generally the more optimal choice in this scenario.

      • Push Constants

        • You must call vkCmdPushConstants  for each command buffer section where shaders need it.

        • Since push constants are set per draw/dispatch scope, if you have many objects, you’d be redundantly re-sending the same value (time) multiple times in the same frame.

        • There’s no automatic “shared” state — every pipeline that uses it must get the value pushed explicitly.

      • Uniform Buffers

        • You can store the time in a uniform buffer once per frame, bind it once in a descriptor set, and then every draw call will see the same value without re-uploading.

        • Works well for “global” frame data (view/proj matrices, time, frame index, etc.).

        • Binding a pre-allocated UBO in a descriptor set has low overhead and avoids per-draw constant pushing.

      • Performance implication:

        • If the data is the same for all draws in a frame, a UBO avoids redundant driver calls and state changes, and makes it easier to keep the command buffer lean. Push constants are better suited for per-object or per-draw small data.

  • Storage Image vs. Storage Buffer :

    • While both storage images and storage buffers allow for read-write access in shaders, they have different use cases:

    • Storage Images :

      • Ideal for 2D or 3D data that benefits from texture operations like filtering or addressing modes.

    • Storage Buffers :

      • Better for arbitrary structured data or when you need to access data in a non-uniform pattern.

  • Texel Buffer vs. Storage Buffer :

    • Texel buffers and storage buffers also have different strengths:

    • Texel Buffers :

      • Provide texture-like access to buffer data, allowing for operations like filtering.

    • Storage Buffers :

      • More flexible for general-purpose data storage and manipulation.

  • Do

    • Do keep constant data small, where 128 bytes is a good rule of thumb.

    • Do use push constants if you do not want to set up a descriptor set/UBO system.

    • Do make constant data directly available in the shader if it is pre-determinable, such as with the use of specialization constants.

  • Avoid

    • Avoid indexing in the shader if possible, such as dynamically indexing into buffer  or uniform  arrays, as this can disable shader optimisations in some platforms.

  • Impact

    • Failing to use the correct method of constant data will negatively impact performance, causing either reduced FPS and/or increased BW and load/store activity.

    • On Mali, register mapped uniforms are effectively free. Any spilling to buffers in memory will increase load/store cache accesses to the per thread uniform fetches.

Input Attributes

About
  • The only shader stage in core Vulkan that has an input attribute controlled by Vulkan is the vertex shader stage ( SHADER_STAGE_VERTEX ).

    #version 450
    layout(location = 0) in vec3 inPosition;
    
    void main() {
        gl_Position = vec4(inPosition, 1.0);
    }
    
  • Other shader stages, such as a fragment shader stage, have input attributes, but the values are determined from the output of the previous stages run before it.

  • This involves declaring the interface slots when creating the VkPipeline  and then binding the VkBuffer  before draw time with the data to map.

  • Before calling vkCreateGraphicsPipelines  a VkPipelineVertexInputStateCreateInfo  struct will need to be filled out with a list of VkVertexInputAttributeDescription  mappings to the shader.

    VkVertexInputAttributeDescription input = {};
    input.location = 0;
    input.binding  = 0;
    input.format   = FORMAT_R32G32B32_SFLOAT; // maps to vec3
    input.offset   = 0;
    
  • The only thing left to do is bind the vertex buffer and optional index buffer prior to the draw call.

    vkBeginCommandBuffer();
    // ...
    vkCmdBindVertexBuffer();
    vkCmdDraw();
    // ...
    vkCmdBindVertexBuffer();
    vkCmdBindIndexBuffer();
    vkCmdDrawIndexed();
    // ...
    vkEndCommandBuffer();
    
  • Limits :

    • maxVertexInputAttributes

    • maxVertexInputAttributeOffset

Memory Layout
  • .

  • .

  • .

    • Single binding.

  • .

    • One binding per attribute.

  • One binding or many bindings? It doesn't matter that much. In some cases one is better, etc, don't worry too much about it.

Vertex Input Binding / Vertex Buffer
  • Tell Vulkan how to pass this data format to the vertex shader once it's been uploaded into GPU memory

  • A vertex binding describes at which rate to load data from memory throughout the vertices.

  • It specifies the number of bytes between data entries and whether to move to the next data entry after each vertex or after each instance.

  • VkVertexInputBindingDescription .

    • binding

      • Specifies the index of the binding in the array of bindings.

    • stride

      • Specifies the number of bytes from one entry to the next.

    • inputRate

      • VERTEX_INPUT_RATE_VERTEX

        • Move to the next data entry after each vertex.

      • VERTEX_INPUT_RATE_INSTANCE

        • Move to the next data entry after each instance.

      • We're not going to use instanced rendering, so we'll stick to per-vertex data.

  • VkVertexInputAttributeDescription

    • Describes how to handle vertex input.

    • An attribute description struct describes how to extract a vertex attribute from a chunk of vertex data originating from a binding description.

    • We have two attributes, position and color, so we need two attribute description structs.

    • binding

      • Tells Vulkan from which binding the per-vertex data comes.

    • location

      • References the location  directive of the input in the vertex shader.

        • The input in the vertex shader with location 0  is the position, which has two 32-bit float components.

    • format

      • Describes the type of data for the attribute.

      • Implicitly defines the byte size of attribute data.

      • A bit confusingly, the formats are specified using the same enumeration as color formats.

      • The following shader types and formats are commonly used together:

        • float : FORMAT_R32_SFLOAT

        • vec2 : FORMAT_R32G32_SFLOAT

        • vec3 : FORMAT_R32G32B32_SFLOAT

        • vec4 : FORMAT_R32G32B32A32_SFLOAT

      • As you can see, you should use the format where the amount of color channels matches the number of components in the shader data type.

      • It is allowed to use more channels than the number of components in the shader, but they will be silently discarded.

        • If the number of channels is lower than the number of components, then the BGA components will use default values of (0, 0, 1) .

      • The color type ( SFLOAT , UINT , SINT ) and bit width should also match the type of the shader input. See the following examples:

        • ivec2 : FORMAT_R32G32_SINT , a 2-component vector of 32-bit signed integers

        • uvec4 : FORMAT_R32G32B32A32_UINT , a 4-component vector of 32-bit unsigned integers

        • double : FORMAT_R64_SFLOAT , a double-precision (64-bit) float

    • offset

      • Specifies the number of bytes since the start of the per-vertex data to read from.

  • Graphics Pipeline Vertex Input Binding :

    • For the following vertices:

      Vertex :: struct {
          pos:   eng.Vec2,
          color: eng.Vec3,
      }
      
      vertices := [?]Vertex{
          { {  0.0, -0.5 }, { 1.0, 0.0, 0.0 } },
          { {  0.5,  0.5 }, { 0.0, 1.0, 0.0 } },
          { { -0.5,  0.5 }, { 0.0, 0.0, 1.0 } },
      }
      
    • We setup this in the Graphics Pipeline creation:

      vertex_binding_descriptor := vk.VertexInputBindingDescription{
          binding   = 0,
          stride    = size_of(Vertex),
          inputRate = .VERTEX,
      }
      vertex_attribute_descriptor := [?]vk.VertexInputAttributeDescription{
          {
              binding  = 0,
              location = 0,
              format   = .R32G32_SFLOAT,
              offset   = cast(u32)offset_of(Vertex, pos),
          },
          {
              binding  = 0,
              location = 1,
              format   = .R32G32B32_SFLOAT,
              offset   = cast(u32)offset_of(Vertex, color),
          },
      }
      vertex_input_create_info := vk.PipelineVertexInputStateCreateInfo {
          sType                           = .PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO,
          vertexBindingDescriptionCount   = 1,
          pVertexBindingDescriptions      = &vertex_binding_descriptor,
          vertexAttributeDescriptionCount = len(vertex_attribute_descriptor),
          pVertexAttributeDescriptions    = &vertex_attribute_descriptor[0],
      }
      
    • The pipeline is now ready to accept vertex data in the format of the vertices  container and pass it on to our vertex shader.

  • Vertex Buffer :

    • If you run the program now with validation layers enabled, you'll see that it complains that there is no vertex buffer bound to the binding.

    • The next step is to create a vertex buffer and move the vertex data to it so the GPU is able to access it.

    • Creating :

      • Follow the tutorial for creating a buffer, specifying BUFFER_USAGE_VERTEX_BUFFER  as the BufferCreateInfo   usage .

Index Buffer
  • Motivation :

    • Drawing a rectangle takes two triangles, which means that we need a vertex buffer with six vertices. The problem is that the data of two vertices needs to be duplicated, resulting in redundancies.

    • The solution to this problem is to use an index buffer.

    • An index buffer is essentially an array of pointers into the vertex buffer.

    • It allows you to reorder the vertex data, and reuse existing data for multiple vertices.

    • .

      • The first three indices define the upper-right triangle, and the last three indices define the vertices for the bottom-left triangle.

    • It is possible to use either uint16_t  or uint32_t  for your index buffer depending on the number of entries in vertices . We can stick to uint16_t  for now because we’re using less than 65535 unique vertices.

    • Just like the vertex data, the indices need to be uploaded into a VkBuffer  for the GPU to be able to access them.

  • Creating :

    • Follow the tutorial for creating a buffer, specifying BUFFER_USAGE_INDEX_BUFFER  as the BufferCreateInfo   usage .

  • Using :

    • We first need to bind the index buffer, just like we did for the vertex buffer.

    • The difference is that you can only have a single  index buffer. It’s unfortunately not possible to use different indices for each vertex attribute, so we do still have to completely duplicate vertex data even if just one attribute varies.

    • An index buffer is bound with vkCmdBindIndexBuffer  which has the index buffer, a byte offset into it, and the type of index data as parameters.

      • As mentioned before, the possible types are INDEX_TYPE_UINT16  and INDEX_TYPE_UINT32 .

    • Just binding an index buffer doesn’t change anything yet, we also need to change the drawing command to tell Vulkan to use the index buffer.

    • Remove  the vkCmdDraw  line and replace it with vkCmdDrawIndexed .

Push Constants

  • A Push Constant is a small bank of values accessible in shaders.

  • These are designed for small amount (a few dwords) of high frequency data to be updated per-recording of the command buffer.

  • So that the shader can understand where this data will be sent, we specify a special push constants <layout>  in our shader code.

layout(push_constant) uniform MeshData {
    mat4 model;
} mesh_data;
  • Choosing to use Push Constants :

    • In early implementations of Vulkan on Arm Mali, this was usually the fastest way of pushing data to your shaders. In more recent times, we have observed on Mali devices that overall  they can be slower. If performance is something you are trying to maximise on Mali devices, descriptor sets may be the way to go. However, other devices may still favour push constants.

    • Having said this, descriptor sets are one of the more complex features of Vulkan, making the convenience of push constants still worth considering as a go-to method, especially if working with trivial data.

  • Limits :

    • maxPushConstantsSize

      • guaranteed at least 128  bytes on all devices.

      • If you're using Vulkan 1.4 the minimum was increased to 256.

  • Push Constants .

Offsets
  • .

  • Ex1 :

    layout(push_constant, std430) uniform pc {
        layout(offset = 32) vec4 data;
    };
    
    layout(location = 0) out vec4 outColor;
    
    void main() {
       outColor = data;
    }
    
    VkPushConstantRange range = {};
    range.stageFlags = SHADER_STAGE_FRAGMENT;
    range.offset = 32;
    range.size = 16;
    
Updating
  • Ex1 :

    • Push constants can be incrementally updated over the course of a command buffer.

    // vkBeginCommandBuffer()
    vkCmdBindPipeline();
    vkCmdPushConstants(offset: 0, size: 16, value = [0, 0, 0, 0]);
    vkCmdDraw(); // values = [0, 0, 0, 0]
    
    vkCmdPushConstants(offset: 4, size: 8, value = [1 ,1]);
    vkCmdDraw(); // values = [0, 1, 1, 0]
    
    vkCmdPushConstants(offset: 8, size: 8, value = [2, 2]);
    vkCmdDraw(); // values = [0, 1, 2, 2]
    // vkEndCommandBuffer()
    
    • Interesting how old values are kept. Values that were not changed are preserved.

Lifetime
  • vkCmdPushConstants  is tied to the VkPipelineLayout  usage and therefore why they must match before a call to a command such as vkCmdDraw() .

  • Because push constants are not tied to descriptors, the use of vkCmdBindDescriptorSets  has no effect on the lifetime or pipeline layout compatibility  of push constants.

  • The same way it is possible to bind descriptor sets that are never used by the shader, the same is true for push constants.

CPU Performance
  • Push one struct once per draw instead of many separate vkCmdPushConstants calls (one call writing a small struct is far cheaper).

  • Many small state changes cause the driver to update internal tables, validate, or patch commands — that’s CPU work and cannot be avoided without batching.

  • Observations :

    • 5 push calls were taking 7.65us. I groupped all them in 1 single push call, now taking 3.08us.

    • This was substancial, as at the time I was issuing this push calls hundreds of time per frame; I later reduced this number, but anyway, could be significant.

Descriptors Sets

About

  • VkDescriptorSet

  • One Descriptor -> One Resource.

  • They are always organized in Descriptor Sets.

    • One or more descriptors contained.

    • Combine descriptors which are used in conjunction.

  • A handle or pointer into a resource.

    • Note that is not just a pointer, but a pointer + metadata.

  • A core mechanism used to bind resources to shaders.

  • Holds the binding information that connects shader inputs to data such as VkBuffer  resources and VkImage  textures.

  • Think of it as a set of GPU-side pointers that you bind once.

  • The internal representation of a descriptor set is whatever the driver wants it to be.

  • Article by Arseny Kapoulkine .

  • Sample talking about best practices .

  • Content :

    • Where to find a Resource.

    • Usage type of a Resource.

    • Offsets, sometimes.

    • Some metadata, sometimes.

  • Example :

    • .

    // Note - only set 0 and 2 are used in this shader
    layout(set = 0, binding = 0) uniform sampler2D myTextureSampler;
    
    layout(set = 0, binding = 2) uniform uniformBuffer0 {
        float someData;
    } ubo_0;
    
    layout(set = 0, binding = 3) uniform uniformBuffer1 {
        float moreData;
    } ubo_1;
    
    layout(set = 2, binding = 0) buffer storageBuffer {
        float myResults;
    } ssbo;
    
  • API :

    • .

    • .

  • Limits :

    • maxBoundDescriptorSets

    • Per stage limit

    • maxPerStageDescriptorSamplers

    • maxPerStageDescriptorUniformBuffers

    • maxPerStageDescriptorStorageBuffers

    • maxPerStageDescriptorSampledImages

    • maxPerStageDescriptorStorageImages

    • maxPerStageDescriptorInputAttachments

    • Per type limit

    • maxPerStageResources

    • maxDescriptorSetSamplers

    • maxDescriptorSetUniformBuffers

    • maxDescriptorSetUniformBuffersDynamic

    • maxDescriptorSetStorageBuffers

    • maxDescriptorSetStorageBuffersDynamic

    • maxDescriptorSetSampledImages

    • maxDescriptorSetStorageImages

    • maxDescriptorSetInputAttachments

    • VkPhysicalDeviceDescriptorIndexingProperties  if using Descriptor Indexing

    • VkPhysicalDeviceInlineUniformBlockPropertiesEXT  if using Inline Uniform Block

  • Visual explanation {0:00 -> 5:35} .

    • Nice.

    • The rest of the video is meh.

Difficulties
  • Problems :

    • "They are not bad but they very much force a specific rendering style: you have triple / quadrupled nested for loops, binding your things based on usage and then rebind descriptor sets as needed."

    • "Many of us are moving towards bindless rendering, where you just bind everything once in one big descriptor set, and then index into it at will; tho, Vulkan 1.0 does not greatly support, and also the descriptor count for it was quite low".

    • Cannot update descriptors after binding in a command buffer.

    • All descriptors must be valid, even if not used.

    • Descriptor arrays must be sampled uniformly.

      • Different invocations can’t use different indices.

      • Can sample “dynamically uniform”, e.g. runtime-based index.

    • Upper limit on descriptor counts.

    • Discourages GPU-driven rendering architectures.

      • Due to the need to set up descriptor sets per draw call it’s hard to adapt any of the aforementioned schemes to GPU-based culling or command submission.

  • Solutions :

    • Descriptor Indexing :

      • Available in 1.3, optional in 1.2, or EXT_descriptor_indexing .

      • Update descriptors after binding.

      • Update unused descriptors.

      • Relax requirement that all descriptors must be valid, even if unused.

      • Non-uniform array indexing.

    • Buffer Device Address :

      • Available in 1.3, optional in 1.2, or KHR_buffer_device_address .

      • Directly access buffers through addresses without a descriptor.

      • See [[#Physical Storage Buffer]] below.

    • Descriptor Buffers – EXT_descriptor_buffer :

      • Manage descriptors directly.

      • Similar to D3D12’s descriptor model .

Allocation

  • A scheme that works well is to use free lists of descriptor set pools; whenever you need a descriptor set pool, you allocate one from the free list and use it for subsequent descriptor set allocations in the current frame on the current thread. Once you run out of descriptor sets in the current pool, you allocate a new pool. Any pools that were used in a given frame need to be kept around; once the frame has finished rendering, as determined by the associated fence objects, the descriptor set pools can reset via vkResetDescriptorPool  and returned to free lists. While it’s possible to free individual descriptors from a pool via DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET , this complicates the memory management on the driver side and is not recommended.

  • When a descriptor set pool is created, application specifies the maximum number of descriptor sets allocated from it, as well as the maximum number of descriptors of each type that can be allocated from it. In Vulkan 1.1, the application doesn’t have to handle accounting for these limits – it can just call vkAllocateDescriptorSets and handle the error from that call by switching to a new descriptor set pool. Unfortunately, in Vulkan 1.0 without any extensions, it’s an error to call vkAllocateDescriptorSets if the pool does not have available space, so application must track the number of sets and descriptors of each type to know beforehand when to switch to a different pool.

  • Different pipeline objects may use different numbers of descriptors, which raises the question of pool configuration. A straightforward approach is to create all pools with the same configuration that uses the worst-case number of descriptors for each type – for example, if each set can use at most 16 texture and 8 buffer descriptors, one can allocate all pools with maxSets=1024, and pool sizes 16 1024 for texture descriptors and 8 1024 for buffer descriptors. This approach can work but in practice it can result in very significant memory waste for shaders with different descriptor count – you can’t allocate more than 1024 descriptor sets out of a pool with the aforementioned configuration, so if most of your pipeline objects use 4 textures, you’ll be wasting 75% of texture descriptor memory.

  • Strategies :

    • Two alternatives that provide a better balance memory use:

    1. Measure an average number of descriptors used in a shader pipeline per type for a characteristic scene and allocate pool sizes accordingly. For example, if in a given scene we need 3000 descriptor sets, 13400 texture descriptors, and 1700 buffer descriptors, then the average number of descriptors per set is 4.47 textures (rounded up to 5) and 0.57 buffers (rounded up to 1), so a reasonable configuration of a pool is maxSets=1024, 5*1024 texture descriptors, 1024 buffer descriptors. When a pool is out of descriptors of a given type, we allocate a new one – so this scheme is guaranteed to work and should be reasonably efficient on average.

    2. Group shader pipeline objects into size classes, approximating common patterns of descriptor use, and pick descriptor set pools using the appropriate size class. This is an extension of the scheme described above to more than one size class. For example, it’s typical to have large numbers of shadow/depth prepass draw calls, and large numbers of regular draw calls in a scene – but these two groups have different numbers of required descriptors, with shadow draw calls typically requiring 0 to 1 textures per set and 0 to 1 buffers when dynamic buffer offsets are used. To optimize memory use, it’s more appropriate to allocate descriptor set pools separately for shadow/depth and other draw calls. Similarly to general-purpose allocators that can have size classes that are optimal for a given application, this can still be managed in a lower-level descriptor set management layer as long as it’s configured with application specific descriptor set usages beforehand.

Implementation
  • Descriptors are like pointers, so as any pointer they need to allocate space to live ahead of time.

  • How many :

    • Its possible to have 1 very big descriptor pool that handles the entire engine, but that means we need to know what descriptors we will be using for everything ahead of time.

    • That can be very tricky to do at scale. Instead, we will keep it simpler, and we will have multiple descriptor pools for different parts of the project , and try to be more accurate with them.

      • I don't know what that actually means in practice.

  • VkDescriptorPool .

    • Maintains a pool of descriptors, from which descriptor sets are allocated.

    • Descriptor pools are externally synchronized, meaning that the application must  not allocate and/or free descriptor sets from the same pool in multiple threads simultaneously.

    • They are very opaque.

    • VkDescriptorPoolCreateInfo .

      • Contains a type of descriptor (same VkDescriptorType  as on the bindings above ), alongside a ratio to multiply the maxSets  parameter is.

      • This lets us directly control how big the pool is going to be. maxSets  controls how many VkDescriptorSets  we can create from the pool in total, and the pool sizes give how many individual bindings of a given type are owned.

      • flags .

        • Is a bitmask of VkDescriptorPoolCreateFlagBits  specifying certain supported operations on the pool.

        • DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET

          • Determines if individual descriptor sets can be freed or not:

          • We're not going to touch the descriptor set after creating it, so we don't need this flag. You can leave flags  to its default value of 0 .

        • DESCRIPTOR_POOL_CREATE_UPDATE_AFTER_BIND

          • Descriptor pool creation may  fail with the error ERROR_FRAGMENTATION  if the total number of descriptors across all pools (including this one) created with this bit set exceeds maxUpdateAfterBindDescriptorsInAllPools , or if fragmentation of the underlying hardware resources occurs.

      • maxSets

        • Is the maximum number of descriptor sets that can  be allocated from the pool.

      • poolSizeCount

        • Is the number of elements in pPoolSizes .

      • pPoolSizes

        • Is a pointer to an array of VkDescriptorPoolSize  structures, each containing a descriptor type and number of descriptors of that type to be allocated in the pool.

        • If multiple VkDescriptorPoolSize  structures containing the same descriptor type appear in the pPoolSizes  array then the pool will be created with enough storage for the total number of descriptors of each type.

        • VkDescriptorPoolSize .

          • type

            • Is the type of descriptor.

          • descriptorCount

            • Is the number of descriptors of that type to allocate. If type  is DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK  then descriptorCount  is the number of bytes to allocate for descriptors of this type.

  • VkDescriptorSetAllocateInfo

    • descriptorPool

      • Is the pool which the sets will be allocated from.

    • descriptorSetCount

      • Determines the number of descriptor sets to be allocated from the pool.

    • pSetLayouts

      • Is a pointer to an array of descriptor set layouts, with each member specifying how the corresponding descriptor set is allocated.

  • vkAllocateDescriptorSets() .

    • The allocated descriptor sets are returned in pDescriptorSets .

    • When a descriptor set is allocated, the initial state is largely uninitialized and all descriptors are undefined, with the exception that samplers with a non-null pImmutableSamplers  are initialized on allocation.

    • Descriptors also become undefined if the underlying resource or view object is destroyed.

    • Descriptor sets containing undefined descriptors can  still be bound and used, subject to the following conditions:

      • For descriptor set bindings created with the PARTIALLY_BOUND  bit set:

        • All descriptors in that binding that are dynamically used must  have been populated before the descriptor set is consumed .

      • For descriptor set bindings created without the PARTIALLY_BOUND  bit set:

        • All descriptors in that binding that are statically used must  have been populated before the descriptor set is consumed .

      • Descriptor bindings with descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK   can  be undefined when the descriptor set is consumed ; though values in that block will be undefined.

      • Entries that are not used by a pipeline can  have undefined descriptors.

    • pAllocateInfo

    • pDescriptorSets

      • Is a pointer to an array of VkDescriptorSet  handles in which the resulting descriptor set objects are returned.

  • Multithreading :

    • Descriptor pools are externally synchronized, meaning that the application must  not allocate and/or free descriptor sets from the same pool in multiple threads simultaneously.

    • Command Pools are used to allocate, free, reset, and update descriptor sets. By creating multiple descriptor pools, each application host thread is able to manage a descriptor set in each descriptor pool at the same time.

Best Practices
  • Don’t allocate descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to allocate the descriptor set with texture descriptors.

  • Don't allocate descriptor sets from descriptor pools on performance critical code paths.

  • Don't allocate, free or update descriptor sets every frame, unless it is necessary.

  • Don't set DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET  if you do not need to free individual descriptor sets.

    • Setting DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET  may prevent the implementation from using a simpler (and faster) allocator.

Descriptor Types

Overview
  • For buffers, application must choose between uniform and storage buffers, and whether to use dynamic offsets or not. Uniform buffers have a limit on the maximum addressable size – on desktop hardware, you get up to 64 KB of data, however on mobile hardware some GPUs only provide 16 KB of data (which is also the guaranteed minimum by the specification). The buffer resource can be larger than that, but shader can only access this much data through one descriptor.

  • On some hardware, there is no difference in access speed between uniform and storage buffers, however for other hardware depending on the access pattern uniform buffers can be significantly faster. Prefer uniform buffers for small to medium sized data especially if the access pattern is fixed (e.g. for a buffer with material or scene constants). Storage buffers are more appropriate when you need large arrays of data that need to be larger than the uniform buffer limit and are indexed dynamically in the shader.

  • For textures, if filtering is required, there is a choice of combined image/sampler descriptor (where, like in OpenGL, descriptor specifies both the source of the texture data, and the filtering/addressing properties), separate image and sampler descriptors (which maps better to Direct3D 11 model), and image descriptor with an immutable sampler descriptor, where the sampler properties must be specified when pipeline object is created.

  • The relative performance of these methods is highly dependent on the usage pattern; however, in general immutable descriptors map better to the recommended usage model in other newer APIs like Direct3D 12, and give driver more freedom to optimize the shader. This does alter renderer design to a certain extent, making it necessary to implement certain dynamic portions of the sampler state, like per-texture LOD bias for texture fade-in during streaming, using shader ALU instructions.

Storage Images
  • DESCRIPTOR_TYPE_STORAGE_IMAGE

  • Is a descriptor type that allows shaders to read from and write to an image without using a fixed-function graphics pipeline.

  • This is particularly useful for compute shaders and advanced rendering techniques.

  • Storage Images and Implementation .

// FORMAT_R32_UINT
layout(set = 0, binding = 0, r32ui) uniform uimage2D storageImage;

// example usage for reading and writing in GLSL
const uvec4 texel = imageLoad(storageImage, ivec2(0, 0));
imageStore(storageImage, ivec2(1, 1), texel);
  • Use cases :

    • Image Processing :

      • Storage images are ideal for image processing tasks like filters, blurs, and other post-processing effects.

Sampler
  • DESCRIPTOR_TYPE_SAMPLER  and DESCRIPTOR_TYPE_SAMPLED_IMAGE .

layout(set = 0, binding = 0) uniform sampler samplerDescriptor;
layout(set = 0, binding = 1) uniform texture2D sampledImage;

// example usage of using texture() in GLSL
vec4 data = texture(sampler2D(sampledImage,  samplerDescriptor), vec2(0.0, 0.0));
Combined Image Sampler
  • DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER

  • On some implementations, it may  be more efficient to sample from an image using a combination of sampler and sampled image that are stored together in the descriptor set in a combined descriptor.

layout(set = 0, binding = 0) uniform sampler2D combinedImageSampler;

// example usage of using texture() in GLSL
vec4 data = texture(combinedImageSampler, vec2(0.0, 0.0));
Uniform Buffer / UBO (Uniform Buffer Object)
layout(set = 0, binding = 0) uniform uniformBuffer {
    float a;
    int b;
} ubo;

// example of reading from UBO in GLSL
int x = ubo.b + 1;
vec3 y = vec3(ubo.a);
  • Uniform Buffers commonly use std140  layout (strict alignment rules, predictable padding).

    • Source: ChatGPT. I want to confirm.

/* UBO: small read-only data (std140) */
layout(set = 0, binding = 0, std140) uniform SceneParams {
    mat4 viewProj;
    vec4 lightPos;
    float time;
} scene;
  • UBO (Uniform Buffer Object) :

    • “Uniform buffer object” is more of an OpenGL-era name, but some Vulkan tutorials and developers still use it informally to mean the same thing — the buffer that holds uniform data.

Storage Buffer / SSBO (Shader Storage Buffer Object)
  • DESCRIPTOR_TYPE_STORAGE_BUFFER

  • GLSL uses distinct address spaces: uniform  → UBO, buffer  → SSBO.

  • Use std430  layout by default (tighter packing, fewer padding requirements).

  • SSBO (Shader Storage Buffer Object) is a OpenGL term.

// Implicit std430 (default)
layout(set = 0, binding = 0) buffer storageBuffer {
    float a;
    int b;
} ssbo;

// Explicit std430
layout(set = 0, binding = 1, std430) buffer ParticleData {
    vec4 pos[];
} particles;

// Reading and writing to a SSBO in GLSL
ssbo.a = ssbo.a + 1.0;
ssbo.b = ssbo.b + 1;
  • BufferBlock  and Uniform  would have been seen prior to KHR_storage_buffer_storage_class .

  • Storage buffers can also have dynamic offsets at bind time DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC .

  • Why SSBO for dynamic arrays :

    • std430  allows tight packing and runtime-sized arrays (T data[]) , which is ideal for dynamic-length storage.

    • SSBOs allow arbitrary indexing, read/write, and atomics.

    • maxStorageBufferRange is usually much larger than maxUniformBufferRange .

    • You can use *_DYNAMIC  descriptors to bind multiple subranges of one large backing buffer cheaply.

  • Many arrays :

    • A buffer block may contain multiple arrays, but only the last member of the block may be a runtime-sized (unsized) array T x[] . All other arrays must be fixed-size (compile-time constant) or you must implement sizing/offsets yourself.

      • This is invalid , even with descriptor indexing:

      layout(std430, set = 0, binding = 0) buffer FixedArrays { 
          vec4 A[]; 
          vec2 B[]; 
          mat4 C[]; 
          some_struct D[];
      } fixedArrays;
      
    1. Use a uint x[] :

      • 32-bit words; simplest and portable.

      • This is effectively an untyped byte/word blob stored in the SSBO and you manually reinterpret (cast) it in the shader

      layout(std430, set = 0, binding = 0) buffer PackedBytes {
          uint countA;   // number of A elements
          uint offsetA;  // offset into data[] in uint words
          uint countB;
          uint offsetB;  // offset into data[] in uint words
          uint countC;
          uint offsetC;
      
          uint data[];   // payload in 32-bit words
      } pb;
      
      // helpers
      float readFloat(uint baseWordIndex) {
          return uintBitsToFloat(pb.data[baseWordIndex]);
      }
      
      vec2 readVec2(uint baseWordIndex) {
          return vec2(
              uintBitsToFloat(pb.data[baseWordIndex + 0]),
              uintBitsToFloat(pb.data[baseWordIndex + 1])
          );
      }
      
      vec3 readVec3(uint baseWordIndex) {
          return vec3(
              uintBitsToFloat(pb.data[baseWordIndex + 0]),
              uintBitsToFloat(pb.data[baseWordIndex + 1]),
              uintBitsToFloat(pb.data[baseWordIndex + 2])
          );
      }
      
      vec4 readVec4(uint baseWordIndex) {
          return vec4(
              uintBitsToFloat(pb.data[baseWordIndex + 0]),
              uintBitsToFloat(pb.data[baseWordIndex + 1]),
              uintBitsToFloat(pb.data[baseWordIndex + 2]),
              uintBitsToFloat(pb.data[baseWordIndex + 3])
          );
      }
      
      mat4 readMat4(uint baseWordIndex) {
          // mat4 stored column-major as 16 floats (4 columns of vec4)
          return mat4(
              readVec4(baseWordIndex + 0),
              readVec4(baseWordIndex + 4),
              readVec4(baseWordIndex + 8),
              readVec4(baseWordIndex + 12)
          );
      }
      
    2. Use a vec4 x[] :

      • 128-bit blocks; simpler alignment for vec4/mat4 data.

      // Pack everything into vec4 blocks for simple alignment
      layout(std430, set = 0, binding = 0) buffer Packed {
          uint countA;
          uint offsetA; // in vec4-blocks
          uint countB;
          uint offsetB; // in vec4-blocks
          uint countC;
          uint offsetC; // in vec4-blocks
          uint countD;
          uint offsetD; // in vec4-blocks
      
          vec4 blocks[]; // single runtime-sized array (last member)
      } packed;
      
      // helpers
      vec4 getA(uint i) {
          return packed.blocks[packed.offsetA + i];
      }
      
      vec2 getB(uint i) {
          return packed.blocks[packed.offsetB + i].xy; // we store each B in one vec4 block
      }
      
      mat4 getC(uint i) {
          uint base = packed.offsetC + i * 4; // mat4 occupies 4 vec4 blocks
          return mat4(packed.blocks[base + 0],
                      packed.blocks[base + 1],
                      packed.blocks[base + 2],
                      packed.blocks[base + 3]);
      }
      
      // for some_struct D that we store as 1 vec4 per element:
      some_struct getD(uint i) {
          vec4 v = packed.blocks[packed.offsetD + i];
          // decode v -> some_struct fields
      }
      
    3. Use many SSBOs:

      layout(std430, set=0, binding=0) buffer BufA { vec4 A[]; } bufA;
      layout(std430, set=0, binding=1) buffer BufB { vec2 B[]; } bufB;
      layout(std430, set=0, binding=2) buffer BufC { mat4 C[]; } bufC;
      layout(std430, set=0, binding=3) buffer BufD { some_struct D[]; } bufD;
      
Texel Buffer
  • Texel buffers are a way to access buffer data with texture-like operations in shaders.

  • Texel Buffers and Implementation .

  • Compatibility Requirements .

    • The format specified in the shader (SPIR-V Image Format) must exactly match  the format used when creating the VkImageView (Vulkan Format).

    • Require exact format matching between the shader and the view. The views must always match the shader exactly.

  • Best Practices .

  • Uniform Texel Buffer :

    • DESCRIPTOR_TYPE_UNIFORM_TEXEL_BUFFER

    • Read-only access.

    layout(set = 0, binding = 0) uniform textureBuffer uniformTexelBuffer;
    
    // example of reading texel buffer in GLSL
    vec4 data = texelFetch(uniformTexelBuffer, 0);
    
    • Use cases :

      • Lookup Tables :

        • Uniform texel buffers are useful for implementing lookup tables that need to be accessed with texture-like operations.

  • Storage Texel Buffer :

    • DESCRIPTOR_TYPE_STORAGE_TEXEL_BUFFER

    • Read-write access.

    // FORMAT_R8G8B8A8_UINT
    layout(set = 0, binding = 0, rgba8ui) uniform uimageBuffer storageTexelBuffer;
    
    // example of reading and writing texel buffer in GLSL
    int offset = int(gl_GlobalInvocationID.x);
    vec4 data = imageLoad(storageTexelBuffer, offset);
    imageStore(storageTexelBuffer, offset, uvec4(0));
    
    • Use cases :

      • Particle Systems :

        • Storage texel buffers can be used to store and update particle data in a compute shader, which can then be read by a vertex shader for rendering.

Input Attachment
  • DESCRIPTOR_TYPE_INPUT_ATTACHMENT

layout (input_attachment_index = 0, set = 0, binding = 0) uniform subpassInput inputAttachment;

// example loading the attachment data in GLSL
vec4 data = subpassLoad(inputAttachment);

Updates

Implementation
  • A Descriptor Set, even though created and allocated, is still empty. We need to fill it up with data.

  • Updates must  happen outside of a command record and execution.

    • No update after vkCmdBindDescriptorSets() .

    • Usually you update before vkBeginCommandBuffer()  or after the vkQueueSubmit()  (if we know the sync is done for cmd).

  • If using Descriptor Indexing :

    • Descriptors can be updated after binding in command buffers.

      • Command buffer execution will use most recent updates.

    • .

  • VkWriteDescriptorSet .

    • dstSet

      • Is the destination descriptor set to update.

    • dstBinding

      • Is the descriptor binding within that set.

    • dstArrayElement

      • Remember that descriptors can be arrays, so we also need to specify the first index in the array that we want to update.

      • If not using an array, the index is simply 0 .

      • Is the starting element in that array.

      • If the descriptor binding identified by dstSet  and dstBinding  has a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK  then dstArrayElement  specifies the starting byte offset within the binding.

    • descriptorCount

      • It's a descriptor count, not  a descriptor SET count!!

      • Is the number of descriptors to update.

      • If the descriptor binding identified by dstSet  and dstBinding  has a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK , then descriptorCount  specifies the number of bytes to update.

      • Otherwise, descriptorCount  is one of

    • descriptorType

      • We need to specify the type of descriptor again

      • Is a VkDescriptorType  specifying the type of each descriptor in pImageInfo , pBufferInfo , or pTexelBufferView .

      • It must  be the same type as the descriptorType  specified in VkDescriptorSetLayoutBinding  for dstSet  at dstBinding , except  if VkDescriptorSetLayoutBinding  for dstSet  at dstBinding  is equal to DESCRIPTOR_TYPE_MUTABLE_EXT .

      • The type of the descriptor also controls which array the descriptors are taken from.

    • pBufferInfo

      • Is a pointer to an array of VkDescriptorBufferInfo  structures or is ignored, as described below.

      • VkDescriptorBufferInfo .

        • Structure specifying descriptor buffer information

        • Specifies the buffer and the region within it that contains the data for the descriptor.

        • buffer

        • offset

          • Is the offset in bytes from the start of buffer .

          • Access to buffer memory via this descriptor uses addressing that is relative to this starting offset.

          • For DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC  and DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC  descriptor types:

            • offset  is the base offset from which the dynamic offset is applied.

        • range

          • Is the size in bytes that is used for this descriptor update, or WHOLE_SIZE  to use the range from offset  to the end of the buffer.

            • When range  is WHOLE_SIZE  the effective range is calculated at vkUpdateDescriptorSets  by taking the size of buffer  minus the offset .

          • For DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC  and DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC  descriptor types:

            • range  is the static size used for all dynamic offsets.

    • pImageInfo

      • Is a pointer to an array of VkDescriptorImageInfo  structures or is ignored, as described below.

      • VkDescriptorImageInfo .

        • imageLayout

          • Is the layout that the image subresources accessible from imageView  will be in at the time this descriptor is accessed.

          • Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLED_IMAGE , DESCRIPTOR_TYPE_STORAGE_IMAGE , DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER , and DESCRIPTOR_TYPE_INPUT_ATTACHMENT .

        • imageView

          • Is an image view handle or NULL_HANDLE .

          • Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLED_IMAGE , DESCRIPTOR_TYPE_STORAGE_IMAGE , DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER , and DESCRIPTOR_TYPE_INPUT_ATTACHMENT .

        • sampler

          • Is a sampler handle.

          • Is used in descriptor updates for types DESCRIPTOR_TYPE_SAMPLER  and DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER  if the binding being updated does not use immutable samplers.

    • pTexelBufferView

  • vkUpdateDescriptorSets() .

    • descriptorWriteCount

      • Is the number of elements in the pDescriptorWrites  array.

    • pDescriptorWrites

    • descriptorCopyCount

      • Is the number of elements in the pDescriptorCopies  array.

    • pDescriptorCopies

      • Is a pointer to an array of VkCopyDescriptorSet  structures describing the descriptor sets to copy between.

Best Practices
  • Don’t update descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to update the descriptor set with texture descriptors.

  • When rendering dynamic objects the application will need to push some amount of per-object data to the GPU, such as the MVP matrix. This data may not fit into the push constant limit for the device, so it becomes necessary to send it to the GPU by putting it into a VkBuffer  and binding a descriptor set that points to it.

  • Materials also need their own descriptor sets, which point to the textures they use. We can either bind per-material and per-object descriptor sets separately or collate them into a single set. Either way, complex applications will have a large amount of descriptor sets that may need to change on the fly, for example due to textures being streamed in or out.

  • Not-good Solution: One or more pools per-frame, resetting the pool :

    • The simplest approach to circumvent the issue is to have one or more VkDescriptorPool s per frame, reset them at the beginning of the frame and allocate the required descriptor sets from it. This approach will consist of a vkResetDescriptorPool()  call at the beginning, followed by a series of vkAllocateDescriptorSets()  and vkUpdateDescriptorSets()  to fill them with data.

    • This is very useful for things like per-frame descriptors. That way we can have descriptors that are used just for one frame, allocated dynamically, and then before we start the frame we completely delete all of them in one go.

    • This is confirmed to be a fast path by GPU vendors, and recommended to use when you need to handle per-frame descriptor sets.

    • The issue is that these calls can add a significant overhead to the CPU frame time, especially on mobile. In the worst cases, for example calling vkUpdateDescriptorSets()  for each draw call, the time it takes to update descriptors can be longer than the time of the draws themselves.

  • Solution: Caching descriptor sets :

    • A major way to reduce descriptor set updates is to re-use them as much as possible. Instead of calling vkResetDescriptorPool()  every frame, the app will keep the VkDescriptorSet  handles stored with some caching mechanism to access them.

    • The cache could be a hashmap with the contents of the descriptor set (images, buffers) as key. This approach is used in our framework by default. It is possible to remove another level of indirection by storing descriptor set handles directly in the materials and/or meshes.

    • Caching descriptor sets has a dramatic effect on frame time for our CPU-heavy scene.

    • In this game on a 2019 mobile phone it went from 44ms (23fps) to 27ms (37fps). This is a 38% decrease in frame time.

    • This system is reasonably easy to implement for a static scene, but it becomes harder when you need to delete descriptor sets. Complex engines may implement techniques to figure out which descriptor sets have not been accessed for a certain number of frames, so they can be removed from the map.

    • This may correspond to calling vkFreeDescriptorSets() , but this solution poses another issue: in order to free individual descriptor sets the pool has to be created with the DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET  flag. Mobile implementations may use a simpler allocator if that flag is not set, relying on the fact that pool memory will only be recycled in block.

    • It is possible to avoid using that flag by updating descriptor sets instead of deleting them. The application can keep track of recycled descriptor sets and re-use one of them when a new one is requested.

  • Solution: One buffer per-frame :

    • We will now explore an alternative approach, that is complementary to descriptor caching in some way. Especially for applications in which descriptor caching is not quite feasible, buffer management is another lever for optimizing performance.

    • As discussed at the beginning, each rendered object will typically need some uniform data along with it, that needs to be pushed to the GPU somehow. A straightforward approach is to store a VkBuffer  per object and update that data for each frame.

    • This already poses an interesting question: is one buffer enough? The problem is that this data will change dynamically and will be in use by the GPU while the frame is in flight.

    • Since we do not want to flush the GPU pipeline between each frame, we will need to keep several copies of each buffer, one for each frame in flight.

    • Another similar option is to use just one buffer per object, but with a size equal to num_frames * buffer_size , then offset it dynamically based on the frame index.

      • For each frame, one buffer per object is created and filled with data. This means that we will have many descriptor sets to create, since every object will need one that points to its VkBuffer . Furthermore, we will have to update many buffers separately, meaning we cannot control their memory layout and we might lose some optimization opportunities with caching.

    • We can address both problems by reverting the approach: instead of having a VkBuffer  per object containing per-frame data, we will have a VkBuffer  per frame containing per-object data. The buffer will be cleared at the beginning of the frame, then each object will record its data and will receive a dynamic offset to be used at vkCmdBindDescriptorSets()  time.

    • With this approach we will need fewer descriptor sets, as more objects can share the same one: they will all reference the same VkBuffer , but at different dynamic offsets. Furthermore, we can control the memory layout within the buffer.

    • Using a single large VkBuffer  in this case shows a performance improvement similar to descriptor set caching.

    • For this relatively simple scene stacking the two approaches does not provide a further performance boost, but for a more complex case they do stack nicely:

      • Descriptor caching is necessary when the number of descriptor sets is not just due to VkBuffer s with uniform data, for example if the scene uses a large amount of materials/textures.

      • Buffer management will help reduce the overall number of descriptor sets, thus cache pressure will be reduced and the cache itself will be smaller.

    • (2025-09-08)

      • I personally liked this technique much more than descriptor caching.

      • It sounds more concrete than fiddling with descriptor sets.

      • Reminds me of Buffer Device Address.

  • Do

    • Update already allocated but no longer referenced descriptor sets, instead of resetting descriptor pools and reallocating new descriptor sets.

    • Prefer reusing already allocated descriptor sets, and not updating them with the same information every time.

    • Consider caching your descriptor sets when feasible.

    • Consider using a single (or few) VkBuffer  per frame with dynamic offsets.

    • Batch calls to vkAllocateDescriptorSets if possible – on some drivers, each call has measurable overhead, so if you need to update multiple sets, allocating both in one call can be faster;

    • To update descriptor sets, either use vkUpdateDescriptorSets with descriptor write array, or use vkUpdateDescriptorSetWithTemplate  from Vulkan 1.1. Using the descriptor copy functionality of vkUpdateDescriptorSets  is tempting with dynamic descriptor management for copying most descriptors out of a previously allocated array, but this can be slow on drivers that allocate descriptors out of write-combined memory. Descriptor templates can reduce the amount of work application needs to do to perform updates – since in this scheme you need to read descriptor information out of shadow state maintained by application, descriptor templates allow you to tell the driver the layout of your shadow state, making updates substantially faster on some drivers.

    • Prefer dynamic uniform buffers to updating uniform buffer descriptors. Dynamic uniform buffers allow to specify offsets into buffer objects using pDynamicOffsets argument of vkCmdBindDescriptorSets without allocating and updating new descriptors. This works well with dynamic constant management where constants for draw calls are allocated out of large uniform buffers, substantially reduce CPU overhead, and can be more efficient on GPU. While on some GPUs the number of dynamic buffers must be kept small to avoid extra overhead in the driver, one or two dynamic uniform buffers should work well in this scheme on all architectures.

    • On some drivers, unfortunately the allocate & update path is not very optimal – on some mobile hardware, it may make sense to cache descriptor sets based on the descriptors they contain if they can be reused later in the frame.

Descriptor Set Layout

  • Contains the information about what that descriptor set holds.

  • Specifies the types of resources that are going to be accessed by the pipeline, just like a render pass specifies the types of attachments that will be accessed.

  • How many :

    • You need to specify a descriptor set layout for each descriptor set when creating the pipeline layout.

      • You can use this feature to put descriptors that vary per-object and descriptors that are shared into separate descriptor sets.

      • In that case, you avoid rebinding most of the descriptors across draw calls which are potentially more efficient.

    • Since the buffer structure is identical across frames, one layout suffices.

      • Create only 1 descriptor set layout, regardless of frames in-flight.

      • This layout defines the type of resource (e.g., VKDESCRIPTORTYPEUNIFORMBUFFER ) and its binding point.

  • VkDescriptorSetLayout .

    • Opaque handle to a descriptor set layout object.

    • Is defined by an array of zero or more descriptor bindings.

    • Where it's used :

    • VkDescriptorSetLayoutBinding .

      • Structure specifying a descriptor set layout binding.

      • Each individual descriptor binding is specified by a descriptor type, a count (array size) of the number of descriptors in the binding, a set of shader stages that can access the binding, and (if using immutable samplers) an array of sampler descriptors.

      • Bindings that are not specified have a descriptorCount  and stageFlags  of zero, and the value of descriptorType  is undefined.

      • binding

        • Is the binding number of this entry and corresponds to a resource of the same binding number in the shader stages.

        • Used in the shader and the type of descriptor, which is a uniform buffer object.

      • descriptorType

        • Is a VkDescriptorType  specifying which type of resource descriptors are used for this binding.

      • descriptorCount

        • Insight :

          • It's a descriptor count, not a descriptor SET count !! It's just to specify how many resources is expected to be in that binding.

          • It makes complete sense to be used for arrays.

          • Caio:

            • What happens if the values don't match? For example, trying to get the index 5 of the array, when the binding was described having descriptorCount = 1  ?

          • Oni:

            • I don't know if this is specified. I guess it's only going to update the first element. So you're going to read bogus data. Maybe it changes between different drivers, no idea.

        • What value to use :

          • A MVP transformation is in a single uniform buffer, so we using a descriptorCount  of 1 .

          • In other words, a whole struct counts as 1 .

        • Is the number of descriptors contained in the binding, accessed in a shader as an array.

          • Except if descriptorType  is DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK  in which case descriptorCount  is the size in bytes of the inline uniform block.

        • If descriptorCount  is zero this binding entry is reserved and the resource must  not be accessed from any stage via this binding within any pipeline using the set layout.

        • It is possible for the shader variable to represent an array of uniform buffer objects, and this property specifies the number of values in the array.

        • Examples :

          • This could be used to specify a transformation for each of the bones in a skeleton for skeletal animation.

      • stageFlags

        • Is a bitmask of VkShaderStageFlagBits  specifying which pipeline shader stages can  access a resource for this binding.

          • SHADER_STAGE_ALL  is a shorthand specifying all defined shader stages, including any additional stages defined by extensions.

        • If a shader stage is not included in stageFlags , then a resource must  not be accessed from that stage via this binding within any pipeline using the set layout.

        • Other than input attachments which are limited to the fragment shader, there are no limitations on what combinations of stages can  use a descriptor binding, and in particular a binding can  be used by both graphics stages and the compute stage.

      • pImmutableSamplers

        • Affects initialization of samplers.

        • If descriptorType  specifies a DESCRIPTOR_TYPE_SAMPLER  or DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER  type descriptor, then pImmutableSamplers   can  be used to initialize a set of immutable samplers .

        • If descriptorType  is not one of these descriptor types, then pImmutableSamplers  is ignored .

        • Immutable samplers are permanently bound into the set layout and must  not be changed; updating a DESCRIPTOR_TYPE_SAMPLER  descriptor with immutable samplers is not allowed and updates to a DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER  descriptor with immutable samplers does not modify the samplers (the image views are updated, but the sampler updates are ignored).

        • If pImmutableSamplers  is not NULL , then it is a pointer to an array of sampler handles that will be copied into the set layout and used for the corresponding binding. Only the sampler handles are copied; the sampler objects must  not be destroyed before the final use of the set layout and any descriptor pools and sets created using it.

        • If pImmutableSamplers  is NULL , then the sampler slots are dynamic and sampler handles must  be bound into descriptor sets using this layout. ]

    • VkDescriptorSetLayoutCreateInfo .

      • pBindings

        • A pointer to an array of VkDescriptorSetLayoutBinding  structures.

      • bindingCount

        • Is the number of elements in pBindings .

      • flags

    • vkCreateDescriptorSetLayout() .

      • Create a new descriptor set layout.

      • pCreateInfo

      • pAllocator

      • pSetLayout

        • Is a pointer to a VkDescriptorSetLayout  handle in which the resulting descriptor set layout object is returned.

  • VkPipelineLayoutCreateInfo .

    • Structure specifying the parameters of a newly created pipeline layout object

    • setLayoutCount

      • Is the number of descriptor sets included in the pipeline layout.

      • How it works :

        • It's possible to have multiple descriptor sets ( set = 0 , set = 1 , etc).

        • "You can have set = 0 being a set that is always bound and never changes, set = 1 is something specific to the current object being rendered, etc."

    • pSetLayouts

      • Is a pointer to an array of VkDescriptorSetLayout  objects.

      • The implementation must  not access these objects outside of the duration of the command this structure is passed to.

Binding

  • A Descriptor state is tracked only inside a command buffer; they are always bound at command buffer level; their state is local to command buffers.

    • They are not bound at queue level or global level, only to command buffers.

  • .

  • Which set index to choose :

    • According to GPU vendors, each descriptor set slot has a cost, so the fewer we have, the better.

    • "Organize shader inputs into "sets" by update frequency."

    • Rarely changes -> low index.

    • Changes frequently -> high index.

    • Usually Descriptor Set 0 is used to always bind some global scene data, which will contain some uniform buffers and some special textures, and Descriptor Set 1 will be used for per-object data.

  • vkCmdBindDescriptorSets .

    • It needs to be done before the vkCmdDrawIndexed()  calls, for example.

    • commandBuffer

      • Is the command buffer that the descriptor sets will be bound to.

    • pipelineBindPoint

      • Is a VkPipelineBindPoint  indicating the type of the pipeline that will use the descriptors. There is a separate set of bind points for each pipeline type, so binding one does not disturb the others.

      • Unlike vertex and index buffers, descriptor sets are not unique to graphics pipelines, therefore, we need to specify if we want to bind descriptor sets to the graphics or compute pipeline.

      • Indicates the type of the pipeline that will use the descriptor.

      • There is a separate set of bind points for each pipeline type, so binding one does not disturb the others.

      • .

        • A raytracing command takes the currently bound descriptors from the raytracing bind point.

        • A draw command takes the currently bound descriptors from the graphics bind point.

        • The two don't interfere with each other.

    • layout

    • firstSet

      • Is the set number  of the first descriptor set  to be bound.

    • descriptorSetCount

      • Is the number of elements in the pDescriptorSets  array.

    • pDescriptorSets

      • Is a pointer to an array of handles to VkDescriptorSet  objects describing the descriptor sets to bind to.

    • dynamicOffsetCount

      • Is the number of dynamic offsets in the pDynamicOffsets  array.

    • pDynamicOffsets

      • Is a pointer to an array of uint32_t  values specifying dynamic offsets.

Strategy: Descriptor Indexing ( EXT_descriptor_indexing )

Plan
  • SSBOs and UBOs.

    • Can I just put different data without restriction?

      • Yes. See the SSBO section for that.

    • SSBOs or UBOs?

      • Using storage buffers exclusively instead of uniform buffers can increase GPU time on some architectures.

      • I'll use SSBO, as that was the general recommendation.

      • Maybe I'll mix both.

  • Material Data:

    • The Material index is used to look up material data from material storage buffer. The textures can then be accessed using the indices from the material data and the descriptor array.

    • Could be sent via push constants, but if I choose to go for indirect rendering (I should), then I cannot use push constants. I'd use the instance index (or similar) to index into a []Material_Data .

  • Model Matrix / Transforms:

    • Same as material data. I can send via push constants if direct drawing, or via []model_matrix  if indirect drawing.

  • Globals:

    • Camera view/proj, lights, ambient, etc.

    • I could just bind this once as well.

  • Vertex:

    • Indirect vs Full bindless:

      • I'm not sure. I'll use Indirect Drawing for now. ChatGPU deep search didn't give me much.

    1. Indirect Drawing:

      • For indirect drawing, it makes sense to just vkCmdBindIndexBuffer , as I NEED the vertex shader to be called by the number of times I specified

      • Plan: go for bindless first, drawing direct. instead of using the instanceID  or similar, I just send the draw_data index via push constants. this way, the shader will be completely finalized, but then I batch the draws via draw indirect and use the instanceID  instead of the push constants ID

      • Indirect Drawing will be the last thing

      • What not invert and do indirect first? I cannot do that, as the instanceID  is useless without a bindless design! I NEED to have use for the ID, as I cannot bind desc sets or push constants for each individual draw! bindless first is a MUST.

    2. Full bindless:

      • Using a large index buffer: We need to bind index data. If just like the vertex data, index data is allocated in one large index buffer, we only need to bind it once using vkCmdBindIndexBuffer .

      • While Vulkan provides a first-class way to specify vertex data by calling vkCmdBindVertexBuffers , having to bind vertex buffers per-draw would not work for a fully bindless design.

        • Additionally, some hardware doesn’t support vertex buffers as a first-class entity, and the driver has to emulate vertex buffer binding, which causes some CPU-side slowdowns when using vkCmdBindVertexBuffers .

      • In a fully bindless design, we need to assume that all vertex buffers are suballocated in one large buffer and either use per-draw vertex offsets ( vertexOffset  argument to vkCmdDrawIndexed ) to have hardware fetch data from it, or pass an offset in this buffer to the shader with each draw call and fetch data from the buffer in the shader. Both approaches can work well, and might be more or less efficient depending on the GPU.

    3. Mesh Shaders.

      • Mesh Shaders is probably what is most true to the bindless strategy, but I won't go that way yet (too soon, too new).

    4. Compute

      • Maybe I could use a compute to do this for me, but then I'd lose the rasterizer.

  • Draw Data:

    • Indices to index into the other arrays.

    struct DrawData
    {
        uint materialIndex;
        uint transformOffset;
        uint vertexOffset;
        uint unused0; // vec4 padding
    
        // ... extra gameplay data goes here
    };
    
    • Vertex Shader:

      DrawData dd = drawData[gl_DrawIDARB];
      TransformData td = transformData[dd.transformOffset];
      vec4 positionLocal = vec4(positionData[gl_VertexIndex + dd.vertexOffset], 1.0);
      vec3 positionWorld = mat4x3(td.transform[0], td.transform[1], td.transform[2]) * positionLocal;
      
    • Frag Shader:

      DrawData dd = drawData[drawId];
      MaterialData md = materialData[dd.materialIndex];
      vec4 albedo = texture(sampler2D(materialTextures[md.albedoTexture], albedoSampler), uv * vec2(md.tilingX, md.tilingY));
      
  • Slots:

    • tex buffer and material data buffer will be in the same set 0, or should they be 0/1?

    • Probably every bind is on desc set 0

    • The slots are based on frequency, but every single binding I'm talking about might just be bound once globally without problems

  • Overall:

    • []textures

    • []material_data

      • uv, flip, modulate, etc.

    • []model_matrices

      • transforms.

    • []draw_data

      • Indices to index into the other arrays.

    • vertex/indices

      • As input attributes, to then use Indirect Drawing.

About
  • Descriptor indexing is also known by the term "bindless", which refers to the fact that binding individual descriptor sets and descriptors is no longer the primary way we keep shader pipelines fed. Instead, we can bind a huge descriptor set once and just index into a large number of descriptors.

  • Adds a lot  of flexibility to how resources are accessed.

  • "Bindless algorithms" are generally built around this flexibility where we either index freely into a lot of descriptors at once, or update descriptors where we please. In this model, "binding" descriptors is not a concern anymore.

  • The core functionality of this extension is that we can treat descriptor memory as one massive array, and we can freely access any resource we want at any time, by indexing.

  • If an array is large enough, an index into that array is indistinguishable from a pointer.

  • At most, we need to write/copy descriptors to where we need them and we can now consider descriptors more like memory blobs rather than highly structured API objects.

  • The introduction of descriptor indexing revealed that the descriptor model is all just smoke and mirrors. A descriptor is just a blob of binary data that the GPU can interpret in some meaningful way. The API calls to manage descriptors really just boils down to “copy magic bits here.”

  • Support :

    • Descriptor Indexing was created in 2018, so all hardware 2018+ should support it.

    • Core in Vulkan 1.2+

    • Limits queried using VkPhysicalDeviceDescriptorIndexingPropertiesEXT .

    • Features queried using VkPhysicalDeviceDescriptorIndexingFeaturesEXT .

    • Features toggled using VkPhysicalDeviceDescriptorIndexingFeaturesEXT .

  • Required for :

    • Raytracing.

    • Many GPU Driven Rendering approaches.

  • Advantages :

    • No costly transfer of descriptor to GPU every frame. Shows up as spending a lot of time in vkUpdateDescriptorSets  (Vulkan)

    • More flexible / dynamic rendering architecture

    • No manual tracking of per-object resource groups

    • Updating matrices and material data can be done in bulk before command recording

    • CPU and GPU refer to resources the same way, by index

    • GPU can store Texture IDs in a buffer for reference later in the frame – many uses

    • Easy Vertex Pulling – gets rid of binding vertex buffers

    • Write resource indexes from one shader into a buffer that another shader reads & uses

    • G-Buffer can use material ID instead of values

    • Terrain Splatmap contains material IDs allowing many materials to be used, instead of 4

    • And more…

  • Disadvantages :

    • Requires hardware support

      • May be too new for widespread use

      • Different “feature levels” can help ease transition

    • Different Performance Penalties

      • Arrays indexing can cause memory indirections

        • Fetching texture descriptors from an array indexed by material data indexed by material index can add an extra indirection on GPU compared to some alternative designs

    • “With great power comes great responsibility”

      • GPU can't verify that valid descriptors are bound

      • Validation is costlier: happens inside shaders

      • Can be difficult to debug

      • Descriptor management is up to the Application

    • On some hardware, various descriptor set limits may make this technique impractical to implement; to be able to index an arbitrary texture dynamically from the shader, maxPerStageDescriptorSampledImages  should be large enough to accomodate all material textures - while many desktop drivers expose a large limit here, the specification only guarantees a limit of 16, so bindless remains out of reach on some hardware that otherwise supports Vulkan.

  • Comparison: Indexing resources without the extension :

    • .

    • Descriptor Indexing, explanation of "dynamic non-uniform" .

      • Good read.

    • Constant Indexing :

      layout(set = 0, binding = 0) uniform sampler2D Tex[4];
      
      texture(Tex[0], ...);
      texture(Tex[2], ...);
      
      // We can trivially flatten a constant-indexed array into individual resources,
      // so, constant indexing requires no fancy hardware indexing support.
      layout(set = 0, binding = 0) uniform sampler2D Tex0;
      layout(set = 0, binding = 1) uniform sampler2D Tex1;
      layout(set = 0, binding = 2) uniform sampler2D Tex2;
      layout(set = 0, binding = 3) uniform sampler2D Tex3;
      
    • Image Array Dynamic Indexing :

      • The dynamic indexing features allow us to use a non-constant expression to index an array.

        • This has been supported since Vulkan 1.0.

      • The restriction is that the index must be dynamically uniform .

      layout(set = 0, binding = 0) uniform sampler2D Tex[4];
      
      texture(Tex[dynamically_uniform_expression], ...);
      
    • Non-uniform vs Texture Atlas vs Texture Array :

      • Accessing arbitrary textures in a draw call is not a new problem, and graphics programmers have found ways over the years to workaround restrictions in older APIs. Rather than having multiple textures, it is technically possible to pack multiple textures into one texture resource, and sample from the correct part of the texture. This kind of technique is typically referred to as "texture atlas". Texture arrays (e.g. sampler2DArray) is another feature which can be used for similar purposes.

      • Problems with atlas:

        • Mip-mapping is hard to implement, and must likely be done manually with derivatives and math.

        • Anisotropic filtering is basically impossible.

        • Any other sampler addressing than CLAMP_TO_EDGE  is very awkward to implement.

        • Cannot use different texture formats.

      • Problems with texture array:

        • All resolutions must match.

        • Number of array layers is limited (just 256 in min-spec).

        • Cannot use different texture formats.

      • Non-uniform indexing solves these issues since we can freely use multiple sampled image descriptors instead. Atlases and texture arrays still have their place. There are many use cases where these restrictions do not cause problems.

      • Non-uniform indexing is not just limited to textures (although that is the most relevant use case). Any descriptor type can be used as long as the device supports it.

Features
  • Update-after-bind :

    • In Vulkan, you generally have to create a VkDescriptorSet  and update it with all descriptors before you call vkCmdBindDescriptorSets . After a set is bound, the descriptor set cannot be updated again until the GPU is done using it. This gives drivers a lot of flexibility in how they access the descriptors. They are free to copy the descriptors and pack them somewhere else, promote them to hardware registers, the list goes on.

    • Update-After-Bind gives flexibility to applications instead. Descriptors can be updated at any time as long as they are not actually accessed by the GPU. Descriptors can also be updated while the descriptor set is bound to a command buffer, which enables a "streaming" use case.

      • This means the application doesn’t have to unbind or re-record command buffers just to change descriptors—reducing CPU overhead in some streaming-resource scenarios.

    • Concurrent Updates :

      • Another "hidden" feature of update-after-bind is that it is possible to update the descriptor set from multiple threads. This is very useful for true "bindless" since unrelated tasks might want to update descriptors in different parts of the streamed/bindless descriptor set.

    • After and after :

      • .

  • Non-uniform indexing :

    • While update-after-bind adds flexibility to descriptor management, non-uniform indexing adds great flexibility for shaders.

    • It completely removes all restrictions on how we index into arrays, but we must notify our intent to the compiler.

    • Normally, drivers and hardware can assume that the dynamically uniform guarantee holds, and optimize for that case.

    • If we use the nonuniformEXT  decoration in GL_EXT_nonuniform_qualifier  we can let the compiler know that the guarantee does not necessarily hold, and the compiler will deal with it in the most efficient way possible for the target hardware. The rationale for having to annotate like this is that driver compiler backends would be forced to be more conservative than necessary if applications were not required to use nonuniformEXT .

    • When to use it :

      • The invocation group :

        • The invocation group is a set of threads (invocations) which work together to perform a task.

        • In graphics pipelines, the invocation group is all threads which are spawned as part of a single draw command. This includes multiple instances, and for multi-draw-indirect it is limited to a single gl_DrawID .

        • In compute pipelines, the invocation group is a single workgroup, so it’s very easy to know when it is safe to avoid nonuniformEXT.

        • An expression is considered dynamically uniform  if all invocations in an invocation group have the same value.

          • In other words, dynamically uniform  means that the index is the same across all threads spawned by a draw command.

      • Interaction with Subgroups :

        • It is very easy to think that dynamically uniform just means "as long as the index is uniform in the subgroup, it’s fine!". This is certainly true for most (desktop) architectures, but not all.

        • It is technically possible that a value can be subgroup uniform, but still not dynamically uniform. Consider a case where we have a workgroup size of 128 threads, with a subgroup size of 32. Even if each subgroup does subgroupBroadcastFirst()  on the index, each subgroup might have different values, and thus, we still technically need nonuniformEXT  here. If you know that you have only one subgroup per workgroup however, subgroupBroadcastFirst()  is good enough.

        • The safe thing to do is to just add nonuniformEXT  if you cannot prove the dynamically uniform property. If the compiler knows that it only really cares about subgroup uniformity, it could trivially optimize away nonuniformEXT(subgroupBroadcastFirst())  anyways.

        • The common reason to use subgroups in the first place, is that it was an old workaround for lack of true non-uniform indexing, especially for desktop GPUs. A common pattern would be something like:

Implementation
  • Exemples :

    • odin_cool_engine:

      • odin_cool_engine/src/rp_ui.odin

        • It just sends an index to the compute pipeline via push constants.

      • odin_cool_engine/src/renderer.odin:725

        • It just sends an index to the compute pipeline via push constants.

    • Descriptor Indexing Sample .

  • Setup :

    1. Check availability of the extension through vk.EXT_DESCRIPTOR_INDEXING_EXTENSION_NAME  + vk.EnumerateDeviceExtensionProperties .

    2. Check supported features of the extension through vk.GetPhysicalDeviceFeatures2  + vk.PhysicalDeviceDescriptorIndexingFeatures  as the pNext  term.

  • VkDescriptorSetLayoutCreateInfo .

    • flags

      • UPDATE_AFTER_BIND_POOL

        • Specifies that descriptor sets using this layout must be allocated from a descriptor pool created with the UPDATE_AFTER_BIND  bit set.

        • Descriptor set layouts created with this bit set have alternate limits for the maximum number of descriptors per-stage and per-pipeline layout.

        • The non-UpdateAfterBind limits only count descriptors in sets created without this flag. The UpdateAfterBind limits count all descriptors, but the limits may be higher than the non-UpdateAfterBind limits.

  • VkDescriptorBindingFlagBits :

    • PARTIALLY_BOUND

      • Specifies that descriptors in this binding that are not dynamically used, don't need to contain valid descriptors at the time the descriptors are consumed.

        • A descriptor is 'dynamically used' if any shader invocation executes an instruction that performs any memory access using the descriptor.

        • If a descriptor is not dynamically used, any resource referenced by the descriptor is not considered to be referenced during command execution.

      • This provides so it's not necessary to bind every descriptor. Allows a descriptor array binding to function even when not all array elements are written or valid.

      • This is critical if we want to make use of descriptor "streaming". A descriptor only has to be bound if it is actually used by a shader.

      • Without this feature, if you have an array of N descriptors and your shader indexes [0..N-1], all descriptors must be valid; otherwise behavior is undefined even if the shader never touches the uninitialized ones.

      • When enabled, you only need to write descriptors that the shader will index. “Holes” in the array are allowed, provided shader indices never touch them.

      • Use this when you want to leave “holes” in a large descriptor array (i.e. not update every element) without pre-filling unused slots with a fallback texture. When this flag is set, descriptors that are not dynamically used by the shader need not contain valid descriptors — but if the shader actually accesses an unwritten descriptor you still get undefined/invalid results. This is a convenience to avoid writing N fallback descriptors each time.

    • VARIABLE_DESCRIPTOR_COUNT

      • Allows a descriptor binding to have a variable number of descriptors.

      • Use a variable amount of descriptors in an array.

      • Specifies that this is a variable-sized descriptor binding, whose size will be specified when a descriptor set is allocated using this layout.

      • This must only  be used for the last binding in the descriptor set layout (i.e. the binding with the largest value of binding).

      • vk.DescriptorSetLayoutBinding.descriptorCount

        • The value is treated as an upper bound on the size of the binding.

        • The actual count is supplied at allocation time via VkDescriptorSetVariableDescriptorCountAllocateInfo .

        • For the purposes of counting against limits such as maxDescriptorSet  and maxPerStageDescriptor , the full value of descriptorCount  is counted, except for descriptor bindings with a descriptor type of DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK , when VkDescriptorSetLayoutCreateInfo.flags  does not contain DESCRIPTOR_SET_LAYOUT_CREATE_DESCRIPTOR_BUFFER . In this case, descriptorCount  specifies the upper bound on the byte size of the binding; thus it counts against the maxInlineUniformBlockSize  and maxInlineUniformTotalSize  limits instead.

      • When we later allocate the descriptor set, we can declare how large we want the array to be.

      • Be aware that there is a global limit to the number of descriptors can be allocated at any one time.

      • This is extremely useful when using EXT_descriptor_indexing , since we do not have to allocate a fixed amount of descriptors for each descriptor set.

      • In many cases, it is far more flexible to use runtime sized descriptor arrays.

      • Use this when you want the shader-visible length of a descriptor-array binding to be allocatable per-descriptor-set (i.e. different sets expose different array lengths) instead of using a single compile-time/ layout upper bound. At allocation you pass the actual count with VkDescriptorSetVariableDescriptorCountAllocateInfo. This reduces bookkeeping/pool usage and lets you avoid allocating the full upper bound for every set. Requires the descriptor-indexing feature be enabled and the variable-size binding must be the last binding in the set

    • UPDATE_AFTER_BIND

      • Specifies that if descriptors in this binding are updated between when the descriptor set is bound in a command buffer and when that command buffer is submitted to a queue, then the submission will use the most recently set descriptors for this binding and the updates do not invalidate the command buffer. Descriptor bindings created with this flag are also partially exempt from the external synchronization requirement in vkUpdateDescriptorSetWithTemplateKHR  and vkUpdateDescriptorSets . Multiple descriptors with this flag set can be updated concurrently in different threads, though the same descriptor must not be updated concurrently by two threads. Descriptors with this flag set can be updated concurrently with the set being bound to a command buffer in another thread, but not concurrently with the set being reset or freed.

      • Update-after-bind is another critical component of descriptor indexing, which allows us to update descriptors after a descriptor set has been bound to a command buffer.

      • This is critical for streaming descriptors, but it also relaxed threading requirements. Multiple threads can update descriptors concurrently on the same descriptor set.

      • UPDATE_AFTER_BIND  descriptors is somewhat of a precious resource, but min-spec in Vulkan is at least 500k descriptors, which should be more than enough.

    • UPDATE_UNUSED_WHILE_PENDING

      • Specifies that descriptors in this binding can be updated after a command buffer has bound this descriptor set, or while a command buffer that uses this descriptor set is pending execution, as long as the descriptors that are updated are not used by those command buffers. Descriptor bindings created with this flag are also partially exempt from the external synchronization requirement in vkUpdateDescriptorSetWithTemplateKHR and vkUpdateDescriptorSets in the same way as for UPDATE_AFTER_BIND . If PARTIALLY_BOUND  is also set, then descriptors can be updated as long as they are not dynamically used by any shader invocations. If PARTIALLY_BOUND  is not set, then descriptors can be updated as long as they are not statically used by any shader invocations.

      • Update-Unused-While-Pending is somewhat subtle, and allows you to update a descriptor while a command buffer is executing.

      • The only restriction is that the descriptor cannot actually be accessed by the GPU.

    • UPDATE_AFTER_BIND  vs UPDATE_UNUSED_WHILE_PENDING

      • Both involve updates to descriptor sets after they are bound, UPDATE_UNUSED_WHILE_PENDING  is a weaker requirement since it is only about descriptors that are not used, whereas UPDATE_AFTER_BIND  requires the implementation to observe updates to descriptors that are used.

  • Enabling Non-Uniform Indexing :

    1. Enable runtimeDescriptorArray  and shaderSampledImageArrayNonUniformIndexing  (required for indexing an array of COMBINED_IMAGE_SAMPLER ), descriptorBindingPartiallyBound  (optional, to avoid undefined behavior on not fully populated arrays).

      • If in Vulkan <1.2, then the features must be enabled in the vk.PhysicalDeviceDescriptorIndexingFeatures .

      • If in Vulkan >=1.2, then the features must be enabled in the vk.PhysicalDeviceVulkan12Features .

        • If this is not followed, you'll get:

        [ERROR] --- vkCreateDevice(): pCreateInfo->pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDeviceDescriptorIndexingFeatures structure. The features in VkPhysicalDeviceDescriptorIndexingFeatures were promoted in Vulkan 1.2 and is also found in VkPhysicalDeviceVulkan12Features. To prevent one feature setting something to TRUE and the other to FALSE, only one struct containing the feature is allowed.
        pNext chain: VkDeviceCreateInfo::pNext -> [STRUCTURE_TYPE_LOADER_DEVICE_CREATE_INFO] -> [STRUCTURE_TYPE_LOADER_DEVICE_CREATE_INFO] -> [VkPhysicalDeviceVulkan13Features] -> [VkPhysicalDeviceVulkan12Features] -> [VkPhysicalDeviceDynamicRenderingUnusedAttachmentsFeaturesEXT] -> [VkPhysicalDeviceDescriptorIndexingFeatures].
        The Vulkan spec states: If the pNext chain includes a VkPhysicalDeviceVulkan12Features structure, then it must not include a VkPhysicalDevice8BitStorageFeatures, VkPhysicalDeviceShaderAtomicInt64Features, VkPhysicalDeviceShaderFloat16Int8Features, VkPhysicalDeviceDescriptorIndexingFeatures, VkPhysicalDeviceScalarBlockLayoutFeatures, VkPhysicalDeviceImagelessFramebufferFeatures, VkPhysicalDeviceUniformBufferStandardLayoutFeatures, VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures, VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures, VkPhysicalDeviceHostQueryResetFeatures, VkPhysicalDeviceTimelineSemaphoreFeatures, VkPhysicalDeviceBufferDeviceAddressFeatures, or VkPhysicalDeviceVulkanMemoryModelFeatures structure (https://vulkan.lunarg.com/doc/view/1.4.328.0/windows/antora/spec/latest/chapters/devsandqueues.html#VUID-VkDeviceCreateInfo-pNext-02830)
        
      vulkan12_features := vk.PhysicalDeviceVulkan12Features{
          // etc
      
          descriptorIndexing                        = true,
              // Descriptor Indexing:
              // Todo: Is this only for VK 1.2?
      
          runtimeDescriptorArray                    = true,
              // Descriptor Indexing:
      
          shaderSampledImageArrayNonUniformIndexing = true,
              // Descriptor Indexing: required for indexing an array of `COMBINED_IMAGE_SAMPLER`.
      
          descriptorBindingPartiallyBound           = true,
              // Descriptor Indexing: optional, to avoid undefined behavior on not fully populated arrays.
      
          descriptorBindingVariableDescriptorCount  = true,
              // Descriptor Indexing: Allows a descriptor binding to have a variable number of descriptors.
      
          // etc
      }
      
    2. In GLSL use the GL_EXT_nonuniform_qualifier  extension and wrap the index with nonuniformEXT(...)  (or apply nonuniformEXT  to the loaded value) so the compiler emits the SPIR-V NonUniformEXT  decoration.

    • In the shader :

      • Constructors and builtin functions, which all have return types that are not qualified by nonuniformEXT , will not generate nonuniform results.

        • Shaders need to use the constructor syntax (or assignment to a nonuniformEXT -qualified variable) to re-add the nonuniformEXT  qualifier to the result of builtin functions.

        • Correct:

          • It is important to note that to be 100% correct, we must use:

          • nonuniformEXT(sampler2D()) .

          • It is the final argument to a call like texture()  which determines if the access is to be considered non-uniform.

        • Wrong:

          • It is very common in the wild to see code like:

          • sampler2D(Textures[nonuniformEXT(in_texture_index)], ...)

          • This looks very similar to HLSL, but it is somewhat wrong.

          • Generally, it will work on drivers, but it is not technically correct.

        • Examples:

          • sampler2D()  is such a constructor, so we must add nonuniformEXT  afterwards.

            • out_frag_color = texture(nonuniformEXT(sampler2D(Textures[in_texture_index], ImmutableSampler)), in_uv);

      • Other use cases:

        • The nonuniform qualifier will propagate up to the final argument which is used in the load/store or atomic operation.

        • Examples:

          // At the top
          #extension GL_EXT_nonuniform_qualifier : require
          
          uniform UBO { vec4 data; } UBOs[];   
          vec4 foo = UBOs[nonuniformEXT(index)].data;
          
          buffer  SSBO { vec4 data; } SSBOs[]; 
          vec4 foo = SSBOs[nonuniformEXT(index)].data;
          
          uniform sampler2D Tex[];
          vec4 foo = texture(Tex[nonuniformEXT(index)], uv);
          
          uniform uimage2D Img[];              
          uint count = imageAtomicAdd(Img[nonuniformEXT(index)], uv, val);
          
          #version 450
          #extension GL_EXT_nonuniform_qualifier : require
          layout(local_size_x = 64) in;
          
          layout(set = 0, binding = 0) uniform sampler2D Combined[];
          layout(set = 1, binding = 0) uniform texture2D Tex[];
          layout(set = 2, binding = 0) uniform sampler Samp[];
          layout(set = 3, binding = 0) uniform U { vec4 v; } UBO[];
          layout(set = 4, binding = 0) buffer S { vec4 v; } SSBO[];
          layout(set = 5, binding = 0, r32ui) uniform uimage2D Img[];
          
          void main()
          {
              uint index = gl_GlobalInvocationID.x;
              vec2 uv = vec2(gl_GlobalInvocationID.yz) / 1024.0;
          
              vec4 a = textureLod(Combined[nonuniformEXT(index)], uv, 0.0);
              vec4 b = textureLod(nonuniformEXT(sampler2D(Tex[index], Samp[index])), uv, 0.0);
              vec4 c = UBO[nonuniformEXT(index)].v;
              vec4 d = SSBO[nonuniformEXT(index)].v;
          
              imageAtomicAdd(Img[nonuniformEXT(index)], ivec2(0), floatBitsToUint(a.x + b.y + c.z + d.w));
          }
          
      • Caviats:

        • LOD:

          • Using implicit LOD with nonuniformEXT can be spicy! If the threads in a quad do not have the same index, LOD might not be computed correctly.

          • The quadDivergentImplicitLOD  property lets you know if it will work.

          • In this case however, it is completely fine, since the helper lanes in a quad must come from the same primitive, which all have the same flat fragment input.

      • Avoinding nonuniformEXT :

        • You might consider using subgroup operations to implement nonuniformEXT  on your own.

        • This is technically out of spec, since the SPIR-V specification states that to avoid nonuniformEXT ,

        • the shader must guarantee that the index is "dynamically uniform".

        • "Dynamically uniform" means the value is the same across all invocations in an "invocation group".

        • The invocation group is defined to be all invocations (threads) for:

          • An entire draw command (for graphics)

          • A single workgroup (for compute).

        • Avoiding nonuniformEXT  with clever programming is far more likely to succeed when writing compute shaders,

        • since the workgroup boundary serves as a much easier boundary to control than entire draw commands.

        • It is often possible to match workgroup to subgroup 1:1, unlike graphics where you cannot control how

        • quads are packed into subgroups at all.

        • The recommended approach here is to just let the compiler do its thing to avoid horrible bugs in the future.

  • Enabling Update-After-Bind :

    1. In VkDescriptorSetLayoutCreateInfo  we must pass down binding flags in a separate struct with pNext .

      bindings_count := len(stage_set_layout.bindings)
      descriptor_bindings_flags := make([]vk.DescriptorBindingFlagsEXT, bindings_count, context.temp_allocator)
      for i in 0..<len(descriptor_bindings_flags) {
          descriptor_bindings_flags[i] = { .PARTIALLY_BOUND }
      }
      descriptor_bindings_flags[bindings_count - 1] += { .VARIABLE_DESCRIPTOR_COUNT }
          // Only the last binding supports VARIABLE_DESCRIPTOR_COUNT.
      
      descriptor_binding_flags_create_info := vk.DescriptorSetLayoutBindingFlagsCreateInfoEXT{
          sType         = .DESCRIPTOR_SET_LAYOUT_BINDING_FLAGS_CREATE_INFO_EXT,
          bindingCount  = u32(bindings_count),
          pBindingFlags = raw_data(descriptor_bindings_flags),
          pNext         = nil,
      }
      descriptor_set_layout_create_info := vk.DescriptorSetLayoutCreateInfo{
          sType        = .DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
          flags        = {  },
      
          bindingCount = u32(bindings_count),
          pBindings    = raw_data(stage_set_layout.bindings),
      
          pNext        = &descriptor_binding_flags_create_info,
      }
      
      // Num Descriptors
      static constexpr uint32_t NumDescriptorsStreaming  = 2048;
      static constexpr uint32_t NumDescriptorsNonUniform = 64;
      
      // Pool
      uint32_t poolCount = NumDescriptorsStreaming + NumDescriptorsNonUniform;
      VkDescriptorPoolSize       pool_size = vkb::initializers::descriptor_pool_size(DESCRIPTOR_TYPE_SAMPLED_IMAGE, poolCount);
      VkDescriptorPoolCreateInfo pool      = vkb::initializers::descriptor_pool_create_info(1, &pool_size, 2);
      
      // Allocate
      VkDescriptorSetVariableDescriptorCountAllocateInfoEXT variable_info{};
      allocate_info.pNext              = &variable_info;
      
      variable_info.sType              = STRUCTURE_TYPE_DESCRIPTOR_SET_VARIABLE_DESCRIPTOR_COUNT_ALLOCATE_INFO_EXT;
      variable_info.descriptorSetCount = 1;
      variable_info.pDescriptorCounts = &NumDescriptorsStreaming;
      CHECK(vkAllocateDescriptorSets(get_device().get_handle(), &allocate_info, &descriptors.descriptor_set_update_after_bind));
      variable_info.pDescriptorCounts = &NumDescriptorsNonUniform;
      CHECK(vkAllocateDescriptorSets(get_device().get_handle(), &allocate_info, &descriptors.descriptor_set_nonuniform));
      
    2. The VkDescriptorPool  must also be created with UPDATE_AFTER_BIND . Note that there is global limit to how many UPDATE_AFTER_BIND descriptors can be allocated at any point. The min-spec here is 500k, which should be good enough.

Strategy: Descriptor Buffers ( EXT_descriptor_buffer )

  • Article .

  • Sample .

  • Released on (2022-11-21).

  • TLDR :

    • Descriptor sets are now backed by VkBuffer  objects where you memcpy  in descriptors. Delete VkDescriptorPool  and VkDescriptorSet  from the API, and have fun!

    • Performance is either equal or better.

  • Coming from Descriptor Indexing, we use plain uints instead of actual descriptor sets, there are some design questions that come up.

  • Do we assign one uint per descriptor, or do we try to group them together such that we only need to push one base offset?

  • If we go with the latter, we might end up having to copy descriptors around. If we go with one uint per descriptor, we just added extra indirection on the GPU. GPU throughput might suffer with the added latency.

  • On the other hand, having to group descriptors linearly one after the other can easily lead to copy hell. Copying descriptors is still an abstracted operation that requires API calls to perform, and we cannot perform it on the GPU. The overhead of all these calls in the driver can be quite significant, especially in API layering. I’ve seen up to 10 million calls to “copy descriptor” per second which adds up.

  • Managing descriptors really starts looking more and more like just any other memory management problem. Let’s try translating existing API concepts into what they really are under the hood.

  • vkCreateDescriptorPool

    • vkAllocateMemory . Memory type unknown, but likely HOST_VISIBLE  and DEVICE_LOCAL . Size of pool computed from pool entries.

  • vkAllocateDescriptorSets

    • Linear or arena allocation from pool. Size and alignment computed from VkDescriptorSetLayout .

  • vkUpdateDescriptorSets

    • Writes raw descriptor data by copying payload from VkImageView  / VkSampler  / VkBufferView . Write offset is deduced from VkDescriptorSetLayout  and binding. The VkDescriptorSet  contains a pointer to HOST_VISIBLE  mapped CPU memory. Copies are similar.

  • vkCmdBindDescriptorSets

    • Binds the GPU VA of the VkDescriptorSet  somehow.

  • The descriptor buffer API effectively removes VkDescriptorPool  and VkDescriptorSet . The APIs now expose lower level detail.

  • For example, there’s now a bunch of properties to query:

    typedef struct VkPhysicalDeviceDescriptorBufferPropertiesEXT {
        …
        size_t             samplerDescriptorSize;
        size_t             combinedImageSamplerDescriptorSize;
        size_t             sampledImageDescriptorSize;
        size_t             storageImageDescriptorSize;
        size_t             uniformTexelBufferDescriptorSize;
        size_t             robustUniformTexelBufferDescriptorSize;
        size_t             storageTexelBufferDescriptorSize;
        size_t             robustStorageTexelBufferDescriptorSize;
        size_t             uniformBufferDescriptorSize;
        size_t             robustUniformBufferDescriptorSize;
        size_t             storageBufferDescriptorSize;
        size_t             robustStorageBufferDescriptorSize;
        size_t             inputAttachmentDescriptorSize;
        size_t             accelerationStructureDescriptorSize;
        …
    } VkPhysicalDeviceDescriptorBufferPropertiesEXT;
    

Strategy: Push Descriptor ( VK_KHR_push_descriptor )

  • Promoted to core in Vulkan 1.4.

  • Last modified date: (2017-09-12).

  • This extension allows descriptors to be written into the command buffer, while the implementation is responsible for managing their memory. Push descriptors may enable easier porting from older APIs and in some cases can be more efficient than writing descriptors into descriptor sets.

  • Sample .

  • New Commands

    • vkCmdPushDescriptorSetKHR

  • If Vulkan Version 1.1 or VK_KHR_descriptor_update_template  is supported:

    • vkCmdPushDescriptorSetWithTemplateKHR

  • New Structures

    • Extending VkPhysicalDeviceProperties2 :

      • VkPhysicalDevicePushDescriptorPropertiesKHR

  • New Enum Constants

    • VK_KHR_PUSH_DESCRIPTOR_EXTENSION_NAME

    • VK_KHR_PUSH_DESCRIPTOR_SPEC_VERSION

    • Extending VkDescriptorSetLayoutCreateFlagBits :

      • VK_DESCRIPTOR_SET_LAYOUT_CREATE_PUSH_DESCRIPTOR_BIT_KHR

    • Extending VkStructureType:

      • VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PUSH_DESCRIPTOR_PROPERTIES_KHR

  • If Vulkan Version 1.1 or VK_KHR_descriptor_update_template is supported:

    • Extending VkDescriptorUpdateTemplateType :

      • VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_PUSH_DESCRIPTORS_KHR

Strategy: Bindful / Classic strategy (Slot-based / Frequency-based)

  • mna (midmidmid):

    • The reason you split up resources into multiple sets is actually to reduce  the cost of vkCmdBindDescriptorSets . The idea being that if you've got one set that holds scene-wide data and a different set that holds object-specific data, you only bind the scene stuff once  and then just leave it bound. Then the per-object updates go faster because you're pushing much smaller descriptor sets into whatever special silicon descriptor sets map to on your particular GPU. Note: there are rules about how you have to arrange your sets (so like the scene-wide one has to be at a lower index than the per-object one), and all of the pipelines you use must have compatible  layouts for the sets you aren't rebinding every time you switch to a different pipeline. Someone can correct me if I'm wrong, but if you switch to a pipeline that's got an incompatible layout for some descriptor set at index n  then all  descriptor sets at indices >= n  need to be rebound.

    • I think the only reason I'd change any of my stuff to bindless is if I hit however many hundreds of thousands of calls to vkCmdBindDescriptorSets  it takes for descriptors to be a per-frame bottleneck.

    • But I find descriptors pretty intuitive and easy to work with.

    • I didn't  find them easy to work with when I first  came to VK (from GL/D3D11-world), but now that I've got some scaffolding set up to manage them, they're easy sauce.

    • (They actually map pretty well to having worked with old  console GPUs where you manage the command queue directly and have to think about resource bindings in terms of physical registers on the GPU. It was helpful to have that background.)

    • If you're working with descriptor sets, then you have lots of little objects whose lifetimes you need to track and manage. Getting them grouped into the appropriate set of pools  cuts that number down to something that's not hard to manage. So, for me, I've got a dynamically allocated and recycled set of descriptor pools for stuff that changes every frame, and then I've got my materials grouped into pack files (for fast content loading) and each of those has one descriptor pool for all the sets for all of its materials. Easy peasy. For bindless, you need to figure out how you're going to divide up the big array of descriptors in your one mega set. There's different strategies for doing that. But you'll get a better description of them out of the bindless fans on the server.

    • Implementation-wise, I  don't think there's a huge complexity difference between the two approaches. Bindless might be conceptually  simpler since "it's just a big array" doesn't require as big of a mental shift as dividing resources up by usage and update frequency and thinking in those  terms.

  • In the “classic” model, before you draw or dispatch, you must bind each resource to a specific descriptor binding or slot.

  • Example:

    • vkCmdBindDescriptorSets(...)

    • Binding texture #0 for this draw, texture #1 for that draw, etc.

  • The shader uses a fixed binding index:

    • layout(set = 0, binding = 3) uniform sampler2D tex;

  • If you want to change which texture is used, you re-bind that descriptor.

  • .

Specialization Constants

  • Allows a constant value in SPIR-V to be specified at VkPipeline  creation time.

  • This is powerful as it replaces the idea of doing preprocessor macros in the high level shading language (GLSL, HLSL, etc).

  • A way to provide constant values to a SPIR-V shader at pipeline creation time so the compiler can constant-fold, inline, and eliminate branches.

    • This yields code equivalent to having compiled separate shader variants with those constant values baked in.

  • This is not Vulkan exclusive, but an optimization from SPIR-V. OpenGL 4.6 can also use this feature.

  • Sample .

  • UBOs and Push Constants suffer from limited optimizations during shader compilation. Specialization Constants can provide those optimizations:

    • Uniform buffer objects (UBOs) are one of the most common approaches when it is necessary to set values within a shader at run-time and are used in many tutorials. UBOs are pushed to the shader just prior to its execution, this is after shader compilation which occurs during vkCreateGraphicsPipelines . As these values are set after the shader has been compiled, the driver’s shader compiler has limited scope to perform optimizations to the shader during its compilation. This is because optimizations such as loop unrolling or unused code removal require the compiler to have knowledge of the values controlling them which is not possible with UBOs. Push constants also suffer from the same problems as UBOs, as they are also provided after the shader has been compiled.

    • Specialization Constants  are set before pipeline creation meaning these values are known during shader compilation, this allows the driver’s shader compiler to perform optimizations. In this optimisation process the compiler has the ability to remove unused code blocks and statically unroll which reduces the fragment cycles required by the shader which results in increased performance.

    • While specialization constants rely on knowing the required values before pipeline creation occurs, by trading off this flexibility and allowing the compiler to perform these optimizations you can increase the performance of your application easily and reduce shader code size.

  • Do :

    • Use compile-time specialization constants for all control flow. This allows compilation to completely remove unused code blocks and statically unroll loops.

  • Don’t :

    • Use control-flow which is parameterized by uniform values; specialize shaders for each control path needed instead.

  • Impact :

    • Reduced performance due to less efficient shader programs.

  • Example :

    #version 450
    layout (constant_id = 0) const float myColor = 1.0;
    layout(location = 0) out vec4 outColor;
    
    void main() {
        outColor = vec4(myColor);
    }
    
    struct myData {
        float myColor = 1.0f;
    } myData;
    
    VkSpecializationMapEntry mapEntry = {};
    mapEntry.constantID = 0; // matches constant_id in GLSL and SpecId in SPIR-V
    mapEntry.offset     = 0;
    mapEntry.size       = sizeof(float);
    
    VkSpecializationInfo specializationInfo = {};
    specializationInfo.mapEntryCount = 1;
    specializationInfo.pMapEntries   = &mapEntry;
    specializationInfo.dataSize      = sizeof(myData);
    specializationInfo.pData         = &myData;
    
    VkGraphicsPipelineCreateInfo pipelineInfo = {};
    pipelineInfo.pStages[fragIndex].pSpecializationInfo = &specializationInfo;
    
    // Create first pipeline with myColor as 1.0
    vkCreateGraphicsPipelines(&pipelineInfo);
    
    // Create second pipeline with same shader, but sets different value
    myData.myColor = 0.5f;
    vkCreateGraphicsPipelines(&pipelineInfo);
    
  • Use cases :

    • Toggling features:

      • Support for a feature in Vulkan isn’t known until runtime. This usage of specialization constants is to prevent writing two separate shaders, but instead embedding a constant runtime decision.

    • Improving backend optimizations:

      • Optimizing shader compilation  from SPIR-V to GPU.

      • The “backend” here refers to the implementation’s compiler that takes the resulting SPIR-V and lowers it down to some ISA to run on the device.

      • Constant values allow a set of optimizations such as constant folding , dead code elimination , etc. to occur.

    • Affecting types and memory sizes:

      • It is possible to set the length of an array or a variable type used through a specialization constant.

      • It is important to notice that a compiler will need to allocate registers depending on these types and sizes. This means it is likely that a pipeline cache will fail if the difference is significant in registers allocated.

  • How they work :

    • The values are supplied using VkSpecializationInfo  attached to the VkPipelineShaderStageCreateInfo .

    • In GLSL (or HLSL → SPIR-V) mark a constant with a constant id, e.g. layout(constant_id = 0) const int MATERIAL_MODE = 0;

    • Create VkSpecializationMapEntry  entries mapping constantID  → offset/size in your data block.

    • Fill a contiguous data buffer with the specialization values and set up VkSpecializationInfo .

    • Put the VkSpecializationInfo*  into the shader stage VkPipelineShaderStageCreateInfo  before calling vkCreateGraphicsPipelines . The backend finalizes (specializes/compiles) the shader at pipeline creation time.

  • How it affects the pipeline workflow :

    • TLDR :

      • It does not solve the pipeline workflow problem. It provides a system for shader optimization at SPIR-V→GPU compile time.

      • Specialization lets you get near-compile-time optimizations while still selecting variants at runtime, but it does not avoid having multiple created pipelines if you need multiple different specialized behaviors.

    • They do not, by themselves, precompile every possible branch permutation and keep them all resident for you. Each distinct set of specialization values that you want available at runtime normally corresponds to a separately created pipeline (the specialization values are applied during pipeline creation).

    • If you need multiple variants you must create (or reuse) the pipelines for those values.

    • If you have N independent boolean specialization choices, the number of possible specialized pipelines is 2^N (exponential growth). Creating many pipelines increases driver/state memory and creation time; use caching/derivatives/libraries if creation cost or count is a concern.

    • You cannot change a specialization constant per draw without binding a different pipeline: the specialization is fixed for the pipeline object, so per-draw changes require binding another pipeline or using a different strategy (uniforms, push constants, dynamic branching).

    • Different values mean different pipeline creation (driver work / memory).

    • "Is this a way to precompile every branching of a shader?"

      • Yes, but only if you actually create a pipeline for each variant.

      • Specialization constants let the driver compile-away branches at pipeline-creation time, but they do not magically produce all variants for you at draw time.

  • Recommendations :

    • Improving shader performance with vulkan's specialization constants .

      • When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo  field of VkPipelineShaderStageCreateInfo . At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.

      • It is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the rendering loop and just swap pipelines as needed while rendering.

      • "promote the UBO array to a push constant".

      • Applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.

      • In other words:

        • The article shows how it's possible to pass a value to the shader during graphics pipeline creation so the shader is compiled from SPIR-V to GPU with that constant altered.

        • This helps by allowing the SPIR-V→GPU compiler to make optimization choices such as unrolling loops and removing branches; it can also enable UBO promotion.

        • The article does not suggest specialization constants solve the pipeline workflow problem. It focuses on compile-time shader optimizations.

Physical Storage Buffer ( KHR_buffer_device_address )

  • Impressions :

    • (2025-09-08)

    • No descriptor sets.

      • Cool.

    • Very easy to set up.

    • Shader usage is a bit tricky; push constants are required to access buffers in many patterns.

    • More prone to programmer errors because there is no automatic bounds checking.

    • Hmm, idk, for now not sure.

  • Adds the ability to have “pointers in the shader”.

  • Buffer device address is a powerful and unique feature of Vulkan. It exposes GPU virtual addresses directly to the application, and the application can then use those addresses to access buffer data freely through pointers rather than descriptors.

  • This feature lets you place addresses in buffers and load and store to them inside shaders, with full capability to perform pointer arithmetic and other tricks.

  • Support :

    • Core in Vulkan 1.3.

    • Submitted at (2019-01-06), core at (2019-11-25).

    • Coverage :

      • (2025-09-08) 71.6%

      • 79.8% Windows

      • 70.9% Linux

      • 68.7% Android

  • Lack of safety :

    • A critical thing to note is that a raw pointer has no idea of how much memory is safe to access. Unlike SSBOs when bounds-checking features are enabled, you must either do range checks yourself or avoid relying on out-of-bounds behavior.

  • Creating a buffer :

    • To be able to grab a device address from a VkBuffer , you must create the buffer with SHADER_DEVICE_ADDRESS  usage.

    • The memory you bind that buffer to must be allocated with the corresponding flag via pNext .

    VkMemoryAllocateFlagsInfoKHR flags_info{STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR};
    flags_info.flags             = MEMORY_ALLOCATE_DEVICE_ADDRESS_KHR;
    memory_allocation_info.pNext = &flags_info;
    
    • After allocating and binding the buffer, query the address:

    VkBufferDeviceAddressInfoKHR address_info{STRUCTURE_TYPE_BUFFER_DEVICE_ADDRESS_INFO_KHR};
    address_info.buffer = buffer.buffer;
    buffer.gpu_address  = vkGetBufferDeviceAddressKHR(device, &address_info);
    
    • This address behaves like a normal address; you can offset the VkDeviceAddress  value as you see fit since it is a uint64_t .

    • There is no host-side alignment requirement enforced by the API for this value.

    • When using this pointer in shaders, you must provide and respect alignment semantics yourself, because the shader compiler cannot infer anything about a raw pointer loaded from memory.

    • You can place this pointer inside another buffer and use it as an indirection.

  • GL_EXT_buffer_reference :

    • In Vulkan GLSL, the GL_EXT_buffer_reference  extension allows declaring buffer blocks as pointer-like types rather than SSBOs. GLSL lacks true pointer types, so this extension exposes pointer-like behavior.

    #extension GL_EXT_buffer_reference : require
    
    • You can forward-declare types. Useful for linked lists and similar structures.

    layout(buffer_reference) buffer Position;
    
    • You can declare a buffer reference type. This is not an SSBO declaration, but effectively a pointer-to-struct.

    layout(std430, buffer_reference, buffer_reference_align = 8) writeonly buffer Position {
        vec2 positions[];
    };
    
    • buffer_reference  tags the type accordingly. buffer_reference_align  marks the minimum alignment for pointers of this type.

    • You can place the Position  type inside another buffer or another buffer reference type:

    layout(std430, buffer_reference, buffer_reference_align = 8) readonly buffer PositionReferences {
        Position buffers[];
    };
    
    • Now you have an array of pointers.

    • You can also place a buffer reference inside push constants, an SSBO, or a UBO.

    layout(std430, set = 0, binding = 0) readonly buffer Pointers {
        Positions positions[];
    };
    
    layout(std430, push_constant) uniform Registers {
        PositionReferences references;
    } registers;
    
  • Casting pointers :

    • A key aspect of buffer device address is that we gain the capability to cast pointers freely.

    • While it is technically possible (and useful in some cases!) to "cast pointers" with SSBOs with clever use of aliased declarations like so:

    layout(set = 0, binding = 0) buffer SSBO { float v1[]; };
    layout(set = 0, binding = 0) buffer SSBO2 { vec4 v4[]; };
    
    • It gets kind of hairy quickly, and not as flexible when dealing with composite types.

    • When we have casts between integers and pointers, we get the full madness  that is pointer arithmetic. Nothing stops us from doing:

    #extension GL_EXT_buffer_reference : require
    layout(buffer_reference) buffer PointerToFloat { float v; };
    
    PointerToFloat pointer = load_pointer();
    uint64_t int_pointer = uint64_t(pointer);
    int_pointer += offset;
    pointer = PointerToFloat(int_pointer);
    pointer.v = 42.0;
    
    • Not all GPUs support 64-bit integers, so it is also possible to use uvec2  to represent pointers. This way, we can do raw pointer arithmetic in 32-bit, which might be more optimal anyways.

    #extension GL_EXT_buffer_reference_uvec2 : require
    layout(buffer_reference) buffer PointerToFloat { float v; };
    PointerToFloat pointer = load_pointer();
    uvec2 int_pointer = uvec2(pointer);
    uint carry;
    uint lo = uaddCarry(int_pointer.x, offset, carry);
    uint hi = int_pointer.y + carry;
    pointer = PointerToFloat(uvec2(lo, hi));
    pointer.v = 42.0;
    
  • Debugging :

    • When debugging or capturing an application that uses buffer device addresses, there are some special driver requirements that are not universally supported. Essentially, to be able to capture application buffers which contain raw pointers, we must ensure that the device address for a given buffer remains stable when the capture is replayed in a new process. Applications do not have to do anything here, since tools like RenderDoc will enable the bufferDeviceAddressCaptureReplay  feature for you, and deal with all the magic associated with address capture behind the scenes. If the bufferDeviceAddressCaptureReplay  is not present however, tools like RenderDoc will mask out the bufferDeviceAddress  feature, so beware.

  • Sample .

  • .

Memory Allocation

Info

  • Memory Management .

    • Talk by AMD.

    • Shows no code.

    • The video is useful.

    • Memory Heaps, Memory Types.

    • Memory Blocks.

    • Suballocations.

    • Dos and Don'ts.

    • VMA.

    • VmaDumpVis.py to visualize the json file dumped by VMA.

  • Memory Management .

    • Sounds more technical; I only saw parts of the talk.

    • Talk by AMD.

    • Shows code.

    • Memory Heaps, Memory Types.

    • Dos and Don'ts.

    • VMA.

  • There is additional level of indirection: VkDeviceMemory  is allocated separately from creating VkBuffer / VkImage  and they must be bound together.

  • Driver must be queried for supported memory heaps and memory types. Different GPU vendors provide different types of it.

  • It is recommended to allocate bigger chunks of memory and assign parts of them to particular resources, as there is a limit on maximum number of memory blocks that can be allocated.

  • When memory is over-committed on Windows, the OS memory manager may move allocations from video memory to system memory, the OS also may temporarily suspend a process from the GPU runlist in order to page out its allocations to make room for a different process’ allocations. There is no OS memory manager on Linux that mitigates over-commitment by automatically performing paging operations on memory objects.

  • Use EXT_pageable_device_local_memory  to avoid demotion of critical resources by assigning memory priority. It’s also a good idea to set low priority to non-critical resources such as vertex and index buffers; the app can verify the performance impact by placing the resources in system memory. 

  • Use EXT_pageable_device_local_memory  to also disable automatic promotion of allocations from system memory to video memory.

  • Use dedicated memory allocations ( KHR_dedicated_allocation , core in VK 1.1) when appropriate.

  • Using dedicated memory may improve performance for color and depth attachments, especially on pre-Turing GPUs.

  • Use KHR_get_memory_requirements2  (core in VK 1.1) to check whether an image/buffer requires dedicated allocation.

  • Use host visible video memory to write data directly to video memory from the CPU. Such heap can be detected using DEVICE_LOCAL | HOST_VISIBLE . Take into account that CPU writes to such memory may be slower compared to normal memory. CPU reads are significantly slower. Check BAR1 traffic using Nsight Systems for possible issues.

  • Explicitly look for the MEMORY_PROPERTY_DEVICE_LOCAL  when picking a memory type for resources, which should be stored in video memory.

  • Don’t assume fixed heap configuration, always query and use the memory properties using vkGetPhysicalDeviceMemoryProperties() .

  • Don’t assume memory requirements of an image/buffer, use vkGet*MemoryRequirements()

  • Don’t put every resource into a Dedicated Allocation.

  • For memory objects that are intended to be in device-local, do not just pick the first memory type. Pick one that is actually device-local.

  • The benefit is that we avoid CPU memory costs for lots of tiny buffers, as well as cache misses by using just the same buffer object and varying the offset.

  • This optimization applies to all buffers, but in the previous blog post on shader resource binding it was mentioned that the offsets are particularly good for uniform buffers.

  • Software developers use custom memory management for various reasons:

    • Making allocations often involves the operating system which is rather costly.

    • It is usually faster to re-use existing allocations rather than to free and reallocate new ones.

    • Objects that live in a continuous chunk of memory can enjoy better cache utilization.

    • Data that is aligned well for the hardware can be processed faster.

  • Memory is a precious resource, and it can involve several indirect costs by the operating systems. For example some operating systems have a linear cost over the number of allocations for each submission to a Vulkan Queue. Another scenario is that the operating system also handles the paging state of allocations depending on other proceses, we therefore encourage not using too many allocations and organizing them “wisely”.

  • Device Memory: This memory is used for buffers and images and the developer is responsible for their content.

  • Resource Pools: Objects such as CommandBuffers and DescriptorSets are allocated from pools, the actual content is indirectly written by the driver.

  • Custom Host Allocators: Depending on your control-freak level you may also want to provide your own host allocator that the driver can use for the api objects.

  • Heap: Depending on the hardware and platform, the device will expose a fixed number of heaps, from which you can allocate certain amount of memory in total. Discrete GPUs with dedicated memory will be different to mobile or integrated solutions that share memory with the CPU. Heaps support different memory types which must be queried from the device.

  • Memory type: When creating a resource such as a buffer, Vulkan will provide information about which memory types are compatible with the resource. Depending on additional usage flags, the developer must pick the right type, and based on the type, the appropriate heap.

  • Memory property flags: These flags encode caching behavior and whether we can map the memory to the host (CPU), or if the GPU has fast access to the memory.

  • Memory: This object represents an allocation from a certain heap with a user-defined size.

  • Resource (Buffer/Image): After querying for the memory requirements and picking a compatible allocation, the memory is associated with the resource at a certain offset. This offset must fulfill the provided alignment requirements. After this we can start using our resource for actual work.

  • Sub-Resource (Offsets/View): It is not required to use a resource only in its full extent, just like in OpenGL we can bind ranges (e.g. varying the starting offset of a vertex-buffer) or make use of views (e.g. individual slice and mipmap of a texture array).

  • The fact that we can manually bind resources to actual memory addresses, gives rise to the following points:

    • Resources may alias (share) the same region of memory.

    • Alignment requirements for offsets into an allocation must be manually managed.

  • Store multiple buffers, like the vertex and index buffer, into a single VkBuffer  and use offsets in commands like vkCmdBindVertexBuffers .

  • The advantage is that your data is more cache friendly in that case, because it’s closer together. It is even possible to reuse the same chunk of memory for multiple resources if they are not used during the same render operations, provided that their data is refreshed, of course.

  • This is known as aliasing and some Vulkan functions have explicit flags to specify that you want to do this.

  • Uniform Buffer Binding: As part of a DescriptorSet this would be the equivalent of an arbitrary glBindBufferRange(GL_UNIFORM_BUFFER, dset.binding, dset.bufferOffset, dset.bufferSize) in OpenGL. All information for the actual binding by the CommandBuffer is stored within the DescriptorSet itself.

  • Uniform Buffer Dynamic Binding: Similar as above, but with the ability to provide the bufferOffset later when recording the CommandBuffer, a bit like this pseudo code: CommandBuffer->BindDescriptorSet(setNumber, descriptorSet, &offset). It is very practical to use when sub-allocating uniform buffers from a larger buffer allocation.

  • Push Constants: PushConstants are uniform values that are stored within the CommandBuffer and can be accessed from the shaders similar to a single global uniform buffer. They provide enough bytes to hold some matrices or index values and the interpretation of the raw data is up the shader. You may recall glProgramEnvParameter from OpenGL providing something similar. The values are recorded with the CommandBuffer and cannot be altered afterwards: CommandBuffer->PushConstant(offset, size, &data)

  • Dynamic offsets are very fast for NVIDIA hardware. Re-using the same DescriptorSet with just different offsets is rather CPU-cache friendly as well compared to using and managing many DescriptorSets. NVIDIA’s OpenGL driver actually also optimizes uniform buffer binds where just the range changes for a binding unit.

Sub-allocation
  • In a real world application, you’re not supposed to actually call vkAllocateMemory  for every individual buffer.

  • The maximum number of simultaneous memory allocations is limited by the maxMemoryAllocationCount  physical device limit, which may be as low as 4096  even on high end hardware like an NVIDIA GTX 1080.

  • The right way to allocate memory for a large number of objects at the same time is to create a custom allocator that splits up a single allocation among many different objects by using the offset  parameters that we’ve seen in many functions.

  • You can either implement such an allocator yourself, or use the VMA  library provided by the GPUOpen initiative.

  • Sub-allocation is a first-class approach when working in Vulkan.

  • Memory is allocated in pages with a fixed size; sub-allocation reduces the number of OS-level allocations.

  • You should use memory sub-allocation.

  • Memory allocation and deallocation at OS/driver level is expensive.

  • vkAllocateMemory()  is costly on the CPU.

  • Cost can be reduced by suballocating from a large memory object.

  • Also note the maxMemoryAllocationCount  limit which constrains the number of simultaneous allocations an application can have.

  • A Vulkan app should aim to create large allocations and then manage them itself.

Arenas

Discussion around the availability of arenas in Vulkan
  • (2025-12-07)

  • Caio:

    • hello, is it possible to create a memory arena, placing all new objects in this region, and then freeing all this region without having to call the vkDestroyX functions? I'm having the impression that Vulkan memory management is rooted in RAII, which I don't like. All my games are managed through arenas, which I think is perfect, but for Vulkan I'm having to track each individual allocation and free each one at a time. I'm already treating memory as a big arena, but I'm having the overhead of calling the destruction of each resource separately.

  • CharlesG:

    • You don’t own the memory that backs vulkan objects. For command buffers and descriptors there are pools so the driver can do a good job with the backing memory scheme.

    • For VkDeviceMemory, you decide how to sub allocate them

  • Caio:

    • do I need to call destroy for objects like vkPipeline, VkPipelineLayout, VkDescriptorSetLayout, VkShaderModule, VkRenderPass, etc? I have lots of objects that should die exacly at the same time, but I'm having to free them one by one. I heard about suballocations for buffers and images, but what about these types of objects I mentioned?

  • VkIpotrick:

    • they require actualy cleanup, they are not just some memory

    • they might be referenced within other internal structures of the driver and have to be removed from those for example

  • CharlesG:

    • anything that you vkCreate must be vkDestroyed; Except command buffers and descriptors where it is sufficient to just destroy the pools.

    • Using Vulkan is a lot like networking with a remote server, lots of driver internals have implementation requirements that make arenas not the “obvious choice” (otherwise we’d see more of them)

  • Caio:

    • Is there a future in Vulkan where the decision of how to free the memory is not bound to the driver, but for the programmer? You mentioned how this is limited by what the driver allows, but could this change in the future and move towards being more low-level?

  • VkIpotrick:

    • no. i dont think that is feasable.

    • that would handcuff drivers so bad that you would be too low level. At that point a propper spec could be impossibly hard to create and maintain between vendors

    • vulkan drivers still have to do a loooot of things internally. its still highish level api

  • CharlesG:

    • I concur.

    • I want to reiterate that drivers deal with much more than host memory allocations, but device memory, external memory (to the process), OS api’s, display hardware, shader compilers. Some objects don’t actually DO anything on deletion (sampler come to mind because the handle stores the entire state for some implementations, when the private data ext isnt active)

    • Drivers get to ask the os on your behalf to map device memory into the host address space. And deal with you forgetting to unmap it during shutdown (though the OS is more likely to also clean up after user lode drivers…)

    • I mention that some objects are “free” to leak cause they didn’t allocate anything internally because that is an implementation detail that isnt possible on all hardware, so the API cant guarantee “free” sampler cleanup without screwing over some hardware. And it just ties their hands when it is no longer possible to put all the state into the handle any more in the future with extensions to the API

  • Caio:

    • well, I imagine this was the case, but still, I was hopeful there was some alternative for bulk deletion. Currently I just wrapped around the concept of shared lifetimes and created a pseudo-arena, which internally frees all the memory for me by calling each respective destructor. Still, it annoys me a bit knowing the design could be faster if I could bulk delete the content instead of being bound by what the driver exposes

    • I understand why it's not possible due to the current design by drivers, but I wish it were

    • my concern now is not the performance per se, but more about the freedom of having the option of managing memory in a way that could logically be faster (logically, as freeing a memory region is quite obviously faster than having to manage the state of different objects before deleting each of them individually). I'm not currently bound by the deletion times of those calls. I'm speaking more from a philosophical standpoint.

  • CharlesG:

    • Inb4 going all in on bindless and gpu driven where there just arent as many vulkan objects to manage

    • Fences and semaphores come to mind as prime examples of not just memory

  • Caio:

    • I'm trying to move it that way after trying bindful for a while, it's being much nicer and aligns with the vision I have of how memory is better managed;

  • CharlesG:

    • Suggestions for the API can be made in the vulkan-pain-points channel (although itd be good to link to this convo) and an issue can be made in the Vulkan-Docs github repo as thats the home of the specification. That said, this ask is not easily actionable so hard to quantify what “success” means.

    • All good, and going towards bindless is definitely going to suite your tastes better!

  • VkIpotrick:

    • bindless is simply better at this point

    • descriptor sets, layouts, pools etc made sense for old hardware, but now they are just very clunky oddly behaving abstractions

    • also with bindless you can have one static allocation for all descriptors

    • the ultimate memory management is static lifetime after all.

Alternatives and half-solutions
  • You cannot safely get the behavior you want — i.e. allocate many Vulkan resources and then legally free one big memory region while leaving the Vulkan object handles alive and never calling their destruction; on a conformant Vulkan implementation. Freeing VkDeviceMemory that backs resources while those resources are still live or still in use is undefined behavior / validation errors unless you guarantee the resources are never used again and the driver allows that. The Vulkan spec requires you to manage object lifetimes; drivers may have internal bookkeeping tied to those object handles that won’t be cleaned just by freeing the raw memory.

  • That said, you can achieve the practical “free everything by freeing a small number of objects/regions” without peppering vkDestroy* calls everywhere by changing how you structure resources. options that actually give you region-like semantics:

  • Mega-backings (buffers)

    • Never creating one Vulkan resource handle per logical allocation. In practice that means: create a small number of real Vulkan resources (big backing buffers / big images or sparse resources), suballocate from them, and operate using offsets/array-layer indices. When the region should die you destroy the backing objects (a few destroys) and free their VkDeviceMemory. No per-suballocation vkDestroy* calls are necessary because there are no per-suballocation Vulkan handles to destroy.

    • Create a small set of large backing VkDeviceMemory + VkBuffer objects (one per memory type/usage class you need).

    • Suballocate ranges from those big buffers and use offsets everywhere:

    • For vertex/index bindings: vkCmdBindVertexBuffers(..., firstBinding, 1, &bigBuffer, &offset).

    • For descriptors: VkDescriptorBufferInfo{ bigBuffer, offset, range } — descriptors can point at a buffer + offset without creating new VkBuffer handles.

    • When you’re done, you only need to vkDestroyBuffer / vkFreeMemory for a few big buffers, not for every tiny allocation.

    • Constraints: alignment, memoryRequirements and usage flags must be compatible for all suballocations placed in a given big buffer. If two allocations need different usage flags or memory types, they must go into different backing buffers.

  • Texture atlases / arrays (images)

    • Replace many small VkImage objects with a single large image (or texture array/array layers / atlas) and pack multiple textures into it. Use UV/array-layer indices in shader, or use VkImageView / descriptor indexing accordingly.

    • You then destroy and free one big image rather than many small ones. Tradeoffs: packing, mipmapping, filtering artifacts, and sampler/view creation.

Host Memory

Allocator ( VkAllocationCallbacks )
  • VkAllocationCallbacks  only control host (CPU) allocations the loader/driver makes for Vulkan bookkeeping and temporary object.

  • They do not give you a direct view or control of device (GPU) memory payloads.

  • Passing a non-NULL pAllocator  to a vkCreateX  function causes the driver to call your callbacks for those host allocations. They do not switch the driver from using device heaps to host malloc; they only replace the host allocator functions used by the implementation. The allocation scope rules determine whether the allocation is command-scoped or object-scoped.

  • Passing a custom VkAllocationCallbacks  to vkCreateBuffer  lets you intercept and control the host memory the driver uses to represent the buffer object — but it does not tell you how many bytes of GPU heap were (or will be) consumed by the buffer’s storage. For the latter you must intercept device allocations (see below).

  • To track real GPU memory you must track vkAllocateMemory / vkFreeMemory  (and any driver-internal device allocations) and/or use VK_EXT_memory_report  / VK_EXT_memory_budget  to observe what the driver actually commits.

  • Examples :

    • vkCreateBuffer(...) :

      • This call creates a buffer object handle and the driver's host-side bookkeeping for that object (descriptor, small metadata).

      • Those host allocations are the things pAllocator  on vkCreateBuffer  controls.

      • The call does not  allocate GPU payload memory for the buffer contents.

      • The buffer becomes usable on the device only after you allocate VkDeviceMemory  and bind  it (or the driver performs some implicit allocation in non-standard implementations).

      • The implementation goes as:

        • vk.CreateBuffer

          • Create buffer. Host Visible handle. CPU Memory.

          vk_check(vk.CreateBuffer(_device.handle, &buffer_create_info, &arena.gpu_alloc, &buffer_handle))
          
          
        • vk.GetBufferMemoryRequirements

          • Prepare allocation_info for VkDeviceMemory. Choose a memoryTypeIndex with the desired properties

          • allocationSize and memoryTypeIndex determine whether the allocation will be device-local, host-visible, coherent, etc.

          • This properties decide if the memory is mappable from the CPU.

          • This call doesn't allocate anything.

          mem_requirements: vk.MemoryRequirements
          vk.GetBufferMemoryRequirements(_device.handle, buffer_handle, &mem_requirements)
          mem_allocation_info := vk.MemoryAllocateInfo{
              sType           = .MEMORY_ALLOCATE_INFO,
              allocationSize  = mem_requirements.size,
              memoryTypeIndex = device_find_memory_type(mem_requirements.memoryTypeBits, properties),
          }
          
        • vk.AllocateMemory

          • This is the call that requests a VkDeviceMemory  allocation from a particular memory type/heap.

          • Memory type is HOST_VISIBLE :

            • The driver will allocate from the heap that provides host mappings (which is typically system RAM or a host-visible region).

            • Effect: device payload is created — the VkDeviceMemory  object represents committed device memory (counts against the heap’s budget).

            • On discrete GPUs this is often a segment of system memory that is mapped by the driver, or on integrated GPUs it may be the same physical RAM but treated as both host- and device-accessible.

            • The pAllocator  you pass to vkAllocateMemory only affects host-side allocations the driver does while processing the call; it does not change whether the allocation consumes device heap bytes.

          • Memory type is DEVICE_LOCAL :

            • Driver allocates a VkDeviceMemory from the device-local heap (on discrete GPUs this is the GPU VRAM heap). That is the device payload and consumes heap budget. The allocation is not host-visible, so you cannot vkMapMemory this memory.

            • Note: on integrated GPUs device-local may still be mappable because physical memory is shared — but that depends entirely on memory type flags exposed by the driver.

          • Memory type is HOST VISIBLE + DEVICE_LOCAL :

            • The allocation is created in a heap that the driver marks both device-local and host-visible. Physically this can mean: shared system RAM (integrated GPU) or a special heap the driver exposes that is accessible by both CPU and GPU. The VkDeviceMemory is committed and counts against that heap’s budget.

            • You may be able to vkMapMemory  this memory because it is host-visible. Performance characteristics vary: host-visible+device-local memory can be slower to CPU-access than pure host memory or slower to GPU-access than pure device-local VRAM.

            • On PC discrete GPUs this commonly corresponds to the GPU memory that is accessible through the PCIe BAR (Resizible-BAR / ReBAR) or a special small window the driver exposes. Allocation behavior: vkAllocateMemory allocates from that BAR-exposed heap (it consumes VRAM or a BAR-mapped window of VRAM).

          vk_check(vk.AllocateMemory(_device.handle, &mem_allocation_info, nil, &buffer_memory))
          
        • vk.BindBufferMemory

          • Binds one with the other (memory aliasing). Doesn't allocate anything

          • Binds the previously allocated device memory to the buffer object. Binding itself normally does not allocate additional device heap bytes; it just associates that payload region with the buffer handle.

          • After bind the buffer is usable for CPU mapping (if host-visible) and/or device operations.

          vk_check(vk.BindBufferMemory(_device.handle, buffer_handle, buffer_memory, 0))
          
    • vkCreateGraphicsPipelines(...)

      • Pipeline creation can be expensive and opaque.

      • During pipeline creation the driver may:

        • allocate host-side structures for the pipeline object (controlled by pAllocator  passed to vkCreateGraphicsPipelines ),

        • compile/optimize shaders, build internal representations,

        • and may allocate internal device resources (driver-controlled device memory, shader/kernel upload, caches) that are not the same as application VkDeviceMemory  allocations. The spec explicitly allows drivers to perform internal device allocations for things like pipelines; those allocations are not controlled by VkAllocationCallbacks . If you need to see them, use VK_EXT_device_memory_report .

Allocation, Reallocation, Free, Internal Alloc, Internal Free
  • pfnAllocation  or pfnReallocation  may be called in the following situations:

    • Allocations scoped to a VkDevice  or VkInstance  may be allocated from any API command.

    • Allocations scoped to a command may be allocated from any API command.

    • Allocations scoped to a VkPipelineCache  may only be allocated from:

      • vkCreatePipelineCache

      • vkMergePipelineCaches  for dstCache

      • vkCreateGraphicsPipelines  for pipelineCache

      • vkCreateComputePipelines  for pipelineCache

    • Allocations scoped to a VkValidationCacheEXT  may only be allocated from:

      • vkCreateValidationCacheEXT

      • vkMergeValidationCachesEXT  for dstCache

      • vkCreateShaderModule  for validationCache in VkShaderModuleValidationCacheCreateInfoEXT

    • Allocations scoped to a VkDescriptorPool  may only be allocated from:

      • any command that takes the pool as a direct argument

      • vkAllocateDescriptorSets  for the descriptorPool  member of its pAllocateInfo  parameter

      • vkCreateDescriptorPool

    • Allocations scoped to a VkCommandPool  may only be allocated from:

      • any command that takes the pool as a direct argument

      • vkCreateCommandPool

      • vkAllocateCommandBuffers  for the commandPool  member of its pAllocateInfo  parameter

      • any vkCmd*  command whose commandBuffer  was allocated from that VkCommandPool

    • Allocations scoped to any other object may only be allocated in that object’s vkCreate*  command.

  • pfnFree , or pfnReallocation  with zero size, may be called in the following situations:

    • Allocations scoped to a VkDevice  or VkInstance may be freed from any API command.

    • Allocations scoped to a command must be freed by any API command which allocates such memory.

    • Allocations scoped to a VkPipelineCache  may be freed from vkDestroyPipelineCache .

    • Allocations scoped to a VkValidationCacheEXT  may be freed from vkDestroyValidationCacheEXT .

    • Allocations scoped to a VkDescriptorPool  may be freed from

      • any command that takes the pool as a direct argument

    • Allocations scoped to a VkCommandPool  may be freed from:

      • any command that takes the pool as a direct argument

      • vkResetCommandBuffer  whose commandBuffer  was allocated from that VkCommandPool

    • Allocations scoped to any other object may be freed in that object’s vkDestroy*  command.

    • Any command that allocates host memory may also free host memory of the same scope.

  • pfnAllocation

    • If pfnAllocation  is unable to allocate the requested memory, it must return NULL.

    • If the allocation was successful, it must return a valid pointer to memory allocation containing at least size  bytes, and with the pointer value being a multiple of alignment .

  • `pfnReallocation``

    • If the reallocation was successful, pfnReallocation  must return an allocation with enough space for size bytes, and the contents of the original allocation from bytes zero to min(original size, new size) - 1 must be preserved in the returned allocation.

    • If size is larger than the old size, the contents of the additional space are undefined .

    • If satisfying these requirements involves creating a new allocation, then the old allocation should be freed.

    • If pOriginal  is NULL, then pfnReallocation  must behave equivalently to a call to PFN_vkAllocationFunction  with the same parameter values (without pOriginal ).

    • If size  is zero, then pfnReallocation  must behave equivalently to a call to PFN_vkFreeFunction  with the same pUserData  parameter value, and pMemory  equal to pOriginal .

    • If pOriginal  is non-NULL, the implementation must ensure that alignment  is equal to the alignment  used to originally allocate pOriginal .

    • If this function fails and pOriginal  is non-NULL the application must not free the old allocation.

  • pfnFree

    • May be NULL , which the callback must handle safely.

    • If pMemory  is non-NULL, it must be a pointer previously allocated by pfnAllocation  or pfnReallocation .

    • The application should free this memory.

  • pfnInternalAllocation

    • Upon allocation of executable memory, pfnInternalAllocation  will be called.

    • This is a purely informational callback.

  • pfnInternalFree

    • Upon freeing executable memory, pfnInternalFree  will be called.

    • This is a purely informational callback.

  • If either of pfnInternalAllocation  or pfnInternalFree  is not NULL, both must be valid callbacks

Creating the allocator
  • VkAllocationCallbacks  are for host-side allocations the Vulkan loader/driver makes (CPU memory for driver bookkeeping, staging buffers, etc.).

  • Using malloc / free :

    • Is common and acceptable for many apps — but you must meet Vulkan’s callback semantics (alignment, reallocation behavior, thread-safety) and consider performance.

    • This is a normal, valid approach. It satisfies most apps and is what many people do in practice.

    • Discussion .

    • Caviats :

      • Alignment:

        • Vulkan allocators must return memory suitably aligned for any type the driver might need. Use posix_memalign/aligned_alloc on POSIX, _aligned_malloc on Windows, or otherwise ensure alignment. The Vulkan spec expects allocation functions to behave like platform allocators.

      • Reallocation semantics:

        • pfnReallocation  must implement C-like realloc semantics (grow/shrink, preserve contents if requested). If your platform realloc does not support required alignment, implement reallocation by allocating new aligned memory, copying the old contents, freeing the old pointer.

      • Thread-safety & performance:

        • Drivers can call the callbacks from multiple threads. The system malloc is usually thread-safe but can have global locks and contention. For high-frequency allocation patterns, a custom pool or thread-local allocator can reduce contention and improve predictable performance.

      • Internal allocation tracking:

        • VkAllocationCallbacks  provide pUserData  so you can route allocations to a custom pool/context for tracking or to implement more efficient pooling per object type.

  • The GPU VkDeviceMemory allocations (the ones created with vkAllocateMemory) are a separate resource and must be managed with Vulkan APIs and counted against the appropriate memory heap

  • If you use malloc for VkAllocationCallbacks , you are only providing host-allocator behavior for driver/loader-side allocations.

Scope
  • Each allocation has an allocation scope defining its lifetime and which object it is associated with. Possible values passed to the allocationScope parameter of the callback functions specified by VkAllocationCallbacks , indicating the allocation scope, are:

  • COMMAND

    • Specifies that the allocation is scoped to the duration of the Vulkan command.

    • The most specific allocator available is used ( DEVICE , else INSTANCE ).

  • OBJECT

    • Specifies that the allocation is scoped to the lifetime of the Vulkan object that is being created or used.

    • The most specific allocator available is used ( OBJECT , else DEVICE , else INSTANCE ).

  • CACHE

    • Specifies that the allocation is scoped to the lifetime of a VkPipelineCache  or VkValidationCacheEXT  object.

    • If an allocation is associated with a VkValidationCacheEXT  or VkPipelineCache  object, the allocator will use the CACHE  allocation scope.

    • The most specific allocator available is used ( CACHE , else DEVICE , else INSTANCE ).

  • DEVICE

    • Specifies that the allocation is scoped to the lifetime of the Vulkan device.

    • If an allocation is scoped to the lifetime of a device, the allocator will use an allocation scope of DEVICE .

    • The most specific allocator available is used ( DEVICE , else INSTANCE ).

  • INSTANCE

    • Specifies that the allocation is scoped to the lifetime of the Vulkan instance.

    • If the allocation is scoped to the lifetime of an instance and the instance has an allocator, its allocator will be used with an allocation scope of INSTANCE .

    • Otherwise an implementation will allocate memory through an alternative mechanism that is unspecified.

  • Most Vulkan commands operate on a single object, or there is a sole object that is being created or manipulated. When an allocation uses an allocation scope of OBJECT  or CACHE , the allocation is scoped to the object being created or manipulated.

  • When an implementation requires host memory, it will make callbacks to the application using the most specific allocator and allocation scope available:

  • Pools :

    • Objects that are allocated from pools do not specify their own allocator. When an implementation requires host memory for such an object, that memory is sourced from the object’s parent pool’s allocator.

Device Memory

  • Device memory is memory that is visible to the device — for example the contents of the image or buffer objects, which can be natively used by the device.

  • A Vulkan device operates on data in device memory via memory objects that are represented in the API by a VkDeviceMemory  handle.

  • VkDeviceMemory .

    • Opaque handle to a device memory object.

Properties
  • Memory properties of a physical device describe the memory heaps and memory types available.

  • To query memory properties, call vkGetPhysicalDeviceMemoryProperties .

  • VkPhysicalDeviceMemoryProperties

    • Describes a number of memory heaps as well as a number of memory types that can be used to access memory allocated in those heaps.

    • Each heap describes a memory resource of a particular size, and each memory type describes a set of memory properties (e.g. host cached vs. uncached) that can be used with a given memory heap. Allocations using a particular memory type will consume resources from the heap indicated by that memory type’s heap index. More than one memory type may share each heap, and the heaps and memory types provide a mechanism to advertise an accurate size of the physical memory resources while allowing the memory to be used with a variety of different properties.

    • At least one heap must include MEMORY_HEAP_DEVICE_LOCAL  in VkMemoryHeap.flags

    • memoryTypeCount  is the number of valid elements in the memoryTypes  array.

    • memoryTypes  is an array of MAX_MEMORY_TYPES   VkMemoryType  structures describing the memory types that can be used to access memory allocated from the heaps specified by memoryHeaps.

    • memoryHeapCount  is the number of valid elements in the memoryHeaps  array.

    • memoryHeaps  is an array of MAX_MEMORY_HEAPS   VkMemoryHeap  structures describing the memory heaps from which memory can be allocated.

Device Memory Allocation
  • Memory requirements :

    • vkGetBufferMemoryRequirements

      • Returns the memory requirements for specified Vulkan object

      • device

        • Is the logical device that owns the buffer.

      • buffer

        • Is the buffer to query.

      • pMemoryRequirements

        • Is a pointer to a VkMemoryRequirements  structure in which the memory requirements of the buffer object are returned.

    • VkMemoryRequirements

      • size

        • Is the size, in bytes, of the memory allocation required for the resource.

        • The size of the required memory in bytes may differ from bufferInfo.size .

      • alignment

        • The offset in bytes where the buffer begins in the allocated region of memory, depends on bufferInfo.usage  and bufferInfo.flags .

      • memoryTypeBits

        • Bit field of the memory types that are suitable for the buffer.

        • Bit i  is set if and only if the memory type i  in the VkPhysicalDeviceMemoryProperties  structure for the physical device is supported for the resource.

    • vkGetPhysicalDeviceMemoryProperties

      • Reports memory information for the specified physical device

      • We'll use it to find a memory type that is suitable for the buffer itself.

      • vkGetPhysicalDeviceMemoryProperties2  behaves similarly to vkGetPhysicalDeviceMemoryProperties , with the ability to return extended information in a pNext  chain of output structures.

      • memoryHeaps

        • Are distinct memory resources like dedicated VRAM and swap space in RAM for when VRAM runs out.

        • The different types of memory exist within these heaps.

        • Right now we’ll only concern ourselves with the type of memory and not the heap it comes from, but you can imagine that this can affect performance.

      • memoryTypes

        • Consists of VkMemoryType  structs that specify the heap and properties of each memory type.

        • The properties define special features of the memory, like being able to map it so we can write to it from the CPU.

        • VkMemoryType

          • Structure specifying memory type

          • heapIndex

          • propertyFlags

      • typeFilter

        • Specify the bit field of memory types that are suitable.

        • That means that we can find the index of a suitable memory type by simply iterating over them and checking if the corresponding bit is set to 1 .

        • However, we’re not just interested in a memory type that is suitable for the vertex buffer.

        • We also need to be able to write our vertex data to that memory.

      • We may have more than one desirable property, so we should check if the result of the bitwise AND is not just non-zero, but equal to the desired properties bit field. If there is a memory type suitable for the buffer that also has all the properties we need, then we return its index, otherwise we throw an exception.

  • Allocation :

    • VkMemoryAllocateInfo

      • allocationSize

        • Is the size of the allocation in bytes.

      • memoryTypeIndex

        • Is an index identifying a memory type from the memoryTypes  array of the vkGetPhysicalDeviceMemoryProperties  struct, as defined in the 'memory requirements'.

    • vkAllocateMemory .

      • To allocate memory objects.

      • device

        • Is the logical device that owns the memory.

      • pAllocateInfo

        • Is a pointer to a VkMemoryAllocateInfo  structure describing parameters of the allocation. A successfully returned allocation must use the requested parameters — no substitution is permitted by the implementation.

      • pAllocator

        • Controls host  memory allocation.

      • pMemory

        • Is a pointer to a VkDeviceMemory  handle in which information about the allocated memory is returned.

    • Allocations returned by vkAllocateMemory  are guaranteed to meet any alignment requirement of the implementation. For example, if an implementation requires 128 byte alignment for images and 64 byte alignment for buffers, the device memory returned through this mechanism would be 128-byte aligned. This ensures that applications can correctly suballocate objects of different types (with potentially different alignment requirements) in the same memory object.

    • When memory is allocated, its contents are undefined with the following constraint:

      • The contents of unprotected memory must not be a function of the contents of data protected memory objects, even if those memory objects were previously freed.

      • The contents of memory allocated by one application should not be a function of data from protected memory objects of another application, even if those memory objects were previously freed.

    • The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation- or platform-dependent limits. The maxMemoryAllocationCount feature describes the number of allocations that can exist simultaneously before encountering these internal limits.

  • Freeing :

    • To free a memory object, call vkFreeMemory .

    • Before freeing a memory object, an application must ensure the memory object is no longer in use by the device — for example by command buffers in the pending state. Memory can be freed whilst still bound to resources, but those resources must not be used afterwards. Freeing a memory object releases the reference it held, if any, to its payload. If there are still any bound images or buffers, the memory object’s payload may not be immediately released by the implementation, but must be released by the time all bound images and buffers have been destroyed. Once all references to a payload are released, it is returned to the heap from which it was allocated.

    • How memory objects are bound to Images and Buffers is described in detail in the [Resource Memory Association] section.

    • If a memory object is mapped at the time it is freed, it is implicitly unmapped.

    • Host writes are not implicitly flushed when the memory object is unmapped, but the implementation must guarantee that writes that have not been flushed do not affect any other memory.

Resource Memory Association
  • Resources are initially created as virtual allocations with no backing memory. Device memory is allocated separately and then associated with the resource. This association is done differently for sparse and non-sparse resources.

  • Resources created with any of the sparse creation flags are considered sparse resources. Resources created without these flags are non-sparse. The details on resource memory association for sparse resources is described in Sparse Resources.

  • Non-sparse resources must be bound completely and contiguously to a single VkDeviceMemory object before the resource is passed as a parameter to any of the following operations:

    • creating buffer, image, or tensor views

    • updating descriptor sets

    • recording commands in a command buffer

  • Once bound, the memory binding is immutable for the lifetime of the resource.

  • In a logical device representing more than one physical device, buffer and image resources exist on all physical devices but can be bound to memory differently on each. Each such replicated resource is an instance of the resource. For sparse resources, each instance can be bound to memory arbitrarily differently. For non-sparse resources, each instance can either be bound to the local or a peer instance of the memory, or for images can be bound to rectangular regions from the local and/or peer instances. When a resource is used in a descriptor set, each physical device interprets the descriptor according to its own instance’s binding to memory.

  • Sparse Resources .

  • Sparse Resources .

  • Sparse resources let you create VkBuffer  and VkImage  objects which are bound non-contiguously to one or more VkDeviceMemory  allocations.

Host Access

  • Also check GPU .

  • Memory objects created with vkAllocateMemory  are not directly host accessible.

  • Memory objects created with the memory property MEMORY_PROPERTY_HOST_VISIBLE  are considered mappable. Memory objects must be mappable in order to be successfully mapped on the host.

  • vkMapMemory

    • This function allows us to access a region of the specified memory resource defined by an offset and size.

    • Used to retrieve a host virtual address pointer to a region of a mappable memory object.

    • It is also possible to specify the special value WHOLE_SIZE  to map all of the memory.

    • device

      • Is the logical device that owns the memory.

    • memory

      • Is the VkDeviceMemory  object to be mapped.

    • offset

      • Is a zero-based byte offset from the beginning of the memory object.

    • size

      • Is the size of the memory range to map, or WHOLE_SIZE  to map from offset to the end of the allocation.

    • flags

      • Is a bitmask of VkMemoryMapFlagBits  specifying additional parameters of the memory map operation.

    • ppData

      • Is a pointer to a void*  variable in which a host-accessible pointer to the beginning of the mapped range is returned. The value of the returned pointer minus offset must be aligned to VkPhysicalDeviceLimits.minMemoryMapAlignment .

      • Acts like regular RAM, but physically points to GPU memory.

  • After a successful call to vkMapMemory  the memory object memory is considered to be currently host mapped.

  • It is an application error to call vkMapMemory on a memory object that is already host mapped.

  • vkMapMemory  does not check whether the device memory is currently in use before returning the host-accessible pointer.

  • If the device memory was allocated without the MEMORY_PROPERTY_HOST_COHERENT  set, these guarantees must be made for an extended range: the application must round down the start of the range to the nearest multiple of VkPhysicalDeviceLimits.nonCoherentAtomSize , and round the end of the range up to the nearest multiple of VkPhysicalDeviceLimits.nonCoherentAtomSize .

  • Problem :

    • The driver may not immediately copy the data into the buffer memory, for example, because of caching.

    • It is also possible that writes to the buffer are not visible in the mapped memory yet.

    • There are two ways to deal with that problem:

      • Use a memory heap that is host coherent, indicated with MEMORY_PROPERTY_HOST_COHERENT

      • Call vkFlushMappedMemoryRanges  after writing to the mapped memory, and call vkInvalidateMappedMemoryRanges  before reading from the mapped memory.

    • Flushing memory ranges or using a coherent memory heap means that the driver will be aware of our writings to the buffer, but it doesn’t mean that they are actually visible on the GPU yet. The transfer of data to the GPU is an operation that happens in the background, and the specification simply tells us  that it is guaranteed to be complete as of the next call to vkQueueSubmit .

  • Minimum Alignment :

    • VkPhysicalDeviceLimits .

    • ChatGPT:

      • Dynamic offsets:

        • If you used DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC  or DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC  in your VkDescriptorSetLayoutBinding .

          • That is the definition of a dynamic descriptor.

        • If you call vkCmdBindDescriptorSets(..., dynamicOffsetCount, pDynamicOffsets) . If dynamicOffsetCount > 0  and pDynamicOffsets  is non-null you are supplying dynamic offsets at bind time.

      • How offsets are applied:

        • Non-dynamic descriptor:

          • The VkDescriptorBufferInfo.offset  you gave to vkUpdateDescriptorSets  is baked into the descriptor.

          • That offset  must be a multiple of minUniformBufferOffsetAlignment .

        • Dynamic descriptor:

          • The descriptor stores a base offset / range , and the runtime adds the dynamic offset(s) you pass to vkCmdBindDescriptorSets .

          • Each dynamic offset must be a multiple of minUniformBufferOffsetAlignment .

      • If you are not using Dynamic Offsets in the vkCmdBindDescriptorSets , nor using offsets in the VkDescriptorBufferInfo , then you don't need to worry about this limit.

Staging buffer
  • Use a host visible buffer  as temporary buffer and use a device local buffer  as actual buffer.

  • The host visible buffer should have use BUFFER_USAGE_TRANSFER_SRC , and the device local buffer should have use BUFFER_USAGE_TRANSFER_DST .

  • The contents of the host visible buffer is copied to the device local buffer using vkCmdCopyBuffer .

  • .

  • Data Transfer .

  • Buffer copy requirements :

    • Requires a queue family that supports transfer operations, which is indicated using QUEUE_TRANSFER .

      • Any queue family with QUEUE_GRAPHICS  or QUEUE_COMPUTE  capabilities already implicitly support QUEUE_TRANSFER  operations.

      • A different queue family specifically for transfer operations could be used.

        • It will require you to make the following modifications to your program:

          • Modify QueueFamilyIndices  and findQueueFamilies  to explicitly look for a queue family with the QUEUE_TRANSFER  bit, but not the QUEUE_GRAPHICS .

          • Modify createLogicalDevice  to request a handle to the transfer queue

          • Create a second command pool for command buffers that are submitted on the transfer queue family

          • Change the sharingMode  of resources to be SHARING_MODE_CONCURRENT  and specify both the graphics and transfer queue families

          • Submit any transfer commands like vkCmdCopyBuffer  (which we’ll be using in this chapter) to the transfer queue instead of the graphics queue

      • This will teach you a lot about how resources are shared between queue families.

      • Caio: Ok, but what's the benefits of using different queues? I don't know.

BAR (Base Address Register)
Memory Aliasing
  • A range of a VkDeviceMemory allocation is aliased if it is bound to multiple resources simultaneously, as described below, via vkBindImageMemory , vkBindBufferMemory , vkBindAccelerationStructureMemoryNV , vkBindTensorMemoryARM , via sparse memory bindings, or by binding the memory to resources in multiple Vulkan instances or external APIs using external memory handle export and import mechanisms.

  • Consider two resources, resourceA and resourceB, bound respectively to memory rangeA and rangeB. Let paddedRangeA and paddedRangeB be, respectively, rangeA and rangeB aligned to bufferImageGranularity. If the resources are both linear or both non-linear (as defined in the Glossary), then the resources alias the memory in the intersection of rangeA and rangeB. If one resource is linear and the other is non-linear, then the resources alias the memory in the intersection of paddedRangeA and paddedRangeB.

  • The implementation-dependent limit bufferImageGranularity also applies to tensor resources.

  • Memory aliasing can be useful to reduce the total device memory footprint of an application, if some large resources are used for disjoint periods of time.

  • vkBindBufferMemory() .

    • If memory allocation was successful, then we can now associate this memory with the buffer using this function.

    • offset

      • Offset within the region of memory.

      • Since this memory is allocated specifically for this the vertex buffer, the offset is simply 0 .

      • If the offset is non-zero, then it is required to be divisible by memRequirements.alignment .

  • Memory Aliasing .

Lazily Allocated Memory

  • If the memory object is allocated from a heap with the MEMORY_PROPERTY_LAZILY_ALLOCATED  bit set, that object’s backing memory may be provided by the implementation lazily. The actual committed size of the memory may initially be as small as zero (or as large as the requested size), and monotonically increases as additional memory is needed.

  • A memory type with this flag set is only allowed to be bound to a VkImage whose usage flags include IMAGE_USAGE_TRANSIENT_ATTACHMENT .

Protected Memory

  • Protected memory divides device memory into protected device memory and unprotected device memory.

  • Unprotected Device Memory :

    • Unprotected device memory, which can be visible to the device and can be visible to the host

    • Unprotected images, unprotected tensors, and unprotected buffers, to which unprotected memory can be bound

    • Unprotected command buffers, which can be submitted to a device queue to execute unprotected queue operations

    • Unprotected device queues, to which unprotected command buffers can be submitted

    • Unprotected queue submissions, through which unprotected command buffers can be submitted

    • Unprotected queue operations

  • Protected Device Memory :

    • Protected device memory, which can be visible to the device but must not be visible to the host

    • Protected images, protected tensors, and protected buffers, to which protected memory can be bound

    • Protected command buffers, which can be submitted to a protected-capable device queue to execute protected queue operations

    • Protected-capable device queues, to which unprotected command buffers or protected command buffers can be submitted

    • Protected queue submissions, through which protected command buffers can be submitted

    • Protected queue operations

  • Protected Memory .

Tracking GPU Memory

  • Vulkan does not expose fixed per-object byte counts for most objects — exact memory use is implementation and driver-dependent. Some objects ( VkImage , VkBuffer ) must be bound to VkDeviceMemory  you allocate (so you can know their size). Many other objects (pipelines, command buffers, descriptor sets, semaphores, imageviews, pipeline layouts, etc.) often cause hidden driver allocations that may live in host memory, device memory, or both — and those allocations’ size and placement vary by driver and GPU.

By object
  • VkInstance  / VkPhysicalDevice  / VkDevice  (handles):

    • Small host-side allocations (process RAM). Measure via your VkAllocationCallbacks or by tracking driver host allocations. These are host-visible (they are just process memory)

  • VkImageView  / VkBufferView  / VkSampler :

    • Lightweight, usually host memory (small driver structures). They rarely allocate large device memory; they may cause small host allocations. Implementation dependent but small (tens to a few hundred bytes each in many drivers).

  • VkDescriptorSetLayout  / VkPipelineLayout  / VkDescriptorSet  (layout vs sets):

    • Layout and pipeline layout are small host structures (host memory). Descriptor sets and descriptor pools may be implemented in host memory or device memory; larger descriptor usage (large arrays, inline uniform blocks, inline immutable samplers, or driver internal structures) can cause real device allocations. Behavior is driver dependent.

  • VkPipeline  (graphics/compute):

    • Creation can cause hidden device and/or host allocations (compiled device binaries, GPU resident state). The spec explicitly allows implementations to allocate device memory during pipeline creation; the pipeline cache and pipeline executable properties APIs can help quantify some of this. Pipeline objects range from a few KB to multiple MB depending on driver, the number/complexity of shaders, and whether the driver stores compiled GPU blobs. Use VK_KHR_pipeline_executable_properties  and pipeline cache queries to inspect pipeline internals.

  • VkPipelineCache :

    • Contains data you can query with vkGetPipelineCacheData  — that returns host-visible data you can size and persist.

  • VkCommandPool  / VkCommandBuffer :

    • Command buffers are allocated from a pool; actual memory holding recorded commands is driver-managed and may be placed in device local memory (GPU command stream) or host memory, depending on driver and OS. Sizes vary widely and are not exposed directly; instrument via driver callbacks or VK_EXT_device_memory_report .

  • VkSemaphore  / VkFence :

    • Binary semaphores and fences may use kernel/OS constructs or small host/device allocations; timeline semaphores hold a 64-bit value and may be backed by device memory on some implementations. Typically small (a few bytes to some KB) but driver dependent.

  • VkSwapchainKHR  and presentable images:

    • Swapchain images are VkImage objects with memory managed by the WSI/driver; they are typically DEVICE_LOCAL and can live in special presentable heaps. Their size equals image size × format bits × layers/levels plus padding (obtainable from vkGetImageMemoryRequirements  for images you allocate yourself; for WSI images use provided queries and VK_EXT_memory_budget  to monitor heap consumption).

  • Typical magnitude examples (illustrative only)

    • Instance / layouts / view objects: tens to hundreds of bytes each (host).

    • Small buffers (uniform buffers) / small images: KBs to MBs, depending on dimensions and format — these are the allocations you make explicitly.

    • Pipelines: KBs → multiple MBs (depends on shader complexity and driver caching). Use pipeline executable queries to get an estimate.

    • Command buffer pools / driver command memory: KBs → MBs per many command buffers; driver dependent.

    • These numbers must be measured on your target hardware — they are not constant across drivers.

Tracking
  1. Centralize and wrap all vkAllocateMemory  / vkFreeMemory  calls.

    • Record: VkDeviceMemory  handle, VkMemoryAllocateInfo  size/flags, chosen memory type index, and optionally the VkDeviceSize  and offset for any suballocator logic. Suballocation (one VkDeviceMemory  used for many buffers/images) means you must additionally record your suballocations. Use this table as the authoritative committed GPU bytes. (Spec: vkAllocateMemory  produces the device memory payload.)

  2. Track suballocation bookkeeping in your allocator.

    • If you allocate large VkDeviceMemory  blocks and suballocate slices for many buffers/images, account the slices into your counters (otherwise counting only VkDeviceMemory  handles will under- or over-count usage).

  3. Hook creation / bind points to attribute usage.

    • When you vkBindBufferMemory  / vkBindImageMemory , attach which application object is consuming which suballocation — this lets you produce per-buffer/per-image committed usage.

  4. Use VK_EXT_memory_budget  for driver-reported heap usage/budgets.

    • Query VkPhysicalDeviceMemoryBudgetPropertiesEXT  via vkGetPhysicalDeviceMemoryProperties2  to get heapBudget  and heapUsage  values per heap.

    • These are implementation-provided and reflect other processes and driver internal usage; use them as cross-checks and to warn when you approach limits.

    • Use it to see heap usage and budget per heap (useful to spot overall device local vs host mapped heap pressure). This is not per-object, but shows total heap usage and remaining budget. Combine with device_memory_report events to attribute heap changes to objects.

  5. Enable VK_EXT_device_memory_report  for visibility into driver-internal  allocations.

    • This extension gives callbacks for driver-side device memory events (allocate/free/import) including allocations not exposed as VkDeviceMemory (for example, allocations made internally during pipeline creation). Use it for debugging and to catch allocations that your vkAllocateMemory wrapper would miss.

  6. Account for dedicated allocations and imports.

    • You can use VK_KHR_dedicated_allocation  to force one allocation per resource. If you allocate one VkDeviceMemory  per resource you know exactly how many bytes each resource consumes.

    • If an allocation is made with VkMemoryDedicatedAllocateInfo  or via external memory import, count that device memory appropriately — it typically represents a whole allocation tied to a single image/buffer.

  7. Use VK_KHR_pipeline_executable_properties  for pipeline internals.

    • Create the pipeline with the capture flag ( VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR ) and call vkGetPipelineExecutablePropertiesKHR  / vkGetPipelineExecutableStatisticsKHR  to obtain compile-time statistics and sizes for pipeline executables that the driver produced. This helps measure how much space pipeline compilation produced (but it may not show every byte the driver reserved at runtime).

  8. Vendor tools + RenderDoc / NSight / Radeon GPU Profiler.

    • These tools often show GPU memory usage, allocations, and sometimes attribute memory to API objects. Use them to validate your in-process accounting.

Device Memory Report ( VK_EXT_device_memory_report )
  • Last updated (2021-01-06).

  • Info .

  • Allows registration of device memory event callbacks upon device creation, so that applications or middleware can obtain detailed information about memory usage and how memory is associated with Vulkan objects. This extension exposes the actual underlying device memory usage, including allocations that are not normally visible to the application, such as memory consumed by vkCreateGraphicsPipelines . It is intended primarily for use by debug tooling rather than for production applications.

Memory Budget ( EXT_memory_budget )
  • Last updated (2018-10-08).

  • Coverage .

    • Not good on android, but the rest is 80%+.

  • Sample

  • Query video memory budget for the process from the OS memory manager.

  • It’s important to keep usage below the budget to avoid stutters caused by demotion of video memory allocations.

  • While running a Vulkan application, other processes on the machine might also be attempting to use the same device memory, which can pose problems.

  • This extension adds support for querying the amount of memory used and the total memory budget for a memory heap. The values returned by this query are implementation-dependent and can depend on a variety of factors including operating system and system load.

  • The VkPhysicalDeviceMemoryBudgetPropertiesEXT.heapBudget  values can be used as a guideline for how much total memory from each heap the current process can use at any given time, before allocations may start failing or causing performance degradation. The values may change based on other activity in the system that is outside the scope and control of the Vulkan implementation.

  • The VkPhysicalDeviceMemoryBudgetPropertiesEXT.heapUsage  will display the current process estimated heap usage.

  • With this information, the idea is for an application at some interval (once per frame, per few seconds, etc) to query heapBudget and heapUsage. From here the application can notice if it is over budget and decide how it wants to handle the memory situation (free it, move to host memory, changing mipmap levels, etc).

  • This extension is designed to be used in concert with VK_EXT_memory_priority  to help with this part of memory management.

Vulkan Memory Allocator (VMA)

  • VMA in Odin .

  • VMA (vulkan memory allocator) .

  • Implements memory allocators for Vulkan, header only. In Vulkan, the user has to deal with the memory allocation of buffers, images, and other resources on their own. This can be very difficult to get right in a performant and safe way. Vulkan Memory Allocator does it for us and allows us to simplify the creation of images and other resources. Widely used in personal Vulkan engines or smaller scale projects like emulators. Very high end projects like Unreal Engine or AAA engines write their own memory allocators.

  • There are cases like the PCSX3 emulator project, where they replaced their attempt at allocation to VMA, and won 20% extra framerate.

  • Critiques :

    • .

HDR Support

  • Shader code converts high-dynamic-range (HDR) linear color values (often stored in floating formats like R16G16B16A16_SFLOAT ) into display-referred low-dynamic-range (LDR) values (sRGB or the swapchain format).

  • Operations include exposure, clamping, tone curve (Reinhard, ACES, filmic), and gamma or sRGB conversion.

  • Each monitor manufacturer does this differently; it's not standardized .

  • Inputs:

    • HDR color (linear), optionally exposure/exposure texture, bloom, eye adaptation.

  • Steps (example minimal):

    1. Multiply by exposure.

    2. Apply curve (e.g. Reinhard: c/(1+c) , or ACES approximation).

    3. Convert to sRGB/gamma ( pow(color, 1.0/2.2) ) or use proper sRGB conversion.

    4. Output vec4  clamped to [0,1]  into swapchain format (e.g. FORMAT_B8G8R8A8_UNORM ).

Drawing to a High Precision Image ( R16G16B16A16_SFLOAT )
  • Rendering into an R16G16B16A16_SFLOAT  (FP16) image provides:

    • Higher dynamic range and precision (light accumulation > 1.0, less banding, better tone mapping).

    • Freedom to tone-map and convert later.

  • This is the engine-side HDR pipeline .

  • Described technique .

    • From "New draw loop" until the end.

  • Rendering into a separate high-precision offscreen target and then copying/blitting/tonemapping into the swapchain is the standard approach when you need arbitrary internal resolution, higher precision, HDR processing, or when the swapchain does not expose desired formats/usages. The trade-off is the extra memory and an explicit copy/blit or import step; the benefit is control over precision and size. The Vulkan command vkCmdBlitImage  / transfer usage or a shader-based blit/resolve are the usual mechanisms to move from the internal target to the presentable image.

  • The image we will be using is going to be in the RGBA 16-bit float format.

    • R16G16B16A16_SFLOAT  is a common intermediate HDR format (16-bit float per channel). It increases memory and bandwidth (roughly 2× vs 8-bit RGBA) and may affect GPU/VRAM usage and upload/download costs; it also reduces quantization/banding and supports HDR/light-accumulation workflows without clamping at 1.0. The choice is an explicit trade-off: more precision (and headroom for lighting) vs more memory/bandwidth. The format is widely supported for offscreen images but may not be available as a swapchain format on all platforms, which reinforces the decision to render offscreen then convert/tonemap for presentation.

    • This is slightly overkill, but will provide us with a lot of extra pixel precision that will come in handy when doing lighting calculations and better rendering.

  • It's possible to apply low-latency techniques where we could be rendering into a different image from the swapchain image, and then directly push that image to the swapchain with very low latency.

    • Techniques like NVIDIA's "Latency Markers" / Reflex or AMD's Anti-Lag rely on starting rendering work as early as possible, often before  the presentation engine signals readiness for the next frame via vkAcquireNextImageKHR  (Vulkan) or AcquireNextFrame  (DXGI). This necessitates rendering into a separate, persistently available image. The swapchain image index is only provided at acquisition time, making pre-rendering impossible with direct swapchain targets. Documentation for these low-latency SDKs implicitly requires separate render targets.

  • Choosing the image tiling:

    • We can then copy that image into the swapchain image and present it to the screen.

    • VkCmdCopyImage

      • Is faster, but its much more restricted, for example the resolution on both images must match.

    • VkCmdBlitImage

      • Lets you copy images of different formats and different sizes into one another.

      • You have a source rectangle and a target rectangle, and the system copies it into its position.

  • New code for transitioning :

    _drawExtent.width = _drawImage.imageExtent.width;
    _drawExtent.height = _drawImage.imageExtent.height;
    
    CHECK(vkBeginCommandBuffer(cmd, &cmdBeginInfo)); 
    
    // transition our main draw image into general layout so we can write into it
    // we will overwrite it all so we dont care about what was the older layout
    vkutil::transition_image(cmd, _drawImage.image, IMAGE_LAYOUT_UNDEFINED, IMAGE_LAYOUT_GENERAL);
    
    draw_background(cmd);
    
    //transition the draw image and the swapchain image into their correct transfer layouts
    vkutil::transition_image(cmd, _drawImage.image, IMAGE_LAYOUT_GENERAL, IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL);
    vkutil::transition_image(cmd, _swapchainImages[swapchainImageIndex], IMAGE_LAYOUT_UNDEFINED, IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL);
    
    // execute a copy from the draw image into the swapchain
    vkutil::copy_image_to_image(cmd, _drawImage.image, _swapchainImages[swapchainImageIndex], _drawExtent, _swapchainExtent);
    
    // set swapchain image layout to Present so we can show it on the screen
    vkutil::transition_image(cmd, _swapchainImages[swapchainImageIndex], IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, IMAGE_LAYOUT_PRESENT_SRC_KHR);
    
    //finalize the command buffer (we can no longer add commands, but it can now be executed)
    CHECK(vkEndCommandBuffer(cmd));
    
    • The main difference we have in the render loop is that we no longer do the clear on the swapchain image. Instead, we do it on the _drawImage.image .

    • Once we have cleared the image, we transition both the swapchain and the draw image into their layouts for transfer, and we execute the copy command. Once we are done with the copy command, we transition the swapchain image into present layout for display. As we are always drawing on the same image, our draw_image does not need to access swapchain index, it just clears the draw image. We are also writing the _drawExtent  that we will use for our draw region.

Etc
  • But  this image still has to be copied/tonemapped into the swapchain format , which is typically limited to 8-bit UNORM unless the OS/driver supports HDR swapchain formats.

  • To actually output HDR to the screen, all of the following conditions must be met:

  1. Swapchain format must support HDR bit depth .

    • Example formats: FORMAT_A2B10G10R10_UNORM_PACK32 , FORMAT_R16G16B16A16_SFLOAT , or platform-specific HDR surface formats.

    • You query available swapchain formats via vkGetPhysicalDeviceSurfaceFormatsKHR .

    • If only 8-bit formats are exposed, you cannot present HDR directly.

  2. Swapchain color space must be HDR-capable

    • Vulkan allows specifying a VkColorSpaceKHR  (e.g., COLOR_SPACE_HDR10_ST2084_EXT , COLOR_SPACE_HDR10_HLG_EXT ).

    • These correspond to HDR transfer functions (PQ/HLG).

    • If the driver/surface does not expose them, the system compositor won’t accept HDR content.

  3. OS and display pipeline must be HDR-enabled

    • Windows: HDR toggle must be enabled in system settings, compositor configured for HDR10.

    • Linux/Wayland: requires HDR support in compositor + driver (still emerging).

    • Android: requires AHardwareBuffer  / SurfaceView  with HDR formats.

    • macOS: Metal swapchains expose extended sRGB/PQ output modes.

    • (Platform docs confirm HDR availability is compositor-driven).

  4. Application side tone mapping & gamut mapping

    • Even if swapchain supports HDR, you generally still render into FP16, then apply:

      • Tone mapping (map wide dynamic range → HDR10/HLG range).

      • Color gamut conversion (usually Rec.709 → Rec.2020 for HDR10).

    • Only then write into the HDR swapchain image.

Profiling

  • Provides your application with a mechanism to time the execution of commands on the GPU.

  • You can specify any pipeline stage at which the timestamp should be written, a lot of stage combinations and orderings won’t give meaningful result.

    • So while it may may sound reasonable to write timestamps for the vertex and fragment shader stage directly one after another, that will usually not return meaningful results due to how the GPU works.

  • You can’t compare timestamps taken on different queues.

  • Sample .

    • We’ll be using 6 time points, one for the start and one for the end of three render passes.

    • The code samples/api/timestamp_queries :

      • Uses QUERY_RESULT_64 | QUERY_RESULT_WAIT , so it's not optimal.

      • The query is made after vkQueueSubmit() .

  • GPU Timing Basics .

    • Vulkan and DX12.

    • Uses QUERY_RESULT_64  and enables the hostQueryReset  for vk.PhysicalDeviceVulkan12Features , using vk.ResetQueryPool()  right after creating the QueryPool .

  • Queries .

  • vkCmdWriteTimestamp2 .

    • This is pretty much the same as the vkCmdWriteTimestamp  function used in this sample, but adds support for some additional pipeline stages using VkPipelineStageFlags2 .

Support

  • Device limits:

    • timestampPeriod

      • If the limit of the physical device is greater than zero, timestamp queries are supported.

      • If your device has a timestampPeriod  of 1, so that one increment in the result maps to exactly one nanosecond.

      • It contains the number of nanoseconds it takes for a timestamp query value to be increased by 1 ("tick").

    • timestampComputeAndGraphics

      • If is TRUE , timestamps are supported by every queue family that supports either graphics or compute operations

      • If not, we need to check if the queue we want to use supports timestamps.

Query Pool

  • A query pool is then used to either directly fetch or copy over the results to the host.

  • Used to store and read back the results.

  • queryType

    • We set to QUERY_TYPE_TIMESTAMP  for using timestamp queries

  • queryCount

    • The maximum number of the the timestamp query result this pool can store.

Reset
  • Before we can start writing data to the query pool, we need to reset it.

  • vkCmdResetQueryPool

    • At the start of the command buffer.

    • Sets the status of query indices [ firstQuery , firstQuery  + queryCount  - 1] to unavailable.

    • Defines an execution dependency between other query commands that reference the same query.

  • vkResetQueryPool() .

  • QUERY_POOL_CREATE_RESET_KHR

    • During Query Pool creation.

Writing

  • vkCmdWriteTimestamp

    • Will request a timestamp to be written from the GPU for a certain pipeline stage and write that value to memory.

Reading

  • Reading back the results can be done in two ways:

    • Copy the results into a VkBuffer  inside the command buffer using vkCmdCopyQueryPoolResults

    • Get the results after the command buffer has finished executing using vkGetQueryPoolResults

  • vkGetQueryPoolResults()

    • QUERY_RESULT_64

      • Will tell the api that we want to get the results as 64 bit values. Without this flag, we would only get 32 bit values. And since timestamp queries can operate in nanoseconds, only using 32 bits could result into an overflow.

      • if your device has a timestampPeriod  of 1, so that one increment in the result maps to exactly one nanosecond, with 32 bit precision you’d run into such an overflow after only about 0.43 seconds.

    • QUERY_RESULT_WAIT

      • Tells the api to wait for all results to be available. So when using this flag the values written to our time_stamps  vector is guaranteed to be available after calling vkGetQueryPoolResults .

      • This is fine for our use-case where we want to immediately access the results, but may introduce unnecessary stalls in other scenarios.

    • QUERY_RESULT_WITH_AVAILABILITY

      • Will let you poll the availability of the results and defer writing new timestamps until the results are available.

      • This should be the preferred way of fetching the results in a real-world application. Using this flag an additional availability value is inserted after each query value. If that value becomes non-zero, the result is available. You then check availability before writing the timestamp again.

Occlusion Queries

  • Occlusion queries track the number of samples that pass the per-fragment tests for a set of drawing commands. As such, occlusion queries are only available on queue families supporting graphics operations. The application can  then use these results to inform future rendering decisions.

  • An occlusion query is begun and ended by calling vkCmdBeginQuery  and vkCmdEndQuery , respectively.

  • When an occlusion query begins, the count of passing samples always starts at zero.

  • For each drawing command, the count is incremented as described in Sample Counting . If flags  does not contain QUERY_CONTROL_PRECISE  an implementation may  generate any non-zero result value for the query if the count of passing samples is non-zero.

Pipeline Statistics Queries

  • Pipeline statistics queries allow the application to sample a specified set of VkPipeline  counters. These counters are accumulated by Vulkan for a set of either drawing or dispatching commands while a pipeline statistics query is active. As such, pipeline statistics queries are available on queue families supporting compute operations.

  • The availability of pipeline statistics queries is indicated by the pipelineStatisticsQuery  member of the VkPhysicalDeviceFeatures  object (see vkGetPhysicalDeviceFeatures  and vkCreateDevice  for detecting and requesting this query type on a VkDevice ).

  • A pipeline statistics query is begun and ended by calling vkCmdBeginQuery  and vkCmdEndQuery , respectively.

  • When a pipeline statistics query begins, all statistics counters are set to zero. While the query is active, the pipeline type determines which set of statistics are available, but these must  be configured on the query pool when it is created. If a statistic counter is issued on a command buffer that does not support the corresponding operation, or the counter corresponds to a shading stage which is missing from any of the pipelines used while the query is active, the value of that counter is undefined  after the query has been made available. At least one statistic counter relevant to the operations supported on the recording command buffer must  be enabled.

Performance Queries

  • Provide applications with a mechanism for getting performance counter information about the execution of command buffers, render passes, and commands. [asdasd]

  • Each queue family advertises the performance counters that can  be queried on a queue of that family via a call to vkEnumeratePhysicalDeviceQueueFamilyPerformanceQueryCountersKHR . Implementations may  limit access to performance counters based on platform requirements or only to specialized drivers for development purposes.

  • Performance queries use the existing vkCmdBeginQuery  and vkCmdEndQuery  to control what command buffers, render passes, or commands to get performance information for.

Mesh Shaders Queries

  • When a generated mesh primitives query is active, the mesh-primitives-generated count is incremented every time a primitive emitted from the mesh shader stage reaches the fragment shader stage. When a generated mesh primitives query begins, the mesh-primitives-generated count starts from zero.

  • Mesh and task shader pipeline statistics queries function the same way that invocation queries work for other shader stages, counting the number of times the respective shader stage has been run. When the statistics query begins, the invocation counters start from zero.

Result Status Queries

  • Result status queries serve a single purpose: allowing the application to determine whether a set of operations have completed successfully or not, as indicated by the VkQueryResultStatusKHR  value written when retrieving the result of a query using the QUERY_RESULT_WITH_STATUS_KHR  flag.

  • Unlike other query types, result status queries do not track or maintain any other data beyond the completion status, thus no other data is written when retrieving their results.

  • Support for result status queries is indicated by VkQueueFamilyQueryResultStatusPropertiesKHR :: queryResultStatusSupport  , as returned by vkGetPhysicalDeviceQueueFamilyProperties2  for the queue family in question.

Other Queries

  • Transform Feedback Queries.

  • Primitives Generated Queries.

  • Intel Performance Queries.

  • Video Encode Feedback Queries.

Mobile

GLFW
  • An unfortunate disadvantage is GLFW doesn’t work in Android or iOS; it is a desktop-only solution.

  • SDL does offer mobile support; however, mobile windowing support is best done by interfacing with the Operating system such as using the JNI  in Android.

  • While mobile is beyond the scope of this initial tutorial, plans exist to eventually cover it in detail, and Google has excellent documentation .

Pre-Rotation
  • .

  • .

    • You can only query surfaceCapabilities.currentTransform , you cannot set it.

    • If they don't match, the presentation engine will have to do the pre-rotation for you, which has a performance cost.

  • Implementing a full pre-rotate system is reportedly difficult, so many engines avoid it.

  • .

  • .

    • This is a simpler option to implement.

    • "Many engines already do a blit to the final image to the swapchain image, so this is the perfect place to do the pre-rotation".

      • "Basically free and you get performance benefits".

VR

Video Decoding

SPIR-V

  • Standard Portable Intermediate Representation V .

  • SPIR-V .

  • Vulkan’s official shader format (portable, efficient).

  • SPIR-V is a binary format.

  • Works with Metal via MoltenVK.

Compiling
  • You can write GLSL or HLSL and compile to SPIR-V.

    • GLSL to SPIR-V:

      • glslangValidator (from Khronos)

      # Compile GLSL → SPIR-V (Vulkan)
      glslangValidator -V vertex_shader.vert -o vert.spv
      glslangValidator -V fragment_shader.frag -o frag.spv
      
    • HLSL to SPIR-V:

      • DXC (DirectX Shader Compiler)

      dxc -T vs_6_0 -E VSMain -spirv shader.hlsl -Fo vert.spv
      
      • Requires HLSL shaders with Vulkan-compatible semantics.

    • Convert SPIR-V to other formats:

      • SPIRV-Cross (converts HLSL to GLSL/SPIR-V)

  • Compiling shaders on the commandline is one of the most straightforward options and it's the one that we'll use in this tutorial, but it's also possible to compile shaders directly from your own code.

    • The Vulkan SDK includes libshaderc , which is a library to compile GLSL code to SPIR-V from within your program.

Advantages
  • The advantage of using a bytecode format is that the compilers written by GPU vendors to turn shader code into native code are significantly less complex. The past has shown that with human-readable syntax like GLSL, some GPU vendors were rather flexible with their interpretation of the standard. If you happen to write non-trivial shaders with a GPU from one of these vendors, then you’d risk another vendor’s drivers rejecting your code due to syntax errors, or worse, your shader running differently because of compiler bugs. With a straightforward bytecode format like SPIR-V that will hopefully be avoided.

Tooling

spirv-cross
  • Cross-compilation

    • Converts SPIR-V shader binaries into high-level shading languages:

      • GLSL (various versions)

      • HLSL

      • MSL (Metal Shading Language for Apple platforms)

      • WGSL (WebGPU shading language)

    • This lets you write shaders once (e.g. in GLSL or HLSL), compile to SPIR-V, then regenerate source for other backends.

  • Reflection

    • Inspects SPIR-V binaries and reports metadata about:

      • Descriptor sets and bindings

      • Push constants

      • Vertex input/output attributes

      • Specialization constants

    • With the --reflect  flag, it outputs this data as JSON , making it easy to drive engine code-generation or runtime Vulkan setup.

  • Ex :

    • spirv-cross scene_vert.spv --reflect > scene_vert.json .

Web

  • No Vulkan support in browsers; you must port to WebGPU or use translation layers.

  • WebAssembly - WASM .

WebGPU (wgpu)

  • WebGPU is a cross-platform graphics API, aiming to unify GPU access across:

    • Browsers (via native support)

    • Native apps (via libraries like wgpu, Dawn, etc.)