UBG: Vulkan and hardware ray tracing

Exciting news! In an earlier post I described my conflict with doing stuff and having something to show for it. Well, it seems that I managed to get over it - at least partly. The past 1.5 months or so I've been rather busy learning Vulkan and hardware ray tracing, and have gotten rather far already!
Though this time I had a bit of help. Instead of doing everything from scratch, I decided to build upon the assets of a Finnish Game Jam 2015 game I was part of, WatDo. Btw, that year was a bit special. I didn't code a single line of (game) code, focusing solely on building the game tilemap with Tiled. Anyway - let's start by briefly talking about doing the new things, and then about the things themselves: learning Vulkan and RT, and also other engine-level stuff such as asset loading and abstractions.

I got inspired (more on a later post?) to finally build my other long-time dream - a game with dynamic 2D soft shadows and diffuse global illumination. Considering everything, I thought that my best bet was to learn hardware ray tracing, which in turn meant learning Vulkan. I had zero knowledge on RT, but I had meant to learn Vulkan on many occasions, as that is a modern high-performance graphics API with many improvements compared to even most modern OpenGL. Especially in the area of multithreading, which is a major focus area on my new "engine" in USGE. I was still rather sceptical of RT, but decided to move on with learning Vulkan first, as that was the pre-requisite and would be very useful by just itself, too.

Learning Vulkan

As I knew very little, it was easy to just start working through a tutorial, and having the tutorial to speak for itself. And for the longest time there even wasn't anything that could have be shared, as it was all just initialization and more initialization. In the end it took me two weeks to complete the tutorial and have a multisampled and textured triangle on the screen.

Quite a pleasant surprise, actually. A long time ago when I initially stumbled upon Vulkan and that very same tutorial, I estimated that it would take at least a month or two to complete it! And now I finished it in just two weeks! I even managed to backport my code hot reloading system, and with the new explicit APIs of Vulkan (and my own GLFW windowing code), it was slightly easier to accomplish it than with USGE which used OpenGL and Silk.NET.Windowing. Of course the initial discovery work done on the topic was extremely important; I'm hoping to share some details on a VLOG post later. But no promises.

The tutorial I used was perhaps intentionally focused on NOT building abstractions on top of the Vulkan functionalities, so after I finished the tutorial I slowly began building those missing pieces for the most common functionalities, concurrently with learning new stuff. But with an asterisk. I intentionally tried not to abstract too many things, as I still know very little about Vulkan and the use cases. I've also read too many horror stories on just building engine abstractions and losing all will to continue after that work is "done". And in the olden times that's exactly how I rolled, and it was glorious! Cool stuff can be done even without great abstractions, to a point. So let's do just that and see where we end up. Especially when there are some rather major engine-level things I'm yet to do (and don't yet know how to best to them), and all those can affects things by a lot.

Learning ray tracing

Confident in my abilities and the success with Vulkan, I started learning about ray tracing. A task I was expecting to fail. But after reading a few papers and watching a couple of videos on the topic, I managed to astonish myself even more. It took me only a week to produce an image containing ray traced elements, and that's on top of the work spent on tilemap rendering stuff. Some of the more helpful videos were:
But three weeks. For learning Vulkan and ray tracing from scratch. I'm extremely happy for such and accomplishment.

And at this point I was eager to publish something, too! Something small. But shooting a whole VLOG post still felt a daunting task, and even writing a blog post would have been too much. But a microblog was the right size, so I tweeted about my accomplishments :)

I continued working on RT at a great pace, and and shared a few more images on Twitter in a short timeframe. The images also deserved some kind of descriptions, so I did my best while staying inside the 260 character limit. But that felt too restrictive, and I yerned for a good old blog post. But that was too much. So I stopped posting, and turned my full attention to just doing stuff.

And just kept on doing stuff. While the initial ray traced images were easy to produce, further improvements were increasingly harder. At some point I finally managed to decide that RT things had reached a checkpoint, and I could / should / had to work on some other areas for a change.

The final upgrades were about adding glowing things on the map, and the process of producing the final map meshes was taking longer than I was happy with. Short iteration time is my thing, and now it was broken. But hope was not lost, no way.

Porting the asset system

For the previous iteration of my "unicorn" game project USGE, I had produced an asset pipe system which handled asset loading and metadata, and generated an Android-inspired R-file. The system was also meant to do initial asset pre-processing automatically, but I hadn't gotten around to implementing it just yet. And now I clearly needed it, so I got to work.

I started by drafting a fluent API for configuring such a processing pipeline, while on a train. But when I got to implementing it, I encountered something shameful. While I've started to feel confident in my own programming abilities, the type of object-oriented principles and especially the soup of generic constraints needed to implement the API in a compile-time safe way eventually proved to be too difficult :( Or at least in that state of mind I didn't manage to finish it. So I took a step back and implemented the configuration API in a way that was completely validated only at runtime, and got the stuff done at least. A great psychological win, actually.

In the end I now have a system where I have a bunch of .meta.json files in the assets directry. They contain pipeline configs and references to concrete files. The pipelines themselves can then transform the definitions and loaded assets, and finally produce a set of asset definitions and data files for the game to load. And the R file plus an accompanying asset manifest.

In the previous iteration I had all the metadata codegen'd into the R-file, but that was so much work. So now that metadata lives the manifest file which is read at an early phase in application startup. The manifest itself mostly just contains file names for each asset, but can also contain special instructions for the asset loaders (like precomputed image sizes). But mostly it contains stuff required during development for asset "hot-hot" reloading (yet to be implemented).

The asset loaders themselves work in parallel most of the time and use the manifest to know what to load. For example in case of the textures:
  • Allocate Vulkan image handles and get memory requirements; pixel size known via manifest (currently non-threaded, but rather easy to improve).
  • Allocate one large buffer for image data.
  • Load all images in parallel from disk (or later on, from memory). This includes IO, parsing, GPU uploads and mipmap generation.
  • Cleanup staging buffers.
Asset loading is especially a thing where I'm most satisfied with Vulkan's multithreading support. Things just work out of the box, no magic required.

Oh, and all the different types of assets are loaded in parallel, too. The loading order is guided by the manifest. It doesn't typically matter, but profiling showed that map data took a lot longer to load than any other asset, so I implemented a special configuration option to enable marking some assets as "Expensive". They are loaded first, meaning that I save about 20ms of loading time with that simple change. The very first loader thread starts to load the map, and while that is happening all the other parallel threads manage to finish their work. If on the other hand the other assets were loaded first, they could finish loading a bit faster, but we'd end up waiting for the big asset for a lot longer:
There's a lot of things I could improve with the asset system, but for now it is good enough allowing me to focus on other things.

Other load time improvements

While the load times were the driving reason of the asset system, the system of course also has its primary use :p But performance is a nice focus area. And in parallel to working with the finishing touches on the assets, I also worked to improve other things which impact the loading times. As I briefly mentioned, I've build a hot-reload system for code. When the game host starts, it creates the desktop window and sets up the Vulkan context. Then it dynamically loads the game's assemblies and executes the code. When a change is detected, the same window is reused, and the code is reloaded. Initially the host was not aware of the assets, but I've since improved things, allowing me to cache the them.

Also, I managed to improve the reload process in the host by parallelizing assembly reloading, Vulkan context recycling (not required, but helps to point out resource leaks) and asset manifest processing. This alone saved me about 300ms.

Currently the asset caching is limited to just the file data, but even that yielded an improvement of about 100-200 ms. The rather unoptimized map asset is 150 MB, and sadly takes a while to load from even an NVMe disk. But by caching the data in the host, that can be skipped. Thanks to the checksums built into the manifest, the host needs to re-read only the files which have been changed between reloads. When the game code itself runs, it can use the data from memory, skipping the disk reads. As a final tiny optimization the cached asset data is allocated in long-lived pinned arrays, hopefully reducing the GC pressure by just a tiny bit.

I'd like to improve things by a notch more by having the host keep the textures themselves in memory, but currently it's actually not worth it, as the multithreaded loading is already fast. Plus it would greatly complicate recycling that Vulkan context. The slowest asset was the map data, and now it's fast thanks to being cached. So now all the assets are loaded in about 50-100ms via the cache.

After the assets (including shader binaries) have been loaded, the game can concurrently compile graphics and ray tracing pipelines (latter taking 250ms even with a pipeline cache!) and building the initial ray tracing acceleration structures. And after just the pipelines the game can also start to initialize (later on with even more threading) all the game objects while waiting for the AS build.

In the future I could further optimize things by starting to build the AS immediately once the map asset is loaded (before textures), and also the pipelines could start compiling right after the required shader bytecodes are loaded. But at this point I deemed best not to complicate things too much. Oh, and of course I'm yet to optimize the map mesh itself. There's A LOT of shapes that could be merged. There's also no index buffer yet.

Probably forgot something, but with all these changes I managed to cut the total reloading time from almost 4 seconds down to about 600 ms, so it's almost instant again. The initial load takes about 2.6 seconds, but thankfully that needs to happen only rarely. So about 1 - 1.5 seconds from pressing "Build" in Visual Studio to reloading all game code and assets and getting the first new frames on the screen :)

It's a joy to develop with an iteration time that short.

Further RT work

With the basics in check once again, I delved to another must-have feature: ray tracing denoising. Initially I tried to do my own filtering and got surprisingly close in building a pseudo k-nearest filter, but the performance was awful. When I later read more about the techniques it seems that I was very close to how the things should be done properly. Instead of having and adaptive window size in a single pass, multiple smaller passes should be made.

After giving up on implementing my own filter I managed to gather enough courage in trying out Nvidia's Optix denoiser. That unfortunately was only available via their driver, and required using their C++ SDK. Before committing to it fully, I did some manual work with their command line sample. Exported some frames from my game, converted them to EXR, ran the denoiser, and converted the files back to PNG. And it all looked rather promising!

Then, to my astonishment, I managed to build a C DLL wrapping the functionality, and integrate that to my game. The performance sucked due to expensive buffer copies via CPU, but I could see denoised stuff, and it all looked correct! That was most excellent. Then I spent a few days optimizing the buffer copies, eventually managing to share the Vulkan's buffer directly with CUDA's thanks to VK_KHR_external_memory_win32 and cuImportExternalMemory. I still have one more smaller buffer to skip the copy on, but on a 1200x1000 R32G32B32(A32 unused) buffer the denoising process now takes about 4 - 5 ms on my RTX 3080 Ti. That's still a bit slow, but perfectly usable!

Unfortunately the high frame rates allowed me to see that even in Optix's temporal mode there's considerable variance between frames because I just don't have enough good samples per pixel. Only at about 64 spp is where things even begin to look passable. 1 - 2 spp is the generally agreed upon good target...

I had hoped that I wouldn't need to implement importance sampling in a game this simple, but seems that I was wrong. I was victorious with Vulkan and the basics of RT, but I fear that is a battle I might not be able to win. It was nice knowing you all.

Next I'll be researching ReSTIR, DDGI techniques and the like. Just wanted to write this advance-eulogy first.

* * *

But perhaps not all is lost. I mentioned having some success with my own denoiser. I should also try the AMD's open source one for comparison. Or perhaps try writing a special version combining that with my own. While it might sound a bit arrogant even trying to write my own which is better than even the state-of-the-art by industry leaders, there's one thing on my side: I haven't yet spoken about this in depth, but I'm not trying to ray trace the whole image; just lighting. Once I have the ray traced per-frame lightmap, I can then render the game objects with a normal rasterizer, and look up the lighting from the smoothed out RT image. For example the rather distored yellow grids in the image at the start of this post are of no concern, as a separate raster step will draw over them.

Oh, almost forgot. I've spent extra effort in making the ray tracing happen in a real 2D-space. But if I want to truly simulate how light would behave, I must do it in 3D, and have all the game objects have a 3D representation: in real life if a light beam hits a floor, some photons bounce to the ceiling and the walls, and then back to another point on the floor. That isn't currently happening, leading to light beams that don't illuminate rooms the way they should. It might be possible to do some post processing to simulate it, but I'm not too hopeful. "Fun" times ahead.

Anyway. Quite an update. I really hope to be able to return victorious some day.

No comments: