Hope you all enjoyed the _journey_ - I for one am definitely glad it's over. What's the worst dumb mistake you've ever made which cost you too much time?
Sometimes I change my functions to be way shorter and simpler just to debug the entire code more easily. Exactly like you did when you modified your fragment shader. I do at most probably between 5 and 10 simplifications. And when it's time to come back to normal, I just forget one of these simplifications and it generates a bug 400 new lines of code later. Then it takes me several days to understand that the problem comes from a wrong simplification I did earlier on purpose to debug my code on a previous debugging session...
"I don't know. Maybe I'm just doing errors only. Who knows. Maybe i'm a clown. I am a clown." This is the most relatable programming video ever lmaoooo
This just goes to show you. Whether you have 1 year or 10+ years experience programming.. It happens to all of us.. one little oversight, typo, you name it.. that will haunt us forever.
Man, let's appreciate for a second the legend that knew what was up with just a screenshot from the profiler. Game engine programmers are of a different breed!
@@The101Superman Also realizing that someone could accidentally place all their vertex data in system ram. Usually people coding with vulkan really emphasize gpu ram.
@@gileeeI know what you are saying... but you wouldn't be saying the same thing if you're engineering a CPU and you have to account for every single wire, connection, transistor, etc... then when all of that is properly connected and you think you are done, then it's a matter of designing the ISA and how both instructions and data are represented... from there it's a matter of writing your own assembler. For granted, there are many highly sophisticated tools today to help streamline that process, but imagine having to do that by hand without the aid of any modern computer, device or software! They may teach the basics of this in some colleges and universities, but there is just as much that they leave out! Also, I'm 100% self taught, 0 college education! I took the initiative to follow my own ambitions, desires and goals. I've always been intrigued with electronics.
Open GL holds your hand and makes many assumptions, "Give me your data and I'll try to draw it accordingly". Vulkan on the other hand is Explicit: "Tell me everything you want to do and how you want it done and I'll do it exactly that way!"
Cherno: "Zooms in on CPU only" Me: "Starts laughing both hilarious and empathetic, feeling the pain and relief of not finding that one bug for a week, that someone else points out in a second"
Man. Computer programming is like blindly assembling machine. You never actually know how all of it working. In fact you might not even remember how you assembling each piece of code together.
@@gittawat6986 no... Just no... Programming is about knowing exactly how everything works. The problem is that schedules are often made assuming that you don't need to know everything, and those schedules are broken, because you do need to know how everything works otherwise you're not engineering you're gambling (that something will work as you want it to). When you fly on a modern fly-by-wire airliner you better hope that the programmers understand how everything works because that airliner can't stay in the air unless that software works perfectly. This is why every programmer needs to know assembly and why interpreters are a fundamentally broken concept (they are obfuscation that makes it inherently more difficult to understand how everything works).
Something eerily similar happened to me, only with my DirectX12 implementation. It was waaaay slower than the OpenGL. Like you I fired up the nsight profiler and saw PCI throughput being the bottleneck. Turned out I was still using upload heaps for my vertex buffers (instead of default heaps). I even had a TODO comment there saying I need to fix that. Oh well, learning happened.
I have so much respect for graphics programmers. I had to write some OpenGL stuff in uni and was so dramatically overwhelmed by the fundadmentals and terminology. You can spend so much time and energy on this topic, can't even begin to imagine how brutal vulkan must be. Also really cool content, thank you!
The LunarG Vulkan debug layers does in fact gave a bit of performance validation if you ask, quite handy, if fairly minimal right now. My first guess was that it was just hitting V-Sync 😄. Pretty obvious that it was buffer residency as soon as I saw that PCI bandwidth cheese block, but I wondered if you had somehow managed to allocate your render target images as CPU only!
I feel you! I knew what was coming when you showed us that PCI throughput graph :) I love Vulkan and the control you get, but with great power comes great responsibility. Looking forward to your next video.
Hi the cherno, thanks for this content of graphics programming, a little inspired by you, I've started to learn Vulkan and D3D12, and I found the VMA brother the D3D12MA, and now I've pulled request 2 features, all in cmake. I've followed the cmake path, instead of premake that your prefer, but i really thank you to be introduced in this content.
Thank you for uploading this video. It's very educational. I mean many people face problems like that and once it's over they breath a sigh of relief and move on. However, you took your time to tell us about it and that's really important.
As soon as I saw the PCIE graph, I screamed NOOOO and laughed. I knew exactly what was coming. You had a good intuition to what you might be doing, but were guessing the wrong target.
Amazing video! I think the idea is great. Seeing you solve a problem within your own project is a nice. Especially performance issues and stuff like that which make you use these performance tools! Keep these coming! PS: There are also these tools called Intel Graphics Performance Analyzers, maybe you should check them out.
Honestly, this was pretty comforting because I was expecting a far more stranger vulkan behavior thing. This is certainly unfortunate and hard to debug but the reason for it being slow is very obvious
Hehe, the joys of having all of the control with Vulkan! nSight is such an incredibly useful tool for such things. I spent a few hours last weekend wondering why vkCmdDrawIndexed() was not drawing anything only to then realise that even when not using instancing that instanceCount needs to be set to 1 and not 0. D'Oh! As someone writing a new Vulkan based engine I'm really enjoying the series. Keep going! :)
One time I set the RenderPass attachment store op to DONT_CARE and didn't even realize it. I was banging my head for like an hour wondering why my window was displaying garbage. Vulkan basically did my rendering, saw the enum and just dumped the result into undefined territory, because, after all, the programmer said to discard the output. This was before I had access to NSight. I wonder how fast I would've solved the mistake if I did have it.
Thank you for sharing this and most important the investigation process/method its really helpful!, now I need to get a windows pc there are no tools like that for OSX
In Vulkan validation layers you can enable "PERF" level, which gives hints about various sub-optimal uses of Vulkan API. I am not sure if it would catch this one, but it is worth a try. There is VK_LAYER_LUNARG_assistant_layer , which is essentially designed for these purposes and should detect this issue. Also, there is nothing wrong that Vulkan allows you do to do that or other stupid things. Using CPU only memory for some stuff sometimes makes perfect sense actually. I am glad you found the issue and fixed it, and learned something new.
I've had multiple of those small things in my Vulkan renderer, fortunately never this one, I followed a tutorial while implementing mine that used staging buffers. (: Anyways - What differences did you notice when switching to the Vulkan Allocator? Just lifting the number of allocations limit? Or was there any performance differences? Implementing that allocator has been on my to do list for some time now. (:
hey thanks for taking the trouble of Vulkan for us man, i stopped developing with vulkan (for now i only use opengl + directx11), because i tried and saw how much it takes time to deal with even the simpleset things
24:42 This is implementation defined. The GL_STREAM/STATIC/DYNAMIC_DRAW hints are exactly that - hints. From what I understand, these used to mean something, but driver vendors have more or less stopped caring about them, since misusing them was so common that it was better to just have the driver decide.
I think the new ARB_buffer_storage API (Core in OpenGL 4.3) has a more detailed impl of that for glBufferStorage(): instead of hints, it uses flags really similar to vulkan's: GL_CLIENT_STORAGE_BIT will make the buffer cpu side if possible, GL_MAP_PERSISTENT_BIT will make the buffer persistently mapped but likely to be cpu side too, GL_DYNAMIC_STORAGE_BIT will make the buffer able to copy from cpu data but still be gpu-side, and 0 will make the buffer fully invisible cpu-side, therefore fully on the GPU. Persistent mapping is often used for staging buffers btw, and you can use glCopyBufferSubData to copy the buffer'scontentd
I like those types of hard won nuggets of information. People spend their lives looking for the most useful and precious nuggets. I have it in my head to build a model of all the pieces of, say the hazel engine, and animate. I know that there are always multiple layers of abstraction between the conceptual object and the metal and that maybe useful to illustrate as well. We will see if I ever make any actual progress, I've got other demands of my time.
Okay thats the best kind of videos i want to see you making/doing. Having a bug, extremely hard to track down and analyze it and actually solve it. That is the most valuable thing for me - especially when using modern graphics system - such as Vulkan. A as matter of fact, i started getting into Vulkan and even clearing the screen to a color does not work in all cases. On win32 it works perfectly, including re-creating of the swap chains - but on Linux X11 it renders but crashes on XDestroyDisplay(). Such bugs are so annoying, because they prevent you from continuing other things :-( Also i dont understand why the validation on Linux simply does not work (No instance extension detected), but on Win32 it works just fine - both have SDK installed and its the same system (Multiboot) O_o
Hello from Thailand, I am no mean programmer of any sort, but watching this I can see how developer actually working, this is a good video to listen while working from home... setting me in kind of working mode..
the Vulkan Road is rocky at best. I am struggling with synchronization but the knowledge you gain is worth it because you know how the Gpu actually works. But super cool video.
Love this video. I'm always tell young developers "no matter how long you've been programming, you will make silly mistakes." Unit and regression tests are your friend. If you don't have good test coverage, simple typos will bite your butt.
Here's a fun fact: say you have some HOST_COHERENT memory that you're reading from a shader. Say, you're storing there a vertex buffer of your big point cloud (50m+ points). You can get the total size of your vertex buffer, divide it by the time of your GPU pass using this vertex buffer and you'll get your PCI-e speed almost exactly (give or take 2%). Because your shader literally reads your RAM by streaming it over your PCI-e with its full bandwidth, introducing practically no extra latency. I don't know about you, but I find this practically magical how they(hardware guys) achieve that.
The first thing I was thinking of is not a pixel but a vertex shader. Then after looking at PCI occupancy it was definitely obvious that some buffers are stored on a CPU. I was developing a Dx12 renderer and it was like the first thing to do is to use an upload (staging) buffer. Although I didn't use GPU trace till now. Now I know, thank you.
Two minutes in the video I had a clue of your problem (Yeah, it's annoying from all the verbose Vulkan exposes, that one little thing was a game changer :sadface:). It's nice that you've gone through the journey and pointed out the understanding of hardware. Vulkan and DX12 are very explicit in management and control over the Hardware involved, and it was also nice to see a capture to dig deep in the problem solving. Nice video, keep up the good work, Hazel is looking very cool!
Well i knew that uploading shaders to gpu storage is important but not that it has such an impact. Good to knows and great to see how you tackle bugs like this.
That is possible in OpenGL with similar settings, and there’s tons of details to these rendering pipelines that all need super careful attention! It’s not exactly always an error, because sometimes you do want to set it up that way (e.g. water simulation or cloth simulation that is updating the vertices on the CPU). With these low level APIs you have to tell exactly what you want it to do
Anyone happen to know if the "vulkan memory allocator" can be used to "protect" graphics memory? I remember watching a WebGPU video where the presenter mentioned that VertexBuffers might not make it in the spec because can't control the security around it (prevent from accessing beyond bounds). But was thinking lower level APIs (or more likely the web assembly host) could be used to protect GPU memory from being used maliciously. Perhaps the web assembly runtimes/hosts need a memory manager that works like hte VMA to be able to control access to memory.
I had a similar issue recently on dx12. Staging memory in UPLOAD heap is not cached and cpu reads are slow. Turned out I needed to use host cached memory in some cases because client code was reading mapped memory
I've considered switching my code to use VMA, but I haven't gotten around to it yet and my current DIY pool allocation scheme seems ok. This is 100% something that would happen to me, though, and I'll be extra careful if I do switch lol. Glad you found that one wrong flag instead of deciding to rewrite some large unrelated portion of your code.
Wasn't the GPU memory almost empty then? Noone noticed? It's pretty scary that so small difference can have such big change... i've heard recently that someone got order of magnitude performance increase by adding some noop instructions to his functions so they would have better layout in cache (hot code in 1 cacheline) xD programming is brutal sometimes
Well it's almost empty one way or another. Sponza geometry is like what, 8MB worth of buffers? Textures and render targets were in VRAM anyway, they're a good chunk bigger all together.
fellow vulkan renderer veteran. I am sorry I feel your pain. As soon as I saw the PCI throughput that high I was thinking, he must be copying over every frame. Congrats on fixing this one line hell
Many moons ago I was using DirectX, and trying to get my game to recover from alt-tab or minimizing and restoring the game window. When the window is reactivated, I had to reload all the textures into the gpu from system memory, which meant I had to keep a copy in system memory. And I didn't have enough gpu ram to store all the textures, so I killed two birds with one stone by implementing a caching system with an lru list. Whenever I went to draw something, if my gpu buffer for a given texture was either not loaded or invalid, I'd reload it from it's system memory buffer, and keep track of how much total gpu ram was in use (for textures), and when it passed a threshold, I would repeatedly unload the gpu buffer for the least recently used texture (that was still loaded) until I (theoretically) had enough gpu ram to load the texture I was trying to load. Before I created that system I did suffer from a lot of artifacts and slowness and headaches. After I implemented it, things were much smoother and faster. Many, many moons ago.
Why did this video come on my dash today, if it was posted 2 years ago? I keep doing that. I keep necro-posting without realizing it. Argh. Oh well. (I guess it must be because I'm just now finally looking at Vulkan.)
We all been there. It doesn't matter how experienced you are. My favourite bug hunt was when we sat in front of the screen for a day with my colleague trying to figure out why the theme colors are wrong (developed a low level UI engine at that time). Every damn piece of code was perfectly fine. Then the night fell and the whole theme changed. This was when it hit us that we were looking at the wrong dataset all the time because the theming engine did a switch at a certain time of day to compensate daylight (which is important in a navigation software). Such a facepalm moment :D
A bit late to the party, but I got hung up in the "#type" preprocessor directive in the shader, which doesn't appear in any documentation. Is this just an addition of your own, to let the engine split one file into respective shader types before compiling them?
For anybody reading this in future. Not only is it possible to store things like vertex buffer on system RAM. but it HAS to be stored on RAM before it is copied to the GPU, it's just how computers work. What you are really controlling in graphics API is hinting when and how you would like to pass around data like vertex buffers, and Cherno accidentily set it to basically recopy the vertex buffer every draw command. I basically took 1 look at that profile, saw the high Async Copy Engine and immediately realized it was excess, repeat data copying. I had my suspicions before that but that's only because CPUs and GPUs are so fast that literally 99% of performance problems is data copying from backing memory. Calculations are often basically free in comparison to IO. I call it a hint because the graphics API is just that, an API and the underlying OpenGL/Vulkan/D3D implimentation has an active program running in the background on CPU that governs the actual behaviour of the graphical context. and whilst you might request it to do some copying it'll just do it when it makes sense, so long as it conforms to the spec. Technically on some systems you can use DMA (Direct Memory Addressing) to write directly into the GPU buffer but this is objectively worse than having a well orchestrated graphics context manage the copy when it is sensible. Anyway.
Four questions: Do you have any good tips for learning Vulkan? What exactly is a renderer, and what would a renderer interface look like? What defines a Vulkan "Context" in your renderer, or in other words: what does Vulkan "Context" mean?
Insane story ahah, just a testiment to how useful these debugging and profiling tools are. Hope this is the kick for us to stop debugging by using printfs
I wish school/college would teach you to use debuggers effectively. Instead we get tested on ide with no debuggers and the only help you get is just pritnfs
Validation layers has performance suggestions, they should have raise a flag for this one since it is one of the most common mistakes. I totally understand your frustration! If you ask me nsight also should have tell you to look at resources uploaded to vmem too and your app limited by it. Great share! Thanks!
I’am not a programmer. But do you have a common Z-axis nominator for all the stuff that’s placed onto the ground ? That could save a lot in the clock circles, just by have all that stuff optimized to be more logical. Like i said, i’am not a programmer.
Hello Cherno, great video. At 12:25 you talk how you can't use some nsight features in non-RTX cards. Besides the whole tool being chained to Nvidia GPUs. Won't GPUOpen provide a good alternative in these cases, or at least a complement? - Does GPUOpen replaces nsight? It does everything nsight does? - No idea, but it runs on everything, which is better then nothing. I recommend you checking the video on AMD RU-vid channel titled: 'AMD RDNA™ 2 - Radeon™ GPU Profiler 1.10'. It is a overview of what GPUOpen Tools can do.
Yeah, reading your buffers from sys memory would give you a nasty performance issues. I optimized a little pong clone simply adding in staging buffers to move the data from host memory to device memory. It nice watch a game going from 10 fps to over 200 fps by simply moving to local device memory.
great video welcome to Vulkan memory flags xD complexity explosion xDD then there goes depth-buffer flags... changing single flag you can have 100x faster performance to 100x slower