Is "now try it with the fast inverse sqrt" the programmer version of how every musician content creator is asked/forced to attempt Rush E and other meme songs
modern CPUs actually have built in optimized instructions for exactly these types of things. For example, intel CPUs have an operation that does 8 inverse square root operations with perfect accuracy in just one clock cycle. These types of operations are called SIMD, and are majorly underutilized Edit: not quite perfect 2^-14 is the maximum accuracy they guarantee For a game though, that difference will be practically indistinguishable
Right? Game developers need to pay attention. We wouldn't have so many issues with modern games being unoptimized if they used more advanced optimization. Doom Eternal is a perfect example of that
@@mariotheundying Layers and layers and layers of virtualization, security, memory safety, compatibility, APIs, threading, and more before you even get close to running instructions on the bare metal
Nice video! Just one small point: If you want to use invsqrt(x) to calculate sqrt(x), you can use x*invsqrt(x) instead of 1/invsqrt(x). That might save a few cycles? But I still agree that the quake3 fast inverse square root algorithm is probably not that useful on N64.
to do regular sqrt you would just use a different magic constant. float fast_sqrt(float x) { int i = *(int *)&x; i = 0x1fbd1df5 + (i >> 1); return *(float *)&i; }
If I understood that table correctly from the start of the video, does that imply that the 0 newton iteration would take 6 cycles to complete just the fast inverse portion? How long does * take, may deserve another video...
Did a little reading, it sounds as though hardware implementations of sqrt make have taken a few different routes, common ones apparently were a lookup table for rough approximation followed by a number of newton iterations, or alternatively some process similar to long division. Without digging too deeply, my guess is that might be why the cycle counts for division and sqrt are the same. Since long division's one of the slower approaches to dealing with floats, it could be that the fast inverse sqrt is the way to go, since the n64's hardware was developed before the algorithm was discovered it could be that its implementation could be beaten via software. Newton iterations roughly double the precision of the result so a better initial guess can rapidly decrease the time cost.
I'm kind of surprised it was made popular by Quake III. I could've sworn Mike Abrash put it in the original Quake, but it's been years since I saw the source.
@@KazeN64 ah yes, DOOM with all of it's real time reflections and shading and floating point numbers that it definitely used. /s seriously though, i thought it was funny. because DOOM is pretty much always the first game people think of when IdSoftware is mentioned. so sometimes people think the FastInverseSqrt is also from DOOM, even though the game doesn't even use floats at all because they were way too slow at the time.
Silas' idea with the error cancelling is very cool, there are probably many other examples where we can reduce the error of one problem by dividing it into two sub-problems with opposite error
Error is a fickle issue. While I don't know for sure, usually these types of strategies have smaller error, in return they have larger error when values get extremely big or small.
@@M0liusX I definitely believe that, there are usually very specific cases where one algorithm is better than another, and in general there is no "optimal" algorithm which always works best for all situations, it always depends on the specific example.
We use similar algoritms in land surveying to estimate coordinates with high accuracy. By using a GPS reading and comparing it to a GPS reading, at a control station, we can see which sattelites are visible in both readings. By subtracting the differences between the readings, the accuracy of the initial GPS reading goes from 5-10m(15-30feet) of accuracy, to 2-3cm(1-1.5inches) of accuracy. This only works when the same satelites are visible in both readings, if the control reading is to far away, then the algorithm wouldn't work. But this cancels out errors like the density of the atmosphere, refraction errors, and random errors, since you work with more data.
The truth about why my dad never came back from the grocery store 😢😢😢 Edit: Yooo my mind was blown w/ Silas’s idea. Mathematically it seems obvious but getting that much accuracy improvement with the 2 fourth-root calcs multiplied together is insane! Thanks for the good content as always!
I know the project is aimed at maintaining real hardware compatibility, but maybe consider to patch a n64-emu and give one extra rambus to the console to see how far your game can go?
I don't know where your journey takes you, but the amount of data you took upon you, is painfully large. You are like a character from literature, the giver. In the story a certain character carries the memory of the world as it was. At this point it's safe to bet you are in the top 5 most knowledgeable n64 programming/development person in the world.
He has the relative luxury of only concerning himself with one set of hardware (the N64) and one set of software (the relevant programming language(s)) for decades. Almost no (game) programmer out there enjoys this kind of laser focus. Instead it's 2-3 things at a time and every other year one thing is discarded and another thing added.
I used to think there was no reason for a Mario 64 sequel to exist, but seeing how much there was to improve.on the base engine, and how many level possibilities it opens up, it's obvious they should have made one
8:57 ooo, reminds me of Romberg integration (more generally Richardson extrapolation), add together approximations at different step sizes to cancel out one error term in the taylor expansion. Takes me back to intro to scientific computing :)
8:46 calling an inverse fourth root a "fourth inverse square root" really made me lose track of what was happening for a moment, lol. Great video though
At 7:23: I’m not an expert on this, but wouldn’t any number 1.17549435082e-38 or smaller be rounded up due to floating point accuracy, eliminating any potential issue?
that number is the lowest possible floating point number before the exponent hit -127 - if the number was smaller than this, it'd loop around and you'd get a number closer to the max float representable
At 1:30 you seem to be suggesting that the best way to get sqrt(x) from 1/sqrt(x) is to take the reciprocal via a division, but you could also just multiply by x since x/sqrt(x) = sqrt(x). I don't know if this changes anything you're saying.
when i was tutoring, i told students that there is a time and place to use the fast inverse square root. the place is on 32 bit windows PCs and the time was 1999. The trick only works for single precision IEEE 754 floats, and essentially any modern PC made after probably 2007 or so is so fast that the gains of the float hack aren't worth it lol. its important to teach though because i think it demystifies the IEEE 754 standard and helps students understand that its essentially just scientific notation but with 2 instead of 10 as your base. and if you go into embedded systems, you do need to understand how that stuff works because you will eventually need to convert a 754 float to a DEC float (or vice versa) and that has its own little hacks.
The same trick works for doubles too, but it needs a different magic constant (which has been listed on wikipedia) Otherwise, yeah, it's role is obsoleted by the dedicated reciprocal square root approximation instruction, where available.
this is a great example of how optimization is a very specific problem that always requiring profiling before you can say something is gonna be a "good optimization"
Is there anyway to upgrade the ram's speed on a N64? I'm a bit curious just how much horsepower is locked away behind having 4 miners share one pickaxe.
The N64 uses a type of RAM called RDRAM. That's where you would need to start your search. There are faster versions of it that were made for PCs, but they were uncommon and the last ones were made around 2003. The PS2 also uses RDRAM, so that might be an easier source to find. After you find some RDRAM, you would need to find out how to make it work. The RDRAM is connected to the RCP, so you would need to find some way to speed up the RDRAM without affecting the RCP's timings.
Im using fixed-point so i kinda need the algorithmic versions, not the mantissa hack. Interested in researching these... In a few years, when i have time to spare this issue
Couldn’t you multiply x * invsqrt(x) instead of doing 1 / invsqrt(x) ? I think that would be a little faster but I’ve never programmed for the N64 specifically.
You can use the inverse square root algorithm in more cases. The other best example is probably physical simulations, more exactly, simulations of gravity, electromagnetism, etc. since those follow the inverse square law. It's probably obsolete by now though, and i have no idea why you would ever want to build physical simulations on an N64 lmao.
@@KazeN64 Hmm... I believe it has less to do with the texture format and more to do with how the texture is rendered. I still think the odds of this working (and working well) aren't completely likely but it could be worth a try.
Hi Kaze, very cool video. i have a question about a mario 64 port for the playstation classic, it runs very well (60fps) but in under water scenes the framerate drops below 60fps. i watched you videos and question myself: is it possible to build a port for the playstation classsic with your enhanced version of the source code and will it reduce or completely remove the stutters ( framerate drops)?
My best guess.. Some optimizations might be helpful but Kaze has made a lot of very N64 hardware specific improvements, and since the hardware is different, It's not at all certain they would work as well.
1:11 well use builtin sqrt, then run a parser in cpu cache that grabs this result, and performs inversion within cpu cache, so you only would calculate the square root and write to cache buffer periodically when needed
aas a starting coder, i really do find this all really interesting, but boy its hard to understand. I love how dedicated you are, and i believe people could write colege thesis about your optimizations alone, but being truthfull, i don´t even get it why the square root is needed here. Oh well, there is allways more to learn, and i hope in the future i can write code that is optimal enought so it doesnt clog weaker machines
The square root isn't a coding thing really its a math thing. vectors are things that have magnitudes and directions. Lets say we have a vector (3,4) if we wanted to find the size of it we would basically use Pythagoras to find it out thats why the square root is needed here. As for the reason he needs to find the square root he does mention in the video that its for vector normalization.
It is not surprising the unique challenges of PC games that rely heavily on an awful x87 floating point unit do not necessarily translate to a completely different architecture
Quick question from a rando: Are you an emulator developer or something? This video showed up seemingly randomly in my recc's, and while I find it interesting enough on its own, I get the feeling I'm not your usual target audience...
Question Kaze or anyone else. Why is making an N64 hack Console compatible difficult? Figure I would ask here, so excuse being somewhat off topic. I understand emulators aren't completely accurate. What makes these instructions compatible and using an emulator that doesn't emulator perfect different? I searched this allot and community forums are so terribly toxic and get offended by everything! It's a real turnoff but I dont see the answer besides for "wahh stop offending the creators" or "duh the emulators are different" 😮💨🥴. OF course I mean no disrespect for the hard work put into this! :) Legit prefer using a console but never get a proper answer. Kaze is amazing, much respect so figured ask here.
I think there's multiple things making it a lot harder to develope for the n64 directly: 1. Emulators are easier to test on. That means that if you do make a change or you develope a tool, you will test it on emulator first. By default, emulators will work already and then you have to put in additional work to make it go well on console 2. A real console can't be ran alongside a debugger (so any incompatibilies are very hard to find) 3. Emulators are a lot less picky than real consoles. Many exceptions might not be emulated correctly, so when console would just freeze, emulators might just produce a slightly wrong result or even just work. (e.g. you can see this with ROM reads - they need to be 8 bytes aligned on N64, but can have any alignment on emulator) 4. Not every creator has access to an n64 to begin with to even test that their mod works on it. 5. the n64 has a pretty tight performance budget and emulator just doesnt
@@KazeN64 Thanks for the wonderful answer, much appreciated. This really sums everything up nice and neatly. Hopefully this helps others as well who have the same question. I really respect he hard work put into hacking/production so naturally I'm not trying to stir the pot :) haha. This puts things in perspective very well. Your second point seems very important concerning debugging, this makes allot of sense. The trial and error of using a real console definitely suggest much more intense work. Not having access to the hardware seems to be a bigger issue than I thought, understandable. Thanks Kaze much appreciated.
So, I noticed at 8:54 that you used pointers to alias the floats like the original algorithm, which is UB. So, the language lawyer in me was wondering why you didn't use a union instead. Does the compiler generate worse code in that case?
i dont think it'd make a difference, even a memcpy works the same. i didn't even realize this was UB at the time (and i do plenty of floating point bithacks in this codebase so i'm not sure this type of UB can be avoided without compromising performance)
@@KazeN64Yeah, I guess that makes sense. After all, C++'s bit_cast is just a constexpr memcpy so compilers should know how to optimize it. The only other problem I can think of with pointer casting is that if floats and u32s have different alignemts on N64, one of the pointers might end up being unaligned, but since everything ended up working I guess that's not the case. Or maybe it is but MIPS doesn't care?
yeah, most games that run at 60fps are static camera games that can afford to run with no zbuffer. that lets you read half the memory reads/writes when rendering a pixel.
if you take taylor expansion of the sqrt and invert it.. its something like 1/sqrt(x)=inv(1 + x/2 + ...), shiftrgt-1 is the div-by-2, doom? it was quake. want to get the best guess, compute the derivative f'(x) = 0, to get the best guess limits
you could always also do the sine trick, and store some function/derivative values in the range [0,2] of sqrt() and 1/sqrt(), then quadratic interpolate between the range, very fast combined with bit shifts, div by 4 (or shiftrgt-2, also in the FPU)
yep the bit shift + quadratic derivative point sample interpolation approximation, gets you to below 1e-5 accuracy, for both normal sqrt and inverse sqrt, and is super fast, only bit shift, plus and multiply operations. if you divide by 2 (or 4) to get the argument value to between 1 and 2, you get the best sqrt approximation. below 1 values the quadratic approximation gets worse.
>why aren't you using the famous algorithm that optimized quake for x86 instruction sets? >10 minutes explaining that mario 64 isn't quake and an N64 isn't x86 in a way normal people would understand this channel always fun, people always stupid.
I didn't actually expect it to work at all. Because I remember you said that square roots on the N64 are relatively fast. Of course Kaze will find a way to eek out that tiny extra bit of performance and then some.
The gas tank's just fine, actually, especially with the Expansion Pak ... The problem is that the hose from the gas tank to the engine is the size of a curly straw.
They started with designs used for SGI workstations and cut stuff waaay down. They were not going to make a new CPU design for the N64 and clocking it slower wouldn't have saved any money. It's not like today where we have ARM cores at every performance level you might want.
Never underestimate the value of a single cycle to a bedrock function that gets called endlessly. Like back when one of my teenage hobby projects was a primitive 3D renderer (for modding a game with), written in BASIC (specifically VB4), with basically zero access to any external, actual 3D APIs. Displaying textures in real-time was far outside my abilities but I had the coordinate transformations for wireframe rendering optimized as much as I can think of, including a few scenarios where I resorted to the oft-maligned GOSUB/RETURN type calls instead of making a function call to handle it, simply because it was the faster mechanism.
Both editions of the book Hacker's Delight mention the fast inverse square root (or as they call it, an Approximate Reciprocal Square Root Routine) and give various improvements of the algorithm. In the books they already mentioned FISR without Newton iterations: > deleting the Newton step results in a substantially faster function with a relative error within ±0.035, using a constant of 0x5F37642F.
It's amazing to me how you can make deeply complex topics so easy to understand by explaining them based on a use case. Programming is like black magic to me, yet I can follow your videos along without any issue. God bless.
All of your videos are so incredible. I love how you mix maths and humour in the way you do. Even if I can't comprehend everything, I love each and every second
have you thought about writing some small research papers for these findings and experiments? like, even if they're extremely specific for your use case, they're still cool as hell and might even help someone some day
For the graph at 9:19, it probably would have been more clear if you'd labeled it as "Error (%) vs Cycles," since that's what the numbers actually represent. In both cases, a lower number is better, which is the inverse of what is implied by "Accuracy vs Performance" (which suggests that a higher value is more accurate or has higher performance).
I was certain it was absolutely "useless" in the Nintendo 64 hardware, I'm amazed you actually found a place to use it! Also, you know your audience very well, @06:58 I chuckled and @07:23 I almost laughed. Great content!
I wonder if you’ll ever do optimization for the n64 bios… I know Nintendo didn’t give very many developers access to the bios but there are a few games which load a different bios into the n64 and allow even more optimized code to be run for whatever game was developed…
You mean the RSP Microcode, and that's a whole different problem. I don't know how doucumented the microcode is. Nintendo certainly didn't want developers messing with it, and just use the ones they provided.
The inverse square root sure went through a journey, didn't it? From being cumbersome to calculate, to an ingenious bit hack, to becoming its own CPU instruction.
Kaze Emanuar: The only person that can explain to me how to optimise a 26 year old game in high technical detail I can't even begin to understand, while keeping me invested until the end.
I made my own using weighted quadratic beziers. It's only 4% less accurate than the sine and cosine operations using the squirt, at a fraction of the performans cost. I know it can be improved, but so far so good. :3c
To summarize: _Script kiddie:_ "HAY, you should use this really famous algorithm because it's a more efficient way of performing floating point calculations!" _Nintendo:_ "Yeah, we thought of that, dude. We stuck a chip in the system that does that _specific_ calculation all on its own because doing it any other way was hella inefficient." _Kaze:_ "Amateurs..." _Nintendo/kid:_ "What was that?" _Kaze, LVL. 99 Script Wizard:_ "AMATEURS!"