Тёмный

Finding the BEST sine function for Nintendo 64 

Kaze Emanuar
Подписаться 269 тыс.
Просмотров 317 тыс.
50% 1

Опубликовано:

 

28 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 1,3 тыс.   
@KazeN64
@KazeN64 Год назад
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/KazeEmanuar. The first 200 of you will get 20% off Brilliant’s annual premium subscription.
@official-obama
@official-obama Год назад
5:30 it's wikipedia, you can edit it
@CandiedC
@CandiedC Год назад
Considering you’ve added hi res models into SM64 before, is it possible to add Render96 to it?
@SuperM789
@SuperM789 Год назад
@@CandiedC render96 looks like shit, so even if you could it would be better not to
@RanEncounter
@RanEncounter Год назад
@ 17:34 the derivative is wrong. The correct derivative is 3ax^2 + b, not 3ax^2 + x.
@KazeN64
@KazeN64 Год назад
@@RanEncounter oh whoops thats a typo
@prismavoid3764
@prismavoid3764 Год назад
Using ancient Indian mathematical formulas to make the funny red man go bing bing wahoo even faster, even after going like five massive optimizations that give wild speedups already Classic Kaze.
@lerikhkl
@lerikhkl Год назад
"funny red man go bing bing wahoo" is a great band name
@williamdrum9899
@williamdrum9899 Год назад
The trick to fast computing is to use integers as much as possible
@Swenglish
@Swenglish Год назад
@@lerikhkl I love their song "everybody cheating but me".
@l3rvn0
@l3rvn0 Год назад
@@williamdrum9899 I wonder if this is really true. Float has 7 decimal positions, if I use an integer and treat the first 7 positions as decimals, would that make my code much faster and more efficient?
@ThePondermatic
@ThePondermatic Год назад
This is part of Bhaskara's legacy, one he could not have possibly imagined hundreds of years ago.
@totalphantasm
@totalphantasm Год назад
Man it’s always crazy how even a game that’s considered “poorly optimized” like Mario 64 is still optimized incredibly well for its time. Now a days developers say “oh most peoples computers these days should have about 16 gigs of ram? Why waste time making it run on 8 then?” And then the game still comes out rushed and bad.
@augustdahlkvist3998
@augustdahlkvist3998 Год назад
Lol mario 64 isn't well optimised at all, they didn't even turn on the automatic compiler optimisations and it regularly drops below 30fps even though the n64 should be able to handle it fine. Modern games at least shipped with the compiler optimisations enabled
@benjaminoechsli1941
@benjaminoechsli1941 Год назад
What you have to remember is that this is a console game, from the age of cartridges. There was no such thing as "updates" or bug fixes, so studios were forced to give their devs time to get a game good enough to sell.
@FlamespeedyAMV
@FlamespeedyAMV Год назад
​@@dekufiremage7808because they keep hiring morons
@incognitoman3656
@incognitoman3656 Год назад
@@benjaminoechsli1941here were re-releases thankfully, but changing the engine is drastic for one to do.
@totalphantasm
@totalphantasm Год назад
@@augustdahlkvist3998 I’m not super familiar with this topic in particular but it’s a fact that there’s almost nothing in game development that’s as simple as “turning on an optimization”. I would bet money that the issue isn’t as simple as turning on a feature. There are some things that only look obvious and easy with 3 decades of retrospective. Even then, most modern game studios probably couldn’t get Mario 64 to run at “drops below 30 with big objects” if I’m being honest.
@darkwalker19
@darkwalker19 Год назад
Me (Person who has a literal degree in computer science, built multiple games, and over 5 years industry level experience): This guy is insane and so much is going in one ear and out the other but this is so interesting to watch
@johnmoser3594
@johnmoser3594 Год назад
Take a computer engineering degree and learn about signal processing, stuff like this will look more familiar.
@jackmeyergarvey759
@jackmeyergarvey759 24 дня назад
To be fair to yourself, this is not a topic most computer scientists would deal with. It is much closer to engineering
@cerulity32k
@cerulity32k 7 месяцев назад
4:20 Small correction: The table uses 16KB (20KB with cos) of memory, because floats are 4 bytes wide.
@KazeN64
@KazeN64 7 месяцев назад
true!i was thinking of 0x4000 and just said 4kb because im so hexbrained lmao
@CRITICALHITRU
@CRITICALHITRU 2 месяца назад
@@KazeN64 pin the comment with corrections or update description?
@Creatively_Bored
@Creatively_Bored Год назад
I have a hunch about the angle flipping necessity: tangent. The angles in the 1st and 3rd quadrants (going ccw around a circle on the cartesian grid) have the same tangent values, the same with the 2nd and 4th quadrants. So you have to distinguish when you want the angle from the 1st or 3rd quadrant (and the 2nd or 4th Quadrant). MS Excel (and Google Sheets iirc) have the 2 argument arctan function to treat the angle calculation as a vector problem, but since Mario 64 doesn't have this, you have to use a from scratch angle discrimination setup, much like what Kaze ended up using.
@a2e5
@a2e5 Год назад
atan2 is a godsend when it comes to not having to roll your head everywhere to figure things out. you can define it from a classic arctan though -- just some if cases for signs and stuff
@porterleete
@porterleete Год назад
The video's great! I love seeing how things like this are done on old hardware. It seems to me like it would be hard to understand how anything would be best optimized for modern hardware with preemptive execution, and weird hardware tricks and strange security flaws - more like magic than science. Even though I don't really program much, let alone optimize anything, optimizing code for old hardware seems like it's something that can actually be learned in a concrete and non-mystical way by a human being, even if it takes effort.
@KazeN64
@KazeN64 Год назад
modern hardware is optimized for bad code. i think optimizing code on more modern hardware is less about nitty gritty details like this and more about using best practices. there is also an issue that most modern games run on many different consoles so you dont just have one metric to maximize. a lot of these n64 optimizations would slow modern machines down.
@incognitoman3656
@incognitoman3656 Год назад
@@KazeN64Ever since Playstations, gaming has gotten more techy to the point of modern renders being mostly power and less “magicks”. Thanks to you, we can see their bag of tricks!
@KopperNeoman
@KopperNeoman Год назад
@KazeN64 The more powerful the hardware, the harder it is to run anything bare metal. On modern systems, it can even be impossible, requiring you to bypass even the UEFI itself.
@Ehal256
@Ehal256 Год назад
@@KazeN64 It's still optimized for good code, it just also handles bad code a bit less poorly. Bad code on modern hardware can still easily waste 90% of memory bandwidth, etc. Where I think modern hardware shines is making decent, not optimized, but not wasteful code run very fast.
@aldendwyer
@aldendwyer Год назад
This channel is slowly becoming the best showcase of the general outline of optimizing code on RU-vid. I love it.
@tumm1192
@tumm1192 Год назад
I remember from a book called "linear algebra done right" on page 115 it says the best approximation for sin(x), of fifth degree, on the interval [-pi,pi] is given by 0.987862x − 0.155271x^3 + 0.00564312x^5.
@KazeN64
@KazeN64 Год назад
looks like a minmax approximation which has issues i've outlined in the video
@tumm1192
@tumm1192 Год назад
​@@KazeN64awesome.
@youdj_app
@youdj_app Год назад
When I was coding 3D dos games 25 years ago, I couldn't understand why my sin table was not that fast, it is just one line of code with no CPU computation... Cache/memory bottleneckes make a huge different, I confirm.
@williamdrum9899
@williamdrum9899 Год назад
See I program on CPUs that didn't have a cache. So for me, I don't need to worry about it since memory access is equally slow!
@Anon.G
@Anon.G Год назад
​@@williamdrum9899then you have to start worrying about other problems!
@andrewdunbar828
@andrewdunbar828 Год назад
Yeah you'd have to go back to the '80s when we had no caches to avoid that. I used to play with this stuff in the 8-bit days until about the Amiga era.
@williamdrum9899
@williamdrum9899 Год назад
@@Anon.G Haha yeah. At least with C compilers you can actually finish a project. Writing an entire game in assembly is a monumental task even on an 8 bit platform
@williamdrum9899
@williamdrum9899 Год назад
@@andrewdunbar828 I love the 68000, it's so easy to work with
@AsymptoteInverse
@AsymptoteInverse Год назад
It's always interesting to see how constraints inspire creativity and inventiveness. I have very limited programming experience, but a fair bit of writing experience, and having a constraint (the syllable count of a haiku, a particular meter in a poem, a more arbitrary constraint like not using a certain letter at all) really pushes you to play around with the elements you've got and try to achieve a pleasant result in spite of the constraint. It strikes me as weirdly similar to working around memory and cache constraints. Reminds me, too, of a book I read written by an ex-Atari programmer. (I think it was Chris Crawford.) He talked about the clever things they tried to do with graphics by executing code in the handful of machine cycles in between CRT raster lines.
@squiddlecomplains9829
@squiddlecomplains9829 Год назад
Thanks Kaze! This was the perfect video to watch while running head first into a wall!
@jimmerjammy
@jimmerjammy Год назад
I've always been fascinated with how computers implement trigonometry. It's more difficult than arithmetic, and there are many ways to do trig functions. Another method is the CORDIC algorithm, which involves adding smaller and smaller angles until it reaches the input. It was developed for use in 8 and 16 bit processors. The TI-83 calculator, which uses a chip similar to the gameboy's, uses this algorithm.
@christopherdigirolamo9879
@christopherdigirolamo9879 Год назад
"I never expected anything I've learned in school to actually be a useful life skill" Apparently hacking out a polynomial with a low-error approximation of sine for SM64 optimization in 2023 is a useful life skill lol
@MrMeow-dk2tx
@MrMeow-dk2tx Год назад
It is, it's called fun. You ever had it in your life?
@christopherdigirolamo9879
@christopherdigirolamo9879 Год назад
​@@MrMeow-dk2tx Yessir
@Johncw87
@Johncw87 Год назад
It is when making Mario 64 romhacks is one of your bodily functions.
@spyczech
@spyczech Год назад
His brain has a built in RISC co processor for just rom hacking
@vinesthemonkey
@vinesthemonkey Год назад
math is useful for physics and statistics if you want to consider "real world applications"
@Frenchnostalgique
@Frenchnostalgique Год назад
I got a welcome ego boost watching this video by thinking about the "only use one 4th of the curve" and "interweave sine and cosine" before you mentioned them. Thanks for that, I love these optimization tricks videos
@CAEC64
@CAEC64 Год назад
thanks for enlightening me on the vertical and horizontal speed on 2d games kaze
@BGP00
@BGP00 Год назад
NO WAY CAEC
@Rihcterwilker
@Rihcterwilker Год назад
This is a coding channel that eventually uses mario to show how programming works.
@amineaitsaidi5919
@amineaitsaidi5919 Год назад
Exactly, and this is why i am here.
@marioluigijam3612
@marioluigijam3612 Год назад
Coding AND MATH!!!! My favorites
@amineaitsaidi5919
@amineaitsaidi5919 Год назад
I just realized how Mario 64 is not optimized at all to store a whole table of sin, it is juste pure rush.
@telanis9
@telanis9 10 месяцев назад
@@amineaitsaidi5919 No, a lookup table is actually quite a good optimization. Like Kaze said, computing the actual Taylor series to a high degree of accuracy would be incredibly slow. Nintendo didn't understand just how much their memory bus design constrained everything until much much later, so at the time it made perfect sense to try to save on CPU with the table.
@JonathanThe501st
@JonathanThe501st Год назад
"What if I told you there's room for improvement?" Kade, this is YOU we're talking about. You'll probably be able to get Super Mario 64 to run on a TI-86 Calculator if given enough time
@Diablokiller999
@Diablokiller999 Год назад
Have you ever tried CORDIC algorithm? It needs a (quarter?) Sine table but calculates everything just by adding and shifting, so basically one cycle operations and the accuracy depends on the number of repetitions you give the algorithm. Could be too taxing on the bus but maybe give it a try? Not hard to implement though... Used it on FPGAs to reduce computation unit requirements by a lot, since multiplier are rare on those. Golem has a great article about it (and in german :3 )!
@MiguelDiaz_
@MiguelDiaz_ Год назад
I love that Kaze's romhacking has evolved into the most insane solutions to code optimization problems
@gzusfishlives8897
@gzusfishlives8897 Год назад
@2:12 I was today years old when i learned you can get on the castle. I never even considered it before. The vid was cool but thanks for that
@MindGoblin41
@MindGoblin41 5 месяцев назад
Really interesting. Easy to understand the x y speed part from the beginning, but sadly, I’m having a hard time with the use of sin/cos in 3d bone movement and wishing I understood why it’s necessary from the video.
@alexiosasclpios4830
@alexiosasclpios4830 Год назад
Funny I actually worked on a project that used the same sine approximation you came up with for the fifth order, but for the third equation (for the last coefficient) we calculated it by minimising the distance between the function and the ideal sine, to do so we ended up equating both integrals between 0 and pi/2 and deduced the result. We had in the end : f(x) = a*(pi/2)*x + b*((pi/2)*x)^3 + c*((pi/2)*x)^5 with a = 4*(3/pi - 9/16) , b = 5/2 - 2*a , c = a - 3/2. I haven't checked if it's better or worse than your function that you found but when we tried a strictly derivative approach (What you did kinda) it ended up doing worse for our purpose. edit: didn't see the new video my bad
@KazeN64
@KazeN64 Год назад
yeah that's more accurate. this video was made before i even knew polynomial solvers were a thing and i just kinda went with the approach i knew worked somewhat.
@costelinha1867
@costelinha1867 Год назад
20:10 while these bugged bones are not desireable normally, I think they could make for a pretty funny mechanic for another romhack... maybe a power-up?
@grantsparks6554
@grantsparks6554 Год назад
Hey Kaze, this was another solid deep-dive. If I could make a suggestion: I found the additional audio of the speedrunning/glitching balanced a little too loud. Loud, glitched "wa-wa-wa-wa-wahoo"s into a wall + music + mathematics VO is a lot to listen to all at once and this is the only time I can recall a video of yours feeling this loud and overstimulating. I'd like to point to 16:45 as an example, so see if others feel the same. Otherwise, great topic! Love the feeling of progression to the algorithm
@KazeN64
@KazeN64 Год назад
Yeah thats fair! i'll make sure to turn the audio down a bit next video.
@grantsparks6554
@grantsparks6554 Год назад
Thanks, I appreciate it 👍
@The4Crawler
@The4Crawler Год назад
Great video. In a past life, I wrote display device drivers for the GEM Desktop VDI interface, mainly on 286 and 386 systems. In that interface, they do use some angle parameters, mainly for circular arcs. Angles are specified in 0.1 degree increments. The supplied driver code used sine lookup tables and was based on octants, so 0.0 - 45.0 degrees. There was some front end code that would reduce angles into one of 8 octants, even for angles over 360 degrees. I started out using this code but found some strange behavior every once in a while the display update would pause for a noticeable delay. I ended up writing a test program that would load the driver and be able to send it a range of parameters for all the graphics primitive functions. This way I could load my program (and the driver) with a debugger (displaying on a separate system via serial port). I finally narrowed down the problem in the arc drawing code. While the comment in the code (was originally written in a Fortran variant, then the assembler output from that was used) said reduce the angle to 0..7 octant, it was actually doing 1..8 octant. This was in a loop where the code would take the angle, check for > 360 and if larger, subtract 360 and repeat. Then it would check for octant 0..7, but if the angle reduced to 8, it would hit the outer loop and keep going until it finally would get into the 0..7 range. First, I fixed the assembler code to do the proper 0..7 calculation and it worked fine. However, there was over 900 bytes of lookup table (and this was in the 16 bit / 640KB days), so that was a huge chunk of memory just sitting there in case a rarely used arc call was made. I was also in the process of converting almost all this assembler code to C and realized this whole mess could be replaced with a call to the sin() library function. Since we were selling high end graphics cards, almost every customer of ours had a 287 or 387 math co-processor in their system, so this was a fairly low overhead call. Doing this trimmed about 1KB from the driver memory footprint and it ran smoother without that annoying pause every time it got a strange angle parameter.
@kirswords8587
@kirswords8587 Год назад
The Mario Kart music in the background 🔥Nintendo has so many good jams!
@cmyk8964
@cmyk8964 Год назад
Oh WOW that’s clever. Storing sine and cosine together in the interval [0°, 45°) because the rotation matrix uses both sine and cosine of the same number!
@bruhweirdo7992
@bruhweirdo7992 Год назад
You should make that big fist Mario 12:23 a power up in the game
@lovalmidas
@lovalmidas Год назад
It is a good optimization considering a program usually deals with only a handful of values as inputs per frame. You are not likely to calculate cos(2) after cos(1), making caching cos(2) irrelevant, but you often calculate cos(1) and sin(1) together, especially for calculating projections of the same vector over two axes. Returning both values at once is very useful for 3D-driven games as both are generally consumed by the same function. You get potentially twice cache utilization before a cache invalidation. This you did in get_sin_and_cos() for the LUT method, and the F32x2 trick in the numerical method. Both cases can give you savings outside of the function, by reducing the number of your total function calls. This savings exist independently of naive instruction call / cycle counting. Comparison between OoT sine and interwoven sine is quite illustrative. Of course, in the real game, there may be times when only one of the sin/cos is needed, but it seems to be a minority. I would be interested to see how much savings from the latter numerical methods can be attributed to the use of F32x2 (consumes one register, assuming the consumer code can work without that register space), as compared to fetching one of them and converting to the other via an identity (consumes at least one multiply op, add/sub op, followed by a sqrt op - which you can take out from the main sincosXXXX function since you no longer need to perform sqrt(ONE - x * x) conversion there).
@RorroYT-SAR
@RorroYT-SAR Год назад
To all people thinking that Nintendo Switch is weak: - There is no weak consoles, there's only bad developers, that are not unlocking full potential of console. Thanks for improving SM64, your mods are amazing, keep it up:)
@kriskeersmaekers233
@kriskeersmaekers233 Год назад
How about using a look up table where you only store 16 or 8 bits per entry which is just the mantissa of the float
@whtiequillBj
@whtiequillBj Год назад
The words of a true computer scientist. "You are going to learn a lot of math to day, and you will like it".
@tsobf242
@tsobf242 Год назад
Heh, I spent way too long making an efficient and accurate sine function for modern processors. If you want a way to find coefficients for polynomial functions, I suggest using black box function optimization. I used julia and optimized for both reducing the maximum error and reducing the total error over the section of the function I needed. I shot for a half-sine centered on 0 and did funky bit manipulation on floats... I really don't know how much of what I figured out is useful for n64 hardware, but it's nice seeing someone suffer in a similar way as I did :D
@quentincorradi5646
@quentincorradi5646 Год назад
Coefficient finding using remez or chebychev is the best, no need for manual search. I would have like to see a comparison of these algorithms with CORDIC, even though it's table based.
@tsobf242
@tsobf242 Год назад
@@quentincorradi5646 I'm certainly not an expert! I'm not surprised there's a better way to find coefficients. I'll have to see if I can grok any of that...
@AntonioNoack
@AntonioNoack Год назад
You can export excel tables to pdf, or disable auto-correction to get rid of the squiggly red lines.
@Seamusoboyle
@Seamusoboyle Год назад
Hello! Floating-Point/Computer Maths engineer here. I've developed trig functions for quite a lot of architectures/my life at this point, maybe I can give some suggestions. In our world, the minimax approximation is mostly king. I say mostly because mathematically it's usually the provably optimal polynomial, however for us there's the usual floating-point issues you get trying to apply maths onto computers. But if there's an issue with a minimax approximation you usually look inwards before looking outwards (Especially if your sin/cos ends up being greater than 1) If you have problems around 0, you should probably be optimizing for relative error, as opposed to absolute error which is what it seems you've done here (I might be wrong). In the business we tend to use an error metric called the ULP error (units in last place), which roughly measures how many bits at the end of your floating point answer are incorrect. (It's very close to a relative error measure, bit not quite the same. Optimising for relative error usually works well) Because of how floating-points scale, this means that you need tiny errors for tiny floats, and can have much larger errors for larger floats. Using the absolute error as your metric allows vales near 0 to blow up to however large you want in a relative error sense. Without an exact application in mind this ULP-metric is the one people like myself would use to write eg maths libraries. Another issue is that if you take the coefficients of an optimal real-valued minimax polynomial and convert them straight to floats you lose an pile of accuracy, unfortunately for computer scientists everywhere. Probably a lot more than you might think, when compounded with floating-point evaluations. We have some tools developed to try get around a lot of these limitations, a good starting point is one called sollya. Here you can ask for relative/absolute error, and give all kinds of constraints to your problem. Importantly you can restrict the output coefficients to be floats, and it will try optimise around this, so you won't lose large amounts of accuracy converting reals coefficients. For example, on the interval [0 -> pi/4], for sin() it gives a cubic polynomial of: x * (0.9995915889739990234375f + x^2 * (-0.16153512895107269287109375f)) Or by forcing the first coefficient to be 1 (generally sollya knows best, but sometimes storing a value of 1 is free on hardware, and it also might make really tiny values more accurate at the expense of larger ones): x * (1.0f + x^2 * (-0.16242791712284088134765625f)) And similarly the next higher polynomial gives: x * (0.999998509883880615234375f + x^2 * (-0.16662393510341644287109375f + x^2 * 8.150202222168445587158203125e-3f)) Or with forced 1: x * (1.0f + x^2 * (-0.166633903980255126953125f + x^2 * 8.16328264772891998291015625e-3f)) Note that sollya does not produce the absolute best polynomial possible, but it's the best you'll get with not too much work, and the results it does give are usually very decent.
@KazeN64
@KazeN64 Год назад
Otherse have suggested sollya as well! The problem with the polynomials you've suggested though is that you are adding additional coefficients. I've purposely chosen only odd and only even polynomials because they give you a lot of accuracy with less coefficients. instead of a 3rd order sine, an odd 4th order cosine will always be cheaper and more accurate for example. considering the constraints i gave my polynomials, i also think it's entirely impossible to come up with a different one. we have used up all degrees of freedom after all. although someone suggested easing up on some of the conditions for the higher order polynomials and it did end up improving them.
@wowimoldaf
@wowimoldaf Год назад
Replying so i can remember this golden comment via comment history. Thanks for this comment, i also learn a lot from this..
@Immadeus
@Immadeus Год назад
Who needs math class when you have Super Mario 64?
@thomas-ux8co
@thomas-ux8co Год назад
I actually find it pretty clever to use two different functions with different speed/accuracy tradeoffs for the different situations
@flameofthephoenix8395
@flameofthephoenix8395 Год назад
One way of calculating sine that I came up with was using the fact that in both 3d and 2d you can find the angle between two angles, by normalizing both angles, then simply averaging their position, and normalizing again. So, with this you can calculate any value, first you start with 3 stored values of rotation (0,1) or 0 degrees, (1,0) or 90 degrees, and (0,-1) or 180 degrees, then using the aforementioned function you can take the in-between of 90 degrees and 0 degrees to get 45 degrees, then 22.5 degrees, and so on. You can then convert the angle you are trying to find the sine of into binary. And use each 1 as a call back to one of the computed angles, you add the angles together with a rotation matrix, and you'll get the answer. This works quite well, but isn't designed for speed or to be compact in the memory.
@angeldude101
@angeldude101 Год назад
Even better, you can get the components of a rotation matrix with twice the angle between two vectors with just their dot and cross products (Technically a single geometric product.) Even if the vectors aren't normalized, the factor by which the sine and cosine are off by is the same for each, so they effectively cancel out. If you want the rotation only from one to the other, then you need to either normalize the vectors themselves and add them together to get their average, or you can take the double rotation, normalize it, add 1, and then normalize again, which is both the halfway interpolation of the rotation with the identity _and_ the rotation's square root. It's possible with algebraic manipulation to prove that these two methods give the same result. Often when finding rotations, you do so by computing the angle between two vectors using the arcsin of the magnitude of the cross product, the arccos of their dot product, or arctan of the ratio between the two. This is then usually plugged straight into a sincos to get the components for the rotation, leaving you with having called 2 or 3 different transcendental functions which do _absolutely nothing but cancel each other out._ "in both 3d and 2d"? This strategy works in _Nd._ It's completely dimension agnostic. You can even multiply the result by more rotations to compose them, which causes a new behavior in 4d that doesn't happen in 3d or 2d. I didn't even specify the vectors being linearly independent, since the double rotation that results is still a valid 360° rotation, meaning it technically even works in _1d!_ Kaze mentioned in other comments that he was working on converting things to Quaternions, which are a natural part of this system, and (an obfuscated form of) the actual result given by the geometric product of two vectors, or more accurately _mirrors._
@drowned309
@drowned309 Год назад
This video makes me feel bad at math
@Myako
@Myako 10 месяцев назад
Your degrees of enthusiasm, investigation, knowledge and actual results are absolutely amazing. They come shining through the video, making it even more entertaining. Congratulations, and I can't wait to see these implemented in your game. 👏🏻👏🏻👏🏻💪🏻💪🏻💪🏻
@AbAb-th5qe
@AbAb-th5qe Год назад
gcc has a sincos function that does this. The optimizer can detect calling cos and sin close to each other with the same parameter and output a call to sincos instead. The ZX spectrum uses chebyshev series IIRC. How about CORDIC? The lookup table is smaller so maybe you'd get more cache hits
@renakunisaki
@renakunisaki Год назад
I'm surprised I haven't seen more systems with an instruction that does both at once.
@AbAb-th5qe
@AbAb-th5qe Год назад
@@renakunisaki I vaguely recall seeing one as part of SPIR-V, but I might be misremembering that
@faranocks
@faranocks Год назад
I made a sine function with 16 bit sin/cos interwoven to 32 bits (4 bytes) for an FPGA, it worked pretty well, and yea less cache misses since sine and cos were interwoven (unintended effect tbh, I just wanted 1 input value to return both sin and cos but). I also shortened to 1/8 and swapped values around as necessary using hardware muxes essentially. Data size was 32 x 256 bits. FPGA so flipping signals based on the value is pretty fast relative to software calculations for this sort of thing (only around 2 more cycles to swap values vs storing entire sin/cos wave, and no need to worry about pipeline hazard optimizations as with coding). Another optimization was using 2048 angles, rather than a degree based calculation, that way there wasn't a need for a modulo (0 = 0 degrees/360 degrees, 1024 = 180 degrees, etc.), and I restricted the angle to an 11 bit register so that overflows would just happen and the angle would go from positive to negative or from 2pi +1 to +1 without any additional calculations needed.
@strawberrylemonadelioness
@strawberrylemonadelioness Год назад
I may be confused as hell by this but I'm still fascinated enough to keep watching
@twl148
@twl148 Год назад
11:20 kerning moment
@perguto
@perguto Год назад
Wouldn't it be easies to take for xpi/8 the 4th order Taylor expansion at x=pi/4 (which is just the cosine Taylor series around 0)? One should however slightly modify the coefficients so they mach up better at x=pi/8
@KazeN64
@KazeN64 Год назад
the taylor series has inaccuracies where it matters so i dont like this
@yockanookany
@yockanookany Год назад
"You're going to learn a lot of math today" why did my fight or flight response kick in?
@cad97
@cad97 Год назад
I wonder if anyone interleaves sin/cos tables on modern hardware. Most of the time you'd just use floats and your stdlib/libm sin/cos, but quantized angles still have their applications.
@renakunisaki
@renakunisaki Год назад
Do microcontrollers count? I had to do it recently. I should try some of these other methods to compare, but it's tricky without floating point...
@FairyKid64
@FairyKid64 Год назад
I wish Nintendo would hire people like you to make new N64 games for the Switch. Although, actually at the same, I'm glad you're independent because it means we get this excellent work explained and get real console fan projects!
@antivanti
@antivanti Год назад
I was hoping the Quake inverse square root would make an appearance but ancient Indian maths is also very cool 😊
@KazeN64
@KazeN64 Год назад
inverse square root is actually a lot slower on the n64 unfortunately
@mrkosmos9421
@mrkosmos9421 Год назад
That wikipedia information for the memory bandwidth is probably in megabits per second, not megabytes per second, given the maths
@stanislavponomarev8829
@stanislavponomarev8829 8 месяцев назад
i believe there is a mistake in table sizes - mario 64 has 5K sine values, not 5K bytes; these are 4-byte floats i.e. 20KB total size same for OoT
@bitskit3476
@bitskit3476 Год назад
@4:26 - If you're storing 4096 sine values as 32-bit floats, it's not 4 KiB, it's 16 KiB. In the case of the Zelda64 method, you're looking at 4 KiB. I highly doubt that you're using 8-bit chars to store angles, because you'd be looking at a precision of only 1.4 degrees.
@KazeN64
@KazeN64 Год назад
yeah i just misspoke. im so used to seeing it as 0x4000 that i just read the 4 out lol
@Bolpat
@Bolpat Год назад
2:01 A transcendental number is _NOT_ a number that does not repeat. √2 is a number that does not repeat, but it’s not transcendental either. The opposite of _transcendental_ is _algebraic._ An algebraic number is one that is the solution to a polynomial with integer coefficients, i.e. something like x⁵−x−1 = 0 is well-known to have a solution, which is algebraic by definition and its value is approximately 1.1673, but it cannot be described in a closed form with integers and basic operators including roots (square root, cube root, etc.).
@KazeN64
@KazeN64 Год назад
yeah that's true. honestly chatgpt wrote most of the script and i just missed this mistake.
@Bolpat
@Bolpat Год назад
@@KazeN64 😂
@BradenBest
@BradenBest Год назад
4:25 are you _sure_ it's 4KiB? We're talking C here, and AFAIK the C standard requires that floats are IEEE 754, which AFAIK only has definitions of 32-bit (single precision) and 64 bit (double precision) floats, which would mean that sizeof (float) == 4 must be true. If sizeof (float) == 4, then an array of 4096 of them would take up 16384 bytes (16KiB) of space, not 4096 (4KiB), which is still small compared to 4 MiB, but it's noteworthy.
@KazeN64
@KazeN64 Год назад
yeah its 16kb, i just had 0x4000 in my script and read it off wrong.
@BradenBest
@BradenBest Год назад
@@KazeN64 ah, understandable
@deltapi8859
@deltapi8859 Год назад
it's interesting how many weird choices they made. N64 has mipmapping and pretty advanced texture filtering, but not enough memory hold large textures, which made everything look like stew. But they couldn't put a sine function into the fpu. Odd those times were.
@rtg_onefourtwoeightfiveseven
The Kaze LUT method is genius. You should be proud of yourself for that one.
@rosebarrett1267
@rosebarrett1267 Год назад
This is bringing me back almost a decade to calc in high school damn. Good thing I have pannenkoek as supplemental learning or I'd be lost. I'm very much enjoying learning about this :3
@TheSpiffyNeoStar
@TheSpiffyNeoStar Год назад
That interwoven table is pure genius.
@johnnygodoy8329
@johnnygodoy8329 Год назад
This might require a big rewrite to implement, but Wildberger's rational trigonometry can be used to avoid transcendental functions for geometric problems, and could eventually help for this. A short review is "RATIONAL TRIGONOMETRY: COMPUTATIONAL VIEWPOINT" by Olga Kosheleva.
@angeldude101
@angeldude101 Год назад
Using RT is just using geometric algebra without calling it geometric algebra. If you have two vectors at a desired angle apart from each other, you can just multiply the vectors together and get a quaternion that rotates by twice the angle between the vectors (because it's technically composing two reflections across said vectors). No transcendental functions like in RT, but you also get to work with abstract geometric objects rather than components. (Of course, GA lets you use transcendental functions anyways if you want to, but you don't have to.) Kaze has mentioned in other comments that he's transitioning some code to use quaternions, so he'll likely get much of the benefit of RT anyways just from that.
@runforitman
@runforitman Год назад
I love how the mario 64 guy was cracked and then ocarina of time was just 8:11
@master_matthew
@master_matthew Год назад
This might be the most clever math solution in software development since the inverse squareroot.
@kyramonnix1520
@kyramonnix1520 Год назад
"I found myself oscillating" xD
@BernardBernouli
@BernardBernouli Год назад
2:04 Small error here. The fact that pi is irrational is not related to the fact that sin is transcendental. Transcendental means that you cannot write a number the root of a finite polynomial with rational coefficients. There are numbers like sqrt(2) that do not repeat and that are not transcendental. Though, the reason for sin being transcendental is that pi is transcendental, and also e.
@KazeN64
@KazeN64 Год назад
oh true. admittedly i used chatgpt to write a lot of this script and just ended up not catching that haha. i just fed it a note that said "sin is hard to calculate similar to pi"
@isogash
@isogash Год назад
You can rewrite your function in a branchless form to avoid all of the conditional sign-flipping. Instead of using != 0, shift the result of the "flipsign" AND into the sign bit position for the floating points (normally the most significant digit) and then you can use bitwise XOR on the numbers (cast to ints then back to floats) when you need to conditionally flip the sign of. Removing the branches should reduce the number of instructions that the compiler needs to produce in each case from 2 to 1. Getting rid of the conditional angle = 0x8000 - angle is also possible, but certainly harder.
@KazeN64
@KazeN64 Год назад
doing AND operations on floating point numbers would require moving the value from a floating point register to a general register first. that transition back and forth would be more expensive than the branch.
@bedoop3870
@bedoop3870 Год назад
watching Animation vs Math earlier today just before this has made me understand so much more than if i didn't youtube's teaching me so much
@SullySadface
@SullySadface 10 месяцев назад
That's a lot of math I never learned. I like the random movement tech in the footage.
@Gunbudder
@Gunbudder Год назад
now do it with scaled semicircles and special sin and cos functions that don't require conversion to or from a float! that is the true stretching of angle data to its maximum potential with limited space. and you can even do it without a table if you want to lose your last remaining bits of sanity
@atomictraveller
@atomictraveller Год назад
hope you read me ;) 20 years of audio dsp. i have a short video on this method. // initialise; float s0 = 1.f; s1 = 0.f; // loop; s0 -= w * s1; s1 += w * s0; this method is an iterative sine and cosine pair, for two mults and two additions per frame. (the cosine is a half frame off) it needs to be normalised every couple thousand iterations (t = 1.f / sqrt(s0 * s0 + s1 * s1; s0 *= t; s1 *= t;) w is angular frequency and set 2 * pi * freq_in_hertz / samplerate and is stable
@KazeN64
@KazeN64 Год назад
that doesn't seem like it'd get the sine or cosine of an angle though,... that looks like its just having 2 values oscillate in sine/cosine patterns. not what we need here
@straubulous9511
@straubulous9511 Год назад
You are so amazing and inspiring. Thank you for all the care that you've given Mario. You're a truly one of a kind.
@James2210
@James2210 Год назад
About halfway through and anticipating an explanation of CORDIC. Super excited edit: guess I was overestimating your need for accuracy
@jeffmejia707
@jeffmejia707 Год назад
Can u show us how sm64 original Will seen with all your optimization que improvements !?
@kevintyrrell7409
@kevintyrrell7409 Год назад
That blows my mind that the macro actually can return two values. C/C++ is absolutely insane. You think there are defined runes to things (e.g. under no circumstance can you return two values) but bam, it happens.
@SMCwasTaken
@SMCwasTaken 10 месяцев назад
My brain broke
@FlamespeedyAMV
@FlamespeedyAMV Год назад
I wish we could just go back to these types of games again
@CheaterCodes
@CheaterCodes Год назад
Question: Did you consider/investigate the CORDIC algorithm? While it would certainly require more instructions, you would only need simple integer computation instructions (no floats). No multiplications, no divisions, no square root, and you get both sine and cosine out of it. I'm not sure if its actually beneficial, but it feel like it might be close.
@TazAlonzo
@TazAlonzo Год назад
Hey man! You're doing great! We are all patiently waiting and watching you make the DLC Mario 64 deserves! Love all your content, especially optimizing SM64!
@timpz
@timpz Год назад
I'm using the Padé approximant which only uses 1 divide and deviates only 1.78% for all angles under Pi (Bhaskara is deviates 1.86%) and even less if it's under 0.5*Pi. It's probably a tiny bit slower though since you need to calculate x^2, x^3 and all the way to x^6 for a good approximation. I'd post a link but it seems to delete my comment so just look up the example on wikipedia. Really good video btw, love your work 😄
@KazeN64
@KazeN64 Год назад
yeah at that point the 5th order sine seems a lot more accurate and cheaper
@jmi967
@jmi967 Год назад
I could see using 2-3 of those functions in the code and choosing which to use based on what the code is needing to do.
@bitskit3476
@bitskit3476 Год назад
@KazeN64 I just finished implementing the Zelda64 version of the sin/cos function in C and did a bit of profiling. It's 70% faster than glibc's version. There are a couple of things that even modern GCC is too stupid to optimize away (even when compiling with -O2). For example, when I did `(FTOS(f) / 16) % 1024` to calculate the table index, I was only getting about a 10% improvement over glibc. I replaced it with `(FTOS(f) >> 4) & 0xFFF` and performance shot up to a 68% improvement over glibc. I then got another 2% improvement by precalculating `(1.0 / (2.0 * M_PI) / 65536)`, which presumably eliminated a division operation.
@KazeN64
@KazeN64 Год назад
those operations are not guaranteed to be the same, for example if you are returning s16s >>4 and /16 arent equivalent. that last term should compile to a constant so that wouldn't give you any speedup (unless M_PI was a variable)
@bitskit3476
@bitskit3476 Год назад
​​​@@KazeN64 I'm not sure what you mean, but right shifting by 4 is the same as dividing by 2^4, which is 16. And I've already checked the results, so I know it works :p
@bitskit3476
@bitskit3476 Год назад
The only thing that doesnt work is predividing by four. For some reason, idk. So I do the bitshift after multiplying by FMAGIC and casting to an int
@KazeN64
@KazeN64 Год назад
@@bitskit3476 right shifting by 4 is not the same as dividing by 16 for signed integers. e.g. -1/16 is 0 but -1>>4 is -1
@bitskit3476
@bitskit3476 Год назад
​​​@@KazeN64shifting right is division by two regardless of whether the integer is signed or unsigned. An arithmetic shift just preserves the sign bit. For example, consider 0xFE. This is the two's complement of -2. If you do an arithmetic right shift by 1, you get 0xFF, which is -1. The two's complement of -4 is 0xFC. If we arithmetic shift that to the right, we get 0xFE, which is -2.
@7up64
@7up64 Год назад
These people really ain’t even seen the whole video and going, nice video! NPCS do indeed exist.
@CAEC64
@CAEC64 Год назад
me when im on the highway and realize the nearby cars are actual people with a life and story to tell
@MayVeryWellBeep
@MayVeryWellBeep 8 дней назад
Mmm, yes. Sine and cosine. So true bestie. 🧐 I understand all of this of course. Please do not ask any follow up questions regarding that claim; I am tired today.
@TheZombiesAreComing
@TheZombiesAreComing Год назад
2:45 You WILL learn math today and you WILL like it yea... those words don't bode well
@YEWCHENGYINMoe
@YEWCHENGYINMoe Год назад
Just use the sqrt directly!!! sin(x)^2= a simple up and down linear function, and cos(x)=sin(x+pi/2)
@Reonu
@Reonu Год назад
This kind of sine function is only possible because you switched to PC Port. I'm so glad you made the switch!
@WhiteThumbs
@WhiteThumbs Год назад
Windowed mode ftw
@hoo2042
@hoo2042 Год назад
Huh? What does this sine function have to do with PC port? Do you just mean using the mario64 decomp code? If so, that codebase was also used for the PC port, but the decomp project is not the same as the PC port project. Everything he was talking about performance-wise was about optimizing for real N64 hardware, especially the specifics of its memory bandwidth and cache line size, and the speed of various math operators on its CPU.
@thecozies
@thecozies Год назад
PC port is superior because it is on the PC meaning that porting is particularly yes, especially when the
@KazeN64
@KazeN64 Год назад
This is all an N64 rom and I won't drop n64 support obviously. There'd be no point in any of these optimizations if I used the PC port. Besides, if I wanted to make a PC port, i'd just copypaste mario's code into unreal engine 5 and use that.
@Yeetboii
@Yeetboii Год назад
@@KazeN64 once you finish re writing the entire game (seeing as it appears thats where things are going) does that mean textures and sound effects aside, you could theoretically release/sell the code itself seeing as you own it?
@iyziejane
@iyziejane Год назад
I imagine many of the sine and cosine calculations could be further optimized using continuity. So if Mario's face angle is theta + epsilon on the current frame, then this is not far from the face angle theta he had on the previous frame, for which the sine and cosine were already used. You can find the new value by f(x + e) = f(x) + e f'(x) for small e, so sin(theta + epsilon) = sin(theta) + epsilon*cos(theta). Of course sometimes you need a sinusoid value "from scratch" so one of the current methods will be needed.
@KazeN64
@KazeN64 Год назад
that will just introduce an additional error on top of the old error and the extra logic for this would make it more expensive. plus to evaulation the +e*f'(x) part would be just as expensive as just evaluating f(x+e)
@iyziejane
@iyziejane Год назад
​@@KazeN64 thanks for considering it. The reason + e*f'(x) would be cheaper than f(x + e) is because you already have f(x) and f'(x) cached (sin(x) and cos(x) from angles nearby to the one you want to compute now). So it's just one multiplication and one addition if you have those values, compared to +25 operations to compute the new approximation value from scratch. But I am guessing you mean that the access time for these previous values is slower than computing new ones from scratch, which is somewhat suggested by the video, but I wasn't sure. It's true that this introduces additional errors, based on how fast that accumulates one could correct it with a more expensive computation after every 10 frames or something.
@Eckster
@Eckster Год назад
At 20:20 when you said the polynomial math was something you learned in 9th grade in Germany I originally thought you were talking about the derivatives, and had a moment where I lost what little faith I had in US schools.
@todorstojanov3100
@todorstojanov3100 9 месяцев назад
Well derivatives are done in 10th grade in Germany
@HouseMD93
@HouseMD93 Год назад
For the MinMaxPolynomial, you can use a gradient descent algorithm (as a Python man, this is accessed using the scipy.optimize.minimize function) to get a better answer. I got a = 3.096133 and b = 4.42047, resulting in a maximum error of 0.0045 - although sin(pi/2) = 0.997. It also has a similar problem to your approximation where the maxima is slightly before pi/2, but it is not above 1, so this method may work.
@HouseMD93
@HouseMD93 Год назад
For the 5th order sine, using a similar method of gradient descent and a bit of grid search, it's possible to get the error down to 7e-5 (0.00007). Using your notation, the variables for the 5th order sine should be as such a = 0.00750356, c = -0.165646 and e = 0.999684. The value of sine is 1.00004915 at pi/2.
@iankoberlein1974
@iankoberlein1974 Год назад
You are an absolute genius, Kaze.
@nicknevco215
@nicknevco215 Год назад
Wish there was a emu version that had these changes into the engine to see it working
@BusinessWolf1
@BusinessWolf1 Год назад
There has to be a way of using 2 or 3 functions together for different ranges in the sine wave. A lot of it is a very straight curve, so you could use an inaccurate solution there and it wouldn't matter.
@mattsgamingstuff5867
@mattsgamingstuff5867 Год назад
There is sin(x) ~= x for sufficiently small x...but branching logic is expensive too; which would be doubly so if you broke it into smaller intervals such that linear or quadratic approximations were sufficiently accurate (it'd be a lot of intervals). So it may very well be that branching in a manual calculation is going to be slower than doing a slightly harder calculation with no branching. Your real options are a lookup table or some form of approximations that avoid high powers and divisions. The only reason the lookup table didn't flat out win is the memory, bandwidth, and cache limitations of the platform.
@13231wmw
@13231wmw Год назад
Well, at least N64 has floating-point hardware. You did not have to experience the joys of DDS and CORDIC 😅.
@twinsunianlp7359
@twinsunianlp7359 Год назад
Can you use the fact that f’(cos(x)) = -sin(x) to evade using a sqrt? I barely understand if this is more efficient in this architecture and i know its more total code but it might be less assembly instructions if you do it this way. Like maybe store 4*a and 2*b in cosineusefficient then calculate -sin(x) as the derivate of the first function like so: f32 y = _i y = (y*(cosinusefficient[3] + (y*y*cosinusefficient[2])) Then flip the sign assignment. Also, since we’re using constants you can consider reducing the error linearly by adding or substracting a constant at certain points.
@twinsunianlp7359
@twinsunianlp7359 Год назад
Of course you could interweave both calculations by calculating something like x = _i a = _i*cosinusefficient[1] b = _i*_i*cosinusefficient[0] x= (ONE + x*(a + b)) y = ( a
@KazeN64
@KazeN64 Год назад
the additional multiplications would be a lot more expensive unfortunately
@vileshaft9730
@vileshaft9730 Год назад
uh hard not to notice your super mario bros image aspect ratio is stretched, its 16:9 when the game itself is either 4:3 or 5:3
@VideyoJunkei
@VideyoJunkei Год назад
Does NOT using float numbers help with speed? I made a routine myself years ago that treats 0-360 deg as 0-255 or 512 or 1024 (which allows for seamless transitions/wrapping through 0(no checks needed) I did use a precalculated table though; all numbers were stored huge, I think +-1024, output was scaled to screen, whole program stored huge numbers to avoid float everywhere! Yes, 1 pixel movement was 1024 'behind the scenes'
@KazeN64
@KazeN64 Год назад
float multiplication and division is a lot faster on the n64
@flatfingertuning727
@flatfingertuning727 Год назад
How about computing each sine/cosine pair by using a third-order polynomial to compute the sine of either x or 16384-x, whichever is smaller, and then using the distance formula to compute the other?
@AccAkut1987
@AccAkut1987 Год назад
Okay I only understand like 10% of this... but just how much more deep level programming is involved here compared to every current developer just throwing assets into Unity or Unreal 5. How much faster & better could modern games run if they had bespoke engine.
@FizzyFelidae
@FizzyFelidae Год назад
how did you calculate the peak bandwidth of the Nintendo 64?
@KazeN64
@KazeN64 Год назад
i did the fastest data load i could possibly do and measured how long it'd take, then extrapolated to a full second.
@keiyakins
@keiyakins Год назад
Of course, none of this makes sense to a programmer coming from the SNES, where computation was expensive and memory transfer was basically free.
@strider_hiryu850
@strider_hiryu850 Год назад
i'm guessing the 4,096 movement angles are also unnecessary. my guess is that at most, 512 would be required for an imperceptible difference. but you could probably get away with 256. those are just guesses, however. the 4,096 may turn out to be completely necessary for the smooth movement of SM64, for all i know. it just felt like another one of those numbers that were way too high, for no good reason.
@KazeN64
@KazeN64 Год назад
yeah its not unthinkable that this was just a completely random number they chose
Далее
The Folded Polynomial - N64 Optimization
14:26
Просмотров 237 тыс.
Это нужно попробовать
00:42
Просмотров 234 тыс.
Mario 64's Physics are not perfect
22:00
Просмотров 274 тыс.
The Truth about the Fast Inverse Square Root on the N64
10:01
64 Bits: Nintendo's BIGGEST Mistake
15:33
Просмотров 274 тыс.
The Problem with the Grass in Mario 64
35:04
Просмотров 383 тыс.
How Super Mario 64 was beaten without the A button
24:12
The Worst Fake Speedrun on Youtube
17:51
Просмотров 18 млн