Тёмный

The Folded Polynomial - N64 Optimization 

Kaze Emanuar
Подписаться 269 тыс.
Просмотров 237 тыс.
50% 1

Опубликовано:

 

27 сен 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 1 тыс.   
@KazeN64
@KazeN64 Год назад
Getting a lot of comments about making the code branchless - so let me explain why that's a bad idea: A branch takes a single cycle on N64 and we have no branch prediction. Doing bit manipulations on floats requires us to move the float from a float register to a general purpose register first, so that will always be a penalty of 2 cycles. This means that just doing the conditions is WAY faster than doing bit manipulation on floats. Lets compare a branchless version to a branchfull version: if (shifter & 0x8000) { cosx = -cosx; } Compiles to: andi t0, a0, $8000 beq t0, r0, DontInvert neg.s f0, f0 Dontinvert: (3 cycles) cosx = cosx ^ ((shifter&0x8000)
@fastestdino2
@fastestdino2 Год назад
Dude I only vaguely understand half that math but I can tell you know your stuff. You literally put more effort and thought into fixing a 20 year old game then most triple A devs put into making theirs. Keep up the good work.
@taviethestick
@taviethestick Год назад
Kaze, my -2 braincells are gonna EXPLODE 💀💀
@multiplysixbynine
@multiplysixbynine Год назад
Instead of going branchless, try using only one branch with a switch statement jump table to distinguish the 8 cases up front. That should remove all of the bit tests and swaps and conditional branches at the cost of inlining the polynomial calculation 8 times. Code size would increase but not by much.
@andremaldonado7410
@andremaldonado7410 Год назад
What song did you use for the background music in chapter 3? So familiar but I just can't remember the name
@shukterhousejive
@shukterhousejive Год назад
It's cool to see that branch delay slot get some work, another reminder why you can't always optimize the N64 like a modern processor
@GameDevYal
@GameDevYal Год назад
"We run the computations first and THEN figure out which one we computed" You know you're pushing against the limits of what's possible when your code starts implementing quantum mechanics
@jaysefgames1155
@jaysefgames1155 Год назад
Finally... Quantum computing...
@ThompYT
@ThompYT Год назад
my brains hurts
@notNajimi
@notNajimi Год назад
If only they marketed the system as the Nintendo Quantum
@Gestersmek
@Gestersmek Год назад
They don't call it the Reality Coprocessor for nothing.
@MrGreatDane2
@MrGreatDane2 Год назад
How did you escape your designated Gamemaker corner?
@ChaunceyGardener
@ChaunceyGardener Год назад
All math books should have the Mario font.
@DaVince21
@DaVince21 Год назад
Super Maths 64
@V_r0
@V_r0 Год назад
tht's wht i'm saying
@cerealnuee8189
@cerealnuee8189 Год назад
Conversely, imagine a version of Mario 64 that uses LaTeX
@EndorJedi985
@EndorJedi985 Год назад
Would make it more bearable
@SanaeKochiya
@SanaeKochiya Год назад
with subway surfers in the corner
@undefined06855
@undefined06855 Год назад
kaze on his way to save literally 0.000096 microseconds on a console thats 27 years old
@kannolotl
@kannolotl Год назад
Just you wait until you hear about Super Mario Bros. speedrunners
@jess648
@jess648 Год назад
the benefit wasn’t even fps this time, the sine function is what makes the 3D math of the N64 and 3 dimensional games in general tick basically so physics, rendering and animation all benefit from improvements in accuracy
@exylic
@exylic Год назад
It's .096µs. The “.000096” is in seconds
@DanielFerreira-ez8qd
@DanielFerreira-ez8qd Год назад
​@@jash21222n64 but it runs on an intel i12 12th gen
@MrGamelover23
@MrGamelover23 Год назад
​@@jess648So does that mean that he can do better animation or physics or something like that?
@FairyKid64
@FairyKid64 Год назад
I really appreciate how open minded you are and how you give credit where credit is due and don't try to make people with "worse" ideas look bad. Keep up the good work!
@KazeN64
@KazeN64 Год назад
I try my best to invite any type of discussion! I guess a confident demeanor in people is often associated with unwilling to change ones mind, which is an unfortunate vibe to give off. I wish more people would just come in and try to tell me where I'm wrong just so we can discuss and learn.
@ALZlper
@ALZlper Год назад
​@@KazeN64You have every right to be confident based on your results. If someone else proposes a measurably better solution, of course it's time to upgrade, otherwise the right to be confident is gone :) Love that mindset, exactly mine too.
@carltheshivan
@carltheshivan Год назад
It's a good idea to not completely dismiss the "worse" ideas because sometimes these things are a two steps forward, on step back situation, and maybe that worse idea will become useful later with a little modification or refinement or in a different context.
@3333218
@3333218 Год назад
@@KazeN64 The best way to create something great is to start out by trying something stupid and correcting why it went wrong. Which is why it's important to have people willing to suggest , try and discuss anything that might seem worth giving a chance. It seems you understand that. ^ ^
@howard_blast
@howard_blast 11 месяцев назад
Lol, extremely passive aggressive comment. Get over yourself with that toxic "all solutions are beautiful" mentality. Take some responsibility when you're in the wrong. No one is owed anything just because they tried (less hard than others at that).
@fders938
@fders938 Год назад
This stuff reminds me of when I was learning x86 programming and attempted to use my new-found powers to beat libc's sin/cos. After hand-writing an asm implementation of a 7th order taylor polynomial it was...2x slower and less accurate than libc's version. These videos might help in the future when I get into DS programming.
@kintustis
@kintustis Год назад
Isn't there an ASM instruction for that anyways?
@henke37
@henke37 Год назад
@@kintustisx87 does indeed have sin and cos as instructions. I'm sure they were great back in 1995.
@oscarsmith3942
@oscarsmith3942 10 месяцев назад
@@henke37 They are actually surprisingly bad. For unknown reasons, Intel only used a 66 bit approximation of Pi, so near multiples of Pi they are only correct to 1 significant figure instead of the 16 that they are supposed to reach.
@jhgvvetyjj6589
@jhgvvetyjj6589 8 месяцев назад
@@oscarsmith3942 In real use cases that won't matter since the error inherent in rounding the near-pi value will be much larger than the error of fsin and fcos instructions
@SpringDavid
@SpringDavid Год назад
Kaze when he accidentally creates a movement of optimizing old games to the point they cannot lag:
@benjaminoechsli1941
@benjaminoechsli1941 Год назад
Speaking it into existence!
@IceYetiWins
@IceYetiWins Год назад
Infinite frames per second
@awemowe2830
@awemowe2830 Год назад
He might be the first person to finally remove frames from games entirely. We now measure performance in "speed of light", as frames don't exist, and lag is only something the human brain can suffer from now....
@novarender_
@novarender_ Год назад
The game is just a function of t
@dudono1744
@dudono1744 8 месяцев назад
​@@novarender_That's called a TAS
@GamerOverThere
@GamerOverThere Год назад
Kaze is slowly recreating the shipoftheseus problem in SM64 😂
@pacomatic9833
@pacomatic9833 Год назад
Now that you say it...
@AROAH
@AROAH Год назад
At some point he could just swap out Mario and it’s not even the same game anymore
@ZeroUm_
@ZeroUm_ Год назад
If we change all the parts, but it ends up sailing 15 nanoseconds faster, is it the same ship?
@vespertinnee
@vespertinnee Год назад
i get what you're saying. but the mechanics and genre of gameplay ain't changing at all.
@MitchelGatzke
@MitchelGatzke Год назад
@@AROAH he already did that, he replaced that mario with a brand new optimized mario
@CynicPlacebo
@CynicPlacebo Год назад
Oh, I remember! I'm so glad to hear a programmer that still cares about performance. Too often I've had coworkers do something in a disgustingly inefficient way because they just don't even think about it. I'll rewrite a query, or a function, or just flatten some nested loops, and suddenly it's 3 to 100 orders of magnitude faster (usually when someone is 1,000x slower they start reaching out for help, but that's about it)
@krystostheoverlord1261
@krystostheoverlord1261 Год назад
I feel that! I have had coworkers complain to me that I do not need to optimize the code, but I just go ahead anyways since it usually does not take much longer. They end up liking the faster, optimized code much better (usually running real time instead of 1 frame a second LOL)
@CynicPlacebo
@CynicPlacebo Год назад
@@krystostheoverlord1261 there are dangers on both sides, but it largely boils down to personality types. If you are a perfectionist that wants everything to run perfectly, then there is some use in pushing yourself to go faster and be less perfect (especially early on or during a proof of concept). ...but I think most people fall into the other category of people that want to write it once with whatever pops into their head first and then never revisit it as long as it technically provides accurate information. Those people need to be pushed into taking a little more time to not just do the first thing but at least weigh a couple options. More importantly, they need to go back, actually test the speed, and do a round 2 specifically intended for optimizing, simplifying, commenting, and making the code more elegant (yes, I'm lumping all those sins together, but I realize some people can just have 1 or 2 of those problems)
@adamsoft7831
@adamsoft7831 Год назад
I think you mean 3-100x slower? 100 orders of magnitude would be 1 followed by 100 zeros.
@CynicPlacebo
@CynicPlacebo Год назад
@@adamsoft7831 I do not mean 100 times slower. I mean 1000x slower to an incalculably slow but probably exaggerated 100 orders of magnitude (since we never knew how long it would take since it was essentially stuck, I gave it a fake artificial top number for emphasis). They usually start asking for help around 1,000x slower, which was why I listed that, but there are literally processes that they tried to run, estimated it would take a few days, and then 3 months later the process still hadn't even hit 1% success. I'm talking about *really* big data (many many petabytes). Whereas if you cleverly divide and conquer, suddenly we can do the whole thing in less than 24 hours (I know, still slow, but we are talking about many Petabytes across about 300k servers) My point was that many things were literally 100 times slower (2 orders of magnitude) and no one would care or ask for help. They would just deal with the fact that this tool only got run once or twice a year. There was a data sync tool that was running monthly, because it took a week or 2 to run. After I fixed it, it ran hourly (every now and then it would go over the hour mark, so it'd skip 1 cron sync. It was just a simple flock, but it hardly ever got triggered. Usually only after a batch update that touched a ton of datapoints all at once).
@CynicPlacebo
@CynicPlacebo Год назад
I'm not claiming I'm a genius either. The biggest problem is that a dev would literally try to run a script off their personal machine that would then loop through every server and try to do something. Just by writing the script so it could run on the server itself and rsyncing it everywhere, that gives me a 300,000x boost because each server can do its own thing simultaneously (yes, I mean THAT dumb of mistakes)
@Armameteus
@Armameteus Год назад
In short: - not actually faster - _way_ more accurate I'd say that's a decent trade-off.
@FloydMaxwell
@FloydMaxwell Год назад
Summarizing this video is a crime to this video
@NerdTheBox
@NerdTheBox 6 месяцев назад
@@FloydMaxwell but it saves so many cycles
@Ray-pp7zo
@Ray-pp7zo 3 месяца назад
It's not even trading anything, you're getting both for free
@Nicoya
@Nicoya Год назад
Optimizing trig functions is great, but the fastest trig function is the one you never call. Have you taken the time to step back and see how many places where you can avoid entering degree/polar space, and instead simply stay in linear (vector/matrix/quaternion) space?
@KazeN64
@KazeN64 Год назад
yeah, im planning to translate the whole game to quaternion animations for example. animation sin/cos calls are the bulk of this right now. unfortunately the entire engine and every behavior runs on euler angles so i dont want to refactor the actor rotation into quats if i can prevent it.
@kr1v
@kr1v 7 месяцев назад
​@@KazeN64(joke) you've already rewritten the entire source once, why not twice?
@MrGamelover23
@MrGamelover23 2 месяца назад
​@@KazeN64One, what does that actually mean? And two, what is the benefit in terms of performance?
@andrewliu6592
@andrewliu6592 11 дней назад
@@MrGamelover23 based on my limited understanding; 1) quaternions are a different way to represent rotations; you have four floats but you only need 2 trig operations (one sine and one cosine for one angle), while for euler angles you have three floats, each one representing an angle, which requires 6 trig operations (one sine and one cosine per angle) 2) benefit is you don't have to do as many trig function calls, which is good
@Ragesauce
@Ragesauce Год назад
You have no idea how excited I am to play the original SM64 when you remake it with all the improvements. I have held off playing it for years all for this moment. I cannot wait!
@danielpope6498
@danielpope6498 Год назад
I thought he said he wasn't releasing these fixes applied to the original game, just using it to make his sequel
@aftdawn
@aftdawn Год назад
​@@danielpope6498nah, he said at one point that sometime in the future he is gonna backport the upgrades and patch's into the vanilla game with no custom levels, but that's probs still gonna be like 6 months after "Return to Yoshi's Island" is out, and there's no ETA on the hack
@Tabu11211
@Tabu11211 Год назад
​@@danielpope6498 not if we pressure him enough.
@bretayerstorm
@bretayerstorm Год назад
​@@Tabu11211 Not trying to offend or anything but, most of us have witnessed what could happen if we "pressure" someone (or a company) to release or publish an app or game just cause we are impatient (looking at you Cyberpunk 2077, NoManSky... ) We all hate a buggy mess. With that said, I rather be the kind of viewer / customer to actually encourage developers and studios to take their time to make the app, game or whatever they are trying to develop so we the consumers get what we paid for. Pressure will only fuel the crunch culture in Programming jobs (or any other field where this exists...) So no. I rather wait few more months.. HELL a YEAR, as long as final product is stable and efficient enough for us to enjoy. Just my humble two cents. Take care
@goob8945
@goob8945 Год назад
@@bretayerstormreal shiz bruh
@jfa4771
@jfa4771 Год назад
imagine if Nintendo discovered your rom hack in 1996, how shocked the devs would be
@icyz1ne456
@icyz1ne456 Год назад
romhack takedown origin story
@Flying_Titor
@Flying_Titor Год назад
Forget dmca, they'd send a hitman his way
@mariotheundying
@mariotheundying Год назад
Prob would pay him money for the code or try to hire him, and also have him work on new consoles other than games
@bootortle
@bootortle Год назад
Yeah, it would prove backward time travel to be possible!
@keaton718
@keaton718 Год назад
Maybe the original original Mario 64 ran at like 2 frames per second and ruined Nintendo's reputation and they went bankrupt. Then Kaze decades later and improved Mario 64 into what it is today and Nintendo's time travellers got ahold of it, took it back to 1996, published it and saved Nintendo. Now Kaze is basing his improvements off that version of Mario 64, the version he unknowingly wrote himself in a parallel reality, and any day now Nintendo's time travelling spies will bring it back to 1996 and Mario 64 will blow everyone's minds and Nintendo will bankrupt Sony because no one wants a Sony crapstation after they see Kaze's v2 Mario 64.
@lexacutable
@lexacutable Год назад
I'm enjoying imagining an alternate universe in which commercial n64 games were this efficient
@mizurazu
@mizurazu 11 месяцев назад
This. I'm trying to imagine Turok 2 could have actually worked now.
@matthewtalbot6505
@matthewtalbot6505 Год назад
Alright, you’re definitely going to be dipping deeper into advanced and/or theoretical mathematics going forwards with this project. Folded 4th order polynomials to approximate the sine and cosine graphs. You, and the community members who assisted, are mad geniuses.
@supersmily5811
@supersmily5811 Год назад
I want this rom hack so badly. The levels look so huge, and clean! You have ziplines! And workplace accidents!
@unique_two
@unique_two Год назад
I think you could use a double angle identity of cos here: cos(2x) = 2cos(x)^2 - 1. Compute cos in the interval [0, pi/4] via quadratic polynomial, then use the identity to expand to the interval [0, pi/2]. From there you get all values of cos via the usual symmetries. This might get rid of the square root, but I don't understand the details of the implementation.
@schlega2
@schlega2 Год назад
That would be more efficient if you only need the cos. You'd still need the sqrt to get the sin though.
@autodidact7127
@autodidact7127 Год назад
Having followed this for years I am never ever EVER dissappointed when you upload. One of the only ongoing projects that just ROCK!
@Phoenix_1991
@Phoenix_1991 Год назад
Even though I didn't understand the technical bits, I respect you immensly for you dedication to such old and limited hardware. Hopefully someone will recognize this hard work and give you the credit you deserve, whatever that may be.
@excitedbox5705
@excitedbox5705 Год назад
It is no different than any other sport. Think about it, you set a limit and then try to better your skill by seeing how hard you can push it. Un target shooting you try to get as close as possible to the center of the target, in F1 racing you try to decrease your lap time, here you try to max your FPS. Using an N64 game is just a fun way to set the rules for the "competition" that he has a nostalgic connection too and forces him to think outside the box.
@inthefade
@inthefade Год назад
I will be playing the hell out of this ROMhack. I hope that is the kind of appreciation he is looking for (and a really good job, if he doesn't already have his dream job).
@rebmcr
@rebmcr Год назад
Do you plan to release a version of your optimised engine at some point, which can run the original Super Mario 64 ROM? It would make for a very interesting comparison.
@KazeN64
@KazeN64 Год назад
of course, this mod will be open source after release.
@Apostolinen
@Apostolinen Год назад
Out of curiosity... when will this mod release? It's absolutely phenomenal.
@HawtDawg420
@HawtDawg420 Год назад
@@Apostolinen it'll release when it's done ;)
@jmssun
@jmssun Год назад
I hope you can maintain a parallel release of original M64 patch that includes all your performance mods, this way the community will constantly referring to your channel if they want the most current on their M64 and the best performant build. The side effect is that it encourages more people to your channel and discover your incredible works, as well as knowing your new content
@ruie.34
@ruie.34 Год назад
Yeah but then he’ll get struck by dmca
@kvdrr
@kvdrr Год назад
​@@ruie.34nah he wouldnt, but its nice Kaze has a community so eager to defend him i guess 😅
@enochliu8316
@enochliu8316 Год назад
​@@kvdrrMany of his mods have been struck down.😢
@soviut303
@soviut303 Год назад
I'd love to see you collaborate with James Lambert who's building Portal 64 to see what kind of performance gains he could potentially see with your optimizations.
@4.0.4
@4.0.4 Год назад
Can't imagine they don't already watch each other's content
@McWickyyyy
@McWickyyyy Год назад
How do you figure all this stuff out lmao. I am a full stack web dev but when I see stuff like this I’m just like I’m a fraud 😂
@NoNameAtAll2
@NoNameAtAll2 Год назад
learn C, join the system side you'll be angry at electron just like us!
@KazeN64
@KazeN64 Год назад
i dont even know what electron is... :D
@McWickyyyy
@McWickyyyy Год назад
Lmaoo. I did learn some C in college and I actually loved it and was one of the few to do well 😂 and a little bit of assembly. It is def no easy task lmao. I’m tryna make a fullstack browser game actually. Got the prototype phase done of getting everything I need in place. But I just know imma run into optimization issues later. I always love watching videos like this to see how the pros do it 😭
@OhluhKayTall
@OhluhKayTall Год назад
Web devs sticking together 😤. In the same boat and feel just as fraudulent watching these optimization videos. One of these days I'll learn C or something ...
@fungo6631
@fungo6631 Год назад
@@NoNameAtAll2 OP should learn HolyC instead, like a divine intellect individual would do.
@angeldude101
@angeldude101 Год назад
"// imaginary part in the cosine to give the reader mental damage" It's a critical hit! Not quite the quaternion video I was hoping for (you mentioned adding quaternions in the comments of your prior video), but I i will always accept more math optimization content on this channel. And yes, I was not joking when I i said I i was hoping for _quaternions._ Quaternions are actually pretty simple when not obfuscated or divorced from their connection to the composition of reflections.
@adiel_loiola
@adiel_loiola Год назад
I have LITERALLY no IDEIA what are you talking about, but i love those videos lmao.
@myggmastaren3365
@myggmastaren3365 Год назад
when anyone asks if math is useful, I'll just redirect them to your videos
@samahearn770
@samahearn770 Год назад
I think my fave way to avoid doing sqrt is the famous Quake 3 fast inverse square root function, which uses the mantissa of the float itself through a dubious cast and some bitwise black magic so you can calculate normals faster.
@AwesomeGames56
@AwesomeGames56 Год назад
This is wild, not only is it more accurate but it’s also fast enough that the console doesn’t even know there’s a difference. Pushing the 64 like this makes me wonder what kind of games we could have if AAA devs still put games out on older systems.
@Crigence
@Crigence Год назад
Made timestamps for the video. @KazeN64, can you please implement these when you got a minute? 0:00 Intro 0:56 Chapter 1: Refresher 2:49 Chapter 2: Numerics 3:20 Chapter 3: The Square Root... 4:42 Chapter 4: The Folded Polynomial (I quickly scrubbed the video several times and there was no chapter 5, I think Kaze just saw the length between chapters 4 and 6 and assumed it was there) 10:29 Chapter 6: Other Ideas 12:23 Conclusion
@KazeN64
@KazeN64 Год назад
done! ty!
@MayVeryWellBeep
@MayVeryWellBeep 7 дней назад
I'm so glad I watched this follow up video as well, or I would never have proceeded from being at the start of my lunch break having no understanding of this issue to where I am now: at the end of my lunch break.
@codcouch1
@codcouch1 Год назад
i;m new to programming and i was shocked that interpolating between 2 values is slower than calculating all that stuff. I would have just assumed that interpolation was faster and never even investigated. Nice job
@PsychorGames
@PsychorGames Год назад
You think I would just ignore Mario doing the soyjak face in the thumbnail? You think I would just let that go? You're a fool.
@geno_purple
@geno_purple Год назад
I'm consistently blown away at your programming skills. Keep up the great work!
@mongus2
@mongus2 Год назад
I hope the community has access to all of these clever optimizations one day
@renakunisaki
@renakunisaki Год назад
Nintendo, hire this man once he manages to optimize this game so much it runs in reverse and works as a time machine.
@DavidRomigJr
@DavidRomigJr Год назад
This was an interesting watch. I love these types of optimizations, pushing the limits. My favorite optimization has to be the fast inverted square root since its so simple and so fast, obvious in hindsight but not very if you don’t already know it. All the cache talk reminded me the issues we had with PC to PS2 ports, having DMAs constantly stalling on instruction and data cache fills. In the end there wasn’t a lot we could easily do. Fun times.
@Gestersmek
@Gestersmek Год назад
I guess the only thing to do now is Fast Approximate Square Root. For real though, that folded polynomial is crazy. The math nerd in me was more than impressed at the ingenuity.
@KazeN64
@KazeN64 Год назад
any square root approximation will be a lot slower than the hardware one i think. the famous inverse square root algorithm is in the ballpark for 3 - 20x slower (depending on use case)
@MDaveUK
@MDaveUK Год назад
​@KazeN64 is that the quake 3 fast inverse square root trick?
@Gestersmek
@Gestersmek Год назад
@@KazeN64 Well, that's unfortunate, but hey, at least you got a few cycles saved with the current implementation.
@blarghblargh
@blarghblargh Год назад
@@MDaveUK if you've heard of it, it's the famous one :P
@anonymouscommentator
@anonymouscommentator Год назад
i absolutely love your videos. not only was mario n64 my childhood game, your videos have the perfect amount of nerdiness and math in them to be interesting while your jokes are hilarious. Keep it up!
@karlosk5773
@karlosk5773 Год назад
Random question: Who composed and programmed the music in the Return to Yoshis Island demo and Peachs Fury? The music is amazing in these games! Thank you for your incredible work!
@StopChangingUsernamesYouTube
I almost caught myself asking why, but no, this is cool as hell. While the heft of our operating system-scale browsers (at least relative to OSes [checks notes] 20 years ago? Holy crap.) may give the impression that memory and compute are cheap and plentiful for the moment, we're running into for-now limits on just how much more memory we can pack into a space, how densely we can pack storage on a drive of any type we currently use, and just plain how many transistors we can fit on a given die using today's tech. People enjoying the sandboxes of constrained computing that old consoles offer will probably provide some much-needed optimization hints for the giant and heavy programs of tomorrow. But most of all, I know we all pursue our own niches for personal enjoyment and it makes me happy to see someone making strides in their chosen area.
@timseguine2
@timseguine2 Год назад
After your last video on this topic, I was convinced that the accuracy could still be improved considerably. So I am glad you found a way to get it without sacrificing performance, even if none of my suggestions were what got you there.
@gameisrigged6942
@gameisrigged6942 Год назад
Quaternions 😢
@KazeN64
@KazeN64 Год назад
quaternions will be the final animation format used here, no worry! but yeah at the moment it's still euler angles
@angeldude101
@angeldude101 Год назад
Fingers crossed for a video. I will however ask if your planned use of quaternions still uses the 360° fixed point angle format, since the individual components are no longer just raw angles, but floats are also twice as large.
@gameisrigged6942
@gameisrigged6942 Год назад
​@@KazeN64that would be insanely cool!
@TaranAlvein
@TaranAlvein Год назад
That was super cool. I liked watching how everything was broken down to extrapolate positions based on just a few calculations. It was amazing, and very interesting to watch!
@yvendous
@yvendous Год назад
MARIO IS A MENACE IN THIS GAME Mario making that goomba and get squished by the plank he was sitting on?? and Mario breaking the glass the Koopas were carrying??? Hilarious
@Diablokiller999
@Diablokiller999 Год назад
You really should look again into CORDIC, there are implementations of this algorithm only using addition/subtraction, specifically for FPGAs. I used it a couple of years ago to calculate a sine for a 40MHz ADC input for phase shift detection (dual phase lock in) and only needed ~30 clock cycles for a 64 Bit input signal.
@KazeN64
@KazeN64 Год назад
i'll need to see some code before i can give it a consideration
@cartoonhead9222
@cartoonhead9222 Год назад
Someone needs to show this to Todd Howard so he knows what optimisation is.
@BBWahoo
@BBWahoo Год назад
He's too busy putting his energy into convincing people the ridiculous CPU tax is fine
@TheInfiniteAmo
@TheInfiniteAmo Год назад
Kaze, your channel and insane romhacking ability was a big inspiration for me picking up Decomp romhacking myself and making my first Pokemon romhack. Just like your videos I barely understand what's going on and I'm enjoying every second of it. Thanks for being awesome.
@DorE3k
@DorE3k Год назад
These optimizations are getting ridiculous at this point, great stuff Kaze! The folded polynomial approach is brilliant and elegant, nice job to the guy who came up with it
@cartanfan-youtube
@cartanfan-youtube 7 месяцев назад
Javascript devs: just add package, who cares if this website takes a few mode seconds to load N64 devs: by exploiting the heavently symmetry of sines and cosines, i can save 50 nanoseconds.
@anjoliebarrios8906
@anjoliebarrios8906 Год назад
3:45 WHAT!!! I didn't know this!! My math teacher just made us memorize everything. Well, we knew the first set of equations, not the 2nd set (quarter rotation ahead)
@LavaCreeperPeople
@LavaCreeperPeople Год назад
You're still doing this stuff to this day? Nice
@AllothTian
@AllothTian Год назад
Exploiting the mirror and translation symmetries of sine/cosine is standard in any half-decent math library. Those libraries typically also implement an arbitrary precision division by pi, as that is the biggest contributor to inaccuracy in naive division-by-constant implementations. That being said, if you can ensure the numbers you feed into your sine/cosine never grow too large, the error should be acceptable for most real-time applications. Lastly, the more accurate (compared to Taylor) polynomial coefficients are obtained via the Remez algorithm, and you've got frameworks like Sollya that can compute that for a desired precision and/or order.
@jansenart0
@jansenart0 Год назад
Imagine what we could be capable of if this Talmudic-level of scrutiny were applied to modern programming.
@rdgfb
@rdgfb Год назад
mario 64 is now so technical, math teachers would get confused without trying
@prototypez4343
@prototypez4343 Год назад
nintendo 64
@madlikov747
@madlikov747 Год назад
Nintendo ultra 64
@Slenderquil
@Slenderquil Год назад
I was never good at math so i have no clue what's going on, but this seems like a really cool find
@mariotheundying
@mariotheundying Год назад
I'm good at math but haven't reached this level of math in school, I'm in one of the last grades and I'm wondering when will it be taught or if it starts in college
@lior_haddad
@lior_haddad Год назад
That's an awesome idea for approximating, sad that it's basically the same speed-wise, and n64 specific...
@KazeN64
@KazeN64 Год назад
its a lot more accurate so i think it's still a huge bonus! i bet theres other architectures that benefit from this approach too
@lior_haddad
@lior_haddad Год назад
@@KazeN64 yeah, I guess more older hardware would probably benefit from this! I just don't think there's a lot of hardware where you both use sin/cos/tan a lot, yet those operations are not super-optimized in hardware. Accuracy is great though! How close is this to being the perfect 1ULP function?
@timmygilbert4102
@timmygilbert4102 Год назад
I was on a discord where they had discussed porting Mario 64 to the GBA, the same discord where tomb raider GBA was presented, I wonder how much it's compatible with that console, they had bomb OMB battlefield rendered with texture. The fill rate is even more of a bottle neck 😂
@micalobia1515
@micalobia1515 Год назад
@@lior_haddad Only use case I could think of is GPU stuff, where that style of sin/cos would be integrated into the hardware, I've no idea if it would be better than what they use though
@lorebz
@lorebz Год назад
every time I think about how Kaze's videos make me wanna code, Kaze releases a new video about how coding even includes more math, lke sine and curves and more and more math then I get overwhelmed, then I get hopeful, then I get overwhelmed again then I get hopeful and then I g
@That_0ne_Dev
@That_0ne_Dev 3 месяца назад
I like the idea that a math chap joined the discord. Dumped a bunch of math pages to improve the sin function in images and did not elaborate
@bmenrigh
@bmenrigh Год назад
Also for higher order polynomials note that you can represent a*x^4 + b*x^2 + c*x + d in another form: d + x*(c + x*(b + x*(a*x))). For higher order polynomials this second form has far fewer multiplications than the more traditional naive representation.
@KazeN64
@KazeN64 Год назад
that is what the code does. but regardless of which way around im typing it in, pretty sure GCC optimized that anyway.
@zoiosilva
@zoiosilva Год назад
I can't wait to start seeing speedruns of one of your fixed sm64 versions, and listening to the speedrunner's comments on the run.
@alec_almartson
@alec_almartson Год назад
Thank You for teaching us in this Masterclass 😮💯🎮👍🏻 By the way, this is same train of thoughts that our Legendary friend John Carmack took when he conducted his exhaustive research to simplify the "Square Root" function... to improve the Doom (and Quake)'s Game Engine,... you know..., to make us happier in the end. Very good, please keep up the good work (I like this kind of videos)👍🏻
@BlueFinch
@BlueFinch Год назад
I watch this video the same way my dog watches me - I know you're saying complicated things, but I don't understand them amidst my enjoyment.
@sem_aki
@sem_aki Год назад
This is the Oppenheimer of Mario 64
@macksnotcool
@macksnotcool Год назад
Possible optimization: This is going to sound ridiculous but in many programing languages, multiplying by 0.5 can be faster than dividing by 2. I know this is the case in C# but I don't know about C or C++.
@Octobeann
@Octobeann Год назад
I don’t remember for sure but I think he might’ve said he’s already doing that in the previous video
@angeldude101
@angeldude101 Год назад
On a bit level, they're just adding and subtracting 1 from the exponent, though if the instructions always take the same number of cycles, then dividing by 2 would slow down to accommodate non powers of 2 that get passed in, which are much slower than a single multiply, but can be more accurate. Multiplying or dividing by a power of two on the other hand is always perfectly accurate as long as your floats aren't subnormal. That said, this doesn't actually seem relevant as the given code doesn't include a single division, nor multiplication by a half.
@KazeN64
@KazeN64 Год назад
GCC compiles a division by 2 into a multiplication by 0.5 - but i don't divide anywhere in this anyway so i don't know where you got that idea.
@macksnotcool
@macksnotcool Год назад
Yeah, your right. Also, I meant it as a general optimization and not one for calculating sin functions.@@KazeN64
@cmyk8964
@cmyk8964 Год назад
The idea to not only fold the circle into quarters, but to fold the quarters again into eighths, is very cool
@Snazzysnail15
@Snazzysnail15 Год назад
I do not understand 😁👍
@MagusArtStudios
@MagusArtStudios Год назад
Your videos have improved my coding skills :)
@lowenevvan8619
@lowenevvan8619 Год назад
0:33 anyone curious about this. I believe a signed integer uses the 1st bit as the sign. Thus using a shift right on any negative signed integer drops the sign, divides by 2, then adds a very large number. In other words a very unexpected result You can maybe make it work, but it's more involved then just shifting right.
@KazeN64
@KazeN64 Год назад
it'll divide by 2 just fine - but it will round the division result wrong
@Skyliner_369
@Skyliner_369 Год назад
I love that the benefits of this code isnt so much speed, but just pure accuracy.
@JeffACornell
@JeffACornell 5 месяцев назад
What about a piecewise quadratic approximation? You can minimize the lookup table by using complex numbers for rotation (cos(x)+i*sin(x)). For example, if you pre-compute the sine and cosine of pi/4, pi/8, pi/16, and pi/32, then you can generate the sine and cosine for any multiple of pi/32 up to pi/2 by complex-multiplying the appropriate combination of pre-computed angles. You then complex-multiply this result by a small-angle approximation (presumably sin(x)=x and cos(x)=1-0.5x^2) to get to full resolution. At some point all those complex-multiplications will cost more than the square root function, but I wonder how much precision this could get before that happens. And I wonder if this binary-search style of lookup can be represented as just constants within the instructions and live in the instruction cache.
@newbornkilik
@newbornkilik Год назад
Wow, I have no idea what Kaze just said to me the last 15 minutes, but I am happy for it!
@musaran2
@musaran2 Год назад
1) "If"s are notorious perf killers. Test with a representative set of angles & corresponding conditional execution. 2) Float sign flip optimizes to a XOR. The "if"'s flag could serve as mask, bypassing a test & jump. Make sure the end code uses those tricks. 3) If conditional jumps can't be avoided, what if we 1st test angle octant, jump to near duplicates of calculation code, and DON'T have to swap or flip sign at the end? 4) Floats store values as logarithms. This opens to optimizations turning ×/÷ into +/-, as in Doom's famous inverse square root.
@KazeN64
@KazeN64 Год назад
1) not on n64, branches are a single cycle. we have no branch prediction 2) to do float bit manipulation on n64 you first need to move them to a general purpose register. flipping the sign through that is a 6 cycle operation. doing it with branches is 3 cycles. 3) like before, flipping the sign is very cheap using a conditional. 4) that again requires bit manipulation which is 2 moves and will be slower. besides, we don't multiply or divide by a power of 2 anywhere here so that is also not useful.
@musaran2
@musaran2 Год назад
@@KazeN64 Gee, who would have thought that you knew your stuff? No one could see that coming! :> Oh well. As long as we leave no stone unturned…
@KazeN64
@KazeN64 Год назад
@@musaran2 haha yeah don't worry. my first programming language was mips assembly. i always think in terms of compiled code and not the english you see when writing C
@cmon200
@cmon200 Год назад
Kaze in school Kaze classmate"ugh why do we need to learn this I'll never use it" Kaze"mario"
@cameronabshire1195
@cameronabshire1195 Год назад
Hmm! How accurate was the original Mario64 calculations for sin/cos? Depending on the answer, maybe it would be feasible to sacrifice some of that newfound accuracy for extra performance that could *potentially* be measurable.
@DallinBackstrom
@DallinBackstrom Год назад
Kaze mentioned that the original LUT had a max error of 15 angle units or so, which is an order of magnitude higher than the new accuracy of the folded 4th order polynomial. So, maybe some performance can be wrangled out here, although I have my doubts. But doubts don't make proofs. Let's be rigorous! For gameplay, we'll aim for a max error in sine of 0.001 or less, which should equate to around ~15 angle-units of error, close to the orignial LUT. for graphical rendering, we'll aim for a max error in sine of 0.03 or less, which would be a little bit worse that the 3rd-order approximation from the last video, and MUCH worse than the new folded 2nd order polynomial, but probably not that noticeable as long as the function is 0 at x=0 and 1 at x=pi/2. The 2nd order folded polynomial is currently used for graphical rendering, and the 4th order folded polynomial is presently used for game logic. let's consider the graphics first: The 2nd order polynomial can be reduced to a 1st order polynomial. this is literally the sin(x) = x approximation, which is probably unacceptably inaccurate, but let's be rigorous. the identity cos(x) = sqrt( 1 - sin(x)^2) becomes cos(x) = sqrt(1 - x^2). this needs to be valid over the range 0 to pi/4, or -pi/4 to pi/4. I plugged these into Desmos and compared them to the built-in Sin() and Cos() functions over the range {-pi/4 to pi/4}, and the average error is actually not as bad as you might think. however, the maximum error is very bad, at 0.088 for the cosine approximation which is well above the limit we set ofr ourselves earlier. But, it gets worse. remember, we're only able to use the restricted domain of 1/4 pi because we use the square-root identity to complete the graph. There's an ugly discontinuity where x transitions into sqrt(1-x^2). The gap between the two graphs at this point is about 0.16, which is a huge and unacceptable gap in the function. I wish I could include a picture in my comment to really clearly illustrate this, but you can try the following for yourself in desmos or wolfram or whatever: function 1: sin(x) {0
@KazeN64
@KazeN64 Год назад
reducing the instruction count by 5 won't help unfortunately. The main driver of lag on the n64 is memory bandwidth. the compiled function is exactly 32 instructions and unless you can get below 24, you'll still load 32 instructions into the cache. this is why these new functions made basically no difference regarding performance. (and keep in mind the CPU performance does not matter at all - all that matters is how much of that performance translated to the renderer,... which is just a fraction of that)
@DallinBackstrom
@DallinBackstrom Год назад
okay-- scratch what I said earlier about the first-order polynomial! I was able to get the discontinuity to dissapear-- **and** get the maximum error in sine down to 0.03, which is acceptable for rendering purposes. I just had to add a slope constant of 0.9 to the sin(x) = x approximation. so it ends up looking like the following: function 1 = 0.9*x function 2 = sqrt(1-function1(x)^2) and the error is totally manageable???? this is wild, lol still only saves one multiplication and one subtraction, though. so it's up in the air whether the added error is worth it.
@KazeN64
@KazeN64 Год назад
@@DallinBackstrom that is pretty funny that that works lol, but yeah not worth it at all
@Hinguckah
@Hinguckah Год назад
I have no idea what this actually means but it sounded exciting to save a few nanoseconds on stuff! I guess this removes some of the jank with the collisions and stuff? Maybe the random invisible wall bonks? Or is that not what has became 3 times more accurate?
@KazeN64
@KazeN64 Год назад
the animations and marios directions. e.g. when you press up-right on the stick, mario will move more in the direction you hold. in the vanilla game he is off by about 0.2 degrees sometimes.
@Hinguckah
@Hinguckah Год назад
@@KazeN64 Oh, yeah that does sound great, thanks for answering : )
@Davidevgen
@Davidevgen Год назад
the accuracy improvement is more important than the performance improvement imo.
@caiocc12
@caiocc12 Год назад
Next step is exploring the MIPS ISA and finding instructions that can accelerate the computations, i.e. in code that uses sin/cos more than once in the same equation, employ SIMD to compute multiple sin/cos at once.
@oldtools6089
@oldtools6089 Год назад
I've got a 300lb toenail collection which justifies my ice-cream and chicken-skin diet.
@f.n.8540
@f.n.8540 Год назад
kaze embrace the lag and make touhou mario with 2000 bullets per spell card
@ValentineShevelev
@ValentineShevelev 9 месяцев назад
I'm sorry for my stupid question, but can't we just precalculate sine and cosine for some values (for, like, 1000 of them, more or less) and then just use this as a lookup table. Maybe you can also combine this with your current method, so that you use this to calculate all the parts where you would have used square root. I'm sorry if this has obvious problems (such as costing too much memory to be viable), I'm not really experienced and/or knowledgeable enough in this field 😅 Upd: uh, oh, found another comment on this topic which has been answered. For anyone else who is curious, the problem IS indeed the lack of cache (as far as I understand). It has been mentioned in the previous video as well. Anyways, thank you for your awesome content!
@koolgamzstudio
@koolgamzstudio 2 месяца назад
If Cosine is just sin but +90 on input, why use more instructions and math when you can just add 90 to the direction
@Orzorn
@Orzorn Год назад
I'm a professional software engineer and this is some god damned black magic tech wizardry. I enjoy optimizing things myself, but this recognizing all of the symmetries available and abusing each of them to get this kind of accuracy is beautiful.
@victorvillacis6764
@victorvillacis6764 Год назад
My man is obsessed in optimizing N64
@ArchieN1761
@ArchieN1761 Год назад
HAHAHA i had a "silly playlist" at the back while watching this, and when the folded polynomial was revealed the mario rpg victory theme played, i didnt even blinked until i tought more about it
@guy_th18
@guy_th18 Год назад
"some things might randomly not be on the right cache line anymore" piquied my interest. never heard the term "cache line", do you talk about it in another video? is it similar to memory paging?
@angeldude101
@angeldude101 Год назад
A cache line is usually much smaller than a page, though they're also usually aligned to page boundaries on systems with paging. A page might be 4096 bytes, while a cache line might be only 16 bytes, with exactly 256 cache lines per page. Passing is more about abstracting physical memory into a virtual address space, while cache lines are about prefetching data from physical memory to store in the CPU so it can be accessed faster. Generally when you read from one address, you're very likely to be reading from other nearby addresses as well.
@SensSword
@SensSword Год назад
7:30 why not do the XOR trick to swap the values without a temp variable? That used to be quicker back in Quake 1 days....
@KazeN64
@KazeN64 Год назад
check the pinned comment
@HydratedBeans
@HydratedBeans Год назад
This makes me wish we had the source for all games. Imagine what the community could do.
@xdanic3
@xdanic3 Год назад
We could probably run so many games on potato laptops, but nowdays we would need both the codebase, and the project files and the engine they were built on... But we would need another kaze for each game
@SoyAntonioGaming
@SoyAntonioGaming Год назад
we already have CoD games , we dont need any other games
@HydratedBeans
@HydratedBeans Год назад
@@xdanic3 I’ll be the kaze for Command & Conquer
@hyperteleXii
@hyperteleXii Год назад
I'm guessing this wouldn't work on a modern, deeply pipelined processor, due to the branches. It's kind of funny that IF ELSE is faster than math on N64 🙂
@KazeN64
@KazeN64 Год назад
wouldn't really matter, you could just swap some stuff out for a sign copy on most modern processors i think. but modern processors are a lot faster at polynomials anyway so you probably dont even need this stuff
@ferrarikangaroo9271
@ferrarikangaroo9271 Год назад
This video was amazing and made me a bit depressed that I don't have time to code in my spare time anymore. Is there a Mario 64 rom hack that you guys recommend that plays on real hardware?
@nicholaskroeplin81
@nicholaskroeplin81 Год назад
kaze made a version of super mario star road that runs on real hardware you will need a flash cart though
@_DRMR_
@_DRMR_ Год назад
This massively improved accuracy for essentially no penalty (under known game conditions) is really a fantastic outcome!
@multicoloredwiz
@multicoloredwiz Год назад
wild how much you guys can come up with. god bless the information superhighway baby
@timmygilbert4102
@timmygilbert4102 Год назад
MMM recently I found out about chebyshev polynomial in the computation of cosine distance field. I haven't investigate yet, my use case is a bit different, I'm trying to find an algorithm with no loop to raytrace an helix. Fun how both look alike.
@LordZero666
@LordZero666 Год назад
Someone hire this guy and we will soon run PS5 games on a phone with performance to spare.
@dedvzer
@dedvzer Год назад
Takeaway: "to write the best performing code you should always ask..." *your mathematician friend*. Great work on all these videos!
@emceebois
@emceebois Год назад
IMO this should be spread far and wide throughout all of retro gaming and vintage computing, nothing about this seems like it would ONLY be an improvement on the N64. Fewer cycles for greater accuracy in computing sin(x) and cos(x) without huge LUTs would improve homebrew for...damn near everything! Where were YOU when Kaze Emanuar and SilasLock re-wrote the book on estimating trig functions?
@DanielDugovic
@DanielDugovic Год назад
8:39 for the math-confused, this is about 0.096ms or 96us per frame.
@Yoshistar95
@Yoshistar95 Год назад
So the improvement is like you add salt to your dish, to make it taste a bit better
@isogash
@isogash Год назад
Uhhh, most C/C++ compilers will interpret the >> operator applied to a signed integer as an arithmetic shift right, which *is* a division by power of 2 for both positive and negative numbers. For unsigned integers a logical shift right is the same as a division by power of 2 regardless, so the commenter was correct and you are wrong, for most compilers. The actual defined behaviour of >> for signed integers is that it's left up to the implementation of the compiler.
@KazeN64
@KazeN64 Год назад
They are almost the same - but the C standard calls for rounding towards 0 while >> will round towards negative infinity. He was arguing they were exactly identical. Generally /2^n will compile into a righshift by reg size, addition and then a rightshift by n instead of just a rightshift by n.
@tef_ebooks
@tef_ebooks Год назад
Did you end up trying out this new partitioning method on lookup tables too? You mentioned in an earlier video that sin/cos could use 1/4 the table size through mirroring/reflections, but I'd be curious to know how much of an improvement you could make by using the folding trick on lookup tables
@KazeN64
@KazeN64 Год назад
yeah i showcased exactly that on my previous sine video! i did something even better actually - i had an 1/8th sine and 1/8th cosine table interwoven for cache locality.
@tef_ebooks
@tef_ebooks Год назад
​@@KazeN64 Nice! I only remembered seeing the 1/4th thing from Zelda in the last video, rather than "lookup table for the top of the wave and using squares to calculate the missing parts" I know you're probably tired of backseat optimizations, but I am wondering if there's a useful compression method for a table to save more space Like "packing more floats in a cache entry by using deltas" or "one small lookup table of approximate floats and one larger lookup table of deltas to correct the value" The only other thing that springs to mind is trying to correct the error by improving the approximation, like a round of newton rhapson, but that includes a divide, so, no dice. Anyway I've been nerd sniped, great video!
@KazeN64
@KazeN64 Год назад
@@tef_ebooksit's not actually using squares to get all the values in the 1/8th implementation! i explain it a bit in the video if you want to understand it. its a pretty cool concept im using.
Далее
The Truth about the Fast Inverse Square Root on the N64
10:01
64 Bits: Nintendo's BIGGEST Mistake
15:33
Просмотров 273 тыс.
Avaz Oxun - Yangisidan bor
14:29
Просмотров 388 тыс.
Finding the BEST sine function for Nintendo 64
26:41
Просмотров 315 тыс.
I Optimised My Game Engine Up To 12000 FPS
11:58
Просмотров 678 тыс.
Optimizing with "Bad Code"
17:11
Просмотров 212 тыс.
When Optimisations Work, But for the Wrong Reasons
22:19
What was the N64 Expansion Pak actually used for?
15:03
Cryptic Stars in Mario 64 EXPLAINED
9:10
Просмотров 87 тыс.
How Super Mario 64 was beaten without the A button
24:12
I Remade Mario 64 FROM MEMORY
12:14
Просмотров 361 тыс.
Mario 64's Physics are not perfect
22:00
Просмотров 273 тыс.
The T.V. Game
8:29
Просмотров 221 тыс.