Is this New Quadratic Formula Actually Faster than the OG Quadratic Formula? (For a Computer)

Подписаться 97 тыс.

Просмотров 94 тыс.

50% 1

In this video we look at how well Po-Shen Loh's new quadratic formula compares to the original, tried-and-true Quadratic Formula in terms of speed. The new formula from Po-Shen Loh certainly looks simpler, but which is faster for a computer?
Po-Shen Loh's explanation of his formula: • Examples: A Different ...
Support What's a Creel? on Patreon: / whatsacreel
FaceBook: / whatsacreel
Background images from animations from HDRI Haven: hdrihaven.com/
Image of Lesley 'Twiggy' Lawson from Wikipedia
Software used to make this vid: Visual Studio 2019 Community: www.visualstudio.com/downloads/
Blender: www.blender.org/
Audacity: www.audacityteam.org/
Davinci Resolve 16: www.blackmagicdesign.com/prod...
OpenOffice: www.openoffice.org/
Gimp: www.gimp.org/

Опубликовано:

30 авг 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 376

@gideonmaxmerling204 Год назад

Would probably be faster to just return 1 for every entry. It would be accurate enough in astronomical scales.

@rayniac211 Год назад

Unless your quadratics are astronomical in nature ;)

@WitherLele Год назад

@@rayniac211 then it would be accurate on a bigger scale

@RobertLoweIsAwesome Год назад

@@WitherLele ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-ICv6GLwt1gM.html

@airkami 6 дней назад

You remind me of the algorithm i designed to check an array of positive integers to return true for each value divisible by 3. My Algorithm: Return false It is fast, it doesn't contain unpredictable failure, and it is correct most of the time given the set of all positive integers.

@josemonge4604 Год назад

Dark background and blue color aren't a good combination of colors for readable text. Try white or yellow next time or something like that. Anyway great comparison and great job on the optimizations!

@HansLemurson Год назад

If you want Blue to be seen, add some Green!

@firstname4337 Год назад

LOL, he said blue code and i said what code ? I had to change the brightness of my monitor to see it

@sdjhgfkshfswdfhskljh3360 Год назад

Not just it bad by itself, but it is also more damaged by RU-vid compression algorithms.

@NoNameAtAll2 10 месяцев назад

yellow on white? ha ha

@professortrog7742 10 месяцев назад

@@NoNameAtAll2 no on the dark background

@47Mortuus Год назад

9:35 don't divide by 2, multiply by 0.5. The compiler is not allowed to do so, as it will sometimes produce different bitwise results. ... Unless you have fast unsafe math enabled, which will also allow for FMA reordering, which is a pretty significant optimization... Also, much of the runtime is actually float division. It's not pipelined and you can only start a division after another has finished, which is 14 clock cycles at best, which is _extremely_ relevant when you can replace a FP division by using "RCP_PS" and do some Newton-Raphson, which has _WAY_ higher throughput than FP divison (relevant for tight loops such as this one), and can be FMA fused when multiplying by the reciprocal instead. Here's the code for an accurate reciprocal (only _possibly_ worth it if your CPU supports FMA; total latency (fastest possible): 4 (rcp) + 2 * 4 (FMA) = 12 cycles vs 11 cycles (fastest possible division), but way higher throughput and FMA "fusable", meaning another - up to - 3 clock cycles saved): __m128 rcp23_ps(__m128 a) { const __m128 ONE = _mm_set1_ps(1f); __m128 r = _mm_rcp_ps(a); return _mm_fnmadd_ps(_mm_fmsub_ps(r, a, ONE), r, r); }

@DDlol01 Год назад

I was looking for this one. I also want to add: even if fast unsafe math is enabled, on some compilers (vcc: /fp:fast) you should also enable optimizations for speed to rewrite constants as needed (done when using /O2 or /Ot but NOT /Ox). /Ot is default. /O2 does some more afaik. (no indepth knowledge use at your own risk. Please corret me if im wrong) P.S.: AFAIK in gcc constant rewriting is auto enabled when using fast unsafe math (-ffast-math) and optimization levels do not affect constants, -ffasm-math takes precedence?

@cH3rtzb3rg Год назад

Dividing by powers of two can and will actually be replaced by multiplying with the inverse by the compiler even without fast-math flags, since the result will be exactly the same. One could replace the divisions by a with a multiplications by (1/a) -- this is a thing the compiler won't do in exact-math mode. In fast-math mode, it may even use the fast rcpps (reciprocal packed single) instruction with a refinement step to do this faster.

@JamesEdwards780 Год назад

bit shift >> may be faster still

@vytah Год назад

@@JamesEdwards780 You can't just bitshift floats.

@PremierSullivan Год назад

@@vytah Not with that attitude!

@pyromen321 Год назад

I’m guessing the new version might actually use less power! They should be practically the same speed due to ILP, but the the new version requires less calculations

@-mwolf Год назад

whats ILP?

@leocomerford Год назад

@@-mwolf Instruction-level parallelism en.wikipedia.org/wiki/Instruction-level_parallelism .

@MenkoDany Год назад

Thank u for the wisdom, dolan

@ABaumstumpf Год назад

But it simply does not take less computational powe

@kingacrisius 9 месяцев назад

@@ABaumstumpf I believe they mean literal power, like electricity. It might be less expensive in terms of electricity.

@bronzecomeshome9517 Год назад

My man really putting the work in on the Blender effects 😂 but damn I was so shocked to see deleting a useless branch cut the processing time in half

@protonmaster76 Год назад

He's done an interesting video on branchless coding. In modern CPUs, a branch can be very expensive.

@bronzecomeshome9517 Год назад

@@protonmaster76 I've seen it and enjoyed it, I just assumed that as a false branch it would have been optimized out by the compiler.

@chainingsolid Год назад

The combo of a branch + floating point, may have resulted in the compiler not realizing it was redundant. And having a branch that also probably didn't predict very well (its based soley on incoming data) it may have done major performance damage. Would be cool to see the pattern of which way the branch goes, given its based solely on the input. Wonder if "sorting" the input to make it predictable would allow removal of the divide (by the CPU's branch predictor), for some of the solve attempts and thus, maybe go even faster for those cases. @Creel looking at the branch predictability could be an interesting video.

@pyromen321 Год назад

26:24 it’s worth noting that because you’re dividing my a constant power of two for the third div in the new version, all modern compilers will optimize that to a single mul instruction. I’m almost certain that -x/y will always become an xor and mul when y is a power of two Edit: I don’t want to litter the comment section too much, so I’ll just put this here: awesome video, man! Some of my work involves optimization, and I absolutely love when other people make this knowledge super accessible!

@WhatsACreel Год назад

I think you're right about the division being swapped for a mul! I can't remember what I did in the ASM, but hopefully a compiler would figure that one out... Cheers for watching mate, have a good one :)

@PhilippeVerdy Год назад

An actual implementation would distinguish 4 main cases, depending on if 'a' is zero or not, and on if 'b' is zero on not: - 1. When (a=b=0), if (c=0) there's a single root (r1=r2=0), otherwise there's no solution (r1=r2=NaN). - 2. Otherwise when only (a=0), there's no solution if (b=0), otherwise a single solution (r1=r2=-c/b). - 3. Otherwise when only (b=0), there are two solutions of opposite signs (r2=-(r1=√(c/a)). - 4. Otherwise, we have the two quadratic roots, possibly complex depending on the sign of the determinant (b²-4ac). For the last 4th case, you need **numeric stability** (due to limited precision for "float", "double" or "long double" types used by FPU) and the two quadratic solutions should be given by (r{1,2}=[√(1-4ac/b²)±1]*b/2a), where the square root may be imaginary if the "normalized" determinant (1-4ac/b²) is negative, i.e. when the usual determinant (b²-4ac) is negative (both conditions are equivalent because b≠0 in that 4th case). This gives a total of 7 cases, where conditionals (if statements) are justified. But the test on if b=1 in this video is just useless and stupid (it tries to avoid a trivial division by 1 but at the cost of a branch, which disables the optimization or parallelisation by the compiler: divisions by 1 are trivial also in FPUs, may just cost 1 cycle, LESS than the cost of a branch).

@PhilippeVerdy Год назад

(-x/y) cannot be "optimized" with a xor and mul: we're working here with floatting point values, not integers! The negation of floatting points just flips & single bit, the sign bit, it does not perform any "xor" on the whole value (the exponent or the mantissa part are not modified), and this can be done before the division on any operand x or y, or on the result; but in fact the result sign is NOT[sign(x)AND sign(y)], and the absolute quotient is the separate division of abs(x)/abs(y). To compute that (positiive) dividend of these two positive floatting points, - the FPU separately computes the quotient of the two mantissa, which are in (0.5, 1.0], so that non-normalized quotient is in (0.5, 2], and the difference of exponent parts; - finally it renormalizes the quotient into (0.5, 1.0] and accordingly adjusts the difference of exponent parts by conditionally substracting 1. The final result may become a denormal (or zero) floatting point value, or an "infinite" floatting point value, depending if the final exponent part results in an underflow or overflow (for its allowed range), but it will be a "NaN" (or would raise a "'division-by-zero" exception) if both mantissa parts (including the implicit most significant bit if it's not stored in the floatting point format) are zero. In all cases, the result sign computed separately is blindly applied as is (possibly giving "positive or negative zero", "'positive or negative NaN", or "positive or negative infinite" values; some libraries may or may not discard the sign distinction for zeroes and NaNs, but with some extra cost using conditional branches)

@pyromen321 Год назад

@@PhilippeVerdy, you probably should have checked godbolt before that whole explanation. I just confirmed that on some compilers it becomes a mul and xor, on others it just becomes a mul by -0.5. Remember, in this case we are dividing by a positive constant power of two and negating. This means the compiler can take shortcuts.

@PhilippeVerdy Год назад

@@pyromen321 if you are dividing à float by a negative power of two, you don't need a mul, just z decrementation of the exposant part, and flipping the sign bit, this tales two instructions, worse than a single FPU instruction performing a single multiplication by constant like -0.5 (consider data path on thd same word containing thd sign bit and the exponent, plus you need to test underflows when decrementing the exponent, or you'll wrong results !

@larswilms8275 Год назад

just so everybody knows: the OG quadratic formula is the closed form of the Po-Shen Lo algorithm. the PSL algorithm is simpler to understand and perform by a mere human, but I would not expect optimized code for the two variants to differ al to much.

@sabriath Год назад

CPU latency of modern computers are roughly: fadd 4 (as well as fsub) fmul 4 fdiv 11-25 fsqrt 35 At the very least, this puts the original at between 97 and 125 latency, while the new formula takes between 88 and 130 latency. This means on natural programming it is both slower and faster depending on division requirements. The issue I have with your assessment at the end is you say that the new formula takes 3 divisions, it actually takes 2 to normalize the equation off the first coefficient. You are dividing by 2.0f, this is wasting 11-25 cpu latency on chip where you could simply multiply by 0.5f which is 4 latency. Even worse is the fact that floats are in powers of 2, so dividing by 2 is just decrementing the exponent, which is a 1 latency to perform an integer decrement prior to loading the float into register for use (even before normalizing division in the fpu). Some compilers would have noticed the 2.0f division and possibly switched to a fmul on 0.5f for speed, but it wouldn't have figured out that you could do a dec prior to any of it, which means that you didn't actually test the right code. If done correctly, the new method should have used 1xDEC (1), 2xFDIV (22-50), 1xFMUL (4), 1xFSQRT (35), and 3xFADD/FSUB (12), which comes to 74-102, faster than the original by 21%. Also, complex numbers are automatically formatted correctly with the new method, the old method has a division after sqrt, which takes longer to simplify (squaring it, dividing the sqrt answer by it to get the complex part formatted). You can make the code a little bit faster by taking the reciprocal of the leading coefficient (1xFDIV) and then multiplying it to the other coefficients to speed up the process further, but both can make use of that trick....despite that, the new method becomes almost 23% faster. The real bottleneck comes down to the square root function, it is the most expensive part in both scenarios. If we could get that down to 15 latency, the new method would be 28% faster than the old.

@FrostMonolith Год назад

I felt that you haven't watch Po-Shen Loh's video at its full, because he explicitly mentions this is NOT his own equation. It's his way to interpret the logic in solving quadratic equations. He even mentions who originally had this idea.

@alexwolski3344 Год назад

What a great idea! Gamifying the benchmarks makes it so much fun. Love the animations and benchmarking song. Really interesting stuff.

@InvaderMik Год назад

As a tip: at 3:45 the contrast with the blue text is so low that I can’t read it. So I’m following along with your commentary by imagining the route in my head rather than reading along with you! EDIT: ok so this happens a lot in this video.

@WolfgangBrehm Год назад

4:31 skip the if. Just always divide. If the coefficients are random, the probability of a being equal to 1 are very slim. Also it might be worthwhile to multiply by the precomputed inverse of a.

@rrshier 11 месяцев назад

I was just thinking what you put down here. The if statement is actually probably his biggest adder of time here! AND usually the vectorization wants to have pure math, nothing with conditionals

@matthewkolar7560 Год назад

CPU's & Compilers can vectorize Array of Structures as well, the difference between this and SOA is more about optimizing cache access patterns for functions that dont use every element in the structure. You should double check your compiler optimization settings if you are getting wildly differing results from a structure this small.

@marknn3 Год назад

Instead of: b[i] /= a[i]; c[i] /= a[i]; do: float a_inverse = 1.0 / a[i]; b[i] *= a_inverse; c[i] *= a_inverse; This will remove one division and replacing it with a much faster multiply.

@radeklew1 Год назад

That replaces a division with TWO multiplication, which... might still be faster? You'd have to benchmark, and it may depend on the CPU.

@teambellavsteamalice Год назад

I like the *= steps. Someone else remarked instead of mid b /= 2 or b *=.5. will using r2 instead of u be faster still? so r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 += mid

@thomasbrotherton4556 Год назад

And even faster to not be doing I/O back to the arrays.

@ABaumstumpf Год назад

And that change will give slightly different results

@Aras14 Год назад

After a little bit of tinkering, I got the Po Shen method down to: 2 Float subtractions 2 Transfers (a := b) 1 Integers subtraction (from a float to reduce the exponent by 1 aka dividing by 2) 1 XOR (to multiply by -1) 1 Float addition 1 Square root It also only uses 3 Registers. It goes as follows (from x² + Ax + B; A, B, C are floats; p is the number of significand bits of the type of float, l is the size of the total number for example 32, 64): 1. Subtract from A 2^p 2. XOR A with 2^l 3. Set C to A 4. Multiply C by A 5. Subtract from C B 6. Take the square root of C 7. Set B to C 8. Add to B A 9. Subtract from A C

@procedupixel213 Год назад

If the least significant bit of precision can be sacrificed in the new method, the two divisions by a can be replaced by one reciprocal (i.e. 1.0/a) and two multiplications with that reciprocal. Thus one division can be traded for two multiplications, which may be faster.

@SmallSpoonBrigade Год назад

Honestly, that's the real solution here. How much accuracy are you allowed to sacrifice in order to speed things up. I'm taking accounting courses right now and there's a bunch of shortcuts that are taken for similar reasons. The result is somewhat different from what it would be with proper math, but it's usually a small amount and you're rarely dealing with more than 4 decimal places when dealing with currency anways. Plus, since people are using one of two sets of rules regarding how these things are don, there usually isn't much trouble.

@attilatorok5767 Год назад

Accuracy is also why std:complex is usually slow. The C++ standard library has to work in all cases and for complex numbers not sacraficing precidion means a lot slower code. You also cant use a fast-math optimised version because the standard library is precompiled (altough you can work around this by compiling a glibc with fast-math and using that with static linking it would probably still preform worse than hand-written speed-for-accuracy optimised code). I would personally not recommend std:complex in any case unless you are sure you need super-duper accurate answers but than you would probably go with an arbitrary precidion library wich has complex numbers.

@teambellavsteamalice Год назад

Isn't the step B = b/a and then mid = B/2 the real waste here? You only need mid. Also you can use a trick used for the quadratic, where r2 is used for the discriminant, r1 = (-b + r2)/2a and r2 = (-b - r2)/2a. For the Po Shen Lo it works even better: mid = b/2a c = c/a r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 = mid + r2 the last step I imagine can be faster as it's adding mid to itself ( r2 += mid?) This should already always beat the quadratic. Would using reciprocals be even faster?

@derfogel3491 Год назад

The Po-Shen Loh method is not new. Learned that method over ten years ago in school.

@omegaroguelp Год назад

yea like same

@Kebabrulle4869 Год назад

Same. It’s called the pq formula in Swedish schools. It’s basically just dividing the polynomial by a and deriving the quadratic formula again.

@furyzenblade3558 Год назад

Your videos are always super interesting! Thank you

@GiveAcademy Год назад

Love all the content!!! Reminds me of my random coding adventures! And very inspiring!!! Very happy that you are active again! I am struggling to get myself back up and going.., would love to see a continuation of your GraceAI neural network series!! ❤

@algorithminc.8850 Год назад

Great channel ... really enjoying it ... thank you. Cheers

@josefaguilar2955 Год назад

I absolutely loved this, thanks for making such great content!

@HDX30 Год назад

Super interesting video! At the end, you do mention how multiplications versus divisions may affect running time, and list that Po-Shen Loh has 3 divisions. However, one of the divisions from Po-Shen Loh is a simple division by 2, which is effectively a multiplication by 0.5. With that said, Po-Shen Loh would have the same addition, 1 less subtraction, 3 less multiplication, the same division, and the same square root! Making it strictly better in terms of operation counts! I'd be curious to understand what you think.

@WhatsACreel Год назад

I found it very hard to count the operations. The negatives can become subtractions, the divisions can be invterted (as you mentioned), and the FMADD instructions can perform add/sub along with a multiplication. My operation counts were a little hit and miss! I think you’re right :)

@astrobullivant5908 Год назад

When I first saw Po-Shen Loh's method, I thought it looked like he was solving it by substitution, just as Tartaglia and Cardano solved cubics with at first before generalizing the formula. Po-Shen Loh's way is cool and nifty even though it's slower than the regular quadratic formula. Plus, there are situations where we'd get solutions in human-friendly form faster using Po-Shen Loh's method than we would with the regular quadratic formula.

@duckymomo7935 Год назад

It’s also just quadratic formula but removing the fluff

@corscheid Год назад

This video was absolutely amazing. Beautifully done. WOW.

@DOROnoDORO Год назад

Love the format, very nice animation, and even nicer maths!

@shanehebert396 Год назад

For the CPU/FPU code, if the bottleneck is reading from RAM, you might try putting in some prefetch loads of the next set of data (particularly for your ASM code). It may not make much difference, though, given the loads are cache line at a time. I remember way back in the day we were looking at the Fastest Fourier Transform in the West (FFTW) on the UltraSparc and telling the C compiler to heavily optimize it (including using static profile guided optimization) and the ASM it generated was pretty slick, including prefetch loading so the data for the next loop was in L1 when the next loop iteration needed it. If the compiler you used didn't use PGO to optimize but has the option, it might be interesting to see if it helps. Also, the dark blue text used for Po-Shen Loh code is pretty hard (for at least me) to read. The red was OK even if a little dark.

@booom4849 Год назад

Manual prefetching will be slower because there are now additional instructions to be executed. He is just iterating over arrays, the hardware prefetcher likes that pattern.

@FaZekiller-qe3uf Год назад

What a great way to pad the runtime

@CjqNslXUcM Год назад

awesome video, well done!

@mikefochtman7164 11 месяцев назад

Worked on old 32-bit mainframes and we used to pour over our code for hours to hand-optimize the FORTRAN. Looking for division operations that could be replaced with multiplication by an inverse. And the bane of our work were pesky 'implicit type converions' where someone would do something like * instead of *. Yeah, those old compilers weren't very good at optimization so we had to do it ourselves. Love your channel and loved this dive into some interesting bit of math.

@Andrew90046zero Год назад

Really interesting. I was really hoping Po-Shen Loh was going to be faster across the board :P But the results and the overall lesson of the video was still very interesting. It would be cool to see a video where you only talk about the Op Times table you showed at 26:36 and maybe showing other examples of algorithms and how they stack up in terms of Op Times. On top of that, I also found myself wondering a little bit about "compiler vectorization". Idk if you have any video's on that subject, but I am interested in learning more about compiler vectorization. Especially including parts where you talked about "Structure of Arrays" and how the compiler (I guess) can more easily take advantage of simd instructions.

@JamesJones-zt2yx Год назад

Would really like to see a discussion of whether Po-Shen Loh's method is preferable to the one we all learned from a numerical analysis POV. Not sure error analysis leads to quite as spiffy graphics, though. :)

@AMan-xz7tx Год назад

love the video, but some of the text is a little hard to read, maybe add a centered shadow text effect to make it pop out a bit, and lighten the text colours just a small amount too?

@sotrh7974 Год назад

Yeah the blue on dark gray in particular is really hard to see

@falklumo Год назад

Po-Shen Lo did NOT derive a different quadratic formula, his’ is the same! He just derives it a little bit differently. The way he writes the formula just reshuffled common factors.

@93hothead Год назад

Nobody said anything about deriving

@falknfurter Год назад

When doing computations not only performance is important, but also the accuracy of the results. Quadratic equations have two cases which need to be treated carefully from a numerical point of view, or you might suffer a loss of accuracy: (1) where -b ~ sqrt(b^2 - 4ac) and (2) where b^2 ~ 4ac. In both cases you basically subtract two almost equal numbers. In combination with the finite precision of floats the siginificant digits of the result are reduced. Both algorithms (as shown here) suffer from these problems. Obviously, for numerically stable code these cases should be handled appropriately.

@MatthewJacksonTheMuuj Год назад

I do remember this coming up in Numerical Analysis class. Avoid subtracting two numbers if the result might be close to zero! College may have been a long time ago for me but that one guideline has stuck with me.

10 месяцев назад

Totally underrated comment. This is overlooked so often.

@turun_ambartanen 11 месяцев назад

I was excited for a new formula, but it's just the regular one we are taught in school here in Germany :( It's called "pq-Formel", with p=b/a and q=c/a

@josephvictory9536 Год назад

An idea for future videos: Can you put the standard deviation with the speed result if its less than 2? I found myself really wanting to see it for some of those closer matches. Its only us nerds watching so there is no shame! Awesome video btw!

@prof.salomonibarra6402 Год назад

¡ Excelente trabajo !. You deserve the gold medal in computer olympics. For real numbers OQ performs very well.I'm going to try this on ARM V8 and compare with similar Intel. Cheers

@syntaxerorr Год назад

I love your videos. I don't want to critique your style and this is the first I've seen with this style...but if you speed up all the animations (mainly the countdowns a lot, like 5 seconds less) I think it would flow better. I know animation is hard and a trial by error process sometimes.

@lordadamson Год назад

man I love your videos haha

@stevegremory3849 Год назад

bloody awesome! never knew this existed, extremely cool, hell it'll be useful for school work too ;)

@WhatsACreel Год назад

I reckon! Such a clever formula! Cheers for watching :)

@stevegremory3849 Год назад

@@WhatsACreel Cheers :D

@nyaromeneko4455 Год назад

For completeness, I would suggest doing the benchmark calculations in both float and double. I would have liked to see complex. 🙃

@Mernom Год назад

That branch really muddied the water. I think the first test should've been redone with the optimized code.

@protonmaster76 Год назад

It's interesting that the std::complex is so much slower than the scalar. It would have been interesting to see a complex implementation in C that didn't use the std::complex methods.

@zactron1997 Год назад

It does make sense that for complex numbers you're looking at much more compute. In the complex domain you have to store 2 values per number for the same level of precision. You'd think that means you only have double the work, but you actually end up with substantially more in certain operations. Consider addition and subtraction: that maps quite nicely to ReC = ReA + ReB, and ImC = ImA + ImB. So double the operations. Now consider multiplication: ReC = ReA × ReB - ImA × ImB, and ImC = ReA × ImB + ImA × ReB. Now you have 4 multiplications instead of 1, and two additions. Division and square roots are even worse. Now granted, you could speed this up by working in polar form, but then you need trig functions to handle addition and subtraction.

@Mernom Год назад

@@zactron1997 How fast is conversion between polar and grid forms?

@protonmaster76 Год назад

@@zactron1997 it is true that when operating in the complex domain that some maths operations require a lot more operations than its real counterpart. My point is that when he used std::complex implementation it took 10,000 ms to run as opposed to a few hundred ms. But his asm real vs complex implementations did not have such a wide time difference between the two.

@protonmaster76 Год назад

@@Mernom cartesian/polar conversation requires about the same amount of trig as you're trying to avoid in the add and sub.

@monad_tcp Год назад

std::complex is pretty terrible

@christopherhume1631 Год назад

The two algorithms are practically indistinguishable. One can regard the principle difference between them as an algebraically motivated optimization. The main value of this investigation is to underscore the fact that divisions are more expensive than multiplications. This difference would be substantially amplified by the use of double precision math. I haven't looked closely; but it may be possible to reduce the number of divisions in the Po Shen Loh approach. Regardless of approach, trading division for multiplication should improve performance - especially for double precision math. Note, however, I am not referring to constant division by powers of two. Though I am not certain, it is possible those will be converted to arithmetic shifts.

@PhilippeVerdy Год назад

That's no longer true with modern FPUs and GPUs, where divisions are now actually a bit FASTER than multiplications! But this is actually visible only with GPUs, not with existing x64 CPU/CPU where both perform the same in one cycle... except when using "vectorized" expressions: AVX2 processor extensions (or newer), if they are used by your C/C++ compiler or its libraries, may show that divisions CAN perform faster than multiplications! You may also get different figures depending on CPU brands and models (how their internal FPU units are internally built). Do not assume things from the time where divisions were still using our "naive" algorithms (with bit-by-bit compares, shifts, and substraction) when we had no FPU to compute them in a single hardware instruction. As well, there are optimizations in modern FPUs to efficiently compute square roots. Modern FPUS can use "FFT-based" algorithms implemented by hardware (using many more silicon gates but structured in a much larger "tree" or parallel branches, but with lower depth, so with faster total propagation time): this is true for the multiplication, the division, the square root, and other trigonometric/exponential/logarithmic functions natively supported by FPU or GPU instructions.

@PhilippeVerdy Год назад

As well there are new types of processing units, notably for IA and signal processing, using optimization of tensor flows. They start appearing on CPUs for smartphones and will be soon available on desktop PCs, and are already used in large computers performing massive computation (think about what the CERN needs for the LHC, or what Google needs for its IA with BigData, or what is needed in astronomy, fluid mechanic, or computer-aided design, notably in the aerospatial and defense sectors). These processing units have new specialized instructions for efficiently computing "convolution products" and "FFT's" by hardware. Here also, they can accelerate basic divisions faster than basic multiplications (and the difference is even more visible with massively parallel computing, with very large vectors or matrices rather than individual scalars).

@ProjectPhysX Год назад

For the red code at 3:30, wouldn't it be faster to compute the sqrt(...) part only once in a variable and then insert the variable twice for the +/- solutions? Or does the compiler do that automatically? Edit: Nevermind, just saw you test this later in the video and the compiler does indeed optimize it :) I personally never trust the compiler :D

@Wyoming307 Год назад

is this actually a legal optimisation for the compiler though? if r1 points to the same address as a, b or c then the first computation would change the parameters for the second. So optimising this would lead to the wrong result. I guess if the function is only called once and it knows all the arguments at compile time, then it knows this can't happen, but I'm still surprised it would do this.

@mt0software Год назад

I wish this was the kind of game show they'd show on TV. great video

@michaelwilkes0 8 месяцев назад

just getting close to such a tiny we ll established formula is amazing. epic.

@plokki456 Год назад

interesting video! Though I find it hard to believe that the whole gpu is only twice as fast as a single cpu core for such a massively parallel task. Is there some cuda optimization flag that you did not turn on? Also the gpu may not have had the time to increase its frequency to the max if the task was too short. Did you compute the effective bandwidth vs theoretical bandwidth as you did for the cpu test? I'm curious about that as well.

@aliengenie8896 Год назад

When you're doing stuff with GPGPU, if you try to simply time the time from when you "submit" the work to the GPU to when it finishes, you end up also measuring the overhead that is involved with deploying kernels and eventually having all of them hit the synchronization point at the end. There's a good chance that the amount of work that was submitted was not even to keep the GPU busy for long enough, and so the vast majority of the time is spent on that overhead. Depending on how things are done, you may end up also measuring the transfer of the inputs/outputs to/from the GPU, which would make the results look even more similar. The "dumb" way to measure how much overhead there is, would be to give it a ridiculously small amount of work, like only one calculation, and see if it ends up being ~75ms again. The better way is that Nvidia supplies tools for measuring how long each part of the process takes. (Launching the kernel vs memory transfers vs the actual work).

@kasuha Год назад

So, you take (A ± B)/C expression, separate it into A/C ± B/C and call it a new revolutionary formula? Seriously? The whole comparison is about how smart is the optimizer in your compiler and how efficiently or inefficiently can you put the two into source code. But in the end of the day, it's exactly the same formula. Programmers go to far greater lengths in deviating from formulae optimized for viewing to formulae optimized for execution when it's really important.

@Daniel_VolumeDown Год назад

@@jeremiahbullfrog9288 idk because i did not wathed but formula was proposed by someone else whose video is in the discription

@spencergouw3316 Год назад

The formula is meant for pedagogy, to teach a more intuitive approach to the quadratic equation to students in the original video. Your complaint is pretty silly considering no one claimed it was "revolutionary", only that it was "easier on pen and paper", which it is. Po Shen Loh even derives the quadratic equation from his reformulation in his video.

@effektor2000 Год назад

I would try to calculate 1 / (2.0) * a[i]) and then do multiplications instead of the two fivisions. Also I aliasing rules prevent further compiler optimizations, making the pointers restrict should help in this case, in particular in Quad2 as it does more memory access.

@andersgoude Год назад

I am a bit curious about the aliasing rules here. If the compiler was reusing the value for the square root in the first case (at around 7:00), wouldn't this be against the aliasing rules, as if r1 would be the same as a, b or c, it would have changed the values for the square root for the second calculation. So the compiler may already assume that there is no aliased pointers

@haoming3430 Год назад

Where can I download the assembly version of the code? Many thanks!

@heaslyben Год назад

Thrilling stuff 😃

@kestans Год назад

brain neglects reading darkblue code :D

@asailijhijr Год назад

Is one of those divisions by four? Can you optimise it with a bit shift? Or is that not how floats and std::complex numbers work?

@stevenlynch3456 Год назад

The way floats are structured means that bit shifts don't work for division nor multiplication by powers of 2. Floats are structured like this: SignBit PositiveAndBiasedExponent FractionalPart SignBit is 0 for positive float, 1 for negative float PositiveAndBiasedExponent is exponentYouWantToStore+bias = exponentYouWantToStore+127 FractionalPart is like the .345 in the number 12.345 SignBit is 1 bit long. PositiveAndBiasedExponent is 8 bits long. FractionalPart is 23 bits long. Why these kinda random #s? It's standardized and together they use exactly 32 bits, which is a nice power of 2. Example float in memory: 1 10001000 11000000000000000000000 SignBit = 1, so float is negative 10001000 = 136. However, to get the exponent we intended, subtract the bias to get 136-127=9 11000... = (1/2)+(1/4) = .75 So the example float in memory is -1.75*2^(9) = -1.75*512 = -896.0 1 10001000 11000000000000000000000 = -896.0 Why 1.75 instead of .75? The leading 1 is always present in binary scientific notation, and the value 1 is the same no matter which base you're in (except Base 0 and Base 1, which are extraordinarily difficult/annoying to deal with). Now imagine shifting it to the right because you think dividing a float by 2 is the same as dividing an integer by 2. What you're doing is moving the sign bit into the exponent, dividing the EXPONENT by 2, and then (what your intention originally was) dividing the FractionalPart by 2. 1 10001000 11000000000000000000000 shifted to the right by 1 is 0 11000100 01100000000000000000000 Putting this into www.h-schmidt.net/FloatConverter/IEEE754.html shows that it's equal to 8.11656739243e+20 -896 is pretty different when compared to +8 with 20 digits after it. HOWEVER, if you could find a way to decrease the exponent by 1 (i.e. subtract place value of 1 from the exponent) WITHOUT modifying any other part of the float, you would successfully divide by 2.

@Chris-op7yt 11 месяцев назад

that's the form for finding x axis intercepts. if just computing the y value for any particular a,b,c and x, then just plug it into y=axx+bx+c. no square roots and no division.

@XfStef Год назад

Glad round 2 took a Data Oriented approach!

@KerbalLauncher Год назад

What were the compiler settings?

@zaydabbas1609 Год назад

Just to be clear here, these are algebraically identical. There was no 'new math' discovered here as some people claim. This is merely an optimization for computation

@guylaingreer4190 Год назад

I didn't catch if you mentioned it, but when you switched to std::complex were the coefficients complex or were they still real?

@selvapk5567 Год назад

Can anyone explain these instructions which I found in a shellcode document for getting a bind shell? function hashes (executable as nop-equivalent) _emit 0x59 ; LoadLibraryA ; pop ecx _emit 0x81 ; CreateProcessA ; or ecx, 0x203062d3 _emit 0xc9 ; ExitProcess _emit 0xd3 ; WSAStartup _emit 0x62 ; WSASocketA _emit 0x30 ; bind _emit 0x20 ; listen _emit 0x41 ; accept ; inc ecx

@Knirin Год назад

What is the standard deviation of the tests?

@CanKarakuzulu Год назад

Just trying, but using branched code in cuda code severly slows things down. Maybe you could try branchless version of po-shen lo cuda?

@Zero-4793 Год назад

with Po=Shen Loh's, you don't need to calculate with complex numbers. you just put an i out the front if c^2 < 0: c = |c| ans = ic

@Zooiest 11 месяцев назад

How does the tie rule work? Is it something like abs(resultsA[i] - resultsB[i]) < stddev(resultsA)? I'm not really a statistician, but I like learning these things

@antovfukov9005 Год назад

Very well done video, color code needs fixing to be perfect. I demand a part two short video switching the div 2 to multiply .5. :) love the othe optimization comments and would be great to compare. Fun game show graphics.

@ExaltedDuck 11 месяцев назад

fascinating. I haven't watched the other video yet to better understand this other method but from the 1-sentence description early on in your video, it doesn't sound like they're doing different things. The description of Po Shen Lo as finding the middle and then adding and subtracting the difference to find the roots... that's exactly what the quadratic formula does. -b/2a is the middle and then the difference is sqrt(b^2-4ac)/2a. You might be able to speed up the quadratic code by a tiny amount by calculating the -b/2a as well as the discriminant. Although in the scalar version, the compiler probably already has. I'm really curious to go see the more explanatory video soon, though. Maybe there's something I'm missing...

@mikeblaszczak5346 Год назад

Dark blue on a grey background? Good going.

@MichaelJamesActually Год назад

Could you use something like the inverse root hack from quake 3? I'm sure there's other approximations of square roots that could also improve that calc.

@MichaelJamesActually Год назад

Scratch that. Both methods have the same number of square root operations

@dananskidolf Год назад

I think processor architecture plays a big role in this. The operation times depend on FPU implementation - for example Intel improved division by a factor of two and sqrt by 4 in the Penryn architecture if I remember correctly. More cache or memory bandwidth will target the key bottleneck here - that's probably the main gain from using a GPU. Branch prediction is powerful, but difficult and I think still evolving, so this affects different CPUs differently, as well as depending on the data fed in, and the order it's in. (The branch was only there to demonstrate bad code but I thought I'd include it.)

@TECHN01200 2 месяца назад

Once you are doing GPGPU, your limit is becoming the speed of copying to and from the card over the PCI bus rather than the compute.

@devinschlegel1763 Год назад

are there any plans to continue the neural net series?

@jerryplayz101 Год назад

15:09 - my guessing that the compiler is using branch prediction, meaning that the program is already precomputing "a" and prepares for the branch as part of low level optimisation

@Misteribel Год назад

I might have missed it in the video, but did you consider rsqrtss or vrsqrtss? You’d need to change the algorithm, but it may improve speed of your slowest instruction: square root. Maybe you used it in the simd variant asm version.

@badhombre4942 10 месяцев назад

√(b²-4ac)/(2a) == √((b/2a * b/2a) - c/a) == u. Therefore both formulas are the same. If a is > 1 in the majority of your dataset, then the if block adds 2 division operations. Your red code can be optimized to make it even faster.

@thomasolson7447 Год назад

Suppose every program is just a number. I'm wondering how that number would be calculated. Is it left to right, or right to left? Then there is the stack and heap. Is there a stage before it enters memory where it is one complete number? Suppose we added 1 to it, where would it be? Is there a halt at the end? Does the increment have to take place before the halt?

@PhilippeVerdy Год назад

Is summary, we cannot decide which approach ("original" from maths on real or complex numbers with infinite precision, or "PSL"-optimized) is faster or slower as it largely depends on how you program your code, the programming language you use (notably how it rounds the results per its semantics), the compiler you use (and its compilation options notably about rounding of intermediate results, and how it may cache multiple reads from external structures/variables into local registers), the library you use, and the hardware you use: finally you can only decide after really making some local benchmark. NEVER forget the language semantics (effects of "implicit" roundings, and of "if" branches) when creating ANY implementation with that language.

@graealex Год назад

The font used in the animation always reminds me of the animations from surreal entertainment. Just waiting for "Sheldon is kill" to appear.

@armandosreis Год назад

I don't get how you can get rid of the if branch.... It says if a is not 1 then you adjust b and c. so for all values of a that are not 1 you need that adjustment. How can you just remove it? That's changing the formula. Am I missing something here?

@ABaumstumpf Год назад

The branch said to not do the division if it would divide by 1 " which is completely irrelevant cause dividing by 1 does not change the result but having a branch can seriously impact performance

@barakap2372 11 месяцев назад

Hey friend. Why didnt you upload for such a long time? You were a reallllllllly good introduction for me for assembly C++ and in general compsci. Please reply

@boelwerkr Год назад

That "if" breaks a lot of processor optimizations and it's not needed i think. if "quads[i].a" is 1.0f the equation inside the if is: b/=1.0f; c/=1.0f which didn't change the values. This would speed up the code a lot. With that it should have a fighting chance. The slowest part of both codes is the "sqrt" it is a magnitude slower than any other operation in that code. So using it as little as possible is always preferable. So this code: void Po_Shen_Loh_Simple_AoS(QuadStruct* quads, int count){ for(int i =0; i< count; i++){ float mid = ( quads[i].b / quads[i].a ) * -0.5f float u = sqrt(mid * mid - quads[i].c / quads[i].a ); quads[i].r1 = mid + u; quads[i].r2 = mid - u; } } should be faster if i haven't made a stupid logical error. 🙂

@phizc 8 месяцев назад

26:12 One of the divisions in Po-Shen Loh is x/2. It can be turned into a multiplication x*0.5. I'm kinda surprised if the compiler doesn't do that. So PSL vs "old" should have 1 less sub, 3 less mul and all the others the same. I.e. 7 clocks less than the "old" in the first table, and 2.5 less in the second table.

@PplsChampion Год назад

"benchmarking as entertainment" is next to godliness

@TerjeMathisen Год назад

First, the two algorithms really are not separate: They have exactly the same numerical qualities and effectively do the exact same operations, only in a slightly modified order! On a modern CPU, sqrt() and FDIV are both far slower than fadd/fsub/fmul & fmadd, so I would start by calculating the a_reciprocal = (1/a) term: The FDIV in that operation will then be nearly or completely free because it can overlap with all the rest of the calculations. In the PSL version we immediately use the result of that division to scale b and c, which simplifies the logic but generates a latency barrier. If we instead calculate half_a_recip = (0.5/a) then we can multiply the final answers by that value. The sqrt term is sqrt(b*b-4*a*c), here the b2=b*b term and the ac=a*c term can run at the same time, then we combine them in tsqrt=sqrt(fmadd(ac,4.0,-b2)). By the time that has finished, the fdiv will be be all done, so we just have to do the final res0 = (-b + tsqrt) * half_a_recip and res1 = (-b-tqrt) * half_a_recip calculations which will also overlap. Since sqrt is slightly slower than fdiv, we can in fact delay the half_a_recip calculation a smidgen, starting it just after the b2=b*b; and ac=a*c; operations, it will still finish before the sqrt. On an x64 cpu with at least two fmadd pipelines, but in scalar mode, these calculations should take ~40 clock cycles (I'm assuming 5-cycle fmadd and 20-cycle fsqrt).

@cH3rtzb3rg Год назад

I doubt the latency of the instructions matters at all in case you calculate millions of independent roots -- only instruction throughput and port-usage should matter (and in fact memory throughput). Instruction latency would matter, if the output of the previous calculation influences the input of the next.

@TerjeMathisen Год назад

@@cH3rtzb3rg I mostly agree: When processing huge amounts of data, typically more than last level cache, almost any calculation will be totally dominated by memory bandwidth. OTOH, that is pretty much unimagninable for the case of solving quadratic roots. :-)

@cH3rtzb3rg Год назад

@@TerjeMathisen I also agree. The benchmark in the video is pretty much meaningless for most practical purposes. One could compare the latency of the algorithms by calculating new coefficients based on the previous roots (that way, there would also be no memory bottleneck).

@thrownchance Год назад

Which CPU and GPU was used? A speedup of only factor 2 to go to the GPU seems to be barely worth it imho.

@user-sl6gn1ss8p Год назад

specially considering the time it takes to get stuff into the gpu

@maxvilla5005 Год назад

Nice vid, wish the code text in red and blue (especially blue) was easier to read though.

@beaconofwierd1883 Год назад

At first I was angry with you for implementing it using a branch, the I was very happy that you did so because it made me feel good spotting the issue from the start :D So thank you for making me feel clever ;D

@monad_tcp Год назад

In the end, both are good enough, most of the time goes to pure moving data around, and the square root itself. So it doesn't matter much. But it was a cool thing to know.

@wolfgangreichl3361 10 месяцев назад

The way I learnt it, the expression (double==constant) is a bug anyway. (9:12) Before going to assembly I would have replaced std::complex, which seems to me to be the culprit messing up memory access.

@decky1990 Год назад

Could you do a bit shift right (SHR?) to help reduce the clock cycles for dividing by 2 or would it need to be an integer to preserve accuracy? Thanks for putting up these videos - desperately trying to finish a new book on C++ for the 3rd time and you’re my only real motivation at the minute.

@andersgoude Год назад

This only works for integers. If you really want to achieve this through bit manipulation with a float, what you could do is something like float a = 5; *reinterpret_cast(&a) -= 0x800000; and a will now have the value 2.5 instead (assuming that int is 32 bit) What this code is doing is that it decreases the exponent of the floating point number by 1, which mathematically is the same as dividing by 2. (check the IEEE 754 standard for more details how floating point numbers are stored in memory)

@decky1990 Год назад

@@andersgoude thanks for sharing, Anders! Will check this out further, thank you

@andersgoude Год назад

@@decky1990 I just want to remark that my example is mainly for educational purposes, and is generally not something that is recommended. The normal solution for divisions with a constant is to replace the expression a/2 with a*(1.0/2.0) (or just write a*0.5 which will generate the same code). This works for any constant and for most modern systems, multiplications are very fast operations, compared to divisions

@felixstuber8046 10 месяцев назад

You can substitute one division in the po-shen lo formula by a multiplication if you instead of dividing b by a and then by 2 simply devide it by 2*a. On the other hand, multiplication and division by 2 shouldn't be a problem at all for a binary machine.

@JustATempest 9 месяцев назад

I'd be interested and you using the fast square root algorithm for quake as a substitute for the square root for both of these

@mattewlefty991 Год назад

From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c. So we will need to use more different methods for computing the roots. Btw the "red an blue tower thing" was really anxiety-inducing 😂

@user-sl6gn1ss8p Год назад

"From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c" Really? Like, in a way that po-shen loh wouldn't?

@mattewlefty991 Год назад

@@user-sl6gn1ss8p That's a good question, po's formula divides all the coefficients by a, and then does the calculations (but when abs(a) is small there might be some problems in general). @Creel would you share how you generated the test cases?

@user-sl6gn1ss8p Год назад

@@mattewlefty991 now that you mention it, comparing the results might be fun as well (if fairly slow)

@monad_tcp Год назад

why would it do that ? the roots is basically when the curve crosses the 0. we are talking about a parabola, it must cross the origin

@mattewlefty991 Год назад

@@monad_tcp Not all parabolas cross the x axis... But even then there are some operations (like subtraction, sum and division) that introduces errors because the computer works with a finite number of digits

@sharkysharkerson Год назад

Based on code complexity, I would think Po-Shen Lo might do better on an FPGA. It could probably be worked out that both take the same time, but even so, Po-Shen Lo looks like it would use less blocks on the FPGA.

@smergthedargon8974 Год назад

I find it both interesting and very funny how the GPU doing complex numbers is almost exactly twice the time it took for it to do scalars.

@PhilippeVerdy Год назад

So suppress the "if", and just use ONLY two intermediate variables "mid" and "u" declared as "long double", storing their complete expression; then you get "r1=mid+u" and "r2=mid-u"; you can even assign the intermediate valid of "mid" and "u" within the expression of "r1", and reuse "r1" into the expression of "r2" in a single statement, so you get "r2=(r1=(mid=sqrt(b*b-a*c*4)/a/2)+(u=...))-u*2;" As well avoid converting "float" read from your "quad" structure multiple times to "double" or "log double", just read them once in a local temp variable of type "long double" (this can be done in BOTH "formulas" to drop that overhead). Note that if you use assembly, those temp variables would be FPU registers of type "long double" (but you need a good C/C++ compiler to do the same optimization and use "IEEE rounding modes" flags in the compiler or with macros in the soruce code to avoid all unnecessary intermediate conversions and roundings). You can redo your test as well by changing your "quad" structure to declare "double" or "long double" fields (in both "formulas") So what you are really testing is how C/C++ compilers optimize the generated code. There's in fact NO difference between the formulas, except that if you incorrectly round intermediate results in your naive "Poh-Shen-Loh" implementation, you get DIFFERENT results between the two "formulas", where your "Poh-Shen-Loh" implementation is LESS PRECISE than your "original" implementation! You did not even compare the results between the two formulas to see that their results have differences (visible here on the 9th significant digit which can deviate by +/-2 !)

@DanikaOliver Год назад

I can't see the code, the contrast is not good. The letters are small.

@mhcbon4606 Год назад

at 4:30, friend, the code in blue is unreadable unless i use 1080p.

@rogo7330 Год назад

Is storing computation result into previously defined variable (thinking that we won't need its content anymore) faster than defining new (local to function) variable? And does this even matter?

@cH3rtzb3rg Год назад

For the compiler it should not matter at all, at least for simple types. Do whatever is easier to read.

@monad_tcp Год назад

@@cH3rtzb3rg its usually easier to store duplicated terms in a variable to make the code simpler, not for performance, the compiler can do that

@cH3rtzb3rg Год назад

@@monad_tcp I think Rogo was asking about code like: int x1 = SOME_EXPRESSION; // use x1 int x2 = SOME_OTHER_EXPRESSION; // use x2 If one should reuse x1 instead of introducing x2 for that.

@monad_tcp Год назад

@@cH3rtzb3rg oh, in that case it really doesn't matter, variables are renamed either way, the compiler either do register allocation or store in the stack. After data-flow analysis, it will reuse stack variables if possible. Unless your compiler is really, really dumb. Like Microchip ones.

@camofelix Год назад

I didn’t notice what chip you’re running this on! If it’s AVX512 with FP16 (early alder lake with ecores off) enabled there’s some SIMD fast paths for complex calculations that can do it all much faster, and because you’re using a smaller data type (fp16 instead of fp32) you get much better bang per buck on memory bandwidth

@monad_tcp Год назад

its an x86 probably