computers suck at division (a painful discovery)

Подписаться 691 тыс.

Просмотров 1,7 млн

50% 1

I tried to take on a simple task. I TRIED to do a simple assembly problem. But, the flaws of the ARM architecture ultimately almost defeated me. Computers suck at division, and there's a few reasons for that. Division is so hard for computers, that the ARM processor core didn't have a divide instruction until 2004. Even now, certain ARM Cortex M series processers don't have the divide instruction.
So then, how do the processors do division? Watch the video and find out ;)
🏫 COURSES 🏫 Learn to code in C at lowlevel.academy
📰 NEWSLETTER 📰 Sign up for our newsletter at mailchi.mp/low...
🛒 GREAT BOOKS FOR THE LOWEST LEVEL🛒
Blue Fox: Arm Assembly Internals and Reverse Engineering: amzn.to/4394t87
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation : amzn.to/3C1z4sk
Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software : amzn.to/3C1daFy
The Ghidra Book: The Definitive Guide: amzn.to/3WC2Vkg
🔥🔥🔥 SOCIALS 🔥🔥🔥
Low Level Merch!: lowlevel.store/
Follow me on Twitter: / lowleveltweets
Follow me on Twitch: / lowlevellearning
Join me on Discord!: / discord

Наука

Опубликовано:

25 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 2,7 тыс.

@LowLevel-TV Год назад

Wanna learn MORE awesome stuff while helping out the channel? Go get a FREE month of Skillshare using my link: join.skillshare.com/learn-1/?coupon=AFF30D23

@LithiumDeuteride-6 10 месяцев назад

The "Degrees" class unambiguously requires floating-point numbers.

@Jblow-u2m 9 месяцев назад

What about the neon DSP code? That handles div. Wide accumulated Also.

@Finkelfunk Год назад

What you discovered here is something called "Division by Invariant Multiplication". That topic is so insanely complicated that a good number of people have written dissertations in computer sciences about this stuff. Long story short: It makes division go vrooooom on your processor.

@null6482 Год назад

Where can I find more info about this

@solomontan1524 Год назад

This comment should be pinned so everyone can learn about the name for this concept. Very interesting stuff

@adamrak7560 Год назад

if you want something even more crazy, look up fast inverse square root algorithm

@Smorb42 Год назад

@@null6482 en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant

@okkoheinio5139 Год назад

@@MustacheMerlin wow, that sounds cool

@chsovi7164 Год назад

"I didn't want to use an incredibly low level library, because it felt like cheating" Been there before, I know where this road leads. I wish you luck on your future breadboard CPU project!

@MayorMcC666 Год назад

Either breadboard or TempleOS, one of two roads

@mikicerise6250 Год назад

RNJesus speaks to me.

@EmptyZoo393 Год назад

Seriously, there is a reason high level languages exist. If you can let the compiler and the language take care of your branch tables for you, you can save a lot of development time. These days, the only reasons to work with assembly are for SIMD optimization, extremely optimized library functions (often involving SIMD), and occasionally interrupt routines. There are occasional times with small micro-controllers like this where you need to use assembly, but code the compiler creates is typically roughly equivalent to what you'd create, and often better. Don't underestimate the power of auto-loop unrolling and function inlining. The linker can take all sorts of shortcuts that you wouldn't think to.

@NoX-512 Год назад

@@EmptyZoo393 What's the fun in that?

@TheDavidlloydjones Год назад

@@mikicerise6250 Do you speak Aramaic? What has He been telling you? Lemme guess. "Donald Trump has been sent to save you all" maybe?

@snoupix3332 Год назад

I love how us, programmers, are spending hours and nights to not use an existing external lib just to say "I've made it myself" 🤣

@themalaysiandude3903 Год назад

nothing ever beat than making your own function all day instead of just copy pasting

@OhhCrapGuy Год назад

I think the real reason we do this isn't because we don't like libraries, it's because we want to understand. There's so many times that I've written something myself, fully understood the entire problem, gotten the bugs worked out, and as soon I was entirely satisfied with my understanding, I simply immediately replaced my solution with a library, because I don't want to maintain any of that nonsense!

@JonJenkins1982 Год назад

If you do that, you acquire skills that easily translate to other realms of computational problem solving.

@OhhCrapGuy Год назад

@@JonJenkins1982 not only that, but even for higher level programmers, every time you deal with bit twiddling and such, you learn to think that way even better. Case in point, I had mostly figured out the solution for this only a few moments after he pointed out that ARM M0 cores don't have division capabilities, or at least close to it. I was thinking the solution was overflowing the irrelevant bits out of the multiplication register, but this is effectively the same thing, just putting them into a word that we simply ignore. It's not that I, or anyone else, have some sort of unique insight into how it works, I never would have thought of that earlier in my career. It's just that being exposed to these sorts of cool hacks over time gives you the tools you need to come up with new bit hacks when you encounter a problem.

@ricardocasimiro6424 Год назад

agree

@fireball2275 Год назад

I like how it changed to night with the easy bit at the start

@LowLevel-TV Год назад

FINALLY someone noticed that XD

@fireball2275 Год назад

@@LowLevel-TV legend

@Xevos701 Год назад

The Firestorm core in Apple M1 has a dedicated Divider ALU.

@foobar1500 Год назад

@@Xevos701 The Firestorm implementation of integer division is surprisingly efficient, but it's nonetheless 2.5-4 times worse in performance than the corresponding multiplication.

@DavidGarcia-se3ko Год назад

Multiply by 555, then bit shift

@JustAnotherAlchemist Год назад

The generalization of this concept is called "division by multiplicative Inverse" or less commonly "multiplication by the reciprocal" and is a relatively common practice when you have a division where the divisor is a constant. It's talked about a lot in low-level books, like "Hacker's Delight." Basically, you precompute the division by encoding it as a reciprocal. It's fatal flaw is you have to know what you are dividing with from the start. To go farther, you have to use something like the Newton-Raphson method for division. This is the same as the above, in that it finds the reciprocation quotient. The first major difference is that it is a run time operation. That is, it FINDS the needed reciprocal during execution, no precomputation. The second major difference is that it uses an approximation to division _as a function_ to get close to the correct result, then uses Newton-Raphson method to refine that guess to as many digits as you desire. For 32bits, I find that 3 iterations is plenty. Incidentally, all of this is possible because floating point/fixed point number systems encode multiplication and division _into_ the number itself, in the same sense that having a sign bit encodes addition and subtraction into a number. Floating point, specifically, also encodes exponentiation into the number, which is why it is the usual goto number system. ... The comment claiming this to be called "Division by Invariant Multiplication" is the first time I have ever heard it called specifically by that name, and has just one research paper with that exact title... Wikipedia calls it "multiplication by [the] reciprocal" which someone responding to said comment linked to as well. multiplication by the reciprocal is a basic math operation, and gets you a lot more hits than "Division by Invariant Multiplication." It's also certainly not complicated to understand, as this ~5 min video has clearly demonstrated.

@idonnow2 Год назад

Based informative comment

@rickroller1566 10 месяцев назад

Doesn't Newton's method use division? What am I missing?

@JustAnotherAlchemist 10 месяцев назад

@@rickroller1566 en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division ...specifically, you want "Xi =(2-DXi) ... which can be calculated from Xi using only multiplication and subtraction, or using two fused multiply-adds. "

@pulakgautam3536 9 месяцев назад

you rlly are an alchemist

@Fujui 7 месяцев назад

Why dont you divide by 1.8

@jensknudsen4222 Год назад

This brings back painful memories of 8-bit computing. Consider yourself lucky to have a multiply instruction.

@svampebob007 Год назад

I remember building a 8 bit cpu, the division circuit was by far the longest to figure out. I started with obviously +-, *... then tried for months to figure out how to do division (without obviously looking for the answer online) once I realized how you could do division with substractions and additions it took me just a couple of weeks to deal with IO and W/R to ram.

@linuxization4205 Год назад

just add the number multiple times if you need multiplication.

@stinchjack Год назад

I dabble with Z80 assembly. Multiplication is a charm. e.g. multiply HL register by 10 push de ld d, h ld e, l add hl, hl add hl, hl add hl, de add hl, hl pop de I've written something like that so often by now its just muscle memory :D

@daishi5571 Год назад

Motorola 6809 (1978) has a multiply command, takes register A and multiplies that by register B then combines the A & B registers to make a register D (which is 8+8 bit so 16 bit) and outputs the multiplication result.

@robertherndon4351 Год назад

Ah, tables of squares and difference-of-squares multiplication methods. And then there are Newton-Raphson approximations for computing reciprocals so you can pipeline divisions using multiplication. All good fun...

@sprytnychomik Год назад

Other solution: don't use Fahrenheits.

@LowLevel-TV Год назад

Imperial Unit Gang

@PRIMEVAL543 Год назад

The good old freedom units that you got forced on by the uk cause nothing stands mor for freedom than the British empire. Did you learn about picas and points btw? And can you tell me how many miles 1 pica + 3 points are?

@NinjaRunningWild Год назад

F that!

@devinnall2284 Год назад

For Non-Americans an easy way to understand Fahrenheit is to think of them as a percentage anything below 40% is cold 40% to around 75% are nice medium temperatures and anything above that is hot as hell!

@gwenyurick9663 Год назад

@Devin Nall Yeah like most metric stuff is just better full stop, but Celcius is not more beneficial for a human. Fs range of 0-100 being "about dangerously cold but manageable with layers" to "about dangerously hot but manageable with hydration and limiting sun exposure" is much easier to just understand what a temp actually means to me planning my day. Celcius is based about water state changes but I'm not water i don't care when it changes state for weather forecasts

@Soupie62 Год назад

5 / 9 is roughly 569 / 1024. Dividing by multiples of 2 is an Arithmetic Shift Right. Multiply by 569, shift the result Right by ten bits. Multiply by 36409, shift 16 bits... and 2330169 with a shift of 32 bits merges nicely with your code.

@VictorCampos87 Год назад

Sir, this is what I call an elegant solution. Easy to understand. Easy to code. Easy to process. And give the most precise results available for this architeture (I think).

@Kyle-ke5fx Год назад

@@VictorCampos87 This is exactly why they don't have these operations on the processor. Because skilled embedded developers don't need it and it would make it too expensive or large for the applications it was built for. People that do this every day generally don't need it.

@joshuahudson2170 Год назад

@@Kyle-ke5fx What do you do when a variable divide comes up?

@TrackedHiker Год назад

@@joshuahudson2170 do it the same way we do long division by hand in decimal

@joshuahudson2170 Год назад

@@TrackedHiker That's a guess and check pattern.

@RokeJulianLockhart.s13ouq Год назад

This is why, whenever you're programming, and realize a solution that seems like cheating, do it. Cheat.

@sovietshnuckums5345 3 месяца назад

Real

@thenermer Месяц назад

He isn't trying to make something, hes doing it to learn

@JohnnyThund3r 11 дней назад

Yes... but also no... Like right now I could cheat and just take some random Water Shader someone made for Godot, copy that code and put in my game... BUT... I actually wanna know how water shaders work because it's fundamental to recreating a submarine simulator that's actually good and doesn't feel like an asset flip.

@RokeJulianLockhart.s13ouq 11 дней назад

@@JohnnyThund3r If you utilize a library which is well-designed, the combined manpower of its developers shall render it better than anything that a single developer like yourself or I could create. Plaigarism doesn't inherently mean inferiority.

@cewla3348 8 дней назад

@@RokeJulianLockhart.s13ouq they're not talking about using a library, they're talking about ripping other assets into their project

@daylen577 6 месяцев назад

The thing people often don't understand is that every operation eventually leads to an addition or a subtraction; multiplying is just adding the same number multiple times, and dividing is just looping through a bunch of potential numbers to see which one is closest.

@amigalemming Месяц назад

Addition and Subtraction eventually end up in pure binary logic. It would be enough to have bit shift, AND, OR, and setting and testing bits.

@Trooperos90 28 дней назад

Yes but actually no. Thatd be too slow. Its logical bitwise operations.

@richardericlope3341 Год назад

Even as late as the the Nintendo DS ARM 9, we needed to use fixed point mathematics. The hardware registers themselves accepted only fixed point values. Same with trancendental trig functions. We had to use lookup tables in BRADIAN(Binary Radian) units interpolated on the fly. Fun times.

@circuit10 Год назад

Did you program DS games?

@raymundhofmann7661 Год назад

Fixed point or fractional integers can in fact do adequate calculations for most practical things and in signal processing, it just gets increasingly burdening on the programmer to decide where the decimal point is put best for each number represented in the program to not lose precision or keep the S/N. Consequently, it is easier for the programmer to have this done at runtime with floating points, but also more wasteful and also possibly numerically more unstable/indeterministic when dealing with feed back or iteration over data sets. Scientific calculations or a FFT (and other transformations) are examples where floating point calculations may be needed. In a FFT for example all data points in the time domain may contribute to only one data point in the frequency domain and vice versa, floating point may be the better choice to deal with the bigger dynamic of the transformed compared to the source data, compared to just increasing integer bit precision overall.

@tomcombe4813 Год назад

Yep one of the hardest problems in computer design has always been finding a good way to implement division. There have been some creative solutions, like in the cray-1 supercomputer which would instead compute a reciprocal that you could then multiply to do your division. But often, designers just omit division all together from their CPUs.

@grimonce Год назад

Well this was just division... By constant....

@declanmoore Год назад

The DS also has a divider unit, which makes up somewhat for the lack of support by the CPU :p

@razorstone3088 Год назад

bro's just casually programming some ARM assembly. love it man stuff like this makes me motivated to learn low level stuff even though it looks kinda scary to me

@LowLevel-TV Год назад

Do it!

@Yas-gs8cm Год назад

I think it's a waste of time beyond knowing "what is assembly and how it works + basics" (Note: I, myself, know the "basics" I mentioned above. It's logical that "knowing" is always superior comparing to "not knowing"...)

@andrewdunbar828 Год назад

Learning it will improve your high-level coding. Worth it!

@LuskyMJ Год назад

@@Yas-gs8cm Once you learn assembly your thought process never becomes the same again. Even if you work with higher level languages you still constantly think about: "What happens once this gets converted to assembly". It helps you develop new ways to solve problems that you wouldn't have if you never coded in assembly.

@ImperatorZed Год назад

@@Yas-gs8cm this isn't even the basics though, this video is super simple stuff

@jameshogge Год назад

A couple of extra notes: - This division method (multiply+shift) is attempted by compilers whenever the divisor is known at compile time. It isn't used to divide by an unknown number as the constants can't be predetermined. - Another alternative to implementing these things in hardware is to translate the instruction into micro ops on the cpu. (I can't say for certain which instructions do this but I believe all exponential/trig functions usually do.) I believe the reciprocal is usually calculated in hardware so a division might be converted to this plus a multiplication And yes, regular divisions can take 100s of cycles

@weetabixharry Год назад

There are a lot of iterative algorithms for computing divisions. My favourite is successive approximation: x = a/b, we rearrange to x.b = a. We then *guess* the value of x 1 bit at a time, by setting each bit to '1' from MSB to LSB and checking if x_guess.b is larger than a. If it is smaller (or equal), then we leave that bit set. If it is larger, then we reset that bit to '0'. This algorithm works nicely for a lot of monotonic functions with simple inverses (e.g. division --> multiplication, square root --> square, and there is even a clever trick for calculating logarithms this way).

@weetabixharry Год назад

Of course, that algorithm is slow, but I think it is elegant. In digital hardware, it needs N clock cycles to calculate an N-bit result (but uses an incredibly small amount of silicon). In a CPU, much longer.

@absalomdraconis Год назад

@@weetabixharry : I wonder if anyone's tried to shorten it with a "binary search" or priority encoder?

@weetabixharry Год назад

@@absalomdraconis There are plenty of ways of doing it faster in hardware, it just costs more transistors. Depending on the clock frequency, it may even be possible in 1 clock cycle. The best solution depends on the use case.

@weetabixharry Год назад

@@absalomdraconis And it already is a binary search (because we're setting 1 bit at a time, not incrementing the value +1 for each guess).

@kensmith5694 Год назад

As someone who does microcontroller stuff, I often do that sort of thing. There is even a method that allows you to effectively divide by a variable amount. It is based on two observations: 1) A/B = (N*A)/(N*B) so you can pick an N to make the denominator something easier to deal with 2) For X very small 1/(1+X) = 1-X this allows you to deal with any number near a power of two without a divide.

@crashhappy7296 4 месяца назад

That second approximation, and the others that result from that Taylor series expansion, are all over physics.

@kensmith5694 4 месяца назад

@@crashhappy7296 The 2nd thing is really more like an observation Taylor started with. The two parts together effectively do a divide but it is not Taylor.

@mervmartin2112 10 месяцев назад

At the machine level, computers don't subtract, multiply, or divide. They add, compliment and shift. So, in binary, shift the dividend left past the radix as many times as you want significant digits. Two's compliment the divisor and add it to the dividend until the quotient sign bit is negative. Shift the quotient right the same amount you shifted the dividend left. Viola! The decimal is not called decimal in any other number base but 10. It's the radix in any number base. The video was great! Good job digging into the software that deep!!!!

@__christopher__ 4 месяца назад

At the machine level, they also don't add, either. Addition is build out of purely logical operations.

@mervmartin2112 4 месяца назад

@@__christopher__ Yep. And even the logic can be expressed in discrete components. And those in gathered natural materials. Sand, dirt, burnt things and wizardry ;-D

@filipesaz Год назад

For those of you who, like me, had some difficulty understanding his explanation. His final calculation is something like: C = (F - 32) * 5 * Const / (2^32); where Const = ceil((2^32) / 9); He divides by 2^32 by grabbing the highest 32 bits of a little-endian architecture number, because it would be equivalent to bit shifting the 64-bit positive number 32 times to the right.

@danielschneider9358 10 месяцев назад

Thx! That actually cleared things up nicely :D

@ashwiniabhishek1504 9 месяцев назад

Thanks, it cleared up my doubts

@User-je7gf 6 месяцев назад

You are a wizard

@maggiejetson7904 4 месяца назад

For something that's only C to F or reverse, just calculate it ahead of time on excel and put a lookup table inside the code. It is embedded system and it will be faster and smaller than pulling in a math library.

@watsonwrote Год назад

I love how this highlights how division is fundamentally different from the other basic operations. It opens up a whole new domain of numbers and we have to contend with that.

@goldenhate6649 Год назад

The mathematical proofs for multiplication and division is what made me hate math. I am an chem engineer by education and environmental by trade. I love equations, but screw any sort of proof. Hated every second of every math class I took in college for that reason. I would have loved to complete matrix theory, (still have nightmares there), but proofs made me quit. I think I made it less than a month. Kicker is, the class even said it WASN'T going to have proofs.

@Your_choise Год назад

In a sense, subtraction also opens up a new domain as it gives negative numbers.

@coopergates9680 Год назад

@@goldenhate6649 Pretty much, I do recreational math all day, but forget deep dives into multivar calculus or other long routes by hand just to demonstrate a simple conjecture

@weirdlyspecific302 Год назад

It simply isn't. Subtraction does the same thing.

@antoinem3659 11 месяцев назад

@@weirdlyspecific302 but division open up more

@Ma1ne2 Год назад

Loving this type of video format where you just tell a short story about something you have been working on yourself and learning something new! Would love to see more of this format!

@LowLevel-TV Год назад

Thank you! I really had fun with this one

@MrZcar350 Год назад

And has the magic number (at least in the 32-bit version) 0x5F3759DF.

@mrbutish Месяц назад

Learnt this early on. Computers divide by left shifting/right shifting the number in a register. It is multiplication by the decimal representation of the divisor.

@mrbutish Месяц назад

That is why python has the 1+2= 3.0000...0004

@johnnyvsx Год назад

I used this same technique years ago. Another way of thinking about it is that you’re converting your fraction to a close fraction which has a power of 2 as the denominator since that will just be a bit shift operation. It was an old intel 8088 microprocessor. It had a divide opcode, but execution took over 160 clock cycles. The real-time code couldn’t finish processing one block of data before the next arrived. Converting all the multiply & divides into multiply & shifts saved the day.

@evan7721 Год назад

Once you described how that weird arbitrarily large number was found/represented, it reminded me of the Fast Inverse Square Root function in Quakes engine that tells the computer to cast a float as an int with no conversion

@endergamer7444 Год назад

//Evil floating point bit level hacking.

@dylon4906 Год назад

@@endergamer7444 // what the fuck?

@paulkirjonen1226 Год назад

@@endergamer7444 // what the duck

@fire_man3173 Год назад

//Newton Iteration

@goldenhate6649 Год назад

The real kicker is the matrix that allows you to do square roots without doing a square root and in only a few FLOPS. Its the foundation of all modern video games that use triangular modeling.

@0x574c Год назад

I've got some bad news... looks like the Cortex M0, M0+ & M1 don't support the SMULL instruction (32-bit multiplication with 64-bit result). You only get the lower 32 bits. If you set the compiler flag `-march=armv6-m` in Godbolt, you'll see it calls a library function for division (`__aeabi_idiv`) instead.

@nahco3994 Год назад

Yup, the same thing tripped me up when I was fooling around with some playful attempts to do baremetal stuff on a Raspberry Pi 3B, which has a Cortex A53 I believe. I didn't bother to include or link any libraries, because I was only doing very basic arithmetic and pointer shit anyway, outputting over UART. I was quite surprised that I was getting an error referencing this function call, in a place where I fully expected a plain integer division to be.

@foobar1500 Год назад

I strongly doubt if he actually needs to convert millions of Fahrenheit to Celsius. For a reduced range a 32-bit multiplication with a 16-bit constant and a 16-bit shift is enough for correctness.

@Carewolf Год назад

Dont code for armv6... That stuff is ancient.

@ABaumstumpf Год назад

@@Carewolf "Dont code for armv6... That stuff is ancient." ah yes - the "ancient" 9 years ago... cause nobody would ever use something that was first released that long ago............................................................ yeah no.

@Carewolf Год назад

@@ABaumstumpf You dont because arm before v7 is not a single platform, it makes no sense to target armv6 because there is no such single ISA, it is a huge mess. if you target arm32 you target armv7 or a VERY specific arm cpu.. And besides armv6 is from 2002, so 20+ years old, not 9

@gapplegames1604 Год назад

My favorite trick is to take some huge power of 2, let’s say 1024. Turn 5/9 into the closest fraction with 1024 as the denominator. In this case you would get 570/1024. Now all you have to do is multiply your binary number by 570, and then move the decimal 10 places to the left. I’m sure you guys know way better ways to do this, but this was my trick when I had to write code in assembly for my computer science class.

@gapplegames1604 Год назад

Just realized this is almost exactly what you did in your video!

@user-dh8oi2mk4f Год назад

@@gapplegames1604 This is exactly what he did in the video

@meneldal Год назад

@@gapplegames1604 Depending on the architecture and the number of bits you have there are tradeoffs with precision/max value you can accept. Since 32 bit multiplies tend to be faster (on cheap hardware), something with a lower bit shift like your example (but most likely 16 to use tricks with using just high/low word) has its uses in some situations. If you're stuck with an 8-bit AVR, you're definitely not doing 64 bits multiplies (and probably not 32 either).

@gregorymorse8423 Год назад

Except multiplying by 590 reduces the domain and causes overflow in cases the better method doesn't. You usually like to write crappy buggy code, without consideration of any problems or limitations or domain restrictions. Yes we know. Hopefully nobody is paying you for such garbage tho

@rileyalden723 Год назад

@@gregorymorse8423 jesus christ bro chill, he's explaining how he did it "for my computer science class", he's just working with the closest possible precision he can given the limitations of practical time constraints and diminishing returns. For the purposes of understanding how assembly works, the goal of most courses that are going to require a basic understanding of this kind of math, this is a perfectly sufficient method.

@baxtermullins1842 3 месяца назад

In the “old” days of fixed point arithmetic, machines came in fractional or integer models. So, we kept track of powers of 2 in a note by each statement so as to loose track, x(2^n). The “n” was physically typed in the comment beside each statement. As far as timing, a multiply took 4 cycles while a divide took 13 computer cycles. This was for a 16-bit process while double, triple, quad etc precision timing grew to 2n cycles, 3 n etc. So we always tried to find ways of using multiples, especially for critical time simulations.

@LittleBlue42 Месяц назад

-has library that handles said issue easily -decides not to use it, and wastes a bunch of time reinventing the wheel.

@rparl Год назад

The first computer which I programmed in Assembly was the IBM 704, in 1960. There were two floating point division instructions, Divide AND Stop & Divide OR Stop. This had to do with the requirement that the result had to be less than one. So the inputs had to be scaled to make sure.

@KillerSpud Год назад

I've been developing on a Motorola/freescale/NXP DSC family chip for a few years now and I'm always having to do multiply shift approximations. A simple divide operation takes a few hundred microseconds, and when I'm running a one millisecond loop that can blow things out quick. This is just one of those cases where close enough really is close enough.

@LowLevel-TV Год назад

Yeah I’ll probably do another video for this but it’s crazy that even on modern CISC chips DIV instructions take sometimes up to 12-20 clock cycles

@johntrevy1 Год назад

@@LowLevel-TV But then the human brain typically struggles with division the most too. I still can't even get my head around long division lol.

@slicer95 Год назад

@@LowLevel-TV That is because Division is one of the hardest problems in Computer Arithmetic.

@UnblockMind Год назад

Maybe we should teach them short division method.

@andrewdunbar828 Год назад

@@LowLevel-TV If you want to realize you're still coding at a high level even in asm, then try to implement a divide yourself in FPGA (-:

@henriksundt7148 Год назад

If you had divided by a variable instead of 9 in the C source, the true face of division would show up. The code would then have to call a runtime division algorithm instead of multiplying by a constant. The ARM assembler code for efficient division is worth a look.

@EvilidelRio Год назад

In the not-so-good old days, on an IBM mainframe and before IEEE math, we've got a lot of fun adding 100 events each one with a weight of 1/100... just to discover after many hours reviewing the code that 1/100 is a periodic fraction in binary! So we changed and weighted 128 events.

@senatuspopulusqueromanum Месяц назад

@@EvilidelRio how does that work? could I get an elaboration on what that means?

@EvilidelRio Месяц назад

@@senatuspopulusqueromanum if you added 1/100 100 times you didn't get back 1.0 but something like 0.999996798...

@minhazulislam4682 Год назад

0:37 "I felt I was on my way to a quick LOW LEVEL victory" nice one.

@hayfahvytsen Год назад

Before the prevalence of div instructions and FPUs this was standard practice in embedded systems and decades of products used fixed point math to do incredible things, including keeping your car on the road in the winter!

@dr.strangelove5622 Год назад

Really cool video!! It feels kind of great when I get to see someone else passionate about assembly language. This video reminded me about the time I was figuring out on how I should multiply and divide a number in the assembly language of ATMega8A microcontroller (and later with an 8085 processor): I had to use repeated addition and subtraction to get the job done. Later on, when I worked with 8086 microprocessor I got to know that division is an expensive operation (just like LOOP and REPNE instruction in x86 processor. Even in x86-64 family of processors, they are just artificates which no one has ever optimized, so everyone favours JMP instead of LOOP). Again, this was a great video! 👍👍

@LowLevel-TV Год назад

Thanks for watching! :D

@SansNeural Год назад

Old timer here... visiting you from the days before MUL and DIV instructions in microprocessors. I should be well versed in all manner of integer math operations, but thankfully I've been able to forget (or at least not recall and use) all that for years now. Heck, for some of the industrial machines (Z80 based) I programmed in the '80s we splurged and added a math co-processor card! Probably the main takeaway in your journey shown here is that ARM is older than we tend to think. Born in the '80s itself, ARM was radically new (RISC) but still influenced by its times. Also, adding hardware DIV might have been considered antithetical to the new RISC concept.

@HappyBeezerStudios Год назад

And nowadays RISC and CISC aren't really meaningful anymore. The most widely used RISC designs have tons of instructions and extensions and the most widely used CISC designs translate CISC instructions into RISC-like µOps

@Acid_Viking Год назад

When confronted with similar issues in CS, I adopted the more straightforward solution of giving up coding forever.

@Chris-hf2sl 3 месяца назад

That's a more common solution than many folk realise.

@sailor5853 Год назад

Will Smith: "You can't do division" Robot: "Can you?"

@Olav_Hansen Год назад

As soon as you aren't working with whole numbers, most humans start to break at division as well. 90/9 is something most people can do, 90/8 becomes a bit tricky to some, but 90/7.6 will require most people to pull out a writing sheet to 'cache partial results', and even makes us struggle past this point. And this is while we are the creatures that learn how to divide things evenly with others even before we learn math.

@farfa2937 Год назад

I'd say that most people (including myself) would actually no even try to do 90/7.6 by themselves at all, sheet or not.

@nerobernardino88 10 месяцев назад

@@farfa2937 My solution would be: "90/7.6=Fuckyou"

@llllNEOllllchannel 9 месяцев назад

90/7.6=900/76

@ss3nm0dn4r8 8 месяцев назад

the big difference here is that 8 and 9 have 1 digit while 7.6 has two not the fact its not a whole number if you say divide by .9 now its a cakewalk for most people again

@Olav_Hansen 8 месяцев назад

@@ss3nm0dn4r8 I wouldn't say most would find it easy again (althought definitely easier then 90/7.6), but that's also because the solution is a whole, natural number again.

@menotworking Год назад

Those of us who work with smaller micros do this on an almost daily basis. Create a multiplier/divisor pair with the divisor a power of 2. 2^8, 2^16 and 2^32 are especially nice because they just involve discarding low bytes. The only caution is to make sure your intermediate multiplication cannot overflow. For "normal" temperatures, a 2^8 divisor is all you need. 5/9 = 142.22, so the multiplier can be 142 or 143. Now you work back to determine the maximum input temperature that will avoid overflow, so 65535/142 = 461 and change. The conversion will work with inputs up to 461 degrees, with an error of about 0.16%. You can also add a rounding factor of 256/2 to the multiply before the 2^8 "division".

@RoseHayes-321 Год назад

I strongly recommend the book "Hacker's Delight" by Henry S. Warren, Jr. It's particularly valuable for anyone doing assembly language programming. I wish it had been around in the 80s when I started out.

@ShimonDanilov Год назад

I also thought of that immediately, because I remember that in Hacker’s Delight there is this exact algorithm. Such a good book.

@kevintan5497 Год назад

dividing using assembly was one of my least favorite assignments when i took a class with assembly

@alst4817 2 месяца назад

I had a heart attack when after 2 seconds I realised you were writing assembly. Just a wee bit over my head😂

@nandomax3 Год назад

Next time I get lost in a lucid dream, I'll try to divide with an old chipset. If division is supported I'm dreaming

@falcon-ng6sd 11 дней назад

Divception!

@_dekinci Год назад

When we manually tried different division approaches on a computer engineering courses I was fascinated by how hard division acually is. It makes sense though - division is also harder to do on paper. P.S. there is also a square root

@akallio9000 Год назад

That's why they call it a TRIAL divisor, if it's too big you get to start over again.

@Mueller3D Год назад

Division in binary is actually not that difficult, since the (appropriately shifted) divisor either does or doesn't go into the dividend (determined using subtraction). There are even shortcuts you can do to make this fairly efficient. However, it still takes a number of cycles equal to the quotient length.

@_dekinci Год назад

@@Mueller3D yeah, but it's much less intuitive than multiplication. It's also longer algorithms with more steps and branching. In multiplication you just go with a naive approach. In division I don't think any approach can be considered naive

@MrEnyecz Год назад

It is nice to see young kids discovering the wonders of the world! :) What if I told you that the famous 6502 CPU which powered such machines like the C64 (and many-many others) had not only no division but no multiplication too. And it was still able to rule the 80s.

@kc9scott Год назад

When I was using Apple II, I wrote a macro library for 16- and 32-bit int operations like add/subtract/multiply/divide. But when I got to wanting to do floating point from assembler, I cheated and just bought a floating-point processor board.

@w花b 10 месяцев назад

You simply have to implement that in software as far as I know. It will just be slower but that's It.

@karmatical5837 10 месяцев назад

@@w花bwell, by deciding to work with floating points, you kinda accepted the fact that you will sacrifice memory and performance. After some time studying hardwares, specially old ones, i got why its that so fundamental and common.

@TheNvipy 10 месяцев назад

Traditionally, we used log/exp approach on 8bit/16bit. Hitachi and Mips RISC cpu work around the problem. On mips, multiplication and division are carried out in parallel and store the result in specific registers. Hitachi "sh" cpus provides a division by part, we can set the required precision: one instruction to initialize, another to perform a step.

@MrEnyecz 10 месяцев назад

@@w花bYes, of course, it was Turing complete. I just said that it was even more primitive, but still a CPU. Getting surprised that some CPU has no division shows that you're way too young. :)

@brianfox7067 Год назад

I can't get out of bed and brush my teeth and this guy is doing this with his time

@xennial7408 3 месяца назад

Being in management today, I watch videos like this to get back to my roots and just feel good. Thanks for posting this! 🙂

@Ghoetic Год назад

I usually pre-scale a fraction a with an acceptable binary power, eg 2^8 giving (5/9)*256 = 142, and multiply with that then right shifting the result by 8 to fast divide.

@capability-snob Год назад

This was a triumph! FWIW most x86_64 CPUs have floating point division hardware but integer division is implemented using micro ops for newton-raphson method. This bugs me to no end because you normally want high throughput for floating division, so you use n/r for that, but for ints it usually is a single division on the critical path. The difference is huge - unpipelined trial subtraction hardware has an 8 cycle typical latency, but n/r is 26 cycles.

@soonts Год назад

On modern AMD CPUs, integer division only takes 2 micro-ops. Still slow though, for 64-bit integers the latency is up to 19 cycles.

@capability-snob Год назад

@@soonts ooh, interesting! DIV and IDIV timings going back to Bulldozer appear to be for a slim hardware division unit with up to 18 cycles of latency. Intel are still in the high 20s though. Thanks, I've been team red for over a decade but somehow never noticed this.

@Carewolf Год назад

@@soonts It takes a varied amount of time depending on the value you divide with. Anywhere from 2 to 30 cycles.

@torbjorngranlund6299 Год назад

Not true. Hardware division makes use of SRT, not Newton's method.

@lolerie Год назад

@@capability-snob divq on ice lake is fast as hell. That is 128 bit division.

@wubsyman5796 Год назад

At 3:10, is that IEEE 754?

@owensmith7530 Год назад

The ARM1 processor didn't have a multiply instruction either. Multiply was added for the ARM2 not really for maths but because of how often the C compiler had to calculate addresses (eg array members) and multiply is very useful for that.

@Autotrope 5 месяцев назад

I started this on the wrong foot, I had "just use a high level language" yelling in my head until I realized that looking at these from the low level is the point of the video and indeed the channel

@svenmorgenstern9506 Год назад

Welcome to 6502 assembler with more bits! 😱 Seriously, at the assembly level (especially with RISC type instruction sets) you end up really appreciating how much heavy lifting your compiler does for you. I've never written anything in ARM assembler, but this video takes me back. Thanks for sharing! 👍🍺

@markday3145 Год назад

One of my favorite optimizations had to do with converting timestamps from 32-bit number of seconds to separate year/month/day/weekday/hour/minute/seconds fields on a 6502. The typical approach involves multiple 32-bit divisions. I realized that most of those quotient bits were always zero. So I essentially unrolled those divisions and eliminated all of the trial subtractions that were guaranteed to result in a zero. It ended up needing a total of something like 34 subtractions, resulting in a speedup of 7X.

@foobar1500 Год назад

On systems with fixed instruction length even stuff like loading immediates (that is, constants) to registers is quite funky, even on symbolic assembler. On x86 with variable instruction length but no widely available barrel shifter logic available to instructions these are quite literal "move immediate" instructions, but on ARM, complicated constants either take several instructions to load (if not using constant islands), or generate constants in interesting ways...

@ArneChristianRosenfeldt Год назад

JRISC in 1993 ( Atari Jaguar ) and SH2 in 1994 on Sega 32x and Saturn offered 32-bit division. It has some latency, but did not block the processor. Similar how quake did division on CISC Pentium in 1996 .

@rodneylives Год назад

A breakthrough moment for processor-based math for me was, way back in a college Computer Architecture class, realizing that, internally, computers do addition and multiplication exactly as we do on paper, they just do it in base 2. Changing the base simplifies the operation greatly. There is no deep magic.

@NothingButFish Год назад

Taking me back to applying newton's method for calculating square root in MIPS assembly as a programming assignment in an introductory computer architecture/CS class. I think more than anything else it taught me to respect all of the crazy shit that's happening at the lowest levels (although of course it goes even lower).

@jasonenns5076 7 месяцев назад

MIPS is dead! The company responsible for MIPS now is RISC-V.

@unknown_user_gaming Месяц назад

im a computer in this case

@paradiselost9946 Год назад

This is the bit i love. Go deep, go to circuit level... half bit and full adders, shift registers... We have paper, pencils, and extra ways of accessing memory. A puter is stick with whatever figures are in its amu. It "sees" one digit at a time. It cant see the whole number, it cant go back and "carry the one"... Subtraction by addition and complements has always fascinated me...

@m.hosseinmahmoodi Год назад

I remember when I first started learning to use assembly, I started coding for Game Boy. I decided I'll write a few simple functions to get a feel of assembly, so I wrote a function for Fibonacci sequence then I attempted to write a function for “n!” !!! Imagine my surprise when I found out that GBZ80(CPU of Game Boy) doesn't have any instruction for multiplication. I had to write a multiplication function using bit shifting. I have used a lot of different CPU's assembly and most of the very old ones (6502, 65C816, …) doesn't have a mul\div instruction; and old ones (8086, 8186) only have integer mul\div; and newer ones (from pentium until now) are very hard to program (if you want to make them fast, that is) because of SIMDs.

@maksymiliank5135 Год назад

That's actually a pretty cool solution. I thought you were going to talk about floating point division and how inaccurate it is, but this was also interesting to watch

@edgarbonet1 Год назад

Nice video, thanks! It is worth noting that, even when division is supported by the hardware, it can be quite slow. I once timed the basic floating point operations on a Cortex-A5, which has an FPU implementing the VFPv4-D16 instruction set. Multiplications turned out to be fast (one/two cycles in single/double precision respectively), but divisions took more than seven times longer than multiplications.

@zlcoolboy Год назад

Shows that ARM was right to forego a division module.

@watchm4ker Год назад

@@zlcoolboy That's still faster than doing it in software. This method only works because it's a known constant that can be approximated. Once you divide by a variable, you've got to use much more sophisticated approximations, which will likely take multiple times as long as a hardware divide unit.

@JaseewaJasee 5 месяцев назад

the depth you go into each topic is remarkable, truly enlightening!

@REgamesplayer Год назад

Typical programming task. Write a ''Hello World" program. Me: So, in order to fully understand and to solve this problem, I basically need to understand chip design, translation layer and have an advanced understanding of programming language in which I'm coding. In order to solve this problem you will need an expert understanding in computer science.

@QuantenMagier Год назад

It's called RISC (Reduced Instruction Set Computing) for a reason, if you want hardware division operations use a CISC (Complex Inst..) processor or a mathematical co-processor. ;)

@GeorgeTsiros Год назад

RISC does not mean that it won't have certain instructions, or that it lacks functionality, or that it is a _limited_ instruction set computer. It just means load/store arch, orthogonal reg file and instr set.

@QuantenMagier Год назад

@@GeorgeTsiros Well it's not limited it's just reduced, but it still is general purpose. ;) The idea behind RISC was to improve the efficiency per transistor count and a division unit would mean more transistors with almost no benefit regarding the most programs.

@jesseparrish1993 Год назад

That's why I stick to multiplication by inverses

@LowLevel-TV Год назад

So much easier

@jesseparrish1993 Год назад

@@LowLevel-TV Obviously I posted a witless comment as quickly as I could to get attention. Thanks for the content!

@LowLevel-TV Год назад

Anytime man

@jasonknight1085 Год назад

As an old assembly hound, it's fun to see something like ARM. I don't do a lot of RISC stuff in ASM because of the old adage "RISC is for people who write compilers. CISC is for people who write programs" On the 8086 and 8088 we too couldn't rely on there even being a FPU, and our MUL and DIV statements were painfully slow -- in excess of 150 clocks EACH. An old trick for doing percentages in that case was to do a multiply very like this as a fraction of 65536. What made that "fun" was you ended up with the integer result in DX, and the numerator (65536 as the denominator) in AX. We were further hobbled by there being no MUL reg, immed ; assumes temperature in AX sub ax, 0x20 mov bx, 0x8E38 ; roughly 5/9ths 0x10000 mul bx ; DX = integer result, AX = numerator over 0x10000 Becuase it uses only the one multiply, it takes half the time that MUL 5 and DIV 9 would. Which is good when we're talking turning 300+ clocks into 150.

@customsongmaker Год назад

So true, typical MUL reg

@ArneChristianRosenfeldt Год назад

What are 8086 and 68000 even doing in all those cycles. The factors are 16 bit as is the ALU. This only makes sense to me if MUL and DIV were just an afterthought which had to be cheap. Indeed I have seen a large stretches of code without any of these instructions. This typical IBM-PC business code does not MUL or DIV. Maybe once to calculate taxes.

@jasonknight1085 Год назад

@@ArneChristianRosenfeldt A hardware multiply requires a lot... no, A LOT ... of silicon. We're talking as much as 20 times the silicon as an addition. To that end a lot of early processors left it out entirely, or implemented it in "microcode" In fact that's exactly what the 8086 / 8088 does, is use microcode. A small bit of code on the processor that pretends to be an opcode, that runs on the other instructions of the processor. Even with a hardware multiply the operation is intensive enough to require a lot of clock cycles as multiply and divide -- even just as integer -- are expensive operations. A mul reg16 taking anywhere from 25 to 40 clocks depending on CPU. (laugh is the 486 is slower than the 286 or 386), and thus it wasn't uncommon for stuff like vga video routines to still use shifts instead. ; ACCEPTS ; AX = Y ; CORRUPTS ; AX ; RETURNS ; DI = Y * 320 xchg ah, al ; == y * 256 mov di, ax shl ax, 2 ; == y * 64 add di, ax ; di == y * 320 Runs in about half the clock cycles of a hardware multiply right up through the first pentium. Though still not as fast as a XLAT

@ArneChristianRosenfeldt Год назад

@@jasonknight1085 still does not explain why anybody would need more than 20 cycles in microcode. With a barrel shifter and some look ahead for zeros all those shift special cases would go away. Zero flag running from msb to lsb for early out (like in 386). Silicon usage goes up with the square of bits. Hence it makes sense to implement 8x8 as MUL ( costs like a 64 bit add ), and then run the microcode over this. 4 cycles for 68k like 16x16 -> 32. ARM7 in the GBA has hardware 32x8 and needs 4 cycles for 32x32.

@jasonknight1085 Год назад

@@ArneChristianRosenfeldt Do you realize just how much silicon an ARMv4 has that an 8086 does not? We're talking about a ~29,000 transistor and ~5000 gate processor here. That ARM7TDMI (the 7 is misleading, it's a v4) is what, 280,000 transistors and 90k or so gates? What you describe in silicon alone would be a fifth the die space the 8086 used. JUST for the multiple logic alone. Remember, this is an age where a DRAM memory access was 4 clocks per bus width (so a word was 8 clocks on the 8088), and even a simple byte shift by fixed one place took two clocks. Memory shift was 15 + and effective address calculation, and shift reg16 by CL was 8+4n. But yeah, once transistor counts grew these problems went away. By the time of the 80186 you're looking at 26 to 33 clocks, and the 286 it's 13 to 26 clocks. The 186 was 55,000 transistors, the 286 was 134,000. Also worthy of mention is the NEC V20 and V30, drop-in replacements for the 8088 and 8086 respectively, with timings closer to and in some places even faster than the 286. See its flat 21 clocks for all 8 bit multiplies and 26 clocks for 16 bit. It achieved this by having over 60,000 transistors, more than double the stock intel part. The type of infrastructure needed for the microcode you describe by the time you do both 8 bit and 16 bit mul and div is more silicon than many processors even had when the 8086 started being designed in 1976. That's no joke either, look at the 8080 with its whopping 4500 transistors-- and it was considered a chonky boy on release. The Z80 was around 8500, 6502 was 4500. And IF contemporary processors had a multiple opcode (many did not) they were inherently 8 bit with a 16 bit result (in two registers). There's a reason the Motorola 68000 with its amazing fast hardware multiply via its 16 bit ALU needed 68000 transistors (hence its name), an absurd number for the time that took five years to even come close to looking viable outside of big iron use... because manufacturing processes for that many transistors reduced yields resulting in elevated prices.

@kenhaley4 Год назад

I'm old enough to remember the days when all calculators were mechanical (sometimes electrically powered, but not always). Unless they were the more expensive variety, they did not support division. I specifically remember my first hand-held 4-function electronic calculator (circa 1960) and being truly amazed that it could divide! There was actually a short pause (maybe 1/8 second) before it displayed the answer to a division problem.

@e3498-v7l 9 месяцев назад

“Division is slow, we’re just gonna ditch that” - I love the approach. Back in university assembly course, instead of dividing by 1000 I just bit shifted by 10 because it’s faster and good enough. My professor wondered about the loss of precision and made me fix it, though. 😂

@abdox86 Год назад

I remember when I first hit by this wall, but mine was bigger, just imagine ( 8-bit processor, and no multiply instructions) It was a PIC mcu, and it almost boiled me to write a floating point 64-bit based division and multiplication macros, it was enjoyable and hard challenge to tackle.

@cpK054L Год назад

PIC is trash, move to ARM master race.

@abdox86 Год назад

@@cpK054L design one and then call it trash, for studying fundamentals pic mcus are the best , ARM is more advanced and complex.

@cpK054L Год назад

@@abdox86 I used PICs for undergraduate programs and for the price to performance ARM gave me a better result as a general purpose MCU

@stevenpersoon Год назад

I encountered such a thing on a school project with FPGA's. With an FPGA you allocate logic units (AND, OR etc.) to build your own logic circuit. Using a single division operation used like a thousand logic units, which is a lot. Instead we used an division algorithm which got it down to under 200. It probably used multiplication and bit shifting, but I'm not sure anymore.

@phenanrithe Год назад

The RISC philosophy, love it or hate it 😉 Yes, I saw the interesting multiply-to-divide trick in printf routines to display numbers, where it has to divide by 10 (or powers of 10). I was impressed, it was packed with such little techniques to save cycles (on x86_64).

@linuxization4205 Год назад

Literally the only CISC cpu that is halfway alive and not x86 is the Motorola 680x0 architecture.

@phenanrithe Год назад

@@linuxization4205 It depends which categories you are willing to consider. There are microcontrollers and CPUs derived for ex. from the 6809, 8051 or 6502 and so on which are borderline or plain CISC. Though many designs, including the x86, only have CISC instructions but a RISC architecture, so it's not as straightforward as when the term was initially coined. 😀 I believe Intel started translating the instructions into micro-instructions in their Pentium series, otherwise they wouldn't have achieved the same level of performance. It was bold and brilliant. Nexgen, which was bought by AMD to save their failing K5 architecture, did the same back then.

@HappyBeezerStudios Год назад

@@phenanrithe It was the Pentium Pro, (P6 or i686), which all modern x86 chips still identify as. So yeah, the big x86 CISC architecture uses RISC-like tricks under the hood and only displays itself as CISC on the outside for compatibility's sake. And that whole thing was also supposed to be replaced. After IA-16 and IA-32 should be IA-64, but that went to nowhere just like i960, i860 or i432. And obviously Itanium which, as P7 was somewhat supposed to be the successor in 98

@boptillyouflop Год назад

@@HappyBeezerStudios Yeah... what happened was that IA-64/Itanium was a disaster, it tried to optimize all the wrong things and ended up being somehow SLOWER than x86 if you can believe such a thing.

@Kruimeldief Год назад

Thank you for not making this video unnecessarily long like so many RU-vidrs do (8+ minute videos with boring history and whatnot).

@terone 2 месяца назад

This reminds me of a Corridor Crew video where they open Photoshop 1.0, see a convolution matrix and have no idea what it is. One of the assembly programming I had to do in college was a calculator. Not everyone figured out the algorithm for division was incomplete, but at least they learned that there was an algorithm for it as it wasn't a standard function.

@1nbp Год назад

As a computer engineering student currently learning about CPU design, it’s also important to bring up that new CPU architectures (arm & RISC-V) are shifting towards reduced instruction sets to increase efficiency. This is why division isn’t prioritised, as it can be done in *very* roundabout ways that may still be faster than a physical hardware implementation. As long as you have the fundamental instructions, you can do essentially everything.

@embeddedroom Год назад

True story. Encounter this frequently when working with code generators! Good job!

@LowLevel-TV Год назад

@_--_--_ Год назад

Also important to know is that both integer divide and FPU divide are *very* slow even on architectures that supports it in hardware. An integer divide is up to 12 cycles on Cortex M4 and 10-50 cycles on modern X86-64 for example, Skylake needed like 40-100 cycles for integer div just a couple generations ago. FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4. Also noteworthy is that even architectures such as M0/M0+ can implement external division hardware, such as the case for the RP2040 using M0+ architecture which integer division hardware needs only 8 cycles and therefore is actually faster than standard Cortex M4 for integer division.

@ewenchan1239 Год назад

If integer ADD/SUB/MUL is so fast, then why isn't it used everywhere?

@_--_--_ Год назад

@@ewenchan1239 Can you restate your question, as it doesnt really make any sense. Are you asking why a div instruction isnt substituted with a combination of add/sub/mul instructions? They are, the compiler will try to replace a divide either by a shift, a combination of muls and shifts or adds and shifts, depending on if your target supports mul instructions. However this only works if the divisor is a compile time constant. For other cases the compiler most likely will use div instruction unless you tell it otherwise. For example you could use libdivide instead, which can be faster than hardware divide on a lot of targets. Or you could tell it to use its default inbuild algorithm, which most likely will be quite a bit slower than a hardware divider *if* the compiler cant optimize it down, which in a lot of cases it can.

@ewenchan1239 Год назад

@@_--_--_ "Can you restate your question, as it doesnt really make any sense." You state: "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4." If your statement above is true, then why isn't it used in lieu of using the ACTUAL divide instruction, ALL the time? (You've answered it for the most part in the second half of your reply above.) But for said second half of your reply to be true, you would have to know: a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work) and b) you explicitly tell the compiler what to do/how to handle the DIV instructions, which typically means that you would have to know the specific purpose or application that you are programming for, AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient. (i.e. you're programming for yourself rather than for someone else to use where the importance of the epsilon value is not known nor what someone else's tolerance limit for epsilon is)

@_--_--_ Год назад

@@ewenchan1239 "You state: "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4."" I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point. As far as I am aware we are talking about integer arithmetics specifically. "a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work)" Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works". Also its often a good idea to rethink your code, often a lot of divisions are unnecessary (especially with floating point, but we are talking integer here), the compiler cant do this for you most of the time. This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance. "which typically means that you would have to know the specific purpose or application that you are programming for" As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion. "AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient." I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion. All compiler optimizations or replacement software algorithms will give the exact same output as the hardware divider.

@ewenchan1239 Год назад

@@_--_--_ "I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point." But you also the "cost" (in terms of the number of operations) it takes for FP ADD/SUB and FP MUL on said Cortex M4 as well. "FPU divide on Cortex M4 is 14 cycles." "As far as I am aware we are talking about integer arithmetics specifically." My understanding is that we can be talking about both (given that you can represent floating point numbers as a sum-product of pairs of two integers, can you not? (i.e. the approximation for 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1*2^-6 ~= 0.109375) And isn't the mul/shift - it's doing floating point math with integers, isn't it? (unless I am missing something here and/or misunderstanding what @Low Level Learning is talking about here, in this video) "Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works"." The "won't work" here refers to the substitution or the replacement of the slower div operation with the faster combination of other instructions. That's the "work" that you were talking about it doing. Therefore; pursuant to your statement, if you aren't dividing by a compile time constant, then said replacement/substitution won't take place - i.e. the "work" that I am referring to, based on what you wrote, won't take place. Ergo; "won't 'work'" (div operation won't get replaced by "combination of other instructions" (which should help to perform the division faster without using the div operation/instruction).) "often a lot of divisions are unnecessary" Varies, I would think. "This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance." Again, I think that it varies. (depends on what you are doing/why you are using division in the first place) re: mistakes I wouldn't necessarily call using divisions a mistake. I agree that this is where a skilled programmer comes in, but if you are learning or you are an unskilled programmer, then that's something that they will have to learn. Take for example, the inversion of 1/(1/9). If you use exact arithmetic, your E(X) is 9. If you approximate (1/9) ~= 1*2^-4 + 1*2^-5 + 1*2^-6 ~= 0.109375, then 1/0.109375 = 9.142857142857142.... Therefore; your epsilon would be 9.142857142857142... - 9 = 0.142857142857142.... or about 1.6% error. And therefore; the question that you would have to ask yourself is "is that 1.6% error significant or not?" In some cases, it might not be. In other cases, it can be VERY significant. It really depends on what you are doing. (Note that often times, if you are replacing a division operation, I have yet to see someone who advocates for the replacement or substitution of the div operations with a combination of other operations - they don't automatically talk about (nor calculate) their epsilon and let the user know what that epsilon is that arises from said replacement/substitution. I think that if you are going to implement such a kind of replacement (of the div operation) with a combination of other operations, where and whenever possible, you SHOULD be REQUIRED to tell the user what your epsilon is, when appropriate.) "As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion." That's an assumption on your part though. There is nothing that explicitly states this. As I said, it is important for people NOT to have it as a key takeaway from watching this video that they are going to implement this everywhere they see because they see it as a speed/performance improvement whilst they fail to recognise that this speed/performance improvement comes at the expense of precision and accuracy. Even for assembly code (cf. GROMACS/Folding@Home), maintaining the precision and the accuracy can still be vital for the program to perform the calculations that is asked of it properly and correctly. "I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion." The replacement of a division operation with a combination of other operations is almost always an approximation of the actual division operation. (cf. the example that @Low Level Learning shows above with respect to the approximation of 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1 * 2^-6 ~= 0.109375) That IS an approximation. That IS what he did and that IS what is shown.

@TomsodM 28 дней назад

I think I've seen this even in x86 assembly. You can learn a lot from a good compiler: division by constant can be substituted with a multiplication, multiplication by (some) constants can be left to the address generator, and I've even seen them converting float to int by adding a large constant to it and throwing out the high-order bytes -- while I get the math, I'm not sure how it's better than the built-in x86 FPU instruction that does the same, but it was still pretty cool.

@mani_mincraft 3 месяца назад

This is a RU-vid classic. I love when RU-vid recommends me this video lol

@adailtonjn Год назад

Amazing content. It would also be interesting videos about division of non-constant numbers, the implementation of a floating point unit (showing how to add and multiply floats) and also the implementation of algorithms like sine and square root. All of that comparing C and assembly.

@SolomonUcko Год назад

It looks like the library function __aeabi_idiv/__aeabi_uidiv is generally used on ARM to handle division by a non-constant denominator.

@thewhitefalcon8539 Год назад

Binary long division isn't as complicated as you might think, though it's fiddly

@jlewwis1995 Год назад

For the multiplication part you could optimize it even further by removing multiplication entirely and just storing f-32 in a register then bit shifting left by 2 followed by adding the f-32 which would be the same as multiplying by 5. The the divide by 9 is way tougher to optimize though because it seems like you can't to the same with right bit shifts unless I'm missing something :/ Also I guess this would only work if you wanted the whole integer part of the result, if you wanted some decimal places too I guess you would need to use fixed point

@jhgvvetyjj6589 Год назад

Why not combine multiplication by 5 and multiplication by 1÷9 (477218589÷4294967296) to do multiplication by 5÷9 (2386092943÷4294967296)

@computationaltrinitarianism Год назад

@@jhgvvetyjj6589 because division in this case is floor division which is not associative (x*y)/z != x*(y/z). For example (2*3)/4 = 6/4 = 1 != 0 = 2*0 = 2*(3/4). EDIT: Ah, I see that is not what you are asking now.

@vmarcelo49 Год назад

Damn you guys are so nerd

@thewhitefalcon8539 Год назад

this is how binary multiplication already works!

@cpK054L Год назад

Wouldn't your register just end up trashing your bits if you go beyond scope? xxxxooooxxxxoooo >> 6 bits for example, would you not end up with 000000111100? then if you were shift left it'd come back as zeros?

@CallousCoder Год назад

Been there done that on 6502, z80 a whole lot. And fixed point is the easiest and fastest approach. Also many CPUs don’t have multiply, they are both mathematical monsters 🤪

@gdclemo Год назад

One method I've seen on 6502 is to use a lookup table of squares; (x+y)^2 - (x-y)^2 == (x*y*4). Useful for 8-bit * 8-bit -> 16-bit multiplies. You can do the divide by four with shifts or just build it into the table.

@amigalemming Год назад

Some architectures like the Motorola 56k DSP have an instruction for doing one step of a division. If you call it often enough it will yield the correct quotient, still sufficiently fast. This can be a good compromise between speed and saving chip space.

@elkanaajowi9093 Год назад

Now, what kind of function is this? I thought I was an average programmer but this seems to be in a league of its own with push, sub sp, pop... Hats off to you guys abstracting for us these jargon; so we just write our plain codes peacefully. And Google recommending this to me is also strange; may be it knows what I should be knowing which I still don't know.

@kool2btrue Год назад

Learned this at university while studying computer engineering. That is one of the advantages of a 4 year education over general programming bootcamps. You get into the specific details of why things happen at a low level.

@jaxcoop10 Год назад

Im learning this rn in my computer systems class and I still can’t wrap my head around it

@housemana Год назад

in a vaccuum yea, who wouldn't want to? but realistically, you wasted a ton of your money and life for something a "general programming bootcamp" could have gotten u into your field a LOT faster for a LOT less money. Funny how that works. but in order to suck off the tit that is high lvl education grant money, one must pay the blood price of an undergrad + grad degree. so funny how that works. :")

@kool2btrue Год назад

@@housemana No chance anyone out of a coding boot camp could get my type of job. It requires quite a bit hardware knowledge as well as low level programing knowledge. Let alone the amount of research we have to do including advanced mathematics to invent tech that is not currently in use. This is just one of many upon many things covered in an extended program. In case you are not familiar with software development, there are many branches and specialties that required varying levels of knowledge. Usually the more difficult jobs tend to have higher requirements, but also have higher pay ceilings. Except maybe game development (High standards, shit pay and low quality of life). I'm not saying this to shit on people who go to boot camps. In fact I have a ton of friends who have advanced their careers and greatly benefited from these. But if your goal is to be a software dev, a 4 year university will undoubtedly provide you with a more thorough education and better job prospects as starting point. I just see a lot of videos essentially spreading your same message, and it is just not accurate.

@ameliabuns4058 Год назад

@@kool2btrue why can't you just read the textbooks at home? I am a self taught software engineer and I just read books on this stuff

@KashTheKingYT 9 месяцев назад

@@kool2btrue I bet I could get your job

@tommyhuffman7499 Год назад

Learned more here than in most of my CS courses

@LowLevel-TV Год назад

Hell yeah

@jhgvvetyjj6589 Год назад

Meanwhile x86 has native sine cosine and tangent and can convert between extended precision binary floating point and 18 digit decimal integers

@andrewdunbar828 Год назад

680x0 had those too in the FPU for '020 and '030, but they removed them again for the '040 because they're slow even in hardware and add so much complexity and use up so much silicon.

@MrWaalkman 10 месяцев назад

At Saturn we ran DEC VAX computers and they ran GE Cimplicity for doing our SCADA screen and getting data from the PLCs in the plant. Apparently it never occurred to GE that we might need to do math at some point, and they didn't bother to include that capability in Cimplicity. ITs solution was to use seven GE Series Six PLCs to do the math for Cimplicity. They were affectionately known either as the "Seven Dwarfs", or "VAX Math coprocessors". :)

@doger944 Год назад

When I was first taught about logic gates and how to make a full adder curcuit, I went on an absolute rampage trying to figure out how pure logic can be used to perform mathematical functions. Subtraction is easy once you know how to add, and multiplication is just adding with extra steps, so I figured them out quite quickly, but I remember spending weeks on division. I ended up figuring out how the division process worked, but was left with the problem that: a) like half the results of dividing in binary end with recurring decimals b) Three decimal places would only give my answer a resolution of 1/8th This resulted in some divisions appearing to give the same answer, because the difference between them was too small to see. So I thought "I know, I'll just represent the remainder as a fraction, that should be easier." The problem with that is that nobody wants to be told that: 27÷18=27/18 So I would have to find some way to simplify the fraction. Which is not easier at all. As it turns out, the only way to find the prime factors of a number is to divide it by every prime number smaller than half of the original, smallest to largest, and see if the result is a whole number. Which means for each digit you add to the input the list of prime numbers needed to check grows exponentially. And then you have the added problem that once you find that a number is divisible by one of the primes, you then have to check whether you can divide it by that prime again, and keep checking indefinitely until you get a remainder, and while that's easy enough to show in a flow chart, making a circuit to do that is a whole different matter. The "solution" I decided on was to make a curcuit that only devides three digit binary numbers, and have the remainder be displayed up to three decimal (binamal?) places as long as there is no "loss" (I.e. There is no data carried off the end of the divider circuit). If there is loss, the result is displayed as a fraction which can then be simplified. The reason this is easier to simplify is that now the only fraction that will ever need simplification are 3/6 and 2/6 This means that my simplification circuit only needs to check if the numerator and denominator are both even, and if so divide them by two, and if not, check if they're divisible by 3, and if so divide them. If they are neither divisible by 2 or 3 they are displayed as they are, since I know from my meta-knowledge of all possible divisions of three digit binary numbers that any other fraction I could encounter would either be already in its simplest form, or be possible to display to three decimal places without loss. This was really a solution, but I'm still pretty proud that I made it at least look like it can divide.

@GoofyGeekGang 9 месяцев назад

0:21 Or you could put C= (F-32) * 0.5555556 because I’m pretty sure that 5/9 will never change

@GoofyGeekGang 9 месяцев назад

Print (Var=C)

@GoofyGeekGang 9 месяцев назад

Var=C

@GoofyGeekGang 9 месяцев назад

Lol

@thisisreallyme3130 Год назад

YES - more of this please! I’m learning 6502 (cc65 but no stdlib lib emulation), and this kind of trickery totally relates. 😊

@54egg Год назад

Lookup tables are your friend on 6502 where practical. Check this video which has a link to page (as I recall) that shows how to do 8x8 bit multiply on 6502 with 4 table lookups, 2 8 bit add/subtracts, and a 16 but subtract in under 40 clock cycles, ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-WEliEAc3ZyA.html

@paulwomack5866 Год назад

If you're gonna use "SCARY FIXED POINT" to a division, you may as well fold the multiply by five in, and save a whole multiply instruction. So instead of setting up 1/9 as your magic constant, set up 5/9 (just a different constant). But, yeah, depending on your CPU, some things need to be coded using multiple instructions. Might be multiply, might be divide, might be floating point. I worked on a graphics system coded in 68000/C where all pixels were held in fixed point 24.8 (a whole number of 256ths of a pixel, if you like)

@veroxid Месяц назад

You know someone's a low level programmer at heart the moment they look at the option "use a library" and immediately think "that's cheating," rofl.

@xaiano794 7 месяцев назад

Most Americans don't even know that the Fahrenheit scale is 0-96 and falsely assume it's 0-100 because they underestimate what a dumb scale it is. He literally used the coldest temperature in his home town as 0 and the average human body temperature as 96, which as you probably know, he got wrong.

@RandomGeometryDashStuff Год назад

what about dividing by number not known at compile time?

@LowLevel-TV Год назад

You can probably write an algorithm to develop the known fixed point multiplier, but I’ll have to think about that one :)

@richardericlope3341 Год назад

Multiple subtractions in a loop.

@KeinNiemand Год назад

I guess you just have to use an algorithihm to do division.

@easylemon6640 Год назад

Sooo... you need to divide, to... divide????

@HansStrijker Год назад

If you need to do many chained calculations containing different operators, it can be faster to do them as rationals, and only do the final division at the end. Summation and division will end up to be three operations minimally, but it can pay off quite a bit for lower complexity with other operations. Though it can quickly run into rollover issues, so might need to keep an eye out for that, and apply a simplify step (gcd algorithm) where necessary.

@HansLemurson Год назад

I knew the "Multiply and Shift" technique, but I hadn't thought to use the Overflow Register deliberately to do an implicit shift.

@chrisnatale5901 Год назад

Nice solution! I remember reading a book back in the 90's called "Game Programmer's Black Book" that detailed workarounds for faster division performance on the 486's and Pentiums of the day.

@deusexaethera Год назад

Yeah, pretty much your only alternative would be to build a nested-loop that adds the divisor to a total, counts the number of times the loop runs before the total reaches the dividend, then move the decimal point on the divisor and the loop-counter, and repeat the whole process until you run out of space to store decimal places.

@Quantris Год назад

binary search is also an alternative

@modolief Год назад

This was super-fun, thanks! Kinda reminds me of the mysterious code from Doom that does fast inverse square root.

@NostraDavid2 Год назад

That was Quake 3, but agreed.

@paulwomack5866 Год назад

This is way more trivial (or fundamental) than that (amazing) trick. Reminiscent though.

@fabiangiesen306 Год назад

Dividers don't need to be big or complicated; a basic shift-and-subtract divider is little more than a wide adder, it's just slow to do it that way - depending on the implementation, typically something like 1-2 cycles per quotient bit, so for a 32:32 bit divide, it would take something like 32-64 cycles. Pretty much the same hardware can also do multiplication, just as slowly. These are usually the first multiplication/division algorithms you learn about when you study HW for arithmetic circuits. This long runtime is awkward to deal with; having operations that long-running is a pain if you try to fit them in a simple microcontroller design that otherwise has nothing of the sort, and also makes it harder to guarantee timely servicing of interrupts etc. For this reason early RISC CPUs either didn't have any multiplication/division support at all or effectively treated it like a coprocessor which sometimes (e.g. when interrupts happened) could "lose" results and then you'd have to retry. With ARMv1 in particular, the instruction set was such that you could just implement the basic repeated-doubling-and-add multiplication and shift-and-subtract division operations in software. That would typically take something like 3-4 cycles per bit for the straightforward way, which is 3-4 times slower than dedicated hardware would be, but it avoids all the complexity and consequences of dealing with long-running arithmetic operations. ARMv2 added multiplication instructions, and did get a better multiplication algorithm than software shift-and-add - but it still only handled 2 bits of the product per cycle, so (since it was a 32-bit CPU) multiplications could take as much as 16 cycles! This used modified Booth recoding, which is a staple in multipliers to this day, but otherwise pretty much just ran on the regular ALU. Over time this got improved further. The problem with division is that it's not as gradual: the 1-bit-at-a-time algorithms are simple, there are 2-bit-at-a-time algorithms that are already a lot more complicated (and need a fair amount of tables), and then there's yet another giant complexity jump if you want faster than that. So historically, there was a reason to be weary about including either multipliers or dividers, and if you're gonna do one of the two, multipliers are used a lot more often and also have a lot more options for implementers. This is especially important because most divisions (and remainders) in real-world code are by constants and for those, we can use the multiply-with-reciprocal method which just requires a decently fast multiplier and not a divider. Real-world integer divisions by non-constants are not unheard of, but they're rare enough that to this day, if you're going to do a simple chip, it's not doubtful whether a dedicated hardware divider is worth it; software division routines are usually good enough for those cases, and especially when you have a fast multiplier available (which enables better algorithms). The really fun part is that for a long time, until fairly recently, Intel x86/x64 CPUs didn't have a real integer divider either. Sure they have an integer divide _instruction_, but pretty much everything Intel released between the original 1993 Pentium and the Icelake microarchitecture (released in 2019) does not have an integer divide _unit_. All these CPUs have a floating-point unit with a floating-point divider, and would use that floating-point divider for integers too. (Running a short, or in some cases not-so-short, microcode routine to do so.) The reasons for this are similar to why ARM Cortex cores don't have it: this isn't particularly fast, but integer divide instructions are rare in practice, and a dedicated integer divider just wasn't worth it. They did change their mind on that eventually as mentioned, but it took them over 25 years!

@piotrcurious1131 Год назад

what is a pity though is that almost no one implements "fuzzy"(analog or asynchronous digital backed) divide instructions, so they could be used in DSP/AI/GFX etc. cases where error is not a problem. There are neat asynchronous methods, exploiting propagation/delay effects which take just a little of silicon and are fairly precise . OFC most ppl argue that divide can be emulated or real hardware unit used, but then it cannot do either deep pipelines, nor SIMD or other bulk matrix transformation. DSP and GFX suffer most as not only people need beefy chips to do fairly simple operations, but also power consumption hurts.

@SerunaXI 7 месяцев назад

It's been well over a decade since I took a basic Boolean logic class, and I remember the process of subtraction being complex compared to addition. Now consider that division can be treated as multiple instances of subtracting the same value until there's nothing left to subtract; well...