Тёмный
No video :(

The fastest matrix multiplication algorithm 

Dr. Trefor Bazett
Подписаться 420 тыс.
Просмотров 286 тыс.
50% 1

Опубликовано:

 

22 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 231   
@planaritytheory
@planaritytheory Год назад
It's important to note that, although additions _are_ considered cheaper than multiplications, what we _aren't_ doing is making a tradeoff between the two operations (accepting many more additions to save a few multiplications). You didn't explicitly say that's what's going on, but someone could get the impression from this video that that's the deal---I myself used to think that. Instead, the number of additions, multiplications, and total number of operations are all lower for Strassen's algorithm, and it's because the number of "multiplications" at each step is counting the number of recursive calls that are made.
@DrTrefor
@DrTrefor Год назад
Thanks for the clarification. Indeed, I left the induction proof for additions as an "exercise" so this got glossed over.
@samueljehanno
@samueljehanno Год назад
69th
@gregorymorse8423
@gregorymorse8423 Год назад
Considering that addition is O(n) and multiplication is O(n log n) in best case, given a 2s complement number system, treating multiplication and addition as like operations is completely incorrect especially as the values of the matrix grow larger. So it's not just the recursion that is relevant.
@destiny_02
@destiny_02 Год назад
If I'm not wrong, multiplication is cheaper than addition in case of floating point number on a modern CPU.
@gregorymorse8423
@gregorymorse8423 Год назад
@Jotaro Kujo just because it is simpler to implement floating point multiplication doesn't make it cheaper. It's not faster even if addition requires resources to shift the smaller argument and the result. There is still an integer multiply or add in the middle. My point was only for unbounded big integers as well. Treating addition and multiplication as like ops when using a finite bit width like 32 or 64 be they integers or floats does make sense. Even if their clock cycles are slightly different. The story just gets more complicated gived a fused multiply add (FMA) is used in modern hardware to improve accuracy and speed. And applies to complex number of matrix multiplication very specifically in fact. But unbounded data types mist treat addition and multiplication separately.
@Mattes_H
@Mattes_H Год назад
I think it should be noted that all of these faster algorithms are so called galactic algorithms, meaning the constant that is hidden by the phrase "order of" is so big they become useless for any practical application. In practice the challange is more about efficent and parallel implemetation. In fact GEMM the subroutine most often used for matrix-matrix-multiplication usally implements either the naive or Strassen-based algorithm but uses all the hardware tricks available on modern processors to get their high perforamce.
@maxrush206
@maxrush206 Год назад
nice icon lol
@vaff69420
@vaff69420 Год назад
I don't exactly understand your comment. Why aren't library developers implementing these algorithms if they are simply faster than the native or other slower algorithms?
@pablote325
@pablote325 Год назад
@@vaff69420 Because of the constants.. take for example the last algorithm, this one does O(n^2.37285) multiplications. What this means is that the number of multiplications is exactly gonna be: Mult = A* n^2.37285 + B , for some A and B, usually unknown. thats what the big O notations means... So like the previous comment said, this weird galactic algorithms always have HUGE constants, so is not even practical to use it to multiply a 64x64 matrix (i'm just putting numbers here, but that's usually what happens, the real benefits of using these algorithms is when n is literally an astronomical number xd)
@porschepanamera92
@porschepanamera92 Год назад
@@pablote325 ... and in that case it might be much faster but still extremely slow using current hardware?
@luckyw4ss4bi
@luckyw4ss4bi Год назад
@@porschepanamera92 I think the idea he's hinting at in general here is that the size of matrices you'd have to be multiplying to gain a performance boost from the current fastest algorithm is too large for modern computers to handle, anyway.
@michaelsaba6769
@michaelsaba6769 Год назад
Just read about this algorithm in CLRS. It's a perfect teaching tool to show why constant factors might not matter in asymptotic notations, but how they often do matter when working with recurrences. Like in this case, where 7 multiplications are preffered over 8, even if we're only multiplying the runtime by a constant factor. Strassen's yields a Theta(n^log(7)) time complexity, while the standard divide and conquer approach yields Theta(n^3) time.
@user-pk4hn1uz1k
@user-pk4hn1uz1k Месяц назад
Thanks!
@frfr1022
@frfr1022 Год назад
It reminds me of this algorithm for natural numbers multiplication, which is theoretically a lot faster than any other algorithm we know of, but gets worth using only at ridiculously huge numbers. I can't remember how its called, but i think it was mentioned in 3b1b's latest video on convolutions
@DrTrefor
@DrTrefor Год назад
If I recall correctly, some of the ideas of large number multiplications are related to ideas from this very Strassen algorithm.
@adamel-sawaf4045
@adamel-sawaf4045 Год назад
@@DrTrefor Strassen also worked on a multiplication algorithm for numbers (the Schönhage-Strassen algorithm), so that would make sense!
@oyaoya2468
@oyaoya2468 Год назад
Karatsuba algorithm, is this what you mean? Because it uitlizes the same technique as Strassen algorithm
@magicflour
@magicflour Год назад
I had a class where a whole third of it was dedicated to these linear algebra algorithms. The most I learned really was "optimize the hardware and optimize with hardware" to get good results lol. This was a computer engineering class so it made sense.
@DrTrefor
@DrTrefor Год назад
Ha, yes that has been my sense too even as mathematicians have fun with asymptotic limits:D
@marcotroster8247
@marcotroster8247 Год назад
Finally someone said it so that I don't need to comment it myself 😂
@alphalunamare
@alphalunamare Год назад
Fascinating! It brings back old memories. In 1977 I was mulling over the problem and managed, over a few days, to create an algorithm for 3x3 multiplications requiring 24 multiplications and 66 additions only. It reduced to strassen's method in the 2x2 case but was no improvement as that would have required 21 multiplications. I did this longhand over yards of line printer paper. Unfortunately a Canadian published an algorithm with 23 multiplications and 96 additions. it pissed me off no end, especially as it had no information regarding its derivation. (If I could trade 1 multiplication for 29 additions I could improve but I left it on the shelf) Sulking I guess. Over the years I wondered if it was just a novelty, then I realised that in an object oriented sense that the savings are manifestly more than for just simple arithmetic multiplications. An intriguing thought is whether the intermediate Pn multiplications of 'objects' have a real meaning in the physics of the model under analysis that perhaps could shed some light? Perhaps in the Geometric Algebraic treatment of Spacetime?
@ThunderAppeal
@ThunderAppeal Год назад
No one cares.
@alphalunamare
@alphalunamare Год назад
@@ThunderAppeal lol says who?
@planaritytheory
@planaritytheory Год назад
ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-u2yUSvqH0lc.html
@boltez6507
@boltez6507 9 месяцев назад
@@ThunderAppeal I care
@cbunix23
@cbunix23 Год назад
I wrote a cache friendly matrix multiplication algorithm that improves performance by 40% when matrices are very large; basically, the multiplications and additions are the same, but the order is done in a way that respects cache memory.
@kruksog
@kruksog Год назад
I remember learning the Karatsuba algorithm for multiplication. It is also computationally cheaper than the standard way of multiplying numbers. It had a very similar feel to this. Cool stuff. Thanks!
@adamel-sawaf4045
@adamel-sawaf4045 Год назад
This also reminded me of Karatsuba! I learned it in my algorithms class. I implemented it in a C++ program for extra credit.
@eti-iniER
@eti-iniER Год назад
As a first year CS student, I'm completely blown away by the fact that humans know and understand concepts like this. Amazing video, btw Cue the mild depression
@discreet_boson
@discreet_boson Год назад
Please make more computational math videos like this! Programming documentation never explains the math like you do
@DrTrefor
@DrTrefor Год назад
I'd love to!
@wcdeich4
@wcdeich4 Год назад
Interesting! I previously knew some algorithms could do matrix multiplication faster for large sparse matrices, but those algorithms are actually slower for large dense matrices.
@adamel-sawaf4045
@adamel-sawaf4045 Год назад
Great vid! Just as informative as it needed to be, while sparking inspiration to learn more. One thing I think you should’ve mentioned was how much the space complexity grows for the trade-off in speed.
@ThunderAppeal
@ThunderAppeal Год назад
No one cares.
@airsquid8532
@airsquid8532 Год назад
The amount of power I wield is unrivaled with this knowledge, thank you for yet another interesting video full of passion!!
@laurenpinschannels
@laurenpinschannels Год назад
I'm curious what you're researching!
@prashantsingh6370
@prashantsingh6370 Год назад
As someone who absolutely loved solving matrices & determinants problems in higher secondary mathematics, way back in 2011-12, a huge thumbs up for the excellent video. 👍
@DrTrefor
@DrTrefor Год назад
Thanks for sharing!
@freshrockpapa-e7799
@freshrockpapa-e7799 9 месяцев назад
enjoying computations doesn't make you special in any way, there was no reason to mention that
@L4ZY2137
@L4ZY2137 Год назад
There is a thing calleg "Hierarchical matrix", that allows to change a matrix to a more "compact" form that allows to compute many operations like inversion or multiplication in almost O(n) time. If I remember correctly if we remove the constant (k) that controls approximation the complexity is about O(n * (logn) ^ 2)
@cmilkau
@cmilkau Год назад
One important question with this algorithm: how many additions does the recursive algorithm use? Because in the end, we want a faster algorithm, not just a reduction in elementary multiplications. It's also a bit counterintuitive that this strategy *reduces* the total number of additions asymptotically.
@rezamiau
@rezamiau Год назад
Absolutely brilliant! One of the major problems that engineers like me are dealing with is the time consuming eigenvalues calculations, because of increasing the precession, due to Ill-Conditioning of a matrix. Please let us know if there is a good solution to it. Thank you so much.
@ddognine
@ddognine Год назад
Interesting video of an area of active research of which I was unaware. I would be interested in a similar video covering the state of the art in matrix inversion which I believe is much more problematic for large matrices than multiplication and is typically the first operation before any multiplication.
@quarkonium3795
@quarkonium3795 Год назад
One thing to keep in mind is that most of these improvements from the Strassen algorithm are theoretical. Most only improve on the Strassen algorithm for extremely large matrices-so large that these algorithms will probably never be practical
@christossymA3A2
@christossymA3A2 Год назад
Man it feels almost illegal such great knowlegde being available for free
@DrTrefor
@DrTrefor Год назад
ha!
@noot3316
@noot3316 8 месяцев назад
I just finished my first term of Engineering at UVic (you were an amazing calc prof by the way!) and I can FINALLY understand how cool this video is!
@DrTrefor
@DrTrefor 8 месяцев назад
Haha that’s awesome, thank you!
@flexeos
@flexeos Год назад
I am not sure that the pure optimization of operations is very relevant. The limiting factor is often moving the data in and out of memory. What you have to optimize is how may operations you can do for each load.
@Miguel_Noether
@Miguel_Noether Год назад
Same amount of memory and speed of allocation, doing faster operations is going to be the most efficient way to optimize everything else (that's all the hype of quantum computing)
@johnniefujita
@johnniefujita Год назад
Single operations are not really what is expensive and increase algorithm complexity, the "O"s, actually complexity is direclty relates to nested computations. Usually we want a algorithm to be linear or linearithm. Polynomial can be tractable. But never exponential and factorial, those cannot scale. Also algorithms may vary worst, average and best case scenarios for each different implementations, and sometimes, like in the case of sorting algorithm, the best choice can vary based on how well organized the data already is.
@changjeffreysinto3872
@changjeffreysinto3872 9 месяцев назад
Solution to 7:17 Let the number of addition to the matrix product of 2^n*2^n sized matrixes be f(n). Since we calculate the multiplication of 7 smaller matrixes of size 2^(n-1)*2^(n-1), we have 7f(n-1) operations. Also, we calculate the sum and differences of these 2^(n-1)*2^(n-1) matrixes for 18 times in total, amounting to 18*(2*2^(n-1)) times overall. Therefore, the recursive relation is f(n)=7f(n-1)+(9/2) 4^n, which is a first order linear DE. We now substitute f(n)=7^n * f_n, obtaining f_n = f_(n-1) + (9/2) (4/7)^n. We have f_n = 9/2* sum 1-> n (4/7)^n, which is less than 9/2 * sum 1-> inf (4/7)^n = 21/2. Finally, f(n)=7^n * f_n
@MrHaggyy
@MrHaggyy Год назад
A huge deal in practical matrix multiplication is taking the naive approach: look for redundant operations, look for dependencies in computation, group up independent operations and parallelize them. A 3x3 matrix on a GPU would schedule every single one of that 27 multiplies on one 32 core chunk. Then you would copy the data in another 32 core chunk doing the addition. You could also do the addition of three matrix multiply in one execution of 32 additions. Thats the reason why Strasson is such a big deal. 8 multiply and 4 additions fit beautifully in 32 core chunks. And once you have setup your execution pipeline you can feed it any matrix. If a rank is odd the CPU will make it even by adding zeros befor passing it to the GPU.
@dvir-ross
@dvir-ross Год назад
According to wikipedia, there is already an improvment from 2022.
@DrTrefor
@DrTrefor Год назад
Oh nice! Good to be obsolete already!
@user-yk7kq3rf7z
@user-yk7kq3rf7z Год назад
Congratulation for a 100pi subscribers!)
@DrTrefor
@DrTrefor Год назад
Oh cool, thank you!
@guigazalu
@guigazalu Год назад
While watching, I asked myself if you would cite the AlphaTensor article and results, as I read it in the recent months. It turned out you did! It is a very cool article, even though it's a "heavy reading" for me.
@gooball2005
@gooball2005 Год назад
I do enjoy a good video on complexity theory! Also, your thumbnail game is on point!
@jebarijihed
@jebarijihed Год назад
This reminds a lot of the FFT and the DFT algorithems !
@lebdesmath2510
@lebdesmath2510 Год назад
nice vid, keep up the good work
@DrTrefor
@DrTrefor Год назад
Thanks, will do!
@U.Inferno
@U.Inferno Год назад
"This doesn't sound like a big improvement." Me, a computer scientist: "No. This does."
@GreatJO
@GreatJO Год назад
Is SIMD faster?
@mohammadhomsee8640
@mohammadhomsee8640 Год назад
That's a great videos, I really love to hear your lessons. Good job professor!
@garrettkajmowicz
@garrettkajmowicz Год назад
One area I found missing from your video was in showing that using block matrices got us the correct result. That is, that treating a 4x4 matrix as a 2x2 matrix of 2x2 matrices would, when multiplied, get us the same result as we would otherwise expect.
@rob876
@rob876 9 месяцев назад
Here's a better version of Strassen's algorithm due to Winograd for A.B = C: a = A21 + A22 b = a - A11 c = B12 - B11 d = B22 - c s = A11.B11 t = a.c u = s + b.d v = u + (A11 - A21).(B22 - B12) C11 = s + A12.B21 C12 = u + t + (A12 - b).B22 C21 = v - A22.(d - B21) C22 = v + t This has 7 multiplications and 15 additions/subtractions.
@kristas-not
@kristas-not Год назад
ok, you officially kick ass in my book.
@Moinsdeuxcat
@Moinsdeuxcat Год назад
8:40 THIS HAS TO BECOME A SUMMONING SALT VIDEO
@Yupppi
@Yupppi 9 месяцев назад
Since this is specifically computational related, it might be worth talking about not only the algorithm for calculating sub matrices, but cache memory (its layout and use) and how transposing the matrix gives better performance due to the elements being laid out in the memory more efficiently and also how the relevant ones stay in the cache over the time you need to use them, which avoids cache misses. Furthermore copying the matrices (to local buffer) interestingly enough helps as well. And lastly if you split it to multiple threads. Which leads to performance doing such a leap that you can for example go from 2000 ms to 20 ms execution time. I feel like n^3 and n^2,4 aren't quite as concrete demonstrations compared to that. But of course Intel has a Basic Linear Algebra Subprogram implementation that does it in roughly 2 ms. These are in my understanding matrices the size of 1000.
@taotaotan5671
@taotaotan5671 Год назад
Very very interesting topic! Thanks for sharing this, Trefor! Wondering if the algorithm applies to a more general case of matrix multiplication, e.g. (m x n) * (n x p) Another thing I found interesting is, after some research, matrix inversion and matrix multiplication have the same time complexity. The optimization method you mentioned in this video also applies to matrix inversion. That contradicts my impression that matrix inversion causes more headaches than multiplication.
@DrTrefor
@DrTrefor Год назад
For non-square matrices you can just imagine they are square with extra zeros so the same upper bound applies here.
@Dawg1561
@Dawg1561 Год назад
Didn't knew Dr. Strange left sorcery and started teaching mathematics. Btw, This was a great explanation
@codethinki
@codethinki Год назад
title: fastest matrix multiplication algorithm. Video explains: The first improvement of matrix multiplication 😆. Oh and on top of clickbait a fat sponsorship. I love quality content
@streampunksheep
@streampunksheep Год назад
ok this is cool. it makes me want to study matrices again
@jcsjcs2
@jcsjcs2 Год назад
The reason why you don't care about additions is not because additions are "much faster" than multiplications. An FPU will usually take the same amount of time for either and will probably even feature combined "multiply and accumulate" functions. The reason is that numbers of operations is of the order n^3. Therefore, adding both together is still of the order n^3.
@gofaonepaul
@gofaonepaul Год назад
This is so cool. ❤
@realdragon
@realdragon Год назад
Speaking from hardwere perspective if possible try to avoid jumping between columns. Even 2D array is very long row so jumping between columns means jumping through huge parts of array while calling next number is this row is very fast
@gzhang207
@gzhang207 Год назад
One specialization of matrix multiplication is the matrix multiplication with a vector (or filter) that is widely used in convolution based ML. Winograd convolution seems to follow the same optimization idea.
@ericphantri96734
@ericphantri96734 Год назад
Use divider into 2×2 or 3×3 the do all the way to use the grid point inverter and addition to perform by electronic across the board mean reducing to the determinant of each small matrix then move them together into smaller matrix with all the determinant and repeat until final determinant each is only single number then multiple ply out. And find the angle between the two number by trigonometric value then we had final vector so if n = 50 numbers each matrix then use the 2× 2 because even so far total operations about 50×49 estimate 2450 calculation by electronic of 20 mhz then it take 0.00000085 second and if use slower CPU like 1 mhz then it take 0.005 second for 50×50
@rumasc9854
@rumasc9854 Год назад
dude, I was waiting for the cool dance that the thumbnail portrayed was about to happen
@cmilkau
@cmilkau Год назад
I don't like title cheating. I somehow expected this would present a state-of-the-art algorithm, not just explain Strassen and then continue with "there are better ones this fast".
@Miguel_Noether
@Miguel_Noether Год назад
It's the fastest, where is the cheating part?
@SimGunther
@SimGunther Год назад
@@Miguel_Noether An explanation on the latest Williams equation would've sufficed for this video, but he chooses not to elaborate on its mechanics whatsoever.
@soonts
@soonts Год назад
Modern computers don't necessarily do additions much faster than multiplications. A computer I use to watch your video has AMD Zen 3 CPU. For 32-bit floating point numbers, on my computer the performance is absolutely equal. Specifically, both vaddps (adds 8 numbers) and vmulps (multiplies 8 numbers) AVX instructions have 3 cycles latency, and 0.5 cycles throughput.
@aliuncu2815
@aliuncu2815 Год назад
You forgot to mention that not a week after the AlphaTensor group's paper, Kauers and Moosbauer improved their results on 5x5 matrices.
@dr.rahulgupta7573
@dr.rahulgupta7573 Год назад
Excellent presentation 👌
@DrTrefor
@DrTrefor Год назад
Thanks a lot!
@jaimeduncan6167
@jaimeduncan6167 Год назад
If one is working with not so big numbers (in terms of number of relevant digits ) , like 64 bit b floating point numbers modern computers will produces multiplications in the same time of addition for this kind of problems (fully pipelined) in fact many will produce Ax + y 1 per cycle per pipeline. That means That additions can’t be ignored.
@ultimatedude5686
@ultimatedude5686 Год назад
I think the reason reducing multiplications is so important is that multiplying matrices is way more computationally expensive than adding them. That means when you’re working with block matrices, you want the minimum number of multiplications of the components; extra additions are fine and won’t really affect the asymptotic complexity.
@jaimeduncan6167
@jaimeduncan6167 2 месяца назад
@@ultimatedude5686 Yes, but that is not his point his point is "when multiplying matrixes I can ignore the scalar aditions because they are much faster" wrong. It was true it's no longer the case. Almost all modern architectures since Power 1 have some version of fusion multiply-add fully pipelined, again when dealing with floats and big matrixes (latency is different).
@ultimatedude5686
@ultimatedude5686 2 месяца назад
@@jaimeduncan6167 Addition of numbers is about the same speed as multiplication of numbers, yes. But the matrices might actually be block matrices, in which case the components themselves are matrices, in which case adding the components is much faster than multiplying them.
@jimwinchester339
@jimwinchester339 Год назад
Interesting! I was class of 1980, so I learned only of the Strassen algorithm.
@notgabby604
@notgabby604 Год назад
The Walsh Hadamard Transform is kind of the fastest dense matrix multiply you can do. It's a fixed thing but you can somewhat unfix it with cheap operations like prior sign flipping of the input information or prior permutations etc. The number of multiples required then is zero.
@D_VAULTZ
@D_VAULTZ Год назад
does this only apply to large ints? Im wondering if there is an improvement available in much smaller values, such as 8bit integers.
@jonatanlind5408
@jonatanlind5408 Год назад
These algorithms only tend to make sense for large ints. The algorithms presented in this video were developed to speed up computation, your 8bit example can already be calculated increadible quickly and even if you have a large amount of calculations it is possible to calculate every single 8bit multiply possible and simply look up the answer when you need it. I am unaware of any algorithms that can beat such a method.
@swagatochatterjee7104
@swagatochatterjee7104 Год назад
Hey the Alpha Tensor record in 9:50 has been beaten by two researchers at the university of Linz who showed that you can multiply 5*5 matrices in 95 multiplications instead of 96 multiplications which the paper showed.
@dougpark1025
@dougpark1025 Год назад
An interesting set of algorithms. However, as it turns out getting the numbers from memory into the CPU is a bottleneck that is easy to forget. If you can arrange to get data to load into the cache of the processor in contiguous blocks there is a great deal of performance gain to be had over just pulling the numbers from memory without any consideration. This was a technique used decades ago to do matrix multiplies efficiently when computer memory wasn't large enough to hold all of the matrices. Blocks would be loaded from disk, multiplied and then blocks written back out to disk. Same idea, but memory is a lot faster than writing to disk. Today we usually don't run into anything that big, but the concept still works. Next step, enter the multiple core processor. You can take the same idea and subdivide the work of multiplication into parts that run independently on separate cores. In the end you can approach an N times speed up where N is number of cores. (Subject to Amdahl's law of course). The nice thing about the matrix multiply is it falls into the category of being embarrassingly parallel. Which is to say, it doesn't require much in the way synchronization constructs. I have not tried any of these faster algorithms in a parallel implementation. It might be interesting to benchmark that sort of implementation.
@DiabloQFDB
@DiabloQFDB Год назад
I need to research and see if these methods can be implemented with AVX.
@yutubl
@yutubl Год назад
Thanks for this very interesting information.
@saitougin7210
@saitougin7210 Год назад
7:05 "using some log rules": starting from base change formula (log_b c) / (log_b a) = (ln c) / (ln a) (log_b c) * (ln a) = (log_b a) * (ln c) ln[a^(log_b c)] = ln[c^(log_b a)] a^(log_b c) = c^(log_b a) q.e.d. You're welcome. Hm, maybe someone should add this formula on Wikipedia or something. It might come in handy, if one just knew it.
@iwersonsch5131
@iwersonsch5131 Год назад
At 5:32, if I look at matrices of size n=2k+1 for k approaching infinity, the complexity with this naive extension approaches (2n-1)^log2(7) = (n-0.5)^log2(49) > (n-1)^5, doesn't it? Of course there are ways to reconcile this but this argument appears to be insufficient for the claim
@wilurbean
@wilurbean Год назад
I feel like I've made it in my physics education when that slightly extended decimal exponential got me excited
@DrTrefor
@DrTrefor Год назад
lol
@wilurbean
@wilurbean Год назад
@@DrTrefor fr tho I went back to school at 30. Started at a tiny community College in rural northern Michigan a few and got accepted to a top tier engineering physics program (#1 in the US in my particular sub-field). My current physics classes are Modern Physics and lab where we're doing a lot of modeling/data analysis with big matrices. And, E&M based around Griffiths Electrodynamics. Wouldn't have passed Vector Calc or Linear Algebra without your videos. It's coming full circle now with Modern and Linear Algebra and EM with the vector Calc. Ty!
@lunalildragon241
@lunalildragon241 Год назад
Well calculations in itself don't really need a long time, the biggest problem is the memory read time that you have if you read and write more data than the cpu can store. Meaning if you are calculating that huge operations you would have to care about spliting the data into chunks that fit in that area without having to reread.
@Oscar1618033
@Oscar1618033 Год назад
I have the feeling that such algorithms might involve some kind of matrix log and fourier transforms. Could it be?
@David-gu8hv
@David-gu8hv Год назад
No wonder donuts and coffee go so well together...
@ankurraj6560
@ankurraj6560 Год назад
Can we somehow use the concept of a Matrix as a linear transformation of a space to simplify our Matrix multiplication problems!? 🤔🤔
@DrTrefor
@DrTrefor Год назад
Well composition of linear transformation is just matrix multiplication so they are definitely related.
@subhodeepmondal7937
@subhodeepmondal7937 Год назад
Let me tell you a story, So I just started a learning CUDA and matrix multiplication is basic algorithm that you've to learn. But they were using a algorithm which launches a nXn thread grid to calculate sum(a^i * b^j) term. which is virtually O(n) complexity if you've more CUDA cores than the order of matrix. But I thought can I do it better, so what I did I calculated all the a^i *b^j term and then did a rolling sum which is of complexity of log(n). So the point is if we could somehow able to parallelize our problem we can process a lot faster.
@TheViperZed
@TheViperZed Год назад
multiplication and addition having a different impacts from a computational complexity standpoint is incorrect from the perspective of modern hardware, if you're not doing this on embedded hardware the multiply, int or fp, is going to take one cycle, sometimes even just fractional cycles. Most often it's a fractional cycle multiply accumulate doing both the mult and the add. Computational complexity really needs to be combined with an analysis of memory throughput and memory footprint to be a viable analytical strategy these days, on its own it's an anachronism from a time where the main performance bound wasn't IO throughput into the the CPU. It's really worth mentioning these facts if you're going to approach a topic from a computational complexity standpoint.
@jessstuart7495
@jessstuart7495 Год назад
Do modern CPU's detect when you are trying to multiply by zero and return a result (zero) faster than a multiplying two non-zero numbers?
@DrTrefor
@DrTrefor Год назад
Yes I believe so, but likely depends on exact algorithm one uses.
@jonatanlind5408
@jonatanlind5408 Год назад
Yes speed of computation is dependent on the data itself. Floating point numbers actually have 5 distinct data representations they can assume each of which must be handled seperately. These are Zero, denormal, normal, infinity and NaN. I believe computations regarding denormal numbers usually are slower that normal or zero and significantly so.
@unlucky-777
@unlucky-777 Год назад
As a computer engineer, this didn't help me at all because my coworkers are still using the most basic methods in our codes
@frankyin8509
@frankyin8509 Год назад
Love his T-shirt, 👕 topology meme🎉
@abdulshabazz8597
@abdulshabazz8597 5 месяцев назад
I would think the computational lower bound is less than O(N^2) if any digit repeats in more than one quadrant...
@iiiiii-w8h
@iiiiii-w8h Год назад
which algorithm is implemented on MatLab?
@DrTrefor
@DrTrefor Год назад
The normal one by default I would presume. It takes truly gigantic matrices for this kind of asymptotic difference to make a measurable advantage.
@deeveshjain4246
@deeveshjain4246 5 месяцев назад
is this the same thing as batch matrix multiplication?
@chxyike
@chxyike Год назад
you can improve from n^3 to n^2.807 which is not significant, but the complexity increases a lot. The gain is small with this algorithm, to my opinion.
@MrHamof
@MrHamof Год назад
Say you have a matrix that's 100x100. With N^3, you need 1 000 000 multiplications to solve it. With n^2.807 you need 411 150 multiplications, and with the current record holder of 2.3728596 we're down to 55 683 multiplications. I don't know enough to speak to real world performance difference, but it definitely has a huge impact on the number of multiplications needed for large matrix multiplication. Depending on how much harder implementation is, I would agree that it's probably not worth it for smaller matrices, like most of these the difference between O(n^3). O(n^2) and O(n) isn't that big if n is a small number.
@chxyike
@chxyike Год назад
​@@MrHamofyou are right. It makes sense for large matrix.
@Rin-qj7zt
@Rin-qj7zt Год назад
apparently the link for brilliant is still active cuz it gave me the discount lol
@DrTrefor
@DrTrefor Год назад
Awesome! Hope you enjoy:)
@lucavogels
@lucavogels Год назад
Can you also explain the Alman, Willimas Algorithm please! 🤓
@gnomeba12
@gnomeba12 Год назад
Ok now show us the best algorithm for SVD!
@DrTrefor
@DrTrefor Год назад
ha actually I've been wanting to do a SVD video
@dmitrykazakov2829
@dmitrykazakov2829 Год назад
This misses the question of accuracy. Additions (and subtractions) suffer precision loss in floating point arithmetic. If you have accumulating additions in your algorithm you are risking catastrophic precision loss for some inputs. So shuffling multiplication into a bunch of additions might be not such a bight idea as it appears. In general floating-point operations are not associative, this all optimization methods based of reshuffling operations like this one must be done with great care.
@holyshit922
@holyshit922 Год назад
Standard multiplication maybe is not the fastest but it is still not bad because it has polynomial growth Compare it with such algorithms like calculating determinant by cofactor expansion of by summing products over all permutations which have factorial growth If we want to calculate coefficients of characteristic polynomial by expanding determinant via minors to avoid symbolic computation we will have exponential growth 2:16 and here I catch you we accelerate algorithm but with cost of memory but both RAM and disk space or paper and ink cost in dollars Was it worth , acceleration isnt significant you wont be able to accelerate algorithm to O(n^2)
@joshualickey3426
@joshualickey3426 Год назад
I think i found a mistake at 2:44. In the bottom lower right it might be p1-p2+p3+p6.
@artus9022
@artus9022 Год назад
But what about SIMD? Isnt this n²? How would the algorithms perform in this case?
@Petch85
@Petch85 Год назад
What if you have a matrix x vector, are there similar tricks?
@myzel394
@myzel394 9 месяцев назад
Either I did not understand this video at all, or did he not show how the current fastest method actually works? :o
@dmitryincog7455
@dmitryincog7455 Год назад
Is there a matrix multiplication library that uses this algorithms? And if not - what would that men?
@farfa2937
@farfa2937 Год назад
If you hardcode a matix, then you can get the result in constant time, and you'll be right as long as you only try to multiply matrices that result in your chosen one.
@tissuepaper9962
@tissuepaper9962 Год назад
This is the well-known "divine intervention" algorithm which is also employed in "miracle sort" and the "birthday attack" on hashing algorithms.
@dabbler1
@dabbler1 Год назад
correction: it is p1 + p3 - p2 +p6. not p1 + p2 - p3 +p6
@muskyoxes
@muskyoxes Год назад
I'm not believing that a few multiplications are worse than a huge pile of additions. Multiplying two 10,000 digit numbers is a pain, but a 10,000-dimension matrix doesn't imply 10,000 digit numbers in the entries. Has anyone found which of these algorithms is _actually_ fastest when used for real problems on real computers?
@DrTrefor
@DrTrefor Год назад
Maybe one way to think about it is that multiplication IS just iterated addition, so every multiplication itself is hiding a huge pile of additions.
@muskyoxes
@muskyoxes Год назад
@@DrTrefor But those additions are parallel and hardware makes mincemeat of them. (Unlike division, which needs sequential subtraction and is thus slow.) Multiplication often takes only one clock cycle. Sometimes three. Rarely much more than that as far as i've seen
@calvindang7291
@calvindang7291 Год назад
You don't actually calculate based the number of multiplications. You calculate asymptotic complexity the normal way, which gives the same result as just counting the number of multiplications. Of course, which one is actually fastest depends on how large your matrices are. Multiplying matrices that are 100k in each dimension is probably enough that the extra overhead of fast methods are worth it, but if they're significantly smaller than that it might not be needed.
@muskyoxes
@muskyoxes Год назад
@@calvindang7291 If you're replacing a 3 clock cycle operation with a dozen 1 clock cycle operation, aren't you losing at all sizes? It's not that simple (what about vector operations, pipelining, or GPUs?), but all the more reason to know the _practically_ best algorithm. I'd also only be interested in large matrix sizes that are used in real problems, and not report that some algorithm is best on matrices orders of magnitude bigger than any application has ever worked with.
@calvindang7291
@calvindang7291 Год назад
@@muskyoxes As I said - you count the operation count the normal way. The way I learned in school, that includes O(1) multiplication, since we aren't assuming big numbers. Nobody cares about the exact amount of operations. If a twice as large matrix is always only approximately 7 times more operations, that's going to be treated as better than one where it's 8 times more (algorithm analysis doesn't care about any other overhead, either - just whichever part of the algorithm is slowest, which you can usually figure out by looking at it for 5 seconds)
@jimnewton4534
@jimnewton4534 Год назад
i'm surprised that if the number of rows is not a power of 2, then you can simply add rows and columns of zeros. I've always thought I needed adjoin rows and columns of zeros but 1's on the main diagonal.
@pierreabbat6157
@pierreabbat6157 Год назад
What happens to the accuracy of computation with these matrix multiplication algorithms? I use the standard n³ algorithm, with pairwise addition for accuracy.
@calvindang7291
@calvindang7291 Год назад
It's assumed that you want an exact result (and so won't get hit by rounding errors) when doing this - if you only need approximate answers and so are using an approximate data type, that's something else entirely.
@alekthpariso8606
@alekthpariso8606 Год назад
still trying to figure out where i am gonna need this.
@DrTrefor
@DrTrefor Год назад
haha probably never tbh:D
@telotawa
@telotawa Год назад
9:09 but you can do integer multiplication in n log n tho
@mathematicia
@mathematicia Год назад
This could be easier method but learning , memorising and use this in practica scenerios is really painfull . As long as calculations is concerned it could be somewhere usefull but when we do any type of analysis with metrices , it couldn't replace the ordinary multiplication.
@MrTyty527
@MrTyty527 Год назад
I can come up with an O(1) algorithm for this, but the result might be less accurate!
@jimjacobs2817
@jimjacobs2817 Год назад
Is there a quicker way to calculate a 4 x 4 matrix? That is, there a similar approach to increasing the speed of 2 x 2 matrices?
@iwersonsch5131
@iwersonsch5131 Год назад
Asymptotically could mean "asymptotically if the number of trials with 3x3 matrices approaches infinity" but I think i get the idea
@Observer_detector
@Observer_detector 12 дней назад
Tensor Algebra : 😈😈
@noahwilliams8996
@noahwilliams8996 Год назад
Why would multiplication take more time than addition?
@squeezy8414
@squeezy8414 Год назад
Is there a practical application to multiplying such large matrices? As far as I know current matrix applications are limited to 4x4 matrices right? Correct me if I'm wrong
Далее
The Fastest Multiplication Algorithm
13:58
Просмотров 96 тыс.
Never Troll Shelly🫡 | Brawl Stars
00:10
Просмотров 1,5 млн
PEDRO PEDRO INSIDEOUT
00:10
Просмотров 2,9 млн
Why complete chaos is impossible || Ramsey Theory
23:00
Researchers thought this was a bug (Borwein integrals)
17:26
The Bingo Paradox: 3× more likely to win
30:15
Просмотров 377 тыс.
Caught Cheating With Phone In His SOCK!
14:29
Просмотров 56 тыс.
Adding Nested Loops Makes this Algorithm 120x FASTER?
15:41
The Clever Way to Count Tanks - Numberphile
16:45
Просмотров 1 млн
ChatGPT is destroying my math exams
11:43
Просмотров 73 тыс.
This Algorithm is 1,606,240% FASTER
13:31
Просмотров 798 тыс.
Why The Sun is Bigger Than You Think
10:30
Просмотров 195 тыс.
Never Troll Shelly🫡 | Brawl Stars
00:10
Просмотров 1,5 млн