The Hardware Lottery (Paper Explained)

Yannic Kilcher

Подписаться 266 тыс.

Просмотров 11 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

27 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 76

@veedrac 4 года назад

My comment from Reddit: www.reddit.com/r/MachineLearning/comments/iuw094/r_the_hardware_lottery_the_advent_of_domain/g5q7fos/ Architectures generally trend from specialized to general, growing from a specialization around a task to a specialization around a particular computation graph. For example, the traditional Von Neumann architecture started out as the simplest way to get programmability, and through optimization gradually turned into latency-specialized machines, approximating dataflow, because the computation graphs built on the early machines had sequential bottlenecks. The truly early GPUs were things like sprite rendering hardware, and then later we got the ‘GPU’ GPUs which rendered triangles in fixed-function pipelines, but over time, because the computation graphs they took were homogenous and throughput-limited, they turned into general-purpose machines for homogenous and throughput-limited computation graphs. Multicore CPUs came later. Naturally these were already general computing devices, but you can see they fit a niche between the other two, for workloads that needed throughput but were heterogeneous and contained long latency chains. But then you have other pieces that are still task specialized. GPUs still contain triangle rasterization hardware. You've got hardware video encoding and decoding. Your CPUs contain encryption accelerating routines. What's that about? Ultimately there are two types of new hardware. There's the fixed function hardware that's logically specialized to the one type of operation. And then there's the hardware that's specialized to a type of computation. The question for AI is do we need the former, or the latter? The TPU or the tensor cores in a GPU suggest the former, that what we want is an operation accelerator for this particular niche. If you look at other hardware, like from Graphcore or Cerebras or even Groq, you can see that there is very little that prevents getting almost all the speedup on much more general-purpose architectures. The computations these devices offer are still fully programmable. Very roughly, these are bandwidth and pipelining optimized architectures, optimal for computations where the computational graph is very short end-to-end, very wide, and readily pipelined. History would say that the latter should win out, and if you're concerned about overspecialization, this is who you should want to win out, too. But unusually, the former is backed by huge names, like NVIDIA and Google, so it's unclear if the same will happen. In Google's case this is presumably just shortsightedness, but in NVIDIA's case I think there's a legitimate data ingress/egress bandwidth issue with the spatially spread-out hardware. You might then say, but why these architectures, and not others? This all boils down to something similar to the Church-Turing thesis, in that there aren't that many types of computation graph, and most stuff just boils down simply to the closest general-purpose architecture. It's not a coincidence that triangle drawing hardware turned out to be really good at fluid simulations and speech synthesis. Computation is intrinsically fungible, and problems can only get so diverse before the general purpose hardware is the optimal solution. Also something something obligatory Moore's law is not dead (docs.google.com/spreadsheets/d/1NNOqbJfcISFyMd0EsSrhppW7PT6GCfnrVGhxhLA5PVw/edit#gid=1953273658).

@veedrac 4 года назад

To clarify a bit more the point I'm trying to make: GPUs didn't occur around the time they did by coincidence, as one of many possible avenues. They occurred roughly as early as they could, according to the physics of the time. If they were much earlier, or started out much less specialized, they would have been pointless. The first 3D accelerators had only a few million transistors! You *have* to build hardware that does what hardware can do fast, and when you only have a handful of transistors, there's no significant benefit from GPU-style architectures, because single threads of execution, à la CPUs, are scaling just as well, but are more general. And so for multicore CPUs. And you might think, well at least with neural network accelerators, isn't it the case that those could have been sped up much earlier? To a degree, yes, but less than you might think. This sort of spatially laid out processing 1) requires a huge number of transistors, and large local caches, that has only become practical in the last decade or so, and 2) is mostly solving a memory bandwidth issue, that is again is mostly a product of the last decade.

@veedrac 4 года назад

Oh, and since the video highlighted the FPGA part of the paper: FPGAs are a distraction. They're about ideal for emulating hardware designs, and markedly little else. People treat them as magic ‘unspecialized silicon’ when in reality they're actually silicon that's hyperspecialied into this one function of pretending to be unspecialized. This is typically going to be a significantly worse than a generalized variant of a Graphcore/Cerebras style architecture, which offers the same sort of distributed computing architecture, as long as you have a high enough level description of your problem that you can tackle it in a more general way.

@blanamaxima 4 года назад

Not to forget that memory was not available until beginng of the 2000. Building HW withouth the memory would have pushed you in an SRAM dataflow where deep structures are almost imposible to do. You can pipline only that much... What enabled this long pipeline on realistically big data was DDR. I think that it was very smart to recap the algos given new adavances in compute. My credit goes to Alex et al. for doing the work and traying things out. 10 years prior they might just shut down the training after a week and label it as unsuccesful.

@piotrekmalpa1 4 года назад

When you read this paper you know it was written by a ML practitioner. Standard ML practice is to >take an old idea from statistics or control theorem > rename it as "machine learning" >publish >profit The author seems to take the basic idea from economics - the opportunity cost, rename it as a "lottery", publish the paper

@visionscaper 4 года назад

This made me think immediately of the upswing of Transformers research and results and the downturn of research and results with RNNs: TPUs are very good at training and dealing with Transformers, while they perform poorly on RNNs (or at least they are harder to program on TPUs).

@yizhe7512 4 года назад

What is told in this essay has been pretty much already well understood in the CS field. Your empirical ML research does depend on the hardware, but theoretical research may not. Hardware also evolves to better suit the new research direction/application etc.

@herp_derpingson 4 года назад

23:28 This is the exploration/exploitation dillemma. I dont see anything wrong with that humanity has done. We continued on a promising venture until it stopped progressing. Then we backtracked and started exploring other opportunities. Isnt this the optimal way of doing this? I believe we should do the same once we run out of ideas in deep learning. 42:28 I wish there was some cheap way to fabricate microchips in home. It doesnt have to be 7nm or something crazy like that. It would be like a 3D printer but it prints specialized hardware. Just imagine the amount of new creative hardware people would make in their bedroom. Open source but for hardware designs. Modern quantum computer research uses a lot of modern day CPUs and GPUs. You just couldn't build quantum computers back in 1970s. Very good video keep it coming!

@trisimix 4 года назад

So like, an fpga?

@herp_derpingson 4 года назад

@@trisimix Yeah, FPGA but affordable and you can fabricate at home.

@oreganorx7 4 года назад

re: exploration/exploitation dillemma I wonder if we can design an optimization algorithm that helps us humans plan a collective optimal research path (one that maximizes AI advancement per year for example). This would be in contrast to how right now things move along in an unconscious/ad-hoc way. I.e. no-one for example says "this year we will dedicate 20% resources to scaling existing NN architectures, 25% resources to pursuing new architectures, 10% resources to improving existing hardware, etc."

@volotat 4 года назад

I came here to know how to win the lottery, became ML expert instead. Good video.

@dermitdembrot3091 4 года назад

@stella5322 3 года назад

I agree with all the points you made. Choices "we" or society made in the past limits the possible winning and losing in the future is definitely true. And that is part of life or any other research area. The slowing and ending of Moore's makes the ICs really expensive to make, which makes the hardware lottery extremely expensive. However, I think a potential solution is to create a flexible general-purpose hardware platform (some kind of processor) and open up the Operator and Data Transfer libraries functions to the algorithm researcher to allow them to develop these libraries themselves.

@SianaGearz 3 года назад

Something where the paper slightly trips over itself is when it claims that the sequential processor is constrained by the sequential memory device, which is... true. But it's also only half the picture, because at any point, only a given kind of memory device is practical at a given price point! So the processor and memory co-develop together! Especially because memory doesn't develop uniformly, so the improvements in memory latency over the last 30 years have been snail-paced, while bandwidth exploded, so from this depends, which kind of computing device will be adequate pairing for any given kind of memory. So of course in early 80s, straightforward CISC implementations were adequate to memory at hand, then we got pipelining, prefetch, caches, out-of-order. When 6502 is within factor of 2 of the best possible system you can build with regard to RAM bandwidth efficiency, and costs only a couple $, why would you build something else? Substitute 6502 with a more era-adequate processor, like a bit later, you might just throw a 68k, and then a Hitachi SH2 at almost any task, and you wouldn't be terribly wrong. And you can't just juxtapose CPU + sequential memory to GPU, because GPU's memory is also sequential. It's just very, very wide, very high bandwidth. I think the prime reason we got GPGPU, shader hardware in early 2000s, is because it became practical from the cost and footprint perspective to put a 256-bit wide memory device in a commodity PC, and pixel shader was a relatively cheap and straightforward way to utilise that! You're also fundamentally constrained by RAM bandwidth. I mean on a GPU, hardware scheduling performs very effective latency hiding, because it can request memory well in advance of when it will actually use it, but your working set, your task, it just doesn't fit into cache and scratch ram, so you'll be reading it from RAM and writing it to RAM entirely during each substantial processing step. So let's assume a CPU is what held neural networks back, so how would you work around that? Well you can do local processing and you can bind a DSP to each their own memory domain, interconnect them, no centralised memory, all distributed, which... well picture that, that's a typical supercomputer! Sure you can build one, but it'll cost you, because all these discrete subunits and all the communication between them, it adds up in cost, it adds up quickly. In 2000s, suddenly every one of us had a lil supercomputer at home. IT WAS EXCITING. So for most of the history, there really wasn't a relevant meaningful "what if the hardware was different" question which doesn't boil down to "what if you had unlimited budget", which is not to say architecturally more fitting hardware of a given complexity level couldn't exist at all, but its benefit would be historically so limited as to not be very relevant, trivial enough to handwave away.

@quebono100 4 года назад

At first, I thought, huh winning the lottery? Am I on the right channel here, but then you claryfied it

@SebastianStabinger 4 года назад

The comparison with GPUs from the 70s is not really valid. The code a GPU could execute was fixed in hardware until around 2001 when first cards supporting shaders came out and even then they were restricted to graphics tasks and could not really be effectively used for other things. It took until around 2007 until GPUs could be properly used for general tasks, which coincides quite nicely with the point when neural networks started to become more interesting again. I do agree though, that even if people were building specialized hardware for neural networks in the 80s, it would probably not have changed much, since semiconductor technology just was not advanced enough to give the performance that is actually needed.

@DamianReloaded 4 года назад

I liked the part where it says that for a promising idea that cannot be implemented due to hardware limitations we could have a way to at least be able to tell how much more expensive it would be to follow that route. Maybe what the paper is trying to say is to spend more resources before we branch (make more informed decisions). But in a society based upon profit I don't think something that isn't guarantied to give returns would prosper at least for the general public.

@melo2722 4 года назад

Can you do a video on how you select what papers to read, since Im guessing there would be only a small amount of papers really care about/would be useful to your research. Thanks

@florianhonicke5448 4 года назад

I still watch each of your videos. Thanks for the content.

@TheAIEpiphany 3 года назад

I am surprised that GNNs were not mentioned. GNNs are NNs right, but the current HW we have is not suitable for them (we're good at dense matrix multiplication but not so much at sparse matrix multiplications). There are companies such as Graphcore that are working on creating specialized HW for GNNs. Once an idea shows as a promising one a specialized HW will be built because of commercial needs. The world is making both general-purpose computing chips but we see more and more ASICs (application-specific chips) being built (a good example is Tesla's custom self-driving chip) - so we're going a full circle here (historically-wise). I don't completely agree with the sentiment that just because some idea was rejected in the past as time goes by it will be harder to implement it - we can simulate stuff, we have FPGAs which make it much easier to create whatever custom HW you need, on the fly, using HDLs (HW description languages). Yes, it will take some more money but we're in a much better position HW-wise compared to the early days of computing. Co-evolution (of ideas and HW/SW in this example) is as natural as nature itself. If an idea is really that good it will find its way around, sooner or later, that's my hypothesis. As soon as we hit the saturation point with NNs, if and when that happens, the world will undoubtedly turn its head towards different models (restricted Boltzmann machines and symbolic manipulation of course).

@TheNuttyNetterAlexLamb 4 года назад

I started doing ML stuff around 2011. I don't agree that you need GPUs to make neural nets work well. There are lots of people who get nice improvements from using moderate-sized MLPs, which train alright on CPUs, albeit a bit slowly. Without GPUs, I think you'd still have successful neural nets, but it would definitely cut out some of the big applications. In my opinion one of the biggest barriers back then to using neural networks was a lack of knowledge and understanding about why they were useful. Very few people understood arguments about distributed representations, so they thought that methods like random forests were just as good. Only a handful of people understood why deep networks were valuable. The other perception is that deep networks were unreliable compared to other techniques, which was true, but which seems to be reduced as techniques have gotten better (optimizers, activations, residual connections). Another big bias was that other branches of research were seen as more prestigious because they had fancier math or were harder to understand. So I think that's probably one of the biggest mistakes that we made in hindsight.

@qeter129 4 года назад

so this isn't a paper on how to get good silicon?

@qeter129 4 года назад

Having listened to the full video it seems obvious to me. Popular techniques get more development time and hardware specialization. Perhaps a model that attempted to measure opportunity cost of past development decisions would have been more substantial. More broadly I think industries do a pretty good job of identifying and pursuing research and development paths before hitting a bottleneck and pursuing something new so I'm not totally convinced that "basic research" should receive a multiple of its current funding like the authors seem to be arguing. Though of course that sentiment won't fly well with academia.

@GeekProdigyGuy 4 года назад

@@qeter129 Industry fundamentally pursues a greedy (as in, greedy / incremental search) approach, because they don't particularly like high variance (risk). Basic research is high risk, high reward. If we would like to optimize scientific progress as a whole, then basic research is a clear winner, but for any profit-motivated corporation, the exact opposite is true. Of course, in reality we have a balance between these approaches, but the argument that we haven't landed on the ideal mix of strategies being pursued is, I think, pretty uncontroversial. We can see that in how many papers are about incremental, ultimately inconsequential improvements over SotA. However, this doesn't actually solve the difficult problems, which is 1) determining what exact mix of strategies is optimal for progress and 2) modifying the incentives so that these strategies are optimal for individuals and corporations.

@qeter129 4 года назад

@@GeekProdigyGuy yeah that why I would have appreciated a model that approximated the opportunity cost. obviously any such model would be making a lot of assumptions but it would be a great way to get the ball rolling down a more solid/actionable path.

@AbgezocktXD 4 года назад

We could call it the framework lottery instead. I think the word framework works for both software and also hardware.

@Erotemic 4 года назад

I think infrastructure might be a better word.

@Zantorc 4 года назад

As someone who started their career in software back in the 70s, it was obvious to me that expert systems were a dead end when I looked at them in the 80s. On the other hand neural networks always seemed promising, it was the inability to do backpropagation through multiple layers which held it back. I don't think you will ever get general AI by this approach - that will take something more along the lines of the work being done by Numenta - but it will be useful .

@liammcfadden7760 3 года назад

I didn't read the paper, so I'm not sure if they touch on this...but aren't ideas pursued in parallel? Maybe neural networks became useful later than they would have otherwise (given more amenable hardware), but there are always 'rebels' like Hinton pursuing ideas that aren't popular at the time. I would argue this is actually the norm of all science: an idea becomes popular and useful, which attracts more and more attention until there are severe diminishing returns, then a new novel idea is adopted, becomes the norm, and the cycle continues. I'm sure this cycle has been pointed out before. To me, this paper is a rehashing and isn't really useful.

@YannicKilcher 3 года назад

You're right. It's just that you can't pursue every possible idea in parallel, so you necesssarily have to decide on some. but I largely agree with your point

@egparker5 4 года назад

I didn’t see the lottery in the title - I was too enthralled by the Byte magazine cover!

@yaxiongzhao6640 4 года назад

I thought everyone understand this phenomenony already? Well the author probably is not well exposed to theoretical computer science and computational complexity research. The modern computing infrastructure is largely aligned with a theory foundation that is optimized for general efficiency, in other words, algorithm complexity is usually not penalized by hardware limitations, as long as they are within the boundary of the classic computing model (like turing machine etc.).

@yaxiongzhao6640 4 года назад

What follows is that DNN is complex, so they demand GPUs, not that GPUs are not invented earlier enough, such that DNNs naturally mature later in history. Not the other way around.

@Marcos10PT 4 года назад

This is more about TPUs and focusing on more specialized operations like matrix multiplication, as opposed to simple atomic operations like what one would imagine being part of a turing machine. I do wonder if this is a general enough framework that we can specify any algorithm in terms of tensors. Disclaimer: I don't have much knowledge about what kind of operation is optimized by TPUs, but I assume it has to do with matrix operations on arbitrary dimensions (hence the tensor part)

@al8-.W 3 года назад

"Being too early is the same as being wrong". Science needs to be rooted in reality. It's its very pragmatism that makes it so powerful, so there is no reason to be sad about great ideas that did not meet their public. What I like to say is that the object of science is to describe and predict. Whenever you are trying to explain something, you are just making up a sleeping time story. There is no way any parasite (eg religious or similar) debate can exist in a reference frame where thoughts can only either describe or predict our environment. Let's stop answering questions like "Why does X happen ?". Instead we need to focus more on the "What would we observe if we did Y ?". That's the whole point about making technological and societal changes.

@SimonJackson13 3 года назад

Reminds me of the vapourware paper "Introduction to professorial blowseeding in life"

@XetXetable 4 года назад

If the problem with CPUs was the Von Neuman bottleneck, then you could, theoretically, have general-purpose hardware which, provably, doesn't have anything like that at all. If an FPGA-like architecture executes the same way something like symmetric interaction nets do, then it cannot have any asymptotic disadvantages compared to any other architectures. They have a universality property that guarantees preservation of space and time complexity when compiling from any other physically realizable model of computation. I don't know how easy this is to implement in hardware. Most research on hardware interaction nets I've seen seems to be for hypothetical chemical based computation which I don't think will go anywhere any time soon. It seems like FPGAs, etc, are not trying to innovate in terms of their underlying model of computation so they may, indeed, have their own form of this problem. That being said, this problem can be solved once and for all without entering an infinite regress, in theory. Just make sure your model of computation is universal in an appropriate sense which CPUs are not.

@woowooNeedsFaith 4 года назад

29:30 This looks exactly like evolution. I would expect lottery to look like flat...

@migkillerphantom 4 года назад

The author of this paper needs to open up their computer and take a glance at all the specific purpose digital signal processing hardware on their motherboard. Ranging from the prominent GPU to everything else, with storage devices, memory devices, network cards, board controllers, IO controllers, BIOS, comm protocol ASICs, etc. Many of these things have processor cores in them but they're still application specific, and have bits of non-von Neumann stuff doing computation.

@opiido 4 года назад

I love the intro to this video

@G12GilbertProduction 4 года назад

Next Thomson predictable recurrential lottery algorithm who's only looks for classic paper? If we loose more of 1280×650 open entries in this letter, and alpha of this draw means n = 4, till then CPU crashes at 60% memory lose, when will add more entries for this circuits, instead?

@mindaugaspranckevicius9505 4 года назад

ASIC? e.g. nobody is mining bitcoins nowadays with GPU's, but there were times when it was feasible, but then shifted to FPGU and then ASIC. Maybe specialized suitable network training operations could be baked into ASIC?

@migkillerphantom 4 года назад

They already are. TPUs are used for both inference and training.

@sq9340 3 года назад

31:02 interesting ideas like NN are not abandoned (like in the tree), but reappeared again and again, until the hardware lottery comes

@paulkirkland9860 4 года назад

I want to win the lottery, but also enjoy your videos about DL

@sehbanomer8151 4 года назад

I thought they would talk about neuromorphic chips.

@swordwaker7749 3 года назад

I think the leading machine learning powerhouses already know that they have to build their own hardware to continue developing better AI.

@L33TNINJA51 4 года назад

Are you going to go back to making almost daily videos? If so, when?

@jasdeepsinghgrover2470 4 года назад

The Road Not Taken.... This is how way leads on to way.

@zikunchen6303 4 года назад

Reminds me of this particular comment from Ilya Sutskever: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-13CZPWmke6A.html

@zikunchen6303 4 года назад

Also this discussion here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-13CZPWmke6A.html&ab_channel=LexFridman

@ScottLeGrand 3 года назад

This is a great paper about Google's inductive bias in the hardware wars, but no coverage of the biggest hardware lottery win of them all, the continued survival of the x86 architecture, which created the opportunity for GPUs across the board in mid naughts? Even back in the 1980s, Atari's graphics-oriented computers were dismissed as "game machines" by the very serious people(tm) who chose to elevate the IBM PC and that cemented x86's dominance for the next 3 decades. Sad... So many better architectures fell by the wayside on the way there... CUDA was created to break into HPC and it stormed the figurative walled garden of supercomputing when its only opponent was the Maginot Line of x86. But given I was there at the time, I would say its rise to dominance is due to the happy accident that a processor that kicks ass at LINPACK through fast BLAS ops, cheap fast FP32, and superfast approximate transcendentals was the prepared architecture that favored the chance arrival of ImageNet in 2012. If it hadn't been CUDA, it would have been SSE on steroids (see AVX512) and far less cost-efficient.

@444haluk 3 года назад

I am listening Yannic at x3 speed. It's the only comfortable speed for me.

@jianjianh_ 4 года назад

This world is just a big lottery.

@LouisChiaki 4 года назад

Nah, it is more like a SGD.

@IoannisNousias 4 года назад

FPGAs?

@YannicKilcher 4 года назад

en.wikipedia.org/wiki/Field-programmable_gate_array

@automatescellulaires8543 3 года назад

CPU are very abstract machine. FPGA are much more closely related to physical reality. So no, you won't go down the rabbit hole ad infinitum, at some point you end up at the most fundamental general hardware level. FPGA are much closer to this fundamental point than CPU or GPU are.

@natesh31588 4 года назад

The work in this paper reminds me of this talk LeCun gave at ISSCC 2019 - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-YzD7Z2yRL7Y.html&ab_channel=ISSCCVideos

@sujanshrestha718 4 года назад

yes

@migkillerphantom 4 года назад

Your videos get recommended to people who think you can tell them how to win the lottery? lol

@YannicKilcher 3 года назад

no they search for it :D

@ArkadeepBanerjeeOfficial 4 года назад

1 author... Still "we" 😅

@aduasarebaffour759 4 года назад

"we" in research papers refer to the author(s) and the reader.

@tomw4688 3 года назад

I thought this video was about winning the genetic lottery on your brain hardware. The last name of the author doesn't help either.

@doritoflake9375 4 года назад

So no ones going to talk about the fact the authors last name is Hooker.

@DistortedV12 4 года назад

When quantum machine learning comes to fruition, we will move out of the fourth AI winter

@dermitdembrot3091 4 года назад

Quantum computing has limited applications

@NN-kd6wg 4 года назад

18:10 -"... what he calls 'The Lost Decades' and this is ..." I'm pretty sure the author's pronouns are she/her- (edit: I misheard). It's an interesting paper. I guess we're "lucky" that graphics processing has characteristics that overlap with machine learning workloads (not to imply that it is a random coincidence, of course) and that demand for better graphics has been so consistently strong over the last few decades. Otherwise we may have had to wait a fair while longer before deep learning (as a whole) "won" the lottery to the extent that it has.

@YannicKilcher 4 года назад

I say what *it* [the paper] calls the lost decade :)

@NN-kd6wg 4 года назад

@@YannicKilcher Oh hah just re-listened and I hear it correctly now