Next-Gen CPUs/GPUs have a HUGE problem!

High Yield

Подписаться 54 тыс.

Просмотров 199 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

6 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 896

@flioink Год назад

Nowadays CPUs have more cache memory than my first PC had RAM. It's amazing how far we've come in terms of processing power.

@uvuvwevwevweossaswithglasses Год назад

486 :D

@TremereTT Год назад

I think once we get a better process than calculating parts of the Program ahead of time in parallel in all possible outcomes and then throwing away all of the cached results but one because of a result ahead in the pipeline, we will get to need way less Cache.

@soylentgreenb Год назад

But the L1 is still small and shrinking it no longer makes it faster.

@bricaaron3978 Год назад

@@uvuvwevwevweossaswithglasses How much RAM did a gaming 486 have, and about how much was a megabyte of RAM?

@kellyshea92 Год назад

I just built my first pc the other day and got it to post on the first try. It literally take 1 second for it to boot up. I didnt think the new i9 was so strong

@Chillst0rm Год назад

This is why MCM (( multi chip modules )) combined with 3d vcache will be soo important moving forward. L4 cache will probably make a return also, as something much farther from the die compared to L1 to L3

@GewelReal Год назад

if L4 will be able to work as RAM that would be a revolution. Few GB of L4 would make getting RAM for light use obsolete. And even with extra RAM it would be a massive performance benefit

@Coecoo Год назад

You say "Important" but there is legitimately no excuse for more powerful consumer hardware outside of extreme VR / 4K. Graphical fidelity has peaked years ago at the currently mainstream use of polygons. If anyone bought any sort of remotely mid-range computer within the last 1-2 years and they experience performance issue in games, it is 100% optimization / functionality related.

@Technicellie Год назад

@@Coecoo I agree with you from the sight we have now, but I wouldn't set it in stone just yet. I don't see what can be improved in graphic fidelity. But just because we don't see it, doesn't mean that there is none.

@dkis8730 Год назад

@@Coecoo completely path traced ray traced games are the future though. And you need the most powerful hardware today to run 4k/144fps which gives you optimal smoothness with visibly much better graphics.

@ThylineTheGay Год назад

@@Coecoo companies should definitely be aiming for efficiency, but they don't, and probably won't, because "this won't destroy the planet" doesn't market as well as "oooooh, shiiiiny" Classic capitalism 🙃

@mkatakm Год назад

That's why AMD is starting to use 3D v-cache, which is basically stacking multiple cache RAM layers vertically in the same space. As it did with Ryzen 7 5800x3d, same technology is coming with 7000 series AMD CPUs as well soon.

@ClaimClam Год назад

techno gobbledygook

@baldwindomestic2267 Год назад

@@ClaimClam more cache, but stack like burger patty, more stack, more cache/burger

@ClaimClam Год назад

@@baldwindomestic2267 understand

@robojimtv Год назад

Wouldn't be surprised if the GPUs get v cache one day too. I think it could solve a number of issues with the RDNA3 chips

@guytech7310 Год назад

The issue is addressing heat when stacking dies vertically. I don't know how much heat SRAM produces, but I suspect it will be a problem. Maybe they can get by with a double stack, but I suspect any additional layers is not going to have the means to dissipate the heat.

@DigBipper188 Год назад

AMD had cache scaling down as one of a few reasons they decided to split dies. Cache and some interfaces such as the memory controller don't scale well on their chips when going down a node, which is why their later EPYC and R7000 parts have the IO and some cache levels split from the cores so that they can maintain the diminutive size of the actual cores themselves, and then anything that doesn't scale well (e.g memory controller, L3 cache and so forth) can be produced on another die at a cheaper, lower resolution process node (say, 5nm for the CCDs and then 16nm or even 20nm for the MCD / IO dies). This is also why the Memory Cache Die (MCD) of RDNA3 is a thing too, as it doesn't scale well on the current 5nm node, so AMD has opted to use a larger node for these parts to reduce cost, and then reserve the 5nm node for the GCD itself where they can still see density benefits from the increased resolution of that lithography node.

@Jaker788 Год назад

Well, they don't quite wanna go so far back as 16-20nm, they've been progressing their non logic die, For Ryzen 7000 it's 6nm for IO (and IO + L3 for RDNA3) a high yielding, efficient, and cheaper than 7nm node that's basically refined and faster manufactured 7nm due to multiple layers using EUV. Seems like they'll stay there for a while on IO (and cache for RDNA) and keep logic shrinking to new cutting edge nodes. While density doesn't scale anymore with IO, and now memory, there is supply, tooling, and energy efficiency still that factor's in. 20nm planar silicon wouldn't be as efficient for L3 or IO.

@rocket2739 Год назад

''Reduce cost'' yeah, for them. Because on the consumer end, we have yet to see the prices go down...

@Jaker788 Год назад

@@rocket2739 Technically we saw RX7000 prices drop a bit below the previous generation RX6000. But really, reduced cost means it won't increase as much as any competition that isn't doing the same thing. If this pays off for AMD, and Nvidia takes years to get their own implementation then they'll be at a cost disadvantage.

@josephsteffen2378 Год назад

@@Jaker788 Nvidia enjoyed its day in the sun. I remember when the Titanium Series(or whatever it was) was released... It was just by chance that I read an article (on some online computer magazine/media).... I recognized the jump in technology/speed/value... Nvidia shot ahead of the pack. Not by a few feet or seconds, more like they "lapped" the competition.... that stock just moved from $17/share to $20/share. Some how I got it all together and told everyone that I knew "BUY INVIDIA!". It just reached $27. I guessed that it could go, maybe, up to about $129. I figured that was as far as my skill could guess. I don't know jack about the stock market or trading... It was the only time in my life that I made a prediction of a stock profit... or suggested purchasing.... NAILED IT!

@peceed 10 месяцев назад

@@josephsteffen2378 The same with AMD. Unfortunately didn't have money for investment.

@damienlobb85 Год назад

AMD definitely doesn't get enough credit for their forward thinking in this regard. And as highly regarded Jim Keller and his work on Zen has been. There was an engineer (Sam Naffziger) who was responsible for persuading the senior execs to use chiplets on Zen and future AMD products.

@ledoynier3694 Год назад

.. maybe because they did not invent the wheel? every foundry has MCM designs and chip stacking technologies being worked on since the past 10 - 15 years. We're only just starting to see them hit the market.

@BruceCarbonLakeriver Год назад

@@ledoynier3694 and yet intel was talking about "we're not gluing our chips together..." (although they are doing it for Xeon for a while...)

@CommanderRiker0 Год назад

Didn't Intel do this long ago with "Crystal-Well" 128mb cache chip year and years ago?

@HighYield Год назад

Broadwell i7-5775C

@1000area Год назад

@@HighYield but that's an L4 cache, a known solution to add cache next to the chip. not stacked cache like what AMD and TSMC are working right now.

@jabadahut50 Год назад

Magnetic resistance memory is nearly as fast as SRAM and there are methods out there for it to be used in an analog mode allowing a single cell to hold 8 bits per cell. Would be interesting to see in the future if this tech gets adopted.

@diegorosario2040 Год назад

Wouldnt it requiere on chip error correction to be used to store 8 bits?

@jabadahut50 Год назад

@@diegorosario2040 depends on the design but it might. I'm not 100% sure how it works but to my understanding its a sort of a magnetic potentiometer with a sort of adc that is hardwired to the 256 possible outputs.

@diegorosario2040 Год назад

@@jabadahut50 the deal with non binary encoding Is that it worsens singal to noise ratio. Error correcting code would be need to mitigate that problem

@jabadahut50 Год назад

@@diegorosario2040 likley and im sure that might trade some speed off but ecc memory is already usually denser and slower than non ecc memory anyway so I dont think it'd be a huge trade off for 8x capacity per chip

@diegorosario2040 Год назад

@@jabadahut50 it Will work storage wise but i am curious if it could compromise bandwith

@coladict Год назад

Engineering is always a balancing act. Improving one aspect comes with drawbacks in another. There may be ways to mitigate those drawbacks, but eventually when using the same principle of a technology you will hit its physical boundaries.

@miweneia Год назад

This channel is criminally underrated, presenting so much data and such key points in such a digestible and short manner is commendable! That aside, it’s actually crazy to think about how humanity has existed for thousands of years, but in only the past 50 years we’ve went from creating the first CPU to hitting the actual physics limitations of it’s cache module, and in other 15 or so years we’ll probably hit the physics limitations of the actual CPU’s transistor size. Really makes me wonder what technology and chips would look like 50 years from now… Hopefully I’ll find out firsthand!

@manojramesh4598 Год назад

True

@aylim3088 Год назад

I'd really wanna see what a more mature chiplet GPU with 3d cache could do. Bit of a shame that rx 7900 was a bad launch but definitely hopeful for the future; besides, I would have been suspicious if the first-ever chiplet GPU didn't launch with teething problems. Shame its issues can't really be called 'just' teething problems, but I'll keep on the waiting game.

@TheCustomFHD Год назад

It seems the AMD GPUs are relatively easy to reduce the Hotspot Temp. Vertically mounting it seems to fix it, and also more thermal paste. Look at Der8auer's video

@JJAB91 Год назад

The hotspot issue only seems to effect AMD's own cards, partner cards don't have such issues.

@marsovac Год назад

Nice video! But you didn't explain what "SRAM scaling" means in this context and why is it happening. I guess it means that the size of an SRAM cell does not get smaller as the process node gets smaller. But considering that the same applies to some other parts of the chip like interconnects, this is nothing new. Currently TSMC 7nm or 5nm have almost the same feature sizes but the density is increased in smaller nodes. Logic circuits are not packed as close to each other as possible and this is where they get scaling. SRAM does not have the possibility to be denser, since it already is as dense as it can be in a perfect grid. At some point logic circuits in the chip will end up having the same problem. So the real problem is that the processes are getting less nm in their name while the transistor gate distance remains the same. They are decreasing numbers of the process but they are not nanometers anymore and this is what is causing SRAM problems. SOmething as dense as it gets has no benefit from increasing density, just from decreasing transistor size. Maybe you want to talk about this "cheating" that is occuring in the process names. The name of the process no longer correlates to the distance between transistor gates. Maybe a video about process shrinking and how it changed in the last 10 years would be informative.

@adityasalunkhe8156 Год назад

^exactly he should have said SRAM chips stopped scaling in density rather than just scaling because also remember the register file and the microcode controller is also implemented as an SRAM in the execution pipeline and if there is more delay to access register file it would would mean less IPC and then why would you have faster ALU nodes paired with slower register file or microcode controller makes no sense

@larion2336 Год назад

Yeah idk that this is as significant as he makes it sound. The entire reason AMD are going with chiplet designs in RDNA 3 is because there are already things like IO, and memory to an extent already, don't scale as well with lower nm designs, so they make the core GPU chip lower nm and higher nm for other parts where downscaling it doesn't lead to any real performance benefits while saving them money. Well that and it means they can stitch chips together but yeah.

@dex6316 Год назад

This video mentioned that other components of a processor are also suffering from scaling issues. However, this is especially problematic for SRAM cells. SRAM not scaling means that to boost performance one must use more silicon. That’s very bad for the high performance microprocessor industry, which is the premise of this video. Other components not scaling well isn’t as impactful on the final designs because processors aren’t dependent on massive growth of these components; look at the cache growth to see why SRAM not scaling is really bad. Also logic cells don’t get denser by optimizing how they are packed together. The cells are reconstructed using different materials to hit desired performance targets at smaller sizes. Logic transistors are in fact getting smaller.

@kotekzot Год назад

If feature sizes remain almost the same, what is it about new processes that enables them to reduce wasted space to increase density?

@johndododoe1411 Год назад

@@dex6316 How do material changes allow smaller logic gates without allowing smaller SRAM cells?

@6SoulHunter9 Год назад

The information quality of this channel is astounding, I cannot believe it has only 3.4k subscribers. Also, presentation quality is also very good and it's improving :)

@marsovac Год назад

you would be astonished by how many US people will not watch these videos simply because of the accent. I've seen people that don't want to watch videos done by Aussie or British english creators because of the accent, and those are much closer to american english.

@6SoulHunter9 Год назад

@@marsovac I know. The accent was always right for me, but after watching some harsh criticism I have started to pay attention and I think that this channel is improving on that regard, the accent was thicker. And while I don't mind the accent, I know that there are some channels which sometimes I watch without being very interested, because the voice is smooth and mesmerizing. I am sure it would help this channel to take off. Me? I don't mind, my english accent isn't the best either.

@RM-el3gw Год назад

yes, it's crazy underrated. The youtuube algorithm is the one that sometimes fails to bring quality content like this to the front where it belongs.

@padnomnidprenon9672 Год назад

Loo yes. I just realized he have 4k subs. I thought it was 90k at least

@stevewiley3832 Год назад

For me it is the usage of sensationalist wording. He used the words "...approaching death", which implies that SRAM has a functionality problem even though the issue is a scaling problem.

@K11... Год назад

Your channel will grow through the roof soon. You have amazing content.

@theminer49erz Год назад

I know, it's great to see so many people interacting and "liking" so fast. The number has grown steadily for some time now, which is great! He deserves it for sure!

@HighYield Год назад

It also makes the whole video creation process a lot more fun if I know ppl are actually gonna watch it!

@nutzeeer Год назад

Just got a front page recommendation and i will sub

@nutzeeer Год назад

3841th sub :)

@Hunter_Bidens_Crackpipe_ Год назад

Nah

@JosephArata Год назад

Die stacking will get rid of this problem, they can use a larger process node with the SRAM, while the GPU/CPU cores are using the lowest node possible. They'll also likely start using HBM once they go full PC on a single chip design.

@pacifi5t Год назад

Thank you for breaking down this issue. I thought I knew a lot about hardware, but it seems I've only seen the top of this iceberg.

@dascandy Год назад

This finally explains why the CPU core is made on a smaller process than the memory chips, when it used to be that memory chips were the first to shrink (because of much simpler design).

@alwanexus Год назад

You may be thinking of DRAM, which requires different process features.

@JoeLion55 9 месяцев назад

DRAM has always been on an older process than logic, because 1) DRAM cost control is much more critical than logic and can’t afford to use bleeding edge fab processes, and 2) the DRAM array has features (like Wordlines and bitlines) that use entirely different fan processes and aren’t able to scale at the same rate as logic transistor processes. But, historically, SRAM was used as the test vehicle to test new processes, because SRAM uses (or can use) “normal” logic transistors.

@b130610 Год назад

AMD certainly seems to have an advantage in the chiplet space because of their past successes with zen, but I have to wonder how much longer that advantage will last. It would be pretty ironic if nvidia integrates chiplets into their cards before AMD can leverage that advantage for a clear win at the high end. It seems like they really had a golden opportunity with rdna3, but it obviously hasn't really worked out that well so far.

@ag687 Год назад

it's not a chiplet, but Nvidia is already leveraging entire datacenters of cards to work together as though its one supersized GPU. Which means they probably already have the tech they need need to do chiplets without too much of an issue.

@b130610 Год назад

@@ag687 afaik, the chiplet tech AMD is using is at least a couple orders of magnitude higher bandwidth than nvidias data center networking solutions (although, they are impressive in their own right). The chiplet interconnects are developed in coordination with tsmc though, so it's not inconceivable that Nvidia could use similar tech to AMD as long as they stay in good graces with TSMC.

@sudeshryan8707 Год назад

i think Amd has patented most practical aproaches to chiplet design already which will leave others very much little space for innovation. Intel's struggling for years with their tile design is showing its much harder for others to be competitive.

@b130610 Год назад

@@sudeshryan8707 I'm inclined to agree with you there, but I'm not ready to rule out something new built on TSMCs packaging technologies for high speed interconnects. Last year I thought no other chip design firms were even close to AMD on mass market chiplet designs, but then we saw the m1 ultra from apple with very impressive performance scaling over a whole new fabric. I wouldn't count Nvidia out, but I'm certainly not expert on the matter, just an armchair critic.

@aravindpallippara1577 Год назад

@@b130610while apple's m1 ultra is very impressive it has less bandwidth per silicon usage and the interposer itself is an extremely expensive tech compared to amd's infinity fabric based inter die communication Amd might go patent troll on other companies going forward, not a fan of that happening

@bananaboy482 Год назад

The amount of attention this video has is criminal. Best video I've watched all day! Entertaining, informative in an easy to understand way, and well made!

@tqrules01 Год назад

I don't think it will be an issue for AMD. They are using 3D caching. The 5800X3D is stil a beast. Oh nvm you already mentioned it. I think in the future they will be able to start stacking with a faster and faster interconnects i.e next gen infinity Fabric

@Yuriel1981 Год назад

Was going to say pretty much the same thing. 3D cache will increase the amount of SRAM that a chip will be able to hold. It doesn't fix the scaling problem. But it does solve some of the size issues which is why the AMD Chiplet tech is most likely the next step.

@kotekzot Год назад

Pretty sure Infinity Fabric is slower than the vias used in 3D V-cache.

@daxconnell7661 Год назад

even when early computers where developed some discovered you could double the amount of memory in a computer by stacking ram. 4464 RAM Chip commodore 64/Apple era

@spamcheck9431 Год назад

THIS right here. I think AMD and Nvidia are going to separate here in terms of utility. Nvidia is gonna have to focus on cuda cores, while AMD focuses on parallel processing. The only thing that might save Intel is if they somehow went along with apple’s chip methodology, where they target specific use cases, such has a portion of the CPU hard wired for specific tasks instead of relying on transistor gates.

@kotekzot Год назад

@@spamcheck9431 would you explain what hardware features Apple integrates that Intel doesn't? AFAIK Intel and AMD include a lot of extra instruction sets and some accelerators (e.g. for encryption).

@kiri101 Год назад

I already knew about the topic but this was such a well organised video it was still worth watching. Your pacing, delivery of speech and the information density in the video are very well balanced. Thank you.

@omegaprime223 Год назад

My only thought is: "Oh no, application developers will have to learn how to optimize again... the horror." Companies have been offloading optimization work because technology could just brute-force things for so long, now that we're starting to see limitations that might stick around for more than once chip generation corporations will have to optimize existing features if they want to cram even more features in.

@zthemythz Год назад

were probably just going to see stagnation

@towb0at Год назад

Super interesting topic. Seems like the one that comes up with the best successor to SRAM will take the cake, once chiplets scaling is fully utilized

@anepicotter4595 Год назад

Fortunately we can get a lot more SRAM with AMDs 3D cache method and it’ll definitely work well in chiplet designs even as the core chiplets continue to scale down.

@mnomadvfx Год назад

This has been known for a while and ARM have been looking to using some variant of MRAM to replace SRAM for the purpose of CPU caches. While this is difficult in a monolithic die it becomes easier with chiplet stacking as AMD have already demonstrated with X3D. Not only will MRAM offer non volatility/persistance for potentially higher power efficiency, but it will also offer dramatically superior area scaling to SRAM for larger caches.

@ytviewer267 Год назад

Apple already has a CPU using chiplet tech. The M1 Ultra CPU introduced back in March which stitches together two M1 Max chips into a single package. They aren't currently using it to split off SRAM, but the M1 Max is an extremely large die comparatively.

@HighYield Год назад

Thats true, but since its "just" two of the same M1 Max fused together, I am separating it from chiplet designs like AMD is using, with chiplets of different sizes.

@scaryhobbit211 Год назад

Eh... they'll find a way around the SRAM bottleneck, like they always do. There's the Chiplet designs like you mentioned, but I'm also interested to see what IBM's light-based CPU leads to.

@soylentgreenb Год назад

Single core scaling ended when dennard scaling died. Multicore scaling isn’t really working that well as real time consumer applications like games cannot take good advantage of it without increasing latency (hemce why 144 FPS today doesn’t feel better than 72 FPS in the 90’s; more pipelined engine). Moore’s law scaling is not holding up that well either; it is about cost per transistor, but wafer prices are almost competing with density scaling. Light is a piss poor medium for density of storage and density of logic. Light is very large. A blue photon is 350 nm big and when you approach that sort of scale you get weird effects like surface plasmon resonance and quantum tunneling. So you either incorporate the weirdness and do something with plasmons or you make a bus with micron sized wave guides; a lithography size that hasn’t been in vogue since the 1980’s

@amineabdz Год назад

@@soylentgreenb So the absolute best photonics can do is non ionizing radiation ? which is very near ultra violet range, either that or find some way to mitigate the material degradation from using some ionizing wavelength (which afaik is impossible, or else even nuclear shielding on Nuclear power plants would not be of a concern anymore)

@davidmckean955 Год назад

Considering we're quickly reaching the physical limits of what's possible for scaling all parts of the CPU, we have much bigger problems to worry about medium term.

@amentco8445 Год назад

@@soylentgreenb And what would be the big issue in utilizing UV for this?

@TheDoomerBlox Год назад

7:14 - Probably was worth noting that '6nm', in spite of being "adjacent" to '5nm' in its name, is actually a refined-refined version of the older n7 TSMC node seen on Zen2 chiplets.

@HighYield Год назад

You are correct, 6nm is based in 7nm just like 4nm is based on 5nm.

@sharktooh76 Год назад

nvidia 4000 series is made on 5nm not on 4. 4N is Nvidia customized node based on TSMC N5 5nm node. TSMC N4 is 4nm . 4N is *NOT* N4.

@davidgunther8428 Год назад

I think 2.5D chiplets will stay at the L3 cache level, not the L2 level. There's so much data transfer and the latency needs to be so low that L2 on a chiplet would need to be closer/ stacked to perform well.

@Eskoxo Год назад

I Think this could probably have many possible solutions how IBM Telum cpu handles different caches in cluster of cpus comes to mind or perhaps have different chip with slightly slower L4 cache etc

@MarianRambo1 Год назад

4:10 You forgot to mention about ryzen 5800X3D witch has 96 MB l3 cache.

@SpencerHHO Год назад

I thought scaling had pretty much died around the 28nm nodes. It seems AMD has already solved this issue with chiplets and 3DVcache all the L3 cache on RDNA3 variants released so far by amd have the cache and memory controllers(which also don't scale much anymore) on separate chiplets on a cheaper older node than the main compute die. we will see larger packages with AMD and costs will continue to rise but their chiplet designs gives then Aussie advantage and Intel is already trying to implement their own version. A lot of the tech AMD uses is co developed with TSMC and isn't that different from the tech apple is using with its M2 chips I suspect this will only accelerate the transition to multi die SOCs and 3D stacking. Cache is a lot less energy hungry than logic so it makes sense that this what's seeing 3d stacked silicon first.

@youcrew Год назад

I think this is why chiplette/tile designs are essential. We will start seeing SOC packaging get larger

@BruceCarbonLakeriver Год назад

It is a matter of time when the whole Van Neuman architecture is within a chiplette design. The motherboard just will hold RAM and peripherals connected to the SoC.

@electronash Год назад

This is weird. I just bought a Ryzen 9 5900X, to upgrade a 3200G in my second PC. When I was comparing it to chips like the 5800X3D, I noticed the different in L3 cache sizes, and wondered how much area cache must be taking up on the chip. I figured that a BIG part of the cost of the chip is the cache, since even 32MB will take up quite a large area of the silicon. I didn't realize there was a problem with SRAM cell size on the smaller nodes, though. Interesting vid. If only SRAM was somehow smaller and simpler to produce, we would likely never have needed to use DRAM at all. I've often wondered how fast a PC would be if it's main RAM could use SRAM instead of DRAM. (Modern DDR SDRAM is FAST, but the latency is still high compared to what I would think SRAM could do.)

@jazzochannel Год назад

5:40 "isn't there anything that can be done? great question, so glad you asked" smoothest transition of the year.

@Nahrix Год назад

Use SRAM as a physical buffer between cores, and build vertically. The relative physical size difference would mean there is a larger distance between each core, distancing the hottest parts, and allowing better thermals.

@zonemyparkour Год назад

When your channel becomes famous, I want to leave this here as proof I was here from the beginning. Great content. Loved your graphic explanations.

@joehorecny7835 Год назад

Amazing Content and Analysis! Hopefully they are working on the bandwidth of the chiplets, sounds like that might be the next bottleneck.

@BlenderRookie Год назад

Bigger dies are inevitable, along with wider memory busses. Transistors and d latches(or whatever they are called these days), can only get so small and transistors can only switch so fast. The eventual step is wider word processing and wider memory word accessing. But hey, I am old and when I was into the nitty gritty of this stuff, CPUs were running typical TTL voltages of about 5 volts. So yeah, I'm expired.

@johnsavard7583 Год назад

At about 5:55 in your video, you finally mentioned chiplet design - if you can't scale static RAM, just put it off the chip. Of course, that involves some additional delays, so you still need L1 cache on the die with the logic, but it helps a lot.

@xeridea Год назад

Yeah L1 and L2 probably still best on the same chip since latency is critical, but L3 is a great candidate.

@Tigerfox_ Год назад

I feel like we're back in days of Pentium II and III.

@BenjaminCronce Год назад

@@Tigerfox_ Except for many work loads, the P2/P3 with smaller on-chip L2 cache was faster than the larger off-chip cache. Celeron with 128KiB of on-chip L2 cache was faster than the 512KiB off-chip cache Pentium. In this case, I think the off-chip ran at half frequency. Much faster than DRAM, but a few factors lower bandwidth and higher latency than on-chip. Going off of memory from 2 decades ago. Take it with a grain of salt.

@Tigerfox_ Год назад

@@BenjaminCronce I know all that, but I don't understand what you're trying to say. Of course, for some workloads more cache is better than faster cacher, for some it's the other way around. I haven't seen an in dept analysis of what applications profit more from Raptor Lakes increased cache yet, but I know that for example only some games profit greatly from 5800X3D's 3D-cache, same as some games run faster on Broadwell i7-5775C wirth eDRAM L4-cache than on on i7-7700K. They'll have to find a compromise. AMD reduced the size of infinity cache slightly on RDNA3, but vastly increased it's speed.

@7rich79 Год назад

One of the typical advantages of process node shrinks that is advertised is increased performance, increased power efficiency, or a combination of both. Does this mean then that if you cannot continue to shrink the process, SRAM performance will be the bottleneck for newer architectures? What are the alternatives to SRAM?

@46three Год назад

Gamers Nexus has an interview with one of AMD's lead engineers, Sam Naffziger, who explains this exact issue as one of the key concerns that chiplet design (and 3d V-cache) aims to mitigate. Interesting chat for sure.

@46three Год назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-8XBFpjM6EIY.html

@TheEVEInspiration Год назад

I think some caches will become near obsolete to make room for the more essential caches. Think of the separate cache for code that is indirectly fed from a data-cache. By changing them to just storing pre-decoded meta-data (like instruction boundaries on x64, or other decoding hints) and fetching the actual code from the data-cache instead when needed. There are more such tradeoffs to make for sure, like cache-complexity versus cache size. If cache size is under pressure by this scaling development, expect more complex/smarter caching systems that until now did not make economic sense.

@stevetodd7383 Год назад

There’s a very good reason for split I and D caches - they allow simultaneous fetching of instructions and data. A pure Von-Neumann design (shared instruction and data memory) can only execute one instruction every other clock cycle (one instruction fetch followed by a data access relating to that instruction). Modern cores are all modified Harvard designs, that allow simultaneous fetching of instructions and data access via the two different caches. They are also quite small compared to later caches in the scheme, so unifying them will save little space. The better solution to the problem is 3D stacking and using simpler/cheaper process nodes to create cache layers. This actually gets the cache closer to the point of use while letting you increase sizes.

@TheEVEInspiration Год назад

@@stevetodd7383 I understand those points and I think it's an argument that has been loosing validity for some time now. Ever since the introduction of the level-0 uOp cache, the effect of large level-1 instruction caches has been going down. And those level-0 caches are getting bigger every generation! There is a saving to be had there for sure. By making them smaller, but smarter. For example by increasing set associativity or as I suggested by storing only meta-info/tagging relevant cache-lines in L2 as being used for code. As both L1 caches are fed from L2, there already is concurrent fetch capability at that level. I1 Cache is virtually all about lowering latency for non-decoded instructions! A smaller cache that speeds up the decoding would give the same benefit as todays caches. Putting some of the cache area towards a bigger uOp cache will see more benefit I think (at least that is the trend right now). As for Die stacking, that is all about level 2, not level 1 caches. This also speaks in favor of the idea of a smaller L1 instruction cache as the code will be in that extra large L2 anyway. Level 1 instruction is simply between a rock and a hard place (the much faster already decoded uOp cache and the much larger and extendable L2 I+D cache). And there is another trend looming, sharing massive level 2 caches between cores! That can be a huge transistor count saving architecture feature.

@stevetodd7383 Год назад

@@TheEVEInspiration a cache only accesses the next higher level in the case of a miss. At this point there is typically a burst of activity while a cache line is written or read. Because of this I and D caches don’t typically access the L2 concurrently. Each level of cache has a progressively higher miss cost, and then adding multi-port access adds more. The I and D caches are deliberately small and fast. L2 is larger and slower, L3 larger and slower again. The job of the I cache is to keep the instruction decode pipe fed as much as possible. That pipe results in L0 uOps, but there’s a higher penalty if L0 misses and you have to go all the way to L2. The job of the D cache is to keep the data needs of the uOps fed as much as possible while avoiding the need to go to L2 again. There’s a reason that we don’t just have a single layer of cache. Big and complicated caches are slow. Cache models are a trade-off between the need to maximise hits and the time to return cached data. Oh, and to add to that, L0 cache is in the form of VLIW instructions that are far from compact. You’ll not get efficient use of space if you try for a large boost to the L0 to make up for no I cache.

@horusfalcon Год назад

An interesting presentation! I wondered when something like this would happen. Now, whoever develops a more scaleable SRAM will wind up being the performance leader unless other techniques prove much more cost-effective.

@Kevin-jb2pv Год назад

Unless we have some sort of new paradigm shift in computer hardware, these limitations are why I think we're probably going to head into an era of off-loading CPU functions to dedicated co-processors. We already did it with GPU's, and bitmining did prove that certain functions are better handled by dedicated hardware and can be done cost-effectively. Plus, NVidia has been selling dedicated, specialized GPU hardware for AI for years, now. I think we're going to start seeing more processing handled by specialized units as demands grow. Exactly which functions? I can't say. For gaming, physics is the first thing that comes to mind, but PhysX was already a thing that failed and then got absorbed back into GPU hardware. Perhaps we'll see a return of discrete physics units? We also have dedicated AI chips out there, and I believe one of the things they get used extensively for is in processing image data in some phones. It was heavily marketed a few years ago by several major players, but I don't know if that's a thing that's still being done on current-gen phones. Point is, manufacturers have already done it and are at least trying to find other applications to offload to dedicated silicon. So far, the physical limits of semiconductors have not, yet, hit that brick wall that we've been getting warned about for years. It's slowed, but so far manufacturers have been able to use other tricks to get generational improvements in computing power, so the wider industry and enthusiast community hasn't had to feel the pain quite yet. Who knows, maybe manufacturers will be able to keep squeezing more cycles out of what we have right now for many years just because they will actually have to start doing real work on architecture re-working now that they can't just fall back on shrinking their transistors (and this, for the most part, is what we have been seeing, it's just a matter of how long they can keep doing that). But I think that when backs are really pushed up against the wall, we'll start seeing more radical solutions start being brought to market. I think that the fact that Moore's law is just about done with will likely mean that we're about to see a _boom_ in innovative and creative new solutions because the "safe" path is no longer a viable one and corporate leadership will start being forced to try new things to stay competitive.

@samghost13 Год назад

There was a Big Light switching ON in my Head. Thank you very much Sir!

@RM-el3gw Год назад

very informative as always. I believe theres multiple physics aspects of semiconductor tech that are being pushed to their limits rn. cheers

@jtjames79 Год назад

Good. Necessity is the mother of invention. It's actually a problem that substrates only change when you absolutely have to.

@Raven-lg7td Год назад

omg i never heard about this before and I subbed to MLID, AdoredTV, Coreteks....you're a real hidden gem plz keep up! this is so interesting

@zxuiji Год назад

Well the can potentially create ERAM, using electrowetting and light rays it is possible to create a fast RW byte with minimal power usage, using just the position of light caught one can determine 0 or 1, could also try storing an entire unsigned integer/float with the strength of light caught

@abdulhkeem.alhadhrami Год назад

I thought by 2015 we would already have high capacity near 500MB and 1GB by 2020 and sure enough here we are barely close only to 100MB on AMD and intel under 50MB consumer chips not talking server or thread ripper chips.

@NoneofyourBusiness-ii1ps Год назад

well, there is also a physical limit of how dense you can store information, which happens to be the number of bits by counting the number of Planck squares on the surface of a black hole. Basically if you pack too much information into a given space it will collapse into a blackhole, literally...

@kotekzot Год назад

I wonder if Zen 5 is going to have any L2/3 cache on the die or are they going to stack it all on top of the die.

@nezbrun872 Год назад

Good video, but I would have liked to have the "physics problem" explained, and why it specifically affects SRAM cache. I can understand the analog limitation, as "lumped" parts like resistors, capacitors and inductors need chip area, but why is SRAM cache special? It's digital logic, just like the CPU? An SRAM cache bit is typically six or eight transistors: what is the "physics problem" why can this not scale?

@HighYield Год назад

If you are looking for more in-depth infos, heres an article from 2015 that clearly talks about the problems with SRAM scaling in detail: semiengineering.com/moore-memory-problems/

@Razor2048 Год назад

What are your thoughts on CPU makers moving to add HBM to the CPUs, where it effectively becomes a massive level 4 type cache?

@tyaty Год назад

Intel is already planning launch them in the near future . (Xeon Max)

@Themisterdee Год назад

Very interesting.. thank you. Dumb thought I know but .. Wont that mean that rectangular chips are soon to be obsolete? for there must be a finite limit to SRam gates/ wires per nm along an edge. As in if the shrunken dies get smaller it would mean more ports per nm of for example logic cell density thus more 'wires ' to the Sram edges Im assuming you would quicly run out of room .

@tomtomkowski7653 Год назад

Let's wait and see how well this 1nm non-silicon process TSMC and MIT are working on will perform. And yes, chiplets is the way to go and the question is how well different companies will develop this idea with their different approaches.

@runeoveras3966 Год назад

Great video! Thank you. Hope you enjoy the holidays.

@chibby0ne 4 месяца назад

This answers why chiplet design is becoming so popular lately. Thanks a lot for the well conveyed and duly researched video.

@jabezhane Год назад

I remember back in the mid 90's "the issue with going lower than XXnm" and then in the early 2000's the near impossible task of going past XXnm"...and so on. We keep going somehow.

@xeschire706 3 месяца назад

Easy answer to the sram cache debacle, switch to 1T-sram, which is a hybrid SRAM/DRAM embedded memory, & is also an embedded version of psram or psdram(pseudo static ram, or pseudo static dynamic ram) & is the embodiment of the best of both worlds put into practice, as it has performance comparable to sram, while offering memory densities comparable to edram, & since the transistors are more densely packed in the same area 6-fold, then this type of memory, if used as a cache, would offer more cache memory in a smaller area, while also having lower latency than a standard SRAM cache. 1T-SRAM based caches are the way to go onwards, especially when combined with vertical stacking & MCM.

@ChiquitaSpeaks Год назад

I’d like to know if there’s a difference in the implications of the importance of cache/SRAM in an SOC but I guess Apple’s decision making offers some insight in on that somewhat?

@NootNoot. Год назад

As for chiplets and specifically future RDNA designs I wonder if moving from a N6 to N5/N3E MCD would even be worth it? And although it seems TSMC has hit a dead end with SRAM scaling, I wonder how well other foundries are doing. Like for example as you say, Intel is using some TSMC manufacturing for Meteor lake, and I wonder if Intel has a more efficient SRAM scaling. This also calls for Nvidias Blackwell. They've benefited a lot from Samsung's 8nm node to a custom TSMC N4 node. While I don't doubt Nvidia to take the performance crown again, I feel like 4000 series has benefited a lot from the silicon. Will they also have a desegregated design as well or will they pull some blackmagic with further increased power draw? Btw I think the thumbnail is great lol

@dra6o0n Год назад

Nvidia hasn't got much CPU experience to do proper chiplet designs like AMD or Intel does, and Apple just brute force it's engineering with lots and lots of money in R&D to poach talents for that. Otherwise Nvidia would have pushed for chiplets sooner instead of showing a proof of concept one time and then forget about it later.

@635574 Год назад

Maybe even more impressive are neuromorpic chips where the compute and the memory are in the same place on the chip, and they are processing asynchronously.

@tjtjmich16p Год назад

Dude your channel will explode with subscribers and viewers it's already happening now many recommendations from your channel is what RU-vid's algorithm is showing me and many more tech nerds out there so expect huge growth and you will reach 100 thousand subs before you know it, And awesome content by the way, Really well edited and well thought out videos, And I really like your accent it makes you sound like a tech company owner.

@HighYield Год назад

It’s a bit overwhelming right now to be honest, but I’ll manage. Thanks for the kind words!

@HazzyDevil Год назад

Love the way you present these videos, about time I subscribe :)

@rayraycthree5784 Год назад

Why can't the same transistors used in the ALUs, LUTs and controller be used to build memory flip flop cache?

@okman9684 Год назад

While processors are the building for employees, RAM are the parking lot of CPU and giving crazy numbers for process nodes is nothing if you are using more cache for it in RAM

@kenohara4574 Год назад

This channel has 5.16k subscribers in Dec 27 , iam writting this cuz so this will be the proof that how good and informative this channel is and how fast it will grow , this channel will hit 1 million in a year mark my words :)

@deusexaethera Год назад

SRAM not being able to scale-down anymore doesn't mean it's dead, it means it's fully optimized. Those are almost, but not quite, exact opposites.

@paulsim7589 Год назад

I knew this from other hardware videos. But i watched this anyway as its quite relaxing and easy to kisten to. Your format for explanation is very good. Thank you.

@SupraSav Год назад

Solid video. Hope your channel blows up brotha

@ralfbaechle Год назад

Another strategy to solve issues w/ limited cache performance is to add another cache level. I've worked on a product which had a rather large L4 cache implemented with DRAMs. It did work well enough to avoid the immeiate issue which was to avoid respinning some custom chips which would have been crazy expensive. On the bigger picture L4 caches are rather rare. Another limitation of cache implementation is the size. High end microprocessors of the 1980s had 64kB I-cache and 64kB D-cache. With only small deviations these sizes have been almost a natural constant of microprocessor designs. Basically 64kB per cache turned out to be the sweet spot. Smaller may make sense for cost reasons. But larger primary caches get physically larger, thus move further away from the CPU resulting in slower access which rarely is a good design choice. The result is the epidemic of multilevel caches. L2 caches popped up in the early 90s, L3 not much later. I think the Alpha 21164 EV5 release in 1995 was the first microprocessor with an L3 cache.

@frankklemm1471 Год назад

L1- and L2-Cache on die, L3-Cache-Tags on-die, L3-Cache-Data off-die on-package. A DRAM-based, OS-controlled L4-Cache with several thousands of banks.

@anomalousresult Год назад

SRAM size scaling slow down has been been observed since FinFET was adopted.

@HighYield Год назад

You think it might change with GAA?

@cyber_robot889 Год назад

Wow, I'm in PC hardware like almost from 2003 year, and never ever heard about SRAM. thank you for new a and interesting information. Like a real reveal under my nose, lmao

@agw5425 Год назад

Just like in ssd/nvme storage the solution is to go vertical. A 3D sram cash on a separate die can be 10X taller by just adapting how the compute die is mounted. Even a 100X thickness only builds 0,5 mm in height, easy to adjust with a tailored heat spreader. Just imagine how much sram cash you could fit in 1 cm3 when each layer is only 4 nm thick/wide, a custom design heat remover would make even that possible, if the demand for that much sram cash will one day appear.

@TopHatProductions115 Год назад

SRAM cache chiplet? Not sure how to address this issue... EDIT: Lucky guess made while watching the first half of the video!

@HighYield Год назад

Don’t sell yourself short! Your guess was smart, not lucky.

@JorenVaes Месяц назад

I was told one of the reason that sram doesn't scale well anymore is because s-ram used to actually use much higher density rules than regular logic. SRAM usually has separate DRC rules in a process, where they made use of all the fancy regular pattern techniques and so on to get that higher density. This made the density they could get in SRAM much higher than in logic. When you scaled the process, you mostly scaled the minimum sizes of general structures, and so by using the same 'hacks' you used in the last gen, you could scale the SRAM too. But now the logic itself is also starting to use these 'physics hacks' to scale, with things like fixed metal line patterns and then having to use cut masks to do your low-metal wireing to get higher density in newer nodes. So in a sense, it is not that sram is no longer scaling - it is the fact that to get logic to scale, they are using the same tricks cache used for a long time already. As a result, the 'tools' used to make high density SRAM is no longer scaling, as the fundamentals are not changing.

@rocktech7144 Год назад

The dependability of sram is paramount to the dependable operation of the cpu. The physical limitation of static ram technology is hard to surpass with current materials. The application of 3d stacked ram will be the only way to temporarily circumvent these limits and still get the required throughput of ram necessary for the new cpu. The next step will require a quantum leap in materials technology. What will it be?

@Vermiliontea Год назад

Well, it's all true of course. But I dunno if focusing so much on the slowing down of SRAM scaling amounts to a complete reason. The thing is that scaling is slowing down overall, due to physics, and those parts which were aggressively scaled down from the start are going to slow down first. Chiplets solve the problem of keeping yield up and costs down in chip production, as total silicon area increases. Otoh they add complexity, steps and distances in manufacturing. But obviously the costs and problems of that are much more reasonable and in control.

@Clark-Mills Год назад

Chiplets: Better yield as well. Trying to make new silicon that size in one piece is impractical. Being able to cherry pick the known good bits and gluing it all together is a nice workaround.

@DivusMagus Год назад

This could mean AMD will have a big advantage as they have already done a lot of the research on chiplet designing and manufacturing. So they are already ahead of the curve. but with both intel's and Nvidia's insanely deep pockets they can just throw a ton of money at the problem and get it done quickly.

@necromax13 Год назад

Amd's chiplet design is codeveloped with TSMC, so anyone that has their silicon produced by tsmc will directly and indirectly benefit...

@abelgerli Год назад

Actually I am not surprised because when you do the calculation of the structural size of the manufacturing notes with the van see waals silicon atom radius of .22 Nm you get only about 14 atoms. And the structures get so small you will get a manufacturing problem as well as problems with quantum mechanics when you get to sizes below 5 atoms. This is a guess but chemical bonds work with quantum mechanic properties. And getting that small gets you in this ballpark size wise.

@GreggRoberts Год назад

My old Packard Bell used sram (simm ram). I remember it because the cyrix cpu required it to be installed in pairs like rambus did years later.

@griffmason8591 Год назад

If the chiplets get smaller and you use more sram then the dye is scaling. 3d or vertical build processes still allow scaling. Lets say the dye is 2 chplelet and 60 sram modules, if you stack like a skyscraper 10 chiplets high,then it it still scaling. Adding multiple chips like servers is still scaling. If you cannot go smaller then you go more of. That is scaling, not just the smallest process at the current time.

@rightwingsafetysquad9872 Год назад

As AMD has been unable or unwilling to give themselves a market price/performance advantage over NVIDA with RDNA3 and having lost their CPU advantage to Intel Rocket Lake, the future once again looks bleak for them once Intel and NVIDIA figure out chiplets.

@growthmonger4341 Год назад

Great information and no BS, will definitely drop by again.

@intetx Год назад

3D stacking might never use another node. The problem is two different nodes bend differently, which I could imagine could cause issues with bonding them directly.

@rahcxyoutube Год назад

I absolutely love your videos, keep it up!

@mjdevlog Год назад

Great video! I really appreciated the thorough analysis of the potential problems with next-gen CPUs and GPUs. It's important to consider these issues and have a critical eye towards new technology. Keep up the excellent work!

@stefanbanev Год назад

CPURAM latency and throughput these days are remarkably low/high, 5%, 10% or even 20% relative improvements still possible but it is asymptotically getting very close to the saturations... the next improvements are a massive multi cores frontiers and the next - a quantum computing to the gadgets...

@freuk_ Год назад

Very interesting video. I don't thinks this situation is a problem, mostly software and hardware technologies have taken a direction in modularization to make ever system. It may get complex but I think that's even a good new.

@shyamdevadas6099 Год назад

Very fascinating video. Well done!

@pirojfmifhghek566 Год назад

I'm actually looking forward to the day when we finally reach the limits of what a manufacturing node can accomplish in terms of nanometer node size. At that point it the costs and methods for making the most cutting-edge chips will really start to proliferate. The only thing that can improve a chip from that point forward will be the architecture of the silicon itself, which is where we sorely need the most improvements. It'll also be a good time for Windows and chip designers to come together and finally pare down the old x86 operating system to a _standardized_ reduced instruction set format. We are also going to need more purpose-built components in our computers soon, to give our computers more integrated utility rather than speed. I'm most interested in what companies like Mythic have been doing with analog chips, because they've managed to use older process nodes to create insanely efficient chips that do very complex tasks in AI computing. We've been leaning too heavily on CPUs and GPUs to do these compute tasks and a lot of that work could be offloaded to newer, purpose-built components.

@ilyarepin7750 Год назад

or they could stop wasting money on diminishing returns from miniaturizing silicon when its already close to its theoretical limits, and instead just work on commercializing a new approach to computing like photonic chips or carbon based chips.

@pirojfmifhghek566 Год назад

@@ilyarepin7750 This would be a welcome change. I dunno how far along in research we are with those technologies though. It may be that we hit the limits of silicon miniaturization long before carbon or photonic chips make their way into consumer devices. I just hope that hitting the manufacturing wall means cheaper silicon for a while.

@zyxyuv1650 Год назад

One thing I don't understand is... each chip on my DDR5 stick stores 1 or 2 gigabytes, and my microSD card stores 1 terabyte... but somehow 64 MB of L3 takes up so much space on a CPU/GPU die? Is the die area used by 64 MB of SRAM really 16 to 32 times smaller in m^2 than the die area used by the 1 or 2 GB modules on a DDR5 stick? Or does SRAM somehow use 10 times more die area than the equivalent DRAM? Or do SRAM and DRAM use the same area but the CPUs/GPUs just have so little space available that even allocating the tiniest area for 64 MB is a huge deal? If SRAM doesn't use more physical space than DRAM, then why can't we have a chiplet with 1 gigabyte of L3? Does SRAM cost 10X more to manufacture than DRAM?

@proxis9980 Год назад

the data doesnt take more space the logic to acces each and every one of it takes alot of space....think of it as in the stick you have 1 guy patroling a full train of people and you tell him pls pull out the guy wiht the red shirt while in the cpu you have a guy standing next to each seat and aks pls pull out the guy with the red shirt :D you ahve to employ a shitton of guys for that one train AND you drasticly reduce the amount of seats that train could reasonable have if you alway have to make room for the guys standing next to eachseat..... also in practice in your stick the procedure would be that each customer in a traincard (rowacess) would have to leave the train and go throu a terminal so that fuking slow by comparison ....

@zyxyuv1650 Год назад

@@proxis9980 It sounds like you're saying that unlike DRAM's tiny bus width, SRAM's "bus" is so wide that most of the die area is taken up by logic allowing 1:1 access to any SRAM bit at low latency, and not taken up by the actual storage of that bit. I wonder what the multiplier of die area required for low latency SRAM is? I.e. does it take 10x more space in total to store the same data while allowing low latency access to any SRAM bit?

@sudeshryan8707 Год назад

1bit sram cell is consisted of 6 transistors while 1bit dram cell is 1 transistor+1 capacitor only. Also sram caches are 100s of times faster than dram. modern Dram chips are made using 20-30nm process which cant be properly scaled well to smaller process nodes. That capacitor of a dram cell is not that scalable.

@Ryan.Lohman Год назад

I remember when CPU's had a 256-512KB L2 Cache limit back in the late 90's early 2000's

@SB-pf5rc Год назад

as someone who follows computer channels and bike channels, the thumnail for this video was very alarming. SRAM is like the biggest brand in the mountain bike space.

@HighYield Год назад

Oh, didn’t know that. Hope there are no “clickbait” views who are mad when I don’t talk about bikes 😬

@SB-pf5rc Год назад

@@HighYield no problem friend! i thought it was funny once i realized. 'sram' is a weird combination of letters, what are the odds? discovered your channel recently and love what you're doing. thank you.

@tsclly2377 Год назад

Heat with chiplets will require lower speeds, not such a problem with L3 and predictive loading, but critical with internal stacks and registers and L2.. in that order..

@Enkaptaton 7 месяцев назад

So we should be satisfied with what we already have? Do programmers finally have to make their code efficient?

@nagi603 Год назад

Wonder how difficult it would be to have the 3D V-cache but with different node sources.

@HighYield Год назад

Actually, thats already being worked on by TSMC and AMD. I'm not sure if the 3D V-Cache on the new Zen 4 X3D CPUs is also produced in 5nm, it could be a 6/7nm node. I'm trying to figure that our right now.

@Vatharian Год назад

Modern CPUs have more L2 cache than my first PC's hard drive capacity. That's insane for me. Chipsets are only a temporary workaround. Only way out I can see is stacking SRAM under the compute. 50-layer ICs and heat management headaches: go!