@@longrove5710 I wonder if AMD officially stated now that they are going to do so, as in "we're releasing it next quarter" coming from Lisa Su, what would be the reaction from Intel and nVidia? Maybe the former would turn to their Arc departement or rather the Ponte Vecchio team with Xe-HPC for tiled integration into the CPU design, an in-house updated i7-8809G descendant, and the latter would release to the general, as in average Joe, public something like a supped-up child of Jetson industrial "board" and Grace-Hopper HPC module...
Instinct is GCN evolved minus all the graphics bits... so no way it's going to be any use for basic computing (and also, CDNA evolved from the idea that graphics don't fill up the pipeline)
@@marcin_karwinski They already are but Intel's graphics ability is rather stretched thin (Meteor lake will run proper Arc Alchemist) by their graphics ambitions which aren't quite turning a dime
6:10 Another reason for the -S SKU is for some customers like my last company. Because of processor data leakage vulns, we refused to do sensitive computing (and 60% of what we did was sensitive) in the cloud on shared hardware with SMT enabled. That meant we ran on expensive dedicated hardware instances until the CSP gave us non-SMT instance types.
1. Why did you remove your Cerebras+G42 video? Did you accidentally say something superfluous? 2. Please ask Cerebras about the form of their chips. Have they investigated the possibility of making them full-wafer (circular) rather than squared? Unusual forms do not make sense for many chips because of cutting difficulties, but in the case of full-wafer chips, the form factor does not seem to be a limiting factor.
Then did amd back then did low clock speed to pump in more cores as strats,or actually limited on the performance just due to their architecture?Very curious about that
AMD is clearly going BIG with HPCs with MI300 variants by the end of this year. Its quite obvious a healthy competition is much needed with Nvidia's as well as Intel's enterprise accelerators. I would give my "noob" amateur prediction that we will see chips within 10 years with enough cache to replace operating memory starting with enterprise and followed by consumer market. A race to high speed and high capacity cache-like memory is equally important as chip architecture improvements. Especially after it was confirmed it isn't possible to further shrink current design of the cache memory and thus it makes sense to stack it vertically like AMD's V-Cache. Seeing future chips running with memory which has no or very tiny bottlenecks is wet dream of every enterprise and consumer computer designer. Seeing chips with more than enough cache with or without combination of the HBM memory is in my opinion a right way to go to completely utilize processor capabilities and performance unless new disruptive technology like memristors is broadly adopted. Will see what might come first, but memory improvements r very much needed as well.
You know, when they teased Genoa X i said something along the lines of 'Remember the Xeon X5698, that -HPC- HFT focused Westemere EP that had 4 of the 6 cores disable, so those 2 cores could have all that L3 cache and run at 4.4Ghz instead of 3.6Ghz, i would love to see a spiritual successor to that in Genoa X' And well, thank you AMD, i know that not that many customers would need that cache, but i sure love that you did it anyway
With the next shrink and Zen5c we may be back to 12 CPU Chiplets ( with 16 Cores ) = 192 Cores /384 Threads , maybe even with 2 or 3 Layer of V-Cache on Top , if the Latency Penality is dwarfed by Performance Gains in some Apps . Thats the beauty of AMDs modular System , AMD can react very quickly to Customer needs . AI = some Accelerators will be the next which AMD will integrate , AMD currently is behind in this regard vs Intel but i guess we will see it with Zen 5
Rocm totally sucks right now, went and bought nvidia gpu for that reason, communication with customers also lacking, I really hope they put in the required investment into the software side of things for AI because great hardware goes to waste without software support
Will they be bringing the shrunken Xilinx Media engine and Ai to consumer Gpus in next gen? It's very much needed to get AMD to compete equivalently in the Consumer gpu space with Nvidia
Now if only they released Bergamo-based dual chiplet Ryzen processor with 32 highly efficient cores and 64 threads in a desktop platform... and a mixed processor with 1 chiplet Genoa-X-based, so *X3D and 1 with efficient multitude of cores... a 24 mixed cores all supporting HT and with exact same instructions support, 8 favoring games and memory intensive loads, and 16 for more efficient multicore workstationy loads... Since both of these chiplets are more power efficiency oriented than the regular Ryzen high clock ones, maybe the power draw would be the same as the X3D current SKUs... and the higher clocked RAM in the desktop system would help to somewhat negate or alleviate a bit the Bergamo-based chiplet's L3-deficit.
AMD has given themselves a level of flexibility for design never seen before... At this point they can actually use 4C/5C for desktop laptop machines, a mix of 4/4C or 5/5C... I think they left out a lot from the Instinct presentation to keep Nvidia in the dark... The biggest change needed is a mixed precision engine where they can do 4/8bit INT for inference... But It seems realistic that they will look to Xilinx for inference and push Instinct for training and AI... Having AVX512 is a huge boost for EPYC 300A I think will be great for local execution so rather than loading instructions into the Root CPU, the APU can load the instructions locally for an even greater speedup... They will need the software to be aware and applicable drivers but it will speedup execution by 2X at least...
For heavy HPC workloads (stochastic dynamic programming for hydropower plants) that are floating point unit limited, the hyperthread cores are more than useless, they slow down the computation - so we typically turn them off in cloud environment that give us access to such machine configurations. So it will be welcome to have machines with no HT!
You don't need to turn off SMT (HT is Intel), you can use cgroups to limit which CPUs your workload runs on. This also applies to vCPUs (you can control through cgroups which CPUs the guest can run on). So really it's a trade off between a ~9% reduction in CPU cost (which is only part of the cost of a blade) and the flexibility to either supply or not supply SMT to your guests.
Multiple threads per core (hyper-threading) also is a potential vector for side channel attacks. Remember how Intel recommend disabling HTT during the Spectre/Meltdown fiasco? Having HTT disabled already on the hardware is a definite security benefit for a number of applications to be run on dedicated cloud hardware. And these folks care _much_ more about security than CPT.
AMD designed for SMT which meant threads were tagged so that every thread had its own execution space and couldn't get access to data from any other thread real or virtual...
@@ChristianHowellwhile they dodged the OG Spectre/Meltdown, there have been many speculative execution vulnerabilities found on AMD CPUs. it doesn't inspire confidence even if SMT is (or rather currently appears to be) technically safe.
imagine 4x4 zen 4c with a 1GB "adamante" / "Telum" like VL2$ layer per cpu, and each core 1mb L1$, in a 50mm2 silicon package I really want 32 cores + 64 RDNA3 cu in a single "APU package" probably 120-170w, and 16c + 40cu in a 15-28w package, but most importantly a 6/12 core package + 24cu in 7-12w or something for laptops (these would be defective cores, low frequency)
Saying Intel's HBM parts are exclusively for HPC and not for "technical computing" so you shouldn't compare HBM to X3D is certainly very clever from AMD's marketing. Kinda horseshit, but clever.
L3 cache vs hbm would mean even niche wins for amd vs more utility wins fo intel It's question of size vs speed, l3 has smallest size but fastest speed, hbm is next and after than ram, so it's just pick whichever one for your workload deal here
50 nm dies? Rock on ! Where will we go smaller then 1 nm? It is getting hot and stuffy in here. Can't we press an AI Easy Button. I need to take a nap.
For ROCm best success it could help if they embraced the smaller scale creative market with their GPU's, CUDA is the main reason I dont buy AMD GPU, since I need it for 3d work & other creative apps, also there's loads of local Ai stuff NVIDIA gpu are good at. Its a difficult choice I would imagine, but that extra energy & inflows would assure ROCm gets regular attention & broader adoption.
yeah i've decided not to go AMD on my next upgrade for the same reason. it's taken them too long to do anything to catch up. I still think that for games they have the edge when it comes to cost effectiveness, but at this point i'm very unwillingly going to pay the nvidia tax to get cuda on my desktop so i don't have to keep using my much slower laptop for that.
They've been starting to (there is official support for the RDNA workstation GPUs and latest gen consumer ones, alongside unofficial support for the last couple generations of consumer ones), but progress has been painfully slow... I wonder if this is one of those areas where AMD needs to be willing to bite the bullet and invest a lot of $$ into their software/library dev/packaging side in order to gain a much larger reward/not lose out on future market share. Presently that doesn't appear to be the case, but I'd be real happy to be proven wrong.
AMD's current problem is that they have 2 different architectures: CDNA for massive compute and RDNA for graphics. Improvements for rock stable and highly optimized ROCm on CDNA won't do much for home and small scale compute on RDNA.
At the driver and lowest level software side maybe, but the gap being discussed here is mostly at higher levels (e.g. think PyTorch and its direct dependencies). In that context differences between programming against CDNA and RDNA are not as significant, certainly not much more so than working with consumer vs data center cards on the Nvidia side. That's why AMD is able to maintain ROCm support on RDNA for a lot of the AI and HPC stack with a relatively tiny software team.
Will you cover Intel's recent announcements regarding APX and AVX10, too? It would also be great to know what AMD thinks of these extensions and if it will take half a decade for them to support these or if adoption will be faster than with AVX-512.
The good thing about AVX10.1 is that it is feature-equivalent to Sapphire Rapids’ implementation of AVX-512. So a piece of software that targets AVX10.1/512 or AVX10.1/256 can also check the relevant AVX-512 feature bits and can use that codepath if either one is supported.
@@Anton1699 AVX10.2 is the more interesting of the two, IMHO, as it brings advanced vector capabilities across the whole range of products, not just server SKUs. It also probably means no proper AVX-512 support with desktop products for quite some time still.
@@seylaw My point was that if you write a piece of software that targets AVX10.1/256 now, and you don't use any of the instructions not supported by Zen 4, then your code will run on AMD CPUs from Zen 4 onwards, Intel Server CPUs and Intel Client CPUs once they introduce AVX10 support.
@@Anton1699 Sorry, I was talking in more general terms about both ISAs from a user's perspective. And while I get the value you describe for developers that want to run forward-looking code now, I cannot see this being the optimal way, as a 512-bit vector length was a major feature before that just got downgraded (and integration of that feature in E-cores would have checked that box). So Zen 4 users won't get served optimally with that approach everywhere. And as I understand it (I am not an engineer) AVX10.1 and 10.2 are still not as flexible as ARM's SVE2 which means that code would not automagically make use of 512-bit vector lenghts when found on the P-cores, or have I missed something?
@@seylaw You've got that right. A developer has to check the AVX10 CPUID leaf to determine whether they can execute 512-bit wide AVX10 instructions. So you do need two seperate code paths if you wish to target both AVX10/256 & AVX10/512. I personally think the 512-bit vectors were the least exciting feature of AVX-512 and I honestly don't understand why Intel chose to make that the eponymous feature. AVX-512 introduced so many basic instructions that were missing from SSE/AVX for no good reason. One of the more noteworthy applications where AVX-512 provides large speedups over AVX2 is the PS3 emulator RPCS3 and it lets you choose between 256-bit wide AVX-512 and "full-fat" AVX-512. I just hope that Intel and AMD manage to introduce AVX10 quickly and I really hope they do not produce SKUs that do not support it. Intel made SSE4.2-only Celerons and Pentium CPUs until very recently.
Of course you're going to pronounce non-English words wrong, you're British. If I ever hear a Brit pronounce a whole sentence of French, Spanish or Italian correctly, I just may die of shock.
I'm curious about the MI300A and specifically why AMD landed on 24 CPU cores being the correct amount? Why not use Zen 4c chiplets to go to 48? This isn't a criticism, I just wanted to understand what objectives were driving this specific design. On a separate note, Hi Ian, would you be up for doing an explainer, or even deep dive, on Intels recent AVX10 & APX announcements? I'm curious about your thoughts on their long term benefits, or if these changes will only really benefit compilers.
I plan to build a workstation around October this year. And I feel very unsure about any decisions I made so far. I got an A750 from an Intel Giveaway, but that's currently very useless for model inference. (it's great for doing stable Diffusion) but language model inference is really broken. Intels own numbers have their Gaudi2 being 300% ahead of the GPU Max, and I can't buy either. It seems like getting Nvidia L40 or RTX 6000 Ada is the simple solution because software will just work. But I haven't looked at AMD. Or just CPU. Intel has done a lot of marketing for SPR inference, but that technology isn't available on their client CPUs which I want for gaming. Also the ROCm PyTorch support is Linux only. It's native, but Linux Only. Making this not very useful for end user inference. Intel said that IPEX Windows native is happing near the "end of the year". But they were able to give me any timeline directly. Intel is publishing more and more stories about selling tiny GPU clusters with PVC. But I feel like they are missing GPU model inference for end users on Windows.
linux is now preaty user frendly if you don't want to use advenced features. a simple dual boot if you don't want to deal with the wine stuff with a shared drive for everything that you need on both OS.
@@tominmoreau8546 that seems to be a likely reality as a developer. And WSL still requires Dual boot due to anti cheats not allowing Hyper-V. But this will not help with enduser experience. Right now the only viable experience for end users is to own an Nvidia GPU as CUDA will run by default and run without installing any toolkit, setting up wsl2 or getting docker (maybe openVINO runtime). It seems like Intel is focussing on CPU and perhaps even VPU for client inference. No confidence in GPU inference, means their iGPU still struggles.
If you're looking to do any serious ML, especially if you're looking at data center GPUs, I'd *highly* recommend running it in on a dedicated (headless) Linux box. You should run the numbers though, even for inference, once you account for power costs and if you depreciate your hardware at 2-3y, you'll probably be better off from a cost perspective to use cloud GPUs vs local (but everyone should run their own utilization numbers).
@@lhl this is a purely developer machine, I don't plan to run inference all day or even do week long training runs. Perhaps a few hours of fine tuning but that's all. Yes, it's much more cost effective to simply run on a cloud instance and in fact - I could do that right now for free. But it's very far removed from efficient development for stuff locally. And I can't run a model server on my cloud instances since they are really limited in terms of Internet access, so sending all of that via ssh seems like a stretch. Might look into it, but the one node I can easily access is just a 24GB RTX5000 which won't run 15B models and you have to keep caching off. for large evaluation runs I can request largest instances. But I need to develop all my tasks first and make sure they come up with results that mean something. Using a tiny model on CPU locally for development doesn't do the trick anymore because the tasks are difficult enough for a gpt2 fails 100%
@@Veptis For your use cases, a 24GB card sounds fine. If you need a local card, your best bang for buck will probably be a used 3090 (~$700) right now. It will run SDXL easily, and you can fit 4-bit quants of 33B models w/ no problem (exllama is most memory efficient, llama.cpp will let you offload layers if you are extending context or something). You should also have no problem running QLoRA or other 4-bit fine-tuning. I have a 24GB card in my workstation and have no problem running StarCoder 15B (@q8, basically no perplexity loss, but also check out CodeGen25-7B which performs pretty close at half the size) models or any of the 30B class models at 2K+ context, however you *will* run out of VRAM if you are running Windows and driving a display/running apps like browsers from your card. You can always get a second 3090 and get good scaling and be able to run a 70B+16K context. Remember, that the cards you've mentioned are $7000-10000, a lot to pay if you're just dabbling (eg, it sounds like you could pay about 10% of that and do what you want locally). Also for those following along, at home, $7000 is ~10h/day of A100 80GB cloud compute for 2 years at Runpod's spot prices right now. Plug in a spreadsheet with your home KW/h power costs if you're looking at things from a cost/perf perspective, but suffice to say, for most people it'd be a lot cheaper to rent GPUs when you need them. (I understand how having local hardware can be convenient though. I'm building a bunch of latency sensitive apps myself and it's nice to be able to have something under the desk. Toasty and pricey, though.)
Is AMD using different chiplets for Consumer and Enterprise a good thing for consumers? Or do you think we'll see consumer products using the more power efficient chiplets too?
@@shepardpolska They are going to prioritise profit so if they can make a wafer of Enterprise specific chiplets or a wafer of Consumer specific chiplets they are going to prioritise Enterprise. But done a bit of googling and Zen4s integrated graphics is part of the i/o die so if Zen5 is the same the chiplets will be same for both so consumers can get the cast off chiplets that don't make the spec for enterprise. Still interested to see if any of the "c" variants make it to consumer products
@@Pegaroo_ It's like this currently. Ryzen is using the same CCDs as EPYC but with a different IO die, and probably ones that are binned worse for energy consumption. It's like this with each chiplet based Zen.
About the Zen4 vs Zen4c chiplet designs and regarding the limits of I/O, Memory and Compute parts were each scales differently with transistor shrinkage. For example I/O drops off around 12nm, cache memory (S-Ram) at around 7nm-10nm however the compute seems to still scale well with even lower size transistors. With that said, the question that pops up in my head is *"Wouldn't it be better to have each CCD with one of the CCX be performance & memory dense while the other CCX be core dense like Zen4c?"* In a sense that would make the design of the 2 different CCX (core complex) in a single CCD (chiplet die) be like an "internally" heterogeneous design, mix of different types of CCX in one die, where half the cpu is like Zen4 and the other half Zen4c, in the same die. I believe Zen5 (or perhaps later) could have 2 different CCX designs in a single CCD (chiplet) where the first CCX having something around 4-8 high performance cores with a lot of L2 and L3 cache that utilizes an architecture optimized for performance instead of die area; While the other CCX could have very little cache, use different logic transistor libraries aimed for density (less frequency and less energy) and instead use the die space for more cores instead, cores that actually *shrinks* better with smaller nodes. Going further with transistor shrinkage this gap between "Performance CCX" and "Efficiency CCX" will only increase. I am guessing that this second CCX could have up to 32 cores by itself or something around those numbers. Remember SRam cache doesn't shrink well after around 7nm, however logic transistors (the compute) does and that is the key to this idea. And this gap will increase over time. Going further with this idea you could perhaps just remove L3 cache all together in the "efficiency CCX" and instead only use the 3D V-Cache, resulting in one CCX basically have no die area for cache and fill it with dense and efficient "high compute per area" design. This would open up additional levels of heterogeneous design, where AMD could build different chiplets (CCD) by having different CCX design. One CCD could perhaps have 2 of the performance CCX (e.g. 8 cores) and another CCD have 2 efficiency CCX (e.g. 32-64 cores) and one could have a mix one of each (e.g. 24+ cores). Next heterogeneous level after the CCX designs would be the CCD level, where AMD could mix these different types of CCDs in same package and as well as having HBM chiplets, AI chiplets e.t.c. Maybe a Zen5c would only have these efficiency chiplets, Epic Zen5 have a mix and Threadripper have a more bias to performance and power hungry chiplets? I know I have been rambling :) but I do think this is an interesting idea in its core, where you work around the different limitations of I/O, Cache and Logic transistor die shrink capabilities.
The issue with that is I don't think there is any benefit to that. The whole idea of Zen4c is for it to be used where you need high density. If you split the CCX like you suggest you might not even fit 4 Zen4 cores, since that would make the CCD bigger then it already is. Zen4c CCD is already bigger then Zen4. It would be more realistic to have a Zen4 and a Zen4c CCD in consumer CPUs. It can be done already and would probably benefit the platform more overall.
AI acceleration is half-hardware and half-software. Nvidia got CUDA, Intel is working on it's OneAI. I wonder what does AMD have to offer in this space? If nothing, then (in my opinion), throwing AI here is nothing but poor marketing that might make some fan boys happy but experts won't are unlikely to be fooled.
@@shepardpolska Interesting. Something is certainly better than nothing. Is seems very basic, perhaps that's why OpenCL is still very popular among AMD users.
it's nice to see AMD push to AI workloads, but they really need to beef up their software support for ML. George hotz tried to get RDNA working with tinygrad a few months back and it sucked. The driver kernel panic, but it looks like Lisa responded to george. You really need a full software stack and good support for ML frameworks. Without that, adoption isn't going to compete against CUDA. I find it interesting AMD AI chip has the same memory as M2 Ultra. My money is on Apple getting their software stack for ML in good shape before AMD.
Did you see latest video on Roc5m from Wendell? One from like 3 days ago. He did show that implementation for AMD has really moved a lot in the past 6 months. If it keeps going at this rate, it should be very good competitor to CUDA in 12 months or so. For now it "only" runs Stable Diffusion at better quality than NVIDIA. SD is not as important to AI as Inference or training, but it is a good sign. Yes, they are still behind. But maybe if you will be buying computer privately in a year - it will be a decent competitor? For HPC it's not a problem, you just write your own software. But for "normal" users? You will need hell of a good support and also get people from other companies to support it well. CUDA mostly runs so well, because it's supported by like million 3rd party applications. If AMD can get it to the level where they have even 10% of such support, just in important places, they can start taking away market share from NVIDIA. Mindshare will be more difficult. I really hope that AMD could create something like Jetson Nano - affordable, cheap, small units that can be used for training purposes, but aren't completely useless in of themselves.
4 vs. 4c: Looking at those floorplans, if they're to scale, there's no way that's just a change of node corner, there's going to be some re-pipelining and re-balancing in there at least.
we also run our vcpus with either HT deactivated or we only sell the real cores and use the HT Cores for overhead. the 2nd option becomes useless however, when you have 128 ht cores as "overhead". you can also safe some power not running HT. in our tests, HT cores only perform at around 20-25% of a real core anyway
I think it's a mistake to design to reduce the frequency, it should be 4.5 Ghz All core, that would give 50% more cycles, which will add value and performance; and for these days, isn't but low frequency, so shouldn't be a big deal. I don't see a design bottleneck at all IMHO SMT is 1.3C performance and 2 Threads execution, so just add this consideration to customers, but I understand the requirement, but it's irrelevant if the software is tailored properly to the client's needs. Zen4 EPYC Genoa I think the cores should go 24/48/96, makes more sense and it's more beneficial for the users
Phoronix benchmarks do include the accelerators, and there were only a couple AI inference workloads that SPR was even competitive in. Across the hundreds of benchmarks in the Phoronix test suite, that was less than 10 workloads, and on average the Zen4C system was almost 2x what SPR was able to do. Unless your workload specifically benefits from those accelerators 90% of the time, SPR is dead on arrival.
@@benjaminlynch9958There is more to acceleration than AI inference. Majority of the internet still runs on regular ol' crypto and databases. I'm talking about QAT/IAA/DSA.
Why didn’t intel figure this out, shrink golden cove (and leave avx512 in tact) to run at a lower frequency and thus consume less power rather than introduce gracemont with no avx512 abilities? Amd clearly understands what levers they can pull to optimize a design for a given workload. Amd is solutions oriented and seems to have a desire to build customized silicon solutions for a given workload rather than remaining “general purpose”. Intel needs to adopt a similar mentality. For instance, amd figured out that gaming workloads could benefit significantly from additional l3 cache. Hence zen 3d. Data centers could benefit from additional density, hence bergamo. Where are the intel equivalents for gaming, data center? Will be interesting to see what intel does with chiplets/tiles in meteor lake and beyond.
Probably coz Intel was a mess internally and scrambling for any idea that would work, and some engineers working on the P & E core method had more success than the shrinking method and also e cores for laptops and nucs so that got greenlit over other projects
Long as they keep it out of RIG and my Privately owned network , then i will not have to sue them. i only use the basic for AMD graphics , as i do not need more fucking PC companies trying to take my data and use my band width when they do not have permission or any legal right in Australia, I pay for my band width l Also paid for 7,500 dollars worth of ANMD hardware which does not give them legal access to any of my equipment or past the socket , If i catch them i will SUE them , then sell of the companies assets . Their will be no more AMD.
The problem I personally have with these Cloud cores from AMD is the fact that, if they were able to fit everything into the 4C footprint, then why didn’t they ship all Zen 4 chips using the C cores? Because they perform worse (less cache) and are clocked lower…
Uh, you answered your own question. They have completely different target markets, Zen4 is performance oriented while 4C is efficiency oriented. When you need to do a lot of compute, Zen4 will be better. When you're running node and pushing out web stuff, 4C will be good enough while being more efficient and lots cheaper.
From benchmarks I've seen, I think Zen 4C is like the console APUs and cuts the vector exec width in half. Also interesting they dont list the vector latency just the FPU.