At some point people will stop caring about CPU cores, and they'll build out a 5,000 core chip where there's a core that accelerates every single application that runs on the platform ('Minecraft core', 'ffmpeg core', 'MS Word core', 'Torrent core'). There will just be one CPU core to coordinate all the app cores. Heck, Apple's kind of down that path with how they use like 6 or 8 different specific cores for neural, ISP, prores, etc.
100% accelerators will continue to be a bit deal. The desktop Meteor Lake next year will start to take folks down that path since, in theory, you can have a different HP and Dell Core chip.
@@ServeTheHomeVideo I just hope we don't end up with a thousand proprietary cores that can't really be used unless you're locked into one specific OS/driver layout. I don't think we'll have enough people willing to do the insane feats of reverse engineering the Asahi Linux group is doing on the M-series processors.
@@JeffGeerling probably not - developing a new core - and getting the silicon made in quantity, running w/o too many 'erratas' - is much more expensive than just replicating an existing design where you increase the # of cores
I wonder what this would look like for some games I play :P Could I get game specific cores for my Flight Simulator which really likes that one thread... Maybe someone figured out a way to offload it to some FPGA already :D
Interesting, thank you! Ultimately it all boils down to code - many devs are happy that their creation runs w/o problems - even if you give them the compiler updates for QAT they might not use them, validating code and exchangeability with other platforms will hold them back
Awesome to see this video! Very excited to see SPR ship! Also, very minor nit: your TLS throughput graph for perf per thread says QAT hardware at 18T rather than 16T. I just started working as a firmware engineer at Intel for OpenBMC (feels like I just started but it's been over a year sheesh) and it's definitely exciting to see the platform we worked on getting into the hands of reviewers and customers haha.
AMD should be building a response to QAT which is a chiplet that they can place next to the IO Die, even on embedded chipsets. Possibly even a CCD which replaces a normal CPU CCD, perhaps creating an asymmetric design with one CCD focused on performance and the other on acceleration.
the 2% performance deficit is what i saw with my ryzen 1700x back in the day when i was testing with cinebench r15 multithread, in bios setting it my ryzen 1700x as 4+0, or 2+2. 2-3% difference to measure the fabric inpact. edit: i never really posted on it, cuz i thought it was too little of a margin of error at the time, and was too lazy to do more than 3 runs for each of the 2 config types i did.
@@ServeTheHomeVideo that's quite aperformance downgrade, but then again, thats a usecase that taxes the fabric more, so introduces a bottleneck there. its interesting to test this kind of stuff nonetheless! :D
One thing that could be a catch! For any new instructions, virtualisation support is crucial. If the hypervisor doesn’t support the new instructions, or the overhead is too great, this would defeat the purpose. Granted, this isn’t an issue for physical appliances but many appliances these days are virtual.
Sounds like these new Intel SR chips with on die QAT will be great for high density front-end web or proxy servers. Will be interesting to see how general performance stacks up with current and next gen Epyc, and if AMD decides to add similar hardware function to io die.
Hi John. Thank you for being a member! That is the latest. The 8970 is the faster x16 adapter. The 8960 is the slower x8 adapter that sometimes I flash on screen in these videos. I am not sure if there are going to be new adapters since Xeon D has it built-in, and Sapphire Rapids will have a much faster version built-in. Top-end SKUs are >2x what the 8970 can do in some of our tests. A few reference pieces for you: - Intel QAT Cards by Generation: www.servethehome.com/intel-quickassist-parts-and-cards-by-qat-generation/ - This week's Sapphire Rapids AMX and QAT performance: www.servethehome.com/hands-on-with-intel-sapphire-rapids-xeon-accelerators-qct/
Epyc 3451 is essentially just a low clocked Threadripper 1950x. There is a Zen 2 variant of the low power Epyc series (7D12, there are retail versions of it), curious why they didnt suggest using it? Does the motherboard support setting the memory into "channel" or "die" interleave modes? This forces NUMA on the Zen1/1.5 platform which can offer better performance in some applications/operating systems that have trouble controlling NUMA associations by themselves (also Zen 1/1.5 have a bit of an issue on this by themselves so the application/OS is not always at fault). Both modes need to tested as the which one will work best depends on the CPU and memory loadout. For a 1950x for example "channel" is the correct mode, while a 2990wx needs to use "die" mode.
The 7D12 is not a soldered chip and is physically *much* larger. In many embedded applications, those are completely different segments. On the NUMA side, the reason this took days to do was partially the process of going through different thread placements and core configurations. This was not just a "run once and see how it goes" type of exercise. One number may be presented, but a lot more work goes on behind the scenes.
@@ServeTheHomeVideo Setting thread association at the OS level isnt good enough for Zen 1/1.5 cpu's, they will often send the data to the wrong memory channel if the correct NUMA mode is not forced at the BIOS level. Not knocking anything you're doing, its all good work and I know how much testing you're doing as i've spent several months playing with my 1700x, 1950x and 2990wx learning what makes them perform best for what I use them for and that the NUMA bug is further down than most software i've tested seems to be able to handle. All that said the Epyc zen1/1.5's might handle a little differently and the software you're testing might not be affected by the problem at all, i'm just concerned it is affected is all. Dont take my comments too hard, zen 1/1.5 is old as we both know, so really it does need a replacement from a newer architecture for the soldered to the board segment.
Unless the QAT software ecosystem has improved drastically since I last looked at it a few years ago I dont see QAT really happening outside of niches like SAN vendors that have big dev teams. If they dont integrate QAT with say the mainline linux kernel network stack (say for ktls or ipsec) it will see very limited uptake. No normies are going to run IPSEC under some weird DPDK-setup.
Later in this video, we showed this is going into mainstream Xeon's starting with Sapphire Rapids. That will fix the chicken-and-egg problem with software adoption.
It's been 9 months since Ice Lake Xeon D was released, and I still can't buy one. Supermicro lists them as "Coming soon". Asrock lists them as "Preliminary". I don't know of a retailer that's selling them. Seems like vaporware.
Not just the Ice Lake D's. I've had a X11SDV-8C-TP8F on backorder for the last 3 months, after waiting from around November last year for them to become backorderable 😕
The point of this was built-in integration where you do not need add-in cards. That is a big deal for the embedded market that these parts are targeted at. We already added a 100GbE NIC for the EPYC because the onboard networking was too slow. If we did an additional QAT card for the AMD EPYC 3451, then we have two cards extra for AMD. Do we then need an add-in QAT card for the Xeon D since we added one for AMD?
@@ServeTheHomeVideo It would be an intereseting comparison as far as performance comparison is concerned. There was another video about QAT on the server platform which did not show benchmark for QAT external card on AMD platform. Seems like someone is trying hard not to show the full picture. With that said, I generally like your videos/articles/insights. It just seemed odd that the external card was show but not used for benchmark.
If you're doing 30+Gbps of IPSEC traffic, then you're not going to be a home user in the first place :) You'll be able to pick and choose your platform. None of this really matters at the home...
100%. STH was the first site to review the EPYC 3000 series and I was one of the few folks at the launch in London. We have reviewed several platforms based on them too. That is why I asked AMD if there was a replacement before doing this
"it's like with aes-ni [...] today everybody just assumes it's there" Cries in cheap arm NAS CPUs that are basically useless for encryption, because the don't have AES-NI. Seriously though, Synology still maintains a feature table that features AES-NI. You might not think about it, but if you buy one of those without, you have use everything unencrypted... Which is of course, really bad, even for a consumer deployment.
The point is more that this is integrated, not an add-in PCIe card. In the embedded market, there are very limited PCIe slots and often smaller form factors.
Which new chip? Hopper has not been announced beyond the H100. Ada Lovelace parts will start with the L40 in the data center. These are more for compression/ crypto so they are for a different segment.
@@ServeTheHomeVideo I have not been staying up to date on the chip names but the same chip in the 4090. Pny revealed what the new quadros will be. The exact same as the amphere with the ada Lovelace chip
Yes, the RTX 6000 Ada and the L40 have the same architecture as the 4090. That is very common for the old "Quadro" line and the newer professional cards.
I do hope that AMD makes something with Pensando, because it's ridiculous that you have to put intel NIC to "even the odds" just a little. AFAIK Genoa should be coming this quarter (though to what markets it's hard to say) and it does have "some kind of" accelerators. But more accelerators are coming on AMD with Turin only. Whether something equivalent to QAT will be available - IDK. I leave it to experts like you. That is one of the reason why AMD despite their advantages in performance has like 10-15% of server market share (I think, I don't know which source is credible honestly). Because Intel has most of the ecosystem already in place. Even though SP is delayed, people are still waiting on it. Really missed opportunity by AMD, though IDK how fast they could have built their own ecosystem, given that their financial problems were mostly gone maybe 18 months ago. So they were careful with spending. But this really hurts their market share. I hope that you will soon be able to review Genoa. I'd like both companies to be competitive and have approximately even market share - that would give them money for R&D and they could innovate.
@@ServeTheHomeVideo That is at least some good news. One would hope that they will also have something for this particular sector of the market, because being forced to use first Zen EPYC's without there being something to replace them is a lost opportunity. I do realize that AMD has to be careful with how much capacity they order, so they don't lose money, but I think they could be a little more flexible now - given that they will use N6 and N5 nodes for all of their stuff.
At work we run 100Gbit/s of AES on EPYC Rome CPUs with plenty of capacity to spare. I really don't see the point of this crypto accelerator. AES-NI is plenty fast for almost any workload. If you need to go 400GbE it makes more sense to run crypto on the network card like with Mellanox ConnectX 6, to avoid doing extra data copies through the CPU L3 cache.
@@Mr.Leeroy CDN nodes. Probably around 250W to push 100Gbit/s HTTPS traffic but that's previous gen Zen. With an offloading NIC you can get much better figures. Check Netflix's IBC talk about how they build their 400G and 800G nodes.
@@ServeTheHomeVideo Agreed, but one downside of putting workload specific accelerators into the CPUs themselves is the burden it adds to the chips that aren't used for those workloads. That's particularly true for features added to every core - you had better be darn sure that those transistors are going to get enabled and used broadly as the area will take away from the number of cores, which are widely useful. AMD cores are generally less silicon area than Intel's for a given level of performance. That area/performance advantage, along with the chiplet approach, is what has enabled AMD to deliver much higher core count server CPUs.
What vendors will be providing the least expensive way to get into two socket Sapphire Rapids Servers running Windows 2019-2023 Server OS. Was hoping to get one from Dell but my Dell Server technical sales guy doesn’t think any thing new will be happening till 2024! I need AI Server Acceleration now. PCIe Gen5 RAM DDR5 memory at 😢44Mhz
This is the problem with Intel, the keep adding more and more facilities, instructions etc.. to try to speed up their chips. Not strip them down and make the core instructions and architecture more efficient.
It's the game that Intel has been playing for years and it's where the industry is heading. People want to keep seeing performance increase exponentially, but as we're hitting the limits of transistor and power scaling, the only way to keep improving performance and efficiency is through dedicated hardware. From your logic, we would be encoding videos in software and rendering games on our CPUs because adding more accelerators and instructions is bad. Clearly, it's a flawed argument. I think that as we move forward, accelerators will only become more important to improving performance and efficiency.
"Not strip them down and make the core instructions and architecture more efficient." Did you forget IA64? Intel tried that. They realized x86 had been extended way too far and it was time to make a break from it and start with a fresh architecture. But the public rejected it. Then AMD comes along and extends x86 yet again, even further adding 64 bit extensions to already overloaded architecture. So, Intel had no choice but follow along. You can't blame Intel for this. You have to blame the public in general for demanding backward compatibility and to a lesser extent AMD for adding the 64 bit extensions and thus continuing to keep it alive. Intel's plan was to kill off x86 and move us to IA64 instead. The P4 was going to be the end of the line.