You are correct, so-called smart nics will include different accelerators, but the best will be re-programmable ones such as FPGA based designs where you can load on different accelerators that you actually require, including bug fixes and upgrades.
Patrick, I love it when you do such awesome stories on niche products. The reviews of a $75 device one day and $150,000 device two days later really make me enjoy your website and RU-vid channel. I wonder if this would be a dream job for me...
We will probably be adding an 18th team member in Q4. Shoot me a note if you are serious and what you are looking to do and if you are thinking part time or more full time. We usually have folks start part-time to see how they actually like reviewing hardware.
Thank you so much for this video!! I started diving into qat a few month’s ago and learned the hard way about the support for the different generation of cards lol!
You can. I just tested a QAT add-in card on an AMD system using SQL Server 2022 RC0 which can offload backup compression with QAT. Works like a charm. No idea how much faster or slower the CPU version of QAT is since I don't have access to that kind of hardware. I also have allot of questions on how this is going to work with a hypervisor sitting in between. I don't want to pass the QAT device to just one VM, I want all VM's to be able to use QAT acceleration. Does the CPU version show up as a discrete PCIe device or is it more like a instruction set extension?
@@ServeTheHomeVideo the question was not about AMD hardware, but rather about using the Intel QAT hardware on AMD hardware, which is a supported modality. It is a shame that STH didn't think of that...
what does this mean for latency, interrupt budget, DMA, etc? These are valid benchmarks, but what happens in total system testing, do you really free up resources which can be used without immediately hitting a next bottleneck, for example interrupts bogging down some subsystems of the platform?
Thanks for your great video! Just to make sure --- this card is attached on PCIe right? Can instruction trigger accelerator ops on this card? How to use it? As a I/O call, or through instructions?
is the acceleration used when using Java or C# or nginx with their main libraries out of the box, or did you implement specific intel dependencies/libraries to take advantage of quickassist?
Interestingly, Intel has a more "mainstream" acceleration called Quick Sync for video encoding and decoding. When it works (i.e. on supported video codec), it makes a huge difference. AMD seems to completely neglect this market for some reason.
@@jaffarbh It is not part of AES-NI but it is a competing technology in the workflow that was presented in the video. Also, the video uses CPU + Dedicated QAT PCI hardware to compare against AMD without using AES-NI.
@@aliancemd Fair point. Intel has already embedded QAT into the latest Xeons. The real question is whether equivalent QAT accelertion exists in AMD processors. In any case, this is a specialist market and not something eveyone needs. May be AMD doesn't see the need to dedicate silicon for it.
I am very much not an expert in these things. With QAT's compression acceleration, could such hardware be used to accelerate disk access in a desktop environment? I'm not necessarily asking from a practical standpoint, merely wondering if we might see something like it in a future chipset/CPU so they can advertise that your SSD will be XX% larger or faster since it would be less data going across a bottleneck.
Cool. I had not heard of these but have been looking for a long time for a way to cheaply accelerate my older servers. My servers run a lot of modern hardware like high speed NVMe and they can't keep up. I think this is what I need. Will these basically work on any machine? I mean it's just a PCIe card, right?
@@ServeTheHomeVideo Well, no not really, there'd still be no CPU core complex on the card so it wouldn't pass your own 8 point 'is this a DPU?' checklist.
Ah I meant more like Mt. Evans would be the closest product, or the FPGAs. But if you went FPGA + EPYC embedded, then you would get Xilinx most likely.
Can you please do an update for this? Not your full blown speed tests, but more of the proper mix and match what's out there to have it function and somewhat futureproof.
Loved the video, but IMHO one thing is missing - if a customer already has EPYC servers or he's planning to buy the upcoming Genoa CPU's, is there something for him related to acceleration? I'm sure there's some serious NDA that has been signed by you and AMD but still, some hints ... ;)
@@ServeTheHomeVideo That's exactly what I thought while you were talking about the accelerator card. The functionality on there will probably be folded into the DPU which also means we will have more vendors putting out products which can do this.
As for ciphers, ChaCha20-Poly1305 is not going anywhere really, it is seriously overpowered (20 rounds of ChaCha is silly overkill, 12 is sensible overkill, 8 is probably fine).
17:35 That is so misleading. To remind everyone that AMD(and Intel) supports AES-NI, which has significantly better support, including from hypervisors - there is really no reason to compare hardware acceleration vs no hardware acceleration at all, on hardware that has it.
Kind of rubs me the wrong way that you didn't mention AMD's solution, Xilinx/ Pensando (is this available now or soon?) and the Intel QAT card can be used in an AMD system. Looking at the video, one could easily think Intel has a huge advantage over AMD. Hard to believe Intel didn't have a say in this or maybe you were influenced by their sponsorship. I'm not saying do something that sours the relationship, but just mentioning it would have been much better already. Honestly I feel like the AMD results should have been removed as you are comparing apples to oranges, only thing it does is make AMD look bad. Hope you can keep that in mind in the future.
@@ServeTheHomeVideo so will xilinx and bluefield solutions be able to do this same sort of thing? I'm curious what the implementations will be like on the software stack in order to utilize the offload, and if there is direct hardware support in those cards to accelerate these functions (specific cyphers, etc), and if so or if not, will that affect bandwidth, latency, Max # of connections, and power efficiency. I'd be VERY curious to have these same tests and cases with the same basic base hardware paired with nvidia and amd accel/dpu cards benchmarked the same way (if possible) so that these numbers could be put side by side with them, and show how intel compares to other vendor solutions (and how much work it would be on the software side to implement - like is there native support for each solution in popular products like pfsense, web server stacks, etc)
Wow, that is super weird! A HW accelerator that actually accelerates something! But seriously guys, why the f.. are you trying to make this a comparison against AMD with no acceleration, this is just plain silly.
We used a faster AMD CPU so when we did things like acceleration via ISA-L AMD was faster due to the clock speed and extra TDP it had. The TDP difference between lower power Intel parts and the higher power AMD SKUs we used is about the same as QAT card TDP. AMD has promised the Pensando solution for example, but has yet to deliver cards and we cannot eBay them. When Pensando cards arrive, we will look at those.
@@ServeTheHomeVideo showing that a specialized HW unit + CPU can draw the same power as another CPU isn't any less silly, and excusing this with AMD not yet delivering similar HW isn't a very good excuse for making a silly and utterly useless "comparison".
What is the alternative though? No major server vendors support QAT cards in EPYC systems. You can put them in, but that is a one-off unsupported configuration that would be a lab project, not something that people would really deploy. That is why we need AMD's accelerators so we can have real solutions not lab experiments.
@@ServeTheHomeVideo the alternative is to not do silly stuff and only report on what the QAT can do. The fact that QAT doesn't work on EPYC is partially (if not fully) Intel's fault, so trying to put this on AMD is just even more silly. BTW, an accelerator from AMD will perform very differently, and comparing the two would also make little sense, these are very specific SW accelerators, but you probably already know that?
@@brynyard it's not silly at all. If there is no industry supported alternative for amd systems, this could mean the difference between choosing an intel or an amd platform for a specific application based purely on the amount of resources we've just been shown get used. This may be a huge realization for a lot of people, and may affect purchasing decisions for variously sized projects. In larger data centers, optimizing for a specific use case can potentially mean the difference of a ton of power, latency, number of connections a server can make while still performing work with those connections, so users per server, so then number of total servers, so data center sizing, etc. This may have huge implications for our very Internet-oriented data centers, with all kinds of encryption and very little inter-data center machine-to-machine trust.
The QAT cards are basically server PCHs on a card so they are like 23W TDP parts. That is why I wanted to use EPYC CPUs with 15W each more (30W total) to at least bridge some of the gap.
If you're having to spend the dev time to implement QAT within your application, why marry yourself to a hardware specific component when there are fast real-time algorithms like LZ4 and ZSTD when you can get 1 GB/s+ per core? I don't get the feel that the forward looking storage vendors are continuing down the hardware accelerated path here as they get locked into a specific technology and then they're unable to port elsewhere, i.e. cloud.