As a large datacenter storage admin, this completely blew my mind. I spent probably an hour hypothesizing breathlessly with colleges about the options and possibilities this introduces, especially for highly composable infrastructure build-outs. But then I realized the hitch. What's the price tag on these drives? Adding the hardware to make these things work as network endpoints is probably not going to make the cost of NVMe flash storage go down.
Notice they also offer an adapter meaning normal NVMe are supported. As a TOC I think it will be cheaper if you will consider the cost of Storage Server with CPUs RAM HBA NIC etc and a switch
There will be an SoC that handles the network and flash storage logic. Since the requirements will be put down in a spec, the SoC and firmware can be highly optimised for the task. So it could become as cheap as consumer grade router SoC. But it has to become a industry standard to drive the economies of scale. Till then it will be expensive.
Have you used any networked PCI-Express? In our test lab we have some PCIe-over-fabric (I won't say NVMe-oF, as there are other devices on endpoints, from accelerators to RAM tanks) and genuine PCI-Express networks. While the reach is very short (no optical stuff, as far as I know), within single aisle at best, The scalabaility and ease of reconfiguration this introduced into our systems is insane. Cost of PCI-e switches is also starting to fall, and with networked Gen4 it actually makes sense.
Would you be able to do away with storage servers with this and just have racks of network switches? I know people are moving more towards software defined storage but cutting out the middle men seems risky to data integrity. Other that than, this blew my mind too because it seems so scalable compared to current configurations.
I mean..... I feel like that would be pretty far off. PCIe Gen 4 x4 (typical NVMe SSD) is rated at 8 GB/s (that's a big B as in Bytes). A 10GbE switch is small b as in bits, or about 1250 MB/s. So 10GbE is ~8x slower interface than PCIe Gen 4 x4. So really, you kinda need to be at the 100GbE scale for this to make sense, cause even a 40GbE switch is still only 6 GB/s. There are some "affordable" switches with 40GbE at the moment, but it's usually just uplink ports. Dunno. I feel like it would be a long time before it either gets retired or reaches the scale needed to be cheap to pro-sumers.
@@JeffGeerling it would be cool to see something like an interposer board made from a cheap Arm SoC which has native 1GbE and SATA, and some have PCIe as well. One of those sub-$20 designs would be great. Perhaps there will be an affordable 2.5GbE SoC soon from one of the vendors.
@@xPakrikx would that be the the 504 OSD Ceph cluster blog entry from Sage back in 2016? I did wonder what happened to the He8 Microserver stuff they were testing ceph on. Would love to see it.
I’m going to watch this three times in a row, and see if I can see how this makes the “don’t trust anything” model of data management easier and more cost-effective to execute. I don’t fancy my chances! Thank you Patrick for keeping on bringing us the crazy new things!
I'd really want one of these to play with at home and test out various crazy use-cases. Too bad they don't seem to be available to mere mortals. NVMe key-value mode could have some really interesting implications in setups like this.
@@ServeTheHomeVideo Would be nice, if the NVMe to NVMeoF adapter would be available for testing. There is even a test version of the Seagate X18 out there. Would be really nice to have block storage available like that.
The type of tech that is showcased here isn't going to be available to us plebes for years, and it will be second-hand. I really don't understand the name of the channel. I would understand it if he bought and showcased stuff that is finally hitting the used market, for ya know, hobbyists at home to be looking for.
@@ServeTheHomeVideo "Part of the idea also for homelab folks is that the stuff used at work today trickles down in 3-5 years to homelabs as it is decommissioned." Yeah, so what, people are supposed to remember content from a video from 3-5 years ago? THAT's my whole point.
I was wondering the same thing. If the Pi had more than a single PCIe lanes and faster NIC, I'm sure you'd create a Pi version of this @Jeff ;) But yeah, nvme-cli and NVM over TCP has been available in Linux for a while now. Seems like if someone made a board that attaches directly to a NVMe drive and translates to TCP, we could DIY something similar. I've searched a bit for an affordable SBC with PCIe lanes + 2.5Gbe to create a poor man's version. Unfortunately they're all: unavailable, too expensive, slow NIC, PCIe 2, not enough lanes, etc. It looks like even an ITX / ATX board as NVMe host would be expensive, because only latest gen server CPUs or HEDT support a lot of PCIe lanes (with room for fast NIC) or the denser PCIe 4/5.
@@j0hn7r0n in the industrial computing world there are some backplanes for video applications which have pcie switches for managing up to 14 X16 pcie slots and those are more affordable than any server with xepns for pcie lanes. I have installed some of those and with a core i5 or pentium G, we can connect many to the same system.
This is a pretty cool evolution of a storage device and goes along well with the evolution of nvnmet in the kernel. It'll be interesting to see what processing you'll be able to do on the way the SSD. If they're on the network, users are going to want trust/encryption and then they'll want to do something else. Perhaps the next video on DPUs will cover that and more!
The big thing I noticed is that each unit has its own controller. Thus, moving the bottleneck to the switch. Providing you can max out the throughput of the switch. Everything else seems straight forward. Though at first I thought I saw 2.5gb and not 25/50gb. Probably due to lost of talk on 2.5gb for home stuff. :)
Very cool. I am picturing a lower cost adapter for SATA SSD or HDD and maybe a couple of 1 or 10 gig connectors being used as a disk shelf in a home lab. My size constraints led me to a case that I am not wild about and offloading the storage would be nice.
I'd just make an adapter for our current NVMe drives. We have super fast, super big, and super cheap NVMe drives already available to everyone - they've just wrapped them in an industrial grade carrier/package. Try buying an industrial RPi and it will blow your mind what they're asking for them - a Raspberry Pi.
Was thinking of using AoE with a number Raspberry Pi 4s connected to inexpensive SSDs and using ZFS instead of software RAID. Unfortunately, getting RPis is not too easy now but there are cheap (if not bulky) SFF PCs. Something cheap to experiment with network storage.
Yeah, I still use AOE myself at the data center because it can basically do what is discussed here - except you don't get as many ethernet ports lol. I wonder if they are using it in the firmware.
as much as we do not like this translation between server and storage disks, it also allows you to offload the management and disk IO to another box. it would interesting to see how it impacts the performance of the servers now that you have to manage the disks as well
Hi there! Kind of..... FC has a SCSI payload in the FC frame. NVME over Fabric wraps up an NVME payload. And here that is NVME essentially wrapped up in Infiniband over ethernet. RoCE v2 = RDMA (remote direct memory access, for Infiniband clustering by calling on memory addressing of resources across a network) wrapped up in UDP, IP, and Ethernet. This way you can make the direct memory calls to the NVME drives as if they were a local resource, but wrap it up and send it out across the ethernet network.
A few years ago around 2015 the was the kinetic open storage project. It was launched with several vendors , but only Seagate made only one drive. The tech was not limited to HDD, 4TB and 1GigE, but that was unfortunately the only product. It supported openstack Swift and Ceph OSD. Maybe a Linux server running on each drive was too resource inefficient in 2015. Hope Kioxia makes the ethernet direct to drive work comercially this time. The entire solution with dual ethernet switch in the same chassis is apealing.
The object storage\kv implementation in the kinetic was super useful for many workloads and gives you the ability to grown horizontally across many many disks. combined with how ceph does the hash map and replication you get some really cool capabilities.
So the end line is having a disk-as-a-node, with dpu and interface. You make a configuration and add, as needed new chassis-nodes. The only difference I see is the client machine asking directly to the drive/array the data. You still need some compute somewhere to handle that (the dpus). Therefore, it's still a computer with a bunch of drives. Still, for me, it's like the size of the universe, it's hard to conceive. Maybe it will make more sense with the second video.
I'm a little late to this party but from a purely tech-based standpoint, this is amazing! With the continuous advancements in speed, multitasking, and increase in lanes that the modern CPUs have, removing any and all pitstops, layovers, and roadblocks to its ability to compute will drastically improve efficiency and workflow, as well as, reduce the amount of hardware needed to maintain that "hyperspace" workflow. Reducing power usage etc. all great news right! But after my nerdgasm past, I realized one thing. We've been giving IP addresses to resources forever but now we're going to be doing it on steroids. We are dangerously low on properly trained security individuals to monitor and maintain our current networks. What happens when we expanded our current network topologies from a couple dozen or a few hundred nodes to thousands or hundreds of thousands by giving even our HDs, GPUs, etc. IPs? What happens when someone hacks into your 22 drives simultaneously and takes the whole dataset offline or holds it for ransom? I'm all for advancement and I would never want to stop progression out of fear. But can someone please start the conversation about how we safeguard ourselves in this "hyperspace" work environment we will be advancing into?
The modularity of the technology is fascinating but I'm trying to understand the implications on latency. Say you have an array spanning multiple racks with drives acting as network endpoints, I suspect the acceaa speed of the array will only be determined by the furthest endpoint. Its a toss between the network layer vs the cpu/silicon layer of traditional systems. How this compares to traditional pcie attached storage in a SAN will be interesting.
the speed of any data array appears to you as whatever the latency is of the furthest piece of data you need, regardless of how it's structured or configured. if your data is already where you need it, you can even break the network temporarily without interrupting the work being done. remember when we thought the best computer was the fastest one?
lol, now do ZFS on a DPU. More real note, I think this is cool for things that need access to a physical drive but may not have the chassis for it, but I don't think it's going to replace a SAN for things like a clustered filesystem (VMFS/vVOLs).
This looks like a new variant of SATAoE (Serial ATA over Ethernet), which was a pretty cool way of inexpensively attaching network based disks directly to a host.
Thanks Patrick, Truly amazing. I was waiting for this review since you showed it last year. It is a great concept & I think it 'll have a huge impact on storage industry. I only wish you can do some benchmarks because latency is the main factor & I believe it 'll be minimal here since as you mentioned it so directly attached to network hence less translation Great job & thanks for the entire STH team
We were not running stacks on the drives themselves other than NVMeoF, but you get the idea of where this is going. Next up, we will have a DPU version of this.
@@novafire99 Ceph actually did have an experimental drive in partnership with WD that ran the OSDs on the drives themselves and used 2.5G ethernet (1/10th what these NVMe drives do but it's HDD so... shrug). It was a really cool idea but managing OSDs and Ceph on the individual drives is something that's definitely a bit clunky because now you need extra ram and compute on each drive controller board. You're not eliminating those extra layers as far as the inefficiencies are concerned so much as just moving the microservers running your OSDs onto the same board. With NVMe-oF direct to the drives instead of adding a heavyweight Linux daemon to each drive you're only adding a network stack and an embedded Linux host mostly as a control plane and your data plane can still offload the bulk of the work. Aside from the reduction in processing, your interface is now much simpler. It's NVMe-oF, thats a much smaller jump to bridge from NVMe to NVMe-oF than it is to bridge from SATA to Ceph OSD. Yes it's still having to deal with authentication, encryption, session management, etc. and you can expect needing more FW updates but nothing like having to manage a bunch of Linux servers in your Ceph cluster. Having that logical separation with a clean, stable API avoids the added complexity of combining higher level storage cluster stuff in Ceph with lower level drives.
I just started learning about 10g/100g Ethernet and SFP/qsfp and I was like why don't they make a 100gb qsfp m.2 slot, why is networking such an overengineered underperforming stagnant mess !
This particular one we did not do parity since it was RAID 0. However, how we had it setup it would be in the server/workstation. In the next video in this series, we will show it being done on a DPU for full offload.
Specifically, this was regular Linux software RAID (mdadm). You could also run ZFS over the block devices after connecting the workstation to them, or perhaps even a shared-disk file system.
Like, that's iscsi, a thing I learned about this week after having forgotten about it for 20 years. The reason I found it was I need to build a 10 node cluster out of computers from the recycling. Honestly I'm shocked we haven't had IP capable nvme drives from the start. This is obviously because SAN sellers have been blocking it from existing. My whole week has been, how do I duct tape a 40$ 100g PCIe card to a 200$ 2tb nvme drive, preferably without using any fans.
Please let us run Ceph OSD (with cleavis/Tang for encryption) deamon directly on this board. This would be some cool. I dreamed of this for years, but finally we are closer. Octa core A75 with 8GB and 4GB of NAND for OS, would be absolutely all I need. Accessing 25Gbps drives from one (or few servers) using nvme directly will not scale (how much you can put in one server, 800Gbps maybe). With ceph, each client will connect independently and you can easily saturate few Tbps with enough clients.
Are they also looking @ adding spinning rust drive switches as well, for the mass storage needs that can be lower tier storage with hot data on the nvme switchs? Coming from the age of being some of the first users of NAS and SAN products back in the 90s and spending the next 20 years in enterprise storage and enterprise architecture, this new topology is sexy as hell
@@ServeTheHomeVideo Looking forward to that, and everything else you bring out. STH the site has been a constant joy and so have your videos over the years! Thanks for these!
why do I see U as ten, and were standing in the middle of a sand lot. talking about new stuff!🤣.....this is very helpful when gathering ideas, to assemble the latest N greatest systems!! super-duper! great ideas! thanks Patrick! accessing data, right from the storage. will really cut down on overhead latency!!! nice! I can't believe they didn't make plastic skid covers though! it's so easy 2 loose ec's, over usage time! good luck!
How does security / access control work in this model? As a traditional server physically has exclusive access to its directly connected drives, it's a natural place to put a security boundary. With every drive as its own node, I guess you would use network security techniques like VLANs. Sort of relatedly, it seems like the traditional model's intermediating layer of server nodes tends to compartmentalize damage from (accidental) misconfiguration. If you're reconfiguring one node's disks you might lose the data on that node, but other nodes will be unaffected. If you're configuring the network between nodes, you might lose access temporarily, but your data is still there on each node and you can recover by reconnecting them to the network. In contrast, if all your disks are in one big pool and all the configuration is in software, what stops a misconfiguration from hosing all your disks at once? In particular, there's a general system assumption that software can assume exclusive access to direct-attached disks (excluding exotic shared-disk filesystems), contrasting with the presumption that network nodes are able to simultaneously serve multiple clients that may connect to them. If you put the disks directly on the network, you would have to be very careful that your network config guarantees mutual exclusion of access to any one disk, or else a subsequent misconfiguration would cause multiple nodes to clobber each other's data on a single disk.
so, that little pcb acts as a micro-linux and presents the nvme to the network, right? while this is great in terms of overall management and scalability, doesn't it increase net traffic and address usage by a huge margin? but i guess that's not a problem in the datacenters, they can up the equipment like it's nothing. Something similar that i remember was the very old ataoe but probably was more similar to FC.
Okay...that's pretty slick. I'm very keen to hear more about how one would actually use these SSDs, DPUs, or NVMEoF in production. I've always been wondering how you'd implement SSD redundancy with those. I'm assuming there isn't much compute power on each SSD so anything must be implemented on the storage consumer system. It does kind of scare me that you have to trust the storage consumer system to not accidentally screw up the disks it can access on the network. Oops...one server was acting up and so it wiped all the other namespaces on the 8 drives it was sharing...
So what I'm hearing, is that each NVMe drive is a near-zero-overhead ethernet-native NAS block device (as opposed to a limited component requiring a direct CPU attachment), and then you'd run the filesystem either on a dedicated, more traditional "heavy" NAS, or on the client itself. I wonder what kind of innovations this will enable elsewhere in the system. Maybe it'll be easier to design purpose-specific RAID accelerators (ideally compatible with a standard format like ZFS), once the drives aren't on the same PCIe bus. Maybe the individual parts of a modern filesystem (drive switching, parity calcs, caching, and file->block associations) could be separated into purpose-specific hardware modules. Or, maybe consumer operating systems start supporting NVMe-over-WiFi/Ethernet. Maybe NVMe gains network discovery features. Maybe drives with gigabit ports start showing up marketed to consumers, cheaper to use as a NAS directly than building an entire server around an NVMe-over-PCIe drive. (A single controller chip will probably under-cut even a Raspberry Pi, once optimized for cost.)
octet33 octet33 2 days ago So what I'm hearing, is that each NVMe drive is a near-zero-overhead ethernet-native NAS block device (as opposed to a limited component requiring a direct CPU attachment), and then you'd run the filesystem either on a dedicated, more traditional "heavy" NAS, or on the client itself. I wonder what kind of innovations this will enable elsewhere in the system. Maybe it'll be easier to design purpose-specific RAID accelerators (ideally compatible with a standard format like ZFS), once the drives aren't on the same PCIe bus. Maybe the individual parts of a modern filesystem (drive switching, parity calcs, caching, and file->block associations) could be separated into purpose-specific hardware modules. > Or, maybe consumer operating systems start supporting NVMe-over-WiFi/Ethernet. Already possible. > Maybe NVMe gains network discovery features. Already possible. > Maybe drives with gigabit ports start showing up marketed to consumers, 1Gbps is way too slow. You are putting exensive SSD that can do 500-4000GB/s (even cheapo ssds can do it), and will be limited to 120MB/s with worse latencies too. 10Gbps minimum for it to be useful.
Windel for level1 just did a video about how most raid systems anymore don’t do error checking at the raid controller level. Most just wait for the drive to report data error. This must be true for this as well correct?
I think Wendell was doing hardware RAID. These are more for software defined storage solutions. Also, traditional "RAID" has really been used a lot less as folks move to scale out since you have other forms of redundancy from additional copies, erasure coding, and etc. The next video in this series that we do will be using a controller, but also software so a bit different than what he was doing.
How is this different in concept from a bunch of Odroid HC1's? Each drive gets it's own tiny debian server and feeds a network port, sounds pretty much like this thing only this thing is more modern and faster. So rather than managing 4 servers with 24 drives each, you now manage 96 servers with 1 drive each.What am I missing?
@@ServeTheHomeVideo All I'm saying is, that sticking a tiny server to each drive has been done before, so the main new thing here is sliding 24 of those single drive servers into an 2U enclosure with a fancy switch interface, rather than powering them and network connecting them individually. The really exciting bit is how to manage those 24 servers so you don't need to micro manage each one individually.
Some custom ASIC in the Sonic switches could let the switches do RAID on there own.. RAID controller on steroids, LOL. Or maybe just put a Ryzen APU in there as the x86 control plain :) The integrated Vega should be able to do DPU i'd think.
Considering the redundancy is external to the system this sounds like it's very easy to accidentally remove the wrong drive for hot-swap replacement and incur very real data loss. :/ I'm thinking this primarily when Patrick started talking about namespace slicing. I don't yet see what the advantage of this topology is.
I'm curious how these scales based on price. Presumably the dual 25gbe controllers "in" each drive are pretty expensive and drive up the cost. Is that cheaper than relying on 6x dual 200Gbe DPUs to expose those PCIe lanes to the fabric?
@@ServeTheHomeVideo does it have an effect on power consumption though? Ethernet is meant for longer distances than PCIe, thus requiring more power... How does a rack of these fare power-wise compared to more "conventional" alternatives?
0:16 - Those storage bays have also had SCSI and SCSI u320 and more I'm sure - unless my history is letting me down and those connection types only have with 3.5' formfactor HDD's ?
I do wonder however if 25Gbps is fast enough given that pcie gen 5 can do that on a single channel. But the configurability does seem worthwhile, and given that PCIe uses the same series tech as electrical high speed Ethernet it does seem worthwhile, and it should be possible to have one controller that can do both 100g Ethernet and pci in the same controller silicon.
Soooo, instead of having one big server for bunch of disks, now we have bunch of disks with integrated server each; main benefit wich i see is more reliability Hmmm, can single or pair drive be mounted as storage in windows? :D
The lab was SUPER dark. Then add to the fact that there is a metal box (rack) around these systems. That is a Canon R5 (R5 C was not out yet) at ISO 12800 just to get something somewhat viewable.
@@ServeTheHomeVideo fair enough! Data centers aren’t really known for perfect lighting. Honestly cool to see that the R5 has such usable photos at iso 12800
ah so it's not running on x86 it's running on x64.... but there is still a lot of translation going on, granted where the translation is happening has moved, but it is still happening. the ONLY real difference is the communication between the drive and the controller, rather than it being serial it is instead using ethernet. NVMe over fabric SSD, so it is still a serial device, that would be the NVMe part, then there is a controller sat out in front doing the conversion.... this looks like some sort of iSCSI connection. granted they changed the command name to NVME . the network diagram @12:15 is also not correct, or is there a large part of the network missing? or rather than the big box "switch" have these been connected individually to the network with a device we haven't seen?
Does this solve the problems with concurrency in protocols like iSCSI? Nevermind, I think I answered my own question. It looks like you'd need to use a filesystem or DPU that supports mounting across multiple machines
It's interesting however, I wonder how it would really be in the data center world. We do a lot of high storage types of systems linked up with 40/100g ports. It is fun to dream of the ideas and usage for it.
Don't get me wrong, I love new tech and this has applications I'm sure, but the applications where this would excel are fairly limited to thigs like putting lots of these talking.... directly to a server. Lots of small to mid range data centers would end up front-ending this with a normal x86 server to even use it and handle things like shares, permissions etc. In those cases, you'd actually be adding translation steps. I'd see this just being a competitor for FC/FCoE/iSCSI attached storage.
Is there some way to manage array ownership in a distributed manner? In other words, is it possible to create an array on one machine such that all other storage clients on the network know that disks 1...n belong to array A?
I should clarify: know that disks 1...n belong to array A even if the array is not mounted, so that reallocation to another array would be prevented. Cool vid and tech btw, thanks!
The underlying difference isn't that it uses IP, it's that NVMe is not exposed outside of the disk. It's more like: NVMe over Ethernet, therefore a MAC address is the lowest OSI layer for a directly connected device (ie. the switch chip) to access it. As opposed to: NVMe over PCI, therefore a PCI address is the lowest OSI layer for a directly connected device (ie. the CPU) to access it. No need to jump to OSI layer 3 with IP addresses. TLDR: Normal NVMe devices are "over PCI", which is OSI layer 1. With NVMe over Ethernet, the lowest accessible OSI layer is now layer 2 (Ethernet/MAC). FINAL EDIT (lol): I'm referring to all layer 2 protocols as "Ethernet" for simplicity. "Over Fabric" encompasses all layer 2 connections, such as Infiniband.
Hi Preston! Thank you for joining! I think for the general market, it is going to take some time. A lot of that is just based on how these drives are being marketed. If we saw an industrywide push, it would be much faster. I also think that as DPUs become more common, something like the EM6 starts to make a lot more sense since the infrastructure provider can then just pull storage targets over the network and then provision/ do erasure coding directly on the DPU and present it to client VMs or even the bare metal server
iSCSI TNG 20 years later, a man talks excitedly about the next greatest thing from Microsoft while showing a Linux prompt. I like the idea and hardware but a small part of me feels like I am experiencing some sort of waking post truth nightmare.
Sadly this still uses the 'normal' L2, L3 and L4 protocols with their minimum packetsizes and therefore will not work as wonderfully for low latency use cases as they make it sound like
This isn't what Linus used at all. Their project is standard NVME drives directly connected to your everyday x86 server. This is way beyond that type of setup...