Great video, Thanks very much! Editorial: The Proxmox UI needs a LOT of work. Having to use the Ceph dashboard to define pools and rule sets manually and having to adjust storage.cfg to detail the Data vs Metadata is just miserable, especially because it doesn't leave any evidence in the Proxmox UI. I'm looking forward to checking out your CephFS video. That's the reason I'm here. Thanks again!
Hyper Convergence has too cool of a name - but its no joke for storage pools, it just makes sense to keep costs down and get super redundancy and great performance.
I really appreciate the work you put into this video. I looked into Ceph a while ago but I don't think it had the web console that you showed. I really appreciate on how clearly you explain it.
Great video but left me with a headache 🙂 I'm just starting with PVE so I will likely leave HA and Ceph for later but just settle for replicating between two nodes and manual failover in case of node loss. Not the best but I need to run work on PVE now and don't want to be tearing down and re-installing constantly as I learn. I guess next step is a virtual PVE lab - I am *assuming* PVE can be installed on PVE 🙂
How fast is this? How would a database like postgres perform inside the cluster on one node compared to running natively on the same hardware node but without ceph? I can't find much information about this, just that people have very diverse opinions on running kubernetes/databases/ceph
Every few months, the thought of ceph for my home servers crosses my mind. I never sat down to truly understand it and now I don’t have to. Your video explained it all for me so I can understand the pros and cons it offers. I’m glad to see you posting again but don’t forget to keep taking care of yourself.
Thanks! I'm keeping to a more reasonable schedule now. I'm still debating on if I want to migrate to Ceph vs my current two-box system (one TrueNAS + one Proxmox).
As usual, this one was excellent and clarified erasure coding. Many thanks. Setting up cephfs is super easy. For anyone curious... NODE -> Ceph -> CephFS, create Metadata Server(s) then Create CephFS, create a Pool, and add that pool to Datacenter -> Storage as type CephFS. Job done. Just don't backup to your ceph-fs pool from VMs that are hosted on your RBD ceph-vm pool if both are on the same OSDs. The read/write contention is massive :-)
@@apalrdsadventures FWIW when comparing Ceph/FS vs GlusterFS, the ram usage was a massive difference. There were 4 related Gluster daemons and they used 80 MB ram. The mds, mgr, mon and osd Ceph daemons took up 3.4 GB of ram! Also considering that Gluster works best with XFS there was also no massive ZFS memory requirements either. However, Gluster lacks a RBD block storage system similar to ZFS zvols so considering that Ceph provides block level storage and can almost be completely managed from the Proxmox gui... I'll just make sure I have a minimum of 32 GB of ram for nodes and just go with Ceph/FS,
That's partially because the Ceph OSD is doing its own caching (since it operates on the block device directly and skips the kernel filesystem page cache), whereas Gluster is relying on XFS and the kernel page cache (which is still there even without ZFS, it just doesn't show up as consumed memory like ZFS does). The default limit is 4G of cache per OSD and it will scale back with system memory pressure. The monitor also does use a decent amount of RAM, although on my cluster it seems to be around 500M which is reasonable I think.
@@apalrdsadventures Right, got it. At some point I want to do a tiny 3 node nanolab project, a bit like your $250 project, and I'd probably go with Proxmox on LVM (specifically not ZFS) and Gluster and see how few resources could be used to still end up with HA. Gluster can store qcow2 and raw VM images. My current 3 node microlab has been stable for 2 weeks now so if it doesn't require any attention for another month I might try a Gluster based nanolab project.
Thanks for explaining how to INSTALL and setup the CEPH Dashboard! Your instructions on where and how to actually install said dashboard in THIS video is awesome! Can you please mark where in the video this instructions are? Thanks !
pls bench the netfs and also compare to other options like ocfs2/gluster over zfs/sshfs/smb/nfs - having a couple nass may be a good way to have more than 1 netfs and do backups? this is all semi advanced great to learn and experiment with - most people and smb will want most basic but with best price/perf and easy ui - ymmv - i think the toughest part is just the setup - maybe have a website with the howtos? #simple machines
running it on 2.5 would be nice - making everything nvme would be nice sine there is real price parity now - don't leave perf on the table - run soe simple benchmarks after you get it all built out - consider spinning up vms on ws to join cluster and as a stopgap until you can get more nodes
Does this mean that you would need separate block devices (drives) so that you would be able to create separate replicated and erasure-coded pools? Or are you able to use the same drives for both? I just bought three mini PCs and it only has a single 2242 M.2 NVMe slot per mini PC. So right now, I've managed to partition the drive so that I was able to install Proxmox on it and then use the remainder of the drive as the Ceph OSD, but it's only the replicated pool. For me to do this with only one 2242 M.2 NVMe SSD, does that mean that I would have to repartition the SSD so that I would actually split the remaining space into two partitions - one for the Ceph pool replicated rule and one for the Ceph erasure-coded pool? Thoughts? Your help is greatly appreciated. Thank you.
You can have multiple pools on a single set of drives. The rules for each pool will take affect for that pool only, so there need to be enough drives in the system for it to meet the rules without overfilling one drive.
@@apalrdsadventures Agreed. So I have a 512 GB 2242 M.2 NVMe SSD in each of the 3 nodes. 100 GB has been allocated for the Proxmox install + 8 GB for swap (per node). That leaves about ~404 GB (~376 GiB) available for Ceph. Right now, in its current state, I created the first Ceph OSD (per node) which is used in a ceph replicate pool. But after watching your video, maybe I can repartition that such that the fourth partition (/dev/nvme0n1p4) in each of the drives is maybe 50-100 GB which will be the OSD for the Ceph replicate pool and then the remainder (~304 GB/~296 GiB) can be used for the Ceph erasure pool (/dev/nvme0n1p5), right? I just want to double check with you whether I have understood the material that you have presented in your video correctly and properly. And that should still allow for HA failover in case one of my nodes dies -- that the VMs and/or CTs should be able to failover onto the other remaining nodes, correct? Your help is greatly appreciated. Thank you.
A given pool isn't tied to specific OSDs (disks / partitions). All pools can use all free space on all disks, according to their rules. So no need to partition into replicate / erasure. Replicated an erasure coded pools can share the same disks.
@@apalrdsadventures Thank you. I will have to play around with setting that up to see if I would be able to put both the erasure coded pool and the replicated pool on the same OSD. Your help is greatly appreciated. *edit* Just tried it. It works! Yay! Thank you! Happy New Year!
perfect, I'm in the process of deploying a new promox/ceph cluster at my company. I created a ssd pool I want to use for I/O hungry machines, but creating a machine maybe on the ssd pool makes no difference in read and write I/O (tested with dd) vs putting them on the default manager pool (maybe because it uses these ssd tools also since they are also available in that pool?) maybe you can point me in direction would be highly appreciated
On a pool of mixed drive types, with the failure domain set to Host, Ceph will end up selecting 3 hosts and then from the host group selecting an OSD from that host based on the OSD's weight (by default the weight is the capacity in TB). Some of these PGs will end up with one or more SSDs in the mix of course. All IO goes to the 'first' OSD holding the PG, which will then do IO on the replicas if necessary. So a write op requires at least two of the replicas to complete and a read op can be satisfied directly by the 'first' OSD. If that first OSD is an SSD, then the whole PG will get the read performance of the SSD and the write performance of the second fastest disk. The disk image will end up spread across many PGs and normal performance will average out, but in a purely sequential, single queue depth workload you'll end up hitting the same PG for awhile and probably running into sequential network bandwidth limits, especially when writing (as data needs to pass over the network 3 times). FIO is better for measuring IO bandwidth, especially if you know the block size and queue depth your application supports (i.e. for databases).
Good video as usual. Quick question about "data_pool" and "metadata_pool" in erasure-coding mode. Is there a "rule" or a "good practice" allowing us to define a good ratio between the capacity stored in the "data_pool" and the size of the SSD(s) that will host the "metadata_pool"
In my experience, the metadata pool is extremely small. It's less than a MB in my testing, although I don't have a ton of data on my test setup (the total capacity is only ~300GB).
I have 3 nodes right now I have 1 built out with 32 gb ram and 2 tb hdd with 512 gb drive with proxmox installed. How should i build the 1st node knowing later I want to add 2 more when I get the money? What filesystem should I put on the 2tb hdd where the VM/Containers will be stored? I don't want to have to do this over when I get the other 2 read so kinda want it to be ready to add them. I am going to build the other 2 out at the same time.
In general I do ZFS unless I have a good reason not to. This goes for both the OS drive and other drives. You can create separate ZFS pools now to plan on replacing one of them with Ceph later. If you want to add Ceph later you'd need to reformat the 2TB HDD, but it's possible to migrate in place (OS disk remains ZFS, 2 new OSDs get added on 2 new nodes, create pool in degraded state, move all VMs/CTs to new pool, then reformat 2TB HDD and add it as third OSD, let it replicate to third OSD). Depending on if you care about cluster-wide data guarantees you get from Ceph you could just keep zfs on the 2TB drive, it's not a bad solution either.
great stuff - you need another nas spinning rust node to make the cluster fully redundant - do some homee serving with wireguard and a nginc r proxy on a cheap vps - or two - think about opnsnese ha as independent hw nodes - it can do load balancing to the nodes and it may get better perf than you think - this is a great setup for a small/med biz since it is so easily and cheaply scalable for capacity #live migration
Answering all your comments in one place: - I'd need 3 nodes minimum with the 2+1 code. 1 drive each on 4 hosts would be better than 4 drives all on one host. - NVMe disks need relatively new hardware, something I don't have. But I'm working on a video with new hardware, just not Ceph yet. - I'll get to CephFS and distributed filesystems, it's a big topic and it deserves at least a few videos of its own
I run TrueNAS Core on my proxmox cluster as a VM. I pass thru some spinning disks to the VM and use those for ZFS. Could a CephFS implementation replace this? The TrueNAS is rock solid btw, very happily using that ZFS pool for years and I've had disk issues and replaced easily resilvering and recovering with no data loss. I also use Proxmox Backup Server with an LTO4 tape drive and I backup my proxmox cluster. It gets a bit hokey when I backup my TrueNAS data using this.
CephFS *could* replace it, or you could mount a dataset from the Proxmox host ZFS pool into an LXC container (and manage sharing there), or install samba and manage sharing on the host. Ceph (including CephFS) doesn't really scale down to single servers well though, although it's possible. ZFS is certainly much more optimized for in-memory transactions.
Have you checked out the `pveceph pool create` command with the --erasure-coding parameter? That makes it quite a bit easier to use EC pools in Proxmox. :)
It's about the same amount of work to do it from the command line vs GUI, since you need to do a ceph osd crush rule create for the metadata pool also.
True, for a cluster that requires quite specialized settings, like failure domain on the OSDs instead of hosts and device class specifics, the metadata pool needs a different CRUSH rule if it should match the EC pool in these things :)
This may be a weird/bad idea, but is there any way to make a SSD pool that puts its replicated data in HDD's? The idea is to get 100% efficient use of your SSD's, and have all the replication overhead on the HDD's. Obviously if you were ever to lose a SDD you would then be stuck on HDD speeds for the time being. You could definitely manually do this by simply make two pools, and use some program to just copy all data on the SSD pool to the HDD pool. And set the SSD to have no replication. I was just curious if ceph has anything built in to do something like this.
No, all copies share the same storage rule (which is either any or a specific device class). But you wouldn't want to do this either, since Ceph guarantees writes when 2 out of 3 replicas have completed, the fastest two of three drives will end up determining the write speed of the whole operation. The write will stall until at least 2 replicas are in place, and the pool will be degraded if the third replica is substantially behind the other two.
@@apalrdsadventures I just found out there are features that mix SSD's and HDD's in the same pool. They are called hybrid pools, and there is also primary affinity. With 3 replicated, for example, you can have it write the 1st copy to a SSD OSD, and then the other 2 copies to HDD's. As you point out, this will not speed up write speeds, as you will have to write to the slow HDD still. But it will speed up read speeds substantially. I only just now read about this. But I would be curious if this is viable with erasure coding too, rather than just replication.
You're right, it's possible by writing completely custom CRUSH rules - docs.ceph.com/en/latest/rados/operations/crush-map/#custom-crush-rules Depending on your workload it might be easier / better to use tiering to keep recent data in SSDs. That's my plan, with CephFS and video editing data.
Thank you for the guide. I created an SSD cache pool for my HDD pool with your guide. Which pool do I add in proxmox as the storage for my VMs, the HDD pool?
@@apalrdsadventures I am not using separate metadata + data pools. The metadata is in the osd itself. In this case which pool do I add into the pve storage? The SSD cache or HDD pool. Thank you for the excellent proxmox guides. You have a new sub.
So you're using separate DB/WAL disks? In that case, there is only one pool and you add that to the GUI. If you're using the 'tiered storage' commands from the Ceph documentation, be aware that they specifically recommend not using it for RBD. But you'd select the HDD pool in that case.
Filesystems tend to do a lot of disk IO across the whole device in the background (i.e. ZFS scrubs, other filesystem cleanup) that would pull blocks into cache unnecessarily and copy on write filesystems also tend to pull the entire disk into cache as blocks are allocated and discarded when a file is modified, whereas with CephFS and RGW, Ceph is directly aware of the IO operation and knows which files should be kept in cache.
I noticed that: rbd: ceph-ssd content rootdir,images krbd 0 pool ceph-nvme data-pool ceph-ssd Both the same disk images display. Must be the metadata, but not so elegant if it's shared with a regular pool. Ended up creating a meta-data pool: rbd: ceph-ssd content images,rootdir krbd 0 pool ceph-ssd_metadata data-pool ceph-ssd rbd: ceph-ssd_metadata content images,rootdir krbd 0 pool ceph-ssd_metadata Now it makes sense again!