Тёмный

All about POOLS | Proxmox + Ceph Hyperconverged Cluster fäncy Configurations for RBD 

apalrd's adventures
Подписаться 63 тыс.
Просмотров 31 тыс.
50% 1

Опубликовано:

 

2 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 76   
@lawrencerubanka7087
@lawrencerubanka7087 3 месяца назад
Great video, Thanks very much! Editorial: The Proxmox UI needs a LOT of work. Having to use the Ceph dashboard to define pools and rule sets manually and having to adjust storage.cfg to detail the Data vs Metadata is just miserable, especially because it doesn't leave any evidence in the Proxmox UI. I'm looking forward to checking out your CephFS video. That's the reason I'm here. Thanks again!
@spoonikle
@spoonikle Год назад
Hyper Convergence has too cool of a name - but its no joke for storage pools, it just makes sense to keep costs down and get super redundancy and great performance.
@apalrdsadventures
@apalrdsadventures Год назад
Oh yeah, it makes way more sense than loading up on a ton of disk shelves on a single server and pushing that one server harder.
@Mikesco3
@Mikesco3 Год назад
I really appreciate the work you put into this video. I looked into Ceph a while ago but I don't think it had the web console that you showed. I really appreciate on how clearly you explain it.
@apalrdsadventures
@apalrdsadventures Год назад
The Ceph dashboard is pretty great honestly. I'm just using a few modules so far, but I'm sure I'll get to more of it eventually in the series.
@NineKeysDown
@NineKeysDown Год назад
Thank you, that was really helpful and filled in some the gaps I was missing!
@apalrdsadventures
@apalrdsadventures Год назад
Glad it was helpful!
@junialter
@junialter 11 месяцев назад
Man you're the best IT youtube maker I've seen in a long while. Thank you so much.
@mikebakkeyt
@mikebakkeyt Год назад
Great video but left me with a headache 🙂 I'm just starting with PVE so I will likely leave HA and Ceph for later but just settle for replicating between two nodes and manual failover in case of node loss. Not the best but I need to run work on PVE now and don't want to be tearing down and re-installing constantly as I learn. I guess next step is a virtual PVE lab - I am *assuming* PVE can be installed on PVE 🙂
@MultiMunding
@MultiMunding 8 месяцев назад
How fast is this? How would a database like postgres perform inside the cluster on one node compared to running natively on the same hardware node but without ceph? I can't find much information about this, just that people have very diverse opinions on running kubernetes/databases/ceph
@brunosalhuana7431
@brunosalhuana7431 Год назад
Can you do a video about using Ceph RBD with samba?
@hoaxbuster78
@hoaxbuster78 3 месяца назад
i tried to install ceph dashoard, do you have tutorial ? thanks !
@BrianPuccio
@BrianPuccio Год назад
Every few months, the thought of ceph for my home servers crosses my mind. I never sat down to truly understand it and now I don’t have to. Your video explained it all for me so I can understand the pros and cons it offers. I’m glad to see you posting again but don’t forget to keep taking care of yourself.
@apalrdsadventures
@apalrdsadventures Год назад
Thanks! I'm keeping to a more reasonable schedule now. I'm still debating on if I want to migrate to Ceph vs my current two-box system (one TrueNAS + one Proxmox).
@runningcolt
@runningcolt 2 месяца назад
@@apalrdsadventures who won the debate?
@MarkConstable
@MarkConstable Год назад
As usual, this one was excellent and clarified erasure coding. Many thanks. Setting up cephfs is super easy. For anyone curious... NODE -> Ceph -> CephFS, create Metadata Server(s) then Create CephFS, create a Pool, and add that pool to Datacenter -> Storage as type CephFS. Job done. Just don't backup to your ceph-fs pool from VMs that are hosted on your RBD ceph-vm pool if both are on the same OSDs. The read/write contention is massive :-)
@apalrdsadventures
@apalrdsadventures Год назад
Glad you liked it!
@MarkConstable
@MarkConstable Год назад
@@apalrdsadventures FWIW when comparing Ceph/FS vs GlusterFS, the ram usage was a massive difference. There were 4 related Gluster daemons and they used 80 MB ram. The mds, mgr, mon and osd Ceph daemons took up 3.4 GB of ram! Also considering that Gluster works best with XFS there was also no massive ZFS memory requirements either. However, Gluster lacks a RBD block storage system similar to ZFS zvols so considering that Ceph provides block level storage and can almost be completely managed from the Proxmox gui... I'll just make sure I have a minimum of 32 GB of ram for nodes and just go with Ceph/FS,
@apalrdsadventures
@apalrdsadventures Год назад
That's partially because the Ceph OSD is doing its own caching (since it operates on the block device directly and skips the kernel filesystem page cache), whereas Gluster is relying on XFS and the kernel page cache (which is still there even without ZFS, it just doesn't show up as consumed memory like ZFS does). The default limit is 4G of cache per OSD and it will scale back with system memory pressure. The monitor also does use a decent amount of RAM, although on my cluster it seems to be around 500M which is reasonable I think.
@MarkConstable
@MarkConstable Год назад
@@apalrdsadventures Right, got it. At some point I want to do a tiny 3 node nanolab project, a bit like your $250 project, and I'd probably go with Proxmox on LVM (specifically not ZFS) and Gluster and see how few resources could be used to still end up with HA. Gluster can store qcow2 and raw VM images. My current 3 node microlab has been stable for 2 weeks now so if it doesn't require any attention for another month I might try a Gluster based nanolab project.
@BigBadDodge4x4
@BigBadDodge4x4 11 месяцев назад
Thanks for explaining how to INSTALL and setup the CEPH Dashboard! Your instructions on where and how to actually install said dashboard in THIS video is awesome! Can you please mark where in the video this instructions are? Thanks !
@hpsfresh
@hpsfresh Год назад
Can I change priority for osd so vm works with osd on the same node (as far is it will be faster via internal virtual 100gbit network)?
@shephusted2714
@shephusted2714 Год назад
pls bench the netfs and also compare to other options like ocfs2/gluster over zfs/sshfs/smb/nfs - having a couple nass may be a good way to have more than 1 netfs and do backups? this is all semi advanced great to learn and experiment with - most people and smb will want most basic but with best price/perf and easy ui - ymmv - i think the toughest part is just the setup - maybe have a website with the howtos? #simple machines
@minecrafter9099
@minecrafter9099 8 месяцев назад
latest proxmox and ceph don't seem to go along very well with the dashboard, so to do erasure coded it seems like cli is the only option
@apalrdsadventures
@apalrdsadventures 8 месяцев назад
Yeah, unfortunately it's a bug in Python that's impacting Dashboard, so dashboard doesn't really work for anyone (Proxmox or not).
@tomaszbyczek7611
@tomaszbyczek7611 Год назад
You are my MASTER !! Thank you from Poland :) Keep doing your great work !
@shephusted2714
@shephusted2714 Год назад
running it on 2.5 would be nice - making everything nvme would be nice sine there is real price parity now - don't leave perf on the table - run soe simple benchmarks after you get it all built out - consider spinning up vms on ws to join cluster and as a stopgap until you can get more nodes
@voldllc9621
@voldllc9621 Год назад
Good stuff, but I did not see how you tailor the number of placement groups for the pools.
@geesharp6637
@geesharp6637 2 месяца назад
So did you ever make the Cephfs and other Ceph topics you mentioned at the end of the video?
@apalrdsadventures
@apalrdsadventures 2 месяца назад
No, mostly because my 3-node cluster hardware ended up losing a node when I lost my workstation board.
@AamraNetworksAWS
@AamraNetworksAWS Год назад
Hi, can you please show how to install the ceph dashboard getting lots of error
@remusvictuelles1669
@remusvictuelles1669 Год назад
very informative and easy to understand explanation... you've got a sub!
@berniemeowmeow
@berniemeowmeow Год назад
Really enjoy your channel and explanations. Thank you!
@apalrdsadventures
@apalrdsadventures Год назад
Glad you like them!
@ewenchan1239
@ewenchan1239 9 месяцев назад
Does this mean that you would need separate block devices (drives) so that you would be able to create separate replicated and erasure-coded pools? Or are you able to use the same drives for both? I just bought three mini PCs and it only has a single 2242 M.2 NVMe slot per mini PC. So right now, I've managed to partition the drive so that I was able to install Proxmox on it and then use the remainder of the drive as the Ceph OSD, but it's only the replicated pool. For me to do this with only one 2242 M.2 NVMe SSD, does that mean that I would have to repartition the SSD so that I would actually split the remaining space into two partitions - one for the Ceph pool replicated rule and one for the Ceph erasure-coded pool? Thoughts? Your help is greatly appreciated. Thank you.
@apalrdsadventures
@apalrdsadventures 9 месяцев назад
You can have multiple pools on a single set of drives. The rules for each pool will take affect for that pool only, so there need to be enough drives in the system for it to meet the rules without overfilling one drive.
@ewenchan1239
@ewenchan1239 9 месяцев назад
@@apalrdsadventures Agreed. So I have a 512 GB 2242 M.2 NVMe SSD in each of the 3 nodes. 100 GB has been allocated for the Proxmox install + 8 GB for swap (per node). That leaves about ~404 GB (~376 GiB) available for Ceph. Right now, in its current state, I created the first Ceph OSD (per node) which is used in a ceph replicate pool. But after watching your video, maybe I can repartition that such that the fourth partition (/dev/nvme0n1p4) in each of the drives is maybe 50-100 GB which will be the OSD for the Ceph replicate pool and then the remainder (~304 GB/~296 GiB) can be used for the Ceph erasure pool (/dev/nvme0n1p5), right? I just want to double check with you whether I have understood the material that you have presented in your video correctly and properly. And that should still allow for HA failover in case one of my nodes dies -- that the VMs and/or CTs should be able to failover onto the other remaining nodes, correct? Your help is greatly appreciated. Thank you.
@apalrdsadventures
@apalrdsadventures 9 месяцев назад
A given pool isn't tied to specific OSDs (disks / partitions). All pools can use all free space on all disks, according to their rules. So no need to partition into replicate / erasure. Replicated an erasure coded pools can share the same disks.
@ewenchan1239
@ewenchan1239 9 месяцев назад
@@apalrdsadventures Thank you. I will have to play around with setting that up to see if I would be able to put both the erasure coded pool and the replicated pool on the same OSD. Your help is greatly appreciated. *edit* Just tried it. It works! Yay! Thank you! Happy New Year!
@krzycieslik6650
@krzycieslik6650 Год назад
CrystalMark show me on random 4KiB transfer 2.26 (read) and 1.53 (write). Where i can find instructions, how made it faster?
@GrishTech
@GrishTech Год назад
Ceph has very nice redundancy and scalability, but not random I/o, especially Q depth 1.
@datenkralle1406
@datenkralle1406 Год назад
perfect, I'm in the process of deploying a new promox/ceph cluster at my company. I created a ssd pool I want to use for I/O hungry machines, but creating a machine maybe on the ssd pool makes no difference in read and write I/O (tested with dd) vs putting them on the default manager pool (maybe because it uses these ssd tools also since they are also available in that pool?) maybe you can point me in direction would be highly appreciated
@apalrdsadventures
@apalrdsadventures Год назад
On a pool of mixed drive types, with the failure domain set to Host, Ceph will end up selecting 3 hosts and then from the host group selecting an OSD from that host based on the OSD's weight (by default the weight is the capacity in TB). Some of these PGs will end up with one or more SSDs in the mix of course. All IO goes to the 'first' OSD holding the PG, which will then do IO on the replicas if necessary. So a write op requires at least two of the replicas to complete and a read op can be satisfied directly by the 'first' OSD. If that first OSD is an SSD, then the whole PG will get the read performance of the SSD and the write performance of the second fastest disk. The disk image will end up spread across many PGs and normal performance will average out, but in a purely sequential, single queue depth workload you'll end up hitting the same PG for awhile and probably running into sequential network bandwidth limits, especially when writing (as data needs to pass over the network 3 times). FIO is better for measuring IO bandwidth, especially if you know the block size and queue depth your application supports (i.e. for databases).
@peronik349
@peronik349 Год назад
Good video as usual. Quick question about "data_pool" and "metadata_pool" in erasure-coding mode. Is there a "rule" or a "good practice" allowing us to define a good ratio between the capacity stored in the "data_pool" and the size of the SSD(s) that will host the "metadata_pool"
@apalrdsadventures
@apalrdsadventures Год назад
In my experience, the metadata pool is extremely small. It's less than a MB in my testing, although I don't have a ton of data on my test setup (the total capacity is only ~300GB).
@drdaddydanger1546
@drdaddydanger1546 Год назад
Thank you. I wonder how can I backup a ceph pool in a good way.
@thestreamreader
@thestreamreader Год назад
I have 3 nodes right now I have 1 built out with 32 gb ram and 2 tb hdd with 512 gb drive with proxmox installed. How should i build the 1st node knowing later I want to add 2 more when I get the money? What filesystem should I put on the 2tb hdd where the VM/Containers will be stored? I don't want to have to do this over when I get the other 2 read so kinda want it to be ready to add them. I am going to build the other 2 out at the same time.
@apalrdsadventures
@apalrdsadventures Год назад
In general I do ZFS unless I have a good reason not to. This goes for both the OS drive and other drives. You can create separate ZFS pools now to plan on replacing one of them with Ceph later. If you want to add Ceph later you'd need to reformat the 2TB HDD, but it's possible to migrate in place (OS disk remains ZFS, 2 new OSDs get added on 2 new nodes, create pool in degraded state, move all VMs/CTs to new pool, then reformat 2TB HDD and add it as third OSD, let it replicate to third OSD). Depending on if you care about cluster-wide data guarantees you get from Ceph you could just keep zfs on the 2TB drive, it's not a bad solution either.
@shephusted2714
@shephusted2714 Год назад
great stuff - you need another nas spinning rust node to make the cluster fully redundant - do some homee serving with wireguard and a nginc r proxy on a cheap vps - or two - think about opnsnese ha as independent hw nodes - it can do load balancing to the nodes and it may get better perf than you think - this is a great setup for a small/med biz since it is so easily and cheaply scalable for capacity #live migration
@apalrdsadventures
@apalrdsadventures Год назад
Answering all your comments in one place: - I'd need 3 nodes minimum with the 2+1 code. 1 drive each on 4 hosts would be better than 4 drives all on one host. - NVMe disks need relatively new hardware, something I don't have. But I'm working on a video with new hardware, just not Ceph yet. - I'll get to CephFS and distributed filesystems, it's a big topic and it deserves at least a few videos of its own
@pavlovsky0
@pavlovsky0 Год назад
I run TrueNAS Core on my proxmox cluster as a VM. I pass thru some spinning disks to the VM and use those for ZFS. Could a CephFS implementation replace this? The TrueNAS is rock solid btw, very happily using that ZFS pool for years and I've had disk issues and replaced easily resilvering and recovering with no data loss. I also use Proxmox Backup Server with an LTO4 tape drive and I backup my proxmox cluster. It gets a bit hokey when I backup my TrueNAS data using this.
@pavlovsky0
@pavlovsky0 Год назад
Also thank you for your content. You and electronicswizardry do some excellent proxmox videos.
@apalrdsadventures
@apalrdsadventures Год назад
CephFS *could* replace it, or you could mount a dataset from the Proxmox host ZFS pool into an LXC container (and manage sharing there), or install samba and manage sharing on the host. Ceph (including CephFS) doesn't really scale down to single servers well though, although it's possible. ZFS is certainly much more optimized for in-memory transactions.
@nevermetme
@nevermetme Год назад
Have you checked out the `pveceph pool create` command with the --erasure-coding parameter? That makes it quite a bit easier to use EC pools in Proxmox. :)
@apalrdsadventures
@apalrdsadventures Год назад
It's about the same amount of work to do it from the command line vs GUI, since you need to do a ceph osd crush rule create for the metadata pool also.
@nevermetme
@nevermetme Год назад
True, for a cluster that requires quite specialized settings, like failure domain on the OSDs instead of hosts and device class specifics, the metadata pool needs a different CRUSH rule if it should match the EC pool in these things :)
@seccentral
@seccentral Год назад
loved this, keep em coming :D
@copper4eva
@copper4eva Год назад
This may be a weird/bad idea, but is there any way to make a SSD pool that puts its replicated data in HDD's? The idea is to get 100% efficient use of your SSD's, and have all the replication overhead on the HDD's. Obviously if you were ever to lose a SDD you would then be stuck on HDD speeds for the time being. You could definitely manually do this by simply make two pools, and use some program to just copy all data on the SSD pool to the HDD pool. And set the SSD to have no replication. I was just curious if ceph has anything built in to do something like this.
@apalrdsadventures
@apalrdsadventures Год назад
No, all copies share the same storage rule (which is either any or a specific device class). But you wouldn't want to do this either, since Ceph guarantees writes when 2 out of 3 replicas have completed, the fastest two of three drives will end up determining the write speed of the whole operation. The write will stall until at least 2 replicas are in place, and the pool will be degraded if the third replica is substantially behind the other two.
@copper4eva
@copper4eva Год назад
@@apalrdsadventures I just found out there are features that mix SSD's and HDD's in the same pool. They are called hybrid pools, and there is also primary affinity. With 3 replicated, for example, you can have it write the 1st copy to a SSD OSD, and then the other 2 copies to HDD's. As you point out, this will not speed up write speeds, as you will have to write to the slow HDD still. But it will speed up read speeds substantially. I only just now read about this. But I would be curious if this is viable with erasure coding too, rather than just replication.
@apalrdsadventures
@apalrdsadventures Год назад
You're right, it's possible by writing completely custom CRUSH rules - docs.ceph.com/en/latest/rados/operations/crush-map/#custom-crush-rules Depending on your workload it might be easier / better to use tiering to keep recent data in SSDs. That's my plan, with CephFS and video editing data.
@ap5672
@ap5672 Год назад
Thank you for the guide. I created an SSD cache pool for my HDD pool with your guide. Which pool do I add in proxmox as the storage for my VMs, the HDD pool?
@apalrdsadventures
@apalrdsadventures Год назад
If you're using metadata + data pools, you'll add the SSD pool in the GUI and add the HDD pool in the config file ("data-pool" is HDD)
@ap5672
@ap5672 Год назад
@@apalrdsadventures I am not using separate metadata + data pools. The metadata is in the osd itself. In this case which pool do I add into the pve storage? The SSD cache or HDD pool. Thank you for the excellent proxmox guides. You have a new sub.
@apalrdsadventures
@apalrdsadventures Год назад
So you're using separate DB/WAL disks? In that case, there is only one pool and you add that to the GUI. If you're using the 'tiered storage' commands from the Ceph documentation, be aware that they specifically recommend not using it for RBD. But you'd select the HDD pool in that case.
@ap5672
@ap5672 Год назад
@@apalrdsadventures I am using the tiered storage from ceph documentation. Thanks for the warning. I wonder why it isn't recommended.
@apalrdsadventures
@apalrdsadventures Год назад
Filesystems tend to do a lot of disk IO across the whole device in the background (i.e. ZFS scrubs, other filesystem cleanup) that would pull blocks into cache unnecessarily and copy on write filesystems also tend to pull the entire disk into cache as blocks are allocated and discarded when a file is modified, whereas with CephFS and RGW, Ceph is directly aware of the IO operation and knows which files should be kept in cache.
@martinhryniewiecki
@martinhryniewiecki Год назад
Fantastic explanation
@apalrdsadventures
@apalrdsadventures Год назад
Glad you like it!
@allards
@allards Год назад
I noticed that: rbd: ceph-ssd content rootdir,images krbd 0 pool ceph-nvme data-pool ceph-ssd Both the same disk images display. Must be the metadata, but not so elegant if it's shared with a regular pool. Ended up creating a meta-data pool: rbd: ceph-ssd content images,rootdir krbd 0 pool ceph-ssd_metadata data-pool ceph-ssd rbd: ceph-ssd_metadata content images,rootdir krbd 0 pool ceph-ssd_metadata Now it makes sense again!
@apalrdsadventures
@apalrdsadventures Год назад
Yes, the metadata is shared so Proxmox sees it as two sets of metadata. I don’t really mind but your solution does fix that.
@yevhenbryukhov
@yevhenbryukhov 4 месяца назад
White theme is a big misconfiguration 😜😄
Далее
Лиса🦊 УЖЕ НА ВСЕХ ПЛОЩАДКАХ!
00:24
БАГ ЕЩЕ РАБОТАЕТ?
00:26
Просмотров 96 тыс.
Highly Available Storage in Proxmox - Ceph Guide
31:13
We look into Ceph erasure coding
15:38
Просмотров 3,3 тыс.
More POWER for my HomeLab! // Proxmox
17:49
Просмотров 90 тыс.