The limitations of today’s SSDs | The Backend Engineering Show

Hussein Nasser

Подписаться 431 тыс.

Просмотров 13 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

30 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 29

@hnasr 2 года назад

Head to database.husseinnasser.com for a discount coupon to my Introduction to Database Engineering course. Link redirects to udemy with coupon applied.

@ch94086 2 года назад

Hey Hussein, thanks for your great videos. I think the problem is the inverse, why are DBs bad for SSDs? The problems you describe are because SSDs try to emulate rotating disk or drum, and writing emulated "sectors" means reading and rewriting elsewhere. In an SSD random reads are cheap, appends are cheap, but random rewrites are expensive. Some DBs use read only bulk store in a layer with ram cached overlay layer for updates to improve performance, but actually better fit SSDs. Rather than rethink SSDs we should rethink DBs. How about a video on that? The problem exists in microcomputers like a temperature logger. Mostly people use SPIFFS, a flash file disk emulation, to rewrite sectors to append a new record of temp pressure humidity readings. But that's really bad for the flash ROM, rewriting and relocating blocks instead of appending within an erased block. A time based DB could be written to use batch erased superblocks and append (individual bytes can be written in flash over erased blocks of all 1s). But worse, microcomputers don't bother writing to on chip flash and instead send a message to a cloud server. When the cloud server goes out of business the product gets bricked. Writing to self contained flash vs cloud server means no external dependency.

@hjups 2 года назад

I completely agree. However, that falls into his last point about complexity. Rewriting the way databases behave is probably too cumbersome in most cases, and it's easier to complain about the underlying storage technology instead.

@Stoney_Eagle 2 года назад

I realy enjoy these talks, When are you gonna do tech talks or keynotes on stage 😊😉

@sanjeev2003 2 года назад

Yeah cosmic rays. 🤣

@user-qz5is9dc2n 2 года назад

its like a teacher explaining things but i actually understand it

@danielwei9933 2 года назад

Awesome talk. I was also curious if you’ll be doing a discussion on the recent Atlassian outage?

@hjups 2 года назад

There are a lot of complaints in this video that while valid, are a consequence of the memory technology (and surrounding architecture), and not necessarily some inherent flaw. Every technology used has tradeoffs, and these are the tradeoffs for using NAND flash memory. That said, improving SSD performance for database applications is an active area of research (a focus of myself and others in my research group). However, a large portion of the research is not focused on improving write performance, because that's a technology limitation... If you want to do a lot of database insertion, the consensus is to do that on a copy in RAM and then dump to the disk (SSD) once the write sequence is complete. For clarification on some technical points: Your comment about blocks isn't quite right. NAND flash read granularity is on the order of pages, which are typically 16KB (even though the OS presents them as 4KB - that means to read a 4KB page, you actually need to read 16KB). They are however "byte addressable". It depends on how you want to define that.... you could 1) read the 16KB page into DRAM on the controller chip and then return the byte that you want, or 2) many NAND flash has a byte read instruction where it copies the 16KB page into the on die buffer and can return a specific byte from the page (this saves bandwidth on the NVDDR bus). There are also some NAND manufacturers (Micron comes to mind), that have the ability to read a sub-page of 4KB, which is faster than the full 16KB page. If you write a single byte though (or 4KB page), you can only do that on an erased block. The erase process takes significantly longer, and can only be done at the block level. When you say cost for DRAM? Do you mean parts cost? The DRAM in these devices is only ~1GB or so of DDR3, which you can get for $1 in quantity. Also, it's far too small or slow to be used by any other application outside of the SSD. Furthermore... it's sort of important to have as a page cache, so that you can write pages back and modify them at the 4KB level (rather than 16KB level) without killing your NAND blocks with writes. Although, what you are describing with DRAM-less, FTL-less SSDs, those do exist. Those typically run the FTL at the driver level in the host and use host DRAM for the LBA table. So if you wanted to, you could just not run the driver provided and write your own which does all of this low-level management for you. Btw, namespacing is a concept of NVMe, not of SSDs. NVMe is the communication level that sits on top of the FTL layer, but is not required. The drives with the FTL on the host do not use NVMe, and therefore would need to implement namespacing on the host rather than the SSD itself. For overprovisioning, that is something you paid for though. That's why SSDs come in odd sizes compared to HDDs. Essentially, a SSD is almost always a multiple of power of 2 in size, but the FTL hides some of those pages/blocks to be used by the garbage collection and as a reserve for wear-out. So you can look at it two ways... you are either paying for a 512GB drive but only getting 510GB, or you are paying for a 510GB drive and actually getting 512GB with 2GB inaccessible. Because of marketing though, I believe the latter would be more applicable. Again, though, if you want to access all of the flash, then you can get one of those host controlled drives. Note that this overprovisioning area is controlled via the LBA. So you don't write to it, erase, move, and erase again. You just map the old invalid blocks as the new over provisioning area, and only do the write once. Although, you write at the page level, not the block level, so what you mean by multiple writes is ambiguous here. So the write amplification comes form the fact that you need to move the good blocks from the erase unit into the overprovision area, which consists of a bunch of reads as well (a smart controller would allocate them horizontally across the flash, but that comes with its own challenges). Another problem with SSD speed and fluctuations though, is the ECC stage. The higher the density per cell, and the more the cell is used, the more important the ECC becomes, and the more complex it needs to be. Typically, ECC cannot read / write at the bandwidth of the NVDDR link, which slows down the operation. But on top of that, the NAND needs to go through a process of read-retry if the ECC returns an error. This is essentially a form of read amplification. In terms of the complexity of exposing lower levels to the client.... that's the consequence of tracing the law of diminishing returns. If you want it to perform better, you need to do something more complex. If you want to do something more complex, you can either fix it to a general solution that works well for problem A and not for problem B, or you can expose a level that lets the application for problem A optimize for that problem and the application for problem B optimize for that problem.

@hjups 2 года назад

To add a tl;dr. Most of the complaints posed are due to the way NAND flash works, not SSDs. So if you want something better, you should complain to Micron, Samsung, SK, etc. And get them to change their flash designs (never going to happen though). So SSD architects have to work with what they are given. Note that it is possible to do single byte R/W on flash, but the die size would be significantly larger, which means it will either be more expensive, or the density would be much lower. I would not be surprised if the facilitate flash R/W at the byte level, you would be paying for 128MB of flash at the cost of a 16GB flash die.

@hnasr 2 года назад

Thank you for the in depth detailed comments. Really valuable input. And apologies if my video came up as complaining, I appreciate the work all engineers across the entire stack is doing to Improve the tech. The goal of the video was to state the current limitations of this technology so we watch out when build apps. And I think I missed the mark on that one retrospectively.

@hjups 2 года назад

@@hnasr You may have missed the mark a little there as you said. Though you mentioned a followup video which would be a good place to clarify (that would also fit well with the topic, so win-win). I apologize if my comment made it seem like I thought you were being disrespectful toward other engineers in the stack. I was trying to show that much of the limitations are caused by the underlying flash technology and not by architectural decisions (so the points you brought up were either directly related to the flash technology itself, or restrictions imposed by SSD architecture in an attempt to make the flash restrictions be less punitive). But those restrictions are the tradeoff for density (the ideal memory technology for database applications simply doesn't exist, and may be physically impossibly, so you have to choose the least worst option). I do agree that you should take the limitations of the entire stack into account though (and having a database app manage the flash blocks directly may be the way to go). That's why many of the bleeding-edge database research has been in ways to minimize writes (via write coalescing) in an attempt to remove pressure from the flash (they do struggle with the OS and FTL, since you only get so much control at the file system level). Sharding also does that to some degree by allowing for the data to be distributed instead of requiring coalescing. Lots of interesting areas of improvement but they all add complexity.

@ayex86 2 года назад

SSD tech is bad, but the worst thing of it is the durability of cells. And it's still not solved.

@mindBytesAiShorts 2 года назад

Your pointing out all of these "problems" like they are some conspiracy drive makers do to screw you over rather than the engineering that is done to run my drive. Yes, if you are designing SSDs and storage mediums they would have to consider this, but as a developer its all extracted away and doesnt effect us meaningfully. Furthermore, not all workloads are tiny and fast writes. If I write a large file, lets say 40MB, then im writing 10,000 FULL BLOCKS at once, what about this use case? Are you recommending a mass-market cheap and permanent storage medium specifically for single-byte operations? Have you designed such a thing?

@Unicorn_Bank Год назад

I'm just going to say it: Intel's Decision to axe Optane (chalcogenic Phase change) "storage class memory", was one of the Stupidest business decisions they ever made. I really can't wait for the patents to Expire? I Hope someone else, like IBM picks up the torch? 🔥 I had the 14G for running "experiments" when they first came out. Was like $35. I ran DB tests on it, and was mind blown. (Compared to my $400 SSDs of that era) Sure it didn't offer anything near the GB/$ of Samsung's multi-layer NAND SSDs! But you know what? I Don't care! The advantages are worth it; and when you're in it for the long haul - in 3 or 4 generations it won't matter. This is the same fundamental chemistry that was used on DVD-RW technology of the 90's just miniaturized and put in a "cell". These cells had been "designed" for > 1,000,000 read-erase cycles, and in testing the very 1st gen got over 250,000 which is still an achievement. They allow byte-addressing, something NAND Flash will never do. And are the perfect storage technology for high-throughput databases! We've been waiting forever for real revolutionary memory technologies to Hit the market. I'm tired of all the empty promises from researchers "Startups" & universities! (Cough, Nantoro, Everspin, ...) Just look at this page: en.wikipedia.org/wiki/Universal_memory en.wikipedia.org/wiki/Magnetoresistive_RAM#See_also When the hell are we ever going to see a real Disruptive storage technology reach mass market & volume production status? 2123? 2230? I've been tracking this market, and ever single company & "breakthrough" that has emerged in it, for over 12 years. And still Nothing. Nada... You can't go on Newegg and buy Any of this. It's for "Special customers only" like Defense & Aerospace (where Radiation hardening matters). And yet this one was real, I had it in my hands. Why cancel it? Beat's me. They're hemorrhaging money (and talent) right now, And it may take them 5-10 years to course correct. They Really should've kept it alive, or provided fab space for Micron to continue making them, and doing R&D towards perfected the technology, and should've looked at it as a long-term strategic investment in Conquering the enterprise data space? But "Stockholders" with their short term quarterly-earnings goals win. Speaking of Stockholders: HFT and banks are about the perfect use for it. Because of this nonsense, and how unpredictable SSD lifetimes & BitRot are, I've had to add Tape backups as a final "peace of mind" Tier & factor to my storage strategy. Wish we could Trust Solid-State, or go "All in" on Flash but I can't. Great & Informative Video, as alwasy! Hussein 😃

@Cosines 2 года назад

Ramadan Kareem, Hussein!

@hnasr 2 года назад

كل عام وانتم بخير ❤️

@jupyter5k647 2 года назад

The amount of the general Back-End concepts I have learned from your videos in such a short amount of time is just mind blowing to me so ~ Thank you. I was going to explain some things that I thought you didn't deliver in a correct way, but there's already a comment here in-depth explaining it much better than me :^] , and so I'm interested in that follow-up video about this

@hnasr 2 года назад

Thanks and appreciate all the feedback and corrections!

@jansiranis4480 2 года назад

Enjoyed your video on SSD. Can you another video for NAT and related concepts.

@hnasr 2 года назад

Thanks! I made a video on NAT here Network Address Translation - NAT Explained ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-RG97rvw1eUo.html

@heitormbonfim 2 года назад

Great video, I don't know if I should recommend a video from another channel, but there's a video on Brunch Education with animation on how SSD works, showing the processes inside and everything else. It's like a complement for this video

@edgararakelyan9326 2 года назад

From my previous understanding (which is limited) on ssds, reads can be done at the page level whereas writes are done at the page level; in the case that the page being written to is already dirty then the entire block is erased and rewritten elsewhere. Thus sequential write io in ssds is faster than random write io due to limiting the impact of this issue (which is called write amplification and does not exist for reads). Just wanted to make some corrections since you said this impacts all reads/writes and that the LBA maps to blocks which I don’t think is correct; it maps to a set of pages. You discuss this halfway thru, I spoke too soon

@belos.2020 2 года назад

nice rrrrollin R my man 😁 and good explanation 💪😎

@acagastya 2 года назад

Thanks for this video!

@tremolony4924 2 года назад

first

@shiewhun1772 2 года назад

Love this, but more web3 content, please.

@apidas 2 года назад

ofc if you're trying to optimize for SQL workload. nothing is gonna be enough you had to move on to other technologies. sticking with SQL for customer facing application is crazy