QCon London '23 - A New Era for Database Design with TigerBeetle

Подписаться 3,6 тыс.

Просмотров 8 тыс.

50% 1

Pivotal moments in database durability, I/O, systems programming languages and testing techniques, and how they influenced our design decisions for TigerBeetle.
This is the pre-recording of our talk, which was later given live at QCon London '23, in the Innovations in Data Engineering track hosted by Sid Anand... and a stone's throw from Westminster Abbey!
The live QCon talks were in-person only this year, and were not recorded, but thankfully, we stuck to the script, so that what you see here is what you would have seen, if you had been there.
The cover art is a special illustration by Joy Machs. We wanted to bring together the London skyline to showcase old and new design in the form of the historic London Bridge alongside the futuristic Shard. If you happen to be in London, take a walk across the Millennium Footbridge, and see if you can see Joy's vision as you look across the water.
Thanks to Sid Anand for the special invitation. It was an honor to present TigerBeetle alongside DynamoDB, StarTree, Gunnar Morling, and our friends in the Animal Database Alliance: Redpanda and DuckDB!
qconlondon.com/presentation/m...

Наука

Опубликовано:

10 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 23

@themichaelw 2 месяца назад

18:00 that's the same Andres Freund who discovered the XZ backdoor. Neat.

@luqmansen 4 месяца назад

Great talk, thanks for the sharing! ❤

@dannykopping Год назад

Super talk! Very information dense and clear, with a strong narrative. Also so great to hear a South African accent in highly technical content on the big stage 😊

@jorandirkgreef Год назад

Thank you Danny!

@asssheeesh2 Год назад

That was really great!

@youtux2 9 месяцев назад

Absolutely amazing.

@timibolu 5 месяцев назад

Amazing. Really amazing

@LewisCampbellTech 8 месяцев назад

Every month or so I'll watch this talk while cooking. The first time didn't really understand part I. This time around I got most of it. Crazy how the linux kernel prioritised users yanking USB sticks out over database durability.

@jorandirkgreef 8 месяцев назад

Thanks Lewis, special to hear this-and I hope that the “durability” of the flavors in your cooking are all the better! ;)

@rabingaire 11 месяцев назад

Wow what an amazing talk

@YuruCampSupermacy Год назад

absolutely loved the talk.

@uncleyour3994 Год назад

Really good stuff

@jonathanmarler5808 11 месяцев назад

Great talk. I'm at 15:20 and have to comment. Even if you crash and restart to handle fsync failure that still doesnt address the problem because anothe process could have called fsync and marked the pages as clean, meaning the database process would never see an fsync failure.

@jorandirkgreef 11 месяцев назад

Hey Jonathan, thanks! Agreed, for sure. I left that out to save time, and because it's nuanced (a few kernel patches ameliorate this). Ultimately, Direct I/O is the blanket fix for all of these issues with buffered I/O. Awesome to see you here and glad you enjoyed the talk! Milan '24?! :)

@Peter-bg1ku Месяц назад

I never thought Redis AOF were this simple.

@tenthlegionstudios1343 Год назад

Very good talk. It takes a lot of the previous deep dives I have watched and puts them all together. I am curious about the points made about the advantages to single threaded execution model, especially in the context of using the VOPR / having deterministic behaviors. When you look at something like Red Panda with a thread per core architecture, using seastar, and a bunch of advanced linux features - are design choices like this making it harder to test and have some sense of deterministic bug reproduction? This is not a tradeoff I have ever considered before, and for a DB that is most concerned about strict serializability and no data loss - this must have greatly changed the design. I am curious about the potential speed ups at the cost of losing the deterministic nature of tiger beetle - not to mention the cognitive load of a more complex code base.

@tigerbeetledb Год назад

Thanks-great to hear that! We are huge fans of Redpanda, and indeed RP and TB share a similar philosophy (direct async I/O, single binary, and of course, single thread per core). In fact, we did an interview on these things with Alex Gallego, CEO of Redpanda, last year: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-jC_803mW448.html With care, it's possible to design a system from the outset for concurrency, that can then run either single threaded or in parallel, or with varying degrees of parallelism determined by the operator at runtime, with the same deterministic result, even across the cluster as a whole. Dominik Tornow has a great post comparing concurrency with parallelism and determinism (the latter two are orthogonal, which is what makes this possible): dominik-tornow.medium.com/a-tale-of-two-spectrums-df6035f4f0e1 For example, within TigerBeetle's own LSM-Forest storage engine, we are planning to have parts of the compaction process eventually run across threads, but with deterministic effects on the storage data file. For now, we're focusing on single core performance, to see how far we can push that before we introduce CPU thread pools (separated by ring buffers) for things like sorting or cryptography. The motivation for this is Frank McSherry's paper, “Scalability but at what Cost?”, which is a great read! www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html

@tenthlegionstudios1343 Год назад

@@tigerbeetledb These articles are gold. Thanks for the in depth reply! Cant wait to see where this all goes.

@dwylhq874 24 дня назад

This one of the few channels I have *notifications on* for. 🔔 TigerBeetle is _sick_ !! Your whole team is _awesome_ !! 😍 So stoked to _finally_ be using this in a real project! 🎉 Keep up the great work. 🥷

@jorandirkgreef 23 дня назад

Thank you so much! You're sicker still! :) And we're also so stoked to hear that!

@pervognsen_bitwise 10 месяцев назад

Thanks for the talk, Joran. Genuine question since I don't know and it's very surprising to me: Is there really no way to get buffered write syscall backpressure on Linux? The Windows NT kernel has a notoriously outdated (and slow) IO subsystem which has always provided mandatory disk write backpressure by tracking the number of outstanding dirty pages. So if disk block writes cannot keep up with the rate of dirty pages, the dirty page counter will reach the cap and will start applying backpressure by blocking write syscalls (and also blocking page-faulting writes to file-mapped pages, though the two cases differ in the implementation details). I'm assuming Linux's choice to not have backpressure must be based on fundamental differences in design philosophy, closely related to the situation with memory overcommit? Certainly the NT design here hurts bursty write throughput in cases where you want to write an amount that is large enough that it exceeds the dirty page counter limit but not so large that you're worried about building up a long-term disk backlog (a manually invoked batch-mode program like a linker would fall in this category). Or you're worried about accumulating more than a desired amount of queueing-induced latency that would kill the throughput of fsync-dependent applications; considering this point makes me think that you wouldn't want to rely on any fixed dirty page backpressure policy anyway, since you want to control the max queuing-induced latency.

@stevesteve8098 Месяц назад

Yes.... i remember back in the 90's oracle tried this system of "direct IO", Blew lots of trumpets..... and announced it HAS to be better & faster Because ....'insert reasoning here" Well you know what..... it was complete bullshit, becasue they made lots of assumptions and very little real testing. Because even you THINK you are writing directly to the "disk" YOU ARE NOT.... you are Writing to a BLACK BOX., you have absolutely NO idea of HOW or WHAT is implemented in that Black box. There may be a thousand buffer levels in that box, with all sorts of swings and roundabouts. so... no.... you are NOT directly writing to disk, such a basic lack of insight and depth of thought is a worry with this sort of "data" Evangelicalism...

@jorandirkgreef 23 дня назад

Thanks Steve, I think we're actually in agreement here. That's why TigerBeetle was designed with an explicit storage fault model, where we expect literally nothing of the "disk" (whether physical or virtualized). For example, we fully expect that I/O may be sent to the wrong sector or corrupted, and we test this to extreme lengths with the storage fault injection that we do. Again, we fully expect to be running in virtualized environments or across the network, or on firmware that doesn't fsync etc. and pretty much all of TigerBeetle was designed with this in mind. However, at the same time, to be clear, this talk is not so much about the "disk" as hardware-as about the kernel page cache as software, and what the kernel page cache does in response to I/O errors (whether from real disk or virtual disk). We're really trying to shine a spotlight on the terrific work coming out of UW-Madison in this regard: www.usenix.org/system/files/atc20-rebello.pdf To summarize their findings then, while Direct I/O is (completely) not sufficient, it is still necessary. It's just one of many little things you need to get right, if you have an explicit storage fault model, and if you want to preserve as much durability as you can. At least