No video :(

Write-Ahead Logging Challenges and Append only B-Trees

The Geek Narrator

Подписаться 57 тыс.

Просмотров 1,6 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

25 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 22

@ajinkyajadhav5058 6 месяцев назад

You explained this very simply thank you and keep it up!!

@AtharvaRao0104 9 месяцев назад

I think including the in-memory page cache in the diagram of checkpoint process will help. This checkpoint process controls both WAL and page cache to ensure that all changes are persisted. 1. When DB crashes, the in-memory page cache is lost. All in-memory updates which have not yet been flushed to Data files on disk are lost. To make sure we can recover these changes , we write the operations (read/write) to WAL append only log file on disk while doing the in-memory page cache updates.. 2. We can replay the WAL contents to restore the in-memory page cache and hence recover the state after crash. 3. We should not be replaying all the operations in WAL as it can take a long time when DB restarts. When the page cache is flushed to disk, it makes the changes durable. Now we can discard the WAL log records associated with the operations that were applied to those cached pages which were flushed.

@vinothraoful 10 месяцев назад

Thanks for this awesome series. it would be more helpful if we can get an example like on what scenarios we use what kind of DBs? When to choose WAL based db? when to choose Append only B-Tree? Though we understand the internals I feel use cases give more clarity.

@TheGeekNarrator 10 месяцев назад

Yes - it is in the plan. Upcoming in this series: - B-Trees - B*, B-link, B-w and so on. - LSM trees - Improving LSM - Columnar vs Row orientation - Isolation levels - Locking and MVCC - Indexing - Consensus models And much more.

@varunvats32 9 месяцев назад

Correct me if I'm wrong, 1. When a update comes, the relevant pages would be loaded up in memory. 2. Then B-Tree processing => copy of pages, creating new links, or a new leaf node would all be done in memory only. 3. And then the fsync happens, which will be sequential I/O on the disk.

@TheGeekNarrator 9 месяцев назад

Yes, that sounds right except that the copy is started from the leaf node which is the actual data. The goal is to keep as many hot pages as possible in memory, but implementation varies in different databases. IIRC CouchDB, MongoDB also uses copy-on-write with Garbage collection (but please verify). But there are some databases that do not use any cache, for example LMDB doesn’t used any cache but instead uses memory mapping which uses OS cache only. I will talk about memory mapping and how it is used in databases in a separate video.

@balajisrinivasan6618 9 месяцев назад

Does append-only B-Tree follow sequential IO on disk? In the multi version B-tree mentioned, looks like we are updating a page that might in turn get into a random IO on the disk block right? My understanding is : Database deals with pages, OS deals with files and Disk deals with blocks. Our goal is to have a sequential IO on disk (the end layer). WAL achieves it through immutability until the disk level. There is no updates on any level, only inserts and deletes (during compaction). Curious to understand how it works for append-only B-Tree. Thanks !

@TheGeekNarrator 9 месяцев назад

Thats a great question. It is still copy-on-write with LMDB. To avoid random IO it keeps these two versions old and new close to each other in a file. Page 0 is the old version Page 1 is the new version. And it keeps this ping pong going with each update. These two pages (0 and 1) are in the same file and are used for the two roots that it uses for the two versions of data. Since the location is predetermined Random IO isn’t involved and it is pretty quick in moving between versions. (There will of course be some head movement but nothing close to random IO). Imagine something like this: +-----------+-----------+-----------+-----------+-----+ | Page 0 | Page 1 | Page 2 | Page 3 | ... | +-----------+-----------+-----------+-----------+-----+ Page 0 and Page 1 are used for the two roots. Predetermined IO vs random IO. Does that make sense? Sorry I should have added more details, but I only realise it now 😀. I will plan another video on this topic.

@balajisrinivasan6618 9 месяцев назад

@@TheGeekNarrator Got it. Does that mean the head needle on the magnetic disk has a capability to move backward as well? thanks for the quick reply, much appreciated !

@TheGeekNarrator 9 месяцев назад

The head can move across the radius and IIUC these pages would be stored such that the head can access them across the radius. I will try to find if there is more information here, but good questions. keep them coming.

@harshitpandey6664 8 месяцев назад

Nice insights. Any durable database which is based on append only B-Trees (and not WAL) ?

@TheGeekNarrator 8 месяцев назад

Thanks- IIUC CouchDb also uses Append only Btrees.

@nareshgb1 8 месяцев назад

@@TheGeekNarrator does it not use a WAL - I assume CouchDB is the same as couchbase? BTW - another advantage of WAL is that the IO for checkpointing can be done in background mode - so commits can be "instantaneous," as only a single write to the WAL is needed. Thanks again for the great video.

@ksramchandani 4 месяца назад

What is the corresponding in-memory update for the log that is written? Log is also written to disk or is it in memory and then flushed to disk? What happens when log flushing to disk fails?

@TheGeekNarrator 4 месяца назад

A Log is flushed to disk, yes. Databases like Postgres also support fsync mode which waits for the log be flushed to the disk, which adds a little performance impact (but nothing comparable to updating data pages and indexes directly). Not using fsync mode OR in simple terms, not waiting for the OS to flush the log to the disk may result in lost committed transactions. Typically a database dealing with concurrent transactions can write log entries to disk with a single fsync which is very efficient. So choose your tradeoff, but the thumb rule I use is, enable fsync by default to ensure maximum reliability and tweak if performance ever become a problem. Does that answer your question?

@ksramchandani 4 месяца назад

@@TheGeekNarrator Yes. Thank you! Wasn't aware of the DB option to wait for log flushing to succeed.

@nareshgb1 8 месяцев назад

WOW.....this is an absolutely brilliant explanation for the append only B-tree. I was starting to get into couchbase (which I believe uses this) and my first search on youtube showed this video. Though I had read about append only b-tree before, this time I think I am not going to forget :) Awesome, keep it up Sir.

@TheGeekNarrator 8 месяцев назад

Thanks a lot 🙏🏻😀

@KumarSahil78 8 месяцев назад

The background music is little loud, please try to lower it or maybe remove all together.

@TheGeekNarrator 8 месяцев назад

Thanks. I appreciate your feedback

@rajeshkumar1149 10 месяцев назад

Great video. Is the alternative to WAL described in th video a LSM tree used by rocksdb? If not how is it different from LSM?

@TheGeekNarrator 10 месяцев назад

No, its not an LSM Tree. Its an Append only B-Tree. LSM (as the name suggests) is a log based structure, but what I described in the video is done on a B-Tree. In LSM (which I am going to cover in depth in this series), you simply append in the log, but in the append only B-Tree, you start from the leaf node and you copy the exact structure of the B-Tree that exists on disk. So they are very different. The only thing common between them is append-only property. This is copy-on-write mechanism on B-Trees, where the entire idea is to avoid random writes and have sequential buffered IO which is far more efficient. I hope it helps.