Understanding Aggregate Functions Performance | The Backend Engineering Show

Подписаться 415 тыс.

Просмотров 19 тыс.

50% 1

Aggregate functions like Count, max, min, avg performance really depends on how did you tune your database for that kind of workload. Let us discuss this.
0:00 Intro
1:22 SELECT COUNT(*)
4:30 SELECT AVG(A)
5:15 SELECT MAX(A)
8:00 Best case scenario
11:30 Clustering
14:00 Clustering Sequential Writes
17:19 Clustering Random Writes
20:30 Summary
Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
database.husseinnasser.com
Introduction to NGINX (link redirects to udemy with coupon)
nginx.husseinnasser.com
Python on the Backend (link redirects to udemy with coupon)
nginx.husseinnasser.com
Become a Member on RU-vid
/ @hnasr
🔥 Members Only Content
• Members-only videos
🏭 Backend Engineering Videos in Order
backend.husseinnasser.com
💾 Database Engineering Videos
• Database Engineering
🎙️Listen to the Backend Engineering Podcast
husseinnasser.com/podcast
Gears and tools used on the Channel (affiliates)
🖼️ Slides and Thumbnail Design
Canva
partner.canva.com/c/2766475/6...
Stay Awesome,
Hussein

Наука

Опубликовано:

16 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 34

@hnasr 2 года назад

Head to database.husseinnasser.com for a discount coupon to my Introduction to Database Engineering course. Link redirects to udemy with coupon applied.

@joshua.hintze 2 года назад

Already in the class. Definitely recommend to all!

@YourMakingMeNervous 2 года назад

I am a data engineer and your channel has been invaluable for my learning lately

@SinskariBoi5guys66 2 года назад

I've been a database specialist for years and I did not know these intimate details. Great Thanks. You are simply awesome.

@jomarmontuya1516 2 года назад

I'm a front end developer but I I'm starting to like backend because of you!

@md.hussainulislamsajib7189 2 года назад

I always enjoy the way you dissect things and get to the bare metal to fully understand and explain it! You're amazing!! 👏

@hnasr 2 года назад

Thank you!

@briankimutai515 2 года назад

Thanks a lot @Hussein for this informative video. I do not have a PhD but I'm going to attempt to solve the problems of clustered sequential and random writes. Opinions from the community will be appreciated :) 1) Clustered Sequential Writes: How about we use an in memory queue like data structure for fresh new writes? For every new write, we allocate some memory with the new data and atomically append it at the tail end of the queue. Since the writes are strictly ordered by the time they arrive at the database, we can use a lock-free paradigm like compare-and-swap to atomically append data to the queue. This will do eliminate the lock overhead that mutexes introduce. Then when it comes to actually adding the writes to the b-tree index structure, with this ordering, there's a big potential of the tree rebuilding for every new write and this can cause a big performance overhead. As the writes are batched up in the queue, we can add them to the tree at a configurable threshold so that we rebuild the tree much less often than the rate at which the writes arrive. Finally, at some point we need to flush our tree on-disk. Since the right sub-tree of our b-tree is the one that's mostly growing, most of the changes are happening sequentially in the same region on disk. We can use fadvise when flushing our changes on disk, just to get some extra performance. This approach has trade-offs in that we'll have superior sequential write performance but poor random reads since we need search both the b-tree (O(logN)) and the in-memory queue (O(N) for a size equal to the configurable threshold). 2) Clustered Random Writes: @Hussein, can you elaborate here how exactly flushing the entire contents of memory will cause write amplification, as in, for each record received, how many writes on disk will need to take place for it to be persisted? For this problem, there is one extreme end where every write results in a disk write, which can be very expensive for writes. At the other end, we store up data in-memory till we run out of memory and we flush everything to disk and rebuild the tree afresh if needed. This will also be a problem since during flushes, writes will need to be paused. I can anecdotally say that most storage engines, at least for RocksDB, writes can be batched at a configurable threshold. This threshold will lie some in the middle where the user weighs the amount of memory they have and the frequency of writes they can tolerate, so that writes can be flushed to disk in a manner that won't make the performance suffer. I welcome all opinions from all you database enthusiasts ;)

@overflow3789 2 года назад

The way you explain things and dissect things is really amazing I haven't found a detailed video as you have done.its really usefull to get the understanding of such beautiful things .

@squirrel1620 2 года назад

For the estimates, when guessing within 5-10 rows of error on a table with 100M rows, it would be easier to get the size of a row in bytes and do some math with the total bytes taken by the table

@hnasr 2 года назад

Good idea, one thing might’ve get the estimate off as well is if many rows were deleted on the table. databases usually don’t return the space back to the OS which will cause the table to over report its size compared to how many rows it has

@mrvaibh0 2 года назад

I literally made notes, such core details 🙌

@semperfiArs 2 года назад

I would really love to know where you get ideas for these videos and how much time you put to read and research about it because this is really awesome and sounds like you put in a lot of time. Thanks a lot for the awesome content you post here

@vendetta3953 2 года назад

always a pleasure to watch your informative videos :)

@tafarakasparovmhangami2356 2 года назад

Quite Insightful😎

@subhamprasad1373 2 года назад

thanks for your effort , really i learn a lot.

@young-ceo 2 года назад

Thank you for the video! amazing advice

@HarshKapadia 2 года назад

Loved this video!

@pieter5466 Год назад

5:50 At first I interpreted "smallest" as "of least disk size in bytes", but it more likely means their sequential ordering with respect to the overall range.

@marslogics 2 года назад

Thank you...

@mudassirhasan9022 2 года назад

Thanks

@michaelangelovideos 2 года назад

Love your content man. Can you talk on the Okta breach??

@nguyentanphat7754 2 года назад

Thank you for your awesome discussion in this video, @Hussein! However I have a question to the part of you talking about the cons of the uuid random-writes that would exhaust the RAM, what is the solution for this specific case? Should we just considerably use the clustered index with the sequential writes that you'd mentioned or is there a particular solution for this? 🙏🙏

@hnasr 2 года назад

I think we should just understand that this might happen. Sometimes there is no escape and you need UUID to be your primary key, so in that case you would just increase the buffer pool size. Otherwise its more optimal to use a sequential light weight primary key so avoid this. Using a non clustered table and trying your workflow against it is also another idea. Percona wrote a nice blog about this www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

@dhawaljoshi 2 года назад

would it be the best of both? like write to buffer and then one process would continuously write to the actual disk to prevent the buffer checkpoint situation?

@letsmusic3341 2 года назад

Will Select Count(A) FROM T; use index or will it go for table scan?

@henrikschmidt8178 2 года назад

Did you make a video about varchar and how it is stored in different systems? I hav always wondered how a varchar(255) is handled on disk and how changes are handled. Are all data saved as maximum size and fixed size records in total, are records stored in variable length or are all varchar stored in a separate space all together?

@hnasr 2 года назад

I believe varchar works similarly to TOAST in postgres text. The actual text is stored in another table and the row has a pointer to that. This keeps the row from overflowing the page since most databases has fixed page size. I talked about It here ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-_UUFMAZswhU.html And also on my course

@mohamadabdelrida2866 2 года назад

If count is slow then how does auto-increment impact performance?

@shiewhun1772 2 года назад

Also, Hussein, I know that it will be time-consuming and will probably not be of much interest to you, but do more videos on web3 and Ethereum and the underlying tech. I've seen the videos on IPFS and Bitcoin mining you did, but it'd be nice if you did more. Also, how is it possible that in ML, they're able to process so much data in their Panda tables, when on the other hand, doing the same sort of manipulations on similar data in a regular database will be slower?