No video :(

Advancing Spark - Bloom Filter Indexes in Databricks Delta

Advancing Analytics

Подписаться 32 тыс.

Просмотров 9 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 23

@Coldhyll Год назад

Nice video, you could also have talked about the impact on the storage price and the necessity to configure your blooms well, because they can take a lot of space and cost you a lot (for tables with TB/PB of data, blooms might weighs a lot)

@FrostSpike Год назад

Interesting that they decided to use a compressed Parquet file to hold the index bitmaps. I wouldn't have thought that would have been the most performant way of doing it. A set of flat binary files (or using some sort of bespoke structure) where you can just twiddle map bits on disk might have been more efficient, but that might not be "the Spark way" and maybe they needed to support things like Delta time-travel, etc. using the same mechanisms.

@MrDeedeeck 3 года назад

Great video! Could I ask why z-ordering by id column helps? My understanding is that z-ordering colocates related information together to improve reads of data. But how can this help when each row of id is unique? (Every row has a different id value). How can you co-locate data where essentially all values of a column are different?

@robertkostecki7353 Год назад

Interesting but the 50% performance gain is not that great compared to SQL Server B-tree index equivalent (not to mention the columnstore one). Still move in the right direction.

@Monsalvo888 3 года назад

Another wonderful video, thanks!

@hrishabhg Год назад

Parquet has RLE Dictionary feature. Why would that not work out of the box for pruning the bytes? Databricks is creating the bloom filter file for each file. Bloom filter with parquet files look redundant unless that is kept centrally like min, max, z-order etc. are kept in delta logs. That way, atleast count of file open operations will be reduced. Also, if it is efficient file pruning, why databricks not apply it on all the columns by default just like parquet figures out what encodings to put on each column. During read, it can figure out what size of bloom filter should be there for each file, rather than leaving it to the user.

@wCupOfTea Год назад

Does the performance test account for disk caching (aka delta caching)?

@smileanbu13 3 года назад

Is there any feature to auto calculate the no. of distinct item on bloom filter enabled colum?

@fb-gu2er 3 года назад

I wonder the impact of it when merging into delta. Will it help?

@AdvancingAnalytics 3 года назад

Hrmmm, it would certainly help if you were filtering to a selective column value in your match criteria. Not sure how much benefit you would see where you're joining to a bloom indexed column in the match though... Easy to try! I'll have a play at some point soon! Simon

@kapilgalandework4807 3 года назад

First of all, I want to thank you for such a wonderful videos. Recently became fan of your videos and thank you for the them. Secondly, I thought DataSkipping option of Delta was gathering the stats and was working as Index on first 32 columns (by default) or on n columns mentioned in the DDL. If so, then what was the need of Bloom filter ?

@AdvancingAnalytics 3 года назад

Hey - thanks for watching! The stats gathered as part of dataskipping is the minimum & maximum value for your 32 (or n) columns. But if you haven't specifically ordered (or Z-ordered) your data by a specific column, it's unlikely that file skipping will be very efficient. Imagine you have four files contain ID values from 2-11, 1-10, 2-9 and 1-13. If you query for ID = 5, it's going to read all files using data skipping. The bloom filter index gives an ADDITIONAL change of data skipping by saying that the records with ID 5 are definitely not in 2 of the files, and might be in the remaining two. In a nutshell, it's a way of improving data skipping on columns that haven't been optimised for your min-max data skipping :) Simon

@sannidhiteredesai912 3 года назад

If instead of == queries we try to have >= query on the indexed column, that won't be a proper use case for bloom filters right ?

@AdvancingAnalytics 3 года назад

Yeah, traditional bloom filters are for point lookups not range filters, so I'd assume you won't see the same performance benefits - but I've not performance tested bloom range lookups within Delta / Databricks!

@ademarjr000 3 года назад

Great video! Can we use bloom filter with delta.io?

@AdvancingAnalytics 3 года назад

Nope, as far as I know it hasn't been added to the OSS Delta implementation at this point. Guess we may see it in a future delta release, depends if they want to open source it!

@capoeirabrasil 3 года назад

Cool trick to force an optimize by changing the max file size! But it's not working for me :( I am playing around with different options for output file size and doing so within a single partition (3 GB) rather than the whole table (6 TB). I changed the spark config SET spark.databricks.delta.optimize.maxFileSize = 33554432; but re-running the optimize with zorder did not re-organize the files. Is it because I specified a single partition to optimize within?

@AdvancingAnalytics 3 года назад

Hey! Hard to say! I've not tried forcing the re-optimize within a partition, might be something at the trans log level that enforces it across the whole table - I'd be surprised though! If I get a chance over the next week I'll take a look - lemme know if you get anywhere! Simon

@briancuster7355 2 года назад

Sounds like a great index but what is the implications for ETL?

@AdvancingAnalytics 2 года назад

There's a performance hit on write, as it needs to gather the bloom filter stats for each new file that gets created. Similar to any index really - write gets slower but reads get faster (if they hit the index!)

@tiagohenriqueribeiroferrei6481 3 года назад

Could you to do a video about integration Databricks with Database on premise?

@AdvancingAnalytics 3 года назад

I should do... but that's basically a video on virtual networking in Azure, which will take quite some time to set up (unless your on-prem servers are all on open ports??). I'll add it to the list when I next get a chance!

@tiagohenriqueribeiroferrei6481 3 года назад

@@AdvancingAnalytics I’m from Brazil 🇧🇷, but I live in San Diego, CA. I have a lot of Brazilian friends and I, we are your fan.