Nice video, you could also have talked about the impact on the storage price and the necessity to configure your blooms well, because they can take a lot of space and cost you a lot (for tables with TB/PB of data, blooms might weighs a lot)
Interesting that they decided to use a compressed Parquet file to hold the index bitmaps. I wouldn't have thought that would have been the most performant way of doing it. A set of flat binary files (or using some sort of bespoke structure) where you can just twiddle map bits on disk might have been more efficient, but that might not be "the Spark way" and maybe they needed to support things like Delta time-travel, etc. using the same mechanisms.
Great video! Could I ask why z-ordering by id column helps? My understanding is that z-ordering colocates related information together to improve reads of data. But how can this help when each row of id is unique? (Every row has a different id value). How can you co-locate data where essentially all values of a column are different?
Interesting but the 50% performance gain is not that great compared to SQL Server B-tree index equivalent (not to mention the columnstore one). Still move in the right direction.
Parquet has RLE Dictionary feature. Why would that not work out of the box for pruning the bytes? Databricks is creating the bloom filter file for each file. Bloom filter with parquet files look redundant unless that is kept centrally like min, max, z-order etc. are kept in delta logs. That way, atleast count of file open operations will be reduced. Also, if it is efficient file pruning, why databricks not apply it on all the columns by default just like parquet figures out what encodings to put on each column. During read, it can figure out what size of bloom filter should be there for each file, rather than leaving it to the user.
Hrmmm, it would certainly help if you were filtering to a selective column value in your match criteria. Not sure how much benefit you would see where you're joining to a bloom indexed column in the match though... Easy to try! I'll have a play at some point soon! Simon
First of all, I want to thank you for such a wonderful videos. Recently became fan of your videos and thank you for the them. Secondly, I thought DataSkipping option of Delta was gathering the stats and was working as Index on first 32 columns (by default) or on n columns mentioned in the DDL. If so, then what was the need of Bloom filter ?
Hey - thanks for watching! The stats gathered as part of dataskipping is the minimum & maximum value for your 32 (or n) columns. But if you haven't specifically ordered (or Z-ordered) your data by a specific column, it's unlikely that file skipping will be very efficient. Imagine you have four files contain ID values from 2-11, 1-10, 2-9 and 1-13. If you query for ID = 5, it's going to read all files using data skipping. The bloom filter index gives an ADDITIONAL change of data skipping by saying that the records with ID 5 are definitely not in 2 of the files, and might be in the remaining two. In a nutshell, it's a way of improving data skipping on columns that haven't been optimised for your min-max data skipping :) Simon
Yeah, traditional bloom filters are for point lookups not range filters, so I'd assume you won't see the same performance benefits - but I've not performance tested bloom range lookups within Delta / Databricks!
Nope, as far as I know it hasn't been added to the OSS Delta implementation at this point. Guess we may see it in a future delta release, depends if they want to open source it!
Cool trick to force an optimize by changing the max file size! But it's not working for me :( I am playing around with different options for output file size and doing so within a single partition (3 GB) rather than the whole table (6 TB). I changed the spark config SET spark.databricks.delta.optimize.maxFileSize = 33554432; but re-running the optimize with zorder did not re-organize the files. Is it because I specified a single partition to optimize within?
Hey! Hard to say! I've not tried forcing the re-optimize within a partition, might be something at the trans log level that enforces it across the whole table - I'd be surprised though! If I get a chance over the next week I'll take a look - lemme know if you get anywhere! Simon
There's a performance hit on write, as it needs to gather the bloom filter stats for each new file that gets created. Similar to any index really - write gets slower but reads get faster (if they hit the index!)
I should do... but that's basically a video on virtual networking in Azure, which will take quite some time to set up (unless your on-prem servers are all on open ports??). I'll add it to the list when I next get a chance!