Apache Iceberg Tutorial: Learn the Problem & Solution Behind Iceberg's Origin Story

Подписаться 4,5 тыс.

Просмотров 34 тыс.

50% 1

This Apache Iceberg 101 Course will cover the origin story of the data lakehouse table format. We will explore what Hive is, and why it was beneficial. We will also discuss the problems associated with Hive, as well as the solution that Apache Iceberg provides.
Hive is an open source data warehouse system developed by Facebook in 2008. It was created to provide an efficient way to manage large amounts of data stored in Hadoop clusters. The system was designed to make it easier for developers to write queries and manipulate data without having to learn a new programming language. Hive allowed users to work with structured data using SQL-like commands, making it easier for them to access their data from different sources.
The benefits of Hive were numerous, including its scalability, cost-effectiveness, and ease of use. However, there were some drawbacks associated with it as well. One issue was that Hive wasn’t able to handle unstructured or semi-structured data very well. This caused problems when users needed to access their data from different sources or when they wanted to use more complex queries than what Hive could provide. Another issue was that Hive had a steep learning curve, making it difficult for new users or those who weren’t familiar with SQL-like commands to get started quickly and easily.
In order to address these issues, Apache Iceberg was created as an open source alternative to Hive in 2016. Apache Iceberg is a Data Lakehouse engine that provides better support for unstructured and semi-structured data than what traditional Data Warehouses provide. It also makes querying and manipulating this type of data much simpler by providing a unified query language that works across multiple sources and systems. This makes it easier for developers and analysts alike to quickly get up and running without having to learn a new programming language or understand complex database architectures.
Apache Iceberg also provides performance benefits over traditional Data Warehouses due its ability to store large amounts of unstructured or semi-structured data in a more efficient manner than other systems can offer. By using columnar storage formats such as Parquet, Apache Iceberg can dramatically reduce the amount of time needed for I/O operations on large datasets while still providing fast query performance over the same datasets.
If you’re looking for more great content on the topic of Data Lakehouses like Apache Iceberg, be sure to check out dremio.com/subsurface where you can find tutorials, articles, webinars, and more!
Connect with us!
Twitter: bit.ly/30pcpE1
LinkedIn: bit.ly/2PoqsDq
Facebook: bit.ly/2BV881V
Community Forum: bit.ly/2ELXT0W
Github: bit.ly/3go4dcM
Blog: bit.ly/2DgyR9B
Questions?: bit.ly/30oi8tX
Website: bit.ly/2XmtEnN

Наука

Опубликовано:

27 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 26

@zmihayl 2 дня назад

Your voice is like an angel to fall asleep😇

@sukulmahadik0303 3 месяца назад

[Notes Part-2] Netflix tried to address the problems with the Hive table format, by creating a new table format. The goals Netflix wanted to achieve with this format were: - Table correctness/consistency - Faster query planning and inexpensive execution - eliminate file listing and better file pruning - Allows users to not worry about the physical layout of the data. Users should be able to write queries naturally and be able to get benefit or partitioning. - Table evolution - Allow adding, removal of table columns, changes to partition schema. - Accomplish all of these at scale. With this we landed with the concept of iceberg. "With Iceberg, a table is a canonical list of files." Instead of saying a table is all the files in a directory, theoretically the files in iceberg table could be anywhere. Files could be in different folders and does not have to be in nicely organized directory system. This is because iceberg is going to maintain the list of files in its metadata and the engine would leverage this metadata. This allows to get to the files faster.

@sukulmahadik0303 3 месяца назад

[Notes Part-1] What is a table format? - A way to organize a dataset’s files to present them as a single table. Our datalake may contain millions of data files that represent multiple datasets. The analytical tools need to know which files are part of which dataset. This is the problem that table formats solve. - It’s a way to answer the question “what data is in this table”? Hive Table Format: - Old table format. - Abstracted the complexity of Map Reduce. Internally it converted SQL statements in to map reduce jobs. - Created at Facebook - Used “directories” on storage (HDFS/Object Storage) to indicate which files are part of which table. This format applies a simpler approach that says a particular directory is a table. All the contents of the directory are part of this table. Any subfolders would be treated as partitions of the table. - Pros: o Works with almost every engine since it became the de-facto standard and stayed so for long. o Partitioning allows more efficient query access patterns than full table scans. o File format agnostic. o Atomically update a while partitions. o Single, central answer to “what data is in this table” for the whole ecosystem. - Cons: o Small updates are very inefficient. Updating a few records wasn’t easy. A partition was a smallest unit of transaction. To update few records, the entire partition or entire table had to be swapped/rewritten. o No way to atomically update multiple partitions. There is a risk that after one partition is changed and before we start work on the next partition, someone could query the table and read inconsistent data o Multiple jobs modifying the same dataset do not do it safely. ,.i.e. Safe concurrent writes not possible. o All of the directory listing needed for large table take a long time. Engines have to list all the contents of the directories prior to returning the results so that it can understand the metadata of the files that make up the table/partition. o To filter out the files (pruning) required the engines to read those files. Opening and closing of all the files to check if they have the data of interest made it time consuming. o Users of the datasets have to know the physical layout of the table. Without knowing the physical layout of the table, we might write inefficient queries that could lead to full table scans that are costly. o Hive table statistics are often stale and data engineers have to keep executing Analyze queries to keep the stats up to date. Collecting stats needs more compute and stats are only fresh as often we run the analyze jobs.

@joshi1q2w3e Год назад

Thank you for this Series! It’s great!

@M20081986 Год назад

Great start to Iceberg series. Thanks a lot !

@Dremio Год назад

Lot more content here as well www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/

@sukulmahadik0303 3 месяца назад

[Notes Part-4] Iceberg Design Benefits: - Efficiently make smaller updates. Updates do not happen at the directory level now. Changes are made at the file level. - Snapshot isolation for transactions. Every change to the table creates a new snapshot. All read are on the newest snapshot. Reads and writes do not interfere with each other and all writes are atomic. Read does not read a partial snapshot. - Faster planning and execution. A lot of metadata is maintained at individual file, partition level allowing better file pruning while running the query. Column stats are maintained in the manifest files. These column stats are used to eliminate files. - Reliable metrics for CBOs(vs hive). The column stats and metrics are calculated on write instead of frequent expensive jobs. - Abstracting the physical, expose a logical view. Users don’t have to be familiar with the underlying physical structure of the table. Features like hidden partitioning, compaction of small files , table can change over time , ability to experiment with the table layout with breaking the table. - Rich schema evolution support. - All engines see changes immediately.

@swarajmehta3011 5 месяцев назад

amazing content!! great work

@ricardorobertodelima7358 Год назад

thanks very good.

@santhoshreddykesavareddy1078 4 дня назад

Hi thanks this is really a great information to start with Apache Iceberg. But I have a question, when modern databases are already doing it with so much advance technology to prune and scan the data, why would we need to store the data in files format instead of directly loading them to a table ?

@Dremio 3 дня назад

When you start talking about 10TB+ datasets yo run into issues on whether database can hold the dataset and performantly. Also different purposes need different tools so you need your data in a way that be used by different teams with different tools.

@Dremio 3 дня назад

Also with data lakehouse tables there doesn’t have to be any running database server when no one is querying the dataset since they are just files in storage while traditional database tables need a persistently running environment.

@santhoshreddykesavareddy1078 3 дня назад

@@Dremio wow! Now I have got full clarity. Thank you so much for your response.

@santhoshreddykesavareddy1078 3 дня назад

@@Dremio cost saving. Thanks for the tip 😀.

@reclaim_kashi_vishweshwar_1195 Год назад

Great Video...can you please advise where is the metadata information of Iceberg format stored for a table ?

@AlexMercedCoder Год назад

The location of the metadata is elaborated in later videos in this series but essentially there are three types of files that are created: - metadata.json: table level definitions (table id, list of snapshots, current/previous schema/partitioning scheme, etc.) - snapshot/manifest list: A list of manifest files belonging to one table snapshot with partition stats to do partition pruning - manifests: files listing a batch of files with the tables data with metadata on each column that can be used for min/max pruning.

@reclaim_kashi_vishweshwar_1195 Год назад

@@AlexMercedCoder Got you. Thank you so much , You are doing a favor - this a an awesome learning channel

@sukulmahadik0303 3 месяца назад

[Notes Part-3] What iceberg is? 1) Table Format Specification: It’s a standard for how do we organize the data around the table. Any engine reading/writing data from iceberg table should implement this open specification. 2) A set of APIs and libraries of interaction with that specification (Java, Python API) These libraries are leveraged in other engines and tools that allow them to interact with iceberg tables. What iceberg is not? 1) Not a storage engine. (Storage options we could use are HDFS, object stores like S3) 2) Not an execution engine (Engine options we could use Spark, Presto, Flink etc) 3) Not a service. We do not have to run a server of some sort. It’s just a specification for storing and reading data and metadata files.

@galeop 5 месяцев назад

18:31 Doesn't the Hive Metadata Store already allows to reference the data files making up a table, and collect columns statistics?

@jaredcheung6875 Год назад

Could you please share the slides? Thanks in advance.

@Dremio Год назад

Send an email to alex.merced@dremio.com

@adityanjsg99 4 месяца назад

Please tell me , how this is different or better from Databricks Lakehouse?

@Dremio 4 месяца назад

I would point you to watching out State of the Data Lakehouse presentation on this channel.

@jean4j_ 4 месяца назад

Nice video but a little mistake :) Java was used to develop Map Reduce jobs. Not JavaScript. You probably know that, you just got them mixed up.

@Dremio 4 месяца назад

100% agreed, I was trying to saying a “Java” script not JavaScript but said it too fast, but 100%. I always fumble trying to express writing a single Java “thing” like saying a “Python Script” or “Ruby Script”. - Alex

@jean4j_ 4 месяца назад

@@Dremio nice save!! haha Alright. It makes sense. Anyway I knew that you knew.