Fokko Driesprong - PyIceberg: Tipping your toes into the petabyte data-lake | PyData Amsterdam 2023

Подписаться 160 тыс.

50% 1

With Apache Iceberg, you store your big data in the cloud as files (e.g., Parquet), but then query it as if it’s a plain SQL table. You enjoy the endless scalability of the cloud, without having to worry about how to store, partition, or query your data efficiently. PyIceberg is the Python implementation of Apache Iceberg that loads your Iceberg tables into PyArrow (pandas), DuckDB, or any of your preferred engines for doing data science. This means that with PyIceberg, you can tap into big data easily by only using Python. It’s time to say goodbye to the ancient Hadoop-based frameworks of the past! In this talk, you'll learn why you need Iceberg, how to use it, and why it is so fast.
Description: Working with high volumes of data has always been complex and challenging. Querying data with Spark requires you to know how the data is partitioned, otherwise, your query performance suffers tremendously. The Apache Iceberg open table format fixes this by fixing the underlying storage, instead of by educating the end users. Iceberg originated at Netflix and provides a cloud-native layer on top of your data files. It solves traditional issues regarding correctness by supporting concurrent reading and writing to the table. Iceberg improves performance dramatically by collecting metrics on the data, having the ability to easily repartition your data, and being able to compact the underlying data. Finally, it supports time travel, so the model that you're training doesn't change because new data has been added. After this talk, you'll be comfortable using Apache Iceberg.
Minutes 0-5: History and why we need a table format
Minutes 5-15: Overview of Iceberg, and how it works under the hood
Minutes 15-30: Introduction to PyIceberg with code and real examples (notebook!!)
Bio:
Fokko Driesprong
Open Source enthousiast. Committer on Avro, Parquet, Druid, Airflow and Iceberg. Apache Software Foundation members.
===
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.
Want to help add timestamps to our RU-vid videos to help with discoverability? Find out more here: github.com/numfocus/RU-vidVi...

Наука

Опубликовано:

10 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 2

@istvandarvas3372 7 месяцев назад

Mate, you've made my day! :D Thanks

@symosys 4 месяца назад

@16:28 print(tbl.schema()) in Database terms this would be container of many tables, so should this be a Table Definition and not a Schema? Or if there were many tables, would that command show all tables in the schema? (Totally new to this, my observation coming from Relational and Star Schemas to Columnar.)