Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

Подписаться 160 тыс.

Просмотров 154 тыс.

50% 1

PyData London Meetup #54
Tuesday, March 5, 2019
Data pipelines are necessary for the flow of information from its source to its consumers, typically data scientists, analysts and software developers. Managing data flow from many sources is a complex task where the maintenance cost limits scale of being able to build a large reliable data warehouse. This presentation proposes a number of applied data engineering principles that can be used to build robust easily manageable data pipelines and data products. Examples will be shown using Python on AWS.
Sponsored & Hosted by Man AHL
****
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.
Want to help add timestamps to our RU-vid videos to help with discoverability? Find out more here: github.com/numfocus/RU-vidVi...

Наука

Опубликовано:

19 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 17

@efeorikpete8774 2 года назад

Fast-forward to 3 years later: AIRFLOW now has robust documentation for authoring, scheduling and monitoring your data pipeline

@MrKane101111 2 года назад

Great presentation, really nice analogy and very clear.

@boudehoucherahma8083 2 года назад

Verry interesting présentation. Tanks🙏

@severtone263 Год назад

This was very helpful. That analogy is simply the best.

@jamesattwood3454 2 года назад

Great talk!

@dmytrooliinyk3083 18 дней назад

That's a great talk!

@AshokTak 2 года назад

00:00 Welcome 00:34 Merchant John Story 08:17 Need for standardization 22:26 Q&A Will update it later.

@alexgartner8187 2 года назад

Awesome

@TheSolbiatii 2 года назад

00:00 Welcome 00:34 Merchant John Story 08:17 Need for standardization 10:25 Traditional Pipeline vs Ideal Framework with Validations 18:02 Principles 22:26 Q&A

@augugninfin1034 Год назад

great...

@mayurarun Год назад

Nice

@horaceweatherby2910 Год назад

To be honest, I didn't find this to be very helpful. I'm a project manager tasked with redesigning the whole data environment in a small enterprise, technically minded but never formally studied. It seemed like the presenter didn't make the case for the presentation's title "Build frameworks, not pipelines." I didn't observe a part where he discounted pipelines. The beginning 10 minutes about many units being used across Britain as an analogy for different technologies and systems in data didn't reveal any insights and can be safely skipped IMO. After that, the diagramming of a framework from the data source all the way to a data warehouse seems more like an explanation for beginner's, but without the clarity that such an explanation should possess. Overall, seemed like an inadequately organized way to present a basic idea. Though, some individual points from this presentation that I took away: - Keep HTML files from web scraping, not just fields, for access to the data at any time without going back to the original source - Maintain a layer for failed data extractions: this has been my idea for a long time but good to see it articulated by an actual data engineer - Maintain a layer as a staging data warehouse, prior to the production data warehouse Instead, I found this recommended video better, even though it was more complex: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-C6Abv87D5dU.html It goes more in-depth about one company's challenges in designing a new data pipeline and offers insights that are generalizable to anyone setting up or upgrading such a pipeline.

@ooker777 10 месяцев назад

Thanks for your time and effort to write a detailed review

@firefoxmetzger9063 2 года назад

Somehow this makes me think of XKCD's Standards comic.

@julianatlas5172 2 года назад

I likes the xkdc about date format. There is only one good date format according to the ISO 8601 which is YYYY-MM-DD e.g 2021-12-15

@vansf3433 Год назад

It's too simple, and anyone can learn the process of sorting out, transforming and transmitting data without any need of good knowledge of CS

@RedShipsofSpainAgain 11 месяцев назад

First 10 minutes he talks about different measuring units in Britain as a bad analogy for the importance of standards in modern daya engineering: it has zero relevance to data engineering platforms. Really poor analogy. Just skip to 10:20.