How to test your Python ETL pipelines | Data pipeline | Pytest

Подписаться 12 тыс.

Просмотров 11 тыс.

50% 1

In this tutorial we are going to cover how to test ETL pipelines. I have received a number of inquiries on the testing and especially testing the data pipelines we build using python. Testing is an important aspect of ETL pipelines. It ensures we are delivering accurate information to our stakeholders. We want to make sure our data is current, consistent and accurate.
Therefore, it is always a good idea to put test cases in place to catch data anomalies. A failing test can tell us that;
• An assumption about your source data is incorrect. For example, a column we expected never to be null contains nulls or a column we expected to contain unique values contains duplicates.
• Testing can catch the flaws in our transformation logic.
Errata in the tests: One of the viewers pointed that the null check was always returning true. It has been revised to to return false when nulls are present. test_null_check function is updated as follow:
def test_null_check(df):
assert df['ProductKey'].notnull().all()
Link to GitHub repo (code & data): github.com/hnawaz007/pythonda...
Link to article on this topic: blog.devgenius.io/how-to-test...
Pytest Docs: docs.pytest.org/en/7.2.x/
#pytest #etl #python
Subscribe to our channel:
/ haqnawaz
---------------------------------------------
Follow me on social media!
Github: github.com/hnawaz007
Instagram: / bi_insights_inc
LinkedIn: / haq-nawaz
---------------------------------------------
Topics covered in this video:
0:00 - Introduction to ETL testing
0:56 - Benefit of testing
1:32 - Pytest testing library overview
2:26 - Pytest setup
3:05 - Import Data
3:36 - First test - column check
6:08 - Primary key column tests
7:22 - Pytest features
8:15 - Data Type check
9:36 - Expected Values check

Наука

Опубликовано:

15 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 27

@BiInsightsInc Год назад

Part two Pytest integration with ETL pipeline: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-7FPksG-LYOA.html Part three of Pytest - Data Quality report: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Sv6QWF7J63k.html

@willosullivan3571 Год назад

The best data engineering RU-vidr I've had the pleasure to find. Thanks and please keep it up!

@user-te3oy8mo3u Год назад

Heart felt thanks to you for all these recorded sessions/tutorials .. you have made life so simple.

@poojaak1678 Год назад

Articulate explanation!You’re the Best!!Thank you so much .

@farhadshakibaca Год назад

The best data engineering RU-vidr Thank you

@soheilahg921 Год назад

Great and very helpful Content. Thank you.

@Sreenu1523 Год назад

You did a great job. I was looking same material for long time. Thanks man for sharing great content. I have many questions on pytest, will ask many questions once I go through all videos . Thanks

@BillusTinnus Год назад

Great video, thanks

@ZarifouDjibrilFrancais 3 месяца назад

Very helpul. Thank you.

@muddashir Год назад

Thanks

@gulnarabekirova4741 5 месяцев назад

Thank you for a great tutorial! You already have few different videos, can you add a number(to order them) to each tutorial it can help which video is the first and which one is the last.

@BiInsightsInc 4 месяца назад

Thanks and good suggestion. I have consolidated the data quality videos in their own playlist. Here is the link: ru-vid.com/group/PLaz3Ms051BAkgmoRZEcGFvQzY4YW_SR8b

@ashishvats1515 Год назад

could you please do this with apache beam…. jdbc source to Bigquery …. or you help me in this… i really need this kind of information

@MyChannel-ns3ct 3 месяца назад

Thanks for this video, is there a video on how to do these runs on SQL server, pgadmin or Athena ?

@BiInsightsInc 3 месяца назад

Here is the link to the video in the series that runs data quality test against sql server. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-7FPksG-LYOA.html Here is the link to the series: ru-vid.com/group/PLaz3Ms051BAkgmoRZEcGFvQzY4YW_SR8b

@bharamkarvivek4632 11 месяцев назад

Thanks for such important info. How to automate these test cases?

@BiInsightsInc 11 месяцев назад

You can embed these tests in your Data Pipeline, below is an example. Once you schedule it via an orchestrator then these tests will run each time your pipeline is triggered. You can use any tool like Airflow, Dagsters, Prefect or cron to schedule Python based pipelines. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-7FPksG-LYOA.html&ab_channel=BIInsightsInc Airflow: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-eZfD6x9FJ4E.html&ab_channel=BIInsightsInc Dagster: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-f1TbVGdhmYg.html&ab_channel=BIInsightsInc

@SP-db6sh Год назад

How to add a logger to it with Tqdm progress bar

@BiInsightsInc Год назад

If you want to log the test for review or sharing then check out the next video. I haven't played around with Tqdm but here is there docs and implementation. Maybe in the future I will implement this in a project. github.com/tqdm/tqdm

@kiranpatil4968 10 месяцев назад

Please make video on etl automation testing from scratch and make seperate playlists

@BiInsightsInc 10 месяцев назад

I will try and cover this in the future. In the meantime you can check out the following videos on the testing and automating the ETL pipelines. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-7FPksG-LYOA.html ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Sv6QWF7J63k.html&t ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-7UQ91Ib7PtU.html&t How to automate Python based ETL pipelines. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-f1TbVGdhmYg.html&t ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-eZfD6x9FJ4E.html&t ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-IsuAltPOiEw.html

@lalalf4535 Год назад

Function test_null_check(df) will always return passed

@BiInsightsInc Год назад

Thanks for spotting this. I have updated the code base. You can use the following assertion. # check for nulls def test_null_check(df): assert df['ProductKey'].notnull().all()

@lalalf4535 Год назад

@@BiInsightsInc Thank you. Your content is very useful.