Тёмный

Cricket Statistics Data Pipeline in Google Cloud using Airflow | Data Engineering Project 

Подписаться
Просмотров 19 тыс.
% 428

Looking to get in touch?
Drop me a line at vishal.bulbule@gmail.com, or schedule a meeting using the provided link topmate.io/vishal_bulbule Cricket Statistics Data Pipeline in Google Cloud using Airflow,Dataflow,Cloud Function and Looker Studio
Data Retrieval: We fetch data from the Cricbuzz API using Python.
Storing Data in GCS: After fetching the data, we store it in a CSV file in Google Cloud Storage (GCS).
Cloud Function Trigger: Create a Cloud Function that triggers upon file upload to the GCS bucket. The function will execute when a new CSV file is detected and trigger dataflow job.
Cloud Function Execution: Inside the Cloud Function, we will have code that triggers a Dataflow job. Ensure you handle the trigger correctly and pass the required parameters to initiate the Dataflow job.
Dataflow Job: The Dataflow job is triggered by the Cloud Function and loads the data from the CSV file in the GCS bucket into BigQuery. Ensure you have set up the necessary configurations.
Looker Dashboard: BigQuery serves as the data source for your Looker Studio dashboard. Configure Looker to connect to BigQuery and create the dashboard based on the data loaded.
Github Repo for all code used in this project
github.com/vishal-bulbule/cricket-stat-data-engineering-project
============================================
Associate Cloud Engineer -Complete Free Course
ru-vid.com/group/PLLrA_pU9-Gz1FvRc7v-4l4dTG9fMsekNf
Google Cloud Data Engineer Certification Course
ru-vid.com/group/PLLrA_pU9-Gz1TbaEIlUVfAqZ853LmDPib
Google Cloud Platform(GCP) Tutorials
ru-vid.com/group/PLLrA_pU9-Gz2s68wQrEAmhA-cMIK_ANXj
Generative AI
ru-vid.com/group/PLLrA_pU9-Gz0Vu-Ln5jD4eNovJxEoXDWr
Getting Started with Duet AI
ru-vid.com/group/PLLrA_pU9-Gz0NNt3Yxjc7Qmxg-HjBLOJs
Google Cloud Projects
ru-vid.com/group/PLLrA_pU9-Gz1HfqPgNcklB5S7XqHONXCq
Python For GCP
ru-vid.com/group/PLLrA_pU9-Gz3zmg__9iK_nnfzYQAGwgH4
Terraform Tutorials
ru-vid.com/group/PLLrA_pU9-Gz3A4r_kLJw456cGlZHzqxb-
Linkedin
www.linkedin.com/in/vishal-bulbule/
Medium Blog
medium.com/@VishalBulbule
Github Repository for Source Code
github.com/vishal-bulbule
Email - vishal.bulbule@techtrapture.com
#dataengineeringessentials #dataengineers #dataengineeringproject #airflow #dataflow #cloudcomposer #bigquery #looker #googlecloud #datapipeline

Наука

Опубликовано:

 

18 дек 2023

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 39   
@sravyam8055
@sravyam8055 2 месяца назад
Excellent sir! I had never ever watched a full video with this much clarity on Batch pipelines.. keep going the same speed and very good explanation.🎉🎉🎉🎉🎉🎉🎉🎉
@dhananjaylakkawar4621
@dhananjaylakkawar4621 9 месяцев назад
I was thinking to build a project on GCP and your video arrived . great work sir! thank you
@venkatatejanatireddi8018
@venkatatejanatireddi8018 9 месяцев назад
I sincerely recommend this to people who wants to explore DE pipeline orchestration on GCP
@ed-salinas-97
@ed-salinas-97 Месяц назад
Just discovered this channel so there may be some videos more recent that address some things. I worry about CSV files that have 100+ columns. That .js file and .json file will get pretty crazy very fast (although I suppose you could use python to create the json file). I'm also more interested in being able to append to a table once it's created with new data that can get either uploaded daily or weekly. Overall, great job! I definitely learned a few things here!
@shyjukoppayilthiruvoth6568
@shyjukoppayilthiruvoth6568 4 месяца назад
Very good video. would recommend to any one who is new to GCP
@ajayagrawal7586
@ajayagrawal7586 6 месяцев назад
I was looking for this type of video for a long time. Thanks.
@balajichakali9293
@balajichakali9293 8 месяцев назад
Thanks is a small word to you sir..🙏 This is the Best Explanation I ever seen in youtube. It is very helpful to me. I have completed this project end to end and l have learnt so many things.
@techtrapture
@techtrapture 7 месяцев назад
Glad that it helped you.
@brjkumar
@brjkumar 8 месяцев назад
Good job. Looks like the best video for GCP ELT & other GCP stuff.
@techtrapture
@techtrapture 8 месяцев назад
Glad it was helpful!
@aashishsharma4734
@aashishsharma4734 Месяц назад
Very good video to understand data engineering workflow
@prabhuduttasahoo7802
@prabhuduttasahoo7802 6 месяцев назад
Learnt a lot from you. Thank you sir
@bernasiakk
@bernasiakk 2 месяца назад
This is great! I followed your video step-by-step, and now it's time for me to do a project of my own based on your stuff! Will use something more European though, like soccer or basketball haha :D Thanks!!!
@techtrapture
@techtrapture 2 месяца назад
True...better for you not to use Cricket 😅😅
@ashishvats1515
@ashishvats1515 2 месяца назад
Hello, sir! Great video. If we need to implement CDC or append new data to a table, do we have to extract the data date-wise and load it to GCS? And how do we append that data to an existing table in BigQuery? Cloud Composer: Extract data from an API and load it to GCS. Cloud Function: Trigger the event to load a new CSV file to BigQuery using Dataflow. So where do we need to write the logic to append the new data to an existing table in BigQuery?
@pariyaparesh
@pariyaparesh 7 месяцев назад
Thanks a lot for such great explanation. Can you please share which video recording/editing tool is being used?
@Anushri_M29
@Anushri_M29 4 месяца назад
Hi Vishal, this is a really great video, but it would be very helpful if you could also explain the code that you have written from 6:01.
@wreckergta5470
@wreckergta5470 7 месяцев назад
Thank you, learned a lot from you sir
@techtrapture
@techtrapture 7 месяцев назад
Happy to know. Keep learning brother 🎉
@ShigureMuOnline
@ShigureMuOnline 4 месяца назад
nice video. just one question why do you create a dataflow ? you can insert rows using python?
@techtrapture
@techtrapture 4 месяца назад
Yes I agree but as a project I want to show the complete orchestration process and use multiple services
@ShigureMuOnline
@ShigureMuOnline 4 месяца назад
@@techtrapture really thanks for the faster answer. I Will see all your videos
@SwapperTheFirst
@SwapperTheFirst 7 месяцев назад
Hi Vishal, in this and your other Composer videos you use standard Airflow operators (for example, Python or Bash). Do you know how to install Google Cloud Airflow package for Google cloud specific operators? I've tried to upload the wheel to /plugins bucket, but nothing happens. Composer can't import Google Cloud operators (like pubsub) and DAGs with these operators are listed as broken. Thanks!
@techtrapture
@techtrapture 7 месяцев назад
I usually refer this code sample airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/index.html
@SwapperTheFirst
@SwapperTheFirst 7 месяцев назад
@@techtrapture thanks! But how to use these operators in Composer? In Airflow I just pip install the package. How to do this in Composer?!
@techtrapture
@techtrapture 7 месяцев назад
Ohh k got your doubts now...you have to add it in requirements.txt and keep in dags folder. Also other options available here. cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
@SwapperTheFirst
@SwapperTheFirst 7 месяцев назад
​@@techtrapture yes, this is exacthly what I needed. I can use both of these options, depending on the DAGs. Great!
@rishiraj2548
@rishiraj2548 9 месяцев назад
Thanks
@NirvikVermaBCE
@NirvikVermaBCE 7 месяцев назад
I am getting stuck on the airflow code, I think it might be an issue with the filename in the python code, bash_command='python /home/airflow/gcs/dags/scripts/extract_data_and_push_gcs.py', I have uploaded the extract_data_and_push_gcs.py in scripts of dags. However, is there any way to check the path /home/airflow/gcs/dags/scripts/ ??
@techtrapture
@techtrapture 7 месяцев назад
/home/airflow/gcs/dags = your dags GCS bucket It's same path
@venkatatejanatireddi8018
@venkatatejanatireddi8018 9 месяцев назад
i have been facing issues invoking the dataflow job, while using the default App engine service account. Could you let me know if you were using a specific service account to work with the cloud function?
@techtrapture
@techtrapture 9 месяцев назад
No, I am using the same default service account.what error you are getting?
@sampathgoud8108
@sampathgoud8108 6 месяцев назад
I tried the same way as per your video but i got this error when running the data flow job through template. Could you please help me out what exactly the mistake which i have done. I used the same schema which you have used. Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to serialize json to table row: 1,Babar Azam,Pakistan
@techtrapture
@techtrapture 6 месяцев назад
Are you using same json files?
@sampathgoud8108
@sampathgoud8108 6 месяцев назад
yes@@techtrapture Below is the JSON file { "BigQuery Schema": [{ "name": "rank", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "country", "type": "STRING" } ] }
@sampathgoud8108
@sampathgoud8108 6 месяцев назад
I tried Rank column with both String and INTEGER data types. For both i am getting the same issue.
@pankajgurbani1484
@pankajgurbani1484 6 месяцев назад
@sampathgoud8108 I was getting the same error, this got resolved after I put the 'transform' in JavaScript UDF name in Optional Parameters while setting up DataFlow job
@TechwithRen-Z
@TechwithRen-Z 4 месяца назад
This tutorial is 😩a waste of time for beginners. He did not show how to connect python to the GCP before storing data in bucket. There a lot of missing steps.
@Rajdeep6452
@Rajdeep6452 7 месяцев назад
you didnt show how to connect GCP before storing data in bucket. You have jumped a lot of steps. your video lacks quality. You should also include which dependencies to use and all. Just running your code and uploading to Github is not everything.