Soumil Shah

Soumil Shah

1 715
6 610 745

Подписаться

I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a RU-vid channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

Fast GeoSearch on Data Lakes: Efficiently Build Geo Search Using Hudi for Lightning-Fast Retrieval

9:03

Fast GeoSearch on Data Lakes: Efficiently Build Geo Search Using Hudi for Lightning-Fast Retrieval

7 часов назад

Building Keyword Search in Hudi: Inverted Indexes, Record Level | Keyword Search in Datalakes

7:24

Building Keyword Search in Hudi: Inverted Indexes, Record Level | Keyword Search in Datalakes

14 часов назад

Storing Athena Query Metrics in Hudi for Advanced Analysis and Audit using AWS Glue

8:55

Storing Athena Query Metrics in Hudi for Advanced Analysis and Audit using AWS Glue

19 часов назад

Using OpenAI Vector Embedding to Store Large Vectors in Hudi with MiniO for Cost-Effective AI Apps

7:20

Using OpenAI Vector Embedding to Store Large Vectors in Hudi with MiniO for Cost-Effective AI Apps

День назад

Learn How to Use Apache Hudi Streamer with DataHUB An Open Source Metadata Platform

8:23

Learn How to Use Apache Hudi Streamer with DataHUB An Open Source Metadata Platform

День назад

Getting Started with X-Table and Unity Catalog | Universal Datalakes | Hands on Labs

6:19

Getting Started with X-Table and Unity Catalog | Universal Datalakes | Hands on Labs

14 дней назад

Hudi Using Spark SQL on AWS S3: Insert, Update, Deletes, Stored Procedures on AWS Glue Notebooks

8:15

Hudi Using Spark SQL on AWS S3: Insert, Update, Deletes, Stored Procedures on AWS Glue Notebooks

21 день назад

How to Use Hudi Streamer on New EMR 7.1.0 Spark 3.5.1 and Hudi 0.14.1 | Hands-on Labs

7:03

How to Use Hudi Streamer on New EMR 7.1.0 Spark 3.5.1 and Hudi 0.14.1 | Hands-on Labs

21 день назад

How to Use Hudi Streamer with Hudi version 0.15.0 | Hands on Guide |

4:29

How to Use Hudi Streamer with Hudi version 0.15.0 | Hands on Guide |

21 день назад

How to Execute Postgres Stored procedures in Spark | Hands on Guide

3:10

How to Execute Postgres Stored procedures in Spark | Hands on Guide

21 день назад

Learn How to Ingest Data from Hudi Incrementally hudi table changes into Postgres Using Spark

6:47

Learn How to Ingest Data from Hudi Incrementally hudi table changes into Postgres Using Spark

21 день назад

Universal Datalakes: Interoperability with Hudi, Iceberg, and Delta Tables with AWS Glue Notebooks

8:17

Universal Datalakes: Interoperability with Hudi, Iceberg, and Delta Tables with AWS Glue Notebooks

28 дней назад

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

3:38

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Месяц назад

OneTable to translate a Hudi table to Iceberg format and sync with Glue Catalog

5:41

OneTable to translate a Hudi table to Iceberg format and sync with Glue Catalog

Месяц назад

Learn How to Run Apache X Table Sync Command on AWS Cloud Shell | Interoperate Hudi Iceberg delta

4:30

Learn How to Run Apache X Table Sync Command on AWS Cloud Shell | Interoperate Hudi Iceberg delta

Месяц назад

Learn How to Ingest XML files with AWS Glue into Hudi Datalakes | Step by Step guide

6:33

Learn How to Ingest XML files with AWS Glue into Hudi Datalakes | Step by Step guide

Месяц назад

Hudi with Spark SQL for Beginners | Insert| Updates | Delete | incremental Query | Stored procedures

8:56

Hudi with Spark SQL for Beginners | Insert| Updates | Delete | incremental Query | Stored procedures

Месяц назад

How we Utilized Hudi's Time Travel Query to Investigate Bid and Spend | Going Back in Time with Hudi

5:19

How we Utilized Hudi's Time Travel Query to Investigate Bid and Spend | Going Back in Time with Hudi

Месяц назад

Hudi Cleaning Process | hoodie.keep.min.commits and hoodie.keep.max.commits Explained

5:55

Hudi Cleaning Process | hoodie.keep.min.commits and hoodie.keep.max.commits Explained

Месяц назад

AWS Glue Tutorial: How to Filter and Exclude S3 Files while reading as Glue Dynamic Frame

2:58

AWS Glue Tutorial: How to Filter and Exclude S3 Files while reading as Glue Dynamic Frame

Месяц назад

How to Read S3 Partitioned Data as Columns in AWS Glue DF

3:58

How to Read S3 Partitioned Data as Columns in AWS Glue DF

Месяц назад

Multiple Spark Writers to Hudi tables | Hands on Labs

8:00

Multiple Spark Writers to Hudi tables | Hands on Labs

Месяц назад

Learn How to Ingest data from pulsar Topic into Hudi with DeltaStreamer | Hands on Labs

5:20

Learn How to Ingest data from pulsar Topic into Hudi with DeltaStreamer | Hands on Labs

Месяц назад

Build Hudi Date Dimension in Minutes with Spark SQL Minio and Query with Trino

3:46

Build Hudi Date Dimension in Minutes with Spark SQL Minio and Query with Trino

Месяц назад

Hudi Streamer implementing Slowly Changing Dimension Type 2 and Query Real Time Trino | Hands on

24:00

Hudi Streamer implementing Slowly Changing Dimension Type 2 and Query Real Time Trino | Hands on

Месяц назад

Demo Video : Hudi Delta Streamer Implementing Slowly Changing Dimension and Query that using Trino

4:43

Demo Video : Hudi Delta Streamer Implementing Slowly Changing Dimension and Query that using Trino

Месяц назад

DeltaStreamer with incremental ETL and Broadcast Joins for Faster ETL

6:13

DeltaStreamer with incremental ETL and Broadcast Joins for Faster ETL

2 месяца назад

Learn How to use Cloudwatch metrics with Hudi AWS Glue Jobs

7:24

Learn How to use Cloudwatch metrics with Hudi AWS Glue Jobs

2 месяца назад

Tips to Feel Valued at Work: Overcoming Unappreciation

2:16

Tips to Feel Valued at Work: Overcoming Unappreciation

2 месяца назад

Комментарии

@MrLeanhduclk14 5 часов назад

HI Soumil, Perfect solution Soumil. So my case, I have some streaming tables same like your demo, and after landing on S3, how can I do join them for further real time analytics ? Can Flink do it by select data from Sink table and join each other for further analytics ?

@DamosyTheFreckle 6 часов назад

nope doesn't work, don't waste your time

@shyamgurunath5876 8 часов назад

You will reach more heights soumil… will be there to watch ❤

@SoumilShah 5 часов назад

Thank you sir

@surajbhardwaj2599 8 часов назад

Sir you are amazing. Thanks for the content...

@SoumilShah 8 часов назад

So nice of you

@employedgorilla 9 часов назад

You deserve it bro

@SoumilShah 9 часов назад

Thanks❤

@emonymph6911 8 часов назад

@@SoumilShah you're welcome. please make a video on where in the stack we should build data objects e.g. metadata layer or somewhere else. The idea is if we have to replace tech X with Y when X is outdated and Y is new and improved processing speed, how can we keep our tables in-tract and unchanged (assuming the storage layer remains unchanged)? full object rewrite is not fun.

@SoumilShah 8 часов назад

@@emonymph6911sure thing !!

@SachinShukla230187 11 часов назад

Amazing, I have good experience in Python but no video gave me the right insight or interest to understand these patterns thank you Soumil because of you I have learnt these things otherwise I was running away here and there.....

@SoumilShah 8 часов назад

Thanks a lot Really thank you I mean it

@MrHatemfaheem День назад

gtihub link not working

@electricalsir День назад

essentially enjoyed

@KartikGautam 2 дня назад

Hi Soumil, I am unable to access the pdf can you help me with that. Thanks

@electricalsir 3 дня назад

good

@melojuan 3 дня назад

what a legend!

@harivigsp7934 4 дня назад

can you please put a video on iceberg DR?

@rigseoservice 4 дня назад

very annoying to watch. frequent switching between windows very stressing

@martingregson7136 5 дней назад

Do you bowl as fast as you talk?

@sarathju3867 6 дней назад

Thanks for posting this ❤ it

@SoumilShah 6 дней назад

Thank you sir

@chandini766 6 дней назад

Hi Soumil, Thank you for your detailed videos. Could you point to any resource that can help setup the IntelliJ for pyspark?

@4BroGame 6 дней назад

Hey bro I cloned a website and now I am opening that website code in vs code editor but after doing the necessary editing only text is changing not the images. Like I am putting my image URL on the place of website image URL but after saving it and opening it with live server the preview is showing me the images of cloned website not mine and in inspect element it is showing the image code of cloned website not mine why. I am trying from 6 hours and nothing is works for me. Will you plz tell me how can I change the images and edit it.

@Vamsikuruva-d8b 6 дней назад

After installing, when I try to run elasticsearch.bat file it is showing error like \Java\jdk-21.0.1 was unexpected at this time. But my jdk and java bin folder paths setted correctly in environment variables

@prasantkumarsrivastava5925 7 дней назад

yes, please slow down yourself in every respect pls

@krishnendudas8573 7 дней назад

Thanks for the video. It's a good one. Do you have any samples related to the scenario where we have to read the Avro data from a Kafka topic and upsert into the Hudi tables?

@BabaiChakraborty-ss8pt 7 дней назад

amazing work @soumil. Thanks

@debmidya411 7 дней назад

Hi Soumil, thanks for the video. Using openjdk 11 and Python 3.8. I can't see the table printed when run 'Creating Dataframe from List of Tuples'. Used Jupyter notebook as well as VS code Editor. Any idea.

@IleniaQuintero 9 дней назад

Hello, I was looking at your video channel. We may be helping a company that uses secure images to increase supply chain security and help cloud native development. Would you be willing to help try their software, make a video, and help show devs how to use their tools? This is not an offer, but just to start a conversation about your willingness to take on sponsorship. Please provide me with your email if you are interested. You'd have a chance to look at their technology and decide if it's the type of software that you'd be interested in covering in your channel.

@MrTejasreddy 9 дней назад

hello soumil data,schema assigned to a dataframe but when i used df.show() i am getting error..

@SoumilShah 9 дней назад

What error

@saurabhshinde7135 9 дней назад

Thanks for the great content man. Can you please re-upload the Lab link as repository is not accessible

@DotCreatorOfficial 9 дней назад

aviter game hack video plese

@louisadibe3189 9 дней назад

cool video content

@prasanthvegesna2306 10 дней назад

Hi Soumil , Thank you for the video. I know there are certain catalogs available now for iceberg . In this case we utilizing glue as catalog or one table as a catalog? Also to automatically or incrementally sync data into iceberg table we have to event based trigger process to run that Java command?

@sayedsamimahamed5324 10 дней назад

Where is the concept for CDC?

@world52love 10 дней назад

how to handle zero values in csv file and how to fill those values

@worldtour666 10 дней назад

@Soumil, How to run glue notebook ondocker container? Please refer any video?

@xyz-jn4oj 10 дней назад

How to handle if secret manager has rotation in python?

@JuanMa-lv7bd 11 дней назад

Thanks for the video, I noticed that the error logs were not marked as error by datadog, any idea on how to do that? I'm trying to send an artificial error to see if I can create a notification when something fails but datadog always mark them as INFO logs

@andriifadieiev9757 11 дней назад

Thanks for sharing! For the future video, same with UntyCatalog maybe?

@SergeyTarabara 11 дней назад

Soumil thank you for the video! Is the same thing possible, but with the Iceberg format?

@johnnydrumgole8476 11 дней назад

Hello im having trouble with creating the connection

@selmaiilonga6262 12 дней назад

Hi please provide the link to for the smart library

@SergeyTarabara 12 дней назад

We need the same thing about Iceberg. Thx Soumil!

@PikaPikaKatzy 13 дней назад

Could not create pixmap from (the path)

@MrSkelver 13 дней назад

thank you so much

@isharkpraveen 14 дней назад

Why did u hashed? You can directly remove the duplicates by .dropDuplicates right?

@amitkhandelwal8030 14 дней назад

Hi Soumil you have not given any configuration to do in airflow.cfg did the solution you give will work when we want to parallelise multiple task inside a dag and parallelise multiple dag ? Other people are giving solution like change the database to mysql or postgress and chnage executer to LocalExecutor what do you think about these solutions?

@ismail3035 14 дней назад

We dont really need to attach ticketId to user as user can exist without a ticket. Also for fetching all the tickets associated with a given user, we can use userId GSI on ticket model

@robstuckey 14 дней назад

awesome video! thanks for sharing. +1 sub

@Levy957 15 дней назад

you are a god

@electricalsir 16 дней назад

thanks soumil

@ashokjangam7329 17 дней назад

@soumilshah thanks for your informative video, but the link you have given in description for pdf files is not working. could you please update that with right url.

@IndianSumaira 17 дней назад

Sir, can i store and unzip these docs in other drives and not in c:?

@rommel23nb 18 дней назад

Thanks Mr. Shah--- I used these commands to prepare a cheat sheet for data cleaning--- regards