Optimize read from Relational Databases using Spark

Подписаться 108 тыс.

Просмотров 4,2 тыс.

50% 1

In this video, I have discussed an optimised way of reading big tables which might be there in some Databases. I am using Apache Spark to read data from the source.
Find the source code here
github.com/ankur334/sparkBoot...
𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:
🔅 Topmate - (Book 1:1 or other sessions)
topmate.io/ankur_ranjan
🔅 LinkedIn - / thebigdatashow
🔅 Instagram - / ranjan_anku
#DataEngineering #apachespark #bigdata #interview #dataengineerjob #careerswitch #job #scala

Опубликовано:

23 июл 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 22

@amrita5222 Год назад

It's very tough to make these videos, after all day job. Hats off to your determination. Really inspirational 😇

@TheBigDataShow 2 месяца назад

Thank you Amrita 🥳🎉🎊

@DURGESHKUMAR-gd5in Год назад

Way of teaching is awesome bro 🤞

@nupoornawathey100 Месяц назад

Only video on YT to explain these parameters well, thanks Ankur !! I had a query. For example, we have 10 partitions, lowerBound=0, upperBound=10000 and provide fetchSize as 1000. Will fetchSize be used as limit 1000 here ? Say one partitioned sql have more rows than fetchSize what may happen here ?

@DURGESHKUMAR-gd5in Год назад

Hi Ankur , Durgesh this side 🙂

@mranaljadhav8259 Год назад

Thanks alot sir for making such a awesome video...Keep uploading more videos..waiting for more such videos

@shubhambadaya Год назад

thank you

@princeyjaiswal45 Год назад

Great👍

@gadgetswisdom9384 Год назад

nice video keep it up

@shreyakatti5070 2 месяца назад

Amazing Video.

@TheBigDataShow 2 месяца назад

Thank you Shreya for your kind words

@kamalnayan9157 Год назад

Great!

@dpatel9 Год назад

This is very useful example. Is there any way to optmize writing/insertion into SQL tables where we have millions of rows in dataframe???

@dataenthusiast_ Год назад

Great Explanation Ankur So In production scenario, ideally we have to calculate the min max of the bounds at run time right. we cannot hardcode the lowerbound upperbound.

@TheBigDataShow Год назад

Yes correct. Most of the developers write smart code to dynamically determine these lower or upper case

@shivanshudhawan7714 11 месяцев назад

@@TheBigDataShow I actually did the same, reading from mysql (1053 tables- some really big some medium and some small) and writing them to databricks raw layer. What I did, I was programmatically getting lower and upper bound for the tables and then using them to read the data parallely, in that case my total hits to the source db are doubled. Any advice you can provide on this?

@RohanKumar-mh3pt 11 месяцев назад

hello sir this is very helpful can you please make video regarding what kind of question they asked in data pipeline design round and what are the possible way to handle such questions

@TheBigDataShow 2 месяца назад

Please check the Data Engineering Mock Interview playlist. We have recorded more than 25 Data Engineering mock interviews..

@kalpeshswain8207 Год назад

I have doubts here, when we deal with tables from databases, we can use inner bound and outer bound.....but when we read flat files like CSV , can we use inner bound and outerbound

@TheBigDataShow Год назад

Use if it is a file format then use columnar file formats like Parquet, ORC or row based file formats like AVRO. It will help you in the predicate pushdown and help you to fetch your column more quickly. CSV files are row based format and it is a very simple format. It is not recommended to store big data in CSV.

@TheBigDataShow Год назад

Check my article for understanding Parquet, ORC and Avro www.linkedin.com/feed/update/urn:li:activity:6972381746185064448?