No video :(

Apache Spark | Spark Interview Question | Spark Optimization { PartitionBy & Repartition }

Подписаться 14 тыс.

Просмотров 19 тыс.

50% 1

Apache Spark | Spark Interview Question | Spark Optimization { PartitionBy & Repartition }
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
⭐ Kite is a free AI-powered coding assistant for Python that will help you code smarter and faster. Integrates with Atom, PyCharm, VS Code, Sublime, Vim, and Spyder. I've been using Kite for 6 months and I love it! www.kite.com/g....
The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you're typing. I've been using Kite for 6 months, and I love it!
-----------------------------------------------------------
Apache Spark Interview Question - In this video, we will learn to answer the interview question on What is the Difference between partitionBy and repartition in Apache Spark. We will understand this Spark optimization techniques with small demo.
-----------------------------------------------------------
DataSet for you to work:
github.com/aza...
Blog link to learn more on Spark:
www.learntospark.com
Linkedin profile:
/ azarudeen-s-83652474
FB page:
/ learntospark-104523781...
#apachespark #spark #sparkoptimization

Опубликовано:

22 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 25

@akashprabhakar6353 3 месяца назад

Awesome video. Greatly explained

@occasionalvisitor6920 3 месяца назад

Hi, you should clearly explain as why the source file after reading shows only 1 partition. simply telling that the file is small does not gives clarity to the audience. Technically there are lot of factors which can attribute to the decision of how many partitions will be created such as Parallelism, Block Size, Input Format and Splitting Logic, Data Locality and Cluster Configuration etc

@nareshvemula2204 3 года назад

Good work azar. I gone through your list of videos from last one year, lot of videos are there. Could you please make one consolidated series of videos for big data engineer interview questions and answers starting from hdfs, sqoop, hadoop, apache spark, hive, impala etc. Similarly one more series of lectures for beginners to understand each concept in detailed. If this is very big thing, could you please provide the list of sources and learning path and provide a clear cut strategy and resources for interviews and as well as for learning purpose and create separate play lists for it.

@vipinkumarjha5587 3 года назад

Superly explained...Thanks for sharing the knowledge

@Balajionceagain 3 года назад

Thanks Azar. It’s really helpful 👍

@prajjwaljaglan1092 3 года назад

Very informative and helpful videos... Thanks for sharing the knowledge 👍👍

@AzarudeenShahul 3 года назад

Thanks for your support :)

@bikersview9926 2 года назад

voice was so good azar

@deepjyotimitra1340 2 года назад

very nice explanation. 👌 👍

@AzarudeenShahul 2 года назад

Thanks for your support

@dhananjayreddy9998 2 года назад

After watching the complete video, I understood that repartition should be used @ reading the file and PartitionBy should be used @ Writing. Repartition actually divides the data into the number given @ repartition. But didn't get clarity on Partitionby . Could someone please explain on this

@sravankumar1767 2 года назад

Superb

@Humanist1199 3 года назад

Excellent Azar

@ferrerolounge1910 Год назад

why do we need files in memory? (repartition use). Isn't file always used when storing in disk?

@SpiritOfIndiaaa 2 года назад

Thank you, please share note books

@shilpasthavarmath5262 3 года назад

Usefull..

@153dravid 2 года назад

Hi Azar, which is faster (Repartition vs PartitionBy vs Coalesce) if we are dealing with 1 TB data? Thank you

@localmartian9047 2 года назад

They do different things. partitionBy is for writing the df into separate subdirectories based on partition column. While repartition and colasce deal with distribution of data inmemory among executors. coalesce is used to reduce partitions and tries to avoid full reshuffling, so it will be faster than repartition which can both increase or decrease partitions but does it with full reshuffle. But if you decrease the partition too much than capacity of executor, it can lead to OOM. So it depends on the problem you are solving ie you want to reduce skewness or distribute more

@creativeminds7397 3 года назад

Hello Azarudeen, Your vedios are awesome. I have one question can you please provide me code .. 1) I want decrypt the file using private key .My all files PGP encrypted file and private key stored in S3 bucket.. please help me to provide the code.

@AzarudeenShahul 3 года назад

If possible can u pls share some sample file..