Тёмный

21 Broadcast Variable and Accumulators in Spark 

Ease With Data
Подписаться 3,9 тыс.
Просмотров 1,3 тыс.
50% 1

Video explains - What are Distributed variable in Spark? How they work? What is Broadcast variable? What are Accumulators?
Chapters
00:00 - Introduction
02:24 - Broadcast Variable
06:57 - Accumulators
Local PySpark Jupyter Lab setup - • 03 Data Lakehouse | Da...
Python Basics - www.learnpython.org/
GitHub URL for code - github.com/subhamkharwal/pysp...
The series provides a step-by-step guide to learning PySpark, a popular open-source distributed computing framework that is used for big data processing.
New video in every 3 days ❤️
#spark #pyspark #python #dataengineering

Опубликовано:

 

15 июл 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 13   
@DEwithDhairy
@DEwithDhairy 5 месяцев назад
AWESOME
@devarajusankruth7115
@devarajusankruth7115 Месяц назад
hi sir, what is the difference between broadcast join and broadcast variable. in broadcast join also a copy of smaller dataframe is stored at each executor,so no shuffling happens across the executors
@easewithdata
@easewithdata Месяц назад
Broadcast joins implements the same concept of broadcast variable. It simplifies the use in Dataframes
@sushantashow000
@sushantashow000 13 дней назад
can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.
@easewithdata
@easewithdata 12 дней назад
Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜
@sureshraina321
@sureshraina321 6 месяцев назад
@8:50 , I have one small doubt " we have already filtered out the department_id == 6 , In that case we wont have any other department other than 6. Do we need to really groupBy(department_id) after filtering ?? ".
@easewithdata
@easewithdata 6 месяцев назад
Yes, since the data is already filtered you can directly apply sum on it. Group by is not mandatory
@sureshraina321
@sureshraina321 6 месяцев назад
​@@easewithdata Thank you 👍
@TechnoSparkBigData
@TechnoSparkBigData 6 месяцев назад
In last video you mentioned that we should avoid UDF but here you used it during getting the broadcast value. Will it impact the performance?
@easewithdata
@easewithdata 6 месяцев назад
Yes we should avoid Python UDF as much as possible. This example was just for demonstration of an use case of broadcast variable. You can always use UDF written in Scala and registered for use in Python.
@TechnoSparkBigData
@TechnoSparkBigData 6 месяцев назад
@@easewithdata thanks
@at-cv9ky
@at-cv9ky 5 месяцев назад
pls can you provide the link to download sample data ?
@easewithdata
@easewithdata 5 месяцев назад
All datasets are available on GitHub. Checkout the url in video description
Далее
24 Fix Skewness and Spillage with Salting in Spark
21:17
ЛУЧШАЯ ПОКУПКА ЗА 180 000 РУБЛЕЙ
28:28
20 Data Caching in Spark
13:19
Просмотров 1,3 тыс.
15 How Spark Writes data
14:08
Просмотров 1,4 тыс.
19 Understand and Optimize Shuffle in Spark
15:14
Просмотров 1,7 тыс.
25 AQE aka Adaptive Query Execution in Spark
11:52
Просмотров 1,8 тыс.
26 Spark SQL, Hints, Spark Catalog and Metastore
19:20
Просмотров 1,3 тыс.
ЛУЧШАЯ ПОКУПКА ЗА 180 000 РУБЛЕЙ
28:28