Тёмный
No video :(

38. Databricks | Pyspark | Interview Question | Compression Methods: Snappy vs Gzip 

Raja's Data Engineering
Подписаться 24 тыс.
Просмотров 12 тыс.
50% 1

Опубликовано:

 

21 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 26   
@47shashank47
@47shashank47 Год назад
Just started following your playlist since last 3 days. .. The way you have provided explaination it's so amazing concepts which I could not clear in last 7-8 months. in just last 3 days I got clarity ont those topics. .. Thanks a lot for creating such amazing content.
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Thanks Shashank, for your comment! Keep watching
@abhaybisht101
@abhaybisht101 2 года назад
Nice content Raja 🤟
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thanks Abhay
@PavanKumar-tt8mm
@PavanKumar-tt8mm 2 года назад
Good Raja. Today i had learn new topic..Thankyou.
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you Pavan
@ajinkyamore8359
@ajinkyamore8359 2 года назад
Really Nice Explanation. Thanks
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you
@azarudeena6467
@azarudeena6467 2 года назад
Easy to understand
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you
@sravankumar1767
@sravankumar1767 2 года назад
Nice explanation Raj 👌 👍
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thanks Sravan
@srinubathina7191
@srinubathina7191 Год назад
Thank You Sir
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Most welcome
@karamveersolanki138
@karamveersolanki138 2 года назад
Hi Raja, one doubt: Regarding splitable, you said more than one core can access it. Isn't it means that the file is spread over multiple partitions and is available for parallel processing.
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Good question Karamveer. The data is distributed across nodes in the form of partitions but that's within cluster environment (within onheap memory when we talk about spark). But what we are discussing here is file storage within external system such as dbfs, S3, adls, hdfs etc. So when spark is reading data from external environment, if the huge file is not in splitable format, it would take more time to distribute the data across nodes in the form of partitions because that non-splitable file cant be read by multiple cores at a time. Hope it is clear. Thanks for this good question
@kanstantsinhulevich4313
@kanstantsinhulevich4313 Год назад
Hey, Raja. I know that parquet file with gzip codec is splittable. Of course if we compress csv file with gzip codec it won't be splittable. It would be nice if you will ad some clarification.
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Hi Kanstantsin, yes you are right. Parquet file with gzip is splittable by default while CSV with gzip is non-splittable by default. However there are some workaround to split gzipped CSV files like reading it in textinputformat api or pre-splitting the gzipped file into multiple pieces
@karthickrajachandrasekar8486
Hi raja, thanks for amazing explanation. I have one doubt is there any ways, after compressing into gz, same name will shown?
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Yes same name will be shown after compression
@karthickrajachandrasekar8486
@@rajasdataengineering7585 It will shown like part004 like that. How to fetch the same name that will be given csv?
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
If you want to have specific name, dataframe can be converted to pandas and write with specific name
@deevjitsaha3168
@deevjitsaha3168 День назад
i tried creating parquet file in gzip compression type but it created multiple part files. however it supposed to create one file right??
@sohelsayyad5572
@sohelsayyad5572 Год назад
thank you sir, if huge file is not splittable then, can we convert its compression format to make it splittable, if yes how do we do that ? Also is there any scenario of parquet/orc/avro where its not splittable and need workaround. how we resolve it ? 👍
@abhinavsingh1173
@abhinavsingh1173 Год назад
Your course it best. But problem with you course is that you are not attching the github link for your sample data and code. Irequest you as your audience please do this. Thanks
@Mehtre108
@Mehtre108 7 месяцев назад
Hello sir what is sequence to watch videos because some are not there in playlist
Далее
39. Databricks | Spark | Pyspark Functions| Split
10:41
What Is A Data Catalog And Why Do People Use Them?
10:26
24 Fix Skewness and Spillage with Salting in Spark
21:17
Apache Spark Memory Management
23:09
Просмотров 8 тыс.
What exactly is Apache Spark? | Big Data Tools
4:37
Просмотров 125 тыс.