No video :(

38. Databricks | Pyspark | Interview Question | Compression Methods: Snappy vs Gzip

Raja's Data Engineering

Подписаться 24 тыс.

Просмотров 12 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 26

@47shashank47 Год назад

Just started following your playlist since last 3 days. .. The way you have provided explaination it's so amazing concepts which I could not clear in last 7-8 months. in just last 3 days I got clarity ont those topics. .. Thanks a lot for creating such amazing content.

@rajasdataengineering7585 Год назад

Thanks Shashank, for your comment! Keep watching

@abhaybisht101 2 года назад

Nice content Raja 🤟

@rajasdataengineering7585 2 года назад

Thanks Abhay

@PavanKumar-tt8mm 2 года назад

Good Raja. Today i had learn new topic..Thankyou.

@rajasdataengineering7585 2 года назад

Thank you Pavan

@ajinkyamore8359 2 года назад

Really Nice Explanation. Thanks

@rajasdataengineering7585 2 года назад

Thank you

@azarudeena6467 2 года назад

Easy to understand

@rajasdataengineering7585 2 года назад

Thank you

@sravankumar1767 2 года назад

Nice explanation Raj 👌 👍

@rajasdataengineering7585 2 года назад

Thanks Sravan

@srinubathina7191 Год назад

Thank You Sir

@rajasdataengineering7585 Год назад

Most welcome

@karamveersolanki138 2 года назад

Hi Raja, one doubt: Regarding splitable, you said more than one core can access it. Isn't it means that the file is spread over multiple partitions and is available for parallel processing.

@rajasdataengineering7585 2 года назад

Good question Karamveer. The data is distributed across nodes in the form of partitions but that's within cluster environment (within onheap memory when we talk about spark). But what we are discussing here is file storage within external system such as dbfs, S3, adls, hdfs etc. So when spark is reading data from external environment, if the huge file is not in splitable format, it would take more time to distribute the data across nodes in the form of partitions because that non-splitable file cant be read by multiple cores at a time. Hope it is clear. Thanks for this good question

@kanstantsinhulevich4313 Год назад

Hey, Raja. I know that parquet file with gzip codec is splittable. Of course if we compress csv file with gzip codec it won't be splittable. It would be nice if you will ad some clarification.

@rajasdataengineering7585 Год назад

Hi Kanstantsin, yes you are right. Parquet file with gzip is splittable by default while CSV with gzip is non-splittable by default. However there are some workaround to split gzipped CSV files like reading it in textinputformat api or pre-splitting the gzipped file into multiple pieces

@karthickrajachandrasekar8486 Год назад

Hi raja, thanks for amazing explanation. I have one doubt is there any ways, after compressing into gz, same name will shown?

@rajasdataengineering7585 Год назад

Yes same name will be shown after compression

@karthickrajachandrasekar8486 Год назад

@@rajasdataengineering7585 It will shown like part004 like that. How to fetch the same name that will be given csv?

@rajasdataengineering7585 Год назад

If you want to have specific name, dataframe can be converted to pandas and write with specific name

@deevjitsaha3168 День назад

i tried creating parquet file in gzip compression type but it created multiple part files. however it supposed to create one file right??

@sohelsayyad5572 Год назад

thank you sir, if huge file is not splittable then, can we convert its compression format to make it splittable, if yes how do we do that ? Also is there any scenario of parquet/orc/avro where its not splittable and need workaround. how we resolve it ? 👍

@abhinavsingh1173 Год назад

Your course it best. But problem with you course is that you are not attching the github link for your sample data and code. Irequest you as your audience please do this. Thanks