No video :(

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

Spark Summit

Подписаться 39 тыс.

Просмотров 69 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 24

@stephaniedatabricksrivera 3 года назад

Emily's Parkay butter pics made me laugh. Really enjoyed this. Great job Emily!!

@flwi 6 лет назад

Wow, great presentation!

@HasanAmmori 2 года назад

Fantastic talk! I wish there was a little more info on the format spec itself.

@manjunath15 5 лет назад

Very informative and nicely articulated.

@gmetrofun 5 лет назад

AWS S3 supports random access queries (i.e., Range Header), consequently pushdown is also supported on AWS S3

@bnsagar90 3 года назад

Can you please some text or link where I can read more about this. Thanks.

@Tomracc 2 года назад

this is wonderful, enjoyed start to end :)

@maa1dz1333q2eqER 6 лет назад

Great presentation, touched a lot of important areas, thanks

@tianzhang3120 3 года назад

Awesome presentation!

@amitbhattacharyya5925 2 года назад

good explanations , this would be great if some git code they can mention

@TheAjit1111 5 лет назад

Great talk, Thank you

@clray123 6 лет назад

Eh so basically any sort of growing data can be only partitioned in one way (along the dimension of the growth - which for many use cases will be some meaningless "autoincrement" id). Which then defeats all the push-down filtering for any other dimension. Not to mention that if your data keeps growing in small increments and you need access to latest of it, you will have to jump through hoops to somehow integrate all those small increments into bigger files - because scanning 20000 tiny files ain't gonna be efficient (and this means lots of constant rewriting - that's why write speed DOES matter and it's not "write-once", but write-many)...

@betterwithrum 5 лет назад

Where are the slides?

@bogdandubas3978 4 года назад

Amazing speaker!

@HughMcBrideDonegalFlyer 7 лет назад

Great talk on a very important (and too often overlooked ) topic

@djibb.7876 7 лет назад

Great talk!!! I set up a spark-cluster with 2 workers. I save a Dtaframe using partitionBy ("column x") as a parquet format to some path on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status ...... - unable to specify Schema ... Any Suggestions?