Sofia Heisler No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency PyCon 2017

PyCon 2017

Подписаться 20 тыс.

Просмотров 56 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

27 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 20

@bdurbin 7 лет назад

Thank you for the very helpful talk. I had already spent a bit of time trying to speed up my code which makes heavy use of Pandas. I had made some good progress by using cProfile, avoiding iterrows and apply, and using the Pandas series columns as much as possible. Based on this talk, I tried the line_profiler for the first time, and it is really informative! Also using the vectorization with numpy arrays was so easy and made a huge difference. Thanks again.

@dannydk6 5 лет назад

Very informative and helpful talk for Data Scientists. I have been moving more towards vectorization after organically noticing the steep performance hits by using iterrows and apply compared to vectorized inputs.

@DimanjanDahal 6 лет назад

Avoid loops, use vectorized operations in pandas or numpy. numpy>pandas>loop

@softwareskills.jinqiang 7 лет назад

great, short and useful talk. I'm new to Pandas, I've learnt useful things here.

@cosmicallyderived 7 лет назад

Wow, I'm really going to take the numpy array vs Series performance hit to heart. All that index alignment operations must make a significant overhead rather than strictly elementwise operations. I have a nasty Python Enum.Flag columnwise aggregation operation in my code that's taking a substantial amount of time. Thanks for the solid tips.

@haneulkim4902 4 года назад

Amaaaaaaaazzzing~~~~ Thanks Sofia!

@gouravkumar9011 3 года назад

This is eye opening 😢 have been using things the wrong way

@flavienlambert224 7 лет назад

Hi, this is a great talk. I was puzzled by the first part related to vectorization. How on earth can the function accept a scalar or a vector without crashing...Until I realized that the haversine function contains only operations that are, in fact, all already defined in Numpy. Any functions which is not native for arrays would not work this way. It may be worth pointed it out. :-)

@uttamo7 7 лет назад

Yeah I was wondering that, thanks for pointing it out

@davidkinney8337 6 лет назад

Excellent and entertaining talk (I want a pet panda now. OK, not really). Very useful info and the scoreboard really made your point.

@LucasRodrigues-ye3qx 6 лет назад

This video is gold!

@fernandoteixeira312 7 лет назад

Very good. Very useful. Great job

@habeebhassan8443 6 лет назад

Literally, this optimization process looks like a magic command to me, but time will tell. Am a newbie

@davidvandervlugt2728 6 лет назад

Good luck Habeeb, I just started out as well

@MegaFoobar 5 лет назад

Superb visual aids :-)

@mikeg9b 6 лет назад

I wonder how a Julia implementation would compare.

@jonraza6369 3 года назад

who came from course Python for datascience?

@laslok 7 лет назад

Well, I would strongly suggest to data scientists to extend their teams to professional programmers if code performance becomes an issue. All python libraries base at some point on C code. Shortening the calls to the native plattform libraries will only improve performance up to a certain point. The reason for this are the C compilers and the coding. C compilers accept dozens of arguments which improve the performance and solutions can be written in dozen of ways. Both factor combined offer huge potential for optimization. Lastly I tuned a C code processing 20 GB in average (unfortunately) sequentially on one processor core in 40 minutes down to 10 minutes without major source code changes. I even could do better but I reached the boundaries of the IO throughput of the server...