Will Polars replace Pandas for Data Science?

Rob Mulla

Подписаться 182 тыс.

Просмотров 303 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

2 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 189

@shlokbhakta2893 Год назад

Python devs will use anything but python to make python faster lol

@samueljehanno Год назад

Lmao

@bernardcrnkovic3769 Год назад

so what? that is the point of python, to be a pretty wrapper around optimized components :D

@lukaswalker2342 Год назад

@@bernardcrnkovic3769 exactly that

@grantpeterson2524 Год назад

Uh, yeah, exactly. Why is that a bad thing? Python is a lot faster to write, C/C++/Rust (or any compiled language) is faster to run. Most of the time, when I profile, 5% of my code takes up 95% of the runtime. Rewriting that 5% in Rust or C let's me have my cake and eat it too.

@shlokbhakta2893 Год назад

@@grantpeterson2524 it’s not a bad thing, just a funny joke because it’s ironic lol

@Kim-re7hs Год назад

Polars performance benchmarks are great + developing roadmap looks promising. Looking forward to your upcoming Polars series 👍

@robmulla Год назад

Coming soon! Thanks for watching.

@chrstfer2452 9 месяцев назад

@@robmulla where's that polars series?

@abh1yan Год назад

Rust is getting over everything guys.

@robmulla Год назад

True

@incremental_failure Год назад

You could say, everything is rusty. Badum-tssh.

@Diabolic9595 Год назад

Funny joke mate!

@_Moonlight_22 Год назад

Python really do includes the whole north pole😂

@robmulla Год назад

And all the bears!

@user-myraklejnr 9 месяцев назад

Santa here we come😂😂😂😂😂😂😂😂

@celestialgamer360 23 дня назад

Pandas, Polars,. Python lol 😂😂... Python is gonna import all animals and birds 😅😅😅...

@peteredwards7680 Год назад

mfw its the 100th time today that I've seen something "designed for speed from the group up, in Rust".

@robmulla Год назад

What were the other 99? I want to know.

@jeanchindeko5477 Год назад

For Polars to replace Pandas, they have to up they game in term of integration. Pandas is the de facto library for data engineer and data science in Python, meaning tones of other libraries are integrated with pandas (SqlAlchemy, pySpark, Arrow, scikit-learn, matplotlib, etc… basically any Python data engineering and data science libraries have integration with Pandas. And you also have to count all the peoples who knows Pandas, working at making it faster with vectorisation

@robmulla Год назад

All great points. I think it will take time. But it works well for what it was designed to do.

@adrianjdelgado Год назад

Polars also uses vectorization and has a quick and easy way to transform a polars dataframe to a pandas dataframe. In some benchmarks, it is faster to create a pandas dataframe via polars than using pandas directly.

@chrstfer2452 11 месяцев назад

It'll take rewriting their interface to match pandas' interface. Then it'd pretty much be a drop-in replacement; Edit, having worked with it, i still think this but the polars interface is better so im thinking it should be a LazyFrame/DataFrame.pdcompat type module

@ctm92 7 месяцев назад

Polars need to be a drop in replacement for pandas to be used in the field. Data scientists know how to use pandas and switching over to something other has a steep learning curve and it might not be worth it, especially for a new project

@mannycalavera121 Год назад

Rust allows C like speed without the decades of experience required to write safe and optimised C code.

@robmulla Год назад

This is an interesting take! I don't know that much about coding in C or Rust but I didn't know that was one of the benefits of Rust.

@mickolesmana5899 Год назад

Waiting for faster version of GeoPandas, Sjoin-ing 100+ rows already took long enough

@robmulla Год назад

I’ve only done a little work with geopandas but noticed it was slow too.

@AgnaldoC Год назад

Geo pandas-dask

@avinashthakur80 Год назад

Another library which is blazingly fast because of Rust.

@robmulla Год назад

What other ones are?

@LeNguyen-yj9ol Год назад

Ruff 😊

@cloinca_rpe11 Год назад

WhiteBox Tools as well if you do spatial analysis

@LordPompinchu666 Год назад

Another library fast in Rust because people never cared to learn C and spoil the shit out of performance. Try to run reverse sqiared root in C vs Rust. You'll face the hard truth: modern programmers are way worse than the older ones, when performance mattered. Doom runs on my fridge. Try to run Rust on your coffee machine... good luck

@adrianjdelgado Год назад

@@LordPompinchu666 in the specific case of Polars, one of the main reasons it is faster is because of multitheading. A lot of potential bugs in that realm Rust catches them at compile time. Rust makes writing a multitheading version of pandas feasible. Doing it in C would be a minefield.

@dhaval1489 Год назад

I use Polars more then pandas, Polars syntax is much more simple and way faster

@robmulla Год назад

Nice! I still can't fully move away from pandas, but polars for major data pipelines for sure!

@dhaval1489 Год назад

@@robmulla me neither pandas eco-system is much larger and mature, you can always change Polars database frame to pandas and vice versa, so at the end of the day whatever get the job done efficiently should be used.

@eyadamin4089 Год назад

Do you think it will replace pandas ? And do it have the same options as pandas

@robmulla Год назад

Great question. For some tasks I think it will. It still lacks some functionality like native plotting. Look out for a full length video I’m going to be making about polars soon.

@eyadamin4089 Год назад

@robmulla Waiting for it, all of your videos and lives are very helpful and interesting tho

@celestialgamer360 23 дня назад

Use according to your needs both... If you know both well...

@Neura1net 9 месяцев назад

They should have used the syntax of pandas

@robmulla 9 месяцев назад

I think they purposefully wanted to be different. There are already a lot of pandas alternatives that don't work too great. Polars is it's own thing entirely.

@chrism6880 Год назад

Ugh I am literally 2/3 of the way through refactoring an old project created by a former contractor where I replaced his list and dict comprehension with pandas...guess I gotta refactor my refactor.

@robmulla Год назад

Is the main goal of the project speed? If so dict and lists are going to be hard to beat. If not then pandas should be sufficient.

@chrism6880 Год назад

@Rob Mulla the project compares very large datasets. Since pandas has a numpy backend implemented in c, many of the operations are orders of magnitude faster than using dicts.

@brandonrich4956 Год назад

Eventually people will realize that they save more time by just moving to 100% Julia instead of wasting all this time building everything in 1 language to execute it in another.

@robmulla Год назад

I guess every language is popular in it's own way. I've never learned Julia.

@evanshlom1 9 месяцев назад

Or you do it all in rust which is better than Julia

@lequedicatsamarge4228 7 месяцев назад

I used to be a die-hard pandas-user and just recently switched to polars - I am not going back. It's not just speed, it's data types (ok pandas 2.0 has made huge progress here), syntax, and the kind of no-bullshit-fuckarounds with indices. I fell in love with polars, especially with the now available api to hvplot

@JordiRosell Год назад

I hope so. Not only for speed, but for code cleanliness.

@robmulla Год назад

You like polars code style better? I don't know how I feel about all the `pl.col()` it needs.

@JordiRosell Год назад

@@robmulla I think it helps writing in more chained style. I agree that pl.col isn't great. I prefer to use col importing it, but it's not ideal.

@ស្រលាញ់សន្តិភាព Год назад

For a noob as I am, it takes me 10mn just to import more than 20 modules before actually writing some functions

@robmulla Год назад

I can relate. Copy/paste can save some time though if you do it a lot.

@primary4075 Год назад

In my uni, I'm still using pandas for data science. Not that much different I think for now

@alexandrodisla6285 Год назад

Polaris can work with pandas beautifully!

@ElinLiu0823 Год назад

I'd rather using cudf if gpu available on system,else i will use polars

@robmulla Год назад

Still need to do more testing with cudf. But it’s fast for sure.

@giagoskapetanakis6033 9 месяцев назад

What ide is this?

@Linkario86 Год назад

Polars it is then. I'm relatively new and use Jupyter Notebook but I assume I can just import Polars like Pandas as shown in the video?

@robmulla Год назад

Yes, I have a longer video where I review polars on my channel and explain. Check it out here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-VHqn7ufiilE.html

@Linkario86 Год назад

@@robmulla thanks!

@bocchitherock-ob2bl 8 месяцев назад

a minute of silence for those rustaceans who will use this as an excuse to say Rust is faster than C. (saying that as someone who loves Rust btw)

@sourajitpaul9064 5 месяцев назад

Why not using PySpark instead of Polar or Pandas???

@PythonPlusPlus 7 месяцев назад

So Polars is like Pandas, but cooler? (pun intended)

@user-fv1576 2 месяца назад

Now compare to MySQL and sql

@Rustincohle88 Месяц назад

People who use pyspark for data frames >

@priyadarshanmohanty277 Год назад

Does it have integration with snowflake?

@robmulla Год назад

Not sure. Good question.

@sakatagintoki8835 8 месяцев назад

Well snowflake has python api. So you can use it to load data after processing the data using polars.

@eugenex8892 Год назад

Knowledge of pure SQL is much more effective...

@robmulla Год назад

Could be true but also depends on what you’re trying to accomplish.

@LuvxJacqu4li8e Год назад

Too bad I'm not into data science... yet or never

@robmulla Год назад

Come on! You know you want to! 😊

@panda_dva2261 Год назад

Real Data Scientists will wait 10 hours for their Refresh Data in excel. Patience is virtue. All these new scallywags with their tik toks and 5 second attention spans looking for the fastest thing possible.

@robmulla Год назад

Patience is a virtue!

@bhavyakukkar 9 месяцев назад

patience is virtue = help i can't keep up

@DeebzFromThe90s Год назад

Alright there are way too many options floating around right now. I spent the last week letting a modest gaming right run 24/7 to convert a bunch of SAS7BDAT files into parquet files because the pyreadstat multithreaded reading in chunks didn't work as expected. For that same dataset which is several hundred GBs in disk size, I have to do some data wrangling and I'm growing ill at the thought of how long it would take pandas to loop through it. Now I either risk learning dask, polars, or maybe even SQLite only to not get the desired results at a suitable speed, or stick to pandas. Thoughts?

@robmulla Год назад

I agree, it's hard to say what the best option is right now. I think the main question I ask myself is: how fast do I need it to run? and can I do do my computation on a single machine in local memory? - The choice really depends on the answers to those questions.

@soffwhere Год назад

Super useful

@robmulla Год назад

Thanks! Glad you found it useful.

@bigkatoan5076 9 месяцев назад

Actually 2.8s and 600ms still the same cus 1 click complete :))

@AtomicPixels 7 месяцев назад

I love how everyone is lightning faster than the other lightning faster framework lol

@syukcode 9 месяцев назад

You are using Python 3.8.5, what if you use Python 3.11?

@fluffyflextail 9 месяцев назад

Only ones I know, are bidirectionally opposed from each other and stored in one source object

@Xarxes104 9 месяцев назад

Does it matter if youre just running the code once.

@jaybestemployee 9 месяцев назад

Yeah, learn a new library to save seconds at a time.

@thebosscrystal Год назад

Is that an extension that shows the running block and it’s time? (Not the timeit)

@throwaway6288 Год назад

Wow 4 times faster!!! 😐

@robmulla Год назад

I take it you’re not impressed…

@shivamjha5202 7 месяцев назад

Nice 👍

@camus83489 Год назад

interesting wondering how this compares to say pyspark and Cudf

@robmulla Год назад

Probably depends on the dataset. My understanding is polars can work well for opeations in a single machine's memory, pyspark is more for distributing across many nodes and cudf is fast if your data can fit into GPU memory.

@camus83489 Год назад

@@robmulla ahh cool, interesting, so this polars thing probably best way to speed up data wrangling on a single computer (for at home hobbyists). Another interesting thing would be for df.apply(lambda x: etc) operations - how quickly can polaris iterate through a dataset. I think that would be a huge game changer

@aakashkhamaru9403 Год назад

What ide do you use?

@jamesn6458 Год назад

Looks like Visual Studio Code

@gaurav_r13 Год назад

Sql

@robmulla Год назад

For databases it’s great!

@8koi245 Год назад

BLAZINGLY FAST

@robmulla Год назад

🔥 🚗 🔥

@AndoroidP Год назад

Just use Rust from the ground up. It's not that hard

@robmulla Год назад

I tried. It was hard. 😂

@theLowestPointInMyLife Год назад

Rust is terrible for most things

@bhavyakukkar 9 месяцев назад

also terrible for everything until you get the hang of it

@moulayesididahi1804 Месяц назад

Yes but pandas syntax is more simpler than polar. 3.5 s it's OK for millions observations if the code is 2 lines and easy to remember.

@sitrakaforler8696 Год назад

Dam 😮

@robmulla Год назад

Yea. Pretty crazy. Am I right?

@gc1979o Год назад

What about dask ?

@robmulla Год назад

I have an entire video that compares dask, modin, and vaex. Check it out here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-LEhMQhCv3Kg.html

@orlandogarcia885 9 месяцев назад

What about the new versions of pandas ? Specially since 2.0 , it increase its speed ?

@dung-olymzeus Год назад

where u get the dataset

@robmulla Год назад

Here you go: www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022

@dung-olymzeus Год назад

@@robmulla thanks

@geekyprogrammer4831 Год назад

I was using dask earlier

@robmulla Год назад

Nice! You should try out polars too.

@KevinWeatherwalks Год назад

How much does the loading of the data contribute to the time?

@andreichalapco1446 Год назад

How many cores in pandas, how many cores in Polaris?

@robmulla Год назад

Depends on the function you are running. Some pandas functions don't run multithreaded but others do. Polars is completely multithreaded I believe.

@HaunterButIhadNameGagWtf Год назад

What's the HW?

@robmulla Год назад

I show all my hardware in my setup video. But I have a ryzen threadripper with a lot of cores.

@HaunterButIhadNameGagWtf Год назад

@@robmulla thx. Why it's not counted on GPU? Or data mining tasks cannot be accelerated by this, just neural networks itself? I am beginner, bought rtx 3060 12gb for basic tasks. U got link to that video pls?

@robmulla Год назад

@@HaunterButIhadNameGagWtf You should try out cudf if you want to process on a GPU. It is really fast but requires enough GPU memory for your data.

@tadamacky Год назад

What ide is this or notebook or something

@skelaw Год назад

vsc with jupyter extension

@robmulla Год назад

Yes! VSCode and jupyter

@donaldli4755 Год назад

Short story: no

@robmulla Год назад

But also: maybe?

@stevenpaulsen5975 Год назад

y’all could do this in excel with fast results 😭

@robmulla Год назад

nooooooooooooooooooooooooooooooooo 😂

@blackpilledbuddha4944 Год назад

Will my boss pay me 4 times as much...

@robmulla Год назад

Guaranteed!

@NeArMe. Год назад

Still new and learning about pandas 😢

@robmulla Год назад

We all have to start somewhere!

@johannesmphaka7433 Год назад

Vaex, is much better. Have a look at it.

@robmulla Год назад

I made a video about it already. Check my channel for the video about pandas alternatives

@johannesmphaka7433 Год назад

@@robmulla Thanks.

@code2compass 10 месяцев назад

Damn

@pineapple3832 Год назад

why do people use pandas or polars when sql exists?

@robmulla Год назад

There are a few situations where it might be more appropriate to use Pandas over SQL for a particular task: When working with small or medium-sized datasets: Pandas is generally faster and more convenient than SQL for working with small or medium-sized datasets, especially if the data is already in a structured format (such as a CSV file). When the data is not stored in a database: If the data you are working with is not stored in a database. When you need to perform complex data manipulation tasks: Pandas provides a wide range of functions and methods that can be used to manipulate and summarize data in a variety of ways. This can be particularly useful when you need to perform complex data manipulation tasks that would be difficult or time-consuming to accomplish using SQL alone.

@pineapple3832 Год назад

@@robmulla yeah okay that all sounds like it makes sense. So basically, SQL is for really large datasets, data that's already in a database, and there's certain "complex manipulation tasks" that can be done in pandas and not sql.

@robmulla Год назад

@@pineapple3832 You got it. Also data exploration can be much easier when working with the data in the computer’s memory. Check my EDA video for some examples.

@Jacob-bn1nj Год назад

@@robmullaHow would you compare the usefulness to R? Currently in college and I taken a couple courses using primarily R and im split on which of the 2 languages I should focus on

@Onrirtopia Год назад

I don't know how python Devs have the face to call anything "lightning fast"

@robmulla Год назад

Don't gatekeep me bro

@Onrirtopia Год назад

@@robmulla it's not gatekeeping. Seriously are you dumb? If you want to get into programming, sure, here are some "lightning fast" beginner languages: Lua, Kotlin, Dart, Nim, Go and many more. It's not gatekeeping, it's the truth. Python is slow and nothing made with python should be called "lightning fast" considering the same thing has been created in Go and runs 3times faster. Also, the language for data science is Julia, not python.

@robmulla Год назад

@@Onrirtopia ok. This package is written in rust with a python api. Most python packages are written on C. Saying python is slow is hilariously ignorant.

@Onrirtopia Год назад

@@robmulla the API speed still makes it slower than just using a native Go package. Above that, cython (or C-python) is only as fast as the person it's written by. And yes, python is interpeted so if you write C code in python that also has to be interpeted making it slower than any compiled language, again. Stop trying to make up lies just to win an arguement. Python is slow.

@robmulla Год назад

@@Onrirtopia you started it 😝