Uh, yeah, exactly. Why is that a bad thing? Python is a lot faster to write, C/C++/Rust (or any compiled language) is faster to run. Most of the time, when I profile, 5% of my code takes up 95% of the runtime. Rewriting that 5% in Rust or C let's me have my cake and eat it too.
For Polars to replace Pandas, they have to up they game in term of integration. Pandas is the de facto library for data engineer and data science in Python, meaning tones of other libraries are integrated with pandas (SqlAlchemy, pySpark, Arrow, scikit-learn, matplotlib, etc… basically any Python data engineering and data science libraries have integration with Pandas. And you also have to count all the peoples who knows Pandas, working at making it faster with vectorisation
Polars also uses vectorization and has a quick and easy way to transform a polars dataframe to a pandas dataframe. In some benchmarks, it is faster to create a pandas dataframe via polars than using pandas directly.
It'll take rewriting their interface to match pandas' interface. Then it'd pretty much be a drop-in replacement; Edit, having worked with it, i still think this but the polars interface is better so im thinking it should be a LazyFrame/DataFrame.pdcompat type module
Polars need to be a drop in replacement for pandas to be used in the field. Data scientists know how to use pandas and switching over to something other has a steep learning curve and it might not be worth it, especially for a new project
Another library fast in Rust because people never cared to learn C and spoil the shit out of performance. Try to run reverse sqiared root in C vs Rust. You'll face the hard truth: modern programmers are way worse than the older ones, when performance mattered. Doom runs on my fridge. Try to run Rust on your coffee machine... good luck
@@LordPompinchu666 in the specific case of Polars, one of the main reasons it is faster is because of multitheading. A lot of potential bugs in that realm Rust catches them at compile time. Rust makes writing a multitheading version of pandas feasible. Doing it in C would be a minefield.
@@robmulla me neither pandas eco-system is much larger and mature, you can always change Polars database frame to pandas and vice versa, so at the end of the day whatever get the job done efficiently should be used.
Great question. For some tasks I think it will. It still lacks some functionality like native plotting. Look out for a full length video I’m going to be making about polars soon.
I think they purposefully wanted to be different. There are already a lot of pandas alternatives that don't work too great. Polars is it's own thing entirely.
Ugh I am literally 2/3 of the way through refactoring an old project created by a former contractor where I replaced his list and dict comprehension with pandas...guess I gotta refactor my refactor.
@Rob Mulla the project compares very large datasets. Since pandas has a numpy backend implemented in c, many of the operations are orders of magnitude faster than using dicts.
Eventually people will realize that they save more time by just moving to 100% Julia instead of wasting all this time building everything in 1 language to execute it in another.
I used to be a die-hard pandas-user and just recently switched to polars - I am not going back. It's not just speed, it's data types (ok pandas 2.0 has made huge progress here), syntax, and the kind of no-bullshit-fuckarounds with indices. I fell in love with polars, especially with the now available api to hvplot
Yes, I have a longer video where I review polars on my channel and explain. Check it out here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-VHqn7ufiilE.html
Real Data Scientists will wait 10 hours for their Refresh Data in excel. Patience is virtue. All these new scallywags with their tik toks and 5 second attention spans looking for the fastest thing possible.
Alright there are way too many options floating around right now. I spent the last week letting a modest gaming right run 24/7 to convert a bunch of SAS7BDAT files into parquet files because the pyreadstat multithreaded reading in chunks didn't work as expected. For that same dataset which is several hundred GBs in disk size, I have to do some data wrangling and I'm growing ill at the thought of how long it would take pandas to loop through it. Now I either risk learning dask, polars, or maybe even SQLite only to not get the desired results at a suitable speed, or stick to pandas. Thoughts?
I agree, it's hard to say what the best option is right now. I think the main question I ask myself is: how fast do I need it to run? and can I do do my computation on a single machine in local memory? - The choice really depends on the answers to those questions.
Probably depends on the dataset. My understanding is polars can work well for opeations in a single machine's memory, pyspark is more for distributing across many nodes and cudf is fast if your data can fit into GPU memory.
@@robmulla ahh cool, interesting, so this polars thing probably best way to speed up data wrangling on a single computer (for at home hobbyists). Another interesting thing would be for df.apply(lambda x: etc) operations - how quickly can polaris iterate through a dataset. I think that would be a huge game changer
@@robmulla thx. Why it's not counted on GPU? Or data mining tasks cannot be accelerated by this, just neural networks itself? I am beginner, bought rtx 3060 12gb for basic tasks. U got link to that video pls?
There are a few situations where it might be more appropriate to use Pandas over SQL for a particular task: When working with small or medium-sized datasets: Pandas is generally faster and more convenient than SQL for working with small or medium-sized datasets, especially if the data is already in a structured format (such as a CSV file). When the data is not stored in a database: If the data you are working with is not stored in a database. When you need to perform complex data manipulation tasks: Pandas provides a wide range of functions and methods that can be used to manipulate and summarize data in a variety of ways. This can be particularly useful when you need to perform complex data manipulation tasks that would be difficult or time-consuming to accomplish using SQL alone.
@@robmulla yeah okay that all sounds like it makes sense. So basically, SQL is for really large datasets, data that's already in a database, and there's certain "complex manipulation tasks" that can be done in pandas and not sql.
@@pineapple3832 You got it. Also data exploration can be much easier when working with the data in the computer’s memory. Check my EDA video for some examples.
@@robmullaHow would you compare the usefulness to R? Currently in college and I taken a couple courses using primarily R and im split on which of the 2 languages I should focus on
@@robmulla it's not gatekeeping. Seriously are you dumb? If you want to get into programming, sure, here are some "lightning fast" beginner languages: Lua, Kotlin, Dart, Nim, Go and many more. It's not gatekeeping, it's the truth. Python is slow and nothing made with python should be called "lightning fast" considering the same thing has been created in Go and runs 3times faster. Also, the language for data science is Julia, not python.
@@Onrirtopia ok. This package is written in rust with a python api. Most python packages are written on C. Saying python is slow is hilariously ignorant.
@@robmulla the API speed still makes it slower than just using a native Go package. Above that, cython (or C-python) is only as fast as the person it's written by. And yes, python is interpeted so if you write C code in python that also has to be interpeted making it slower than any compiled language, again. Stop trying to make up lies just to win an arguement. Python is slow.