I moved my entire stack to Polars from Pandas in the past month. It is more intuitive (I am a relatively new programmer), much faster, has a cleaner API and integrates better with databases. If you are just getting going on data analysis you should start with Polars. And there are plenty of resources out there to help you if you want to migrate.
I have been heavily using pandas, polars, and duckdb and heres my two cents. 1. If you need dataframe to work with, you should go with polars no doubt. For certain cases, you may have to convert df from polars to pandas but that’s quite rare and that should be done at the end anyway if necessary. 2. If you need fast query to analyze, go with duckdb. You can directly use polars and pandas objects in memory to crunch data. Very light, fast and efficient. 3. Know how to use all these stacks. Don’t just stick with one only.
I started using it this month, even taking documentation from pandas projects and translating them with cluade and ChatGPT. Still use datareader to pandas to polars.
I also do enjoy your occasional more exploratory and less explanatory videos. They are a motivation to go out and explore the "wonders of the world" and acquire the more in-depth insight by oneself. This applies to both Polars and Huggin. Thank you for having drawn my attention to both. But I also do appreciate the videos in which you take the viewers by the hand. In particular with polars a hands-on video would definitely interest me. Your channel is amazing, rich, in-depth, no-nonsens. By now your some 800 Videos provide more information than I was presented with in years of studying in the 80s at your same university. You truely enrich the world. Big achievment at your age. We are living in truely exciting times. A bottom up approach based on first principles has for a long time been a driver for human intellectual developement, I'd say ever since Newton. AI allows us to tackle problems which are either not governed by first principles such as Newtons, Maxwells, Einsteins, Feynmann laws, maths and mathemetical and logical proofs or where following the way from first principles to tangible and useful application is beyond our scope and maybe possibly forever. It allows as to deal with chemistry and biology, consciousness. There is a lot more to AI than just writing amazing programs like the ones you show. I wish to live on for quite another while to see this new form of thought evolve. Your videos make and keep me curious. I see things today I'd have considered SciFi until very recently.
Given the good press, I've tried to use Polars on a huge dataset. Unfortunately, I couldn't even read my data in, as no API was available. There isn't a file extension that you won't be able to read with Pandas. At this point, Pandas remains the better choice given its compatibility with the Python data science ecosystem.
Pandas has a ton of useful features that polars lacks. Last time I used polars, reading in excel files and dealing with dates, especially dates with inconsistent formatting, was a major hassle. People like to shit on the pandas API but I don't find polars that much better. Good thing there's duckdb that you can use on top of both. Troubleshooting will be much easier and faster with pandas due to its immense popularity. It almost feels like polars is for the rare edge case where you have too much data for pandas to handle efficiently, but not enough data to warrant using distributed computing. Good for pipelines in certain cases but not necessarily EDA imho.
This is an excellent topic for a vid. I've not switched yet. Pandas2 is pyArrow just to reiterate the point, but it's not distributed. My bet is that pandas2 will catch up at pandas3, it's got a massive following and under active development. I've attended one of original pandas public meeting prior pandas2 and their direction of travel is clear. Agreed on the apply function in pandas and its a great point to make in the video.
Strange comment about "pandas supporting Apache Arrow only since version 2" when pandas 2.0 was released in April 2023. Is there any reason not to use recent pandas version today? One great feature in polars is an option to express the query in sql. It makes the code often much more readable.