This channel is about simple programming, data analytics, and data science. There are also some small tips for working conveniently with the common tools we use for programming here.
@@digitalProgramLife really? I experience about a 10x to 20x performance bump. Beyond performance, pandas is challenged by a decade of accrued technical debt. Lastly, I find the syntax of polars to be far more pleasant and intuitive. All that to say, yeah, pandas is a disaster. It is foolish to start a new project in pandas.
Also it is very, very bad practice to use sqlalchemy with pandas. I’ve experienced 100x performance lifts by just running SQL to copy queried data files from a database / warehouse to local disk (as, say, parquet files) and then read / scan them from disk into memory with polars. Sqlalchemy + pandas results in horrendous bottlenecks that can absolutely ruin basic workflows even with tiny ~10M record tables because of memory issues.
@@Charles-m7j I see your point, however, I still can't say that "pandas is a disaster". In many cases, the performance differences may not be as drastic, and the trade-offs between performance and ecosystem support, familiarity, and ease of use may favor pandas. Regarding performance, while you've experienced significant improvements with Polars, the performance differences can vary depending on the specific use case, dataset characteristics, and the operations being performed. In handling time series data, missing data, working with hierarchical and multi-indexed data structures, pandas is more mature and feature-rich. Besides we should consider compatibility with legacy code. If you have existing codebases or workflows built around pandas, switching to a different library like Polars may require substantial refactoring and can potentially introduce compatibility issues with other libraries that depend on pandas. I agree Polars may offer advantages in certain scenarios, but it is an oversimplification to label pandas as a "disaster" )
@@digitalProgramLife I disagree. Polars has excellent support for complex data. The Pandas “index” is a nightmare. I have been using Polars for arbitrarily nested hierarchical, sparse time series for nearly two years now and I would never go back to Pandas. The performance gains I am talking about vary, obviously, but they vary between 10X to 100X speed ups… I agree that Pandas is “more mature” but that is not a good thing - it makes it harder to modernize its decade of accrued technical debt. It is not more “feature complete”, however. Sure, it has more “integrations” than Polars (IE matplotlib, xgboost) but this is relatively trivial to solve by casting a polars dataframe to and arrow dataframe before passing that to any other module. Polars -> Arrow is a zero cost copy so very memory efficient and fast… Lastly, compatibility is the _only_ reason Pandas should be in a modern codebase. Hence, why I said I would not use Pandas for any new project. Frankly, it sounds like you have never tried to use Polars for any real project. Perhaps you should consider doing more research before creating more educational material.