Intro to Python Dask: Easy Big Data Analytics with Pandas!

Подписаться 41 тыс.

Просмотров 14 тыс.

50% 1

In this video, you will learn how to use Dask, a Python module that enables pandas code to run in parallel on your local machine or scaled out to multiple machines. No dataframe or numpy size limits and super-fast execution. Just pip install and go! It's that easy! Does it really work? Find out!
Join my Patreon Community and Watch this Video without Ads!
www.patreon.co...
Twitter: @BryanCafferky
Notebook with Code at:
github.com/bca...
See my Master Databricks and Apache Spark series:
• Master Databricks and ...

Опубликовано:

8 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 22

@Septumsempra8818 2 года назад

The Champ Cafferky!

@atanu4321 2 года назад

Thanks for awesome introduction of Python Dask

@BryanCafferky 2 года назад

YW. Thanks for watching.

@arturkunz 2 года назад

15:30 the rounding difference is probably because you use the full dataset in ddf and only a part in pdf. Very great introduction in Dusk! At the moment I am only working with numpy for data engineering (Deep Learning with Images). Would you say it makes sense to save images to pandas dataframes? It would probably make a lot of stuff easier and by using Dask even fast because of the parallelization.

@BryanCafferky 2 года назад

It may make it easier depending on what you are trying to do. Seems like keeping the image attributes with the image might make it easier to use both together.

@jamiew3986 2 года назад

Thanks for your video! I 've recently exploring DASK and realized that it only has read_sql_table, but no read_sql_query function. I used to read sql queries into python by using pyodbc/sqlalchemy, but it looks like it's not possible with DASK.

@BryanCafferky 2 года назад

Yeah. I mean you could use pandas with SQL but that would not scale. You could create a view in the SQL database and query that with read_sql_table. It is a limitation. I think Spark wins on that one.

@jamiew3986 2 года назад

@@BryanCafferky Thanks. I do have another question maybe related to DASK or jupyter notebook. So, the data I am currently working on has over 50MM rows. It takes a long time to read in via pyodbc even though I read in chuncks. I then converted to dask dataframe, hoping it would fasten the data manipulations. I called .persist() after adding 3 more columns, it would run for 30 minutes until I get a memory error. I am using my company desktop (96GB available, 12 precessors), so I'm surprised it's taking that much memory. Groupby seems to take a long time as well. I tried to use R data.table table, and hasn't had a memory error yet. Do you happen to encounter the same situation before, or do you have any guess on what could be the problem causing this issue?

@BryanCafferky 2 года назад

@@jamiew3986 If possible, maybe create a table or view on the source database that limits the columns and data you need. Also, limit large columns like substring to get part of a string, etc. The read this in and directly write it to a parquet file. Once that has been done. Load the parquet file. Not sure why its running out of memory but you don't have scale out. This would be easier on a cloud environment. Here are some links that may help. docs.dask.org/en/stable/dataframe-sql.html Measuring memory usage docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.memory_usage.html A Blog Use Case with Detailed Info on Optimization blog.dask.org/2021/03/11/dask_memory_usage#:~:text=When%20possible%2C%20you%20can%20fine,wasting%20RAM%20or%20CPU%20cores. Hope that helps.

@KOMPAJAM 2 года назад

When do you realize you have to leverage Dask on a DF - What error message would you gte?

@BryanCafferky 2 года назад

When your dataframes are taking a significant chunk of available memory, it's good to think of trying Dask. You do get an error though. See this blog. towardsdatascience.com/how-to-avoid-memory-errors-with-pandas-22366e1371b1

@KOMPAJAM 2 года назад

@@BryanCafferky Thanks Bryan!

@sawantamang2069 2 года назад

how reliable it is to use it in production for data ingestion?

@I677000 2 года назад

Do I need know python prior this course ?

@BryanCafferky 2 года назад

Yes. pandas is a Python library so this video is not useful if you don't know Python pandas.

@I677000 2 года назад

@@BryanCafferky I guess I need to finish python course first 😅

@BryanCafferky 2 года назад

@@I677000 Focus on pandas. You don't need to become an expert on all of Python. The book Python for Data Analysis by Wes McKinney is a good one.

@ericxls93 Год назад

Great video as usual, thank you. But after about a week of hammering the subject, I could not load a data table from an Azure SQL database ☹️… back to pandas… (having to do loops to deal with the memory limits 🤦‍♂️)

@BryanCafferky Год назад

Hmmm..Ok. Sorry to hear that. For Azure SQL, Databricks might be a better option.

@ericxls93 Год назад

@@BryanCafferky would love to use databricks, but system is far to mature to change it. Contacted Microsoft and appears my issue has to do with the version of Alquemy…. Only works with version 1.4 and below.