@DataMites help me understand why we have to stop using Dask around 200GB of data. Couldn't Dask handle terabytes of data if the script was run on a multi-node cluster that could scale up or down to meet data ingestion sizes on the fly?
Thanks for the clear explanation. I would like to practice more on dask and pyspark...would you be in a position in to recommend some tutorials..thanks
I need to calculate a function made of pandas and numpy with an iteration of 10 lakh per set. It takes 2 hrs for each set. Can using dask reduce this time?
Scikit-learn cannot be completely integrated with pyspark because the algorithms in scikit-learn are not implemented to be distributed as they work on single machime itself.
Thanks for the info , I have used pandas data frame for fetching and performing metric calculations on ~25 million records on daily basis. Question : can I use pandas data frame even though I have 200+ gb data with more memory processing without using dask.
"Hi Sridhar Tondapi, thanks for reaching us with your query. Theoretically, you can use more memory processing to encounter large data but we will suggest using either pyspark or dask (there might be others too) according to the requirements. Due to their parallel computing and other internal mechanisms, you can free up remaining processing power for other useful works."
The benchmark is the size of data not the system configuration.Pandas can load data in system’s memory with any performance issue till 5 GB however takes some time to laid data size greater than 6.If your system configuration is 4 GB and you are typing to load data via pandas definitely it will take time as pandas first load data into system’s memory and then head is shown.
"Dask enables to store data that is larger than RAM unlike Pandas. Each of these is able to use data partitioned between RAM and a hard disk as well distributed across multiple nodes in a cluster. Pandas with chunksize is something where in we are explicitly specifying how data needs to be cut into smaller modules."
Thanks for the info. I am reading the SQL table with the help of pandas dataframe but when the table is very large such as with 14,320,316 rows pandas is not working. How to connect to SQL with dask or pyspark.
Hi, for dask please go through this: docs.dask.org/en/latest/dataframe-sql.html . For pyspark you might need to setup certain spark environment and I will suggest you to go through this official documentation spark.apache.org/docs/latest/api/python/getting_started/index.html
Hi Sir, I want to explore on data Validation between 2 tables.. and needs to compare data field by field.. How can i achieve my requirement using python pandas.. is there any predefined libraries available to compare data for 2 tables in pandas..
I think dask also supports distributed system for processing much like spark so why dask is not able to support more than 100 gb data or say 1 tb data just like spark???
Dask supports 100+ gb of data check official site and yes it works as distributed system. When working on single laptop it utilizes all cores and if working on distributes system it uses all clusters