How do I make my pandas DataFrame smaller and faster?

Подписаться 243 тыс.

Просмотров 67 тыс.

50% 1

Are you working with a large dataset in pandas, and wondering if you can reduce its memory footprint or improve its efficiency? In this video, I'll show you how to do exactly that in one line of code using the "category" data type, introduced in pandas 0.15. I'll explain how it works, and how to know when you shouldn't use it.
SUBSCRIBE to learn data science with Python:
www.youtube.co...
JOIN the "Data School Insiders" community and receive exclusive rewards:
/ dataschool
== RESOURCES ==
GitHub repository for the series: github.com/jus...
"info" documentation: pandas.pydata.o...
"memory_usage" documentation: pandas.pydata.o...
"astype" documentation: pandas.pydata.o...
Overview of categorical data in pandas: pandas.pydata.o...
API reference for categorical methods: pandas.pydata.o...
== LET'S CONNECT! ==
Newsletter: www.dataschool...
Twitter: / justmarkham
Facebook: / datascienceschool
LinkedIn: / justmarkham

Опубликовано:

12 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 245

@dataschool 6 лет назад

Starting in pandas version 0.19, you can create a category column during the file reading process! Learn more here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE--NbY7E9hKxk.html And starting in pandas 0.21, the method for specifying ordered categories has changed. Learn the new method here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html

@WaltterValdez 8 лет назад

Thanks, I reduced mya data from 592.4 MB to 195.0 MB using categories That's amazing!!!

@dataschool 8 лет назад

That is awesome!!

@ilyastrojnov7627 3 года назад

remember, with big data you need pd.eval and df.query for filter, these functions don't use memore for temp bool Series

@BadriNathJK 8 лет назад

I am recommending your channel to all my friends. You are too good.

@dataschool 8 лет назад

Wow, thank you!

@amandal8170 4 года назад

Yes, he is too good. Even our professor recommended learn pandas from him. lol.

@amandal8170 4 года назад

@@dataschool Thanks a lot. Could we have some of R shiny or Python visualisation ? Like your teaching style.

@readtilleternity 6 лет назад

Dude, you are awesome! This is THE best tutorial on Pandas I have come across on the internet. You are really doing the internet a great favor! Thanks a lot!

@dataschool 5 лет назад

Wow! Thank you so much for your kind words! :) You are very welcome.

@andreacazzaniga8488 7 лет назад

very useful! I was still a bit skeptical but the example with the country series made it all very clear! you are good at giving the best frame to understand things

@dataschool 7 лет назад

Excellent! Glad to hear that this video was helpful to you.

@jiwonkim5315 5 лет назад

You’re amazing at explaining, thanks for uploading these content

@dataschool 5 лет назад

You're very welcome! Thanks for your kind comments :)

@JSchellergJ 5 лет назад

Good lord man, this is awesome and your way of teaching is well paced and easy to follow. You're a incredible teacher, keep this way and you will hit the stars!

@dataschool 5 лет назад

Thanks so much for your kind words! Much appreciated!

@JR-di9uk 6 лет назад

You should mention that if you perform a df['mycolumn'].astype=('category'), you won't be able to enter arbitrary strings into the DataFrame anymore (write ops are limited to the exact categories). This may be an advantage (typo protection) or disadvantage, depending on the use case! Otherwise, thanks for the conscise and clear instructions!

@dataschool 6 лет назад

That's a great point, thank you for bringing it up! I really appreciate it.

@FabioRBelotto Год назад

I understand that the category becomes "available" to only the kinds of values used on it, but how should I do when need to edit? For example, on sex gender I used to have Male of Female. Now I should store many other types. How to edit / increase the category list?

@Diachron 7 лет назад

Well I must sound like a broken record about how good these videos are but they only get better. I've come close on occasion to manually implementing what the category dtype does, so thanks for that revelation.

@dataschool 7 лет назад

Thank you! I'm glad the category tip was helpful to you!

@fruitfcker5351 5 лет назад

If anyone is seeing a FutureWarning error when specifying categories, instead of: df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True) use: quality_dtype = pd.api.types.CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True) df['quality'] = df.quality.astype(quality_dtype)

@dataschool 5 лет назад

Right! The API changed in pandas 0.21. More details here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html

@UndoubtablySo Год назад

category feature super powerful, glad i learnt this

@dataschool 11 месяцев назад

Great to hear!

@sibinh 7 лет назад

Really useful tips. Thanks Kevin.

@dataschool 7 лет назад

You're very welcome!

@nitishkumar-bk8kd 4 года назад

loved ur explanation, great teacher

@dataschool 4 года назад

Thank you! 😃

@jolespin 5 лет назад

Possible new topic: Methods in pandas that are not well known to most users. I've been using pandas for years and didn't know about the `cat`, `str`, and `memory_usage` methods. I'm familiar with `groupby`, `applymap`, `map`, etc. but it would be cool if you could show case some other methods that are less well known to the common users. Thanks

@dataschool 5 лет назад

Great suggestion, thanks!

@ahmadmponda3294 Год назад

Thank you a million. being struggling with inplace returning none type df most of the time.

@silverahmad 4 года назад

Amazing as always. This entire playlist is in my favorites bar now! I have a quick questions, I tried the bonus tip on the drinksby continent dataframe just to see how it works drinks['continent']=drinks.continent.astype('category', categories=['South America', 'Africa', 'North America', 'Europe', 'Asia', 'Oceania'], ordered=True) and I get this error TypeError: astype() got an unexpected keyword argument 'categories' Any idea why?

@vinayakmaheshwari3697 6 лет назад

Can you make a video on how to merge, join and concatenate in python and also differences between these. Nice videos by the way!

@dataschool 6 лет назад

Thanks for your suggestion, I'll consider it! :)

@pldeepesh 5 лет назад

This on the coolest tutorials I have watched on pandas. Thanks for making it. I have a question though, would these categories improve the speed of a for loop, if I user iterrows() on the data frame

@dataschool 5 лет назад

Thanks for your kind words! As for your question, I'm not sure, sorry!

@Kralnor 4 года назад

Using iterrows() in pandas is an anti-pattern and should only be done as a last resort. See engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

@uguree 3 года назад

custom ordered category is now a bit different: from pandas.api.types import CategoricalDtype cat_type = CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True) df.quality.astype(cat_type)

@hsrayyar 3 года назад

thanks! It works!

@GregHacob 8 лет назад

Very useful tips. You make pandas easy to understand. Thank you!

@dataschool 8 лет назад

You're very welcome!

@biswajitpatowary5784 7 лет назад

Thats too good. Can you plz come up with tutorial videos of Matplotlib?

@dataschool 7 лет назад

Thanks for the suggestion! :)

@kp9834 4 года назад

Thank you for an excellent video on writing memory efficient code with categorical data in input. I'm interested in understanding various options to read in large dataframes (other than common pandas and spark methods) containing only numerical data, iterate over its length, create smaller dataframe out of it based on a condition and do some processing, all of which in a faster and memory efficient way. Please cover it if possible.

@dataschool 4 года назад

Thanks for your suggestion!

@s.baskaravishnu22 7 лет назад

I very much congratulate you for sharing code used in video with us. Many thanks for that. It is very much useful to me. My warm regards to you.

@dataschool 7 лет назад

You're welcome!

@FabioRBelotto Год назад

I usually have to work over big big data samples, even for simple analysis. The main issue I face is that pandas takes more time to read/store the data frame than working on it. Sadly, is quicker e easier to just run some extractions using sql as is runs on the database server than importing data to my local machine.

@jaikapoor3666 4 года назад

why does .info( ) have parenthesis? Isn't it an attribute of the DataFrame?

@Russel4973 8 лет назад

Great explanation! Never knew about "category" before.

@dataschool 8 лет назад

Thanks! It's so useful, I knew I had to cover it in the video series!

@mrmuranga 3 года назад

Amazing....I enjoy learning from the channel

@dataschool 3 года назад

Thank you!

@jolespin 5 лет назад

Didn't know about the memory_usage, cat, str, etc. Nice!

@dataschool 5 лет назад

Thanks!

@senupranesh 5 лет назад

Amazing explanation along with hands on. I am really stunned with the way of teaching. Thank you very much. Your accent sometimes remembers me Bruce Lee.

@dataschool 5 лет назад

Thank you!

@gcm4312 8 лет назад

Very useful!

@dataschool 8 лет назад

Agreed! It's surprising that it's not more widely known! I'm trying to change that :)

@AbrahamHoffman 8 лет назад

Yeah, this one was totally awesome. Thanks for making the videos!

@dataschool 8 лет назад

Ha! Thank you for the comment! And you are very welcome, I enjoyed making these videos.

@jewel3761 4 года назад

Why did sort_values() method not work in line 9 and instead you used sorted()?

@Leonardo-jv1ls 5 лет назад

Man. You are insanely good.

@dataschool 5 лет назад

Thank you! 😊

@haoshanduan6314 6 лет назад

The way you wrote categories is deprecated, now we need to write like this: from pandas.api.types import CategoricalDtype df['quality']=df.quality.astype('category') CategoricalDtype(['good','very good','excellent'], ordered=True)

@dataschool 6 лет назад

Thanks! You're correct that this changed in pandas 0.21. However, I think this is the correct substitute code, which is slightly different from what you wrote: from pandas.api.types import CategoricalDtype cat = CategoricalDtype(['good','very good','excellent'], ordered=True) df['quality']=df.quality.astype(cat) Hope that helps!

@dataschool 6 лет назад

I discuss the new syntax for specifying categories in my latest video, "5 new changes in pandas you need to know about": ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html

@mmimpositive 5 лет назад

How to make the output to appear in a tabular form as is shown in your video? This gives the better clarity of data.

@dataschool 5 лет назад

The way the output looks is determined by your editor. I'm using the Jupyter notebook, though note that the output varies even across different versions of the notebook.

@jaikishank 3 года назад

Great explanation .Thank you.

@dataschool 3 года назад

You are welcome!

@nelsonmacy1010 3 года назад

Brilliant video! Thx.bonus was awesome

@dataschool 3 года назад

Glad you enjoyed it!

@hoegwonkim1727 5 лет назад

I should have found your channel more earily! Tks for sharing great vedio

@dataschool 5 лет назад

😄

@itsme.samrat 4 года назад

loved this part

@dataschool 3 года назад

Thanks!

@sunoreal 4 года назад

There is no 'categories' or 'ordered' parameters in the astype() method I use pandas version 0.25.1 So, how do I set a priority in this version? Oh you did explain in your message Thank you

@dataschool 4 года назад

This should help: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas_changes.ipynb

@FabioRBelotto Год назад

What is the amount of non unique values that still worth becoming a category?

@oliverf2924 3 года назад

Great tutorial, thank you

@dataschool 3 года назад

You are welcome!

@LonglongFeng 7 лет назад

question: at 5:20, when you coded drinks.memory_usage(deep=True).sum(), it gave '24920L'. What does the 'L' mean after the figure? I think I seemed to see the 'L' thing appears when using the '.shape' function. what does that 'L' mean?

@dataschool 7 лет назад

L stands for "long", which I believe refers to the "long integer" type, which is the NumPy data type being used to store that data. In other words, it's an implementation detail that you don't really need to know. Hope that helps!

@vvasani 7 лет назад

You are the best! I'm feeling Lucky that I found your channel at right time in my learning path ...Thanks a lot! I have one question here. could you please help understanding general idea behind using 'categories' in astype method since it is not a pre-defined parameter in method documentation (if we click shift+ tab :) )? I mean what all parameters we can use in place of kwargs in an instancemethod just like we used 'categories' here? (All properties/attributes of an object?)

@dataschool 7 лет назад

Glad you like the videos! Please consider subscribing to the Data School mailing list: www.dataschool.io/subscribe/ Regarding your question, I don't know how to explain the technical details behind why you can pass the argument 'categories' in this case, other than to say that it's because the pandas code has been written to allow that argument. I'm sorry if that's not what you were looking for!

@niteshsrivastava6504 4 года назад

Thanks for ur knowledge sharing. My question is how this category is different from label encoding. They do the same thing?

@dataschool 4 года назад

Great question! When using the category data type, you are defining how pandas stores that column of data. However, you still treat that column as strings when working with it within pandas. With label encoding, your goal is to convert categories to numbers so that you can work with the numbers, not the strings. Does that answer your question?

@serdarb8995 6 лет назад

You are great Kevin

@dataschool 6 лет назад

Thanks! You are great Serdar!

@ItsWithinYou 2 года назад

As usual, great lesson. Many thanks!

@dataschool 2 года назад

Thank you!

@priyankrajsharma 5 лет назад

awesome tutorial.. you made it so easy

@dataschool 5 лет назад

Thanks!

@virenr5767 6 лет назад

Great Videos. Thank you. Would appreciate your advice on the following - I am attempting to maintain customer-wise product wise monthly sales data. The index would be the product and the columns would be the customer name. Data would have to be captured into the table every month. 1. How would you recommend setting up the structure - As different data frames for each month or as a 3 dimensional array, with the 3rd dimension being the monthly data. 2. How do you set up a blank structure containing all possible products and customers and then populate each data frame with monthly sales data received? 3. Suppose you start dealing with a new customer mid year, how do you populate the entire table with this new customer Series and then start capturing their sales data from the month they start buying? Thank you in advance, for the answers

@dataschool 6 лет назад

I'm sorry, but this is way beyond what I can address in a comment... good luck!

@grijeshmnit 4 года назад

brilliantly explained.

@dataschool 4 года назад

Thank you!

@amish1502 3 года назад

The tutorials are super nice and helpful, but I just got a slight problem that the 'categories' and 'ordered' arguments are not working in python 3.9 and pandas version 1.2.2

@dataschool 3 года назад

See here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html

@AlonsoParejawee 4 года назад

Thank you! Is it possible to create multiple dataframes based on the categories I have in my dataset?

@rvg296 4 года назад

Seems like in the latest pandas 1.1.2 version df['quality'] = df.quality.astype('category',categories=['good','verygood','excellent'],ordered=True) this throws an error saying unexpected categories argument. I guess this should work. df['quality'] = pd.Categorical(df.quality,categories=['good','verygood','excellent'],ordered=True)

@dataschool 3 года назад

Thanks for sharing! Yes, the pandas API for ordered categories has changed since I recorded this video.

@jdavis38100 8 лет назад

Great job Kevin!

@dataschool 8 лет назад

Thanks! :)

@RohanB-xg6vg 3 года назад

Hello ,currently I am using pandas version 1.2.2,in that I get an error while runing this code , df.quality.astype('category',categories=[''good','very good','excellent'],ordered =True) And it says that astype() got an unexpected keyword argument 'categories' Do they removed those parameters in newer version of pandas as this video was few years old?

@dataschool 3 года назад

See this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html

@asiftandel8750 3 года назад

Great Video Sir

@dataschool 3 года назад

Thanks!

@tkannab1 6 лет назад

Excellent video!! thank you!

@dataschool 6 лет назад

You're very welcome!

@tugraalp01 3 года назад

(11:00) That method might be usefull for data analysis studies, but if we apply some macine learning algorithms, we HAVE TO use label encoding or one hot encoding etc. technics , right ? I actually want to know that how much correct to convert the attribute as 'category' type in ML instead of not appliyng encoding technics ?

@dataschool 3 года назад

You are correct that converting to the category type does not prepare it for ML. See this video for more: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-0w78CHM_ubM.html

@user-rj9vs3pr2n 10 месяцев назад

Hi! The data file url doesn't seem to be working all of a sudden. Could you look it up please?

@dataschool 8 месяцев назад

You can get the datasets from here if needed: github.com/justmarkham/pandas-videos

@rephechaun 5 лет назад

Hi Kevin, Does this mean we can throw in this category converted variable into machine learning model like Logistic Regression in sklearn or statmodels?

@dataschool 5 лет назад

No, that's not how it works, sorry!

@geocarvalhont 7 лет назад

Amazing tip, thank you again!

@dataschool 7 лет назад

You're very welcome!

@amitghosh425 4 года назад

at 16:44 I get the error message "ValueError: Got an unexpected argument: categories" for running "df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered =True)" . please help

@dataschool 4 года назад

The pandas API has changed. See this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html

@danielmayper6548 4 года назад

I've been following along on your examples and they've all been incredible, but I encountered an error I can't see to get around on this one. At about 16:45, the command df['quality'] = df.quality.astype('category', categories=['good','very good','excellent'], ordered=True) is given and whenever I try and submit that line to the compiler I get the error ValueError: Got an unexpected argument: categories Was there an update to Pandas that may have changed this function or is there some kind of error I'm not aware I'm making?

@danielmayper6548 4 года назад

I had tried going to your github and copying the line you used from there, but I was getting the same error

@dataschool 4 года назад

The pandas API has changed, please see this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html

@ganeshs8522 5 лет назад

Hi Thanks for the nice videos! df[df.quality >'good'] also works Is there any reason you use df.loc[df.quality > 'good'] in the last part of this video? Under what conditions you use df[ condition] vs df.loc[condition]?

@dataschool 5 лет назад

In this case, I use loc to be more explicit. I general, I use loc whenever its flexibility is required.

@PradeepKumar6 8 лет назад

Amazing always !!! Is it possible to convert these type of data into category while we read the data into python? Also, There is another datatype called datetime. I think it would be great if you may enlighten us with that as well for the purpose of datetime manipulation in future.

@dataschool 8 лет назад

Thanks! Regarding your first question, I haven't figured out a way to do it. Regarding datetimes, I will cover that in an upcoming video :)

@dataschool 8 лет назад

My latest video on the datetime format has been released: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-yCgJGsg0Xa4.html Hope that helps!

@kostasnikoloutsos5172 7 лет назад

I am wondering if there is any cryptographic system that can convert strings to integers and then decrypt them back. If yes then why pandas do not implement that in the background to reduce space? Also if we use this astype("category") does has any effects when we export this dataframe into csv or excel file?

@dataschool 7 лет назад

Question 1 - I'm not sure. Question 2 - no effect. Hope that helps!

@KhalilYasser 3 года назад

Thank you very much. Amazing tutorial. When trying this line `df['Quality'] = df.Quality.astype('category', categories = ['good', 'very good', 'excellent'], ordered=True)`, I encountered an error `TypeError: astype() got an unexpected keyword argument 'categories'`

@KhalilYasser 3 года назад

Searched and solve like that: `from pandas.api.types import CategoricalDtype` then I used the line like that `df['Quality'] = df['Quality'].astype(CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True))`

@experimentalhypothesis1137 5 лет назад

these videos are excellent!

@dataschool 5 лет назад

Thanks!

@reazshafqat5504 7 лет назад

first of all thank you for all of your videos! my question would be: in your case the size of the continent category is 488KB but in my case its 744KB. Can you explain the reason behind this difference?

@dataschool 7 лет назад

Glad you like the videos! Regarding your question, it's probably due to the version of pandas or Python.

@hariharamoorthythennetipan2190 7 лет назад

cool. Very nice examples.

@dataschool 7 лет назад

Thanks! Glad it was helpful to you!

@saurabhkhodake 7 лет назад

For the bonus tutorial i got error as "_astype() got an unexpected keyword argument 'categories' " Has the definition to astype() changed? Appreciate if someone could help.

@mleiano 7 лет назад

I had a similar error, I think what you did is you somehow ran the code without the "ordered = True" bit of the code at first or some such partial code and then tried to run it again with all the arguments as shown in the tutorial above, in that case it does show the error you mentioned. Just run the DataFrame creation command; ie, df = pd.DataFrame(...) again and then run the df.quality.astype(...) code, it should work. It did for me anyways. Let me know how it goes. Can anyone explain why it happens though? I am not sure about that.

@dataschool 7 лет назад

What version of pandas are you running?

@KimmoHintikka 7 лет назад

Thanks to re-running the the df creation again worked. My pandas version info from conda. pandas 0.19.2 np112py36_1 ------------------------- file name : pandas-0.19.2-np112py36_1.tar.bz2 name : pandas version : 0.19.2 build string: np112py36_1 build number: 1 channel : defaults size : 8.4 MB arch : x86_64 date : 2017-02-04 license : BSD md5 : 5ce048ed69412b7bec27989c5c963678 noarch : None platform : darwin url : repo.continuum.io/pkgs/free/osx-64/pandas-0.19.2-np112py36_1.tar.bz2 dependencies: numpy 1.12* python 3.6* python-dateutil pytz

@nishitsethi9405 7 лет назад

Thanks for the very informative video. I have one question. How do we convert multiple columns to 'category' data type at once? In my data set, I have 25 categorical columns and 6 integer columns. So is there an efficient way of converting these 25 columns to categorical while importing the data set or after importing? Thanks.

@dataschool 7 лет назад

Great question! There might be an easy way to do this, perhaps with the apply function, but I'm not sure at the moment. Let me know if you figured out an efficient method!

@kostasnikoloutsos5172 7 лет назад

You used a parameter called categories.This is not in the parameters of astype method. I think its in **kwargs.In docs I found this: kwargs : keyword arguments to pass on to the constructor. Where is the constructor I cannot understand this

@dataschool 7 лет назад

Sorry, I don't know how to answer your question!

@evapatrick3476 5 лет назад

Hi there, thanks for your excellent tutorial. I have a question that I unable to find an answer to, Can you use these columns (ones which have been converted into categories) in analysis, specifically machine learning models? If not how can one do without have to use get_dummies option since I have a column of about 8,000 unique rows?

@dataschool 5 лет назад

I recommend scikit-learn's OneHotEncoder for this case. No, you can't directly feed a category column to scikit-learn. Hope that helps!

@niteshsawant2716 4 года назад

How to autoupdate the ID column

@olabrew 7 лет назад

Hi, could you do a lesson on using the pivot function in Pandas? Haven't seen a good example anywhere.

@dataschool 7 лет назад

Thanks for the suggestion! Maybe this might be helpful to you? pbpython.com/pandas-pivot-table-explained.html

@olabrew 7 лет назад

Thanks! That helps to explain it a bit better. Cheers

@ishaangupta2223 4 года назад

Hey python shows an error whenever I type categories in astype, saying: astype got an unexpected keyword argument 'categories'. Can you please help.

@anngu3086 4 года назад

the syntax got updated, you better check out the first comment he pinned on top

@ishaangupta2223 4 года назад

Ann Gu Thanks

@spacedustpi 5 лет назад

Thanks. Very useful. Why do you prefer df.loc[df.quality > 'good', :] over df[df.quality > 'good']?

@dataschool 5 лет назад

Either is fine. The first is more explicit, whereas the second is more readable, so I go back and forth! :)

@rahulgulati890 8 лет назад

Thanks for sharing such great videos. Can you create one video in explaining pivot table in pandas. That would be really helpful. Regards Rahul

@dataschool 8 лет назад

You're welcome! And, I will do my best to create one on pivot table. In the meantime, here's a good post on it: pbpython.com/pandas-pivot-table-explained.html

@rahulgulati890 8 лет назад

+Data School thank you kevin

@richardanderson8377 8 лет назад

My question is about using categorical variables to build a logistic regression model using statsmodels. I had some 0-1 integer variables that I wanted to use as some of the predictor variables to build a logistic regression model, but converted them to categorical thinking this would avoid being treated as numerical. However, I got a ValueError: unrecognized data structures: / . Do you understand why? I can take this to a different forum if that would be better..

@dataschool 8 лет назад

My video coming out on July 12 will answer that question! I'll let you know when it's posted.

@dataschool 8 лет назад

Check out my latest video, and see if it answers your question: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-0s_1IsROgDc.html Hope that helps!

@richardanderson8377 8 лет назад

Nice video. My question goes a bit further. Suppose you wanted to use your k-1 dummy variables in a statsmodels or sci-kit learn logistic regression. would you leave them as type integers or convert them to type categorical?

@dataschool 8 лет назад

You would leave them as type integer. Good luck!

@Kavyashree40 6 лет назад

Hi, Your videos are superb. Learnt a lot.Could you please explain me about pivot and pivot_table?

@dataschool 6 лет назад

Thanks! I will consider that for future videos.

@annelizabeth728 6 лет назад

Thanks for another fantastic video! I tried the tip at the end, and got a warning message: "FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead." I checked the pandas documentation and substituted CategoricalDType, e.g. "cat_type = CategoricalDtype(categories=["good", "very good", "excellent"],ordered=True) [newline] df['quality'].astype(cat_type)" but that didn't really work the way I was expecting either. Is there a newer way of accomplishing this?

@dataschool 6 лет назад

Thanks for your kind words! Regarding your question, you are correct that this has changed in the latest versions of pandas. However, your proposed code looks exactly correct to me. What exactly are you expecting that you are not seeing? Just to be clear, you do need to overwrite the existing 'quality' column if you want there to be a permanent change: df['quality'] = df['quality'].astype(cat_type)

@dataschool 6 лет назад

I discuss the new syntax for specifying categories in my latest video, "5 new changes in pandas you need to know about": ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-te5JrSCW-LY.html Hope that helps!

@TheAlderFalder 5 лет назад

This was awesome!

@dataschool 5 лет назад

Thanks!

@lonewolf2547 6 лет назад

For my dataset it reduced the size by approximately 50%. What i wanted to ask is if it has to lookup each time, does this increases the time complexity?

@dataschool 6 лет назад

No, the lookup shouldn't take a meaningful amount of time.

@vasanthnayak4086 7 лет назад

Hi... Thanks for sharing the Greatest series of videos on Pandas...!!! Quick question: Is there a way to convert a csv (size more than 2 GB) to a pandas data frame in the system where the RAM is 2 GB. I am getting 'memory error', while executing the code. I cant use 'category', I need the data as same as in the csv. Thanks...!!!

@dataschool 7 лет назад

Thanks for your kind words! One strategy is to read in only some of the rows and columns (only the ones you need), demonstrated here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-B-r9VuK80dk.html

@safeeqahmed3306 5 лет назад

Great video. I have a doubt. Suppose if i have a dataset about computers. I have a column for number of antivirus installed in a computer. I have total 100 observations but only 3 unique values for this column (1, 2 and 3). So should I consider this column as numeric or categorical?

@dataschool 5 лет назад

It depends - what are you trying to predict?

@safeeqahmed3306 5 лет назад

Data School I am predicting if a particular machine will be attacked by a malware soon, based on its configurations and a number of other parameters including number of antiviruses installed

@dataschool 5 лет назад

You would consider the column numeric.

@safeeqahmed3306 5 лет назад

Data School thanks a lot. May I know the reason please? And why it depends on the predictor?

@patrickmckowen1154 5 лет назад

Over 9000!!!!!

@dataschool 5 лет назад

😄

@mdzahidulislam6857 7 лет назад

I am glad that I came across your videos. It is really helpful for me. However, can we use categorical and numeric features for building decision trees in sklearn? I am getting the following errors: ValueError: could not convert string to float: 'Zimbabwe' Thank you very much for your help.

@dataschool 7 лет назад

You can use categorical features with any scikit-learn model, however you will need to transform them to numeric values. Here are two videos that may help you: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-0s_1IsROgDc.html ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-ylRlGCtAtiE.html

@mdzahidulislam6857 7 лет назад

Thanks a lot! There are awesome..

@dataschool 7 лет назад

You're very welcome! Glad they were helpful to you :)

@aakashkumarnain7592 8 лет назад

Hello Kevin!! How can I rename my columns which I changed to categorical data to the original names of the columns?

@dataschool 8 лет назад

You can use the DataFrame method 'rename', which I talk about in this video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-0uBirYFhizE.html

@muhammadfayyaz7134 6 лет назад

Would grateful if you make some tutorials on big data analytics thanks

@dataschool 6 лет назад

Thanks for your suggestion!

@muhammadfayyaz7134 6 лет назад

Data School i hope will see a great tutorial series from you about big data soon. 😊

@bhanu4187 5 лет назад

i want to compare two date and time columns and produce the categorical value of new column if both columns have the same value , like if two columns have the same date and time i need to have 1 else 0. how it can be done pls help me

@dataschool 5 лет назад

df['new'] = (df.first == df.second)

@srosell100 4 года назад

Hi, why do you hace to put memory_usage = 'deep' and not only memory_usage

@dataschool 4 года назад

That's how you specify the parameter

@srosell100 4 года назад

@@dataschool Thank you very much!!!, never though you would answer, and thank very much in general for your content you have thought me so much!!!

@vinodkumar-ro7rc 7 лет назад

Excellent Article

@dataschool 7 лет назад

Thanks!

@rdg8268 6 лет назад

I need something like categories for a age range, for example 0-10, 0-20... Is it possible?

@dataschool 6 лет назад

Sure!

@Om-iy9ix 6 лет назад

Hie there Great videos, when we wrote drinks.continent.cat.codes.head() we got 1 2 0 2 0 and when I did drinks.head after that, it displayed Asia Europe and all instead of just numbers which should point to a look up table containing strings. Then I did was drinks.memoryusage(deep =True ) which gave reduced continent size... How does this worked . One side it does not reflect in Data frame and on other side it shows reduced . Hope you help me out soon.. Thanks a lot for your amazing videos. Please make more videos on Data Science ML topics .

@dataschool 6 лет назад

Great question! The integers are the internal encodings for those categories, and the size is reduced due to those encodings. Does that help? You might like this video series: ru-vid.com/group/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A

@vanmemet 7 лет назад

Thanks for your great videos, I am very enjoying watching, learning a lot. But most of these concepts are already addressed in sql world. I think when you tutor the video, you may reference these subjects to sql subjects. IMHO.

@dataschool 7 лет назад

SQL and pandas can indeed accomplish many of the same tasks. For SQL users, you are right that SQL comparisons might be helpful. You might like resource #5 here: www.dataschool.io/best-python-pandas-resources/