Feature Engineering-How to Perform One Hot Encoding for Multi Categorical Variables

Подписаться 1 млн

Просмотров 267 тыс.

50% 1

Hi All,
After Completing this video you will understand how we can perform One hot Encoding for Multi Categorical Features.
amazon url: www.amazon.in/...
Buy the Best book of Machine Learning, Deep Learning with python sklearn and tensorflow from below
amazon url:
www.amazon.in/...
Connect with me here:
Twitter: / krishnaik06
Facebook: / krishnaik06
instagram: / krishnaik06
Subscribe my unboxing Channel
/ @krishnaikhindi
Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
Deep Learning Playlist: • Tutorial 1- Introducti...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
Data Science Interview Question playlist: • Complete Life Cycle of...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/...
🙏🙏🙏🙏🙏🙏🙏🙏
YOU JUST NEED TO DO
3 THINGS to support my channel
LIKE
SHARE
&
SUBSCRIBE
TO MY RU-vid CHANNEL

Опубликовано:

9 авг 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 173

@ttowelie 4 года назад

I spent my whole week to solve the sort of the same problem. Thank you for your solution!

@cocum2 4 года назад

Great video! This is the solution I was looking for, very well explained, thank you very much for sharing!

@pradeepc5207 4 года назад

Same here also

@harshithbangera7905 3 года назад

same here.....i always found your videos very usefull

@umakanta7 4 года назад

The best trainer i feel in youtube for simplicity in explaining ..great

@abdullahalmahfuz6700 2 месяца назад

Should i have to know feature Engineering in 2024?

@ajaykumar-rh2gz 3 года назад

Krish Naik Sir.... You are doing amazing job here. I am deeply following you and your channel. I have taken your paid services also, admission in affordable AI in iNeuron. Till now I have suggested more than 100 students about your channel and most of they following you. Thank you once aging for this support sir....Ajay Kumar Ex Indian Navy.

@niteshmishra3923 4 года назад

I was stuck with a similar kind of data set for my class project...This has been an immense help in making things more clear !!!thanks a ton

@lawrencenanagyan489 2 месяца назад

You changed my life! God bless you!

@Futureyouth-be1bo Месяц назад

pro i have problem that is iam using two different datasets one from kaggle and one from local but the problem is when making hot encoding when ever i try doing this flightdata = pd.get_dummies(flightdata, columns=['OriginCityName', 'DestCityName']) df = pd.get_dummies(df, columns=['OriginCityName', 'DestCityName']) # Ensure both datasets have the same dummy variables flightdata, df = flightdata.align(df, join='inner', axis=1) but the public datasets have many more categorical than the local how can i solve it ?

@adithyarajagopal1288 4 года назад

Many youtubers have videos on building models and the intuition behind them, not many have a feature engineering playlist as comprehensive as yours.... All the best

@programsolve3053 3 месяца назад

Thank you so much for the easy explanation of an obscure topic. 🎉🎉🎉🎉

@yosupalex8276 2 года назад

hey dude your feature engineering and stats videaos SAVED MY LIFE!!!!!!! THANK YOU SOOOOOOO MUCH!!!!

@ritwikmukherjee3572 Месяц назад

@krish Naik ... Hello sir, first of all I would like to thank you for giving us so many wonderful videos from which we learn so much. I would like to request you to provide the link of this file so that I can practice the coding part.

@bhushandhamankar 3 года назад

I'll suggest you to watch 2nd Video in this playlist first then come for this one...:)

@anupampurkait6066 3 года назад

I think here we may not need to use 'sort_values' function because 'value_counts' method by default sorts the values by descending order.

@abhishekverma549 4 года назад

Sir i need this .ipynb file, please share with us.

@poornakumar1508 4 года назад

Really cool!! i hav got stuck in without knowing this..Thanks a lot!!!!!

@datadrix 5 лет назад

One stupid question from my side, what the different roles in Machine Learning ? for an example in other fields like developer, tester, coder , etc

@salvindsouza7053 4 года назад

Analytics and analysis of data in all the fields ,for automation!

@agastyasharma1641 2 года назад

2nd day of me learning ML this is the first video i got when i searched for feature engineering. This video is explained in a simple way to get an understanding by student who is new to AI & ML. @Krish Can I share the link of this video on the course I am learning from Udemy.

@chandrashekharbagul5825 Год назад

Thanks for the help sir. I was facing exactly the same kind of issue with my data at the workplace.

@pushpitkumar99 3 года назад

Your videos are amazing Sir. Very informative and easy to understand. Thank You so much for all your hardwork.

@shaz-z506 5 лет назад

Hi Krish, I just have one question, that how we'll decide the top 10 or top 20, the threshold value seems like a tedious way to decide. We'll for threshold value, does that depend upon business and to whatever domain we gonna apply this technique to, please let me know.

@aditisrivastava7079 5 лет назад

I also have this doubt

@vishal56765 5 лет назад

We can see from value_counts(). Where the count number starts dropping too much, we can take till that category

@pradeepc5207 4 года назад

I have been waiting to understand something related to this .Now i have understood the flow .superb explanation :-)

@sreenathgupta6767 3 года назад

Nice, If i am dealing with dataset similar to Airline dataset where source and destination airports are important and we need to consider all airports. How can we deal such a dataset

@ranasagar699 3 года назад

you can use same technique 10 most frequent category for source and destination

@kanhataak1269 4 года назад

All videos are really very nice and very well explanation.... How to explain the project in front of the interviewer. when they are ask tell me about your project and tell me about your self, i confused where i should to start, i don't know how to start. explain by given an example. pls make a video this topic using both hindi and english language. Thanks

@shivaprasadshirawar6235 3 года назад

I'm here for feature engineering after the statistics playlist as Krish sir said but I'm not getting anything. Am I doing it right or I should come back after machine learning playlist. Bcz I'm not getting the purpose of these methods and also the impact.... Please someone help me

@rabintimalsina9263 3 года назад

same here'

@navneetmohit1 3 года назад

This method is for feature engineering of categorical feature, which means that the data columns that do not have numbers in them, they have strings which means something for the application that captured or uses this data. When this kind of Data is used for model building you have to think the big picture that the models are trained using mathematics in the background so the common sense is that you cannot use string to do calculation so during feature engineering there are techniques like these that convert the categorical feature to numbers using encoding technique. However, using one hot encoder technique brings the added burden of increasing the number of columns in the dataframe which in terms of efficiency of training decreases the overall performance of the trained model. One hot encoder is great to use when you have few categories in a column but if you have hundreds of them then the method of repurposing one hot encoder only for top 10 categories was proven to be beneficial and reduced the burden of added features or dimension. I hope you get the point now. I would always want you to think not just the detail but the bigger picture to understand the reasoning behind them and then things will be easier to understand. All the best

@navneetmohit1 3 года назад

@@rabintimalsina9263 check out my response below. Hopefully it helps you

@shivaprasadshirawar6235 3 года назад

@@navneetmohit1 Yes sir I understood.... Thank you for the patience and the explanation 😊. Actually I ws following the path told by krish sir n at that tym I wasn't knowing machine learning models and techniques of data Pre processing that's y I ws confused......

@rabintimalsina9263 3 года назад

@@navneetmohit1 what if we have 7 to 8 categorical features after one hot encoding it may results to curse of dimensionality how can we perform at that moment ?

@sujankumar215 3 года назад

Hi Krish, please let me know where can i find Code you have used in these videos ? i also found the code of many videos are not available in description

@debanganabhattacharjee3706 2 года назад

Hi! Could you please explain how do I do the same thing when there are multiple values in each row of each column. For eg. In a genre column there are many genres separated by commas like: Comedy,Drama,Thriller and I need them all as 3 separate columns with 1,0 values wherever applicable. With this approach genres like this are being identified as a single genre but how do I divide them into 3 distinct genres?

@prathameshgurav8313 4 года назад

this video is really helpful for me to gain knowledge thank you..!

@sanyuktabaluni4608 4 года назад

Hi krish! What if we have a dependent variable with Categories: Never, Rarely, Sometimes, Often or dependent variable for weather prediction: "Sunny", "Monsoon", "Windy". How will we deal with a dependent variable with so many categories. Can a dependent variable y have more than 1 column?

@akashravindra.. 2 года назад

use naive bayes

@dineshnaik4904 3 года назад

Amazing!! Thank you very much for solid explanation!!!!

@yikheichan1653 3 месяца назад

Im so confused how i use it when i have a dataset , so variables with less frequency set as 0 ? and they are still useful for the dataset? Like when i do the model like Multinomial logistic regression , is your method useful because when i most than 2 which more than 0 and 1 i need Multinomial logistic regression ?

@yogeshrunthla9350 4 года назад

Very thankful for your efforts 🙌🙌🙌🙌

@sathishsivam635 Год назад

only one suggestion i wanted to give you bro, that is kindly arrange the videos based on the data science syllabus. it is very difficult to find the frequency.

@mohammaddehghan8762 3 года назад

thank you a lot of for all tutorial i learn

@kishoredev6004 4 года назад

Awesome Video! Krish, Thank You So Much

@debbie2017 3 года назад

great...! thanks for saving lot of time

@Futureyouth-be1bo Месяц назад

@shashwatsingh253 4 года назад

Great Explanation Sir !! Thank You Sir ...

@rupambose4830 4 года назад

Amazing explanation

@Raja-tt4ll 4 года назад

It was a very nice video. Thank you.

@rohanchess8332 Год назад

Very informative!

@AshutoshSingh-do4ts 2 года назад

Thank you sir ! for this video

@janithpanditharathne6196 2 года назад

When there are multi categorical variables, can we use one hot encoding with Support Vector Machine?

@surajrahinj4797 2 года назад

Het Krish please provide the Notebook in video Description

@vamsireddy6306 5 лет назад

Sir can we know that one hot encoding with top labels is only way to improve model performance for more labeled datasets. Suppose datasets with 100 labels having same frequency neglecting 90 of 100 labels make our model less efficient.

@krishnaik06 5 лет назад

As said this will not always work..this works only when u have a imbalanced categories in ur features. Still I will be uploading more videos to handle different scenarios

@georgedong3789 4 года назад

agree with you. Moreover, this kind of encoding can overfit

@anirvansen6591 4 года назад

Learnt this new technique.Thanks

@chowdarybkc1619 5 лет назад

Bro no jupyter notebook please upload it

@vijethrai2747 4 года назад

Open cmd as admin and type 'pip install jupyterlab'

@rishilramesh946 3 года назад

Is it fine to One Hot encode before train test split or we should do it only after the split? Does it cause Data Leakage if we use one hot encoding before train test split?

@gouthamipalarapu909 3 года назад

Hello Krish. i am watching this video on repeat mode but none i could understand. can you please take another dataset to explain OHE. Mercedes Benz is really confusing. Awaiting for your reply. please help.

@sidgirase Год назад

Hey Krish. I am trying to make an anomaly detection model with many categorical columns. Grouping rare values into a single group would negatively impact my model. Am I thinking right?

@ritvikpant7107 2 года назад

Here as we've considered 10 most occurring labels for the dataset then what is the parameter by the help of which we can makeout that we should use these many labels and that will portray everything right? Anyone can reply.

@dhainik.suthar 3 года назад

How can we handle this data during model deployment ? We need to assign one value as one and anothe all are 0 it's much time consuming is there are way than tell me

@RBSTREAMS Год назад

sir where can i find these jupyter notebooks? i dont see any link in the description..can anybody please help me with that...

@sandipansarkar9211 2 года назад

finished watching

@abelsontenny7537 2 года назад

how do i iterate through the variables(features) names in a for loop to do the entire process without repeating to run the one_hot_top_x function again and again?

@anoshkaniskar3117 3 года назад

Hi.. Krish can we also perform mean encoding for this type of problem...please let me knw.. also thanks for sharing and this type of info...

@a.r.s.6301 4 года назад

well sir i want to ask you somethink : Isnt that your approaching causes the feature losing lets just say i have a dataset which contains lots of car brand and i want to make regression. I think your approach works fine for most 10 frequent brand but other brands becoming always 0. If i want to learn that brands values. How its work fine

@MrDareh 3 года назад

Great! How does this compare to using word embeddings for encoding categorical features?

@deepeshkumarsharma6514 5 лет назад

sir if you get time please create a video about mean encoding , that's also a good technique for encoding

@mukulmishra2296 5 лет назад

can't we use frequency encoding or target encoding?

@SahilShah-cd5bi 5 месяцев назад

Sir, can you please explain when should we use one hot encoding, label encoding or ordinal encoding?

@SahilShah-cd5bi 5 месяцев назад

What should be the conditions?

@DanishAnsari-sn2sy 3 года назад

Hello Krish, hope you are doing great. Krish as you have shown us how to take the top 10 categories in a variable but you have used the top 10 categories of X2 in all of the variables. If we encoding each variable separately then we should be taking the top 10 categories of each variable? Can you please help me out with this!?

@muskan_bagrecha 3 года назад

You can find out top 10 for each column and pass this in the function.

@abhishekprasad7030 5 лет назад

Hello, Can I get to know where are you currently working .. I mean city and Company !!

@pankushkukreja3101 5 лет назад

lead Data Scientist, Panasonic , bangalore as per Github

@abhinasneupane2392 2 года назад

How do we handle when there are different numbers of columns created during training and testing? For example, during training we have selected the top 10 and dropped 1 which will create 9 new dummy columns, but how do we handle when there are only two categories in testing new data. If it only has two categories which will only create two new columns.

@akashravindra.. 2 года назад

Thats why you should first merge two datasets if there are like in video train and test and then perform encoding.

@abhinasneupane2392 2 года назад

@@akashravindra.. but you do not know how many values will be getting in real life in production. Let’s say you create 4 different columns A,B,C and others as fourth column based on 4 categories. Now let’s say you get totally new value E on that column where you created dummy when making predictions on production data how will we handle this ? I get it it works if you know all values on the columns and create dummy columns.

@akashravindra.. 2 года назад

@@abhinasneupane2392 You should create column transformer using pipeline. And use it. And One Hot encoder from Sklearn is used for that.

@NaveenKumar-fm5yg 6 месяцев назад

if a column contain too much categories we simply use label encoding for that column that also fine right

@vaishnavi4354 Месяц назад

i think, label encoding is used for only target variables.

@snehithoddula7905 4 года назад

instead of seperately doing for x1,x2,x3..... cant we do that like this, for i in data.columns: top_10=[x for x in data.i.value_counts().sort_values(ascending=False).head(10).index] for label in top_10: data[label]=np.where(data['i']==label,1,0) data[['i']+top_10] when i try to do this i am getting that i is nt an attribute of data, how can i resolve this ,can somebody help

@vaibhavyaramwar 3 года назад

Does we need to perform Encoding only on Train Data or entire dataset? If we need to perform Encoding only on Train dataset at the time applying model on test we will face issue of column mismatch. Can you please brief about this , it would really be helpful.

@akashravindra.. 2 года назад

Encoding is always done on entire dataset because your test data cannot have categories and you expect the model to predict the output based on that.

@nan0mchgaming937 3 года назад

We can also use nunique

@cdhanunjay5497 4 года назад

I have one dara set having more than 1500 different labels then what to do same thing if i apply there will be more features

@littlecutiepiedia2940 4 года назад

take %age ratio by applying 80 20 rule if 80% of data lying in top 10 to 20 then you can apply this otherwise convert into Target guided mean value

@niveshtayal979 5 лет назад

Hi Krish I think this technique is not useful when we are working on real time project. So can you please explain the same with some other technique that will be really helpful.

@Jam05_ 4 года назад

Frequency encoding or target encoding maybe

@RahulKumar-lv9yz 3 года назад

Do we have to be a member to get this jupyter notebook and other content?

@manideepgupta2433 4 года назад

Hi Krish, That was really a wonderful video. But I have a question, I have used mean encoding in one of my data containing state,city,ward values on 3 of these columns, So does this method be better that mean encoding? and in the case of mean encoding, if I perform mean encoding on various col(state,city,ward) do they cause high correlation among the data?

@akashravindra.. 2 года назад

I think mean encoding is better in a way because it gives different values for different categories and later you can standardize or normalize them. But in this all the categories are treated as 1 which signifies huge loss of data.

@animeshmuduli1043 2 года назад

pls provide us the jupiternotebook file🙏

@banankulovski 9 месяцев назад

what about dummy variable trap?

@ashishdhiman4097 3 года назад

The sum of all the labels was 123 however shape showed only 117 columns. Were some labels missed ??

@aroaro4963 4 года назад

once i feed the encoded data i recive encoded output. How can I map them back to the real categorical data. (decoding)

@abdmo7281 4 года назад

Great video can i ask,is that multi-hot encoding?

@shreyasaxena5169 4 года назад

is this right for applying in whole data ? data=pd.read_csv('mercedes.csv',usecols=['X1','X2','X3','X4','X5','X6']) usecol=['X1','X2','X3','X4','X5','X6'] for a in usecol: def cal_top(df,variable): tops=[x for x in df[variable].value_counts().sort_values(ascending=False).head(10).index] return tops top1=cal_top(data,a) def one_hot_top_x(df,variable,top_lables): for label in top_lables: df[variable+'_'+label]=np.where(data[variable]==label,1,0) one_hot_top_x(data,a,top1)

@sandipansarkar9211 2 года назад

where is the ipython notebook for practice?I an unable to locate it

@siddharthrao3115 2 года назад

amazing

@mahikhan5716 2 года назад

@krish naik could i have the dataset would be wholesome for me ?

@preenu7528 3 года назад

Could you please provide the link to the jupyter notebook?

@mritunjay3723 2 года назад

I have joined as a member .. How do I get the feature engineering notes ??

@bhanupratapyadav6449 2 года назад

bhai mujhe ye error dikha raha hai "maximum recursion depth exceeded while calling a Python object" or data type column ka change ho ja raha hai ye code use kar raha ho to " for features in MainData.columns: MainData[features].replace(np.nan,MainData[features].mean,inplace=True)" help plz

@priyankapradhan4539 4 года назад

In one of your video (optimizeCNN model) you took fashion_mnist dataset to optimize model. But when use the same code to read dataset from local drive its showing lotz of error.....i did it using glob module.....sir plzzz make one such video in which we can optimizeCNN model using our own dataset from local drive or from google drive when working with colab.....please do needful.....thank u in advance.

@vininitdgp 3 года назад

so, is it recommended to perform one hot encoding to all the binary categorical feature in a dataset? or its ok to let them as feature(column) only ?

@akashravindra.. 2 года назад

Binary its good to use but honestly you can simply replace the one of the binary value with zero instead of one hot encoding and then dropping the original column. You can use str.replace() function and replace one binary value.

@Aman-lw3vq 4 года назад

won't it be better if we use label encoding for every categorical variable instead of creating new variables and making data messier???

@manojgupta91 3 года назад

First of all Thanks for these wonderful videos. I've a question. Suppose we have a categorical variable with many distinct values. Using One Hot Encoder will add up to too many features/dimensions. Instead of using Top (Most Frequent) approach can add these dimensions and then use dimensionality reduction eg. PCA for this?

@PapunRout-zk9ip Год назад

where is the jupyter notebook links

@jagannadhareddykalagotla624 4 года назад

Actually in X4 feature we have only 4 categories so how can we take top 10 Please explain

@JaySingh-qj3hv 4 года назад

so their is no need to this u can simple do get dummies, after all there are only 4 categories. You only have to apply this method until you have many categories

@snehithoddula7905 4 года назад

@@JaySingh-qj3hv instead of seperately doing for x1,x2,x3..... cant we do that like this, for i in data.columns: top_10=[x for x in data.i.value_counts().sort_values(ascending=False).head(10).index] for label in top_10: data[label]=np.where(data['i']==label,1,0) data[['i']+top_10] when i try to do this i am getting that i is nt an attribute of data, how can i resolve this ,can somebody help

@rupambose4830 4 года назад

@@snehithoddula7905 Here i itself is each and every column present in dataset.So write data[i] instead of data['i']

@akatsukidawn 11 месяцев назад

I am currently doing mtech in machine learning but I can't understand anything from this video. I have lots of assignments to do but I am stuck

@rajatjain328 4 года назад

Please update this playlist

@moussaabgaming3463 5 месяцев назад

how can i get your note book sir

@nihalshukla7718 3 года назад

bhaiya , plz try to share your share notebook

@sandipansarkar9211 2 года назад

finished practicing code

@vinjad5672 4 года назад

hi sir but i am not able to get this mercedes dataset from kaggle can you help me with that

@snehithoddula7905 4 года назад

www.kaggle.com/c/mercedes-benz-greener-manufacturing/data you can download from here

@thegamewars8569 2 года назад

data[data['X2']+data[top_10]] Sir this part of code is not working can you please help with this

@akhibali8405 3 года назад

When to use Label encoding???

@sivalakshmisivaraman8321 4 года назад

Why are we performing feature engineering?can we not take the variable as is...

@lokesh542 4 года назад

Suppose if you need to perform clustering how will you move ahead then without performing this step for fix category

@priyankapradhan4539 4 года назад

Sir plzzz make video on feature extraction from images........so that after feature extraction it can directly be feed to CNN.....plzzzzzzzzzzz

@georgedong3789 4 года назад

features extracted automatically by cnn why do you need to extract features separately..all u need to specify the kernels that it

@priyankapradhan4539 4 года назад

@@georgedong3789 i want to extract LBP feature from image store them in a pickle file and thn want to use that pickle file( that contains LBP feature of image) to cnn....so that model accuracy can b improved...but unable to do....plz make a video on this....plzzzzz Sir