Тёмный

Make Your Pandas Code Lightning Fast 

Rob Mulla
Подписаться 166 тыс.
Просмотров 174 тыс.
50% 1

Speed up slow pandas/python code by 2500x using this simple trick. Face it, your pandas code is slow. Learn how to speed it up! In this video Rob discusses a key trick to making your code faster! Pandas is an essential tool for any python programmer and data scientist. Using the pandas apply function, using vectorized functions, the speed difference can be significant. Write faster python code.
Timeline
00:00 Intro
00:46 Creating our Data
02:39 The Problem
03:48 Coding Up the Problem
04:43 Level 1: Loop
06:29 Level 2: Apply
07:27 Level 3: Vectorized
09:31 Plot The Speed Comparison
10:23 Outro
Follow me on twitch for live coding streams: / medallionstallion_
Intro to Pandas video: • A Gentle Introduction ...
Exploritory Data Analysis Video: • Exploratory Data Analy...
* RU-vid: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
#python #code #datascience #pandas

Опубликовано:

 

16 май 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 325   
@hasijasanskar
@hasijasanskar 2 года назад
Whoa.. 3500 times difference. Vectorised is even faster than apply, will give it try next time for sure. Awesome video as always.
@robmulla
@robmulla 2 года назад
Thanks Sanskar. Yes, using vectorized functions is always much faster. In some cases it's not possible but then there are other ways to speed it up. I might show that in another video if this one is popular.
@amazingdude9042
@amazingdude9042 3 месяца назад
@@robmulla can you make a video on how to make pandas resample faster ?
@miaandgingerthememebunnyme3397
@miaandgingerthememebunnyme3397 2 года назад
That’s my husband! He’s so cool.
@robmulla
@robmulla 2 года назад
Love you boo. 😘
@FilippoGronchi
@FilippoGronchi 2 года назад
Fully agree!
@sauloncall
@sauloncall 2 года назад
Aww! This is wholesome!
@rahulchoudhary1024
@rahulchoudhary1024 Год назад
I've been watching your videos since last one week non stop! And enjoy comments from your SO!!! Lovely!
@mohammedgt8102
@mohammedgt8102 Год назад
He is awesome. Taking time out of his day to share knowledge 👏
@kip1272
@kip1272 Год назад
also, a way to speed it up is to not use & and | for 'and' and 'or' but just use the words 'and' and 'or'. these words are made for boolean expressions and thus work faster. & and | are bitwise operators and are made for integers. using these will force python to make the booleans an integer and then do the bitwise operation and then cast it back to a boolean. this doesn't take that much time if u do it once but in a test scenario inspired by this video it was roughly 45% slower.
@robmulla
@robmulla Год назад
Nice tip! I didn’t know that.
@kazmkazm9676
@kazmkazm9676 Год назад
I made the experiment. It is ready to run. What you have suggested is coded below. It is approximately 20 percent faster. import timeit setup = 'import random; random_list = random.sample(range(1,101),100)' # with or first_code = '''\ result_1 = [rand for rand in random_list if (rand >75) or (rand 75) | (rand
@kip1272
@kip1272 Год назад
@@kazmkazm9676 the difference was even bigger between & and 'and', if i remember corectly.
@A372575
@A372575 Год назад
Great, never realized that. Will start using 'and' and 'or' now onwards.
@jti107
@jti107 2 года назад
I didn’t realize you could write 10k as 10_000. I work with astronomical units so makes variables more readable. Great video!
@robmulla
@robmulla 2 года назад
Thanks! Yes, they introduced that functionality with underscores in numbers with python 3.6 - it really helps make numbers more readable.
@kailashlate6348
@kailashlate6348 3 месяца назад
😊😊😊
@kailashlate6348
@kailashlate6348 3 месяца назад
😊
@FilippoGronchi
@FilippoGronchi 2 года назад
That's another awesome video....extremely useful in the real world work. Thanks again Rob
@robmulla
@robmulla 2 года назад
Thanks for watching Filippo!
@LimitlesslyUnlimited
@LimitlesslyUnlimited 2 года назад
Haha coincidentally I'd been raving about vectorized to my friends the last few months. It's soo good. The moment I saw your title I figured you're probably talking about vectorize too haha. Awesome video and great content!!
@robmulla
@robmulla 2 года назад
You called it! Thanks for the positive feedback. Hope to create more videos like it soon.
@robertnolte519
@robertnolte519 Год назад
Same! Still hasn't worked on picking up chicks at the bar, but I'm not giving up.
@Zenoandturtle
@Zenoandturtle 6 месяцев назад
That is unbelievable. Astounding time difference. I was recently watching a presentation on candle stick algorythm, and the presenter used vectorised method and I was confused (I an new to Python), but this video made it all too clear. Fantastic presentation.
@robmulla
@robmulla 6 месяцев назад
Glad you found it interesting. Thanks for watching!
@deepakramani05
@deepakramani05 2 года назад
As I work with Pandas and large datasets, I come across code that use iterrows often. Most developers just don't care about time or come from various programming backgrounds that prohibit them from using efficient methods. I wish more people use vectorization.
@robmulla
@robmulla 2 года назад
Thanks. That’s exactly why I wanted to make this video. Hopefully people will find it helpful.
@pr0skis
@pr0skis Год назад
Some of the biggest bottlenecks are from IO... especially when trying to read then concat multiple large Excel files. Shaving a few seconds in the algos just isnt gonna make much of a difference
@allenklingsporn6993
@allenklingsporn6993 Год назад
@@pr0skis Hard to say that definitively, though, right? You have no idea how anyone is using pandas. If they have slow algos running iteratively, it can very easily become much slower than I/O functions. I've seen some pretty wild pandas use in my business, and a lot of it is really terrible at runtime, especially anything that is wrapped in a GUI (sometimes even with multiprocessing...).
@nitinkumar29
@nitinkumar29 Год назад
@@pr0skis you can convert excel file to csv and then use csv files because csv files io are faster.
@jaimeduncan6167
@jaimeduncan6167 Год назад
It's the same with a relational database, we call them the cursor kids. They loop and loop and loop when they can use a set operation to go hundreds of times faster and often with less code.
@MrValleMilton
@MrValleMilton 20 дней назад
Thank you very much for this video Rob. It is very helpful for beginners like me. Have a great day.
@nirbhay_raghav
@nirbhay_raghav Год назад
My man made a df out of the time diff to plot them!! Really useful video. Will definitely keep this in mind from now.
@robmulla
@robmulla Год назад
Haha. Thanks Nirbhay!
@Vonbucko
@Vonbucko Год назад
Awesome video man! Appreciate the tips, I'll definitely be subscribing!
@robmulla
@robmulla Год назад
I appreciate that a ton. Share with a friend too!
@craftydoeseverything9718
@craftydoeseverything9718 5 месяцев назад
Hey, I just thought I'd mention, I really appreciate that you use really huge test datasets, since a lot of the time, test datasets used in tutorials are quite small and don't sure how code will scale. This video does it perfectly, though!
@sphericalintegration
@sphericalintegration Год назад
Thank you for this, Rob. This video made me subscribe because in 10 minutes you solved one of my biggest problems. And your Boo is right - you are pretty cool. Thanks again, sir.
@robmulla
@robmulla Год назад
That's awesome that I was able to help you out. Check my other videos where I go over similar tips! Glad you agree with my Boo
@abdulkadirguven1173
@abdulkadirguven1173 9 месяцев назад
Thank you very much Rob.
@ajaybalakrishnan5208
@ajaybalakrishnan5208 Год назад
Awesome. Thanks Rob for introducing this concept to me.
@robmulla
@robmulla Год назад
Happy it helped!
@robertjordan114
@robertjordan114 Год назад
Man where have you been all my Python-Life!?!? Thank you so much for this! Outstanding!!!
@robmulla
@robmulla Год назад
Thanks Robert for watching. Glad you found it helpful!
@robertjordan114
@robertjordan114 Год назад
The problem in dealing with is that I am looping through some poorly designed tables and building a sql statement to be applied and then appending the output to a list. Not sure if a vectorized approach will work since I have that sql call, but the apply might save me from needing to recreate the df prior to appending everytime.
@robmulla
@robmulla Год назад
@@robertjordan114 Interesting. Not sure what your data is like- but it can be better a lot of the times to write a nice SQL statement that puts the data in the correct formatting first. That way you put the processing demands on the SQL server and it can usually optimize really well.
@robertjordan114
@robertjordan114 Год назад
Oh you have no idea, my source table has one column with the name of the column in my lookup table and another with the value that I need to filter on in that lookup table. The loop creates the where clause based on the number of related rows in the initial dataset, and then I'm executing that sql statement the return the values to a python data frame which I then convert to a pandas data frame and append. Like I said, amateur hour! 🤣
@nathanielbonini8951
@nathanielbonini8951 10 месяцев назад
This is spot on. I had a filter running that was going to take 2 days to complete on a 12M line CSV file using iteration - clearly not good. Now it takes 6 seconds.
@gabrielfrazer-mckee5095
@gabrielfrazer-mckee5095 Год назад
Great video! I wish I had known not to loop over my array for my machine learning project... going to go improve my code now!
@robmulla
@robmulla Год назад
Glad you learned something new!
@hussamcheema
@hussamcheema Год назад
Wow amazing. Please keep making more videos like this.
@robmulla
@robmulla Год назад
Thanks for the feedback. I’ll try my best.
@OktatOnline
@OktatOnline Год назад
I'm over here as a newbie data scientist, copying the logic step-by-step in order to have good coding habits in the future lmao. Thanks for the video, really valuable!
@robmulla
@robmulla Год назад
Glad you found it helpful!
@anoopbhagat13
@anoopbhagat13 2 года назад
Wow ! That's an excellent way of speed up the code.
@robmulla
@robmulla 2 года назад
Thanks Anoop. Hope your future pandas code is a bit faster because of this video :D
@OPPACHblu_channel
@OPPACHblu_channel Год назад
Somehow i have been met vectorize method first at the beginning on my python and pandas journey. Thanks for sharing your experience, lightning fast
@robmulla
@robmulla Год назад
It’s a great thing to learn early!
@alexandremachado1014
@alexandremachado1014 2 года назад
Hey man, nice video! Kudos from reddit!
@robmulla
@robmulla 2 года назад
Glad you enjoed it. So cool that the reddit community liked this video so much. Hopefully my next one will be as popular.
@predstavitel
@predstavitel Год назад
Thanks for the great video! I have a project with some calculations. They take some minutes through the loops. I'm going to use vectorized way. So i'll write another comment with comparison later. Some days later... i rewrote a signifacnt part of my code. Made it vectorized, and i got fantastic results. The example: old code - 1m.3s, new code - 6s. One more: old code - 14m.58s, new code - 11s. Awesome!
@robmulla
@robmulla Год назад
So awesome! It's really satisfying when you are able to improve the speed of code by orders of magnitude.
@LaHoraMaker
@LaHoraMaker Год назад
I loved that you used Madrid Python user group for the pandas logo :)
@robmulla
@robmulla Год назад
I did?! I didn't even realize. What's the timestamp where I show that logo?
@prodmanaiml9317
@prodmanaiml9317 2 года назад
More video tips for pandas would be excellent!
@robmulla
@robmulla 2 года назад
Great suggestion. I'll try to keep the pandas videos coming.
@colmduffy2272
@colmduffy2272 Год назад
There are several videos on pandas vectorization. This is the best.
@robmulla
@robmulla Год назад
I apprecaite you saying that! Thanks for watching.
@kanishkpareek6650
@kanishkpareek6650 9 месяцев назад
your teaching style is awesome. where can i find your videos in a structured manner??
@kingj5983
@kingj5983 4 дня назад
Wow, awesome video, thanks! Although it takes time to figure out how to turn my limit conditions into logical calculation and return a bool dataframe
@mic9657
@mic9657 Год назад
great tips! and very well presented
@robmulla
@robmulla Год назад
Glad you like it. Thanks for watching.
@GregZoppos
@GregZoppos Год назад
Wow, thanks! I'm a beginner in data science, this is really interesting to me.
@robmulla
@robmulla Год назад
Great to hear! Good luck in your data science journey.
@thebreath6159
@thebreath6159 Год назад
Ok this channel is great for data science, I’ll follow
@robmulla
@robmulla Год назад
Thanks for subbing!
@ledestonilo7274
@ledestonilo7274 Год назад
Interesting. Thank you will try it.
@robmulla
@robmulla Год назад
Awesome! Let me know how it goes.
@artemqqq7153
@artemqqq7153 Год назад
Dude, that row[column] thing was a shock to me, thanks!
@robmulla
@robmulla Год назад
Glad you learned something!
@bm647
@bm647 Год назад
Great video! very useful
@robmulla
@robmulla Год назад
Glad you found it useful!
@Chris_87BC
@Chris_87BC 11 месяцев назад
Great video! I am currently looping through a data frame column for each customer and print the data to PDF. Is there a vectorized version that can be much faster?
@YuanYuan-uk8sz
@YuanYuan-uk8sz 10 месяцев назад
thank you your very extremely perfect video,so so helpful for me,love you so much
@robmulla
@robmulla 10 месяцев назад
I'm so glad! Share it with a friend or two who you think might also appreciate it.
@PeterSeres
@PeterSeres 2 года назад
Nice video! Thanks for detailed explanation. My only problem with this is that I often have to apply functions that depend on sequential time data and a loop setup makes the most sense since the next time step depends on the previous time steps. Are there some advanced methods on how to set up more complex vectorized functions that don't fit into a one-liner expression?
@robmulla
@robmulla 2 года назад
Yes there are! I think I'll probably make a few more videos on the topic considering how interested people seem in this. But I'd suggest if you can do any of your processing that goes across rows in groups - first do a `groupby()` and then you can multiprocess the processing of each group on a different CPU thread. If you have 8 or 16 CPU threads you can speed things up a lot!
@DrewLevitt
@DrewLevitt Год назад
Pandas has a lot of useful time series methods, but without knowing exactly what you're trying to do, it'd be hard to suggest any specific functions. But if you only need to refer to step (n-1) when processing step n, you can use df.shift() to store step n-1 IN the row for step n. Hope this helps!
@cbritton27
@cbritton27 2 года назад
I had a similar situation creating a new column based on conditions. My data set has 520,000 records so the apply was very slow. I got good results with using the select function from numpy. I'm curious how that would compare to the vectorization in your case. Edit: in my case, the numpy select is slightly faster than the vectorization.
@robmulla
@robmulla 2 года назад
Thanks for sharing. It would be cool to see an example code snippet similar to what I used in this video for comparison.
@linkernick5379
@linkernick5379 Год назад
Polars lib is quite fast with my 1 million dataset, I recommend to try.
@spicytuna08
@spicytuna08 Год назад
oh my!!! awesome. thanks!!!
@robmulla
@robmulla Год назад
Thanks 🙏
@ersineser7610
@ersineser7610 Год назад
Thank you very much for great video.
@robmulla
@robmulla Год назад
Glad you liked it! Thanks for the feedback.
@vinitjha_
@vinitjha_ 5 месяцев назад
which font do you use? That's awesome font and color scheme
@A372575
@A372575 Год назад
Thanks, one query in case of vectorize, which one would be faster - np.where or the method you memtioned ?
@balajikrishnamoorthy5464
@balajikrishnamoorthy5464 Год назад
I am a begineer, admired your sound knowledge in Pandas
@robmulla
@robmulla Год назад
Thanks for watching. Hope you leaned some helpful stuff.
@alanhouston5874
@alanhouston5874 Год назад
Can you save lists using Parquet Or is it only applicable to dataframes?
@diegoalmeida2221
@diegoalmeida2221 Год назад
Nice video, though in some cases we want to use a specific complex function from a library. The apply method works fine for that case. But is there a way to use it with vectorization?
@robmulla
@robmulla Год назад
You can try to vectorize using something like numba. But it depends on the complexity of the function.
@blogmaster7920
@blogmaster7920 Год назад
This can be really helpful, when moving data from one source to another through Internet.
@robmulla
@robmulla Год назад
Absolutely, compressing can make any data transfer faster.
@justsayin...1158
@justsayin...1158 8 месяцев назад
It's a great tip, but I don't feel like, I understood, what vectorized means, or how I make a function vectorized. Is it just creating the boolean array by applying the conditions to the whole data frame in this way, or are there other ways to vectorize as well?
@pietraderdetective8953
@pietraderdetective8953 2 года назад
I have always been struggling to understand how vectorize work..this video of yours is the one made it crystal clear for me. What a great video! Can you please do more of these efficient pandas videos and use some stock market data? Thanks!
@robmulla
@robmulla 2 года назад
Thanks for the feedback. I’m so happy you found this useful. I’ll try my best to do a future video related to stock market data.
@Sinke_100
@Sinke_100 Год назад
Cool, for really large dataset and when conditions aren't too complicated that vectorized method is amazing, apply is nice alternative cause you can write function, there should be a module that converts normal functions in this vectorized syntax cause it's quite complicated to write
@robmulla
@robmulla Год назад
Glad it was helpful! There are some packages that compile functions (called numba/jit) there is also np.vectorize
@Sinke_100
@Sinke_100 Год назад
@@robmulla I tryed to played a bit with it, pandas it's similar to numpy and I worked with numpy quite a bit, I tryed to put in a function bool_calculation with 3 distinct dfs for age condition, pct_sleeping and time in bed, finaly return value was final condition, df loc supports putting function directly in it's statement, so I did that finaly I compared created dfs with both methods, and they are same. My suggestion is that you should explain more in depth those complexed stuff.
@andrew3068
@andrew3068 Год назад
Super awesome video.
@robmulla
@robmulla Год назад
I appreciate that. Thanks for commenting!
@alexisdebrand6209
@alexisdebrand6209 Год назад
so usefull thank you !!!!!
@robmulla
@robmulla Год назад
You're welcome! Thanks for commenting.
@kennethstephani692
@kennethstephani692 Год назад
Great video!!!
@robmulla
@robmulla Год назад
Thank you!!
@johnidouglasmarangon
@johnidouglasmarangon Год назад
Great video Bob, thanks. I curious, which interface for Jupyter Notebook you are using?
@robmulla
@robmulla Год назад
Glad you liked it. This is jupyterlab with the solarized dark theme. Check out my full video on jupyter where I go into detail about it.
@johnidouglasmarangon
@johnidouglasmarangon Год назад
@@robmulla Tks Bob ✌️
@robmulla
@robmulla Год назад
@@johnidouglasmarangon no problem. Jane!
@ehsankiani542
@ehsankiani542 Год назад
Thanks Rob
@robmulla
@robmulla Год назад
Thanks for watching!
@adamleon8504
@adamleon8504 11 месяцев назад
in these cases it is easy to vectorize but how can you vectorize when the process or the function that needs the df as input is more complex? For example can you vectorize a procedure that uses specific rows and not one column based on a condition and then use these elements to perform calculations with step and not on the same row for example df.loc[i,"A"] - df.loc[i-1,"B"]?
@kevincannon2269
@kevincannon2269 Год назад
i _am_ excited! show the solution in machine code next pls thx
@robmulla
@robmulla Год назад
Working on it…
@andrewcoyne9768
@andrewcoyne9768 Год назад
Love the video, thanks Rob! Is there a vectorized way to create a column that is the sum of several columns? I tried df['total'] = df.iloc[:, 5:13].sum(), which was way faster but returned all NaN values. Any help would be appreciated.
@robmulla
@robmulla Год назад
So close! I think all you need to do is to change `sum(axis=1)` and it should work!
@andrewcoyne9768
@andrewcoyne9768 Год назад
@@robmulla Brilliant! Works perfect now. Thanks for the quick reply
@biomedphil
@biomedphil Год назад
Nice! I guess you can call it “vectorization”: it’s Boolean masking and you add the logical vectors to get the final logical selection mask. Does pre allocating the reward column make any difference, or is that automatically done in pandas even when you add values one row at a time as in example level 1 And 2 ?
@robmulla
@robmulla Год назад
Good point. I’m not 100% sure if it makes a difference to pre allocate. I’d need to test it out. Thanks for the comment.
@bgotura
@bgotura 7 месяцев назад
I love how that Pandas logo has canibalized the city of Madrid (Spain) logo
@Graham_Wideman
@Graham_Wideman Год назад
1:19 "a random integer between one and 100." I believe that should be from 0 to 99 (ie: inclusive at both ends). In case nobody else mentioned it.
@robmulla
@robmulla Год назад
Good catch! I think you are the first to point that out.
@FabioRBelotto
@FabioRBelotto Год назад
Great video. I am working on a Df with millions of rows and pandas apply was struggling. I solved using an vectorized solution as exposed. Much much better. Could you imagine a situation where vectorization would be not possible?
@robmulla
@robmulla Год назад
Glad this helped! As far as examples where vectorization is not possible: For example, if you need to perform an operation that requires branching, such as selecting different values based on some condition, vectorization may not be possible. In this case, you would need to use a loop or some other non-vectorized approach to perform the operation. Another example where vectorization may not be possible is when working with datasets that have varying lengths or shapes. In this case, it may not be possible to perform operations on the entire dataset using vectorized methods. Hope that helps.
@incremental_failure
@incremental_failure Год назад
Vectorization is the whole point of Pandas. But there are cases where vectorization is impossible and you need to process row-by-row, in that case it's best to switch to numba for a precompiled function.
@RichieStockholm
@RichieStockholm 2 года назад
I expect a video about moped gangs in the future, Rob.
@robmulla
@robmulla 2 года назад
That’s a great idea Richie! I practically majored in moped gangs in college. 😂
@AkaExcel
@AkaExcel 2 года назад
Dear Rob thank you for your video, i was following your kaggle Music challenge, and as i am beginner in audio datasets it was difficult to submit even single baseline model, could you be kind to show us how to submit baseline model to this challenge? Sincerely, Akmal
@robmulla
@robmulla 2 года назад
Hey Akmal. Thanks for watching my video. For the kaggle music competition - it looks like people have now published some great starter notebooks in the "code" section. I would suggest checking out some of those. You could also chat with me during my next twitch stream and we could discuss.
@djangoworldwide7925
@djangoworldwide7925 3 месяца назад
As an R user we use vectorization using mutate without even thinking about the other methods for such task. R is so much more suitable for data science and wrangling
@FF-ct5dr
@FF-ct5dr Год назад
The Pandas doc literally tells you that iterrows is slow and should be avoided lol. As for vectorization, Pandas uses (slightly tweaked so to hold different types) numpy arrays which are hosted in continuous memory blocks... So ofc vectorization will be faster than apply/map.
@robmulla
@robmulla Год назад
Yep. This is obvious to a seasoned veteran, but as I mentioned in the video, for many newbies who haven't read the docs and aren't fully aware of the backend, they don't know that iterrows is a bad idea.
@Geza_Molnar_
@Geza_Molnar_ Год назад
@@robmulla Maybe, when you have time for that, you could publish a video that describes to newbies what "RTFM" means, and what is the benefit of that. You are popular, a role model for some 🙂 (in this case "M" -> docs)
@sweealamak628
@sweealamak628 Год назад
I'm kicking myself now for not finding your video 10 months ago. I'm near the completion of my code and resorted to a mix of iterating For loops and small scale vectorisation by declaring new columns after applying some logic. I seriously need to adopt your methods and redo my code because mine is just not fast enough!
@robmulla
@robmulla Год назад
I totally feel you. It took me years before I understood really how important it is to avoid iterating rows was. Once you learn it all your pandas code will be much faster though.
@sweealamak628
@sweealamak628 Год назад
@@robmulla I just altered one of my For loops and used your Vectorized approach! Not only is it faster, I did it in just 3 lines of code and the syntax is much easier to read! I feel so embarrased for myself cos it's much more straight forward than I thought! Now the tricky thing is, I work on a time series dataset where I compare previous rows of data to the current row to get the "result". I assume I can use the "shift" method to look back at a previous row of data. If it works, I'm gonna Vectorize everything! THANKS SO MUCH!
@beastmaroc7585
@beastmaroc7585 Год назад
thank you so much fir this game changer tips ....
@robmulla
@robmulla Год назад
Thanks for watching!
@blakeedwards3582
@blakeedwards3582 Год назад
What theme are you using to get your Jupyter Notebook to look like that?
@robmulla
@robmulla Год назад
Solarized dark theme. I have a whole video about my jupyter setup
@rockwellshabani5180
@rockwellshabani5180 2 года назад
Would vectorization also be faster than an np.where statement with multiple conditions?
@robmulla
@robmulla 2 года назад
Great question! I think someone tested it out in the reddit thread where I posted it and found maybe a slight speed increase over the vectorized version.
@georgebrandon7696
@georgebrandon7696 Год назад
np.where() is what I use almost exclusively. However, it tends to be a little unreadable if you need to use additional if statements to go from binary (either or) to 3 or more possible values. Of course, one could also nest np.where() statements too. :)
@amitamola2014
@amitamola2014 Год назад
So what about the scenario when we want to perform same operation but only in one column? Such as, if pct_sleep
@robmulla
@robmulla Год назад
You can use something like qcut in this case or a vectorized statement with and or statement.
@alysmtech3683
@alysmtech3683 Год назад
Jesus, I'm over here blowing up my laptop. Had no idea, thank you!
@robmulla
@robmulla Год назад
Hah. My name is Rob. But glad you learned something new.
@dreamdeckup
@dreamdeckup 2 года назад
I had to do the same thing in my first internship lol. The script went from 4 hours to like 10 minutes to run
@robmulla
@robmulla 2 года назад
Yea, when I learned this it 100% changed the way I write pandas code.
@tlovestatus1632
@tlovestatus1632 2 года назад
Amazing video
@robmulla
@robmulla 2 года назад
Thanks for the feedback!
@kathrynpanger2289
@kathrynpanger2289 2 года назад
What if I want to apply a more complicated or non-numeric test, like instead of df['pct-sleeping'] > 0.5 I was looking at whether "teeth" was in df['dream-themes'], (a list of the tags concerning the things the sleeper dreamt about e.g. [teeth, whale, dog, slide, school]). Is the only way to do this by with .apply or can this still be vectorized?
@robmulla
@robmulla 2 года назад
This is a good question. I think it depends on the dtype of the dream-theme columns. Would it only contain a single value or potentially multiple ones? Check the ‘isin’ function in the pandas docs, its a vectorized function of doing this.
@DrewLevitt
@DrewLevitt Год назад
I haven't tested this but try df['dream-themes'].str.contains('teeth'). If df['dream-themes'] is a bunch of comma-delimited strings, you should be good to go (but watch out for partial matches e.g. "teeth whitening" contains "teeth"); not sure whether this will work if df['dream-themes'] contains a bunch of proper lists. Try it and let me know!
@MrJak3d
@MrJak3d 2 года назад
Damn, I knew lvl 2 but lvl 3 was awesome!
@robmulla
@robmulla 2 года назад
Thanks Jake! Yea, vectorized functions are super fast. If you can't vectorize then there are other ways to make it faster (like chunking and multiprocessing)... I might make a video about that next!
@saxegothaea_conspicua
@saxegothaea_conspicua 7 месяцев назад
can you vectorize using query? I suppose you can't
@co.n.g.studios5710
@co.n.g.studios5710 Год назад
Nice vid. Wouldn't it even be faster, if using the .values for the columns? is this even applicable in the case presented in the example? Looking forward to your answer, cheers
@robmulla
@robmulla Год назад
Thanks for the comment. Yes using .values could be faster thanks for pointing that out. Not sure about specific part in this video but worth a try.
@demosthenessss7850
@demosthenessss7850 Год назад
代码写得好顺滑啊,佩服啊!
@robmulla
@robmulla Год назад
Thanks for your comment. Translation I think is "The code is written so smoothly, I admire it!" I'm glad you liked it!
@demosthenessss7850
@demosthenessss7850 Год назад
@@robmulla Yes, translation is correct. I commented in Chinese because I want to have more Chinese voices here. 谢谢你的分享,我会继续关注:)
@BILALAHMAD-cz9gu
@BILALAHMAD-cz9gu 11 месяцев назад
this man is amazing but i'm poor with english ...... but i will learn english definetly bcz of this man
@robmulla
@robmulla 11 месяцев назад
Thanks. So glad it helped even though it’s not your native tongue!
@DK-rl1sf
@DK-rl1sf 2 года назад
Is there a way to use np.vectorize() instead of df.loc so things are more tidier?
@robmulla
@robmulla 2 года назад
That's a great point. I've used np.vectorize before but not too frequently. I agree the current solution isn't very clean to read and could be much tidier.
@DK-rl1sf
@DK-rl1sf 2 года назад
@@robmulla No, this is not so much a suggestion and more a question. I'm new to this and literally don't know. I had previously read about np.vectorize(). I tried doing your vectorize method but using np.vectorize but couldn't figure out the syntax.
@robmulla
@robmulla 2 года назад
​@@DK-rl1sf Yea, there is a lot of overhead when using pandas instead of numpy - but you get the benefit of named columns, easy filtering and sorting. In my experience np.vectorize worked but i was working with just numpy arrays not pandas dataframes.
@ddmood2
@ddmood2 Год назад
That’s all I do at work, vectorize is the way to go. I was able to do some complex logic with them.
@robmulla
@robmulla Год назад
Love it.
@elgoogffokcuf
@elgoogffokcuf Год назад
What about Numba, if it can bring some more optimization, it will be nice if you make a video for it.
@robmulla
@robmulla Год назад
Numba/jit is great to speed up more complex operations. I've had limited experience with it, but every case it really sped things up. Doing it as a video is a good idea.
@mbcebrix
@mbcebrix Год назад
Is vectorization applicable for huge datasets? Like millions of datasets for example.
@robmulla
@robmulla Год назад
If it can fit in your computer’s memory then yes!
@danielbrett247
@danielbrett247 Год назад
Not everything can be vectorized, commonly when processing time series data. For these, a great library to know about is njit.
@robmulla
@robmulla Год назад
Agreed. Njit / numba can be great when needing to make sudo compiled python code.
@Hitdouble
@Hitdouble Год назад
What theme do you use in your notebook?
@robmulla
@robmulla Год назад
I have a whole video on my jupyter lab setup. But it’s just the solarized dark theme.
@JoseStev
@JoseStev 3 месяца назад
Use polars?
@Pranavshashi
@Pranavshashi Год назад
I'm new so this might be a silly question. I am using an API to get additional data for each row in my dataset. Can I use vectorized approach while making API calls as well?
@robmulla
@robmulla Год назад
That's kind of different. You just want to gather the results from the api as fast as possible. Check out something like async calls to the api. This might help: stackoverflow.com/questions/71232879/how-to-speed-up-async-requests-in-python
@Pranavshashi
@Pranavshashi Год назад
@@robmulla thanks!
@gabrielgarcia302
@gabrielgarcia302 4 месяца назад
True
@chndrl5649
@chndrl5649 Год назад
Could also use query instead of loc
@robmulla
@robmulla Год назад
Not sure that would work for this case because we aren’t straight filtering.
@lucianodomingues2290
@lucianodomingues2290 Год назад
Very useful!!! Thanks for sharing.
@robmulla
@robmulla Год назад
Thanks for watching Luciano!
@robmulla
@robmulla Год назад
Thanks for watching Luciano!
@lucienjaegers2028
@lucienjaegers2028 Год назад
Nice trick, but what if you code it completely in C / C++ / Rust? Literature says those are 50 - 80 times faster?
@robmulla
@robmulla Год назад
I have a whole video on polars, which is written in rust. It’s faster for sure. But keep in mind pandas backend is just C code.
@dh00mketu
@dh00mketu Год назад
Why didn't you remove get rewards function from other run times?
@robmulla
@robmulla Год назад
Oops. Did I do it incorrectly? Can you share the timestamp?
@krishnapullak
@krishnapullak Год назад
Nice tip
@robmulla
@robmulla Год назад
Thx for watching.
@leejunzhao
@leejunzhao Год назад
Question: I follow your code, an erroe come out said " 'reward_calc' is not defined ".
@wenbozhao4325
@wenbozhao4325 Год назад
In the video the fn name has no letter c at the end
@robmulla
@robmulla Год назад
Oh. Good catch. Sorry it was confusing
@TeXiCiTy
@TeXiCiTy Год назад
For looping over big datasets I switch to polars when speed becomes an issue.
@robmulla
@robmulla Год назад
I have an entire video on my channel about polars. It’s great! Check it out.
@jrwkc
@jrwkc Год назад
when you vectorize with loc, don't you have to vectorize the right side of the equation too. df['favorite_food'] is not masked. It's the whole array. Right? So you are setting the reward to the first N of df['favorite_food'] where N is the length of the mask.
@robmulla
@robmulla Год назад
I don't think so because pandas will use the index when populating. But I'm also not 100% sure.
@jrwkc
@jrwkc Год назад
@@robmulla make github repos so we can test! that would be great
@ErikS-
@ErikS- Год назад
3.5 seconds for a for loop with only 10k rows... Is this done in a Docker container or another VM(-like) environment?
@robmulla
@robmulla Год назад
Just done locally on my fairly beefy machine.
@Atlas92936
@Atlas92936 Год назад
Nice glasses! where are they from?
@robmulla
@robmulla Год назад
Haha. Thanks! 🤓 - They are from warby parker, but I accidentally broke this pair :(
Далее
Make Python code 1000x Faster with Numba
20:33
Просмотров 437 тыс.
Speed Up Your Pandas Dataframes
11:15
Просмотров 66 тыс.
5 Tips To Write Better Python Functions
15:59
Просмотров 68 тыс.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
10 Python Comprehensions You SHOULD Be Using
21:35
Просмотров 91 тыс.
7 Python Data Visualization Libraries in 15 minutes
15:03
Compiled Python is FAST
12:57
Просмотров 80 тыс.
Pandas 2.0 : Everything You Need to Know
9:24
Просмотров 119 тыс.
All 39 Python Keywords Explained
34:08
Просмотров 66 тыс.