How Fast is Python's Sort? Performance Testing

Подписаться 230 тыс.

Просмотров 42 тыс.

50% 1

Performance testing Python's builtin sort.
Learning to performance test is an important part of transitioning from a beginner Python level to an intermediate to advanced Python level. Measure measure measure! Get coding!
― mCoding with James Murphy (mcoding.io)
Source code: github.com/mCo...
SUPPORT ME ⭐
---------------------------------------------------
Patreon: / mcoding
Paypal: www.paypal.com...
Other donations: mcoding.io/donate
BE ACTIVE IN MY COMMUNITY 😄
---------------------------------------------------
Discord: / discord
Github: github.com/mCo...
Reddit: / mcoding
Facebook: / james.mcoding

Наука

Опубликовано:

14 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 86

@oddlang687 3 года назад

Came for TimSort performance, stayed for the general performance testing knowledge

@mCoding 3 года назад

I try to include a more general lesson in my videos beyond the answer to the title!

@janknoblich4129 3 года назад

Pythons build in sort is btw called TimSort. Its a merge sort and insertion sort hybrid

@_trbr 3 года назад

you could say it is a merge of merge and insertion, or just inmergion, which happens to sound like immersion which is a similar concept.

@mCoding 3 года назад

Thanks for mentioning, that's named after Tim Peters by the way!

@MagnumCarta 3 года назад

@@_trbr I have memoized this answer.

@JohnZakaria 3 года назад

I found your results very insightful and inspiring. Didn't think about performance testing that way. Your video is unique. I would be interested in seeing more performance testing stuff

@mCoding 3 года назад

thanks for the kind words!

@shreydixit2690 3 года назад

You deserve way more subscribers

@mCoding 3 года назад

Thanks for the kind words!

@yazanalj1975 3 года назад

I'm a python beginner and your videos help me out a great deal, they're really fascinating and cool

@mCoding 3 года назад

Great to hear!

@AJMansfield1 3 года назад

Re: other things running at the same time, you can at least mitigate scheduling differences by using `process_time_ns()` instead of `perf_counter_ns()`. That'll let you compare the user+system time (i.e. the amount of time actually spent executing the particular process and its kernel calls, with pauses in which other processes were scheduled not included) instead of comparing wall time (i.e. the length of the total real-world interval between when the 'start timer' bit got scheduled and when the 'stop timer' bit got scheduled). That doesn't eliminate _all_ variation (e.g. you still waste process time after a context switch on branch mispredictions and cache misses and page faults and stuff), but it'd at least let you ignore the _bulk_ of the CPU time allocated to other processes. One other trick: call `os.sched_yield()` right before you capture the start time. This will give the kernel the opportunity to get other pending processes out of the way, that it might otherwise decide to schedule during the first millisecond of your test run, and ensure that your test case starts right at the beginning of a jiffy (and therefore have an entire jiffy to run before the next scheduler interrupt, meaning that test cases shorter than 1ms likely won't get interrupted at all.)

@mCoding 3 года назад

Cool sched_yield trick! Hmm... I read somewhere that process_time_ns should not be used for performance testing because it is less accurate than perf_counter, despite having the same resolution, I'll have to look into it. It certainly wouldn't be a good idea if your code ever intentionally sleeps, but I suppose it just depends on what you are trying to measure.

@AJMansfield1 3 года назад

@@mCoding Yeah, guess it does depend on what you're trying to measure. If you're trying to get a sense for how long a running process is going to take (e.g. updating a progress bar), you'd want to use perf_counter, since there, you _want_ the figures to account for and scheduling congestion. Something like process_time, on the other hand, isn't even _trying_ to be an accurate representation of how much wall time a program using that algorithm might take - the only purpose is getting a low-noise relative estimate of a process's "cumulative cpu utilization" (which is really the main thing thing we care about when doing algorithm A/B testing).

@stilgarfifrawi7155 3 года назад

Didn't know performance testing went this deep . . . another rabbit hole to explore!

@mCoding 3 года назад

enjoy!

@MagnumCarta 3 года назад

The biggest part of learning is not how to learn something new but how to think about something you already know.

@PrashantKg1996 3 года назад

I am more interested in knowing why exactly the second sort is slower than the first one and is there a way around it?

@mCoding 3 года назад

Do you mean the function and function2 difference? It's likely because of having a cold system cache. Look up "cache warming" to read more on this.

@psy8428 3 года назад

Cool video. What about using `__slots__` on the classes to be sorted? Should at least beat out the regular class because of the optimization in class attribute lookup, but I wonder how close it comes to the tuple sorting.

@connorclub6244 Год назад

"nearby electromagnetic fields" got me. Ah yes, thinking about electromagnetic fields when sorting, who doesn't?!

@kiefac 3 года назад

Is there a reason the nearly sorted ints and fully sorted random strings have those dips in elapsed time at what appears to be powers of 2? (Most notable points on both graphs are at ~256 and ~512 elements)

@mCoding 3 года назад

Indeed, from listobject.c in CPython: /* Avoid malloc for small temp arrays. */ #define MERGESTATE_TEMP_SIZE 256 This is the cause.

@toneser 3 года назад

TL;DR: It's probably because exponential search, which is used to merge already sorted parts of the array, uses powers of 2. Not quite sure what's going on with random strings (it may have to do with the fact that they are compared symbol by symbol), but here's the deal with almost sorted ints. The algorithm used in Python is Timsort (en.wikipedia.org/wiki/Timsort). This algorithm tries to take advantage of runs (consecutive elements that are **already sorted**) that are present in real-world data. It keeps track of three runs and merges two of them whenever necessary. While merging, if it encounters the situation where it thinks a large chunk of a final merged run will be taken from one run, it tries jumping further in the run to see if it can find the position where it can skip to to start using another run, using exponential search/galloping (en.wikipedia.org/wiki/Exponential_search), which uses powers of 2, example below. And hey, this situation of a bunch of runs in the data is exactly what we see with almost ordered data like these ints. Exponential search example: We have a run that begins like [1,2,3,4,5,6,7,....] and another one, which begins like [396,397,398,...]. We've already checked 7 elements of the first run, and we still haven't used anything from the second one. Then, let's try jumping to the 15th element in the first run (let's call that [15]). Let's say we see 15 there. Then we can try jumping to [31]. Let's say we see 31 there. Ok, then we jump to [63]. We see 401 there. 401 > 396. Now we search where we need to start using the second run to make our final merged run using binary search (en.wikipedia.org/wiki/Binary_search) (the position is between [31] and [63] in this case). In these tests we make exactly 10 swaps. It means that as list sizes get larger we get less randomness in the list (since more numbers end up in runs). At some points randomness becomes small enough for the algorithm to utilize exponential search better and better. Try increasing the number of swaps - those dips will only start popping up much later, with larger sizes. I tried using 1 swap, and the first dip appeared at size 64 already, but with 100 swaps there were no dips up to 1000 (I didn't bother checking for larger sizes), which is not surprising since our average run size becomes ~10, which is kinda small, so the algorithm rightfully decides to avoid galloping when merging runs. Now, I'm not exactly sure *why* these dips happen at powers of 2, but I'm assuming it's because the algorithm decides minimum run size using powers of 2 and the list length to increase efficiency (see wiki), plus average run sizes become just large enough at these sizes for the algorithm to efficiently use galloping, since more numbers end up in order. For example at size 127 with 8 swaps the average run size is =16, so the algorithm decides it should use it. At sizes like 16 with 8 swaps the average run size is

@TheYellowBlueWhite 3 года назад

thanks for a very interesting analysis

@mCoding 3 года назад

Glad you liked it!

@player6769 Год назад

This channel makes me feel like a genius

@auntwilson 3 года назад

What's with the dips in time for the nearly sorted ints at around 256 (?) and 512 (?).

@mCoding 3 года назад

We can guess the answer from the source code, from listobject.c: /* Avoid malloc for small temp arrays. */ #define MERGESTATE_TEMP_SIZE 256

@auntwilson 3 года назад

@@mCoding Thanks!

@jbaidley 3 года назад

@@mCoding Shouldn't that produce upward jumps not downwards ones?

@xanri7673 3 года назад

@@jbaidley I wonder. It's implying that malloc is somehow faster than putting the variables on the stack. Maybe that is due to freeing and reallocating (test order again) in which case malloc can actually get significantly faster (stackoverflow.com/questions/41869662/is-malloc-faster-when-i-freed-memory-before). Not sure if at that point it would be faster than the stack but who knows? To allocate the array on the stack you need to shift the stack pointer whereas when you just get the pointer from `malloc` there is no shifting.

@cmyk8964 3 года назад

@@jbaidley I suspect that is where TimSort changes strategies from a strategy better suited for small lists (such as insertion sort) to a strategy that has extra overhead but is faster for larger lists.

@Marquesian 3 года назад

What I found surprising is that in the end result, sort appears to perform better on the GeometricString cases than on the UniformInt cases. After all, string comparison is usually/intuitively slower than integer comparison. Then I figures since you're restricting the characters of the strings to the printable characters, it is similar to n-ary FewInts cases but the GeometricStrings cases outperform those too! Perhaps my first assumption doesn't hold... Any insights on the matter?

@tedarcher9120 3 года назад

How much faster is python builtin than a custom quicksort? Than a linux sort?

@Ovicron 3 года назад

Much faster. It's optimized and I heard the sorting algorithm python uses "timsort" is a little faster than a regular quicksort on sorted data.

@mCoding 3 года назад

Indeed, builtins are much faster than writing it yourself, and timsort is closer in spirit to a stable mergesort.

@gregorymorse8423 3 года назад

Numpy probably wins over built ins if you use numpy arrays thanks to vectorization in the implementation

@von_nobody Год назад

Another caveats, this is micro benchmark that show max possible speed of given code but this do not match exactly same code when its run in bigger app. Other pats of code could interfere with speed and you will not get same results as this in real environment. One of many reason is CPU cache. Conclusion is that you start with benchmarks like this but then when you finish work you need benchmark full application too to see if your code is better than old one.

@javiercmh 3 года назад

Hi, thanks for the videos. What plotting library are you using? I like that it looks interactive

@mCoding 3 года назад

This is just matplotlib. Search for the "Legend Picking" example, which shows you how to make it so you can click on lines in the legend to toggle the plots on/off. I also threw in an ax.relim and ax.autoscale_view to make the axes auto adjust on toggle. See the github code for details!

@javiercmh 3 года назад

@@mCoding thanks!! Will do :)

@MrRussianComrad 3 года назад

Does the order matter because there is a cold-start of some sort?

@mCoding 3 года назад

Yep!

@modernfreeman4228 3 года назад

Could you make a video on any substring searching algorithm performance testing? Great content 👌

@rosgori 3 года назад

9:27 I didn't know that you can do that with Matplotlib!

@MrTyty527 3 года назад

very interesting topic, high quality content the Python world is big arghhhhhh

@mCoding 3 года назад

Thank you! I'm finding it is a small world lately! But lots to learn no matter what!

@AntonioZL 3 года назад

Off topic but, what do you think are the biggest advantages in 'statically typed Python'?

@mCoding 3 года назад

Python is a dynamic language and will always be, but type hints that can be checked _before_ running the code is a huge win. Remember that code is read typically many more times than it is written. Type hints are a great way to communicate to the reader what your intentions are. And if you actually check your type hints with a static analyzer, then you can catch bugs _before_ you run the code. No more "you forgot to pass an argument" or "wrong order of arguments" or "passed a str but expected bytes" errors.

@CFSworks 3 года назад

@@mCoding Plus, editors can benefit from the type hints to provide autocomplete!

@carlosmspk 3 года назад

So... what's with the step like curve of the nearly sorted?

@doganayozese4163 3 года назад

Sorry, kind of irrelevant but gotta ask. It's a great theme you have. Can you share which one this is?

@mCoding 3 года назад

The Pycharm builtin theme "Darcula"

@doganayozese4163 3 года назад

@@mCoding Thank you very much

@mikailkhan9166 3 года назад

Hey I would really appreciate videos about languages other than Python, although I understand if Python videos probably get a lot more views

@ArunNarasimhan_arun1997107 3 года назад

Very interesting content!

@mCoding 3 года назад

Thanks!

@woter4101 3 года назад

isnt there a way to check how much time has passed in a process? that way you can properly profile it in any environment and other running programs dont mess with it

@mCoding 3 года назад

Unfortunately there is no way to do this accurately. The best we can do is use the high precision clock time.perf_counter or time.perf_counter_ns, but an alpha particle can hit your CPU at any time and slow a single execution down at any time, so you really need to run things multiple times. Also your operating system decides when Python gets CPU cycles, so it can just randomly decide to make Python wait. You can see this in some of the graphs where there is a huge spike relative to nearby points.

@woter4101 3 года назад

@@mCoding i meant time.process_time(), it gets the time passed in the processed, and it doesn't matter if the os made python wait

@mkhnuser 3 года назад

It is a pretty good video, you deserved a like ;)

@mCoding 3 года назад

I appreciate it!

@robertbrummayer4908 2 года назад

Excellent

@superoriginalhandle 3 года назад

Interesting!

@mCoding 3 года назад

Glad you think so!

@omidgholami2594 3 года назад

Great work well done.

@mCoding 3 года назад

Thank you!

@Tferdz 3 года назад

you should have used exponentially distributed numbers for the size.

@mCoding 3 года назад

I was more interested in accurate measurements than saving space in this video. If you want to use exponential spacing (which would be necessary if you wanted to test a higher upper limit) feel free to run it yourself and just change the sizes list!

@unperrier5998 2 года назад

At 7:08 wouldn't the order be affected by the testcases? Since they each set the pseudo-rng seed. I know the testscases are using numpy but 1) numpy could interact with C runtime in some way and affect the random module and 2) you don't always want to install numpy just for a test.

@gregorymorse8423 3 года назад

So bottom line is forget python classes, just pack your data structures in lists and tuples and use function calls. When references are necessary I suppose this will not work though. But of course object references should be avoided if possible. Union Find disjoint sets then might be most efficient as just a dictionary of lists (or tuples at cost of reconstruction on changes) and simple using the keys of the dictionary in the tuples. Rather than making a Node class. It's like C style programming is more efficient than C++ style programming likely because the interpreter does not have enough optimizations for classes yet

@MagnumCarta 3 года назад

At the same time keep in mind "how much performance do I need?". Readable code is better than fastly executable code if the refactoring time is high. Doing things through classes may have some pitfalls but if the code is easier to read (provided a new hire has a good comprehension of OOP concepts) then it may be worth the slight performance impact, especially if the size of input is small. We should also be cognizant of code needing updates over time as we learn edge cases or work with new hardware/software. So just as its important to measure worst case, we also should look at best case and the average case to determine when to refactor/update our code.

@gregorymorse8423 3 года назад

@@MagnumCarta yes only in performance critical code should such tricks be considered

@shambledeggs 3 года назад

Can we take a moment and think about why the first example with function and function1, produced two different results?

@simivb 3 года назад

Performance can also depend on ... "nearby electromagnetic fields"??? Excuse me, what??

@MagnumCarta 3 года назад

Since we measure things in a binary fashion (each bit of data is either 0 or 1, two possible values), it is easy for outside radiation to potentially cause a bit to flip. This is how electromagnetic fields can cause data corruption. Think of measuring a bit as either 0 or 1 as a case of saying "is the voltage level above {{ THRESHOLD_AMOUNT }} ?". Where "THRESHOLD_AMOUNT" is something like 0.5 Volts (the actual figure can vary so don't rely on just that particular value, it is a case example commonly seen in textbooks talking about voltage levels of circuits). And the "{{ }}" is a Python style way of implementing the value of a variable (where THRESHOLD_AMOUNT=0.5). So if we were to read the value and it is above that amount (e.g. 0.5) we just arbitrarily say it is a 1 else it is a 0. Even a slight uptick in the amount at some point in time from an outside radio wave could cause the voltage level to increase to the point of the threshold we're measuring for.

@rban123 3 года назад

or we could just look at the source code and review the asympotic complexity of the function

@mCoding 3 года назад

Knowing Python's sort is n log n doesn't really answer the question. As I mention in the video, it depends on many many other factors that are not captured from the time complexity. E.g. 1000000 n log n and 3 n log n are both O(n log n), but one is much worse than the other. And even if you knew it was 3 n log n, this still doesn't take into account cache performance and many other things that can vary the speed of an algorithm by factors of 10000.

@rban123 3 года назад

@@mCoding python’s sorting algorithm is O(n) but yeah there are lots of other factors

@rban123 3 года назад

Which are machine specific of course

@cmyk8964 3 года назад

@@rban123 > sorting > O(n) It’s time for your evening medication and electroshock therapy, sir.

@patrickglinski329 5 месяцев назад

Hey man. I know this is old but please dont just display a list of bullet points on the screen and say "The first 4..". Many viewers dont watch the video. They listen only. Great content overall though! cheers