great video. benchmarking is such a powerful tool. of course people can game the benchmarks, but they go to show that you shouldn't get too attached to one particular tech because everything can change once a new system shows better performance!
Absolutely, hopefully my recent benchmarkings have shown that there are a lot of factors that can impact performance. It's really important to try to be clear about the assumptions that go into the test
I learned about duckdb at posit::conf last year. It seems like a good tool, but I primarily use arrow when I need speed (for larger data) and DBI and dbplyr when I need to work with a database.
Thanks for watching. Check out the pinned comment (be sure to expand it to see the whole thing) where I added arrow to the comparison. For this test, it is actually slower than duckdb!
Pat, Nice to see a video on DuckDB! I have been playing with the arrow package (another space-saving type of approach) but it recently stopped working on my Mac (M1). It is another package worth considering.
Thanks for tuning in! Not sure why arrow wouldn't work on an M1. That's what I have and was able to get it to work. Check out the pinned comment (be sure to expand it to see the whole thing) where I added arrow to the comparison. For this test, it is actually slower than duckdb!
@@Riffomonas Pat, When I run arrow_info() I get FALSE on every item except the first (acero). I just updated R and RStudio but that did not fix the issue.
That is remarkably satisfying watching all of those benchmarks jostle for supremacy! I think one question that might still be good to examine (although I really don't know how you'd do it), given that your initial problem was that your data was too big to fit in memory, is how memory efficient each of these methods are? "Slow but fits in memory" might beat "fast but my machine can't handle it".
Thank you for showing the benchmark about their performances (I still recommend you the `bench` package, though). How about `tidypolars` (in R, not in Python)?
Very nice and thought-provoking. My understanding of DuckDB is that it is basically a way to run large datasets by storing them locally and thus not eat up RAM and slow things down (the larger-than-memory selling point of DuckDB) and only loading in what you need - not the entire dataset. So maybe asking about speed compared to a matrix-approach may bet a bit of apples-vs-oranges deal?
@@Riffomonas Climate scientists have been using NetCDF files or decades. Those are supposed to be very memory efficient. Is that an option for you? I do realize that eventually you have to pick something and move on.
Hello ! Thanks a lot for you clarity and these useful tutorials ! When I have large data to process, I some time try to parallelize my script with package such as doparallel in R. Any thoughts on that ?
I have used the future and furrr packages in the past. These are great to make it easy to work with parallelization when trying to speed things up. Thanks for watching!
Hey Pat, great video! I see you scrolling a lot - wouldn't paragraphing help a lot since your code is getting soooooo long (and the comments still suggest more benchmarking :P)
I decided not to go with duckplyr since the print output is a bit annoying. I couldn't see enough rows because of all the extra info there.. how do you silent this?
Comparing keyed data.tables to non-keyed, non-indexed duckdb tables seems unfair, since duckdb does support keys and indices. Have you tested keyed and/or indexed tables in duckdb? If I'm not mistaken, the duckdb un-keyed versions outperformed the data.table un-keyed versions?
Thanks for watching! I'm not able to find duckdb/duckplyr documentation on setting keys. Can you point it out to me? But you are correct that dt without keys is slower than duckdb. I did this in the current (and previous episodes). The get_dt_threek function is keyed and took 421k ns, get_dt_three (not keyed) took 104941k ns and get_duck_three took 2474k ns.
@@Riffomonas I provided a direct link in an earlier comment, but RU-vid appears to have dropped it. But if you search for "indexing" on the DuckDB website, you'll find that keys are "implicitly indexed" by adaptive radix trees (ARTs). I expect that keying the duckdb table will improve performance on your query benchmarks, but I'd be interested in learning how much.