Hey, Pat. Happy 2024! We missed you in 2023. I hope you're okay. I started teaching an R class last year, and I always recommend your videos to my students. Best wishes!
Hi Pat, I just wanted to let you know how much I enjoy your work and how valuable it’s been to me. I’ve literally been binging your episodes and I can honestly say I’ve become a better R user because of you. Thank you, and keep up the good work!
Thanks for these videos! I was looking for an introduction to Snakemake that starts from scratch and this was the perfect walkthrough. About the conflicts you were running into: something I've seen pretty often is deploying the webpage based off a separate branch. You can set up an action to run whatever workflow to render your webpage documents and then send it to a different branch. Then, you change the your settings to target that particular branch. The advantage of doing it this way is that you prevent the situation you ran in to by keeping the output of the pipeline (the webpage and figure) separate from your code. Then, if you want to make changes to your code, you don't have to worry about pulling down all the revisions resulting from pipeline runs. It's also not an issue here, but it also avoids the situation where you're working on a team and everyone is generating their own outputs and everyone's repo gets out of sync. The action I use is peaceiris/actions-gh-pages. I add a rule to put all of the webpage files into a docs folder, which I target with the action. Maybe a little overkill for this simple website, but this workflow is extensible to more complicated websites (and dovetails nicely with Quarto webpages). You can see my implementation of your project here: github.com/pommevilla/drought_index. Another comment - you use `snakemake -c 1 ...` to run the workflow, and you've mentioned before that you designed the workflow to work with one processor. Snakemake actually determines which rules can be run together based on the DAG. Rules run as soon as their dependencies are completed, so if a rule doesn't have any (for example, leaf nodes in the dag), then they can run right away. In my modified workflow (see DAG on the README on my repo), there are 4 child nodes, so I could technically call `snakemake -c 4 ...` run those four jobs in parallel. Also, when `get_all_archive` runs, it can use one of the clusters to run one of its two dependencies instead of waiting for the single processor to open up. I'm not sure how much runtime gains you'd gain here since the biggest chokepoints are the downloads and reading the dly files, but it's something to keep in mind. Again, thank you so much for these videos! I learned a lot of good stuff here, and I'm looking forward to future videos.
Pat, I am not sure where this would fit but since you are dealing with large datasets in your climate series then you are probably already familiar with the arrow and duckdb packages. The former allows you to work with larger-than-memory datasets in R. One of the main drawbacks of R is that it loads everything into memory and can thus be slow. arrow (which works with a bunch of different languages - Python, Rust, Mathlab and so forth - however, is similar to data.table (in R) but much faster. The key is that arrow uses a data structure (parque files) that works much more efficiently than the normal - row-wise data structures (e.g., csv). Duckdb is a structured database that lives on local drives (no need for cloud storage even for large files) and is quickly gaining ground. Both these programs have R versions (API?) and are excellent for big data. I would love to see you cover these. Thanks, H
He hasn't make videos for the last 5 months, his videos were very good with a lot of useful tips and trick and workflows. I hope he's okay and doing well.
Nice videos as usual. Since you are playing around with different programs I was wondering if you had looked at imagemagick? As far as I can see it can do amazing things both inside and outside R. I would love to learn more about it other than the rudimentary stuff I know. I hope you are willing to explore it and do a video on it. Thanks!
Hi Pat. You have fantastic episodes about coding. Thank you. However, we the scientists use a lot of IC50/EC50 computations. Would it be possible to do an episode on this topic? Maybe using drc library from R. Thanks again - Kamal