Riffomonas Project

Riffomonas Project

321
1 363 738

Подписаться

I'm Pat Schloss! I produce videos about how to use data science tools to answer questions about the world around me. I believe that anyone can answer their own questions. Do you?! I'd love to learn more about the world around you. Share the questions you would like to answer and we can take them on together!

How to use GitHub actions to build on other operating systems and test code coverage (CC299)

24:51

How to use GitHub actions to build on other operating systems and test code coverage (CC299)

21 час назад

How to create a website for your R package with pkgdown (CC298)

34:44

How to create a website for your R package with pkgdown (CC298)

14 дней назад

How to create a vignette for an R package (CC297)

36:31

How to create a vignette for an R package (CC297)

21 день назад

How to create a data only R package with devtools (CC296)

38:08

How to create a data only R package with devtools (CC296)

28 дней назад

Using the BFG Repo-Cleaner to delete files from your GitHub history (CC295)

17:13

Using the BFG Repo-Cleaner to delete files from your GitHub history (CC295)

Месяц назад

Adding built in data to an R package (CC294)

31:40

Adding built in data to an R package (CC294)

Месяц назад

Using Roxygen2 to document functions in an R package (CC293)

39:41

Using Roxygen2 to document functions in an R package (CC293)

Месяц назад

Benchmarking R functions for joining data frames (CC292)

34:58

Benchmarking R functions for joining data frames (CC292)

Месяц назад

Benchmarking R functions for reading tsv files (CC291)

38:11

Benchmarking R functions for reading tsv files (CC291)

2 месяца назад

Benchmarking methods for reading text files in R (CC290)

33:58

Benchmarking methods for reading text files in R (CC290)

2 месяца назад

Writing an R function to read FASTA-formatted files (CC289)

41:37

Writing an R function to read FASTA-formatted files (CC289)

2 месяца назад

Incorporating C++ code in an R package with Rcpp and devtools (CC288)

26:30

Incorporating C++ code in an R package with Rcpp and devtools (CC288)

2 месяца назад

Integrating ideas from Stack Overflow to improve our R code (CC287)

31:16

Integrating ideas from Stack Overflow to improve our R code (CC287)

2 месяца назад

Using R's profvis package to diagnose and resolve bottlenecks (CC286)

23:26

Using R's profvis package to diagnose and resolve bottlenecks (CC286)

3 месяца назад

Base R's paste and paste0 functions: how to use the sep and collapse arguments (CC285)

29:23

Base R's paste and paste0 functions: how to use the sep and collapse arguments (CC285)

3 месяца назад

Finding the consensus classification using anonymous functions (CC284)

34:46

Finding the consensus classification using anonymous functions (CC284)

3 месяца назад

Generating and classifying bootstrap replicates with test driven development (CC283)

34:30

Generating and classifying bootstrap replicates with test driven development (CC283)

3 месяца назад

base R, stringi, and stringr: Benchmarking string manipulations with (CC282)

40:52

base R, stringi, and stringr: Benchmarking string manipulations with (CC282)

3 месяца назад

Refactoring R code to make it faster and more memory efficient (CC281)

39:21

Refactoring R code to make it faster and more memory efficient (CC281)

3 месяца назад

Renaming our R package, updating RStudio and R, organizing code, and passing Check! (CC280)

32:28

Renaming our R package, updating RStudio and R, organizing code, and passing Check! (CC280)

3 месяца назад

Comparing duckdb and duckplyr to tibbles, data.tables, and data.frames (CC279)

41:14

Comparing duckdb and duckplyr to tibbles, data.tables, and data.frames (CC279)

3 месяца назад

Accessing values from data frames, data tables, tibbles, matrices, and vectors (CC278)

31:57

Accessing values from data frames, data tables, tibbles, matrices, and vectors (CC278)

4 месяца назад

The tutorial you need to maximize your use data frames in R (CC277)

43:47

The tutorial you need to maximize your use data frames in R (CC277)

4 месяца назад

The tutorial you need to maximize your use of R's lists (CC276)

37:46

The tutorial you need to maximize your use of R's lists (CC276)

4 месяца назад

Evaluating the performance of various methods for generating vectors in R (CC275)

34:30

Evaluating the performance of various methods for generating vectors in R (CC275)

4 месяца назад

The Team, The Team, The Team: Reductionism vs holism in microbiome research (CC274)

58:38

The Team, The Team, The Team: Reductionism vs holism in microbiome research (CC274)

4 месяца назад

The tutorial you need to maximize your use of vectors in R (CC273)

36:24

The tutorial you need to maximize your use of vectors in R (CC273)

4 месяца назад

Brute force building a kmer database in R - what could go wrong? (CC272)

29:06

Brute force building a kmer database in R - what could go wrong? (CC272)

4 месяца назад

Using base R and testthat to calculate probabilities (CC271)

45:07

Using base R and testthat to calculate probabilities (CC271)

4 месяца назад

Комментарии

@ianworthington2324 День назад

Difficult to believe that the size of a punch card remains the recommended line length all these years later.

@Riffomonas 23 часа назад

Hah!

@user-ps4fb1oy2r 2 дня назад

Hello, which package or options did you use to check integrity of your code after passing it to styler ? I'm referring to the one you used via the Build button

@Riffomonas День назад

That's my package - phylotypr - that I run Build on. Is that what you're asking about?

@user-ps4fb1oy2r День назад

@@Riffomonas yes, thanks

@djangoworldwide7925 2 дня назад

Aay GitHub actions detected a lint issue, does it automatically run styler to fix it? If not, what benefit does it have?

@Riffomonas 2 дня назад

It doesn't, but there is a separate GHA for running styler. We'll readdress this in the next episode so you can detect problems before pushing. I think the benefit of this particular GHA is for pull requests from others so that their code can get run through a lintr before trying to pull it into your code

@zjardynliera-hood5609 2 дня назад

Hello Patrick, I watched many of your videos and made my first R package to generate, filter, and sort relative abundance tables and make plots. We do a lot of amplicon sequencing from environmental samples at uWaterloo. The github repo is zjardyn/bubbler and the package is mostly done. I am still scared to submit it to CRAN lol.

@joshstat8114 5 дней назад

Thank you for showing the benchmark about their performances (I still recommend you the `bench` package, though). How about `tidypolars` (in R, not in Python)?

@Riffomonas 2 дня назад

I'll have to check out the tidypolars package, this was a new one to me. Thanks for watching!

@matthewson8917 5 дней назад

It was surprising that base pipe was generally slower than magrittr pipe

@Riffomonas 5 дней назад

Thanks for watching! More experimenting with both suggests that it really depends on the context. Any difference is really minimal

@jlntp1642 9 дней назад

thank you, I love this series. I am wondering if R analysis project could be done as package driven way, what is your opinion on this?

@Riffomonas 9 дней назад

Definitely, I've seen people create papers as R packages - check out this pre-print peerj.com/preprints/3192v2/

@jlntp1642 9 дней назад

@@Riffomonas thanks for sharing, I also tried myself to embeded my analytic code into a package.... but at some point I felt that generated data are too heavy for a R package. Also, I never considered (but I would like) using "test that" for an analytical workflow. In addition beyond "report generation" it would be nice to include github action within the workflow.

@Riffomonas 8 дней назад

Something like Test Driven Development for data analysis is always rolling around in the back of my head :) It's hard because it requires using functions to test and most data analyses don't use home made functions, they use functions from other packages, which are hopefully already tested

@djangoworldwide7925 9 дней назад

Great notes on the upper right of the screen for further read. Thanks Pat! We're all excited about you submitting your pkg to cran 🤞🏻

@Riffomonas 9 дней назад

Thanks! A few more weeks 🤓

@MrJL-xk5jb 12 дней назад

your videos are so useful, lots of thanks. How do you comment (#) more thant one line at the same time?

@Riffomonas 10 дней назад

I highlight the lines and then use the shortcut to comment them. On my mac it's shift-command-c

@user-vp4ix6ff3b 12 дней назад

Ours also uses slurm

@grahamsharpe9812 14 дней назад

Why do you use tibble? And not modify data in excel? Or is this a preference thing?

@Riffomonas 14 дней назад

It's about reproducibility and transparency. Modifying data in excel is very much not reproducible or transparent. It's next to impossible to document changes in excel files. Also, excel is $$$ whereas R is free. With R, we can document all of the changes in the script and rerun the script multiple times without having to worry about breaking things. Also, if I make a mistake, I can easily correct it by reprocessing the raw data with the corrected code.

@kpicsoffice4246 15 дней назад

Thank you. Great work

@Riffomonas 15 дней назад

My pleasure - thanks for watching!🤓

@kkanden 16 дней назад

despite not currently developing an R package or planning to, i really enjoy watching your series on making one especially since you're showing the backstage, raw and "ugly" side of coding (the typos in particular). if it ever comes to me creating a package i'll make sure to use this series as a reference. cheers!

@Riffomonas 16 дней назад

wonderful - thanks for watching!

@chooby364 17 дней назад

I dont like my axis labels to repeat themselves. Is there a way to have a single centered x and or y labelling. Been trying to figure it out but cannot manage to get it rght

@Riffomonas 17 дней назад

This type of thing is much easier using the {patchwork} package. Thanks for watching!

@user-ro9ex5im2p 19 дней назад

This was great. Thank you :)

@Riffomonas 17 дней назад

My pleasure! thanks for watching🤓

@Universe624 19 дней назад

how can I access the glb.ts+dsst.csv data file?

@Riffomonas 17 дней назад

You can get it through the links in the blog post linked in the show notes. github.com/riffomonas/climate_viz/tree/a428f64b7db493145bf84ec6f38f8e89da258675/data

@hassanhijazi4757 21 день назад

Hey Pat, What is the best practice when you want to populate a list but you don’t know upfront how long this might grow? What you do in this case?

@Riffomonas 21 день назад

You can certainly grow the list, which really isn't a problem if you don't think it will be long. Alternatively, I've also seen people initialize a list that's larger than you think it will be and then prune it after you know how big it actually should be

@samlawrence4627 22 дня назад

Thank you for the video. I just have a question about the @examples section. When I preview the documentation file under the help tab in Rstudio, I see a link that says "Run examples." When I click this link, it takes me to a blank page that says "Example/<function name> not found." When I look at other packages, I can click on this link, and it shows the output given by the examples. Is there something I have to do to get this link to work? Or does this happen after the package is submitted?

@Riffomonas 22 дня назад

I think that only works on packages already on CRAN

@samlawrence4627 21 день назад

@@Riffomonas Okay, thank you

@ericagardner8249 26 дней назад

Thank you, this is so helpful :)

@Riffomonas 23 дня назад

my pleasure! thanks for watching 🤓

@djangoworldwide7925 29 дней назад

using tmp dir is such a flex. i really gotta use this more often...

@Riffomonas 29 дней назад

hah - thanks!

@souIsynapse Месяц назад

I faxed this to the author of my favorite package whose last update was from the early devonian thanks

@Riffomonas 29 дней назад

lol - thanks for watching 🤓

@markusmuller65656 Месяц назад

Thanks for sharing.

@Riffomonas 29 дней назад

absolutely - thanks for watching!

@rags3791 Месяц назад

excellent

@Riffomonas 29 дней назад

my pleasure! thanks for watching :)

@SuperDashdash Месяц назад

Sir, I have always been thrilled by your R techniques and the way of your explanation. Your videos have caused me to switch to R [and of course RStudio] from Python [Jupyter] literally captivated me in enjoying analytics using R. I am working in Aerospace Industry. While the organization leverages a couple of premium visualization tools even to analyze exploratorily, I, after having got lightning stuck by your amazing videos, have been using ggplot extensively along with my basic statistical knowledge and of course teaching my colleagues. Thank you, @Riffomonas. Please keep posting more videos to enlighten some of thirsty analysts like me.

@Riffomonas 29 дней назад

Thanks for your very kind comment!

@mocabeentrill Месяц назад

Wow! Thanks Pat. This was the most advanced episode in the series and it requires one to be well versed in the intricacies of base R. Thoroughly enjoyed it!

@Riffomonas 29 дней назад

wonderful! sometimes it's fun to go into the weeds a bit :)

@chuckbecker4983 Месяц назад

Great instructional video, thanks! During the pandemic I became proficient at this but haven't used Git in a couple of years. You provided just the guidance I needed.

@Riffomonas 29 дней назад

So glad to hear it - thanks for watching!

@Jeep-d7c Месяц назад

Have you tried the join from the collapse package? It is very fast in my tests. collapse::join(x, y, how="inner", on=c("a"="b"))

@Riffomonas 29 дней назад

Thanks - I'll have to check that out

@user-rf9ow8ck2l Месяц назад

One of the frustrating limitations of the conda approach for R is that only a fraction of the packages in CRAN are compiled and installable via conda. Other packages are not in CRAN and its not clear how to install those via conda. Any thoughts on this?

@Riffomonas Месяц назад

I have found that the major packages are available in one of the conda collections. It's not a horrible process to contribute a conda package if you need to. This SO link is somewhat helpful stackoverflow.com/questions/52061664/install-r-package-from-github-using-conda

@user-ro9ex5im2p Месяц назад

This was great! Thank you

@Riffomonas Месяц назад

Thanks for watching - I'm glad you enjoyed it!

@meronghirmay4960 Месяц назад

Three years since this video was posted, and here I am too making a nice figure. And this is not the only video that has helped me. Thank you very much, Dr Schloss.

@Riffomonas Месяц назад

Hah! Thanks so much for watching. I'm glad you're finding my videos helpful 🤓

@ahmed007Jaber Месяц назад

Hi Pat. thank you for this could you please check the blog post? I guess it is not uploaded yet thank you so much for the knowledge sharing and efforts you do which have helped me immensly

@Riffomonas Месяц назад

Thanks for the head up - it's there now

@mariliaamaralmarcondes6943 Месяц назад

I am from Brazil and your explanation is so good that I can understand all your class. Thank you very much.

@Riffomonas Месяц назад

Wonderful - my pleasure!

@monzerthejoker343 Месяц назад

I didn't understand any thing

@Riffomonas Месяц назад

Sorry! If there's anything specific let me know what was confusing

@djangoworldwide7925 Месяц назад

3 mins ago. I must be number one fan. Thanks Pat! Your series are great

@Riffomonas Месяц назад

lol - well, you're #1 today 😂

@AdamHillier-h7p Месяц назад

HI, I could not load BiodiversityR package. Error: package 'tcltk' could not be loaded. Any ideas ?

@Riffomonas Месяц назад

Sorry, i'm not familiar with the BiodiversityR package. You might try installing tcltk and then try BiodiversityR again

@miissJoceLyn Месяц назад

This is an outstanding explanation of everything you are doing here. This is amazing, this content really contributes to science. Thank you so much <3

@Riffomonas Месяц назад

my pleasure! thanks for watching🤓

@user-sb9oc3bm7u Месяц назад

I love how julia is always a honour guest in these debates

@Riffomonas Месяц назад

Hey, if your local community uses Julia, go for it. Same goes for Fortran, Haskell, whatever. There's no debate. The only rule is that people need to learn to program. It's best to learn what your local community uses. No interest in engaging in any type of language wars here 🤓

@PhilippusCesena Месяц назад

Great video as always!

@Riffomonas Месяц назад

Thanks Philippus!

@djangoworldwide7925 Месяц назад

11:09 will instead of with. Not sure if you fixed it later 23:45 should fix to @returns

@Riffomonas Месяц назад

Thanks! I think I got the @returns after editing the video :) It really doesn't matter if you use @return or @returns, but @returns is the new way of doing things.

@djangoworldwide7925 Месяц назад

I wonder how the process today might look if you're doing something like this with ChatGPT: Providing the function to chatgpt Uploading roxyegn documentation (optional) Asking it to write documentation with the important key headers, including examples. That way you can make sure your wording is the same across functions, as well as argument names. I mean, must provide some context but I bet it can spare a lot of time

@shadyamigo Месяц назад

I’ve used it for this very purpose. It’s very good and saves so much time

@Riffomonas Месяц назад

I'll make a deal with you and all of my other viewers... I promise that I will never intentionally use ChatGPT or its ilk to generate code, documentation, anything on my channel :)

@djangoworldwide7925 Месяц назад

I believe you, seeing how good you recall complex regex ;) @@Riffomonas

@joshstat8114 Месяц назад

Would you like me to recommend you to use `bench::mark()` whenever you benchmark the expression?

@Riffomonas Месяц назад

Thanks - I've used it in other episodes, but I find the {microbenchmark} is easier to use for some applications

@elforich Месяц назад

Very well put together, thanks

@Riffomonas Месяц назад

Thanks for watching!

@tedhermann3424 Месяц назад

Just to note, using as.data.table() or setDT() will be considerably faster than data.table(). data.table also comes with its own version of merge() so you don't have to use the funky syntax for a full merge.

@Riffomonas Месяц назад

Thanks for the feedback - I'm finding that if I use as.data.table or setDT, I get similar results as plain data.table and inner_join

@tedhermann3424 Месяц назад

@@Riffomonas I made synthetic datasets since I don't have your fasta data. In my test, dtA was the fastest, followed by using setDT and as.data.table. data.table() was similar to dplyr with inner_join. Here is my code: each_num <- 1e4 animal_legs <- map_dfr(data.frame(animal = c("cow", "fish", "chicken", "dog", "sheep"), n_legs = c(4, 0, 2, 4, 4)), rep, each = each_num) %>% mutate(n = 1:nrow(.)) animal_sounds <- map_dfr(data.frame(animal = c("cow", "chicken", "cat", "sheep", "dog"), sounds = c("mooo", "cluck", "meow", "baaa", "bark")), rep, each = each_num) %>% mutate(n = 1:nrow(.)) # make a copy for setDT because it changes things in place, # which would affect everything else relying on animal_legs and animal_sounds # in microbenchmark. animal_legs_2 <- copy(animal_legs) animal_sounds_2 <- copy(animal_sounds) #data.table for dtA test animal_legs_dt <- data.table::data.table(animal_legs, key = "n") animal_sounds_dt <- data.table::data.table(animal_sounds, key = "n") microbenchmark::microbenchmark( base = base::merge(animal_legs, animal_sounds, by = "n", all = FALSE), ij = dplyr::inner_join(animal_legs, animal_sounds, by = "n"), dt = { animal_legs_dt_test <- data.table::data.table(animal_legs, key = "n") animal_sounds_dt_test <- data.table::data.table(animal_sounds, key = "n") animal_legs_dt_test[animal_sounds_dt_test, nomatch = NULL, on = .(n)] #inner join }, dtA = animal_legs_dt[animal_sounds_dt, nomatch = NULL, on = .(n)], as_dt = { animal_legs_dt_test <- data.table::as.data.table(animal_legs, key = "n") animal_sounds_dt_test <- data.table::as.data.table(animal_sounds, key = "n") animal_legs_dt_test[animal_sounds_dt_test, nomatch = NULL, on = .(n)] }, set_dt = setDT(animal_legs_2)[setDT(animal_sounds_2), nomatch = NULL, on = .(n)] ) Here's the results table on my computer: Unit: milliseconds expr min lq mean median uq max neval cld base 29.754131 31.562985 33.248288 32.509195 34.936214 38.573338 100 a ij 3.847486 4.260528 5.063156 4.359981 4.581972 10.221760 100 b dt 3.840432 4.029632 6.483946 4.263007 4.815313 145.038057 100 b dtA 2.474058 2.599248 4.473535 2.822705 3.102000 137.916202 100 b as_dt 3.255442 3.403854 4.397897 3.637299 4.121683 11.064186 100 b set_dt 2.881791 2.995842 3.809361 3.250692 3.767623 7.726401 100 b Base R was actually the fastest when I did this test with the original animal_* datasets, but it clearly doesn't scale very well.

@tedhermann3424 Месяц назад

@@Riffomonas Apologies if this is showing up for a second time, but I replied earlier and now it seems to be gone. I made some synthetic data because I don't have your fasta data. I consistently find that dtA is fastest, followed by setDT() and as.data.table(). dplyr and data.table() are comparable. Here is my code: each_num <- 1e4 animal_legs <- map_dfr(data.frame(animal = c("cow", "fish", "chicken", "dog", "sheep"), n_legs = c(4, 0, 2, 4, 4)), rep, each = each_num) %>% mutate(n = 1:nrow(.)) animal_sounds <- map_dfr(data.frame(animal = c("cow", "chicken", "cat", "sheep", "dog"), sounds = c("mooo", "cluck", "meow", "baaa", "bark")), rep, each = each_num) %>% mutate(n = 1:nrow(.)) # make a copy for setDT because it changes things in place, # which would affect everything else relying on animal_legs and animal_sounds # in microbenchmark. animal_legs_2 <- copy(animal_legs) animal_sounds_2 <- copy(animal_sounds) #data.table for dtA test animal_legs_dt <- data.table::data.table(animal_legs, key = "n") animal_sounds_dt <- data.table::data.table(animal_sounds, key = "n") microbenchmark::microbenchmark( base = base::merge(animal_legs, animal_sounds, by = "n", all = FALSE), ij = dplyr::inner_join(animal_legs, animal_sounds, by = "n"), dt = { animal_legs_dt_test <- data.table::data.table(animal_legs, key = "n") animal_sounds_dt_test <- data.table::data.table(animal_sounds, key = "n") animal_legs_dt_test[animal_sounds_dt_test, nomatch = NULL, on = .(n)] #inner join }, dtA = animal_legs_dt[animal_sounds_dt, nomatch = NULL, on = .(n)], as_dt = { animal_legs_dt_test <- data.table::as.data.table(animal_legs, key = "n") animal_sounds_dt_test <- data.table::as.data.table(animal_sounds, key = "n") animal_legs_dt_test[animal_sounds_dt_test, nomatch = NULL, on = .(n)] }, set_dt = setDT(animal_legs_2)[setDT(animal_sounds_2), nomatch = NULL, on = .(n)] )

@tedhermann3424 Месяц назад

@@Riffomonas I've tried replying a few times, but youtube seems to be autoremoving the comment. Maybe something to do with the code snippet.... Anyway, I made large synthetic datasets because I don't have the fasta data, and ran everything again. dtA is consistently fastest, followed by setDT and as.data.table(). Here is my code below. each_num <- 1e4 animal_legs <- map_dfr(data.frame(animal = c("cow", "fish", "chicken", "dog", "sheep"), n_legs = c(4, 0, 2, 4, 4)), rep, each = each_num) %>% mutate(n = 1:nrow(.)) animal_sounds <- map_dfr(data.frame(animal = c("cow", "chicken", "cat", "sheep", "dog"), sounds = c("mooo", "cluck", "meow", "baaa", "bark")), rep, each = each_num) %>% mutate(n = 1:nrow(.)) # make a copy for setDT because it changes things in place, # which would affect everything else relying on animal_legs and animal_sounds # in microbenchmark. animal_legs_2 <- copy(animal_legs) animal_sounds_2 <- copy(animal_sounds) # data.table for dtA test animal_legs_dt <- data.table::data.table(animal_legs, key = "n") animal_sounds_dt <- data.table::data.table(animal_sounds, key = "n") microbenchmark::microbenchmark( base = base::merge(animal_legs, animal_sounds, by = "n", all = FALSE), ij = dplyr::inner_join(animal_legs, animal_sounds, by = "n"), dt = { animal_legs_dt_test <- data.table::data.table(animal_legs, key = "n") animal_sounds_dt_test <- data.table::data.table(animal_sounds, key = "n") animal_legs_dt_test[animal_sounds_dt_test, nomatch = NULL, on = .(n)] # inner join }, dtA = animal_legs_dt[animal_sounds_dt, nomatch = NULL, on = .(n)], as_dt = { animal_legs_dt_test <- data.table::as.data.table(animal_legs, key = "n") animal_sounds_dt_test <- data.table::as.data.table(animal_sounds, key = "n") animal_legs_dt_test[animal_sounds_dt_test, nomatch = NULL, on = .(n)] }, set_dt = setDT(animal_legs_2)[setDT(animal_sounds_2), nomatch = NULL, on = .(n)] )

@tedhermann3424 Месяц назад

@@Riffomonas I've tried replying numerous times, but my comment gets removed each time. I think it doesn't like the code snippet I'm trying to share... Anyway, I made synthetic datasets ~50,000 rows long, where each row is a unique group so that it is comparable to your fasta data. dtA is consistently fastest, followed by setDT and as.data.table. One thing I had to control for was using a copy of the dataframe for setDT (e.g., df_copy <- copy(df)) because setDT works in place. If you use the same df reference for all items in microbenchmark, you run the risk of setDT changing your dataframe in place to a data.table. Then any subsequent runs with data.table() or as.data.table() take the same amount of time because df is already a data.table. Maybe youtube will let me share the code as a github link... : github.com/mrguyperson/joins_example/blob/main/R/joins.R

@PhilippusCesena Месяц назад

Great video, I was used to dplyr and it is very interesting to see other approaches.

@Riffomonas Месяц назад

Thanks! Glad you enjoyed it 🤓

@rayflyers Месяц назад

A few thoughts come to mind. 1) dplyr always outputs tibbles. If you're going to use dplyr, it might be worth using tibbles throughout your package. The loss in performance is worth the consistent formatting, and tibbles are just better. 2) dplyr allows for multiple backends (dtplyr, dbplyr, duckplyr, arrow, etc). Would those affect your code? If I call duckplyr::methods_overwrite(), and a package has a custom function that calls dplyr::inner_join() under the hood, would it now call duckplyr::inner_join() under the hood instead? 3) Similarly, If I pipe a dataframe into dtplyr::lazy_dt() and then into a custom join function that calls dplyr::inner_join() under the hood, would it work and use the data.table method? Or would dtplyr just not know how to translate the code? I know that you're not planning to write a custom join function, but your video still sparked these curiosities in me. Lately I've been looking at these dplyr backends as a way to scale up our work for big data projects without making my team have to learn new syntax, so they've been on my mind a lot. Great video as always!

@Riffomonas Месяц назад

Great - thanks for the feedback. For now, the input to the phylotypr functions will be data.frames, but they should work fine if people provide tibbles or data.tables. The output will be base R structures like lists and character strings.

@jmoggridge Месяц назад

> class(iris) [1] "data.frame" > x <- tibble::tibble(Species = iris$Species[1]) > class(x) [1] "tbl_df" "tbl" "data.frame" > iris |> dplyr::inner_join(x) |> class() [1] "data.frame" > iris |> dplyr::inner_join(x) |> tibble::as_tibble() |> class() [1] "tbl_df" "tbl" "data.frame"

@spacelem Месяц назад

That was super helpful, thank you! I'll admit that joining was something that I hadn't really got the hang of, and even though I have gone through the tutorials, I didn't really appreciate what was going on. Only got rolling to figure out and then I can say I've mastered data.table! I would add though that the animal_legs_dt[animal_sounds_dt[uniq_animals]] for a full join is... pretty ugly! Instead, data.table provides its own version of merge that looks exactly like base R.

@Riffomonas Месяц назад

Thanks - I hadn't seen data.table::merge. That would simplify things considerably

@aidanmorales2576 Месяц назад

Excellent video as always! You might want to check out the tidytable package by Mark Fairbanks. It provides a fast data.table backend for many dplyr, purrr, and tidyr functions, with tidyverse syntax. I find it works great for speeding up R package development and code, while keeping dependencies down and keeping the code readable/maintainable for those not as familiar with data.table.

@markrandall7631 2 месяца назад

I have been trying this and some URL exit without the URL in single quotes and some need single quote URL to exit. Gone to encapsulating all URL in single quotes.

@Riffomonas Месяц назад

yeah bash can sometimes do different things with single vs double quotes. a backslash can be useful for escaping quotes if you need to have quotes in quotes

@markrandall7631 Месяц назад

@@Riffomonas this was without the URL in any quotes, like your script.

@guani2155 2 месяца назад

Hi Pat, thanks for the nice vedio! when use nmds <- metaMDS(shared, autotransform = FALSE), then score(nmds), the output has both $sites (which is the Group here) and $species (OTUs). I cannot directly pipe it to ggplot. I wonder how you deal with it? Thanks!

@Riffomonas 2 месяца назад

Hmmm, I'm not sure - why are you giving metaMDS shared instead of a distance matrix? Could that be the difference between what you and I are doing? github.com/riffomonas/distances/blob/main/code/nmds.R

@guani2155 2 месяца назад

@@Riffomonas But at 12:27, you were using nmds<-metaMDS(shared, autotransform = FALSE), using shared as input?

@Riffomonas 2 месяца назад

The rest of the video goes on to say that the defaults were not ideal and that rarefaction of the data was necessary

@guani2155 Месяц назад

@@Riffomonas I see, thank you Pat!

@danielkwawuvi_tutorials 2 месяца назад

Thank you for the guidance. Do you have a video on performing Principal Component Analysis on microbial data? It will be helpful to see you do this. I am learning a lot from you, Prof.

@Riffomonas 2 месяца назад

Thanks for watching check out these two videos: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-G5Qckqq5Erw.html and ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-h7OrVmT7Ja8.html