TidyX is a screen cast where we discuss how Data Science topics and code work line-by-line, explaining what they did and how the functions they used work. We also break down the visualizations they create and talk about how to apply similar approaches to other data sets. The objective is to help more people learn R and get involved in the TidyTuesday community.
The hosts are Ellis Hughes (@ellis_hughes) and Patrick Ward (@OSPpatrick).
Ellis has been working with R since 2015 and has a background working as a statistical programmer in support of both Statistical Genetics and HIV Vaccines. He also runs the Seattle UseR Group.
Patrick's current work centers on research and development in professional sport with an emphasis on data analysis in American football. Previously, he was a sport scientist within the Nike Sports Research Lab. Research interests include training and competition analysis as they apply to athlete health, injury, and performance.
very cool and so useful simply because event-level dataframes are ubiquitous (especially in a work/company setting) especially with fetching all this data from a nested/tree like structured jsons - you have a big fan in me here gentlemen - thanks !! (i'm willing to bet Cohen is a data analyst/scientist for a company involved in sports/NBA bookkeeping or analytics lmao)
great stuff as always ! clear and easy to understand what easily can be confusing so, thanks ! (very) unrelated to this but would you be interested in diving (either deep) or as an introduction into the logger package, i think it's created with the intent of mimicking python's version and while i find python's pretty straightforward, for some reason R's version is a little more obscure/harder to grasp for me
Thanks for the comment! I'll have to look at the logger package a bit again - I think I've used it in the past, but there may be a few other things/concepts you need to know before it makes sense ~ Ellis
You can try use the \(df) notation for anon functions. purrr::map(my_list, \(df) lm(y~x1+X2, data = df)) This function expects a list of data frames called my_list. It then regress in each of these data frames y against x1+x2, and specifies each of the data frames in the list as the data for the regression. It is the same as what you guys did, but i think this notation was introduced and is now considered better than ~ .x notation
This was great! Pretty concise and learnt a lot. Cleaning data is a lot less time consuming and intuitive for me now. Didn't know base R was so good at dealing with strings
Awesome. One package I like a lot is the furrr package. You use the function "future_map" as you do with the "map" function from the purrr package, but in parallel. Pretty easy.
Great approach for doing multiple comparisons. Could you not just replace filter(r1 != r2) with filter(as.numeric(r1) > as.numeric(r2))? I think broom::tidy() on the output of the t test might have made it a bit easier to combine and extract the data you wanted. although your approach works fine.
The same solution came to my mind last week for a similar problem, filter the indices if i>j just gave me the upper triangle (except diag.) of the Cartesian product matrix. 👍
Another great episodw. Looking forward to the episode 200 party edition! In your plots the y axis was count data. It irritates me that ggplot will often put decimal points on scales even if you bother to define the variable as an integer. For one plot it's not too hard to manually input the breaks etc. Is there an easy way of getting round this problem when you are generating lots of plots, as in your example? It should be as simple as saying "this is integer data, 10.5 is meanigless!"
Any plans to do any more NBA stuff? I'd love to see something like trying to predict some of the awards as the season is winding down, maybe Most Improved player as predicted using ML - similar to the HOF pitching
Great video, guys! And it's great that the baseball season around the corner, so lots of people should be itching for new data points (I mean, games lol)
Thanks for the reply. We have many episodes on ggplot2 and entire series on tidymodels and shiny. Is there something in particular you'd be interested in seeing? ~patrick
If you must use a for loop... # Using a for loop library(rlang) wgt_stat_list <- list() for (i in 1:5) { wgt_stat_list[[i]] <- get_wgt_score( fake_dat, sym(str_c("stat", i)), sym(str_c("stat", i, "_n")) ) }
Hey Ellis - great walk-through. I see you predicted both Beltre and Mauer, both of who made it. A-Rod won't make it due to external shenanigans, as not you note. The other player who made it was Todd Helton. Going to do pitchers next?
Really appreciate the for loop solution as often running into this situation myself and I always wondered if you can do this without running the same code multiple times. This has been super helpful!
You can accomplish it with a single pivot longer if the column names have a consistent pattern/separator. I gave the stat variables a new suffix ("_score") to make that happen. fake_dat |> rename_with(\(x) str_replace(x, "(.+\\d$)", "\\1_score")) |> pivot_longer( c(ends_with("_score"), ends_with("_n")), names_to = c("stat", ".value"), names_sep = "_", ) |> summarize( total_obs = sum(n), wgt_stat = weighted.mean(score, n), .by = c(athlete, stat) )
I am right now building a shiny application prototype that I would like to demo to my colleagues. We use Dataiku in my organization and it has the capability to host shiny apps. However, since our tech guys are all pythonistas, most R packages that I am using don't run well on our Dataiku deployment (our tech guys don't care). I explored other alternatives for shiny app deployment but I couldn't find one that is practical enough. I am now going to try this *shinylive* option. Wish me luck.
Thank you for the awesome tutorial. Did you try building an application with raw data from something like a CSV? Edit* I was able to get my personal data to work. Thank you again for your awesome tutorial.
You guys it's reall6kinda pointless to teach/show baseR. There are so many cool things to learn in broom, purr, vroom, prophet, gg's extension with html, more advanced joining with closest and more... Putting the time on base R with functionality that is so much easier in the tidyverse is just... Meh 🤷🏽♀️
Very interesting! I've always used ggplot2 for plotting but there are some situations in which I wonder if base R could help me. Main setback for this is that I'm really tidyverse oriented, and so my ability to find resources for non tidy approaches is lacking a bit. I'll drop here a few questions that interest me and might be interesting for others. 1. gganimate is slow! I often work with 2D paths data and it's not always easy to "see" the dynamics of the path. An example for this is gaze data from eye tracking, where plotting the gazeplot all at once often results in an unreadable mess.. So I've tried rendering them as animations, but the rendering part is so slow that it is unfeasable to use it in the "question-wrangle-plot" loop. I wonder if there are ways to do animations in base R that would make the process quicker. 2. the second question is tied to the first one. Since most of my data has a sampling frequency of at most 60hz but it could probably be visualised even at lower frequencies, I've often asked myself if there are ways to render animations in real time. I've seen some stuff from coolbutuseless but it's all a bit out of my league.. here are some examples coolbutuseless.github.io/package/eventloop/articles/stream-plotting.html coolbutuseless.github.io/package/eventloop/articles/demo-particles.html github.com/coolbutuseless/nara 3. how do you save the plots that you make with base R? in ggplot2 you have ggsave and you pass it the plot object. can you even save a base plot as an object? Thank you for what you do, it's truly helpful
Some of my colleagues did a card sorting task for survey development and wanted to display the results in a dendrogram. Can you show how to make a dendrogram (both in ggplot2 and whatever else you find that's useful)?