Tag: rstats

Visualizing Sampling Distributions in ggplot2: Adding area under the curve

Visualizing Sampling Distributions in ggplot2: Adding area under the curve

Thank you ggplot2tutor for solving one of my struggles. Apparently this is all it takes:

ggplot(NULL, aes(x = c(-3, 3))) +
  stat_function(fun = dnorm, geom = "line")

I can’t begin to count how often I have wanted to visualize a (normal) distribution in a plot. For instance to show how my sample differs from expectations, or to highlight the skewness of the scores on a particular variable. I wish I’d known earlier that I could just add one simple geom to my ggplot!

Want a different mean and standard deviation, just add a list to the args argument:

ggplot(NULL, aes(x = c(0, 20))) +
  stat_function(fun = dnorm,
                geom = "area",
                args = list(
                  mean = 10,
                  sd = 3

Need a different distribution? Just pass a different distribution function to stat_function. For instance, an F-distribution, with the df function:

ggplot(NULL, aes(x = c(0, 5))) +
  stat_function(fun = df,
                geom = "area",
                args = list(
                  df1 = 2,
                  df2 = 10

You can make it is complex as you want. The original ggplot2tutor blog provides this example:

ggplot(NULL, aes(x = c(-3, 5))) +
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    alpha = .3
  ) +
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    xlim = c(qnorm(.95), 4)
  ) +
    fun = dnorm,
    geom = "line",
    linetype = 2,
    fill = "steelblue",
    alpha = .5,
    args = list(
      mean = 2
  ) +
    title = "Type I Error",
    x = "z-score",
    y = "Density"
  ) +
  scale_x_continuous(limits = c(-3, 5))

Have a look at the original blog here: https://ggplot2tutor.com/sampling_distribution/sampling_distribution/

StatQuest: Statistical concepts, clearly explained

StatQuest: Statistical concepts, clearly explained

Josh Starmer is assistant professor at the genetics department of the University of North Carolina at Chapel Hill.

But more importantly:
Josh is the mastermind behind StatQuest!

StatQuest is a Youtube channel (and website) dedicated to explaining complex statistical concepts — like data distributions, probability, or novel machine learning algorithms — in simple terms.

Once you watch one of Josh’s “Stat-Quests”, you immediately recognize the effort he put into this project. Using great visuals, a just-about-right pace, and relateable examples, Josh makes statistics accessible to everyone. For instance, take this series on logistic regression:

And do you really know what happens under the hood when you run a principal component analysis? After this video you will:

Or are you more interested in learning the fundamental concepts behind machine learning, then Josh has some videos for you, for instance on bias and variance or gradient descent:

With nearly 200 videos and counting, StatQuest is truly an amazing resource for students ‘and teachers on topics related to statistics and data analytics. For some of the concepts, Josh even posted videos running you through the analysis steps and results interpretation in the R language.

StatQuest started out as an attempt to explain statistics to my co-workers – who are all genetics researchers at UNC-Chapel Hill. They did these amazing experiments, but they didn’t always know what to do with the data they generated. That was my job. But I wanted them to understand that what I do isn’t magic – it’s actually quite simple. It only seems hard because it’s all wrapped up in confusing terminology and typically communicated using equations. I found that if I stripped away the terminology and communicated the concepts using pictures, it became easy to understand.

Over time I made more and more StatQuests and now it’s my passion on YouTube.

Josh Starmer via https://statquest.org/about/

rstudio::conf 2019 summary

rstudio::conf 2019 summary

Cool intro video!
Thanks to Amelia for pointing to it

Welcome to rstudio::conf 2019

Similar to last year, I was not able to attend rstudio::conf 2019.

Fortunately, so much of the conference is shared on Twitter and media outlets that I still felt included. Here are some things that I liked and learned from, despite the Austin-Tilburg distance.

All presentations are streamed

One great thing about rstudio::conf is that all presentations are streamed and later posted on the RStudio website.

Of what I’ve already reviewed, I really liked Jenny Bryan’s presentation on lazy evaluation, Max Kuhn’s presentation on parsnip, and teaching data science with puzzles by Irene Steves. Also, the gt package is a serious power tool! And I was already a gganimate fanboy, as you know from here and here.

One of the insights shared in Jenny Bryan’s talk that can be a life-saver

I think I’m going to watch all talks over the coming weekends!

Slides & Extra Materials

There’s an official rstudio-conf repository on Github hosting many materials in an orderly fashion.

Karl Broman made his own awesome GitHub repository with links to the videos, the slides, and all kinds of extra resources.

Karl’s handy github repo of rstudio::conf

All takeaways in a handy #rstudioconf Shiny app

Garrick Aden-Buie made a fabulous Shiny app that allows you to review all #rstudioconf tweets during and since the conference. It even includes some random statistics about the tweets, and a page with all the shared media.

Some random takeaways

Via this tweet about this rstudio::conf presentation
Some words of wisdom by Emily Robinson (whom we know from here)
You should consider joining #tidytuesday!

Extra: Online RStudio Webinars

Did you know that RStudio also posts all the webinars they host? There really are some hidden pearls among them. For instance, this presentation by Nathan Stephens on rendering rmarkdown to powerpoint will save me tons of work, and those new to broom will also be astonished by this webinar by Alex Hayes.

Mathematical aRt

Marcus Volz is a research fellow at the University of Melbourne, studying geometric networks, optimisation and computational geometry. He’s interested in visualisation, and always looking for opportunities to represent complex information in novel ways to accelerate learning and uncover the unexpected.

One of Marcus’ hobbies is the visualization of mathematical patterns and statistical algorithms via R. He has a whole portfolio full of them, including a Github page with all the associated R code. For my recent promotion, my girlfriend asked Marcus to generate a K-nearest neighbors visual and she had it printed on a large canvas.


The picture contains about 10.000 points, randomly uniformly distributed across x and y, connected by lines with their closest other points. Marcus shared the code to generate such k-nearest neighbor algorithm plots here on Github. So if you know your way around R, you could make your own version:

#' k-nearest neighbour graph
#' Computes a k-nearest neighbour graph for a given set of points. Refer to the \href{https://en.wikipedia.org/wiki/Nearest_neighbor_graph}{Wikipedia article} for details.
#' @param points A data frame with x, y coordinates for the points
#' @param k Number of neighbours
#' @keywords nearest neightbour graph
#' @export
#' @examples
#' k_nearest_neighbour_graph()

k_nearest_neighbour_graph <- function(points, k=8) {
  get_k_nearest <- function(points, ptnum, k) {
    xi <- points$x[ptnum]
    yi <- points$y[ptnum]     points %>%
      dplyr::mutate(dist = sqrt((x - xi)^2 + (y - yi)^2)) %>%
      dplyr::arrange(dist) %>%
      dplyr::filter(row_number() %in% seq(2, k+1)) %>%
      dplyr::mutate(xend = xi, yend = yi)
  1:nrow(points) %>%
    purrr::map_df(~get_k_nearest(points, ., k))

Those less versed in R can use Marcus package mathart. With this package, Marcus shares many more visual depictions of cool algorithms! You can install the package and several dependencies with the following lines of code:

install.packages(c("devtools", "mapproj", "tidyverse", "ggforce", "Rcpp"))

Subsequently, you can visualize all kinds of cool stuff, like for instance rapidly exploring random trees (see this Wikipedia article for details):

# Generate rrt edges
df <- rapidly_exploring_random_tree() %>% mutate(id = 1:nrow(.))

# Create plot
ggplot() +
  geom_segment(aes(x, y, xend = xend, yend = yend, size = -id, alpha = -id), df, lineend = "round") +
  coord_equal() +
  scale_size_continuous(range = c(0.1, 0.75)) +
  scale_alpha_continuous(range = c(0.1, 1)) +
  theme_blankcanvas(margin_cm = 0)
Via https://github.com/marcusvolz/mathart

This k-d tree (see this Wikipedia article for details) is also amazing:

result <- kdtree(mathart::points)

ggplot() +
  geom_segment(aes(x, y, xend = xend, yend = yend), result) +
  coord_equal() +
  xlim(0, 10000) + ylim(0, 10000) +
  theme_blankcanvas(margin_cm = 0)
Via https://github.com/marcusvolz/mathart

This page of Marcus’ mathart Github repository contains the code exact code for these and many other visualizations of algorithms and statistical phenomena. Do check it out if you’re interested!


Also, check out the “Fun” section of my R tips and tricks list for more cool visuals you can generate in R!

R tips and tricks

R tips and tricks

Below are a dozen of very specific R tips and tricks. Some are valuable, useful, or boost your productivity. Others are just geeky funny. 

More general helpful R packages and resources can be found in this list.

If you have additions, please comment below or contact me!

Completely new to R? → Start here!

Table of Contents

Join 239 other followers


Many more shortkeys available here online, and in your RStudio under Tools → Keyboard Shortcuts Help.


Disclaimer: This page contains links to Amazon’s book shop.
Any purchases through those links provide us with a small commission that helps to host this blog.

Useful base functions

Back to Table of Contents

R Markdown

Data manipulation

Data visualization

Back to Table of Contents


Easter eggs

Join 239 other followers

Back to Table of Contents

rstudio::conf 2018 summary

rstudio::conf 2018 summary

rstudio::conf is the yearly conference when it comes to R programming and RStudio. In 2017, nearly 500 people attended and, last week, 1100 people went to the 2018 edition. Regretfully, I was on holiday in Cardiff and missed out on meeting all my #rstats hero’s. Just browsing through the #rstudioconf Twitter-feed, I already learned so many new things that I decided to dedicate a page to it!

Fortunately, you can watch the live streams taped during the conference:

Two people have collected the slides of most rstudio::conf 2018 talks, which you can acces via the Github repo’s of matthewravey and by simecek. People on Twitter have particularly recommended teach the tidyverse to beginners (by David Robinson), the lesser known stars of the tidyverse (by Emily Robinson), the future of time series and financial analysis in the tidyverse (by Davis Vaughan of business-science.io), Understanding Principal Component Analysis (by Julia Silge), and Deploying TensorFlow models (by Javier Luraschi). Nevertheless, all other presentations are definitely worth checking out as well!

One of the workshops deserves an honorable mention. Jenny Bryan presented on What they forgot to teach you about R, providing some excellent advice on reproducible workflows. It elaborates on her earlier blog on project-oriented workflows, which you should read if you haven’t yet. Some best pRactices Jenny suggests:

  • Restart R often. This ensures your code is still working as intended. Use Shift-CMD-F10 to do so quickly in RStudio.
  • Use stable instead of absolute paths. This allows you to (1) better manage your imports/exports and folders, and (2) allows you to move/share your folders without the code breaking. For instance, here::here("data","raw-data.csv") loads the raw-data.csv-file from the data folder in your project directory. If you are not using the here package yet, you are honestly missing out! Alternatively you can use fs::path_home()normalizePath() will make paths work on both windows and mac. You can usebasename instead of strsplit to get name of file from a path.
  • To upload an existing git directory to GitHub easily, you can usethis::use_github().
  • If you include the below YAML header in your .R file, you can easily generate .md files for you github repo.
#' ---
#' output: github_document
#' ---
  • Moreover, Jenny proposed these useful default settings for knitr:
collapse = TRUE,
comment = "#>",
out.width = "100%"

Another of Jenny Bryan‘s talks was named Data Rectangling and although you might not get much out of her slides without her presenting them, you should definitely try the associated repurrrsive tutorial if you haven’t done so yet. It’s a poweR up for any useR!

Here’s a Shiny dashboard made by Garrick Aden-Buie including all the #rstudioconf tweets so you can browse the posts yourself. If you want to download the tweets, Mike Kearney (author of rtweet) shares the data here on his Github. Some highlights:

These probably only present a minimal portion of the thousands of tips and tricks you could have learned by simply attending rstudio::conf. I will definitely try to attend next year’s edition. Nevertheless, I hope the above has been useful. If I missed out on any tips, presentations, tweets, or other materials, please reply below, tweet me or pop me a message!