Category: r

Animating causal inference methods

Some time back the animations below went sort of viral in the statistical programming community. In them, economics professor Nick Huntington-Klein demonstrates step-by-step how statistical tests estimate effect sizes.

You will find several other animations in Nick’s original blog, and the associatedtwitter thread.

Moreover, if you are interested in the R code to generate these animations, have a look at this github repository for the causalgraphs.

Controlling for a variable

Via http://nickchk.com/causalgraphs.html

Matching on a Variable

Differences in differences

Link to the Twitter thread:

I've been getting used to gganimate and thought it would be useful to put together some illustrations of what various causal inference methods *actually do to data* and how they work. Here, for example, is what it means to control for a (binary) variable pic.twitter.com/lmEvJSPQgY
— Nick HK (@nickchk) November 26, 2018

Learn from the Pros: How media companies visualize data

Past months, multiple companies shared their approaches to data visualization and their lessons learned.

Click the companies in the list below to jump to their respective section

The Financial Times
The Britisch Broadcast Corporation
The Economist
FiveThirtyEight

Financial Times

The Financial Times (FT) released a searchable database of the many data visualizations they produced over the years. Some lovely examples include:

Graphic showing what May needs to happen to get her deal over the line when MPs vote on Friday — Data visualization belonging to a recent Brexit piece by the FT, viahttps://www.ft.com/graphics

Dutch housing graphic — Searching the FT database for *European House Prices* via https://www.ft.com/graphics returns this map of the Netherlands.

BBC

The BBC released a free cookbook for data visualization using R programming. Here is the associated Medium post announcing the book.

The BBC data team developed an R package (bbplot) which makes the process of creating publication-ready graphics in their in-house style using R’s ggplot2 library a more reproducible process, as well as making it easier for people new to R to create graphics.

Apart from sharing several best practices related to data visualization, they walk you through the steps and R code to create graphs such as the below:

One of the graphs the BBC cookbook will help you create, via https://bbc.github.io/rcookbook/

Ultimately, you should be able to reproduce these BBC style graphics, viahttps://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535

Economist

The data team at the Economist also felt a need to share their lessons learned via Medium. They show some of their most misleading, confusing, and failing graphics of the past years, and share the following mistakes and their remedies:

Truncating the scale (image #1 below)
Forcing a relationship by cherry-picking scales
Choosing the wrong visualisation method (image #2 below)
Taking the “mind-stretch” a little too far (image #3 below)
Confusing use of colour (image #4 below)
Including too much detail
Lots of data, not enough space

Moreover, they share the data behind these failing and repaired data visualizations:

Via https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368

FiveThirtyEight

I could not resist including this (older) overview of the 52 best charts FiveThirtyEight claimed they made.

All 538’s data visualizations are just stunningly beautiful and often very
ingenious, using new chart formats to display complex patterns. Moreover, the range of topics they cover is huge. Anything ranging from their traditional background — politics — to great cover stories on sumo wrestling and pricy wine.

Viahttps://fivethirtyeight.com/features/the-52-best-and-weirdest-charts-we-made-in-2016/

Via https://fivethirtyeight.com/features/the-52-best-and-weirdest-charts-we-made-in-2016/ You should definitely check out the original cover story via https://projects.fivethirtyeight.com/sumo/

StatQuest: Statistical concepts, clearly explained

Josh Starmer is assistant professor at the genetics department of the University of North Carolina at Chapel Hill.

But more importantly:
Josh is the mastermind behind StatQuest!

StatQuest is a Youtube channel (and website) dedicated to explaining complex statistical concepts — like data distributions, probability, or novel machine learning algorithms — in simple terms.

Once you watch one of Josh’s “Stat-Quests”, you immediately recognize the effort he put into this project. Using great visuals, a just-about-right pace, and relateable examples, Josh makes statistics accessible to everyone. For instance, take this series on logistic regression:

And do you really know what happens under the hood when you run a principal component analysis? After this video you will:

Or are you more interested in learning the fundamental concepts behind machine learning, then Josh has some videos for you, for instance on bias and variance or gradient descent:

With nearly 200 videos and counting, StatQuest is truly an amazing resource for students ‘and teachers on topics related to statistics and data analytics. For some of the concepts, Josh even posted videos running you through the analysis steps and results interpretation in the R language.

StatQuest started out as an attempt to explain statistics to my co-workers – who are all genetics researchers at UNC-Chapel Hill. They did these amazing experiments, but they didn’t always know what to do with the data they generated. That was my job. But I wanted them to understand that what I do isn’t magic – it’s actually quite simple. It only seems hard because it’s all wrapped up in confusing terminology and typically communicated using equations. I found that if I stripped away the terminology and communicated the concepts using pictures, it became easy to understand.
Over time I made more and more StatQuests and now it’s my passion on YouTube.
Josh Starmer via https://statquest.org/about/

Free Programming Books (I still need to read)

There are multiple unread e-mails in my inbox.

Links to books.

Just sitting there. Waiting to be opened, read. For months already.

The sender, you ask? Me. Paul van der Laken.

A nuisance that guy, I tell you. He keeps sending me reminders, of stuff to do, books to read. Books he’s sure a more productive me would enjoy.

Now, I could wipe my inbox. Be done with it. But I don’t wan’t to lose this digital to-do list… Perhaps I should put them here instead. So you can help me read them!

Each of the below links represents a formidable book on programming! (I hear)
And there are free versions! Have a quick peek. A peek won’t hurt you:

Disclaimer: This page contains one or more links to Amazon.
Any purchases made through those links provide us with a small commission that helps to host this blog.

Applied Predictive Modelling – by Max Kuhn & Kjell Johnson
Feature Engineering and Selection: A Practical Approach for Predictive Models – by Max Kuhn & Kjell Johnson
- http://www.feat.engineering/
The Pragmatic Programmer – by Andrew Hunt & David Thomas
- Buy this book via Amazon to support its authors
- https://www.nceclusters.no/globalassets/filer/nce/diverse/the-pragmatic-programmer.pdf
Clean Code – by Robert Martin
- Buy this book via Amazon to support its authors
- https://www.investigatii.md/uploads/resurse/Clean_Code.pdf
R for Data Science – by Hadley Wickham
- Buy this book via Amazon to support its authors
- https://r4ds.had.co.nz/

Advanced R – by Hadley Wickham
- Buy this book via Amazon to support its authors
- https://adv-r.hadley.nz/index.html

R Markdown: The Definitive Guide – by Yihui Xie, J. J. Allaire, & Garrett Grolemund
- Buy this book via Amazon to support its authors
- https://bookdown.org/yihui/rmarkdown/
Bookdown: Authoring Books and Technical Documents with R Markdown – by Yihui Xie
- Buy this book via Amazon to support its authors
- https://bookdown.org/yihui/bookdown/
blogdown – by Yihui Xie
- Buy this book via Amazon to support its authors
- https://bookdown.org/yihui/blogdown/
The Hundred Page Machine Learning Book – by Andriy Burkov
- Buy this book via Amazon to support its authors
- https://file.ai100.com.cn/files/file-code/original/cd136ebe-0e34-4e43-966b-224acff83005/100MLBOOK/Chapter8.pdf
An Introduction to Statistical Learning – by Gareth James, Daniela Witten, Trevor Hastie, & Robert Tibshirani
- Buy this book via Amazon to support its authors
- http://www-bcf.usc.edu/~gareth/ISL/

The Elements of Statistical Learning – by Trevor Hastie, Robert Tibshirani, & Jerone Friedman
- Buy this book via Amazon to support its authors
- http://web.stanford.edu/~hastie/ElemStatLearn/

Interpretable Machine Learning – by Christoph Molnar
- Buy this book via LeanPub to support its authors
- https://christophm.github.io/interpretable-ml-book/index.html
Deep Learning – by Ian Goodfellow, Yoshua Bengio, & Aaron Courville
- Buy this book via Amazon to support its authors
- https://www.deeplearningbook.org/
Deep Learning with Python – by Francois Chollet
Pro Git – by Scott Chacon & Ben Straub
- Buy this book via Amazon to support its authors
- https://git-scm.com/book/en/v2

The books listed above have a publicly accessible version linked. Some are legitimate. Other links are somewhat shady.
If you feel like you learned something from reading one of the books (which you surely will), please buy a hardcopy version. Or an e-book. At the very least, reach out to the author and share what you appreciated in his/her work.
It takes valuable time to write a book, and we should encourage and cherish those who take that time.

For more books on R programming, check out my R resources overview.

For books on data analytics and (behavioural) psychology in (HR) management, check out Books for the modern data-driven HR professional.

R Image Art, by Michael Freeman

Michael Freeman — information researcher at the University of Washington — was asked whether he could manipulate images with only R programming and he thought to give it a try. In his blog, Michael demonstrates how he used ggplot2 and the imager packages, among others, to go from this original photo:

Spain, Vitoria-Gasteiz, Graffiti, Painting, Art — Via https://pixabay.com/photos/spain-vitoria-gasteiz-graffiti-88385/

To this dot representation:

And this voronoi diagram:

rstudio::conf 2019 summary

Welcome to rstudio::conf 2019

Similar to last year, I was not able to attend rstudio::conf 2019.

Fortunately, so much of the conference is shared on Twitter and media outlets that I still felt included. Here are some things that I liked and learned from, despite the Austin-Tilburg distance.

All presentations are streamed

One great thing about rstudio::conf is that all presentations are streamed and later posted on the RStudio website.

Of what I’ve already reviewed, I really liked Jenny Bryan’s presentation on lazy evaluation, Max Kuhn’s presentation on parsnip, and teaching data science with puzzles by Irene Steves. Also, the gt package is a serious power tool! And I was already a gganimate fanboy, as you know from here and here.

Pass the dots = mindblown. #rstats #RStudioConf #timesaver #elegantcode pic.twitter.com/bV1gCDVZS5
— Jennifer Chunn (@jchunn206) January 18, 2019

One of the insights shared in Jenny Bryan’s talk that can be a life-saver

I think I’m going to watch all talks over the coming weekends!

Slides & Extra Materials

There’s an official rstudio-conf repository on Github hosting many materials in an orderly fashion.

Karl Broman made his own awesome GitHub repository with links to the videos, the slides, and all kinds of extra resources.

Karl’s handy github repo of rstudio::conf

All takeaways in a handy #rstudioconf Shiny app

Garrick Aden-Buie made a fabulous Shiny app that allows you to review all #rstudioconf tweets during and since the conference. It even includes some random statistics about the tweets, and a page with all the shared media.

Some random takeaways

Via this tweet about this rstudio::conf presentation

Data scientists can fail by:
❌not saying no enough
❌not providing anything more than a cursory analysis
❌assuming PM knows enough to ask question in the right way and not collaborating with them
❌caring more about using fancy method than solving business problems#rstudioconf
— Emily Robinson (@robinson_es) January 18, 2019

Some words of wisdom by Emily Robinson (whom we know from here)

Recently heard about #tidytuesday from @drob Keynote at #RStudioConf and want to know how to get started?

Check out the GitHub repo for all the details!

New data drops every Monday morning!https://t.co/8NaXR93uIX
— Tom Mock (@thomas_mock) January 18, 2019

You should consider joining #tidytuesday!

Extra: Online RStudio Webinars

Did you know that RStudio also posts all the webinars they host? There really are some hidden pearls among them. For instance, this presentation by Nathan Stephens on rendering rmarkdown to powerpoint will save me tons of work, and those new to broom will also be astonished by this webinar by Alex Hayes.

Controlling for a variable

Matching on a Variable

Differences in differences

Link to the Twitter thread:

Share this:

Financial Times

BBC

Economist

FiveThirtyEight

Share this:

Share this:

Share this:

Share this:

All presentations are streamed

Slides & Extra Materials

All takeaways in a handy #rstudioconf Shiny app

Some random takeaways

Extra: Online RStudio Webinars

Share this: