Category: r

The Causal Inference Book: DAGS and more

Harvard (bio)statisticians Miguel Hernan and Jamie Robins just released their new book, online and accessible for free!

The Causal Inference book provides a cohesive presentation of causal inference, its concepts and its methods. The book is divided in 3 parts of increasing difficulty: causal inference without models, causal inference with models, and causal inference from complex longitudinal data. Here’s the official Harvard page for the book release.

Some of the book’s (NHEFS) data is accesible too:

In SAS, Stata, MS Excel, and CSV formats
Codebook

As is the associated computer code for the analyses, in multiple languages:

R by Joy Shi and Sean McGrath. Rendered version by Tom Palmer.
Python by James Fiedler
SAS by Roger Logan
Stata by Eleanor Murray and Roger Logan

This is definitely an interesting read for epidemiologists, statisticians, psychologists, economists, sociologists, political scientists, data scientists, computer scientists, and any other person with a love for proper data analysis!

Our revised #CausalInferenceBook is now freely available.

The book is organized in 3 parts of increasing difficulty: From counterfactuals and causal diagrams to treatment-confounder feedback and g-methods.

Thanks to everyone who sent us comments/typos.https://t.co/bRPFYazK2D
— Miguel Hernán (@_MiguelHernan) January 1, 2019

Sam Finalyson visualized some of the Directed Acyclic Graphs (DAG) covered in the book, and these also look quite nice. The visuals and other notes and glossary items here.

Last week my wife was out of town, research was slow, and summer self-study aspirations were high, so I sat down and organized:

ALL THE 50+ DAGS from @_MiguelHernan and Robins Causal Inference book (P1):https://t.co/vKcdclTQbe

Some notes on key conceptshttps://t.co/IcLZCLlrZW pic.twitter.com/UZWSauvQxT
— Sam Finlayson (@IAmSamFin) June 20, 2019

Cover image via blytheadamson.com

An Introduction to Docker for R Users, by Colin Fay

In this awesome 8-minute read, R-progidy Colin Fay explains in laymen’s terms what Docker images, Docker containers, and Volumes are; what Rocker is; and how to set up a Docker container with an R image and run code on it:

On your machine, you’re going to need two things: images, and containers. Images are the definition of the OS, while the containers are the actual running instances of the images. […] To compare with R, this is the same principle as installing vs loading a package: a package is to be downloaded once, while it has to be launched every time you need it. And a package can be launched in several R sessions at the same time easily.
Colin Fay, via https://colinfay.me/docker-r-reproducibility

In his blog, Colin also refers to some great additional resources on Rocker/Docker…

… as well as reading list for those interested in learning more about Docker:

Overview of built-in colors in R

Most of my data visualizations I create using R programming — as you might have noticed from the content of my website.

Though I am colorblind myself, I love to work with colors and color palettes in my visualizations. And I’ve come across quite some neat tricks in my time.

For instance, did you it’s super easy to create a reproducible though custom color palette? Or that there’s a quick reference card for ggplot2’s built-in colors? Or, and this is this blog post’s main subject, that you can access all built-in base colors using colors()!

This last trick, I learned in this recent blog post I came across, by Chisato. She explored all colors() base R incorporates, using the new ggforce and ggraph packages (thank you Thomas Lin Petersen!). Her exploration resulted in some nice visual overviews, which you can view in more detail in the original blog here.

Colors() that have at least 5 colors in their family

Causal Random Forests, by Mark White

I stumbled accros this incredibly interesting read by Mark White, who discusses the (academic) theory behind, inner workings, and example (R) applications of causal random forests:

EXPLICITLY OPTIMIZING ON CAUSAL EFFECTS VIA THE CAUSAL RANDOM FOREST: A PRACTICAL INTRODUCTION AND TUTORIAL (By Mark White)

These so-called “honest” forests seem a great technique to identify opportunities for personalized actions: think of marketing, HR, medicine, healthcare, and other personalized recommendations. Note that an experimental setup for data collection is still necessary to gather the right data for these techniques.

https://www.markhw.com/blog/causalforestintro

Tidy Machine Learning with R’s purrr and tidyr

Jared Wilber posted this great walkthrough where he codes a simple R data pipeline using purrr and tidyr to train a large variety of models and methods on the same base data, all in a non-repetitive, reproducible, clean, and thus tidy fashion. Really impressive workflow!

Comparison between R dplyr and data.table code

Atrebas created this extremely helpful overview page showing how to program standard data manipulation and data transformation routines in R’s famous packages dplyr and data.table.

The document has been been inspired by this stackoverflow question and by the data.table cheat sheet published by Karlijn Willems.
Resources for data.table can be found on the data.table wiki, in the data.table vignettes, and in the package documentation. Reference documents for dplyr include the dplyr cheat sheet, the dplyr vignettes, and the package documentation.

Here’s a hyperlinked table of contents: