Tag: causal

Simulating data with Bayesian networks, by Daniel Oehm

Daniel Oehm wrote this interesting blog about how to simulate realistic data using a Bayesian network.

Bayesian networks are a type of probabilistic graphical model that uses Bayesian inference for probability computations. Bayesian networks aim to model conditional dependence, and therefore causation, by representing conditional dependence by edges in a directed graph. Through these relationships, one can efficiently conduct inference on the random variables in the graph through the use of factors.

Devin Soni via Medium

As Bayes nets represent data as a probabilistic graph, it is very easy to use that structure to simulate new data that demonstrate the realistic patterns of the underlying causal system. Daniel’s post shows how to do this with bnlearn.

New data is simulated from a Bayes net (see above) by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node.

Daniel Oehms in his blog

The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values.

As you can see, a Bayesian network allows you to generate data that looks, feels, and behaves a lot like the data on which you based your network on in the first place.

This can be super useful if you want to generate a synthetic / fake / artificial dataset without sharing personal or sensitive data.

Moreover, the underlying Bayesian net can be very useful to compute missing values. In Daniel’s example, he left out some values on purpose (pretending they were missing) and imputed them with the Bayes net. He found that the imputed values for the missing data points were quite close to the original ones:

In the original blog, Daniel goes on to show how to further check the integrity of the simulated data using statistical models and shares all his code so you can try this out yourself. Please do give his website a visit as Daniel has many more interesting statistics blogs!

Causal Random Forests, by Mark White

I stumbled accros this incredibly interesting read by Mark White, who discusses the (academic) theory behind, inner workings, and example (R) applications of causal random forests:

EXPLICITLY OPTIMIZING ON CAUSAL EFFECTS VIA THE CAUSAL RANDOM FOREST: A PRACTICAL INTRODUCTION AND TUTORIAL (By Mark White)

These so-called “honest” forests seem a great technique to identify opportunities for personalized actions: think of marketing, HR, medicine, healthcare, and other personalized recommendations. Note that an experimental setup for data collection is still necessary to gather the right data for these techniques.

Propensity Score Matching Explained Visually

Propensity score matching (wiki) is a statistical matching technique that attempts to estimate the effect of a treatment (e.g., intervention) by accounting for the factors that predict whether an individual would be eligble for receiving the treatment. The wikipedia page provides a good example setting:

Say we are interested in the effects of smoking on health. Here, smoking would be considered the treatment, and the ‘treated’ are simply those who smoke. In order to find a cause-effect relationship, we would need to run an experiment and randomly assign people to smoking and non-smoking conditions. Of course such experiments would be unfeasible and/or unethical, as we can’t ask/force people to smoke when we suspect it may do harm.
We will need to work with observational data instead. Here, we estimate the treatment effect by simply comparing health outcomes (e.g., rate of cancer) between those who smoked and did not smoke. However, this estimation would be biased by any factors that predict smoking (e.g., social economic status). Propensity score matching attempts to control for these differences (i.e., biases) by making the comparison groups (i.e., smoking and non-smoking) more comparable.

Lucy D’Agostino McGowan is a post-doc at Johns Hopkins Bloomberg School of Public Health and co-founder of R-Ladies Nashville. She wrote a very nice blog explaining what propensity score matching is and showing how to apply it to your dataset in R. Lucy demonstrates how you can use propensity scores to weight your observations in such a way that accounts for the factors that correlate with receiving a treatment. Moreover, her explainations are strenghtened by nice visuals that intuitively demonstrate what the weighting does to the “pseudo-populations” used to estimate the treatment effect.