Tag: survivalanalysis

# Analytics in HR case study: Behind the scenes

Past week, Analytics in HR published a guest blog about one of my People Analytics projects which you can read here. In the blog, I explain why and how I examined the turnover of management trainees in light of the international work assignments they go on.

For the analyses, I used a statistical model called a survival analysis – also referred to as event history analysis, reliability analysis, duration analysis, time-to-event analysis, or proporational hazard models. It estimates the likelihood of an event occuring at time t, potentially as a function of certain data.

The sec version of surival analysis is a relatively easy model, requiring very little data. You can come a long way if you only have the time of observation (in this case tenure), and whether or not an event (turnover in this case) occured. For my own project, I had two organizations, so I added a source column as well (see below).

```# LOAD REQUIRED PACKAGES ####
library(tidyverse)
library(ggfortify)
library(survival)

# SET PARAMETERS ####
set.seed(2)
sources = c("Organization Red","Organization Blue")
prob_leave = c(0.5, 0.5)
prob_stay = c(0.8, 0.2)
n = 60

# SIMULATE DATASETS ####
bind_rows(
tibble(
Tenure = sample(1:80, n*2, T),
Source = sample(sources, n*2, T, prob_leave),
Turnover = T
),
tibble(
Tenure = sample(1:85, n*25, T),
Source = sample(sources, n*25, T, prob_stay),
Turnover = F
)
) ->
data_surv

# RUN SURVIVAL MODEL ####
sfit <- survfit(Surv(data_surv\$Tenure, event = data_surv\$Turnover) ~ data_surv\$Source)

# PLOT  SURVIVAL ####
autoplot(sfit, censor = F, surv.geom = 'line', surv.size = 1.5, conf.int.alpha = 0.2) +
scale_x_continuous(breaks = seq(0, max(data_surv\$Tenure), 12)) +
coord_cartesian(xlim = c(0,72), ylim = c(0.4, 1)) +
scale_color_manual(values = c("blue", "red")) +
scale_fill_manual(values = c("blue", "red")) +
theme_light() +
theme(legend.background = element_rect(fill = "transparent"),
legend.justification = c(0, 0),
legend.position = c(0, 0),
legend.text = element_text(size = 12)
) +
labs(x = "Length of service",
y = "Percentage employed",
title = "Survival model applied to the retention of new trainees",
fill = "",
color = "")
```

Using the code above, you should be able to conduct a survival analysis and visualize the results for your own projects. Please do share your results!

# Must read: Computer Age Statistical Inference (Efron & Hastie, 2016)

Statistics, and statistical inference in specific, are becoming an ever greater part of our daily lives. Models are trying to estimate anything from (future) consumer behaviour to optimal steering behaviours and we need these models to be as accurate as possible. Trevor Hastie is a great contributor to the development of the field, and I highly recommend the machine learning books and courses that he developed, together with Robert Tibshirani. These you may find in my list of R Resources (Cheatsheets, Tutorials, & Books).

Today I wanted to share another book Hastie wrote, together with Bradley Efron, another colleague of his at Stanford University. It is called Computer Age Statistical Inference (Efron & Hastie, 2016) and is a definite must read for every aspiring data scientist because it illustrates most algorithms commonly used in modern-day statistical inference. Many of these algorithms Hastie and his colleagues at Stanford developed themselves and the book handles among others:

• Regression:
• Logistic regression
• Poisson regression
• Ridge regression
• Jackknife regression
• Least angle regression
• Lasso regression
• Regression trees
• Bootstrapping
• Boosting
• Cross-validation
• Random forests
• Survival analysis
• Support vector machines
• Kernel smoothing
• Neural networks
• Deep learning
• Bayesian statistics

# R resources (free courses, books, tutorials, & cheat sheets)

Help yourself to these free books, tutorials, packages, cheat sheets, and many more materials for R programming. There’s a separate overview for handy R programming tricks. If you have additions, please comment below or contact me!

Join 1,403 other followers

LAST UPDATED: 2021-09-24

Completely new to R? → Start learning here!

# Cheat Sheets

Many of the above cheat sheets are hosted in the official RStudio cheat sheet overview.

# Data Visualization

## Interactive / HTML / JavaScript widgets

• R HTML Widgets Gallery***
• `plotly` – interactive plots
• `billboarder` – easy interface to billboard.js, a JavaScript chart library based on D3
• `d3heatmap` – interactive D3 heatmaps
• `altair`Vega-Lite visualizations via Python
• `DT` – interactive tables
• `DiagrammeR` – interactive diagrams (DiagrammeR cheat sheet)
• `dygraphs` – interactive time series plots
• `formattable` – formattable data structures
• `ggvis` – interactive ggplot2
• `highcharter` – interactive Highcharts plots
• `leaflet` – interactive maps
• `metricsgraphics` – interactive JavaScript bare-bones line, scatterplot and bar charts
• `networkD3` – interative D3 network graphs
• `scatterD3` – interactive scatterplots with D3
• `rbokeh` – interactive Bokeh plots
• `rCharts` – interactive Javascript charts
• `rcdimple` – interactive JavaScript bar charts and others
• `rglwidget` – interactive 3d plots
• `threejs` – interactive 3d plots and globes
• `visNetwork` – interactive network graphs
• `wordcloud2` – interface to wordcloud2.js.
• `timevis` – interactive timelines

## ggplot2

### ggplot2 extensions

• ggplot2 extensions overview***
• `ggthemes` – plot style themes
• `hrbrthemes` – opinionated, typographic-centric themes
• `ggmap` – maps with Google Maps, Open Street Maps, etc.
• `ggiraph` – interactive ggplots
• `gghighight` – highlight lines or values, see vignette
• `ggstance` – horizontal versions of common plots
• `GGally` – scatterplot matrices
• `ggalt` – additional coordinate systems, geoms, etc.
• `ggbeeswarm` – column scatter plots or voilin scatter plots
• `ggforce` – additional geoms, see visual guide
• `ggrepel` – prevent plot labels from overlapping
• `ggraph` – graphs, networks, trees and more
• `ggpmisc` – photo-biology related extensions
• `geomnet` – network visualization
• `ggExtra` – marginal histograms for a plot
• `gganimate` – animations, see also the gganimate wiki page
• `ggpage` – pagestyled visualizations of text based data
• `ggpmisc` – useful additional `geom_*` and `stat_*` functions
• `ggstatsplot` – include details from statistical tests in plots
• `ggspectra` – tools for plotting light spectra
• `ggnetwork` – geoms to plot networks
• `ggpoindensity` – cross between a scatter plot and a 2D density plot
• `ggradar` – radar charts
• `ggsurvplot (survminer)` – survival curves
• `ggseas` – seasonal adjustment tools
• `ggthreed` – (evil) 3D geoms
• `ggtech` – style themes for plots
• `ggtern` – ternary diagrams
• `ggTimeSeries` – time series visualizations
• `ggtree` – tree visualizations
• `treemapify` – wilcox’s treemaps
• `seewave` – spectograms

## Miscellaneous

• `coefplot` – visualizes model statistics
• `circlize` – circular visualizations for categorical data
• `clustree` – visualize clustering analysis
• `quantmod` – candlestick financial charts
• `dabestr`– Data Analysis using Bootstrap-Coupled ESTimation
• `devoutsvg` – an SVG graphics device (with pattern fills)
• `devoutpdf` – an PDF graphics device
• `cartography` – create and integrate maps in your R workflow
• `colorspace` – HSL based color palettes
• `viridis` – Matplotlib viridis color pallete for R
• `munsell` – Munsell color palettes for R
• `Cairo` – high-quality display output
• `igraph` – Network Analysis and Visualization
• `graphlayouts` – new layout algorithms for network visualization
• `lattice` – Trellis graphics
• `tmap` – thematic maps
• `trelliscopejs` – interactive alternative for `facet_wrap`
• `rgl` – interactive 3D plots
• `corrplot` – graphical display of a correlation matrix
• `googleVis` – Google Charts API
• `plotROC` – interactive ROC plots
• `extrafont` – fonts in R graphics
• `rvg` – produces Vector Graphics that allow further editing in PowerPoint or Excel
• `showtext` – text using system fonts
• `animation` – animated graphics using ImageMagick.
• `misc3d` – 3d plots, isosurfaces, etc.
• `xkcd` – xkcd style graphics
• `imager` – CImg library to work with images
• `ungeviz` – tools for visualize uncertainty
• `waffle` – square pie charts a.k.a. waffle charts
• Creating spectograms in R with `hht`, `warbleR`, `soundgen`, `signal`, `seewave`, or `phonTools`

# Markdown & Other Output Formats

• `tidystats` – automating updating of model statistics
• `papaja` – preparing APA journal articles
• `blogdown` – build websites with Markdown & Hugo
• `huxtable` – create Excel, html, & LaTeX tables
• `xaringan` – make slideshows via remark.js and markdown
• `summarytools` – produces neat, quick data summary tables
• `citr` – RStudio Addin to Insert Markdown Citations

# Statistical Modeling & Machine Learning

## Miscellaneous

• `corrr` – easier correlation matrix management and exploration

# Integrated Development Environments (IDEs) & Graphical User Inferfaces (GUIs)

Descriptions mostly taken from their own websites:

• RStudio*** – Open source and enterprise ready professional software
• Jupyter Notebook*** – open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text across dozens of programming languages.
• Microsoft R tools for Visual Studio – turn Visual Studio into a powerful R IDE
• R Plugins for Vim, Emax, and Atom editors
• Rattle*** – GUI for data mining
• equisse – RStudio add-in to interactively explore and visualize data
• R Analytic Flow – data flow diagram-based IDE
• RKWard – easy to use and easily extensible IDE and GUI
• Eclipse StatET – Eclipse-based IDE
• OpenAnalytics Architect – Eclipse-based IDE
• TinnR – open source GUI and IDE
• DisplayR – cloud-based GUI
• BlueSkyStatistics – GUI designed to look like SPSS and SAS
• ducer – GUI for everyone
• R commander (Rcmdr) – easy and intuitive GUI
• JGR – Java-based GUI for R
• jamovi & `jmv` – free and open statistical software to bridge the gap between researcher and statistician
• Exploratory.io – cloud-based data science focused GUI
• Stagraph – GUI for ggplot2 that allows you to visualize and connect to databases and/or basic file types
• ggraptr – GUI for visualization (Rapid And Pretty Things in R)
• ML Studio – interactive Shiny platform for data visualization, statistical modeling and machine learning

# R & other software and languages

### R & SQL

Join 1,403 other followers