Author: Paul van der Laken

Data Science vs. Data Alchemy – by Lucas Vermeer

Data Science vs. Data Alchemy – by Lucas Vermeer

How do scurvy, astronomy, alchemy and data science relate to each other?

In this goto conference presentation, Lucas Vermeer — Director of Experimentation at Booking.com — uses some amazing storytelling to demonstrate how the value of data (science) is largely by organizations capability to gather the right data — the data they actually need.

It’s a definite recommendation to watch for data scientists and data science leaders out there.

Here are the slides, and they contain some great oneliners:

@lucasvermeer
@lucasvermeer
Book tip: On the Clock

Book tip: On the Clock

Suppose you operate a warehouse where workers work 11-hour shifts. In order to meet your productivity KPIs, a significant number of them need to take painkillers multiple times per shift. Do you…

  1. Decrease or change the KPI (goals)
  2. Make shifts shorter
  3. Increase the number or duration of breaks
  4. Increase the medical staff
  5. Install vending machines to dispense painkillers more efficiently

Nobody in their right mind would take option 5… Right?

Yet, this is precisely what Amazon did according to Emily Guendelsberger in her insanely interesting and relevant book “On the clock(note the paradoxal link to Amazon’s webshop here).

Emily went undercover as employee at several organizations to experience blue collar jobs first-hand. In her book, she discusses how tech and data have changed low-wage jobs in ways that are simply dehumanizing.

These days, with sensors, timers, and smart nudging, employees are constantly being monitored and continue working (hard), sometimes at the cost of their own health and well-being.

I really enjoyed the book, despite the harsh picture it sketches of low wage jobs and malicious working conditions these days. The book poses several dilemma’s and asks multiple reflective questions that made me re-evaluate and re-appreciate my own job. Truly an interesting read!

Some quotes from the book to get you excited:

“As more and more skill is stripped out of a job, the cost of turnover falls; eventually, training an ever-churning influx of new unskilled workers becomes less expensive than incentivizing people to stay by improving the experience of work or paying more.”

Emily Guendelsberger, On the Clock

“Q: Your customer-service representatives handle roughly sixty calls in an eighty-hour shift, with a half-hour lunch and two fifteen-minute breaks. By the end of the day, a problematic number of them are so exhausted by these interactions that their ability to focus, read basic conversational cues, and maintain a peppy demeanor is negatively affected. Do you:

A. Increase staffing so you can scale back the number of calls each rep takes per shift — clearly, workers are at their cognitive limits

B. Allow workers to take a few minutes to decompress after difficult calls

C. Increase the number or duration of breaks

D. Decrease the number of objectives workers have for each call so they aren’t as mentally and emotionally taxing

E. Install a program that badgers workers with corrective pop-ups telling them that they sound tired.

Seriously—what kind of fucking sociopath goes with E?”

Emily Guendelsberger, On the Clock
My copy of the book
(click picture to order your own via affiliate link)

Cover via Freepik

Create a publication-ready correlation matrix, with significance levels, in R

Create a publication-ready correlation matrix, with significance levels, in R

In most (observational) research papers you read, you will probably run into a correlation matrix. Often it looks something like this:

FACTOR ANALYSIS

In Social Sciences, like Psychology, researchers like to denote the statistical significance levels of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:

Table 4 from Family moderators of relation between community ...

Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..

I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.

Yes, there is the cor function, but it does not include significance levels.

Then there the (in)famous Hmisc package, with its rcorr function. But this tool provides a whole new range of issues.

What’s this storage.mode, and what are we trying to coerce again?

Soon you figure out that Hmisc::rcorr only takes in matrices (thus with only numeric values). Hurray, now you can run a correlation analysis on your dataframe, you think…

Yet, the output is all but publication-ready!

You wanted one correlation matrix, but now you have two… Double the trouble?

To spare future scholars the struggle of the early day R programming, I would like to share my custom function correlation_matrix.

My correlation_matrix takes in a dataframe, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a fully formatted publication-ready correlation matrix!

You can specify many formatting options in correlation_matrix.

For instance, you can use only 2 decimals. You can focus on the lower triangle (as the lower and upper triangle values are identical). And you can drop the diagonal values:

Or maybe you are interested in a different type of correlation coefficients, and not so much in significance levels:

For other formatting options, do have a look at the source code below.

Now, to make matters even more easy, I wrote a second function (save_correlation_matrix) to directly save any created correlation matrices:

Once you open your new correlation matrix file in Excel, it is immediately ready to be copy-pasted into Word!

If you are looking for ways to visualize your correlations do have a look at the packages corrr and corrplot.

I hope my functions are of help to you!

Do reach out if you get to use them in any of your research papers!

I would be super interested and feel honored.

correlation_matrix

#' correlation_matrix
#' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend.
#'
#' @param df dataframe; containing numeric and/or logical columns to calculate correlations for
#' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"`
#' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3`
#' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.`
#' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"`
#' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE`
#' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE`
#' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string)
#'
#' @return a correlation matrix
#' @export
#'
#' @examples
#' `correlation_matrix(iris)`
#' `correlation_matrix(mtcars)`
correlation_matrix <- function(df, 
                               type = "pearson",
                               digits = 3, 
                               decimal.mark = ".",
                               use = "all", 
                               show_significance = TRUE, 
                               replace_diagonal = FALSE, 
                               replacement = ""){
  
  # check arguments
  stopifnot({
    is.numeric(digits)
    digits >= 0
    use %in% c("all", "upper", "lower")
    is.logical(replace_diagonal)
    is.logical(show_significance)
    is.character(replacement)
  })
  # we need the Hmisc package for this
  require(Hmisc)
  
  # retain only numeric and boolean columns
  isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1))
  if (sum(!isNumericOrBoolean) > 0) {
    cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n')
  }
  df = df[isNumericOrBoolean]
  
  # transform input data frame to matrix
  x <- as.matrix(df)
  
  # run correlation analysis using Hmisc package
  correlation_matrix <- Hmisc::rcorr(x, type = type)
  R <- correlation_matrix$r # Matrix of correlation coeficients
  p <- correlation_matrix$P # Matrix of p-value 
  
  # transform correlations to specific character format
  Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark)
  
  # if there are any negative numbers, we want to put a space before the positives to align all
  if (sum(!is.na(R) & R < 0) > 0) {
    Rformatted = ifelse(!is.na(R) & R > 0, paste0(" ", Rformatted), Rformatted)
  }

  # add significance levels if desired
  if (show_significance) {
    # define notions for significance levels; spacing is important.
    stars <- ifelse(is.na(p), "", ifelse(p < .001, "***", ifelse(p < .01, "**", ifelse(p < .05, "*", ""))))
    Rformatted = paste0(Rformatted, stars)
  }
  
  # make all character strings equally long
  max_length = max(nchar(Rformatted))
  Rformatted = vapply(Rformatted, function(x) {
    current_length = nchar(x)
    difference = max_length - current_length
    return(paste0(x, paste(rep(" ", difference), collapse = ''), sep = ''))
  }, FUN.VALUE = character(1))
  
  # build a new matrix that includes the formatted correlations and their significance stars
  Rnew <- matrix(Rformatted, ncol = ncol(x))
  rownames(Rnew) <- colnames(Rnew) <- colnames(x)
  
  # replace undesired values
  if (use == 'upper') {
    Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement
  } else if (use == 'lower') {
    Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement
  } else if (replace_diagonal) {
    diag(Rnew) <- replacement
  }
  
  return(Rnew)
}

save_correlation_matrix

#' save_correlation_matrix
#' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend
#' @param df dataframe; passed to `correlation_matrix`
#' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv`
#' @param ... any other arguments passed to `correlation_matrix`
#'
#' @return NULL
#'
#' @examples
#' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')`
#' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')`
save_correlation_matrix = function(df, filename, ...) {
  return(write.csv2(correlation_matrix(df, ...), file = filename))
}

Sign up to keep up to date on the latest R, Data Science & Tech content:

Best Tech & Programming Talks Ever

Best Tech & Programming Talks Ever

Every now and then, Twitter will offer these golden resources.

Ashley Willis recently asked people to name the best tech talk they’ve ever seen and the results are a resource I don’t want to lose.

Hundreds of people responded, sharing their contenders for the title.

Below, I selected some of the top-rated talks and clustered them accordingly. Click a category to jump to the section.


Big Idea & Programming Meta-Talks

The Future of Programming

Growing a Language

The Mess We’re In

Making Users Awesome

Ethical Dilemmas in Software Engineering


Testing code

Adding Eyes to Your Test Automation Framework

TATFT – Test All The F*cking Time


Language-Specific talks

Concurrency (Python)

How we program multicores (erlang)

Y Not- Adventures in Functional Programming (Ruby)

JavaScript: The Good Parts


Code Design

Core Design Principles for Software Developers

Design Patterns vs Anti pattern in APL


Containers & Kubernetes

The Container Operator’s Manual

Write a Container in Go From Scratch

Container Hacks and Fun Images

Kubernetes and the Path to Serverless

Let’s Build Kubernetes, With a Spreadsheet and Volunteers

Cover image via: https://toggl.com/blog/best-tech-websites

Learn to style HTML using CSS — Tutorials by Mozilla

Learn to style HTML using CSS — Tutorials by Mozilla

Cascading Stylesheets — or CSS — is the first technology you should start learning after HTML. While HTML is used to define the structure and semantics of your content, CSS is used to style it and lay it out. For example, you can use CSS to alter the font, color, size, and spacing of your content, split it into multiple columns, or add animations and other decorative features.

https://developer.mozilla.org/en-US/docs/Learn/CSS

I was personally encoutered CSS in multiple stages of my Data Science career:

  • When I started using (R) markdown (see here, or here), I could present my data science projects as HTML pages, styled through CSS.
  • When I got more acustomed to building web applications (e.g., Shiny) on top of my data science models, I had to use CSS to build more beautiful dashboard layouts.
  • When I was scraping data from Ebay, Amazon, WordPress, and Goodreads, my prior experiences with CSS & HTML helped greatly to identify and interpret the elements when you look under the hood of a webpage (try pressing CTRL + SHIFT + C).

I know others agree with me when I say that the small investment in learning the basics behind HTML & CSS pay off big time:

I read that Mozilla offers some great tutorials for those interested in learning more about “the web”, so here are some quicklinks to their free tutorials:

Screenshot via developer.mozilla.org/en-US/docs/Learn/CSS/CSS_layout/Introduction
What does a tech lead do? – by Jake Voytko

What does a tech lead do? – by Jake Voytko

According to Jake Voytko, data science and engineering teams run more efficiently and spread knowledge more quickly when there is a single person setting the technical direction of a team. The so-called tech lead.

Sometimes tech lead is an official title, referring to the position between an engineering manager and the engineering team. Oftentimes it is just a unofficial role one grows in to.

Now, according to Jake, you can learn to become a tech lead. And you can be good at it too. Somebody has to do it, so it might as well be you! It could allow you to leverage your time to move the organization forward, and enables you to influence data science or engineering throughout the entire team!

In this original blog, which I thoroughly enjoyed reading, Jake explains in more detail what it takes to be(come) a good tech lead. Here just the headers copied, but if you’re interested, take a look at the full article:

  • Less time writing code
  • Helping others often (esp. juniors)
  • Helping others first
  • Doing unsexy, unthankful work to enable the team
  • Being an ally (of underrepresented groups)
  • Spreading knowledge, or making sure it spreads

And this is what Jake feels his work week looks like as a tech lead:

Snapshot from the original article

Cover image via TeamGantt.com