The Institute for Ethical Machine Learning compiled this amazing curated list of open source libraries that will help you deploy, monitor, version, scale, and secure your production machine learning.

The Institute for Ethical Machine Learning is a think-tank that brings together with technology leaders, policymakers & academics to develop standards for ML.

In most (observational) research papers you read, you will probably run into a correlation matrix. Often it looks something like this:

In Social Sciences, like Psychology, researchers like to denote the statistical significance levels of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:

Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..

I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.

Yes, there is the cor function, but it does not include significance levels.

Then there the (in)famous Hmisc package, with its rcorr function. But this tool provides a whole new range of issues.

What’s this storage.mode, and what are we trying to coerce again?

Soon you figure out that Hmisc::rcorr only takes in matrices (thus with only numeric values). Hurray, now you can run a correlation analysis on your dataframe, you think…

Yet, the output is all but publication-ready!

You wanted one correlation matrix, but now you have two… Double the trouble?

To spare future scholars the struggle of the early day R programming, I would like to share my custom functioncorrelation_matrix.

My correlation_matrix takes in a dataframe, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a fully formatted publication-ready correlation matrix!

For instance, you can use only 2 decimals. You can focus on the lower triangle (as the lower and upper triangle values are identical). And you can drop the diagonal values:

Or maybe you are interested in a different type of correlation coefficients, and not so much in significance levels:

For other formatting options, do have a look at the source code below.

Now, to make matters even more easy, I wrote a second function (save_correlation_matrix) to directly save any created correlation matrices:

Once you open your new correlation matrix file in Excel, it is immediately ready to be copy-pasted into Word!

If you are looking for ways to visualizeyour correlations do have a look at the packages corrr and corrplot.

I hope my functions are of help to you!

Do reach out if you get to use them in any of your research papers!

I would be super interested and feel honored.

correlation_matrix

#' correlation_matrix
#' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend.
#'
#' @param df dataframe; containing numeric and/or logical columns to calculate correlations for
#' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"`
#' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3`
#' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.`
#' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"`
#' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE`
#' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE`
#' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string)
#'
#' @return a correlation matrix
#' @export
#'
#' @examples
#' `correlation_matrix(iris)`
#' `correlation_matrix(mtcars)`
correlation_matrix <- function(df,
type = "pearson",
digits = 3,
decimal.mark = ".",
use = "all",
show_significance = TRUE,
replace_diagonal = FALSE,
replacement = ""){
# check arguments
stopifnot({
is.numeric(digits)
digits >= 0
use %in% c("all", "upper", "lower")
is.logical(replace_diagonal)
is.logical(show_significance)
is.character(replacement)
})
# we need the Hmisc package for this
require(Hmisc)
# retain only numeric and boolean columns
isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1))
if (sum(!isNumericOrBoolean) > 0) {
cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n')
}
df = df[isNumericOrBoolean]
# transform input data frame to matrix
x <- as.matrix(df)
# run correlation analysis using Hmisc package
correlation_matrix <- Hmisc::rcorr(x, type = type)
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
# transform correlations to specific character format
Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark)
# if there are any negative numbers, we want to put a space before the positives to align all
if (sum(!is.na(R) & R < 0) > 0) {
Rformatted = ifelse(!is.na(R) & R > 0, paste0(" ", Rformatted), Rformatted)
}
# add significance levels if desired
if (show_significance) {
# define notions for significance levels; spacing is important.
stars <- ifelse(is.na(p), "", ifelse(p < .001, "***", ifelse(p < .01, "**", ifelse(p < .05, "*", ""))))
Rformatted = paste0(Rformatted, stars)
}
# make all character strings equally long
max_length = max(nchar(Rformatted))
Rformatted = vapply(Rformatted, function(x) {
current_length = nchar(x)
difference = max_length - current_length
return(paste0(x, paste(rep(" ", difference), collapse = ''), sep = ''))
}, FUN.VALUE = character(1))
# build a new matrix that includes the formatted correlations and their significance stars
Rnew <- matrix(Rformatted, ncol = ncol(x))
rownames(Rnew) <- colnames(Rnew) <- colnames(x)
# replace undesired values
if (use == 'upper') {
Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (use == 'lower') {
Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (replace_diagonal) {
diag(Rnew) <- replacement
}
return(Rnew)
}

save_correlation_matrix

#' save_correlation_matrix
#' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend
#' @param df dataframe; passed to `correlation_matrix`
#' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv`
#' @param ... any other arguments passed to `correlation_matrix`
#'
#' @return NULL
#'
#' @examples
#' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')`
#' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')`
save_correlation_matrix = function(df, filename, ...) {
return(write.csv2(correlation_matrix(df, ...), file = filename))
}

Sign up to keep up to date on the latest R, Data Science & Tech content:

Cohen’s d (wiki) is a statistic used to indicate the standardised difference between two means. Resarchers often use it to compare the averages between groups, for instance to determine that there are higher outcomes values in a experimental group than in a control group.

Researchers often use general guidelines to determine the size of an effect. Looking at Cohen’s d, psychologists often consider effects to be small when Cohen’s d is between 0.2 or 0.3, medium effects (whatever that may mean) are assumed for values around 0.5, and values of Cohen’s d larger than 0.8 would depict large effects (e.g., University of Bath).

By the way, Kristoffer hosts many other interesting visualization tools (most made with JavaScript’s D3 library) on statistics and statistical phenomena on his website, have a look!

These days, I am often programming in multiple different languages for my projects. I will do some data generation and machine learning in Python. The data exploration and some quick visualizations I prefer to do in R. And if I’m feeling adventureous, I might add some Processing or JavaScript visualizations.

Obviously, I want to track and store the versions of my programs and the changes between them. I probably don’t have to tell you that git is the tool to do so.

Normally, you’d have a .gitignore file in your project folder, and all files that are not listed (or have patterns listed) in the .gitignore file are backed up online.

However, when you are working in multiple languages simulatenously, it can become a hassle to assure that only the relevant files for each language are committed to Github.

Each language will have their own “by-files”. R projects come with .Rdata, .Rproj, .Rhistory and so on, whereas Python projects generate pycaches and what not. These you don’t want to commit preferably.

Here you simply enter the operating systems, IDEs, or Programming languages you are working with, and it will generate the appropriate .gitignore contents for you.

Let’s try it out

For my current project, I am working with Python and R in Visual Studio Code. So I enter:

And Voila, I get the perfect .gitignore including all specifics for these programs and languages:

# Created by https://www.gitignore.io/api/r,python,visualstudiocode
# Edit at https://www.gitignore.io/?templates=r,python,visualstudiocode
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
### R ###
# History files
.Rhistory
.Rapp.history
# Session Data files
.RData
.RDataTmp
# User-specific files
.Ruserdata
# Example code in package build process
*-Ex.R
# Output files from R CMD build
/*.tar.gz
# Output files from R CMD check
/*.Rcheck/
# RStudio files
.Rproj.user/
# produced vignettes
vignettes/*.html
vignettes/*.pdf
# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
.httr-oauth
# knitr and R markdown default cache directories
*_cache/
/cache/
# Temporary files created by R markdown
*.utf8.md
*.knit.md
### R.Bookdown Stack ###
# R package: bookdown caching files
/*_files/
### VisualStudioCode ###
.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
### VisualStudioCode Patch ###
# Ignore all local history of files
.history
# End of https://www.gitignore.io/api/r,python,visualstudiocode

Regular expression (also abbreviated to regex) really is a powertool any programmer should know. It was and is one of the things I most liked learning, as it provides you with immediate, godlike powers that can speed up your (data science) workflow tenfold.

I’ve covered many regex related topics on this blog already, but thought I’d combine them and others in a nice curated overview — for myself, and for you of course, to use.

If you have any materials you liked, but are missing, please let me know!