Category: report

Create a publication-ready correlation matrix, with significance levels, in R

Create a publication-ready correlation matrix, with significance levels, in R

In most (observational) research papers you read, you will probably run into a correlation matrix. Often it looks something like this:

FACTOR ANALYSIS

In Social Sciences, like Psychology, researchers like to denote the statistical significance levels of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:

Table 4 from Family moderators of relation between community ...

Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..

I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.

Yes, there is the cor function, but it does not include significance levels.

Then there the (in)famous Hmisc package, with its rcorr function. But this tool provides a whole new range of issues.

What’s this storage.mode, and what are we trying to coerce again?

Soon you figure out that Hmisc::rcorr only takes in matrices (thus with only numeric values). Hurray, now you can run a correlation analysis on your dataframe, you think…

Yet, the output is all but publication-ready!

You wanted one correlation matrix, but now you have two… Double the trouble?

To spare future scholars the struggle of the early day R programming, I would like to share my custom function correlation_matrix.

My correlation_matrix takes in a dataframe, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a fully formatted publication-ready correlation matrix!

You can specify many formatting options in correlation_matrix.

For instance, you can use only 2 decimals. You can focus on the lower triangle (as the lower and upper triangle values are identical). And you can drop the diagonal values:

Or maybe you are interested in a different type of correlation coefficients, and not so much in significance levels:

For other formatting options, do have a look at the source code below.

Now, to make matters even more easy, I wrote a second function (save_correlation_matrix) to directly save any created correlation matrices:

Once you open your new correlation matrix file in Excel, it is immediately ready to be copy-pasted into Word!

If you are looking for ways to visualize your correlations do have a look at the packages corrr and corrplot.

I hope my functions are of help to you!

Do reach out if you get to use them in any of your research papers!

I would be super interested and feel honored.

correlation_matrix

#' correlation_matrix
#' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend.
#'
#' @param df dataframe; containing numeric and/or logical columns to calculate correlations for
#' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"`
#' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3`
#' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.`
#' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"`
#' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE`
#' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE`
#' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string)
#'
#' @return a correlation matrix
#' @export
#'
#' @examples
#' `correlation_matrix(iris)`
#' `correlation_matrix(mtcars)`
correlation_matrix <- function(df, 
                               type = "pearson",
                               digits = 3, 
                               decimal.mark = ".",
                               use = "all", 
                               show_significance = TRUE, 
                               replace_diagonal = FALSE, 
                               replacement = ""){
  
  # check arguments
  stopifnot({
    is.numeric(digits)
    digits >= 0
    use %in% c("all", "upper", "lower")
    is.logical(replace_diagonal)
    is.logical(show_significance)
    is.character(replacement)
  })
  # we need the Hmisc package for this
  require(Hmisc)
  
  # retain only numeric and boolean columns
  isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1))
  if (sum(!isNumericOrBoolean) > 0) {
    cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n')
  }
  df = df[isNumericOrBoolean]
  
  # transform input data frame to matrix
  x <- as.matrix(df)
  
  # run correlation analysis using Hmisc package
  correlation_matrix <- Hmisc::rcorr(x, type = type)
  R <- correlation_matrix$r # Matrix of correlation coeficients
  p <- correlation_matrix$P # Matrix of p-value 
  
  # transform correlations to specific character format
  Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark)
  
  # if there are any negative numbers, we want to put a space before the positives to align all
  if (sum(!is.na(R) & R < 0) > 0) {
    Rformatted = ifelse(!is.na(R) & R > 0, paste0(" ", Rformatted), Rformatted)
  }

  # add significance levels if desired
  if (show_significance) {
    # define notions for significance levels; spacing is important.
    stars <- ifelse(is.na(p), "", ifelse(p < .001, "***", ifelse(p < .01, "**", ifelse(p < .05, "*", ""))))
    Rformatted = paste0(Rformatted, stars)
  }
  
  # make all character strings equally long
  max_length = max(nchar(Rformatted))
  Rformatted = vapply(Rformatted, function(x) {
    current_length = nchar(x)
    difference = max_length - current_length
    return(paste0(x, paste(rep(" ", difference), collapse = ''), sep = ''))
  }, FUN.VALUE = character(1))
  
  # build a new matrix that includes the formatted correlations and their significance stars
  Rnew <- matrix(Rformatted, ncol = ncol(x))
  rownames(Rnew) <- colnames(Rnew) <- colnames(x)
  
  # replace undesired values
  if (use == 'upper') {
    Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement
  } else if (use == 'lower') {
    Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement
  } else if (replace_diagonal) {
    diag(Rnew) <- replacement
  }
  
  return(Rnew)
}

save_correlation_matrix

#' save_correlation_matrix
#' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend
#' @param df dataframe; passed to `correlation_matrix`
#' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv`
#' @param ... any other arguments passed to `correlation_matrix`
#'
#' @return NULL
#'
#' @examples
#' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')`
#' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')`
save_correlation_matrix = function(df, filename, ...) {
  return(write.csv2(correlation_matrix(df, ...), file = filename))
}

Sign up to keep up to date on the latest R, Data Science & Tech content:

Top-19 articles of 2019

Top-19 articles of 2019

With only one day remaining in 2019, let’s review the year. 2019 was my third year of blogging and it went by even quicker than the previous two!

Personally, it has been a busy year for me: I started a new job, increased my speaking and teaching activities, bought and moved to my new house, and got married op top of that!

Fortunately, I also started working parttime. This way, I could still reserve some time for learning and sharing my learnings. And sharing I did:

I posted 95 blogs in 2019!
That means one new post every 4 days!

paulvanderlaken.com improved its online footprint as well. We received over 100k visitors in 2019! And many of you subscribed and sticked around. Our little community now includes 55 more members than it did last year! And that is not even including the followers to my new twitter bot Artificial Stupidity!

Thank you for your continued interest!

Join 262 other followers

Now, I am always curious as to what brings you to my website, so let’s have a look at some 2019 statistics (which I downloaded via my new Python scraper).

Most read articles

There is clearly a power distribution in the quantity with which you read my blogs.

Some blogs consistently attract dozens of visitors each day. Others have only handful of visitors over the course of a year.

These are the 19 articles which were most read in 2019. Hyperlinks are included below the bar chart. It’s a nice combination of R programming, machine learning, HR-related materials, and some entertainment (games & gambling) in between.

Which have and haven’t you read?

  1. R resources
  2. R tips and tricks
  3. New to R?
  4. Books for the modern, data-driven HR professional
  5. The house always wins
  6. Visualization innovations
  7. Simple correlation analysis in R
  8. Beating battleships with algorithms and AI
  9. Regular expressions in R
  10. Simpson’s paradox
  11. Visualizing the k-means clustering algorithm
  12. Survival of the best fit
  13. Datasets to practice and learn data science
  14. Identifying dirty twitter bots
  15. Game of Thrones map
  16. Screeps
  17. Northstar
  18. The difference between DS, ML, and AI visualized
  19. Light GBM vs. XGBoost

Rising stars

Half of these most read articles have actually been published in 2017 or ’18 already. However, of the 95 articles published in 2019, some also demonstrate promising visitor patterns:

The People Analytics books, Visual innovations, and AI Battleships are in the top 19, and several others made it too.

Some of these newer blogs haven’t had the time to mature and redeem their place yet though. Regardless, I have high hopes!

Particularly for Neural Synesthesia, which was easily one of my greatest WOW-moments for ML applications in 2019. It’s truly mesmerizing to see a GAN traverse its latent space.

Reading & posting patterns

I have been posting quite regularly throughout the year. Apart from a holiday to Thailand during the start of January, and the start of my new job in February.

While I write and post most of my blogs in the weekend, I guess I should consider postponing publishing. As you guys are mostly active during Tuesdays and Wedsnesdays!

Statistical summary of 2019

What better way to end 2019 than with a statistical summary?

I have posted more and shorter blogs, and you’ve rewarded me with visits and more likes (also per post). However, we need more discussion!

Statistic2018 2019 Δ
Views85614107388+25%
Unique visitors5759470615+23%
Posts6195+56%
Words / post518371-40%
Likes51111118%
Comments2416-33%
As of 29/12/2019

2020 Outlook

It took some time to get started, but halfway 2017 my blog started attracting an audience. People stayed on during 2018, and visitor number continued to increase through 2019.

With an ongoing expansion from R into Python, and an increased focus on sharing resources, applications, and novelties related to data visualization and machine learning, I have a lot more in store for 2020!

I hope you stick around for the ride!

Please like, subscribe, share, and comment, and we’ll make sure 2020 will be at least as interesting and full of (machine) learning as 2019 has been!

Join 262 other followers

Zeit’s interactive visualization of the 2019 European election results

Zeit’s interactive visualization of the 2019 European election results

Zeit — the German newspaper — analyzed recent election results in over 80,000 regions of Europe. They discovered many patterns – from the radical left to the extremist right. Moreover, they allow you to find patterns yourself, among others in your own region.

They published the summarized election results in this beautiful interactive map of Europe.

The map is beautifully color-coded for the dominant political view (Conservative, Green, Liberal, Socialist, Far left, or Far right) per region. Moreover, you can select these views and look for regions where they received respectively many votes. Like in the below, where I opted for the Liberal view, which finds strongest support in regions of the Netherlands, France, Czechia, Romania, Denmark, Estonia, and Finland.

For instance, the region of Tilburg in the Netherlands — where I live — voted mostly Liberal, as depicted by the yellow Netherlands. In contrast, in the German border regions conservative and socialist parties received most votes, whereas in the Belgian border regions uncategorizable parties received most votes.

Zeit discovered some cool patterns themselves as well, as discussed in the original article. These include:

  • Right-Wing Populists in Poland
  • North-South divides in Italy and Spain
  • Considerable support for regional parties in Catalonia, Belgium, Scotland and Italy
  • Dominant Green and Liberal views in the Netherlands, France, and Germany

Have a look yourself, it’s a great example of open access data-driven journalism!

Pimp my RMD: Tips for R Markdown – by Yan Holtz

Pimp my RMD: Tips for R Markdown – by Yan Holtz

R markdown creates interactive reports from R code, including interactive reports, documents, dashboards, presentations, and even books. Have a look at this awesome gallery of R markdown examples.

Yan Holtz recently created a neat little overview of handy R Markdown tips and tricks that improve the appearance of output documents. He dubbed this overview Pimp my RMD. Have a look, it’s worth it!

Via https://rmarkdown.rstudio.com/authoring_quick_tour.html

Two years of paulvanderlaken.com

Yesterday was the second anniversary of my website. I also reflected on this moment last year, and I thought to continue the tradition in 2019.

Let me start with a great, big
THANK YOU
to all my readers for continuing to visit my website!

You are the reason I continue to write down what I read. And maybe even the reason I continued reading and learning last year, despite all other distractions [my “real” job and my PhD : )].

Also a big thank you to all my followers on Twitter and LinkedIn, and those who have taken the time to comment or like my blogs. All of you make that I gain energy from writing this blog!

With that said, let’s start the review of the past year on my blog.

Most popular blog posts of 2018

Most importantly, let’s examine what you guys liked. Which blogs attracted the most visitors? What did you guys read?

Unfortunately, WordPress does not allow you to scrape their statistics pages. However, I was able to download monthly data manually, which I could then visualize to show you some trends.

The visual below shows the cumulative amount of visitors attracted by each blog I’ve written in 2018. Here follow links to the top 8 blogs in terms of visitor numbers this year:

  1. “What’s the difference between data science, machine learning, and artificial intelligence?”, visualized. received 4355 visits. Following a viral blog by David Robinson, I try to demystify the popular terminology.
  2. The House Always Wins: Simulating 5,000,000 Games of Baccarat a.k.a. Punto Banco received 3079 views. After a visit to Holland Casino, I thought it’d be fun to approximate the odds of gambling through statistical simulation.
  3. Bayesian data analysis for newcomers received 2253 views. It contains the link to an open access paper explaining the basics of Bayesian analysis.
  4. Identifying “Dirty” Twitter Bots with R and Python received 2247 views. It tells the story of two programmers who uncover networks of filthy social media accounts.
  5. rstudio::conf 2018 summary received 1514 views. It provides links to the most salient talks and presentations of the yearly R gathering.
  6. R tips & tricks is relatively new and has only yet received 1212 views. Seperate from the R resources guide, this new list contains all the quick tricks that help you program more effectively in R.
  7. Super Resolution: A Photo Enhancer AI received 891 views and elaborates on the development of new tools that can upgrade photo and video data quality.
  8. ggstatsplot: Creating graphics including statistical details is also relatively new but already attracted 810 visitors. It explains the novel visualization package in R that allows you to quickly create elaborate statistical plots.

Biggest failures of 2018

Where there’s success, there’s failure. Some of my posts did not get a lot of attention by my readership. That’s unfortunate, as I really only take the time to blog about the stuff that I deem interesting enough. Were these failed blog posts just unlucky, or am I biased and were they simply really bad and uninteresting?

You be the judge! Here are some of the least read posts of 2018:

General statistics

Now, let’s move to some general statistics: in 2018, paulvanderlaken.com received 85.614 views, by 57.594 unique visitors. I posted 61 new blogs, consisting of a total of 31.598 words. Fifty-one visitors liked one of my posts, and 24 visitors took the time to post a comment of their own (my replies included, probably).

Compared to last year, my website did pretty well!

20172018Δ
Views3849085614122%
Unique visitors2694957594114%
Posts10061-39%
Words / post625518-17%
Likes355146%
Comments9924-76%

However, the above statistics do not properly reflect the development of my website. For instance, I only really started generating traffic after my first viral post (i.e., Harry Plotter). The below graph takes that into account and better reflects the development of the traffic to my website.

The upward trend in traffic looks promising!

All time favorites

Looking back to the start of paulvanderlaken.com, let’s also examine which blogs have been performing well ever since their conception.

Clearly, most people have been coming for the R resources overview, as demonstrated by the visual below. Moreover, the majority of blog posts has not been visited much — only a handful ever cross the 1000 views mark.

The blogs that attracted a large public in 2017 (such as the original Harry Plotter and its sequal, and the Kaggle 2017 DS survey) have phased out a bit.

Fortunately, the introductory guide for newcomers to R is still kickstarting many programming careers! And on an additional positive note, more and more visitors seem to inspect the homepage and archives.

Redirected visitors

Finally , let’s have a closer look as to what brought people to my website.The below visualizes the main domains that redirected visitors.

Search engines provided the majority of traffic in both 2017 and 2018 –
mainly Google; to a lesser extent, DuckDuckGo and Bing (who in his right mind uses Norton Safe Search?!). My Twitter visitors increased in 2018 as compared to 2017, as did my traffic from this specific Quora page.

And that concludes my two year anniversary of paulvanderlaken.com review. I hope you enjoyed it, and that you will return to my website for the many more years to come : )

I end with a big shout out to my most loyal readers!
104 people have subscribed to my website (as of 2019-01-22)
and receive an update wherener I post a new blog.

Thank you for your continued support!

Want to join this group of elite followers?
Press the Follow button 
in the right toolbar, or at the bottom of this blog post.

IBM’s Watson for Oncology: A Biased and Unproven Recommendation System in Cancer Treatment?

IBM’s Watson for Oncology: A Biased and Unproven Recommendation System in Cancer Treatment?

The below reiterates and summarizes this Stat article.

Recently, I addressed how bias may slip into Machine Learning applications and this weekend I came across another real-life example: IBM’s Watson, specifically Watson for Oncology. With a single machine, IBM intended to tackle humanity’s most vexing diseases and revolutionize medicine and they quickly zeroed in on a high-profile target: cancer.

However, three years later now, a STAT investigation has found that the supercomputer isn’t living up to the lofty expectations IBM created for it. IBM claims that, through Artificial Intelligence, Watson for Oncology can generate new insights and identify “new approaches” to cancer care. However, the STAT investigation (video below) concludes that the system doesn’t create new knowledge and is artificially intelligent only in the most rudimentary sense of the term. Similarly, cancer specialists using the product argue Watson is still in its “toddler stage” when it comes to oncology.

Let’s start with the positive side. For specific treatments, Watson can scan academic literature, immediately providing the “best data” about a treatment — survival rates, for example — thereby relieving doctors of tedious literature searches. Due to this transparency, Watson may level the hierarchy commonly found in hospital settings, by holding (senior) doctors accountable to the data and empowering junior physicians to back up their arguments. Furthermore, Watson’s information may empower patients as they can be offered a comprehensive packet of treatment options, including potential treatment plans along with relevant scientific articles. Patients can do their own research about these treatments, and maybe even disagree with the doctor about the right course of action.

Although study results demonstrate that Watson saves doctors time and can have a high concordance rate with their treatment recommendations, much more research is needed. The studies were all conference abstracts, which haven’t been published in peer-reviewed journals — and all but one was either conducted by a paying customer or included IBM staff on the author list, or both. More importantly, IBM has failed to exposed Watson for Oncology to critical review by outside scientists nor have they conducted clinical trials to assess its effectiveness. It would be very interesting to examine whether Watson’s implementation is actually saving lives or making healthcare more efficient/effective.

imaging-video[1].jpg
IBM Watson Health
Such validation is especially necessary because several issues are identified. First, the actual capabilities of Watson for Oncology are not well-understood by the public, and even by some of the hospitals that use it. It’s taken nearly six years of painstaking work by data engineers and doctors to train Watson in just seven types of cancer, and keep the system updated with the latest knowledge. Moreover, because of the complexity of the underlying machine learning algorithms, the recommendations Watson puts out are a black box, and Watson can not provide the specific reasons for picking treatment A over treatment B.

Second, the system is essentially Memorial Sloan Kettering in a portable box. IBM celebrates Memorial Sloan Kettering’s role as the only trainer of Watson. After all, who better to educate the system than doctors at one of the world’s most renowned cancer hospitals? However, doctors claim that Memorial Sloan Kettering’s training has caused bias in the system, because the treatment recommendations it puts into Watson don’t always comport with the practices of doctors elsewhere in the world. When users ask Watson for advice, the system also searches published literature — some of which is curated by Memorial Sloan Kettering — to provide relevant studies and background information to support its recommendation. But the recommendation itself is derived from the training provided by the hospital’s doctors, not the outside literature.

 

Doctors at Memorial Sloan Kettering acknowledged their influence on Watson. “We are not at all hesitant about inserting our bias, because I think our bias is based on the next best thing to prospective randomized trials, which is having a vast amount of experience,” said Dr. Andrew Seidman, one of the hospital’s lead trainers of Watson. “So it’s a very unapologetic bias.

However, this bias causes serious problems when Watson for Oncology is implemented in other countries/hospitals. The generally affluent population treated at Memorial Sloan Kettering doesn’t reflect the diversity of people around the world. According to Martijn van Oijen, an epidemiologist and associate professor at Academic Medical Center in the Netherlands, Watson has not been implemented in because of country level differences in treatment approaches. Similarly, oncologists at one hospital in Denmark said they have dropped implementation altogether after finding that local doctors agreed with Watson in only about 33 percent of cases. Different problems occurred in South Korea, where researchers reported that the treatment Watson most often recommended for breast cancer patients simply wasn’t covered by their national insurance system.

Kris, the lead trainer at Memorial Sloan Kettering, says nobody wants to hear the problems. “All they want to hear is that Watson is the answer. And it always has the right answer, and you get it right away, and it will be cheaper. But like anything else, it’s kind of human.