Tag: r

calmcode.io > video tutorials for open source tools

calmcode.io is an e-learning platform that I really really really recommend to programmers and data scientists:

It is free.

It involves open source tools.

It uses bite-sized tutorial videos.

It explains tools clearly.

It explains everything calmly.

There’s tons of content about computer programming, data science, and personal productivity.

On top of this all, it’s by Vincent Warmerdam, and everything he touches seems to turn to gold.

Check it out here: https://calmcode.io/

You can subscribe to their newsletter here, to receive new content fresh from the presses, straight in your inbox!

There are just so many tutorials on so many different topics! Here are some quick glances at some topics and tools:

How to confuse your shareholders by bad data visualization

Like many people during the COVID19 crisis, I turned to the stock market as a new hobby.

Like the ignorant investor that I am, I thought it wise to hop on the cloud computing bandwagon.

Hence, I bought, among others, a small position in Rackspace Technologies.

A long way down

Now, my Rackspace shares have plummeted in price since I bought them.

Screenshot of Google Finance on August 25th 2021: https://www.google.com/finance/quote/RXT:NASDAQ?sa=X&ved=2ahUKEwjxqdr0oczyAhWKtqQKHZk3A90Q_AUoAXoECAEQAw&window=6M

Obviously, this is less than ideal for me, but also, I should not be surprised.

Clearly, I knew nothing about the company I bought shares in. Apparently they are going through some big time reorganization, and this is not good price-wise.

Fast forward to yesterday.

Doing research

To re-evalute my investment, I thought it wise to have a look at Rackspace’s Quarterly Report.

According to Investopedia: A quarterly report is a summary or collection of unaudited financial statements, such as balance sheets, income statements, and cash flow statements, issued by companies every quarter (three months). In addition to reporting quarterly figures, these statements may also provide year-to-date and comparative (e.g., last year’s quarter to this year’s quarter) results. Publicly-traded companies must file their reports with the Securities Exchange Committee (SEC).

Fortunately these quarterly reports are readily available on the investors relation page, and they are not that hard to read once you have seen a few.

Visualizing financial data

I was excited to see that Rackspace offered their financial performance in bite-sized bits to me as a laymen, through their usage of nice visualizations of the financial data.

Please take a moment to process the below copy of page 11 of their 2021 Q2 report:

Screenshot of page 11 of the 2021 Q2 Quarterly Report of Rackspace Technologies: https://ir.rackspace.com/static-files/474fde80-f203-4227-a438-57b062992d46

Though… the longer I looked at these charts… the more my head started to hurt…

How can the growth line be about the same in the three charts Total Revenue (top-left), Core Revenue (top-right), and Non-GAAP EPS (bottom-right)? They represent different increments: 13%, 17%, and 14% respectively.

Zooming in on the top left: how does the $657 revenue of 2Q20 fit inside the $744 revenue of 2Q21 almost three times?!

The increase is only 13%, not 300%!

Recreating the problem

I decided to recreate the vizualizations of the quarterly report.

To see what the visualization should have actually looked like. And to see how they could have made this visualization worse.

You can find the R ggplot2 code for these plots here on Github.

If you know me, you know I can’t do something 50%, so I decided to make the plots look as closely to the original Rackspace design as possible.

Here are the results:

Here are all three combined, along with two simple questions:

This I shared on social media (LinkedIn, Twitter), to ask for people’s opinions:

And I tagged Rackspace and offered them my help!

I hope they’re not offended and respond : )

ppsr live on CRAN!

Finding predictive patterns in your dataset with one line of code!

Today — March 2nd 2021 — my first R package was published on the comprehensive R archive network (CRAN).

ppsr is the R implementation of the Predictive Power Score (PPS).

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. You can read more about the concept in earlier blog posts (here and here), or here on Github, or via Medium.

With the ppsr package live on CRAN, it is now super easy to install the package and examine the predictive relationships in your dataset:

Here’s the official ppsr CRAN page for those interested:

ppsr: An R implementation of the Predictive Power Score

Update March, 2021: My R package for the predictive power score (ppsr) is live on CRAN!
Try install.packages("ppsr") in your R terminal to get the latest version.

A few months ago, I wrote about the Predictive Power Score (PPS): a handy metric to quickly explore and quantify the relationships in a dataset.

As a social scientist, I was taught to use a correlation matrix to describe the relationships in a dataset. Yet, in my opinion, the PPS provides three handy advantages:

PPS works for any type of data, also nominal/categorical variables
PPS quantifies non-linear relationships between variables
PPS acknowledges the asymmetry of those relationships

Florian Wetschoreck came up with the PPS idea, wrote the original blog, and programmed a Python implementation of it (called ppscore).

Yet, I work mostly in R and I was very keen on incorporating this powertool into my general data science workflow.

So, over the holiday period, I did something I have never done before: I wrote an R package!

It’s called ppsr and you can find the code here on github.

Installation

# You can get the official version from CRAN:
install.packages("ppsr")

## Or you can get the development version from GitHub:
# install.packages('devtools')
# devtools::install_github('https://github.com/paulvanderlaken/ppsr')

Usage

The ppsr package has three main functions that compute PPS:

score() – which computes an x-y PPS
score_predictors() – which computes X-y PPS
score_matrix() – which computes X-Y PPS

Visualizing PPS

Subsequently, there are two main functions that wrap around these computational functions to help you visualize your PPS using ggplot2:

visualize_predictors() – producing a barplot of all X-y PPS

visualize_matrix() – producing a heatmap of all X-Y PPS

Note that Species is a nominal/categorical variable, with three character/text options.

A correlation matrix would not be able to show us that the type of iris Species can be predicted extremely well by the petal length and width, and somewhat by the sepal length and width. Yet, particularly sepal width is not easily predicted by the type of species.

Exploring mtcars

It takes about 10 seconds to run 121 decision trees with visualize_matrix(mtcars). Yet, the output is much more informative than the correlation matrix:

cyl can be much better predicted by mpg than the other way around
the classification of vs can be done well using nearly all variables as predictors, except for am
yet, it’s hard to predict anything based on the vs classification
a cars’ am can’t be predicted at all using these variables

The correlation matrix does provides insights that are not provided by the PPS matrix. Most importantly, the sign and strength of any linear relationship that may exist. For instance, we can deduce that mpg relates strongly negatively with cyl.

Yet, even though half of the matrix does not provide any additional information (due to the symmetry), I still find it hard to derive the most important relations and insights at a first glance.

Moreover, the rows and columns for vs and am are not very informative in this correlation matrix as it contains pearson correlations coefficients by default, whereas vs and am are binary variables. The same can be said for cyl, gear and carb, which contain ordinal categories / integer data, so you can discuss the value of these coefficients depicted here.

Exploring trees

In R, there are many datasets built in via the datasets package. Let’s explore some using the ppsr::visualize_matrix() function.

datasets::trees has data on 31 trees’ girth, height and volume.

visualize_matrix(datasets::trees) shows that both girth and volume can be used to predict the other quite well, but not perfectly.

Let’s have a look at the correlation matrix.

The scores here seem quite higher in general. A near perfect correlation between volume and girth.

Is it near perfect though? Let’s have a look at the underlying data and fit a linear model to it.

You will still be pretty far off the real values when you use a linear model based on Girth to predict Volume. This is what the original PPS of 0.65 tried to convey.

Actually, I’ve run the math for this linaer model and the RMSE is still 4.11. Using just the mean Volume as a prediction of Volume will result in 16.17 RMSE. If we map these RMSE values on a linear scale from 0 to 1, we would get the PPS of our linear model, which is about 0.75.

So, actually, the linear model is a better predictor than the decision tree that is used as a default in the ppsr package. That was used to generate the PPS matrix above.

Yet, the linear model definitely does not provide a perfect prediction, even though the correlation may be near perfect.

Conclusion

In sum, I feel using the general idea behind PPS can be very useful for data exploration.

Particularly in more data science / machine learning type of projects. The PPS can provide a quick survey of which targets can be predicted using which features, potentially with more complex than just linear patterns.

Yet, the old-school correlation matrix also still provides unique and valuable insights that the PPS matrix does not. So I do not consider the PPS so much an ~~alternative~~, as much as a complement in the toolkit of the data scientist & researcher.

Enjoy the R package, or the Python module for that matter, and let me know if you see any improvements!

JavaScript for R — ebook

The R programming language has seen the integration of many languages; C, C++, Python, to name a few, can be seamlessly embedded into R so one can conveniently call code written in other languages from the R console. Little known to many, R works just as well with JavaScript—this book delves into the various ways both languages can work together.
https://book.javascript-for-r.com/

John Coene is an well-known R and JavaScript developer. He recently wrote a book on JavaScript for R users, of which he published an online version free to access here.

The book is definitely worth your while if you want to better learn how to develop front-end applications (in JavaScript) on top of your statistical R programs. Think of better understanding, and building, yourself Shiny modules or advanced data visualizations integrated right into webpages.

A nice step on your development path towards becoming a full stack developer by combining R and JavaScript!

Yet most R developers are not familiar with one of web browsers’ core technology: JavaScript. This book aims to remedy that by revealing how much JavaScript can greatly enhance various stages of data science pipelines from the analysis to the communication of results.
https://book.javascript-for-r.com/

Want to learn more about JavaScript in general, then I recommend this book:

Bayesian Statistics using R, Python, and Stan

For a year now, this course on Bayesian statistics has been on my to-do list. So without further ado, I decided to share it with you already.

Richard McElreath is an evolutionary ecologist who is famous in the stats community for his work on Bayesian statistics.

At the Max Planck Institute for Evolutionary Anthropology, Richard teaches Bayesian statistics, and he was kind enough to put his whole course on Statistical Rethinking: Bayesian statistics using R & Stan open access online.

You can find the video lectures here on Youtube, and the slides are linked to here: