Computerphile is a Youtube sister channel of Numberphile. Where Numberphile’s videos are about the magic behind match and numbers, Computerphile’s videos are all about computers and computer stuff. I recommend both channels in general, and have watched many of their videos already.

Yet, over the past weeks I specifically enjoyed what seems to be several series of videos on Cyber Security related topics.

What makes a good password?

One series is all about passwords.

What are strong passwords, which are bad? How can hackers crack yours? And how do websites secure user passwords?

The videos below are in somewhat of the right order and they make for an interesting insight in the world of password management. They give you advice on how to pick you password, and even a nice tool to check whether your password has ever been leaked.

Probably, you will want to change your password afterwards!

Hacking and attacking

If you are up to no good, please do not watch this second series, which revolves all around hacks and computer attacks.

How do people get access to a websites database? How can we prevent it? How can we recognize security dangers?

You might know of SQL injections, but do you know what a slow loris attack is? Or how ransomware works? Or what exploitX is?

These videos nicely continue the line of a previous post on Try Hack Me’s Cyber Security Challenges, where you can learn how computers work and where there vulnerabilities lie.

Update March, 2021: My R package for the predictive power score (ppsr) is live on CRAN! Try install.packages("ppsr") in your R terminal to get the latest version.

A few months ago, I wrote about the Predictive Power Score (PPS): a handy metric to quickly explore and quantify the relationships in a dataset.

As a social scientist, I was taught to use a correlation matrix to describe the relationships in a dataset. Yet, in my opinion, the PPS provides three handy advantages:

PPS works for any type of data, also nominal/categorical variables

PPS quantifies non-linear relationships between variables

PPS acknowledges the asymmetry of those relationships

# You can get the official version from CRAN:
install.packages("ppsr")
## Or you can get the development version from GitHub:
# install.packages('devtools')
# devtools::install_github('https://github.com/paulvanderlaken/ppsr')

Usage

The ppsr package has three main functions that compute PPS:

score() – which computes an x-y PPS

score_predictors() – which computes X-y PPS

score_matrix() – which computes X-Y PPS

Visualizing PPS

Subsequently, there are two main functions that wrap around these computational functions to help you visualize your PPS using ggplot2:

visualize_predictors() – producing a barplot of all X-y PPS

visualize_matrix() – producing a heatmap of all X-Y PPS

Note that Species is a nominal/categorical variable, with three character/text options.

A correlation matrix would not be able to show us that the type of iris Species can be predicted extremely well by the petal length and width, and somewhat by the sepal length and width. Yet, particularly sepal width is not easily predicted by the type of species.

Exploring mtcars

It takes about 10 seconds to run 121 decision trees with visualize_matrix(mtcars). Yet, the output is much more informative than the correlation matrix:

cyl can be much better predicted by mpg than the other way around

the classification of vs can be done well using nearly all variables as predictors, except for am

yet, it’s hard to predict anything based on the vs classification

a cars’ am can’t be predicted at all using these variables

The correlation matrix does provides insights that are not provided by the PPS matrix. Most importantly, the sign and strength of any linear relationship that may exist. For instance, we can deduce that mpg relates strongly negatively with cyl.

Yet, even though half of the matrix does not provide any additional information (due to the symmetry), I still find it hard to derive the most important relations and insights at a first glance.

Moreover, the rows and columns for vs and am are not very informative in this correlation matrix as it contains pearson correlations coefficients by default, whereas vs and am are binary variables. The same can be said for cyl, gear and carb, which contain ordinal categories / integer data, so you can discuss the value of these coefficients depicted here.

Exploring trees

In R, there are many datasets built in via the datasets package. Let’s explore some using the ppsr::visualize_matrix() function.

datasets::trees has data on 31 trees’ girth, height and volume.

visualize_matrix(datasets::trees) shows that both girth and volume can be used to predict the other quite well, but not perfectly.

Let’s have a look at the correlation matrix.

The scores here seem quite higher in general. A near perfect correlation between volume and girth.

Is it near perfect though? Let’s have a look at the underlying data and fit a linear model to it.

You will still be pretty far off the real values when you use a linear model based on Girth to predict Volume. This is what the original PPS of 0.65 tried to convey.

Actually, I’ve run the math for this linaer model and the RMSE is still 4.11. Using just the mean Volume as a prediction of Volume will result in 16.17 RMSE. If we map these RMSE values on a linear scale from 0 to 1, we would get the PPS of our linear model, which is about 0.75.

So, actually, the linear model is a better predictor than the decision tree that is used as a default in the ppsr package. That was used to generate the PPS matrix above.

Yet, the linear model definitely does not provide a perfect prediction, even though the correlation may be near perfect.

Conclusion

In sum, I feel using the general idea behind PPS can be very useful for data exploration.

Particularly in more data science / machine learning type of projects. The PPS can provide a quick survey of which targets can be predicted using which features, potentially with more complex than just linear patterns.

Yet, the old-school correlation matrix also still provides unique and valuable insights that the PPS matrix does not. So I do not consider the PPS so much an alternative, as much as a complement in the toolkit of the data scientist & researcher.

Enjoy the R package, or the Python module for that matter, and let me know if you see any improvements!

A flowchart is a picture of the separate steps of a process in sequential order. It it super useful to organize and interpret business processes, IT systems, or computer algorithms.

I draw flowcharts and process diagrams all the time in my daily work as a data scientist!

Drawing out the business process is often a first step in any project, in order to really understand the underlying business workflow and problems. I feel doing so greatly facilitates opportunity finding.

Moreover, when designing a machine learning or data science architecture — with data coming from different sources, being manipulated using different workflows, and ending up in models feeding multiple business processes — drawing the whole she-bang out really helps me personally to keep overview.

There are licensed software programs such as Microsoft Visio that allow you to create flowcharts. But there are also numerous free applications that can help you draw up a flow chart.

I’ve recently transitioned to VS Code for all my Python programming, so I really welcome this cool feature. It integrates all the flow chart functionality of draw.io right there in your IDE. Incredible!

Here’s a demo:

Here’s another demo, but with a light theme, showing how easy it is to export your diagrams to a shareable png file.

Moreover, due to VS Code’s amazing “LiveShare” feature, you can even collaborate with colleagues and build a flow chart together, simulatenously:

Now there are many more features to this plugin. You can write and change the JavaScript code behind the objects to tailor it completely to your theme and tastes. Or if you prefer working with XML, you can just alter that code. Everything seems to work as a charm.

Google’s guidebook to human-centered AI design refered to the Design Kit, containing numerous helpful tools to help you design products with user experience in mind.

The design kit website contains many practical methods, tools, case studies and much more resources to help you in the design process.

Human-centered design is a practical, repeatable approach to arriving at innovative solutions. Think of these Methods as a step-by-step guide to unleashing your creativity, putting the people you serve at the center of your design process to come up with new answers to difficult problems.

The design kit methods section provides some seriously handy guidelines to help you design your products with the customer in mind. A step-by-step process guideline is offered, as well as neat worksheets to records the information you collect in the process, and a video explanation of the method.