Category: programming

Try Hack Me – Cyber Security Challenges

Try Hack Me – Cyber Security Challenges

Sometimes I just stumble across these random resources that I immediately want to share with fellow geeks. If you like computers and programming, you should definitely have a look at…

https://tryhackme.com

TryHackMe started in 2018 by two cyber security enthusiasts, Ashu Savani and Ben Spring, who met at a summer internship. When getting started with in the field, they found learning security to be a fragmented, inaccessable and difficult experience; often being given a vulnerable machine’s IP with no additional resources is not the most efficient way to learn, especially when you don’t have any prior knowledge. When Ben returned back to University he created a way to deploy machines and sent it to Ashu, who suggested uploading all the notes they’d made over the summer onto a centralised platform for others to learn, for free.

To allow users to share their knowledge, TryHackMe allows other users (at no charge) to create a virtual room, which contains a combination of theoretical and practical learning components.. In early 2019, Jon Peters started creating rooms and suggested the platform build up a community, a task he took on and succeeded in.

The platform has never raised any capital and is entirely bootstrapped.

https://tryhackme.com/about

I don’t have any affiliation or whatever with the platform, but I just think it’s a super cool resource if you want to learn more about hands-on computer stuff.

Here’s a nice demo on an advanced programmer taking on one of the first challenges. I definitely still have a long way to go, but it’s fun to watch someone sneak into a (dummy) server and look for clues! Like a proper detective, but then an extra nerdy one!

There are many “hacktivities” you can try on the platform.

And if you’re serious about learning this stuff, there are learning paths set out for you!

If you like their content, do consider taking a paid subscription and share this great initiative!

ppsr live on CRAN!

ppsr live on CRAN!

Finding predictive patterns in your dataset with one line of code!

Today — March 2nd 2021 — my first R package was published on the comprehensive R archive network (CRAN).

ppsr is the R implementation of the Predictive Power Score (PPS).

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. You can read more about the concept in earlier blog posts (here and here), or here on Github, or via Medium.

With the ppsr package live on CRAN, it is now super easy to install the package and examine the predictive relationships in your dataset:

Here’s the official ppsr CRAN page for those interested:

Awesome R Shiny Resources & Extensions

Awesome R Shiny Resources & Extensions

Rob Gilmore curates a github repo listing resources for working with Shiny, the R web framework and dashboarding tool.

Nan Xiao curates a second repository, listing awesome R packages offer that extensions to Shiny, like extended UI or server components.

They should be your go-to resources when looking for anything Shiny!

Shiny Resources

Extensions

Become a Data Science Professional

Become a Data Science Professional

Amit Ness gathered an impressive list of learning resources for becoming a data scientist.

It’s great to see that he shares them publicly on his github so that others may follow along.

But beware, this learning guideline covers a multi-year process.

Amit’s personal motto seems to be “Becoming better at data science every day“.

Completing the hyperlinked list below will take you several hundreds days at the least!

Learning Philosophy:

Index

How a File Format Exposed a Crossword Scandal

Vincent Warmerdam shared this Youtube video which I thoroughly enjoyed watched. It’s about Saul Pwanson, a software engineer whose hobby project got a little out of hand.

In 2016, Saul Pwanson designed a plain-text file format for crossword puzzle data, and then spent a couple of months building a micro-data-pipeline, scraping tens of thousands of crosswords from various sources.

After putting all these crosswords in a simple uniform format, Saul used some simple command line commands to check for common patterns and irregularities.

Surprisingly enough, after visualizing the results, Saul discovered egregious plagiarism by a major crossword editor that had gone on for years.

Ultimately, 538 even covered the scandal:

I thoroughly enjoyed watching this talk on Youtube.

Saul covers the file format, data pipeline, and the design choices that aided rapid exploration; the evidence for the scandal, from the initial anomalies to the final damning visualization; and what it’s like for a data project to get 15 minutes of fame.

I tried to localize the dataset online, but it seems Saul’s website has since gone offline. If you do happen to find it, please do share it in the comments!

ppsr: An R implementation of the Predictive Power Score

ppsr: An R implementation of the Predictive Power Score

Update March, 2021: My R package for the predictive power score (ppsr) is live on CRAN!
Try install.packages("ppsr") in your R terminal to get the latest version.

A few months ago, I wrote about the Predictive Power Score (PPS): a handy metric to quickly explore and quantify the relationships in a dataset.

As a social scientist, I was taught to use a correlation matrix to describe the relationships in a dataset. Yet, in my opinion, the PPS provides three handy advantages:

  1. PPS works for any type of data, also nominal/categorical variables
  2. PPS quantifies non-linear relationships between variables
  3. PPS acknowledges the asymmetry of those relationships

Florian Wetschoreck came up with the PPS idea, wrote the original blog, and programmed a Python implementation of it (called ppscore).

Yet, I work mostly in R and I was very keen on incorporating this powertool into my general data science workflow.

So, over the holiday period, I did something I have never done before: I wrote an R package!

It’s called ppsr and you can find the code here on github.

Installation

# You can get the official version from CRAN:
install.packages("ppsr")

## Or you can get the development version from GitHub:
# install.packages('devtools')
# devtools::install_github('https://github.com/paulvanderlaken/ppsr')

Usage

The ppsr package has three main functions that compute PPS:

  • score() – which computes an x-y PPS
  • score_predictors() – which computes X-y PPS
  • score_matrix() – which computes X-Y PPS

Visualizing PPS

Subsequently, there are two main functions that wrap around these computational functions to help you visualize your PPS using ggplot2:

  • visualize_predictors() – producing a barplot of all X-y PPS
  • visualize_matrix() – producing a heatmap of all X-Y PPS
PPS matrix for iris

Note that Species is a nominal/categorical variable, with three character/text options.

A correlation matrix would not be able to show us that the type of iris Species can be predicted extremely well by the petal length and width, and somewhat by the sepal length and width. Yet, particularly sepal width is not easily predicted by the type of species.

Correlation matrix for iris

Exploring mtcars

It takes about 10 seconds to run 121 decision trees with visualize_matrix(mtcars). Yet, the output is much more informative than the correlation matrix:

  • cyl can be much better predicted by mpg than the other way around
  • the classification of vs can be done well using nearly all variables as predictors, except for am
  • yet, it’s hard to predict anything based on the vs classification
  • a cars’ am can’t be predicted at all using these variables
PPS matrix for mtcars

The correlation matrix does provides insights that are not provided by the PPS matrix. Most importantly, the sign and strength of any linear relationship that may exist. For instance, we can deduce that mpg relates strongly negatively with cyl.

Yet, even though half of the matrix does not provide any additional information (due to the symmetry), I still find it hard to derive the most important relations and insights at a first glance.

Moreover, the rows and columns for vs and am are not very informative in this correlation matrix as it contains pearson correlations coefficients by default, whereas vs and am are binary variables. The same can be said for cyl, gear and carb, which contain ordinal categories / integer data, so you can discuss the value of these coefficients depicted here.

Correlation matrix for mtcars

Exploring trees

In R, there are many datasets built in via the datasets package. Let’s explore some using the ppsr::visualize_matrix() function.

datasets::trees has data on 31 trees’ girth, height and volume.

visualize_matrix(datasets::trees) shows that both girth and volume can be used to predict the other quite well, but not perfectly.

Let’s have a look at the correlation matrix.

The scores here seem quite higher in general. A near perfect correlation between volume and girth.

Is it near perfect though? Let’s have a look at the underlying data and fit a linear model to it.

You will still be pretty far off the real values when you use a linear model based on Girth to predict Volume. This is what the original PPS of 0.65 tried to convey.

Actually, I’ve run the math for this linaer model and the RMSE is still 4.11. Using just the mean Volume as a prediction of Volume will result in 16.17 RMSE. If we map these RMSE values on a linear scale from 0 to 1, we would get the PPS of our linear model, which is about 0.75.

So, actually, the linear model is a better predictor than the decision tree that is used as a default in the ppsr package. That was used to generate the PPS matrix above.

Yet, the linear model definitely does not provide a perfect prediction, even though the correlation may be near perfect.

Conclusion

In sum, I feel using the general idea behind PPS can be very useful for data exploration.

Particularly in more data science / machine learning type of projects. The PPS can provide a quick survey of which targets can be predicted using which features, potentially with more complex than just linear patterns.

Yet, the old-school correlation matrix also still provides unique and valuable insights that the PPS matrix does not. So I do not consider the PPS so much an alternative, as much as a complement in the toolkit of the data scientist & researcher.

Enjoy the R package, or the Python module for that matter, and let me know if you see any improvements!