A new blog post every Tuesday!

A new blog post every Tuesday!

Hi dear readers!

I’ve been blogging for just over three years now. Writing my first blogs in January 2017 in a small café in London out of pastime, I had never imagined that I would actually attract a readership. And while my blog grew in visitors and followers, it always remained just a hobby for me to summarize and post stuff I was reading or working on at the time.

Still, I want to make sure that you — my readership — get the most out of the time I invest in my blogs. Some blogs are more interesting than others, I am aware. However, I notice that it’s not only the content, but also the time and momentum of posting that draws in you readers.

Hence, I propose the following: Rather than erratically posting everything at the time of writing, I’m going to provide you with some regularity. As of today…

I will post a new blog every Tuesday, at 15:00 GMT+1

“Why that specific time?” I hear you asking. Well, it’s the perfect time as all my followers are still awake on the same day!

  • My Asian followers can read the blog just before or in bed,
  • My African and European readers can look forward to their commute home, and
  • My North- and South-American fans can enjoy it with their morning coffee!

Join 206 other followers

Over the course of last year, I posted nearly 100 blogs. If you do some quick maths, you will conclude that 1 post every Tuesday is a lot less.

And this is where you come in:

Please let me know at what time you would prefer to have me post a second blog every (other) week. You can do so in the comments, by sending me a direct message, or by reaching out to me on twitter or LinkedIn.

Image via uoe.co.uk/uoe-post-office-services

Probability Distributions mapped and explained by their relationships

Probability Distributions mapped and explained by their relationships

Sean Owen created this handy cheat sheet that shows the most common probability distributions mapped by their underlying relationships.

Probability distributions are fundamental to statistics, just like data structures are to computer science. They’re the place to start studying if you mean to talk like a data scientist. 

Sean Owen (via)

Owen argues that the probability distributions relate to each other in intuitive and interesting ways that makes it easier for you to recall them. For instance, several follow naturally from the Bernoulli distribution. Having this map by hand should thus help you really understand what these distributions imply.

On top of that, it’s just a nice geeky network poster!

Sean’s map of the relationships between probability distributions (via)

Now, Sean didn’t just make a fancy map. In the original blog he also explains each of the distributions and how it relates to the others. Having this knowledge is vital to being a good data scientist / analyst.

You can sometimes get away with simple analysis using R or scikit-learn without quite understanding distributions, just like you can manage a Java program without understanding hash functions. But it would soon end in tears, bugs, bogus results, or worse: sighs and eye-rolling from stats majors.

Sean Owen (via)

For instance, here’s Sean explaining the Binomial distribution:

The binomial distribution may be thought of as the sum of outcomes of things that follow a Bernoulli distribution. Toss a fair coin 20 times; how many times does it come up heads? This count is an outcome that follows the binomial distribution. Its parameters are n, the number of trials, and p, the probability of a “success” (here: heads, or 1). Each flip is a Bernoulli-distributed outcome, or trial. Reach for the binomial distribution when counting the number of successes in things that act like a coin flip, where each flip is independent and has the same probability of success.

Sean Owen (via)

Header image via Alison-Static

Simulating data with Bayesian networks, by Daniel Oehm

Simulating data with Bayesian networks, by Daniel Oehm

Daniel Oehm wrote this interesting blog about how to simulate realistic data using a Bayesian network.

Bayesian networks are a type of probabilistic graphical model that uses Bayesian inference for probability computations. Bayesian networks aim to model conditional dependence, and therefore causation, by representing conditional dependence by edges in a directed graph. Through these relationships, one can efficiently conduct inference on the random variables in the graph through the use of factors.

Devin Soni via Medium

As Bayes nets represent data as a probabilistic graph, it is very easy to use that structure to simulate new data that demonstrate the realistic patterns of the underlying causal system. Daniel’s post shows how to do this with bnlearn.

Daniel’s example Bayes net

New data is simulated from a Bayes net (see above) by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node.

Daniel Oehms in his blog

The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values.

The original data alongside the generated data in Daniel’s example

As you can see, a Bayesian network allows you to generate data that looks, feels, and behaves a lot like the data on which you based your network on in the first place.

This can be super useful if you want to generate a synthetic / fake / artificial dataset without sharing personal or sensitive data.

Moreover, the underlying Bayesian net can be very useful to compute missing values. In Daniel’s example, he left out some values on purpose (pretending they were missing) and imputed them with the Bayes net. He found that the imputed values for the missing data points were quite close to the original ones:

For two variables, the original values plotted against the imputed replacements.

In the original blog, Daniel goes on to show how to further check the integrity of the simulated data using statistical models and shares all his code so you can try this out yourself. Please do give his website a visit as Daniel has many more interesting statistics blogs!

Building a realistic Reddit AI that get upvoted in Python

Building a realistic Reddit AI that get upvoted in Python

Sometimes I find these AI / programming hobby projects that I just wished I had thought of…

Will Stedden combined OpenAI’s GPT-2 deep learning text generation model with another deep-learning language model by Google called BERT (Bidirectional Encoder Representations from Transformers) and created an elaborate architecture that had one purpose: posting the best replies on Reddit.

The architecture is shown at the end of this post — copied from Will’s original blog here. Moreover, you can read this post for details regarding the construction of the system. But let me see whether I can explain you what it does in simple language.

The below is what a Reddit comment and reply thread looks like. We have str8cokane making a comment to an original post (not in the picture), and then tupperware-party making a reply to that comment, followed by another reply by str8cokane. Basically, Will wanted to create an AI/bot that could write replies like tupperware-party that real people like str8cokane would not be able to distinguish from “real-people” replies.

Note that with 4 points, str8cokane‘s original comments was “liked” more than tupperware-party‘s reply and str8cokane‘s next reply, which were only upvoted 2 and 1 times respectively.

gpt2-bert on China
Example reddit comment and replies (via bonkerfield.org/)

So here’s what the final architecture looks like, and my attempt to explain it to you.

  1. Basically, we start in the upper left corner, where Will uses a database (i.e. corpus) of Reddit comments and replies to fine-tune a standard, pretrained GPT-2 model to get it to be good at generating (red: “fake”) realistic Reddit replies.
  2. Next, in the upper middle section, these fake replies are piped into a standard, pretrained BERT model, along with the original, real Reddit comments and replies. This way the BERT model sees both real and fake comments and replies. Now, our goal is to make replies that are undistinguishable from real replies. Hence, this is the task the BERT model gets. And we keep fine-tuning the original GPT-2 generator until the BERT discriminator that follows is no longer able to distinguish fake from real replies. Then the generator is “fooling” the discriminator, and we know we are generating fake replies that look like real ones!
    You can find more information about such generative adversarial networks here.
  3. Next, in the top right corner, we fine-tune another BERT model. This time we give it the original Reddit comments and replies along with the amount of times they were upvoted (i.e. sort of like likes on facebook/twitter). Basically, we train a BERT model to predict for a given reply, how much likes it is going to get.
  4. Finally, we can go to production in the lower lane. We give a real-life comment to the GPT-2 generator we trained in the upper left corner, which produces several fake replies for us. These candidates we run through the BERT discriminator we trained in the upper middle section, which determined which of the fake replies we generated look most real. Those fake but realistic replies are then input into our trained BERT model of the top right corner, which predicts for every fake but realistic reply the amount of likes/upvotes it is going to get. Finally, we pick and reply with the fake but realistic reply that is predicted to get the most upvotes!
What Will’s final architecture, combining GPT-2 and BERT, looked like (via bonkerfield.org)

The results are astonishing! Will’s bot sounds like a real youngster internet troll! Do have a look at the original blog, but here are some examples. Note that tupperware-party — the Reddit user from the above example — is actually Will’s AI.

COMMENT: 'Dune’s fandom is old and intense, and a rich thread in the cultural fabric of the internet generation' BOT_REPLY:'Dune’s fandom is overgrown, underfunded, and in many ways, a poor fit for the new, faster internet generation.'
bot responds to specific numerical bullet point in source comment

Will ends his blog with a link to the tutorial if you want to build such a bot yourself. Have a try!

Moreover, he also notes the ethical concerns:

I know there are definitely some ethical considerations when creating something like this. The reason I’m presenting it is because I actually think it is better for more people to know about and be able to grapple with this kind of technology. If just a few people know about the capacity of these machines, then it is more likely that those small groups of people can abuse their advantage.

I also think that this technology is going to change the way we think about what’s important about being human. After all, if a computer can effectively automate the paper-pushing jobs we’ve constructed and all the bullshit we create on the internet to distract us, then maybe it’ll be time for us to move on to something more meaningful.

If you think what I’ve done is a problem feel free to email me , or publically shame me on Twitter.

Will Stedden via bonkerfield.org/2020/02/combining-gpt-2-and-bert/

Learn Julia for Data Science

Learn Julia for Data Science

Most data scientists favor Python as a programming language these days. However, there’s also still a large group of data scientists coming from a statistics, econometrics, or social science and therefore favoring R, the programming language they learned in university. Now there’s a new kid on the block: Julia.

Image result for julia programming"
Via Medium

Advantages & Disadvantages

According to some, you can think of Julia as a mixture of R and Python, but faster. As a programming language for data science, Julia has some major advantages:

  1. Julia is light-weight and efficient and will run on the tiniest of computers
  2. Julia is just-in-time (JIT) compiled, and can approach or match the speed of C
  3. Julia is a functional language at its core
  4. Julia support metaprogramming: Julia programs can generate other Julia programs
  5. Julia has a math-friendly syntax
  6. Julia has refined parallelization compared to other data science languages
  7. Julia can call C, Fortran, Python or R packages

However, others also argue that Julia comes with some disadvantages for data science, like data frame printing, 1-indexing, and its external package management.

Comparing Julia to Python and R

Open Risk Manual published this side-by-side review of the main open source Data Science languages: Julia, Python, R.

You can click the links below to jump directly to the section you’re interested in. Once there, you can compare the packages and functions that allow you to perform Data Science tasks in the three languages.

GeneralDevelopmentAlgorithms & Datascience
History and CommunityDevelopment EnvironmentGeneral Purpose Mathematical Libraries
Devices and Operating SystemsFiles, Databases and Data ManipulationCore Statistics Libraries
Package ManagementWeb, Desktop and Mobile DeploymentEconometrics / Timeseries Libraries
Package DocumentationSemantic Web / Semantic DataMachine Learning Libraries
Language CharacteristicsHigh Performance ComputingGeoSpatial Libraries
Using R, Python and Julia togetherVisualization
Via openriskmanual.org/wiki/Overview_of_the_Julia-Python-R_Universe

Starting with Julia for Data Science

Here’s a very well written Medium article that guides you through installing Julia and starting with some simple Data Science tasks. At least, Julia’s plots look like:

Via Medium
Building a $86 million car theft AI in 57 lines of JavaScript

Building a $86 million car theft AI in 57 lines of JavaScript

Tait Brown was annoyed at the Victoria Police who had spent $86 million Australian dollars on developing the BlueNet system which basically consists of an license-plate OCR which crosschecks against a car theft database.

Tait was so disgruntled as he thought he could easily replicate this system without spending millions and millions of tax dollars. And so he did. In only 57 lines of JavaScript, though, to be honest, there are many more lines of code hidden away in abstraction and APIs…

Anyway, he built a system that can identify license plates, read them, and should be able to cross check them with a criminal database.

Via Medium

I really liked reading about this project, so please do so if you’re curious via the links below:

Part 1: How I replicated an $86 million project in 57 lines of code

Part 2: Remember the $86 million license plate scanner I replicated?

Part X: the code on Github

Cover image via Medium via Freepik