This zine explains SELECT queries step by step with tons of examples that will show you exactly what’s happening when you run a query. You’ll be able to easily translate your questions about your data into queries and get answers fast.
I’ve been blogging for just over three years now. Writing my first blogs in January 2017 in a small café in London out of pastime, I had never imagined that I would actually attract a readership. And while my blog grew in visitors and followers, it always remained just a hobby for me to summarize and post stuff I was reading or working on at the time.
Still, I want to make sure that you — my readership — get the most out of the time I invest in my blogs. Some blogs are more interesting than others, I am aware. However, I notice that it’s not only the content, but also the time and momentum of posting that draws in you readers.
Hence, I propose the following: Rather than erratically posting everything at the time of writing, I’m going to provide you with some regularity. As of today…
I will post a new blog every Tuesday, at 15:00 GMT+1
“Why that specific time?” I hear you asking. Well, it’s the perfect time as all my followers are still awake on the same day!
My Asian followers can read the blog just before or in bed,
My African and European readers can look forward to their commute home, and
My North- and South-American fans can enjoy it with their morning coffee!
Over the course of last year, I posted nearly 100 blogs. If you do some quick maths, you will conclude that 1 post every Tuesday is a lot less.
And this is where you come in:
Please let me know at what time you would prefer to have me post a second blog every (other) week. You can do so in the comments, by sending me a direct message, or by reaching out to me on twitter or LinkedIn.
Owen argues that the probability distributions relate to each other in intuitive and interesting ways that makes it easier for you to recall them. For instance, several follow naturally from the Bernoulli distribution. Having this map by hand should thus help you really understand what these distributions imply.
On top of that, it’s just a nice geeky network poster!
Now, Sean didn’t just make a fancy map. In the original blog he also explains each of the distributions and how it relates to the others. Having this knowledge is vital to being a good data scientist / analyst.
You can sometimes get away with simple analysis using R or scikit-learn without quite understanding distributions, just like you can manage a Java program without understanding hash functions. But it would soon end in tears, bugs, bogus results, or worse: sighs and eye-rolling from stats majors.
For instance, here’s Sean explaining the Binomial distribution:
The binomial distribution may be thought of as the sum of outcomes of things that follow a Bernoulli distribution. Toss a fair coin 20 times; how many times does it come up heads? This count is an outcome that follows the binomial distribution. Its parameters are n, the number of trials, and p, the probability of a “success” (here: heads, or 1). Each flip is a Bernoulli-distributed outcome, or trial. Reach for the binomial distribution when counting the number of successes in things that act like a coin flip, where each flip is independent and has the same probability of success.
Bayesian networks are a type of probabilistic graphical model that uses Bayesian inference for probability computations. Bayesian networks aim to model conditional dependence, and therefore causation, by representing conditional dependence by edges in a directed graph. Through these relationships, one can efficiently conduct inference on the random variables in the graph through the use of factors.
As Bayes nets represent data as a probabilistic graph, it is very easy to use that structure to simulate new data that demonstrate the realistic patterns of the underlying causal system. Daniel’s post shows how to do this with bnlearn.
New data is simulated from a Bayes net (see above) by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node.
The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values.
As you can see, a Bayesian network allows you to generate data that looks, feels, and behaves a lot like the data on which you based your network on in the first place.
This can be super useful if you want to generate a synthetic / fake / artificial dataset without sharing personal or sensitive data.
Moreover, the underlying Bayesian net can be very useful to compute missing values. In Daniel’s example, he left out some values on purpose (pretending they were missing) and imputed them with the Bayes net. He found that the imputed values for the missing data points were quite close to the original ones:
In the original blog, Daniel goes on to show how to further check the integrity of the simulated data using statistical models and shares all his code so you can try this out yourself. Please do give his website a visit as Daniel has many more interesting statistics blogs!
Sometimes I find these AI / programming hobby projects that I just wished I had thought of…
Will Stedden combined OpenAI’s GPT-2 deep learning text generation model with another deep-learning language model by Google called BERT (Bidirectional Encoder Representations from Transformers) and created an elaborate architecture that had one purpose: posting the best replies on Reddit.
The architecture is shown at the end of this post — copied from Will’s original bloghere. Moreover, you can read this post for details regarding the construction of the system. But let me see whether I can explain you what it does in simple language.
The below is what a Reddit comment and reply thread looks like. We have str8cokane making a comment to an original post (not in the picture), and then tupperware-party making a reply to that comment, followed by another reply by str8cokane. Basically, Will wanted to create an AI/bot that could write replies like tupperware-party that real people like str8cokane would not be able to distinguish from “real-people” replies.
Note that with 4 points, str8cokane‘s original comments was “liked” more than tupperware-party‘s reply and str8cokane‘s next reply, which were only upvoted 2 and 1 times respectively.
So here’s what the final architecture looks like, and my attempt to explain it to you.
Basically, we start in the upper left corner, where Will uses a database (i.e. corpus) of Reddit comments and replies to fine-tune a standard, pretrained GPT-2 model to get it to be good at generating (red: “fake”) realistic Reddit replies.
Next, in the upper middle section, these fake replies are piped into a standard, pretrained BERT model, along with the original, real Reddit comments and replies. This way the BERT model sees both real and fake comments and replies. Now, our goal is to make replies that are undistinguishable from real replies. Hence, this is the task the BERT model gets. And we keep fine-tuning the original GPT-2 generator until the BERT discriminator that follows is no longer able to distinguish fake from real replies. Then the generator is “fooling” the discriminator, and we know we are generating fake replies that look like real ones! You can find more information about such generative adversarial networks here.
Next, in the top right corner, we fine-tune another BERT model. This time we give it the original Reddit comments and replies along with the amount of times they were upvoted (i.e. sort of like likes on facebook/twitter). Basically, we train a BERT model to predict for a given reply, how much likes it is going to get.
Finally, we can go to production in the lower lane. We give a real-life comment to the GPT-2 generator we trained in the upper left corner, which produces several fake replies for us. These candidates we run through the BERT discriminator we trained in the upper middle section, which determined which of the fake replies we generated look most real. Those fake but realistic replies are then input into our trained BERT model of the top right corner, which predicts for every fake but realistic reply the amount of likes/upvotes it is going to get. Finally, we pick and reply with the fake but realistic reply that is predicted to get the most upvotes!
The results are astonishing! Will’s bot sounds like a real youngster internet troll! Do have a look at the original blog, but here are some examples. Note that tupperware-party — the Reddit user from the above example — is actually Will’s AI.
I know there are definitely some ethical considerations when creating something like this. The reason I’m presenting it is because I actually think it is better for more people to know about and be able to grapple with this kind of technology. If just a few people know about the capacity of these machines, then it is more likely that those small groups of people can abuse their advantage.
I also think that this technology is going to change the way we think about what’s important about being human. After all, if a computer can effectively automate the paper-pushing jobs we’ve constructed and all the bullshit we create on the internet to distract us, then maybe it’ll be time for us to move on to something more meaningful.
If you think what I’ve done is a problem feel free to email me , or publically shame me on Twitter.
Most data scientists favor Python as a programming language these days. However, there’s also still a large group of data scientists coming from a statistics, econometrics, or social science and therefore favoring R, the programming language they learned in university. Now there’s a new kid on the block: Julia.
Advantages & Disadvantages
According to some, you can think of Julia as a mixture of R and Python, but faster. As a programming language for data science, Julia has some major advantages:
Julia is light-weight and efficient and will run on the tiniest of computers
Julia is just-in-time (JIT) compiled, and can approach or match the speed of C
Julia is a functional language at its core
Julia support metaprogramming: Julia programs can generate other Julia programs
Julia has a math-friendly syntax
Julia has refined parallelization compared to other data science languages
Julia can call C, Fortran, Python or R packages
However, others also argue that Julia comes with some disadvantages for data science, like data frame printing, 1-indexing, and its external package management.
You can click the links below to jump directly to the section you’re interested in. Once there, you can compare the packages and functions that allow you to perform Data Science tasks in the three languages.
Tait Brown was annoyed at the Victoria Police who had spent $86 million Australian dollars on developing the BlueNet system which basically consists of an license-plate OCR which crosschecks against a car theft database.
Anyway, he built a system that can identify license plates, read them, and should be able to cross check them with a criminal database.
I really liked reading about this project, so please do so if you’re curious via the links below: