Northstar: The interactive, drag-and-drop data science platform by MIT

MIT researchers have spent years developing the new drag-and-drop analytics tools they call Northstar.

Northstar is an interactive data science platform that rethinks how people interact with data. It empowers users without programming experience, background in statistics or machine learning expertise to explore and mine data through an intuitive user interface, and effortlessly build, analyze, and evaluate machine learning (ML) pipelines.
northstar.mit.edu/

Northstar starts as a blank, white interface. Users upload datasets into the system, which appear in a “datasets” box on the left. Any data labels will automatically populate a separate “attributes” box below. There’s also an “operators” box that contains various algorithms, as well as the new AutoML tool. All data are stored and analyzed in the cloud.
news.mit.edu/2019/drag-drop-data-analytics-0627

You can read more about the tool’s functionalities in this MIT news article, which includes several promising GIFs:

Moreover, on the Northstar website you can find this longer video explaining the tool in detail.

https://vimeo.com/342787403

While Northstar looks insanely cool and promising, I do worry about putting such power in the hands of people who may not have much experience with statistics and/or machine learning. We all know how easily errors and bias may slip into data-driven processes, so I am curious to see how these next-gen kind of tools will be deployed and used.

Survival of the Best Fit: A webgame on AI in recruitment

Survival of the Best Fit is a webgame that simulates what happens when companies automate their recruitment and selection processes.

You – playing as the CEO of a starting tech company – are asked to select your favorite candidates from a line-up, based on their resumés.

As your simulated company grows, the time pressure increases, and you are forced to automate the selection process.

Fortunately, some smart techies working for your company propose training a computer to hire just like you just did.

They don’t need anything but the data you just generated and some good old supervised machine learning!

To avoid spoilers, try the game yourself and see what happens!

The game only takes a few minutes, and is best played on mobile.

www.survivalofthebestfit.com/ via Medium

Survival of the Best Fit was built by Gabor Csapo, Jihyun Kim, Miha Klasinc, and Alia ElKattan. They are software engineers, designers and technologists, advocating for better software that allows members of the public to question its impact on society.

You don’t need to be an engineer to question how technology is affecting our lives. The goal is not for everyone to be a data scientist or machine learning engineer, though the field can certainly use more diversity, but to have enough awareness to join the conversation and ask important questions.
With Survival of the Best Fit, we want to reach an audience that may not be the makers of the very technology that impact them everyday. We want to help them better understand how AI works and how it may affect them, so that they can better demand transparency and accountability in systems that make more and more decisions for us.
survivalofthebestfit.com

I found that the game provides a great intuitive explanation of how (humas) bias can slip into A.I. or machine learning applications in recruitment, selection, or other human resource management practices and processes.

If you want to read more about people analytics and machine learning in HR, I wrote my dissertation on the topic and have many great books I strongly recommend.

Finally, here’s a nice Medium post about the game.

https://www.survivalofthebestfit.com/game/

Note, as Joachin replied below, that the game apparently does not learn from user-input, but is programmed to always result in bias towards blues.
I kind of hoped that there was actually an algorithm “learning” in the backend, and while the developers could argue that the bias arises from the added external training data (you picked either Google, Apple, or Amazon to learn from), it feels like a bit of a disappointment that there is no real interactivity here.

Recreating graphics from the Fundamentals of Data Visualization

Claus Wilke wrote the Fundamentals of Data Visualization – a great resource that’s definitely high on my list of recommended data visualization books.

In a recent post, Claus shared the link to a GitHub repository where he hosts some of the R programming code with which Claus made the graphics for his dataviz book. The repository is named practical ggplot2, after the R package Clause used to make many of his visuals.

Check it out, the page contains some pearls and the code behind them, which will help you learn to create fabulous visualizations yourself. Some examples:

Via https://htmlpreview.github.io/?https://github.com/clauswilke/practical_ggplot2/blob/master/health_status.html

Via https://htmlpreview.github.io/?https://github.com/clauswilke/practical_ggplot2/blob/master/corruption_human_development.html

Here’s the original tweet in case you want to see the responses.

I prepared step-by-step instructions for reproducing a few of the figures from my book. #rstats #ggplot2 #dataviz https://t.co/fe7vO3wwAN pic.twitter.com/G5Lk4K91gU
— Claus Wilke (@ClausWilke) June 4, 2019

Python vs. R for Data Science, by Norm Matloff

Cover image via Hacker Noon.

Norm Matloff is a professor of Computer Science at University College Davis. He recently updated his viewpoint on whether R or Python is the best language for Data Science. While I normally hate those opinionated comparisons, Norm’s outline of the two languages’ (dis)advantages is actually quite balanced and well-versed.

I strongly recommend that you read Norm’s original piece here.

I can mostly agree with Norm, although the blog reads as if he has a (slight) bias in favor of R. In his original blog, Norm discusses many different programming topics and provides detailed information on why he considers certain topics big wins, slight edges, or ties between the two programming languages.

In the table below, I’ve tried to summarize Norm’s opinions by converting his words to 0-100 scores per topic for a quicker overview. I’ve converted Norm’s words to scores: his huge win became 100-0, a big win 80-20, a win 70-30, an edge 60-40, and a tie 50-50.


	Python	R
Elegance	100
Learning curve		100
Data Science libraries	40	60
Machine Learning	60	40
Statistical correctness	20	80
Parallel computing	50	50
C/C++ interface	40	60
Object orientation, metaprogramming	40	60
Language unity	100
Linked data structures	70	30
Online help	20	80

I personally started my career with R, so that’s definitely my favorite programming language. However, I think that Python is more convenient and faster on certain topics, and closer to more mainstream programming languages, which I why I’m currently learning it next to using R.

If you want to learn R, I can recommend you follow my quick 6-step guide to learning R programming. Alternatively, Norm points to his quick tutorial on R for non-programmers, and a tutorial on Python, for learners with a programming background.

Happy learning!

PS. This tweet by John summarizes the whole discussion quite well.

Someone asked me "R vs. Python", so:
1. It depends what you're trying to do
2. If you're trying to capitalise the letter r, I'd go with R, but if you're trying to strangle a woodland animal, I'd say python
3. Java is better than either. It's a huge island! Tropical rainforests!
— John Burn-Murdoch (@jburnmurdoch) June 18, 2019

Generalized Additive Models Tutorial in R, by Noam Ross

Generalized Additive Models — or GAMs in short — have been somewhat of a mystery to me. I’ve known about them, but didn’t know exactly what they did, or when they’re useful. That came to an end when I found out about this tutorial by Noam Ross.

In this beautiful, online, interactive course, Noam allows you to program several GAMs yourself (in R) and to progressively learn about the different functions and features. I am currently halfway through, but already very much enjoy it.

If you’re already familiar with linear models and want to learn something new, I strongly recommend this course!

The interactive course asks you to program several GAMs yourself https://noamross.github.io/gams-in-r-course/

You progressively learn how to run, interpret, and visualize GAMs yourself https://noamross.github.io/gams-in-r-course/

After a while you are even able to visualize smoothed interactions between variables https://noamross.github.io/gams-in-r-course/

Logistic regression is not fucked, by Jake Westfall

Recently, I came across a social science paper that had used linear probability regression. I had never heard of linear probability models (LPM), but it seems just an application of ordinary least squares regression but to a binomial dependent variable.

According to some, LPM is a commonly used alternative for logistic regression, which is what I was learned to use when the outcome is binary.

Potentially because of my own social science background (HRM), using linear regression without a link transformation on binary data just seems very unintuitive and error-prone to me. Hence, I sought for more information.

I particularly liked this article by Jake Westfall, which he dubbed “Logistic regression is not fucked”, following a series of blogs in which he talks about methods that are fucked and not useful.

Jake explains the classification problem and both methods inner workings in a very straightforward way, using great visual aids. He shows how LMP would differ from logistic models, and why its proposed benefits are actually not so beneficial. Maybe I’m in my bubble, but Jake’s arguments resonated.

Read his article yourself:
http://jakewestfall.org/blog/index.php/2018/03/12/logistic-regression-is-not-fucked/

Here’s the summary:
Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to.