MIT researchers have spent years developing the new drag-and-drop analytics tools they call Northstar.
Northstar is an interactive data science platform that rethinks how people interact with data. It empowers users without programming experience, background in statistics or machine learning expertise to explore and mine data through an intuitive user interface, and effortlessly build, analyze, and evaluate machine learning (ML) pipelines.
Northstar starts as a blank, white interface. Users upload datasets into the system, which appear in a “datasets” box on the left. Any data labels will automatically populate a separate “attributes” box below. There’s also an “operators” box that contains various algorithms, as well as the new AutoML tool. All data are stored and analyzed in the cloud.
You can read more about the tool’s functionalities in this MIT news article, which includes several promising GIFs:
Moreover, on the Northstar website you can find this longer video explaining the tool in detail.
While Northstar looks insanely cool and promising, I do worry about putting such power in the hands of people who may not have much experience with statistics and/or machine learning. We all know how easily errors and bias may slip into data-driven processes, so I am curious to see how these next-gen kind of tools will be deployed and used.
Josh Starmer is assistant professor at the genetics department of the University of North Carolina at Chapel Hill.
But more importantly: Josh is the mastermind behind StatQuest!
StatQuest is a Youtube channel (and website) dedicated to explaining complex statistical concepts — like data distributions, probability, or novel machine learning algorithms — in simple terms.
Once you watch one of Josh’s “Stat-Quests”, you immediately recognize the effort he put into this project. Using great visuals, a just-about-right pace, and relateable examples, Josh makes statistics accessible to everyone. For instance, take this series on logistic regression:
And do you really know what happens under the hood when you run a principal component analysis? After this video you will:
Or are you more interested in learning the fundamental concepts behind machine learning, then Josh has some videos for you, for instance on bias and variance or gradient descent:
With nearly 200 videos and counting, StatQuest is truly an amazing resource for students ‘and teachers on topics related to statistics and data analytics. For some of the concepts, Josh even posted videos running you through the analysis steps and results interpretation in the R language.
StatQuest started out as an attempt to explain statistics to my co-workers – who are all genetics researchers at UNC-Chapel Hill. They did these amazing experiments, but they didn’t always know what to do with the data they generated. That was my job. But I wanted them to understand that what I do isn’t magic – it’s actually quite simple. It only seems hard because it’s all wrapped up in confusing terminology and typically communicated using equations. I found that if I stripped away the terminology and communicated the concepts using pictures, it became easy to understand.
Over time I made more and more StatQuests and now it’s my passion on YouTube.
Xeno.graphics is the collection of unusual charts and maps Maarten Lambrechts maintains. It’s a repository of novel, innovative, and experimental visualizations to inspire you, to fight xenographphobia, and popularize new chart types.
For instance, have you ever before heard of a time curve? These are very useful to visualize the development of a relationship over time.
The upset plot is another example of an upcoming visualization. It can demonstrate the overlap or insection in a dataset. For instance, in the social network of #rstats twitter heroes, as the below example from the Xenographics website does.
The below necklace map is new to me too. What it does precisely is unclear to me as well.
There are hundreds of other interestingcharts, maps, figures, and plots, so do have a look yourself. Moreover, the xenographics collection is still growing. If you know of one that isn’t here already, please submit it. You can also expect some posts about certain topics around xenographics.
Tilburg University has set up a masterclass Predictive HR Analytics. In 3 days, the Professional Learning program will teach you all you need to know to implement predictive analytics and take HR to the next level. More information can be found here.
What makes this program unique?
The masterclass Predictive HR Analytics goes beyond HR analytics and focuses on transformational people predictions. You learn how to embed predictive HR analytics into your HR Strategy and how to use your findings to convince others.
The masterclass is developed at the prestigious Human Resources department at Tilburg University, which has obtained international recognition with its high-quality academic research in the HRM field.
The mix of professors in conjunction with leading HR professionals leads to a strong academic program with a practical approach.
Your peer participants will make sure that the class opens up a high-quality network of HR specialists. The diversity of leading companies from different sectors in the classroom creates new insights for all the participants.
The program is like a 3-day pressure cooker. By combining online and offline components, we can create more in-depth discussions in the classroom.
You will experience a high impact on your daily practice, since the program is focused on direct implementation.
This course is ideal for anyone in HR seeking to become more adept in using quantitative data for decision making. Typical participants are (future) HR analysts, HR managers, HR business partners, HR consultants and (financial) business analysts with a strong link on people resources. Participants are from various sectors, such as financial services, healthcare institutions, government agencies and business services.
Things can get confusing quite quickly if you’re a layman. People boast about boosting while deep, brain-like networks are used to play child’s games. Data guru’s speak of mighty, though random woodlands and the media simultaneously praise and criticize IBM Watson. To create even more confusion, consultancy firms introduce a new type of analytics every year, each one more valuable than its predecessor. I am not even kidding, I counted seveneightnineten eleven types: descriptive, diagnostic, exploratory, inferential, strategic, causal, enterprise, advanced, predictive, prescriptive, adaptive, and cognitive analytics, roughly in that order of complexity.
The resulting confusion I experience firsthand in my work. In my workshops, people would ask questions like “How can I use data mining to make our dashboards to morepredictive?” or“How can I build neural networks to understand our customer needs?”. Similarly, I’ve heard managers ask for more “cognitive solutions” or “one of those fancy neural networks“. However, things can get pretty ugly, pretty soon, once unnecessary complexity is introduced without good reasons (e.g., superior performance, processing speed), appropriate foundations (e.g., accurate, valid, and sufficient data), or good research designs (e.g., control conditions, random assignment, out-of-sample validation).
It is high time to demystify the data domain. If people outside the direct domain know what’s what, they will better understand what can and can’t be done with data. Moreover, they will not be as easily fooled by the cognitive AI mumbojumbo of consultants. A recent blog made me very happy. David Robinson — data scientist at StackOverflow — proposes very simple definitions of three interrelated domains (data science, machine learning, and artificial intelligence) and highlights their differences. If you haven’t yet, do read it, but to summarize David’s take:
Data science produces insights
Machine learning produces predictions
Artificial intelligence produces actions
These definitions are overly simplistic, David acknowledges, and not without their flaws: “A fortune teller makes predictions, but we’d never say that they’re doing machine learning!”. However, I feel its a great first attempt at demystification. Particularly, the applied example with which David continues make matters more clear:
Suppose we were building a self-driving car, and were working on the specific problem of stopping at stop signs. We would need skills drawn from all three of these fields.
Machine learning: The car has to recognize a stop sign using its cameras. We construct a dataset of millions of photos of streetside objects, and train an algorithm to predict which have stop signs in them.
Artificial intelligence: Once our car can recognize stop signs, it needs to decide when to take the action of applying the brakes. It’s dangerous to apply them too early or too late, and we need it to handle varying road conditions (for example, to recognize on a slippery road that it’s not slowing down quickly enough), which is a problem of control theory.
Data science: In street tests, we find that the car’s performance isn’t good enough, with some false negatives in which it drives right by a stop sign. After analyzing the street test data, we gain the insight that the rate of false negatives depends on the time of day: it’s more likely to miss a stop sign before sunrise or after sunset. We realize that most of our training data included only objects in full daylight, so we construct a better dataset including nighttime images and go back to the machine learning step.
Around the same time I read David’s blog, I came across the picture below, and its brother:
This got me thinking about how I would explain the field to a layman. In Human Resource Management (my PhD domain), there is enormous confusion around what’s what. When HR professionals speak of analytics they can mean about anything from a group average or a bar chart up to a deep neural network. I hoped that a simple diagram could help solve some of the confusion in terminology. Here’s my attempt:
Note that this diagram reflects my personal, implicit definitions of the concepts. Hence, in many ways, it may be biased, incorrect, or plain stupid. Fortunately, the r/datascience and r/MachineLearning communities were very willing to help me improve it. I should also stress that David’s blog inspired the attempt in the first place. While the diagram still greatly oversimplifies matters (and is in conflict with the purist academic definitions), I hope its helps as a layman’s introduction to the field.
How to read it? From left to right, we start out with raw data. Often, we’d first transform this data into usable features/variables: discriminatory characteristics of the objects were trying to analyze. On the one hand, a researcher may engineer these features. For instance, by some (statistical) transformation such as taking the average X within groups or reducing the number of categories for Z. On the other hand, unsupervised machine learning techniques may be applied to (semi-)automatically engineer features by identifying relevant clusters or dimensions in the data.
Next, the features can be input into statistical analysis. Taking the upper path, both unsupervised and supervised machine learning techniques can be used to build models that can be interpreted to gain insights about phenomena. This process is what business people usually mean when they say “analytics“. Mostly, it involves descriptive, causal or inferential analyses in order to gain insights into some process or phenomenon. Taking the lower path, supervised learning may be applied build a predictive model and retrieve predictions for a dependent variable. These predictions may also be evaluated using further analysis to retrieve insights. For instance, to gain understanding about what’s driving the predictions or how the predictions may be leveraged in practice.
Finally, both predictions and insights may form the basis of actions, which can be taken by a human agent or by a computer agent. In the latter case, we would deal with AI by some definitions.
There is one more route in the diagram, going directly from the raw data to the predictions: deep learning. Here, a neural network may take in complex data (e.g., text, images, sound) and engineer relevant features autonomously to base predictions on.
Disclaimer: The diagram is a major oversimplification! Particularly the placement of and overlap between the domains in this diagram is a simplification and not very good by purist, academic standards. For instance, despite being a extremely important field of innovation, I excluded reinforcement learning as I was unable to place it without making the figure considerably more complex. Similarly, the others domains do not have as clear demarcations as this figure suggests and their placement is by my definition of them. Data science, in my opinion, reflects the diffusion of insights or knowledge from data, particularly the (human) decisions and actions made in that process. Much of data science relies on machine learning, which involves how algorithms learn a model of reality from data, observations, or experiences. This learning can occur in different forms (e.g., supervised, unsupervised, deep, and reinforcement learning) and, unlike David’s definition, thus not always output predictions (e.g., also dimensions, clusters). Finally, machine learning is a specific branch of artificial intelligence, a label that has had many definitions. In my eyes, it includes any (partially) automated process where seemingly intelligent actions are automatically executed based on decision rules. An action can be as simple as a single if-then statement or as complex as a smart fridge ordering new milk. Whether AI is or should be considered a part of data science is food for a different discussion. For much more straightforward definitions of the fields, please consult this slide shared by u/mmcmtl:
If you have any thoughts on how the above diagram and/or blog could or should be improved, feel free to comment below, reach out, or share your own attempts!
The first programs for (scientific) text mining are already over 50 years old. More recent efforts, such as the Linguistic Inquiry Word Count (LIWC; Tausczik & Pennebaker, 2010), have greatly improved our text analytical capabilities. Moreover, several single-purpose programs have been developed, which also consider syntactic text structures (e.g., Syntactic Complexity Analyzer [Lu, 2010], TAALES [Kyle & Crossley, 2015]).However, the widespread use of many of these programs has been hampered by two major barriers.
First, considerable technical expertise is required, which obstructs researchers without statistical backgrounds. For example, packages such as tm in R (Meyer et al., 2015) have been developed to conduct natural-language processing, but the steep learning curve forms a challenge. Additionally, the constant increase of computational processing power and the proliferation of new algorithms makes it difficult for researchers to maintain working knowledge of state-of-the-art methods.
Alternatively, most of the existing user-friendly NLP programs (and packages), such as RapidMiner (Akthar & Hahne, 2012), SAS Text Miner (Abell, 2014), or SPSS Modeler (IBM Corp., 2011), charge either a large software fee up front or a subscription fee. The cost of these programs can be prohibitively expensive for junior researchers and researchers looking to integrate new techniques into their research toolbox.
In the attached article, TACIT is introduced: Text Analysis, Crawling and Investigation Tool. TACIT is an open-source architecture that establishes a pipeline between the various stages of text-based research by integrating tools for text mining, data cleaning, and analysis under a single user-friendly architecture. In addition to being prepackaged with a range of easily applied, cutting-edge methods, TACIT’s design also allows other researchers to write their own plugins.
The authors’ hope is that TACIT can facilitate the integration and use of advancements in computational linguistics in psychological research, and by doing so can help researchers make use of the ever-growing documents of our social discourse in ways that have previously not been possible.