In optimizing their transportation services, Uber uses evolutionary strategies and genetic algorithms to train deep neural networks through reinforcement learning. A lot of difficult words in one sentence; you can imagine the complexity of the process.
Because it is particularly difficult to observe the underlying dynamics of this learning process in neural network optimization, Uber built VINE – a Visual Inspector for NeuroEvolution. VINE helps to discover how evolutionary strategies and genetic optimizing are performing under the hood. In a recent article, they demonstrate how VINE works on the MujocoHumanoid Locomotion task.
[…] In the Humanoid Locomotion Task, each pseudo-offspring neural network controls the movement of a robot, and earns a score, called its fitness, based on how well it walks. [Evolutionary principles] construct the next parent by aggregating the parameters of pseudo-offspring based on these fitness scores […]. The cycle then repeats.
In the video below, one of my favorite YouTube channels (Two Minute Papers) discusses a new super resolution project where academic scholars taught a neural network to improve low quality photo’s. The researchers took the same picture with multiple camera’s of varying quality and allowed a neural network to learn how the lowest quality pictures can be adjusted to more closely resemble their high quality counterparts. A very interesting approach and the results are just mind-boggling:
Professor John Kruschke and Torrin Liddell – one of his Ph.D. students at Indiana University – wrote a fantastically useful scientific paper introducing Bayesian data analysis to the masses. Kruschke and Liddell explain the main ideas behind Bayesian statistics, how Bayesians deal with continuous and binary variables, how to use and set meaningful priors, the differences between confidence and credibility intervals, how to perform model comparison tests, and many more. The paper is published open access so you can read it here.
I found it incredibly useful, providing me with a better understanding of how Bayesian analysis works, what kind of questions you can answer with it, and what the resulting insights would comprise of. After reading it, I was honestly asking myself why I don’t use Bayesian methods more often… So what’s next, how to learn more?
Kaggle conducts industry-wide surveys to assess the state of data science and machine learning. Over 17,000 individuals worldwide participated in the survey, myself included, and 171 countries and territories are represented in the data.
There is an ongoing debate regarding whether R or Python is better suited for Data Science (probably the latter, but I nevertheless prefer the former). The thousands of responses to the Kaggle survey may provide some insights into how the preferences for each of these languages are dispersed over the globe. At least, that was what I thought when I wrote the code below.
### PAUL VAN DER LAKEN### 2017-10-31### KAGGLE DATA SCIENCE SURVEY### VISUALIZING WORLD WIDE RESPONSES### AND PYTHON/R PREFERENCES# LOAD IN LIBRARIESlibrary(ggplot2)library(dplyr)library(tidyr)library(tibble)# OPTIONS & STANDARDIZATIONoptions(stringsAsFactors=F)theme_set(theme_light())dpi=600w=12h=8wm_cor=0.8hm_cor=0.8capt="Kaggle Data Science Survey 2017 by paulvanderlaken.com"# READ IN KAGGLE DATAmc<-read.csv("multipleChoiceResponses.csv")%>%as.tibble()# READ IN WORLDMAP DATAworldMap<-map_data(map="world")%>%as.tibble()# ALIGN KAGGLE AND WORLDMAP COUNTRY NAMESmc$Country[!mc$Country%in%worldMap$region]%>%unique()worldMap$region%>%unique()%>%sort(F)mc$Country[mc$Country=="United States"]<-"USA"mc$Country[mc$Country=="United Kingdom"]<-"UK"mc$Country[grepl("China|Hong Kong", mc$Country)]<-"China"# CLEAN UP KAGGLE DATAlvls=c("","Rarely", "Sometimes", "Often", "Most of the time")labels=c("NA", lvls[-1])ind_data<-mc%>%select(Country, WorkToolsFrequencyR, WorkToolsFrequencyPython)%>%mutate(WorkToolsFrequencyR=factor(WorkToolsFrequencyR,
levels=lvls, labels=labels))%>%filter(!(Country==""|is.na(WorkToolsFrequencyR)|is.na(WorkToolsFrequencyPython)))# AGGREGATE TO COUNTRY LEVELcountry_data<-ind_data%>%group_by(Country)%>%summarize(N=n(),
Python=sum(WorkToolsFrequencyPython%>%as.numeric()))# CREATE THEME FOR WORLDMAP PLOTtheme_worldMap<-theme(plot.background=element_rect(fill="white"),
After aligning some country names (above), I was able to start visualizing the results. A first step was to look at the responses across the globe. The greener the more responses and the grey countries were not represented in the dataset. A nice addition would have been to look at the response rate relative to country population.. any volunteers?
Now, let’s look at how frequently respondents use Python and R in their daily work. I created two heatmaps: one excluding the majority of respondents who indicated not using either Python or R, probably because they didn’t complete the survey.
# AGGREGATE DATA TO WORKTOOL RESPONSESworktool_data<-ind_data%>%group_by(WorkToolsFrequencyR, WorkToolsFrequencyPython)%>%count()# HEATMAP OF PREFERRED WORKTOOLSggplot(worktool_data, aes(x=WorkToolsFrequencyR, y=WorkToolsFrequencyPython))+geom_tile(aes(fill=log(n)))+geom_text(aes(label=n), col="black")+scale_fill_gradient(low="red", high="yellow")+labs(title="Heatmap of Python and R usage",
subtitle="Most respondents indicate not using Python or R (or did not complete the survey)",
# HEATMAP OF PREFERRED WORKTOOLS# EXCLUSING DOUBLE NA'Sworktool_data%>%filter(!(WorkToolsFrequencyPython=="NA"&WorkToolsFrequencyR=="NA"))%>%ungroup()%>%mutate(perc=n/sum(n))%>%ggplot(aes(x=WorkToolsFrequencyR, y=WorkToolsFrequencyPython))+geom_tile(aes(fill=n))+geom_text(aes(label=paste0(round(perc,3)*100,"%")), col="black")+scale_fill_gradient(low="red", high="yellow")+labs(title="Heatmap of Python and R usage (non-users excluded)",
subtitle="There is a strong reliance on Python and less users focus solely on R",
Okay, now let’s map these frequency data on a worldmap. Because I’m interested in the country level differences in usage, I look at the relative usage of Python compared to R. So the redder the country, the more Python is used by Data Scientists in their workflow whereas R is the preferred tool in the bluer countries. Interesting to see, there is no country where respondents really use R much more than Python.
# WORLDMAP OF RELATIVE WORKTOOL PREFERENCEggplot(country_data)+geom_map(data=worldMap,
aes(map_id=region, x=long, y=lat),
map=worldMap, fill="grey")+geom_map(aes(map_id=Country, fill=Python/R),
map=worldMap, size=0.3)+scale_fill_gradient(low="blue", high="red", name="Python/R")+theme_worldMap+labs(title="Relative usage of Python to R per country",
subtitle="Focus on Python in Russia, Israel, Japan, Ukraine, China, Norway & Belarus",
Thank you for reading my visualization report. Please do try and extract some other interesting insights from the data yourself.
Max Woolf writes machine learning blogs on his personal blog, minimaxir, and posts open-source code repositories on his GitHub. He is a former Apple Software QA Engineer and graduated from Carnegie Mellon University. I have published his work before, for instance, this short ggplot2 tutorial by MiniMaxir, but his new project really amazed me.
Max developed a Facebook web scaper in Python. This tool gathers all the posts and comments of Facebook Pages (or Open Facebook Groups) and the related metadata, including post message, post links, and counts of each reaction on the post. The data is then exported to a CSV file, which can be imported into any data analysis program like Excel, or R.
Max put his scraper to work and gathered a ton of publicly available Facebook posts and their metadata between 2016 and 2017.