Tag: networkdata

Evolving Floorplans – by Joel Simon

Evolving Floorplans – by Joel Simon

Joel Simon is the genius behind an experimental project exploring optimized school blueprints. Joel used graph-contraction and ant-colony pathing algorithms as growth processes, which could generate elementary school designs optimized for all kinds of characteristics: walking time, hallway usage, outdoor views, and escape routes just to name a few.

Two generated designs, minimizing the traffic flow (left) as well as escape routes (right) [original]
Other designs tried to maximize the number of windows, resulting in seemingly random open courtyards [original]

The original floor plan [original]
Definitely check out the original write-up if you are interested in the details behind the generation process! Or have a look at some of Joel’s other projects.

Identifying “Dirty” Twitter Bots with R and Python

Past week, I came across two programming initiatives to uncover Twitter bots and one attempt to identify fake Instagram accounts.

Mike Kearney developed the R package botornot which applies machine learning to estimate the probability that a Twitter user is a bot. His default model is a gradient boosted model trained using both users-level (bio, location, number of followers and friends, etc.) and tweets-level information (number of hashtags, mentions, capital letters, etc.). This model is 93.53% accurate when classifying bots and 95.32% accurate when classifying non-bots. His faster model uses only the user-level data and is 91.78% accurate when classifying bots and 92.61% accurate when classifying non-bots. Unfortunately, the models did not classify my account correctly (see below), but you should definitely test yourself and your friends via this Shiny application.

Fun fact: botornot can be integrated with Mike’s rtweet package

Scraping Dirty Bots

At around the same time, I read this very interesting blog by Andy Patel. Annoyed by the fake Twitter accounts that kept liking and sharing his tweets, Andy wrote a Python script called pronbot_search. It’s an iterative search algorithm which Andy seeded with the dozen fake Twitter accounts that he identified originally. Subsequently, the program iterated over the friends and followers of each of these fake users, looking for other accounts displaying similar traits (e.g., similar description, including an URL to a sex-website called “Dirty Tinder”).

Whenever a new account was discovered, it was added to the query list, and the process continued. Because of the Twitter API restrictions, the whole crawling process took literal days before Andy manually terminated it. The results are just amazing:

After a day, the results looked like so. Notice the weird clusters of relationships in this network. [original]
The full bot network uncovered by Andy included 22.000 fake Twitter accounts:

At the end of the weekend of March 10th, Andy had to stop the scraper after running for several days even though he had only processed 18% of the networks of the 22.000 included Twitter bots [original]
The bot network on Twitter is probably enormous! Zooming in on the network, Andy notes that:

Pretty much the same pattern I’d seen after one day of crawling still existed after one week. Just a few of the clusters weren’t “flower” shaped.

Andy Patel, March 2018, link

Zoomed in to a specific part of the network you can see the separate clusters of bots doing little more than liking each others messages. [original]
In his blog, Andy continues to look at all kind of data on these fake accounts. I found most striking that many of these account are years and years old already. Potentially, Twitter can use Mike Kearney’s botornot application to spot and remove them!

Most of the bots in the Dirty Tinder network found by Andy Patel were 3 to 8 years old already. [original]
Andy was nice enough to share the data on these bot accounts here, for you to play with. His Python code is stored in the same github repo and more details around this project you can read in his original blog.

Fake Instagram Accounts

Finally, SRFdata (Timo Grossenbacher) attempted to uncover fake Instagram followers among the 7 million followers in the network of 115 important Swiss Instagram influencers in R. Magi Metrics was used to retrieve information for public Instagram accounts and rvest for private accounts. Next, clear fake accounts (e.g., little followers, following many, no posts, no profile picture, numbers in name) were labelled manually, and approximately 10% of the inspected 1000 accounts appeared fake. Finally, they trained a random forest model to classify fake accounts with a sensitivity (true negative) rate of 77.4% and an overall accuracy of around 94%.

Network Visualization with igraph and ggraph

Network Visualization with igraph and ggraph

Eiko Fried, researcher at the University of Amsterdam, recently blogged about personal collaborator networks. I came across his post on twitter, discussing how to conduct such analysis in R, and got inspired.

Unfortunately, my own publication record is quite boring to analyse, containing only a handful of papers. However, my promotors – Prof. dr. Jaap Paauwe and Prof. dr. Marc van Veldhoven – have more extensive publication lists. Although I did not manage to retrieve those using the scholarpackage, I was able to scrape Jaap Paauwe’s publication list from his Google Scholar page. Jaap has 141 publications listed with one or more citation on Google Scholar. More than enough for an analysis!

While Eiko uses his colleague Sacha Epskamp’s R package qgraph, I found an alternative in the packages igraph and ggraph.

### 2017-10-31


w = 14
h = 7
dpi = 900

pub_history <- read_excel("paauwe_wos.xlsx")

pub_history %>%
  filter(condition == 1) %>%
  select(name) %>%
  .$name %>%
  gsub("[A-Z]{2,}|[A-Z][ ]", "", .) %>%
  strsplit(",") %>%
  lapply(function(x) gsub("\\..*", "", x)) %>%
  lapply(function(x) gsub("^[ ]+","",x)) %>%
  lapply(function(x) x[x != ""]) %>%
  lapply(function(x) tolower(x))->

authors <- lapply(authors, function(x){
  if(!"paauwe" %in% x){
  } else{

authors_unique <- authors %>% unlist() %>% unique() %>% sort(F)

simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  names(s) <- NULL
  paste(toupper(substring(s, 1,1)), substring(s, 2),
        sep="", collapse=" ")
authors_unique_names <- sapply(authors_unique, simpleCap)

The above retrieve the names of every unique author from the excel file I got from Google Scholar. Now we need to examine to what extent the author names co-occur. We do that with the below code, storing all co-occurance data in a matrix, which we then transform to an adjacency matrix igraph can deal with. The output graph data looks like this:

coauthorMatrix <- do.call(
  lapply(authors, function(x){
  1*(authors_unique %in% x)

adjacencyMatrix <- coauthorMatrix %*% t(coauthorMatrix)

g <- graph.adjacency(adjacencyMatrix, 
                     mode = "undirected", 
                     diag = FALSE)
V(g)$Degree <- degree(g, mode = 'in') # CALCULATE DEGREE
V(g)$Name <- authors_unique_names # ADD NAMES
g # print network
## IGRAPH f1b50a7 U--- 168 631 -- 
## + attr: Degree (v/n), Name (v/c)
## + edges from f1b50a7:
##  [1]  1-- 21  1--106  2-- 44  2-- 52  2--106  2--110  3-- 73  3--106
##  [9]  4-- 43  4-- 61  4-- 78  4-- 84  4--106  5-- 42  5--106  6-- 42
## [17]  6-- 42  6-- 97  6-- 97  6--106  6--106  6--125  6--125  6--127
## [25]  6--127  6--129  6--129  7--106  7--106  7--150  7--150  8-- 24
## [33]  8-- 38  8-- 79  8-- 98  8-- 99  8--106  9-- 88  9--106  9--133
## [41] 10-- 57 10--106 10--128 11-- 76 11-- 85 11--106 12-- 30 12-- 80
## [49] 12--106 12--142 12--163 13-- 16 13-- 16 13-- 22 13-- 36 13-- 36
## [57] 13--106 13--106 13--106 13--166 14-- 70 14-- 94 14--106 14--114
## + ... omitted several edges

This graph data we can now feed into ggraph:

theme_networkMap <- theme(
  plot.background = element_rect(fill = "beige"),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  panel.background = element_blank(),
  legend.background = element_blank(),
  legend.position = "none",
  legend.title = element_text(colour = "black"),
  legend.text = element_text(colour = "black"),
  legend.key = element_blank(),
  axis.text = element_blank(), 
  axis.title = element_blank(),
  axis.ticks = element_blank()
ggraph(g, layout = "auto") +
  # geom_edge_density() +
  geom_edge_diagonal(alpha = 1, label_colour = "blue") +
  geom_node_label(aes(label = Name, size = sqrt(Degree), fill = sqrt(Degree))) +
  theme_networkMap +
  scale_fill_gradient(high = "blue", low = "lightblue") +
  labs(title = "Coauthorship Network of Jaap Paauwe",
       subtitle = "Publications with more than one Google Scholar citation included",
       caption = "paulvanderlaken.com") +
  ggsave("Paauwe_Coauthorship_Network.png", dpi = dpi, width = w, height = h)


Feel free to use the code to look at your own coauthorship networks or to share this further.

Datasets to practice and learn Programming, Machine Learning, and Data Science

Datasets to practice and learn Programming, Machine Learning, and Data Science

Many requests have come in regarding “training datasets” – to practice programming. Fortunately, the internet is full of open-source datasets! I compiled a selected list of datasets and repositories below. If you have any additions, please comment or contact me! For information on programming languages or algorithms, visit the overviews for RPython, SQL, or Data Science, Machine Learning, & Statistics resources.

This list is no longer being maintained. There are other, more frequently updated repositories of useful datasets included in bold below:

LAST UPDATED: 2019-12-23
A Million News Headlines: News headlines published over a period of 14 years.
AggData | Datasets
Aligned Hansards of the 36th Parliament of Canada
Amazon Web Services: Public Datasets
American Community Survey
ArcGIS Hub Open Data
arXiv.org help – arXiv Bulk Data Access – Amazon S3
Asset Macro: Financial & Macroeconomic Historical Data
Awesome JSON Datasets
Awesome Public Datasets
Behavioral Risk Factor Surveillance System
British Oceanographic Data Center
Bureau of Justice
Causality | Data Repository
CDC Wonder Online Database
Census Bureau Home Page
Center for Disease Control
City of Chicago
Click Dataset | Center for Complex Networks and Systems Research
CommonCrawl 2013 Web Crawl
Consumer Finance: Mortgage Database
CRCNS – Collaborative Research in Computational Neuroscience
Data Download
Data is Plural
Data.Seattle.Gov | Seattle’s Data Site
Data.World datasets
Datasets for Data Mining
DELVE datasets
DMOZ open directory (mirror)
Enigma Public
Enron Email Dataset
European Environment Agency (EEA) | Data and maps
Eurostat Database
Eurovision YouTube Comments: YouTube comments on entries from the 2003-2008 Eurovision Song Contests
FAA Data
Face Recognition Homepage – Databases
FBI Crime Data Explorer
FEMA Data Feeds
Flickr personal taxonomies
Fraudulent E-mail Corpus: CLAIR collection of “Nigerian” fraud emails
Freebase (last datadump)
Gene Expression Omnibus (GEO) Main page
GeoJSON files for real-time Virginia transportation data.
Golem Dataset
Google Books n-gram dataset
Google Public Data Explorer
Google Research: A Web Research Corpus Annotated with Freebase Concepts
Health Intelligence
Healthcare Cost and Utilization Project
Human Fertility Database
Human Mortality Database
ICPRS Social Science Studies 
ICWSM Spinnr Challenge 2011 dataset
IIE.org Open Doors Data Portal
IMDB dataset
IMF Data and Statistics
Informatics Lab Open Data
Inside AirBnB
Internet Archive: Digital Library
Ironic Corpus: 1950 sentences labeled for ironic content
Kaggle Datasets
KAPSARC Energy Data Portal
KDNuggets Datasets
Lahman’s Baseball Database
Lending Club Loan Data
Linking Open Data
London Datastore
Makeover Monday
Medical Expenditure Panel Survey
Million Song Dataset | scaling MIR research
MLDATA | Machine Learning Dataset Repository
MLvis Scientific Data Repository
MovieLens Data Sets | GroupLens Research
NASA Earth Data
National Health and Nutrition Examination Survey
National Hospital Ambulatory Medical Care Survey Data
New York State
NYPD Crash Data Band-Aid
ODI Leeds
Office for National Statistics
Old Newspapers: A cleaned subset of HC Corpora newspapers
Open Data Inception Portals
Open Data Nederland
Open Data Network
OpenDataSoft Repository
Our World in Data
Pajek datasets
PermID from Thomson Reuters
Pew Research Center
Princeton University Library
Project Gutenberg
Reddit Datasets
Registry of Research Data Repositories
Satori OpenData
SCOTUS Opinions Corpus: Lots of Big, Important Words
Sharing PyPi/Maven dependency data « RTFB
SMS Spam Collection
St. Louis Federal Reserve
Stanford Large Network Dataset Collection
State of the Nation Corpus (1990 – 2017): Full texts of the South African State of the Nation addresses
Substance Abuse and Mental Health Services Administration 
Swiss Open Government Data
Tableau Public
The Association of Religious Data Archives
The Economist
The General Social Survey
The Huntington’s Early California Population Project
The World Bank | Data
The World Bank Data Catalog
Toronto Open Data
Translation Task Data
Transport for London
Twitter Data 2010
Ubuntu Dialogue Corpus: 26 million turns from natural two-person dialogues
UC Irvine Knowledge Discovery in Databases Archive
UC Irvine Machine Learning Repository –
UC Irvine Network Data Repository
UN Comtrade Database
UN General Debates:Transcriptions of general debates at the UN from 1970 to 2016
Uniform Crime Reporting
United States Exam Data
University of Michigan ICPSR
University of Rochester LibGuide “Data-Stats”
US Bureau of Labor Statistics
US Census Bureau Data
US Energy Information Administration
US Government Web Services and XML Data Sources
USA Facts
USENET corpus (2005-2011)
Utah Open Data
Varieties of Democracy.
Western Pennsylvania Regional Data Center
WHO Data Repository
Wikipedia List of Datasets for Machine Learning
World Values Survey
World Wealth & Income Database
World Wide Web: 3.5 billion web pages and their relations
Yahoo Data for Researchers
YouTube Network 2007-2008