Joel Simon is the genius behind an experimental project exploring optimized school blueprints. Joel used graph-contraction and ant-colony pathing algorithms as growth processes, which could generate elementary school designs optimized for all kinds of characteristics: walking time, hallway usage, outdoor views, and escape routes just to name a few.
Tag: networkdata
Identifying “Dirty” Twitter Bots with R and Python
Past week, I came across two programming initiatives to uncover Twitter bots and one attempt to identify fake Instagram accounts.
Mike Kearney developed the R package botornot
which applies machine learning to estimate the probability that a Twitter user is a bot. His default model is a gradient boosted model trained using both users-level (bio, location, number of followers and friends, etc.) and tweets-level information (number of hashtags, mentions, capital letters, etc.). This model is 93.53% accurate when classifying bots and 95.32% accurate when classifying non-bots. His faster model uses only the user-level data and is 91.78% accurate when classifying bots and 92.61% accurate when classifying non-bots. Unfortunately, the models did not classify my account correctly (see below), but you should definitely test yourself and your friends via this Shiny application.
Fun fact: botornot
can be integrated with Mike’s rtweet
package
Scraping Dirty Bots
At around the same time, I read this very interesting blog by Andy Patel. Annoyed by the fake Twitter accounts that kept liking and sharing his tweets, Andy wrote a Python script called pronbot_search
. It’s an iterative search algorithm which Andy seeded with the dozen fake Twitter accounts that he identified originally. Subsequently, the program iterated over the friends and followers of each of these fake users, looking for other accounts displaying similar traits (e.g., similar description, including an URL to a sex-website called “Dirty Tinder”).
Whenever a new account was discovered, it was added to the query list, and the process continued. Because of the Twitter API restrictions, the whole crawling process took literal days before Andy manually terminated it. The results are just amazing:
Pretty much the same pattern I’d seen after one day of crawling still existed after one week. Just a few of the clusters weren’t “flower” shaped.
Andy Patel, March 2018, link
Fake Instagram Accounts
Finally, SRFdata (Timo Grossenbacher) attempted to uncover fake Instagram followers among the 7 million followers in the network of 115 important Swiss Instagram influencers in R. Magi Metrics was used to retrieve information for public Instagram accounts and rvest
for private accounts. Next, clear fake accounts (e.g., little followers, following many, no posts, no profile picture, numbers in name) were labelled manually, and approximately 10% of the inspected 1000 accounts appeared fake. Finally, they trained a random forest model to classify fake accounts with a sensitivity (true negative) rate of 77.4% and an overall accuracy of around 94%.
Network Visualization with igraph and ggraph
Eiko Fried, researcher at the University of Amsterdam, recently blogged about personal collaborator networks. I came across his post on twitter, discussing how to conduct such analysis in R, and got inspired.
Unfortunately, my own publication record is quite boring to analyse, containing only a handful of papers. However, my promotors – Prof. dr. Jaap Paauwe and Prof. dr. Marc van Veldhoven – have more extensive publication lists. Although I did not manage to retrieve those using the scholar
package, I was able to scrape Jaap Paauwe’s publication list from his Google Scholar page. Jaap has 141 publications listed with one or more citation on Google Scholar. More than enough for an analysis!
While Eiko uses his colleague Sacha Epskamp’s R package qgraph
, I found an alternative in the packages igraph
and ggraph
.
### PAUL VAN DER LAKEN
### 2017-10-31
### COAUTHORSHIP NETWORK VISUALIZATION
# LOAD IN PACKAGES
library(readxl)
library(dplyr)
library(ggraph)
library(igraph)
# STANDARDIZE VISUALIZATIONS
w = 14
h = 7
dpi = 900
# LOAD IN DATA
pub_history <- read_excel("paauwe_wos.xlsx")
# RETRIEVE AUTHORS
pub_history %>%
filter(condition == 1) %>%
select(name) %>%
.$name %>%
gsub("[A-Z]{2,}|[A-Z][ ]", "", .) %>%
strsplit(",") %>%
lapply(function(x) gsub("\\..*", "", x)) %>%
lapply(function(x) gsub("^[ ]+","",x)) %>%
lapply(function(x) x[x != ""]) %>%
lapply(function(x) tolower(x))->
authors
# ADD JAAP PAAUWE WHERE MISSING
authors <- lapply(authors, function(x){
if(!"paauwe" %in% x){
return(c(x,"paauwe"))
} else{
return(x)
}
})
# EXTRACT UNIQUE AUTHORS
authors_unique <- authors %>% unlist() %>% unique() %>% sort(F)
# FORMAT AUTHOR NAMES
# CAPATILIZE
simpleCap <- function(x) {
s <- strsplit(x, " ")[[1]]
names(s) <- NULL
paste(toupper(substring(s, 1,1)), substring(s, 2),
sep="", collapse=" ")
}
authors_unique_names <- sapply(authors_unique, simpleCap)
The above retrieve the names of every unique author from the excel file I got from Google Scholar. Now we need to examine to what extent the author names co-occur. We do that with the below code, storing all co-occurance data in a matrix, which we then transform to an adjacency matrix igraph
can deal with. The output graph data looks like this:
# CREATE COAUTHORSHIP MATRIX
coauthorMatrix <- do.call(
cbind,
lapply(authors, function(x){
1*(authors_unique %in% x)
}))
# TRANSFORM TO ADJECENY MATRIX
adjacencyMatrix <- coauthorMatrix %*% t(coauthorMatrix)
# CREATE NETWORK GRAPH
g <- graph.adjacency(adjacencyMatrix,
mode = "undirected",
diag = FALSE)
V(g)$Degree <- degree(g, mode = 'in') # CALCULATE DEGREE
V(g)$Name <- authors_unique_names # ADD NAMES
g # print network
## IGRAPH f1b50a7 U--- 168 631 -- ## + attr: Degree (v/n), Name (v/c) ## + edges from f1b50a7: ## [1] 1-- 21 1--106 2-- 44 2-- 52 2--106 2--110 3-- 73 3--106 ## [9] 4-- 43 4-- 61 4-- 78 4-- 84 4--106 5-- 42 5--106 6-- 42 ## [17] 6-- 42 6-- 97 6-- 97 6--106 6--106 6--125 6--125 6--127 ## [25] 6--127 6--129 6--129 7--106 7--106 7--150 7--150 8-- 24 ## [33] 8-- 38 8-- 79 8-- 98 8-- 99 8--106 9-- 88 9--106 9--133 ## [41] 10-- 57 10--106 10--128 11-- 76 11-- 85 11--106 12-- 30 12-- 80 ## [49] 12--106 12--142 12--163 13-- 16 13-- 16 13-- 22 13-- 36 13-- 36 ## [57] 13--106 13--106 13--106 13--166 14-- 70 14-- 94 14--106 14--114 ## + ... omitted several edges
This graph data we can now feed into ggraph
:
# SET THEME FOR NETWORK VISUALIZATION
theme_networkMap <- theme(
plot.background = element_rect(fill = "beige"),
panel.border = element_blank(),
panel.grid = element_blank(),
panel.background = element_blank(),
legend.background = element_blank(),
legend.position = "none",
legend.title = element_text(colour = "black"),
legend.text = element_text(colour = "black"),
legend.key = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank()
)
# VISUALIZE NETWORK
ggraph(g, layout = "auto") +
# geom_edge_density() +
geom_edge_diagonal(alpha = 1, label_colour = "blue") +
geom_node_label(aes(label = Name, size = sqrt(Degree), fill = sqrt(Degree))) +
theme_networkMap +
scale_fill_gradient(high = "blue", low = "lightblue") +
labs(title = "Coauthorship Network of Jaap Paauwe",
subtitle = "Publications with more than one Google Scholar citation included",
caption = "paulvanderlaken.com") +
ggsave("Paauwe_Coauthorship_Network.png", dpi = dpi, width = w, height = h)
Feel free to use the code to look at your own coauthorship networks or to share this further.
Datasets to practice and learn Programming, Machine Learning, and Data Science
Many requests have come in regarding “training datasets” – to practice programming. Fortunately, the internet is full of open-source datasets! I compiled a selected list of datasets and repositories below. If you have any additions, please comment or contact me! For information on programming languages or algorithms, visit the overviews for R, Python, SQL, or Data Science, Machine Learning, & Statistics resources.
This list is no longer being maintained. There are other, more frequently updated repositories of useful datasets included in bold below: