Tag: network

Datasets to practice and learn Programming, Machine Learning, and Data Science

Datasets to practice and learn Programming, Machine Learning, and Data Science

Many requests have come in regarding “training datasets” – to practice programming. Fortunately, the internet is full of open-source datasets! I compiled a selected list of datasets and repositories below. If you have any additions, please comment or contact me! For information on programming languages or algorithms, visit the overviews for RPython, SQL, or Data Science, Machine Learning, & Statistics resources.

This list is no longer being maintained. There are other, more frequently updated repositories of useful datasets included in bold below:

LAST UPDATED: 2019-12-23
A Million News Headlines: News headlines published over a period of 14 years.
AggData | Datasets
Aligned Hansards of the 36th Parliament of Canada
Amazon Web Services: Public Datasets
American Community Survey
ArcGIS Hub Open Data
arXiv.org help – arXiv Bulk Data Access – Amazon S3
Asset Macro: Financial & Macroeconomic Historical Data
Awesome JSON Datasets
Awesome Public Datasets
Behavioral Risk Factor Surveillance System
British Oceanographic Data Center
Bureau of Justice
Canada
Causality | Data Repository
CDC Wonder Online Database
Census Bureau Home Page
Center for Disease Control
ChEMBLdb
ChemDB
City of Chicago
Click Dataset | Center for Complex Networks and Systems Research
CommonCrawl 2013 Web Crawl
Consumer Finance: Mortgage Database
CRCNS – Collaborative Research in Computational Neuroscience
Data Download
Data is Plural
Data.gov
Data.gov.au
Data.gov.nz
Data.gov.sg
Data.gov.uk
Data.Seattle.Gov | Seattle’s Data Site
Data.world
Data.World datasets
DataHub
Datasets for Data Mining
DataSF
Dataverse
DELVE datasets
DMOZ open directory (mirror)
DRYAD
Enigma Public
Enron Email Dataset
European Environment Agency (EEA) | Data and maps
Eurostat
Eurostat Database
Eurovision YouTube Comments: YouTube comments on entries from the 2003-2008 Eurovision Song Contests
FAA Data
Face Recognition Homepage – Databases
FAOSTAT Data
FBI Crime Data Explorer
FEMA Data Feeds
Figshare
FiveThirthyEight.com
Flickr personal taxonomies
FlowingData
Fraudulent E-mail Corpus: CLAIR collection of “Nigerian” fraud emails
Freebase (last datadump)
Gapminder.org
Gene Expression Omnibus (GEO) Main page
GeoJSON files for real-time Virginia transportation data.
Golem Dataset
Google Books n-gram dataset
Google Public Data Explorer
Google Research: A Web Research Corpus Annotated with Freebase Concepts
Health Intelligence
Healthcare Cost and Utilization Project
HealthData.gov
Human Fertility Database
Human Mortality Database
ICPRS Social Science Studies 
ICWSM Spinnr Challenge 2011 dataset
IIE.org Open Doors Data Portal
ImageNet
IMDB dataset
IMF Data and Statistics
Informatics Lab Open Data
Inside AirBnB
Internet Archive: Digital Library
IPUMS
Ironic Corpus: 1950 sentences labeled for ironic content
Kaggle Datasets
KAPSARC Energy Data Portal
KDNuggets Datasets
Knoema
Lahman’s Baseball Database
Lending Club Loan Data
Linking Open Data
London Datastore
Makeover Monday
Medical Expenditure Panel Survey
Million Song Dataset | scaling MIR research
MLDATA | Machine Learning Dataset Repository
MLvis Scientific Data Repository
MovieLens Data Sets | GroupLens Research
NASA
NASA Earth Data
National Health and Nutrition Examination Survey
National Hospital Ambulatory Medical Care Survey Data
New York State
NYPD Crash Data Band-Aid
ODI Leeds
OECD Data
OECD.Stat
Office for National Statistics
Old Newspapers: A cleaned subset of HC Corpora newspapers
Open Data Inception Portals
Open Data Nederland
Open Data Network
OpenDataSoft Repository
Our World in Data
Pajek datasets
PermID from Thomson Reuters
Pew Research Center
Plenar.io
PolicyMap
Princeton University Library
Project Gutenberg
Quandl
re3data.org
Reddit Datasets
Registry of Research Data Repositories
Retrosheet.org
Satori OpenData
SCOTUS Opinions Corpus: Lots of Big, Important Words
Sharing PyPi/Maven dependency data « RTFB
SMS Spam Collection
Socrata
St. Louis Federal Reserve
Stanford Large Network Dataset Collection
State of the Nation Corpus (1990 – 2017): Full texts of the South African State of the Nation addresses
Statista
Substance Abuse and Mental Health Services Administration 
Swiss Open Government Data
Tableau Public
The Association of Religious Data Archives
The Economist
The General Social Survey
The Huntington’s Early California Population Project
The World Bank | Data
The World Bank Data Catalog
Toronto Open Data
Translation Task Data
Transport for London
Twitter Data 2010
Ubuntu Dialogue Corpus: 26 million turns from natural two-person dialogues
UC Irvine Knowledge Discovery in Databases Archive
UC Irvine Machine Learning Repository –
UC Irvine Network Data Repository
UN Comtrade Database
UN General Debates:Transcriptions of general debates at the UN from 1970 to 2016
UNdata
Uniform Crime Reporting
UniGene
United States Exam Data
University of Michigan ICPSR
University of Rochester LibGuide “Data-Stats”
US Bureau of Labor Statistics
US Census Bureau Data
US Energy Information Administration
US Government Web Services and XML Data Sources
USA Facts
USENET corpus (2005-2011)
Utah Open Data
Varieties of Democracy.
Western Pennsylvania Regional Data Center
WHO Data Repository
Wikipedia List of Datasets for Machine Learning
WordNet
World Values Survey
World Wealth & Income Database
World Wide Web: 3.5 billion web pages and their relations
Yahoo Data for Researchers
YouTube Network 2007-2008
Neural Networks 101

Neural Networks 101

Last month, a video by 3Blue1Brown has been trending on YouTube, accumulating already over a quarter of a million views. It only lasts 10 minutes but provides a very good and intuitive explanation of the inner workings of Neural Networks (NN):

The Machine Learning & Deep Learning book I wrote about recently provides a more substantial explanation of the different NNs and their inner workings. Neural nets come in various different flavors and my list of Data Science, Machine Learning, & Statistics Resources includes useful cheatsheets and other information, such as the architecture map below.

If you still haven’t had enough, Daniel Shiffman demonstrates how to code Neural Networks in Processing (Java), and the video displays precisely what happens behind the scenes. Finally, MIT has made their AI course material open-source, and it includes two 45 minute lectures on NNs. The lecturing professor – Patrick Winston – isn’t much of a fan of these “bulldozer” algorithms. He has a stronger preference for “more sophisticated” mathematical learning through, for instance, Support Vector Machines.

The Dataviz Project: Find just the right visualization

The Dataviz Project: Find just the right visualization

Do you have a bunch of data but you can’t seem to figure out how to display it? Or looking for that one specific visualization of which you can’t remember the name?

www.datavizproject.com provides a most comprehensive overview of all the different ways to visualize your data. You can sort all options by Family, Input, Function, and Shape to find that one dataviz that best conveys your message.

datavizproject overview

Update: look at some of these other repositories here or here.

Networks Among #rstats Twitterers

Networks Among #rstats Twitterers

Reposted from Kasia Kulma’s github with minor modifications.

Have you ever wondered whether the most active/popular R-twitterers are virtual friends? 🙂 And by friends here I simply mean mutual followers on Twitter. In this post, I score and pick top 30 #rstats twitter users and analyse their Twitter network. You’ll see a lot of applications of rtweet and ggraph packages, as well as a very useful twist using purrr library, so let’s begin!

 

IMPORTING #RSTATS USERS

After loading my precious packages…

library(rtweet)
library(dplyr)
library(purrr)
library(igraph)
library(ggraph)

… I searched for Twitter users that have rstats termin their profile description. It definitely doesn’t include ALL active and popular R – users, but it’s a pretty reliable way of picking R – fans.

r_users <- search_users("#rstats", n = 1000)

It’s important to say, that in rtweet::search_users() even if you specify 1000 users to be extracted, you end up with quite a few duplicates and the actual number of users I got was much smaller: 564

r_users %>% summarise(n_users = n_distinct(screen_name))
##   n_users
## 1     564

Funnily enough, even though my profile description contains #rstats I was not included in the search results (@KKulma), sic! Were you? 🙂

SCORING AND CHOOSING TOP #RSTATS USERS

Now, let’s extract some useful information about those users:

r_users_info <- lookup_users(r_users$screen_name)

You’ll notice, that created data frame holds information about the number of followers, friends (users they follow), lists they belong to, the number of tweets (statuses) or how many times were they marked favourite.

r_users_info %>% select(dplyr::contains("count")) %>% head()
##   followers_count friends_count listed_count favourites_count
## 1            8311           366          580             9325
## 2           44474            11         1298                3
## 3           11106           524          467            18495
## 4           12481           431          542             7222
## 5           15345          1872          680            27971
## 6            5122           700          549             2796
##   statuses_count
## 1          66117
## 2           1700
## 3           8853
## 4           6388
## 5          22194
## 6          10010

And these variables that I used for building my ‘top score’: I simply calculate a percentile for each of those variables and sum it all together for each user. Given that each variable’s percentile will give me a value between 0 and 1, The final score can have a maximum value of 5.

r_users_ranking <- r_users_info %>%
  filter(protected == FALSE) %>% 
  select(screen_name, dplyr::contains("count")) %>% 
  unique() %>% 
  mutate(followers_percentile = ecdf(followers_count)(followers_count),
         friends_percentile = ecdf(friends_count)(friends_count),
         listed_percentile = ecdf(listed_count)(listed_count),
         favourites_percentile = ecdf(favourites_count)(favourites_count),
         statuses_percentile = ecdf(statuses_count)(statuses_count)
         ) %>% 
  group_by(screen_name) %>% 
  summarise(top_score = followers_percentile + friends_percentile + listed_percentile + favourites_percentile + statuses_percentile) %>% 
  ungroup() %>% 
  mutate(ranking = rank(-top_score))

Finally, I picked top 30 users based on the score I calculated. Tada!

top_30 <- r_users_ranking %>% arrange(desc(top_score)) %>% head(30) %>% arrange(desc(top_score))
top_30 
## # A tibble: 30 x 3
##        screen_name top_score ranking
##              <chr>     <dbl>   <dbl>
##  1          hspter  4.877005       1
##  2    RallidaeRule  4.839572       2
##  3         DEJPett  4.771836       3
##  4 modernscientist  4.752228       4
##  5 nicoleradziwill  4.700535       5
##  6      tomhouslay  4.684492       6
##  7    ChetanChawla  4.639929       7
##  8   TheSmartJokes  4.627451       8
##  9   Physical_Prep  4.625668       9
## 10       Cataranea  4.602496      10
## # ... with 20 more rows

I must say I’m incredibly impressed by these scores: @hpster, THE top R – twitterer managed to obtain a score of nearly 4.9 out of 5! WOW!

Anyway! To add some more depth to my list, I tried to identify top users’ gender, to see how many of them are women. I had to do it manually (ekhem!), as the Twitter API’s data doesn’t provide this, AFAIK. Let me know if you spot any mistakes!

top30_lookup <- r_users_info %>%
  filter(screen_name %in% top_30$screen_name) %>% 
  select(screen_name, user_id)

top30_lookup$gender <- c("M", "F", "F", "F", "F",
                         "M", "M", "M", "F", "F", 
                         "F", "M", "M", "M", "F", 
                         "F", "M", "M", "M", "M", 
                         "M", "M", "M", "F", "M",
                         "M", "M", "M", "M", "M")

table(top30_lookup$gender)
## 
##  F  M 
## 10 20

It looks like a third of all top users are women, but in the top 10 users, there are 6 women. Better than I expected, to be honest. So, well done, ladies!

GETTING FRIENDS NETWORK

Now, this was the trickiest part of this project: extracting top users’ friends list and putting it all in one data frame. As you may be aware, Twitter API allows you to download information only on 15 accounts in 15 minutes. So for my list, I had to break it up into 2 steps, 15 users each and then I named each list according to the top user they refer to:

top_30_usernames <- top30_lookup$screen_name

friends_top30a <-   map(top_30_usernames[1:15 ], get_friends)
names(friends_top30a) <- top_30_usernames[1:15]

# 15 minutes later....
friends_top30b <- map(top_30_usernames[16:30], get_friends)

After this I end up with two lists, each containing all friends’ IDs for top and bottom 15 users respectively. So what I need to do now is i) append the two lists, ii) create a variable stating top users’ name in each of those lists and iii) turn lists into data frames. All this can be done in 3 lines of code. And brace yourself: here comes the purrr trick I’ve been going on about! Simply using purrr:::map2_df I can take a single list of lists, create a name variable in each of those lists based on the list name (twitter_top_user) and convert the result into the data frame. BRILLIANT!!

# turning lists into data frames and putting them together
friends_top30 <- append(friends_top30a, friends_top30b)
names(friends_top30) <- top_30_usernames

# purrr - trick I've been banging on about!
friends_top <- map2_df(friends_top30, names(friends_top30), ~ mutate(.x, twitter_top_user = .y)) %>% 
  rename(friend_id = user_id) %>% select(twitter_top_user, friend_id)

Here’s the last bit that I need to correct before we move on to plotting the friends networks: for some reason, using purrr::map() with rtweet:::get_friends() gives me max only 5000 friends, but in case of @TheSmartJokes the true value is over 8000. As it’s the only top user with more than 5000 friends, I’ll download his friends separately…

# getting a full list of friends
SJ1 <- get_friends("TheSmartJokes")
SJ2 <- get_friends("TheSmartJokes", page = next_cursor(SJ1))

# putting the data frames together 
SJ_friends <-rbind(SJ1, SJ2) %>%  
  rename(friend_id = user_id) %>% 
  mutate(twitter_top_user = "TheSmartJokes") %>% 
  select(twitter_top_user, friend_id)

# the final results - over 8000 friends, rather than 5000
str(SJ_friends) 
## 'data.frame':    8611 obs. of  2 variables:
##  $ twitter_top_user: chr  "TheSmartJokes" "TheSmartJokes" "TheSmartJokes" "TheSmartJokes" ...
##  $ friend_id       : chr  "390877754" "6085962" "88540151" "108186743" ...

… and use it to replace those friends that are already in the final friends list.

friends_top30 <- friends_top %>% 
  filter(twitter_top_user != "TheSmartJokes") %>% 
  rbind(SJ_friends) 

Finally, let me do some last data cleaning: filtering out friends that are not among the top 30 R – users, replacing their IDs with twitter names and adding gender for top users and their friends… Tam, tam, tam: here we are! Here’s the final data frame we’ll use for visualising the friend networks!

# select friends that are top30 users
final_friends_top30 <- friends_top  %>% 
  filter(friend_id %in% top30_lookup$user_id)

# add friends' screen_name
final_friends_top30$friend_name <- top30_lookup$screen_name[match(final_friends_top30$friend_id, top30_lookup$user_id)]

# add users' and friends' gender
final_friends_top30$user_gender <- top30_lookup$gender[match(final_friends_top30$twitter_top_user, top30_lookup$screen_name)]
final_friends_top30$friend_gender <- top30_lookup$gender[match(final_friends_top30$friend_name, top30_lookup$screen_name)]

## final product!!!
final <- final_friends_top30 %>% select(-friend_id)

head(final)
##   twitter_top_user     friend_name user_gender friend_gender
## 1         hrbrmstr nicoleradziwill           M             F
## 2         hrbrmstr        kara_woo           M             F
## 3         hrbrmstr      juliasilge           M             F
## 4         hrbrmstr        noamross           M             M
## 5         hrbrmstr      JennyBryan           M             F
## 6         hrbrmstr     thosjleeper           M             M

VISUALIZING FRIENDS NETWORKS

After turning our data frame into something more usable by igraph and ggraph

f1 <- graph_from_data_frame(final, directed = TRUE, vertices = NULL)
V(f1)$Popularity <- degree(f1, mode = 'in')

… let’s have a quick overview of all the connections:

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') 

generic_pure

Keep in mind that Popularity – defined as the number of edges that go into the node – determines node size. It’s all very pretty, but I’d like to see how nodes correspond to Twitter users’ names:

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  geom_node_text(aes(label = name, fontface='bold'), 
                 color = 'white', size = 3) +
  theme_graph(background = 'dimgray', text_colour = 'white',title_size = 30) 

generic_names

So interesting! You can see the core of the graph consists mainly of female users: @hpster, @JennyBryan, @juliasilge, @karawoo, but also a couple of male R – users: @hrbrmstr and @noamross. Who do they follow? Men or women?

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') +
  geom_edge_link(aes(colour = friend_gender)) +
  scale_edge_color_brewer(palette = 'Set1') + 
  labs(title='Top 30 #rstats users and gender of their friends')

generic_with_gender

It’s difficult to say definitely, but superficially I see A LOT of red, suggesting that our top R – users often follow female top twitterers. Let’s have a closer look and split graphs by user gender and see if there’s any difference in the gender of users they follow:

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') +
  facet_edges(~user_gender) +
  geom_edge_link(aes(colour = friend_gender)) +
  scale_edge_color_brewer(palette = 'Set1') +
  labs(title='Top 30 #rstats users and gender of their friends', subtitle='Graphs are separated by top user gender, edge colour indicates their friend gender' )

gender_with_gender

Ha! look at this! Obviously, female users’ graph will be less dense as there are fewer of them in the dataset, however, you can see that they tend to follow male users more often than male top users do. Is that impression supported by raw numbers?

final %>% 
  group_by(user_gender, friend_gender) %>% 
  summarize(n = n()) %>% 
  group_by(user_gender) %>% 
  mutate(sum = sum(n),
         percent = round(n/sum, 2)) 
## # A tibble: 4 x 5
## # Groups:   user_gender [2]
##   user_gender friend_gender     n   sum percent
##         <chr>         <chr> <int> <int>   <dbl>
## 1           F             F    26    57    0.46
## 2           F             M    31    57    0.54
## 3           M             F    55   101    0.54
## 4           M             M    46   101    0.46

It seems so, although to the lesser extent than suggested by the network graphs: Female top users follow other female top users 46% of the time, whereas male top users follow female top user 54% of the time. So what do you have to say about that?

About the author:

Kasia Kulma states she’s an overall, enthusiastic science enthusiast. Formally, a doctor in evolutionary biology, professionally, a data scientist, and, privately, a soppy mum and outdoors lover.