Tag: images

Datasets to practice and learn Programming, Machine Learning, and Data Science

Datasets to practice and learn Programming, Machine Learning, and Data Science

Many requests have come in regarding “training datasets” – to practice programming. Fortunately, the internet is full of open-source datasets! I compiled a selected list of datasets and repositories below. If you have any additions, please comment or contact me! For information on programming languages or algorithms, visit the overviews for RPython, SQL, or Data Science, Machine Learning, & Statistics resources.

This list is no longer being maintained. There are other, more frequently updated repositories of useful datasets included in bold below:

LAST UPDATED: 2019-12-23
A Million News Headlines: News headlines published over a period of 14 years.
AggData | Datasets
Aligned Hansards of the 36th Parliament of Canada
Amazon Web Services: Public Datasets
American Community Survey
ArcGIS Hub Open Data
arXiv.org help – arXiv Bulk Data Access – Amazon S3
Asset Macro: Financial & Macroeconomic Historical Data
Awesome JSON Datasets
Awesome Public Datasets
Behavioral Risk Factor Surveillance System
British Oceanographic Data Center
Bureau of Justice
Causality | Data Repository
CDC Wonder Online Database
Census Bureau Home Page
Center for Disease Control
City of Chicago
Click Dataset | Center for Complex Networks and Systems Research
CommonCrawl 2013 Web Crawl
Consumer Finance: Mortgage Database
CRCNS – Collaborative Research in Computational Neuroscience
Data Download
Data is Plural
Data.Seattle.Gov | Seattle’s Data Site
Data.World datasets
Datasets for Data Mining
DELVE datasets
DMOZ open directory (mirror)
Enigma Public
Enron Email Dataset
European Environment Agency (EEA) | Data and maps
Eurostat Database
Eurovision YouTube Comments: YouTube comments on entries from the 2003-2008 Eurovision Song Contests
FAA Data
Face Recognition Homepage – Databases
FBI Crime Data Explorer
FEMA Data Feeds
Flickr personal taxonomies
Fraudulent E-mail Corpus: CLAIR collection of “Nigerian” fraud emails
Freebase (last datadump)
Gene Expression Omnibus (GEO) Main page
GeoJSON files for real-time Virginia transportation data.
Golem Dataset
Google Books n-gram dataset
Google Public Data Explorer
Google Research: A Web Research Corpus Annotated with Freebase Concepts
Health Intelligence
Healthcare Cost and Utilization Project
Human Fertility Database
Human Mortality Database
ICPRS Social Science Studies 
ICWSM Spinnr Challenge 2011 dataset
IIE.org Open Doors Data Portal
IMDB dataset
IMF Data and Statistics
Informatics Lab Open Data
Inside AirBnB
Internet Archive: Digital Library
Ironic Corpus: 1950 sentences labeled for ironic content
Kaggle Datasets
KAPSARC Energy Data Portal
KDNuggets Datasets
Lahman’s Baseball Database
Lending Club Loan Data
Linking Open Data
London Datastore
Makeover Monday
Medical Expenditure Panel Survey
Million Song Dataset | scaling MIR research
MLDATA | Machine Learning Dataset Repository
MLvis Scientific Data Repository
MovieLens Data Sets | GroupLens Research
NASA Earth Data
National Health and Nutrition Examination Survey
National Hospital Ambulatory Medical Care Survey Data
New York State
NYPD Crash Data Band-Aid
ODI Leeds
Office for National Statistics
Old Newspapers: A cleaned subset of HC Corpora newspapers
Open Data Inception Portals
Open Data Nederland
Open Data Network
OpenDataSoft Repository
Our World in Data
Pajek datasets
PermID from Thomson Reuters
Pew Research Center
Princeton University Library
Project Gutenberg
Reddit Datasets
Registry of Research Data Repositories
Satori OpenData
SCOTUS Opinions Corpus: Lots of Big, Important Words
Sharing PyPi/Maven dependency data « RTFB
SMS Spam Collection
St. Louis Federal Reserve
Stanford Large Network Dataset Collection
State of the Nation Corpus (1990 – 2017): Full texts of the South African State of the Nation addresses
Substance Abuse and Mental Health Services Administration 
Swiss Open Government Data
Tableau Public
The Association of Religious Data Archives
The Economist
The General Social Survey
The Huntington’s Early California Population Project
The World Bank | Data
The World Bank Data Catalog
Toronto Open Data
Translation Task Data
Transport for London
Twitter Data 2010
Ubuntu Dialogue Corpus: 26 million turns from natural two-person dialogues
UC Irvine Knowledge Discovery in Databases Archive
UC Irvine Machine Learning Repository –
UC Irvine Network Data Repository
UN Comtrade Database
UN General Debates:Transcriptions of general debates at the UN from 1970 to 2016
Uniform Crime Reporting
United States Exam Data
University of Michigan ICPSR
University of Rochester LibGuide “Data-Stats”
US Bureau of Labor Statistics
US Census Bureau Data
US Energy Information Administration
US Government Web Services and XML Data Sources
USA Facts
USENET corpus (2005-2011)
Utah Open Data
Varieties of Democracy.
Western Pennsylvania Regional Data Center
WHO Data Repository
Wikipedia List of Datasets for Machine Learning
World Values Survey
World Wealth & Income Database
World Wide Web: 3.5 billion web pages and their relations
Yahoo Data for Researchers
YouTube Network 2007-2008
Generating 3D Faces from 2D Photographs

Generating 3D Faces from 2D Photographs

Aaron Jackson, Adrian Bulat, Vasileios Argyriou and Georgios Tzimiropoulos
of the Computer Vision Laboratory of the University of Nottingham built a neural network that generates a full 3D image of a single portrait photograph. They turn a photograph like this…

PVDL corporate

… into an accurately creepy 3D image like this.


You can try it with your own or other photographs here. I found that images with white background get the best results. On their project website you can read more about the underlying convolutional neural network.

Update 21-10-2017: One of my favorite YouTube channels explains how the models were trained and the data used:

t-SNE, the Ultimate Drum Machine and more

t-SNE, the Ultimate Drum Machine and more

This blog explains t-Distributed Stochastic Neighbor Embedding (t-SNE) by a story of programmers joining forces with musicians to create the ultimate drum machine (if you are here just for the fun, you may start playing right away).

Kyle McDonald, Manny Tan, and Yotam Mann experienced difficulties in pinpointing to what extent sounds are similar (ding, dong) and others are not (ding, beep) and they wanted to examine how we, humans, determine and experience this similarity among sounds. They teamed up with some friends at Google’s Creative Lab and the London Philharmonia to realize what they have named “the Infinite Drum Machine” turning the most random set of sounds into a musical instrument.

Drum Machine.png

The project team wanted to include as many different sounds as they could, but had less appetite to compare, contrast and arrange all sounds into musical accords themselves. Instead, they imagined that a computer could perform such a laborious task. To determine the similarities among their dataset of sounds – which literally includes a thousand different sounds from the ngaaarh of a photocopier to the zing of an anvil – they used a fairly novel unsupervised machine learning technique called t-Distributed Stochastic Neighbor Embedding, or t-SNE in short (t-SNE Wiki; developer: Laurens van der Maaten). t-SNE specializes in dimensionality reduction for visualization purposes as it transforms highly-dimensional data into a two- or three-dimensional space. For a rapid introduction to highly-dimensional data and t-SNE by some smart Googlers, please watch the video below.

As the video explains, t-SNE maps complex data to a two- or three-dimensional space and was therefore really useful to compare and group similar sounds. Sounds are super highly-dimensional as they are essentially a very elaborate sequence of waves, each with a pitch, a duration, a frequency, a bass, an overall length, etcetera (clearly I am no musician). You would need a lot of information to describe a specific sound accurately. The project team compared sound to fingerprints, as there is an immense amount of data in a single padamtss.

t-SNE takes into account all this information of a sound and compares all sounds in the dataset. Next, it creates 2 or 3 new dimensions and assigns each sound values on these new dimensions in such a way that sounds which were previously similar (on the highly-dimensional data) are also similar on the new 2 – 3 dimensions. You could say that t-SNE summarizes (most of) the information that was stored in the previous complex data. This is what dimensionality reduction techniques do: they reduce the number of dimensions you need to describe data (sufficiently). Fortunately, techniques such as t-SNE are unsupervised, meaning that the project team did not have to tag or describe the sounds in their dataset manually but could just let the computer do the heavy lifting.

The result of this project is fantastic and righteously bears the name of Infinite Drum Machine (click to play)!  You can use the two-dimensional map to explore similar sounds and you can even make beats using the sequencing tool. The below video summarizes the creation process.

Amazed by this application, I wanted to know how t-SNE is being used in other projects. I have found a tremendous amount of applications that demonstrate how to implement t-SNE in Python, R, and even JS whereas the method also seems popular in academia.

Luke Metz argues implementation in Python is fairly easy and Analytics Vidhya and a visualized blog by O’Reilly back this claim. Superstar Andrej Karpathy has an interactive t-SNE demo which allows you to compare the similarity among top Twitter users using t-SNE (I think in JavaScript). A Kaggle user and Data Science Heroes have demonstrated how to apply t-SNE in R and have compared the method to other unsupervised methods, for instance to PCA.

Clusters of similar cats/dogs in Luke Metz’ application of t-SNE.
Cho et al., 2014 have used t-SNE in their natural language processing projects as it allows for an easy examination of the similarity among words and phrases. Mnih and colleagues (2015) have used t-SNE to examine how neural networks were playing video games.

t-SNE video games
Two-dimensional t-SNE visualization of the hidden layer activity of neural network playing Space Invaders (Mnih et al., 2015)

On a final note, while acknowledging its potential, this blog warns for the inaccuracies in t-SNE due to the aesthetical adjustments it often seems to make. They have some lovely interactive visualizations to back up their claim. They conclude that it’s incredible flexibility allows t-SNE to find structure where other methods cannot. Unfortunately, this makes it tricky to interpret t-SNE results as the algorithm makes all sorts of untransparent adjustments to tidy its visualizations and make the complex information fit on just 2-3 dimensions.

Generating images from scratch: Parallel Multiscale Autoregressive Density Estimation

Generating images from scratch: Parallel Multiscale Autoregressive Density Estimation

A while ago, I blogged about this new algorithm, pix2code, which takes in pictures of graphical user interfaces and outputs the underlying code. Today, I discovered another fantastic algorithm, by Scott Reed and his colleagues at Google Deepmind. txt2pix would be a catchy name for this algorithm, as it can take in a fairly complex sentence (e.g., “a grey bird with a black head, orange eyes, and a yellow beak“) and generate a completely new and unique image based on its content. In their recently published paper, they elaborate on the algorithms inner workings.

An example of the training and generation process reported in the paper

Scott and his team have been working on this project for quite some time. The early version of the algorithm generated an image one pixel at a time, but it had difficulties generating large or high-quality images. After picking a starting pixel to generate, any consecutively generated pixel the algorithm generates needs to align with its neighbours. For example, if pixel A is the first pixel in the generation of the yellow beak of a bird, any pixels that are created in the neighbourhood of that pixel should take into account that pixel A is trying to visualize a yellow beak, and behave accordingly: either continuing the beak, or ending the beak and starting on another element of the image.

The problem with such an iterative approach (i.e., pixel by pixel) is that it can take a very long time for a computer to generate an image. Considering that a fairly small image, say 256 by 256 pixels, already contains 65.536 pixels, each of which needs to be generated while considering all its neighbours and keeping in mind the bigger picture. In the most recent, updated version of the algorithm, Scott and his team have allowed the generation of multiple unrelated pixels simultaneously at different ‘zones’ of the image. Hence the Parallel in Parallel Multiscale Autoregressive Density Estimation. With this parallel approach, the algorithm can now generate the pixels representing the yellow beak in one area of the image, while simultaneously generating pixels for the bird’s wings and the branch it’s sitting on at different sections of the image. This speeds up the process quite extensively, demanding less computation time, thus allowing for quicker image generation.

I can definitely recommend that you check out Scott Reeds’ twitter feed for some amazing animated GIFs of the generation process:

If you want to know more details behind the algorithm but do not fancy reading the entire paper, I recommend this short explanation video by Károly Zsolnai-Fehér (what a name!) of Two Minute papers: