Google’s Dataset Search: Direct access to 25 million interesting datasets

I used to keep a repository of links to interesting datasets to learn data science. However, that page I can retire, as Google has launched its new service Dataset Search.

The “world wide web” hosts millions of datasets, on nearly any topic you can think of. Google’s Dataset Search has indexed almost 25 million of these datasets, giving you a single entry point to search for datasets online. After a year of testing, Dataset Search is now officially out of beta.

After alpha testing, Dataset Search now includes filter based on the types of dataset that you want (e.g., tables, images, text), on whether the dataset is open source/access. For dataset on geographic area’s, you can see the map. The quality of dataset’s descriptions has improved greatly, and the tool now has a mobile version.

Anyone who publishes data can make their datasets discoverable in Dataset Search by describe the properties of their dataset using a special schema on their own web page.

How to Read Scientific Papers

Reddit is a treasure trove of random stuff. However, every now and then, in the better groups, quite valuable topics pop up. Here’s one I came across on r/statistics:

Particularly the advice by grandzooby seemed worth a like, and he linked to several useful resources which I’ve summarized for you below.

An 11-step guide to reading a paper

Jennifer Raff — assistant professor at the University of Kansas — wrote this 3-page guide on how to read papers. It elaborates on 11 main pieces of advice for reading academic papers:

  1. Begin by reading the introduction, skip the abstract.
  2. Identify the general problem: “What problem is this research field trying to solve?”
  3. Try to uncover the reason and need for this specific study.
  4. Identify the specific problem: “What problems is this paper trying to solve?”
  5. Identify what the researchers are going to do to solve that problem
  6. Read & identify the methods: draw the studies in diagrams
  7. Read & identify the results: write down the main findings
  8. Determine whether the results solve the specific problem
  9. Read the conclusions and determine whether you agree
  10. Read the abstract
  11. Find out what others say about this paper

Jennifer also dedicated a more elaborate blog post to the matter (to which u/grandzooby refers).

4-step Infographic

Natalia Rodriguez made a beautiful infographic with some general advice for Elsevier:


How to take notes while reading

Mary Purugganan and Jan Hewitt of Rice University propose slightly different steps for reading academic papers. Though they seem more general pointers to keep in mind to me:

  1. Skim the article and identify its structure
  2. Distinguish its main points
  3. Generate questions before and during reading
  4. Draw inferences while reading
  5. Take notes while reading

Regarding the note taking Mary and Jan propose the following template which may proof useful:

  • Citation:
  • URL:
  • Keywords:
  • General subject:
  • Specific subject:
  • Hypotheses:
  • Methodology:
  • Results:
  • Key points:
  • Context (in the broader field/your work):
  • Significance (to the field/your work):
  • Important figures/tables (description/page numbers):
  • References for further reading:
  • Other comments:

Scholars sharing their experiences

Science Magazine dedicated a long read to how to seriously read scientific papers, in which they asked multiple scholars to share their experiences and tips.

Anatomy of a scientific paper

This 13-page guide by the American Society of Plant Biologists was recommended by some, but I personally don’t find it as useful as the other advices here. Nevertheless, for the laymen, it does include a nice visualization of the anatomy of scientific papers:


Learning How to Learn

One reddit user recommend this Coursera course, Learning How to Learn: Powerful mental tools to help you master tough subjects. It’s free, and can be taken in English, but also Portuguese, Spanish, or Chinese.

This course gives you easy access to the invaluable learning techniques used by experts in art, music, literature, math, science, sports, and many other disciplines. We'll learn about the how the brain uses two very different learning modes and how it encapsulates ("chunks") information. We'll also cover illusions of learning, memory techniques, dealing with procrastination, and best practices shown by research to be most effective in helping you master tough subjects.

Last week I cohosted a professional learning course on data visualization at JADS. My fellow host was prof. Jack van Wijk, and together we organized an amazing workshop and poster event. Jack gave two lectures on data visualization theory and resources, and mentioned among others, a resource I was unfamiliar with up until then. is a lot like the dataviz project in the sense that it is an extensive overview of different types of data visualizations. treevis is unique, however, in the sense that it is focused on specifically visualizations of hierarchical data: multi-level or nested data structures.

Hans-Jörg Schulz — professor of Computer Science at Aarhus University in Denmark — maintains the treevis repo. At the moment of writing, he has compiled over 300 different types of hierachical data visualizations and displays them on this website.

As an added bonus, the repo is interactive as there are several ways to filter and look for the visualization type that best fits your data and needs.

Most resources come with added links to the original authors and the original papers they were first published in, so this is truly a great resources for those interested in doing a deep dive into data visualization. Do have a look yourself!

Anomaly Detection Resources

Carnegie Mellon PhD student Yue Zhao collects this great Github repository of anomaly detection resources:

The repository consists of tools for multiple languages (R, Python, Matlab, Java) and resources in the form of:

  1. Books & Academic Papers
  2. Online Courses and Videos
  3. Outlier Datasets
  4. Algorithms and Applications
  5. Open-source and Commercial Libraries/Toolkits
  6. Key Conferences & Journals

Outlier Detection (also known as Anomaly Detection) is an exciting yet challenging field, which aims to identify outlying objects that are deviant from the general data distribution. Outlier detection has been proven critical in many fields, such as credit card fraud analytics, network intrusion detection, and mechanical unit defect detection.

Quick Access — Table of Contents

The Causal Inference Book: DAGS and more

Harvard (bio)statisticians Miguel Hernan and Jamie Robins just released their new book, online and accessible for free!

The Causal Inference book provides a cohesive presentation of causal inference, its concepts and its methods. The book is divided in 3 parts of increasing difficulty: causal inference without models, causal inference with models, and causal inference from complex longitudinal data. Here’s the official Harvard page for the book release.

Some of the book’s (NHEFS) data is accesible too:

As is the associated computer code for the analyses, in multiple languages:

This is definitely an interesting read for epidemiologists, statisticians, psychologists, economists, sociologists, political scientists, data scientists, computer scientists, and any other person with a love for proper data analysis! 

Sam Finalyson visualized some of the Directed Acyclic Graphs (DAG) covered in the book, and these also look quite nice. The visuals and other notes and glossary items here.

Overviews of Graph Classification and Network Clustering methods

Thanks to Sebastian Raschka I am able to share this great GitHub overview page of relevant graph classification techniques, and the scientific papers behind them. The overview divides the algorithms into four groups:

  1. Factorization
  2. Spectral and Statistical Fingerprints
  3. Deep Learning
  4. Graph Kernels

Moreover, the overview contains links to similar collections on community detectionclassification/regression trees and gradient boosting papers with implementations.

As well as a link to relevant graph classification benchmark datasets.