Sean Owen created this handy cheat sheet that shows the most common probability distributions mapped by their underlying relationships. Probability distributions are fundamental to statistics, just like data structures are to computer science. They’re the place to start studying if you mean to talk like a data scientist. Sean Owen (via) Owen argues that the…

# Tag: data

## Simulating data with Bayesian networks, by Daniel Oehm

Daniel Oehm wrote this interesting blog about how to simulate realistic data using a Bayesian network. Bayesian networks are a type of probabilistic graphical model that uses Bayesian inference for probability computations. Bayesian networks aim to model conditional dependence, and therefore causation, by representing conditional dependence by edges in a directed graph. Through these relationships, one…

## Google's Dataset Search: Direct access to 25 million interesting datasets

I used to keep a repository of links to interesting datasets to learn data science. However, that page I can retire, as Google has launched its new service Dataset Search. The “world wide web” hosts millions of datasets, on nearly any topic you can think of. Google’s Dataset Search has indexed almost 25 million of these…

## Animated Machine Learning Classifiers

Ryan Holbrook made awesome animated GIFs in R of several classifiers learning a decision rule boundary between two classes. Basically, what you see is a machine learning model in action, learning how to distinguish data of two classes, say cats and dogs, using some X and Y variables. These visuals can be great to understand…

## Caselaw Access Project: Structured data of over 6 million U.S. court decisions

Case.law seems like a very interesting data source for a machine learning or text mining project: The Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law…

## Simulate Datasets with DrawData.xyz

Vincent Warmerdam shared his new tool to quickly simulate artificial datasets: http://www.drawdata.xyz. The drawdata.xyz tool allows you to easily create your own line- and scatter-plot with different groups of datapoints following specific x-y patterns. After drawing your data, you can just click to export your new dataset to csv or json format. x y 106.04…

## Comparison between R dplyr and data.table code

Atrebas created this extremely helpful overview page showing how to program standard data manipulation and data transformation routines in R’s famous packages dplyr and data.table. The document has been been inspired by this stackoverflow question and by the data.table cheat sheet published by Karlijn Willems. Resources for data.table can be found on the data.table wiki, in the data.table vignettes,…

## Understanding Data Distributions

Having trouble understanding how to interpret distribution plots? Or struggling with Q-Q plots? Sven Halvorson penned down a visual tutorial explaining distributions using visualisations of their quantiles. Because each slice of the distribution is 5% of the total area and the height of the graph is changing, the slices have different widths. It’s like we’re…

## What Every Programmer Needs To Know About Encodings

Kunststube wrote this great introduction to text encoding. Ever wondered why your Word document sometimes starts with ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢? Well, encoding‘s why. Kunststube introduces you to the wonderful world of ASCII, WLatin, Mac Latin, and UTF-8, -16 and -32. Read the original articla via http://kunststube.net/encoding/