Tag: openaccess

Practice your Data Science skills on real-life data

Maven Analytics now provides open access to their datasets through what they call their “DATA PLAYGROUND“!

They offer 21 datasets including a range of different data (time series, geospatial, user preferences) on a variety of topics like business, sports, wine, financial stocks, transportation and whatnot.

This is a great starting point if you want to practice your data science, machine learning and analysis skills on real life data!

Maven Analytics provides e-learnings in analysis and programming software. To provide a practical learning experience, their courses are often accompanied by real-life datasets for students to analyze.

Bayesian Statistics using R, Python, and Stan

For a year now, this course on Bayesian statistics has been on my to-do list. So without further ado, I decided to share it with you already.

Richard McElreath is an evolutionary ecologist who is famous in the stats community for his work on Bayesian statistics.

At the Max Planck Institute for Evolutionary Anthropology, Richard teaches Bayesian statistics, and he was kind enough to put his whole course on Statistical Rethinking: Bayesian statistics using R & Stan open access online.

You can find the video lectures here on Youtube, and the slides are linked to here:

Richard also wrote a book that accompanies this course:

For more information abou the book, click here.

For the Python version of the code examples, click here.

Google’s Dataset Search: Direct access to 25 million interesting datasets

I used to keep a repository of links to interesting datasets to learn data science. However, that page I can retire, as Google has launched its new service Dataset Search.

The “world wide web” hosts millions of datasets, on nearly any topic you can think of. Google’s Dataset Search has indexed almost 25 million of these datasets, giving you a single entry point to search for datasets online. After a year of testing, Dataset Search is now officially out of beta.

After alpha testing, Dataset Search now includes filter based on the types of dataset that you want (e.g., tables, images, text), on whether the dataset is open source/access. For dataset on geographic area’s, you can see the map. The quality of dataset’s descriptions has improved greatly, and the tool now has a mobile version.

Anyone who publishes data can make their datasets discoverable in Dataset Search by describe the properties of their dataset using a special schema on their own web page.

Caselaw Access Project: Structured data of over 6 million U.S. court decisions

Case.law seems like a very interesting data source for a machine learning or text mining project:

The Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.
The capstone of the Caselaw Access Project is a robust set of tools which facilitate access to the cases and their associated metadata. We currently offer five ways to access the data: API, bulk downloads, search, browse, and a historical trends viewer.
https://case.law/about/

Our open-source API is the best option for anybody interested in programmatically accessing our metadata, full-text search, or individual cases.
If you need a large collection of cases, you will probably be best served by our bulk data downloads. Bulk downloads for Illinois and Arkansas are available without a login, and unlimited bulk files are available to research scholars.
https://case.law/about/

Case metadata, such as the case name, citation, court, date, etc., is freely and openly accessible without limitation. Full case text can be freely viewed or downloaded but you must register for an account to do so, and currently you may view or download no more than 500 cases per day. In addition, research scholars can qualify for bulk data access by agreeing to certain use and redistribution restrictions. You can request a bulk access agreement by creating an account and then visiting your account page.
Access limitations on full text and bulk data are a component of Harvard’s collaboration agreement with Ravel Law, Inc. (now part of Lexis-Nexis). These limitations will end, at the latest, in March of 2024. In addition, these limitations apply only to cases from jurisdictions that continue to publish their official case law in print form. Once a jurisdiction transitions from print-first publishing to digital-first publishing, these limitations cease. Thus far, Illinois and Arkansas have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction. We hope many other jurisdictions will follow their example soon.
https://case.law/about/

A different project altogether is helping the team behind Caselaw improve its data quality:

Our data inevitably includes countless errors as part of the digitization process. The public launch of this project is only the start of discovering errors, and we hope you will help us in finding and fixing them.
Some parts of our data are higher quality than others. Case metadata, such as the party names, docket number, citation, and date, has received human review. Case text and general head matter has been generated by machine OCR and has not received human review.
You can report errors of all kinds at our Github issue tracker, where you can also see currently known issues. We particularly welcome metadata corrections, feature requests, and suggestions for large-scale algorithmic changes. We are not currently able to process individual OCR corrections, but welcome general suggestions on the OCR correction process.
https://case.law/about/

Dynamic Programming MIT Course

Cover image by xkcd

Over the last months I’ve been working my way through Project Euler in my spare time. I wanted to learn Python programming, and what better way than solving mini-problems and -projects?!

Well, Project Euler got a ton of these, listed in increasing order of difficulty. It starts out simple: to solve the first problem you need to write a program to identify multiples of 3 and 5. Next, in problem two, you are asked to sum the first thousand even Fibonacci numbers. Each problem, the task at hand gets slighly more difficult…

For me, Project Euler combines math, programming, and stats in a way that really keeps me motivated to continue and learn new concepts and programming / problem-solving approaches.

However, at problem 31, I really got stuck. For several hours, I struggled to solve it in a satisfactory fashion, even though most other problems only take 5-90 minutes.

After hours of struggling, I pretty much gave up, and googled some potential solutions. Aparently, the way to solve problem 31, is to take a so-called dynamic programming approach.

Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively. Likewise, in computer science, if a problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems, then it is said to have optimal substructure.
https://en.wikipedia.org/wiki/Dynamic_programming

Now, this sounded like something I’d like to learn more about! I was already quite familiar with recursive problems and solutions, but this dynamic programming sounded next-level.

So I googled and googled for tutorials and other resources, and I finally came across this free 2011 MIT course that I intend to view over the coming weeks.

There’s even a course website with additional materials and assignments (in Python).

ASSN #	TOPICS	PROBLEM SETS	SOLUTIONS
1	Asymptotic complexity, recurrence relations, peak finding	Problem Set 1 (PDF) Problem Set 1 Code (ZIP)	Problem Set 1 Solutions (PDF)
2	Fractal rendering, digital circuit simulation	Problem Set 2 (PDF) Problem Set 2 Code (ZIP)	Problem Set 2 Solutions (PDF) Problem Set 2 Code Solutions (ZIP – 7.7MB)
3	Range queries, digital circuit layout	Problem Set 3 (PDF) Problem Set 3 Code (ZIP – 3.2MB)	Problem Set 3 Solutions (PDF) Problem Set 3 Code Solutions (ZIP – 15.7MB)
4	Hash functions, Python dictionaries, matching DNA sequences	Problem Set 4 (PDF) Problem Set 4 Code (GZ – 12.4MB) (kfasta.py courtesy of Kevin Kelley, and used with permission.)	Problem Set 4 Solutions (PDF) Problem Set 4 Code Solutions (ZIP)
5	The Knight’s Shield, RSA public key encryption, image decryption	Problem Set 5 (PDF) Problem Set 5 Code (ZIP) Problem Set 5 Grading Explanation (PDF)	Problem Set 5 Solutions (PDF) Problem Set 5 Code Solutions (ZIP)
6	Social networks, Rubik’s Cube, Dijkstra	Problem Set 6 (PDF) Problem Set 6 Code (ZIP – 2.9MB) (nhpn.py courtesy of Punyashloka Biswal and Michael Lieberman; Pocket Cube Solver courtesy of Huan Liu and Anh Nguyen. Used with permission.)	Problem Set 6 Solutions (PDF) Problem Set 6 Code Solutions (ZIP)
7	Seam carving, stock purchasing and knapsack	Problem Set 7 (PDF) Seam Carving for Content-Aware Image Resizing Problem Set 7 Code (ZIP) (Sunset image © source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse.)Problem Set 7 Answer Template (ZIP)Problem Set 7 Grading Explanation (PDF)	Problem Set 7 Solutions (PDF) Problem Set 7 Code Solutions (ZIP)

Will you join me? And let me know what you think!

For those less interested in (dynamic) programming but mostly in machine learning, there’s this other great MIT OpenCourseWare youtube playlist of their Artificial Intelligence course. I absolutely loved that course and I really powered through it in a matter of weeks (which is why I am already psyched about this new one). I learned so much new concepts, and I strongly recommend it. Unfortunately, the professor recently passed away.

Google Fonts: 915 free font families

Looking for a custom typeface to use in your data visualizations? Google Fonts is an awesome databank of nearly a thousands font families you can access, download, and use for free.

If you’re into design, the website includes a blog featuring articles on font design.

Google Fonts among others provided the font for my dissertation cover so I definitely recommend it.