Category: datasets

Wordle with NLP for Data Scientists

I have played my fair share of Wordle.

I’m not necessarily good at it, but most days I get to solve the puzzle.

The experience is completely different with Semantle — a Wordle-inspired puzzle in which you also need to guess the word of the day.

Unlike in Wordle, Semantle gives you unlimited guesses though. And, boy, you will need many!

Like Wordle, Semantle gives you hints as to how close your guesses were to the secret word of the day.

However, where Wordle shows you how good your guesses were in terms of the letters used, Semantle evaluates the semantic similarity of your guesses to the secret word. For the 1000 most similar words to the secret word, it will show you its closeness like in the picture above.

This semantic similarity comes from the domain of Natural Language Processing — NLP — and this basically reflects how often words are used in similar contexts in natural language.

For instance, the words “love” and “hate” may seem like opposites, but they will often score similarly in grammatical sentences. According to the semantle FAQ the actual opposite of “love” is probably something like “Arizona Diamondbacks”, or “carburetor”.

Another example is last day’s solution (15 March 2022), when the secret word was circle. The ten closest words you could have guessed include circles and semicircle, but more distinctive words such as corner and clockwise.

Further downfield you could have guessed relatively close words like saucer, dot, parabola, but I would not have expected words like outwaited, weaved, and zipped.

The creator of Semantle scored the semantic similarity for almost all words used in the English language, by training a so-called word2vec model based on a very large dataset of news articles (GoogleNews-vectors-negative300.bin from late 2021).

Now, every day, one word is randomly selected as the secret word, and you can try to guess which one it is. I usually give up after 300 to 400 guesses, but my record was 76 guesses for uncovering the secret word world.

Try it out yourself: https://semantle.novalis.org/

And do share your epic wins and fails!

Practice your Data Science skills on real-life data

Maven Analytics now provides open access to their datasets through what they call their “DATA PLAYGROUND“!

They offer 21 datasets including a range of different data (time series, geospatial, user preferences) on a variety of topics like business, sports, wine, financial stocks, transportation and whatnot.

This is a great starting point if you want to practice your data science, machine learning and analysis skills on real life data!

Maven Analytics provides e-learnings in analysis and programming software. To provide a practical learning experience, their courses are often accompanied by real-life datasets for students to analyze.

How a File Format Exposed a Crossword Scandal

Vincent Warmerdam shared this Youtube video which I thoroughly enjoyed watched. It’s about Saul Pwanson, a software engineer whose hobby project got a little out of hand.

In 2016, Saul Pwanson designed a plain-text file format for crossword puzzle data, and then spent a couple of months building a micro-data-pipeline, scraping tens of thousands of crosswords from various sources.

After putting all these crosswords in a simple uniform format, Saul used some simple command line commands to check for common patterns and irregularities.

Surprisingly enough, after visualizing the results, Saul discovered egregious plagiarism by a major crossword editor that had gone on for years.

Ultimately, 538 even covered the scandal:

I thoroughly enjoyed watching this talk on Youtube.

Saul covers the file format, data pipeline, and the design choices that aided rapid exploration; the evidence for the scandal, from the initial anomalies to the final damning visualization; and what it’s like for a data project to get 15 minutes of fame.

I tried to localize the dataset online, but it seems Saul’s website has since gone offline. If you do happen to find it, please do share it in the comments!

Google’s Dataset Search: Direct access to 25 million interesting datasets

I used to keep a repository of links to interesting datasets to learn data science. However, that page I can retire, as Google has launched its new service Dataset Search.

The “world wide web” hosts millions of datasets, on nearly any topic you can think of. Google’s Dataset Search has indexed almost 25 million of these datasets, giving you a single entry point to search for datasets online. After a year of testing, Dataset Search is now officially out of beta.

After alpha testing, Dataset Search now includes filter based on the types of dataset that you want (e.g., tables, images, text), on whether the dataset is open source/access. For dataset on geographic area’s, you can see the map. The quality of dataset’s descriptions has improved greatly, and the tool now has a mobile version.

Anyone who publishes data can make their datasets discoverable in Dataset Search by describe the properties of their dataset using a special schema on their own web page.

Caselaw Access Project: Structured data of over 6 million U.S. court decisions

Case.law seems like a very interesting data source for a machine learning or text mining project:

The Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.
The capstone of the Caselaw Access Project is a robust set of tools which facilitate access to the cases and their associated metadata. We currently offer five ways to access the data: API, bulk downloads, search, browse, and a historical trends viewer.
https://case.law/about/

Our open-source API is the best option for anybody interested in programmatically accessing our metadata, full-text search, or individual cases.
If you need a large collection of cases, you will probably be best served by our bulk data downloads. Bulk downloads for Illinois and Arkansas are available without a login, and unlimited bulk files are available to research scholars.
https://case.law/about/

Case metadata, such as the case name, citation, court, date, etc., is freely and openly accessible without limitation. Full case text can be freely viewed or downloaded but you must register for an account to do so, and currently you may view or download no more than 500 cases per day. In addition, research scholars can qualify for bulk data access by agreeing to certain use and redistribution restrictions. You can request a bulk access agreement by creating an account and then visiting your account page.
Access limitations on full text and bulk data are a component of Harvard’s collaboration agreement with Ravel Law, Inc. (now part of Lexis-Nexis). These limitations will end, at the latest, in March of 2024. In addition, these limitations apply only to cases from jurisdictions that continue to publish their official case law in print form. Once a jurisdiction transitions from print-first publishing to digital-first publishing, these limitations cease. Thus far, Illinois and Arkansas have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction. We hope many other jurisdictions will follow their example soon.
https://case.law/about/

A different project altogether is helping the team behind Caselaw improve its data quality:

Our data inevitably includes countless errors as part of the digitization process. The public launch of this project is only the start of discovering errors, and we hope you will help us in finding and fixing them.
Some parts of our data are higher quality than others. Case metadata, such as the party names, docket number, citation, and date, has received human review. Case text and general head matter has been generated by machine OCR and has not received human review.
You can report errors of all kinds at our Github issue tracker, where you can also see currently known issues. We particularly welcome metadata corrections, feature requests, and suggestions for large-scale algorithmic changes. We are not currently able to process individual OCR corrections, but welcome general suggestions on the OCR correction process.
https://case.law/about/

Datasets to practice and learn Programming, Machine Learning, and Data Science

Many requests have come in regarding “training datasets” – to practice programming. Fortunately, the internet is full of open-source datasets! I compiled a selected list of datasets and repositories below. If you have any additions, please comment or contact me! For information on programming languages or algorithms, visit the overviews for R, Python, SQL, or Data Science, Machine Learning, & Statistics resources.

This list is no longer being maintained. There are other, more frequently updated repositories of useful datasets included in bold below: