In 2016, Saul Pwanson designed a plain-text file format for crossword puzzle data, and then spent a couple of months building a micro-data-pipeline, scraping tens of thousands of crosswords from various sources.
After putting all these crosswords in a simple uniform format, Saul used some simple command line commands to check for common patterns and irregularities.
Surprisingly enough, after visualizing the results, Saul discovered egregious plagiarism by a major crossword editor that had gone on for years.
I thoroughly enjoyed watching this talk on Youtube.
Saul covers the file format, data pipeline, and the design choices that aided rapid exploration; the evidence for the scandal, from the initial anomalies to the final damning visualization; and what it’s like for a data project to get 15 minutes of fame.
I tried to localize the dataset online, but it seems Saul’s website has since gone offline. If you do happen to find it, please do share it in the comments!
In this tutorial, you will be introduced to the command line. We have selected a set of commands we think will be useful in general to a wide range of audience. […] after completing this tutorial, readers should be able to use the shell for version control, managing cloud services (like deploying your own shiny server etc.), execute commands in R & RMarkdown and execute R scripts in the shell.
If you want a deeper understanding of using command line for data science, the original authors suggest you read Data Science at the Command Line. Moreover, Software Carpentry has a lesson on shell. More references are listed at the end of the original tutorial. Use the clickable table of contents to quickly browse to the topic of your interest: