Category: news

Simulating Corona Virus Outbreaks – with and without social distancing

Simulating Corona Virus Outbreaks – with and without social distancing

I don’t want to participate in the general debate on COVID19 as there are enough, much more knowledgeable experts doing so already.

However, I did want to share something that sparked my interest: this great article by the Washington Post where they show the importance of social distancing in case of viral outbreaks with four simple simulations:

  1. Regular viral outbreak
  2. Viral outbreak with forced (temporary) quarantaine
  3. Viral outbreak with moderate social distancing
  4. Viral outbreak with extensive social distancing

While these are obviously much oversimplified models of reality, the results convey a powerful and very visual message showing the importance of our social behavior in such a crisis.

1. Simulation of regular viral outbreak
2. Simulation with temporary quarantaine opening up.

As these simulations are randomized, you will get your own personalized results when you read the article! Try it out yourself: washingtonpost.com/graphics/2020/world/corona-simulator/?tid=

A comparison of my results:

Caselaw Access Project: Structured data of over 6 million U.S. court decisions

Caselaw Access Project: Structured data of over 6 million U.S. court decisions

Case.law seems like a very interesting data source for a machine learning or text mining project:

The Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.

The capstone of the Caselaw Access Project is a robust set of tools which facilitate access to the cases and their associated metadata. We currently offer five ways to access the data: APIbulk downloadssearchbrowse, and a historical trends viewer.

https://case.law/about/

Our open-source API is the best option for anybody interested in programmatically accessing our metadata, full-text search, or individual cases.

If you need a large collection of cases, you will probably be best served by our bulk data downloads. Bulk downloads for Illinois and Arkansas are available without a login, and unlimited bulk files are available to research scholars.

https://case.law/about/

Case metadata, such as the case name, citation, court, date, etc., is freely and openly accessible without limitation. Full case text can be freely viewed or downloaded but you must register for an account to do so, and currently you may view or download no more than 500 cases per day. In addition, research scholars can qualify for bulk data access by agreeing to certain use and redistribution restrictions. You can request a bulk access agreement by creating an account and then visiting your account page.

Access limitations on full text and bulk data are a component of Harvard’s collaboration agreement with Ravel Law, Inc. (now part of Lexis-Nexis). These limitations will end, at the latest, in March of 2024. In addition, these limitations apply only to cases from jurisdictions that continue to publish their official case law in print form. Once a jurisdiction transitions from print-first publishing to digital-first publishing, these limitations cease. Thus far, Illinois and Arkansas have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction. We hope many other jurisdictions will follow their example soon.

https://case.law/about/

A different project altogether is helping the team behind Caselaw improve its data quality:

Our data inevitably includes countless errors as part of the digitization process. The public launch of this project is only the start of discovering errors, and we hope you will help us in finding and fixing them.

Some parts of our data are higher quality than others. Case metadata, such as the party names, docket number, citation, and date, has received human review. Case text and general head matter has been generated by machine OCR and has not received human review.

You can report errors of all kinds at our Github issue tracker, where you can also see currently known issues. We particularly welcome metadata corrections, feature requests, and suggestions for large-scale algorithmic changes. We are not currently able to process individual OCR corrections, but welcome general suggestions on the OCR correction process.

https://case.law/about/
Northstar: The interactive, drag-and-drop data science platform by MIT

Northstar: The interactive, drag-and-drop data science platform by MIT

MIT researchers have spent years developing the new drag-and-drop analytics tools they call Northstar.

Northstar is an interactive data science platform that rethinks how people interact with data. It empowers users without programming experience, background in statistics or machine learning expertise to explore and mine data through an intuitive user interface, and effortlessly build, analyze, and evaluate machine learning (ML) pipelines.

northstar.mit.edu/

Northstar starts as a blank, white interface. Users upload datasets into the system, which appear in a “datasets” box on the left. Any data labels will automatically populate a separate “attributes” box below. There’s also an “operators” box that contains various algorithms, as well as the new AutoML tool. All data are stored and analyzed in the cloud.

news.mit.edu/2019/drag-drop-data-analytics-0627

You can read more about the tool’s functionalities in this MIT news article, which includes several promising GIFs:

Moreover, on the Northstar website you can find this longer video explaining the tool in detail.

https://vimeo.com/342787403

While Northstar looks insanely cool and promising, I do worry about putting such power in the hands of people who may not have much experience with statistics and/or machine learning. We all know how easily errors and bias may slip into data-driven processes, so I am curious to see how these next-gen kind of tools will be deployed and used.

Zeit’s interactive visualization of the 2019 European election results

Zeit’s interactive visualization of the 2019 European election results

Zeit — the German newspaper — analyzed recent election results in over 80,000 regions of Europe. They discovered many patterns – from the radical left to the extremist right. Moreover, they allow you to find patterns yourself, among others in your own region.

They published the summarized election results in this beautiful interactive map of Europe.

The map is beautifully color-coded for the dominant political view (Conservative, Green, Liberal, Socialist, Far left, or Far right) per region. Moreover, you can select these views and look for regions where they received respectively many votes. Like in the below, where I opted for the Liberal view, which finds strongest support in regions of the Netherlands, France, Czechia, Romania, Denmark, Estonia, and Finland.

For instance, the region of Tilburg in the Netherlands — where I live — voted mostly Liberal, as depicted by the yellow Netherlands. In contrast, in the German border regions conservative and socialist parties received most votes, whereas in the Belgian border regions uncategorizable parties received most votes.

Zeit discovered some cool patterns themselves as well, as discussed in the original article. These include:

  • Right-Wing Populists in Poland
  • North-South divides in Italy and Spain
  • Considerable support for regional parties in Catalonia, Belgium, Scotland and Italy
  • Dominant Green and Liberal views in the Netherlands, France, and Germany

Have a look yourself, it’s a great example of open access data-driven journalism!

Learn from the Pros: How media companies visualize data

Learn from the Pros: How media companies visualize data

Past months, multiple companies shared their approaches to data visualization and their lessons learned.

Click the companies in the list below to jump to their respective section


Financial Times

The Financial Times (FT) released a searchable database of the many data visualizations they produced over the years. Some lovely examples include:

Graphic showing what May needs to happen to get her deal over the line when MPs vote on Friday
Data visualization belonging to a recent Brexit piece by the FT, viahttps://www.ft.com/graphics
Dutch housing graphic
Searching the FT database for European House Prices via https://www.ft.com/graphics returns this map of the Netherlands.

BBC

The BBC released a free cookbook for data visualization using R programming. Here is the associated Medium post announcing the book.

The BBC data team developed an R package (bbplot) which makes the process of creating publication-ready graphics in their in-house style using R’s ggplot2 library a more reproducible process, as well as making it easier for people new to R to create graphics.

Apart from sharing several best practices related to data visualization, they walk you through the steps and R code to create graphs such as the below:

One of the graphs the BBC cookbook will help you create, via https://bbc.github.io/rcookbook/

Economist

The data team at the Economist also felt a need to share their lessons learned via Medium. They show some of their most misleading, confusing, and failing graphics of the past years, and share the following mistakes and their remedies:

  • Truncating the scale (image #1 below)
  • Forcing a relationship by cherry-picking scales
  • Choosing the wrong visualisation method (image #2 below)
  • Taking the “mind-stretch” a little too far (image #3 below)
  • Confusing use of colour (image #4 below)
  • Including too much detail
  • Lots of data, not enough space

Moreover, they share the data behind these failing and repaired data visualizations:

Via https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368
Via https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368
Via https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368
Via https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368

FiveThirtyEight

I could not resist including this (older) overview of the 52 best charts FiveThirtyEight claimed they made.

All 538’s data visualizations are just stunningly beautiful and often very
ingenious, using new chart formats to display complex patterns. Moreover, the range of topics they cover is huge. Anything ranging from their traditional background — politics — to great cover stories on sumo wrestling and pricy wine.

Viahttps://fivethirtyeight.com/features/the-52-best-and-weirdest-charts-we-made-in-2016/
Via https://fivethirtyeight.com/features/the-52-best-and-weirdest-charts-we-made-in-2016/ You should definitely check out the original cover story via https://projects.fivethirtyeight.com/sumo/
Via https://fivethirtyeight.com/features/the-52-best-and-weirdest-charts-we-made-in-2016/

Analysis of Media Coverage on Refugees

Analysis of Media Coverage on Refugees

Hannah Yan Han is doing #100dayprojects on data science and visual storytelling and I can only recommend that you take a look yourself. Below you find her R text analysis (#41) of UNHCR speeches and TV coverage on refugees.

Unsurprisingly, nouns like asylum, repatriation, displacement, persecution, plight, and crisis appear significantly more often in UNHCR speeches on refugees than in general English texts. The first visualization below shows the action-oriented verbs most commonly used in combination with these nouns.

This second visualization shows the most occurring verb-noun pairs.

Hannah used newsflash to retrieve the GDELT data on US TV news. Some channels seem to cover refugees more than others. I would have loved to see which topics occurred on each channel, but unfortunately she did not report on this.