Author: Paul van der Laken

Text Mining: Pythonic Heavy Metal

This blog summarized work that has been posted here, here, and here.

Iain of degeneratestate.org wrote a three-piece series where he applied text mining to the lyrics of 222,623 songs from 7,364 heavy metal bands spread over 22,314 albums that he scraped from darklyrics.com. He applied a broad range of different analyses in Python, the code of which you can find here on Github.

For example, he starts part 1 by calculated the difficulty/complexity of the lyrics of each band using the Simple Measure of Gobbledygook or SMOG and contrasted this to the number of swearwords used, finding a nice correlation.

Ratio of swear words vs readability — Lyric complexity relates positive to swearwords used.

Furthermore, he ran some word importance analysis, looking at word frequencies, log-likelihood ratios, and TF-IDF scores. This allowed him to contrast the word usage of the different bands, finding, for instance, one heavy metal band that was characterized by the words “oh yeah baby got love“: fans might recognize either Motorhead, Machinehead, or Diamondhead.

Using cosine distance measures, Iain could compare the word vectors of the different bands, ultimately recognizing band similarity, and song representativeness for a band. This allowed interesting analysis, such as a clustering of the various bands:

However, all his analysis worked out nicely. While he also applied t-SNE to visualize band similarity in a two-dimensional space, the solution was uninformative due to low variance in the data.

He could predict the band behind a song by training a one-vs-rest logistic regression classifier based on the reduced lyric space of 150 dimensions after latent semantic analysis. Despite classifying a song to one of 120 different bands, the classifier had a precision and recall both around 0.3, with negligible hyper parameter tuning. He used the classification errors to examine which bands get confused with each other, and visualized this using two network graphs.

In part 2, Iain tried to create a heavy metal lyric generator (which you can now try out).

His first approach was to use probabilistic distributions known as language models. Basically he develops a Markov Chain, in his opinion more of a “unsmoothed maximum-likelihood language model“, which determines the next most probable word based on the previous word(s). This model is based on observed word chains, for instance, those in the first two lines to Iron Maiden’s Number of the Beast:

Another approach would be to train a neural network. Iain used Keras, which ran on an amazon GPU instance. He recognizes the power of neural nets, but says they also come at a cost:

“The maximum likelihood models we saw before took twenty minutes to code from scratch. Even using powerful libraries, it took me a while to understand NNs well enough to use. On top of this, training the models here took days of computer time, plus more of my human time tweeking hyper parameters to get the models to converge. I lack the temporal, financial and computational resources to fully explore the hyperparameter space of these models, so the results presented here should be considered suboptimal.” – Iain

He started out with feed forward networks on a character level. His best try consisted of two feed forward layers of 512 units, followed by a softmax output, with layer normalisation, dropout and tanh activations, which he trained for 20 epochs to minimise the mean cross-entropy. Although it quickly beat the maximum likelihood Markov model, its longer outputs did not look like genuine heavy metal songs.

So he turned to recurrent neural network (RNN). The RNN Iain used contains two LSTM layers of 512 units each, followed by a fully connected softmax layer. He unrolled the sequence for 32 characters and trained the model by predicting the next 32 characters, given their immediately preceding characters, while minimizing the mean cross-entropy:

“To generate text from the RNN model, we step character-by-character through a sequence. At each step, we feed the current symbol into the model, and the model returns a probability distribution over the next character. We then sample from this distribution to get the next character in the sequence and this character goes on to become the next input to the model. The first character fed into the model at the beginning of generation is always a special start-of-sequence character.” – Iain

This approach worked quite well, and you can compare and contrast it with the earlier models here. If you’d just like to generate some lyrics, the models are hosted online at deepmetal.io.

In part 3, Iain looks into emotional arcs, examining the happiness and metalness of words and lyrics.

When applied to the combined lyrics of albums, you could examine how bands developed their signature sound over time. For example, the lyrics of Metallica’s first few albums seem to be quite heavy metal and unhappy, before moving to a happier place. The Black album is almost sentiment-neutral, but after that they became ever more darker and more metal, moving back to the style to their first few albums. He applied the same analysis on the text of the Harry Potter books, of which especially the first and last appear especially metal.

The Dataviz Project: Find just the right visualization

Do you have a bunch of data but you can’t seem to figure out how to display it? Or looking for that one specific visualization of which you can’t remember the name?

www.datavizproject.com provides a most comprehensive overview of all the different ways to visualize your data. You can sort all options by Family, Input, Function, and Shape to find that one dataviz that best conveys your message.

Update: look at some of these other repositories here or here.

Generating 3D Faces from 2D Photographs

Aaron Jackson, Adrian Bulat, Vasileios Argyriou and Georgios Tzimiropoulos
of the Computer Vision Laboratory of the University of Nottingham built a neural network that generates a full 3D image of a single portrait photograph. They turn a photograph like this…

… into an accurately creepy 3D image like this.

You can try it with your own or other photographs here. I found that images with white background get the best results. On their project website you can read more about the underlying convolutional neural network.

Update 21-10-2017: One of my favorite YouTube channels explains how the models were trained and the data used:

Analysis of Media Coverage on Refugees

Hannah Yan Han is doing #100dayprojects on data science and visual storytelling and I can only recommend that you take a look yourself. Below you find her R text analysis (#41) of UNHCR speeches and TV coverage on refugees.

Unsurprisingly, nouns like asylum, repatriation, displacement, persecution, plight, and crisis appear significantly more often in UNHCR speeches on refugees than in general English texts. The first visualization below shows the action-oriented verbs most commonly used in combination with these nouns.

This second visualization shows the most occurring verb-noun pairs.

Hannah used newsflash to retrieve the GDELT data on US TV news. Some channels seem to cover refugees more than others. I would have loved to see which topics occurred on each channel, but unfortunately she did not report on this.

Visualizing #IRMA Tweets

Reddit user LucasCu90 used the R package twitteR to retrieve all tweets that were sent with #Irma and a Geocode of central Miami (25 mile radius) from Saturday September 9, to Sunday September 10, 2017 (the period of Irma’s approach and initial landfall on the Florida Keys and the mainland). From the 29,000 tweets he collected, Lucas then retrieved the 600 most common words and overlaid them on a map of Florida, with their size relative to their frequency in the data. The result is quite nice!

Coexisting Languages in Australia: An Interactive map

Jack Zhao from Small Multiples – a multidisciplinary team of data specialists, designers and developers – retrieved the Language Spoken at Home (LANP) data from the 2016 Census and turned it into a dot density map that vividly shows how people from different cultures coexist (or not) in ultra high resolution (using Python, englewood library, QGIS, Carto). Each colored dot in the visualizations below represents five people from the same language group in the area. Highly populated areas have a higher density of dots; while language diversity is shown through the number of different colors in the given area.

Good news: the maps are interactive! Here’s Sydney:

Here is the original webpage on Small Multiples and you can browse the interactive map in full screen in your browser. The below language groups are included:

Eastern Asian: Chinese, Japanese, Korean, Other Eastern Asian Languages
Southeast Asian: Burmese and Related Languages, Hmong-Mien, Mon-Khmer, Tai, Southeast Asian Austronesian Languages, Other Southeast Asian Languages
Southern Asian: Dravidian, Indo-Aryan, Other Southern Asian Languages
Southwest And Central Asian: Iranic, Middle Eastern Semitic Languages, Turkic, Other Southwest and Central Asian Languages
Northern European: Celtic, English, German and Related Languages, Dutch and Related Languages, Scandinavian, Finnish and Related Languages
Southern European: French, Greek, Iberian Romance, Italian, Maltese, Other Southern European Languages
Eastern European: Baltic, Hungarian, East Slavic, South Slavic, West Slavic, Other Eastern European Languages
Australian Indigenous: Arnhem Land and Daly River Region Languages, Yolngu Matha, Cape York Peninsula Languages, Torres Strait Island Languages, Northern Desert Fringe Area Languages, Arandic, Western Desert Languages, Kimberley Area Languages, Other Australian Indigenous Languages