recommendations – paulvanderlaken.com

PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

April 2018, a PyData conference was held in London, with three days of super interesting sessions and hackathons. While I couldn’t attend in person, I very much enjoy reviewing the sessions at home as all are shared open access on YouTube channel PyDataTV!

In the following section, I will outline some of my favorites as I progress through the channel:

Winning with simple, even linear, models:

One talk that really resonated with me is Vincent Warmerdam‘s talk on “Winning with Simple, even Linear, Models“. Working at GoDataDriven, a data science consultancy firm in the Netherlands, Vincent is quite familiar with deploying deep learning models, but is also midly annoyed by all the hype surrounding deep learning and neural networks. Particularly when less complex models perform equally well or only slightly less. One of his quote’s nicely sums it up:

“Tensorflow is a cool tool, but it’s even cooler when you don’t need it!”

— Vincent Warmerdam, PyData 2018

In only 40 minutes, Vincent goes to show the finesse of much simpler (linear) models in all different kinds of production settings. Among others, Vincent shows:

how to solve the XOR problem with linear models
how to win at timeseries with radial basis features
how to use weighted regression to deal with historical overfitting
how deep learning models introduce a new theme of horror in production
how to create streaming models using passive aggressive updating
how to build a real-time video game ranking system using mere histograms
how to create a well performing recommender with two SQL tables
how to rock at data science and machine learning using Python, R, and even Stan

Introduction

Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

use more computing power (or cluster of computing nodes).

use another algorithm, for example…