Tag: reproducible

Repository of Production Machine Learning

Repository of Production Machine Learning

The Institute for Ethical Machine Learning compiled this amazing curated list of open source libraries that will help you deploy, monitor, version, scale, and secure your production machine learning.

🔍 Explaining predictions & models🔏 Privacy preserving ML📜 Model & data versioning
🏁 Model Training Orchestration💪 Model Serving and Monitoring🤖 Neural Architecture Search
📓 Reproducible Notebooks📊 Visualisation frameworks🔠 Industry-strength NLP
🧵 Data pipelines & ETL🏷️ Data Labelling🗞️ Data storage
📡 Functions as a service🗺️ Computation distribution📥 Model serialisation
🧮 Optimized calculation frameworks💸 Data Stream Processing🔴 Outlier and Anomaly Detection
🌀 Feature engineering🎁 Feature Stores⚔ Adversarial Robustness
💰 Commercial Platforms
Direct links to the sections of the Github repo

The Institute for Ethical Machine Learning is a think-tank that brings together with technology leaders, policymakers & academics to develop standards for ML.

How to Write a Git Commit Message, in 7 Steps

How to Write a Git Commit Message, in 7 Steps

Version control is an essential tool for any software developer. Hence, any respectable data scientist has to make sure his/her analysis programs and machine learning pipelines are reproducible and maintainable through version control.

Often, we use git for version control. If you don’t know what git is yet, I advise you begin here. If you work in R, start here and here. If you work in Python, start here.

This blog is intended for those already familiar working with git, but who want to learn how to write better, more informative git commit messages. Actually, this blog is just a summary fragment of this original blog by Chris Beams, which I thought deserved a wider audience.

Chris’ 7 rules of great Git commit messaging

  1. Separate subject from body with a blank line
  2. Limit the subject line to 50 characters
  3. Capitalize the subject line
  4. Do not end the subject line with a period
  5. Use the imperative mood in the subject line
  6. Wrap the body at 72 characters
  7. Use the body to explain what and why vs. how

For example:

Summarize changes in around 50 characters or less

More detailed explanatory text, if necessary. Wrap it to about 72
characters or so. In some contexts, the first line is treated as the
subject of the commit and the rest of the text as the body. The
blank line separating the summary from the body is critical (unless
you omit the body entirely); various tools like `log`, `shortlog`
and `rebase` can get confused if you run the two together.

Explain the problem that this commit is solving. Focus on why you
are making this change as opposed to how (the code explains that).
Are there side effects or other unintuitive consequences of this
change? Here's the place to explain them.

Further paragraphs come after blank lines.

 - Bullet points are okay, too

 - Typically a hyphen or asterisk is used for the bullet, preceded
   by a single space, with blank lines in between, but conventions
   vary here

If you use an issue tracker, put references to them at the bottom,
like this:

Resolves: #123
See also: #456, #789

If you’re having a hard time summarizing your commits in a single line or message, you might be committing too many changes at once. Instead, you should try to aim for what’s called atomic commits.

Cover image by XKCD#1296

How Do I…? R Code Snippets by Sharon Machlis

How Do I…? R Code Snippets by Sharon Machlis

Sharon Machlis is the author of Practical R for Mass Communication and Journalism. In writing this book, she obviously wrote a lot of R code. Now, Sharon has been nice enough to share all 195 tricks and tips she came across during her writing with us, via this handy table.

Sharon’s list contains many neat tricks, some of which less well-known base functions, others features of more niche packages. Here’s the ones I am definitely adding to my R tricks overview and want to highlight here as well:

  • Categorize values into interval cut()
  • Convert numbers that came in as strings with commas to R numbers with readr::parse_number(mydf$mycol)
  • Create a searchable, sortable HTML table in 1 line of code with DT::datatable(mydf, filter = 'top')
  • Display a fraction between 0 and 1 as a percentage with scales::percent(myfraction)
  • Generate a vector of 1:length(myvec) with seq_along(myvec)

And as if one list was not enough, scrolling through her Twitter feed, I found another R tips and tricks list by Sharon:

A Categorical Spatial Interpolation Tutorial in R

A Categorical Spatial Interpolation Tutorial in R

Timo Grossenbacher works as reporter/coder for SRF Data, the data journalism unit of Swiss Radio and TV. He analyzes and visualizes data and investigates data-driven stories. On his website, he hosts a growing list of cool projects. One of his recent blogs covers categorical spatial interpolation in R. The end result of that blog looks amazing:

This map was built with data Timo crowdsourced for one of his projects. With this data, Timo took the following steps, which are covered in his tutorial:

  • Read in the data, first the geometries (Germany political boundaries), then the point data upon which the interpolation will be based on.
  • Preprocess the data (simplify geometries, convert CSV point data into an sf object, reproject the geodata into the ETRS CRS, clip the point data to Germany, so data outside of Germany is discarded).
  • Then, a regular grid (a raster without “data”) is created. Each grid point in this raster will later be interpolated from the point data.
  • Run the spatial interpolation with the kknn package. Since this is quite computationally and memory intensive, the resulting raster is split up into 20 batches, and each batch is computed by a single CPU core in parallel.
  • Visualize the resulting raster with ggplot2.

All code for the above process can be accessed on Timo’s Github. The georeferenced points underlying the interpolation look like the below, where each point represents the location of a person who selected a certain pronunciation in an online survey. More details on the crowdsourced pronunciation project van be found here, .

Another of Timo’s R map, before he applied k-nearest neighbors on these crowdsourced data. [original]
If you want to know more, please read the original blog or follow Timo’s new DataCamp course called Communicating with Data in the Tidyverse.