Category: best practices

The Mental Game of Python, by Raymond Hettinger

YouTube recommended I’d watch this recorded presentation by Raymond Hettinger at PyBay2019 last October. Quite a long presentation for what I’d normally watch, but what an eye-openers it contains!

Raymond Hettinger is a Python core developer and in this video he presents 10 programming strategies in these 60 minutes, all using live examples. Some are quite obvious, but the presentation and examples make them very clear. Raymond presents some serious programming truths, and I think they’ll stick.

First, Raymond discusses chunking and aliasing. He brings up the theory that the human mind can only handle/remember 7 pieces of information at a time, give or take 2. Anything above proves to much cognitive load, and causes discomfort as well as errors. Hence, in a programming context, we need to make sure programmers can use all 7 to improve the code, rather than having to decypher what’s in front of them. In a programming context, we do so by modularizing and standardizing through functions, modules, and packages. Raymond uses the Python random module to hightlight the importance of chunking and modular code. This part was quite long, but still interesting.

For the next two strategies, Raymond quotes the Feinmann method of solving problems: “(1) write down a clear problem specification; (2) think very, very hard; (3) write down a solution”. Using the example of a tree walker, Raymond shows how the strategies of incremental development and solving simpler programs can help you build programs that solve complex problems. This part only lasts a couple of minutes but really underlines the immense value of these strategies.

Next, Raymond touches on the DRY principle: Don’t Repeat Yourself. But in a context I haven’t seen it in yet, object oriented programming [OOP], classes, and inherintance.

Raymond continues to build his arsenal of programming strategies in the next 10 minutes, where he argues that programmers should repeat tasks manually until patterns emerge, before they starting moving code into functions. Even though I might not fully agree with him here, he does have some fun examples of file conversion that speak in his case.

Lastly, Raymond uses the graph below to make the case that OOP is a graph traversal problem. According to Raymond, the Python ecosystem is so rich that there’s often no need to make new classes. You can simply look at the graph below. Look for the island you are currently on, check which island you need to get to, and just use the methods that are available, or write some new ones.

Raymond Hettinger: ” OOP is a graph traversal problem”

While there were several more strategies that Raymond wanted to discuss, he doesn’t make it to the end of his list of strategies as he spend to much time on the first, chunking bit. Super curious as to the rest? Contact Raymond on Twitter.

An Introduction to Docker for R Users, by Colin Fay

In this awesome 8-minute read, R-progidy Colin Fay explains in laymen’s terms what Docker images, Docker containers, and Volumes are; what Rocker is; and how to set up a Docker container with an R image and run code on it:

On your machine, you’re going to need two things: images, and containers. Images are the definition of the OS, while the containers are the actual running instances of the images. […] To compare with R, this is the same principle as installing vs loading a package: a package is to be downloaded once, while it has to be launched every time you need it. And a package can be launched in several R sessions at the same time easily.
Colin Fay, via https://colinfay.me/docker-r-reproducibility

In his blog, Colin also refers to some great additional resources on Rocker/Docker…

… as well as reading list for those interested in learning more about Docker:

Overviews of Graph Classification and Network Clustering methods

Thanks to Sebastian Raschka I am able to share this great GitHub overview page of relevant graph classification techniques, and the scientific papers behind them. The overview divides the algorithms into four groups:

Moreover, the overview contains links to similar collections on community detection, classification/regression trees and gradient boosting papers with implementations.

As well as a link to relevant graph classification benchmark datasets.

"Awesome Graph Classification" — A collection of graph classification methods, covering embedding, deep learning, graph kernel, and factorization papers with reference implementations https://t.co/ugpL3xSvf1
— Sebastian Raschka (@rasbt) July 16, 2019

17 Principles of (Unix) Software Design

I came across this 1999-2003 e-book by Eric Raymond, on the Art of Unix Programming. It contains several relevant overviews of the basic principles behind the Unix philosophy, which are probably useful for anybody working in hardware, software, or other algoritmic design.

First up, is a great list of 17 design rules, explained in more detail in the original article:

Rule of Modularity: Write simple parts connected by clean interfaces.
Rule of Clarity: Clarity is better than cleverness.
Rule of Composition: Design programs to be connected to other programs.
Rule of Separation: Separate policy from mechanism; separate interfaces from engines.
Rule of Simplicity: Design for simplicity; add complexity only where you must.
Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do.
Rule of Transparency: Design for visibility to make inspection and debugging easier.
Rule of Robustness: Robustness is the child of transparency and simplicity.
Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
Rule of Least Surprise: In interface design, always do the least surprising thing.
Rule of Silence: When a program has nothing surprising to say, it should say nothing.
Rule of Repair: When you must fail, fail noisily and as soon as possible.
Rule of Economy: Programmer time is expensive; conserve it in preference to machine time.
Rule of Generation: Avoid hand-hacking; write programs to write programs when you can.
Rule of Optimization: Prototype before polishing. Get it working before you optimize it.
Rule of Diversity: Distrust all claims for “one true way”.
Rule of Extensibility: Design for the future, because it will be here sooner than you think.

Moreover, the book contains a shortlist of some of the philosophical principles behind Unix (and software design in general):

Everything that can be a source- and destination-independent filter should be one.
Data streams should if at all possible be textual (so they can be viewed and filtered with standard tools).
Database layouts and application protocols should if at all possible be textual (human-readable and human-editable).
Complex front ends (user interfaces) should be cleanly separated from complex back ends.
Whenever possible, prototype in an interpreted language before coding C.
Mixing languages is better than writing everything in one, if and only if using only that one is likely to overcomplicate the program.
Be generous in what you accept, rigorous in what you emit.
When filtering, never throw away information you don’t need to.
Small is beautiful. Write programs that do as little as is consistent with getting the job done.

If you want to read the real book, or if you just want to support the original author, you can buy the book here:

Let me know which of these and other rules and principles you apply in your daily programming/design job.

Data Visualization Style Guide Repositories

Amy Cesal put together (1) this great overview of style guides for data visualization practice. Moreover, in the original tweet, Amy refers to other great repositories such as (2) this PolicyViz one and (3) this humongous one by Adele.

Spreadsheet of #dataviz style guides: https://t.co/lLQUT5Qwi0

And a form for adding more: https://t.co/i14hb0fZOO pic.twitter.com/qJ2vhcl7QV
— Amy Cesal (@AmyCesal) June 21, 2019

Amy’s list includes many references to the best practices used by some of the leading data journalism companies, such as the BBC, or professional data companies like Salesforce and IBM.

As I’m worried that this great repository may not stand the test of time on the current Google Docs location, here are the base URLs once more:

URL of guidelines	Company name
https://sunlightfoundation.com/2014/03/12/datavizguide	Sunlight Foundation
https://cfpb.github.io/design-manual/data-visualization/data-visualization.html	Consumer Financial Protection Bureau
https://knightcenter.utexas.edu/mooc/file/tdmn_graphics.pdf	Dallas Morning News
https://urbaninstitute.github.io/graphics-styleguide/	The Urban Institute
http://code.minnpost.com/minnpost-styles/	MinnPost
https://public.tableau.com/profile/bbc.audiences#!/vizhome/BBCAudiencesTableauStyleGuide/Hello	BBC Audiences
https://www.ibm.com/design/v1/language/experience/data-visualization/	IBM
https://style.ons.gov.uk/category/data-visualisation/	Office for National Statistics
https://www.ibcs.com/standards	International Business Communication Standards (IBCS®)
https://data.london.gov.uk/blog/city-intelligence-data-design-guidelines/	London City Intelligence
https://www.bbc.co.uk/gel/guidelines/how-to-design-infographics	BBC
https://polaris.shopify.com/design/data-visualizationst	Shopify
https://ux.opower.com/opattern/how-to-charts.html	Opower
https://www.consults-iot.com	Consults-IoT.Com LLP
https://ux.mailchimp.com/patterns/data	MailChimp
https://material.io/design/communication/data-visualization.html	Google- Material Design
https://lightningdesignsystem.com/guidelines/charts/	Salesforce
https://github.com/glosophy/CatoDataVizGuidelines/blob/master/PocketStyleBook.pdf	Cato Institute
https://bbc.github.io/rcookbook/	BBC
https://docs.microsoft.com/en-us/office/dev/add-ins/design/data-visualization-guidelines	Microsoft
https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception	ACI

If you have any resources or style guides to contribute to Amy’s list, you can do so via this link.

Causal Random Forests, by Mark White

I stumbled accros this incredibly interesting read by Mark White, who discusses the (academic) theory behind, inner workings, and example (R) applications of causal random forests:

EXPLICITLY OPTIMIZING ON CAUSAL EFFECTS VIA THE CAUSAL RANDOM FOREST: A PRACTICAL INTRODUCTION AND TUTORIAL (By Mark White)

These so-called “honest” forests seem a great technique to identify opportunities for personalized actions: think of marketing, HR, medicine, healthcare, and other personalized recommendations. Note that an experimental setup for data collection is still necessary to gather the right data for these techniques.