Category: learning

Comparison between R dplyr and data.table code

Comparison between R dplyr and data.table code

Atrebas created this extremely helpful overview page showing how to program standard data manipulation and data transformation routines in R’s famous packages dplyr and data.table.

The document has been been inspired by this stackoverflow question and by the data.table cheat sheet published by Karlijn Willems.

Resources for data.table can be found on the data.table wiki, in the data.table vignettes, and in the package documentation. Reference documents for dplyr include the dplyr cheat sheet, the dplyr vignettes, and the package documentation.

Here’s a hyperlinked table of contents:

Learn Git Branching: An Interactive Tutorial

Learn Git Branching: An Interactive Tutorial

Peter Cottle built this great interactive Git tutorial that teaches you all vital branching skills right in your browser. It’s interactive, beautiful, and very informative, introducing every concept and Git command in a step-by-step fashion.

Have a look yourself: https://learngitbranching.js.org/

Here’s the associated GitHub repository for those interested in forking.

The tutorial includes many levels that progressively teach you the Git commands you’ll need to apply version control on a daily basis:

There’s also a sandbox mode where you can interactively explore and build your own Git tree.


LearnGitBranching is a git repository visualizer, sandbox, and a series of educational tutorials and challenges. Its primary purpose is to help developers understand git through the power of visualization (something that’s absent when working on the command line). This is achieved through a game with different levels to get acquainted with the different git commands.

You can input a variety of commands into LearnGitBranching (LGB) — as commands are processed, the nearby commit tree will dynamically update to reflect the effects of each command.

This visualization combined with tutorials and “levels” can help both beginners and intermediate developers polish their version control skills. A quick demo is available here: https://pcottle.github.com/learnGitBranching/?demo
Or, you can launch the application normally here: https://pcottle.github.com/learnGitBranching/

https://github.com/pcottle/learnGitBranching
Python for R users

Python for R users

Wanting to broaden your scope and learn a new programming language? This great workshop was given at EARL 2018 by Mango Solutions and helps R programmers transition into Python building on their existing R knowledge. The workshop includes exercises that introduce you to the key concepts of Python and some of its most powerful packages for data science, including numpy, pandas, sklearn, and seaborn.

Have a look at the associated workshop guide that walk you through the assignments, or at the github repo with all materials in Jupyter notebooks.

E-Book: Probabilistic Programming & Bayesian Methods for Hackers

E-Book: Probabilistic Programming & Bayesian Methods for Hackers

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. Nevertheless, mathematical analysis is only one way to “think Bayes”. With cheap computing power, we can now afford to take an alternate route via probabilistic programming.

Cam Davidson-Pilon wrote the book Bayesian Methods for Hackers as a introduction to Bayesian inference from a computational and understanding-first, mathematics-second, point of view.

The book is available via Amazon, but you can access an online e-book for free. There’s also an associated GitHub repo.

The book explains Bayesian principles with code and visuals. For instance:

%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
figsize(11, 9)

import scipy.stats as stats

dist = stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = stats.bernoulli.rvs(0.5, size=n_trials[-1])
x = np.linspace(0, 1, 100)

for k, N in enumerate(n_trials):
    sx = plt.subplot(len(n_trials)/2, 2, k+1)
    plt.xlabel("$p$, probability of heads") \
        if k in [0, len(n_trials)-1] else None
    plt.setp(sx.get_yticklabels(), visible=False)
    heads = data[:N].sum()
    y = dist.pdf(x, 1 + heads, 1 + N - heads)
    plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
    plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
    plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)

    leg = plt.legend()
    leg.get_frame().set_alpha(0.4)
    plt.autoscale(tight=True)


plt.suptitle("Bayesian updating of posterior probabilities",
             y=1.02,
             fontsize=14)

plt.tight_layout()

I can only recommend you start with the online version of Bayesian Methods for Hackers, but note that the print version helps sponsor the author ánd includes some additional features:

  • Additional Chapter on Bayesian A/B testing
  • Updated examples
  • Answers to the end of chapter questions
  • Additional explanation, and rewritten sections to aid the reader.

If you’re interested in learning more about Bayesian analysis, I recommend these other books:

Helpful resources for A/B testing

Helpful resources for A/B testing

Brandon Rohrer — (former) data scientist at Microsoft, iRobot, and Facebook — asked his network on Twitter and LinkedIn to share their favorite resources on A/B testing. It produced a nice list, which I summarized below.

The order is somewhat arbitrary, and somewhat based on my personal appreciation of the resources.

Cover image via Optimizely

5 Quick Tips for Coding in the Classroom, by Kelly Bodwin

5 Quick Tips for Coding in the Classroom, by Kelly Bodwin

Kelly Bodwin is an Assistant Professor of Statistics at Cal Poly (San Luis Obispo) and teaches multiple courses in statistical programming. Based on her experiences, she compiled this great shortlist of five great tips to teach programming.

Kelly truly mentions some best practices, so have a look at the original article, which she summarized as follows:

1. Define your terms

Establish basic coding vocabulary early on.

  • What is the console, a script, the environment?
  • What is a function a variable, a dataframe?
  • What are strings, characters, and integers?

2. Be deliberate about teaching versus bypassing peripheral skills

Use tools like RStudio Cloud, R Markdown, and the usethis package to shelter students from setup.

Personally, this is what kept me from learning Python for a long time — the issues with starting up.

Kelly provides this personal checklist of peripherals skills including which ones she includes in her introductory courses:

Course TypeInstall/Update R and RStudioR Markdown fluencyPackage managementData managementFile and folder organizationGitHub
Intro Stat for Non-Majors⚠️⚠️
Intro Stat for Majors⚠️⚠️⚠️⚠️
Advanced Statistics⚠️⚠️
Intro to Statistical Computation

✅ = required course skill
⚠️ = optional, proceed with caution
❌ = avoid entirely
via https://teachdatascience.com/teaching_programming_tips/

3. Read code like English

The best way to debug is to read your process out loud as a sentence.

Basically Kelly argues that you should learn students to be able to translate their requirements into (R) code.

When you continuously read out your code as step-by-step computer instructions, students will learn to translate their own desires to computer instructions.

4. Require good coding practices from Day One

Kelly refers to this great talk by Jenny Bryan on “good” code and how to recognize it.

Kelly’s personal best practice included:

  • Clear code formatting
  • Object names follow consistent conventions
  • Lack of unnecessary code repetition
  • Reproducibility
  • Unit tests before large calculations
  • Commenting and/or documentation

For more R style guides, see my R resources overview.

5. Leave room for creativity

Open-ended questions (like “here’s a dataset, do a cool analysis“) let students explore and shine.


Large parts of the above were copied from this original article by Kelly Boldwin. I highly recommend you have a look at the original, and at the website hosting it: teachdatascience.com

Cover picture by freecodecamp.org.