Data types are one of those things that you don’t tend to care about until you get an error or some unexpected results. It is also one of the first things you should check once you load a new data into pandas for further analysis.
Chris Moffit
In this short tutorial, Chris shows how to the pandasdtypes map to the numpy and base Python data types.
Moreover, Chris demonstrates how to handle and convert data types so you can speed up your data analysis. Both using custom functions and anonymous lambda functions.
This blog by Gordon Shotwell has passed my Twitter feed a couple of times now and I thought I’d share it here: blog.shotwell.ca/posts/why_i_use_r
It in, Gordon present his reasons for using R, describing R’s four unique selling point, and outlining a discussion full of perfectly quotable thoughts and opinions.
Do have a look at the original blog as well, but here’s my 3-minute summary:
Gordon finds that there are four main features of the R programming language that are essential to his work and in a sense unique to the R language. Here they are, along with quotes by Gordon explaining R’s unique selling points in his words:
(1) Native data science structures
It’s relatively easy to do data science in R without any external libraries. You can read data from a csv into a data frame, plot and clean that data, and analyse it using built-in statistical models.
(2) Non-standard evaluation
Non-standard evaluation lets you do things like use a variable name in a plot title, or evaluate a user-supplied expression in a different environment.
[…]
For example, R lets you specify models with a formula interface like this: lm(mtcars, mpg ~ cyl). This is a natural way for statisticians to specify statistical models because they’re usually familliar with the syntax, but without NSE there’s no way to make that function work as written because mpg and cylare not objects in the calling environment.
(3) Packaging concensus
R let me get up and running, installing packages, filtering data, and printing plots in under 20 minutes, which meant that I stayed interested in the language and eventually started using it professionally. I had actually started to learn Python at around the same time but just found it too difficult. […]
The user that I care the most about only has 20 minutes of attention and no real programming skill, so the only thing they can “just” do is copy and paste one line of code into a console. If that doesn’t work, I’ve lost them, and they’ll spend another lonely year renewing their SPSS licenses.
(4) Functional programming
I really like this pattern of [functional] programming because breaking complicated jobs down into small functional bricks gives me confidence that the overall solution is correct. I can work on the small functions, verify that they’re correct through tests, and then know that combining those building blocks together won’t change their behaviour.
Although I personally do not fully agree with these four points (e.g., I very much like to leverage functional programming in Python and it works like a charm!) I very much liked the outline Gordon provides. I’d love to hear your thoughts as well, so do share them in the comments.
For now, let’s end with some other lovely quotes by Gordon:
The thing is, I don’t use R out of some blind brand loyalty but because I don’t like working hard.
I came to R from an Excel background, and for a long time I had internalized the feeling that serious engineers used Python, while analysts or researchers could use languages like R. Over time I’ve realized that the people making that statement often aren’t really informed. They rarely know anything about R, and often don’t really write production-quality code themselves.
In contrast, most of the very senior engineers I’ve met understand that all programming languages are basically just bundles of trade-offs, and so no single language is going to be globally superior to another. There really are no production languages – only production engineers.
I was training a predictive model for work for use in a Shiny App. However, as the training set was quite large (700k+ obs.), the model object to save was also quite large in size (500mb). This slows down your operation significantly!
Basically, all you really need are the coefficients (and a link function, in case of glm()). However, I can imagine that you are not eager to write new custom predictions functions, but that you would rather want to rely on R’s predict.lm and predict.glm. Hence, you’ll need to save some more object information.
Via Google I came to this blog, which provides this great custom R function (below) to decrease the object size of trained generalized linear models considerably! It retains only those object data that are necessary to make R’s predict functions work.
My saved linear model went from taking up half a GB to only 27kb! That’s a 99.995% reduction!
The R for Data Science (R4DS) book by Hadley Wickham is a definite must-read for every R programmer. Amongst others, the power of functional programming is explained in it very well in the chapter on Iteration. I wrote about functional programming before, but I recently re-read the R4DS book section after coming across some new valuable resources on particularly R’s purrr functions.
The purpose of this blog post is twofold. First, I wanted to share these new resources I came across, along with the other resources I already have collected over time on functional programming. Second, I wanted to demonstrate via code why functional programming is so powerful, and how it can speed up, clean, and improve your own workflow.
1. Resources
So first things first, “what are these new functional programming resources?”, you must be wondering. Well, here they are:
Thomas Mock was as inspired by the R4DS book as I was, and will run you through the details behind some of the examples in this tutorial.
Hadley Wickham himself gave a talk at a 2016 EdinbR meetup, explaing why and how to (1) use tidyr to make nested data frame, (2) use purrr for functional programming instead of for loops, and (3) visualise models by converting them to tidy data with broom:
Via YouTube.
Colin Fay dedicated several blogs to purrr. Some are very helpful as introduction — particularly this one — others demonstrate more expert applications of the power of purrr — such as this sequence of six blogs on web mining.
This GitHub repository by Dan Ovando does a fantastic job of explaining functional programming and demonstrating the functionality of purrr.
Cormac Nolan made a beautiful RPub Markdown where he displays how functional programming in combination with purrr‘s functions can result in very concise, fast, and supercharged code.
Last, but not least, part of Duke University 2017’s statistical programming course can be found here, related to functional programming with and without purrr.
2. Functional programming example
I wanted to run you through the basics behind functional programming, the apply family and their purrring successors. I try to do so by providing you some code which you can run in R yourself alongside this read. The content is very much inspired on the R4DS book chapter on iteration.
Let’s start with some data
# let's grab a subset of the mtcars dataset mtc <- mtcars[ , 1:3] # store the first three columns in a new object
Say we would like to know the average (mean) value of the data in each of the columns of this new dataset. A starting programmer would usually write something like the below:
#### basic approach:
mean(mtc$mpg) mean(mtc$cyl) mean(mtc$disp)
However, this approach breaks therule of three! Bascially, we want to avoid copying and pasting anything more than twice.
A basic solution would be to use a for-loop to iterate through each column’s data one by one, and calculate and store the mean for each. Here, we first want to pre-allocate an output vector, in order to prevent that we grow (and copy into memory) a vector in each of the iterations of our for-loop. Details regarding why you do not want to grow a vector can be found here. A similar memory-issue you can create with for-loops is described here.
In the end, our for-loop approach to calculating column means could look something like this:
#### for loop approach:
output <- vector("double", ncol(mtc)) # pre-allocate an empty vector
# replace each value in the vector by the column mean using a for loop for(i in seq_along(mtc)){ output[i] <- mean(mtc[[i]]) }
# print the output output
[1] 20.09062 6.18750 230.72188
This output is obviously correct, and the for-loop does the job, however, we are left with some unnecessary data created in our global environment, which not only takes up memory, but also creates clutter.
ls() # inspect global environment
[1] "i" "mtc" "output"
Let’s remove the clutter and move on.
rm(i, output) # remove clutter
Now, R is a functional programming language so this means that we can write our own function with for-loops in it! This way we prevent the unnecessary allocation of memory to overhead variables like i and output. For instance, take the example below, where we create a custom function to calculate the column means. Note that we still want to pre-allocate a vector to store our results.
#### functional programming approach:
col_mean <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[i] <- mean(df[[i]]) } output }
Now, we can call this standardized piece of code by calling the function in different contexts:
This way we prevent that we have to write the same code multiple times, thus preventing errors and typos, and we are sure of a standardized output.
Moreover, this functional programming approach does not create unnecessary clutter in our global environment. The variables created in the for loop (i and output) only exist in the local environment of the function, and are removed once the function call finishes. Check for yourself, only our dataset and our user-defined function col_mean remain:
ls()
[1] "col_mean" "mtc"
For the specific purpose we are demonstrating here, a more flexible approach than our custom function already exists in base R: in the form of the apply family. It’s a set of functions with internal loops in order to “apply” a function over the elements of an object. Let’s look at some example applications for our specific problem where we want to calculate the mean values for all columns of our dataset.
#### apply approach:
# apply loops a function over the margin of a dataset apply(mtc, MARGIN = 1, mean) # either by its rows (MARGIN = 1) apply(mtc, MARGIN = 2, mean) # or over the columns (MARGIN = 2)
# in both cases apply returns the results in a vector
# sapply loops a function over the columns, returning the results in a vector sapply(mtc, mean)
mpg cyl disp 20.09062 6.18750 230.72188
# lapply loops a function over the columns, returning the results in a list lapply(mtc, mean)
Sidenote: sapply and lapply both loop their input function over a dataframe’s columns by default as R dataframes are actually lists of equal-length vectors (see Advanced R [Wickham, 2014]).
# tapply loops a function over a vector # grouping it by a second INDEX vector # and returning the results in a vector tapply(mtc$mpg, INDEX = mtc$cyl, mean)
4 6 8 26.66364 19.74286 15.10000
These apply functions are a cleaner approach than the prior for-loops, as the output is more predictable (standard a vector or a list) and no unnecessary variables are allocated in our global environment.
Performing the same action to each element of an object and saving the results is so common in programming that our friends at RStudio decided to create the purrr package. It provides another family of functions to do these actions for you in a cleaner and more versatile approach building on functional programming.
install.packages("purrr") library("purrr")
Like the apply family, there are multiple functions that each return a specific output:
# map_lgl returns a logical vector # as numeric means aren't often logical, I had to call a different function map_lgl(mtc, is.logical) # mtc's columns are numerical, hence FALSE
mpg cyl disp FALSE FALSE FALSE
# map_int returns an integer vector # as numeric means aren't often integers, I had to call a different function map_int(mtc, is.integer) # returned FALSE, which is converted to integer (0)
mpg cyl disp 0 0 0
#map_dbl returns a double vector. map_dbl(mtc, mean)
mpg cyl disp 20.09062 6.18750 230.72188
# map_chr returns a character vector. map_chr(mtc, mean)
mpg cyl disp "20.090625" "6.187500" "230.721875"
All purrr functions are implemented in C. This makes them a little faster at the expense of readability. Moreover, the purrr functions can take in additional arguments. For instance, in the below example, the na.rm argument is passed to the mean function
map_dbl(rbind(mtc, c(NA, NA, NA)), mean) # returns NA due to the row of missing values map_dbl(rbind(mtc, c(NA, NA, NA)), mean, na.rm = TRUE) # handles those NAs
mpg cyl disp NA NA NA
mpg cyl disp 20.09062 6.18750 230.72188
Once you get familiar with purrr, it becomes a very powerful tool. For instance, in the below example, we split our little dataset in groups for cyl and then run a linear model within each group, returning these models as a list (standard output of map). All with only three lines of code!
We can expand this as we go, for instance, by inputting this list of linear models into another map function where we run a model summary, and then extract the model coefficient using another subsequent map:
mtc %>% split(.$cyl) %>% map(~ lm(mpg ~ disp, data = .)) %>% map(summary) %>% # returns a list of linear model summaries map("coefficients")
$4 Estimate Std. Error t value Pr(>|t|) (Intercept) 40.8719553 3.58960540 11.386197 1.202715e-06 disp -0.1351418 0.03317161 -4.074021 2.782827e-03 $6 Estimate Std. Error t value Pr(>|t|) (Intercept) 19.081987419 2.91399289 6.5483988 0.001243968 disp 0.003605119 0.01555711 0.2317344 0.825929685 $8 Estimate Std. Error t value Pr(>|t|) (Intercept) 22.03279891 3.345241115 6.586311 2.588765e-05 disp -0.01963409 0.009315926 -2.107584 5.677488e-02
The possibilities are endless, our code is fast and readable, our function calls provide predictable return values, and our environment stays clean!
PS. sorry for the terrible layout but WordPress really has been acting up lately… I really should move to some other blog hosting method. Any tips? Potentially Jekyll?
I recently came across this lovely article where Ali Spittel provides 7 tips for writing cleaner JavaScript code. Enthusiastic about her guidelines, I wanted to translate them to the R programming environment. However, since R is not an object-oriented programming language, not all tips were equally relevant in my opinion. Here’s what really stood out for me.
Suppose we want to create our own custom function to derive the average value of a vector v (please note that there is a base::mean function to do this much more efficiently). We could use the R code below to compute that the average of vector 1 through 10 is 5.5.
avg <- function(v){
s = 0
for(i in seq_along(v)) {
s = s + v[i]
}
return(s / length(v))
}
avg(1:10) # 5.5
However, Ali rightfully argues that this code can be improved by making the variable and function names much more explicit. For instance, the refigured code below makes much more sense on a first look, while doing exactly the same.
averageVector <- function(vector){
sum = 0
for(i in seq_along(vector)){
sum = sum + vector[i]
}
return(sum / length(vector))
}
averageVector(1:10) #5.5
Of course, you don’t want to make variable and function names unnecessary long (e.g., average would have been a great alternative function name, whereas computeAverageOfThisVector is probably too long). I like Ali’s principle:
Don’t minify your own code; use full variable names that the next developer can understand.
2. Write short functions that only do one thing
Ali argues “Functions are more understandable, readable, and maintainable if they do one thing only. If we have a bug when we write short functions, it is usually easier to find the source of that bug. Also, our code will be more reusable.” It thus helps to break up your code into custom functions that all do one thing and do that thing good!
For instance, our earlier function averageVector actually did two things. It first summated the vector, and then took the average. We can split this into two seperate functions in order to standardize our operations.
sumVector <- function(vector){
sum = 0
for(i in seq_along(vector)){
sum = sum + vector[i]
}
return(sum)
}
averageVector <- function(vector){
sum = sumVector(vector)
average = sum / length(vector)
return(average)
}
sumVector(1:10) # 55
averageVector(1:10) # 5.5
If you are writing a function that could be named with an “and” in it — it really should be two functions.
3. Documentation
Personally, I am terrible in commenting and documenting my work. I am always too much in a hurry, I tell myself. However, no more excuses! Anybody should make sure to write good documentation for their code so that future developers, including future you, understand what your code is doing and why!
Ali uses the following great example, of a piece of code with magic numbers in it.
Now, you might immediately recognize the number Pi in this return statement, but others may not. And maybe you will need the value Pi somewhere else in your script as well, but you accidentally use three decimals the next time. Best to standardize and comment!
PI <- 3.14 # PI rounded to two decimal places
areaOfCircle <- function(radius) {
# Implements the mathematical equation for the area of a circle:
# Pi times the radius of the circle squared.
return(PI * radius ** 2)
}
The above is much clearer. And by making PI a variable, you make sure that you use the same value in other places in your script! Unfortunately, R doesn’t handle constants (unchangeable variables), but I try to denote my constants by using ALL CAPITAL variable names such as PI, MAX_GROUP_SIZE, or COLOR_EXPERIMENTAL_GROUP.
Do note that R has a built in variable pi for purposes such as the above.
I love Ali’s general rule that:
Your comments should describe the “why” of your code.
However, more elaborate R programming commenting guidelines are given in the Google R coding guide, stating that:
Functions should contain a comments section immediately below the function definition line. These comments should consist of a one-sentence description of the function; a list of the function’s arguments, denoted by Args:, with a description of each (including the data type); and a description of the return value, denoted by Returns:. The comments should be descriptive enough that a caller can use the function without reading any of the function’s code.
Either way, prevent that your comments only denote “what” your code does:
# EXAMPLE OF BAD COMMENTING ####
PI <- 3.14 # PI
areaOfCircle <- function(radius) {
# custom function for area of circle
return(PI * radius ** 2) # radius squared times PI
}
5. Be Consistent
I do not have as strong a sentiment about consistency as Ali does in her article, but I do agree that it’s nice if code is at least somewhat in line with the common style guides. For R, I like to refer to my R resources list which includes several common style guides, such as Google’s or Hadley Wickham’s Advanced R style guide.