Tag: maps

A Categorical Spatial Interpolation Tutorial in R

A Categorical Spatial Interpolation Tutorial in R

Timo Grossenbacher works as reporter/coder for SRF Data, the data journalism unit of Swiss Radio and TV. He analyzes and visualizes data and investigates data-driven stories. On his website, he hosts a growing list of cool projects. One of his recent blogs covers categorical spatial interpolation in R. The end result of that blog looks amazing:

This map was built with data Timo crowdsourced for one of his projects. With this data, Timo took the following steps, which are covered in his tutorial:

  • Read in the data, first the geometries (Germany political boundaries), then the point data upon which the interpolation will be based on.
  • Preprocess the data (simplify geometries, convert CSV point data into an sf object, reproject the geodata into the ETRS CRS, clip the point data to Germany, so data outside of Germany is discarded).
  • Then, a regular grid (a raster without “data”) is created. Each grid point in this raster will later be interpolated from the point data.
  • Run the spatial interpolation with the kknn package. Since this is quite computationally and memory intensive, the resulting raster is split up into 20 batches, and each batch is computed by a single CPU core in parallel.
  • Visualize the resulting raster with ggplot2.

All code for the above process can be accessed on Timo’s Github. The georeferenced points underlying the interpolation look like the below, where each point represents the location of a person who selected a certain pronunciation in an online survey. More details on the crowdsourced pronunciation project van be found here, .

Another of Timo’s R map, before he applied k-nearest neighbors on these crowdsourced data. [original]
If you want to know more, please read the original blog or follow Timo’s new DataCamp course called Communicating with Data in the Tidyverse.

Pixel Maps in R

Pixel Maps in R

Those of you who follow my blog know I love world maps. Particularly when they are used to visualize data, like these maps of Kaggle programming language preferencesUS household incomes, rush hour travel times, or Shazam recognitions. Those who share this passion will probably like this blog’s topic: mapping data to pixel maps! In R obviously!

A pixel map of holiday and living locations made by Taras Kaduk in R [original]

Taras Kaduk seems as excited about R and the tidyverse as I am, as he built the beautiful map above. It flags all the cities he has visited and, in red, the cities he has lived. Taras was nice enough to share his code here, in the original blog post.

Now, I am not much of a globetrotter, but I do like programming. Hence, I immediately wanted to play with the code and visualize my own holiday destinations. Below you can find my attempt. The updated code I also posted below, but WordPress doesn’t handle code well, so you better look here.

The code to make your own map you can find here.

Let’s run you through the steps to make such a map. First, we need to load some packages. I use the apply family to install and/or load a set of packages so that if I/you run the script on a different computer, it will still work. In terms of packages, the tidyverse (read more) includes some nice data manipulation packages as well as the famous ggplot2 package for visualizations. We need maps and ggmap for their mapping functionalities. here is a great little package for convenient project management, as you will see (read more).

### setup ----------------------------------------------------------------------

# install and/or load packages
pkg <- c("tidyverse", "maps", "ggmap", "here")
sapply(pkg, function(x){
  if(! x %in% installed.packages()) install.packages(x)
  require(x, character.only = TRUE)

Next, we need to load in the coordinates (longitudes and latitudes) of our holiday destinations. Now, I started out creating a dataframe with city coordinates by hand. However, this was definitely not a scale-able solution. Fortunately, after some Googling, I came across ggmap::geocode(). This function allows you to query the Google maps API(no longer works) Data Science Toolkit, which returns all kinds of coordinates data for any character string you feed it.

Although, I ran into two problems with this approach, this was nothing we couldn’t fix. First, my home city of Breda apparently has a name-city in the USA, which Google favors. Accordingly, you need to be careful and/or specific regarding the strings you feed to geocode() (e.g., “Breda NL“). Second, API’s often have a query limit, meaning you can only ask for data every so often. geocode() will quickly return NAs when you feed it more than two, three values. Hence, I wrote a simple while loop to repeat the query until the API retrieves coordinates. The query will pause shortly in between every attempt. Returned coordinates are then stored in the empty dataframe I created earlier. Now, we can easily query a couple dozen of locations without errors.

You can try it yourself: all you need to change is the city_name string.

### cities data ----------------------------------------------------------------

# cities to geolocate
city_name <- c("breda NL", "utrecht", "rotterdam", "tilburg", "amsterdam",
               "london", "singapore", "kuala lumpur", "zanzibar", "antwerp",
               "middelkerke", "maastricht", "bruges", "san fransisco", "vancouver", 
               "willemstad", "hurghada", "paris", "rome", "bordeaux", 
               "berlin", "kos", "crete", "kefalonia", "corfu", 
               "dubai", " barcalona", "san sebastian", "dominican republic", 
               "porto", "gran canaria", "albufeira", "istanbul", 
               "lake como", "oslo", "riga", "newcastle", "dublin", 
               "nice", "cardiff", "san fransisco", "tokyo", "kyoto", "osaka",
               "bangkok", "krabi thailand", "chang mai thailand", "koh tao thailand")   

# initialize empty dataframe   
  city = city_name, 
  lon = rep(NA, length(city_name)), 
  lat = rep(NA, length(city_name)) 
) ->

# loop cities through API to overcome SQ limit
# stop after if unsuccessful after 5 attempts
for(c in city_name){
  temp <- tibble(lon = NA)
  # geolocate until found or tried 5 times
  attempt <- 0 # set attempt counter
  while(is.na(temp$lon) & attempt < 5) {
    temp <- geocode(c, source = "dsk")
    attempt <- attempt + 1 
    cat(c, attempt, ifelse(!is.na(temp[[1]]), "success", "failure"), "\n") # print status
    Sys.sleep(runif(1)) # sleep for random 0-1 seconds
  # write to dataframe
  cities[cities$city == c, -1] <- temp

Now, Taras wrote a very convenient piece of code to generate the dotted world map, which I borrowed from his blog:

### dot data -------------------------------------------------------------------

# generate worldwide dots
lat <- data_frame(lat = seq(-90, 90, by = 1))
lon <- data_frame(lon = seq(-180, 180, by = 1))
dots <- merge(lat, lon, all = TRUE)  
# exclude water-based dots 
dots %>%
mutate(country = map.where("world", lon, lat),
       lakes = map.where("lakes", lon, lat)) %>%
  filter(!is.na(country) & is.na(lakes)) %>% 
  select(-lakes) ->

With both the dot data and the cities’ geocode() coordinates ready, it is high time to visualize the map. Note that I use one geom_point() layer to plot the dots, small and black, and another layer to plot the cities data in transparent red. Taras added a third layer for the cities he had actually lived in; I purposefully did not as I have only lived in the Netherlands and the UK. Note that I again use the convenient here::here() function to save the plot in my current project folder.

### visualize ------------------------------------------------------------------

# plot the data
dots %>% ggplot(aes(x = lon, y = lat)) + 
  geom_point(col = "black", size = 0.25) +
  geom_point(data = cities, col = "red", size = 3, alpha = 0.7) + 
  theme_void() +
    panel.background = element_rect(fill = "#006994"),
    plot.background = element_rect(fill = "#006994")
  ) -> dot_map

# save plot
ggsave(here("worlmap_dots.png"), dot_map, 
       dpi = 600, width = 8, height = 4.5)

I very much like the look of this map and I’d love to see what innovative, other applications you guys can come up with. To copy the code, please look here on RPubs. Do share your personal creations and also remember to take a look at Taras original blog!

Kaggle Data Science Survey 2017: Worldwide Preferences for Python & R

Kaggle Data Science Survey 2017: Worldwide Preferences for Python & R

Kaggle conducts industry-wide surveys to assess the state of data science and machine learning. Over 17,000 individuals worldwide participated in the survey, myself included, and 171 countries and territories are represented in the data.

There is an ongoing debate regarding whether R or Python is better suited for Data Science (probably the latter, but I nevertheless prefer the former). The thousands of responses to the Kaggle survey may provide some insights into how the preferences for each of these languages are dispersed over the globe. At least, that was what I thought when I wrote the code below.

View the Kaggle Kernel here.

### 2017-10-31


options(stringsAsFactors = F)
dpi = 600
w = 12
h = 8
wm_cor = 0.8
hm_cor = 0.8
capt = "Kaggle Data Science Survey 2017 by paulvanderlaken.com"

mc <- read.csv("multipleChoiceResponses.csv") %>%

worldMap <- map_data(map = "world") %>% as.tibble()

mc$Country[!mc$Country %in% worldMap$region] %>% unique()
worldMap$region %>% unique() %>% sort(F)
mc$Country[mc$Country == "United States"] <- "USA"
mc$Country[mc$Country == "United Kingdom"] <- "UK"
mc$Country[grepl("China|Hong Kong", mc$Country)] <- "China"

lvls = c("","Rarely", "Sometimes", "Often", "Most of the time")
labels = c("NA", lvls[-1])
ind_data <- mc %>% 
  select(Country, WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  mutate(WorkToolsFrequencyR = factor(WorkToolsFrequencyR, 
                                      levels = lvls, labels = labels)) %>% 
  mutate(WorkToolsFrequencyPython = factor(WorkToolsFrequencyPython, 
                                           levels = lvls, labels = labels)) %>% 
  filter(!(Country == "" | is.na(WorkToolsFrequencyR) | is.na(WorkToolsFrequencyPython)))

country_data <- ind_data %>%
  group_by(Country) %>%
  summarize(N = n(),
            R = sum(WorkToolsFrequencyR %>% as.numeric()),
            Python = sum(WorkToolsFrequencyPython %>% as.numeric()))

theme_worldMap <- theme(
    plot.background = element_rect(fill = "white"),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    panel.background = element_blank(),
    legend.background = element_blank(),
    legend.position = c(0, 0.2),
    legend.justification = c(0, 0),
    legend.title = element_text(colour = "black"),
    legend.text = element_text(colour = "black"),
    legend.key = element_blank(),
    legend.key.size = unit(0.04, "npc"),
    axis.text = element_blank(), 
    axis.title = element_blank(),
    axis.ticks = element_blank()

After aligning some country names (above), I was able to start visualizing the results. A first step was to look at the responses across the globe. The greener the more responses and the grey countries were not represented in the dataset. A nice addition would have been to look at the response rate relative to country population.. any volunteers?

ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = N),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "green", high = "darkgreen", name = "Response") +
  theme_worldMap +
  labs(title = "Worldwide Response Kaggle DS Survey 2017",
       caption = capt) +


Now, let’s look at how frequently respondents use Python and R in their daily work. I created two heatmaps: one excluding the majority of respondents who indicated not using either Python or R, probably because they didn’t complete the survey.

worktool_data <- ind_data %>%
  group_by(WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%

ggplot(worktool_data, aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = log(n))) +
  geom_text(aes(label = n), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage",
       subtitle = "Most respondents indicate not using Python or R (or did not complete the survey)",
       caption = capt, 
       fill = "Log(N)") 


worktool_data %>%
  filter(!(WorkToolsFrequencyPython == "NA" & WorkToolsFrequencyR == "NA")) %>%
  ungroup() %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = paste0(round(perc,3)*100,"%")), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage (non-users excluded)",
       subtitle = "There is a strong reliance on Python and less users focus solely on R",
       caption = capt, 
       fill = "N") 


Okay, now let’s map these frequency data on a worldmap. Because I’m interested in the country level differences in usage, I look at the relative usage of Python compared to R. So the redder the country, the more Python is used by Data Scientists in their workflow whereas R is the preferred tool in the bluer countries. Interesting to see, there is no country where respondents really use R much more than Python.

ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = Python/R),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "blue", high = "red", name = "Python/R") +
  theme_worldMap +
  labs(title = "Relative usage of Python to R per country",
       subtitle = "Focus on Python in Russia, Israel, Japan, Ukraine, China, Norway & Belarus",
       caption = capt) +

Countries are color-coded for their relative preference for Python (red/purple) or R (blue) as a Data Science tool. 167 out of 171 countries (98%) demonstrate a value of > 1, indicating a preference for Python over R.

Thank you for reading my visualization report. Please do try and extract some other interesting insights from the data yourself.

If you liked my analysis, please upvote my Kaggle Kernel here!

Geographical Maps in ggplot2: Rectangle World Map

Geographical Maps in ggplot2: Rectangle World Map

Maarten Lambrechts posted a tutorial where he demonstrates the steps through which he created a Eurovision Song Festival map in R.

Maarten’s ggplot2 Songfestival map

Maarten’s ggplot2 worldmap

Inspired by his tutorial, I decided to create a worldmap of my own, the R code for which you may find below.

options(stringsAsFactors = F) # options
library(tidyverse# packages
# retrieve data file
link = "https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv"

geodata <- read.csv(link) %>% as.tibble() # load in geodata
str(geodata) # examine geodata
## Classes 'tbl_df', 'tbl' and 'data.frame':    192 obs. of  11 variables:
##  $ name           : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ alpha.2        : chr  "AF" "AL" "DZ" "AO" ...
##  $ alpha.3        : chr  "AFG" "ALB" "DZA" "AGO" ...
##  $ country.code   : int  4 8 12 24 10 28 32 51 36 40 ...
##  $ iso_3166.2     : chr  "ISO 3166-2:AF" "ISO 3166-2:AL" "ISO 3166-2:DZ" "ISO 3166-2:AO" ...
##  $ region         : chr  "Asia" "Europe" "Africa" "Africa" ...
##  $ sub.region     : chr  "Southern Asia" "Southern Europe" "Northern Africa" "Middle Africa" ...
##  $ region.code    : int  142 150 2 2 NA 19 19 142 9 150 ...
##  $ sub.region.code: int  34 39 15 17 NA 29 5 145 53 155 ...
##  $ x              : int  22 15 13 13 15 7 6 20 24 15 ...
##  $ y              : int  8 9 11 17 23 4 14 6 19 6 ...
# create worldmap
worldmap <- ggplot(geodata)

# add rectangle grid + labels
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3))

download (13)

# improve geoms
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 2, 
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +

download (12)

# finalize plot look
colors = c('yellow', 'red', 'white', 'pink', 'green', 'orange')
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 3,
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +
  scale_y_reverse() +
  scale_fill_manual(values = colors) +
  guides(fill = guide_legend(ncol = 2), col = F) +
  theme(plot.background = element_rect(fill = "blue"),
        panel.grid = element_blank(),
        panel.background = element_blank(),
        legend.background = element_blank(),
        legend.position = c(0, 0),
        legend.justification = c(0, 0),
        legend.title = element_text(colour = "white"),
        legend.text = element_text(colour = "white"),
        legend.key = element_blank(),
        legend.key.size = unit(0.06, "npc"),
        axis.text = element_blank(), 
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        text = element_text(colour = "white", size = 16)
        ) +
  labs(title = "ggplot2: Worldmap",
       fill = "Region", 
       caption = "paulvanderlaken.com")

ggplo2 Rectangle Worldmap

What would you add to your worldmap? If you end up making one, please send me a copy on paulvanderlaken@gmail.com!

The Dataviz Project: Find just the right visualization

The Dataviz Project: Find just the right visualization

Do you have a bunch of data but you can’t seem to figure out how to display it? Or looking for that one specific visualization of which you can’t remember the name?

www.datavizproject.com provides a most comprehensive overview of all the different ways to visualize your data. You can sort all options by Family, Input, Function, and Shape to find that one dataviz that best conveys your message.

datavizproject overview

Update: look at some of these other repositories here or here.