Tag: worldmap

Pixel Maps in R

Those of you who follow my blog know I love world maps. Particularly when they are used to visualize data, like these maps of Kaggle programming language preferences, US household incomes, rush hour travel times, or Shazam recognitions. Those who share this passion will probably like this blog’s topic: mapping data to pixel maps! In R obviously!

A pixel map of holiday and living locations made by Taras Kaduk in R [original]

Taras Kaduk seems as excited about R and the tidyverse as I am, as he built the beautiful map above. It flags all the cities he has visited and, in red, the cities he has lived. Taras was nice enough to share his code here, in the original blog post.

Now, I am not much of a globetrotter, but I do like programming. Hence, I immediately wanted to play with the code and visualize my own holiday destinations. Below you can find my attempt. The updated code I also posted below, but WordPress doesn’t handle code well, so you better look here.

worlmap_dots — The code to make your own map you can find here.

Let’s run you through the steps to make such a map. First, we need to load some packages. I use the apply family to install and/or load a set of packages so that if I/you run the script on a different computer, it will still work. In terms of packages, the tidyverse (read more) includes some nice data manipulation packages as well as the famous ggplot2 package for visualizations. We need maps and ggmap for their mapping functionalities. here is a great little package for convenient project management, as you will see (read more).

### setup ----------------------------------------------------------------------

# install and/or load packages
pkg <- c("tidyverse", "maps", "ggmap", "here")
sapply(pkg, function(x){
  if(! x %in% installed.packages()) install.packages(x)
  require(x, character.only = TRUE)
})

Next, we need to load in the coordinates (longitudes and latitudes) of our holiday destinations. Now, I started out creating a dataframe with city coordinates by hand. However, this was definitely not a scale-able solution. Fortunately, after some Googling, I came across ggmap::geocode(). This function allows you to query the ~~Google maps API~~(no longer works) Data Science Toolkit, which returns all kinds of coordinates data for any character string you feed it.

Although, I ran into two problems with this approach, this was nothing we couldn’t fix. First, my home city of Breda apparently has a name-city in the USA, which Google favors. Accordingly, you need to be careful and/or specific regarding the strings you feed to geocode() (e.g., “Breda NL“). Second, API’s often have a query limit, meaning you can only ask for data every so often. geocode() will quickly return NAs when you feed it more than two, three values. Hence, I wrote a simple while loop to repeat the query until the API retrieves coordinates. The query will pause shortly in between every attempt. Returned coordinates are then stored in the empty dataframe I created earlier. Now, we can easily query a couple dozen of locations without errors.

You can try it yourself: all you need to change is the city_name string.

### cities data ----------------------------------------------------------------

# cities to geolocate
city_name <- c("breda NL", "utrecht", "rotterdam", "tilburg", "amsterdam",
               "london", "singapore", "kuala lumpur", "zanzibar", "antwerp",
               "middelkerke", "maastricht", "bruges", "san fransisco", "vancouver", 
               "willemstad", "hurghada", "paris", "rome", "bordeaux", 
               "berlin", "kos", "crete", "kefalonia", "corfu", 
               "dubai", " barcalona", "san sebastian", "dominican republic", 
               "porto", "gran canaria", "albufeira", "istanbul", 
               "lake como", "oslo", "riga", "newcastle", "dublin", 
               "nice", "cardiff", "san fransisco", "tokyo", "kyoto", "osaka",
               "bangkok", "krabi thailand", "chang mai thailand", "koh tao thailand")   

# initialize empty dataframe   
tibble( 
  city = city_name, 
  lon = rep(NA, length(city_name)), 
  lat = rep(NA, length(city_name)) 
) ->
  cities

# loop cities through API to overcome SQ limit
# stop after if unsuccessful after 5 attempts
for(c in city_name){
  temp <- tibble(lon = NA)
  # geolocate until found or tried 5 times
  attempt <- 0 # set attempt counter
  while(is.na(temp$lon) & attempt < 5) {
    temp <- geocode(c, source = "dsk")
    attempt <- attempt + 1 
    cat(c, attempt, ifelse(!is.na(temp[[1]]), "success", "failure"), "\n") # print status
    Sys.sleep(runif(1)) # sleep for random 0-1 seconds
  } 
  # write to dataframe
  cities[cities$city == c, -1] <- temp
}

Now, Taras wrote a very convenient piece of code to generate the dotted world map, which I borrowed from his blog:

### dot data -------------------------------------------------------------------

# generate worldwide dots
lat <- data_frame(lat = seq(-90, 90, by = 1))
lon <- data_frame(lon = seq(-180, 180, by = 1))
dots <- merge(lat, lon, all = TRUE)  
# exclude water-based dots 
dots %>%
mutate(country = map.where("world", lon, lat),
       lakes = map.where("lakes", lon, lat)) %>%
  filter(!is.na(country) & is.na(lakes)) %>% 
  select(-lakes) ->
  dots

With both the dot data and the cities’ geocode() coordinates ready, it is high time to visualize the map. Note that I use one geom_point() layer to plot the dots, small and black, and another layer to plot the cities data in transparent red. Taras added a third layer for the cities he had actually lived in; I purposefully did not as I have only lived in the Netherlands and the UK. Note that I again use the convenient here::here() function to save the plot in my current project folder.

### visualize ------------------------------------------------------------------

# plot the data
dots %>% ggplot(aes(x = lon, y = lat)) + 
  geom_point(col = "black", size = 0.25) +
  geom_point(data = cities, col = "red", size = 3, alpha = 0.7) + 
  theme_void() +
  theme(
    panel.background = element_rect(fill = "#006994"),
    plot.background = element_rect(fill = "#006994")
  ) -> dot_map

# save plot
ggsave(here("worlmap_dots.png"), dot_map, 
       dpi = 600, width = 8, height = 4.5)

I very much like the look of this map and I’d love to see what innovative, other applications you guys can come up with. To copy the code, please look here on RPubs. Do share your personal creations and also remember to take a look at Taras original blog!

Kaggle Data Science Survey 2017: Worldwide Preferences for Python & R

Kaggle conducts industry-wide surveys to assess the state of data science and machine learning. Over 17,000 individuals worldwide participated in the survey, myself included, and 171 countries and territories are represented in the data.

There is an ongoing debate regarding whether R or Python is better suited for Data Science (probably the latter, but I nevertheless prefer the former). The thousands of responses to the Kaggle survey may provide some insights into how the preferences for each of these languages are dispersed over the globe. At least, that was what I thought when I wrote the code below.

View the Kaggle Kernel here.

### PAUL VAN DER LAKEN
### 2017-10-31
### KAGGLE DATA SCIENCE SURVEY
### VISUALIZING WORLD WIDE RESPONSES
### AND PYTHON/R PREFERENCES

# LOAD IN LIBRARIES
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

# OPTIONS & STANDARDIZATION
options(stringsAsFactors = F)
theme_set(theme_light())
dpi = 600
w = 12
h = 8
wm_cor = 0.8
hm_cor = 0.8
capt = "Kaggle Data Science Survey 2017 by paulvanderlaken.com"

# READ IN KAGGLE DATA
mc <- read.csv("multipleChoiceResponses.csv") %>%
  as.tibble()

# READ IN WORLDMAP DATA
worldMap <- map_data(map = "world") %>% as.tibble()

# ALIGN KAGGLE AND WORLDMAP COUNTRY NAMES
mc$Country[!mc$Country %in% worldMap$region] %>% unique()
worldMap$region %>% unique() %>% sort(F)
mc$Country[mc$Country == "United States"] <- "USA"
mc$Country[mc$Country == "United Kingdom"] <- "UK"
mc$Country[grepl("China|Hong Kong", mc$Country)] <- "China"


# CLEAN UP KAGGLE DATA
lvls = c("","Rarely", "Sometimes", "Often", "Most of the time")
labels = c("NA", lvls[-1])
ind_data <- mc %>% 
  select(Country, WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  mutate(WorkToolsFrequencyR = factor(WorkToolsFrequencyR, 
                                      levels = lvls, labels = labels)) %>% 
  mutate(WorkToolsFrequencyPython = factor(WorkToolsFrequencyPython, 
                                           levels = lvls, labels = labels)) %>% 
  filter(!(Country == "" | is.na(WorkToolsFrequencyR) | is.na(WorkToolsFrequencyPython)))

# AGGREGATE TO COUNTRY LEVEL
country_data <- ind_data %>%
  group_by(Country) %>%
  summarize(N = n(),
            R = sum(WorkToolsFrequencyR %>% as.numeric()),
            Python = sum(WorkToolsFrequencyPython %>% as.numeric()))

# CREATE THEME FOR WORLDMAP PLOT
theme_worldMap <- theme(
    plot.background = element_rect(fill = "white"),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    panel.background = element_blank(),
    legend.background = element_blank(),
    legend.position = c(0, 0.2),
    legend.justification = c(0, 0),
    legend.title = element_text(colour = "black"),
    legend.text = element_text(colour = "black"),
    legend.key = element_blank(),
    legend.key.size = unit(0.04, "npc"),
    axis.text = element_blank(), 
    axis.title = element_blank(),
    axis.ticks = element_blank()
  )

After aligning some country names (above), I was able to start visualizing the results. A first step was to look at the responses across the globe. The greener the more responses and the grey countries were not represented in the dataset. A nice addition would have been to look at the response rate relative to country population.. any volunteers?

# PLOT WORLDMAP OF RESPONSE RATE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = N),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "green", high = "darkgreen", name = "Response") +
  theme_worldMap +
  labs(title = "Worldwide Response Kaggle DS Survey 2017",
       caption = capt) +
  coord_equal()

Now, let’s look at how frequently respondents use Python and R in their daily work. I created two heatmaps: one excluding the majority of respondents who indicated not using either Python or R, probably because they didn’t complete the survey.

# AGGREGATE DATA TO WORKTOOL RESPONSES
worktool_data <- ind_data %>%
  group_by(WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  count()

# HEATMAP OF PREFERRED WORKTOOLS
ggplot(worktool_data, aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = log(n))) +
  geom_text(aes(label = n), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage",
       subtitle = "Most respondents indicate not using Python or R (or did not complete the survey)",
       caption = capt, 
       fill = "Log(N)")

# HEATMAP OF PREFERRED WORKTOOLS
# EXCLUSING DOUBLE NA'S
worktool_data %>%
  filter(!(WorkToolsFrequencyPython == "NA" & WorkToolsFrequencyR == "NA")) %>%
  ungroup() %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = paste0(round(perc,3)*100,"%")), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage (non-users excluded)",
       subtitle = "There is a strong reliance on Python and less users focus solely on R",
       caption = capt, 
       fill = "N")

Okay, now let’s map these frequency data on a worldmap. Because I’m interested in the country level differences in usage, I look at the relative usage of Python compared to R. So the redder the country, the more Python is used by Data Scientists in their workflow whereas R is the preferred tool in the bluer countries. Interesting to see, there is no country where respondents really use R much more than Python.

# WORLDMAP OF RELATIVE WORKTOOL PREFERENCE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = Python/R),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "blue", high = "red", name = "Python/R") +
  theme_worldMap +
  labs(title = "Relative usage of Python to R per country",
       subtitle = "Focus on Python in Russia, Israel, Japan, Ukraine, China, Norway & Belarus",
       caption = capt) +
  coord_equal()

Countries are color-coded for their relative preference for **Python (red/purple)** or **R (blue)** as a Data Science tool. 167 out of 171 countries (98%) demonstrate a value of > 1, indicating a preference for Python over R.

Thank you for reading my visualization report. Please do try and extract some other interesting insights from the data yourself.

If you liked my analysis, please upvote my Kaggle Kernel here!

Geographical Maps in ggplot2: Rectangle World Map

Maarten Lambrechts posted a tutorial where he demonstrates the steps through which he created a Eurovision Song Festival map in R.

Inspired by his tutorial, I decided to create a worldmap of my own, the R code for which you may find below.

options(stringsAsFactors = F) # options
library(tidyverse) # packages

# retrieve data file
link = "https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv"

geodata <- read.csv(link) %>% as.tibble() # load in geodata
str(geodata) # examine geodata

## Classes 'tbl_df', 'tbl' and 'data.frame':    192 obs. of  11 variables:
##  $ name           : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ alpha.2        : chr  "AF" "AL" "DZ" "AO" ...
##  $ alpha.3        : chr  "AFG" "ALB" "DZA" "AGO" ...
##  $ country.code   : int  4 8 12 24 10 28 32 51 36 40 ...
##  $ iso_3166.2     : chr  "ISO 3166-2:AF" "ISO 3166-2:AL" "ISO 3166-2:DZ" "ISO 3166-2:AO" ...
##  $ region         : chr  "Asia" "Europe" "Africa" "Africa" ...
##  $ sub.region     : chr  "Southern Asia" "Southern Europe" "Northern Africa" "Middle Africa" ...
##  $ region.code    : int  142 150 2 2 NA 19 19 142 9 150 ...
##  $ sub.region.code: int  34 39 15 17 NA 29 5 145 53 155 ...
##  $ x              : int  22 15 13 13 15 7 6 20 24 15 ...
##  $ y              : int  8 9 11 17 23 4 14 6 19 6 ...

# create worldmap
worldmap <- ggplot(geodata)

# add rectangle grid + labels
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3))

# improve geoms
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 2, 
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +
  scale_y_reverse()

# finalize plot look
colors = c('yellow', 'red', 'white', 'pink', 'green', 'orange')
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 3,
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +
  scale_y_reverse() +
  scale_fill_manual(values = colors) +
  guides(fill = guide_legend(ncol = 2), col = F) +
  theme(plot.background = element_rect(fill = "blue"),
        panel.grid = element_blank(),
        panel.background = element_blank(),
        legend.background = element_blank(),
        legend.position = c(0, 0),
        legend.justification = c(0, 0),
        legend.title = element_text(colour = "white"),
        legend.text = element_text(colour = "white"),
        legend.key = element_blank(),
        legend.key.size = unit(0.06, "npc"),
        axis.text = element_blank(), 
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        text = element_text(colour = "white", size = 16)
        ) +
  labs(title = "ggplot2: Worldmap",
       fill = "Region", 
       caption = "paulvanderlaken.com")

What would you add to your worldmap? If you end up making one, please send me a copy on paulvanderlaken@gmail.com!

Visualizing #IRMA Tweets

Reddit user LucasCu90 used the R package twitteR to retrieve all tweets that were sent with #Irma and a Geocode of central Miami (25 mile radius) from Saturday September 9, to Sunday September 10, 2017 (the period of Irma’s approach and initial landfall on the Florida Keys and the mainland). From the 29,000 tweets he collected, Lucas then retrieved the 600 most common words and overlaid them on a map of Florida, with their size relative to their frequency in the data. The result is quite nice!

Coexisting Languages in Australia: An Interactive map

Jack Zhao from Small Multiples – a multidisciplinary team of data specialists, designers and developers – retrieved the Language Spoken at Home (LANP) data from the 2016 Census and turned it into a dot density map that vividly shows how people from different cultures coexist (or not) in ultra high resolution (using Python, englewood library, QGIS, Carto). Each colored dot in the visualizations below represents five people from the same language group in the area. Highly populated areas have a higher density of dots; while language diversity is shown through the number of different colors in the given area.

Good news: the maps are interactive! Here’s Sydney:

Here is the original webpage on Small Multiples and you can browse the interactive map in full screen in your browser. The below language groups are included:

Eastern Asian: Chinese, Japanese, Korean, Other Eastern Asian Languages
Southeast Asian: Burmese and Related Languages, Hmong-Mien, Mon-Khmer, Tai, Southeast Asian Austronesian Languages, Other Southeast Asian Languages
Southern Asian: Dravidian, Indo-Aryan, Other Southern Asian Languages
Southwest And Central Asian: Iranic, Middle Eastern Semitic Languages, Turkic, Other Southwest and Central Asian Languages
Northern European: Celtic, English, German and Related Languages, Dutch and Related Languages, Scandinavian, Finnish and Related Languages
Southern European: French, Greek, Iberian Romance, Italian, Maltese, Other Southern European Languages
Eastern European: Baltic, Hungarian, East Slavic, South Slavic, West Slavic, Other Eastern European Languages
Australian Indigenous: Arnhem Land and Daly River Region Languages, Yolngu Matha, Cape York Peninsula Languages, Torres Strait Island Languages, Northern Desert Fringe Area Languages, Arandic, Western Desert Languages, Kimberley Area Languages, Other Australian Indigenous Languages

Mapping Median Household Incomes in the US

The US Census Download Center contains rich information on its countries demographic data. Here you can find a piece of R code that uses the highcharter package in R to create an interactive map showing the median household per country.