Tag: rstats

A/B testing and Statistics at Etsy, by Emily Robinson

A/B testing and Statistics at Etsy, by Emily Robinson

Generating numbers is easy; generating numbers you should trust is hard!

Emily Robinson is a data scientist at Etsy, an e-commerce website for handmade and vintage products. In the #rstats community, Emily is nearly as famous as her brother David Robinson, whom we know from the tidytext R-package.

Like any large tech company, Etsy relies heavily on statistics to improve their way of doing business. In their case, data from real-life experiments provide the business intelligence that allow effective decision-making. For instance, they experiment with the layout of their buttons, with the text shown near products, or with the suggestions made after a search query. To detect whether such changes have (ever so) small effects on Etsy’s KPI’s (e.g., conversion), data scientists such as Emily rely on traditional A/B testing.

In a 40-minute presentation, Emily explains how statistical issues such as skewed distributions, outliers, and power are dealt with at Etsy, among others using bootstrapping and simulations. Moreover, 30 minutes in Emily shares her lessons when it comes to working with (less stats-savvy) business stakeholders. For instance, how to help identify and transform business questions into data questions back into business solutions, or how to deal with the desire to peek at the results of experiments early.

Overall, I can the presentation below, the slides of which you find on Emily’s GitHub.

 

New to R? Kickstart your learning and career with these 6 steps!

New to R? Kickstart your learning and career with these 6 steps!

For newcomers, R code can look like old Egyptian hieroglyphs with its weird operators (%in%,<-,||, or %/%). The R language has been said to have a steep learning curve and although there are many introductory courses and books (see R Resources), it’s hard to decide where to start.

Fortunately, I am here to help! The below is a six-step guide on how to learning R, using only open access (i.e., free!) materials.

Although oriented at complete newcomers, it will have you writing your own practical scripts and programs in no time: just start at #1 and work your way to coding mastery!

If you already feel comfortable with the basics of R — or don’t like basics — you can start at #5 and jump into practical learning via the tidyverse.

Good luck!!!

Step 1: An R Folder (15 min)

Create a directory for your R learning stuff somewhere on your computer. Download this (very) short introduction to R by Paul Torfs and Claudia Bauer and store it in that folder. Now read the introduction and follow the steps. It will help you install all R software on your own computer and familiarize you with the standard data types.

Step 2: Handy Cheat Sheets (15 min)

Many standard functions exist in R and after a while you will remember them by heart. For now, it’s good to have a dictionary or references close by hand. Download and read the cheat sheets for base R (Mhairi McNeill) and R base functions (Tom Short). Because you’ll be writing most of your R scripts in RStudio, it’s also recommended to have an RStudio cheat sheet as well as an RStudio keyboard shortcuts cheat sheet by hand.

Step 3: swirl Away in RStudio (8h)

Now you’re ready to really start learning and we’re going to accelerate via swirl. Open up your RStudio and enter the two lines of code below in your console window.

install.packages('swirl') #download swirl package 
library(swirl) #load in swirl package

swirl (webpage) will automatically start and after a couple of prompts you will be able to choose the learning course called 1: R Programming: The basics of programming in R (see below). This course consists of 15 modules via which you will master the basics of R in the environment itself. Start with module 1 and complete between one to three modules per day, so that you finish the swirl course in a week.

Starting up swirl in RStudio
swirl’s R 4 learning courses and the 15 modules belonging to the basics of R programming course

Step 4: A Pirate’s Guide to R (10h)

OK, you should now be familiar with the basics of R. However, knowledge is crystallized via repetition. I therefore suggest, you walk through the book YaRrr! The Pirate’s Guide to R (Phillips, 2017) starting in chapter 3. It’s a fun book and will provide you with more knowledge on how to program custom functions, loops, and some basic statistical modelling techniques – the thing R was actually designed for.

Step 5: R for Data Science (16h)

By now, you can say you might say you are an adapt R programmer with statistical modelling experience. However, you have been working with base R functions mostly, knowledge of which is a must-have to really understand the language. In practice, R programmers rely strongly on developed packages nevertheless. A very useful group of packages is commonly referred to as the tidyverse. You will be amazed at how much this set of packages simplifies working in R. The next step therefore, is to work through the book R for Data Science (Grolemund & Wickham, 2017) (hardcopy here).

Step 6: Specialize (∞)

You are now several steps and a couple of weeks further. You possess basic knowledge of the R language, know how to write scripts in RStudio, are capable of programming in base R as well as using the advanced functionality of the tidyverse, and you have even made a start with some basic statistical modelling.

It’s time to set you loose in the wonderful world of the R community. If you had not done this earlier, you should get accounts on Stack Overflow and Cross Validated. You might also want to subscribe to the R Help Mailing ListR Bloggers, and to my website obviously.

Join 385 other subscribers

On Twitter, have a look at #rstats and, on reddit, subscribe to the rstats, rstudio, and statistics threads. At this time, I can’t but advise you to return to the R Resources Overview and to continue broadening your R programming skills. Pick materials in the area that interests you:

Beer-in-hand Data Science

Beer-in-hand Data Science

Obviously, analysing beer data in high on everybody’s list of favourite things to do in your weekend. Amanda Dobbyn wanted to examine whether she could provide us with an informative categorization the 45.000+ beers in her data set, without having to taste them all herself.

You can find the full report here but you may also like to interactively discover beer similarities yourself in Amanda’s Beer Clustering Shiny App. Or just have a quick look at some of Amanda’s wonderful visualizations below.

A density map of the bitterness (y-axis) and alcohol percentages (x-axis) in the most popular beer styles.

A k-means clustering of each of the 45000 beers in 10 clusters. Try out other settings in Amanda’s Beer Clustering Shiny App.

The alcohol percentages (x), bitterness (y) and cluster assignments of some popular beer styles.

 

Modelling beer’s bitterness (y) by the number of used hops (x).

 

Networks Among #rstats Twitterers

Networks Among #rstats Twitterers

Reposted from Kasia Kulma’s github with minor modifications.

Have you ever wondered whether the most active/popular R-twitterers are virtual friends? 🙂 And by friends here I simply mean mutual followers on Twitter. In this post, I score and pick top 30 #rstats twitter users and analyse their Twitter network. You’ll see a lot of applications of rtweet and ggraph packages, as well as a very useful twist using purrr library, so let’s begin!

 

IMPORTING #RSTATS USERS

After loading my precious packages…

library(rtweet)
library(dplyr)
library(purrr)
library(igraph)
library(ggraph)

… I searched for Twitter users that have rstats termin their profile description. It definitely doesn’t include ALL active and popular R – users, but it’s a pretty reliable way of picking R – fans.

r_users <- search_users("#rstats", n = 1000)

It’s important to say, that in rtweet::search_users() even if you specify 1000 users to be extracted, you end up with quite a few duplicates and the actual number of users I got was much smaller: 564

r_users %>% summarise(n_users = n_distinct(screen_name))
##   n_users
## 1     564

Funnily enough, even though my profile description contains #rstats I was not included in the search results (@KKulma), sic! Were you? 🙂

SCORING AND CHOOSING TOP #RSTATS USERS

Now, let’s extract some useful information about those users:

r_users_info <- lookup_users(r_users$screen_name)

You’ll notice, that created data frame holds information about the number of followers, friends (users they follow), lists they belong to, the number of tweets (statuses) or how many times were they marked favourite.

r_users_info %>% select(dplyr::contains("count")) %>% head()
##   followers_count friends_count listed_count favourites_count
## 1            8311           366          580             9325
## 2           44474            11         1298                3
## 3           11106           524          467            18495
## 4           12481           431          542             7222
## 5           15345          1872          680            27971
## 6            5122           700          549             2796
##   statuses_count
## 1          66117
## 2           1700
## 3           8853
## 4           6388
## 5          22194
## 6          10010

And these variables that I used for building my ‘top score’: I simply calculate a percentile for each of those variables and sum it all together for each user. Given that each variable’s percentile will give me a value between 0 and 1, The final score can have a maximum value of 5.

r_users_ranking <- r_users_info %>%
  filter(protected == FALSE) %>% 
  select(screen_name, dplyr::contains("count")) %>% 
  unique() %>% 
  mutate(followers_percentile = ecdf(followers_count)(followers_count),
         friends_percentile = ecdf(friends_count)(friends_count),
         listed_percentile = ecdf(listed_count)(listed_count),
         favourites_percentile = ecdf(favourites_count)(favourites_count),
         statuses_percentile = ecdf(statuses_count)(statuses_count)
         ) %>% 
  group_by(screen_name) %>% 
  summarise(top_score = followers_percentile + friends_percentile + listed_percentile + favourites_percentile + statuses_percentile) %>% 
  ungroup() %>% 
  mutate(ranking = rank(-top_score))

Finally, I picked top 30 users based on the score I calculated. Tada!

top_30 <- r_users_ranking %>% arrange(desc(top_score)) %>% head(30) %>% arrange(desc(top_score))
top_30 
## # A tibble: 30 x 3
##        screen_name top_score ranking
##              <chr>     <dbl>   <dbl>
##  1          hspter  4.877005       1
##  2    RallidaeRule  4.839572       2
##  3         DEJPett  4.771836       3
##  4 modernscientist  4.752228       4
##  5 nicoleradziwill  4.700535       5
##  6      tomhouslay  4.684492       6
##  7    ChetanChawla  4.639929       7
##  8   TheSmartJokes  4.627451       8
##  9   Physical_Prep  4.625668       9
## 10       Cataranea  4.602496      10
## # ... with 20 more rows

I must say I’m incredibly impressed by these scores: @hpster, THE top R – twitterer managed to obtain a score of nearly 4.9 out of 5! WOW!

Anyway! To add some more depth to my list, I tried to identify top users’ gender, to see how many of them are women. I had to do it manually (ekhem!), as the Twitter API’s data doesn’t provide this, AFAIK. Let me know if you spot any mistakes!

top30_lookup <- r_users_info %>%
  filter(screen_name %in% top_30$screen_name) %>% 
  select(screen_name, user_id)

top30_lookup$gender <- c("M", "F", "F", "F", "F",
                         "M", "M", "M", "F", "F", 
                         "F", "M", "M", "M", "F", 
                         "F", "M", "M", "M", "M", 
                         "M", "M", "M", "F", "M",
                         "M", "M", "M", "M", "M")

table(top30_lookup$gender)
## 
##  F  M 
## 10 20

It looks like a third of all top users are women, but in the top 10 users, there are 6 women. Better than I expected, to be honest. So, well done, ladies!

GETTING FRIENDS NETWORK

Now, this was the trickiest part of this project: extracting top users’ friends list and putting it all in one data frame. As you may be aware, Twitter API allows you to download information only on 15 accounts in 15 minutes. So for my list, I had to break it up into 2 steps, 15 users each and then I named each list according to the top user they refer to:

top_30_usernames <- top30_lookup$screen_name

friends_top30a <-   map(top_30_usernames[1:15 ], get_friends)
names(friends_top30a) <- top_30_usernames[1:15]

# 15 minutes later....
friends_top30b <- map(top_30_usernames[16:30], get_friends)

After this I end up with two lists, each containing all friends’ IDs for top and bottom 15 users respectively. So what I need to do now is i) append the two lists, ii) create a variable stating top users’ name in each of those lists and iii) turn lists into data frames. All this can be done in 3 lines of code. And brace yourself: here comes the purrr trick I’ve been going on about! Simply using purrr:::map2_df I can take a single list of lists, create a name variable in each of those lists based on the list name (twitter_top_user) and convert the result into the data frame. BRILLIANT!!

# turning lists into data frames and putting them together
friends_top30 <- append(friends_top30a, friends_top30b)
names(friends_top30) <- top_30_usernames

# purrr - trick I've been banging on about!
friends_top <- map2_df(friends_top30, names(friends_top30), ~ mutate(.x, twitter_top_user = .y)) %>% 
  rename(friend_id = user_id) %>% select(twitter_top_user, friend_id)

Here’s the last bit that I need to correct before we move on to plotting the friends networks: for some reason, using purrr::map() with rtweet:::get_friends() gives me max only 5000 friends, but in case of @TheSmartJokes the true value is over 8000. As it’s the only top user with more than 5000 friends, I’ll download his friends separately…

# getting a full list of friends
SJ1 <- get_friends("TheSmartJokes")
SJ2 <- get_friends("TheSmartJokes", page = next_cursor(SJ1))

# putting the data frames together 
SJ_friends <-rbind(SJ1, SJ2) %>%  
  rename(friend_id = user_id) %>% 
  mutate(twitter_top_user = "TheSmartJokes") %>% 
  select(twitter_top_user, friend_id)

# the final results - over 8000 friends, rather than 5000
str(SJ_friends) 
## 'data.frame':    8611 obs. of  2 variables:
##  $ twitter_top_user: chr  "TheSmartJokes" "TheSmartJokes" "TheSmartJokes" "TheSmartJokes" ...
##  $ friend_id       : chr  "390877754" "6085962" "88540151" "108186743" ...

… and use it to replace those friends that are already in the final friends list.

friends_top30 <- friends_top %>% 
  filter(twitter_top_user != "TheSmartJokes") %>% 
  rbind(SJ_friends) 

Finally, let me do some last data cleaning: filtering out friends that are not among the top 30 R – users, replacing their IDs with twitter names and adding gender for top users and their friends… Tam, tam, tam: here we are! Here’s the final data frame we’ll use for visualising the friend networks!

# select friends that are top30 users
final_friends_top30 <- friends_top  %>% 
  filter(friend_id %in% top30_lookup$user_id)

# add friends' screen_name
final_friends_top30$friend_name <- top30_lookup$screen_name[match(final_friends_top30$friend_id, top30_lookup$user_id)]

# add users' and friends' gender
final_friends_top30$user_gender <- top30_lookup$gender[match(final_friends_top30$twitter_top_user, top30_lookup$screen_name)]
final_friends_top30$friend_gender <- top30_lookup$gender[match(final_friends_top30$friend_name, top30_lookup$screen_name)]

## final product!!!
final <- final_friends_top30 %>% select(-friend_id)

head(final)
##   twitter_top_user     friend_name user_gender friend_gender
## 1         hrbrmstr nicoleradziwill           M             F
## 2         hrbrmstr        kara_woo           M             F
## 3         hrbrmstr      juliasilge           M             F
## 4         hrbrmstr        noamross           M             M
## 5         hrbrmstr      JennyBryan           M             F
## 6         hrbrmstr     thosjleeper           M             M

VISUALIZING FRIENDS NETWORKS

After turning our data frame into something more usable by igraph and ggraph

f1 <- graph_from_data_frame(final, directed = TRUE, vertices = NULL)
V(f1)$Popularity <- degree(f1, mode = 'in')

… let’s have a quick overview of all the connections:

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') 

generic_pure

Keep in mind that Popularity – defined as the number of edges that go into the node – determines node size. It’s all very pretty, but I’d like to see how nodes correspond to Twitter users’ names:

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  geom_node_text(aes(label = name, fontface='bold'), 
                 color = 'white', size = 3) +
  theme_graph(background = 'dimgray', text_colour = 'white',title_size = 30) 

generic_names

So interesting! You can see the core of the graph consists mainly of female users: @hpster, @JennyBryan, @juliasilge, @karawoo, but also a couple of male R – users: @hrbrmstr and @noamross. Who do they follow? Men or women?

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') +
  geom_edge_link(aes(colour = friend_gender)) +
  scale_edge_color_brewer(palette = 'Set1') + 
  labs(title='Top 30 #rstats users and gender of their friends')

generic_with_gender

It’s difficult to say definitely, but superficially I see A LOT of red, suggesting that our top R – users often follow female top twitterers. Let’s have a closer look and split graphs by user gender and see if there’s any difference in the gender of users they follow:

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') +
  facet_edges(~user_gender) +
  geom_edge_link(aes(colour = friend_gender)) +
  scale_edge_color_brewer(palette = 'Set1') +
  labs(title='Top 30 #rstats users and gender of their friends', subtitle='Graphs are separated by top user gender, edge colour indicates their friend gender' )

gender_with_gender

Ha! look at this! Obviously, female users’ graph will be less dense as there are fewer of them in the dataset, however, you can see that they tend to follow male users more often than male top users do. Is that impression supported by raw numbers?

final %>% 
  group_by(user_gender, friend_gender) %>% 
  summarize(n = n()) %>% 
  group_by(user_gender) %>% 
  mutate(sum = sum(n),
         percent = round(n/sum, 2)) 
## # A tibble: 4 x 5
## # Groups:   user_gender [2]
##   user_gender friend_gender     n   sum percent
##         <chr>         <chr> <int> <int>   <dbl>
## 1           F             F    26    57    0.46
## 2           F             M    31    57    0.54
## 3           M             F    55   101    0.54
## 4           M             M    46   101    0.46

It seems so, although to the lesser extent than suggested by the network graphs: Female top users follow other female top users 46% of the time, whereas male top users follow female top user 54% of the time. So what do you have to say about that?

About the author:

Kasia Kulma states she’s an overall, enthusiastic science enthusiast. Formally, a doctor in evolutionary biology, professionally, a data scientist, and, privately, a soppy mum and outdoors lover.