Tag: ggplot

rstudio::conf 2019 summary

rstudio::conf 2019 summary

Cool intro video!
Thanks to Amelia for pointing to it

Welcome to rstudio::conf 2019

Similar to last year, I was not able to attend rstudio::conf 2019.

Fortunately, so much of the conference is shared on Twitter and media outlets that I still felt included. Here are some things that I liked and learned from, despite the Austin-Tilburg distance.

All presentations are streamed

One great thing about rstudio::conf is that all presentations are streamed and later posted on the RStudio website.

Of what I’ve already reviewed, I really liked Jenny Bryan’s presentation on lazy evaluation, Max Kuhn’s presentation on parsnip, and teaching data science with puzzles by Irene Steves. Also, the gt package is a serious power tool! And I was already a gganimate fanboy, as you know from here and here.

One of the insights shared in Jenny Bryan’s talk that can be a life-saver

I think I’m going to watch all talks over the coming weekends!

Slides & Extra Materials

There’s an official rstudio-conf repository on Github hosting many materials in an orderly fashion.

Karl Broman made his own awesome GitHub repository with links to the videos, the slides, and all kinds of extra resources.

Karl’s handy github repo of rstudio::conf

All takeaways in a handy #rstudioconf Shiny app

Garrick Aden-Buie made a fabulous Shiny app that allows you to review all #rstudioconf tweets during and since the conference. It even includes some random statistics about the tweets, and a page with all the shared media.

Some random takeaways

Image
Via this tweet about this rstudio::conf presentation
Some words of wisdom by Emily Robinson (whom we know from here)
You should consider joining #tidytuesday!

Extra: Online RStudio Webinars

Did you know that RStudio also posts all the webinars they host? There really are some hidden pearls among them. For instance, this presentation by Nathan Stephens on rendering rmarkdown to powerpoint will save me tons of work, and those new to broom will also be astonished by this webinar by Alex Hayes.

Become a data-driven Sommelier by text mining wine reviews

Become a data-driven Sommelier by text mining wine reviews

Aleszu Bajak at Storybench.org published a great demonstration of the power of text mining. He used the R tidytext package to analyse 150,000 wine reviews which Zach Thoutt had scraped from Wine Enthusiast in November of 2017.

Aleszu started his analysis on only the French wines, with a simple word count per region:

[orginal blog]
Next, he applied TF-IDF to surface the words that are most characteristic for specific French wine regions — words used often in combination with that specific region, but not in relation to other regions.

[orginal blog]
The data also contained some price information, which Aleszu mapped France with ggplot2 and the maps package to demonstrate which French wine regions are generally more costly.

[orginal blog]
On the full dataset, Alezsu also demonstrated that there is a strong relationship between price and points, meaning that, in general, more expensive wines seem to get better reviews:

[orginal blog]
The full script and more details you can find in the orginal blog.

Short ggplot2 tutorial by MiniMaxir

Short ggplot2 tutorial by MiniMaxir

The following was reposted from minimaxir.com

 

QUICK INTRODUCTION TO GGPLOT2

ggplot2 uses a more concise setup toward creating charts as opposed to the more declarative style of Python’s matplotlib and base R. And it also includes a few example datasets for practicing ggplot2 functionality; for example, the mpg dataset is a dataset of the performance of popular models of cars in 1998 and 2008.

Let’s say you want to create a scatter plot. Following a great example from the ggplot2 documentation, let’s plot the highway mileage of the car vs. the volume displacement of the engine. In ggplot2, first you instantiate the chart with the ggplot() function, specifying the source dataset and the core aesthetics you want to plot, such as x, y, color, and fill. In this case, we set the core aesthetics to x = displacement and y = mileage, and add a geom_point() layer to make a scatter plot:

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
            geom_point()

As we can see, there is a negative correlation between the two metrics. I’m sure you’ve seen plots like these around the internet before. But with only a couple of lines of codes, you can make them look more contemporary.

ggplot2 lets you add a well-designed theme with just one line of code. Relatively new to ggplot2 is theme_minimal(), which generates a muted style similar to FiveThirtyEight’s modern data visualizations:

p <- p +
    theme_minimal()

But we can still add color. Setting a color aesthetic on a character/categorical variable will set the colors of the corresponding points, making it easy to differentiate at a glance.

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) +
            geom_point() +
            theme_minimal()

Adding the color aesthetic certainly makes things much prettier. ggplot2 automatically adds a legend for the colors as well. However, for this particular visualization, it is difficult to see trends in the points for each class. A easy way around this is to add a least squares regression trendline for each class using geom_smooth() (which normally adds a smoothed line, but since there isn’t a lot of data for each group, we force it to a linear model and do not plot confidence intervals)

p <- p +
    geom_smooth(method = "lm", se = F)

Pretty neat, and now comparative trends are much more apparent! For example, pickups and SUVs have similar efficiency, which makes intuitive sense.

The chart axes should be labeled (always label your charts!). All the typical labels, like titlex-axis, and y-axis can be done with the labs() function. But relatively new to ggplot2 are the subtitle and caption fields, both of do what you expect:

p <- p +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

That’s a pretty good start. Now let’s take it to the next level.

HOW TO SAVE A GGPLOT2 CHART FOR WEB

Something surprisingly undiscussed in the field of data visualization is how to save a chart as a high quality image file. For example, with Excel charts, Microsoft officially recommends to copy the chart, paste it as an image back into Excel, then save the pasted image, without having any control over image quality and size in the browser (the real best way to save an Excel/Numbers chart as an image for a webpage is to copy/paste the chart object into a PowerPoint/Keynote slide, and export the slideas an image. This also makes it extremely easy to annotate/brand said chart beforehand in PowerPoint/Keynote).

R IDEs such as RStudio have a chart-saving UI with the typical size/filetype options. But if you save an image from this UI, the shapes and texts of the resulting image will be heavily aliased (R renders images at 72 dpi by default, which is much lower than that of modern HiDPI/Retina displays).

The data visualizations used earlier in this post were generated in-line as a part of an R Notebook, but it is surprisingly difficult to extract the generated chart as a separate file. But ggplot2 also has ggsave(), which saves the image to disk using antialiasing and makes the fonts/shapes in the chart look much better, and assumes a default dpi of 300. Saving charts using ggsave(), and adjusting the sizes of the text and geoms to compensate for the higher dpi, makes the charts look very presentable. A width of 4 and a height of 3 results in a 1200x900px image, which if posted on a blog with a content width of ~600px (like mine), will render at full resolution on HiDPI/Retina displays, or downsample appropriately otherwise. Due to modern PNG compression, the file size/bandwidth cost for using larger images is minimal.

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) + 
    geom_smooth(method = "lm", se=F, size=0.5) +
    geom_point(size=0.5) +
    theme_minimal(base_size=9) +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

ggsave("tutorial-0.png", p, width=4, height=3)

Compare to the previous non-ggsave chart, which is more blurry around text/shapes:

For posterity, here’s the same chart saved at 1200x900px using the RStudio image-saving UI:

Note that the antialiasing optimizations assume that you are not uploading the final chart to a service like Medium or WordPress.com, which will compress the images and reduce the quality anyways. But if you are uploading it to Reddit or self-hosting your own blog, it’s definitely worth it.

FANCY FONTS

Changing the chart font is another way to add a personal flair. Theme functions like theme_minimal()accept a base_family parameter. With that, you can specify any font family as the default instead of the base sans-serif. (On Windows, you may need to install the extrafont package first). Fonts from Google Fonts are free and work easily with ggplot2 once installed. For example, we can use Roboto, Google’s modern font which has also been getting a lot of usage on Stack Overflow’s great ggplot2 data visualizations.

p <- p +
    theme_minimal(base_size=9, base_family="Roboto")

A general text design guideline is to use fonts of different weights/widths for different hierarchies of content. In this case, we can use a bolder condensed font for the title, and deemphasize the subtitle and caption using lighter colors, all done using the theme() function.

p <- p + 
    theme(plot.subtitle = element_text(color="#666666"),
          plot.title = element_text(family="Roboto Condensed Bold"),
          plot.caption = element_text(color="#AAAAAA", size=6))

It’s worth nothing that data visualizations posted on websites should be easily legible for mobile-device users as well, hence the intentional use of larger fonts relative to charts typically produced in the desktop-oriented Excel.

Additionally, all theming options can be set as a session default at the beginning of a script using theme_set(), saving even more time instead of having to recreate the theme for each chart.

THE “GGPLOT2 COLORS”

The “ggplot2 colors” for categorical variables are infamous for being the primary indicator of a chart being made with ggplot2. But there is a science to it; ggplot2 by default selects colors using the scale_color_hue() function, which selects colors in the HSL space by changing the hue [H] between 0 and 360, keeping saturation [S] and lightness [L] constant. As a result, ggplot2 selects the most distinct colors possible while keeping lightness constant. For example, if you have 2 different categories, ggplot2 chooses the colors with h = 0 and h = 180; if 3 colors, h = 0, h = 120, h = 240, etc.

It’s smart, but does make a given chart lose distinctness when many other ggplot2 charts use the same selection methodology. A quick way to take advantage of this hue dispersion while still making the colors unique is to change the lightness; by default, l = 65, but setting it slightly lower will make the charts look more professional/Bloomberg-esque.

p_color <- p +
        scale_color_hue(l = 40)

RCOLORBREWER

Another coloring option for ggplot2 charts are the ColorBrewer palettes implemented with the RColorBrewer package, which are supported natively in ggplot2 with functions such as scale_color_brewer(). The sequential palettes like “Blues” and “Greens” do what the name implies:

p_color <- p +
        scale_color_brewer(palette="Blues")

A famous diverging palette for visualizations on /r/dataisbeautiful is the “Spectral” palette, which is a lighter rainbow (recommended for dark backgrounds)

However, while the charts look pretty, it’s difficult to tell the categories apart. The qualitative palettes fix this problem, and have more distinct possibilities than the scale_color_hue() approach mentioned earlier.

Here are 3 examples of qualitative palettes, “Set1”, “Set2”, and “Set3,” whichever fit your preference.

VIRIDIS AND ACCESSIBILITY

Let’s mix up the visualization a bit. A rarely-used-but-very-useful ggplot2 geom is geom2d_bin(), which counts the number of points in a given 2d spatial area:

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
    geom_bin2d(bins=10) +
    [...theming options...]

We see that the largest number of points are centered around (2,30). However, the default ggplot2 color palette for continuous variables is boring. Yes, we can use the RColorBrewer sequential palettes above, but as noted, they aren’t perceptually distinct, and could cause issues for readers who are colorblind.

The viridis R package provides a set of 4 high-contrast palettes which are very colorblind friendly, and works easily with ggplot2 by extending a scale_fill_viridis()/scale_color_viridis() function.

The default “viridis” palette has been increasingly popular on the web lately:

p_color <- p +
        scale_fill_viridis(option="viridis")

“magma” and “inferno” are similar, and give the data visualization a fiery edge:

Lastly, “plasma” is a mix between the 3 palettes above:

If you’ve been following my blog, I like to use R and ggplot2 for data visualization. A lot.

One of my older blog posts, An Introduction on How to Make Beautiful Charts With R and ggplot2, is still one of my most-trafficked posts years later, and even today I see techniques from that particular post incorporated into modern data visualizations on sites such as Reddit’s /r/dataisbeautiful subreddit.

NEXT STEPS

FiveThirtyEight actually uses ggplot2 for their data journalism workflow in an interesting way; they render the base chart using ggplot2, but export it as as a SVG/PDF vector file which can scale to any size, and then the design team annotates/customizes the data visualization in Adobe Illustrator before exporting it as a static PNG for the article (in general, I recommend using an external image editor to add text annotations to a data visualization because doing it manually in ggplot2 is inefficient).

For general use cases, ggplot2 has very strong defaults for beautiful data visualizations. And certainly there is a lot more you can do to make a visualization beautiful than what’s listed in this post, such as using facets and tweaking parameters of geoms for further distinction, but those are more specific to a given data visualization. In general, it takes little additional effort to make something unique with ggplot2, and the effort is well worth it. And prettier charts are more persuasive, which is a good return-on-investment.

Max Woolf (@minimaxir) is a former Apple Software QA Engineer living in San Francisco and a Carnegie Mellon University graduate. In his spare time, Max uses Python to gather data from public APIs and ggplot2 to plot plenty of pretty charts from that data. You can learn more about Max here, view his data analysis portfolio here, or view his coding portfolio here.

R resources (free courses, books, tutorials, & cheat sheets)

R resources (free courses, books, tutorials, & cheat sheets)

Help yourself to these free books, tutorials, packages, cheat sheets, and many more materials for R programming. There’s a separate overview for handy R programming tricks. If you have additions, please comment below or contact me!


Join 262 other followers

LAST UPDATED: 2020-08-24


Table of Contents (clickable)

Completely new to R? → Start learning here!


Introductory R

Introductory Books

Online Courses

Style Guides

BACK TO TABLE OF CONTENTS


Advanced R

Package Development

Non-standard Evaluation

Functional Programming

BACK TO TABLE OF CONTENTS

Cheat Sheets

Many of the above cheat sheets are hosted in the official RStudio cheat sheet overview.


Data Manipulation


Data Visualization

Colors

Interactive / HTML / JavaScript widgets

ggplot2

ggplot2 extensions

Miscellaneous

  • coefplot – visualizes model statistics
  • circlize – circular visualizations for categorical data
  • clustree – visualize clustering analysis
  • quantmod – candlestick financial charts
  • dabestr– Data Analysis using Bootstrap-Coupled ESTimation
  • devoutsvg – an SVG graphics device (with pattern fills)
  • devoutpdf – an PDF graphics device
  • cartography – create and integrate maps in your R workflow
  • colorspace – HSL based color palettes
  • viridis – Matplotlib viridis color pallete for R
  • munsell – Munsell color palettes for R
  • Cairo – high-quality display output
  • igraph – Network Analysis and Visualization
  • graphlayouts – new layout algorithms for network visualization
  • lattice – Trellis graphics
  • tmap – thematic maps
  • trelliscopejs – interactive alternative for facet_wrap
  • rgl – interactive 3D plots
  • corrplot – graphical display of a correlation matrix
  • googleVis – Google Charts API
  • plotROC – interactive ROC plots
  • extrafont – fonts in R graphics
  • rvg – produces Vector Graphics that allow further editing in PowerPoint or Excel
  • showtext – text using system fonts
  • animation – animated graphics using ImageMagick.
  • misc3d – 3d plots, isosurfaces, etc.
  • xkcd – xkcd style graphics
  • imager – CImg library to work with images
  • ungeviz – tools for visualize uncertainty
  • waffle – square pie charts a.k.a. waffle charts
  • Creating spectograms in R with hht, warbleR, soundgen, signal, seewave, or phonTools

BACK TO TABLE OF CONTENTS


Shiny, Dashboards, & Apps


Markdown & Other Output Formats

  • tidystats – automating updating of model statistics
  • papaja – preparing APA journal articles
  • blogdown – build websites with Markdown & Hugo
  • huxtable – create Excel, html, & LaTeX tables
  • xaringan – make slideshows via remark.js and markdown
  • summarytools – produces neat, quick data summary tables
  • citr – RStudio Addin to Insert Markdown Citations

Cloud, Server, & Database

BACK TO TABLE OF CONTENTS


Statistical Modeling & Machine Learning

Books

Courses

Cheat sheets

Time series

Survival analysis

Bayesian

Miscellaneous

  • corrr – easier correlation matrix management and exploration

BACK TO TABLE OF CONTENTS


Natural Language Processing & Text Mining

Regular Expressions

BACK TO TABLE OF CONTENTS


Geographic & Spatial mapping


Bioinformatics & Computational Biology

BACK TO TABLE OF CONTENTS


Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)

Descriptions mostly taken from their own websites:

  • RStudio*** – Open source and enterprise ready professional software
  • Jupyter Notebook*** – open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text across dozens of programming languages.
  • Microsoft R tools for Visual Studio – turn Visual Studio into a powerful R IDE
  • R Plugins for Vim, Emax, and Atom editors
  • Rattle*** – GUI for data mining
  • equisse – RStudio add-in to interactively explore and visualize data
  • R Analytic Flow – data flow diagram-based IDE
  • RKWard – easy to use and easily extensible IDE and GUI
  • Eclipse StatET – Eclipse-based IDE
  • OpenAnalytics Architect – Eclipse-based IDE
  • TinnR – open source GUI and IDE
  • DisplayR – cloud-based GUI
  • BlueSkyStatistics – GUI designed to look like SPSS and SAS 
  • ducer – GUI for everyone
  • R commander (Rcmdr) – easy and intuitive GUI
  • JGR – Java-based GUI for R
  • jamovi & jmv – free and open statistical software to bridge the gap between researcher and statistician
  • Exploratory.io – cloud-based data science focused GUI
  • Stagraph – GUI for ggplot2 that allows you to visualize and connect to databases and/or basic file types
  • ggraptr – GUI for visualization (Rapid And Pretty Things in R)
  • ML Studio – interactive Shiny platform for data visualization, statistical modeling and machine learning

R & other software and languages

R & Excel

R & Python

R & SQL

  • sqldf – running SQL statements on R data frames

BACK TO TABLE OF CONTENTS


Join 262 other followers


R Help, Connect, & Inspiration


R Blogs


R Conferences, Events, & Meetups

R Jobs

BACK TO TABLE OF CONTENTS

Animated GIFs in R

Sometimes, it can be of interest to examine how two variables correlate over time. For example, how people in a social network (e.g., an organization) behave or move over the course of time. However, it can be hard to display multi-dimensional data in a single plot. Instead of including time as an additional dimension and providing stakeholders with complicated 3-D plots, ggplot2 now has a support package called gganimate, which allows you to create custom GIFs. Particularly helpful when you seek to demonstrate trends over time.

See this recent post by Analytics Vidhya for a tutorial on the implementation.