Tag: statistics

Talent Works: Data Science to improve Job Application Chances

Searching and applying to jobs can be a costly activity, requiring many hours upon hours of perfecting your motivation letter and CV. Hence, it can be very frustrating to get ghosted (not receiving a reply) for a job. Luckily, Talent Works is able to give us some general tips when it comes to improving the success of your applications. You might remember them from their Interactive Map of the US Job Market.

Using a sample of about 1600 job applications, Talent Works recently conducted all kinds of statistical analyses to look at the hiring process. For instance, they examined the time it takes to get from the application stage to your first day on the job. Split out for various jobs, it seems Mechanical Engineers spend quite a while in the interview stage whereas Software developers are put to work within three weeks.

estimated.mdf.png — The numbers of days spent in each application stage per job (Talent.Works)

In a different analysis, Talent Works examined how to minimize your risk of getting ghosted on a job application. For instance, they found that during the “Golden Hours” (the first 96 hours after a job gets posted), your chances of getting an invitation for an interview are up to 8 times higher than afterwards.

If you submit a job application in the first 96 hours, you’re up to 8x more likely to get an interview. After that, every day you wait reduces your chances by 28% (Talent.Works)

Based on the above they come to the following three timeframes in the application cycle:

“Golden Hours”: Applications submitted between 2-4 days after a job is posted have the highest chance of getting an interview. Not only is there a difference, there’s a big difference: you have up to an 8x higher chance of getting an interview during this period, even if you’re submitting the same application.
Twilight Zone: Chances quickly decrease from OK to really bad: every day you wait after the “Golden Hours” reduces your chances by 28%. The longer you wait, the higher the risk that employers have already checked their inboxes and setup interviews with candidates that met their “good enough”-bar.
Resume Blackhole: According to Talent.Works it’s nearly not worth applying after 10 days. On average, job applications during this phase have a meager ~1.5% of getting an interview. Put another way, if you send out 50 job applications, you might hear back from one (if you’re lucky).

Next, Talent.Works investigated on a more granular level what would then be the best time to apply for a job.This resulted in the following figure

what-best-time-apply-for-job — The best time to apply for a job is between 6am and 10am. During this time, you have an 13% chance of getting an interview — nearly 5x as if you applied to the same job after work. Whatever you do, don’t apply after 4pm (Talent.Works)

Again, they provide a summary of their conclusions:

The best time to apply for a job is between 6am and 10am. During this time, you have an 13% chance of getting an interview.
After that morning window, your interview odds start falling by 10% every 30 minutes. If you’re late, you’re going to pay dearly.
There’s a brief reprieve during lunchtime, where your odds climb back up to 11% at around 12:30pm but then start falling precipitously again.
The single-worst time to apply for a job is after work — if you apply at 7:30pm, you have less than a 3% chance of getting an interview.

If you want to see more, please visit Talent.Works. Here, you can let them process your CV and help you improve your hiring chances (see also this blog post).

Simpson’s Paradox: Two HR examples with R code.

Simpson (1951) demonstrated that a statistical relationship observed within a population—i.e., a group of individuals—could be reversed within all subgroups that make up that population. This phenomenon, where X seems to relate to Y in a certain way, but flips direction when the population is split for W, has since been referred to as Simpson’s paradox. Others names, according to Wikipedia, include the Simpson-Yule effect, reversal paradox or amalgamation paradox.

The most famous example has to be the seemingly gender-biased Berkeley admission rates:

“Examination of aggregate data on graduate admissions to the University of California, Berkeley, for fall 1973 shows a clear but misleading pattern of bias against female applicants. Examination of the disaggregated data reveals few decision-making units that show statistically significant departures from expected frequencies of female admissions, and about as many units appear to favor women as to favor men. If the data are properly pooled, taking into account the autonomy of departmental decision making, thus correcting for the tendency of women to apply to graduate departments that are more difficult for applicants of either sex to enter, there is a small but statistically significant bias in favor of women. […] The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system.” – part of abstract of Bickel, Hammel, & O’Connel (1975)

In a table, the effect becomes clear. While it seems as if women are rejected more often overall, women are actually less often rejected on a departmental level. Women simply applied to more selective departments more often (E & C below), resulting in the overall lower admission rate for women (35% as opposed to 44% for men).

Afbeeldingsresultaat voor berkeley simpson's paradox — Copied from Bits of Pi

Examples in HR

Simpsons Paradox can easily occur in organizational or human resources settings as well. Let me run you through two illustrated examples, I simulated:

Assume you run a company of 1000 employees and you have asked all of them to fill out a Big Five personality survey. Per individual, you therefore have a score depicting his/her personality characteristic Neuroticism, which can run from 0 (not at all neurotic) to 7 (very neurotic). Now you are interested in the extent to which this Neuroticism of employees relates to their Job Performance (measured 0 – 100) and their Salary (measured in Euro’s per Year). In order to get a sense of the effects, you may decide to visualize both these relations in scatter plots:

From these visualizations it would look like Neuroticism relates significantly and positively to both employees’ performance and their yearly salary. Should you select more neurotic people to improve your overall company performance? Or are you discriminating emotionally-stable (non-neurotic) employees when it comes to salary?

Taking a closer look at the subgroups in your data, you might however find very different relationships. For instance, the positive relationship between neuroticism and performance may only apply to technical positions, but not to those employees’ in service-oriented jobs.

Similarly, splitting the employees by education level, it becomes clear that there is a relationship between neuroticism and education level that may explain the earlier association with salary. More educated employees receive higher salaries and within these groups, neuroticism is actually related to lower yearly income.

If you’d like to see the code used to simulate these data and generate the examples, you can find the R markdown file here on Rpubs.

Solving the paradox

Kievit and colleagues (2013) argue that Simpsons paradox may occur in a wide variety of research designs, methods, and questions, particularly within the social and medical sciences. As such, they propose several means to “control” or minimize the risk of it occurring. The paradox may be prevented from occurring altogether by more rigorous research design: testing mechanisms in longitudinal or intervention studies. However, this is not always feasible. Alternatively, the researchers pose that data visualization may help recognize the patterns and subgroups and thereby diagnose paradoxes. This may be easy if your data looks like this:

But rather hard, or even impossible, when your data looks more like the below:

Clustering may nevertheless help to detect Simpson’s paradox when it is not directly observable in the data. To this end, Kievit and Epskamp (2012) have developed a tool to facilitate the detection of hitherto undetected patterns of association in existing datasets. It is written in R, a language specifically tailored for a wide variety of statistical analyses which makes it very suitable for integration into the regular analysis workflow. As an R package, the tool is is freely available and specializes in the detection of cases of Simpson’s paradox for bivariate continuous data with categorical grouping variables (also known as Robinson’s paradox), a very common inference type for psychologists. Finally, its code is open source and can be extended and improved upon depending on the nature of the data being studied.

One example of application is provided in the paper, for a dataset on coffee and neuroticism. A regression analysis would suggest a significant positive association between coffee and neuroticism overall. However, when the detection algorithm of the R package is applied, a different picture appears: the analysis shows that there are three latent clusters present and that the purported positive relationship only holds for one cluster whereas it is negative in the others.

Update 24-10-2017: minutephysics – one of my favorite YouTube channels – uploaded a video explaining Simpson’s paradox very intuitively in a medical context:

Update 01-11-2017: minutephysics uploaded a follow-up video:

The paradox is that we remain reluctant to fight our bias, even when they are put in plain sight.

Data Science, Machine Learning, & Statistics resources (free courses, books, tutorials, & cheat sheets)

Welcome to my repository of data science, machine learning, and statistics resources. Software-specific material has to a large extent been listed under their respective overviews: R Resources & Python Resources. I also host a list of SQL Resources and datasets to practice programming. If you have any additions, please comment or contact me!

LAST UPDATED: 21-05-2018

Courses:

Video:

Books:

Sentiment Lexicons:

Cheatsheets:

Other:

Google Fonts – huge collection of text fonts
Checkmycolours.com – check whether your colours have enough contrast
Vischeck.com – check whether your images are colorblind-friendly
Coblis – Color Blind Simulation
Color Oracle – color blind simulation
Chrome color enhancer – customizable color filter for website browsing

Must read: Computer Age Statistical Inference (Efron & Hastie, 2016)

Statistics, and statistical inference in specific, are becoming an ever greater part of our daily lives. Models are trying to estimate anything from (future) consumer behaviour to optimal steering behaviours and we need these models to be as accurate as possible. Trevor Hastie is a great contributor to the development of the field, and I highly recommend the machine learning books and courses that he developed, together with Robert Tibshirani. These you may find in my list of R Resources (Cheatsheets, Tutorials, & Books).

Today I wanted to share another book Hastie wrote, together with Bradley Efron, another colleague of his at Stanford University. It is called Computer Age Statistical Inference (Efron & Hastie, 2016) and is a definite must read for every aspiring data scientist because it illustrates most algorithms commonly used in modern-day statistical inference. Many of these algorithms Hastie and his colleagues at Stanford developed themselves and the book handles among others:

Regression:
- Logistic regression
- Poisson regression
- Ridge regression
- Jackknife regression
- Least angle regression
- Lasso regression
- Regression trees
Bootstrapping
Boosting
Cross-validation
Random forests
Survival analysis
Support vector machines
Kernel smoothing
Neural networks
Deep learning
Bayesian statistics

R resources (free courses, books, tutorials, & cheat sheets)

Help yourself to these free books, tutorials, packages, cheat sheets, and many more materials for R programming. There’s a separate overview for handy R programming tricks. If you have additions, please comment below or contact me!

Join 385 other subscribers

LAST UPDATED: 2021-09-24

Table of Contents (clickable)

Beginner
Advanced
Cheat sheets
Data manipulation
Data visualization
Dashboards & Shiny
Markdown
Database connections
Machine learning
Text mining
Geospatial analysis
Bioinformatics
R IDEs
Software & language connections
Help
Blogs
Conferences, Events, & Groups
Jobs
Other tips & tricks

Completely new to R? → Start learning here!

Introductory R

Introductory Books

Online Courses

Youtube R classes by Chris Bilder
37 Youtube R Tutorials by Flavio Azevedo***
Essential R tutorials by Gilad Feldman
Data Carpentry Social Science in R
Statistics and R, by Rafael Irizarry and Michael Love
Learn R via R-coder.com

Style Guides

Google’s R style guide
Tidyverse style guide by Hadley Wickham
Advanced R style guide by Hadley Wickham
R style guide for stat405 by Hadley Wickham
R style guide by Collin Gillespie
Best practices for R Coding by Arnaud Amsellem / The R Trader
The State of Naming Conventions in R (Bååth, 2012)
A guide for switching from base R to the tidyverse

BACK TO TABLE OF CONTENTS

Advanced R

Package Development

Mastering Software Development in R (Peng, Kross, & Anderson, 2017)
R Packages (Wickham & Bryan, ???)
rOpenSci Packages: Development, Maintenance, and Peer Review
How to develop good R packages (for open science) by Maëlle Salmon
Tutorial on creating R packages by Friedrich Leisch
Developing R Packages by Jeff Leek
Writing an R package from scratch by Hilary Parker
Write your own R package by STAT545
Making an R Package, by R.M. Ripley
Prepare your package for CRAN
Introduction to roxygen2 by Hadley Wickham
How to build package vignettes with knitr by Yihui Xie
knitr in a nutshell: a minimal tutorial by Karl Broman
Rtools: Building R for Windows by Brian Ripley, Duncan Murdoch, and Jeroen Ooms
devtools – tools to make an R developer’s life easier
roxygen2 – tools for describing functions in comments next to their definitions
Rd2roxygen – tools for converting Rd to roxygen documentation
testthat – tools that simplify the testing of R packages

Non-standard Evaluation

Functional Programming

Writing Functions in R by Hadley Wickham via DataCamp.com
R for Data Science chapters on Functions and Iteration
(Grolemund & Wickham, 2018)***
Advanced R chapter on Functions (Wickham, 2014)
Lesson on writing, testing, and documenting custom functions by Software-Carpentry.org
User-defined R fuctions tutorial by Carlo Fanara via DataCamp.com
Functional programming lecture by Duke University
purrr tutorial by Jenny Bryan***
Intro to purrr tutorial by Emorie Beck
Learn purrr tutorial by Dan Ovando
purrr cheat sheet by RStudio

BACK TO TABLE OF CONTENTS

Cheat Sheets

Many of the above cheat sheets are hosted in the official RStudio cheat sheet overview.

Data Manipulation

Data Visualization

Colors

R Color Guide***
colourpicker – widget that allows users to choose colours
paletteer – comprehensive collection of color palettes in R***
ggplot2 colour guide***
Canva’s 100 color palette included in ggthemes::scale_color_canva
Wes Anderson color palettes
Multicolored annotated text in ggplot2 by Andrew Whitby & Visuelle Data
Picular.co – Google, but for colors

Interactive / HTML / JavaScript widgets

R HTML Widgets Gallery***
plotly – interactive plots
billboarder – easy interface to billboard.js, a JavaScript chart library based on D3
d3heatmap – interactive D3 heatmaps
altair – Vega-Lite visualizations via Python
DT – interactive tables
DiagrammeR – interactive diagrams (DiagrammeR cheat sheet)
dygraphs – interactive time series plots
formattable – formattable data structures
ggvis – interactive ggplot2
highcharter – interactive Highcharts plots
leaflet – interactive maps
metricsgraphics – interactive JavaScript bare-bones line, scatterplot and bar charts
networkD3 – interative D3 network graphs
scatterD3 – interactive scatterplots with D3
rbokeh – interactive Bokeh plots
rCharts – interactive Javascript charts
rcdimple – interactive JavaScript bar charts and others
rglwidget – interactive 3d plots
threejs – interactive 3d plots and globes
visNetwork – interactive network graphs
wordcloud2 – interface to wordcloud2.js.
timevis – interactive timelines

ggplot2

Code examples of top-50 ggplot2 visualizations***
ggplot2 Cheatsheet by RStudio
ggplot2 Quick Reference Guide
ggplot2 Code Snippets
ggplot2 Code Snippets 2
Hitchhiker’s Guide to ggplot2 in R (Burchell & Vargas, 2016)
A practical introduction with R and ggplot2 (Healy, 2017)
Data Vizualization: A practical introduction (Healy, 2018)
Complete ggplot2 Tutorial
Principles & Practice of Data Visualization CS631 at Oregon Health & Science University
Data visualization cheat sheet by RStudio with ggplot2
Setting custom ggplot themes with ggthemr
Creating custom, reproducible color palettes by Simon Jackson
Rearranging values within ggplot2 facets
Combine plots using patchwork or cowplot
equisse – RStudio addin to interactively explore data with ggplot2 without coding

ggplot2 extensions

ggplot2 extensions overview***
ggthemes – plot style themes
hrbrthemes – opinionated, typographic-centric themes
ggmap – maps with Google Maps, Open Street Maps, etc.
ggiraph – interactive ggplots
gghighight – highlight lines or values, see vignette
ggstance – horizontal versions of common plots
GGally – scatterplot matrices
ggalt – additional coordinate systems, geoms, etc.
ggbeeswarm – column scatter plots or voilin scatter plots
ggforce – additional geoms, see visual guide
ggrepel – prevent plot labels from overlapping
ggraph – graphs, networks, trees and more
ggpmisc – photo-biology related extensions
geomnet – network visualization
ggExtra – marginal histograms for a plot
gganimate – animations, see also the gganimate wiki page
ggpage – pagestyled visualizations of text based data
ggpmisc – useful additional geom_* and stat_* functions
ggstatsplot – include details from statistical tests in plots
ggspectra – tools for plotting light spectra
ggnetwork – geoms to plot networks
ggpoindensity – cross between a scatter plot and a 2D density plot
ggradar – radar charts
ggsurvplot (survminer) – survival curves
ggseas – seasonal adjustment tools
ggthreed – (evil) 3D geoms
ggtech – style themes for plots
ggtern – ternary diagrams
ggTimeSeries – time series visualizations
ggtree – tree visualizations
treemapify – wilcox’s treemaps
seewave – spectograms

Miscellaneous

coefplot – visualizes model statistics
circlize – circular visualizations for categorical data
clustree – visualize clustering analysis
quantmod – candlestick financial charts
dabestr– Data Analysis using Bootstrap-Coupled ESTimation
devoutsvg – an SVG graphics device (with pattern fills)
devoutpdf – an PDF graphics device
cartography – create and integrate maps in your R workflow
colorspace – HSL based color palettes
viridis – Matplotlib viridis color pallete for R
munsell – Munsell color palettes for R
Cairo – high-quality display output
igraph – Network Analysis and Visualization
graphlayouts – new layout algorithms for network visualization
lattice – Trellis graphics
tmap – thematic maps
trelliscopejs – interactive alternative for facet_wrap
rgl – interactive 3D plots
corrplot – graphical display of a correlation matrix
googleVis – Google Charts API
plotROC – interactive ROC plots
extrafont – fonts in R graphics
rvg – produces Vector Graphics that allow further editing in PowerPoint or Excel
showtext – text using system fonts
animation – animated graphics using ImageMagick.
misc3d – 3d plots, isosurfaces, etc.
xkcd – xkcd style graphics
imager – CImg library to work with images
ungeviz – tools for visualize uncertainty
waffle – square pie charts a.k.a. waffle charts
Creating spectograms in R with hht, warbleR, soundgen, signal, seewave, or phonTools

BACK TO TABLE OF CONTENTS

Shiny, Dashboards, & Apps

Shiny Cheat Sheet by RStudio
Shiny Tutorial
A collection of links to Shiny applications that have been shared on Twitter.
Enterprise-ready dashboards with Shiny and databases
Several packages to upgrade your Shiny dashboards
More Shiny Resources by Rob Gilmore
More Shiny Resources for Statistics by Yingjie Hu
Building Shiny apps – an interactive tutorial by Dean Attali
Advanced Shiny tips & tricks by Dean Attali (version 2)
flexdashboard – dashboard creation simplified
colourpicker – widget that allows users to choose colours
brighter – toolbox with helpful functions for shiny development
DesktopDeployR – self-contained R-based desktop applications

Markdown & Other Output Formats

R Markdown cheat sheet by RStudio
R Markdown reference guide by RStudio
R Markdown Basics
R Markdown tutorial by RStudio
R Markdown gallery by RStudio
The knitr book (Xie, 2015)
Getting used to R, RStudio, and R Markdown (2016)
R Markdown: The Definitive Guide (Xie, Allaire, & Grolemund, 2018)
Introduction to R Markdown (Clark, 2018)
R Markdown for Scientists (Tierney, 2019)
R Markdown Tips and Tricks
Pimp my RMD by Holtz Yan
Pandoc syntax highlighting examples by Garrick Aden-Buie
Creating slides with R Markdown (Video) by Brian Caffo
Introduction to xaringan by Yihui Xie
A quick demonstration of xarigan
General Markdown cheat sheet
blogdown websites with R Markdown (Xie, Thomas, & Hill, 2018)
blogdown tutorials
How to build a website with blogdown in R, by Storybench
radix – online publication format designed for scientific and technical communication
A template RStudio project with data analysis and manuscript writing by Thomas Julou
Multiple reports from a single Markdown file (example 1) (example2)

tidystats – automating updating of model statistics
papaja – preparing APA journal articles
blogdown – build websites with Markdown & Hugo
huxtable – create Excel, html, & LaTeX tables
xaringan – make slideshows via remark.js and markdown
summarytools – produces neat, quick data summary tables
citr – RStudio Addin to Insert Markdown Citations

Cloud, Server, & Database

Access and manage Google spreadsheets from R with googlesheets
Tutorial: Database Queries with R
Introduction to sparklyr by DataCamp
Running R on AWS
AWS EC2 Tutorial For Beginners
Using RStudio on Amazon EC2 under the Free Usage Tier
Getting started with databases using R, by RStudio
- RMySQL – connects to MySQL and MariaDB
- RPostgreSQL – connects to Postgres and Redshift.
- RSQLite – embeds a SQLite database.
- odbc – connects to many commercial databases via the open database connectivity protocol.
- bigrquery – connects to Google’s BigQuery.
- DBI – separates the connectivity to the DBMS into a “front-end” and a “back-end”.
- dbplot – leverages dplyr to process calculations of plot inside database
- dplyr – also works with remote on-disk data stored in databases
- tidypredict – run predictions inside the database

BACK TO TABLE OF CONTENTS

Statistical Modeling & Machine Learning

Books

Courses

Introduction to Statistical Learning*** at Stanford University by Trevor Hastie and Rob Tibshirani
Introduction to R for Data Science @Microsoft
Introduction to R for Data Science @FutureLearn by Hadley Wickham
PSY2002: Advanced Statistics at University of Toronto by Elizabeth Page-Gould
STAT 450/870: Regression Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 850: Computing Tools for Statisticians at University of Nebraska-Lincoln by Chris Bilder
STAT 873: Applied Multivariate Statistical Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 875: Categorical Data Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 950: Computational Statistics at University of Nebraska-Lincoln by Chris Bilder
Joint Statistical Meetings: Analysis of Categorical Data by Chris Bilder

Cheat sheets

Time series

CRAN Task View – TimeSeries
R xts cheat sheet
Forecasting: Principles and Practice (Hyndman & Athanasopoulos, 2017)
A little book of R for time series (tutorial)
ARIMA forecasting in R (6-part Youtube series)
Introduction to the tsfeatures package
Tutorials: Part 1, Part 2, Part 3, & Part 4 of tidy time series @Business-Science.io with tidyquant
Packages:
- xts – extensible time series
- tsfeatures – methods for extracting various features from time series data
- tidyquant – tidyverse-style financial analysis

Survival analysis

CRAN Task View – Survival
R survival analysis cheat sheet by Przemysław Biecek
Packages:
- survival – functionality for survival and hazard models
- ggsurvplot (survminer) – survival curves

Bayesian

Miscellaneous

corrr – easier correlation matrix management and exploration

BACK TO TABLE OF CONTENTS

Natural Language Processing & Text Mining

Text Mining Tutorial with tm
Tidy Text Mining (Silges & Robinson, 2017) with tidytext
Text Analysis with R for Students of Literature (Jockers, 2014)
Tidytext tutorials by computational journalism
21 Recipes for Mining Twitter Data (Rudis, 2017) with rtweet
Emil Hvitfeldt’s R-text-data GitHub repository
Course: Introduction to Text Analytics with R @DataScienceDojo
Course: Twitter Text Mining and Social Network Analysis (Zhoa, 2016) @RDataMining with twitteR
Quantitative Analysis of Textual Data with quanteda cheat sheet by Stefan Müller and Kenneth Benoit
List of resources for NLP & Text Mining by Stephen Thomas
Packages — for an overview: CRAN Task View – Natural Language Processing:
- tm – text mining.
- tidytext – text mining using tidyverse principles
- quanteda – framework for quantitative text analysis
- gutenbergr – public domain works (free books to practice on)
- corpora – statistics and data sets for corpus frequency data.
- tau – Text Analysis Utilities
- Sentiment140 – headache-free sentiment analysis
- sentimentr – sentiment analysis using text polarity
- openNLP – sentence detector, tokenizer, pos-tagger, shallow and full syntactic parser, named-entity detector, and maximum entropy models with OpenNLP.
- cleanNLP – natural language processing via tidy data models
- RSentiment – English lexicon-based sentiment analysis with negation and sarcasm detection functionalities.
- RWeka – data mining tasks with Weka
- wordnet – a large lexical database of English with WordNet .
- stringi – language processing wrappers
- textcat – provides support for n-gram based text categorization.
- text2vec – text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities.
- lsa – Latent Semantic Analysis
- topicmodels -Latent Dirichlet Allocation (LDA) and Correlated Topics Models (CTM)
- lda -Latent Dirichlet Allocation and related models

Regular Expressions

R Regular Expression cheat sheet by Lise Vaudor
R Regular Expression cheat sheet
R Regular Expression cheat sheet (page 2) by RStudio
regexplain – interactive RStudio addin for regular expressions
Regular Expressions in R – Part 1: Introduction and base R functions
R Regular Expressions by Jon M. Calder in swirl()
R Regular Expression Video Tutorial by Roger Peng
General Regular Expression cheat sheet
General Regular Expression Video Tutorial by Roger Peng
General Regular Expression cheat sheet by OverAPI.com

BACK TO TABLE OF CONTENTS

Geographic & Spatial mapping

Making Maps with R (tutorial) with ggmaps, maps, and mapdata
Importing OpenStreetMap data (tutorial) with osmar
Geocomputation with R (Lovelace, Nowosad, & Muenchow, 2018)
Spatial manipulation with Simple Features (sf) cheat sheet by Ryan Garnett

Bioinformatics & Computational Biology

BACK TO TABLE OF CONTENTS

Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)

Descriptions mostly taken from their own websites:

RStudio*** – Open source and enterprise ready professional software
Jupyter Notebook*** – open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text across dozens of programming languages.
Microsoft R tools for Visual Studio – turn Visual Studio into a powerful R IDE
R Plugins for Vim, Emax, and Atom editors
Rattle*** – GUI for data mining
equisse – RStudio add-in to interactively explore and visualize data
R Analytic Flow – data flow diagram-based IDE
RKWard – easy to use and easily extensible IDE and GUI
Eclipse StatET – Eclipse-based IDE
OpenAnalytics Architect – Eclipse-based IDE
TinnR – open source GUI and IDE
DisplayR – cloud-based GUI
BlueSkyStatistics – GUI designed to look like SPSS and SAS
ducer – GUI for everyone
R commander (Rcmdr) – easy and intuitive GUI
JGR – Java-based GUI for R
jamovi & jmv – free and open statistical software to bridge the gap between researcher and statistician
Exploratory.io – cloud-based data science focused GUI
Stagraph – GUI for ggplot2 that allows you to visualize and connect to databases and/or basic file types
ggraptr – GUI for visualization (Rapid And Pretty Things in R)
ML Studio – interactive Shiny platform for data visualization, statistical modeling and machine learning

R & other software and languages

R & Excel

BERT – Basic Excel R Toolkit
A Comprehensive Guide to Transitioning from Excel to R by Alyssa Columbus
readxl – package to load in Excel data
xlsx – package to read and write Excel data
rvg – produces Vector Graphics which can be modified in Excel
devoutpdf – an PDF graphics device
tidyxl – imports non-tabular (e.g., format) data from Excel files into R
unpivotr – unpivot complex and irregular data layouts in R
unheadr – handle data with embedded subheaders

R & Python

Python for R users
reticulate cheat sheet by RStudio
reticulate – tools for interoperability between Python and R

R & SQL

sqldf – running SQL statements on R data frames

BACK TO TABLE OF CONTENTS

Join 385 other subscribers

R Help, Connect, & Inspiration

RStudio Community
R help mailing list
R seek – search engine for R-related websites
R site search – search engine for help files, manuals, and mailing lists
Nabble – mailing list archive and forum
R User Groups & Conferences
R for Data Science Online Learning Community
Stack Overflow – a FAQ for all your R struggles (programming)
Cross Validated – a FAQ for all your R struggles (statistics)
CRAN Task Views – discover new packages per topic
The R Journal – open access, refereed journal of R
Twitter: #rstats, RStudio, Hadley Wickham, Yihui Xie, Mara Averick, Julia Silge, Jenny Bryan, David Smith, Hilary Parker, R-bloggers
Facebook: R Users Psychology
Youtube: Ben Lambert, Roger Peng
Reddit: rstats, rstudio, statistics, machinelearning, dataisbeautiful

R Blogs

R Conferences, Events, & Meetups

R Jobs

BACK TO TABLE OF CONTENTS

Gradient Descent 101

Gradient Descent is, in essence, a simple optimization algorithm. It seeks to find the gradient of a linear slope, by which the resulting linear line best fits the observed data, resulting in the smallest or lowest error(s). It is THE inner working of the linear functions we get taught in university statistics courses, however, many of us will finish our Masters (business) degree without having heard the term. Hence, this blog.

Linear regression is among the simplest and most frequently used supervised learning algorithms. It reduces observed data to a linear function (Y = a + bX) in order to retrieve a set of general rules, or to predict the Y-values for instances where the outcome is not observed.

One can define various linear functions to model a set of data points (e.g. below). However, each of these may fit the data better or worse than the others. How can you determine which function fits the data best? Which function is an optimal representation of the data? Enter stage Gradient Descent. By iteratively testing values for the intersect (a; where the linear line intersects with the Y-axis (X = 0)) and the gradient (b; the slope of the line; the difference in Y when X increases with 1) and comparing the resulting predictions against the actual data, Gradient Descent finds the optimal values for the intersect and the slope. These optimal values can be found because they result in the smallest difference between the predicted values and the actual data – the least error.

Afbeeldingsresultaat voor linear regression plot r

The video below is part of a Coursera machine learning course of Stanford University and it provides a very intuitive explanation of the algorithm and its workings:

A recent blog demonstrates how one could program the gradient descent algorithm in R for him-/herself. Indeed, the code copied below provides the same results as the linear modelling function in R’s base environment.

gradientDesc  max_iter) { 
      abline(c, m) 
      converged = T
      return(paste("Optimal intercept:", c, "Optimal slope:", m))
    }
  }
}

# compare resulting coefficients
coef(lm(mpg ~ disp, data = mtcars)
gradientDesc(x = disp, y = mpg, learn_rate = 0.0000293, conv_theshold = 0.001, n = 32, max_iter = 2500000)

Although the algorithm may result in a so-called “local optimum”, representing the best fitting set of values (a & b) among a specific range of X-values, such issues can be handled but deserve a separate discussion.

Share this:

Examples in HR

Solving the paradox

Share this:

LAST UPDATED: 21-05-2018

Courses:

Video:

Books:

Sentiment Lexicons:

Cheatsheets:

Other:

Share this:

Share this:

Table of Contents (clickable)

Introductory R

Introductory Books

Online Courses

Style Guides

Advanced R

Package Development

Non-standard Evaluation

Functional Programming

Cheat Sheets

Data Manipulation

Data Visualization

Colors

Interactive / HTML / JavaScript widgets

ggplot2

ggplot2 extensions

Miscellaneous

Shiny, Dashboards, & Apps

Markdown & Other Output Formats

Cloud, Server, & Database

Statistical Modeling & Machine Learning

Books

Courses

Cheat sheets

Time series

Survival analysis

Bayesian

Miscellaneous

Natural Language Processing & Text Mining

Regular Expressions

Geographic & Spatial mapping

Bioinformatics & Computational Biology

Integrated Development Environments (IDEs) & Graphical User Inferfaces (GUIs)

R & other software and languages

R & Excel

R & Python

R & SQL

R Help, Connect, & Inspiration

R Blogs

R Conferences, Events, & Meetups

R Jobs

Share this:

Share this:

Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)