Tag: data

# Understanding Data Distributions

Having trouble understanding how to interpret distribution plots? Or struggling with Q-Q plots? Sven Halvorson penned down a visual tutorial explaining distributions using visualisations of their quantiles.

Because each slice of the distribution is 5% of the total area and the height of the graph is changing, the slices have different widths. It’s like we’re trying to cut a strange shaped cake into 20 equal pieces using parallel cuts. The slices at the center must be thinner since the distribution is denser (taller) than on the edges.

Sven on distribution signatures

Here is the plot of matching the quantiles of the chi-squared(4) and normal distributions. I’ve again plotted these quantiles over 98% of each distribution’s range. The chi-squared distribution is skewed so its quantiles are packed into a smaller portion of its axis.

What is this graph telling us? It shows that the exchange rate between the quantiles of the two distributions is not constant.

Sven on distribution signatures

Here’s the link to the original article, and the R markdown code on github to generate the webpage.

# What Every Programmer Needs To Know About Encodings

Kunststube wrote this great introduction to text encoding. Ever wondered why your Word document sometimes starts with `ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢`? Well, encoding‘s why. Kunststube introduces you to the wonderful world of ASCII, WLatin, Mac Latin, and UTF-8, -16 and -32.

Read the original articla via http://kunststube.net/encoding/

# People Analytics: Is nudging goed werkgeverschap of onethisch?

In Dutch only:

Voor Privacyweb schreef ik onlangs over people analytics en het mogelijk resulterende nudgen van medewerkers: kleine aanpassingen of duwtjes die mensen in de goede richting zouden moeten sturen. Medewerkers verleiden tot goed gedrag, als het ware. Maar wie bepaalt dan wat goed is, en wanneer zouden werkgevers wel of niet mogen of zelfs moeten nudgen?

# Mathematical aRt

Marcus Volz is a research fellow at the University of Melbourne, studying geometric networks, optimisation and computational geometry. He’s interested in visualisation, and always looking for opportunities to represent complex information in novel ways to accelerate learning and uncover the unexpected.

One of Marcus’ hobbies is the visualization of mathematical patterns and statistical algorithms via R. He has a whole portfolio full of them, including a Github page with all the associated R code. For my recent promotion, my girlfriend asked Marcus to generate a K-nearest neighbors visual and she had it printed on a large canvas.

The picture contains about 10.000 points, randomly uniformly distributed across x and y, connected by lines with their closest other points. Marcus shared the code to generate such k-nearest neighbor algorithm plots here on Github. So if you know your way around R, you could make your own version:

```#' k-nearest neighbour graph
#'
#' Computes a k-nearest neighbour graph for a given set of points. Refer to the \href{https://en.wikipedia.org/wiki/Nearest_neighbor_graph}{Wikipedia article} for details.
#' @param points A data frame with x, y coordinates for the points
#' @param k Number of neighbours
#' @keywords nearest neightbour graph
#' @export
#' @examples
#' k_nearest_neighbour_graph()

k_nearest_neighbour_graph <- function(points, k=8) {
get_k_nearest <- function(points, ptnum, k) {
xi <- points\$x[ptnum]
yi <- points\$y[ptnum]     points %>%
dplyr::mutate(dist = sqrt((x - xi)^2 + (y - yi)^2)) %>%
dplyr::arrange(dist) %>%
dplyr::filter(row_number() %in% seq(2, k+1)) %>%
dplyr::mutate(xend = xi, yend = yi)
}

1:nrow(points) %>%
purrr::map_df(~get_k_nearest(points, ., k))
}```

Those less versed in R can use Marcus package `mathart`. With this package, Marcus shares many more visual depictions of cool algorithms! You can install the package and several dependencies with the following lines of code:

```install.packages(c("devtools", "mapproj", "tidyverse", "ggforce", "Rcpp"))
devtools::install_github("marcusvolz/mathart")
devtools::install_github("marcusvolz/ggart")```

Subsequently, you can visualize all kinds of cool stuff, like for instance rapidly exploring random trees (see this Wikipedia article for details):

```# Generate rrt edges
set.seed(1)
df <- rapidly_exploring_random_tree() %>% mutate(id = 1:nrow(.))

# Create plot
ggplot() +
geom_segment(aes(x, y, xend = xend, yend = yend, size = -id, alpha = -id), df, lineend = "round") +
coord_equal() +
scale_size_continuous(range = c(0.1, 0.75)) +
scale_alpha_continuous(range = c(0.1, 1)) +
theme_blankcanvas(margin_cm = 0)
```

This k-d tree (see this Wikipedia article for details) is also amazing:

```result <- kdtree(mathart::points)

ggplot() +
geom_segment(aes(x, y, xend = xend, yend = yend), result) +
coord_equal() +
xlim(0, 10000) + ylim(0, 10000) +
theme_blankcanvas(margin_cm = 0)```

This page of Marcus’ mathart Github repository contains the code exact code for these and many other visualizations of algorithms and statistical phenomena. Do check it out if you’re interested!

Also, check out the “Fun” section of my R tips and tricks list for more cool visuals you can generate in R!

# Privacy, Compliance, and Ethical Issues with Predictive People Analytics

November 9th 2018, I defended my dissertation on data-driven human resource management, which you can read and download via this link. On page 149, I discuss several of the issues we face when implementing machine learning and analytics within an HRM context. For the references and more detailed background information, please consult the full dissertation. More interesting reads on ethics in machine learning can be found here.

# Privacy, Compliance, and Ethical Issues

Privacy can be defined as “a natural right of free choice concerning interaction and communication […] fundamentally linked to the individual’s sense of self, disclosure of self to others and his or her right to exert some level of control over that process” (Simms, 1994, p. 316). People analytics may introduce privacy issues in many ways, including the data that is processed, the control employees have over their data, and the free choice experienced in the work place. In this context, ethics would refer to what is good and bad practice from a standpoint of moral duty and obligation when organizations collect, analyze, and act upon HRM data. The next section discusses people analytics specifically in light of data privacy, legal boundaries, biases, and corporate social responsibility and free choice.

## Data Privacy

Technological advancements continue to change organizational capabilities to collect, store, and analyze workforce data and this forces us to rethink the concept of privacy (Angrave et al., 2016; Bassi, 2011; Martin & Freeman, 2003). For the HRM function, data privacy used to involve questions such as “At what team size can we use the average engagement score without causing privacy infringements?” or “How long do we retain exit interview data?” In contrast, considerably more detailed information on employees’ behaviors and cognitions can be processed on an almost continuous basis these days. For instance, via people analytics, data collected with active monitoring systems help organizations to improve the accuracy of their performance measurement, increasing productivity and reducing operating costs (Holt, Lang, & Sutton, 2016). However, such systems seem in conflict with employees’ right to solitude and their freedom from being watched or listened to as they work (Martin & Freeman, 2003) and are perceived as unethical and unpleasant, affecting employees’ health and morale (Ball, 2010; Faletta, 2014; Holt et al., 2016; Martin & Freeman, 2003; Sánchez Abril, Levin, & Del Riego, 2012). Does the business value such monitoring systems bring justify their implementation? One could question whether business value remains when a more long-term and balanced perspective is taken, considering the implications for employee attraction, well-being, and retention. These can be difficult considerations, requiring elaborate research and piloting.

Faletta (2014) asked American HRM professionals which of 21 data sources would be appropriate for use in people analytics. While some were considered appropriate from an ethical perspective (e.g., performance ratings, demographic data, 360-degree feedback), particularly novel data sources were considered problematic: data of e-mail and video surveillance, performance and behavioral monitoring, and social media profiles and messages. At first thought, these seem extreme, overly intrusive data that are not and will not be used for decision-making. However, in reality, several organizations already collect such data (e.g., Hoffmann, Hartman, & Rowe, 2003; Roth et al., 2016) and they probably hold high predictive value for relevant business outcomes. Hence, it is not inconceivable that future organizations will find ways to use these data for personnel-related decisions – legally or illegally. Should they be allowed to? If not, who is going to monitor them? What if the data are used for mutually beneficial goals – to prevent problems or accidents? These and other questions deserve more detailed discussion by scholars, practitioners, and governments – preferably together.

## Legal Boundaries

Although HRM professionals should always ensure that they operate within the boundaries of the law, legal compliance does not seem sufficient when it comes to people analytics. Frequently, legal systems are unprepared to defend employees’ privacy against the potential invasions via the increasingly rigorous data collection systems (Boudreau, 2014; Ciocchetti, 2011; Sánchez Abril et al., 2012). Initiatives such as the General Data Protection Regulation in the European Union somewhat restore the power balance, holding organizations and their HRM departments accountable to inform employees what, why, and how personal data is processed and stored. The rights to access, correct, and erase their information is returned to employees (GDPR, 2016). However, such regulation may not always exist and, even if it does, data usage may be unethical, regardless of its legality.

For instance, should organizations use all personnel data for which they have employee consent? One could argue that there are cases where the power imbalance between employers and employees negates the validity of consent. For instance, employees may be asked to sign written elaborate declarations or complex agreements as part of their employment, without being fully aware of what they consent to. Moreover, employees may feel pressured to provide consent in fear of losing their job, losing face, or peer pressure. Relatedly, employees may be incentivized to provide consent because of the perks associated with doing so, without fully comprehending the consequences. For instance, employees may share access to personal behavioral data in exchange for mobile devices, wellness, or mobility benefits, in which case these direct benefits may bias their perception and judgement. In such cases, data usage may not be ethically responsible, regardless of the legal boundaries, and HRM departments in general and people analytics specialists in specific should take the responsibility to champion the privacy and the interests of their employees.

## Automating Historic Biases

While ethics can be considered an important factor in any data analytics project, it is particularly so in people analytics projects. HRM decisions have profound implications in an imbalanced relationship, whereas the data within the HRM field often suffer from inherent biases. This becomes particularly clear when exploring applications of predictive analytics in the HRM domain.

For example, imagine that we want to implement a decision-support system to improve the efficiency of our organization’s selection process. A primary goal of such a system could be to minimize the human time (both of our organizational agents and of the potential candidates) wasted on obvious mismatches between candidates and job positions. Under the hood, a decision-support system in a selection setting could estimate a likelihood (i.e., prediction) for each candidate that he/she makes it through the selection process successfully. Recruiters would then only have to interview the candidates that are most likely to be successful, and save valuable time for both themselves and for less probable candidates. In this way, an artificially intelligent system that reviews candidate information and recommends top candidates could considerably decrease the human workload and thereby the total cost of the selection process.

For legal compliance as well as ethical considerations, we would not want such a decision-support system to be biased towards any majority or minority group. Should we therefore exclude demographic and socio-economic factors from our predictive model? What about the academic achievements of candidates, the university they attended, or their performance on our selection tests? Some of those are scientifically validated predictors of future job performance (e.g., Hunter & Schmidt, 1998). However, they also relate to demographic and socio-economic factors and would therefore introduce bias (e.g., Hough, Oswald, & Ployhart, 2001; Pyburn, Ployhart, & Kravitz, 2008; Roth & Bobko, 2000). Do we include or exclude these selection data in our model?

Maybe the simplest solution would be to include all information, to normalize our system’s predictions within groups afterwards (e.g., gender), and to invite the top candidates per group for follow-up interviews. However, which groups do we consider? Do we only normalize for gender and nationality, or also for age and social class? What about combinations of these characteristics? Moreover, if we normalize across all groups and invite the best candidate within each, we might end up conducting more interviews than in the original scenario. Should we thus account for the proportional representation of each of these groups in the whole labor population? As you notice, both the decision-support system and the subject get complicated quickly.

Even more problematic is that any predictive decision-support system in HRM is likely biased from the moment of conception. HRM data is frequently infested with human biases as bias was present in the historic processes that generated the data. For instance, the recruiters in our example may have historically favored candidates with a certain profile, for instance, red hair. After training our decision-support system (i.e., predictive model) on these historic data, it will recognize and copy the pattern that candidates with red hair (or with correlated features, such as a Northwest European nationality) are more likely successful. The system thus learns to recommend those individuals as the top candidates. While this issue could be prevented by training the model on more objective operationalization of candidate success, most HRM data will include its own specific biases. For example, data on performance ratings will include not only the historic preferences of recruiters (i.e., only hired employees received ratings), but also the biases of supervisors and other assessors in the performance evaluation processes. Similar and other biases may occur in data regarding promotions, training courses, talent assessments, or compensation. If we use these data to train our models and systems, we would effectively automate our historic biases. Such issues greatly hinder the implementation of (predictive) people analytics without causing compliance and ethical issues.

## Corporate Social Responsibility versus Free Choice

Corporate social responsibility also needs to be discussed in light of people analytics. People analytics could allow HRM departments to work on social responsibility agendas in many ways. For instance, people analytics can help to demonstrate what causes or prevents (un)ethical behavior among employees, to what extent HRM policies and practices are biased, to what extent they affect work-life balance, or how employees can be stimulated to make decisions that benefit their health and well-being. Regarding the latter case, a great practical example comes from Google’s people analytics team. They uncovered that employees could be stimulated to eat more healthy snacks by color-coding snack containers, and that smaller cafeteria plate sizes could prevent overconsumption and food loss (ABC News, 2013). However, one faces difficult ethical dilemmas in this situation. Is it organizations’ responsibility to nudge employees towards good behavior? Who determines what good entails? Should employees be made aware of these nudges? What do we consider an acceptable tradeoff between free choice and societal benefits?

When we consider the potential of predictive analytics in this light, the discussion gets even more complicated. For instance, imagine that organizations could predict work accidents based on historic HRM information, should they be forbidden, allowed, or required to do so? What about health issues, such as stress and burnout? What would be an acceptable accuracy for such models? How do we feel about false positive and false negatives? Could they use individual-level information if that resulted in benefits for employees?

In conclusion, analytics in the HRM domain quickly encounters issues related to privacy, compliance, and ethics. In bringing (predictive) analytics into the HRM domain, we should be careful not to copy and automate the historic biases present in HRM processes and data. The imbalance in the employment relationship puts the responsibility in the hands of organizational agents. The general message is that what can be done with people analytics may differ from what should be done from a corporate social responsibility perspective. The spread of people analytics depends on our collective ability to harness its power ethically and responsibility, to go beyond the legal requirements and champion both the privacy as well as the interests of employees and the wider society. A balanced approach to people analytics – with benefits beyond financial gain for the organization – will be needed to make people analytics accepted by society, and not just another management tool.

# Tidy Missing Data Handling

A recent open access paper by Nicholas Tierney and Dianne Cook — professors at Monash University — deals with simpler handling, exploring, and imputation of missing values in data.They present new methodology building upon tidy data principles, with a goal to integrating missing value handling as an integral part of data analysis workflows. New data structures are defined (like the `nabular`) along with new functions to perform common operations (like `gg_miss_case`).

These new methods have bundled among others in the R packages `naniar` and `visdat`, which I highly recommend you check out. To put in the author’s own words:

The `naniar` and `visdat` packages build on existing tidy tools and strike a compromise between automation and control that makes analysis efficient, readable, but not overly complex. Each tool has clear intent and effects – plotting or generating data or augmenting data in some way. This reduces repetition and typing for the user, making exploration of missing values easier as they follow consistent rules with a declarative interface.

The below showcases some of the highly informational visuals you can easily generate with `naniar`‘s `nabular`s and the associated functionalities.

For instance, these heatmap visualizations of missing data for the airquality dataset. (A) represents the default output and (B) is ordered by clustering on rows and columns. You can see there are only missings in ozone and solar radiation, and there appears to be some structure to their missingness.

Another example is this upset plot of the patterns of missingness in the airquality dataset. Only Ozone and Solar.R have missing values, and Ozone has the most missing values. There are 2 cases where both Solar.R and Ozone have missing values.

You can also generate a histogram using nabular data in order to show the values and missings in Ozone. Values are imputed below the range to show the number of missings in Ozone and colored according to missingness of ozone (‘Ozone_NA‘). This displays directly that there are approximately 35-40 missings in Ozone.

Alternatively, scatterplots can be easily generated. Displaying missings at 10 percent below the minimum of the airquality dataset. Scatterplots of ozone and solar radiation (A), and ozone and temperature (B). These plots demonstrate that there are missings in ozone and solar radiation, but not in temperature.

Finally, this parallel coordinate plot displays the missing values imputed 10% below range for the oceanbuoys dataset. Values are colored by missingness of humidity. Humidity is missing for low air and sea temperatures, and is missing for one year and one location.

Please do check out the original open access paper and the CRAN vignettes associated with the packages!