Tag: methodology

# Logistic regression is not fucked, by Jake Westfall

Recently, I came across a social science paper that had used linear probability regression. I had never heard of linear probability models (LPM), but it seems just an application of ordinary least squares regression but to a binomial dependent variable.

According to some, LPM is a commonly used alternative for logistic regression, which is what I was learned to use when the outcome is binary.

Potentially because of my own social science background (HRM), using linear regression without a link transformation on binary data just seems very unintuitive and error-prone to me. Hence, I sought for more information.

I particularly liked this article by Jake Westfall, which he dubbed “Logistic regression is not fucked”, following a series of blogs in which he talks about methods that are fucked and not useful.

Jake explains the classification problem and both methods inner workings in a very straightforward way, using great visual aids. He shows how LMP would differ from logistic models, and why its proposed benefits are actually not so beneficial. Maybe I’m in my bubble, but Jake’s arguments resonated.

http://jakewestfall.org/blog/index.php/2018/03/12/logistic-regression-is-not-fucked/

Here’s the summary:
Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to.

# Time Series Analysis 101

A time series can be considered an ordered sequence of values of a variable at equally spaced time intervals. To model such data, one can use time series analysis (TSA). TSA accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend, or seasonal variation) that should be accounted for.

TSA has several purposes:

1. Descriptive: Identify patterns in correlated data, such as trends and seasonal variations.
2. Explanation: These patterns may help in obtaining an understanding of the underlying forces and structure that produced the data.
3. Forecasting: In modelling the data, one may obtain accurate predictions of future (short-term) trends.
4. Intervention analysis: One can examine how (single) events have influenced the time series.
5. Quality control: Deviations on the time series may indicate problems in the process reflected by the data.

TSA has many applications, including:

• Economic Forecasting
• Sales Forecasting
• Budgetary Analysis
• Stock Market Analysis
• Yield Projections
• Process and Quality Control
• Inventory Studies
• Utility Studies
• Census Analysis
• Strategic Workforce Planning

AlgoBeans has a nice tutorial on implementing a simple TS model in Python. They explain and demonstrate how to deconstruct a time series into daily, weekly, monthly, and yearly trends, how to create a forecasting model, and how to validate such a model.

Analytics Vidhya hosts a more comprehensive tutorial on TSA in R. They elaborate on the concepts of a random walk and stationarity, and compare autoregressive and moving average models. They also provide some insight into the metrics one can use to assess TS models. This web-tutorial runs through TSA in R as well, showing how to perform seasonal adjustments on the data. Although the datasets they use have limited practical value (for businesses), the stepwise introduction of the different models and their modelling steps may come in handy for beginners. Finally, business-science.io has three amazing posts on how to implement time series in R following the tidyverse principles using the tidyquant package (Part 1; Part 2; Part 3; Part 4).

# TACIT: An open-source Text Analysis, Crawling, and Interpretation Tool

The first programs for (scientific) text mining are already over 50 years old. More recent efforts, such as the Linguistic Inquiry Word Count (LIWC; Tausczik & Pennebaker, 2010), have greatly improved our text analytical capabilities. Moreover, several single-purpose programs have been developed, which also consider syntactic text structures (e.g., Syntactic Complexity Analyzer [Lu, 2010], TAALES [Kyle & Crossley, 2015]).However, the widespread use of many of these programs has been hampered by two major barriers.

First, considerable technical expertise is required, which obstructs researchers without statistical backgrounds. For example, packages such as tm in R (Meyer et al., 2015) have been developed to conduct natural-language processing, but the steep learning curve forms a challenge. Additionally, the constant increase of computational processing power and the proliferation of new algorithms makes it difficult for researchers to maintain working knowledge of state-of-the-art methods.

Alternatively, most of the existing user-friendly NLP programs (and packages), such as RapidMiner (Akthar & Hahne, 2012), SAS Text Miner (Abell, 2014), or SPSS Modeler (IBM Corp., 2011), charge either a large software fee up front or a subscription fee. The cost of these programs can be prohibitively expensive for junior researchers and researchers looking to integrate new techniques into their research toolbox.

In the attached article, TACIT is introduced: Text Analysis, Crawling and Investigation Tool. TACIT is an open-source architecture that establishes a pipeline between the various stages of text-based research by integrating tools for text mining, data cleaning, and analysis under a single user-friendly architecture. In addition to being prepackaged with a range of easily applied, cutting-edge methods, TACIT’s design also allows other researchers to write their own plugins.

The authors’ hope is that TACIT can facilitate the integration and use of advancements in computational linguistics in psychological research, and by doing so can help researchers make use of the ever-growing documents of our social discourse in ways that have previously not been possible.

# Expanding the methodological toolbox of HRM researchers

Update 26-10-2017: the paper has been published open access and is freely available here: http://onlinelibrary.wiley.com/doi/10.1002/hrm.21847/abstract.

The HR technology landscape is evolving rapidly and with it, the HR function is becoming more and more data-driven (though not fast enough, some argue). HRM research, however, is still characterized by a strong reliance on general linear models like linear regression and ANOVA. In our forthcoming article in the special issue on Workforce Analytics of Human Resource Management, my co-authors and I argue that HRM research would benefit from an outside-in perspective, drawing on techniques that are commonly used in fields other than HRM.

Our article first outlines how the current developments in the measurement of HRM implementation and employee behaviors and cognitions may cause the more traditional statistical techniques to fall short. Using the relationship between work engagement and performance as a worked example, we then provide two illustrations of alternative methodologies that may benefit HRM research:

Using latent variables, bathtub models are put forward as the solution to examine multi-level mechanisms with outcomes at the team or organizational level without decreasing the sample size or neglecting the variation inherent in employees’ responses to HRM activities (see figure 1). Optimal matching analysis is proposed as particularly useful to examine the longitudinal patterns that occur in repeated observations over a prolonged timeframe. We describe both methods in a fair amount of detail, touching on elements such as the data requirements all the way up to the actual modeling steps and limitations.

I want to thank my co-authors and Shell colleagues Zsuzsa Bakk, Vasileios Giagkoulas, Linda van Leeuwen, and Esther Bongenaar for writing this, in my own biased opinion, wonderful article with me and I hope you will enjoy reading it as much as we did writing it.