Tag: validity

Sentiment Analysis: Analyzing Lexicon Quality and Estimation Errors

Sentiment Analysis: Analyzing Lexicon Quality and Estimation Errors

Sentiment analysis is a topic I cover regularly, for instance, with regard to Harry PlotterStranger Things, or Facebook. Usually I stick to the three sentiment dictionaries (i.e., lexicons) included in the tidytext R package (Bing, NRC, and AFINN) but there are many more one could use. Heck, I’ve even tried building one myself using a synonym/antonym network (unsuccessful, though a nice challenge). Two lexicons that did become famous are SentiWordNet, accessible via the lexicon R package, and the Loughran lexicon, designed specifically for the analysis of shareholder reports.

Josh Yazman did the world a favor and compared the quality of the five lexicons mentioned above. He observed their validity in relation to the millions of restaurant reviews in the Yelp dataset. This dataset includes both textual reviews and 1 to 5 star ratings. Here’s a summary of Josh’s findings, including two visualizations (read Josh’s full blog + details here):

  • NRC overestimates the positive sentiment.
  • AFINN also provides overly positive estimates, but to a lesser extent.
  • Loughran seems unreliable altogether (on Yelp data).
  • Bing estimates are accurate as long as texts are long enough (e.g., 200+ words).
  • SentiWordNet‘s estimates are mostly valid and precise, also on shorter texts, but may include minor outliers.

Sentiment scores by Yelp rating, estimated using each lexicon. [original]
The average sentiment score estimated using lexicons, where words are randomly sampled from the Yelp dataset. Note that, although both NRC and Bing scores are relatively positive on average, they also demonstrate a larger spread of scores (which is a good thing if you assume that reviews vary in terms of sentiment). [original]
On a more detailed level, David Robinson demonstrated how to uncover performance errors or quality issues in lexicons, in his 2016 blog on the AFINN lexicon. Using only the most common words (i.e., used in 200+ reviews for at least 10 businesses) of the same Yelp dataset, David visualized the inconsistencies between the AFINN sentiment lexicon and the Yelp ratings in two very smart and appealing ways:

Words’ AFINN sentiment score by the average rating of the reviews they used in [original]
As the figure above shows, David found a strong positive correlations between the sentiment score assigned to words in the AFINN lexicon and the way they are used in Yelp reviews. However, there are some exception – words that did not have the same meaning in the lexicon and the observed data. Examples of words that seem to cause errors are die and bomb (both negative AFINN scores but used in positive Yelp reviews) or, the other way around, joke and honor (positive AFINN scores but negative meanings on Yelp).

A graph of the frequency with which words are used in reviews, by the average rating of the reviews they occur in, colored for their AFINN sentiment score [original]
With the graph above, it is easy to see what words cause inaccuracies. Blue words should be in the upper section of this visual while reds should be closer to the bottom. If this is not the case, a word likely has a different meaning in the lexicon respective to how it’s used on Yelp. These lexicon-data differences become increasingly important as words are located closer to the right side of the graph, which means they more frequently screw up your sentiment estimates. For instance, fine, joke, fuck and hope cause much overestimation of positive sentiment while fresh is not considered in the positive scores it entails and die causes many negative errors.

TL;DR: Sentiment lexicons vary in terms of their quality/performance. If your texts are short (few hundred words) you might be best off using Bing (tidytext). In other cases, opt for SentiWordNet (lexicon), which considers a broader vocabulary. If possible, try to evaluate inaccuracies, outliers, and/or prediction errors via data visualizations.

EAWOP 2017 – Takeaways

Past week, I attended the 2017 conference of the European Association of Work and Organizational Psychology (EAWOP), which was hosted by University College Dublin. There were many interesting sessions, the venue was amazing, and Dublin is a lovely city.  Personally, I mostly enjoyed the presentations on selection and assessment test validity, and below are my main takeaways:

  • circumplexProfessor Stephen Woods gave a most interesting presentation on the development of a periodic table of personality. The related 2016 JAP article you can find here. Woods compares the most commonly used personality indices, “plotting” each scale on a two-dimensional circumplex of the most strongly related Big-Five OCEAN scales. This creates a structure that closely resembles a periodic table, with which he demonstrates which elements of personality are well-researched and which require more scholarly attention. In the presentation, Woods furthermore reviewed the relationship of several of these elements and their effect on job-related outcomes. You can find the abstracts of the larger personality & analytics symposium here.
  • One of the symposia focused on social desirability, impression management, and faking behaviors in personality measurement. The first presentation by Patrick Dunlop elaborated on the various ways in which to measure faking, such as with bogus items, social desirability scales, or by measuring blatant extreme responses. Dunlop’s exemplary study on repeat applicants to firefighter positions was highly amusing. Second, Nicolas Roulin demonstrated how the perceived competitive climate in organizations can cause applicants to positively inflate most of their personality scores, with the exception of their self-reported Extraversion and Conscientiousness which seemed quite stable no matter the perceived competitiveness. Third, Pelt (Ph.D. at Erasmus University and IXLY) demonstrated how (after some statistical corrections) the level of social desirability in personality tests can be reduced by using forced-choice instead of Likert scales. If practitioners catch on, this will likely become the new status quo. The fourth presentation was also highly relevant, proposing to use items that are less biased in their formulation towards specific personality traits (Extraversion is often promoted whereas items on Introversion inherently have negative connotations (e.g., “shyness”)). Fifth and most interestingly, Van der Linden (also Erasmus) showed how a higher-order factor analysis on the Big-Five OCEAN scales results in a single factor of personality – commonly referred to as the Big-One or the general factor of personality. This one factor could represent some sort of social desirability, but according to meta-analytical results presented by van der Linden, the factor correlates .88 with emotional intelligence! Moreover, it consistently predicts performance behaviors (also as rated by supervisors or in 360 assessments) better than the Big-Five factors separately, with only Conscientiousness retaining some incremental validity. You can find the abstracts and the author details of the symposium here.


  • Schäpers (Free University Berlin) demonstrates with three independent experiments that the situational or contextual prompts in a situational judgment test (SJT) do not matter for its validity. In other words, excluding the work-related critical incidents before the item did not affect the predictive validity: not for general mental ability, personality dimensions, emotional intelligence, nor job performance. Actually, the validity improved a little for certain outcomes. These results suggest that SJTs may measure something completely different from what is previously posed. Schäpers found similar effects for written and video-based SJTs. The abstract of Schäpers’ paper can be found here.
  • Finally, assessment vendor cut-e was the main sponsor of the conference. They presented among others their new tool chatAssess, which brings SJTs to a mobile environment. Via this link (https://maptq.com/default/home/nl/start/2tkxsmdi) you can run a demo using the password demochatassess. The abstract of this larger session on game-based assessment can be found here.


The rest of the 2017 EAWOP program can be viewed here.