Benchmarking Sentiment Analysis Systems

Gone are the days when consumers depended on word-of-mouth from their near and dear ones for any product purchase. The Gen-Y generation now majorly go for the online reviews to not only get the virtual look & feel of the product but also understand the specs & cons of the product.  The online reviews could be from various sources like forum discussions, blogs, microblogs, twitter & social networks and are humongous in nature which has led to inception and rapid growth of Sentiment analysis.

Sentiment analysis helps to understand the opinion of people towards a product or an issue.  Sentiment analysis has grown to be one of the most active research areas in Natural Language Processing (NLP). It is also widely studied in data mining, web mining and text mining.  In this blog, we will discuss the techniques to evaluate and benchmark sentiment analysis feature in NLP products.

8KMiles’ recent engagement with a leading cloud provider involved applying sentiment analysis on different review datasets with some of the top products available in the market and assess their effectiveness in correctly identifying them as positive or negative or neutral.  Our own team had tracked opinions about enormous number of movie reviews from IMDb, product reviews from Amazon & Yelp and predicted sentiment polarity with very accurate results. Tweets are different from reviews because of their purpose: while reviews represent summarized thoughts of authors, tweets are more casual and limited to 140 characters of text.  Because of this nature, the accuracy results for tweets significantly vary from other datasets.  A systematic approach to benchmark the accuracy of the sentiment polarity helps to reveal the strength and weakness of various products under different scenarios.  Here, we share some of the top performing products and key information on how accuracy is evaluated for various NLP APIs and comparison report is prepared.

There is a wide range of products available in the market; a few important products with their language supports is shown below.

Google NL API Microsoft Linguistic
Analysis API
IBM AlchemyAPI Stanford CoreNLP Rosette Text Analytics Lexalytics
Sentiment Analysis Language Support English, Spanish, Japanese English, Spanish, French, Portuguese English, French, Italian, German, Portuguese, Russian and Spanish English English, Spanish, Japanese English, Spanish, French, Japanese, Portuguese, Korean, etc.


Not all products return sentiment polarity directly. Some directly return polarity like Positive, Negative, Neutral whereas some other return scores and these score ranges in turn have to be converted to get the polarity if we want to compare products. Following sections explain the results returned by some of the APIs.

Google’s NL API Sentiment Analyzer returns numerical score and magnitude values which represent the overall attitude of the text. After analyzing the results for various ranges, range from -0.1 to 0.1 found to be appropriate for neutral sentiment. Any score greater than 0.1 was considered as positive and score less than -0.1 was considered as negative.

Microsoft Linguistic Analysis API returns a numeric score between 0 & 1. Scores closer to 1 indicate positive sentiment, while scores closer to 0 indicate negative sentiment. A range of scores between 0.45 and 0.60 might be considered as neutral sentiment. Scores less than 0.45 may be used as negative sentiment and scores greater than 0.60 may be taken as positive sentiment.

IBM Alchemy API returns a score as well as sentiment polarity (positive, negative or neutral). So, the sentiment label can be used directly to calculate the accuracy.

Similarly, Stanford CoreNLP API returns 5 labels viz. very positive, positive, very negative, negative and neutral. For comparison with other products, very positive and positive may be combined and treated as a single group called positive and similarly very negative and negative may be combined and treated as a single group called negative.

After above conversion, we need a clean way to show the actual and predicted results for the sentiment polarities. This is explained with examples in the following section.

Confusion Matrix

A confusion matrix contains information about actual and predicted classifications done by a classification system. Let’s consider the below confusion matrix to get a better understanding.

Above example is based on a dataset having 1,500 reviews with split-up of 780 positives, 492 negatives and 228 neutrals in actual.  Product A has predicted 871 positives, 377 negatives and 252 neutrals whereas Product B predicts 753 positives, 404 negatives and 343 neutrals.

From the above table, we can easily understand that Product A rightly identifies 225 negative reviews as negative reviews.  But it wrongly classifies 157 negative reviews as positive and 110 negative reviews as neutral.

Note that all the correct predictions are located along the diagonal of the table (614, 225 and 55).  This helps to quickly identify the errors, as they are shown by values outside the diagonal.

Precision, Recall & F-Measure
Precision measures the exactness of a classifier.  A higher precision means less false positives, while a lower precision means more false positives. Recall measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives.

  • Precision = True Positive / (True Positive + False Positive)
  • Recall = True Positive / (True Positive + False Negative)

F1 Score is measure of a test’s accuracy. It considers both Precision and Recall of the test to compute score, F-Score is the Harmonic mean of precision and recall. This will tell you how your system is performing.

  • F1-Measure= [2 * (Precision * Recall) / (Precision + Recall)]

Here is the Precision, Recall and F1-Score for Product A and Product B.
Product A achieves 70% precision in finding the positive sentiment. This is calculated as 614 divided by 871 (refer confusion matrix table). This means, out of the 871 reviews that Product A identified as positive, 70% is correct (Precision) and 30% of reviews that Product A identified as positive is incorrect.

Product A achieves 79% recall in finding the positive sentiment. This is calculated as 614 divided by 780 (refer confusion matrix table). This means, out of the 780 reviews that Product A should have identified as positive, it has identified 79% correct (Recall) and 21% ((79 + 87)/780) is incorrect.

It is desired to have both high precision and high recall to get a final high accuracy. F1 score considers both precision and recall and gives a single number to compare across products. Based on F1 score comparison, the following is arrived for the given dataset.

  • Product B is slightly better than Product A in finding positive sentiment.
  • Product B is better than Product A in finding negative sentiment.
  • Product B is slightly better than Product A in finding neutral sentiment.

Final Accuracy can be calculated using Num. of Correct Prediction / Total Num. of Records.

To conclude, we understand for the given dataset, Product B performs better than product A. It is important that we must consider multiple datasets and take the average accuracy to find out product final standing.

Author Credits: This article was written by Kalyan Nandi, Lead, Data Science at Big Data Analytics SBU, 8KMiles Software Services.