The objective of the task is to test an automatic system’s ability to predict a sentiment intensity (aka evaluativeness and sentiment association) score for a word or a phrase. Phrases include negators, modals, intensifiers, and diminishers — categories known to be challenging for sentiment analysis. Specifically, the participants will be given a list of terms (single words and multiword phrases) and be asked to provide a score between 0 and 1 that is indicative of the term’s strength of association with positive sentiment. A score of 1 indicates maximum association with positive sentiment (or least association with negative sentiment) and a score of 0 indicates least association with positive sentiment (or maximum association with negative sentiment). If a term is more positive than another, then it should have a higher score than the other.
We introduced this task as part of the SemEval-2015 Task 10 Sentiment Analysis in Twitter, Subtask E (Rosenthal et al., 2015), where the target terms were taken from Twitter. In SemEval-2016, we broaden the scope of the task and include three different domains: general English, English Twitter, and Arabic Twitter. The Twitter domain differs significantly from the general English domain; it includes hashtags, that are often a composition of several words (e.g., #feelingood), misspellings, shortings, slang, etc.
We will have three subtasks, one for each of the three domains:
— General English Sentiment Modifiers Set: This test set has phrases formed by combining a word and a modifier, where a modifier is a negator, an auxilary verb, a degree adverb, or a combination of those. For example, ‘would be very easy’, ‘did not harm’, and ‘would have been nice’. (See development data for more examples.) The test set also includes single word terms (as separate entries). These terms are chosen from the set of words that are part of the multi-word phrases. For example, ‘easy’, ‘harm’, and ‘nice’. The terms in the test set will have the same form as the terms in the development set, but can involve different words and modifiers.
— English Twitter Mixed Polarity Set: This test set focuses on phrases made up of opposite polarity terms. For example, phrases such as ‘lazy sundays’, ‘best winter break’, ‘happy accident’, and ‘couldn’t stop smiling’. Observe that ‘lazy’ is associated with negative sentiment whereas ‘sundays’ is associated with positive sentiment. Automatic systems have to determine the degree of association of the whole phrase with positive sentiment. The test set also includes single word terms (as separate entries). These terms are chosen from the set of words that are part of the multi-word phrases. For example, terms such as ‘lazy’, ‘sundays’, ‘best’, ‘winter’, and so on. This allows the evaluation to determine how good the automatic systems are at determining sentiment association of individual words as well as how good they are at determining sentiment of phrases formed by their combinations. The multi-word phrases and single-word terms are drawn from a corpus of tweets, and may include a small number of hashtag words and creatively spelled words. However, a majority of the terms are those that one would use in everyday English.
— Arabic Twitter Set: This test set includes single words and phrases commonly found in Arabic tweets. The phrases in this set are formed only by combining a negator and a word. See development data for examples.
In each subtask the target terms are chosen from the corresponding domain. We will provide a development set and a test set for each domain. No separate training data will be provided. The development sets will be large enough to be used for tuning or even for training. The test sets and the development sets will have no terms in common. The participants are free to use any additional manually or automatically generated resources; however, we will require that all resources be clearly identified in the submission files and in the system description paper.
All of these terms are manually annotated to obtain their strength of association scores. We use CrowdFlower to crowdsource the annotations. We use the MaxDiff method of annotation. Kiritchenko et al. (2014) showed that even though annotators might disagree about answers to individual questions, the aggregated scores produced with MaxDiff and the corresponding term ranking are consistent. We verified this by randomly selecting ten groups of five answers to each question and comparing the scores and rankings obtained from these groups of annotations. On average, the scores of the terms from the data we have previously annotated (SemEval-2015 Subtask E Twitter data and SemEval-2016 general English terms) differed only by 0.02-0.04 per term, and the Spearman rank correlation coefficient between two sets of rankings was 0.97-0.98.
The participants can submit results for any one, two, or all three subtasks. We will provide separate test files for each subtask. The test file will contain a list of terms from the corresponding domain. The participating systems are expected to assign a sentiment intensity score to each term. The order of the terms in the submissions can be arbitrary.
System ratings for terms in each subtask will be evaluated by first ranking the terms according to sentiment score and then comparing this ranked list to a ranked list obtained from human annotations. Kendall’s Tau (Kendall, 1938) will be used as the metric to compare the ranked lists. We will provide scores for Spearman’s Rank Correlation as well, but participating teams will be ranked by Kendall’s Tau.
We have released an evaluation script so that participants can:
— make sure their output is in the right format;
— track the progress of their system’s performance on the development data.
— Training data ready: September 4, 2015
— Test data ready: Dec 15, 2015
— Evaluation start: January 10, 2016
— Evaluation end: January 31, 2016
— Paper submission due: February 28, 2016
— Paper reviews due: March 31, 2016
— Camera ready due: April 30, 2016
— SemEval workshop: Summer 2016
BACKGROUND AND MOTIVATION
Many of the top performing sentiment analysis systems in recent SemEval competitions (2013 Task 2, 2014 Task 4, and 2014 Task 9) rely on automatically generated sentiment lexicons. Sentiment lexicons are lists of words (and phrases) with prior associations to positive and negative sentiments. Some lexicons can additionally provide a sentiment score for a term to indicate its strength of evaluative intensity. Higher scores indicate greater intensity. Existing manually created sentiment lexicons tend to only have discrete labels for terms (positive, negative, neutral) but no real-valued scores indicating the intensity of sentiment. Here for the first time we manually create a dataset of words and phrases with real-valued scores of intensity. The goal of this task is to evaluate automatic methods for determining sentiment scores of words and phrases. Many of the phrases in the test set will include negators (such as ‘no’ and ‘doesn’t’), modals (such as ‘could’ and ‘may be’), and intensifiers and diminishers (such as ‘very’ and ‘slightly’). This task will enable researchers to examine methods for estimating how each of these word categories impact intensity of sentiment.