Subjectivity is ubiquitous in our use of natural language and thus is also a crucial aspect to consider in natural language processing (NLP). But subjectivity in NLP has not been explored in many contexts where it is prevalent, such as question answering (QA). In this blog post, we discuss our novel data collection method and a new dataset called SubjQA to investigate the relationship between subjectivity and QA in the context of customer reviews.
The Vital Role of Subjectivity in Question Answering
Simply put, subjectivity is the expression of internal opinions or beliefs which cannot be objectively observed or verified. It plays an important role in sentiment analysis and word sense disambiguation. Recently, the NLP community has expressed renewed interest in exploring subjectivity for several research areas such as aspect extraction, opinion mining, community QA. When we examine current trends, it is easy to see why.
Many domains like products and services generate data that is highly subjective. Recent studies show that as many as 69% of user queries in such domains are subjective, and the customer reviews that can answer these queries also tend to be highly subjective. Arguably, the subjectivity of an answer is correlated with the subjectivity of the user query and changes from review to review. Therefore, the dataset and the QA system must understand how subjectivity is expressed in a user query and the review to find an answer. However, existing review-based QA datasets are not large and diverse enough to study the interaction between subjectivity and QA.
The Current Constraints of Question Answering Datasets and Architectures
Based on factual data, the majority of QA datasets and systems are agnostic to subjectivity. Modern QA systems use representation learning architectures that are trained on large-scale factual datasets such as Wikipedia articles, news posts, or books. It is unclear if these architectures can handle subjective statements such as those that appear in reviews. To comprehensively investigate research questions dealing with subjectivity in reviews, it became readily apparent that a large-scale dataset is needed. Since no such dataset exists, we constructed SubjQA, a new challenge QA dataset.
Existing data collection methods suffer from two limitations:
- They rely on the linguistic similarity between the questions and the reviews, resulting in easy datasets. Subjective questions, however, may not always use the same words and phrases as the review.
- They create small datasets that are not diverse nor targeted at understanding subjectivity in the text.
These limitations motivated us to devise a novel data collection method to build SubjQA.
Building SubjQA With a New Data Collection Method
The figure below outlines our proposed data collection method. First, we find opinion extractions from the reviews of a target domain. Each opinion extraction is a tuple of the form (modifier, aspect). Next, we mine associations between the extractions using matrix factorization. For example, we find that the extraction (impressive, character development) is related to the extraction (good, writing). We then use these associations to build the dataset.
Specifically, given an association (head extraction, tail extraction), we find a review that mentions the head extraction and ask crowd workers to write a question using the tail extraction as the topic. We then ask them to select the span of text from the review that answers the given question. They also provide subjectivity labels for both the question and answer. Below are some more examples from the dataset:
Question: How was the plot of the movie?
Review: …simply because there’s so much going on, so much action, so many complex …
Question: Is the restaurant vegan friendly?
Review: ….many vegan dishes on the menu. We had a lovely time here with our friends…
Question: Does the restaurant have a romantic vibe?
Review: Amazing selection of wines, perfect for a date night.
As can be seen in these examples, the QA system would now have to reason about subjective expressions instead of solely relying on linguistic signals to find the answer span from the text.
Characteristics of SubjQA
SubjQA includes over 10,000 examples spanning six different domains: hotels, restaurants, movies, books, electronics, and groceries. A large percentage of questions and answers are subjective: 73% of the questions are subjective and 74% of the answers are subjective. As shown in the figure below, SubjQA has a good distribution of interactions between subjectivity in questions and subjectivity in reviews.
About 65% of the questions in SubjQA are answerable from the reviews they are paired with. SubjQA also contains diverse questions not present in other benchmark QA datasets such as SQuAD and AmazonQA. This is depicted in the figure below by the diversity of question prefixes. The outermost ring shows unigram prefixes. The middle and innermost rings correspond to bigrams and trigrams, respectively.
Performance of QA systems on SubjQA
Pre-trained models achieve F1 scores as high as 92.9% on the popular SQuAD benchmark. On the other hand, the best model achieves an average F1 of 30.5% across domains in SubjQA. The difference in performance can be attributed to both differences in domain (Wikipedia vs. customer reviews) and how subjectivity is expressed across different domains. Even after fine-tuning on each domain, the best model achieves an average F1 score of 74.1% across the different domains. This is significantly lower than the F1 score on SQuAD. We attribute this result to the fact that the models are agnostic about subjective expressions in questions and reviews.
To elucidate if subjectivity is an important feature in QA, we optimize a QA model on answer-span selection and answer-subjectivity classification in a multi-task learning paradigm. The model achieved an average F1 score of 76.3%, a 2 points absolute gain over the subjectivity-agnostic model. This shows that even simple techniques to incorporate subjectivity in reasoning can boost model performance across domains.
Opening Up New Avenues for Research in Question Answering Subjectivity
We have made SubjQA publicly available. You can find it here. Not only does this QA dataset contain subjectivity labels for both questions and answers, but it also enables the following:
- Evaluation and development of architectures for subjective content.
- Investigation of subjectivity and its interactions in broad and diverse contexts.
There is still much more work to be done that can help the NLP community understand the relationship between subjectivity and QA. Fortunately, SubjQA opens up several opportunities to conduct research in this area efficiently and effectively.