When making plans to purchase a gift, travel, or dine out, how do you decide among the plethora of options available? Thanks to the internet, customer reviews are now widely available to help with purchasing decisions for nearly every product or service you can imagine. But reading through endless reviews is a tedious and time-consuming process. Wouldn’t it be nice if you can press a button to summarize all the reviews or, even better, select what to summarize among the reviews and also be able to ask for explanations on the resulting summary? We are excited to describe OpinionDigest, our recent work at Megagon Labs that is capable of summarizing selective opinions of lots of reviews and also explain the summaries that are generated. Previously, we shared our efforts in developing ExtremeReader, an interactive explorer for customizable and explainable review summarization. ExtremeReader uses OpinionDigest as the textual summary generation component, which we didn’t describe in detail. In this blog post, we’ll explore how OpinionDigest offers a powerful, interpretable, and controllable opinion summarization solution for customer reviews — without relying on human-written reference summaries.
The Current State of Opinion Summarization
To understand why we developed OpinionDigest, we must first explore where research on opinion summarization currently stands.
Removing the Need for Reference Summaries
Sequence-to-sequence (seq2seq) generation frameworks are a common choice for text summarization techniques. A seq2seq model consists of two components: An encoder that converts input text into a latent vector in a high-dimensional space, and a decoder that converts this vector back into output text. Hence, seq2seq models are also known as encoder-decoder models.
Seq2seq models can be used for any type of “translation” task. For instance, we use them to “translate” long documents into shorter, more concise summaries in text summarization. But to accomplish this, traditional seq2seq summarization models are supervised by reference summaries, which are carefully written by human annotators.However, it is difficult and expensive to collect a sufficient number of reference summaries to build an accurate seq2seq summarization model in general. To circumvent this obstacle, the machine learning (ML) and natural language processing (NLP) research communities have recently developed unsupervised opinion summarization techniques that do not rely on reference summaries.
How Unsupervised Opinion Summarization Works
With unsupervised opinion summarization, the model learns how to “reconstruct” the original reviews. During the training phase, its encoder converts an input review into a vector representation. The decoder then learns to reconstruct the original review from the vector representation. This process lets the model learn a meaningful latent representation space. Consequently, similar reviews tend to be located close to one another in the vector space.
During the summary generation phase, the encoder aggregates the latent vectors of input reviews by averaging their latent representations. The decoder can then generate a summary from the averaged vector.
Limitations of Conventional Unsupervised Opinion Summarization Techniques
There are some caveats to this approach that are worth mentioning:
The aggregated vector representations are not interpretable.
It is also not clear how averaged latent representations keep/discard the original meanings.
Lastly, it is difficult to control the generation of summaries.
The last point above is especially limiting in the context that a user may want to examine only specific types of information, like the location or service of a hotel. It would be infeasible since we cannot control the generation step with this conventional vector aggregation method. That’s why we developed OpinionDigest.
OpinionDigest: A Framework for Controllable and Interpretable Summary
The core idea behind OpinionDigest is to use opinion phrases as intermediate representations instead of vector representations in high-dimensional space. This representation not only helps users get a better understanding of the original review, but also further allows us to explain the summary by showing opinion phrases that are selected for the summary. With this representation, our summarization framework becomes more interpretable and controllable.
Opinion phrases hold key information of reviews. Teaching a model to “verbalize” these phrases should result in a more natural solution for the opinion summarization problems we’ve discussed. We can also leverage Snippext, our state-of-the-art opinion extraction model, to streamline this process.
By using opinion phrases as intermediate representations, OpinionDigest delivers two major benefits:
Humans can interpret intermediate and aggregated representations. A user can clearly see how opinions are extracted and aggregated from original reviews.
A user can easily control a generated summary by selecting opinion phrases based on aspect category (e.g., service, location, food quality, etc.) and/or sentiment polarity (e.g., positive or negative opinions).
We will investigate how OpinionDigest performs and examine controlled summary generation examples later. But first, let’s explore how OpinionDigest trains and uses a seq2seq model to generate a summary from multiple reviews.
Training Step #1: Opinion Extraction via Snippext
Opinion extraction is the first step of the training phase. This basically converts an input review into opinion phrases. Any aspect-based sentiment analysis model can be employed as the opinion extractor. In OpinionDigest, we used Snippext. Please see our blog post about Snippext for more details about its unique opinion extraction technique.
Training Step #2: Training a seq2seq Model
Next, we train a seq2seq model to generate the original summary. This essentially follows the conventional “reconstruction” training procedure. But the key difference here is the input to the model: Instead of relying on the review text, OpinionDigest uses extracted opinion phrases as input. By training a seq2seq model with a large number of reviews and extracted opinion phrases, it learns to “verbalize” a set of extracted opinion phrases into a text summary.
Summarization Step #1: Opinion Aggregation
Since OpinionDigest needs to summarize multiple input reviews, it first extracts opinion phrases from all of them with the same opinion extractor used in the training phase. It then aggregates the extracted opinion phrases into representative phrases.
The aggregation step is completed by clustering similar opinion phrases and choosing the most frequent opinion phrase for each cluster. The user can also filter opinion phrases that match the intent (e.g., only location-related opinions, only negative opinions, etc.)
Summarization Step #2: Summary Generation
The trained seq2seq model uses the selected opinion phrases as input so it can “verbalize” them into a textual summary.
How Well Does OpinionDigest Work?
To assess OpinionDigest’s performance, we conducted a set of experiments and used automatic and human evaluation methods on two benchmark datasets. Our key observations from the experiments include:
- Compared to alternative methods, OpinionDigest is able to produce high-quality summaries for restaurant reviews on the Yelp benchmark dataset.
- Human evaluation confirms that OpinionDigest is able to generate more informative, coherent, and less redundant summaries than the alternative methods. In addition, it also confirms that OpinionDigest is less likely to generate irrelevant summary than alternative methods.
- OpinionDigest allows users to easily control to include/exclude the aspect and sentiment information in/from the generated summary.
Automatic Evaluation
We first evaluated OpinionDigest on a publicly available Yelp dataset of 624,000 reviews and 200 reference summaries, using the standard evaluation metric ROUGE scores (R1, R2, RL are scores based on 1-gram, 2-gram, and the longest common substring matching). As shown in Table 1 below, OpinionDigest outperforms all baseline approaches.
Although OpinionDigest is not a fully unsupervised framework, only the opinion extractor requires labeled data. This is much easier to acquire than reference summaries. For example, the opinion extraction models used with the Yelp dataset are trained on a publicly available aspect-based sentiment analysis (ABSA) dataset.
Human Evaluation
For the second evaluation, human judges indicated the best and worst summary according to three criteria: informativeness (I), coherence (C), and non-redundancy (R). Besides the Yelp dataset, summaries were also generated from HOTEL, a private dataset consisting of 688,000 reviews from multiple hotel booking websites. We used Best-Worst Scaling to compute the systems’ scores, with values ranging from -100 (unanimously worst) to +100 (unanimously best).
As shown in Table 2 above, OpinionDigest’s generated summaries achieved the best informativeness and coherence scores when compared to the baselines. However, OpinionDigest may still generate redundant phrases in its summaries.
We also performed a summary content support study. Judges were given 8 input reviews and a corresponding summary produced either by MeanSum (a significant baseline for abstractive opinion summarization) or by OpinionDigest. They were then asked to evaluate the extent to which each summary sentence’s content was supported by the input reviews.
Table 3 below shows the proportion of summary sentences that were fully, partially, or not supported for each system. OpinionDigest not only produced significantly more sentences with full or partial support but also fewer sentences without any support.
Lastly, we evaluated OpinionDigest’s ability to generate controllable output. We produced aspect-specific summaries and asked participants to judge if they discussed the specified aspect exclusively, partially, or not at all.
Table 4 above shows the results from this evaluation. 46.6% of the summaries exclusively discussed a specified aspect. Only 10.3% of the summaries failed to contain the aspect completely.
Controlled Summarization With OpinionDigest
As we’ve noted, OpinionDigest can also generate “controlled” summarizations by filtering opinion phrases during the summarization phase. Figure 7 below contains a few examples of these summaries.
The top-left summary is a generated general summary of a hotel based on all of the extracted opinion phrases. The top-right summary is the result of OpinionDigest only accounting for phrases about the staff. And the two bottom summaries are what occurs if OpinionDigest filters opinion phrases based on positive or negative sentiment.
Vector representations used by conventional unsupervised opinion summarization techniques are not interpretable or controllable. With that said, we would like to emphasize that OpinionDigest is the first neural-network-based model that can generate controlled summaries.
The Future of Opinion Summarization
We believe that OpinionDigest is a promising paradigm shift compared with typical opinion summarization approaches. By using opinion phrases as intermediate representations, OpinionDigest offers a powerful, interpretable, and controllable opinion summarization solution for customer reviews without relying on any human-written reference summaries.
Our experimental results demonstrate that OpinionDigest can perform just as well if not better than current state-of-the-art baselines. Our study of controlled summarization also shows that OpinionDigest users can easily control output summaries in an intuitive manner.
Lastly, OpinionDigest is not just about opinion summarization. Our ongoing work shows that the framework can also be used for a variety of automated text generation applications.
Stay tuned for updates on our work with OpinionDigest! We’ll be sure to share any exciting developments in the near future through our blog. In the meantime, we have open-sourced the OpinionDigest code. You can find it here.
Interested in learning more about OpinionDigest? Check out our research paper! Do you have any questions about how it works? Contact us today!
Written by Yoshi Suhara, Xiaolan Wang and Megagon Labs