Understanding public sentiment can unlock unprecedented insights for every business. Consequently, opinion mining, the process of analyzing text content and extracting factors to understand if it is of negative or positive sentiment, has rapidly grown in popularity. With fine-tuning, pre-trained language models can obtain high-quality extractions from user reviews — but not all organizations have access to an adequate amount of training data to do so.
To address this, we developed Snippext, an opinion mining system built over a language model that is fine-tuned through semi-supervised learning with augmented data. Our result shows that you can considerably reduce the amount of labeled training data required for fine-tuning opinion mining language models while maintaining optimal performance.
Labeled Training Data Makes Opinion Mining Costly and Time-Consuming
Online services like Amazon and Yelp are constantly extracting and analyzing aspects, opinions, and sentiments, such as (“room”, “very small”, negative), from reviews and other sources of user-generated information. This data can provide profound insight into various facets of their operations, consumers, and products. On the other side of the equation, aggregating these extractions into user experience summaries can aid customers in decision-making and save them the hassle of reading reviews themselves.
One common opinion mining method is to utilize pre-trained language models such as BERT or XLNet. After some fine-tuning, they can achieve a phenomenal degree of precision. In fact, they can even achieve state-of-the-art (SOTA) performance in several extraction tasks. But fine-tuning language models still require a substantial amount of high-quality labeled training data.
For instance, the SOTA for restaurant aspect term extraction is trained on 3,841 sentences. To acquire this amount of accurate data is non-trivial, to say the least, as it may require linguistic experience. For many cases, crowdsourcing may lower cost to obtain labeled data. But preparing the task, launching it, and processing the results are still extremely time-consuming and effort-intensive. To make matters more complex, you often must repeat the labeling process for each different domain (e.g., hotels, restaurants, employers, etc.).
All of these obstacles pose a significant barrier for numerous organizations that want to leverage opinion mining. But to stay competitive, many modern companies need the ability to accurately conduct opinion mining. To address this, we must make the process of fine-tuning language models less expensive and more efficient.
Motivated by the aforementioned issues, we sought a way to drastically reduce the amount of labeled training data required to fine-tune language models such as BERT while maintaining SOTA performance. The big question is could we still reach SOTA performance with only half (or even less) of the labeled training data typically used?
There are two main questions Snippext had to address to overcome the need for large training datasets:
- Can we gain more value from small sets of labeled training data by generating more high-quality examples from it?
- Can we use both labeled and unlabeled data to fine-tune an in-domain language model and obtain better results for specific tasks?
How Snippext Makes Opinion Mining More Viable
Snippext answers the questions above and achieves SOTA performance with less training data thanks to the clever application of two components.
Data augmentation (DA) is a technique that allows us to automatically increase the training data amount without human annotation. It has been effective for a variety of tasks in computer vision; by applying simple operators such as “rotate”, “crop”, or “flip” to labeled images, researchers have been able to generate more labeled training data for neural networks.
In natural language processing (NLP), DA takes a similar approach; an effective sentence classifier training method is to replace key phrases or words with corresponding synonyms. But synonyms can be limited, and other operators can distort the meaning of an augmented sentence. To solve these restrictions, we took inspiration from MixUp, a popular DA technique in computer vision, and created a new technique known as MixDA.
MixDA generates augmented data through two steps:
1. MixDA “partially” transforms text sequences through a specific set of DA operators suitable for opinion mining tasks. This makes the resulting sequences less likely to be distorted.
Data operators of Snippext:
- Replace non-target token with a new token
- Insert before or after a non-target token with a new token
- Delete a non-target token
- Swap two non-target tokens
- Replace a target span with a new span
2. MixUpNL, a version of MixUp we created for text, performs a convex interpolation over word embeddings on the augmented data with the original data to further reduce any potential noise in the augmented data.
The resulting interpolation is used as the training signal.
For the second component, Snippext leverages semi-supervised learning (SSL). SSL is a paradigm in which models learn from both a small amount of labeled training data and a large amount of unlabeled training data. To do this, we created MixMatchNL, a novel adaptation of MixMatch from images to text. MixMatch is a computer vision technique for training high-accuracy image classifiers with a limited amount of labeled images.
Unlabeled training data allows the trained model to better generalize the entire data distribution. It also helps to avoid overfitting to the small training set. MixMatchNL can exploit these advantages and leverage massive amounts of unlabeled training data through two steps:
- It guesses the labels for unlabeled data.
- It uses MixUpNL to interpolate data with guessed labels and data with known labels.
The field of computer vision has implemented this concept successfully. But this is the first time that it has been adapted for text.
We evaluated the performance of Snippext’s MixDA and MixMatchNL modules by applying them to two Aspect-Based Sentiment Analysis (ABSA) tasks, Aspect Extraction (AE) and Aspect Sentiment Classification (ASC), with four different sizes of training data. It’s important to note that Snippext is not tied to any language model. For implementation and experimentation purposes, we used BERT.
Surprisingly, Snippext was able to reach SOTA performance when supplied with only half or even a third of the original dataset. In other words, this opinion mining system effectively allows for a reduction of 50% or more of the training data and still achieves current SOTA capabilities. When all training data is leveraged, Snippext outperforms SOTA models by up to 3.55%.
We also evaluated the practicality of Snippext by applying it to a large real-world hotel review corpus consisting of 842,260 reviews of 494 San Francisco hotels. Our experiment and analysis show that Snippext can extract more fine-grained opinions and customer experiences usually missed by other systems. More specifically, the baseline pipeline was able to extract 3.16M aspect-opinion pairs — Snippext was able to extract 3.49M.
Future Work in Opinion Mining
With capabilities to reduce the amount of labeled training data required and extract more fine-grained information, Snippext opens up a plethora of opportunities for opinion mining in various applications. This new system is already making a tangible impact on our ongoing collaborations with a hotel review aggregation platform and a job-seeking company.
These examples are really just the beginning of what’s possible with opinion mining. Soon, we will explore optimization opportunities such as multitask learning and active learning to further reduce the labeled data requirements for Snippext.
Are you interested in learning more about Snippext? Check out our research paper! Do you have any questions about how it works? Contact us today!
Written by Megagon Labs