Extracting Salient Facts from Online Reviews with Scarce Labels

Online reviews have become an essential source of information, so much so that more than 90% of people read online reviews about local businesses [1]. Job seekers also rely on reviews before applying to open positions in order to find the best prospective employers. With previous projects like Coop and Extreme Reader, Megagon Labs has already worked to ease the burden of going through thousands of reviews where information can be superfluous or repetitive. Given that job seekers have increasingly relied on online reviews, the strategy of extracting salient facts associated with job seeking (e.g. employer benefits) from online reviews will be an intriguing topic. We have noticed that the work focusing on this consumer base is limited, despite job seekers wanting this information. We therefore expanded our summary work by finding salient facts within reviews, particularly company reviews. We defined salient facts as review comments that present unique or distinctive information about a company.

To help alleviate the headache of searching through reviews, we worked on extracting the most salient facts from reviews. That’s because this information is not always presented to applicants in job descriptions. We adopted supervised learning methods when developing the extractor. However, due to the definition of salient facts, we met an inherent problem: scarce labels. It is difficult to collect large-scale labels to train supervised methods. Our annotation study showed that the percentage of salient facts in raw reviews is extremely low (i.e. < 10%). We apply pre-trained models and a series of optimizations to address the problem. In doing so, we found our methods to be successful in improving F1 scores by more than 40%. We also evaluated our methods on similar existing tasks that deal with minority review comment extraction (e.g. suggestion mining). The experiments showed that our methods can reduce the number of labels by 75% when compared with UDA semi-supervised learning.

Defining Salient Facts

We first inspected raw online reviews to derive a definition of salient facts. Then, we analyzed expert reviews to ensure that the derived definition aligns with the knowledge of professional reviewers [2][3] from Business Insider. 

We obtained sentences from 43,000 reviews about Google, Amazon, Facebook, and Apple. We then invited three annotators to select a set of sentences that describe the most salient and factual phrases. Two examples of salient facts about Google and Amazon are “Google also has 25+ cafes and micro kitchens every 100ft” and “Dogs allowed in all the buildings I’ve been to (including some dog parks in the buildings!)” respectively. The editors also selected a set of non-salient facts. Two examples are “awesome place to work, great salary, smart people” and “I couldn’t imagine a better large corporate culture that still tries to be agile.”  

We observed that sentences deemed as salient facts exhibit two characteristics: 

  1. describe relatively infrequent attributes 
  2. tend to contain relatively more quantitative facts instead of subjective sentiments, when compared with the majority of sentences. 

We analyzed reviews from professional reviewers to investigate if they exhibit characteristics of salient facts; did they describe an uncommon attribute and/or use quantitative descriptions? Using a set of attribute words found in expert reviews, we calculated the frequencies at which the attributes appeared in user-generated review sentences. We found that expert reviews used attribute words that were infrequent in the user-generated review sentences. For example, we found a frequency of only 0.01% for the attribute “death” in user-generated review sentences. “Death” is extremely infrequent when compared to other attributes, such as “place” which appeared in 3.44% of the user-generated review sentences.

We then investigated whether expert sentences use more quantitative descriptions than randomly selected user-generated review sentences. We found that 4 of the 8 expert sentences [2], describing the benefits of Google, used quantitative descriptions such as 10 years, 25 days, 18-22 weeks, and 50% match. On the other hand, none of the 8 sentences randomly sampled from users mentioned any quantities. In fact, most of them used subjective vocabulary such as nice, interesting, and great. The results suggest that our derived definition of salient facts reflect the knowledge of professional reviewers.


The percentage of salient facts in raw reviews is extremely low (<10%), which means we cannot collect a large number of salient fact examples to train an extractor. To remedy this problem, we leveraged pre-trained models and applied a set of optimizations to reduce the number of labels. Owing to the recent success of pre-trained models in information-extraction tasks, we adopted these models for salient fact extraction. We first modeled salient fact extraction as a sequence classification task over pre-trained models. We then employed two major optimizations: representation enrichment and label propagation to prepare a better training set. Representation enrichment expands input tokens of a training example, while label propagation enlarges the size of the training set using unlabeled sentences. The effect of each optimization was reported in our published paper

Our first goal was to build a model that would predict the correct label for an unseen review sentence (if the sentence is salient or otherwise). We trained the model using a set of labeled text instances, each of which consisted of a sentence and a binary label. We used the pre-trained model, including BERT, ALBERT, and RoBERTa, to project a text instance into a high-dimensional vector (e.g.768 in BERT). By leveraging these models, we were able to use far fewer salient sentences to train extractors, in comparison to supervised models trained from scratch.

Limitations in Scarcity

Although pre-trained models show excellent generalizability, we found that they struggle to make correct predictions for sentences with unseen attributes or quantities. The models can learn too much from specific examples in the training set but not the general characteristics of a salient fact. This weakness is due to the inherent scarcity of infrequent attributes and quantitative descriptions in the datasets. We, therefore, employed representation enrichment and label propagation methods to address these issues.

Representation Enrichment and Label Propagation Methods

To solve the issue with the infrequent attributes, first we developed the representation enrichment method to help pre-trained models recognize unseen attributes and quantities for prediction. The method appends a special tag to text instances, recognizes the special tag, and makes accurate predictions. 

Second, we realized the label propagation by supplementing the training data. We used the Jaccard score to search for the most similar instances from unlabeled text instances. To solve the problem that Jaccard score favors frequent word tokens such as stopwords, we introduced a reranking operator that sorts all candidates by their relative affinity to positive and negative examples in the training set. This operator returns top positive examples and tail negative examples. Finally, we were able to boost the set of training data examples.


We conducted performance evaluations on two datasets: reviews of Google and Amazon. We compared the performance of our model with various other machine learning models: 

  1. Simple models (Logistic Regression (LR) and Support Vector Machine (SVM))
  2. Non pre-trained models (Convolutional Neural Network (CNN), Recurrent Neural Net-work with Long Short-Term Memory (LSTM))
  3. Pre-trained models (BERT, ALBERT, and RoBERTa)

We applied the same configuration to all models. Unsurprisingly, we found pre-trained models consistently outperform other models on the two datasets (as shown in table 1). Among all pre-trained models, ours achieves the highest F1 score on both datasets. 

The results suggest that existing pre-trained models are not optimized thoroughly for datasets with scarce labels. Optimizations and training techniques are needed for the best prediction quality.

salient fact table 1

Table 1: The best F1 score of simple models (LR/SVM), non pre-trained models (CNN/LSTM), pre-trained models (BERT/ALBERT/RoBERTa) and ours.

Similar Public Tasks

We evaluated our method on similar tasks to verify whether it is applicable to the extraction of other minority comments. 

We obtained four public datasets: 

  1. SUGG that predicts whether a software review contains user suggestions. 
  2. HOTEL that extracts customer-to-customer suggestions for accommodation. 
  3. SENT that recognizes sentences carrying tips for PHP API design. 
  4. PARA that comes from the same source of SENT but are paragraph-level examples.

The ratio of positive examples for SUGG, HOTEL, SENT, and PARA is 26%, 5%, 10%, and 17% respectively. In all four datasets, we aimed to extract minority comments from raw reviews. 

We adopted Unsupervised Data Augmentation (UDA algorithm, Xie  et  al.  2020) as a strong baseline method. UDA uses BERT as the base model and augments every training example using back translation. The example and its back translations were used together to train the base model. We report F1 scores of BERT, UDA, and ours in Table 2 below. BERT performs the worst, and this is because BERT is not optimized for scarce labels. UDA and our approach perform similarly across all the datasets, yet UDA uses full training examples, while ours uses only 23.52%, 33.33%, 21.97%, and 38.46% of the examples on SUGG, HOTEL, SENT, and PARA respectively. The back translation of UDA is mild in that it usually changes one or two word tokens of the source example. However, the mild strategy is too conservative to augment minority examples in an extremely imbalanced dataset. Therefore, a more aggressive design choice that can augment examples with many new words is needed for minority comment extraction tasks.

salient fact table 2

Table 2: F1 score of BERT, UDA, and our method when using full or 2,000 training examples.

We also evaluated statistical significance on the performance of BERT, UDA, and our method in Figure 2 below. Meanwhile, ours only uses 2,000 training examples. When comparing BERT with ours, BERT shows no significant difference on SUGG and SENT and worse performance on HOTEL and PARA. The results suggest that our method can outperform BERT even with fewer training examples. When comparing UDA and ours, both methods show no significant difference on SENT and PARA. On HOTEL, UDA is better but on SUGG it performs worse. 

The results demonstrated with statistical significance that our method can achieve equally good performance as UDA but with significantly fewer training examples.

figure 2 salient facts

Figure 2: statistical significance test.


In this blog, we practice a novel review-mining task that extracts salient facts from online company reviews. In contrast to reviews written by experts, only a few online reviews contain useful and salient information about a particular company. This creates a situation where “salience” can only rely on highly skewed and data-scarce training information. To address the data-scarcity issue, we applied two data enrichment techniques: (1) representation enrichment and (2) label propagation. These changes boosted the performance of supervised learning models. Experimental results suggest that our method can successfully help train a high-quality salient fact extraction model with fewer human annotations. We also demonstrated that our approach works for similar tasks that deal with minority comments extraction. 

Interested in learning more? Check out our research paper! 

Follow us on LinkedIn and Twitter for to stay up to date with us.

Written by Jinfeng Li and Megagon Labs


More Blog Posts: