Snippext: An Opinion Mining Pipeline that Uses Less Training Data

Understanding public sentiment can unlock unprecedented insights for every business. Consequently, opinion mining has rapidly grown in popularity. But building high-precision, high-recall opinion mining pipelines capable of high-quality information extraction and analysis usually requires an immense amount of information.

Snippext is a state-of-the-art (SOTA) opinion mining pipeline that extracts aspects, opinions, and sentiments from user-generated content such as online reviews. It allows for a reduction of 50% or more of the training data usually required through:

  1. Data augmentation that automatically generates more labeled training data from existing ones. This was inspired by a sentence classifier training method commonly used in natural language processing (NLP).

  2. Semi-supervised learning that leverages massive amounts of unlabeled data.

With these optimizations, Snippext operates comparably and even outperforms previous SOTA results on several opinion mining tasks. It also extracts significantly more fine-grained opinions that enable new opportunities for downstream applications.

The Megagon Labs team evaluated the performance of two of Snippext’s modules by applying them to two aspect-based sentiment analysis (ABSA) tasks, aspect extraction (AE) and aspect sentiment classification (ASC). Snippext was able to achieve SOTA performance with only half or even a third of the original dataset. When the entire dataset was leveraged, Snippext outperformed SOTA models by up to 3.55% during these aspect-based sentiment analysis tasks.

Snippext has been successfully deployed across numerous domains for information extraction and sentiment analysis, including hospitality, food, and e-commerce. This is just the beginning of what’s possible with this system. We are currently exploring optimization opportunities such as multitask learning and active learning to further reduce labeled data requirements for Snippext.

Rotom: A multi-purposed data augmentation framework for training high-quality machine learning models

We propose Rotom, a multi-purposed data augmentation framework for training high-quality machine learning models while requiring only a small number (e.g., 200) of labeled examples.

ExtremeReader: An Interactive Explorer for Customizable and Explainable Review Summarization

ExtremeReader generates both a structured and abstractive summarization that are easier to interpret. It also allows users to explore and see explanations of these summaries by drilling down or up to the desired level of granularity. Users can even see the sentence from which the opinion features were extracted.

HappyDB: a happiness database of 100,000 happy moments

We built HappyDB, a crowd-sourced collection of 100,000 happy moments that we make publicly available. Our goal is to build NLP technology that understands how people express their happiness in text while achieving insights into happiness-leading events and scenarios on a scale.

OpineDB and Voyageur: How Subjective Databases and Experiential Search Can Improve Customer Experiences

We developed OpineDB a subjective database system that addresses these challenges by interpreting subjective predicates against a database schema through a combination of natural language processing (NLP) and information retrieval (IR) techniques.