Making Review Analysis Easier with Teddy

User interaction with Teddy.

Reviews hold a wealth of data that can help improve user experiences with e-commerce services. But unlocking insights from reviews can be a daunting process. In our recent CHI’20 paper, we introduced Teddy, an interactive system that aims to make review mining easier.  Teddy enables data scientists to explore reviews at scale, iteratively refine extraction schemas of opinion mining pipelines, and quickly obtain insights. We designed Teddy with the help of insights gained from an interview study where we spoke with data scientists to understand the workflows and challenges in review analysis. Our study’s findings also go beyond Teddy and provide useful information about common practices and challenges in text mining, which can inform the development of future systems for review and general text analysis at large. 

Market size of e-commerce has been exponentially increasing in the US (left) and worldwide (right) [src:] ​

Why Care About Review Analysis?

User reviews now play an essential role in practically every type of business. While their prevalence was accelerated by the rapid growth of e-commerce services and products around the world, the ubiquity and power of user reviews now extend to all sectors. 

The market size of e-commerce is increasing exponentially; it’s expected to reach one trillion dollars in the US alone and more than six trillion dollars worldwide by 2023. This unprecedented growth directly reflects the amount of available user-generated text (reviews, Q&As, discussions, etc.) for various products and services. 

Reviews often contain a plethora of useful information about their subjects. A large number of reviews can usually provide relevant, reliable signals regarding the quality of a particular service or product. Consequently, consumers regularly check reviews and use these signals to inform their purchasing choices.

Businesses benefit from reviews in numerous ways, too. Many e-commerce platforms display reviews alongside summaries to help facilitate buyer decision-making. Reviews also entail feedback and insights on how to improve products and services. Several companies not only use reviews to track and respond to customer feedback but also to adjust their offerings.

There is, however, more to consumer reviews as a new kind of representation of products.

The true and effective representation of a product is the collection of all user experiences with the product.
User Experiences as Distributional Representations of Products

So what exactly is an effective representation of a product? We know that seller-provided specifications contain important information, but they’re quite limited as a means to predict the products’ performance in use, and thus are limited in informing consumer decision making as well.  

We propose that the true representation of a product is the collection of all user experiences with that product. This proposition carries quite a few ramifications regarding the temporal and distributional nature of the representation, but we’ll save that discussion for another blog post. Ideally, both consumer and enterprise software services for products must leverage collections of user experiences as distributional representations for products. Properly incorporating and modeling such representations will be crucial for the improvement of e-commerce in years to come. 

Reviews as Manifestations of User Experiences

But, how can we access the experiential data locked in consumers’ experiences? This brings us back to user reviews.  Reviews, albeit noisy and sparse, are accounts of user experiences with products.  Thus, consumer and enterprise services for products can use reviews in order to better represent the totality of a product. 

Reviews are manifestations of users’ experiences with products.

Today, many e-commerce companies, especially those providing aggregation and search services, employ data scientists to analyze, extract, and summarize information from reviews. However, as we previously noted, consumer-generated review text is notoriously noisy (e.g., fragmented, incomplete, spammy, fraudulent,  etc.)  and often sparse in informational content. For instance, a hotel review typically mentions only a few pertinent aspects of the lodging, such as cleanliness or location, out of numerous other potentially relevant features. Without effective tools, reading, and searching through thousands of reviews in order to analyze, understand, and interpret them is an arduous and time-intensive task.

To inform the design of Teddy, we first wanted to better comprehend the current practices and challenges in review mining. So we conducted interviews with 15 data scientists who regularly work on review text mining at various technology companies.

Interview Study: Understanding Workflows and Challenges in Review Analysis

During the interviews, we asked each of our participants open-ended questions about their work experiences. We also requested that they walk us through specific examples of projects they recently worked on.

Our questions ran the gamut from broad inquiries such as “How do you prepare your data?” and “What do you spend most of your time on?” to more specific ones like “What data sources and formats do you employ?” In some cases, additional questions arose from a participant’s unique response.

We highlight some of our important findings below.

Our study unveiled three overarching task types spanning review mining workflows:

1. Classification

In classification tasks, analysts develop algorithms to classify entire reviews or individual sentences into predefined categories. Two prime examples of classification tasks are fraud detection and spam detection.

2. Extraction

In this task, analysts build algorithms to detect relevant entities from reviews as well as any descriptors in the text that directly describe or modify those entities. Automatically identifying text spans of opinions about food quality in restaurant reviews is an example of extraction.

3. Representation

For representation tasks, analysts create structured representations to accommodate information and insights related to review text corpora. Developing a schema of amenities offered by specific hotels is a great example of representation tasks.

It’s important to note that these task archetypes are not mutually exclusive. For instance, supervised opinion extraction often initially requires the construction of a schema of aspects (e.g., topics) in order to extract opinions from reviews later.

Bottlenecks reported by data scientists aggregated (left) and separated by the type of task (right).


Besides the three main task types we just discussed, our interviews also identified several common bottlenecks in their workflows. Unsurprisingly, data scientists dedicate the majority of their time to data preparation. We found that the most time-consuming and challenging steps in this process are labeling, designing crowdsourcing tasks, and data exploration — but not necessarily cleaning.

Data scientists are largely satisfied with using existing language models such as BERT. They’re less concerned with developing new models or optimizing hyperparameters; instead, they tend to focus on improving their training data quality. And they need better tools to do so.

Similar to findings from the earlier work, we observed a distinction between what senior and junior data scientists consider to be major hurdles. Motivated by the cost of context switching, pipeline management and provenance are top concerns among senior data scientists. On the other hand, junior data scientists are more worried about the bottlenecks that data labeling and crowdsourcing present.

The division of labor between senior and junior data scientists is perhaps only one aspect of the dichotomy here. Another possibility could be that, with more experience, data scientists adopt different perspectives of what is most relevant for successful outcomes.

Surprisingly, our study participants, even those who regularly work with large review collections, didn’t report scalability as a significant challenge; they often resorted to sampling and were content with it.

Desired capabilities reported, aggregated (left) and separated by the type of task (right).

So, What Do Data Scientists Want?

Besides identifying bottlenecks, our interviews also honed in on the capabilities that participants considered necessary for improving opinion mining over review text. Unsurprisingly, the gathered responses map to many of the challenges we just discussed. But they’re not exactly one-to-one.

As shown in the charts above, better tools for visualization, pipeline management, and training data curation (labeling) topped the list of desired capabilities. Tools for creating structured representations (topical or thematic content organization) and search (lexical and semantic) followed behind these three categories. Requests for tools to help with data cleaning came in last place.

Teddy review exploration pipeline.

Teddy: A System for Interactive Review Analysis

Findings from the interview study, as well as our own experience conducting review analysis,  suggest that data scientists could considerably benefit from interactive data exploration tools for improved review analysis. 

In response,  we developed Teddy, an interactive visual analysis system for data scientists to quickly explore reviews and iteratively refine extraction schemas of opinion mining pipelines. Informed by the results of our study, Teddy enables similarity-based multiscale exploration of reviews using fine-grained opinions extracted from them. Teddy is extensible, supporting common text search, filtering, regular expression matching, sorting operations, and their visualizations through programmatic extensions. It sustains an interactive user experience for the analysis of large numbers of reviews by using a combination of pre-computation, indexing, and user-controlled separation of front- and back-end computations. Finally, Teddy enables data scientists to interactively revise and expand domain-specific schemas used for aspect-based opinion mining. 

To evaluate Teddy’s efficacy, we conducted two case studies involving exploratory analysis of hotel reviews and iterative schema generation for opinion extraction from restaurant reviews, respectively. For instance, we observed that, by exploring reviews through Teddy and iteratively improving the opinion extraction schema, a data scientist was able to improve the accuracy of a trained BERT classifier by 16.9%.

We presented above a brief summary of our study results and then introduced an even more condensed description of Teddy. Please see the paper for details and further discussion.

The Future of Interactive Systems for Text Analysis

Products or services purchased on the web are increasingly defined by user-generated texts such as reviews that exemplify consumer experiences. As a result, these texts are effectively becoming distributional representations of products or services, akin to distributional semantics. Information technology services that best leverage these emerging distributional representations will transform the way with which e-commerce is conducted and experienced with. Therefore research efforts into better tools such as Teddy for understanding and mining the rich information embodied in user-generated text are essential for improving the user experience with e-commerce.

Exploratory insight and hypothesis generation are useful and often necessary. But they are not sufficient for addressing the challenges facing text mining. Data scientists need more integrative systems that couple exploratory visual analysis with confirmatory- and model-based analysis (supervised as well as unsupervised) across various stages of data analysis.

Teddy represents a step forward in this direction. We look forward to continuing our research into interactive integrative systems that improve text mining pipelines. In the meantime, we have made Teddy and the data collected from our interview study publicly available.

Interested in learning more about Teddy? Check out our research paper! Do you have any questions about how it works? Contact us today!

Written by Çağatay Demiralp, Jonathan Engel, Sara Evensen, and Megagon Labs.


More Blog Posts: