The scale and availability of digital text on the internet has drastically increased over the past decade. Online enterprises often apply text data analysis to understand this information to improve their services and products. But text data analysis is an iterative and non-linear process involving numerous steps; data preparation, feature extraction, visualization, and model building, to name a few. To streamline this, we’re building Leam, a system that treats text analysis as a single continuum by combining the advantages of computational notebooks, spreadsheets, and visualization tools.
Through interactive workflows and visualizations, Leam provides an integrated experience that facilitates accessible, rapid text data analysis. Our vision towards developing Leam will be presented at the Conference on Innovative Data Systems Research (CIDR) 2021. In this blog post, we’ll explore current challenges facing text analysis, how Leam works, and potential enhancements for this system. The following figure shows an example of user interaction in Leam.
Motivation and Usage Example
The rapid rise of e-commerce in the last few years has made the internet our main platform for everyday activities such as shopping, dating, travel booking, and job searching. And this exponential growth shows no signs of slowing down: Global e-commerce sales are projected to reach six trillion dollars by 2023 — almost a 50% increase over the current market value.
This digital transformation has contributed to the proliferation of user-generated text (e.g., reviews, Q&As, discussions), which often contains valuable insights. Researchers at Megagon Labs explore text analysis problems such as question understanding, text summarization, opinion mining, to help enterprises extract value from such unstructured text. Therefore, our vision is to develop a system to facilitate workflows for solving these problems and provide a seamless text analysis experience for practitioners.
Text data analysis involves multiple steps. Analysts must prepare raw data (direct manipulation) and implement workflows (writing code). They also have to explore and analyze the resulting information and features (visualization). Therefore, we characterize the text data analysis process more formally as visual interactive text analysis (VITA). The following usage scenario captures a typical VITA workflow:
Let’s assume Cathy, a data scientist in the e-commerce department of a retail business, has been tasked to analyze customer reviews of products purchased from their website. Cathy would like to capture the underlying topics by performing topic modeling and clustering to characterize the review corpus better. Figure 2 captures the use-case which involves — preprocessing the data (clean), creating feature vectors from the text reviews (featurize), creating topic vectors from the corpus (topic modeling), clustering reviews into topics (cluster assignment), and finally, visualizing the clusters by projecting the topics vectors to lower dimensions (2D) using feature transformation techniques such as PCA (visualize). We refer to this example in subsequent discussions.
The Current Challenges of Text Data Analysis
Based on the usage scenario discussed earlier we now discuss the challenges related to visual interactive text analysis. In particular, practitioners encounter these common challenges while implementing VITA workflows:
A Disconnect Between Tools
VITA workflows often require users to employ many tools such as spreadsheets , computational notebooks or scripts, and BI tools or visualization libraries [1, 2]. For example, as shown in the usage scenario above, Cathy may (a) visually inspect the data in a spreadsheet, (b) then clean and featurize the text reviews in a computational notebook, and (c) evaluate the quality of the featurization step by visualizing top-ranked words as a bar chart using a visualization library. If Cathy wants to revise the cleaning or featurization steps, she needs to again repeat the same process all over again. Therefore, moving back and forth between these different tools while iterating over various workflow steps presents users with an extraordinary amount of cognitive overload. Moreover, there are other challenges related to the data incompatibility among different tools and variations in user interface and user actions, among others. Thus, this context switching is not only cumbersome but unnecessarily overwhelming.
Lack of Interactivity
The disconnect between VITA tools also creates a lack of coordination among the three crucial aspects of a VITA workflow: data, code, and visualizations. For instance, to simplify the interpretation of high dimensional text data, users often map different facets of this information to visualizations (e.g., Cathy views top-ranked words in the review corpus as a bar chart). But visualizations generated via scripts or notebooks are static — they cannot be mapped to the raw data through direct manipulation. This is but one example showcasing how lack of interactivity makes it arduous to understand the relations between factors of the same entities on demand.
Limited Support for Operator Reusability
VITA workflows involve various custom-built operations such as cleaning, featurization, visualization, and classification. For example, the cleaning and featurization operations used by Cathy work well for their company data and she wants to use these operations for similar analysis on a different project. But sharing and reusing these operations across projects can be difficult as she has to write these operations from scratch or copy them from the previous project..
Leam: An Integrated System to Enhance the VITA Experience
We developed Leam to address the aforementioned challenges. Designed to be a one-stop solution for visual interactive text analysis, Leam leverages several design considerations distilled from ideal VITA workflow requirements and our own experiences working in an industrial research lab setting.
Leam integrates three paradigms (spreadsheets, computational notebooks, and visualization tools) into a single system. Such integration facilitates VITA workflows that operate on both data and visual representation via GUI-based interactions and code. Leam also implements a suite of operators formalized with visual text algebra (VTA). VTA enables easier execution and reusability and potentially optimization of VITA operations and workflows. The VTA operators span various stages of a VITA workflow, such as data cleaning, featurization, and interactive visualization.
Let’s now explore Leam’s front-end, the integral role that VTA plays in this system, and its back-end architecture:
The Leam User Interface
As depicted below in Figure 3, Leam’s user interface contains four components to help users perform in-place text data analysis: Operator View, Visualization View, Table View, and Notebook View.
We now explain how Cathy can use various features of Leam to complete the tasks in the usage scenario discussed earlier. She can load the review dataset in Leam and instantly view the data in the Table View (Figure 3C). She can then use the Operator View (Figure 3A) to select and execute appropriate operations, e.g., cleaning and featurization of the reviews, via simple button clicks. Alternatively, she can write scripts — using a Python-based VTA library — in the Notebook View (Figure 3D) to perform similar operations. To evaluate the quality of the cleaning and featurization operations, Cathy can use the Operator View to add charts, such as a bar chart of top-ranked words, in the Visualization View (Figure 3B).
Using Leam, Cathy can complete various steps within the workflow in-place without having to move back and forth between different tools. Moreover, she can reuse the same operation across projects by leveraging the VTA library or the Operator View. Finally, she can interact with the charts in the Visualization View to inspect relevant reviews in the data (discussed later).
Visual Text Algebra (VTA)
VTA abstracts the suite of operators implemented in Leam. It enables Leam to support an array of diverse tasks within the VITA workflow, like cleaning, featurization, classification, visualization, and view coordination. The four types of high-level VTA operators are selection, transformation, composition, and coordination.
Selection Operator: This type of operator selects data points of interest on which subsequent workflow operations may be performed. Examples of data that can be used are raw data (e.g., rows) in Table View or visualization marks (e.g., select, filter) in Visualization View.
Transformation Operator: As suggested by its name, this operator type changes the actual data. For instance, cleaning operations remove noisy elements (e.g., HTML tags, emojis, punctuation) while featurization operators create vector representation of texts. Leam’s transformation operators have five subclasses: project, mutate, aggregate, set, and visualize.
Composition Operator: VTA currently supports two composition operators: combine and synthesize. With these, users can create customized operators by combining multiple existing ones. For example, users can combine multiple cleaning operators to build a “clean web page” operator and reuse it later from the Operator View. Figure 4 below shows the JSON specification for a TF-IDF operator that takes a set of reviews as input and generates corresponding TF-IDF vectors. The JSON specification requires other parameters related to storage and presentation of the output which we explain in detail in the research paper.
Coordination Operator: These operators are designed to enable coordination among views. Leam has two types of coordination operators: internal and external. The internal operators let users set the selection type of an existing visualization. For instance, users can set a bar chart selection from a single bar to multiple bars. The external operators enable users to add coordination among multiple views in the Leam user interface. For example, Figure 5 below shows how to link a bar chart with the reviews in the Table view. Clicking a bar, which represents a top ranking word in the review corpus, filters relevant reviews in the Table View.
Implementation of Leam and Future Work
With Leam, we initially focused on operationalizing our design considerations into a comprehensive system for authoring interactive and reusable VITA workflows. Leam is implemented as an in-memory system with a web-based front-end. The research paper contains more details on Leam’s implementation. We have made the current version of Leam open-source. You can find it on our GitHub here.
Leam is a promising step towards the development of an integrated system that supports the entire text data analysis lifecycle. There are several exciting research opportunities related to the scalability of Leam, versioning of VITA workflows, and coverage of VTA. Since Leam’s current implementation is main-memory-based, an immediate challenge is to develop an efficient storage layer to support large datasets. Moreover, we need to adopt optimization strategies to ensure Leam remains interactive and responsive as the scale of the data increases. Because VITA workflows are iterative, developing a version control system for data, code, model, and visualizations is crucial for ensuring reproducibility and provenance of user actions. Finally, to increase the coverage of VTA and support a wider range of text analysis workflows, we need to implement more useful operators that abstract popular machine learning and natural language processing libraries.
Addressing the challenges above will require interdisciplinary research efforts from the DB, NLP, HCI, and VIS communities. Solving each one can yield vast improvements for how we conduct text data analysis in the future.
 Chattopadhyay, S., Prasad, I., Henley, A. Z., Sarma, A., & Barik, T. (2020, April). What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1-12).
 Zhang, A. X., Muller, M., & Wang, D. (2020). How do data science workers collaborate? roles, workflows, and tools. Proceedings of the ACM on Human-Computer Interaction, (CSCW1), 1-23.
Note: The research work was done when all the authors were at Megagon Labs.