MEGAnno: Exploratory Labeling for NLP in Jupyter Notebooks

Data labeling is a key part of the machine learning (ML) life cycle. The more relevant and high-quality labels you can collect, the better your model will be. However, labeling can be expensive and time-consuming, which means there is an opportunity for novel tools to reduce cost and burden. In this blog post, we present MEGAnno, our flexible, exploratory, efficient, and seamless labeling framework for NLP researchers and practitioners. In short, MEGAnno aims to reduce cost while improving the quality of labeling.

What Is MEGAnno?

Before diving into the details, let’s highlight MEGAnno’s unique characteristics at a high level.

  • MEGAnno is a framework consisting of customizable UI and programmatic interfaces, plus backend storage and management for your data, annotations, and auxiliary information.
  • MEGAnno is with you throughout the entire life cycle of your annotation project, from early-stage exploration to analysis, project evolution, and large-scale deployment*, meeting you where your everyday data science work takes place.
  • MEGAnno provides a useful set of out-of-the-box “power tools” which give you extra leverage and reach. The tools could be easily extended to fit the special needs of projects.

 

* Deployment and project exporting to existing crowdsourcing annotation platforms like Amazon MTurk and Appen are still under development.

Why MEGAnno?

Lately, numerous labeling tools and solutions have emerged to collect labels more effectively and in a cost-efficient manner. Unfortunately, most of these tools only focus on the labeling step and don’t consider how data science researchers and practitioners use them within the whole machine learning life cycle. In practice, steps like collecting data, exploring it, labeling it, training a model, and evaluation don’t happen sequentially [Rahman and Kandogan, 2022]. It’s more like a back-and-forth process to continuously improve data, task definition, annotation, and models [Hohman et al., 2020]. So, when thinking about labeling tools, it’s important to think about how they fit into the bigger picture of ML model development workflow.

Figure 1: Our dual-loop model for data annotation: (1) data understanding/exploration loop (yellow) (2) model evaluation loop (green).

To better capture the data labeling and model-building workflows, we interviewed a group of NLP practitioners and observed that the annotation practice is occurring in a dual-loop model: a data understanding and exploration loop and a model evaluation loop (Figure 1). More specifically, after cleaning up the data (step 1), users decide what labels to collect and how many data points they need (step 2). As they explore and label the data (step 3), they may change their mind about what labels to use based on what they’re seeing (step 4). For example, in document classification, a user may start with broad categories and add more specific ones as she discovers relevant documents. These back-and-forth iterations we call the “data understanding/exploration loop” (yellow). Next, the labeled data is exported from the annotation tool and is used to train a model (step 5). However, ML model development is not a one-time thing and usually goes through many iterations of labeling, training, data, and model debugging. A user may need to collect more data (e.g., for the less represented classes due to a sub-optimal prediction performance of the downstream model) or further refine the schema (step 6), which we call the “model evaluation loop” (green). Most tools are focused on the labeling step only (red box). MEGAnno aims to capture both loops seamlessly within the framework (light green box). We observed that the dual-loop workflow of NLP researchers and practitioners is rarely supported in most existing tools. Specifically, we identified these three main challenges:
  • Most tools don’t fit well into the rest of the machine-learning process, so users have to keep switching back and forth between tools, making it hard to move data around.
  • Users cannot customize the tools to focus on labeling what’s most important to them (e.g., class or coverage, uncertainty, subtopics, similarity, etc.).
  • Most tools aren’t good at handling projects that evolve over time.
MEGAnno aims to address these challenges. Our tool supports pre-processing, annotation, analysis, model development, and evaluation seamlessly, all in a single notebook. You can also flexibly change your labeling schema as the project evolves and not worry about managing and storing annotations. It comes with out-of-the-box “power tools” for rich exploration rooted in heuristic-based search, automatic suggestion, active learning recommendation, and efficient batch labeling functionality.

MEGAnno Framework

MEGAnno provides 1) a client library with both interactive in-notebook UI widgets and Python programmatic interfaces, and 2) a back-end service that stores and manages all needed information with language-agnostic REST APIs.

Figure 2. Overview of MEGAnno framework. 

Using MEGAnno is easy. Our pip-installable client library turns your notebook to the single interface for all your annotation needs: pre-processing, labeling, analysis, even model training and debugging. The back-end service keeps track of and provides rich query interfaces for the data, annotations, and auxiliary information (e.g., text embeddings or annotation time). Next, let’s go over some highlighted features of MEGAnno.

Interactive Jupyter Widgets

MEGAnno’s annotation widget features 1) a table view to facilitate exploratory and batch labeling and 2) a more zoomed-in single view with more space for each data example, as in most existing labeling tools.

Labeling tool gif

Figure 3. Table view of the annotation widget. Data examples are organized in a table for better exploration and comparison. Users can also search over, sort or filter on data and annotations.

Annotation with labeler tool

Figure 4. Single view of the annotation widget showing one data example at a time. With more space, the single view is more suitable for span-level tasks like extraction.

MEGAnno also provides a built-in visual monitoring dashboard (Fig. 5) to help users to get real-time status of the annotation project. As projects evolve, users would often need to understand the project’s status to make decisions about the next steps, like collecting more data points with certain characteristics or adding a new class to the task definition. To aid such analysis, the dashboard widget packs common statistics and analytical visualizations (e.g., annotation progress, distribution of labels, annotator agreement, etc.) based on a survey of our pilot users.

Figure 5. Dashboard widget to monitor the progress and statistics of the project and aid decision-making.

Exploratory Labeling

Not all data points are equally important for downstream models and applications. There are often cases where users might want to prioritize a particular batch (e.g., to achieve better class or domain coverage or focus on the data points that the downstream model cannot predict well). MEGAnno provides a flexible and controllable way of organizing annotation projects through the exploratory labeling. This annotation process is done by first identifying an interesting subset and assigning labels to data in the subset. We provide a set of “power tools” to help identify valuable subsets.

Searching for data subsets to label

MEGAnno supports sophisticated searches over data records, annotation, and user-defined metadata through the search APIs. Users can search data records by keywords (e.g., documents mentioning “customer service”) or regular expressions to express more complex patterns. The users can also search the database by already assigned labels (e.g., records with a positive sentiment label). MEGAnno acknowledges the value of auxiliary information for ML projects and provides advanced search functionalities over metadata. For example, users can query with patterns combining regex expressions and POS tags like demo.search(“(best|amazing) “, by=”POS”). For example, the code snippet below will bring up a widget like shown in Figure 2, with the new subset containing negative keywords and patterns.

				
					s1 = demo.search(keyword='delay', limit=10, start=0)
s2 = demo.search("(bad|worst) <ADJ> <NOUN>", by="POS").
# bring up a widget with results combining subset s1 and s2
s = union(s1,s2)
s.show({'view': 'table'})
				
			
Suggesting data subsets to label

Searches initiated by users can help them explore the dataset in a controlled way, but the quality of searches is only as good as users’ knowledge about the data and domain. MEGAnno provides an automated subset suggestion engine to assist with exploration. Users can customize the engine by plugging in external suggestion models as needed. Currently, the engine provides two types of techniques:

  • Embedding-based suggestions makes suggestions based on data-embedding vectors provided by the user. For example, suggest_similar suggests neighbors of data in the querying subset. suggest_coverage examines all the data records within the embedding space in an unsupervised way and suggests data points from the less annotated regions to improve the annotation coverage of the corpus.
  • Active suggestions utilizes active learning techniques to recommend the most informative data for the downstream model. With libraries like ModAL, users can select from various selection strategies based on model uncertainty and entropy (for example). Since MEGAnno’s seamless notebook experience covers the whole loop from annotation to model training and debugging, users can actively select a subset, annotate the subset, update the model, and test again in the same notebook without switching environments.

Working with Multiple Annotators

In real projects, annotation is rarely done by a single person. As an initial step towards collaborative annotation, MEGAnno provides virtually separated namespaces for each annotator. Users identify themselves by a unique authentication token while connecting to the service and only update their own labels through the widgets. MEGAnno provides a reconciliation view (Fig. 6) to look at labels from different individuals and resolve potential conflicts.

Figure 6. Reconciliation view showing the existing label distribution for data points.

Conclusion

MEGAnno is an annotation framework designed specifically with NLP researchers and practitioners in mind. Through MEGAnno’s programmatic interfaces and interactive widget, users can iteratively explore and search for interesting data subsets, annotate data, train models, plus evaluate and debug models within a Jupyter notebook without the overhead of context switching or data migration. With the AI-assisted and customizable “power tools,” users can take full control over their annotation project, thereby accelerating annotation collection based on specific needs.

Check out our paper that was awarded best paper at Data Science with Human-in-the-loop (DaSH) workshop at EMNLP 2022

Written by: Hannah Kim, Dan Zhang, and Megagon Labs

Follow us on LinkedIn and Twitter to stay up to date with us.

Share:

More Blog Posts: