Data labeling is a key part of the machine learning (ML) life cycle. The more relevant and high-quality labels you can collect, the better your model will be. However, labeling can be expensive and time-consuming, which means there is an opportunity for novel tools to reduce cost and burden. In this blog post, we present MEGAnno, our flexible, exploratory, efficient, and seamless labeling framework for NLP researchers and practitioners. In short, MEGAnno aims to reduce cost while improving the quality of labeling.
What Is MEGAnno?
Before diving into the details, let’s highlight MEGAnno’s unique characteristics at a high level.
- MEGAnno is a framework consisting of customizable UI and programmatic interfaces, plus backend storage and management for your data, annotations, and auxiliary information.
- MEGAnno is with you throughout the entire life cycle of your annotation project, from early-stage exploration to analysis, project evolution, and large-scale deployment*, meeting you where your everyday data science work takes place.
- MEGAnno provides a useful set of out-of-the-box “power tools” which give you extra leverage and reach. The tools could be easily extended to fit the special needs of projects.
* Deployment and project exporting to existing crowdsourcing annotation platforms like Amazon MTurk and Appen are still under development.
Lately, numerous labeling tools and solutions have emerged to collect labels more effectively and in a cost-efficient manner. Unfortunately, most of these tools only focus on the labeling step and don’t consider how data science researchers and practitioners use them within the whole machine learning life cycle. In practice, steps like collecting data, exploring it, labeling it, training a model, and evaluation don’t happen sequentially [Rahman and Kandogan, 2022]. It’s more like a back-and-forth process to continuously improve data, task definition, annotation, and models [Hohman et al., 2020]. So, when thinking about labeling tools, it’s important to think about how they fit into the bigger picture of ML model development workflow.
Figure 1: Our dual-loop model for data annotation: (1) data understanding/exploration loop (yellow) (2) model evaluation loop (green).
- Most tools don’t fit well into the rest of the machine-learning process, so users have to keep switching back and forth between tools, making it hard to move data around.
- Users cannot customize the tools to focus on labeling what’s most important to them (e.g., class or coverage, uncertainty, subtopics, similarity, etc.).
- Most tools aren’t good at handling projects that evolve over time.
MEGAnno provides 1) a client library with both interactive in-notebook UI widgets and Python programmatic interfaces, and 2) a back-end service that stores and manages all needed information with language-agnostic REST APIs.
Figure 2. Overview of MEGAnno framework.
Using MEGAnno is easy. Our pip-installable client library turns your notebook to the single interface for all your annotation needs: pre-processing, labeling, analysis, even model training and debugging. The back-end service keeps track of and provides rich query interfaces for the data, annotations, and auxiliary information (e.g., text embeddings or annotation time). Next, let’s go over some highlighted features of MEGAnno.
Interactive Jupyter Widgets
MEGAnno’s annotation widget features 1) a table view to facilitate exploratory and batch labeling and 2) a more zoomed-in single view with more space for each data example, as in most existing labeling tools.
Figure 3. Table view of the annotation widget. Data examples are organized in a table for better exploration and comparison. Users can also search over, sort or filter on data and annotations.
Figure 4. Single view of the annotation widget showing one data example at a time. With more space, the single view is more suitable for span-level tasks like extraction.
MEGAnno also provides a built-in visual monitoring dashboard (Fig. 5) to help users to get real-time status of the annotation project. As projects evolve, users would often need to understand the project’s status to make decisions about the next steps, like collecting more data points with certain characteristics or adding a new class to the task definition. To aid such analysis, the dashboard widget packs common statistics and analytical visualizations (e.g., annotation progress, distribution of labels, annotator agreement, etc.) based on a survey of our pilot users.
Figure 5. Dashboard widget to monitor the progress and statistics of the project and aid decision-making.
Not all data points are equally important for downstream models and applications. There are often cases where users might want to prioritize a particular batch (e.g., to achieve better class or domain coverage or focus on the data points that the downstream model cannot predict well). MEGAnno provides a flexible and controllable way of organizing annotation projects through the exploratory labeling. This annotation process is done by first identifying an interesting subset and assigning labels to data in the subset. We provide a set of “power tools” to help identify valuable subsets.
Searching for data subsets to label
MEGAnno supports sophisticated searches over data records, annotation, and user-defined metadata through the search APIs. Users can search data records by keywords (e.g., documents mentioning “customer service”) or regular expressions to express more complex patterns. The users can also search the database by already assigned labels (e.g., records with a positive sentiment label). MEGAnno acknowledges the value of auxiliary information for ML projects and provides advanced search functionalities over metadata. For example, users can query with patterns combining regex expressions and POS tags like demo.search(“(best|amazing) “, by=”POS”). For example, the code snippet below will bring up a widget like shown in Figure 2, with the new subset containing negative keywords and patterns.
s1 = demo.search(keyword='delay', limit=10, start=0)
s2 = demo.search("(bad|worst)
# bring up a widget with results combining subset s1 and s2
s = union(s1,s2)
Suggesting data subsets to label
Searches initiated by users can help them explore the dataset in a controlled way, but the quality of searches is only as good as users’ knowledge about the data and domain. MEGAnno provides an automated subset suggestion engine to assist with exploration. Users can customize the engine by plugging in external suggestion models as needed. Currently, the engine provides two types of techniques:
- Embedding-based suggestions makes suggestions based on data-embedding vectors provided by the user. For example, suggest_similar suggests neighbors of data in the querying subset. suggest_coverage examines all the data records within the embedding space in an unsupervised way and suggests data points from the less annotated regions to improve the annotation coverage of the corpus.
- Active suggestions utilizes active learning techniques to recommend the most informative data for the downstream model. With libraries like ModAL, users can select from various selection strategies based on model uncertainty and entropy (for example). Since MEGAnno’s seamless notebook experience covers the whole loop from annotation to model training and debugging, users can actively select a subset, annotate the subset, update the model, and test again in the same notebook without switching environments.
Working with Multiple Annotators
In real projects, annotation is rarely done by a single person. As an initial step towards collaborative annotation, MEGAnno provides virtually separated namespaces for each annotator. Users identify themselves by a unique authentication token while connecting to the service and only update their own labels through the widgets. MEGAnno provides a reconciliation view (Fig. 6) to look at labels from different individuals and resolve potential conflicts.
Figure 6. Reconciliation view showing the existing label distribution for data points.
MEGAnno is an annotation framework designed specifically with NLP researchers and practitioners in mind. Through MEGAnno’s programmatic interfaces and interactive widget, users can iteratively explore and search for interesting data subsets, annotate data, train models, plus evaluate and debug models within a Jupyter notebook without the overhead of context switching or data migration. With the AI-assisted and customizable “power tools,” users can take full control over their annotation project, thereby accelerating annotation collection based on specific needs.
Check out our paper that was awarded best paper at Data Science with Human-in-the-loop (DaSH) workshop at EMNLP 2022.