LLMs as Data Annotators (Part 2) – MEGAnno+: A Human-LLM Collaborative Annotation System

This is the second blog in the series of LLM-powered data annotation. In the first part of the series, we explored how to leverage LLMs as annotation agents, along with the associated challenges and opportunities. In this article, we introduce our human-LLM collaborative annotation tool, MEGAnno+, addressing the challenges in LLM annotation by integrating human expertise with LLM capabilities.

What is MEGAnno+?

MEGAnno+ is a data annotation system based on a collaborative setup where humans and LLMs work together to produce reliable and high-quality labels. Unlike existing annotation tools that just focus on collecting LLM labels, we investigate how to onboard LLMs as annotation agents within a human-in-the-loop labeling framework. Specifically, MEGAnno+ offers:

  • Effective LLM agent and annotation management
  • Convenient and robust LLM annotation
  • Exploration and verification of LLM labels by humans
  • Seamless annotation experience within Jupyter notebooks

Human-LLM collaborative annotation workflow

In MEGAnno+, we offer a simple human-LLM collaborative annotation workflow: LLM annotation followed by human verification. Put simply, LLM agents label data first (Figure 1, step ①), and humans verify LLM labels as needed. For most tasks and datasets one can use LLM labels as is; for some subset of difficult or uncertain instances (Figure 1, step ②), humans can verify LLM labels – confirm the right ones and correct the wrong ones (Figure 1, step ③). In this way, the LLM annotation part can be automated, and human efforts can be directed to where they are most needed to improve the quality of final labels.

Unlabeled data is labeled by an LLM, a subset is selected then the subset is human verified.

Figure 1. Our human-LLM collaborative workflow.

Then, the question is, how do we pick which instances to verify (Figure 1, step ②)? There are different strategies such as prioritizing diversity or minimizing uncertainty. For example, to measure LLMs’ uncertainty researchers often use logit values as an approximate metric of model confidence or they apply self-verification and self-consistency techniques where the LLMs are asked again to verify themselves. Another way is for users to use their expertise to query specific data slices (e.g., data instances assigned with a certain label, and instances with specific keywords).

Design considerations

Let’s revisit the various challenges and inconveniences you may encounter in an LLM annotation workflow (we partly discussed these in the previous article). Firstly, without any guidance in prompting, you may resort to trial-and-error methods to eventually identify a suitable prompt for the task. Even so, you need to perform additional verifications to ensure that the annotated labels are within the space of predefined labels. Next, API calls to the selected LLM model can be unreliable, causing errors such as timing out and rate limit violations, requiring manual error handling. Next, you may lack the confidence to train a downstream model without verifying the LLM annotations. However, without any assistance in reviewing potential annotation candidates for verification, you have to go through all the annotations, which can be time-consuming. Lastly, you may want to save model configurations that are working well for future use.

From this example workflow, we came up with the following design requirements for a human-LLM collaborative annotation system.

  • For LLM annotation
    • [Convenient] Automate the annotation workflow, including pre-processing, API calling, and post-processing.
    • [Customizable] Support flexible modification of model configuration and prompt templates.
    • [Robust] Resolve errors with no or minimal interaction.
    • [Reusable] Store used LLM agents (models and prompt templates) for reuse.
    • [Metadata] Capture and store LLM artifacts as annotation metadata.
  • For human verification
    • [Selective] Select verification candidates by search query or recommendation.
    • [Exploratory] Filter, sort, and search by labels and available metadata programmatically and in a UI.

To satisfy these requirements, we implemented our system as an extension of MEGAnno, an exploratory annotation tool within a Jupyter notebook environment. MEGAnno has flexible search and recommendation functions that enable selective and exploratory verification, and it contains a comprehensive backend capable of storing data, labels, and auxiliary metadata generated from LLM annotation.

In the next section, we introduce MEGAnno+’s features in detail.

MEGAnno+ System

System overview

MEGAnno+ offers (1) a Python client library with interactive widgets and (2) a back-end service consisting of web API and database servers. As seen in Figure 2, a user can interact with a Jupyter Notebook, with MEGAnno+ client installed. Through programmatic interfaces and UI widgets, the MEGAnno+ client interacts with the MEGAnno+ service. 

Within a notebook session, you can obtain LLM annotations conveniently and verify the obtained labels selectively (Figure 2). This in-notebook workflow can be seamlessly integrated into existing model training and debugging environments.

We refer to LLM annotation agents as agents to distinguish them from human annotators. An LLM agent’s decision will differ by the LLM model, parameters, and used prompt. So we define an agent with a model configuration (e.g., model’s name, version, and hyper-parameters) and a prompt template. Once an agent is registered, you can run an annotation job on a subset using the registered agent.

For human verification, users can first select a subset of labels from an LLM annotation job (Figure 1, step ②), explore them in a UI widget, and confirm or update the LLM labels (Figure 1, step ③).

Now, let’s go over our LLM annotation → human verification workflow step-by-step.

Overview of the meganno system

Figure 2. Overview of MEGAnno+ system.*

LLM annotation

Let’s look at an example annotation workflow, highlighting the common problems we discussed in the previous article, and see how MEGAnno+ helps in a more efficient and convenient process.

Figure 3. Steps in the LLM annotation workflow.

Step 1: Pre-processing

The pre-processing step of MEGAnno+ helps you with defining your task, label schema, and prompt generation for LLM annotation. Furthermore, this step also allows you to select an available LLM model to utilize for annotation and select its model configurations.

We consider the sentiment-analysis task from Part 1 of the article, where you have chosen a data subset and defined a label schema of {name: “sentiment”; options: “positive”, “negative”, or “neutral”}.

Figure 4. Prompt Template UI. Users can customize task instructions and preview generated prompts.

Based on the task and label schema you provided, MEGAnno+ generates a prompt template, specifying the task instruction as well as some formatting instructions (to make the label extraction process easier). If you are unhappy with the prompt, you can even make slight modifications to it and customize your prompt. Note that few things such as the task name, label options, and formatting instructions can’t be altered.

Once you are happy with the prompt template, MEGAnno+ will use this to generate prompts for all the data in your subset.

Figure 5. An LLM agent is registered with its model configuration and prompt template.

Next, you need to select a valid LLM model** and define its configurations, as per your choice. When you are registering an LLM annotation agent (Figure 5), MEGAnno+ will perform all validity checks like verifying API keys (if applicable), ensuring the model configurations conform to the LLM API, and also ensuring the prompts generated are within the context limit.

Step 2: Calling an LLM API for an annotation job

Using the LLM agent registered in Step 1, MEGAnno+ calls the LLM API to obtain annotations for your subset. To ensure a robust workflow, MEGAnno+ also handles any errors that may be encountered when calling the LLM API.

But hold on! Just before calling the LLM, you realize you want not only labels but also the model’s confidence scores. MEGAnno+ also allows you to obtain LLM metadata artifacts and log them. You can specify to MEGAnno+ that along with annotations, you want “conf” i.e. the logit-based confidence scores to be returned (Figure 6).

Figure 6. Running an LLM annotation job on your selected subset using the chosen LLM agent.

Once the LLM responses are obtained, we move to the next step of extracting the annotations. 

Step 3: Post-processing

As discussed in Part 1, post-processing LLM responses can often be tricky. Using our formatting instructions and a robust post-processing technique, MEGAnno+ eases the process and also extracts the labels, metadata, and performs any additional computation (e.g., calculating confidence score using logits).

Our post-processing mechanism handles several common errors we had described in Part 1. These issues are divided into formatting errors (i.e., output not conforming to formatting instructions) and label class errors (e.g., generating invalid labels or typos in label names). See Figure 7 for examples.

Figure 7. Example LLM responses and post-processing results by MEGAnno+

MEGAnno+ also allows you to monitor the progress of the annotation task (Figure 8). You are provided with information about any API errors encountered while calling the LLM or any post-processing errors. Furthermore, you can view a broad summary of the annotation task, which tells you the number of responses corresponding to each label, and the frequency of invalid LLM responses (if encountered during post-processing). This summary should give you a good idea of how the LLM performed on your dataset for this task.

Figure 8. Summary for an LLM annotation job.

You can now go back and make any modifications to your task as you see fit. For example, you may make minor modifications to the prompt, or change the LLM configuration to define a new agent to perform your annotation task. You may also want to change your dataset or the task itself, and then define a new schema. You may also use MEGAnno+’s mechanism to retrieve your previously saved agents and use one of them to annotate a new dataset.

And with that, you are now able to inspect the LLM annotations using our UI widget.

Human Verification

We have discussed the benefits of a human-LLM collaborative environment. Now, we discuss how we include humans in the loop for the LLM annotation process. Once LLM annotations are obtained, MEGAnno+ allows humans to verify the annotations (i.e., either confirm the label as correct or reject the label and specify a correct label). 

Now, how do we choose which instances a human should verify? Of course, it seems redundant for a human to verify every annotation in the dataset as that would defeat the purpose of using LLMs for a cheap and faster annotation process. Instead, MEGAnno+ provides a possibility to aid the human verifiers by computing confidence scores for each annotation. If you had specified your annotation job to store confidence scores as in Figure 6, then you can retrieve and visualize them alongside the LLM labels in our UI widget (Figure 9). You can sort and filter the confidence scores to obtain only those annotations for which the LLM had low confidence scores. This will ease the human verification process and make it more efficient.

Figure 9. Verification UI for exploring data and confirming/correcting LLM labels.

Try out our demo!

MEGAnno+ is now available for you to try here, with a few simple steps. A video demo of our system is also available on the same website. If you like this demo-able version, great news, we have a full version of MEGAnno coming soon, where you can install MEGAnno+ and create your own annotation projects. So stay tuned for more updates!


We presented MEGAnno+, a human-LLM collaborative annotation system. With our LLM annotation → human verification workflow, reliable and high-quality labels can be collected conveniently and efficiently. Our tool supports robust LLM annotation, selective human verification, and effective management of LLMs, labels, and metadata for reusability.

*For a detailed description of MegAnno+’s core concepts and data model, kindly refer to Section 4 of our paper. **MegAnno+ supports select LLM models. For the purposes of our demo, we use OpenAI ChatCompletion models.

For more details, check out our EACL 2024 demo paper and our MEGAnno webpage.

Written by: Hannah KimKushan Mitra, and Megagon Labs.

Follow us on LinkedIn and Twitter to stay up to date.


More Blog Posts: