LLMs as Data Annotators (Part 1) – Challenges and Opportunities

March 13, 2024

If you are familiar with data annotation, then you know it can be a painful process that slows down your projects. Many researchers and practitioners have tried to automate the data annotation process by employing ML models as annotation agents. They will annotate a small amount of data, train a simple ML model, and use the model for labeling. While this approach saves time, you still need to train a new ML model if your task changes (for example, topic classification → sentiment analysis) or your dataset changes (news articles → SNS posts).

Then come LLMs, the jack of all trades for various text-based tasks. You don’t need to train them, since they are already pre-trained. Out of the box, they can label your data (Wang et al. 2021, Ding et al. 2023) as long as you provide clear instructions – with some caveats, of course.

Sounds promising, right? In our two-part article, we’ll explore data annotation using LLMs, including potential challenges, tools to overcome them, and the potentials of human-LLM collaborative annotation. In this article, we’ll discuss how to leverage LLMs as data annotation agents and the practical challenges that may arise. In Part 2, we elaborate on how we address these challenges with our tool, MEGAnno+.

How to use LLMs for data annotation

Before starting data annotation for your project (say, sentiment analysis), you need to define your labeling task and labeling schema. A labeling task is usually decided based on the overall project objective. Labeling schema is a set of label options as well as the corresponding rules on how to assign a label to each data sample. For example, your labeling task may be sentiment classification. The corresponding labeling schema could be “positive,” “neutral,” or “negative,” where each label represents the sentiment of a given sample.

Figure 1: Inputs and outputs of steps in human annotation and LLM annotation. LLM annotation requires additional pre-processing and post-processing steps.

Once your task definition and labeling schema are ready for human annotators, the next step is simple. Human annotators go through data samples and assign labels (Figure 1a: Human Annotation). In addition, they may be given an annotation guideline that contains the overall task objective and specific labeling instructions to give annotators context and guidance for the labeling task.

In the case of using LLMs as annotation agents, the process is not so simple as can be seen in Figure 1b: LLM annotation. You need to decide which LLM models to use, pre-process the data, call the selected LLM to annotate them, and post-process LLM responses to finally obtain clean labels. We will go through this step by step.

Model selection and configuration

When using LLMs for annotation, you first need to select an LLM model and specify its configurations. There are multiple LLM models to choose from, including commercial ones like ChatGPT and open-source ones such as Llama or Vicuna. Some models may have different parameters. A common hyperparameter is temperature, which controls whether the response should be more random/creative (high temperature), predictable (low temperature), or deterministic (zero temperature).

Pre-processing: preparing prompts

Next, you need to create a prompt. A prompt is a textual input to an LLM. Given a prompt text, an LLM will generate a response text. So we need to instruct an LLM what to do within a prompt, just like if you were teaching human annotators how to label. The instructions can include explaining the task (e.g., sentiment analysis), the valid label options (e.g., “positive,” “neutral,” or “negative”), and the data sample you want to annotate. Optionally, you can also include some labeled examples (this is called few-shot learning) to help the LLM better carry out the given labeling task. Additionally, you can include formatting instructions in a prompt if specifically formatted responses (e.g., in JSON) are needed.

Figure 2 shows an example prompt for the sentiment analysis.

Figure 2: Prompting LLMs for sentiment analysis task with instruction, labeled demonstration examples, input, and LLM output.

Calling an LLM API

Once your prompt is ready, you can send it to the selected LLM. For running a few examples for testing, using an online interface such as OpenAI playground may be sufficient. For large-scale annotation, we recommend using LLM APIs.

Caveat: Note that each LLM has a context limit, which is the number of tokens (not the same as the number of words) it can process including both the input and the output. So, if your generated prompt is long, you need to ensure it is within the token limits for the selected LLM while considering the expected length of its output. For commercial LLMs, you need to make sure you have a valid authentication key and understand the cost structure, especially before annotating a huge dataset. Also, there may be limitations on how many times or how frequently you can call the LLM APIs, depending on API providers and your account tiers.

Post-processing

Once a response is received from the LLM, you will need to parse the response text (e.g, “Label: positive” as in Figure 2), extract the label (e.g., “positive”), and map it to your labeling schema (e.g., check if “positive” ∈ [“positive”, “negative”, “neutral”]). This post-processing part can be challenging, especially since the free-text outputs of LLMs can often be noisy, not always following the given labeling instructions.

What could go wrong with LLM annotation?

LLMs can be unpredictable sometimes. Even when you give them clear instructions, they may go off script. There are several things that you need to keep in mind when using LLMs for annotation – here we talk about some of the issues you may encounter.

Let’s consider the following prompt.

“Label the sentiment of the following text as positive, neutral or negative. Your answer should be in the following format Label: <sentiment>’

Text: The weather today is beautiful”

Problem 1: Incorrect response formats and invalid labels

Using the prompt above, you make a call to an LLM and successfully retrieve an annotation response:

Label: positive

That seems good. You now decide to annotate 100 data points for which you write a script to use the same prompt template as above, but changing the field of text to every sentence in your dataset. You generate one call per prompt to the LLM and as a result, you receive 100 responses. You decide to write another script to post-process and extract the LLM labels to match your label schema.

However, you notice that some of the LLM responses deviate from your labeling and formatting instructions (see Figure 3). For instance, a response does not adhere to the specified format eg. Label is positive or even minor deviations in punctuation marks or capitalization eg. Label: “Positive”, or Label: Positive. could mess up the label extraction process. In some cases, the generated response may not be related to the annotation task at all. Even with recent LLM models that are trained to follow instructions, this can still happen.

Therefore, a robust post-processing mechanism that can handle these types of errors is vital to a smooth and efficient data annotation process. For example, utilizing regular expressions or word similarity can cover minor mishaps like spelling errors or punctuation problems. Some LLMs offer structured output modes (e.g., responses are JSON-formatted), which can alleviate the incorrect format issue.

Figure 3: Example LLM responses and post-processing results.

Problem 2: Uncommon tasks and schemas

Imagine that you want to modify your annotation task, e.g., altering the labeling instruction (to improve the prompt) or the label schema (if you feel you want to add more label options). Let’s say, for example, you want to update the label schema now as “super-positive,” “positive,” “negative,” and “super-negative.” You change the prompt accordingly and call your LLM again. The following figure is an example summary of the annotation responses and post-processing results:

Figure 4. LLM annotation summary showing invalid labels and their frequencies.

Do you notice that the distribution for super-positive and super-negative labels is far less frequent than the other labels? It is likely that the LLM does not understand the subtle differences between “positive” and “super-positive” and between “negative” and “super-negative.” One reason can be that the new schema with the five label options is less common than [“positive”, “neutral”, “negative”] schema that is more widely used.

LLMs know what they are pre-trained with. They are trained with common NLP tasks such as document classification, topic modeling, named entity recognition, and so on. Especially for those tasks with common labeling schema, LLMs already saw tons of samples and learned how to solve them. So, if your task can be framed as a common NLP problem, we recommend using a conventional label schema and telling LLMs to answer it like it’s the NLP problem – for example, by classifying an email text as “spam” or “not spam.” Another example is instead of asking whether a comment on an article is agreeing with the original article, frame it as a Natural Language Inference (NLI) task, where the label schema is “entailment”, “neutral” and “contradiction”. Another trick is to write a very detailed and specific labeling guideline in your prompt. For example, you can explain what kind of emotions should be treated as “super-positive” (e.g, ecstatic) as opposed to “positive” (e.g., satisfied). Adding an additional label option of “Other” or “N/A” can be helpful for capturing uncertain or confusing data points.

Trial and Error

Please keep in mind that LLM annotation is iterative in nature. It’s unlikely that everything will be perfect on the first go. Our recommendation is to start simple and small. Use default model settings and basic prompts first. If you run into some errors, improve as you go.

Here’s an example workflow. After reviewing the initial annotations, you may want to re-run the process for validity purposes, either by utilizing the same model again or after making some changes to the LLM configuration (or by using a new LLM model altogether). Or you can change the labeling schema after observing the LLM labels. Once LLM annotation is complete, you can export the labels and use them for training a downstream model. During the evaluation of your trained model, you may want to revisit the annotation step to collect more data.

Can LLMs and humans work together?

The underlying question is this: can we completely trust LLM labels as long as we take caution in designing label schema, prepare a perfect prompt, and build robust post-processing LLM responses? The answer is no. While LLMs have shown remarkable capabilities, they cannot replace human annotators (Ziems et al. 2024). A viable alternative is to have humans and LLMs work together on the annotation task (Wang et al. 2021).

LLMs rely on their internal knowledge, which is learned from extensive training data. As a result, LLMs may struggle with contexts or nuances that require socio-cultural understanding or certain domains that are not well covered in their training data. Also, they may miss out on subjective tasks or undervalue ethical considerations, in which case human intervention may be more beneficial.

Figure 5. Human-LLM collaborative annotation.

So, rather than completely relying on LLMs, humans and LLMs should collaborate on data labeling (Figure 5). This way, we can leverage both LLMs’ efficiency and human expertise to ensure a reliable, robust data annotation process.

Conclusion

In this article, we explored detailed steps to use LLMs for data annotation. We also discussed some of the practical issues you may encounter and the reliability of LLM annotations. To address these challenges, we will introduce MEGAnno+, our human-LLM collaborative annotation tool, in Part 2 of the series. Our tool is designed for an iterative LLM annotation process as well as human collaboration so that users can obtain high-quality annotations more conveniently and efficiently without manually writing code snippets. Read Part 2 of this article to try our MEGAnno demo and stay tuned for the full release!

Written by: Hannah Kim, Kushan Mitra, and Megagon Labs.

Follow us on LinkedIn and Twitter to stay up to date.

LLMs as Data Annotators (Part 1) – Challenges and Opportunities

How to use LLMs for data annotation

What could go wrong with LLM annotation?

Problem 1: Incorrect response formats and invalid labels

Problem 2: Uncommon tasks and schemas

Trial and Error

Can LLMs and humans work together?

Conclusion

Share:

More Blog Posts:

LLMs as Data Annotators (Part 2) – MEGAnno+: A Human-LLM Collaborative Annotation System

Less Is More for Long Document Summary Evaluation by LLMs

Megagon Team Feature: Aiden Zhao

Knowledge Graphs Building and Training: Engineering Aspects

Wrapping up 2023: Acknowledgements & Aspirations

Measuring and Modifying Factual Knowledge in Large Language Models