MEGAnno in Action: Human-LLM Collaborative Annotation

MEGAnno combines the power of large language models (LLMs) with human expertise to streamline and enhance the data labeling process with a data annotation framework. Throughout this article, we’ll showcase MEGAnno’s capabilities as we provide detailed code snippets that show you how to: 

  • Perform LLM annotations, 
  • Conduct confidence-aided human verification, 
  • Iteratively select models and refine prompts, 
  • Compare and aggregate results from different LLM agents. 
 

If you are new to MEGAnno, check out our previous blog post for a bird’s eye view of the human-LLM collaborative experiences, and refer to our documentation for basic usage.

Let's Get Started

You can follow along here or connect to the demo project using our shared Colab with pre-populated data and schema (limited to non-admin access). Following the previous example notebooks, this demonstration will be using Twitter US Airline Sentiment.

Setup

First, we pip install MEGAnno and import meganno client modules.

				
					!pip install "meganno_client[ui] @ git+https://github.com/megagonlabs/meganno-client.git" -q
from meganno_client import Service, Authentication, Controller, PromptTemplate

				
			

Then, we need to setup MEGAnno access token (request a token here), OpenAI API keys, and connect to our hosted MEGAnno service.

				
					import os
token = # token shared through email, request at https://meganno.github.io/#request_form
# openai key, we don't collect or store openAI keys
os.environ['OPENAI_API_KEY']= 'sk-...'
# optional
# os.environ['OPENAI_ORGANIZATION'] = 'your_organization_key'

auth = Authentication(project='blog_post', token=token)
demo = Service(project='blog_post',auth=auth)

controller = Controller(demo, auth)
				
			

Initial LLM Annotation

To perform LLM annotations, you need to specify an agent. The agent is defined by the configuration of the LLM and prompt used. MEGAnno supports OpenAI models and provides a visual interface to make prompt construction easier.

Configure the Model and Prompt

We start with the basic built-in prompt template, which uses the label schema to construct the system instruction and later concatenates the content of each data entry in the subset. Users can preview the concatenation with the provided sample input and revise the template as needed. Templates also provide a get_template function for prompt inspection.

				
					schema = demo.get_schemas().value(active=True)
label_name= 'sentiment'
prompt_template = PromptTemplate(label_schema=schema[0]["schemas"]["label_schema"], label_names=[label_name])
prompt_template.preview(records=["Sample Airline tweets1", "Sample Airline tweets1"])

				
			
				
					prompt_template.get_template()
				
			

Then, we need to configure the model. (Only the model field is mandatory.)

				
					model_config = {
    "model": "gpt-3.5-turbo",
    "temperature": 0}
				
			

Run an LLM Annotation Job on Subsets

To run an LLM annotation job on subsets, first you register an agent. Like human annotations, LLM annotations are organized around subsets. LLM jobs are the execution of agents over a subset. Registering the agent returns a unique identifier which allows us to better manage executions and enables agent re-use (over the same or different subsets). 

You can refer to Appendix A2 for additional APIs to list and explore existing agents.

				
					# register agent.
agent_gpt_3p5 = controller.create_agent(model_config, prompt_template, provider_api="openai:chat")
				
			

Here, we start with a subset from a search for the keyword “good”, expecting to get mostly positive examples but also some neutral or sarcastically negative examples.

				
					# selecting subset to run the job with
subset = demo.search(keyword="good")
subset.show({"view": "table"})
				
			

Then, we ask our basic agent with GPT-3.5 and minimal prompts to label the subset by calling the run_job function. To better set up human verification for success later on, we collect a metadata conf as a measurement of the model’s confidence in its prediction. The built-in function calculates based on the logprob from the OpenAI response. Successful execution returns a unique identifier job_uuid, as a reference to the run. Appendix A2  provides additional APIs to list and explore existing jobs. 

				
					job_uuid = controller.run_job(agent_gpt_3p5, subset, label_name, label_meta_names=["conf"])
				
			

Before returning the final job persisted in the database, statistics about the automated labeling steps will also be printed. These stats can include the number of valid prompts constructed, the number of successful OpenAI calls, and the number of responses with valid results will be displayed, along with their distribution. This information will help provide an initial sense of the labeling quality.

Verification

Human Verification

Next, we want to take a look at the LLM’s output in order to verify the predictions and correct them, if necessary. To perform a basic verification, we can start with the job_uuid. While verifying, you can decide to sort by the conf metadata associated with each prediction, to start with the most uncertain predictions from the agents.

				
					verf_subset = demo.search_by_job(job_id=job_uuid)
verf_subset.show({"mode": "verifying",
                  "label_meta_names": ["conf"]})
				
			

In this case, we correct two misclassifications (one sarcastic, one irrelevant) and confirm the rest of the predictions. After clicking the save button, verifications will be persisted in the backend, together with data and annotations. 

MEGAnno gives you control over the data entries you’d like to verify through finer-grained searches. For example, the scripts below search for all the unverified predictions from a certain job with a confidence score lower than 0.99:

				
					args = {
    "job_id": job_uuid,
    "label_metadata_condition": {
        "label_name": "sentiment",
        "name": "conf",
        "operator": "<",
        "value": 0.99,
    },
    "verification_condition": {
        "label_name": label_name,
        "search_mode": "UNVERIFIED",  # "ALL"|"UNVERIFIED"|"VERIFIED"
    },
}
verf_subset2 = demo.search_by_job(**args)
verf_subset2.show({"mode": "verifying",
                  "label_meta_names": ["conf"]})
				
			

Verifications can be retrieved by various predicates. For example, the scripts below reveal all labels where the human verifier disagreed with the agent and made a correction on the predicted label:

				
					# further filter by type of verification(CONFIRMS|CORRECTS)
# CONFIRMS:  where the verification confirms the original label
# CORRECTS: where the verification is different from the original label
verf_subset.get_verification_annotations(
    label_name="sentiment",
    label_level="record",
    annotator=job_uuid,
    verified_status="CORRECTS",  # CONFIRMS|CORRECTS
)
				
			

Automated Verification

As an alternative to human verification, we can always introduce a second opinion from another (potentially more powerful) model. Here we run the same subset using a gpt-4 model and collect the data points with conflicting predictions.

				
					model_config2 = {'model': 'gpt-4',
                'temperature': 1}

agent_gpt_4 = controller.create_agent(model_config2, prompt_template, provider_api='openai:chat')
job_uuid2 = controller.run_job(agent_gpt_4,
                              subset,
                              label_name,
                              label_meta_names = ["conf"],
                              fuzzy_extraction=True)
				
			
				
					s_conflict = demo.search(annotator_list=[job_uuid,job_uuid2],
                         label_condition={'name':label_name,
                                         'operator':'conflicts'})
s_conflict.show({'mode':'reconciling'})
				
			

Using the above scripts, we were also able to identify the data entries that most need human verification. 

Revision

Using the results of the initial annotation and human verification, we identified the model’s trouble with identifying sarcastic tweets and neutral sentiment towards the entity (airline), so we are revising the prompt to add explicit instructions and provide examples:

				
					# building new prompt
prompt_template2 = PromptTemplate(label_schema=schema[0]['schemas']['label_schema'], label_names=[label_name])
# preview templates with sample intput, switchable with the "input" drop down
prompt_template2.preview(records=[ 'Sample Airline tweets1', 'Sample Airline tweets2'])
				
			

To do this, append the revised prompt into the “Task Inst:” field of the prompt update widget and save the prompt:

				
					# Results:
agent_gpt_3p5_revised = controller.create_agent(model_config, prompt_template2 , provider_api='openai:chat')
				
			

As an initial estimation of the performance of the revised model, we compare the prediction accuracy over a small sample of 1,000 tweets (the last 30 sorted by data-id). This is not a thorough evaluation, but it should provide some initial insights and suggest directions for the next iteration. The next cells load sample testing data with labels, and run the experiments with GPT-3.5 agents with the initial and revised prompts, and then compare the accuracy of their predictions.

				
					test_subset = demo.search(skip=970, limit=30)
test_subset.show()
test_job_3p5 = controller.run_job(agent_gpt_3p5,
                              test_subset,
                              label_name,
                              label_meta_names = ["conf"],
                              fuzzy_extraction=True)
test_job_3p5_revised = controller.run_job(agent_gpt_3p5_revised,
                              test_subset,
                              label_name,
                              label_meta_names = ["conf"],
                              fuzzy_extraction=True)
				
			
				
					def evaluate(results, label_name, job_uuid):
    total, match = 0, 0
    for item in results:
        groud_truth = list(filter(lambda x: x['name']== 'pseudo_label', item['record_metadata']))[0]['value']
        prediction = list(filter(lambda x: x['annotator']== job_uuid, item['annotation_list']))[0]
        predicted_label = list(filter(lambda x: x['label_name']==label_name, prediction['labels_record']))[0]['label_value'][0]
        total += 1
        match += 1 if groud_truth == predicted_label else 0
    return f"{match} out of {total} correct."

print("Before: ", evaluate(demo.search_by_job(job_id=test_job_3p5, limit=100).value(), label_name, test_job_3p5) )

print("After: ", evaluate(demo.search_by_job(job_id=test_job_3p5_revised, limit=100).value(), label_name, test_job_3p5_revised))
# Output
# Before:  22 out of 30 correct.
# After:  24 out of 30 correct.
				
			

Wrap Up

We hope this walk through gave you a good understanding of Meganno‘s. For support with using the platform and more detailed information, refer to our documentation page. There, you can find more on our advanced features, the API client docs, and more details on our LLM integration. We are available for support through our Contact Us page. To read more about how MEGAnno came about, you can refer to the research paper and the related blog articlesFollow us on Twitter and LinkedIn to stay up to date with our research and open source projects. 

Appendix

Beyond our demo, here are some extra pointers for further exploration. 

A1. Setup for Locally-Deployed Projects

As an alternative to using pre-defined projects hosted on our infrastructure, you can follow the instructions here to deploy your own projectConnect with your preferred notebook environment (Jupyter or Colab) to enjoy full access, including the ability to use customized data and task schema. The code blocks below are what we used to set up this Colab notebook, in which we specify a sentiment analysis task with three candidate labels and import data from the example tweet dataset

				
					demo.get_schemas().set_schemas({
    'label_schema': [
        {
            "name": "sentiment",
            "level": "record",
            "options": [
                { "value": "pos", "text": "positive" },
                { "value": "neg", "text": "negative" },
                { "value": "neu", "text": "neutral" },

            ]
        }
    ]
})
demo.get_schemas().value(active=True)
				
			
				
					import from dataframe
import pandas as pd

df = pd.read_csv("tweets.csv").loc[:1000]
demo.import_data_df(df, column_mapping={
    'id':'id',
    'content':'content',
    "metadata":'pseudo_label' # optional metadata
})
				
			

A2. List Agents & Jobs

At any time in the pipeline, we can ask the controller to list, existing agents and jobs, with their detailed configuration, filter over predicates (currently supporting filter over provider).

				
					# list agents
controller.list_my_agents()
# job_list = controller.list_jobs('agent_uuid', [agent_uuid])
				
			
				
					# filter over agent properties and get jobs
ret = controller.list_agents(provider_filter="openai", show_job_list=True)
job_list = [val for sublist in ret for val in sublist["job_list"]]
job_list
				
			

Written by Dan Zhang and Megagon Labs

Share:

More Blog Posts: