MEGAnno in Action: Human-LLM Collaborative Annotation

MEGAnno combines the power of large language models (LLMs) with human expertise to streamline and enhance the data labeling process. This article will guide you through leveraging this innovative tool to optimize your data annotation workflow through end-to-end example that demonstrates MEGAnno’s capabilities. Through this article and notebook, we provide detailed code snippets to show you how to: 

  • Perform LLM annotations, 
  • Conduct confidence-aided human verification, 
  • Iteratively select models and refine prompts, 
  • Compare and aggregate results from different LLM agents. 

If you are new to MEGAnno, check out our previous blog post for a bird’s eye view on the human-LLM collaborative experiences, and refer to our documentation for basic usage.

Let's get started

First we simply import output from google colab, pip installing meganno.

				
					from google.colab import output
output.enable_custom_widget_manager()
# Installation might take ~ 1min
# You might get a "ERROR: pip's dependency resolver does not ...", please ignore
!pip install "meganno_client[ui] @ git+https://github.com/megagonlabs/meganno-client.git" -q
				
			
				
					from meganno_client import Service, Authentication, Controller, PromptTemplate
				
			

Setup

  • Option 1: Connect to the demo project using our shared Colab with prepopulated data and schema (limited to non-admin access).
  • Option 2: Follow the instructions here to deploy your own project. Connect with your preferred notebook environment (Jupyter or Colab) to enjoy full access, including the ability to use customized data and task schema.

Following the previous example notebooks, we continue the demonstration using the Twitter US Airline Sentiment. The remaining of the notebook provides code snippets for option 1. Please see the Appendix A1 section for project setup instructions for option 2.

				
					import os
token = # token shared through email, request at https://meganno.github.io/#request_form
# openai key, we don't collect or store openAI keys
os.environ['OPENAI_API_KEY']= 'sk-...'
# optional
# os.environ['OPENAI_ORGANIZATION'] = 'your_organization_key'

auth = Authentication(project='blog_post', token=token)
demo = Service(project='blog_post',auth=auth)

controller = Controller(demo, auth)
				
			

Initial LLM Annotation

To perform LLM annotations, a user needs to specify an agent, defined by the configuration of the model and the prompt used. The current version of MEGAnno supports the chat and (deprecated) completion interfaces of OpenAI models, and provides a visual interface for easier prompt construction.

Configure model and prompt

We start with the basic built-in prompt template, which uses the label schema to construct the system instruction and later concatenates the content of each data entry in the subset. Users can preview the concatenation with the provided sample input and revise the template as needed. Templates also provide a get_template function for prompt inspection.

				
					schema = demo.get_schemas().value(active=True)
label_name= 'sentiment'
prompt_template = PromptTemplate(label_schema=schema[0]["schemas"]["label_schema"], label_names=[label_name])
prompt_template.preview(records=["Sample Airline tweets1", "Sample Airline tweets1"])

				
			
				
					prompt_template.get_template()
				
			

Then we need to configure the model (only the model field is mandatory)

				
					model_config = {
    "model": "gpt-3.5-turbo",
    "temperature": 0}
				
			

Run an LLM annotation job on subsets

Like human annotations, LLM annotations are organized around subsets. LLM jobs are defined as the execution of agents over a subset. To better manage executions and provide opportunities for agent re-use (over the same or different subsets), agents first need to be registered with a MEGAnno Controller, by calling create_agent. Upon creation, a unique identifier will be returned to the user for future reference. Please refer to Appendix A2 for additional APIs to list and explore existing agents.

				
					# register agent.
agent_gpt_3p5 = controller.create_agent(model_config, prompt_template, provider_api="openai:chat")
				
			

Here we start with a subset from a search by keyword “good”, expecting to get mostly positive examples, and some “harder” examples of netural or sacastically negative examples.

				
					# selecting subset to run the job with
subset = demo.search(keyword="good")
subset.show({"view": "table"})
				
			

First, we ask our basic agent with gpt-3.5 and minimal prompts to label the subset, by calling the run_job function. Currently, the framework only supports single label annotation per job, so a label_name is required to pick the label to work on from the multi-label schema. To better assist later human verification, we also collect a metadata conf as a measurement of the model’s confidence in its prediction. The built-in function calculates based on the the logprob from the OpenAI response. Future versions may support user-defined functions, potentially on other open-source or on-premise models. Upon successful execution, a unique identifier job_uuid will be returned as a reference to the run. Please refer to Appendix A2 for additional APIs to list and explore existing jobs. 

				
					# *Make sure OPENAI_API_KEY is set as an env var!*
job_uuid = controller.run_job(agent_gpt_3p5, subset, label_name, label_meta_names=["conf"])
				
			

Before returning the final job persisted in the database, statistics about the automated labeling steps will also be printed. For example, the number of valid prompts constructed, the number of successful OpenAI calls, and the number of responses with valid results will be displayed, along with their distribution. This information will help provide an initial sense of the labeling quality.

Verification

Human verifications

Next, we would want to take a look at the LLM’s output and verify the predictions, and correct them if necessary. To perform a basic verification, we can start with the job_uuid. While verifying, users can decide to sort by the conf metadata associated with each prediction, so they can start with the most uncertain predictions from the agents.

				
					verf_subset = demo.search_by_job(job_id=job_uuid)
verf_subset.show({"mode": "verifying",
                  "label_meta_names": ["conf"]})
				
			

In this case, we correct two misclassifications (one sarcastic, one irrelevant) and confirm the rest of the predictions. After clicking the save button, verifications will be persisted in the backend, together with data and annotations. 

Users also have control over the data entries they’d like to verify through finer-grained searches. For example, the scripts below search for all the unverified predictions from a certain job, with a confidence score lower than 0.99:

				
					args = {
    "job_id": job_uuid,
    "label_metadata_condition": {
        "label_name": "sentiment",
        "name": "conf",
        "operator": "<",
        "value": 0.99,
    },
    "verification_condition": {
        "label_name": label_name,
        "search_mode": "UNVERIFIED",  # "ALL"|"UNVERIFIED"|"VERIFIED"
    },
}
verf_subset2 = demo.search_by_job(**args)
verf_subset2.show({"mode": "verifying",
                  "label_meta_names": ["conf"]})
				
			

Verifications can be retrieved by various predicates. For example, the scripts below reveal all labels where the human disagreed with the agent and made a correction on the predicted label:

				
					# further filter by type of verification(CONFIRMS|CORRECTS)
# CONFIMS:  where the verification confirms the original label
# CORRECTS: where the verification is different from the original label
verf_subset.get_verification_annotations(
    label_name="sentiment",
    label_level="record",
    annotator=job_uuid,
    verified_status="CORRECTS",  # CONFIRMS|CORRECTS
)
				
			

Automated verification

As an alternative to human verification, we can always introduce a second opinion from another (potentially more powerful) model. Here we run the same subset using a gpt-4 model and collect the data points with conflicting predictions.

				
					model_config2 = {'model': 'gpt-4',
                'temperature': 1}

agent_gpt_4 = controller.create_agent(model_config2, prompt_template, provider_api='openai:chat')
job_uuid2 = controller.run_job(agent_gpt_4,
                              subset,
                              label_name,
                              label_meta_names = ["conf"],
                              fuzzy_extraction=True)
				
			
				
					s_conflict = demo.search(annotator_list=[job_uuid,job_uuid2],
                         label_condition={'name':label_name,
                                         'operator':'conflicts'})
s_conflict.show({'mode':'reconciling'})
				
			

Using the above scripts, we were also able to identify the data entries most in need of human verification. 

Revision

With the results of the initial annotation and human verification, we identified the model’s insufficiency of identifying sacastic tweets and nuetrual sentiment towards the entity (airline), so we are revising the prompt to add explicit instruction and provide examples:

				
					# building new prompt
prompt_template2 = PromptTemplate(label_schema=schema[0]['schemas']['label_schema'], label_names=[label_name])
# preview templates with sample intput, switchable with the "input" drop down
prompt_template2.preview(records=[ 'Sample Airline tweets1', 'Sample Airline tweets2'])
				
			

Append the revised prompt into the “Task Inst:” field of the prompt update widget and save the prompt:

				
					# Results:
agent_gpt_3p5_revised = controller.create_agent(model_config, prompt_template2 , provider_api='openai:chat')
				
			

As an initial estimation of the performance of the revised model, we compare the prediction accuracy over a small sample of 1,000 tweets (the last 30 sorted by data-id). This is not a thorough evaluation, but it should provide some initial insights and suggest directions for the next iteration. The next cells load sample testing data with labels, and run the experiments with gpt-3.5 agents with the initial and revised prompts, and then compare the accuracy of their predictions.

				
					test_subset = demo.search(skip=970, limit=30)
test_subset.show()
test_job_3p5 = controller.run_job(agent_gpt_3p5,
                              test_subset,
                              label_name,
                              label_meta_names = ["conf"],
                              fuzzy_extraction=True)
test_job_3p5_revised = controller.run_job(agent_gpt_3p5_revised,
                              test_subset,
                              label_name,
                              label_meta_names = ["conf"],
                              fuzzy_extraction=True)
				
			
				
					def evaluate(results, label_name, job_uuid):
    total, match = 0, 0
    for item in results:
        groud_truth = list(filter(lambda x: x['name']== 'pseudo_label', item['record_metadata']))[0]['value']
        prediction = list(filter(lambda x: x['annotator']== job_uuid, item['annotation_list']))[0]
        predicted_label = list(filter(lambda x: x['label_name']==label_name, prediction['labels_record']))[0]['label_value'][0]
        total += 1
        match += 1 if groud_truth == predicted_label else 0
    return f"{match} out of {total} correct."

print("Before: ", evaluate(demo.search_by_job(job_id=test_job_3p5, limit=100).value(), label_name, test_job_3p5) )

print("After: ", evaluate(demo.search_by_job(job_id=test_job_3p5_revised, limit=100).value(), label_name, test_job_3p5_revised))
# Output
# Before:  22 out of 30 correct.
# After:  24 out of 30 correct.
				
			

Appendix

Our demo exploration ends but here are some extra pointers for further exploration. 

A1. Setup for self-deployed project (option 2)

If you choose to explore with your own deployed project, you’d start with an empty project where data and task schema is not pre-populated. The code blocks below are what we used to setup this colab notebook in which we specify a sentiment analysis task with three candidate labels and import data from the example tweet dataset.

				
					demo.get_schemas().set_schemas({
    'label_schema': [
        {
            "name": "sentiment",
            "level": "record",
            "options": [
                { "value": "pos", "text": "positive" },
                { "value": "neg", "text": "negative" },
                { "value": "neu", "text": "neutral" },

            ]
        }
    ]
})
demo.get_schemas().value(active=True)
				
			
				
					import from dataframe
import pandas as pd

df = pd.read_csv("tweets.csv").loc[:1000]
demo.import_data_df(df, column_mapping={
    'id':'id',
    'content':'content',
    "metadata":'pseudo_label' # optional metadata
})
				
			

A2 List agents & jobs

At anytime in the pipeline, we can ask the controller to list, existing agents and jobs, with their detailed configuration, filter over predicates (currently supporting filter over provider).

				
					# list agents
controller.list_my_agents()
# job_list = controller.list_jobs('agent_uuid', [agent_uuid])
				
			
				
					# filter over agent properties and get jobs
ret = controller.list_agents(provider_filter="openai", show_job_list=True)
job_list = [val for sublist in ret for val in sublist["job_list"]]
job_list
				
			

Wrap up

We hope you got a good understanding of the potential of Meganno with this walkthrough. For support with using the platform and more detailed information you can refer to our Documentation Page. There you can find more on our advanced features, the API client docs, and more details on our LLM integration. We are available for support through our contact us page. To read more about how Meganno came about you can refer to the research paper and the related blog articles

Written by Dan Zhang and Megagon Labs

Follow us on Twitter and LinkedIn to stay up to date with our research and open source projects. 

Share:

More Blog Posts: