Publications

Filter by:

ICDE

2024

Fairness-aware Data Preparation for Entity Matching

Nima Shahbazi, Jin Wang, Zhengjie Miao, Nikita Bhutani

Entity matching is a crucial task in many real applications. Despite the substantial body of research that focuses on improving the effectiveness of entity matching, enhancing its fairness has received scant attention. To fill this gap, this paper introduces a new problem of preparing fairness-aware datasets for entity matching. We formally outline the problem, drawing upon the principles of group fairness and statistical parity. We devise three highly efficient algorithms to accelerate the process of identifying an unbiased dataset from the vast search space. Our experiments on four real-world datasets show that our proposed algorithms can significantly improve fairness in the results while achieving comparable effectiveness to existing fairness-agnostic methods. Furthermore, we conduct case studies to demonstrate that our proposed techniques can be seamlessly integrated into end-to-end entity matching pipelines to support fairness requirements in real-world applications.

CHI

2024

Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels

Xinru Wang, Hannah Kim, Zhengjie Miao, Kushan Mitra, Sajjadur Rahman

Large language models (LLMs) have shown remarkable performance across various natural language processing (NLP) tasks, indicating their significant potential as data annotators. Although LLM-generated annotations are more cost-effective and efficient to obtain, they are often erroneous for complex or domain-specific tasks and may introduce bias when compared to human annotations. Therefore, instead of completely replacing human annotators with LLMs, we need to leverage the strengths of both LLMs and humans to ensure the accuracy and reliability of annotations. This paper presents a multi-step human-LLM collaborative approach where (1) LLMs generate labels and provide explanations, (2) a verifier assesses the quality of LLM-generated labels, and (3) human annotators re-annotate a subset of labels with lower verification scores. To facilitate human-LLM collaboration, we make use of LLM’s ability to rationalize its decisions. LLM-generated explanations can provide additional information to the verifier model as well as help humans better understand LLM labels. We demonstrate that our verifier is able to identify potentially incorrect LLM labels for human re-annotation. Furthermore, we investigate the impact of presenting LLM labels and explanations on human re-annotation through crowdsourced studies.

NAACL

2024

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models

Seiji Maekawa, Hayate Iso, Sairam Gurajada, Nikita Bhutani

While large language models (LMs) demonstrate remarkable performance, they encounter challenges in providing accurate responses when queried for information beyond their pre-trained memorization. Although augmenting them with relevant external information can mitigate these issues, failure to consider the necessity of retrieval may adversely affect overall performance. Previous research has primarily focused on examining how entities influence retrieval models and knowledge recall in LMs, leaving other aspects relatively unexplored. In this work, our goal is to offer a more detailed, fact-centric analysis by exploring the effects of combinations of entities and relations. To facilitate this, we construct a new question answering (QA) dataset called WiTQA (Wikipedia Triple Question Answers). This dataset includes questions about entities and relations of various popularity levels, each accompanied by a supporting passage. Our extensive experiments with diverse LMs and retrievers reveal when retrieval does not consistently enhance LMs from the viewpoints of fact-centric popularity. Confirming earlier findings, we observe that larger LMs excel in recalling popular facts. However, they notably encounter difficulty with infrequent entity-relation pairs compared to retrievers. Interestingly, they can effectively retain popular relations of less common entities. We demonstrate the efficacy of our finer-grained metric and insights through an adaptive retrieval system that selectively employs retrieval and recall based on the frequencies of entities and relations in the question.

NAACL – Findings

2024

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Pouya Pezeshkpour, Estevam Hruschka

Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions — commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 75% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model’s bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs’ predictions, leading to up to 8 percentage points improvement across different models and benchmarks.

SIGMOD

2024

Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation

Zhengjie Miao, Jin Wang

Relational Web tables provide valuable resources for numerous downstream applications, making table understanding, especially column annotation that identifies semantic types and relations of columns, a hot topic in the field of data management. Despite recent efforts to improve different tasks in table understanding by using the power of large pre-trained language models, existing methods heavily rely on large-scale and high-quality labeled instances, while they still suffer from the data sparsity problem due to the imbalanced data distribution among different classes. In this paper, we propose the Watchog framework, which employs contrastive learning techniques to learn robust representations for tables by leveraging a large-scale unlabeled table corpus with minimal overhead. Our approach enables the learned table representations to enhance fine tuning with much fewer additional labeled instances than in prior studies for downstream column annotation tasks. Besides, we further proposed optimization techniques for semi-supervised settings. Experimental results on popular benchmarking datasets illustrate the superiority of our proposed techniques in two column annotation tasks under different settings. In particular, our Watchog framework effectively alleviates the class imbalance issue caused by a long-tailed label distribution. In the semi-supervised setting, Watchog outperforms the best-known method by up to 26% and 41% in Micro and Macro F1 scores, respectively, on the task of semantic type detection.

Coling

2024

XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates

Haopeng Zhang, Hayate Iso, Sairam Gurajada, Nikita Bhutani

Github

Text editing is a crucial task of modifying text to better align with user intents. However, existing text editing benchmark datasets contain only coarse-grained instructions and lack explainability, thus resulting in outputs that deviate from the intended changes outlined in the gold reference. To comprehensively investigate the text editing capabilities of large language models (LLMs), this paper introduces XATU, the first benchmark specifically designed for fine-grained instruction-based explainable text editing. XATU considers finer-grained text editing tasks of varying difficulty (simplification, grammar check, fact-check, etc.), incorporating lexical, syntactic, semantic, and knowledge-intensive edit aspects. To enhance interpretability, we combine LLM-based annotation and human annotation, resulting in a benchmark that includes fine-grained instructions and gold-standard edit explanations. By evaluating existing LLMs against our benchmark, we demonstrate the effectiveness of instruction tuning and the impact of underlying architecture across various editing tasks. Furthermore, extensive experimentation reveals the significant role of explanations in fine-tuning language models for text editing tasks. The benchmark will be open-sourced to support reproduction and facilitate future research

EACL – Demonstrations

2024

MEGAnno+: A Human-LLM Collaborative Annotation System

Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, Dan Zhang

Large language models (LLMs) can label data faster and cheaper than humans for various NLP tasks. Despite their prowess, LLMs may fall short in understanding of complex, sociocultural, or domain-specific context, potentially leading to incorrect annotations. Therefore, we advocate a collaborative approach where humans and LLMs work together to produce reliable and high-quality labels. We present MEGAnno+, a human-LLM collaborative annotation system that offers effective LLM agent and annotation management, convenient and robust LLM annotation, and exploratory verification of LLM labels by humans.

EACL

2024

Less is More for Long Document Summary Evaluation by LLMs

Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka

Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.

NLP

2024

SILVER: Self Data Augmentation for Out-of-Scope Detection in Dialogues

Chunpeng Ma and Takuya Makino

Detecting out-of-scope (OOS) utterances is crucial in task-oriented dialogue systems, but obtaining enough annotated OOS dialogues to train a binary classifier directly is difficult in practice. Existing data augmentation methods generate OOS dialogues automatically, but their performance usually depends on an external corpus. Herein we propose SILVER, a self data augmentation method that does not use external data. It improves the accuracy of OOS detection (false positive rate: 90.5% → 47.4%). Furthermore, SILVER successfully generates high-quality in-domain (IND) OOS dialogues in terms of naturalness (percentage: 8% → 68%) and OOS correctness (percentage: 74% → 88%), as evaluated by human workers.

ICML

2023

The Extractive-Abstractive Axis: Measuring Content ‘Borrowing’ in Generative Language Models

Nedelina Teneva

Generative language models produce highly abstractive outputs by design, in contrast to extractive responses in search engines. Given this characteristic of LLMs and the resulting implications for content Licensing & Attribution, we propose the the so-called Extractive-Abstractive axis for benchmarking generative models and highlight the need for developing corresponding metrics, datasets and annotation guidelines. We limit our discussion to the text modality.