Publications

CHI - HEAL Workshop
2025
Yoo Yeon Sung, Hannah Kim, Dan Zhang
AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system’s overall performance. Addressing these failures through human intervention is challenging due to the agents’ opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human-centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human-designed agent criteria. Then, it develops a human-aligned agent verifier module, trained with human gold standards, to assess each agent’s execution output. This approach enables granular evaluation of each agent’s performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human-agent collaboration, VeriLA paves the way for more trustworthy and human-aligned compound AI systems.
ICLR
2025
The rapid increase in textual information means we need more efficient methods to sift through, organize, and understand it all. While retrieval-augmented generation (RAG) models excel in accessing information from large document collections, they struggle with complex tasks that require aggregation and reasoning over information spanning across multiple documents–what we call holistic reasoning. Long-context language models (LCLMs) have great potential for managing largescale documents, but their holistic reasoning capabilities remain unclear. In this work, we introduce HoloBench, a novel framework that brings database reasoning operations into text-based contexts, making it easier to systematically evaluate how LCLMs handle holistic reasoning across large documents. Our approach adjusts key factors such as context length, information density, distribution of information, and query complexity to evaluate LCLMs comprehensively. Our experiments show that the amount of information in the context has a bigger influence on LCLM performance than the actual context length. Furthermore, the complexity of queries affects performance more than the amount of information, particularly for different types of queries. Interestingly, queries that involve finding maximum or minimum values are easier for LCLMs and are less affected by context length, even though they pose challenges for RAG systems. However, tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases. Additionally, we find that while grouping relevant information generally improves performance, the optimal positioning varies across models. Our findings surface both the advancements and the ongoing challenges in achieving a holistic understanding of long contexts. These can guide future developments in LCLMs and set the stage for creating more robust language models for real-world applications.
NAACL
2025
Utilizing large language models (LLMs) to rank a set of items has become a common approach in recommendation and retrieval systems. Typically, these systems focus on ordering a substantial number of documents in a monotonic order based on a given query. However, real-world scenarios often present a different challenge: ranking a comparatively smaller set of items, but according to a variety of diverse and occasionally conflicting conditions. In this paper, we define and explore the task of multi-conditional ranking by introducing MCRank, a benchmark tailored for assessing multi-conditional ranking across various item types and conditions. Our analysis of LLMs using MCRank indicates a significant decrease in performance as the number and complexity of items and conditions grow. To overcome this limitation, we propose a novel decomposed reasoning method, consisting of EXtracting and Sorting the conditions, and then Iteratively Ranking the items (EXSIR). Our extensive experiments show that this decomposed reasoning method enhances LLMs’ performance significantly, achieving up to a 14.4% improvement over existing LLMs. We also provide a detailed analysis of LLMs performance across various condition categories, and examine the effectiveness of decomposition step. Furthermore, we compare our method with existing approaches such as Chain-of-Thought and existing ranking models, demonstrating the superiority of our approach and complexity of MCR task. We released our dataset and code.
NAACL - Findings
2025
Although many studies have investigated and reduced hallucinations in large language models (LLMs) for single-document tasks, research on hallucination in multi-document summarization (MDS) tasks remains largely unexplored. Specifically, it is unclear how the challenges arising from handling multiple documents (e.g., repetition and diversity of information) affect models outputs. In this work, we investigate how hallucinations manifest in LLMs when summarizing topic-specific information from multiple documents. Since no benchmarks exist for investigating hallucinations in MDS, we use existing news and conversation datasets, annotated with topic-specific insights, to create two novel multi-document benchmarks. When evaluating 5 LLMs on our benchmarks, we observe that on average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries. Moreover, when summarizing non-existent topic-related information, gpt-3.5-turbo and GPT-4o still generate summaries about 79.35% and 44% of the time, raising concerns about their tendency to fabricate content. To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights. Motivated by these observations, we investigate the efficacy of simple post-hoc baselines in mitigating hallucinations but find them only moderately effective. Our results underscore the need for more effective approaches to systematically mitigate hallucinations in MDS. We release our dataset and code.
NAACL - Industry
2025
Advances in Natural Language Processing (NLP) have the potential to transform HR processes, from recruitment to employee management. While recent breakthroughs in NLP have generated significant interest in its industrial applications, a comprehensive overview of how NLP can be applied across HR activities is still lacking. This paper discovers opportunities for researchers and practitioners to harness NLP’s transformative potential in this domain. We analyze key fundamental tasks such as information extraction and text classification, and their roles in downstream applications like recommendation and language generation, while also discussing ethical concerns. Additionally, we identify gaps in current research and encourage future work to explore holistic approaches for achieving broader objectives in this field.
言語処理学会(NLP)
2025
ツール拡張言語モデルを用いた対話管理エージェントは流行しているが、既存ベンチマークデータは実際のビジネスタスクと乖離があり、商用チャットボットに使えない。本研究で提案するデータ作成手法ではデータ作成者が与えた必要最小限の情報に基づいて、要望に沿った大量で高品質なデータを自動的に生成してオーダーメイド対話管理を実現した。作成されたデータは学習データとして対話管理エージェントの学習や改良に使える。
言語処理学会(NLP)
2025
スキルマッピングは職務記述書や履歴書中の文で言及されるオントロジーで定義されたスキルを特定する作業であり労働市場の分析に不可欠である.詳細な分析に必要な細分化されたスキルが付与された学習データを人手で構築するのは高いコストなため既存研究は Large Language Model (LLM) が生成した合成データが bi-encoder の学習に用いる.合成データを活用したさらなる精度改善のため,提案手法 (𝑘NNBE) は推論時に学習に利用したラベル付き合成文を 𝑘-nearest neighbor (𝑘NN) によって取得し入 力文との類似度を bi-encoder のスコアに加える.実験により 𝑘NNBE は bi-encoder の精度改善,さらに既存の最高精度を示す LLM でスキルをリランキングする手法と比較して高いスループットを維持しながら精度改善を確認した.
IEEE - ICDE
2025
Large language models (LLMs), despite their impressive capabilities in natural language understanding tasks in open-domain, often lack effectiveness with similar tasks in enterprise applications due to potential hallucinations, weak multi-hop reasoning ability, and limitations in adapting to heterogeneous data types, among others. Such issues primarily arise due to the absence of private, on-premises enterprises from an LLM’s training corpus. Knowledge-intensive tasks in enterprise often require multi-step reasoning, deep contextual understanding, and integration of information stored and accessed in heterogeneous formats (e.g., tables, graphs, documents, and JSON), which LLMs aren’t inherently equipped to handle without significant adaptation. To this end, retrieval augmented generation (RAG) offers promise in instrumenting such adaptations on demand. While RAG-based approaches focus on controlling the generation and mitigating hallucinations, existing solutions are not sufficient for the requirements of the enterprise settings. In this paper, we outline our approaches toward understanding and implementing a more effective RAG workflow in the wild. To achieve the goal, we draw on the cognitive science concepts of System 1 (fast, intuitive thinking) and System 2 (slow, deliberate, analytical thinking.) In particular, we discuss how existing RAG approaches are more aligned to System 1 and propose to shift from traditional single-model architectures to compound AI systems within a System 2 framework to improve RAG, especially in complex enterprise applications. Such compound AI systems adopt a more systematic approach by assigning specialized tasks to different intelligent agents, optimizing retrieval and generation performance with a retrieval-augmented generation workflow.
EMNLP
2024
Ayana Niwa, Hayate Iso
In this study, we introduce AmbigNLG, a new task designed to tackle the challenge of task ambiguity in instructions for Natural Language Generation (NLG) tasks. Despite the impressive capabilities of Large Language Models (LLMs) in understanding and executing a wide range of tasks through natural language interaction, their performance is significantly hindered by the ambiguity present in real-world instructions. To address this, AmbigNLG seeks to identify and mitigate such ambiguities, aiming to refine instructions to match user expectations better. We introduce a dataset, AmbigSNI-NLG, consisting of 2,500 instances, and develop an ambiguity taxonomy for categorizing and annotating instruction ambiguities. Our approach demonstrates substantial improvements in text generation quality, highlighting the critical role of clear and specific instructions in enhancing LLM performance in NLG tasks.
CIKM - Demo
2024
Chen Shen, Jin Wang, Sajjadur Rahman, Eser Kandogan
The Text-to-SQL problem aims at developing natural language query interfaces for relational database systems by converting the text input into executable SQL queries. Recently, using Large Language Models (LLM) has emerged as a new paradigm for the Text-to-SQL problem. To this end, the LLM needs to understand not only user input but also information from the database. In this demo, we present multi-agent SQL (MageSQL), an LLM based Text-to-SQL approach that tackles the task by orchestrating multiple agents in a pipeline. We will showcase a user-friendly interface to demonstrate the inner workings of our approach that allows users to add and modify the agents with different functionalities, customize prompts, and see their impact on specific examples. Through several use cases, we will demonstrate how to (i) construct a Text-to-SQL pipeline with multiple agents; (ii) generate prompts for LLM with various templates and strategies; and (iii) monitor the results of natural language queries and perform debugging.