Research

LLM & NLP
Large Language Models (LLMs) have transformed NLP from task-specific approaches to generalized, data-driven methods, enabling more flexible and intelligent AI applications. Our work enhances LLMs with a multi-agent RAG approach, integrating external, domain-specific data for improved handling of complex, heterogeneous data environments. This improves fairness, transparency, and reliability in LLM-powered applications.
Compound AI Systems
As the emergence of large language models (LLMs) as proficient agents has ushered in a new era of compound AI systems, we address the challenges of building compound AI systems for enterprises. Our systems support agentic workflows in which agents interact with tools and data retrievers to solve complex tasks involving natural language understanding, code generation, and complex reasoning.
Data AI
Symbiosis
Data-AI Symbiosis (DAIS) explores research problems at the intersection of data management and AI. We focus on enterprise data cataloging, fact-checking and verification, data lake usability, and benchmarking multi-agent systems to enable effective knowledge grounding and contextualization for knowledge-guided generation with LLMs. At its core, the DAIS group is working toward building the next-generation data platform that enables self-serving data analytics at scale within compound AI systems involving multi-agent workflows.
Human
Centered AI
We conduct research and development to enable more effective and seamless human-AI collaboration. Our efforts focus on planning for complex tasks while incorporating human feedback, developing conversational interfaces for interacting with compound AI systems, and designing tools and algorithms to enhance data annotation using large language models (LLMs). We aim to redefine how humans and AI systems work together, enabling more intuitive, transparent, and impactful collaborations in complex, real-world contexts.

Compound AI System

The emergence of large language models (LLMs) as proficient agents has ushered in a new era of compound AI systems. We are working toward building a blueprint architecture of compound AI systems tailored for enterprises.

LLM & NLP

We develop techniques and algorithms to advance NLP applications for various complexities and domains through a multi-agent approach across a multi-modal data lake. We also work to improve the functionality of LLMs.

Human-Centered AI

We work on planning for complex tasks while incorporating human feedback. We develop conversational interfaces for interacting with compound AI systems and design tools to enhance data annotation using LLMs.

Data AI Symbiosis

We tackle research problems at the intersection of data management and AI, such as data discovery and natural language query generation, to enable self-serving data exploration and analytics at scale over heterogeneous data management.

Related

Publications

言語処理学会 (NLP)
2026
大規模言語モデルにおけるプロンプト変動が出力に対する影響について、様々な文脈で研究されており、用語は多数存在する。本研究は、既存研究で混在してきた概念を「頑健性」と「可制御性」の二軸から再構造化する。さらに、公開データセットを前提とした従来の分析とは異なり、複雑なタスク構成や追加知識の記述を要するビジネスサービス環境に着目し、両概念の重要度を体系的に評価した。実験の結果、我々が考察したタスクにおいては、先行研究で強調されてきた頑健性よりも、プロンプト意図を確実に反映し必要情報を安定して引き出す可制御性が実運用において本質的であることが明らかとなった。本研究は、ビジネス環境に適したプロンプト設計指針の再考に寄与するとともに、将来の評価指標構築やモデル改善への示唆を提供する。
言語処理学会 (NLP)
2026
大規模マルチラベル分類においてラベルノイズは不可避な課題である.既存のGeneralized Cross Entropyは,損失の大きさに応じてノイズを判定するため学習が難しい事例をノイズと誤認して学習への影響を低下させ,網羅性を損なう.本研究は意味的類似度に基づく自己推定型損失重み付け手法を提案する.提案手法はラベルの学習中のモデルによる事例とラベルの意味的類似性に応じて,事例–ラベル単位で重みを適応的に与える.具体的には,意味的類似性が低い正例への学習を抑制すると同時に, 意味的に類似した負例への学習の寄与を緩和することで潜在的な正例を保護する.人工ノイズ環境下の実験において,提案手法は既存手法と比較して高頻度ラベルの精度 (P@1) および頻度バイアスを除去した精度 (PSP@1) を改善した.
言語処理学会 (NLP)
2026
大村 舞 (大阪樟蔭女子大学), 若狭 絢 (東北大学), 松田 寛, 浅原 正幸 (国立国語研究所)
本研究では,日本語日常会話コーパス (CEJC) をUniversal Dependencies形式に変換した日本語話し言葉のツリーバンク UD Japanese-CEJCを開発・構築したので,そのデータについて報告する.日本語日常会話コーパスは,日本語の様々な日常会話を収録した大規模な音声言語コーパスであり,単語区切りや品詞のアノテーションが含まれている.我々は,UD Japanese-CEJCのために,CEJCの長単位形態論情報と文節係り受け情報を新たにアノテーションした.UD Japanese-CEJCは日本語形態論情報と文節 ベースの依存構造情報およびCEJCから手作業で整備された変換ルールに従って構築した.構築したUD Japanese-CEJCに対して,日本語書き言葉コーパスとの比較やUD依存構造解析精度の評価をおこない,CEJCにおけるUD構築に関する様々な問題点を検討した.
言語処理学会 (NLP)
2026
松田 寛, 浅原 正幸(国立国語研究所)
大規模言語モデル (LLM) の性能向上とその微調整技術の普及は,様々な下流タスクの性能を引き上げると同時に,自然言語処理の基礎技術である統語解析処理の性能向上にも寄与している.本稿では,LLM の微調整技術であるLoRA SFTを用いた多言語統語解析モデルを提案する.提案手法は,文書を入力とする言語判定+文区切りタスク,文を入力とする単語分割+言語固有品詞推定タスク,文と単語リストを入力とする依存構造解析タスクで構成され,これらのタスクを貫通動作させることで,言語を問わずテキストを入力するだけで依存構造解析結果を得ることができる.Universal Dependenciesの40言語のデータセットを用いた実験により,マルチタスク学習では文区切り精度がボトルネックとなること,単語分割とともに言語固有品詞推定を行うことで単語分割精度が向上する等の知見を得た.研究成果のモデルおよび解析ライブラリは,商用利用可能なライセンスのもとで公開予定である.
ICLR
2026
Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce RePairTQA, a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.
ICLR
2026
As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.