データとAIの共生

Megagon Labs の データとAIの共生(DAIS) グループは、データマネジメントと AI の融合領域にある研究課題を探求しています。 DAIS グループの中心的な目標は、マルチエージェントワークフローを含む複合AI システムにおいて、大規模なセルフサービス型データ分析を可能にする次世代データプラットフォームの構築です。

大規模言語モデル(LLM)の進歩、特に深い言語理解能力の向上は、データ統合、エンティティマッチング、データ探索といった従来のデータマネジメントの課題に取り組む新たな機会をもたらしています。 私たちは、AI を活用したデータマネジメントの研究において、言語モデルと最先端の機械学習手法を活用し、データレイクにおけるデータ探索、データテーブルの理解、データマネジメントのためのデータ拡張、自然言語からドメイン固有クエリの生成に注力しています。

一方で、LLM のエンタープライズシステムへの導入が進むにつれ、正確性、プライバシー、信頼性、ガバナンス、説明可能性が極めて重要な要素となっています。そのため、知識集約型クエリ理解の強化、異種データソースにおける知識検索の最適化、検索・クエリ処理の最適化、ファクトチェックと検証の堅牢性向上、ドメイン適応の柔軟性向上といった課題に対する体系的なアプローチの開発が求められています。 私たちの研究は、AI を活用したデータマネジメントだけでなく、データマネジメントを活用した AI の発展にも焦点を当てています。具体的には、エンタープライズ向けのデータカタログ管理、ファクトチェックと検証、データレイクの利便性向上、マルチエージェントシステムのベンチマークに取り組むことで、LLM を活用した知識に基づく生成のための知識グラウンディングと文脈化の強化を目指しています。

ハイライト

プロジェクト

Watchdog フレームワークは、コントラスト学習 を活用し、大規模なラベルなしの表データコーパスを用いて、オーバーヘッドを最小限に抑えながら堅牢な表データの表現を学習します。

ナレッジグラフと大規模言語モデル(LLM) を活用し、複雑なタスク決定に対する知識に基づく推論 を探求しています。本研究では、タスク決定のレビューと潜在的な誤判断の排除 を行う 二段階のパイプライン を構築し、推論の前に誤った決定を取り除くことで信頼性の高い推論生成を実現します。

実世界の環境において、複合AI システムのマルチモーダルデータ検索によるデータ探索性能の評価を促進 するため、エンタープライズデータプラットフォームの複雑性をモデル化したベンチマーク を提案します。

Child,Holding,Balloons,Standing,In,Front,Of,Fantasy,Storm,illustration,Painting

Multi-Agent SQL (MageSQL) は、大規模言語モデル(LLM)を活用し、複数のエージェントをオーケストレーションするパイプライン型アプローチによって、テキストから SQL への変換タスクに取り組む手法です。 ユーザーフレンドリーなインターフェースを提供し、エージェントの追加・編集、プロンプトのカスタマイズ、変更の影響の可視化が可能です

関連

研究論文

ICLR
2026
Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce RePairTQA, a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.
ACL - Findings
2024
Aditi Mishra, Sajjadur Rahman, Hannah Kim, Kushan Mitra, Estevam Hruschka
Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. Yet, their ability to provide well-grounded rationalizations for knowledge-intensive tasks remains under-explored. Such tasks, like commonsense multiple-choice questions, require rationales based on world knowledge to support predictions and refute alternate options. We consider the task of generating knowledge-guided rationalization in natural language by using expert-written examples in a few-shot manner. Surprisingly, crowd-workers preferred knowledge-grounded rationales over crowdsourced rationalizations, citing their factuality, sufficiency, and comprehensive refutations. Although LLMs-generated rationales were preferable, further improvements in conciseness and novelty are required. In another study, we show how rationalization of incorrect model predictions erodes humans’ trust in LLM-generated rationales. Motivated by these observations, we create a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.
ICDE
2024
Nima Shahbazi, Jin Wang, Zhengjie Miao, Nikita Bhutani
Entity matching is a crucial task in many real applications. Despite the substantial body of research that focuses on improving the effectiveness of entity matching, enhancing its fairness has received scant attention. To fill this gap, this paper introduces a new problem of preparing fairness-aware datasets for entity matching. We formally outline the problem, drawing upon the principles of group fairness and statistical parity. We devise three highly efficient algorithms to accelerate the process of identifying an unbiased dataset from the vast search space. Our experiments on four real-world datasets show that our proposed algorithms can significantly improve fairness in the results while achieving comparable effectiveness to existing fairness-agnostic methods. Furthermore, we conduct case studies to demonstrate that our proposed techniques can be seamlessly integrated into end-to-end entity matching pipelines to support fairness requirements in real-world applications.
SIGMOD
2024
Zhengjie Miao, Jin Wang
Relational Web tables provide valuable resources for numerous downstream applications, making table understanding, especially column annotation that identifies semantic types and relations of columns, a hot topic in the field of data management. Despite recent efforts to improve different tasks in table understanding by using the power of large pre-trained language models, existing methods heavily rely on large-scale and high-quality labeled instances, while they still suffer from the data sparsity problem due to the imbalanced data distribution among different classes. In this paper, we propose the Watchog framework, which employs contrastive learning techniques to learn robust representations for tables by leveraging a large-scale unlabeled table corpus with minimal overhead. Our approach enables the learned table representations to enhance fine tuning with much fewer additional labeled instances than in prior studies for downstream column annotation tasks. Besides, we further proposed optimization techniques for semi-supervised settings. Experimental results on popular benchmarking datasets illustrate the superiority of our proposed techniques in two column annotation tasks under different settings. In particular, our Watchog framework effectively alleviates the class imbalance issue caused by a long-tailed label distribution. In the semi-supervised setting, Watchog outperforms the best-known method by up to 26% and 41% in Micro and Macro F1 scores, respectively, on the task of semantic type detection.
DASFAA
2024
Investigation of Simple-but-Effective Architecture for Long-form Text Matching with Transformers
Chen Shen, Jin Wang
Long-form text matching plays a significant role in many real world Natural Language processing (NLP) and Information Retrieval (IR) applications.Recently Transformer based models such as BERT have been widely applied to address this problem and achieved promising results.However, they are all based on the architecture of Siamese Network and thus need to come up with extra techniques to capture the matching signals to remedy the problem of late interaction.In this paper, we investigate the usage of sequence pair classification architecture as the solution to long-form text matching.That is, we concatenate the pair of long-form texts into one sequence as the input into a pre-trained language model for fine-tuning.The initial experimental results show that such a simple baseline method can outperform state-of-the-art approaches in this field without further optimization.This findings illustrate that it is a promising choice to use sequence pair classification as the solution for this problem which has not been explored by previous studies yet.We also conduct in-depth empirical analysis to present more comprehensive results to support our claim and provide more insights for researchers in this direction.
New Frontiers in Graph Learning Workshop - NeurIPS
2023
Nedelina Teneva, Estevam Hruschka
Despite the recent popularity of knowledge graph (KG) related tasks and benchmarks such as KG embeddings, link prediction, entity alignment and evaluation of the reasoning abilities of pretrained language models as KGs, the structure and properties of real KGs are not well studied. In this paper, we perform a large scale comparative study of 29 real KG datasets from diverse domains such as the natural sciences, medicine, and NLP to analyze their properties and structural patterns. Based on our findings, we make several recommendations regarding KG-based model development and evaluation. We believe that the rich structural information contained in KGs can benefit the development of better KG models across fields and we hope this study will contribute to breaking the existing data silos between different areas of research (e.g., ML, NLP, AI for sciences).
6 Min Read
June 26, 2024
長文テキストマッチングは、自然言語処理(NLP)および情報検索(IR)の分野において重要な課題です。私たちは、Transformerモデルを用いたシーケンスペア分類(sequence pair classification)による、シンプルかつ効果的な解決策を提案し、最先端のSiameseネットワークベースの手法を上回る性能を実証しました。
6 Min Read
June 3, 2024
データが最も重要な資源となる中、リレーショナルWebテーブルに含まれる膨大な情報を活用するこのフレームワークは不可欠なものとなっています。Watchogは、企業が製品カタログ、価格表、カスタマーデータリポジトリから貴重なインサイトを抽出し、それを活用して価格戦略を最適化し、顧客満足度とロイヤルティを向上させるためのパーソナライズされたレコメンドを提供することができるようになります。
11 Min Read
January 9, 2024
ナレッジグラフ(KG)構築および学習プラットフォームの詳細を解説し、機械学習を豊かにする役割を明らかにします。私たちの独自のパイプラインを設計し、データの出所やGNNトレーニングの粒度を掘り下げながら、KGを現実のユースケースの実用的な実世界のタスクにシームレスに統合することを促進するかを紹介します。