Data AI Symbiosis

The Data-AI Symbiosis (DAIS) group at Megagon Labs explores research problems at the intersection of data management and AI. At its core, the DAIS group is working toward building the next-generation data platform that enables self-serving data analytics at scale within compound AI systems involving multi-agent workflows.

Advances in large language models (LLMs), specifically deep language understanding capabilities, offer new opportunities to tackle classic data-management problems such as data integration, entity matching, and data discovery. Our work in the AI-for-data-management area focuses on exploiting language models and state-of-the-art machine-learning approaches for data discovery in data lakes, tabular data understanding, data augmentation for data management, and natural language to domain-specific query generation.

Conversely, as LLMs are increasingly adopted in enterprise systems — where accuracy, privacy, trust, governance, and explainability are of the utmost importance — it is necessary to develop systematic approaches toward enhancing knowledge-intensive query understanding, knowledge retrieval over heterogeneous data sources, optimization during retrieval and querying, robustness in fact-checking and verification, and flexibility in domain adaptation. Our work in the data-management-for-AI area focuses on enterprise data cataloging, fact-checking and verification, data lake usability, and benchmarking multi-agent systems to enable effective knowledge grounding and contextualization for knowledge-guided generation with LLMs.

Highlighted

Projects

watchog abstract

The Watchog framework employs contrastive learning techniques to learn robust representations for tables by leveraging a large-scale unlabeled table corpus with minimal overhead. 

We explore knowledge-guided rationales for complex task decisions using knowledge graphs and LLMs. We create a two-stage pipeline to review task decisions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.

To catalyze research on evaluating the data discovery performance of multimodal data retrievers in Compound AI Systems within a real-world setting, we propose a benchmark modeling the complexity of enterprise data platforms.

 

 

We present multi-agent SQL (MageSQL), an LLM based text-to-SQL approach that tackles the task by orchestrating multiple agents in a pipeline. Our user-friendly interface lets users add and modify agents, customize prompts, and observe their impact on specific examples.

Related

Publications

ACL - Findings
2024
Aditi Mishra, Sajjadur Rahman, Hannah Kim, Kushan Mitra, Estevam Hruschka
Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. Yet, their ability to provide well-grounded rationalizations for knowledge-intensive tasks remains under-explored. Such tasks, like commonsense multiple-choice questions, require rationales based on world knowledge to support predictions and refute alternate options. We consider the task of generating knowledge-guided rationalization in natural language by using expert-written examples in a few-shot manner. Surprisingly, crowd-workers preferred knowledge-grounded rationales over crowdsourced rationalizations, citing their factuality, sufficiency, and comprehensive refutations. Although LLMs-generated rationales were preferable, further improvements in conciseness and novelty are required. In another study, we show how rationalization of incorrect model predictions erodes humans’ trust in LLM-generated rationales. Motivated by these observations, we create a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.
SIGMOD
2024
Zhengjie Miao, Jin Wang
Relational Web tables provide valuable resources for numerous downstream applications, making table understanding, especially column annotation that identifies semantic types and relations of columns, a hot topic in the field of data management. Despite recent efforts to improve different tasks in table understanding by using the power of large pre-trained language models, existing methods heavily rely on large-scale and high-quality labeled instances, while they still suffer from the data sparsity problem due to the imbalanced data distribution among different classes. In this paper, we propose the Watchog framework, which employs contrastive learning techniques to learn robust representations for tables by leveraging a large-scale unlabeled table corpus with minimal overhead. Our approach enables the learned table representations to enhance fine tuning with much fewer additional labeled instances than in prior studies for downstream column annotation tasks. Besides, we further proposed optimization techniques for semi-supervised settings. Experimental results on popular benchmarking datasets illustrate the superiority of our proposed techniques in two column annotation tasks under different settings. In particular, our Watchog framework effectively alleviates the class imbalance issue caused by a long-tailed label distribution. In the semi-supervised setting, Watchog outperforms the best-known method by up to 26% and 41% in Micro and Macro F1 scores, respectively, on the task of semantic type detection.
6 Min Read
January 9, 2024
This blog post will peel back the layers of our KG building and learning platform, illuminating its role in enriching machine learning. As we explore our distinctive pipelines and delve into the granularities of data provenance and GNN training, we’ll showcase how our system facilitates the seamless integration of KGs into practical, real-world tasks for production use cases.
3 Min Read
October 31, 2023
Our experiments show ZETT advances state-of-the-art extraction accuracy while providing a conceptually simple and stable solution. Going forward, we believe methods like ZETT that leverage self-supervised pre-training will play a key role in adapting information extraction to open-domain settings.
6 Min Read
April 27, 2023
To help NLP researchers and practitioners understand and improve their data, we introduce Weedle, an exploratory text analysis tool for data-centric NLP. Here are Weedle’s biggest strengths…