Data AI Symbiosis

The Data-AI Symbiosis (DAIS) group at Megagon Labs explores research problems at the intersection of data management and AI. At its core, the DAIS group is working toward building the next-generation data platform that enables self-serving data analytics at scale within compound AI systems involving multi-agent workflows.

Advances in large language models (LLMs), specifically deep language understanding capabilities, offer new opportunities to tackle classic data-management problems such as data integration, entity matching, and data discovery. Our work in the AI-for-data-management area focuses on exploiting language models and state-of-the-art machine-learning approaches for data discovery in data lakes, tabular data understanding, data augmentation for data management, and natural language to domain-specific query generation.

Conversely, as LLMs are increasingly adopted in enterprise systems — where accuracy, privacy, trust, governance, and explainability are of the utmost importance — it is necessary to develop systematic approaches toward enhancing knowledge-intensive query understanding, knowledge retrieval over heterogeneous data sources, optimization during retrieval and querying, robustness in fact-checking and verification, and flexibility in domain adaptation. Our work in the data-management-for-AI area focuses on enterprise data cataloging, fact-checking and verification, data lake usability, and benchmarking multi-agent systems to enable effective knowledge grounding and contextualization for knowledge-guided generation with LLMs.

Highlighted

Projects

watchog abstract

The Watchog framework employs contrastive learning techniques to learn robust representations for tables by leveraging a large-scale unlabeled table corpus with minimal overhead. 

We explore knowledge-guided rationales for complex task decisions using knowledge graphs and LLMs. We create a two-stage pipeline to review task decisions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.

To catalyze research on evaluating the data discovery performance of multimodal data retrievers in Compound AI Systems within a real-world setting, we propose a benchmark modeling the complexity of enterprise data platforms.

 

 

Child,Holding,Balloons,Standing,In,Front,Of,Fantasy,Storm,illustration for MageSQL

We present multi-agent SQL (MageSQL), an LLM based text-to-SQL approach that tackles the task by orchestrating multiple agents in a pipeline. Our user-friendly interface lets users add and modify agents, customize prompts, and observe their impact on specific examples.

Related

Publications

Moin Amin-Naseri, Hannah Kim, Estevam Hruschka
The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains–such as medical, legal, and HR–the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.
ICLR
2026
Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce RePairTQA, a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.
VLDB
2025
Yihao Hu, Jin Wang, Sajjadur Rahman
Data discovery from data lakes is an essential application in modern data science. While many previous studies focused on improving the efficiency and effectiveness of data discovery, little attention has been paid to the usability of such applications. In particular, exploring data discovery results can be cumbersome due to the cognitive load involved in understanding raw tabular results and identifying insights to draw conclusions. To address this challenge, we introduce a new problem — visualization recommendation for data discovery over data lakes — which aims at automatically identifying visualizations that highlight relevant or desired trends in the results returned by data discovery engines. We propose LakeVisage, an end-to-end framework as the first solution to this problem. Given a data lake, a data discovery engine, and a user-specified query table, LakeVisage intelligently explores the space of visualizations and recommends the most useful and “interesting” visualization plans. To this end, we developed (i) approaches to smartly construct the candidate visualization plans from the results of the data discovery engine and (ii) effective pruning strategies to filter out less interesting plans so as to accelerate the visual analysis. Experimental results on real data lakes show that our proposed techniques can lead to an order of magnitude speedup in visualization recommendation. We also conduct a comprehensive user study to demonstrate that LakeVisage offers convenience to users in real data analysis applications by enabling them seamlessly get started with the tasks and performing explorations flexibly.
The rapid increase in textual information means we need more efficient methods to sift through, organize, and understand it all. While retrieval-augmented generation (RAG) models excel in accessing information from large document collections, they struggle with complex tasks that require aggregation and reasoning over information spanning across multiple documents–what we call holistic reasoning. Long-context language models (LCLMs) have great potential for managing largescale documents, but their holistic reasoning capabilities remain unclear. In this work, we introduce HoloBench, a novel framework that brings database reasoning operations into text-based contexts, making it easier to systematically evaluate how LCLMs handle holistic reasoning across large documents. Our approach adjusts key factors such as context length, information density, distribution of information, and query complexity to evaluate LCLMs comprehensively. Our experiments show that the amount of information in the context has a bigger influence on LCLM performance than the actual context length. Furthermore, the complexity of queries affects performance more than the amount of information, particularly for different types of queries. Interestingly, queries that involve finding maximum or minimum values are easier for LCLMs and are less affected by context length, even though they pose challenges for RAG systems. However, tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases. Additionally, we find that while grouping relevant information generally improves performance, the optimal positioning varies across models. Our findings surface both the advancements and the ongoing challenges in achieving a holistic understanding of long contexts. These can guide future developments in LCLMs and set the stage for creating more robust language models for real-world applications.
Github
Corpus
Guide AI - SIGMOD
2024
Yanlin Feng, Sajjadur Rahman, Aaron Feng, Vincent Chen, Eser Kandogan
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks via interactions with tools and data retrievers have garnered significant interest within database and AI communities. While these systems have the potential to supplement typical analysis workflows of data analysts in enterprise data platforms, unfortunately, CASs are subject to the same data discovery challenges that analysts have encountered over the years — silos of multimodal data sources, created across teams and departments within an organization, make it difficult to identify appropriate data sources for accomplishing the task at hand. Existing data discovery benchmarks do not model such multimodality and multiplicity of data sources. Moreover, benchmarks of CASs prioritize only evaluating end-to-end task performance. To catalyze research on evaluating the data discovery performance of multimodal data retrievers in CASs within a real-world setting, we propose CMDBench, a benchmark modeling the complexity of enterprise data platforms. We adapt existing datasets and benchmarks in open-domain — from question answering and complex reasoning tasks to natural language querying over structured data — to evaluate coarse- and fine-grained data discovery and task execution performance. Our experiments reveal the impact of data retriever design on downstream task performance — a 46% drop in task accuracy on average — across various modalities, data sources, and task difficulty. The results indicate the need to develop optimization strategies to identify appropriate LLM agents and retrievers for efficient execution of CASs over enterprise data.
VLDB
2021
Yuliang Li, Xiaolan Wang, Zhengjie Miao, Wang Chiew Tan
In recent years, we have witnessed the development of novel data augmentation (DA) techniques for creating additional training data needed by machine learning based solutions. In this tutorial, we will provide a comprehensive overview of techniques developed by the data management community for data preparation and data integration. In addition to surveying task-specific DA operators that leverage rules, transformations, and external knowledge for creating additional training data, we also explore the advanced DA techniques such as interpolation, conditional generation, and DA policy learning. Finally, we describe the connection between DA and other machine learning paradigms such as active learning, pre-training, and weakly-supervised learning. We hope that this discussion can shed light on future research directions for a holistic data augmentation framework for high-quality dataset creation. PVLDB Reference Format: Yuliang Li, Xiaolan Wang, Zhengjie Miao, and Wang-Chiew Tan. Data Augmentation for ML-driven Data Preparation and Integration.
6 Min Read
January 9, 2024
This blog post will peel back the layers of our KG building and learning platform, illuminating its role in enriching machine learning. As we explore our distinctive pipelines and delve into the granularities of data provenance and GNN training, we’ll showcase how our system facilitates the seamless integration of KGs into practical, real-world tasks for production use cases.
3 Min Read
October 31, 2023
Our experiments show ZETT advances state-of-the-art extraction accuracy while providing a conceptually simple and stable solution. Going forward, we believe methods like ZETT that leverage self-supervised pre-training will play a key role in adapting information extraction to open-domain settings.
3 Min Read
August 24, 2023
In this work, we propose an end-to-end framework named Starmie. Dataset discovery from data lakes is a critical way to utilize open-domain data within the enterprise. To overcome the issues stemming from data quality and incomplete metadata in data lakes, it is essential to support the problem of table union search, which aims to find all tables that are unionable with the query table, given a query table and a collection of data lake tables.