Data AI Symbiosis

Advances in large language models (LLMs), specifically deep language understanding capabilities, offer new opportunities to tackle classic data-management problems such as data integration, entity matching, and table discovery. Our work in the AI-for-data-management area has recently focused on exploiting language models and state-of-the-art machine learning approaches. We utilize large language models in novel settings for finding table representations to discover datasets in data lakes, data augmentation techniques for data management tasks, and different declarative explanation approaches for data integration tasks.

Conversely, as LLMs are adopted more and more, their application within enterprise systems — where accuracy, privacy, trust, governance, and explainability are of utmost importance — necessitates enhancement in knowledge retrieval spanning heterogeneous data sources, optimization in retrieval (query processing), robustness in fact generation and verification, and flexibility in domain adaptation. For example, the HR domain introduces new problems that require careful consideration related to bias, factuality, and explainability. Our work in the data-management-for-AI area focuses on knowledge grounding and contextualization for knowledge-guided generation, fact-checking and verification, data lake usability, and benchmarking multi-agent systems for enterprise applications, among others.

 

Recent Publications:

CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems

Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks

Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation

A Blueprint Architecture of Compound AI Systems for Enterprise

Fairness-aware Data Preparation for Entity Matching

Related Projects:

Sudowoodo

Sudowoodo can also improve the efficiency of model engineering since the learned representation can be applied to all stages of a typical entity matching pipeline, such as blocking, labeling, and matching. Besides, Sudowoodo can also support a variety of use-cases, such as data cleaning and semantic type detection, suggesting its versatility.

Ditto

Entity Matching (EM) refers to the problem of finding pairs of entity records that refer to the same real-world entity such as customers, products, businesses, or publications. As one of the most fundamental problems in data integration, EM has a wide range of applications including data cleaning, data integration, knowledge base construction, and entity similarity search. We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models (LMs) such as BERT.

Starmie

Dataset discovery from data lakes is essential in many real-world applications that require table search over open datasets. There are many important downstream tasks for dataset discovery, such as table union search, finding joinable tables, and column clustering. Starmie is an end-to-end framework for dataset discovery, with table union search as the main use case.