AI for Data Management

Advances in language models (LM), specifically deep language understanding capabilities, offer new opportunities to tackle classic data-management problems such as data integration, entity matching, and table discovery. The domain of HR offers us new problems that demand explainability. This field in particular allows us to broaden classic problems to formulate new ones such as generalized entity matching for identifying binary relations between entities of different types with heterogeneous data.

Our work in the AI-for-data-management area has recently focused on exploiting language models and state of the art machine learning approaches. We utilize large language models in novel settings for finding table representations to discover datasets in data lakes, data augmentation techniques for data management tasks, and different declarative explanation approaches for data integration tasks.

Related Projects:


Sudowoodo can also improve the efficiency of model engineering since the learned representation can be applied to all stages of a typical entity matching pipeline, such as blocking, labeling, and matching. Besides, Sudowoodo can also support a variety of use-cases, such as data cleaning and semantic type detection, suggesting its versatility.


Entity Matching (EM) refers to the problem of finding pairs of entity records that refer to the same real-world entity such as customers, products, businesses, or publications. As one of the most fundamental problems in data integration, EM has a wide range of applications including data cleaning, data integration, knowledge base construction, and entity similarity search. We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models (LMs) such as BERT.


Dataset discovery from data lakes is essential in many real-world applications that require table search over open datasets. There are many important downstream tasks for dataset discovery, such as table union search, finding joinable tables, and column clustering. Starmie is an end-to-end framework for dataset discovery, with table union search as the main use case.