Category: Research Area

KnowledgeHub

In the KnowledgeHub (KH) project, we exploit knowledge coming from many different sources (structured and unstructured) to address some current LLM limitations. Specifically, we envision KH as a symbiotic system that aims to couple knowledge represented in LLMs with structured data in knowledge graphs and relational databases as well as unstructured data in text corpora. 

Read More »

Weedle

For data-centric NLP, we present Weedle: Widget-Enabled Exploratory Data analysis for NLP Experts. Weedle offers global and local exploration of text data via built-in and customizable transformation operations.

Read More »

Characterizing Human-Centered Information Extraction

In particular, we proposed feature- and system-specific guidelines for designing human-centered data systems. The feature-specific guidelines, inspired by cognitive engineering principles for enhancing human-computer performance, recommend automating the unwanted workload of humans.

Read More »

Sudowoodo

Sudowoodo can also improve the efficiency of model engineering since the learned representation can be applied to all stages of a typical entity matching pipeline, such as blocking, labeling, and matching. Besides, Sudowoodo can also support a variety of use-cases, such as data cleaning and semantic type detection, suggesting its versatility.

Read More »

CoCoSum: Contrastive Summary for Two Comparable Entities

We developed a novel decoding algorithm, co-decoding. For the distinctive opinion summary generation, it emphasizes the distinctive words by contrasting the token probability distribution of the target entity against that of the counterpart entity. For the common opinion summary generation, it highlights the entity-pair specific words by aggregating token probability distributions.

Read More »

Ditto

Entity Matching (EM) refers to the problem of finding pairs of entity records that refer to the same real-world entity such as customers, products, businesses, or publications. As one of the most fundamental problems in data integration, EM has a wide range of applications including data cleaning, data integration, knowledge base construction, and entity similarity search. We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models (LMs) such as BERT.

Read More »

MegAnno

We present MegAnno, a novel exploratory annotation framework designed for NLP researchers and practitioners. Unlike existing labeling tools that focus on data labeling only, our framework aims to support a broader, iterative ML workflow, including data exploration and model development.

Read More »

Starmie

Dataset discovery from data lakes is essential in many real-world applications that require table search over open datasets. There are many important downstream tasks for dataset discovery, such as table union search, finding joinable tables, and column clustering. Starmie is an end-to-end framework for dataset discovery, with table union search as the main use case.

Read More »