In the KnowledgeHub (KH) project, we exploit knowledge coming from many different sources (structured and unstructured) to address some current LLM limitations. Specifically, we envision KH as a symbiotic system that aims to couple knowledge represented in LLMs with structured data in knowledge graphs and relational databases as well as unstructured data in text corpora.
We developed MAGNETON, a framework for authoring interactive widgets within computational notebooks that enables transparent, reusable, and customizable data science workflows. The framework enhances widgets to support fine-grained interaction history management, reusable states, and user-defined customizations.
In particular, we proposed feature- and system-specific guidelines for designing human-centered data systems. The feature-specific guidelines, inspired by cognitive engineering principles for enhancing human-computer performance, recommend automating the unwanted workload of humans.
Sudowoodo can also improve the efficiency of model engineering since the learned representation can be applied to all stages of a typical entity matching pipeline, such as blocking, labeling, and matching. Besides, Sudowoodo can also support a variety of use-cases, such as data cleaning and semantic type detection, suggesting its versatility.
We developed Coop, a tool that enables us to generate more specific summaries by finding better summary vector in the latent space.
We developed a novel decoding algorithm, co-decoding. For the distinctive opinion summary generation, it emphasizes the distinctive words by contrasting the token probability distribution of the target entity against that of the counterpart entity. For the common opinion summary generation, it highlights the entity-pair specific words by aggregating token probability distributions.
Entity Matching (EM) refers to the problem of finding pairs of entity records that refer to the same real-world entity such as customers, products, businesses, or publications. As one of the most fundamental problems in data integration, EM has a wide range of applications including data cleaning, data integration, knowledge base construction, and entity similarity search. We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models (LMs) such as BERT.
We present MegAnno, a novel exploratory annotation framework designed for NLP researchers and practitioners. Unlike existing labeling tools that focus on data labeling only, our framework aims to support a broader, iterative ML workflow, including data exploration and model development.
Dataset discovery from data lakes is essential in many real-world applications that require table search over open datasets. There are many important downstream tasks for dataset discovery, such as table union search, finding joinable tables, and column clustering. Starmie is an end-to-end framework for dataset discovery, with table union search as the main use case.