Day: December 5, 2022

Ditto

Entity Matching (EM) refers to the problem of finding pairs of entity records that refer to the same real-world entity such as customers, products, businesses, or publications. As one of the most fundamental problems in data integration, EM has a wide range of applications including data cleaning, data integration, knowledge base construction, and entity similarity search. We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models (LMs) such as BERT.

Read More »

MegAnno

We present MegAnno, a novel exploratory annotation framework designed for NLP researchers and practitioners. Unlike existing labeling tools that focus on data labeling only, our framework aims to support a broader, iterative ML workflow, including data exploration and model development.

Read More »

Starmie

Dataset discovery from data lakes is essential in many real-world applications that require table search over open datasets. There are many important downstream tasks for dataset discovery, such as table union search, finding joinable tables, and column clustering. Starmie is an end-to-end framework for dataset discovery, with table union search as the main use case.

Read More »

GiNZA

GiNZA is an open-source Japanese NLP library with features such as a one-step installer, high-speed and high-precision analysis, and international capabilities for sentence structure analysis.

Read More »

ZETT: Zero-shot Triplet Extraction by Template Infilling

In this project, we hypothesize that relation triplet extraction can be reformulated such that it aligns with the pre-training objective of large pre-trained language models. This can enable the models to leverage knowledge acquired during pre-training and render improved generalization capabilities to unseen relations.

Read More »

ESE: Low Resource Entity Set Expansion

Entity set expansion (ESE) is the task of expanding a given seed set of entities (e.g., ‘mini bar’, ‘tv unit’) for a concept (e.g., room features) using a textual corpus. The task is typically studied under low-resource settings since obtaining large-scale training data is expensive.

Read More »

Summarizing Community-based Question-Answer Pairs

Megagon Labs researchers proposed a new CQA Summarization task focused on summarizing QA pairs in Community-based Question Answering. In addition, we developed a multi-stage annotation framework and created a benchmark CoQASum for the CQA Summarization task.

Read More »