Entity Matching (EM) refers to the problem of finding pairs of entity records that refer to the same real-world entity such as customers, products, businesses, or publications. As one of the most fundamental problems in data integration, EM has a wide range of applications including data cleaning, data integration, knowledge base construction, and entity similarity search.
We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models (LMs) such as BERT. Ditto consists of a simple architecture of casting the EM task as sequence classification which can be solved by fine-tuning LMs. In addition, Ditto applies an array of optimization techniques including domain knowledge injection, text summarization, and data augmentation to further boost the matching model’s performance. Our experiment results on real-world EM benchmark datasets showed that Ditto consistently achieved the state-of-the-art (SOTA) matching quality and outperformed previous EM solutions by up to 29% in F1.
Compared to the existing EM approaches, Ditto is unique in three ways:
- By leveraging pre-trained language models, Ditto understands languages better.
- Ditto is more robust to noisy, small, and text-heavy entity data.
- Ditto is label-efficient. For example, Ditto requires fewer labels to achieve the same matching quality.