Entity Matching (EM)
Entity Matching is an important task in the daily life of a data scientist. Given two collections of data entries (e.g., two tables, JSON, text), the goal of Entity Matching (EM) is to find all pairs of entries that refer to the same real-world entity, such as products, publications, businesses, etc. Because of its simple problem definition, EM is one of the most fundamental problems in data integration and has a wide range of applications including entity search, data cleaning, joining data from different data sources, etc.
Two phases of EM: Blocking and Matching
Here is an example of an EM problem instance. Given two tables of product records from different data sources, our goal is to find records that refer to the same product. The first step to perform in a typical EM pipeline is blocking. The goal of blocking is to avoid the quadratic pairwise comparison by selecting only candidate pairs that are likely to match. In our example, a simple but reasonable heuristic such as selecting pairs with at least one token would reduce the number of candidates from 9 (all possible pairs) to 3! The blocking method is typically designed as fast simple heuristics having a high recall so that most real matches are retained. The second step is matching, which performs the more expensive pairwise comparison where the final “right” match/not match decision needs to be made for all pairs selected during the blocking phase.
EM is challenging
Although many techniques exist including both rule-based and learning-based methods, matching remains quite a challenging task. As the example illustrates, making the right match / not-match decision requires substantial language understanding and domain-specific knowledge. For example, in the above table:
- For the first pair, the EM solution needs to know “immersion”–“immers” and “deluxe 2.0”– “dlux 2” are synonyms in this context.
- For the second pair, although the two entries look similar, they are actually different because the software editions do not match (7th vs. 8th).
- For the last pair, although the two entries look quite different, they match because the product IDs are the same.
In fact, even the most advanced previous EM solutions failed to correctly match / unmatch all these pairs!
Ditto: EM with Pre-trained Language Models
We present Ditto, a novel EM solution based on pre-trained language models (LMs) such as BERT. Ditto casts EM as a sequence-pair classification problem to leverage LMs, which have been shown to generate highly contextualized embeddings that capture better language understanding compared to traditional word embeddings.
Since pre-trained LMs take text sequences as input, Ditto first needs to serialize a pair of entity entries (e1, e2) into a sequence of tokens. This is done by adding special tokens [COL] and [VAL] which indicates the starts of an attribute name or an attribute value:
A candidate pair (e1, e2) is serialized as “[CLS] serialize(e1) [SEP] serialize(e2) [SEP]” with separator tokens [SEP] and a special token [CLS] to generate the contextualized sequence embedding.
Next, Ditto follows the typical steps of LM fine-tuning: we add the linear and softmax layer on top of the output of the LM’s Transformer layers, initialize the model with the pre-trained weights, and train the model on a labeled EM dataset until convergence.
Ditto further improves its matching capability through three optimizations:
- Domain Knowledge. To help the LM to focus on the most important matching information, Ditto allows domain knowledge to be added to further highlight important pieces of the input (e.g., product ID) that may be useful for matching decisions.
- Summarization. One major challenge in applying pre-trained LMs is the max sequence length. For example, BERT can only take at most 512 sub-word tokens as input. While truncating the entity entries can discard the most important information, Ditto applies a summarization technique based on TF-IDF to long strings so that only the most essential tokens are retained and used for EM.
- Data Augmentation. Finally, to address the requirement of having a large, high-quality labeled dataset, Ditto applies data augmentation to generate additional training examples from existing ones. Moreover, Ditto augments training data with (difficult) examples, which challenges the model to learn “harder” to capture invariant properties such as column-order invariance.
If you are interested, please look into our paper for more details!
With pre-trained LMs and the above optimizations, Ditto achieves significant improvements compared to the previous state-of-the-art solutions such as DeepMatcher. On a standard set of 13 benchmark EM datasets, Ditto achieves a 9.43% average F1 score performance gain and up to 32% F1 score improvement. Ditto is also more robust to noisy data (misaligned schema, misplaced column values), small training sets, and text-heavy data where we observe the most significant performance gain.
Ditto is also label-efficient. On a product matching dataset, Ditto is able to outperform previous best solutions with only half or fewer labeled examples.
Deploying Ditto in a complete EM pipeline
We deployed Ditto in a standard EM pipeline as shown in the diagram and applied the pipeline on matching two large-scale company datasets containing 789K and 412K entries. Using Ditto as the matcher, we achieved a high 96.5% F1 on the holdout dataset. Apart from the standard basic blocking mechanism, Ditto provides an optional advanced blocking function obtained from fine-tuning a siamese, sentence transformer model. This model maps entries in both tables into the same vector space to allow candidate filtering via vector similarity search. The advanced blocking step yields a 3.8x overall speed-up of the entire pipeline.
Written by Yuliang Li and Megagon Labs