Entity Matching (EM)
Given two collections of entity entries, such as two tables of product listings, the Entity Matching (EM) problem aims at identifying all pairs of entries that refer to the same object, such as products, publications, businesses, etc. Because of its simple and broad problem definition, EM is not surprisingly a fundamental problem in data integration and has a wide range of real-world applications from data deduplication to entity similarity search.
There have been numerous matching techniques developed for EM during the last several decades. More recently, machine learning, particularly deep learning, achieves promising high-quality results across multiple EM benchmark datasets. Our question is, can we apply these advanced EM techniques to a broader range of matching tasks?
Limitations from the Classic EM Problem Definition
Unfortunately, the problem definition of EM used in previous studies has several assumptions making it limited in many real-world data science scenarios. It assumes that input datasets are structured with aligned schemas. The notion of matching is also quite limited as it simply checks if two entries are identical real-world objects.
These assumptions do not hold in many real-world data science applications. Consider the job targeting application as an example. The goal of job targeting is to find pairs of matching job postings from companies and resumes from applicants. The job postings are typically unstructured long text documents. Resumes can come in any form from pdf documents uploaded to entered through structured forms on recruiting sites, stored eventually as JSON documents. The application also requires a notion of matching beyond the simple equality for checking, for example, if the applicant is qualified for the job.
Can we have a more general problem formulation that works for the above scenarios?
Generalized Entity Matching (GEM)
To address these limitations, we formulate a new research problem called Generalized Entity Matching (GEM). Similar to EM, GEM takes two entity datasets (collections of entity entries) as input and outputs all pairs of matching entries. As shown in the above figure, GEM generalizes EM in the following aspects:
- The two datasets to be matched can contain entity entries that are structured, semi-structured, or unstructured.
- The two collections can come in arbitrary schema or schema can be undefined. Finding the schema structure and aligning them is part of the matching task.
- The matching criteria can be any general binary relationship instead of identical real-world objects. For example, if a resume of an applicant is suitable for a position shown in a job posting, we can regard them as matched.
Benchmarking the GEM problem: Machamp
To study the new research problem of GEM, the first step is to create benchmarking datasets and tasks for it. Based on the definition of GEM, in terms of data format and schema, there are in total 6 tasks of interest that cover the combination of matching between datasets in structured (REL), semi-structured (SEMI), and unstructured (TEXT) formats with homogeneous (HOMO) or heterogeneous (HETER) schema:
- REL-REL (HETER): matching two structured tables of heterogeneous schema,
- SEMI-(HOMO) and SEMI-(HETER): matching two collections of semi-structured documents of homogeneous or heterogeneous schema,
- SEMI-REL: matching a collection of semi-structured documents with a structured table, and
- SEMI-TEXT and REL-TEXT: matching a semi-structured or structured dataset with an unstructured, textual dataset.
Here are some examples of data entries of job postings from the same domain stored in structured, semi-structured, or unstructured format:
As the examples show, in the job targeting application, job postings can come in heterogeneous data format. Matching jobs of the same level and domain require both language and structural understanding of the entity entries. This can be challenging to the most advanced NLP or EM solutions nowadays!
To benchmark each of the 6 tasks, we constructed our evaluation suite Machamp from existing EM benchmark datasets. There are existing benchmark datasets such as Magellan and WDC product matching that provide high-quality ground truth labels for the classic EM settings. To fit these datasets to our purpose, we applied a series of data transformations, structuring, and merging to convert the tasks into matching semi-structured or unstructured data. In one of the tasks (Rel-Text), we even re-labeled the dataset to have non-equality matching relation semantics. You can find out more details about our dataset construction process in the paper.
Experiment Results
Here are some basic statistics of the 7 datasets in the Machamp benchmark. The dataset covers training sets of different sizes and different positive ratios (%pos). We also profile the “difficulty” of each data set by checking how likely a rule-based method to measure textual and structural similarity can separate the positive and negative classes (more details in the paper).
We evaluated 7 baseline methods on the Machamp benchmark including classic ML methods (SVM, Random Forest), RNN-based methods (DeepER, DeepMatcher), and Transformer-based methods (Transformer, Ditto, SentenceBERT). Among the 3 classes of methods, Transformers achieves the overall best results (highlighted top-2):
Opening Up New Research Opportunities
As shown in the above results, there is indeed significant room for improvement as existing methods designed for classic EM fail to achieve a moderate level (i.e., 70% average F1 score) of matching quality in some difficult tasks!
We have made Machamp publicly available here. The dataset is just the beginning of the study on the new research problem of GEM, there are more research opportunities to be further explored in the next step, such as:
- Investigate more application scenarios of the GEM problem
- Develop new techniques for GEM to jointly utilize the textual and structural information to improve the quality of matching.
If you are interested in learning more about Machamp, please check out our paper!
Written by Jin Wang, Yuliang Li, and Megagon Labs