Real-world applications frequently seek to solve a general form of the Entity Matching (EM) problem to find associated entities. Such scenarios, which we call Generalized Entity Matching (GEM), include matching jobs to candidates in job targeting, matching students with courses in online education, matching products with user reviews on e-commercial websites, and beyond. These tasks impose new requirements such as matching data entries with diverse formats or having a flexible and semantics-rich matching definition. Scenarios like these are beyond the current EM task formulation or approaches from the field of data management.
Generalized Entity Matching (GEM) is challenging
Consider the following example of matching jobs with resumes from the job targeting application in the following figure. Although this pair seems to be a good match because of the same job title (nurse practitioner) and the considerable overlapping text, they are actually not a good match. The reason is that the job requires a nurse with experience in a Level III neonatal intensive care unit which the resume’s owner doesn’t have. In order to make this non-match decision correctly, the system needs to have strong language understanding capabilities to detect, for example, whether “3-5 years of NNP experience” is compatible with “3 years of experience”. The system also needs to understand the structure of each document. For example, the education and certification section should be matched to education and skills on the right (blue boxes). As we can tell, this task is very challenging even for the state-of-the-art NLP and EM solutions.
Our solution: fine-tuning pre-trained Language Models (LMs)
Pre-trained language models (LMs) are deep neural networks pre-trained on large text corpora. For example, in 2019, Google published BERT which contains over 300 million parameters and was pre-trained on 3.3 billion words corpora of English books and Wikipedia. After pre-training, the model can be fine-tuned to downstream tasks on task-specific datasets.
To leverage LMs’ powerful language understanding capabilities, we simply need to convert the input (resume, job description, etc.) into text sequences. Supposing that the inputs are in semi-structured formats such as JSON, we can serialize each entry into a string by inserting special tokens such as [COL] and [VAL] to indicate the start of attribute name or value such as title and benefit. After that, since we need to perform binary classification of whether the pair matches or not, we further concatenate the two serialized entities into a single sequence.
Next, we simply need to add task-specific layers (i.e., linear and softmax) on top of the LM. The special [CLS] token inserted at the head of the sequence now captures the semantics of both entities providing useful signals for classification.
The default fine-tuning setting provides a good baseline for GEM, but there are two key challenges that we need to address to further improve its performance.
Machop: attribute-aware summarization and pooling
The first challenge is that language models typically have a max sequence length that limits their input token’s length. Job targeting tasks can have long sequences of more than 1000 tokens, and simple truncation can fail because the key information such as qualifications and required skills are oftentimes hidden in the middle of long, less relevant text such as general company descriptions.
Machop addresses this challenge by using attribute-aware summarization. We train a sentence classifier that assigns topics, such as Qualification, Benefit, Duty, etc. to each sentence in the job description. We then apply summarization operators such as truncation to shorten each topic. Similarly for resumes, if some structure of the document already exists such as sections like education and experience, we summarize each section independently. By doing so, we are guaranteed to retain some information for each topic and section.
The second challenge is that we want to better leverage the internal structure of the documents. The default fine-tuning setting only uses a single special token to represent both input entities. In Machop, we proposed a novel structure-aware pooling layer that applies sum or max pooling operators over the element-wise attribute similarities. By doing so, we can explicitly compare sections and topics within the documents such as qualification vs. experience and aggregate their similarities to guide the language model to obtain better representations.
Experiment: Job-Job matching and Job-Resume matching
We evaluated Machop on two job targeting tasks: job-to-job matching and job-resume matching. For job-to-job matching, we created our training and evaluation datasets by sampling job descriptions from 600,000 jobs from indeed.com. We compared Machop with classic ML methods such as logistic regression and SVM and also state-of-the-art deep learning methods such as RNN. As we can see, LMs like BERT outperform classic ML methods like SVM by 7%, and Machop significantly outperforms the previous best model BERT by over 10% F1 score.
For the job-resume matching tasks, we used the same set of job descriptions plus 700 synthetic resumes created by experts. We can see a similar trend that BERT outperforms SVM by over 11%, but our pooling and summarization techniques in Machop further improves the performance of language models by a large margin of 17% F1 score.
Written by: Yuliang Li and Megagon Labs