Machine Learning is playing an increasingly important role in data integration tasks such as entity matching, data cleaning, and table annotation. Recently the research advances of pre-trained language models (LM) have been widely adopted in data integration tasks which achieved state-of-the-art performance. However, existing learning-based solutions still suffer from two major challenges, making them less attractive in practice. On the one hand, the success of such learning-based approaches comes at the cost of creating large-scale, high-quality annotated datasets which are not always readily available. On the other hand, practitioners have to build a specialized ML solution for each task, resulting in extra costs of model engineering. Furthermore, the reusability of such models is somewhat limited.
Contrastive Learning-based Framework
Figure 1: Contrastive Learning framework for end-to-end optimization.
In this paper, we introduce Sudowoodo, an end-to-end framework for a variety of data integration applications to resolve the above issues. Sudowoodo addresses the label requirement by leveraging contrastive learning to learn a data representation model from a large collection of unlabeled data items. This is realized by the contrastive objective that allows the model to learn how to distinguish pairs of similar data items from dissimilar ones that are likely to be distinct. The pre-training process of contrastive learning is lightweight and fully unsupervised. Furthermore, the learned representation of data items can be applied to different sub-tasks in the data integration pipeline, either directly in an unsupervised manner or by fine-tuning via labels. In this way, Sudowoodo can drastically reduce both the data-labeling and model-engineering efforts.
To improve the performance of Sudowoodo, we also propose an optional pseudo-labeling step that extracts training signals with high confidence from the learned representations useful for further boosting the fine-tuning performance. In addition, we have come up with three optimizations for the pre-training process, namely data augmentation, clustering-based negative sampling, and redundancy regularization.
Experiment Results
Figure 2: F1 scores for semi-supervised matching (EM). Sudowoodo uses 500 uniformly sampled pairs from train+valid.
We conducted experiments over the entity matching application under semi-supervised and unsupervised settings. Specifically, we evaluated five popular datasets: Amazon-Google (AG), DBLP-ACM (DA), DBLP-Scholar (DS), Walmart-Amazon (WA), and Abt-Buy (AB). For the semi-supervised setting, we compared with previous studies DeepMatcher, Ditto, and Rotom. The results show that Sudowoodo achieves up to 16% performance gain in F1 scores compared with state-of-the-art methods, all with one-third fewer labeled training instances.
Figure 3: F1 scores for unsupervised matching (EM).
For the unsupervised setting, we compared with state-of-the-art methods ZeroER and AutoFuzzyJoin. Sudowoodo outperformed them by 7.7% and 8.9% on average in its F1 scores, respectively. Moreover, we also conducted experiments on the application of the blocking stage of entity matching, where Sudowoodo also achieved the best performance.
More Use Cases
Except for entity matching, Sudowoodo can also be applied to other applications such as data cleaning and column type detection. For data cleaning, Sudowoodo provides a holistic solution for both error detection and correction stages directly from the potentially contaminated data with the help of pre-trained data representations. For column type detection, Sudowoodo pretrains a column encoder for tables in a fully unsupervised manner. This enables it to find pairs of columns that have the same semantic type from a large collection of tables. More detailed steps and results of such applications can be found in our technical report.
If you are interested in learning about Sudowoodo, please check the preprint of our paper. We also released the source code on GitHub. The paper on Sudowoodo will be presented at ICDE 2023: The 39th IEEE International Conference on Data Engineering.