In the ever-evolving digital landscape, where data reigns supreme, the ability to harness the wealth of information encapsulated within relational web tables has become a paramount necessity. Web tables are structured data representations on web pages, typically organized in rows and columns, similar to spreadsheets. Relational web tables provide valuable resources for various data management applications, e.g. schema matching, dataset discovery, and data cleaning. This makes table annotation a hot topic in data management, specifically when it identifies semantic types and relations of columns. Given a web table, table annotation aims to annotate different parts of the table to provide more useful signals for further querying and analysis. Table annotation could work on various granularities of a table, such as a cell, row, column, or the whole table.
In this work, we focused on column annotation tasks, including semantic type detection, relation extraction, and column population. Semantic type detention and relation extraction aim to annotate one column and pairs of columns with types, respectively. Column population aims at predicting the possible set of columns of a table given the first few columns. While many previous studies have employed pre-trained language models (PLMs) for these tasks and even achieved promising results, they heavily relied on the availability of high-quality annotated datasets for the fine-tuning process of PLM. However, it is challenging to acquire large-scale annotated datasets for many reasons. First, the table corpus is huge, and the data quality is relatively low, which makes it more difficult for humans to annotate. Second, there is always a long-tail distribution in existing benchmarking datasets. Even if the overall cardinality of the datasets might be large enough, there would still be insufficient labeled instances in many classes. As a result, it is difficult for models to capture sufficient signals for types in the minority classes, even under supervised settings.
Figure 1: The Overall Framework
We proposed the Watchog framework shown in Figure 1 to address such issues. Watchog employed contrastive learning, an important technique in the self-supervised learning paradigm, to automatically learn table representations from a vast collection of unlabeled table corpus in a fully unsupervised manner. Contrastive learning seeks to train a column encoder, which could lead to representations where related columns are close to each other in the embedding space. This is realized by identifying different views of the same column created by data augmentation (DA) operations as positive instances and the rest as negative. Following DA methods for NLP tasks, we developed a series of DA operators for table-related tasks in three levels: token, cell, and column. As a result, the obtained positive instances will be aware of column information moving forward. We also used the metadata to treat column pairs with the same metadata as positive instances, which helps improve the contrastive learning process.
The outcome of the contrastive learning process is a column encoder with column-aware information. We then performed fine-tuning on different downstream tasks. Unlike previous approaches, our method could reduce the size of the training set to up to 10% of that used in previous studies. To address this problem, we employed pseudo-labeling techniques that take advantage of unlabeled corpus to provide more training instances with “soft” labels, which are probability distributions instead of deterministic labels. The idea is to leverage the prediction of the target model in the current iteration to obtain training signals by combining labeled and unlabeled instances through interpolation. In this way, we can obtain richer training signals from huge amounts of unlabeled table corpus without requiring human annotation.
Getting Results
Figure 2: The Main Results
We conducted experiments over three popular benchmarking datasets: semantic type detection (ST) on WikiTable, relation extraction (RE) on WikiTable, and semantic type detection on Viznet. To evaluate the performance of the semi-supervised setting, we conducted uniform random sampling over the original datasets and used up to 10% of labeled training instances for fine-tuning. The results, shown in Figure 2, demonstrate that Watchog outperforms previous approaches by a significant margin when there are insufficient labeled training instances. This shows the effect of our contrastive learning-based solution that could bring extra information without relying on human annotation. Watchog also shows significant improvement over our previous work Starmie, which also utilized self-supervised learning techniques to learn table representation. This contrast illustrates that the proposed techniques of pseudo-labeling could help improve the overall performance in the fine-tuning process when the cardinality of labeled training instances is low.
To learn more about Watchog, please refer to our paper, “Watchog: A Light-weight Contrastive Learning-based Framework for Column Annotation.” Our former colleague, Zhengjie Miao, will present the paper at SIGMOD 2024 on Thursday, June 13th, at 5:00 pm in Session 36: Data Integration + Provenance.
We also released the source code on GitHub.
Written by Jin Wang and Megagon Labs