Dataset discovery from data lakes is a critical way to utilize open-domain data within the enterprise. To overcome the issues stemming from data quality and incomplete metadata in data lakes, it is essential to support the problem of table union search, which aims to find all tables that are unionable with the query table, given a query table and a collection of data lake tables. To determine whether two tables are unionable, a crucial step is to identify unionable columns between the tables. Existing solutions—such as syntactic similarity, knowledge base, and word embedding—have utilized information from multiple resources for evaluating column unionability. However, their performance still suffers from the low quality of column representation and the failure to learn contextual information among columns.
Figure 1: The Overall Framework for Starmie
In this work, we propose an end-to-end framework named Starmie as the means of resolving the above issues. Starmie consists of two stages: offline and online. In the offline stage, it takes the data lake tables as inputs and a column encoder is trained to generate embeddings of columns given all tables in the data lake as input. Next, embeddings from the encoder are materialized and index structures are built on top of them. In this process, getting high-quality representations is very important as such training the column encoder with the tight objective is crucial. Starmie addresses this problem by leveraging contrastive representation learning to train the encoder in a self-supervised manner. In order to learn a context-aware column representation, it further combines the learning algorithm with a multi-column table transformer model. This ensures that the column representations depend on not only the model’s own contents but also the context within the table. In the online stage, Starmie identifies the unionable tables by constructing a weighted bipartite graph between columns in the query and data lake tables, respectively. Then, the unionable score between tables are calculated as the maximum bipartite matching of the graph. To accelerate the query processing process, it employs a filter-and-verification framework and utilizes the HNSW (Hierarchical Navigable Small World) index structure to reduce the vector search time.
Figure 2: Effectiveness of Starmie, with k = 10 for SANTOS small and k = 60 for other datasets.
We conducted experiments over two popular benchmarking tasks for table union search, namely SANTOS and TUS. The results of effectiveness can be found in Figure 2 above. We can see that compared with previous approaches for table union search such as SANTOS and D3L, as well as column encoding methods such as Sherlock and SATO, Starmie has a significant advantage in MAP and Recall under all settings. To illustrate the necessity of learning contextual information among columns, we also compared Starmie with the baseline SingleCol where the multi-column table Transformer component is replaced with a single column Transformer. The results show that Starmie also outperforms SingleCol by a clear margin. In addition to quantifying its effectiveness, we also conducted experiments on Starmie’s efficiency and scalability. The results show that the query execution time tends to be constant with respect to the data size with the help of the HNSW index. As a result, Starmie can perform well on a large corpus with 50 million tables.
Figure 3: Dataset discovery for downstream ML tasks
Apart from table union search, Starmie can also be useful in many downstream tasks of dataset discovery. In the example of predicting the “Rating” attribute shown in Figure 3, we leverage Starmie to retrieve relevant tables from the data lake to join with the training dataset and provide additional features that enrich the training process of machine learning models. Our results show that by joining over the name column, additional features such as voting and interest groups can be added to the original dataset. Meanwhile, if we simply use syntactic metrics such as Jaccard to do the same task, the retrieved tables are less meaningful and might include extra noise.
If you are interested in learning more about Starmie, please refer to our paper, “Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning.” Our former intern, Grace Fan, will be presenting the Starmie paper at VLDB 2023 on Tuesday, August 29th, at 10:30 am (H1) session.
We also released the source code on GitHub.