Many data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, semantic search, and data visualization can benefit from detecting the semantic types of data columns in relational tables. In our recent paper that will be presented at this year’s VLDB conference, we introduced Sato, a new learned model that improves the state of the art in tabular column type prediction by incorporating table context.
How Does Understanding Column Semantic Types Help?
Before diving into Sato’s model details, let’s try to understand how the ability to detect the semantic types, real-world categories or references, of table columns (or any other data sources, for that matter) can help in such an apparently diverse set of tasks, ranging from data cleaning to semantic search, to data visualization. Consider data cleaning, the task of correcting errors, typos, inconsistencies, etc. in data through transformation and validation rules. Automation of data cleaning would significantly benefit from detecting the column semantic data types in a given dataset to infer the most appropriate rules to apply. For instance, knowing that the first and the third columns of the table above contain country names and their capitals considerably simplifies the job of correcting any errors or detecting and filling in missing values in these columns. Similarly, if we know the fourth and the fifth columns of the table have latitude and longitude values, we can automatically suggest a map to visualize the population column. Similarly, if we know the third and the fourth columns of the table have latitude and longitude values, we can automatically suggest a map to visualize the population column. If you can query the semantics of tables, it is much easier to find similar tables, further augmenting and generating recommendations for data discovery in data stores.
Current Practices and Challenges
Recognizing the benefits, many commercial systems such as Google Data Studio, PowerBI, Tableau, Talend, and Trifacta automatically detect a limited number of semantic types. Existing systems generally use matching-based approaches for semantic type detection. For example, consider regular expression matching. Using predefined character sequences, this method captures data value patterns and checks if a value matches them. For the detection of simple data types with a predictable structure, such as dates, this can be sufficient. But approaches like regular expression matching are limited in versatility: They’re not robust enough to handle noisy real-world data; they only support a limited number of semantic types; and they underperform for data types without strict validation.
Recently, Sherlock, a deep learning model trained on a large-scale corpora, showed promising results in addressing the above challenges. Without assuming any column header information, Sherlock, a multi-input feed-forward neural network trained on the VizNet corpora, predicts the semantic type of a given column solely based on its values. The improved prediction performance of Sherlock over alternatives demonstrates the potential of high-capacity models to improve the accuracy, robustness, and scale of semantic type detection. However, Sherlock doesn’t consider table context in its column type predictions and its accuracy is relatively low for types that have fewer examples in the training dataset.
Sato: Modeling Table Context to Improve Semantic Type Detection
We developed Sato to primarily improve Sherlock’s predictive performance for data columns in which the semantic type is ambiguous or has few examples available. This is important for extending automated type detection to a larger number of types in practice. Sato operates under the premise that a column’s table context contains additional descriptive power for improving the accuracy of its type prediction.
The intuition behind this premise is rather straightforward. A table is not a mere collection of random columns but is instead created with a particular intent in mind. The semantic types of the columns in the table can be considered as expressions of that intent with thematic coherence. As a result, we should expect, for example, certain semantic types to co-occur more frequently in the same table than others. A peek into the type co-occurrence frequencies in Sato’s training dataset supports the argument. For example, pairs like (city, state) or (age, weight) appear in the same table more frequently than others.
Incorporating the context of a column in the prediction of its semantic type is not just a way to improve the prediction accuracy in general. In many cases, it is the only way to correctly disambiguate the column’s type.
Motivated by these considerations, Sato combines topic modeling to capture the “global context” and structured learning to capture the “local context” in addition to single-column type prediction based on the Sherlock model. For modeling purposes, the context of a table column is formed by the types and values of all the other columns in the same table.
Latent Dirichlet Allocation (LDA) is an unsupervised generative probabilistic topic model widely used to quantify thematic structures in text. Sato uses it to learn a topic vector for each table. Sato then adds this vector to the same feature set as used for Sherlock. As a result, each data column within the same table has the exact same topic vector.
All of this makes it easier for the model to contextualize. For example, it can “understand” that a data column containing date values within the global context of personal information is much more likely to refer to birth dates than publication dates.
Besides leveraging a central topic vector, the labels of columns in close proximity to a chosen column hold discriminative information as well. When given a subset of data columns and corresponding predictions, Sato optimizes the likelihood of semantic types over this subset using Conditional Random Fields (CRF).
Using the architecture above, Sato is trained on more than 122K table columns with 78 different semantic types gathered from a subset of the Sherlock training dataset.
Experiments confirm that Sato significantly outperforms Sherlock in semantic type prediction by incorporating global and local contextual information of data columns, with respective increases of as much as 14.4% and 5.3% in macro and micro (support-weighted) F1 scores.
Our ablation study also shows that both Sato variants, Sato-NoTopic (LDA module is excluded) and Sato-NoStruct (CRF module is excluded), still outperform the Sherlock model. The study also suggests that the contributions of the LDA and CRF modules are mostly complementary, while the global (table-wise) feature vector obtained through LDA increases the prediction accuracy more than the structured output prediction using CRF.
Crucially, Sato’s performance gains are primarily due to the improved prediction over data types that are semantically overlapping or underrepresented. Sato solves the problem of ambiguity with overlapping semantic types by leveraging context. Acting as a data-driven regularizer, incorporating context also ameliorates the “data hunger” for training deep learning models. This in turn can facilitate automated detection support for a larger number of semantic types.
Deep learning models are essentially representation learners. Using layerwise weights or activations of trained deep learning models is common for obtaining vectorial numeric representations (embeddings) of data.
Col2Vec: Contextual Semantic Column Embeddings
In this sense, Sato also provides an opportunity to create (extract) contextual semantic representations of table columns. Let’s verify how the table intent features modeled with LDA help Sato better capture the table semantics. For this, we can compare the embedding vectors from the final layer of the Sato model (before the CRF layer) and the baseline Sherlock model as column embeddings for a select set of types.
We use t-SNE to reduce the dimensionality of the extracted vectors to two and then visualize them using scatterplots, encoding the reduced dimensions with horizontal and vertical positions. In the scatterplots of the embeddings above, we observe the column embeddings based on Sato better separate the given type clusters than those obtained from the Sherlock model. This gives additional support for the argument that topic-aware prediction helps Sato distinguish semantically similar semantic types by capturing the table context of an input table.
In general, the ability to learn contextual representations of table columns, rows, or cells has much broader applications.
Transfer Learning for Data Preparation and Analysis Pipelines
So, the availability of massive table corpora presents a unique opportunity to develop pretrained representation models. To test the viability of using representation models, we fine-tuned the BERT model for our semantic type detection task. We find that the fine-tuned BERT model achieves better prediction performance than the Sherlock model. This result is promising because a “featurization-free” method with default parameters is able to achieve a prediction accuracy comparable to that of Sherlock. In this context, an interesting follow-up would be to combine Sato’s multi-column model with BERT-like pretrained representation models. Indeed, a recent work extends BERT to train a representation model for Internet tables.
To be sure, we will need to go beyond a mere application or extension of the existing language models in order to have impact in practice, as tables are neither paragraphs of sentences nor are they images of pixels. Nevertheless, there is a clear and exciting opportunity in front of us to automate relational data tasks with models learned from massive table corpora, yet again leveraging the unreasonable effectiveness of data for engineering solutions.