Learning to Detect Semantic Types from Large Table Corpora

Many data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, semantic search, and data visualization can benefit from detecting the semantic types of data columns in relational tables. In our recent paper that will be presented at this year’s VLDB conference, we introduced Sato, a new learned model that improves the state of the art in tabular column type prediction by incorporating table context.

Data systems automatically detect atomic data types such as string, integer, and decimal to reliably operate and improve user experience. But semantic types such as country, population, and latitude provide finer-grained, operational information that can be useful for further improving and automating many data tasks.

How Does Understanding Column Semantic Types Help?

Before diving into Sato’s model details, let’s try to understand how the ability to detect the semantic types, real-world categories or references, of table columns (or any other data sources, for that matter) can help in such an apparently diverse set of tasks, ranging from data cleaning to semantic search, to data visualization. Consider data cleaning, the task of correcting errors, typos, inconsistencies, etc. in data through transformation and validation rules. Automation of data cleaning would significantly benefit from detecting the column semantic data types in a given dataset to infer the most appropriate rules to apply. For instance, knowing that the first and the third columns of the table above contain country names and their capitals considerably simplifies the job of correcting any errors or detecting and filling in missing values in these columns. Similarly, if we know the fourth and the fifth columns of the table have latitude and longitude values, we can automatically suggest a map to visualize the population column. Similarly, if we know the third and the fourth columns of the table have latitude and longitude values, we can automatically suggest a map to visualize the population column. If you can query the semantics of tables, it is much easier to find similar tables, further augmenting and generating recommendations for data discovery in data stores. 

Current Practices and Challenges

Recognizing the benefits, many commercial systems such as Google Data Studio, PowerBI, Tableau, Talend, and Trifacta automatically detect a limited number of semantic types. Existing systems generally use matching-based approaches for semantic type detection. For example, consider regular expression matching. Using predefined character sequences, this method captures data value patterns and checks if a value matches them. For the detection of simple data types with a predictable structure, such as dates, this can be sufficient. But approaches like regular expression matching are limited in versatility: They’re not robust enough to handle noisy real-world data; they only support a limited number of semantic types; and they underperform for data types without strict validation. 

Recently, Sherlock, a deep learning model trained on a large-scale corpora, showed promising results in addressing the above challenges. Without assuming any column header information, Sherlock, a multi-input feed-forward neural network trained on the VizNet corpora, predicts the semantic type of a given column solely based on its values. The improved prediction performance of Sherlock over alternatives demonstrates the potential of high-capacity models to improve the accuracy, robustness, and scale of semantic type detection. However, Sherlock doesn’t consider table context in its column type predictions and its accuracy is relatively low for types that have fewer examples in the training dataset.

Sato: Modeling Table Context to Improve Semantic Type Detection

We developed Sato to primarily improve Sherlock’s predictive performance for data columns in which the semantic type is ambiguous or has few examples available. This is important for extending automated type detection to a larger number of types in practice. Sato operates under the premise that a column’s table context contains additional descriptive power for improving the accuracy of its type prediction. 

The intuition behind this premise is rather straightforward. A table is not a mere collection of random columns but is instead created with a particular intent in mind. The semantic types of the columns in the table can be considered as expressions of that intent with thematic coherence. As a result, we should expect, for example, certain semantic types to co-occur more frequently in the same table than others. A peek into the type co-occurrence frequencies in Sato’s training dataset supports the argument. For example, pairs like (city, state) or (age, weight) appear in the same table more frequently than others.

A table is not a mere collection of random columns but rather is created with a particular intent in mind. The heatmap matrix above shows the co-occurrence frequencies in log scale for a selected set of column types in the VizNet corpora. Certain pairs like (city, state) or (age, weight) appear in the same table more frequently than others. Modeling the co-occurrence structures of table columns improves semantic type prediction.

Incorporating the context of a column in the prediction of its semantic type is not just a way to improve the prediction accuracy in general. In many cases, it is the only way to correctly disambiguate the column’s type. 

Two tables (Table A and Table B) from the VizNet corpus. The last column of Table A and the first column of Table B have identical values: ‘Florence,’ ‘Warsaw,’ ‘London,’ and ‘Braunschweig.’ A prediction model based solely on column values (i.e., single-column prediction) cannot resolve the ambiguity to infer the correct semantic types, birthplace and city. Sato incorporates signals from table context and performs a multi-column type prediction to help resolve ambiguities like these and improve the accuracy of semantic type predictions.

Motivated by these considerations, Sato combines topic modeling to capture the “global context” and structured learning to capture the “local context” in addition to single-column type prediction based on the Sherlock model. For modeling purposes, the context of a table column is formed by the types and values of all the other columns in the same table. 

Overview of how Sato incorporates the global context (as topic vector) and local context (inferring the ‘‘type likelihood’’ over a sequence of columns).

Latent Dirichlet Allocation (LDA) is an unsupervised generative probabilistic topic model widely used to quantify thematic structures in text. Sato uses it to learn a topic vector for each table. Sato then adds this vector to the same feature set as used for Sherlock. As a result, each data column within the same table has the exact same topic vector.

All of this makes it easier for the model to contextualize. For example, it can “understand” that a data column containing date values within the global context of personal information is much more likely to refer to birth dates than publication dates.

Besides leveraging a central topic vector, the labels of columns in close proximity to a chosen column hold discriminative information as well. When given a subset of data columns and corresponding predictions, Sato optimizes the likelihood of semantic types over this subset using Conditional Random Fields (CRF).

Using the architecture above, Sato is trained on more than 122K table columns with 78 different semantic types gathered from a subset of the Sherlock training dataset.

Frequencies of the 78 semantic types in the Sato training data. Most types have fewer than 5K columns, forming a “long tail” in the frequency histogram.

Experiments confirm that Sato significantly outperforms Sherlock in semantic type prediction by incorporating global and local contextual information of data columns, with respective increases of as much as 14.4% and 5.3% in macro and micro (support-weighted) F1 scores

Performance comparison between Sherlock and Sato variants: Statistical testing indicates Sato, Sato-NoTopic, and Sato-NoStruct perform significantly better than Sherlock. Numbers are the average macro and micro F1 scores over a 5-fold cross validation and the error bars indicate the 95% confidence intervals--too small to be visible for the micro F1 values shown in the second row of charts. The values in parentheses indicate the improvements in percentage over Sherlock.

Our ablation study also shows that both Sato variants, Sato-NoTopic (LDA module is excluded) and Sato-NoStruct (CRF module is excluded), still outperform the Sherlock model. The study also suggests that the contributions of the LDA and CRF modules are mostly complementary, while the global (table-wise) feature vector obtained through LDA increases the prediction accuracy more than the structured output prediction using CRF.

Crucially, Sato’s performance gains are primarily due to the improved prediction over data types that are semantically overlapping or underrepresented. Sato solves the problem of ambiguity with overlapping semantic types by leveraging context. Acting as a data-driven regularizer, incorporating context also ameliorates the “data hunger” for training deep learning models. This in turn can facilitate automated detection support for a larger number of semantic types.

F1 scores of Sato and Sherlock for each semantic type, grouped by the relative change when Sato is used for prediction in comparison to Sherlock. Sato outperforms Sherlock for 70 semantic types (left, year to sales). Sato and Sherlock performances tie for 3 semantic types, continent, grades, and currency (middle). Sherlock’s prediction accuracy is better than that of Sato only for 5 types (right, education to address).

Deep learning models are essentially representation learners. Using layerwise weights or activations of trained deep learning models is common for obtaining vectorial numeric representations (embeddings) of data. 

Col2Vec: Contextual Semantic Column Embeddings

In this sense, Sato also provides an opportunity to create (extract) contextual semantic representations of table columns. Let’s verify how the table intent features modeled with LDA help Sato better capture the table semantics. For this, we can compare the embedding vectors from the final layer of the Sato model (before the CRF layer) and the baseline Sherlock model as column embeddings for a select set of types.

Scatterplots visualizing the column embeddings by Sato and Sherlock for 4 semantic types. Each point in the scatterplots represents a single column and its color encodes the semantic type of the column. The gray overlays illustrate the areas of overlapping neighborhoods (the regions of potential confusion by the models) among the semantic type clusters. Embedding vectors obtained from Sato better separate the given type clusters than those obtained from the Sherlock model.

We use t-SNE to reduce the dimensionality of the extracted vectors to two and then visualize them using scatterplots, encoding the reduced dimensions with horizontal and vertical positions. In the scatterplots of the embeddings above, we observe the column embeddings based on Sato better separate the given type clusters than those obtained from the Sherlock model. This gives additional support for the argument that topic-aware prediction helps Sato distinguish semantically similar semantic types by capturing the table context of an input table. 

In general, the ability to learn contextual representations of table columns, rows, or cells has much broader applications.

Transfer Learning for Data Preparation and Analysis Pipelines

Pretrained language models such as ELMo, ULMFiT, BERT, T5, XLNet, and GPT–all successful examples of transfer learning–have recently demonstrated great progress in improving NLP tasks. 

So, the availability of massive table corpora presents a unique opportunity to develop pretrained representation models. To test the viability of using representation models, we fine-tuned the BERT model for our semantic type detection task. We find that the fine-tuned BERT model achieves better prediction performance than the Sherlock model. This result is promising because a “featurization-free” method with default parameters is able to achieve a prediction accuracy comparable to that of Sherlock. In this context, an interesting follow-up would be to combine Sato’s multi-column model with BERT-like pretrained representation models. Indeed, a recent work extends BERT to train a representation model for Internet tables. 

To be sure, we will need to go beyond a mere application or extension of the existing language models in order to have impact in practice, as tables are neither paragraphs of sentences nor are they images of pixels. Nevertheless, there is a clear and exciting opportunity in front of us to automate relational data tasks with models learned from massive table corpora, yet again leveraging the unreasonable effectiveness of data for engineering solutions. 

Are you interested in learning more about Sato? Check out our research paper! Do you have any questions? Feel free to contact us today!

Written by Çağatay Demiralp, Jinfeng Li and Megagon Labs.


More Blog Posts: