Semantic type detection aims to identify the real-world references of data sources such as table columns. By establishing correspondences with real-world concepts, semantic types can provide fine-grained data descriptions and improve data preparation and information retrievals tasks such as data cleaning, schema matching, semantic search, and data visualization. For example, if a system were able to detect that the string values in a given column refer to names, it could then automatically apply a transformation rule that capitalizes the first character of each string while leaving the rest lowercase. This would allow a person to analyze this dataset to spend less time and manual effort fixing it!
Figure 1: Data systems automatically detect atomic data types such as string, integer, and decimal to reliably operate and improve user experience. But semantic types such as country, population, and latitude are disproportionately more informative and powerful — and, in several use cases, essential.
Let’s consider data cleaning, the task of generating clean data by using transformation and validation rules. To automate it, we have to be able to detect the data types of a given dataset and infer the most appropriate rules to apply. For instance, knowing that the first two columns of the table in Figure 1 contain country names and their capitals significantly simplifies the job of correcting any errors or detecting and filling in missing values in these columns. Similarly, if we know the third and the fourth of columns of the table have latitude and longitude values, we can use a map to visualize the value pairs. This would be much more useful than a possible default of scatterplot visualization when these two columns are treated as decimal columns.
We want a semantic type detection algorithm to have three basic qualities: It must be robust to heterogeneity and noise in data; it should scale well with the number of rows and columns as well as the number of semantic types; and, of course, it has to be accurate. Let’s delve deeper into the complexity of automatic type detection by examining both traditional and novel approaches to this task.
The Limitations of Traditional Semantic Type Detection
Recognizing the benefits, many commercial systems such as Google Data Studio, Tableau, PowerBI, and Trifacta all attempt to automatically detect a limited number of semantic types. In some scenarios, semantic type detection is simple. For instance, ISBNs or credit card numbers lend themselves to straightforward data type detection since they are generated according to strict validation rules. But as shown in Table 1, most types, including location, birth date, and name, do not adhere to such consistent structures.
Table 1: Examples of (partial) data columns with corresponding semantic types sampled from the VizNet corpus.
Existing systems generally leverage matching-based approaches for semantic type detection. Take regular expression matching, for example. Using predefined character sequences, this method captures data value patterns and checks if a value matches them. For detecting simple data types with a certain structure, like date, this can be sufficient. But approaches like regular expression matching are limited in versatility: They’re not robust enough to handle noisy real-world data; they only support a limited number of semantic types; and they underperform for data types without strict validations. On the other hand, current machine learning approaches to semantic type detection utilizing models like logistic regression and decision trees suffer from limited predictive performance or steep memory requirements.
Machine learning models trained on large-scale corpora have proven effective at addressing these challenges and carrying out various predictive tasks across domains. Let’s now examine two new approaches to semantic type detection based on training models with a large number of real-world columns.
(1) Sherlock: A Deep Learning Approach to Semantic Type Detection
Sherlock leverages developments in high-capacity models like deep neural networks and advancements in natural language embedding. Developed as a multi-input deep neural network, Sherlock is trained on more than 685K real-world data columns with 78 different semantic types (Figure 1). An interesting aspect of this training dataset is that it was constructed by matching column headers in the VizNet table corpus with 31M tables to their DBpedia types, without resorting to human labeling. Each raw sample (column) is represented by 1,588 features describing statistical properties, character distributions, word embeddings, and paragraph vectors of column values.
Figure 2: Frequencies of the 78 semantic types considered by Sherlock. For example, there were about 10K columns with the semantic type “area” in the Sherlock training data.
When trained on a sufficient amount of real-world semantic type examples, this kind of model is robust to dirty data and scales well. Figure 3 illustrates the overall approach. Sherlock’s predictive performance exceeds dictionary and regular expression benchmarks, the consensus of crowdsourced annotations, and machine learning baselines using decision trees. In terms of efficiency, more traditional models like decision trees are faster than Sherlock. However, they come at the cost of predictive performance and storage requirements.
Figure 3: Overview of the underlying approach of Sherlock.
While Sherlock represents a significant step forward in applying deep learning to semantic typing, it suffers from two problems. First, it under-performs for types that do not have a sufficiently large number of samples in the training data. This limits Sherlock’s application to infrequent types, which form the majority of data types appearing in tables at large. Second, Sherlock uses only the values of a column to predict its type, without considering the column’s context in the table. However, column type prediction based solely on the column values cannot resolve the ambiguities due to overlap between semantic data types. Known as semantic overlap, this can be problematic since data columns with similar data values might represent different concepts in different table contexts.
(2) Sato: Leveraging Context to Improve Semantic Type Detection
Sato was mainly developed to address the shortcomings of Sherlock, improving the predictive performance for data columns in which the semantic type is ambiguous or has few examples available. This is important for extending automated type detection to a larger number of types. Sato operates under the premise that a column’s table context contains additional descriptive power for improving the column’s type prediction accuracy.
The intuition behind this premise is rather straightforward. A table is not a mere collection of random columns but instead created with a particular intent in mind. The semantic types of the columns in the table can be considered as expressions of that intent with thematic coherence. As a result, for example, we should expect certain semantic types to co-occur more frequently in the same table than others. A peek into the type co-occurrence frequencies in Sato’s training dataset supports the argument (Figure 4).
Figure 4: Co-occurrence frequencies in log scale for a selected set of column types. Certain pairs like (city, state) or (age, weight) appear in the same table more frequently than others. Sato leverages co-occurrence structures of table columns. There are non-zero diagonal values as tables can have multiple columns of the same semantic type.
Incorporating the context of a column in the prediction of its semantic type not only can improve the prediction accuracy in general but also be the only way to correctly disambiguate the column’s type in many cases. Figure 5 demonstrates one such case.
Figure 5: Two actual tables (Table A and Table B) from the VizNet corpus. The last column of Table A and the first column of Table B have identical values: ‘Florence,’ ‘Warsaw,’ ‘London,’ and ‘Braunschweig.’ A prediction model based solely on column values (i.e., single-column prediction) cannot resolve the ambiguity to infer the correct semantic types, birthplace and city. Sato incorporates signals from table context and performs a multi-column type prediction to help effectively resolve ambiguities like these and improve the accuracy of semantic type predictions.
Motivated by these considerations, Sato combines topic modeling to capture “global context” and structured learning to capture “local context” in addition to single-column type prediction based on the Sherlock model. For the modeling purposes of Sato, the context of a table column is formed by the types and values of all the other columns in the same table. Figure 6 illustrates the additional steps needed to incorporate a table’s context in Sato.
Figure 6: Overview of how Sato incorporates the global context (as topic vector) and local context (inferring the ‘‘type likelihood’’ over a sequence of columns).
Latent Dirichlet Allocation is an unsupervised generative probabilistic topic model widely used to quantify thematic structures in text. Sato uses it to learn a topic vector for each table. Sato then adds this vector to the same feature set as used for Sherlock. As a result, each data column within the same table has the exact same topic vector.
All of this makes it easier for the model to contextualize. For example, it can “understand” that a data column containing date values within the global context of personal information is much more likely to refer to birth dates than publication dates.
Besides leveraging a central topic vector, the labels of columns in close proximity to a chosen column hold discriminative information as well. When given a subset of data columns and corresponding predictions, Sato optimizes the likelihood of semantic types over this subset using Conditional Random Fields.
Using the architecture above, Sato is trained on more than 122K table columns with 78 different semantic types using a subset of the Sherlock training dataset (Figure 7).
Figure 7: Frequencies of the 78 semantic types in the Sato training data. Most types have fewer than 5K columns, forming a “long tail” in the frequency histogram.
Experiments confirm that, Sato significantly outperforms Sherlock in semantic type prediction by incorporating global and local contextual information of data columns, with respective increases of as much as 14.4% and %5.3 in macro and micro (support-weighted) F1 scores. Crucially, Sato’s performance gains are primarily due to the improved prediction over data types that are semantically overlapping or underrepresented (Figure 8).
Figure 8: F1 scores of Sato and Sherlock for each semantic type, grouped by the relative change when Sato is used for prediction in comparison to Sherlock; increased (left), unchanged (middle), and decreased (right). Sato increases the prediction accuracy for the majority of the types, particularly for the previously hard underrepresented ones.
Semantic Type Detection Still Has Room for Improvement
Sato solves the problem of ambiguity with overlapping semantic types by leveraging context. Incorporating context also ameliorates the “data hunger” for training deep learning models, facilitating automated detection support for a larger number of semantic types. But two important issues remain:
- Sherlock and Sato both assume semantic type to be mutually exclusive. In other words, each sample corresponds to only one semantic type. But when considering some data types, they potentially have a parent-child relation. Country, for example, can be a child-class of location.
- Due to the way their training data was curated, Sherlock and Sato are able to detect a relatively small number of numerical semantic types. But many semantic data types represent numerical data.
At Megagon Labs, we are excited to be working on future learned models and tools in order to address these challenges in semantic type detection.