Recent years have been marked by the increased use of knowledge graphs (KGs) in diverse areas such as natural language processing (NLP), the natural sciences (e.g. biomedicine, physics, and geology), cybersecurity, finance, education, and manufacturing. KGs store information in the form of triples and allow for the encoding of both topological and semantic information. The growing interest in using KGs led to a surge in dataset releases: almost half of the datasets in the popular PyKEEN library have been published in the last 3 years. Even with this much interest in KGs a fundamental question still remains: What properties and structures do real KGs have, and how do they compare to each other in terms of these properties?
In our paper published at the NeurIPS 2023 New Frontiers in Graph Learning Workshop (GLFrontiers), we tackle this open problem by performing a large-scale analysis of 29 KGs from different underlying domains. We measured and analyzed various KG properties and described common/distinct structural patterns we observed in the datasets. Based on our findings, we formulated several recommendations for practitioners for future KG model development, evaluation, and dataset construction.
What Datasets Did We Analyze?
The datasets we considered were categorized into the following three groups:
- 9 biomedical KGs storing gene, protein, and pathway data, derived from high-quality public datasets.
- 17 semantic KGs storing data extracted from text sources such as Wikipedia (or its structured counterpart Wikidata).
- 3 societal KGs storing curated data in domains such as geography, international relations, such as Nations dataset
What KG Properties Did We Consider?
We performed a series of data analysis steps to quantify multiple properties and structural features described below.
- Relationship between number of entities, relations, and triples. Here we considered how KGs cluster in terms of the number of their entities, relations, and triples.
- Degree distribution and KG density. The density of KG edges typically affects downstream properties and behavior of graphs. For each KG entity, we summed its in-degrees and out-degrees to obtain the entity degrees analogously to the way node degrees are computed in undirected graphs. In the below plot, we see that several datasets show a distinct degree distribution in the lower degrees.
Related to the KG density, we also quantified the KG connectivity by plotting the triples per KG relation. The thick tails in the below plot show that the semantic datasets, AristoV4 and FB15k, have a high number of relations with a low number of triples per relation. Several other semantic datasets follow the same pattern, while none (except one) of the biomedical datasets do the same.
- Relation cardinality and relational patterns. Relation cardinality describes the numerical relationship between the possible head and tail entities of a KG relation. The possible types are: one-to-one, one-to-many, many-to-many, and many-to-one. For example, the relation “Gene Activates Gene” is many-to-many because various genes can activate multiple other genes. For each dataset, we plotted the proportion of these 4 relation types such as in the examples below.
Based on these plots, we detected several patterns: (a) 1-1 dominance: In 15 of the 28 datasets, the leading relation type is 1-1 including many semantic datasets. Overall, the number of 1-1 relations is more pronounced in the semantic datasets than in the biomedical datasets. (b) M-M dominance: In 11 of the 28 datasets, the leading relation type is M-M. Biological datasets are dominated by M-M relations. (c) mixed cardinalities: We observed that some of the most frequently used semantic datasets have a significant number of all 4 cardinalities, unlike the rest of the analyzed dataset.
We also consider relational patterns such as asymmetric, symmetric, inverse, and composite (such as in the examples below). Across all datasets, we observed dominance of anti-symmetric relations, with the exception of the societal datasets and some of the most frequently used benchmarking datasets in NLP, such as FB15 and WN18RR.
- Metapaths. KG metapaths are a graph-theoretic metric widely used in the biomedical literature for assessing the connectivity of KGs and deriving insights about the clinical or biological relevance of interactions such as gene-gene or drug effects. For example, in the Hetionet dataset, the node type Compound, Gene, and Disease form the metapath Compound (binds) → Gene (associates) → Disease of length 2. The number of metapaths of a given length provides a way for quantifying the level of relational composition without having explicit composite relations encoded in the KG. We found that biomedical KGs have a significantly higher number of metapaths than the other KG types.
Key Findings and Recommendations
- KGs are not created equal, which has implications for downstream model evaluation. We recommend that researchers consider a broader set of datasets for KG model development and evaluation.
- We recommend KG libraries and benchmarks consider adding tools for handling/removing certain relation types in order to bring visibility to data problems such as train/test data leakage.
- The success of KG-enhanced Pretrain Language Models (PLMs) may vary depending on the underlying KG structure so the development of domain-specific PLMs may benefit from different types of KGs.
- Practitioners should consider the role of KG properties and structural patterns during design, hypothesis testing, and model development.
- Analyzing the properties and structural characteristics of existing KGs may benefit the future design of more robust and diverse KG datasets.
Our study has implications for the broader use of KGs across domains. Given the proliferation of models (in KG link prediction, entity alignment, LM-as-KG evaluation) across domains (NLP, natural sciences, medicine, and other disciples), it is worth investigating whether (and how) structural patterns, as well as their inter-domain variability across KGs, may correlate or influence KG model performance. We believe that the rich structural information contained in KGs can benefit the development of better KG-based models and KG datasets across fields, and we hope this study will contribute to breaking the existing data silos between different areas of research such as ML, NLP, and AI for sciences.