Knowledge Graphs Building and Training: Engineering Aspects

January 9, 2024

Knowledge graphs are crucial sources of human-curated structured knowledge, especially for improving machine learning models and their representation, reasoning, and explainability in intricate domain-specific tasks. At Megagon Labs, we recognize the transformative power of knowledge graphs (KGs). They are not just repositories for information, but dynamic structures that evolve in their representation and reasoning power. Within the human resources (HR) domain, we are curating continuously growing concept-level and instance-level knowledge graphs that synergize human-curated wisdom with data-driven insights. These KGs serve as knowledge sources for both research projects and real-world HR tasks, i.e., matching, phone screen prediction, explainability, etc.

Traditionally, the iterative process of refining these KGs and utilizing them in downstream tasks is painstakingly slow, often stifled by manual curation and the difficulty of assessing the downstream effects of KG content and size on the machine learning model performance. We’ve designed a KG platform that not only accelerates the life cycle of KG construction and learning but also ensures that each iteration is infused with the intelligence necessary to support informed decision-making.

This blog post will peel back the layers of our KG building and learning platform, illuminating its role in enriching machine learning. As we explore our distinctive pipelines and delve into the granularities of data provenance and GNN training, we’ll showcase how our system facilitates the seamless integration of KGs into practical, real-world tasks for production use cases.

KG Platform

As shown in Figure 1, at the heart of our KG building and learning platform is the life cycle of knowledge. The figure illustrates the generation of a derived KG from Megagon KnowledgeHub, one golden dataset as the standard benchmark, multiple derived datasets for model training, as well as the role of KG model for downstream tasks.

Figure 1. Components of KG Building and Learning Platform

Let’s go into the details of the components. Our process begins with constructing multiple concept-level and instance-level KGs under a shared hierarchical schema, utilizing both structured and unstructured data sources. Within our system, data from various sources undergo a metamorphosis into knowledge through data preprocessing and KG-building pipelines. For instance, as shown in Figure 2, a resume transcends its original form of mere text to become a node in the KG, enriched with extracted attributes and integrated into the larger knowledge graph. Utilizing our hierarchical KGs, entities and relationships are aggregated at the appropriate level of semantics to develop better and stronger signals for model development.

Figure 2. Example Subgraph of a Resume in KG

For every unique downstream task, our pipelines perform graph transformations to distill and refine knowledge from the Megagon KnowledgeHub. This involves knowledge transformation, curation, and integration processes, culminating in the creation of a derived KG tailored specifically for machine learning applications in the given task. For example, in one design option, the original two-hops relation “resume → experience → skill” will be compressed into the one-hop “resume→skill” relation and merged with other relation sources for resume skills. The standardization and automation of such transformation operations allow developers to try more ideas. This capability to generate multiple versions of golden and derived datasets enables a thorough comparative analysis of design options, driving informed decision-making backed by solid data and robust evaluation metrics. KG learning pipelines are executed on these datasets, and they deliver results.

Often in practice, direct use of KGs in the industry may not result in high-quality results due to their large scale, as well as the unbalanced and sparse nature of the data. A key aspect that sets our system apart from other KG building and training frameworks is the graph transformation. Similar to the various data operations in data lakes that transform raw data to the data used for models, the knowledge stored in KG usually needs to be filtered, aggregated, merged, transformed, and then a graph model could take and apply it to a downstream task. Graph models need to have an “emphasized” graph, especially for a task.

To standardize such operations in graph transformation and enable fast iterations, we designed an automatic graph transformation pipeline in the KG building and learning platform. This pipeline operates on a declarative paradigm. It only requires a query file defined by the user, and the framework will automatically fetch, transform, and build a new KG for a downstream task. The query file declarative contains information of “what” (high-level triple definition in generic schema) is desired, and “how” (real queries in databases, i.e., Neo4j). This approach significantly speeds up the process of creating and iterating on KGs.

Data Provenance

As the data flows from raw sources to KGs, then to training datasets, and eventually into models, tracking the provenance becomes crucial. If a particular configuration of a job-resume matching model demonstrates improved performance, tracking data provenance enables us to pinpoint the underlying factors. Let’s delve further into the specifics of data provenance:

First, within the KG construction, we’ve built a generic hierarchical schema that organizes every entity and relationship within the HR domain into a structured hierarchy. For instance, from a specific node like “Python”, we can trace our path back to the most abstract node in the KG. At the most granular level, data-source-specific relationships in the KG represent the finest detail of the relational structure. For any two relationships, we can find common ancestors and merge the knowledge at that juncture. For example, two data-source specific relationships, “resumeHasExtractedSkillByModelA” and “resumeHasExtractedSkillByModelB,” store knowledge for skill extraction results from resumes by two models. Our platform will automatically merge them into the “resumeHasSkill” relationship.

Second, within the KG building and learning framework, we’ve established a configuration file-based versioning system for each framework component, documenting the identifiers of preceding components, raw data sources, KG sources, utilized raw tables, and additional relevant metadata.

KG Learning

Within the overarching architecture of our KG building and learning framework, KG learning functions as a critical intermediary, transforming knowledge into direct outcomes. To learn from the rich, interconnected, and deep semantic relationships present within the KG, various customized GNN/LLM-based machine learning approaches can be designed for a variety of downstream tasks. We will use one task to show how our customized GNN demonstrates nuanced domain understanding and enhances the adaptability of our end-to-end solution.

Let’s have a look at one real-world application: phone screen prediction, or forecasting whether a candidate should receive a phone screen interview based on their resume and the job’s criteria.

Our approach utilizes a bespoke two-stage, graph-based framework comprising a pre-training phase for encoder-decoder knowledge graph learning, followed by a fine-tuning phase tailored for phone screen classification. Harnessing the power of the knowledge graph, the model achieves comparable results, and in some cases superior results, to the LLM-based approach. What’s better is the KG requires only a fraction of the resources and training time. The model’s rapid inference capability renders it highly suitable for production environments. Iterative refinement of our design has led to several key observations:

Structural knowledge alone, even without textual data, captures significant semantic information.
The KG model boasts a concise model size and faster inference times, making it a more practical choice for production environments without sacrificing the quality of results.
The inclusion of conceptual linkages, such as taxonomies, provides obvious benefits and enhances the performance of data-driven knowledge graphs.

Embedding Feature Store

Knowledge graph embeddings offer a powerful foundation for various machine learning-driven real-world applications, such as search, matching, and recommendation systems. Beyond the direct deployment of graph-based ML models, these applications benefit significantly from the rich, domain-specific representations derived from KG learning. These embeddings encapsulate complex semantics and knowledge, making them useful as features for enhancing any other models.

To effectively leverage these representations and meet the dynamic demands of industrial use cases, we utilize high-dimensional feature stores. These feature stores are engineered to support operations like batch vector fetching, similarity search, and approximate nearest neighbor (ANN) functions. This enables seamless integration of KG-derived insights into broader systems, ensuring that the understanding captured by our KGs can directly contribute to the depth and precision of various downstream applications.

Conclusion & Future Work

In this article, we provided a high-level view of our KG building and training framework. Its proficiency in handling the complete KG life cycle — from construction to utilization in machine learning tasks — not only facilitates robust exploration and validation of concepts but also enhances task performance. This iterative approach has led to more informed decision-making processes, underscoring the transformative power of KGs.

As we look to the future, we seek to further integrate GNNs and LLMs within our framework, and introduce a feedback loop that incorporates human feedback and insights, thereby enriching our knowledge graphs and enhancing our learning methodologies.

Written by: Chen Shen, Eser Kandogan, and Megagon Labs.

Follow us on LinkedIn and Twitter to stay up to date.

Knowledge Graphs Building and Training: Engineering Aspects

KG Platform

Data Provenance

KG Learning

Embedding Feature Store

Conclusion & Future Work

Share:

More Blog Posts:

LLMs as Data Annotators (Part 2) – MEGAnno+: A Human-LLM Collaborative Annotation System

LLMs as Data Annotators (Part 1) – Challenges and Opportunities

Less Is More for Long Document Summary Evaluation by LLMs

Megagon Team Feature: Aiden Zhao

Wrapping up 2023: Acknowledgements & Aspirations

Measuring and Modifying Factual Knowledge in Large Language Models