Feature Stores: Deep Learning, NLP, and Knowledge Graphs

Feature stores are integral to the machine learning lifecycle. They aim to improve the productivity of data scientists in building, deploying, publishing, and reusing features across the organization.  As such they have been an essential part of the MLOps stack, producing key machine learning artifacts such as training datasets. 

With advances in deep learning, such as pre-trained large language models (LLMs), graph neural networks (GNNs), and beyond, embeddings are becoming key by-products as features for new model development and machine learning applications. 

In this article, we will introduce feature stores and examine the implications of deep learning on feature stores as well as discuss the role of feature stores as part of the emerging MLOps stack.  

Let’s first step back and describe what a feature store is and examine use cases.

What is a feature store?

A feature store is a data system that aims to provide automatic and consistent data processing, feature generation, storage, training, and serving for machine learning models. Feature stores enable features to be registered, shared, re-used, and monitored across the organization along with the metadata.

Feature stores have the following major components: (1) pipelines to fetch, transform, and compute features from data sources, (2) databases to store and manage features for offline (e.g. training) and online processing (e.g. prediction), (3) a central registry to capture entity metadata for features, (4) APIs and SDKs to process, store, and retrieve features and metadata, and (5) tooling (e.g. web apps) to browse, search, and visualize registry and features. (See Figure below)

A registry is key for supporting effective collaboration, as it defines the schema of the entities (e.g. product, company) and metadata for domains of interest. The entities served in the registry are standardized feature definitions, together with necessary metadata which enables feature search and discovery.

Components of a typical feature store

Feature store implementations and open-source tools vary in their ability to support the above functionality. In practice, depending on the need, a feature store implementation can be just a low-latency key-value store such as Redis, where practitioners agree upon schema and content of the database, then use the database SDKs or wrappers to programmatically compute, search, and pull features. 

Of course, there are now open-source and commercial feature store solutions that go much beyond a database that simply stores and retrieves features. For example, Feast is an open-source project that can support batch and streaming data sources for feature transformation. It can serve features for offline training of machine learning models and online real-time model prediction, and Feast can provide a registry and associated SDKs for searching and retrieving features. 

Examples

Let’s go through a couple of examples to illustrate the usage and benefits of feature stores. We will use Feast to demonstrate some basic functionality in feature stores.

Let’s first define an entity (e.g. company). An entity is a collection of semantically related features that are specific to the domain (e.g. HR).

				
					# Define an entity 
company = Entity(name="company", value_type=ValueType.INT64, join_keys=["company_fcc_id"], description="Company FCC ID",)

				
			

In Feast, features are defined through feature views, where multiple entities (e.g. company, product) can be associated with a feature. In essence, a feature view is a time-series feature data from a source. 

				
					# Define a data source as a file, for example
company_quarterly_stats = FileSource(
    path="/efs/company_data/quarterly_stats.parquet",...)

# Define a feature view
company_quarterly_stats_view = FeatureView(
    name="company_quarterly_stats",
    entities=["company"],  
    schema=[
        Field(name="revenue", dtype=Float32),
        Field(name="profit", dtype=Float32),
        …
    ],
    online=True,source=company_quarterly_stats, tags={},
)

				
			

In Feast, feature services allow users to retrieve features as needed, both in offline mode during model training and while in online mode during model prediction.

				
					# Initialize a feature store
feature_store = FeatureStore('.') 

# Create a feature service 
company_quarterly_fs = FeatureService(
name="company_quarterly_stats",
features=[company_quarterly_stats_view]
)

# Retrieve features for online processing
feature_service = feature_store.get_feature_service("company_quarterly_stats")
feature_vector = feature_store.get_online_features(
    features=feature_service,
    entity_rows=[
        {"company_fcc_id": 1000},
        {"company_fcc_id": 1001},
	   …
    ],
).to_dict()

# Retrieve features for offline processing
feature_store.get_historical_features(features=feature_service, entity_df=entity_df)

				
			

Our use case: knowledge graphs

At Megagon, we leverage LLMs, knowledge graphs (KGs), and graph learning approaches for tasks such as entity and relation extraction, entity set expansion, and (more broadly) KG development. We learn representations (embeddings), and we use them in various downstream problems such as candidate-job matching. We collaborate internally on these tasks between our research, engineering, and data science teams. We also collaborate externally with partner companies as we apply our research and engineering work.

As such, primary use cases where feature stores come into play for us are (1) collaboration, (2) MLOps, and (3) tech transfer.

Collaboration. Especially in applied research in a specific domain (e.g. HR, healthcare), feature stores become essential for coordination and collaboration. As we pointed out earlier, feature stores can facilitate collaboration among team members, enabling them to work toward the same goal with the help of a common data model, plus well-defined entities and relationships. Within a team, feature stores become spaces that make both features and feature generation processes shareable. Even teams working on different projects (but sharing data, features, models, or embeddings) can collaborate and reuse each other’s outcomes. A team member could add a new feature into a shared feature store with just a simple extra step of including metadata and required tags to update the feature registry. Then, the feature is easy to be consumed by other teams in their models. Features could be published, re-used, and tracked without much human effort.

Without feature stores, features are stored as individual files in various formats, and the data processing & transformation could be a nightmare to sort out for a large team. Tackling such issues without feature stores could create a lot of redundancies and conflicting tech requirements.

MLOps. MLOps frameworks such as MLFlow allow for the capture of parameters in training models. Coupled with MLOps frameworks (specifically experiment tracking and model serving components), feature stores can complete the mission of standardizing data injection and retrieval from the store. Without a good integration, each alone fails in the mission, as features of unknown origin, unknown underlying datasets and models, can infiltrate feature stores and lead to data quality issues. It may be cumbersome at the start, but as more projects share the same underlying dataset, it becomes easier to allow projects to share and swap data. A good integration gives the data science team good visibility and confidence about the data they have access to.

Tech Transfer. Across organizations (e.g. between research labs and development groups) and companies (e.g. among partner companies), feature stores can also serve as a tech transfer touch point. With appropriate versioning (and provenance), a set of features can be bundled as a product. With features as a product, feature stores function much like an app store for data- and analytics-oriented companies. 

In regards to operationalization of ML, we see a strong need for integrating various pieces of MLOps, since models, data, features, and metadata are distributed across a variety of systems (see below picture). We strongly believe feature stores can be a central place where all valuable data can be pulled together. For example, experimental metadata can be captured along with the features. In a similar fashion, high-dimensional indices can be registered along with the entities they represent, and a feature store can even be a proxy for approximate nearest neighbor (ANN) search for embeddings, keeping track of ids across various systems. Some features can also be created on-demand without storing them through model prediction. Features can also be bundled as a release, and feature stores can be a front to publish and keep track of use. 

example of operationalization of ML across data, features, experiments and models for KGs.

Operationalization of ML across data, features, experiments and models for KGs

Tech Survey

Towards the goal of operationalizing machine learning for our use case, we did a tech survey of the feature store landscape. Given how rapidly the tech space moves, we cannot claim that it was a comprehensive survey, though it gave us a point-in-time understanding of what is out there. We are not going to go into a deep analysis in this blog, but we essentially saw several varieties of applicable technologies, though none completely met all needs for our case. 

Essentially, the landscape includes: (1) databases, specifically low-latency key-value stores (2) search engines, with support of high-dimensional vectors (3) high-dimensional indices that directly support ANN search, and (4) full-fledged feature stores, coupled with databases to function as a registry.

Naturally, feature stores come closest to the needed functionality, as they are designed specifically for this purpose. Where they lack is often in their ability to integrate with experiment tracking and model serving functionalities, features that are often supported by tools like MLflow. If only registry functionality is needed, databases can substitute for feature stores, but careful attention must be given to schema design; that way, users can keep track of metadata. In a similar fashion, search engines can be used and some also support high-dimensional searching. Specialized high-dimensional indices offer great performance, but they lack support for low-dimensional, traditional features. 

Conclusion

In the ecosystem of MLOps tools and systems, feature stores occupy a prime space as the gatekeepers of the data in machine learning. As such, they could play an even larger role in centralizing MLOps functionality under one roof, pulling together all data (e.g. graphs, databases, document collections) and metadata (e.g. schema, experiment, and models) in the ecosystem of machine learning. Doing so would allow for greater access to functionality through common pluggable APIs, such as nearest neighbor search and model serving (e.g. features on demand).

We believe an improved integration of the MLOps tools and systems will greatly increase the compounding factor of productivity and collaboration in data science and research.

Written by: Eser Kandogan, Chen ShenAiden Zhao, and Megagon Labs

Follow us on LinkedIn and Twitter to stay up to date with us.

Share:

More Blog Posts: