Data Version Control for Machine Learning Applications

In the software development world, a project’s most critical component is its code. Code version control is a natural necessity, which is why software developers use tools like Git. In machine learning, however, the lifeblood of a project is its data and its models. Version control for these elements—data version control—is a much more complicated beast.

Deep learning—a branch of machine learning that models its approach after the human brain—has made impressive advances in many Natural Language Processing (NLP) applications. At Megagon Labs, we employ deep learning techniques to develop data-driven solutions for a wide spectrum of NLP related applications, including:

  • Information extraction
  • Text generation
  • Entity resolution
  • Text classification
  • Reading comprehension

As we conduct experiments, replicability and provenance are integral for sound results. For this, our research efforts also need to track performance with different hyperparameters and settings. Without the proper tools, this task can be error-prone and overwhelming. Additionally, the sheer volume of data we use to derive our models is massive. Compared to code files, which can be easily tracked by version control tools like Git, data and model files are much larger and may be stored in various remote locations.

To better utilize our time and more easily track the many moving pieces in our NLP research, we have been exploring the latest version control tools applicable to our field and needs. Among those we have explored thus far, DVC shows promise for facing these challenges.

Data Version Control (DVC) is an open-source version control system that extends Git for effective use in the machine learning space. In this article, we’ll touch briefly on the unique features of DVC, the advantages of using DVC for machine learning (ML) projects, and also some of the current limitations of using DVC for real-world applications.

The Unique Features of DVC

First, let’s touch upon some of the features of DVC that make it uniquely situated for managing ML applications.

Git-style commands and architecture

DVC extends the already-popular Git tool, providing similar—and so, familiar—commands.

As in the command line example above, users can add a data file to DVC with the command
dvc add.
 Users acquainted with Git know that the push and pull commands ordinarily upload and download files directly to the repository. Because data files managed by DVC are stored remotely (not in the repository), the push and pull commands upload and download the references to those files’ remote locations.  This analogous, but familiar, style of usage makes for seamless adoption by those with a background in software development. Once installed to the local machine, DVC runs on top of any Git repository.

The architecture of DVC is shown in the figure below. The most significant characteristic of DVC is that it allows Data Version Control by replacing large data files with small metafiles that reference the remote locations of those data files, making them easy to handle with Git.


Taken from

Flexible pipeline for machine learning applications

DVC provides a data pipeline to allow users to track the changes in an ML project. It divides the ML processes into stages, with each stage describing a set of essential behaviors in the project. Let’s consider the application of text classification, for example. Here, a DVC pipeline could be divided into three stages: preprocessing, training, and evaluation. Users can easily modify the stages by editing the configuration files and specifying the hyperparameters for each stage to run the project.

The dvc dag command provides a visualization of the pipeline with all of its stages and computational steps. This examples comes from DVC’s documentation on command usage:


Reproducible experiment results

DVC’s data pipeline supports incremental execution, which in turn can yield exact reproduction of experimental results. In the ML research world, this is a game-changer. Once a pipeline is built and executed for the first time, some intermediate results are stored. Upon subsequent executions of the pipeline, DVC only executes those stages for which hyperparameters have changed. The ability to segment execution accelerates the process of hyperparameter tuning when conducting experiments.

The results of a pipeline can be easily reproduced with the command dvc repro. Users can track the changes of parameters and their results after making modifications with the dvc diff command. Additionally, users can perform experiments in batches, writing experiment results and corresponding hyperparameters into a JSON file at one time.

Platform and language agnostic

DVC is platform-agnostic, able to run on Linux, MacOS, and Windows. It is also language-agnostic, able to support ML projects, whether they’re written in Python, R, Scala, or other languages. DVC also works independently of the ML libraries (such as TensorFlow or Keras) used in a project. The focus of DVC is on experiment reproducibility and version control for ML data rather than for the lightweight code files (for which Git is already sufficient).

Remote storage support

Lastly, DVC allows users to specify the remote storage location of data files and models for a project. We alluded earlier to the massive size of ML data and model files. This size makes storage within traditional version control platforms (like Github or Gitlab) unfeasible. Instead, ML projects often store data with cloud storage platforms like Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage. DVC supports most mainstream cloud storage platforms, connecting project code with remotely stored data sets and models.


The advantages of using DVC

Now that we have covered the core features of DVC, the advantages of using DVC in ML projects are clear. First, users of DVC can separate the management of code from the management of data sets and models. Equipped with DVC as a data management tool, ML researchers can focus their efforts directly on their actual ML work rather than on building custom tools to facilitate the work. Code is properly versioned in a system like Git, while model and data provenance is managed by DVC. Provenance and replicability are further ensured as Git and DVC work together to keep the data, model and code in sync. 

Next, as many data scientists and ML practitioners are already familiar with Git, DVC presents little to no learning curve. Commands and usage are intuitive. Workflow patterns are already understood and easily adopted.

Incremental computation—through DVC’s facilitation of experiment management—vastly increases velocity for ML work and research, in which iterative experiments are the backbone of the process. By only executing those parts of an experiment where parameters have changed, as opposed to repeating the experiment as a whole, DVC users save time and gain focus. An ML project team using DVC can collaborate by examining and reproducing individual parts of an experiment, more effectively managing and dividing the workload.

Lastly, DVC brings effective tracking of experiment results with the help of JSON files. Within DVC, results are coupled to the data sets, models, and hyperparameters that produced them. As the inputs change and are tracked, the resulting outputs are also tracked. Users can issue commands like dvc metrics to show results as scalar numbers or dvc plots to visualize data series. What’s more, commands like dvc metrics diff and dvc plots diff can highlight and visualize the differences between data commits.


The ongoing limitations of DVC for real-world applications

While the advantages of using DVC in Megagon Labs applications are substantial, the present functionality of DVC has certain limitations. Customarily, Git commits are identified by their hashes (for example, 8a2bf39c). This is the case in DVC as well. More helpful within the field of ML research, however, is the referencing of data versions through unique tags (for example, bigram-baseline and trigram-baseline). Presently, DVC users still need to apply tags to versions manually and individually.

Second, DVC as a relatively young tool in the software world (v1.0.0 released in June 2020, and v2.0.0 released in March 2021) depends on its users’ comfort levels at the command line. While working at the command line may leave some ML practitioners undeterred, the present lack of a graphical user interface for DVC may be a deal-breaker for others.

There is also the concern of data and model files being used in multiple applications. There are limitations in tying down a data file to a single repository, when it is perhaps needed in multiple repositories. DVC has yet to be a silver bullet for MLOps, where service building, deployment, and tracking production environments still present challenges unmet by DVC.

Lastly, at Megagon Labs, we continue to wrestle with the limited functionality of DVC as it applies to Python APIs. DVC is an open-source project with a growing community, so this limitation may improve with time. For the time being, however, this limitation impacts the effectiveness of DVC in supporting complicated application scenarios, such as grid search for tuning hyperparameters.



Machine learning—particularly in deep learning for NLP applications—is an extremely complicated field. The complexity, however, ought to be in the research itself rather than in the tools needed to facilitate the research. While code version control has solved the problem of managing lightweight code files as they evolve, the data version control problem in ML is not so straightforward. DVC shows strong potential for tackling this problem with its support of features like git-style commands, experiment reproducibility, and remote cloud storage. Though DVC still lacks the features to make it a full-fledged MLOps solution, it’s contributions toward experiment replicability and data provenance bring significant gains to the field.

Written by Jin Wang, Alexander Whedon and Megagon Labs


More Blog Posts: