What MLflow Solves and Doesn’t Solve for Us

As a growing research lab focused on Natural Language Processing, database management and data augmentation, Megagon Labs’ engineering team is often looking to improve its MLOps to better track experimental data, parameters, models and metrics. To find the proper tool we decided to have a “bake-off” –  a process in which engineers work together to compare technologies in search of the best possible solution to use for its operations.

With our bake-off we compared a number of MLOps platforms, for our research as well as for our applied projects. Given the size of our lab, the bake-off was not an extensive study of all MLOps platforms, however we did include a few of our research scientists in the process given that they would also use and benefit from MLOps improvements. There are more and more startups emerging in this space, which I believe speaks to its importance in machine learning practice today.

In my bake-off group we used MLflow as part of the bake-off on a pet project, to do extractive summarization using TextRank, and also applied it on a research project, which uses transformers to do monolingual word alignment. Since then we started using it more formally for an applied research project, for NLG-based content generation and evaluation. 

I have found MLflow to be really useful, and simple, particularly for provenance. It was increasingly becoming important for us, even in research projects, to collect and keep track of metadata on projects, e.g. data, parameters, code, models, and metrics associated with each run.

Below I describe what I like and don’t so far.

What I like about MLflow is that it is...

  • easy to get going. It is very quick and easy to start with MLflow: include a few MLflow libraries, add a project description file to specify environment (e.g. Conda, Docker), entry points (e.g. params, code to run) and that is it. Bring up the web UI that comes with MLflow and examine parameters, metrics, etc. Very quick ROI, mostly due to its integration to commonly used machine learning libraries, such as PyTorch, TensorFlow, Scikit-learn, and many others, to automatically persist metrics, models, and other artifacts. You can also manually track additional metrics by simply calling relevant python functions (e.g. log_param, log_metric, log_artifact). 
  • non-opinionated. MLflow is fairly non-opinionated about almost anything. You can organize code however you want, use practically any ML/NLP library (even implements models as python functions), launch multiple runs in one program, use local vs server mode, use or develop your own plugins for model storage, tracking store, execution backend, and beyond.  
  • super easy to deploy models as services. This feature is just amazing. With almost no additional work you can turn your models into web services or deploy them, even a python function model by just calling MLflow models serve! Optionally, you can specify input and output signatures of your services. This is particularly useful if in your organization you would like to create a playground for people to experiment with in-house models.

What I don’t like about MLflow is that...

  • path from development to production could be better. MLflow supports versioning and tagging that you can use to define the stage of production. But as a mode of operation I would favor a model where runs (e.g. artifacts, metrics, params, etc.) are pushed to production (a remote MLflow server) from a development environment. I am arguing for a model like git with remotes. This is certainly possible where you can have MLflow running on local and server, where runs are captured locally first and then if confirmed a run can be pushed to a remote server. A model like this just keeps things more tidy. A minor issue, but authentication could also be directly supported with MLflow, to make it easier to deploy for production.


  • data version control is not directly supported. MLflow does a good job of keeping track of what goes in and out, but organizations in general need more in terms of data versioning, especially for shared datasets, updates, etc. Integration with DVC (also see our blog post about DVC) could be an option. There is also a storage angle to this, which I cover below.


  • storage gets pretty large, pretty quickly. I guess this is to be expected, as it captures everything and every time, even when the experiment fails to run. Hence it makes sense to have policies on when to purge runs and artifacts from the server, maybe utilize tags for that purpose, and periodically delete/archive old experiments that are no longer relevant. It may also make sense to utilize filesystems that support deduplication (e.g. BTRFS).


  • getting a complete view might be hard. MLflow is centered around a ‘run’ as a primary construct. Typically organizations use the same datasets and models in several projects though. From that sense it is not easy to get a complete picture of which projects use which model, and get a complete lineage of data across projects (think a complete graph of inputs and outputs). MLflow supports search, so it is possible to get the necessary information but I expected better user experience to track across runs, across projects.

MLflow for a research lab

MLflow is still a great fit for industrial research labs in general. Provenance and reproducibility is typically a top concern and MLflow seems to do a good job. Ease of use and getting started quickly are just great. There are some gaps but it is not that hard to close them given how extensible it is. For more production-oriented organizations MLflow may not address all their needs, for example production model monitoring. However, MLflow can still serve as an essential part of the overall MLOps framework.

Written by Eser Kandogan and Megagon Labs


More Blog Posts: