
Introduction
Machine Learning doesn't stop at model training. In production-ready ML projects, we need logging, version control, reproducible pipelines, deployment, monitoring, and governance – in short: MLOps.
What is MLOps and do I need it? If you're asking yourself this question, I can recommend the following summary: https://neptune.ai/blog/mlops
Two popular tools that fulfill MLOps requirements are MLflow and Vertex AI. However, they follow fundamentally different philosophies.
In this article, we analyze both tools from a technical perspective – with a focus on architecture, integration capabilities, pipelines, and real-world deployment.
The beginning of the article provides a brief overview of both products, so I can address my own experiences and problems in the following topics.
If you're already familiar with both tools, jump directly to my learnings below.
Architecture Overview
MLflow
MLflow is an open-source framework originally developed by Databricks. It's flexible, lightweight, and can be integrated into various ML stacks. It consists of four core components:
-
Tracking
Logging API for metrics, parameters, artifacts. Supports both local runs and remote tracking servers (e.g., via mlflow server). Backends: file system, SQL DB (e.g., SQLite, MySQL, PostgreSQL).
-
Projects
Structuring code as reusable ML packages with MLproject YAML file. Supports Conda and Docker environments.
-
Models
Standard format for model persistence (mlflow.pyfunc as generic interface). Export to e.g., ONNX, TensorFlow SavedModel, or even to AWS SageMaker and Azure ML is possible.
-
Model Registry
HTTP API and UI for model versioning, staging/production promotion, CI/CD integration.
MLflow is not limited to one ecosystem. You can work with PyTorch, Scikit-learn, or HuggingFace – locally, on-premises, or in the cloud.
Advantages:
- Open source and locally deployable
- Cloud-agnostic
- Large community and many integrations (PyTorch, TensorFlow, XGBoost, etc.)
Disadvantages:
- No integrated compute backend or AutoML
- Scaling and hosting must be organized independently
Vertex AI
→ One Platform to Rule Them All
Vertex AI is a fully managed MLOps platform from Google Cloud that unifies numerous ML services under a consolidated interface. It supports the entire machine learning lifecycle – from data preparation through training to deployment and monitoring.
Vertex AI offers six central components:
-
Workbench
JupyterLab-based development environment for prototyping and training – either via notebook instances or custom containers. Tight integration with GCP services like GCS and BigQuery.
-
Training
Flexible training with manual jobs or AutoML. Support for custom training pipelines (e.g., with TensorFlow, PyTorch). Execution on scalable, serverless resources.
-
Pipelines
Workflow orchestration based on Kubeflow Pipelines SDK, TFX, or TFData. Enables repeatable and versioned ML processes with GCP-native integration.
-
Model Registry
Central management and versioning of models. Support for staging, deployment, and model releases. Automatic integration into CI/CD workflows possible.
-
Prediction
Managed online and batch inference with automatic scaling. Models can be deployed as AutoML output or in custom containers.
-
Model Monitoring & Explainability
Integrated monitoring of predictions (e.g., data drift, outlier detection) as well as tools for explainable AI (feature attribution, bias detection).
Vertex AI is deeply embedded in the Google Cloud ecosystem – including GCP IAM, Vertex Feature Store, Cloud Logging, GCS, and Pub/Sub – and is based on a Kubernetes-native architecture.
Advantages:
- Fully integrated into Google Cloud
- Scalable infrastructure "out of the box"
- Supports both code-first and no-code approaches
Disadvantages:
- Strong Google Cloud dependency
- Complex onboarding for smaller teams
- Dependent on GCP pricing model
Components Under the Microscope
In the following chapters, I'll share some practical experiences. I've used both local MLflow setups and MLFlow in Databricks environments. On the other hand, there were projects that were completely implemented with VertexAI, since the data foundation was already part of the Google Cloud infrastructure.
1. Operations and Scaling
Architecture
MLflow
The MLflow tracking server can be installed and tested locally relatively easily. However, it shouldn't be used for production. Instead, deployment with Kubernetes and a SQL database in the background (e.g., PostgreSQL or MySQL) is recommended.
Since I already had experience with Kubernetes, this was my first approach. However, this was only partially successful. While MLflow was relatively easy to use, the entire infrastructure caused significant overhead. Some MLflow functions couldn't be implemented in a local deployment – more on this in the following chapters.
The clear favorite for using MLflow is Databricks. Since MLflow is a Databricks product, there's seamless integration into the Databricks platform. Databricks handles the entire infrastructure and scaling, which significantly simplifies the work, but also brings a decisive disadvantage, which will be addressed later.
Vertex AI
Vertex AI is natively integrated into Google infrastructure and fully managed. It runs on a Kubernetes-native structure at Google.
Setup is very simple: you only choose region and zone (or even multiple regions), Google handles load balancing. The platform also handles instance scaling and adjusts them automatically as needed.
Costs
MLflow
Since MLflow itself is open source, there are no direct licensing costs. However, in your own infrastructure, costs for compute, storage & network arise.
With a managed Databricks instance, a billing model per cluster or job applies. Costs are very variable and depend on the respective usage scope.
I'll go into this point in more detail in the "Lessons Learned" chapter.
Vertex AI
Google's cost model is typically based on the pay-as-you-go principle. All services like training, storage, pipelines, and endpoints are billed separately. Therefore, clean configuration is necessary from the beginning to prevent permanently running GPU instances from driving up costs. The same applies to the managed service at Databricks.
Pipelines
MLflow
MLflow has no built-in pipeline orchestration. Databricks aims to use MLflow orchestration in their own platform by integrating and orchestrating notebooks in Databricks jobs. However, I liked to combine MLflow with Airflow since existing workflows were already implemented with it.
Alternatively, solutions like Prefect or Dagster can be used. Logging and monitoring may need to be implemented independently.
Vertex AI
Vertex AI Pipelines are based on Kubeflow Pipelines and are tightly integrated into Google Cloud Platform (GCP). Each pipeline consists of containerized components, which promotes reusability.
Monitoring, logging, and model registration are automatically connected.
2. Data Governance and Feature Management
Feature Store
MLflow
A feature store can be a useful addition in large projects to ensure consistency and data quality. It should be carefully considered whether external tools are required for this. In many cases, simple tables can already provide a sufficient and uncomplicated data foundation.
A feature store can be integrated independently of MLflow. External solutions like Feast or Hopsworks are well-suited for this. In a managed MLflow environment like Databricks, the in-house feature store can also be used, which works seamlessly with MLflow.
Vertex AI
Vertex AI has its own feature store with integrated support for online and offline access. This allows the entire feature process to be well centralized. However, it should also be carefully considered here whether using a feature store is really necessary in the specific project.
Positively noteworthy is that the platform automatically ensures that features remain synchronized during training and inference time. This significantly reduces possible feature drifts – and saves a lot of headaches in daily work.
Data Lineage and Reproducibility
MLflow
Through the MLproject format and Git SHA, you can track which code version belongs to which run. This allows good code traceability.
Data like CSVs or Parquet files must be versioned separately – e.g., with Git. In Databricks, you can use Delta Lake directly for this.
Vertex AI
Vertex AI is closely connected to GCS and BigQuery – this allows you to version your datasets relatively easily.
Each step in a Vertex AI pipeline automatically writes metadata. This makes it much easier to trace individual transformations.
I'll mention it here once: Vertex AI works best when the data foundation is also in BigQuery or in a lake.
3. CI/CD and Automated Deployment
Automatic Model Promotion
MLflow
Through the MLflow Model Registry, you can version models (e.g., Champion/Challenger) and specifically promote them to staging or production.
CI/CD can be implemented with tools like Jenkins or GitLab CI. In Databricks, Databricks Asset Bundles should also be used among other things.
In a Databricks environment, you can deploy models directly on a Managed MLflow Serving Endpoint – or alternatively export them, e.g., for a SageMaker or Azure ML endpoint.
Vertex AI
Vertex AI also offers a comparable solution for model management with the Model Registry.
Automated CI/CD pipelines – for example with Cloud Build or GitHub Actions – can deploy new model endpoints and replace the existing version after successful validation.
A rollback is straightforward: either by clicking in the console or via a short script. The underlying infrastructure is completely managed by Google.
💡 DISCLAIMER: Before a model goes into production, valid quality assurance should be in place!
Validation steps like accuracy, precision, or drift checks help ensure only robust models are promoted. This is especially true when data conditions fluctuate strongly or schemas change regularly – targeted model and data tests are worthwhile here, e.g., with Great Expectations, Deepchecks, or custom Pytest extensions.
The entire pipeline can be automated – from code change to rollback:
Code Change → Training → Evaluation → Registry → Canary Deployment → Monitoring → possibly Rollback.
4. Monitoring
MLflow
MLflow only brings basic functions for experiment tracking out of the box – for example for metrics during training.
For real model monitoring in production – for example for detecting drift or outliers – you need external systems like Prometheus, Grafana, or the ELK Stack. These can be integrated via custom pipelines or inference services.
In Databricks, you can additionally use jobs, Delta Live Tables, and integrated monitoring for cluster resources. However, external tools are also needed here for deeper metric analyses during operation.
However, you don't have to use your own logging in Databricks. In some projects, we also extracted metrics, features, or model descriptions into our own tables so that the overview is not lost.
Vertex AI
Vertex AI offers integrated monitoring – including automatic detection of data and concept drift.
The platform analyzes statistical changes and provides alerts directly via Cloud Monitoring & Logging.
One feature I always found helpful in my daily work: The integrated explainability based on SHAP shows which features currently contribute most to the prediction – even during operation.
5. Multi-Cloud and Hybrid Scenarios
MLflow
A major advantage of MLflow is independence from infrastructure. You can train models in AWS, validate them on-premises, and then deploy them somewhere else entirely – without having to change tooling.
Artifact stores like S3, GCS, or Azure Blob can be flexibly exchanged. Backend tracking (e.g., SQLite, MySQL, PostgreSQL) is also freely selectable – depending on environment and requirements.
Databricks brings MLflow natively and supports multi-cloud operation on AWS, Azure, and GCP. Functions remain largely the same – differences are mainly in setting up storage and compute.
Vertex AI
Vertex AI can basically also handle on-premises data sources – for example via VPC Peering, Cloud VPN, or Private Service Connect.
Nevertheless, deployment always takes place in Google Cloud. You have only limited control over the underlying components. For hybrid or multi-cloud strategies, this can bring limitations. Multi-cloud or hybrid-oriented approaches should not be pursued with Vertex AI.
6. Best Practices
Team Structures & Organizational Culture
MLflow
In projects where I used MLflow, there was often a lot of responsibility directly with the Data Scientists and DevOps teams. We had to build security, monitoring, and CI/CD ourselves – which meant more freedom but also more coordination effort.
The open-source character was a real advantage: We could quickly fall back on community solutions or establish our own standards. This worked well especially in agile teams with a lot of personal responsibility. However, a lot of freedom and few standards also bring problems. It often felt more like tinkering.
Vertex AI
In projects with Vertex AI, a central cloud team was usually involved – especially in larger enterprise setups.
Onboarding was often steeper here, especially because of the more complex IAM and network configurations. But once the GCP processes were cleanly set up, multiple teams could efficiently access the same services and resources. This paid off in the long run.
For smaller projects with a lot of freedom, I would always go with Vertex AI. It's a perfect all-in-one solution for quick success.
Lessons Learned & Pitfalls
MLflow
What I learned early with MLflow: Without clean naming conventions, tracking quickly becomes confusing – especially with many runs and experiments.
In self-hosted setups, we had to deal intensively with security and network topics. This is often not worth the working time, especially in smaller teams.
On Databricks, much was significantly more relaxed since many things work out-of-the-box. But here too, you should keep costs in mind – especially with multiple simultaneously running clusters, it can quickly become expensive.
Vertex AI
With Vertex AI, I experienced how easy it is to fall into cost traps – for example through incorrectly chosen instance types or forgotten endpoints that continue running in the background.
Another point: With very complex pipelines, debugging is sometimes tricky. Google's strong abstraction makes many things easier, but it can also make troubleshooting more difficult when something doesn't run as expected.
Performance Tuning
MLflow
When many experiments run in parallel, it's worth switching to batch logging. In a project with thousands of runs, this noticeably improved performance.
Tuning the underlying database (e.g., PostgreSQL) or data stores was also a decisive factor in larger setups. We worked with partitions, for example, to reduce access problems.
Vertex AI
In Vertex AI, I found: Choosing the right hardware is crucial. An expensive GPU is useless if the model doesn't utilize it.
For scalable training, I had good experiences with distributed training jobs – for example with TensorFlow on multiple VMs. What was always important: First plan scaling cleanly, then scale up – otherwise it becomes unnecessarily expensive.
Practical Tip from Project Experience
Before deciding on a platform, a small proof of concept is worthwhile. I had good experiences setting up a simple pipeline, versioning a small model, and specifically paying attention to the following points:
- How complex is the setup?
- How transparent are the costs?
- How well can security be configured?
- How does handling feel in daily use?
Especially with more complex MLOps projects, it often only becomes clear during concrete construction which tool really fits your team and existing infrastructure. Implementation effort is often more decisive than features.
Conclusion: Your Use Case Decides
MLflow is particularly suitable for:
- Startups or small teams with their own infrastructure
- Projects with high flexibility requirements
- Teams that want to remain independent of a cloud provider
Vertex AI is ideal for:
- Companies already heavily invested in GCP
- Projects with high scaling requirements
- Teams that need a fully integrated MLOps ecosystem