Experiment Tracking and Versioning during Model Training

Experiment Tracking

Following is just a short list of things you might want to consider tracking for each experiment during its training process:

• The loss curve corresponding to the train split and each of the eval splits.

• The model performance metrics that you care about on all nontest splits, such as accuracy, F1, perplexity.

• The log of corresponding sample, prediction, and ground truth label. This comes in handy for ad hoc analytics and sanity check

• The speed of your model, evaluated by the number of steps per second or, if your data is text, the number of tokens processed per second.

• System performance metrics such as memory usage and CPU/GPU utilization. They’re important to identify bottlenecks and avoid wasting system resources.

• The values over time of any parameter and hyperparameter whose changes can affect your model’s performance, such as the learning rate if you use a learning rate schedule; gradient norms (both globally and per layer), especially if you’re clipping your gradient norms; and weight norm, especially if you’re doing weight decay

Popular & Widely Used Experiment Tracking Tools

These are mature, well-supported, and widely adopted.

1. MLflow

Why it's good: Open source, flexible, integrates with most ML/DL frameworks, good for experiment tracking, model registry, and reproducibility.
Best for: Teams that want open-source control and modularity.
Drawbacks: UI is basic; could require setup overhead.

2. Weights & Biases (W&B)

Why it's good: Rich UI, real-time logging, collaborative dashboard, integrates with most ML/DL tools.
Best for: Teams that want ease of use, visualization, and collaboration.
Free tier: Generous, but enterprise features are paid.
Drawbacks: Hosted by default (can be self-hosted, but takes work).

3. Comet

Why it's good: Similar to W&B, also with strong visualization tools and model registry.
Best for: Those needing flexible logging and visual comparison of experiments.
Bonus: Some users prefer Comet’s UI and offline mode to W&B.

🧪 More Specialized / Newer Tools

4. Neptune.ai

Why it's good: Strong metadata management, easy versioning, nice dashboard.
Best for: Research teams or production teams needing clear recordkeeping.
Drawbacks: Smaller community compared to W&B or MLflow.

5. Guild AI

Why it's good: CLI-first, lightweight, good for scripting-heavy or hardcore DevOps-style ML workflows.
Best for: DevOps-savvy ML engineers who want tight control.
Drawbacks: Minimal GUI, less "out-of-the-box" than others.

6. DVC (Data Version Control) + CML (Continuous ML)

Why it's good: Git-like experience for versioning data, models, and experiments.
Best for: MLOps teams already using Git extensively.
Drawbacks: More setup and integration needed; less interactive than W&B/Comet.

7. ClearML

Why it's good: All-in-one suite for experiment tracking, orchestration, and model deployment.
Free tier: Very generous, and can self-host.
Drawbacks: Slightly steeper learning curve.

🏢 Enterprise / Cloud-Integrated Tools

8. SageMaker Experiments

Why it's good: Native to AWS, good for those using SageMaker Studio.
Best for: Teams all-in on AWS.
Drawbacks: Less useful outside of AWS stack.

9. Vertex AI (Google Cloud)

Why it's good: Fully integrated into Google Cloud with experiment tracking and managed services.
Best for: GCP-based workflows.
Drawbacks: Limited outside GCP, costs can add up.

10. Azure ML

Why it's good: Integrated experiment tracking and pipelines, good for enterprise teams.
Best for: Microsoft ecosystem users.

🧩 Summary Table

Tool	Open Source	UI Quality	Cloud Integration	Best For
MLflow	✅	⭐⭐	✅ (via plugins)	Modular setups, OSS fans
Weights & Biases	❌ (Free tier)	⭐⭐⭐⭐	✅	Research & collab-heavy teams
Comet	❌ (Free tier)	⭐⭐⭐⭐	✅	Deep experiment analysis
Neptune.ai	❌ (Free tier)	⭐⭐⭐	✅	Metadata-heavy workflows
ClearML	✅	⭐⭐⭐	✅	Full MLOps stack
DVC + CML	✅	⭐⭐	✅ (via GitHub CI/CD)	Versioning-heavy workflows
SageMaker Exp.	❌	⭐⭐⭐	AWS-native	AWS-only users
Vertex AI	❌	⭐⭐⭐⭐	GCP-native	GCP-based teams

----------------

Versioning

ML systems are part code, part data, so you need to not only version your code but your data as well.

Top Data Versioning Tools (2024–2025)

1. DVC (Data Version Control)

What it does: Git-style versioning for datasets and models. Tracks data files and connects them to code commits.
Why it’s good: CLI-based, integrates with Git, supports remote storage (S3, GCS, etc.).
Bonus: Can be used alongside CML for CI/CD workflows.
Best for: ML engineers and teams already using Git.

2. LakeFS

What it does: Git-like version control for object storage (e.g., S3), enabling branching and commit-like workflows.
Why it’s good: Allows atomic operations and reproducibility at scale.
Best for: Teams with large data lakes needing safe experimentation.
Bonus: Works well in production pipelines, not just experimentation.

3. Pachyderm

What it does: Data pipelines with versioned data at each stage.
Why it’s good: Combines data versioning and pipeline orchestration.
Best for: Teams needing data provenance across pipelines.
Drawbacks: Heavier to set up; often used in enterprise.

4. Delta Lake (by Databricks)

What it does: ACID-compliant storage layer on top of data lakes (like S3 or Azure Blob).
Why it’s good: Tracks versions of data, supports time travel.
Best for: Spark/Databricks ecosystems.
Bonus: Excellent for big data workflows and analytics too.

5. Quilt

What it does: Versioning for data stored in S3 with a visual data catalog.
Why it’s good: Data packages, preview in UI, metadata tagging.
Best for: Teams wanting a searchable catalog + version control.

6. Weight & Biases Artifacts

What it does: Tracks data, models, and files as “artifacts” alongside experiments.
Why it’s good: Easy integration with their experiment tracking platform.
Best for: Teams already using W&B.
Drawbacks: Tied to the W&B ecosystem.

7. Dagshub

What it does: Git-based platform for ML collaboration that integrates DVC and Git LFS.
Why it’s good: GitHub-like interface for ML projects.
Best for: Smaller teams wanting a visual + code-based experience.
Bonus: Hosted solution; great for open-source ML projects.

🧩 Quick Comparison

Tool	Git-Integrated	UI Available	Storage Backend	Best For
DVC	✅	Basic (CLI/UI plugin)	Local, S3, GCS, Azure	General-purpose ML pipelines
LakeFS	✅	✅	Object storage	Data lake versioning
Pachyderm	✅	✅	Object storage	Versioned data pipelines
Delta Lake	❌ (SQL-based)	❌	Data lakes	Big data + Spark users
Quilt	❌	✅	S3	Data catalog + lightweight versioning
W&B Artifacts	❌	✅	Cloud	Integrated with W&B tracking
Dagshub	✅	✅	S3, Git LFS	Collaborative ML teams

If you're already using something like Git + MLflow, pairing it with DVC or LakeFS often makes sense. For larger orgs with big data infra, Delta Lake or Pachyderm might fit better.

Search This Blog

Blog