Skip to main content

    From Notebook to Production: How We Deploy ML Models in 8 Weeks

    Agentic AICloudmess Team8 min readJanuary 15, 2026

    The Notebook Trap

    We see this pattern constantly: a data science team spends months building a fraud detection model, a recommendation engine, or a forecasting system. It performs well in Jupyter notebooks with offline validation. Everyone is excited. Then the handoff to engineering happens, and everything stalls. The engineering team says they need 3 to 6 months to 'properly containerize and deploy it.' Six months later, the model is still sitting in a notebook, the business is still waiting, and the data science team has already moved on to the next experiment. The core issue is that notebook code is inherently not production-grade. It relies on local file paths, ad-hoc data loading, and manual execution steps that do not translate to a repeatable, automated workflow.

    Why It Takes So Long (And Shouldn't)

    The bottleneck is not the model itself. It is the infrastructure around it. Most teams lack standardized ML deployment pipelines, so every model becomes a bespoke project: custom Dockerfiles, one-off Kubernetes manifests, manual testing processes, and ad-hoc monitoring configured after the first production incident. Each deployment reinvents the wheel. We have audited teams where the Dockerfile for model A uses Python 3.9 with pip, model B uses Python 3.10 with conda, and model C uses Poetry with a completely different base image. This inconsistency creates maintenance nightmares. The fix is not hiring more engineers. It is building a repeatable pipeline using tools like MLflow for experiment tracking, BentoML or Seldon Core for model serving, and ArgoCD for GitOps-based deployment orchestration.

    The Pipeline We Build

    Our approach follows a structured 8-week timeline. Weeks 1 to 2: We containerize the model using standardized multi-stage Docker builds. The base image includes CUDA 12.1 for GPU workloads, a pinned Python version, and a health check endpoint built on FastAPI or Flask. We use Poetry for deterministic dependency resolution and export a requirements.txt for the final slim image, keeping container sizes under 2GB even with ML dependencies. Weeks 3 to 4: We set up MLflow on ECS Fargate backed by RDS PostgreSQL for the metadata store and S3 for the artifact store. Every model version is registered with its training metrics, hyperparameters, dataset hash, and lineage. Rollback is a single MLflow CLI command. Weeks 5 to 6: We build an A/B testing framework using Istio traffic splitting on EKS or ALB weighted target groups on ECS. Traffic is split 90/10 between the baseline and candidate models, and we track business metrics (conversion rate, fraud catch rate) alongside model metrics (accuracy, latency P99). Weeks 7 to 8: We deploy to production with Karpenter for GPU node provisioning, HPA configured to scale on custom metrics (requests per second and GPU utilization), and a full observability stack using Prometheus, Grafana, and Langfuse for inference tracing.

    What Changes After

    Once the pipeline is in place, deploying the next model takes hours, not months. Data scientists push code to a Git repository, a GitHub Actions workflow triggers the build, MLflow registers the artifact, and ArgoCD rolls it out to staging automatically. One client went from quarterly model updates to weekly deployments after adopting this pipeline. Their model serving uptime went from 97.2% to 99.7% with auto-failover across availability zones. Inference latency P99 dropped from 850ms to 320ms after we right-sized the serving containers and added request batching with a max batch size of 32 and a 50ms timeout window. The engineering team stopped dreading ML deployments because they became boring, repeatable, and fully automated.

    The Real Cost of Waiting

    Every month a model sits in a notebook is a month of value you are not capturing. If that fraud detection model catches $200K in fraud per month once deployed, a 6-month delay costs $1.2M in preventable losses. We have seen recommendation engines increase average order value by 12 to 18% within the first month of deployment. Forecasting models that reduce inventory waste by 8% pay for the entire infrastructure investment in a single quarter. The pipeline we build typically costs $15K to $25K in AWS infrastructure per month for a production-grade setup, and it pays for itself in the first sprint. The key is treating ML deployment as an engineering discipline with the same rigor as application deployment, not as a one-off science project.