Docker Deployment for AI Applications: Production Patterns That Scale
Deploying AI applications in production is a different problem from training a model or running a notebook. You need reliable process isolation, health monitoring, resource limits, security hardening, and the ability to update models without downtime. Docker solves most of these problems — but only if you use it correctly. We run 82 containers across 20+ AI applications in production. This guide covers the patterns that work, the ones that do not, and the specific configurations we use for deploying LLM-powered applications, Ollama model servers, and multi-container AI stacks.
Why Docker for AI Applications
The core argument for Docker in AI deployment is reproducibility. An AI application typically depends on a specific Python or Node.js runtime version, particular library versions (often with CUDA-specific builds), model files that can be gigabytes in size, and configuration that varies between development and production. Docker packages all of this into a single artifact that runs identically everywhere. "It works on my machine" stops being an excuse.
Beyond reproducibility, Docker gives you process isolation (one container crash does not bring down your other services), resource limits (prevent a runaway inference loop from consuming all available memory), and declarative infrastructure (your entire deployment is defined in code, version-controlled, and reviewable). For teams managing multiple AI services — which is increasingly common as organizations deploy purpose-built models for different tasks — these properties are not optional.
Multi-Stage Dockerfiles: The Four-Stage Pattern
Every AI application we deploy uses a four-stage Dockerfile. This pattern minimizes the final image size (critical when images include model files), separates build dependencies from runtime dependencies, and ensures the production container runs as a non-root user. Here is the pattern:
Four-Stage Build Pattern
- Stage 1 — base: Alpine-based Node.js image (node:22-alpine3.21) with system dependencies and security updates (apk upgrade). This is the foundation both build and runtime share. Pin to patch version, never use "latest" or even "22-alpine" — image contents change under you.
- Stage 2 — deps: Install application dependencies with npm ci (not npm install — ci is deterministic from lockfile). This stage is cached unless package-lock.json changes, which makes rebuilds fast.
- Stage 3 — builder: Copy source code, run Prisma generate (if using a database), and execute the production build (next build for Next.js apps). The output is the compiled application without source files or dev dependencies.
- Stage 4 — runner: Copy only the standalone build output into a clean base image. Create a non-root user (uid 1001, gid 1001). Set file ownership. Configure the entrypoint. This final image contains nothing except what is needed to run — no build tools, no source, no dev dependencies.
The result is a production image that is typically 150-300MB for a Next.js AI application, compared to 1-2GB if you just copy everything into a single stage. Smaller images mean faster deployments, faster rollbacks, and less attack surface.
Health Checks: The Most Overlooked Configuration
Health checks determine whether Docker considers a container "healthy" and routes traffic to it. Without health checks, Docker has no way to know if your application has crashed, frozen, or is stuck in an infinite loop. It just sees that the process is running. This is especially problematic for AI applications where model loading can take 30-60 seconds after the container starts, and where memory leaks from repeated inference can gradually degrade a service.
Every service in our stack has a health check. Here are the patterns by service type:
- Application containers: HTTP health endpoint using wget. The endpoint checks that the application can accept requests and that any required downstream services (database, model server) are reachable. Interval: 30s, timeout: 10s, retries: 3.
- PostgreSQL: pg_isready command. Checks that the database is accepting connections on the expected port. This is more reliable than a TCP port check because pg_isready verifies the Postgres protocol is responding.
- Redis: redis-cli ping. Returns PONG if Redis is operational. Simple, fast, reliable.
- Ollama (model server): HTTP check against the Ollama API health endpoint. This confirms that Ollama is running and that at least one model is loaded and ready for inference — not just that the process started.
Critical: use 127.0.0.1, never localhost
Alpine Linux resolves "localhost" to IPv6 (::1) before IPv4 (127.0.0.1). If your service only listens on IPv4 — which is the default for most Node.js and Python applications — health checks against "localhost" will fail intermittently or consistently. Always use 127.0.0.1 explicitly. This one issue accounts for more "container keeps restarting" tickets than any other single cause in our fleet.
Pair health checks with depends_on conditions. Never use bare depends_on (which only waits for the container to start, not become healthy). Always use depends_on with condition: service_healthy. This ensures your application container does not start until the database and model server are actually ready.
Ollama in Docker: Running Local LLMs in Production
Ollama has become the standard way to serve local LLMs, and running it in Docker is straightforward — with a few important considerations for production use.
Shared Model Server Architecture
We run a single Ollama instance that serves multiple applications. Each AI application connects to the shared Ollama container over a dedicated Docker network. This approach is more efficient than running separate Ollama instances per application because model weights are loaded into GPU memory once and shared across requests. With a 24GB GPU, you can serve a 13B parameter model to five or six applications simultaneously without issues.
The Docker Compose configuration uses a named external network (we call it ailab-network) that both the Ollama container and all consuming application containers join. Application containers reference Ollama by its container name (ailab-ollama) and the default port (11434). No host networking, no port publishing to the host — all traffic stays within the Docker network.
GPU Passthrough Configuration
For GPU inference (which you want — CPU inference on large models is unusably slow for production), Docker needs access to the host GPU. On Linux with NVIDIA GPUs, this requires the NVIDIA Container Toolkit (nvidia-container-toolkit package). Once installed, you add the GPU reservation to your Docker Compose file under the deploy section.
Key considerations for GPU-enabled containers:
- Memory management: Set OLLAMA_MAX_LOADED_MODELS to control how many models stay in GPU memory simultaneously. Default is 1, which means only the most recently used model stays loaded. For multi-application setups serving the same model, this default is fine. For setups serving different models, increase it based on available GPU memory.
- Model storage: Mount a persistent volume for Ollama's model directory. Models can be 4-15GB each — you do not want them downloaded on every container restart. The volume mount ensures models persist across restarts and updates.
- Concurrent requests: Set OLLAMA_NUM_PARALLEL to control how many requests Ollama processes simultaneously. Higher values increase throughput but consume more GPU memory. We typically use 2-4 depending on model size and GPU capacity.
Security Hardening: Production-Grade Container Security
Running AI applications in production means they are attack surfaces. A misconfigured container can leak model outputs, expose internal APIs, or provide a foothold for lateral movement in your network. Security hardening is not optional. Here is our standard configuration, applied to every container.
| Hardening Measure | Configuration | Why It Matters |
|---|---|---|
| Drop all capabilities | cap_drop: [ALL] | Prevents privilege escalation. Add back only what is needed. |
| No new privileges | security_opt: [no-new-privileges:true] | Blocks setuid/setgid binaries from gaining elevated privileges. |
| Non-root user | USER nextjs (uid 1001) | Limits blast radius if container is compromised. |
| Read-only filesystem | read_only: true + tmpfs mounts | Prevents writing to container filesystem. Tmpfs for /tmp and caches. |
| Resource limits | deploy.resources.limits (memory + CPU) | Prevents any single container from consuming all host resources. |
| Process limits | deploy.resources.limits.pids: 200 | Prevents fork bombs or runaway process spawning. |
| Log rotation | max-size: 10m, max-file: 3 | Prevents log files from filling the disk. DB/Redis get 50m/5. |
| Network isolation | Dedicated network per app, 127.0.0.1 binding | Containers only see services they need. No exposure to host network. |
| Pinned images | postgres:16.11-alpine3.21, not :latest | Prevents unexpected changes when pulling images. Reproducible builds. |
These are not aspirational best practices — this is our standard configuration applied to every container in production. The overhead is minimal (a few extra lines in docker-compose.yml), and the security improvement is significant. Most container security incidents we have seen in the wild would have been prevented by these basic measures.
Network Architecture: Isolation Without Complexity
Each application gets its own Docker network. The application container, its database, and its Redis instance all share a network. They can communicate freely within that network but cannot reach containers on other application networks. The only shared resource is the Ollama model server, which sits on its own network that application containers explicitly join.
This architecture means a compromise of one application's container does not give access to other applications' databases. It also means network policies are simple: allow everything within the application network, allow connections to the shared Ollama network, deny everything else.
Port binding is exclusively to 127.0.0.1. No container port is ever published on 0.0.0.0 (all interfaces). External access is handled by a reverse proxy (Caddy or Nginx) that terminates TLS and forwards to the localhost-bound container ports. This means containers are not directly reachable from the network even if firewall rules are misconfigured.
Resource Management for AI Workloads
AI applications have different resource profiles than typical web services. An inference request can consume 2-8GB of RAM depending on the model, spike CPU usage for 5-30 seconds, and if using GPU, monopolize the GPU memory for the duration. Without resource limits, a burst of inference requests can cause out-of-memory kills on co-located services.
We set explicit memory and CPU limits on every container. For AI application containers, typical limits are 512MB-1GB of RAM and 1-2 CPU cores — the application itself does not need much because inference happens in the Ollama container. For the Ollama container, limits depend on the model size: a 13B model needs roughly 10-12GB RAM when loaded. PostgreSQL gets 256MB-512MB per database instance. Redis gets 128-256MB.
Monitor before you limit
Do not guess resource limits. Run your application under realistic load for a week, monitor actual usage with docker stats, then set limits at 1.5x the observed peak. Setting limits too tight causes OOM kills that are hard to debug. Setting them too loose wastes resources and provides no protection. Measure first.
Environment Variables and Secrets
AI applications typically need database credentials, API keys (for external services), model configuration parameters, and application-specific settings. Never hardcode these in Dockerfiles or docker-compose.yml files. We use .env files for development and environment-specific configuration, with required variables enforced using the ${VAR:?error message} syntax in docker-compose.yml. This means docker compose up fails immediately with a clear error if a required variable is missing, rather than starting with broken configuration.
For production secrets (database passwords, API keys), we use Docker secrets or mount files from the host. The .env file is never committed to version control. A .env.example file lists all required variables with placeholder values.
Deployment and Update Patterns
Zero-Downtime Updates
For AI applications where inference requests can take 10-30 seconds, a naive restart (stop old container, start new one) means dropped requests. Our update procedure: build the new image, start the new container alongside the old one, wait for the new container to become healthy (health check passes), switch the reverse proxy to the new container, drain connections from the old container, then stop the old container. For most of our applications, this entire process takes 60-90 seconds with zero dropped requests.
Model Updates Without Redeployment
Ollama supports pulling new model versions at runtime. When we update a model, we pull the new version into the running Ollama container, verify it loads correctly, then update the application configuration to use the new model tag. The application container does not need to restart — only the model reference changes. This separation between application code and model artifacts is a major advantage of the shared Ollama architecture.
Monitoring and Observability
For a fleet of AI containers, you need visibility into four dimensions:
- Container health: Is the container running? Is the health check passing? When did the last restart happen and why?
- Resource usage: CPU, memory, disk, and GPU utilization over time. Trend lines matter more than point-in-time values — you want to see the memory leak developing before it causes an OOM kill.
- Application metrics: Request latency, error rates, inference time, queue depth. These tell you whether the application is performing correctly even when the container is "healthy."
- Model metrics: Token throughput, model load time, cache hit rate (for repeated prompts), and output quality scores if you have a way to measure them.
We use a combination of Docker's built-in logging (with rotation), application-level structured logging (JSON format, shipped to a central log store), and periodic health check scripts that alert on degraded conditions before they become outages.
Common Mistakes and How to Avoid Them
- →Using :latest tags in production. Your build is not reproducible if the base image can change between pulls. Pin to specific patch versions. Yes, it means manual updates — that is the point. You want to control when changes happen.
- →Running as root. The default. And it means a container escape gives the attacker root on the host. Always create and use a non-root user in your Dockerfile. The three extra lines of configuration are worth the security improvement.
- →No resource limits. Works fine until it does not. One runaway process, one memory leak, one exceptionally large inference request — and suddenly your entire server is unresponsive because a single container consumed all available resources.
- →Publishing ports on 0.0.0.0. This exposes your container to every network interface on the host. If the host has a public IP, your database is now accessible from the internet. Always bind to 127.0.0.1.
- →Skipping health checks on AI containers. AI containers have long startup times (model loading), can hang during inference, and are prone to memory leaks. Without health checks, Docker has no way to detect these conditions and restart the container. A health check that verifies the model is loaded and responding catches problems that a simple process check misses.
Getting Started
If you are deploying AI applications and struggling with reliability, resource management, or security, Docker with the patterns described here will solve most of those problems. The initial investment is a few days of infrastructure work that pays dividends in operational stability for years.
We offer AI infrastructure consulting for teams that want to get this right the first time. We review your current deployment, identify security and reliability gaps, and implement production-grade Docker configurations for your AI stack.
Book a free discovery call to discuss your AI deployment challenges. No sales pitch — just a practical assessment of where your infrastructure stands and what improvements would have the highest impact.