Fleet Dashboard

Container monitoring across the whole Rogue AI fleet

Built by Rogue AI · single-pane container monitoring for the self-hosted fleet · Internal tool

Built incrementally as the fleet grew; used daily as the operations console.

Fleet Dashboard, Container monitoring across the whole Rogue AI fleet

The problem

The fleet grew to roughly sixty containers spread across separate apps, each on its own network and ports, with its own database and Redis. Checking health meant running docker ps and tailing logs per app and holding the whole port-and-subnet map in your head. There was no single place to answer 'what is up, what is struggling, and where does it live?'

What I built

A single-page dashboard that shows the whole fleet at a glance: per-app cards with a color-coded health badge, CPU and memory bars, database sizes, and an expandable log viewer. It groups raw containers back into the apps they belong to, surfaces the cross-app port and subnet map, and lets the operator restart, rebuild, or pull an image straight from the card, with database containers deliberately excluded from those actions.

Architecture

Next.js frontend behind auth

The dashboard and its API routes are a Next.js app protected by NextAuth v5. It never touches the Docker socket itself, it only calls the sidecar over an internal network, so the public surface holds no privileged access.

Python Flask sidecar for the Docker socket

A small Flask service is the only component that talks to the Docker socket. It enumerates containers, reads stats and logs, and performs lifecycle actions. The socket is mounted read-only, the API is gated by a shared secret, and the container is exposed on no external port.

Containers regrouped into apps

Raw container names mean little on their own, so a metadata layer maps containers back to the app they belong to and renders one card per app, including its port assignments and dedicated subnet, instead of a flat list.

Cached polling on a fixed interval

Stats and container listings are cached on the sidecar for a short window and the dashboard auto-refreshes on an interval. That keeps the socket from being hammered on every page view while still reflecting near-current state.

Guarded lifecycle actions with an audit trail

Restart, rebuild, and pull run through the sidecar with per-action timeouts and are written to a log with the user and outcome. Database containers are blocked by name so a stray click cannot bounce Postgres or Redis.

Tech stack

Next.jsNextAuth 5Python Flask sidecarDocker

What broke first

▸
Reading the Docker socket is a privileged operation, so it does not belong inside the user-facing web app. Isolating it in a separate sidecar with one narrow job keeps the blast radius small if the frontend is ever compromised.
▸
A single number is worth more than a wall of graphs. Most days the only question is 'is everything green?', so the dashboard answers that first and keeps per-container detail one click away.
▸
Polling a few dozen containers on every page view is enough to overwhelm the socket. Caching stats on the sidecar for a short window and refreshing on an interval keeps the host responsive without lying about state.

Outcome

Day-to-day fleet checks collapsed from a sweep of per-app terminal commands into one glance at one page. Privileged Docker access stays sealed in a single isolated sidecar rather than spread across the apps it watches, and routine restarts now happen from the dashboard with a record of who did what.

Honest limits

This is an internal operations tool, not a product. It runs in the local lab to watch a self-hosted fleet, and it was built solo for one operator. The real trade-off to call out: the Python sidecar holds privileged access to the Docker socket. Even mounted read-only and locked behind a shared secret on an isolated network with no external port, anything that can reach the socket is effectively root-adjacent on the host. That risk is accepted deliberately and contained by keeping it off the user-facing app, it is not eliminated.