Faceless Video Factory

YouTube Shorts pipeline — BullMQ + FFmpeg + Edge TTS

Built by Rogue AI · Topic-to-MP4 Shorts pipeline on free tooling · Pipeline demo · Self-hosted

Built solo in a local lab as a pipeline-pattern demo; iterated over a single focused build cycle.

Faceless Video Factory — YouTube Shorts pipeline — BullMQ + FFmpeg + Edge TTS

The problem

Cranking out short-form vertical video by hand is the same chore repeated forever: write a script, record a voiceover, cut captions, stack it into a 9:16 frame, export. None of those steps are hard — they are just slow and easy to get subtly wrong (caption timing drifts, the export ratio is off, the audio clips). I wanted to prove that the whole chain could be automated end to end as a queue of background jobs, and that it could run entirely on free tooling so the cost floor is zero. This is a demonstration of the automation pattern, not a launched content business.

What I built

A Next.js 16 web app where you submit a topic, and a BullMQ worker walks it through a fixed status flow: PENDING → SCRIPTING → DRAFT → VOICING → GENERATING → ASSEMBLING → REVIEW. Each video row in Postgres carries its own status, script, and asset paths, so the queue is the source of truth and you can watch a job move through the stages. The default 'Faceless' path uses only free services — a local model writes the script, Edge TTS speaks it, stock B-roll fills the frame, and FFmpeg burns the captions in and exports a 1080x1920 MP4. Paid providers (a hosted LLM, ElevenLabs, HeyGen, Kling, fal.ai) are wired in as opt-in upgrades behind env keys, but nothing requires them to produce a finished clip.

Architecture

Pipeline as a status machine

Every video is a Postgres row that advances through an explicit status flow (PENDING → SCRIPTING → DRAFT → VOICING → GENERATING → ASSEMBLING → REVIEW → APPROVED/REJECTED). Execution state lives in the data, not in memory, so a job's progress is always inspectable and a crash mid-pipeline leaves a row you can see and resume rather than a lost in-flight process.

BullMQ worker, concurrency 1

A separate worker process pulls jobs from a Redis-backed BullMQ queue at concurrency 1 — deliberately serial, because FFmpeg assembly is CPU-heavy and two encodes at once just thrash the box. Decoupling the worker from the web app means the UI stays responsive while a render grinds away in the background, and the worker can be scaled or restarted on its own.

Free-first provider chain with paid fallbacks

Each step has a free primary and an optional paid upgrade: script from a local model (hosted LLM optional), voice from Edge TTS (ElevenLabs optional), visuals from stock B-roll (Flux/Kling optional), assembly always FFmpeg. The chain degrades gracefully — missing an API key downshifts that step to its free path instead of failing the whole job.

FFmpeg assembly to 9:16 with burned-in captions

FFmpeg is the one non-negotiable stage: it composites the voiceover, B-roll, and captions into a 1080x1920 vertical MP4 with the subtitles burned in so they survive re-upload anywhere. A quirk worth noting — Edge TTS returns word timings as JSON, not SRT, so the pipeline regroups them into short 2-4 word caption phrases before FFmpeg renders them, which is what keeps captions readable rather than one word flashing at a time.

Self-hosted, isolated, custom JWT

The whole stack runs in Docker on its own network — app, worker, Postgres, and Redis as separate containers, app and worker read-only with tmpfs scratch and named volumes for output and logs. Auth is a custom jose-signed JWT in a video_session cookie rather than a hosted identity provider, keeping the demo self-contained with no external account to depend on.

Tech stack

Next.js 16BullMQFFmpegEdge TTSCustom JWT

What broke first

▸
Make the queue the source of truth. Storing each job's status on its database row — rather than tracking it in the worker's memory — meant a crashed or restarted render left an inspectable, resumable row instead of a vanished process. Execution state belongs in the data.
▸
Concurrency 1 is a feature, not a limitation. FFmpeg saturates a CPU on its own; letting two encodes run together made everything slower. Matching worker concurrency to what the hardware can actually do beat any clever parallelism.
▸
Design for the free path first. Building the default route entirely on free tooling (local model, Edge TTS, stock B-roll, FFmpeg) and treating paid providers as opt-in upgrades kept the cost floor at zero and forced every step to have a working fallback — so a missing API key downshifts a stage instead of breaking the run.

Outcome

The pattern works: submit a topic, and a finished captioned vertical MP4 comes out the other end without a paid API key in sight. What it proves is the engineering shape — a durable job queue, a status machine you can watch, and a provider chain that prefers free tools and only reaches for paid ones when you opt in. It is a reference implementation of that pipeline, not a content operation; there is no upload scheduler, no channel, and no published volume to quote.

Honest limits

This is a pipeline-pattern demo, not a launched product. It is self-hosted, built solo, and runs in a local lab (the old VPS that once hosted experiments has been retired). The point is to prove the automation chain end to end on 100% free tooling — there are no real users, no published content volume, and no invented metrics. The paid provider integrations exist as opt-in branches; the demo runs the free path. Output quality is bounded by the free tools: stock B-roll and synthetic narration are serviceable for the pattern, not a substitute for crafted footage.