Offline Voice Dictation
System-wide push-to-talk dictation, zero network
Built by Rogue AI · System-wide push-to-talk on Windows · Fully offline
First push-to-talk version: January 2026 with whisper.cpp tiny. F9 hotkey + uIOhook integration in February once global-key capture worked reliably across focus contexts. Daily driver since.
The problem
Windows Voice Access is cloud-based, unreliable, and loses context. Third-party dictation tools either require a subscription or send every word you speak to a remote server. For confidential work — client notes, medical, legal, security research — neither is acceptable.
What I built
An Electron tray app that registers a global hotkey (F9 by default). Hold F9, speak, release — the transcribed text appears in the active application, wherever the cursor is. No internet required. No telemetry. No subscription.
Architecture
Tech stack
What broke first
- ▸
uIOhook native bindings on Windows are a build-chain rabbit hole — node-gyp + the right Windows SDK version + Python 3.11 (not 3.12, not 3.10). Documented the exact incantation in the README because I forgot it twice.
- ▸
whisper.cpp 'medium' is the sweet spot for English on a consumer GPU. 'small' is fast but drops technical terms; 'large-v3' is accurate but the latency makes the UX feel broken. Medium + Q5 quant on a 4070 is sub-second for short phrases.
- ▸
Clipboard paste fails silently in Slack and Zoom — they intercept Ctrl+V. Added a fallback that simulates keystrokes when paste doesn't visibly land within 200 ms.
Outcome
F9 in any application — terminal, browser, email client, document editor — speak, release, text appears. No network calls. No cloud. Used daily for dictating technical notes and long-form content. Median transcription latency under one second per spoken phrase on a consumer GPU.
Honest limits
English-tuned. German dictation works but punctuation restoration is rough — fine for notes, not for client-facing prose without a once-over. No custom vocabulary training (whisper.cpp doesn't expose that cleanly). Latency is sub-second on a recent GPU but climbs to 2-3s on CPU-only — fine for short phrases, painful for paragraphs.
