Skip to main content
Home / Portfolio / Offline Voice Dictation
Audio AI

Offline Voice Dictation

System-wide push-to-talk dictation, zero network

Built by Rogue AI · System-wide push-to-talk on Windows · Fully offline

First push-to-talk version: January 2026 with whisper.cpp tiny. F9 hotkey + uIOhook integration in February once global-key capture worked reliably across focus contexts. Daily driver since.

Offline Voice Dictation — System-wide push-to-talk dictation, zero network

The problem

Windows Voice Access is cloud-based, unreliable, and loses context. Third-party dictation tools either require a subscription or send every word you speak to a remote server. For confidential work — client notes, medical, legal, security research — neither is acceptable.

What I built

An Electron tray app that registers a global hotkey (F9 by default). Hold F9, speak, release — the transcribed text appears in the active application, wherever the cursor is. No internet required. No telemetry. No subscription.

Architecture

Tray process
Electron, minimal UI, persistent tray icon, global config
Hotkey hook
uIOhook for true system-global key capture (works even when no window is focused)
Audio capture
Node.js audio input stream, 16 kHz mono PCM, recorded while hotkey is held
Transcription
whisper.cpp with GPU acceleration (CUDA / Metal / CPU fallback), configurable model size (tiny/base/small/medium)
Text normalization
Punctuation restoration, common-phrase corrections, configurable dictionary
Output
Clipboard-paste into the active application, or simulated keystrokes for apps that block paste

Tech stack

ElectronNode.jsTypeScriptwhisper.cppuIOhookWASAPI

What broke first

  • uIOhook native bindings on Windows are a build-chain rabbit hole — node-gyp + the right Windows SDK version + Python 3.11 (not 3.12, not 3.10). Documented the exact incantation in the README because I forgot it twice.

  • whisper.cpp 'medium' is the sweet spot for English on a consumer GPU. 'small' is fast but drops technical terms; 'large-v3' is accurate but the latency makes the UX feel broken. Medium + Q5 quant on a 4070 is sub-second for short phrases.

  • Clipboard paste fails silently in Slack and Zoom — they intercept Ctrl+V. Added a fallback that simulates keystrokes when paste doesn't visibly land within 200 ms.

Outcome

F9 in any application — terminal, browser, email client, document editor — speak, release, text appears. No network calls. No cloud. Used daily for dictating technical notes and long-form content. Median transcription latency under one second per spoken phrase on a consumer GPU.

Honest limits

English-tuned. German dictation works but punctuation restoration is rough — fine for notes, not for client-facing prose without a once-over. No custom vocabulary training (whisper.cpp doesn't expose that cleanly). Latency is sub-second on a recent GPU but climbs to 2-3s on CPU-only — fine for short phrases, painful for paragraphs.

Related reading

← Back to portfolio