Executive overview

Classification: Internal · Version: 1.0 · Audience: Leadership, sponsors · Primary focus: Product & architecture alignment

This is the main program document for the Voice Hotkey initiative. It frames the product vision, architecture, and delivery model. Technical depth is in the Software requirements; UX detail is in the Design guide.

1. Executive summary

Voice Hotkey is a Linux desktop application that binds a configurable modifier key (default: Right Control) as a double-click trigger to record voice, transcribe it locally using OpenAI Whisper, and copy the resulting text to the clipboard. The MVP is a headless daemon—no GUI required.

2. Strategic objectives

Frictionless voice-to-text on Linux via a single keyboard gesture.
Local-first transcription: no cloud dependency in the default configuration.
Minimal footprint: background process, configurable via TOML file.
Extensible: future support for TTS, autocomplete, autocorrect, and multi-platform.

3. Architecture (executive view)

Layer	Decision
Language	Python 3.9+ — ecosystem maturity for audio, ML, and input capture.
Key capture	`pynput` — global keyboard listener (X11/XWayland).
Audio	`sounddevice` + `numpy` — real-time microphone capture.
Transcription	`openai-whisper` — local inference, no API key required.
Output	`pyperclip` → `xclip` / `wl-copy` — clipboard.
Config	TOML at `~/.config/voice-hotkey/config.toml`.

4. Governance

Product owns feature scope, key-binding defaults, and UX.
Engineering owns technical requirements, platform support, and security.
Executive approves scope changes that imply new infrastructure or vendor commitments.

5. Investment & risk

Cost: Engineering time only; no cloud inference cost in default config.
Key risk: Wayland compatibility for global key capture (mitigated by XWayland fallback).
Dependency risk: Whisper model size vs. transcription quality trade-off (configurable).

6. Roadmap

MVP: Right Control double-click → record → Whisper → clipboard.
Phase 2: Configurable keys, model selection, language detection.
Phase 3: TTS, autocomplete, autocorrect, active-window injection.
Phase 4: Multi-platform support (Windows, macOS).

← Back to Voice Hotkey