Executive overview
Classification: Internal ·
Version: 1.0 ·
Audience: Leadership, sponsors ·
Primary focus: Product & architecture alignment
This is the
main program document for the Voice Hotkey initiative.
It frames the product vision, architecture, and delivery model. Technical depth is in the
Software requirements;
UX detail is in the
Design guide.
1. Executive summary
Voice Hotkey is a Linux desktop application that binds a configurable
modifier key (default: Right Control) as a double-click trigger to record voice,
transcribe it locally using OpenAI Whisper, and copy the resulting text to the clipboard.
The MVP is a headless daemon—no GUI required.
2. Strategic objectives
- Frictionless voice-to-text on Linux via a single keyboard gesture.
- Local-first transcription: no cloud dependency in the default configuration.
- Minimal footprint: background process, configurable via TOML file.
- Extensible: future support for TTS, autocomplete, autocorrect, and multi-platform.
3. Architecture (executive view)
| Layer | Decision |
| Language |
Python 3.9+ — ecosystem maturity for audio, ML, and input capture. |
| Key capture |
pynput — global keyboard listener (X11/XWayland). |
| Audio |
sounddevice + numpy — real-time microphone capture. |
| Transcription |
openai-whisper — local inference, no API key required. |
| Output |
pyperclip → xclip / wl-copy — clipboard. |
| Config |
TOML at ~/.config/voice-hotkey/config.toml. |
4. Governance
- Product owns feature scope, key-binding defaults, and UX.
- Engineering owns technical requirements, platform support, and security.
- Executive approves scope changes that imply new infrastructure or vendor commitments.
5. Investment & risk
- Cost: Engineering time only; no cloud inference cost in default config.
- Key risk: Wayland compatibility for global key capture (mitigated by XWayland fallback).
- Dependency risk: Whisper model size vs. transcription quality trade-off (configurable).
6. Roadmap
- MVP: Right Control double-click → record → Whisper → clipboard.
- Phase 2: Configurable keys, model selection, language detection.
- Phase 3: TTS, autocomplete, autocorrect, active-window injection.
- Phase 4: Multi-platform support (Windows, macOS).
← Back to Voice Hotkey