shipped 2026 · private utility · runs in a tray

Vox

Private · Python 3.13 + faster-whisper + Win32 SendInput · Right-Ctrl hotkey · 2026

The internal name is VoiceType; the showcase name is Vox.

Built for:
One person. Me, at the keyboard, all day, every day, when typing is slower than speaking.
Not built for:
Anyone who needs an account, a cloud upload, or a multi-language translation surface. Vox transcribes English at the cursor, full stop.

Hold right-Ctrl, speak, release. Whatever you said appears at the cursor — in any application, in any text field, with sub-second latency on a stock laptop CPU. No account, no cloud, no listening when you don’t want it to.

§ I

The problem

Cloud dictation tools are accurate, fast, and listening. The cost of accuracy and speed is that the audio leaves the machine; the cost of leaving the machine is that someone else gets a copy of every sentence I say at my desk. That’s a bad trade for daily use.

Local dictation has been good enough for a year and a half. Vox is a thin shim around faster-whisper with one trick: a single hotkey that turns on the mic, captures audio while held, transcribes on release, and types the result at the cursor via Win32 SendInput. No window, no menu, no decision to make.

§ II

Decisions

  1. kept2026

    Push-to-talk over voice activity detection. PTT is unambiguous; VAD picks up the radio, the dog, the air conditioning, my exhale at 11 PM. The wrong thing transcribed is worse than no transcription.

  2. kept2026

    A lockfile (voicetype.active) at the project root that exists only while the mic is hot. Other voice tools on the same machine respect the lockfile and mute themselves; without it, two listeners would fight for the mic and both would lose.

  3. cut2026

    A floating UI window. The status feedback is a tray icon — grey idle, blue recording, yellow processing, green flash on success. A real window would invite configuration, and configuration is the enemy of a tool I want to forget I’m using.

§ III

System

Stack — current pins.
LayerImplementationPurpose
HotkeyGlobal Win32 hookRight-Ctrl press / release events
CapturePyAudio16 kHz mono WAV ring buffer while held
Transcribefaster-whisperLocal Whisper inference, CPU or CUDA
InjectWin32 SendInputTypes into the focused text field
StatusTray icon (Tkinter)Four colors, no chrome, no decisions
CoordLockfilevoicetype.active · respected by peer tools
vox/hotkey_loop.pypython · push-to-talk loop
# One global hotkey: right-Ctrl. Press starts recording and writes
# the lockfile so peer voice tools mute themselves. Release stops,
# transcribes locally, types into whatever has focus.
def on_press(key):
    if key != Key.ctrl_r or recorder.active:
        return
    LOCKFILE.touch()                         # peer tools see this
    set_tray("recording")
    recorder.start()                         # 16 kHz mono ring buf

def on_release(key):
    if key != Key.ctrl_r or not recorder.active:
        return
    audio = recorder.stop()
    LOCKFILE.unlink(missing_ok=True)
    set_tray("processing")
    text = whisper.transcribe(audio, language="en").strip()
    if text:
        SendInput(text)                      # type at the cursor
        set_tray("done")
    else:
        set_tray("idle")
vox.session.logndjson · 0.74s end-to-end
{"t":"14:02:18.402","event":"press","key":"right_ctrl","lockfile":"created"}
{"t":"14:02:18.418","event":"capture_start","sr":16000,"channels":1}
{"t":"14:02:21.071","event":"release","key":"right_ctrl","duration_ms":2653}
{"t":"14:02:21.084","event":"transcribe_start","model":"distil-whisper-large-v3","device":"cpu"}
{"t":"14:02:21.812","event":"transcribe_done","ms":728,"chars":62}
{"t":"14:02:21.815","event":"send_input","method":"win32_unicode","chars":62}
{"t":"14:02:21.821","event":"done","total_ms":1419,"text":"the cost of accuracy is that the audio leaves the machine"}
FIGURE. Press → release → typed at cursor in 1.4 seconds end-to-end. CPU only; no model warm-up keeping the laptop awake.
Vox tray-icon four states tiled — idle, recording at 2.3s, transcribing on whisper-large-v3, done in 1.4s end-to-end — and a Windows taskbar fragment showing the icon embedded among system tray icons.
FIGURE. The four states of the tray icon. No window, no menu, no decision to make — push, speak, release, and the typed sentence shows up at the cursor.

Acknowledgments

Vox stands on faster-whisper, the Whisper weights from OpenAI, and the long line of small-utility-as-tray-icon tools that prove a useful piece of software does not need a UI to be useful.

← Index