Speech To Text Typing for Wayland users

@zetaphor.com

There are many times I find myself wishing I had a speech to text solution that was simple and convenient, but as a Wayland (Linux display protocol) user software like Talon, WisprFlow, and SuperWhisper are unavailable to me.

So I decided to just build my own solution with Python and NVIDIA's Canary model. That project became Freespeak.

Canary is NVIDA's latest ASR model that is both smaller, faster, and more accurate than OpenAI's Whisper. In fact I'm defaulting to the 180 million parameter model rather than the standard 1 billion parameter model. This is a model that coudl easily be run on a smartphone or standard consumer CPU.

I now have a keyboard shortcut mapped to a dedicated key (UHK 80 ftw) that activates the microphone recording. Once activated the application uses the Silero VAD model to determine when human speech is present so that it doesn't try and transcribe background noise. Once the speech utterance has ended, the audio is trimmed and sent to Canary for transcription, and then finally typed out into wherever I'm focused using ydotool.

There's an additional post-processing step that converts words like "exclamation mark", "comma", "question mark" into their respective special characters. Additionally there's an optional step to process the raw transcription through a self-hosted instance of LanguageTool. This helps to clean up any grammar and capitalization, and allows me to easily substitute/exclude words with a personal dictionary. A docker compose file is included in the repo.

I'm pretty happy with how it turned out, the implementation is relatively low latency and uses very little system memory, and the accuracy of Canary blows everything else I've tried out of the water.

I'll probably expand further on the subsitution step to enable sending special control keys like media play/pause, volume control, etc. I'll probably also investigate binding special phrases to running shell scripts so I can automate other tasks.

Further down the line I'd like to look into integrating the prototype I built that replicates Microsoft's Mouse Grid feature that I built into a separate project.

zetaphor.com
Zetaphor

@zetaphor.com

🏳️‍🌈 🏳️‍⚧️ Equal rights, no more, no less!

Developer, FOSS advocate, XR enthusiast
I came here because of atproto, I stayed here because of the vibes

See pinned post for a list of my Bsky/ATProto work

https://ko-fi.com/zetaphor

Post reaction in Bluesky

*To be shown as a reaction, include article link in the post or add link card

Reactions from everyone (0)