Voice Assistant Setup

This summer I am starting again on my voice assistant project. I've decided on using my main workstation as a server (for now; if this goes well I'll move it over to a dedicated machine) and having satellite nodes be responsible for wakeword detection only. For now I will be using a raspberry pi 4 as the satellite node, but that's overkill. I will move to a lighter weight, lower-cost solution in the future. I just already have the pi and don't want to buy a bunch of new hardware before I need it.

I am using Python with fastapi, uvicorn and faster-whisper with the standard Jarvis wakeword for now. Two years ago, I had phrases and intents mapped to specific functions; this time around, I am architecting the project to be modular skills-based so that I can easily extend the project. I am also starting with LLM-first architecture, rather than writing hardcoded functions with the intention to add a local model later.

As of today, the pi listens for the wakeword, waits for user input, sends that input to my llama model, and responds through the satellite's speaker. The process was a headache in that installing and configuring packages is always more of a hassle than expected, but that is an area where Claude can really help. With the disclaimer that this time I leaned on Claude, but I do think it was very valuable to do myself the first time so that I have some idea of what's going on with the device instead of just prompt-reprompt-cross-fingers-that-nothing-goes-off-the-rails. Would highly recommend if you're trying to do something similar.

I have also set up the infrastructure and code conventions for skills beyond a single question:answer from the model, and added a couple of base skills that just make a POST request and print simulated actions.

Hitches and headaches

aplay is kind of a nightmare. Two years ago I did go through it manually, checking device numbers, configuring drivers, the whole nine yards. Again, Claude is a real time saver here, but having at least some idea of where things are going wrong is incredibly useful for guiding the prompts, even if Claude handles the actual syntax.

Current limitations

As of the time of this writing, the assistant just takes a query and gives an answer. This results in frequent hallucinations and is not in and of itself that useful

Next up

Websocket server and skills like timers and weather