Loading
I built a voice assistant using https://github.com/KoljaB/RealtimeSTT a couple months ago. Recently started working on some VOIP technology and there’s a desire to do real-time translation (user A is speaking in language A, this is translated in realtime to user B)
I foresee the critical need will be for low-latency translation – I want to transcribe what user B is speaking in realtime in chunks, translate it in chunks, then send that in chunks to my speech generator and play it.
Has anyone worked on a technology like this and has experience with what research I should do or technologies I can use? I’ve already built a voice assistant that uses wake words to transcribe user questions, parse the text thru an LLM, get a response, and mutate our game environment. So I have wake word listening + recording STT, plus TTS for the response.
The pieces I don’t have yet:
My current toolchain (for an alexa-like assistant) allows me to take wakeworded STT, and then process it with appropriate context thru chatgpt to produce an appropriate, controlled result (using structured outputs). So I’m making two major changes – trying to get a chunk-based STT model that doesn’t use wakewords, as well as doing translation versus answering queries.
submitted by /u/Rosstin
[link] [comments]