The Deepgram flavor leverages Deepgram's Speech-to-Text API to transcribe the SIP user's spoken input into text. This transcription is then sent to the ChatGPT API, which interprets the message and generates a response. The response text is subsequently passed back to Deepgram's Text-to-Speech API, which converts it into voice format, allowing the response to be played back to the user.
It is using Deepgram's nova-2 module, with the conversationalai option (by default) to interpret the user's input. In order to determine the correct phrasing, we are relying on the model's logic to determine the phrases and punctuate them accordingly.
By default, the language used is English, but can be tuned to support other languages as well, depending on the Models used. You can find out more about how to tune the Deepgram modules here.
Communication with Deepgram is done over WebSocket channels, ensuring efficient transfer of real-time audio media. Media is encoded using the codec received from the user. Currently supported codecs for STT are:
- g711 PCMU - mulaw
- g711 PCMA - alaw
- Opus
A full list of Deepgram's supported encodings is here.
We are using the asynchronous OpenAI Python library to communicate with ChatGPT backend. By default we are using the gpt-4o model for conversational AI, but others can be used as well. A full list of available models and their capabilities can be found here.
In order to playback the AI's result to the user, we are using Deepgram's Text-to-Speech REST interface.
Codecs used for playing back the audio to the user are the same ones used for STT, with a few constraints enforced by the Deepgram's TTS engine.
The following parameters can be tuned for this engine:
Section | Parameter | Environment | Mandatory | Description | Default |
---|---|---|---|---|---|
deepgram |
key |
DEEPGRAM_API_KEY |
yes | Deepgram API key | not provided |
deepgram |
chatgpt_key or openai_key |
CHATGPT_API_KEY /OPENAI_API_KEY |
yes | OpenAI API key used for ChatGPT | not provided |
deepgram |
chatgpt_model |
CHATGPT_API_MODEL |
no | OpenAI Model used for ChatGPT text interaction | gpt-4o |
deepgram |
speech_model |
DEEPGRAM_SPEECH_MODEL |
no | Deepgram's speech detection model | nova-2-conversationalai |
deepgram |
language |
DEEPGRAM_LANGUAGE |
no | Deepgram's supported language used for speech transcoding | en-US |
deepgram |
voice |
DEEPGRAM_VOICE |
no | Deepgram's voice used for speaking back the response | aura-asteria-en |
deepgram |
welcome_message |
DEEPGRAM_WELCOME_MSG |
no | A welcome message to be played back to the user when the call starts | `` |
deepgram |
disable |
DEEPGRAM_DISABLE |
no | Disables the flavor | false |