WebSockets Guide
WebSockets are the most performant option for text-to-speech.
You can send multiple generation requests over the same WebSocket connection by
specifying different context_id
in outgoing messages,
and then differentiate audio by context_id
in incoming messages.
You can also stream input text in chunks of any size by setting continue
to true
in
all but the last chunk.
This helps to avoid any inconsistencies in prosody
that could happen if text chunks were treated completely independently.
Here are the best practices for working with WebSockets:
- create the connection beforehand to avoid introducing additional latency for the first generation;
- reuse the connection for multiple generations;
- periodically reopen the connection as any one connection is not guaranteed to be kept alive for longer than 24 hours;
- close the connection if it’s unused because idle connections may be closed server-side.
See the API reference for details.