Custom Endpoints (BYO LLM / STT / TTS)
Rymi can route any agent's LLM, transcription, or speech synthesis traffic to a server you host yourself. This is for teams who:
- run their own model behind a private network or compliance boundary,
- need a fine-tuned LLM, voice, or STT model that no public provider hosts,
- want to swap a provider without changing the rest of their pipeline.
Plan
Custom endpoints are available on Pro and Enterprise plans.
How it works
Each agent has three optional endpoint URL columns:
custom_llm_url— points the agent's language model at your server.custom_voice_url— points the agent's TTS at your server.custom_transcriber_url— points the agent's STT at your server.
When an agent's stack picker selects Custom LLM, Custom Voice, or Custom Transcriber, the gateway dispatches that channel's traffic to the configured URL. The other two channels stay on whatever provider you picked.
A bearer token can be attached per channel via Settings → Providers (or from the in-studio Connect API key chip). Rymi forwards it as an Authorization: Bearer … header on every dispatch. Leave the token blank for unauthenticated endpoints (private VPC, signed URLs, etc.).
Configuring an endpoint
- Open the agent in Studio → Voice & AI Models.
- Click the expand icon at the top-right of the channel you want to override (Speech Recognition, Language model, or Voice Engine).
- Pick the matching
Custom …model from the provider list. - Paste your endpoint URL into Self-hosted endpoint and save.
- Optional: click Connect API key in the same drawer's provider header to attach a bearer token.
Wire formats
WARNING
The wire format is fixed by Rymi — your server must implement the contract below verbatim. We deliberately keep it small so it's easy to ship.
LLM — OpenAI-compatible /v1/chat/completions
POST {custom_llm_url}
Authorization: Bearer <token> (optional)
Content-Type: application/jsonThe body is the standard OpenAI Chat Completions request:
{
"model": "your-model-name",
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."}
],
"stream": true,
"stream_options": {"include_usage": true},
"temperature": 0.7,
"max_tokens": 2048
}The response must be a Server-Sent Events (SSE) stream with the OpenAI Chat Completions delta format. Token usage is read from the final chunk's data.usage field; if your server can't provide it, return null and Rymi falls back to estimated counts.
Most self-hosted stacks already speak this protocol: vLLM, llama.cpp, ollama, LiteLLM, Together, OpenRouter, Groq, Cerebras, Fireworks, etc. If yours does, you're done.
Reference — FastAPI proxy to a local model
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx
app = FastAPI()
UPSTREAM = "http://localhost:8000/v1/chat/completions" # vLLM, llama.cpp, etc.
@app.post("/v1/chat/completions")
async def proxy(req: Request):
payload = await req.json()
async def gen():
async with httpx.AsyncClient(timeout=None) as client:
async with client.stream("POST", UPSTREAM, json=payload) as r:
async for chunk in r.aiter_raw():
yield chunk
return StreamingResponse(gen(), media_type="text/event-stream")TTS — HTTP POST with streaming audio bytes
POST {custom_voice_url}
Authorization: Bearer <token> (optional)
Content-Type: application/jsonBody:
{
"text": "the sentence to synthesize",
"voice": "voice-id-from-your-catalog",
"language": "en-US",
"format": "pcm_24000",
"instructions": "speak warmly and slowly"
}Response: HTTP 200 with Transfer-Encoding: chunked. The body is a binary stream of PCM 16-bit signed little-endian, mono, at the sample rate encoded in format (default 24kHz). Rymi forwards the chunks to the call as they arrive, so the sooner you start writing bytes, the lower the time-to-first-audio.
instructions is sent only when the LLM requests an emotional style. Servers that can't honor it should ignore the field.
Reference — Express server proxying to a local TTS engine
import express from 'express';
import { spawn } from 'child_process';
const app = express();
app.use(express.json());
app.post('/tts', (req, res) => {
res.setHeader('Content-Type', 'application/octet-stream');
res.setHeader('Transfer-Encoding', 'chunked');
const { text, voice } = req.body;
// Replace with your synthesis pipeline. The piper TTS binary streams
// raw PCM to stdout, which we forward verbatim.
const tts = spawn('piper', ['--model', voice, '--output_raw'], { stdio: ['pipe', 'pipe', 'inherit'] });
tts.stdout.pipe(res);
tts.stdin.end(text);
});
app.listen(8080);STT — WebSocket streaming
wss://{custom_transcriber_url}
Authorization: Bearer <token> (optional, sent on the upgrade)Once the WebSocket is open, Rymi sends:
- Binary frames — PCM 16-bit signed little-endian, mono, 16kHz. Variable size; buffer and process them as you like.
- JSON control messages:
{"type":"finalize"}— Rymi requests a forced final transcript at the end of an utterance.{"type":"close"}— terminate the session.
Your server sends:
{"type":"interim", "text": "the user is saying"}
{"type":"final", "text": "the user is saying.", "confidence": 0.92, "language_code": "en-US"}
{"type":"utterance_end"}
{"type":"speech_started"}
{"type":"error", "message": "model unavailable"}confidence and language_code are optional. utterance_end is what tells Rymi the caller has stopped speaking — without it, barge-in and turn-taking won't behave correctly. speech_started is optional and improves barge-in.
Reference — Node WebSocket server wrapping faster-whisper
import { WebSocketServer } from 'ws';
import { Whisper } from 'whisper-node';
const wss = new WebSocketServer({ port: 8080 });
const whisper = new Whisper({ modelName: 'base.en' });
wss.on('connection', (ws) => {
let buf = Buffer.alloc(0);
ws.on('message', async (data, isBinary) => {
if (isBinary) {
buf = Buffer.concat([buf, data as Buffer]);
// Run interim every ~600ms of audio (16000 * 2 * 0.6 = 19200 bytes)
if (buf.length >= 19200) {
const interim = await whisper.transcribe(buf, { partial: true });
ws.send(JSON.stringify({ type: 'interim', text: interim }));
}
return;
}
const msg = JSON.parse(data.toString());
if (msg.type === 'finalize') {
const final = await whisper.transcribe(buf);
ws.send(JSON.stringify({ type: 'final', text: final, confidence: 1, language_code: 'en-US' }));
ws.send(JSON.stringify({ type: 'utterance_end' }));
buf = Buffer.alloc(0);
} else if (msg.type === 'close') {
ws.close();
}
});
});Validation & error handling
- URLs must use
https://for LLM/TTS orwss://for STT. Plainhttp://andws://are rejected at save time. - The agent fails to connect (with a clear log line) if a custom slot is selected but the matching URL is empty.
- HTTP 4xx / 5xx from your server bubbles up as a call error and the gateway does not fall back to a default provider — failure is loud, not silent.
- Bearer tokens are encrypted at rest in
tenant_providersand only decrypted in-memory in the gateway when warming a call.
Observability
Every dispatch records a provider_usage event with provider: 'custom-llm' | 'custom-voice' | 'custom-transcriber' and the byte / token counts. Look for these in the agent's call detail page or query call_events directly.
When NOT to use custom endpoints
- If your model is already available through a public BYOK provider (OpenAI, Anthropic, Groq, ElevenLabs, Deepgram, etc.) — use that provider's BYOK slot instead. You'll get richer observability and no custom-server maintenance.
- If you need a different wire format — e.g. you want to point Rymi at a Vapi-shaped or Retell-shaped server. The Rymi-native shapes above are the only ones supported today; a compatibility shim is on the roadmap.

