A high-performance, real-time speech-to-speech system designed for low-latency telephony communication. Prodigy integrates Whisper (ASR), LLaMA (LLM), and a generic TTS stage (with hot-pluggable Kokoro, NeuTTS, VITS2, or Matcha-TTS engines) into a linear microservice pipeline, using a standalone SIP client as an RTP gateway. Optimized for Apple Silicon (CoreML/Metal) with no PyTorch runtime dependency.
An optional RAG sidecar (tomedo-crawl) connects to a Tomedo electronic medical records (EMR) server, crawls patient data, and feeds LLaMA with per-caller context so the AI can greet patients by name and give medically-informed responses.
Telephony Network
|
[SIP Client] ──────────────────────────► tomedo-crawl
/ \ POST /caller (port 13181)
RTP in RTP out │
| ^ GET /caller │
IAP OAP GET /query │
| ^ ▼
VAD TTS stage [Ollama embed]
| ^ ▲ [Vector store ]
| | │ (engine dock, port 13143)
| | └─ Kokoro / NeuTTS / VITS2 / Matcha engine
Whisper ────► LLaMA ◄──────────────────────────────┘
RAG context injection
[Frontend] (web UI + log aggregation)
The pipeline is a linear chain of C++ programs. Every adjacent pair communicates over two persistent TCP connections (management + data). The frontend manages all services and provides a web UI at http://0.0.0.0:8080/.
tomedo-crawl is a sidecar — it is not in the audio data path. Communication with other pipeline services is via its own HTTP REST API (port 13181).
- OS: macOS Apple Silicon (M1/M2/M3/M4)
- Language: C++17, Python 3.9+
- Build: CMake 3.22+, Ninja (recommended)
- Dependencies (installed automatically by
runmetoinstalleverythingfirst):- whisper.cpp (compiled with CoreML + Metal)
- llama.cpp (compiled with Metal)
- espeak-ng (
brew install espeak-ng) - macOS frameworks: Accelerate, Metal, CoreML, Foundation
# Step 1: Install everything (Homebrew, Miniconda, models, CoreML exports)
./runmetoinstalleverythingfirst
# Step 2: Build all services
./runmetobuildeverything
# Step 3: Launch
cd bin && ./frontend
# Web UI: http://localhost:8080runmetobuildeverything auto-clones whisper-cpp and llama-cpp if missing, detects Ninja for fast parallel builds, and bypasses the macOS Xcode license check using the Command Line Tools SDK directly.
# Build whisper.cpp (CoreML + Metal, static)
cmake -G Ninja -S whisper-cpp -B whisper-cpp/build \
-DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF \
-DWHISPER_COREML=ON -DGGML_METAL=ON \
-DWHISPER_BUILD_TESTS=OFF -DWHISPER_BUILD_EXAMPLES=OFF
cmake --build whisper-cpp/build -j
# Build llama.cpp (Metal, static)
cmake -G Ninja -S llama-cpp -B llama-cpp/build \
-DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF \
-DGGML_METAL=ON
cmake --build llama-cpp/build -j
# Build Prodigy
cmake -G Ninja -S . -B build \
-DCMAKE_BUILD_TYPE=Release -DKOKORO_COREML=ON -DBUILD_TESTS=ON
cmake --build build -j| Option | Default | Description |
|---|---|---|
KOKORO_COREML |
ON |
Enable CoreML ANE acceleration for Kokoro decoder |
BUILD_TESTS |
ON |
Build unit/integration tests (requires GoogleTest); disable with -DBUILD_TESTS=OFF |
ESPEAK_NG_DATA_DIR |
auto-detected | Path to espeak-ng-data directory |
All model files are placed in bin/models/. runmetoinstalleverythingfirst downloads and prepares all of these automatically.
| File | Size | Purpose |
|---|---|---|
ggml-large-v3-turbo-q5_0.bin |
~547 MB | Default ASR model (best speed/accuracy balance) |
ggml-large-v3-q5_0.bin |
~1.0 GB | Higher accuracy ASR model |
ggml-large-v3-turbo-encoder.mlmodelc/ |
varies | CoreML ANE encoder for large-v3-turbo |
ggml-large-v3-encoder.mlmodelc/ |
varies | CoreML ANE encoder for large-v3 |
| File | Size | Purpose |
|---|---|---|
Llama-3.2-1B-Instruct-Q8_0.gguf |
~1.2 GB | Response generation (Metal-accelerated) |
Located in bin/models/kokoro-german/:
| File | Purpose |
|---|---|
coreml/kokoro_duration.mlmodelc/ |
Duration model (CoreML ANE) |
coreml/kokoro_f0n_{3s,5s,10s}.mlmodelc/ |
F0/N predictor buckets (CoreML ANE) |
decoder_variants/*.mlmodelc/ |
Split decoder models (CoreML ANE) |
<voice>_voice.bin |
Voice style embedding (256-dim float32). Available: df_eva_voice.bin, dm_bernd_voice.bin |
vocab.json |
Phoneme-to-token mapping |
Located in bin/models/neutts-nano-german/:
| File | Size | Purpose |
|---|---|---|
neutts-nano-german-Q8_0.gguf |
~241 MB | LLaMA-based speech backbone (Q8_0, near-lossless) |
neucodec_decoder.mlmodelc/ |
~3.4 GB | NeuCodec CoreML decoder |
ref_codes.bin |
- | Pre-computed reference voice codec codes |
ref_text.txt |
- | Reference voice phoneme transcript |
Located in bin/models/vits2-german/:
| File | Purpose |
|---|---|
de_DE-thorsten-high.onnx |
Piper VITS2 German voice model (ONNX) |
de_DE-thorsten-high.onnx.json |
Model config — sample rate, phoneme set, speaker info |
Download with python3 scripts/setup_vits2_models.py --output-dir bin/models/vits2-german.
Located in bin/models/matcha-german/coreml/:
| File | Purpose |
|---|---|
matcha_encoder.mlmodelc/ |
Text encoder (CoreML ANE, 512-token input) |
matcha_flow_3s.mlmodelc/ |
Baked ODE flow — 3-second bucket (CoreML ANE) |
matcha_flow_5s.mlmodelc/ |
Baked ODE flow — 5-second bucket (CoreML ANE) |
matcha_flow_10s.mlmodelc/ |
Baked ODE flow — 10-second bucket (CoreML ANE) |
matcha_vocoder.mlmodelc/ |
HiFi-GAN mel-to-waveform vocoder (CoreML ANE) |
vocab.json |
Phoneme-to-token mapping |
Export with python3 scripts/export_matcha_models.py --checkpoint <path> --output-dir bin/models/matcha-german/coreml.
Located in bin/models/:
| File | Size | Purpose |
|---|---|---|
moshiko-pytorch-bf16-q8.gguf |
~8 GB | Q8 quantized LM model (recommended for 16GB Macs) |
moshi-en-q8-backend-config.json |
- | English Q8 backend config |
moshi-german-backend-config.json |
- | German LoRA-merged backend config |
moshiko-german-candle.q8.gguf |
~8 GB | German LoRA-merged Q8 model (if converted) |
The Moshi backend is built from kyutai-labs/moshi (Rust) with Metal patches for Apple Silicon. The Q8 GGUF model is required on 16GB Macs — BF16 safetensors (15GB) causes swap thrashing.
Performance (Apple M4 Mac mini, 16GB, Q8 GGUF + Metal):
- Steady-state LM step: ~101ms (12.5Hz frame rate = 80ms budget → 1.26× real-time)
- Model warmup: ~4 seconds
- First step: ~2.2s (Metal shader compilation)
Located in bin/models/g2p/:
| File | Purpose |
|---|---|
de_g2p.mlmodelc/ |
DeepPhonemizer German G2P model (CoreML, ~5M params) |
char_vocab.json |
Input character vocabulary |
phoneme_vocab.json |
Output IPA phoneme vocabulary |
Export with python3 scripts/export_g2p_model.py --output-dir bin/models/g2p. Used by Kokoro and Matcha when --g2p neural is set; falls back to espeak-ng for non-German input.
All services bind to 127.0.0.1:
| Service | Mgmt Port | Data Port | Cmd Port | Notes |
|---|---|---|---|---|
| SIP Client | 13100 | 13101 | 13102 | + SIP UDP 5060 + RTP UDP 10000+ |
| IAP | 13110 | 13111 | 13112 | |
| VAD | 13115 | 13116 | 13117 | |
| Whisper | 13120 | 13121 | 13122 | |
| LLaMA | 13130 | 13131 | 13132 | |
| TTS stage (dock) | 13140 | 13141 | 13142 | Engine dock listens on 13143 |
| Kokoro engine | — | — | 13144 | Docks into TTS stage on 13143 |
| NeuTTS engine | — | — | 13174 | Docks into TTS stage on 13143 |
| VITS2 engine | — | — | 13175 | Docks into TTS stage on 13143 |
| Matcha-TTS engine | — | — | 13176 | Docks into TTS stage on 13143 |
| OAP | 13150 | 13151 | 13152 | |
| Moshi service | 13160 | 13161 | 13162 | |
| Moshi backend | — | — | — | WebSocket 8998+ (one per language) |
| Frontend | - | - | - | HTTP 8080, Log UDP 22022 |
| tomedo-crawl | 13180 | 13181 | 13182 | REST API on 13181; 13180/13182 reserved |
RTP gateway and SIP stack. Handles SIP registration with Digest authentication, incoming/outgoing call management, and routes raw RTP audio between the telephony network and the internal pipeline.
Key behaviors:
- Minimal SIP stack over raw UDP (port 5060 by default)
- MD5 Digest authentication with
WWW-Authenticatechallenge parsing - Re-registers every 60 seconds
- Multi-line: supports N simultaneous SIP registrations (
--lines 0is valid for test-only mode) - Inbound RTP forwarded to IAP as raw Packet frames (12-byte RTP header included; IAP strips it)
- Outbound G.711 frames from OAP wrapped in RTP headers (seq, timestamp, SSRC) and sent via UDP
- Stale call auto-hangup after 60 seconds of no RTP traffic
- RTP port base: 10000, incremented by 2 per call
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
[--lines N] [<user> <server> [port]] |
0 lines | Lines to register at startup; positional args only needed when lines > 0 |
--log-level <LEVEL> |
INFO |
Log verbosity: ERROR, WARN, INFO, DEBUG, TRACE |
Runtime Commands (cmd port 13102):
| Command | Description |
|---|---|
ADD_LINE <user> <server> <port> <password> |
Register a new SIP account dynamically (space-delimited; use - for no password) |
REMOVE_LINE <index> |
Unregister and remove a SIP line by index |
LIST_LINES |
List all registered SIP lines |
GET_STATS |
JSON RTP counters for all active calls (rx/tx packets, bytes, forwarded, discarded) |
PING |
Health check → PONG |
STATUS |
Registered lines, active calls, connection state |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
Converts G.711 μ-law telephony audio (8kHz) to float32 PCM (16kHz) for the VAD service.
Signal chain:
- G.711 μ-law decode: 256-entry ITU-T lookup table; each byte → float32 in [-1.0, 1.0]
- 8kHz→16kHz upsample: 15-tap Hamming-windowed sinc FIR half-band filter (cutoff ~3.8kHz, ~40dB stopband). Zero-stuffs input, then filters to remove spectral copies above 4kHz.
Each 160-byte RTP payload (20ms @ 8kHz) produces 320 float32 samples (20ms @ 16kHz). Continues processing and discards output if VAD is unavailable; auto-reconnects when VAD comes back online.
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--log-level <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (cmd port 13112):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Active call count, upstream/downstream state, avg/max per-packet latency (μs) |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
Energy-based Voice Activity Detection. Segments continuous 16kHz PCM into speech chunks (0.5–8 seconds) for Whisper.
Algorithm:
- Adaptive noise floor: EMA update (alpha=0.05) during silence frames; time constant ~1 second
- Onset detection: requires 3 consecutive frames above
threshold × noise_floorto confirm speech start - End detection: 400ms of consecutive sub-threshold frames triggers speech-end
- Micro-pause detection: short pauses (~400ms) between words trigger early submission rather than waiting for full silence — reduces Whisper inference latency since inference time scales with chunk length
- Smart-split: when max chunk length is reached during speech, finds the lowest-energy frame near the boundary to avoid cutting mid-word
- Pre-speech context: 400ms (8 frames × 50ms) before confirmed onset is prepended to each chunk
- RMS energy gate: chunks with RMS < 0.005 discarded as near-silence
- SPEECH_ACTIVE/SPEECH_IDLE signals: broadcast downstream to the TTS stage (teed to the docked engine) and OAP for TTS interruption and SPEECH_IDLE-driven warm-up
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--vad-window-ms <ms> / -w <ms> |
50 |
Frame analysis window length |
--vad-threshold <mult> / -t <mult> |
2.0 |
Energy threshold multiplier over noise floor |
--vad-silence-ms <ms> / -s <ms> |
700 |
Silence duration to end speech segment |
--vad-max-chunk-ms <ms> / -c <ms> |
12000 |
Maximum speech chunk duration |
--vad-onset-gap <ms> / -g <ms> |
-1 (auto) |
Minimum gap between consecutive onsets (negative = auto-derive from silence-ms) |
--log-level <LEVEL> / -L <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (cmd port 13117):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Noise floor, threshold, silence_ms, max_chunk_ms, active calls, upstream/downstream state |
SET_VAD_THRESHOLD:<mult> |
Update threshold multiplier at runtime |
SET_VAD_SILENCE_MS:<ms> |
Update silence detection duration at runtime |
SET_VAD_MAX_CHUNK_MS:<ms> |
Update max chunk length at runtime |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
Automatic Speech Recognition (ASR). Receives pre-segmented speech chunks from VAD and returns transcribed text to LLaMA.
Inference details:
- Backend: whisper.cpp with CoreML ANE (Apple Neural Engine) + Metal fallback
- Decoding: Greedy strategy (not beam search). On 2–8s segments, greedy is 3–5× faster than beam_size=5 with negligible accuracy difference. Temperature fallback with
temp_inc=0.2handles uncertain segments. - Telephony-optimized parameters:
no_speech_thold=0.9(prevents early decoder stop on G.711-degraded audio),entropy_thold=2.8(tolerant of codec uncertainty) - No audio normalization: audio passed directly to Whisper (matches whisper-cli defaults for optimal accuracy on G.711 input)
- RMS energy pre-check: rejects chunks with RMS < 0.005 to prevent hallucinations on near-silence
- Packet buffering: if LLaMA is disconnected, buffers up to 64 transcription packets and drains them on reconnect
- Hallucination filter (default OFF, runtime-toggleable): exact-match detection of common Whisper hallucination strings (e.g., "Untertitel", "Copyright", "Musik"); repetition detection; trailing suffix stripping
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--language <lang> / -l <lang> |
de |
Whisper language code |
--model <path> / -m <path> |
models/ggml-large-v3-turbo-q5_0.bin |
Path to GGML model file |
--log-level <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (cmd port 13122):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Model name, upstream/downstream state, hallucination filter state |
HALLUCINATION_FILTER:ON / OFF |
Enable/disable hallucination filter |
HALLUCINATION_FILTER:STATUS |
Query filter state |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
Generates a spoken German reply from transcribed text using Llama-3.2-1B-Instruct.
Inference details:
- Model: Llama-3.2-1B-Instruct Q8_0 GGUF, all layers on Metal GPU (
n_gpu_layers=-1) - Template:
llama_chat_apply_template()— uses the model's built-in chat template for correct role tagging; no manual prompt formatting - Sampling: Greedy (
llama_sampler_init_greedy). Max 96 tokens per response. Stops at sentence-ending punctuation (.,?,!) or EOS. - Context: 2048 tokens, 4 threads
- German system prompt: enforces always-German, max 1 sentence / 15 words, polite and natural tone. ~320ms average latency on Apple M-series.
- Clause-boundary streaming: triggers TTS synthesis at clause boundaries (
,,;, em-dash, en-dash,-) as well as sentence-end punctuation. Reduces perceived latency by ~100–200 ms by starting synthesis before the full sentence is generated. Up to 4 early-streaming chunks per response (MAX_EARLY_STREAM_CHUNKS = 4). - Session isolation: each call gets its own
LlamaCallstruct with independent message history and KV cache sequence ID. Context cleared onCALL_END. - Shut-up mechanism:
SPEECH_ACTIVEfrom VAD aborts active generation immediately (~5–13ms interrupt latency). Worker loop defers new responses while speech is active. - Tokenizer resilience: retries with progressively larger buffer (up to 4×) if
llama_tokenize()returns a negative value
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
(positional) <model_path> |
models/Llama-3.2-1B-Instruct-Q8_0.gguf |
Path to GGUF model file (passed as the trailing positional argument; the directory can be overridden with the WHISPERTALK_MODELS_DIR env var) |
--rag-host <host> / -H <host> |
127.0.0.1 |
tomedo-crawl host for RAG context lookups |
--rag-port <port> / -P <port> |
13181 |
tomedo-crawl HTTP port |
--log-level <LEVEL> / -L <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (cmd port 13132):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Model name, active calls, upstream/downstream state, speech active flag |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
The TTS stage is a generic pipeline node that sits between LLaMA and OAP. It
owns the interconnect sockets (mgmt 13140, data 13141, cmd 13142) and a
dedicated engine dock on port 13143. Concrete TTS engines
(bin/kokoro-service, bin/neutts-service, bin/vits2-service, bin/matcha-service)
are not pipeline nodes — they are client processes that connect to the dock,
authenticate via a one-line JSON HELLO, and stream audio back through it.
Engine slot model (last-connect-wins). The dock holds at most one active engine. State transitions:
HELLO ok TCP close / SHUTDOWN ack
[NO ENGINE] ─────────────▶ [ACTIVE=X] ─────────────────────▶ [NO ENGINE]
│
│ new engine Y completes HELLO ok
▼
[SWAPPING X→Y]
│ dock sends CUSTOM SHUTDOWN to X;
│ X closes TCP (≤ 2 s) or is force-closed
▼
[ACTIVE=Y]
When no engine is docked, LLaMA text frames are dropped with a
rate-limited WARN log; mgmt signals (CALL_END / SPEECH_ACTIVE /
SPEECH_IDLE) are still auto-forwarded to OAP. On every swap the dock
emits a CUSTOM FLUSH_TTS mgmt frame to OAP so residual PCM is
discarded before the new engine's audio arrives.
Engine dock protocol (TCP, loopback only):
- Engine connects to
127.0.0.1:13143. - Engine sends one-line JSON HELLO:
{"name":"kokoro","sample_rate":24000,"channels":1,"format":"f32le"}\n. - Dock replies
OK\n(accepts) orERR <reason>\n(rejects — the current active engine is untouched). - After OK, frames are tag-prefixed:
0x01= serialized dataPacket,0x02= mgmt (MgmtMsgType+ optional payload). The dock ferries LLaMA→engine text, engine→OAP audio, mgmt signals, and PING/PONG keepalives. - On receipt of
CUSTOM SHUTDOWNthe engine process joins its workers, releases model handles, and callsstd::_Exit(0).
Runtime Commands (TTS dock cmd port 13142):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
ACTIVE <engine-name> when docked, NONE otherwise |
SET_LOG_LEVEL:<LEVEL> |
Change dock log verbosity without restart |
Text-to-speech using the Kokoro model. Receives text from the dock and streams 24kHz float32 PCM audio back through it. No PyTorch dependency — all inference via CoreML on Apple Neural Engine.
Phonemization pipeline:
- espeak-ng (via
libespeak-ng) converts input text → IPA phoneme string. Language auto-detected (de/en-us) viadetect_german(). - Phoneme cache (LRU, 10,000 entries): avoids re-running espeak-ng for repeated phrases.
- KokoroVocab: greedy longest-match scan (up to 4 chars per token, UTF-8 aware) maps phonemes → int64 token IDs from
vocab.json. Input padded to 512 tokens.
Two-stage CoreML inference:
- Stage 1 — Duration model (
kokoro_duration.mlmodelc): predicts per-phoneme durations, generates alignment tensors (pred_dur,d,t_en,s,ref_s). Style encoding from<voice>_voice.bin(256-dim reference embedding). - Stage 1b — F0/N predictor (
kokoro_f0n_{3s,5s,10s}.mlmodelc): three bucketed models (3s/5s/10s) predict fundamental frequency (f0_pred) and voicing (n_pred) from the duration model'sdandsoutputs. Bucket selected by utterance length. These condition the harmonic/noise excitation signal — without them, speech sounds hoarse/unvoiced. - Stage 2 — Decoder (
decoder_variants/*.mlmodelc): split decoder generates the audio waveform from alignment tensors + F0/N conditioning. All models run withMLComputeUnitsAll(ANE + GPU + CPU).
Audio output processing:
normalize_audio(): scales to 0.90 peak ceiling (skips near-silent audio and already-normalized output)apply_fade_in(): 48-sample linear ramp at onset to prevent click artifacts- Sends audio to OAP in 4800-sample chunks (200ms @ 24kHz) for smooth buffer filling
SPEECH_ACTIVE handling: Abandoned synthesis immediately if VAD signals caller speech. Per-call synthesis threads, so multi-line calls synthesize in parallel.
Prosody state carryover: Each call's ref_s_out (256-dim style tensor) from chunk N is fed as ref_s input to chunk N+1's duration model. This preserves intonation continuity across synthesized chunks without any model re-export.
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--voice <NAME> |
df_eva |
Voice to use (df_eva, dm_bernd) |
--g2p <auto|neural|espeak> |
auto |
G2P backend: neural uses DeepPhonemizer CoreML (German), espeak forces espeak-ng, auto uses neural when available for German |
--log-level <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (Kokoro engine cmd port 13144):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Active calls, dock connection state, current speed |
SET_SPEED:<0.5–2.0> |
Set synthesis speed (1.0 = normal, clamped to [0.5, 2.0]) |
GET_SPEED |
Query current speed |
TEST_SYNTH:<text> |
Synthesize text and return timing/peak/RMS stats (no audio output) |
BENCHMARK:<text>|<N> |
Run N synthesis iterations; returns avg/p50/p95 latency and RTF |
SYNTH_WAV:<path>|<text> |
Synthesize text and save to WAV file at <path> (relative paths only) |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
On a CUSTOM SHUTDOWN frame from the dock the engine joins its synthesis workers and exits; it does not restart the pipeline.
Alternative TTS engine using the NeuTTS Nano German model. Like Kokoro it is a dock client — the dock's single engine slot means only one TTS engine serves traffic at a time, and starting a second engine transparently swaps it in (last-connect-wins).
Inference pipeline:
- espeak-ng converts input text → IPA phonemes (language
de, with stress markers) - Builds a NeuTTS prompt:
user: Convert the text to speech:<|TEXT_PROMPT_START|>{ref_phones} {phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{ref_codes} - Tokenize and feed to NeuTTS backbone (llama.cpp, Q4_0 GGUF) with temperature=1.0, top_k=50 autoregressive sampling
- Extract
<|speech_N|>tokens as integer codec codes - Stop at
<|SPEECH_GENERATION_END|>or EOS - Decode codes through NeuCodec CoreML decoder → 24kHz float32 PCM
Reference voice: Pre-computed codec codes (ref_codes.bin) and phonemized text (ref_text.txt) loaded at startup to define voice timbre and style.
Audio post-processing: Same as Kokoro — normalize to 0.90 peak, 48-sample fade-in.
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--log-level <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (NeuTTS engine cmd port 13174):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Active calls, dock connection state |
TEST_SYNTH:<text> |
Synthesize and return timing stats |
SYNTH_WAV:<path>|<text> |
Synthesize text to a WAV file at the given path |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
On a CUSTOM SHUTDOWN frame from the dock the engine joins its synthesis workers and exits.
Alternative TTS engine built on Piper TTS via the libpiper C API. Uses ONNX Runtime internally — no CoreML required. Suitable for high-quality German synthesis with fast ISTFT-based vocoding.
Inference pipeline:
piper_create()loads a Piper.onnxmodel +.onnx.jsonconfig from$WHISPERTALK_MODELS_DIR/vits2-german/piper_synthesize_start()/piper_synthesize_next()loop produces float32 PCM chunks- PCM resampled to 24kHz if model sample rate differs; normalized and chunked to
kTTSMaxFrameSamples - Audio sent to the dock via
EngineClient::send_audio()with 12-byte header pertts-common.h
Phonemization: Piper handles phonemization internally via its bundled espeak-ng. Optional neural G2P pre-phonemization for German when --g2p neural is set.
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--model-dir <DIR> |
$WHISPERTALK_MODELS_DIR/vits2-german |
Directory containing .onnx + .onnx.json |
--voice <NAME> |
default |
Voice filename base (e.g. de_DE-thorsten-high) |
--g2p <auto|neural|espeak> |
auto |
G2P backend |
--log-level <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (VITS2 engine cmd port 13175):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Active calls, dock connection state, model path |
TEST_SYNTH:<text> |
Synthesize and return timing stats |
SYNTH_WAV:<path>|<text> |
Synthesize text to a WAV file at the given path |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
Alternative TTS engine based on Matcha-TTS (flow-matching acoustic model) + HiFi-GAN vocoder, both exported to CoreML. The ODE flow is baked into a static CoreML graph (10 Euler steps unrolled at export time) — no iterative solver at runtime.
Inference pipeline (5 stages):
- Phonemize: espeak-ng or neural G2P → IPA; phoneme cache (LRU)
- Encoder (
matcha_encoder.mlmodelc): text → latent acoustic sequence (512-token fixed input, ANE) - Noise sample: Gaussian noise via per-call
std::mt19937(deterministic, seeded percall_id) - Baked ODE flow (
matcha_flow_{3s,5s,10s}.mlmodelc): noise + latent → mel-spectrogram (bucket selected by utterance length, ANE) - HiFi-GAN vocoder (
matcha_vocoder.mlmodelc): mel → 24kHz float32 PCM (ANE)
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--model-dir <DIR> |
$WHISPERTALK_MODELS_DIR/matcha-german/coreml |
CoreML model bundle directory |
--voice <NAME> |
default |
Voice preset |
--g2p <auto|neural|espeak> |
auto |
G2P backend |
--log-level <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (Matcha-TTS engine cmd port 13176):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Active calls, dock connection state, model path |
TEST_SYNTH:<text> |
Synthesize and return timing stats |
SYNTH_WAV:<path>|<text> |
Synthesize text to a WAV file at the given path |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
Real-time speech-to-speech conversation using the Moshi model. Two-component architecture: moshi-service (C++) handles WebSocket transport, OGG/Opus encoding/decoding, and integration with the frontend pipeline; moshi-backend (Rust) runs the neural model inference on Metal GPU.
Architecture:
moshi-serviceconnects tomoshi-backendvia WebSocket (localhost, plain HTTP)- Input audio: OGG/Opus encoded, sent as WebSocket binary frames
- Output: interleaved audio (OGG/Opus) and text tokens
- Multi-language support via backend pool (one backend per language, routed by
--backend-configargs)
Key fixes applied (via patches/moshi-rust-metal-fixes.patch):
matmul_dtype()returns BF16 on Metal (30× speedup for Q8 models)- Model loading uses BF16 dtype on Metal (not just CUDA)
- TLS removed for localhost backend (plain HTTP WebSocket)
- HuggingFace download skipped for local model files
- Enhanced logging in processing loop and OGG decoder
Command-Line Parameters (moshi-service):
| Argument | Default | Description |
|---|---|---|
--backend-config <lang>:<config>[:<binary>] |
(none) | Backend language, config JSON path, optional binary path |
--default-language <lang> |
en |
Default language when no preference is specified |
--log-level <LEVEL> |
INFO |
Log verbosity |
Converts 24kHz float32 PCM from the TTS stage into 160-byte G.711 μ-law frames for the SIP client. Maintains constant 20ms output cadence.
Signal chain (per call):
- DC blocking (first-order high-pass): α = 0.9947697 (~20Hz cutoff). Removes DC offset and LF rumble. Initialized with the first sample value to avoid onset click.
- Presence boost (optional, default OFF): High-shelf biquad IIR filter, +3dB shelf at 2500Hz (Audio EQ Cookbook, S=1). Adds air/clarity to the telephone band.
- Anti-aliasing FIR (63-tap, Hamming-windowed sinc): Cutoff 3400/12000 (normalized). ~43dB stopband attenuation. Coefficients computed once at startup, shared across all calls. Per-call
fir_history[31]preserves filter state across chunks. - 3:1 Decimation: Keep every 3rd filtered sample (24kHz → 8kHz).
- G.711 μ-law encode (ITU-T compliant):
ULAW_CLIP=32635,ULAW_BIAS=132. Encodes int16 PCM to 8-bit μ-law byte.
Output scheduler: Dedicated sender thread fires every 20ms using steady_clock. Sends exactly 160 bytes per tick. If the TTS buffer is empty (silence, no engine docked, or a FLUSH_TTS just drained residual PCM during an engine swap), sends 0xFF (μ-law silence) to maintain RTP clock continuity. Scheduler resync guard: if OS sleep/load spike causes >100ms drift, snaps next_tick to now instead of firing a burst of catch-up frames.
SPEECH_ACTIVE handling: Clears all per-call buffers and resets FIR/DC/biquad state immediately when VAD signals caller speech. A configurable sidetone guard (default 1500ms) suppresses flushes arriving shortly after new TTS audio — prevents echo from triggering a spurious flush.
WAV recording (optional): When enabled, records the 8kHz int16 PCM output per call. Written to disk on CALL_END.
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
--save-wav-dir <dir> |
(disabled) | Enable WAV recording and set output directory |
--log-level <LEVEL> |
INFO |
Log verbosity |
Runtime Commands (cmd port 13152):
| Command | Description |
|---|---|
PING |
Health check → PONG |
STATUS |
Active calls, buffer lengths, upstream/downstream state |
SAVE_WAV:ON / OFF / STATUS |
Toggle WAV recording |
SET_SAVE_WAV_DIR:<dir> |
Set WAV output directory |
PRESENCE_BOOST:ON / OFF / STATUS |
Toggle +3dB presence boost biquad |
SET_SIDETONE_GUARD_MS:<ms> |
Set SPEECH_ACTIVE guard window (default 800ms) |
TEST_ENCODE:<freq>|<dur_ms> |
Generate sine wave, encode, measure μ-law RMS output |
SET_LOG_LEVEL:<LEVEL> |
Change log verbosity without restart |
Central control plane. Serves the web UI, manages service lifecycles, aggregates logs, and exposes all configuration via REST API.
Storage: SQLite database (whispertalk.db) — persists service configurations, log level settings, and test results.
Log aggregation: Each service sends structured log entries as UDP datagrams to port 22022. Frontend stores them in SQLite (ring-buffered in memory for fast /recent queries) and streams them live via SSE.
Full HTTP API (port 8080):
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/services |
List all managed services + status |
| POST | /api/services/start |
Start a service {name, args} |
| POST | /api/services/stop |
Stop a service {name} |
| POST | /api/services/restart |
Restart a service {name} |
| GET/POST | /api/services/config |
Read/write per-service config (persisted in SQLite) |
| GET | /api/logs |
Paginated log query {limit, offset, service, level} |
| GET | /api/logs/recent |
Last N entries from in-memory ring buffer |
| GET | /api/logs/stream |
SSE live log stream |
| POST | /api/settings/log_level |
Set per-service log level (propagated to running service immediately) |
| POST | /api/db/query |
Execute SELECT query (read-only guard) |
| POST | /api/db/write_mode |
Toggle write mode for unsafe queries |
| GET | /api/db/schema |
Return SQLite schema |
| GET | /api/whisper/models |
List available GGML model files in models/ |
| POST | /api/whisper/accuracy_test |
Run offline Whisper accuracy test on a WAV file |
| POST | /api/whisper/hallucination_filter |
Enable/disable Whisper hallucination filter |
| GET | /api/tts/status |
TTS-dock engine slot: {"engine":"kokoro"}, {"engine":"neutts"}, or {"engine":null} when no engine is docked |
| GET | /api/tts/engine_config?engine=<name> |
Per-engine config: {engine, voice, g2p_backend, language, disabled_reason} |
| POST | /api/tts/engine_config |
Persist engine config {engine, voice, g2p_backend, language}; restarts active engine asynchronously |
| GET | /api/tts/available_voices?engine=<name> |
Voice list from $WHISPERTALK_MODELS_DIR/<engine>-<lang>/ |
| GET | /api/tts/available_g2p |
Available G2P backends: always espeak; adds neural if de_g2p.mlmodelc is present |
| GET/POST | /api/vad/config |
Read/write VAD parameters (propagated to running service) |
| GET/POST | /api/oap/wav_recording |
Read/write OAP WAV recording settings |
| POST | /api/sip/add-line |
Register a new SIP account |
| POST | /api/sip/remove-line |
Remove a SIP account |
| GET | /api/sip/lines |
List registered SIP lines |
| GET | /api/sip/stats |
RTP counters per active call |
| POST | /api/iap/quality_test |
Offline G.711 codec round-trip quality test |
| GET | /api/testfiles |
List WAV+TXT sample pairs in Testfiles/ |
| POST | /api/testfiles/scan |
Rescan Testfiles/ directory |
| POST | /api/tests/start |
Run a test binary |
| POST | /api/tests/stop |
Kill a running test |
| GET | /api/tests/*/history |
Test run history |
| GET | /api/tests/*/log |
Test stdout/stderr |
| GET | /api/test_results |
Pipeline WER test results |
| GET | /api/status |
System uptime, service health summary |
Web UI features:
- Service management: start/stop/restart each pipeline service independently
- Real-time log streaming with per-service and per-level filtering
- Log level control: checkboxes (ERROR/WARN/INFO/DEBUG/TRACE) applied immediately and persisted
- VAD configuration: threshold, silence duration, max chunk length — runtime update without restart
- Whisper configuration: model selection, hallucination filter toggle
- Kokoro configuration: synthesis speed slider, SYNTH_WAV test, neural G2P and language selection
- VITS2 / Matcha configuration: voice, G2P backend, language dropdowns with Save; engines are greyed out when no compatible model is installed for the selected language
- OAP configuration: WAV recording toggle + directory, presence boost toggle
- SIP management: add/remove SIP lines, view RTP statistics
- Beta testing page: audio injection into live calls via Test SIP Provider
- Test infrastructure: ASR accuracy tests, pipeline WER tests, LLaMA quality tests, codec quality tests
- tomedo-crawl configuration: Tomedo server IP/port, mTLS certificate upload, Ollama subservice management, crawl schedule, vector store status
RAG sidecar that crawls a Tomedo EMR server and provides per-patient context to the LLaMA service.
Components:
- Tomedo crawler: fetches patient list, diagnoses, medications, appointments, and phone numbers via mutual TLS HTTPS.
- Vector store: hnswlib HNSW in-memory ANN index + SQLite persistence (encrypted with SQLCipher).
- Phone index: local SQLite table mapping digit-normalised phone numbers to patient IDs; enables sub-100 ms caller identification from a SIP phone number.
- Ollama client: calls
POST /api/embeddingsto generate float32 embeddings for each text chunk. - HTTP API (port 13181): serves
/health,/query,/caller,/crawl/trigger,/ollama/*,/config.
Command-Line Parameters:
| Argument | Default | Description |
|---|---|---|
[db-path] |
tomedo-crawl.db |
Path to the encrypted SQLite database |
--verbose |
off | Enable DEBUG log level |
--skip-initial-crawl |
off | Do not crawl at startup |
--phone-only |
off | Update phone index only, skip embeddings |
--no-embed |
off | Index phone numbers without generating embeddings |
--top-k N |
3 |
Default result count for /query |
--chunk-size N |
512 |
Text chunk size in estimated tokens |
--overlap N |
64 |
Token overlap between consecutive chunks |
--workers N |
4 |
Embedding worker thread count |
See docs/tomedo-crawl.md for the full API reference, database schema, Tomedo API details, and security model.
All inter-service communication uses interconnect.h (a shared header, no external library):
- Management channel (base port +0): Typed control messages —
CALL_START,CALL_END,SPEECH_ACTIVE,SPEECH_IDLE,PING/PONG - Data channel (base port +1): Binary
Packetframes — variable-length payloads tagged withcall_idandPacketType(audio PCM, text, G.711) - Command port (base port +2): TCP command interface, one connection per request, 10s recv timeout
- TCP_NODELAY: Enabled on all connections for minimum latency
- Auto-reconnect: Downstream connections retry every 200ms until reachable; upstream server accepts reconnections at any time
- LogForwarder: Sends structured log entries as UDP datagrams to
FRONTEND_LOG_PORT(22022)
Every service supports 5 levels: ERROR, WARN, INFO, DEBUG, TRACE.
Three ways to set log level:
- Startup argument:
--log-level DEBUG - Frontend UI: Log level checkboxes — applied immediately to the running service, persisted in SQLite for restarts
- Direct command: Send
SET_LOG_LEVEL:DEBUGto the service's cmd port via TCP
python3 tests/run_pipeline_test.py <MODEL_NAME> [TESTFILES_DIR]Injects WAV samples through the full pipeline via Test SIP Provider, collects Whisper transcriptions from the frontend log API, and computes character-level similarity against ground truth.
- PASS: ≥ 99.5% similarity
- WARN: ≥ 90% similarity
- FAIL: < 90% similarity
Test samples: Testfiles/sample_NN.wav + sample_NN.txt pairs.
B2BUA test tool that injects audio files into the pipeline as if they were real phone calls. Supports WAV recording of both legs of each conference call.
./test_sip_provider --port 5060 --http-port 22011 --testfiles-dir TestfilesHTTP API (port 22011):
| Method | Endpoint | Description |
|---|---|---|
| POST | /conference |
Create a test call with optional audio injection |
| POST | /hangup |
Hang up a call |
| GET | /calls |
List active calls |
| GET/POST | /wav_recording |
Read/write WAV recording settings |
| POST | /inject |
Inject an audio file into a call leg |
End-to-end diagnostic script:
python3 tests/run_stage7.py [--iterations N]Starts all services, connects test calls, enables WAV recording, injects samples, collects logs, and saves WAV files from both OAP and Test SIP Provider. Produces stage7_output/run_N/ directories with pipeline.log, oap_call_*.wav, and tsp_call_*.wav.
./runmetobuildeverything
cd build && ctest --output-on-failureTests are built by default (BUILD_TESTS=ON). To skip them: ./runmetobuildeverything --no-tests
Test binaries: test_sanity, test_interconnect, test_kokoro_cpp, test_integration.
Hardware: Apple M4, macOS 25.2.0
whisper.cpp: v1.8.3 (CoreML + Metal)
All model/backend combinations achieved 5/5 perfect transcription on clean input.
| Model | Size | Backend | Avg Time |
|---|---|---|---|
| large-v3 | 2.9 GB | CoreML + ANE | ~2580ms |
| large-v3-q5_0 | 1.0 GB | CoreML + ANE | ~2075ms |
| large-v3-turbo | 1.5 GB | CoreML + ANE | ~1575ms |
| large-v3-turbo-q5_0 | 547 MB | CoreML + ANE | ~1060ms |
Audio path: WAV → 8kHz μ-law G.711 → RTP → SIP Client → IAP (8→16kHz) → Whisper
| Model | Size | Backend | PASS | WARN | FAIL | Avg ms |
|---|---|---|---|---|---|---|
| large-v3 | 2.9 GB | Metal | 12 | 8 | 0 | 1627 |
| large-v3 | 2.9 GB | CoreML | 11 | 8 | 1* | 1301 |
| large-v3-q5_0 | 1.0 GB | Metal | 11 | 9 | 0 | 1789 |
| large-v3-turbo | 1.5 GB | CoreML | 9 | 10 | 1* | 688 |
| large-v3-turbo-q5_0 | 547 MB | CoreML | 8 | 11 | 1** | 686 |
* CoreML warmup caused first-inference timeout
** Sample_01 failed (41.8%) due to CoreML warmup causing VAD to miss first half
Scoring: PASS = ≥99.5% similarity, WARN = ≥90%, FAIL = <90%
- Accuracy priority:
large-v3+ Metal — 1627ms avg, no warmup delay, best accuracy - Speed priority:
large-v3-turbo+ CoreML — 688ms avg after initial warmup
- Quantization (q5_0): negligible accuracy impact vs. full-precision models
- CoreML warmup: 20–35s first-inference compilation cost, one-time per service lifetime
- Turbo trade-off: ~2× faster, slightly more WARN results. 4-layer decoder occasionally misses nuances in G.711-degraded audio.
- Consistent failures: some samples fail across all model configs due to G.711 codec artifacts, not model limitations