Therapeutic AI
Companion.
We built a multimodal mental-health companion with a
3D avatar, grounded in CBT and DBT, on the OpenAI Realtime API
for speech-to-speech. Llama Guard 4 + ShieldGemma on every turn,
988 / SAMHSA crisis routing for risk, HIPAA-aligned infra,
68% session completion against a 31% industry average.
AI mental health assistant with 3D avatar
We shipped a therapeutic AI companion with a 3D avatar and CBT/DBT-grounded flows. 68% of sessions complete against a 31% industry average, and the average session runs about 12 minutes.
The OpenAI Realtime API handles speech-to-speech in one hop (with Cartesia Sonic + ElevenLabs Conversational AI as a redundancy path), WebRTC moves the audio, and a Three.js WebGPU avatar holds 60fps lip-sync under 200 ms voice-to-response. Llama Guard 4, ShieldGemma and NeMo Guardrails sit on the input and output, and a 988 / SAMHSA crisis routing path is wired for risk. The whole experience was co-designed with clinical advisors and runs on HIPAA-aligned infra, not retrofitted after the fact.
AI Delivery Approach
-
CBT and DBT on the prompt layer — We mapped therapeutic flows into structured prompts, then had clinical advisors red-team them before a single user session went live.
-
Speech-to-speech in one hop — We replaced the legacy STT → LLM → TTS chain with the OpenAI Realtime API for sub-200 ms turn latency, and kept Cartesia Sonic + ElevenLabs Conversational AI as a redundancy path. Avatar lip-sync runs against the same audio stream so the user doesn’t feel the seams.
-
Safety as a first-class path — Llama Guard 4, ShieldGemma and NeMo Guardrails on every turn, plus a 988 / SAMHSA crisis routing rule that escalates to a human counsellor at the first risk signal. The model never improvises around a crisis cue.
-
Test with the actual users — We ran guided sessions with target users and iterated on tone, pacing and avatar expressiveness before opening the doors.
What was actually hard
Therapeutic AI has to feel warm and be safe, and the two pull against each other. A model that’s too cautious sounds robotic and users drop off; a model that’s too fluent can drift into unsafe advice. We had to hold CBT and DBT structure, keep latency low enough for a real conversation, and route every risk signal to a safe path without breaking the session.

Project Outcome
68% of users now finish a session, against a 31% industry average, and satisfaction settled at 4.7 out of 5. The avatar plus voice flow is what users actually point to in feedback — they stay because it feels like a conversation, not a form.
satisfaction > 68% session
completion > <200ms voice-to-response
latency > 60fps avatar
lip-sync


“They built a companion that feels genuinely supportive — not robotic. The 3D avatar and voice integration turned a concept into something users kept coming back to.”
@ Dr. Aarav P.
Chief Product Officer — Digital Health Startup



