Why I Built It
I needed text-to-speech for my own SaaS projects. I looked at ElevenLabs and OpenAI TTS, ran the numbers, and immediately started looking for
alternatives. At scale — even modest scale — those prices compound fast.
ElevenLabs charges around $0.30/1K characters on their lowest paid tier. OpenAI TTS is roughly $0.015/1K characters for the standard model.
Neither felt justified when I knew the actual cost of running inference on a GPU I already owned.
So I built Audexum (https://audexum.com) — a TTS REST API with 43 voices across 33 languages, priced at what I think the market actually supports
rather than what VC-backed companies need to charge to cover their runway.
This is a write-up of how I built it, what surprised me, and what I'd do differently.
The Tech Stack
The backend is FastAPI with SQLAlchemy async over PostgreSQL. I chose async SQLAlchemy because the API has a mix of quick metadata queries and
longer-running inference jobs, and I wanted a single event loop handling both without threads. The async driver (asyncpg under the hood) makes
connection pooling significantly simpler to reason about.
Caddy handles TLS termination. I use Cloudflare DNS-01 challenge for certificate issuance, which means port 80 never needs to be open. The server
sits behind a firewall with only 443 reachable publicly — DNS-01 validation happens entirely through Cloudflare's API. If you're running a home
server or a VPS with port 80 blocked by your host, this is the correct approach.
The frontend is React + Vite. Nothing unusual there — it's a documentation site, a dashboard, and a playground. Vite's dev server proxy makes
local development against a FastAPI backend painless.
Stripe handles billing. Resend handles transactional email (signup confirmation, API key delivery, usage alerts). Both have decent Python SDKs and
webhook support that works reliably.
The TTS model itself is Supertonic-3 running as an ONNX graph on CUDA. The model weights sit at around 1.2 GB VRAM, leaving headroom on an RTX
3090 (24 GB) for batching and model warm-up. ONNX inference on GPU is faster than PyTorch eager mode for fixed-architecture models and sidesteps
torch version compatibility headaches in production.
API authentication uses Bearer tokens with sk_live_ prefixed keys — the same convention most developers already know from Stripe and OpenAI. Less
cognitive load when integrating.
Output is WAV. I added AI Act Article 50 compliant watermarking in the WAV metadata — a tamper-evident signal embedded in the file header. It's a
legal requirement for AI-generated audio in the EU and takes about 15 lines of code to implement correctly.
Pricing Math: One RTX 3090 Serving Hundreds of Users
Here are the actual numbers.
An RTX 3090 pulls around 350W under full inference load. At €0.12/kWh (European average), that's:
350W × 24h = 8.4 kWh/day × €0.12 = ~€1.01/day in electricity
Supertonic-3 generates roughly 80–120 characters of speech per second on the 3090. Call it 100 chars/sec as a conservative average across voice
types.
100 chars/sec × 3600 sec/hr × 24 hr = 8,640,000 chars/day theoretical max
At the Scale plan price (€30 per 2M characters), that theoretical daily throughput is worth:
8.64M chars / 2M × €30 = €129.60/day in revenue at 100% utilization
Nobody runs at 100% utilization. Real-world API traffic is spiky and bursty. But even at 10% utilization:
€12.96/day revenue vs €1.01/day electricity cost = profitable at low single-digit utilization
The actual server cost (amortized hardware + hosting) matters more than electricity at this scale. But the point is: a single consumer GPU can
serve hundreds of paying users on realistic usage patterns.
┌─────────┬────────┬────────────┬──────────────┐
│ Plan │ Price │ Characters │ Per 1M chars │
├─────────┼────────┼────────────┼──────────────┤
│ Free │ €0 │ 10K/mo │ — │
├─────────┼────────┼────────────┼──────────────┤
│ Starter │ €4/mo │ 100K/mo │ €40 │
├─────────┼────────┼────────────┼──────────────┤
│ Pro │ €12/mo │ 500K/mo │ €24 │
├─────────┼────────┼────────────┼──────────────┤
│ Scale │ €30/mo │ 2M/mo │ €15 │
├─────────┼────────┼────────────┼──────────────┤
│ PAYG │ €3 │ 1M │ €3 │
└─────────┴────────┴────────────┴──────────────┘
For comparison, ElevenLabs' equivalent tier runs ~€60+ for 500K characters. OpenAI TTS is cheaper than ElevenLabs but still 5× the Scale plan rate
here.
Three Things That Bit Me
- Passlib + bcrypt 5.x Is Broken — Use bcrypt Directly
I started with Passlib for password hashing because it's the standard FastAPI recommendation. It works fine until bcrypt releases a major version.
Passlib hasn't kept up with bcrypt's API changes, and the result is silent failures or cryptic AttributeError exceptions at runtime.
The fix: drop Passlib entirely and call bcrypt directly.
import bcrypt
def hash_password(password: str) -> str:
return bcrypt.hashpw(password.encode(), bcrypt.gensalt()).decode()
def verify_password(plain: str, hashed: str) -> bool:
return bcrypt.checkpw(plain.encode(), hashed.encode())
Fewer dependencies, no version mismatch risk, same security. If you're starting a new FastAPI project today, skip Passlib and go straight to
bcrypt.
- Stripe EUR Amounts in Test Mode Round Differently Than You Expect
Stripe stores amounts as integers in the smallest currency unit (cents for EUR). When you create a price object programmatically in test mode,
floating point creeps in if you're not careful.
# Wrong — floating point precision issue
amount_cents = plan_price_eur * 100 # 3.99 * 100 = 398.99999...
# Correct
amount_cents = round(plan_price_eur * 100)
This only showed up in test mode because Stripe's test mode applies additional validation that production mode doesn't. Always use round() before
passing amounts to Stripe.
- asyncio.Semaphore(1) Is the Correct GPU Concurrency Fix
ONNX inference on CUDA is not thread-safe for concurrent requests. If two requests hit the inference endpoint simultaneously, you get a CUDA OOM
or a CUDA illegal memory access, both of which crash the process.
from asyncio import Semaphore
gpu_semaphore = Semaphore(1)
async def synthesize(text: str, voice_id: str) -> bytes:
async with gpu_semaphore:
audio = await run_in_executor(None, model.run, text, voice_id)
return audio
Semaphore(1) means only one inference runs at a time. Requests queue behind it. For a single-GPU server this is correct behavior — inference is
fast enough that queue wait times are low.
Calling the API
curl:
curl -X POST https://audexum.com/api/tts \
-H "Authorization: Bearer sk_live_xxx" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice_id": "af_heart", "format": "wav"}' \
--output output.wav
Python:
import requests
response = requests.post(
"https://audexum.com/api/tts",
headers={"Authorization": "Bearer sk_live_xxx"},
json={"text": "Hello world", "voice_id": "af_heart", "format": "wav"},
)
with open("output.wav", "wb") as f:
f.write(response.content)
Node.js:
const fs = require("fs");
const response = await fetch("https://audexum.com/api/tts", {
method: "POST",
headers: {
"Authorization": "Bearer sk_live_xxx",
"Content-Type": "application/json",
},
body: JSON.stringify({ text: "Hello world", voice_id: "af_heart", format: "wav" }),
});
fs.writeFileSync("output.wav", Buffer.from(await response.arrayBuffer()));
What's Next
- Streaming audio output (chunked transfer for low-latency playback)
- MP3 and OGG output formats
- Voice cloning from a reference audio clip
- Batch endpoint — multiple texts in, ZIP of WAV files out
Try It
Free tier is 10,000 characters per month, no credit card required.
If you build something with it, I'm curious what you're using TTS for.













