- Book: Decoupled PHP — Clean and Hexagonal Architecture for Applications That Outlive the Framework
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You ship Reverb. The first thousand connections are quiet. The first ten thousand stay quiet too, on one node. Then you cross fifteen thousand on a single process and the event loop starts hiccuping in ways the dashboard doesn't explain.
A trading dashboard team I talked to last month hit exactly that wall. They were broadcasting price ticks to roughly 50,000 logged-in users at peak, all running on a single Reverb 1.3 node behind a Network Load Balancer. The fix wasn't a bigger box. It was four things: cluster mode, a real drain handler, smarter authz, and an honest answer about whether they should've stayed on Soketi.
This is the playbook they ended up with.
Reverb in 60 seconds
Reverb is Laravel's first-party WebSocket server. It ships in Laravel 11+, speaks the Pusher protocol on the wire, and runs on ReactPHP under the hood. That last part is the one nobody internalizes until they hit a scale wall: Reverb is a single PHP process running an event loop. It doesn't fork per connection. It multiplexes everything on one thread.
That's why it's fast. It's also why a single CPU core ceiling becomes your ceiling.
php artisan reverb:start --host=0.0.0.0 --port=8080
One command, one process, one core. You broadcast from Laravel like always:
broadcast(new OrderShipped($order))->toOthers();
The broadcaster pushes the event into Reverb's internal channel map, Reverb walks the subscribers for that channel, and writes the frame to each open socket. No queue, no detour, no fan-out service. For most Laravel apps this is enough forever.
It stops being enough somewhere between 10k and 15k concurrent sockets per process. Reverb's docs hint at this. The team I mentioned hit it at 11,400 connections on a 4-core EC2 c7g.xlarge, ARM Graviton.
The single-node ceiling
The math is annoyingly simple. Each WebSocket connection holds:
- One file descriptor (cheap).
- A PHP object representing the connection state (cheaper).
- A subscription set for the channels it's joined (free until the broadcast).
- One slice of the event loop's attention on every broadcast (this is the killer).
When you broadcast(new TickerUpdate($symbol)) and 12,000 sockets are subscribed to private-ticker.AAPL, Reverb iterates that subscriber list and writes 12,000 frames before returning to the loop. Broadcasts that should take 8ms start taking 80ms. The loop falls behind. New connect frames queue up. Heartbeat replies get late. Clients flap.
The symptom in Reverb logs is harmless-looking:
[reverb] connection 7f3a closed: timeout waiting for pong
That's not a network problem. That's your event loop too busy to ack pings.
The fix is to stop pretending one process can do this.
Cluster mode in Reverb 1.4
Reverb 1.4 added real cluster support via Redis pub/sub. Up to 1.3, scaling out meant standing up multiple Reverb processes behind a load balancer with sticky sessions, which worked until a client on node A needed to receive a broadcast originating on node B. Then nothing.
Cluster mode wires every node to a shared Redis instance. When Laravel broadcasts, the event lands on whichever node the HTTP request hit. That node publishes the payload to Redis. Every Reverb node subscribes, picks the message up, and fans it out to its own local connection table.
Config lives in config/reverb.php:
return [
'default' => env('REVERB_SERVER', 'reverb'),
'servers' => [
'reverb' => [
'host' => env('REVERB_SERVER_HOST', '0.0.0.0'),
'port' => env('REVERB_SERVER_PORT', 8080),
'hostname' => env('REVERB_HOST'),
'options' => [
'tls' => [],
],
'max_request_size' => env('REVERB_MAX_REQUEST_SIZE', 10_000),
'scaling' => [
'enabled' => env('REVERB_SCALING_ENABLED', true),
'channel' => env('REVERB_SCALING_CHANNEL', 'reverb'),
'server' => [
'url' => env('REDIS_URL'),
'host' => env('REDIS_HOST', '127.0.0.1'),
'port' => env('REDIS_PORT', '6379'),
'username' => env('REDIS_USERNAME'),
'password' => env('REDIS_PASSWORD'),
'database' => env('REDIS_DB', '0'),
],
],
],
],
];
Set REVERB_SCALING_ENABLED=true on every node and point them at the same Redis. They form a logical cluster. A broadcast from any HTTP node lands on every WebSocket node.
Two things to know that the migration guide buries:
Redis pub/sub is fire-and-forget. If a Reverb node is slow to consume from Redis, messages pile up in the Redis client buffer and eventually get dropped. Monitor client output buffer on the Redis side. The trading team caught a 90-second buffer growth during a market open and added a Redis alert before it bit them.
Don't put cluster Redis on the same instance as your cache Redis. They have different load profiles. Cluster pub/sub is steady-state high-throughput. Cache is bursty. Sharing the instance means one bad cache key eviction storm chokes the cluster.
On 4 nodes (4 vCPU each, c7g.xlarge), the team comfortably held 48,000 connections steady-state, with peak excursions to 56,000. Per-node CPU sat at 55-65% during heavy broadcasts. That's the headroom you want.
The deploy storm, and why you need a SIGTERM handler
The first thing that broke after they shipped cluster mode wasn't a connection count. It was a deploy.
CI pushes a new release. The orchestrator (ECS in their case) rolls nodes one at a time. Each node gets a SIGTERM, then 30 seconds, then SIGKILL. When a Reverb node dies, every socket on it closes. 12,000 clients see a close frame, then their reconnect logic kicks in.
Pusher-protocol clients reconnect with exponential backoff plus jitter. In theory. In practice, a popular Laravel Echo version had a bug where the initial retry was effectively immediate. 12,000 clients all hit the load balancer in the same 200ms window. The LB happily routed them to the surviving 3 nodes. Each surviving node went from 12,000 connections to 16,000 in two seconds. Event loops choked. Heartbeats timed out. Clients reconnected again. Cascade.
The fix is to drain gracefully on SIGTERM before the orchestrator hard-kills.
Reverb's reverb:start command doesn't ship a documented drain mode, but you can register a signal handler in a tiny custom command that wraps it. The pattern that worked:
namespace App\Console\Commands;
use Illuminate\Console\Command;
use Laravel\Reverb\Servers\Reverb\Factory;
class ReverbServeGraceful extends Command
{
protected $signature = 'reverb:serve-graceful
{--host=0.0.0.0}
{--port=8080}
{--drain-timeout=25}';
protected $description = 'Reverb with SIGTERM-driven graceful drain';
public function handle(): int
{
$server = Factory::make(
host: $this->option('host'),
port: (int) $this->option('port'),
);
$draining = false;
$drainStart = null;
// SIGTERM = orchestrator says "wrap it up"
pcntl_signal(SIGTERM, function () use (&$draining, &$drainStart) {
if ($draining) {
return;
}
$draining = true;
$drainStart = microtime(true);
// stop accepting new sockets
app('reverb.health')->markUnhealthy();
// tell clients to reconnect elsewhere
// pusher protocol: code 4200 = "generic reconnect"
foreach (app('reverb.channels')->connections() as $connection) {
$connection->send(json_encode([
'event' => 'pusher:error',
'data' => ['code' => 4200, 'message' => 'server draining'],
]));
// give the client a beat to act before we close
}
});
// tick the signal dispatcher inside the loop
$server->loop->addPeriodicTimer(0.5, function () use (
&$draining, &$drainStart
) {
pcntl_signal_dispatch();
if (! $draining) {
return;
}
$elapsed = microtime(true) - $drainStart;
$open = app('reverb.channels')->connectionCount();
if ($open === 0 || $elapsed > (int) $this->option('drain-timeout')) {
// either everyone left, or we ran out of patience
exit(0);
}
});
$server->start();
return 0;
}
}
The shape matters more than the exact code. On SIGTERM you do three things in order:
- Flip the health endpoint to unhealthy. The LB stops sending new connections within one health-check interval (5-10 seconds is typical).
-
Send a
pusher:errorwith code 4200 to every connection. Echo and most Pusher clients treat this as a "reconnect to a different node" hint, not a crash. - Wait up to 25 seconds. SIGKILL hits at 30, so 25 gives you margin.
The actual disconnect window flattens out over those 25 seconds instead of a 200ms cliff. With four nodes rolling one at a time, the surviving three see a slow climb of maybe 3,000 reconnects each, well within headroom.
If you only do one thing from this whole post, do the drain handler. It's the difference between "deploys are uneventful" and "deploys page the on-call engineer."
Channel authorization at scale
Reverb authorizes private and presence channels by calling back into your Laravel app. Every time a client subscribes to private-orders.42, Reverb POSTs to /broadcasting/auth with the channel name and the user's session. Your channels.php decides yes or no.
use App\Models\Order;
use Illuminate\Support\Facades\Auth;
Broadcast::channel('orders.{orderId}', function ($user, int $orderId) {
return Order::query()
->where('id', $orderId)
->where('user_id', $user->id)
->exists();
});
That's an SQL query per subscription. At 50k concurrent connections, every one of which subscribes to two or three channels on connect, you're looking at 100-150k auth queries during a connection storm. Most of which hit the database in the same 10-second window.
Three patterns to keep this sane:
Cache the authorization result. For most channel types, "can user 42 see order 9001" doesn't change second-to-second. A 30-second cache on the channel auth callback collapses the storm to a query per (user, channel) pair, then nothing for half a minute.
Broadcast::channel('orders.{orderId}', function ($user, int $orderId) {
return cache()->remember(
"chan:orders:{$orderId}:user:{$user->id}",
now()->addSeconds(30),
fn () => Order::where('id', $orderId)
->where('user_id', $user->id)
->exists()
);
});
The cache key has to invalidate when the underlying permission changes. Order ownership rarely changes mid-session, so 30 seconds is safe. For organization-membership channels where a user can be removed mid-day, you'll want a shorter TTL or explicit invalidation on the remove path.
Authorize once per channel, not once per message. Reverb already does this by default. It's worth saying out loud because new teams sometimes write per-message permission checks on the publish side, "just to be safe." That's fine for one publisher with twelve subscribers. It's a quadratic cliff at scale. Trust the subscribe-time check.
Use presence channels only when you actually need presence. Presence channels carry user metadata in every member-added and member-removed event, and Reverb has to track membership state for every channel on every node. A 5,000-member presence channel with churn is expensive in ways a private channel isn't. The team I keep mentioning had a "global online users" presence channel they didn't actually need; they were just reading the count for a UI badge. They moved that to a Redis counter and dropped 40% of cluster CPU.
Heartbeats and presence: the cost nobody calculates
Pusher protocol heartbeats default to 30 seconds. Client sends ping, server replies pong. With 50k connections, that's 50k pongs every 30 seconds, distributed across the cluster. Roughly 1,700 pongs per second per node on a 4-node cluster. Cheap individually, but it's constant load. Don't tune the interval down without a reason.
Presence channels add another tax. Each member_added and member_removed event broadcasts to every other member of that channel. A presence channel with 1,000 members sees 1,000 broadcasts every time someone joins or leaves. If join/leave churn is even modest, say 50 events per second across the app, that's 50,000 frames per second of presence chatter alone.
Real numbers from the trading team's load test on the 4-node cluster:
- 48,000 idle connections, no broadcasts: ~12% CPU per node (heartbeats only).
- Same 48,000 + 200 broadcasts/sec on private channels with ~250 subscribers each: ~55% CPU per node.
- Same setup + a 5,000-member presence channel with 30 joins/leaves per second: ~78% CPU per node.
The presence channel was the single biggest cost line. Worth knowing before you sprinkle them through your UI.
When Soketi or Pusher Cloud still wins
Reverb is the right call when you want WebSockets that feel like part of Laravel, you're comfortable running a stateful service, and your scale fits inside a small cluster. That covers a lot of teams.
Three cases where it isn't the right call:
Pusher Cloud wins when you don't want to operate stateful infra at all. $49-$499/month for managed Pusher buys you zero deploy storms, zero Redis pub/sub buffer monitoring, zero "why is one node at 90% CPU." If your team is small and the realtime feature isn't core IP, you're probably overspending engineering time to save money on infra. Pusher Cloud is the boring choice and boring is often correct.
Soketi wins when you want Pusher-protocol compatibility but with Node.js operational characteristics. It's a drop-in for Pusher and Reverb on the protocol side, written in TypeScript, and has been around longer than Reverb's cluster mode. If your team already runs Node services and prefers that operational profile, Soketi is the move. It also has battle-tested horizontal scaling. There are 100k-connection Soketi deploys documented in the wild going back years.
Stay on Pusher when you need globally distributed presence channels. Reverb's cluster mode is single-region. If your users span continents and you need a presence channel that knows about all of them with sub-200ms latency, Pusher's edge network does what Reverb doesn't.
For the trading team (single region, Laravel-shop, mid-size cluster, want to control the wire) Reverb 1.4 with the drain handler turned out to be the right answer. Their bill went from $340/month for Pusher to $190/month for the cluster plus Redis. Not a huge win on dollars; a much bigger win on control and on knowing exactly what the stack is doing on a deploy.
The shortlist
If you take Reverb to production at scale, ship these:
- Cluster mode enabled on every node, shared Redis dedicated to pub/sub.
- A SIGTERM-driven graceful drain that flips health to unhealthy, sends pusher:error code 4200, and waits up to 25 seconds before exiting.
- 30-second cache on channel auth callbacks where ownership is stable.
- Presence channels only where you genuinely need presence; use Redis counters for "how many online" badges.
- Monitor Redis pub/sub client output buffer alongside Reverb's connection count.
- One node = ~12,000 connections is the planning number. Don't push past it without measuring.
The realtime stack stops being magical once you can predict its failure modes. That's the whole game.
If this was useful
The Laravel stack does a lot for you, but the architecture under it is yours to design. Once your WebSocket layer, your queue layer, and your domain logic all start having opinions, the framework's defaults stop being the right answer everywhere. Decoupled PHP is the book about the architecture your codebase reaches for after it outgrows those defaults, with patterns for keeping your domain pure when Reverb, Horizon, and Eloquent all want to live inside it.
What's the biggest single-node Reverb scale you've pushed in production? Drop the connection count and what broke first.
Available on Kindle, Paperback, and Hardcover. English, German, and Japanese editions out now — Portuguese and Spanish coming soon.














