Scale real‑time chat from 2 users to 1 million

Real-time is one of those features that demos in an afternoon and humbles you for a month. Open a WebSocket, push a message, watch it appear on the other screen — done, right? That's the part that's easy. The part that makes it engineering is what happens after the demo: when the process dies mid-send, and when the two users in your test become two million concurrent conversations.

I built a small project to take both of those seriously — an ephemeral one-on-one chat. A host opens a room, a guest joins, the two talk in real time, and the moment either leaves, the room and every message in it are deleted. No history, no archive. The feature is small on purpose. The architecture is not, and that's the point of this post: the same codebase that serves two users is designed to serve millions, and the path between those two numbers is adding boxes, not rewriting code. I'm confident this design carries a million concurrent chats. Here's the reasoning.

The naive design dies twice

Here's the version almost everyone writes first:

Save the message to the database.
Push the message to the connected clients.

It works every time you test it, and it dies for two independent reasons.

It dies for correctness. The gap between step 1 and step 2 is a crash-shaped hole.

The dual-write problem: when one operation updates two systems that don't share a transaction (a database and a message broker), any crash between the writes leaves them inconsistent, with no clean way to recover.

Crash after the commit but before the push, and the message is saved but never delivered. Push first instead, and a crash leaves it delivered but never saved — gone on the next page load. No ordering of two independent writes survives a crash in the middle.

It also dies for scale, and this is the one I actually care about. In the naive design the request thread does the fanout — it holds the WebSocket connections and pushes to each recipient inline. That couples two things that must never be coupled at scale:

Request latency is now proportional to recipient count. Sending into a busy room, or broadcasting the lobby's room list to every idle user, makes the sender's request slow. Fanout size should never be in the critical path of a write.
The server is now stateful. It holds connections in memory, so you can't just add a second box — connections for a conversation might live on a different instance than the write. You're one node, or you're building sticky sessions and a custom cross-node fanout. You've painted yourself into a single-server corner on day one.

You can't patch your way out of either of these. You design your way out.

One write, then get out of the way

The fix for both problems is the same move: the request does exactly one thing — a single database transaction — and then returns. It never touches the broker, never holds a connection, never fans out.

That's the transactional outbox pattern (well documented by Chris Richardson and the microservices community). In one transaction you write two facts to the database: the business row, and a row in an outbox table describing what to publish and to whom. One transaction, one source of truth — either both land or neither does.

func (o *messageUsecase) Create(ctx context.Context, msg string, publicRoomId, publicUserId uuid.UUID) (model.Message, error) {
	var createdMsg model.Message
	errCreate := o.txManager.UseTx(ctx, func(ctx context.Context) (err error) {
		// ... resolve room and user ...

		createdMsg, err = o.messageRepo.Create(ctx, newMsg)   // (1) the message
		if err != nil {
			return
		}

		// ... work out who needs to receive it ...

		// (2) the delivery intent — same transaction, no broker call
		err = o.outboxUsecase.PublishMessageToUsersInRoom(ctx, createdMsg, user, usersToPublish)
		return
	})
	return createdMsg, errCreate
}

Look at what this buys at scale. The request does bounded, local work — render the message for each recipient (O(N) CPU, but cheap and never more than the people in a room) and a single batched insert (O(1) round-trips). It holds no WebSocket connections and performs no broadcast fanout itself, so the app tier stays stateless and scales sideways. The expensive, unbounded part — actually reaching everyone — happens asynchronously downstream of Kafka, where it can be absorbed and parallelized. And the one path where fanout really is large — broadcasting the lobby's room list — is a single outbox row addressed to many channels, so even that stays O(1) on the write.

And because the request never holds a connection, the app tier is completely stateless. Every instance does nothing but write to Postgres and return. That single property is what lets you put fifty app servers behind a load balancer with no coordination between them. Stateless is the precondition for horizontal scale, and the outbox is what makes the app stateless.

Why Kafka — the part that actually scales to millions

Once the app just records intent and walks away, something has to turn those outbox rows into delivered messages. That something is a change-data-capture pipeline, and the choice of Kafka in the middle is the deliberate scale decision of the whole system.

Every box on that diagram scales on its own axis, and Kafka is why. Specifically:

Throughput through partitioning. Kafka is a distributed, partitioned, replicated log. You scale write and read throughput by adding partitions and brokers — there's a well-trodden path from thousands to millions of messages per second. The delivery tier's ceiling is Kafka's ceiling, and Kafka's ceiling is very far away.
A durable buffer, so spikes don't cascade. When a traffic spike hits, or Centrifugo briefly falls behind, messages queue durably in the log instead of backing up into the app or getting dropped. Kafka turns a spike into a slightly longer log, not an outage. That backpressure boundary is exactly what keeps a real-time system standing during the moments that matter.
Parallel consumers via consumer groups. Centrifugo reads as a consumer group; partitions distribute across its instances. More connected users → more Centrifugo nodes → more partitions consumed in parallel. The WebSocket tier scales independently of the app tier and the database.
Per-room ordering for free. Every outbox row is stamped with its room id as the partition key:
```
// partitionId = message.RoomId
o.outboxRepo.CreatePublishOutbox(ctx, channel, renderedMessage, message.RoomId)
```
That's the seam that lets a single room's messages stay strictly ordered on one partition while millions of other rooms spread across the cluster for throughput. Ordering and parallelism usually pull against each other; partition-by-room is how you get both.
New consumers cost the write path nothing. The log is retained, so the day you want moderation, search indexing, analytics, or an audit trail, you attach another consumer group and replay. At a million chats you will want those — and none of them require touching the code that sends a message.

This is also why I use CDC instead of polling the outbox table. Polling works at small scale, but at large scale it puts a relentless query load on the database and floors your latency at the poll interval. Debezium reads Postgres' write-ahead log — the database already wrote it, so capture is nearly free and nearly instant. It's the version of "drain the outbox" that doesn't fight you as the numbers grow.

Ephemerality is a scaling feature, not just a product one

The obvious objection to the diagram above: Postgres is your single stateful core — isn't it the bottleneck? It's the right question, and the design has two answers.

First, rooms are independent. Two conversations share nothing, which makes the room id a natural shard key. Past the ceiling of one primary, you partition or shard by room and the write load spreads cleanly, because nothing needs to join across rooms.

Second — and this is the part I like most — the data is ephemeral by design, and ephemerality bounds the working set. When a conversation ends, the room and its messages are deleted. The database holds only active chats, never the full history of everything ever said. A product with a million concurrent chats stores a million chats' worth of rows, not ten years of accumulated archive. The one stateful tier stays small precisely because nothing is allowed to persist. The feature that makes the product feel private is the same property that keeps the core scalable.

HTML over the wire — and why I didn't reach for React

The default for an app like this in 2026 is a React SPA talking to a JSON API. It's a fine architecture for some products. It also drags in a specific set of bugs that have nothing to do with your feature and everything to do with the shape of the design — and I've debugged all of them more than once:

Two renderers, one truth. The server owns the data; the client owns the markup. So display logic — formatting, "show this badge when…", "hide the delete button unless you're the host" — gets written twice, in two languages, and drifts. The two halves of your app slowly start to disagree.
The state-sync problem. This is the big one. The client keeps a copy of server state and has to keep it fresh: cache invalidation, stale reads, optimistic updates that need rolling back when the server says no. You've created a distributed-state problem inside your own UI. Half of what React-Query and Redux do exists to manage a problem you only have because you chose to duplicate the state.
Contract drift. Rename a field on the server and a cached SPA bundle in someone's browser breaks. Now you're versioning APIs and writing contract tests to police an agreement between two things you wrote yourself.
Permissions in the wrong place. "Hide the button unless you're the host" is a client check — but the client can't be trusted, so the real rule has to live on the server too. Put it only on the client and it's a security hole; put it in both and it drifts.
Round-trip waterfalls. A screen needs data from three endpoints, so it's three sequential fetches and three spinners — or a bespoke per-screen endpoint, or GraphQL and its own complexity tax.

Every one of these is a consequence of splitting the view across the wire as data. So I didn't split it. The server is the single source of truth and it sends rendered HTML — the hypermedia-driven approach, HTML over the wire, revived by htmx and Carson Gross's Hypermedia Systems, DHH's Hotwire (literally HTML Over The WIRE), and Phoenix LiveView. The browser's only job is to drop the HTML into the page. There's no client copy of state to sync, no JSON contract to drift, and permissions are decided in the one place that also renders the markup.

I wanted to prove it holds all the way through a real-time system, so even the outbox payload carries HTML, not a message object:

renderedMessage, err := o.roomPresenter.RenderMessage(ctx, message, sender, receiver)  // -> HTML string
_, err = o.outboxRepo.CreatePublishOutbox(ctx, channel, renderedMessage, message.RoomId)

The thing travelling Debezium → Kafka → Centrifugo → the socket is a <div>, not a {}. And the same ChatMessage templ component renders in all three places a message can appear — the initial page load, the sender's HTTP response, and the live socket push:

// the sender's own message, rendered straight into the HTTP response
err = messageComponent.ChatMessage(newMsg, user, user).Render(ctx, w)

One definition of what a message looks like, three delivery mechanisms, zero contract to keep in sync. It composes with the scale story, too: the delivery tier stays dumb — Kafka and Centrifugo shovel an opaque payload with no schema or logic in the hot path — while the one smart step, rendering, happens on the stateless app tier you can add boxes to.

The trade-off

None of this is free, and it's the wrong choice for some apps:

You ship more bytes per update — HTML is heavier than JSON for the same data (though gzip narrows it, and you're not also shipping a megabyte of framework).
The server does the rendering, spending CPU the client would otherwise spend. Cheap and cacheable here, but real.
It's a poor fit for genuinely client-heavy UIs — offline-first, local computation, canvas-style interactivity (Figma, Maps, a spreadsheet). Those need client state; hypermedia would fight them.
There's no free JSON API for, say, a future native mobile client. If I ever need one, I add it then — I'm not paying for it now.

I took those costs on purpose, because this app sits exactly where hypermedia wins: it's server-state-driven — the UI is a reflection of rooms and messages that live in the database, not a client-side computation — and it's built by a person. Trading flexibility I don't need for the elimination of an entire second codebase, a state-sync layer, and whole categories of drift bugs is not a close call. The best code is the code I didn't have to write to keep two copies of the truth agreeing.

The seams that keep each tier swappable

A system with this many collaborators — database, transaction, broker, renderer, object store — only stays sane if the dependencies point inward and are expressed as interfaces. The use cases depend on RoomRepo, TxManager, not on pgx or Centrifugo. That's what let me unit-test the transactional logic with hand-rolled fakes and no database at all, and it's what lets any one tier be replaced as the scale story evolves.

My favourite small seam is how a transaction crosses layers without leaking its type everywhere. The TxManager stashes the live transaction in the context; every repository quietly asks "is one in flight? use it; otherwise use the pool":

func GetTxIfAny(ctx context.Context, p *pgxpool.Pool) Querier {
	tx, ok := ctx.Value(helper.TxKey).(pgx.Tx)
	if ok {
		return tx
	}
	return p
}

A repository doesn't know whether it's inside UseTx. The use case owns the transactional boundary; the repositories just honour it. Ten lines now, instead of threading a tx argument through every signature later.

What's still the frontier

Confidence isn't pretending the work is finished — it's knowing exactly where the next problems live and seeing that the architecture already has room for them:

Delivery is at-least-once, not exactly-once. A consumer can replay after a crash, so the client needs to de-duplicate by message id. That's a known, bounded piece of work — and at-least-once is the correct default; it's the guarantee that never loses a message.
Postgres sharding is real work past a point. The shard key (room id) and the bounded, ephemeral working set are already in place; turning the crank is an infrastructure step, not a re-architecture.
Identity is a cookie, not real auth. Fine for a demo, first on the list for anything real.

None of these change the shape of the system. They're the next boxes, not the next rewrite — which is the whole claim.

The takeaway

Anyone can wire up a WebSocket for two users. The design question is whether those two users and two million share a codebase or a tombstone. This one shares a codebase, because every decision was made against the larger number: the request does one transaction and gets out of the way, so the app tier is stateless and scales sideways; Kafka absorbs the fanout, buffers the spikes, orders each room, and feeds consumers you haven't written yet; ephemerality keeps the one stateful core small; and the smart rendering happens once, far from the tier that has to reach everyone.

The WebSocket was the easy afternoon. An architecture that grows by adding boxes instead of starting over — that's the month. And that month is the work that's still ours.

Public GitHub repo for the code

https://github.com/finnng/ephemeral-chat

Some screenshots

The main screen, every users will see the lobby to select the room or create a chat room to wait for the other to join

The other user will see the room in the lobby immediately.

Then they can click to the room name to join and start the chat. The room now will disappear from the lobby. Only two of them can chat
The view from the host

It was fun building this project. I'm thinking of hosting it somewhere to let people come to chat just for fun ;)

How I built and scaled real‑time chat app from 2 to millions of users — without a rewrite?

The naive design dies twice

One write, then get out of the way

Why Kafka — the part that actually scales to millions

Ephemerality is a scaling feature, not just a product one

HTML over the wire — and why I didn't reach for React

The trade-off

The seams that keep each tier swappable

What's still the frontier

The takeaway

Public GitHub repo for the code

Some screenshots

Comments

More from this blog

We won Honestbee Logistics Challenge 2016

Migrating a decade of production data from an abandoned database (RethinkDB) to MongoDB

Developers, what should we do next in the age of AI?

Start With a Monolith. Split It Only When You Have To

Command Palette

The naive design dies twice

One write, then get out of the way

Why Kafka — the part that actually scales to millions

Ephemerality is a scaling feature, not just a product one

HTML over the wire — and why I didn't reach for React

The trade-off

The seams that keep each tier swappable

What's still the frontier

The takeaway

Public GitHub repo for the code

Some screenshots

Comments

More from this blog