Interactive · System Evolution

Scaling your system.

People always ask what to build and scale first. Here's a simple path you can follow when scaling your application. Every system is different, so measure your own traffic before reaching for any of these. Use this as a general guide, not a checklist.

1 VM 2 Split DB 3 Observe 4 Load Balance 5 Cache + CDN 6 Async 7 Replicas + Shard
01 / 07

A single VM

Internet your users VM · ubuntu · 4gb Web server nginx · :80 App node · python · go · :3000 Database postgres · localhost Cron + jobs crontab · pm2

You're at zero users and you want to ship today. Everything runs on a single VM. Web server, app code, database, cron jobs, and static files all live on the same machine. One server, one deploy, one place to look when something breaks. Cheap, easy to reason about, and totally fine for a long time. You'll outgrow it the first time the database eats all the memory and takes the app down with it, or the first time you need to deploy without a five-minute outage.

Open source · self-host

Hetzner or DigitalOcean droplet running Ubuntu, Nginx, your app, Postgres, and pm2 or systemd to keep things alive.

Enterprise · managed

AWS Lightsail, GCP Compute Engine, Azure VM, Render, Railway. Same shape, someone else handles the drive failure at 3am.

What breaks →

One bad query, one VM. Postgres eats all the memory and gets OOM-killed. The app sitting next to it goes down too. Everything is one process tree.

Internet 503 / timeout VM · oom cascade Web killed by oom App killed by oom Database mem 100% · seq scan Cron + jobs stalled
02 / 07

Split the database off.

Internet users App VM Web + App nginx · node · go Cron + jobs crontab · pm2 DB host Postgres dedicated ram backups · pitr tcp 5432

The database is the first thing to escape. Move Postgres onto its own host so it has dedicated memory and disk. Now app crashes don't take the database down with them, and you can scale the two independently. You also get backups, point-in-time restore, and minor version upgrades as a managed feature instead of as a weekend project. The one rule: don't put the DB host on the public internet.

Open source · self-host

Postgres or MySQL on a dedicated VM, with Litestream or pgBackRest for backups and a private network between app and DB.

Enterprise · managed

AWS RDS, Google Cloud SQL, Neon, Supabase, PlanetScale, Aiven. Backups, failover, and patching come included.

What breaks →

The system is alive. You can't see what it's doing. Users say "it's slow." You SSH in, run top, and guess. No logs, no metrics, no idea which query.

User "it's slow" ??? p95 latency · ??? error rate · ??? slow query · ??? user funnel no logs · no metrics · no traces · no analytics App VM running silent Postgres running silent engineer ssh's in, runs `top`, hopes for the best
03 / 07

See what's actually happening.

Logs structured · json grafana · datadog Metrics + Alerts cpu · p95 · errors pages on threshold Product analytics events · funnels retention App VM web + app + cron otel sdk loaded Postgres on its own host exporter attached 5432

You have two boxes now and almost no idea what either of them is doing. Add three things at once: structured logs, runtime metrics, and product analytics. Logs tell you what your code did. Metrics tell you how the machine feels. Product analytics tells you what users are actually doing inside the app. Wire alerts to the metrics that hurt the business when they break, and ignore the rest. Pages you snooze are worse than no pages at all.

Open source · logs & metrics

OpenTelemetry SDK, Prometheus, Grafana, Loki, Alertmanager. Self-hosted PostHog for product events.

Enterprise · logs & metrics

Datadog, New Relic, Honeycomb, Sentry, Better Stack. Pick one suite if you can, glue is a tax.

Enterprise · product analytics

PostHog Cloud, Amplitude, Mixpanel, Heap. Send a small set of high-signal events, not everything.

What breaks →

One app box, all the traffic. CPU pegs at 100%, requests queue, timeouts spread. And every deploy is a planned outage.

traffic spike App VM cpu 100% · queue full deploy = downtime 504 Postgres healthy · idle

Keep reading.

Drop your email to unlock the rest.

04 / 07

More than one app server.

Internet users Load Balancer tls · health checks · round-robin App 1 stateless App 2 stateless App 3 stateless Postgres single primary

One app box can only handle so much traffic, and a single box means downtime every time you deploy. Put a load balancer in front, run two or three app instances behind it, and make the app stateless so any request can land on any box. Now you can roll deploys one instance at a time with no downtime, and a crashed instance just gets traffic pulled until it recovers. The database is still the ceiling, but you bought a lot of headroom on the app tier.

Open source · self-host

HAProxy, Caddy, Nginx, or Traefik in front of your own app boxes. Keepalived if you want LB redundancy too.

Enterprise · managed

AWS ALB, Cloudflare Load Balancer, Google Cloud Load Balancing. Or a platform where it's built in: Vercel, Railway, Fly, Render.

What breaks →

Every app box, same hot read, every request. Connection pool full. The same query lands a thousand times per second on rows the DB just answered for. Static files round-trip from origin too.

Load Balancer App 1 SELECT user/42 on every request App 2 SELECT user/42 on every request App 3 SELECT user/42 on every request Postgres conns 100/100 · 1k rps on the same row no cache between

Keep reading.

Drop your email to unlock the rest.

05 / 07

Cache and CDN.

Users global CDN edge cache · static Load Balancer tls termination App pool App App App Cache redis · valkey sessions · hot reads Postgres cache miss

Static assets and frequently-read data shouldn't touch your origin. A CDN puts your JS, CSS, and images on edge nodes near every user. A cache layer in front of the database lets you skip the round trip for hot reads, sessions, rate limits, and computed results that don't change often. Both buy you raw speed and let the origin breathe. Cache invalidation is the hard part, so start with TTLs before you start with pub/sub-driven busts.

Open source · cache

Redis or Valkey in front of the DB, Varnish in front of the app for full-page caching, KeyDB for high-throughput multi-core.

Enterprise · CDN

Cloudflare, Fastly, AWS CloudFront, Bunny CDN. Vercel and Netlify ship a CDN by default for the static layer.

Enterprise · cache

Upstash Redis, AWS ElastiCache, Redis Cloud, Momento. Pick the one that lives in the same region as your apps.

What breaks →

A slow upstream becomes your slow upstream. Stripe takes 8 seconds, your checkout endpoint takes 8 seconds. Every web worker is parked on a network call. The site is "down" without anything actually crashing.

User 504 timeout App tier · all threads blocked App ⏳ App ⏳ App ⏳ Postgres healthy ✓ Cache healthy ✓ Stripe p95: 8.2s Email API p95: 4.1s 504

Keep reading.

Drop your email to unlock the rest.

06 / 07

Get slow work off the request path.

Load Balancer App tier · sync App App App Queue durable · retry Workers W W Cache Postgres 3rd-party APIs stripe · resend · etc enqueue

Sending email, generating PDFs, calling third-party APIs, processing uploads, building search indexes. None of that belongs in the request the user is waiting on. Push those jobs onto a queue and let a separate worker pool grind through them. The web tier stays fast even when an upstream provider has a bad day, and you get retries, dead-letter queues, and rate limiting for free. This is also the point where a slow webhook stops being able to take your whole site down.

Open source · self-host

BullMQ on Redis, Sidekiq for Ruby, Celery for Python, RabbitMQ, NATS JetStream, Redis Streams. Pick what your language has the best client for.

Enterprise · managed

AWS SQS plus Lambda, Google Pub/Sub, Vercel Queues, Inngest, Trigger.dev, Temporal Cloud. Durable workflows for anything multi-step.

What breaks →

Every workload meets at one primary. OLTP writes, dashboards, BI exports, the analyst's ad-hoc query. All on one box. Lock waits and write throughput are the new ceiling.

App writes checkouts · signups Dashboards heavy reads BI exports nightly · big scans Ad-hoc SQL analyst notebook Postgres · single primary i/o saturated · lock waits climbing write throughput pinned

Keep reading.

Drop your email to unlock the rest.

07 / 07

Read replicas and sharding.

App tier App pool Replica A read-only Replica B read-only Replica C Sharded primaries Shard 1 tenants A–F Shard 2 tenants G–M Shard 3 tenants N–Z Shard router hash · directory · range reads writes

Your database is the ceiling now. Add read replicas to absorb read-heavy traffic like dashboards, search, and reports. When writes get hot, shard by tenant or user id so each shard owns a slice of the data and no single primary takes the whole load. This is the point where you stop treating the database as a single thing and start treating it as a fleet. Cross-shard joins become harder and transactions get scoped to one shard, so most teams design the partition key before they actually need it.

Open source · self-host

Vanilla Postgres streaming replication for replicas, Citus for sharded Postgres, Vitess for sharded MySQL, pgbouncer for connection pooling.

Enterprise · managed

AWS Aurora, Google Cloud Spanner, CockroachDB Cloud, PlanetScale, Neon (branching plus replicas), Yugabyte, MongoDB Atlas with sharding.

What breaks now →

You traded one ceiling for two new flavors of weirdness. Replicas lag, so a fresh write can read back stale. Anything touching two shards becomes a coordination problem.

replica lag · stale read write t=0 balance = 100 Primary balance = 100 ✓ lag 200ms Replica balance = 90 read t=50ms gets 90 · stale! cross-shard transaction transfer $50: A → B one tx, two shards Shard 1 User A · debit Shard 2 User B · credit 2-phase commit · partial failure debit succeeded, credit didn't ⚠