System Evolution · how your stack grows with scale

01 / 07

A single VM

You're at zero users and you want to ship today. Everything runs on a single VM. Web server, app code, database, cron jobs, and static files all live on the same machine. One server, one deploy, one place to look when something breaks. Cheap, easy to reason about, and totally fine for a long time. You'll outgrow it the first time the database eats all the memory and takes the app down with it, or the first time you need to deploy without a five-minute outage.

Open source · self-host

Hetzner or DigitalOcean droplet running Ubuntu, Nginx, your app, Postgres, and pm2 or systemd to keep things alive.

Enterprise · managed

AWS Lightsail, GCP Compute Engine, Azure VM, Render, Railway. Same shape, someone else handles the drive failure at 3am.

What breaks →

One bad query, one VM. Postgres eats all the memory and gets OOM-killed. The app sitting next to it goes down too. Everything is one process tree.

02 / 07

Split the database off.

The database is the first thing to escape. Move Postgres onto its own host so it has dedicated memory and disk. Now app crashes don't take the database down with them, and you can scale the two independently. You also get backups, point-in-time restore, and minor version upgrades as a managed feature instead of as a weekend project. The one rule: don't put the DB host on the public internet.

Open source · self-host

Postgres or MySQL on a dedicated VM, with Litestream or pgBackRest for backups and a private network between app and DB.

Enterprise · managed

AWS RDS, Google Cloud SQL, Neon, Supabase, PlanetScale, Aiven. Backups, failover, and patching come included.

What breaks →

The system is alive. You can't see what it's doing. Users say "it's slow." You SSH in, run top, and guess. No logs, no metrics, no idea which query.

03 / 07

See what's actually happening.

You have two boxes now and almost no idea what either of them is doing. Add three things at once: structured logs, runtime metrics, and product analytics. Logs tell you what your code did. Metrics tell you how the machine feels. Product analytics tells you what users are actually doing inside the app. Wire alerts to the metrics that hurt the business when they break, and ignore the rest. Pages you snooze are worse than no pages at all.

Open source · logs & metrics

OpenTelemetry SDK, Prometheus, Grafana, Loki, Alertmanager. Self-hosted PostHog for product events.

Enterprise · logs & metrics

Datadog, New Relic, Honeycomb, Sentry, Better Stack. Pick one suite if you can, glue is a tax.

Enterprise · product analytics

PostHog Cloud, Amplitude, Mixpanel, Heap. Send a small set of high-signal events, not everything.

What breaks →

One app box, all the traffic. CPU pegs at 100%, requests queue, timeouts spread. And every deploy is a planned outage.

Keep reading.

Drop your email to unlock the rest.

04 / 07

More than one app server.

One app box can only handle so much traffic, and a single box means downtime every time you deploy. Put a load balancer in front, run two or three app instances behind it, and make the app stateless so any request can land on any box. Now you can roll deploys one instance at a time with no downtime, and a crashed instance just gets traffic pulled until it recovers. The database is still the ceiling, but you bought a lot of headroom on the app tier.

Open source · self-host

HAProxy, Caddy, Nginx, or Traefik in front of your own app boxes. Keepalived if you want LB redundancy too.

Enterprise · managed

AWS ALB, Cloudflare Load Balancer, Google Cloud Load Balancing. Or a platform where it's built in: Vercel, Railway, Fly, Render.

What breaks →

Every app box, same hot read, every request. Connection pool full. The same query lands a thousand times per second on rows the DB just answered for. Static files round-trip from origin too.

Keep reading.

Drop your email to unlock the rest.

05 / 07

Cache and CDN.

Static assets and frequently-read data shouldn't touch your origin. A CDN puts your JS, CSS, and images on edge nodes near every user. A cache layer in front of the database lets you skip the round trip for hot reads, sessions, rate limits, and computed results that don't change often. Both buy you raw speed and let the origin breathe. Cache invalidation is the hard part, so start with TTLs before you start with pub/sub-driven busts.

Open source · cache

Redis or Valkey in front of the DB, Varnish in front of the app for full-page caching, KeyDB for high-throughput multi-core.

Enterprise · CDN

Cloudflare, Fastly, AWS CloudFront, Bunny CDN. Vercel and Netlify ship a CDN by default for the static layer.

Enterprise · cache

Upstash Redis, AWS ElastiCache, Redis Cloud, Momento. Pick the one that lives in the same region as your apps.

What breaks →

A slow upstream becomes your slow upstream. Stripe takes 8 seconds, your checkout endpoint takes 8 seconds. Every web worker is parked on a network call. The site is "down" without anything actually crashing.

Keep reading.

Drop your email to unlock the rest.

06 / 07

Get slow work off the request path.

Sending email, generating PDFs, calling third-party APIs, processing uploads, building search indexes. None of that belongs in the request the user is waiting on. Push those jobs onto a queue and let a separate worker pool grind through them. The web tier stays fast even when an upstream provider has a bad day, and you get retries, dead-letter queues, and rate limiting for free. This is also the point where a slow webhook stops being able to take your whole site down.

Open source · self-host

BullMQ on Redis, Sidekiq for Ruby, Celery for Python, RabbitMQ, NATS JetStream, Redis Streams. Pick what your language has the best client for.

Enterprise · managed

AWS SQS plus Lambda, Google Pub/Sub, Vercel Queues, Inngest, Trigger.dev, Temporal Cloud. Durable workflows for anything multi-step.

What breaks →

Every workload meets at one primary. OLTP writes, dashboards, BI exports, the analyst's ad-hoc query. All on one box. Lock waits and write throughput are the new ceiling.

Keep reading.

Drop your email to unlock the rest.

07 / 07

Read replicas and sharding.

Your database is the ceiling now. Add read replicas to absorb read-heavy traffic like dashboards, search, and reports. When writes get hot, shard by tenant or user id so each shard owns a slice of the data and no single primary takes the whole load. This is the point where you stop treating the database as a single thing and start treating it as a fleet. Cross-shard joins become harder and transactions get scoped to one shard, so most teams design the partition key before they actually need it.

Open source · self-host

Vanilla Postgres streaming replication for replicas, Citus for sharded Postgres, Vitess for sharded MySQL, pgbouncer for connection pooling.

Enterprise · managed

AWS Aurora, Google Cloud Spanner, CockroachDB Cloud, PlanetScale, Neon (branching plus replicas), Yugabyte, MongoDB Atlas with sharding.

What breaks now →

You traded one ceiling for two new flavors of weirdness. Replicas lag, so a fresh write can read back stale. Anything touching two shards becomes a coordination problem.

Scaling your system.

A single VM

Open source · self-host

Enterprise · managed

Split the database off.

Open source · self-host

Enterprise · managed

See what's actually happening.

Open source · logs & metrics

Enterprise · logs & metrics

Enterprise · product analytics

Keep reading.

More than one app server.

Open source · self-host

Enterprise · managed

Keep reading.

Cache and CDN.

Open source · cache

Enterprise · CDN

Enterprise · cache

Keep reading.

Get slow work off the request path.

Open source · self-host

Enterprise · managed

Keep reading.

Read replicas and sharding.

Open source · self-host

Enterprise · managed

Get the dev download.