Skip to main content

Command Palette

Search for a command to run...

System Design Diaries #1: Building the Foundations

Updated
6 min read
Z
🌷 software engineer • ✨ open source contributor • 🩷 builder of developer tools interested in api infrastructure, developer tooling, and scalable systems (along with binge watching Suits)⚙️

Hey Readers.

Spotted: millions of users posting tweets, streaming videos, sending messages, and somehow the internet doesn't completely fall apart.

What's their secret?

No, it's not luck. It's system design.

Today we're pulling back the curtain on the fundamentals every engineer should know before designing systems that can survive real-world traffic.

What Are Interviewers Actually Looking For?

When you're in a system design interview, nobody expects you to reinvent Google.

They're evaluating how you think.

Can you gather requirements? Analyze tradeoffs? Design APIs? Build reliable and scalable systems? Communicate your ideas clearly?

The best candidates don't jump straight into architecture diagrams. They start by asking questions.

Functional Requirements

What should the system do?

For a social media platform:

  • Users can post tweets.

  • Users can follow or unfollow others.

  • Users can view timelines.

Non-Functional Requirements

How should the system behave?

Examples:

  • 99.99% uptime

  • 100 million users

  • Response times under 200 ms

  • High availability

Before designing the solution, understand the problem.

A shocking number of engineers skip this step.

The Hierarchy of Speed

Not all storage is created equal.

Some memories are VIP guests. Others are waiting outside the club.

Component Speed Size Persistent?
CPU Registers Fastest Tiny No
L1 Cache Very Fast KBs No
L2 Cache Fast MBs No
L3 Cache Slower Larger MBs No
RAM Slower GBs No
SSD Slow TBs Yes
HDD Slowest TBs Yes

L1 Cache

The closest memory to the CPU.

  • Located inside each CPU core

  • Extremely fast

  • Stores frequently used instructions and data

L2 Cache

A little larger, a little slower.

  • Dedicated to individual CPU cores

  • Stores data not found in L1

  • Reduces trips to slower memory

L3 Cache

The shared gossip hub.

  • Shared among CPU cores

  • Helps cores access common data

  • Faster than RAM but slower than L1 and L2

RAM

The system's working memory.

  • Stores currently running programs

  • Much larger than cache

  • Volatile memory (data disappears when power is lost)

Cache Hits vs Cache Misses

A cache hit means the CPU finds data immediately.

A cache miss means it must travel down the hierarchy looking for it.

And just like searching for a missing group project member five minutes before submission, that takes time.

Speed Order

Registers → L1 Cache → L2 Cache → L3 Cache → RAM → SSD → HDD

This is why caching is such a big deal in large-scale systems.

Frequently accessed data stays closer to the processor, reducing latency and improving performance.

What Does a Real Production System Look Like?

A simplified request flow:

User → DNS → Load Balancer → Application Servers → Cache → Database

Each component has a role:

  • DNS finds the server.

  • Load Balancer distributes traffic.

  • Application Servers handle business logic.

  • Cache serves frequently requested data quickly.

  • Database stores persistent information.

Simple on paper.

Wildly complicated at scale.

The Unsung Heroes of Production

Building software is one thing.

Keeping it alive is another.

CI/CD

Continuous Integration and Continuous Deployment automate testing and releases.

Popular tools:

  • Jenkins

  • GitHub Actions

  • GitLab CI

Because manually deploying code at 2 AM is not a personality trait.

Logging

Logs record system events.

Examples:

  • User logged in

  • Payment failed

  • Server crashed

Popular tools:

  • ELK Stack

  • Datadog

  • Splunk

Monitoring

Monitoring tracks system health.

Metrics include:

  • CPU usage

  • Memory usage

  • Error rate

  • Request latency

Popular tools:

  • Prometheus

  • Grafana

Alerting

When something goes wrong, somebody needs to know.

Tools like Slack and PagerDuty notify engineers when critical thresholds are crossed.

Because servers rarely break during business hours.

The Core Pillars of System Design

Scalability

Can the system handle growth?

If 1,000 users become 1,000,000 users tomorrow, does the system survive?

A scalable system grows without requiring a complete redesign.

Reliability

Can users trust the system?

Reliable systems continue working correctly even when things fail.

Availability

Can users access the service when they need it?

Availability is usually measured as:

Availability = Uptime / Total Time

Typical targets:

Availability Downtime per Year
99% 3.65 days
99.9% 8.76 hours
99.99% 52 minutes
99.999% ~5 minutes

Why not 100%?

Because perfection is expensive.

Sometimes impossibly expensive.

Maintainability

Future engineers should be able to understand and modify the system.

If nobody understands your architecture six months later, you've created a puzzle, not a product.

The CAP Theorem Drama

Every distributed system eventually faces a difficult choice.

CAP Theorem states that a distributed system can only guarantee two of the following three properties:

Consistency (C)

Every node sees the same data.

Availability (A)

Every request receives a response.

Partition Tolerance (P)

The system continues operating despite network failures.

Since network failures are unavoidable, Partition Tolerance is non-negotiable.

That leaves a choice.

CP Systems

Consistency + Partition Tolerance

Availability may suffer.

Examples:

  • Banking systems

  • Payment systems

Would you rather see an error message or lose money?

Exactly.

AP Systems

Availability + Partition Tolerance

Strong consistency is sacrificed.

Examples:

  • Social media platforms

  • Messaging systems

You might briefly see outdated data, but the system remains available.

Final Thoughts

System design isn't about memorizing diagrams.

It's about understanding tradeoffs.

Every decision improves one thing while sacrificing another.

More consistency may reduce availability.

More throughput may increase latency.

More reliability may increase complexity.

The best engineers aren't the ones with the fanciest architecture.

They're the ones who understand what tradeoffs they're making and why.

And just like every good secret in the Coding World, every large-scale system is hiding a thousand design decisions beneath the surface.

XOXO.