System Design Diaries #1: Building the Foundations
Hey Readers.
Spotted: millions of users posting tweets, streaming videos, sending messages, and somehow the internet doesn't completely fall apart.
What's their secret?
No, it's not luck. It's system design.
Today we're pulling back the curtain on the fundamentals every engineer should know before designing systems that can survive real-world traffic.
What Are Interviewers Actually Looking For?
When you're in a system design interview, nobody expects you to reinvent Google.
They're evaluating how you think.
Can you gather requirements? Analyze tradeoffs? Design APIs? Build reliable and scalable systems? Communicate your ideas clearly?
The best candidates don't jump straight into architecture diagrams. They start by asking questions.
Functional Requirements
What should the system do?
For a social media platform:
Users can post tweets.
Users can follow or unfollow others.
Users can view timelines.
Non-Functional Requirements
How should the system behave?
Examples:
99.99% uptime
100 million users
Response times under 200 ms
High availability
Before designing the solution, understand the problem.
A shocking number of engineers skip this step.
The Hierarchy of Speed
Not all storage is created equal.
Some memories are VIP guests. Others are waiting outside the club.
| Component | Speed | Size | Persistent? |
|---|---|---|---|
| CPU Registers | Fastest | Tiny | No |
| L1 Cache | Very Fast | KBs | No |
| L2 Cache | Fast | MBs | No |
| L3 Cache | Slower | Larger MBs | No |
| RAM | Slower | GBs | No |
| SSD | Slow | TBs | Yes |
| HDD | Slowest | TBs | Yes |
L1 Cache
The closest memory to the CPU.
Located inside each CPU core
Extremely fast
Stores frequently used instructions and data
L2 Cache
A little larger, a little slower.
Dedicated to individual CPU cores
Stores data not found in L1
Reduces trips to slower memory
L3 Cache
The shared gossip hub.
Shared among CPU cores
Helps cores access common data
Faster than RAM but slower than L1 and L2
RAM
The system's working memory.
Stores currently running programs
Much larger than cache
Volatile memory (data disappears when power is lost)
Cache Hits vs Cache Misses
A cache hit means the CPU finds data immediately.
A cache miss means it must travel down the hierarchy looking for it.
And just like searching for a missing group project member five minutes before submission, that takes time.
Speed Order
Registers → L1 Cache → L2 Cache → L3 Cache → RAM → SSD → HDD
This is why caching is such a big deal in large-scale systems.
Frequently accessed data stays closer to the processor, reducing latency and improving performance.
What Does a Real Production System Look Like?
A simplified request flow:
User → DNS → Load Balancer → Application Servers → Cache → Database
Each component has a role:
DNS finds the server.
Load Balancer distributes traffic.
Application Servers handle business logic.
Cache serves frequently requested data quickly.
Database stores persistent information.
Simple on paper.
Wildly complicated at scale.
The Unsung Heroes of Production
Building software is one thing.
Keeping it alive is another.
CI/CD
Continuous Integration and Continuous Deployment automate testing and releases.
Popular tools:
Jenkins
GitHub Actions
GitLab CI
Because manually deploying code at 2 AM is not a personality trait.
Logging
Logs record system events.
Examples:
User logged in
Payment failed
Server crashed
Popular tools:
ELK Stack
Datadog
Splunk
Monitoring
Monitoring tracks system health.
Metrics include:
CPU usage
Memory usage
Error rate
Request latency
Popular tools:
Prometheus
Grafana
Alerting
When something goes wrong, somebody needs to know.
Tools like Slack and PagerDuty notify engineers when critical thresholds are crossed.
Because servers rarely break during business hours.
The Core Pillars of System Design
Scalability
Can the system handle growth?
If 1,000 users become 1,000,000 users tomorrow, does the system survive?
A scalable system grows without requiring a complete redesign.
Reliability
Can users trust the system?
Reliable systems continue working correctly even when things fail.
Availability
Can users access the service when they need it?
Availability is usually measured as:
Availability = Uptime / Total Time
Typical targets:
| Availability | Downtime per Year |
|---|---|
| 99% | 3.65 days |
| 99.9% | 8.76 hours |
| 99.99% | 52 minutes |
| 99.999% | ~5 minutes |
Why not 100%?
Because perfection is expensive.
Sometimes impossibly expensive.
Maintainability
Future engineers should be able to understand and modify the system.
If nobody understands your architecture six months later, you've created a puzzle, not a product.
The CAP Theorem Drama
Every distributed system eventually faces a difficult choice.
CAP Theorem states that a distributed system can only guarantee two of the following three properties:
Consistency (C)
Every node sees the same data.
Availability (A)
Every request receives a response.
Partition Tolerance (P)
The system continues operating despite network failures.
Since network failures are unavoidable, Partition Tolerance is non-negotiable.
That leaves a choice.
CP Systems
Consistency + Partition Tolerance
Availability may suffer.
Examples:
Banking systems
Payment systems
Would you rather see an error message or lose money?
Exactly.
AP Systems
Availability + Partition Tolerance
Strong consistency is sacrificed.
Examples:
Social media platforms
Messaging systems
You might briefly see outdated data, but the system remains available.
Final Thoughts
System design isn't about memorizing diagrams.
It's about understanding tradeoffs.
Every decision improves one thing while sacrificing another.
More consistency may reduce availability.
More throughput may increase latency.
More reliability may increase complexity.
The best engineers aren't the ones with the fanciest architecture.
They're the ones who understand what tradeoffs they're making and why.
And just like every good secret in the Coding World, every large-scale system is hiding a thousand design decisions beneath the surface.
XOXO.
