Good System Design Starts With Understanding the Problem Not the Solution
by Arif Ikhsanudin, Backend Developer
You Are Already Thinking About Microservices
The feature request lands in your backlog: "Build a notification system." Within an hour, someone has proposed Kafka, a fanout service, per-channel workers, and a preference engine. The whiteboard is full. Everyone is excited. Nobody has asked what a notification actually is in this product, who receives it, how fast it needs to arrive, or what happens if it is late.
This is the default failure mode of experienced engineers. Pattern recognition kicks in before problem understanding does. You have built notification systems before. You know what the solution looks like. So you skip the part where you figure out whether this is actually the same problem.
It usually is not.
The Questions That Change the Design
The notification system problem has at least four meaningfully different shapes:
- Transactional alerts (password reset, payment confirmation) — delivery within seconds, high reliability, low volume
- Marketing campaigns — bulk delivery, timing flexibility, opt-out compliance required, high volume
- Real-time collaboration events (someone edited your document) — sub-second delivery, best-effort acceptable, very high volume
- Regulatory notifications (account suspended, legal hold) — guaranteed delivery, audit trail required, low volume
Each of these shapes calls for a different architecture. Shape 1 is well-served by a transactional email provider like SendGrid with synchronous HTTP calls from your application. Shape 3 might be WebSockets with a Redis pub/sub backbone. Shape 4 needs durable storage with delivery confirmation and immutable audit logs.
If you jump to "Kafka fanout service" before asking which shape you have, you will probably build something that handles shape 2 well and shapes 1 and 4 poorly — and you will not find out until you are debugging a missing password reset email in production.
How to Actually Understand the Problem
Three questions that force clarity before any architecture is proposed:
What does failure look like, and how bad is it? A notification that arrives 30 seconds late during a flash sale is a business problem. A notification that arrives 30 seconds late for a two-factor authentication code is a user experience problem. A notification that never arrives confirming a wire transfer is a compliance problem. These are not the same failure severity and should not be treated identically.
What is the expected volume and growth curve? Not "we want to scale to millions of users" — that is a goal, not a constraint. What is the actual current volume, what is the projected volume in 6 months, and is there a predictable spike pattern (end-of-month billing, market open/close)?
Who owns the downstream complexity? Notification systems touch email providers, SMS gateways, push notification services, in-app rendering, and user preference storage. Each of those integrations has rate limits, delivery semantics, and failure modes. Does your design need to own all of that or can it delegate to a managed service that handles provider failover and compliance?
# Problem definition before any architecture:
Notification types: transactional only (for now)
Volume: ~500/day today, projected 5,000/day in 6 months
Latency requirement: < 5 seconds for password reset, < 30s for receipts
Delivery guarantee: at-least-once; duplicates handled by idempotency key
Channels: email only; SMS is Q3
Failure tolerance: retry up to 3x over 10 minutes, then alert on-call
Compliance: GDPR unsubscribe required; no marketing content
That specification rules out half the "enterprise notification platform" solutions immediately. It also tells you that a simple queue with a dead-letter channel and an email provider integration is probably sufficient — and adding Kafka now is engineering for a future you have not validated.
The Cost of Skipping This Step
Skipping problem definition does not save time. It relocates the time to the worst possible moment: after the system is built, when changing the architecture means reworking production code under pressure.
The patterns are predictable. Teams that jump to solutions end up with:
- Overly complex systems with operational overhead that exceeds the problem's actual requirements
- Systems that handle the imagined use case well and the real use case poorly
- Architecture decisions made for scale that will never materialize, creating maintenance burden for years
A system that was designed for a problem that was never properly defined is not a technical debt problem — it is a requirements debt problem, and it does not get paid off by refactoring the code.
Start With a Written Problem Statement
Before any architecture discussion, write down the problem in concrete terms. Not user stories — constraints. Volume, latency, consistency requirements, failure tolerance, regulatory surface area, team operational capability. One page maximum.
If you cannot write that page, you are not ready to design the system. If writing it surfaces disagreement about what the system needs to do, you have just saved yourself from building the wrong thing. That disagreement is the design work. Do it before the code, not during.