Message Queues: The Part of System Design Most Backends Skip Too Long
by Arif Ikhsanudin, Backend Developer
The Synchronous Coupling Problem
Your order service calls the notification service synchronously via HTTP. When the notification service is slow — a downstream SMTP relay is backing up — order confirmations start timing out. Not the notifications. The orders themselves. A customer can't complete a purchase because an email system is slow. These two things have no business being coupled, but your architecture has made them inseparable.
This is the canonical case for a message queue. The order service publishes an event. The notification service consumes it at whatever rate it can manage. A slow notification service means slightly delayed emails, not failed orders. These are qualitatively different outcomes.
The delay between when this architectural insight becomes obvious and when most teams act on it is often "after the first major incident."
What a Message Queue Is Actually Doing
A message queue (or message broker — I'll use the terms interchangeably here for the common use cases) is a durable store for events that decouples the producer from the consumer in three distinct ways:
Temporal decoupling: The producer publishes when it has something to say. The consumer processes when it's ready. They don't need to be running at the same time or at the same rate.
Load leveling: If the producer publishes 10,000 events in a burst, the queue absorbs the spike. The consumer processes at its own pace. Without a queue, that spike hits your downstream system directly.
Reliability decoupling: If the consumer crashes after receiving a message but before processing it, the message remains in the queue (or returns to it) and will be redelivered. A synchronous call that fails is a failed operation. An unconsumed queue message is a pending operation.
The Major Options and When to Use Them
RabbitMQ is a general-purpose broker implementing AMQP (Advanced Message Queuing Protocol). It supports flexible routing — exchanges, bindings, and queues let you build fan-out, topic-based, and direct routing patterns. Quorum queues (added in RabbitMQ 3.8) provide strong durability guarantees with consensus-based replication. Use it when you need flexible routing and your team is comfortable operating a broker.
AWS SQS is a managed queue with minimal operational overhead. FIFO queues provide exactly-once processing and ordered delivery within a message group — but at a lower throughput ceiling (300 TPS per API action for FIFO vs 3,000 for standard queues). Standard queues are at-least-once with best-effort ordering. SQS is the right choice when you're already on AWS and want to avoid operating infrastructure.
Apache Kafka is not a traditional message queue — it's a distributed log. Messages are retained for a configurable period, consumers track their own position, and messages can be replayed. This makes it appropriate for event sourcing, audit trails, and stream processing use cases. Kafka is operationally heavier than RabbitMQ or SQS and is usually overkill for simple task queues. Use it when replay semantics, high throughput (millions of events/second), or log compaction matter.
The Problems Queues Create
Switching from synchronous to async processing introduces a set of problems that synchronous systems don't have:
Observability gap: In a synchronous system, a slow operation is visible immediately — the request is slow. In an async system, a backed-up queue may go unnoticed until queue depth becomes very large. You need explicit monitoring of queue depth, consumer lag, and processing rate from day one — not as a retrofit.
At-least-once delivery and idempotency: Most brokers guarantee at-least-once delivery under failure conditions — a message may be delivered more than once. If your consumer is not idempotent, duplicate processing causes duplicate effects: sending two emails, charging a customer twice, recording a transaction twice. Every consumer must handle duplicates.
// Idempotent consumer: check before acting
public void handleOrderConfirmation(OrderConfirmedEvent event) {
if (notificationLog.alreadySent(event.getOrderId())) {
return; // duplicate delivery, safe to ignore
}
emailService.sendConfirmation(event);
notificationLog.record(event.getOrderId());
}
Ordering assumptions: Standard queues do not guarantee ordering. If your processing logic assumes event A arrives before event B, standard queuing will eventually violate that assumption. FIFO queues (SQS FIFO, Kafka partitions) preserve order within a partition or message group, with throughput constraints.
Dead-letter queues: Messages that fail processing after the maximum retry count need somewhere to go that isn't silent discard. A dead-letter queue (DLQ) receives these messages so they can be inspected and reprocessed. This is non-optional infrastructure — without it, processing failures are invisible.
When Not to Use a Queue
Queues add latency and complexity. Don't reach for them when:
- The operation must be synchronous from the user's perspective — a real-time authorization check cannot be queued
- The data volume is small and processing is fast — queueing to distribute ten records per minute is underpowered for a deployed broker
- Your team doesn't have operational experience with the chosen broker — operating Kafka in production is a non-trivial commitment
The Practical Takeaway
Identify one place in your current system where a slow or unavailable dependency can fail the operation that triggered it — even though the dependency is not core to that operation. Map out whether the caller actually needs an immediate response, or just confirmation that the request was received. If the latter, that's your first queue candidate. Start with SQS if you're on AWS — it's the lowest operational overhead path to understanding the pattern in production.