Message Queues: The Part of System Design Most Backends Skip Too Long

by Arif Ikhsanudin, Backend Developer

The Synchronous Coupling Problem

Your order service calls the notification service synchronously via HTTP. When the notification service is slow — a downstream SMTP relay is backing up — order confirmations start timing out. Not the notifications. The orders themselves. A customer can't complete a purchase because an email system is slow. These two things have no business being coupled, but your architecture has made them inseparable.

This is the canonical case for a message queue. The order service publishes an event. The notification service consumes it at whatever rate it can manage. A slow notification service means slightly delayed emails, not failed orders. These are qualitatively different outcomes.

The delay between when this architectural insight becomes obvious and when most teams act on it is often "after the first major incident."

What a Message Queue Is Actually Doing

A message queue (or message broker — I'll use the terms interchangeably here for the common use cases) is a durable store for events that decouples the producer from the consumer in three distinct ways:

Temporal decoupling: The producer publishes when it has something to say. The consumer processes when it's ready. They don't need to be running at the same time or at the same rate.

Load leveling: If the producer publishes 10,000 events in a burst, the queue absorbs the spike. The consumer processes at its own pace. Without a queue, that spike hits your downstream system directly.

Reliability decoupling: If the consumer crashes after receiving a message but before processing it, the message remains in the queue (or returns to it) and will be redelivered. A synchronous call that fails is a failed operation. An unconsumed queue message is a pending operation.

The Major Options and When to Use Them

RabbitMQ is a general-purpose broker implementing AMQP (Advanced Message Queuing Protocol). It supports flexible routing — exchanges, bindings, and queues let you build fan-out, topic-based, and direct routing patterns. Quorum queues (added in RabbitMQ 3.8) provide strong durability guarantees with consensus-based replication. Use it when you need flexible routing and your team is comfortable operating a broker.

AWS SQS is a managed queue with minimal operational overhead. FIFO queues provide exactly-once processing and ordered delivery within a message group — but at a lower throughput ceiling (300 TPS per API action for FIFO vs 3,000 for standard queues). Standard queues are at-least-once with best-effort ordering. SQS is the right choice when you're already on AWS and want to avoid operating infrastructure.

Apache Kafka is not a traditional message queue — it's a distributed log. Messages are retained for a configurable period, consumers track their own position, and messages can be replayed. This makes it appropriate for event sourcing, audit trails, and stream processing use cases. Kafka is operationally heavier than RabbitMQ or SQS and is usually overkill for simple task queues. Use it when replay semantics, high throughput (millions of events/second), or log compaction matter.

The Problems Queues Create

Switching from synchronous to async processing introduces a set of problems that synchronous systems don't have:

Observability gap: In a synchronous system, a slow operation is visible immediately — the request is slow. In an async system, a backed-up queue may go unnoticed until queue depth becomes very large. You need explicit monitoring of queue depth, consumer lag, and processing rate from day one — not as a retrofit.

At-least-once delivery and idempotency: Most brokers guarantee at-least-once delivery under failure conditions — a message may be delivered more than once. If your consumer is not idempotent, duplicate processing causes duplicate effects: sending two emails, charging a customer twice, recording a transaction twice. Every consumer must handle duplicates.

// Idempotent consumer: check before acting
public void handleOrderConfirmation(OrderConfirmedEvent event) {
    if (notificationLog.alreadySent(event.getOrderId())) {
        return; // duplicate delivery, safe to ignore
    }
    emailService.sendConfirmation(event);
    notificationLog.record(event.getOrderId());
}

Ordering assumptions: Standard queues do not guarantee ordering. If your processing logic assumes event A arrives before event B, standard queuing will eventually violate that assumption. FIFO queues (SQS FIFO, Kafka partitions) preserve order within a partition or message group, with throughput constraints.

Dead-letter queues: Messages that fail processing after the maximum retry count need somewhere to go that isn't silent discard. A dead-letter queue (DLQ) receives these messages so they can be inspected and reprocessed. This is non-optional infrastructure — without it, processing failures are invisible.

When Not to Use a Queue

Queues add latency and complexity. Don't reach for them when:

  • The operation must be synchronous from the user's perspective — a real-time authorization check cannot be queued
  • The data volume is small and processing is fast — queueing to distribute ten records per minute is underpowered for a deployed broker
  • Your team doesn't have operational experience with the chosen broker — operating Kafka in production is a non-trivial commitment

The Practical Takeaway

Identify one place in your current system where a slow or unavailable dependency can fail the operation that triggered it — even though the dependency is not core to that operation. Map out whether the caller actually needs an immediate response, or just confirmation that the request was received. If the latter, that's your first queue candidate. Start with SQS if you're on AWS — it's the lowest operational overhead path to understanding the pattern in production.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Why Chicago Startups Are Rethinking the Full-Time Backend Hire and Winning With Async Contractors

Some Chicago startups have stopped competing for senior backend engineers in a market that favors their biggest competitors. Here's what they're doing instead.

Read more

The Head Chef Analogy: Why Teams Without a Tech Lead Fail

Imagine walking into a busy kitchen with 10 cooks and no head chef. Food is being made—but no one agrees on how it should taste.

Read more

Race Conditions and Visibility in Java — What the Memory Model Actually Guarantees

The Java Memory Model defines precisely which writes are visible to which reads, and under what conditions. Without understanding it, thread-safe code is guesswork. With it, the correct tool for each situation becomes clear.

Read more

Writing Code That Works Is the Easy Part

Getting code to pass tests and ship to production is a solved problem for most competent developers. The hard part — the part that takes years to learn — is everything surrounding that code.

Read more