Skip to content

Building Payment Processing Systems That Don't Leak Money: Where Queue Design Meets Reconciliation

Most payment systems leak. Not because the code is wrong, but because the queue that orchestrates them treats money like any other message. I'll show you where.

· 7 min read

The Problem Nobody Talks About Until It Costs Them Money

You've built a payment processor. It has a worker pool consuming from a queue. It talks to your bank's API. It updates a database. On the happy path, everything works.

Then a worker crashes mid-transaction, or your webhook handler times out, or the bank returns a 500 error you didn't code for. The message sits in the queue. You restart the worker. What happens next is not as predictable as you think.

The real issue isn't the occasional failure. The real issue is that payment queues behave differently from task queues. A delayed email is annoying. A payment that processes twice is a fraud investigation. A payment that silently fails to update the ledger is a reconciliation nightmare you discover three weeks later.

Most payment systems start by reusing the same queue patterns they'd use for anything else. Then they bolt on monitoring, retry logic, and manual reconciliation. What they're actually building is a reconciliation fire hose.

Why Standard Queue Assumptions Break Under Money

A typical queue system prioritizes availability. If a message fails, it retries. If it retries too many times, it goes to a dead letter queue. Someone reviews it eventually.

Payment messages cannot follow this pattern.

Take a concrete example: Stripe webhook delivery. You process a charge.succeeded webhook, update your customer's balance, and issue their reward. The webhook handler crashes before it can mark the webhook as processed. Stripe retries. Your handler runs again. The customer's balance increments twice. The reward is issued twice.

Now your database says they have 200 points. Your payments ledger says you charged them once. Your reward table says you issued 200 points to that customer twice. A human must now find the transaction, link it to the duplicate, and decide whether to keep a refund.

This is not a theoretically remote scenario. This is what happens when you treat "mark as processed" as an application concern rather than a transactional concern.

The mechanism that prevents this is idempotency. Not as a buzzword, but as a specific architectural choice: every payment operation must carry an idempotency key (often the same as the bank's transaction ID or a derivable hash), and your system must guarantee that the same idempotency key processing twice produces the same outcome as processing it once.

This is straightforward in principle. It is not straightforward in practice because it means your database must enforce a unique constraint on (operation_type, idempotency_key, customer_id), and every state transition must be transactional. Many systems skip this because "it makes the code more complicated" until they don't, and then they're rebuilding trust.

The Reconciliation Drift Problem

Here's a scenario from first-hand experience: a multi-tenant SaaS platform where each tenant has a separate ledger. A payment comes in. The worker dequeues it, processes it against tenant A's ledger, dequeues the next message for tenant B, but crashes before committing tenant B's ledger write.

Now your master ledger (the one you send to the bank for settlement) is out of sync with your per-tenant ledgers by exactly one transaction. When you reconcile at end-of-day, you see a mismatch. Debugging requires walking through the queue, the ledgers, and the bank API simultaneously. If you're running a high-volume system with thousands of transactions per hour, finding the one that drifted is not trivial.

The mechanism that prevents this is explicit reconciliation checkpoints, not just implicit idempotency. Specifically: every N transactions (often 100 or 1000, depending on throughput), your worker publishes a reconciliation marker to a separate audit queue. This marker includes a hash of all ledger states it has seen so far. At settlement time, you read all the markers, compute the expected sum, and compare it to what the bank says. If they don't match, you know the drift window: it's between marker X and marker Y.

Without this, reconciliation is a binary choice: either you trust your system (which is often not warranted), or you re-process transactions from the bank's export (which is slow and error-prone).

Queue Semantics Matter More Than Throughput

There are three delivery guarantees for queues: at-most-once, at-least-once, and exactly-once.

At-most-once means a message might not be processed. This is fine for analytics, not for payments.

At-least-once means a message might be processed multiple times. This is what most durable queues provide (RabbitMQ with durable queues, SQS, Kafka with offset commits). It's the default assumption. If you're not being explicit about idempotency, you're gambling.

Exactly-once is a false promise in distributed systems (you can get it from a single machine, but not across networks). What you actually want is exactly-once-visible-effects, which means: side effects only happen once, even if the message is processed twice. This requires idempotency keys, transactional writes, and external consistency checks.

Many teams implement their own queue and get at-least-once. That's the right starting point. The mistake is treating it as solved.

How to Structure Your Payment Queue

Step one: every payment message must carry a unique, deterministic idempotency key. If it comes from a customer action, derive it from the customer ID, operation type, and timestamp. If it comes from a webhook, use the webhook's own ID if the provider offers one.

Step two: your database schema must enforce uniqueness on (idempotency_key). This can be a primary key on a separate idempotency table, or a unique constraint on your transactions table. The point is: the database prevents duplicate processing at the storage layer, not the application layer.

Step three: every dequeue operation is a transaction. Dequeue, process, write state, mark as idempotent, commit. If anything fails between dequeue and commit, the message stays in the queue. When the worker recovers, it dequeues the same message again. The idempotency check ensures the second processing is idempotent (it notices the entry already exists and returns the same result as the first attempt).

Step four: implement Dead Letter Queues (DLQ) for permanent failures, not transient ones. If a payment fails due to insufficient funds, that's not a transient error. Don't retry it. Move it to a DLQ and alert. If a payment fails due to a network timeout, retry with exponential backoff (3-5 attempts across 5-10 minutes, then DLQ).

Step five: reconciliation. Separately from your queue, run a periodic job (once per hour, once per day, depending on volume) that pulls your ledger summary and compares it to the bank's settlement file. Log the delta. Alert if it's non-zero. When it's non-zero, you have a bounded window to search.

The Trade-off: Simplicity vs. Correctness

All of this adds code. You're writing idempotency checks. You're writing reconciliation jobs. You're handling DLQ review. You're logging more. You're thinking about failure modes that feel premature when your system is small.

Here's why you do it anyway: the cost of not doing it is higher than the cost of doing it, and it compounds.

A single undetected duplicate charge creates one angry customer. A pattern of drift creates audit findings. A reconciliation surprise at year-end creates a financial restatement. These are not theoretical costs; they're real operational drag.

Compare this to what I've worked on with multi-tenant SaaS platforms: if one tenant's ledger drifts due to a queue failure, it affects only that tenant's settlement for that period. But if the master ledger drifts, it affects the entire system's trustworthiness. The difference is whether you can quarantine the problem or whether you have to blow up everything to find it.

Monitoring the Things That Actually Matter

Standard queue monitoring tells you message count, processing rate, error count. None of that tells you if money is leaking.

What actually matters:

  • Idempotency key collision rate (how often a message with the same key is dequeued twice). This should be near zero. If it's not, your worker pool is crashing too often or your queue is re-delivering too aggressively.
  • DLQ accumulation rate (how many messages are moving to DLQ per day). This should be tracked separately from normal throughput. If it's growing, you have a systemic issue.
  • Reconciliation delta (ledger vs. bank export). This should be zero every reconciliation cycle. If it's not, log the exact window where drift occurred so you can trace it.
  • Idempotency table size (rows with duplicate keys processed). This should be growing much slower than your main ledger. If it's growing fast, something is retrying more than it should.

These metrics tell you whether your queue is functioning as a financial control, not just as a task dispatcher.

When to Add Complexity

You don't need all of this on day one. You need idempotency keys and unique constraints on day one. You need DLQ logic before you go to production. You need reconciliation before you have real money flowing. You add reconciliation automation after you've run it manually once and understood what you're looking for.

Most teams get this backwards. They build the whole system, go live, then discover they need reconciliation. Then they retrofit it. By that point, they're already dealing with a backlog of unexplained transactions.

The other direction: build the boring parts first. Idempotency, constraints, DLQ, manual reconciliation. Run it for a week at low volume. Only then add the automation and monitoring layers.

The Reality

Payment processing looks simple until it doesn't. The gap between "code that handles happy paths" and "system that handles financial correctness" is not architectural elegance. It's mechanical rigor.

Queue design in payment systems is not about throughput. It's about eliminating the categories of drift. Every mechanism I've described solves for a specific way money can leak: idempotency solves for duplicate processing, DLQ solves for permanent failures, reconciliation solves for undetected drift.

You build this once, and then it becomes boring. That's the goal.

Written by

Faiz Kasman

Software engineer in Kuala Lumpur. Payments, multi-tenant SaaS, and inventory infrastructure. Currently building the Shell Malaysia ParkEasy app.

Keep reading