async-await task-queues data-loss production-ops

Async/Await in Production: When Your Task Queue Becomes a Footgun

Most backends start with fire-and-forget async tasks and lose data without knowing it. Here's the concrete difference between "it worked in testing" and actually guaranteed delivery, and why you'll rebuild this twice.

2026-06-14 · 4 min read

The shape of the problem

You write this:

// app.js
async function handlePayment(orderId) {
  const order = await db.orders.get(orderId);
  await stripe.charges.create({
    amount: order.total,
    currency: 'myr'
  });
  
  // Send confirmation email
  emailQueue.send({
    to: order.email,
    template: 'payment-received',
    orderId: order.id
  });
  
  return { success: true };
}

The HTTP response fires immediately. The email queues up. Your local tests pass, staging passes, and you deploy on a Tuesday afternoon. By Wednesday morning you have seventeen customers saying they never got their receipt email.

Here's what happened: your app crashed. Or the queue worker crashed. Or both workers died, the queue flushed on restart, and nobody told you. Meanwhile, the payment succeeded in Stripe, the database shows the order as paid, but nothing happened after that. The customer had no confirmation. The customer support team is now manually checking each one.

This is not theoretical. This is the core shape of async task failure in production.

Why "fire-and-forget" betrays you

The mistake is not async/await itself. The mistake is the word "queue."

You name something a queue. It sounds like it persists. Like it's reliable. Like someone will deliver the message. But if you're doing this:

emailQueue.send({ ... });

And emailQueue is an in-memory array, or a simple Bull job pushed to a Redis server with no AOF enabled, or a message broker you haven't monitored in six months, you don't have a queue. You have a suggestion.

Real failure modes:

Server restarts. In-memory queue evaporates. All pending tasks gone.
Redis loses power. No persistence, no recovery.
Worker process crashes before the task completes. The task is dequeued but never processed.
You have multiple workers and no acknowledgment mechanism. Two workers grab the same job. One succeeds, one fails, nobody can tell.
The queue is running fine, but you have no visibility into it. You don't know tasks are failing because nobody's logging.

All of these are undetectable until a customer tells you.

The spectrum of reliability

Let's be concrete about what you're actually buying with different patterns.

Best-effort (what you probably have now)

// Push the job, don't wait
await jobQueue.add('send-email', { orderId, email });
res.json({ success: true });

// Worker picks it up whenever
async function emailWorker() {
  const job = await jobQueue.next();
  try {
    await mailer.send(job.data);
  } catch (err) {
    console.error('Email failed:', err);
    // That's it. Job is lost.
  }
}

If the worker crashes, the job is gone. If the mailer fails, it's gone. You lose data. This is production today for most small-to-medium backends.

The advantage: simple, fast, no overhead. The disadvantage: you're bleeding transactions.

Persistent queue with retries

const queue = new Queue('emails', {
  redis: { host: 'localhost', port: 6379 },
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 }
  }
});

await queue.add('send-email', { orderId, email });

queue.process(async (job) => {
  try {
    await mailer.send(job.data);
  } catch (err) {
    if (job.attemptsMade < 3) {
      throw err; // Retry
    }
    // Log to dead-letter queue, alert ops
    await dlq.add(job.data);
    throw err;
  }
});

Bull (or RabbitMQ, or Temporal, or whatever) now stores the job in Redis. If the worker crashes, the job stays in the queue. It retries. On the third failure, you've deliberately sent it somewhere you can see it.

Cost: Redis overhead, network calls, slightly more latency. But you're catching real failures now.

Guaranteed delivery with idempotency

This is where you stop losing data.

const queue = new PersistentQueue(db);

async function handlePayment(orderId) {
  const order = await db.orders.get(orderId);
  
  await stripe.charges.create({
    amount: order.total,
    currency: 'myr'
  });
  
  // Create a record in the database that this task exists
  const taskId = generateId();
  await db.pendingTasks.insert({
    id: taskId,
    type: 'send-email',
    orderId: orderId,
    email: order.email,
    status: 'pending',
    createdAt: now(),
    idempotencyKey: `payment-${orderId}`
  });
  
  // Publish to queue
  await queue.publish('send-email', { taskId, orderId, email: order.email });
  
  return { success: true };
}

// Worker
queue.subscribe('send-email', async (message) => {
  const task = await db.pendingTasks.get(message.taskId);
  
  // Already processed? Don't send again
  if (task.status === 'completed') {
    return;
  }
  
  try {
    await mailer.send(message.email);
    await db.pendingTasks.update(message.taskId, { status: 'completed' });
  } catch (err) {
    if (shouldRetry(err)) {
      throw err; // Queue will retry
    }
    await db.pendingTasks.update(message.taskId, { status: 'failed', error: err.message });
  }
});

Now when you query your database, you can see exactly which payments have corresponding emails sent. If a worker crashes mid-email, the database still shows the task as pending. Next attempt picks it up. If it somehow runs twice, the second run sees the task is already complete and skips it.

You can query this. You can reconcile. You can backfill. You know what's happening.

When guaranteed delivery is overkill

Not everything needs this.

Logging and analytics? Fire-and-forget. If one Datadog event gets dropped, your business doesn't break.

Cache invalidation? Fire-and-forget. If a cache key doesn't clear, your TTL handles it.

A/B test assignment? Fire-and-forget. A few lost assignments won't hurt.

Payments, orders, user account changes, subscription status? Guaranteed delivery, or you're going to rebuild this module twice.

The rule: if the customer, your finance team, or compliance would care that it didn't happen, it needs persistence and idempotency.

The operational piece you can't skip

Persistence alone isn't enough. You need visibility.

Most teams build a queue, it works for six months, then they notice they've never looked at it. A job fails silently. A Redis key gets accidentally flushed. A worker dies because of a memory leak nobody debugged.

Add this immediately:

// Metrics on every job
queue.on('completed', (job) => {
  metrics.increment('queue.job.completed', { type: job.name });
});

queue.on('failed', (job, err) => {
  metrics.increment('queue.job.failed', { type: job.name });
  alerts.send(`Queue job failed: ${job.name} ${err.message}`);
});

// Depth monitoring
setInterval(async () => {
  const depth = await queue.count('pending');
  metrics.gauge('queue.pending_depth', depth);
  
  if (depth > 10000) {
    alerts.send('Queue backlog exceeding threshold');
  }
}, 30000);

// Periodic reconciliation
setInterval(async () => {
  const stale = await db.pendingTasks.findWhere({
    status: 'pending',
    createdAt: { $lt: now() - 1.hour }
  });
  
  if (stale.length > 0) {
    alerts.send(`${stale.length} tasks stuck for >1 hour`);
  }
}, 5.minutes);

This is not optional. This is the part that turns "I built a queue" into "I know if my queue is working."

Real example: where this bites you hardest

I've seen this exact pattern at payment processors: a charge succeeds, the thank-you email never arrives, the customer re-submits, now there are two charges. The team scrambles to reverse one charge while the customer is already getting angry. It took two hours to track down that the email worker had crashed and nobody was monitoring it.

Or consider a SaaS platform where subscription renewal gets queued but never fires. The customer's account expires quietly. They find out when they try to use it. Churn, support tickets, refund requests.

The smallest async task can become your highest-impact operational liability if you're not treating it like data.

How to migrate an existing backend

If you're already running best-effort queues and want to stop losing data, this is the sequence:

Add persistence to your queue first. If you're using Bull, AOF on Redis is non-negotiable. Or switch to Temporal, which stores jobs in Postgres. This alone stops most failures.
Add idempotency keys to your jobs and database records. Check before you act.
Add monitoring immediately. Depth, failure rate, stuck jobs. Alert on any of these going weird.
Backfill. Query your queue and your database side-by-side. Find all the jobs that exist in the queue but not in the database, or vice versa. This tells you what you've already lost.
Document your SLA for each queue. Some tasks can retry forever. Others have a 24-hour deadline. Make that explicit.

This isn't something you do once and forget. Like observability and error budgets, queue management becomes an ongoing operational cost. The sooner you accept that, the sooner you stop losing customer data.

The bottom line

Async/await is not the problem. Fire-and-forget is.

Every time you push a task to a queue and immediately return success to the client without knowing for certain that the task will be processed, you're taking a bet. You're betting that the worker won't crash, that the queue won't flush, that nobody will force-restart the server.

On a Tuesday afternoon with light traffic, you win that bet. By Wednesday morning, someone hasn't.

Build it once, build it right. Persistence, idempotency, monitoring. Then you can forget about it, because it actually works.

Written by

Faiz Kasman

Software engineer in Kuala Lumpur. Payments, multi-tenant SaaS, and inventory infrastructure.

GitHub LinkedIn About

Keep reading