Why BullMQ over SQS or another managed queue?

BullMQ runs on the same Redis instance already in the stack for caching. Adding SQS would mean another AWS dependency, per-message cost, and higher network latency on every job dispatch. For a single-node setup targeting sub-3-second sync, the added hop is not free. BullMQ also has built-in priority queues, per-job retry configuration, and a good observability API (Bull Board). The tradeoff is that Redis becomes a single point of failure for the queue, which is fine at current scale and would need addressing with Redis Sentinel or Cluster at higher load.

Why NestJS over Express or Fastify?

NestJS's module system and dependency injection were the deciding factors. Each integration (Shopee, TikTok, Lazada) is its own NestJS module with its own adapter, and the DI container handles wiring them into the queue workers without manual factory functions. The built-in Guards make JWT and role-based auth clean to add per-endpoint without middleware spaghetti. The tradeoff is boilerplate: NestJS generates more files per feature than Express. Worth it for a project where each new platform integration needs the same auth, retry, and logging structure.

Why Docker Compose and not Kubernetes?

Kubernetes is correctly sized for multi-node deployments with horizontal autoscaling. Lunara currently runs on a single VPS. Docker Compose brings the full stack up with one command, makes local development identical to production, and has no control-plane overhead. The honest answer is also: Kubernetes would take longer to configure correctly than the current scale justifies. When the in-memory mutex needs to become a distributed lock (i.e., when multiple backend replicas are needed), that is the moment to revisit.

What happens when Shopee or TikTok is down?

Jobs stay in BullMQ's failed set and retry according to the 5xx schedule (1m, 5m, 15m). If all retries are exhausted, the job lands in the dead-letter zone and a FailedSync record is written to the database. The seller sees the failed sync in their activity log. When the platform recovers, a reconciliation pass can replay failed syncs from the FailedSync table. The Master Stock figure is always the source of truth; the platform figures are eventually consistent with it.

Lunara · Faiz Kasman

A seller lists 50 units of a SKU on Shopee. Same SKU on TikTok Shop. Same on Lazada. One unit sells on Shopee, and the other two platforms need to know about it in under three seconds, or the next buyer on either platform just bought stock that no longer exists. That is the problem Lunara solves, and every architectural decision in the system flows from it.

Context

What Lunara is

Lunara is a micro-OMS (order management system) for Malaysian and SEA e-commerce sellers who operate across multiple marketplaces simultaneously. It maintains a single Master Stock figure per SKU and propagates any change outward to connected platforms through a priority queue. Shopee, TikTok Shop, and Lazada each get their own integration adapter with separate OAuth flows, rate-limit handling, and webhook endpoints.

The system is built on NestJS 10, PostgreSQL 15 with Prisma, Redis 7, and BullMQ 5. The full stack runs as a set of Docker Compose services behind Nginx, with Let's Encrypt for SSL. The frontend is a separate Next.js 14 app inside the same Compose network. There is no managed queue service or cloud-specific infrastructure, by design.

What makes this project distinct from JomJual is where the complexity lives. JomJual's hardest problems are multi-tenancy, financial state machines, and frontend DX across a monorepo. Lunara's hardest problems are coordination: how do you keep three platforms with different APIs, different rate limits, and different webhook schemas in sync with a single source of truth, without blocking, without over-locking, and without silently losing a job on a platform error?

Architecture

The queue is the product

Every stock change, every order event, every product import goes through BullMQ before any platform API is touched. The queue is not an implementation detail. It is the mechanism that makes the latency target achievable and makes failures recoverable without data loss.

BullMQ sits on top of Redis 7. Jobs are persisted in Redis, which means they survive a backend restart. Workers pick up jobs concurrently, with concurrency capped per queue. Failed jobs stay in the failed set and are retried according to a backoff schedule that depends on the error class, not a single global retry policy.

The three queues map directly to business priority. inventory_update must be fast because a stock discrepancy on a live listing costs money. order_ingest must be reliable because missed webhooks create phantom reservations. product_sync can wait because importing a product listing is not time-sensitive.

queue.config.ts

// src/modules/queue/queue.config.ts (simplified)
export const QUEUE_CONFIGS = {
inventory_update: {
  priority: 1,
  concurrency: 10,
  defaultJobOptions: { removeOnComplete: 100, removeOnFail: 500 },
},
order_ingest: {
  priority: 2,
  concurrency: 10,
  defaultJobOptions: { removeOnComplete: 100, removeOnFail: 500 },
},
product_sync: {
  priority: 3,
  concurrency: 2,
  defaultJobOptions: { removeOnComplete: 50, removeOnFail: 200 },
},
} as const

01 · Three priority queues, three concurrency tiers

The queue setup is not arbitrary. Each tier reflects a different tradeoff between throughput and urgency.

inventory_update is P1 with 10 concurrent workers. A stock push to Shopee competes for the same Redis connection pool as a stock push to Lazada. 10 workers means up to 10 simultaneous outbound API calls, which is about the ceiling before platform rate limits start biting. This is time-sensitive work: a stale stock figure on a live listing is a live overselling risk.

order_ingest is P2 with 10 concurrent workers. Order webhooks arrive in bursts during sales events. 10 workers keeps the queue from backing up during spikes. Order processing is not slower than inventory sync, but it is less time-critical: a 5-second delay in processing a webhook does not cost a sale the way a 5-second delay in a stock update might.

product_sync is P3 with only 2 workers. Importing or updating product listings is a background operation. A seller importing their Lazada catalogue does not need it done in seconds. Two workers keeps this queue from consuming Redis throughput that inventory_update needs.

The concurrency caps also double as implicit rate limiting. Each platform has an API rate limit per shop. With 10 workers handling potentially many shops, the per-shop rate limit is enforced separately in the integration adapters, not just by concurrency.

02 · Retry schedule depends on the error class

Not every failure is the same. A 429 from Shopee means slow down. A 5xx means the platform is having trouble. A 401 means the seller's OAuth token expired or was revoked. These three situations have completely different appropriate responses.

Error	Retry Schedule
429 Rate Limit	5s, then 10s, then 30s
5xx Server Error	1m, then 5m, then 15m
401 Unauthorized	Disable shop, notify user (no retry)

The 429 backoff is tight because rate limits reset quickly for most SEA marketplace APIs. The 5xx backoff is longer because a platform having server trouble is unlikely to recover in 10 seconds. The 401 case does not retry at all: retrying an expired token just hammers the API, and the seller needs to reauthorise before any further jobs are useful. The shop gets flagged as disconnected and the user gets a notification.

retry.policy.ts

// src/modules/queue/retry.policy.ts (simplified)
export function getRetryOptions(error: PlatformError) {
if (error.status === 429) {
  return { attempts: 3, backoff: { type: 'fixed', delays: [5000, 10000, 30000] } }
}
if (error.status >= 500) {
  return { attempts: 3, backoff: { type: 'fixed', delays: [60000, 300000, 900000] } }
}
if (error.status === 401) {
  // No retry. Disable shop and surface to user.
  return { attempts: 1, backoff: { type: 'fixed', delays: [] } }
}
return { attempts: 2, backoff: { type: 'exponential', delay: 5000 } }
}

This matters more than it might look. A naive global retry policy either hammers platforms during rate-limit windows or waits too long after a 429 that clears in five seconds.

03 · Soft reserve vs hard deduct

When an order arrives as PENDING, Lunara reserves the stock but does not deduct it. The Master Stock figure stays unchanged, and the reservation has a 60-minute expiry. If the order is confirmed within that window, the reservation converts to a hard deduct: Master Stock goes down, and a sync job is queued to push the updated figure to all connected platforms. If the 60 minutes pass with no confirmation, the reservation expires and the stock is freed.

When the order is CONFIRMED, the hard deduct runs: stock is decremented from master and a sync is queued immediately.

inventory.service.ts

// src/modules/inventory/inventory.service.ts (simplified)
async function handleOrderStatusChange(order: Order) {
if (order.status === 'PENDING') {
  await softReserve(order.masterSkuId, order.quantity, {
    expiresInMinutes: 60,
    orderId: order.id,
  })
  // No sync queued. Platforms still show pre-order stock.
}

if (order.status === 'CONFIRMED') {
  await hardDeduct(order.masterSkuId, order.quantity)
  await queueInventoryUpdate(order.masterSkuId) // Push to all platforms
}

if (order.status === 'CANCELLED') {
  await releaseReservation(order.id)
  await queueInventoryUpdate(order.masterSkuId) // Restock + sync
}
}

The reason this exists is abandoned carts. On SEA marketplaces, PENDING orders are common and many never confirm. If the system hard-deducted on PENDING, a seller with 10 units and 8 abandoned carts in flight would show 2 units on all platforms, even though 8 of those orders will never pay. The soft-reserve pattern keeps the listings accurate for real buyers while protecting against genuine overselling on confirmed orders.

The 60-minute window is a product decision, not a technical one. Long enough to accommodate slow bank transfers (FPX can take a few minutes to settle), short enough that genuine abandoned carts do not hold stock overnight.

3
integrated marketplace platforms (Shopee, TikTok Shop, Lazada)
3
priority queue tiers with independent concurrency caps
<3s
target sync latency per platform push
60m
soft-reserve window before auto-expiry
4
internal order statuses mapped across all three platform schemas
5m
platform adjustment confirmation window before propagating seller-side edits

Learnings

Q3 2023
The in-memory mutex for preventing concurrent stock updates on the same Master SKU works fine on a single process. It will not work at all if the backend ever runs as more than one replica. The right solution is conditional writes in the database, or distributed locks via DynamoDB or Redis SETNX. Noted for when the scale warrants it.
Q3 2023
Firebase Auth timeout was set to return null on failure, which means the frontend treats a network hiccup the same as a logged-out user. This fails open. The correct pattern is a discriminated result type so the caller can distinguish 'not authenticated' from 'authentication check failed'. Left as a known issue.
Q4 2023
The MigrationConsent entity accumulated 12 fields for an email workflow that could have been a separate EmailLog table from the start. The entity now carries logic that does not belong to it. A normalisation pass is needed but not yet done. Classic schema bloat from iterating fast without stepping back.
Q1 2024
Per-error retry schedules were added after the first round of live platform testing showed that a single global backoff was either too aggressive (hammering Shopee during a 429 window) or too slow (waiting 5 minutes after a 429 that cleared in 5 seconds). The fix was obvious in retrospect; it should have been in the initial queue design.
Q2 2024
The demo mode (admin@trylunara.app) simulates OAuth connections and seeds synthetic data without touching real platform APIs. This was the right call: real platform OAuth in a demo context creates fragile dependencies on external credentials. The tradeoff is that the demo does not show real webhook latency, which is the thing most worth showing. Still working out how to demonstrate that without a real account.

FAQ

Why BullMQ over SQS or another managed queue?: BullMQ runs on the same Redis instance already in the stack for caching. Adding SQS would mean another AWS dependency, per-message cost, and higher network latency on every job dispatch. For a single-node setup targeting sub-3-second sync, the added hop is not free. BullMQ also has built-in priority queues, per-job retry configuration, and a good observability API (Bull Board). The tradeoff is that Redis becomes a single point of failure for the queue, which is fine at current scale and would need addressing with Redis Sentinel or Cluster at higher load.
Why NestJS over Express or Fastify?: NestJS's module system and dependency injection were the deciding factors. Each integration (Shopee, TikTok, Lazada) is its own NestJS module with its own adapter, and the DI container handles wiring them into the queue workers without manual factory functions. The built-in Guards make JWT and role-based auth clean to add per-endpoint without middleware spaghetti. The tradeoff is boilerplate: NestJS generates more files per feature than Express. Worth it for a project where each new platform integration needs the same auth, retry, and logging structure.
Why Docker Compose and not Kubernetes?: Kubernetes is correctly sized for multi-node deployments with horizontal autoscaling. Lunara currently runs on a single VPS. Docker Compose brings the full stack up with one command, makes local development identical to production, and has no control-plane overhead. The honest answer is also: Kubernetes would take longer to configure correctly than the current scale justifies. When the in-memory mutex needs to become a distributed lock (i.e., when multiple backend replicas are needed), that is the moment to revisit.
What happens when Shopee or TikTok is down?: Jobs stay in BullMQ's failed set and retry according to the 5xx schedule (1m, 5m, 15m). If all retries are exhausted, the job lands in the dead-letter zone and a FailedSync record is written to the database. The seller sees the failed sync in their activity log. When the platform recovers, a reconciliation pass can replay failed syncs from the FailedSync table. The Master Stock figure is always the source of truth; the platform figures are eventually consistent with it.