A national schools competition with a brutal shape: tens of thousands of students, one MCQ quiz each, inside a hard daily window that opens at 8am and slams shut at 10pm. A whole geographic zone starts within minutes of each other, so the system has to survive a stampede, rank everyone deterministically even when thousands tie on a timeout, hand each finisher a certificate, and let non-technical organisers run the event without a developer on call. The integrity has to hold because there are prizes attached. Built solo on Next.js and Postgres, load-tested to ten thousand concurrent before launch.
Context
What it is
A time-boxed online quiz-competition platform for a nationwide public-sector awareness campaign aimed at school students. Participants sign in with their school-issued education email, sit a single 40-question multiple-choice quiz inside a strict daily window assigned to their geographic zone, optionally submit a short slogan as a tiebreaker, then download a participation certificate and view their ranking. A separate admin console manages the question bank, schedules per-zone windows, watches live participation, manages admin users, and exports ranked results for the organising committee. Two categories run in parallel.
The event runs over five days, one day per zone, with everyone in a zone competing in the same 8am-to-10pm window and a 60-minute personal limit once they start. That shape drove almost every technical decision: it is simultaneously a fairness problem, a self-service problem, and a load problem.
Architecture
Postgres as the source of truth
The application is Next.js 16 on Vercel, but the domain logic lives almost entirely in Postgres — nineteen SQL migrations of SECURITY DEFINER functions, accessed only through a server-side service-role client, with functions pinned to the Singapore region to sit next to the database. The app tier is thin on purpose: every state transition that matters (start an attempt, save an answer, finalise a score) is a single database round-trip, because under a stampede the cheapest request is the one that does not happen.
Identity is split. Participants get no auth account — instead a tamper-proof signed JWT (HS256 via jose) in an httpOnly cookie, issued at email entry, carries their identity for eight days. Admins use real Supabase Auth mapped to an admin-users profile. That split keeps the participant path stateless and cheap while the admin path gets full account security.
01 · Default-deny row-level security that cannot regress
The integrity model is default-deny. Every table has row-level security enabled with zero policies for the anon and authenticated roles — the service role bypasses RLS, and nothing else gets in. The risk with that pattern is the obvious one: someone adds a new table later and forgets to enable RLS, and now there is a hole. So an event trigger enables RLS automatically on any new table.
-- Any table created later is locked down the moment it exists.
create or replace function force_rls()
returns event_trigger
language plpgsql as $$
declare obj record;
begin
for obj in
select * from pg_event_trigger_ddl_commands()
where command_tag = 'CREATE TABLE'
loop
execute format('alter table %s enable row level security', obj.object_identity);
end loop;
end $$;
create event trigger on_create_table
on ddl_command_end when tag in ('CREATE TABLE')
execute function force_rls();All the real logic is in SECURITY DEFINER functions with EXECUTE revoked from the public roles, which closes the PostgREST griefing vector entirely. Forgetting a policy fails closed, not open. It is the inverse of the usual default where a missed WHERE clause leaks the whole table.
02 · Deterministic ranking when thousands tie
Ranking sounds trivial until you remember that a hard cutoff means thousands of students can finish with the identical completion time. Anyone who runs out the clock shares exactly 3,600,000 ms. The ranking view orders by score descending, then completion time ascending — and originally that was it, using a plain window function.
row_number() over (
partition by category
order by score desc,
completion_time_ms asc,
attempt_id -- the fix
)The bug surfaced in the export. With thousands of identical (score, time) pairs, row_number() is non-deterministic across queries, so the committee's chunked, paginated CSV export was duplicating some students and skipping others between pages. Appending the unique attempt_id as a final tiebreaker gives a total, stable order — the same every time the query runs — which is what makes paginated export correct. The human-facing tiebreaker, the submitted slogan, decides genuine ties for prizes; the attempt_id just guarantees the machine ordering is reproducible.
03 · Engineering for the stampede
The whole load profile is one spike per day. A few mitigations stack up to absorb it. Saving an answer is a single upsert RPC that re-checks the deadline server-side, replacing an earlier two-call read-then-write. The answer-finalise path is a set-based job run by pg_cron every minute, so abandoned attempts get scored and closed without anyone polling. Rate limiting on the result/certificate identity gate is a DB-backed atomic fixed-window counter that fails open, keyed per target email so it is safe behind shared school NATs.
Time is handled by construction rather than by hope. Vercel runs in UTC, the windows are in Malaysia time, and there is no JavaScript date math anywhere in the gating — absolute instants are compared against the window table inside Postgres, with an inclusive start and an exclusive end. A dev-only clock that advances with real elapsed time made the whole timed flow, including the 10pm cutoff, testable weeks before the event. The whole thing was then run against a k6 harness sized for ten thousand concurrent users, behind a secret-gated test endpoint that 404s unless explicitly enabled.
04 · The last-super-admin you cannot delete
A small one, but it is the kind of safety property that saves an event. There has to always be at least one active super-admin, or the organisers lock themselves out mid-competition. A friendly check in the application handles the message, but the real guarantee is a BEFORE UPDATE OR DELETE trigger that atomically counts the remaining active super-admins and raises if the operation would take the count to zero. That makes it race-proof against two admins demoting each other at once, and proof against someone running the UPDATE directly in SQL.
10k
concurrent users in the load target (p95 < 2s, < 1% errors)
19
SQL migrations; the domain lives in the database
40
questions drawn per attempt, quota-balanced by difficulty
1
attempt per person, enforced by a unique constraint
0
anon/authenticated RLS policies: default-deny
5 days
5 zones, one window per day, 8am to 10pm
Learnings
- Default-deny RLS plus an auto-enable event trigger means a forgotten policy fails closed. Combined with revoking EXECUTE on the definer functions from anon/authenticated, it closes the PostgREST direct-access vector. The usual model leaks on a missed WHERE; this one leaks on nothing unless you explicitly open it.
- Ranking under mass ties needs a deterministic tiebreaker even when humans will never see it. Thousands of timeout students share an identical completion time, and non-deterministic row_number() was duplicating and skipping rows across paginated export pages. A unique attempt_id appended to the sort fixed the export, not the leaderboard.
- For a once-a-day stampede, the cheapest request is the one you delete. Collapsing answer-save to a single upsert RPC, finalising abandoned attempts with a set-based cron job, failing rate limits open, and colocating compute with the database were each a few percent that added up to surviving the spike.
- Do timezone-sensitive gating in the database with absolute instants, never with JavaScript date math on a UTC server. A dev clock that advances in real time let me test the 10pm cutoff and the 60-minute limit weeks before the event, which is the only reason I trusted them on the day.
FAQ
- Why no auth accounts for participants?
- Because a school-issued education email already is the identity, and creating a full auth account per student would add a stateful, rate-limited dependency on the exact path that takes the stampede. A signed httpOnly cookie (a JWT issued at email entry) carries identity statelessly and cheaply, with the one-attempt-per-person rule enforced by a unique constraint in the database rather than by a session. Admins, who are few and need real account security, use full auth.
- Why put so much logic in Postgres instead of the application?
- Three reasons. Correctness: the deadline re-check, the one-attempt constraint, the quota-balanced question draw, and the ranking all have to be authoritative, and the database is the one place every path goes through. Load: a single SECURITY DEFINER RPC is one round-trip, which matters when a whole zone starts at once. Integrity: with row-level security default-deny and EXECUTE revoked from public roles, there is no way to reach the data except through the functions I wrote.
- How do you know it holds at 10,000 concurrent?
- It was load-tested there. A k6 harness, gated behind a secret endpoint that 404s unless explicitly enabled on non-production, drove smoke, ramp, and a full 60-minute acceptance run at the target concurrency against the service-level objectives (p95 under 2 seconds per action, error rate under 1%). The single-RPC answer path and region colocation came directly out of what those runs showed.