During a routine revenue review, one set of session records looked off: users had plugged in, charged for 30 minutes or more, and were billed 0 credits. The chargers had genuinely delivered energy. The money just had not been collected. That was the entry point into a six-figure discrepancy, an automated clawback system, and eventually a wider set of platform engineering work spanning dynamic pricing, a 50,000-user migration with Stripe refunds, and a health monitoring layer built on AWS Lambda and CloudWatch.
Context
What Shell ParkEasy is
Shell Malaysia operates EV charging infrastructure under two product surfaces: ParkEasy, the original platform that handled EV session management and credits, and Shell Recharge, its successor. I joined as a software developer in October 2024 with the platform mid-transition. The work spans both surfaces: patching billing correctness on the live system, building the migration path for 50,000 credit-holding users, and extending the platform with new features while the transition runs.
The platform's data architecture uses DynamoDB for configuration and entity storage, Firebase Realtime Database for live hardware state, and Longship as the SaaS layer that bridges the Shell platform to the physical charger hardware. Longship captures raw session telemetry including energy delivered, and the platform billing engine reads from it. AWS Lambda and CloudWatch handle scheduled jobs and health monitoring. Stripe handles credit refund processing for the migration path.
On the public-facing side, what I can discuss here is mechanism and outcome. Shell's NDA covers system internals, partner-level detail, and production configuration specifics. Where numbers appear in this case study they come from approved public claims: the six-figure recovery figure, the ~60% clawback rate, the 30% off-peak utilisation lift. Everything else is described at the level of architecture and decision-making.
Architecture
How the platform is structured
The platform runs across two databases with a clear division of responsibility. DynamoDB holds configuration, entity records, and session history. Firebase Realtime Database holds live hardware state: charger availability, session status, and telemetry as it streams from the field. Longship sits between the physical hardware and the platform, maintaining the canonical record of what each charger actually delivered in a session.
Scheduled jobs run on AWS Lambda, triggered via CloudWatch schedules. This includes health monitoring across the platform's codebases, dynamic pricing configuration rotation, and periodic reconciliation tasks. The EC2 layer runs the application servers, and S3 holds configuration files that drive runtime behaviour for features like dynamic pricing.
The billing engine reads session records from DynamoDB, cross-referencing Longship's telemetry to calculate energy delivered and compute the charge owed. The 0kWh clawback work exposed a timing dependency in this read path, detailed below.
01 · The 0kWh clawback
The discrepancy surfaced during a routine revenue review. Session records showed users who had charged for 30 minutes or more being billed 0 credits. Cross-referencing those records against Longship confirmed the chargers had genuinely delivered energy. The problem was on the billing side.
The root cause was a sync-timing issue. The billing engine computed session cost against totalKwh, a field that Longship updates on session close. If the billing calculation ran before Longship had fully synced the session to its final state, the field read as zero. The fix was switching the cost calculation to read from chargingPeriods.meterValue instead, the latest entry in the charging periods array. That value is updated incrementally during the session, so it reflects actual energy delivered even if the session record has not fully closed on the Longship side.
Scoping the historical damage required writing an analysis script and running it month by month across the platform's lifetime. A single-month sample surfaced around RM 20,000 in discrepancies. The full sweep reached six figures.
The clawback system deducted owed amounts from user balances automatically. Approximately 60% of the outstanding amount was recovered this way. The remaining 40% resulted in negative balances, which were handled with documented communications and evidence trails to maintain a clean audit record.
The reason the remaining 40% could not be clawed back automatically was user balance coverage: some accounts did not hold enough credits to absorb the full deduction. The decision to document and communicate those cases rather than silently write them off was driven by compliance requirements. A clean audit trail mattered more than fully automated resolution.
02 · 50,000-user migration with Stripe refunds
The ParkEasy sunset required moving approximately 50,000 credit-holding users to Shell Recharge or refunding their balances via Stripe. Original scope was estimated at 3 months. It shipped in approximately 1 month.
The timeline compression came from a scope argument made early: the first wave of users did not require a fully custom admin dashboard built from scratch. Existing internal tooling could handle initial CSV-based processing while the dedicated ops interface was built in parallel. That argument was accepted, which freed the first month for building the core flow rather than infrastructure.
The end-to-end system covered: a multi-step consent form with OTP verification using phone number as the primary identifier, account freezing on consent submission to prevent further top-ups, a CSV-based migration path handed off to the Shell Recharge team for balance transfer, an automated Stripe refund path with a manual transfer fallback for edge cases, a real-time ops dashboard showing opted-in, pending, migrated, and refunded counts live, a refund forecasting module tracking daily obligations against Stripe balance to ensure the 14-day SLA was met, a transactional email pipeline with four templates covering notice and confirmation for both migration and refund paths, and an email monitoring section showing delivery status and failure counts with automatic retries.
The Stripe refund path is a state machine: pending (refund initiated) to processing (Stripe accepted it) to completed (funds disbursed) or failed (retry queued). Stripe's refund API is asynchronous. A successful API call does not mean the refund is done. The state machine handles the gap, with retries on failure and visibility into stuck transitions in the ops dashboard.
The load-testing miss is worth naming directly. The email pipeline hit throughput issues on the first large batch send. The issue was caught and fixed before it caused SLA failures, but it was a fix made on the fly rather than pre-empted. The pipeline should have been load-tested at migration-scale batch volumes before the first send went out.
03 · Frontend transition: ChargeNow rebuild and consent form UX
The transition to Shell Recharge was as much a frontend problem as a backend one. Users had to keep charging sessions running, keep topping up credits, and keep receiving receipts while the systems underneath them got rearranged. Two pieces of user-facing work mattered most.
The ChargeNow flow in the React webapp got a structural rebuild between May and June 2025. The old version conflated locations and bays into a single scrollable list, which made the walk-up experience cumbersome at larger sites. The rebuild split the flow into three explicit steps: a location list fetched on entry, bay availability lazy-loaded per selected location (one targeted fetch instead of fetching every bay at every site), then a confirmation step that locks the chosen bay for the user. The lock prevents race conditions where two users pick the same bay simultaneously. After confirmation, a two-timer pattern runs the session screen: a 1-second local clock for smooth duration display, and a 5-second API poll for telemetry. A single 5-second interval for both would have made the UI feel sluggish. A finite state machine with a VIEW enum replaced an earlier pattern that allowed illegal UI states to exist on screen.
Two supporting changes landed alongside the core rebuild. A centralised error handler replaced the scatter of per-endpoint error handling, so the same error class produced consistent UI treatment across the flow. The signup path was integrated directly into the charge flow so a first-time walk-up user could create an account mid-flow without being bounced out to a separate registration page. A local-development mock mode was added so the UI could be iterated without real charger hardware, because full end-to-end testing on live hardware is slow and expensive.
The webapp's API client was migrated from direct axios calls to a typed v2 API utility. Auto-generated TypeScript types from the backend OpenAPI schema propagated through the webapp, so a backend contract change surfaced as a compile error in the frontend rather than a runtime surprise. The migration was incremental: endpoints moved to the v2 util one at a time as their consumer screens got touched, rather than as a single big-bang swap.
The migration consent form went through several rounds of copy and UX iteration between October 2025 and February 2026. Small things: internal terminology was aligned with what users actually see (the backend used "balance", the user-facing surface uses "credits"), button copy was tested against what users responded to, and consent-close handling was refined to catch accidental closes without losing in-progress form data. None of those changes is architecturally interesting on its own. In aggregate, they are the difference between a migration flow that loses a meaningful fraction to form abandonment and one that does not.
The frontend work is the quieter layer during a platform transition. Backend reliability carries the structural load. What the frontend determines is whether users feel carried through the change or dropped into the middle of it.
04 · Dynamic pricing pilot
The hypothesis was simple: would time-of-day pricing improve utilisation at underperforming charging sites without hurting revenue at high-performing ones?
The experiment ran across 5 sites: 2 high-performing, 2 low-performing, and 1 control. The control site was necessary to isolate the pricing effect from seasonal variation and unrelated demand shifts. Pricing windows were defined using quartile analysis on historical session data: the 3rd quartile of session volume by hour set the peak threshold, and the 1st quartile set the off-peak threshold. Three pricing tiers (off-peak, normal, peak) mapped to those windows.
Implementation used a cron job reading configuration from S3. When the active pricing tier changes, the new config is written to S3 and the EC2 service is restarted via the EC2 API to pick up the change. A 2-second buffer before the restart gives in-flight requests time to complete cleanly.
After one month, off-peak utilisation was up approximately 30% across all 5 sites, including the high-performing ones where the hypothesis had not predicted a large effect. Peak-hour revenue held at pre-pilot levels despite the price increase, which suggests demand in peak windows is relatively inelastic at the charging sites in the pilot. The result prompted a wider rollout across the broader network.
05 · Health monitoring, defense-in-depth
Before the monitoring system existed, the way anyone knew a codebase was down was when users complained. Manual SSH and restart followed. The stopgap was straightforward: a health check endpoint on each codebase, an AWS Lambda function scheduled by CloudWatch to poll those endpoints, and Slack notifications that fire only on status change. The status-change filter mattered. A notification every minute when a service is healthy trains people to ignore Slack. One notification when status flips keeps the channel meaningful.
The Slack integration started as a webhook. After evaluating the notification patterns and the need for richer formatting and threading, it was migrated to the Slack bot token API. The history of that change is visible in git, and it is the kind of mid-iteration refinement worth noting: the first version worked, the second version was better, and doing the migration early meant less accumulated dependency on the webhook format.
The root cause investigation on one repeatedly failing codebase followed the monitoring data. The codebase was restarting frequently. Disk usage was high. Tracing it back: excessive logging was filling disk. A full disk caused the server to stall accepting new requests. Requests queued. Memory exhausted. OOM. The fix was three layers: automated log rotation that trims once storage exceeds a threshold, a memory-based auto-restart that fires before the process reaches OOM, and the Lambda alerting layer that catches the status flip early. Any one layer alone would have been a partial fix. The three together addressed the root cause, the symptom, and the detection gap.
6 figures
revenue recovered via 0kWh clawback
~60%
of outstanding discrepancy recovered via automated clawback
50k
users in migration scope across ParkEasy to Shell Recharge
~30%
off-peak utilisation lift across all 5 pilot sites
1 month
delivery vs 3-month original scope estimate for migration
628 lines
CLI replaced by S3 Editor dashboard with versioning and rollback
Learnings
- Building the ChargeNow feature, the two-timer polling pattern (1-second local clock for smooth duration display, 5-second API polling for telemetry) was the right separation. Trying to use a single 5-second interval for both would have made the UI feel sluggish. The finite state machine with a VIEW enum was worth the upfront design time: it eliminated an entire class of illegal UI states that would have been hard to debug in production.
- The health monitoring migration from Slack webhook to Slack bot token API was worth doing mid-iteration rather than deferring. The first version worked and was the right starting point. Doing the migration while the system was still new meant the refactor cost was low. Deferring it would have accumulated more dependents on the webhook format and made the change harder to justify.
- Root cause investigation on the failing codebase showed that disk-full from excessive logging was the primary failure mode, not the OOM the alerts were surfacing. Alerting on symptoms is table-stakes. Building in a path from alert to root cause matters more. The three-layer defense (log rotation, memory-based restart, alerting) came from following the failure chain backward, not from guessing at probable causes.
- The email pipeline for the migration should have been load-tested at migration-scale batch volumes before the first send. It hit throughput issues on the first large batch and required a fix on the fly. The fix went in before any SLA was missed, but it was reactive rather than pre-empted. Any pipeline that has to send tens of thousands of emails in a short window needs a load test at that volume, not at a sample.
- The scope argument for the migration was the right call and worth making early. Pushing back on the 3-month estimate by identifying what existing tooling could absorb in the first wave freed capacity to build the core consent and refund flow properly. The lesson is not that scope should always be cut, but that the right time to challenge scope is before implementation starts, with a specific proposal for what gets deferred and why.
FAQ
- Why Shell?
- The combination of a live production system, real financial consequence, and a platform mid-transition across multiple product surfaces was the draw. Most portfolio projects are greenfield. Shell ParkEasy is an existing system with paying users, accumulated technical decisions, and real operational constraints. The kind of work that matters here is finding a billing bug during routine analysis, not because someone filed a ticket.
- What is Longship?
- Longship is a SaaS platform that connects EV charger hardware to the cloud. It captures raw session telemetry, energy delivered, and charger status, and exposes that data via API. The Shell platform reads from Longship to price sessions and validate billing. The 0kWh clawback work involved cross-referencing platform billing records against Longship's session data to confirm the hardware had genuinely delivered energy the billing engine had not collected.
- Did you work on the hardware side?
- No. Longship abstracts the hardware layer. My work was on the platform side: the backend services that read from Longship, compute billing, manage user accounts and credits, and handle the product surfaces facing customers and operations. I did not write firmware or configure charger hardware directly.
- How did you find the 0kWh billing issue?
- It came up during routine revenue analysis, not from a user report or a filed bug. A session record showing 0kWh billed after a 30-minute charge stood out. Cross-referencing that record against Longship confirmed energy had been delivered. From one anomaly to a root cause took a day. From root cause to a historical sweep of the platform's lifetime took another day or two. The six-figure total was not expected at the start of that investigation.