Skip to content

Malaysia's 900 AI startups need backend engineers, not prompt engineers

The capital has arrived. The data centers are built. The thing Malaysia is still short of is the decade-accumulated skill of building systems that stay correct.

· 7 min read

The numbers that define Malaysia's current AI posture are real and significant, not aspirational projections dressed as progress reports. Data-center capacity grew from roughly 10 MW in 2022 to over 1,500 MW by 2025. Investment into that infrastructure hit $10 billion in 2023 and tripled in 2024. Microsoft, Oracle, and Google have together committed $10.7 billion to Malaysian AI infrastructure. The government has announced a RM500 million startup fund for the AI sector and is targeting 900 AI startups by 2026, against a 2026 budget that allocates RM1.36 billion to digital projects. Concurrently, 5G upgrades are planned for major cities, 5,000 experts are scheduled for AI training, and AI trials are being rolled out across 100 factories.

This is not a country catching up. The cooling towers are going up. The floor space is being commissioned. The foreign capital has made its bet. The thing that has not arrived yet is the engineering depth to fill the layer below the data centers: the people who will build, run, and keep correct the software that actually uses all of that capacity. Every conversation I have heard in the Malaysian engineering market about AI hiring concentrates on the top of the stack. It concentrates on the skill that is easiest to hire for right now. That is the risk.

The two skill curves

Prompt engineering is a real discipline. Composing effective LLM inputs, designing tool-use traces, managing context windows, and reliably coaxing consistent output from a model are skills with genuine craft, and they reward rapid iteration: someone who spends six focused months on this will be measurably better at it by the end. The feedback loop is fast, the iteration surface is large, and the tools for measuring improvement are right there in the output.

Distributed systems engineering compounds differently. The skill of building systems that stay correct under concurrency, partial failure, and sustained load is not something that shortens through templates or agents, because the failure modes are systemic rather than atomic. A prompt produces a bad output. A distributed system produces a correct output in staging and a data-corrupting race condition in production at 2am on a Sunday when queue depth crosses a threshold nobody modelled. The reflexes for handling that take years to accumulate, partly because the failure scenarios only appear at production scale, and partly because each incident teaches something that no course or tutorial can reproduce. The AI-startup boom is hiring the first curve faster than the second. Both skills matter. The imbalance is the capacity problem.

What the capex actually needs underneath

1,500 MW of data-center capacity needs orchestration layers, observability pipelines, SRE practices, and the infrastructure-as-code discipline to make any of it reproducible across regions and recoverable after failure. That work is not automated away by LLMs. It is performed by engineers who have operated infrastructure under pressure and who know which alerts are noise and which ones are the two-minute warning before a full outage.

900 AI startups shipping production software need API gateway design, idempotency at every state-changing boundary, queue design that accounts for failure modes at scale, database migration discipline that does not corrupt live data during schema changes, secrets management that does not end in a credential in a git history, deployment pipelines that make rollback a one-command operation, and cost instrumentation that tells you when an inference call budget is spiking before the invoice does. None of this is prompt work.

A prompt-first startup with a working demo will need to hire a backend engineer before the second enterprise customer signs. Enterprise procurement asks for SOC 2 readiness, audit logs, data-residency guarantees, and SLA commitments. The answers to those questions live in the system design, not the model configuration. The procurement checklist does not care how clever the prompt chain is. It cares whether the logs exist, whether they are tamper-evident, and whether someone on the engineering team can produce them in 24 hours.

The agentic layer amplifies the gap rather than closing it. Gartner projects that 40% of enterprise applications will include task-specific AI agents by the end of 2026. Forrester estimates that 30% of enterprise application vendors will launch their own MCP servers in the same period. The MCP ecosystem already has over 10,000 active public servers and 97 million monthly SDK downloads. Agentic systems need state management for long-running tasks, tool authorization and scope enforcement, orchestration retry logic that handles partial completion, and observability that can reconstruct what a multi-step agent actually did when a customer files a complaint. That observability problem is harder than anything in traditional CRUD architecture. It requires engineers who have thought carefully about event sourcing, log correlation, and audit trail design. That is not a six-month skill. That is a decade skill applied to a new problem shape.

What Malaysian founders should hire for before Series A

1. A backend engineer who has run production with on-call

The credential is not a bootcamp certificate or a framework familiarity checklist. The credential is: has this person been paged at 3am, diagnosed the incident, restored service, and written the postmortem? The reflexes that emerge from that experience are not teachable in a classroom and not readable in a portfolio. You look for it in how someone talks about a production failure: whether they can name the root cause, describe what was ambiguous in the first five minutes, and explain the specific thing they changed to prevent recurrence. That specificity is what you are hiring. Engineers who have never been on-call for a system under load have not yet developed the instinct for fault-tolerant design that comes from knowing what it feels like when the system fails and you are responsible for it.

2. Someone who can own the data migration path

Schema changes on a live database with millions of rows are where startups lose user trust silently and permanently. The correct migration strategy depends on the specific change: adding a nullable column is safe in most databases, renaming a column is a multi-step dance with backward-compatibility constraints, and backfilling a new field in a large table requires a batched job that respects row-level locking and does not time out or lock the table for an interval that affects production traffic. This work is boring in the best sense of the word. Nobody will write a LinkedIn post about the migration that ran cleanly. The alternative outcome is a 90-minute outage during a product launch, and that story does get written. An engineer who has shipped zero-downtime migrations on production systems and can explain the steps they took is worth hiring early, before the first major schema change makes it obvious that nobody on the team knows how to do this safely.

3. An engineer with production payments experience in Malaysian rails

The local payments stack has specific properties that global tutorials do not prepare for. FPX is redirect-based with a distinct webhook delivery model. DuitNow AutoDebit has mandate lifecycle semantics: a mandate moves through states, those states can arrive out of order, and a charge against a revoked mandate is both a failed transaction and a compliance problem. Idempotency here is not just about duplicate charges. It is about reconciling the webhook stream against the T+1 settlement file that PayNet publishes, because those two sources will disagree during gateway maintenance windows and that disagreement needs a defined resolution path. The companion post on FPX and DuitNow covers the specific patterns in detail. The relevant hiring signal is: has the engineer shipped a payments integration that survived a Sunday night AutoDebit failure at scale, and can they describe how they found out what went wrong?

4. A frontend engineer who understands performance budgets

Not a framework cataloguer. Not someone who can list the React rendering model from memory. Someone who has loaded their own app on a 4G connection in Rawang or Muar, noticed that the LCP was four seconds, tracked the cause to an unoptimized image or a blocking script, and fixed it in a way that held across subsequent deployments. Web performance in Southeast Asia is not a Western-market concern that can be addressed later. A significant share of users accessing a Malaysian startup's product are on mobile, on a variable connection, on a mid-range device. A performance regression on that hardware profile is an experience regression for a large portion of the actual user base. The engineer who knows how to audit a performance budget and enforce it in CI, rather than fixing it once and hoping nothing ships that breaks it again, is the one who delivers durable results.

5. A senior engineer who can say no

This is the hire that most clearly separates a startup that scales from one that accumulates technical debt until the debt is the product. Distributed systems expertise is partly the knowledge of what to build and partly the taste for what not to build. A microservices architecture for a fifteen-person startup with a single-region deployment is not an architecture. It is a maintenance surface distributed across six repositories with no sufficient team to operate any of them. An event-driven system where every operation crosses three queues is not resilience, it is fragility at each queue boundary without the observability to know which one failed. The senior engineer who can look at a proposed architecture, identify the accidental complexity, and say "we do not need this yet, and here is what we should build instead" is often the most cost-effective hire on the entire team. They do not generate lines of code. They reduce the lines of code that would otherwise need to be written, debugged, and eventually deleted.

The availability signal

This is the gap I think about a lot, partly because I have worked in the Malaysian engineering market long enough to see it clearly from multiple angles. If any of the roles above describes something your team is actively hiring for, the footer of this site is the way to reach me. I will not pitch further here. The point of this post is the argument, not the advertisement.

Closing

The capital has arrived. The cooling towers are going up. The government has committed the fund, the floor space, the training pipeline, and the international partnerships with Singapore, South Korea, and Japan. The thing Malaysia is still short of cannot be shipped in a container from Oregon. It is the boring, slow, compounding skill of building systems that stay correct when production reality hits them. Malaysia will close this gap. The question is whether it does so before the first wave of AI startups meets its first serious production incident, or after.

Written by

Faiz Kasman

Software engineer in Kuala Lumpur. Payments, multi-tenant SaaS, and inventory infrastructure. Currently building the Shell Malaysia ParkEasy app.

Keep reading