Designing Failure with State Machines

Sim-Pesa Build Log — Week 5 of 16

We are now at the start of Phase 2, and this week had zero new code committed. Instead, it was pure architecture work; the slow, deliberate kind of design that determines whether the implementation phase goes smoothly or turns into a series of expensive reversals. The central question I had to answer was deceptively simple:

What is the complete, unambiguous lifecycle of a single transaction, from the moment it enters the system to the moment it exits?

If you are new here, Sim-Pesa is a local-first M-Pesa API simulator I am building to sharpen my software development skills. The goal is to give developers a reliable, offline environment for testing Safaricom Daraja payment workflows, without depending on a cloud sandbox that is sometimes unstable. You can read the previous weeks to catch up.

In a simple CRUD application, a status column is fine. You write "success" or "fail" and move on. But Sim-Pesa is a transactional state machine modelling real financial flows. In that context, a status string is not just a label; it is a contract. Every possible value must be defined, every transition must be justified, and the system must be deterministic enough that a developer can look at a stuck transaction and immediately understand why.

Defining the States

The first task was enumerating every state a transaction can occupy and categorising them clearly. I landed on six.

State	Meaning	Where It Lives
`PENDING`	Default entry state — job sits in the queue awaiting a worker.	Ingestion API
`PROCESSING`	Funds reserved, row locks held, awaiting PIN on the Virtual Smartphone.	Worker Pool
`SUCCESS`	Correct PIN entered — balances updated, webhook dispatched.	Worker Pool
`FAILED`	Business-rule violation (e.g. insufficient funds, blocked account).	Anywhere
`CANCELLED`	Developer manually rejected the STK prompt on the Simulation UI.	UI / Worker
`VOIDED`	15-minute timeout elapsed before the worker completed the Lock-Validate phase.	DB / Worker

The distinction between active and terminal states matters for data integrity. Once a transaction enters SUCCESS, FAILED, CANCELLED, or VOIDED, its state is immutable. No worker process, no API call, and no UI interaction can reverse it. This will be enforced at the database layer with a check constraint — so that even a buggy worker cannot corrupt a finalised record.

Mapping the Possible Transitions

With the states defined, the next question was: which transitions are legal, and what triggers each one? This is where the state machine becomes genuinely useful as a design tool, because it forced me to think about every edge case before writing a single line of code.

Initial	Intermediate	Terminal	Trigger
`PENDING`	`PROCESSING`	`SUCCESS`	Happy path — valid funds, correct PIN, balances moved.
`PENDING`	`PROCESSING`	`CANCELLED`	User-driven — valid funds, but the developer rejected the prompt.
`PENDING`	`PROCESSING`	`FAILED`	Logic-driven — valid funds, but wrong PIN entered.
`PENDING`	—	`FAILED`	System-driven — insufficient funds or blocked account at validation.
`PENDING`	—	`VOIDED`	Infrastructure-driven — timeout before worker acquired locks.

Two of those transitions deserve a closer look, because they look similar on the surface but represent fundamentally different system behaviours.

The Architecture of Failure

The temptation in any payment system is to collapse all negative outcomes into a single "failed" bucket. This is a mistake that makes debugging miserable — and makes building a realistic simulator impossible. The semantic difference between FAILED, CANCELLED, and VOIDED is the difference between a system you can diagnose in minutes and one that leaves you staring at logs for hours.

FAILED — The System Said No

This transition occurs when the system is healthy and functioning correctly, but the transaction itself violates a business rule. The worker successfully acquired the row locks and ran validation — it simply discovered that the payment cannot proceed. Primary triggers are insufficient funds (ResultCode 1) or a blocked account status. This is a deterministic outcome. The system reached a conclusion, and the developer's application code should handle it the same way it would handle a real M-Pesa response.

CANCELLED — The Human Said No

This is the user-driven rejection. In the real M-Pesa STK Push flow, a subscriber can dismiss the payment prompt on their handset. Sim-Pesa replicates this via the Virtual Smartphone UI — the developer clicks "Cancel" on the simulated STK prompt. The M-Pesa equivalent is ResultCode 1032 (Request Cancelled by User). Importantly, at the point of cancellation, funds have already been reserved via SELECT ... FOR UPDATE. The CANCELLED transition must therefore also release those locks and roll back any reserved balances within the same atomic transaction.

VOIDED — The Infrastructure Said Nothing

This is the most dangerous state, because it represents a failure of the environment rather than the transaction data. A transaction becomes VOIDED when the 15-minute processing window expires before the worker ever completed the Lock-Validate phase — typically caused by a crashed Worker container or a backed-up Redis queue. The M-Pesa analogue is ResultCode 1037 (DS Timeout). The critical engineering implication is that we do not know whether the user had sufficient funds, because validation never ran. This non-deterministic outcome must be clearly distinguished from FAILED in the audit log.

Engineering Challenges: The Atomic Lock

The PENDING → PROCESSING Transition

The most consequential architectural decision of this week was defining precisely when a transaction transitions from PENDING to PROCESSING. The question seems trivial. It is not.

The intuitive option is to transition to PROCESSING as soon as the worker picks up the job from the BullMQ queue. But this is fatally flawed. If the status is updated before the database locks are acquired and validation is complete, there is a window where the UI will display a PIN prompt for a transaction that the backend cannot actually service. The balance may be insufficient; a lock contention may prevent progress. The developer sees a ghost prompt — a PROCESSING response that will eventually resolve to FAILED. This is exactly the kind of non-deterministic behaviour Sim-Pesa is designed to eliminate.

The rule: a transaction must not be marked PROCESSING until the system can guarantee it can complete the processing phase. Changing the state and acquiring the lock must be the same atomic operation.

The correct approach is the Lock-Validate-Update pattern, executed within a single PostgreSQL transaction block:

BEGIN — open the database transaction.
SELECT ... FOR UPDATE — acquire row-level locks on both the User and Merchant rows. No other worker can modify these rows until this transaction completes.
Validate — check that the user balance ≥ transaction amount and that the account status is ACTIVE.
UPDATE status → 'PROCESSING' — only now, after validation passes, is the state changed.
COMMIT — locks are held until commit, guaranteeing atomicity.

PostgreSQL's FOR UPDATE clause serialises concurrent access at the row level. In a high-concurrency scenario, ten workers attempting to charge the same user simultaneously — each transaction queues behind the previous one. Expensive, yes. But the alternative: A race condition that results in double-spending is catastrophic in a financial system. This is where the ACID manifesto stops being abstract philosophy and becomes a specific line of SQL.

You cannot change a state until you own the row.

Conclusion and Looking Ahead

Week 5 was entirely about drawing lines. Every state named, every transition justified, every ambiguity resolved before a line of production code is written. This kind of upfront design is uncomfortable because it produces no visible output, no working endpoint, no passing tests. But it is the work that determines whether the implementation phase is productive or a series of expensive reversals.

The state machine is now the source of truth for Sim-Pesa's transactional core. Everything in the Worker Pool; every database write, every balance check, every webhook dispatch will be built to serve this model.

The immediate challenge in Week 6 is implementing the concurrency control described above: specifically, what happens when ten transactions target the same user account at the exact same millisecond. The theory is clear. Getting PostgreSQL locking semantics right under BullMQ's parallel worker model is where the real learning happens.

Week 5 of 16: Building Sim-Pesa, a local M-Pesa transaction appliance. Follow the journey as I learn to build production-grade systems from scratch.

The Transactional State Machine: Designing for Deterministic Failure

Defining the States

Mapping the Possible Transitions

The Architecture of Failure

FAILED — The System Said No

CANCELLED — The Human Said No

VOIDED — The Infrastructure Said Nothing

Engineering Challenges: The Atomic Lock

The PENDING → PROCESSING Transition

Conclusion and Looking Ahead

Comments

Command Palette

Defining the States

Mapping the Possible Transitions

The Architecture of Failure

FAILED — The System Said No

CANCELLED — The Human Said No

VOIDED — The Infrastructure Said Nothing

Engineering Challenges: The Atomic Lock

The PENDING → PROCESSING Transition

Conclusion and Looking Ahead

Comments