Designing an idempotent rating engine

Billing systems fail in the most expensive way: quietly, by double-charging or losing records. The fix is not more retries. It is making every operation safe to repeat.

The shape of the problem

A rating engine turns events into money. In telecom that means call detail records: billions a day, arriving out of order, sometimes twice, occasionally not at all. The failure mode that costs the most is not a crash. It is a quiet one, where a retried batch charges a customer twice, or a dropped partition loses revenue nobody notices for a month.

Retries do not fix this. More retries make it worse. The only durable answer is to make every write safe to repeat, so that replaying the same record changes nothing.

“An idempotent system does not fear the retry. It is built to be replayed.”

graph LR
  A[CDR sources] --> B(["Kafka ingest"])
  B --> C[Rating engine]
  C --> D{{"Idempotency store"}}
  C --> E[("Ledger")]
  E --> F[Reconciliation]
  F -. verifies .-> E

Figure 1 / Rating data flow - every record carries a key; the ledger and reconciliation agree, or the pipeline halts.

Making writes idempotent

The whole scheme rests on a dedup key derived from the record itself, not from the moment it was processed. The ledger upsert is keyed on it, so a second attempt finds the first result and returns it unchanged.

public RatingResult rate(CallDetailRecord cdr) {
    // The dedup key turns a replay into a no-op, not a double charge.
    String key = cdr.accountId() + ":" + cdr.eventId();

    return ledger.upsert(key, () -> {
        Money amount = tariff.price(cdr);
        return new RatingResult(key, amount, cdr.timestamp());
    });
}

Configuration keeps the two halves honest. Keys live long enough to cover any realistic replay window, and reconciliation refuses to settle anything it cannot explain.

rating:
  idempotency:
    store: cassandra
    ttl: 90d            # keep keys long enough to cover replays
  reconcile:
    interval: 60s
    on_mismatch: halt   # never settle a record we cannot explain

What we measured

The engine has run in production across several markets for over a year:

4B records rated per day, reconciled end to end
0 revenue leakage, measured against the system of record
60s reconciliation interval, halt-on-mismatch
1 screen the runbook the on-call team actually uses

None of this is clever. It is the boring, provable version, which is exactly why it has never paged anyone at 3am. That is the highest praise a billing system can earn.