Blog
Building Resilient Payment Gateways: Lessons from Processing $10 Billion in Transactions
Payment gateway architecture patterns for high availability, idempotency, and reconciliation. Production lessons from processing billions in transaction volume.
Processing payments at scale is one of the hardest problems in distributed systems. Every transaction must be exactly-once, the system must be available 99.999% of the time, and reconciliation must be perfect. A payment gateway that processes 100 million transactions per month cannot afford to lose a single transaction or double-charge a single customer.
We have built payment gateway infrastructure for fintechs processing over $10 billion in annual transaction volume. The architecture decisions that matter are not about speed — they are about correctness, idempotency, and the ability to recover from partial failures without losing money.
Who Is This Guide For?
This guide is for fintech CTOs, payment engineers, and platform architects building or operating payment gateway infrastructure. If you are processing payments at scale — or plan to — this is for you.
By the End of This, You’ll Know…
- Why idempotency is the most important property of a payment system
- How to design reconciliation that catches every discrepancy
- The architecture patterns that enable 99.999% availability without sacrificing correctness
- How to handle partial failures in distributed payment flows
The Correctness Hierarchy
In payment systems, correctness matters more than anything else. A payment gateway can be slow and still function. A payment gateway that loses transactions or double-charges customers does not function at all.
The hierarchy of payment system properties:
- Correctness: Every transaction is processed exactly once — no loss, no duplication, no double-charging
- Availability: The system is accessible when customers need to make payments
- Consistency: All systems agree on the state of every transaction
- Latency: Transactions complete quickly
Most payment system failures are not availability failures. They are correctness failures — transactions that succeed in one system and fail in another, leaving money in an inconsistent state.
Idempotency: The Foundation
Idempotency means that processing the same transaction twice produces the same result as processing it once. Without idempotency, network failures, timeouts, and retries will inevitably cause duplicate charges or lost payments.
Implementation pattern:
Every payment request includes a unique idempotency key — typically a combination of the customer ID, amount, and a client-generated UUID. The payment gateway stores this key with the transaction result. When a duplicate request arrives, the gateway returns the stored result without reprocessing.
Where idempotency breaks:
- Timeout between systems: Your gateway sends a charge request to the payment processor. The processor charges the customer but the response times out. Your gateway does not know if the charge succeeded. The customer retries. Without idempotency, you now have two charges.
- Partial failures: The charge succeeds, the ledger update fails. Without idempotency and compensation, money moves but your records do not reflect it.
- Concurrent requests: Two requests for the same transaction arrive simultaneously. Without proper locking, both may succeed.
Reconciliation: Catching Every Discrepancy
Reconciliation is the process of comparing transaction records across all systems — your gateway, your payment processor, your ledger, and your bank — to ensure they agree. When they disagree, you have a problem that requires investigation.
Three-layer reconciliation:
- Gateway-to-processor reconciliation: Compare every transaction in your gateway with the corresponding record in the payment processor’s system. Discrepancies indicate failed communication, partial failures, or processor errors.
- Processor-to-ledger reconciliation: Ensure every processed transaction is reflected in your accounting ledger. A transaction that was processed but not recorded in the ledger creates an accounting discrepancy.
- Ledger-to-bank reconciliation: Compare your ledger with bank settlement records. This catches processor errors, settlement failures, and timing differences.
Automation is non-negotiable:
Manual reconciliation does not scale. At 100 million transactions per month, even a 0.001% discrepancy rate means 1,000 transactions to investigate. Automated reconciliation runs continuously, flags discrepancies in real time, and routes them to the appropriate investigation queue.
High Availability Architecture
Active-Active Across Regions
Payment gateways must be available across all time zones and all business hours. Active-active deployment across multiple regions provides this:
- Traffic routing: A global load balancer (AWS Route 53, GCP Cloud DNS) routes requests to the nearest healthy region.
- Data replication: Transaction state replicates synchronously within a region and asynchronously across regions. Synchronous within-region replication ensures consistency for the most common case (requests routed to a single region).
- Failover: If one region fails, traffic routes to the remaining regions. The failover must complete in under 30 seconds to meet payment processing SLAs.
Circuit Breakers for Payment Processors
Payment processors (Stripe, Adyen, Worldpay) have their own availability characteristics. A circuit breaker pattern protects your gateway from cascading failures:
- Healthy state: All requests pass through to the processor.
- Open state: If the processor’s error rate exceeds a threshold (e.g., 5% of requests), the circuit opens and requests fail fast.
- Half-open state: After a cooldown period, a small percentage of requests test the processor. If they succeed, the circuit closes.
Key insight: Always have a secondary processor configured. When the primary circuit opens, fail over to the secondary automatically. This provides processor-level redundancy without vendor lock-in.
The Ledger Pattern
The ledger is the single source of truth for all financial transactions. Every payment, refund, and settlement is recorded as a debit or credit entry in the ledger.
Double-entry bookkeeping:
Every transaction creates at least two entries — a debit and a credit. This ensures the ledger always balances. If a debit exists without a corresponding credit, you have a discrepancy that must be investigated.
Immutable ledger entries:
Ledger entries are never modified or deleted. Corrections are made through new entries that offset the original. This provides a complete audit trail for regulatory compliance.
What You Can Actually Use Today
- Stripe: Best for developer experience and rapid integration. Good for startups and mid-market fintechs.
- Adyen: Best for global payments with local acquiring. Strong in Europe and Asia-Pacific.
- Worldpay: Best for high-volume enterprise processing. Strong in card-not-present and cross-border.
- Reconciliation tools: Gocardless, Chargebee, or custom reconciliation engines built on event sourcing
- Idempotency frameworks: Redis-backed idempotency keys with database persistence for durability
FAQ
What is the most common cause of payment system failures?
Partial failures — where a transaction succeeds in one system (e.g., the payment processor) and fails in another (e.g., the ledger). Without proper idempotency and reconciliation, these discrepancies can go undetected for hours or days.
How do you handle payment processor outages?
Active-active deployment with circuit breakers and automatic failover to a secondary processor. The secondary processor should be a different provider (e.g., Stripe as primary, Adyen as secondary) to avoid correlated failures.
How long does reconciliation take?
Automated reconciliation runs continuously. Most discrepancies are detected within minutes. Full end-of-day reconciliation completes within 2 hours for most payment volumes.
We help fintechs design and build payment gateway infrastructure that processes billions in transaction volume with zero tolerance for correctness failures. If you are building or scaling a payment system, get in touch.