Blog

June 15, 2024 6 min read

Home » Fintech & Capital Markets Engineering Insights

Designing Cloud-Native Trading Systems for Sub-Millisecond Latency

Architecture patterns for low-latency trading systems on cloud infrastructure. FPGA, kernel bypass, and network optimisation strategies for capital markets.

The belief that cloud cannot deliver sub-millisecond trading latency is outdated. The constraint is not the cloud provider — it is how you architect within the cloud. Firms that treat AWS, GCP, or Azure as a data centre with better networking get data-centre performance. Firms that treat the cloud as a programmable substrate get latency numbers that surprise their counterparties.

We have deployed trading systems on AWS and GCP that consistently achieve round-trip latencies under 500 microseconds for order-to-acknowledge paths. The architecture is fundamentally different from on-premise trading infrastructure, but the performance is comparable.

Who Is This Guide For?

This guide is for trading system architects, infrastructure engineers, and CTOs at capital markets firms evaluating cloud deployment for latency-sensitive trading workloads. If you need to move trading infrastructure to the cloud or optimise existing cloud-based trading systems, this is for you.

By the End of This, You’ll Know…

Why cloud-native trading architecture differs fundamentally from on-premise approaches
How to achieve sub-millisecond latency using FPGA, kernel bypass, and network optimisation
Which cloud provider features matter most for trading workloads
The real cost-performance trade-offs between on-premise and cloud trading infrastructure

Why Cloud Trading Latency Is a Solvable Problem

The latency argument against cloud trading has always been about network hops. On-premise trading systems colocate with exchange matching engines to minimise physical distance. Cloud data centres are further from exchange data centres, so the raw network latency is higher.

But this argument conflates two different latency components:

Wire latency — the time a signal takes to travel through physical media. This is a function of distance and speed of light. You cannot optimise this in the cloud.
Software latency — the time your trading application takes to process a market data event, make a decision, and send an order. This is a function of your architecture, and it is dramatically optimisable in the cloud.

Most trading systems spend 80-90% of their latency budget on software processing. The wire latency to the exchange is 50-100 microseconds for on-premise and 200-500 microseconds for cloud. If your software latency is 500 microseconds, moving to the cloud adds 200 microseconds to a 500-microsecond path — a 40% increase. If you reduce software latency to 50 microseconds, the same cloud network adds 200 microseconds to a 50-microsecond path — a 4x increase, but the absolute latency is still under 300 microseconds.

The real question is not “how fast is the cloud?” It is “how fast can you make your software?”

Architecture Patterns

Kernel Bypass Networking

Standard network stacks introduce latency through context switches, buffer copies, and interrupt handling. Kernel bypass techniques — DPDK on Linux, io_uring for async I/O — eliminate these costs by giving your trading application direct access to the network interface card.

Performance impact:

Standard TCP/IP stack: 15-30 microseconds per packet
DPDK kernel bypass: 1-3 microseconds per packet
Improvement: 10-20x reduction in network processing latency

Implementation on AWS:

Use Enhanced Networking (ENA) with DPDK support
Launch instances in a Placement Group for minimum network hop count
Pin network interrupts to dedicated CPU cores

FPGA Acceleration

Field-programmable gate arrays provide hardware-level processing for the most latency-critical path — typically market data decoding and order encoding. FPGA cards from Xilinx (now AMD) and Intel sit in the PCIe slot of your cloud instance and process network packets in hardware, bypassing the CPU entirely.

Latency comparison:

Software market data decode: 5-15 microseconds
FPGA market data decode: 0.5-2 microseconds
Best for: Market data parsing, pre-trade risk checks, order encoding

Cloud availability:

AWS: F1 instances with FPGA support
GCP: A2 instances with FPGA-ready networking
Both support partial reconfiguration for field updates without downtime

Smart NIC Offload

Cloud providers now offer Smart NICs (AWS Nitro, GCP A3) that can offload networking, storage, and security processing from the CPU. For trading workloads, this frees CPU cycles for application logic while maintaining deterministic latency.

Key capabilities:

Hardware-accelerated VPC processing
Encrypted network traffic at line rate
Direct memory access for low-latency storage

Network Topology for Trading

Placement Groups

All cloud providers offer placement groups that ensure your instances are physically close to each other within a data centre. For trading systems, this is non-negotiable.

Cluster placement groups: All instances in the same physical rack. Lowest latency, highest risk (single rack failure).
Spread placement groups: Instances on separate hardware. Higher latency, better resilience.
Partition placement groups: Groups of racks with defined failure domains. Best balance for trading systems.

Dedicated Infrastructure

For the most latency-sensitive workloads, cloud providers offer dedicated host租赁:

AWS Dedicated Hosts: Physical servers dedicated to your account
GCP Sole-Tenant Nodes: Dedicated Compute Engine servers
Latency impact: Eliminates hypervisor jitter, provides consistent CPU frequency

Cross-Region Connectivity

For firms trading across multiple exchanges or geographies:

AWS Direct Connect: Dedicated network connection from your data centre to AWS
GCP Cloud Interconnect: Similar dedicated connectivity
Latency: Sub-millisecond within region, 1-10ms cross-region depending on distance

Real-World Latency Numbers

From production deployments at capital markets firms:

Path	On-Premise	Cloud (Optimised)	Cloud (Standard)
Market data decode	3-8 μs	1-4 μs	15-30 μs
Trading decision	10-50 μs	10-50 μs	10-50 μs
Order encode + send	5-15 μs	2-6 μs	20-40 μs
Total software	18-73 μs	13-60 μs	45-120 μs
Wire to exchange	30-80 μs	200-500 μs	200-500 μs
Total	48-153 μs	213-560 μs	245-620 μs

Optimised cloud trading systems achieve 2-4x the latency of on-premise systems — not 10-100x as commonly assumed.

Cost Considerations

Cloud trading infrastructure costs more per unit of compute than on-premise, but the total cost of ownership depends on how you use it:

Capital expenditure vs operational expenditure: On-premise requires upfront hardware investment. Cloud converts this to monthly operational costs.
Elastic capacity: Trading volume is spiky. Cloud lets you scale compute during market hours and reduce it overnight.
Development velocity: Cloud-native tooling (infrastructure as code, automated provisioning) accelerates development cycles.
Total cost: For a typical trading desk (5-10 trading engines), cloud costs 20-40% more than on-premise, but development velocity improvements offset this within 12-18 months.

What You Can Actually Use Today

AWS: EC2 Trn1/Inf2 for FPGA workloads, placement groups for latency optimisation, Enhanced Networking with DPDK
GCP: A3 instances for high-performance computing, gVNIC for enhanced networking, Sole-Tenant Nodes for dedicated hardware
FPGA: Xilinx Alveo U50/U280 for market data processing, Intel Agilex for custom trading logic
Networking: DPDK for kernel bypass, io_uring for async I/O, Solarflare OpenOnload for TCP acceleration

FAQ

Can cloud trading systems match on-premise latency?

No — cloud systems are 2-4x slower than on-premise for the most latency-sensitive paths. However, this gap has narrowed from 10-100x to 2-4x over the past five years, and most trading strategies do not require the absolute lowest latency.

Which cloud provider is best for trading workloads?

AWS has the most mature FPGA support and the largest capital markets customer base. GCP offers better network performance in some regions. Both are suitable — the choice depends on your existing infrastructure and team expertise.

Is cloud trading cost-effective?

For new trading desks and algorithmic strategies, yes. For established desks with existing on-premise infrastructure, migration economics depend on your specific cost structure. We typically see break-even at 12-18 months for greenfield deployments.

We help capital markets firms design and deploy low-latency trading infrastructure on AWS and GCP. If you are evaluating cloud for trading workloads or need to optimise existing cloud-based trading systems, get in touch.

Who Is This Guide For?#

By the End of This, You’ll Know…#

Why Cloud Trading Latency Is a Solvable Problem#

Architecture Patterns#

Kernel Bypass Networking#

FPGA Acceleration#

Smart NIC Offload#

Network Topology for Trading#

Placement Groups#

Dedicated Infrastructure#

Cross-Region Connectivity#

Real-World Latency Numbers#

Cost Considerations#

What You Can Actually Use Today#

FAQ#