Blog
Kubernetes GPU Scheduling for Quantitative Research Workloads
GPU orchestration patterns for quantitative research. NVIDIA GPU Operator, vGPU sharing, and cost optimisation for ML infrastructure in capital markets.
Quantitative research teams consume GPU compute differently from standard ML teams. A single backtest of a reinforcement learning strategy may require 8 H100 GPUs for 72 hours, then nothing for a week. A risk model training run may consume 4 A100s for 6 hours, but the researcher needs interactive access to the dashboard throughout. Peak demand is unpredictable and hit-driven.
We have built GPU infrastructure for quant hedge funds and bank research desks on Kubernetes. Here is what we learned about scheduling, sharing, and cost management for financial ML workloads.
Who Is This Guide For?
This guide is for platform engineers, quantitative developers, and infrastructure leads at capital markets firms building GPU infrastructure for research and trading. If you are moving from dedicated GPU servers to a shared Kubernetes cluster, this is for you.
By the End of This, You’ll Know…
- How to configure the NVIDIA GPU Operator for multi-tenant quant workflows
- When to use vGPU partitioning vs MIG vs time-slicing
- How to implement cost chargeback for GPU-intensive research teams
- The scheduling patterns that keep GPU utilisation above 70%
The GPU Operator on Kubernetes
The NVIDIA GPU Operator automates the management of GPU resources on Kubernetes. It replaces the manual process of installing NVIDIA drivers, container runtime hooks, and monitoring agents on each GPU node.
Key components for quant research:
- GPU Feature Discovery: Automatically detects GPU model, memory, and compute capability on each node. This matters for mixed clusters where A100, H100, and older GPU generations coexist.
- Time-Slicing: Multiple pods can share a single GPU by time-multiplexing access. Each pod gets a fraction of the GPU’s compute time. Good for interactive research workloads where latency is not critical.
- MIG (Multi-Instance GPU): GPU partitioning at the hardware level. Each partition has dedicated compute, memory, and bandwidth. Supported on A100 and H100. Required when workloads need guaranteed isolation.
GPU Sharing Strategies
| Strategy | Isolation Level | Supported GPUs | Best For |
|---|---|---|---|
| MIG | Hardware-level | A100, H100 | Multi-tenant production workloads |
| Time-slicing | Software-level | All | Interactive research, batch jobs |
| vGPU | Hypervisor-level | All (via VMware) | VMware-centric shops (Tanzu) |
| Dedicated GPU | Full | All | Single-tenant training runs |
When to Use Each
- Dedicated GPU for production training runs that need deterministic performance
- MIG for serving multiple inference models on a single GPU with guaranteed SLAs
- Time-slicing for interactive Jupyter notebooks, ad-hoc research queries, and low-priority batch jobs
- vGPU for firms running K8s on VMware Tanzu
Scheduling Configuration
Node Labelling
Every GPU node should be labelled with its GPU model, memory, and partitioning strategy:
| |
Researchers use node selectors to target the right GPU type for their workload.
Priority Scheduling
Not all GPU workloads have the same urgency. A production trading model retraining has higher priority than a research backtest. Use Kubernetes priority classes:
| |
Production workloads preempt research workloads when GPU capacity is limited. The research job saves its checkpoint and gracefully shuts down.
Cost Chargeback
GPU cost is the most visible expense in any quant infrastructure budget. Implement chargeback at the pod level:
- Label every GPU workload with cost centre, research team, project, and experiment ID.
- Export GPU metrics (utilisation, memory, power draw) at the pod level via DCGM exporter.
- Attribute cost by multiplying GPU usage (pod-seconds) by the blended cost per GPU-hour (compute + amortised infrastructure + cooling).
A dashboard that shows each research team’s GPU spend alongside their compute-hours consumed changes behaviour without management intervention.
What You Can Actually Use Today
| Tool | Purpose | Source |
|---|---|---|
| NVIDIA GPU Operator | K8s GPU resource management | Open source |
| DCGM Exporter | GPU metrics for Prometheus | Open source |
| Kubeflow | ML workflow orchestration on K8s | Open source |
FAQ
Can I mix A100 and H100 nodes in the same cluster? Yes, but require careful scheduling. Node affinity rules ensure that inference workloads target A100 nodes and training workloads target H100 nodes. The GPU Operator’s node labelling maps each node’s GPU type and capacity.
How do I handle GPU memory fragmentation? MIG partitioning and time-slicing both introduce memory overhead. For workloads with large model footprints, use dedicated GPU nodes. For smaller models (BERT-sized and below), MIG partitions with 20-40 GB slices work well.
What is the cost difference between on-premise and cloud GPUs? Cloud GPUs cost 2-3x more than on-premise equivalent compute for sustained workloads, but offer the advantage of elastic scaling. The total cost comparison depends on utilisation — at 70%+ utilisation, on-premise is cheaper. At lower utilisation levels, cloud is cost-effective.
Further Reading
For a deeper discussion of ML infrastructure in capital markets, see our Financial Data Platforms service page.