Blog

February 22, 2026 4 min read

Home » Fintech & Capital Markets Engineering Insights

Kubernetes GPU Scheduling for Quantitative Research Workloads

GPU orchestration patterns for quantitative research. NVIDIA GPU Operator, vGPU sharing, and cost optimisation for ML infrastructure in capital markets.

Quantitative research teams consume GPU compute differently from standard ML teams. A single backtest of a reinforcement learning strategy may require 8 H100 GPUs for 72 hours, then nothing for a week. A risk model training run may consume 4 A100s for 6 hours, but the researcher needs interactive access to the dashboard throughout. Peak demand is unpredictable and hit-driven.

We have built GPU infrastructure for quant hedge funds and bank research desks on Kubernetes. Here is what we learned about scheduling, sharing, and cost management for financial ML workloads.

Who Is This Guide For?

This guide is for platform engineers, quantitative developers, and infrastructure leads at capital markets firms building GPU infrastructure for research and trading. If you are moving from dedicated GPU servers to a shared Kubernetes cluster, this is for you.

By the End of This, You’ll Know…

How to configure the NVIDIA GPU Operator for multi-tenant quant workflows
When to use vGPU partitioning vs MIG vs time-slicing
How to implement cost chargeback for GPU-intensive research teams
The scheduling patterns that keep GPU utilisation above 70%

The GPU Operator on Kubernetes

The NVIDIA GPU Operator automates the management of GPU resources on Kubernetes. It replaces the manual process of installing NVIDIA drivers, container runtime hooks, and monitoring agents on each GPU node.

Key components for quant research:

GPU Feature Discovery: Automatically detects GPU model, memory, and compute capability on each node. This matters for mixed clusters where A100, H100, and older GPU generations coexist.
Time-Slicing: Multiple pods can share a single GPU by time-multiplexing access. Each pod gets a fraction of the GPU’s compute time. Good for interactive research workloads where latency is not critical.
MIG (Multi-Instance GPU): GPU partitioning at the hardware level. Each partition has dedicated compute, memory, and bandwidth. Supported on A100 and H100. Required when workloads need guaranteed isolation.

Strategy	Isolation Level	Supported GPUs	Best For
MIG	Hardware-level	A100, H100	Multi-tenant production workloads
Time-slicing	Software-level	All	Interactive research, batch jobs
vGPU	Hypervisor-level	All (via VMware)	VMware-centric shops (Tanzu)
Dedicated GPU	Full	All	Single-tenant training runs

When to Use Each

Dedicated GPU for production training runs that need deterministic performance
MIG for serving multiple inference models on a single GPU with guaranteed SLAs
Time-slicing for interactive Jupyter notebooks, ad-hoc research queries, and low-priority batch jobs
vGPU for firms running K8s on VMware Tanzu

Scheduling Configuration

Node Labelling

Every GPU node should be labelled with its GPU model, memory, and partitioning strategy:

1
2
3
4
labels:
  nvidia.com/gpu.product: "NVIDIA-H100-80GB"
  nvidia.com/gpu.memory: "80"
  nvidia.com/gpu.mig-enabled: "true"

Researchers use node selectors to target the right GPU type for their workload.

Priority Scheduling

Not all GPU workloads have the same urgency. A production trading model retraining has higher priority than a research backtest. Use Kubernetes priority classes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-training
value: 1000
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: research-backtest
value: 500

Production workloads preempt research workloads when GPU capacity is limited. The research job saves its checkpoint and gracefully shuts down.

Cost Chargeback

GPU cost is the most visible expense in any quant infrastructure budget. Implement chargeback at the pod level:

Label every GPU workload with cost centre, research team, project, and experiment ID.
Export GPU metrics (utilisation, memory, power draw) at the pod level via DCGM exporter.
Attribute cost by multiplying GPU usage (pod-seconds) by the blended cost per GPU-hour (compute + amortised infrastructure + cooling).

A dashboard that shows each research team’s GPU spend alongside their compute-hours consumed changes behaviour without management intervention.

What You Can Actually Use Today

Tool	Purpose	Source
NVIDIA GPU Operator	K8s GPU resource management	Open source
DCGM Exporter	GPU metrics for Prometheus	Open source
Kubeflow	ML workflow orchestration on K8s	Open source

FAQ

Can I mix A100 and H100 nodes in the same cluster? Yes, but require careful scheduling. Node affinity rules ensure that inference workloads target A100 nodes and training workloads target H100 nodes. The GPU Operator’s node labelling maps each node’s GPU type and capacity.

How do I handle GPU memory fragmentation? MIG partitioning and time-slicing both introduce memory overhead. For workloads with large model footprints, use dedicated GPU nodes. For smaller models (BERT-sized and below), MIG partitions with 20-40 GB slices work well.

What is the cost difference between on-premise and cloud GPUs? Cloud GPUs cost 2-3x more than on-premise equivalent compute for sustained workloads, but offer the advantage of elastic scaling. The total cost comparison depends on utilisation — at 70%+ utilisation, on-premise is cheaper. At lower utilisation levels, cloud is cost-effective.

Who Is This Guide For?#

By the End of This, You’ll Know…#

The GPU Operator on Kubernetes#

GPU Sharing Strategies#

When to Use Each#

Scheduling Configuration#

Node Labelling#

Priority Scheduling#

Cost Chargeback#

What You Can Actually Use Today#

FAQ#

Further Reading#