Scaling and Performance
This guide explains how to scale Virtual MCP Server (vMCP) deployments. For MCPServer scaling, see Horizontal scaling in the Kubernetes operator guide.
Vertical scaling
Vertical scaling (increasing CPU/memory per instance) is the simplest approach and works for all use cases, including stateful backends.
To increase resources, configure podTemplateSpec in your VirtualMCPServer:
spec:
podTemplateSpec:
spec:
containers:
- name: vmcp
resources:
requests:
cpu: '500m'
memory: 512Mi
limits:
cpu: '1'
memory: 1Gi
Vertical scaling is recommended as the starting point for most deployments.
Horizontal scaling
Horizontal scaling (adding more replicas) can improve availability and handle higher request volumes.
How to scale horizontally
Set the replicas field in your VirtualMCPServer spec to control the number of
vMCP pods:
spec:
replicas: 3
If you omit replicas, the operator defers replica management to an HPA or
other external controller. You can also scale manually or with an HPA:
Option 1: Manual scaling
kubectl scale deployment vmcp-<VMCP_NAME> -n <NAMESPACE> --replicas=3
Option 2: Autoscaling with HPA
kubectl autoscale deployment vmcp-<VMCP_NAME> -n <NAMESPACE> \
--min=2 --max=5 --cpu-percent=70
Session storage for multi-replica deployments
When running multiple replicas, configure Redis session storage so that sessions are shared across pods. Without session storage, a request routed to a different replica than the one that established the session will fail.
spec:
replicas: 3
sessionStorage:
provider: redis
address: redis-master.toolhive-system.svc.cluster.local:6379
db: 0
keyPrefix: vmcp-sessions
passwordRef:
name: redis-secret
key: password
See Redis Sentinel session storage for a complete Redis deployment guide.
If you configure multiple replicas without session storage, the operator sets a
SessionStorageWarning status condition (with reason SessionStorageMissingForReplicas)
on the resource but still applies the replica count. Pods will start, but
requests routed to a replica that did not establish the session will fail. Ensure
Redis is available before scaling beyond a single replica.
When horizontal scaling is challenging
Horizontal scaling works well for stateless backends (fetch, search, read-only operations) where sessions can be resumed on any instance.
However, stateful backends make horizontal scaling difficult:
- Stateful backends (Playwright browser sessions, database connections, file system operations) require requests to be routed to the same instance that established the session
- Session resumption may not work reliably for stateful backends
The VirtualMCPServer CRD includes a sessionAffinity field that controls how
the Kubernetes Service routes repeated client connections. By default, it uses
ClientIP affinity, which routes connections from the same client IP to the
same pod:
spec:
sessionAffinity: ClientIP # default
ClientIP affinity relies on the source IP reaching kube-proxy. When clients
sit behind a NAT gateway, corporate proxy, or cloud load balancer (common in
EKS, GKE, and AKS), all traffic appears to originate from the same IP —
routing every client to the same pod and eliminating the benefit of horizontal
scaling. This fails silently: the deployment appears healthy but only one pod
handles all load.
For stateless backends, set sessionAffinity: None so the Service
load-balances freely. For stateful backends where true per-session routing is
required, ClientIP affinity is a best-effort mechanism only. Prefer vertical
scaling or a dedicated vMCP instance per team instead.
For stateful backends, vertical scaling or dedicated instances per team/use case are recommended instead of horizontal scaling.
Next steps
- Explore Kubernetes operator guides for managing MCP servers alongside vMCP
- Curate a server catalog for your team with the Registry Server