The shared infrastructure problem
AI hardware is expensive. Organizations running local inference have a strong incentive to consolidate workloads onto shared GPU infrastructure rather than dedicating hardware to each use case. A single server with multiple high-end GPUs can serve several teams, projects, or classification levels.
The efficiency argument is real. So is the governance problem — because shared infrastructure forces questions that don't arise in single-tenant deployments:
Who authorized this workload to run on this hardware? Which model version was active when this output was produced? Could one tenant's workload have affected another tenant's results? Were the right access controls in place at the time of execution?
In cloud environments, the provider's multi-tenancy architecture answers these — hypervisor isolation, network segmentation, IAM policies. In local deployments, your team is the provider, and the answers have to be designed.
What "isolation" means in practice
Multi-tenant isolation for AI workloads operates at several levels:
At the compute level: are workloads sharing GPU memory? GPU scheduling? CPU cores? Full hardware isolation (dedicated GPUs per tenant) is the simplest model to reason about but the least efficient. Time-slicing or MPS (Multi-Process Service) sharing is more efficient but requires trust in the isolation provided by the GPU driver and scheduler.
At the model level: if multiple tenants use different fine-tuned models or adapters on shared base weights, the deployment system needs to track which adapter was active for each inference request. A request processed with the wrong adapter is a correctness failure — and potentially a data-handling violation if the adapters were trained on different data with different access controls.
At the data level: prompts, outputs, and intermediate state (KV caches, activations) from one tenant's workload can't be accessible to another. In GPU memory, this requires either explicit clearing between workloads or architectural separation that prevents cross-tenant memory access.
At the audit level: each tenant's operational records must be attributable. If the logging system captures inference events without clear tenant attribution, the records are usable for aggregate monitoring but not for tenant-specific audit or incident response.
The adapter management challenge
Fine-tuned adapters (LoRA weights, prompt tuning vectors, specialized configurations) are the practical unit of model customization in many deployments. In a multi-tenant environment, adapter management becomes a governance surface:
Each inference request should be traceable to a specific adapter version approved for use in that tenant's context. "The model" isn't sufficient identification when the base model can be combined with different adapters that produce materially different behavior.
Adapters also change over time as teams iterate on fine-tuning. The system needs to track which adapter version was active at any point, not just which one is current. When a compliance reviewer asks "what model produced this output last quarter?" — and they will — the answer requires historical adapter records.
In governed environments, deploying a new adapter is a change that should go through your change-control process. If adapters can be loaded or swapped without review, you've got a gap in your change-control framework.
What shared infrastructure needs
For multi-tenant AI infrastructure in controlled environments, the minimum governance requirements are:
Explicit tenant boundaries. Each workload runs within a defined boundary — hardware allocation, permitted models, permitted adapters, data-handling rules. These boundaries should be documented and enforceable, not informal agreements.
Request-level attribution. Every inference request is logged with tenant identity, model version, adapter version, and hardware identifier. This is the foundation for any downstream audit or incident investigation.
Change control for deployment state. Loading a new model, swapping an adapter, or changing an inference configuration is a deployment change. In CMMC-scoped environments, deployment changes to systems handling CUI must follow documented change-management procedures.
Reviewable separation evidence. When an auditor asks "how do you ensure tenant A's data can't be accessed by tenant B?" — the answer needs to be concrete. Not "we trust the GPU driver" but a documented isolation model with defined boundaries, enforcement mechanisms, and verification procedures.
The alternative: dedicated infrastructure
For some programs, the governance overhead of multi-tenant AI infrastructure exceeds the hardware cost savings. Dedicated hardware per classification level, per program, or per tenant eliminates the isolation questions entirely at the cost of lower utilization.
That's a legitimate architectural choice, especially where data-handling requirements are strict enough that any shared-infrastructure risk is unacceptable. The decision should be made explicitly based on the program's threat model and compliance obligations — not defaulted into because multi-tenancy seemed too complex to govern.
Either approach works. Shared infrastructure with single-tenant assumptions doesn't.