On-prem LLM hardware sizing for healthcare is the engineering process of selecting GPU infrastructure, networking, storage, and supporting hardware to run open-source language models at the institution’s required model size, concurrency, and latency targets — under the institution’s data-residency policy. The 2026 production hardware landscape spans NVIDIA H100 (the high-capability default for clinical AI), A100 80GB (the cost-effective alternative for many production deployments), L40S (mid-tier deployments), L4 (efficiency-focused workloads), and consumer-grade RTX 4090/5090 (development environments and small production workloads). The sizing math depends on model parameters, quantization (FP16/BF16/INT8/INT4/AWQ), inference batch size, and concurrent request volume. Hardware costs run from $30,000 for a single-GPU server suitable for small-scale Llama 8B deployments to $400,000+ for multi-server H100 clusters sized for enterprise multi-thousand-clinician health systems. The hardware refresh cycle is 3–4 years for GPU infrastructure; total cost of ownership including refresh, power, cooling, and operations is typically 1.5–2.5× the initial hardware investment over a 3-year horizon.
Hardware selection is where most on-prem LLM engagements get into trouble. Under-sizing produces inadequate performance and operational frustration; over-sizing wastes capital. The right answer is institution-specific and depends on model selection, use case mix, concurrency patterns, and operational maturity.
This guide is the hardware sizing reference Taction Software® uses on on-prem LLM engagements with hospitals.
The GPU Landscape in 2026
Five GPU classes cover most production healthcare AI deployments.
NVIDIA H100 (80GB SXM and PCIe)
The high-capability default for clinical AI in 2026. H100 SXM modules in 8-GPU servers (HGX H100) are the production-grade hardware for enterprise-scale deployments. PCIe variants for smaller deployments.
Capability profile. Highest single-GPU memory bandwidth in the lineup. Strong performance on the largest open-source models (Llama 3 70B, Mixtral 8x22B). Best support for cutting-edge inference optimizations (FP8 inference, dynamic shapes).
Pricing context. H100 SXM modules currently $25,000–$30,000+ each at list pricing; PCIe variants somewhat lower. An 8-GPU HGX H100 server lands at $250,000–$350,000 depending on configuration.
When to choose. Enterprise deployments with capacity for the full server. Multi-thousand-clinician health systems. Use cases requiring maximum capability or maximum concurrency.
NVIDIA H200 (where available)
Updated H100 with larger memory (141GB). Useful for very-large-model deployments without quantization.
When to choose. Where the budget supports it and the use case benefits from larger memory (long-context inference, very-large mixture-of-experts models served in full precision).
NVIDIA A100 80GB
The cost-effective alternative for many production deployments. Substantially less expensive than H100 with strong real-world performance for most clinical use cases.
Capability profile. Strong performance on Llama 3 70B (with appropriate quantization), Mixtral 8x7B, and the smaller open-source models. Wide support across the open-source inference ecosystem.
Pricing context. A100 80GB at $15,000–$20,000 each. 8-GPU servers $150,000–$220,000.
When to choose. Mid-sized hospitals, single-use-case deployments, and budget-constrained engagements where H100 is over-specified.
NVIDIA L40S
Mid-tier GPU for inference workloads. 48GB memory; strong inference performance per dollar; lower power consumption than A100/H100.
Capability profile. Good for Llama 3 8B–14B at full precision; supports Llama 3 70B with aggressive quantization. Suitable for many clinical use cases that don’t require frontier-tier performance.
Pricing context. $10,000–$13,000 per GPU. Multi-GPU servers $80,000–$140,000.
When to choose. Small-to-mid hospitals, narrow use-case deployments, multiple-GPU consolidation patterns where L40S beats fewer-but-larger GPUs on cost-per-throughput.
NVIDIA L4
Lower-tier GPU for efficiency-focused workloads. 24GB memory.
Capability profile. Fits Llama 3 8B at full precision; supports up to 14B with quantization. Suitable for narrower use cases (intent classification, simple structured generation, routing).
Pricing context. $3,000–$5,000 per GPU.
When to choose. Edge deployments, narrow tasks, development environments.
Consumer GPU (RTX 4090, RTX 5090)
24GB consumer-grade hardware. Strong cost-per-performance for development and small-scale production.
When to choose. Development environments. Small clinic deployments where the use case fits in a 24GB footprint. Note: data-center licensing terms vary by manufacturer; verify license terms before production deployment.
Sizing Math: Memory and Concurrency
The math that drives hardware sizing.
Memory Requirements by Model and Quantization
Model memory footprint approximations:
| Model | FP16/BF16 | INT8 | INT4/AWQ |
| Llama 3 8B | ~16 GB | ~8 GB | ~4 GB |
| Llama 3 70B | ~140 GB | ~80 GB | ~40 GB |
| Mistral 7B | ~14 GB | ~7 GB | ~4 GB |
| Mixtral 8x7B | ~90 GB | ~50 GB | ~25 GB |
| Mixtral 8x22B | ~280 GB | ~150 GB | ~80 GB |
| Phi-3 medium 14B | ~28 GB | ~14 GB | ~8 GB |
Plus the KV cache for active inference (typically 5–20% of model size depending on context length and batch size). Plus the inference engine’s overhead (typically 5–10%).
A practical rule: total GPU memory required is roughly 1.3× the quantized model size for moderate concurrency, 1.5× for high concurrency.
Concurrency Math
Throughput depends on:
- Model size and quantization — smaller, more quantized models serve more concurrent requests.
- Batch size — vLLM’s continuous batching delivers 5–10× the throughput of naive sequential serving.
- Context length — longer contexts consume more memory per request.
- Inference engine — vLLM is the production default; produces substantially better throughput than alternatives.
Rough capacity benchmarks (vLLM, INT8 quantization, modest context lengths).
- Single H100 80GB serving Llama 3 70B INT8: ~50–150 concurrent requests, ~10–30 tokens/sec output throughput per request.
- Single A100 80GB serving Llama 3 70B INT8: ~30–100 concurrent requests with similar throughput.
- 4× A100 40GB cluster serving Llama 3 70B FP16: ~150–400 concurrent requests with strong throughput.
- 8× H100 80GB cluster serving Llama 3 70B FP16: ~500–1,500 concurrent requests, sustained.
These numbers vary substantially with use case (long context vs. short, structured output vs. open generation, high temperature vs. greedy). Production-grade sizing always tests against the actual use case workload, not vendor benchmarks.
Total Cost of Ownership Framework
Hardware acquisition is one component of TCO. The full picture.
Initial Capital
GPU servers, networking switches, storage, racks, UPS. Initial capital for production deployments typically lands at:
- Small-scale single-server deployment: $30,000–$80,000
- Mid-scale multi-GPU server: $150,000–$300,000
- Enterprise multi-server cluster: $400,000–$1,500,000+
Power and Cooling
GPU servers consume substantial power (8-GPU servers typically 6–10kW under load) and produce substantial heat. Annual power and cooling costs:
- Small deployment (single server, ~3kW): $5,000–$8,000/year
- Mid-scale (single 8-GPU server, ~8kW): $15,000–$25,000/year
- Enterprise cluster (multiple servers, 30–60kW): $60,000–$150,000/year
Operations and Maintenance
GPU infrastructure requires ongoing operational attention — firmware updates, driver management, monitoring, on-call coverage, hardware-failure response. Annual operations cost:
- Small deployment: $20,000–$40,000/year (mostly absorbed in existing IT operations)
- Mid-scale: $50,000–$100,000/year
- Enterprise cluster: $150,000–$400,000/year (typically dedicated MLOps engineer plus shared IT operations)
Refresh Cycle
GPU infrastructure refreshes every 3–4 years for production deployments. Earlier refreshes (2–3 years) for institutions running cutting-edge models. The refresh is part of ongoing operating expense, not a one-time event.
3-Year TCO Examples
For a mid-scale deployment (single 8-GPU H100 server, 24/7 production operations):
- Initial hardware: $300,000
- Power and cooling (3 years): $60,000
- Operations and maintenance (3 years): $200,000
- 3-year TCO: ~$560,000
- Per-month run-rate: ~$15,500
For an enterprise cluster (4× 8-GPU H100 servers):
- Initial hardware: $1,200,000
- Power and cooling (3 years): $300,000
- Operations and maintenance (3 years): $750,000
- 3-year TCO: ~$2,250,000
- Per-month run-rate: ~$62,500
The TCO comparison favors on-prem when use-case volume is high. At low volume, cloud-hosted inference (where it’s permitted by policy) typically beats on-prem TCO.
Common Hardware-Selection Mistakes
Five patterns that produce bad hardware decisions.
Mistake 1 — Sizing for Peak Without Considering Average
A team sizes for the peak hour of the busiest day. The hardware sits at 20% utilization most of the time. Resolution: size for sustained 70–80% utilization at peak; cloud-burst or queue-shed at extreme peaks.
Mistake 2 — Choosing FP16 When INT8 Would Suffice
A team buys 4× H100 to run Llama 3 70B in FP16 when INT8 quantization on 2× H100 would have served the use case acceptably. Resolution: validate use-case quality against quantized variants before committing to higher-precision hardware.
Mistake 3 — Ignoring Power and Cooling Requirements
A team procures GPU servers without confirming the data center has the power capacity, cooling capacity, or rack space to host them. The hardware arrives; deployment delays for facilities work. Resolution: facilities review is part of the discovery phase.
Mistake 4 — Single-Vendor Lock-In
A team builds the architecture tightly around a specific GPU vendor’s tooling. The next refresh is constrained to the same vendor regardless of pricing or capability. Resolution: keep the inference engine and orchestration layer vendor-agnostic where possible.
Mistake 5 — Underbudgeting Operations
A team budgets for hardware acquisition without budgeting for the MLOps team that operates it. The deployed system has nobody monitoring, patching, or responding to incidents. Quality degrades over months. Resolution: operations cost is at least 50% of TCO; budget accordingly.
Pricing and Engagement Structure
| Engagement | Duration | Price Range | Scope |
| Hardware Sizing Discovery | 2–4 weeks | $30,000–$45,000 | Use case analysis, capacity planning, hardware specification, vendor RFQ support, facilities-readiness assessment |
| Hardware Procurement Support | Variable | Time-based | RFP development, vendor negotiation, configuration validation, deployment planning |
| Deployment and MLOps Setup | 6–10 weeks | $80,000–$130,000 | Hardware racking and provisioning, inference stack deployment, monitoring infrastructure, MLOps tooling, operational runbook |
| Hardware (separate) | — | $30,000–$1,500,000+ | Per the sizing analysis output |
Total on-prem hardware engagement (engineering + hardware) typically runs $200,000–$1,500,000+ depending on scale.
Closing
On-prem LLM hardware in 2026 has multiple right answers depending on use case scale, data-control posture, and operational maturity. The GPU landscape is mature; the sizing math is well-understood; the TCO calculation is calculable. Buyers who scope hardware against actual use-case workloads — not vendor benchmarks — produce deployments that operate at the sizing they planned for.
If you are scoping on-prem LLM hardware for your hospital, book a 60-minute scoping call. Taction Software has shipped 785+ healthcare implementations since 2013, with 200+ EHR integrations across Epic, Cerner-Oracle, Athena, and Allscripts, zero HIPAA findings on shipped software, and active BAA paper trails with every major AI provider. Our healthcare engineering team handles hardware sizing, MLOps deployment, and inference stack setup as default scope on on-prem engagements. Our verified case studies cover the production deployments behind these patterns. For the engineering scope behind the engagement, see our healthcare software development practice and our hospital and health-system practice for the operational context. For the data integration patterns this work depends on, see our healthcare data integration practice. For an estimate against your specific use case, see the healthcare engineering cost calculator. For deeper context, see our broader generative AI healthcare applications work.
