How much VRAM do I need to self-host Llama 70B?

At FP16 precision, Llama 70B requires approximately 168GB VRAM (including 20% overhead for KV cache). With 4-bit quantization, this drops to approximately 50GB, fitting on a single A100 80GB or two RTX 5090s.

Is self-hosting an LLM cheaper than using the OpenAI API?

It depends on your daily token volume. At high volumes (10M+ tokens/day), self-hosting on a GPU like the RTX 5090 ($0.89/hr) becomes significantly cheaper than per-token API pricing. Use our interactive cost calculator to find your break-even point.

What is the best GPU for AI inference in 2026?

The Nvidia RTX 5090 offers the best value in 2026, outperforming the A100 by 2.6x at $0.89/hr cloud rental. For enterprise-scale deployments requiring 80GB+ VRAM, the H100 remains the standard.

GPUHosting.Guide — Self-Host AI Models on GPU Infrastructure (2026)

Interactive Guide

Find Your Perfect Setup

Answer 4 quick questions and we'll recommend the ideal model, GPU, quantization, and inference engine for your use case.

What's your primary use case?

This determines which model architecture fits best.

💻

Code Generation

Writing, reviewing, debugging code

💬

Chat / Assistant

Customer support, Q&A, general chat

🧠

Complex Reasoning

Math, analysis, multi-step logic

🌍

Multilingual / STEM

Non-English, scientific, research

How many concurrent users?

This determines your inference engine and GPU tier.

👤

Just Me

1-2 users, dev/testing

👥

Small Team

5-20 concurrent users

🏢

Production

50+ users, public API

What's your monthly budget?

This narrows down provider and hardware options.

💰

Under $100/mo

Consumer GPUs, spot instances

💰💰

$100 — $500/mo

Dedicated mid-tier GPUs

💰💰💰

$500+/mo

Enterprise H100, multi-GPU

What matters most?

This fine-tunes your quantization and engine choice.

✨

Output Quality

Best reasoning, minimal compression

⚡

Speed / Throughput

Fastest tokens/sec, low latency

🎯

Ease of Setup

Minimal config, get running fast

🎯

Your Recommended Setup

Step 1

Choose Your Model

Click any card to flip and see detailed specs. The 2026 landscape is dominated by MoE architectures — massive parameter counts with only a fraction active per token.

Step 2

Understand the Hardware

GPU selection is governed by VRAM capacity and memory bandwidth. The RTX 5090 outperforms the A100 by 2.6x at a fraction of the cost.

GPU	Class	VRAM	Bandwidth	Cloud $/hr	$/1M Tok (7B)	Notes
RTX 5090	Consumer	32GB GDDR7	1,792 GB/s	$0.89	$0.04	2.6x faster than A100. Best value 2026.
RTX 6000 Ada	Workstation	48GB GDDR6	960 GB/s	$0.77	—	Sweet spot for 70B 4-bit.
RTX PRO 6000 Blackwell Max-Q	Blackwell	96GB GDDR7 ECC	TBD	~$1.59	—	2x the VRAM of RTX 6000 Ada. Available as Hetzner dedicated server.
A100 SXM (80GB)	Enterprise	80GB HBM2e	1,935 GB/s	$1.39	—	Workhorse. Fits 70B FP16. PCIe from $1.19/hr.
B200 (192GB)	Blackwell	192GB HBM3e	8,000 GB/s	$2.25 — $4.95	—	Next-gen Blackwell. 2x H100 memory. Prices dropping as supply scales.
H100 SXM (80GB)	Enterprise	80GB HBM3	3,350 GB/s	$2.69	$0.13	Fastest. NVLink SXM for multi-GPU.
Apple M4 Max	Unified SoC	128GB Unified	~546 GB/s	N/A	N/A	Holds Llama 4 Scout. ~22 tok/s.

Key insight: For models >70B, you need multi-GPU. Qwen 3 235B MoE requires 4x H100 SXM with NVLink (900 GB/s bidirectional).

Step 3

Estimate Your VRAM Needs

Models must fit entirely in VRAM for fast inference. Adjust model size and quantization to see GPU recommendations.

Configure Model Parameters

Model Size: 70 Billion Parameters

7B120B400B670B1T

For MoE models, use active params for speed estimate; total params for VRAM sizing.

Quantization (Precision)

FP16 = best quality. 4-Bit = least VRAM. Sub-4-bit risks severe quality loss.

Estimated VRAM Required

168 GB

Includes 20% overhead for context & KV cache.

Recommended GPU Setup

2x NVIDIA H100 (80GB) or 4x A100 (40GB)

Reference: Real-World Model VRAM

gpt-oss-20bMoE, 3.6B active

~48GB FP16 → 16GB @ 4-bit

Single RTX 4090 / laptop GPU

gpt-oss-120bMoE, 5.1B active

~288GB FP16 → ~86GB @ 4-bit

Single H100/A100 80GB @ 4-bit

Sarvam-30BDense, GQA

~72GB FP16 → ~22GB @ 4-bit

Single A100 80GB or 2x A10G

MiniMax M2.5 (230B)MoE

~101GB @ 3-bit · ~243GB @ 8-bit

1x 16GB GPU + 96GB system RAM

Step 4

Calculate Your Break-Even Point

Adjust daily token usage and infrastructure costs to find where fixed server pricing beats per-token API fees.

Daily Token Volume: 10 Million

API Cost (per 1M tokens): $2.50

GPT-4o-mini ~$0.30 | Llama 70B API ~$0.50 | Claude 3.5 Sonnet ~$15

Server Cost (Hourly): $1.20

RTX 5090 RunPod ~$0.89/hr | A100 PCIe ~$1.19/hr | H100 SXM ~$2.69/hr

30-Day API Cost:$750

30-Day Self-Hosted Cost:$864

At this volume, API is cheaper.

Cumulative 30-Day Cost Projection

RunPod benchmark: $1 yields ~162,000 tokens (H100 SXM @ $2.69/hr) vs Azure (67,559), GCP (42,637), or AWS (38,370). Recalculated from original 175,301 figure based on prior $2.49/hr pricing. Throughput varies by model and configuration. Prices verified March 2026.

Step 5

Pick Your GPU Cloud Provider

Filter by enterprise reliability vs. budget-friendly options.

Verified April 2026

Provider	Type	Flagship GPU	Est. Rate	Key Advantage	Get Started

Pricing reflects estimated on-demand pricing, April 2026. Links may contain affiliate referrals.

Step 6

Choose Your Quantization Format

Your format choice dictates which inference engine you can use.

GGUF

EXL2

BEST QUALITY

Variable bits-per-weight per layer. Allocates more bits to sensitive attention layers, fewer to tolerant feed-forward layers.

Engines: ExLlamaV2
Hardware: Pure GPU only
Trade-off: Incompatible with vLLM

AWQ

ENTERPRISE

Activation-aware weight quantization. The standard for production vLLM deployments.

Engines: vLLM, HF TGI
Hardware: Pure GPU only
Trade-off: Slightly below EXL2 quality

GPTQ

LEGACY

Post-training uniform quantization. Widely supported but largely superseded by AWQ in new deployments.

Engines: vLLM, TGI, AutoGPTQ
Hardware: Pure GPU only
Trade-off: AWQ beats it on quality

FP8

HOPPER / BLACKWELL

Native 8-bit floating point precision on H100, H200, B200, and RTX 5090. Near-FP16 quality with ~50% memory reduction — and unlike AWQ/GPTQ, requires no calibration dataset.

EnginesvLLM, TensorRT-LLM, SGLang

HardwareH100 / H200 / B200 / RTX 5090

Memory~50% vs FP16, near-lossless

CalibrationNot required

Trade-offNo support on A100, RTX 4090

Warning: Sub-4-bit (IQ1) causes severe degradation. Only extremely robust models survive extreme compression. FP8 requires Hopper (H100/H200) or Blackwell (B200/RTX 5090) — falls back to FP16 on older hardware.

Step 7

Choose Your Inference Engine

vLLM for production, Ollama for simplicity, TGI for HuggingFace, llama.cpp for edge.

vLLM — Production Throughput King

PagedAttention for KV cache. 35x throughput vs llama.cpp. Drop-in OpenAI API.

Concurrency: 64+ users
Formats: AWQ, GPTQ
Setup: Medium (Docker)

FP8/NVFP4: Native
API: OpenAI compatible
Best for: Production APIs

Step 8

Deploy: Copy-Paste Commands

Production-ready deployment commands for the three most common setups.

Option A: Ollama (2 minutes)

Single-user, dev/testing

BEGINNER

# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Run Llama 3.1 70B (4-bit, ~40GB VRAM) ollama run llama3.1:70b # Or 8B for smaller hardware (~5GB VRAM) ollama run llama3.1:8b # OpenAI-compatible API (localhost:11434) curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"llama3.1:70b","messages":[{"role":"user","content":"Hello!"}]}' # Serve on all interfaces (VPS): OLLAMA_HOST=0.0.0.0 ollama serve

Option B: vLLM + Docker

High concurrency, production APIs

PRODUCTION

# vLLM with Docker (requires NVIDIA Container Toolkit) docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 --ipc=host \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-70B-Instruct \ --quantization awq \ --max-model-len 8192 \ --tensor-parallel-size 2 # Test endpoint: curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"meta-llama/Llama-3.1-70B-Instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":256}'

Option C: RunPod Serverless

Variable traffic, zero idle cost

SERVERLESS

# 1. Create account at runpod.io/?ref=852l1ola # 2. Serverless > New Endpoint > vLLM Worker # 3. GPU: RTX 4090 (7B) or A100 (70B) # 4. Min Workers: 0, Max Workers: 10 # Call your endpoint: curl -X POST "https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"input":{"prompt":"Hello!","max_tokens":512,"temperature":0.7}}'

Quick Reference: Which Setup?

Scenario	Engine	Quant	GPU
Personal dev	Ollama	GGUF Q4_K_M	RTX 4090 / M4 Max
Small team (5-10)	vLLM	AWQ 4-bit	1x A100 80GB
Production (100+)	vLLM	AWQ 4-bit	2x H100 SXM
Variable traffic	RunPod Serverless	AWQ 4-bit	Auto-scaled
Max quality	ExLlamaV2	EXL2 4.5bpw	RTX 6000 Ada

Stop Paying Per Token.
Own Your AI Infrastructure.

Find Your Perfect Setup

What's your primary use case?

How many concurrent users?

What's your monthly budget?

What matters most?

Your Recommended Setup

Choose Your Model

Understand the Hardware

Estimate Your VRAM Needs

Configure Model Parameters

Estimated VRAM Required

Recommended GPU Setup

Reference: Real-World Model VRAM

Calculate Your Break-Even Point

Cumulative 30-Day Cost Projection

Pick Your GPU Cloud Provider

Choose Your Quantization Format

GGUF

EXL2

AWQ

GPTQ

FP8

Choose Your Inference Engine

vLLM — Production Throughput King

Ollama — One-Line Simplicity

HuggingFace TGI — Enterprise

llama.cpp — Edge & Local

SGLang — Agentic & Structured Output

Deploy: Copy-Paste Commands

Option A: Ollama (2 minutes)

Option B: vLLM + Docker

Option C: RunPod Serverless

Quick Reference: Which Setup?

Stop Paying Per Token.Own Your AI Infrastructure.

Find Your Perfect Setup

What's your primary use case?

How many concurrent users?

What's your monthly budget?

What matters most?

Your Recommended Setup

Choose Your Model

Understand the Hardware

Estimate Your VRAM Needs

Configure Model Parameters

Estimated VRAM Required

Recommended GPU Setup

Reference: Real-World Model VRAM

Calculate Your Break-Even Point

Cumulative 30-Day Cost Projection

Pick Your GPU Cloud Provider

Choose Your Quantization Format

GGUF

EXL2

AWQ

GPTQ

FP8

Choose Your Inference Engine

vLLM — Production Throughput King

Ollama — One-Line Simplicity

HuggingFace TGI — Enterprise

llama.cpp — Edge & Local

SGLang — Agentic & Structured Output

Deploy: Copy-Paste Commands

Option A: Ollama (2 minutes)

Option B: vLLM + Docker

Option C: RunPod Serverless

Quick Reference: Which Setup?

Stop Paying Per Token.
Own Your AI Infrastructure.