Deploying Large Language Models in HPC Environments
Motivation
- LLMs demand large-scale compute and memory resources.
- HPC clusters offer GPUs, interconnects, and schedulers for scalability.
- Efficient deployment requires managing model parallelism, batching, and containers.
Preliminaries
Offline vs. Online Inference
- Online: Process requests interactively via the client-server interface.
- Offline batching: Process large sets of prompts together for the best throughput.
- Ideal for HPC workloads with scheduled jobs or dataset inference.
Offline Batching Example (vLLM)
from vllm import LLM, SamplingParams
prompts = open("dataset.txt").read().splitlines()
params = SamplingParams(temperature=0.0)
llm = LLM(model="meta-llama/Llama-2-7b-hf")
outputs = llm.generate(prompts, sampling_params=params)
for o in outputs:
print(o.outputs[0].text)
Online: Client Example
import os
import openai
client = openai.OpenAI(
base_url="localhost:8080",
)
response = client.chat.completions.create(
model="deepseek-r1",
messages=[
{"role": "system", "content": "Talk like a pirate."},
{
"role": "user",
"content": "How do I check if a Python object is an instance of a class?",
},
],
)
print(response.choices[0].message.content)
About Data Types
Many LLMs introduce more compact data types to reduce memory footprint and computational costs:
- Llama 3 405B –
bfloat16 - DeepSeek V3 686B –
fp8 - GPT OSS 120B –
MXFP4 - GGUF data types –
Q8_0,Q2_K
Compute Capability (CC)
CC defines the hardware features and supported instructions for each NVIDIA GPU architecture. NVIDIA CUDA GPU List
- Tesla T4 – CC 7.5
- only supports
fp32
- only supports
- NVIDIA A40 (CC 8.5), A100 (CC 8.0), L40S (CC 8.9)
- natively supports
bfloat16
- natively supports
- NVIDIA H100 – CC 9.0
- natively supports
fp8
- natively supports
Still, many libraries provide software support for missing features.
Offline Inference – Pros & Cons
- Fully utilize GPU memory and compute.
- Run large-scale inference as scheduled batch jobs.
- Reduced API overhead compared to serving mode.
- Limited interactivity, no multi-user serving.
LLM Inference Engines
Engine Comparison
| Engine | Core Strength | Parallelism | Offline Batching | Model Formats |
| vLLM | Throughput | Distributed (Ray required) | Yes | HF |
| Ollama | User-friendliness | Single node | Limited | HF + GGUF |
| llama.cpp | Portability | Single node | No | GGUF |
Ollama Overview
- Simplified LLM serving platform with local model management.
- Uses efficient quantized weights and HTTP API for access.
- Good for development or single-node HPC testing.
- Does not scale beyond one node.
Example: Running Ollama
# Install Ollama (Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull a model ollama pull meta-ai/llama2:7b # Query model ollama serve --model meta-ai/llama2:7b
vLLM Overview
- Developed at UC Berkeley for efficient LLM inference.
- Supports tensor parallelism and OpenAI-compatible APIs.
- Supports both online and offline inference.
- Ability to scale beyond one node.
Running vLLM on HPC Node
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-2-13b-chat-hf \ --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ --gpu-memory-utilization 0.95
llama.cpp Overview
- Lightweight C/C++ inference backend for LLaMA-family models.
- Supports CPU, ROCm, CUDA, and Vulkan backends.
- Provides support for GGUF quantization and tools to create your own quantized models.
Example: llama.cpp on HPC Node
llama-server -m models/7B/ggml-model-q4_0.bin
Ray Integration
- Ray simplifies distributed orchestration of LLM workloads.
- Main Idea: Make multiple nodes visible as a single one.
- Compatible with vLLM and custom batch pipelines.
- Handles fault tolerance and dynamic scaling in HPC clusters.
Ray Cluster Setup (Online Inference)
# Head node ray start --head # Worker nodes ray start --address='head_node:6379' # Deploy via vLLM python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-2-13b-chat-hf \ --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ --gpu-memory-utilization 0.95 \ --distributed-executor-backend "ray"
---
Singularity / Apptainer
Containerization in HPC
- Docker alternative that does not require escalated privileges.
- Enables GPU passthrough and MPI compatibility.
- Ideal for encapsulating vLLM, Ollama, and Ray setups.
- Supports single-file containers (
.sif)
Example: Singularity
singularity build vllm.sif docker://vllm/vllm-openai singularity shell -B /model/directory vllm.sif singularity exec -B /model/directory vllm.sif script.sh
---
Tips and Best Practices
- Containerize with Singularity for reproducibility.
- Use Ray to orchestrate multi-node inference or offline batches.
- Deploy vLLM for scalable serving; Ollama/llama.cpp for lightweight nodes.
- Optimize resource usage via Slurm or job arrays.
Performance Considerations
- Parallelism: Split models across GPUs for throughput.
- Quantization: Use llama.cpp or Ollama for 4/8-bit weights.
- Offline batching: Maximize GPU utilization for dataset inference if interactivity is not required.
- Storage: Cache models locally to reduce network overhead.
---
Conclusion
- HPC enables scalable and efficient LLM inference.
- vLLM + Ray: best for distributed online/offline workloads.
- llama.cpp + Ollama: great for lightweight inference and testing.
- Singularity: ensures reproducibility and portability.







