Deploying Large Language Models in HPC Environments

Motivation

LLMs demand large-scale compute and memory resources.
HPC clusters offer GPUs, interconnects, and schedulers for scalability.
Efficient deployment requires managing model parallelism, batching, and containers.

Preliminaries

Offline vs. Online Inference

Online: Process requests interactively via the client-server interface.
Offline batching: Process large sets of prompts together for the best throughput.
Ideal for HPC workloads with scheduled jobs or dataset inference.

Offline Batching Example (vLLM)

from vllm import LLM, SamplingParams

prompts = open("dataset.txt").read().splitlines()
params = SamplingParams(temperature=0.0)
llm = LLM(model="meta-llama/Llama-2-7b-hf")

outputs = llm.generate(prompts, sampling_params=params)

for o in outputs:
    print(o.outputs[0].text)

Online: Client Example

import os
import openai
client = openai.OpenAI(
    base_url="localhost:8080",
)
response = client.chat.completions.create(
    model="deepseek-r1",
    messages=[
        {"role": "system", "content": "Talk like a pirate."},
        {
            "role": "user",
            "content": "How do I check if a Python object is an instance of a class?",
        },
    ],
)
print(response.choices[0].message.content)

About Data Types

Many LLMs introduce more compact data types to reduce memory footprint and computational costs:

Llama 3 405B – bfloat16
DeepSeek V3 686B – fp8
GPT OSS 120B – MXFP4
GGUF data types – Q8_0, Q2_K

Compute Capability (CC)

CC defines the hardware features and supported instructions for each NVIDIA GPU architecture. NVIDIA CUDA GPU List

Tesla T4 – CC 7.5
- only supports fp32
NVIDIA A40 (CC 8.5), A100 (CC 8.0), L40S (CC 8.9)
- natively supports bfloat16
NVIDIA H100 – CC 9.0
- natively supports fp8

Still, many libraries provide software support for missing features.

Offline Inference – Pros & Cons

Fully utilize GPU memory and compute.
Run large-scale inference as scheduled batch jobs.
Reduced API overhead compared to serving mode.
Limited interactivity, no multi-user serving.

LLM Inference Engines

Engine Comparison

Engine	Core Strength	Parallelism	Offline Batching	Model Formats
vLLM	Throughput	Distributed (Ray required)	Yes	HF
Ollama	User-friendliness	Single node	Limited	HF + GGUF
llama.cpp	Portability	Single node	No	GGUF

Ollama Overview

Simplified LLM serving platform with local model management.
Uses efficient quantized weights and HTTP API for access.
Good for development or single-node HPC testing.
Does not scale beyond one node.

Example: Running Ollama

# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull meta-ai/llama2:7b

# Query model
ollama serve --model meta-ai/llama2:7b

vLLM Overview

Developed at UC Berkeley for efficient LLM inference.
Supports tensor parallelism and OpenAI-compatible APIs.
Supports both online and offline inference.
Ability to scale beyond one node.

Running vLLM on HPC Node

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-13b-chat-hf \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 4 \
  --gpu-memory-utilization 0.95

llama.cpp Overview

Lightweight C/C++ inference backend for LLaMA-family models.
Supports CPU, ROCm, CUDA, and Vulkan backends.
Provides support for GGUF quantization and tools to create your own quantized models.

Example: llama.cpp on HPC Node

llama-server -m models/7B/ggml-model-q4_0.bin

Ray Integration

Ray simplifies distributed orchestration of LLM workloads.
Main Idea: Make multiple nodes visible as a single one.
Compatible with vLLM and custom batch pipelines.
Handles fault tolerance and dynamic scaling in HPC clusters.

Ray Cluster Setup (Online Inference)

# Head node
ray start --head

# Worker nodes
ray start --address='head_node:6379'

# Deploy via vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-13b-chat-hf \
--tensor-parallel-size 8 \
--pipeline-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--distributed-executor-backend "ray"

---

Singularity / Apptainer

Containerization in HPC

Docker alternative that does not require escalated privileges.
Enables GPU passthrough and MPI compatibility.
Ideal for encapsulating vLLM, Ollama, and Ray setups.
Supports single-file containers (.sif)

Example: Singularity

singularity build vllm.sif docker://vllm/vllm-openai

singularity shell -B /model/directory vllm.sif

singularity exec -B /model/directory vllm.sif script.sh

---

Tips and Best Practices

Containerize with Singularity for reproducibility.
Use Ray to orchestrate multi-node inference or offline batches.
Deploy vLLM for scalable serving; Ollama/llama.cpp for lightweight nodes.
Optimize resource usage via Slurm or job arrays.

Performance Considerations

Parallelism: Split models across GPUs for throughput.
Quantization: Use llama.cpp or Ollama for 4/8-bit weights.
Offline batching: Maximize GPU utilization for dataset inference if interactivity is not required.
Storage: Cache models locally to reduce network overhead.

---

Conclusion

HPC enables scalable and efficient LLM inference.
vLLM + Ray: best for distributed online/offline workloads.
llama.cpp + Ollama: great for lightweight inference and testing.
Singularity: ensures reproducibility and portability.