Čeština
English
  • Vítejte na stránkách NLP Centra!
  • Zapojte se do vývoje softwarových nástrojů!
  • Analýza přirozeného jazyka
  • Vyzkoušejte si korpusy o velikosti knihoven online!
  • Studujte jednu ze specializací!
  • Členové laboratoře

Deploying Large Language Models in HPC Environments

Motivation

  • LLMs demand large-scale compute and memory resources.
  • HPC clusters offer GPUs, interconnects, and schedulers for scalability.
  • Efficient deployment requires managing model parallelism, batching, and containers.

Preliminaries

Offline vs. Online Inference

  • Online: Process requests interactively via the client-server interface.
  • Offline batching: Process large sets of prompts together for the best throughput.
  • Ideal for HPC workloads with scheduled jobs or dataset inference.

Offline Batching Example (vLLM)

from vllm import LLM, SamplingParams

prompts = open("dataset.txt").read().splitlines()
params = SamplingParams(temperature=0.0)
llm = LLM(model="meta-llama/Llama-2-7b-hf")

outputs = llm.generate(prompts, sampling_params=params)

for o in outputs:
    print(o.outputs[0].text)

Online: Client Example

import os
import openai
client = openai.OpenAI(
    base_url="localhost:8080",
)
response = client.chat.completions.create(
    model="deepseek-r1",
    messages=[
        {"role": "system", "content": "Talk like a pirate."},
        {
            "role": "user",
            "content": "How do I check if a Python object is an instance of a class?",
        },
    ],
)
print(response.choices[0].message.content)

About Data Types

Many LLMs introduce more compact data types to reduce memory footprint and computational costs:

  • Llama 3 405B – bfloat16
  • DeepSeek V3 686B – fp8
  • GPT OSS 120B – MXFP4
  • GGUF data types – Q8_0, Q2_K

Compute Capability (CC)

CC defines the hardware features and supported instructions for each NVIDIA GPU architecture. NVIDIA CUDA GPU List

  • Tesla T4 – CC 7.5
    • only supports fp32
  • NVIDIA A40 (CC 8.5), A100 (CC 8.0), L40S (CC 8.9)
    • natively supports bfloat16
  • NVIDIA H100 – CC 9.0
    • natively supports fp8

Still, many libraries provide software support for missing features.

Offline Inference – Pros & Cons

  • Fully utilize GPU memory and compute.
  • Run large-scale inference as scheduled batch jobs.
  • Reduced API overhead compared to serving mode.
  • Limited interactivity, no multi-user serving.

LLM Inference Engines

Engine Comparison

Engine Core Strength Parallelism Offline Batching Model Formats
vLLM Throughput Distributed (Ray required) Yes HF
Ollama User-friendliness Single node Limited HF + GGUF
llama.cpp Portability Single node No GGUF

Ollama Overview

  • Simplified LLM serving platform with local model management.
  • Uses efficient quantized weights and HTTP API for access.
  • Good for development or single-node HPC testing.
  • Does not scale beyond one node.

Example: Running Ollama

# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull meta-ai/llama2:7b

# Query model
ollama serve --model meta-ai/llama2:7b

vLLM Overview

  • Developed at UC Berkeley for efficient LLM inference.
  • Supports tensor parallelism and OpenAI-compatible APIs.
  • Supports both online and offline inference.
  • Ability to scale beyond one node.

Running vLLM on HPC Node

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-13b-chat-hf \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 4 \
  --gpu-memory-utilization 0.95

llama.cpp Overview

  • Lightweight C/C++ inference backend for LLaMA-family models.
  • Supports CPU, ROCm, CUDA, and Vulkan backends.
  • Provides support for GGUF quantization and tools to create your own quantized models.

Example: llama.cpp on HPC Node

llama-server -m models/7B/ggml-model-q4_0.bin

Ray Integration

  • Ray simplifies distributed orchestration of LLM workloads.
  • Main Idea: Make multiple nodes visible as a single one.
  • Compatible with vLLM and custom batch pipelines.
  • Handles fault tolerance and dynamic scaling in HPC clusters.

Ray Cluster Setup (Online Inference)

# Head node
ray start --head

# Worker nodes
ray start --address='head_node:6379'

# Deploy via vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-13b-chat-hf \
--tensor-parallel-size 8 \
--pipeline-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--distributed-executor-backend "ray"

---

Singularity / Apptainer

Containerization in HPC

  • Docker alternative that does not require escalated privileges.
  • Enables GPU passthrough and MPI compatibility.
  • Ideal for encapsulating vLLM, Ollama, and Ray setups.
  • Supports single-file containers (.sif)

Example: Singularity

singularity build vllm.sif docker://vllm/vllm-openai

singularity shell -B /model/directory vllm.sif

singularity exec -B /model/directory vllm.sif script.sh

---

Tips and Best Practices

  • Containerize with Singularity for reproducibility.
  • Use Ray to orchestrate multi-node inference or offline batches.
  • Deploy vLLM for scalable serving; Ollama/llama.cpp for lightweight nodes.
  • Optimize resource usage via Slurm or job arrays.

Performance Considerations

  • Parallelism: Split models across GPUs for throughput.
  • Quantization: Use llama.cpp or Ollama for 4/8-bit weights.
  • Offline batching: Maximize GPU utilization for dataset inference if interactivity is not required.
  • Storage: Cache models locally to reduce network overhead.

---

Conclusion

  • HPC enables scalable and efficient LLM inference.
  • vLLM + Ray: best for distributed online/offline workloads.
  • llama.cpp + Ollama: great for lightweight inference and testing.
  • Singularity: ensures reproducibility and portability.