LLM on akeso

An OpenAI compatible API endpoint runs on akeso.fi.muni.cz, at the following URL:

https://nlp.fi.muni.cz/llama/

Currently deployed model names:

gpt-oss-120b
eurollm-9b-instruct-q6_k_l
qwen3-30b-a3b-instruct-2507-q6_k_xl
glm4.6-iq4_k

The URL provides an OpenAI compatible API endpoint. To access a specific model through a Web UI, go to https://nlp.fi.muni.cz/llama/upstream/MODEL_NAME/:

See the documentation of llama-server for API details: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#openai-compatible-api-endpoints.

Authorization

Use HTTP Bearer Token authorization for access. In the following, test represents the token, replace it with your own access token.

To obtain the access token, ask you supervisor.

Web frontend

Provide the token on the Settings dialog:

HTTP

Send the following HTTP header along with your request:

Authorization: Bearer test

Examples

cURL

curl -X POST https://nlp.fi.muni.cz/llama/v1/chat/completions \
    --compressed -H "Content-Type: application/json" \
    -H "Authorization: Bearer test" \
    -d '{ "model": "gpt-oss-120b", "messages": [{"role": "system", "content":"You are a helpful assistant."},{"role": "user", "content":"řekni vtip"}]}'

Python

Requests

import requests
payload = {'model': 'gpt-oss-120b',
  'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'},
  {'role': 'user', 'content': 'řekni vtip'}]}
response = requests.post("https://nlp.fi.muni.cz/llama/v1/chat/completions",
  headers={'Authorization': 'Bearer test'}, json=payload)
print(response.json()['choices'][0]['message']['content'])

OpenAI library

import openai
client = openai.OpenAI(base_url="https://nlp.fi.muni.cz/llama/", api_key="test")
response = client.chat.completions.create(
  model = "gpt-oss-120b",
  messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'řekni vtip'}])
print(response.choices[0].message.content)

Notes

You do not have to provide the system message – a default provided by the model will be used instead.
The model parameter is currently ignored – only a single model is currently configured.
Your data is not private – other users can see cached requests using the /slots endpoint.
You can enable prefix caching to speed up processing of long repetitive prompts by sending "cache_prompt": true in the request.
Send me (Ondřej Herman) a message at xherman1@fi.muni.cz should you need help.