# Auxiliary Model Configuration

When Hermes' main model is slow or expensive, auxiliary tasks (title generation, compression, vision, session search, approval, etc.) should run on a separate fast/cheap model to eliminate timeouts and reduce cost.

## Required Context Windows

| Task | Minimum Context | Notes |
|------|-----------------|-------|
| compression | 64,000 tokens | Hermes enforces this at runtime |
| vision | 8,000 tokens | Image token count dominates |
| title_generation | 4,000 tokens | Usually just last 2-3 exchanges |
| session_search | 8,000 tokens | Summaries of past sessions |
| web_extract | 8,000 tokens | Page content chunk |
| approval | 4,000 tokens | Single command + context |
| skills_hub / mcp / curator | 8,000 tokens | Metadata queries |

## DISCOVER FIRST — MANDATORY BEFORE ANY CONFIG CHANGE

**Never assume model names work on the user's endpoint.** Before touching config, query what is actually available:

```bash
curl -s https://ENDPOINT/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin).get('data',[])]"
```

Pick from the results. If the list is empty (404/error), no models are available — fix the endpoint first.

## Model Selection — Choose from what EXISTS on the endpoint

| Role | Ideal choice | Fallback (if ideal missing) | Min context |
|------|-------------|----------------------------|-------------|
| All text auxiliaries | `qwen2.5:1.5b` (~1 GB, 32k) | `ministral-3:3b` (~3 GB, 32k) | 4-8k |
| Vision | `gemma3:4b` (~3 GB, 8k) | `qwen3-vl:235b-instruct` (large) or any model with `vl`/`gemma`/`llava` in ID | 8k |
| Compression | main model (e.g. kimi-k2.6, deepseek-v4-pro) | Any model ≥64k context on the endpoint | **64k enforced** |

**If no small model exists on the endpoint**, fall back to the main model for all auxiliary tasks — it'll work but be slower.

### Pitfall: Vision model 404

`llava-phi3` is NOT available on all Ollama endpoints. If you set it and get:

> `404 — model "llava-phi3" not found`

Use the discovery API to list what is actually hosted, then pick the smallest vision-capable model:

```bash
curl -s https://YOUR_OLLAMA_HOST/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin).get('data',[])]"
```

Look for `gemma3`, `llava`, or `qwen3-vl` in the IDs. `gemma3:4b` is the safest default for Ollama-hosted vision because it is small and almost always present.

## Config Pattern

Set `auxiliary.<task>.provider` to `custom` (or your provider name), `model` explicitly, and reuse the same `base_url`/`api_key` as the main model. **Do not leave `provider: auto`** — that causes timeouts when no suitable backend is found.

Example YAML snippet:

```yaml
auxiliary:
  title_generation:
    provider: custom
    model: qwen2.5:1.5b
    base_url: https://ollama.com/v1
    api_key: <key>
    timeout: 60
  compression:
    provider: custom
    model: kimi-k2.6      # must be ≥64k context
    base_url: https://ollama.com/v1
    api_key: <key>
    timeout: 120
    context_length: 131072
  vision:
    provider: custom
    model: gemma3:4b
    base_url: https://ollama.com/v1
    api_key: <key>
    timeout: 180
```

## Runtime Warning

Hermes validates the compression model at startup. If its `context_length` is < 64,000, it logs:

> "Auxiliary compression model <name> has a context window of <N>, which is below the minimum 64,000 required by Hermes Agent."

Fix: either raise `context_length` in config (if the model actually supports it) or switch the compression task to a model that does.

## Discovery: List Available Models on an Ollama Endpoint

```bash
curl -s https://YOUR_OLLAMA_HOST/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin).get('data',[])]"
```

Filter for vision-capable models by looking for `vl`, `llava`, or `gemma3` in the model ID.