mGPUmanager: ollama not evictable + no game-awareness + OLLAMA_KEEP_ALIVE=24h starve the game/FLUX #3

Open
opened 2026-06-07 09:14:08 +00:00 by mAi · 0 comments

Symptom

On mRock (RTX 4070 Ti SUPER, 16 GB) a game (Baldur's Gate 3, ~3 GB) could not run properly and an ImaGen/FLUX (comfyui) request 503'd, because VRAM was hogged by an idle ollama embedding model. mGPUmanager is running but does not actually keep the GPU usable.

Diagnosis (from live state + mgpumanager logs, 2026-06-07)

VRAM at the time: 13.4/16 GB used, 2.4 GB free. Holders: ollama 5.7 GB (qwen3-embedding:8b, pinned), whisper 2.0 GB, mvoice 0.24 GB, comfyui 0.3 GB, BG3 3.1 GB.

mgpumanager scheduler log when a comfyui lease needed 13000 MiB:

evicted consumer victim=mvoice target=comfyui free_mib_after=16 need_mib=13000
evicted consumer victim=whisper-server target=comfyui free_mib_after=16 need_mib=13000
WARN no eviction candidates target=comfyui need_mib=13000 free_mib=2388 strict=true
POST /v1/lease status=503

It evicted mvoice + whisper but could not evict ollama, then gave up.

Three concrete defects

  1. ollama is not an evictable consumer. In config/consumers.yaml the ollama entry has only a url — no vram_resident_mib, no unload method, no systemd_unit. So the scheduler never picks it as an eviction victim, and its 5.7 GB stays pinned no matter what else needs the GPU. Fix: give ollama a real unload path. Ollama supports unloading a model via the API (POST /api/generate / /api/chat with "keep_alive": 0, or POST /api/ps + ollama stop <model>, or /api/generate {model, keep_alive:0}). mGPUmanager should track ollama's resident models (GET /api/ps gives name + size_vram) and be able to evict them on demand, same as it does mvoice/whisper.

  2. No game-awareness. A running game (GPU process under SteamLibrary, or a configurable process/cgroup list) is invisible to mGPUmanager. It should: (a) detect an active game and treat its VRAM as a high-priority, unevictable reservation; (b) refuse or defer AI leases that would starve the game (e.g. a 13 GB FLUX request while a game holds 3 GB and needs more should be queued/denied gracefully, not trigger a churn-evict that 503s anyway). Optionally a 'gaming mode' lease/flag that parks all AI consumers.

  3. OLLAMA_KEEP_ALIVE=24h (env on mRock) is the root hog — every loaded ollama model pins for a full day. Set a short default (e.g. 30s5m) so idle models free VRAM, OR have mGPUmanager own ollama's lifecycle (load with a per-lease keep_alive, unload right after). A 24h global pin defeats the whole point of a GPU manager.

Bonus: unit start is flaky

The user systemd unit crash-looped on startup (Permission denied executing /home/m/dev/mGPUmanager/bin/mgpumanager, restart counter 4) before finally starting. Fix the binary perms / unit so it starts cleanly first try, and make sure systemctl --user is-active reflects reality.

Acceptance

  • With a game running, an ImaGen/FLUX or large LLM lease that wouldn't fit alongside the game is deferred/denied gracefully (no churn-evicting unrelated consumers to then 503), and the game keeps its VRAM.
  • ollama appears as an evictable consumer: when VRAM is needed, mgpumanager can unload ollama models (verify via GET /api/ps going empty) — not just mvoice/whisper.
  • ollama models no longer pin for 24h (keep_alive shortened or manager-owned).
  • mgpumanager user unit starts cleanly (no Permission-denied crashloop), is-active = active.
  • Document the consumer model (incl. how a game is detected/reserved) in the repo README.

Context

GPU consumers on mRock: mvoice (TTS), whisper-server (STT), ollama (LLM + embeddings for youpc/flexsiebels/Lexie/Kirsten), comfyui (ImaGen/FLUX), plus the desktop + games. Config: config/consumers.yaml. Immediate workaround applied today: ollama stop qwen3-embedding:8b freed 5.7 GB by hand — this issue is to make that automatic + game-aware.

## Symptom On mRock (RTX 4070 Ti SUPER, 16 GB) a game (Baldur's Gate 3, ~3 GB) could not run properly and an ImaGen/FLUX (comfyui) request 503'd, because VRAM was hogged by an idle ollama embedding model. mGPUmanager is running but does not actually keep the GPU usable. ## Diagnosis (from live state + mgpumanager logs, 2026-06-07) VRAM at the time: 13.4/16 GB used, 2.4 GB free. Holders: ollama 5.7 GB (qwen3-embedding:8b, pinned), whisper 2.0 GB, mvoice 0.24 GB, comfyui 0.3 GB, BG3 3.1 GB. mgpumanager scheduler log when a comfyui lease needed 13000 MiB: ``` evicted consumer victim=mvoice target=comfyui free_mib_after=16 need_mib=13000 evicted consumer victim=whisper-server target=comfyui free_mib_after=16 need_mib=13000 WARN no eviction candidates target=comfyui need_mib=13000 free_mib=2388 strict=true POST /v1/lease status=503 ``` It evicted mvoice + whisper but **could not evict ollama**, then gave up. ### Three concrete defects 1. **ollama is not an evictable consumer.** In `config/consumers.yaml` the `ollama` entry has only a `url` — no `vram_resident_mib`, no `unload` method, no `systemd_unit`. So the scheduler never picks it as an eviction victim, and its 5.7 GB stays pinned no matter what else needs the GPU. Fix: give ollama a real unload path. Ollama supports unloading a model via the API (`POST /api/generate` / `/api/chat` with `"keep_alive": 0`, or `POST /api/ps` + `ollama stop <model>`, or `/api/generate {model, keep_alive:0}`). mGPUmanager should track ollama's resident models (`GET /api/ps` gives name + size_vram) and be able to evict them on demand, same as it does mvoice/whisper. 2. **No game-awareness.** A running game (GPU process under SteamLibrary, or a configurable process/cgroup list) is invisible to mGPUmanager. It should: (a) detect an active game and treat its VRAM as a high-priority, **unevictable reservation**; (b) **refuse or defer AI leases that would starve the game** (e.g. a 13 GB FLUX request while a game holds 3 GB and needs more should be queued/denied gracefully, not trigger a churn-evict that 503s anyway). Optionally a 'gaming mode' lease/flag that parks all AI consumers. 3. **`OLLAMA_KEEP_ALIVE=24h`** (env on mRock) is the root hog — every loaded ollama model pins for a full day. Set a short default (e.g. `30s`–`5m`) so idle models free VRAM, OR have mGPUmanager own ollama's lifecycle (load with a per-lease keep_alive, unload right after). A 24h global pin defeats the whole point of a GPU manager. ### Bonus: unit start is flaky The user systemd unit crash-looped on startup (`Permission denied` executing `/home/m/dev/mGPUmanager/bin/mgpumanager`, restart counter 4) before finally starting. Fix the binary perms / unit so it starts cleanly first try, and make sure `systemctl --user is-active` reflects reality. ## Acceptance - With a game running, an ImaGen/FLUX or large LLM lease that wouldn't fit alongside the game is **deferred/denied gracefully** (no churn-evicting unrelated consumers to then 503), and the game keeps its VRAM. - ollama appears as an evictable consumer: when VRAM is needed, mgpumanager can unload ollama models (verify via `GET /api/ps` going empty) — not just mvoice/whisper. - ollama models no longer pin for 24h (keep_alive shortened or manager-owned). - mgpumanager user unit starts cleanly (no Permission-denied crashloop), `is-active` = active. - Document the consumer model (incl. how a game is detected/reserved) in the repo README. ## Context GPU consumers on mRock: mvoice (TTS), whisper-server (STT), ollama (LLM + embeddings for youpc/flexsiebels/Lexie/Kirsten), comfyui (ImaGen/FLUX), plus the desktop + games. Config: `config/consumers.yaml`. Immediate workaround applied today: `ollama stop qwen3-embedding:8b` freed 5.7 GB by hand — this issue is to make that automatic + game-aware.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: m/mGPUmanager#3
No description provided.