mGPUmanager: ollama not evictable + no game-awareness + OLLAMA_KEEP_ALIVE=24h starve the game/FLUX #3
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
On mRock (RTX 4070 Ti SUPER, 16 GB) a game (Baldur's Gate 3, ~3 GB) could not run properly and an ImaGen/FLUX (comfyui) request 503'd, because VRAM was hogged by an idle ollama embedding model. mGPUmanager is running but does not actually keep the GPU usable.
Diagnosis (from live state + mgpumanager logs, 2026-06-07)
VRAM at the time: 13.4/16 GB used, 2.4 GB free. Holders: ollama 5.7 GB (qwen3-embedding:8b, pinned), whisper 2.0 GB, mvoice 0.24 GB, comfyui 0.3 GB, BG3 3.1 GB.
mgpumanager scheduler log when a comfyui lease needed 13000 MiB:
It evicted mvoice + whisper but could not evict ollama, then gave up.
Three concrete defects
ollama is not an evictable consumer. In
config/consumers.yamltheollamaentry has only aurl— novram_resident_mib, nounloadmethod, nosystemd_unit. So the scheduler never picks it as an eviction victim, and its 5.7 GB stays pinned no matter what else needs the GPU. Fix: give ollama a real unload path. Ollama supports unloading a model via the API (POST /api/generate//api/chatwith"keep_alive": 0, orPOST /api/ps+ollama stop <model>, or/api/generate {model, keep_alive:0}). mGPUmanager should track ollama's resident models (GET /api/psgives name + size_vram) and be able to evict them on demand, same as it does mvoice/whisper.No game-awareness. A running game (GPU process under SteamLibrary, or a configurable process/cgroup list) is invisible to mGPUmanager. It should: (a) detect an active game and treat its VRAM as a high-priority, unevictable reservation; (b) refuse or defer AI leases that would starve the game (e.g. a 13 GB FLUX request while a game holds 3 GB and needs more should be queued/denied gracefully, not trigger a churn-evict that 503s anyway). Optionally a 'gaming mode' lease/flag that parks all AI consumers.
OLLAMA_KEEP_ALIVE=24h(env on mRock) is the root hog — every loaded ollama model pins for a full day. Set a short default (e.g.30s–5m) so idle models free VRAM, OR have mGPUmanager own ollama's lifecycle (load with a per-lease keep_alive, unload right after). A 24h global pin defeats the whole point of a GPU manager.Bonus: unit start is flaky
The user systemd unit crash-looped on startup (
Permission deniedexecuting/home/m/dev/mGPUmanager/bin/mgpumanager, restart counter 4) before finally starting. Fix the binary perms / unit so it starts cleanly first try, and make suresystemctl --user is-activereflects reality.Acceptance
GET /api/psgoing empty) — not just mvoice/whisper.is-active= active.Context
GPU consumers on mRock: mvoice (TTS), whisper-server (STT), ollama (LLM + embeddings for youpc/flexsiebels/Lexie/Kirsten), comfyui (ImaGen/FLUX), plus the desktop + games. Config:
config/consumers.yaml. Immediate workaround applied today:ollama stop qwen3-embedding:8bfreed 5.7 GB by hand — this issue is to make that automatic + game-aware.