Live deploy on mRock surfaced a Schritt 5 bug: comfyui was always
treated as preloaded at scheduler startup, which made ensureFits()
short-circuit on the very first /v1/image request — exactly the
scenario eviction is supposed to handle. mvoice was never picked as
a victim, ComfyUI then OOM'd loading FLUX on top of the still-resident
mvoice.
Fix: replace the blanket 'every consumer starts loaded' init with a
heuristic — initialLoaded(cons):
- VRAMManaged (ollama): true. We never track/evict it; the consumer
runs its own LRU.
- Load+Unload both present (mvoice): true. Designed to be controllable;
typically preloads in its own lifespan.
- Unload only, no Load (comfyui): false. Lazy — FLUX isn't resident
until the first /prompt, so we shouldn't bill its 13 GiB against the
GPU budget until then.
- SystemdUnit only (whisper-server): true. Always-on, model loaded at
process start.
- Empty: true. Safe fallback.
Verified live on mRock (2026-05-15):
Before /v1/image: nvidia-smi 8963 MiB used; mvoice gpu_resident_mib 2345
POST /v1/image: HTTP 400 from upstream (empty workflow), broker did
trigger eviction before forwarding
After: nvidia-smi 6547 MiB used; mvoice gpu_resident_mib 9
(~CUDA context only); scheduler.evictions = 2
POST /v1/tts: audio_url returned, tts_ms 670, audio 3.5 s
After reload: nvidia-smi 8943 MiB used; mvoice gpu_resident_mib 2917
Test: TestInitialLoadedHeuristic pins the four cases down so this
doesn't regress when someone adds a fifth consumer type.
Refs: m/mGPUmanager#1 (live deploy).
Convention on mRock is user-units for ML services (whisper-server,
mvoice-launcher, comfyui as of today). Switching mGPUmanager too:
- systemd/mgpumanager.service: rewritten as a user unit (%h-based
WorkingDirectory + ExecStart, WantedBy=default.target). Drops the
ProtectSystem/ProtectHome hardening that came from the system-unit
template — user units don't need it, and ProtectHome=read-only
blocks a user unit's own working dir.
- Makefile deploy target: rsync to ~/.config/systemd/user/ on the
remote and use systemctl --user, no sudo. README documents the
lingering prerequisite (loginctl enable-linger m).
- config/consumers.yaml: bind on 0.0.0.0:8770 instead of localhost so
mRiver / Tailscale peers can actually reach the broker.
Refs: m/mGPUmanager#1 (deploy task).
scheduler.Evicting wraps the Locked scheduler with the design's
LRU-with-coexistence eviction loop. main.go switches to it.
Per-job flow:
1. ensureFits — compare cons.vram_resident_mib + 256 MiB cushion against
the live nvidia-smi free reading. If insufficient, pick the LRU
loaded consumer NOT in cons.can_coexist_with, NOT VRAM-managed
(ollama is excluded from eviction by design — it runs its own LRU),
and NOT the target itself, then call its unload route. Wait 1s for
VRAM to actually free. Repeat up to 5 times.
2. ensureLoaded — if the target was previously unloaded, call its
/api/admin/load (mvoice). Consumers without a load route are
assumed to cold-start implicitly on first request.
3. inner.Run — global GPU lock + job execution.
State:
- scheduler-local 'loaded' map + scheduler-local 'lastUsed' map. The
registry's health-derived Loaded field is the source of truth for
consumers that report it, but we need our own state for the seconds
between an unload call and the next probe.
- Stats.Evictions counts successful unload calls and surfaces through
/v1/status.
LRU pick order:
- Scheduler-local lastUsed (set on successful Run completion) takes
precedence over registry.LastUsed (set on health probes) because the
former reflects real GPU work, not health chatter. Zero-time
consumers (never used) lose first.
Tests:
- Already-resident target: no eviction calls.
- 13 GiB comfyui evicted to fit 2.8 GiB mvoice → 1 unload + 1 load,
Stats.Evictions = 1.
- Coexistent consumer (ollama, in mvoice.can_coexist_with) is never
picked even if it's the LRU candidate; the non-coexistent comfyui
is unloaded instead.
Race detector clean.
Refs: m/mGPUmanager#1 (Schritt 5).
Replaces the MVP Passthrough with scheduler.Locked: a capacity-1 channel
serialises every consumer's GPU work end-to-end. main.go switches to it.
Behavioural contract:
- Jobs that arrive while another job holds the GPU block on the channel
until the holder finishes. Context cancellation aborts the wait
cleanly (no leaked tokens, queue depth decremented).
- Stats track queue_depth, in_flight, total_jobs, last_wait_ms,
last_run_ms, oldest_queued — surfaced through /v1/status.
- One lock for ALL consumers (not per-consumer): the design (§4.3) is
explicit that grobgranular > GPU-stream-granular on single-GPU
single-user hardware. mvoice + ollama + comfyui never run truly
concurrently any more, which is the whole point — that's what
produced the CUDA-OOM under load.
Tests:
- 5 goroutines hammer the scheduler concurrently → max in-flight = 1.
- Cancellation while parked on the lock returns ctx.Err() and frees
the queue slot.
- Stats reflect in-flight + queue-depth transitions correctly.
- Race detector clean.
Schritt 5 will compose this with VRAM-pressure eviction: before
acquiring the lock, check if the target consumer's resident cost fits
under the current GPU headroom; if not, unload the LRU non-coexistent
consumer first.
Refs: m/mGPUmanager#1 (Schritt 4).
Go daemon listening on :8770 that fronts mvoice (8766), whisper-server
(8178), ollama (11434), comfyui (8188) behind a single /v1 façade.
What this MVP does:
- Loads config/consumers.yaml: routing table, per-consumer URL + health +
paths + vram_resident_mib + can_coexist_with + load/unload routes.
- Background health probe (5s) on every consumer; refuses fast with a
structured 503 if the last probe failed (no Felix-Banholzer-style
silent fallback).
- POST /v1/{tts,stt,llm,image} proxies the request body + Content-Type
to the routed consumer's path and streams the response back.
- GET /audio/* proxies to audio_proxy consumer (wa.sh fetches its WAV
this way).
- GET /v1/status exposes live GPU sample (nvidia-smi every 2s),
per-consumer health/loaded/gpu_resident_mib/active/total_requests,
scheduler stats.
- GET /healthz, GET / — broker liveness.
The Scheduler interface is in place but the implementation is
'Passthrough' — every job runs immediately, no lock, no queue. Schritt 4
replaces it with a serialising mutex; Schritt 5 adds VRAM-pressure
eviction. The interface boundary means server.go stays unchanged.
Out of scope here:
- Schritt 3: wa.sh migration (parallel work in mAi).
- Schritt 4: queue + global GPU lock.
- Schritt 5: nvidia-smi-driven LRU eviction.
Tests: config validation (good/bad), proxy forwards body, audio proxy
streams bytes, unhealthy consumer returns 503, /v1/status JSON shape.
Refs: m/mGPUmanager#1