scheduler.Evicting wraps the Locked scheduler with the design's
LRU-with-coexistence eviction loop. main.go switches to it.
Per-job flow:
1. ensureFits — compare cons.vram_resident_mib + 256 MiB cushion against
the live nvidia-smi free reading. If insufficient, pick the LRU
loaded consumer NOT in cons.can_coexist_with, NOT VRAM-managed
(ollama is excluded from eviction by design — it runs its own LRU),
and NOT the target itself, then call its unload route. Wait 1s for
VRAM to actually free. Repeat up to 5 times.
2. ensureLoaded — if the target was previously unloaded, call its
/api/admin/load (mvoice). Consumers without a load route are
assumed to cold-start implicitly on first request.
3. inner.Run — global GPU lock + job execution.
State:
- scheduler-local 'loaded' map + scheduler-local 'lastUsed' map. The
registry's health-derived Loaded field is the source of truth for
consumers that report it, but we need our own state for the seconds
between an unload call and the next probe.
- Stats.Evictions counts successful unload calls and surfaces through
/v1/status.
LRU pick order:
- Scheduler-local lastUsed (set on successful Run completion) takes
precedence over registry.LastUsed (set on health probes) because the
former reflects real GPU work, not health chatter. Zero-time
consumers (never used) lose first.
Tests:
- Already-resident target: no eviction calls.
- 13 GiB comfyui evicted to fit 2.8 GiB mvoice → 1 unload + 1 load,
Stats.Evictions = 1.
- Coexistent consumer (ollama, in mvoice.can_coexist_with) is never
picked even if it's the LRU candidate; the non-coexistent comfyui
is unloaded instead.
Race detector clean.
Refs: m/mGPUmanager#1 (Schritt 5).