Live deploy on mRock surfaced a Schritt 5 bug: comfyui was always
treated as preloaded at scheduler startup, which made ensureFits()
short-circuit on the very first /v1/image request — exactly the
scenario eviction is supposed to handle. mvoice was never picked as
a victim, ComfyUI then OOM'd loading FLUX on top of the still-resident
mvoice.
Fix: replace the blanket 'every consumer starts loaded' init with a
heuristic — initialLoaded(cons):
- VRAMManaged (ollama): true. We never track/evict it; the consumer
runs its own LRU.
- Load+Unload both present (mvoice): true. Designed to be controllable;
typically preloads in its own lifespan.
- Unload only, no Load (comfyui): false. Lazy — FLUX isn't resident
until the first /prompt, so we shouldn't bill its 13 GiB against the
GPU budget until then.
- SystemdUnit only (whisper-server): true. Always-on, model loaded at
process start.
- Empty: true. Safe fallback.
Verified live on mRock (2026-05-15):
Before /v1/image: nvidia-smi 8963 MiB used; mvoice gpu_resident_mib 2345
POST /v1/image: HTTP 400 from upstream (empty workflow), broker did
trigger eviction before forwarding
After: nvidia-smi 6547 MiB used; mvoice gpu_resident_mib 9
(~CUDA context only); scheduler.evictions = 2
POST /v1/tts: audio_url returned, tts_ms 670, audio 3.5 s
After reload: nvidia-smi 8943 MiB used; mvoice gpu_resident_mib 2917
Test: TestInitialLoadedHeuristic pins the four cases down so this
doesn't regress when someone adds a fifth consumer type.
Refs: m/mGPUmanager#1 (live deploy).
mGPUmanager
GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher /v1-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.
Full design: docs/design.md — Bestandsaufnahme, 10-Alternativen-Survey, Eviction-Algorithmus, Migrationspfad.
Was es macht
Auf mrock:8770 sitzt ein Go-Daemon, der:
/v1/tts,/v1/stt,/v1/llm,/v1/imageals einheitliche Konsumenten-Fassade exponiert,- jede Anfrage durch einen globalen GPU-Scheduler schleust (seriell, Queue),
- bei VRAM-Druck LRU-Eviction über die deklarierten Coexistenz-Gruppen aus
config/consumers.yamlfährt, - in
/v1/statusLive-GPU-Belegung + Consumer-Health + Scheduler-Statistiken zeigt, - niemals stille Fallbacks zurückgibt — Fehler kommen als strukturiertes
{error,message,consumer,retryable}.
Konsumenten-Registry
config/consumers.yaml deklariert pro Consumer:
url,health.{method,path}für Liveness-Probingpaths.<kind>.{method,path}— wie der Broker zu seinem TTS/STT/LLM/Image-Endpoint kommtvram_resident_mib— für die Scheduler-Mathe (Schritt 5)unload.{method,path,body}und optionalload.{method,path}— wie der Broker den Consumer aus dem VRAM räumt / wieder hochfährtcan_coexist_with: [..]— wer parallel resident sein darfpriority(0=low, 4=urgent),max_concurrency
Build + Deploy
make build # ./bin/mgpumanager
make test # go test ./...
make run # lokal gegen ./config/consumers.yaml
make deploy HOST=mrock # rsync + systemd reload + restart
Auf mRock läuft der Daemon als System-Unit (/etc/systemd/system/mgpumanager.service).
Endpoints
| Verb | Pfad | Verhalten |
|---|---|---|
| POST | /v1/tts |
Proxy zu routing.tts-Consumer (default: mvoice /api/synthesize) |
| POST | /v1/stt |
Proxy zu routing.stt-Consumer (default: mvoice /api/transcribe) |
| POST | /v1/llm |
Proxy zu routing.llm-Consumer (default: ollama /api/generate) |
| POST | /v1/image |
Proxy zu routing.image-Consumer (default: comfyui /prompt) |
| GET | /audio/* |
Proxy zu audio_proxy-Consumer (wa.sh fetcht generiertes Audio so) |
| GET | /v1/status |
Live-Snapshot: GPU + Consumer-Health + Scheduler-Stats |
| GET | /healthz |
Broker-Liveness (200 OK) |
Fehler-Schema
Jeder Broker-eigene Fehler hat die Form:
{
"error": "consumer_unreachable",
"message": "upstream mvoice last probe failed: connection refused",
"consumer": "mvoice",
"retryable": true
}
Codes: consumer_unreachable, no_consumer, scheduler_error, bad_consumer_url, bad_request. Pass-through-4xx/5xx vom Consumer landet unverändert beim Client.
Phase 1 Status (Issue #1)
- ✅ Schritt 0 — ComfyUI persistent (
systemd: comfyui.service) - ✅ Schritt 1 —
mvoice /api/admin/{load,unload}(mai/knuth/admin-load-unload @ mVoice) - ✅ Schritt 2 — Routing-Façade +
/v1/status - ✅ Schritt 3 — wa.sh auf Broker umgestellt (m/mAi
mai/knuth/wa-tts-broker) - ✅ Schritt 4 — Queue + globaler GPU-Lock
- ✅ Schritt 5 — Coexistenz-Gruppen + LRU-Eviction
Description
GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher /v1-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.
Languages
Go
98.2%
Makefile
1.8%