Go to file

mAi 468317e395 fix(scheduler): mark lazy consumers (Unload but no Load) as not-loaded at startup

Live deploy on mRock surfaced a Schritt 5 bug: comfyui was always
treated as preloaded at scheduler startup, which made ensureFits()
short-circuit on the very first /v1/image request — exactly the
scenario eviction is supposed to handle. mvoice was never picked as
a victim, ComfyUI then OOM'd loading FLUX on top of the still-resident
mvoice.

Fix: replace the blanket 'every consumer starts loaded' init with a
heuristic — initialLoaded(cons):

  - VRAMManaged (ollama): true. We never track/evict it; the consumer
    runs its own LRU.
  - Load+Unload both present (mvoice): true. Designed to be controllable;
    typically preloads in its own lifespan.
  - Unload only, no Load (comfyui): false. Lazy — FLUX isn't resident
    until the first /prompt, so we shouldn't bill its 13 GiB against the
    GPU budget until then.
  - SystemdUnit only (whisper-server): true. Always-on, model loaded at
    process start.
  - Empty: true. Safe fallback.

Verified live on mRock (2026-05-15):

  Before /v1/image:  nvidia-smi 8963 MiB used; mvoice gpu_resident_mib 2345
  POST /v1/image:    HTTP 400 from upstream (empty workflow), broker did
                     trigger eviction before forwarding
  After:             nvidia-smi 6547 MiB used; mvoice gpu_resident_mib 9
                     (~CUDA context only); scheduler.evictions = 2
  POST /v1/tts:      audio_url returned, tts_ms 670, audio 3.5 s
  After reload:      nvidia-smi 8943 MiB used; mvoice gpu_resident_mib 2917

Test: TestInitialLoadedHeuristic pins the four cases down so this
doesn't regress when someone adds a fifth consumer type.

Refs: m/mGPUmanager#1 (live deploy).

2026-05-15 16:54:11 +02:00

cmd/mgpumanager

feat: Schritt 5 — VRAM-pressure eviction + coexistence groups

2026-05-11 13:37:03 +02:00

config

build: deploy as systemd --user unit on mRock

2026-05-15 16:50:04 +02:00

internal

fix(scheduler): mark lazy consumers (Unload but no Load) as not-loaded at startup

2026-05-15 16:54:11 +02:00

systemd

build: deploy as systemd --user unit on mRock

2026-05-15 16:50:04 +02:00

.gitignore

feat: Schritt 2 — mGPUmanager MVP routing + /v1/status

2026-05-11 13:30:17 +02:00

go.mod

feat: Schritt 2 — mGPUmanager MVP routing + /v1/status

2026-05-11 13:30:17 +02:00

go.sum

feat: Schritt 2 — mGPUmanager MVP routing + /v1/status

2026-05-11 13:30:17 +02:00

Makefile

build: deploy as systemd --user unit on mRock

2026-05-15 16:50:04 +02:00

README.md

feat: Schritt 5 — VRAM-pressure eviction + coexistence groups

2026-05-11 13:37:03 +02:00

README.md

mGPUmanager

GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher /v1-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.

Full design: docs/design.md — Bestandsaufnahme, 10-Alternativen-Survey, Eviction-Algorithmus, Migrationspfad.

Was es macht

Auf mrock:8770 sitzt ein Go-Daemon, der:

/v1/tts, /v1/stt, /v1/llm, /v1/image als einheitliche Konsumenten-Fassade exponiert,
jede Anfrage durch einen globalen GPU-Scheduler schleust (seriell, Queue),
bei VRAM-Druck LRU-Eviction über die deklarierten Coexistenz-Gruppen aus config/consumers.yaml fährt,
in /v1/status Live-GPU-Belegung + Consumer-Health + Scheduler-Statistiken zeigt,
niemals stille Fallbacks zurückgibt — Fehler kommen als strukturiertes {error,message,consumer,retryable}.

Konsumenten-Registry

config/consumers.yaml deklariert pro Consumer:

url, health.{method,path} für Liveness-Probing
paths.<kind>.{method,path} — wie der Broker zu seinem TTS/STT/LLM/Image-Endpoint kommt
vram_resident_mib — für die Scheduler-Mathe (Schritt 5)
unload.{method,path,body} und optional load.{method,path} — wie der Broker den Consumer aus dem VRAM räumt / wieder hochfährt
can_coexist_with: [..] — wer parallel resident sein darf
priority (0=low, 4=urgent), max_concurrency

Build + Deploy

make build       # ./bin/mgpumanager
make test        # go test ./...
make run         # lokal gegen ./config/consumers.yaml
make deploy HOST=mrock  # rsync + systemd reload + restart

Auf mRock läuft der Daemon als System-Unit (/etc/systemd/system/mgpumanager.service).

Endpoints

Verb	Pfad	Verhalten
POST	`/v1/tts`	Proxy zu `routing.tts`-Consumer (default: mvoice `/api/synthesize`)
POST	`/v1/stt`	Proxy zu `routing.stt`-Consumer (default: mvoice `/api/transcribe`)
POST	`/v1/llm`	Proxy zu `routing.llm`-Consumer (default: ollama `/api/generate`)
POST	`/v1/image`	Proxy zu `routing.image`-Consumer (default: comfyui `/prompt`)
GET	`/audio/*`	Proxy zu `audio_proxy`-Consumer (wa.sh fetcht generiertes Audio so)
GET	`/v1/status`	Live-Snapshot: GPU + Consumer-Health + Scheduler-Stats
GET	`/healthz`	Broker-Liveness (200 OK)

Fehler-Schema

Jeder Broker-eigene Fehler hat die Form:

{
  "error": "consumer_unreachable",
  "message": "upstream mvoice last probe failed: connection refused",
  "consumer": "mvoice",
  "retryable": true
}

Codes: consumer_unreachable, no_consumer, scheduler_error, bad_consumer_url, bad_request. Pass-through-4xx/5xx vom Consumer landet unverändert beim Client.

Phase 1 Status (Issue #1)

✅ Schritt 0 — ComfyUI persistent (systemd: comfyui.service)
✅ Schritt 1 — mvoice /api/admin/{load,unload} (mai/knuth/admin-load-unload @ mVoice)
✅ Schritt 2 — Routing-Façade + /v1/status
✅ Schritt 3 — wa.sh auf Broker umgestellt (m/mAi mai/knuth/wa-tts-broker)
✅ Schritt 4 — Queue + globaler GPU-Lock
✅ Schritt 5 — Coexistenz-Gruppen + LRU-Eviction