scheduler.Evicting wraps the Locked scheduler with the design's LRU-with-coexistence eviction loop. main.go switches to it. Per-job flow: 1. ensureFits — compare cons.vram_resident_mib + 256 MiB cushion against the live nvidia-smi free reading. If insufficient, pick the LRU loaded consumer NOT in cons.can_coexist_with, NOT VRAM-managed (ollama is excluded from eviction by design — it runs its own LRU), and NOT the target itself, then call its unload route. Wait 1s for VRAM to actually free. Repeat up to 5 times. 2. ensureLoaded — if the target was previously unloaded, call its /api/admin/load (mvoice). Consumers without a load route are assumed to cold-start implicitly on first request. 3. inner.Run — global GPU lock + job execution. State: - scheduler-local 'loaded' map + scheduler-local 'lastUsed' map. The registry's health-derived Loaded field is the source of truth for consumers that report it, but we need our own state for the seconds between an unload call and the next probe. - Stats.Evictions counts successful unload calls and surfaces through /v1/status. LRU pick order: - Scheduler-local lastUsed (set on successful Run completion) takes precedence over registry.LastUsed (set on health probes) because the former reflects real GPU work, not health chatter. Zero-time consumers (never used) lose first. Tests: - Already-resident target: no eviction calls. - 13 GiB comfyui evicted to fit 2.8 GiB mvoice → 1 unload + 1 load, Stats.Evictions = 1. - Coexistent consumer (ollama, in mvoice.can_coexist_with) is never picked even if it's the LRU candidate; the non-coexistent comfyui is unloaded instead. Race detector clean. Refs: m/mGPUmanager#1 (Schritt 5).
74 lines
3.1 KiB
Markdown
74 lines
3.1 KiB
Markdown
# mGPUmanager
|
|
|
|
GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher `/v1`-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.
|
|
|
|
Full design: [`docs/design.md`](docs/design.md) — Bestandsaufnahme, 10-Alternativen-Survey, Eviction-Algorithmus, Migrationspfad.
|
|
|
|
## Was es macht
|
|
|
|
Auf `mrock:8770` sitzt ein Go-Daemon, der:
|
|
|
|
- `/v1/tts`, `/v1/stt`, `/v1/llm`, `/v1/image` als einheitliche Konsumenten-Fassade exponiert,
|
|
- jede Anfrage durch einen globalen GPU-Scheduler schleust (seriell, Queue),
|
|
- bei VRAM-Druck LRU-Eviction über die deklarierten Coexistenz-Gruppen aus `config/consumers.yaml` fährt,
|
|
- in `/v1/status` Live-GPU-Belegung + Consumer-Health + Scheduler-Statistiken zeigt,
|
|
- niemals stille Fallbacks zurückgibt — Fehler kommen als strukturiertes `{error,message,consumer,retryable}`.
|
|
|
|
## Konsumenten-Registry
|
|
|
|
`config/consumers.yaml` deklariert pro Consumer:
|
|
|
|
- `url`, `health.{method,path}` für Liveness-Probing
|
|
- `paths.<kind>.{method,path}` — wie der Broker zu seinem TTS/STT/LLM/Image-Endpoint kommt
|
|
- `vram_resident_mib` — für die Scheduler-Mathe (Schritt 5)
|
|
- `unload.{method,path,body}` und optional `load.{method,path}` — wie der Broker den Consumer aus dem VRAM räumt / wieder hochfährt
|
|
- `can_coexist_with: [..]` — wer parallel resident sein darf
|
|
- `priority` (0=low, 4=urgent), `max_concurrency`
|
|
|
|
## Build + Deploy
|
|
|
|
```sh
|
|
make build # ./bin/mgpumanager
|
|
make test # go test ./...
|
|
make run # lokal gegen ./config/consumers.yaml
|
|
make deploy HOST=mrock # rsync + systemd reload + restart
|
|
```
|
|
|
|
Auf mRock läuft der Daemon als System-Unit (`/etc/systemd/system/mgpumanager.service`).
|
|
|
|
## Endpoints
|
|
|
|
| Verb | Pfad | Verhalten |
|
|
|---|---|---|
|
|
| POST | `/v1/tts` | Proxy zu `routing.tts`-Consumer (default: mvoice `/api/synthesize`) |
|
|
| POST | `/v1/stt` | Proxy zu `routing.stt`-Consumer (default: mvoice `/api/transcribe`) |
|
|
| POST | `/v1/llm` | Proxy zu `routing.llm`-Consumer (default: ollama `/api/generate`) |
|
|
| POST | `/v1/image` | Proxy zu `routing.image`-Consumer (default: comfyui `/prompt`) |
|
|
| GET | `/audio/*` | Proxy zu `audio_proxy`-Consumer (wa.sh fetcht generiertes Audio so) |
|
|
| GET | `/v1/status`| Live-Snapshot: GPU + Consumer-Health + Scheduler-Stats |
|
|
| GET | `/healthz` | Broker-Liveness (200 OK) |
|
|
|
|
## Fehler-Schema
|
|
|
|
Jeder Broker-eigene Fehler hat die Form:
|
|
|
|
```json
|
|
{
|
|
"error": "consumer_unreachable",
|
|
"message": "upstream mvoice last probe failed: connection refused",
|
|
"consumer": "mvoice",
|
|
"retryable": true
|
|
}
|
|
```
|
|
|
|
Codes: `consumer_unreachable`, `no_consumer`, `scheduler_error`, `bad_consumer_url`, `bad_request`. Pass-through-4xx/5xx vom Consumer landet unverändert beim Client.
|
|
|
|
## Phase 1 Status (Issue #1)
|
|
|
|
- ✅ Schritt 0 — ComfyUI persistent (`systemd: comfyui.service`)
|
|
- ✅ Schritt 1 — `mvoice /api/admin/{load,unload}` (mai/knuth/admin-load-unload @ mVoice)
|
|
- ✅ Schritt 2 — Routing-Façade + `/v1/status`
|
|
- ✅ Schritt 3 — wa.sh auf Broker umgestellt (m/mAi `mai/knuth/wa-tts-broker`)
|
|
- ✅ Schritt 4 — Queue + globaler GPU-Lock
|
|
- ✅ Schritt 5 — Coexistenz-Gruppen + LRU-Eviction
|