3 Commits

Author SHA1 Message Date
mAi
ca9bb1773f feat: Schritt 5 — VRAM-pressure eviction + coexistence groups
scheduler.Evicting wraps the Locked scheduler with the design's
LRU-with-coexistence eviction loop. main.go switches to it.

Per-job flow:
1. ensureFits — compare cons.vram_resident_mib + 256 MiB cushion against
   the live nvidia-smi free reading. If insufficient, pick the LRU
   loaded consumer NOT in cons.can_coexist_with, NOT VRAM-managed
   (ollama is excluded from eviction by design — it runs its own LRU),
   and NOT the target itself, then call its unload route. Wait 1s for
   VRAM to actually free. Repeat up to 5 times.
2. ensureLoaded — if the target was previously unloaded, call its
   /api/admin/load (mvoice). Consumers without a load route are
   assumed to cold-start implicitly on first request.
3. inner.Run — global GPU lock + job execution.

State:
- scheduler-local 'loaded' map + scheduler-local 'lastUsed' map. The
  registry's health-derived Loaded field is the source of truth for
  consumers that report it, but we need our own state for the seconds
  between an unload call and the next probe.
- Stats.Evictions counts successful unload calls and surfaces through
  /v1/status.

LRU pick order:
- Scheduler-local lastUsed (set on successful Run completion) takes
  precedence over registry.LastUsed (set on health probes) because the
  former reflects real GPU work, not health chatter. Zero-time
  consumers (never used) lose first.

Tests:
- Already-resident target: no eviction calls.
- 13 GiB comfyui evicted to fit 2.8 GiB mvoice → 1 unload + 1 load,
  Stats.Evictions = 1.
- Coexistent consumer (ollama, in mvoice.can_coexist_with) is never
  picked even if it's the LRU candidate; the non-coexistent comfyui
  is unloaded instead.

Race detector clean.

Refs: m/mGPUmanager#1 (Schritt 5).
2026-05-11 13:37:03 +02:00
mAi
c81c145163 feat: Schritt 2 — mGPUmanager MVP routing + /v1/status
Go daemon listening on :8770 that fronts mvoice (8766), whisper-server
(8178), ollama (11434), comfyui (8188) behind a single /v1 façade.

What this MVP does:
- Loads config/consumers.yaml: routing table, per-consumer URL + health +
  paths + vram_resident_mib + can_coexist_with + load/unload routes.
- Background health probe (5s) on every consumer; refuses fast with a
  structured 503 if the last probe failed (no Felix-Banholzer-style
  silent fallback).
- POST /v1/{tts,stt,llm,image} proxies the request body + Content-Type
  to the routed consumer's path and streams the response back.
- GET /audio/* proxies to audio_proxy consumer (wa.sh fetches its WAV
  this way).
- GET /v1/status exposes live GPU sample (nvidia-smi every 2s),
  per-consumer health/loaded/gpu_resident_mib/active/total_requests,
  scheduler stats.
- GET /healthz, GET / — broker liveness.

The Scheduler interface is in place but the implementation is
'Passthrough' — every job runs immediately, no lock, no queue. Schritt 4
replaces it with a serialising mutex; Schritt 5 adds VRAM-pressure
eviction. The interface boundary means server.go stays unchanged.

Out of scope here:
- Schritt 3: wa.sh migration (parallel work in mAi).
- Schritt 4: queue + global GPU lock.
- Schritt 5: nvidia-smi-driven LRU eviction.

Tests: config validation (good/bad), proxy forwards body, audio proxy
streams bytes, unhealthy consumer returns 503, /v1/status JSON shape.

Refs: m/mGPUmanager#1
2026-05-11 13:30:17 +02:00
m
b31b6f6580 Initial commit 2026-05-11 11:16:07 +00:00