mGPUmanager

8 Commits 2 Branches 0 Tags

Author	SHA1	Message	Date
mAi	d02c88b42a	Merge deploy-time fixes (systemd --user unit, initialLoaded heuristic)	2026-05-15 16:56:47 +02:00
mAi	468317e395	fix(scheduler): mark lazy consumers (Unload but no Load) as not-loaded at startup Live deploy on mRock surfaced a Schritt 5 bug: comfyui was always treated as preloaded at scheduler startup, which made ensureFits() short-circuit on the very first /v1/image request — exactly the scenario eviction is supposed to handle. mvoice was never picked as a victim, ComfyUI then OOM'd loading FLUX on top of the still-resident mvoice. Fix: replace the blanket 'every consumer starts loaded' init with a heuristic — initialLoaded(cons): - VRAMManaged (ollama): true. We never track/evict it; the consumer runs its own LRU. - Load+Unload both present (mvoice): true. Designed to be controllable; typically preloads in its own lifespan. - Unload only, no Load (comfyui): false. Lazy — FLUX isn't resident until the first /prompt, so we shouldn't bill its 13 GiB against the GPU budget until then. - SystemdUnit only (whisper-server): true. Always-on, model loaded at process start. - Empty: true. Safe fallback. Verified live on mRock (2026-05-15): Before /v1/image: nvidia-smi 8963 MiB used; mvoice gpu_resident_mib 2345 POST /v1/image: HTTP 400 from upstream (empty workflow), broker did trigger eviction before forwarding After: nvidia-smi 6547 MiB used; mvoice gpu_resident_mib 9 (~CUDA context only); scheduler.evictions = 2 POST /v1/tts: audio_url returned, tts_ms 670, audio 3.5 s After reload: nvidia-smi 8943 MiB used; mvoice gpu_resident_mib 2917 Test: TestInitialLoadedHeuristic pins the four cases down so this doesn't regress when someone adds a fifth consumer type. Refs: m/mGPUmanager#1 (live deploy).	2026-05-15 16:54:11 +02:00
mAi	167999cecf	build: deploy as systemd --user unit on mRock Convention on mRock is user-units for ML services (whisper-server, mvoice-launcher, comfyui as of today). Switching mGPUmanager too: - systemd/mgpumanager.service: rewritten as a user unit (%h-based WorkingDirectory + ExecStart, WantedBy=default.target). Drops the ProtectSystem/ProtectHome hardening that came from the system-unit template — user units don't need it, and ProtectHome=read-only blocks a user unit's own working dir. - Makefile deploy target: rsync to ~/.config/systemd/user/ on the remote and use systemctl --user, no sudo. README documents the lingering prerequisite (loginctl enable-linger m). - config/consumers.yaml: bind on 0.0.0.0:8770 instead of localhost so mRiver / Tailscale peers can actually reach the broker. Refs: m/mGPUmanager#1 (deploy task).	2026-05-15 16:50:04 +02:00
mAi	cacba89edd	Merge Phase 1 — Schritte 0–5 (broker MVP, queue, LRU eviction)	2026-05-15 16:44:56 +02:00
mAi	ca9bb1773f	feat: Schritt 5 — VRAM-pressure eviction + coexistence groups scheduler.Evicting wraps the Locked scheduler with the design's LRU-with-coexistence eviction loop. main.go switches to it. Per-job flow: 1. ensureFits — compare cons.vram_resident_mib + 256 MiB cushion against the live nvidia-smi free reading. If insufficient, pick the LRU loaded consumer NOT in cons.can_coexist_with, NOT VRAM-managed (ollama is excluded from eviction by design — it runs its own LRU), and NOT the target itself, then call its unload route. Wait 1s for VRAM to actually free. Repeat up to 5 times. 2. ensureLoaded — if the target was previously unloaded, call its /api/admin/load (mvoice). Consumers without a load route are assumed to cold-start implicitly on first request. 3. inner.Run — global GPU lock + job execution. State: - scheduler-local 'loaded' map + scheduler-local 'lastUsed' map. The registry's health-derived Loaded field is the source of truth for consumers that report it, but we need our own state for the seconds between an unload call and the next probe. - Stats.Evictions counts successful unload calls and surfaces through /v1/status. LRU pick order: - Scheduler-local lastUsed (set on successful Run completion) takes precedence over registry.LastUsed (set on health probes) because the former reflects real GPU work, not health chatter. Zero-time consumers (never used) lose first. Tests: - Already-resident target: no eviction calls. - 13 GiB comfyui evicted to fit 2.8 GiB mvoice → 1 unload + 1 load, Stats.Evictions = 1. - Coexistent consumer (ollama, in mvoice.can_coexist_with) is never picked even if it's the LRU candidate; the non-coexistent comfyui is unloaded instead. Race detector clean. Refs: m/mGPUmanager#1 (Schritt 5).	2026-05-11 13:37:03 +02:00
mAi	3b3d828e9e	feat: Schritt 4 — Locked scheduler (global GPU lock, queue, stats) Replaces the MVP Passthrough with scheduler.Locked: a capacity-1 channel serialises every consumer's GPU work end-to-end. main.go switches to it. Behavioural contract: - Jobs that arrive while another job holds the GPU block on the channel until the holder finishes. Context cancellation aborts the wait cleanly (no leaked tokens, queue depth decremented). - Stats track queue_depth, in_flight, total_jobs, last_wait_ms, last_run_ms, oldest_queued — surfaced through /v1/status. - One lock for ALL consumers (not per-consumer): the design (§4.3) is explicit that grobgranular > GPU-stream-granular on single-GPU single-user hardware. mvoice + ollama + comfyui never run truly concurrently any more, which is the whole point — that's what produced the CUDA-OOM under load. Tests: - 5 goroutines hammer the scheduler concurrently → max in-flight = 1. - Cancellation while parked on the lock returns ctx.Err() and frees the queue slot. - Stats reflect in-flight + queue-depth transitions correctly. - Race detector clean. Schritt 5 will compose this with VRAM-pressure eviction: before acquiring the lock, check if the target consumer's resident cost fits under the current GPU headroom; if not, unload the LRU non-coexistent consumer first. Refs: m/mGPUmanager#1 (Schritt 4).	2026-05-11 13:33:39 +02:00
mAi	c81c145163	feat: Schritt 2 — mGPUmanager MVP routing + /v1/status Go daemon listening on :8770 that fronts mvoice (8766), whisper-server (8178), ollama (11434), comfyui (8188) behind a single /v1 façade. What this MVP does: - Loads config/consumers.yaml: routing table, per-consumer URL + health + paths + vram_resident_mib + can_coexist_with + load/unload routes. - Background health probe (5s) on every consumer; refuses fast with a structured 503 if the last probe failed (no Felix-Banholzer-style silent fallback). - POST /v1/{tts,stt,llm,image} proxies the request body + Content-Type to the routed consumer's path and streams the response back. - GET /audio/* proxies to audio_proxy consumer (wa.sh fetches its WAV this way). - GET /v1/status exposes live GPU sample (nvidia-smi every 2s), per-consumer health/loaded/gpu_resident_mib/active/total_requests, scheduler stats. - GET /healthz, GET / — broker liveness. The Scheduler interface is in place but the implementation is 'Passthrough' — every job runs immediately, no lock, no queue. Schritt 4 replaces it with a serialising mutex; Schritt 5 adds VRAM-pressure eviction. The interface boundary means server.go stays unchanged. Out of scope here: - Schritt 3: wa.sh migration (parallel work in mAi). - Schritt 4: queue + global GPU lock. - Schritt 5: nvidia-smi-driven LRU eviction. Tests: config validation (good/bad), proxy forwards body, audio proxy streams bytes, unhealthy consumer returns 503, /v1/status JSON shape. Refs: m/mGPUmanager#1	2026-05-11 13:30:17 +02:00
m	b31b6f6580	Initial commit	2026-05-11 11:16:07 +00:00