feat: Schritt 4 — Locked scheduler (global GPU lock, queue, stats)

Replaces the MVP Passthrough with scheduler.Locked: a capacity-1 channel serialises every consumer's GPU work end-to-end. main.go switches to it. Behavioural contract: - Jobs that arrive while another job holds the GPU block on the channel until the holder finishes. Context cancellation aborts the wait cleanly (no leaked tokens, queue depth decremented). - Stats track queue_depth, in_flight, total_jobs, last_wait_ms, last_run_ms, oldest_queued — surfaced through /v1/status. - One lock for ALL consumers (not per-consumer): the design (§4.3) is explicit that grobgranular > GPU-stream-granular on single-GPU single-user hardware. mvoice + ollama + comfyui never run truly concurrently any more, which is the whole point — that's what produced the CUDA-OOM under load. Tests: - 5 goroutines hammer the scheduler concurrently → max in-flight = 1. - Cancellation while parked on the lock returns ctx.Err() and frees the queue slot. - Stats reflect in-flight + queue-depth transitions correctly. - Race detector clean. Schritt 5 will compose this with VRAM-pressure eviction: before acquiring the lock, check if the target consumer's resident cost fits under the current GPU headroom; if not, unload the LRU non-coexistent consumer first. Refs: m/mGPUmanager#1 (Schritt 4).
2026-05-11 13:33:39 +02:00
parent c81c145163
commit 3b3d828e9e
7 changed files with 315 additions and 13 deletions
--- a/cmd/mgpumanager/main.go
+++ b/cmd/mgpumanager/main.go
@@ -61,7 +61,9 @@ func main() {

 	reg := registry.New(cfg, logger.With("component", "registry"))
 	gpuPoller := gpu.NewPoller(cfg.GPU.PollInterval(), logger.With("component", "gpu"))
-	sched := scheduler.NewPassthrough(reg)
+	// Phase 1 always runs a single-slot global GPU lock. Schritt 5's
+	// eviction-aware scheduler wraps this same lock with VRAM pressure logic.
+	sched := scheduler.NewLocked(reg, 1)

 	go reg.Run(ctx)
 	go gpuPoller.Run(ctx)