feat: Schritt 4 — Locked scheduler (global GPU lock, queue, stats)

Replaces the MVP Passthrough with scheduler.Locked: a capacity-1 channel
serialises every consumer's GPU work end-to-end. main.go switches to it.

Behavioural contract:
- Jobs that arrive while another job holds the GPU block on the channel
  until the holder finishes. Context cancellation aborts the wait
  cleanly (no leaked tokens, queue depth decremented).
- Stats track queue_depth, in_flight, total_jobs, last_wait_ms,
  last_run_ms, oldest_queued — surfaced through /v1/status.
- One lock for ALL consumers (not per-consumer): the design (§4.3) is
  explicit that grobgranular > GPU-stream-granular on single-GPU
  single-user hardware. mvoice + ollama + comfyui never run truly
  concurrently any more, which is the whole point — that's what
  produced the CUDA-OOM under load.

Tests:
- 5 goroutines hammer the scheduler concurrently → max in-flight = 1.
- Cancellation while parked on the lock returns ctx.Err() and frees
  the queue slot.
- Stats reflect in-flight + queue-depth transitions correctly.
- Race detector clean.

Schritt 5 will compose this with VRAM-pressure eviction: before
acquiring the lock, check if the target consumer's resident cost fits
under the current GPU headroom; if not, unload the LRU non-coexistent
consumer first.

Refs: m/mGPUmanager#1 (Schritt 4).
This commit is contained in:
mAi
2026-05-11 13:33:39 +02:00
parent c81c145163
commit 3b3d828e9e
7 changed files with 315 additions and 13 deletions

View File

@@ -61,7 +61,9 @@ func main() {
reg := registry.New(cfg, logger.With("component", "registry"))
gpuPoller := gpu.NewPoller(cfg.GPU.PollInterval(), logger.With("component", "gpu"))
sched := scheduler.NewPassthrough(reg)
// Phase 1 always runs a single-slot global GPU lock. Schritt 5's
// eviction-aware scheduler wraps this same lock with VRAM pressure logic.
sched := scheduler.NewLocked(reg, 1)
go reg.Run(ctx)
go gpuPoller.Run(ctx)