Files
mGPUmanager/config/consumers.yaml
mAi c81c145163 feat: Schritt 2 — mGPUmanager MVP routing + /v1/status
Go daemon listening on :8770 that fronts mvoice (8766), whisper-server
(8178), ollama (11434), comfyui (8188) behind a single /v1 façade.

What this MVP does:
- Loads config/consumers.yaml: routing table, per-consumer URL + health +
  paths + vram_resident_mib + can_coexist_with + load/unload routes.
- Background health probe (5s) on every consumer; refuses fast with a
  structured 503 if the last probe failed (no Felix-Banholzer-style
  silent fallback).
- POST /v1/{tts,stt,llm,image} proxies the request body + Content-Type
  to the routed consumer's path and streams the response back.
- GET /audio/* proxies to audio_proxy consumer (wa.sh fetches its WAV
  this way).
- GET /v1/status exposes live GPU sample (nvidia-smi every 2s),
  per-consumer health/loaded/gpu_resident_mib/active/total_requests,
  scheduler stats.
- GET /healthz, GET / — broker liveness.

The Scheduler interface is in place but the implementation is
'Passthrough' — every job runs immediately, no lock, no queue. Schritt 4
replaces it with a serialising mutex; Schritt 5 adds VRAM-pressure
eviction. The interface boundary means server.go stays unchanged.

Out of scope here:
- Schritt 3: wa.sh migration (parallel work in mAi).
- Schritt 4: queue + global GPU lock.
- Schritt 5: nvidia-smi-driven LRU eviction.

Tests: config validation (good/bad), proxy forwards body, audio proxy
streams bytes, unhealthy consumer returns 503, /v1/status JSON shape.

Refs: m/mGPUmanager#1
2026-05-11 13:30:17 +02:00

89 lines
1.9 KiB
YAML

listen: 127.0.0.1:8770
gpu:
total_mib: 16376 # RTX 4070 Ti SUPER
reserved_mib: 1024 # headroom for system/desktop
poll_interval_seconds: 2
routing:
tts: mvoice
stt: mvoice # whisper-server is alternative if explicitly requested
llm: ollama
image: comfyui
# Audio download proxy: GET /audio/* forwards to this consumer.
audio_proxy: mvoice
consumers:
mvoice:
url: http://localhost:8766
health:
method: GET
path: /api/health
paths:
tts:
method: POST
path: /api/synthesize
stt:
method: POST
path: /api/transcribe
vram_resident_mib: 2800
load:
method: POST
path: /api/admin/load
unload:
method: POST
path: /api/admin/unload
can_coexist_with: [whisper-server, ollama]
priority: 3
max_concurrency: 1
whisper-server:
url: http://localhost:8178
health:
method: GET
path: /
paths:
stt:
method: POST
path: /inference
vram_resident_mib: 2050
# No HTTP unload; mGPUmanager evicts via systemd restart (Schritt 5).
systemd_unit: whisper-server.service
can_coexist_with: [mvoice, ollama]
priority: 2
max_concurrency: 1
ollama:
url: http://localhost:11434
health:
method: GET
path: /api/tags
paths:
llm:
method: POST
path: /api/generate
# Ollama runs its own LRU keep_alive; we don't track resident VRAM.
vram_managed: true
can_coexist_with: [mvoice, whisper-server]
priority: 2
max_concurrency: 1
comfyui:
url: http://localhost:8188
health:
method: GET
path: /system_stats
paths:
image:
method: POST
path: /prompt
vram_resident_mib: 13000
unload:
method: POST
path: /api/free
body: '{"unload_models":true,"free_memory":true}'
can_coexist_with: []
priority: 1
max_concurrency: 1