design(t-paliad-151): Paliadin Tailscale SSH route to mRiver

Inventor design for routing Paliadin from paliad.de's Dokploy container on mLake to mRiver via Tailscale + SSH, preserving m's Claude Code subscription instead of paying Anthropic API tokens. Three sub-designs covering m's four locked decisions (2026-05-07 22:35): - network_mode: host on paliad (m overrode the sidecar recommendation; Phase A explicitly tests traefik compatibility under host mode) - server-side paliadin-shim with one RPC per turn (run-turn / reset / health / bootstrap), authorized_keys command= restriction, from=mlake - env-var routing trigger (PALIADIN_REMOTE_HOST) + Paliadin interface split: LocalPaliadinService keeps the laptop PoC, RemotePaliadinService shells out to ssh m@mriver paliadin-shim - ed25519 keypair via Dokploy secret PALIADIN_SSH_PRIVATE_KEY, written to a chmod 600 tmpfile at startup; pinned host key via PALIADIN_KNOWN_HOSTS Verified live before designing: mRiver tmux+claude present, mLake Tailscale active and sees mRiver, paliad Dockerfile is alpine-minimal, no authorized_keys on mRiver yet. No assumptions left from CLAUDE.md. Includes: friendly error code mriver_unreachable extending t-paliad-150, single-flight rate limit, security review (defence-in-depth via command=/from= restrictions), three-phase rollout (manual proof → Dockerfile bake → polish), file-level deliverables for the coder shift. Inventor stops here — no code shipped. Awaiting m's go/no-go. Refs m/paliad#12
2026-05-07 22:47:30 +02:00
parent 1061685981
commit befa41c00e
1 changed files with 592 additions and 0 deletions
--- a/docs/design-paliadin-tailscale-ssh-2026-05-07.md
+++ b/docs/design-paliadin-tailscale-ssh-2026-05-07.md
@@ -0,0 +1,592 @@
+# Paliadin: route prod via Tailscale SSH to mRiver
+
+**Issue:** m/paliad#12 — t-paliad-151
+**Date:** 2026-05-07
+**Author:** noether (inventor)
+**Supersedes nothing.** Extends `docs/design-paliadin-2026-05-07.md` (the Phase 0 PoC) with a third deployment path between "laptop-only PoC" and "Anthropic API direct".
+**Related:** t-paliad-146 (PoC ship), t-paliad-150 (`friendlyErrorMessage` pattern).
+
+---
+
+## 1. Goal
+
+Make Paliadin reachable from `paliad.de` (Dokploy on mLake) without losing m's Claude Code subscription, by routing each turn over Tailscale + SSH from the paliad container to mRiver, where the existing long-lived `tmux` + `claude` PoC keeps running.
+
+**Non-goals (v1):**
+
+- Multi-host failover.
+- Encryption beyond SSH-over-tailnet (already E2E-encrypted by Tailscale's WireGuard layer).
+- Anthropic API fallback when mRiver is offline — show a friendly error instead.
+- Wake-on-LAN of mRiver.
+- Multi-tenant or multi-firm variants.
+
+---
+
+## 2. Live state — what was verified before designing
+
+A design built on stale facts rots fast. These were probed on 2026-05-07, not assumed from CLAUDE.md or memory:
+
+| Fact | How verified | Result |
+|---|---|---|
+| mRiver = `100.99.98.203`, has tmux + claude | this worker runs on mRiver; `tmux -V` → `tmux 3.6a`; `which claude` → `/home/m/.local/bin/claude` | confirmed |
+| mLake (`100.99.98.201`) has Tailscale running | `ssh m@mlake tailscale status` | confirmed; mRiver visible as `active; direct [2a02:4780:41:3fbc::1]:41641` |
+| paliad container Dockerfile is alpine:3.21 minimal, no SSH, no tailscaled | `Dockerfile` | confirmed (only `ca-certificates`) |
+| paliad compose runs default Docker bridge (no `network_mode`) | `docker-compose.yml` | confirmed |
+| mRiver has no `~/.ssh/authorized_keys` yet | `ls ~/.ssh/` | confirmed — file must be created in Phase A |
+| `/tmp/paliadin/` does not exist on mRiver yet | `ls /tmp/paliadin` | confirmed — created on first turn (paliadin.go:185 `os.MkdirAll`) |
+| `paliad-paliadin` tmux session is not currently running on mRiver | `tmux ls` | not present; the existing PoC creates it on demand |
+
+**Implication for design:** the paliad container needs new infrastructure on three axes — network reachability of the tailnet, an SSH client + identity, and a service-layer code path that talks to a remote tmux instead of a local one. Each axis is its own sub-design below.
+
+---
+
+## 3. Locked decisions (m, 2026-05-07 22:35)
+
+m made four design-shaping calls via the inventor's `AskUserQuestion` pass. They are recorded here verbatim because every downstream choice in §4–§6 follows from them.
+
+| # | Question | m's choice |
+|---|---|---|
+| 1 | Container Tailscale shape | **`network_mode: host` on paliad** |
+| 2 | SSH-to-mRiver protocol granularity | **Server-side `paliadin-shim` (one RPC per turn)** |
+| 3 | Routing trigger | **Env var `PALIADIN_REMOTE_HOST` + interface split** |
+| 4 | SSH private key storage | **Dokploy secret env var `PALIADIN_SSH_PRIVATE_KEY`** |
+
+Decision (1) was *not* the inventor's recommendation — host mode has known interaction risk with traefik (§4.2). m is overriding the recommendation; this design accepts the call and codifies a Phase A test step that gates the rollout on traefik still working under host mode. If Phase A blows up, the fallback is to revisit (1) in a follow-up issue, not to silently swap to a sidecar.
+
+---
+
+## 4. Sub-design A — Container Tailscale shape
+
+### 4.1 Shape: `network_mode: host`
+
+paliad's container shares mLake's network namespace. `tailscale0` (mLake's tailnet interface) is directly visible from inside the container. Outbound `ssh m@100.99.98.203` reaches mRiver over the tailnet without any sidecar, userspace tailscaled, SOCKS proxy, or auth-key flow inside the container.
+
+```yaml
+# docker-compose.yml diff
+services:
+  web:
+    build: .
+    network_mode: host           # NEW
+    # remove: expose: ["8080"]   # host mode means port is on the host directly
+    environment:
+      - PORT=8080
+      ...
+      # NEW Paliadin remote-routing knobs
+      - PALIADIN_REMOTE_HOST=${PALIADIN_REMOTE_HOST}      # 100.99.98.203
+      - PALIADIN_REMOTE_USER=${PALIADIN_REMOTE_USER}      # m
+      - PALIADIN_SSH_PRIVATE_KEY=${PALIADIN_SSH_PRIVATE_KEY}
+      - PALIADIN_KNOWN_HOSTS=${PALIADIN_KNOWN_HOSTS}      # one-line ssh-keyscan output
+    restart: unless-stopped
+```
+
+### 4.2 Trade-off accepted: traefik routing under host mode
+
+paliad.de's TLS is provided by Dokploy's traefik on the `dokploy-network` overlay. With `network_mode: host`, paliad is no longer attached to that overlay. Two failure modes are possible:
+
+- **(M1)** traefik can't discover the service via Docker DNS → 502 at the edge.
+- **(M2)** traefik routes via host loopback (`http://127.0.0.1:8080` or `host.docker.internal`) and works fine.
+
+Recent Dokploy versions configure traefik with both `loadbalancer.server.url` and Docker labels; (M2) is the documented host-mode path. **Phase A explicitly tests this** (§7) before any code is written; if (M1) materialises, the design rolls back to the sidecar variant of decision 1 in a follow-up issue.
+
+Other host-mode side-effects to flag in operations:
+
+- paliad listens on host port 8080 directly. Any other compose service binding 8080 conflicts.
+- paliad's outbound DNS uses host resolver (no Docker-internal `web` etc.). Currently fine: paliad's only network deps are external (Supabase, SMTP, GitHub raw). No service on `dokploy-network` is referenced by name.
+- The container can reach **every** Tailscale node, not just mRiver. Mitigations live in §5 (key restriction) and §5.2 (`from=` clause on mRiver authorized_keys).
+
+### 4.3 Dockerfile diff
+
+```dockerfile
+# Final stage adds the SSH client only. Tailscale is provided by the host.
+FROM alpine:3.21
+RUN apk add --no-cache ca-certificates openssh-client    # +openssh-client (~1MB)
+WORKDIR /app
+COPY --from=backend /paliad /app/paliad
+COPY --from=frontend /app/frontend/dist /app/dist
+EXPOSE 8080
+CMD ["/app/paliad"]
+```
+
+Image-size delta: alpine `openssh-client` is ~1.1 MB compressed — negligible. No tailscaled, no entrypoint script, no extra processes inside the container.
+
+### 4.4 What does NOT change
+
+- No Tailscale auth-key inside paliad. The container inherits the host's tailnet binding, so there is no per-container Tailscale identity to rotate. mLake's existing Tailscale auth is the only one in scope.
+- No tailscaled process inside the container.
+- No new sidecar container.
+
+---
+
+## 5. Sub-design B — SSH identity, restricted shim, host-key pinning
+
+### 5.1 Identity: dedicated ed25519 keypair `paliad-prod`
+
+One keypair, generated once on mRiver during Phase A, used by every paliad-prod deploy:
+
+```bash
+# On mRiver (Phase A bootstrap):
+ssh-keygen -t ed25519 -N "" -C "paliad-prod $(date +%Y-%m-%d)" -f /tmp/paliad-prod-key
+# Public key → mRiver authorized_keys (see 5.2)
+# Private key → Dokploy secret store as PALIADIN_SSH_PRIVATE_KEY
+shred -u /tmp/paliad-prod-key   # only the encrypted/secret-stored copies survive
+```
+
+Rotation: regenerate, push public key to mRiver authorized_keys, update Dokploy secret, redeploy. No code change needed — paliad's startup re-reads the env var on every boot.
+
+The private key is delivered to the container as a multi-line env var. At process start, paliad writes it to a tmpfile so OpenSSH can use it:
+
+```go
+// cmd/server/main.go (sketch)
+func loadPaliadinSSHKey() (string, error) {
+    blob := os.Getenv("PALIADIN_SSH_PRIVATE_KEY")
+    if blob == "" { return "", nil }    // remote mode disabled
+    f, err := os.CreateTemp("", "paliadin-id_ed25519-")
+    if err != nil { return "", err }
+    if err := os.Chmod(f.Name(), 0o600); err != nil { return "", err }
+    if _, err := f.WriteString(blob); err != nil { return "", err }
+    if err := f.Close(); err != nil { return "", err }
+    return f.Name(), nil    // path passed to RemotePaliadinService
+}
+```
+
+The tmpfile lives at `/tmp/paliadin-id_ed25519-<rand>` for the container's lifetime. On container restart, a fresh tmpfile is written. We never persist the key to a volume.
+
+### 5.2 mRiver `authorized_keys` entry
+
+```
+command="/home/m/.local/bin/paliadin-shim",no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc,from="100.99.98.201" ssh-ed25519 AAAA...PUBKEY... paliad-prod
+```
+
+Each restriction matters:
+
+- `command=` — every `ssh m@mriver …` invocation runs the shim regardless of what the client asked for. The client's requested command is exposed as `$SSH_ORIGINAL_COMMAND` for the shim to dispatch on.
+- `no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc` — defence-in-depth: even if someone steals the key and bypasses the shim's argument validation, they can't get an interactive shell, can't tunnel ports, can't pivot via agent forwarding.
+- `from="100.99.98.201"` — only accept connections from mLake's tailnet IP. Defends against the "container has full tailnet visibility" host-mode side-effect from §4.2: if the key leaks off mLake, it can't be replayed from another tailnet host.
+
+### 5.3 Host-key pinning
+
+`StrictHostKeyChecking=accept-new` is too loose for a long-lived production identity (one-time MITM during first connect substitutes a different key forever). Instead:
+
+- During Phase A, run `ssh-keyscan -t ed25519 100.99.98.203` on mLake.
+- Capture the single output line.
+- Store as Dokploy secret `PALIADIN_KNOWN_HOSTS`.
+- At container startup, write to `/tmp/paliadin-known_hosts` chmod 644.
+- Pass to OpenSSH via `-o UserKnownHostsFile=/tmp/paliadin-known_hosts -o StrictHostKeyChecking=yes`.
+
+If mRiver's host key ever rotates (rare; only on disk wipe / fresh OS), Phase A runs again and the secret is updated. SSH refuses to connect with a clear "host key changed" error, which surfaces as `mriver_unreachable` to the user — exactly the right blast-radius (loud failure, no silent connect to a substitute host).
+
+### 5.4 The shim — `paliadin-shim`
+
+A bash script on mRiver at `/home/m/.local/bin/paliadin-shim`. It is the **only** thing the paliad-prod key is allowed to invoke, and it dispatches on `$SSH_ORIGINAL_COMMAND`. Three RPCs:
+
+```bash
+#!/bin/bash
+# paliadin-shim — server-side RPC for paliad's remote-tmux turns.
+# Invoked via authorized_keys command= with $SSH_ORIGINAL_COMMAND set.
+set -euo pipefail
+umask 077
+
+readonly TMUX_SESSION="${PALIADIN_TMUX_SESSION:-paliad-paliadin}"
+readonly RESPONSE_DIR="${PALIADIN_RESPONSE_DIR:-/tmp/paliadin}"
+readonly TIMEOUT_S=60
+readonly TURN_ID_RE='^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$'
+
+mkdir -p "$RESPONSE_DIR"
+
+# Parse $SSH_ORIGINAL_COMMAND. Format: "<verb> <arg1> <arg2> …"
+read -r -a argv <<< "${SSH_ORIGINAL_COMMAND:-}"
+verb="${argv[0]:-}"
+
+ensure_pane() {
+  if ! tmux has-session -t "$TMUX_SESSION" 2>/dev/null; then
+    tmux new-session -d -s "$TMUX_SESSION"
+  fi
+  # Find or create the @paliadin-scope=chat window.
+  local target=""
+  while read -r idx; do
+    scope=$(tmux show-window-option -t "$TMUX_SESSION:$idx" -v @paliadin-scope 2>/dev/null || true)
+    if [[ "$scope" == "chat" ]]; then target="$TMUX_SESSION:$idx"; break; fi
+  done < <(tmux list-windows -t "$TMUX_SESSION" -F '#{window_index}')
+  if [[ -z "$target" ]]; then
+    idx=$(tmux new-window -t "$TMUX_SESSION" -n claude-paliadin -P -F '#{window_index}' claude)
+    target="$TMUX_SESSION:$idx"
+    # Wait for claude to settle (60s bound; matches Go waitForPaneReady).
+    for _ in $(seq 1 120); do
+      pane=$(tmux capture-pane -t "$target" -p 2>/dev/null || true)
+      if [[ "$pane" == *"❯"* || "$pane" == *"│"* ]]; then break; fi
+      sleep 0.5
+    done
+    tmux set-window-option -t "$target" @paliadin-scope chat
+    tmux set-window-option -t "$target" @fix-name claude-paliadin
+    # Bootstrap system prompt — reuses the Go service's prompt text.
+    # The Go side sends this via the `bootstrap` RPC on first turn instead
+    # of duplicating the prompt here. See §6.4.
+  fi
+  echo "$target"
+}
+
+case "$verb" in
+  health)
+    # Liveness check — used by paliad to short-circuit when mRiver is offline.
+    # Returns "ok" iff tmux + claude are reachable.
+    tmux has-session -t "$TMUX_SESSION" 2>/dev/null \
+      || tmux new-session -d -s "$TMUX_SESSION"
+    command -v claude >/dev/null && echo ok || { echo no-claude; exit 1; }
+    ;;
+
+  bootstrap)
+    # First-turn-only: ensure pane exists and inject the system prompt.
+    # $1 = base64-encoded prompt body (avoids quoting hell).
+    target=$(ensure_pane)
+    prompt=$(printf '%s' "${argv[1]:?missing prompt}" | base64 -d)
+    tmux send-keys -t "$target" -l -- "$prompt"
+    tmux send-keys -t "$target" Enter
+    sleep 2   # give claude a moment to absorb
+    echo ok
+    ;;
+
+  run-turn)
+    # $1 = turn_id (UUID); $2 = base64-encoded user message.
+    turn_id="${argv[1]:?missing turn_id}"
+    [[ "$turn_id" =~ $TURN_ID_RE ]] || { echo >&2 "bad turn_id"; exit 2; }
+    msg=$(printf '%s' "${argv[2]:?missing message}" | base64 -d)
+    target=$(ensure_pane)
+    out="$RESPONSE_DIR/$turn_id.txt"
+    rm -f "$out"
+    # Envelope matches what paliadin_prompt.go expects.
+    tmux send-keys -t "$target" -l -- "[PALIADIN:$turn_id] $msg"
+    tmux send-keys -t "$target" Enter
+    # Poll for the response file. Same shape as Go pollForResponse.
+    for _ in $(seq 1 $((TIMEOUT_S * 5))); do
+      if [[ -s "$out" ]]; then
+        sleep 0.05    # settle
+        cat "$out"
+        rm -f "$out"
+        exit 0
+      fi
+      sleep 0.2
+    done
+    echo >&2 "paliadin: response timeout after ${TIMEOUT_S}s"
+    exit 124
+    ;;
+
+  reset)
+    # /clear the conversation; next turn starts fresh.
+    target=$(ensure_pane)
+    tmux send-keys -t "$target" -l -- "/clear"
+    tmux send-keys -t "$target" Enter
+    echo ok
+    ;;
+
+  *)
+    echo >&2 "paliadin-shim: unknown verb '$verb'"
+    exit 2
+    ;;
+esac
+```
+
+Why a shim instead of raw tmux-over-SSH:
+
+- One SSH round-trip per turn (~50 ms over tailnet) vs ~10–20 round-trips for the granular pattern.
+- Argument validation lives in one place (UUID regex on turn_id, base64 for messages, fixed verb list) — easier to audit than a regex over `$SSH_ORIGINAL_COMMAND` matching `tmux send-keys …`.
+- mRiver-side concerns (response polling, settle delays, pane-readiness) stay on mRiver, which is where the tmux state lives. The Go service stops caring about local file polling at all.
+
+---
+
+## 6. Sub-design C — Service-layer integration, routing, reliability
+
+### 6.1 Interface split
+
+The current `*PaliadinService` becomes an interface with two implementations: `LocalPaliadinService` (the existing tmux code, renamed) and `RemotePaliadinService` (the new SSH code). Construction picks one at startup based on `PALIADIN_REMOTE_HOST`.
+
+```go
+// internal/services/paliadin.go (after refactor)
+
+type Paliadin interface {
+    RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error)
+    ResetSession(ctx context.Context) error
+    ListRecentTurns(ctx context.Context, callerID uuid.UUID, limit int) ([]PaliadinTurn, error)
+    Stats(ctx context.Context, callerID uuid.UUID) (*PaliadinStats, error)
+    IsOwner(ctx context.Context, userID uuid.UUID) (bool, error)
+}
+
+// LocalPaliadinService wraps the current tmux PoC (laptop / dev path).
+type LocalPaliadinService struct { /* identical to today's PaliadinService */ }
+
+// RemotePaliadinService talks to a paliadin-shim over SSH on mRiver.
+type RemotePaliadinService struct {
+    db          *sqlx.DB
+    users       *UserService
+    sshHost     string   // 100.99.98.203
+    sshUser     string   // m
+    sshKeyPath  string   // /tmp/paliadin-id_ed25519-<rand>
+    knownHosts  string   // /tmp/paliadin-known_hosts
+    turnMu      sync.Mutex
+
+    // Health-check cache.
+    healthMu      sync.Mutex
+    healthOK      bool
+    healthCheckedAt time.Time
+}
+```
+
+DB access (`ListRecentTurns`, `Stats`, `IsOwner`) is identical for both — they only read `paliad.paliadin_turns`. They live in a shared `paliadinDB` helper struct embedded in both implementations.
+
+### 6.2 Wiring at startup
+
+```go
+// cmd/server/main.go (excerpt)
+var paliadin services.Paliadin
+remoteHost := os.Getenv("PALIADIN_REMOTE_HOST")
+switch {
+case remoteHost != "":
+    keyPath, err := loadPaliadinSSHKey()
+    if err != nil { log.Fatalf("paliadin: load ssh key: %v", err) }
+    if keyPath == "" { log.Fatalf("paliadin: PALIADIN_REMOTE_HOST set but no PALIADIN_SSH_PRIVATE_KEY") }
+    knownHosts, err := loadPaliadinKnownHosts()
+    if err != nil { log.Fatalf("paliadin: load known_hosts: %v", err) }
+    paliadin = services.NewRemotePaliadinService(db, userSvc, services.RemotePaliadinConfig{
+        SSHHost: remoteHost,
+        SSHUser: cmpOr(os.Getenv("PALIADIN_REMOTE_USER"), "m"),
+        SSHKeyPath: keyPath,
+        KnownHostsPath: knownHosts,
+    })
+    log.Printf("paliadin: remote mode → ssh %s@%s", "m", remoteHost)
+case localTmuxAvailable():
+    paliadin = services.NewLocalPaliadinService(db, userSvc, "", "")
+    log.Printf("paliadin: local tmux mode")
+default:
+    paliadin = services.NewDisabledPaliadinService(db, userSvc)
+    log.Printf("paliadin: disabled (no remote host, no local tmux)")
+}
+```
+
+`NewDisabledPaliadinService` exists today implicitly via the `ErrTmuxUnavailable` path; making it explicit gives the constructor a clear name and the handler doesn't have to special-case `nil`.
+
+### 6.3 SSH invocation pattern
+
+`RemotePaliadinService` runs every RPC through the same helper:
+
+```go
+func (s *RemotePaliadinService) callShim(ctx context.Context, args ...string) ([]byte, error) {
+    sshArgs := []string{
+        "-i", s.sshKeyPath,
+        "-o", "UserKnownHostsFile=" + s.knownHostsPath,
+        "-o", "StrictHostKeyChecking=yes",
+        "-o", "BatchMode=yes",
+        "-o", "ConnectTimeout=3",
+        "-o", "ServerAliveInterval=10",
+        "-o", "ServerAliveCountMax=3",
+        s.sshUser + "@" + s.sshHost,
+        "--",
+    }
+    sshArgs = append(sshArgs, args...)
+    c, cancel := context.WithTimeout(ctx, 70*time.Second)   // shim has its own 60s; +10s for SSH overhead
+    defer cancel()
+    cmd := exec.CommandContext(c, "ssh", sshArgs...)
+    var stdout, stderr bytes.Buffer
+    cmd.Stdout = &stdout; cmd.Stderr = &stderr
+    if err := cmd.Run(); err != nil {
+        return nil, fmt.Errorf("paliadin: ssh shim %v: %w (stderr: %s)", args, err, stderr.String())
+    }
+    return stdout.Bytes(), nil
+}
+```
+
+`RunTurn` becomes:
+
+```go
+func (s *RemotePaliadinService) RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error) {
+    s.turnMu.Lock()
+    defer s.turnMu.Unlock()
+
+    if err := s.healthGate(ctx); err != nil {
+        return nil, err   // ErrMRiverUnreachable, picked up by handler
+    }
+
+    turnID := uuid.New()
+    started := time.Now().UTC()
+    if err := s.insertTurnRow(ctx, …); err != nil { return nil, err }
+
+    // First-turn-only: bootstrap the system prompt on mRiver. Detected by
+    // checking whether any prior turn for this user has succeeded.
+    if err := s.ensureBootstrapped(ctx); err != nil {
+        _ = s.markTurnError(ctx, turnID, "bootstrap_failed")
+        return nil, err
+    }
+
+    msg := sanitiseForTmux(req.UserMessage)
+    msgB64 := base64.StdEncoding.EncodeToString([]byte(msg))
+    body, err := s.callShim(ctx, "run-turn", turnID.String(), msgB64)
+    if err != nil {
+        _ = s.markTurnError(ctx, turnID, classifySSHError(err))
+        return nil, err
+    }
+
+    // Same trailer-parse + audit-row writes as Local, factored into shared helper.
+    return s.completeTurnFromBody(ctx, turnID, started, string(body))
+}
+```
+
+### 6.4 System prompt bootstrap
+
+The local PoC calls `paliadinSystemPrompt(s.responseDir)` once when it creates the pane. The remote path needs the same hook. Two options that don't require duplicating the German prompt body to mRiver:
+
+- **Lazy bootstrap (chosen):** the first `RunTurn` after a paliad-prod restart sends the system prompt via `bootstrap` RPC, then runs the actual turn. Subsequent turns skip the bootstrap. State is per-process: `RemotePaliadinService.bootstrapped` boolean guarded by mutex.
+- Eager bootstrap at startup is rejected — it forces every container start to wait for mRiver to be online, which couples paliad's boot to mRiver's availability.
+
+Lazy bootstrap means the very first turn after a paliad redeploy pays a ~3 s extra cost (claude pane spin-up + system prompt absorb). Acceptable for a single-user PoC.
+
+### 6.5 Health-check gating (`mriver_unreachable`)
+
+Every `RunTurn` first calls `healthGate(ctx)`:
+
+- Cached for 10 s. If last check was <10 s ago and was OK, skip the probe.
+- Otherwise: `s.callShim(ctx, "health")` with a 3 s timeout. On success, set cache OK; on failure, return `ErrMRiverUnreachable`.
+
+Why 10 s: short enough that "I just woke my laptop" propagates inside one user retry; long enough that a busy chat doesn't probe on every turn.
+
+```go
+var ErrMRiverUnreachable = errors.New("paliadin: mriver unreachable")
+
+func (s *RemotePaliadinService) healthGate(ctx context.Context) error {
+    s.healthMu.Lock()
+    defer s.healthMu.Unlock()
+    if s.healthOK && time.Since(s.healthCheckedAt) < 10*time.Second {
+        return nil
+    }
+    c, cancel := context.WithTimeout(ctx, 3*time.Second)
+    defer cancel()
+    out, err := s.callShim(c, "health")
+    s.healthCheckedAt = time.Now()
+    if err != nil || strings.TrimSpace(string(out)) != "ok" {
+        s.healthOK = false
+        return fmt.Errorf("%w: %v", ErrMRiverUnreachable, err)
+    }
+    s.healthOK = true
+    return nil
+}
+```
+
+### 6.6 Friendly error code (extends t-paliad-150)
+
+`friendlyErrorMessage` already maps `tmux_unavailable` to a localised message. We add one new code:
+
+- `mriver_unreachable` → DE: *"mRiver ist offline — Paliadin nicht erreichbar. Mach mRiver an, oder nutze Paliadin lokal mit `./paliad`."* / EN: *"mRiver is offline — Paliadin can't reach it. Wake mRiver, or run Paliadin locally with `./paliad`."*
+
+Implementation: one new `case` in the SSE-error switch in `frontend/src/client/paliadin.ts`'s `friendlyErrorMessage`, plus matching i18n keys (`paliadin.error.mriver_unreachable.de` / `.en`). Server-side: `paliadin` HTTP handler maps `errors.Is(err, services.ErrMRiverUnreachable)` to `event: error\ndata: {"code":"mriver_unreachable","message":"..."}\n\n`.
+
+### 6.7 Rate limit
+
+A runaway loop on the paliad side could DOS the SSH connection. Cheapest cap: enforce one in-flight turn at a time via `turnMu` (already exists in the local PoC). On top of that, a rolling cap of N=20 turns/min in `RemotePaliadinService` rejects with `ErrRateLimited` (mapped to a friendly `paliadin.error.rate_limited`). PoC has one user (m); the cap is a paranoid safety, not a real throttle.
+
+### 6.8 What about ControlMaster?
+
+Decision-2's chosen path (server-side shim with one RPC per turn) makes ControlMaster optional. The shim collapses ~10 raw-tmux ops into a single SSH connect — that's already the latency win ControlMaster would buy.
+
+Adding it on top would save ~30–50 ms per turn but adds:
+
+- A persistent `~/.ssh/cm-*` socket inside the container.
+- Cleanup logic on shutdown.
+- A subtle interaction with the SSH BatchMode + ConnectTimeout settings.
+
+Verdict: skip ControlMaster in v1. If turn latency over Tailscale is measured >300 ms in practice and hot enough to matter, add it in a follow-up; the call site is one helper.
+
+---
+
+## 7. Phasing
+
+### Phase A — manual proof-of-concept (no Dockerfile change yet)
+
+Goal: validate the round-trip end-to-end on a deployed paliad, before touching the image.
+
+Steps:
+
+1. **Generate keypair** on mRiver: `ssh-keygen -t ed25519 -N "" -C "paliad-prod" -f /tmp/paliad-prod-key`.
+2. **Install shim** at `/home/m/.local/bin/paliadin-shim` (script from §5.4), `chmod 755`.
+3. **Write authorized_keys** with the public key + restrictions from §5.2.
+4. **Capture mRiver host key**: `ssh-keyscan -t ed25519 100.99.98.203 > /tmp/paliad-known_hosts` from mLake.
+5. **Confirm host networking trade-off (§4.2):** flip the running paliad-prod compose to `network_mode: host` on a temporary branch; redeploy via Dokploy; verify `https://paliad.de/` still serves correctly via traefik. **Gate**: if traefik 502s, abort Phase A and revisit decision 1 in a follow-up issue.
+6. **Smoke-test SSH from inside the container**:
+   ```
+   docker exec -it paliad-prod sh
+   apk add --no-cache openssh-client       # one-shot, before Dockerfile change
+   ssh -i /tmp/key -o UserKnownHostsFile=/tmp/known_hosts m@100.99.98.203 health
+   # expected: "ok"
+   ssh … run-turn $(uuidgen) "$(printf 'Hallo Paliadin' | base64 -w0)"
+   # expected: response body, then ".../uuid.txt" cleaned up
+   ```
+7. **Wire env vars manually** via Dokploy UI for one deploy; confirm `/paliadin` works end-to-end against mRiver.
+
+If Phase A passes: codify into Phase B. If it fails on (5), the design rolls back to a sidecar in a new issue (decision 1 follow-up). If it fails elsewhere, fix the shim or the SSH config; the architecture is fine.
+
+### Phase B — bake into Dockerfile + Dokploy secrets
+
+1. Dockerfile: add `openssh-client` to the final stage (§4.3).
+2. compose: add `network_mode: host` and the four new env vars (§4.1).
+3. Dokploy secrets: register `PALIADIN_REMOTE_HOST=100.99.98.203`, `PALIADIN_REMOTE_USER=m`, `PALIADIN_SSH_PRIVATE_KEY=...`, `PALIADIN_KNOWN_HOSTS=...`.
+4. Code: refactor `PaliadinService` to the interface split (§6.1–§6.2). New file `internal/services/paliadin_remote.go`. Tests: `paliadin_remote_test.go` mocks `callShim` to verify `RunTurn` audit-row writes, error mapping, and `healthGate` caching.
+5. Ship under one PR; tag t-paliad-151 done.
+
+### Phase C — friendly errors + monitoring
+
+1. `paliadin.error.mriver_unreachable` i18n keys + `friendlyErrorMessage` case (§6.6).
+2. `/admin/paliadin` shows last health-probe result + last successful turn timestamp.
+3. Optional: `mai-mesh` integration to surface mRiver-offline events to m on Telegram (out-of-band; not gating).
+
+---
+
+## 8. Security review summary
+
+| Risk | Mitigation |
+|---|---|
+| Stolen private key → arbitrary SSH on mRiver | `command=` shim restriction + `from="100.99.98.201"` + ed25519 key + private key only in Dokploy secret store (encrypted at rest) |
+| Stolen private key → tailnet-wide SSH from non-mLake host | `from="100.99.98.201"` clause |
+| Container compromise → key extraction | Key written to tmpfile chmod 600, only root inside container can read; alpine container has no shell-on-error trampolines |
+| Host-key MITM during connect | Pinned `known_hosts`; `StrictHostKeyChecking=yes` |
+| Shim argument injection (e.g. via `run-turn $(rm -rf /)`) | Shim parses positional args from `$SSH_ORIGINAL_COMMAND` via `read -r -a`; never passes args to a subshell `eval`; turn_id validated by UUID regex; message body always base64-decoded into a single shell variable, never re-evaluated |
+| Runaway loop → SSH flood | Single-flight `turnMu` + 20/min rolling cap |
+| `network_mode: host` widens blast radius | The `command=` + `from=` restrictions on mRiver mean container compromise = "can run shim verbs against mRiver only", not "shell on mRiver" |
+| PaliadinOwnerEmail bypass | Unchanged from PoC: gate is in Go (`/paliadin` 404s for any other user). Even if mRiver SSH key leaks, attacker still needs paliad session as `m@hoganlovells.com`. |
+
+---
+
+## 9. Out-of-scope clarifications (for review)
+
+These were called out in the issue but the design intentionally does not solve them, to keep v1 tight. Each is acknowledged so review knows it wasn't an oversight:
+
+- **Wake-on-LAN of mRiver:** out of scope. v1's UX when mRiver is asleep is the friendly error from §6.6. Future work: integrate with `mai-mesh` capability fallback.
+- **Multi-host failover:** out of scope. Only mRiver is targeted.
+- **Anthropic API fallback when mRiver offline:** out of scope per CLAUDE.md (`ANTHROPIC_API_KEY` reserved for production-v1, unused in PoC).
+- **ControlMaster:** v1 ships without; revisit if turn latency >300 ms in practice (§6.8).
+
+---
+
+## 10. File-level deliverables (for the coder shift)
+
+When this design is approved and the coder shift starts, the work splits roughly into:
+
+- `Dockerfile` — `+openssh-client`.
+- `docker-compose.yml` — `network_mode: host`, four new env entries.
+- `internal/services/paliadin.go` — extract `Paliadin` interface; rename existing to `LocalPaliadinService`; pull DB-only methods (`ListRecentTurns`, `Stats`, `IsOwner`) into a shared embedded `paliadinDB` so both implementations get them for free.
+- `internal/services/paliadin_remote.go` — new file: `RemotePaliadinService`, `RemotePaliadinConfig`, `callShim`, `healthGate`, `ensureBootstrapped`, `classifySSHError`, `ErrMRiverUnreachable`.
+- `internal/services/paliadin_remote_test.go` — unit tests with a mocked `callShim`.
+- `cmd/server/main.go` — env-var-based wiring (§6.2), `loadPaliadinSSHKey`, `loadPaliadinKnownHosts`.
+- `frontend/src/client/paliadin.ts` — one `case` in `friendlyErrorMessage` for `mriver_unreachable`.
+- `frontend/src/i18n.ts` — two new keys (`paliadin.error.mriver_unreachable.de` / `.en`).
+- `scripts/paliadin-shim` — server-side script (§5.4); copied to mRiver during Phase A, not part of any container.
+- `docs/project-status.md` — note Phase 0.5 (PoC) → Phase 0.6 (Tailscale-SSH prod route).
+
+No DB migrations needed — `paliad.paliadin_turns` schema already covers everything (`error_code` field already accepts free-form codes including `mriver_unreachable`).
+
+---
+
+## 11. Open questions for review
+
+- **Q (m):** Phase A test step 5 expects traefik to keep working under host-mode. If a quick search confirms Dokploy's traefik can't route to host-network services without manual `loadbalancer.server.url` config, we should know before Phase A. Worth a 5-minute Dokploy doc check before merging Phase B.
+- **Q (m):** Should the `paliadin-shim` script live in this repo (`scripts/paliadin-shim`) and be version-pinned, or is it a one-off that lives only on mRiver? Repo location lets us audit changes; mRiver-only keeps deploy footprint smaller.
+- **Q (m):** `ANTHROPIC_API_KEY` env var reservation in compose comments — keep the comment line for production-v1, or strip it now that this design supersedes that path for the foreseeable future?
+
+---
+
+**Inventor stops here.** No code shipped this shift. Awaiting m's go/no-go on the design before the coder shift starts.