design(t-paliad-151): Paliadin Tailscale SSH route to mRiver

Inventor design for routing Paliadin from paliad.de's Dokploy container
on mLake to mRiver via Tailscale + SSH, preserving m's Claude Code
subscription instead of paying Anthropic API tokens.

Three sub-designs covering m's four locked decisions (2026-05-07 22:35):
- network_mode: host on paliad (m overrode the sidecar recommendation;
  Phase A explicitly tests traefik compatibility under host mode)
- server-side paliadin-shim with one RPC per turn (run-turn / reset /
  health / bootstrap), authorized_keys command= restriction, from=mlake
- env-var routing trigger (PALIADIN_REMOTE_HOST) + Paliadin interface
  split: LocalPaliadinService keeps the laptop PoC, RemotePaliadinService
  shells out to ssh m@mriver paliadin-shim
- ed25519 keypair via Dokploy secret PALIADIN_SSH_PRIVATE_KEY, written
  to a chmod 600 tmpfile at startup; pinned host key via
  PALIADIN_KNOWN_HOSTS

Verified live before designing: mRiver tmux+claude present, mLake
Tailscale active and sees mRiver, paliad Dockerfile is alpine-minimal,
no authorized_keys on mRiver yet. No assumptions left from CLAUDE.md.

Includes: friendly error code mriver_unreachable extending t-paliad-150,
single-flight rate limit, security review (defence-in-depth via
command=/from= restrictions), three-phase rollout (manual proof →
Dockerfile bake → polish), file-level deliverables for the coder shift.

Inventor stops here — no code shipped. Awaiting m's go/no-go.

Refs m/paliad#12
This commit is contained in:
m
2026-05-07 22:47:30 +02:00
parent 1061685981
commit befa41c00e

View File

@@ -0,0 +1,592 @@
# Paliadin: route prod via Tailscale SSH to mRiver
**Issue:** m/paliad#12 — t-paliad-151
**Date:** 2026-05-07
**Author:** noether (inventor)
**Supersedes nothing.** Extends `docs/design-paliadin-2026-05-07.md` (the Phase 0 PoC) with a third deployment path between "laptop-only PoC" and "Anthropic API direct".
**Related:** t-paliad-146 (PoC ship), t-paliad-150 (`friendlyErrorMessage` pattern).
---
## 1. Goal
Make Paliadin reachable from `paliad.de` (Dokploy on mLake) without losing m's Claude Code subscription, by routing each turn over Tailscale + SSH from the paliad container to mRiver, where the existing long-lived `tmux` + `claude` PoC keeps running.
**Non-goals (v1):**
- Multi-host failover.
- Encryption beyond SSH-over-tailnet (already E2E-encrypted by Tailscale's WireGuard layer).
- Anthropic API fallback when mRiver is offline — show a friendly error instead.
- Wake-on-LAN of mRiver.
- Multi-tenant or multi-firm variants.
---
## 2. Live state — what was verified before designing
A design built on stale facts rots fast. These were probed on 2026-05-07, not assumed from CLAUDE.md or memory:
| Fact | How verified | Result |
|---|---|---|
| mRiver = `100.99.98.203`, has tmux + claude | this worker runs on mRiver; `tmux -V``tmux 3.6a`; `which claude``/home/m/.local/bin/claude` | confirmed |
| mLake (`100.99.98.201`) has Tailscale running | `ssh m@mlake tailscale status` | confirmed; mRiver visible as `active; direct [2a02:4780:41:3fbc::1]:41641` |
| paliad container Dockerfile is alpine:3.21 minimal, no SSH, no tailscaled | `Dockerfile` | confirmed (only `ca-certificates`) |
| paliad compose runs default Docker bridge (no `network_mode`) | `docker-compose.yml` | confirmed |
| mRiver has no `~/.ssh/authorized_keys` yet | `ls ~/.ssh/` | confirmed — file must be created in Phase A |
| `/tmp/paliadin/` does not exist on mRiver yet | `ls /tmp/paliadin` | confirmed — created on first turn (paliadin.go:185 `os.MkdirAll`) |
| `paliad-paliadin` tmux session is not currently running on mRiver | `tmux ls` | not present; the existing PoC creates it on demand |
**Implication for design:** the paliad container needs new infrastructure on three axes — network reachability of the tailnet, an SSH client + identity, and a service-layer code path that talks to a remote tmux instead of a local one. Each axis is its own sub-design below.
---
## 3. Locked decisions (m, 2026-05-07 22:35)
m made four design-shaping calls via the inventor's `AskUserQuestion` pass. They are recorded here verbatim because every downstream choice in §4§6 follows from them.
| # | Question | m's choice |
|---|---|---|
| 1 | Container Tailscale shape | **`network_mode: host` on paliad** |
| 2 | SSH-to-mRiver protocol granularity | **Server-side `paliadin-shim` (one RPC per turn)** |
| 3 | Routing trigger | **Env var `PALIADIN_REMOTE_HOST` + interface split** |
| 4 | SSH private key storage | **Dokploy secret env var `PALIADIN_SSH_PRIVATE_KEY`** |
Decision (1) was *not* the inventor's recommendation — host mode has known interaction risk with traefik (§4.2). m is overriding the recommendation; this design accepts the call and codifies a Phase A test step that gates the rollout on traefik still working under host mode. If Phase A blows up, the fallback is to revisit (1) in a follow-up issue, not to silently swap to a sidecar.
---
## 4. Sub-design A — Container Tailscale shape
### 4.1 Shape: `network_mode: host`
paliad's container shares mLake's network namespace. `tailscale0` (mLake's tailnet interface) is directly visible from inside the container. Outbound `ssh m@100.99.98.203` reaches mRiver over the tailnet without any sidecar, userspace tailscaled, SOCKS proxy, or auth-key flow inside the container.
```yaml
# docker-compose.yml diff
services:
web:
build: .
network_mode: host # NEW
# remove: expose: ["8080"] # host mode means port is on the host directly
environment:
- PORT=8080
...
# NEW Paliadin remote-routing knobs
- PALIADIN_REMOTE_HOST=${PALIADIN_REMOTE_HOST} # 100.99.98.203
- PALIADIN_REMOTE_USER=${PALIADIN_REMOTE_USER} # m
- PALIADIN_SSH_PRIVATE_KEY=${PALIADIN_SSH_PRIVATE_KEY}
- PALIADIN_KNOWN_HOSTS=${PALIADIN_KNOWN_HOSTS} # one-line ssh-keyscan output
restart: unless-stopped
```
### 4.2 Trade-off accepted: traefik routing under host mode
paliad.de's TLS is provided by Dokploy's traefik on the `dokploy-network` overlay. With `network_mode: host`, paliad is no longer attached to that overlay. Two failure modes are possible:
- **(M1)** traefik can't discover the service via Docker DNS → 502 at the edge.
- **(M2)** traefik routes via host loopback (`http://127.0.0.1:8080` or `host.docker.internal`) and works fine.
Recent Dokploy versions configure traefik with both `loadbalancer.server.url` and Docker labels; (M2) is the documented host-mode path. **Phase A explicitly tests this** (§7) before any code is written; if (M1) materialises, the design rolls back to the sidecar variant of decision 1 in a follow-up issue.
Other host-mode side-effects to flag in operations:
- paliad listens on host port 8080 directly. Any other compose service binding 8080 conflicts.
- paliad's outbound DNS uses host resolver (no Docker-internal `web` etc.). Currently fine: paliad's only network deps are external (Supabase, SMTP, GitHub raw). No service on `dokploy-network` is referenced by name.
- The container can reach **every** Tailscale node, not just mRiver. Mitigations live in §5 (key restriction) and §5.2 (`from=` clause on mRiver authorized_keys).
### 4.3 Dockerfile diff
```dockerfile
# Final stage adds the SSH client only. Tailscale is provided by the host.
FROM alpine:3.21
RUN apk add --no-cache ca-certificates openssh-client # +openssh-client (~1MB)
WORKDIR /app
COPY --from=backend /paliad /app/paliad
COPY --from=frontend /app/frontend/dist /app/dist
EXPOSE 8080
CMD ["/app/paliad"]
```
Image-size delta: alpine `openssh-client` is ~1.1 MB compressed — negligible. No tailscaled, no entrypoint script, no extra processes inside the container.
### 4.4 What does NOT change
- No Tailscale auth-key inside paliad. The container inherits the host's tailnet binding, so there is no per-container Tailscale identity to rotate. mLake's existing Tailscale auth is the only one in scope.
- No tailscaled process inside the container.
- No new sidecar container.
---
## 5. Sub-design B — SSH identity, restricted shim, host-key pinning
### 5.1 Identity: dedicated ed25519 keypair `paliad-prod`
One keypair, generated once on mRiver during Phase A, used by every paliad-prod deploy:
```bash
# On mRiver (Phase A bootstrap):
ssh-keygen -t ed25519 -N "" -C "paliad-prod $(date +%Y-%m-%d)" -f /tmp/paliad-prod-key
# Public key → mRiver authorized_keys (see 5.2)
# Private key → Dokploy secret store as PALIADIN_SSH_PRIVATE_KEY
shred -u /tmp/paliad-prod-key # only the encrypted/secret-stored copies survive
```
Rotation: regenerate, push public key to mRiver authorized_keys, update Dokploy secret, redeploy. No code change needed — paliad's startup re-reads the env var on every boot.
The private key is delivered to the container as a multi-line env var. At process start, paliad writes it to a tmpfile so OpenSSH can use it:
```go
// cmd/server/main.go (sketch)
func loadPaliadinSSHKey() (string, error) {
blob := os.Getenv("PALIADIN_SSH_PRIVATE_KEY")
if blob == "" { return "", nil } // remote mode disabled
f, err := os.CreateTemp("", "paliadin-id_ed25519-")
if err != nil { return "", err }
if err := os.Chmod(f.Name(), 0o600); err != nil { return "", err }
if _, err := f.WriteString(blob); err != nil { return "", err }
if err := f.Close(); err != nil { return "", err }
return f.Name(), nil // path passed to RemotePaliadinService
}
```
The tmpfile lives at `/tmp/paliadin-id_ed25519-<rand>` for the container's lifetime. On container restart, a fresh tmpfile is written. We never persist the key to a volume.
### 5.2 mRiver `authorized_keys` entry
```
command="/home/m/.local/bin/paliadin-shim",no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc,from="100.99.98.201" ssh-ed25519 AAAA...PUBKEY... paliad-prod
```
Each restriction matters:
- `command=` — every `ssh m@mriver …` invocation runs the shim regardless of what the client asked for. The client's requested command is exposed as `$SSH_ORIGINAL_COMMAND` for the shim to dispatch on.
- `no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc` — defence-in-depth: even if someone steals the key and bypasses the shim's argument validation, they can't get an interactive shell, can't tunnel ports, can't pivot via agent forwarding.
- `from="100.99.98.201"` — only accept connections from mLake's tailnet IP. Defends against the "container has full tailnet visibility" host-mode side-effect from §4.2: if the key leaks off mLake, it can't be replayed from another tailnet host.
### 5.3 Host-key pinning
`StrictHostKeyChecking=accept-new` is too loose for a long-lived production identity (one-time MITM during first connect substitutes a different key forever). Instead:
- During Phase A, run `ssh-keyscan -t ed25519 100.99.98.203` on mLake.
- Capture the single output line.
- Store as Dokploy secret `PALIADIN_KNOWN_HOSTS`.
- At container startup, write to `/tmp/paliadin-known_hosts` chmod 644.
- Pass to OpenSSH via `-o UserKnownHostsFile=/tmp/paliadin-known_hosts -o StrictHostKeyChecking=yes`.
If mRiver's host key ever rotates (rare; only on disk wipe / fresh OS), Phase A runs again and the secret is updated. SSH refuses to connect with a clear "host key changed" error, which surfaces as `mriver_unreachable` to the user — exactly the right blast-radius (loud failure, no silent connect to a substitute host).
### 5.4 The shim — `paliadin-shim`
A bash script on mRiver at `/home/m/.local/bin/paliadin-shim`. It is the **only** thing the paliad-prod key is allowed to invoke, and it dispatches on `$SSH_ORIGINAL_COMMAND`. Three RPCs:
```bash
#!/bin/bash
# paliadin-shim — server-side RPC for paliad's remote-tmux turns.
# Invoked via authorized_keys command= with $SSH_ORIGINAL_COMMAND set.
set -euo pipefail
umask 077
readonly TMUX_SESSION="${PALIADIN_TMUX_SESSION:-paliad-paliadin}"
readonly RESPONSE_DIR="${PALIADIN_RESPONSE_DIR:-/tmp/paliadin}"
readonly TIMEOUT_S=60
readonly TURN_ID_RE='^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$'
mkdir -p "$RESPONSE_DIR"
# Parse $SSH_ORIGINAL_COMMAND. Format: "<verb> <arg1> <arg2> …"
read -r -a argv <<< "${SSH_ORIGINAL_COMMAND:-}"
verb="${argv[0]:-}"
ensure_pane() {
if ! tmux has-session -t "$TMUX_SESSION" 2>/dev/null; then
tmux new-session -d -s "$TMUX_SESSION"
fi
# Find or create the @paliadin-scope=chat window.
local target=""
while read -r idx; do
scope=$(tmux show-window-option -t "$TMUX_SESSION:$idx" -v @paliadin-scope 2>/dev/null || true)
if [[ "$scope" == "chat" ]]; then target="$TMUX_SESSION:$idx"; break; fi
done < <(tmux list-windows -t "$TMUX_SESSION" -F '#{window_index}')
if [[ -z "$target" ]]; then
idx=$(tmux new-window -t "$TMUX_SESSION" -n claude-paliadin -P -F '#{window_index}' claude)
target="$TMUX_SESSION:$idx"
# Wait for claude to settle (60s bound; matches Go waitForPaneReady).
for _ in $(seq 1 120); do
pane=$(tmux capture-pane -t "$target" -p 2>/dev/null || true)
if [[ "$pane" == *""* || "$pane" == *"│"* ]]; then break; fi
sleep 0.5
done
tmux set-window-option -t "$target" @paliadin-scope chat
tmux set-window-option -t "$target" @fix-name claude-paliadin
# Bootstrap system prompt — reuses the Go service's prompt text.
# The Go side sends this via the `bootstrap` RPC on first turn instead
# of duplicating the prompt here. See §6.4.
fi
echo "$target"
}
case "$verb" in
health)
# Liveness check — used by paliad to short-circuit when mRiver is offline.
# Returns "ok" iff tmux + claude are reachable.
tmux has-session -t "$TMUX_SESSION" 2>/dev/null \
|| tmux new-session -d -s "$TMUX_SESSION"
command -v claude >/dev/null && echo ok || { echo no-claude; exit 1; }
;;
bootstrap)
# First-turn-only: ensure pane exists and inject the system prompt.
# $1 = base64-encoded prompt body (avoids quoting hell).
target=$(ensure_pane)
prompt=$(printf '%s' "${argv[1]:?missing prompt}" | base64 -d)
tmux send-keys -t "$target" -l -- "$prompt"
tmux send-keys -t "$target" Enter
sleep 2 # give claude a moment to absorb
echo ok
;;
run-turn)
# $1 = turn_id (UUID); $2 = base64-encoded user message.
turn_id="${argv[1]:?missing turn_id}"
[[ "$turn_id" =~ $TURN_ID_RE ]] || { echo >&2 "bad turn_id"; exit 2; }
msg=$(printf '%s' "${argv[2]:?missing message}" | base64 -d)
target=$(ensure_pane)
out="$RESPONSE_DIR/$turn_id.txt"
rm -f "$out"
# Envelope matches what paliadin_prompt.go expects.
tmux send-keys -t "$target" -l -- "[PALIADIN:$turn_id] $msg"
tmux send-keys -t "$target" Enter
# Poll for the response file. Same shape as Go pollForResponse.
for _ in $(seq 1 $((TIMEOUT_S * 5))); do
if [[ -s "$out" ]]; then
sleep 0.05 # settle
cat "$out"
rm -f "$out"
exit 0
fi
sleep 0.2
done
echo >&2 "paliadin: response timeout after ${TIMEOUT_S}s"
exit 124
;;
reset)
# /clear the conversation; next turn starts fresh.
target=$(ensure_pane)
tmux send-keys -t "$target" -l -- "/clear"
tmux send-keys -t "$target" Enter
echo ok
;;
*)
echo >&2 "paliadin-shim: unknown verb '$verb'"
exit 2
;;
esac
```
Why a shim instead of raw tmux-over-SSH:
- One SSH round-trip per turn (~50 ms over tailnet) vs ~1020 round-trips for the granular pattern.
- Argument validation lives in one place (UUID regex on turn_id, base64 for messages, fixed verb list) — easier to audit than a regex over `$SSH_ORIGINAL_COMMAND` matching `tmux send-keys …`.
- mRiver-side concerns (response polling, settle delays, pane-readiness) stay on mRiver, which is where the tmux state lives. The Go service stops caring about local file polling at all.
---
## 6. Sub-design C — Service-layer integration, routing, reliability
### 6.1 Interface split
The current `*PaliadinService` becomes an interface with two implementations: `LocalPaliadinService` (the existing tmux code, renamed) and `RemotePaliadinService` (the new SSH code). Construction picks one at startup based on `PALIADIN_REMOTE_HOST`.
```go
// internal/services/paliadin.go (after refactor)
type Paliadin interface {
RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error)
ResetSession(ctx context.Context) error
ListRecentTurns(ctx context.Context, callerID uuid.UUID, limit int) ([]PaliadinTurn, error)
Stats(ctx context.Context, callerID uuid.UUID) (*PaliadinStats, error)
IsOwner(ctx context.Context, userID uuid.UUID) (bool, error)
}
// LocalPaliadinService wraps the current tmux PoC (laptop / dev path).
type LocalPaliadinService struct { /* identical to today's PaliadinService */ }
// RemotePaliadinService talks to a paliadin-shim over SSH on mRiver.
type RemotePaliadinService struct {
db *sqlx.DB
users *UserService
sshHost string // 100.99.98.203
sshUser string // m
sshKeyPath string // /tmp/paliadin-id_ed25519-<rand>
knownHosts string // /tmp/paliadin-known_hosts
turnMu sync.Mutex
// Health-check cache.
healthMu sync.Mutex
healthOK bool
healthCheckedAt time.Time
}
```
DB access (`ListRecentTurns`, `Stats`, `IsOwner`) is identical for both — they only read `paliad.paliadin_turns`. They live in a shared `paliadinDB` helper struct embedded in both implementations.
### 6.2 Wiring at startup
```go
// cmd/server/main.go (excerpt)
var paliadin services.Paliadin
remoteHost := os.Getenv("PALIADIN_REMOTE_HOST")
switch {
case remoteHost != "":
keyPath, err := loadPaliadinSSHKey()
if err != nil { log.Fatalf("paliadin: load ssh key: %v", err) }
if keyPath == "" { log.Fatalf("paliadin: PALIADIN_REMOTE_HOST set but no PALIADIN_SSH_PRIVATE_KEY") }
knownHosts, err := loadPaliadinKnownHosts()
if err != nil { log.Fatalf("paliadin: load known_hosts: %v", err) }
paliadin = services.NewRemotePaliadinService(db, userSvc, services.RemotePaliadinConfig{
SSHHost: remoteHost,
SSHUser: cmpOr(os.Getenv("PALIADIN_REMOTE_USER"), "m"),
SSHKeyPath: keyPath,
KnownHostsPath: knownHosts,
})
log.Printf("paliadin: remote mode → ssh %s@%s", "m", remoteHost)
case localTmuxAvailable():
paliadin = services.NewLocalPaliadinService(db, userSvc, "", "")
log.Printf("paliadin: local tmux mode")
default:
paliadin = services.NewDisabledPaliadinService(db, userSvc)
log.Printf("paliadin: disabled (no remote host, no local tmux)")
}
```
`NewDisabledPaliadinService` exists today implicitly via the `ErrTmuxUnavailable` path; making it explicit gives the constructor a clear name and the handler doesn't have to special-case `nil`.
### 6.3 SSH invocation pattern
`RemotePaliadinService` runs every RPC through the same helper:
```go
func (s *RemotePaliadinService) callShim(ctx context.Context, args ...string) ([]byte, error) {
sshArgs := []string{
"-i", s.sshKeyPath,
"-o", "UserKnownHostsFile=" + s.knownHostsPath,
"-o", "StrictHostKeyChecking=yes",
"-o", "BatchMode=yes",
"-o", "ConnectTimeout=3",
"-o", "ServerAliveInterval=10",
"-o", "ServerAliveCountMax=3",
s.sshUser + "@" + s.sshHost,
"--",
}
sshArgs = append(sshArgs, args...)
c, cancel := context.WithTimeout(ctx, 70*time.Second) // shim has its own 60s; +10s for SSH overhead
defer cancel()
cmd := exec.CommandContext(c, "ssh", sshArgs...)
var stdout, stderr bytes.Buffer
cmd.Stdout = &stdout; cmd.Stderr = &stderr
if err := cmd.Run(); err != nil {
return nil, fmt.Errorf("paliadin: ssh shim %v: %w (stderr: %s)", args, err, stderr.String())
}
return stdout.Bytes(), nil
}
```
`RunTurn` becomes:
```go
func (s *RemotePaliadinService) RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error) {
s.turnMu.Lock()
defer s.turnMu.Unlock()
if err := s.healthGate(ctx); err != nil {
return nil, err // ErrMRiverUnreachable, picked up by handler
}
turnID := uuid.New()
started := time.Now().UTC()
if err := s.insertTurnRow(ctx, ); err != nil { return nil, err }
// First-turn-only: bootstrap the system prompt on mRiver. Detected by
// checking whether any prior turn for this user has succeeded.
if err := s.ensureBootstrapped(ctx); err != nil {
_ = s.markTurnError(ctx, turnID, "bootstrap_failed")
return nil, err
}
msg := sanitiseForTmux(req.UserMessage)
msgB64 := base64.StdEncoding.EncodeToString([]byte(msg))
body, err := s.callShim(ctx, "run-turn", turnID.String(), msgB64)
if err != nil {
_ = s.markTurnError(ctx, turnID, classifySSHError(err))
return nil, err
}
// Same trailer-parse + audit-row writes as Local, factored into shared helper.
return s.completeTurnFromBody(ctx, turnID, started, string(body))
}
```
### 6.4 System prompt bootstrap
The local PoC calls `paliadinSystemPrompt(s.responseDir)` once when it creates the pane. The remote path needs the same hook. Two options that don't require duplicating the German prompt body to mRiver:
- **Lazy bootstrap (chosen):** the first `RunTurn` after a paliad-prod restart sends the system prompt via `bootstrap` RPC, then runs the actual turn. Subsequent turns skip the bootstrap. State is per-process: `RemotePaliadinService.bootstrapped` boolean guarded by mutex.
- Eager bootstrap at startup is rejected — it forces every container start to wait for mRiver to be online, which couples paliad's boot to mRiver's availability.
Lazy bootstrap means the very first turn after a paliad redeploy pays a ~3 s extra cost (claude pane spin-up + system prompt absorb). Acceptable for a single-user PoC.
### 6.5 Health-check gating (`mriver_unreachable`)
Every `RunTurn` first calls `healthGate(ctx)`:
- Cached for 10 s. If last check was <10 s ago and was OK, skip the probe.
- Otherwise: `s.callShim(ctx, "health")` with a 3 s timeout. On success, set cache OK; on failure, return `ErrMRiverUnreachable`.
Why 10 s: short enough that "I just woke my laptop" propagates inside one user retry; long enough that a busy chat doesn't probe on every turn.
```go
var ErrMRiverUnreachable = errors.New("paliadin: mriver unreachable")
func (s *RemotePaliadinService) healthGate(ctx context.Context) error {
s.healthMu.Lock()
defer s.healthMu.Unlock()
if s.healthOK && time.Since(s.healthCheckedAt) < 10*time.Second {
return nil
}
c, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
out, err := s.callShim(c, "health")
s.healthCheckedAt = time.Now()
if err != nil || strings.TrimSpace(string(out)) != "ok" {
s.healthOK = false
return fmt.Errorf("%w: %v", ErrMRiverUnreachable, err)
}
s.healthOK = true
return nil
}
```
### 6.6 Friendly error code (extends t-paliad-150)
`friendlyErrorMessage` already maps `tmux_unavailable` to a localised message. We add one new code:
- `mriver_unreachable` DE: *"mRiver ist offline — Paliadin nicht erreichbar. Mach mRiver an, oder nutze Paliadin lokal mit `./paliad`."* / EN: *"mRiver is offline — Paliadin can't reach it. Wake mRiver, or run Paliadin locally with `./paliad`."*
Implementation: one new `case` in the SSE-error switch in `frontend/src/client/paliadin.ts`'s `friendlyErrorMessage`, plus matching i18n keys (`paliadin.error.mriver_unreachable.de` / `.en`). Server-side: `paliadin` HTTP handler maps `errors.Is(err, services.ErrMRiverUnreachable)` to `event: error\ndata: {"code":"mriver_unreachable","message":"..."}\n\n`.
### 6.7 Rate limit
A runaway loop on the paliad side could DOS the SSH connection. Cheapest cap: enforce one in-flight turn at a time via `turnMu` (already exists in the local PoC). On top of that, a rolling cap of N=20 turns/min in `RemotePaliadinService` rejects with `ErrRateLimited` (mapped to a friendly `paliadin.error.rate_limited`). PoC has one user (m); the cap is a paranoid safety, not a real throttle.
### 6.8 What about ControlMaster?
Decision-2's chosen path (server-side shim with one RPC per turn) makes ControlMaster optional. The shim collapses ~10 raw-tmux ops into a single SSH connect that's already the latency win ControlMaster would buy.
Adding it on top would save ~3050 ms per turn but adds:
- A persistent `~/.ssh/cm-*` socket inside the container.
- Cleanup logic on shutdown.
- A subtle interaction with the SSH BatchMode + ConnectTimeout settings.
Verdict: skip ControlMaster in v1. If turn latency over Tailscale is measured >300 ms in practice and hot enough to matter, add it in a follow-up; the call site is one helper.
---
## 7. Phasing
### Phase A — manual proof-of-concept (no Dockerfile change yet)
Goal: validate the round-trip end-to-end on a deployed paliad, before touching the image.
Steps:
1. **Generate keypair** on mRiver: `ssh-keygen -t ed25519 -N "" -C "paliad-prod" -f /tmp/paliad-prod-key`.
2. **Install shim** at `/home/m/.local/bin/paliadin-shim` (script from §5.4), `chmod 755`.
3. **Write authorized_keys** with the public key + restrictions from §5.2.
4. **Capture mRiver host key**: `ssh-keyscan -t ed25519 100.99.98.203 > /tmp/paliad-known_hosts` from mLake.
5. **Confirm host networking trade-off (§4.2):** flip the running paliad-prod compose to `network_mode: host` on a temporary branch; redeploy via Dokploy; verify `https://paliad.de/` still serves correctly via traefik. **Gate**: if traefik 502s, abort Phase A and revisit decision 1 in a follow-up issue.
6. **Smoke-test SSH from inside the container**:
```
docker exec -it paliad-prod sh
apk add --no-cache openssh-client # one-shot, before Dockerfile change
ssh -i /tmp/key -o UserKnownHostsFile=/tmp/known_hosts m@100.99.98.203 health
# expected: "ok"
ssh … run-turn $(uuidgen) "$(printf 'Hallo Paliadin' | base64 -w0)"
# expected: response body, then ".../uuid.txt" cleaned up
```
7. **Wire env vars manually** via Dokploy UI for one deploy; confirm `/paliadin` works end-to-end against mRiver.
If Phase A passes: codify into Phase B. If it fails on (5), the design rolls back to a sidecar in a new issue (decision 1 follow-up). If it fails elsewhere, fix the shim or the SSH config; the architecture is fine.
### Phase B — bake into Dockerfile + Dokploy secrets
1. Dockerfile: add `openssh-client` to the final stage (§4.3).
2. compose: add `network_mode: host` and the four new env vars (§4.1).
3. Dokploy secrets: register `PALIADIN_REMOTE_HOST=100.99.98.203`, `PALIADIN_REMOTE_USER=m`, `PALIADIN_SSH_PRIVATE_KEY=...`, `PALIADIN_KNOWN_HOSTS=...`.
4. Code: refactor `PaliadinService` to the interface split (§6.1§6.2). New file `internal/services/paliadin_remote.go`. Tests: `paliadin_remote_test.go` mocks `callShim` to verify `RunTurn` audit-row writes, error mapping, and `healthGate` caching.
5. Ship under one PR; tag t-paliad-151 done.
### Phase C — friendly errors + monitoring
1. `paliadin.error.mriver_unreachable` i18n keys + `friendlyErrorMessage` case (§6.6).
2. `/admin/paliadin` shows last health-probe result + last successful turn timestamp.
3. Optional: `mai-mesh` integration to surface mRiver-offline events to m on Telegram (out-of-band; not gating).
---
## 8. Security review summary
| Risk | Mitigation |
|---|---|
| Stolen private key → arbitrary SSH on mRiver | `command=` shim restriction + `from="100.99.98.201"` + ed25519 key + private key only in Dokploy secret store (encrypted at rest) |
| Stolen private key → tailnet-wide SSH from non-mLake host | `from="100.99.98.201"` clause |
| Container compromise → key extraction | Key written to tmpfile chmod 600, only root inside container can read; alpine container has no shell-on-error trampolines |
| Host-key MITM during connect | Pinned `known_hosts`; `StrictHostKeyChecking=yes` |
| Shim argument injection (e.g. via `run-turn $(rm -rf /)`) | Shim parses positional args from `$SSH_ORIGINAL_COMMAND` via `read -r -a`; never passes args to a subshell `eval`; turn_id validated by UUID regex; message body always base64-decoded into a single shell variable, never re-evaluated |
| Runaway loop → SSH flood | Single-flight `turnMu` + 20/min rolling cap |
| `network_mode: host` widens blast radius | The `command=` + `from=` restrictions on mRiver mean container compromise = "can run shim verbs against mRiver only", not "shell on mRiver" |
| PaliadinOwnerEmail bypass | Unchanged from PoC: gate is in Go (`/paliadin` 404s for any other user). Even if mRiver SSH key leaks, attacker still needs paliad session as `m@hoganlovells.com`. |
---
## 9. Out-of-scope clarifications (for review)
These were called out in the issue but the design intentionally does not solve them, to keep v1 tight. Each is acknowledged so review knows it wasn't an oversight:
- **Wake-on-LAN of mRiver:** out of scope. v1's UX when mRiver is asleep is the friendly error from §6.6. Future work: integrate with `mai-mesh` capability fallback.
- **Multi-host failover:** out of scope. Only mRiver is targeted.
- **Anthropic API fallback when mRiver offline:** out of scope per CLAUDE.md (`ANTHROPIC_API_KEY` reserved for production-v1, unused in PoC).
- **ControlMaster:** v1 ships without; revisit if turn latency >300 ms in practice (§6.8).
---
## 10. File-level deliverables (for the coder shift)
When this design is approved and the coder shift starts, the work splits roughly into:
- `Dockerfile` — `+openssh-client`.
- `docker-compose.yml` — `network_mode: host`, four new env entries.
- `internal/services/paliadin.go` — extract `Paliadin` interface; rename existing to `LocalPaliadinService`; pull DB-only methods (`ListRecentTurns`, `Stats`, `IsOwner`) into a shared embedded `paliadinDB` so both implementations get them for free.
- `internal/services/paliadin_remote.go` — new file: `RemotePaliadinService`, `RemotePaliadinConfig`, `callShim`, `healthGate`, `ensureBootstrapped`, `classifySSHError`, `ErrMRiverUnreachable`.
- `internal/services/paliadin_remote_test.go` — unit tests with a mocked `callShim`.
- `cmd/server/main.go` — env-var-based wiring (§6.2), `loadPaliadinSSHKey`, `loadPaliadinKnownHosts`.
- `frontend/src/client/paliadin.ts` — one `case` in `friendlyErrorMessage` for `mriver_unreachable`.
- `frontend/src/i18n.ts` — two new keys (`paliadin.error.mriver_unreachable.de` / `.en`).
- `scripts/paliadin-shim` — server-side script (§5.4); copied to mRiver during Phase A, not part of any container.
- `docs/project-status.md` — note Phase 0.5 (PoC) → Phase 0.6 (Tailscale-SSH prod route).
No DB migrations needed — `paliad.paliadin_turns` schema already covers everything (`error_code` field already accepts free-form codes including `mriver_unreachable`).
---
## 11. Open questions for review
- **Q (m):** Phase A test step 5 expects traefik to keep working under host-mode. If a quick search confirms Dokploy's traefik can't route to host-network services without manual `loadbalancer.server.url` config, we should know before Phase A. Worth a 5-minute Dokploy doc check before merging Phase B.
- **Q (m):** Should the `paliadin-shim` script live in this repo (`scripts/paliadin-shim`) and be version-pinned, or is it a one-off that lives only on mRiver? Repo location lets us audit changes; mRiver-only keeps deploy footprint smaller.
- **Q (m):** `ANTHROPIC_API_KEY` env var reservation in compose comments — keep the comment line for production-v1, or strip it now that this design supersedes that path for the foreseeable future?
---
**Inventor stops here.** No code shipped this shift. Awaiting m's go/no-go on the design before the coder shift starts.