Files
paliad/docs/design-paliadin-tailscale-ssh-2026-05-07.md
m f952fb85c3 design(t-paliad-151) amend: port 22022 bypass + Phase A.0 results
Phase A.0 revealed Tailscale SSH on mRiver intercepts :22 from tailnet
peers and bypasses OpenSSH's authorized_keys entirely (banner
"SSH-2.0-Tailscale", auth method "none", command= restriction never
fires). The fix is port 22022 via a systemd ssh.socket drop-in:
Tailscale SSH only intercepts :22, so :22022 hits real OpenSSH where
the design's command=/from= shim restriction works as specified.

Updated:
- §3 locked decisions: row 5 added (port 22022, m's call 23:35)
- §4.5 new subsection: Tailscale SSH bypass via socket drop-in
  + records the "Address already in use" first-attempt failure as a
  "don't retry without cleaning sshd_config Port directives first"
  lesson
- §5.2/5.3: ssh-keyscan now uses -p 22022; known_hosts is host:port
  keyed for non-22 ports
- §6.1/6.2/6.3: SSHPort field on RemotePaliadinService config, -p
  flag in callShim, PALIADIN_REMOTE_PORT env (default 22022)
- §7 phasing: A.0 completion checked off step-by-step with concrete
  fingerprints; A.5/A.6/A.7 split out as m-driven
- §8 security: Tailscale-SSH-on-:22 risk explicitly tabled with
  port-22022 mitigation
- §10 deliverables: mRiver host-setup artifacts noted
- §12 new Phase A.0 completion summary with the three secrets m
  needs to register in Dokploy

Phase A.0 verified end-to-end:
- ssh -p 22022 paliad-prod-key m@mriver health → ok
- run-turn UUID base64msg → 3.4 s including a real Claude response
- from="100.99.98.201" correctly rejects connections from mRiver
  itself

mRiver host state in place (not in repo): authorized_keys with
restrictions, /home/m/.local/bin/paliadin-shim, ssh.socket drop-in.
Three secrets staged at ~/.paliad-staging/ on mRiver for m to copy
into Dokploy: paliad-prod-key (PALIADIN_SSH_PRIVATE_KEY),
known_hosts (PALIADIN_KNOWN_HOSTS), and the three plain env vars.

Refs m/paliad#12
2026-05-07 23:37:26 +02:00

678 lines
37 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Paliadin: route prod via Tailscale SSH to mRiver
**Issue:** m/paliad#12 — t-paliad-151
**Date:** 2026-05-07
**Author:** noether (inventor)
**Supersedes nothing.** Extends `docs/design-paliadin-2026-05-07.md` (the Phase 0 PoC) with a third deployment path between "laptop-only PoC" and "Anthropic API direct".
**Related:** t-paliad-146 (PoC ship), t-paliad-150 (`friendlyErrorMessage` pattern).
---
## 1. Goal
Make Paliadin reachable from `paliad.de` (Dokploy on mLake) without losing m's Claude Code subscription, by routing each turn over Tailscale + SSH from the paliad container to mRiver, where the existing long-lived `tmux` + `claude` PoC keeps running.
**Non-goals (v1):**
- Multi-host failover.
- Encryption beyond SSH-over-tailnet (already E2E-encrypted by Tailscale's WireGuard layer).
- Anthropic API fallback when mRiver is offline — show a friendly error instead.
- Wake-on-LAN of mRiver.
- Multi-tenant or multi-firm variants.
---
## 2. Live state — what was verified before designing
A design built on stale facts rots fast. These were probed on 2026-05-07, not assumed from CLAUDE.md or memory:
| Fact | How verified | Result |
|---|---|---|
| mRiver = `100.99.98.203`, has tmux + claude | this worker runs on mRiver; `tmux -V``tmux 3.6a`; `which claude``/home/m/.local/bin/claude` | confirmed |
| mLake (`100.99.98.201`) has Tailscale running | `ssh m@mlake tailscale status` | confirmed; mRiver visible as `active; direct [2a02:4780:41:3fbc::1]:41641` |
| paliad container Dockerfile is alpine:3.21 minimal, no SSH, no tailscaled | `Dockerfile` | confirmed (only `ca-certificates`) |
| paliad compose runs default Docker bridge (no `network_mode`) | `docker-compose.yml` | confirmed |
| mRiver has no `~/.ssh/authorized_keys` yet | `ls ~/.ssh/` | confirmed — file must be created in Phase A |
| `/tmp/paliadin/` does not exist on mRiver yet | `ls /tmp/paliadin` | confirmed — created on first turn (paliadin.go:185 `os.MkdirAll`) |
| `paliad-paliadin` tmux session is not currently running on mRiver | `tmux ls` | not present; the existing PoC creates it on demand |
**Implication for design:** the paliad container needs new infrastructure on three axes — network reachability of the tailnet, an SSH client + identity, and a service-layer code path that talks to a remote tmux instead of a local one. Each axis is its own sub-design below.
---
## 3. Locked decisions (m, 2026-05-07 22:35)
m made four design-shaping calls via the inventor's `AskUserQuestion` pass. They are recorded here verbatim because every downstream choice in §4§6 follows from them.
| # | Question | m's choice |
|---|---|---|
| 1 | Container Tailscale shape | **`network_mode: host` on paliad** |
| 2 | SSH-to-mRiver protocol granularity | **Server-side `paliadin-shim` (one RPC per turn)** |
| 3 | Routing trigger | **Env var `PALIADIN_REMOTE_HOST` + interface split** |
| 4 | SSH private key storage | **Dokploy secret env var `PALIADIN_SSH_PRIVATE_KEY`** |
| 5 | SSH port to bypass Tailscale SSH | **Port 22022 via `ssh.socket` drop-in (Phase A finding, 23:30)** |
Decision (1) was *not* the inventor's recommendation — host mode has known interaction risk with traefik (§4.2). m is overriding the recommendation; this design accepts the call and codifies a Phase A test step that gates the rollout on traefik still working under host mode. If Phase A blows up, the fallback is to revisit (1) in a follow-up issue, not to silently swap to a sidecar.
Decision (5) emerged during Phase A: Tailscale SSH on mRiver was found to intercept `:22` from tailnet peers and bypass OpenSSH's `authorized_keys` entirely (banner says "Tailscale", auth method "none"). The `command=` shim restriction therefore never fires on the standard port. Adding port 22022 via a `systemd ssh.socket` drop-in routes paliad's connections to real OpenSSH where the restriction works. m's interactive `tailscale ssh m@mriver` on `:22` stays untouched. See §4.4 for the implementation.
---
## 4. Sub-design A — Container Tailscale shape
### 4.1 Shape: `network_mode: host`
paliad's container shares mLake's network namespace. `tailscale0` (mLake's tailnet interface) is directly visible from inside the container. Outbound `ssh m@100.99.98.203` reaches mRiver over the tailnet without any sidecar, userspace tailscaled, SOCKS proxy, or auth-key flow inside the container.
```yaml
# docker-compose.yml diff
services:
web:
build: .
network_mode: host # NEW
# remove: expose: ["8080"] # host mode means port is on the host directly
environment:
- PORT=8080
...
# NEW Paliadin remote-routing knobs
- PALIADIN_REMOTE_HOST=${PALIADIN_REMOTE_HOST} # 100.99.98.203
- PALIADIN_REMOTE_PORT=${PALIADIN_REMOTE_PORT} # 22022 (bypasses Tailscale SSH, see §4.5)
- PALIADIN_REMOTE_USER=${PALIADIN_REMOTE_USER} # m
- PALIADIN_SSH_PRIVATE_KEY=${PALIADIN_SSH_PRIVATE_KEY}
- PALIADIN_KNOWN_HOSTS=${PALIADIN_KNOWN_HOSTS} # one-line ssh-keyscan -p 22022 output
restart: unless-stopped
```
### 4.2 Trade-off accepted: traefik routing under host mode
paliad.de's TLS is provided by Dokploy's traefik on the `dokploy-network` overlay. With `network_mode: host`, paliad is no longer attached to that overlay. Two failure modes are possible:
- **(M1)** traefik can't discover the service via Docker DNS → 502 at the edge.
- **(M2)** traefik routes via host loopback (`http://127.0.0.1:8080` or `host.docker.internal`) and works fine.
Recent Dokploy versions configure traefik with both `loadbalancer.server.url` and Docker labels; (M2) is the documented host-mode path. **Phase A explicitly tests this** (§7) before any code is written; if (M1) materialises, the design rolls back to the sidecar variant of decision 1 in a follow-up issue.
Other host-mode side-effects to flag in operations:
- paliad listens on host port 8080 directly. Any other compose service binding 8080 conflicts.
- paliad's outbound DNS uses host resolver (no Docker-internal `web` etc.). Currently fine: paliad's only network deps are external (Supabase, SMTP, GitHub raw). No service on `dokploy-network` is referenced by name.
- The container can reach **every** Tailscale node, not just mRiver. Mitigations live in §5 (key restriction) and §5.2 (`from=` clause on mRiver authorized_keys).
### 4.3 Dockerfile diff
```dockerfile
# Final stage adds the SSH client only. Tailscale is provided by the host.
FROM alpine:3.21
RUN apk add --no-cache ca-certificates openssh-client # +openssh-client (~1MB)
WORKDIR /app
COPY --from=backend /paliad /app/paliad
COPY --from=frontend /app/frontend/dist /app/dist
EXPOSE 8080
CMD ["/app/paliad"]
```
Image-size delta: alpine `openssh-client` is ~1.1 MB compressed — negligible. No tailscaled, no entrypoint script, no extra processes inside the container.
### 4.4 What does NOT change
- No Tailscale auth-key inside paliad. The container inherits the host's tailnet binding, so there is no per-container Tailscale identity to rotate. mLake's existing Tailscale auth is the only one in scope.
- No tailscaled process inside the container.
- No new sidecar container.
### 4.5 Bypassing Tailscale SSH via port 22022 (Phase A discovery)
**Phase A revealed** that Tailscale SSH on mRiver intercepts `:22` from tailnet peers before OpenSSH sees the connection. The SSH banner reads `SSH-2.0-Tailscale`, the verbose log shows `Authenticated using "none"`, and the `authorized_keys command=` directive is therefore inert. mRiver's `tailscale status --json` confirms the `https://tailscale.com/cap/ssh` capability is enabled.
The fix: a separate listening port for the paliad route, where Tailscale SSH does not intercept and real OpenSSH handles auth.
mRiver uses systemd socket activation for sshd (`/usr/lib/systemd/system/ssh.socket` binds `:22`). Setting `Port 22022` in `sshd_config` is **ignored** under socket activation — listen ports come from the socket unit, not sshd's own config. The correct change is a drop-in:
```ini
# /etc/systemd/system/ssh.socket.d/paliad.conf
[Socket]
ListenStream=0.0.0.0:22022
ListenStream=[::]:22022
```
Followed by `systemctl daemon-reload && systemctl restart ssh.socket`. Both `:22` (still routed through Tailscale SSH for m's interactive use) and `:22022` (real OpenSSH) end up listening. The same sshd binary handles both — same host key, same `authorized_keys`, same sshd_config. The only difference is *which port* a peer dials.
A failed first attempt (2026-05-07 23:07) added the drop-in while a stale `Port 22022` directive in `sshd_config.d/99-paliad-test.conf` was still bound — the resulting `Address already in use` took `ssh.socket` down for ~30 s until reverted. Lesson: clean any prior `Port` directives out of `sshd_config.d/*.conf` before retrying the socket drop-in.
Phase A end-to-end test (2026-05-07 23:31) succeeded with port 22022:
- `ssh -p 22022 -i paliad-prod-key m@100.99.98.203 health``ok`
- `run-turn <uuid> <base64-msg>` → 3.4 s round-trip including a Claude-Code response
- `from="100.99.98.201"` correctly rejected a connection sourced from mRiver itself (`Permission denied (publickey,password)`)
---
## 5. Sub-design B — SSH identity, restricted shim, host-key pinning
### 5.1 Identity: dedicated ed25519 keypair `paliad-prod`
One keypair, generated once on mRiver during Phase A, used by every paliad-prod deploy:
```bash
# On mRiver (Phase A bootstrap):
ssh-keygen -t ed25519 -N "" -C "paliad-prod $(date +%Y-%m-%d)" -f /tmp/paliad-prod-key
# Public key → mRiver authorized_keys (see 5.2)
# Private key → Dokploy secret store as PALIADIN_SSH_PRIVATE_KEY
shred -u /tmp/paliad-prod-key # only the encrypted/secret-stored copies survive
```
Rotation: regenerate, push public key to mRiver authorized_keys, update Dokploy secret, redeploy. No code change needed — paliad's startup re-reads the env var on every boot.
The private key is delivered to the container as a multi-line env var. At process start, paliad writes it to a tmpfile so OpenSSH can use it:
```go
// cmd/server/main.go (sketch)
func loadPaliadinSSHKey() (string, error) {
blob := os.Getenv("PALIADIN_SSH_PRIVATE_KEY")
if blob == "" { return "", nil } // remote mode disabled
f, err := os.CreateTemp("", "paliadin-id_ed25519-")
if err != nil { return "", err }
if err := os.Chmod(f.Name(), 0o600); err != nil { return "", err }
if _, err := f.WriteString(blob); err != nil { return "", err }
if err := f.Close(); err != nil { return "", err }
return f.Name(), nil // path passed to RemotePaliadinService
}
```
The tmpfile lives at `/tmp/paliadin-id_ed25519-<rand>` for the container's lifetime. On container restart, a fresh tmpfile is written. We never persist the key to a volume.
### 5.2 mRiver `authorized_keys` entry
```
command="/home/m/.local/bin/paliadin-shim",no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc,from="100.99.98.201" ssh-ed25519 AAAA...PUBKEY... paliad-prod
```
Each restriction matters:
- `command=` — every `ssh m@mriver …` invocation runs the shim regardless of what the client asked for. The client's requested command is exposed as `$SSH_ORIGINAL_COMMAND` for the shim to dispatch on.
- `no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc` — defence-in-depth: even if someone steals the key and bypasses the shim's argument validation, they can't get an interactive shell, can't tunnel ports, can't pivot via agent forwarding.
- `from="100.99.98.201"` — only accept connections from mLake's tailnet IP. Defends against the "container has full tailnet visibility" host-mode side-effect from §4.2: if the key leaks off mLake, it can't be replayed from another tailnet host.
### 5.3 Host-key pinning
`StrictHostKeyChecking=accept-new` is too loose for a long-lived production identity (one-time MITM during first connect substitutes a different key forever). Instead:
- During Phase A, run `ssh-keyscan -p 22022 -t ed25519 100.99.98.203` on mLake.
- Capture the single output line. The host-key portion is identical to the `:22` entry — same sshd, same keys — but the `[100.99.98.203]:22022` prefix matters because OpenSSH's `known_hosts` is `host:port`-keyed for non-22 ports.
- Store as Dokploy secret `PALIADIN_KNOWN_HOSTS`.
- At container startup, write to `/tmp/paliadin-known_hosts` chmod 644.
- Pass to OpenSSH via `-o UserKnownHostsFile=/tmp/paliadin-known_hosts -o StrictHostKeyChecking=yes`.
If mRiver's host key ever rotates (rare; only on disk wipe / fresh OS), Phase A runs again and the secret is updated. SSH refuses to connect with a clear "host key changed" error, which surfaces as `mriver_unreachable` to the user — exactly the right blast-radius (loud failure, no silent connect to a substitute host).
### 5.4 The shim — `paliadin-shim`
A bash script on mRiver at `/home/m/.local/bin/paliadin-shim`. It is the **only** thing the paliad-prod key is allowed to invoke, and it dispatches on `$SSH_ORIGINAL_COMMAND`. Three RPCs:
```bash
#!/bin/bash
# paliadin-shim — server-side RPC for paliad's remote-tmux turns.
# Invoked via authorized_keys command= with $SSH_ORIGINAL_COMMAND set.
set -euo pipefail
umask 077
readonly TMUX_SESSION="${PALIADIN_TMUX_SESSION:-paliad-paliadin}"
readonly RESPONSE_DIR="${PALIADIN_RESPONSE_DIR:-/tmp/paliadin}"
readonly TIMEOUT_S=60
readonly TURN_ID_RE='^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$'
mkdir -p "$RESPONSE_DIR"
# Parse $SSH_ORIGINAL_COMMAND. Format: "<verb> <arg1> <arg2> …"
read -r -a argv <<< "${SSH_ORIGINAL_COMMAND:-}"
verb="${argv[0]:-}"
ensure_pane() {
if ! tmux has-session -t "$TMUX_SESSION" 2>/dev/null; then
tmux new-session -d -s "$TMUX_SESSION"
fi
# Find or create the @paliadin-scope=chat window.
local target=""
while read -r idx; do
scope=$(tmux show-window-option -t "$TMUX_SESSION:$idx" -v @paliadin-scope 2>/dev/null || true)
if [[ "$scope" == "chat" ]]; then target="$TMUX_SESSION:$idx"; break; fi
done < <(tmux list-windows -t "$TMUX_SESSION" -F '#{window_index}')
if [[ -z "$target" ]]; then
idx=$(tmux new-window -t "$TMUX_SESSION" -n claude-paliadin -P -F '#{window_index}' claude)
target="$TMUX_SESSION:$idx"
# Wait for claude to settle (60s bound; matches Go waitForPaneReady).
for _ in $(seq 1 120); do
pane=$(tmux capture-pane -t "$target" -p 2>/dev/null || true)
if [[ "$pane" == *""* || "$pane" == *"│"* ]]; then break; fi
sleep 0.5
done
tmux set-window-option -t "$target" @paliadin-scope chat
tmux set-window-option -t "$target" @fix-name claude-paliadin
# Bootstrap system prompt — reuses the Go service's prompt text.
# The Go side sends this via the `bootstrap` RPC on first turn instead
# of duplicating the prompt here. See §6.4.
fi
echo "$target"
}
case "$verb" in
health)
# Liveness check — used by paliad to short-circuit when mRiver is offline.
# Returns "ok" iff tmux + claude are reachable.
tmux has-session -t "$TMUX_SESSION" 2>/dev/null \
|| tmux new-session -d -s "$TMUX_SESSION"
command -v claude >/dev/null && echo ok || { echo no-claude; exit 1; }
;;
bootstrap)
# First-turn-only: ensure pane exists and inject the system prompt.
# $1 = base64-encoded prompt body (avoids quoting hell).
target=$(ensure_pane)
prompt=$(printf '%s' "${argv[1]:?missing prompt}" | base64 -d)
tmux send-keys -t "$target" -l -- "$prompt"
tmux send-keys -t "$target" Enter
sleep 2 # give claude a moment to absorb
echo ok
;;
run-turn)
# $1 = turn_id (UUID); $2 = base64-encoded user message.
turn_id="${argv[1]:?missing turn_id}"
[[ "$turn_id" =~ $TURN_ID_RE ]] || { echo >&2 "bad turn_id"; exit 2; }
msg=$(printf '%s' "${argv[2]:?missing message}" | base64 -d)
target=$(ensure_pane)
out="$RESPONSE_DIR/$turn_id.txt"
rm -f "$out"
# Envelope matches what paliadin_prompt.go expects.
tmux send-keys -t "$target" -l -- "[PALIADIN:$turn_id] $msg"
tmux send-keys -t "$target" Enter
# Poll for the response file. Same shape as Go pollForResponse.
for _ in $(seq 1 $((TIMEOUT_S * 5))); do
if [[ -s "$out" ]]; then
sleep 0.05 # settle
cat "$out"
rm -f "$out"
exit 0
fi
sleep 0.2
done
echo >&2 "paliadin: response timeout after ${TIMEOUT_S}s"
exit 124
;;
reset)
# /clear the conversation; next turn starts fresh.
target=$(ensure_pane)
tmux send-keys -t "$target" -l -- "/clear"
tmux send-keys -t "$target" Enter
echo ok
;;
*)
echo >&2 "paliadin-shim: unknown verb '$verb'"
exit 2
;;
esac
```
Why a shim instead of raw tmux-over-SSH:
- One SSH round-trip per turn (~50 ms over tailnet) vs ~1020 round-trips for the granular pattern.
- Argument validation lives in one place (UUID regex on turn_id, base64 for messages, fixed verb list) — easier to audit than a regex over `$SSH_ORIGINAL_COMMAND` matching `tmux send-keys …`.
- mRiver-side concerns (response polling, settle delays, pane-readiness) stay on mRiver, which is where the tmux state lives. The Go service stops caring about local file polling at all.
---
## 6. Sub-design C — Service-layer integration, routing, reliability
### 6.1 Interface split
The current `*PaliadinService` becomes an interface with two implementations: `LocalPaliadinService` (the existing tmux code, renamed) and `RemotePaliadinService` (the new SSH code). Construction picks one at startup based on `PALIADIN_REMOTE_HOST`.
```go
// internal/services/paliadin.go (after refactor)
type Paliadin interface {
RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error)
ResetSession(ctx context.Context) error
ListRecentTurns(ctx context.Context, callerID uuid.UUID, limit int) ([]PaliadinTurn, error)
Stats(ctx context.Context, callerID uuid.UUID) (*PaliadinStats, error)
IsOwner(ctx context.Context, userID uuid.UUID) (bool, error)
}
// LocalPaliadinService wraps the current tmux PoC (laptop / dev path).
type LocalPaliadinService struct { /* identical to today's PaliadinService */ }
// RemotePaliadinService talks to a paliadin-shim over SSH on mRiver.
type RemotePaliadinService struct {
db *sqlx.DB
users *UserService
sshHost string // 100.99.98.203
sshPort int // 22022 — bypasses Tailscale SSH on :22 (see §4.5)
sshUser string // m
sshKeyPath string // /tmp/paliadin-id_ed25519-<rand>
knownHosts string // /tmp/paliadin-known_hosts
turnMu sync.Mutex
// Health-check cache.
healthMu sync.Mutex
healthOK bool
healthCheckedAt time.Time
}
```
DB access (`ListRecentTurns`, `Stats`, `IsOwner`) is identical for both — they only read `paliad.paliadin_turns`. They live in a shared `paliadinDB` helper struct embedded in both implementations.
### 6.2 Wiring at startup
```go
// cmd/server/main.go (excerpt)
var paliadin services.Paliadin
remoteHost := os.Getenv("PALIADIN_REMOTE_HOST")
switch {
case remoteHost != "":
keyPath, err := loadPaliadinSSHKey()
if err != nil { log.Fatalf("paliadin: load ssh key: %v", err) }
if keyPath == "" { log.Fatalf("paliadin: PALIADIN_REMOTE_HOST set but no PALIADIN_SSH_PRIVATE_KEY") }
knownHosts, err := loadPaliadinKnownHosts()
if err != nil { log.Fatalf("paliadin: load known_hosts: %v", err) }
port, _ := strconv.Atoi(cmpOr(os.Getenv("PALIADIN_REMOTE_PORT"), "22022"))
paliadin = services.NewRemotePaliadinService(db, userSvc, services.RemotePaliadinConfig{
SSHHost: remoteHost,
SSHPort: port,
SSHUser: cmpOr(os.Getenv("PALIADIN_REMOTE_USER"), "m"),
SSHKeyPath: keyPath,
KnownHostsPath: knownHosts,
})
log.Printf("paliadin: remote mode → ssh %s@%s:%d", "m", remoteHost, port)
case localTmuxAvailable():
paliadin = services.NewLocalPaliadinService(db, userSvc, "", "")
log.Printf("paliadin: local tmux mode")
default:
paliadin = services.NewDisabledPaliadinService(db, userSvc)
log.Printf("paliadin: disabled (no remote host, no local tmux)")
}
```
`NewDisabledPaliadinService` exists today implicitly via the `ErrTmuxUnavailable` path; making it explicit gives the constructor a clear name and the handler doesn't have to special-case `nil`.
### 6.3 SSH invocation pattern
`RemotePaliadinService` runs every RPC through the same helper:
```go
func (s *RemotePaliadinService) callShim(ctx context.Context, args ...string) ([]byte, error) {
sshArgs := []string{
"-F", "/dev/null", // ignore /etc/ssh/ssh_config + ~/.ssh/config
"-i", s.sshKeyPath,
"-p", strconv.Itoa(s.sshPort), // 22022 — bypasses Tailscale SSH on :22
"-o", "IdentitiesOnly=yes", // don't fall back to other keys
"-o", "UserKnownHostsFile=" + s.knownHostsPath,
"-o", "StrictHostKeyChecking=yes",
"-o", "BatchMode=yes",
"-o", "ConnectTimeout=3",
"-o", "ServerAliveInterval=10",
"-o", "ServerAliveCountMax=3",
s.sshUser + "@" + s.sshHost,
"--",
}
sshArgs = append(sshArgs, args...)
c, cancel := context.WithTimeout(ctx, 70*time.Second) // shim has its own 60s; +10s for SSH overhead
defer cancel()
cmd := exec.CommandContext(c, "ssh", sshArgs...)
var stdout, stderr bytes.Buffer
cmd.Stdout = &stdout; cmd.Stderr = &stderr
if err := cmd.Run(); err != nil {
return nil, fmt.Errorf("paliadin: ssh shim %v: %w (stderr: %s)", args, err, stderr.String())
}
return stdout.Bytes(), nil
}
```
`RunTurn` becomes:
```go
func (s *RemotePaliadinService) RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error) {
s.turnMu.Lock()
defer s.turnMu.Unlock()
if err := s.healthGate(ctx); err != nil {
return nil, err // ErrMRiverUnreachable, picked up by handler
}
turnID := uuid.New()
started := time.Now().UTC()
if err := s.insertTurnRow(ctx, ); err != nil { return nil, err }
// First-turn-only: bootstrap the system prompt on mRiver. Detected by
// checking whether any prior turn for this user has succeeded.
if err := s.ensureBootstrapped(ctx); err != nil {
_ = s.markTurnError(ctx, turnID, "bootstrap_failed")
return nil, err
}
msg := sanitiseForTmux(req.UserMessage)
msgB64 := base64.StdEncoding.EncodeToString([]byte(msg))
body, err := s.callShim(ctx, "run-turn", turnID.String(), msgB64)
if err != nil {
_ = s.markTurnError(ctx, turnID, classifySSHError(err))
return nil, err
}
// Same trailer-parse + audit-row writes as Local, factored into shared helper.
return s.completeTurnFromBody(ctx, turnID, started, string(body))
}
```
### 6.4 System prompt bootstrap
The local PoC calls `paliadinSystemPrompt(s.responseDir)` once when it creates the pane. The remote path needs the same hook. Two options that don't require duplicating the German prompt body to mRiver:
- **Lazy bootstrap (chosen):** the first `RunTurn` after a paliad-prod restart sends the system prompt via `bootstrap` RPC, then runs the actual turn. Subsequent turns skip the bootstrap. State is per-process: `RemotePaliadinService.bootstrapped` boolean guarded by mutex.
- Eager bootstrap at startup is rejected — it forces every container start to wait for mRiver to be online, which couples paliad's boot to mRiver's availability.
Lazy bootstrap means the very first turn after a paliad redeploy pays a ~3 s extra cost (claude pane spin-up + system prompt absorb). Acceptable for a single-user PoC.
### 6.5 Health-check gating (`mriver_unreachable`)
Every `RunTurn` first calls `healthGate(ctx)`:
- Cached for 10 s. If last check was <10 s ago and was OK, skip the probe.
- Otherwise: `s.callShim(ctx, "health")` with a 3 s timeout. On success, set cache OK; on failure, return `ErrMRiverUnreachable`.
Why 10 s: short enough that "I just woke my laptop" propagates inside one user retry; long enough that a busy chat doesn't probe on every turn.
```go
var ErrMRiverUnreachable = errors.New("paliadin: mriver unreachable")
func (s *RemotePaliadinService) healthGate(ctx context.Context) error {
s.healthMu.Lock()
defer s.healthMu.Unlock()
if s.healthOK && time.Since(s.healthCheckedAt) < 10*time.Second {
return nil
}
c, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
out, err := s.callShim(c, "health")
s.healthCheckedAt = time.Now()
if err != nil || strings.TrimSpace(string(out)) != "ok" {
s.healthOK = false
return fmt.Errorf("%w: %v", ErrMRiverUnreachable, err)
}
s.healthOK = true
return nil
}
```
### 6.6 Friendly error code (extends t-paliad-150)
`friendlyErrorMessage` already maps `tmux_unavailable` to a localised message. We add one new code:
- `mriver_unreachable` DE: *"mRiver ist offline — Paliadin nicht erreichbar. Mach mRiver an, oder nutze Paliadin lokal mit `./paliad`."* / EN: *"mRiver is offline — Paliadin can't reach it. Wake mRiver, or run Paliadin locally with `./paliad`."*
Implementation: one new `case` in the SSE-error switch in `frontend/src/client/paliadin.ts`'s `friendlyErrorMessage`, plus matching i18n keys (`paliadin.error.mriver_unreachable.de` / `.en`). Server-side: `paliadin` HTTP handler maps `errors.Is(err, services.ErrMRiverUnreachable)` to `event: error\ndata: {"code":"mriver_unreachable","message":"..."}\n\n`.
### 6.7 Rate limit
A runaway loop on the paliad side could DOS the SSH connection. Cheapest cap: enforce one in-flight turn at a time via `turnMu` (already exists in the local PoC). On top of that, a rolling cap of N=20 turns/min in `RemotePaliadinService` rejects with `ErrRateLimited` (mapped to a friendly `paliadin.error.rate_limited`). PoC has one user (m); the cap is a paranoid safety, not a real throttle.
### 6.8 What about ControlMaster?
Decision-2's chosen path (server-side shim with one RPC per turn) makes ControlMaster optional. The shim collapses ~10 raw-tmux ops into a single SSH connect that's already the latency win ControlMaster would buy.
Adding it on top would save ~3050 ms per turn but adds:
- A persistent `~/.ssh/cm-*` socket inside the container.
- Cleanup logic on shutdown.
- A subtle interaction with the SSH BatchMode + ConnectTimeout settings.
Verdict: skip ControlMaster in v1. If turn latency over Tailscale is measured >300 ms in practice and hot enough to matter, add it in a follow-up; the call site is one helper.
---
## 7. Phasing
### Phase A — manual proof-of-concept (no Dockerfile change yet)
Goal: validate the round-trip end-to-end on a deployed paliad, before touching the image.
**Phase A.0 (DONE 2026-05-07 23:31):** SSH+shim end-to-end on the tailnet.
1.**Generate keypair** on mRiver: `ssh-keygen -t ed25519 -N "" -C "paliad-prod" -f ~/.paliad-staging/paliad-prod-key`. Fingerprint `SHA256:5uV8v872F/IhJycjjq0crFue/emAYfw71N9bxTvkl9c`.
2.**Commit shim** to `scripts/paliadin-shim` and **install** at `/home/m/.local/bin/paliadin-shim`, `chmod 755`.
3.**Write authorized_keys** with public key + `command=`/`from="100.99.98.201"`/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc restrictions (§5.2).
4.**Add port 22022 socket drop-in** at `/etc/systemd/system/ssh.socket.d/paliad.conf`, `systemctl daemon-reload && systemctl restart ssh.socket`. Both `:22` (Tailscale SSH for m) and `:22022` (real OpenSSH for paliad) listening (§4.5).
5.**Capture mRiver:22022 host key**: `ssh-keyscan -p 22022 -t ed25519 100.99.98.203 > ~/.paliad-staging/known_hosts` from mLake. Fingerprint `SHA256:HPoUzy60Cb8yLERIBQcB2mHihNST3NaTODx5Ypd1XpA`.
6.**Smoke-test from mLake** (without paliad container, just raw ssh from mLake's host shell):
```
ssh -F /dev/null -i /tmp/paliad-prod-key -o UserKnownHostsFile=/tmp/paliad-known_hosts \
-o StrictHostKeyChecking=yes -o IdentitiesOnly=yes -o BatchMode=yes \
-p 22022 m@100.99.98.203 health
→ ok
ssh … run-turn $(uuidgen) "$(printf 'Sag …' | base64 -w0)"
→ "test ok" (3.4 s round-trip including a real Claude response)
```
7. ✅ **from= rejection verified**: the same key from mRiver itself (`100.99.98.203`) → `Permission denied (publickey,password)` as expected.
**Phase A.5 (PENDING m's hands):** validate `network_mode: host` + traefik routing on prod paliad.de.
- Branch the live `docker-compose.yml` on a temp branch.
- Add `network_mode: host` to the `web` service; remove `expose: ["8080"]`.
- Push to trigger a Dokploy redeploy.
- `curl --connect-timeout 5 -sSI https://paliad.de/` — expect 200 (or login redirect), NOT 502.
- If 502: revert the temp branch (`git revert HEAD && git push`); revisit decision 1 in a follow-up issue.
- If 200: keep the host-mode change; ready for Phase B.
This is **m's call to execute** — it briefly touches prod paliad.de. Inventor/coder should not flip prod compose without explicit go-ahead. Rollback is one revert + redeploy.
**Phase A.6 (after A.5 passes):** smoke-test SSH from inside the paliad-prod container itself (the real container, not just the mLake host shell):
```
docker exec -it <paliad-container> sh
apk add --no-cache openssh-client # one-shot, before Dockerfile change
ssh -F /dev/null -i /tmp/paliad-prod-key -o UserKnownHostsFile=/tmp/paliad-known_hosts \
-o StrictHostKeyChecking=yes -o IdentitiesOnly=yes -o BatchMode=yes \
-p 22022 m@100.99.98.203 health
# expected: "ok"
```
This proves the container's host-mode networking actually delivers a tailnet connect.
**Phase A.7:** wire env vars manually via Dokploy UI for one deploy; confirm `/paliadin` chat works against mRiver from paliad.de.
If A.5 fails: the design rolls back to a sidecar in a new issue (decision 1 follow-up). The SSH path (A.0) and traefik path (A.5) are independent — A.0 is already proven; only A.5+ is at risk.
### Phase B — bake into Dockerfile + Dokploy secrets
1. Dockerfile: add `openssh-client` to the final stage (§4.3).
2. compose: add `network_mode: host` and the four new env vars (§4.1).
3. Dokploy secrets: register `PALIADIN_REMOTE_HOST=100.99.98.203`, `PALIADIN_REMOTE_USER=m`, `PALIADIN_SSH_PRIVATE_KEY=...`, `PALIADIN_KNOWN_HOSTS=...`.
4. Code: refactor `PaliadinService` to the interface split (§6.1§6.2). New file `internal/services/paliadin_remote.go`. Tests: `paliadin_remote_test.go` mocks `callShim` to verify `RunTurn` audit-row writes, error mapping, and `healthGate` caching.
5. Ship under one PR; tag t-paliad-151 done.
### Phase C — friendly errors + monitoring
1. `paliadin.error.mriver_unreachable` i18n keys + `friendlyErrorMessage` case (§6.6).
2. `/admin/paliadin` shows last health-probe result + last successful turn timestamp.
3. Optional: `mai-mesh` integration to surface mRiver-offline events to m on Telegram (out-of-band; not gating).
---
## 8. Security review summary
| Risk | Mitigation |
|---|---|
| Stolen private key → arbitrary SSH on mRiver | `command=` shim restriction + `from="100.99.98.201"` + ed25519 key + private key only in Dokploy secret store (encrypted at rest); paliad route uses port 22022 where real OpenSSH enforces all of the above |
| Stolen private key → tailnet-wide SSH from non-mLake host | `from="100.99.98.201"` clause (verified: rejected from mRiver itself in Phase A.0) |
| Tailscale SSH on `:22` bypasses `authorized_keys` | The paliad-prod key's `command=` restriction is not enforced on `:22`. Mitigation: paliad always dials `:22022`, which is real OpenSSH. m's interactive `tailscale ssh m@mriver` on `:22` continues to be governed by Tailscale ACLs, separate from paliad's identity. |
| Container compromise → key extraction | Key written to tmpfile chmod 600, only root inside container can read; alpine container has no shell-on-error trampolines |
| Host-key MITM during connect | Pinned `known_hosts`; `StrictHostKeyChecking=yes` |
| Shim argument injection (e.g. via `run-turn $(rm -rf /)`) | Shim parses positional args from `$SSH_ORIGINAL_COMMAND` via `read -r -a`; never passes args to a subshell `eval`; turn_id validated by UUID regex; message body always base64-decoded into a single shell variable, never re-evaluated |
| Runaway loop → SSH flood | Single-flight `turnMu` + 20/min rolling cap |
| `network_mode: host` widens blast radius | The `command=` + `from=` restrictions on mRiver mean container compromise = "can run shim verbs against mRiver only", not "shell on mRiver" |
| PaliadinOwnerEmail bypass | Unchanged from PoC: gate is in Go (`/paliadin` 404s for any other user). Even if mRiver SSH key leaks, attacker still needs paliad session as `m@hoganlovells.com`. |
---
## 9. Out-of-scope clarifications (for review)
These were called out in the issue but the design intentionally does not solve them, to keep v1 tight. Each is acknowledged so review knows it wasn't an oversight:
- **Wake-on-LAN of mRiver:** out of scope. v1's UX when mRiver is asleep is the friendly error from §6.6. Future work: integrate with `mai-mesh` capability fallback.
- **Multi-host failover:** out of scope. Only mRiver is targeted.
- **Anthropic API fallback when mRiver offline:** out of scope per CLAUDE.md (`ANTHROPIC_API_KEY` reserved for production-v1, unused in PoC).
- **ControlMaster:** v1 ships without; revisit if turn latency >300 ms in practice (§6.8).
---
## 10. File-level deliverables (for the coder shift)
When this design is approved and the coder shift starts, the work splits roughly into:
- `Dockerfile` — `+openssh-client`.
- `docker-compose.yml` — `network_mode: host`, five new env entries (`PALIADIN_REMOTE_HOST`, `PALIADIN_REMOTE_PORT`, `PALIADIN_REMOTE_USER`, `PALIADIN_SSH_PRIVATE_KEY`, `PALIADIN_KNOWN_HOSTS`).
- `internal/services/paliadin.go` — extract `Paliadin` interface; rename existing to `LocalPaliadinService`; pull DB-only methods (`ListRecentTurns`, `Stats`, `IsOwner`) into a shared embedded `paliadinDB` so both implementations get them for free.
- `internal/services/paliadin_remote.go` — new file: `RemotePaliadinService`, `RemotePaliadinConfig` (with `SSHPort`), `callShim`, `healthGate`, `ensureBootstrapped`, `classifySSHError`, `ErrMRiverUnreachable`.
- `internal/services/paliadin_remote_test.go` — unit tests with a mocked `callShim`.
- `cmd/server/main.go` — env-var-based wiring (§6.2), `loadPaliadinSSHKey`, `loadPaliadinKnownHosts`, `PALIADIN_REMOTE_PORT` parse with default `22022`.
- `frontend/src/client/paliadin.ts` — one `case` in `friendlyErrorMessage` for `mriver_unreachable`.
- `frontend/src/i18n.ts` — two new keys (`paliadin.error.mriver_unreachable.de` / `.en`).
- `scripts/paliadin-shim` — server-side script (§5.4); already shipped + installed on mRiver during Phase A.0, not part of any container. Repo location chosen so the security-relevant script is version-controlled.
- `docs/project-status.md` — note Phase 0.5 (PoC) → Phase 0.6 (Tailscale-SSH prod route).
- **mRiver host setup (one-time, already done in Phase A.0):** `/etc/systemd/system/ssh.socket.d/paliad.conf` (port 22022 listen drop-in); `~/.ssh/authorized_keys` (paliad-prod public key with restrictions); `/home/m/.local/bin/paliadin-shim` (executable). These are NOT in the repo because they live on m's laptop; `docs/project-status.md` should reference them.
No DB migrations needed — `paliad.paliadin_turns` schema already covers everything (`error_code` field already accepts free-form codes including `mriver_unreachable`).
---
## 11. Open questions for review
- **Q (m), still open:** Phase A.5 (traefik+host-mode on prod paliad.de) is not yet executed. m drives this; rollback is one revert. Dokploy doc check before flipping is recommended but not blocking.
- **Q (m), resolved 2026-05-07 23:50:** shim location → repo (`scripts/paliadin-shim`, committed in `0248411`). Version-controlled and auditable.
- **Q (m), still open:** `ANTHROPIC_API_KEY` env var reservation in compose comments — keep for production-v1, or strip now? Not blocking either phase; defer.
---
## 12. Phase A.0 completion summary (2026-05-07 23:50)
**Coder shift (noether) executed Phase A.0 in full:**
1. ✅ shim committed at `scripts/paliadin-shim` (commit `0248411`, repo-version-controlled)
2. ✅ shim installed at `/home/m/.local/bin/paliadin-shim` on mRiver
3. ✅ ed25519 keypair `paliad-prod` generated, public-key fingerprint `SHA256:5uV8v872F/IhJycjjq0crFue/emAYfw71N9bxTvkl9c`, private key staged at `~/.paliad-staging/paliad-prod-key` on mRiver (mode 600)
4. ✅ `~/.ssh/authorized_keys` written with `command=`/`from=`/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc restrictions
5. ✅ `ssh.socket` drop-in installed at `/etc/systemd/system/ssh.socket.d/paliad.conf`; both `:22` and `:22022` listening
6. ✅ host key for `:22022` captured at `~/.paliad-staging/known_hosts` (fingerprint `SHA256:HPoUzy60Cb8yLERIBQcB2mHihNST3NaTODx5Ypd1XpA`)
7. ✅ end-to-end SSH+shim+Claude run-turn validated from mLake → mRiver:22022 (3.4 s round-trip)
8. ✅ `from="100.99.98.201"` rejection verified
**Three secrets ready for Dokploy registration** (m to copy from `~/.paliad-staging/` on mRiver):
- `PALIADIN_SSH_PRIVATE_KEY` ← `cat ~/.paliad-staging/paliad-prod-key`
- `PALIADIN_KNOWN_HOSTS` ← `cat ~/.paliad-staging/known_hosts`
- `PALIADIN_REMOTE_HOST=100.99.98.203`, `PALIADIN_REMOTE_PORT=22022`, `PALIADIN_REMOTE_USER=m`
**Phase A.5 (traefik+host-mode test) and Phase A.6/A.7 (in-container SSH smoke + paliad/paliadin end-to-end) await m's hands** — they touch prod paliad.de.
**Phase B (Dockerfile + Go interface split + Dokploy secrets) is unblocked from a code perspective** — but should not merge until Phase A.5 confirms the host-mode networking trade-off is acceptable.
---
**Inventor design + coder Phase A.0 complete.** Awaiting m for Phase A.5 traefik validation before the coder writes the Go interface split.