m/paliad

Files

m f952fb85c3 design(t-paliad-151) amend: port 22022 bypass + Phase A.0 results

Phase A.0 revealed Tailscale SSH on mRiver intercepts :22 from tailnet
peers and bypasses OpenSSH's authorized_keys entirely (banner
"SSH-2.0-Tailscale", auth method "none", command= restriction never
fires). The fix is port 22022 via a systemd ssh.socket drop-in:
Tailscale SSH only intercepts :22, so :22022 hits real OpenSSH where
the design's command=/from= shim restriction works as specified.

Updated:
- §3 locked decisions: row 5 added (port 22022, m's call 23:35)
- §4.5 new subsection: Tailscale SSH bypass via socket drop-in
  + records the "Address already in use" first-attempt failure as a
  "don't retry without cleaning sshd_config Port directives first"
  lesson
- §5.2/5.3: ssh-keyscan now uses -p 22022; known_hosts is host:port
  keyed for non-22 ports
- §6.1/6.2/6.3: SSHPort field on RemotePaliadinService config, -p
  flag in callShim, PALIADIN_REMOTE_PORT env (default 22022)
- §7 phasing: A.0 completion checked off step-by-step with concrete
  fingerprints; A.5/A.6/A.7 split out as m-driven
- §8 security: Tailscale-SSH-on-:22 risk explicitly tabled with
  port-22022 mitigation
- §10 deliverables: mRiver host-setup artifacts noted
- §12 new Phase A.0 completion summary with the three secrets m
  needs to register in Dokploy

Phase A.0 verified end-to-end:
- ssh -p 22022 paliad-prod-key m@mriver health → ok
- run-turn UUID base64msg → 3.4 s including a real Claude response
- from="100.99.98.201" correctly rejects connections from mRiver
  itself

mRiver host state in place (not in repo): authorized_keys with
restrictions, /home/m/.local/bin/paliadin-shim, ssh.socket drop-in.
Three secrets staged at ~/.paliad-staging/ on mRiver for m to copy
into Dokploy: paliad-prod-key (PALIADIN_SSH_PRIVATE_KEY),
known_hosts (PALIADIN_KNOWN_HOSTS), and the three plain env vars.

Refs m/paliad#12

2026-05-07 23:37:26 +02:00

37 KiB

Raw Blame History

Paliadin: route prod via Tailscale SSH to mRiver

Issue: m/paliad#12 — t-paliad-151 Date: 2026-05-07 Author: noether (inventor) Supersedes nothing. Extends docs/design-paliadin-2026-05-07.md (the Phase 0 PoC) with a third deployment path between "laptop-only PoC" and "Anthropic API direct". Related: t-paliad-146 (PoC ship), t-paliad-150 (friendlyErrorMessage pattern).

1. Goal

Make Paliadin reachable from paliad.de (Dokploy on mLake) without losing m's Claude Code subscription, by routing each turn over Tailscale + SSH from the paliad container to mRiver, where the existing long-lived tmux + claude PoC keeps running.

Non-goals (v1):

Multi-host failover.
Encryption beyond SSH-over-tailnet (already E2E-encrypted by Tailscale's WireGuard layer).
Anthropic API fallback when mRiver is offline — show a friendly error instead.
Wake-on-LAN of mRiver.
Multi-tenant or multi-firm variants.

2. Live state — what was verified before designing

A design built on stale facts rots fast. These were probed on 2026-05-07, not assumed from CLAUDE.md or memory:

Fact	How verified	Result
mRiver = `100.99.98.203`, has tmux + claude	this worker runs on mRiver; `tmux -V` → `tmux 3.6a`; `which claude` → `/home/m/.local/bin/claude`	confirmed
mLake (`100.99.98.201`) has Tailscale running	`ssh m@mlake tailscale status`	confirmed; mRiver visible as `active; direct [2a02:4780:41:3fbc::1]:41641`
paliad container Dockerfile is alpine:3.21 minimal, no SSH, no tailscaled	`Dockerfile`	confirmed (only `ca-certificates`)
paliad compose runs default Docker bridge (no `network_mode`)	`docker-compose.yml`	confirmed
mRiver has no `~/.ssh/authorized_keys` yet	`ls ~/.ssh/`	confirmed — file must be created in Phase A
`/tmp/paliadin/` does not exist on mRiver yet	`ls /tmp/paliadin`	confirmed — created on first turn (paliadin.go:185 `os.MkdirAll`)
`paliad-paliadin` tmux session is not currently running on mRiver	`tmux ls`	not present; the existing PoC creates it on demand

Implication for design: the paliad container needs new infrastructure on three axes — network reachability of the tailnet, an SSH client + identity, and a service-layer code path that talks to a remote tmux instead of a local one. Each axis is its own sub-design below.

3. Locked decisions (m, 2026-05-07 22:35)

m made four design-shaping calls via the inventor's AskUserQuestion pass. They are recorded here verbatim because every downstream choice in §4–§6 follows from them.

#	Question	m's choice
1	Container Tailscale shape	`network_mode: host` on paliad
2	SSH-to-mRiver protocol granularity	Server-side `paliadin-shim` (one RPC per turn)
3	Routing trigger	Env var `PALIADIN_REMOTE_HOST` + interface split
4	SSH private key storage	Dokploy secret env var `PALIADIN_SSH_PRIVATE_KEY`
5	SSH port to bypass Tailscale SSH	Port 22022 via `ssh.socket` drop-in (Phase A finding, 23:30)

Decision (1) was not the inventor's recommendation — host mode has known interaction risk with traefik (§4.2). m is overriding the recommendation; this design accepts the call and codifies a Phase A test step that gates the rollout on traefik still working under host mode. If Phase A blows up, the fallback is to revisit (1) in a follow-up issue, not to silently swap to a sidecar.

Decision (5) emerged during Phase A: Tailscale SSH on mRiver was found to intercept :22 from tailnet peers and bypass OpenSSH's authorized_keys entirely (banner says "Tailscale", auth method "none"). The command= shim restriction therefore never fires on the standard port. Adding port 22022 via a systemd ssh.socket drop-in routes paliad's connections to real OpenSSH where the restriction works. m's interactive tailscale ssh m@mriver on :22 stays untouched. See §4.4 for the implementation.

4. Sub-design A — Container Tailscale shape

4.1 Shape: `network_mode: host`

paliad's container shares mLake's network namespace. tailscale0 (mLake's tailnet interface) is directly visible from inside the container. Outbound ssh m@100.99.98.203 reaches mRiver over the tailnet without any sidecar, userspace tailscaled, SOCKS proxy, or auth-key flow inside the container.

# docker-compose.yml diff
services:
  web:
    build: .
    network_mode: host           # NEW
    # remove: expose: ["8080"]   # host mode means port is on the host directly
    environment:
      - PORT=8080
      ...
      # NEW Paliadin remote-routing knobs
      - PALIADIN_REMOTE_HOST=${PALIADIN_REMOTE_HOST}      # 100.99.98.203
      - PALIADIN_REMOTE_PORT=${PALIADIN_REMOTE_PORT}      # 22022 (bypasses Tailscale SSH, see §4.5)
      - PALIADIN_REMOTE_USER=${PALIADIN_REMOTE_USER}      # m
      - PALIADIN_SSH_PRIVATE_KEY=${PALIADIN_SSH_PRIVATE_KEY}
      - PALIADIN_KNOWN_HOSTS=${PALIADIN_KNOWN_HOSTS}      # one-line ssh-keyscan -p 22022 output
    restart: unless-stopped

4.2 Trade-off accepted: traefik routing under host mode

paliad.de's TLS is provided by Dokploy's traefik on the dokploy-network overlay. With network_mode: host, paliad is no longer attached to that overlay. Two failure modes are possible:

(M1) traefik can't discover the service via Docker DNS → 502 at the edge.
(M2) traefik routes via host loopback (http://127.0.0.1:8080 or host.docker.internal) and works fine.

Recent Dokploy versions configure traefik with both loadbalancer.server.url and Docker labels; (M2) is the documented host-mode path. Phase A explicitly tests this (§7) before any code is written; if (M1) materialises, the design rolls back to the sidecar variant of decision 1 in a follow-up issue.

Other host-mode side-effects to flag in operations:

paliad listens on host port 8080 directly. Any other compose service binding 8080 conflicts.
paliad's outbound DNS uses host resolver (no Docker-internal web etc.). Currently fine: paliad's only network deps are external (Supabase, SMTP, GitHub raw). No service on dokploy-network is referenced by name.
The container can reach every Tailscale node, not just mRiver. Mitigations live in §5 (key restriction) and §5.2 (from= clause on mRiver authorized_keys).

4.3 Dockerfile diff

# Final stage adds the SSH client only. Tailscale is provided by the host.
FROM alpine:3.21
RUN apk add --no-cache ca-certificates openssh-client    # +openssh-client (~1MB)
WORKDIR /app
COPY --from=backend /paliad /app/paliad
COPY --from=frontend /app/frontend/dist /app/dist
EXPOSE 8080
CMD ["/app/paliad"]

Image-size delta: alpine openssh-client is ~1.1 MB compressed — negligible. No tailscaled, no entrypoint script, no extra processes inside the container.

4.4 What does NOT change

No Tailscale auth-key inside paliad. The container inherits the host's tailnet binding, so there is no per-container Tailscale identity to rotate. mLake's existing Tailscale auth is the only one in scope.
No tailscaled process inside the container.
No new sidecar container.

4.5 Bypassing Tailscale SSH via port 22022 (Phase A discovery)

Phase A revealed that Tailscale SSH on mRiver intercepts :22 from tailnet peers before OpenSSH sees the connection. The SSH banner reads SSH-2.0-Tailscale, the verbose log shows Authenticated using "none", and the authorized_keys command= directive is therefore inert. mRiver's tailscale status --json confirms the https://tailscale.com/cap/ssh capability is enabled.

The fix: a separate listening port for the paliad route, where Tailscale SSH does not intercept and real OpenSSH handles auth.

mRiver uses systemd socket activation for sshd (/usr/lib/systemd/system/ssh.socket binds :22). Setting Port 22022 in sshd_config is ignored under socket activation — listen ports come from the socket unit, not sshd's own config. The correct change is a drop-in:

# /etc/systemd/system/ssh.socket.d/paliad.conf
[Socket]
ListenStream=0.0.0.0:22022
ListenStream=[::]:22022

Followed by systemctl daemon-reload && systemctl restart ssh.socket. Both :22 (still routed through Tailscale SSH for m's interactive use) and :22022 (real OpenSSH) end up listening. The same sshd binary handles both — same host key, same authorized_keys, same sshd_config. The only difference is which port a peer dials.

A failed first attempt (2026-05-07 23:07) added the drop-in while a stale Port 22022 directive in sshd_config.d/99-paliad-test.conf was still bound — the resulting Address already in use took ssh.socket down for ~30 s until reverted. Lesson: clean any prior Port directives out of sshd_config.d/*.conf before retrying the socket drop-in.

Phase A end-to-end test (2026-05-07 23:31) succeeded with port 22022:

ssh -p 22022 -i paliad-prod-key m@100.99.98.203 health → ok
run-turn <uuid> <base64-msg> → 3.4 s round-trip including a Claude-Code response
from="100.99.98.201" correctly rejected a connection sourced from mRiver itself (Permission denied (publickey,password))

5. Sub-design B — SSH identity, restricted shim, host-key pinning

5.1 Identity: dedicated ed25519 keypair `paliad-prod`

One keypair, generated once on mRiver during Phase A, used by every paliad-prod deploy:

# On mRiver (Phase A bootstrap):
ssh-keygen -t ed25519 -N "" -C "paliad-prod $(date +%Y-%m-%d)" -f /tmp/paliad-prod-key
# Public key → mRiver authorized_keys (see 5.2)
# Private key → Dokploy secret store as PALIADIN_SSH_PRIVATE_KEY
shred -u /tmp/paliad-prod-key   # only the encrypted/secret-stored copies survive

Rotation: regenerate, push public key to mRiver authorized_keys, update Dokploy secret, redeploy. No code change needed — paliad's startup re-reads the env var on every boot.

The private key is delivered to the container as a multi-line env var. At process start, paliad writes it to a tmpfile so OpenSSH can use it:

// cmd/server/main.go (sketch)
func loadPaliadinSSHKey() (string, error) {
    blob := os.Getenv("PALIADIN_SSH_PRIVATE_KEY")
    if blob == "" { return "", nil }    // remote mode disabled
    f, err := os.CreateTemp("", "paliadin-id_ed25519-")
    if err != nil { return "", err }
    if err := os.Chmod(f.Name(), 0o600); err != nil { return "", err }
    if _, err := f.WriteString(blob); err != nil { return "", err }
    if err := f.Close(); err != nil { return "", err }
    return f.Name(), nil    // path passed to RemotePaliadinService
}

The tmpfile lives at /tmp/paliadin-id_ed25519-<rand> for the container's lifetime. On container restart, a fresh tmpfile is written. We never persist the key to a volume.

5.2 mRiver `authorized_keys` entry

command="/home/m/.local/bin/paliadin-shim",no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc,from="100.99.98.201" ssh-ed25519 AAAA...PUBKEY... paliad-prod

Each restriction matters:

command= — every ssh m@mriver … invocation runs the shim regardless of what the client asked for. The client's requested command is exposed as $SSH_ORIGINAL_COMMAND for the shim to dispatch on.
no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc — defence-in-depth: even if someone steals the key and bypasses the shim's argument validation, they can't get an interactive shell, can't tunnel ports, can't pivot via agent forwarding.
from="100.99.98.201" — only accept connections from mLake's tailnet IP. Defends against the "container has full tailnet visibility" host-mode side-effect from §4.2: if the key leaks off mLake, it can't be replayed from another tailnet host.

5.3 Host-key pinning

StrictHostKeyChecking=accept-new is too loose for a long-lived production identity (one-time MITM during first connect substitutes a different key forever). Instead:

During Phase A, run ssh-keyscan -p 22022 -t ed25519 100.99.98.203 on mLake.
Capture the single output line. The host-key portion is identical to the :22 entry — same sshd, same keys — but the [100.99.98.203]:22022 prefix matters because OpenSSH's known_hosts is host:port-keyed for non-22 ports.
Store as Dokploy secret PALIADIN_KNOWN_HOSTS.
At container startup, write to /tmp/paliadin-known_hosts chmod 644.
Pass to OpenSSH via -o UserKnownHostsFile=/tmp/paliadin-known_hosts -o StrictHostKeyChecking=yes.

If mRiver's host key ever rotates (rare; only on disk wipe / fresh OS), Phase A runs again and the secret is updated. SSH refuses to connect with a clear "host key changed" error, which surfaces as mriver_unreachable to the user — exactly the right blast-radius (loud failure, no silent connect to a substitute host).

5.4 The shim — `paliadin-shim`

A bash script on mRiver at /home/m/.local/bin/paliadin-shim. It is the only thing the paliad-prod key is allowed to invoke, and it dispatches on $SSH_ORIGINAL_COMMAND. Three RPCs:

#!/bin/bash
# paliadin-shim — server-side RPC for paliad's remote-tmux turns.
# Invoked via authorized_keys command= with $SSH_ORIGINAL_COMMAND set.
set -euo pipefail
umask 077

readonly TMUX_SESSION="${PALIADIN_TMUX_SESSION:-paliad-paliadin}"
readonly RESPONSE_DIR="${PALIADIN_RESPONSE_DIR:-/tmp/paliadin}"
readonly TIMEOUT_S=60
readonly TURN_ID_RE='^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$'

mkdir -p "$RESPONSE_DIR"

# Parse $SSH_ORIGINAL_COMMAND. Format: "<verb> <arg1> <arg2> …"
read -r -a argv <<< "${SSH_ORIGINAL_COMMAND:-}"
verb="${argv[0]:-}"

ensure_pane() {
  if ! tmux has-session -t "$TMUX_SESSION" 2>/dev/null; then
    tmux new-session -d -s "$TMUX_SESSION"
  fi
  # Find or create the @paliadin-scope=chat window.
  local target=""
  while read -r idx; do
    scope=$(tmux show-window-option -t "$TMUX_SESSION:$idx" -v @paliadin-scope 2>/dev/null || true)
    if [[ "$scope" == "chat" ]]; then target="$TMUX_SESSION:$idx"; break; fi
  done < <(tmux list-windows -t "$TMUX_SESSION" -F '#{window_index}')
  if [[ -z "$target" ]]; then
    idx=$(tmux new-window -t "$TMUX_SESSION" -n claude-paliadin -P -F '#{window_index}' claude)
    target="$TMUX_SESSION:$idx"
    # Wait for claude to settle (60s bound; matches Go waitForPaneReady).
    for _ in $(seq 1 120); do
      pane=$(tmux capture-pane -t "$target" -p 2>/dev/null || true)
      if [[ "$pane" == *"❯"* || "$pane" == *"│"* ]]; then break; fi
      sleep 0.5
    done
    tmux set-window-option -t "$target" @paliadin-scope chat
    tmux set-window-option -t "$target" @fix-name claude-paliadin
    # Bootstrap system prompt — reuses the Go service's prompt text.
    # The Go side sends this via the `bootstrap` RPC on first turn instead
    # of duplicating the prompt here. See §6.4.
  fi
  echo "$target"
}

case "$verb" in
  health)
    # Liveness check — used by paliad to short-circuit when mRiver is offline.
    # Returns "ok" iff tmux + claude are reachable.
    tmux has-session -t "$TMUX_SESSION" 2>/dev/null \
      || tmux new-session -d -s "$TMUX_SESSION"
    command -v claude >/dev/null && echo ok || { echo no-claude; exit 1; }
    ;;

  bootstrap)
    # First-turn-only: ensure pane exists and inject the system prompt.
    # $1 = base64-encoded prompt body (avoids quoting hell).
    target=$(ensure_pane)
    prompt=$(printf '%s' "${argv[1]:?missing prompt}" | base64 -d)
    tmux send-keys -t "$target" -l -- "$prompt"
    tmux send-keys -t "$target" Enter
    sleep 2   # give claude a moment to absorb
    echo ok
    ;;

  run-turn)
    # $1 = turn_id (UUID); $2 = base64-encoded user message.
    turn_id="${argv[1]:?missing turn_id}"
    [[ "$turn_id" =~ $TURN_ID_RE ]] || { echo >&2 "bad turn_id"; exit 2; }
    msg=$(printf '%s' "${argv[2]:?missing message}" | base64 -d)
    target=$(ensure_pane)
    out="$RESPONSE_DIR/$turn_id.txt"
    rm -f "$out"
    # Envelope matches what paliadin_prompt.go expects.
    tmux send-keys -t "$target" -l -- "[PALIADIN:$turn_id] $msg"
    tmux send-keys -t "$target" Enter
    # Poll for the response file. Same shape as Go pollForResponse.
    for _ in $(seq 1 $((TIMEOUT_S * 5))); do
      if [[ -s "$out" ]]; then
        sleep 0.05    # settle
        cat "$out"
        rm -f "$out"
        exit 0
      fi
      sleep 0.2
    done
    echo >&2 "paliadin: response timeout after ${TIMEOUT_S}s"
    exit 124
    ;;

  reset)
    # /clear the conversation; next turn starts fresh.
    target=$(ensure_pane)
    tmux send-keys -t "$target" -l -- "/clear"
    tmux send-keys -t "$target" Enter
    echo ok
    ;;

  *)
    echo >&2 "paliadin-shim: unknown verb '$verb'"
    exit 2
    ;;
esac

Why a shim instead of raw tmux-over-SSH:

One SSH round-trip per turn (~50 ms over tailnet) vs ~10–20 round-trips for the granular pattern.
Argument validation lives in one place (UUID regex on turn_id, base64 for messages, fixed verb list) — easier to audit than a regex over $SSH_ORIGINAL_COMMAND matching tmux send-keys ….
mRiver-side concerns (response polling, settle delays, pane-readiness) stay on mRiver, which is where the tmux state lives. The Go service stops caring about local file polling at all.

6. Sub-design C — Service-layer integration, routing, reliability

6.1 Interface split

The current *PaliadinService becomes an interface with two implementations: LocalPaliadinService (the existing tmux code, renamed) and RemotePaliadinService (the new SSH code). Construction picks one at startup based on PALIADIN_REMOTE_HOST.

// internal/services/paliadin.go (after refactor)

type Paliadin interface {
    RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error)
    ResetSession(ctx context.Context) error
    ListRecentTurns(ctx context.Context, callerID uuid.UUID, limit int) ([]PaliadinTurn, error)
    Stats(ctx context.Context, callerID uuid.UUID) (*PaliadinStats, error)
    IsOwner(ctx context.Context, userID uuid.UUID) (bool, error)
}

// LocalPaliadinService wraps the current tmux PoC (laptop / dev path).
type LocalPaliadinService struct { /* identical to today's PaliadinService */ }

// RemotePaliadinService talks to a paliadin-shim over SSH on mRiver.
type RemotePaliadinService struct {
    db          *sqlx.DB
    users       *UserService
    sshHost     string   // 100.99.98.203
    sshPort     int      // 22022 — bypasses Tailscale SSH on :22 (see §4.5)
    sshUser     string   // m
    sshKeyPath  string   // /tmp/paliadin-id_ed25519-<rand>
    knownHosts  string   // /tmp/paliadin-known_hosts
    turnMu      sync.Mutex

    // Health-check cache.
    healthMu      sync.Mutex
    healthOK      bool
    healthCheckedAt time.Time
}

DB access (ListRecentTurns, Stats, IsOwner) is identical for both — they only read paliad.paliadin_turns. They live in a shared paliadinDB helper struct embedded in both implementations.

6.2 Wiring at startup

// cmd/server/main.go (excerpt)
var paliadin services.Paliadin
remoteHost := os.Getenv("PALIADIN_REMOTE_HOST")
switch {
case remoteHost != "":
    keyPath, err := loadPaliadinSSHKey()
    if err != nil { log.Fatalf("paliadin: load ssh key: %v", err) }
    if keyPath == "" { log.Fatalf("paliadin: PALIADIN_REMOTE_HOST set but no PALIADIN_SSH_PRIVATE_KEY") }
    knownHosts, err := loadPaliadinKnownHosts()
    if err != nil { log.Fatalf("paliadin: load known_hosts: %v", err) }
    port, _ := strconv.Atoi(cmpOr(os.Getenv("PALIADIN_REMOTE_PORT"), "22022"))
    paliadin = services.NewRemotePaliadinService(db, userSvc, services.RemotePaliadinConfig{
        SSHHost: remoteHost,
        SSHPort: port,
        SSHUser: cmpOr(os.Getenv("PALIADIN_REMOTE_USER"), "m"),
        SSHKeyPath: keyPath,
        KnownHostsPath: knownHosts,
    })
    log.Printf("paliadin: remote mode → ssh %s@%s:%d", "m", remoteHost, port)
case localTmuxAvailable():
    paliadin = services.NewLocalPaliadinService(db, userSvc, "", "")
    log.Printf("paliadin: local tmux mode")
default:
    paliadin = services.NewDisabledPaliadinService(db, userSvc)
    log.Printf("paliadin: disabled (no remote host, no local tmux)")
}

NewDisabledPaliadinService exists today implicitly via the ErrTmuxUnavailable path; making it explicit gives the constructor a clear name and the handler doesn't have to special-case nil.

6.3 SSH invocation pattern

RemotePaliadinService runs every RPC through the same helper:

func (s *RemotePaliadinService) callShim(ctx context.Context, args ...string) ([]byte, error) {
    sshArgs := []string{
        "-F", "/dev/null",                          // ignore /etc/ssh/ssh_config + ~/.ssh/config
        "-i", s.sshKeyPath,
        "-p", strconv.Itoa(s.sshPort),              // 22022 — bypasses Tailscale SSH on :22
        "-o", "IdentitiesOnly=yes",                 // don't fall back to other keys
        "-o", "UserKnownHostsFile=" + s.knownHostsPath,
        "-o", "StrictHostKeyChecking=yes",
        "-o", "BatchMode=yes",
        "-o", "ConnectTimeout=3",
        "-o", "ServerAliveInterval=10",
        "-o", "ServerAliveCountMax=3",
        s.sshUser + "@" + s.sshHost,
        "--",
    }
    sshArgs = append(sshArgs, args...)
    c, cancel := context.WithTimeout(ctx, 70*time.Second)   // shim has its own 60s; +10s for SSH overhead
    defer cancel()
    cmd := exec.CommandContext(c, "ssh", sshArgs...)
    var stdout, stderr bytes.Buffer
    cmd.Stdout = &stdout; cmd.Stderr = &stderr
    if err := cmd.Run(); err != nil {
        return nil, fmt.Errorf("paliadin: ssh shim %v: %w (stderr: %s)", args, err, stderr.String())
    }
    return stdout.Bytes(), nil
}

RunTurn becomes:

func (s *RemotePaliadinService) RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error) {
    s.turnMu.Lock()
    defer s.turnMu.Unlock()

    if err := s.healthGate(ctx); err != nil {
        return nil, err   // ErrMRiverUnreachable, picked up by handler
    }

    turnID := uuid.New()
    started := time.Now().UTC()
    if err := s.insertTurnRow(ctx, …); err != nil { return nil, err }

    // First-turn-only: bootstrap the system prompt on mRiver. Detected by
    // checking whether any prior turn for this user has succeeded.
    if err := s.ensureBootstrapped(ctx); err != nil {
        _ = s.markTurnError(ctx, turnID, "bootstrap_failed")
        return nil, err
    }

    msg := sanitiseForTmux(req.UserMessage)
    msgB64 := base64.StdEncoding.EncodeToString([]byte(msg))
    body, err := s.callShim(ctx, "run-turn", turnID.String(), msgB64)
    if err != nil {
        _ = s.markTurnError(ctx, turnID, classifySSHError(err))
        return nil, err
    }

    // Same trailer-parse + audit-row writes as Local, factored into shared helper.
    return s.completeTurnFromBody(ctx, turnID, started, string(body))
}

6.4 System prompt bootstrap

The local PoC calls paliadinSystemPrompt(s.responseDir) once when it creates the pane. The remote path needs the same hook. Two options that don't require duplicating the German prompt body to mRiver:

Lazy bootstrap (chosen): the first RunTurn after a paliad-prod restart sends the system prompt via bootstrap RPC, then runs the actual turn. Subsequent turns skip the bootstrap. State is per-process: RemotePaliadinService.bootstrapped boolean guarded by mutex.
Eager bootstrap at startup is rejected — it forces every container start to wait for mRiver to be online, which couples paliad's boot to mRiver's availability.

Lazy bootstrap means the very first turn after a paliad redeploy pays a ~3 s extra cost (claude pane spin-up + system prompt absorb). Acceptable for a single-user PoC.

6.5 Health-check gating (`mriver_unreachable`)

Every RunTurn first calls healthGate(ctx):

Cached for 10 s. If last check was <10 s ago and was OK, skip the probe.
Otherwise: s.callShim(ctx, "health") with a 3 s timeout. On success, set cache OK; on failure, return ErrMRiverUnreachable.

Why 10 s: short enough that "I just woke my laptop" propagates inside one user retry; long enough that a busy chat doesn't probe on every turn.

var ErrMRiverUnreachable = errors.New("paliadin: mriver unreachable")

func (s *RemotePaliadinService) healthGate(ctx context.Context) error {
    s.healthMu.Lock()
    defer s.healthMu.Unlock()
    if s.healthOK && time.Since(s.healthCheckedAt) < 10*time.Second {
        return nil
    }
    c, cancel := context.WithTimeout(ctx, 3*time.Second)
    defer cancel()
    out, err := s.callShim(c, "health")
    s.healthCheckedAt = time.Now()
    if err != nil || strings.TrimSpace(string(out)) != "ok" {
        s.healthOK = false
        return fmt.Errorf("%w: %v", ErrMRiverUnreachable, err)
    }
    s.healthOK = true
    return nil
}

6.6 Friendly error code (extends t-paliad-150)

friendlyErrorMessage already maps tmux_unavailable to a localised message. We add one new code:

mriver_unreachable → DE: "mRiver ist offline — Paliadin nicht erreichbar. Mach mRiver an, oder nutze Paliadin lokal mit ./paliad." / EN: "mRiver is offline — Paliadin can't reach it. Wake mRiver, or run Paliadin locally with ./paliad."

Implementation: one new case in the SSE-error switch in frontend/src/client/paliadin.ts's friendlyErrorMessage, plus matching i18n keys (paliadin.error.mriver_unreachable.de / .en). Server-side: paliadin HTTP handler maps errors.Is(err, services.ErrMRiverUnreachable) to event: error\ndata: {"code":"mriver_unreachable","message":"..."}\n\n.

6.7 Rate limit

A runaway loop on the paliad side could DOS the SSH connection. Cheapest cap: enforce one in-flight turn at a time via turnMu (already exists in the local PoC). On top of that, a rolling cap of N=20 turns/min in RemotePaliadinService rejects with ErrRateLimited (mapped to a friendly paliadin.error.rate_limited). PoC has one user (m); the cap is a paranoid safety, not a real throttle.

6.8 What about ControlMaster?

Decision-2's chosen path (server-side shim with one RPC per turn) makes ControlMaster optional. The shim collapses ~10 raw-tmux ops into a single SSH connect — that's already the latency win ControlMaster would buy.

Adding it on top would save ~30–50 ms per turn but adds:

A persistent ~/.ssh/cm-* socket inside the container.
Cleanup logic on shutdown.
A subtle interaction with the SSH BatchMode + ConnectTimeout settings.

Verdict: skip ControlMaster in v1. If turn latency over Tailscale is measured >300 ms in practice and hot enough to matter, add it in a follow-up; the call site is one helper.

7. Phasing

Phase A — manual proof-of-concept (no Dockerfile change yet)

Goal: validate the round-trip end-to-end on a deployed paliad, before touching the image.

Phase A.0 (DONE 2026-05-07 23:31): SSH+shim end-to-end on the tailnet.

✅ Generate keypair on mRiver: ssh-keygen -t ed25519 -N "" -C "paliad-prod" -f ~/.paliad-staging/paliad-prod-key. Fingerprint SHA256:5uV8v872F/IhJycjjq0crFue/emAYfw71N9bxTvkl9c.
✅ Commit shim to scripts/paliadin-shim and install at /home/m/.local/bin/paliadin-shim, chmod 755.
✅ Write authorized_keys with public key + command=/from="100.99.98.201"/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc restrictions (§5.2).
✅ Add port 22022 socket drop-in at /etc/systemd/system/ssh.socket.d/paliad.conf, systemctl daemon-reload && systemctl restart ssh.socket. Both :22 (Tailscale SSH for m) and :22022 (real OpenSSH for paliad) listening (§4.5).
✅ Capture mRiver:22022 host key: ssh-keyscan -p 22022 -t ed25519 100.99.98.203 > ~/.paliad-staging/known_hosts from mLake. Fingerprint SHA256:HPoUzy60Cb8yLERIBQcB2mHihNST3NaTODx5Ypd1XpA.

✅ Smoke-test from mLake (without paliad container, just raw ssh from mLake's host shell):

ssh -F /dev/null -i /tmp/paliad-prod-key -o UserKnownHostsFile=/tmp/paliad-known_hosts \
    -o StrictHostKeyChecking=yes -o IdentitiesOnly=yes -o BatchMode=yes \
    -p 22022 m@100.99.98.203 health
→ ok
ssh … run-turn $(uuidgen) "$(printf 'Sag …' | base64 -w0)"
→ "test ok" (3.4 s round-trip including a real Claude response)

✅ from= rejection verified: the same key from mRiver itself (100.99.98.203) → Permission denied (publickey,password) as expected.

Phase A.5 (PENDING m's hands): validate network_mode: host + traefik routing on prod paliad.de.

Branch the live docker-compose.yml on a temp branch.
Add network_mode: host to the web service; remove expose: ["8080"].
Push to trigger a Dokploy redeploy.
curl --connect-timeout 5 -sSI https://paliad.de/ — expect 200 (or login redirect), NOT 502.
If 502: revert the temp branch (git revert HEAD && git push); revisit decision 1 in a follow-up issue.
If 200: keep the host-mode change; ready for Phase B.

This is m's call to execute — it briefly touches prod paliad.de. Inventor/coder should not flip prod compose without explicit go-ahead. Rollback is one revert + redeploy.

Phase A.6 (after A.5 passes): smoke-test SSH from inside the paliad-prod container itself (the real container, not just the mLake host shell):

docker exec -it <paliad-container> sh
apk add --no-cache openssh-client     # one-shot, before Dockerfile change
ssh -F /dev/null -i /tmp/paliad-prod-key -o UserKnownHostsFile=/tmp/paliad-known_hosts \
    -o StrictHostKeyChecking=yes -o IdentitiesOnly=yes -o BatchMode=yes \
    -p 22022 m@100.99.98.203 health
# expected: "ok"

This proves the container's host-mode networking actually delivers a tailnet connect.

Phase A.7: wire env vars manually via Dokploy UI for one deploy; confirm /paliadin chat works against mRiver from paliad.de.

If A.5 fails: the design rolls back to a sidecar in a new issue (decision 1 follow-up). The SSH path (A.0) and traefik path (A.5) are independent — A.0 is already proven; only A.5+ is at risk.

Phase B — bake into Dockerfile + Dokploy secrets

Dockerfile: add openssh-client to the final stage (§4.3).
compose: add network_mode: host and the four new env vars (§4.1).
Dokploy secrets: register PALIADIN_REMOTE_HOST=100.99.98.203, PALIADIN_REMOTE_USER=m, PALIADIN_SSH_PRIVATE_KEY=..., PALIADIN_KNOWN_HOSTS=....
Code: refactor PaliadinService to the interface split (§6.1–§6.2). New file internal/services/paliadin_remote.go. Tests: paliadin_remote_test.go mocks callShim to verify RunTurn audit-row writes, error mapping, and healthGate caching.
Ship under one PR; tag t-paliad-151 done.

Phase C — friendly errors + monitoring

paliadin.error.mriver_unreachable i18n keys + friendlyErrorMessage case (§6.6).
/admin/paliadin shows last health-probe result + last successful turn timestamp.
Optional: mai-mesh integration to surface mRiver-offline events to m on Telegram (out-of-band; not gating).

8. Security review summary

Risk	Mitigation
Stolen private key → arbitrary SSH on mRiver	`command=` shim restriction + `from="100.99.98.201"` + ed25519 key + private key only in Dokploy secret store (encrypted at rest); paliad route uses port 22022 where real OpenSSH enforces all of the above
Stolen private key → tailnet-wide SSH from non-mLake host	`from="100.99.98.201"` clause (verified: rejected from mRiver itself in Phase A.0)
Tailscale SSH on `:22` bypasses `authorized_keys`	The paliad-prod key's `command=` restriction is not enforced on `:22`. Mitigation: paliad always dials `:22022`, which is real OpenSSH. m's interactive `tailscale ssh m@mriver` on `:22` continues to be governed by Tailscale ACLs, separate from paliad's identity.
Container compromise → key extraction	Key written to tmpfile chmod 600, only root inside container can read; alpine container has no shell-on-error trampolines
Host-key MITM during connect	Pinned `known_hosts`; `StrictHostKeyChecking=yes`
Shim argument injection (e.g. via `run-turn $(rm -rf /)`)	Shim parses positional args from `$SSH_ORIGINAL_COMMAND` via `read -r -a`; never passes args to a subshell `eval`; turn_id validated by UUID regex; message body always base64-decoded into a single shell variable, never re-evaluated
Runaway loop → SSH flood	Single-flight `turnMu` + 20/min rolling cap
`network_mode: host` widens blast radius	The `command=` + `from=` restrictions on mRiver mean container compromise = "can run shim verbs against mRiver only", not "shell on mRiver"
PaliadinOwnerEmail bypass	Unchanged from PoC: gate is in Go (`/paliadin` 404s for any other user). Even if mRiver SSH key leaks, attacker still needs paliad session as `m@hoganlovells.com`.

9. Out-of-scope clarifications (for review)

These were called out in the issue but the design intentionally does not solve them, to keep v1 tight. Each is acknowledged so review knows it wasn't an oversight:

Wake-on-LAN of mRiver: out of scope. v1's UX when mRiver is asleep is the friendly error from §6.6. Future work: integrate with mai-mesh capability fallback.
Multi-host failover: out of scope. Only mRiver is targeted.
Anthropic API fallback when mRiver offline: out of scope per CLAUDE.md (ANTHROPIC_API_KEY reserved for production-v1, unused in PoC).
ControlMaster: v1 ships without; revisit if turn latency >300 ms in practice (§6.8).

10. File-level deliverables (for the coder shift)

When this design is approved and the coder shift starts, the work splits roughly into:

Dockerfile — +openssh-client.
docker-compose.yml — network_mode: host, five new env entries (PALIADIN_REMOTE_HOST, PALIADIN_REMOTE_PORT, PALIADIN_REMOTE_USER, PALIADIN_SSH_PRIVATE_KEY, PALIADIN_KNOWN_HOSTS).
internal/services/paliadin.go — extract Paliadin interface; rename existing to LocalPaliadinService; pull DB-only methods (ListRecentTurns, Stats, IsOwner) into a shared embedded paliadinDB so both implementations get them for free.
internal/services/paliadin_remote.go — new file: RemotePaliadinService, RemotePaliadinConfig (with SSHPort), callShim, healthGate, ensureBootstrapped, classifySSHError, ErrMRiverUnreachable.
internal/services/paliadin_remote_test.go — unit tests with a mocked callShim.
cmd/server/main.go — env-var-based wiring (§6.2), loadPaliadinSSHKey, loadPaliadinKnownHosts, PALIADIN_REMOTE_PORT parse with default 22022.
frontend/src/client/paliadin.ts — one case in friendlyErrorMessage for mriver_unreachable.
frontend/src/i18n.ts — two new keys (paliadin.error.mriver_unreachable.de / .en).
scripts/paliadin-shim — server-side script (§5.4); already shipped + installed on mRiver during Phase A.0, not part of any container. Repo location chosen so the security-relevant script is version-controlled.
docs/project-status.md — note Phase 0.5 (PoC) → Phase 0.6 (Tailscale-SSH prod route).
mRiver host setup (one-time, already done in Phase A.0): /etc/systemd/system/ssh.socket.d/paliad.conf (port 22022 listen drop-in); ~/.ssh/authorized_keys (paliad-prod public key with restrictions); /home/m/.local/bin/paliadin-shim (executable). These are NOT in the repo because they live on m's laptop; docs/project-status.md should reference them.

No DB migrations needed — paliad.paliadin_turns schema already covers everything (error_code field already accepts free-form codes including mriver_unreachable).

11. Open questions for review

Q (m), still open: Phase A.5 (traefik+host-mode on prod paliad.de) is not yet executed. m drives this; rollback is one revert. Dokploy doc check before flipping is recommended but not blocking.
Q (m), resolved 2026-05-07 23:50: shim location → repo (scripts/paliadin-shim, committed in 0248411). Version-controlled and auditable.
Q (m), still open: ANTHROPIC_API_KEY env var reservation in compose comments — keep for production-v1, or strip now? Not blocking either phase; defer.

12. Phase A.0 completion summary (2026-05-07 23:50)

Coder shift (noether) executed Phase A.0 in full:

✅ shim committed at scripts/paliadin-shim (commit 0248411, repo-version-controlled)
✅ shim installed at /home/m/.local/bin/paliadin-shim on mRiver
✅ ed25519 keypair paliad-prod generated, public-key fingerprint SHA256:5uV8v872F/IhJycjjq0crFue/emAYfw71N9bxTvkl9c, private key staged at ~/.paliad-staging/paliad-prod-key on mRiver (mode 600)
✅ ~/.ssh/authorized_keys written with command=/from=/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc restrictions
✅ ssh.socket drop-in installed at /etc/systemd/system/ssh.socket.d/paliad.conf; both :22 and :22022 listening
✅ host key for :22022 captured at ~/.paliad-staging/known_hosts (fingerprint SHA256:HPoUzy60Cb8yLERIBQcB2mHihNST3NaTODx5Ypd1XpA)
✅ end-to-end SSH+shim+Claude run-turn validated from mLake → mRiver:22022 (3.4 s round-trip)
✅ from="100.99.98.201" rejection verified

Three secrets ready for Dokploy registration (m to copy from ~/.paliad-staging/ on mRiver):

PALIADIN_SSH_PRIVATE_KEY ← cat ~/.paliad-staging/paliad-prod-key
PALIADIN_KNOWN_HOSTS ← cat ~/.paliad-staging/known_hosts
PALIADIN_REMOTE_HOST=100.99.98.203, PALIADIN_REMOTE_PORT=22022, PALIADIN_REMOTE_USER=m

Phase A.5 (traefik+host-mode test) and Phase A.6/A.7 (in-container SSH smoke + paliad/paliadin end-to-end) await m's hands — they touch prod paliad.de.

Phase B (Dockerfile + Go interface split + Dokploy secrets) is unblocked from a code perspective — but should not merge until Phase A.5 confirms the host-mode networking trade-off is acceptable.

Inventor design + coder Phase A.0 complete. Awaiting m for Phase A.5 traefik validation before the coder writes the Go interface split.

37 KiB Raw Blame History Unescape Escape