Phase A.0 revealed Tailscale SSH on mRiver intercepts :22 from tailnet peers and bypasses OpenSSH's authorized_keys entirely (banner "SSH-2.0-Tailscale", auth method "none", command= restriction never fires). The fix is port 22022 via a systemd ssh.socket drop-in: Tailscale SSH only intercepts :22, so :22022 hits real OpenSSH where the design's command=/from= shim restriction works as specified. Updated: - §3 locked decisions: row 5 added (port 22022, m's call 23:35) - §4.5 new subsection: Tailscale SSH bypass via socket drop-in + records the "Address already in use" first-attempt failure as a "don't retry without cleaning sshd_config Port directives first" lesson - §5.2/5.3: ssh-keyscan now uses -p 22022; known_hosts is host:port keyed for non-22 ports - §6.1/6.2/6.3: SSHPort field on RemotePaliadinService config, -p flag in callShim, PALIADIN_REMOTE_PORT env (default 22022) - §7 phasing: A.0 completion checked off step-by-step with concrete fingerprints; A.5/A.6/A.7 split out as m-driven - §8 security: Tailscale-SSH-on-:22 risk explicitly tabled with port-22022 mitigation - §10 deliverables: mRiver host-setup artifacts noted - §12 new Phase A.0 completion summary with the three secrets m needs to register in Dokploy Phase A.0 verified end-to-end: - ssh -p 22022 paliad-prod-key m@mriver health → ok - run-turn UUID base64msg → 3.4 s including a real Claude response - from="100.99.98.201" correctly rejects connections from mRiver itself mRiver host state in place (not in repo): authorized_keys with restrictions, /home/m/.local/bin/paliadin-shim, ssh.socket drop-in. Three secrets staged at ~/.paliad-staging/ on mRiver for m to copy into Dokploy: paliad-prod-key (PALIADIN_SSH_PRIVATE_KEY), known_hosts (PALIADIN_KNOWN_HOSTS), and the three plain env vars. Refs m/paliad#12
37 KiB
Paliadin: route prod via Tailscale SSH to mRiver
Issue: m/paliad#12 — t-paliad-151
Date: 2026-05-07
Author: noether (inventor)
Supersedes nothing. Extends docs/design-paliadin-2026-05-07.md (the Phase 0 PoC) with a third deployment path between "laptop-only PoC" and "Anthropic API direct".
Related: t-paliad-146 (PoC ship), t-paliad-150 (friendlyErrorMessage pattern).
1. Goal
Make Paliadin reachable from paliad.de (Dokploy on mLake) without losing m's Claude Code subscription, by routing each turn over Tailscale + SSH from the paliad container to mRiver, where the existing long-lived tmux + claude PoC keeps running.
Non-goals (v1):
- Multi-host failover.
- Encryption beyond SSH-over-tailnet (already E2E-encrypted by Tailscale's WireGuard layer).
- Anthropic API fallback when mRiver is offline — show a friendly error instead.
- Wake-on-LAN of mRiver.
- Multi-tenant or multi-firm variants.
2. Live state — what was verified before designing
A design built on stale facts rots fast. These were probed on 2026-05-07, not assumed from CLAUDE.md or memory:
| Fact | How verified | Result |
|---|---|---|
mRiver = 100.99.98.203, has tmux + claude |
this worker runs on mRiver; tmux -V → tmux 3.6a; which claude → /home/m/.local/bin/claude |
confirmed |
mLake (100.99.98.201) has Tailscale running |
ssh m@mlake tailscale status |
confirmed; mRiver visible as active; direct [2a02:4780:41:3fbc::1]:41641 |
| paliad container Dockerfile is alpine:3.21 minimal, no SSH, no tailscaled | Dockerfile |
confirmed (only ca-certificates) |
paliad compose runs default Docker bridge (no network_mode) |
docker-compose.yml |
confirmed |
mRiver has no ~/.ssh/authorized_keys yet |
ls ~/.ssh/ |
confirmed — file must be created in Phase A |
/tmp/paliadin/ does not exist on mRiver yet |
ls /tmp/paliadin |
confirmed — created on first turn (paliadin.go:185 os.MkdirAll) |
paliad-paliadin tmux session is not currently running on mRiver |
tmux ls |
not present; the existing PoC creates it on demand |
Implication for design: the paliad container needs new infrastructure on three axes — network reachability of the tailnet, an SSH client + identity, and a service-layer code path that talks to a remote tmux instead of a local one. Each axis is its own sub-design below.
3. Locked decisions (m, 2026-05-07 22:35)
m made four design-shaping calls via the inventor's AskUserQuestion pass. They are recorded here verbatim because every downstream choice in §4–§6 follows from them.
| # | Question | m's choice |
|---|---|---|
| 1 | Container Tailscale shape | network_mode: host on paliad |
| 2 | SSH-to-mRiver protocol granularity | Server-side paliadin-shim (one RPC per turn) |
| 3 | Routing trigger | Env var PALIADIN_REMOTE_HOST + interface split |
| 4 | SSH private key storage | Dokploy secret env var PALIADIN_SSH_PRIVATE_KEY |
| 5 | SSH port to bypass Tailscale SSH | Port 22022 via ssh.socket drop-in (Phase A finding, 23:30) |
Decision (1) was not the inventor's recommendation — host mode has known interaction risk with traefik (§4.2). m is overriding the recommendation; this design accepts the call and codifies a Phase A test step that gates the rollout on traefik still working under host mode. If Phase A blows up, the fallback is to revisit (1) in a follow-up issue, not to silently swap to a sidecar.
Decision (5) emerged during Phase A: Tailscale SSH on mRiver was found to intercept :22 from tailnet peers and bypass OpenSSH's authorized_keys entirely (banner says "Tailscale", auth method "none"). The command= shim restriction therefore never fires on the standard port. Adding port 22022 via a systemd ssh.socket drop-in routes paliad's connections to real OpenSSH where the restriction works. m's interactive tailscale ssh m@mriver on :22 stays untouched. See §4.4 for the implementation.
4. Sub-design A — Container Tailscale shape
4.1 Shape: network_mode: host
paliad's container shares mLake's network namespace. tailscale0 (mLake's tailnet interface) is directly visible from inside the container. Outbound ssh m@100.99.98.203 reaches mRiver over the tailnet without any sidecar, userspace tailscaled, SOCKS proxy, or auth-key flow inside the container.
# docker-compose.yml diff
services:
web:
build: .
network_mode: host # NEW
# remove: expose: ["8080"] # host mode means port is on the host directly
environment:
- PORT=8080
...
# NEW Paliadin remote-routing knobs
- PALIADIN_REMOTE_HOST=${PALIADIN_REMOTE_HOST} # 100.99.98.203
- PALIADIN_REMOTE_PORT=${PALIADIN_REMOTE_PORT} # 22022 (bypasses Tailscale SSH, see §4.5)
- PALIADIN_REMOTE_USER=${PALIADIN_REMOTE_USER} # m
- PALIADIN_SSH_PRIVATE_KEY=${PALIADIN_SSH_PRIVATE_KEY}
- PALIADIN_KNOWN_HOSTS=${PALIADIN_KNOWN_HOSTS} # one-line ssh-keyscan -p 22022 output
restart: unless-stopped
4.2 Trade-off accepted: traefik routing under host mode
paliad.de's TLS is provided by Dokploy's traefik on the dokploy-network overlay. With network_mode: host, paliad is no longer attached to that overlay. Two failure modes are possible:
- (M1) traefik can't discover the service via Docker DNS → 502 at the edge.
- (M2) traefik routes via host loopback (
http://127.0.0.1:8080orhost.docker.internal) and works fine.
Recent Dokploy versions configure traefik with both loadbalancer.server.url and Docker labels; (M2) is the documented host-mode path. Phase A explicitly tests this (§7) before any code is written; if (M1) materialises, the design rolls back to the sidecar variant of decision 1 in a follow-up issue.
Other host-mode side-effects to flag in operations:
- paliad listens on host port 8080 directly. Any other compose service binding 8080 conflicts.
- paliad's outbound DNS uses host resolver (no Docker-internal
webetc.). Currently fine: paliad's only network deps are external (Supabase, SMTP, GitHub raw). No service ondokploy-networkis referenced by name. - The container can reach every Tailscale node, not just mRiver. Mitigations live in §5 (key restriction) and §5.2 (
from=clause on mRiver authorized_keys).
4.3 Dockerfile diff
# Final stage adds the SSH client only. Tailscale is provided by the host.
FROM alpine:3.21
RUN apk add --no-cache ca-certificates openssh-client # +openssh-client (~1MB)
WORKDIR /app
COPY --from=backend /paliad /app/paliad
COPY --from=frontend /app/frontend/dist /app/dist
EXPOSE 8080
CMD ["/app/paliad"]
Image-size delta: alpine openssh-client is ~1.1 MB compressed — negligible. No tailscaled, no entrypoint script, no extra processes inside the container.
4.4 What does NOT change
- No Tailscale auth-key inside paliad. The container inherits the host's tailnet binding, so there is no per-container Tailscale identity to rotate. mLake's existing Tailscale auth is the only one in scope.
- No tailscaled process inside the container.
- No new sidecar container.
4.5 Bypassing Tailscale SSH via port 22022 (Phase A discovery)
Phase A revealed that Tailscale SSH on mRiver intercepts :22 from tailnet peers before OpenSSH sees the connection. The SSH banner reads SSH-2.0-Tailscale, the verbose log shows Authenticated using "none", and the authorized_keys command= directive is therefore inert. mRiver's tailscale status --json confirms the https://tailscale.com/cap/ssh capability is enabled.
The fix: a separate listening port for the paliad route, where Tailscale SSH does not intercept and real OpenSSH handles auth.
mRiver uses systemd socket activation for sshd (/usr/lib/systemd/system/ssh.socket binds :22). Setting Port 22022 in sshd_config is ignored under socket activation — listen ports come from the socket unit, not sshd's own config. The correct change is a drop-in:
# /etc/systemd/system/ssh.socket.d/paliad.conf
[Socket]
ListenStream=0.0.0.0:22022
ListenStream=[::]:22022
Followed by systemctl daemon-reload && systemctl restart ssh.socket. Both :22 (still routed through Tailscale SSH for m's interactive use) and :22022 (real OpenSSH) end up listening. The same sshd binary handles both — same host key, same authorized_keys, same sshd_config. The only difference is which port a peer dials.
A failed first attempt (2026-05-07 23:07) added the drop-in while a stale Port 22022 directive in sshd_config.d/99-paliad-test.conf was still bound — the resulting Address already in use took ssh.socket down for ~30 s until reverted. Lesson: clean any prior Port directives out of sshd_config.d/*.conf before retrying the socket drop-in.
Phase A end-to-end test (2026-05-07 23:31) succeeded with port 22022:
ssh -p 22022 -i paliad-prod-key m@100.99.98.203 health→okrun-turn <uuid> <base64-msg>→ 3.4 s round-trip including a Claude-Code responsefrom="100.99.98.201"correctly rejected a connection sourced from mRiver itself (Permission denied (publickey,password))
5. Sub-design B — SSH identity, restricted shim, host-key pinning
5.1 Identity: dedicated ed25519 keypair paliad-prod
One keypair, generated once on mRiver during Phase A, used by every paliad-prod deploy:
# On mRiver (Phase A bootstrap):
ssh-keygen -t ed25519 -N "" -C "paliad-prod $(date +%Y-%m-%d)" -f /tmp/paliad-prod-key
# Public key → mRiver authorized_keys (see 5.2)
# Private key → Dokploy secret store as PALIADIN_SSH_PRIVATE_KEY
shred -u /tmp/paliad-prod-key # only the encrypted/secret-stored copies survive
Rotation: regenerate, push public key to mRiver authorized_keys, update Dokploy secret, redeploy. No code change needed — paliad's startup re-reads the env var on every boot.
The private key is delivered to the container as a multi-line env var. At process start, paliad writes it to a tmpfile so OpenSSH can use it:
// cmd/server/main.go (sketch)
func loadPaliadinSSHKey() (string, error) {
blob := os.Getenv("PALIADIN_SSH_PRIVATE_KEY")
if blob == "" { return "", nil } // remote mode disabled
f, err := os.CreateTemp("", "paliadin-id_ed25519-")
if err != nil { return "", err }
if err := os.Chmod(f.Name(), 0o600); err != nil { return "", err }
if _, err := f.WriteString(blob); err != nil { return "", err }
if err := f.Close(); err != nil { return "", err }
return f.Name(), nil // path passed to RemotePaliadinService
}
The tmpfile lives at /tmp/paliadin-id_ed25519-<rand> for the container's lifetime. On container restart, a fresh tmpfile is written. We never persist the key to a volume.
5.2 mRiver authorized_keys entry
command="/home/m/.local/bin/paliadin-shim",no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc,from="100.99.98.201" ssh-ed25519 AAAA...PUBKEY... paliad-prod
Each restriction matters:
command=— everyssh m@mriver …invocation runs the shim regardless of what the client asked for. The client's requested command is exposed as$SSH_ORIGINAL_COMMANDfor the shim to dispatch on.no-pty,no-port-forwarding,no-agent-forwarding,no-X11-forwarding,no-user-rc— defence-in-depth: even if someone steals the key and bypasses the shim's argument validation, they can't get an interactive shell, can't tunnel ports, can't pivot via agent forwarding.from="100.99.98.201"— only accept connections from mLake's tailnet IP. Defends against the "container has full tailnet visibility" host-mode side-effect from §4.2: if the key leaks off mLake, it can't be replayed from another tailnet host.
5.3 Host-key pinning
StrictHostKeyChecking=accept-new is too loose for a long-lived production identity (one-time MITM during first connect substitutes a different key forever). Instead:
- During Phase A, run
ssh-keyscan -p 22022 -t ed25519 100.99.98.203on mLake. - Capture the single output line. The host-key portion is identical to the
:22entry — same sshd, same keys — but the[100.99.98.203]:22022prefix matters because OpenSSH'sknown_hostsishost:port-keyed for non-22 ports. - Store as Dokploy secret
PALIADIN_KNOWN_HOSTS. - At container startup, write to
/tmp/paliadin-known_hostschmod 644. - Pass to OpenSSH via
-o UserKnownHostsFile=/tmp/paliadin-known_hosts -o StrictHostKeyChecking=yes.
If mRiver's host key ever rotates (rare; only on disk wipe / fresh OS), Phase A runs again and the secret is updated. SSH refuses to connect with a clear "host key changed" error, which surfaces as mriver_unreachable to the user — exactly the right blast-radius (loud failure, no silent connect to a substitute host).
5.4 The shim — paliadin-shim
A bash script on mRiver at /home/m/.local/bin/paliadin-shim. It is the only thing the paliad-prod key is allowed to invoke, and it dispatches on $SSH_ORIGINAL_COMMAND. Three RPCs:
#!/bin/bash
# paliadin-shim — server-side RPC for paliad's remote-tmux turns.
# Invoked via authorized_keys command= with $SSH_ORIGINAL_COMMAND set.
set -euo pipefail
umask 077
readonly TMUX_SESSION="${PALIADIN_TMUX_SESSION:-paliad-paliadin}"
readonly RESPONSE_DIR="${PALIADIN_RESPONSE_DIR:-/tmp/paliadin}"
readonly TIMEOUT_S=60
readonly TURN_ID_RE='^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$'
mkdir -p "$RESPONSE_DIR"
# Parse $SSH_ORIGINAL_COMMAND. Format: "<verb> <arg1> <arg2> …"
read -r -a argv <<< "${SSH_ORIGINAL_COMMAND:-}"
verb="${argv[0]:-}"
ensure_pane() {
if ! tmux has-session -t "$TMUX_SESSION" 2>/dev/null; then
tmux new-session -d -s "$TMUX_SESSION"
fi
# Find or create the @paliadin-scope=chat window.
local target=""
while read -r idx; do
scope=$(tmux show-window-option -t "$TMUX_SESSION:$idx" -v @paliadin-scope 2>/dev/null || true)
if [[ "$scope" == "chat" ]]; then target="$TMUX_SESSION:$idx"; break; fi
done < <(tmux list-windows -t "$TMUX_SESSION" -F '#{window_index}')
if [[ -z "$target" ]]; then
idx=$(tmux new-window -t "$TMUX_SESSION" -n claude-paliadin -P -F '#{window_index}' claude)
target="$TMUX_SESSION:$idx"
# Wait for claude to settle (60s bound; matches Go waitForPaneReady).
for _ in $(seq 1 120); do
pane=$(tmux capture-pane -t "$target" -p 2>/dev/null || true)
if [[ "$pane" == *"❯"* || "$pane" == *"│"* ]]; then break; fi
sleep 0.5
done
tmux set-window-option -t "$target" @paliadin-scope chat
tmux set-window-option -t "$target" @fix-name claude-paliadin
# Bootstrap system prompt — reuses the Go service's prompt text.
# The Go side sends this via the `bootstrap` RPC on first turn instead
# of duplicating the prompt here. See §6.4.
fi
echo "$target"
}
case "$verb" in
health)
# Liveness check — used by paliad to short-circuit when mRiver is offline.
# Returns "ok" iff tmux + claude are reachable.
tmux has-session -t "$TMUX_SESSION" 2>/dev/null \
|| tmux new-session -d -s "$TMUX_SESSION"
command -v claude >/dev/null && echo ok || { echo no-claude; exit 1; }
;;
bootstrap)
# First-turn-only: ensure pane exists and inject the system prompt.
# $1 = base64-encoded prompt body (avoids quoting hell).
target=$(ensure_pane)
prompt=$(printf '%s' "${argv[1]:?missing prompt}" | base64 -d)
tmux send-keys -t "$target" -l -- "$prompt"
tmux send-keys -t "$target" Enter
sleep 2 # give claude a moment to absorb
echo ok
;;
run-turn)
# $1 = turn_id (UUID); $2 = base64-encoded user message.
turn_id="${argv[1]:?missing turn_id}"
[[ "$turn_id" =~ $TURN_ID_RE ]] || { echo >&2 "bad turn_id"; exit 2; }
msg=$(printf '%s' "${argv[2]:?missing message}" | base64 -d)
target=$(ensure_pane)
out="$RESPONSE_DIR/$turn_id.txt"
rm -f "$out"
# Envelope matches what paliadin_prompt.go expects.
tmux send-keys -t "$target" -l -- "[PALIADIN:$turn_id] $msg"
tmux send-keys -t "$target" Enter
# Poll for the response file. Same shape as Go pollForResponse.
for _ in $(seq 1 $((TIMEOUT_S * 5))); do
if [[ -s "$out" ]]; then
sleep 0.05 # settle
cat "$out"
rm -f "$out"
exit 0
fi
sleep 0.2
done
echo >&2 "paliadin: response timeout after ${TIMEOUT_S}s"
exit 124
;;
reset)
# /clear the conversation; next turn starts fresh.
target=$(ensure_pane)
tmux send-keys -t "$target" -l -- "/clear"
tmux send-keys -t "$target" Enter
echo ok
;;
*)
echo >&2 "paliadin-shim: unknown verb '$verb'"
exit 2
;;
esac
Why a shim instead of raw tmux-over-SSH:
- One SSH round-trip per turn (~50 ms over tailnet) vs ~10–20 round-trips for the granular pattern.
- Argument validation lives in one place (UUID regex on turn_id, base64 for messages, fixed verb list) — easier to audit than a regex over
$SSH_ORIGINAL_COMMANDmatchingtmux send-keys …. - mRiver-side concerns (response polling, settle delays, pane-readiness) stay on mRiver, which is where the tmux state lives. The Go service stops caring about local file polling at all.
6. Sub-design C — Service-layer integration, routing, reliability
6.1 Interface split
The current *PaliadinService becomes an interface with two implementations: LocalPaliadinService (the existing tmux code, renamed) and RemotePaliadinService (the new SSH code). Construction picks one at startup based on PALIADIN_REMOTE_HOST.
// internal/services/paliadin.go (after refactor)
type Paliadin interface {
RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error)
ResetSession(ctx context.Context) error
ListRecentTurns(ctx context.Context, callerID uuid.UUID, limit int) ([]PaliadinTurn, error)
Stats(ctx context.Context, callerID uuid.UUID) (*PaliadinStats, error)
IsOwner(ctx context.Context, userID uuid.UUID) (bool, error)
}
// LocalPaliadinService wraps the current tmux PoC (laptop / dev path).
type LocalPaliadinService struct { /* identical to today's PaliadinService */ }
// RemotePaliadinService talks to a paliadin-shim over SSH on mRiver.
type RemotePaliadinService struct {
db *sqlx.DB
users *UserService
sshHost string // 100.99.98.203
sshPort int // 22022 — bypasses Tailscale SSH on :22 (see §4.5)
sshUser string // m
sshKeyPath string // /tmp/paliadin-id_ed25519-<rand>
knownHosts string // /tmp/paliadin-known_hosts
turnMu sync.Mutex
// Health-check cache.
healthMu sync.Mutex
healthOK bool
healthCheckedAt time.Time
}
DB access (ListRecentTurns, Stats, IsOwner) is identical for both — they only read paliad.paliadin_turns. They live in a shared paliadinDB helper struct embedded in both implementations.
6.2 Wiring at startup
// cmd/server/main.go (excerpt)
var paliadin services.Paliadin
remoteHost := os.Getenv("PALIADIN_REMOTE_HOST")
switch {
case remoteHost != "":
keyPath, err := loadPaliadinSSHKey()
if err != nil { log.Fatalf("paliadin: load ssh key: %v", err) }
if keyPath == "" { log.Fatalf("paliadin: PALIADIN_REMOTE_HOST set but no PALIADIN_SSH_PRIVATE_KEY") }
knownHosts, err := loadPaliadinKnownHosts()
if err != nil { log.Fatalf("paliadin: load known_hosts: %v", err) }
port, _ := strconv.Atoi(cmpOr(os.Getenv("PALIADIN_REMOTE_PORT"), "22022"))
paliadin = services.NewRemotePaliadinService(db, userSvc, services.RemotePaliadinConfig{
SSHHost: remoteHost,
SSHPort: port,
SSHUser: cmpOr(os.Getenv("PALIADIN_REMOTE_USER"), "m"),
SSHKeyPath: keyPath,
KnownHostsPath: knownHosts,
})
log.Printf("paliadin: remote mode → ssh %s@%s:%d", "m", remoteHost, port)
case localTmuxAvailable():
paliadin = services.NewLocalPaliadinService(db, userSvc, "", "")
log.Printf("paliadin: local tmux mode")
default:
paliadin = services.NewDisabledPaliadinService(db, userSvc)
log.Printf("paliadin: disabled (no remote host, no local tmux)")
}
NewDisabledPaliadinService exists today implicitly via the ErrTmuxUnavailable path; making it explicit gives the constructor a clear name and the handler doesn't have to special-case nil.
6.3 SSH invocation pattern
RemotePaliadinService runs every RPC through the same helper:
func (s *RemotePaliadinService) callShim(ctx context.Context, args ...string) ([]byte, error) {
sshArgs := []string{
"-F", "/dev/null", // ignore /etc/ssh/ssh_config + ~/.ssh/config
"-i", s.sshKeyPath,
"-p", strconv.Itoa(s.sshPort), // 22022 — bypasses Tailscale SSH on :22
"-o", "IdentitiesOnly=yes", // don't fall back to other keys
"-o", "UserKnownHostsFile=" + s.knownHostsPath,
"-o", "StrictHostKeyChecking=yes",
"-o", "BatchMode=yes",
"-o", "ConnectTimeout=3",
"-o", "ServerAliveInterval=10",
"-o", "ServerAliveCountMax=3",
s.sshUser + "@" + s.sshHost,
"--",
}
sshArgs = append(sshArgs, args...)
c, cancel := context.WithTimeout(ctx, 70*time.Second) // shim has its own 60s; +10s for SSH overhead
defer cancel()
cmd := exec.CommandContext(c, "ssh", sshArgs...)
var stdout, stderr bytes.Buffer
cmd.Stdout = &stdout; cmd.Stderr = &stderr
if err := cmd.Run(); err != nil {
return nil, fmt.Errorf("paliadin: ssh shim %v: %w (stderr: %s)", args, err, stderr.String())
}
return stdout.Bytes(), nil
}
RunTurn becomes:
func (s *RemotePaliadinService) RunTurn(ctx context.Context, req TurnRequest) (*TurnResult, error) {
s.turnMu.Lock()
defer s.turnMu.Unlock()
if err := s.healthGate(ctx); err != nil {
return nil, err // ErrMRiverUnreachable, picked up by handler
}
turnID := uuid.New()
started := time.Now().UTC()
if err := s.insertTurnRow(ctx, …); err != nil { return nil, err }
// First-turn-only: bootstrap the system prompt on mRiver. Detected by
// checking whether any prior turn for this user has succeeded.
if err := s.ensureBootstrapped(ctx); err != nil {
_ = s.markTurnError(ctx, turnID, "bootstrap_failed")
return nil, err
}
msg := sanitiseForTmux(req.UserMessage)
msgB64 := base64.StdEncoding.EncodeToString([]byte(msg))
body, err := s.callShim(ctx, "run-turn", turnID.String(), msgB64)
if err != nil {
_ = s.markTurnError(ctx, turnID, classifySSHError(err))
return nil, err
}
// Same trailer-parse + audit-row writes as Local, factored into shared helper.
return s.completeTurnFromBody(ctx, turnID, started, string(body))
}
6.4 System prompt bootstrap
The local PoC calls paliadinSystemPrompt(s.responseDir) once when it creates the pane. The remote path needs the same hook. Two options that don't require duplicating the German prompt body to mRiver:
- Lazy bootstrap (chosen): the first
RunTurnafter a paliad-prod restart sends the system prompt viabootstrapRPC, then runs the actual turn. Subsequent turns skip the bootstrap. State is per-process:RemotePaliadinService.bootstrappedboolean guarded by mutex. - Eager bootstrap at startup is rejected — it forces every container start to wait for mRiver to be online, which couples paliad's boot to mRiver's availability.
Lazy bootstrap means the very first turn after a paliad redeploy pays a ~3 s extra cost (claude pane spin-up + system prompt absorb). Acceptable for a single-user PoC.
6.5 Health-check gating (mriver_unreachable)
Every RunTurn first calls healthGate(ctx):
- Cached for 10 s. If last check was <10 s ago and was OK, skip the probe.
- Otherwise:
s.callShim(ctx, "health")with a 3 s timeout. On success, set cache OK; on failure, returnErrMRiverUnreachable.
Why 10 s: short enough that "I just woke my laptop" propagates inside one user retry; long enough that a busy chat doesn't probe on every turn.
var ErrMRiverUnreachable = errors.New("paliadin: mriver unreachable")
func (s *RemotePaliadinService) healthGate(ctx context.Context) error {
s.healthMu.Lock()
defer s.healthMu.Unlock()
if s.healthOK && time.Since(s.healthCheckedAt) < 10*time.Second {
return nil
}
c, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
out, err := s.callShim(c, "health")
s.healthCheckedAt = time.Now()
if err != nil || strings.TrimSpace(string(out)) != "ok" {
s.healthOK = false
return fmt.Errorf("%w: %v", ErrMRiverUnreachable, err)
}
s.healthOK = true
return nil
}
6.6 Friendly error code (extends t-paliad-150)
friendlyErrorMessage already maps tmux_unavailable to a localised message. We add one new code:
mriver_unreachable→ DE: "mRiver ist offline — Paliadin nicht erreichbar. Mach mRiver an, oder nutze Paliadin lokal mit./paliad." / EN: "mRiver is offline — Paliadin can't reach it. Wake mRiver, or run Paliadin locally with./paliad."
Implementation: one new case in the SSE-error switch in frontend/src/client/paliadin.ts's friendlyErrorMessage, plus matching i18n keys (paliadin.error.mriver_unreachable.de / .en). Server-side: paliadin HTTP handler maps errors.Is(err, services.ErrMRiverUnreachable) to event: error\ndata: {"code":"mriver_unreachable","message":"..."}\n\n.
6.7 Rate limit
A runaway loop on the paliad side could DOS the SSH connection. Cheapest cap: enforce one in-flight turn at a time via turnMu (already exists in the local PoC). On top of that, a rolling cap of N=20 turns/min in RemotePaliadinService rejects with ErrRateLimited (mapped to a friendly paliadin.error.rate_limited). PoC has one user (m); the cap is a paranoid safety, not a real throttle.
6.8 What about ControlMaster?
Decision-2's chosen path (server-side shim with one RPC per turn) makes ControlMaster optional. The shim collapses ~10 raw-tmux ops into a single SSH connect — that's already the latency win ControlMaster would buy.
Adding it on top would save ~30–50 ms per turn but adds:
- A persistent
~/.ssh/cm-*socket inside the container. - Cleanup logic on shutdown.
- A subtle interaction with the SSH BatchMode + ConnectTimeout settings.
Verdict: skip ControlMaster in v1. If turn latency over Tailscale is measured >300 ms in practice and hot enough to matter, add it in a follow-up; the call site is one helper.
7. Phasing
Phase A — manual proof-of-concept (no Dockerfile change yet)
Goal: validate the round-trip end-to-end on a deployed paliad, before touching the image.
Phase A.0 (DONE 2026-05-07 23:31): SSH+shim end-to-end on the tailnet.
- ✅ Generate keypair on mRiver:
ssh-keygen -t ed25519 -N "" -C "paliad-prod" -f ~/.paliad-staging/paliad-prod-key. FingerprintSHA256:5uV8v872F/IhJycjjq0crFue/emAYfw71N9bxTvkl9c. - ✅ Commit shim to
scripts/paliadin-shimand install at/home/m/.local/bin/paliadin-shim,chmod 755. - ✅ Write authorized_keys with public key +
command=/from="100.99.98.201"/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc restrictions (§5.2). - ✅ Add port 22022 socket drop-in at
/etc/systemd/system/ssh.socket.d/paliad.conf,systemctl daemon-reload && systemctl restart ssh.socket. Both:22(Tailscale SSH for m) and:22022(real OpenSSH for paliad) listening (§4.5). - ✅ Capture mRiver:22022 host key:
ssh-keyscan -p 22022 -t ed25519 100.99.98.203 > ~/.paliad-staging/known_hostsfrom mLake. FingerprintSHA256:HPoUzy60Cb8yLERIBQcB2mHihNST3NaTODx5Ypd1XpA. - ✅ Smoke-test from mLake (without paliad container, just raw ssh from mLake's host shell):
ssh -F /dev/null -i /tmp/paliad-prod-key -o UserKnownHostsFile=/tmp/paliad-known_hosts \ -o StrictHostKeyChecking=yes -o IdentitiesOnly=yes -o BatchMode=yes \ -p 22022 m@100.99.98.203 health → ok ssh … run-turn $(uuidgen) "$(printf 'Sag …' | base64 -w0)" → "test ok" (3.4 s round-trip including a real Claude response) - ✅ from= rejection verified: the same key from mRiver itself (
100.99.98.203) →Permission denied (publickey,password)as expected.
Phase A.5 (PENDING m's hands): validate network_mode: host + traefik routing on prod paliad.de.
- Branch the live
docker-compose.ymlon a temp branch. - Add
network_mode: hostto thewebservice; removeexpose: ["8080"]. - Push to trigger a Dokploy redeploy.
curl --connect-timeout 5 -sSI https://paliad.de/— expect 200 (or login redirect), NOT 502.- If 502: revert the temp branch (
git revert HEAD && git push); revisit decision 1 in a follow-up issue. - If 200: keep the host-mode change; ready for Phase B.
This is m's call to execute — it briefly touches prod paliad.de. Inventor/coder should not flip prod compose without explicit go-ahead. Rollback is one revert + redeploy.
Phase A.6 (after A.5 passes): smoke-test SSH from inside the paliad-prod container itself (the real container, not just the mLake host shell):
docker exec -it <paliad-container> sh
apk add --no-cache openssh-client # one-shot, before Dockerfile change
ssh -F /dev/null -i /tmp/paliad-prod-key -o UserKnownHostsFile=/tmp/paliad-known_hosts \
-o StrictHostKeyChecking=yes -o IdentitiesOnly=yes -o BatchMode=yes \
-p 22022 m@100.99.98.203 health
# expected: "ok"
This proves the container's host-mode networking actually delivers a tailnet connect.
Phase A.7: wire env vars manually via Dokploy UI for one deploy; confirm /paliadin chat works against mRiver from paliad.de.
If A.5 fails: the design rolls back to a sidecar in a new issue (decision 1 follow-up). The SSH path (A.0) and traefik path (A.5) are independent — A.0 is already proven; only A.5+ is at risk.
Phase B — bake into Dockerfile + Dokploy secrets
- Dockerfile: add
openssh-clientto the final stage (§4.3). - compose: add
network_mode: hostand the four new env vars (§4.1). - Dokploy secrets: register
PALIADIN_REMOTE_HOST=100.99.98.203,PALIADIN_REMOTE_USER=m,PALIADIN_SSH_PRIVATE_KEY=...,PALIADIN_KNOWN_HOSTS=.... - Code: refactor
PaliadinServiceto the interface split (§6.1–§6.2). New fileinternal/services/paliadin_remote.go. Tests:paliadin_remote_test.gomockscallShimto verifyRunTurnaudit-row writes, error mapping, andhealthGatecaching. - Ship under one PR; tag t-paliad-151 done.
Phase C — friendly errors + monitoring
paliadin.error.mriver_unreachablei18n keys +friendlyErrorMessagecase (§6.6)./admin/paliadinshows last health-probe result + last successful turn timestamp.- Optional:
mai-meshintegration to surface mRiver-offline events to m on Telegram (out-of-band; not gating).
8. Security review summary
| Risk | Mitigation |
|---|---|
| Stolen private key → arbitrary SSH on mRiver | command= shim restriction + from="100.99.98.201" + ed25519 key + private key only in Dokploy secret store (encrypted at rest); paliad route uses port 22022 where real OpenSSH enforces all of the above |
| Stolen private key → tailnet-wide SSH from non-mLake host | from="100.99.98.201" clause (verified: rejected from mRiver itself in Phase A.0) |
Tailscale SSH on :22 bypasses authorized_keys |
The paliad-prod key's command= restriction is not enforced on :22. Mitigation: paliad always dials :22022, which is real OpenSSH. m's interactive tailscale ssh m@mriver on :22 continues to be governed by Tailscale ACLs, separate from paliad's identity. |
| Container compromise → key extraction | Key written to tmpfile chmod 600, only root inside container can read; alpine container has no shell-on-error trampolines |
| Host-key MITM during connect | Pinned known_hosts; StrictHostKeyChecking=yes |
Shim argument injection (e.g. via run-turn $(rm -rf /)) |
Shim parses positional args from $SSH_ORIGINAL_COMMAND via read -r -a; never passes args to a subshell eval; turn_id validated by UUID regex; message body always base64-decoded into a single shell variable, never re-evaluated |
| Runaway loop → SSH flood | Single-flight turnMu + 20/min rolling cap |
network_mode: host widens blast radius |
The command= + from= restrictions on mRiver mean container compromise = "can run shim verbs against mRiver only", not "shell on mRiver" |
| PaliadinOwnerEmail bypass | Unchanged from PoC: gate is in Go (/paliadin 404s for any other user). Even if mRiver SSH key leaks, attacker still needs paliad session as m@hoganlovells.com. |
9. Out-of-scope clarifications (for review)
These were called out in the issue but the design intentionally does not solve them, to keep v1 tight. Each is acknowledged so review knows it wasn't an oversight:
- Wake-on-LAN of mRiver: out of scope. v1's UX when mRiver is asleep is the friendly error from §6.6. Future work: integrate with
mai-meshcapability fallback. - Multi-host failover: out of scope. Only mRiver is targeted.
- Anthropic API fallback when mRiver offline: out of scope per CLAUDE.md (
ANTHROPIC_API_KEYreserved for production-v1, unused in PoC). - ControlMaster: v1 ships without; revisit if turn latency >300 ms in practice (§6.8).
10. File-level deliverables (for the coder shift)
When this design is approved and the coder shift starts, the work splits roughly into:
Dockerfile—+openssh-client.docker-compose.yml—network_mode: host, five new env entries (PALIADIN_REMOTE_HOST,PALIADIN_REMOTE_PORT,PALIADIN_REMOTE_USER,PALIADIN_SSH_PRIVATE_KEY,PALIADIN_KNOWN_HOSTS).internal/services/paliadin.go— extractPaliadininterface; rename existing toLocalPaliadinService; pull DB-only methods (ListRecentTurns,Stats,IsOwner) into a shared embeddedpaliadinDBso both implementations get them for free.internal/services/paliadin_remote.go— new file:RemotePaliadinService,RemotePaliadinConfig(withSSHPort),callShim,healthGate,ensureBootstrapped,classifySSHError,ErrMRiverUnreachable.internal/services/paliadin_remote_test.go— unit tests with a mockedcallShim.cmd/server/main.go— env-var-based wiring (§6.2),loadPaliadinSSHKey,loadPaliadinKnownHosts,PALIADIN_REMOTE_PORTparse with default22022.frontend/src/client/paliadin.ts— onecaseinfriendlyErrorMessageformriver_unreachable.frontend/src/i18n.ts— two new keys (paliadin.error.mriver_unreachable.de/.en).scripts/paliadin-shim— server-side script (§5.4); already shipped + installed on mRiver during Phase A.0, not part of any container. Repo location chosen so the security-relevant script is version-controlled.docs/project-status.md— note Phase 0.5 (PoC) → Phase 0.6 (Tailscale-SSH prod route).- mRiver host setup (one-time, already done in Phase A.0):
/etc/systemd/system/ssh.socket.d/paliad.conf(port 22022 listen drop-in);~/.ssh/authorized_keys(paliad-prod public key with restrictions);/home/m/.local/bin/paliadin-shim(executable). These are NOT in the repo because they live on m's laptop;docs/project-status.mdshould reference them.
No DB migrations needed — paliad.paliadin_turns schema already covers everything (error_code field already accepts free-form codes including mriver_unreachable).
11. Open questions for review
- Q (m), still open: Phase A.5 (traefik+host-mode on prod paliad.de) is not yet executed. m drives this; rollback is one revert. Dokploy doc check before flipping is recommended but not blocking.
- Q (m), resolved 2026-05-07 23:50: shim location → repo (
scripts/paliadin-shim, committed in0248411). Version-controlled and auditable. - Q (m), still open:
ANTHROPIC_API_KEYenv var reservation in compose comments — keep for production-v1, or strip now? Not blocking either phase; defer.
12. Phase A.0 completion summary (2026-05-07 23:50)
Coder shift (noether) executed Phase A.0 in full:
- ✅ shim committed at
scripts/paliadin-shim(commit0248411, repo-version-controlled) - ✅ shim installed at
/home/m/.local/bin/paliadin-shimon mRiver - ✅ ed25519 keypair
paliad-prodgenerated, public-key fingerprintSHA256:5uV8v872F/IhJycjjq0crFue/emAYfw71N9bxTvkl9c, private key staged at~/.paliad-staging/paliad-prod-keyon mRiver (mode 600) - ✅
~/.ssh/authorized_keyswritten withcommand=/from=/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc restrictions - ✅
ssh.socketdrop-in installed at/etc/systemd/system/ssh.socket.d/paliad.conf; both:22and:22022listening - ✅ host key for
:22022captured at~/.paliad-staging/known_hosts(fingerprintSHA256:HPoUzy60Cb8yLERIBQcB2mHihNST3NaTODx5Ypd1XpA) - ✅ end-to-end SSH+shim+Claude run-turn validated from mLake → mRiver:22022 (3.4 s round-trip)
- ✅
from="100.99.98.201"rejection verified
Three secrets ready for Dokploy registration (m to copy from ~/.paliad-staging/ on mRiver):
PALIADIN_SSH_PRIVATE_KEY←cat ~/.paliad-staging/paliad-prod-keyPALIADIN_KNOWN_HOSTS←cat ~/.paliad-staging/known_hostsPALIADIN_REMOTE_HOST=100.99.98.203,PALIADIN_REMOTE_PORT=22022,PALIADIN_REMOTE_USER=m
Phase A.5 (traefik+host-mode test) and Phase A.6/A.7 (in-container SSH smoke + paliad/paliadin end-to-end) await m's hands — they touch prod paliad.de.
Phase B (Dockerfile + Go interface split + Dokploy secrets) is unblocked from a code perspective — but should not merge until Phase A.5 confirms the host-mode networking trade-off is acceptable.
Inventor design + coder Phase A.0 complete. Awaiting m for Phase A.5 traefik validation before the coder writes the Go interface split.