Files
paliad/docs/cicd-runner-setup-2026-05-25.md
mAi c901293c9c
Some checks failed
Paliad CI gate / build (push) Has been cancelled
Paliad CI gate / test-go (push) Has been cancelled
Paliad CI gate / deploy (push) Has been cancelled
feat(cicd): Slice A — pre-deploy gate + role-split migration smoke
Adds .gitea/workflows/test.yaml that gates every push on `go build`,
`bun run build`, `go vet`, the migration coordination check, and the
role-split end-to-end migration smoke. On push to main + green, calls
Dokploy's compose.deploy API and polls /health/ready until 200.

t-paliad-282 / m/paliad#114. Design: docs/design-cicd-pre-deploy-gate-2026-05-25.md
(inventor shift on mai/cronus/inventor-ci-cd-pre).

Catches all three of today's outage classes:

  brunel (~13:20) slot collision     -> TestMigrations_NoDuplicateSlot
  hermes (~16:05) dropped-col refs   -> TestBootSmoke
  mig 129 (~14:56) 42501 ownership   -> TestMigrations_EndToEndAsAppRole

Snapshot approach. internal/db/testdata/prod-snapshot.sql is a pg_dump
of youpc-supabase paliad schema + applied_migrations rows. CI restores
this into a fresh `supabase/postgres:15.8.1.060` (same image, same role
topology as prod) and runs ApplyMigrations as the `postgres` role
(which is NOT a superuser on supabase/postgres, matching prod). Existing
migrations are skipped (already in applied_migrations); only NEW migs
from the PR run end-to-end. This sidesteps the fresh-DB idempotence
debt in some historical migrations (mig 037 missing pg_trgm, mig 051
inner COMMIT) — those are tracked separately and don't block the gate.

Sub-changes:

- internal/handlers/handlers.go — new /health/ready endpoint distinct
  from /healthz. /healthz stays liveness (process alive, no DB); /ready
  is readiness (DB pool pings within 2 s). Returns 503 when svc or pool
  is nil (DB-less deploys are intentionally not-ready). svc.Pool added
  to handlers.Services, wired in cmd/server/main.go.

- internal/db/migrate_test.go — TestMigrations_NoDuplicateSlot (pure
  unit, catches brunel) and TestMigrations_EndToEndAsAppRole (snapshot-
  gated, catches the 42501 class).

- cmd/server/main_smoke_test.go — TestBootSmoke now also asserts
  /health/ready returns 503 with a nil svc. New TestHealthReady_Live
  asserts 200 against a live pool.

- internal/db/migrations/024_rename_department_columns.up.sql and
  027_rename_to_partner_units.up.sql — ALTER INDEX / ALTER POLICY
  exception handlers now catch undefined_object OR undefined_table OR
  duplicate_object. Old handler only caught undefined_object; Postgres
  raises undefined_table when source object never existed, and
  duplicate_object when destination already exists. The expanded
  handlers make these migrations truly idempotent across all plausible
  starting states.

- Makefile — verify-mig-app, test-frontend, refresh-snapshot targets.
  refresh-snapshot pg_dumps youpc-supabase prod (needs PALIAD_PROD_DATABASE_URL),
  strips pg16 \restrict commands for pg15 restore compat, and filters
  applied_migrations rows to this branch's max on-disk version.

- internal/db/testdata/README.md — explains the snapshot's purpose,
  refresh procedure, and how to verify locally.

- docs/cicd-runner-setup-2026-05-25.md — one-time admin steps for
  registering a Gitea Actions runner on mriver and wiring DOKPLOY_TOKEN
  as a repo secret. Documents soft-launch plan per m's Q11.4 (keep
  Dokploy's autoDeploy=true webhook alive for one week, disable after
  the workflow has gated 5 successful deploys).

Build clean. Full go test ./internal/... ./cmd/... green without
TEST_DATABASE_URL. With TEST_DATABASE_URL + TEST_APP_DATABASE_URL set
to a supabase/postgres scratch + snapshot restored:
TestMigrations_NoDuplicateSlot, TestMigrations_EndToEndAsAppRole,
TestBootSmoke, TestHealthReady_Live all pass. Live-DB service tests in
internal/services/* fail under supabase/postgres 15.8 with a 42P08
parameter-binding error (unrelated to Slice A — tracked as a follow-up).
2026-05-25 17:42:06 +02:00

10 KiB

CI/CD runner setup — paliad

Companion to: docs/design-cicd-pre-deploy-gate-2026-05-25.md (Slice A, t-paliad-282 / m/paliad#114) Date: 2026-05-25 Audience: mlake / mriver admin (m or head)

Slice A's .gitea/workflows/test.yaml requires (a) at least one online Gitea Actions runner and (b) a Dokploy API token wired as a repo secret. Both are one-time setup actions that paliad's source tree cannot perform itself — they live on infra-side. This doc lists them so the workflow can go green on its first run.


0. Pre-flight: what already exists

Verified live (2026-05-25 cronus inventor shift):

  • Gitea 1.24.4 on mgit.msbls.de, has_actions: true on m/paliad.
  • /api/v1/admin/actions/runners reports 2 runners registered. They are likely the shared runners used by m/mGreen and m/mGeo (both have .gitea/workflows/deploy.yml with runs-on: self-hosted).
  • m/paliad/actions/tasks reports total_count=0 — paliad has never run a workflow yet.

The existing runners may already be capable of running paliad's workflow without further setup. The verification step (§3) below tells you whether they are.


1. Runner placement decision (m's Q11.1)

m's pick: mriver.

Rationale: mriver hosts the mai worker fleet but workers spend most of their time waiting on Anthropic. mlake's Dokploy + Swarm workload is more contended. A new runner on mriver adds the least pressure to either box.

If mriver is offline or saturated when CI first fires, fall back to the existing mlake-side runners (they're already registered; no provisioning needed).


2. One-time setup (admin steps)

2.1 Register a new Gitea Actions runner on mriver

# On mriver, as m:
# 1. Download the act_runner binary (matching Gitea 1.24.x)
curl -L -o /usr/local/bin/act_runner \
  https://gitea.com/gitea/act_runner/releases/download/v0.2.13/act_runner-0.2.13-linux-amd64
chmod +x /usr/local/bin/act_runner

# 2. Get a runner registration token. In the Gitea UI:
#    /admin → Actions → Runners → "Create new Runner"
#    (or org-scope: /m/paliad/settings/actions/runners)
# Copy the token.

# 3. Register
mkdir -p ~/act_runner && cd ~/act_runner
act_runner register --no-interactive \
  --instance https://mgit.msbls.de \
  --token <REGISTRATION_TOKEN> \
  --name mriver-paliad-1 \
  --labels ubuntu-latest:docker://node:20-bookworm

# 4. Run as a systemd unit (preferred) or as a session daemon
# Systemd unit example: /etc/systemd/system/act_runner.service
#   [Unit]
#   Description=Gitea Actions runner
#   After=network.target
#   [Service]
#   User=m
#   WorkingDirectory=/home/m/act_runner
#   ExecStart=/usr/local/bin/act_runner daemon
#   Restart=on-failure
#   [Install]
#   WantedBy=multi-user.target
sudo systemctl enable --now act_runner
sudo systemctl status act_runner

Why ubuntu-latest:docker://node:20-bookworm for the label? Gitea Actions' runs-on: ubuntu-latest resolves via the runner's label map. Mapping it to a Docker image gives the workflow a sandbox with Docker available — required for our Postgres service container in test.yaml. mriver should have Docker (for paliadin-shim); if not, install it.

2.2 Register the Dokploy API token as a repo secret

The workflow's deploy job needs secrets.DOKPLOY_TOKEN. Use the existing project-wide Dokploy API key (the one stored in ~/.claude/skills/mai-dokploy/SKILL.md).

In the Gitea UI:

  • Navigate to https://mgit.msbls.de/m/paliad/settings/actions/secrets
  • Click "Add secret"
    • Name: DOKPLOY_TOKEN
    • Value: mai-ottosSyRHMhmLhhhXaCbKzbqKBuSqzqEtmKDOPelPCeimTaYsbmaVslVyEgJZGCIxVdz

Or via API (mAi identity):

curl --netrc-file ~/.netrc-mai -sS -X POST \
  -H "Content-Type: application/json" \
  https://mgit.msbls.de/api/v1/repos/m/paliad/actions/secrets/DOKPLOY_TOKEN \
  -d '{"data":"mai-ottosSyRHMhmLhhhXaCbKzbqKBuSqzqEtmKDOPelPCeimTaYsbmaVslVyEgJZGCIxVdz"}'

(Requires repo-owner permission. If mAi lacks it, m runs it.)


3. Verify the runner sees the workflow

After (2.1) + (2.2):

# Push the Slice A branch (the one this doc lives on)
git push origin mai/cronus/coder-cicd-slice-a

# Confirm the runner picked up the job
curl --netrc-file ~/.netrc-mai -sS \
  "https://mgit.msbls.de/api/v1/repos/m/paliad/actions/tasks?limit=5" | jq '.'

A new task per job should appear (build, test-go). If total_count stays 0, the runner labels don't match the workflow's runs-on. Re-register with --labels ubuntu-latest (no docker:// suffix) and the existing runners on mlake will pick it up via shell mode.


4. Soft-launch (m's Q11.4)

m's pick: keep both Dokploy auto-deploy and the workflow's deploy step alive for ~1 week. After ≥5 successful green deploys via the workflow, disable Dokploy's autoDeploy in the Dokploy UI for the paliad compose.

While both are live, every push to main fires:

  1. Dokploy webhook (existing path) → deploys immediately, no gate.
  2. Gitea workflow → on green, ALSO calls compose.deploy.

The second call is idempotent — if Dokploy already deployed the same commit, this is a no-op. The workflow's value during soft-launch is the gate signal: a red workflow on a green main = the bad migration shipped via the unguarded webhook and broke prod, and the workflow is shouting about it.

After confidence builds:

  1. In the Dokploy UI, navigate to the paliad compose → Settings.
  2. Toggle "Auto Deploy" off.
  3. Save.

From this point, the only path to deploy is the workflow's deploy job. Red workflow = no deploy.


5. What Slice A catches today — and what it doesn't

After this branch (mai/cronus/coder-cicd-slice-a) merges to main:

Catches (active in CI)

  • Build breakagego build, go vet, bun run build. Red gate, no deploy.
  • Slot collisionsTestMigrations_NoDuplicateSlot runs without a DB. A PR adding migration N when version N already exists fails at gate time. This is the brunel-class catch (m/paliad#114 ~13:20 outage).
  • New-migration shape errors (hermes class)TestBootSmoke runs ApplyMigrations against the snapshot-restored DB. New migs from this PR get applied for real; any column/relation/syntax error fails the gate before merge.
  • New-migration ownership errors (mig 129 42501 class)TestMigrations_EndToEndAsAppRole runs ApplyMigrations connected as postgres (NON-superuser on supabase/postgres:15.8.1.060, same role topology as youpc-supabase prod). Any migration that assumes supabase_admin privilege fails with the same 42501 must be owner error class that took paliad.de offline on 2026-05-25.
  • Readiness probe regressionsTestHealthReady_Live confirms /health/ready returns 200 against a live pool, 503 against a nil pool.
  • Pure-Go test regressionsgo test ./internal/... ./cmd/... runs without TEST_DATABASE_URL (live-DB service tests skip the same way they do on a developer laptop without a scratch DB).

Mechanism — the snapshot approach

CI's scratch DB starts from a pg_dump of youpc-supabase paliad schema + paliad.applied_migrations rows, committed to internal/db/testdata/prod-snapshot.sql. After restore, the scratch DB is at "paliad HEAD of snapshot" and ApplyMigrations sees only this PR's new migrations as pending.

This sidesteps the fresh-DB idempotence problem: several historical migrations (notably mig 037's missing CREATE EXTENSION pg_trgm, mig 051's inner COMMIT;) can't be replayed from scratch against supabase/postgres:15.8.1.060. The snapshot pins everything that's already applied in prod and lets CI focus on what's new — which is what we actually care about for outage prevention.

Snapshot refresh: make refresh-snapshot with PALIAD_PROD_DATABASE_URL set (see internal/db/testdata/README.md).

Known gap — live-DB service tests don't run in CI

internal/services/*_test.go tests with TEST_DATABASE_URL set fail against supabase/postgres:15.8.1.060 with 42P08 inconsistent types deduced for parameter errors on some INSERT bind paths. The same tests pass against youpc-supabase prod. Cause is unconfirmed — likely subtle differences in type inference between the dockerized image and the prod cluster's configuration. CI today runs go test ./... without TEST_DATABASE_URL so these tests skip. Not blocking outage prevention; tracked as a follow-up for the post-Slice-A coder.

Migration cleanup also bundled in this PR

Two surgical migration improvements that surfaced during snapshot debugging — kept here because they're small and harmless:

  • mig 024 + 027ALTER INDEX / ALTER POLICY exception handlers now catch undefined_object OR undefined_table OR duplicate_object. Old handler caught only undefined_object; Postgres raises undefined_table when the source object never existed and duplicate_object when the destination already exists. The expanded handler makes the migrations truly idempotent across the three plausible states: source-still-German (rename succeeds), already-renamed (catches duplicate_object), and fresh-DB-never-had-German (catches undefined_table).

Other migration history bugs (mig 037 missing pg_trgm, mig 051 inner COMMIT) are tracked as a separate cleanup task — not blocking, because the snapshot bypasses them.

Verification checklist (after Slice A merges)

  1. Workflow green on its first PR run? Check /m/paliad/actions. If not, fix before merging.
  2. Dokploy compose.deploy call succeeds? The workflow's deploy job logs the POST response. A successful response is a Dokploy job ID; a 4xx is an auth or compose-id problem.
  3. /health/ready returns 200 within 5 minutes after a green deploy? The workflow polls this. If it times out, the migration may have failed silently inside the new container — check docker logs --tail 50 compose-transmit-multi-byte-driver-v7jth9-web-1 on mlake.
  4. Reproduce the slot-collision catch locally: rename 131_…up.sql to 129_… (duplicate slot) → workflow MUST fail at Migration coordination check. Revert before pushing.
  5. Reproduce the role-split catch locally: add a no-op migration 132_test_supersedes.up.sql containing REINDEX SYSTEM paliad_scratch; (requires superuser). Workflow MUST fail at Migration end-to-end (deploy role). Revert before pushing.

6. Future polish (Slice D, m's Q4 R-pick)

mai-test post-merge shift: once Slice A is stable, wire a Gitea webhook on push-to-main that fires /mai-test as a follow-up shift. It runs the broader smoke + integration suite and posts results as a Gitea commit status. Not blocking; the gate doesn't depend on it.

Implementation belongs in m/mAi (the mai webhook handler), not in paliad. Out of scope for Slice A.