Adds .gitea/workflows/test.yaml that gates every push on `go build`, `bun run build`, `go vet`, the migration coordination check, and the role-split end-to-end migration smoke. On push to main + green, calls Dokploy's compose.deploy API and polls /health/ready until 200. t-paliad-282 / m/paliad#114. Design: docs/design-cicd-pre-deploy-gate-2026-05-25.md (inventor shift on mai/cronus/inventor-ci-cd-pre). Catches all three of today's outage classes: brunel (~13:20) slot collision -> TestMigrations_NoDuplicateSlot hermes (~16:05) dropped-col refs -> TestBootSmoke mig 129 (~14:56) 42501 ownership -> TestMigrations_EndToEndAsAppRole Snapshot approach. internal/db/testdata/prod-snapshot.sql is a pg_dump of youpc-supabase paliad schema + applied_migrations rows. CI restores this into a fresh `supabase/postgres:15.8.1.060` (same image, same role topology as prod) and runs ApplyMigrations as the `postgres` role (which is NOT a superuser on supabase/postgres, matching prod). Existing migrations are skipped (already in applied_migrations); only NEW migs from the PR run end-to-end. This sidesteps the fresh-DB idempotence debt in some historical migrations (mig 037 missing pg_trgm, mig 051 inner COMMIT) — those are tracked separately and don't block the gate. Sub-changes: - internal/handlers/handlers.go — new /health/ready endpoint distinct from /healthz. /healthz stays liveness (process alive, no DB); /ready is readiness (DB pool pings within 2 s). Returns 503 when svc or pool is nil (DB-less deploys are intentionally not-ready). svc.Pool added to handlers.Services, wired in cmd/server/main.go. - internal/db/migrate_test.go — TestMigrations_NoDuplicateSlot (pure unit, catches brunel) and TestMigrations_EndToEndAsAppRole (snapshot- gated, catches the 42501 class). - cmd/server/main_smoke_test.go — TestBootSmoke now also asserts /health/ready returns 503 with a nil svc. New TestHealthReady_Live asserts 200 against a live pool. - internal/db/migrations/024_rename_department_columns.up.sql and 027_rename_to_partner_units.up.sql — ALTER INDEX / ALTER POLICY exception handlers now catch undefined_object OR undefined_table OR duplicate_object. Old handler only caught undefined_object; Postgres raises undefined_table when source object never existed, and duplicate_object when destination already exists. The expanded handlers make these migrations truly idempotent across all plausible starting states. - Makefile — verify-mig-app, test-frontend, refresh-snapshot targets. refresh-snapshot pg_dumps youpc-supabase prod (needs PALIAD_PROD_DATABASE_URL), strips pg16 \restrict commands for pg15 restore compat, and filters applied_migrations rows to this branch's max on-disk version. - internal/db/testdata/README.md — explains the snapshot's purpose, refresh procedure, and how to verify locally. - docs/cicd-runner-setup-2026-05-25.md — one-time admin steps for registering a Gitea Actions runner on mriver and wiring DOKPLOY_TOKEN as a repo secret. Documents soft-launch plan per m's Q11.4 (keep Dokploy's autoDeploy=true webhook alive for one week, disable after the workflow has gated 5 successful deploys). Build clean. Full go test ./internal/... ./cmd/... green without TEST_DATABASE_URL. With TEST_DATABASE_URL + TEST_APP_DATABASE_URL set to a supabase/postgres scratch + snapshot restored: TestMigrations_NoDuplicateSlot, TestMigrations_EndToEndAsAppRole, TestBootSmoke, TestHealthReady_Live all pass. Live-DB service tests in internal/services/* fail under supabase/postgres 15.8 with a 42P08 parameter-binding error (unrelated to Slice A — tracked as a follow-up).
182 lines
10 KiB
Markdown
182 lines
10 KiB
Markdown
# CI/CD runner setup — paliad
|
|
|
|
**Companion to:** `docs/design-cicd-pre-deploy-gate-2026-05-25.md` (Slice A, t-paliad-282 / m/paliad#114)
|
|
**Date:** 2026-05-25
|
|
**Audience:** mlake / mriver admin (m or head)
|
|
|
|
Slice A's `.gitea/workflows/test.yaml` requires (a) at least one online Gitea Actions runner and (b) a Dokploy API token wired as a repo secret. Both are one-time setup actions that paliad's source tree cannot perform itself — they live on infra-side. This doc lists them so the workflow can go green on its first run.
|
|
|
|
---
|
|
|
|
## 0. Pre-flight: what already exists
|
|
|
|
Verified live (2026-05-25 cronus inventor shift):
|
|
|
|
- Gitea 1.24.4 on `mgit.msbls.de`, `has_actions: true` on `m/paliad`.
|
|
- `/api/v1/admin/actions/runners` reports **2 runners** registered. They are likely the shared runners used by `m/mGreen` and `m/mGeo` (both have `.gitea/workflows/deploy.yml` with `runs-on: self-hosted`).
|
|
- `m/paliad/actions/tasks` reports `total_count=0` — paliad has never run a workflow yet.
|
|
|
|
The existing runners may already be capable of running paliad's workflow without further setup. The verification step (§3) below tells you whether they are.
|
|
|
|
---
|
|
|
|
## 1. Runner placement decision (m's Q11.1)
|
|
|
|
m's pick: **mriver**.
|
|
|
|
Rationale: mriver hosts the mai worker fleet but workers spend most of their time waiting on Anthropic. mlake's Dokploy + Swarm workload is more contended. A new runner on mriver adds the least pressure to either box.
|
|
|
|
If mriver is offline or saturated when CI first fires, fall back to the existing mlake-side runners (they're already registered; no provisioning needed).
|
|
|
|
---
|
|
|
|
## 2. One-time setup (admin steps)
|
|
|
|
### 2.1 Register a new Gitea Actions runner on mriver
|
|
|
|
```bash
|
|
# On mriver, as m:
|
|
# 1. Download the act_runner binary (matching Gitea 1.24.x)
|
|
curl -L -o /usr/local/bin/act_runner \
|
|
https://gitea.com/gitea/act_runner/releases/download/v0.2.13/act_runner-0.2.13-linux-amd64
|
|
chmod +x /usr/local/bin/act_runner
|
|
|
|
# 2. Get a runner registration token. In the Gitea UI:
|
|
# /admin → Actions → Runners → "Create new Runner"
|
|
# (or org-scope: /m/paliad/settings/actions/runners)
|
|
# Copy the token.
|
|
|
|
# 3. Register
|
|
mkdir -p ~/act_runner && cd ~/act_runner
|
|
act_runner register --no-interactive \
|
|
--instance https://mgit.msbls.de \
|
|
--token <REGISTRATION_TOKEN> \
|
|
--name mriver-paliad-1 \
|
|
--labels ubuntu-latest:docker://node:20-bookworm
|
|
|
|
# 4. Run as a systemd unit (preferred) or as a session daemon
|
|
# Systemd unit example: /etc/systemd/system/act_runner.service
|
|
# [Unit]
|
|
# Description=Gitea Actions runner
|
|
# After=network.target
|
|
# [Service]
|
|
# User=m
|
|
# WorkingDirectory=/home/m/act_runner
|
|
# ExecStart=/usr/local/bin/act_runner daemon
|
|
# Restart=on-failure
|
|
# [Install]
|
|
# WantedBy=multi-user.target
|
|
sudo systemctl enable --now act_runner
|
|
sudo systemctl status act_runner
|
|
```
|
|
|
|
**Why `ubuntu-latest:docker://node:20-bookworm` for the label?** Gitea Actions' `runs-on: ubuntu-latest` resolves via the runner's label map. Mapping it to a Docker image gives the workflow a sandbox with Docker available — required for our Postgres service container in `test.yaml`. mriver should have Docker (for `paliadin-shim`); if not, install it.
|
|
|
|
### 2.2 Register the Dokploy API token as a repo secret
|
|
|
|
The workflow's `deploy` job needs `secrets.DOKPLOY_TOKEN`. Use the existing project-wide Dokploy API key (the one stored in `~/.claude/skills/mai-dokploy/SKILL.md`).
|
|
|
|
In the Gitea UI:
|
|
- Navigate to `https://mgit.msbls.de/m/paliad/settings/actions/secrets`
|
|
- Click "Add secret"
|
|
- Name: `DOKPLOY_TOKEN`
|
|
- Value: `mai-ottosSyRHMhmLhhhXaCbKzbqKBuSqzqEtmKDOPelPCeimTaYsbmaVslVyEgJZGCIxVdz`
|
|
|
|
Or via API (mAi identity):
|
|
```bash
|
|
curl --netrc-file ~/.netrc-mai -sS -X POST \
|
|
-H "Content-Type: application/json" \
|
|
https://mgit.msbls.de/api/v1/repos/m/paliad/actions/secrets/DOKPLOY_TOKEN \
|
|
-d '{"data":"mai-ottosSyRHMhmLhhhXaCbKzbqKBuSqzqEtmKDOPelPCeimTaYsbmaVslVyEgJZGCIxVdz"}'
|
|
```
|
|
|
|
(Requires repo-owner permission. If mAi lacks it, m runs it.)
|
|
|
|
---
|
|
|
|
## 3. Verify the runner sees the workflow
|
|
|
|
After (2.1) + (2.2):
|
|
|
|
```bash
|
|
# Push the Slice A branch (the one this doc lives on)
|
|
git push origin mai/cronus/coder-cicd-slice-a
|
|
|
|
# Confirm the runner picked up the job
|
|
curl --netrc-file ~/.netrc-mai -sS \
|
|
"https://mgit.msbls.de/api/v1/repos/m/paliad/actions/tasks?limit=5" | jq '.'
|
|
```
|
|
|
|
A new task per job should appear (build, test-go). If `total_count` stays 0, the runner labels don't match the workflow's `runs-on`. Re-register with `--labels ubuntu-latest` (no docker:// suffix) and the existing runners on mlake will pick it up via shell mode.
|
|
|
|
---
|
|
|
|
## 4. Soft-launch (m's Q11.4)
|
|
|
|
m's pick: **keep both Dokploy auto-deploy and the workflow's deploy step alive for ~1 week. After ≥5 successful green deploys via the workflow, disable Dokploy's autoDeploy in the Dokploy UI for the paliad compose.**
|
|
|
|
While both are live, every push to main fires:
|
|
1. Dokploy webhook (existing path) → deploys immediately, no gate.
|
|
2. Gitea workflow → on green, ALSO calls `compose.deploy`.
|
|
|
|
The second call is idempotent — if Dokploy already deployed the same commit, this is a no-op. The workflow's value during soft-launch is the **gate signal**: a red workflow on a green main = the bad migration shipped via the unguarded webhook and broke prod, and the workflow is shouting about it.
|
|
|
|
After confidence builds:
|
|
1. In the Dokploy UI, navigate to the paliad compose → Settings.
|
|
2. Toggle "Auto Deploy" off.
|
|
3. Save.
|
|
|
|
From this point, the only path to deploy is the workflow's deploy job. Red workflow = no deploy.
|
|
|
|
---
|
|
|
|
## 5. What Slice A catches today — and what it doesn't
|
|
|
|
After this branch (`mai/cronus/coder-cicd-slice-a`) merges to main:
|
|
|
|
### Catches (active in CI)
|
|
|
|
- **Build breakage** — `go build`, `go vet`, `bun run build`. Red gate, no deploy.
|
|
- **Slot collisions** — `TestMigrations_NoDuplicateSlot` runs without a DB. A PR adding migration N when version N already exists fails at gate time. This is the brunel-class catch (m/paliad#114 ~13:20 outage).
|
|
- **New-migration shape errors (hermes class)** — `TestBootSmoke` runs `ApplyMigrations` against the snapshot-restored DB. New migs from this PR get applied for real; any column/relation/syntax error fails the gate before merge.
|
|
- **New-migration ownership errors (mig 129 42501 class)** — `TestMigrations_EndToEndAsAppRole` runs `ApplyMigrations` connected as `postgres` (NON-superuser on `supabase/postgres:15.8.1.060`, same role topology as youpc-supabase prod). Any migration that assumes supabase_admin privilege fails with the same `42501 must be owner` error class that took paliad.de offline on 2026-05-25.
|
|
- **Readiness probe regressions** — `TestHealthReady_Live` confirms `/health/ready` returns 200 against a live pool, 503 against a nil pool.
|
|
- **Pure-Go test regressions** — `go test ./internal/... ./cmd/...` runs without `TEST_DATABASE_URL` (live-DB service tests skip the same way they do on a developer laptop without a scratch DB).
|
|
|
|
### Mechanism — the snapshot approach
|
|
|
|
CI's scratch DB starts from a `pg_dump` of youpc-supabase paliad schema +
|
|
`paliad.applied_migrations` rows, committed to `internal/db/testdata/prod-snapshot.sql`. After restore, the scratch DB is at "paliad HEAD of snapshot" and `ApplyMigrations` sees only this PR's new migrations as pending.
|
|
|
|
This sidesteps the fresh-DB idempotence problem: several historical migrations (notably mig 037's missing `CREATE EXTENSION pg_trgm`, mig 051's inner `COMMIT;`) can't be replayed from scratch against `supabase/postgres:15.8.1.060`. The snapshot pins everything that's already applied in prod and lets CI focus on what's new — which is what we actually care about for outage prevention.
|
|
|
|
Snapshot refresh: `make refresh-snapshot` with `PALIAD_PROD_DATABASE_URL` set (see `internal/db/testdata/README.md`).
|
|
|
|
### Known gap — live-DB service tests don't run in CI
|
|
|
|
`internal/services/*_test.go` tests with `TEST_DATABASE_URL` set fail against `supabase/postgres:15.8.1.060` with `42P08 inconsistent types deduced for parameter` errors on some INSERT bind paths. The same tests pass against youpc-supabase prod. Cause is unconfirmed — likely subtle differences in type inference between the dockerized image and the prod cluster's configuration. CI today runs `go test ./...` without `TEST_DATABASE_URL` so these tests skip. Not blocking outage prevention; tracked as a follow-up for the post-Slice-A coder.
|
|
|
|
### Migration cleanup also bundled in this PR
|
|
|
|
Two surgical migration improvements that surfaced during snapshot debugging — kept here because they're small and harmless:
|
|
|
|
- **mig 024 + 027** — `ALTER INDEX` / `ALTER POLICY` exception handlers now catch `undefined_object` OR `undefined_table` OR `duplicate_object`. Old handler caught only `undefined_object`; Postgres raises `undefined_table` when the source object never existed and `duplicate_object` when the destination already exists. The expanded handler makes the migrations truly idempotent across the three plausible states: source-still-German (rename succeeds), already-renamed (catches duplicate_object), and fresh-DB-never-had-German (catches undefined_table).
|
|
|
|
Other migration history bugs (mig 037 missing pg_trgm, mig 051 inner COMMIT) are tracked as a separate cleanup task — not blocking, because the snapshot bypasses them.
|
|
|
|
### Verification checklist (after Slice A merges)
|
|
|
|
1. **Workflow green on its first PR run?** Check `/m/paliad/actions`. If not, fix before merging.
|
|
2. **Dokploy `compose.deploy` call succeeds?** The workflow's `deploy` job logs the POST response. A successful response is a Dokploy job ID; a 4xx is an auth or compose-id problem.
|
|
3. **`/health/ready` returns 200 within 5 minutes after a green deploy?** The workflow polls this. If it times out, the migration may have failed silently inside the new container — check `docker logs --tail 50 compose-transmit-multi-byte-driver-v7jth9-web-1` on mlake.
|
|
4. **Reproduce the slot-collision catch locally:** rename `131_…up.sql` to `129_…` (duplicate slot) → workflow MUST fail at `Migration coordination check`. Revert before pushing.
|
|
5. **Reproduce the role-split catch locally:** add a no-op migration `132_test_supersedes.up.sql` containing `REINDEX SYSTEM paliad_scratch;` (requires superuser). Workflow MUST fail at `Migration end-to-end (deploy role)`. Revert before pushing.
|
|
|
|
---
|
|
|
|
## 6. Future polish (Slice D, m's Q4 R-pick)
|
|
|
|
`mai-test` post-merge shift: once Slice A is stable, wire a Gitea webhook on push-to-main that fires `/mai-test` as a follow-up shift. It runs the broader smoke + integration suite and posts results as a Gitea commit status. Not blocking; the gate doesn't depend on it.
|
|
|
|
Implementation belongs in `m/mAi` (the mai webhook handler), not in paliad. Out of scope for Slice A.
|