OUTAGE: paliad.de container down — schema_migrations stuck at 106 while main has 123 #104

New Issue

mAi · 2026-05-25T13:38:02Z

mAi commented

2026-05-25 13:38:02 +00:00

Diagnosis (head, 2026-05-25 15:37)

m reports paliad.de returns 404 on every path. Confirmed:

$ curl -s -I https://paliad.de/healthz
HTTP/2 404
content-type: text/plain; charset=utf-8
content-length: 19

$ curl -s https://paliad.de/healthz
404 page not found

That's Traefik's default 404 — no healthy upstream backend.

Schema is stuck 17 migrations behind

SELECT version, dirty FROM paliad.paliad_schema_migrations ORDER BY version DESC LIMIT 1;
  → version=106, dirty=false

Repo main is on commit 51fca93, migrations 107–123 all checked in. The container is either failing to boot somewhere between 107 and 123, OR Dokploy auto-deploys have been failing silently and the production binary is months old.

Likely candidates (per project CLAUDE.md traps)

Ownership-by-supabase_admin — if any migration 107–123 created a table or function while the dev session was using Supabase MCP (which runs as supabase_admin), the runtime postgres role can't ALTER. m's CLAUDE.md flags this exact trap from t-paliad-238 mig 119. Same fix: ALTER TABLE/FUNCTION ... OWNER TO postgres via supabase MCP.
Dokploy webhook silently down — Gitea push doesn't trigger Dokploy redeploy; old container keeps running until crash + restart cycle eats it.
docker-compose.yml env-list gap — a recent merge introduced a ${VAR} reference but didn't declare it under services.web.environment:. Same trap as t-paliad-238 Add-User 503.
A migration syntax error that boots the container into a panic loop.

What to do — fixer workflow

Triage: hit Dokploy API for the paliad compose status + last build log:
```
curl --netrc -s https://dokploy.msbls.de/api/<paliad-compose>/status   # or similar — check docs
```
Or read the compose ID Zx147ycurfYagKRl_Zzyo on mlake directly. paliad CLAUDE.md Hosting: Dokploy compose on mlake (72.62.52.189).
Logs: pull the container's last 200 lines of stdout/stderr from Dokploy or ssh mlake + docker logs.
Identify the failing migration / step. Patterns to look for:
- failed to migrate: … → migration error; find the offending migration file + diagnose
- must be owner of … → ownership trap; fix via Supabase MCP ALTER ... OWNER TO postgres
- pq: column "X" does not exist → schema mismatch; data shape vs DDL
- panic on env-var lookup → missing env var declaration
Fix: smallest correct change. Document the root cause in the completion report.
Re-deploy: trigger Dokploy redeploy + verify schema_migrations now equals the latest file in internal/db/migrations/.
Verify: curl https://paliad.de/healthz returns 200; paliad.paliad_schema_migrations version matches main.

Hard rules

Don't roll back migrations unless absolutely necessary; favour rolling forward.
Don't bypass auth to fix things; if Dokploy access requires m's auth, ask head to involve him.
Cite the failing migration / root cause in the issue comment + completion report.
Branch: mai/<worker>/outage-paliad-deploy.
go build ./... && go test ./internal/... && cd frontend && bun run build must stay clean (probably already clean since the code-side compiles; bug is at deploy/migration time).

Out of scope

Migrating to a different host.
Rebuilding the deploy pipeline from scratch.
Reverting the recent merges (they're correct; the deploy is broken).

Reporting

mai report completed with: root cause + the migration / config that broke + how it was fixed + verification (curl healthz 200, schema_migrations.version equals latest file). Comment on this issue with the same.

## Diagnosis (head, 2026-05-25 15:37) m reports paliad.de returns 404 on every path. Confirmed: ``` $ curl -s -I https://paliad.de/healthz HTTP/2 404 content-type: text/plain; charset=utf-8 content-length: 19 $ curl -s https://paliad.de/healthz 404 page not found ``` That's Traefik's default 404 — no healthy upstream backend. ## Schema is stuck 17 migrations behind ```sql SELECT version, dirty FROM paliad.paliad_schema_migrations ORDER BY version DESC LIMIT 1; → version=106, dirty=false ``` Repo main is on commit `51fca93`, migrations 107–123 all checked in. The container is either failing to boot somewhere between 107 and 123, OR Dokploy auto-deploys have been failing silently and the production binary is months old. ## Likely candidates (per project CLAUDE.md traps) 1. **Ownership-by-supabase_admin** — if any migration 107–123 created a table or function while the dev session was using Supabase MCP (which runs as supabase_admin), the runtime postgres role can't ALTER. m's CLAUDE.md flags this exact trap from t-paliad-238 mig 119. Same fix: `ALTER TABLE/FUNCTION ... OWNER TO postgres` via supabase MCP. 2. **Dokploy webhook silently down** — Gitea push doesn't trigger Dokploy redeploy; old container keeps running until crash + restart cycle eats it. 3. **`docker-compose.yml` env-list gap** — a recent merge introduced a `${VAR}` reference but didn't declare it under `services.web.environment:`. Same trap as t-paliad-238 Add-User 503. 4. **A migration syntax error** that boots the container into a panic loop. ## What to do — fixer workflow 1. **Triage**: hit Dokploy API for the paliad compose status + last build log: ```bash curl --netrc -s https://dokploy.msbls.de/api/<paliad-compose>/status # or similar — check docs ``` Or read the compose ID `Zx147ycurfYagKRl_Zzyo` on mlake directly. paliad CLAUDE.md `Hosting: Dokploy compose on mlake (72.62.52.189)`. 2. **Logs**: pull the container's last 200 lines of stdout/stderr from Dokploy or ssh mlake + docker logs. 3. **Identify the failing migration / step.** Patterns to look for: - `failed to migrate: …` → migration error; find the offending migration file + diagnose - `must be owner of …` → ownership trap; fix via Supabase MCP `ALTER ... OWNER TO postgres` - `pq: column "X" does not exist` → schema mismatch; data shape vs DDL - panic on env-var lookup → missing env var declaration 4. **Fix**: smallest correct change. Document the root cause in the completion report. 5. **Re-deploy**: trigger Dokploy redeploy + verify schema_migrations now equals the latest file in `internal/db/migrations/`. 6. **Verify**: `curl https://paliad.de/healthz` returns 200; `paliad.paliad_schema_migrations` version matches main. ## Hard rules - **Don't roll back migrations** unless absolutely necessary; favour rolling forward. - **Don't bypass auth** to fix things; if Dokploy access requires m's auth, ask head to involve him. - Cite the failing migration / root cause in the issue comment + completion report. - Branch: `mai/<worker>/outage-paliad-deploy`. - `go build ./... && go test ./internal/... && cd frontend && bun run build` must stay clean (probably already clean since the code-side compiles; bug is at deploy/migration time). ## Out of scope - Migrating to a different host. - Rebuilding the deploy pipeline from scratch. - Reverting the recent merges (they're correct; the deploy is broken). ## Reporting `mai report completed` with: root cause + the migration / config that broke + how it was fixed + verification (curl healthz 200, schema_migrations.version equals latest file). Comment on this issue with the same.

mAi self-assigned this 2026-05-25 13:38:02 +00:00

mAi commented

2026-05-25 13:44:32 +00:00

Root cause + fix (head, 2026-05-25 15:42)

Container was crashlooping with:

migration failed: migration 123: disk name "backups" != DB name "de_inf_lg_replik_duplik_sequencing" (renamed after apply? revert the rename, or UPDATE paliad.applied_migrations SET name="backups" WHERE version=123 if the rename is intentional)

What happened:

Cronus's #77 Slice A merged 123_backups.up.sql to main (~12:00).
Brunel's #95 in-process testing wrote a row to paliad.applied_migrations claiming (123, 'de_inf_lg_replik_duplik_sequencing') at 13:24 — even though his file was never committed to disk.
Subsequent deploys ran cronus's 123_backups.up.sql, found the name mismatch, bailed at startup.
Result: container in restart loop since ~13:24, paliad.de Traefik 404.

Fix applied: DELETE FROM paliad.applied_migrations WHERE version=123 AND name='de_inf_lg_replik_duplik_sequencing'. Container will retry, apply cronus's actual 123_backups SQL (which is idempotent + guarded), and boot. Monitoring https://paliad.de/healthz for 200.

Brunel's in-process DB changes for de.inf.lg.replik/duplik remain intact (DELETE only removed the applied_migrations marker, not the rule UPDATEs themselves). His eventual 124_de_inf_lg_replik_duplik_sequencing.up.sql will land with idempotent guards and no-op the DB state while recording the applied_migrations row properly.

Lessons banked

NEVER write to paliad.applied_migrations during in-process testing. Test via supabase MCP queries OR a temp DB; never poke the production migration tracker.
Migration slot allocation across active workers needs coordination. Today cronus took 123 (backups), brunel inadvertently also targeted 123 (during testing). Going forward: head reserves slots when spawning workers (e.g. brunel = 124, hermes = 125, icarus = 126, atlas = 127 — communicated up front in the instruction).
Container restart loop on migration mismatch produces a clean error message but no Dokploy build failure. Dokploy's deploy log shows status=done. Detection requires watching the container log directly, not just the deploy status.

Follow-up

The Dokploy auto-deploy webhook + build pipeline is healthy; this was a DB-state issue, not infra. No infrastructure changes needed.

Closing this issue when paliad.de/healthz returns 200.

## Root cause + fix (head, 2026-05-25 15:42) Container was crashlooping with: ``` migration failed: migration 123: disk name "backups" != DB name "de_inf_lg_replik_duplik_sequencing" (renamed after apply? revert the rename, or UPDATE paliad.applied_migrations SET name="backups" WHERE version=123 if the rename is intentional) ``` What happened: 1. Cronus's #77 Slice A merged `123_backups.up.sql` to main (~12:00). 2. Brunel's #95 in-process testing wrote a row to `paliad.applied_migrations` claiming `(123, 'de_inf_lg_replik_duplik_sequencing')` at 13:24 — even though his file was never committed to disk. 3. Subsequent deploys ran cronus's `123_backups.up.sql`, found the name mismatch, bailed at startup. 4. Result: container in restart loop since ~13:24, paliad.de Traefik 404. **Fix applied**: `DELETE FROM paliad.applied_migrations WHERE version=123 AND name='de_inf_lg_replik_duplik_sequencing'`. Container will retry, apply cronus's actual `123_backups` SQL (which is idempotent + guarded), and boot. Monitoring `https://paliad.de/healthz` for 200. Brunel's in-process DB changes for de.inf.lg.replik/duplik **remain intact** (DELETE only removed the applied_migrations marker, not the rule UPDATEs themselves). His eventual `124_de_inf_lg_replik_duplik_sequencing.up.sql` will land with idempotent guards and no-op the DB state while recording the applied_migrations row properly. ## Lessons banked 1. **NEVER write to `paliad.applied_migrations` during in-process testing.** Test via supabase MCP queries OR a temp DB; never poke the production migration tracker. 2. **Migration slot allocation across active workers needs coordination.** Today cronus took 123 (backups), brunel inadvertently also targeted 123 (during testing). Going forward: head reserves slots when spawning workers (e.g. brunel = 124, hermes = 125, icarus = 126, atlas = 127 — communicated up front in the instruction). 3. **Container restart loop on migration mismatch produces a clean error message but no Dokploy build failure.** Dokploy's deploy log shows status=done. Detection requires watching the container log directly, not just the deploy status. ## Follow-up The Dokploy auto-deploy webhook + build pipeline is healthy; this was a DB-state issue, not infra. No infrastructure changes needed. Closing this issue when paliad.de/healthz returns 200.

Sign in to join this conversation.

Branches Tags

main

mai/knuth/narrow-assess-to-the

mai/knuth/editor-four-part-fix

mai/knuth/editor-first-real-edit

mai/knuth/wire-build-patentstyle

mai/ritchie/build-patentstyle-unguarded

mai/knuth/rescue-cited-design

mai/ritchie/vendor-guard-first-catch

mai/knuth/stale-branch-triage

mai/ritchie/stale-negative-claims

mai/knuth/reset-form-language-and-email

mai/knuth/adopt-mauth-module

mai/knuth/reset-link-scanner-safe

mai/knuth/registry-coherence-139-postscript

mai/ritchie/db-test-packages-sh-and

mai/knuth/gen-skeleton-submission

mai/knuth/retire-skeleton-generator-tier5

mai/jason/seed-orphan-drafts-guard

mai/knuth/ci-lane-no-dsn

mai/jason/seed-script-prod-guard

mai/knuth/skeleton-doccomment-completeness

mai/brunel/harness-findings-postscript

mai/hades/dead-surface-sweep

mai/brunel/views-eventkind-flake

mai/jason/issue-158-service-address

mai/knuth/issue-139-letterhead-vars

mai/cronus/issue-148-trigger-picker

mai/hades/issue-155-followup

mai/hades/issue-155-naming

mai/brunel/escalation-visibility-flag

mai/jason/alles-overrides-horizon

mai/knuth/m-paliad-150-part-b-m

mai/hades/issue-161-zustandigkeit

mai/cronus/m-paliad-160-per-user

mai/jason/issue-163-parties-role

mai/ares/issue-162-one-convention

mai/brunel/m-paliad-115-the-sweep-s

mai/goodall/for-every-check-in-this

mai/knuth/land-darwin-s-follow-up

mai/diesel/guard-report-lib

mai/diesel/issue-139-slice-b

mai/diesel/issue-139-letterhead-vars

mai/darwin/148-crossparty-ui

mai/diesel/m-paliad-158-a-stale

mai/darwin/vacation-doc-warnings

mai/darwin/upc-vacation-findings

mai/darwin/rop-citation-fix

mai/darwin/issue-150-holidays

mai/ritchie/build-the-block-editor

mai/darwin/swallowed-cleanup-errors

mai/darwin/formalities-refusal-schema4

mai/darwin/drift-caveat-shape

mai/darwin/http-smoke-enforcing

mai/darwin/s6-round-3

mai/darwin/loops-acting-user

mai/darwin/s6-rehearsal-round-2

mai/darwin/close-the-s6-blockers

mai/darwin/rehearse-the-s6-flip

mai/knuth/drilling-the-scheduled

mai/brunel/21-test-files-under-pkg

mai/atlas/design-hlc-com-as

mai/hopper3/a-hand-run-can-advance

mai/grace4/re-vendor-mai

mai/grace3/vendor-the-nine-german

mai/head/slug-rule-contract

mai/head/vendor-contract-note

mai/grace2/wiki-generator-language

mai/marco/verify-the-outlook-add

mai/pike2/an-explicit-begin-commit

mai/noether5/remove-the-paris-p3-and

mai/lexy2/r2-backfill-procedural

mai/kepler/issue-502-hl-to-hlc

mai/hertz2/r4-litigationplanner

mai/shannon2/docker-compose-yml-never

mai/linus2/r3-finish-the-b-5

mai/zeus2/guard-no-live-sql-string

mai/galileo2/the-embedded-upc-planner

mai/kepler2/slice-b-procedural

mai/diesel2/mig044-erwiderung-repair

mai/diesel2/fresh-db-replay-past-mig

mai/head/gen-upc-snapshot-dead-table

mai/noether4/offices-export-regen-201

mai/noether4/base-p1-genericize-m

mai/hopper/finish-the-half-built

mai/pike/dead-migration-tests

mai/linus/audit-comment-fix

mai/linus/fristensuche-82-search

mai/linus/b7-checklists

mai/linus/b8-frontend-pure-logic

mai/pike/b5-auth-path-coverage

mai/diesel/rule-test-resync

mai/diesel/regression-m-confirmed

mai/patton/b1-make-the-dormant-test

mai/athena/test-gap-audit-map

mai/diesel/kostenrechner-bug-upc

mai/hopper/patentsstyle-styleguide

mai/pike/re-render-patentsstyle

mai/linus/firm-footer-officelanguag

mai/carmack/re-render-deploy

mai/diesel/fresh-db-bootstrap

mai/pike/follow-up-gen-template

mai/turing/docforge-flip

mai/cronus/bighand-delimiter-constant

mai/ritchie/composer-delete-all

mai/atlas/inventor-followup-rules

mai/knuth/coder-conditional-rule

mai/cronus/inventor-ci-cd-pre

mai/demeter/gitster-submission

mai/atlas/inventor-per-event-card

mai/cronus/inventor-procedural

mai/cronus/inventor-backup-mode

mai/icarus/inventor-inbox-overhaul

mai/atlas/inventor-symmetric-date

mai/gauss/inventorcoder-team-admin

mai/kepler/inventorcoder-project

mai/darwin/roadmap-ccr-en

mai/euler/coder-small-ux-polish

mai/darwin/fristenrechner-cleanup

mai/darwin/fixercoder-priority-bug

mai/leibniz/inventor-caldav-multi

mai/hertz/inventor-unified-modal

mai/archimedes/inventor-excel-data

mai/boltzmann/inventor-gap-tolerant

mai/copernicus/submission-slice-1

mai/fermi/interactive-session

mai/hertz/inventor-suggest-changes

mai/copernicus/inventor-submission

mai/mendel/test-strategy-slice-1

mai/ampere/custom-views-improvements

mai/planck/paliadin-per-user-rls

mai/ritchie/phase-h-ai-deadline

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: m/paliad#104