OUTAGE: paliad.de container down — schema_migrations stuck at 106 while main has 123 #104

Open
opened 2026-05-25 13:38:02 +00:00 by mAi · 1 comment
Collaborator

Diagnosis (head, 2026-05-25 15:37)

m reports paliad.de returns 404 on every path. Confirmed:

$ curl -s -I https://paliad.de/healthz
HTTP/2 404
content-type: text/plain; charset=utf-8
content-length: 19

$ curl -s https://paliad.de/healthz
404 page not found

That's Traefik's default 404 — no healthy upstream backend.

Schema is stuck 17 migrations behind

SELECT version, dirty FROM paliad.paliad_schema_migrations ORDER BY version DESC LIMIT 1;
   version=106, dirty=false

Repo main is on commit 51fca93, migrations 107–123 all checked in. The container is either failing to boot somewhere between 107 and 123, OR Dokploy auto-deploys have been failing silently and the production binary is months old.

Likely candidates (per project CLAUDE.md traps)

  1. Ownership-by-supabase_admin — if any migration 107–123 created a table or function while the dev session was using Supabase MCP (which runs as supabase_admin), the runtime postgres role can't ALTER. m's CLAUDE.md flags this exact trap from t-paliad-238 mig 119. Same fix: ALTER TABLE/FUNCTION ... OWNER TO postgres via supabase MCP.
  2. Dokploy webhook silently down — Gitea push doesn't trigger Dokploy redeploy; old container keeps running until crash + restart cycle eats it.
  3. docker-compose.yml env-list gap — a recent merge introduced a ${VAR} reference but didn't declare it under services.web.environment:. Same trap as t-paliad-238 Add-User 503.
  4. A migration syntax error that boots the container into a panic loop.

What to do — fixer workflow

  1. Triage: hit Dokploy API for the paliad compose status + last build log:

    curl --netrc -s https://dokploy.msbls.de/api/<paliad-compose>/status   # or similar — check docs
    

    Or read the compose ID Zx147ycurfYagKRl_Zzyo on mlake directly. paliad CLAUDE.md Hosting: Dokploy compose on mlake (72.62.52.189).

  2. Logs: pull the container's last 200 lines of stdout/stderr from Dokploy or ssh mlake + docker logs.

  3. Identify the failing migration / step. Patterns to look for:

    • failed to migrate: … → migration error; find the offending migration file + diagnose
    • must be owner of … → ownership trap; fix via Supabase MCP ALTER ... OWNER TO postgres
    • pq: column "X" does not exist → schema mismatch; data shape vs DDL
    • panic on env-var lookup → missing env var declaration
  4. Fix: smallest correct change. Document the root cause in the completion report.

  5. Re-deploy: trigger Dokploy redeploy + verify schema_migrations now equals the latest file in internal/db/migrations/.

  6. Verify: curl https://paliad.de/healthz returns 200; paliad.paliad_schema_migrations version matches main.

Hard rules

  • Don't roll back migrations unless absolutely necessary; favour rolling forward.
  • Don't bypass auth to fix things; if Dokploy access requires m's auth, ask head to involve him.
  • Cite the failing migration / root cause in the issue comment + completion report.
  • Branch: mai/<worker>/outage-paliad-deploy.
  • go build ./... && go test ./internal/... && cd frontend && bun run build must stay clean (probably already clean since the code-side compiles; bug is at deploy/migration time).

Out of scope

  • Migrating to a different host.
  • Rebuilding the deploy pipeline from scratch.
  • Reverting the recent merges (they're correct; the deploy is broken).

Reporting

mai report completed with: root cause + the migration / config that broke + how it was fixed + verification (curl healthz 200, schema_migrations.version equals latest file). Comment on this issue with the same.

## Diagnosis (head, 2026-05-25 15:37) m reports paliad.de returns 404 on every path. Confirmed: ``` $ curl -s -I https://paliad.de/healthz HTTP/2 404 content-type: text/plain; charset=utf-8 content-length: 19 $ curl -s https://paliad.de/healthz 404 page not found ``` That's Traefik's default 404 — no healthy upstream backend. ## Schema is stuck 17 migrations behind ```sql SELECT version, dirty FROM paliad.paliad_schema_migrations ORDER BY version DESC LIMIT 1; → version=106, dirty=false ``` Repo main is on commit `51fca93`, migrations 107–123 all checked in. The container is either failing to boot somewhere between 107 and 123, OR Dokploy auto-deploys have been failing silently and the production binary is months old. ## Likely candidates (per project CLAUDE.md traps) 1. **Ownership-by-supabase_admin** — if any migration 107–123 created a table or function while the dev session was using Supabase MCP (which runs as supabase_admin), the runtime postgres role can't ALTER. m's CLAUDE.md flags this exact trap from t-paliad-238 mig 119. Same fix: `ALTER TABLE/FUNCTION ... OWNER TO postgres` via supabase MCP. 2. **Dokploy webhook silently down** — Gitea push doesn't trigger Dokploy redeploy; old container keeps running until crash + restart cycle eats it. 3. **`docker-compose.yml` env-list gap** — a recent merge introduced a `${VAR}` reference but didn't declare it under `services.web.environment:`. Same trap as t-paliad-238 Add-User 503. 4. **A migration syntax error** that boots the container into a panic loop. ## What to do — fixer workflow 1. **Triage**: hit Dokploy API for the paliad compose status + last build log: ```bash curl --netrc -s https://dokploy.msbls.de/api/<paliad-compose>/status # or similar — check docs ``` Or read the compose ID `Zx147ycurfYagKRl_Zzyo` on mlake directly. paliad CLAUDE.md `Hosting: Dokploy compose on mlake (72.62.52.189)`. 2. **Logs**: pull the container's last 200 lines of stdout/stderr from Dokploy or ssh mlake + docker logs. 3. **Identify the failing migration / step.** Patterns to look for: - `failed to migrate: …` → migration error; find the offending migration file + diagnose - `must be owner of …` → ownership trap; fix via Supabase MCP `ALTER ... OWNER TO postgres` - `pq: column "X" does not exist` → schema mismatch; data shape vs DDL - panic on env-var lookup → missing env var declaration 4. **Fix**: smallest correct change. Document the root cause in the completion report. 5. **Re-deploy**: trigger Dokploy redeploy + verify schema_migrations now equals the latest file in `internal/db/migrations/`. 6. **Verify**: `curl https://paliad.de/healthz` returns 200; `paliad.paliad_schema_migrations` version matches main. ## Hard rules - **Don't roll back migrations** unless absolutely necessary; favour rolling forward. - **Don't bypass auth** to fix things; if Dokploy access requires m's auth, ask head to involve him. - Cite the failing migration / root cause in the issue comment + completion report. - Branch: `mai/<worker>/outage-paliad-deploy`. - `go build ./... && go test ./internal/... && cd frontend && bun run build` must stay clean (probably already clean since the code-side compiles; bug is at deploy/migration time). ## Out of scope - Migrating to a different host. - Rebuilding the deploy pipeline from scratch. - Reverting the recent merges (they're correct; the deploy is broken). ## Reporting `mai report completed` with: root cause + the migration / config that broke + how it was fixed + verification (curl healthz 200, schema_migrations.version equals latest file). Comment on this issue with the same.
mAi self-assigned this 2026-05-25 13:38:02 +00:00
Author
Collaborator

Root cause + fix (head, 2026-05-25 15:42)

Container was crashlooping with:

migration failed: migration 123: disk name "backups" != DB name "de_inf_lg_replik_duplik_sequencing" (renamed after apply? revert the rename, or UPDATE paliad.applied_migrations SET name="backups" WHERE version=123 if the rename is intentional)

What happened:

  1. Cronus's #77 Slice A merged 123_backups.up.sql to main (~12:00).
  2. Brunel's #95 in-process testing wrote a row to paliad.applied_migrations claiming (123, 'de_inf_lg_replik_duplik_sequencing') at 13:24 — even though his file was never committed to disk.
  3. Subsequent deploys ran cronus's 123_backups.up.sql, found the name mismatch, bailed at startup.
  4. Result: container in restart loop since ~13:24, paliad.de Traefik 404.

Fix applied: DELETE FROM paliad.applied_migrations WHERE version=123 AND name='de_inf_lg_replik_duplik_sequencing'. Container will retry, apply cronus's actual 123_backups SQL (which is idempotent + guarded), and boot. Monitoring https://paliad.de/healthz for 200.

Brunel's in-process DB changes for de.inf.lg.replik/duplik remain intact (DELETE only removed the applied_migrations marker, not the rule UPDATEs themselves). His eventual 124_de_inf_lg_replik_duplik_sequencing.up.sql will land with idempotent guards and no-op the DB state while recording the applied_migrations row properly.

Lessons banked

  1. NEVER write to paliad.applied_migrations during in-process testing. Test via supabase MCP queries OR a temp DB; never poke the production migration tracker.
  2. Migration slot allocation across active workers needs coordination. Today cronus took 123 (backups), brunel inadvertently also targeted 123 (during testing). Going forward: head reserves slots when spawning workers (e.g. brunel = 124, hermes = 125, icarus = 126, atlas = 127 — communicated up front in the instruction).
  3. Container restart loop on migration mismatch produces a clean error message but no Dokploy build failure. Dokploy's deploy log shows status=done. Detection requires watching the container log directly, not just the deploy status.

Follow-up

The Dokploy auto-deploy webhook + build pipeline is healthy; this was a DB-state issue, not infra. No infrastructure changes needed.

Closing this issue when paliad.de/healthz returns 200.

## Root cause + fix (head, 2026-05-25 15:42) Container was crashlooping with: ``` migration failed: migration 123: disk name "backups" != DB name "de_inf_lg_replik_duplik_sequencing" (renamed after apply? revert the rename, or UPDATE paliad.applied_migrations SET name="backups" WHERE version=123 if the rename is intentional) ``` What happened: 1. Cronus's #77 Slice A merged `123_backups.up.sql` to main (~12:00). 2. Brunel's #95 in-process testing wrote a row to `paliad.applied_migrations` claiming `(123, 'de_inf_lg_replik_duplik_sequencing')` at 13:24 — even though his file was never committed to disk. 3. Subsequent deploys ran cronus's `123_backups.up.sql`, found the name mismatch, bailed at startup. 4. Result: container in restart loop since ~13:24, paliad.de Traefik 404. **Fix applied**: `DELETE FROM paliad.applied_migrations WHERE version=123 AND name='de_inf_lg_replik_duplik_sequencing'`. Container will retry, apply cronus's actual `123_backups` SQL (which is idempotent + guarded), and boot. Monitoring `https://paliad.de/healthz` for 200. Brunel's in-process DB changes for de.inf.lg.replik/duplik **remain intact** (DELETE only removed the applied_migrations marker, not the rule UPDATEs themselves). His eventual `124_de_inf_lg_replik_duplik_sequencing.up.sql` will land with idempotent guards and no-op the DB state while recording the applied_migrations row properly. ## Lessons banked 1. **NEVER write to `paliad.applied_migrations` during in-process testing.** Test via supabase MCP queries OR a temp DB; never poke the production migration tracker. 2. **Migration slot allocation across active workers needs coordination.** Today cronus took 123 (backups), brunel inadvertently also targeted 123 (during testing). Going forward: head reserves slots when spawning workers (e.g. brunel = 124, hermes = 125, icarus = 126, atlas = 127 — communicated up front in the instruction). 3. **Container restart loop on migration mismatch produces a clean error message but no Dokploy build failure.** Dokploy's deploy log shows status=done. Detection requires watching the container log directly, not just the deploy status. ## Follow-up The Dokploy auto-deploy webhook + build pipeline is healthy; this was a DB-state issue, not infra. No infrastructure changes needed. Closing this issue when paliad.de/healthz returns 200.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: m/paliad#104
No description provided.