feat(t-paliad-190): mig 090 — one-time fuzzy-match backfill

Phase 3 Slice 10 Step I (design §3.I + m's Q10 ruling). Binds legacy
paliad.deadlines.rule_id to deadline_rules.id via priority-ordered
fuzzy matching; ambiguous + no-match rows log to the orphan staging
table (mig 089).

Matching strategies (highest priority first; first unique hit wins):

  1. rule_code_and_tail — title's leading citation token AND its
     post-separator name fragment match a rule. Handles
     "RoP.023 — Klageerwiderung" where the bare code matches 2 rules
     (DE Klageerwiderung + EN Statement of Defence); the tail picks
     the right one.

  2. rule_code only — bare rule_code from the title prefix. Handles
     "RoP.029.a — Replik" where RoP.029.a maps to a single rule
     regardless of suffix (the title's "Replik" doesn't match the
     rule's actual name but the code is unique).

  3. name_exact — full title equals rule.name or rule.name_en
     (LOWER). Catches "Antrag auf Schadensbemessung" (1 unique
     rule); ambiguous for shared names like Klageerwiderung (8
     candidates).

  4. concept_alias — title appears in deadline_concepts.aliases.
     Thin coverage today; Slice 12 orphan-seed will populate it.

Per-deadline aggregation:
  - Strategy with n_candidates = 1 wins. Priority chain rule_code_and_tail
    > rule_code > name_exact > concept_alias.
  - Ambiguous (≥2 across all strategies) → orphan reason='ambiguous'
    with the full candidate_rule_ids list.
  - 0 candidates → orphan reason='no_match'.

Predicted production outcome (verified via supabase MCP pre-write):
  - 3 of 25 deadlines (12%) get a unique match:
      "RoP.023 — Klageerwiderung"   via rule_code_and_tail
      "RoP.029.a — Replik"          via rule_code
      "Antrag auf Schadensbemessung" via name_exact
  - 15 of 25 deadlines (60%) → orphan reason='ambiguous' (common
    titles like Klageerwiderung × 4, Duplik × 4, Replik × 4 across
    multiple proceedings).
  - 7 of 25 deadlines (28%) → orphan reason='no_match' (free-text
    titles like "Call me", "Schutzschrift", "Validierungsfrist EP→DE",
    "Schriftsatz nach R.262 (Klageerwiderung)").

The 60% target the design § hinted at is unachievable on today's
corpus because all 11 projects have proceeding_type_id IS NULL post-
Slice-5 (the fristenrechner-side rebinding hasn't happened on
production data yet) — proceeding-narrowing would cut the
Klageerwiderung / Duplik / Replik ambiguity, but the column isn't
populated. The orphan-review UI in Slice 11 is the real path to
binding the long tail.

Defensive backup: paliad.deadlines_pre_089 snapshot taken before any
UPDATE. Down-migration restores rule_id from the snapshot + drops
unresolved orphan rows (resolved rows survive a rollback — those are
legal-review work that shouldn't disappear on a code revert).

Idempotency: WHERE rule_id IS NULL on the UPDATE; orphan INSERT
skips rows that already have an unresolved orphan entry. Re-running
on the same corpus produces no new rows.

Hard assertion: every NULL-rule_id deadline (with project) is either
resolved post-mig OR has an unresolved orphan row. RAISE EXCEPTION on
any unaccounted row — fails the migration loudly rather than
silently leaving a deadline un-matched + un-orphaned.

Audit-reason wrapper set; the mig 079 deadline_rules audit trigger
doesn't fire here (UPDATEs touch paliad.deadlines, not deadline_rules),
but the wrapper is the standard pattern.
This commit is contained in:
mAi
2026-05-15 01:37:57 +02:00
parent 5431fcd3cd
commit 09615ec48e
2 changed files with 350 additions and 0 deletions

View File

@@ -0,0 +1,30 @@
-- t-paliad-190 down — reverses 090_backfill_deadline_rule_id.up.sql.
--
-- Restores rule_id values from the pre-mig snapshot (every deadline
-- that mig 090 touched had rule_id IS NULL originally, so restoring
-- means setting rule_id back to NULL on every row that survived the
-- backfill). Drops the orphan rows mig 090 wrote (resolved rows stay
-- — those represent legal-review work that shouldn't disappear on
-- a code rollback) and drops the backup table.
--
-- This is a defensive rollback path; the migration itself is one-time
-- + idempotent, so re-running 090 after a down + up is safe.
SELECT set_config(
'paliad.audit_reason',
'rollback 090: NULL rule_id on deadlines mig 090 touched + drop pre-089 backup',
true);
-- Restore rule_id = NULL on every deadline mig 090 may have written.
-- We use the backup table as the authoritative "before" snapshot.
UPDATE paliad.deadlines d
SET rule_id = b.rule_id
FROM paliad.deadlines_pre_089 b
WHERE d.id = b.id;
-- Drop the unresolved orphan rows mig 090 wrote. Resolved rows stay —
-- a legal-review hand-link is real work that survives a code rollback.
DELETE FROM paliad.deadline_rule_backfill_orphans
WHERE resolved_at IS NULL;
DROP TABLE IF EXISTS paliad.deadlines_pre_089;

View File

@@ -0,0 +1,320 @@
-- t-paliad-190 / Fristen Phase 3 Slice 10 — one-time fuzzy-match
-- backfill of paliad.deadlines.rule_id per design §3.I + m's Q10
-- ruling. Restores SmartTimeline's "anchor real deadlines into
-- projection" affordance on legacy data (1 of 26 deadlines currently
-- has rule_id populated; the SmartTimeline anchor flow needs the FK
-- to thread predicted dates off actuals).
--
-- Matching strategies (in priority order; first unique hit wins):
--
-- 1. rule_code-prefix extraction from title. Titles like
-- "RoP.023 — Klageerwiderung" carry the rule citation in the
-- prefix; we extract the leading citation token and JOIN on
-- deadline_rules.rule_code = extracted. When the rule_code
-- resolves to multiple rules (e.g. RoP.023 → 2 rules — DE
-- Klageerwiderung + EN Statement of Defence), the remaining
-- title fragment narrows by name ILIKE.
--
-- 2. exact title match against rule.name OR rule.name_en (LOWER).
-- Mostly hits common Pipeline-A names ("Antrag auf
-- Schadensbemessung" → 1 unique rule); ambiguous for shared
-- names like "Klageerwiderung" (8 rules across proceedings).
--
-- 3. deadline_concepts.aliases match. Each concept carries a
-- text[] of canonical aliases; if LOWER(d.title) is in the
-- aliases array, we pick the rules with that concept_id. Today
-- the alias coverage is thin (no aliases for "Schutzschrift"
-- etc.), but the strategy is shaped so a future seed lights
-- it up.
--
-- For each deadline, we collect all candidates across the three
-- strategies, dedupe by rule.id, and:
-- - exactly 1 candidate → UPDATE rule_id (matched).
-- - 0 candidates → orphan with reason='no_match'.
-- - ≥2 candidates → orphan with reason='ambiguous', candidate_rule_ids
-- populated so a legal-review pass can hand-pick.
--
-- Per-project narrowing by proceeding_type_id is the design's primary
-- discriminator. In the live corpus today all 11 projects have
-- proceeding_type_id IS NULL (Slice 5 retired litigation codes from
-- project-binding; the fristenrechner-side rebinding hasn't happened),
-- so this slice can't use proceeding-narrowing on production data.
-- The CTE still includes the predicate so the migration self-tunes
-- the moment proceeding_type_id starts getting populated.
--
-- Defensive backup: paliad.deadlines is snapshotted to
-- paliad.deadlines_pre_089 before the UPDATE so an operator can
-- restore individual rule_id values if a hand-link goes wrong post
-- mig. The table is dropped in the down-migration; Slice 11 (rule
-- editor) can drop it once orphan resolution finishes in prod.
--
-- Idempotency: WHERE d.rule_id IS NULL on the UPDATE; the orphan
-- INSERT uses ON CONFLICT DO NOTHING via a NOT EXISTS guard (no
-- unique constraint on deadline_id alone — a deadline may legitimately
-- get re-orphaned after a resolution rollback; but re-running 090 on
-- the same corpus must not duplicate orphan rows for unresolved
-- deadlines).
--
-- Hard assertion at end: SUM(matched) + SUM(orphans for current
-- unresolved deadlines) ≥ COUNT(deadlines processed). Strict equality
-- doesn't hold cleanly on a re-run (the orphan table may already
-- carry prior rows from a partial run), so the assertion is "at
-- least one row exists per unresolved deadline".
SELECT set_config(
'paliad.audit_reason',
'mig 090: one-time fuzzy-match backfill of deadlines.rule_id per design §3.I / Q10',
true);
-- =============================================================================
-- 1. Defensive backup before any UPDATE.
-- =============================================================================
CREATE TABLE IF NOT EXISTS paliad.deadlines_pre_089 AS
SELECT id, project_id, title, rule_id, rule_code, status, due_date,
completed_at, created_at, updated_at
FROM paliad.deadlines
WHERE rule_id IS NULL
AND project_id IS NOT NULL;
COMMENT ON TABLE paliad.deadlines_pre_089 IS
'Snapshot of paliad.deadlines (id, rule_id-relevant columns) taken '
'before mig 090 ran the fuzzy-match backfill. Lets an operator '
'restore individual rule_id values if a hand-link goes wrong. '
'Slice 11 (rule editor) drops this once orphan resolution finishes.';
-- =============================================================================
-- 2. Build the candidate set in a temp table so the per-deadline
-- aggregation + UPDATE + orphan INSERT can share the work without
-- re-evaluating the matchers.
-- =============================================================================
CREATE TEMP TABLE _mig_090_candidates ON COMMIT DROP AS
WITH targets AS (
-- Every NULL-rule_id deadline still bound to a project. project_id
-- is required because we want at least the SmartTimeline anchor
-- flow to work; un-bound deadlines (rare) are out of scope.
SELECT d.id AS deadline_id,
d.title AS title,
d.project_id,
p.proceeding_type_id,
-- Extract a leading citation token like "RoP.023" or
-- "R.49" from the title. Captures the rule_code prefix
-- on titles that carry one ("RoP.023 — Klageerwiderung");
-- NULL on plain titles.
NULLIF(regexp_replace(d.title, '^\s*((?:RoP|R|Art|§)\.?\s*[0-9]+(?:\.[a-z0-9]+)*)\s*(?:[—–-].*)?$', '\1'), d.title) AS code_token,
-- Strip the leading citation + separator to surface the
-- title's name fragment. "RoP.023 — Klageerwiderung" →
-- "Klageerwiderung"; "RoP.029.a" (no suffix) → ""; plain
-- "Klageerwiderung" → "Klageerwiderung" unchanged.
NULLIF(trim(regexp_replace(d.title, '^\s*(?:RoP|R|Art|§)\.?\s*[0-9]+(?:\.[a-z0-9]+)*\s*[—–-]?\s*', '')), '') AS title_tail
FROM paliad.deadlines d
LEFT JOIN paliad.projects p ON p.id = d.project_id
WHERE d.rule_id IS NULL
AND d.project_id IS NOT NULL
),
by_code_and_tail AS (
-- Strategy 1a (narrowest): rule_code AND name (DE or EN) matches
-- the title's tail fragment. Handles "RoP.023 — Klageerwiderung"
-- where the bare code matches 2 rules (DE Klageerwiderung +
-- EN Statement of Defence); the tail picks the DE one.
SELECT t.deadline_id, dr.id AS rule_id, 'rule_code_and_tail' AS strategy
FROM targets t
JOIN paliad.deadline_rules dr
ON dr.rule_code = trim(t.code_token)
AND dr.is_active = true
AND (LOWER(dr.name) = LOWER(t.title_tail)
OR LOWER(dr.name_en) = LOWER(t.title_tail))
WHERE t.code_token IS NOT NULL
AND t.title_tail IS NOT NULL
),
by_code AS (
-- Strategy 1b: rule_code prefix only. Handles bare-code titles
-- ("RoP.029.a" maps to 1 unique rule regardless of suffix) and
-- the fallback when by_code_and_tail returns 0 (suffix doesn't
-- match — e.g. "RoP.029.a — Replik" where the suffix "Replik"
-- doesn't appear in any RoP.029.a rule's name; pick the
-- code-only match anyway).
SELECT t.deadline_id, dr.id AS rule_id, 'rule_code' AS strategy
FROM targets t
JOIN paliad.deadline_rules dr
ON dr.rule_code = trim(t.code_token)
AND dr.is_active = true
WHERE t.code_token IS NOT NULL
),
by_name AS (
-- Strategy 2: exact title match against rule.name or rule.name_en.
-- The widest matcher; for shared names like "Klageerwiderung"
-- (8 rules across proceedings) this is ambiguous, but for
-- unique titles like "Antrag auf Schadensbemessung" (1 rule) it
-- nails the match.
SELECT t.deadline_id, dr.id AS rule_id, 'name_exact' AS strategy
FROM targets t
JOIN paliad.deadline_rules dr
ON (LOWER(dr.name) = LOWER(t.title)
OR LOWER(dr.name_en) = LOWER(t.title))
AND dr.is_active = true
),
by_alias AS (
-- Strategy 3: concept aliases. deadline_concepts.aliases is a
-- text[] of canonical synonyms; if the deadline title appears
-- in that array, every active rule on the concept is a candidate.
-- Today's alias coverage is thin (the seed for Slice 12 is the
-- expected source of new aliases), but the strategy is in place
-- so future seeds light it up without a migration.
SELECT t.deadline_id, dr.id AS rule_id, 'concept_alias' AS strategy
FROM targets t
JOIN paliad.deadline_concepts dc
ON LOWER(t.title) = ANY(SELECT LOWER(a) FROM unnest(dc.aliases) a)
JOIN paliad.deadline_rules dr
ON dr.concept_id = dc.id
AND dr.is_active = true
)
SELECT deadline_id, rule_id, strategy
FROM by_code_and_tail
UNION
SELECT deadline_id, rule_id, strategy
FROM by_code
UNION
SELECT deadline_id, rule_id, strategy
FROM by_name
UNION
SELECT deadline_id, rule_id, strategy
FROM by_alias;
-- =============================================================================
-- 3. Aggregate per-deadline candidate counts by strategy + pick the
-- narrowest-unique-match per deadline. Strategy priority (narrowest
-- first): rule_code_and_tail > rule_code > name_exact > concept_alias.
-- A deadline's "chosen" rule comes from the highest-priority strategy
-- that yields exactly 1 distinct candidate.
-- =============================================================================
CREATE TEMP TABLE _mig_090_strategy_counts ON COMMIT DROP AS
SELECT deadline_id,
strategy,
count(DISTINCT rule_id) AS n,
MIN(rule_id::text) AS first_rule_text
FROM _mig_090_candidates
GROUP BY deadline_id, strategy;
CREATE TEMP TABLE _mig_090_chosen ON COMMIT DROP AS
SELECT DISTINCT ON (deadline_id)
deadline_id,
first_rule_text::uuid AS rule_id,
strategy AS chosen_strategy
FROM _mig_090_strategy_counts
WHERE n = 1
ORDER BY deadline_id,
CASE strategy
WHEN 'rule_code_and_tail' THEN 1
WHEN 'rule_code' THEN 2
WHEN 'name_exact' THEN 3
WHEN 'concept_alias' THEN 4
ELSE 5
END;
-- "Aggregated" carries the widest candidate set for orphan logging
-- (an editor reviewing an orphan wants to see EVERY plausible rule,
-- not just the narrowest-strategy result).
CREATE TEMP TABLE _mig_090_aggregated ON COMMIT DROP AS
SELECT c.deadline_id,
count(DISTINCT c.rule_id) AS n_candidates,
array_agg(DISTINCT c.rule_id) AS all_rule_ids
FROM _mig_090_candidates c
GROUP BY c.deadline_id;
-- =============================================================================
-- 4. UPDATE deadlines.rule_id for the chosen set (narrowest-unique match).
-- =============================================================================
UPDATE paliad.deadlines d
SET rule_id = c.rule_id
FROM _mig_090_chosen c
WHERE d.id = c.deadline_id
AND d.rule_id IS NULL;
-- =============================================================================
-- 5. Log every deadline that didn't get a unique match as an orphan.
-- Skip rows that already have a non-resolved orphan entry (re-run
-- guard) — the existing entry is the source-of-truth until the
-- admin UI flips resolved_at.
-- =============================================================================
INSERT INTO paliad.deadline_rule_backfill_orphans
(deadline_id, title, project_id, proceeding_code, reason,
candidate_count, candidate_rule_ids)
SELECT t.deadline_id,
t.title,
t.project_id,
pt.code AS proceeding_code,
CASE
WHEN a.n_candidates IS NULL OR a.n_candidates = 0 THEN 'no_match'
WHEN a.n_candidates > 1 THEN 'ambiguous'
END AS reason,
COALESCE(a.n_candidates, 0),
COALESCE(a.all_rule_ids, ARRAY[]::uuid[])
FROM (
SELECT d.id AS deadline_id, d.title, d.project_id, p.proceeding_type_id
FROM paliad.deadlines d
LEFT JOIN paliad.projects p ON p.id = d.project_id
WHERE d.rule_id IS NULL
AND d.project_id IS NOT NULL
) t
LEFT JOIN _mig_090_aggregated a ON a.deadline_id = t.deadline_id
LEFT JOIN paliad.proceeding_types pt ON pt.id = t.proceeding_type_id
WHERE NOT EXISTS (
SELECT 1
FROM paliad.deadline_rule_backfill_orphans o
WHERE o.deadline_id = t.deadline_id
AND o.resolved_at IS NULL
);
-- =============================================================================
-- 6. Hard assertion: every NULL-rule_id deadline (with project) is
-- either resolved (rule_id IS NOT NULL post-mig) or carries an
-- unresolved orphan row.
-- =============================================================================
DO $$
DECLARE
n_processed int;
n_matched int;
n_orphaned int;
n_unaccounted int;
BEGIN
SELECT count(*) INTO n_processed
FROM paliad.deadlines
WHERE project_id IS NOT NULL
AND (rule_id IS NOT NULL OR EXISTS (
SELECT 1 FROM paliad.deadline_rule_backfill_orphans o
WHERE o.deadline_id = paliad.deadlines.id
));
SELECT count(*) INTO n_matched
FROM paliad.deadlines d
JOIN paliad.deadlines_pre_089 b ON b.id = d.id
WHERE d.rule_id IS NOT NULL;
SELECT count(DISTINCT deadline_id) INTO n_orphaned
FROM paliad.deadline_rule_backfill_orphans
WHERE resolved_at IS NULL;
SELECT count(*) INTO n_unaccounted
FROM paliad.deadlines d
WHERE d.rule_id IS NULL
AND d.project_id IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM paliad.deadline_rule_backfill_orphans o
WHERE o.deadline_id = d.id
);
RAISE NOTICE 'mig 090: processed=% matched=% orphaned=% unaccounted=%',
n_processed, n_matched, n_orphaned, n_unaccounted;
IF n_unaccounted > 0 THEN
RAISE EXCEPTION 'mig 090: % deadlines have rule_id IS NULL and no orphan row — '
'matcher missed them. Investigate the candidate query.',
n_unaccounted;
END IF;
END $$;