Merge mai/hermes/issue-4-paperless-ai: intra-scan dedup + short-brand prefix match (#4 )

mAi: #4 - paperless-AI prompt: intra-scan dedup + short-brand prefix match
Two prompt-only rules added to address follow-ups from #3: 1. Intra-scan dedup (new rule 4 in Correspondents section): when processing multiple docs from the same sender in one scan batch, reuse the correspondent name created earlier in the same session instead of letting each doc create a fresh alias. Triggered by paperless-AI creating 3 Praxis-Irle aliases in one batch (no native batch-context plumbing; best-effort via prompt). 2. Short-brand prefix match (extension of Fuzzy-Regel): if OCR name is a strict prefix of an existing correspondent (or vice-versa) and the first 2 brand tokens match, use the existing correspondent. Triggered by 'Hogan Lovells' creating a new correspondent despite 'Hogan Lovells International LLP' already existing. Deployed via push_system_prompt.py --apply, container restarted, both strings verified present in /app/data/.env (backup at .env.bak.20260521T092606). Effectiveness will be observed as multi-doc scans flow through.
2026-05-21 11:30:35 +02:00 · 2026-05-21 11:26:40 +02:00 · 2026-05-16 18:38:00 +02:00 · 2026-05-16 18:27:19 +02:00 · 2026-05-16 18:03:41 +02:00 · 2026-05-16 17:57:26 +02:00
9 changed files with 680 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+.m/
--- a/infra/mdms-mover/README.md
+++ b/infra/mdms-mover/README.md
@@ -0,0 +1,180 @@
+# mdms-mover — age-gated inbox → toprocess promoter + blank-page stripper
+
+Two jobs in one user-systemd timer:
+
+1. **Stability gate** (otto#438): solves the chunk-write race between the
+   Canon MB5100 (SMB scans land in `/mnt/mdms/inbox/` in pieces) and
+   Paperless (polls `/mnt/mdms/toprocess/` every 60s and consumes
+   anything it sees). A file is only promoted when **both**:
+   - `mtime > 3 minutes` ago, and
+   - file size is unchanged since the previous run.
+2. **Blank-page strip** (mDMS#2): duplex scans through patch-T separators
+   leave a blank backside (the unprinted reverse of the separator sheet)
+   at the front of every subsequent document. PDF files are passed
+   through `strip_blank_pages.py` before promotion. Pages with no
+   embedded text AND >97% near-white pixels are dropped.
+
+## Layout on mDock
+
+```
+/home/m/mdms-mover/mover.sh                  # script, deployed copy
+/home/m/mdms-mover/strip_blank_pages.py      # blank-page detector
+/home/m/.config/systemd/user/mdms-mover.service  # oneshot service
+/home/m/.config/systemd/user/mdms-mover.timer    # OnUnitActiveSec=1min
+/home/m/.local/state/mdms-mover/state.tsv    # last-seen size per file
+/home/m/.local/bin/uv                        # uv runner for the strip script
+```
+
+Runs as user `m` under user-systemd. mDock has `Linger=yes` for user
+`m`, so the timer keeps firing across reboots and logout sessions.
+
+## Why systemd, not cron
+
+The original spec (otto#438) called for `/etc/cron.d/mdms-mover`. mDock
+runs Ubuntu 24.04 server which ships with systemd-timers and no `cron`
+package. Installing cron only to honour the spec wording would add a
+package we don't otherwise need; a user-systemd timer is the canonical
+Ubuntu 24.04 approach and gives better observability
+(`systemctl --user status mdms-mover.timer`, `journalctl --user -u mdms-mover`).
+
+User-mode (not system-mode) keeps the entire install in `m`'s home — no
+sudo at deploy or maintenance time, no `/var/lib/...` directories to
+chown, the service can read/write the NFS mount because `m` owns it.
+
+## Configuration
+
+```
+| var                    | default                                       | meaning                                            |
+|------------------------|-----------------------------------------------|----------------------------------------------------|
+| MDMS_INBOX             | /mnt/mdms/inbox                               | source — scanner SMB target                        |
+| MDMS_TOPROCESS         | /mnt/mdms/toprocess                           | destination — Paperless consume                    |
+| MDMS_STATE             | $HOME/.local/state/mdms-mover/state.tsv       | per-file size memory                               |
+| MDMS_MIN_AGE_MIN       | 3                                             | minimum mtime age in minutes                       |
+| MDMS_STRIP_BLANK       | true                                          | run blank-page strip on PDFs (set to "false" to disable) |
+| MDMS_STRIP_SCRIPT      | <mover dir>/strip_blank_pages.py              | path override for the strip script                 |
+| MDMS_BLANK_THRESHOLD   | 0.97                                          | near-white pixel ratio to call a page blank (read by strip script) |
+| MDMS_BLANK_NEAR_WHITE  | 240                                           | grayscale cutoff (0-255) for "near white" pixels (read by strip script) |
+| MDMS_BLANK_DPI         | 50                                            | thumbnail render DPI (read by strip script)        |
+```
+
+To override at runtime, drop into
+`~/.config/systemd/user/mdms-mover.service.d/override.conf`:
+
+```ini
+[Service]
+Environment=MDMS_MIN_AGE_MIN=5
+Environment=MDMS_BLANK_THRESHOLD=0.99
+```
+
+then `systemctl --user daemon-reload && systemctl --user restart mdms-mover.timer`.
+
+## Blank-page detection — what gets dropped
+
+A page is dropped iff BOTH:
+
+1. embedded text is empty / whitespace-only (image-only scans always
+   pass this — they have no embedded text), AND
+2. the rendered thumbnail is ≥ `MDMS_BLANK_THRESHOLD` near-white pixels
+   (0.97 by default → >97% of pixels brighter than grayscale 240).
+
+The threshold is conservative on purpose: a false-negative (keeping a
+blank page we should have dropped) is recoverable via Paperless's UI; a
+false-positive (dropping a real page) silently loses data. If real
+pages get dropped in practice, **raise** `MDMS_BLANK_THRESHOLD` toward
+0.99 — that makes the strip step pickier and keeps more pages.
+
+Edge cases handled inside `strip_blank_pages.py`:
+
+- **1-page input:** strip is skipped entirely (single-page docs never
+  have separator-backside artefacts).
+- **All pages would drop:** the script exits with code `2` and writes no
+  output. The mover keeps the file in the inbox and logs
+  `WARNING: <name> appears all-blank, kept in inbox`. m can inspect via
+  `journalctl --user -u mdms-mover`.
+- **strip_blank_pages.py errors out:** mover falls back to a plain `mv`
+  (unstripped) so a transient problem in the detector never blocks a
+  scan from reaching Paperless.
+
+The script is a uv-inline-deps single file (PyMuPDF for both rendering
+and text extraction — one wheel, no `poppler-utils` apt install on
+mdock). Mirrors the pattern from `infra/paperless/generate_separator.py`.
+
+## Deploy / sync
+
+The live files on mDock must match this directory byte-for-byte (md5,
+same convention as `infra/samba-canon/`).
+
+```bash
+ssh mdock 'mkdir -p ~/mdms-mover ~/.config/systemd/user ~/.local/state/mdms-mover ~/.local/bin'
+
+# uv binary (single static binary, user-space — no apt, no sudo)
+rsync -av ~/.local/bin/uv mdock:/home/m/.local/bin/uv
+
+# mover + strip script
+scp infra/mdms-mover/mover.sh             mdock:/home/m/mdms-mover/mover.sh
+scp infra/mdms-mover/strip_blank_pages.py mdock:/home/m/mdms-mover/strip_blank_pages.py
+scp infra/mdms-mover/mdms-mover.service   mdock:/home/m/.config/systemd/user/
+scp infra/mdms-mover/mdms-mover.timer     mdock:/home/m/.config/systemd/user/
+
+ssh mdock 'chmod +x ~/mdms-mover/mover.sh ~/mdms-mover/strip_blank_pages.py && \
+           systemctl --user daemon-reload && \
+           systemctl --user enable --now mdms-mover.timer'
+```
+
+The first time the strip script runs, `uv` downloads python + PyMuPDF
+into `~/.cache/uv/` (~30 MB). Subsequent runs reuse the cache.
+
+## Verify
+
+```bash
+ssh mdock 'systemctl --user list-timers mdms-mover.timer'
+ssh mdock 'journalctl --user -u mdms-mover -n 20 --no-pager'
+ssh mdock 'cat ~/.local/state/mdms-mover/state.tsv'
+ssh mdock 'journalctl -t mdms-mover -n 20 --no-pager'
+```
+
+## Emergency disable
+
+Stop the timer entirely:
+
+```bash
+ssh mdock 'systemctl --user stop mdms-mover.timer && \
+           systemctl --user disable mdms-mover.timer'
+```
+
+Or just disable the strip step while keeping the stability gate:
+
+```bash
+mkdir -p ~/.config/systemd/user/mdms-mover.service.d
+cat > ~/.config/systemd/user/mdms-mover.service.d/override.conf <<EOF
+[Service]
+Environment=MDMS_STRIP_BLANK=false
+EOF
+systemctl --user daemon-reload
+```
+
+Re-enable the timer with `systemctl --user enable --now mdms-mover.timer`.
+
+If you need to drain the inbox manually while disabled, files older
+than a few minutes are safe to `mv` into `toprocess/` by hand —
+Paperless will pick them up on its next poll.
+
+## Logs
+
+Service logs land in the user journal under unit `mdms-mover`, and
+moved-file events also go through `logger -t mdms-mover` so they appear
+under that tag in the system journal too:
+
+```bash
+ssh mdock 'journalctl --user -u mdms-mover -f'   # service execution
+ssh mdock 'journalctl -t mdms-mover -f'          # moved-file events
+```
+
+## Refs
+
+- mDMS#2 — blank-page strip (this README)
+- otto#438 — original scheduler / staging-folder design
+- otto#429 — original Paperless pipeline setup
+- otto#431 — samba-canon bridge container (upstream of this mover)
+- `docs/strategy.md` — overall mDMS dataset layout
+- `infra/paperless/generate_separator.py` — sibling uv-inline-deps script
--- a/infra/mdms-mover/mdms-mover.service
+++ b/infra/mdms-mover/mdms-mover.service
@@ -0,0 +1,11 @@
+[Unit]
+Description=mDMS mover — promote stable scanner files inbox → toprocess
+After=network-online.target remote-fs.target
+Wants=network-online.target remote-fs.target
+
+[Service]
+Type=oneshot
+ExecStart=%h/mdms-mover/mover.sh
+
+[Install]
+WantedBy=default.target
--- a/infra/mdms-mover/mdms-mover.timer
+++ b/infra/mdms-mover/mdms-mover.timer
@@ -0,0 +1,12 @@
+[Unit]
+Description=Run mDMS mover every minute
+Requires=mdms-mover.service
+
+[Timer]
+OnBootSec=2min
+OnUnitActiveSec=1min
+AccuracySec=10s
+Unit=mdms-mover.service
+
+[Install]
+WantedBy=timers.target default.target
--- a/infra/mdms-mover/mover.sh
+++ b/infra/mdms-mover/mover.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+# mdms-mover: move stable files from /mnt/mdms/inbox → /mnt/mdms/toprocess.
+#
+# A file is "stable" when it satisfies BOTH conditions:
+#   1. mtime older than MIN_AGE seconds (default 180s).
+#   2. size unchanged since the previous run (recorded in STATE).
+#
+# This protects Paperless from ingesting half-written scans dropped by the
+# Canon MB5100 via SMB. See otto#438, mDMS#2.
+#
+# When MDMS_STRIP_BLANK=true (default) and the file is a PDF, blank pages
+# are stripped before promotion (mDMS#2). Empty backsides of patch-T
+# separators from duplex scans land here. See strip_blank_pages.py for the
+# detection heuristic.
+
+set -euo pipefail
+
+INBOX="${MDMS_INBOX:-/mnt/mdms/inbox}"
+TOPROCESS="${MDMS_TOPROCESS:-/mnt/mdms/toprocess}"
+STATE="${MDMS_STATE:-$HOME/.local/state/mdms-mover/state.tsv}"
+MIN_AGE_MIN="${MDMS_MIN_AGE_MIN:-3}"
+STRIP_BLANK="${MDMS_STRIP_BLANK:-true}"
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+STRIP_SCRIPT="${MDMS_STRIP_SCRIPT:-$SCRIPT_DIR/strip_blank_pages.py}"
+
+mkdir -p "$TOPROCESS" "$(dirname "$STATE")"
+touch "$STATE"
+
+NEW_STATE=$(mktemp)
+trap 'rm -f "$NEW_STATE"' EXIT
+
+# Promote a single stable file from inbox into toprocess, blank-stripping
+# PDFs when enabled. Returns silently; logs go through logger(1).
+promote() {
+  local src="$1" name="$2" size="$3"
+  local ext="${name##*.}"
+  local dest="$TOPROCESS/$name"
+
+  if [[ "$STRIP_BLANK" != "true" || "${ext,,}" != "pdf" || ! -x "$STRIP_SCRIPT" ]]; then
+    if mv -n "$src" "$dest" 2>/dev/null; then
+      logger -t mdms-mover "moved $name ($size bytes)"
+    fi
+    return
+  fi
+
+  # Stage stripped output inside toprocess (same filesystem → atomic rename).
+  # Dotfile prefix so Paperless's consumer ignores the partial during write.
+  local tmpout="$TOPROCESS/.mdms-tmp.$$.$name"
+  local rc=0
+  "$STRIP_SCRIPT" "$src" "$tmpout" || rc=$?
+
+  case "$rc" in
+    0)
+      mv -f "$tmpout" "$dest" && rm -f "$src"
+      logger -t mdms-mover "moved $name ($size bytes, strip ok)"
+      ;;
+    2)
+      rm -f "$tmpout"
+      logger -t mdms-mover "WARNING: $name appears all-blank, kept in inbox"
+      ;;
+    *)
+      rm -f "$tmpout"
+      logger -t mdms-mover "strip failed for $name (rc=$rc), passing through unchanged"
+      if mv -n "$src" "$dest" 2>/dev/null; then
+        logger -t mdms-mover "moved $name ($size bytes, unstripped)"
+      fi
+      ;;
+  esac
+}
+
+# Iterate top-level regular files older than MIN_AGE_MIN.
+# Skip dotfiles (probe files, scanner temp markers like ._foo, our .mdms-tmp.*).
+while IFS= read -r f; do
+  name=$(basename "$f")
+  case "$name" in
+    .*) continue ;;
+  esac
+
+  if ! size=$(stat -c %s "$f" 2>/dev/null); then
+    continue
+  fi
+
+  prev=$(awk -v n="$name" '$1==n {print $2; exit}' "$STATE")
+  printf '%s\t%s\n' "$name" "$size" >> "$NEW_STATE"
+
+  if [[ -n "$prev" && "$size" == "$prev" ]]; then
+    promote "$f" "$name" "$size"
+  fi
+done < <(find "$INBOX" -maxdepth 1 -type f -mmin "+$MIN_AGE_MIN")
+
+mv "$NEW_STATE" "$STATE"
+trap - EXIT
--- a/infra/mdms-mover/strip_blank_pages.py
+++ b/infra/mdms-mover/strip_blank_pages.py
@@ -0,0 +1,122 @@
+#!/usr/bin/env -S uv run --script
+# /// script
+# requires-python = ">=3.11"
+# dependencies = [
+#   "pymupdf>=1.24",
+#   "Pillow>=10.0",
+# ]
+# ///
+"""Strip blank pages from a PDF — used by mdms-mover before promoting to toprocess.
+
+Usage:
+    strip_blank_pages.py <input.pdf> <output.pdf>
+
+Exit codes:
+    0   output.pdf written (either stripped or copied unchanged)
+    2   all pages would be dropped — output NOT written, caller should keep
+        the original file in the inbox and log a warning
+    1   error (input unreadable, write failed, etc.)
+
+A page counts as "blank" iff BOTH of:
+  * embedded text is empty / whitespace-only, AND
+  * rendered thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels.
+
+False-negatives are preferred over false-positives — borderline pages stay.
+
+Env:
+  MDMS_BLANK_THRESHOLD   near-white pixel ratio (0.0-1.0, default 0.97)
+  MDMS_BLANK_NEAR_WHITE  near-white cutoff in 0-255 grayscale (default 240)
+  MDMS_BLANK_DPI         thumbnail render DPI (default 50)
+
+PyMuPDF is used instead of pdf2image+pikepdf+pypdf so the whole pipeline is
+one self-contained wheel — no poppler-utils apt-install on mdock, no
+multiple text-extraction libraries to keep in sync.
+"""
+from __future__ import annotations
+
+import io
+import os
+import shutil
+import sys
+from pathlib import Path
+
+import fitz  # PyMuPDF
+from PIL import Image
+
+
+def near_white_ratio(image: Image.Image, near_white: int) -> float:
+    gray = image.convert("L") if image.mode != "L" else image
+    hist = gray.histogram()
+    total = sum(hist)
+    if total == 0:
+        return 1.0
+    return sum(hist[near_white:]) / total
+
+
+def page_is_blank(page: "fitz.Page", threshold: float, near_white: int, dpi: int) -> bool:
+    text = (page.get_text("text") or "").strip()
+    if text:
+        return False
+    pix = page.get_pixmap(dpi=dpi, colorspace=fitz.csGRAY)
+    image = Image.frombytes("L", (pix.width, pix.height), pix.samples)
+    return near_white_ratio(image, near_white) >= threshold
+
+
+def main() -> int:
+    if len(sys.argv) != 3:
+        print(f"usage: {sys.argv[0]} <input.pdf> <output.pdf>", file=sys.stderr)
+        return 1
+
+    src = Path(sys.argv[1])
+    dst = Path(sys.argv[2])
+
+    threshold = float(os.environ.get("MDMS_BLANK_THRESHOLD", "0.97"))
+    near_white = int(os.environ.get("MDMS_BLANK_NEAR_WHITE", "240"))
+    dpi = int(os.environ.get("MDMS_BLANK_DPI", "50"))
+
+    try:
+        doc = fitz.open(src)
+    except Exception as exc:
+        print(f"failed to open {src}: {exc}", file=sys.stderr)
+        return 1
+
+    try:
+        page_count = doc.page_count
+
+        if page_count <= 1:
+            shutil.copyfile(src, dst)
+            return 0
+
+        keep: list[int] = []
+        for i in range(page_count):
+            if not page_is_blank(doc[i], threshold, near_white, dpi):
+                keep.append(i)
+
+        if not keep:
+            print(f"all pages blank in {src.name}", file=sys.stderr)
+            return 2
+
+        if len(keep) == page_count:
+            shutil.copyfile(src, dst)
+            return 0
+
+        out = fitz.open()
+        try:
+            for i in keep:
+                out.insert_pdf(doc, from_page=i, to_page=i)
+            out.save(dst)
+        finally:
+            out.close()
+
+        dropped = page_count - len(keep)
+        print(
+            f"{src.name}: dropped {dropped}/{page_count} blank page(s)",
+            file=sys.stderr,
+        )
+        return 0
+    finally:
+        doc.close()
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/infra/paperless/README.md
+++ b/infra/paperless/README.md
@@ -21,4 +21,28 @@ in the repo. Hashes:
 | setup.js.patched | ~/paperless/build/setup.js.patched | `04cb5fbfaed13a5f25612af0b79dd90c` |
 | server.js.patched | ~/paperless/build/server.js.patched | `eadcbb86048127f2c80632ae77bbc2a0` |

-See `docs/research/issue-429-paperless-pipeline.md` for the why.
+See `docs/research/issue-429-paperless-pipeline.md` in `m/otto` for the
+original pipeline rebuild (issue otto#429).
+
+## SYSTEM_PROMPT deploy mechanism
+
+`SYSTEM_PROMPT.txt` is the source of truth. It is a template — the
+`{{CORRESPONDENTS_LIST}}` placeholder is rendered at deploy time by
+fetching the live correspondents from Paperless. The live prompt is
+inside `paperless-ai`'s `/app/data/.env` (volume `paperless_aidata`) as
+the backtick-delimited `SYSTEM_PROMPT=\`…\`` block.
+
+Deploy with `push_system_prompt.py`:
+
+```bash
+python3 push_system_prompt.py            # dry run — diff only
+python3 push_system_prompt.py --apply    # write + restart paperless-ai
+```
+
+The script filters recipient-only names (Matthias / Mathias Siebels)
+out of the rendered list — see `RECIPIENT_EXCLUDE` in the script and
+the matching rule at the top of the Correspondents section in
+`SYSTEM_PROMPT.txt`. If you edit either, edit both.
+
+The previous live `.env` is kept on mDock as `.env.bak.<ts>` next to the
+new one for rollback.
--- a/infra/paperless/SYSTEM_PROMPT.txt
+++ b/infra/paperless/SYSTEM_PROMPT.txt
@@ -21,4 +21,52 @@ Bei medizinischen Dokumenten Tag Gesundheit setzen.
 Bei steuerrelevanten Dokumenten Tag Steuer setzen.
 Bei Dokumenten mit Frist Tag Frist setzen.

-Correspondents: Verwende den vollen offiziellen Namen der Organisation oder Person (z.B. "DAK-Gesundheit" nicht "DAK-Gesundheit Postzentrum, 22778 Hamburg"). Keine Adressen im Namen. Pruefe ob der Correspondent schon existiert bevor du einen neuen anlegst.
+Erfinde NIEMALS neue Tags. Erfinde NIEMALS neue Document Types. Bei Unsicherheit: Document Type = Information, keine zusätzlichen Tags.
+
+Correspondents — WICHTIG, in dieser Reihenfolge:
+
+1. EMPFAENGER NIEMALS als Correspondent: Matthias Siebels (alle Schreibweisen — Mathias, Mathhias, Siebels, MS, "Herr Siebels", "Herrn Matthias Siebels", "Empfaengeradresse Windscheidstr. 33") ist der EMPFAENGER nahezu jedes Dokuments in diesem DMS. NIEMALS als Correspondent setzen, auch wenn der Name in der Absenderzeile zu lesen ist (z.B. wenn der OCR die Empfaengeradresse als Absender mis-interpretiert). Gleiches gilt sinngemaess fuer Paul Siebels — Paul ist meistens Empfaenger (Bescheide, Rechnungen, Steuerbescheide an Paul). Verwende Paul Siebels nur dann als Correspondent, wenn Paul nachweislich Autor des Dokuments ist (z.B. eigener Brief, Schadensmeldung von Paul).
+
+2. Der Correspondent ist die Organisation oder Person, die das Dokument GESENDET / GESCHRIEBEN hat. In den seltenen Faellen, in denen m (Matthias) selbst Autor ist (z.B. eigene Briefe an Behoerden, eigene Umsatzsteuer-Voranmeldung), waehle Document Type = Personal Correspondence und Correspondent = die EMPFAENGENDE Organisation (z.B. "Finanzamt Düsseldorf-Mitte").
+
+3. Bevorzuge existierende Correspondents bei klarer semantischer Aehnlichkeit (Fuzzy-Regel unten). Wenn der OCR-Absender genuinely neu ist (z.B. ein neuer Versorger, Vermieter, Arzt, Dienstleister, Anwalt, Mandant, Versicherer), lege einen neuen Correspondent an, statt zwanghaft auf den naechsten existierenden Namen zu mappen.
+
+4. INTRA-SCAN DEDUP: Bevor du einen neuen Correspondent anlegst, pruefe ob du in dieser Sitzung (gleicher Scan-Batch, gleicher Verarbeitungslauf) bereits einen Correspondent mit aehnlichem Namen angelegt hast — verwende dann den existierenden (denselben Namen unveraendert), statt eine weitere Variante anzulegen. Konkret: kommen in einem Scan mehrere Dokumente vom gleichen Sender vor (z.B. zwei Rechnungen derselben Arztpraxis, mehrere Schreiben desselben Versorgers), MUSS der Correspondent-Name bei jedem dieser Dokumente identisch sein. Im Zweifel waehle die laengste / vollstaendigste Form, die du in diesem Scan gesehen hast.
+
+Fuzzy-Regel: Wenn der OCR-Absendername bis auf Kleinschreibung, Akzente, Tippfehler, Anrede ("Herr"/"Frau"/"Herrn"), Adresszusatz, Personenname als Ansprechpartner oder Rechtsform-Suffix (GmbH/AG/eG/e.V./LLP/KG/mbH/AG/VVaG) einem existierenden Correspondent entspricht, verwende den existierenden Namen UNVERAENDERT. Bei substantiell anderen Namen (anderer Stamm, andere Branche, andere Firmierung) lege einen neuen an.
+
+Beim Vergleich gilt: Ist der OCR-Name ein striktes Praefix eines existierenden Correspondents (oder umgekehrt), und stimmen die ersten 2 Brand-Tokens ueberein (Token = Wort, das nicht Rechtsform-Suffix, Adresse oder Anrede ist), verwende den existierenden Correspondent. Das gilt sowohl fuer Kurzformen ohne Rechtsform-Suffix ("Hogan Lovells" -> "Hogan Lovells International LLP") als auch fuer den umgekehrten Fall, wenn die existierende Form kuerzer ist als die OCR-Form.
+
+Beispiele:
+- "Hogan Lovells lnternational LLP" (OCR-Variante) -> "Hogan Lovells International LLP" (existiert)
+- "Hogan Lovells" (Kurzform ohne Rechtsform) -> "Hogan Lovells International LLP" (existiert; OCR-Name ist Praefix, erste 2 Brand-Tokens stimmen)
+- "eprimo CmbH" -> "eprimo" (existiert)
+- "Helios Klinikum Duisburg GmbH" -> "Helios Klinikum Duisburg" (existiert)
+- "Kundenservice von eprimo" -> "eprimo" (existiert)
+- "Ammerländer Versicherung VVaG" -> "Ammerländer Versicherung" (existiert; Rechtsform weglassen)
+- "ING-DiBa AG, Theodor-Heuss-Allee 2, 60486 Frankfurt am Main" -> "ING-DiBa AG" (existiert; Adresse weglassen)
+- "Vattenfall Europe Sales GmbH" -> "Vattenfall" (existiert; konsolidiere Konzernvarianten)
+- Brief von einem NEUEN Versorger "Stadtwerke XYZ" -> neu anlegen als "Stadtwerke XYZ" (NICHT auf "eprimo" oder "Vodafone" mappen, nur weil das der naechste existierende Versorger ist)
+- Drei Dokumente einer neuen Praxis im selben Scan: erstes Dokument legt Correspondent "Praxis Dr. Mustermann" an, zweites und drittes Dokument verwenden GENAU diesen Namen, auch wenn der OCR "Dr. Mustermann" oder "Praxis fuer XYZ" liest (siehe Regel 4).
+
+Beim Anlegen neuer Correspondents: voller offizieller Name der Organisation/Person, KEINE Adresse, KEINE Anrede, KEINE Rechtsform-Suffixe in Reinform (GmbH/AG/etc. nur dann mit aufnehmen, wenn sie Teil der Markenidentitaet sind, z.B. "DKB Grund GmbH").
+
+Aktuelle Correspondents-Liste (aus dieser pruefe als ERSTES, ob einer passt — Eintraege mit Matthias/Mathias Siebels sind absichtlich nicht enthalten, siehe Regel 1):
+{{CORRESPONDENTS_LIST}}
+
+Titel-Generierung (PFLICHT, deutsch, 5-80 Zeichen):
+- Format: "{Absender-Kurzform} - {Worum es geht}"
+- "{Absender-Kurzform}" = Correspondent in kurzer Form (z.B. "DAK", "Finanzamt", "Hogan Lovells", "Vodafone")
+- "{Worum es geht}" = 2-6 Woerter, die den Inhalt konkret beschreiben (z.B. "Beitragsrechnung Q1", "Grundsteuerbescheid 2024", "Gehaltsabrechnung Januar 2025", "GigaTV Vertragsverlaengerung")
+- Bei Rechnungen / Bescheiden: Vorgangs- bzw. Rechnungsnummer in den Titel aufnehmen, wenn vorhanden (z.B. "DAK - Beitragsrechnung 2025-Q1 (Nr. 4711)")
+- Keine generischen Woerter wie "Dokument", "Datei", "Scan", "PDF", "Schreiben" als alleinige Beschreibung
+- Keine Datums-Strings im Titel (das Datum erscheint schon im Storage Path)
+- Keine Anrede ("Sehr geehrter Herr") und keine Floskeln
+- Beispiele guter Titel:
+  - "DAK - Beitragsrechnung Q1"
+  - "Finanzamt - Grundsteuerbescheid 2024"
+  - "Hogan Lovells - Gehaltsabrechnung Januar"
+  - "Vodafone - GigaTV Vertragsverlaengerung"
+  - "AOK - Mitgliedsbescheinigung"
+- Bei unklarem Inhalt Fallback: "Information - {Sender-Name}" (z.B. "Information - Stadtwerke Muenchen")
+- Der Titel wird im JSON-Feld "title" zurueckgegeben.
--- a/infra/paperless/push_system_prompt.py
+++ b/infra/paperless/push_system_prompt.py
@@ -0,0 +1,187 @@
+"""
+Render SYSTEM_PROMPT.txt with the live correspondent list and push it to
+the paperless-ai container's /app/data/.env on mDock.
+
+The repo SYSTEM_PROMPT.txt is the template (with the placeholder
+{{CORRESPONDENTS_LIST}}). This script:
+
+  1. Reads the current correspondents from the Paperless API.
+  2. Filters out names that must never appear as correspondent
+     (recipients of m's mail — see RECIPIENT_EXCLUDE).
+  3. Renders the prompt by substituting the placeholder.
+  4. Reads the live /app/data/.env from the paperless-ai container.
+  5. Replaces the SYSTEM_PROMPT=`…` block.
+  6. Backs up the old .env (.bak.<ts>) and writes the new one.
+  7. Restarts the paperless-ai container.
+
+Dry-run is the default: prints the would-be rendered prompt without
+writing.
+
+Usage:
+    python3 push_system_prompt.py             # dry run
+    python3 push_system_prompt.py --apply     # write + restart
+
+Migrated into m/mDMS from m/otto on 2026-05-16 (mDMS#3).
+"""
+import argparse
+import datetime
+import json
+import os
+import subprocess
+import sys
+
+
+PAPERLESS_HOST = "mdock"
+PAPERLESS_AI_CONTAINER = "paperless-ai"
+PAPERLESS_WEB_CONTAINER = "paperless-webserver-1"
+ENV_PATH = "/app/data/.env"
+HERE = os.path.dirname(os.path.abspath(__file__))
+TEMPLATE_PATH = os.path.join(HERE, "SYSTEM_PROMPT.txt")
+PLACEHOLDER = "{{CORRESPONDENTS_LIST}}"
+
+# Names that are m or his household — recipients, never correspondents.
+# Substring match, case-insensitive. Keep the actual correspondent records
+# in Paperless (data integrity for historical doc assignments), but never
+# show them to the LLM as candidate senders.
+RECIPIENT_EXCLUDE = ("matthias siebels", "mathias siebels")
+
+
+def get_token() -> str:
+    out = subprocess.run(
+        ["ssh", PAPERLESS_HOST,
+         f"docker exec {PAPERLESS_AI_CONTAINER} sh -c "
+         f"'grep ^PAPERLESS_API_TOKEN {ENV_PATH} | cut -d= -f2'"],
+        capture_output=True, text=True, timeout=15,
+    )
+    return out.stdout.strip()
+
+
+def fetch_correspondents(token: str) -> list[str]:
+    cmd = (
+        f"docker exec {PAPERLESS_WEB_CONTAINER} "
+        f"curl -s -H 'Authorization: Token {token}' "
+        f"'http://localhost:8000/api/correspondents/?page_size=500'"
+    )
+    out = subprocess.run(
+        ["ssh", PAPERLESS_HOST, cmd],
+        capture_output=True, text=True, timeout=30,
+    )
+    if out.returncode != 0:
+        raise RuntimeError(f"fetch failed: {out.stderr}")
+    data = json.loads(out.stdout)
+    names = [c["name"] for c in data["results"]]
+    filtered = [n for n in names
+                if not any(x in n.lower() for x in RECIPIENT_EXCLUDE)]
+    dropped = sorted(set(names) - set(filtered))
+    if dropped:
+        print(f"filtered out recipient-names: {dropped}")
+    return sorted(filtered, key=lambda s: s.lower())
+
+
+def render_prompt(template: str, names: list[str]) -> str:
+    listing = "\n".join(f"- {n}" for n in names)
+    return template.replace(PLACEHOLDER, listing)
+
+
+def read_remote_env() -> str:
+    out = subprocess.run(
+        ["ssh", PAPERLESS_HOST,
+         f"docker exec {PAPERLESS_AI_CONTAINER} cat {ENV_PATH}"],
+        capture_output=True, text=True, timeout=15,
+    )
+    if out.returncode != 0:
+        raise RuntimeError(f"cat failed: {out.stderr}")
+    return out.stdout
+
+
+def replace_system_prompt(env: str, new_prompt: str) -> str:
+    """Replace the SYSTEM_PROMPT=`…` block with the new one.
+
+    Paperless-AI's .env uses backtick-delimited values for multi-line
+    settings (JS .env loader convention; bash would not accept this).
+    """
+    lines = env.splitlines(keepends=True)
+    out = []
+    inside = False
+    replaced = False
+    for line in lines:
+        if not inside and line.startswith("SYSTEM_PROMPT="):
+            out.append(f"SYSTEM_PROMPT=`{new_prompt.rstrip()}`\n")
+            replaced = True
+            stripped_value = line[len("SYSTEM_PROMPT="):].rstrip("\n")
+            if stripped_value.startswith("`") and stripped_value.count("`") >= 2:
+                continue
+            inside = True
+            continue
+        if inside:
+            if "`" in line:
+                inside = False
+            continue
+        out.append(line)
+    if not replaced:
+        raise SystemExit("SYSTEM_PROMPT= line not found in .env")
+    return "".join(out)
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--apply", action="store_true",
+                    help="Write new .env and restart paperless-ai")
+    args = ap.parse_args()
+
+    with open(TEMPLATE_PATH) as f:
+        template = f.read()
+    if PLACEHOLDER not in template:
+        sys.exit(f"template missing placeholder {PLACEHOLDER}")
+
+    token = get_token()
+    names = fetch_correspondents(token)
+    print(f"fetched {len(names)} live correspondents (after recipient filter)")
+    rendered = render_prompt(template, names)
+    print(f"rendered prompt: {len(rendered)} chars, {len(rendered.splitlines())} lines")
+
+    env_before = read_remote_env()
+    env_after = replace_system_prompt(env_before, rendered)
+    if env_before == env_after:
+        print("no change — live prompt already matches rendered template")
+        return
+
+    if not args.apply:
+        print("--- new SYSTEM_PROMPT block ---")
+        for line in env_after.splitlines():
+            if line.startswith("SYSTEM_PROMPT="):
+                print(line[:200] + ("…" if len(line) > 200 else ""))
+        print()
+        print("DRY RUN — re-run with --apply to write + restart paperless-ai")
+        return
+
+    ts = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%S")
+    backup = f"{ENV_PATH}.bak.{ts}"
+    subprocess.run(
+        ["ssh", PAPERLESS_HOST,
+         f"docker exec {PAPERLESS_AI_CONTAINER} cp {ENV_PATH} {backup}"],
+        check=True, timeout=15,
+    )
+    print(f"backup: {backup}")
+
+    write_cmd = (
+        f"docker exec -i {PAPERLESS_AI_CONTAINER} "
+        f"sh -c 'cat > {ENV_PATH}'"
+    )
+    proc = subprocess.run(
+        ["ssh", PAPERLESS_HOST, write_cmd],
+        input=env_after, capture_output=True, text=True, timeout=30,
+    )
+    if proc.returncode != 0:
+        sys.exit(f"write failed: {proc.stderr}")
+    print(f"wrote {len(env_after)} bytes to {ENV_PATH}")
+
+    subprocess.run(
+        ["ssh", PAPERLESS_HOST, f"docker restart {PAPERLESS_AI_CONTAINER}"],
+        check=True, timeout=60,
+    )
+    print(f"restarted {PAPERLESS_AI_CONTAINER}")
+
+
+if __name__ == "__main__":
+    main()
Author	SHA1	Message	Date
mAi	237f7f89ba	Merge mai/hermes/issue-4-paperless-ai: intra-scan dedup + short-brand prefix match (#4 )	2026-05-21 11:30:35 +02:00
mAi	a2fa76a41a	mAi: #4 - paperless-AI prompt: intra-scan dedup + short-brand prefix match Two prompt-only rules added to address follow-ups from #3: 1. Intra-scan dedup (new rule 4 in Correspondents section): when processing multiple docs from the same sender in one scan batch, reuse the correspondent name created earlier in the same session instead of letting each doc create a fresh alias. Triggered by paperless-AI creating 3 Praxis-Irle aliases in one batch (no native batch-context plumbing; best-effort via prompt). 2. Short-brand prefix match (extension of Fuzzy-Regel): if OCR name is a strict prefix of an existing correspondent (or vice-versa) and the first 2 brand tokens match, use the existing correspondent. Triggered by 'Hogan Lovells' creating a new correspondent despite 'Hogan Lovells International LLP' already existing. Deployed via push_system_prompt.py --apply, container restarted, both strings verified present in /app/data/.env (backup at .env.bak.20260521T092606). Effectiveness will be observed as multi-doc scans flow through.	2026-05-21 11:26:40 +02:00
mAi	70b94cc2e9	Merge mai/hermes/issue-3-paperless-ai: paperless-AI prompt fix + drift reconciliation (#3 )	2026-05-16 18:38:00 +02:00
mAi	7ba5bb925c	mAi: #3 - paperless-AI prompt: Empfaenger-Regel + softened correspondent matching + drift reconciliation Live SYSTEM_PROMPT on mDock had drifted heavily from the repo template (detailed correspondent fuzzy-matching catalogue, full existing-names list, refined title-generation rules). Reconciled by adopting the live prompt as the new baseline in SYSTEM_PROMPT.txt and layering two fixes on top: 1. Recipient rule (Rule 1): Matthias / Mathias Siebels and any address- block variant ("Herr Siebels", "Empfaengeradresse Windscheidstr. 33") must NEVER be set as correspondent — m is the recipient of nearly every doc. Paul Siebels: also recipient by default, only correspondent when nachweislich Autor (eigener Brief, Schadensmeldung von Paul). Triggering misclassification (issue body): doc 280 (Vattenfall Stromliefervertrag) was tagged correspondent="Matthias Siebels" because the AI picked the recipient address block as sender. 2. Soften "Bevorzuge IMMER existierenden Correspondent" -> only when semantic similarity is clear. Genuinely new senders (Versorger, Arzt, Versicherer, Vermieter, ...) get a new correspondent rather than being force-mapped to the nearest existing name. Fixes the Vattenfall -> Telekom drift on docs 283/284 (also addressed by head adding Vattenfall ID 257 manually). Also migrated push_system_prompt.py from m/otto into this repo so the deploy mechanism (render template -> push to /app/data/.env -> restart paperless-ai) lives next to the template. Added RECIPIENT_EXCLUDE filter so Matthias/Mathias Siebels are stripped from the rendered correspondents list — defense in depth on top of the prompt rule. Paperless correspondent records (IDs 3, 255) are preserved for the historical doc assignments that still reference them. Applied to live mDock paperless-ai (backup .env.bak.20260516T162255). 39 of 41 Siebels-correspondent doc assignments cleared + their paperless-AI sqlite tracker rows (processed_documents, history_documents, openai_metrics) deleted so they reclassify on the next scan. Two kept (doc 117 Vollmacht from Paul, doc 130 Schadensmeldung filled by Paul — both genuine Paul-as-author cases per the new rule). Refs: m/mDMS#3	2026-05-16 18:27:19 +02:00
mAi	927a66bd66	Merge mai/hermes/issue-2-mover-strip: mover strips blank pages (#2 )	2026-05-16 18:03:41 +02:00
mAi	90142396d8	mAi: #2 - mdms-mover: strip blank pages from duplex scans Two changes: 1. Migrate mover from m/otto (commit 9974937, otto#438) into this repo at infra/mdms-mover/. mover.sh, mdms-mover.service, mdms-mover.timer, README.md. Matches the live deployment on mDock byte-for-byte (modulo the strip step below). 2. Add blank-page stripping before the inbox → toprocess promotion. A page is dropped iff its embedded text is empty AND its rendered thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels (default 0.97 per issue #2). Detects the empty backside of patch-T separator sheets in duplex scans (mDMS#2). strip_blank_pages.py uses PyMuPDF as the only Python dep — single self-contained wheel, no `poppler-utils` apt-install on mdock. Mirrors the uv-inline-deps single-file pattern of infra/paperless/generate_separator.py. Edge cases: - 1-page input: strip skipped entirely. - All pages would drop: script exits 2, mover keeps file in inbox and logs WARNING (no empty doc reaches Paperless). - Strip script errors: mover falls back to plain mv, no scan blocked. - MDMS_STRIP_BLANK=false: bypass strip entirely (emergency disable). Deploy: rsync uv binary to mdock ~/.local/bin/uv (single static binary, user-space, no apt), scp script + units, systemctl --user daemon-reload. Verified live with synthetic 4-page (2 real + 1 blank + 1 real → 3 pages), 1-page (unchanged), all-blank (kept in inbox + warning) test PDFs. Timer fires every ~70s as before.	2026-05-16 17:57:26 +02:00
mAi	862bc76a2b	Merge mai/hermes/issue-1-scan-stack-multi: Paperless barcode-splitter (#1 )	2026-05-16 15:56:30 +02:00