7 Commits

Author SHA1 Message Date
mAi
237f7f89ba Merge mai/hermes/issue-4-paperless-ai: intra-scan dedup + short-brand prefix match (#4) 2026-05-21 11:30:35 +02:00
mAi
a2fa76a41a mAi: #4 - paperless-AI prompt: intra-scan dedup + short-brand prefix match
Two prompt-only rules added to address follow-ups from #3:

1. Intra-scan dedup (new rule 4 in Correspondents section): when
   processing multiple docs from the same sender in one scan batch,
   reuse the correspondent name created earlier in the same session
   instead of letting each doc create a fresh alias. Triggered by
   paperless-AI creating 3 Praxis-Irle aliases in one batch (no native
   batch-context plumbing; best-effort via prompt).

2. Short-brand prefix match (extension of Fuzzy-Regel): if OCR name is
   a strict prefix of an existing correspondent (or vice-versa) and
   the first 2 brand tokens match, use the existing correspondent.
   Triggered by 'Hogan Lovells' creating a new correspondent despite
   'Hogan Lovells International LLP' already existing.

Deployed via push_system_prompt.py --apply, container restarted, both
strings verified present in /app/data/.env (backup at
.env.bak.20260521T092606). Effectiveness will be observed as
multi-doc scans flow through.
2026-05-21 11:26:40 +02:00
mAi
70b94cc2e9 Merge mai/hermes/issue-3-paperless-ai: paperless-AI prompt fix + drift reconciliation (#3) 2026-05-16 18:38:00 +02:00
mAi
7ba5bb925c mAi: #3 - paperless-AI prompt: Empfaenger-Regel + softened correspondent matching + drift reconciliation
Live SYSTEM_PROMPT on mDock had drifted heavily from the repo template
(detailed correspondent fuzzy-matching catalogue, full existing-names
list, refined title-generation rules). Reconciled by adopting the live
prompt as the new baseline in SYSTEM_PROMPT.txt and layering two fixes
on top:

1. Recipient rule (Rule 1): Matthias / Mathias Siebels and any address-
   block variant ("Herr Siebels", "Empfaengeradresse Windscheidstr. 33")
   must NEVER be set as correspondent — m is the recipient of nearly
   every doc. Paul Siebels: also recipient by default, only correspondent
   when nachweislich Autor (eigener Brief, Schadensmeldung von Paul).

   Triggering misclassification (issue body): doc 280 (Vattenfall
   Stromliefervertrag) was tagged correspondent="Matthias Siebels"
   because the AI picked the recipient address block as sender.

2. Soften "Bevorzuge IMMER existierenden Correspondent" -> only when
   semantic similarity is clear. Genuinely new senders (Versorger, Arzt,
   Versicherer, Vermieter, ...) get a new correspondent rather than
   being force-mapped to the nearest existing name. Fixes the
   Vattenfall -> Telekom drift on docs 283/284 (also addressed by head
   adding Vattenfall ID 257 manually).

Also migrated push_system_prompt.py from m/otto into this repo so the
deploy mechanism (render template -> push to /app/data/.env -> restart
paperless-ai) lives next to the template. Added RECIPIENT_EXCLUDE
filter so Matthias/Mathias Siebels are stripped from the rendered
correspondents list — defense in depth on top of the prompt rule.
Paperless correspondent records (IDs 3, 255) are preserved for the
historical doc assignments that still reference them.

Applied to live mDock paperless-ai (backup .env.bak.20260516T162255).
39 of 41 Siebels-correspondent doc assignments cleared + their
paperless-AI sqlite tracker rows (processed_documents,
history_documents, openai_metrics) deleted so they reclassify on the
next scan. Two kept (doc 117 Vollmacht from Paul, doc 130
Schadensmeldung filled by Paul — both genuine Paul-as-author cases per
the new rule).

Refs: m/mDMS#3
2026-05-16 18:27:19 +02:00
mAi
927a66bd66 Merge mai/hermes/issue-2-mover-strip: mover strips blank pages (#2) 2026-05-16 18:03:41 +02:00
mAi
90142396d8 mAi: #2 - mdms-mover: strip blank pages from duplex scans
Two changes:

1. Migrate mover from m/otto (commit 9974937, otto#438) into this repo
   at infra/mdms-mover/. mover.sh, mdms-mover.service, mdms-mover.timer,
   README.md. Matches the live deployment on mDock byte-for-byte (modulo
   the strip step below).

2. Add blank-page stripping before the inbox → toprocess promotion. A
   page is dropped iff its embedded text is empty AND its rendered
   thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels (default 0.97
   per issue #2). Detects the empty backside of patch-T separator
   sheets in duplex scans (mDMS#2).

strip_blank_pages.py uses PyMuPDF as the only Python dep — single
self-contained wheel, no `poppler-utils` apt-install on mdock. Mirrors
the uv-inline-deps single-file pattern of infra/paperless/generate_separator.py.

Edge cases:
- 1-page input: strip skipped entirely.
- All pages would drop: script exits 2, mover keeps file in inbox and
  logs WARNING (no empty doc reaches Paperless).
- Strip script errors: mover falls back to plain mv, no scan blocked.
- MDMS_STRIP_BLANK=false: bypass strip entirely (emergency disable).

Deploy: rsync uv binary to mdock ~/.local/bin/uv (single static binary,
user-space, no apt), scp script + units, systemctl --user daemon-reload.
Verified live with synthetic 4-page (2 real + 1 blank + 1 real → 3
pages), 1-page (unchanged), all-blank (kept in inbox + warning) test
PDFs. Timer fires every ~70s as before.
2026-05-16 17:57:26 +02:00
mAi
862bc76a2b Merge mai/hermes/issue-1-scan-stack-multi: Paperless barcode-splitter (#1) 2026-05-16 15:56:30 +02:00
9 changed files with 680 additions and 2 deletions

1
.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
.m/

180
infra/mdms-mover/README.md Normal file
View File

@@ -0,0 +1,180 @@
# mdms-mover — age-gated inbox → toprocess promoter + blank-page stripper
Two jobs in one user-systemd timer:
1. **Stability gate** (otto#438): solves the chunk-write race between the
Canon MB5100 (SMB scans land in `/mnt/mdms/inbox/` in pieces) and
Paperless (polls `/mnt/mdms/toprocess/` every 60s and consumes
anything it sees). A file is only promoted when **both**:
- `mtime > 3 minutes` ago, and
- file size is unchanged since the previous run.
2. **Blank-page strip** (mDMS#2): duplex scans through patch-T separators
leave a blank backside (the unprinted reverse of the separator sheet)
at the front of every subsequent document. PDF files are passed
through `strip_blank_pages.py` before promotion. Pages with no
embedded text AND >97% near-white pixels are dropped.
## Layout on mDock
```
/home/m/mdms-mover/mover.sh # script, deployed copy
/home/m/mdms-mover/strip_blank_pages.py # blank-page detector
/home/m/.config/systemd/user/mdms-mover.service # oneshot service
/home/m/.config/systemd/user/mdms-mover.timer # OnUnitActiveSec=1min
/home/m/.local/state/mdms-mover/state.tsv # last-seen size per file
/home/m/.local/bin/uv # uv runner for the strip script
```
Runs as user `m` under user-systemd. mDock has `Linger=yes` for user
`m`, so the timer keeps firing across reboots and logout sessions.
## Why systemd, not cron
The original spec (otto#438) called for `/etc/cron.d/mdms-mover`. mDock
runs Ubuntu 24.04 server which ships with systemd-timers and no `cron`
package. Installing cron only to honour the spec wording would add a
package we don't otherwise need; a user-systemd timer is the canonical
Ubuntu 24.04 approach and gives better observability
(`systemctl --user status mdms-mover.timer`, `journalctl --user -u mdms-mover`).
User-mode (not system-mode) keeps the entire install in `m`'s home — no
sudo at deploy or maintenance time, no `/var/lib/...` directories to
chown, the service can read/write the NFS mount because `m` owns it.
## Configuration
```
| var | default | meaning |
|------------------------|-----------------------------------------------|----------------------------------------------------|
| MDMS_INBOX | /mnt/mdms/inbox | source — scanner SMB target |
| MDMS_TOPROCESS | /mnt/mdms/toprocess | destination — Paperless consume |
| MDMS_STATE | $HOME/.local/state/mdms-mover/state.tsv | per-file size memory |
| MDMS_MIN_AGE_MIN | 3 | minimum mtime age in minutes |
| MDMS_STRIP_BLANK | true | run blank-page strip on PDFs (set to "false" to disable) |
| MDMS_STRIP_SCRIPT | <mover dir>/strip_blank_pages.py | path override for the strip script |
| MDMS_BLANK_THRESHOLD | 0.97 | near-white pixel ratio to call a page blank (read by strip script) |
| MDMS_BLANK_NEAR_WHITE | 240 | grayscale cutoff (0-255) for "near white" pixels (read by strip script) |
| MDMS_BLANK_DPI | 50 | thumbnail render DPI (read by strip script) |
```
To override at runtime, drop into
`~/.config/systemd/user/mdms-mover.service.d/override.conf`:
```ini
[Service]
Environment=MDMS_MIN_AGE_MIN=5
Environment=MDMS_BLANK_THRESHOLD=0.99
```
then `systemctl --user daemon-reload && systemctl --user restart mdms-mover.timer`.
## Blank-page detection — what gets dropped
A page is dropped iff BOTH:
1. embedded text is empty / whitespace-only (image-only scans always
pass this — they have no embedded text), AND
2. the rendered thumbnail is ≥ `MDMS_BLANK_THRESHOLD` near-white pixels
(0.97 by default → >97% of pixels brighter than grayscale 240).
The threshold is conservative on purpose: a false-negative (keeping a
blank page we should have dropped) is recoverable via Paperless's UI; a
false-positive (dropping a real page) silently loses data. If real
pages get dropped in practice, **raise** `MDMS_BLANK_THRESHOLD` toward
0.99 — that makes the strip step pickier and keeps more pages.
Edge cases handled inside `strip_blank_pages.py`:
- **1-page input:** strip is skipped entirely (single-page docs never
have separator-backside artefacts).
- **All pages would drop:** the script exits with code `2` and writes no
output. The mover keeps the file in the inbox and logs
`WARNING: <name> appears all-blank, kept in inbox`. m can inspect via
`journalctl --user -u mdms-mover`.
- **strip_blank_pages.py errors out:** mover falls back to a plain `mv`
(unstripped) so a transient problem in the detector never blocks a
scan from reaching Paperless.
The script is a uv-inline-deps single file (PyMuPDF for both rendering
and text extraction — one wheel, no `poppler-utils` apt install on
mdock). Mirrors the pattern from `infra/paperless/generate_separator.py`.
## Deploy / sync
The live files on mDock must match this directory byte-for-byte (md5,
same convention as `infra/samba-canon/`).
```bash
ssh mdock 'mkdir -p ~/mdms-mover ~/.config/systemd/user ~/.local/state/mdms-mover ~/.local/bin'
# uv binary (single static binary, user-space — no apt, no sudo)
rsync -av ~/.local/bin/uv mdock:/home/m/.local/bin/uv
# mover + strip script
scp infra/mdms-mover/mover.sh mdock:/home/m/mdms-mover/mover.sh
scp infra/mdms-mover/strip_blank_pages.py mdock:/home/m/mdms-mover/strip_blank_pages.py
scp infra/mdms-mover/mdms-mover.service mdock:/home/m/.config/systemd/user/
scp infra/mdms-mover/mdms-mover.timer mdock:/home/m/.config/systemd/user/
ssh mdock 'chmod +x ~/mdms-mover/mover.sh ~/mdms-mover/strip_blank_pages.py && \
systemctl --user daemon-reload && \
systemctl --user enable --now mdms-mover.timer'
```
The first time the strip script runs, `uv` downloads python + PyMuPDF
into `~/.cache/uv/` (~30 MB). Subsequent runs reuse the cache.
## Verify
```bash
ssh mdock 'systemctl --user list-timers mdms-mover.timer'
ssh mdock 'journalctl --user -u mdms-mover -n 20 --no-pager'
ssh mdock 'cat ~/.local/state/mdms-mover/state.tsv'
ssh mdock 'journalctl -t mdms-mover -n 20 --no-pager'
```
## Emergency disable
Stop the timer entirely:
```bash
ssh mdock 'systemctl --user stop mdms-mover.timer && \
systemctl --user disable mdms-mover.timer'
```
Or just disable the strip step while keeping the stability gate:
```bash
mkdir -p ~/.config/systemd/user/mdms-mover.service.d
cat > ~/.config/systemd/user/mdms-mover.service.d/override.conf <<EOF
[Service]
Environment=MDMS_STRIP_BLANK=false
EOF
systemctl --user daemon-reload
```
Re-enable the timer with `systemctl --user enable --now mdms-mover.timer`.
If you need to drain the inbox manually while disabled, files older
than a few minutes are safe to `mv` into `toprocess/` by hand —
Paperless will pick them up on its next poll.
## Logs
Service logs land in the user journal under unit `mdms-mover`, and
moved-file events also go through `logger -t mdms-mover` so they appear
under that tag in the system journal too:
```bash
ssh mdock 'journalctl --user -u mdms-mover -f' # service execution
ssh mdock 'journalctl -t mdms-mover -f' # moved-file events
```
## Refs
- mDMS#2 — blank-page strip (this README)
- otto#438 — original scheduler / staging-folder design
- otto#429 — original Paperless pipeline setup
- otto#431 — samba-canon bridge container (upstream of this mover)
- `docs/strategy.md` — overall mDMS dataset layout
- `infra/paperless/generate_separator.py` — sibling uv-inline-deps script

View File

@@ -0,0 +1,11 @@
[Unit]
Description=mDMS mover — promote stable scanner files inbox → toprocess
After=network-online.target remote-fs.target
Wants=network-online.target remote-fs.target
[Service]
Type=oneshot
ExecStart=%h/mdms-mover/mover.sh
[Install]
WantedBy=default.target

View File

@@ -0,0 +1,12 @@
[Unit]
Description=Run mDMS mover every minute
Requires=mdms-mover.service
[Timer]
OnBootSec=2min
OnUnitActiveSec=1min
AccuracySec=10s
Unit=mdms-mover.service
[Install]
WantedBy=timers.target default.target

93
infra/mdms-mover/mover.sh Executable file
View File

@@ -0,0 +1,93 @@
#!/bin/bash
# mdms-mover: move stable files from /mnt/mdms/inbox → /mnt/mdms/toprocess.
#
# A file is "stable" when it satisfies BOTH conditions:
# 1. mtime older than MIN_AGE seconds (default 180s).
# 2. size unchanged since the previous run (recorded in STATE).
#
# This protects Paperless from ingesting half-written scans dropped by the
# Canon MB5100 via SMB. See otto#438, mDMS#2.
#
# When MDMS_STRIP_BLANK=true (default) and the file is a PDF, blank pages
# are stripped before promotion (mDMS#2). Empty backsides of patch-T
# separators from duplex scans land here. See strip_blank_pages.py for the
# detection heuristic.
set -euo pipefail
INBOX="${MDMS_INBOX:-/mnt/mdms/inbox}"
TOPROCESS="${MDMS_TOPROCESS:-/mnt/mdms/toprocess}"
STATE="${MDMS_STATE:-$HOME/.local/state/mdms-mover/state.tsv}"
MIN_AGE_MIN="${MDMS_MIN_AGE_MIN:-3}"
STRIP_BLANK="${MDMS_STRIP_BLANK:-true}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
STRIP_SCRIPT="${MDMS_STRIP_SCRIPT:-$SCRIPT_DIR/strip_blank_pages.py}"
mkdir -p "$TOPROCESS" "$(dirname "$STATE")"
touch "$STATE"
NEW_STATE=$(mktemp)
trap 'rm -f "$NEW_STATE"' EXIT
# Promote a single stable file from inbox into toprocess, blank-stripping
# PDFs when enabled. Returns silently; logs go through logger(1).
promote() {
local src="$1" name="$2" size="$3"
local ext="${name##*.}"
local dest="$TOPROCESS/$name"
if [[ "$STRIP_BLANK" != "true" || "${ext,,}" != "pdf" || ! -x "$STRIP_SCRIPT" ]]; then
if mv -n "$src" "$dest" 2>/dev/null; then
logger -t mdms-mover "moved $name ($size bytes)"
fi
return
fi
# Stage stripped output inside toprocess (same filesystem → atomic rename).
# Dotfile prefix so Paperless's consumer ignores the partial during write.
local tmpout="$TOPROCESS/.mdms-tmp.$$.$name"
local rc=0
"$STRIP_SCRIPT" "$src" "$tmpout" || rc=$?
case "$rc" in
0)
mv -f "$tmpout" "$dest" && rm -f "$src"
logger -t mdms-mover "moved $name ($size bytes, strip ok)"
;;
2)
rm -f "$tmpout"
logger -t mdms-mover "WARNING: $name appears all-blank, kept in inbox"
;;
*)
rm -f "$tmpout"
logger -t mdms-mover "strip failed for $name (rc=$rc), passing through unchanged"
if mv -n "$src" "$dest" 2>/dev/null; then
logger -t mdms-mover "moved $name ($size bytes, unstripped)"
fi
;;
esac
}
# Iterate top-level regular files older than MIN_AGE_MIN.
# Skip dotfiles (probe files, scanner temp markers like ._foo, our .mdms-tmp.*).
while IFS= read -r f; do
name=$(basename "$f")
case "$name" in
.*) continue ;;
esac
if ! size=$(stat -c %s "$f" 2>/dev/null); then
continue
fi
prev=$(awk -v n="$name" '$1==n {print $2; exit}' "$STATE")
printf '%s\t%s\n' "$name" "$size" >> "$NEW_STATE"
if [[ -n "$prev" && "$size" == "$prev" ]]; then
promote "$f" "$name" "$size"
fi
done < <(find "$INBOX" -maxdepth 1 -type f -mmin "+$MIN_AGE_MIN")
mv "$NEW_STATE" "$STATE"
trap - EXIT

View File

@@ -0,0 +1,122 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "pymupdf>=1.24",
# "Pillow>=10.0",
# ]
# ///
"""Strip blank pages from a PDF — used by mdms-mover before promoting to toprocess.
Usage:
strip_blank_pages.py <input.pdf> <output.pdf>
Exit codes:
0 output.pdf written (either stripped or copied unchanged)
2 all pages would be dropped — output NOT written, caller should keep
the original file in the inbox and log a warning
1 error (input unreadable, write failed, etc.)
A page counts as "blank" iff BOTH of:
* embedded text is empty / whitespace-only, AND
* rendered thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels.
False-negatives are preferred over false-positives — borderline pages stay.
Env:
MDMS_BLANK_THRESHOLD near-white pixel ratio (0.0-1.0, default 0.97)
MDMS_BLANK_NEAR_WHITE near-white cutoff in 0-255 grayscale (default 240)
MDMS_BLANK_DPI thumbnail render DPI (default 50)
PyMuPDF is used instead of pdf2image+pikepdf+pypdf so the whole pipeline is
one self-contained wheel — no poppler-utils apt-install on mdock, no
multiple text-extraction libraries to keep in sync.
"""
from __future__ import annotations
import io
import os
import shutil
import sys
from pathlib import Path
import fitz # PyMuPDF
from PIL import Image
def near_white_ratio(image: Image.Image, near_white: int) -> float:
gray = image.convert("L") if image.mode != "L" else image
hist = gray.histogram()
total = sum(hist)
if total == 0:
return 1.0
return sum(hist[near_white:]) / total
def page_is_blank(page: "fitz.Page", threshold: float, near_white: int, dpi: int) -> bool:
text = (page.get_text("text") or "").strip()
if text:
return False
pix = page.get_pixmap(dpi=dpi, colorspace=fitz.csGRAY)
image = Image.frombytes("L", (pix.width, pix.height), pix.samples)
return near_white_ratio(image, near_white) >= threshold
def main() -> int:
if len(sys.argv) != 3:
print(f"usage: {sys.argv[0]} <input.pdf> <output.pdf>", file=sys.stderr)
return 1
src = Path(sys.argv[1])
dst = Path(sys.argv[2])
threshold = float(os.environ.get("MDMS_BLANK_THRESHOLD", "0.97"))
near_white = int(os.environ.get("MDMS_BLANK_NEAR_WHITE", "240"))
dpi = int(os.environ.get("MDMS_BLANK_DPI", "50"))
try:
doc = fitz.open(src)
except Exception as exc:
print(f"failed to open {src}: {exc}", file=sys.stderr)
return 1
try:
page_count = doc.page_count
if page_count <= 1:
shutil.copyfile(src, dst)
return 0
keep: list[int] = []
for i in range(page_count):
if not page_is_blank(doc[i], threshold, near_white, dpi):
keep.append(i)
if not keep:
print(f"all pages blank in {src.name}", file=sys.stderr)
return 2
if len(keep) == page_count:
shutil.copyfile(src, dst)
return 0
out = fitz.open()
try:
for i in keep:
out.insert_pdf(doc, from_page=i, to_page=i)
out.save(dst)
finally:
out.close()
dropped = page_count - len(keep)
print(
f"{src.name}: dropped {dropped}/{page_count} blank page(s)",
file=sys.stderr,
)
return 0
finally:
doc.close()
if __name__ == "__main__":
sys.exit(main())

View File

@@ -21,4 +21,28 @@ in the repo. Hashes:
| setup.js.patched | ~/paperless/build/setup.js.patched | `04cb5fbfaed13a5f25612af0b79dd90c` |
| server.js.patched | ~/paperless/build/server.js.patched | `eadcbb86048127f2c80632ae77bbc2a0` |
See `docs/research/issue-429-paperless-pipeline.md` for the why.
See `docs/research/issue-429-paperless-pipeline.md` in `m/otto` for the
original pipeline rebuild (issue otto#429).
## SYSTEM_PROMPT deploy mechanism
`SYSTEM_PROMPT.txt` is the source of truth. It is a template — the
`{{CORRESPONDENTS_LIST}}` placeholder is rendered at deploy time by
fetching the live correspondents from Paperless. The live prompt is
inside `paperless-ai`'s `/app/data/.env` (volume `paperless_aidata`) as
the backtick-delimited `SYSTEM_PROMPT=\`…\`` block.
Deploy with `push_system_prompt.py`:
```bash
python3 push_system_prompt.py # dry run — diff only
python3 push_system_prompt.py --apply # write + restart paperless-ai
```
The script filters recipient-only names (Matthias / Mathias Siebels)
out of the rendered list — see `RECIPIENT_EXCLUDE` in the script and
the matching rule at the top of the Correspondents section in
`SYSTEM_PROMPT.txt`. If you edit either, edit both.
The previous live `.env` is kept on mDock as `.env.bak.<ts>` next to the
new one for rollback.

View File

@@ -21,4 +21,52 @@ Bei medizinischen Dokumenten Tag Gesundheit setzen.
Bei steuerrelevanten Dokumenten Tag Steuer setzen.
Bei Dokumenten mit Frist Tag Frist setzen.
Correspondents: Verwende den vollen offiziellen Namen der Organisation oder Person (z.B. "DAK-Gesundheit" nicht "DAK-Gesundheit Postzentrum, 22778 Hamburg"). Keine Adressen im Namen. Pruefe ob der Correspondent schon existiert bevor du einen neuen anlegst.
Erfinde NIEMALS neue Tags. Erfinde NIEMALS neue Document Types. Bei Unsicherheit: Document Type = Information, keine zusätzlichen Tags.
Correspondents — WICHTIG, in dieser Reihenfolge:
1. EMPFAENGER NIEMALS als Correspondent: Matthias Siebels (alle Schreibweisen — Mathias, Mathhias, Siebels, MS, "Herr Siebels", "Herrn Matthias Siebels", "Empfaengeradresse Windscheidstr. 33") ist der EMPFAENGER nahezu jedes Dokuments in diesem DMS. NIEMALS als Correspondent setzen, auch wenn der Name in der Absenderzeile zu lesen ist (z.B. wenn der OCR die Empfaengeradresse als Absender mis-interpretiert). Gleiches gilt sinngemaess fuer Paul Siebels — Paul ist meistens Empfaenger (Bescheide, Rechnungen, Steuerbescheide an Paul). Verwende Paul Siebels nur dann als Correspondent, wenn Paul nachweislich Autor des Dokuments ist (z.B. eigener Brief, Schadensmeldung von Paul).
2. Der Correspondent ist die Organisation oder Person, die das Dokument GESENDET / GESCHRIEBEN hat. In den seltenen Faellen, in denen m (Matthias) selbst Autor ist (z.B. eigene Briefe an Behoerden, eigene Umsatzsteuer-Voranmeldung), waehle Document Type = Personal Correspondence und Correspondent = die EMPFAENGENDE Organisation (z.B. "Finanzamt Düsseldorf-Mitte").
3. Bevorzuge existierende Correspondents bei klarer semantischer Aehnlichkeit (Fuzzy-Regel unten). Wenn der OCR-Absender genuinely neu ist (z.B. ein neuer Versorger, Vermieter, Arzt, Dienstleister, Anwalt, Mandant, Versicherer), lege einen neuen Correspondent an, statt zwanghaft auf den naechsten existierenden Namen zu mappen.
4. INTRA-SCAN DEDUP: Bevor du einen neuen Correspondent anlegst, pruefe ob du in dieser Sitzung (gleicher Scan-Batch, gleicher Verarbeitungslauf) bereits einen Correspondent mit aehnlichem Namen angelegt hast — verwende dann den existierenden (denselben Namen unveraendert), statt eine weitere Variante anzulegen. Konkret: kommen in einem Scan mehrere Dokumente vom gleichen Sender vor (z.B. zwei Rechnungen derselben Arztpraxis, mehrere Schreiben desselben Versorgers), MUSS der Correspondent-Name bei jedem dieser Dokumente identisch sein. Im Zweifel waehle die laengste / vollstaendigste Form, die du in diesem Scan gesehen hast.
Fuzzy-Regel: Wenn der OCR-Absendername bis auf Kleinschreibung, Akzente, Tippfehler, Anrede ("Herr"/"Frau"/"Herrn"), Adresszusatz, Personenname als Ansprechpartner oder Rechtsform-Suffix (GmbH/AG/eG/e.V./LLP/KG/mbH/AG/VVaG) einem existierenden Correspondent entspricht, verwende den existierenden Namen UNVERAENDERT. Bei substantiell anderen Namen (anderer Stamm, andere Branche, andere Firmierung) lege einen neuen an.
Beim Vergleich gilt: Ist der OCR-Name ein striktes Praefix eines existierenden Correspondents (oder umgekehrt), und stimmen die ersten 2 Brand-Tokens ueberein (Token = Wort, das nicht Rechtsform-Suffix, Adresse oder Anrede ist), verwende den existierenden Correspondent. Das gilt sowohl fuer Kurzformen ohne Rechtsform-Suffix ("Hogan Lovells" -> "Hogan Lovells International LLP") als auch fuer den umgekehrten Fall, wenn die existierende Form kuerzer ist als die OCR-Form.
Beispiele:
- "Hogan Lovells lnternational LLP" (OCR-Variante) -> "Hogan Lovells International LLP" (existiert)
- "Hogan Lovells" (Kurzform ohne Rechtsform) -> "Hogan Lovells International LLP" (existiert; OCR-Name ist Praefix, erste 2 Brand-Tokens stimmen)
- "eprimo CmbH" -> "eprimo" (existiert)
- "Helios Klinikum Duisburg GmbH" -> "Helios Klinikum Duisburg" (existiert)
- "Kundenservice von eprimo" -> "eprimo" (existiert)
- "Ammerländer Versicherung VVaG" -> "Ammerländer Versicherung" (existiert; Rechtsform weglassen)
- "ING-DiBa AG, Theodor-Heuss-Allee 2, 60486 Frankfurt am Main" -> "ING-DiBa AG" (existiert; Adresse weglassen)
- "Vattenfall Europe Sales GmbH" -> "Vattenfall" (existiert; konsolidiere Konzernvarianten)
- Brief von einem NEUEN Versorger "Stadtwerke XYZ" -> neu anlegen als "Stadtwerke XYZ" (NICHT auf "eprimo" oder "Vodafone" mappen, nur weil das der naechste existierende Versorger ist)
- Drei Dokumente einer neuen Praxis im selben Scan: erstes Dokument legt Correspondent "Praxis Dr. Mustermann" an, zweites und drittes Dokument verwenden GENAU diesen Namen, auch wenn der OCR "Dr. Mustermann" oder "Praxis fuer XYZ" liest (siehe Regel 4).
Beim Anlegen neuer Correspondents: voller offizieller Name der Organisation/Person, KEINE Adresse, KEINE Anrede, KEINE Rechtsform-Suffixe in Reinform (GmbH/AG/etc. nur dann mit aufnehmen, wenn sie Teil der Markenidentitaet sind, z.B. "DKB Grund GmbH").
Aktuelle Correspondents-Liste (aus dieser pruefe als ERSTES, ob einer passt — Eintraege mit Matthias/Mathias Siebels sind absichtlich nicht enthalten, siehe Regel 1):
{{CORRESPONDENTS_LIST}}
Titel-Generierung (PFLICHT, deutsch, 5-80 Zeichen):
- Format: "{Absender-Kurzform} - {Worum es geht}"
- "{Absender-Kurzform}" = Correspondent in kurzer Form (z.B. "DAK", "Finanzamt", "Hogan Lovells", "Vodafone")
- "{Worum es geht}" = 2-6 Woerter, die den Inhalt konkret beschreiben (z.B. "Beitragsrechnung Q1", "Grundsteuerbescheid 2024", "Gehaltsabrechnung Januar 2025", "GigaTV Vertragsverlaengerung")
- Bei Rechnungen / Bescheiden: Vorgangs- bzw. Rechnungsnummer in den Titel aufnehmen, wenn vorhanden (z.B. "DAK - Beitragsrechnung 2025-Q1 (Nr. 4711)")
- Keine generischen Woerter wie "Dokument", "Datei", "Scan", "PDF", "Schreiben" als alleinige Beschreibung
- Keine Datums-Strings im Titel (das Datum erscheint schon im Storage Path)
- Keine Anrede ("Sehr geehrter Herr") und keine Floskeln
- Beispiele guter Titel:
- "DAK - Beitragsrechnung Q1"
- "Finanzamt - Grundsteuerbescheid 2024"
- "Hogan Lovells - Gehaltsabrechnung Januar"
- "Vodafone - GigaTV Vertragsverlaengerung"
- "AOK - Mitgliedsbescheinigung"
- Bei unklarem Inhalt Fallback: "Information - {Sender-Name}" (z.B. "Information - Stadtwerke Muenchen")
- Der Titel wird im JSON-Feld "title" zurueckgegeben.

View File

@@ -0,0 +1,187 @@
"""
Render SYSTEM_PROMPT.txt with the live correspondent list and push it to
the paperless-ai container's /app/data/.env on mDock.
The repo SYSTEM_PROMPT.txt is the template (with the placeholder
{{CORRESPONDENTS_LIST}}). This script:
1. Reads the current correspondents from the Paperless API.
2. Filters out names that must never appear as correspondent
(recipients of m's mail — see RECIPIENT_EXCLUDE).
3. Renders the prompt by substituting the placeholder.
4. Reads the live /app/data/.env from the paperless-ai container.
5. Replaces the SYSTEM_PROMPT=`…` block.
6. Backs up the old .env (.bak.<ts>) and writes the new one.
7. Restarts the paperless-ai container.
Dry-run is the default: prints the would-be rendered prompt without
writing.
Usage:
python3 push_system_prompt.py # dry run
python3 push_system_prompt.py --apply # write + restart
Migrated into m/mDMS from m/otto on 2026-05-16 (mDMS#3).
"""
import argparse
import datetime
import json
import os
import subprocess
import sys
PAPERLESS_HOST = "mdock"
PAPERLESS_AI_CONTAINER = "paperless-ai"
PAPERLESS_WEB_CONTAINER = "paperless-webserver-1"
ENV_PATH = "/app/data/.env"
HERE = os.path.dirname(os.path.abspath(__file__))
TEMPLATE_PATH = os.path.join(HERE, "SYSTEM_PROMPT.txt")
PLACEHOLDER = "{{CORRESPONDENTS_LIST}}"
# Names that are m or his household — recipients, never correspondents.
# Substring match, case-insensitive. Keep the actual correspondent records
# in Paperless (data integrity for historical doc assignments), but never
# show them to the LLM as candidate senders.
RECIPIENT_EXCLUDE = ("matthias siebels", "mathias siebels")
def get_token() -> str:
out = subprocess.run(
["ssh", PAPERLESS_HOST,
f"docker exec {PAPERLESS_AI_CONTAINER} sh -c "
f"'grep ^PAPERLESS_API_TOKEN {ENV_PATH} | cut -d= -f2'"],
capture_output=True, text=True, timeout=15,
)
return out.stdout.strip()
def fetch_correspondents(token: str) -> list[str]:
cmd = (
f"docker exec {PAPERLESS_WEB_CONTAINER} "
f"curl -s -H 'Authorization: Token {token}' "
f"'http://localhost:8000/api/correspondents/?page_size=500'"
)
out = subprocess.run(
["ssh", PAPERLESS_HOST, cmd],
capture_output=True, text=True, timeout=30,
)
if out.returncode != 0:
raise RuntimeError(f"fetch failed: {out.stderr}")
data = json.loads(out.stdout)
names = [c["name"] for c in data["results"]]
filtered = [n for n in names
if not any(x in n.lower() for x in RECIPIENT_EXCLUDE)]
dropped = sorted(set(names) - set(filtered))
if dropped:
print(f"filtered out recipient-names: {dropped}")
return sorted(filtered, key=lambda s: s.lower())
def render_prompt(template: str, names: list[str]) -> str:
listing = "\n".join(f"- {n}" for n in names)
return template.replace(PLACEHOLDER, listing)
def read_remote_env() -> str:
out = subprocess.run(
["ssh", PAPERLESS_HOST,
f"docker exec {PAPERLESS_AI_CONTAINER} cat {ENV_PATH}"],
capture_output=True, text=True, timeout=15,
)
if out.returncode != 0:
raise RuntimeError(f"cat failed: {out.stderr}")
return out.stdout
def replace_system_prompt(env: str, new_prompt: str) -> str:
"""Replace the SYSTEM_PROMPT=`…` block with the new one.
Paperless-AI's .env uses backtick-delimited values for multi-line
settings (JS .env loader convention; bash would not accept this).
"""
lines = env.splitlines(keepends=True)
out = []
inside = False
replaced = False
for line in lines:
if not inside and line.startswith("SYSTEM_PROMPT="):
out.append(f"SYSTEM_PROMPT=`{new_prompt.rstrip()}`\n")
replaced = True
stripped_value = line[len("SYSTEM_PROMPT="):].rstrip("\n")
if stripped_value.startswith("`") and stripped_value.count("`") >= 2:
continue
inside = True
continue
if inside:
if "`" in line:
inside = False
continue
out.append(line)
if not replaced:
raise SystemExit("SYSTEM_PROMPT= line not found in .env")
return "".join(out)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--apply", action="store_true",
help="Write new .env and restart paperless-ai")
args = ap.parse_args()
with open(TEMPLATE_PATH) as f:
template = f.read()
if PLACEHOLDER not in template:
sys.exit(f"template missing placeholder {PLACEHOLDER}")
token = get_token()
names = fetch_correspondents(token)
print(f"fetched {len(names)} live correspondents (after recipient filter)")
rendered = render_prompt(template, names)
print(f"rendered prompt: {len(rendered)} chars, {len(rendered.splitlines())} lines")
env_before = read_remote_env()
env_after = replace_system_prompt(env_before, rendered)
if env_before == env_after:
print("no change — live prompt already matches rendered template")
return
if not args.apply:
print("--- new SYSTEM_PROMPT block ---")
for line in env_after.splitlines():
if line.startswith("SYSTEM_PROMPT="):
print(line[:200] + ("" if len(line) > 200 else ""))
print()
print("DRY RUN — re-run with --apply to write + restart paperless-ai")
return
ts = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%S")
backup = f"{ENV_PATH}.bak.{ts}"
subprocess.run(
["ssh", PAPERLESS_HOST,
f"docker exec {PAPERLESS_AI_CONTAINER} cp {ENV_PATH} {backup}"],
check=True, timeout=15,
)
print(f"backup: {backup}")
write_cmd = (
f"docker exec -i {PAPERLESS_AI_CONTAINER} "
f"sh -c 'cat > {ENV_PATH}'"
)
proc = subprocess.run(
["ssh", PAPERLESS_HOST, write_cmd],
input=env_after, capture_output=True, text=True, timeout=30,
)
if proc.returncode != 0:
sys.exit(f"write failed: {proc.stderr}")
print(f"wrote {len(env_after)} bytes to {ENV_PATH}")
subprocess.run(
["ssh", PAPERLESS_HOST, f"docker restart {PAPERLESS_AI_CONTAINER}"],
check=True, timeout=60,
)
print(f"restarted {PAPERLESS_AI_CONTAINER}")
if __name__ == "__main__":
main()