Commit Graph

7 Commits

Author SHA1 Message Date
mAi
70b94cc2e9 Merge mai/hermes/issue-3-paperless-ai: paperless-AI prompt fix + drift reconciliation (#3) 2026-05-16 18:38:00 +02:00
mAi
7ba5bb925c mAi: #3 - paperless-AI prompt: Empfaenger-Regel + softened correspondent matching + drift reconciliation
Live SYSTEM_PROMPT on mDock had drifted heavily from the repo template
(detailed correspondent fuzzy-matching catalogue, full existing-names
list, refined title-generation rules). Reconciled by adopting the live
prompt as the new baseline in SYSTEM_PROMPT.txt and layering two fixes
on top:

1. Recipient rule (Rule 1): Matthias / Mathias Siebels and any address-
   block variant ("Herr Siebels", "Empfaengeradresse Windscheidstr. 33")
   must NEVER be set as correspondent — m is the recipient of nearly
   every doc. Paul Siebels: also recipient by default, only correspondent
   when nachweislich Autor (eigener Brief, Schadensmeldung von Paul).

   Triggering misclassification (issue body): doc 280 (Vattenfall
   Stromliefervertrag) was tagged correspondent="Matthias Siebels"
   because the AI picked the recipient address block as sender.

2. Soften "Bevorzuge IMMER existierenden Correspondent" -> only when
   semantic similarity is clear. Genuinely new senders (Versorger, Arzt,
   Versicherer, Vermieter, ...) get a new correspondent rather than
   being force-mapped to the nearest existing name. Fixes the
   Vattenfall -> Telekom drift on docs 283/284 (also addressed by head
   adding Vattenfall ID 257 manually).

Also migrated push_system_prompt.py from m/otto into this repo so the
deploy mechanism (render template -> push to /app/data/.env -> restart
paperless-ai) lives next to the template. Added RECIPIENT_EXCLUDE
filter so Matthias/Mathias Siebels are stripped from the rendered
correspondents list — defense in depth on top of the prompt rule.
Paperless correspondent records (IDs 3, 255) are preserved for the
historical doc assignments that still reference them.

Applied to live mDock paperless-ai (backup .env.bak.20260516T162255).
39 of 41 Siebels-correspondent doc assignments cleared + their
paperless-AI sqlite tracker rows (processed_documents,
history_documents, openai_metrics) deleted so they reclassify on the
next scan. Two kept (doc 117 Vollmacht from Paul, doc 130
Schadensmeldung filled by Paul — both genuine Paul-as-author cases per
the new rule).

Refs: m/mDMS#3
2026-05-16 18:27:19 +02:00
mAi
927a66bd66 Merge mai/hermes/issue-2-mover-strip: mover strips blank pages (#2) 2026-05-16 18:03:41 +02:00
mAi
90142396d8 mAi: #2 - mdms-mover: strip blank pages from duplex scans
Two changes:

1. Migrate mover from m/otto (commit 9974937, otto#438) into this repo
   at infra/mdms-mover/. mover.sh, mdms-mover.service, mdms-mover.timer,
   README.md. Matches the live deployment on mDock byte-for-byte (modulo
   the strip step below).

2. Add blank-page stripping before the inbox → toprocess promotion. A
   page is dropped iff its embedded text is empty AND its rendered
   thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels (default 0.97
   per issue #2). Detects the empty backside of patch-T separator
   sheets in duplex scans (mDMS#2).

strip_blank_pages.py uses PyMuPDF as the only Python dep — single
self-contained wheel, no `poppler-utils` apt-install on mdock. Mirrors
the uv-inline-deps single-file pattern of infra/paperless/generate_separator.py.

Edge cases:
- 1-page input: strip skipped entirely.
- All pages would drop: script exits 2, mover keeps file in inbox and
  logs WARNING (no empty doc reaches Paperless).
- Strip script errors: mover falls back to plain mv, no scan blocked.
- MDMS_STRIP_BLANK=false: bypass strip entirely (emergency disable).

Deploy: rsync uv binary to mdock ~/.local/bin/uv (single static binary,
user-space, no apt), scp script + units, systemctl --user daemon-reload.
Verified live with synthetic 4-page (2 real + 1 blank + 1 real → 3
pages), 1-page (unchanged), all-blank (kept in inbox + warning) test
PDFs. Timer fires every ~70s as before.
2026-05-16 17:57:26 +02:00
mAi
862bc76a2b Merge mai/hermes/issue-1-scan-stack-multi: Paperless barcode-splitter (#1) 2026-05-16 15:56:30 +02:00
mAi
061ea424ad mAi: #1 - Paperless-ngx Barcode-Splitter aktiviert (Patch-T)
PAPERLESS_CONSUMER_ENABLE_BARCODES=true + DELETE_PAGES=true live auf mDock,
parallel in m/paperless docker-compose.yml (Source-of-Truth) committet
(siehe m/paperless commit 8c1ca3f).

Neu:
- infra/paperless/generate_separator.py — Code-128 PATCHT-Generator (uv inline-deps)
- infra/paperless/separator-patchT.pdf — druckbare Trennseite
- docs/strategy.md — neuer Abschnitt "Multi-page scan + automatic splitting"

Test 2026-05-16: Stapel aus 3 Fake-Schreiben (2 + 1 + 1 Seiten) mit
PATCHT-Separator dazwischen → 3 getrennte Paperless-Dokumente mit
korrekten Seitenzahlen, Trennseiten entsorgt. Test-Dokumente wieder
gelöscht.

Closes: nichts (m schliesst Issues selbst — Label "done" via API)
2026-05-16 15:52:53 +02:00
m
2aa532e717 chore: initial commit — spinout from m/otto
Spun out mDMS strategy + tooling from m/otto into its own repo on 2026-05-15.

Migrated:
- docs/strategy.md (was: m/otto:docs/mdms-strategy.md)
- infra/paperless/ (config + audit + migrate scripts)
- infra/samba-canon/ (Canon MB5100 SMB1 bridge container)

History in m/otto: issues #429–#438. Going forward, all mDMS issues
file here. Sibling m/paperless (separate repo) remains the bare
Docker Compose for Paperless-ngx itself.
2026-05-15 17:31:20 +02:00