mAi 7ba5bb925c mAi: #3 - paperless-AI prompt: Empfaenger-Regel + softened correspondent matching + drift reconciliation
Live SYSTEM_PROMPT on mDock had drifted heavily from the repo template
(detailed correspondent fuzzy-matching catalogue, full existing-names
list, refined title-generation rules). Reconciled by adopting the live
prompt as the new baseline in SYSTEM_PROMPT.txt and layering two fixes
on top:

1. Recipient rule (Rule 1): Matthias / Mathias Siebels and any address-
   block variant ("Herr Siebels", "Empfaengeradresse Windscheidstr. 33")
   must NEVER be set as correspondent — m is the recipient of nearly
   every doc. Paul Siebels: also recipient by default, only correspondent
   when nachweislich Autor (eigener Brief, Schadensmeldung von Paul).

   Triggering misclassification (issue body): doc 280 (Vattenfall
   Stromliefervertrag) was tagged correspondent="Matthias Siebels"
   because the AI picked the recipient address block as sender.

2. Soften "Bevorzuge IMMER existierenden Correspondent" -> only when
   semantic similarity is clear. Genuinely new senders (Versorger, Arzt,
   Versicherer, Vermieter, ...) get a new correspondent rather than
   being force-mapped to the nearest existing name. Fixes the
   Vattenfall -> Telekom drift on docs 283/284 (also addressed by head
   adding Vattenfall ID 257 manually).

Also migrated push_system_prompt.py from m/otto into this repo so the
deploy mechanism (render template -> push to /app/data/.env -> restart
paperless-ai) lives next to the template. Added RECIPIENT_EXCLUDE
filter so Matthias/Mathias Siebels are stripped from the rendered
correspondents list — defense in depth on top of the prompt rule.
Paperless correspondent records (IDs 3, 255) are preserved for the
historical doc assignments that still reference them.

Applied to live mDock paperless-ai (backup .env.bak.20260516T162255).
39 of 41 Siebels-correspondent doc assignments cleared + their
paperless-AI sqlite tracker rows (processed_documents,
history_documents, openai_metrics) deleted so they reclassify on the
next scan. Two kept (doc 117 Vollmacht from Paul, doc 130
Schadensmeldung filled by Paul — both genuine Paul-as-author cases per
the new rule).

Refs: m/mDMS#3
2026-05-16 18:27:19 +02:00

mDMS

m's document management — Paperless-ngx + AI-classification pipeline, Canon scanner SMB bridge, strategy + tooling.

Spun out from m/otto on 2026-05-15 — issues #429#438 in m/otto are the provenance trail. Going forward, all mDMS work lives here.

Layout

mDMS/
├── docs/
│   └── strategy.md          # Taxonomy, layout, conventions (the bible)
├── infra/
│   ├── paperless/           # Paperless-AI config: SYSTEM_PROMPT, audit scripts,
│   │                        # migrate_types.py, deploy docker-compose
│   └── samba-canon/         # SMB1 bridge container for Canon MB5100 scanner
│                            # (host-network + nmbd, SMB1+NTLMv1 for old printer)
└── README.md

Components

Paperless-ngx (deployment)

Compose lives in m/paperless (separate repo). That repo is the deployment artifact — ~/paperless/ on mDock is its checkout. This repo (m/mDMS) tracks the AI classification layer that sits on top of Paperless-ngx (infra/paperless/SYSTEM_PROMPT.txt, the type/tag/ correspondent migration scripts, the audit pipeline).

Paperless-AI

Runs on mdock:3077 in front of Paperless-ngx (mdock:8777). Classifies each ingested document into one of the 10 canonical types and ≤2 of the 13 canonical tags. The system prompt + the migration scripts in infra/paperless/ are the source of truth — keep this repo and the live Paperless-AI aidata/.env in sync.

Canon SMB bridge

infra/samba-canon/ is the host-network Samba 4.10 container on mDock that the Canon MB5100 scans to. Files land in /mnt/mdms/inbox/ (NFS from mTrueNAS) and Paperless polls every 60s. The two-stage inbox (staging dir + age-gated mover) lives separately under ~/mdms-mover/ on mDock — see m/otto issue #438.

Data

NFS-mounted from mTrueNAS: /mnt/mPool/mdms//mnt/mdms/ on all consumers. Layout:

/mnt/mPool/mdms/
├── inbox/         # SMB scanner target (Canon writes here)
├── toprocess/     # Age-gated staging → Paperless consumes here
├── paperless/     # Paperless storage (post-ingest)
├── archive/       # Long-term archive
├── templates/     # Document templates
└── export/        # Manual exports

Reference

  • docs/strategy.md — full strategy, taxonomy decisions, type/tag rationale
  • m/otto issues #429#438 — original implementation history
  • m/paperless — the bare Paperless-ngx Docker Compose setup
Description
m document management — Paperless-ngx pipeline, samba scanner bridge, strategy + tooling
Readme 82 KiB
Languages
Python 80.8%
Shell 15.3%
Dockerfile 3.9%