chore: initial commit — spinout from m/otto

Spun out mDMS strategy + tooling from m/otto into its own repo on 2026-05-15.

Migrated:
- docs/strategy.md (was: m/otto:docs/mdms-strategy.md)
- infra/paperless/ (config + audit + migrate scripts)
- infra/samba-canon/ (Canon MB5100 SMB1 bridge container)

History in m/otto: issues #429–#438. Going forward, all mDMS issues
file here. Sibling m/paperless (separate repo) remains the bare
Docker Compose for Paperless-ngx itself.
This commit is contained in:
m
2026-05-15 17:31:20 +02:00
commit 2aa532e717
15 changed files with 3131 additions and 0 deletions

36
CLAUDE.md Normal file
View File

@@ -0,0 +1,36 @@
# mDMS
Document-management strategy + tooling: Paperless-ngx + Paperless-AI + Canon SMB bridge.
**Memory group_id:** `mdms` (new — formerly `otto` for these issues)
**Project type:** infrastructure + AI-classification pipeline. No web frontend, no application server. Deploys live on mDock; data on mTrueNAS.
## Spinout context
Migrated out of `m/otto` on 2026-05-15. Strategy doc + paperless-AI tooling + samba-canon bridge moved here. The original implementation history is in `m/otto` issues #429#438. Going forward, file all mDMS issues here.
## Layout
- `docs/strategy.md` — the bible. Taxonomy (10 types, 13 tags), filename conventions, OCR-pipeline decisions. Read first.
- `infra/paperless/` — AI-classification layer config: `SYSTEM_PROMPT.txt`, audit log, `migrate_types.py`.
- `infra/samba-canon/` — host-network Samba 4.10 SMB1 bridge for Canon MB5100.
## Sibling repo
`m/paperless` — separate, bare Docker Compose for Paperless-ngx itself. `~/paperless/` on mDock is its checkout. Keep that for deployment; this repo is for *strategy* + *AI/classification* + *Canon bridge*.
## Live deployment touchpoints
- `mdock:8777` — Paperless-ngx (managed via `~/paperless/`, i.e. `m/paperless` repo)
- `mdock:3077` — Paperless-AI (config in this repo: `infra/paperless/`)
- mDock `~/samba-canon/` — Canon SMB bridge (source in this repo: `infra/samba-canon/`)
- mDock `~/mdms-mover/` — Age-gated inbox mover (source still in `m/otto` per issue #438, to be migrated in)
When code in this repo and the live deployment drift, fix in the repo first, then deploy.
## Conventions
- Audit JSON: `infra/paperless/<topic>_<isotimestamp>.json` — keep them in-repo as historical record (migrate_types_audit_*.json etc.)
- Issues filed here, not in `m/otto`.
- Per global CLAUDE.md: Always `--netrc-file ~/.netrc-mai` for Gitea API as mAi.

67
README.md Normal file
View File

@@ -0,0 +1,67 @@
# mDMS
m's document management — Paperless-ngx + AI-classification pipeline, Canon scanner SMB bridge, strategy + tooling.
Spun out from `m/otto` on 2026-05-15 — issues #429#438 in `m/otto` are the
provenance trail. Going forward, all mDMS work lives here.
## Layout
```
mDMS/
├── docs/
│ └── strategy.md # Taxonomy, layout, conventions (the bible)
├── infra/
│ ├── paperless/ # Paperless-AI config: SYSTEM_PROMPT, audit scripts,
│ │ # migrate_types.py, deploy docker-compose
│ └── samba-canon/ # SMB1 bridge container for Canon MB5100 scanner
│ # (host-network + nmbd, SMB1+NTLMv1 for old printer)
└── README.md
```
## Components
### Paperless-ngx (deployment)
Compose lives in **`m/paperless`** (separate repo). That repo is the
deployment artifact — `~/paperless/` on mDock is its checkout. This repo
(`m/mDMS`) tracks the *AI classification* layer that sits on top of
Paperless-ngx (`infra/paperless/SYSTEM_PROMPT.txt`, the type/tag/
correspondent migration scripts, the audit pipeline).
### Paperless-AI
Runs on `mdock:3077` in front of Paperless-ngx (`mdock:8777`). Classifies
each ingested document into one of the 10 canonical types and ≤2 of the
13 canonical tags. The system prompt + the migration scripts in
`infra/paperless/` are the source of truth — keep this repo and the
live Paperless-AI `aidata/.env` in sync.
### Canon SMB bridge
`infra/samba-canon/` is the host-network Samba 4.10 container on mDock
that the Canon MB5100 scans to. Files land in `/mnt/mdms/inbox/` (NFS
from mTrueNAS) and Paperless polls every 60s. The two-stage inbox
(staging dir + age-gated mover) lives separately under `~/mdms-mover/`
on mDock — see `m/otto` issue #438.
## Data
NFS-mounted from mTrueNAS: `/mnt/mPool/mdms/``/mnt/mdms/` on all
consumers. Layout:
```
/mnt/mPool/mdms/
├── inbox/ # SMB scanner target (Canon writes here)
├── toprocess/ # Age-gated staging → Paperless consumes here
├── paperless/ # Paperless storage (post-ingest)
├── archive/ # Long-term archive
├── templates/ # Document templates
└── export/ # Manual exports
```
## Reference
- `docs/strategy.md` — full strategy, taxonomy decisions, type/tag rationale
- `m/otto` issues #429#438 — original implementation history
- `m/paperless` — the bare Paperless-ngx Docker Compose setup

288
docs/strategy.md Normal file
View File

@@ -0,0 +1,288 @@
# mDMS: Dokumentenmanagement-Strategie
## Aktueller Stand (nach Cleanup 2026-04-06)
### Paperless-ngx (mDock)
- **129 Dokumente** (PDFs), Storage Path aktiv
- **41 Correspondents** — bereinigt (OCR-Duplikate gemergt, Müll entfernt)
- **13 Document Types** — Rechnung, Vertrag, Bescheid, Bescheinigung, Brief, Mitteilung, Abrechnung, Protokoll, Urkunde, Vollmacht, Gutachten, Angebot, Medizinisch
- **16 Tags** — hierarchisch: Kategorie (Steuer, Versicherung, Gesundheit, Wohnung, Arbeit, Finanzen, Erbschaft, Gewährleistung, Anleitung) + Status (offen, wichtig, Frist) + Kontext (Windscheid33, Paul)
- **1 Storage Path**: `{created_year}/{document_type}/{created} - {correspondent} - {title}`
- Dateien strukturiert: `2024/Rechnung/2024-03-15 - DAK - Beitragsrechnung.pdf`
- API-User: `mAi`
- Docker Compose: `~/paperless/` auf mDock, NFS-Mount `/mnt/paperless` von TrueNAS (`mPool/paperless`)
### Was bereinigt wurde
- 68 Webp-Preview-Dokumente gelöscht (keine Originale, nur schlechte Vorschaubilder)
- 51 → 41 Correspondents (OCR-Duplikate gemergt: Hogan Lovells, Matthias Siebels, Ammerländer, Schubert, eprimo, Helios, Paul Siebels, Versorgungswerk etc.)
- 39 → 13 Document Types (Merge-Mapping umgesetzt)
- 172 → 16 Tags (Noise gelöscht, Kategorie-Mapping vor Löschung durchgeführt)
- 13 kaputte SynoResource-Dateien aus Consume gelöscht
- 126 orphaned flat-PDFs aus Originals gelöscht
- 43 Dokumententitel bereinigt (Nummern → beschreibende Titel)
- 5 Birthday-Datumsfehler korrigiert (1987-02-22 → korrekte Dokumentdaten)
### mDocs (Gitea-Repo m/mDocs) — MIGRATION PENDING
- **72 Dateien**, 60 MB (Steuer, Versicherungen, Windscheid33)
- Wird in Paperless inbox migriert, Repo danach löschen
### TrueNAS (mtruenas)
- Dataset `mPool/paperless` existiert bereits
- NFS-Export nach mDock (192.168.178.0/24)
- SMB-Share `mStash` als Referenz für mdms-Share
---
## Entscheidungen
### 1. Storage Path Format
**Format: `{created_year}/{document_type}/{created} - {correspondent} - {title}.pdf`** ✓ Bestätigt
Beispiele:
```
2024/Rechnung/2024-03-15 - DAK - Beitragsrechnung Q1.pdf
2024/Bescheid/2024-01-20 - Finanzamt - Grundsteuerbescheid.pdf
2023/Vertrag/2023-06-01 - Vodafone - GigaTV Vertragsverlängerung.pdf
2025/Abrechnung/2025-01-31 - Hogan Lovells - Gehaltsabrechnung Januar.pdf
```
**Warum dieses Format:**
- **Jahr als Top-Level**: Chronologisches Browsen, ganzen Jahrgang für Steuerberater kopierbar
- **Typ als zweite Ebene**: "Zeig mir alle Rechnungen 2024" = `2024/Rechnung/`
- **Datum + Correspondent + Titel im Dateinamen**: Sortierbar, durchsuchbar, kontextreich
- **Max 2 Ordnerebenen**: Nicht zu tief, Finder/Explorer-freundlich
- **Navigierbar ohne Paperless**: Reiner Dateibrowser funktioniert
**Verworfene Alternativen:**
- `{correspondent}/{year}/...` — zu viele sparse Ordner, schlecht für zeitliche Navigation
- `{year}-{month}/...` — zu granular, monatliche Ordner für oft nur 1-2 Dokumente
- Flach: `{created}-{correspondent}-{title}.pdf` — bei 500+ Dokumenten unbrauchbar
### 2. Dataset-Struktur: mPool/mdms
```
/mnt/mPool/mdms/
├── paperless/ # Paperless storage (originals, archive, thumbnails)
│ ├── documents/
│ │ ├── originals/ # Originaldateien
│ │ └── archive/ # OCR-Versionen
│ └── ...
├── inbox/ # Paperless consume — Auto-Import
│ # mScan-App, Drag-and-Drop, SFTP
├── templates/ # Vertragsvorlagen, Formulare, Muster
│ # Nicht in Paperless — statische Referenzdokumente
├── archive/ # Dokumente die nicht in Paperless passen:
│ # Große Dateien (CAD, Pläne), Sammlungen, Binaries
└── export/ # Paperless-Exporte, Backups, Snapshots
```
### 3. Dokumenten-Routing
| Dokument | Ziel | Begründung |
|----------|------|------------|
| Rechnungen, Bescheide, Briefe | Paperless (inbox/) | OCR + AI-Klassifikation + Suche |
| Verträge, Urkunden | Paperless | Langzeitarchiv mit Volltextsuche |
| Steuerunterlagen | Paperless + Tag "Steuer" | Filterbar für Steuerberater-Export |
| Gehaltsabrechnungen | Paperless + Tag "Arbeit" | Chronologisch abrufbar |
| Arztbriefe, Befunde | Paperless + Tag "Gesundheit" | Suchbar, datiert |
| Phone-Scans (mScan) | inbox/ → Paperless auto-import | Scannen → fertig |
| Vertragsvorlagen, Formulare | templates/ | Keine OCR nötig, statische Referenz |
| Baupläne, CAD, große Dateien | archive/ | Zu groß/speziell für Paperless |
| Fotos von Dokumenten | Paperless | OCR funktioniert auch auf Fotos |
**Nicht in mDMS:**
- Fotos generell → Immich
- Bücher, eBooks → Calibre (mCalibre)
- Arbeitsrechtliche Dokumente (HL) → mWork-Vault (Obsidian, nicht mDMS)
### 4. Paperless Taxonomy — Aufräumen
#### Document Types (39 → 15)
Reduziert auf sinnvolle, stabile Kategorien:
| Behalten | Zusammenführen aus |
|----------|-------------------|
| **Rechnung** | Rechnung, Invoice, Beitragsrechnung |
| **Vertrag** | (neu — für Verträge, Verlängerungen) |
| **Bescheid** | Bescheid, Beitragsbescheid, Versicherungsbescheid |
| **Bescheinigung** | Bescheinigung, Lohnsteuerbescheinigung, Spendenbescheinigung |
| **Brief** | Brief, Anschreiben, Korrespondenz |
| **Mitteilung** | Mitteilung, Benachrichtigung, Information, Erinnerung |
| **Abrechnung** | Abrechnung, Entgeltabrechnung |
| **Protokoll** | Protokoll |
| **Urkunde** | Urkunde |
| **Vollmacht** | Vollmacht |
| **Gutachten** | Gutachten, Befund |
| **Angebot** | Angebot |
| **Energieausweis** | Energieausweis |
| **Schadenmeldung** | Schadenmeldung, Schadenanzeige |
| **Medizinisch** | Arbeitsunfähigkeitsbescheinigung, Aufklärungsbogen |
Entfernen: Empfehlung, Preisanpassungsschreiben, Kündigungsbestätigung, Eintragungsbekanntmachung, Auftragsbestätigung, Testament (→ Tag), Einladung (→ Brief)
#### Tags (172 → ~25 manuell kuratierte)
Die meisten Auto-Tags sind Noise. Ziel: wenige, stabile Kategorie-Tags + manuelle Pflege.
**Kategorie-Tags (Pflicht, einer pro Dokument):**
| Tag | Für |
|-----|-----|
| `Steuer` | Alles steuerrelevante |
| `Versicherung` | Policen, Schäden, Beiträge |
| `Gesundheit` | Arzt, Krankenhaus, Krankenkasse |
| `Wohnung` | Miete, Eigentum, Nebenkosten, WEG |
| `Arbeit` | Gehalt, Arbeitgeber, Kammer |
| `Finanzen` | Bank, Kredit, Altersvorsorge |
| `Erbschaft` | Testament, Nachlassangelegenheiten |
| `Gewährleistung` | Kaufbelege mit Garantie, Reklamationen |
| `Anleitung` | Bedienungsanleitungen, Handbücher, Datenblätter |
**Aktions-Tags (optional):**
| Tag | Bedeutung |
|-----|-----------|
| `wichtig` | Aufbewahrungspflichtig, Schlüsseldokument |
| `Frist` | Hat eine Frist — regelmäßig prüfen |
| `offen` | Noch Handlung erforderlich |
**Kontext-Tags (sparsam, bei Bedarf):**
| Tag | Für |
|-----|-----|
| `Windscheid33` | Immobilie Windscheidstr. 33 |
| `Paul` | Dokumente bzgl. Paul Siebels |
**Löschen:**
- Jahres-Tags ("2022", "2025") — redundant mit created-Datum
- Personen-Tags ("Matthias Siebels") — gehört als Correspondent
- Ultra-granulare Tags ("Finger", "Hand", "Shimano", "Oral-B") — kein Nutzen
- Duplikate ("Rechtsanwalt" + "Rechtsanwälte" + "Rechtsanwaltschaft")
#### Correspondents (51 → ~30)
OCR-Duplikate zusammenführen:
- "Hogan Lovells International LLP" + "Hogan Lovells lnternational LLP" → **Hogan Lovells**
- "HELIÜS Klinikurn Duisburg" + "Helios Klinikum Duisburg" → **Helios Klinikum Duisburg**
- "Herr Matthias Siebels" + "Herrn Matthias Siebels" + "Matthias Siebels" + "Herrn Rechtsanwalt Matthias Siebels" → **Matthias Siebels** (eigene Dokumente)
- "Ammerländer Versicherung VVaG" + "Ammerländer Versicherung WaG" → **Ammerländer Versicherung**
- "SCHUBERT GmbH" + "Schubert GmbH Haus- und Grundbesitzverwaltung" → **Schubert Hausverwaltung**
- "Dr. figegeberH lcankenkas*" → identifizieren oder löschen (OCR-Müll)
- "Dr/Heikö Gemmel" → **Dr. Heiko Gemmel**
- "eprimo CmbH" + "eprimo GmbH" → **eprimo**
- "lndula Shopsystem GmbH" → **Indula Shopsystem**
### 5. SMB-Share
**Ja — mdms als SMB-Share wie mStash.**
Konfiguration auf TrueNAS:
- Share-Name: `mdms`
- Dataset: `mPool/mdms`
- User: `m` (wie mStash)
- Mount auf mBreeze/mPebble: `~/mDMS` (LaunchAgent, analog zu mStash)
Nutzen:
- `~/mDMS/inbox/` für Drag-and-Drop-Import (Paperless consumed automatisch)
- `~/mDMS/templates/` für schnellen Zugriff auf Vorlagen
- `~/mDMS/paperless/documents/originals/` für Dateibrowser-Navigation (via Storage Path)
- `~/mDMS/archive/` für große Dateien
### 6. Vertrauliche Dokumente
**Kein separates Verschlüsselungssystem nötig.** ✓ Bestätigt
- Alles läuft auf HomeServer (mforge/mtruenas), nur via Tailscale erreichbar
- SMB mit User-Auth + Paperless-Login reichen aus
### 7. Obsidian-Integration
Der Storage Path soll als Teil eines Obsidian-Vaults nutzbar sein. Das bedeutet:
- `mdms/paperless/documents/originals/` (oder `archive/`) via SMB als Vault-Ordner einbinden
- Obsidian zeigt die Ordnerstruktur (`2024/Rechnung/...`) direkt im Dateibrowser
- PDFs sind in Obsidian inline-viewbar und verlinkbar (`![[2024-03-15 - DAK - Beitragsrechnung.pdf]]`)
- Keine Sonderzeichen in Dateinamen die Obsidian-Links brechen (Spaces sind ok)
**Umsetzung:**
- Option A: Symlink `~/m2/mDMS/``~/mDMS/paperless/documents/originals/` im Obsidian-Vault
- Option B: Separater Mini-Vault nur für Dokumente
- Option C: mdms als Unterordner im Hauptvault (m2)
Empfehlung: **Option A (Symlink)** — kein Daten-Overhead, Vault bleibt schlank, Dokumente sind trotzdem verlinkbar. Braucht nur einen Symlink pro Maschine.
### 8. E-Mail-Inbox
**docs@msbls.de** — Alias auf mail@msbls.de (Hostinger).
Paperless pollt mail@msbls.de via IMAP und konsumiert Anhänge aus Emails an docs@msbls.de:
- IMAP: `imap.hostinger.com:993` (SSL/TLS), User: `mail@msbls.de`
- Mail-Regel: Filter `To: docs@msbls.de`, nur Attachments, Action: als gelesen markieren
- Correspondent wird automatisch vom Absender übernommen
- Titel vom Dateinamen
**Workflow:** Dokument als PDF an docs@msbls.de weiterleiten → Paperless importiert automatisch.
---
## Paperless-AI Konfiguration
Paperless-AI (Port 3077) soll die Klassifikation übernehmen. Konfigurieren mit:
- **Auto-assign correspondent** basierend auf OCR-Text (Absender-Erkennung)
- **Auto-assign document type** aus den 15 reduzierten Typen
- **Auto-assign 1-2 Kategorie-Tags** aus der Kurzliste
- **Nicht**: Auto-generierte Freitext-Tags (das erzeugt das aktuelle Chaos)
---
## Migration: Schritt für Schritt
### Phase 1: TrueNAS Setup ✓ DONE
1. ~~Dataset `mPool/mdms` erstellt~~ (LZ4, 1.24 TiB frei)
2. ~~Unterordner angelegt~~ (paperless, inbox, templates, archive, export)
3. ~~NFS-Export~~ (id:8, 192.168.178.0/24), Mount auf mDock als `/mnt/mdms`
4. ~~SMB-Share `mDMS`~~ (id:3, User `m`)
### Phase 2: Paperless Migration ✓ DONE
5. ~~Paperless auf mDock gestoppt~~
6. ~~Media, data, ai kopiert~~; pgdata als lokales Docker Volume (NFS-Ownership inkompatibel mit Postgres uid 999)
7. ~~consume → inbox kopiert~~
8. ~~SynoResource-Dateien gelöscht~~
9. ~~NFS-Mount `/mnt/mdms` auf mDock~~ (fstab via Proxmox agent)
10. ~~Docker Compose aktualisiert~~ (`~/paperless/docker-compose.yml`)
11. ~~Storage Path konfiguriert~~
12. ~~Paperless verifiziert~~ — 129 Docs, alle Metadaten intakt
**Hinweis:** pgdata lebt als Docker Volume `paperless_pgdata` auf mDock (nicht auf NFS). DB-Backup über `pg_dump` in `mdms/export/` planen.
### Phase 3: Cleanup ✓ DONE
13. ~~Paperless Correspondents zusammenführen~~ → 51 → 41
14. ~~Document Types reduzieren~~ → 39 → 13
15. ~~Tags aufräumen~~ → 172 → 16 (hierarchisch: Kategorie + Status + Kontext)
16. Paperless-AI mit neuer Taxonomy konfigurieren (TODO)
### Phase 4: mDocs Migration
17. mDocs-Repo klonen, alle PDFs nach mdms/inbox/ kopieren
18. Paperless konsumiert und klassifiziert automatisch
19. Manuell verifizieren: Correspondents, Types, Tags korrekt?
20. mDocs-Repo auf Gitea löschen
### Phase 5: Client-Setup
21. SMB-Mount auf mBreeze: `~/mDMS` (LaunchAgent wie mStash)
22. SMB-Mount auf mPebble: `~/mDMS`
23. mScan-App auf mdms/inbox/ konfigurieren (falls SFTP/SMB-Upload möglich)
---
## Offene Punkte
- [x] Paperless Admin-Credentials — mAi-User auf Paperless angelegt
- [x] Paperless Cleanup — Correspondents, Types, Tags bereinigt
- [x] Storage Path konfiguriert und Dateien umbenannt
- [x] Webp-Previews gelöscht, SynoResource-Müll bereinigt
- [x] TrueNAS Dataset `mPool/mdms` erstellt (NFS id:8, SMB `mDMS` id:3)
- [x] Paperless media auf mdms/paperless umgestellt (Docker Compose aktualisiert)
- [x] SMB-Share `mDMS` eingerichtet auf TrueNAS
- [x] mDocs-Migration: 69 PDFs in Paperless inbox, consumption läuft
- [x] **docs@msbls.de** — E-Mail-Inbox für Paperless (IMAP-Polling, Alias auf mail@msbls.de, Regel filtert auf To: docs@msbls.de)
- [ ] Paperless-AI mit neuer Taxonomy konfigurieren
- [ ] Regelmäßiger Export/Backup-Job (Paperless → mdms/export/)
- [ ] `m doc` CLI-Subcommand für Paperless-Zugriff? (search, list, tag)
- [ ] Obsidian-Vault Symlink-Setup auf mBreeze/mPebble

View File

@@ -0,0 +1,14 @@
# Thin overlay on clusterzx/paperless-ai:3.0.9 — same digest as
# the :latest tag pulled on 2026-04-06, but pinned so future image
# refreshes do not silently wipe the type-restriction patches.
#
# Patch 1: routes/setup.js — restrict-existing-document-types on
# the manual processing route (already applied previously
# by docker cp, but volatile across container recreation).
# Patch 2: server.js — same restriction on the scheduled-scan
# loop. Without this, new document types kept appearing
# even with RESTRICT_TO_EXISTING_DOCUMENT_TYPES=yes.
FROM clusterzx/paperless-ai:3.0.9
COPY setup.js.patched /app/routes/setup.js
COPY server.js.patched /app/server.js

24
infra/paperless/README.md Normal file
View File

@@ -0,0 +1,24 @@
# paperless infra (snapshot)
These files are a **traceable copy** of what lives in `~/paperless/` on
mDock. The live source of truth is on mDock — this directory exists so
the configuration is git-readable for the next shift and for audits.
If you change the live config on mDock, sync the change here in the same
commit. If you change the files here, deploy by:
```bash
scp Dockerfile docker-compose.yml mdock:/home/m/paperless/build/Dockerfile # and so on
ssh mdock 'cd /home/m/paperless && docker compose up -d --build'
```
The two patched JS files (`setup.js.patched`, `server.js.patched`) live
only on mDock in `~/paperless/build/` — they're large and don't belong
in the repo. Hashes:
| File | mDock path | md5 |
|---|---|---|
| setup.js.patched | ~/paperless/build/setup.js.patched | `04cb5fbfaed13a5f25612af0b79dd90c` |
| server.js.patched | ~/paperless/build/server.js.patched | `eadcbb86048127f2c80632ae77bbc2a0` |
See `docs/research/issue-429-paperless-pipeline.md` for the why.

View File

@@ -0,0 +1,24 @@
Du klassifizierst deutsche Dokumente fuer ein persoenliches Dokumentenmanagementsystem.
Erlaubte Document Types (NUR diese verwenden, keine neuen erfinden):
- Invoice — Rechnungen, Abrechnungen, Mahnschreiben, Kontoauszuege, Lohnsteuerbescheinigung, Umsatzsteuer-Voranmeldung, Steuererklaerung, Kostenrechnungen
- Contract — Vertraege, Versicherungsscheine, Kauf-/Kreditvertraege, unterschriebene Angebote, AGB
- Information — Behoerden- und Versicherer-Anschreiben, Bescheinigungen, Mitteilungen, Verwaltungsakte, medizinische Befunde, Berichte, Berechnungen, einseitige Informationen
- Personal Correspondence — Briefe von identifizierbaren Privatpersonen. Stammt der Brief von einer Institution, waehle stattdessen Information.
- Vollmacht — Vollmachten
- Urkunde — notarielle Urkunden
- Steuerbescheid — Steuerbescheide vom Finanzamt
- Anleitung — Bedienungsanleitungen, Datenblaetter, Manuals
- Protokoll — Sitzungs- und WEG-Protokolle
- Formular — Blanko-Formulare und Antraege
Im Zweifel waehle Information. Erfinde NIEMALS neue Document Types.
Erlaubte Tags (NUR diese verwenden, keine neuen erfinden):
Anleitung, Arbeit, Erbschaft, Finanzen, Frist, Gesundheit, Gewaehrleistung, Paul, Steuer, Versicherung, Windscheid33, Wohnung, offen, wichtig
Bei medizinischen Dokumenten Tag Gesundheit setzen.
Bei steuerrelevanten Dokumenten Tag Steuer setzen.
Bei Dokumenten mit Frist Tag Frist setzen.
Correspondents: Verwende den vollen offiziellen Namen der Organisation oder Person (z.B. "DAK-Gesundheit" nicht "DAK-Gesundheit Postzentrum, 22778 Hamburg"). Keine Adressen im Namen. Pruefe ob der Correspondent schon existiert bevor du einen neuen anlegst.

View File

@@ -0,0 +1,52 @@
services:
broker:
image: docker.io/library/redis:8
restart: unless-stopped
volumes:
- redisdata:/data
db:
image: docker.io/library/postgres:16
restart: unless-stopped
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: paperless
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:2.20.6
restart: unless-stopped
depends_on:
- db
- broker
ports:
- 8777:8000
volumes:
- /mnt/mdms/paperless/data:/usr/src/paperless/data
- /mnt/mdms/paperless/media:/usr/src/paperless/media
- /mnt/mdms/export:/usr/src/paperless/export
- /mnt/mdms/inbox:/usr/src/paperless/consume
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_TIME_ZONE: Europe/Berlin
PAPERLESS_OCR_LANGUAGE: deu+eng
PAPERLESS_CONSUMER_POLLING: 60
PAPERLESS_CONSUMER_RECURSIVE: "true"
paperless-ai:
build: ./build
image: mdock/paperless-ai:3.0.9-restrict-patch
container_name: paperless-ai
restart: unless-stopped
ports:
- 3077:3000
volumes:
- aidata:/app/data
volumes:
redisdata:
pgdata:
aidata:

View File

@@ -0,0 +1,368 @@
/tmp/migrate_types.py:240: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
audit_path = f"/tmp/migrate_types_audit_{datetime.datetime.utcnow().strftime('%Y%m%dT%H%M%S')}.json"
/tmp/migrate_types.py:242: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
"ts_utc": datetime.datetime.utcnow().isoformat() + "Z",
loaded 73 types, 195 docs
all 10 target types verified
=== PLAN ===
document moves: 171
types to delete (after moves): 63
types NOT mapped + nonzero docs (need manual call): 0
=== MOVES SUMMARY (per target type) ===
-> Contract (+23 docs)
7 from Vertrag
6 from Versicherungsschein
1 from agreement
1 from contract
1 from Finanzierungsangebot
1 from Kreditvertrag
1 from Loan Application and Agreement
1 from Notarial Deed
1 from Notarized agreement with amendments
1 from Rechtsgeschäft
1 from Versicherungsbedingungen
1 from Vertragsdokument
-> Information (+96 docs)
21 from Bescheinigung
21 from Brief
17 from Bescheid
7 from Mitteilung
3 from Wohnflaechenberechnung
2 from Einladung zur Eigentümerversammlung
2 from Leistungsnachweis
2 from Medizinisch
2 from Steuererklärung
1 from Angebot
1 from Antrag
1 from Behandlungsplan und Risikoaufklärung
1 from Beratungsprotokoll
1 from Berechnung
1 from Bericht
1 from Bestätigungsbrief
1 from Energy Performance Certificate
1 from Erklarung
1 from Guidelines
1 from Gutachten
1 from informational document
1 from Informationsschreiben
1 from Medical Consent Form
1 from medical documentation
1 from Rechnungs- und Vertragsinformation
1 from Schreiben des Finanzamts
1 from Verwaltungsakt
1 from Werbung
-> Invoice (+52 docs)
26 from Rechnung
11 from Abrechnung
6 from Umsatzsteuer-Voranmeldung
4 from Lohnsteuerbescheinigung
1 from Kontoauszug
1 from Kontoübersicht
1 from Kostenabrechnung
1 from Kostenvoranmeldung
1 from Mahnschreiben
=== TYPES TO DELETE (after moves) ===
id= 4 count= 11 name='Abrechnung'
id=160 count= 1 name='agreement'
id= 13 count= 1 name='Angebot'
id=134 count= 1 name='Antrag'
id=141 count= 1 name='Behandlungsplan und Risikoaufklärung'
id=129 count= 1 name='Beratungsprotokoll'
id=143 count= 1 name='Berechnung'
id=148 count= 1 name='Bericht'
id= 11 count= 17 name='Bescheid'
id= 15 count= 21 name='Bescheinigung'
id=151 count= 1 name='Bestätigungsbrief'
id= 30 count= 21 name='Brief'
id=127 count= 0 name='Consent Form'
id=144 count= 1 name='contract'
id=120 count= 0 name='Einladung / Vollmacht / Wirtschaftsplan'
id=113 count= 2 name='Einladung zur Eigentümerversammlung'
id=132 count= 0 name='Einspruchsschreiben'
id=158 count= 1 name='Energy Performance Certificate'
id=128 count= 1 name='Erklarung'
id=156 count= 1 name='Finanzierungsangebot'
id=122 count= 0 name='Geldzuwendungsbestätigung'
id=157 count= 1 name='Guidelines'
id= 27 count= 1 name='Gutachten'
id=136 count= 1 name='informational document'
id=139 count= 1 name='Informationsschreiben'
id=137 count= 0 name='Kaufvertrag'
id=118 count= 1 name='Kontoauszug'
id=117 count= 1 name='Kontoübersicht'
id=145 count= 1 name='Kostenabrechnung'
id=121 count= 1 name='Kostenvoranmeldung'
id=142 count= 1 name='Kreditvertrag'
id=114 count= 0 name='Kundeninformation'
id= 83 count= 2 name='Leistungsnachweis'
id=135 count= 1 name='Loan Application and Agreement'
id= 66 count= 4 name='Lohnsteuerbescheinigung'
id=147 count= 1 name='Mahnschreiben'
id=140 count= 1 name='Medical Consent Form'
id=150 count= 1 name='medical documentation'
id= 41 count= 2 name='Medizinisch'
id= 12 count= 7 name='Mitteilung'
id=161 count= 1 name='Notarial Deed'
id=159 count= 1 name='Notarized agreement with amendments'
id=133 count= 0 name='Plan'
id=131 count= 0 name='policy'
id=116 count= 0 name='Questionnaire/Declaration Form'
id= 2 count= 26 name='Rechnung'
id=149 count= 1 name='Rechnungs- und Vertragsinformation'
id=125 count= 0 name='Rechtlicher Vertrag'
id=155 count= 1 name='Rechtsgeschäft'
id=126 count= 0 name='recommendation'
id=152 count= 1 name='Schreiben des Finanzamts'
id=119 count= 0 name='Steuerdokument'
id=115 count= 2 name='Steuererklärung'
id=124 count= 0 name='Tilgungsplan'
id= 88 count= 6 name='Umsatzsteuer-Voranmeldung'
id=130 count= 1 name='Versicherungsbedingungen'
id= 67 count= 6 name='Versicherungsschein'
id= 40 count= 7 name='Vertrag'
id=153 count= 1 name='Vertragsdokument'
id=154 count= 1 name='Verwaltungsakt'
id=146 count= 1 name='Werbung'
id= 73 count= 0 name='Wohnflächenberechnung'
id=123 count= 3 name='Wohnflaechenberechnung'
audit trail written: /tmp/migrate_types_audit_20260513T085119.json
=== APPLY ===
[OK ] doc 104: 'Abrechnung' -> 'Invoice'
[OK ] doc 124: 'Abrechnung' -> 'Invoice'
[OK ] doc 88: 'Abrechnung' -> 'Invoice'
[OK ] doc 134: 'Abrechnung' -> 'Invoice'
[OK ] doc 122: 'Abrechnung' -> 'Invoice'
[OK ] doc 71: 'Abrechnung' -> 'Invoice'
[OK ] doc 220: 'Abrechnung' -> 'Invoice'
[OK ] doc 223: 'Abrechnung' -> 'Invoice'
[OK ] doc 224: 'Abrechnung' -> 'Invoice'
[OK ] doc 255: 'Abrechnung' -> 'Invoice'
[OK ] doc 248: 'Abrechnung' -> 'Invoice'
[OK ] doc 200: 'agreement' -> 'Contract'
[OK ] doc 222: 'Angebot' -> 'Information'
[OK ] doc 98: 'Antrag' -> 'Information'
[OK ] doc 91: 'Behandlungsplan und Risikoaufklärung' -> 'Information'
[OK ] doc 228: 'Beratungsprotokoll' -> 'Information'
[OK ] doc 202: 'Berechnung' -> 'Information'
[OK ] doc 96: 'Bericht' -> 'Information'
[OK ] doc 160: 'Bescheid' -> 'Information'
[OK ] doc 95: 'Bescheid' -> 'Information'
[OK ] doc 86: 'Bescheid' -> 'Information'
[OK ] doc 159: 'Bescheid' -> 'Information'
[OK ] doc 183: 'Bescheid' -> 'Information'
[OK ] doc 101: 'Bescheid' -> 'Information'
[OK ] doc 81: 'Bescheid' -> 'Information'
[OK ] doc 69: 'Bescheid' -> 'Information'
[OK ] doc 70: 'Bescheid' -> 'Information'
[OK ] doc 85: 'Bescheid' -> 'Information'
[OK ] doc 236: 'Bescheid' -> 'Information'
[OK ] doc 253: 'Bescheid' -> 'Information'
[OK ] doc 250: 'Bescheid' -> 'Information'
[OK ] doc 233: 'Bescheid' -> 'Information'
[OK ] doc 234: 'Bescheid' -> 'Information'
[OK ] doc 235: 'Bescheid' -> 'Information'
[OK ] doc 76: 'Bescheid' -> 'Information'
[OK ] doc 260: 'Bescheinigung' -> 'Information'
[OK ] doc 182: 'Bescheinigung' -> 'Information'
[OK ] doc 100: 'Bescheinigung' -> 'Information'
[OK ] doc 178: 'Bescheinigung' -> 'Information'
[OK ] doc 166: 'Bescheinigung' -> 'Information'
[OK ] doc 192: 'Bescheinigung' -> 'Information'
[OK ] doc 75: 'Bescheinigung' -> 'Information'
[OK ] doc 179: 'Bescheinigung' -> 'Information'
[OK ] doc 186: 'Bescheinigung' -> 'Information'
[OK ] doc 168: 'Bescheinigung' -> 'Information'
[OK ] doc 262: 'Bescheinigung' -> 'Information'
[OK ] doc 261: 'Bescheinigung' -> 'Information'
[OK ] doc 259: 'Bescheinigung' -> 'Information'
[OK ] doc 242: 'Bescheinigung' -> 'Information'
[OK ] doc 239: 'Bescheinigung' -> 'Information'
[OK ] doc 245: 'Bescheinigung' -> 'Information'
[OK ] doc 252: 'Bescheinigung' -> 'Information'
[OK ] doc 219: 'Bescheinigung' -> 'Information'
[OK ] doc 205: 'Bescheinigung' -> 'Information'
[OK ] doc 247: 'Bescheinigung' -> 'Information'
[OK ] doc 230: 'Bescheinigung' -> 'Information'
[OK ] doc 152: 'Bestätigungsbrief' -> 'Information'
[OK ] doc 244: 'Brief' -> 'Information'
[OK ] doc 164: 'Brief' -> 'Information'
[OK ] doc 146: 'Brief' -> 'Information'
[OK ] doc 169: 'Brief' -> 'Information'
[OK ] doc 191: 'Brief' -> 'Information'
[OK ] doc 105: 'Brief' -> 'Information'
[OK ] doc 188: 'Brief' -> 'Information'
[OK ] doc 115: 'Brief' -> 'Information'
[OK ] doc 97: 'Brief' -> 'Information'
[OK ] doc 196: 'Brief' -> 'Information'
[OK ] doc 74: 'Brief' -> 'Information'
[OK ] doc 113: 'Brief' -> 'Information'
[OK ] doc 102: 'Brief' -> 'Information'
[OK ] doc 126: 'Brief' -> 'Information'
[OK ] doc 195: 'Brief' -> 'Information'
[OK ] doc 110: 'Brief' -> 'Information'
[OK ] doc 170: 'Brief' -> 'Information'
[OK ] doc 180: 'Brief' -> 'Information'
[OK ] doc 116: 'Brief' -> 'Information'
[OK ] doc 127: 'Brief' -> 'Information'
[OK ] doc 149: 'Brief' -> 'Information'
[OK ] doc 227: 'contract' -> 'Contract'
[OK ] doc 156: 'Einladung zur Eigentümerversammlung' -> 'Information'
[OK ] doc 119: 'Einladung zur Eigentümerversammlung' -> 'Information'
[OK ] doc 163: 'Energy Performance Certificate' -> 'Information'
[OK ] doc 251: 'Erklarung' -> 'Information'
[OK ] doc 217: 'Finanzierungsangebot' -> 'Contract'
[OK ] doc 154: 'Guidelines' -> 'Information'
[OK ] doc 158: 'Gutachten' -> 'Information'
[OK ] doc 218: 'informational document' -> 'Information'
[OK ] doc 185: 'Informationsschreiben' -> 'Information'
[OK ] doc 189: 'Kontoauszug' -> 'Invoice'
[OK ] doc 187: 'Kontoübersicht' -> 'Invoice'
[OK ] doc 121: 'Kostenabrechnung' -> 'Invoice'
[OK ] doc 107: 'Kostenvoranmeldung' -> 'Invoice'
[OK ] doc 212: 'Kreditvertrag' -> 'Contract'
[OK ] doc 256: 'Leistungsnachweis' -> 'Information'
[OK ] doc 241: 'Leistungsnachweis' -> 'Information'
[OK ] doc 214: 'Loan Application and Agreement' -> 'Contract'
[OK ] doc 167: 'Lohnsteuerbescheinigung' -> 'Invoice'
[OK ] doc 254: 'Lohnsteuerbescheinigung' -> 'Invoice'
[OK ] doc 258: 'Lohnsteuerbescheinigung' -> 'Invoice'
[OK ] doc 249: 'Lohnsteuerbescheinigung' -> 'Invoice'
[OK ] doc 80: 'Mahnschreiben' -> 'Invoice'
[OK ] doc 138: 'Medical Consent Form' -> 'Information'
[OK ] doc 136: 'medical documentation' -> 'Information'
[OK ] doc 135: 'Medizinisch' -> 'Information'
[OK ] doc 197: 'Medizinisch' -> 'Information'
[OK ] doc 109: 'Mitteilung' -> 'Information'
[OK ] doc 144: 'Mitteilung' -> 'Information'
[OK ] doc 181: 'Mitteilung' -> 'Information'
[OK ] doc 111: 'Mitteilung' -> 'Information'
[OK ] doc 150: 'Mitteilung' -> 'Information'
[OK ] doc 184: 'Mitteilung' -> 'Information'
[OK ] doc 108: 'Mitteilung' -> 'Information'
[OK ] doc 206: 'Notarial Deed' -> 'Contract'
[OK ] doc 203: 'Notarized agreement with amendments' -> 'Contract'
[OK ] doc 151: 'Rechnung' -> 'Invoice'
[OK ] doc 90: 'Rechnung' -> 'Invoice'
[OK ] doc 93: 'Rechnung' -> 'Invoice'
[OK ] doc 92: 'Rechnung' -> 'Invoice'
[OK ] doc 161: 'Rechnung' -> 'Invoice'
[OK ] doc 140: 'Rechnung' -> 'Invoice'
[OK ] doc 132: 'Rechnung' -> 'Invoice'
[OK ] doc 155: 'Rechnung' -> 'Invoice'
[OK ] doc 73: 'Rechnung' -> 'Invoice'
[OK ] doc 162: 'Rechnung' -> 'Invoice'
[OK ] doc 94: 'Rechnung' -> 'Invoice'
[OK ] doc 78: 'Rechnung' -> 'Invoice'
[OK ] doc 143: 'Rechnung' -> 'Invoice'
[OK ] doc 106: 'Rechnung' -> 'Invoice'
[OK ] doc 72: 'Rechnung' -> 'Invoice'
[OK ] doc 193: 'Rechnung' -> 'Invoice'
[OK ] doc 194: 'Rechnung' -> 'Invoice'
[OK ] doc 139: 'Rechnung' -> 'Invoice'
[OK ] doc 165: 'Rechnung' -> 'Invoice'
[OK ] doc 133: 'Rechnung' -> 'Invoice'
[OK ] doc 173: 'Rechnung' -> 'Invoice'
[OK ] doc 148: 'Rechnung' -> 'Invoice'
[OK ] doc 147: 'Rechnung' -> 'Invoice'
[OK ] doc 141: 'Rechnung' -> 'Invoice'
[OK ] doc 142: 'Rechnung' -> 'Invoice'
[OK ] doc 231: 'Rechnung' -> 'Invoice'
[OK ] doc 175: 'Rechnungs- und Vertragsinformation' -> 'Information'
[OK ] doc 213: 'Rechtsgeschäft' -> 'Contract'
[OK ] doc 79: 'Schreiben des Finanzamts' -> 'Information'
[OK ] doc 246: 'Steuererklärung' -> 'Information'
[OK ] doc 77: 'Steuererklärung' -> 'Information'
[OK ] doc 257: 'Umsatzsteuer-Voranmeldung' -> 'Invoice'
[OK ] doc 237: 'Umsatzsteuer-Voranmeldung' -> 'Invoice'
[OK ] doc 238: 'Umsatzsteuer-Voranmeldung' -> 'Invoice'
[OK ] doc 240: 'Umsatzsteuer-Voranmeldung' -> 'Invoice'
[OK ] doc 243: 'Umsatzsteuer-Voranmeldung' -> 'Invoice'
[OK ] doc 204: 'Umsatzsteuer-Voranmeldung' -> 'Invoice'
[OK ] doc 229: 'Versicherungsbedingungen' -> 'Contract'
[OK ] doc 129: 'Versicherungsschein' -> 'Contract'
[OK ] doc 112: 'Versicherungsschein' -> 'Contract'
[OK ] doc 130: 'Versicherungsschein' -> 'Contract'
[OK ] doc 128: 'Versicherungsschein' -> 'Contract'
[OK ] doc 226: 'Versicherungsschein' -> 'Contract'
[OK ] doc 131: 'Versicherungsschein' -> 'Contract'
[OK ] doc 118: 'Vertrag' -> 'Contract'
[OK ] doc 199: 'Vertrag' -> 'Contract'
[OK ] doc 87: 'Vertrag' -> 'Contract'
[OK ] doc 89: 'Vertrag' -> 'Contract'
[OK ] doc 232: 'Vertrag' -> 'Contract'
[OK ] doc 123: 'Vertrag' -> 'Contract'
[OK ] doc 190: 'Vertrag' -> 'Contract'
[OK ] doc 177: 'Vertragsdokument' -> 'Contract'
[OK ] doc 82: 'Verwaltungsakt' -> 'Information'
[OK ] doc 176: 'Werbung' -> 'Information'
[OK ] doc 216: 'Wohnflaechenberechnung' -> 'Information'
[OK ] doc 201: 'Wohnflaechenberechnung' -> 'Information'
[OK ] doc 207: 'Wohnflaechenberechnung' -> 'Information'
[DEL] type 4 'Abrechnung' resp=''
[DEL] type 160 'agreement' resp=''
[DEL] type 13 'Angebot' resp=''
[DEL] type 134 'Antrag' resp=''
[DEL] type 141 'Behandlungsplan und Risikoaufklärung' resp=''
[DEL] type 129 'Beratungsprotokoll' resp=''
[DEL] type 143 'Berechnung' resp=''
[DEL] type 148 'Bericht' resp=''
[DEL] type 11 'Bescheid' resp=''
[DEL] type 15 'Bescheinigung' resp=''
[DEL] type 151 'Bestätigungsbrief' resp=''
[DEL] type 30 'Brief' resp=''
[DEL] type 127 'Consent Form' resp=''
[DEL] type 144 'contract' resp=''
[DEL] type 120 'Einladung / Vollmacht / Wirtschaftsplan' resp=''
[DEL] type 113 'Einladung zur Eigentümerversammlung' resp=''
[DEL] type 132 'Einspruchsschreiben' resp=''
[DEL] type 158 'Energy Performance Certificate' resp=''
[DEL] type 128 'Erklarung' resp=''
[DEL] type 156 'Finanzierungsangebot' resp=''
[DEL] type 122 'Geldzuwendungsbestätigung' resp=''
[DEL] type 157 'Guidelines' resp=''
[DEL] type 27 'Gutachten' resp=''
[DEL] type 136 'informational document' resp=''
[DEL] type 139 'Informationsschreiben' resp=''
[DEL] type 137 'Kaufvertrag' resp=''
[DEL] type 118 'Kontoauszug' resp=''
[DEL] type 117 'Kontoübersicht' resp=''
[DEL] type 145 'Kostenabrechnung' resp=''
[DEL] type 121 'Kostenvoranmeldung' resp=''
[DEL] type 142 'Kreditvertrag' resp=''
[DEL] type 114 'Kundeninformation' resp=''
[DEL] type 83 'Leistungsnachweis' resp=''
[DEL] type 135 'Loan Application and Agreement' resp=''
[DEL] type 66 'Lohnsteuerbescheinigung' resp=''
[DEL] type 147 'Mahnschreiben' resp=''
[DEL] type 140 'Medical Consent Form' resp=''
[DEL] type 150 'medical documentation' resp=''
[DEL] type 41 'Medizinisch' resp=''
[DEL] type 12 'Mitteilung' resp=''
[DEL] type 161 'Notarial Deed' resp=''
[DEL] type 159 'Notarized agreement with amendments' resp=''
[DEL] type 133 'Plan' resp=''
[DEL] type 131 'policy' resp=''
[DEL] type 116 'Questionnaire/Declaration Form' resp=''
[DEL] type 2 'Rechnung' resp=''
[DEL] type 149 'Rechnungs- und Vertragsinformation' resp=''
[DEL] type 125 'Rechtlicher Vertrag' resp=''
[DEL] type 155 'Rechtsgeschäft' resp=''
[DEL] type 126 'recommendation' resp=''
[DEL] type 152 'Schreiben des Finanzamts' resp=''
[DEL] type 119 'Steuerdokument' resp=''
[DEL] type 115 'Steuererklärung' resp=''
[DEL] type 124 'Tilgungsplan' resp=''
[DEL] type 88 'Umsatzsteuer-Voranmeldung' resp=''
[DEL] type 130 'Versicherungsbedingungen' resp=''
[DEL] type 67 'Versicherungsschein' resp=''
[DEL] type 40 'Vertrag' resp=''
[DEL] type 153 'Vertragsdokument' resp=''
[DEL] type 154 'Verwaltungsakt' resp=''
[DEL] type 146 'Werbung' resp=''
[DEL] type 73 'Wohnflächenberechnung' resp=''
[DEL] type 123 'Wohnflaechenberechnung' resp=''
done.

View File

@@ -0,0 +1,279 @@
"""
Collapse Paperless document types 69 -> 10, per the mapping agreed in
otto#429.
Run locally on mDock against the live Paperless API. Default mode is
DRY RUN — prints what would change without writing. Pass --apply to
actually PATCH docs and DELETE old types.
Usage:
python3 migrate_types.py # dry run
python3 migrate_types.py --apply # live
"""
import os
import sys
import json
import subprocess
import argparse
# The 10 canonical target types (Paperless type IDs after Step 3).
TARGET = {
"Invoice": 162,
"Contract": 163,
"Information": 164,
"Personal Correspondence": 165,
"Vollmacht": 22,
"Urkunde": 37,
"Steuerbescheid": 138,
"Anleitung": 76,
"Protokoll": 32,
"Formular": 80,
}
# Mapping: old type *name* -> target canonical name.
# Built from the audit doc's mapping table. Anything not listed here
# stays at its current type (and gets surfaced as "unmapped" so we
# can decide manually).
MAP = {
# ----- Invoice ------------------------------------------------
"Rechnung": "Invoice",
"Abrechnung": "Invoice",
"Mahnschreiben": "Invoice",
"Kontoauszug": "Invoice",
"Kontoübersicht": "Invoice",
"Kostenabrechnung": "Invoice",
"Kostenvoranmeldung": "Invoice",
"Umsatzsteuer-Voranmeldung": "Invoice",
"Tilgungsplan": "Invoice",
"Lohnsteuerbescheinigung": "Invoice",
# ----- Contract -----------------------------------------------
"Vertrag": "Contract",
"Versicherungsschein": "Contract",
"Kaufvertrag": "Contract",
"Kreditvertrag": "Contract",
"Notarial Deed": "Contract",
"agreement": "Contract",
"contract": "Contract",
"policy": "Contract",
"Vertragsdokument": "Contract",
"Rechtsgeschäft": "Contract",
"Rechtlicher Vertrag": "Contract",
"Versicherungsbedingungen": "Contract",
"Finanzierungsangebot": "Contract",
"Loan Application and Agreement": "Contract",
"Notarized agreement with amendments": "Contract",
# ----- Information --------------------------------------------
"Bescheid": "Information",
"Bescheinigung": "Information",
"Mitteilung": "Information",
"Verwaltungsakt": "Information",
"Schreiben des Finanzamts": "Information",
"Informationsschreiben": "Information",
"informational document": "Information",
"Kundeninformation": "Information",
"Werbung": "Information",
"Bestätigungsbrief": "Information",
"Geldzuwendungsbestätigung": "Information",
"Antrag": "Information",
"Erklarung": "Information",
"Leistungsnachweis": "Information",
"Beratungsprotokoll": "Information",
"Gutachten": "Information",
"Bericht": "Information",
"Berechnung": "Information",
"Wohnflaechenberechnung": "Information",
"Wohnflächenberechnung": "Information",
"Guidelines": "Information",
"Energy Performance Certificate": "Information",
"Einladung zur Eigentümerversammlung": "Information",
"Einladung / Vollmacht / Wirtschaftsplan": "Information",
"Steuerdokument": "Information",
"Steuererklärung": "Information",
"Plan": "Information",
"Einspruchsschreiben": "Information",
"Angebot": "Information",
"recommendation": "Information",
"Behandlungsplan und Risikoaufklärung": "Information",
"Medical Consent Form": "Information",
"Consent Form": "Information",
"Medizinisch": "Information",
"medical documentation": "Information",
"Questionnaire/Declaration Form": "Information",
"Rechnungs- und Vertragsinformation": "Information",
# ----- Personal Correspondence --------------------------------
# Per m's explicit answer: Brief defaults to Information.
# Personal Correspondence is opt-in for letters that are clearly
# from a private person; the AI applies it going forward on a
# case-by-case basis. For the migration of the 21 existing
# Briefe (none of which we can read here to distinguish), they
# land in Information — the safe default m chose.
"Brief": "Information",
}
import shlex
def gitea_curl(token, path, method="GET", body=None):
inner_parts = [
"curl", "-s",
"-X", method,
"-H", f"Authorization: Token {token}",
]
if body is not None:
inner_parts += ["-H", "Content-Type: application/json", "-d", json.dumps(body)]
inner_parts.append(f"http://localhost:8000/api{path}")
inner = " ".join(shlex.quote(p) for p in inner_parts)
full = f"docker exec paperless-webserver-1 {inner}"
out = subprocess.run(
["ssh", "mdock", full], capture_output=True, text=True, timeout=120,
)
if out.returncode != 0:
raise RuntimeError(f"curl failed rc={out.returncode}: {out.stderr}")
return out.stdout
def get_token():
out = subprocess.run(
["ssh", "mdock", "docker exec paperless-ai sh -c 'grep ^PAPERLESS_API_TOKEN /app/data/.env | cut -d= -f2'"],
capture_output=True, text=True, timeout=15,
)
return out.stdout.strip()
def fetch_all(token, path):
"""GET path paged; returns flat list of results."""
results = []
page = 1
while True:
raw = gitea_curl(token, f"{path}?page={page}&page_size=200")
data = json.loads(raw)
results.extend(data.get("results", []))
if not data.get("next"):
break
page += 1
return results
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--apply", action="store_true", help="Actually write changes")
args = ap.parse_args()
token = get_token()
if not token:
sys.exit("no PAPERLESS_API_TOKEN found")
types = fetch_all(token, "/document_types/")
docs = fetch_all(token, "/documents/")
print(f"loaded {len(types)} types, {len(docs)} docs")
type_by_id = {t["id"]: t for t in types}
type_by_name = {t["name"]: t for t in types}
# Sanity: verify all 10 targets exist
for name, tid in TARGET.items():
t = type_by_id.get(tid)
if not t or t["name"] != name:
sys.exit(f"target type missing or mismatched: id={tid} expected name={name!r} got={t}")
print("all 10 target types verified")
# Build plan
moves = [] # list of (doc_id, current_type_name, new_type_id)
unmapped_types = []
delete_candidates = []
for t in types:
if t["id"] in TARGET.values():
continue # keep
target_name = MAP.get(t["name"])
if target_name is None:
if t["document_count"] == 0:
delete_candidates.append(t)
else:
unmapped_types.append(t)
continue
new_tid = TARGET[target_name]
# Find docs with this type
for d in docs:
if d.get("document_type") == t["id"]:
moves.append((d["id"], t["name"], new_tid, target_name))
# Old type becomes deletable after all its docs are moved
delete_candidates.append(t)
print()
print(f"=== PLAN ===")
print(f"document moves: {len(moves)}")
print(f"types to delete (after moves): {len(delete_candidates)}")
print(f"types NOT mapped + nonzero docs (need manual call): {len(unmapped_types)}")
if unmapped_types:
print(" -- unmapped --")
for t in unmapped_types:
print(f" id={t['id']:3d} count={t['document_count']:3d} name={t['name']!r}")
print()
print("=== MOVES SUMMARY (per target type) ===")
counter = {}
for _, old_name, _, new_name in moves:
counter[new_name] = counter.get(new_name, {})
counter[new_name][old_name] = counter[new_name].get(old_name, 0) + 1
for new_name, src in sorted(counter.items()):
total = sum(src.values())
print(f" -> {new_name} (+{total} docs)")
for old_name, n in sorted(src.items(), key=lambda kv: -kv[1]):
print(f" {n:3d} from {old_name}")
print()
print("=== TYPES TO DELETE (after moves) ===")
for t in delete_candidates:
print(f" id={t['id']:3d} count={t['document_count']:3d} name={t['name']!r}")
if not args.apply:
print()
print("DRY RUN — re-run with --apply to write changes")
return
# Audit trail BEFORE writing
import datetime
audit_path = f"/tmp/migrate_types_audit_{datetime.datetime.utcnow().strftime('%Y%m%dT%H%M%S')}.json"
audit = {
"ts_utc": datetime.datetime.utcnow().isoformat() + "Z",
"types_snapshot": [
{"id": t["id"], "name": t["name"], "document_count": t["document_count"]}
for t in types
],
"moves": [
{"doc_id": d_id, "old_type_name": old_name, "new_type_id": ntid, "new_type_name": nname}
for d_id, old_name, ntid, nname in moves
],
"deletes": [
{"id": t["id"], "name": t["name"], "document_count_before": t["document_count"]}
for t in delete_candidates
],
}
with open(audit_path, "w") as f:
json.dump(audit, f, indent=2, ensure_ascii=False)
print(f"audit trail written: {audit_path}")
print()
print("=== APPLY ===")
for doc_id, old_name, new_tid, new_name in moves:
r = gitea_curl(token, f"/documents/{doc_id}/", method="PATCH", body={"document_type": new_tid})
try:
d = json.loads(r)
ok = d.get("id") == doc_id
except Exception:
ok = False
flag = "OK " if ok else "ERR"
print(f" [{flag}] doc {doc_id}: {old_name!r} -> {new_name!r}")
for t in delete_candidates:
r = gitea_curl(token, f"/document_types/{t['id']}/", method="DELETE")
# Paperless DELETE returns empty 204 on success
print(f" [DEL] type {t['id']} {t['name']!r} resp={r[:80]!r}")
print("done.")
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,18 @@
FROM alpine:3.13
RUN apk add --no-cache \
samba \
samba-common-tools \
shadow \
&& rm -rf /var/cache/apk/*
RUN rm -rf /etc/samba/* /var/lib/samba/* /var/log/samba/* \
&& mkdir -p /etc/samba /var/lib/samba/private /var/log/samba /var/run/samba /inbox
COPY smb.conf /etc/samba/smb.conf
COPY entrypoint.sh /entrypoint.sh
RUN chmod 0755 /entrypoint.sh
EXPOSE 139 445
ENTRYPOINT ["/entrypoint.sh"]

120
infra/samba-canon/README.md Normal file
View File

@@ -0,0 +1,120 @@
# samba-canon — SMB bridge for the Canon MAXIFY MB5100
Old-Samba container on mDock that gives the Canon MB5100 (2014, SMB1 +
NTLMv1 only) a writable share. Scans land in `/mnt/mdms/inbox/` and are
picked up by Paperless within 60s via the existing consume-folder poll.
## Why this exists
The Canon MAXIFY MB5100 only supports SMB Shared Folder as a scan
destination (no FTP, no WebDAV — see the [official manual][canon-manual]).
It speaks SMB1 with NTLMv1 auth.
Direct scan-to-TrueNAS fails reproducibly even with `enable_smb1=true` +
`ntlmv1_auth=true` flipped on TrueNAS Core: the TrueNAS-Samba (4.19+) ships
extra SMB1 hardening that breaks the printer's handshake. `smb1_process.c:502`
logs `NT_STATUS_CONNECTION_RESET` — the printer closes the TCP socket before
the first SMB packet is processed.
Rather than fight TrueNAS hardening, this container runs a deliberately old
Samba (4.13.17 on Alpine 3.13) on mDock, bound to mDock's LAN interface
only, and writes received files straight to the NFS-mounted Paperless
inbox.
The TrueNAS SMB stack stays untouched — mBreeze and mPebble keep mounting
`mDMS` directly from TrueNAS as before.
[canon-manual]: https://ij.manual.canon/ij/webmanual/Manual/All/MB5100%20series/EN/UG/ug_scanning0700.html
## Layout
| File | Purpose |
| ----------------- | ---------------------------------------------------------- |
| `Dockerfile` | `alpine:3.13` + samba 4.13.17, ~46 MiB image |
| `smb.conf` | NT1 server, NTLMv1 + LANMAN enabled, single `[inbox]` share |
| `entrypoint.sh` | Creates `canon` user at UID 1000, sets pw from env, runs smbd |
| `docker-compose.yml` | Binds 445/139 on the LAN IP only, mounts `/mnt/mdms/inbox` |
These files are a **traceable copy** of what lives in `~/samba-canon/` on
mDock (same convention as `infra/paperless/`). If you change the live config
on mDock, sync the change here in the same commit.
## Deploy
```bash
scp infra/samba-canon/{Dockerfile,smb.conf,entrypoint.sh,docker-compose.yml} \
mdock:~/samba-canon/
ssh mdock 'cd ~/samba-canon && docker compose up -d --build'
```
The real `CANON_PASSWORD` lives in `~/samba-canon/.env` on mDock (chmod 600,
not committed). Rotate by editing `.env` and `docker compose restart`
`entrypoint.sh` re-applies the password to the Samba TDB on every boot.
## Canon Quick Utility Toolbox values
Use these exact values in the printer's "Destination Settings → Folder"
entry (Canon Drucker Quick Utility Toolbox → Destination Folder Settings):
| Field | Value |
| ---------------- | ---------------------------------------------- |
| Display name | `mDock Inbox` (any label) |
| SMB server name | `192.168.178.131` (mDock LAN IP — not `mdock`, the printer does no DNS) |
| Shared folder | `inbox` |
| Domain / Workgroup | leave blank, or `WORKGROUP` |
| User | `canon` |
| Password | (from `~/samba-canon/.env` on mDock — `CANON_PASSWORD`) |
| Port | leave default (445) — non-standard ports are not supported by the printer |
The printer's connection-test should report success.
## Verification (replayed during deploy)
1. **`smbclient` listing from a known-good client.** From mBreeze:
```bash
smbutil view -A "//canon:<pw>@192.168.178.131"
# → "Authenticate successfully with //canon:…@192.168.178.131"
```
2. **Mount + write from mBreeze.**
```bash
mkdir -p /tmp/canon-test
mount -t smbfs "//canon:<pw>@192.168.178.131/inbox" /tmp/canon-test
touch /tmp/canon-test/probe.txt
ls -la /mnt/mdms/inbox/probe.txt # on mDock — should show m:m, mode 0664
umount /tmp/canon-test
```
3. **Toolbox connection test** — green tick (m runs this once during setup).
4. **Real scan from the ADF** — PDF lands in `/mnt/mdms/inbox/`, Paperless
polls within 60 s, OCR + AI-typing run, file moves to
`<year>/<type>/...` (existing Paperless pipeline, see `infra/paperless/`).
5. **Survives mDock reboot.** `docker compose up -d` sets
`restart: unless-stopped`. Verified via `docker restart samba-canon` —
container comes back up and shares are reachable within ~5 s.
## Security notes
- LAN-only. The compose binds `192.168.178.131:445` and `192.168.178.131:139`,
not `0.0.0.0`. The container is not reachable from Tailscale or the
internet.
- SMB1 + NTLMv1 are insecure by design. Acceptable here because the threat
model is "untrusted devices on the home LAN", and the only client is the
printer. **Do not expose this share to anything except the Canon.**
- The `canon` user is a Samba-only account (`/sbin/nologin`, no system
password, no shell). It maps to UID 1000 inside the container so that
files written through SMB land as `m:m` on the host NFS mount.
- If `CANON_PASSWORD` leaks, rotate it: edit `~/samba-canon/.env` on mDock,
`docker compose restart samba-canon`, and re-enter the new password in
the Canon Toolbox.
## Out of scope
- TLS / encrypted SMB — incompatible with the printer; LAN-only mitigates.
- Multi-user — only the printer needs to write here.
- Replacing the TrueNAS SMB stack mBreeze/mPebble already use.
- Replacing the printer — m wants to keep the MB5100 working.

View File

@@ -0,0 +1,36 @@
services:
samba-canon:
build:
context: .
dockerfile: Dockerfile
image: samba-canon:alpine3.13
container_name: samba-canon
restart: unless-stopped
# The Canon MAXIFY MB5100 only speaks SMB on the standard ports — non-standard
# ports are not configurable in the printer. So we bind 445/139 on the LAN
# interface only (mDock's LAN IP), keeping Tailscale out of scope.
ports:
- "192.168.178.131:445:445/tcp"
- "192.168.178.131:139:139/tcp"
volumes:
# /mnt/mdms/inbox is NFS-mounted on mDock from TrueNAS (192.168.178.124).
# Paperless's consume folder polls /mnt/mdms/inbox every 60s, so scans
# land here and are picked up by Paperless without further wiring.
- /mnt/mdms/inbox:/inbox:rw
environment:
# canon user inside the container is created with this UID/GID at boot.
# 1000 = m on mDock, which also owns /mnt/mdms/inbox.
PUID: "1000"
PGID: "1000"
# Real password is in .env (gitignored); see README.md.
CANON_PASSWORD: "${CANON_PASSWORD:?CANON_PASSWORD must be set in .env}"
# smbd needs the full default cap set (SETUID/SETGID to honour `force user`,
# CHOWN/FOWNER/DAC_OVERRIDE for file creation, NET_BIND_SERVICE for <1024).
# We rely on Docker defaults rather than cap_drop ALL + a hand-picked list.
# Light healthcheck — smbd answers `smbclient -L` once it's up.
healthcheck:
test: ["CMD-SHELL", "smbclient -L //127.0.0.1 -U canon%${CANON_PASSWORD} -m SMB3 >/dev/null 2>&1 || smbclient -L //127.0.0.1 -U canon%${CANON_PASSWORD} -m NT1 >/dev/null 2>&1"]
interval: 60s
timeout: 10s
retries: 3
start_period: 15s

View File

@@ -0,0 +1,41 @@
#!/bin/sh
set -eu
# Map the in-container "canon" user to the same UID/GID as `m` on the host
# (UID 1000 / GID 1000). force user = canon in smb.conf then guarantees that
# every file written through SMB lands as m:m on the NFS-mounted /mnt/mdms/inbox.
TARGET_UID="${PUID:-1000}"
TARGET_GID="${PGID:-1000}"
if ! getent group canon >/dev/null 2>&1; then
addgroup -g "${TARGET_GID}" canon
fi
if ! getent passwd canon >/dev/null 2>&1; then
adduser -D -H -u "${TARGET_UID}" -G canon -s /sbin/nologin canon
fi
if [ -z "${CANON_PASSWORD:-}" ]; then
echo "FATAL: CANON_PASSWORD env var is required" >&2
exit 1
fi
# (Re)apply the Samba password every boot so rotating it = restart the container.
printf '%s\n%s\n' "${CANON_PASSWORD}" "${CANON_PASSWORD}" | smbpasswd -s -a canon >/dev/null
smbpasswd -e canon >/dev/null
# Verify the bind-mounted /inbox exists and is writable from the container.
# smbd will drop privilege per session to the canon user (uid 1000), which
# matches m on the host — files therefore land as m:m on the NFS mount.
if ! test -d /inbox; then
echo "FATAL: /inbox missing — bind mount /mnt/mdms/inbox not set." >&2
exit 1
fi
if ! test -w /inbox; then
echo "FATAL: /inbox not writable. Check NFS mount + permissions on /mnt/mdms/inbox (must be writable by uid ${TARGET_UID})." >&2
exit 1
fi
echo "samba-canon ready: smbd $(smbd --version | head -1), user=canon uid=${TARGET_UID} gid=${TARGET_GID}"
exec smbd --foreground --no-process-group --log-stdout

View File

@@ -0,0 +1,49 @@
[global]
workgroup = WORKGROUP
server string = Canon SMB bridge
netbios name = MDOCK-CANON
security = user
map to guest = Never
log file = /var/log/samba/log.%m
log level = 1
max log size = 1000
# Old-school SMB1 + NTLMv1 — required by Canon MAXIFY MB5100 (2014, SMB1 only).
# LAN-only, no encryption — see infra/samba-canon/README.md.
server min protocol = NT1
server max protocol = SMB3
client min protocol = NT1
client max protocol = SMB3
ntlm auth = ntlmv1-permitted
lanman auth = yes
client lanman auth = yes
client plaintext auth = no
server signing = disabled
smb encrypt = disabled
server multi channel support = no
# Performance / sanity for a single-share LAN bridge
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
dns proxy = no
usershare allow guests = no
panic action = /bin/sh -c 'echo "smbd panic at $(date)" >&2'
[inbox]
comment = Canon scan inbox (writes to /mnt/mdms/inbox on TrueNAS via NFS)
path = /inbox
browseable = yes
writable = yes
read only = no
guest ok = no
valid users = canon
force user = canon
force group = canon
create mask = 0664
directory mask = 0775
force create mode = 0664
force directory mode = 0775
# The Canon writes single PDFs; vfs full_audit is overkill.
vfs objects =