mAi 90142396d8 mAi: #2 - mdms-mover: strip blank pages from duplex scans
Two changes:

1. Migrate mover from m/otto (commit 9974937, otto#438) into this repo
   at infra/mdms-mover/. mover.sh, mdms-mover.service, mdms-mover.timer,
   README.md. Matches the live deployment on mDock byte-for-byte (modulo
   the strip step below).

2. Add blank-page stripping before the inbox → toprocess promotion. A
   page is dropped iff its embedded text is empty AND its rendered
   thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels (default 0.97
   per issue #2). Detects the empty backside of patch-T separator
   sheets in duplex scans (mDMS#2).

strip_blank_pages.py uses PyMuPDF as the only Python dep — single
self-contained wheel, no `poppler-utils` apt-install on mdock. Mirrors
the uv-inline-deps single-file pattern of infra/paperless/generate_separator.py.

Edge cases:
- 1-page input: strip skipped entirely.
- All pages would drop: script exits 2, mover keeps file in inbox and
  logs WARNING (no empty doc reaches Paperless).
- Strip script errors: mover falls back to plain mv, no scan blocked.
- MDMS_STRIP_BLANK=false: bypass strip entirely (emergency disable).

Deploy: rsync uv binary to mdock ~/.local/bin/uv (single static binary,
user-space, no apt), scp script + units, systemctl --user daemon-reload.
Verified live with synthetic 4-page (2 real + 1 blank + 1 real → 3
pages), 1-page (unchanged), all-blank (kept in inbox + warning) test
PDFs. Timer fires every ~70s as before.
2026-05-16 17:57:26 +02:00

mDMS

m's document management — Paperless-ngx + AI-classification pipeline, Canon scanner SMB bridge, strategy + tooling.

Spun out from m/otto on 2026-05-15 — issues #429#438 in m/otto are the provenance trail. Going forward, all mDMS work lives here.

Layout

mDMS/
├── docs/
│   └── strategy.md          # Taxonomy, layout, conventions (the bible)
├── infra/
│   ├── paperless/           # Paperless-AI config: SYSTEM_PROMPT, audit scripts,
│   │                        # migrate_types.py, deploy docker-compose
│   └── samba-canon/         # SMB1 bridge container for Canon MB5100 scanner
│                            # (host-network + nmbd, SMB1+NTLMv1 for old printer)
└── README.md

Components

Paperless-ngx (deployment)

Compose lives in m/paperless (separate repo). That repo is the deployment artifact — ~/paperless/ on mDock is its checkout. This repo (m/mDMS) tracks the AI classification layer that sits on top of Paperless-ngx (infra/paperless/SYSTEM_PROMPT.txt, the type/tag/ correspondent migration scripts, the audit pipeline).

Paperless-AI

Runs on mdock:3077 in front of Paperless-ngx (mdock:8777). Classifies each ingested document into one of the 10 canonical types and ≤2 of the 13 canonical tags. The system prompt + the migration scripts in infra/paperless/ are the source of truth — keep this repo and the live Paperless-AI aidata/.env in sync.

Canon SMB bridge

infra/samba-canon/ is the host-network Samba 4.10 container on mDock that the Canon MB5100 scans to. Files land in /mnt/mdms/inbox/ (NFS from mTrueNAS) and Paperless polls every 60s. The two-stage inbox (staging dir + age-gated mover) lives separately under ~/mdms-mover/ on mDock — see m/otto issue #438.

Data

NFS-mounted from mTrueNAS: /mnt/mPool/mdms//mnt/mdms/ on all consumers. Layout:

/mnt/mPool/mdms/
├── inbox/         # SMB scanner target (Canon writes here)
├── toprocess/     # Age-gated staging → Paperless consumes here
├── paperless/     # Paperless storage (post-ingest)
├── archive/       # Long-term archive
├── templates/     # Document templates
└── export/        # Manual exports

Reference

  • docs/strategy.md — full strategy, taxonomy decisions, type/tag rationale
  • m/otto issues #429#438 — original implementation history
  • m/paperless — the bare Paperless-ngx Docker Compose setup
Description
m document management — Paperless-ngx pipeline, samba scanner bridge, strategy + tooling
Readme 82 KiB
Languages
Python 80.8%
Shell 15.3%
Dockerfile 3.9%