Compare commits
7 Commits
mai/hermes
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 237f7f89ba | |||
| a2fa76a41a | |||
| 70b94cc2e9 | |||
| 7ba5bb925c | |||
| 927a66bd66 | |||
| 90142396d8 | |||
| 862bc76a2b |
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
|||||||
|
.m/
|
||||||
180
infra/mdms-mover/README.md
Normal file
180
infra/mdms-mover/README.md
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
# mdms-mover — age-gated inbox → toprocess promoter + blank-page stripper
|
||||||
|
|
||||||
|
Two jobs in one user-systemd timer:
|
||||||
|
|
||||||
|
1. **Stability gate** (otto#438): solves the chunk-write race between the
|
||||||
|
Canon MB5100 (SMB scans land in `/mnt/mdms/inbox/` in pieces) and
|
||||||
|
Paperless (polls `/mnt/mdms/toprocess/` every 60s and consumes
|
||||||
|
anything it sees). A file is only promoted when **both**:
|
||||||
|
- `mtime > 3 minutes` ago, and
|
||||||
|
- file size is unchanged since the previous run.
|
||||||
|
2. **Blank-page strip** (mDMS#2): duplex scans through patch-T separators
|
||||||
|
leave a blank backside (the unprinted reverse of the separator sheet)
|
||||||
|
at the front of every subsequent document. PDF files are passed
|
||||||
|
through `strip_blank_pages.py` before promotion. Pages with no
|
||||||
|
embedded text AND >97% near-white pixels are dropped.
|
||||||
|
|
||||||
|
## Layout on mDock
|
||||||
|
|
||||||
|
```
|
||||||
|
/home/m/mdms-mover/mover.sh # script, deployed copy
|
||||||
|
/home/m/mdms-mover/strip_blank_pages.py # blank-page detector
|
||||||
|
/home/m/.config/systemd/user/mdms-mover.service # oneshot service
|
||||||
|
/home/m/.config/systemd/user/mdms-mover.timer # OnUnitActiveSec=1min
|
||||||
|
/home/m/.local/state/mdms-mover/state.tsv # last-seen size per file
|
||||||
|
/home/m/.local/bin/uv # uv runner for the strip script
|
||||||
|
```
|
||||||
|
|
||||||
|
Runs as user `m` under user-systemd. mDock has `Linger=yes` for user
|
||||||
|
`m`, so the timer keeps firing across reboots and logout sessions.
|
||||||
|
|
||||||
|
## Why systemd, not cron
|
||||||
|
|
||||||
|
The original spec (otto#438) called for `/etc/cron.d/mdms-mover`. mDock
|
||||||
|
runs Ubuntu 24.04 server which ships with systemd-timers and no `cron`
|
||||||
|
package. Installing cron only to honour the spec wording would add a
|
||||||
|
package we don't otherwise need; a user-systemd timer is the canonical
|
||||||
|
Ubuntu 24.04 approach and gives better observability
|
||||||
|
(`systemctl --user status mdms-mover.timer`, `journalctl --user -u mdms-mover`).
|
||||||
|
|
||||||
|
User-mode (not system-mode) keeps the entire install in `m`'s home — no
|
||||||
|
sudo at deploy or maintenance time, no `/var/lib/...` directories to
|
||||||
|
chown, the service can read/write the NFS mount because `m` owns it.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
```
|
||||||
|
| var | default | meaning |
|
||||||
|
|------------------------|-----------------------------------------------|----------------------------------------------------|
|
||||||
|
| MDMS_INBOX | /mnt/mdms/inbox | source — scanner SMB target |
|
||||||
|
| MDMS_TOPROCESS | /mnt/mdms/toprocess | destination — Paperless consume |
|
||||||
|
| MDMS_STATE | $HOME/.local/state/mdms-mover/state.tsv | per-file size memory |
|
||||||
|
| MDMS_MIN_AGE_MIN | 3 | minimum mtime age in minutes |
|
||||||
|
| MDMS_STRIP_BLANK | true | run blank-page strip on PDFs (set to "false" to disable) |
|
||||||
|
| MDMS_STRIP_SCRIPT | <mover dir>/strip_blank_pages.py | path override for the strip script |
|
||||||
|
| MDMS_BLANK_THRESHOLD | 0.97 | near-white pixel ratio to call a page blank (read by strip script) |
|
||||||
|
| MDMS_BLANK_NEAR_WHITE | 240 | grayscale cutoff (0-255) for "near white" pixels (read by strip script) |
|
||||||
|
| MDMS_BLANK_DPI | 50 | thumbnail render DPI (read by strip script) |
|
||||||
|
```
|
||||||
|
|
||||||
|
To override at runtime, drop into
|
||||||
|
`~/.config/systemd/user/mdms-mover.service.d/override.conf`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Service]
|
||||||
|
Environment=MDMS_MIN_AGE_MIN=5
|
||||||
|
Environment=MDMS_BLANK_THRESHOLD=0.99
|
||||||
|
```
|
||||||
|
|
||||||
|
then `systemctl --user daemon-reload && systemctl --user restart mdms-mover.timer`.
|
||||||
|
|
||||||
|
## Blank-page detection — what gets dropped
|
||||||
|
|
||||||
|
A page is dropped iff BOTH:
|
||||||
|
|
||||||
|
1. embedded text is empty / whitespace-only (image-only scans always
|
||||||
|
pass this — they have no embedded text), AND
|
||||||
|
2. the rendered thumbnail is ≥ `MDMS_BLANK_THRESHOLD` near-white pixels
|
||||||
|
(0.97 by default → >97% of pixels brighter than grayscale 240).
|
||||||
|
|
||||||
|
The threshold is conservative on purpose: a false-negative (keeping a
|
||||||
|
blank page we should have dropped) is recoverable via Paperless's UI; a
|
||||||
|
false-positive (dropping a real page) silently loses data. If real
|
||||||
|
pages get dropped in practice, **raise** `MDMS_BLANK_THRESHOLD` toward
|
||||||
|
0.99 — that makes the strip step pickier and keeps more pages.
|
||||||
|
|
||||||
|
Edge cases handled inside `strip_blank_pages.py`:
|
||||||
|
|
||||||
|
- **1-page input:** strip is skipped entirely (single-page docs never
|
||||||
|
have separator-backside artefacts).
|
||||||
|
- **All pages would drop:** the script exits with code `2` and writes no
|
||||||
|
output. The mover keeps the file in the inbox and logs
|
||||||
|
`WARNING: <name> appears all-blank, kept in inbox`. m can inspect via
|
||||||
|
`journalctl --user -u mdms-mover`.
|
||||||
|
- **strip_blank_pages.py errors out:** mover falls back to a plain `mv`
|
||||||
|
(unstripped) so a transient problem in the detector never blocks a
|
||||||
|
scan from reaching Paperless.
|
||||||
|
|
||||||
|
The script is a uv-inline-deps single file (PyMuPDF for both rendering
|
||||||
|
and text extraction — one wheel, no `poppler-utils` apt install on
|
||||||
|
mdock). Mirrors the pattern from `infra/paperless/generate_separator.py`.
|
||||||
|
|
||||||
|
## Deploy / sync
|
||||||
|
|
||||||
|
The live files on mDock must match this directory byte-for-byte (md5,
|
||||||
|
same convention as `infra/samba-canon/`).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh mdock 'mkdir -p ~/mdms-mover ~/.config/systemd/user ~/.local/state/mdms-mover ~/.local/bin'
|
||||||
|
|
||||||
|
# uv binary (single static binary, user-space — no apt, no sudo)
|
||||||
|
rsync -av ~/.local/bin/uv mdock:/home/m/.local/bin/uv
|
||||||
|
|
||||||
|
# mover + strip script
|
||||||
|
scp infra/mdms-mover/mover.sh mdock:/home/m/mdms-mover/mover.sh
|
||||||
|
scp infra/mdms-mover/strip_blank_pages.py mdock:/home/m/mdms-mover/strip_blank_pages.py
|
||||||
|
scp infra/mdms-mover/mdms-mover.service mdock:/home/m/.config/systemd/user/
|
||||||
|
scp infra/mdms-mover/mdms-mover.timer mdock:/home/m/.config/systemd/user/
|
||||||
|
|
||||||
|
ssh mdock 'chmod +x ~/mdms-mover/mover.sh ~/mdms-mover/strip_blank_pages.py && \
|
||||||
|
systemctl --user daemon-reload && \
|
||||||
|
systemctl --user enable --now mdms-mover.timer'
|
||||||
|
```
|
||||||
|
|
||||||
|
The first time the strip script runs, `uv` downloads python + PyMuPDF
|
||||||
|
into `~/.cache/uv/` (~30 MB). Subsequent runs reuse the cache.
|
||||||
|
|
||||||
|
## Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh mdock 'systemctl --user list-timers mdms-mover.timer'
|
||||||
|
ssh mdock 'journalctl --user -u mdms-mover -n 20 --no-pager'
|
||||||
|
ssh mdock 'cat ~/.local/state/mdms-mover/state.tsv'
|
||||||
|
ssh mdock 'journalctl -t mdms-mover -n 20 --no-pager'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Emergency disable
|
||||||
|
|
||||||
|
Stop the timer entirely:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh mdock 'systemctl --user stop mdms-mover.timer && \
|
||||||
|
systemctl --user disable mdms-mover.timer'
|
||||||
|
```
|
||||||
|
|
||||||
|
Or just disable the strip step while keeping the stability gate:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p ~/.config/systemd/user/mdms-mover.service.d
|
||||||
|
cat > ~/.config/systemd/user/mdms-mover.service.d/override.conf <<EOF
|
||||||
|
[Service]
|
||||||
|
Environment=MDMS_STRIP_BLANK=false
|
||||||
|
EOF
|
||||||
|
systemctl --user daemon-reload
|
||||||
|
```
|
||||||
|
|
||||||
|
Re-enable the timer with `systemctl --user enable --now mdms-mover.timer`.
|
||||||
|
|
||||||
|
If you need to drain the inbox manually while disabled, files older
|
||||||
|
than a few minutes are safe to `mv` into `toprocess/` by hand —
|
||||||
|
Paperless will pick them up on its next poll.
|
||||||
|
|
||||||
|
## Logs
|
||||||
|
|
||||||
|
Service logs land in the user journal under unit `mdms-mover`, and
|
||||||
|
moved-file events also go through `logger -t mdms-mover` so they appear
|
||||||
|
under that tag in the system journal too:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh mdock 'journalctl --user -u mdms-mover -f' # service execution
|
||||||
|
ssh mdock 'journalctl -t mdms-mover -f' # moved-file events
|
||||||
|
```
|
||||||
|
|
||||||
|
## Refs
|
||||||
|
|
||||||
|
- mDMS#2 — blank-page strip (this README)
|
||||||
|
- otto#438 — original scheduler / staging-folder design
|
||||||
|
- otto#429 — original Paperless pipeline setup
|
||||||
|
- otto#431 — samba-canon bridge container (upstream of this mover)
|
||||||
|
- `docs/strategy.md` — overall mDMS dataset layout
|
||||||
|
- `infra/paperless/generate_separator.py` — sibling uv-inline-deps script
|
||||||
11
infra/mdms-mover/mdms-mover.service
Normal file
11
infra/mdms-mover/mdms-mover.service
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=mDMS mover — promote stable scanner files inbox → toprocess
|
||||||
|
After=network-online.target remote-fs.target
|
||||||
|
Wants=network-online.target remote-fs.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=%h/mdms-mover/mover.sh
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=default.target
|
||||||
12
infra/mdms-mover/mdms-mover.timer
Normal file
12
infra/mdms-mover/mdms-mover.timer
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run mDMS mover every minute
|
||||||
|
Requires=mdms-mover.service
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnBootSec=2min
|
||||||
|
OnUnitActiveSec=1min
|
||||||
|
AccuracySec=10s
|
||||||
|
Unit=mdms-mover.service
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target default.target
|
||||||
93
infra/mdms-mover/mover.sh
Executable file
93
infra/mdms-mover/mover.sh
Executable file
@@ -0,0 +1,93 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# mdms-mover: move stable files from /mnt/mdms/inbox → /mnt/mdms/toprocess.
|
||||||
|
#
|
||||||
|
# A file is "stable" when it satisfies BOTH conditions:
|
||||||
|
# 1. mtime older than MIN_AGE seconds (default 180s).
|
||||||
|
# 2. size unchanged since the previous run (recorded in STATE).
|
||||||
|
#
|
||||||
|
# This protects Paperless from ingesting half-written scans dropped by the
|
||||||
|
# Canon MB5100 via SMB. See otto#438, mDMS#2.
|
||||||
|
#
|
||||||
|
# When MDMS_STRIP_BLANK=true (default) and the file is a PDF, blank pages
|
||||||
|
# are stripped before promotion (mDMS#2). Empty backsides of patch-T
|
||||||
|
# separators from duplex scans land here. See strip_blank_pages.py for the
|
||||||
|
# detection heuristic.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
INBOX="${MDMS_INBOX:-/mnt/mdms/inbox}"
|
||||||
|
TOPROCESS="${MDMS_TOPROCESS:-/mnt/mdms/toprocess}"
|
||||||
|
STATE="${MDMS_STATE:-$HOME/.local/state/mdms-mover/state.tsv}"
|
||||||
|
MIN_AGE_MIN="${MDMS_MIN_AGE_MIN:-3}"
|
||||||
|
STRIP_BLANK="${MDMS_STRIP_BLANK:-true}"
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
STRIP_SCRIPT="${MDMS_STRIP_SCRIPT:-$SCRIPT_DIR/strip_blank_pages.py}"
|
||||||
|
|
||||||
|
mkdir -p "$TOPROCESS" "$(dirname "$STATE")"
|
||||||
|
touch "$STATE"
|
||||||
|
|
||||||
|
NEW_STATE=$(mktemp)
|
||||||
|
trap 'rm -f "$NEW_STATE"' EXIT
|
||||||
|
|
||||||
|
# Promote a single stable file from inbox into toprocess, blank-stripping
|
||||||
|
# PDFs when enabled. Returns silently; logs go through logger(1).
|
||||||
|
promote() {
|
||||||
|
local src="$1" name="$2" size="$3"
|
||||||
|
local ext="${name##*.}"
|
||||||
|
local dest="$TOPROCESS/$name"
|
||||||
|
|
||||||
|
if [[ "$STRIP_BLANK" != "true" || "${ext,,}" != "pdf" || ! -x "$STRIP_SCRIPT" ]]; then
|
||||||
|
if mv -n "$src" "$dest" 2>/dev/null; then
|
||||||
|
logger -t mdms-mover "moved $name ($size bytes)"
|
||||||
|
fi
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Stage stripped output inside toprocess (same filesystem → atomic rename).
|
||||||
|
# Dotfile prefix so Paperless's consumer ignores the partial during write.
|
||||||
|
local tmpout="$TOPROCESS/.mdms-tmp.$$.$name"
|
||||||
|
local rc=0
|
||||||
|
"$STRIP_SCRIPT" "$src" "$tmpout" || rc=$?
|
||||||
|
|
||||||
|
case "$rc" in
|
||||||
|
0)
|
||||||
|
mv -f "$tmpout" "$dest" && rm -f "$src"
|
||||||
|
logger -t mdms-mover "moved $name ($size bytes, strip ok)"
|
||||||
|
;;
|
||||||
|
2)
|
||||||
|
rm -f "$tmpout"
|
||||||
|
logger -t mdms-mover "WARNING: $name appears all-blank, kept in inbox"
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
rm -f "$tmpout"
|
||||||
|
logger -t mdms-mover "strip failed for $name (rc=$rc), passing through unchanged"
|
||||||
|
if mv -n "$src" "$dest" 2>/dev/null; then
|
||||||
|
logger -t mdms-mover "moved $name ($size bytes, unstripped)"
|
||||||
|
fi
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# Iterate top-level regular files older than MIN_AGE_MIN.
|
||||||
|
# Skip dotfiles (probe files, scanner temp markers like ._foo, our .mdms-tmp.*).
|
||||||
|
while IFS= read -r f; do
|
||||||
|
name=$(basename "$f")
|
||||||
|
case "$name" in
|
||||||
|
.*) continue ;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
if ! size=$(stat -c %s "$f" 2>/dev/null); then
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
prev=$(awk -v n="$name" '$1==n {print $2; exit}' "$STATE")
|
||||||
|
printf '%s\t%s\n' "$name" "$size" >> "$NEW_STATE"
|
||||||
|
|
||||||
|
if [[ -n "$prev" && "$size" == "$prev" ]]; then
|
||||||
|
promote "$f" "$name" "$size"
|
||||||
|
fi
|
||||||
|
done < <(find "$INBOX" -maxdepth 1 -type f -mmin "+$MIN_AGE_MIN")
|
||||||
|
|
||||||
|
mv "$NEW_STATE" "$STATE"
|
||||||
|
trap - EXIT
|
||||||
122
infra/mdms-mover/strip_blank_pages.py
Executable file
122
infra/mdms-mover/strip_blank_pages.py
Executable file
@@ -0,0 +1,122 @@
|
|||||||
|
#!/usr/bin/env -S uv run --script
|
||||||
|
# /// script
|
||||||
|
# requires-python = ">=3.11"
|
||||||
|
# dependencies = [
|
||||||
|
# "pymupdf>=1.24",
|
||||||
|
# "Pillow>=10.0",
|
||||||
|
# ]
|
||||||
|
# ///
|
||||||
|
"""Strip blank pages from a PDF — used by mdms-mover before promoting to toprocess.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
strip_blank_pages.py <input.pdf> <output.pdf>
|
||||||
|
|
||||||
|
Exit codes:
|
||||||
|
0 output.pdf written (either stripped or copied unchanged)
|
||||||
|
2 all pages would be dropped — output NOT written, caller should keep
|
||||||
|
the original file in the inbox and log a warning
|
||||||
|
1 error (input unreadable, write failed, etc.)
|
||||||
|
|
||||||
|
A page counts as "blank" iff BOTH of:
|
||||||
|
* embedded text is empty / whitespace-only, AND
|
||||||
|
* rendered thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels.
|
||||||
|
|
||||||
|
False-negatives are preferred over false-positives — borderline pages stay.
|
||||||
|
|
||||||
|
Env:
|
||||||
|
MDMS_BLANK_THRESHOLD near-white pixel ratio (0.0-1.0, default 0.97)
|
||||||
|
MDMS_BLANK_NEAR_WHITE near-white cutoff in 0-255 grayscale (default 240)
|
||||||
|
MDMS_BLANK_DPI thumbnail render DPI (default 50)
|
||||||
|
|
||||||
|
PyMuPDF is used instead of pdf2image+pikepdf+pypdf so the whole pipeline is
|
||||||
|
one self-contained wheel — no poppler-utils apt-install on mdock, no
|
||||||
|
multiple text-extraction libraries to keep in sync.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import io
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import fitz # PyMuPDF
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
|
||||||
|
def near_white_ratio(image: Image.Image, near_white: int) -> float:
|
||||||
|
gray = image.convert("L") if image.mode != "L" else image
|
||||||
|
hist = gray.histogram()
|
||||||
|
total = sum(hist)
|
||||||
|
if total == 0:
|
||||||
|
return 1.0
|
||||||
|
return sum(hist[near_white:]) / total
|
||||||
|
|
||||||
|
|
||||||
|
def page_is_blank(page: "fitz.Page", threshold: float, near_white: int, dpi: int) -> bool:
|
||||||
|
text = (page.get_text("text") or "").strip()
|
||||||
|
if text:
|
||||||
|
return False
|
||||||
|
pix = page.get_pixmap(dpi=dpi, colorspace=fitz.csGRAY)
|
||||||
|
image = Image.frombytes("L", (pix.width, pix.height), pix.samples)
|
||||||
|
return near_white_ratio(image, near_white) >= threshold
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
if len(sys.argv) != 3:
|
||||||
|
print(f"usage: {sys.argv[0]} <input.pdf> <output.pdf>", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
src = Path(sys.argv[1])
|
||||||
|
dst = Path(sys.argv[2])
|
||||||
|
|
||||||
|
threshold = float(os.environ.get("MDMS_BLANK_THRESHOLD", "0.97"))
|
||||||
|
near_white = int(os.environ.get("MDMS_BLANK_NEAR_WHITE", "240"))
|
||||||
|
dpi = int(os.environ.get("MDMS_BLANK_DPI", "50"))
|
||||||
|
|
||||||
|
try:
|
||||||
|
doc = fitz.open(src)
|
||||||
|
except Exception as exc:
|
||||||
|
print(f"failed to open {src}: {exc}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
try:
|
||||||
|
page_count = doc.page_count
|
||||||
|
|
||||||
|
if page_count <= 1:
|
||||||
|
shutil.copyfile(src, dst)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
keep: list[int] = []
|
||||||
|
for i in range(page_count):
|
||||||
|
if not page_is_blank(doc[i], threshold, near_white, dpi):
|
||||||
|
keep.append(i)
|
||||||
|
|
||||||
|
if not keep:
|
||||||
|
print(f"all pages blank in {src.name}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
if len(keep) == page_count:
|
||||||
|
shutil.copyfile(src, dst)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
out = fitz.open()
|
||||||
|
try:
|
||||||
|
for i in keep:
|
||||||
|
out.insert_pdf(doc, from_page=i, to_page=i)
|
||||||
|
out.save(dst)
|
||||||
|
finally:
|
||||||
|
out.close()
|
||||||
|
|
||||||
|
dropped = page_count - len(keep)
|
||||||
|
print(
|
||||||
|
f"{src.name}: dropped {dropped}/{page_count} blank page(s)",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
return 0
|
||||||
|
finally:
|
||||||
|
doc.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -21,4 +21,28 @@ in the repo. Hashes:
|
|||||||
| setup.js.patched | ~/paperless/build/setup.js.patched | `04cb5fbfaed13a5f25612af0b79dd90c` |
|
| setup.js.patched | ~/paperless/build/setup.js.patched | `04cb5fbfaed13a5f25612af0b79dd90c` |
|
||||||
| server.js.patched | ~/paperless/build/server.js.patched | `eadcbb86048127f2c80632ae77bbc2a0` |
|
| server.js.patched | ~/paperless/build/server.js.patched | `eadcbb86048127f2c80632ae77bbc2a0` |
|
||||||
|
|
||||||
See `docs/research/issue-429-paperless-pipeline.md` for the why.
|
See `docs/research/issue-429-paperless-pipeline.md` in `m/otto` for the
|
||||||
|
original pipeline rebuild (issue otto#429).
|
||||||
|
|
||||||
|
## SYSTEM_PROMPT deploy mechanism
|
||||||
|
|
||||||
|
`SYSTEM_PROMPT.txt` is the source of truth. It is a template — the
|
||||||
|
`{{CORRESPONDENTS_LIST}}` placeholder is rendered at deploy time by
|
||||||
|
fetching the live correspondents from Paperless. The live prompt is
|
||||||
|
inside `paperless-ai`'s `/app/data/.env` (volume `paperless_aidata`) as
|
||||||
|
the backtick-delimited `SYSTEM_PROMPT=\`…\`` block.
|
||||||
|
|
||||||
|
Deploy with `push_system_prompt.py`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 push_system_prompt.py # dry run — diff only
|
||||||
|
python3 push_system_prompt.py --apply # write + restart paperless-ai
|
||||||
|
```
|
||||||
|
|
||||||
|
The script filters recipient-only names (Matthias / Mathias Siebels)
|
||||||
|
out of the rendered list — see `RECIPIENT_EXCLUDE` in the script and
|
||||||
|
the matching rule at the top of the Correspondents section in
|
||||||
|
`SYSTEM_PROMPT.txt`. If you edit either, edit both.
|
||||||
|
|
||||||
|
The previous live `.env` is kept on mDock as `.env.bak.<ts>` next to the
|
||||||
|
new one for rollback.
|
||||||
|
|||||||
@@ -21,4 +21,52 @@ Bei medizinischen Dokumenten Tag Gesundheit setzen.
|
|||||||
Bei steuerrelevanten Dokumenten Tag Steuer setzen.
|
Bei steuerrelevanten Dokumenten Tag Steuer setzen.
|
||||||
Bei Dokumenten mit Frist Tag Frist setzen.
|
Bei Dokumenten mit Frist Tag Frist setzen.
|
||||||
|
|
||||||
Correspondents: Verwende den vollen offiziellen Namen der Organisation oder Person (z.B. "DAK-Gesundheit" nicht "DAK-Gesundheit Postzentrum, 22778 Hamburg"). Keine Adressen im Namen. Pruefe ob der Correspondent schon existiert bevor du einen neuen anlegst.
|
Erfinde NIEMALS neue Tags. Erfinde NIEMALS neue Document Types. Bei Unsicherheit: Document Type = Information, keine zusätzlichen Tags.
|
||||||
|
|
||||||
|
Correspondents — WICHTIG, in dieser Reihenfolge:
|
||||||
|
|
||||||
|
1. EMPFAENGER NIEMALS als Correspondent: Matthias Siebels (alle Schreibweisen — Mathias, Mathhias, Siebels, MS, "Herr Siebels", "Herrn Matthias Siebels", "Empfaengeradresse Windscheidstr. 33") ist der EMPFAENGER nahezu jedes Dokuments in diesem DMS. NIEMALS als Correspondent setzen, auch wenn der Name in der Absenderzeile zu lesen ist (z.B. wenn der OCR die Empfaengeradresse als Absender mis-interpretiert). Gleiches gilt sinngemaess fuer Paul Siebels — Paul ist meistens Empfaenger (Bescheide, Rechnungen, Steuerbescheide an Paul). Verwende Paul Siebels nur dann als Correspondent, wenn Paul nachweislich Autor des Dokuments ist (z.B. eigener Brief, Schadensmeldung von Paul).
|
||||||
|
|
||||||
|
2. Der Correspondent ist die Organisation oder Person, die das Dokument GESENDET / GESCHRIEBEN hat. In den seltenen Faellen, in denen m (Matthias) selbst Autor ist (z.B. eigene Briefe an Behoerden, eigene Umsatzsteuer-Voranmeldung), waehle Document Type = Personal Correspondence und Correspondent = die EMPFAENGENDE Organisation (z.B. "Finanzamt Düsseldorf-Mitte").
|
||||||
|
|
||||||
|
3. Bevorzuge existierende Correspondents bei klarer semantischer Aehnlichkeit (Fuzzy-Regel unten). Wenn der OCR-Absender genuinely neu ist (z.B. ein neuer Versorger, Vermieter, Arzt, Dienstleister, Anwalt, Mandant, Versicherer), lege einen neuen Correspondent an, statt zwanghaft auf den naechsten existierenden Namen zu mappen.
|
||||||
|
|
||||||
|
4. INTRA-SCAN DEDUP: Bevor du einen neuen Correspondent anlegst, pruefe ob du in dieser Sitzung (gleicher Scan-Batch, gleicher Verarbeitungslauf) bereits einen Correspondent mit aehnlichem Namen angelegt hast — verwende dann den existierenden (denselben Namen unveraendert), statt eine weitere Variante anzulegen. Konkret: kommen in einem Scan mehrere Dokumente vom gleichen Sender vor (z.B. zwei Rechnungen derselben Arztpraxis, mehrere Schreiben desselben Versorgers), MUSS der Correspondent-Name bei jedem dieser Dokumente identisch sein. Im Zweifel waehle die laengste / vollstaendigste Form, die du in diesem Scan gesehen hast.
|
||||||
|
|
||||||
|
Fuzzy-Regel: Wenn der OCR-Absendername bis auf Kleinschreibung, Akzente, Tippfehler, Anrede ("Herr"/"Frau"/"Herrn"), Adresszusatz, Personenname als Ansprechpartner oder Rechtsform-Suffix (GmbH/AG/eG/e.V./LLP/KG/mbH/AG/VVaG) einem existierenden Correspondent entspricht, verwende den existierenden Namen UNVERAENDERT. Bei substantiell anderen Namen (anderer Stamm, andere Branche, andere Firmierung) lege einen neuen an.
|
||||||
|
|
||||||
|
Beim Vergleich gilt: Ist der OCR-Name ein striktes Praefix eines existierenden Correspondents (oder umgekehrt), und stimmen die ersten 2 Brand-Tokens ueberein (Token = Wort, das nicht Rechtsform-Suffix, Adresse oder Anrede ist), verwende den existierenden Correspondent. Das gilt sowohl fuer Kurzformen ohne Rechtsform-Suffix ("Hogan Lovells" -> "Hogan Lovells International LLP") als auch fuer den umgekehrten Fall, wenn die existierende Form kuerzer ist als die OCR-Form.
|
||||||
|
|
||||||
|
Beispiele:
|
||||||
|
- "Hogan Lovells lnternational LLP" (OCR-Variante) -> "Hogan Lovells International LLP" (existiert)
|
||||||
|
- "Hogan Lovells" (Kurzform ohne Rechtsform) -> "Hogan Lovells International LLP" (existiert; OCR-Name ist Praefix, erste 2 Brand-Tokens stimmen)
|
||||||
|
- "eprimo CmbH" -> "eprimo" (existiert)
|
||||||
|
- "Helios Klinikum Duisburg GmbH" -> "Helios Klinikum Duisburg" (existiert)
|
||||||
|
- "Kundenservice von eprimo" -> "eprimo" (existiert)
|
||||||
|
- "Ammerländer Versicherung VVaG" -> "Ammerländer Versicherung" (existiert; Rechtsform weglassen)
|
||||||
|
- "ING-DiBa AG, Theodor-Heuss-Allee 2, 60486 Frankfurt am Main" -> "ING-DiBa AG" (existiert; Adresse weglassen)
|
||||||
|
- "Vattenfall Europe Sales GmbH" -> "Vattenfall" (existiert; konsolidiere Konzernvarianten)
|
||||||
|
- Brief von einem NEUEN Versorger "Stadtwerke XYZ" -> neu anlegen als "Stadtwerke XYZ" (NICHT auf "eprimo" oder "Vodafone" mappen, nur weil das der naechste existierende Versorger ist)
|
||||||
|
- Drei Dokumente einer neuen Praxis im selben Scan: erstes Dokument legt Correspondent "Praxis Dr. Mustermann" an, zweites und drittes Dokument verwenden GENAU diesen Namen, auch wenn der OCR "Dr. Mustermann" oder "Praxis fuer XYZ" liest (siehe Regel 4).
|
||||||
|
|
||||||
|
Beim Anlegen neuer Correspondents: voller offizieller Name der Organisation/Person, KEINE Adresse, KEINE Anrede, KEINE Rechtsform-Suffixe in Reinform (GmbH/AG/etc. nur dann mit aufnehmen, wenn sie Teil der Markenidentitaet sind, z.B. "DKB Grund GmbH").
|
||||||
|
|
||||||
|
Aktuelle Correspondents-Liste (aus dieser pruefe als ERSTES, ob einer passt — Eintraege mit Matthias/Mathias Siebels sind absichtlich nicht enthalten, siehe Regel 1):
|
||||||
|
{{CORRESPONDENTS_LIST}}
|
||||||
|
|
||||||
|
Titel-Generierung (PFLICHT, deutsch, 5-80 Zeichen):
|
||||||
|
- Format: "{Absender-Kurzform} - {Worum es geht}"
|
||||||
|
- "{Absender-Kurzform}" = Correspondent in kurzer Form (z.B. "DAK", "Finanzamt", "Hogan Lovells", "Vodafone")
|
||||||
|
- "{Worum es geht}" = 2-6 Woerter, die den Inhalt konkret beschreiben (z.B. "Beitragsrechnung Q1", "Grundsteuerbescheid 2024", "Gehaltsabrechnung Januar 2025", "GigaTV Vertragsverlaengerung")
|
||||||
|
- Bei Rechnungen / Bescheiden: Vorgangs- bzw. Rechnungsnummer in den Titel aufnehmen, wenn vorhanden (z.B. "DAK - Beitragsrechnung 2025-Q1 (Nr. 4711)")
|
||||||
|
- Keine generischen Woerter wie "Dokument", "Datei", "Scan", "PDF", "Schreiben" als alleinige Beschreibung
|
||||||
|
- Keine Datums-Strings im Titel (das Datum erscheint schon im Storage Path)
|
||||||
|
- Keine Anrede ("Sehr geehrter Herr") und keine Floskeln
|
||||||
|
- Beispiele guter Titel:
|
||||||
|
- "DAK - Beitragsrechnung Q1"
|
||||||
|
- "Finanzamt - Grundsteuerbescheid 2024"
|
||||||
|
- "Hogan Lovells - Gehaltsabrechnung Januar"
|
||||||
|
- "Vodafone - GigaTV Vertragsverlaengerung"
|
||||||
|
- "AOK - Mitgliedsbescheinigung"
|
||||||
|
- Bei unklarem Inhalt Fallback: "Information - {Sender-Name}" (z.B. "Information - Stadtwerke Muenchen")
|
||||||
|
- Der Titel wird im JSON-Feld "title" zurueckgegeben.
|
||||||
|
|||||||
187
infra/paperless/push_system_prompt.py
Normal file
187
infra/paperless/push_system_prompt.py
Normal file
@@ -0,0 +1,187 @@
|
|||||||
|
"""
|
||||||
|
Render SYSTEM_PROMPT.txt with the live correspondent list and push it to
|
||||||
|
the paperless-ai container's /app/data/.env on mDock.
|
||||||
|
|
||||||
|
The repo SYSTEM_PROMPT.txt is the template (with the placeholder
|
||||||
|
{{CORRESPONDENTS_LIST}}). This script:
|
||||||
|
|
||||||
|
1. Reads the current correspondents from the Paperless API.
|
||||||
|
2. Filters out names that must never appear as correspondent
|
||||||
|
(recipients of m's mail — see RECIPIENT_EXCLUDE).
|
||||||
|
3. Renders the prompt by substituting the placeholder.
|
||||||
|
4. Reads the live /app/data/.env from the paperless-ai container.
|
||||||
|
5. Replaces the SYSTEM_PROMPT=`…` block.
|
||||||
|
6. Backs up the old .env (.bak.<ts>) and writes the new one.
|
||||||
|
7. Restarts the paperless-ai container.
|
||||||
|
|
||||||
|
Dry-run is the default: prints the would-be rendered prompt without
|
||||||
|
writing.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 push_system_prompt.py # dry run
|
||||||
|
python3 push_system_prompt.py --apply # write + restart
|
||||||
|
|
||||||
|
Migrated into m/mDMS from m/otto on 2026-05-16 (mDMS#3).
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import datetime
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
PAPERLESS_HOST = "mdock"
|
||||||
|
PAPERLESS_AI_CONTAINER = "paperless-ai"
|
||||||
|
PAPERLESS_WEB_CONTAINER = "paperless-webserver-1"
|
||||||
|
ENV_PATH = "/app/data/.env"
|
||||||
|
HERE = os.path.dirname(os.path.abspath(__file__))
|
||||||
|
TEMPLATE_PATH = os.path.join(HERE, "SYSTEM_PROMPT.txt")
|
||||||
|
PLACEHOLDER = "{{CORRESPONDENTS_LIST}}"
|
||||||
|
|
||||||
|
# Names that are m or his household — recipients, never correspondents.
|
||||||
|
# Substring match, case-insensitive. Keep the actual correspondent records
|
||||||
|
# in Paperless (data integrity for historical doc assignments), but never
|
||||||
|
# show them to the LLM as candidate senders.
|
||||||
|
RECIPIENT_EXCLUDE = ("matthias siebels", "mathias siebels")
|
||||||
|
|
||||||
|
|
||||||
|
def get_token() -> str:
|
||||||
|
out = subprocess.run(
|
||||||
|
["ssh", PAPERLESS_HOST,
|
||||||
|
f"docker exec {PAPERLESS_AI_CONTAINER} sh -c "
|
||||||
|
f"'grep ^PAPERLESS_API_TOKEN {ENV_PATH} | cut -d= -f2'"],
|
||||||
|
capture_output=True, text=True, timeout=15,
|
||||||
|
)
|
||||||
|
return out.stdout.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_correspondents(token: str) -> list[str]:
|
||||||
|
cmd = (
|
||||||
|
f"docker exec {PAPERLESS_WEB_CONTAINER} "
|
||||||
|
f"curl -s -H 'Authorization: Token {token}' "
|
||||||
|
f"'http://localhost:8000/api/correspondents/?page_size=500'"
|
||||||
|
)
|
||||||
|
out = subprocess.run(
|
||||||
|
["ssh", PAPERLESS_HOST, cmd],
|
||||||
|
capture_output=True, text=True, timeout=30,
|
||||||
|
)
|
||||||
|
if out.returncode != 0:
|
||||||
|
raise RuntimeError(f"fetch failed: {out.stderr}")
|
||||||
|
data = json.loads(out.stdout)
|
||||||
|
names = [c["name"] for c in data["results"]]
|
||||||
|
filtered = [n for n in names
|
||||||
|
if not any(x in n.lower() for x in RECIPIENT_EXCLUDE)]
|
||||||
|
dropped = sorted(set(names) - set(filtered))
|
||||||
|
if dropped:
|
||||||
|
print(f"filtered out recipient-names: {dropped}")
|
||||||
|
return sorted(filtered, key=lambda s: s.lower())
|
||||||
|
|
||||||
|
|
||||||
|
def render_prompt(template: str, names: list[str]) -> str:
|
||||||
|
listing = "\n".join(f"- {n}" for n in names)
|
||||||
|
return template.replace(PLACEHOLDER, listing)
|
||||||
|
|
||||||
|
|
||||||
|
def read_remote_env() -> str:
|
||||||
|
out = subprocess.run(
|
||||||
|
["ssh", PAPERLESS_HOST,
|
||||||
|
f"docker exec {PAPERLESS_AI_CONTAINER} cat {ENV_PATH}"],
|
||||||
|
capture_output=True, text=True, timeout=15,
|
||||||
|
)
|
||||||
|
if out.returncode != 0:
|
||||||
|
raise RuntimeError(f"cat failed: {out.stderr}")
|
||||||
|
return out.stdout
|
||||||
|
|
||||||
|
|
||||||
|
def replace_system_prompt(env: str, new_prompt: str) -> str:
|
||||||
|
"""Replace the SYSTEM_PROMPT=`…` block with the new one.
|
||||||
|
|
||||||
|
Paperless-AI's .env uses backtick-delimited values for multi-line
|
||||||
|
settings (JS .env loader convention; bash would not accept this).
|
||||||
|
"""
|
||||||
|
lines = env.splitlines(keepends=True)
|
||||||
|
out = []
|
||||||
|
inside = False
|
||||||
|
replaced = False
|
||||||
|
for line in lines:
|
||||||
|
if not inside and line.startswith("SYSTEM_PROMPT="):
|
||||||
|
out.append(f"SYSTEM_PROMPT=`{new_prompt.rstrip()}`\n")
|
||||||
|
replaced = True
|
||||||
|
stripped_value = line[len("SYSTEM_PROMPT="):].rstrip("\n")
|
||||||
|
if stripped_value.startswith("`") and stripped_value.count("`") >= 2:
|
||||||
|
continue
|
||||||
|
inside = True
|
||||||
|
continue
|
||||||
|
if inside:
|
||||||
|
if "`" in line:
|
||||||
|
inside = False
|
||||||
|
continue
|
||||||
|
out.append(line)
|
||||||
|
if not replaced:
|
||||||
|
raise SystemExit("SYSTEM_PROMPT= line not found in .env")
|
||||||
|
return "".join(out)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--apply", action="store_true",
|
||||||
|
help="Write new .env and restart paperless-ai")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
with open(TEMPLATE_PATH) as f:
|
||||||
|
template = f.read()
|
||||||
|
if PLACEHOLDER not in template:
|
||||||
|
sys.exit(f"template missing placeholder {PLACEHOLDER}")
|
||||||
|
|
||||||
|
token = get_token()
|
||||||
|
names = fetch_correspondents(token)
|
||||||
|
print(f"fetched {len(names)} live correspondents (after recipient filter)")
|
||||||
|
rendered = render_prompt(template, names)
|
||||||
|
print(f"rendered prompt: {len(rendered)} chars, {len(rendered.splitlines())} lines")
|
||||||
|
|
||||||
|
env_before = read_remote_env()
|
||||||
|
env_after = replace_system_prompt(env_before, rendered)
|
||||||
|
if env_before == env_after:
|
||||||
|
print("no change — live prompt already matches rendered template")
|
||||||
|
return
|
||||||
|
|
||||||
|
if not args.apply:
|
||||||
|
print("--- new SYSTEM_PROMPT block ---")
|
||||||
|
for line in env_after.splitlines():
|
||||||
|
if line.startswith("SYSTEM_PROMPT="):
|
||||||
|
print(line[:200] + ("…" if len(line) > 200 else ""))
|
||||||
|
print()
|
||||||
|
print("DRY RUN — re-run with --apply to write + restart paperless-ai")
|
||||||
|
return
|
||||||
|
|
||||||
|
ts = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%S")
|
||||||
|
backup = f"{ENV_PATH}.bak.{ts}"
|
||||||
|
subprocess.run(
|
||||||
|
["ssh", PAPERLESS_HOST,
|
||||||
|
f"docker exec {PAPERLESS_AI_CONTAINER} cp {ENV_PATH} {backup}"],
|
||||||
|
check=True, timeout=15,
|
||||||
|
)
|
||||||
|
print(f"backup: {backup}")
|
||||||
|
|
||||||
|
write_cmd = (
|
||||||
|
f"docker exec -i {PAPERLESS_AI_CONTAINER} "
|
||||||
|
f"sh -c 'cat > {ENV_PATH}'"
|
||||||
|
)
|
||||||
|
proc = subprocess.run(
|
||||||
|
["ssh", PAPERLESS_HOST, write_cmd],
|
||||||
|
input=env_after, capture_output=True, text=True, timeout=30,
|
||||||
|
)
|
||||||
|
if proc.returncode != 0:
|
||||||
|
sys.exit(f"write failed: {proc.stderr}")
|
||||||
|
print(f"wrote {len(env_after)} bytes to {ENV_PATH}")
|
||||||
|
|
||||||
|
subprocess.run(
|
||||||
|
["ssh", PAPERLESS_HOST, f"docker restart {PAPERLESS_AI_CONTAINER}"],
|
||||||
|
check=True, timeout=60,
|
||||||
|
)
|
||||||
|
print(f"restarted {PAPERLESS_AI_CONTAINER}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user