Merge mai/hermes/issue-3-paperless-ai: paperless-AI prompt fix + drift reconciliation (#3)
This commit is contained in:
@@ -21,4 +21,28 @@ in the repo. Hashes:
|
||||
| setup.js.patched | ~/paperless/build/setup.js.patched | `04cb5fbfaed13a5f25612af0b79dd90c` |
|
||||
| server.js.patched | ~/paperless/build/server.js.patched | `eadcbb86048127f2c80632ae77bbc2a0` |
|
||||
|
||||
See `docs/research/issue-429-paperless-pipeline.md` for the why.
|
||||
See `docs/research/issue-429-paperless-pipeline.md` in `m/otto` for the
|
||||
original pipeline rebuild (issue otto#429).
|
||||
|
||||
## SYSTEM_PROMPT deploy mechanism
|
||||
|
||||
`SYSTEM_PROMPT.txt` is the source of truth. It is a template — the
|
||||
`{{CORRESPONDENTS_LIST}}` placeholder is rendered at deploy time by
|
||||
fetching the live correspondents from Paperless. The live prompt is
|
||||
inside `paperless-ai`'s `/app/data/.env` (volume `paperless_aidata`) as
|
||||
the backtick-delimited `SYSTEM_PROMPT=\`…\`` block.
|
||||
|
||||
Deploy with `push_system_prompt.py`:
|
||||
|
||||
```bash
|
||||
python3 push_system_prompt.py # dry run — diff only
|
||||
python3 push_system_prompt.py --apply # write + restart paperless-ai
|
||||
```
|
||||
|
||||
The script filters recipient-only names (Matthias / Mathias Siebels)
|
||||
out of the rendered list — see `RECIPIENT_EXCLUDE` in the script and
|
||||
the matching rule at the top of the Correspondents section in
|
||||
`SYSTEM_PROMPT.txt`. If you edit either, edit both.
|
||||
|
||||
The previous live `.env` is kept on mDock as `.env.bak.<ts>` next to the
|
||||
new one for rollback.
|
||||
|
||||
@@ -21,4 +21,46 @@ Bei medizinischen Dokumenten Tag Gesundheit setzen.
|
||||
Bei steuerrelevanten Dokumenten Tag Steuer setzen.
|
||||
Bei Dokumenten mit Frist Tag Frist setzen.
|
||||
|
||||
Correspondents: Verwende den vollen offiziellen Namen der Organisation oder Person (z.B. "DAK-Gesundheit" nicht "DAK-Gesundheit Postzentrum, 22778 Hamburg"). Keine Adressen im Namen. Pruefe ob der Correspondent schon existiert bevor du einen neuen anlegst.
|
||||
Erfinde NIEMALS neue Tags. Erfinde NIEMALS neue Document Types. Bei Unsicherheit: Document Type = Information, keine zusätzlichen Tags.
|
||||
|
||||
Correspondents — WICHTIG, in dieser Reihenfolge:
|
||||
|
||||
1. EMPFAENGER NIEMALS als Correspondent: Matthias Siebels (alle Schreibweisen — Mathias, Mathhias, Siebels, MS, "Herr Siebels", "Herrn Matthias Siebels", "Empfaengeradresse Windscheidstr. 33") ist der EMPFAENGER nahezu jedes Dokuments in diesem DMS. NIEMALS als Correspondent setzen, auch wenn der Name in der Absenderzeile zu lesen ist (z.B. wenn der OCR die Empfaengeradresse als Absender mis-interpretiert). Gleiches gilt sinngemaess fuer Paul Siebels — Paul ist meistens Empfaenger (Bescheide, Rechnungen, Steuerbescheide an Paul). Verwende Paul Siebels nur dann als Correspondent, wenn Paul nachweislich Autor des Dokuments ist (z.B. eigener Brief, Schadensmeldung von Paul).
|
||||
|
||||
2. Der Correspondent ist die Organisation oder Person, die das Dokument GESENDET / GESCHRIEBEN hat. In den seltenen Faellen, in denen m (Matthias) selbst Autor ist (z.B. eigene Briefe an Behoerden, eigene Umsatzsteuer-Voranmeldung), waehle Document Type = Personal Correspondence und Correspondent = die EMPFAENGENDE Organisation (z.B. "Finanzamt Düsseldorf-Mitte").
|
||||
|
||||
3. Bevorzuge existierende Correspondents bei klarer semantischer Aehnlichkeit (Fuzzy-Regel unten). Wenn der OCR-Absender genuinely neu ist (z.B. ein neuer Versorger, Vermieter, Arzt, Dienstleister, Anwalt, Mandant, Versicherer), lege einen neuen Correspondent an, statt zwanghaft auf den naechsten existierenden Namen zu mappen.
|
||||
|
||||
Fuzzy-Regel: Wenn der OCR-Absendername bis auf Kleinschreibung, Akzente, Tippfehler, Anrede ("Herr"/"Frau"/"Herrn"), Adresszusatz, Personenname als Ansprechpartner oder Rechtsform-Suffix (GmbH/AG/eG/e.V./LLP/KG/mbH/AG/VVaG) einem existierenden Correspondent entspricht, verwende den existierenden Namen UNVERAENDERT. Bei substantiell anderen Namen (anderer Stamm, andere Branche, andere Firmierung) lege einen neuen an.
|
||||
|
||||
Beispiele:
|
||||
- "Hogan Lovells lnternational LLP" (OCR-Variante) -> "Hogan Lovells International LLP" (existiert)
|
||||
- "eprimo CmbH" -> "eprimo" (existiert)
|
||||
- "Helios Klinikum Duisburg GmbH" -> "Helios Klinikum Duisburg" (existiert)
|
||||
- "Kundenservice von eprimo" -> "eprimo" (existiert)
|
||||
- "Ammerländer Versicherung VVaG" -> "Ammerländer Versicherung" (existiert; Rechtsform weglassen)
|
||||
- "ING-DiBa AG, Theodor-Heuss-Allee 2, 60486 Frankfurt am Main" -> "ING-DiBa AG" (existiert; Adresse weglassen)
|
||||
- "Vattenfall Europe Sales GmbH" -> "Vattenfall" (existiert; konsolidiere Konzernvarianten)
|
||||
- Brief von einem NEUEN Versorger "Stadtwerke XYZ" -> neu anlegen als "Stadtwerke XYZ" (NICHT auf "eprimo" oder "Vodafone" mappen, nur weil das der naechste existierende Versorger ist)
|
||||
|
||||
Beim Anlegen neuer Correspondents: voller offizieller Name der Organisation/Person, KEINE Adresse, KEINE Anrede, KEINE Rechtsform-Suffixe in Reinform (GmbH/AG/etc. nur dann mit aufnehmen, wenn sie Teil der Markenidentitaet sind, z.B. "DKB Grund GmbH").
|
||||
|
||||
Aktuelle Correspondents-Liste (aus dieser pruefe als ERSTES, ob einer passt — Eintraege mit Matthias/Mathias Siebels sind absichtlich nicht enthalten, siehe Regel 1):
|
||||
{{CORRESPONDENTS_LIST}}
|
||||
|
||||
Titel-Generierung (PFLICHT, deutsch, 5-80 Zeichen):
|
||||
- Format: "{Absender-Kurzform} - {Worum es geht}"
|
||||
- "{Absender-Kurzform}" = Correspondent in kurzer Form (z.B. "DAK", "Finanzamt", "Hogan Lovells", "Vodafone")
|
||||
- "{Worum es geht}" = 2-6 Woerter, die den Inhalt konkret beschreiben (z.B. "Beitragsrechnung Q1", "Grundsteuerbescheid 2024", "Gehaltsabrechnung Januar 2025", "GigaTV Vertragsverlaengerung")
|
||||
- Bei Rechnungen / Bescheiden: Vorgangs- bzw. Rechnungsnummer in den Titel aufnehmen, wenn vorhanden (z.B. "DAK - Beitragsrechnung 2025-Q1 (Nr. 4711)")
|
||||
- Keine generischen Woerter wie "Dokument", "Datei", "Scan", "PDF", "Schreiben" als alleinige Beschreibung
|
||||
- Keine Datums-Strings im Titel (das Datum erscheint schon im Storage Path)
|
||||
- Keine Anrede ("Sehr geehrter Herr") und keine Floskeln
|
||||
- Beispiele guter Titel:
|
||||
- "DAK - Beitragsrechnung Q1"
|
||||
- "Finanzamt - Grundsteuerbescheid 2024"
|
||||
- "Hogan Lovells - Gehaltsabrechnung Januar"
|
||||
- "Vodafone - GigaTV Vertragsverlaengerung"
|
||||
- "AOK - Mitgliedsbescheinigung"
|
||||
- Bei unklarem Inhalt Fallback: "Information - {Sender-Name}" (z.B. "Information - Stadtwerke Muenchen")
|
||||
- Der Titel wird im JSON-Feld "title" zurueckgegeben.
|
||||
|
||||
187
infra/paperless/push_system_prompt.py
Normal file
187
infra/paperless/push_system_prompt.py
Normal file
@@ -0,0 +1,187 @@
|
||||
"""
|
||||
Render SYSTEM_PROMPT.txt with the live correspondent list and push it to
|
||||
the paperless-ai container's /app/data/.env on mDock.
|
||||
|
||||
The repo SYSTEM_PROMPT.txt is the template (with the placeholder
|
||||
{{CORRESPONDENTS_LIST}}). This script:
|
||||
|
||||
1. Reads the current correspondents from the Paperless API.
|
||||
2. Filters out names that must never appear as correspondent
|
||||
(recipients of m's mail — see RECIPIENT_EXCLUDE).
|
||||
3. Renders the prompt by substituting the placeholder.
|
||||
4. Reads the live /app/data/.env from the paperless-ai container.
|
||||
5. Replaces the SYSTEM_PROMPT=`…` block.
|
||||
6. Backs up the old .env (.bak.<ts>) and writes the new one.
|
||||
7. Restarts the paperless-ai container.
|
||||
|
||||
Dry-run is the default: prints the would-be rendered prompt without
|
||||
writing.
|
||||
|
||||
Usage:
|
||||
python3 push_system_prompt.py # dry run
|
||||
python3 push_system_prompt.py --apply # write + restart
|
||||
|
||||
Migrated into m/mDMS from m/otto on 2026-05-16 (mDMS#3).
|
||||
"""
|
||||
import argparse
|
||||
import datetime
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
|
||||
PAPERLESS_HOST = "mdock"
|
||||
PAPERLESS_AI_CONTAINER = "paperless-ai"
|
||||
PAPERLESS_WEB_CONTAINER = "paperless-webserver-1"
|
||||
ENV_PATH = "/app/data/.env"
|
||||
HERE = os.path.dirname(os.path.abspath(__file__))
|
||||
TEMPLATE_PATH = os.path.join(HERE, "SYSTEM_PROMPT.txt")
|
||||
PLACEHOLDER = "{{CORRESPONDENTS_LIST}}"
|
||||
|
||||
# Names that are m or his household — recipients, never correspondents.
|
||||
# Substring match, case-insensitive. Keep the actual correspondent records
|
||||
# in Paperless (data integrity for historical doc assignments), but never
|
||||
# show them to the LLM as candidate senders.
|
||||
RECIPIENT_EXCLUDE = ("matthias siebels", "mathias siebels")
|
||||
|
||||
|
||||
def get_token() -> str:
|
||||
out = subprocess.run(
|
||||
["ssh", PAPERLESS_HOST,
|
||||
f"docker exec {PAPERLESS_AI_CONTAINER} sh -c "
|
||||
f"'grep ^PAPERLESS_API_TOKEN {ENV_PATH} | cut -d= -f2'"],
|
||||
capture_output=True, text=True, timeout=15,
|
||||
)
|
||||
return out.stdout.strip()
|
||||
|
||||
|
||||
def fetch_correspondents(token: str) -> list[str]:
|
||||
cmd = (
|
||||
f"docker exec {PAPERLESS_WEB_CONTAINER} "
|
||||
f"curl -s -H 'Authorization: Token {token}' "
|
||||
f"'http://localhost:8000/api/correspondents/?page_size=500'"
|
||||
)
|
||||
out = subprocess.run(
|
||||
["ssh", PAPERLESS_HOST, cmd],
|
||||
capture_output=True, text=True, timeout=30,
|
||||
)
|
||||
if out.returncode != 0:
|
||||
raise RuntimeError(f"fetch failed: {out.stderr}")
|
||||
data = json.loads(out.stdout)
|
||||
names = [c["name"] for c in data["results"]]
|
||||
filtered = [n for n in names
|
||||
if not any(x in n.lower() for x in RECIPIENT_EXCLUDE)]
|
||||
dropped = sorted(set(names) - set(filtered))
|
||||
if dropped:
|
||||
print(f"filtered out recipient-names: {dropped}")
|
||||
return sorted(filtered, key=lambda s: s.lower())
|
||||
|
||||
|
||||
def render_prompt(template: str, names: list[str]) -> str:
|
||||
listing = "\n".join(f"- {n}" for n in names)
|
||||
return template.replace(PLACEHOLDER, listing)
|
||||
|
||||
|
||||
def read_remote_env() -> str:
|
||||
out = subprocess.run(
|
||||
["ssh", PAPERLESS_HOST,
|
||||
f"docker exec {PAPERLESS_AI_CONTAINER} cat {ENV_PATH}"],
|
||||
capture_output=True, text=True, timeout=15,
|
||||
)
|
||||
if out.returncode != 0:
|
||||
raise RuntimeError(f"cat failed: {out.stderr}")
|
||||
return out.stdout
|
||||
|
||||
|
||||
def replace_system_prompt(env: str, new_prompt: str) -> str:
|
||||
"""Replace the SYSTEM_PROMPT=`…` block with the new one.
|
||||
|
||||
Paperless-AI's .env uses backtick-delimited values for multi-line
|
||||
settings (JS .env loader convention; bash would not accept this).
|
||||
"""
|
||||
lines = env.splitlines(keepends=True)
|
||||
out = []
|
||||
inside = False
|
||||
replaced = False
|
||||
for line in lines:
|
||||
if not inside and line.startswith("SYSTEM_PROMPT="):
|
||||
out.append(f"SYSTEM_PROMPT=`{new_prompt.rstrip()}`\n")
|
||||
replaced = True
|
||||
stripped_value = line[len("SYSTEM_PROMPT="):].rstrip("\n")
|
||||
if stripped_value.startswith("`") and stripped_value.count("`") >= 2:
|
||||
continue
|
||||
inside = True
|
||||
continue
|
||||
if inside:
|
||||
if "`" in line:
|
||||
inside = False
|
||||
continue
|
||||
out.append(line)
|
||||
if not replaced:
|
||||
raise SystemExit("SYSTEM_PROMPT= line not found in .env")
|
||||
return "".join(out)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--apply", action="store_true",
|
||||
help="Write new .env and restart paperless-ai")
|
||||
args = ap.parse_args()
|
||||
|
||||
with open(TEMPLATE_PATH) as f:
|
||||
template = f.read()
|
||||
if PLACEHOLDER not in template:
|
||||
sys.exit(f"template missing placeholder {PLACEHOLDER}")
|
||||
|
||||
token = get_token()
|
||||
names = fetch_correspondents(token)
|
||||
print(f"fetched {len(names)} live correspondents (after recipient filter)")
|
||||
rendered = render_prompt(template, names)
|
||||
print(f"rendered prompt: {len(rendered)} chars, {len(rendered.splitlines())} lines")
|
||||
|
||||
env_before = read_remote_env()
|
||||
env_after = replace_system_prompt(env_before, rendered)
|
||||
if env_before == env_after:
|
||||
print("no change — live prompt already matches rendered template")
|
||||
return
|
||||
|
||||
if not args.apply:
|
||||
print("--- new SYSTEM_PROMPT block ---")
|
||||
for line in env_after.splitlines():
|
||||
if line.startswith("SYSTEM_PROMPT="):
|
||||
print(line[:200] + ("…" if len(line) > 200 else ""))
|
||||
print()
|
||||
print("DRY RUN — re-run with --apply to write + restart paperless-ai")
|
||||
return
|
||||
|
||||
ts = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%S")
|
||||
backup = f"{ENV_PATH}.bak.{ts}"
|
||||
subprocess.run(
|
||||
["ssh", PAPERLESS_HOST,
|
||||
f"docker exec {PAPERLESS_AI_CONTAINER} cp {ENV_PATH} {backup}"],
|
||||
check=True, timeout=15,
|
||||
)
|
||||
print(f"backup: {backup}")
|
||||
|
||||
write_cmd = (
|
||||
f"docker exec -i {PAPERLESS_AI_CONTAINER} "
|
||||
f"sh -c 'cat > {ENV_PATH}'"
|
||||
)
|
||||
proc = subprocess.run(
|
||||
["ssh", PAPERLESS_HOST, write_cmd],
|
||||
input=env_after, capture_output=True, text=True, timeout=30,
|
||||
)
|
||||
if proc.returncode != 0:
|
||||
sys.exit(f"write failed: {proc.stderr}")
|
||||
print(f"wrote {len(env_after)} bytes to {ENV_PATH}")
|
||||
|
||||
subprocess.run(
|
||||
["ssh", PAPERLESS_HOST, f"docker restart {PAPERLESS_AI_CONTAINER}"],
|
||||
check=True, timeout=60,
|
||||
)
|
||||
print(f"restarted {PAPERLESS_AI_CONTAINER}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user