refactor(docforge): slice 1 — extract .docx engine to pkg/docforge/docx (t-paliad-349)

Relocate the in-house OOXML machinery out of internal/services into the
first docforge adapter, with zero behaviour change:

  submission_merge.go  -> pkg/docforge/docx/merge.go     (placeholder
                          substitution renderer + preview-HTML emitter)
  submission_md.go     -> pkg/docforge/docx/markdown.go  (Markdown->OOXML
                          walker incl. the b78a984 underscore-fix)
  submission_render.go -> pkg/docforge/docx/dotm.go      (.dotm->.docx)
  + their _test.go files (git-tracked renames, 84-99% identical)

internal/services keeps thin type-alias + forwarder shims
(docforge_shims.go) so every caller in services/handlers/main compiles
and behaves identically: PlaceholderMap, MissingPlaceholderFn,
SubmissionRenderer, HyperlinkAllocator (aliases); NewSubmissionRenderer,
DefaultMissingMarker, RenderMarkdownToOOXML[WithStyles], ConvertDotmToDocx,
SanitiseSubmissionFileName (forwarders). docx.XMLAttrEscape is exported so
submission_compose.go's hyperlink-rels inserts reuse the walker's escaping.

Three mis-filed pretty-printer tests (legalSourcePretty, ourSideDE/EN,
patentNumberUPC) that exercise the vars layer move back to
internal/services/submission_vars_pretty_test.go.

Placeholder grammar + PlaceholderMap stay co-located with the renderer in
docx for now; slice 3 hoists the format-neutral grammar to the docforge
root with the VariableResolver interface.

Verification: go build ./... clean, go vet clean, full module test green
(the byte-exact OOXML golden tests in merge/compose/render pass unchanged
= behaviour preserved). gofmt drift on the moved files is pre-existing
(72/169 services files already drift; no gofmt gate).

m/paliad#157
This commit is contained in:
mAi
2026-05-29 14:51:59 +02:00
parent 091804923a
commit 78a30a7ee0
10 changed files with 212 additions and 79 deletions

24
pkg/docforge/doc.go Normal file
View File

@@ -0,0 +1,24 @@
// Package docforge is paliad's modular document-generator engine — the
// format-neutral core that turns templates + variables into rendered
// documents, with format-specific adapters living in sub-packages.
//
// The package is being extracted from the in-tree submission generator
// (internal/services/submission_*.go) per the PRD in
// docs/plans/prd-docforge-2026-05-29.md (t-paliad-349 / m/paliad#157).
// The extraction follows the same packaging discipline as
// pkg/litigationplanner: docforge owns its types and exposes interfaces
// for the stateful inputs (variable resolution, template storage); the
// consuming application (paliad) implements those interfaces against its
// own database, and a future second consumer reaches the engine over an
// HTTP veneer rather than importing it.
//
// Slice 1 (this commit) relocates the .docx adapter — the Markdown→OOXML
// walker, the placeholder substitution engine, and the .dotm→.docx
// converter — into pkg/docforge/docx with no behaviour change. paliad's
// internal/services package keeps thin type-alias + forwarder shims so
// the submission generator and its HTTP surface compile and behave
// identically. Later slices introduce the neutral document model,
// hoist the format-neutral placeholder grammar up to this root package,
// and add the VariableResolver interface, the TemplateStore, the
// authoring surface, and the pluggable Exporter.
package docforge

28
pkg/docforge/docx/doc.go Normal file
View File

@@ -0,0 +1,28 @@
// Package docx is docforge's .docx (OOXML) adapter — the first
// format adapter in the docforge engine (t-paliad-349 / m/paliad#157).
//
// It owns the in-house OOXML machinery extracted from paliad's submission
// generator in slice 1, with no behaviour change:
//
// - merge.go — the placeholder substitution renderer
// (SubmissionRenderer.Render / RenderHTML). Two-pass {{placeholder}}
// substitution (single-run, then cross-run merge for fragmented
// placeholders), plus the preview-HTML emitter that wraps substituted
// values in clickable <span class="draft-var" data-var="…"> markup.
// - markdown.go — the Markdown→OOXML walker (RenderMarkdownToOOXML*),
// including the b78a984 fix that preserves {{…}} placeholders verbatim
// through inline-span parsing (underscores in keys survive).
// - dotm.go — ConvertDotmToDocx: strips macros from a .dotm/.docm/
// .dotx and rewrites the content-types + rels to a clean .docx,
// passing every other part through bit-for-bit.
//
// Why no third-party docx library: lukasjarosch/go-docx treats sibling
// placeholders in one run ("{{a}} ./. {{b}}") as nested and refuses to
// replace either; patent submissions routinely have several placeholders
// per paragraph, so this in-house renderer is required. See merge.go.
//
// The placeholder grammar — \{\{\s*([A-Za-z][A-Za-z0-9_.]*)\s*\}\} — and
// the PlaceholderMap type currently live here with the renderer; a later
// slice hoists the format-neutral grammar up to the docforge root once
// the neutral document model and the VariableResolver interface land.
package docx

204
pkg/docforge/docx/dotm.go Normal file
View File

@@ -0,0 +1,204 @@
package docx
// Submission .dotm → .docx converter (t-paliad-230, "format-only" scope
// reduction of the original t-paliad-215 submission generator).
//
// Word .dotm (macro-enabled template), .docm (macro-enabled document),
// .dotx (template, no macros), and .docx (document, no macros) are all
// OOXML zip containers. The macro-bearing variants carry an extra set
// of parts:
//
// word/vbaProject.bin — the VBA project binary
// word/_rels/vbaProject.bin.rels — auxiliary relationships
// word/vbaData.xml — VBA support data
// word/customizations.xml — keyMapCustomizations
//
// plus a Content-Types override for each of those, a Default extension
// declaring all .bin files as vbaProject, and a different "main" content
// type for word/document.xml itself.
//
// ConvertDotmToDocx walks the zip, drops the macro parts, rewrites
// [Content_Types].xml and word/_rels/document.xml.rels to remove every
// reference to them, and switches the main document content type to
// the plain .docx form. Every other part — styles, fonts, theme,
// settings, document body, header/footer/numbering, glossary, custom
// XML — passes through bit-for-bit at the original compression method
// and modification time.
//
// No variable substitution. Today's slice hands the lawyer the firm
// style template as a clean .docx so they can edit and save under
// their own filename. The merge-engine slice is deferred.
import (
"archive/zip"
"bytes"
"fmt"
"io"
"regexp"
"strings"
)
// The four OOXML "main" content types we may see on word/document.xml.
// Anything other than docxMainContentType gets rewritten so the output
// reads as a plain document.
const (
dotmMainContentType = "application/vnd.ms-word.template.macroEnabledTemplate.main+xml"
docmMainContentType = "application/vnd.ms-word.document.macroEnabled.main+xml"
dotxMainContentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml"
docxMainContentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
)
// Macro-related parts dropped wholesale from the output zip.
var macroParts = map[string]bool{
"word/vbaProject.bin": true,
"word/_rels/vbaProject.bin.rels": true,
"word/vbaData.xml": true,
"word/customizations.xml": true,
}
const (
contentTypesPath = "[Content_Types].xml"
documentRelsPath = "word/_rels/document.xml.rels"
)
// vbaDefaultExtensionRegex matches the `<Default Extension="bin"
// ContentType=".../vbaProject"/>` row in [Content_Types].xml. After
// vbaProject.bin is dropped, the Default is dead weight (and Word will
// flag the file as macro-bearing if it survives).
var vbaDefaultExtensionRegex = regexp.MustCompile(
`\s*<Default\b[^>]*\bExtension\s*=\s*"bin"[^>]*\bContentType\s*=\s*"application/vnd\.ms-office\.vbaProject"[^>]*/>`,
)
// macroOverridePartRegex matches any <Override PartName="…"/> element
// whose PartName is one of the dropped macro parts. The /word/
// prefix is the OOXML convention for the absolute part path in
// [Content_Types].xml — file paths in the zip itself omit the leading
// slash.
var macroOverridePartRegex = regexp.MustCompile(
`\s*<Override\b[^>]*\bPartName\s*=\s*"/word/(?:vbaProject\.bin|vbaData\.xml|customizations\.xml)"[^>]*/>`,
)
// macroRelTypeRegex matches the two macro-related relationship Types
// in word/_rels/document.xml.rels: vbaProject (binds to vbaProject.bin)
// and keyMapCustomizations (binds to customizations.xml). After both
// targets are dropped, leaving the relationships in would make Word
// flag the file as corrupt.
var macroRelTypeRegex = regexp.MustCompile(
`\s*<Relationship\b[^>]*\bType\s*=\s*"http://schemas\.microsoft\.com/office/2006/relationships/(?:vbaProject|keyMapCustomizations)"[^>]*/>`,
)
// ConvertDotmToDocx rewrites a .dotm (or .docm, or .dotx) zip into a
// clean .docx zip. Idempotent on a zip that is already a plain .docx.
// Returns an error if the input is not a valid zip.
func ConvertDotmToDocx(dotmBytes []byte) ([]byte, error) {
zr, err := zip.NewReader(bytes.NewReader(dotmBytes), int64(len(dotmBytes)))
if err != nil {
return nil, fmt.Errorf("dotm→docx: open zip: %w", err)
}
var out bytes.Buffer
zw := zip.NewWriter(&out)
for _, entry := range zr.File {
if macroParts[entry.Name] {
continue
}
body, err := readZipFile(entry)
if err != nil {
return nil, fmt.Errorf("dotm→docx: read %s: %w", entry.Name, err)
}
switch entry.Name {
case contentTypesPath:
body = rewriteContentTypes(body)
case documentRelsPath:
body = rewriteDocumentRels(body)
}
w, err := zw.CreateHeader(&zip.FileHeader{
Name: entry.Name,
Method: entry.Method,
Modified: entry.Modified,
})
if err != nil {
return nil, fmt.Errorf("dotm→docx: write header %s: %w", entry.Name, err)
}
if _, err := w.Write(body); err != nil {
return nil, fmt.Errorf("dotm→docx: write body %s: %w", entry.Name, err)
}
}
if err := zw.Close(); err != nil {
return nil, fmt.Errorf("dotm→docx: finalise zip: %w", err)
}
return out.Bytes(), nil
}
// rewriteContentTypes demotes any of the three non-docx "main" content
// types to plain docx, drops the bin Default-Extension entry, and
// drops every Override that targeted a dropped macro part.
//
// String-level substitution rather than encoding/xml: round-tripping
// through Go's XML marshaller would re-emit the document with
// canonical namespace declarations on every child, which Word reads
// but which makes the binary diff unnecessarily large. Direct
// substitution preserves the file's original shape.
func rewriteContentTypes(body []byte) []byte {
body = bytes.ReplaceAll(body, []byte(dotmMainContentType), []byte(docxMainContentType))
body = bytes.ReplaceAll(body, []byte(docmMainContentType), []byte(docxMainContentType))
body = bytes.ReplaceAll(body, []byte(dotxMainContentType), []byte(docxMainContentType))
body = vbaDefaultExtensionRegex.ReplaceAll(body, nil)
body = macroOverridePartRegex.ReplaceAll(body, nil)
return body
}
// rewriteDocumentRels drops the two macro-related relationships from
// word/_rels/document.xml.rels (vbaProject + keyMapCustomizations) so
// the manifest no longer points at parts the zip no longer carries.
// Every other relationship — styles, settings, numbering, theme,
// headers/footers, customXml — passes through untouched.
func rewriteDocumentRels(body []byte) []byte {
return macroRelTypeRegex.ReplaceAll(body, nil)
}
// readZipFile slurps a zip entry's bytes.
func readZipFile(f *zip.File) ([]byte, error) {
rc, err := f.Open()
if err != nil {
return nil, err
}
defer rc.Close()
return io.ReadAll(rc)
}
// SanitiseSubmissionFileName cleans a string for use inside a download
// filename — strips path separators and quote characters that would
// break Content-Disposition or confuse browsers across OSes. ASCII-folds
// the small set of German umlaut letters that show up in submission
// names today (Klageerwiderung, Berufungsbegründung, …) so the file
// lands cleanly on legacy SMB shares whose layer is still cp1252.
// Other Unicode is preserved so non-DE/EN names still produce a
// recognisable file.
func SanitiseSubmissionFileName(s string) string {
s = strings.TrimSpace(s)
s = umlautFolder.Replace(s)
s = strings.Map(func(r rune) rune {
switch r {
case '/', '\\':
return '_'
case '"', '\'':
return -1
}
return r
}, s)
return s
}
// umlautFolder turns the four DE umlaut letters (both cases) into ASCII
// digraphs; ß → ss.
var umlautFolder = strings.NewReplacer(
"ä", "ae", "ö", "oe", "ü", "ue",
"Ä", "Ae", "Ö", "Oe", "Ü", "Ue",
"ß", "ss",
)

View File

@@ -0,0 +1,255 @@
package docx
import (
"archive/zip"
"bytes"
"io"
"strings"
"testing"
"time"
)
// minimalDOTM builds a small .dotm zip whose shape mirrors the real
// HL Patents Style template: macro-enabled main content type, Default
// extension declaring .bin as vbaProject, Overrides for vbaData.xml +
// customizations.xml, document.xml.rels with vbaProject +
// keyMapCustomizations relationships, and the four macro parts on
// disk (vbaProject.bin + auxiliary rels + vbaData.xml +
// customizations.xml).
//
// In-memory so the test is self-contained (no checked-in binary).
// Word and LibreOffice would reject this minimal file as incomplete
// (no _rels/.rels root manifest); the tests work at the byte level
// and assert structural properties of the converted output.
func minimalDOTM(t *testing.T) []byte {
t.Helper()
var buf bytes.Buffer
zw := zip.NewWriter(&buf)
add := func(name, body string) {
t.Helper()
w, err := zw.CreateHeader(&zip.FileHeader{
Name: name,
Method: zip.Deflate,
Modified: time.Date(2026, 5, 21, 12, 0, 0, 0, time.UTC),
})
if err != nil {
t.Fatalf("zip header %s: %v", name, err)
}
if _, err := io.WriteString(w, body); err != nil {
t.Fatalf("write %s: %v", name, err)
}
}
add(contentTypesPath, `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>`+
`<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">`+
`<Default Extension="bin" ContentType="application/vnd.ms-office.vbaProject"/>`+
`<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>`+
`<Default Extension="xml" ContentType="application/xml"/>`+
`<Override PartName="/word/document.xml" ContentType="`+dotmMainContentType+`"/>`+
`<Override PartName="/word/customizations.xml" ContentType="application/vnd.ms-word.keyMapCustomizations+xml"/>`+
`<Override PartName="/word/vbaData.xml" ContentType="application/vnd.ms-word.vbaData+xml"/>`+
`<Override PartName="/word/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>`+
`</Types>`)
add("word/document.xml",
`<?xml version="1.0" encoding="UTF-8" standalone="yes"?>`+
`<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">`+
`<w:body><w:p><w:r><w:t>Hello Paliad</w:t></w:r></w:p></w:body></w:document>`)
add(documentRelsPath,
`<?xml version="1.0" encoding="UTF-8" standalone="yes"?>`+
`<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">`+
`<Relationship Id="rId1" Type="http://schemas.microsoft.com/office/2006/relationships/vbaProject" Target="vbaProject.bin"/>`+
`<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>`+
`<Relationship Id="rId3" Type="http://schemas.microsoft.com/office/2006/relationships/keyMapCustomizations" Target="customizations.xml"/>`+
`</Relationships>`)
add("word/styles.xml", `<w:styles xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"/>`)
add("word/vbaProject.bin", "PRETEND-VBA-BINARY-PAYLOAD")
add("word/_rels/vbaProject.bin.rels", `<?xml version="1.0"?><Relationships/>`)
add("word/vbaData.xml", `<?xml version="1.0"?><wne:vbaSuppData xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"/>`)
add("word/customizations.xml", `<?xml version="1.0"?><wne:tcg xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"/>`)
if err := zw.Close(); err != nil {
t.Fatalf("close zip: %v", err)
}
return buf.Bytes()
}
func unzipEntries(t *testing.T, data []byte) map[string]string {
t.Helper()
zr, err := zip.NewReader(bytes.NewReader(data), int64(len(data)))
if err != nil {
t.Fatalf("open output zip: %v", err)
}
out := make(map[string]string, len(zr.File))
for _, f := range zr.File {
rc, err := f.Open()
if err != nil {
t.Fatalf("open %s: %v", f.Name, err)
}
body, err := io.ReadAll(rc)
rc.Close()
if err != nil {
t.Fatalf("read %s: %v", f.Name, err)
}
out[f.Name] = string(body)
}
return out
}
func TestConvertDotmToDocx_StripsMacroParts(t *testing.T) {
dotm := minimalDOTM(t)
out, err := ConvertDotmToDocx(dotm)
if err != nil {
t.Fatalf("ConvertDotmToDocx: %v", err)
}
entries := unzipEntries(t, out)
for _, name := range []string{
"word/vbaProject.bin",
"word/_rels/vbaProject.bin.rels",
"word/vbaData.xml",
"word/customizations.xml",
} {
if _, ok := entries[name]; ok {
t.Errorf("output still contains %s", name)
}
}
if doc, ok := entries["word/document.xml"]; !ok {
t.Error("output is missing word/document.xml")
} else if !strings.Contains(doc, "Hello Paliad") {
t.Errorf("document body lost during conversion: %q", doc)
}
if _, ok := entries["word/styles.xml"]; !ok {
t.Error("output lost unrelated word/styles.xml")
}
ctypes, ok := entries[contentTypesPath]
if !ok {
t.Fatal("output is missing [Content_Types].xml")
}
if strings.Contains(ctypes, "macroEnabled") {
t.Errorf("output [Content_Types].xml still references a macro-enabled type: %q", ctypes)
}
if !strings.Contains(ctypes, docxMainContentType) {
t.Errorf("output is missing plain docx main content type: %q", ctypes)
}
if strings.Contains(ctypes, "vbaProject") {
t.Errorf("output [Content_Types].xml still references vbaProject: %q", ctypes)
}
if strings.Contains(ctypes, "vbaData") {
t.Errorf("output [Content_Types].xml still overrides vbaData: %q", ctypes)
}
if strings.Contains(ctypes, "keyMapCustomizations") {
t.Errorf("output [Content_Types].xml still overrides customizations: %q", ctypes)
}
if !strings.Contains(ctypes, "wordprocessingml.styles") {
t.Errorf("output lost unrelated styles Override: %q", ctypes)
}
rels, ok := entries[documentRelsPath]
if !ok {
t.Fatal("output is missing word/_rels/document.xml.rels")
}
if strings.Contains(rels, "vbaProject") {
t.Errorf("output rels still references vbaProject: %q", rels)
}
if strings.Contains(rels, "keyMapCustomizations") {
t.Errorf("output rels still references keyMapCustomizations: %q", rels)
}
if !strings.Contains(rels, "styles.xml") {
t.Errorf("output rels lost unrelated styles relationship: %q", rels)
}
}
func TestConvertDotmToDocx_IdempotentOnPlainDocx(t *testing.T) {
var buf bytes.Buffer
zw := zip.NewWriter(&buf)
add := func(name, body string) {
w, err := zw.Create(name)
if err != nil {
t.Fatalf("create %s: %v", name, err)
}
if _, err := io.WriteString(w, body); err != nil {
t.Fatalf("write %s: %v", name, err)
}
}
add(contentTypesPath, `<?xml version="1.0"?>`+
`<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">`+
`<Override PartName="/word/document.xml" ContentType="`+docxMainContentType+`"/>`+
`</Types>`)
add("word/document.xml", `<w:document/>`)
if err := zw.Close(); err != nil {
t.Fatalf("close: %v", err)
}
out, err := ConvertDotmToDocx(buf.Bytes())
if err != nil {
t.Fatalf("ConvertDotmToDocx: %v", err)
}
entries := unzipEntries(t, out)
if _, ok := entries["word/vbaProject.bin"]; ok {
t.Error("plain docx grew a vbaProject during conversion")
}
if ctypes := entries[contentTypesPath]; !strings.Contains(ctypes, docxMainContentType) {
t.Errorf("plain docx lost its content type: %q", ctypes)
}
}
func TestConvertDotmToDocx_AcceptsDocmAndDotx(t *testing.T) {
for _, mainType := range []string{docmMainContentType, dotxMainContentType} {
t.Run(mainType, func(t *testing.T) {
var buf bytes.Buffer
zw := zip.NewWriter(&buf)
add := func(name, body string) {
w, _ := zw.Create(name)
_, _ = io.WriteString(w, body)
}
add(contentTypesPath, `<?xml version="1.0"?>`+
`<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">`+
`<Override PartName="/word/document.xml" ContentType="`+mainType+`"/>`+
`</Types>`)
add("word/document.xml", `<w:document/>`)
zw.Close()
out, err := ConvertDotmToDocx(buf.Bytes())
if err != nil {
t.Fatalf("ConvertDotmToDocx: %v", err)
}
ctypes := unzipEntries(t, out)[contentTypesPath]
if strings.Contains(ctypes, mainType) {
t.Errorf("non-docx main type survived conversion: %q", ctypes)
}
if !strings.Contains(ctypes, docxMainContentType) {
t.Errorf("docx main type not present: %q", ctypes)
}
})
}
}
func TestConvertDotmToDocx_RejectsNonZip(t *testing.T) {
_, err := ConvertDotmToDocx([]byte("not a zip file"))
if err == nil {
t.Fatal("expected error for non-zip input, got nil")
}
}
func TestSanitiseSubmissionFileName(t *testing.T) {
cases := map[string]string{
"Klageerwiderung": "Klageerwiderung",
"Berufungsbegründung": "Berufungsbegruendung",
"Schriftsatz/Anlage": "Schriftsatz_Anlage",
`Statement of "Defence"`: "Statement of Defence",
` Klage `: "Klage",
"Größe": "Groesse",
}
for in, want := range cases {
t.Run(in, func(t *testing.T) {
if got := SanitiseSubmissionFileName(in); got != want {
t.Errorf("SanitiseSubmissionFileName(%q) = %q, want %q", in, got, want)
}
})
}
}

View File

@@ -0,0 +1,511 @@
package docx
// Markdown → OOXML walker for Composer section content (t-paliad-313
// Slice B, design doc §9.2).
//
// Scope per the head's Slice B brief: paragraphs + inline bold/italic
// only. Headings, lists, blockquote, links land in Slice D's rich-prose
// pass. This walker is intentionally minimal — every Markdown construct
// it doesn't recognise is rendered as a plain paragraph so the lawyer's
// prose round-trips losslessly even when they hit Markdown the walker
// doesn't yet understand.
//
// The output uses the base's stylemap.paragraph entry for the
// <w:pStyle> on each paragraph so the styling matches the base's
// typography (HLpat-Body-B0 on the HLC base, Normal on the neutral
// base, etc.).
//
// Placeholders ({{path.dot.notation}}) are preserved verbatim — they
// pass through the walker untouched and get substituted by the v1
// SubmissionRenderer's placeholder pass after the composer assembly.
//
// Grammar supported:
//
// - Blank line → paragraph break
// - `**bold**` → <w:r><w:rPr><w:b/></w:rPr><w:t>…</w:t></w:r>
// - `*italic*` or `_italic_` → <w:r><w:rPr><w:i/></w:rPr>…</w:r>
// - Otherwise → plain text run
import (
"fmt"
"strings"
)
// HyperlinkAllocator hands the walker a `rId` for each external URL
// it encounters in `[label](url)` inline links. The composer's
// post-pass uses these allocations to mutate
// `word/_rels/document.xml.rels` so the emitted `<w:hyperlink
// r:id="…">` elements resolve correctly. Pass nil to drop links to
// plain text (the label survives, the URL doesn't render).
//
// t-paliad-316 Slice D.
type HyperlinkAllocator func(url string) string
// RenderMarkdownToOOXML renders the given Markdown source into OOXML
// paragraph elements (`<w:p>…</w:p>`), suitable for splicing into a
// .docx body. Each paragraph carries `<w:pStyle w:val="<paragraphStyle>"/>`
// when paragraphStyle is non-empty.
//
// Slice B shipped paragraphs + bold/italic. Slice D extends to
// headings (h1/h2/h3), bullet/numbered lists, blockquote, and inline
// hyperlinks via the optional HyperlinkAllocator.
//
// stylemap supplies the paragraph-style names for each kind:
// stylemap["paragraph"] — default body
// stylemap["heading_1/2/3"] — heading levels
// stylemap["list_bullet"] — bullet list paragraph style
// stylemap["list_numbered"] — numbered list paragraph style
// stylemap["blockquote"] — blockquote
// Missing entries fall back to the "paragraph" style.
//
// Empty input renders one empty paragraph so the splice site is
// well-formed even when the lawyer hasn't typed anything in this
// section.
func RenderMarkdownToOOXML(md, paragraphStyle string) string {
return RenderMarkdownToOOXMLWithStyles(md, map[string]string{"paragraph": paragraphStyle}, nil)
}
// RenderMarkdownToOOXMLWithStyles is the full Slice-D-aware entry
// point. Slice B's RenderMarkdownToOOXML is a wrapper for back-compat.
func RenderMarkdownToOOXMLWithStyles(md string, stylemap map[string]string, links HyperlinkAllocator) string {
defaultStyle := stylemap["paragraph"]
if md == "" {
return emptyParagraph(defaultStyle)
}
blocks := splitMarkdownBlocks(md)
if len(blocks) == 0 {
return emptyParagraph(defaultStyle)
}
// Numbered-list counter resets on every non-numbered block so
// "1. A\n2. B\n\n1. C" renders as 1./2./1. (the lawyer's input
// determined the ordinal, the walker just renders).
numberedCounter := 0
var b strings.Builder
for _, blk := range blocks {
style := stylemap[blk.styleKey]
if style == "" {
style = defaultStyle
}
if blk.styleKey == "list_numbered" {
numberedCounter++
} else {
numberedCounter = 0
}
b.WriteString(renderBlockParagraph(blk, style, links, numberedCounter))
}
return b.String()
}
// mdBlock is one rendered paragraph: a kind (paragraph / heading_*
// / list_bullet / list_numbered / blockquote) and the inline content
// text. List markers, heading hashes, blockquote `> ` etc. are
// stripped from the text before storage.
type mdBlock struct {
styleKey string // "paragraph" | "heading_1" | "heading_2" | "heading_3" | "list_bullet" | "list_numbered" | "blockquote"
text string
}
// splitMarkdownBlocks parses the source into a sequence of blocks,
// detecting heading / list / blockquote prefixes line-by-line. Blank
// lines split paragraph runs (same semantics as splitMarkdownParagraphs)
// but each line is also tagged with its block kind.
//
// Lines that look like block markers don't merge with their neighbours
// even across blank lines — every list / heading / blockquote line is
// its own block in the output. A run of unmarked lines collapses into
// one "paragraph" block (so soft line breaks inside a paragraph still
// concatenate).
//
// CRLF normalised to LF before parsing.
func splitMarkdownBlocks(md string) []mdBlock {
normalised := strings.ReplaceAll(md, "\r\n", "\n")
lines := strings.Split(normalised, "\n")
var blocks []mdBlock
var pendingPara []string
blankRun := 0
flushPara := func() {
if len(pendingPara) > 0 {
blocks = append(blocks, mdBlock{styleKey: "paragraph", text: strings.Join(pendingPara, "\n")})
pendingPara = nil
}
}
for _, raw := range lines {
line := raw
if strings.TrimSpace(line) == "" {
if len(pendingPara) > 0 {
flushPara()
blankRun = 1
continue
}
blankRun++
continue
}
// Detect heading / list / blockquote markers BEFORE we accumulate
// into the paragraph buffer.
kind, payload, ok := detectBlockMarker(line)
if ok {
flushPara()
// Emit spacing paragraphs equivalent to (blankRun - 1) extra.
for i := 1; i < blankRun; i++ {
blocks = append(blocks, mdBlock{styleKey: "paragraph", text: ""})
}
blankRun = 0
blocks = append(blocks, mdBlock{styleKey: kind, text: payload})
continue
}
// Plain paragraph line.
if len(pendingPara) == 0 {
// Starting a new paragraph after a blank run — emit
// (blankRun-1) extra empty paragraphs for vertical spacing.
for i := 1; i < blankRun; i++ {
blocks = append(blocks, mdBlock{styleKey: "paragraph", text: ""})
}
}
blankRun = 0
pendingPara = append(pendingPara, line)
}
flushPara()
return blocks
}
// detectBlockMarker classifies a single line. Returns (styleKey,
// payload-with-marker-stripped, true) for recognised markers; false
// for plain paragraph lines.
//
// Recognised markers (Slice D):
// # Heading → heading_1
// ## Heading → heading_2
// ### Heading → heading_3
// - item / * item → list_bullet
// 1. item / 2. item ... → list_numbered (any positive integer)
// > quote → blockquote
//
// Leading whitespace inside the line is tolerated up to 3 spaces (per
// CommonMark) so the lawyer's contentEditable indentation doesn't
// hide the marker.
func detectBlockMarker(line string) (string, string, bool) {
trimmed := strings.TrimLeft(line, " ")
// Cap to 3 spaces of leading indent — beyond that, treat as a
// regular paragraph line (matches CommonMark).
if len(line)-len(trimmed) > 3 {
return "", "", false
}
if strings.HasPrefix(trimmed, "### ") {
return "heading_3", strings.TrimSpace(trimmed[4:]), true
}
if strings.HasPrefix(trimmed, "## ") {
return "heading_2", strings.TrimSpace(trimmed[3:]), true
}
if strings.HasPrefix(trimmed, "# ") {
return "heading_1", strings.TrimSpace(trimmed[2:]), true
}
if strings.HasPrefix(trimmed, "> ") {
return "blockquote", strings.TrimSpace(trimmed[2:]), true
}
if strings.HasPrefix(trimmed, "- ") || strings.HasPrefix(trimmed, "* ") {
return "list_bullet", strings.TrimSpace(trimmed[2:]), true
}
// Numbered: "N. " where N is one or more digits.
if i := indexOfNumberedMarker(trimmed); i > 0 {
return "list_numbered", strings.TrimSpace(trimmed[i:]), true
}
return "", "", false
}
// indexOfNumberedMarker checks for "N. " or "N) " at the start of the
// trimmed line; returns the byte index just past the marker, or -1 if
// no marker present.
func indexOfNumberedMarker(s string) int {
i := 0
for i < len(s) && s[i] >= '0' && s[i] <= '9' {
i++
}
if i == 0 {
return -1
}
if i >= len(s) {
return -1
}
if s[i] != '.' && s[i] != ')' {
return -1
}
if i+1 >= len(s) || s[i+1] != ' ' {
return -1
}
return i + 2
}
// renderBlockParagraph emits one `<w:p>` for a block. List blocks
// keep the same paragraph style as a default paragraph (the Slice D
// design's contract — list styles come from the base's stylemap and
// Word's numbering.xml is honoured by adding a leading bullet/number
// prefix in the rendered text). This keeps the composer free of
// numbering.xml mutations.
func renderBlockParagraph(blk mdBlock, paragraphStyle string, links HyperlinkAllocator, numberedOrdinal int) string {
var b strings.Builder
b.WriteString(`<w:p>`)
if paragraphStyle != "" {
b.WriteString(`<w:pPr><w:pStyle w:val="`)
b.WriteString(xmlAttrEscape(paragraphStyle))
b.WriteString(`"/></w:pPr>`)
}
if blk.text == "" {
b.WriteString(`<w:r><w:t xml:space="preserve"></w:t></w:r>`)
b.WriteString(`</w:p>`)
return b.String()
}
text := blk.text
// List blocks emit a visible "• " / "N. " prefix run. The
// stylemap entry handles paragraph indentation if the base
// defines a list paragraph style; otherwise the prefix at least
// surfaces the structure in plain Word. Lawyers who want Word's
// auto-numbering reapply a list style post-export.
switch blk.styleKey {
case "list_bullet":
b.WriteString(`<w:r><w:t xml:space="preserve">• </w:t></w:r>`)
case "list_numbered":
ordinal := numberedOrdinal
if ordinal <= 0 {
ordinal = 1
}
b.WriteString(`<w:r><w:t xml:space="preserve">`)
b.WriteString(fmt.Sprintf("%d. ", ordinal))
b.WriteString(`</w:t></w:r>`)
}
for _, run := range parseInlineRuns(text, links) {
b.WriteString(run)
}
b.WriteString(`</w:p>`)
return b.String()
}
// parseInlineRuns extracts inline spans + hyperlink runs and serialises
// each to OOXML. Hyperlinks become `<w:hyperlink r:id="RID">…runs…</w:hyperlink>`
// where RID comes from the HyperlinkAllocator.
func parseInlineRuns(text string, links HyperlinkAllocator) []string {
// Phase 1: find all hyperlink spans `[label](url)` and split the
// text around them.
type segment struct {
text string
isLink bool
url string
}
var segs []segment
rest := text
for {
idx := strings.Index(rest, "[")
if idx < 0 {
if rest != "" {
segs = append(segs, segment{text: rest})
}
break
}
// Find matching closing bracket, then a "(" right after.
closeBracket := strings.Index(rest[idx:], "](")
if closeBracket < 0 {
segs = append(segs, segment{text: rest})
break
}
closeParen := strings.Index(rest[idx+closeBracket:], ")")
if closeParen < 0 {
segs = append(segs, segment{text: rest})
break
}
// idx = start of "["
// idx+closeBracket = position of "]"
// idx+closeBracket+1 = position of "("
// idx+closeBracket+closeParen = position of ")"
label := rest[idx+1 : idx+closeBracket]
url := rest[idx+closeBracket+2 : idx+closeBracket+closeParen]
if idx > 0 {
segs = append(segs, segment{text: rest[:idx]})
}
segs = append(segs, segment{text: label, isLink: true, url: url})
rest = rest[idx+closeBracket+closeParen+1:]
}
var runs []string
for _, seg := range segs {
if seg.isLink && links != nil {
rid := links(seg.url)
if rid != "" {
var hb strings.Builder
hb.WriteString(`<w:hyperlink r:id="`)
hb.WriteString(xmlAttrEscape(rid))
hb.WriteString(`">`)
for _, span := range parseInlineSpans(seg.text) {
hb.WriteString(renderRunWithLinkStyle(span))
}
hb.WriteString(`</w:hyperlink>`)
runs = append(runs, hb.String())
continue
}
}
for _, span := range parseInlineSpans(seg.text) {
runs = append(runs, renderRun(span))
}
}
return runs
}
// renderRunWithLinkStyle emits a hyperlink child run. Same B/I support
// as renderRun, but additionally tags the run with the "Hyperlink"
// character style (Word's built-in) so the link renders in the
// document's hyperlink colour + underline.
func renderRunWithLinkStyle(span inlineSpan) string {
var b strings.Builder
b.WriteString(`<w:r><w:rPr><w:rStyle w:val="Hyperlink"/>`)
if span.Bold {
b.WriteString(`<w:b/>`)
}
if span.Italic {
b.WriteString(`<w:i/>`)
}
b.WriteString(`</w:rPr><w:t xml:space="preserve">`)
b.WriteString(xmlTextEscape(span.Text))
b.WriteString(`</w:t></w:r>`)
return b.String()
}
// inlineSpan is one piece of inline content: a text payload plus
// formatting flags. Bold and italic are independent — `***both***`
// produces one span with both flags set.
type inlineSpan struct {
Text string
Bold bool
Italic bool
}
// parseInlineSpans tokenises Markdown inline formatting into runs of
// (text, bold, italic). The grammar is intentionally narrow:
//
// - `**…**` → bold
// - `__…__` → bold (Markdown alternate)
// - `*…*` → italic
// - `_…_` → italic (Markdown alternate)
// - Anything else flows through as plain text.
//
// Unbalanced delimiters fall through as literal characters — the
// walker never errors on malformed Markdown. Nested formatting (e.g.
// `**bold *bold-italic* bold**`) toggles flags as it walks.
func parseInlineSpans(text string) []inlineSpan {
var out []inlineSpan
var cur strings.Builder
bold := false
italic := false
flush := func() {
if cur.Len() == 0 {
return
}
out = append(out, inlineSpan{Text: cur.String(), Bold: bold, Italic: italic})
cur.Reset()
}
i := 0
n := len(text)
for i < n {
// Preserve {{...}} placeholders verbatim. Underscores and
// other Markdown-significant chars inside a placeholder key
// (e.g. {{project.case_number}}) must not be interpreted as
// bold/italic delimiters — otherwise the key gets stripped of
// its underscores and the v1 placeholder pass looks up the
// wrong key, surfacing [KEIN WERT: project.casenumber] in the
// preview.
if i+1 < n && text[i] == '{' && text[i+1] == '{' {
rel := strings.Index(text[i+2:], "}}")
if rel >= 0 {
end := i + 2 + rel + 2
cur.WriteString(text[i:end])
i = end
continue
}
// Unmatched {{ — fall through to plain character handling.
}
// Bold delimiters first (longer match wins over italic).
if i+1 < n && (text[i:i+2] == "**" || text[i:i+2] == "__") {
flush()
bold = !bold
i += 2
continue
}
if text[i] == '*' || text[i] == '_' {
flush()
italic = !italic
i++
continue
}
cur.WriteByte(text[i])
i++
}
flush()
if len(out) == 0 {
out = append(out, inlineSpan{Text: ""})
}
return out
}
// renderRun emits one `<w:r>` element for an inline span. Empty text
// spans render as empty runs (Word accepts them; they're harmless).
func renderRun(span inlineSpan) string {
var b strings.Builder
b.WriteString(`<w:r>`)
if span.Bold || span.Italic {
b.WriteString(`<w:rPr>`)
if span.Bold {
b.WriteString(`<w:b/>`)
}
if span.Italic {
b.WriteString(`<w:i/>`)
}
b.WriteString(`</w:rPr>`)
}
b.WriteString(`<w:t xml:space="preserve">`)
b.WriteString(xmlTextEscape(span.Text))
b.WriteString(`</w:t></w:r>`)
return b.String()
}
// emptyParagraph returns one empty `<w:p>` with the given style. Used
// when a section's content_md is empty so the splice site stays
// well-formed.
func emptyParagraph(paragraphStyle string) string {
var b strings.Builder
b.WriteString(`<w:p>`)
if paragraphStyle != "" {
b.WriteString(`<w:pPr><w:pStyle w:val="`)
b.WriteString(xmlAttrEscape(paragraphStyle))
b.WriteString(`"/></w:pPr>`)
}
b.WriteString(`<w:r><w:t xml:space="preserve"></w:t></w:r></w:p>`)
return b.String()
}
// xmlTextEscape escapes the five XML-significant characters for safe
// insertion into <w:t> content. & first to avoid double-encoding.
func xmlTextEscape(s string) string {
s = strings.ReplaceAll(s, "&", "&amp;")
s = strings.ReplaceAll(s, "<", "&lt;")
s = strings.ReplaceAll(s, ">", "&gt;")
// Quotes and apostrophes are legal inside element text content;
// no need to escape them here.
return s
}
// XMLAttrEscape is the exported form of xmlAttrEscape, used by the
// paliad-side composer (submission_compose.go) when it builds hyperlink
// relationship inserts. It exists so the composer can reuse the exact
// attribute-escaping the walker applies without reaching across the
// package boundary for an unexported helper. Slice 2 folds the
// composer's splice into this package, after which the wrapper retires.
func XMLAttrEscape(s string) string { return xmlAttrEscape(s) }
// xmlAttrEscape escapes for safe insertion into an attribute value
// (e.g. `<w:pStyle w:val="…"/>`).
func xmlAttrEscape(s string) string {
s = strings.ReplaceAll(s, "&", "&amp;")
s = strings.ReplaceAll(s, "<", "&lt;")
s = strings.ReplaceAll(s, ">", "&gt;")
s = strings.ReplaceAll(s, `"`, "&quot;")
return s
}

View File

@@ -0,0 +1,383 @@
package docx
// Unit tests for the Composer's Markdown → OOXML walker (t-paliad-313
// Slice B). Pure function; no DB dependency.
import (
"strings"
"testing"
)
func TestRenderMarkdownToOOXML_EmptyInput(t *testing.T) {
out := RenderMarkdownToOOXML("", "Normal")
if !strings.Contains(out, `<w:p>`) {
t.Errorf("empty input must still emit one <w:p>; got %q", out)
}
if !strings.Contains(out, `<w:pStyle w:val="Normal"/>`) {
t.Errorf("empty input must carry the paragraph style; got %q", out)
}
}
func TestRenderMarkdownToOOXML_SingleParagraph(t *testing.T) {
out := RenderMarkdownToOOXML("Hello world", "HLpat-Body-B0")
if !strings.Contains(out, `<w:pStyle w:val="HLpat-Body-B0"/>`) {
t.Errorf("paragraph missing stylemap entry: %q", out)
}
if !strings.Contains(out, "Hello world") {
t.Errorf("paragraph text missing: %q", out)
}
// Exactly one <w:p>.
if got := strings.Count(out, "<w:p>"); got != 1 {
t.Errorf("expected 1 <w:p>; got %d", got)
}
}
func TestRenderMarkdownToOOXML_TwoParagraphs(t *testing.T) {
out := RenderMarkdownToOOXML("first\n\nsecond", "Normal")
if got := strings.Count(out, "<w:p>"); got != 2 {
t.Errorf("expected 2 <w:p>; got %d, out=%q", got, out)
}
if !strings.Contains(out, "first") || !strings.Contains(out, "second") {
t.Errorf("paragraph text missing: %q", out)
}
}
func TestRenderMarkdownToOOXML_BoldInline(t *testing.T) {
out := RenderMarkdownToOOXML("hello **bold** world", "")
if !strings.Contains(out, `<w:rPr><w:b/></w:rPr>`) {
t.Errorf("bold rPr missing: %q", out)
}
if !strings.Contains(out, ">bold<") {
t.Errorf("bold text payload missing: %q", out)
}
// The surrounding "hello " and " world" pieces are separate runs;
// the bold rPr should appear exactly once in this output.
if got := strings.Count(out, "<w:b/>"); got != 1 {
t.Errorf("expected exactly one <w:b/> tag; got %d in %q", got, out)
}
}
func TestRenderMarkdownToOOXML_ItalicInline(t *testing.T) {
out := RenderMarkdownToOOXML("see *italic* here", "")
if !strings.Contains(out, `<w:rPr><w:i/></w:rPr>`) {
t.Errorf("italic rPr missing: %q", out)
}
if !strings.Contains(out, ">italic<") {
t.Errorf("italic text payload missing: %q", out)
}
}
func TestRenderMarkdownToOOXML_BoldItalicCombo(t *testing.T) {
// Nested: ***both*** → entering both flags. The walker toggles each
// delimiter independently, so the resulting run carries both <w:b/>
// and <w:i/>.
out := RenderMarkdownToOOXML("***both***", "")
if !strings.Contains(out, `<w:b/>`) || !strings.Contains(out, `<w:i/>`) {
t.Errorf("expected both <w:b/> and <w:i/>; got %q", out)
}
}
func TestRenderMarkdownToOOXML_PlaceholdersPassThrough(t *testing.T) {
// Placeholders are sacred — the walker must preserve them verbatim
// so the v1 placeholder pass can substitute them later.
out := RenderMarkdownToOOXML("Sehr geehrter {{parties.claimant.0.name}}", "Normal")
if !strings.Contains(out, "{{parties.claimant.0.name}}") {
t.Errorf("placeholder corrupted: %q", out)
}
}
func TestRenderMarkdownToOOXML_PlaceholderUnderscoresPreserved(t *testing.T) {
// Regression: a placeholder key containing underscores (project.case_number,
// user.display_name, project.patent_number_upc) used to get its underscores
// consumed by the italic/bold inline scanner — the OOXML stored
// {{project.casenumber}} and the preview surfaced
// [KEIN WERT: project.casenumber] instead of the real value.
cases := []string{
"{{project.case_number}}",
"{{user.display_name}}",
"{{project.patent_number_upc}}",
"prefix {{project.case_number}} suffix",
"two: {{a.b_c}} and {{d.e_f}}",
"mixed: _italic_ then {{project.case_number}} then __bold__",
}
for _, in := range cases {
out := RenderMarkdownToOOXML(in, "Normal")
// Every placeholder substring in the input must appear verbatim
// in the output (XML escaping is irrelevant for {} and _).
for _, ph := range extractPlaceholders(in) {
if !strings.Contains(out, ph) {
t.Errorf("input %q: placeholder %q lost; got %q", in, ph, out)
}
}
}
}
func TestParseInlineSpans_PlaceholderWithUnderscoresIsLiteral(t *testing.T) {
// Direct guard on the inline scanner. {{project.case_number}} must
// emit as a single non-italic span containing the full placeholder.
spans := parseInlineSpans("{{project.case_number}}")
if len(spans) != 1 {
t.Fatalf("expected 1 span; got %d (%+v)", len(spans), spans)
}
if spans[0].Italic || spans[0].Bold {
t.Errorf("placeholder must not be italic/bold; got %+v", spans[0])
}
if spans[0].Text != "{{project.case_number}}" {
t.Errorf("placeholder text corrupted: got %q", spans[0].Text)
}
}
func TestParseInlineSpans_ItalicAroundPlaceholder(t *testing.T) {
// Italic delimiters outside a placeholder still work; the placeholder
// itself stays literal even when it sits between italics.
spans := parseInlineSpans("_before_ {{x.y_z}} _after_")
var saw struct {
italicBefore bool
placeholder bool
italicAfter bool
}
for _, s := range spans {
if s.Italic && s.Text == "before" {
saw.italicBefore = true
}
if !s.Italic && !s.Bold && strings.Contains(s.Text, "{{x.y_z}}") {
saw.placeholder = true
}
if s.Italic && s.Text == "after" {
saw.italicAfter = true
}
}
if !saw.italicBefore || !saw.placeholder || !saw.italicAfter {
t.Errorf("expected italic/placeholder/italic structure; got %+v", spans)
}
}
// extractPlaceholders pulls every {{...}} occurrence out of a Markdown
// source. Tiny helper, only used by the regression test above.
func extractPlaceholders(s string) []string {
var out []string
for {
start := strings.Index(s, "{{")
if start < 0 {
return out
}
end := strings.Index(s[start+2:], "}}")
if end < 0 {
return out
}
out = append(out, s[start:start+2+end+2])
s = s[start+2+end+2:]
}
}
func TestRenderMarkdownToOOXML_XMLEscape(t *testing.T) {
out := RenderMarkdownToOOXML("a & b < c > d", "")
if strings.Contains(out, " & ") {
t.Errorf("unescaped & survived: %q", out)
}
if !strings.Contains(out, "&amp;") || !strings.Contains(out, "&lt;") || !strings.Contains(out, "&gt;") {
t.Errorf("expected escaped entities; got %q", out)
}
}
func TestRenderMarkdownToOOXML_BlankLinesPreserveSpacing(t *testing.T) {
// Two blank lines between paragraphs → one empty paragraph in
// between, preserving the lawyer's intentional whitespace.
out := RenderMarkdownToOOXML("first\n\n\nsecond", "Normal")
if got := strings.Count(out, "<w:p>"); got != 3 {
t.Errorf("expected 3 <w:p> (first + blank + second); got %d in %q", got, out)
}
}
func TestRenderMarkdownToOOXML_CRLFNormalisation(t *testing.T) {
out := RenderMarkdownToOOXML("first\r\n\r\nsecond", "")
if got := strings.Count(out, "<w:p>"); got != 2 {
t.Errorf("CRLF input should produce 2 paragraphs; got %d in %q", got, out)
}
}
func TestParseInlineSpans_Plain(t *testing.T) {
spans := parseInlineSpans("hello world")
if len(spans) != 1 || spans[0].Bold || spans[0].Italic || spans[0].Text != "hello world" {
t.Errorf("expected single plain span; got %+v", spans)
}
}
func TestParseInlineSpans_UnderscoreItalic(t *testing.T) {
spans := parseInlineSpans("_emph_")
var italicHits int
for _, s := range spans {
if s.Italic && s.Text == "emph" {
italicHits++
}
}
if italicHits != 1 {
t.Errorf("expected one italic 'emph' span; got %+v", spans)
}
}
func TestParseInlineSpans_UnderscoreBold(t *testing.T) {
spans := parseInlineSpans("__strong__")
var boldHits int
for _, s := range spans {
if s.Bold && s.Text == "strong" {
boldHits++
}
}
if boldHits != 1 {
t.Errorf("expected one bold 'strong' span; got %+v", spans)
}
}
// ─────────────────────────────────────────────────────────────────────
// Slice D — rich-prose constructs
// ─────────────────────────────────────────────────────────────────────
func slicedStylemap() map[string]string {
return map[string]string{
"paragraph": "Body",
"heading_1": "H1",
"heading_2": "H2",
"heading_3": "H3",
"list_bullet": "ListBullet",
"list_numbered": "ListNumber",
"blockquote": "Quote",
}
}
func TestRenderMarkdownToOOXML_Heading1(t *testing.T) {
out := RenderMarkdownToOOXMLWithStyles("# A heading", slicedStylemap(), nil)
if !strings.Contains(out, `<w:pStyle w:val="H1"/>`) {
t.Errorf("heading_1 missing H1 style: %q", out)
}
if !strings.Contains(out, "A heading") {
t.Errorf("heading text missing: %q", out)
}
}
func TestRenderMarkdownToOOXML_Heading2And3(t *testing.T) {
out := RenderMarkdownToOOXMLWithStyles("## H2 line\n### H3 line", slicedStylemap(), nil)
if !strings.Contains(out, `<w:pStyle w:val="H2"/>`) || !strings.Contains(out, "H2 line") {
t.Errorf("h2 not rendered: %q", out)
}
if !strings.Contains(out, `<w:pStyle w:val="H3"/>`) || !strings.Contains(out, "H3 line") {
t.Errorf("h3 not rendered: %q", out)
}
}
func TestRenderMarkdownToOOXML_BulletList(t *testing.T) {
out := RenderMarkdownToOOXMLWithStyles("- first\n- second\n* third", slicedStylemap(), nil)
if !strings.Contains(out, `<w:pStyle w:val="ListBullet"/>`) {
t.Errorf("bullet stylemap not applied: %q", out)
}
if strings.Count(out, "• ") != 3 {
t.Errorf("expected 3 bullet prefixes; got %d in %q", strings.Count(out, "• "), out)
}
}
func TestRenderMarkdownToOOXML_NumberedList(t *testing.T) {
out := RenderMarkdownToOOXMLWithStyles("1. first\n2. second\n3. third", slicedStylemap(), nil)
if !strings.Contains(out, `<w:pStyle w:val="ListNumber"/>`) {
t.Errorf("numbered stylemap not applied: %q", out)
}
for _, want := range []string{"1. ", "2. ", "3. "} {
if !strings.Contains(out, want) {
t.Errorf("missing ordinal prefix %q in %q", want, out)
}
}
}
func TestRenderMarkdownToOOXML_NumberedListResetsOnNonList(t *testing.T) {
// "1. A\n2. B\nplain\n1. C" → 1. A, 2. B, plain para, 1. C
out := RenderMarkdownToOOXMLWithStyles("1. A\n2. B\nplain\n1. C", slicedStylemap(), nil)
// The plain "plain" line breaks the list, so the next numbered
// item restarts at 1.
idxA := strings.Index(out, "1. ")
if idxA < 0 {
t.Fatalf("first 1. missing: %q", out)
}
idxB := strings.Index(out, "2. ")
if idxB < 0 || idxB <= idxA {
t.Fatalf("2. not after 1.: idxA=%d idxB=%d", idxA, idxB)
}
rest := out[idxB+1:]
idxC := strings.Index(rest, "1. ")
if idxC < 0 {
t.Errorf("numbered counter didn't reset on non-list block: %q", out)
}
}
func TestRenderMarkdownToOOXML_Blockquote(t *testing.T) {
out := RenderMarkdownToOOXMLWithStyles("> the quoted text", slicedStylemap(), nil)
if !strings.Contains(out, `<w:pStyle w:val="Quote"/>`) {
t.Errorf("blockquote stylemap not applied: %q", out)
}
if !strings.Contains(out, "the quoted text") {
t.Errorf("blockquote text missing: %q", out)
}
}
func TestRenderMarkdownToOOXML_Hyperlink(t *testing.T) {
allocated := map[string]string{}
alloc := func(url string) string {
rid := "rIdComposer" + url
allocated[url] = rid
return rid
}
out := RenderMarkdownToOOXMLWithStyles("See [Bundesgerichtshof](https://bgh.bund.de) for details.", slicedStylemap(), alloc)
if _, ok := allocated["https://bgh.bund.de"]; !ok {
t.Errorf("allocator never called for URL: %q", out)
}
if !strings.Contains(out, `<w:hyperlink r:id="rIdComposerhttps://bgh.bund.de">`) {
t.Errorf("hyperlink tag missing or wrong rid: %q", out)
}
if !strings.Contains(out, "Bundesgerichtshof") {
t.Errorf("link label missing: %q", out)
}
if !strings.Contains(out, `<w:rStyle w:val="Hyperlink"/>`) {
t.Errorf("hyperlink character style missing: %q", out)
}
}
func TestRenderMarkdownToOOXML_HyperlinkNilAllocatorFallsBackToPlain(t *testing.T) {
out := RenderMarkdownToOOXMLWithStyles("See [BGH](https://bgh.bund.de) here.", slicedStylemap(), nil)
// Without an allocator, the label still renders as plain text.
if !strings.Contains(out, "BGH") {
t.Errorf("label dropped: %q", out)
}
if strings.Contains(out, "<w:hyperlink") {
t.Errorf("hyperlink emitted without allocator: %q", out)
}
}
func TestDetectBlockMarker(t *testing.T) {
cases := []struct {
in string
kind string
want string
ok bool
}{
{"# A", "heading_1", "A", true},
{"## B", "heading_2", "B", true},
{"### C", "heading_3", "C", true},
{" # indented", "heading_1", "indented", true}, // up to 3 spaces tolerated
{" # too-deep", "", "", false}, // 4 spaces → not a heading
{"- bullet", "list_bullet", "bullet", true},
{"* star", "list_bullet", "star", true},
{"1. one", "list_numbered", "one", true},
{"42. forty-two", "list_numbered", "forty-two", true},
{"1) paren", "list_numbered", "paren", true},
{"1.no-space", "", "", false}, // ordinal needs trailing space
{"> quote", "blockquote", "quote", true},
{"plain", "", "", false},
{"#nospace", "", "", false}, // heading needs space after hash
}
for _, tc := range cases {
t.Run(tc.in, func(t *testing.T) {
kind, payload, ok := detectBlockMarker(tc.in)
if ok != tc.ok || kind != tc.kind || payload != tc.want {
t.Errorf("detectBlockMarker(%q) = (%q,%q,%v); want (%q,%q,%v)", tc.in, kind, payload, ok, tc.kind, tc.want, tc.ok)
}
})
}
}

531
pkg/docforge/docx/merge.go Normal file
View File

@@ -0,0 +1,531 @@
package docx
// Submission template renderer — in-house engine for the submission
// draft editor (t-paliad-238, design doc
// docs/design-submission-page-2026-05-22.md §3 / §6.2).
//
// Resurrected from commit 8ea3509 (the original t-paliad-215 Slice 1
// "in-house .docx render engine"). Kept in a separate file from the
// format-only converter (submission_render.go) so the t-paliad-230
// /generate one-click path stays unchanged and the merge engine doesn't
// have to share zip-helper names with it.
//
// Why not lukasjarosch/go-docx: the library's "nested placeholder" guard
// treats sibling placeholders inside the same <w:t> run (e.g.
// "{{a}} ./. {{b}}") as nested and refuses to replace either. Patent
// submissions routinely have multiple placeholders per paragraph (party
// blocks especially), so the library is a non-starter. This renderer
// handles single-run placeholders (preserving run-level formatting) AND
// cross-run placeholders (rewriting the paragraph as one run when Word
// has fragmented the placeholder across runs).
//
// Placeholder grammar: {{[A-Za-z][A-Za-z0-9_.]*}} with optional
// whitespace inside braces ({{ project.case_number }} ≡
// {{project.case_number}}).
//
// Missing-value behaviour: when a placeholder has no binding in the
// PlaceholderMap, the renderer emits a marker token so the lawyer sees
// the gap in Word rather than failing the request.
import (
"archive/zip"
"bytes"
"fmt"
"io"
"regexp"
"strings"
)
// PlaceholderMap is the variable bag built by SubmissionVarsService.
// Keys are dotted paths without braces (e.g. "project.case_number").
// Values are the substituted text — already locale-aware, pretty-
// printed, and sanitised by the caller.
type PlaceholderMap map[string]string
// MissingPlaceholderFn translates an unbound placeholder key into the
// in-document marker token. The default in DefaultMissingMarker is
// "[KEIN WERT: <key>]" / "[NO VALUE: <key>]" depending on lang.
type MissingPlaceholderFn func(key string) string
// valueWrapperFn wraps a substituted value with a marker the HTML
// preview emitter can recognise — used by RenderHTML to turn each
// substituted value into a clickable <span class="draft-var" …>
// (t-paliad-261, click-variable-in-preview → jump-to-field). nil means
// no wrapping; the .docx export path uses nil so its output is
// byte-identical to the wrapper-free build. The wrapper is invoked for
// both resolved values and missing-marker text so clicking a missing
// placeholder still jumps to the corresponding sidebar input.
type valueWrapperFn func(key, value string) string
// Private-Use-Area sentinels for the HTML preview wrap. PUA characters
// are valid in XML 1.0 content, never appear in legitimate template
// text, pass unchanged through xmlEncode/xmlDecode/htmlEscape, and are
// stripped by emitTextWithDraftVars when the preview HTML is assembled.
const (
previewVarBegin = ""
previewVarMid = ""
previewVarEnd = ""
)
// htmlPreviewWrapper wraps a substituted value with the PUA sentinels
// emitTextWithDraftVars recognises. Used only by RenderHTML; the .docx
// Render path uses nil so its output is identical to the pre-261 build.
func htmlPreviewWrapper(key, value string) string {
return previewVarBegin + key + previewVarMid + value + previewVarEnd
}
// DefaultMissingMarker returns the standard missing-value marker for
// the given UI language.
func DefaultMissingMarker(lang string) MissingPlaceholderFn {
prefix := "KEIN WERT"
if strings.EqualFold(lang, "en") {
prefix = "NO VALUE"
}
return func(key string) string {
return "[" + prefix + ": " + key + "]"
}
}
// placeholderRegex matches a single placeholder. The capture group
// extracts the key name without braces or surrounding whitespace.
//
// Restricted to [A-Za-z][A-Za-z0-9_.]* so that stray "{{" sequences in
// legal prose don't get mistaken for placeholders. A genuine placeholder
// always starts with an ASCII letter.
var placeholderRegex = regexp.MustCompile(`\{\{\s*([A-Za-z][A-Za-z0-9_.]*)\s*\}\}`)
// SubmissionRenderer renders a .docx template into a .docx output by
// substituting {{placeholder}} tokens with values from a PlaceholderMap.
// Stateless; safe for concurrent use.
type SubmissionRenderer struct{}
// NewSubmissionRenderer constructs the renderer.
func NewSubmissionRenderer() *SubmissionRenderer {
return &SubmissionRenderer{}
}
// Render reads the .docx template at templateBytes, substitutes every
// placeholder from vars (or emits the missing-marker token), and returns
// the merged .docx bytes. Unknown placeholders never fail the render —
// the lawyer sees the marker in Word and fixes it.
//
// Pre-pass: ConvertDotmToDocx is called on the input so a .dotm
// template (macro-bearing) is downgraded to a plain .docx before the
// merge step runs. Idempotent on inputs that are already plain .docx.
func (r *SubmissionRenderer) Render(templateBytes []byte, vars PlaceholderMap, missing MissingPlaceholderFn) ([]byte, error) {
if missing == nil {
missing = DefaultMissingMarker("de")
}
cleanBytes, err := ConvertDotmToDocx(templateBytes)
if err != nil {
return nil, fmt.Errorf("submission render: pre-pass convert: %w", err)
}
zr, err := zip.NewReader(bytes.NewReader(cleanBytes), int64(len(cleanBytes)))
if err != nil {
return nil, fmt.Errorf("submission render: open zip: %w", err)
}
var out bytes.Buffer
zw := zip.NewWriter(&out)
for _, entry := range zr.File {
body, err := readMergeZipEntry(entry)
if err != nil {
return nil, fmt.Errorf("submission render: read %s: %w", entry.Name, err)
}
if isWordXMLEntry(entry.Name) {
body = substituteInDocumentXML(body, vars, missing, nil)
}
w, err := zw.CreateHeader(&zip.FileHeader{
Name: entry.Name,
Method: entry.Method,
Modified: entry.Modified,
})
if err != nil {
return nil, fmt.Errorf("submission render: write header %s: %w", entry.Name, err)
}
if _, err := w.Write(body); err != nil {
return nil, fmt.Errorf("submission render: write %s: %w", entry.Name, err)
}
}
if err := zw.Close(); err != nil {
return nil, fmt.Errorf("submission render: finalise zip: %w", err)
}
return out.Bytes(), nil
}
// RenderHTML produces a read-only HTML rendering of the merged document
// body for the draft-editor preview pane. Walks the SAME placeholder
// substitution as Render, then extracts the body text from word/document.xml
// and emits semantic HTML — one <p> per <w:p>, with <strong>/<em> spans
// for runs that carry <w:b>/<w:i> formatting. Tables, lists, and complex
// formatting collapse to plain paragraphs (the preview is a fidelity
// guide, not a WYSIWYG editor — final formatting comes from Word at
// export).
//
// Returns escaped HTML safe to inject into the page via dangerouslySet
// or innerHTML. The caller is responsible for wrapping in an outer
// container; this method emits only the body fragment.
func (r *SubmissionRenderer) RenderHTML(templateBytes []byte, vars PlaceholderMap, missing MissingPlaceholderFn) (string, error) {
if missing == nil {
missing = DefaultMissingMarker("de")
}
cleanBytes, err := ConvertDotmToDocx(templateBytes)
if err != nil {
return "", fmt.Errorf("submission render html: pre-pass convert: %w", err)
}
zr, err := zip.NewReader(bytes.NewReader(cleanBytes), int64(len(cleanBytes)))
if err != nil {
return "", fmt.Errorf("submission render html: open zip: %w", err)
}
var docXML []byte
for _, entry := range zr.File {
if entry.Name != "word/document.xml" {
continue
}
docXML, err = readMergeZipEntry(entry)
if err != nil {
return "", fmt.Errorf("submission render html: read document.xml: %w", err)
}
break
}
if docXML == nil {
return "", fmt.Errorf("submission render html: word/document.xml missing")
}
merged := substituteInDocumentXML(docXML, vars, missing, htmlPreviewWrapper)
return docXMLToHTML(merged), nil
}
// isWordXMLEntry returns true for the .docx parts that contain
// substitutable text. We touch document.xml plus header*.xml and
// footer*.xml (templates may put firm letterhead in a header) but
// skip styles, theme, settings, comments, footnotes — none of which
// should carry merge placeholders in a well-formed template.
func isWordXMLEntry(name string) bool {
switch {
case name == "word/document.xml":
return true
case strings.HasPrefix(name, "word/header") && strings.HasSuffix(name, ".xml"):
return true
case strings.HasPrefix(name, "word/footer") && strings.HasSuffix(name, ".xml"):
return true
}
return false
}
// readMergeZipEntry slurps a zip entry's bytes. Named distinctly from
// the helper in submission_render.go (readZipFile) to keep this file
// self-contained — the two are functionally identical.
func readMergeZipEntry(f *zip.File) ([]byte, error) {
rc, err := f.Open()
if err != nil {
return nil, err
}
defer rc.Close()
return io.ReadAll(rc)
}
// substituteInDocumentXML walks document XML and replaces every
// {{placeholder}} occurrence inside <w:t> text nodes. Handles both
// single-run placeholders (the common case for freshly authored
// templates) and cross-run placeholders (where Word's autocorrect or
// manual editing has split a placeholder across runs).
//
// Two-pass strategy:
//
// 1. Pass 1: replace placeholders that fit entirely within one
// <w:t>…</w:t>. This is the 99% case and preserves all run-level
// formatting (bold, italic, font runs).
// 2. Pass 2: for paragraphs that still contain orphan "{{" or "}}"
// markers after pass 1, merge the text of every <w:t> inside the
// paragraph, run the replacement on the merged text, and rewrite
// the paragraph's runs as a single <w:r><w:t>…</w:t></w:r> using
// the formatting properties of the first run.
func substituteInDocumentXML(body []byte, vars PlaceholderMap, missing MissingPlaceholderFn, wrap valueWrapperFn) []byte {
replaced := substituteInTextNodes(body, vars, missing, wrap)
if !needsCrossRunMerge(replaced) {
return replaced
}
return substituteAcrossRuns(replaced, vars, missing, wrap)
}
// wTextNodeRegex matches one <w:t …>contents</w:t> element, capturing
// the contents.
var wTextNodeRegex = regexp.MustCompile(`<w:t(\s[^>]*)?>([^<]*)</w:t>`)
// substituteInTextNodes runs the placeholder replacement inside each
// <w:t> text node independently. Format-preserving for single-run
// placeholders.
func substituteInTextNodes(body []byte, vars PlaceholderMap, missing MissingPlaceholderFn, wrap valueWrapperFn) []byte {
return wTextNodeRegex.ReplaceAllFunc(body, func(match []byte) []byte {
sub := wTextNodeRegex.FindSubmatch(match)
attrs := string(sub[1])
contents := xmlDecode(string(sub[2]))
replaced := replacePlaceholders(contents, vars, missing, wrap)
if replaced == contents {
return match
}
if !strings.Contains(attrs, "xml:space") &&
(strings.HasPrefix(replaced, " ") || strings.HasSuffix(replaced, " ")) {
attrs += ` xml:space="preserve"`
}
return []byte(`<w:t` + attrs + `>` + xmlEncode(replaced) + `</w:t>`)
})
}
// needsCrossRunMerge returns true when the body still contains an
// unmatched "{{" or "}}" inside any <w:t> after pass 1.
func needsCrossRunMerge(body []byte) bool {
for _, m := range wTextNodeRegex.FindAllSubmatch(body, -1) {
t := string(m[2])
if strings.Contains(t, "{{") || strings.Contains(t, "}}") {
return true
}
}
return false
}
// wParagraphRegex matches one <w:p>…</w:p> paragraph block. Greedy
// inner-content match is safe — <w:p> elements do not nest.
var wParagraphRegex = regexp.MustCompile(`(?s)<w:p\b[^>]*>.*?</w:p>`)
// wRunPropsRegex pulls the first <w:rPr>…</w:rPr> block from a paragraph.
var wRunPropsRegex = regexp.MustCompile(`(?s)<w:rPr>.*?</w:rPr>`)
// wParagraphPropsRegex pulls the optional <w:pPr>…</w:pPr>.
var wParagraphPropsRegex = regexp.MustCompile(`(?s)<w:pPr>.*?</w:pPr>`)
// substituteAcrossRuns is pass 2: concatenate every text node in a
// fragmented-placeholder paragraph, run replacement, rewrite as one run.
func substituteAcrossRuns(body []byte, vars PlaceholderMap, missing MissingPlaceholderFn, wrap valueWrapperFn) []byte {
return wParagraphRegex.ReplaceAllFunc(body, func(para []byte) []byte {
textNodes := wTextNodeRegex.FindAllSubmatch(para, -1)
if len(textNodes) == 0 {
return para
}
var merged strings.Builder
for _, m := range textNodes {
merged.WriteString(xmlDecode(string(m[2])))
}
original := merged.String()
if !strings.Contains(original, "{{") {
return para
}
replaced := replacePlaceholders(original, vars, missing, wrap)
if replaced == original {
return para
}
pPr := wParagraphPropsRegex.Find(para)
rPr := wRunPropsRegex.Find(para)
var rebuilt bytes.Buffer
rebuilt.WriteString(`<w:p>`)
if pPr != nil {
rebuilt.Write(pPr)
}
rebuilt.WriteString(`<w:r>`)
if rPr != nil {
rebuilt.Write(rPr)
}
rebuilt.WriteString(`<w:t xml:space="preserve">`)
rebuilt.WriteString(xmlEncode(replaced))
rebuilt.WriteString(`</w:t></w:r></w:p>`)
return rebuilt.Bytes()
})
}
// replacePlaceholders performs the actual substitution on a plain
// string. Unbound placeholders render the missing marker. When wrap is
// non-nil, both the resolved value AND the missing-marker text are
// passed through wrap(key, value) — the HTML preview path uses this to
// emit clickable spans around every substituted placeholder, including
// missing ones (clicking a missing marker jumps to the corresponding
// sidebar input).
func replacePlaceholders(s string, vars PlaceholderMap, missing MissingPlaceholderFn, wrap valueWrapperFn) string {
return placeholderRegex.ReplaceAllStringFunc(s, func(match string) string {
sub := placeholderRegex.FindStringSubmatch(match)
if len(sub) < 2 {
return match
}
key := sub[1]
var value string
if v, ok := vars[key]; ok {
value = v
} else {
value = missing(key)
}
if wrap != nil {
return wrap(key, value)
}
return value
})
}
// xmlDecode reverses the five standard XML entities Word emits in
// <w:t> content.
func xmlDecode(s string) string {
s = strings.ReplaceAll(s, "&lt;", "<")
s = strings.ReplaceAll(s, "&gt;", ">")
s = strings.ReplaceAll(s, "&quot;", `"`)
s = strings.ReplaceAll(s, "&apos;", "'")
s = strings.ReplaceAll(s, "&amp;", "&")
return s
}
// xmlEncode escapes for safe insertion back into <w:t> content. & first
// to avoid double-encoding the entity prefixes.
func xmlEncode(s string) string {
s = strings.ReplaceAll(s, "&", "&amp;")
s = strings.ReplaceAll(s, "<", "&lt;")
s = strings.ReplaceAll(s, ">", "&gt;")
s = strings.ReplaceAll(s, `"`, "&quot;")
s = strings.ReplaceAll(s, "'", "&apos;")
return s
}
// docXMLToHTML walks the post-merge document XML and emits HTML for
// the preview pane. One <p> per <w:p>; <strong>/<em> spans for runs
// carrying <w:b>/<w:i>. Tables/lists/images collapse to text. Output
// is HTML-escaped except for the structural <p>/<strong>/<em> tags
// this function emits.
func docXMLToHTML(docXML []byte) string {
paragraphs := wParagraphRegex.FindAll(docXML, -1)
var out bytes.Buffer
for _, para := range paragraphs {
out.WriteString("<p>")
out.WriteString(paragraphToHTML(para))
out.WriteString("</p>\n")
}
if out.Len() == 0 {
return "<p></p>"
}
return out.String()
}
// wRunRegex matches one <w:r>…</w:r> run. Greedy match safe — <w:r>
// elements do not nest.
var wRunRegex = regexp.MustCompile(`(?s)<w:r\b[^>]*>.*?</w:r>`)
// wBoldRegex / wItalicRegex detect the bold/italic flags inside a run's
// <w:rPr>. Word emits <w:b/> or <w:b w:val="true"/>; matching the open
// tag covers both forms.
var (
wBoldRegex = regexp.MustCompile(`<w:b\b[^>]*/?>`)
wItalicRegex = regexp.MustCompile(`<w:i\b[^>]*/?>`)
)
// paragraphToHTML extracts the text from each <w:r> inside a paragraph,
// wraps runs flagged bold/italic with the corresponding HTML tags, and
// HTML-escapes the text content.
func paragraphToHTML(para []byte) string {
runs := wRunRegex.FindAll(para, -1)
if len(runs) == 0 {
// Empty paragraph (line break).
return ""
}
var out bytes.Buffer
for _, run := range runs {
text := extractRunText(run)
if text == "" {
continue
}
// Check for bold/italic on the run's <w:rPr>.
rPr := wRunPropsRegex.Find(run)
bold := rPr != nil && wBoldRegex.Match(rPr) && !isFalseFlag(rPr, wBoldRegex)
italic := rPr != nil && wItalicRegex.Match(rPr) && !isFalseFlag(rPr, wItalicRegex)
if bold {
out.WriteString("<strong>")
}
if italic {
out.WriteString("<em>")
}
out.WriteString(emitTextWithDraftVars(text))
if italic {
out.WriteString("</em>")
}
if bold {
out.WriteString("</strong>")
}
}
return out.String()
}
// emitTextWithDraftVars HTML-escapes run text while converting any
// preview-only sentinels emitted by htmlPreviewWrapper into
// <span class="draft-var" data-var="<key>">…</span>. The key is
// restricted to [A-Za-z][A-Za-z0-9_.]* by placeholderRegex, so no
// attribute-escaping is needed on the key; the value is HTML-escaped
// normally. Sentinel-free text (the Render path output, or template
// text outside placeholders) is passed straight through htmlEscape, so
// callers that never invoked wrap see byte-identical HTML.
//
// t-paliad-261: makes substituted variables clickable in the preview
// pane so the user can jump to the matching input in the sidebar.
func emitTextWithDraftVars(text string) string {
if !strings.Contains(text, previewVarBegin) {
return htmlEscape(text)
}
var out strings.Builder
rest := text
for {
i := strings.Index(rest, previewVarBegin)
if i < 0 {
out.WriteString(htmlEscape(rest))
return out.String()
}
out.WriteString(htmlEscape(rest[:i]))
body := rest[i+len(previewVarBegin):]
mid := strings.Index(body, previewVarMid)
end := strings.Index(body, previewVarEnd)
if mid < 0 || end < 0 || mid > end {
// Malformed sentinel — emit the marker as plain escaped
// text and continue past it so the rest of the run still
// renders.
out.WriteString(htmlEscape(previewVarBegin))
rest = body
continue
}
key := body[:mid]
value := body[mid+len(previewVarMid) : end]
out.WriteString(`<span class="draft-var" data-var="`)
out.WriteString(key)
out.WriteString(`">`)
out.WriteString(htmlEscape(value))
out.WriteString(`</span>`)
rest = body[end+len(previewVarEnd):]
}
}
// extractRunText concatenates every <w:t> inside a run, XML-decoding
// the content as it goes.
func extractRunText(run []byte) string {
var out strings.Builder
for _, m := range wTextNodeRegex.FindAllSubmatch(run, -1) {
out.WriteString(xmlDecode(string(m[2])))
}
return out.String()
}
// isFalseFlag returns true if the matched tag explicitly carries
// w:val="false" or w:val="0" — Word's way of turning off an inherited
// format. The default match (just `<w:b/>` or `<w:b w:val="true"/>`)
// is "on".
func isFalseFlag(rPr []byte, rx *regexp.Regexp) bool {
match := rx.Find(rPr)
if match == nil {
return false
}
s := string(match)
return strings.Contains(s, `w:val="false"`) || strings.Contains(s, `w:val="0"`)
}
// htmlEscape escapes the five HTML-significant characters for safe
// insertion into the preview pane.
func htmlEscape(s string) string {
s = strings.ReplaceAll(s, "&", "&amp;")
s = strings.ReplaceAll(s, "<", "&lt;")
s = strings.ReplaceAll(s, ">", "&gt;")
s = strings.ReplaceAll(s, `"`, "&quot;")
s = strings.ReplaceAll(s, "'", "&#39;")
return s
}

View File

@@ -0,0 +1,314 @@
package docx
// Submission merge-engine tests — resurrected from the original
// t-paliad-215 Slice 1 (commit 8ea3509) + Slice 2 (commit 1765d5e).
// Adapted: helper names suffixed with "Merge" so they don't collide
// with the convert tests in submission_render_test.go (minimalDOTM,
// unzipEntries) that test the format-only ConvertDotmToDocx path.
import (
"archive/zip"
"bytes"
"io"
"strings"
"testing"
)
// minimalMergeDOCX builds a tiny .docx zip with one document.xml that
// contains the given body. Just enough to exercise the merge engine.
func minimalMergeDOCX(t *testing.T, documentBody string) []byte {
t.Helper()
var buf bytes.Buffer
zw := zip.NewWriter(&buf)
w, err := zw.Create("word/document.xml")
if err != nil {
t.Fatalf("create document.xml: %v", err)
}
if _, err := io.WriteString(w, documentBody); err != nil {
t.Fatalf("write document.xml: %v", err)
}
w2, err := zw.Create("[Content_Types].xml")
if err != nil {
t.Fatalf("create content types: %v", err)
}
// Use a docx-compatible content type so the convert pre-pass treats
// the input as already-clean (no .dotm rewrites needed).
body := `<?xml version="1.0"?><Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">` +
`<Override PartName="/word/document.xml" ContentType="` + docxMainContentType + `"/></Types>`
if _, err := io.WriteString(w2, body); err != nil {
t.Fatalf("write content types: %v", err)
}
if err := zw.Close(); err != nil {
t.Fatalf("close zip: %v", err)
}
return buf.Bytes()
}
// readMergeDocumentXML pulls word/document.xml out of a rendered .docx.
func readMergeDocumentXML(t *testing.T, b []byte) string {
t.Helper()
zr, err := zip.NewReader(bytes.NewReader(b), int64(len(b)))
if err != nil {
t.Fatalf("open rendered zip: %v", err)
}
for _, f := range zr.File {
if f.Name != "word/document.xml" {
continue
}
rc, err := f.Open()
if err != nil {
t.Fatalf("open document.xml: %v", err)
}
defer rc.Close()
body, err := io.ReadAll(rc)
if err != nil {
t.Fatalf("read document.xml: %v", err)
}
return string(body)
}
t.Fatal("rendered .docx had no word/document.xml")
return ""
}
func TestRender_SingleRunPlaceholder(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>{{firm.name}}</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
out, err := r.Render(tmpl, PlaceholderMap{"firm.name": "HLC"}, nil)
if err != nil {
t.Fatalf("render: %v", err)
}
body := readMergeDocumentXML(t, out)
if !strings.Contains(body, ">HLC<") {
t.Errorf("expected HLC in body, got %q", body)
}
if strings.Contains(body, "{{") {
t.Errorf("unreplaced placeholder marker in body: %q", body)
}
}
func TestRender_MultiplePlaceholdersPerRun(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>{{parties.claimant.name}}, vertreten durch {{parties.claimant.representative}}</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
out, err := r.Render(tmpl, PlaceholderMap{
"parties.claimant.name": "Acme Inc.",
"parties.claimant.representative": "Kanzlei Müller",
}, nil)
if err != nil {
t.Fatalf("render: %v", err)
}
body := readMergeDocumentXML(t, out)
if !strings.Contains(body, "Acme Inc.") || !strings.Contains(body, "Kanzlei Müller") {
t.Errorf("expected both party values, got %q", body)
}
if strings.Contains(body, "{{") {
t.Errorf("unreplaced placeholder marker in body: %q", body)
}
}
func TestRender_MissingMarker(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>{{project.case_number}}</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
out, err := r.Render(tmpl, PlaceholderMap{}, DefaultMissingMarker("de"))
if err != nil {
t.Fatalf("render: %v", err)
}
body := readMergeDocumentXML(t, out)
if !strings.Contains(body, "[KEIN WERT: project.case_number]") {
t.Errorf("expected KEIN WERT marker, got %q", body)
}
outEN, err := r.Render(tmpl, PlaceholderMap{}, DefaultMissingMarker("en"))
if err != nil {
t.Fatalf("render en: %v", err)
}
bodyEN := readMergeDocumentXML(t, outEN)
if !strings.Contains(bodyEN, "[NO VALUE: project.case_number]") {
t.Errorf("expected NO VALUE marker, got %q", bodyEN)
}
}
func TestRender_CrossRunPlaceholder(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>Hello {{</w:t></w:r><w:r><w:t>project</w:t></w:r><w:r><w:t>.case_number}}!</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
out, err := r.Render(tmpl, PlaceholderMap{"project.case_number": "7 O 1234/26"}, nil)
if err != nil {
t.Fatalf("render: %v", err)
}
body := readMergeDocumentXML(t, out)
if !strings.Contains(body, "7 O 1234/26") {
t.Errorf("expected case number after cross-run merge, got %q", body)
}
if strings.Contains(body, "{{") {
t.Errorf("orphan placeholder marker remained: %q", body)
}
}
func TestRender_XMLEscaping(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>{{user.display_name}}</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
out, err := r.Render(tmpl, PlaceholderMap{
"user.display_name": `Müller & Söhne <GmbH> "Special"`,
}, nil)
if err != nil {
t.Fatalf("render: %v", err)
}
body := readMergeDocumentXML(t, out)
if !strings.Contains(body, "Müller &amp; Söhne &lt;GmbH&gt; &quot;Special&quot;") {
t.Errorf("expected escaped value, got %q", body)
}
}
func TestPlaceholderRegex_Boundaries(t *testing.T) {
tests := []struct {
in string
matches []string
}{
{"plain text", nil},
{"{{foo}}", []string{"{{foo}}"}},
{"{{ foo }}", []string{"{{ foo }}"}},
{"{{foo.bar}}", []string{"{{foo.bar}}"}},
{"{{ foo.bar_baz }}", []string{"{{ foo.bar_baz }}"}},
{"{{1bad}}", nil},
{"{{ foo }} and {{ bar }}", []string{"{{ foo }}", "{{ bar }}"}},
}
for _, tc := range tests {
t.Run(tc.in, func(t *testing.T) {
got := placeholderRegex.FindAllString(tc.in, -1)
if len(got) != len(tc.matches) {
t.Fatalf("got %d matches, want %d (in=%q)", len(got), len(tc.matches), tc.in)
}
for i := range got {
if got[i] != tc.matches[i] {
t.Errorf("match %d: got %q, want %q", i, got[i], tc.matches[i])
}
}
})
}
}
// TestRenderHTML_ExtractsParagraphsAndFormatting verifies the preview
// HTML emitter walks <w:p> / <w:r> / <w:t> correctly and carries
// bold/italic through to <strong>/<em>. Substituted placeholders are
// wrapped in <span class="draft-var" data-var="…"> so the client can
// make them clickable (t-paliad-261).
func TestRenderHTML_ExtractsParagraphsAndFormatting(t *testing.T) {
doc := `<w:document><w:body>` +
`<w:p><w:r><w:t>Hello {{firm.name}}</w:t></w:r></w:p>` +
`<w:p><w:r><w:rPr><w:b/></w:rPr><w:t>Bold line</w:t></w:r></w:p>` +
`<w:p><w:r><w:rPr><w:i/></w:rPr><w:t>Italic line</w:t></w:r></w:p>` +
`</w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
html, err := r.RenderHTML(tmpl, PlaceholderMap{"firm.name": "HLC"}, nil)
if err != nil {
t.Fatalf("render html: %v", err)
}
if !strings.Contains(html, `<p>Hello <span class="draft-var" data-var="firm.name">HLC</span></p>`) {
t.Errorf("expected merged paragraph with draft-var span, got %q", html)
}
if !strings.Contains(html, "<strong>Bold line</strong>") {
t.Errorf("expected bold span, got %q", html)
}
if !strings.Contains(html, "<em>Italic line</em>") {
t.Errorf("expected italic span, got %q", html)
}
}
// TestRenderHTML_EscapesContent confirms the preview emitter HTML-escapes
// special characters in placeholder values even inside the draft-var
// span wrapper.
func TestRenderHTML_EscapesContent(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>{{user.display_name}}</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
html, err := r.RenderHTML(tmpl, PlaceholderMap{
"user.display_name": `M&S <Inc> "X"`,
}, nil)
if err != nil {
t.Fatalf("render html: %v", err)
}
want := `<span class="draft-var" data-var="user.display_name">M&amp;S &lt;Inc&gt; &quot;X&quot;</span>`
if !strings.Contains(html, want) {
t.Errorf("expected escaped value inside draft-var span, got %q", html)
}
}
// TestRenderHTML_WrapsMissingMarker confirms that an unbound placeholder
// is still rendered as a clickable draft-var span so the user can click
// the [KEIN WERT: …] marker in the preview and jump to the field.
func TestRenderHTML_WrapsMissingMarker(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>{{project.case_number}}</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
html, err := r.RenderHTML(tmpl, PlaceholderMap{}, nil)
if err != nil {
t.Fatalf("render html: %v", err)
}
want := `<span class="draft-var" data-var="project.case_number">[KEIN WERT: project.case_number]</span>`
if !strings.Contains(html, want) {
t.Errorf("expected missing marker wrapped in draft-var span, got %q", html)
}
}
// TestRenderHTML_WrapsOverriddenValueSameAsResolved is the t-paliad-274
// regression: m's report on m/paliad#106 was that "When filled, the link
// disappears". The preview HTML must wrap an override value with the
// same <span class="draft-var"> as it would an unfilled placeholder, so
// the click-jump from preview→sidebar persists after the user types a
// value. There is no distinction at the renderer level between a value
// that came from the resolved bag (project / parties / deadline lookups)
// and a value the lawyer typed into the sidebar — both arrive in the
// same PlaceholderMap and both must be wrapped.
func TestRenderHTML_WrapsOverriddenValueSameAsResolved(t *testing.T) {
doc := `<w:document><w:body>` +
`<w:p><w:r><w:t>{{project.case_number}} / {{firm.name}}</w:t></w:r></w:p>` +
`</w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
// project.case_number is the typed-by-lawyer override.
// firm.name is the always-resolved value from the firm bag.
html, err := r.RenderHTML(tmpl, PlaceholderMap{
"project.case_number": "UPC_CFI_42/2026",
"firm.name": "HLC",
}, nil)
if err != nil {
t.Fatalf("render html: %v", err)
}
wantOverride := `<span class="draft-var" data-var="project.case_number">UPC_CFI_42/2026</span>`
if !strings.Contains(html, wantOverride) {
t.Errorf("expected overridden value wrapped in draft-var span (click-jump must persist after fill, t-paliad-274), got %q", html)
}
wantResolved := `<span class="draft-var" data-var="firm.name">HLC</span>`
if !strings.Contains(html, wantResolved) {
t.Errorf("expected resolved value still wrapped, got %q", html)
}
}
// TestRender_DocxOutputUnchangedByPreviewWrap asserts the hard rule from
// t-paliad-261: the .docx export path must NOT carry the preview-only
// draft-var sentinels or any draft-var span markup. Renders the same
// template through Render (.docx) and asserts the merged document.xml
// has only the resolved value, not a wrapped one.
func TestRender_DocxOutputUnchangedByPreviewWrap(t *testing.T) {
doc := `<w:document><w:body><w:p><w:r><w:t>{{firm.name}}</w:t></w:r></w:p></w:body></w:document>`
tmpl := minimalMergeDOCX(t, doc)
r := NewSubmissionRenderer()
out, err := r.Render(tmpl, PlaceholderMap{"firm.name": "HLC"}, nil)
if err != nil {
t.Fatalf("render docx: %v", err)
}
body := readMergeDocumentXML(t, out)
if !strings.Contains(body, `<w:t>HLC</w:t>`) {
t.Errorf("expected raw resolved value in .docx, got %q", body)
}
// PUA sentinels and any span markup must NOT appear in the .docx.
for _, forbidden := range []string{"draft-var", "data-var", previewVarBegin, previewVarMid, previewVarEnd} {
if strings.Contains(body, forbidden) {
t.Errorf("docx output unexpectedly contains %q: %q", forbidden, body)
}
}
}