DIP/docs/README_PHASE4_1.md

35 lines
2.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 4.1 TED package and mail/document adapters
This phase extends the generic DOC ingestion SPI with two richer adapters:
- `TedPackageDocumentIngestionAdapter`
- `MailDocumentIngestionAdapter`
## TED package adapter
- imports the package artifact itself as a public DOC document
- expands the `.tar.gz` package into XML child payloads
- imports each child XML as a generic DOC child document
- links children to the package root via `EXTRACTED_FROM`
- keeps the existing legacy TED package processing path intact
## Mail/document adapter
- imports the MIME message as a DOC document
- extracts subject/from/to/body into the mail root semantic text
- imports attachments as child DOC documents
- links attachments via `ATTACHMENT_OF`
- optionally expands ZIP attachments recursively
## Access semantics
- TED packages and TED XML children are imported as `PUBLIC` with no owner tenant
- mail documents use a dedicated default mail access context (`mail-default-owner-tenant-key`, `mail-default-visibility`)
- deduplication is access-scope aware so private content is not merged across different tenants
Additional note:
- wrapper/container documents (for example TED package roots or ZIP wrapper documents expanded into child documents) can skip persistence of ORIGINAL content via `ted.generic-ingestion.store-original-content-for-wrapper-documents=false`, and adapters can now override that default per imported document through `SourceDescriptor.originalContentStoragePolicy` (`STORE` / `SKIP` / `DEFAULT`), while still keeping metadata, derived representations and child relations.
- when original content storage is skipped for a document, GenericDocumentImportService now also skips extraction, derived-content persistence, representation building, and embedding queueing for that document
Schema note:
- `V8__doc_phase4_1_expand_document_and_source_types.sql` expands the generic `DOC` document/source type domain for `TED_PACKAGE` and `PACKAGE_CHILD`, and also repairs older local/dev schemas that used CHECK constraints instead of PostgreSQL ENUM types.