2.0 KiB
Phase 4.1 – TED package and mail/document adapters
This phase extends the generic DOC ingestion SPI with two richer adapters:
TedPackageDocumentIngestionAdapterMailDocumentIngestionAdapter
TED package adapter
- imports the package artifact itself as a public DOC document
- expands the
.tar.gzpackage into XML child payloads - imports each child XML as a generic DOC child document
- links children to the package root via
EXTRACTED_FROM - keeps the existing legacy TED package processing path intact
Mail/document adapter
- imports the MIME message as a DOC document
- extracts subject/from/to/body into the mail root semantic text
- imports attachments as child DOC documents
- links attachments via
ATTACHMENT_OF - optionally expands ZIP attachments recursively
Access semantics
- TED packages and TED XML children are imported as
PUBLICwith no owner tenant - mail documents use a dedicated default mail access context (
mail-default-owner-tenant-key,mail-default-visibility) - deduplication is access-scope aware so private content is not merged across different tenants
Additional note:
-
wrapper/container documents (for example TED package roots or ZIP wrapper documents expanded into child documents) can skip persistence of ORIGINAL content via
ted.generic-ingestion.store-original-content-for-wrapper-documents=false, and adapters can now override that default per imported document throughSourceDescriptor.originalContentStoragePolicy(STORE/SKIP/DEFAULT), while still keeping metadata, derived representations and child relations. -
when original content storage is skipped for a document, GenericDocumentImportService now also skips extraction, derived-content persistence, representation building, and embedding queueing for that document
Schema note:
V8__doc_phase4_1_expand_document_and_source_types.sqlexpands the genericDOCdocument/source type domain forTED_PACKAGEandPACKAGE_CHILD, and also repairs older local/dev schemas that used CHECK constraints instead of PostgreSQL ENUM types.