You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/README_PHASE4_1.md

2.0 KiB

Phase 4.1 TED package and mail/document adapters

This phase extends the generic DOC ingestion SPI with two richer adapters:

  • TedPackageDocumentIngestionAdapter
  • MailDocumentIngestionAdapter

TED package adapter

  • imports the package artifact itself as a public DOC document
  • expands the .tar.gz package into XML child payloads
  • imports each child XML as a generic DOC child document
  • links children to the package root via EXTRACTED_FROM
  • keeps the existing legacy TED package processing path intact

Mail/document adapter

  • imports the MIME message as a DOC document
  • extracts subject/from/to/body into the mail root semantic text
  • imports attachments as child DOC documents
  • links attachments via ATTACHMENT_OF
  • optionally expands ZIP attachments recursively

Access semantics

  • TED packages and TED XML children are imported as PUBLIC with no owner tenant
  • mail documents use a dedicated default mail access context (mail-default-owner-tenant-key, mail-default-visibility)
  • deduplication is access-scope aware so private content is not merged across different tenants

Additional note:

  • wrapper/container documents (for example TED package roots or ZIP wrapper documents expanded into child documents) can skip persistence of ORIGINAL content via ted.generic-ingestion.store-original-content-for-wrapper-documents=false, and adapters can now override that default per imported document through SourceDescriptor.originalContentStoragePolicy (STORE / SKIP / DEFAULT), while still keeping metadata, derived representations and child relations.

  • when original content storage is skipped for a document, GenericDocumentImportService now also skips extraction, derived-content persistence, representation building, and embedding queueing for that document

Schema note:

  • V8__doc_phase4_1_expand_document_and_source_types.sql expands the generic DOC document/source type domain for TED_PACKAGE and PACKAGE_CHILD, and also repairs older local/dev schemas that used CHECK constraints instead of PostgreSQL ENUM types.