You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/README_PHASE4_1.md

35 lines
2.0 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Phase 4.1 TED package and mail/document adapters
This phase extends the generic DOC ingestion SPI with two richer adapters:
- `TedPackageDocumentIngestionAdapter`
- `MailDocumentIngestionAdapter`
## TED package adapter
- imports the package artifact itself as a public DOC document
- expands the `.tar.gz` package into XML child payloads
- imports each child XML as a generic DOC child document
- links children to the package root via `EXTRACTED_FROM`
- keeps the existing legacy TED package processing path intact
## Mail/document adapter
- imports the MIME message as a DOC document
- extracts subject/from/to/body into the mail root semantic text
- imports attachments as child DOC documents
- links attachments via `ATTACHMENT_OF`
- optionally expands ZIP attachments recursively
## Access semantics
- TED packages and TED XML children are imported as `PUBLIC` with no owner tenant
- mail documents use a dedicated default mail access context (`mail-default-owner-tenant-key`, `mail-default-visibility`)
- deduplication is access-scope aware so private content is not merged across different tenants
Additional note:
- wrapper/container documents (for example TED package roots or ZIP wrapper documents expanded into child documents) can skip persistence of ORIGINAL content via `ted.generic-ingestion.store-original-content-for-wrapper-documents=false`, and adapters can now override that default per imported document through `SourceDescriptor.originalContentStoragePolicy` (`STORE` / `SKIP` / `DEFAULT`), while still keeping metadata, derived representations and child relations.
- when original content storage is skipped for a document, GenericDocumentImportService now also skips extraction, derived-content persistence, representation building, and embedding queueing for that document
Schema note:
- `V8__doc_phase4_1_expand_document_and_source_types.sql` expands the generic `DOC` document/source type domain for `TED_PACKAGE` and `PACKAGE_CHILD`, and also repairs older local/dev schemas that used CHECK constraints instead of PostgreSQL ENUM types.