You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/architecture/PHASE4_GENERIC_INGESTION_PI...

59 lines
1.6 KiB
Markdown

# Phase 4 - Generic Ingestion Pipeline
## Goal
Add the first generic ingestion path so arbitrary documents can be imported into the canonical DOC model,
normalized into text representations, and queued for vectorization without depending on the TED-specific model.
## Scope implemented
### Input channels
- file-system polling route for arbitrary documents
- REST/API upload endpoints
### Detection
- file extension + media type based classification
### Extraction
- PDF -> text via PDFBox
- HTML -> cleaned text via JSoup
- text / markdown / generic XML -> normalized UTF-8 text
- unsupported binary types -> fallback warning only
### Representation building
- default generic builder creates:
- FULLTEXT
- SEMANTIC_TEXT
- TITLE_ABSTRACT
### Persistence
- original content stored in DOC.doc_content
- binary originals can now be stored inline in `binary_content`
- derived text variants persisted as additional DOC.doc_content rows
- text representations persisted in DOC.doc_text_representation
- pending embeddings created in DOC.doc_embedding when enabled
## Access model
The generic pipeline uses the Phase 0/1 access model:
- optional owner tenant
- mandatory visibility
This supports both:
- public documents (`owner_tenant_id = null`, `visibility = PUBLIC`)
- tenant-owned documents (`owner_tenant_id != null`, `visibility = TENANT/SHARED/...`)
## Deliberately deferred
- DOCX extraction
- ZIP recursive child import in the generic pipeline
- MIME/EML structured parsing
- generic structured projections beyond TED
- chunked long-document representations