You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
59 lines
1.6 KiB
Markdown
59 lines
1.6 KiB
Markdown
# Phase 4 - Generic Ingestion Pipeline
|
|
|
|
## Goal
|
|
|
|
Add the first generic ingestion path so arbitrary documents can be imported into the canonical DOC model,
|
|
normalized into text representations, and queued for vectorization without depending on the TED-specific model.
|
|
|
|
## Scope implemented
|
|
|
|
### Input channels
|
|
|
|
- file-system polling route for arbitrary documents
|
|
- REST/API upload endpoints
|
|
|
|
### Detection
|
|
|
|
- file extension + media type based classification
|
|
|
|
### Extraction
|
|
|
|
- PDF -> text via PDFBox
|
|
- HTML -> cleaned text via JSoup
|
|
- text / markdown / generic XML -> normalized UTF-8 text
|
|
- unsupported binary types -> fallback warning only
|
|
|
|
### Representation building
|
|
|
|
- default generic builder creates:
|
|
- FULLTEXT
|
|
- SEMANTIC_TEXT
|
|
- TITLE_ABSTRACT
|
|
|
|
### Persistence
|
|
|
|
- original content stored in DOC.doc_content
|
|
- binary originals can now be stored inline in `binary_content`
|
|
- derived text variants persisted as additional DOC.doc_content rows
|
|
- text representations persisted in DOC.doc_text_representation
|
|
- pending embeddings created in DOC.doc_embedding when enabled
|
|
|
|
## Access model
|
|
|
|
The generic pipeline uses the Phase 0/1 access model:
|
|
|
|
- optional owner tenant
|
|
- mandatory visibility
|
|
|
|
This supports both:
|
|
- public documents (`owner_tenant_id = null`, `visibility = PUBLIC`)
|
|
- tenant-owned documents (`owner_tenant_id != null`, `visibility = TENANT/SHARED/...`)
|
|
|
|
## Deliberately deferred
|
|
|
|
- DOCX extraction
|
|
- ZIP recursive child import in the generic pipeline
|
|
- MIME/EML structured parsing
|
|
- generic structured projections beyond TED
|
|
- chunked long-document representations
|