You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1.6 KiB
1.6 KiB
Phase 4 - Generic Ingestion Pipeline
Goal
Add the first generic ingestion path so arbitrary documents can be imported into the canonical DOC model, normalized into text representations, and queued for vectorization without depending on the TED-specific model.
Scope implemented
Input channels
- file-system polling route for arbitrary documents
- REST/API upload endpoints
Detection
- file extension + media type based classification
Extraction
- PDF -> text via PDFBox
- HTML -> cleaned text via JSoup
- text / markdown / generic XML -> normalized UTF-8 text
- unsupported binary types -> fallback warning only
Representation building
- default generic builder creates:
- FULLTEXT
- SEMANTIC_TEXT
- TITLE_ABSTRACT
Persistence
- original content stored in DOC.doc_content
- binary originals can now be stored inline in
binary_content - derived text variants persisted as additional DOC.doc_content rows
- text representations persisted in DOC.doc_text_representation
- pending embeddings created in DOC.doc_embedding when enabled
Access model
The generic pipeline uses the Phase 0/1 access model:
- optional owner tenant
- mandatory visibility
This supports both:
- public documents (
owner_tenant_id = null,visibility = PUBLIC) - tenant-owned documents (
owner_tenant_id != null,visibility = TENANT/SHARED/...)
Deliberately deferred
- DOCX extraction
- ZIP recursive child import in the generic pipeline
- MIME/EML structured parsing
- generic structured projections beyond TED
- chunked long-document representations