You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1.6 KiB

Raw Blame History

Phase 4 - Generic Ingestion Pipeline

Goal

Add the first generic ingestion path so arbitrary documents can be imported into the canonical DOC model, normalized into text representations, and queued for vectorization without depending on the TED-specific model.

Scope implemented

Input channels

file-system polling route for arbitrary documents
REST/API upload endpoints

Detection

file extension + media type based classification

Extraction

PDF -> text via PDFBox
HTML -> cleaned text via JSoup
text / markdown / generic XML -> normalized UTF-8 text
unsupported binary types -> fallback warning only

Representation building

default generic builder creates:
- FULLTEXT
- SEMANTIC_TEXT
- TITLE_ABSTRACT

Persistence

original content stored in DOC.doc_content
binary originals can now be stored inline in binary_content
derived text variants persisted as additional DOC.doc_content rows
text representations persisted in DOC.doc_text_representation
pending embeddings created in DOC.doc_embedding when enabled

Access model

The generic pipeline uses the Phase 0/1 access model:

optional owner tenant
mandatory visibility

This supports both:

public documents (owner_tenant_id = null, visibility = PUBLIC)
tenant-owned documents (owner_tenant_id != null, visibility = TENANT/SHARED/...)

Deliberately deferred

DOCX extraction
ZIP recursive child import in the generic pipeline
MIME/EML structured parsing
generic structured projections beyond TED
chunked long-document representations

1.6 KiB Raw Blame History