You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/architecture/PHASE4_GENERIC_INGESTION_PI...

1.6 KiB

Phase 4 - Generic Ingestion Pipeline

Goal

Add the first generic ingestion path so arbitrary documents can be imported into the canonical DOC model, normalized into text representations, and queued for vectorization without depending on the TED-specific model.

Scope implemented

Input channels

  • file-system polling route for arbitrary documents
  • REST/API upload endpoints

Detection

  • file extension + media type based classification

Extraction

  • PDF -> text via PDFBox
  • HTML -> cleaned text via JSoup
  • text / markdown / generic XML -> normalized UTF-8 text
  • unsupported binary types -> fallback warning only

Representation building

  • default generic builder creates:
    • FULLTEXT
    • SEMANTIC_TEXT
    • TITLE_ABSTRACT

Persistence

  • original content stored in DOC.doc_content
  • binary originals can now be stored inline in binary_content
  • derived text variants persisted as additional DOC.doc_content rows
  • text representations persisted in DOC.doc_text_representation
  • pending embeddings created in DOC.doc_embedding when enabled

Access model

The generic pipeline uses the Phase 0/1 access model:

  • optional owner tenant
  • mandatory visibility

This supports both:

  • public documents (owner_tenant_id = null, visibility = PUBLIC)
  • tenant-owned documents (owner_tenant_id != null, visibility = TENANT/SHARED/...)

Deliberately deferred

  • DOCX extraction
  • ZIP recursive child import in the generic pipeline
  • MIME/EML structured parsing
  • generic structured projections beyond TED
  • chunked long-document representations