DIP/docs/architecture/PHASE4_GENERIC_INGESTION_PI...

# Phase 4 - Generic Ingestion Pipeline

## Goal

Add the first generic ingestion path so arbitrary documents can be imported into the canonical DOC model,
normalized into text representations, and queued for vectorization without depending on the TED-specific model.

## Scope implemented

### Input channels

- file-system polling route for arbitrary documents
- REST/API upload endpoints

### Detection

- file extension + media type based classification

### Extraction

- PDF -> text via PDFBox
- HTML -> cleaned text via JSoup
- text / markdown / generic XML -> normalized UTF-8 text
- unsupported binary types -> fallback warning only

### Representation building

- default generic builder creates:
  - FULLTEXT
  - SEMANTIC_TEXT
  - TITLE_ABSTRACT

### Persistence

- original content stored in DOC.doc_content
- binary originals can now be stored inline in `binary_content`
- derived text variants persisted as additional DOC.doc_content rows
- text representations persisted in DOC.doc_text_representation
- pending embeddings created in DOC.doc_embedding when enabled

## Access model

The generic pipeline uses the Phase 0/1 access model:

- optional owner tenant
- mandatory visibility

This supports both:
- public documents (`owner_tenant_id = null`, `visibility = PUBLIC`)
- tenant-owned documents (`owner_tenant_id != null`, `visibility = TENANT/SHARED/...`)

## Deliberately deferred

- DOCX extraction
- ZIP recursive child import in the generic pipeline
- MIME/EML structured parsing
- generic structured projections beyond TED
- chunked long-document representations