You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

41 lines
1.2 KiB
Markdown

# Phase 4 - Generic Ingestion Pipeline
Phase 4 introduces the first generalized ingestion flow on top of the DOC backbone.
## What is included
- generic ingestion gateway with adapter selection
- file-system ingestion adapter and Camel route
- REST/API upload controller for arbitrary documents
- document type detection by media type / extension
- first extractors for:
- plain text / markdown / generic XML
- HTML
- PDF
- binary fallback
- default representation builder for non-TED documents
- binary payload support in `DOC.doc_content.binary_content`
- automatic creation of pending generic embeddings for imported representations
## Important behavior
- current TED runtime remains intact
- generic ingestion is disabled by default and must be enabled with:
- `ted.generic-ingestion.enabled=true`
- file-system polling is separately controlled with:
- `ted.generic-ingestion.file-system-enabled=true`
- REST/API upload endpoints are under:
- `/api/v1/dip/import/upload`
- `/api/v1/dip/import/text`
## Current supported generic document types
- PDF
- HTML
- TEXT
- MARKDOWN
- XML_GENERIC
- UNKNOWN text-like files
DOCX, ZIP child extraction, and MIME body parsing are intentionally left for later phases.