You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
41 lines
1.2 KiB
Markdown
41 lines
1.2 KiB
Markdown
# Phase 4 - Generic Ingestion Pipeline
|
|
|
|
Phase 4 introduces the first generalized ingestion flow on top of the DOC backbone.
|
|
|
|
## What is included
|
|
|
|
- generic ingestion gateway with adapter selection
|
|
- file-system ingestion adapter and Camel route
|
|
- REST/API upload controller for arbitrary documents
|
|
- document type detection by media type / extension
|
|
- first extractors for:
|
|
- plain text / markdown / generic XML
|
|
- HTML
|
|
- PDF
|
|
- binary fallback
|
|
- default representation builder for non-TED documents
|
|
- binary payload support in `DOC.doc_content.binary_content`
|
|
- automatic creation of pending generic embeddings for imported representations
|
|
|
|
## Important behavior
|
|
|
|
- current TED runtime remains intact
|
|
- generic ingestion is disabled by default and must be enabled with:
|
|
- `ted.generic-ingestion.enabled=true`
|
|
- file-system polling is separately controlled with:
|
|
- `ted.generic-ingestion.file-system-enabled=true`
|
|
- REST/API upload endpoints are under:
|
|
- `/api/v1/dip/import/upload`
|
|
- `/api/v1/dip/import/text`
|
|
|
|
## Current supported generic document types
|
|
|
|
- PDF
|
|
- HTML
|
|
- TEXT
|
|
- MARKDOWN
|
|
- XML_GENERIC
|
|
- UNKNOWN text-like files
|
|
|
|
DOCX, ZIP child extraction, and MIME body parsing are intentionally left for later phases.
|