You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1.2 KiB

Phase 4 - Generic Ingestion Pipeline

Phase 4 introduces the first generalized ingestion flow on top of the DOC backbone.

What is included

  • generic ingestion gateway with adapter selection
  • file-system ingestion adapter and Camel route
  • REST/API upload controller for arbitrary documents
  • document type detection by media type / extension
  • first extractors for:
    • plain text / markdown / generic XML
    • HTML
    • PDF
    • binary fallback
  • default representation builder for non-TED documents
  • binary payload support in DOC.doc_content.binary_content
  • automatic creation of pending generic embeddings for imported representations

Important behavior

  • current TED runtime remains intact
  • generic ingestion is disabled by default and must be enabled with:
    • ted.generic-ingestion.enabled=true
  • file-system polling is separately controlled with:
    • ted.generic-ingestion.file-system-enabled=true
  • REST/API upload endpoints are under:
    • /api/v1/dip/import/upload
    • /api/v1/dip/import/text

Current supported generic document types

  • PDF
  • HTML
  • TEXT
  • MARKDOWN
  • XML_GENERIC
  • UNKNOWN text-like files

DOCX, ZIP child extraction, and MIME body parsing are intentionally left for later phases.