DIP/docs/README_PHASE2.md

949 B

Phase 2 - Vectorization decoupling

Phase 2 moves the primary vectorization pipeline from TED.procurement_document to the generic DOC representation and embedding model introduced in Phase 1.

Implemented in this phase:

  • DOC.doc_text_representation is now the primary text source for embeddings
  • DOC.doc_embedding is the primary persistence target for embedding lifecycle and vectors
  • a generic Camel route processes pending/failed embeddings asynchronously
  • TED imports dual-write into the generic model by creating:
    • canonical DOC.doc_document
    • original DOC.doc_content
    • primary DOC.doc_text_representation
    • pending DOC.doc_embedding
  • compatibility mode keeps writing completed TED embeddings back into TED.procurement_document.content_vector so the legacy semantic search continues to work

This phase is intentionally additive and does not yet migrate TED semantic search reads away from the legacy table.