You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/architecture/PHASE2_VECTORIZATION_DECOUP...

1.8 KiB

Phase 2 - Representation-based vectorization and dual-write compatibility

Goal

Decouple vectorization from the TED document entity so arbitrary document types can use a shared representation-to-embedding pipeline.

Primary changes

  1. Primary vectorization source

    • before: TED.procurement_document.text_content
    • now: DOC.doc_text_representation.text_body
  2. Primary vectorization target

    • before: TED.procurement_document.content_vector
    • now: DOC.doc_embedding.embedding_vector
  3. Compatibility during migration

    • completed embeddings are optionally mirrored back to the legacy TED vector columns using the shared TED document hash (document_hash / dedup_hash)
  4. TED dual-write bridge

    • fresh TED documents are projected into the generic DOC model immediately after persistence

Key services introduced

  • TedPhase2GenericDocumentService
    • creates/refreshes generic DOC records for TED notices
  • DocumentEmbeddingProcessingService
    • processes DOC embedding lifecycle records
  • GenericVectorizationRoute
    • scheduler + worker route for asynchronous DOC embedding generation
  • ConfiguredEmbeddingModelStartupRunner
    • ensures the configured embedding model exists in DOC.doc_embedding_model
  • GenericVectorizationStartupRunner
    • queues pending/failed DOC embeddings on startup

Behavior when Phase 2 is enabled

  • legacy VectorizationRoute is disabled
  • legacy startup queueing is disabled
  • legacy event-based vectorization queueing is disabled
  • generic scheduler and startup runner handle DOC embeddings instead

Compatibility intent

This phase keeps the existing TED search endpoints working while the new generic indexing layer becomes operational. The next phase can migrate search reads from the TED table to DOC.doc_embedding.