You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1.8 KiB

Raw Blame History

Phase 2 - Representation-based vectorization and dual-write compatibility

Goal

Decouple vectorization from the TED document entity so arbitrary document types can use a shared representation-to-embedding pipeline.

Primary changes

Primary vectorization source
- before: TED.procurement_document.text_content
- now: DOC.doc_text_representation.text_body
Primary vectorization target
- before: TED.procurement_document.content_vector
- now: DOC.doc_embedding.embedding_vector
Compatibility during migration
- completed embeddings are optionally mirrored back to the legacy TED vector columns using the shared TED document hash (document_hash / dedup_hash)
TED dual-write bridge
- fresh TED documents are projected into the generic DOC model immediately after persistence

Key services introduced

TedPhase2GenericDocumentService
- creates/refreshes generic DOC records for TED notices
DocumentEmbeddingProcessingService
- processes DOC embedding lifecycle records
GenericVectorizationRoute
- scheduler + worker route for asynchronous DOC embedding generation
ConfiguredEmbeddingModelStartupRunner
- ensures the configured embedding model exists in DOC.doc_embedding_model
GenericVectorizationStartupRunner
- queues pending/failed DOC embeddings on startup

Behavior when Phase 2 is enabled

legacy VectorizationRoute is disabled
legacy startup queueing is disabled
legacy event-based vectorization queueing is disabled
generic scheduler and startup runner handle DOC embeddings instead

Compatibility intent

This phase keeps the existing TED search endpoints working while the new generic indexing layer becomes operational. The next phase can migrate search reads from the TED table to DOC.doc_embedding.

1.8 KiB Raw Blame History