You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1.8 KiB
1.8 KiB
Phase 2 - Representation-based vectorization and dual-write compatibility
Goal
Decouple vectorization from the TED document entity so arbitrary document types can use a shared representation-to-embedding pipeline.
Primary changes
-
Primary vectorization source
- before:
TED.procurement_document.text_content - now:
DOC.doc_text_representation.text_body
- before:
-
Primary vectorization target
- before:
TED.procurement_document.content_vector - now:
DOC.doc_embedding.embedding_vector
- before:
-
Compatibility during migration
- completed embeddings are optionally mirrored back to the legacy TED vector columns using the
shared TED document hash (
document_hash/dedup_hash)
- completed embeddings are optionally mirrored back to the legacy TED vector columns using the
shared TED document hash (
-
TED dual-write bridge
- fresh TED documents are projected into the generic
DOCmodel immediately after persistence
- fresh TED documents are projected into the generic
Key services introduced
TedPhase2GenericDocumentService- creates/refreshes generic DOC records for TED notices
DocumentEmbeddingProcessingService- processes DOC embedding lifecycle records
GenericVectorizationRoute- scheduler + worker route for asynchronous DOC embedding generation
ConfiguredEmbeddingModelStartupRunner- ensures the configured embedding model exists in
DOC.doc_embedding_model
- ensures the configured embedding model exists in
GenericVectorizationStartupRunner- queues pending/failed DOC embeddings on startup
Behavior when Phase 2 is enabled
- legacy
VectorizationRouteis disabled - legacy startup queueing is disabled
- legacy event-based vectorization queueing is disabled
- generic scheduler and startup runner handle DOC embeddings instead
Compatibility intent
This phase keeps the existing TED search endpoints working while the new generic indexing layer becomes
operational. The next phase can migrate search reads from the TED table to DOC.doc_embedding.