You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/architecture/PHASE2_VECTORIZATION_DECOUP...

49 lines
1.8 KiB
Markdown

# Phase 2 - Representation-based vectorization and dual-write compatibility
## Goal
Decouple vectorization from the TED document entity so arbitrary document types can use a shared
representation-to-embedding pipeline.
## Primary changes
1. **Primary vectorization source**
- before: `TED.procurement_document.text_content`
- now: `DOC.doc_text_representation.text_body`
2. **Primary vectorization target**
- before: `TED.procurement_document.content_vector`
- now: `DOC.doc_embedding.embedding_vector`
3. **Compatibility during migration**
- completed embeddings are optionally mirrored back to the legacy TED vector columns using the
shared TED document hash (`document_hash` / `dedup_hash`)
4. **TED dual-write bridge**
- fresh TED documents are projected into the generic `DOC` model immediately after persistence
## Key services introduced
- `TedPhase2GenericDocumentService`
- creates/refreshes generic DOC records for TED notices
- `DocumentEmbeddingProcessingService`
- processes DOC embedding lifecycle records
- `GenericVectorizationRoute`
- scheduler + worker route for asynchronous DOC embedding generation
- `ConfiguredEmbeddingModelStartupRunner`
- ensures the configured embedding model exists in `DOC.doc_embedding_model`
- `GenericVectorizationStartupRunner`
- queues pending/failed DOC embeddings on startup
## Behavior when Phase 2 is enabled
- legacy `VectorizationRoute` is disabled
- legacy startup queueing is disabled
- legacy event-based vectorization queueing is disabled
- generic scheduler and startup runner handle DOC embeddings instead
## Compatibility intent
This phase keeps the existing TED search endpoints working while the new generic indexing layer becomes
operational. The next phase can migrate search reads from the TED table to `DOC.doc_embedding`.