You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
49 lines
1.8 KiB
Markdown
49 lines
1.8 KiB
Markdown
# Phase 2 - Representation-based vectorization and dual-write compatibility
|
|
|
|
## Goal
|
|
|
|
Decouple vectorization from the TED document entity so arbitrary document types can use a shared
|
|
representation-to-embedding pipeline.
|
|
|
|
## Primary changes
|
|
|
|
1. **Primary vectorization source**
|
|
- before: `TED.procurement_document.text_content`
|
|
- now: `DOC.doc_text_representation.text_body`
|
|
|
|
2. **Primary vectorization target**
|
|
- before: `TED.procurement_document.content_vector`
|
|
- now: `DOC.doc_embedding.embedding_vector`
|
|
|
|
3. **Compatibility during migration**
|
|
- completed embeddings are optionally mirrored back to the legacy TED vector columns using the
|
|
shared TED document hash (`document_hash` / `dedup_hash`)
|
|
|
|
4. **TED dual-write bridge**
|
|
- fresh TED documents are projected into the generic `DOC` model immediately after persistence
|
|
|
|
## Key services introduced
|
|
|
|
- `TedPhase2GenericDocumentService`
|
|
- creates/refreshes generic DOC records for TED notices
|
|
- `DocumentEmbeddingProcessingService`
|
|
- processes DOC embedding lifecycle records
|
|
- `GenericVectorizationRoute`
|
|
- scheduler + worker route for asynchronous DOC embedding generation
|
|
- `ConfiguredEmbeddingModelStartupRunner`
|
|
- ensures the configured embedding model exists in `DOC.doc_embedding_model`
|
|
- `GenericVectorizationStartupRunner`
|
|
- queues pending/failed DOC embeddings on startup
|
|
|
|
## Behavior when Phase 2 is enabled
|
|
|
|
- legacy `VectorizationRoute` is disabled
|
|
- legacy startup queueing is disabled
|
|
- legacy event-based vectorization queueing is disabled
|
|
- generic scheduler and startup runner handle DOC embeddings instead
|
|
|
|
## Compatibility intent
|
|
|
|
This phase keeps the existing TED search endpoints working while the new generic indexing layer becomes
|
|
operational. The next phase can migrate search reads from the TED table to `DOC.doc_embedding`.
|