You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/architecture/PHASE1_GENERIC_PERSISTENCE_...

43 lines
1.4 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Phase 1 Generic Persistence Model
## Goal
Introduce the generalized persistence backbone in an additive, non-breaking way.
## New schema
The project now contains the `DOC` schema with the following tables:
- `doc_tenant`
- `doc_document`
- `doc_source`
- `doc_content`
- `doc_text_representation`
- `doc_embedding_model`
- `doc_embedding`
- `doc_relation`
## Design choices
### Owner tenant is optional
Public TED notices can remain unowned documents with `visibility = PUBLIC`.
### Visibility is mandatory
Every canonical document must carry `DocumentVisibility`.
### Vectorization is separated already
`doc_embedding` holds vectorization lifecycle and model association outside `doc_document`.
The actual vector payload column exists in the schema, but the runtime still uses the legacy TED
vectorization flow until Phase 2.
### Content and text representation are separate
`doc_content` stores payload variants. `doc_text_representation` stores search-oriented texts.
This is the key boundary needed for arbitrary future document types.
## What is still intentionally missing
- no dual-write from TED import yet
- no generic ingestion routes yet
- no semantic search cutover yet
- no TED projection tables yet
- no historical migration yet
## Result
The generalized platform is now backed by a real schema and service layer, which reduces the later
migration risk significantly.