runtime-split patch b-i
parent
44995cebf7
commit
1aa599b587
@ -0,0 +1,31 @@
|
|||||||
|
# Config split: moved new-runtime properties to application-new.yml
|
||||||
|
|
||||||
|
This patch keeps shared and legacy defaults in `application.yml` and moves new-runtime properties into `application-new.yml`.
|
||||||
|
|
||||||
|
Activate the new runtime with:
|
||||||
|
|
||||||
|
```
|
||||||
|
--spring.profiles.active=new
|
||||||
|
```
|
||||||
|
|
||||||
|
`application-new.yml` also sets:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
dip.runtime.mode: NEW
|
||||||
|
```
|
||||||
|
|
||||||
|
So profile selection and runtime mode stay aligned.
|
||||||
|
|
||||||
|
Moved blocks:
|
||||||
|
- `dip.embedding.*`
|
||||||
|
- `ted.search.*` (new generic search tuning, now under `dip.search.*`)
|
||||||
|
- `ted.projection.*`
|
||||||
|
- `ted.generic-ingestion.*`
|
||||||
|
- new/transitional `ted.vectorization.*` keys:
|
||||||
|
- `generic-pipeline-enabled`
|
||||||
|
- `dual-write-legacy-ted-vectors`
|
||||||
|
- `generic-scheduler-period-ms`
|
||||||
|
- `primary-representation-builder-key`
|
||||||
|
- `embedding-provider`
|
||||||
|
|
||||||
|
Shared / legacy defaults remain in `application.yml`.
|
||||||
@ -0,0 +1,36 @@
|
|||||||
|
# Runtime split Patch C
|
||||||
|
|
||||||
|
Patch C moves the **new generic search runtime** off `TedProcessorProperties.search`
|
||||||
|
and into a dedicated `dip.search.*` config tree.
|
||||||
|
|
||||||
|
## New config class
|
||||||
|
- `at.procon.dip.search.config.DipSearchProperties`
|
||||||
|
|
||||||
|
## New config root
|
||||||
|
```yaml
|
||||||
|
dip:
|
||||||
|
search:
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Classes moved off `TedProcessorProperties`
|
||||||
|
- `PostgresFullTextSearchEngine`
|
||||||
|
- `PostgresTrigramSearchEngine`
|
||||||
|
- `PgVectorSemanticSearchEngine`
|
||||||
|
- `DefaultSearchOrchestrator`
|
||||||
|
- `DefaultSearchResultFusionService`
|
||||||
|
- `SearchLexicalIndexStartupRunner`
|
||||||
|
- `ChunkedLongTextRepresentationBuilder`
|
||||||
|
|
||||||
|
## What this patch intentionally does not do
|
||||||
|
- it does not yet remove `TedProcessorProperties` from all NEW-mode classes
|
||||||
|
- it does not yet move `generic-ingestion` config off `ted.*`
|
||||||
|
- it does not yet finish the legacy/new config split for import/mail/TED package processing
|
||||||
|
|
||||||
|
Those should be handled in the next config-splitting patch.
|
||||||
|
|
||||||
|
## Practical result
|
||||||
|
After this patch, **new search/semantic/chunking tuning** should be configured only via:
|
||||||
|
- `dip.search.*`
|
||||||
|
|
||||||
|
while `ted.search.*` remains legacy-oriented.
|
||||||
@ -0,0 +1,40 @@
|
|||||||
|
# Runtime Split Patch D
|
||||||
|
|
||||||
|
This patch completes the next configuration split step for the NEW runtime.
|
||||||
|
|
||||||
|
## New property classes
|
||||||
|
|
||||||
|
- `at.procon.dip.ingestion.config.DipIngestionProperties`
|
||||||
|
- prefix: `dip.ingestion`
|
||||||
|
- `at.procon.dip.domain.ted.config.TedProjectionProperties`
|
||||||
|
- prefix: `dip.ted.projection`
|
||||||
|
|
||||||
|
## Classes moved off `TedProcessorProperties`
|
||||||
|
|
||||||
|
### NEW-mode ingestion
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- `GenericFileSystemIngestionRoute`
|
||||||
|
- `GenericDocumentImportController`
|
||||||
|
- `MailDocumentIngestionAdapter`
|
||||||
|
- `TedPackageDocumentIngestionAdapter`
|
||||||
|
- `TedPackageChildImportProcessor`
|
||||||
|
|
||||||
|
### NEW-mode projection
|
||||||
|
- `TedNoticeProjectionService`
|
||||||
|
- `TedProjectionStartupRunner`
|
||||||
|
|
||||||
|
## Additional cleanup in `GenericDocumentImportService`
|
||||||
|
|
||||||
|
It now resolves the default document embedding model through the new embedding subsystem:
|
||||||
|
|
||||||
|
- `EmbeddingProperties`
|
||||||
|
- `EmbeddingModelRegistry`
|
||||||
|
- `EmbeddingModelCatalogService`
|
||||||
|
|
||||||
|
and no longer reads vectorization model/provider/dimensions from `TedProcessorProperties`.
|
||||||
|
|
||||||
|
## What still remains for later split steps
|
||||||
|
|
||||||
|
- legacy routes/services still using `TedProcessorProperties`
|
||||||
|
- legacy/new runtime bean gating for all remaining shared classes
|
||||||
|
- moving old TED-only config fully under `legacy.ted.*`
|
||||||
@ -0,0 +1,26 @@
|
|||||||
|
# Runtime split Patch E
|
||||||
|
|
||||||
|
This patch continues the runtime/config split by targeting the remaining NEW-mode classes
|
||||||
|
that still injected `TedProcessorProperties`.
|
||||||
|
|
||||||
|
## New config classes
|
||||||
|
- `DipIngestionProperties` (`dip.ingestion.*`)
|
||||||
|
- `TedProjectionProperties` (`dip.ted.projection.*`)
|
||||||
|
|
||||||
|
## NEW-mode classes moved off `TedProcessorProperties`
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- `GenericFileSystemIngestionRoute`
|
||||||
|
- `GenericDocumentImportController`
|
||||||
|
- `MailDocumentIngestionAdapter`
|
||||||
|
- `TedPackageDocumentIngestionAdapter`
|
||||||
|
- `TedPackageChildImportProcessor`
|
||||||
|
- `TedNoticeProjectionService`
|
||||||
|
- `TedProjectionStartupRunner`
|
||||||
|
|
||||||
|
## Additional behavior change
|
||||||
|
`GenericDocumentImportService` now hands embedding work off to the new embedding subsystem by:
|
||||||
|
- resolving the default document model from `EmbeddingModelRegistry`
|
||||||
|
- ensuring the model is registered via `EmbeddingModelCatalogService`
|
||||||
|
- enqueueing jobs through `RepresentationEmbeddingOrchestrator`
|
||||||
|
|
||||||
|
This removes the new import path's runtime dependence on legacy `TedProcessorProperties.vectorization`.
|
||||||
@ -0,0 +1,24 @@
|
|||||||
|
# Runtime split Patch G
|
||||||
|
|
||||||
|
Patch G moves the remaining NEW-mode search/chunking classes off `TedProcessorProperties.search`
|
||||||
|
and onto `DipSearchProperties` (`dip.search.*`).
|
||||||
|
|
||||||
|
## New config class
|
||||||
|
- `at.procon.dip.search.config.DipSearchProperties`
|
||||||
|
|
||||||
|
## Classes switched to `DipSearchProperties`
|
||||||
|
- `PostgresFullTextSearchEngine`
|
||||||
|
- `PostgresTrigramSearchEngine`
|
||||||
|
- `PgVectorSemanticSearchEngine`
|
||||||
|
- `DefaultSearchResultFusionService`
|
||||||
|
- `DefaultSearchOrchestrator`
|
||||||
|
- `SearchLexicalIndexStartupRunner`
|
||||||
|
- `ChunkedLongTextRepresentationBuilder`
|
||||||
|
|
||||||
|
## Additional cleanup
|
||||||
|
These classes are also marked `NEW`-only in this patch.
|
||||||
|
|
||||||
|
## Effect
|
||||||
|
After Patch G, the generic NEW-mode search/chunking path no longer depends on
|
||||||
|
`TedProcessorProperties.search`. That leaves `TedProcessorProperties` much closer to
|
||||||
|
legacy-only ownership.
|
||||||
@ -0,0 +1,17 @@
|
|||||||
|
# Runtime split Patch H
|
||||||
|
|
||||||
|
Patch H is a final cleanup / verification step after the previous split patches.
|
||||||
|
|
||||||
|
## What it does
|
||||||
|
- makes `TedProcessorProperties` explicitly `LEGACY`-only
|
||||||
|
- removes the stale `TedProcessorProperties` import/comment from `DocumentIntelligencePlatformApplication`
|
||||||
|
- adds a regression test that fails if NEW runtime classes reintroduce a dependency on `TedProcessorProperties`
|
||||||
|
- adds a simple `application-legacy.yml` profile file
|
||||||
|
|
||||||
|
## Why this matters
|
||||||
|
After the NEW search/ingestion/projection classes are moved to:
|
||||||
|
- `DipSearchProperties`
|
||||||
|
- `DipIngestionProperties`
|
||||||
|
- `TedProjectionProperties`
|
||||||
|
|
||||||
|
`TedProcessorProperties` should be owned strictly by the legacy runtime graph.
|
||||||
@ -0,0 +1,21 @@
|
|||||||
|
# Runtime split Patch I
|
||||||
|
|
||||||
|
Patch I extracts the remaining legacy vectorization cluster off `TedProcessorProperties`
|
||||||
|
and onto a dedicated legacy-only config class.
|
||||||
|
|
||||||
|
## New config class
|
||||||
|
- `at.procon.ted.config.LegacyVectorizationProperties`
|
||||||
|
- prefix: `legacy.ted.vectorization.*`
|
||||||
|
|
||||||
|
## Classes switched off `TedProcessorProperties`
|
||||||
|
- `GenericVectorizationRoute`
|
||||||
|
- `DocumentEmbeddingProcessingService`
|
||||||
|
- `ConfiguredEmbeddingModelStartupRunner`
|
||||||
|
- `GenericVectorizationStartupRunner`
|
||||||
|
|
||||||
|
## Additional cleanup
|
||||||
|
These classes are also marked `LEGACY`-only via `@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)`.
|
||||||
|
|
||||||
|
## Effect
|
||||||
|
The `at.procon.dip.vectorization.*` package now clearly belongs to the old runtime graph and no longer pulls
|
||||||
|
its settings from the shared monolithic `TedProcessorProperties`.
|
||||||
@ -0,0 +1,45 @@
|
|||||||
|
# Runtime split Patch J
|
||||||
|
|
||||||
|
Patch J is a broader cleanup patch for the **actual current codebase**.
|
||||||
|
|
||||||
|
It adds the missing runtime/config split scaffolding and rewires the remaining NEW-mode classes
|
||||||
|
that still injected `TedProcessorProperties`.
|
||||||
|
|
||||||
|
## Added
|
||||||
|
- `dip.runtime` infrastructure
|
||||||
|
- `RuntimeMode`
|
||||||
|
- `RuntimeModeProperties`
|
||||||
|
- `@ConditionalOnRuntimeMode`
|
||||||
|
- `RuntimeModeCondition`
|
||||||
|
- `DipSearchProperties`
|
||||||
|
- `DipIngestionProperties`
|
||||||
|
- `TedProjectionProperties`
|
||||||
|
|
||||||
|
## Rewired off `TedProcessorProperties`
|
||||||
|
### NEW search/chunking
|
||||||
|
- `PostgresFullTextSearchEngine`
|
||||||
|
- `PostgresTrigramSearchEngine`
|
||||||
|
- `PgVectorSemanticSearchEngine`
|
||||||
|
- `DefaultSearchOrchestrator`
|
||||||
|
- `SearchLexicalIndexStartupRunner`
|
||||||
|
- `DefaultSearchResultFusionService`
|
||||||
|
- `ChunkedLongTextRepresentationBuilder`
|
||||||
|
|
||||||
|
### NEW ingestion/projection
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- `GenericFileSystemIngestionRoute`
|
||||||
|
- `GenericDocumentImportController`
|
||||||
|
- `MailDocumentIngestionAdapter`
|
||||||
|
- `TedPackageDocumentIngestionAdapter`
|
||||||
|
- `TedPackageChildImportProcessor`
|
||||||
|
- `TedNoticeProjectionService`
|
||||||
|
- `TedProjectionStartupRunner`
|
||||||
|
|
||||||
|
## Additional behavior
|
||||||
|
- `GenericDocumentImportService` now hands embedding work off to the new embedding subsystem
|
||||||
|
via `RepresentationEmbeddingOrchestrator` and resolves the default model through
|
||||||
|
`EmbeddingModelRegistry` / `EmbeddingModelCatalogService`.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
This patch intentionally targets the real current leftovers visible in the actual codebase.
|
||||||
|
It assumes the new embedding subsystem already exists.
|
||||||
@ -0,0 +1,16 @@
|
|||||||
|
package at.procon.dip.domain.ted.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.ted.projection")
|
||||||
|
@Data
|
||||||
|
public class TedProjectionProperties {
|
||||||
|
private boolean enabled = true;
|
||||||
|
private boolean startupBackfillEnabled = false;
|
||||||
|
@Positive
|
||||||
|
private int startupBackfillLimit = 250;
|
||||||
|
}
|
||||||
@ -0,0 +1,59 @@
|
|||||||
|
package at.procon.dip.ingestion.config;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.access.DocumentVisibility;
|
||||||
|
import jakarta.validation.constraints.NotBlank;
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.ingestion")
|
||||||
|
@Data
|
||||||
|
public class DipIngestionProperties {
|
||||||
|
|
||||||
|
private boolean enabled = false;
|
||||||
|
private boolean fileSystemEnabled = false;
|
||||||
|
private boolean restUploadEnabled = true;
|
||||||
|
private String inputDirectory = "/ted.europe/generic-input";
|
||||||
|
private String filePattern = ".*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$";
|
||||||
|
private String processedDirectory = ".dip-processed";
|
||||||
|
private String errorDirectory = ".dip-error";
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private long pollInterval = 15000;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private int maxMessagesPerPoll = 10;
|
||||||
|
|
||||||
|
private String defaultOwnerTenantKey;
|
||||||
|
private DocumentVisibility defaultVisibility = DocumentVisibility.PUBLIC;
|
||||||
|
private String defaultLanguageCode;
|
||||||
|
|
||||||
|
private boolean storeOriginalBinaryInDb = true;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private int maxBinaryBytesInDb = 5242880;
|
||||||
|
|
||||||
|
private boolean deduplicateByContentHash = true;
|
||||||
|
private boolean storeOriginalContentForWrapperDocuments = true;
|
||||||
|
private boolean vectorizePrimaryRepresentationOnly = true;
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String importBatchId = "phase4-generic";
|
||||||
|
|
||||||
|
private boolean tedPackageAdapterEnabled = true;
|
||||||
|
private boolean mailAdapterEnabled = false;
|
||||||
|
|
||||||
|
private String mailDefaultOwnerTenantKey;
|
||||||
|
private DocumentVisibility mailDefaultVisibility = DocumentVisibility.TENANT;
|
||||||
|
private boolean expandMailZipAttachments = true;
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String tedPackageImportBatchId = "phase41-ted-package";
|
||||||
|
|
||||||
|
private boolean gatewayOnlyForTedPackages = false;
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String mailImportBatchId = "phase41-mail";
|
||||||
|
}
|
||||||
@ -1,49 +1,81 @@
|
|||||||
package at.procon.dip.search.config;
|
package at.procon.dip.search.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Min;
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
import lombok.Data;
|
import lombok.Data;
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
import org.springframework.context.annotation.Configuration;
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
import org.springframework.validation.annotation.Validated;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* New-runtime generic search configuration.
|
||||||
|
*
|
||||||
|
* <p>This property tree is intentionally separated from the legacy
|
||||||
|
* {@code ted.search.*} settings. NEW-mode search/semantic/lexical code should
|
||||||
|
* depend on {@code dip.search.*} only.</p>
|
||||||
|
*/
|
||||||
@Configuration
|
@Configuration
|
||||||
@ConfigurationProperties(prefix = "dip.search")
|
@ConfigurationProperties(prefix = "dip.search")
|
||||||
@Data
|
@Data
|
||||||
|
@Validated
|
||||||
public class DipSearchProperties {
|
public class DipSearchProperties {
|
||||||
|
|
||||||
private Lexical lexical = new Lexical();
|
/** Default page size for search results. */
|
||||||
private Semantic semantic = new Semantic();
|
@Positive
|
||||||
private Fusion fusion = new Fusion();
|
private int defaultPageSize = 20;
|
||||||
private Chunking chunking = new Chunking();
|
|
||||||
|
|
||||||
@Data
|
/** Maximum allowed page size. */
|
||||||
public static class Lexical {
|
@Positive
|
||||||
private double trigramSimilarityThreshold = 0.12;
|
private int maxPageSize = 100;
|
||||||
|
|
||||||
|
/** Semantic similarity threshold (normalized score). */
|
||||||
|
private double similarityThreshold = 0.7d;
|
||||||
|
|
||||||
|
/** Minimum trigram similarity for fuzzy lexical matches. */
|
||||||
|
private double trigramSimilarityThreshold = 0.12d;
|
||||||
|
|
||||||
|
/** Candidate limits per search engine before fusion/collapse. */
|
||||||
|
@Positive
|
||||||
private int fulltextCandidateLimit = 120;
|
private int fulltextCandidateLimit = 120;
|
||||||
|
|
||||||
|
@Positive
|
||||||
private int trigramCandidateLimit = 120;
|
private int trigramCandidateLimit = 120;
|
||||||
}
|
|
||||||
|
|
||||||
@Data
|
@Positive
|
||||||
public static class Semantic {
|
|
||||||
private double similarityThreshold = 0.7;
|
|
||||||
private int semanticCandidateLimit = 120;
|
private int semanticCandidateLimit = 120;
|
||||||
private String defaultModelKey;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Data
|
/** Hybrid fusion weights. */
|
||||||
public static class Fusion {
|
private double fulltextWeight = 0.35d;
|
||||||
private double fulltextWeight = 0.35;
|
private double trigramWeight = 0.20d;
|
||||||
private double trigramWeight = 0.20;
|
private double semanticWeight = 0.45d;
|
||||||
private double semanticWeight = 0.45;
|
|
||||||
private double recencyBoostWeight = 0.05;
|
|
||||||
private int recencyHalfLifeDays = 30;
|
|
||||||
private int debugTopHitsPerEngine = 10;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Data
|
/** Enable chunk representations for long documents. */
|
||||||
public static class Chunking {
|
private boolean chunkingEnabled = true;
|
||||||
private boolean enabled = true;
|
|
||||||
private int targetChars = 1800;
|
/** Target chunk size in characters for CHUNK representations. */
|
||||||
private int overlapChars = 200;
|
@Positive
|
||||||
|
private int chunkTargetChars = 1800;
|
||||||
|
|
||||||
|
/** Overlap between consecutive chunks in characters. */
|
||||||
|
@Min(0)
|
||||||
|
private int chunkOverlapChars = 200;
|
||||||
|
|
||||||
|
/** Maximum CHUNK representations generated per document. */
|
||||||
|
@Positive
|
||||||
private int maxChunksPerDocument = 12;
|
private int maxChunksPerDocument = 12;
|
||||||
|
|
||||||
|
/** Additional score weight for recency. */
|
||||||
|
private double recencyBoostWeight = 0.05d;
|
||||||
|
|
||||||
|
/** Half-life in days used for recency decay. */
|
||||||
|
@Positive
|
||||||
|
private int recencyHalfLifeDays = 30;
|
||||||
|
|
||||||
|
/** Startup backfill limit for missing DOC lexical vectors. */
|
||||||
|
@Positive
|
||||||
private int startupLexicalBackfillLimit = 500;
|
private int startupLexicalBackfillLimit = 500;
|
||||||
}
|
|
||||||
|
/** Number of hits per engine returned by the debug endpoint. */
|
||||||
|
@Positive
|
||||||
|
private int debugTopHitsPerEngine = 10;
|
||||||
}
|
}
|
||||||
@ -0,0 +1,115 @@
|
|||||||
|
package at.procon.ted.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Min;
|
||||||
|
import jakarta.validation.constraints.NotBlank;
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
import org.springframework.validation.annotation.Validated;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Legacy vectorization configuration used only by the old runtime path.
|
||||||
|
* <p>
|
||||||
|
* This extracts the former ted.vectorization.* subtree away from TedProcessorProperties
|
||||||
|
* so that legacy vectorization beans no longer depend on the shared monolithic config.
|
||||||
|
*/
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "legacy.ted.vectorization")
|
||||||
|
@Data
|
||||||
|
@Validated
|
||||||
|
public class LegacyVectorizationProperties {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enable/disable legacy async vectorization.
|
||||||
|
*/
|
||||||
|
private boolean enabled = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Use external HTTP API instead of Python subprocess.
|
||||||
|
*/
|
||||||
|
private boolean useHttpApi = false;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Embedding service HTTP API URL.
|
||||||
|
*/
|
||||||
|
private String apiUrl = "http://localhost:8001";
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sentence transformer model name.
|
||||||
|
*/
|
||||||
|
private String modelName = "intfloat/multilingual-e5-large";
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Vector dimensions (must match model output).
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int dimensions = 1024;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Batch size for vectorization processing.
|
||||||
|
*/
|
||||||
|
@Min(1)
|
||||||
|
private int batchSize = 16;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Thread pool size for async vectorization.
|
||||||
|
*/
|
||||||
|
@Min(1)
|
||||||
|
private int threadPoolSize = 4;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Maximum text length for vectorization (characters).
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int maxTextLength = 8192;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* HTTP connection timeout in milliseconds.
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int connectTimeout = 10000;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* HTTP socket/read timeout in milliseconds.
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int socketTimeout = 60000;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Maximum retries on connection failure.
|
||||||
|
*/
|
||||||
|
@Min(0)
|
||||||
|
private int maxRetries = 5;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enable the former Phase 2 generic pipeline in the legacy runtime.
|
||||||
|
* In the split runtime design this should normally stay false in NEW mode
|
||||||
|
* because legacy beans are not instantiated there.
|
||||||
|
*/
|
||||||
|
private boolean genericPipelineEnabled = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Keep writing completed TED embeddings back to the legacy ted.procurement_document
|
||||||
|
* vector columns so the existing semantic search stays operational during migration.
|
||||||
|
*/
|
||||||
|
private boolean dualWriteLegacyTedVectors = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Scheduler interval for generic embedding polling (milliseconds).
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private long genericSchedulerPeriodMs = 6000;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builder key for the primary TED semantic representation created during transitional dual-write.
|
||||||
|
*/
|
||||||
|
@NotBlank
|
||||||
|
private String primaryRepresentationBuilderKey = "ted-phase2-primary-representation";
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Provider key used when registering the configured embedding model in DOC.doc_embedding_model.
|
||||||
|
*/
|
||||||
|
@NotBlank
|
||||||
|
private String embeddingProvider = "http-embedding-service";
|
||||||
|
}
|
||||||
@ -1,3 +1,30 @@
|
|||||||
|
spring:
|
||||||
|
config:
|
||||||
|
activate:
|
||||||
|
on-profile: legacy
|
||||||
|
|
||||||
dip:
|
dip:
|
||||||
runtime:
|
runtime:
|
||||||
mode: LEGACY
|
mode: LEGACY
|
||||||
|
|
||||||
|
# Legacy runtime uses the existing ted.* property tree.
|
||||||
|
# Move old route/download/mail/vectorization/search settings here over time.
|
||||||
|
legacy:
|
||||||
|
ted:
|
||||||
|
vectorization:
|
||||||
|
enabled: true
|
||||||
|
use-http-api: false
|
||||||
|
api-url: http://localhost:8001
|
||||||
|
model-name: intfloat/multilingual-e5-large
|
||||||
|
dimensions: 1024
|
||||||
|
batch-size: 16
|
||||||
|
thread-pool-size: 4
|
||||||
|
max-text-length: 8192
|
||||||
|
connect-timeout: 10000
|
||||||
|
socket-timeout: 60000
|
||||||
|
max-retries: 5
|
||||||
|
generic-pipeline-enabled: true
|
||||||
|
dual-write-legacy-ted-vectors: true
|
||||||
|
generic-scheduler-period-ms: 6000
|
||||||
|
primary-representation-builder-key: ted-phase2-primary-representation
|
||||||
|
embedding-provider: http-embedding-service
|
||||||
|
|||||||
@ -1,9 +1,143 @@
|
|||||||
|
# New runtime overrides
|
||||||
|
# Activate with: --spring.profiles.active=new
|
||||||
|
|
||||||
|
# Optional explicit marker; file is profile-specific already
|
||||||
|
spring:
|
||||||
|
config:
|
||||||
|
activate:
|
||||||
|
on-profile: new
|
||||||
|
|
||||||
dip:
|
dip:
|
||||||
runtime:
|
runtime:
|
||||||
mode: NEW
|
mode: NEW
|
||||||
|
|
||||||
|
search:
|
||||||
|
# Default page size for search results
|
||||||
|
default-page-size: 20
|
||||||
|
# Maximum page size
|
||||||
|
max-page-size: 100
|
||||||
|
# Similarity threshold for vector search (0.0 - 1.0)
|
||||||
|
similarity-threshold: 0.7
|
||||||
|
# Minimum trigram similarity for fuzzy lexical matches
|
||||||
|
trigram-similarity-threshold: 0.12
|
||||||
|
# Candidate limits per engine before fusion/collapse
|
||||||
|
fulltext-candidate-limit: 120
|
||||||
|
trigram-candidate-limit: 120
|
||||||
|
semantic-candidate-limit: 120
|
||||||
|
# Hybrid fusion weights
|
||||||
|
fulltext-weight: 0.35
|
||||||
|
trigram-weight: 0.20
|
||||||
|
semantic-weight: 0.45
|
||||||
|
# Additional score weight for recency
|
||||||
|
recency-boost-weight: 0.05
|
||||||
|
# Recency half-life in days
|
||||||
|
recency-half-life-days: 30
|
||||||
|
# Enable chunk representations for long documents
|
||||||
|
chunking-enabled: true
|
||||||
|
# Target chunk size in characters
|
||||||
|
chunk-target-chars: 1800
|
||||||
|
# Overlap between consecutive chunks
|
||||||
|
chunk-overlap-chars: 200
|
||||||
|
# Maximum number of chunks generated per document
|
||||||
|
max-chunks-per-document: 12
|
||||||
|
# Startup backfill limit for missing lexical vectors
|
||||||
|
startup-lexical-backfill-limit: 500
|
||||||
|
# Number of top hits per engine returned by /search/debug
|
||||||
|
debug-top-hits-per-engine: 10
|
||||||
|
|
||||||
embedding:
|
embedding:
|
||||||
enabled: true
|
enabled: true
|
||||||
|
default-document-model: e5-default
|
||||||
|
default-query-model: e5-default
|
||||||
|
providers:
|
||||||
|
mock-default:
|
||||||
|
type: mock
|
||||||
|
dimensions: 16
|
||||||
|
external-e5:
|
||||||
|
type: http-json
|
||||||
|
base-url: http://172.20.240.18:8001
|
||||||
|
connect-timeout: 5s
|
||||||
|
read-timeout: 60s
|
||||||
|
models:
|
||||||
|
mock-search:
|
||||||
|
provider-config-key: mock-default
|
||||||
|
provider-model-key: mock-search
|
||||||
|
dimensions: 16
|
||||||
|
distance-metric: COSINE
|
||||||
|
supports-query-embedding-mode: true
|
||||||
|
active: true
|
||||||
|
e5-default:
|
||||||
|
provider-config-key: external-e5
|
||||||
|
provider-model-key: intfloat/multilingual-e5-large
|
||||||
|
dimensions: 1024
|
||||||
|
distance-metric: COSINE
|
||||||
|
supports-query-embedding-mode: true
|
||||||
|
active: true
|
||||||
jobs:
|
jobs:
|
||||||
enabled: true
|
enabled: true
|
||||||
scheduler-delay-ms: 5000
|
|
||||||
|
# Phase 4 generic ingestion configuration
|
||||||
|
ingestion:
|
||||||
|
# Master switch for arbitrary document ingestion into the DOC model
|
||||||
|
enabled: true
|
||||||
|
# Enable file-system polling for non-TED documents
|
||||||
|
file-system-enabled: false
|
||||||
|
# Allow REST/API upload endpoints for arbitrary documents
|
||||||
|
rest-upload-enabled: true
|
||||||
|
# Input directory for the generic Camel file route
|
||||||
|
input-directory: /ted.europe/generic-input
|
||||||
|
# Regex for files accepted by the generic file route
|
||||||
|
file-pattern: .*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$
|
||||||
|
# Move successfully processed files here
|
||||||
|
processed-directory: .dip-processed
|
||||||
|
# Move failed files here
|
||||||
|
error-directory: .dip-error
|
||||||
|
# Polling interval for the generic route
|
||||||
|
poll-interval: 15000
|
||||||
|
# Maximum files per poll
|
||||||
|
max-messages-per-poll: 200
|
||||||
|
# Optional default owner tenant; leave empty for PUBLIC docs like TED or public knowledge docs
|
||||||
|
default-owner-tenant-key:
|
||||||
|
# Default visibility when no explicit access context is provided
|
||||||
|
default-visibility: PUBLIC
|
||||||
|
# Optional default language for filesystem imports
|
||||||
|
default-language-code:
|
||||||
|
# Store small binary originals in DOC.doc_content.binary_content
|
||||||
|
store-original-binary-in-db: true
|
||||||
|
# Maximum binary payload size persisted inline in DB
|
||||||
|
max-binary-bytes-in-db: 5242880
|
||||||
|
# Deduplicate by content hash and attach additional sources to the same canonical document
|
||||||
|
deduplicate-by-content-hash: true
|
||||||
|
# Persist ORIGINAL content rows for wrapper/container documents such as TED packages or ZIP wrappers
|
||||||
|
store-original-content-for-wrapper-documents: true
|
||||||
|
# Queue only the primary text representation for vectorization
|
||||||
|
vectorize-primary-representation-only: true
|
||||||
|
# Import batch marker written to DOC.doc_source.import_batch_id
|
||||||
|
import-batch-id: phase4-generic
|
||||||
|
# Enable Phase 4.1 TED package adapter on top of the generic DOC ingestion SPI
|
||||||
|
ted-package-adapter-enabled: true
|
||||||
|
# Enable Phase 4.1 mail/document adapter on top of the generic DOC ingestion SPI
|
||||||
|
mail-adapter-enabled: true
|
||||||
|
# Optional dedicated mail owner tenant, falls back to default-owner-tenant-key
|
||||||
|
mail-default-owner-tenant-key:
|
||||||
|
# Visibility for imported mail messages and attachments
|
||||||
|
mail-default-visibility: TENANT
|
||||||
|
# Expand ZIP attachments recursively through the mail adapter
|
||||||
|
expand-mail-zip-attachments: true
|
||||||
|
# Import batch marker for TED package roots and children
|
||||||
|
ted-package-import-batch-id: phase41-ted-package
|
||||||
|
# When true, TED package documents are stored only through the generic ingestion gateway
|
||||||
|
# and the legacy XML batch processing path is skipped
|
||||||
|
gateway-only-for-ted-packages: true
|
||||||
|
# Import batch marker for mail roots and attachments
|
||||||
|
mail-import-batch-id: phase41-mail
|
||||||
|
|
||||||
|
ted: # Phase 3 TED projection configuration
|
||||||
|
projection:
|
||||||
|
# Enable/disable dual-write into the TED projection model on top of DOC.doc_document
|
||||||
|
enabled: true
|
||||||
|
# Optional startup backfill for legacy TED documents without a projection row yet
|
||||||
|
startup-backfill-enabled: false
|
||||||
|
# Maximum number of legacy TED documents to backfill during startup
|
||||||
|
startup-backfill-limit: 250
|
||||||
|
|
||||||
|
|||||||
@ -0,0 +1,69 @@
|
|||||||
|
package at.procon.dip.architecture;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.ted.config.TedProjectionProperties;
|
||||||
|
import at.procon.dip.ingestion.config.DipIngestionProperties;
|
||||||
|
import at.procon.dip.search.config.DipSearchProperties;
|
||||||
|
import at.procon.ted.config.TedProcessorProperties;
|
||||||
|
import java.lang.reflect.Constructor;
|
||||||
|
import java.lang.reflect.Field;
|
||||||
|
import java.util.List;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
|
import static org.assertj.core.api.Assertions.assertThat;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Regression guard for the runtime/config split.
|
||||||
|
* NEW runtime classes must not depend on TedProcessorProperties anymore.
|
||||||
|
*/
|
||||||
|
class NewRuntimeMustNotDependOnTedProcessorPropertiesTest {
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void new_runtime_classes_should_not_depend_on_ted_processor_properties() {
|
||||||
|
List<Class<?>> newRuntimeClasses = List.of(
|
||||||
|
at.procon.dip.ingestion.service.GenericDocumentImportService.class,
|
||||||
|
at.procon.dip.ingestion.camel.GenericFileSystemIngestionRoute.class,
|
||||||
|
at.procon.dip.ingestion.controller.GenericDocumentImportController.class,
|
||||||
|
at.procon.dip.ingestion.adapter.MailDocumentIngestionAdapter.class,
|
||||||
|
at.procon.dip.ingestion.adapter.TedPackageDocumentIngestionAdapter.class,
|
||||||
|
at.procon.dip.ingestion.service.TedPackageChildImportProcessor.class,
|
||||||
|
at.procon.dip.domain.ted.service.TedNoticeProjectionService.class,
|
||||||
|
at.procon.dip.domain.ted.startup.TedProjectionStartupRunner.class,
|
||||||
|
at.procon.dip.search.engine.fulltext.PostgresFullTextSearchEngine.class,
|
||||||
|
at.procon.dip.search.engine.trigram.PostgresTrigramSearchEngine.class,
|
||||||
|
at.procon.dip.search.engine.semantic.PgVectorSemanticSearchEngine.class,
|
||||||
|
at.procon.dip.search.rank.DefaultSearchResultFusionService.class,
|
||||||
|
at.procon.dip.search.service.DefaultSearchOrchestrator.class,
|
||||||
|
at.procon.dip.search.service.SearchLexicalIndexStartupRunner.class,
|
||||||
|
at.procon.dip.normalization.impl.ChunkedLongTextRepresentationBuilder.class
|
||||||
|
);
|
||||||
|
|
||||||
|
for (Class<?> type : newRuntimeClasses) {
|
||||||
|
assertThat(hasDependency(type, TedProcessorProperties.class))
|
||||||
|
.as(type.getName() + " must not depend on TedProcessorProperties")
|
||||||
|
.isFalse();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void new_runtime_config_classes_exist_as_replacements() {
|
||||||
|
assertThat(DipSearchProperties.class).isNotNull();
|
||||||
|
assertThat(DipIngestionProperties.class).isNotNull();
|
||||||
|
assertThat(TedProjectionProperties.class).isNotNull();
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean hasDependency(Class<?> owner, Class<?> dependency) {
|
||||||
|
for (Field field : owner.getDeclaredFields()) {
|
||||||
|
if (field.getType().equals(dependency)) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (Constructor<?> constructor : owner.getDeclaredConstructors()) {
|
||||||
|
for (Class<?> param : constructor.getParameterTypes()) {
|
||||||
|
if (param.equals(dependency)) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
Loading…
Reference in New Issue