runtime-split patch b-i
parent
44995cebf7
commit
1aa599b587
@ -0,0 +1,31 @@
|
||||
# Config split: moved new-runtime properties to application-new.yml
|
||||
|
||||
This patch keeps shared and legacy defaults in `application.yml` and moves new-runtime properties into `application-new.yml`.
|
||||
|
||||
Activate the new runtime with:
|
||||
|
||||
```
|
||||
--spring.profiles.active=new
|
||||
```
|
||||
|
||||
`application-new.yml` also sets:
|
||||
|
||||
```yaml
|
||||
dip.runtime.mode: NEW
|
||||
```
|
||||
|
||||
So profile selection and runtime mode stay aligned.
|
||||
|
||||
Moved blocks:
|
||||
- `dip.embedding.*`
|
||||
- `ted.search.*` (new generic search tuning, now under `dip.search.*`)
|
||||
- `ted.projection.*`
|
||||
- `ted.generic-ingestion.*`
|
||||
- new/transitional `ted.vectorization.*` keys:
|
||||
- `generic-pipeline-enabled`
|
||||
- `dual-write-legacy-ted-vectors`
|
||||
- `generic-scheduler-period-ms`
|
||||
- `primary-representation-builder-key`
|
||||
- `embedding-provider`
|
||||
|
||||
Shared / legacy defaults remain in `application.yml`.
|
||||
@ -0,0 +1,36 @@
|
||||
# Runtime split Patch C
|
||||
|
||||
Patch C moves the **new generic search runtime** off `TedProcessorProperties.search`
|
||||
and into a dedicated `dip.search.*` config tree.
|
||||
|
||||
## New config class
|
||||
- `at.procon.dip.search.config.DipSearchProperties`
|
||||
|
||||
## New config root
|
||||
```yaml
|
||||
dip:
|
||||
search:
|
||||
...
|
||||
```
|
||||
|
||||
## Classes moved off `TedProcessorProperties`
|
||||
- `PostgresFullTextSearchEngine`
|
||||
- `PostgresTrigramSearchEngine`
|
||||
- `PgVectorSemanticSearchEngine`
|
||||
- `DefaultSearchOrchestrator`
|
||||
- `DefaultSearchResultFusionService`
|
||||
- `SearchLexicalIndexStartupRunner`
|
||||
- `ChunkedLongTextRepresentationBuilder`
|
||||
|
||||
## What this patch intentionally does not do
|
||||
- it does not yet remove `TedProcessorProperties` from all NEW-mode classes
|
||||
- it does not yet move `generic-ingestion` config off `ted.*`
|
||||
- it does not yet finish the legacy/new config split for import/mail/TED package processing
|
||||
|
||||
Those should be handled in the next config-splitting patch.
|
||||
|
||||
## Practical result
|
||||
After this patch, **new search/semantic/chunking tuning** should be configured only via:
|
||||
- `dip.search.*`
|
||||
|
||||
while `ted.search.*` remains legacy-oriented.
|
||||
@ -0,0 +1,40 @@
|
||||
# Runtime Split Patch D
|
||||
|
||||
This patch completes the next configuration split step for the NEW runtime.
|
||||
|
||||
## New property classes
|
||||
|
||||
- `at.procon.dip.ingestion.config.DipIngestionProperties`
|
||||
- prefix: `dip.ingestion`
|
||||
- `at.procon.dip.domain.ted.config.TedProjectionProperties`
|
||||
- prefix: `dip.ted.projection`
|
||||
|
||||
## Classes moved off `TedProcessorProperties`
|
||||
|
||||
### NEW-mode ingestion
|
||||
- `GenericDocumentImportService`
|
||||
- `GenericFileSystemIngestionRoute`
|
||||
- `GenericDocumentImportController`
|
||||
- `MailDocumentIngestionAdapter`
|
||||
- `TedPackageDocumentIngestionAdapter`
|
||||
- `TedPackageChildImportProcessor`
|
||||
|
||||
### NEW-mode projection
|
||||
- `TedNoticeProjectionService`
|
||||
- `TedProjectionStartupRunner`
|
||||
|
||||
## Additional cleanup in `GenericDocumentImportService`
|
||||
|
||||
It now resolves the default document embedding model through the new embedding subsystem:
|
||||
|
||||
- `EmbeddingProperties`
|
||||
- `EmbeddingModelRegistry`
|
||||
- `EmbeddingModelCatalogService`
|
||||
|
||||
and no longer reads vectorization model/provider/dimensions from `TedProcessorProperties`.
|
||||
|
||||
## What still remains for later split steps
|
||||
|
||||
- legacy routes/services still using `TedProcessorProperties`
|
||||
- legacy/new runtime bean gating for all remaining shared classes
|
||||
- moving old TED-only config fully under `legacy.ted.*`
|
||||
@ -0,0 +1,26 @@
|
||||
# Runtime split Patch E
|
||||
|
||||
This patch continues the runtime/config split by targeting the remaining NEW-mode classes
|
||||
that still injected `TedProcessorProperties`.
|
||||
|
||||
## New config classes
|
||||
- `DipIngestionProperties` (`dip.ingestion.*`)
|
||||
- `TedProjectionProperties` (`dip.ted.projection.*`)
|
||||
|
||||
## NEW-mode classes moved off `TedProcessorProperties`
|
||||
- `GenericDocumentImportService`
|
||||
- `GenericFileSystemIngestionRoute`
|
||||
- `GenericDocumentImportController`
|
||||
- `MailDocumentIngestionAdapter`
|
||||
- `TedPackageDocumentIngestionAdapter`
|
||||
- `TedPackageChildImportProcessor`
|
||||
- `TedNoticeProjectionService`
|
||||
- `TedProjectionStartupRunner`
|
||||
|
||||
## Additional behavior change
|
||||
`GenericDocumentImportService` now hands embedding work off to the new embedding subsystem by:
|
||||
- resolving the default document model from `EmbeddingModelRegistry`
|
||||
- ensuring the model is registered via `EmbeddingModelCatalogService`
|
||||
- enqueueing jobs through `RepresentationEmbeddingOrchestrator`
|
||||
|
||||
This removes the new import path's runtime dependence on legacy `TedProcessorProperties.vectorization`.
|
||||
@ -0,0 +1,24 @@
|
||||
# Runtime split Patch G
|
||||
|
||||
Patch G moves the remaining NEW-mode search/chunking classes off `TedProcessorProperties.search`
|
||||
and onto `DipSearchProperties` (`dip.search.*`).
|
||||
|
||||
## New config class
|
||||
- `at.procon.dip.search.config.DipSearchProperties`
|
||||
|
||||
## Classes switched to `DipSearchProperties`
|
||||
- `PostgresFullTextSearchEngine`
|
||||
- `PostgresTrigramSearchEngine`
|
||||
- `PgVectorSemanticSearchEngine`
|
||||
- `DefaultSearchResultFusionService`
|
||||
- `DefaultSearchOrchestrator`
|
||||
- `SearchLexicalIndexStartupRunner`
|
||||
- `ChunkedLongTextRepresentationBuilder`
|
||||
|
||||
## Additional cleanup
|
||||
These classes are also marked `NEW`-only in this patch.
|
||||
|
||||
## Effect
|
||||
After Patch G, the generic NEW-mode search/chunking path no longer depends on
|
||||
`TedProcessorProperties.search`. That leaves `TedProcessorProperties` much closer to
|
||||
legacy-only ownership.
|
||||
@ -0,0 +1,17 @@
|
||||
# Runtime split Patch H
|
||||
|
||||
Patch H is a final cleanup / verification step after the previous split patches.
|
||||
|
||||
## What it does
|
||||
- makes `TedProcessorProperties` explicitly `LEGACY`-only
|
||||
- removes the stale `TedProcessorProperties` import/comment from `DocumentIntelligencePlatformApplication`
|
||||
- adds a regression test that fails if NEW runtime classes reintroduce a dependency on `TedProcessorProperties`
|
||||
- adds a simple `application-legacy.yml` profile file
|
||||
|
||||
## Why this matters
|
||||
After the NEW search/ingestion/projection classes are moved to:
|
||||
- `DipSearchProperties`
|
||||
- `DipIngestionProperties`
|
||||
- `TedProjectionProperties`
|
||||
|
||||
`TedProcessorProperties` should be owned strictly by the legacy runtime graph.
|
||||
@ -0,0 +1,21 @@
|
||||
# Runtime split Patch I
|
||||
|
||||
Patch I extracts the remaining legacy vectorization cluster off `TedProcessorProperties`
|
||||
and onto a dedicated legacy-only config class.
|
||||
|
||||
## New config class
|
||||
- `at.procon.ted.config.LegacyVectorizationProperties`
|
||||
- prefix: `legacy.ted.vectorization.*`
|
||||
|
||||
## Classes switched off `TedProcessorProperties`
|
||||
- `GenericVectorizationRoute`
|
||||
- `DocumentEmbeddingProcessingService`
|
||||
- `ConfiguredEmbeddingModelStartupRunner`
|
||||
- `GenericVectorizationStartupRunner`
|
||||
|
||||
## Additional cleanup
|
||||
These classes are also marked `LEGACY`-only via `@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)`.
|
||||
|
||||
## Effect
|
||||
The `at.procon.dip.vectorization.*` package now clearly belongs to the old runtime graph and no longer pulls
|
||||
its settings from the shared monolithic `TedProcessorProperties`.
|
||||
@ -0,0 +1,45 @@
|
||||
# Runtime split Patch J
|
||||
|
||||
Patch J is a broader cleanup patch for the **actual current codebase**.
|
||||
|
||||
It adds the missing runtime/config split scaffolding and rewires the remaining NEW-mode classes
|
||||
that still injected `TedProcessorProperties`.
|
||||
|
||||
## Added
|
||||
- `dip.runtime` infrastructure
|
||||
- `RuntimeMode`
|
||||
- `RuntimeModeProperties`
|
||||
- `@ConditionalOnRuntimeMode`
|
||||
- `RuntimeModeCondition`
|
||||
- `DipSearchProperties`
|
||||
- `DipIngestionProperties`
|
||||
- `TedProjectionProperties`
|
||||
|
||||
## Rewired off `TedProcessorProperties`
|
||||
### NEW search/chunking
|
||||
- `PostgresFullTextSearchEngine`
|
||||
- `PostgresTrigramSearchEngine`
|
||||
- `PgVectorSemanticSearchEngine`
|
||||
- `DefaultSearchOrchestrator`
|
||||
- `SearchLexicalIndexStartupRunner`
|
||||
- `DefaultSearchResultFusionService`
|
||||
- `ChunkedLongTextRepresentationBuilder`
|
||||
|
||||
### NEW ingestion/projection
|
||||
- `GenericDocumentImportService`
|
||||
- `GenericFileSystemIngestionRoute`
|
||||
- `GenericDocumentImportController`
|
||||
- `MailDocumentIngestionAdapter`
|
||||
- `TedPackageDocumentIngestionAdapter`
|
||||
- `TedPackageChildImportProcessor`
|
||||
- `TedNoticeProjectionService`
|
||||
- `TedProjectionStartupRunner`
|
||||
|
||||
## Additional behavior
|
||||
- `GenericDocumentImportService` now hands embedding work off to the new embedding subsystem
|
||||
via `RepresentationEmbeddingOrchestrator` and resolves the default model through
|
||||
`EmbeddingModelRegistry` / `EmbeddingModelCatalogService`.
|
||||
|
||||
## Notes
|
||||
This patch intentionally targets the real current leftovers visible in the actual codebase.
|
||||
It assumes the new embedding subsystem already exists.
|
||||
@ -0,0 +1,16 @@
|
||||
package at.procon.dip.domain.ted.config;
|
||||
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.ted.projection")
|
||||
@Data
|
||||
public class TedProjectionProperties {
|
||||
private boolean enabled = true;
|
||||
private boolean startupBackfillEnabled = false;
|
||||
@Positive
|
||||
private int startupBackfillLimit = 250;
|
||||
}
|
||||
@ -0,0 +1,59 @@
|
||||
package at.procon.dip.ingestion.config;
|
||||
|
||||
import at.procon.dip.domain.access.DocumentVisibility;
|
||||
import jakarta.validation.constraints.NotBlank;
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.ingestion")
|
||||
@Data
|
||||
public class DipIngestionProperties {
|
||||
|
||||
private boolean enabled = false;
|
||||
private boolean fileSystemEnabled = false;
|
||||
private boolean restUploadEnabled = true;
|
||||
private String inputDirectory = "/ted.europe/generic-input";
|
||||
private String filePattern = ".*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$";
|
||||
private String processedDirectory = ".dip-processed";
|
||||
private String errorDirectory = ".dip-error";
|
||||
|
||||
@Positive
|
||||
private long pollInterval = 15000;
|
||||
|
||||
@Positive
|
||||
private int maxMessagesPerPoll = 10;
|
||||
|
||||
private String defaultOwnerTenantKey;
|
||||
private DocumentVisibility defaultVisibility = DocumentVisibility.PUBLIC;
|
||||
private String defaultLanguageCode;
|
||||
|
||||
private boolean storeOriginalBinaryInDb = true;
|
||||
|
||||
@Positive
|
||||
private int maxBinaryBytesInDb = 5242880;
|
||||
|
||||
private boolean deduplicateByContentHash = true;
|
||||
private boolean storeOriginalContentForWrapperDocuments = true;
|
||||
private boolean vectorizePrimaryRepresentationOnly = true;
|
||||
|
||||
@NotBlank
|
||||
private String importBatchId = "phase4-generic";
|
||||
|
||||
private boolean tedPackageAdapterEnabled = true;
|
||||
private boolean mailAdapterEnabled = false;
|
||||
|
||||
private String mailDefaultOwnerTenantKey;
|
||||
private DocumentVisibility mailDefaultVisibility = DocumentVisibility.TENANT;
|
||||
private boolean expandMailZipAttachments = true;
|
||||
|
||||
@NotBlank
|
||||
private String tedPackageImportBatchId = "phase41-ted-package";
|
||||
|
||||
private boolean gatewayOnlyForTedPackages = false;
|
||||
|
||||
@NotBlank
|
||||
private String mailImportBatchId = "phase41-mail";
|
||||
}
|
||||
@ -1,49 +1,81 @@
|
||||
package at.procon.dip.search.config;
|
||||
|
||||
import jakarta.validation.constraints.Min;
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
import org.springframework.validation.annotation.Validated;
|
||||
|
||||
/**
|
||||
* New-runtime generic search configuration.
|
||||
*
|
||||
* <p>This property tree is intentionally separated from the legacy
|
||||
* {@code ted.search.*} settings. NEW-mode search/semantic/lexical code should
|
||||
* depend on {@code dip.search.*} only.</p>
|
||||
*/
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.search")
|
||||
@Data
|
||||
@Validated
|
||||
public class DipSearchProperties {
|
||||
|
||||
private Lexical lexical = new Lexical();
|
||||
private Semantic semantic = new Semantic();
|
||||
private Fusion fusion = new Fusion();
|
||||
private Chunking chunking = new Chunking();
|
||||
/** Default page size for search results. */
|
||||
@Positive
|
||||
private int defaultPageSize = 20;
|
||||
|
||||
@Data
|
||||
public static class Lexical {
|
||||
private double trigramSimilarityThreshold = 0.12;
|
||||
/** Maximum allowed page size. */
|
||||
@Positive
|
||||
private int maxPageSize = 100;
|
||||
|
||||
/** Semantic similarity threshold (normalized score). */
|
||||
private double similarityThreshold = 0.7d;
|
||||
|
||||
/** Minimum trigram similarity for fuzzy lexical matches. */
|
||||
private double trigramSimilarityThreshold = 0.12d;
|
||||
|
||||
/** Candidate limits per search engine before fusion/collapse. */
|
||||
@Positive
|
||||
private int fulltextCandidateLimit = 120;
|
||||
|
||||
@Positive
|
||||
private int trigramCandidateLimit = 120;
|
||||
}
|
||||
|
||||
@Data
|
||||
public static class Semantic {
|
||||
private double similarityThreshold = 0.7;
|
||||
@Positive
|
||||
private int semanticCandidateLimit = 120;
|
||||
private String defaultModelKey;
|
||||
}
|
||||
|
||||
@Data
|
||||
public static class Fusion {
|
||||
private double fulltextWeight = 0.35;
|
||||
private double trigramWeight = 0.20;
|
||||
private double semanticWeight = 0.45;
|
||||
private double recencyBoostWeight = 0.05;
|
||||
private int recencyHalfLifeDays = 30;
|
||||
private int debugTopHitsPerEngine = 10;
|
||||
}
|
||||
/** Hybrid fusion weights. */
|
||||
private double fulltextWeight = 0.35d;
|
||||
private double trigramWeight = 0.20d;
|
||||
private double semanticWeight = 0.45d;
|
||||
|
||||
@Data
|
||||
public static class Chunking {
|
||||
private boolean enabled = true;
|
||||
private int targetChars = 1800;
|
||||
private int overlapChars = 200;
|
||||
/** Enable chunk representations for long documents. */
|
||||
private boolean chunkingEnabled = true;
|
||||
|
||||
/** Target chunk size in characters for CHUNK representations. */
|
||||
@Positive
|
||||
private int chunkTargetChars = 1800;
|
||||
|
||||
/** Overlap between consecutive chunks in characters. */
|
||||
@Min(0)
|
||||
private int chunkOverlapChars = 200;
|
||||
|
||||
/** Maximum CHUNK representations generated per document. */
|
||||
@Positive
|
||||
private int maxChunksPerDocument = 12;
|
||||
|
||||
/** Additional score weight for recency. */
|
||||
private double recencyBoostWeight = 0.05d;
|
||||
|
||||
/** Half-life in days used for recency decay. */
|
||||
@Positive
|
||||
private int recencyHalfLifeDays = 30;
|
||||
|
||||
/** Startup backfill limit for missing DOC lexical vectors. */
|
||||
@Positive
|
||||
private int startupLexicalBackfillLimit = 500;
|
||||
}
|
||||
|
||||
/** Number of hits per engine returned by the debug endpoint. */
|
||||
@Positive
|
||||
private int debugTopHitsPerEngine = 10;
|
||||
}
|
||||
@ -0,0 +1,115 @@
|
||||
package at.procon.ted.config;
|
||||
|
||||
import jakarta.validation.constraints.Min;
|
||||
import jakarta.validation.constraints.NotBlank;
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
import org.springframework.validation.annotation.Validated;
|
||||
|
||||
/**
|
||||
* Legacy vectorization configuration used only by the old runtime path.
|
||||
* <p>
|
||||
* This extracts the former ted.vectorization.* subtree away from TedProcessorProperties
|
||||
* so that legacy vectorization beans no longer depend on the shared monolithic config.
|
||||
*/
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "legacy.ted.vectorization")
|
||||
@Data
|
||||
@Validated
|
||||
public class LegacyVectorizationProperties {
|
||||
|
||||
/**
|
||||
* Enable/disable legacy async vectorization.
|
||||
*/
|
||||
private boolean enabled = true;
|
||||
|
||||
/**
|
||||
* Use external HTTP API instead of Python subprocess.
|
||||
*/
|
||||
private boolean useHttpApi = false;
|
||||
|
||||
/**
|
||||
* Embedding service HTTP API URL.
|
||||
*/
|
||||
private String apiUrl = "http://localhost:8001";
|
||||
|
||||
/**
|
||||
* Sentence transformer model name.
|
||||
*/
|
||||
private String modelName = "intfloat/multilingual-e5-large";
|
||||
|
||||
/**
|
||||
* Vector dimensions (must match model output).
|
||||
*/
|
||||
@Positive
|
||||
private int dimensions = 1024;
|
||||
|
||||
/**
|
||||
* Batch size for vectorization processing.
|
||||
*/
|
||||
@Min(1)
|
||||
private int batchSize = 16;
|
||||
|
||||
/**
|
||||
* Thread pool size for async vectorization.
|
||||
*/
|
||||
@Min(1)
|
||||
private int threadPoolSize = 4;
|
||||
|
||||
/**
|
||||
* Maximum text length for vectorization (characters).
|
||||
*/
|
||||
@Positive
|
||||
private int maxTextLength = 8192;
|
||||
|
||||
/**
|
||||
* HTTP connection timeout in milliseconds.
|
||||
*/
|
||||
@Positive
|
||||
private int connectTimeout = 10000;
|
||||
|
||||
/**
|
||||
* HTTP socket/read timeout in milliseconds.
|
||||
*/
|
||||
@Positive
|
||||
private int socketTimeout = 60000;
|
||||
|
||||
/**
|
||||
* Maximum retries on connection failure.
|
||||
*/
|
||||
@Min(0)
|
||||
private int maxRetries = 5;
|
||||
|
||||
/**
|
||||
* Enable the former Phase 2 generic pipeline in the legacy runtime.
|
||||
* In the split runtime design this should normally stay false in NEW mode
|
||||
* because legacy beans are not instantiated there.
|
||||
*/
|
||||
private boolean genericPipelineEnabled = true;
|
||||
|
||||
/**
|
||||
* Keep writing completed TED embeddings back to the legacy ted.procurement_document
|
||||
* vector columns so the existing semantic search stays operational during migration.
|
||||
*/
|
||||
private boolean dualWriteLegacyTedVectors = true;
|
||||
|
||||
/**
|
||||
* Scheduler interval for generic embedding polling (milliseconds).
|
||||
*/
|
||||
@Positive
|
||||
private long genericSchedulerPeriodMs = 6000;
|
||||
|
||||
/**
|
||||
* Builder key for the primary TED semantic representation created during transitional dual-write.
|
||||
*/
|
||||
@NotBlank
|
||||
private String primaryRepresentationBuilderKey = "ted-phase2-primary-representation";
|
||||
|
||||
/**
|
||||
* Provider key used when registering the configured embedding model in DOC.doc_embedding_model.
|
||||
*/
|
||||
@NotBlank
|
||||
private String embeddingProvider = "http-embedding-service";
|
||||
}
|
||||
@ -1,3 +1,30 @@
|
||||
spring:
|
||||
config:
|
||||
activate:
|
||||
on-profile: legacy
|
||||
|
||||
dip:
|
||||
runtime:
|
||||
mode: LEGACY
|
||||
|
||||
# Legacy runtime uses the existing ted.* property tree.
|
||||
# Move old route/download/mail/vectorization/search settings here over time.
|
||||
legacy:
|
||||
ted:
|
||||
vectorization:
|
||||
enabled: true
|
||||
use-http-api: false
|
||||
api-url: http://localhost:8001
|
||||
model-name: intfloat/multilingual-e5-large
|
||||
dimensions: 1024
|
||||
batch-size: 16
|
||||
thread-pool-size: 4
|
||||
max-text-length: 8192
|
||||
connect-timeout: 10000
|
||||
socket-timeout: 60000
|
||||
max-retries: 5
|
||||
generic-pipeline-enabled: true
|
||||
dual-write-legacy-ted-vectors: true
|
||||
generic-scheduler-period-ms: 6000
|
||||
primary-representation-builder-key: ted-phase2-primary-representation
|
||||
embedding-provider: http-embedding-service
|
||||
|
||||
@ -1,9 +1,143 @@
|
||||
# New runtime overrides
|
||||
# Activate with: --spring.profiles.active=new
|
||||
|
||||
# Optional explicit marker; file is profile-specific already
|
||||
spring:
|
||||
config:
|
||||
activate:
|
||||
on-profile: new
|
||||
|
||||
dip:
|
||||
runtime:
|
||||
mode: NEW
|
||||
|
||||
search:
|
||||
# Default page size for search results
|
||||
default-page-size: 20
|
||||
# Maximum page size
|
||||
max-page-size: 100
|
||||
# Similarity threshold for vector search (0.0 - 1.0)
|
||||
similarity-threshold: 0.7
|
||||
# Minimum trigram similarity for fuzzy lexical matches
|
||||
trigram-similarity-threshold: 0.12
|
||||
# Candidate limits per engine before fusion/collapse
|
||||
fulltext-candidate-limit: 120
|
||||
trigram-candidate-limit: 120
|
||||
semantic-candidate-limit: 120
|
||||
# Hybrid fusion weights
|
||||
fulltext-weight: 0.35
|
||||
trigram-weight: 0.20
|
||||
semantic-weight: 0.45
|
||||
# Additional score weight for recency
|
||||
recency-boost-weight: 0.05
|
||||
# Recency half-life in days
|
||||
recency-half-life-days: 30
|
||||
# Enable chunk representations for long documents
|
||||
chunking-enabled: true
|
||||
# Target chunk size in characters
|
||||
chunk-target-chars: 1800
|
||||
# Overlap between consecutive chunks
|
||||
chunk-overlap-chars: 200
|
||||
# Maximum number of chunks generated per document
|
||||
max-chunks-per-document: 12
|
||||
# Startup backfill limit for missing lexical vectors
|
||||
startup-lexical-backfill-limit: 500
|
||||
# Number of top hits per engine returned by /search/debug
|
||||
debug-top-hits-per-engine: 10
|
||||
|
||||
embedding:
|
||||
enabled: true
|
||||
default-document-model: e5-default
|
||||
default-query-model: e5-default
|
||||
providers:
|
||||
mock-default:
|
||||
type: mock
|
||||
dimensions: 16
|
||||
external-e5:
|
||||
type: http-json
|
||||
base-url: http://172.20.240.18:8001
|
||||
connect-timeout: 5s
|
||||
read-timeout: 60s
|
||||
models:
|
||||
mock-search:
|
||||
provider-config-key: mock-default
|
||||
provider-model-key: mock-search
|
||||
dimensions: 16
|
||||
distance-metric: COSINE
|
||||
supports-query-embedding-mode: true
|
||||
active: true
|
||||
e5-default:
|
||||
provider-config-key: external-e5
|
||||
provider-model-key: intfloat/multilingual-e5-large
|
||||
dimensions: 1024
|
||||
distance-metric: COSINE
|
||||
supports-query-embedding-mode: true
|
||||
active: true
|
||||
jobs:
|
||||
enabled: true
|
||||
scheduler-delay-ms: 5000
|
||||
|
||||
# Phase 4 generic ingestion configuration
|
||||
ingestion:
|
||||
# Master switch for arbitrary document ingestion into the DOC model
|
||||
enabled: true
|
||||
# Enable file-system polling for non-TED documents
|
||||
file-system-enabled: false
|
||||
# Allow REST/API upload endpoints for arbitrary documents
|
||||
rest-upload-enabled: true
|
||||
# Input directory for the generic Camel file route
|
||||
input-directory: /ted.europe/generic-input
|
||||
# Regex for files accepted by the generic file route
|
||||
file-pattern: .*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$
|
||||
# Move successfully processed files here
|
||||
processed-directory: .dip-processed
|
||||
# Move failed files here
|
||||
error-directory: .dip-error
|
||||
# Polling interval for the generic route
|
||||
poll-interval: 15000
|
||||
# Maximum files per poll
|
||||
max-messages-per-poll: 200
|
||||
# Optional default owner tenant; leave empty for PUBLIC docs like TED or public knowledge docs
|
||||
default-owner-tenant-key:
|
||||
# Default visibility when no explicit access context is provided
|
||||
default-visibility: PUBLIC
|
||||
# Optional default language for filesystem imports
|
||||
default-language-code:
|
||||
# Store small binary originals in DOC.doc_content.binary_content
|
||||
store-original-binary-in-db: true
|
||||
# Maximum binary payload size persisted inline in DB
|
||||
max-binary-bytes-in-db: 5242880
|
||||
# Deduplicate by content hash and attach additional sources to the same canonical document
|
||||
deduplicate-by-content-hash: true
|
||||
# Persist ORIGINAL content rows for wrapper/container documents such as TED packages or ZIP wrappers
|
||||
store-original-content-for-wrapper-documents: true
|
||||
# Queue only the primary text representation for vectorization
|
||||
vectorize-primary-representation-only: true
|
||||
# Import batch marker written to DOC.doc_source.import_batch_id
|
||||
import-batch-id: phase4-generic
|
||||
# Enable Phase 4.1 TED package adapter on top of the generic DOC ingestion SPI
|
||||
ted-package-adapter-enabled: true
|
||||
# Enable Phase 4.1 mail/document adapter on top of the generic DOC ingestion SPI
|
||||
mail-adapter-enabled: true
|
||||
# Optional dedicated mail owner tenant, falls back to default-owner-tenant-key
|
||||
mail-default-owner-tenant-key:
|
||||
# Visibility for imported mail messages and attachments
|
||||
mail-default-visibility: TENANT
|
||||
# Expand ZIP attachments recursively through the mail adapter
|
||||
expand-mail-zip-attachments: true
|
||||
# Import batch marker for TED package roots and children
|
||||
ted-package-import-batch-id: phase41-ted-package
|
||||
# When true, TED package documents are stored only through the generic ingestion gateway
|
||||
# and the legacy XML batch processing path is skipped
|
||||
gateway-only-for-ted-packages: true
|
||||
# Import batch marker for mail roots and attachments
|
||||
mail-import-batch-id: phase41-mail
|
||||
|
||||
ted: # Phase 3 TED projection configuration
|
||||
projection:
|
||||
# Enable/disable dual-write into the TED projection model on top of DOC.doc_document
|
||||
enabled: true
|
||||
# Optional startup backfill for legacy TED documents without a projection row yet
|
||||
startup-backfill-enabled: false
|
||||
# Maximum number of legacy TED documents to backfill during startup
|
||||
startup-backfill-limit: 250
|
||||
|
||||
|
||||
@ -0,0 +1,69 @@
|
||||
package at.procon.dip.architecture;
|
||||
|
||||
import at.procon.dip.domain.ted.config.TedProjectionProperties;
|
||||
import at.procon.dip.ingestion.config.DipIngestionProperties;
|
||||
import at.procon.dip.search.config.DipSearchProperties;
|
||||
import at.procon.ted.config.TedProcessorProperties;
|
||||
import java.lang.reflect.Constructor;
|
||||
import java.lang.reflect.Field;
|
||||
import java.util.List;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
import static org.assertj.core.api.Assertions.assertThat;
|
||||
|
||||
/**
|
||||
* Regression guard for the runtime/config split.
|
||||
* NEW runtime classes must not depend on TedProcessorProperties anymore.
|
||||
*/
|
||||
class NewRuntimeMustNotDependOnTedProcessorPropertiesTest {
|
||||
|
||||
@Test
|
||||
void new_runtime_classes_should_not_depend_on_ted_processor_properties() {
|
||||
List<Class<?>> newRuntimeClasses = List.of(
|
||||
at.procon.dip.ingestion.service.GenericDocumentImportService.class,
|
||||
at.procon.dip.ingestion.camel.GenericFileSystemIngestionRoute.class,
|
||||
at.procon.dip.ingestion.controller.GenericDocumentImportController.class,
|
||||
at.procon.dip.ingestion.adapter.MailDocumentIngestionAdapter.class,
|
||||
at.procon.dip.ingestion.adapter.TedPackageDocumentIngestionAdapter.class,
|
||||
at.procon.dip.ingestion.service.TedPackageChildImportProcessor.class,
|
||||
at.procon.dip.domain.ted.service.TedNoticeProjectionService.class,
|
||||
at.procon.dip.domain.ted.startup.TedProjectionStartupRunner.class,
|
||||
at.procon.dip.search.engine.fulltext.PostgresFullTextSearchEngine.class,
|
||||
at.procon.dip.search.engine.trigram.PostgresTrigramSearchEngine.class,
|
||||
at.procon.dip.search.engine.semantic.PgVectorSemanticSearchEngine.class,
|
||||
at.procon.dip.search.rank.DefaultSearchResultFusionService.class,
|
||||
at.procon.dip.search.service.DefaultSearchOrchestrator.class,
|
||||
at.procon.dip.search.service.SearchLexicalIndexStartupRunner.class,
|
||||
at.procon.dip.normalization.impl.ChunkedLongTextRepresentationBuilder.class
|
||||
);
|
||||
|
||||
for (Class<?> type : newRuntimeClasses) {
|
||||
assertThat(hasDependency(type, TedProcessorProperties.class))
|
||||
.as(type.getName() + " must not depend on TedProcessorProperties")
|
||||
.isFalse();
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
void new_runtime_config_classes_exist_as_replacements() {
|
||||
assertThat(DipSearchProperties.class).isNotNull();
|
||||
assertThat(DipIngestionProperties.class).isNotNull();
|
||||
assertThat(TedProjectionProperties.class).isNotNull();
|
||||
}
|
||||
|
||||
private boolean hasDependency(Class<?> owner, Class<?> dependency) {
|
||||
for (Field field : owner.getDeclaredFields()) {
|
||||
if (field.getType().equals(dependency)) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
for (Constructor<?> constructor : owner.getDeclaredConstructors()) {
|
||||
for (Class<?> param : constructor.getParameterTypes()) {
|
||||
if (param.equals(dependency)) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
}
|
||||
Loading…
Reference in New Issue