Compare commits
10 Commits
74609e481d
...
152d9739af
| Author | SHA1 | Date |
|---|---|---|
|
|
152d9739af | 2 weeks ago |
|
|
00ad3aad38 | 2 weeks ago |
|
|
6ae39b4ea5 | 2 weeks ago |
|
|
678db76415 | 2 weeks ago |
|
|
52330d751d | 2 weeks ago |
|
|
3913ed83e8 | 2 weeks ago |
|
|
177c61803e | 4 weeks ago |
|
|
fbd249e56b | 4 weeks ago |
|
|
1aa599b587 | 4 weeks ago |
|
|
44995cebf7 | 4 weeks ago |
@ -0,0 +1,31 @@
|
|||||||
|
# Config split: moved new-runtime properties to application-new.yml
|
||||||
|
|
||||||
|
This patch keeps shared and legacy defaults in `application.yml` and moves new-runtime properties into `application-new.yml`.
|
||||||
|
|
||||||
|
Activate the new runtime with:
|
||||||
|
|
||||||
|
```
|
||||||
|
--spring.profiles.active=new
|
||||||
|
```
|
||||||
|
|
||||||
|
`application-new.yml` also sets:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
dip.runtime.mode: NEW
|
||||||
|
```
|
||||||
|
|
||||||
|
So profile selection and runtime mode stay aligned.
|
||||||
|
|
||||||
|
Moved blocks:
|
||||||
|
- `dip.embedding.*`
|
||||||
|
- `ted.search.*` (new generic search tuning, now under `dip.search.*`)
|
||||||
|
- `ted.projection.*`
|
||||||
|
- `ted.generic-ingestion.*`
|
||||||
|
- new/transitional `ted.vectorization.*` keys:
|
||||||
|
- `generic-pipeline-enabled`
|
||||||
|
- `dual-write-legacy-ted-vectors`
|
||||||
|
- `generic-scheduler-period-ms`
|
||||||
|
- `primary-representation-builder-key`
|
||||||
|
- `embedding-provider`
|
||||||
|
|
||||||
|
Shared / legacy defaults remain in `application.yml`.
|
||||||
@ -0,0 +1,30 @@
|
|||||||
|
# Embedding policy Patch K1
|
||||||
|
|
||||||
|
Patch K1 introduces the configuration and resolver layer for policy-based document embedding selection.
|
||||||
|
|
||||||
|
## Added
|
||||||
|
- `EmbeddingPolicy`
|
||||||
|
- `EmbeddingProfile`
|
||||||
|
- `EmbeddingPolicyCondition`
|
||||||
|
- `EmbeddingPolicyUse`
|
||||||
|
- `EmbeddingPolicyRule`
|
||||||
|
- `EmbeddingPolicyProperties`
|
||||||
|
- `EmbeddingProfileProperties`
|
||||||
|
- `EmbeddingPolicyResolver`
|
||||||
|
- `DefaultEmbeddingPolicyResolver`
|
||||||
|
- `EmbeddingProfileResolver`
|
||||||
|
- `DefaultEmbeddingProfileResolver`
|
||||||
|
|
||||||
|
## Example config
|
||||||
|
See `application-new-example-embedding-policy.yml`.
|
||||||
|
|
||||||
|
## What K1 does not change
|
||||||
|
- no runtime import/orchestrator wiring yet
|
||||||
|
- no `SourceDescriptor` schema change yet
|
||||||
|
- no job persistence/audit changes yet
|
||||||
|
|
||||||
|
## Intended follow-up
|
||||||
|
K2 should wire:
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- `RepresentationEmbeddingOrchestrator`
|
||||||
|
to use the resolved policy and profile.
|
||||||
@ -0,0 +1,26 @@
|
|||||||
|
# Embedding policy Patch K2
|
||||||
|
|
||||||
|
Patch K2 wires the policy/profile layer into the actual NEW import runtime.
|
||||||
|
|
||||||
|
## What it changes
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- resolves `EmbeddingPolicy` per imported document
|
||||||
|
- resolves `EmbeddingProfile`
|
||||||
|
- ensures the selected embedding model is registered
|
||||||
|
- queues embeddings only for representation drafts allowed by the resolved profile
|
||||||
|
- `RepresentationEmbeddingOrchestrator`
|
||||||
|
- adds a convenience overload for `(documentId, modelKey, profile)`
|
||||||
|
- `EmbeddingJobService`
|
||||||
|
- adds a profile-aware enqueue overload
|
||||||
|
- `DefaultEmbeddingSelectionPolicy`
|
||||||
|
- adds profile-aware representation filtering
|
||||||
|
- `DefaultEmbeddingPolicyResolver`
|
||||||
|
- corrected for the current `SourceDescriptor.attributes()` shape
|
||||||
|
|
||||||
|
## Runtime flow after K2
|
||||||
|
document imported
|
||||||
|
-> representations built
|
||||||
|
-> policy resolved
|
||||||
|
-> profile resolved
|
||||||
|
-> model ensured
|
||||||
|
-> matching representations queued for embedding
|
||||||
@ -0,0 +1,59 @@
|
|||||||
|
# Runtime split Patch B
|
||||||
|
|
||||||
|
Patch B builds on Patch A and makes the NEW runtime actually process embedding jobs.
|
||||||
|
|
||||||
|
## What changes
|
||||||
|
|
||||||
|
### 1. New embedding job scheduler
|
||||||
|
Adds:
|
||||||
|
|
||||||
|
- `EmbeddingJobScheduler`
|
||||||
|
- `EmbeddingJobSchedulingConfiguration`
|
||||||
|
|
||||||
|
Behavior:
|
||||||
|
- enabled only in `NEW` runtime mode
|
||||||
|
- active only when `dip.embedding.jobs.enabled=true`
|
||||||
|
- periodically calls:
|
||||||
|
- `RepresentationEmbeddingOrchestrator.processNextReadyBatch()`
|
||||||
|
|
||||||
|
### 2. Generic import hands off to the new embedding job path
|
||||||
|
`GenericDocumentImportService` is updated so that in `NEW` mode it:
|
||||||
|
|
||||||
|
- resolves `dip.embedding.default-document-model`
|
||||||
|
- ensures the model is registered in `DOC.doc_embedding_model`
|
||||||
|
- creates embedding jobs through:
|
||||||
|
- `RepresentationEmbeddingOrchestrator.enqueueRepresentation(...)`
|
||||||
|
|
||||||
|
It no longer creates legacy-style pending embeddings as the primary handoff for the NEW runtime path.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- This patch assumes Patch A has already introduced:
|
||||||
|
- `RuntimeMode`
|
||||||
|
- `RuntimeModeProperties`
|
||||||
|
- `@ConditionalOnRuntimeMode`
|
||||||
|
- This patch does not yet remove the legacy vectorization runtime.
|
||||||
|
That remains the job of subsequent cutover steps.
|
||||||
|
|
||||||
|
## Expected runtime behavior in NEW mode
|
||||||
|
|
||||||
|
- `GenericDocumentImportService` persists new generic representations
|
||||||
|
- selected representations are queued into `DOC.doc_embedding_job`
|
||||||
|
- scheduler processes pending jobs
|
||||||
|
- vectors are persisted through the new embedding subsystem
|
||||||
|
|
||||||
|
## New config
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
dip:
|
||||||
|
runtime:
|
||||||
|
mode: NEW
|
||||||
|
|
||||||
|
embedding:
|
||||||
|
enabled: true
|
||||||
|
jobs:
|
||||||
|
enabled: true
|
||||||
|
scheduler-delay-ms: 5000
|
||||||
|
```
|
||||||
@ -0,0 +1,36 @@
|
|||||||
|
# Runtime split Patch C
|
||||||
|
|
||||||
|
Patch C moves the **new generic search runtime** off `TedProcessorProperties.search`
|
||||||
|
and into a dedicated `dip.search.*` config tree.
|
||||||
|
|
||||||
|
## New config class
|
||||||
|
- `at.procon.dip.search.config.DipSearchProperties`
|
||||||
|
|
||||||
|
## New config root
|
||||||
|
```yaml
|
||||||
|
dip:
|
||||||
|
search:
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Classes moved off `TedProcessorProperties`
|
||||||
|
- `PostgresFullTextSearchEngine`
|
||||||
|
- `PostgresTrigramSearchEngine`
|
||||||
|
- `PgVectorSemanticSearchEngine`
|
||||||
|
- `DefaultSearchOrchestrator`
|
||||||
|
- `DefaultSearchResultFusionService`
|
||||||
|
- `SearchLexicalIndexStartupRunner`
|
||||||
|
- `ChunkedLongTextRepresentationBuilder`
|
||||||
|
|
||||||
|
## What this patch intentionally does not do
|
||||||
|
- it does not yet remove `TedProcessorProperties` from all NEW-mode classes
|
||||||
|
- it does not yet move `generic-ingestion` config off `ted.*`
|
||||||
|
- it does not yet finish the legacy/new config split for import/mail/TED package processing
|
||||||
|
|
||||||
|
Those should be handled in the next config-splitting patch.
|
||||||
|
|
||||||
|
## Practical result
|
||||||
|
After this patch, **new search/semantic/chunking tuning** should be configured only via:
|
||||||
|
- `dip.search.*`
|
||||||
|
|
||||||
|
while `ted.search.*` remains legacy-oriented.
|
||||||
@ -0,0 +1,40 @@
|
|||||||
|
# Runtime Split Patch D
|
||||||
|
|
||||||
|
This patch completes the next configuration split step for the NEW runtime.
|
||||||
|
|
||||||
|
## New property classes
|
||||||
|
|
||||||
|
- `at.procon.dip.ingestion.config.DipIngestionProperties`
|
||||||
|
- prefix: `dip.ingestion`
|
||||||
|
- `at.procon.dip.domain.ted.config.TedProjectionProperties`
|
||||||
|
- prefix: `dip.ted.projection`
|
||||||
|
|
||||||
|
## Classes moved off `TedProcessorProperties`
|
||||||
|
|
||||||
|
### NEW-mode ingestion
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- `GenericFileSystemIngestionRoute`
|
||||||
|
- `GenericDocumentImportController`
|
||||||
|
- `MailDocumentIngestionAdapter`
|
||||||
|
- `TedPackageDocumentIngestionAdapter`
|
||||||
|
- `TedPackageChildImportProcessor`
|
||||||
|
|
||||||
|
### NEW-mode projection
|
||||||
|
- `TedNoticeProjectionService`
|
||||||
|
- `TedProjectionStartupRunner`
|
||||||
|
|
||||||
|
## Additional cleanup in `GenericDocumentImportService`
|
||||||
|
|
||||||
|
It now resolves the default document embedding model through the new embedding subsystem:
|
||||||
|
|
||||||
|
- `EmbeddingProperties`
|
||||||
|
- `EmbeddingModelRegistry`
|
||||||
|
- `EmbeddingModelCatalogService`
|
||||||
|
|
||||||
|
and no longer reads vectorization model/provider/dimensions from `TedProcessorProperties`.
|
||||||
|
|
||||||
|
## What still remains for later split steps
|
||||||
|
|
||||||
|
- legacy routes/services still using `TedProcessorProperties`
|
||||||
|
- legacy/new runtime bean gating for all remaining shared classes
|
||||||
|
- moving old TED-only config fully under `legacy.ted.*`
|
||||||
@ -0,0 +1,26 @@
|
|||||||
|
# Runtime split Patch E
|
||||||
|
|
||||||
|
This patch continues the runtime/config split by targeting the remaining NEW-mode classes
|
||||||
|
that still injected `TedProcessorProperties`.
|
||||||
|
|
||||||
|
## New config classes
|
||||||
|
- `DipIngestionProperties` (`dip.ingestion.*`)
|
||||||
|
- `TedProjectionProperties` (`dip.ted.projection.*`)
|
||||||
|
|
||||||
|
## NEW-mode classes moved off `TedProcessorProperties`
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- `GenericFileSystemIngestionRoute`
|
||||||
|
- `GenericDocumentImportController`
|
||||||
|
- `MailDocumentIngestionAdapter`
|
||||||
|
- `TedPackageDocumentIngestionAdapter`
|
||||||
|
- `TedPackageChildImportProcessor`
|
||||||
|
- `TedNoticeProjectionService`
|
||||||
|
- `TedProjectionStartupRunner`
|
||||||
|
|
||||||
|
## Additional behavior change
|
||||||
|
`GenericDocumentImportService` now hands embedding work off to the new embedding subsystem by:
|
||||||
|
- resolving the default document model from `EmbeddingModelRegistry`
|
||||||
|
- ensuring the model is registered via `EmbeddingModelCatalogService`
|
||||||
|
- enqueueing jobs through `RepresentationEmbeddingOrchestrator`
|
||||||
|
|
||||||
|
This removes the new import path's runtime dependence on legacy `TedProcessorProperties.vectorization`.
|
||||||
@ -0,0 +1,24 @@
|
|||||||
|
# Runtime split Patch G
|
||||||
|
|
||||||
|
Patch G moves the remaining NEW-mode search/chunking classes off `TedProcessorProperties.search`
|
||||||
|
and onto `DipSearchProperties` (`dip.search.*`).
|
||||||
|
|
||||||
|
## New config class
|
||||||
|
- `at.procon.dip.search.config.DipSearchProperties`
|
||||||
|
|
||||||
|
## Classes switched to `DipSearchProperties`
|
||||||
|
- `PostgresFullTextSearchEngine`
|
||||||
|
- `PostgresTrigramSearchEngine`
|
||||||
|
- `PgVectorSemanticSearchEngine`
|
||||||
|
- `DefaultSearchResultFusionService`
|
||||||
|
- `DefaultSearchOrchestrator`
|
||||||
|
- `SearchLexicalIndexStartupRunner`
|
||||||
|
- `ChunkedLongTextRepresentationBuilder`
|
||||||
|
|
||||||
|
## Additional cleanup
|
||||||
|
These classes are also marked `NEW`-only in this patch.
|
||||||
|
|
||||||
|
## Effect
|
||||||
|
After Patch G, the generic NEW-mode search/chunking path no longer depends on
|
||||||
|
`TedProcessorProperties.search`. That leaves `TedProcessorProperties` much closer to
|
||||||
|
legacy-only ownership.
|
||||||
@ -0,0 +1,17 @@
|
|||||||
|
# Runtime split Patch H
|
||||||
|
|
||||||
|
Patch H is a final cleanup / verification step after the previous split patches.
|
||||||
|
|
||||||
|
## What it does
|
||||||
|
- makes `TedProcessorProperties` explicitly `LEGACY`-only
|
||||||
|
- removes the stale `TedProcessorProperties` import/comment from `DocumentIntelligencePlatformApplication`
|
||||||
|
- adds a regression test that fails if NEW runtime classes reintroduce a dependency on `TedProcessorProperties`
|
||||||
|
- adds a simple `application-legacy.yml` profile file
|
||||||
|
|
||||||
|
## Why this matters
|
||||||
|
After the NEW search/ingestion/projection classes are moved to:
|
||||||
|
- `DipSearchProperties`
|
||||||
|
- `DipIngestionProperties`
|
||||||
|
- `TedProjectionProperties`
|
||||||
|
|
||||||
|
`TedProcessorProperties` should be owned strictly by the legacy runtime graph.
|
||||||
@ -0,0 +1,21 @@
|
|||||||
|
# Runtime split Patch I
|
||||||
|
|
||||||
|
Patch I extracts the remaining legacy vectorization cluster off `TedProcessorProperties`
|
||||||
|
and onto a dedicated legacy-only config class.
|
||||||
|
|
||||||
|
## New config class
|
||||||
|
- `at.procon.ted.config.LegacyVectorizationProperties`
|
||||||
|
- prefix: `legacy.ted.vectorization.*`
|
||||||
|
|
||||||
|
## Classes switched off `TedProcessorProperties`
|
||||||
|
- `GenericVectorizationRoute`
|
||||||
|
- `DocumentEmbeddingProcessingService`
|
||||||
|
- `ConfiguredEmbeddingModelStartupRunner`
|
||||||
|
- `GenericVectorizationStartupRunner`
|
||||||
|
|
||||||
|
## Additional cleanup
|
||||||
|
These classes are also marked `LEGACY`-only via `@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)`.
|
||||||
|
|
||||||
|
## Effect
|
||||||
|
The `at.procon.dip.vectorization.*` package now clearly belongs to the old runtime graph and no longer pulls
|
||||||
|
its settings from the shared monolithic `TedProcessorProperties`.
|
||||||
@ -0,0 +1,45 @@
|
|||||||
|
# Runtime split Patch J
|
||||||
|
|
||||||
|
Patch J is a broader cleanup patch for the **actual current codebase**.
|
||||||
|
|
||||||
|
It adds the missing runtime/config split scaffolding and rewires the remaining NEW-mode classes
|
||||||
|
that still injected `TedProcessorProperties`.
|
||||||
|
|
||||||
|
## Added
|
||||||
|
- `dip.runtime` infrastructure
|
||||||
|
- `RuntimeMode`
|
||||||
|
- `RuntimeModeProperties`
|
||||||
|
- `@ConditionalOnRuntimeMode`
|
||||||
|
- `RuntimeModeCondition`
|
||||||
|
- `DipSearchProperties`
|
||||||
|
- `DipIngestionProperties`
|
||||||
|
- `TedProjectionProperties`
|
||||||
|
|
||||||
|
## Rewired off `TedProcessorProperties`
|
||||||
|
### NEW search/chunking
|
||||||
|
- `PostgresFullTextSearchEngine`
|
||||||
|
- `PostgresTrigramSearchEngine`
|
||||||
|
- `PgVectorSemanticSearchEngine`
|
||||||
|
- `DefaultSearchOrchestrator`
|
||||||
|
- `SearchLexicalIndexStartupRunner`
|
||||||
|
- `DefaultSearchResultFusionService`
|
||||||
|
- `ChunkedLongTextRepresentationBuilder`
|
||||||
|
|
||||||
|
### NEW ingestion/projection
|
||||||
|
- `GenericDocumentImportService`
|
||||||
|
- `GenericFileSystemIngestionRoute`
|
||||||
|
- `GenericDocumentImportController`
|
||||||
|
- `MailDocumentIngestionAdapter`
|
||||||
|
- `TedPackageDocumentIngestionAdapter`
|
||||||
|
- `TedPackageChildImportProcessor`
|
||||||
|
- `TedNoticeProjectionService`
|
||||||
|
- `TedProjectionStartupRunner`
|
||||||
|
|
||||||
|
## Additional behavior
|
||||||
|
- `GenericDocumentImportService` now hands embedding work off to the new embedding subsystem
|
||||||
|
via `RepresentationEmbeddingOrchestrator` and resolves the default model through
|
||||||
|
`EmbeddingModelRegistry` / `EmbeddingModelCatalogService`.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
This patch intentionally targets the real current leftovers visible in the actual codebase.
|
||||||
|
It assumes the new embedding subsystem already exists.
|
||||||
@ -0,0 +1,20 @@
|
|||||||
|
# NEW TED package import route
|
||||||
|
|
||||||
|
This patch adds a NEW-runtime TED package download path that:
|
||||||
|
|
||||||
|
- reuses the proven package sequencing rules
|
||||||
|
- stores package tracking in `TedDailyPackage`
|
||||||
|
- downloads the package tar.gz
|
||||||
|
- ingests it only through `DocumentIngestionGateway`
|
||||||
|
- never calls the legacy XML batch processing / vectorization flow
|
||||||
|
|
||||||
|
## Added classes
|
||||||
|
|
||||||
|
- `TedPackageSequenceService`
|
||||||
|
- `DefaultTedPackageSequenceService`
|
||||||
|
- `TedPackageDownloadNewProperties`
|
||||||
|
- `TedPackageDownloadNewRoute`
|
||||||
|
|
||||||
|
## Config
|
||||||
|
|
||||||
|
Use the `dip.ingestion.ted-download.*` block in `application-new.yml`.
|
||||||
@ -0,0 +1,67 @@
|
|||||||
|
# Vector-sync HTTP embedding provider
|
||||||
|
|
||||||
|
This provider supports two endpoints:
|
||||||
|
|
||||||
|
- `POST {baseUrl}/vector-sync` for single-text requests
|
||||||
|
- `POST {baseUrl}/vectorize-batch` for batch document requests
|
||||||
|
|
||||||
|
## Single request
|
||||||
|
|
||||||
|
Request body:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "intfloat/multilingual-e5-large",
|
||||||
|
"text": "This is a sample text to vectorize"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Batch request
|
||||||
|
|
||||||
|
Request body:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "intfloat/multilingual-e5-large",
|
||||||
|
"truncate_text": false,
|
||||||
|
"truncate_length": 512,
|
||||||
|
"chunk_size": 20,
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"id": "2f48fd5c-9d39-4d80-9225-ea0c59c77c9a",
|
||||||
|
"text": "This is a sample text to vectorize"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Provider configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
batch-request:
|
||||||
|
truncate-text: false
|
||||||
|
truncate-length: 512
|
||||||
|
chunk-size: 20
|
||||||
|
```
|
||||||
|
|
||||||
|
These values are used for `/vectorize-batch` calls and can also be overridden per request via `EmbeddingRequest.providerOptions()`.
|
||||||
|
|
||||||
|
## Orchestrator batch processing
|
||||||
|
|
||||||
|
To let `RepresentationEmbeddingOrchestrator` send multiple representations in one provider call, enable batch processing for jobs and for the model:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
dip:
|
||||||
|
embedding:
|
||||||
|
jobs:
|
||||||
|
enabled: true
|
||||||
|
process-in-batches: true
|
||||||
|
execution-batch-size: 20
|
||||||
|
|
||||||
|
models:
|
||||||
|
e5-default:
|
||||||
|
supports-batch: true
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- jobs are grouped by `modelKey`
|
||||||
|
- non-batch-capable models still fall back to single-item execution
|
||||||
|
- `execution-batch-size` controls how many texts are sent in one `/vectorize-batch` request
|
||||||
@ -0,0 +1,16 @@
|
|||||||
|
package at.procon.dip.domain.ted.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.ted.projection")
|
||||||
|
@Data
|
||||||
|
public class TedProjectionProperties {
|
||||||
|
private boolean enabled = true;
|
||||||
|
private boolean startupBackfillEnabled = false;
|
||||||
|
@Positive
|
||||||
|
private int startupBackfillLimit = 250;
|
||||||
|
}
|
||||||
@ -0,0 +1,189 @@
|
|||||||
|
package at.procon.dip.domain.ted.service;
|
||||||
|
|
||||||
|
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||||
|
import at.procon.dip.runtime.config.RuntimeMode;
|
||||||
|
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
|
||||||
|
import at.procon.ted.model.entity.TedDailyPackage;
|
||||||
|
import at.procon.ted.repository.TedDailyPackageRepository;
|
||||||
|
import java.time.Duration;
|
||||||
|
import java.time.LocalDate;
|
||||||
|
import java.time.OffsetDateTime;
|
||||||
|
import java.time.Year;
|
||||||
|
import java.time.ZoneOffset;
|
||||||
|
import java.util.Optional;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
import lombok.extern.slf4j.Slf4j;
|
||||||
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* NEW-runtime implementation of TED package sequencing.
|
||||||
|
* <p>
|
||||||
|
* This reuses the same decision rules as the legacy TED package downloader:
|
||||||
|
* <ul>
|
||||||
|
* <li>current year forward crawling first</li>
|
||||||
|
* <li>gap filling by walking backward to package 1</li>
|
||||||
|
* <li>NOT_FOUND retry handling with current-year indefinite retry support</li>
|
||||||
|
* <li>previous-year grace period before a tail NOT_FOUND becomes final</li>
|
||||||
|
* </ul>
|
||||||
|
*/
|
||||||
|
@Service
|
||||||
|
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
@Slf4j
|
||||||
|
public class DefaultTedPackageSequenceService implements TedPackageSequenceService {
|
||||||
|
|
||||||
|
private final TedPackageDownloadProperties properties;
|
||||||
|
private final TedDailyPackageRepository packageRepository;
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public PackageInfo getNextPackageToDownload() {
|
||||||
|
int currentYear = Year.now().getValue();
|
||||||
|
|
||||||
|
log.debug("Determining next TED package to download for NEW runtime (current year: {})", currentYear);
|
||||||
|
|
||||||
|
// 1) Current year forward crawling first (newest data first)
|
||||||
|
PackageInfo nextInCurrentYear = getNextForwardPackage(currentYear);
|
||||||
|
if (nextInCurrentYear != null) {
|
||||||
|
log.info("Next TED package: {} (current year {} forward)", nextInCurrentYear.identifier(), currentYear);
|
||||||
|
return nextInCurrentYear;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 2) Walk all years backward and fill gaps / continue unfinished years
|
||||||
|
for (int year = currentYear; year >= properties.getStartYear(); year--) {
|
||||||
|
PackageInfo gapFiller = getGapFillerPackage(year);
|
||||||
|
if (gapFiller != null) {
|
||||||
|
log.info("Next TED package: {} (filling gap in year {})", gapFiller.identifier(), year);
|
||||||
|
return gapFiller;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!isYearComplete(year)) {
|
||||||
|
PackageInfo forwardPackage = getNextForwardPackage(year);
|
||||||
|
if (forwardPackage != null) {
|
||||||
|
log.info("Next TED package: {} (continuing year {})", forwardPackage.identifier(), year);
|
||||||
|
return forwardPackage;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
log.debug("TED package year {} is complete", year);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// 3) Open a new older year if possible
|
||||||
|
int oldestYear = getOldestYearWithData();
|
||||||
|
if (oldestYear > properties.getStartYear()) {
|
||||||
|
int previousYear = oldestYear - 1;
|
||||||
|
if (previousYear >= properties.getStartYear()) {
|
||||||
|
PackageInfo first = new PackageInfo(previousYear, 1);
|
||||||
|
log.info("Next TED package: {} (opening year {})", first.identifier(), previousYear);
|
||||||
|
return first;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
log.info("All TED package years from {} to {} appear complete - nothing to download",
|
||||||
|
properties.getStartYear(), currentYear);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
private PackageInfo getNextForwardPackage(int year) {
|
||||||
|
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
|
||||||
|
|
||||||
|
if (latest.isEmpty()) {
|
||||||
|
return new PackageInfo(year, 1);
|
||||||
|
}
|
||||||
|
|
||||||
|
TedDailyPackage latestPackage = latest.get();
|
||||||
|
|
||||||
|
if (latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND) {
|
||||||
|
if (shouldRetryNotFoundPackage(latestPackage)) {
|
||||||
|
return new PackageInfo(year, latestPackage.getSerialNumber());
|
||||||
|
}
|
||||||
|
|
||||||
|
if (isNotFoundRetryableForYear(latestPackage)) {
|
||||||
|
log.debug("Year {} still inside NOT_FOUND retry window for package {} until {}",
|
||||||
|
year, latestPackage.getPackageIdentifier(), calculateNextRetryAt(latestPackage));
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
log.debug("Year {} finalized after grace period at tail package {}", year, latestPackage.getPackageIdentifier());
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
return new PackageInfo(year, latestPackage.getSerialNumber() + 1);
|
||||||
|
}
|
||||||
|
|
||||||
|
private PackageInfo getGapFillerPackage(int year) {
|
||||||
|
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
|
||||||
|
|
||||||
|
if (first.isEmpty()) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
int minSerial = first.get().getSerialNumber();
|
||||||
|
if (minSerial <= 1) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
return new PackageInfo(year, minSerial - 1);
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean isYearComplete(int year) {
|
||||||
|
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
|
||||||
|
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
|
||||||
|
|
||||||
|
if (first.isEmpty() || latest.isEmpty()) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (first.get().getSerialNumber() != 1) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
TedDailyPackage latestPackage = latest.get();
|
||||||
|
return latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND
|
||||||
|
&& !isNotFoundRetryableForYear(latestPackage);
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean shouldRetryNotFoundPackage(TedDailyPackage pkg) {
|
||||||
|
if (!isNotFoundRetryableForYear(pkg)) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
OffsetDateTime nextRetryAt = calculateNextRetryAt(pkg);
|
||||||
|
return !nextRetryAt.isAfter(OffsetDateTime.now());
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean isNotFoundRetryableForYear(TedDailyPackage pkg) {
|
||||||
|
int currentYear = Year.now().getValue();
|
||||||
|
int packageYear = pkg.getYear() != null ? pkg.getYear() : currentYear;
|
||||||
|
|
||||||
|
if (packageYear >= currentYear) {
|
||||||
|
return properties.isRetryCurrentYearNotFoundIndefinitely();
|
||||||
|
}
|
||||||
|
|
||||||
|
return OffsetDateTime.now().isBefore(getYearRetryGraceDeadline(packageYear));
|
||||||
|
}
|
||||||
|
|
||||||
|
private OffsetDateTime calculateNextRetryAt(TedDailyPackage pkg) {
|
||||||
|
OffsetDateTime lastAttemptAt = pkg.getUpdatedAt() != null
|
||||||
|
? pkg.getUpdatedAt()
|
||||||
|
: (pkg.getCreatedAt() != null ? pkg.getCreatedAt() : OffsetDateTime.now());
|
||||||
|
|
||||||
|
return lastAttemptAt.plus(Duration.ofMillis(properties.getNotFoundRetryInterval()));
|
||||||
|
}
|
||||||
|
|
||||||
|
private OffsetDateTime getYearRetryGraceDeadline(int year) {
|
||||||
|
return LocalDate.of(year + 1, 1, 1)
|
||||||
|
.atStartOfDay()
|
||||||
|
.atOffset(ZoneOffset.UTC)
|
||||||
|
.plusDays(properties.getPreviousYearGracePeriodDays());
|
||||||
|
}
|
||||||
|
|
||||||
|
private int getOldestYearWithData() {
|
||||||
|
int currentYear = Year.now().getValue();
|
||||||
|
for (int year = properties.getStartYear(); year <= currentYear; year++) {
|
||||||
|
if (packageRepository.findLatestByYear(year).isPresent()) {
|
||||||
|
return year;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return currentYear;
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,25 @@
|
|||||||
|
package at.procon.dip.domain.ted.service;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Shared package sequencing contract used to determine the next TED daily package to download.
|
||||||
|
* <p>
|
||||||
|
* This service encapsulates the proven sequencing rules from the legacy download implementation
|
||||||
|
* so they can also be used by the NEW runtime without depending on the old route/service graph.
|
||||||
|
*/
|
||||||
|
public interface TedPackageSequenceService {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns the next package to download according to the current sequencing strategy,
|
||||||
|
* or {@code null} if nothing should be downloaded right now.
|
||||||
|
*/
|
||||||
|
PackageInfo getNextPackageToDownload();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Simple year/serial pair with TED package identifier helper.
|
||||||
|
*/
|
||||||
|
record PackageInfo(int year, int serialNumber) {
|
||||||
|
public String identifier() {
|
||||||
|
return "%04d%05d".formatted(year, serialNumber);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,12 @@
|
|||||||
|
package at.procon.dip.embedding.config;
|
||||||
|
|
||||||
|
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||||
|
import at.procon.dip.runtime.config.RuntimeMode;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
import org.springframework.scheduling.annotation.EnableScheduling;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@EnableScheduling
|
||||||
|
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||||
|
public class EmbeddingJobSchedulingConfiguration {
|
||||||
|
}
|
||||||
@ -0,0 +1,14 @@
|
|||||||
|
package at.procon.dip.embedding.config;
|
||||||
|
|
||||||
|
import lombok.Data;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
public class EmbeddingPolicyCondition {
|
||||||
|
private String documentType;
|
||||||
|
private String documentFamily;
|
||||||
|
private String sourceType;
|
||||||
|
private String mimeType;
|
||||||
|
private String language;
|
||||||
|
private String ownerTenantKey;
|
||||||
|
private String embeddingPolicyHint;
|
||||||
|
}
|
||||||
@ -0,0 +1,16 @@
|
|||||||
|
package at.procon.dip.embedding.config;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.embedding.policies")
|
||||||
|
@Data
|
||||||
|
public class EmbeddingPolicyProperties {
|
||||||
|
|
||||||
|
private EmbeddingPolicyUse defaultPolicy = new EmbeddingPolicyUse();
|
||||||
|
private List<EmbeddingPolicyRule> rules = new ArrayList<>();
|
||||||
|
}
|
||||||
@ -0,0 +1,10 @@
|
|||||||
|
package at.procon.dip.embedding.config;
|
||||||
|
|
||||||
|
import lombok.Data;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
public class EmbeddingPolicyRule {
|
||||||
|
private String name;
|
||||||
|
private EmbeddingPolicyCondition when = new EmbeddingPolicyCondition();
|
||||||
|
private EmbeddingPolicyUse use = new EmbeddingPolicyUse();
|
||||||
|
}
|
||||||
@ -0,0 +1,12 @@
|
|||||||
|
package at.procon.dip.embedding.config;
|
||||||
|
|
||||||
|
import lombok.Data;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
public class EmbeddingPolicyUse {
|
||||||
|
private String policyKey;
|
||||||
|
private String modelKey;
|
||||||
|
private String queryModelKey;
|
||||||
|
private String profileKey;
|
||||||
|
private boolean enabled = true;
|
||||||
|
}
|
||||||
@ -0,0 +1,23 @@
|
|||||||
|
package at.procon.dip.embedding.config;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.document.RepresentationType;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.LinkedHashMap;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.embedding.profiles")
|
||||||
|
@Data
|
||||||
|
public class EmbeddingProfileProperties {
|
||||||
|
|
||||||
|
private Map<String, ProfileDefinition> definitions = new LinkedHashMap<>();
|
||||||
|
|
||||||
|
@Data
|
||||||
|
public static class ProfileDefinition {
|
||||||
|
private List<RepresentationType> embedRepresentationTypes = new ArrayList<>();
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,32 @@
|
|||||||
|
package at.procon.dip.embedding.job;
|
||||||
|
|
||||||
|
import at.procon.dip.embedding.service.RepresentationEmbeddingOrchestrator;
|
||||||
|
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||||
|
import at.procon.dip.runtime.config.RuntimeMode;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
import lombok.extern.slf4j.Slf4j;
|
||||||
|
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
|
||||||
|
import org.springframework.scheduling.annotation.Scheduled;
|
||||||
|
import org.springframework.stereotype.Component;
|
||||||
|
|
||||||
|
@Component
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
@Slf4j
|
||||||
|
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||||
|
@ConditionalOnProperty(prefix = "dip.embedding.jobs", name = "enabled", havingValue = "true")
|
||||||
|
public class EmbeddingJobScheduler {
|
||||||
|
|
||||||
|
private final RepresentationEmbeddingOrchestrator orchestrator;
|
||||||
|
|
||||||
|
@Scheduled(fixedDelayString = "${dip.embedding.jobs.scheduler-delay-ms:5000}")
|
||||||
|
public void processNextBatch() {
|
||||||
|
try {
|
||||||
|
int processed = orchestrator.processNextReadyBatch();
|
||||||
|
if (processed > 0) {
|
||||||
|
log.debug("NEW runtime embedding job scheduler processed {} job(s)", processed);
|
||||||
|
}
|
||||||
|
} catch (Exception ex) {
|
||||||
|
log.warn("NEW runtime embedding job scheduler failed: {}", ex.getMessage(), ex);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,10 @@
|
|||||||
|
package at.procon.dip.embedding.policy;
|
||||||
|
|
||||||
|
public record EmbeddingPolicy(
|
||||||
|
String policyKey,
|
||||||
|
String modelKey,
|
||||||
|
String queryModelKey,
|
||||||
|
String profileKey,
|
||||||
|
boolean enabled
|
||||||
|
) {
|
||||||
|
}
|
||||||
@ -0,0 +1,13 @@
|
|||||||
|
package at.procon.dip.embedding.policy;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.document.RepresentationType;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
public record EmbeddingProfile(
|
||||||
|
String profileKey,
|
||||||
|
List<RepresentationType> embedRepresentationTypes
|
||||||
|
) {
|
||||||
|
public boolean includes(RepresentationType representationType) {
|
||||||
|
return embedRepresentationTypes != null && embedRepresentationTypes.contains(representationType);
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,68 @@
|
|||||||
|
package at.procon.dip.embedding.provider.http;
|
||||||
|
|
||||||
|
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
|
||||||
|
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.net.URI;
|
||||||
|
import java.net.http.HttpClient;
|
||||||
|
import java.net.http.HttpRequest;
|
||||||
|
import java.net.http.HttpResponse;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.time.Duration;
|
||||||
|
import java.util.List;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
abstract class AbstractHttpEmbeddingProviderSupport {
|
||||||
|
|
||||||
|
protected final ObjectMapper objectMapper;
|
||||||
|
protected final HttpClient httpClient = HttpClient.newBuilder()
|
||||||
|
.version(HttpClient.Version.HTTP_1_1)
|
||||||
|
.build();
|
||||||
|
|
||||||
|
protected String trimTrailingSlash(String value) {
|
||||||
|
if (value == null || value.isBlank()) {
|
||||||
|
throw new IllegalArgumentException("Embedding provider baseUrl must be configured");
|
||||||
|
}
|
||||||
|
return value.endsWith("/") ? value.substring(0, value.length() - 1) : value;
|
||||||
|
}
|
||||||
|
|
||||||
|
protected HttpResponse<String> postJson(ResolvedEmbeddingProviderConfig providerConfig,
|
||||||
|
String path,
|
||||||
|
Object body) throws IOException, InterruptedException {
|
||||||
|
HttpRequest.Builder builder = HttpRequest.newBuilder()
|
||||||
|
.uri(URI.create(trimTrailingSlash(providerConfig.baseUrl()) + path))
|
||||||
|
.timeout(providerConfig.readTimeout() == null ? Duration.ofSeconds(60) : providerConfig.readTimeout())
|
||||||
|
.header("Content-Type", "application/json")
|
||||||
|
.POST(HttpRequest.BodyPublishers.ofString(
|
||||||
|
objectMapper.writeValueAsString(body),
|
||||||
|
StandardCharsets.UTF_8
|
||||||
|
));
|
||||||
|
|
||||||
|
if (providerConfig.apiKey() != null && !providerConfig.apiKey().isBlank()) {
|
||||||
|
builder.header("Authorization", "Bearer " + providerConfig.apiKey());
|
||||||
|
}
|
||||||
|
if (providerConfig.headers() != null) {
|
||||||
|
providerConfig.headers().forEach(builder::header);
|
||||||
|
}
|
||||||
|
|
||||||
|
HttpResponse<String> response = httpClient.send(
|
||||||
|
builder.build(),
|
||||||
|
HttpResponse.BodyHandlers.ofString(StandardCharsets.UTF_8)
|
||||||
|
);
|
||||||
|
if (response.statusCode() / 100 != 2) {
|
||||||
|
throw new IllegalStateException(
|
||||||
|
"Embedding provider returned status %d: %s".formatted(response.statusCode(), response.body())
|
||||||
|
);
|
||||||
|
}
|
||||||
|
return response;
|
||||||
|
}
|
||||||
|
|
||||||
|
protected float[] toArray(List<Float> embedding) {
|
||||||
|
float[] result = new float[embedding.size()];
|
||||||
|
for (int i = 0; i < embedding.size(); i++) {
|
||||||
|
result[i] = embedding.get(i);
|
||||||
|
}
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,374 @@
|
|||||||
|
package at.procon.dip.embedding.provider.http;
|
||||||
|
|
||||||
|
import at.procon.dip.embedding.model.EmbeddingModelDescriptor;
|
||||||
|
import at.procon.dip.embedding.model.EmbeddingProviderResult;
|
||||||
|
import at.procon.dip.embedding.model.EmbeddingRequest;
|
||||||
|
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
|
||||||
|
import at.procon.dip.embedding.provider.EmbeddingProvider;
|
||||||
|
import com.fasterxml.jackson.annotation.JsonProperty;
|
||||||
|
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.net.http.HttpResponse;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.UUID;
|
||||||
|
import org.springframework.stereotype.Component;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* HTTP provider for vector APIs.
|
||||||
|
*
|
||||||
|
* Supported endpoints:
|
||||||
|
* POST {baseUrl}/vector-sync - single text
|
||||||
|
* POST {baseUrl}/vectorize-batch - multiple texts
|
||||||
|
*/
|
||||||
|
@Component
|
||||||
|
public class VectorSyncHttpEmbeddingProvider extends AbstractHttpEmbeddingProviderSupport implements EmbeddingProvider {
|
||||||
|
private static final String PROVIDER_TYPE = "http-vector-sync";
|
||||||
|
|
||||||
|
private static final boolean DEFAULT_TRUNCATE_TEXT = false;
|
||||||
|
private static final int DEFAULT_TRUNCATE_LENGTH = 512;
|
||||||
|
private static final int DEFAULT_CHUNK_SIZE = 20;
|
||||||
|
|
||||||
|
private static final List<String> TRUNCATE_TEXT_KEYS = List.of(
|
||||||
|
"vectorize-batch.truncate-text",
|
||||||
|
"vectorize-batch.truncate_text",
|
||||||
|
"truncate_text",
|
||||||
|
"truncate-text",
|
||||||
|
"truncateText"
|
||||||
|
);
|
||||||
|
|
||||||
|
private static final List<String> TRUNCATE_LENGTH_KEYS = List.of(
|
||||||
|
"vectorize-batch.truncate-length",
|
||||||
|
"vectorize-batch.truncate_length",
|
||||||
|
"truncate_length",
|
||||||
|
"truncate-length",
|
||||||
|
"truncateLength"
|
||||||
|
);
|
||||||
|
|
||||||
|
private static final List<String> CHUNK_SIZE_KEYS = List.of(
|
||||||
|
"vectorize-batch.chunk-size",
|
||||||
|
"vectorize-batch.chunk_size",
|
||||||
|
"chunk_size",
|
||||||
|
"chunk-size",
|
||||||
|
"chunkSize"
|
||||||
|
);
|
||||||
|
|
||||||
|
public VectorSyncHttpEmbeddingProvider(ObjectMapper objectMapper) {
|
||||||
|
super(objectMapper);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String providerType() {
|
||||||
|
return PROVIDER_TYPE;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean supports(EmbeddingModelDescriptor model, ResolvedEmbeddingProviderConfig providerConfig) {
|
||||||
|
return PROVIDER_TYPE.equalsIgnoreCase(providerConfig.providerType());
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public EmbeddingProviderResult embedDocuments(ResolvedEmbeddingProviderConfig providerConfig,
|
||||||
|
EmbeddingModelDescriptor model,
|
||||||
|
EmbeddingRequest request) {
|
||||||
|
return execute(providerConfig, model, request);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public EmbeddingProviderResult embedQuery(ResolvedEmbeddingProviderConfig providerConfig,
|
||||||
|
EmbeddingModelDescriptor model,
|
||||||
|
EmbeddingRequest request) {
|
||||||
|
return execute(providerConfig, model, request);
|
||||||
|
}
|
||||||
|
|
||||||
|
private EmbeddingProviderResult execute(ResolvedEmbeddingProviderConfig providerConfig,
|
||||||
|
EmbeddingModelDescriptor model,
|
||||||
|
EmbeddingRequest request) {
|
||||||
|
if (request.texts() == null || request.texts().isEmpty()) {
|
||||||
|
throw new IllegalArgumentException("Embedding request texts must not be empty");
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
return request.texts().size() == 1
|
||||||
|
? executeSingle(providerConfig, model, request.texts().getFirst())
|
||||||
|
: executeBatch(providerConfig, model, request);
|
||||||
|
} catch (InterruptedException e) {
|
||||||
|
Thread.currentThread().interrupt();
|
||||||
|
throw new IllegalStateException("Embedding provider call interrupted", e);
|
||||||
|
} catch (IOException e) {
|
||||||
|
throw new IllegalStateException("Failed to call embedding provider", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private EmbeddingProviderResult executeSingle(ResolvedEmbeddingProviderConfig providerConfig,
|
||||||
|
EmbeddingModelDescriptor model,
|
||||||
|
String text) throws IOException, InterruptedException {
|
||||||
|
HttpResponse<String> response = postJson(
|
||||||
|
providerConfig,
|
||||||
|
"/vector-sync",
|
||||||
|
new VectorSyncRequest(model.providerModelKey(), text)
|
||||||
|
);
|
||||||
|
|
||||||
|
VectorSyncResponse parsed = objectMapper.readValue(response.body(), VectorSyncResponse.class);
|
||||||
|
float[] vector = extractVector(parsed.vector, parsed.combinedVector, model);
|
||||||
|
|
||||||
|
return new EmbeddingProviderResult(
|
||||||
|
model,
|
||||||
|
List.of(vector),
|
||||||
|
List.of(),
|
||||||
|
null,
|
||||||
|
parsed.tokenCount
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
private EmbeddingProviderResult executeBatch(ResolvedEmbeddingProviderConfig providerConfig,
|
||||||
|
EmbeddingModelDescriptor model,
|
||||||
|
EmbeddingRequest request) throws IOException, InterruptedException {
|
||||||
|
BatchRequestSettings settings = resolveBatchRequestSettings(providerConfig, request.providerOptions());
|
||||||
|
|
||||||
|
if (settings.truncateLength() <= 0) {
|
||||||
|
throw new IllegalArgumentException("Batch truncate length must be > 0");
|
||||||
|
}
|
||||||
|
if (settings.chunkSize() <= 0) {
|
||||||
|
throw new IllegalArgumentException("Batch chunk size must be > 0");
|
||||||
|
}
|
||||||
|
|
||||||
|
List<String> requestOrder = new ArrayList<>(request.texts().size());
|
||||||
|
List<VectorizeBatchItemRequest> items = new ArrayList<>(request.texts().size());
|
||||||
|
|
||||||
|
for (String text : request.texts()) {
|
||||||
|
String id = UUID.randomUUID().toString();
|
||||||
|
requestOrder.add(id);
|
||||||
|
items.add(new VectorizeBatchItemRequest(id, text));
|
||||||
|
}
|
||||||
|
|
||||||
|
HttpResponse<String> response = postJson(
|
||||||
|
providerConfig,
|
||||||
|
"/vectorize-batch",
|
||||||
|
new VectorizeBatchRequest(
|
||||||
|
model.providerModelKey(),
|
||||||
|
settings.truncateText(),
|
||||||
|
settings.truncateLength(),
|
||||||
|
settings.chunkSize(),
|
||||||
|
items
|
||||||
|
)
|
||||||
|
);
|
||||||
|
|
||||||
|
VectorizeBatchResponse parsed = objectMapper.readValue(response.body(), VectorizeBatchResponse.class);
|
||||||
|
if (parsed.results == null || parsed.results.isEmpty()) {
|
||||||
|
throw new IllegalStateException("Vectorize-batch provider returned no results");
|
||||||
|
}
|
||||||
|
|
||||||
|
Map<String, VectorizeBatchItemResponse> resultById = new HashMap<>();
|
||||||
|
for (VectorizeBatchItemResponse result : parsed.results) {
|
||||||
|
resultById.put(result.id, result);
|
||||||
|
}
|
||||||
|
|
||||||
|
List<float[]> vectors = new ArrayList<>(request.texts().size());
|
||||||
|
int totalTokenCount = 0;
|
||||||
|
boolean hasAnyTokenCount = false;
|
||||||
|
|
||||||
|
for (String id : requestOrder) {
|
||||||
|
VectorizeBatchItemResponse item = resultById.get(id);
|
||||||
|
if (item == null) {
|
||||||
|
throw new IllegalStateException("Vectorize-batch provider response is missing item for id " + id);
|
||||||
|
}
|
||||||
|
|
||||||
|
vectors.add(extractVector(item.vector, item.combinedVector, model));
|
||||||
|
|
||||||
|
if (item.tokenCount != null) {
|
||||||
|
totalTokenCount += item.tokenCount;
|
||||||
|
hasAnyTokenCount = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return new EmbeddingProviderResult(
|
||||||
|
model,
|
||||||
|
vectors,
|
||||||
|
List.of(),
|
||||||
|
null,
|
||||||
|
hasAnyTokenCount ? totalTokenCount : null
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
private BatchRequestSettings resolveBatchRequestSettings(ResolvedEmbeddingProviderConfig providerConfig,
|
||||||
|
Map<String, Object> providerOptions) {
|
||||||
|
boolean truncateText = resolveBooleanOption(
|
||||||
|
providerOptions,
|
||||||
|
TRUNCATE_TEXT_KEYS,
|
||||||
|
providerConfig.batchTruncateText() != null ? providerConfig.batchTruncateText() : DEFAULT_TRUNCATE_TEXT
|
||||||
|
);
|
||||||
|
int truncateLength = resolveIntOption(
|
||||||
|
providerOptions,
|
||||||
|
TRUNCATE_LENGTH_KEYS,
|
||||||
|
providerConfig.batchTruncateLength() != null ? providerConfig.batchTruncateLength() : DEFAULT_TRUNCATE_LENGTH
|
||||||
|
);
|
||||||
|
int chunkSize = resolveIntOption(
|
||||||
|
providerOptions,
|
||||||
|
CHUNK_SIZE_KEYS,
|
||||||
|
providerConfig.batchChunkSize() != null ? providerConfig.batchChunkSize() : DEFAULT_CHUNK_SIZE
|
||||||
|
);
|
||||||
|
return new BatchRequestSettings(truncateText, truncateLength, chunkSize);
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean resolveBooleanOption(Map<String, Object> providerOptions,
|
||||||
|
List<String> keys,
|
||||||
|
boolean defaultValue) {
|
||||||
|
Object raw = resolveOption(providerOptions, keys);
|
||||||
|
if (raw == null) {
|
||||||
|
return defaultValue;
|
||||||
|
}
|
||||||
|
if (raw instanceof Boolean booleanValue) {
|
||||||
|
return booleanValue;
|
||||||
|
}
|
||||||
|
String normalized = String.valueOf(raw).trim();
|
||||||
|
if (normalized.isEmpty()) {
|
||||||
|
return defaultValue;
|
||||||
|
}
|
||||||
|
return Boolean.parseBoolean(normalized);
|
||||||
|
}
|
||||||
|
|
||||||
|
private int resolveIntOption(Map<String, Object> providerOptions,
|
||||||
|
List<String> keys,
|
||||||
|
int defaultValue) {
|
||||||
|
Object raw = resolveOption(providerOptions, keys);
|
||||||
|
if (raw == null) {
|
||||||
|
return defaultValue;
|
||||||
|
}
|
||||||
|
if (raw instanceof Number number) {
|
||||||
|
return number.intValue();
|
||||||
|
}
|
||||||
|
String normalized = String.valueOf(raw).trim();
|
||||||
|
if (normalized.isEmpty()) {
|
||||||
|
return defaultValue;
|
||||||
|
}
|
||||||
|
return Integer.parseInt(normalized);
|
||||||
|
}
|
||||||
|
|
||||||
|
private Object resolveOption(Map<String, Object> providerOptions, List<String> keys) {
|
||||||
|
if (providerOptions == null || providerOptions.isEmpty()) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
for (String key : keys) {
|
||||||
|
if (providerOptions.containsKey(key)) {
|
||||||
|
return providerOptions.get(key);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
private float[] extractVector(List<Float> vector,
|
||||||
|
List<Float> combinedVector,
|
||||||
|
EmbeddingModelDescriptor model) {
|
||||||
|
float[] resolved;
|
||||||
|
|
||||||
|
if (combinedVector != null && !combinedVector.isEmpty()) {
|
||||||
|
resolved = toArray(combinedVector);
|
||||||
|
} else if (vector != null && !vector.isEmpty()) {
|
||||||
|
resolved = toArray(vector);
|
||||||
|
} else {
|
||||||
|
throw new IllegalStateException("Embedding provider returned no vector");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (model.dimensions() > 0 && resolved.length != model.dimensions()) {
|
||||||
|
throw new IllegalStateException(
|
||||||
|
"Embedding provider returned dimension %d for model %s, expected %d"
|
||||||
|
.formatted(resolved.length, model.modelKey(), model.dimensions())
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
return resolved;
|
||||||
|
}
|
||||||
|
|
||||||
|
private record BatchRequestSettings(boolean truncateText, int truncateLength, int chunkSize) {
|
||||||
|
}
|
||||||
|
|
||||||
|
private record VectorSyncRequest(
|
||||||
|
@JsonProperty("model") String model,
|
||||||
|
@JsonProperty("text") String text
|
||||||
|
) {
|
||||||
|
}
|
||||||
|
|
||||||
|
private record VectorizeBatchRequest(
|
||||||
|
@JsonProperty("model") String model,
|
||||||
|
@JsonProperty("truncate_text") boolean truncateText,
|
||||||
|
@JsonProperty("truncate_length") int truncateLength,
|
||||||
|
@JsonProperty("chunk_size") int chunkSize,
|
||||||
|
@JsonProperty("items") List<VectorizeBatchItemRequest> items
|
||||||
|
) {
|
||||||
|
}
|
||||||
|
|
||||||
|
private record VectorizeBatchItemRequest(
|
||||||
|
@JsonProperty("id") String id,
|
||||||
|
@JsonProperty("text") String text
|
||||||
|
) {
|
||||||
|
}
|
||||||
|
|
||||||
|
static class VectorSyncResponse {
|
||||||
|
@JsonProperty("runtime_ms")
|
||||||
|
public Double runtimeMs;
|
||||||
|
|
||||||
|
@JsonProperty("vector")
|
||||||
|
public List<Float> vector;
|
||||||
|
|
||||||
|
@JsonProperty("incomplete")
|
||||||
|
public Boolean incomplete;
|
||||||
|
|
||||||
|
@JsonProperty("combined_vector")
|
||||||
|
public List<Float> combinedVector;
|
||||||
|
|
||||||
|
@JsonProperty("token_count")
|
||||||
|
public Integer tokenCount;
|
||||||
|
|
||||||
|
@JsonProperty("model")
|
||||||
|
public String model;
|
||||||
|
|
||||||
|
@JsonProperty("max_seq_length")
|
||||||
|
public Integer maxSeqLength;
|
||||||
|
}
|
||||||
|
|
||||||
|
static class VectorizeBatchResponse {
|
||||||
|
@JsonProperty("model")
|
||||||
|
public String model;
|
||||||
|
|
||||||
|
@JsonProperty("count")
|
||||||
|
public Integer count;
|
||||||
|
|
||||||
|
@JsonProperty("results")
|
||||||
|
public List<VectorizeBatchItemResponse> results;
|
||||||
|
}
|
||||||
|
|
||||||
|
static class VectorizeBatchItemResponse {
|
||||||
|
@JsonProperty("id")
|
||||||
|
public String id;
|
||||||
|
|
||||||
|
@JsonProperty("vector")
|
||||||
|
public List<Float> vector;
|
||||||
|
|
||||||
|
@JsonProperty("token_count")
|
||||||
|
public Integer tokenCount;
|
||||||
|
|
||||||
|
@JsonProperty("runtime_ms")
|
||||||
|
public Double runtimeMs;
|
||||||
|
|
||||||
|
@JsonProperty("incomplete")
|
||||||
|
public Boolean incomplete;
|
||||||
|
|
||||||
|
@JsonProperty("combined_vector")
|
||||||
|
public List<Float> combinedVector;
|
||||||
|
|
||||||
|
@JsonProperty("truncated")
|
||||||
|
public Boolean truncated;
|
||||||
|
|
||||||
|
@JsonProperty("truncate_length")
|
||||||
|
public Integer truncateLength;
|
||||||
|
|
||||||
|
@JsonProperty("model")
|
||||||
|
public String model;
|
||||||
|
|
||||||
|
@JsonProperty("max_seq_length")
|
||||||
|
public Integer maxSeqLength;
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,131 @@
|
|||||||
|
package at.procon.dip.embedding.service;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.document.entity.Document;
|
||||||
|
import at.procon.dip.embedding.config.EmbeddingPolicyCondition;
|
||||||
|
import at.procon.dip.embedding.config.EmbeddingPolicyProperties;
|
||||||
|
import at.procon.dip.embedding.config.EmbeddingPolicyRule;
|
||||||
|
import at.procon.dip.embedding.config.EmbeddingPolicyUse;
|
||||||
|
import at.procon.dip.embedding.policy.EmbeddingPolicy;
|
||||||
|
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Objects;
|
||||||
|
import java.util.regex.Pattern;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
|
@Service
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
public class DefaultEmbeddingPolicyResolver implements EmbeddingPolicyResolver {
|
||||||
|
|
||||||
|
private final EmbeddingPolicyProperties properties;
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor) {
|
||||||
|
String overridePolicy = attributeValue(sourceDescriptor, "embeddingPolicyKey");
|
||||||
|
if (overridePolicy != null) {
|
||||||
|
return policyByKey(overridePolicy);
|
||||||
|
}
|
||||||
|
|
||||||
|
String policyHint = policyHint(sourceDescriptor);
|
||||||
|
if (policyHint != null) {
|
||||||
|
return policyByKey(policyHint);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (EmbeddingPolicyRule rule : properties.getRules()) {
|
||||||
|
if (matches(rule.getWhen(), document, sourceDescriptor)) {
|
||||||
|
return toPolicy(rule.getUse());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return toPolicy(properties.getDefaultPolicy());
|
||||||
|
}
|
||||||
|
|
||||||
|
private EmbeddingPolicy policyByKey(String policyKey) {
|
||||||
|
for (EmbeddingPolicyRule rule : properties.getRules()) {
|
||||||
|
if (rule.getUse() != null && policyKey.equals(rule.getUse().getPolicyKey())) {
|
||||||
|
return toPolicy(rule.getUse());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
EmbeddingPolicyUse def = properties.getDefaultPolicy();
|
||||||
|
if (def != null && policyKey.equals(def.getPolicyKey())) {
|
||||||
|
return toPolicy(def);
|
||||||
|
}
|
||||||
|
throw new IllegalArgumentException("Unknown embedding policy key: " + policyKey);
|
||||||
|
}
|
||||||
|
|
||||||
|
private EmbeddingPolicy toPolicy(EmbeddingPolicyUse use) {
|
||||||
|
if (use == null) {
|
||||||
|
throw new IllegalStateException("Embedding policy configuration is missing");
|
||||||
|
}
|
||||||
|
return new EmbeddingPolicy(
|
||||||
|
use.getPolicyKey(),
|
||||||
|
use.getModelKey(),
|
||||||
|
use.getQueryModelKey(),
|
||||||
|
use.getProfileKey(),
|
||||||
|
use.isEnabled()
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean matches(EmbeddingPolicyCondition c, Document document, SourceDescriptor sourceDescriptor) {
|
||||||
|
if (c == null) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!matchesExact(c.getDocumentType(), enumName(document != null ? document.getDocumentType() : null))) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if (!matchesExact(c.getDocumentFamily(), enumName(document != null ? document.getDocumentFamily() : null))) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if (!matchesExact(c.getSourceType(), enumName(sourceDescriptor != null ? sourceDescriptor.sourceType() : null))) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if (!matchesMime(c.getMimeType(), sourceDescriptor != null ? sourceDescriptor.mediaType() : null)) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if (!matchesExact(c.getLanguage(), document != null ? document.getLanguageCode() : null)) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if (!matchesExact(c.getOwnerTenantKey(), document != null && document.getOwnerTenant() != null ? document.getOwnerTenant().getTenantKey() : null )) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
return matchesExact(c.getEmbeddingPolicyHint(), policyHint(sourceDescriptor));
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean matchesExact(String expected, String actual) {
|
||||||
|
if (expected == null || expected.isBlank()) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
return Objects.equals(expected, actual);
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean matchesMime(String pattern, String actual) {
|
||||||
|
if (pattern == null || pattern.isBlank()) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (actual == null || actual.isBlank()) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
return Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(actual).matches();
|
||||||
|
}
|
||||||
|
|
||||||
|
private String enumName(Enum<?> value) {
|
||||||
|
return value != null ? value.name() : null;
|
||||||
|
}
|
||||||
|
|
||||||
|
private String policyHint(SourceDescriptor sourceDescriptor) {
|
||||||
|
return attributeValue(sourceDescriptor, "embeddingPolicyHint");
|
||||||
|
}
|
||||||
|
|
||||||
|
private String attributeValue(SourceDescriptor sourceDescriptor, String key) {
|
||||||
|
if (sourceDescriptor == null) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
Map<String, String> attributes = sourceDescriptor.attributes();
|
||||||
|
if (attributes == null) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
String value = attributes.get(key);
|
||||||
|
return (value == null || value.isBlank()) ? null : value;
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,31 @@
|
|||||||
|
package at.procon.dip.embedding.service;
|
||||||
|
|
||||||
|
import at.procon.dip.embedding.config.EmbeddingProfileProperties;
|
||||||
|
import at.procon.dip.embedding.policy.EmbeddingProfile;
|
||||||
|
import java.util.List;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
|
@Service
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
public class DefaultEmbeddingProfileResolver implements EmbeddingProfileResolver {
|
||||||
|
|
||||||
|
private final EmbeddingProfileProperties properties;
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public EmbeddingProfile resolve(String profileKey) {
|
||||||
|
if (profileKey == null || profileKey.isBlank()) {
|
||||||
|
throw new IllegalArgumentException("Embedding profile key must not be blank");
|
||||||
|
}
|
||||||
|
|
||||||
|
EmbeddingProfileProperties.ProfileDefinition definition = properties.getDefinitions().get(profileKey);
|
||||||
|
if (definition == null) {
|
||||||
|
throw new IllegalArgumentException("Unknown embedding profile: " + profileKey);
|
||||||
|
}
|
||||||
|
|
||||||
|
return new EmbeddingProfile(
|
||||||
|
profileKey,
|
||||||
|
List.copyOf(definition.getEmbedRepresentationTypes())
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,9 @@
|
|||||||
|
package at.procon.dip.embedding.service;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.document.entity.Document;
|
||||||
|
import at.procon.dip.embedding.policy.EmbeddingPolicy;
|
||||||
|
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
||||||
|
|
||||||
|
public interface EmbeddingPolicyResolver {
|
||||||
|
EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor);
|
||||||
|
}
|
||||||
@ -0,0 +1,7 @@
|
|||||||
|
package at.procon.dip.embedding.service;
|
||||||
|
|
||||||
|
import at.procon.dip.embedding.policy.EmbeddingProfile;
|
||||||
|
|
||||||
|
public interface EmbeddingProfileResolver {
|
||||||
|
EmbeddingProfile resolve(String profileKey);
|
||||||
|
}
|
||||||
@ -0,0 +1,328 @@
|
|||||||
|
package at.procon.dip.ingestion.camel;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.document.SourceType;
|
||||||
|
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
|
||||||
|
import at.procon.dip.ingestion.service.DocumentIngestionGateway;
|
||||||
|
import at.procon.dip.ingestion.spi.IngestionResult;
|
||||||
|
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
|
||||||
|
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
||||||
|
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||||
|
import at.procon.dip.runtime.config.RuntimeMode;
|
||||||
|
import at.procon.ted.model.entity.TedDailyPackage;
|
||||||
|
import at.procon.ted.repository.TedDailyPackageRepository;
|
||||||
|
import at.procon.dip.domain.ted.service.TedPackageSequenceService;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.nio.file.Paths;
|
||||||
|
import java.security.MessageDigest;
|
||||||
|
import java.time.OffsetDateTime;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Optional;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
import lombok.extern.slf4j.Slf4j;
|
||||||
|
import org.apache.camel.Exchange;
|
||||||
|
import org.apache.camel.LoggingLevel;
|
||||||
|
import org.apache.camel.builder.RouteBuilder;
|
||||||
|
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
|
||||||
|
import org.springframework.stereotype.Component;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* NEW-runtime TED daily package download route.
|
||||||
|
* <p>
|
||||||
|
* Reuses the proven package sequencing rules through {@link TedPackageSequenceService},
|
||||||
|
* but hands off processing only to the NEW ingestion gateway. No legacy XML batch persistence,
|
||||||
|
* no legacy vectorization route, no old semantic path.
|
||||||
|
*/
|
||||||
|
@Component
|
||||||
|
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||||
|
@ConditionalOnProperty(name = "dip.ingestion.ted-download.enabled", havingValue = "true")
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
@Slf4j
|
||||||
|
public class TedPackageDownloadRoute extends RouteBuilder {
|
||||||
|
|
||||||
|
private static final String ROUTE_ID_SCHEDULER = "ted-package-new-scheduler";
|
||||||
|
private static final String ROUTE_ID_DOWNLOADER = "ted-package-new-downloader";
|
||||||
|
private static final String ROUTE_ID_ERROR = "ted-package-new-error-handler";
|
||||||
|
|
||||||
|
private final TedPackageDownloadProperties properties;
|
||||||
|
private final TedDailyPackageRepository packageRepository;
|
||||||
|
private final TedPackageSequenceService sequenceService;
|
||||||
|
private final DocumentIngestionGateway documentIngestionGateway;
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void configure() {
|
||||||
|
errorHandler(deadLetterChannel("direct:ted-package-new-error")
|
||||||
|
.maximumRedeliveries(3)
|
||||||
|
.redeliveryDelay(10_000)
|
||||||
|
.retryAttemptedLogLevel(LoggingLevel.WARN)
|
||||||
|
.logStackTrace(true));
|
||||||
|
|
||||||
|
from("direct:ted-package-new-error")
|
||||||
|
.routeId(ROUTE_ID_ERROR)
|
||||||
|
.process(this::handleError);
|
||||||
|
|
||||||
|
from("timer:ted-package-new-scheduler?period={{dip.ingestion.ted-download.poll-interval:3600000}}&delay=0")
|
||||||
|
.routeId(ROUTE_ID_SCHEDULER)
|
||||||
|
.process(this::checkRunningPackages)
|
||||||
|
.choice()
|
||||||
|
.when(header("tooManyRunning").isEqualTo(true))
|
||||||
|
.log(LoggingLevel.INFO, "Skipping NEW TED package download - already ${header.runningCount} packages in progress")
|
||||||
|
.otherwise()
|
||||||
|
.process(this::determineNextPackage)
|
||||||
|
.choice()
|
||||||
|
.when(header("packageId").isNotNull())
|
||||||
|
.to("direct:download-ted-package-new")
|
||||||
|
.otherwise()
|
||||||
|
.log(LoggingLevel.INFO, "No NEW TED package to download right now")
|
||||||
|
.end()
|
||||||
|
.end();
|
||||||
|
|
||||||
|
from("direct:download-ted-package-new")
|
||||||
|
.routeId(ROUTE_ID_DOWNLOADER)
|
||||||
|
.log(LoggingLevel.INFO, "NEW TED package download started: ${header.packageId}")
|
||||||
|
.setHeader("downloadStartTime", constant(System.currentTimeMillis()))
|
||||||
|
.process(this::createPackageRecord)
|
||||||
|
.delay(simple("{{dip.ingestion.ted-download.delay-between-downloads:5000}}"))
|
||||||
|
.setHeader(Exchange.HTTP_METHOD, constant("GET"))
|
||||||
|
.setHeader("CamelHttpConnectionClose", constant(true))
|
||||||
|
.toD("${header.downloadUrl}?bridgeEndpoint=true&throwExceptionOnFailure=false&socketTimeout={{dip.ingestion.ted-download.download-timeout:300000}}")
|
||||||
|
.choice()
|
||||||
|
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(200))
|
||||||
|
.process(this::calculateHash)
|
||||||
|
.process(this::checkDuplicateByHash)
|
||||||
|
.choice()
|
||||||
|
.when(header("isDuplicate").isEqualTo(true))
|
||||||
|
.process(this::markDuplicate)
|
||||||
|
.otherwise()
|
||||||
|
.process(this::saveDownloadedPackage)
|
||||||
|
.process(this::ingestThroughGateway)
|
||||||
|
.process(this::markCompleted)
|
||||||
|
.endChoice()
|
||||||
|
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(404))
|
||||||
|
.process(this::markNotFound)
|
||||||
|
.otherwise()
|
||||||
|
.process(this::markFailed)
|
||||||
|
.end();
|
||||||
|
}
|
||||||
|
|
||||||
|
private void checkRunningPackages(Exchange exchange) {
|
||||||
|
long downloadingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING).size();
|
||||||
|
long processingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING).size();
|
||||||
|
long runningCount = downloadingCount + processingCount;
|
||||||
|
|
||||||
|
exchange.getIn().setHeader("runningCount", runningCount);
|
||||||
|
exchange.getIn().setHeader("tooManyRunning", runningCount >= properties.getMaxRunningPackages());
|
||||||
|
|
||||||
|
if (runningCount > 0) {
|
||||||
|
log.info("Currently {} TED packages in progress in NEW runtime ({} downloading, {} processing)",
|
||||||
|
runningCount, downloadingCount, processingCount);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private void determineNextPackage(Exchange exchange) {
|
||||||
|
List<TedDailyPackage> pendingPackages = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PENDING);
|
||||||
|
|
||||||
|
if (!pendingPackages.isEmpty()) {
|
||||||
|
TedDailyPackage pkg = pendingPackages.get(0);
|
||||||
|
log.info("Retrying PENDING TED package in NEW runtime: {}", pkg.getPackageIdentifier());
|
||||||
|
setPackageHeaders(exchange, pkg.getYear(), pkg.getSerialNumber());
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
TedPackageSequenceService.PackageInfo packageInfo = sequenceService.getNextPackageToDownload();
|
||||||
|
if (packageInfo == null) {
|
||||||
|
exchange.getIn().setHeader("packageId", null);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
setPackageHeaders(exchange, packageInfo.year(), packageInfo.serialNumber());
|
||||||
|
}
|
||||||
|
|
||||||
|
private void setPackageHeaders(Exchange exchange, int year, int serialNumber) {
|
||||||
|
String packageId = "%04d%05d".formatted(year, serialNumber);
|
||||||
|
String downloadUrl = properties.getBaseUrl() + packageId;
|
||||||
|
|
||||||
|
exchange.getIn().setHeader("packageId", packageId);
|
||||||
|
exchange.getIn().setHeader("year", year);
|
||||||
|
exchange.getIn().setHeader("serialNumber", serialNumber);
|
||||||
|
exchange.getIn().setHeader("downloadUrl", downloadUrl);
|
||||||
|
}
|
||||||
|
|
||||||
|
private void createPackageRecord(Exchange exchange) {
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
Integer year = exchange.getIn().getHeader("year", Integer.class);
|
||||||
|
Integer serialNumber = exchange.getIn().getHeader("serialNumber", Integer.class);
|
||||||
|
String downloadUrl = exchange.getIn().getHeader("downloadUrl", String.class);
|
||||||
|
|
||||||
|
if (packageRepository.existsByPackageIdentifier(packageId)) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
TedDailyPackage pkg = TedDailyPackage.builder()
|
||||||
|
.packageIdentifier(packageId)
|
||||||
|
.year(year)
|
||||||
|
.serialNumber(serialNumber)
|
||||||
|
.downloadUrl(downloadUrl)
|
||||||
|
.downloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING)
|
||||||
|
.build();
|
||||||
|
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
}
|
||||||
|
|
||||||
|
private void calculateHash(Exchange exchange) throws Exception {
|
||||||
|
byte[] body = exchange.getIn().getBody(byte[].class);
|
||||||
|
MessageDigest digest = MessageDigest.getInstance("SHA-256");
|
||||||
|
byte[] hashBytes = digest.digest(body);
|
||||||
|
|
||||||
|
StringBuilder sb = new StringBuilder();
|
||||||
|
for (byte b : hashBytes) {
|
||||||
|
sb.append(String.format("%02x", b));
|
||||||
|
}
|
||||||
|
|
||||||
|
exchange.getIn().setHeader("fileHash", sb.toString());
|
||||||
|
}
|
||||||
|
|
||||||
|
private void checkDuplicateByHash(Exchange exchange) {
|
||||||
|
String hash = exchange.getIn().getHeader("fileHash", String.class);
|
||||||
|
|
||||||
|
Optional<TedDailyPackage> duplicate = packageRepository.findAll().stream()
|
||||||
|
.filter(p -> hash.equals(p.getFileHash()))
|
||||||
|
.findFirst();
|
||||||
|
|
||||||
|
exchange.getIn().setHeader("isDuplicate", duplicate.isPresent());
|
||||||
|
duplicate.ifPresent(pkg -> exchange.getIn().setHeader("duplicateOf", pkg.getPackageIdentifier()));
|
||||||
|
}
|
||||||
|
|
||||||
|
private void saveDownloadedPackage(Exchange exchange) throws Exception {
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
String hash = exchange.getIn().getHeader("fileHash", String.class);
|
||||||
|
byte[] body = exchange.getIn().getBody(byte[].class);
|
||||||
|
|
||||||
|
Path downloadDir = Paths.get(properties.getDownloadDirectory());
|
||||||
|
Files.createDirectories(downloadDir);
|
||||||
|
Path downloadPath = downloadDir.resolve(packageId + ".tar.gz");
|
||||||
|
Files.write(downloadPath, body);
|
||||||
|
|
||||||
|
long downloadDuration = System.currentTimeMillis() -
|
||||||
|
exchange.getIn().getHeader("downloadStartTime", Long.class);
|
||||||
|
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setFileHash(hash);
|
||||||
|
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADED);
|
||||||
|
pkg.setDownloadedAt(OffsetDateTime.now());
|
||||||
|
pkg.setDownloadDurationMs(downloadDuration);
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
|
||||||
|
exchange.getIn().setHeader("downloadPath", downloadPath.toString());
|
||||||
|
}
|
||||||
|
|
||||||
|
private void ingestThroughGateway(Exchange exchange) {
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
|
||||||
|
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING);
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
|
||||||
|
IngestionResult ingestionResult = documentIngestionGateway.ingest(new SourceDescriptor(
|
||||||
|
null,
|
||||||
|
SourceType.TED_PACKAGE,
|
||||||
|
packageId,
|
||||||
|
downloadPath,
|
||||||
|
packageId + ".tar.gz",
|
||||||
|
"application/gzip",
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
OffsetDateTime.now(),
|
||||||
|
OriginalContentStoragePolicy.DEFAULT,
|
||||||
|
Map.of(
|
||||||
|
"packageId", packageId,
|
||||||
|
"title", packageId + ".tar.gz"
|
||||||
|
)
|
||||||
|
));
|
||||||
|
|
||||||
|
int importedChildCount = Math.max(0, ingestionResult.documents().size() - 1);
|
||||||
|
exchange.getIn().setHeader("gatewayImportedChildCount", importedChildCount);
|
||||||
|
exchange.getIn().setHeader("gatewayImportWarnings", ingestionResult.warnings().size());
|
||||||
|
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setXmlFileCount(importedChildCount);
|
||||||
|
pkg.setProcessedCount(importedChildCount);
|
||||||
|
pkg.setFailedCount(0);
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private void markCompleted(Exchange exchange) throws Exception {
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
|
||||||
|
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
|
||||||
|
pkg.setProcessedAt(OffsetDateTime.now());
|
||||||
|
if (pkg.getDownloadedAt() != null) {
|
||||||
|
long processingDuration = Math.max(0L,
|
||||||
|
java.time.Duration.between(pkg.getDownloadedAt(), OffsetDateTime.now()).toMillis());
|
||||||
|
pkg.setProcessingDurationMs(processingDuration);
|
||||||
|
}
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
|
||||||
|
if (properties.isDeleteAfterIngestion() && downloadPath != null) {
|
||||||
|
Files.deleteIfExists(Path.of(downloadPath));
|
||||||
|
}
|
||||||
|
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
long totalDuration = (pkg.getDownloadDurationMs() != null ? pkg.getDownloadDurationMs() : 0L)
|
||||||
|
+ (pkg.getProcessingDurationMs() != null ? pkg.getProcessingDurationMs() : 0L);
|
||||||
|
log.info("NEW TED package {} completed: xmlCount={}, processed={}, failed={}, totalDuration={}ms",
|
||||||
|
packageId, pkg.getXmlFileCount(), pkg.getProcessedCount(), pkg.getFailedCount(), totalDuration);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private void markNotFound(Exchange exchange) {
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.NOT_FOUND);
|
||||||
|
pkg.setErrorMessage("Package not found (404)");
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private void markFailed(Exchange exchange) {
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
Integer httpCode = exchange.getIn().getHeader(Exchange.HTTP_RESPONSE_CODE, Integer.class);
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
|
||||||
|
pkg.setErrorMessage("HTTP " + httpCode);
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private void markDuplicate(Exchange exchange) {
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
String duplicateOf = exchange.getIn().getHeader("duplicateOf", String.class);
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
|
||||||
|
pkg.setErrorMessage("Duplicate of " + duplicateOf);
|
||||||
|
pkg.setProcessedAt(OffsetDateTime.now());
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private void handleError(Exchange exchange) {
|
||||||
|
Exception exception = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class);
|
||||||
|
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||||
|
|
||||||
|
if (packageId != null) {
|
||||||
|
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||||
|
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
|
||||||
|
pkg.setErrorMessage(exception != null ? exception.getMessage() : "Unknown route error");
|
||||||
|
packageRepository.save(pkg);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,61 @@
|
|||||||
|
package at.procon.dip.ingestion.config;
|
||||||
|
|
||||||
|
import at.procon.dip.domain.access.DocumentVisibility;
|
||||||
|
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||||
|
import at.procon.dip.runtime.config.RuntimeMode;
|
||||||
|
import jakarta.validation.constraints.NotBlank;
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.ingestion")
|
||||||
|
@Data
|
||||||
|
public class DipIngestionProperties {
|
||||||
|
|
||||||
|
private boolean enabled = false;
|
||||||
|
private boolean fileSystemEnabled = false;
|
||||||
|
private boolean restUploadEnabled = true;
|
||||||
|
private String inputDirectory = "/ted.europe/generic-input";
|
||||||
|
private String filePattern = ".*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$";
|
||||||
|
private String processedDirectory = ".dip-processed";
|
||||||
|
private String errorDirectory = ".dip-error";
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private long pollInterval = 15000;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private int maxMessagesPerPoll = 10;
|
||||||
|
|
||||||
|
private String defaultOwnerTenantKey;
|
||||||
|
private DocumentVisibility defaultVisibility = DocumentVisibility.PUBLIC;
|
||||||
|
private String defaultLanguageCode;
|
||||||
|
|
||||||
|
private boolean storeOriginalBinaryInDb = true;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private int maxBinaryBytesInDb = 5242880;
|
||||||
|
|
||||||
|
private boolean deduplicateByContentHash = true;
|
||||||
|
private boolean storeOriginalContentForWrapperDocuments = true;
|
||||||
|
private boolean vectorizePrimaryRepresentationOnly = true;
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String importBatchId = "phase4-generic";
|
||||||
|
|
||||||
|
private boolean tedPackageAdapterEnabled = true;
|
||||||
|
private boolean mailAdapterEnabled = false;
|
||||||
|
|
||||||
|
private String mailDefaultOwnerTenantKey;
|
||||||
|
private DocumentVisibility mailDefaultVisibility = DocumentVisibility.TENANT;
|
||||||
|
private boolean expandMailZipAttachments = true;
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String tedPackageImportBatchId = "phase41-ted-package";
|
||||||
|
|
||||||
|
private boolean gatewayOnlyForTedPackages = false;
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String mailImportBatchId = "phase41-mail";
|
||||||
|
}
|
||||||
@ -0,0 +1,52 @@
|
|||||||
|
package at.procon.dip.ingestion.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Min;
|
||||||
|
import jakarta.validation.constraints.NotBlank;
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* NEW-runtime TED package download configuration.
|
||||||
|
* <p>
|
||||||
|
* This is intentionally separate from the legacy {@code ted.download.*} tree.
|
||||||
|
*/
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.ingestion.ted-download")
|
||||||
|
@Data
|
||||||
|
public class TedPackageDownloadProperties {
|
||||||
|
|
||||||
|
private boolean enabled = false;
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String baseUrl = "https://ted.europa.eu/packages/daily/";
|
||||||
|
|
||||||
|
@NotBlank
|
||||||
|
private String downloadDirectory = "/ted.europe/downloads-new";
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private int startYear = 2015;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private long pollInterval = 3_600_000L;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private long notFoundRetryInterval = 21_600_000L;
|
||||||
|
|
||||||
|
@Min(0)
|
||||||
|
private int previousYearGracePeriodDays = 30;
|
||||||
|
|
||||||
|
private boolean retryCurrentYearNotFoundIndefinitely = true;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private long downloadTimeout = 300_000L;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private int maxRunningPackages = 2;
|
||||||
|
|
||||||
|
@Positive
|
||||||
|
private long delayBetweenDownloads = 5_000L;
|
||||||
|
|
||||||
|
private boolean deleteAfterIngestion = true;
|
||||||
|
}
|
||||||
@ -0,0 +1,47 @@
|
|||||||
|
package at.procon.dip.migration.audit.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Min;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "dip.migration.legacy-audit")
|
||||||
|
@Data
|
||||||
|
public class LegacyTedAuditProperties {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enables the Wave 1 / Milestone A legacy TED audit subsystem.
|
||||||
|
*/
|
||||||
|
private boolean enabled = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Automatically runs the read-only audit on application startup.
|
||||||
|
*/
|
||||||
|
private boolean startupRunEnabled = false;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Maximum number of legacy TED documents to scan during startup.
|
||||||
|
* 0 means no limit.
|
||||||
|
*/
|
||||||
|
@Min(0)
|
||||||
|
private int startupRunLimit = 500;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Batch size for legacy TED document paging.
|
||||||
|
*/
|
||||||
|
@Min(1)
|
||||||
|
private int pageSize = 100;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Hard cap for persisted findings in a single run to avoid runaway audit volume.
|
||||||
|
*/
|
||||||
|
@Min(1)
|
||||||
|
private int maxFindingsPerRun = 10000;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Maximum number of duplicate/grouped samples recorded for global aggregate checks.
|
||||||
|
*/
|
||||||
|
@Min(1)
|
||||||
|
private int maxDuplicateSamples = 100;
|
||||||
|
}
|
||||||
@ -0,0 +1,87 @@
|
|||||||
|
package at.procon.dip.migration.audit.entity;
|
||||||
|
|
||||||
|
import at.procon.dip.architecture.SchemaNames;
|
||||||
|
import jakarta.persistence.Column;
|
||||||
|
import jakarta.persistence.Entity;
|
||||||
|
import jakarta.persistence.EnumType;
|
||||||
|
import jakarta.persistence.Enumerated;
|
||||||
|
import jakarta.persistence.FetchType;
|
||||||
|
import jakarta.persistence.GeneratedValue;
|
||||||
|
import jakarta.persistence.GenerationType;
|
||||||
|
import jakarta.persistence.Id;
|
||||||
|
import jakarta.persistence.Index;
|
||||||
|
import jakarta.persistence.JoinColumn;
|
||||||
|
import jakarta.persistence.ManyToOne;
|
||||||
|
import jakarta.persistence.PrePersist;
|
||||||
|
import jakarta.persistence.Table;
|
||||||
|
import java.time.OffsetDateTime;
|
||||||
|
import java.util.UUID;
|
||||||
|
import lombok.AllArgsConstructor;
|
||||||
|
import lombok.Builder;
|
||||||
|
import lombok.Getter;
|
||||||
|
import lombok.NoArgsConstructor;
|
||||||
|
import lombok.Setter;
|
||||||
|
|
||||||
|
@Entity
|
||||||
|
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_finding", indexes = {
|
||||||
|
@Index(name = "idx_doc_legacy_audit_find_run", columnList = "run_id"),
|
||||||
|
@Index(name = "idx_doc_legacy_audit_find_type", columnList = "finding_type"),
|
||||||
|
@Index(name = "idx_doc_legacy_audit_find_severity", columnList = "severity"),
|
||||||
|
@Index(name = "idx_doc_legacy_audit_find_legacy_doc", columnList = "legacy_procurement_document_id"),
|
||||||
|
@Index(name = "idx_doc_legacy_audit_find_document", columnList = "document_id")
|
||||||
|
})
|
||||||
|
@Getter
|
||||||
|
@Setter
|
||||||
|
@NoArgsConstructor
|
||||||
|
@AllArgsConstructor
|
||||||
|
@Builder
|
||||||
|
public class LegacyTedAuditFinding {
|
||||||
|
|
||||||
|
@Id
|
||||||
|
@GeneratedValue(strategy = GenerationType.UUID)
|
||||||
|
private UUID id;
|
||||||
|
|
||||||
|
@ManyToOne(fetch = FetchType.LAZY, optional = false)
|
||||||
|
@JoinColumn(name = "run_id", nullable = false)
|
||||||
|
private LegacyTedAuditRun run;
|
||||||
|
|
||||||
|
@Enumerated(EnumType.STRING)
|
||||||
|
@Column(name = "severity", nullable = false, length = 16)
|
||||||
|
private LegacyTedAuditSeverity severity;
|
||||||
|
|
||||||
|
@Enumerated(EnumType.STRING)
|
||||||
|
@Column(name = "finding_type", nullable = false, length = 64)
|
||||||
|
private LegacyTedAuditFindingType findingType;
|
||||||
|
|
||||||
|
@Column(name = "package_identifier", length = 20)
|
||||||
|
private String packageIdentifier;
|
||||||
|
|
||||||
|
@Column(name = "legacy_procurement_document_id")
|
||||||
|
private UUID legacyProcurementDocumentId;
|
||||||
|
|
||||||
|
@Column(name = "document_id")
|
||||||
|
private UUID documentId;
|
||||||
|
|
||||||
|
@Column(name = "ted_notice_projection_id")
|
||||||
|
private UUID tedNoticeProjectionId;
|
||||||
|
|
||||||
|
@Column(name = "reference_key", length = 255)
|
||||||
|
private String referenceKey;
|
||||||
|
|
||||||
|
@Column(name = "message", nullable = false, columnDefinition = "TEXT")
|
||||||
|
private String message;
|
||||||
|
|
||||||
|
@Column(name = "details_text", columnDefinition = "TEXT")
|
||||||
|
private String detailsText;
|
||||||
|
|
||||||
|
@Builder.Default
|
||||||
|
@Column(name = "created_at", nullable = false, updatable = false)
|
||||||
|
private OffsetDateTime createdAt = OffsetDateTime.now();
|
||||||
|
|
||||||
|
@PrePersist
|
||||||
|
protected void onCreate() {
|
||||||
|
if (createdAt == null) {
|
||||||
|
createdAt = OffsetDateTime.now();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,28 @@
|
|||||||
|
package at.procon.dip.migration.audit.entity;
|
||||||
|
|
||||||
|
public enum LegacyTedAuditFindingType {
|
||||||
|
PACKAGE_SEQUENCE_GAP,
|
||||||
|
PACKAGE_INCOMPLETE,
|
||||||
|
PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
|
||||||
|
PACKAGE_COMPLETED_COUNT_MISMATCH,
|
||||||
|
PACKAGE_MISSING_XML_FILE_COUNT,
|
||||||
|
PACKAGE_MISSING_FILE_HASH,
|
||||||
|
PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
|
||||||
|
LEGACY_PUBLICATION_ID_DUPLICATE,
|
||||||
|
DOC_DEDUP_HASH_DUPLICATE,
|
||||||
|
LEGACY_DOCUMENT_MISSING_HASH,
|
||||||
|
LEGACY_DOCUMENT_MISSING_XML,
|
||||||
|
LEGACY_DOCUMENT_MISSING_TEXT,
|
||||||
|
LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
|
||||||
|
DOC_DOCUMENT_MISSING,
|
||||||
|
DOC_DOCUMENT_DUPLICATE,
|
||||||
|
DOC_SOURCE_MISSING,
|
||||||
|
DOC_ORIGINAL_CONTENT_MISSING,
|
||||||
|
DOC_ORIGINAL_CONTENT_DUPLICATE,
|
||||||
|
DOC_PRIMARY_REPRESENTATION_MISSING,
|
||||||
|
DOC_PRIMARY_REPRESENTATION_DUPLICATE,
|
||||||
|
TED_PROJECTION_MISSING,
|
||||||
|
TED_PROJECTION_MISSING_LEGACY_LINK,
|
||||||
|
TED_PROJECTION_DOCUMENT_MISMATCH,
|
||||||
|
FINDINGS_TRUNCATED
|
||||||
|
}
|
||||||
@ -0,0 +1,110 @@
|
|||||||
|
package at.procon.dip.migration.audit.entity;
|
||||||
|
|
||||||
|
import at.procon.dip.architecture.SchemaNames;
|
||||||
|
import jakarta.persistence.Column;
|
||||||
|
import jakarta.persistence.Entity;
|
||||||
|
import jakarta.persistence.EnumType;
|
||||||
|
import jakarta.persistence.Enumerated;
|
||||||
|
import jakarta.persistence.GeneratedValue;
|
||||||
|
import jakarta.persistence.GenerationType;
|
||||||
|
import jakarta.persistence.Id;
|
||||||
|
import jakarta.persistence.Index;
|
||||||
|
import jakarta.persistence.PrePersist;
|
||||||
|
import jakarta.persistence.PreUpdate;
|
||||||
|
import jakarta.persistence.Table;
|
||||||
|
import java.time.OffsetDateTime;
|
||||||
|
import java.util.UUID;
|
||||||
|
import lombok.AllArgsConstructor;
|
||||||
|
import lombok.Builder;
|
||||||
|
import lombok.Getter;
|
||||||
|
import lombok.NoArgsConstructor;
|
||||||
|
import lombok.Setter;
|
||||||
|
|
||||||
|
@Entity
|
||||||
|
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_run", indexes = {
|
||||||
|
@Index(name = "idx_doc_legacy_audit_run_status", columnList = "status"),
|
||||||
|
@Index(name = "idx_doc_legacy_audit_run_started", columnList = "started_at")
|
||||||
|
})
|
||||||
|
@Getter
|
||||||
|
@Setter
|
||||||
|
@NoArgsConstructor
|
||||||
|
@AllArgsConstructor
|
||||||
|
@Builder
|
||||||
|
public class LegacyTedAuditRun {
|
||||||
|
|
||||||
|
@Id
|
||||||
|
@GeneratedValue(strategy = GenerationType.UUID)
|
||||||
|
private UUID id;
|
||||||
|
|
||||||
|
@Enumerated(EnumType.STRING)
|
||||||
|
@Column(name = "status", nullable = false, length = 32)
|
||||||
|
private LegacyTedAuditRunStatus status;
|
||||||
|
|
||||||
|
@Column(name = "requested_limit")
|
||||||
|
private Integer requestedLimit;
|
||||||
|
|
||||||
|
@Column(name = "page_size", nullable = false)
|
||||||
|
private Integer pageSize;
|
||||||
|
|
||||||
|
@Column(name = "scanned_packages", nullable = false)
|
||||||
|
@Builder.Default
|
||||||
|
private Integer scannedPackages = 0;
|
||||||
|
|
||||||
|
@Column(name = "scanned_legacy_documents", nullable = false)
|
||||||
|
@Builder.Default
|
||||||
|
private Integer scannedLegacyDocuments = 0;
|
||||||
|
|
||||||
|
@Column(name = "finding_count", nullable = false)
|
||||||
|
@Builder.Default
|
||||||
|
private Integer findingCount = 0;
|
||||||
|
|
||||||
|
@Column(name = "info_count", nullable = false)
|
||||||
|
@Builder.Default
|
||||||
|
private Integer infoCount = 0;
|
||||||
|
|
||||||
|
@Column(name = "warning_count", nullable = false)
|
||||||
|
@Builder.Default
|
||||||
|
private Integer warningCount = 0;
|
||||||
|
|
||||||
|
@Column(name = "error_count", nullable = false)
|
||||||
|
@Builder.Default
|
||||||
|
private Integer errorCount = 0;
|
||||||
|
|
||||||
|
@Column(name = "started_at", nullable = false)
|
||||||
|
private OffsetDateTime startedAt;
|
||||||
|
|
||||||
|
@Column(name = "completed_at")
|
||||||
|
private OffsetDateTime completedAt;
|
||||||
|
|
||||||
|
@Column(name = "summary_text", columnDefinition = "TEXT")
|
||||||
|
private String summaryText;
|
||||||
|
|
||||||
|
@Column(name = "failure_message", columnDefinition = "TEXT")
|
||||||
|
private String failureMessage;
|
||||||
|
|
||||||
|
@Builder.Default
|
||||||
|
@Column(name = "created_at", nullable = false, updatable = false)
|
||||||
|
private OffsetDateTime createdAt = OffsetDateTime.now();
|
||||||
|
|
||||||
|
@Builder.Default
|
||||||
|
@Column(name = "updated_at", nullable = false)
|
||||||
|
private OffsetDateTime updatedAt = OffsetDateTime.now();
|
||||||
|
|
||||||
|
@PrePersist
|
||||||
|
protected void onCreate() {
|
||||||
|
if (startedAt == null) {
|
||||||
|
startedAt = OffsetDateTime.now();
|
||||||
|
}
|
||||||
|
if (createdAt == null) {
|
||||||
|
createdAt = OffsetDateTime.now();
|
||||||
|
}
|
||||||
|
if (updatedAt == null) {
|
||||||
|
updatedAt = OffsetDateTime.now();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@PreUpdate
|
||||||
|
protected void onUpdate() {
|
||||||
|
updatedAt = OffsetDateTime.now();
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,7 @@
|
|||||||
|
package at.procon.dip.migration.audit.entity;
|
||||||
|
|
||||||
|
public enum LegacyTedAuditRunStatus {
|
||||||
|
RUNNING,
|
||||||
|
COMPLETED,
|
||||||
|
FAILED
|
||||||
|
}
|
||||||
@ -0,0 +1,7 @@
|
|||||||
|
package at.procon.dip.migration.audit.entity;
|
||||||
|
|
||||||
|
public enum LegacyTedAuditSeverity {
|
||||||
|
INFO,
|
||||||
|
WARNING,
|
||||||
|
ERROR
|
||||||
|
}
|
||||||
@ -0,0 +1,8 @@
|
|||||||
|
package at.procon.dip.migration.audit.repository;
|
||||||
|
|
||||||
|
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
|
||||||
|
import java.util.UUID;
|
||||||
|
import org.springframework.data.jpa.repository.JpaRepository;
|
||||||
|
|
||||||
|
public interface LegacyTedAuditFindingRepository extends JpaRepository<LegacyTedAuditFinding, UUID> {
|
||||||
|
}
|
||||||
@ -0,0 +1,8 @@
|
|||||||
|
package at.procon.dip.migration.audit.repository;
|
||||||
|
|
||||||
|
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
|
||||||
|
import java.util.UUID;
|
||||||
|
import org.springframework.data.jpa.repository.JpaRepository;
|
||||||
|
|
||||||
|
public interface LegacyTedAuditRunRepository extends JpaRepository<LegacyTedAuditRun, UUID> {
|
||||||
|
}
|
||||||
@ -0,0 +1,610 @@
|
|||||||
|
package at.procon.dip.migration.audit.service;
|
||||||
|
|
||||||
|
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
|
||||||
|
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
|
||||||
|
import at.procon.dip.migration.audit.entity.LegacyTedAuditFindingType;
|
||||||
|
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
|
||||||
|
import at.procon.dip.migration.audit.entity.LegacyTedAuditRunStatus;
|
||||||
|
import at.procon.dip.migration.audit.entity.LegacyTedAuditSeverity;
|
||||||
|
import at.procon.dip.migration.audit.repository.LegacyTedAuditFindingRepository;
|
||||||
|
import at.procon.dip.migration.audit.repository.LegacyTedAuditRunRepository;
|
||||||
|
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||||
|
import at.procon.dip.runtime.config.RuntimeMode;
|
||||||
|
import at.procon.ted.model.entity.ProcurementDocument;
|
||||||
|
import at.procon.ted.model.entity.TedDailyPackage;
|
||||||
|
import at.procon.ted.repository.ProcurementDocumentRepository;
|
||||||
|
import at.procon.ted.repository.TedDailyPackageRepository;
|
||||||
|
import java.time.OffsetDateTime;
|
||||||
|
import java.time.Year;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.TreeMap;
|
||||||
|
import java.util.UUID;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
import lombok.extern.slf4j.Slf4j;
|
||||||
|
import org.springframework.data.domain.Page;
|
||||||
|
import org.springframework.data.domain.PageRequest;
|
||||||
|
import org.springframework.data.domain.Sort;
|
||||||
|
import org.springframework.jdbc.core.JdbcTemplate;
|
||||||
|
import org.springframework.stereotype.Service;
|
||||||
|
import org.springframework.util.StringUtils;
|
||||||
|
|
||||||
|
@Service
|
||||||
|
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
@Slf4j
|
||||||
|
public class LegacyTedAuditService {
|
||||||
|
|
||||||
|
private final LegacyTedAuditProperties properties;
|
||||||
|
private final TedDailyPackageRepository tedDailyPackageRepository;
|
||||||
|
private final ProcurementDocumentRepository procurementDocumentRepository;
|
||||||
|
private final LegacyTedAuditRunRepository runRepository;
|
||||||
|
private final LegacyTedAuditFindingRepository findingRepository;
|
||||||
|
private final JdbcTemplate jdbcTemplate;
|
||||||
|
|
||||||
|
public LegacyTedAuditRun executeAudit() {
|
||||||
|
return executeAudit(properties.getStartupRunLimit());
|
||||||
|
}
|
||||||
|
|
||||||
|
public LegacyTedAuditRun executeAudit(int requestedLimit) {
|
||||||
|
if (!properties.isEnabled()) {
|
||||||
|
throw new IllegalStateException("Legacy TED audit is disabled by configuration");
|
||||||
|
}
|
||||||
|
|
||||||
|
Integer effectiveLimit = requestedLimit > 0 ? requestedLimit : null;
|
||||||
|
int pageSize = properties.getPageSize();
|
||||||
|
AuditAccumulator accumulator = new AuditAccumulator();
|
||||||
|
|
||||||
|
LegacyTedAuditRun run = LegacyTedAuditRun.builder()
|
||||||
|
.status(LegacyTedAuditRunStatus.RUNNING)
|
||||||
|
.requestedLimit(effectiveLimit)
|
||||||
|
.pageSize(pageSize)
|
||||||
|
.startedAt(OffsetDateTime.now())
|
||||||
|
.build();
|
||||||
|
run = runRepository.save(run);
|
||||||
|
|
||||||
|
try {
|
||||||
|
int scannedPackages = auditPackages(run, accumulator);
|
||||||
|
auditGlobalDuplicates(run, accumulator);
|
||||||
|
int scannedLegacyDocuments = 0;//auditLegacyDocuments(run, accumulator, effectiveLimit, pageSize);
|
||||||
|
|
||||||
|
run.setStatus(LegacyTedAuditRunStatus.COMPLETED);
|
||||||
|
run.setCompletedAt(OffsetDateTime.now());
|
||||||
|
run.setScannedPackages(scannedPackages);
|
||||||
|
run.setScannedLegacyDocuments(scannedLegacyDocuments);
|
||||||
|
run.setFindingCount(accumulator.totalFindings());
|
||||||
|
run.setInfoCount(accumulator.infoCount());
|
||||||
|
run.setWarningCount(accumulator.warningCount());
|
||||||
|
run.setErrorCount(accumulator.errorCount());
|
||||||
|
run.setSummaryText(buildSummary(scannedPackages, scannedLegacyDocuments, accumulator));
|
||||||
|
run.setFailureMessage(null);
|
||||||
|
run = runRepository.save(run);
|
||||||
|
|
||||||
|
log.info("Wave 1 / Milestone A legacy-only audit completed: runId={}, packages={}, documents={}, findings={}, warnings={}, errors={}",
|
||||||
|
run.getId(), scannedPackages, scannedLegacyDocuments, accumulator.totalFindings(),
|
||||||
|
accumulator.warningCount(), accumulator.errorCount());
|
||||||
|
return run;
|
||||||
|
} catch (RuntimeException ex) {
|
||||||
|
run.setStatus(LegacyTedAuditRunStatus.FAILED);
|
||||||
|
run.setCompletedAt(OffsetDateTime.now());
|
||||||
|
run.setScannedPackages(accumulator.scannedPackages());
|
||||||
|
run.setScannedLegacyDocuments(accumulator.scannedLegacyDocuments());
|
||||||
|
run.setFindingCount(accumulator.totalFindings());
|
||||||
|
run.setInfoCount(accumulator.infoCount());
|
||||||
|
run.setWarningCount(accumulator.warningCount());
|
||||||
|
run.setErrorCount(accumulator.errorCount());
|
||||||
|
run.setFailureMessage(ex.getMessage());
|
||||||
|
run.setSummaryText(buildSummary(accumulator.scannedPackages(), accumulator.scannedLegacyDocuments(), accumulator));
|
||||||
|
runRepository.save(run);
|
||||||
|
log.error("Wave 1 / Milestone A legacy-only audit failed: runId={}", run.getId(), ex);
|
||||||
|
throw ex;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private int auditPackages(LegacyTedAuditRun run, AuditAccumulator accumulator) {
|
||||||
|
List<TedDailyPackage> packages = tedDailyPackageRepository.findAll(Sort.by(Sort.Direction.ASC, "year", "serialNumber"));
|
||||||
|
if (packages.isEmpty()) {
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
Map<Integer, List<TedDailyPackage>> packagesByYear = new TreeMap<>();
|
||||||
|
for (TedDailyPackage dailyPackage : packages) {
|
||||||
|
packagesByYear.computeIfAbsent(dailyPackage.getYear(), ignored -> new ArrayList<>()).add(dailyPackage);
|
||||||
|
}
|
||||||
|
|
||||||
|
int firstYear = packagesByYear.keySet().iterator().next();
|
||||||
|
int currentYear = Year.now().getValue();
|
||||||
|
|
||||||
|
for (int year = firstYear; year <= currentYear; year++) {
|
||||||
|
List<TedDailyPackage> yearPackages = packagesByYear.get(year);
|
||||||
|
if (yearPackages == null || yearPackages.isEmpty()) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
"year:" + year,
|
||||||
|
"No TED package rows exist for this year inside the audited interval",
|
||||||
|
"year=" + year + ", intervalStartYear=" + firstYear + ", intervalEndYear=" + currentYear);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
auditYearPackageSequence(run, accumulator, year, yearPackages);
|
||||||
|
|
||||||
|
for (TedDailyPackage dailyPackage : yearPackages) {
|
||||||
|
accumulator.incrementScannedPackages();
|
||||||
|
auditSinglePackage(run, accumulator, dailyPackage);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return packages.size();
|
||||||
|
}
|
||||||
|
|
||||||
|
private void auditYearPackageSequence(LegacyTedAuditRun run,
|
||||||
|
AuditAccumulator accumulator,
|
||||||
|
int year,
|
||||||
|
List<TedDailyPackage> yearPackages) {
|
||||||
|
yearPackages.sort((left, right) -> Integer.compare(safeInt(left.getSerialNumber()), safeInt(right.getSerialNumber())));
|
||||||
|
|
||||||
|
int firstSerial = safeInt(yearPackages.getFirst().getSerialNumber());
|
||||||
|
if (firstSerial > 1) {
|
||||||
|
recordMissingPackageRange(run, accumulator, year, 1, firstSerial - 1,
|
||||||
|
"TED package year starts after serial 1");
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int i = 1; i < yearPackages.size(); i++) {
|
||||||
|
int previousSerial = safeInt(yearPackages.get(i - 1).getSerialNumber());
|
||||||
|
int currentSerial = safeInt(yearPackages.get(i).getSerialNumber());
|
||||||
|
if (currentSerial > previousSerial + 1) {
|
||||||
|
recordMissingPackageRange(run, accumulator, year, previousSerial + 1, currentSerial - 1,
|
||||||
|
"TED package sequence gap detected");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private void recordMissingPackageRange(LegacyTedAuditRun run,
|
||||||
|
AuditAccumulator accumulator,
|
||||||
|
int year,
|
||||||
|
int startSerial,
|
||||||
|
int endSerial,
|
||||||
|
String message) {
|
||||||
|
String startPackageId = formatPackageIdentifier(year, startSerial);
|
||||||
|
String endPackageId = formatPackageIdentifier(year, endSerial);
|
||||||
|
String referenceKey = startSerial == endSerial ? startPackageId : startPackageId + "-" + endPackageId;
|
||||||
|
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
|
||||||
|
startSerial == endSerial ? startPackageId : null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
referenceKey,
|
||||||
|
message,
|
||||||
|
"year=" + year + ", missingStartSerial=" + startSerial + ", missingEndSerial=" + endSerial);
|
||||||
|
}
|
||||||
|
|
||||||
|
private void auditSinglePackage(LegacyTedAuditRun run,
|
||||||
|
AuditAccumulator accumulator,
|
||||||
|
TedDailyPackage dailyPackage) {
|
||||||
|
String packageIdentifier = dailyPackage.getPackageIdentifier();
|
||||||
|
int processedCount = safeInt(dailyPackage.getProcessedCount());
|
||||||
|
int failedCount = safeInt(dailyPackage.getFailedCount());
|
||||||
|
int accountedDocuments = processedCount + failedCount;
|
||||||
|
|
||||||
|
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
||||||
|
&& dailyPackage.getProcessedAt() == null) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
|
||||||
|
packageIdentifier,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
packageIdentifier,
|
||||||
|
"TED package is marked COMPLETED but processedAt is null",
|
||||||
|
null);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
||||||
|
&& dailyPackage.getXmlFileCount() == null) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_MISSING_XML_FILE_COUNT,
|
||||||
|
packageIdentifier,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
packageIdentifier,
|
||||||
|
"TED package is marked COMPLETED but xmlFileCount is null",
|
||||||
|
null);
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.DOWNLOADED
|
||||||
|
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.PROCESSING
|
||||||
|
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED)
|
||||||
|
&& !StringUtils.hasText(dailyPackage.getFileHash())) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_MISSING_FILE_HASH,
|
||||||
|
packageIdentifier,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
packageIdentifier,
|
||||||
|
"TED package has no file hash recorded",
|
||||||
|
"downloadStatus=" + dailyPackage.getDownloadStatus());
|
||||||
|
}
|
||||||
|
|
||||||
|
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
|
||||||
|
&& !StringUtils.hasText(dailyPackage.getErrorMessage())) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
|
||||||
|
packageIdentifier,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
packageIdentifier,
|
||||||
|
"TED package is marked FAILED but has no error message",
|
||||||
|
null);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (dailyPackage.getXmlFileCount() != null) {
|
||||||
|
if (accountedDocuments > dailyPackage.getXmlFileCount()) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.ERROR,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
|
||||||
|
packageIdentifier,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
packageIdentifier,
|
||||||
|
"TED package accounting exceeds xmlFileCount",
|
||||||
|
"xmlFileCount=" + dailyPackage.getXmlFileCount()
|
||||||
|
+ ", processedCount=" + processedCount
|
||||||
|
+ ", failedCount=" + failedCount);
|
||||||
|
} else if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
||||||
|
&& accountedDocuments < dailyPackage.getXmlFileCount()) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
|
||||||
|
packageIdentifier,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
packageIdentifier,
|
||||||
|
"TED package accounting is below xmlFileCount",
|
||||||
|
"xmlFileCount=" + dailyPackage.getXmlFileCount()
|
||||||
|
+ ", processedCount=" + processedCount
|
||||||
|
+ ", failedCount=" + failedCount);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (isPackageIncompleteForReimport(dailyPackage, processedCount, failedCount, accountedDocuments)) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
|
||||||
|
? LegacyTedAuditSeverity.ERROR
|
||||||
|
: LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.PACKAGE_INCOMPLETE,
|
||||||
|
packageIdentifier,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
packageIdentifier,
|
||||||
|
"TED package is not fully imported and should be considered for re-import",
|
||||||
|
buildIncompletePackageDetails(dailyPackage, processedCount, failedCount, accountedDocuments));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean isPackageIncompleteForReimport(TedDailyPackage dailyPackage,
|
||||||
|
int processedCount,
|
||||||
|
int failedCount,
|
||||||
|
int accountedDocuments) {
|
||||||
|
TedDailyPackage.DownloadStatus status = dailyPackage.getDownloadStatus();
|
||||||
|
if (status == null) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (status == TedDailyPackage.DownloadStatus.NOT_FOUND) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if (status == TedDailyPackage.DownloadStatus.PENDING
|
||||||
|
|| status == TedDailyPackage.DownloadStatus.DOWNLOADING
|
||||||
|
|| status == TedDailyPackage.DownloadStatus.DOWNLOADED
|
||||||
|
|| status == TedDailyPackage.DownloadStatus.PROCESSING
|
||||||
|
|| status == TedDailyPackage.DownloadStatus.FAILED) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (status != TedDailyPackage.DownloadStatus.COMPLETED) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (dailyPackage.getXmlFileCount() == null) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (failedCount > 0) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
return processedCount < dailyPackage.getXmlFileCount()
|
||||||
|
|| accountedDocuments != dailyPackage.getXmlFileCount();
|
||||||
|
}
|
||||||
|
|
||||||
|
private String buildIncompletePackageDetails(TedDailyPackage dailyPackage,
|
||||||
|
int processedCount,
|
||||||
|
int failedCount,
|
||||||
|
int accountedDocuments) {
|
||||||
|
return "status=" + dailyPackage.getDownloadStatus()
|
||||||
|
+ ", xmlFileCount=" + dailyPackage.getXmlFileCount()
|
||||||
|
+ ", processedCount=" + processedCount
|
||||||
|
+ ", failedCount=" + failedCount
|
||||||
|
+ ", accountedDocuments=" + accountedDocuments;
|
||||||
|
}
|
||||||
|
|
||||||
|
private void auditGlobalDuplicates(LegacyTedAuditRun run, AuditAccumulator accumulator) {
|
||||||
|
int limit = properties.getMaxDuplicateSamples();
|
||||||
|
|
||||||
|
jdbcTemplate.query(
|
||||||
|
"""
|
||||||
|
SELECT publication_id, COUNT(*) AS duplicate_count
|
||||||
|
FROM ted.procurement_document
|
||||||
|
WHERE publication_id IS NOT NULL AND publication_id <> ''
|
||||||
|
GROUP BY publication_id
|
||||||
|
HAVING COUNT(*) > 1
|
||||||
|
ORDER BY duplicate_count DESC, publication_id ASC
|
||||||
|
LIMIT ?
|
||||||
|
""",
|
||||||
|
ps -> ps.setInt(1, limit),
|
||||||
|
(rs, rowNum) -> {
|
||||||
|
String publicationId = rs.getString("publication_id");
|
||||||
|
long duplicateCount = rs.getLong("duplicate_count");
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.ERROR,
|
||||||
|
LegacyTedAuditFindingType.LEGACY_PUBLICATION_ID_DUPLICATE,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
publicationId,
|
||||||
|
"Legacy TED publicationId appears multiple times",
|
||||||
|
"publicationId=" + publicationId + ", duplicateCount=" + duplicateCount);
|
||||||
|
return null;
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private int auditLegacyDocuments(LegacyTedAuditRun run,
|
||||||
|
AuditAccumulator accumulator,
|
||||||
|
Integer requestedLimit,
|
||||||
|
int pageSize) {
|
||||||
|
int processed = 0;
|
||||||
|
int pageNumber = 0;
|
||||||
|
|
||||||
|
while (requestedLimit == null || processed < requestedLimit) {
|
||||||
|
Page<ProcurementDocument> page = procurementDocumentRepository.findAll(
|
||||||
|
PageRequest.of(pageNumber, pageSize, Sort.by(Sort.Direction.ASC, "createdAt", "id")));
|
||||||
|
|
||||||
|
if (page.isEmpty()) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
for (ProcurementDocument legacyDocument : page.getContent()) {
|
||||||
|
auditSingleLegacyDocument(run, accumulator, legacyDocument);
|
||||||
|
accumulator.incrementScannedLegacyDocuments();
|
||||||
|
processed++;
|
||||||
|
if (requestedLimit != null && processed >= requestedLimit) {
|
||||||
|
return processed;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!page.hasNext()) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
pageNumber++;
|
||||||
|
}
|
||||||
|
|
||||||
|
return processed;
|
||||||
|
}
|
||||||
|
|
||||||
|
private void auditSingleLegacyDocument(LegacyTedAuditRun run,
|
||||||
|
AuditAccumulator accumulator,
|
||||||
|
ProcurementDocument legacyDocument) {
|
||||||
|
UUID legacyDocumentId = legacyDocument.getId();
|
||||||
|
String referenceKey = buildReferenceKey(legacyDocument);
|
||||||
|
String documentHash = legacyDocument.getDocumentHash();
|
||||||
|
|
||||||
|
if (!StringUtils.hasText(documentHash)) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.ERROR,
|
||||||
|
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_HASH,
|
||||||
|
null,
|
||||||
|
legacyDocumentId,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
referenceKey,
|
||||||
|
"Legacy TED document has no documentHash",
|
||||||
|
null);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!StringUtils.hasText(legacyDocument.getXmlDocument())) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.ERROR,
|
||||||
|
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_XML,
|
||||||
|
null,
|
||||||
|
legacyDocumentId,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
referenceKey,
|
||||||
|
"Legacy TED document has no xmlDocument payload",
|
||||||
|
"documentHash=" + documentHash);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!StringUtils.hasText(legacyDocument.getTextContent())) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_TEXT,
|
||||||
|
null,
|
||||||
|
legacyDocumentId,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
referenceKey,
|
||||||
|
"Legacy TED document has no normalized textContent",
|
||||||
|
"documentHash=" + documentHash);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!StringUtils.hasText(legacyDocument.getPublicationId())) {
|
||||||
|
recordFinding(run, accumulator,
|
||||||
|
LegacyTedAuditSeverity.WARNING,
|
||||||
|
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
|
||||||
|
null,
|
||||||
|
legacyDocumentId,
|
||||||
|
null,
|
||||||
|
null,
|
||||||
|
referenceKey,
|
||||||
|
"Legacy TED document has no publicationId",
|
||||||
|
"documentHash=" + documentHash);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private void recordFinding(LegacyTedAuditRun run,
|
||||||
|
AuditAccumulator accumulator,
|
||||||
|
LegacyTedAuditSeverity severity,
|
||||||
|
LegacyTedAuditFindingType findingType,
|
||||||
|
String packageIdentifier,
|
||||||
|
UUID legacyProcurementDocumentId,
|
||||||
|
UUID genericDocumentId,
|
||||||
|
UUID tedProjectionId,
|
||||||
|
String referenceKey,
|
||||||
|
String message,
|
||||||
|
String detailsText) {
|
||||||
|
if (accumulator.totalFindings() >= properties.getMaxFindingsPerRun()) {
|
||||||
|
accumulator.markTruncated();
|
||||||
|
if (!accumulator.truncationRecorded()) {
|
||||||
|
LegacyTedAuditFinding truncatedFinding = LegacyTedAuditFinding.builder()
|
||||||
|
.run(run)
|
||||||
|
.severity(LegacyTedAuditSeverity.INFO)
|
||||||
|
.findingType(LegacyTedAuditFindingType.FINDINGS_TRUNCATED)
|
||||||
|
.referenceKey(referenceKey != null ? referenceKey : "max-findings-per-run")
|
||||||
|
.message("Legacy TED audit finding limit reached; additional findings were suppressed")
|
||||||
|
.detailsText("maxFindingsPerRun=" + properties.getMaxFindingsPerRun())
|
||||||
|
.build();
|
||||||
|
findingRepository.save(truncatedFinding);
|
||||||
|
accumulator.recordFinding(LegacyTedAuditSeverity.INFO, true);
|
||||||
|
}
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
LegacyTedAuditFinding finding = LegacyTedAuditFinding.builder()
|
||||||
|
.run(run)
|
||||||
|
.severity(severity)
|
||||||
|
.findingType(findingType)
|
||||||
|
.packageIdentifier(packageIdentifier)
|
||||||
|
.legacyProcurementDocumentId(legacyProcurementDocumentId)
|
||||||
|
.documentId(genericDocumentId)
|
||||||
|
.tedNoticeProjectionId(tedProjectionId)
|
||||||
|
.referenceKey(referenceKey)
|
||||||
|
.message(message)
|
||||||
|
.detailsText(detailsText)
|
||||||
|
.build();
|
||||||
|
findingRepository.save(finding);
|
||||||
|
accumulator.recordFinding(severity, false);
|
||||||
|
}
|
||||||
|
|
||||||
|
private String buildReferenceKey(ProcurementDocument legacyDocument) {
|
||||||
|
if (StringUtils.hasText(legacyDocument.getPublicationId())) {
|
||||||
|
return legacyDocument.getPublicationId();
|
||||||
|
}
|
||||||
|
if (StringUtils.hasText(legacyDocument.getNoticeId())) {
|
||||||
|
return legacyDocument.getNoticeId();
|
||||||
|
}
|
||||||
|
if (StringUtils.hasText(legacyDocument.getSourceFilename())) {
|
||||||
|
return legacyDocument.getSourceFilename();
|
||||||
|
}
|
||||||
|
return String.valueOf(legacyDocument.getId());
|
||||||
|
}
|
||||||
|
|
||||||
|
private int safeInt(Integer value) {
|
||||||
|
return value != null ? value : 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
private String formatPackageIdentifier(int year, int serialNumber) {
|
||||||
|
return "%04d%05d".formatted(year, serialNumber);
|
||||||
|
}
|
||||||
|
|
||||||
|
private String buildSummary(int scannedPackages,
|
||||||
|
int scannedLegacyDocuments,
|
||||||
|
AuditAccumulator accumulator) {
|
||||||
|
return "packages=" + scannedPackages
|
||||||
|
+ ", legacyDocuments=" + scannedLegacyDocuments
|
||||||
|
+ ", findings=" + accumulator.totalFindings()
|
||||||
|
+ ", warnings=" + accumulator.warningCount()
|
||||||
|
+ ", errors=" + accumulator.errorCount()
|
||||||
|
+ (accumulator.truncated() ? ", truncated=true" : "");
|
||||||
|
}
|
||||||
|
|
||||||
|
private static final class AuditAccumulator {
|
||||||
|
private int scannedPackages;
|
||||||
|
private int scannedLegacyDocuments;
|
||||||
|
private int infoCount;
|
||||||
|
private int warningCount;
|
||||||
|
private int errorCount;
|
||||||
|
private boolean truncated;
|
||||||
|
private boolean truncationRecorded;
|
||||||
|
|
||||||
|
void incrementScannedPackages() {
|
||||||
|
scannedPackages++;
|
||||||
|
}
|
||||||
|
|
||||||
|
void incrementScannedLegacyDocuments() {
|
||||||
|
scannedLegacyDocuments++;
|
||||||
|
}
|
||||||
|
|
||||||
|
void recordFinding(LegacyTedAuditSeverity severity, boolean truncationFindingRecordedNow) {
|
||||||
|
switch (severity) {
|
||||||
|
case INFO -> infoCount++;
|
||||||
|
case WARNING -> warningCount++;
|
||||||
|
case ERROR -> errorCount++;
|
||||||
|
}
|
||||||
|
if (truncationFindingRecordedNow) {
|
||||||
|
truncationRecorded = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void markTruncated() {
|
||||||
|
truncated = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
int totalFindings() {
|
||||||
|
return infoCount + warningCount + errorCount;
|
||||||
|
}
|
||||||
|
|
||||||
|
int infoCount() {
|
||||||
|
return infoCount;
|
||||||
|
}
|
||||||
|
|
||||||
|
int warningCount() {
|
||||||
|
return warningCount;
|
||||||
|
}
|
||||||
|
|
||||||
|
int errorCount() {
|
||||||
|
return errorCount;
|
||||||
|
}
|
||||||
|
|
||||||
|
int scannedPackages() {
|
||||||
|
return scannedPackages;
|
||||||
|
}
|
||||||
|
|
||||||
|
int scannedLegacyDocuments() {
|
||||||
|
return scannedLegacyDocuments;
|
||||||
|
}
|
||||||
|
|
||||||
|
boolean truncated() {
|
||||||
|
return truncated;
|
||||||
|
}
|
||||||
|
|
||||||
|
boolean truncationRecorded() {
|
||||||
|
return truncationRecorded;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,33 @@
|
|||||||
|
package at.procon.dip.migration.audit.startup;
|
||||||
|
|
||||||
|
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
|
||||||
|
import at.procon.dip.migration.audit.service.LegacyTedAuditService;
|
||||||
|
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||||
|
import at.procon.dip.runtime.config.RuntimeMode;
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
import lombok.extern.slf4j.Slf4j;
|
||||||
|
import org.springframework.boot.ApplicationArguments;
|
||||||
|
import org.springframework.boot.ApplicationRunner;
|
||||||
|
import org.springframework.stereotype.Component;
|
||||||
|
|
||||||
|
@Component
|
||||||
|
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
@Slf4j
|
||||||
|
public class LegacyTedAuditStartupRunner implements ApplicationRunner {
|
||||||
|
|
||||||
|
private final LegacyTedAuditProperties properties;
|
||||||
|
private final LegacyTedAuditService legacyTedAuditService;
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void run(ApplicationArguments args) {
|
||||||
|
if (!properties.isEnabled() || !properties.isStartupRunEnabled()) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
int requestedLimit = properties.getStartupRunLimit();
|
||||||
|
log.info("Wave 1 / Milestone A startup audit enabled - scanning legacy TED data with limit {}",
|
||||||
|
requestedLimit > 0 ? requestedLimit : "unbounded");
|
||||||
|
legacyTedAuditService.executeAudit(requestedLimit);
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -1,49 +1,81 @@
|
|||||||
package at.procon.dip.search.config;
|
package at.procon.dip.search.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Min;
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
import lombok.Data;
|
import lombok.Data;
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
import org.springframework.context.annotation.Configuration;
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
import org.springframework.validation.annotation.Validated;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* New-runtime generic search configuration.
|
||||||
|
*
|
||||||
|
* <p>This property tree is intentionally separated from the legacy
|
||||||
|
* {@code ted.search.*} settings. NEW-mode search/semantic/lexical code should
|
||||||
|
* depend on {@code dip.search.*} only.</p>
|
||||||
|
*/
|
||||||
@Configuration
|
@Configuration
|
||||||
@ConfigurationProperties(prefix = "dip.search")
|
@ConfigurationProperties(prefix = "dip.search")
|
||||||
@Data
|
@Data
|
||||||
|
@Validated
|
||||||
public class DipSearchProperties {
|
public class DipSearchProperties {
|
||||||
|
|
||||||
private Lexical lexical = new Lexical();
|
/** Default page size for search results. */
|
||||||
private Semantic semantic = new Semantic();
|
@Positive
|
||||||
private Fusion fusion = new Fusion();
|
private int defaultPageSize = 20;
|
||||||
private Chunking chunking = new Chunking();
|
|
||||||
|
|
||||||
@Data
|
/** Maximum allowed page size. */
|
||||||
public static class Lexical {
|
@Positive
|
||||||
private double trigramSimilarityThreshold = 0.12;
|
private int maxPageSize = 100;
|
||||||
|
|
||||||
|
/** Semantic similarity threshold (normalized score). */
|
||||||
|
private double similarityThreshold = 0.7d;
|
||||||
|
|
||||||
|
/** Minimum trigram similarity for fuzzy lexical matches. */
|
||||||
|
private double trigramSimilarityThreshold = 0.12d;
|
||||||
|
|
||||||
|
/** Candidate limits per search engine before fusion/collapse. */
|
||||||
|
@Positive
|
||||||
private int fulltextCandidateLimit = 120;
|
private int fulltextCandidateLimit = 120;
|
||||||
|
|
||||||
|
@Positive
|
||||||
private int trigramCandidateLimit = 120;
|
private int trigramCandidateLimit = 120;
|
||||||
}
|
|
||||||
|
|
||||||
@Data
|
@Positive
|
||||||
public static class Semantic {
|
|
||||||
private double similarityThreshold = 0.7;
|
|
||||||
private int semanticCandidateLimit = 120;
|
private int semanticCandidateLimit = 120;
|
||||||
private String defaultModelKey;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Data
|
/** Hybrid fusion weights. */
|
||||||
public static class Fusion {
|
private double fulltextWeight = 0.35d;
|
||||||
private double fulltextWeight = 0.35;
|
private double trigramWeight = 0.20d;
|
||||||
private double trigramWeight = 0.20;
|
private double semanticWeight = 0.45d;
|
||||||
private double semanticWeight = 0.45;
|
|
||||||
private double recencyBoostWeight = 0.05;
|
|
||||||
private int recencyHalfLifeDays = 30;
|
|
||||||
private int debugTopHitsPerEngine = 10;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Data
|
/** Enable chunk representations for long documents. */
|
||||||
public static class Chunking {
|
private boolean chunkingEnabled = true;
|
||||||
private boolean enabled = true;
|
|
||||||
private int targetChars = 1800;
|
/** Target chunk size in characters for CHUNK representations. */
|
||||||
private int overlapChars = 200;
|
@Positive
|
||||||
|
private int chunkTargetChars = 1800;
|
||||||
|
|
||||||
|
/** Overlap between consecutive chunks in characters. */
|
||||||
|
@Min(0)
|
||||||
|
private int chunkOverlapChars = 200;
|
||||||
|
|
||||||
|
/** Maximum CHUNK representations generated per document. */
|
||||||
|
@Positive
|
||||||
private int maxChunksPerDocument = 12;
|
private int maxChunksPerDocument = 12;
|
||||||
|
|
||||||
|
/** Additional score weight for recency. */
|
||||||
|
private double recencyBoostWeight = 0.05d;
|
||||||
|
|
||||||
|
/** Half-life in days used for recency decay. */
|
||||||
|
@Positive
|
||||||
|
private int recencyHalfLifeDays = 30;
|
||||||
|
|
||||||
|
/** Startup backfill limit for missing DOC lexical vectors. */
|
||||||
|
@Positive
|
||||||
private int startupLexicalBackfillLimit = 500;
|
private int startupLexicalBackfillLimit = 500;
|
||||||
}
|
|
||||||
|
/** Number of hits per engine returned by the debug endpoint. */
|
||||||
|
@Positive
|
||||||
|
private int debugTopHitsPerEngine = 10;
|
||||||
}
|
}
|
||||||
@ -1,16 +0,0 @@
|
|||||||
package at.procon.ted.config;
|
|
||||||
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Patch A scaffold for the legacy runtime configuration tree.
|
|
||||||
*
|
|
||||||
* The legacy runtime still uses {@link TedProcessorProperties} today. This class is
|
|
||||||
* introduced so the old configuration can be moved gradually from `ted.*` to
|
|
||||||
* `legacy.ted.*` without blocking the runtime split.
|
|
||||||
*/
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "legacy.ted")
|
|
||||||
public class LegacyTedProperties extends TedProcessorProperties {
|
|
||||||
}
|
|
||||||
@ -0,0 +1,115 @@
|
|||||||
|
package at.procon.ted.config;
|
||||||
|
|
||||||
|
import jakarta.validation.constraints.Min;
|
||||||
|
import jakarta.validation.constraints.NotBlank;
|
||||||
|
import jakarta.validation.constraints.Positive;
|
||||||
|
import lombok.Data;
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
import org.springframework.validation.annotation.Validated;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Legacy vectorization configuration used only by the old runtime path.
|
||||||
|
* <p>
|
||||||
|
* This extracts the former ted.vectorization.* subtree away from TedProcessorProperties
|
||||||
|
* so that legacy vectorization beans no longer depend on the shared monolithic config.
|
||||||
|
*/
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "legacy.ted.vectorization")
|
||||||
|
@Data
|
||||||
|
@Validated
|
||||||
|
public class LegacyVectorizationProperties {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enable/disable legacy async vectorization.
|
||||||
|
*/
|
||||||
|
private boolean enabled = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Use external HTTP API instead of Python subprocess.
|
||||||
|
*/
|
||||||
|
private boolean useHttpApi = false;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Embedding service HTTP API URL.
|
||||||
|
*/
|
||||||
|
private String apiUrl = "http://localhost:8001";
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Sentence transformer model name.
|
||||||
|
*/
|
||||||
|
private String modelName = "intfloat/multilingual-e5-large";
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Vector dimensions (must match model output).
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int dimensions = 1024;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Batch size for vectorization processing.
|
||||||
|
*/
|
||||||
|
@Min(1)
|
||||||
|
private int batchSize = 16;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Thread pool size for async vectorization.
|
||||||
|
*/
|
||||||
|
@Min(1)
|
||||||
|
private int threadPoolSize = 4;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Maximum text length for vectorization (characters).
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int maxTextLength = 8192;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* HTTP connection timeout in milliseconds.
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int connectTimeout = 10000;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* HTTP socket/read timeout in milliseconds.
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private int socketTimeout = 60000;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Maximum retries on connection failure.
|
||||||
|
*/
|
||||||
|
@Min(0)
|
||||||
|
private int maxRetries = 5;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Enable the former Phase 2 generic pipeline in the legacy runtime.
|
||||||
|
* In the split runtime design this should normally stay false in NEW mode
|
||||||
|
* because legacy beans are not instantiated there.
|
||||||
|
*/
|
||||||
|
private boolean genericPipelineEnabled = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Keep writing completed TED embeddings back to the legacy ted.procurement_document
|
||||||
|
* vector columns so the existing semantic search stays operational during migration.
|
||||||
|
*/
|
||||||
|
private boolean dualWriteLegacyTedVectors = true;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Scheduler interval for generic embedding polling (milliseconds).
|
||||||
|
*/
|
||||||
|
@Positive
|
||||||
|
private long genericSchedulerPeriodMs = 6000;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builder key for the primary TED semantic representation created during transitional dual-write.
|
||||||
|
*/
|
||||||
|
@NotBlank
|
||||||
|
private String primaryRepresentationBuilderKey = "ted-phase2-primary-representation";
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Provider key used when registering the configured embedding model in DOC.doc_embedding_model.
|
||||||
|
*/
|
||||||
|
@NotBlank
|
||||||
|
private String embeddingProvider = "http-embedding-service";
|
||||||
|
}
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue