Compare commits
No commits in common. '152d9739af0969090ddb7db63ad6a93054147053' and '74609e481d97ae16a12a06dd2d6ed191f01a33f7' have entirely different histories.
152d9739af
...
74609e481d
@ -1,31 +0,0 @@
|
|||||||
# Config split: moved new-runtime properties to application-new.yml
|
|
||||||
|
|
||||||
This patch keeps shared and legacy defaults in `application.yml` and moves new-runtime properties into `application-new.yml`.
|
|
||||||
|
|
||||||
Activate the new runtime with:
|
|
||||||
|
|
||||||
```
|
|
||||||
--spring.profiles.active=new
|
|
||||||
```
|
|
||||||
|
|
||||||
`application-new.yml` also sets:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
dip.runtime.mode: NEW
|
|
||||||
```
|
|
||||||
|
|
||||||
So profile selection and runtime mode stay aligned.
|
|
||||||
|
|
||||||
Moved blocks:
|
|
||||||
- `dip.embedding.*`
|
|
||||||
- `ted.search.*` (new generic search tuning, now under `dip.search.*`)
|
|
||||||
- `ted.projection.*`
|
|
||||||
- `ted.generic-ingestion.*`
|
|
||||||
- new/transitional `ted.vectorization.*` keys:
|
|
||||||
- `generic-pipeline-enabled`
|
|
||||||
- `dual-write-legacy-ted-vectors`
|
|
||||||
- `generic-scheduler-period-ms`
|
|
||||||
- `primary-representation-builder-key`
|
|
||||||
- `embedding-provider`
|
|
||||||
|
|
||||||
Shared / legacy defaults remain in `application.yml`.
|
|
||||||
@ -1,30 +0,0 @@
|
|||||||
# Embedding policy Patch K1
|
|
||||||
|
|
||||||
Patch K1 introduces the configuration and resolver layer for policy-based document embedding selection.
|
|
||||||
|
|
||||||
## Added
|
|
||||||
- `EmbeddingPolicy`
|
|
||||||
- `EmbeddingProfile`
|
|
||||||
- `EmbeddingPolicyCondition`
|
|
||||||
- `EmbeddingPolicyUse`
|
|
||||||
- `EmbeddingPolicyRule`
|
|
||||||
- `EmbeddingPolicyProperties`
|
|
||||||
- `EmbeddingProfileProperties`
|
|
||||||
- `EmbeddingPolicyResolver`
|
|
||||||
- `DefaultEmbeddingPolicyResolver`
|
|
||||||
- `EmbeddingProfileResolver`
|
|
||||||
- `DefaultEmbeddingProfileResolver`
|
|
||||||
|
|
||||||
## Example config
|
|
||||||
See `application-new-example-embedding-policy.yml`.
|
|
||||||
|
|
||||||
## What K1 does not change
|
|
||||||
- no runtime import/orchestrator wiring yet
|
|
||||||
- no `SourceDescriptor` schema change yet
|
|
||||||
- no job persistence/audit changes yet
|
|
||||||
|
|
||||||
## Intended follow-up
|
|
||||||
K2 should wire:
|
|
||||||
- `GenericDocumentImportService`
|
|
||||||
- `RepresentationEmbeddingOrchestrator`
|
|
||||||
to use the resolved policy and profile.
|
|
||||||
@ -1,26 +0,0 @@
|
|||||||
# Embedding policy Patch K2
|
|
||||||
|
|
||||||
Patch K2 wires the policy/profile layer into the actual NEW import runtime.
|
|
||||||
|
|
||||||
## What it changes
|
|
||||||
- `GenericDocumentImportService`
|
|
||||||
- resolves `EmbeddingPolicy` per imported document
|
|
||||||
- resolves `EmbeddingProfile`
|
|
||||||
- ensures the selected embedding model is registered
|
|
||||||
- queues embeddings only for representation drafts allowed by the resolved profile
|
|
||||||
- `RepresentationEmbeddingOrchestrator`
|
|
||||||
- adds a convenience overload for `(documentId, modelKey, profile)`
|
|
||||||
- `EmbeddingJobService`
|
|
||||||
- adds a profile-aware enqueue overload
|
|
||||||
- `DefaultEmbeddingSelectionPolicy`
|
|
||||||
- adds profile-aware representation filtering
|
|
||||||
- `DefaultEmbeddingPolicyResolver`
|
|
||||||
- corrected for the current `SourceDescriptor.attributes()` shape
|
|
||||||
|
|
||||||
## Runtime flow after K2
|
|
||||||
document imported
|
|
||||||
-> representations built
|
|
||||||
-> policy resolved
|
|
||||||
-> profile resolved
|
|
||||||
-> model ensured
|
|
||||||
-> matching representations queued for embedding
|
|
||||||
@ -1,59 +0,0 @@
|
|||||||
# Runtime split Patch B
|
|
||||||
|
|
||||||
Patch B builds on Patch A and makes the NEW runtime actually process embedding jobs.
|
|
||||||
|
|
||||||
## What changes
|
|
||||||
|
|
||||||
### 1. New embedding job scheduler
|
|
||||||
Adds:
|
|
||||||
|
|
||||||
- `EmbeddingJobScheduler`
|
|
||||||
- `EmbeddingJobSchedulingConfiguration`
|
|
||||||
|
|
||||||
Behavior:
|
|
||||||
- enabled only in `NEW` runtime mode
|
|
||||||
- active only when `dip.embedding.jobs.enabled=true`
|
|
||||||
- periodically calls:
|
|
||||||
- `RepresentationEmbeddingOrchestrator.processNextReadyBatch()`
|
|
||||||
|
|
||||||
### 2. Generic import hands off to the new embedding job path
|
|
||||||
`GenericDocumentImportService` is updated so that in `NEW` mode it:
|
|
||||||
|
|
||||||
- resolves `dip.embedding.default-document-model`
|
|
||||||
- ensures the model is registered in `DOC.doc_embedding_model`
|
|
||||||
- creates embedding jobs through:
|
|
||||||
- `RepresentationEmbeddingOrchestrator.enqueueRepresentation(...)`
|
|
||||||
|
|
||||||
It no longer creates legacy-style pending embeddings as the primary handoff for the NEW runtime path.
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- This patch assumes Patch A has already introduced:
|
|
||||||
- `RuntimeMode`
|
|
||||||
- `RuntimeModeProperties`
|
|
||||||
- `@ConditionalOnRuntimeMode`
|
|
||||||
- This patch does not yet remove the legacy vectorization runtime.
|
|
||||||
That remains the job of subsequent cutover steps.
|
|
||||||
|
|
||||||
## Expected runtime behavior in NEW mode
|
|
||||||
|
|
||||||
- `GenericDocumentImportService` persists new generic representations
|
|
||||||
- selected representations are queued into `DOC.doc_embedding_job`
|
|
||||||
- scheduler processes pending jobs
|
|
||||||
- vectors are persisted through the new embedding subsystem
|
|
||||||
|
|
||||||
## New config
|
|
||||||
|
|
||||||
Example:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
dip:
|
|
||||||
runtime:
|
|
||||||
mode: NEW
|
|
||||||
|
|
||||||
embedding:
|
|
||||||
enabled: true
|
|
||||||
jobs:
|
|
||||||
enabled: true
|
|
||||||
scheduler-delay-ms: 5000
|
|
||||||
```
|
|
||||||
@ -1,36 +0,0 @@
|
|||||||
# Runtime split Patch C
|
|
||||||
|
|
||||||
Patch C moves the **new generic search runtime** off `TedProcessorProperties.search`
|
|
||||||
and into a dedicated `dip.search.*` config tree.
|
|
||||||
|
|
||||||
## New config class
|
|
||||||
- `at.procon.dip.search.config.DipSearchProperties`
|
|
||||||
|
|
||||||
## New config root
|
|
||||||
```yaml
|
|
||||||
dip:
|
|
||||||
search:
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
## Classes moved off `TedProcessorProperties`
|
|
||||||
- `PostgresFullTextSearchEngine`
|
|
||||||
- `PostgresTrigramSearchEngine`
|
|
||||||
- `PgVectorSemanticSearchEngine`
|
|
||||||
- `DefaultSearchOrchestrator`
|
|
||||||
- `DefaultSearchResultFusionService`
|
|
||||||
- `SearchLexicalIndexStartupRunner`
|
|
||||||
- `ChunkedLongTextRepresentationBuilder`
|
|
||||||
|
|
||||||
## What this patch intentionally does not do
|
|
||||||
- it does not yet remove `TedProcessorProperties` from all NEW-mode classes
|
|
||||||
- it does not yet move `generic-ingestion` config off `ted.*`
|
|
||||||
- it does not yet finish the legacy/new config split for import/mail/TED package processing
|
|
||||||
|
|
||||||
Those should be handled in the next config-splitting patch.
|
|
||||||
|
|
||||||
## Practical result
|
|
||||||
After this patch, **new search/semantic/chunking tuning** should be configured only via:
|
|
||||||
- `dip.search.*`
|
|
||||||
|
|
||||||
while `ted.search.*` remains legacy-oriented.
|
|
||||||
@ -1,40 +0,0 @@
|
|||||||
# Runtime Split Patch D
|
|
||||||
|
|
||||||
This patch completes the next configuration split step for the NEW runtime.
|
|
||||||
|
|
||||||
## New property classes
|
|
||||||
|
|
||||||
- `at.procon.dip.ingestion.config.DipIngestionProperties`
|
|
||||||
- prefix: `dip.ingestion`
|
|
||||||
- `at.procon.dip.domain.ted.config.TedProjectionProperties`
|
|
||||||
- prefix: `dip.ted.projection`
|
|
||||||
|
|
||||||
## Classes moved off `TedProcessorProperties`
|
|
||||||
|
|
||||||
### NEW-mode ingestion
|
|
||||||
- `GenericDocumentImportService`
|
|
||||||
- `GenericFileSystemIngestionRoute`
|
|
||||||
- `GenericDocumentImportController`
|
|
||||||
- `MailDocumentIngestionAdapter`
|
|
||||||
- `TedPackageDocumentIngestionAdapter`
|
|
||||||
- `TedPackageChildImportProcessor`
|
|
||||||
|
|
||||||
### NEW-mode projection
|
|
||||||
- `TedNoticeProjectionService`
|
|
||||||
- `TedProjectionStartupRunner`
|
|
||||||
|
|
||||||
## Additional cleanup in `GenericDocumentImportService`
|
|
||||||
|
|
||||||
It now resolves the default document embedding model through the new embedding subsystem:
|
|
||||||
|
|
||||||
- `EmbeddingProperties`
|
|
||||||
- `EmbeddingModelRegistry`
|
|
||||||
- `EmbeddingModelCatalogService`
|
|
||||||
|
|
||||||
and no longer reads vectorization model/provider/dimensions from `TedProcessorProperties`.
|
|
||||||
|
|
||||||
## What still remains for later split steps
|
|
||||||
|
|
||||||
- legacy routes/services still using `TedProcessorProperties`
|
|
||||||
- legacy/new runtime bean gating for all remaining shared classes
|
|
||||||
- moving old TED-only config fully under `legacy.ted.*`
|
|
||||||
@ -1,26 +0,0 @@
|
|||||||
# Runtime split Patch E
|
|
||||||
|
|
||||||
This patch continues the runtime/config split by targeting the remaining NEW-mode classes
|
|
||||||
that still injected `TedProcessorProperties`.
|
|
||||||
|
|
||||||
## New config classes
|
|
||||||
- `DipIngestionProperties` (`dip.ingestion.*`)
|
|
||||||
- `TedProjectionProperties` (`dip.ted.projection.*`)
|
|
||||||
|
|
||||||
## NEW-mode classes moved off `TedProcessorProperties`
|
|
||||||
- `GenericDocumentImportService`
|
|
||||||
- `GenericFileSystemIngestionRoute`
|
|
||||||
- `GenericDocumentImportController`
|
|
||||||
- `MailDocumentIngestionAdapter`
|
|
||||||
- `TedPackageDocumentIngestionAdapter`
|
|
||||||
- `TedPackageChildImportProcessor`
|
|
||||||
- `TedNoticeProjectionService`
|
|
||||||
- `TedProjectionStartupRunner`
|
|
||||||
|
|
||||||
## Additional behavior change
|
|
||||||
`GenericDocumentImportService` now hands embedding work off to the new embedding subsystem by:
|
|
||||||
- resolving the default document model from `EmbeddingModelRegistry`
|
|
||||||
- ensuring the model is registered via `EmbeddingModelCatalogService`
|
|
||||||
- enqueueing jobs through `RepresentationEmbeddingOrchestrator`
|
|
||||||
|
|
||||||
This removes the new import path's runtime dependence on legacy `TedProcessorProperties.vectorization`.
|
|
||||||
@ -1,24 +0,0 @@
|
|||||||
# Runtime split Patch G
|
|
||||||
|
|
||||||
Patch G moves the remaining NEW-mode search/chunking classes off `TedProcessorProperties.search`
|
|
||||||
and onto `DipSearchProperties` (`dip.search.*`).
|
|
||||||
|
|
||||||
## New config class
|
|
||||||
- `at.procon.dip.search.config.DipSearchProperties`
|
|
||||||
|
|
||||||
## Classes switched to `DipSearchProperties`
|
|
||||||
- `PostgresFullTextSearchEngine`
|
|
||||||
- `PostgresTrigramSearchEngine`
|
|
||||||
- `PgVectorSemanticSearchEngine`
|
|
||||||
- `DefaultSearchResultFusionService`
|
|
||||||
- `DefaultSearchOrchestrator`
|
|
||||||
- `SearchLexicalIndexStartupRunner`
|
|
||||||
- `ChunkedLongTextRepresentationBuilder`
|
|
||||||
|
|
||||||
## Additional cleanup
|
|
||||||
These classes are also marked `NEW`-only in this patch.
|
|
||||||
|
|
||||||
## Effect
|
|
||||||
After Patch G, the generic NEW-mode search/chunking path no longer depends on
|
|
||||||
`TedProcessorProperties.search`. That leaves `TedProcessorProperties` much closer to
|
|
||||||
legacy-only ownership.
|
|
||||||
@ -1,17 +0,0 @@
|
|||||||
# Runtime split Patch H
|
|
||||||
|
|
||||||
Patch H is a final cleanup / verification step after the previous split patches.
|
|
||||||
|
|
||||||
## What it does
|
|
||||||
- makes `TedProcessorProperties` explicitly `LEGACY`-only
|
|
||||||
- removes the stale `TedProcessorProperties` import/comment from `DocumentIntelligencePlatformApplication`
|
|
||||||
- adds a regression test that fails if NEW runtime classes reintroduce a dependency on `TedProcessorProperties`
|
|
||||||
- adds a simple `application-legacy.yml` profile file
|
|
||||||
|
|
||||||
## Why this matters
|
|
||||||
After the NEW search/ingestion/projection classes are moved to:
|
|
||||||
- `DipSearchProperties`
|
|
||||||
- `DipIngestionProperties`
|
|
||||||
- `TedProjectionProperties`
|
|
||||||
|
|
||||||
`TedProcessorProperties` should be owned strictly by the legacy runtime graph.
|
|
||||||
@ -1,21 +0,0 @@
|
|||||||
# Runtime split Patch I
|
|
||||||
|
|
||||||
Patch I extracts the remaining legacy vectorization cluster off `TedProcessorProperties`
|
|
||||||
and onto a dedicated legacy-only config class.
|
|
||||||
|
|
||||||
## New config class
|
|
||||||
- `at.procon.ted.config.LegacyVectorizationProperties`
|
|
||||||
- prefix: `legacy.ted.vectorization.*`
|
|
||||||
|
|
||||||
## Classes switched off `TedProcessorProperties`
|
|
||||||
- `GenericVectorizationRoute`
|
|
||||||
- `DocumentEmbeddingProcessingService`
|
|
||||||
- `ConfiguredEmbeddingModelStartupRunner`
|
|
||||||
- `GenericVectorizationStartupRunner`
|
|
||||||
|
|
||||||
## Additional cleanup
|
|
||||||
These classes are also marked `LEGACY`-only via `@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)`.
|
|
||||||
|
|
||||||
## Effect
|
|
||||||
The `at.procon.dip.vectorization.*` package now clearly belongs to the old runtime graph and no longer pulls
|
|
||||||
its settings from the shared monolithic `TedProcessorProperties`.
|
|
||||||
@ -1,45 +0,0 @@
|
|||||||
# Runtime split Patch J
|
|
||||||
|
|
||||||
Patch J is a broader cleanup patch for the **actual current codebase**.
|
|
||||||
|
|
||||||
It adds the missing runtime/config split scaffolding and rewires the remaining NEW-mode classes
|
|
||||||
that still injected `TedProcessorProperties`.
|
|
||||||
|
|
||||||
## Added
|
|
||||||
- `dip.runtime` infrastructure
|
|
||||||
- `RuntimeMode`
|
|
||||||
- `RuntimeModeProperties`
|
|
||||||
- `@ConditionalOnRuntimeMode`
|
|
||||||
- `RuntimeModeCondition`
|
|
||||||
- `DipSearchProperties`
|
|
||||||
- `DipIngestionProperties`
|
|
||||||
- `TedProjectionProperties`
|
|
||||||
|
|
||||||
## Rewired off `TedProcessorProperties`
|
|
||||||
### NEW search/chunking
|
|
||||||
- `PostgresFullTextSearchEngine`
|
|
||||||
- `PostgresTrigramSearchEngine`
|
|
||||||
- `PgVectorSemanticSearchEngine`
|
|
||||||
- `DefaultSearchOrchestrator`
|
|
||||||
- `SearchLexicalIndexStartupRunner`
|
|
||||||
- `DefaultSearchResultFusionService`
|
|
||||||
- `ChunkedLongTextRepresentationBuilder`
|
|
||||||
|
|
||||||
### NEW ingestion/projection
|
|
||||||
- `GenericDocumentImportService`
|
|
||||||
- `GenericFileSystemIngestionRoute`
|
|
||||||
- `GenericDocumentImportController`
|
|
||||||
- `MailDocumentIngestionAdapter`
|
|
||||||
- `TedPackageDocumentIngestionAdapter`
|
|
||||||
- `TedPackageChildImportProcessor`
|
|
||||||
- `TedNoticeProjectionService`
|
|
||||||
- `TedProjectionStartupRunner`
|
|
||||||
|
|
||||||
## Additional behavior
|
|
||||||
- `GenericDocumentImportService` now hands embedding work off to the new embedding subsystem
|
|
||||||
via `RepresentationEmbeddingOrchestrator` and resolves the default model through
|
|
||||||
`EmbeddingModelRegistry` / `EmbeddingModelCatalogService`.
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
This patch intentionally targets the real current leftovers visible in the actual codebase.
|
|
||||||
It assumes the new embedding subsystem already exists.
|
|
||||||
@ -1,20 +0,0 @@
|
|||||||
# NEW TED package import route
|
|
||||||
|
|
||||||
This patch adds a NEW-runtime TED package download path that:
|
|
||||||
|
|
||||||
- reuses the proven package sequencing rules
|
|
||||||
- stores package tracking in `TedDailyPackage`
|
|
||||||
- downloads the package tar.gz
|
|
||||||
- ingests it only through `DocumentIngestionGateway`
|
|
||||||
- never calls the legacy XML batch processing / vectorization flow
|
|
||||||
|
|
||||||
## Added classes
|
|
||||||
|
|
||||||
- `TedPackageSequenceService`
|
|
||||||
- `DefaultTedPackageSequenceService`
|
|
||||||
- `TedPackageDownloadNewProperties`
|
|
||||||
- `TedPackageDownloadNewRoute`
|
|
||||||
|
|
||||||
## Config
|
|
||||||
|
|
||||||
Use the `dip.ingestion.ted-download.*` block in `application-new.yml`.
|
|
||||||
@ -1,67 +0,0 @@
|
|||||||
# Vector-sync HTTP embedding provider
|
|
||||||
|
|
||||||
This provider supports two endpoints:
|
|
||||||
|
|
||||||
- `POST {baseUrl}/vector-sync` for single-text requests
|
|
||||||
- `POST {baseUrl}/vectorize-batch` for batch document requests
|
|
||||||
|
|
||||||
## Single request
|
|
||||||
|
|
||||||
Request body:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"model": "intfloat/multilingual-e5-large",
|
|
||||||
"text": "This is a sample text to vectorize"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Batch request
|
|
||||||
|
|
||||||
Request body:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"model": "intfloat/multilingual-e5-large",
|
|
||||||
"truncate_text": false,
|
|
||||||
"truncate_length": 512,
|
|
||||||
"chunk_size": 20,
|
|
||||||
"items": [
|
|
||||||
{
|
|
||||||
"id": "2f48fd5c-9d39-4d80-9225-ea0c59c77c9a",
|
|
||||||
"text": "This is a sample text to vectorize"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Provider configuration
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
batch-request:
|
|
||||||
truncate-text: false
|
|
||||||
truncate-length: 512
|
|
||||||
chunk-size: 20
|
|
||||||
```
|
|
||||||
|
|
||||||
These values are used for `/vectorize-batch` calls and can also be overridden per request via `EmbeddingRequest.providerOptions()`.
|
|
||||||
|
|
||||||
## Orchestrator batch processing
|
|
||||||
|
|
||||||
To let `RepresentationEmbeddingOrchestrator` send multiple representations in one provider call, enable batch processing for jobs and for the model:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
dip:
|
|
||||||
embedding:
|
|
||||||
jobs:
|
|
||||||
enabled: true
|
|
||||||
process-in-batches: true
|
|
||||||
execution-batch-size: 20
|
|
||||||
|
|
||||||
models:
|
|
||||||
e5-default:
|
|
||||||
supports-batch: true
|
|
||||||
```
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- jobs are grouped by `modelKey`
|
|
||||||
- non-batch-capable models still fall back to single-item execution
|
|
||||||
- `execution-batch-size` controls how many texts are sent in one `/vectorize-batch` request
|
|
||||||
@ -1,16 +0,0 @@
|
|||||||
package at.procon.dip.domain.ted.config;
|
|
||||||
|
|
||||||
import jakarta.validation.constraints.Positive;
|
|
||||||
import lombok.Data;
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "dip.ted.projection")
|
|
||||||
@Data
|
|
||||||
public class TedProjectionProperties {
|
|
||||||
private boolean enabled = true;
|
|
||||||
private boolean startupBackfillEnabled = false;
|
|
||||||
@Positive
|
|
||||||
private int startupBackfillLimit = 250;
|
|
||||||
}
|
|
||||||
@ -1,189 +0,0 @@
|
|||||||
package at.procon.dip.domain.ted.service;
|
|
||||||
|
|
||||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
|
||||||
import at.procon.dip.runtime.config.RuntimeMode;
|
|
||||||
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
|
|
||||||
import at.procon.ted.model.entity.TedDailyPackage;
|
|
||||||
import at.procon.ted.repository.TedDailyPackageRepository;
|
|
||||||
import java.time.Duration;
|
|
||||||
import java.time.LocalDate;
|
|
||||||
import java.time.OffsetDateTime;
|
|
||||||
import java.time.Year;
|
|
||||||
import java.time.ZoneOffset;
|
|
||||||
import java.util.Optional;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
import lombok.extern.slf4j.Slf4j;
|
|
||||||
import org.springframework.stereotype.Service;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* NEW-runtime implementation of TED package sequencing.
|
|
||||||
* <p>
|
|
||||||
* This reuses the same decision rules as the legacy TED package downloader:
|
|
||||||
* <ul>
|
|
||||||
* <li>current year forward crawling first</li>
|
|
||||||
* <li>gap filling by walking backward to package 1</li>
|
|
||||||
* <li>NOT_FOUND retry handling with current-year indefinite retry support</li>
|
|
||||||
* <li>previous-year grace period before a tail NOT_FOUND becomes final</li>
|
|
||||||
* </ul>
|
|
||||||
*/
|
|
||||||
@Service
|
|
||||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
@Slf4j
|
|
||||||
public class DefaultTedPackageSequenceService implements TedPackageSequenceService {
|
|
||||||
|
|
||||||
private final TedPackageDownloadProperties properties;
|
|
||||||
private final TedDailyPackageRepository packageRepository;
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public PackageInfo getNextPackageToDownload() {
|
|
||||||
int currentYear = Year.now().getValue();
|
|
||||||
|
|
||||||
log.debug("Determining next TED package to download for NEW runtime (current year: {})", currentYear);
|
|
||||||
|
|
||||||
// 1) Current year forward crawling first (newest data first)
|
|
||||||
PackageInfo nextInCurrentYear = getNextForwardPackage(currentYear);
|
|
||||||
if (nextInCurrentYear != null) {
|
|
||||||
log.info("Next TED package: {} (current year {} forward)", nextInCurrentYear.identifier(), currentYear);
|
|
||||||
return nextInCurrentYear;
|
|
||||||
}
|
|
||||||
|
|
||||||
// 2) Walk all years backward and fill gaps / continue unfinished years
|
|
||||||
for (int year = currentYear; year >= properties.getStartYear(); year--) {
|
|
||||||
PackageInfo gapFiller = getGapFillerPackage(year);
|
|
||||||
if (gapFiller != null) {
|
|
||||||
log.info("Next TED package: {} (filling gap in year {})", gapFiller.identifier(), year);
|
|
||||||
return gapFiller;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!isYearComplete(year)) {
|
|
||||||
PackageInfo forwardPackage = getNextForwardPackage(year);
|
|
||||||
if (forwardPackage != null) {
|
|
||||||
log.info("Next TED package: {} (continuing year {})", forwardPackage.identifier(), year);
|
|
||||||
return forwardPackage;
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
log.debug("TED package year {} is complete", year);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// 3) Open a new older year if possible
|
|
||||||
int oldestYear = getOldestYearWithData();
|
|
||||||
if (oldestYear > properties.getStartYear()) {
|
|
||||||
int previousYear = oldestYear - 1;
|
|
||||||
if (previousYear >= properties.getStartYear()) {
|
|
||||||
PackageInfo first = new PackageInfo(previousYear, 1);
|
|
||||||
log.info("Next TED package: {} (opening year {})", first.identifier(), previousYear);
|
|
||||||
return first;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
log.info("All TED package years from {} to {} appear complete - nothing to download",
|
|
||||||
properties.getStartYear(), currentYear);
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
|
|
||||||
private PackageInfo getNextForwardPackage(int year) {
|
|
||||||
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
|
|
||||||
|
|
||||||
if (latest.isEmpty()) {
|
|
||||||
return new PackageInfo(year, 1);
|
|
||||||
}
|
|
||||||
|
|
||||||
TedDailyPackage latestPackage = latest.get();
|
|
||||||
|
|
||||||
if (latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND) {
|
|
||||||
if (shouldRetryNotFoundPackage(latestPackage)) {
|
|
||||||
return new PackageInfo(year, latestPackage.getSerialNumber());
|
|
||||||
}
|
|
||||||
|
|
||||||
if (isNotFoundRetryableForYear(latestPackage)) {
|
|
||||||
log.debug("Year {} still inside NOT_FOUND retry window for package {} until {}",
|
|
||||||
year, latestPackage.getPackageIdentifier(), calculateNextRetryAt(latestPackage));
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
|
|
||||||
log.debug("Year {} finalized after grace period at tail package {}", year, latestPackage.getPackageIdentifier());
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
|
|
||||||
return new PackageInfo(year, latestPackage.getSerialNumber() + 1);
|
|
||||||
}
|
|
||||||
|
|
||||||
private PackageInfo getGapFillerPackage(int year) {
|
|
||||||
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
|
|
||||||
|
|
||||||
if (first.isEmpty()) {
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
|
|
||||||
int minSerial = first.get().getSerialNumber();
|
|
||||||
if (minSerial <= 1) {
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
|
|
||||||
return new PackageInfo(year, minSerial - 1);
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean isYearComplete(int year) {
|
|
||||||
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
|
|
||||||
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
|
|
||||||
|
|
||||||
if (first.isEmpty() || latest.isEmpty()) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (first.get().getSerialNumber() != 1) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
|
|
||||||
TedDailyPackage latestPackage = latest.get();
|
|
||||||
return latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND
|
|
||||||
&& !isNotFoundRetryableForYear(latestPackage);
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean shouldRetryNotFoundPackage(TedDailyPackage pkg) {
|
|
||||||
if (!isNotFoundRetryableForYear(pkg)) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
|
|
||||||
OffsetDateTime nextRetryAt = calculateNextRetryAt(pkg);
|
|
||||||
return !nextRetryAt.isAfter(OffsetDateTime.now());
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean isNotFoundRetryableForYear(TedDailyPackage pkg) {
|
|
||||||
int currentYear = Year.now().getValue();
|
|
||||||
int packageYear = pkg.getYear() != null ? pkg.getYear() : currentYear;
|
|
||||||
|
|
||||||
if (packageYear >= currentYear) {
|
|
||||||
return properties.isRetryCurrentYearNotFoundIndefinitely();
|
|
||||||
}
|
|
||||||
|
|
||||||
return OffsetDateTime.now().isBefore(getYearRetryGraceDeadline(packageYear));
|
|
||||||
}
|
|
||||||
|
|
||||||
private OffsetDateTime calculateNextRetryAt(TedDailyPackage pkg) {
|
|
||||||
OffsetDateTime lastAttemptAt = pkg.getUpdatedAt() != null
|
|
||||||
? pkg.getUpdatedAt()
|
|
||||||
: (pkg.getCreatedAt() != null ? pkg.getCreatedAt() : OffsetDateTime.now());
|
|
||||||
|
|
||||||
return lastAttemptAt.plus(Duration.ofMillis(properties.getNotFoundRetryInterval()));
|
|
||||||
}
|
|
||||||
|
|
||||||
private OffsetDateTime getYearRetryGraceDeadline(int year) {
|
|
||||||
return LocalDate.of(year + 1, 1, 1)
|
|
||||||
.atStartOfDay()
|
|
||||||
.atOffset(ZoneOffset.UTC)
|
|
||||||
.plusDays(properties.getPreviousYearGracePeriodDays());
|
|
||||||
}
|
|
||||||
|
|
||||||
private int getOldestYearWithData() {
|
|
||||||
int currentYear = Year.now().getValue();
|
|
||||||
for (int year = properties.getStartYear(); year <= currentYear; year++) {
|
|
||||||
if (packageRepository.findLatestByYear(year).isPresent()) {
|
|
||||||
return year;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return currentYear;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,25 +0,0 @@
|
|||||||
package at.procon.dip.domain.ted.service;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Shared package sequencing contract used to determine the next TED daily package to download.
|
|
||||||
* <p>
|
|
||||||
* This service encapsulates the proven sequencing rules from the legacy download implementation
|
|
||||||
* so they can also be used by the NEW runtime without depending on the old route/service graph.
|
|
||||||
*/
|
|
||||||
public interface TedPackageSequenceService {
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Returns the next package to download according to the current sequencing strategy,
|
|
||||||
* or {@code null} if nothing should be downloaded right now.
|
|
||||||
*/
|
|
||||||
PackageInfo getNextPackageToDownload();
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Simple year/serial pair with TED package identifier helper.
|
|
||||||
*/
|
|
||||||
record PackageInfo(int year, int serialNumber) {
|
|
||||||
public String identifier() {
|
|
||||||
return "%04d%05d".formatted(year, serialNumber);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,12 +0,0 @@
|
|||||||
package at.procon.dip.embedding.config;
|
|
||||||
|
|
||||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
|
||||||
import at.procon.dip.runtime.config.RuntimeMode;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
import org.springframework.scheduling.annotation.EnableScheduling;
|
|
||||||
|
|
||||||
@Configuration
|
|
||||||
@EnableScheduling
|
|
||||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
|
||||||
public class EmbeddingJobSchedulingConfiguration {
|
|
||||||
}
|
|
||||||
@ -1,14 +0,0 @@
|
|||||||
package at.procon.dip.embedding.config;
|
|
||||||
|
|
||||||
import lombok.Data;
|
|
||||||
|
|
||||||
@Data
|
|
||||||
public class EmbeddingPolicyCondition {
|
|
||||||
private String documentType;
|
|
||||||
private String documentFamily;
|
|
||||||
private String sourceType;
|
|
||||||
private String mimeType;
|
|
||||||
private String language;
|
|
||||||
private String ownerTenantKey;
|
|
||||||
private String embeddingPolicyHint;
|
|
||||||
}
|
|
||||||
@ -1,16 +0,0 @@
|
|||||||
package at.procon.dip.embedding.config;
|
|
||||||
|
|
||||||
import java.util.ArrayList;
|
|
||||||
import java.util.List;
|
|
||||||
import lombok.Data;
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "dip.embedding.policies")
|
|
||||||
@Data
|
|
||||||
public class EmbeddingPolicyProperties {
|
|
||||||
|
|
||||||
private EmbeddingPolicyUse defaultPolicy = new EmbeddingPolicyUse();
|
|
||||||
private List<EmbeddingPolicyRule> rules = new ArrayList<>();
|
|
||||||
}
|
|
||||||
@ -1,10 +0,0 @@
|
|||||||
package at.procon.dip.embedding.config;
|
|
||||||
|
|
||||||
import lombok.Data;
|
|
||||||
|
|
||||||
@Data
|
|
||||||
public class EmbeddingPolicyRule {
|
|
||||||
private String name;
|
|
||||||
private EmbeddingPolicyCondition when = new EmbeddingPolicyCondition();
|
|
||||||
private EmbeddingPolicyUse use = new EmbeddingPolicyUse();
|
|
||||||
}
|
|
||||||
@ -1,12 +0,0 @@
|
|||||||
package at.procon.dip.embedding.config;
|
|
||||||
|
|
||||||
import lombok.Data;
|
|
||||||
|
|
||||||
@Data
|
|
||||||
public class EmbeddingPolicyUse {
|
|
||||||
private String policyKey;
|
|
||||||
private String modelKey;
|
|
||||||
private String queryModelKey;
|
|
||||||
private String profileKey;
|
|
||||||
private boolean enabled = true;
|
|
||||||
}
|
|
||||||
@ -1,23 +0,0 @@
|
|||||||
package at.procon.dip.embedding.config;
|
|
||||||
|
|
||||||
import at.procon.dip.domain.document.RepresentationType;
|
|
||||||
import java.util.ArrayList;
|
|
||||||
import java.util.LinkedHashMap;
|
|
||||||
import java.util.List;
|
|
||||||
import java.util.Map;
|
|
||||||
import lombok.Data;
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "dip.embedding.profiles")
|
|
||||||
@Data
|
|
||||||
public class EmbeddingProfileProperties {
|
|
||||||
|
|
||||||
private Map<String, ProfileDefinition> definitions = new LinkedHashMap<>();
|
|
||||||
|
|
||||||
@Data
|
|
||||||
public static class ProfileDefinition {
|
|
||||||
private List<RepresentationType> embedRepresentationTypes = new ArrayList<>();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,32 +0,0 @@
|
|||||||
package at.procon.dip.embedding.job;
|
|
||||||
|
|
||||||
import at.procon.dip.embedding.service.RepresentationEmbeddingOrchestrator;
|
|
||||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
|
||||||
import at.procon.dip.runtime.config.RuntimeMode;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
import lombok.extern.slf4j.Slf4j;
|
|
||||||
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
|
|
||||||
import org.springframework.scheduling.annotation.Scheduled;
|
|
||||||
import org.springframework.stereotype.Component;
|
|
||||||
|
|
||||||
@Component
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
@Slf4j
|
|
||||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
|
||||||
@ConditionalOnProperty(prefix = "dip.embedding.jobs", name = "enabled", havingValue = "true")
|
|
||||||
public class EmbeddingJobScheduler {
|
|
||||||
|
|
||||||
private final RepresentationEmbeddingOrchestrator orchestrator;
|
|
||||||
|
|
||||||
@Scheduled(fixedDelayString = "${dip.embedding.jobs.scheduler-delay-ms:5000}")
|
|
||||||
public void processNextBatch() {
|
|
||||||
try {
|
|
||||||
int processed = orchestrator.processNextReadyBatch();
|
|
||||||
if (processed > 0) {
|
|
||||||
log.debug("NEW runtime embedding job scheduler processed {} job(s)", processed);
|
|
||||||
}
|
|
||||||
} catch (Exception ex) {
|
|
||||||
log.warn("NEW runtime embedding job scheduler failed: {}", ex.getMessage(), ex);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,10 +0,0 @@
|
|||||||
package at.procon.dip.embedding.policy;
|
|
||||||
|
|
||||||
public record EmbeddingPolicy(
|
|
||||||
String policyKey,
|
|
||||||
String modelKey,
|
|
||||||
String queryModelKey,
|
|
||||||
String profileKey,
|
|
||||||
boolean enabled
|
|
||||||
) {
|
|
||||||
}
|
|
||||||
@ -1,13 +0,0 @@
|
|||||||
package at.procon.dip.embedding.policy;
|
|
||||||
|
|
||||||
import at.procon.dip.domain.document.RepresentationType;
|
|
||||||
import java.util.List;
|
|
||||||
|
|
||||||
public record EmbeddingProfile(
|
|
||||||
String profileKey,
|
|
||||||
List<RepresentationType> embedRepresentationTypes
|
|
||||||
) {
|
|
||||||
public boolean includes(RepresentationType representationType) {
|
|
||||||
return embedRepresentationTypes != null && embedRepresentationTypes.contains(representationType);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,68 +0,0 @@
|
|||||||
package at.procon.dip.embedding.provider.http;
|
|
||||||
|
|
||||||
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
|
|
||||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
|
||||||
import java.io.IOException;
|
|
||||||
import java.net.URI;
|
|
||||||
import java.net.http.HttpClient;
|
|
||||||
import java.net.http.HttpRequest;
|
|
||||||
import java.net.http.HttpResponse;
|
|
||||||
import java.nio.charset.StandardCharsets;
|
|
||||||
import java.time.Duration;
|
|
||||||
import java.util.List;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
abstract class AbstractHttpEmbeddingProviderSupport {
|
|
||||||
|
|
||||||
protected final ObjectMapper objectMapper;
|
|
||||||
protected final HttpClient httpClient = HttpClient.newBuilder()
|
|
||||||
.version(HttpClient.Version.HTTP_1_1)
|
|
||||||
.build();
|
|
||||||
|
|
||||||
protected String trimTrailingSlash(String value) {
|
|
||||||
if (value == null || value.isBlank()) {
|
|
||||||
throw new IllegalArgumentException("Embedding provider baseUrl must be configured");
|
|
||||||
}
|
|
||||||
return value.endsWith("/") ? value.substring(0, value.length() - 1) : value;
|
|
||||||
}
|
|
||||||
|
|
||||||
protected HttpResponse<String> postJson(ResolvedEmbeddingProviderConfig providerConfig,
|
|
||||||
String path,
|
|
||||||
Object body) throws IOException, InterruptedException {
|
|
||||||
HttpRequest.Builder builder = HttpRequest.newBuilder()
|
|
||||||
.uri(URI.create(trimTrailingSlash(providerConfig.baseUrl()) + path))
|
|
||||||
.timeout(providerConfig.readTimeout() == null ? Duration.ofSeconds(60) : providerConfig.readTimeout())
|
|
||||||
.header("Content-Type", "application/json")
|
|
||||||
.POST(HttpRequest.BodyPublishers.ofString(
|
|
||||||
objectMapper.writeValueAsString(body),
|
|
||||||
StandardCharsets.UTF_8
|
|
||||||
));
|
|
||||||
|
|
||||||
if (providerConfig.apiKey() != null && !providerConfig.apiKey().isBlank()) {
|
|
||||||
builder.header("Authorization", "Bearer " + providerConfig.apiKey());
|
|
||||||
}
|
|
||||||
if (providerConfig.headers() != null) {
|
|
||||||
providerConfig.headers().forEach(builder::header);
|
|
||||||
}
|
|
||||||
|
|
||||||
HttpResponse<String> response = httpClient.send(
|
|
||||||
builder.build(),
|
|
||||||
HttpResponse.BodyHandlers.ofString(StandardCharsets.UTF_8)
|
|
||||||
);
|
|
||||||
if (response.statusCode() / 100 != 2) {
|
|
||||||
throw new IllegalStateException(
|
|
||||||
"Embedding provider returned status %d: %s".formatted(response.statusCode(), response.body())
|
|
||||||
);
|
|
||||||
}
|
|
||||||
return response;
|
|
||||||
}
|
|
||||||
|
|
||||||
protected float[] toArray(List<Float> embedding) {
|
|
||||||
float[] result = new float[embedding.size()];
|
|
||||||
for (int i = 0; i < embedding.size(); i++) {
|
|
||||||
result[i] = embedding.get(i);
|
|
||||||
}
|
|
||||||
return result;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,374 +0,0 @@
|
|||||||
package at.procon.dip.embedding.provider.http;
|
|
||||||
|
|
||||||
import at.procon.dip.embedding.model.EmbeddingModelDescriptor;
|
|
||||||
import at.procon.dip.embedding.model.EmbeddingProviderResult;
|
|
||||||
import at.procon.dip.embedding.model.EmbeddingRequest;
|
|
||||||
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
|
|
||||||
import at.procon.dip.embedding.provider.EmbeddingProvider;
|
|
||||||
import com.fasterxml.jackson.annotation.JsonProperty;
|
|
||||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
|
||||||
import java.io.IOException;
|
|
||||||
import java.net.http.HttpResponse;
|
|
||||||
import java.util.ArrayList;
|
|
||||||
import java.util.HashMap;
|
|
||||||
import java.util.List;
|
|
||||||
import java.util.Map;
|
|
||||||
import java.util.UUID;
|
|
||||||
import org.springframework.stereotype.Component;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* HTTP provider for vector APIs.
|
|
||||||
*
|
|
||||||
* Supported endpoints:
|
|
||||||
* POST {baseUrl}/vector-sync - single text
|
|
||||||
* POST {baseUrl}/vectorize-batch - multiple texts
|
|
||||||
*/
|
|
||||||
@Component
|
|
||||||
public class VectorSyncHttpEmbeddingProvider extends AbstractHttpEmbeddingProviderSupport implements EmbeddingProvider {
|
|
||||||
private static final String PROVIDER_TYPE = "http-vector-sync";
|
|
||||||
|
|
||||||
private static final boolean DEFAULT_TRUNCATE_TEXT = false;
|
|
||||||
private static final int DEFAULT_TRUNCATE_LENGTH = 512;
|
|
||||||
private static final int DEFAULT_CHUNK_SIZE = 20;
|
|
||||||
|
|
||||||
private static final List<String> TRUNCATE_TEXT_KEYS = List.of(
|
|
||||||
"vectorize-batch.truncate-text",
|
|
||||||
"vectorize-batch.truncate_text",
|
|
||||||
"truncate_text",
|
|
||||||
"truncate-text",
|
|
||||||
"truncateText"
|
|
||||||
);
|
|
||||||
|
|
||||||
private static final List<String> TRUNCATE_LENGTH_KEYS = List.of(
|
|
||||||
"vectorize-batch.truncate-length",
|
|
||||||
"vectorize-batch.truncate_length",
|
|
||||||
"truncate_length",
|
|
||||||
"truncate-length",
|
|
||||||
"truncateLength"
|
|
||||||
);
|
|
||||||
|
|
||||||
private static final List<String> CHUNK_SIZE_KEYS = List.of(
|
|
||||||
"vectorize-batch.chunk-size",
|
|
||||||
"vectorize-batch.chunk_size",
|
|
||||||
"chunk_size",
|
|
||||||
"chunk-size",
|
|
||||||
"chunkSize"
|
|
||||||
);
|
|
||||||
|
|
||||||
public VectorSyncHttpEmbeddingProvider(ObjectMapper objectMapper) {
|
|
||||||
super(objectMapper);
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public String providerType() {
|
|
||||||
return PROVIDER_TYPE;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public boolean supports(EmbeddingModelDescriptor model, ResolvedEmbeddingProviderConfig providerConfig) {
|
|
||||||
return PROVIDER_TYPE.equalsIgnoreCase(providerConfig.providerType());
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public EmbeddingProviderResult embedDocuments(ResolvedEmbeddingProviderConfig providerConfig,
|
|
||||||
EmbeddingModelDescriptor model,
|
|
||||||
EmbeddingRequest request) {
|
|
||||||
return execute(providerConfig, model, request);
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public EmbeddingProviderResult embedQuery(ResolvedEmbeddingProviderConfig providerConfig,
|
|
||||||
EmbeddingModelDescriptor model,
|
|
||||||
EmbeddingRequest request) {
|
|
||||||
return execute(providerConfig, model, request);
|
|
||||||
}
|
|
||||||
|
|
||||||
private EmbeddingProviderResult execute(ResolvedEmbeddingProviderConfig providerConfig,
|
|
||||||
EmbeddingModelDescriptor model,
|
|
||||||
EmbeddingRequest request) {
|
|
||||||
if (request.texts() == null || request.texts().isEmpty()) {
|
|
||||||
throw new IllegalArgumentException("Embedding request texts must not be empty");
|
|
||||||
}
|
|
||||||
|
|
||||||
try {
|
|
||||||
return request.texts().size() == 1
|
|
||||||
? executeSingle(providerConfig, model, request.texts().getFirst())
|
|
||||||
: executeBatch(providerConfig, model, request);
|
|
||||||
} catch (InterruptedException e) {
|
|
||||||
Thread.currentThread().interrupt();
|
|
||||||
throw new IllegalStateException("Embedding provider call interrupted", e);
|
|
||||||
} catch (IOException e) {
|
|
||||||
throw new IllegalStateException("Failed to call embedding provider", e);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private EmbeddingProviderResult executeSingle(ResolvedEmbeddingProviderConfig providerConfig,
|
|
||||||
EmbeddingModelDescriptor model,
|
|
||||||
String text) throws IOException, InterruptedException {
|
|
||||||
HttpResponse<String> response = postJson(
|
|
||||||
providerConfig,
|
|
||||||
"/vector-sync",
|
|
||||||
new VectorSyncRequest(model.providerModelKey(), text)
|
|
||||||
);
|
|
||||||
|
|
||||||
VectorSyncResponse parsed = objectMapper.readValue(response.body(), VectorSyncResponse.class);
|
|
||||||
float[] vector = extractVector(parsed.vector, parsed.combinedVector, model);
|
|
||||||
|
|
||||||
return new EmbeddingProviderResult(
|
|
||||||
model,
|
|
||||||
List.of(vector),
|
|
||||||
List.of(),
|
|
||||||
null,
|
|
||||||
parsed.tokenCount
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
private EmbeddingProviderResult executeBatch(ResolvedEmbeddingProviderConfig providerConfig,
|
|
||||||
EmbeddingModelDescriptor model,
|
|
||||||
EmbeddingRequest request) throws IOException, InterruptedException {
|
|
||||||
BatchRequestSettings settings = resolveBatchRequestSettings(providerConfig, request.providerOptions());
|
|
||||||
|
|
||||||
if (settings.truncateLength() <= 0) {
|
|
||||||
throw new IllegalArgumentException("Batch truncate length must be > 0");
|
|
||||||
}
|
|
||||||
if (settings.chunkSize() <= 0) {
|
|
||||||
throw new IllegalArgumentException("Batch chunk size must be > 0");
|
|
||||||
}
|
|
||||||
|
|
||||||
List<String> requestOrder = new ArrayList<>(request.texts().size());
|
|
||||||
List<VectorizeBatchItemRequest> items = new ArrayList<>(request.texts().size());
|
|
||||||
|
|
||||||
for (String text : request.texts()) {
|
|
||||||
String id = UUID.randomUUID().toString();
|
|
||||||
requestOrder.add(id);
|
|
||||||
items.add(new VectorizeBatchItemRequest(id, text));
|
|
||||||
}
|
|
||||||
|
|
||||||
HttpResponse<String> response = postJson(
|
|
||||||
providerConfig,
|
|
||||||
"/vectorize-batch",
|
|
||||||
new VectorizeBatchRequest(
|
|
||||||
model.providerModelKey(),
|
|
||||||
settings.truncateText(),
|
|
||||||
settings.truncateLength(),
|
|
||||||
settings.chunkSize(),
|
|
||||||
items
|
|
||||||
)
|
|
||||||
);
|
|
||||||
|
|
||||||
VectorizeBatchResponse parsed = objectMapper.readValue(response.body(), VectorizeBatchResponse.class);
|
|
||||||
if (parsed.results == null || parsed.results.isEmpty()) {
|
|
||||||
throw new IllegalStateException("Vectorize-batch provider returned no results");
|
|
||||||
}
|
|
||||||
|
|
||||||
Map<String, VectorizeBatchItemResponse> resultById = new HashMap<>();
|
|
||||||
for (VectorizeBatchItemResponse result : parsed.results) {
|
|
||||||
resultById.put(result.id, result);
|
|
||||||
}
|
|
||||||
|
|
||||||
List<float[]> vectors = new ArrayList<>(request.texts().size());
|
|
||||||
int totalTokenCount = 0;
|
|
||||||
boolean hasAnyTokenCount = false;
|
|
||||||
|
|
||||||
for (String id : requestOrder) {
|
|
||||||
VectorizeBatchItemResponse item = resultById.get(id);
|
|
||||||
if (item == null) {
|
|
||||||
throw new IllegalStateException("Vectorize-batch provider response is missing item for id " + id);
|
|
||||||
}
|
|
||||||
|
|
||||||
vectors.add(extractVector(item.vector, item.combinedVector, model));
|
|
||||||
|
|
||||||
if (item.tokenCount != null) {
|
|
||||||
totalTokenCount += item.tokenCount;
|
|
||||||
hasAnyTokenCount = true;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return new EmbeddingProviderResult(
|
|
||||||
model,
|
|
||||||
vectors,
|
|
||||||
List.of(),
|
|
||||||
null,
|
|
||||||
hasAnyTokenCount ? totalTokenCount : null
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
private BatchRequestSettings resolveBatchRequestSettings(ResolvedEmbeddingProviderConfig providerConfig,
|
|
||||||
Map<String, Object> providerOptions) {
|
|
||||||
boolean truncateText = resolveBooleanOption(
|
|
||||||
providerOptions,
|
|
||||||
TRUNCATE_TEXT_KEYS,
|
|
||||||
providerConfig.batchTruncateText() != null ? providerConfig.batchTruncateText() : DEFAULT_TRUNCATE_TEXT
|
|
||||||
);
|
|
||||||
int truncateLength = resolveIntOption(
|
|
||||||
providerOptions,
|
|
||||||
TRUNCATE_LENGTH_KEYS,
|
|
||||||
providerConfig.batchTruncateLength() != null ? providerConfig.batchTruncateLength() : DEFAULT_TRUNCATE_LENGTH
|
|
||||||
);
|
|
||||||
int chunkSize = resolveIntOption(
|
|
||||||
providerOptions,
|
|
||||||
CHUNK_SIZE_KEYS,
|
|
||||||
providerConfig.batchChunkSize() != null ? providerConfig.batchChunkSize() : DEFAULT_CHUNK_SIZE
|
|
||||||
);
|
|
||||||
return new BatchRequestSettings(truncateText, truncateLength, chunkSize);
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean resolveBooleanOption(Map<String, Object> providerOptions,
|
|
||||||
List<String> keys,
|
|
||||||
boolean defaultValue) {
|
|
||||||
Object raw = resolveOption(providerOptions, keys);
|
|
||||||
if (raw == null) {
|
|
||||||
return defaultValue;
|
|
||||||
}
|
|
||||||
if (raw instanceof Boolean booleanValue) {
|
|
||||||
return booleanValue;
|
|
||||||
}
|
|
||||||
String normalized = String.valueOf(raw).trim();
|
|
||||||
if (normalized.isEmpty()) {
|
|
||||||
return defaultValue;
|
|
||||||
}
|
|
||||||
return Boolean.parseBoolean(normalized);
|
|
||||||
}
|
|
||||||
|
|
||||||
private int resolveIntOption(Map<String, Object> providerOptions,
|
|
||||||
List<String> keys,
|
|
||||||
int defaultValue) {
|
|
||||||
Object raw = resolveOption(providerOptions, keys);
|
|
||||||
if (raw == null) {
|
|
||||||
return defaultValue;
|
|
||||||
}
|
|
||||||
if (raw instanceof Number number) {
|
|
||||||
return number.intValue();
|
|
||||||
}
|
|
||||||
String normalized = String.valueOf(raw).trim();
|
|
||||||
if (normalized.isEmpty()) {
|
|
||||||
return defaultValue;
|
|
||||||
}
|
|
||||||
return Integer.parseInt(normalized);
|
|
||||||
}
|
|
||||||
|
|
||||||
private Object resolveOption(Map<String, Object> providerOptions, List<String> keys) {
|
|
||||||
if (providerOptions == null || providerOptions.isEmpty()) {
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
for (String key : keys) {
|
|
||||||
if (providerOptions.containsKey(key)) {
|
|
||||||
return providerOptions.get(key);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
|
|
||||||
private float[] extractVector(List<Float> vector,
|
|
||||||
List<Float> combinedVector,
|
|
||||||
EmbeddingModelDescriptor model) {
|
|
||||||
float[] resolved;
|
|
||||||
|
|
||||||
if (combinedVector != null && !combinedVector.isEmpty()) {
|
|
||||||
resolved = toArray(combinedVector);
|
|
||||||
} else if (vector != null && !vector.isEmpty()) {
|
|
||||||
resolved = toArray(vector);
|
|
||||||
} else {
|
|
||||||
throw new IllegalStateException("Embedding provider returned no vector");
|
|
||||||
}
|
|
||||||
|
|
||||||
if (model.dimensions() > 0 && resolved.length != model.dimensions()) {
|
|
||||||
throw new IllegalStateException(
|
|
||||||
"Embedding provider returned dimension %d for model %s, expected %d"
|
|
||||||
.formatted(resolved.length, model.modelKey(), model.dimensions())
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
return resolved;
|
|
||||||
}
|
|
||||||
|
|
||||||
private record BatchRequestSettings(boolean truncateText, int truncateLength, int chunkSize) {
|
|
||||||
}
|
|
||||||
|
|
||||||
private record VectorSyncRequest(
|
|
||||||
@JsonProperty("model") String model,
|
|
||||||
@JsonProperty("text") String text
|
|
||||||
) {
|
|
||||||
}
|
|
||||||
|
|
||||||
private record VectorizeBatchRequest(
|
|
||||||
@JsonProperty("model") String model,
|
|
||||||
@JsonProperty("truncate_text") boolean truncateText,
|
|
||||||
@JsonProperty("truncate_length") int truncateLength,
|
|
||||||
@JsonProperty("chunk_size") int chunkSize,
|
|
||||||
@JsonProperty("items") List<VectorizeBatchItemRequest> items
|
|
||||||
) {
|
|
||||||
}
|
|
||||||
|
|
||||||
private record VectorizeBatchItemRequest(
|
|
||||||
@JsonProperty("id") String id,
|
|
||||||
@JsonProperty("text") String text
|
|
||||||
) {
|
|
||||||
}
|
|
||||||
|
|
||||||
static class VectorSyncResponse {
|
|
||||||
@JsonProperty("runtime_ms")
|
|
||||||
public Double runtimeMs;
|
|
||||||
|
|
||||||
@JsonProperty("vector")
|
|
||||||
public List<Float> vector;
|
|
||||||
|
|
||||||
@JsonProperty("incomplete")
|
|
||||||
public Boolean incomplete;
|
|
||||||
|
|
||||||
@JsonProperty("combined_vector")
|
|
||||||
public List<Float> combinedVector;
|
|
||||||
|
|
||||||
@JsonProperty("token_count")
|
|
||||||
public Integer tokenCount;
|
|
||||||
|
|
||||||
@JsonProperty("model")
|
|
||||||
public String model;
|
|
||||||
|
|
||||||
@JsonProperty("max_seq_length")
|
|
||||||
public Integer maxSeqLength;
|
|
||||||
}
|
|
||||||
|
|
||||||
static class VectorizeBatchResponse {
|
|
||||||
@JsonProperty("model")
|
|
||||||
public String model;
|
|
||||||
|
|
||||||
@JsonProperty("count")
|
|
||||||
public Integer count;
|
|
||||||
|
|
||||||
@JsonProperty("results")
|
|
||||||
public List<VectorizeBatchItemResponse> results;
|
|
||||||
}
|
|
||||||
|
|
||||||
static class VectorizeBatchItemResponse {
|
|
||||||
@JsonProperty("id")
|
|
||||||
public String id;
|
|
||||||
|
|
||||||
@JsonProperty("vector")
|
|
||||||
public List<Float> vector;
|
|
||||||
|
|
||||||
@JsonProperty("token_count")
|
|
||||||
public Integer tokenCount;
|
|
||||||
|
|
||||||
@JsonProperty("runtime_ms")
|
|
||||||
public Double runtimeMs;
|
|
||||||
|
|
||||||
@JsonProperty("incomplete")
|
|
||||||
public Boolean incomplete;
|
|
||||||
|
|
||||||
@JsonProperty("combined_vector")
|
|
||||||
public List<Float> combinedVector;
|
|
||||||
|
|
||||||
@JsonProperty("truncated")
|
|
||||||
public Boolean truncated;
|
|
||||||
|
|
||||||
@JsonProperty("truncate_length")
|
|
||||||
public Integer truncateLength;
|
|
||||||
|
|
||||||
@JsonProperty("model")
|
|
||||||
public String model;
|
|
||||||
|
|
||||||
@JsonProperty("max_seq_length")
|
|
||||||
public Integer maxSeqLength;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,131 +0,0 @@
|
|||||||
package at.procon.dip.embedding.service;
|
|
||||||
|
|
||||||
import at.procon.dip.domain.document.entity.Document;
|
|
||||||
import at.procon.dip.embedding.config.EmbeddingPolicyCondition;
|
|
||||||
import at.procon.dip.embedding.config.EmbeddingPolicyProperties;
|
|
||||||
import at.procon.dip.embedding.config.EmbeddingPolicyRule;
|
|
||||||
import at.procon.dip.embedding.config.EmbeddingPolicyUse;
|
|
||||||
import at.procon.dip.embedding.policy.EmbeddingPolicy;
|
|
||||||
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
|
||||||
import java.util.Map;
|
|
||||||
import java.util.Objects;
|
|
||||||
import java.util.regex.Pattern;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
import org.springframework.stereotype.Service;
|
|
||||||
|
|
||||||
@Service
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
public class DefaultEmbeddingPolicyResolver implements EmbeddingPolicyResolver {
|
|
||||||
|
|
||||||
private final EmbeddingPolicyProperties properties;
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor) {
|
|
||||||
String overridePolicy = attributeValue(sourceDescriptor, "embeddingPolicyKey");
|
|
||||||
if (overridePolicy != null) {
|
|
||||||
return policyByKey(overridePolicy);
|
|
||||||
}
|
|
||||||
|
|
||||||
String policyHint = policyHint(sourceDescriptor);
|
|
||||||
if (policyHint != null) {
|
|
||||||
return policyByKey(policyHint);
|
|
||||||
}
|
|
||||||
|
|
||||||
for (EmbeddingPolicyRule rule : properties.getRules()) {
|
|
||||||
if (matches(rule.getWhen(), document, sourceDescriptor)) {
|
|
||||||
return toPolicy(rule.getUse());
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return toPolicy(properties.getDefaultPolicy());
|
|
||||||
}
|
|
||||||
|
|
||||||
private EmbeddingPolicy policyByKey(String policyKey) {
|
|
||||||
for (EmbeddingPolicyRule rule : properties.getRules()) {
|
|
||||||
if (rule.getUse() != null && policyKey.equals(rule.getUse().getPolicyKey())) {
|
|
||||||
return toPolicy(rule.getUse());
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EmbeddingPolicyUse def = properties.getDefaultPolicy();
|
|
||||||
if (def != null && policyKey.equals(def.getPolicyKey())) {
|
|
||||||
return toPolicy(def);
|
|
||||||
}
|
|
||||||
throw new IllegalArgumentException("Unknown embedding policy key: " + policyKey);
|
|
||||||
}
|
|
||||||
|
|
||||||
private EmbeddingPolicy toPolicy(EmbeddingPolicyUse use) {
|
|
||||||
if (use == null) {
|
|
||||||
throw new IllegalStateException("Embedding policy configuration is missing");
|
|
||||||
}
|
|
||||||
return new EmbeddingPolicy(
|
|
||||||
use.getPolicyKey(),
|
|
||||||
use.getModelKey(),
|
|
||||||
use.getQueryModelKey(),
|
|
||||||
use.getProfileKey(),
|
|
||||||
use.isEnabled()
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean matches(EmbeddingPolicyCondition c, Document document, SourceDescriptor sourceDescriptor) {
|
|
||||||
if (c == null) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!matchesExact(c.getDocumentType(), enumName(document != null ? document.getDocumentType() : null))) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (!matchesExact(c.getDocumentFamily(), enumName(document != null ? document.getDocumentFamily() : null))) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (!matchesExact(c.getSourceType(), enumName(sourceDescriptor != null ? sourceDescriptor.sourceType() : null))) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (!matchesMime(c.getMimeType(), sourceDescriptor != null ? sourceDescriptor.mediaType() : null)) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (!matchesExact(c.getLanguage(), document != null ? document.getLanguageCode() : null)) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (!matchesExact(c.getOwnerTenantKey(), document != null && document.getOwnerTenant() != null ? document.getOwnerTenant().getTenantKey() : null )) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
return matchesExact(c.getEmbeddingPolicyHint(), policyHint(sourceDescriptor));
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean matchesExact(String expected, String actual) {
|
|
||||||
if (expected == null || expected.isBlank()) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
return Objects.equals(expected, actual);
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean matchesMime(String pattern, String actual) {
|
|
||||||
if (pattern == null || pattern.isBlank()) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (actual == null || actual.isBlank()) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
return Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(actual).matches();
|
|
||||||
}
|
|
||||||
|
|
||||||
private String enumName(Enum<?> value) {
|
|
||||||
return value != null ? value.name() : null;
|
|
||||||
}
|
|
||||||
|
|
||||||
private String policyHint(SourceDescriptor sourceDescriptor) {
|
|
||||||
return attributeValue(sourceDescriptor, "embeddingPolicyHint");
|
|
||||||
}
|
|
||||||
|
|
||||||
private String attributeValue(SourceDescriptor sourceDescriptor, String key) {
|
|
||||||
if (sourceDescriptor == null) {
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
Map<String, String> attributes = sourceDescriptor.attributes();
|
|
||||||
if (attributes == null) {
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
String value = attributes.get(key);
|
|
||||||
return (value == null || value.isBlank()) ? null : value;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,31 +0,0 @@
|
|||||||
package at.procon.dip.embedding.service;
|
|
||||||
|
|
||||||
import at.procon.dip.embedding.config.EmbeddingProfileProperties;
|
|
||||||
import at.procon.dip.embedding.policy.EmbeddingProfile;
|
|
||||||
import java.util.List;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
import org.springframework.stereotype.Service;
|
|
||||||
|
|
||||||
@Service
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
public class DefaultEmbeddingProfileResolver implements EmbeddingProfileResolver {
|
|
||||||
|
|
||||||
private final EmbeddingProfileProperties properties;
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public EmbeddingProfile resolve(String profileKey) {
|
|
||||||
if (profileKey == null || profileKey.isBlank()) {
|
|
||||||
throw new IllegalArgumentException("Embedding profile key must not be blank");
|
|
||||||
}
|
|
||||||
|
|
||||||
EmbeddingProfileProperties.ProfileDefinition definition = properties.getDefinitions().get(profileKey);
|
|
||||||
if (definition == null) {
|
|
||||||
throw new IllegalArgumentException("Unknown embedding profile: " + profileKey);
|
|
||||||
}
|
|
||||||
|
|
||||||
return new EmbeddingProfile(
|
|
||||||
profileKey,
|
|
||||||
List.copyOf(definition.getEmbedRepresentationTypes())
|
|
||||||
);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,9 +0,0 @@
|
|||||||
package at.procon.dip.embedding.service;
|
|
||||||
|
|
||||||
import at.procon.dip.domain.document.entity.Document;
|
|
||||||
import at.procon.dip.embedding.policy.EmbeddingPolicy;
|
|
||||||
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
|
||||||
|
|
||||||
public interface EmbeddingPolicyResolver {
|
|
||||||
EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor);
|
|
||||||
}
|
|
||||||
@ -1,7 +0,0 @@
|
|||||||
package at.procon.dip.embedding.service;
|
|
||||||
|
|
||||||
import at.procon.dip.embedding.policy.EmbeddingProfile;
|
|
||||||
|
|
||||||
public interface EmbeddingProfileResolver {
|
|
||||||
EmbeddingProfile resolve(String profileKey);
|
|
||||||
}
|
|
||||||
@ -1,328 +0,0 @@
|
|||||||
package at.procon.dip.ingestion.camel;
|
|
||||||
|
|
||||||
import at.procon.dip.domain.document.SourceType;
|
|
||||||
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
|
|
||||||
import at.procon.dip.ingestion.service.DocumentIngestionGateway;
|
|
||||||
import at.procon.dip.ingestion.spi.IngestionResult;
|
|
||||||
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
|
|
||||||
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
|
||||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
|
||||||
import at.procon.dip.runtime.config.RuntimeMode;
|
|
||||||
import at.procon.ted.model.entity.TedDailyPackage;
|
|
||||||
import at.procon.ted.repository.TedDailyPackageRepository;
|
|
||||||
import at.procon.dip.domain.ted.service.TedPackageSequenceService;
|
|
||||||
import java.nio.file.Files;
|
|
||||||
import java.nio.file.Path;
|
|
||||||
import java.nio.file.Paths;
|
|
||||||
import java.security.MessageDigest;
|
|
||||||
import java.time.OffsetDateTime;
|
|
||||||
import java.util.List;
|
|
||||||
import java.util.Map;
|
|
||||||
import java.util.Optional;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
import lombok.extern.slf4j.Slf4j;
|
|
||||||
import org.apache.camel.Exchange;
|
|
||||||
import org.apache.camel.LoggingLevel;
|
|
||||||
import org.apache.camel.builder.RouteBuilder;
|
|
||||||
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
|
|
||||||
import org.springframework.stereotype.Component;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* NEW-runtime TED daily package download route.
|
|
||||||
* <p>
|
|
||||||
* Reuses the proven package sequencing rules through {@link TedPackageSequenceService},
|
|
||||||
* but hands off processing only to the NEW ingestion gateway. No legacy XML batch persistence,
|
|
||||||
* no legacy vectorization route, no old semantic path.
|
|
||||||
*/
|
|
||||||
@Component
|
|
||||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
|
||||||
@ConditionalOnProperty(name = "dip.ingestion.ted-download.enabled", havingValue = "true")
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
@Slf4j
|
|
||||||
public class TedPackageDownloadRoute extends RouteBuilder {
|
|
||||||
|
|
||||||
private static final String ROUTE_ID_SCHEDULER = "ted-package-new-scheduler";
|
|
||||||
private static final String ROUTE_ID_DOWNLOADER = "ted-package-new-downloader";
|
|
||||||
private static final String ROUTE_ID_ERROR = "ted-package-new-error-handler";
|
|
||||||
|
|
||||||
private final TedPackageDownloadProperties properties;
|
|
||||||
private final TedDailyPackageRepository packageRepository;
|
|
||||||
private final TedPackageSequenceService sequenceService;
|
|
||||||
private final DocumentIngestionGateway documentIngestionGateway;
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public void configure() {
|
|
||||||
errorHandler(deadLetterChannel("direct:ted-package-new-error")
|
|
||||||
.maximumRedeliveries(3)
|
|
||||||
.redeliveryDelay(10_000)
|
|
||||||
.retryAttemptedLogLevel(LoggingLevel.WARN)
|
|
||||||
.logStackTrace(true));
|
|
||||||
|
|
||||||
from("direct:ted-package-new-error")
|
|
||||||
.routeId(ROUTE_ID_ERROR)
|
|
||||||
.process(this::handleError);
|
|
||||||
|
|
||||||
from("timer:ted-package-new-scheduler?period={{dip.ingestion.ted-download.poll-interval:3600000}}&delay=0")
|
|
||||||
.routeId(ROUTE_ID_SCHEDULER)
|
|
||||||
.process(this::checkRunningPackages)
|
|
||||||
.choice()
|
|
||||||
.when(header("tooManyRunning").isEqualTo(true))
|
|
||||||
.log(LoggingLevel.INFO, "Skipping NEW TED package download - already ${header.runningCount} packages in progress")
|
|
||||||
.otherwise()
|
|
||||||
.process(this::determineNextPackage)
|
|
||||||
.choice()
|
|
||||||
.when(header("packageId").isNotNull())
|
|
||||||
.to("direct:download-ted-package-new")
|
|
||||||
.otherwise()
|
|
||||||
.log(LoggingLevel.INFO, "No NEW TED package to download right now")
|
|
||||||
.end()
|
|
||||||
.end();
|
|
||||||
|
|
||||||
from("direct:download-ted-package-new")
|
|
||||||
.routeId(ROUTE_ID_DOWNLOADER)
|
|
||||||
.log(LoggingLevel.INFO, "NEW TED package download started: ${header.packageId}")
|
|
||||||
.setHeader("downloadStartTime", constant(System.currentTimeMillis()))
|
|
||||||
.process(this::createPackageRecord)
|
|
||||||
.delay(simple("{{dip.ingestion.ted-download.delay-between-downloads:5000}}"))
|
|
||||||
.setHeader(Exchange.HTTP_METHOD, constant("GET"))
|
|
||||||
.setHeader("CamelHttpConnectionClose", constant(true))
|
|
||||||
.toD("${header.downloadUrl}?bridgeEndpoint=true&throwExceptionOnFailure=false&socketTimeout={{dip.ingestion.ted-download.download-timeout:300000}}")
|
|
||||||
.choice()
|
|
||||||
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(200))
|
|
||||||
.process(this::calculateHash)
|
|
||||||
.process(this::checkDuplicateByHash)
|
|
||||||
.choice()
|
|
||||||
.when(header("isDuplicate").isEqualTo(true))
|
|
||||||
.process(this::markDuplicate)
|
|
||||||
.otherwise()
|
|
||||||
.process(this::saveDownloadedPackage)
|
|
||||||
.process(this::ingestThroughGateway)
|
|
||||||
.process(this::markCompleted)
|
|
||||||
.endChoice()
|
|
||||||
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(404))
|
|
||||||
.process(this::markNotFound)
|
|
||||||
.otherwise()
|
|
||||||
.process(this::markFailed)
|
|
||||||
.end();
|
|
||||||
}
|
|
||||||
|
|
||||||
private void checkRunningPackages(Exchange exchange) {
|
|
||||||
long downloadingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING).size();
|
|
||||||
long processingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING).size();
|
|
||||||
long runningCount = downloadingCount + processingCount;
|
|
||||||
|
|
||||||
exchange.getIn().setHeader("runningCount", runningCount);
|
|
||||||
exchange.getIn().setHeader("tooManyRunning", runningCount >= properties.getMaxRunningPackages());
|
|
||||||
|
|
||||||
if (runningCount > 0) {
|
|
||||||
log.info("Currently {} TED packages in progress in NEW runtime ({} downloading, {} processing)",
|
|
||||||
runningCount, downloadingCount, processingCount);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private void determineNextPackage(Exchange exchange) {
|
|
||||||
List<TedDailyPackage> pendingPackages = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PENDING);
|
|
||||||
|
|
||||||
if (!pendingPackages.isEmpty()) {
|
|
||||||
TedDailyPackage pkg = pendingPackages.get(0);
|
|
||||||
log.info("Retrying PENDING TED package in NEW runtime: {}", pkg.getPackageIdentifier());
|
|
||||||
setPackageHeaders(exchange, pkg.getYear(), pkg.getSerialNumber());
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
TedPackageSequenceService.PackageInfo packageInfo = sequenceService.getNextPackageToDownload();
|
|
||||||
if (packageInfo == null) {
|
|
||||||
exchange.getIn().setHeader("packageId", null);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
setPackageHeaders(exchange, packageInfo.year(), packageInfo.serialNumber());
|
|
||||||
}
|
|
||||||
|
|
||||||
private void setPackageHeaders(Exchange exchange, int year, int serialNumber) {
|
|
||||||
String packageId = "%04d%05d".formatted(year, serialNumber);
|
|
||||||
String downloadUrl = properties.getBaseUrl() + packageId;
|
|
||||||
|
|
||||||
exchange.getIn().setHeader("packageId", packageId);
|
|
||||||
exchange.getIn().setHeader("year", year);
|
|
||||||
exchange.getIn().setHeader("serialNumber", serialNumber);
|
|
||||||
exchange.getIn().setHeader("downloadUrl", downloadUrl);
|
|
||||||
}
|
|
||||||
|
|
||||||
private void createPackageRecord(Exchange exchange) {
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
Integer year = exchange.getIn().getHeader("year", Integer.class);
|
|
||||||
Integer serialNumber = exchange.getIn().getHeader("serialNumber", Integer.class);
|
|
||||||
String downloadUrl = exchange.getIn().getHeader("downloadUrl", String.class);
|
|
||||||
|
|
||||||
if (packageRepository.existsByPackageIdentifier(packageId)) {
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
TedDailyPackage pkg = TedDailyPackage.builder()
|
|
||||||
.packageIdentifier(packageId)
|
|
||||||
.year(year)
|
|
||||||
.serialNumber(serialNumber)
|
|
||||||
.downloadUrl(downloadUrl)
|
|
||||||
.downloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING)
|
|
||||||
.build();
|
|
||||||
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
}
|
|
||||||
|
|
||||||
private void calculateHash(Exchange exchange) throws Exception {
|
|
||||||
byte[] body = exchange.getIn().getBody(byte[].class);
|
|
||||||
MessageDigest digest = MessageDigest.getInstance("SHA-256");
|
|
||||||
byte[] hashBytes = digest.digest(body);
|
|
||||||
|
|
||||||
StringBuilder sb = new StringBuilder();
|
|
||||||
for (byte b : hashBytes) {
|
|
||||||
sb.append(String.format("%02x", b));
|
|
||||||
}
|
|
||||||
|
|
||||||
exchange.getIn().setHeader("fileHash", sb.toString());
|
|
||||||
}
|
|
||||||
|
|
||||||
private void checkDuplicateByHash(Exchange exchange) {
|
|
||||||
String hash = exchange.getIn().getHeader("fileHash", String.class);
|
|
||||||
|
|
||||||
Optional<TedDailyPackage> duplicate = packageRepository.findAll().stream()
|
|
||||||
.filter(p -> hash.equals(p.getFileHash()))
|
|
||||||
.findFirst();
|
|
||||||
|
|
||||||
exchange.getIn().setHeader("isDuplicate", duplicate.isPresent());
|
|
||||||
duplicate.ifPresent(pkg -> exchange.getIn().setHeader("duplicateOf", pkg.getPackageIdentifier()));
|
|
||||||
}
|
|
||||||
|
|
||||||
private void saveDownloadedPackage(Exchange exchange) throws Exception {
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
String hash = exchange.getIn().getHeader("fileHash", String.class);
|
|
||||||
byte[] body = exchange.getIn().getBody(byte[].class);
|
|
||||||
|
|
||||||
Path downloadDir = Paths.get(properties.getDownloadDirectory());
|
|
||||||
Files.createDirectories(downloadDir);
|
|
||||||
Path downloadPath = downloadDir.resolve(packageId + ".tar.gz");
|
|
||||||
Files.write(downloadPath, body);
|
|
||||||
|
|
||||||
long downloadDuration = System.currentTimeMillis() -
|
|
||||||
exchange.getIn().getHeader("downloadStartTime", Long.class);
|
|
||||||
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setFileHash(hash);
|
|
||||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADED);
|
|
||||||
pkg.setDownloadedAt(OffsetDateTime.now());
|
|
||||||
pkg.setDownloadDurationMs(downloadDuration);
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
|
|
||||||
exchange.getIn().setHeader("downloadPath", downloadPath.toString());
|
|
||||||
}
|
|
||||||
|
|
||||||
private void ingestThroughGateway(Exchange exchange) {
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
|
|
||||||
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING);
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
|
|
||||||
IngestionResult ingestionResult = documentIngestionGateway.ingest(new SourceDescriptor(
|
|
||||||
null,
|
|
||||||
SourceType.TED_PACKAGE,
|
|
||||||
packageId,
|
|
||||||
downloadPath,
|
|
||||||
packageId + ".tar.gz",
|
|
||||||
"application/gzip",
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
OffsetDateTime.now(),
|
|
||||||
OriginalContentStoragePolicy.DEFAULT,
|
|
||||||
Map.of(
|
|
||||||
"packageId", packageId,
|
|
||||||
"title", packageId + ".tar.gz"
|
|
||||||
)
|
|
||||||
));
|
|
||||||
|
|
||||||
int importedChildCount = Math.max(0, ingestionResult.documents().size() - 1);
|
|
||||||
exchange.getIn().setHeader("gatewayImportedChildCount", importedChildCount);
|
|
||||||
exchange.getIn().setHeader("gatewayImportWarnings", ingestionResult.warnings().size());
|
|
||||||
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setXmlFileCount(importedChildCount);
|
|
||||||
pkg.setProcessedCount(importedChildCount);
|
|
||||||
pkg.setFailedCount(0);
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
private void markCompleted(Exchange exchange) throws Exception {
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
|
|
||||||
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
|
|
||||||
pkg.setProcessedAt(OffsetDateTime.now());
|
|
||||||
if (pkg.getDownloadedAt() != null) {
|
|
||||||
long processingDuration = Math.max(0L,
|
|
||||||
java.time.Duration.between(pkg.getDownloadedAt(), OffsetDateTime.now()).toMillis());
|
|
||||||
pkg.setProcessingDurationMs(processingDuration);
|
|
||||||
}
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
|
|
||||||
if (properties.isDeleteAfterIngestion() && downloadPath != null) {
|
|
||||||
Files.deleteIfExists(Path.of(downloadPath));
|
|
||||||
}
|
|
||||||
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
long totalDuration = (pkg.getDownloadDurationMs() != null ? pkg.getDownloadDurationMs() : 0L)
|
|
||||||
+ (pkg.getProcessingDurationMs() != null ? pkg.getProcessingDurationMs() : 0L);
|
|
||||||
log.info("NEW TED package {} completed: xmlCount={}, processed={}, failed={}, totalDuration={}ms",
|
|
||||||
packageId, pkg.getXmlFileCount(), pkg.getProcessedCount(), pkg.getFailedCount(), totalDuration);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
private void markNotFound(Exchange exchange) {
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.NOT_FOUND);
|
|
||||||
pkg.setErrorMessage("Package not found (404)");
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
private void markFailed(Exchange exchange) {
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
Integer httpCode = exchange.getIn().getHeader(Exchange.HTTP_RESPONSE_CODE, Integer.class);
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
|
|
||||||
pkg.setErrorMessage("HTTP " + httpCode);
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
private void markDuplicate(Exchange exchange) {
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
String duplicateOf = exchange.getIn().getHeader("duplicateOf", String.class);
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
|
|
||||||
pkg.setErrorMessage("Duplicate of " + duplicateOf);
|
|
||||||
pkg.setProcessedAt(OffsetDateTime.now());
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
private void handleError(Exchange exchange) {
|
|
||||||
Exception exception = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class);
|
|
||||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
|
||||||
|
|
||||||
if (packageId != null) {
|
|
||||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
|
||||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
|
|
||||||
pkg.setErrorMessage(exception != null ? exception.getMessage() : "Unknown route error");
|
|
||||||
packageRepository.save(pkg);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,61 +0,0 @@
|
|||||||
package at.procon.dip.ingestion.config;
|
|
||||||
|
|
||||||
import at.procon.dip.domain.access.DocumentVisibility;
|
|
||||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
|
||||||
import at.procon.dip.runtime.config.RuntimeMode;
|
|
||||||
import jakarta.validation.constraints.NotBlank;
|
|
||||||
import jakarta.validation.constraints.Positive;
|
|
||||||
import lombok.Data;
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "dip.ingestion")
|
|
||||||
@Data
|
|
||||||
public class DipIngestionProperties {
|
|
||||||
|
|
||||||
private boolean enabled = false;
|
|
||||||
private boolean fileSystemEnabled = false;
|
|
||||||
private boolean restUploadEnabled = true;
|
|
||||||
private String inputDirectory = "/ted.europe/generic-input";
|
|
||||||
private String filePattern = ".*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$";
|
|
||||||
private String processedDirectory = ".dip-processed";
|
|
||||||
private String errorDirectory = ".dip-error";
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private long pollInterval = 15000;
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private int maxMessagesPerPoll = 10;
|
|
||||||
|
|
||||||
private String defaultOwnerTenantKey;
|
|
||||||
private DocumentVisibility defaultVisibility = DocumentVisibility.PUBLIC;
|
|
||||||
private String defaultLanguageCode;
|
|
||||||
|
|
||||||
private boolean storeOriginalBinaryInDb = true;
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private int maxBinaryBytesInDb = 5242880;
|
|
||||||
|
|
||||||
private boolean deduplicateByContentHash = true;
|
|
||||||
private boolean storeOriginalContentForWrapperDocuments = true;
|
|
||||||
private boolean vectorizePrimaryRepresentationOnly = true;
|
|
||||||
|
|
||||||
@NotBlank
|
|
||||||
private String importBatchId = "phase4-generic";
|
|
||||||
|
|
||||||
private boolean tedPackageAdapterEnabled = true;
|
|
||||||
private boolean mailAdapterEnabled = false;
|
|
||||||
|
|
||||||
private String mailDefaultOwnerTenantKey;
|
|
||||||
private DocumentVisibility mailDefaultVisibility = DocumentVisibility.TENANT;
|
|
||||||
private boolean expandMailZipAttachments = true;
|
|
||||||
|
|
||||||
@NotBlank
|
|
||||||
private String tedPackageImportBatchId = "phase41-ted-package";
|
|
||||||
|
|
||||||
private boolean gatewayOnlyForTedPackages = false;
|
|
||||||
|
|
||||||
@NotBlank
|
|
||||||
private String mailImportBatchId = "phase41-mail";
|
|
||||||
}
|
|
||||||
@ -1,52 +0,0 @@
|
|||||||
package at.procon.dip.ingestion.config;
|
|
||||||
|
|
||||||
import jakarta.validation.constraints.Min;
|
|
||||||
import jakarta.validation.constraints.NotBlank;
|
|
||||||
import jakarta.validation.constraints.Positive;
|
|
||||||
import lombok.Data;
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* NEW-runtime TED package download configuration.
|
|
||||||
* <p>
|
|
||||||
* This is intentionally separate from the legacy {@code ted.download.*} tree.
|
|
||||||
*/
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "dip.ingestion.ted-download")
|
|
||||||
@Data
|
|
||||||
public class TedPackageDownloadProperties {
|
|
||||||
|
|
||||||
private boolean enabled = false;
|
|
||||||
|
|
||||||
@NotBlank
|
|
||||||
private String baseUrl = "https://ted.europa.eu/packages/daily/";
|
|
||||||
|
|
||||||
@NotBlank
|
|
||||||
private String downloadDirectory = "/ted.europe/downloads-new";
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private int startYear = 2015;
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private long pollInterval = 3_600_000L;
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private long notFoundRetryInterval = 21_600_000L;
|
|
||||||
|
|
||||||
@Min(0)
|
|
||||||
private int previousYearGracePeriodDays = 30;
|
|
||||||
|
|
||||||
private boolean retryCurrentYearNotFoundIndefinitely = true;
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private long downloadTimeout = 300_000L;
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private int maxRunningPackages = 2;
|
|
||||||
|
|
||||||
@Positive
|
|
||||||
private long delayBetweenDownloads = 5_000L;
|
|
||||||
|
|
||||||
private boolean deleteAfterIngestion = true;
|
|
||||||
}
|
|
||||||
@ -1,47 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.config;
|
|
||||||
|
|
||||||
import jakarta.validation.constraints.Min;
|
|
||||||
import lombok.Data;
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "dip.migration.legacy-audit")
|
|
||||||
@Data
|
|
||||||
public class LegacyTedAuditProperties {
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Enables the Wave 1 / Milestone A legacy TED audit subsystem.
|
|
||||||
*/
|
|
||||||
private boolean enabled = true;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Automatically runs the read-only audit on application startup.
|
|
||||||
*/
|
|
||||||
private boolean startupRunEnabled = false;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Maximum number of legacy TED documents to scan during startup.
|
|
||||||
* 0 means no limit.
|
|
||||||
*/
|
|
||||||
@Min(0)
|
|
||||||
private int startupRunLimit = 500;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Batch size for legacy TED document paging.
|
|
||||||
*/
|
|
||||||
@Min(1)
|
|
||||||
private int pageSize = 100;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Hard cap for persisted findings in a single run to avoid runaway audit volume.
|
|
||||||
*/
|
|
||||||
@Min(1)
|
|
||||||
private int maxFindingsPerRun = 10000;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Maximum number of duplicate/grouped samples recorded for global aggregate checks.
|
|
||||||
*/
|
|
||||||
@Min(1)
|
|
||||||
private int maxDuplicateSamples = 100;
|
|
||||||
}
|
|
||||||
@ -1,87 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.entity;
|
|
||||||
|
|
||||||
import at.procon.dip.architecture.SchemaNames;
|
|
||||||
import jakarta.persistence.Column;
|
|
||||||
import jakarta.persistence.Entity;
|
|
||||||
import jakarta.persistence.EnumType;
|
|
||||||
import jakarta.persistence.Enumerated;
|
|
||||||
import jakarta.persistence.FetchType;
|
|
||||||
import jakarta.persistence.GeneratedValue;
|
|
||||||
import jakarta.persistence.GenerationType;
|
|
||||||
import jakarta.persistence.Id;
|
|
||||||
import jakarta.persistence.Index;
|
|
||||||
import jakarta.persistence.JoinColumn;
|
|
||||||
import jakarta.persistence.ManyToOne;
|
|
||||||
import jakarta.persistence.PrePersist;
|
|
||||||
import jakarta.persistence.Table;
|
|
||||||
import java.time.OffsetDateTime;
|
|
||||||
import java.util.UUID;
|
|
||||||
import lombok.AllArgsConstructor;
|
|
||||||
import lombok.Builder;
|
|
||||||
import lombok.Getter;
|
|
||||||
import lombok.NoArgsConstructor;
|
|
||||||
import lombok.Setter;
|
|
||||||
|
|
||||||
@Entity
|
|
||||||
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_finding", indexes = {
|
|
||||||
@Index(name = "idx_doc_legacy_audit_find_run", columnList = "run_id"),
|
|
||||||
@Index(name = "idx_doc_legacy_audit_find_type", columnList = "finding_type"),
|
|
||||||
@Index(name = "idx_doc_legacy_audit_find_severity", columnList = "severity"),
|
|
||||||
@Index(name = "idx_doc_legacy_audit_find_legacy_doc", columnList = "legacy_procurement_document_id"),
|
|
||||||
@Index(name = "idx_doc_legacy_audit_find_document", columnList = "document_id")
|
|
||||||
})
|
|
||||||
@Getter
|
|
||||||
@Setter
|
|
||||||
@NoArgsConstructor
|
|
||||||
@AllArgsConstructor
|
|
||||||
@Builder
|
|
||||||
public class LegacyTedAuditFinding {
|
|
||||||
|
|
||||||
@Id
|
|
||||||
@GeneratedValue(strategy = GenerationType.UUID)
|
|
||||||
private UUID id;
|
|
||||||
|
|
||||||
@ManyToOne(fetch = FetchType.LAZY, optional = false)
|
|
||||||
@JoinColumn(name = "run_id", nullable = false)
|
|
||||||
private LegacyTedAuditRun run;
|
|
||||||
|
|
||||||
@Enumerated(EnumType.STRING)
|
|
||||||
@Column(name = "severity", nullable = false, length = 16)
|
|
||||||
private LegacyTedAuditSeverity severity;
|
|
||||||
|
|
||||||
@Enumerated(EnumType.STRING)
|
|
||||||
@Column(name = "finding_type", nullable = false, length = 64)
|
|
||||||
private LegacyTedAuditFindingType findingType;
|
|
||||||
|
|
||||||
@Column(name = "package_identifier", length = 20)
|
|
||||||
private String packageIdentifier;
|
|
||||||
|
|
||||||
@Column(name = "legacy_procurement_document_id")
|
|
||||||
private UUID legacyProcurementDocumentId;
|
|
||||||
|
|
||||||
@Column(name = "document_id")
|
|
||||||
private UUID documentId;
|
|
||||||
|
|
||||||
@Column(name = "ted_notice_projection_id")
|
|
||||||
private UUID tedNoticeProjectionId;
|
|
||||||
|
|
||||||
@Column(name = "reference_key", length = 255)
|
|
||||||
private String referenceKey;
|
|
||||||
|
|
||||||
@Column(name = "message", nullable = false, columnDefinition = "TEXT")
|
|
||||||
private String message;
|
|
||||||
|
|
||||||
@Column(name = "details_text", columnDefinition = "TEXT")
|
|
||||||
private String detailsText;
|
|
||||||
|
|
||||||
@Builder.Default
|
|
||||||
@Column(name = "created_at", nullable = false, updatable = false)
|
|
||||||
private OffsetDateTime createdAt = OffsetDateTime.now();
|
|
||||||
|
|
||||||
@PrePersist
|
|
||||||
protected void onCreate() {
|
|
||||||
if (createdAt == null) {
|
|
||||||
createdAt = OffsetDateTime.now();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,28 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.entity;
|
|
||||||
|
|
||||||
public enum LegacyTedAuditFindingType {
|
|
||||||
PACKAGE_SEQUENCE_GAP,
|
|
||||||
PACKAGE_INCOMPLETE,
|
|
||||||
PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
|
|
||||||
PACKAGE_COMPLETED_COUNT_MISMATCH,
|
|
||||||
PACKAGE_MISSING_XML_FILE_COUNT,
|
|
||||||
PACKAGE_MISSING_FILE_HASH,
|
|
||||||
PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
|
|
||||||
LEGACY_PUBLICATION_ID_DUPLICATE,
|
|
||||||
DOC_DEDUP_HASH_DUPLICATE,
|
|
||||||
LEGACY_DOCUMENT_MISSING_HASH,
|
|
||||||
LEGACY_DOCUMENT_MISSING_XML,
|
|
||||||
LEGACY_DOCUMENT_MISSING_TEXT,
|
|
||||||
LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
|
|
||||||
DOC_DOCUMENT_MISSING,
|
|
||||||
DOC_DOCUMENT_DUPLICATE,
|
|
||||||
DOC_SOURCE_MISSING,
|
|
||||||
DOC_ORIGINAL_CONTENT_MISSING,
|
|
||||||
DOC_ORIGINAL_CONTENT_DUPLICATE,
|
|
||||||
DOC_PRIMARY_REPRESENTATION_MISSING,
|
|
||||||
DOC_PRIMARY_REPRESENTATION_DUPLICATE,
|
|
||||||
TED_PROJECTION_MISSING,
|
|
||||||
TED_PROJECTION_MISSING_LEGACY_LINK,
|
|
||||||
TED_PROJECTION_DOCUMENT_MISMATCH,
|
|
||||||
FINDINGS_TRUNCATED
|
|
||||||
}
|
|
||||||
@ -1,110 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.entity;
|
|
||||||
|
|
||||||
import at.procon.dip.architecture.SchemaNames;
|
|
||||||
import jakarta.persistence.Column;
|
|
||||||
import jakarta.persistence.Entity;
|
|
||||||
import jakarta.persistence.EnumType;
|
|
||||||
import jakarta.persistence.Enumerated;
|
|
||||||
import jakarta.persistence.GeneratedValue;
|
|
||||||
import jakarta.persistence.GenerationType;
|
|
||||||
import jakarta.persistence.Id;
|
|
||||||
import jakarta.persistence.Index;
|
|
||||||
import jakarta.persistence.PrePersist;
|
|
||||||
import jakarta.persistence.PreUpdate;
|
|
||||||
import jakarta.persistence.Table;
|
|
||||||
import java.time.OffsetDateTime;
|
|
||||||
import java.util.UUID;
|
|
||||||
import lombok.AllArgsConstructor;
|
|
||||||
import lombok.Builder;
|
|
||||||
import lombok.Getter;
|
|
||||||
import lombok.NoArgsConstructor;
|
|
||||||
import lombok.Setter;
|
|
||||||
|
|
||||||
@Entity
|
|
||||||
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_run", indexes = {
|
|
||||||
@Index(name = "idx_doc_legacy_audit_run_status", columnList = "status"),
|
|
||||||
@Index(name = "idx_doc_legacy_audit_run_started", columnList = "started_at")
|
|
||||||
})
|
|
||||||
@Getter
|
|
||||||
@Setter
|
|
||||||
@NoArgsConstructor
|
|
||||||
@AllArgsConstructor
|
|
||||||
@Builder
|
|
||||||
public class LegacyTedAuditRun {
|
|
||||||
|
|
||||||
@Id
|
|
||||||
@GeneratedValue(strategy = GenerationType.UUID)
|
|
||||||
private UUID id;
|
|
||||||
|
|
||||||
@Enumerated(EnumType.STRING)
|
|
||||||
@Column(name = "status", nullable = false, length = 32)
|
|
||||||
private LegacyTedAuditRunStatus status;
|
|
||||||
|
|
||||||
@Column(name = "requested_limit")
|
|
||||||
private Integer requestedLimit;
|
|
||||||
|
|
||||||
@Column(name = "page_size", nullable = false)
|
|
||||||
private Integer pageSize;
|
|
||||||
|
|
||||||
@Column(name = "scanned_packages", nullable = false)
|
|
||||||
@Builder.Default
|
|
||||||
private Integer scannedPackages = 0;
|
|
||||||
|
|
||||||
@Column(name = "scanned_legacy_documents", nullable = false)
|
|
||||||
@Builder.Default
|
|
||||||
private Integer scannedLegacyDocuments = 0;
|
|
||||||
|
|
||||||
@Column(name = "finding_count", nullable = false)
|
|
||||||
@Builder.Default
|
|
||||||
private Integer findingCount = 0;
|
|
||||||
|
|
||||||
@Column(name = "info_count", nullable = false)
|
|
||||||
@Builder.Default
|
|
||||||
private Integer infoCount = 0;
|
|
||||||
|
|
||||||
@Column(name = "warning_count", nullable = false)
|
|
||||||
@Builder.Default
|
|
||||||
private Integer warningCount = 0;
|
|
||||||
|
|
||||||
@Column(name = "error_count", nullable = false)
|
|
||||||
@Builder.Default
|
|
||||||
private Integer errorCount = 0;
|
|
||||||
|
|
||||||
@Column(name = "started_at", nullable = false)
|
|
||||||
private OffsetDateTime startedAt;
|
|
||||||
|
|
||||||
@Column(name = "completed_at")
|
|
||||||
private OffsetDateTime completedAt;
|
|
||||||
|
|
||||||
@Column(name = "summary_text", columnDefinition = "TEXT")
|
|
||||||
private String summaryText;
|
|
||||||
|
|
||||||
@Column(name = "failure_message", columnDefinition = "TEXT")
|
|
||||||
private String failureMessage;
|
|
||||||
|
|
||||||
@Builder.Default
|
|
||||||
@Column(name = "created_at", nullable = false, updatable = false)
|
|
||||||
private OffsetDateTime createdAt = OffsetDateTime.now();
|
|
||||||
|
|
||||||
@Builder.Default
|
|
||||||
@Column(name = "updated_at", nullable = false)
|
|
||||||
private OffsetDateTime updatedAt = OffsetDateTime.now();
|
|
||||||
|
|
||||||
@PrePersist
|
|
||||||
protected void onCreate() {
|
|
||||||
if (startedAt == null) {
|
|
||||||
startedAt = OffsetDateTime.now();
|
|
||||||
}
|
|
||||||
if (createdAt == null) {
|
|
||||||
createdAt = OffsetDateTime.now();
|
|
||||||
}
|
|
||||||
if (updatedAt == null) {
|
|
||||||
updatedAt = OffsetDateTime.now();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
@PreUpdate
|
|
||||||
protected void onUpdate() {
|
|
||||||
updatedAt = OffsetDateTime.now();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,7 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.entity;
|
|
||||||
|
|
||||||
public enum LegacyTedAuditRunStatus {
|
|
||||||
RUNNING,
|
|
||||||
COMPLETED,
|
|
||||||
FAILED
|
|
||||||
}
|
|
||||||
@ -1,7 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.entity;
|
|
||||||
|
|
||||||
public enum LegacyTedAuditSeverity {
|
|
||||||
INFO,
|
|
||||||
WARNING,
|
|
||||||
ERROR
|
|
||||||
}
|
|
||||||
@ -1,8 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.repository;
|
|
||||||
|
|
||||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
|
|
||||||
import java.util.UUID;
|
|
||||||
import org.springframework.data.jpa.repository.JpaRepository;
|
|
||||||
|
|
||||||
public interface LegacyTedAuditFindingRepository extends JpaRepository<LegacyTedAuditFinding, UUID> {
|
|
||||||
}
|
|
||||||
@ -1,8 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.repository;
|
|
||||||
|
|
||||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
|
|
||||||
import java.util.UUID;
|
|
||||||
import org.springframework.data.jpa.repository.JpaRepository;
|
|
||||||
|
|
||||||
public interface LegacyTedAuditRunRepository extends JpaRepository<LegacyTedAuditRun, UUID> {
|
|
||||||
}
|
|
||||||
@ -1,610 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.service;
|
|
||||||
|
|
||||||
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
|
|
||||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
|
|
||||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditFindingType;
|
|
||||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
|
|
||||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditRunStatus;
|
|
||||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditSeverity;
|
|
||||||
import at.procon.dip.migration.audit.repository.LegacyTedAuditFindingRepository;
|
|
||||||
import at.procon.dip.migration.audit.repository.LegacyTedAuditRunRepository;
|
|
||||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
|
||||||
import at.procon.dip.runtime.config.RuntimeMode;
|
|
||||||
import at.procon.ted.model.entity.ProcurementDocument;
|
|
||||||
import at.procon.ted.model.entity.TedDailyPackage;
|
|
||||||
import at.procon.ted.repository.ProcurementDocumentRepository;
|
|
||||||
import at.procon.ted.repository.TedDailyPackageRepository;
|
|
||||||
import java.time.OffsetDateTime;
|
|
||||||
import java.time.Year;
|
|
||||||
import java.util.ArrayList;
|
|
||||||
import java.util.List;
|
|
||||||
import java.util.Map;
|
|
||||||
import java.util.TreeMap;
|
|
||||||
import java.util.UUID;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
import lombok.extern.slf4j.Slf4j;
|
|
||||||
import org.springframework.data.domain.Page;
|
|
||||||
import org.springframework.data.domain.PageRequest;
|
|
||||||
import org.springframework.data.domain.Sort;
|
|
||||||
import org.springframework.jdbc.core.JdbcTemplate;
|
|
||||||
import org.springframework.stereotype.Service;
|
|
||||||
import org.springframework.util.StringUtils;
|
|
||||||
|
|
||||||
@Service
|
|
||||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
@Slf4j
|
|
||||||
public class LegacyTedAuditService {
|
|
||||||
|
|
||||||
private final LegacyTedAuditProperties properties;
|
|
||||||
private final TedDailyPackageRepository tedDailyPackageRepository;
|
|
||||||
private final ProcurementDocumentRepository procurementDocumentRepository;
|
|
||||||
private final LegacyTedAuditRunRepository runRepository;
|
|
||||||
private final LegacyTedAuditFindingRepository findingRepository;
|
|
||||||
private final JdbcTemplate jdbcTemplate;
|
|
||||||
|
|
||||||
public LegacyTedAuditRun executeAudit() {
|
|
||||||
return executeAudit(properties.getStartupRunLimit());
|
|
||||||
}
|
|
||||||
|
|
||||||
public LegacyTedAuditRun executeAudit(int requestedLimit) {
|
|
||||||
if (!properties.isEnabled()) {
|
|
||||||
throw new IllegalStateException("Legacy TED audit is disabled by configuration");
|
|
||||||
}
|
|
||||||
|
|
||||||
Integer effectiveLimit = requestedLimit > 0 ? requestedLimit : null;
|
|
||||||
int pageSize = properties.getPageSize();
|
|
||||||
AuditAccumulator accumulator = new AuditAccumulator();
|
|
||||||
|
|
||||||
LegacyTedAuditRun run = LegacyTedAuditRun.builder()
|
|
||||||
.status(LegacyTedAuditRunStatus.RUNNING)
|
|
||||||
.requestedLimit(effectiveLimit)
|
|
||||||
.pageSize(pageSize)
|
|
||||||
.startedAt(OffsetDateTime.now())
|
|
||||||
.build();
|
|
||||||
run = runRepository.save(run);
|
|
||||||
|
|
||||||
try {
|
|
||||||
int scannedPackages = auditPackages(run, accumulator);
|
|
||||||
auditGlobalDuplicates(run, accumulator);
|
|
||||||
int scannedLegacyDocuments = 0;//auditLegacyDocuments(run, accumulator, effectiveLimit, pageSize);
|
|
||||||
|
|
||||||
run.setStatus(LegacyTedAuditRunStatus.COMPLETED);
|
|
||||||
run.setCompletedAt(OffsetDateTime.now());
|
|
||||||
run.setScannedPackages(scannedPackages);
|
|
||||||
run.setScannedLegacyDocuments(scannedLegacyDocuments);
|
|
||||||
run.setFindingCount(accumulator.totalFindings());
|
|
||||||
run.setInfoCount(accumulator.infoCount());
|
|
||||||
run.setWarningCount(accumulator.warningCount());
|
|
||||||
run.setErrorCount(accumulator.errorCount());
|
|
||||||
run.setSummaryText(buildSummary(scannedPackages, scannedLegacyDocuments, accumulator));
|
|
||||||
run.setFailureMessage(null);
|
|
||||||
run = runRepository.save(run);
|
|
||||||
|
|
||||||
log.info("Wave 1 / Milestone A legacy-only audit completed: runId={}, packages={}, documents={}, findings={}, warnings={}, errors={}",
|
|
||||||
run.getId(), scannedPackages, scannedLegacyDocuments, accumulator.totalFindings(),
|
|
||||||
accumulator.warningCount(), accumulator.errorCount());
|
|
||||||
return run;
|
|
||||||
} catch (RuntimeException ex) {
|
|
||||||
run.setStatus(LegacyTedAuditRunStatus.FAILED);
|
|
||||||
run.setCompletedAt(OffsetDateTime.now());
|
|
||||||
run.setScannedPackages(accumulator.scannedPackages());
|
|
||||||
run.setScannedLegacyDocuments(accumulator.scannedLegacyDocuments());
|
|
||||||
run.setFindingCount(accumulator.totalFindings());
|
|
||||||
run.setInfoCount(accumulator.infoCount());
|
|
||||||
run.setWarningCount(accumulator.warningCount());
|
|
||||||
run.setErrorCount(accumulator.errorCount());
|
|
||||||
run.setFailureMessage(ex.getMessage());
|
|
||||||
run.setSummaryText(buildSummary(accumulator.scannedPackages(), accumulator.scannedLegacyDocuments(), accumulator));
|
|
||||||
runRepository.save(run);
|
|
||||||
log.error("Wave 1 / Milestone A legacy-only audit failed: runId={}", run.getId(), ex);
|
|
||||||
throw ex;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private int auditPackages(LegacyTedAuditRun run, AuditAccumulator accumulator) {
|
|
||||||
List<TedDailyPackage> packages = tedDailyPackageRepository.findAll(Sort.by(Sort.Direction.ASC, "year", "serialNumber"));
|
|
||||||
if (packages.isEmpty()) {
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
Map<Integer, List<TedDailyPackage>> packagesByYear = new TreeMap<>();
|
|
||||||
for (TedDailyPackage dailyPackage : packages) {
|
|
||||||
packagesByYear.computeIfAbsent(dailyPackage.getYear(), ignored -> new ArrayList<>()).add(dailyPackage);
|
|
||||||
}
|
|
||||||
|
|
||||||
int firstYear = packagesByYear.keySet().iterator().next();
|
|
||||||
int currentYear = Year.now().getValue();
|
|
||||||
|
|
||||||
for (int year = firstYear; year <= currentYear; year++) {
|
|
||||||
List<TedDailyPackage> yearPackages = packagesByYear.get(year);
|
|
||||||
if (yearPackages == null || yearPackages.isEmpty()) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
"year:" + year,
|
|
||||||
"No TED package rows exist for this year inside the audited interval",
|
|
||||||
"year=" + year + ", intervalStartYear=" + firstYear + ", intervalEndYear=" + currentYear);
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
auditYearPackageSequence(run, accumulator, year, yearPackages);
|
|
||||||
|
|
||||||
for (TedDailyPackage dailyPackage : yearPackages) {
|
|
||||||
accumulator.incrementScannedPackages();
|
|
||||||
auditSinglePackage(run, accumulator, dailyPackage);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return packages.size();
|
|
||||||
}
|
|
||||||
|
|
||||||
private void auditYearPackageSequence(LegacyTedAuditRun run,
|
|
||||||
AuditAccumulator accumulator,
|
|
||||||
int year,
|
|
||||||
List<TedDailyPackage> yearPackages) {
|
|
||||||
yearPackages.sort((left, right) -> Integer.compare(safeInt(left.getSerialNumber()), safeInt(right.getSerialNumber())));
|
|
||||||
|
|
||||||
int firstSerial = safeInt(yearPackages.getFirst().getSerialNumber());
|
|
||||||
if (firstSerial > 1) {
|
|
||||||
recordMissingPackageRange(run, accumulator, year, 1, firstSerial - 1,
|
|
||||||
"TED package year starts after serial 1");
|
|
||||||
}
|
|
||||||
|
|
||||||
for (int i = 1; i < yearPackages.size(); i++) {
|
|
||||||
int previousSerial = safeInt(yearPackages.get(i - 1).getSerialNumber());
|
|
||||||
int currentSerial = safeInt(yearPackages.get(i).getSerialNumber());
|
|
||||||
if (currentSerial > previousSerial + 1) {
|
|
||||||
recordMissingPackageRange(run, accumulator, year, previousSerial + 1, currentSerial - 1,
|
|
||||||
"TED package sequence gap detected");
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private void recordMissingPackageRange(LegacyTedAuditRun run,
|
|
||||||
AuditAccumulator accumulator,
|
|
||||||
int year,
|
|
||||||
int startSerial,
|
|
||||||
int endSerial,
|
|
||||||
String message) {
|
|
||||||
String startPackageId = formatPackageIdentifier(year, startSerial);
|
|
||||||
String endPackageId = formatPackageIdentifier(year, endSerial);
|
|
||||||
String referenceKey = startSerial == endSerial ? startPackageId : startPackageId + "-" + endPackageId;
|
|
||||||
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
|
|
||||||
startSerial == endSerial ? startPackageId : null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
referenceKey,
|
|
||||||
message,
|
|
||||||
"year=" + year + ", missingStartSerial=" + startSerial + ", missingEndSerial=" + endSerial);
|
|
||||||
}
|
|
||||||
|
|
||||||
private void auditSinglePackage(LegacyTedAuditRun run,
|
|
||||||
AuditAccumulator accumulator,
|
|
||||||
TedDailyPackage dailyPackage) {
|
|
||||||
String packageIdentifier = dailyPackage.getPackageIdentifier();
|
|
||||||
int processedCount = safeInt(dailyPackage.getProcessedCount());
|
|
||||||
int failedCount = safeInt(dailyPackage.getFailedCount());
|
|
||||||
int accountedDocuments = processedCount + failedCount;
|
|
||||||
|
|
||||||
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
|
||||||
&& dailyPackage.getProcessedAt() == null) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
|
|
||||||
packageIdentifier,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
packageIdentifier,
|
|
||||||
"TED package is marked COMPLETED but processedAt is null",
|
|
||||||
null);
|
|
||||||
}
|
|
||||||
|
|
||||||
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
|
||||||
&& dailyPackage.getXmlFileCount() == null) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_MISSING_XML_FILE_COUNT,
|
|
||||||
packageIdentifier,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
packageIdentifier,
|
|
||||||
"TED package is marked COMPLETED but xmlFileCount is null",
|
|
||||||
null);
|
|
||||||
}
|
|
||||||
|
|
||||||
if ((dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.DOWNLOADED
|
|
||||||
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.PROCESSING
|
|
||||||
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED)
|
|
||||||
&& !StringUtils.hasText(dailyPackage.getFileHash())) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_MISSING_FILE_HASH,
|
|
||||||
packageIdentifier,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
packageIdentifier,
|
|
||||||
"TED package has no file hash recorded",
|
|
||||||
"downloadStatus=" + dailyPackage.getDownloadStatus());
|
|
||||||
}
|
|
||||||
|
|
||||||
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
|
|
||||||
&& !StringUtils.hasText(dailyPackage.getErrorMessage())) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
|
|
||||||
packageIdentifier,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
packageIdentifier,
|
|
||||||
"TED package is marked FAILED but has no error message",
|
|
||||||
null);
|
|
||||||
}
|
|
||||||
|
|
||||||
if (dailyPackage.getXmlFileCount() != null) {
|
|
||||||
if (accountedDocuments > dailyPackage.getXmlFileCount()) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.ERROR,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
|
|
||||||
packageIdentifier,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
packageIdentifier,
|
|
||||||
"TED package accounting exceeds xmlFileCount",
|
|
||||||
"xmlFileCount=" + dailyPackage.getXmlFileCount()
|
|
||||||
+ ", processedCount=" + processedCount
|
|
||||||
+ ", failedCount=" + failedCount);
|
|
||||||
} else if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
|
||||||
&& accountedDocuments < dailyPackage.getXmlFileCount()) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
|
|
||||||
packageIdentifier,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
packageIdentifier,
|
|
||||||
"TED package accounting is below xmlFileCount",
|
|
||||||
"xmlFileCount=" + dailyPackage.getXmlFileCount()
|
|
||||||
+ ", processedCount=" + processedCount
|
|
||||||
+ ", failedCount=" + failedCount);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (isPackageIncompleteForReimport(dailyPackage, processedCount, failedCount, accountedDocuments)) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
|
|
||||||
? LegacyTedAuditSeverity.ERROR
|
|
||||||
: LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.PACKAGE_INCOMPLETE,
|
|
||||||
packageIdentifier,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
packageIdentifier,
|
|
||||||
"TED package is not fully imported and should be considered for re-import",
|
|
||||||
buildIncompletePackageDetails(dailyPackage, processedCount, failedCount, accountedDocuments));
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean isPackageIncompleteForReimport(TedDailyPackage dailyPackage,
|
|
||||||
int processedCount,
|
|
||||||
int failedCount,
|
|
||||||
int accountedDocuments) {
|
|
||||||
TedDailyPackage.DownloadStatus status = dailyPackage.getDownloadStatus();
|
|
||||||
if (status == null) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (status == TedDailyPackage.DownloadStatus.NOT_FOUND) {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
if (status == TedDailyPackage.DownloadStatus.PENDING
|
|
||||||
|| status == TedDailyPackage.DownloadStatus.DOWNLOADING
|
|
||||||
|| status == TedDailyPackage.DownloadStatus.DOWNLOADED
|
|
||||||
|| status == TedDailyPackage.DownloadStatus.PROCESSING
|
|
||||||
|| status == TedDailyPackage.DownloadStatus.FAILED) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (status != TedDailyPackage.DownloadStatus.COMPLETED) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (dailyPackage.getXmlFileCount() == null) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
if (failedCount > 0) {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
return processedCount < dailyPackage.getXmlFileCount()
|
|
||||||
|| accountedDocuments != dailyPackage.getXmlFileCount();
|
|
||||||
}
|
|
||||||
|
|
||||||
private String buildIncompletePackageDetails(TedDailyPackage dailyPackage,
|
|
||||||
int processedCount,
|
|
||||||
int failedCount,
|
|
||||||
int accountedDocuments) {
|
|
||||||
return "status=" + dailyPackage.getDownloadStatus()
|
|
||||||
+ ", xmlFileCount=" + dailyPackage.getXmlFileCount()
|
|
||||||
+ ", processedCount=" + processedCount
|
|
||||||
+ ", failedCount=" + failedCount
|
|
||||||
+ ", accountedDocuments=" + accountedDocuments;
|
|
||||||
}
|
|
||||||
|
|
||||||
private void auditGlobalDuplicates(LegacyTedAuditRun run, AuditAccumulator accumulator) {
|
|
||||||
int limit = properties.getMaxDuplicateSamples();
|
|
||||||
|
|
||||||
jdbcTemplate.query(
|
|
||||||
"""
|
|
||||||
SELECT publication_id, COUNT(*) AS duplicate_count
|
|
||||||
FROM ted.procurement_document
|
|
||||||
WHERE publication_id IS NOT NULL AND publication_id <> ''
|
|
||||||
GROUP BY publication_id
|
|
||||||
HAVING COUNT(*) > 1
|
|
||||||
ORDER BY duplicate_count DESC, publication_id ASC
|
|
||||||
LIMIT ?
|
|
||||||
""",
|
|
||||||
ps -> ps.setInt(1, limit),
|
|
||||||
(rs, rowNum) -> {
|
|
||||||
String publicationId = rs.getString("publication_id");
|
|
||||||
long duplicateCount = rs.getLong("duplicate_count");
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.ERROR,
|
|
||||||
LegacyTedAuditFindingType.LEGACY_PUBLICATION_ID_DUPLICATE,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
publicationId,
|
|
||||||
"Legacy TED publicationId appears multiple times",
|
|
||||||
"publicationId=" + publicationId + ", duplicateCount=" + duplicateCount);
|
|
||||||
return null;
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
private int auditLegacyDocuments(LegacyTedAuditRun run,
|
|
||||||
AuditAccumulator accumulator,
|
|
||||||
Integer requestedLimit,
|
|
||||||
int pageSize) {
|
|
||||||
int processed = 0;
|
|
||||||
int pageNumber = 0;
|
|
||||||
|
|
||||||
while (requestedLimit == null || processed < requestedLimit) {
|
|
||||||
Page<ProcurementDocument> page = procurementDocumentRepository.findAll(
|
|
||||||
PageRequest.of(pageNumber, pageSize, Sort.by(Sort.Direction.ASC, "createdAt", "id")));
|
|
||||||
|
|
||||||
if (page.isEmpty()) {
|
|
||||||
break;
|
|
||||||
}
|
|
||||||
|
|
||||||
for (ProcurementDocument legacyDocument : page.getContent()) {
|
|
||||||
auditSingleLegacyDocument(run, accumulator, legacyDocument);
|
|
||||||
accumulator.incrementScannedLegacyDocuments();
|
|
||||||
processed++;
|
|
||||||
if (requestedLimit != null && processed >= requestedLimit) {
|
|
||||||
return processed;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!page.hasNext()) {
|
|
||||||
break;
|
|
||||||
}
|
|
||||||
pageNumber++;
|
|
||||||
}
|
|
||||||
|
|
||||||
return processed;
|
|
||||||
}
|
|
||||||
|
|
||||||
private void auditSingleLegacyDocument(LegacyTedAuditRun run,
|
|
||||||
AuditAccumulator accumulator,
|
|
||||||
ProcurementDocument legacyDocument) {
|
|
||||||
UUID legacyDocumentId = legacyDocument.getId();
|
|
||||||
String referenceKey = buildReferenceKey(legacyDocument);
|
|
||||||
String documentHash = legacyDocument.getDocumentHash();
|
|
||||||
|
|
||||||
if (!StringUtils.hasText(documentHash)) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.ERROR,
|
|
||||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_HASH,
|
|
||||||
null,
|
|
||||||
legacyDocumentId,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
referenceKey,
|
|
||||||
"Legacy TED document has no documentHash",
|
|
||||||
null);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!StringUtils.hasText(legacyDocument.getXmlDocument())) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.ERROR,
|
|
||||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_XML,
|
|
||||||
null,
|
|
||||||
legacyDocumentId,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
referenceKey,
|
|
||||||
"Legacy TED document has no xmlDocument payload",
|
|
||||||
"documentHash=" + documentHash);
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!StringUtils.hasText(legacyDocument.getTextContent())) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_TEXT,
|
|
||||||
null,
|
|
||||||
legacyDocumentId,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
referenceKey,
|
|
||||||
"Legacy TED document has no normalized textContent",
|
|
||||||
"documentHash=" + documentHash);
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!StringUtils.hasText(legacyDocument.getPublicationId())) {
|
|
||||||
recordFinding(run, accumulator,
|
|
||||||
LegacyTedAuditSeverity.WARNING,
|
|
||||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
|
|
||||||
null,
|
|
||||||
legacyDocumentId,
|
|
||||||
null,
|
|
||||||
null,
|
|
||||||
referenceKey,
|
|
||||||
"Legacy TED document has no publicationId",
|
|
||||||
"documentHash=" + documentHash);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private void recordFinding(LegacyTedAuditRun run,
|
|
||||||
AuditAccumulator accumulator,
|
|
||||||
LegacyTedAuditSeverity severity,
|
|
||||||
LegacyTedAuditFindingType findingType,
|
|
||||||
String packageIdentifier,
|
|
||||||
UUID legacyProcurementDocumentId,
|
|
||||||
UUID genericDocumentId,
|
|
||||||
UUID tedProjectionId,
|
|
||||||
String referenceKey,
|
|
||||||
String message,
|
|
||||||
String detailsText) {
|
|
||||||
if (accumulator.totalFindings() >= properties.getMaxFindingsPerRun()) {
|
|
||||||
accumulator.markTruncated();
|
|
||||||
if (!accumulator.truncationRecorded()) {
|
|
||||||
LegacyTedAuditFinding truncatedFinding = LegacyTedAuditFinding.builder()
|
|
||||||
.run(run)
|
|
||||||
.severity(LegacyTedAuditSeverity.INFO)
|
|
||||||
.findingType(LegacyTedAuditFindingType.FINDINGS_TRUNCATED)
|
|
||||||
.referenceKey(referenceKey != null ? referenceKey : "max-findings-per-run")
|
|
||||||
.message("Legacy TED audit finding limit reached; additional findings were suppressed")
|
|
||||||
.detailsText("maxFindingsPerRun=" + properties.getMaxFindingsPerRun())
|
|
||||||
.build();
|
|
||||||
findingRepository.save(truncatedFinding);
|
|
||||||
accumulator.recordFinding(LegacyTedAuditSeverity.INFO, true);
|
|
||||||
}
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
LegacyTedAuditFinding finding = LegacyTedAuditFinding.builder()
|
|
||||||
.run(run)
|
|
||||||
.severity(severity)
|
|
||||||
.findingType(findingType)
|
|
||||||
.packageIdentifier(packageIdentifier)
|
|
||||||
.legacyProcurementDocumentId(legacyProcurementDocumentId)
|
|
||||||
.documentId(genericDocumentId)
|
|
||||||
.tedNoticeProjectionId(tedProjectionId)
|
|
||||||
.referenceKey(referenceKey)
|
|
||||||
.message(message)
|
|
||||||
.detailsText(detailsText)
|
|
||||||
.build();
|
|
||||||
findingRepository.save(finding);
|
|
||||||
accumulator.recordFinding(severity, false);
|
|
||||||
}
|
|
||||||
|
|
||||||
private String buildReferenceKey(ProcurementDocument legacyDocument) {
|
|
||||||
if (StringUtils.hasText(legacyDocument.getPublicationId())) {
|
|
||||||
return legacyDocument.getPublicationId();
|
|
||||||
}
|
|
||||||
if (StringUtils.hasText(legacyDocument.getNoticeId())) {
|
|
||||||
return legacyDocument.getNoticeId();
|
|
||||||
}
|
|
||||||
if (StringUtils.hasText(legacyDocument.getSourceFilename())) {
|
|
||||||
return legacyDocument.getSourceFilename();
|
|
||||||
}
|
|
||||||
return String.valueOf(legacyDocument.getId());
|
|
||||||
}
|
|
||||||
|
|
||||||
private int safeInt(Integer value) {
|
|
||||||
return value != null ? value : 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
private String formatPackageIdentifier(int year, int serialNumber) {
|
|
||||||
return "%04d%05d".formatted(year, serialNumber);
|
|
||||||
}
|
|
||||||
|
|
||||||
private String buildSummary(int scannedPackages,
|
|
||||||
int scannedLegacyDocuments,
|
|
||||||
AuditAccumulator accumulator) {
|
|
||||||
return "packages=" + scannedPackages
|
|
||||||
+ ", legacyDocuments=" + scannedLegacyDocuments
|
|
||||||
+ ", findings=" + accumulator.totalFindings()
|
|
||||||
+ ", warnings=" + accumulator.warningCount()
|
|
||||||
+ ", errors=" + accumulator.errorCount()
|
|
||||||
+ (accumulator.truncated() ? ", truncated=true" : "");
|
|
||||||
}
|
|
||||||
|
|
||||||
private static final class AuditAccumulator {
|
|
||||||
private int scannedPackages;
|
|
||||||
private int scannedLegacyDocuments;
|
|
||||||
private int infoCount;
|
|
||||||
private int warningCount;
|
|
||||||
private int errorCount;
|
|
||||||
private boolean truncated;
|
|
||||||
private boolean truncationRecorded;
|
|
||||||
|
|
||||||
void incrementScannedPackages() {
|
|
||||||
scannedPackages++;
|
|
||||||
}
|
|
||||||
|
|
||||||
void incrementScannedLegacyDocuments() {
|
|
||||||
scannedLegacyDocuments++;
|
|
||||||
}
|
|
||||||
|
|
||||||
void recordFinding(LegacyTedAuditSeverity severity, boolean truncationFindingRecordedNow) {
|
|
||||||
switch (severity) {
|
|
||||||
case INFO -> infoCount++;
|
|
||||||
case WARNING -> warningCount++;
|
|
||||||
case ERROR -> errorCount++;
|
|
||||||
}
|
|
||||||
if (truncationFindingRecordedNow) {
|
|
||||||
truncationRecorded = true;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
void markTruncated() {
|
|
||||||
truncated = true;
|
|
||||||
}
|
|
||||||
|
|
||||||
int totalFindings() {
|
|
||||||
return infoCount + warningCount + errorCount;
|
|
||||||
}
|
|
||||||
|
|
||||||
int infoCount() {
|
|
||||||
return infoCount;
|
|
||||||
}
|
|
||||||
|
|
||||||
int warningCount() {
|
|
||||||
return warningCount;
|
|
||||||
}
|
|
||||||
|
|
||||||
int errorCount() {
|
|
||||||
return errorCount;
|
|
||||||
}
|
|
||||||
|
|
||||||
int scannedPackages() {
|
|
||||||
return scannedPackages;
|
|
||||||
}
|
|
||||||
|
|
||||||
int scannedLegacyDocuments() {
|
|
||||||
return scannedLegacyDocuments;
|
|
||||||
}
|
|
||||||
|
|
||||||
boolean truncated() {
|
|
||||||
return truncated;
|
|
||||||
}
|
|
||||||
|
|
||||||
boolean truncationRecorded() {
|
|
||||||
return truncationRecorded;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,33 +0,0 @@
|
|||||||
package at.procon.dip.migration.audit.startup;
|
|
||||||
|
|
||||||
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
|
|
||||||
import at.procon.dip.migration.audit.service.LegacyTedAuditService;
|
|
||||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
|
||||||
import at.procon.dip.runtime.config.RuntimeMode;
|
|
||||||
import lombok.RequiredArgsConstructor;
|
|
||||||
import lombok.extern.slf4j.Slf4j;
|
|
||||||
import org.springframework.boot.ApplicationArguments;
|
|
||||||
import org.springframework.boot.ApplicationRunner;
|
|
||||||
import org.springframework.stereotype.Component;
|
|
||||||
|
|
||||||
@Component
|
|
||||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
|
||||||
@RequiredArgsConstructor
|
|
||||||
@Slf4j
|
|
||||||
public class LegacyTedAuditStartupRunner implements ApplicationRunner {
|
|
||||||
|
|
||||||
private final LegacyTedAuditProperties properties;
|
|
||||||
private final LegacyTedAuditService legacyTedAuditService;
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public void run(ApplicationArguments args) {
|
|
||||||
if (!properties.isEnabled() || !properties.isStartupRunEnabled()) {
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
int requestedLimit = properties.getStartupRunLimit();
|
|
||||||
log.info("Wave 1 / Milestone A startup audit enabled - scanning legacy TED data with limit {}",
|
|
||||||
requestedLimit > 0 ? requestedLimit : "unbounded");
|
|
||||||
legacyTedAuditService.executeAudit(requestedLimit);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,81 +1,49 @@
|
|||||||
package at.procon.dip.search.config;
|
package at.procon.dip.search.config;
|
||||||
|
|
||||||
import jakarta.validation.constraints.Min;
|
|
||||||
import jakarta.validation.constraints.Positive;
|
|
||||||
import lombok.Data;
|
import lombok.Data;
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
import org.springframework.context.annotation.Configuration;
|
import org.springframework.context.annotation.Configuration;
|
||||||
import org.springframework.validation.annotation.Validated;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* New-runtime generic search configuration.
|
|
||||||
*
|
|
||||||
* <p>This property tree is intentionally separated from the legacy
|
|
||||||
* {@code ted.search.*} settings. NEW-mode search/semantic/lexical code should
|
|
||||||
* depend on {@code dip.search.*} only.</p>
|
|
||||||
*/
|
|
||||||
@Configuration
|
@Configuration
|
||||||
@ConfigurationProperties(prefix = "dip.search")
|
@ConfigurationProperties(prefix = "dip.search")
|
||||||
@Data
|
@Data
|
||||||
@Validated
|
|
||||||
public class DipSearchProperties {
|
public class DipSearchProperties {
|
||||||
|
|
||||||
/** Default page size for search results. */
|
private Lexical lexical = new Lexical();
|
||||||
@Positive
|
private Semantic semantic = new Semantic();
|
||||||
private int defaultPageSize = 20;
|
private Fusion fusion = new Fusion();
|
||||||
|
private Chunking chunking = new Chunking();
|
||||||
/** Maximum allowed page size. */
|
|
||||||
@Positive
|
@Data
|
||||||
private int maxPageSize = 100;
|
public static class Lexical {
|
||||||
|
private double trigramSimilarityThreshold = 0.12;
|
||||||
/** Semantic similarity threshold (normalized score). */
|
private int fulltextCandidateLimit = 120;
|
||||||
private double similarityThreshold = 0.7d;
|
private int trigramCandidateLimit = 120;
|
||||||
|
}
|
||||||
/** Minimum trigram similarity for fuzzy lexical matches. */
|
|
||||||
private double trigramSimilarityThreshold = 0.12d;
|
@Data
|
||||||
|
public static class Semantic {
|
||||||
/** Candidate limits per search engine before fusion/collapse. */
|
private double similarityThreshold = 0.7;
|
||||||
@Positive
|
private int semanticCandidateLimit = 120;
|
||||||
private int fulltextCandidateLimit = 120;
|
private String defaultModelKey;
|
||||||
|
}
|
||||||
@Positive
|
|
||||||
private int trigramCandidateLimit = 120;
|
@Data
|
||||||
|
public static class Fusion {
|
||||||
@Positive
|
private double fulltextWeight = 0.35;
|
||||||
private int semanticCandidateLimit = 120;
|
private double trigramWeight = 0.20;
|
||||||
|
private double semanticWeight = 0.45;
|
||||||
/** Hybrid fusion weights. */
|
private double recencyBoostWeight = 0.05;
|
||||||
private double fulltextWeight = 0.35d;
|
private int recencyHalfLifeDays = 30;
|
||||||
private double trigramWeight = 0.20d;
|
private int debugTopHitsPerEngine = 10;
|
||||||
private double semanticWeight = 0.45d;
|
}
|
||||||
|
|
||||||
/** Enable chunk representations for long documents. */
|
@Data
|
||||||
private boolean chunkingEnabled = true;
|
public static class Chunking {
|
||||||
|
private boolean enabled = true;
|
||||||
/** Target chunk size in characters for CHUNK representations. */
|
private int targetChars = 1800;
|
||||||
@Positive
|
private int overlapChars = 200;
|
||||||
private int chunkTargetChars = 1800;
|
private int maxChunksPerDocument = 12;
|
||||||
|
private int startupLexicalBackfillLimit = 500;
|
||||||
/** Overlap between consecutive chunks in characters. */
|
}
|
||||||
@Min(0)
|
}
|
||||||
private int chunkOverlapChars = 200;
|
|
||||||
|
|
||||||
/** Maximum CHUNK representations generated per document. */
|
|
||||||
@Positive
|
|
||||||
private int maxChunksPerDocument = 12;
|
|
||||||
|
|
||||||
/** Additional score weight for recency. */
|
|
||||||
private double recencyBoostWeight = 0.05d;
|
|
||||||
|
|
||||||
/** Half-life in days used for recency decay. */
|
|
||||||
@Positive
|
|
||||||
private int recencyHalfLifeDays = 30;
|
|
||||||
|
|
||||||
/** Startup backfill limit for missing DOC lexical vectors. */
|
|
||||||
@Positive
|
|
||||||
private int startupLexicalBackfillLimit = 500;
|
|
||||||
|
|
||||||
/** Number of hits per engine returned by the debug endpoint. */
|
|
||||||
@Positive
|
|
||||||
private int debugTopHitsPerEngine = 10;
|
|
||||||
}
|
|
||||||
|
|||||||
@ -0,0 +1,16 @@
|
|||||||
|
package at.procon.ted.config;
|
||||||
|
|
||||||
|
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Patch A scaffold for the legacy runtime configuration tree.
|
||||||
|
*
|
||||||
|
* The legacy runtime still uses {@link TedProcessorProperties} today. This class is
|
||||||
|
* introduced so the old configuration can be moved gradually from `ted.*` to
|
||||||
|
* `legacy.ted.*` without blocking the runtime split.
|
||||||
|
*/
|
||||||
|
@Configuration
|
||||||
|
@ConfigurationProperties(prefix = "legacy.ted")
|
||||||
|
public class LegacyTedProperties extends TedProcessorProperties {
|
||||||
|
}
|
||||||
@ -1,115 +0,0 @@
|
|||||||
package at.procon.ted.config;
|
|
||||||
|
|
||||||
import jakarta.validation.constraints.Min;
|
|
||||||
import jakarta.validation.constraints.NotBlank;
|
|
||||||
import jakarta.validation.constraints.Positive;
|
|
||||||
import lombok.Data;
|
|
||||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
|
||||||
import org.springframework.context.annotation.Configuration;
|
|
||||||
import org.springframework.validation.annotation.Validated;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Legacy vectorization configuration used only by the old runtime path.
|
|
||||||
* <p>
|
|
||||||
* This extracts the former ted.vectorization.* subtree away from TedProcessorProperties
|
|
||||||
* so that legacy vectorization beans no longer depend on the shared monolithic config.
|
|
||||||
*/
|
|
||||||
@Configuration
|
|
||||||
@ConfigurationProperties(prefix = "legacy.ted.vectorization")
|
|
||||||
@Data
|
|
||||||
@Validated
|
|
||||||
public class LegacyVectorizationProperties {
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Enable/disable legacy async vectorization.
|
|
||||||
*/
|
|
||||||
private boolean enabled = true;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Use external HTTP API instead of Python subprocess.
|
|
||||||
*/
|
|
||||||
private boolean useHttpApi = false;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Embedding service HTTP API URL.
|
|
||||||
*/
|
|
||||||
private String apiUrl = "http://localhost:8001";
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Sentence transformer model name.
|
|
||||||
*/
|
|
||||||
private String modelName = "intfloat/multilingual-e5-large";
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Vector dimensions (must match model output).
|
|
||||||
*/
|
|
||||||
@Positive
|
|
||||||
private int dimensions = 1024;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Batch size for vectorization processing.
|
|
||||||
*/
|
|
||||||
@Min(1)
|
|
||||||
private int batchSize = 16;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Thread pool size for async vectorization.
|
|
||||||
*/
|
|
||||||
@Min(1)
|
|
||||||
private int threadPoolSize = 4;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Maximum text length for vectorization (characters).
|
|
||||||
*/
|
|
||||||
@Positive
|
|
||||||
private int maxTextLength = 8192;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* HTTP connection timeout in milliseconds.
|
|
||||||
*/
|
|
||||||
@Positive
|
|
||||||
private int connectTimeout = 10000;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* HTTP socket/read timeout in milliseconds.
|
|
||||||
*/
|
|
||||||
@Positive
|
|
||||||
private int socketTimeout = 60000;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Maximum retries on connection failure.
|
|
||||||
*/
|
|
||||||
@Min(0)
|
|
||||||
private int maxRetries = 5;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Enable the former Phase 2 generic pipeline in the legacy runtime.
|
|
||||||
* In the split runtime design this should normally stay false in NEW mode
|
|
||||||
* because legacy beans are not instantiated there.
|
|
||||||
*/
|
|
||||||
private boolean genericPipelineEnabled = true;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Keep writing completed TED embeddings back to the legacy ted.procurement_document
|
|
||||||
* vector columns so the existing semantic search stays operational during migration.
|
|
||||||
*/
|
|
||||||
private boolean dualWriteLegacyTedVectors = true;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Scheduler interval for generic embedding polling (milliseconds).
|
|
||||||
*/
|
|
||||||
@Positive
|
|
||||||
private long genericSchedulerPeriodMs = 6000;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Builder key for the primary TED semantic representation created during transitional dual-write.
|
|
||||||
*/
|
|
||||||
@NotBlank
|
|
||||||
private String primaryRepresentationBuilderKey = "ted-phase2-primary-representation";
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Provider key used when registering the configured embedding model in DOC.doc_embedding_model.
|
|
||||||
*/
|
|
||||||
@NotBlank
|
|
||||||
private String embeddingProvider = "http-embedding-service";
|
|
||||||
}
|
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue