Compare commits
No commits in common. '152d9739af0969090ddb7db63ad6a93054147053' and '74609e481d97ae16a12a06dd2d6ed191f01a33f7' have entirely different histories.
152d9739af
...
74609e481d
@ -1,31 +0,0 @@
|
||||
# Config split: moved new-runtime properties to application-new.yml
|
||||
|
||||
This patch keeps shared and legacy defaults in `application.yml` and moves new-runtime properties into `application-new.yml`.
|
||||
|
||||
Activate the new runtime with:
|
||||
|
||||
```
|
||||
--spring.profiles.active=new
|
||||
```
|
||||
|
||||
`application-new.yml` also sets:
|
||||
|
||||
```yaml
|
||||
dip.runtime.mode: NEW
|
||||
```
|
||||
|
||||
So profile selection and runtime mode stay aligned.
|
||||
|
||||
Moved blocks:
|
||||
- `dip.embedding.*`
|
||||
- `ted.search.*` (new generic search tuning, now under `dip.search.*`)
|
||||
- `ted.projection.*`
|
||||
- `ted.generic-ingestion.*`
|
||||
- new/transitional `ted.vectorization.*` keys:
|
||||
- `generic-pipeline-enabled`
|
||||
- `dual-write-legacy-ted-vectors`
|
||||
- `generic-scheduler-period-ms`
|
||||
- `primary-representation-builder-key`
|
||||
- `embedding-provider`
|
||||
|
||||
Shared / legacy defaults remain in `application.yml`.
|
||||
@ -1,30 +0,0 @@
|
||||
# Embedding policy Patch K1
|
||||
|
||||
Patch K1 introduces the configuration and resolver layer for policy-based document embedding selection.
|
||||
|
||||
## Added
|
||||
- `EmbeddingPolicy`
|
||||
- `EmbeddingProfile`
|
||||
- `EmbeddingPolicyCondition`
|
||||
- `EmbeddingPolicyUse`
|
||||
- `EmbeddingPolicyRule`
|
||||
- `EmbeddingPolicyProperties`
|
||||
- `EmbeddingProfileProperties`
|
||||
- `EmbeddingPolicyResolver`
|
||||
- `DefaultEmbeddingPolicyResolver`
|
||||
- `EmbeddingProfileResolver`
|
||||
- `DefaultEmbeddingProfileResolver`
|
||||
|
||||
## Example config
|
||||
See `application-new-example-embedding-policy.yml`.
|
||||
|
||||
## What K1 does not change
|
||||
- no runtime import/orchestrator wiring yet
|
||||
- no `SourceDescriptor` schema change yet
|
||||
- no job persistence/audit changes yet
|
||||
|
||||
## Intended follow-up
|
||||
K2 should wire:
|
||||
- `GenericDocumentImportService`
|
||||
- `RepresentationEmbeddingOrchestrator`
|
||||
to use the resolved policy and profile.
|
||||
@ -1,26 +0,0 @@
|
||||
# Embedding policy Patch K2
|
||||
|
||||
Patch K2 wires the policy/profile layer into the actual NEW import runtime.
|
||||
|
||||
## What it changes
|
||||
- `GenericDocumentImportService`
|
||||
- resolves `EmbeddingPolicy` per imported document
|
||||
- resolves `EmbeddingProfile`
|
||||
- ensures the selected embedding model is registered
|
||||
- queues embeddings only for representation drafts allowed by the resolved profile
|
||||
- `RepresentationEmbeddingOrchestrator`
|
||||
- adds a convenience overload for `(documentId, modelKey, profile)`
|
||||
- `EmbeddingJobService`
|
||||
- adds a profile-aware enqueue overload
|
||||
- `DefaultEmbeddingSelectionPolicy`
|
||||
- adds profile-aware representation filtering
|
||||
- `DefaultEmbeddingPolicyResolver`
|
||||
- corrected for the current `SourceDescriptor.attributes()` shape
|
||||
|
||||
## Runtime flow after K2
|
||||
document imported
|
||||
-> representations built
|
||||
-> policy resolved
|
||||
-> profile resolved
|
||||
-> model ensured
|
||||
-> matching representations queued for embedding
|
||||
@ -1,59 +0,0 @@
|
||||
# Runtime split Patch B
|
||||
|
||||
Patch B builds on Patch A and makes the NEW runtime actually process embedding jobs.
|
||||
|
||||
## What changes
|
||||
|
||||
### 1. New embedding job scheduler
|
||||
Adds:
|
||||
|
||||
- `EmbeddingJobScheduler`
|
||||
- `EmbeddingJobSchedulingConfiguration`
|
||||
|
||||
Behavior:
|
||||
- enabled only in `NEW` runtime mode
|
||||
- active only when `dip.embedding.jobs.enabled=true`
|
||||
- periodically calls:
|
||||
- `RepresentationEmbeddingOrchestrator.processNextReadyBatch()`
|
||||
|
||||
### 2. Generic import hands off to the new embedding job path
|
||||
`GenericDocumentImportService` is updated so that in `NEW` mode it:
|
||||
|
||||
- resolves `dip.embedding.default-document-model`
|
||||
- ensures the model is registered in `DOC.doc_embedding_model`
|
||||
- creates embedding jobs through:
|
||||
- `RepresentationEmbeddingOrchestrator.enqueueRepresentation(...)`
|
||||
|
||||
It no longer creates legacy-style pending embeddings as the primary handoff for the NEW runtime path.
|
||||
|
||||
## Notes
|
||||
|
||||
- This patch assumes Patch A has already introduced:
|
||||
- `RuntimeMode`
|
||||
- `RuntimeModeProperties`
|
||||
- `@ConditionalOnRuntimeMode`
|
||||
- This patch does not yet remove the legacy vectorization runtime.
|
||||
That remains the job of subsequent cutover steps.
|
||||
|
||||
## Expected runtime behavior in NEW mode
|
||||
|
||||
- `GenericDocumentImportService` persists new generic representations
|
||||
- selected representations are queued into `DOC.doc_embedding_job`
|
||||
- scheduler processes pending jobs
|
||||
- vectors are persisted through the new embedding subsystem
|
||||
|
||||
## New config
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
dip:
|
||||
runtime:
|
||||
mode: NEW
|
||||
|
||||
embedding:
|
||||
enabled: true
|
||||
jobs:
|
||||
enabled: true
|
||||
scheduler-delay-ms: 5000
|
||||
```
|
||||
@ -1,36 +0,0 @@
|
||||
# Runtime split Patch C
|
||||
|
||||
Patch C moves the **new generic search runtime** off `TedProcessorProperties.search`
|
||||
and into a dedicated `dip.search.*` config tree.
|
||||
|
||||
## New config class
|
||||
- `at.procon.dip.search.config.DipSearchProperties`
|
||||
|
||||
## New config root
|
||||
```yaml
|
||||
dip:
|
||||
search:
|
||||
...
|
||||
```
|
||||
|
||||
## Classes moved off `TedProcessorProperties`
|
||||
- `PostgresFullTextSearchEngine`
|
||||
- `PostgresTrigramSearchEngine`
|
||||
- `PgVectorSemanticSearchEngine`
|
||||
- `DefaultSearchOrchestrator`
|
||||
- `DefaultSearchResultFusionService`
|
||||
- `SearchLexicalIndexStartupRunner`
|
||||
- `ChunkedLongTextRepresentationBuilder`
|
||||
|
||||
## What this patch intentionally does not do
|
||||
- it does not yet remove `TedProcessorProperties` from all NEW-mode classes
|
||||
- it does not yet move `generic-ingestion` config off `ted.*`
|
||||
- it does not yet finish the legacy/new config split for import/mail/TED package processing
|
||||
|
||||
Those should be handled in the next config-splitting patch.
|
||||
|
||||
## Practical result
|
||||
After this patch, **new search/semantic/chunking tuning** should be configured only via:
|
||||
- `dip.search.*`
|
||||
|
||||
while `ted.search.*` remains legacy-oriented.
|
||||
@ -1,40 +0,0 @@
|
||||
# Runtime Split Patch D
|
||||
|
||||
This patch completes the next configuration split step for the NEW runtime.
|
||||
|
||||
## New property classes
|
||||
|
||||
- `at.procon.dip.ingestion.config.DipIngestionProperties`
|
||||
- prefix: `dip.ingestion`
|
||||
- `at.procon.dip.domain.ted.config.TedProjectionProperties`
|
||||
- prefix: `dip.ted.projection`
|
||||
|
||||
## Classes moved off `TedProcessorProperties`
|
||||
|
||||
### NEW-mode ingestion
|
||||
- `GenericDocumentImportService`
|
||||
- `GenericFileSystemIngestionRoute`
|
||||
- `GenericDocumentImportController`
|
||||
- `MailDocumentIngestionAdapter`
|
||||
- `TedPackageDocumentIngestionAdapter`
|
||||
- `TedPackageChildImportProcessor`
|
||||
|
||||
### NEW-mode projection
|
||||
- `TedNoticeProjectionService`
|
||||
- `TedProjectionStartupRunner`
|
||||
|
||||
## Additional cleanup in `GenericDocumentImportService`
|
||||
|
||||
It now resolves the default document embedding model through the new embedding subsystem:
|
||||
|
||||
- `EmbeddingProperties`
|
||||
- `EmbeddingModelRegistry`
|
||||
- `EmbeddingModelCatalogService`
|
||||
|
||||
and no longer reads vectorization model/provider/dimensions from `TedProcessorProperties`.
|
||||
|
||||
## What still remains for later split steps
|
||||
|
||||
- legacy routes/services still using `TedProcessorProperties`
|
||||
- legacy/new runtime bean gating for all remaining shared classes
|
||||
- moving old TED-only config fully under `legacy.ted.*`
|
||||
@ -1,26 +0,0 @@
|
||||
# Runtime split Patch E
|
||||
|
||||
This patch continues the runtime/config split by targeting the remaining NEW-mode classes
|
||||
that still injected `TedProcessorProperties`.
|
||||
|
||||
## New config classes
|
||||
- `DipIngestionProperties` (`dip.ingestion.*`)
|
||||
- `TedProjectionProperties` (`dip.ted.projection.*`)
|
||||
|
||||
## NEW-mode classes moved off `TedProcessorProperties`
|
||||
- `GenericDocumentImportService`
|
||||
- `GenericFileSystemIngestionRoute`
|
||||
- `GenericDocumentImportController`
|
||||
- `MailDocumentIngestionAdapter`
|
||||
- `TedPackageDocumentIngestionAdapter`
|
||||
- `TedPackageChildImportProcessor`
|
||||
- `TedNoticeProjectionService`
|
||||
- `TedProjectionStartupRunner`
|
||||
|
||||
## Additional behavior change
|
||||
`GenericDocumentImportService` now hands embedding work off to the new embedding subsystem by:
|
||||
- resolving the default document model from `EmbeddingModelRegistry`
|
||||
- ensuring the model is registered via `EmbeddingModelCatalogService`
|
||||
- enqueueing jobs through `RepresentationEmbeddingOrchestrator`
|
||||
|
||||
This removes the new import path's runtime dependence on legacy `TedProcessorProperties.vectorization`.
|
||||
@ -1,24 +0,0 @@
|
||||
# Runtime split Patch G
|
||||
|
||||
Patch G moves the remaining NEW-mode search/chunking classes off `TedProcessorProperties.search`
|
||||
and onto `DipSearchProperties` (`dip.search.*`).
|
||||
|
||||
## New config class
|
||||
- `at.procon.dip.search.config.DipSearchProperties`
|
||||
|
||||
## Classes switched to `DipSearchProperties`
|
||||
- `PostgresFullTextSearchEngine`
|
||||
- `PostgresTrigramSearchEngine`
|
||||
- `PgVectorSemanticSearchEngine`
|
||||
- `DefaultSearchResultFusionService`
|
||||
- `DefaultSearchOrchestrator`
|
||||
- `SearchLexicalIndexStartupRunner`
|
||||
- `ChunkedLongTextRepresentationBuilder`
|
||||
|
||||
## Additional cleanup
|
||||
These classes are also marked `NEW`-only in this patch.
|
||||
|
||||
## Effect
|
||||
After Patch G, the generic NEW-mode search/chunking path no longer depends on
|
||||
`TedProcessorProperties.search`. That leaves `TedProcessorProperties` much closer to
|
||||
legacy-only ownership.
|
||||
@ -1,17 +0,0 @@
|
||||
# Runtime split Patch H
|
||||
|
||||
Patch H is a final cleanup / verification step after the previous split patches.
|
||||
|
||||
## What it does
|
||||
- makes `TedProcessorProperties` explicitly `LEGACY`-only
|
||||
- removes the stale `TedProcessorProperties` import/comment from `DocumentIntelligencePlatformApplication`
|
||||
- adds a regression test that fails if NEW runtime classes reintroduce a dependency on `TedProcessorProperties`
|
||||
- adds a simple `application-legacy.yml` profile file
|
||||
|
||||
## Why this matters
|
||||
After the NEW search/ingestion/projection classes are moved to:
|
||||
- `DipSearchProperties`
|
||||
- `DipIngestionProperties`
|
||||
- `TedProjectionProperties`
|
||||
|
||||
`TedProcessorProperties` should be owned strictly by the legacy runtime graph.
|
||||
@ -1,21 +0,0 @@
|
||||
# Runtime split Patch I
|
||||
|
||||
Patch I extracts the remaining legacy vectorization cluster off `TedProcessorProperties`
|
||||
and onto a dedicated legacy-only config class.
|
||||
|
||||
## New config class
|
||||
- `at.procon.ted.config.LegacyVectorizationProperties`
|
||||
- prefix: `legacy.ted.vectorization.*`
|
||||
|
||||
## Classes switched off `TedProcessorProperties`
|
||||
- `GenericVectorizationRoute`
|
||||
- `DocumentEmbeddingProcessingService`
|
||||
- `ConfiguredEmbeddingModelStartupRunner`
|
||||
- `GenericVectorizationStartupRunner`
|
||||
|
||||
## Additional cleanup
|
||||
These classes are also marked `LEGACY`-only via `@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)`.
|
||||
|
||||
## Effect
|
||||
The `at.procon.dip.vectorization.*` package now clearly belongs to the old runtime graph and no longer pulls
|
||||
its settings from the shared monolithic `TedProcessorProperties`.
|
||||
@ -1,45 +0,0 @@
|
||||
# Runtime split Patch J
|
||||
|
||||
Patch J is a broader cleanup patch for the **actual current codebase**.
|
||||
|
||||
It adds the missing runtime/config split scaffolding and rewires the remaining NEW-mode classes
|
||||
that still injected `TedProcessorProperties`.
|
||||
|
||||
## Added
|
||||
- `dip.runtime` infrastructure
|
||||
- `RuntimeMode`
|
||||
- `RuntimeModeProperties`
|
||||
- `@ConditionalOnRuntimeMode`
|
||||
- `RuntimeModeCondition`
|
||||
- `DipSearchProperties`
|
||||
- `DipIngestionProperties`
|
||||
- `TedProjectionProperties`
|
||||
|
||||
## Rewired off `TedProcessorProperties`
|
||||
### NEW search/chunking
|
||||
- `PostgresFullTextSearchEngine`
|
||||
- `PostgresTrigramSearchEngine`
|
||||
- `PgVectorSemanticSearchEngine`
|
||||
- `DefaultSearchOrchestrator`
|
||||
- `SearchLexicalIndexStartupRunner`
|
||||
- `DefaultSearchResultFusionService`
|
||||
- `ChunkedLongTextRepresentationBuilder`
|
||||
|
||||
### NEW ingestion/projection
|
||||
- `GenericDocumentImportService`
|
||||
- `GenericFileSystemIngestionRoute`
|
||||
- `GenericDocumentImportController`
|
||||
- `MailDocumentIngestionAdapter`
|
||||
- `TedPackageDocumentIngestionAdapter`
|
||||
- `TedPackageChildImportProcessor`
|
||||
- `TedNoticeProjectionService`
|
||||
- `TedProjectionStartupRunner`
|
||||
|
||||
## Additional behavior
|
||||
- `GenericDocumentImportService` now hands embedding work off to the new embedding subsystem
|
||||
via `RepresentationEmbeddingOrchestrator` and resolves the default model through
|
||||
`EmbeddingModelRegistry` / `EmbeddingModelCatalogService`.
|
||||
|
||||
## Notes
|
||||
This patch intentionally targets the real current leftovers visible in the actual codebase.
|
||||
It assumes the new embedding subsystem already exists.
|
||||
@ -1,20 +0,0 @@
|
||||
# NEW TED package import route
|
||||
|
||||
This patch adds a NEW-runtime TED package download path that:
|
||||
|
||||
- reuses the proven package sequencing rules
|
||||
- stores package tracking in `TedDailyPackage`
|
||||
- downloads the package tar.gz
|
||||
- ingests it only through `DocumentIngestionGateway`
|
||||
- never calls the legacy XML batch processing / vectorization flow
|
||||
|
||||
## Added classes
|
||||
|
||||
- `TedPackageSequenceService`
|
||||
- `DefaultTedPackageSequenceService`
|
||||
- `TedPackageDownloadNewProperties`
|
||||
- `TedPackageDownloadNewRoute`
|
||||
|
||||
## Config
|
||||
|
||||
Use the `dip.ingestion.ted-download.*` block in `application-new.yml`.
|
||||
@ -1,67 +0,0 @@
|
||||
# Vector-sync HTTP embedding provider
|
||||
|
||||
This provider supports two endpoints:
|
||||
|
||||
- `POST {baseUrl}/vector-sync` for single-text requests
|
||||
- `POST {baseUrl}/vectorize-batch` for batch document requests
|
||||
|
||||
## Single request
|
||||
|
||||
Request body:
|
||||
```json
|
||||
{
|
||||
"model": "intfloat/multilingual-e5-large",
|
||||
"text": "This is a sample text to vectorize"
|
||||
}
|
||||
```
|
||||
|
||||
## Batch request
|
||||
|
||||
Request body:
|
||||
```json
|
||||
{
|
||||
"model": "intfloat/multilingual-e5-large",
|
||||
"truncate_text": false,
|
||||
"truncate_length": 512,
|
||||
"chunk_size": 20,
|
||||
"items": [
|
||||
{
|
||||
"id": "2f48fd5c-9d39-4d80-9225-ea0c59c77c9a",
|
||||
"text": "This is a sample text to vectorize"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Provider configuration
|
||||
|
||||
```yaml
|
||||
batch-request:
|
||||
truncate-text: false
|
||||
truncate-length: 512
|
||||
chunk-size: 20
|
||||
```
|
||||
|
||||
These values are used for `/vectorize-batch` calls and can also be overridden per request via `EmbeddingRequest.providerOptions()`.
|
||||
|
||||
## Orchestrator batch processing
|
||||
|
||||
To let `RepresentationEmbeddingOrchestrator` send multiple representations in one provider call, enable batch processing for jobs and for the model:
|
||||
|
||||
```yaml
|
||||
dip:
|
||||
embedding:
|
||||
jobs:
|
||||
enabled: true
|
||||
process-in-batches: true
|
||||
execution-batch-size: 20
|
||||
|
||||
models:
|
||||
e5-default:
|
||||
supports-batch: true
|
||||
```
|
||||
|
||||
Notes:
|
||||
- jobs are grouped by `modelKey`
|
||||
- non-batch-capable models still fall back to single-item execution
|
||||
- `execution-batch-size` controls how many texts are sent in one `/vectorize-batch` request
|
||||
@ -1,16 +0,0 @@
|
||||
package at.procon.dip.domain.ted.config;
|
||||
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.ted.projection")
|
||||
@Data
|
||||
public class TedProjectionProperties {
|
||||
private boolean enabled = true;
|
||||
private boolean startupBackfillEnabled = false;
|
||||
@Positive
|
||||
private int startupBackfillLimit = 250;
|
||||
}
|
||||
@ -1,189 +0,0 @@
|
||||
package at.procon.dip.domain.ted.service;
|
||||
|
||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||
import at.procon.dip.runtime.config.RuntimeMode;
|
||||
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
|
||||
import at.procon.ted.model.entity.TedDailyPackage;
|
||||
import at.procon.ted.repository.TedDailyPackageRepository;
|
||||
import java.time.Duration;
|
||||
import java.time.LocalDate;
|
||||
import java.time.OffsetDateTime;
|
||||
import java.time.Year;
|
||||
import java.time.ZoneOffset;
|
||||
import java.util.Optional;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.springframework.stereotype.Service;
|
||||
|
||||
/**
|
||||
* NEW-runtime implementation of TED package sequencing.
|
||||
* <p>
|
||||
* This reuses the same decision rules as the legacy TED package downloader:
|
||||
* <ul>
|
||||
* <li>current year forward crawling first</li>
|
||||
* <li>gap filling by walking backward to package 1</li>
|
||||
* <li>NOT_FOUND retry handling with current-year indefinite retry support</li>
|
||||
* <li>previous-year grace period before a tail NOT_FOUND becomes final</li>
|
||||
* </ul>
|
||||
*/
|
||||
@Service
|
||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||
@RequiredArgsConstructor
|
||||
@Slf4j
|
||||
public class DefaultTedPackageSequenceService implements TedPackageSequenceService {
|
||||
|
||||
private final TedPackageDownloadProperties properties;
|
||||
private final TedDailyPackageRepository packageRepository;
|
||||
|
||||
@Override
|
||||
public PackageInfo getNextPackageToDownload() {
|
||||
int currentYear = Year.now().getValue();
|
||||
|
||||
log.debug("Determining next TED package to download for NEW runtime (current year: {})", currentYear);
|
||||
|
||||
// 1) Current year forward crawling first (newest data first)
|
||||
PackageInfo nextInCurrentYear = getNextForwardPackage(currentYear);
|
||||
if (nextInCurrentYear != null) {
|
||||
log.info("Next TED package: {} (current year {} forward)", nextInCurrentYear.identifier(), currentYear);
|
||||
return nextInCurrentYear;
|
||||
}
|
||||
|
||||
// 2) Walk all years backward and fill gaps / continue unfinished years
|
||||
for (int year = currentYear; year >= properties.getStartYear(); year--) {
|
||||
PackageInfo gapFiller = getGapFillerPackage(year);
|
||||
if (gapFiller != null) {
|
||||
log.info("Next TED package: {} (filling gap in year {})", gapFiller.identifier(), year);
|
||||
return gapFiller;
|
||||
}
|
||||
|
||||
if (!isYearComplete(year)) {
|
||||
PackageInfo forwardPackage = getNextForwardPackage(year);
|
||||
if (forwardPackage != null) {
|
||||
log.info("Next TED package: {} (continuing year {})", forwardPackage.identifier(), year);
|
||||
return forwardPackage;
|
||||
}
|
||||
} else {
|
||||
log.debug("TED package year {} is complete", year);
|
||||
}
|
||||
}
|
||||
|
||||
// 3) Open a new older year if possible
|
||||
int oldestYear = getOldestYearWithData();
|
||||
if (oldestYear > properties.getStartYear()) {
|
||||
int previousYear = oldestYear - 1;
|
||||
if (previousYear >= properties.getStartYear()) {
|
||||
PackageInfo first = new PackageInfo(previousYear, 1);
|
||||
log.info("Next TED package: {} (opening year {})", first.identifier(), previousYear);
|
||||
return first;
|
||||
}
|
||||
}
|
||||
|
||||
log.info("All TED package years from {} to {} appear complete - nothing to download",
|
||||
properties.getStartYear(), currentYear);
|
||||
return null;
|
||||
}
|
||||
|
||||
private PackageInfo getNextForwardPackage(int year) {
|
||||
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
|
||||
|
||||
if (latest.isEmpty()) {
|
||||
return new PackageInfo(year, 1);
|
||||
}
|
||||
|
||||
TedDailyPackage latestPackage = latest.get();
|
||||
|
||||
if (latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND) {
|
||||
if (shouldRetryNotFoundPackage(latestPackage)) {
|
||||
return new PackageInfo(year, latestPackage.getSerialNumber());
|
||||
}
|
||||
|
||||
if (isNotFoundRetryableForYear(latestPackage)) {
|
||||
log.debug("Year {} still inside NOT_FOUND retry window for package {} until {}",
|
||||
year, latestPackage.getPackageIdentifier(), calculateNextRetryAt(latestPackage));
|
||||
return null;
|
||||
}
|
||||
|
||||
log.debug("Year {} finalized after grace period at tail package {}", year, latestPackage.getPackageIdentifier());
|
||||
return null;
|
||||
}
|
||||
|
||||
return new PackageInfo(year, latestPackage.getSerialNumber() + 1);
|
||||
}
|
||||
|
||||
private PackageInfo getGapFillerPackage(int year) {
|
||||
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
|
||||
|
||||
if (first.isEmpty()) {
|
||||
return null;
|
||||
}
|
||||
|
||||
int minSerial = first.get().getSerialNumber();
|
||||
if (minSerial <= 1) {
|
||||
return null;
|
||||
}
|
||||
|
||||
return new PackageInfo(year, minSerial - 1);
|
||||
}
|
||||
|
||||
private boolean isYearComplete(int year) {
|
||||
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
|
||||
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
|
||||
|
||||
if (first.isEmpty() || latest.isEmpty()) {
|
||||
return false;
|
||||
}
|
||||
|
||||
if (first.get().getSerialNumber() != 1) {
|
||||
return false;
|
||||
}
|
||||
|
||||
TedDailyPackage latestPackage = latest.get();
|
||||
return latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND
|
||||
&& !isNotFoundRetryableForYear(latestPackage);
|
||||
}
|
||||
|
||||
private boolean shouldRetryNotFoundPackage(TedDailyPackage pkg) {
|
||||
if (!isNotFoundRetryableForYear(pkg)) {
|
||||
return false;
|
||||
}
|
||||
|
||||
OffsetDateTime nextRetryAt = calculateNextRetryAt(pkg);
|
||||
return !nextRetryAt.isAfter(OffsetDateTime.now());
|
||||
}
|
||||
|
||||
private boolean isNotFoundRetryableForYear(TedDailyPackage pkg) {
|
||||
int currentYear = Year.now().getValue();
|
||||
int packageYear = pkg.getYear() != null ? pkg.getYear() : currentYear;
|
||||
|
||||
if (packageYear >= currentYear) {
|
||||
return properties.isRetryCurrentYearNotFoundIndefinitely();
|
||||
}
|
||||
|
||||
return OffsetDateTime.now().isBefore(getYearRetryGraceDeadline(packageYear));
|
||||
}
|
||||
|
||||
private OffsetDateTime calculateNextRetryAt(TedDailyPackage pkg) {
|
||||
OffsetDateTime lastAttemptAt = pkg.getUpdatedAt() != null
|
||||
? pkg.getUpdatedAt()
|
||||
: (pkg.getCreatedAt() != null ? pkg.getCreatedAt() : OffsetDateTime.now());
|
||||
|
||||
return lastAttemptAt.plus(Duration.ofMillis(properties.getNotFoundRetryInterval()));
|
||||
}
|
||||
|
||||
private OffsetDateTime getYearRetryGraceDeadline(int year) {
|
||||
return LocalDate.of(year + 1, 1, 1)
|
||||
.atStartOfDay()
|
||||
.atOffset(ZoneOffset.UTC)
|
||||
.plusDays(properties.getPreviousYearGracePeriodDays());
|
||||
}
|
||||
|
||||
private int getOldestYearWithData() {
|
||||
int currentYear = Year.now().getValue();
|
||||
for (int year = properties.getStartYear(); year <= currentYear; year++) {
|
||||
if (packageRepository.findLatestByYear(year).isPresent()) {
|
||||
return year;
|
||||
}
|
||||
}
|
||||
return currentYear;
|
||||
}
|
||||
}
|
||||
@ -1,25 +0,0 @@
|
||||
package at.procon.dip.domain.ted.service;
|
||||
|
||||
/**
|
||||
* Shared package sequencing contract used to determine the next TED daily package to download.
|
||||
* <p>
|
||||
* This service encapsulates the proven sequencing rules from the legacy download implementation
|
||||
* so they can also be used by the NEW runtime without depending on the old route/service graph.
|
||||
*/
|
||||
public interface TedPackageSequenceService {
|
||||
|
||||
/**
|
||||
* Returns the next package to download according to the current sequencing strategy,
|
||||
* or {@code null} if nothing should be downloaded right now.
|
||||
*/
|
||||
PackageInfo getNextPackageToDownload();
|
||||
|
||||
/**
|
||||
* Simple year/serial pair with TED package identifier helper.
|
||||
*/
|
||||
record PackageInfo(int year, int serialNumber) {
|
||||
public String identifier() {
|
||||
return "%04d%05d".formatted(year, serialNumber);
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1,12 +0,0 @@
|
||||
package at.procon.dip.embedding.config;
|
||||
|
||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||
import at.procon.dip.runtime.config.RuntimeMode;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
import org.springframework.scheduling.annotation.EnableScheduling;
|
||||
|
||||
@Configuration
|
||||
@EnableScheduling
|
||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||
public class EmbeddingJobSchedulingConfiguration {
|
||||
}
|
||||
@ -1,14 +0,0 @@
|
||||
package at.procon.dip.embedding.config;
|
||||
|
||||
import lombok.Data;
|
||||
|
||||
@Data
|
||||
public class EmbeddingPolicyCondition {
|
||||
private String documentType;
|
||||
private String documentFamily;
|
||||
private String sourceType;
|
||||
private String mimeType;
|
||||
private String language;
|
||||
private String ownerTenantKey;
|
||||
private String embeddingPolicyHint;
|
||||
}
|
||||
@ -1,16 +0,0 @@
|
||||
package at.procon.dip.embedding.config;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.embedding.policies")
|
||||
@Data
|
||||
public class EmbeddingPolicyProperties {
|
||||
|
||||
private EmbeddingPolicyUse defaultPolicy = new EmbeddingPolicyUse();
|
||||
private List<EmbeddingPolicyRule> rules = new ArrayList<>();
|
||||
}
|
||||
@ -1,10 +0,0 @@
|
||||
package at.procon.dip.embedding.config;
|
||||
|
||||
import lombok.Data;
|
||||
|
||||
@Data
|
||||
public class EmbeddingPolicyRule {
|
||||
private String name;
|
||||
private EmbeddingPolicyCondition when = new EmbeddingPolicyCondition();
|
||||
private EmbeddingPolicyUse use = new EmbeddingPolicyUse();
|
||||
}
|
||||
@ -1,12 +0,0 @@
|
||||
package at.procon.dip.embedding.config;
|
||||
|
||||
import lombok.Data;
|
||||
|
||||
@Data
|
||||
public class EmbeddingPolicyUse {
|
||||
private String policyKey;
|
||||
private String modelKey;
|
||||
private String queryModelKey;
|
||||
private String profileKey;
|
||||
private boolean enabled = true;
|
||||
}
|
||||
@ -1,23 +0,0 @@
|
||||
package at.procon.dip.embedding.config;
|
||||
|
||||
import at.procon.dip.domain.document.RepresentationType;
|
||||
import java.util.ArrayList;
|
||||
import java.util.LinkedHashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.embedding.profiles")
|
||||
@Data
|
||||
public class EmbeddingProfileProperties {
|
||||
|
||||
private Map<String, ProfileDefinition> definitions = new LinkedHashMap<>();
|
||||
|
||||
@Data
|
||||
public static class ProfileDefinition {
|
||||
private List<RepresentationType> embedRepresentationTypes = new ArrayList<>();
|
||||
}
|
||||
}
|
||||
@ -1,32 +0,0 @@
|
||||
package at.procon.dip.embedding.job;
|
||||
|
||||
import at.procon.dip.embedding.service.RepresentationEmbeddingOrchestrator;
|
||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||
import at.procon.dip.runtime.config.RuntimeMode;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
|
||||
import org.springframework.scheduling.annotation.Scheduled;
|
||||
import org.springframework.stereotype.Component;
|
||||
|
||||
@Component
|
||||
@RequiredArgsConstructor
|
||||
@Slf4j
|
||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||
@ConditionalOnProperty(prefix = "dip.embedding.jobs", name = "enabled", havingValue = "true")
|
||||
public class EmbeddingJobScheduler {
|
||||
|
||||
private final RepresentationEmbeddingOrchestrator orchestrator;
|
||||
|
||||
@Scheduled(fixedDelayString = "${dip.embedding.jobs.scheduler-delay-ms:5000}")
|
||||
public void processNextBatch() {
|
||||
try {
|
||||
int processed = orchestrator.processNextReadyBatch();
|
||||
if (processed > 0) {
|
||||
log.debug("NEW runtime embedding job scheduler processed {} job(s)", processed);
|
||||
}
|
||||
} catch (Exception ex) {
|
||||
log.warn("NEW runtime embedding job scheduler failed: {}", ex.getMessage(), ex);
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1,10 +0,0 @@
|
||||
package at.procon.dip.embedding.policy;
|
||||
|
||||
public record EmbeddingPolicy(
|
||||
String policyKey,
|
||||
String modelKey,
|
||||
String queryModelKey,
|
||||
String profileKey,
|
||||
boolean enabled
|
||||
) {
|
||||
}
|
||||
@ -1,13 +0,0 @@
|
||||
package at.procon.dip.embedding.policy;
|
||||
|
||||
import at.procon.dip.domain.document.RepresentationType;
|
||||
import java.util.List;
|
||||
|
||||
public record EmbeddingProfile(
|
||||
String profileKey,
|
||||
List<RepresentationType> embedRepresentationTypes
|
||||
) {
|
||||
public boolean includes(RepresentationType representationType) {
|
||||
return embedRepresentationTypes != null && embedRepresentationTypes.contains(representationType);
|
||||
}
|
||||
}
|
||||
@ -1,68 +0,0 @@
|
||||
package at.procon.dip.embedding.provider.http;
|
||||
|
||||
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import java.io.IOException;
|
||||
import java.net.URI;
|
||||
import java.net.http.HttpClient;
|
||||
import java.net.http.HttpRequest;
|
||||
import java.net.http.HttpResponse;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.time.Duration;
|
||||
import java.util.List;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
|
||||
@RequiredArgsConstructor
|
||||
abstract class AbstractHttpEmbeddingProviderSupport {
|
||||
|
||||
protected final ObjectMapper objectMapper;
|
||||
protected final HttpClient httpClient = HttpClient.newBuilder()
|
||||
.version(HttpClient.Version.HTTP_1_1)
|
||||
.build();
|
||||
|
||||
protected String trimTrailingSlash(String value) {
|
||||
if (value == null || value.isBlank()) {
|
||||
throw new IllegalArgumentException("Embedding provider baseUrl must be configured");
|
||||
}
|
||||
return value.endsWith("/") ? value.substring(0, value.length() - 1) : value;
|
||||
}
|
||||
|
||||
protected HttpResponse<String> postJson(ResolvedEmbeddingProviderConfig providerConfig,
|
||||
String path,
|
||||
Object body) throws IOException, InterruptedException {
|
||||
HttpRequest.Builder builder = HttpRequest.newBuilder()
|
||||
.uri(URI.create(trimTrailingSlash(providerConfig.baseUrl()) + path))
|
||||
.timeout(providerConfig.readTimeout() == null ? Duration.ofSeconds(60) : providerConfig.readTimeout())
|
||||
.header("Content-Type", "application/json")
|
||||
.POST(HttpRequest.BodyPublishers.ofString(
|
||||
objectMapper.writeValueAsString(body),
|
||||
StandardCharsets.UTF_8
|
||||
));
|
||||
|
||||
if (providerConfig.apiKey() != null && !providerConfig.apiKey().isBlank()) {
|
||||
builder.header("Authorization", "Bearer " + providerConfig.apiKey());
|
||||
}
|
||||
if (providerConfig.headers() != null) {
|
||||
providerConfig.headers().forEach(builder::header);
|
||||
}
|
||||
|
||||
HttpResponse<String> response = httpClient.send(
|
||||
builder.build(),
|
||||
HttpResponse.BodyHandlers.ofString(StandardCharsets.UTF_8)
|
||||
);
|
||||
if (response.statusCode() / 100 != 2) {
|
||||
throw new IllegalStateException(
|
||||
"Embedding provider returned status %d: %s".formatted(response.statusCode(), response.body())
|
||||
);
|
||||
}
|
||||
return response;
|
||||
}
|
||||
|
||||
protected float[] toArray(List<Float> embedding) {
|
||||
float[] result = new float[embedding.size()];
|
||||
for (int i = 0; i < embedding.size(); i++) {
|
||||
result[i] = embedding.get(i);
|
||||
}
|
||||
return result;
|
||||
}
|
||||
}
|
||||
@ -1,374 +0,0 @@
|
||||
package at.procon.dip.embedding.provider.http;
|
||||
|
||||
import at.procon.dip.embedding.model.EmbeddingModelDescriptor;
|
||||
import at.procon.dip.embedding.model.EmbeddingProviderResult;
|
||||
import at.procon.dip.embedding.model.EmbeddingRequest;
|
||||
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
|
||||
import at.procon.dip.embedding.provider.EmbeddingProvider;
|
||||
import com.fasterxml.jackson.annotation.JsonProperty;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import java.io.IOException;
|
||||
import java.net.http.HttpResponse;
|
||||
import java.util.ArrayList;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.UUID;
|
||||
import org.springframework.stereotype.Component;
|
||||
|
||||
/**
|
||||
* HTTP provider for vector APIs.
|
||||
*
|
||||
* Supported endpoints:
|
||||
* POST {baseUrl}/vector-sync - single text
|
||||
* POST {baseUrl}/vectorize-batch - multiple texts
|
||||
*/
|
||||
@Component
|
||||
public class VectorSyncHttpEmbeddingProvider extends AbstractHttpEmbeddingProviderSupport implements EmbeddingProvider {
|
||||
private static final String PROVIDER_TYPE = "http-vector-sync";
|
||||
|
||||
private static final boolean DEFAULT_TRUNCATE_TEXT = false;
|
||||
private static final int DEFAULT_TRUNCATE_LENGTH = 512;
|
||||
private static final int DEFAULT_CHUNK_SIZE = 20;
|
||||
|
||||
private static final List<String> TRUNCATE_TEXT_KEYS = List.of(
|
||||
"vectorize-batch.truncate-text",
|
||||
"vectorize-batch.truncate_text",
|
||||
"truncate_text",
|
||||
"truncate-text",
|
||||
"truncateText"
|
||||
);
|
||||
|
||||
private static final List<String> TRUNCATE_LENGTH_KEYS = List.of(
|
||||
"vectorize-batch.truncate-length",
|
||||
"vectorize-batch.truncate_length",
|
||||
"truncate_length",
|
||||
"truncate-length",
|
||||
"truncateLength"
|
||||
);
|
||||
|
||||
private static final List<String> CHUNK_SIZE_KEYS = List.of(
|
||||
"vectorize-batch.chunk-size",
|
||||
"vectorize-batch.chunk_size",
|
||||
"chunk_size",
|
||||
"chunk-size",
|
||||
"chunkSize"
|
||||
);
|
||||
|
||||
public VectorSyncHttpEmbeddingProvider(ObjectMapper objectMapper) {
|
||||
super(objectMapper);
|
||||
}
|
||||
|
||||
@Override
|
||||
public String providerType() {
|
||||
return PROVIDER_TYPE;
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean supports(EmbeddingModelDescriptor model, ResolvedEmbeddingProviderConfig providerConfig) {
|
||||
return PROVIDER_TYPE.equalsIgnoreCase(providerConfig.providerType());
|
||||
}
|
||||
|
||||
@Override
|
||||
public EmbeddingProviderResult embedDocuments(ResolvedEmbeddingProviderConfig providerConfig,
|
||||
EmbeddingModelDescriptor model,
|
||||
EmbeddingRequest request) {
|
||||
return execute(providerConfig, model, request);
|
||||
}
|
||||
|
||||
@Override
|
||||
public EmbeddingProviderResult embedQuery(ResolvedEmbeddingProviderConfig providerConfig,
|
||||
EmbeddingModelDescriptor model,
|
||||
EmbeddingRequest request) {
|
||||
return execute(providerConfig, model, request);
|
||||
}
|
||||
|
||||
private EmbeddingProviderResult execute(ResolvedEmbeddingProviderConfig providerConfig,
|
||||
EmbeddingModelDescriptor model,
|
||||
EmbeddingRequest request) {
|
||||
if (request.texts() == null || request.texts().isEmpty()) {
|
||||
throw new IllegalArgumentException("Embedding request texts must not be empty");
|
||||
}
|
||||
|
||||
try {
|
||||
return request.texts().size() == 1
|
||||
? executeSingle(providerConfig, model, request.texts().getFirst())
|
||||
: executeBatch(providerConfig, model, request);
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
throw new IllegalStateException("Embedding provider call interrupted", e);
|
||||
} catch (IOException e) {
|
||||
throw new IllegalStateException("Failed to call embedding provider", e);
|
||||
}
|
||||
}
|
||||
|
||||
private EmbeddingProviderResult executeSingle(ResolvedEmbeddingProviderConfig providerConfig,
|
||||
EmbeddingModelDescriptor model,
|
||||
String text) throws IOException, InterruptedException {
|
||||
HttpResponse<String> response = postJson(
|
||||
providerConfig,
|
||||
"/vector-sync",
|
||||
new VectorSyncRequest(model.providerModelKey(), text)
|
||||
);
|
||||
|
||||
VectorSyncResponse parsed = objectMapper.readValue(response.body(), VectorSyncResponse.class);
|
||||
float[] vector = extractVector(parsed.vector, parsed.combinedVector, model);
|
||||
|
||||
return new EmbeddingProviderResult(
|
||||
model,
|
||||
List.of(vector),
|
||||
List.of(),
|
||||
null,
|
||||
parsed.tokenCount
|
||||
);
|
||||
}
|
||||
|
||||
private EmbeddingProviderResult executeBatch(ResolvedEmbeddingProviderConfig providerConfig,
|
||||
EmbeddingModelDescriptor model,
|
||||
EmbeddingRequest request) throws IOException, InterruptedException {
|
||||
BatchRequestSettings settings = resolveBatchRequestSettings(providerConfig, request.providerOptions());
|
||||
|
||||
if (settings.truncateLength() <= 0) {
|
||||
throw new IllegalArgumentException("Batch truncate length must be > 0");
|
||||
}
|
||||
if (settings.chunkSize() <= 0) {
|
||||
throw new IllegalArgumentException("Batch chunk size must be > 0");
|
||||
}
|
||||
|
||||
List<String> requestOrder = new ArrayList<>(request.texts().size());
|
||||
List<VectorizeBatchItemRequest> items = new ArrayList<>(request.texts().size());
|
||||
|
||||
for (String text : request.texts()) {
|
||||
String id = UUID.randomUUID().toString();
|
||||
requestOrder.add(id);
|
||||
items.add(new VectorizeBatchItemRequest(id, text));
|
||||
}
|
||||
|
||||
HttpResponse<String> response = postJson(
|
||||
providerConfig,
|
||||
"/vectorize-batch",
|
||||
new VectorizeBatchRequest(
|
||||
model.providerModelKey(),
|
||||
settings.truncateText(),
|
||||
settings.truncateLength(),
|
||||
settings.chunkSize(),
|
||||
items
|
||||
)
|
||||
);
|
||||
|
||||
VectorizeBatchResponse parsed = objectMapper.readValue(response.body(), VectorizeBatchResponse.class);
|
||||
if (parsed.results == null || parsed.results.isEmpty()) {
|
||||
throw new IllegalStateException("Vectorize-batch provider returned no results");
|
||||
}
|
||||
|
||||
Map<String, VectorizeBatchItemResponse> resultById = new HashMap<>();
|
||||
for (VectorizeBatchItemResponse result : parsed.results) {
|
||||
resultById.put(result.id, result);
|
||||
}
|
||||
|
||||
List<float[]> vectors = new ArrayList<>(request.texts().size());
|
||||
int totalTokenCount = 0;
|
||||
boolean hasAnyTokenCount = false;
|
||||
|
||||
for (String id : requestOrder) {
|
||||
VectorizeBatchItemResponse item = resultById.get(id);
|
||||
if (item == null) {
|
||||
throw new IllegalStateException("Vectorize-batch provider response is missing item for id " + id);
|
||||
}
|
||||
|
||||
vectors.add(extractVector(item.vector, item.combinedVector, model));
|
||||
|
||||
if (item.tokenCount != null) {
|
||||
totalTokenCount += item.tokenCount;
|
||||
hasAnyTokenCount = true;
|
||||
}
|
||||
}
|
||||
|
||||
return new EmbeddingProviderResult(
|
||||
model,
|
||||
vectors,
|
||||
List.of(),
|
||||
null,
|
||||
hasAnyTokenCount ? totalTokenCount : null
|
||||
);
|
||||
}
|
||||
|
||||
private BatchRequestSettings resolveBatchRequestSettings(ResolvedEmbeddingProviderConfig providerConfig,
|
||||
Map<String, Object> providerOptions) {
|
||||
boolean truncateText = resolveBooleanOption(
|
||||
providerOptions,
|
||||
TRUNCATE_TEXT_KEYS,
|
||||
providerConfig.batchTruncateText() != null ? providerConfig.batchTruncateText() : DEFAULT_TRUNCATE_TEXT
|
||||
);
|
||||
int truncateLength = resolveIntOption(
|
||||
providerOptions,
|
||||
TRUNCATE_LENGTH_KEYS,
|
||||
providerConfig.batchTruncateLength() != null ? providerConfig.batchTruncateLength() : DEFAULT_TRUNCATE_LENGTH
|
||||
);
|
||||
int chunkSize = resolveIntOption(
|
||||
providerOptions,
|
||||
CHUNK_SIZE_KEYS,
|
||||
providerConfig.batchChunkSize() != null ? providerConfig.batchChunkSize() : DEFAULT_CHUNK_SIZE
|
||||
);
|
||||
return new BatchRequestSettings(truncateText, truncateLength, chunkSize);
|
||||
}
|
||||
|
||||
private boolean resolveBooleanOption(Map<String, Object> providerOptions,
|
||||
List<String> keys,
|
||||
boolean defaultValue) {
|
||||
Object raw = resolveOption(providerOptions, keys);
|
||||
if (raw == null) {
|
||||
return defaultValue;
|
||||
}
|
||||
if (raw instanceof Boolean booleanValue) {
|
||||
return booleanValue;
|
||||
}
|
||||
String normalized = String.valueOf(raw).trim();
|
||||
if (normalized.isEmpty()) {
|
||||
return defaultValue;
|
||||
}
|
||||
return Boolean.parseBoolean(normalized);
|
||||
}
|
||||
|
||||
private int resolveIntOption(Map<String, Object> providerOptions,
|
||||
List<String> keys,
|
||||
int defaultValue) {
|
||||
Object raw = resolveOption(providerOptions, keys);
|
||||
if (raw == null) {
|
||||
return defaultValue;
|
||||
}
|
||||
if (raw instanceof Number number) {
|
||||
return number.intValue();
|
||||
}
|
||||
String normalized = String.valueOf(raw).trim();
|
||||
if (normalized.isEmpty()) {
|
||||
return defaultValue;
|
||||
}
|
||||
return Integer.parseInt(normalized);
|
||||
}
|
||||
|
||||
private Object resolveOption(Map<String, Object> providerOptions, List<String> keys) {
|
||||
if (providerOptions == null || providerOptions.isEmpty()) {
|
||||
return null;
|
||||
}
|
||||
for (String key : keys) {
|
||||
if (providerOptions.containsKey(key)) {
|
||||
return providerOptions.get(key);
|
||||
}
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
private float[] extractVector(List<Float> vector,
|
||||
List<Float> combinedVector,
|
||||
EmbeddingModelDescriptor model) {
|
||||
float[] resolved;
|
||||
|
||||
if (combinedVector != null && !combinedVector.isEmpty()) {
|
||||
resolved = toArray(combinedVector);
|
||||
} else if (vector != null && !vector.isEmpty()) {
|
||||
resolved = toArray(vector);
|
||||
} else {
|
||||
throw new IllegalStateException("Embedding provider returned no vector");
|
||||
}
|
||||
|
||||
if (model.dimensions() > 0 && resolved.length != model.dimensions()) {
|
||||
throw new IllegalStateException(
|
||||
"Embedding provider returned dimension %d for model %s, expected %d"
|
||||
.formatted(resolved.length, model.modelKey(), model.dimensions())
|
||||
);
|
||||
}
|
||||
|
||||
return resolved;
|
||||
}
|
||||
|
||||
private record BatchRequestSettings(boolean truncateText, int truncateLength, int chunkSize) {
|
||||
}
|
||||
|
||||
private record VectorSyncRequest(
|
||||
@JsonProperty("model") String model,
|
||||
@JsonProperty("text") String text
|
||||
) {
|
||||
}
|
||||
|
||||
private record VectorizeBatchRequest(
|
||||
@JsonProperty("model") String model,
|
||||
@JsonProperty("truncate_text") boolean truncateText,
|
||||
@JsonProperty("truncate_length") int truncateLength,
|
||||
@JsonProperty("chunk_size") int chunkSize,
|
||||
@JsonProperty("items") List<VectorizeBatchItemRequest> items
|
||||
) {
|
||||
}
|
||||
|
||||
private record VectorizeBatchItemRequest(
|
||||
@JsonProperty("id") String id,
|
||||
@JsonProperty("text") String text
|
||||
) {
|
||||
}
|
||||
|
||||
static class VectorSyncResponse {
|
||||
@JsonProperty("runtime_ms")
|
||||
public Double runtimeMs;
|
||||
|
||||
@JsonProperty("vector")
|
||||
public List<Float> vector;
|
||||
|
||||
@JsonProperty("incomplete")
|
||||
public Boolean incomplete;
|
||||
|
||||
@JsonProperty("combined_vector")
|
||||
public List<Float> combinedVector;
|
||||
|
||||
@JsonProperty("token_count")
|
||||
public Integer tokenCount;
|
||||
|
||||
@JsonProperty("model")
|
||||
public String model;
|
||||
|
||||
@JsonProperty("max_seq_length")
|
||||
public Integer maxSeqLength;
|
||||
}
|
||||
|
||||
static class VectorizeBatchResponse {
|
||||
@JsonProperty("model")
|
||||
public String model;
|
||||
|
||||
@JsonProperty("count")
|
||||
public Integer count;
|
||||
|
||||
@JsonProperty("results")
|
||||
public List<VectorizeBatchItemResponse> results;
|
||||
}
|
||||
|
||||
static class VectorizeBatchItemResponse {
|
||||
@JsonProperty("id")
|
||||
public String id;
|
||||
|
||||
@JsonProperty("vector")
|
||||
public List<Float> vector;
|
||||
|
||||
@JsonProperty("token_count")
|
||||
public Integer tokenCount;
|
||||
|
||||
@JsonProperty("runtime_ms")
|
||||
public Double runtimeMs;
|
||||
|
||||
@JsonProperty("incomplete")
|
||||
public Boolean incomplete;
|
||||
|
||||
@JsonProperty("combined_vector")
|
||||
public List<Float> combinedVector;
|
||||
|
||||
@JsonProperty("truncated")
|
||||
public Boolean truncated;
|
||||
|
||||
@JsonProperty("truncate_length")
|
||||
public Integer truncateLength;
|
||||
|
||||
@JsonProperty("model")
|
||||
public String model;
|
||||
|
||||
@JsonProperty("max_seq_length")
|
||||
public Integer maxSeqLength;
|
||||
}
|
||||
}
|
||||
@ -1,131 +0,0 @@
|
||||
package at.procon.dip.embedding.service;
|
||||
|
||||
import at.procon.dip.domain.document.entity.Document;
|
||||
import at.procon.dip.embedding.config.EmbeddingPolicyCondition;
|
||||
import at.procon.dip.embedding.config.EmbeddingPolicyProperties;
|
||||
import at.procon.dip.embedding.config.EmbeddingPolicyRule;
|
||||
import at.procon.dip.embedding.config.EmbeddingPolicyUse;
|
||||
import at.procon.dip.embedding.policy.EmbeddingPolicy;
|
||||
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
||||
import java.util.Map;
|
||||
import java.util.Objects;
|
||||
import java.util.regex.Pattern;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
import org.springframework.stereotype.Service;
|
||||
|
||||
@Service
|
||||
@RequiredArgsConstructor
|
||||
public class DefaultEmbeddingPolicyResolver implements EmbeddingPolicyResolver {
|
||||
|
||||
private final EmbeddingPolicyProperties properties;
|
||||
|
||||
@Override
|
||||
public EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor) {
|
||||
String overridePolicy = attributeValue(sourceDescriptor, "embeddingPolicyKey");
|
||||
if (overridePolicy != null) {
|
||||
return policyByKey(overridePolicy);
|
||||
}
|
||||
|
||||
String policyHint = policyHint(sourceDescriptor);
|
||||
if (policyHint != null) {
|
||||
return policyByKey(policyHint);
|
||||
}
|
||||
|
||||
for (EmbeddingPolicyRule rule : properties.getRules()) {
|
||||
if (matches(rule.getWhen(), document, sourceDescriptor)) {
|
||||
return toPolicy(rule.getUse());
|
||||
}
|
||||
}
|
||||
|
||||
return toPolicy(properties.getDefaultPolicy());
|
||||
}
|
||||
|
||||
private EmbeddingPolicy policyByKey(String policyKey) {
|
||||
for (EmbeddingPolicyRule rule : properties.getRules()) {
|
||||
if (rule.getUse() != null && policyKey.equals(rule.getUse().getPolicyKey())) {
|
||||
return toPolicy(rule.getUse());
|
||||
}
|
||||
}
|
||||
EmbeddingPolicyUse def = properties.getDefaultPolicy();
|
||||
if (def != null && policyKey.equals(def.getPolicyKey())) {
|
||||
return toPolicy(def);
|
||||
}
|
||||
throw new IllegalArgumentException("Unknown embedding policy key: " + policyKey);
|
||||
}
|
||||
|
||||
private EmbeddingPolicy toPolicy(EmbeddingPolicyUse use) {
|
||||
if (use == null) {
|
||||
throw new IllegalStateException("Embedding policy configuration is missing");
|
||||
}
|
||||
return new EmbeddingPolicy(
|
||||
use.getPolicyKey(),
|
||||
use.getModelKey(),
|
||||
use.getQueryModelKey(),
|
||||
use.getProfileKey(),
|
||||
use.isEnabled()
|
||||
);
|
||||
}
|
||||
|
||||
private boolean matches(EmbeddingPolicyCondition c, Document document, SourceDescriptor sourceDescriptor) {
|
||||
if (c == null) {
|
||||
return true;
|
||||
}
|
||||
|
||||
if (!matchesExact(c.getDocumentType(), enumName(document != null ? document.getDocumentType() : null))) {
|
||||
return false;
|
||||
}
|
||||
if (!matchesExact(c.getDocumentFamily(), enumName(document != null ? document.getDocumentFamily() : null))) {
|
||||
return false;
|
||||
}
|
||||
if (!matchesExact(c.getSourceType(), enumName(sourceDescriptor != null ? sourceDescriptor.sourceType() : null))) {
|
||||
return false;
|
||||
}
|
||||
if (!matchesMime(c.getMimeType(), sourceDescriptor != null ? sourceDescriptor.mediaType() : null)) {
|
||||
return false;
|
||||
}
|
||||
if (!matchesExact(c.getLanguage(), document != null ? document.getLanguageCode() : null)) {
|
||||
return false;
|
||||
}
|
||||
if (!matchesExact(c.getOwnerTenantKey(), document != null && document.getOwnerTenant() != null ? document.getOwnerTenant().getTenantKey() : null )) {
|
||||
return false;
|
||||
}
|
||||
return matchesExact(c.getEmbeddingPolicyHint(), policyHint(sourceDescriptor));
|
||||
}
|
||||
|
||||
private boolean matchesExact(String expected, String actual) {
|
||||
if (expected == null || expected.isBlank()) {
|
||||
return true;
|
||||
}
|
||||
return Objects.equals(expected, actual);
|
||||
}
|
||||
|
||||
private boolean matchesMime(String pattern, String actual) {
|
||||
if (pattern == null || pattern.isBlank()) {
|
||||
return true;
|
||||
}
|
||||
if (actual == null || actual.isBlank()) {
|
||||
return false;
|
||||
}
|
||||
return Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(actual).matches();
|
||||
}
|
||||
|
||||
private String enumName(Enum<?> value) {
|
||||
return value != null ? value.name() : null;
|
||||
}
|
||||
|
||||
private String policyHint(SourceDescriptor sourceDescriptor) {
|
||||
return attributeValue(sourceDescriptor, "embeddingPolicyHint");
|
||||
}
|
||||
|
||||
private String attributeValue(SourceDescriptor sourceDescriptor, String key) {
|
||||
if (sourceDescriptor == null) {
|
||||
return null;
|
||||
}
|
||||
Map<String, String> attributes = sourceDescriptor.attributes();
|
||||
if (attributes == null) {
|
||||
return null;
|
||||
}
|
||||
String value = attributes.get(key);
|
||||
return (value == null || value.isBlank()) ? null : value;
|
||||
}
|
||||
}
|
||||
@ -1,31 +0,0 @@
|
||||
package at.procon.dip.embedding.service;
|
||||
|
||||
import at.procon.dip.embedding.config.EmbeddingProfileProperties;
|
||||
import at.procon.dip.embedding.policy.EmbeddingProfile;
|
||||
import java.util.List;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
import org.springframework.stereotype.Service;
|
||||
|
||||
@Service
|
||||
@RequiredArgsConstructor
|
||||
public class DefaultEmbeddingProfileResolver implements EmbeddingProfileResolver {
|
||||
|
||||
private final EmbeddingProfileProperties properties;
|
||||
|
||||
@Override
|
||||
public EmbeddingProfile resolve(String profileKey) {
|
||||
if (profileKey == null || profileKey.isBlank()) {
|
||||
throw new IllegalArgumentException("Embedding profile key must not be blank");
|
||||
}
|
||||
|
||||
EmbeddingProfileProperties.ProfileDefinition definition = properties.getDefinitions().get(profileKey);
|
||||
if (definition == null) {
|
||||
throw new IllegalArgumentException("Unknown embedding profile: " + profileKey);
|
||||
}
|
||||
|
||||
return new EmbeddingProfile(
|
||||
profileKey,
|
||||
List.copyOf(definition.getEmbedRepresentationTypes())
|
||||
);
|
||||
}
|
||||
}
|
||||
@ -1,9 +0,0 @@
|
||||
package at.procon.dip.embedding.service;
|
||||
|
||||
import at.procon.dip.domain.document.entity.Document;
|
||||
import at.procon.dip.embedding.policy.EmbeddingPolicy;
|
||||
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
||||
|
||||
public interface EmbeddingPolicyResolver {
|
||||
EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor);
|
||||
}
|
||||
@ -1,7 +0,0 @@
|
||||
package at.procon.dip.embedding.service;
|
||||
|
||||
import at.procon.dip.embedding.policy.EmbeddingProfile;
|
||||
|
||||
public interface EmbeddingProfileResolver {
|
||||
EmbeddingProfile resolve(String profileKey);
|
||||
}
|
||||
@ -1,328 +0,0 @@
|
||||
package at.procon.dip.ingestion.camel;
|
||||
|
||||
import at.procon.dip.domain.document.SourceType;
|
||||
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
|
||||
import at.procon.dip.ingestion.service.DocumentIngestionGateway;
|
||||
import at.procon.dip.ingestion.spi.IngestionResult;
|
||||
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
|
||||
import at.procon.dip.ingestion.spi.SourceDescriptor;
|
||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||
import at.procon.dip.runtime.config.RuntimeMode;
|
||||
import at.procon.ted.model.entity.TedDailyPackage;
|
||||
import at.procon.ted.repository.TedDailyPackageRepository;
|
||||
import at.procon.dip.domain.ted.service.TedPackageSequenceService;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.Paths;
|
||||
import java.security.MessageDigest;
|
||||
import java.time.OffsetDateTime;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.apache.camel.Exchange;
|
||||
import org.apache.camel.LoggingLevel;
|
||||
import org.apache.camel.builder.RouteBuilder;
|
||||
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
|
||||
import org.springframework.stereotype.Component;
|
||||
|
||||
/**
|
||||
* NEW-runtime TED daily package download route.
|
||||
* <p>
|
||||
* Reuses the proven package sequencing rules through {@link TedPackageSequenceService},
|
||||
* but hands off processing only to the NEW ingestion gateway. No legacy XML batch persistence,
|
||||
* no legacy vectorization route, no old semantic path.
|
||||
*/
|
||||
@Component
|
||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||
@ConditionalOnProperty(name = "dip.ingestion.ted-download.enabled", havingValue = "true")
|
||||
@RequiredArgsConstructor
|
||||
@Slf4j
|
||||
public class TedPackageDownloadRoute extends RouteBuilder {
|
||||
|
||||
private static final String ROUTE_ID_SCHEDULER = "ted-package-new-scheduler";
|
||||
private static final String ROUTE_ID_DOWNLOADER = "ted-package-new-downloader";
|
||||
private static final String ROUTE_ID_ERROR = "ted-package-new-error-handler";
|
||||
|
||||
private final TedPackageDownloadProperties properties;
|
||||
private final TedDailyPackageRepository packageRepository;
|
||||
private final TedPackageSequenceService sequenceService;
|
||||
private final DocumentIngestionGateway documentIngestionGateway;
|
||||
|
||||
@Override
|
||||
public void configure() {
|
||||
errorHandler(deadLetterChannel("direct:ted-package-new-error")
|
||||
.maximumRedeliveries(3)
|
||||
.redeliveryDelay(10_000)
|
||||
.retryAttemptedLogLevel(LoggingLevel.WARN)
|
||||
.logStackTrace(true));
|
||||
|
||||
from("direct:ted-package-new-error")
|
||||
.routeId(ROUTE_ID_ERROR)
|
||||
.process(this::handleError);
|
||||
|
||||
from("timer:ted-package-new-scheduler?period={{dip.ingestion.ted-download.poll-interval:3600000}}&delay=0")
|
||||
.routeId(ROUTE_ID_SCHEDULER)
|
||||
.process(this::checkRunningPackages)
|
||||
.choice()
|
||||
.when(header("tooManyRunning").isEqualTo(true))
|
||||
.log(LoggingLevel.INFO, "Skipping NEW TED package download - already ${header.runningCount} packages in progress")
|
||||
.otherwise()
|
||||
.process(this::determineNextPackage)
|
||||
.choice()
|
||||
.when(header("packageId").isNotNull())
|
||||
.to("direct:download-ted-package-new")
|
||||
.otherwise()
|
||||
.log(LoggingLevel.INFO, "No NEW TED package to download right now")
|
||||
.end()
|
||||
.end();
|
||||
|
||||
from("direct:download-ted-package-new")
|
||||
.routeId(ROUTE_ID_DOWNLOADER)
|
||||
.log(LoggingLevel.INFO, "NEW TED package download started: ${header.packageId}")
|
||||
.setHeader("downloadStartTime", constant(System.currentTimeMillis()))
|
||||
.process(this::createPackageRecord)
|
||||
.delay(simple("{{dip.ingestion.ted-download.delay-between-downloads:5000}}"))
|
||||
.setHeader(Exchange.HTTP_METHOD, constant("GET"))
|
||||
.setHeader("CamelHttpConnectionClose", constant(true))
|
||||
.toD("${header.downloadUrl}?bridgeEndpoint=true&throwExceptionOnFailure=false&socketTimeout={{dip.ingestion.ted-download.download-timeout:300000}}")
|
||||
.choice()
|
||||
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(200))
|
||||
.process(this::calculateHash)
|
||||
.process(this::checkDuplicateByHash)
|
||||
.choice()
|
||||
.when(header("isDuplicate").isEqualTo(true))
|
||||
.process(this::markDuplicate)
|
||||
.otherwise()
|
||||
.process(this::saveDownloadedPackage)
|
||||
.process(this::ingestThroughGateway)
|
||||
.process(this::markCompleted)
|
||||
.endChoice()
|
||||
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(404))
|
||||
.process(this::markNotFound)
|
||||
.otherwise()
|
||||
.process(this::markFailed)
|
||||
.end();
|
||||
}
|
||||
|
||||
private void checkRunningPackages(Exchange exchange) {
|
||||
long downloadingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING).size();
|
||||
long processingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING).size();
|
||||
long runningCount = downloadingCount + processingCount;
|
||||
|
||||
exchange.getIn().setHeader("runningCount", runningCount);
|
||||
exchange.getIn().setHeader("tooManyRunning", runningCount >= properties.getMaxRunningPackages());
|
||||
|
||||
if (runningCount > 0) {
|
||||
log.info("Currently {} TED packages in progress in NEW runtime ({} downloading, {} processing)",
|
||||
runningCount, downloadingCount, processingCount);
|
||||
}
|
||||
}
|
||||
|
||||
private void determineNextPackage(Exchange exchange) {
|
||||
List<TedDailyPackage> pendingPackages = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PENDING);
|
||||
|
||||
if (!pendingPackages.isEmpty()) {
|
||||
TedDailyPackage pkg = pendingPackages.get(0);
|
||||
log.info("Retrying PENDING TED package in NEW runtime: {}", pkg.getPackageIdentifier());
|
||||
setPackageHeaders(exchange, pkg.getYear(), pkg.getSerialNumber());
|
||||
return;
|
||||
}
|
||||
|
||||
TedPackageSequenceService.PackageInfo packageInfo = sequenceService.getNextPackageToDownload();
|
||||
if (packageInfo == null) {
|
||||
exchange.getIn().setHeader("packageId", null);
|
||||
return;
|
||||
}
|
||||
|
||||
setPackageHeaders(exchange, packageInfo.year(), packageInfo.serialNumber());
|
||||
}
|
||||
|
||||
private void setPackageHeaders(Exchange exchange, int year, int serialNumber) {
|
||||
String packageId = "%04d%05d".formatted(year, serialNumber);
|
||||
String downloadUrl = properties.getBaseUrl() + packageId;
|
||||
|
||||
exchange.getIn().setHeader("packageId", packageId);
|
||||
exchange.getIn().setHeader("year", year);
|
||||
exchange.getIn().setHeader("serialNumber", serialNumber);
|
||||
exchange.getIn().setHeader("downloadUrl", downloadUrl);
|
||||
}
|
||||
|
||||
private void createPackageRecord(Exchange exchange) {
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
Integer year = exchange.getIn().getHeader("year", Integer.class);
|
||||
Integer serialNumber = exchange.getIn().getHeader("serialNumber", Integer.class);
|
||||
String downloadUrl = exchange.getIn().getHeader("downloadUrl", String.class);
|
||||
|
||||
if (packageRepository.existsByPackageIdentifier(packageId)) {
|
||||
return;
|
||||
}
|
||||
|
||||
TedDailyPackage pkg = TedDailyPackage.builder()
|
||||
.packageIdentifier(packageId)
|
||||
.year(year)
|
||||
.serialNumber(serialNumber)
|
||||
.downloadUrl(downloadUrl)
|
||||
.downloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING)
|
||||
.build();
|
||||
|
||||
packageRepository.save(pkg);
|
||||
}
|
||||
|
||||
private void calculateHash(Exchange exchange) throws Exception {
|
||||
byte[] body = exchange.getIn().getBody(byte[].class);
|
||||
MessageDigest digest = MessageDigest.getInstance("SHA-256");
|
||||
byte[] hashBytes = digest.digest(body);
|
||||
|
||||
StringBuilder sb = new StringBuilder();
|
||||
for (byte b : hashBytes) {
|
||||
sb.append(String.format("%02x", b));
|
||||
}
|
||||
|
||||
exchange.getIn().setHeader("fileHash", sb.toString());
|
||||
}
|
||||
|
||||
private void checkDuplicateByHash(Exchange exchange) {
|
||||
String hash = exchange.getIn().getHeader("fileHash", String.class);
|
||||
|
||||
Optional<TedDailyPackage> duplicate = packageRepository.findAll().stream()
|
||||
.filter(p -> hash.equals(p.getFileHash()))
|
||||
.findFirst();
|
||||
|
||||
exchange.getIn().setHeader("isDuplicate", duplicate.isPresent());
|
||||
duplicate.ifPresent(pkg -> exchange.getIn().setHeader("duplicateOf", pkg.getPackageIdentifier()));
|
||||
}
|
||||
|
||||
private void saveDownloadedPackage(Exchange exchange) throws Exception {
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
String hash = exchange.getIn().getHeader("fileHash", String.class);
|
||||
byte[] body = exchange.getIn().getBody(byte[].class);
|
||||
|
||||
Path downloadDir = Paths.get(properties.getDownloadDirectory());
|
||||
Files.createDirectories(downloadDir);
|
||||
Path downloadPath = downloadDir.resolve(packageId + ".tar.gz");
|
||||
Files.write(downloadPath, body);
|
||||
|
||||
long downloadDuration = System.currentTimeMillis() -
|
||||
exchange.getIn().getHeader("downloadStartTime", Long.class);
|
||||
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setFileHash(hash);
|
||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADED);
|
||||
pkg.setDownloadedAt(OffsetDateTime.now());
|
||||
pkg.setDownloadDurationMs(downloadDuration);
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
|
||||
exchange.getIn().setHeader("downloadPath", downloadPath.toString());
|
||||
}
|
||||
|
||||
private void ingestThroughGateway(Exchange exchange) {
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
|
||||
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING);
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
|
||||
IngestionResult ingestionResult = documentIngestionGateway.ingest(new SourceDescriptor(
|
||||
null,
|
||||
SourceType.TED_PACKAGE,
|
||||
packageId,
|
||||
downloadPath,
|
||||
packageId + ".tar.gz",
|
||||
"application/gzip",
|
||||
null,
|
||||
null,
|
||||
OffsetDateTime.now(),
|
||||
OriginalContentStoragePolicy.DEFAULT,
|
||||
Map.of(
|
||||
"packageId", packageId,
|
||||
"title", packageId + ".tar.gz"
|
||||
)
|
||||
));
|
||||
|
||||
int importedChildCount = Math.max(0, ingestionResult.documents().size() - 1);
|
||||
exchange.getIn().setHeader("gatewayImportedChildCount", importedChildCount);
|
||||
exchange.getIn().setHeader("gatewayImportWarnings", ingestionResult.warnings().size());
|
||||
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setXmlFileCount(importedChildCount);
|
||||
pkg.setProcessedCount(importedChildCount);
|
||||
pkg.setFailedCount(0);
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
}
|
||||
|
||||
private void markCompleted(Exchange exchange) throws Exception {
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
|
||||
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
|
||||
pkg.setProcessedAt(OffsetDateTime.now());
|
||||
if (pkg.getDownloadedAt() != null) {
|
||||
long processingDuration = Math.max(0L,
|
||||
java.time.Duration.between(pkg.getDownloadedAt(), OffsetDateTime.now()).toMillis());
|
||||
pkg.setProcessingDurationMs(processingDuration);
|
||||
}
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
|
||||
if (properties.isDeleteAfterIngestion() && downloadPath != null) {
|
||||
Files.deleteIfExists(Path.of(downloadPath));
|
||||
}
|
||||
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
long totalDuration = (pkg.getDownloadDurationMs() != null ? pkg.getDownloadDurationMs() : 0L)
|
||||
+ (pkg.getProcessingDurationMs() != null ? pkg.getProcessingDurationMs() : 0L);
|
||||
log.info("NEW TED package {} completed: xmlCount={}, processed={}, failed={}, totalDuration={}ms",
|
||||
packageId, pkg.getXmlFileCount(), pkg.getProcessedCount(), pkg.getFailedCount(), totalDuration);
|
||||
});
|
||||
}
|
||||
|
||||
private void markNotFound(Exchange exchange) {
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.NOT_FOUND);
|
||||
pkg.setErrorMessage("Package not found (404)");
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
}
|
||||
|
||||
private void markFailed(Exchange exchange) {
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
Integer httpCode = exchange.getIn().getHeader(Exchange.HTTP_RESPONSE_CODE, Integer.class);
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
|
||||
pkg.setErrorMessage("HTTP " + httpCode);
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
}
|
||||
|
||||
private void markDuplicate(Exchange exchange) {
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
String duplicateOf = exchange.getIn().getHeader("duplicateOf", String.class);
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
|
||||
pkg.setErrorMessage("Duplicate of " + duplicateOf);
|
||||
pkg.setProcessedAt(OffsetDateTime.now());
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
}
|
||||
|
||||
private void handleError(Exchange exchange) {
|
||||
Exception exception = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class);
|
||||
String packageId = exchange.getIn().getHeader("packageId", String.class);
|
||||
|
||||
if (packageId != null) {
|
||||
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
|
||||
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
|
||||
pkg.setErrorMessage(exception != null ? exception.getMessage() : "Unknown route error");
|
||||
packageRepository.save(pkg);
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1,61 +0,0 @@
|
||||
package at.procon.dip.ingestion.config;
|
||||
|
||||
import at.procon.dip.domain.access.DocumentVisibility;
|
||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||
import at.procon.dip.runtime.config.RuntimeMode;
|
||||
import jakarta.validation.constraints.NotBlank;
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.ingestion")
|
||||
@Data
|
||||
public class DipIngestionProperties {
|
||||
|
||||
private boolean enabled = false;
|
||||
private boolean fileSystemEnabled = false;
|
||||
private boolean restUploadEnabled = true;
|
||||
private String inputDirectory = "/ted.europe/generic-input";
|
||||
private String filePattern = ".*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$";
|
||||
private String processedDirectory = ".dip-processed";
|
||||
private String errorDirectory = ".dip-error";
|
||||
|
||||
@Positive
|
||||
private long pollInterval = 15000;
|
||||
|
||||
@Positive
|
||||
private int maxMessagesPerPoll = 10;
|
||||
|
||||
private String defaultOwnerTenantKey;
|
||||
private DocumentVisibility defaultVisibility = DocumentVisibility.PUBLIC;
|
||||
private String defaultLanguageCode;
|
||||
|
||||
private boolean storeOriginalBinaryInDb = true;
|
||||
|
||||
@Positive
|
||||
private int maxBinaryBytesInDb = 5242880;
|
||||
|
||||
private boolean deduplicateByContentHash = true;
|
||||
private boolean storeOriginalContentForWrapperDocuments = true;
|
||||
private boolean vectorizePrimaryRepresentationOnly = true;
|
||||
|
||||
@NotBlank
|
||||
private String importBatchId = "phase4-generic";
|
||||
|
||||
private boolean tedPackageAdapterEnabled = true;
|
||||
private boolean mailAdapterEnabled = false;
|
||||
|
||||
private String mailDefaultOwnerTenantKey;
|
||||
private DocumentVisibility mailDefaultVisibility = DocumentVisibility.TENANT;
|
||||
private boolean expandMailZipAttachments = true;
|
||||
|
||||
@NotBlank
|
||||
private String tedPackageImportBatchId = "phase41-ted-package";
|
||||
|
||||
private boolean gatewayOnlyForTedPackages = false;
|
||||
|
||||
@NotBlank
|
||||
private String mailImportBatchId = "phase41-mail";
|
||||
}
|
||||
@ -1,52 +0,0 @@
|
||||
package at.procon.dip.ingestion.config;
|
||||
|
||||
import jakarta.validation.constraints.Min;
|
||||
import jakarta.validation.constraints.NotBlank;
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
/**
|
||||
* NEW-runtime TED package download configuration.
|
||||
* <p>
|
||||
* This is intentionally separate from the legacy {@code ted.download.*} tree.
|
||||
*/
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.ingestion.ted-download")
|
||||
@Data
|
||||
public class TedPackageDownloadProperties {
|
||||
|
||||
private boolean enabled = false;
|
||||
|
||||
@NotBlank
|
||||
private String baseUrl = "https://ted.europa.eu/packages/daily/";
|
||||
|
||||
@NotBlank
|
||||
private String downloadDirectory = "/ted.europe/downloads-new";
|
||||
|
||||
@Positive
|
||||
private int startYear = 2015;
|
||||
|
||||
@Positive
|
||||
private long pollInterval = 3_600_000L;
|
||||
|
||||
@Positive
|
||||
private long notFoundRetryInterval = 21_600_000L;
|
||||
|
||||
@Min(0)
|
||||
private int previousYearGracePeriodDays = 30;
|
||||
|
||||
private boolean retryCurrentYearNotFoundIndefinitely = true;
|
||||
|
||||
@Positive
|
||||
private long downloadTimeout = 300_000L;
|
||||
|
||||
@Positive
|
||||
private int maxRunningPackages = 2;
|
||||
|
||||
@Positive
|
||||
private long delayBetweenDownloads = 5_000L;
|
||||
|
||||
private boolean deleteAfterIngestion = true;
|
||||
}
|
||||
@ -1,47 +0,0 @@
|
||||
package at.procon.dip.migration.audit.config;
|
||||
|
||||
import jakarta.validation.constraints.Min;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.migration.legacy-audit")
|
||||
@Data
|
||||
public class LegacyTedAuditProperties {
|
||||
|
||||
/**
|
||||
* Enables the Wave 1 / Milestone A legacy TED audit subsystem.
|
||||
*/
|
||||
private boolean enabled = true;
|
||||
|
||||
/**
|
||||
* Automatically runs the read-only audit on application startup.
|
||||
*/
|
||||
private boolean startupRunEnabled = false;
|
||||
|
||||
/**
|
||||
* Maximum number of legacy TED documents to scan during startup.
|
||||
* 0 means no limit.
|
||||
*/
|
||||
@Min(0)
|
||||
private int startupRunLimit = 500;
|
||||
|
||||
/**
|
||||
* Batch size for legacy TED document paging.
|
||||
*/
|
||||
@Min(1)
|
||||
private int pageSize = 100;
|
||||
|
||||
/**
|
||||
* Hard cap for persisted findings in a single run to avoid runaway audit volume.
|
||||
*/
|
||||
@Min(1)
|
||||
private int maxFindingsPerRun = 10000;
|
||||
|
||||
/**
|
||||
* Maximum number of duplicate/grouped samples recorded for global aggregate checks.
|
||||
*/
|
||||
@Min(1)
|
||||
private int maxDuplicateSamples = 100;
|
||||
}
|
||||
@ -1,87 +0,0 @@
|
||||
package at.procon.dip.migration.audit.entity;
|
||||
|
||||
import at.procon.dip.architecture.SchemaNames;
|
||||
import jakarta.persistence.Column;
|
||||
import jakarta.persistence.Entity;
|
||||
import jakarta.persistence.EnumType;
|
||||
import jakarta.persistence.Enumerated;
|
||||
import jakarta.persistence.FetchType;
|
||||
import jakarta.persistence.GeneratedValue;
|
||||
import jakarta.persistence.GenerationType;
|
||||
import jakarta.persistence.Id;
|
||||
import jakarta.persistence.Index;
|
||||
import jakarta.persistence.JoinColumn;
|
||||
import jakarta.persistence.ManyToOne;
|
||||
import jakarta.persistence.PrePersist;
|
||||
import jakarta.persistence.Table;
|
||||
import java.time.OffsetDateTime;
|
||||
import java.util.UUID;
|
||||
import lombok.AllArgsConstructor;
|
||||
import lombok.Builder;
|
||||
import lombok.Getter;
|
||||
import lombok.NoArgsConstructor;
|
||||
import lombok.Setter;
|
||||
|
||||
@Entity
|
||||
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_finding", indexes = {
|
||||
@Index(name = "idx_doc_legacy_audit_find_run", columnList = "run_id"),
|
||||
@Index(name = "idx_doc_legacy_audit_find_type", columnList = "finding_type"),
|
||||
@Index(name = "idx_doc_legacy_audit_find_severity", columnList = "severity"),
|
||||
@Index(name = "idx_doc_legacy_audit_find_legacy_doc", columnList = "legacy_procurement_document_id"),
|
||||
@Index(name = "idx_doc_legacy_audit_find_document", columnList = "document_id")
|
||||
})
|
||||
@Getter
|
||||
@Setter
|
||||
@NoArgsConstructor
|
||||
@AllArgsConstructor
|
||||
@Builder
|
||||
public class LegacyTedAuditFinding {
|
||||
|
||||
@Id
|
||||
@GeneratedValue(strategy = GenerationType.UUID)
|
||||
private UUID id;
|
||||
|
||||
@ManyToOne(fetch = FetchType.LAZY, optional = false)
|
||||
@JoinColumn(name = "run_id", nullable = false)
|
||||
private LegacyTedAuditRun run;
|
||||
|
||||
@Enumerated(EnumType.STRING)
|
||||
@Column(name = "severity", nullable = false, length = 16)
|
||||
private LegacyTedAuditSeverity severity;
|
||||
|
||||
@Enumerated(EnumType.STRING)
|
||||
@Column(name = "finding_type", nullable = false, length = 64)
|
||||
private LegacyTedAuditFindingType findingType;
|
||||
|
||||
@Column(name = "package_identifier", length = 20)
|
||||
private String packageIdentifier;
|
||||
|
||||
@Column(name = "legacy_procurement_document_id")
|
||||
private UUID legacyProcurementDocumentId;
|
||||
|
||||
@Column(name = "document_id")
|
||||
private UUID documentId;
|
||||
|
||||
@Column(name = "ted_notice_projection_id")
|
||||
private UUID tedNoticeProjectionId;
|
||||
|
||||
@Column(name = "reference_key", length = 255)
|
||||
private String referenceKey;
|
||||
|
||||
@Column(name = "message", nullable = false, columnDefinition = "TEXT")
|
||||
private String message;
|
||||
|
||||
@Column(name = "details_text", columnDefinition = "TEXT")
|
||||
private String detailsText;
|
||||
|
||||
@Builder.Default
|
||||
@Column(name = "created_at", nullable = false, updatable = false)
|
||||
private OffsetDateTime createdAt = OffsetDateTime.now();
|
||||
|
||||
@PrePersist
|
||||
protected void onCreate() {
|
||||
if (createdAt == null) {
|
||||
createdAt = OffsetDateTime.now();
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1,28 +0,0 @@
|
||||
package at.procon.dip.migration.audit.entity;
|
||||
|
||||
public enum LegacyTedAuditFindingType {
|
||||
PACKAGE_SEQUENCE_GAP,
|
||||
PACKAGE_INCOMPLETE,
|
||||
PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
|
||||
PACKAGE_COMPLETED_COUNT_MISMATCH,
|
||||
PACKAGE_MISSING_XML_FILE_COUNT,
|
||||
PACKAGE_MISSING_FILE_HASH,
|
||||
PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
|
||||
LEGACY_PUBLICATION_ID_DUPLICATE,
|
||||
DOC_DEDUP_HASH_DUPLICATE,
|
||||
LEGACY_DOCUMENT_MISSING_HASH,
|
||||
LEGACY_DOCUMENT_MISSING_XML,
|
||||
LEGACY_DOCUMENT_MISSING_TEXT,
|
||||
LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
|
||||
DOC_DOCUMENT_MISSING,
|
||||
DOC_DOCUMENT_DUPLICATE,
|
||||
DOC_SOURCE_MISSING,
|
||||
DOC_ORIGINAL_CONTENT_MISSING,
|
||||
DOC_ORIGINAL_CONTENT_DUPLICATE,
|
||||
DOC_PRIMARY_REPRESENTATION_MISSING,
|
||||
DOC_PRIMARY_REPRESENTATION_DUPLICATE,
|
||||
TED_PROJECTION_MISSING,
|
||||
TED_PROJECTION_MISSING_LEGACY_LINK,
|
||||
TED_PROJECTION_DOCUMENT_MISMATCH,
|
||||
FINDINGS_TRUNCATED
|
||||
}
|
||||
@ -1,110 +0,0 @@
|
||||
package at.procon.dip.migration.audit.entity;
|
||||
|
||||
import at.procon.dip.architecture.SchemaNames;
|
||||
import jakarta.persistence.Column;
|
||||
import jakarta.persistence.Entity;
|
||||
import jakarta.persistence.EnumType;
|
||||
import jakarta.persistence.Enumerated;
|
||||
import jakarta.persistence.GeneratedValue;
|
||||
import jakarta.persistence.GenerationType;
|
||||
import jakarta.persistence.Id;
|
||||
import jakarta.persistence.Index;
|
||||
import jakarta.persistence.PrePersist;
|
||||
import jakarta.persistence.PreUpdate;
|
||||
import jakarta.persistence.Table;
|
||||
import java.time.OffsetDateTime;
|
||||
import java.util.UUID;
|
||||
import lombok.AllArgsConstructor;
|
||||
import lombok.Builder;
|
||||
import lombok.Getter;
|
||||
import lombok.NoArgsConstructor;
|
||||
import lombok.Setter;
|
||||
|
||||
@Entity
|
||||
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_run", indexes = {
|
||||
@Index(name = "idx_doc_legacy_audit_run_status", columnList = "status"),
|
||||
@Index(name = "idx_doc_legacy_audit_run_started", columnList = "started_at")
|
||||
})
|
||||
@Getter
|
||||
@Setter
|
||||
@NoArgsConstructor
|
||||
@AllArgsConstructor
|
||||
@Builder
|
||||
public class LegacyTedAuditRun {
|
||||
|
||||
@Id
|
||||
@GeneratedValue(strategy = GenerationType.UUID)
|
||||
private UUID id;
|
||||
|
||||
@Enumerated(EnumType.STRING)
|
||||
@Column(name = "status", nullable = false, length = 32)
|
||||
private LegacyTedAuditRunStatus status;
|
||||
|
||||
@Column(name = "requested_limit")
|
||||
private Integer requestedLimit;
|
||||
|
||||
@Column(name = "page_size", nullable = false)
|
||||
private Integer pageSize;
|
||||
|
||||
@Column(name = "scanned_packages", nullable = false)
|
||||
@Builder.Default
|
||||
private Integer scannedPackages = 0;
|
||||
|
||||
@Column(name = "scanned_legacy_documents", nullable = false)
|
||||
@Builder.Default
|
||||
private Integer scannedLegacyDocuments = 0;
|
||||
|
||||
@Column(name = "finding_count", nullable = false)
|
||||
@Builder.Default
|
||||
private Integer findingCount = 0;
|
||||
|
||||
@Column(name = "info_count", nullable = false)
|
||||
@Builder.Default
|
||||
private Integer infoCount = 0;
|
||||
|
||||
@Column(name = "warning_count", nullable = false)
|
||||
@Builder.Default
|
||||
private Integer warningCount = 0;
|
||||
|
||||
@Column(name = "error_count", nullable = false)
|
||||
@Builder.Default
|
||||
private Integer errorCount = 0;
|
||||
|
||||
@Column(name = "started_at", nullable = false)
|
||||
private OffsetDateTime startedAt;
|
||||
|
||||
@Column(name = "completed_at")
|
||||
private OffsetDateTime completedAt;
|
||||
|
||||
@Column(name = "summary_text", columnDefinition = "TEXT")
|
||||
private String summaryText;
|
||||
|
||||
@Column(name = "failure_message", columnDefinition = "TEXT")
|
||||
private String failureMessage;
|
||||
|
||||
@Builder.Default
|
||||
@Column(name = "created_at", nullable = false, updatable = false)
|
||||
private OffsetDateTime createdAt = OffsetDateTime.now();
|
||||
|
||||
@Builder.Default
|
||||
@Column(name = "updated_at", nullable = false)
|
||||
private OffsetDateTime updatedAt = OffsetDateTime.now();
|
||||
|
||||
@PrePersist
|
||||
protected void onCreate() {
|
||||
if (startedAt == null) {
|
||||
startedAt = OffsetDateTime.now();
|
||||
}
|
||||
if (createdAt == null) {
|
||||
createdAt = OffsetDateTime.now();
|
||||
}
|
||||
if (updatedAt == null) {
|
||||
updatedAt = OffsetDateTime.now();
|
||||
}
|
||||
}
|
||||
|
||||
@PreUpdate
|
||||
protected void onUpdate() {
|
||||
updatedAt = OffsetDateTime.now();
|
||||
}
|
||||
}
|
||||
@ -1,7 +0,0 @@
|
||||
package at.procon.dip.migration.audit.entity;
|
||||
|
||||
public enum LegacyTedAuditRunStatus {
|
||||
RUNNING,
|
||||
COMPLETED,
|
||||
FAILED
|
||||
}
|
||||
@ -1,7 +0,0 @@
|
||||
package at.procon.dip.migration.audit.entity;
|
||||
|
||||
public enum LegacyTedAuditSeverity {
|
||||
INFO,
|
||||
WARNING,
|
||||
ERROR
|
||||
}
|
||||
@ -1,8 +0,0 @@
|
||||
package at.procon.dip.migration.audit.repository;
|
||||
|
||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
|
||||
import java.util.UUID;
|
||||
import org.springframework.data.jpa.repository.JpaRepository;
|
||||
|
||||
public interface LegacyTedAuditFindingRepository extends JpaRepository<LegacyTedAuditFinding, UUID> {
|
||||
}
|
||||
@ -1,8 +0,0 @@
|
||||
package at.procon.dip.migration.audit.repository;
|
||||
|
||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
|
||||
import java.util.UUID;
|
||||
import org.springframework.data.jpa.repository.JpaRepository;
|
||||
|
||||
public interface LegacyTedAuditRunRepository extends JpaRepository<LegacyTedAuditRun, UUID> {
|
||||
}
|
||||
@ -1,610 +0,0 @@
|
||||
package at.procon.dip.migration.audit.service;
|
||||
|
||||
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
|
||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
|
||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditFindingType;
|
||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
|
||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditRunStatus;
|
||||
import at.procon.dip.migration.audit.entity.LegacyTedAuditSeverity;
|
||||
import at.procon.dip.migration.audit.repository.LegacyTedAuditFindingRepository;
|
||||
import at.procon.dip.migration.audit.repository.LegacyTedAuditRunRepository;
|
||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||
import at.procon.dip.runtime.config.RuntimeMode;
|
||||
import at.procon.ted.model.entity.ProcurementDocument;
|
||||
import at.procon.ted.model.entity.TedDailyPackage;
|
||||
import at.procon.ted.repository.ProcurementDocumentRepository;
|
||||
import at.procon.ted.repository.TedDailyPackageRepository;
|
||||
import java.time.OffsetDateTime;
|
||||
import java.time.Year;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.TreeMap;
|
||||
import java.util.UUID;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.springframework.data.domain.Page;
|
||||
import org.springframework.data.domain.PageRequest;
|
||||
import org.springframework.data.domain.Sort;
|
||||
import org.springframework.jdbc.core.JdbcTemplate;
|
||||
import org.springframework.stereotype.Service;
|
||||
import org.springframework.util.StringUtils;
|
||||
|
||||
@Service
|
||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||
@RequiredArgsConstructor
|
||||
@Slf4j
|
||||
public class LegacyTedAuditService {
|
||||
|
||||
private final LegacyTedAuditProperties properties;
|
||||
private final TedDailyPackageRepository tedDailyPackageRepository;
|
||||
private final ProcurementDocumentRepository procurementDocumentRepository;
|
||||
private final LegacyTedAuditRunRepository runRepository;
|
||||
private final LegacyTedAuditFindingRepository findingRepository;
|
||||
private final JdbcTemplate jdbcTemplate;
|
||||
|
||||
public LegacyTedAuditRun executeAudit() {
|
||||
return executeAudit(properties.getStartupRunLimit());
|
||||
}
|
||||
|
||||
public LegacyTedAuditRun executeAudit(int requestedLimit) {
|
||||
if (!properties.isEnabled()) {
|
||||
throw new IllegalStateException("Legacy TED audit is disabled by configuration");
|
||||
}
|
||||
|
||||
Integer effectiveLimit = requestedLimit > 0 ? requestedLimit : null;
|
||||
int pageSize = properties.getPageSize();
|
||||
AuditAccumulator accumulator = new AuditAccumulator();
|
||||
|
||||
LegacyTedAuditRun run = LegacyTedAuditRun.builder()
|
||||
.status(LegacyTedAuditRunStatus.RUNNING)
|
||||
.requestedLimit(effectiveLimit)
|
||||
.pageSize(pageSize)
|
||||
.startedAt(OffsetDateTime.now())
|
||||
.build();
|
||||
run = runRepository.save(run);
|
||||
|
||||
try {
|
||||
int scannedPackages = auditPackages(run, accumulator);
|
||||
auditGlobalDuplicates(run, accumulator);
|
||||
int scannedLegacyDocuments = 0;//auditLegacyDocuments(run, accumulator, effectiveLimit, pageSize);
|
||||
|
||||
run.setStatus(LegacyTedAuditRunStatus.COMPLETED);
|
||||
run.setCompletedAt(OffsetDateTime.now());
|
||||
run.setScannedPackages(scannedPackages);
|
||||
run.setScannedLegacyDocuments(scannedLegacyDocuments);
|
||||
run.setFindingCount(accumulator.totalFindings());
|
||||
run.setInfoCount(accumulator.infoCount());
|
||||
run.setWarningCount(accumulator.warningCount());
|
||||
run.setErrorCount(accumulator.errorCount());
|
||||
run.setSummaryText(buildSummary(scannedPackages, scannedLegacyDocuments, accumulator));
|
||||
run.setFailureMessage(null);
|
||||
run = runRepository.save(run);
|
||||
|
||||
log.info("Wave 1 / Milestone A legacy-only audit completed: runId={}, packages={}, documents={}, findings={}, warnings={}, errors={}",
|
||||
run.getId(), scannedPackages, scannedLegacyDocuments, accumulator.totalFindings(),
|
||||
accumulator.warningCount(), accumulator.errorCount());
|
||||
return run;
|
||||
} catch (RuntimeException ex) {
|
||||
run.setStatus(LegacyTedAuditRunStatus.FAILED);
|
||||
run.setCompletedAt(OffsetDateTime.now());
|
||||
run.setScannedPackages(accumulator.scannedPackages());
|
||||
run.setScannedLegacyDocuments(accumulator.scannedLegacyDocuments());
|
||||
run.setFindingCount(accumulator.totalFindings());
|
||||
run.setInfoCount(accumulator.infoCount());
|
||||
run.setWarningCount(accumulator.warningCount());
|
||||
run.setErrorCount(accumulator.errorCount());
|
||||
run.setFailureMessage(ex.getMessage());
|
||||
run.setSummaryText(buildSummary(accumulator.scannedPackages(), accumulator.scannedLegacyDocuments(), accumulator));
|
||||
runRepository.save(run);
|
||||
log.error("Wave 1 / Milestone A legacy-only audit failed: runId={}", run.getId(), ex);
|
||||
throw ex;
|
||||
}
|
||||
}
|
||||
|
||||
private int auditPackages(LegacyTedAuditRun run, AuditAccumulator accumulator) {
|
||||
List<TedDailyPackage> packages = tedDailyPackageRepository.findAll(Sort.by(Sort.Direction.ASC, "year", "serialNumber"));
|
||||
if (packages.isEmpty()) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
Map<Integer, List<TedDailyPackage>> packagesByYear = new TreeMap<>();
|
||||
for (TedDailyPackage dailyPackage : packages) {
|
||||
packagesByYear.computeIfAbsent(dailyPackage.getYear(), ignored -> new ArrayList<>()).add(dailyPackage);
|
||||
}
|
||||
|
||||
int firstYear = packagesByYear.keySet().iterator().next();
|
||||
int currentYear = Year.now().getValue();
|
||||
|
||||
for (int year = firstYear; year <= currentYear; year++) {
|
||||
List<TedDailyPackage> yearPackages = packagesByYear.get(year);
|
||||
if (yearPackages == null || yearPackages.isEmpty()) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
"year:" + year,
|
||||
"No TED package rows exist for this year inside the audited interval",
|
||||
"year=" + year + ", intervalStartYear=" + firstYear + ", intervalEndYear=" + currentYear);
|
||||
continue;
|
||||
}
|
||||
|
||||
auditYearPackageSequence(run, accumulator, year, yearPackages);
|
||||
|
||||
for (TedDailyPackage dailyPackage : yearPackages) {
|
||||
accumulator.incrementScannedPackages();
|
||||
auditSinglePackage(run, accumulator, dailyPackage);
|
||||
}
|
||||
}
|
||||
|
||||
return packages.size();
|
||||
}
|
||||
|
||||
private void auditYearPackageSequence(LegacyTedAuditRun run,
|
||||
AuditAccumulator accumulator,
|
||||
int year,
|
||||
List<TedDailyPackage> yearPackages) {
|
||||
yearPackages.sort((left, right) -> Integer.compare(safeInt(left.getSerialNumber()), safeInt(right.getSerialNumber())));
|
||||
|
||||
int firstSerial = safeInt(yearPackages.getFirst().getSerialNumber());
|
||||
if (firstSerial > 1) {
|
||||
recordMissingPackageRange(run, accumulator, year, 1, firstSerial - 1,
|
||||
"TED package year starts after serial 1");
|
||||
}
|
||||
|
||||
for (int i = 1; i < yearPackages.size(); i++) {
|
||||
int previousSerial = safeInt(yearPackages.get(i - 1).getSerialNumber());
|
||||
int currentSerial = safeInt(yearPackages.get(i).getSerialNumber());
|
||||
if (currentSerial > previousSerial + 1) {
|
||||
recordMissingPackageRange(run, accumulator, year, previousSerial + 1, currentSerial - 1,
|
||||
"TED package sequence gap detected");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private void recordMissingPackageRange(LegacyTedAuditRun run,
|
||||
AuditAccumulator accumulator,
|
||||
int year,
|
||||
int startSerial,
|
||||
int endSerial,
|
||||
String message) {
|
||||
String startPackageId = formatPackageIdentifier(year, startSerial);
|
||||
String endPackageId = formatPackageIdentifier(year, endSerial);
|
||||
String referenceKey = startSerial == endSerial ? startPackageId : startPackageId + "-" + endPackageId;
|
||||
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
|
||||
startSerial == endSerial ? startPackageId : null,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
referenceKey,
|
||||
message,
|
||||
"year=" + year + ", missingStartSerial=" + startSerial + ", missingEndSerial=" + endSerial);
|
||||
}
|
||||
|
||||
private void auditSinglePackage(LegacyTedAuditRun run,
|
||||
AuditAccumulator accumulator,
|
||||
TedDailyPackage dailyPackage) {
|
||||
String packageIdentifier = dailyPackage.getPackageIdentifier();
|
||||
int processedCount = safeInt(dailyPackage.getProcessedCount());
|
||||
int failedCount = safeInt(dailyPackage.getFailedCount());
|
||||
int accountedDocuments = processedCount + failedCount;
|
||||
|
||||
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
||||
&& dailyPackage.getProcessedAt() == null) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
|
||||
packageIdentifier,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
packageIdentifier,
|
||||
"TED package is marked COMPLETED but processedAt is null",
|
||||
null);
|
||||
}
|
||||
|
||||
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
||||
&& dailyPackage.getXmlFileCount() == null) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_MISSING_XML_FILE_COUNT,
|
||||
packageIdentifier,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
packageIdentifier,
|
||||
"TED package is marked COMPLETED but xmlFileCount is null",
|
||||
null);
|
||||
}
|
||||
|
||||
if ((dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.DOWNLOADED
|
||||
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.PROCESSING
|
||||
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED)
|
||||
&& !StringUtils.hasText(dailyPackage.getFileHash())) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_MISSING_FILE_HASH,
|
||||
packageIdentifier,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
packageIdentifier,
|
||||
"TED package has no file hash recorded",
|
||||
"downloadStatus=" + dailyPackage.getDownloadStatus());
|
||||
}
|
||||
|
||||
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
|
||||
&& !StringUtils.hasText(dailyPackage.getErrorMessage())) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
|
||||
packageIdentifier,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
packageIdentifier,
|
||||
"TED package is marked FAILED but has no error message",
|
||||
null);
|
||||
}
|
||||
|
||||
if (dailyPackage.getXmlFileCount() != null) {
|
||||
if (accountedDocuments > dailyPackage.getXmlFileCount()) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.ERROR,
|
||||
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
|
||||
packageIdentifier,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
packageIdentifier,
|
||||
"TED package accounting exceeds xmlFileCount",
|
||||
"xmlFileCount=" + dailyPackage.getXmlFileCount()
|
||||
+ ", processedCount=" + processedCount
|
||||
+ ", failedCount=" + failedCount);
|
||||
} else if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
|
||||
&& accountedDocuments < dailyPackage.getXmlFileCount()) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
|
||||
packageIdentifier,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
packageIdentifier,
|
||||
"TED package accounting is below xmlFileCount",
|
||||
"xmlFileCount=" + dailyPackage.getXmlFileCount()
|
||||
+ ", processedCount=" + processedCount
|
||||
+ ", failedCount=" + failedCount);
|
||||
}
|
||||
}
|
||||
|
||||
if (isPackageIncompleteForReimport(dailyPackage, processedCount, failedCount, accountedDocuments)) {
|
||||
recordFinding(run, accumulator,
|
||||
dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
|
||||
? LegacyTedAuditSeverity.ERROR
|
||||
: LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.PACKAGE_INCOMPLETE,
|
||||
packageIdentifier,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
packageIdentifier,
|
||||
"TED package is not fully imported and should be considered for re-import",
|
||||
buildIncompletePackageDetails(dailyPackage, processedCount, failedCount, accountedDocuments));
|
||||
}
|
||||
}
|
||||
|
||||
private boolean isPackageIncompleteForReimport(TedDailyPackage dailyPackage,
|
||||
int processedCount,
|
||||
int failedCount,
|
||||
int accountedDocuments) {
|
||||
TedDailyPackage.DownloadStatus status = dailyPackage.getDownloadStatus();
|
||||
if (status == null) {
|
||||
return true;
|
||||
}
|
||||
if (status == TedDailyPackage.DownloadStatus.NOT_FOUND) {
|
||||
return false;
|
||||
}
|
||||
if (status == TedDailyPackage.DownloadStatus.PENDING
|
||||
|| status == TedDailyPackage.DownloadStatus.DOWNLOADING
|
||||
|| status == TedDailyPackage.DownloadStatus.DOWNLOADED
|
||||
|| status == TedDailyPackage.DownloadStatus.PROCESSING
|
||||
|| status == TedDailyPackage.DownloadStatus.FAILED) {
|
||||
return true;
|
||||
}
|
||||
if (status != TedDailyPackage.DownloadStatus.COMPLETED) {
|
||||
return true;
|
||||
}
|
||||
if (dailyPackage.getXmlFileCount() == null) {
|
||||
return true;
|
||||
}
|
||||
if (failedCount > 0) {
|
||||
return true;
|
||||
}
|
||||
return processedCount < dailyPackage.getXmlFileCount()
|
||||
|| accountedDocuments != dailyPackage.getXmlFileCount();
|
||||
}
|
||||
|
||||
private String buildIncompletePackageDetails(TedDailyPackage dailyPackage,
|
||||
int processedCount,
|
||||
int failedCount,
|
||||
int accountedDocuments) {
|
||||
return "status=" + dailyPackage.getDownloadStatus()
|
||||
+ ", xmlFileCount=" + dailyPackage.getXmlFileCount()
|
||||
+ ", processedCount=" + processedCount
|
||||
+ ", failedCount=" + failedCount
|
||||
+ ", accountedDocuments=" + accountedDocuments;
|
||||
}
|
||||
|
||||
private void auditGlobalDuplicates(LegacyTedAuditRun run, AuditAccumulator accumulator) {
|
||||
int limit = properties.getMaxDuplicateSamples();
|
||||
|
||||
jdbcTemplate.query(
|
||||
"""
|
||||
SELECT publication_id, COUNT(*) AS duplicate_count
|
||||
FROM ted.procurement_document
|
||||
WHERE publication_id IS NOT NULL AND publication_id <> ''
|
||||
GROUP BY publication_id
|
||||
HAVING COUNT(*) > 1
|
||||
ORDER BY duplicate_count DESC, publication_id ASC
|
||||
LIMIT ?
|
||||
""",
|
||||
ps -> ps.setInt(1, limit),
|
||||
(rs, rowNum) -> {
|
||||
String publicationId = rs.getString("publication_id");
|
||||
long duplicateCount = rs.getLong("duplicate_count");
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.ERROR,
|
||||
LegacyTedAuditFindingType.LEGACY_PUBLICATION_ID_DUPLICATE,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
null,
|
||||
publicationId,
|
||||
"Legacy TED publicationId appears multiple times",
|
||||
"publicationId=" + publicationId + ", duplicateCount=" + duplicateCount);
|
||||
return null;
|
||||
});
|
||||
}
|
||||
|
||||
private int auditLegacyDocuments(LegacyTedAuditRun run,
|
||||
AuditAccumulator accumulator,
|
||||
Integer requestedLimit,
|
||||
int pageSize) {
|
||||
int processed = 0;
|
||||
int pageNumber = 0;
|
||||
|
||||
while (requestedLimit == null || processed < requestedLimit) {
|
||||
Page<ProcurementDocument> page = procurementDocumentRepository.findAll(
|
||||
PageRequest.of(pageNumber, pageSize, Sort.by(Sort.Direction.ASC, "createdAt", "id")));
|
||||
|
||||
if (page.isEmpty()) {
|
||||
break;
|
||||
}
|
||||
|
||||
for (ProcurementDocument legacyDocument : page.getContent()) {
|
||||
auditSingleLegacyDocument(run, accumulator, legacyDocument);
|
||||
accumulator.incrementScannedLegacyDocuments();
|
||||
processed++;
|
||||
if (requestedLimit != null && processed >= requestedLimit) {
|
||||
return processed;
|
||||
}
|
||||
}
|
||||
|
||||
if (!page.hasNext()) {
|
||||
break;
|
||||
}
|
||||
pageNumber++;
|
||||
}
|
||||
|
||||
return processed;
|
||||
}
|
||||
|
||||
private void auditSingleLegacyDocument(LegacyTedAuditRun run,
|
||||
AuditAccumulator accumulator,
|
||||
ProcurementDocument legacyDocument) {
|
||||
UUID legacyDocumentId = legacyDocument.getId();
|
||||
String referenceKey = buildReferenceKey(legacyDocument);
|
||||
String documentHash = legacyDocument.getDocumentHash();
|
||||
|
||||
if (!StringUtils.hasText(documentHash)) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.ERROR,
|
||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_HASH,
|
||||
null,
|
||||
legacyDocumentId,
|
||||
null,
|
||||
null,
|
||||
referenceKey,
|
||||
"Legacy TED document has no documentHash",
|
||||
null);
|
||||
return;
|
||||
}
|
||||
|
||||
if (!StringUtils.hasText(legacyDocument.getXmlDocument())) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.ERROR,
|
||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_XML,
|
||||
null,
|
||||
legacyDocumentId,
|
||||
null,
|
||||
null,
|
||||
referenceKey,
|
||||
"Legacy TED document has no xmlDocument payload",
|
||||
"documentHash=" + documentHash);
|
||||
}
|
||||
|
||||
if (!StringUtils.hasText(legacyDocument.getTextContent())) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_TEXT,
|
||||
null,
|
||||
legacyDocumentId,
|
||||
null,
|
||||
null,
|
||||
referenceKey,
|
||||
"Legacy TED document has no normalized textContent",
|
||||
"documentHash=" + documentHash);
|
||||
}
|
||||
|
||||
if (!StringUtils.hasText(legacyDocument.getPublicationId())) {
|
||||
recordFinding(run, accumulator,
|
||||
LegacyTedAuditSeverity.WARNING,
|
||||
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
|
||||
null,
|
||||
legacyDocumentId,
|
||||
null,
|
||||
null,
|
||||
referenceKey,
|
||||
"Legacy TED document has no publicationId",
|
||||
"documentHash=" + documentHash);
|
||||
}
|
||||
}
|
||||
|
||||
private void recordFinding(LegacyTedAuditRun run,
|
||||
AuditAccumulator accumulator,
|
||||
LegacyTedAuditSeverity severity,
|
||||
LegacyTedAuditFindingType findingType,
|
||||
String packageIdentifier,
|
||||
UUID legacyProcurementDocumentId,
|
||||
UUID genericDocumentId,
|
||||
UUID tedProjectionId,
|
||||
String referenceKey,
|
||||
String message,
|
||||
String detailsText) {
|
||||
if (accumulator.totalFindings() >= properties.getMaxFindingsPerRun()) {
|
||||
accumulator.markTruncated();
|
||||
if (!accumulator.truncationRecorded()) {
|
||||
LegacyTedAuditFinding truncatedFinding = LegacyTedAuditFinding.builder()
|
||||
.run(run)
|
||||
.severity(LegacyTedAuditSeverity.INFO)
|
||||
.findingType(LegacyTedAuditFindingType.FINDINGS_TRUNCATED)
|
||||
.referenceKey(referenceKey != null ? referenceKey : "max-findings-per-run")
|
||||
.message("Legacy TED audit finding limit reached; additional findings were suppressed")
|
||||
.detailsText("maxFindingsPerRun=" + properties.getMaxFindingsPerRun())
|
||||
.build();
|
||||
findingRepository.save(truncatedFinding);
|
||||
accumulator.recordFinding(LegacyTedAuditSeverity.INFO, true);
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
LegacyTedAuditFinding finding = LegacyTedAuditFinding.builder()
|
||||
.run(run)
|
||||
.severity(severity)
|
||||
.findingType(findingType)
|
||||
.packageIdentifier(packageIdentifier)
|
||||
.legacyProcurementDocumentId(legacyProcurementDocumentId)
|
||||
.documentId(genericDocumentId)
|
||||
.tedNoticeProjectionId(tedProjectionId)
|
||||
.referenceKey(referenceKey)
|
||||
.message(message)
|
||||
.detailsText(detailsText)
|
||||
.build();
|
||||
findingRepository.save(finding);
|
||||
accumulator.recordFinding(severity, false);
|
||||
}
|
||||
|
||||
private String buildReferenceKey(ProcurementDocument legacyDocument) {
|
||||
if (StringUtils.hasText(legacyDocument.getPublicationId())) {
|
||||
return legacyDocument.getPublicationId();
|
||||
}
|
||||
if (StringUtils.hasText(legacyDocument.getNoticeId())) {
|
||||
return legacyDocument.getNoticeId();
|
||||
}
|
||||
if (StringUtils.hasText(legacyDocument.getSourceFilename())) {
|
||||
return legacyDocument.getSourceFilename();
|
||||
}
|
||||
return String.valueOf(legacyDocument.getId());
|
||||
}
|
||||
|
||||
private int safeInt(Integer value) {
|
||||
return value != null ? value : 0;
|
||||
}
|
||||
|
||||
private String formatPackageIdentifier(int year, int serialNumber) {
|
||||
return "%04d%05d".formatted(year, serialNumber);
|
||||
}
|
||||
|
||||
private String buildSummary(int scannedPackages,
|
||||
int scannedLegacyDocuments,
|
||||
AuditAccumulator accumulator) {
|
||||
return "packages=" + scannedPackages
|
||||
+ ", legacyDocuments=" + scannedLegacyDocuments
|
||||
+ ", findings=" + accumulator.totalFindings()
|
||||
+ ", warnings=" + accumulator.warningCount()
|
||||
+ ", errors=" + accumulator.errorCount()
|
||||
+ (accumulator.truncated() ? ", truncated=true" : "");
|
||||
}
|
||||
|
||||
private static final class AuditAccumulator {
|
||||
private int scannedPackages;
|
||||
private int scannedLegacyDocuments;
|
||||
private int infoCount;
|
||||
private int warningCount;
|
||||
private int errorCount;
|
||||
private boolean truncated;
|
||||
private boolean truncationRecorded;
|
||||
|
||||
void incrementScannedPackages() {
|
||||
scannedPackages++;
|
||||
}
|
||||
|
||||
void incrementScannedLegacyDocuments() {
|
||||
scannedLegacyDocuments++;
|
||||
}
|
||||
|
||||
void recordFinding(LegacyTedAuditSeverity severity, boolean truncationFindingRecordedNow) {
|
||||
switch (severity) {
|
||||
case INFO -> infoCount++;
|
||||
case WARNING -> warningCount++;
|
||||
case ERROR -> errorCount++;
|
||||
}
|
||||
if (truncationFindingRecordedNow) {
|
||||
truncationRecorded = true;
|
||||
}
|
||||
}
|
||||
|
||||
void markTruncated() {
|
||||
truncated = true;
|
||||
}
|
||||
|
||||
int totalFindings() {
|
||||
return infoCount + warningCount + errorCount;
|
||||
}
|
||||
|
||||
int infoCount() {
|
||||
return infoCount;
|
||||
}
|
||||
|
||||
int warningCount() {
|
||||
return warningCount;
|
||||
}
|
||||
|
||||
int errorCount() {
|
||||
return errorCount;
|
||||
}
|
||||
|
||||
int scannedPackages() {
|
||||
return scannedPackages;
|
||||
}
|
||||
|
||||
int scannedLegacyDocuments() {
|
||||
return scannedLegacyDocuments;
|
||||
}
|
||||
|
||||
boolean truncated() {
|
||||
return truncated;
|
||||
}
|
||||
|
||||
boolean truncationRecorded() {
|
||||
return truncationRecorded;
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1,33 +0,0 @@
|
||||
package at.procon.dip.migration.audit.startup;
|
||||
|
||||
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
|
||||
import at.procon.dip.migration.audit.service.LegacyTedAuditService;
|
||||
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
|
||||
import at.procon.dip.runtime.config.RuntimeMode;
|
||||
import lombok.RequiredArgsConstructor;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.springframework.boot.ApplicationArguments;
|
||||
import org.springframework.boot.ApplicationRunner;
|
||||
import org.springframework.stereotype.Component;
|
||||
|
||||
@Component
|
||||
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
|
||||
@RequiredArgsConstructor
|
||||
@Slf4j
|
||||
public class LegacyTedAuditStartupRunner implements ApplicationRunner {
|
||||
|
||||
private final LegacyTedAuditProperties properties;
|
||||
private final LegacyTedAuditService legacyTedAuditService;
|
||||
|
||||
@Override
|
||||
public void run(ApplicationArguments args) {
|
||||
if (!properties.isEnabled() || !properties.isStartupRunEnabled()) {
|
||||
return;
|
||||
}
|
||||
|
||||
int requestedLimit = properties.getStartupRunLimit();
|
||||
log.info("Wave 1 / Milestone A startup audit enabled - scanning legacy TED data with limit {}",
|
||||
requestedLimit > 0 ? requestedLimit : "unbounded");
|
||||
legacyTedAuditService.executeAudit(requestedLimit);
|
||||
}
|
||||
}
|
||||
@ -1,81 +1,49 @@
|
||||
package at.procon.dip.search.config;
|
||||
|
||||
import jakarta.validation.constraints.Min;
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
import org.springframework.validation.annotation.Validated;
|
||||
|
||||
/**
|
||||
* New-runtime generic search configuration.
|
||||
*
|
||||
* <p>This property tree is intentionally separated from the legacy
|
||||
* {@code ted.search.*} settings. NEW-mode search/semantic/lexical code should
|
||||
* depend on {@code dip.search.*} only.</p>
|
||||
*/
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "dip.search")
|
||||
@Data
|
||||
@Validated
|
||||
public class DipSearchProperties {
|
||||
|
||||
/** Default page size for search results. */
|
||||
@Positive
|
||||
private int defaultPageSize = 20;
|
||||
|
||||
/** Maximum allowed page size. */
|
||||
@Positive
|
||||
private int maxPageSize = 100;
|
||||
|
||||
/** Semantic similarity threshold (normalized score). */
|
||||
private double similarityThreshold = 0.7d;
|
||||
|
||||
/** Minimum trigram similarity for fuzzy lexical matches. */
|
||||
private double trigramSimilarityThreshold = 0.12d;
|
||||
|
||||
/** Candidate limits per search engine before fusion/collapse. */
|
||||
@Positive
|
||||
private int fulltextCandidateLimit = 120;
|
||||
|
||||
@Positive
|
||||
private int trigramCandidateLimit = 120;
|
||||
|
||||
@Positive
|
||||
private int semanticCandidateLimit = 120;
|
||||
|
||||
/** Hybrid fusion weights. */
|
||||
private double fulltextWeight = 0.35d;
|
||||
private double trigramWeight = 0.20d;
|
||||
private double semanticWeight = 0.45d;
|
||||
|
||||
/** Enable chunk representations for long documents. */
|
||||
private boolean chunkingEnabled = true;
|
||||
|
||||
/** Target chunk size in characters for CHUNK representations. */
|
||||
@Positive
|
||||
private int chunkTargetChars = 1800;
|
||||
|
||||
/** Overlap between consecutive chunks in characters. */
|
||||
@Min(0)
|
||||
private int chunkOverlapChars = 200;
|
||||
|
||||
/** Maximum CHUNK representations generated per document. */
|
||||
@Positive
|
||||
private int maxChunksPerDocument = 12;
|
||||
|
||||
/** Additional score weight for recency. */
|
||||
private double recencyBoostWeight = 0.05d;
|
||||
|
||||
/** Half-life in days used for recency decay. */
|
||||
@Positive
|
||||
private int recencyHalfLifeDays = 30;
|
||||
|
||||
/** Startup backfill limit for missing DOC lexical vectors. */
|
||||
@Positive
|
||||
private int startupLexicalBackfillLimit = 500;
|
||||
|
||||
/** Number of hits per engine returned by the debug endpoint. */
|
||||
@Positive
|
||||
private int debugTopHitsPerEngine = 10;
|
||||
private Lexical lexical = new Lexical();
|
||||
private Semantic semantic = new Semantic();
|
||||
private Fusion fusion = new Fusion();
|
||||
private Chunking chunking = new Chunking();
|
||||
|
||||
@Data
|
||||
public static class Lexical {
|
||||
private double trigramSimilarityThreshold = 0.12;
|
||||
private int fulltextCandidateLimit = 120;
|
||||
private int trigramCandidateLimit = 120;
|
||||
}
|
||||
|
||||
@Data
|
||||
public static class Semantic {
|
||||
private double similarityThreshold = 0.7;
|
||||
private int semanticCandidateLimit = 120;
|
||||
private String defaultModelKey;
|
||||
}
|
||||
|
||||
@Data
|
||||
public static class Fusion {
|
||||
private double fulltextWeight = 0.35;
|
||||
private double trigramWeight = 0.20;
|
||||
private double semanticWeight = 0.45;
|
||||
private double recencyBoostWeight = 0.05;
|
||||
private int recencyHalfLifeDays = 30;
|
||||
private int debugTopHitsPerEngine = 10;
|
||||
}
|
||||
|
||||
@Data
|
||||
public static class Chunking {
|
||||
private boolean enabled = true;
|
||||
private int targetChars = 1800;
|
||||
private int overlapChars = 200;
|
||||
private int maxChunksPerDocument = 12;
|
||||
private int startupLexicalBackfillLimit = 500;
|
||||
}
|
||||
}
|
||||
@ -0,0 +1,16 @@
|
||||
package at.procon.ted.config;
|
||||
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
|
||||
/**
|
||||
* Patch A scaffold for the legacy runtime configuration tree.
|
||||
*
|
||||
* The legacy runtime still uses {@link TedProcessorProperties} today. This class is
|
||||
* introduced so the old configuration can be moved gradually from `ted.*` to
|
||||
* `legacy.ted.*` without blocking the runtime split.
|
||||
*/
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "legacy.ted")
|
||||
public class LegacyTedProperties extends TedProcessorProperties {
|
||||
}
|
||||
@ -1,115 +0,0 @@
|
||||
package at.procon.ted.config;
|
||||
|
||||
import jakarta.validation.constraints.Min;
|
||||
import jakarta.validation.constraints.NotBlank;
|
||||
import jakarta.validation.constraints.Positive;
|
||||
import lombok.Data;
|
||||
import org.springframework.boot.context.properties.ConfigurationProperties;
|
||||
import org.springframework.context.annotation.Configuration;
|
||||
import org.springframework.validation.annotation.Validated;
|
||||
|
||||
/**
|
||||
* Legacy vectorization configuration used only by the old runtime path.
|
||||
* <p>
|
||||
* This extracts the former ted.vectorization.* subtree away from TedProcessorProperties
|
||||
* so that legacy vectorization beans no longer depend on the shared monolithic config.
|
||||
*/
|
||||
@Configuration
|
||||
@ConfigurationProperties(prefix = "legacy.ted.vectorization")
|
||||
@Data
|
||||
@Validated
|
||||
public class LegacyVectorizationProperties {
|
||||
|
||||
/**
|
||||
* Enable/disable legacy async vectorization.
|
||||
*/
|
||||
private boolean enabled = true;
|
||||
|
||||
/**
|
||||
* Use external HTTP API instead of Python subprocess.
|
||||
*/
|
||||
private boolean useHttpApi = false;
|
||||
|
||||
/**
|
||||
* Embedding service HTTP API URL.
|
||||
*/
|
||||
private String apiUrl = "http://localhost:8001";
|
||||
|
||||
/**
|
||||
* Sentence transformer model name.
|
||||
*/
|
||||
private String modelName = "intfloat/multilingual-e5-large";
|
||||
|
||||
/**
|
||||
* Vector dimensions (must match model output).
|
||||
*/
|
||||
@Positive
|
||||
private int dimensions = 1024;
|
||||
|
||||
/**
|
||||
* Batch size for vectorization processing.
|
||||
*/
|
||||
@Min(1)
|
||||
private int batchSize = 16;
|
||||
|
||||
/**
|
||||
* Thread pool size for async vectorization.
|
||||
*/
|
||||
@Min(1)
|
||||
private int threadPoolSize = 4;
|
||||
|
||||
/**
|
||||
* Maximum text length for vectorization (characters).
|
||||
*/
|
||||
@Positive
|
||||
private int maxTextLength = 8192;
|
||||
|
||||
/**
|
||||
* HTTP connection timeout in milliseconds.
|
||||
*/
|
||||
@Positive
|
||||
private int connectTimeout = 10000;
|
||||
|
||||
/**
|
||||
* HTTP socket/read timeout in milliseconds.
|
||||
*/
|
||||
@Positive
|
||||
private int socketTimeout = 60000;
|
||||
|
||||
/**
|
||||
* Maximum retries on connection failure.
|
||||
*/
|
||||
@Min(0)
|
||||
private int maxRetries = 5;
|
||||
|
||||
/**
|
||||
* Enable the former Phase 2 generic pipeline in the legacy runtime.
|
||||
* In the split runtime design this should normally stay false in NEW mode
|
||||
* because legacy beans are not instantiated there.
|
||||
*/
|
||||
private boolean genericPipelineEnabled = true;
|
||||
|
||||
/**
|
||||
* Keep writing completed TED embeddings back to the legacy ted.procurement_document
|
||||
* vector columns so the existing semantic search stays operational during migration.
|
||||
*/
|
||||
private boolean dualWriteLegacyTedVectors = true;
|
||||
|
||||
/**
|
||||
* Scheduler interval for generic embedding polling (milliseconds).
|
||||
*/
|
||||
@Positive
|
||||
private long genericSchedulerPeriodMs = 6000;
|
||||
|
||||
/**
|
||||
* Builder key for the primary TED semantic representation created during transitional dual-write.
|
||||
*/
|
||||
@NotBlank
|
||||
private String primaryRepresentationBuilderKey = "ted-phase2-primary-representation";
|
||||
|
||||
/**
|
||||
* Provider key used when registering the configured embedding model in DOC.doc_embedding_model.
|
||||
*/
|
||||
@NotBlank
|
||||
private String embeddingProvider = "http-embedding-service";
|
||||
}
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue