Compare commits

..

No commits in common. '152d9739af0969090ddb7db63ad6a93054147053' and '74609e481d97ae16a12a06dd2d6ed191f01a33f7' have entirely different histories.

@ -1,31 +0,0 @@
# Config split: moved new-runtime properties to application-new.yml
This patch keeps shared and legacy defaults in `application.yml` and moves new-runtime properties into `application-new.yml`.
Activate the new runtime with:
```
--spring.profiles.active=new
```
`application-new.yml` also sets:
```yaml
dip.runtime.mode: NEW
```
So profile selection and runtime mode stay aligned.
Moved blocks:
- `dip.embedding.*`
- `ted.search.*` (new generic search tuning, now under `dip.search.*`)
- `ted.projection.*`
- `ted.generic-ingestion.*`
- new/transitional `ted.vectorization.*` keys:
- `generic-pipeline-enabled`
- `dual-write-legacy-ted-vectors`
- `generic-scheduler-period-ms`
- `primary-representation-builder-key`
- `embedding-provider`
Shared / legacy defaults remain in `application.yml`.

@ -1,30 +0,0 @@
# Embedding policy Patch K1
Patch K1 introduces the configuration and resolver layer for policy-based document embedding selection.
## Added
- `EmbeddingPolicy`
- `EmbeddingProfile`
- `EmbeddingPolicyCondition`
- `EmbeddingPolicyUse`
- `EmbeddingPolicyRule`
- `EmbeddingPolicyProperties`
- `EmbeddingProfileProperties`
- `EmbeddingPolicyResolver`
- `DefaultEmbeddingPolicyResolver`
- `EmbeddingProfileResolver`
- `DefaultEmbeddingProfileResolver`
## Example config
See `application-new-example-embedding-policy.yml`.
## What K1 does not change
- no runtime import/orchestrator wiring yet
- no `SourceDescriptor` schema change yet
- no job persistence/audit changes yet
## Intended follow-up
K2 should wire:
- `GenericDocumentImportService`
- `RepresentationEmbeddingOrchestrator`
to use the resolved policy and profile.

@ -1,26 +0,0 @@
# Embedding policy Patch K2
Patch K2 wires the policy/profile layer into the actual NEW import runtime.
## What it changes
- `GenericDocumentImportService`
- resolves `EmbeddingPolicy` per imported document
- resolves `EmbeddingProfile`
- ensures the selected embedding model is registered
- queues embeddings only for representation drafts allowed by the resolved profile
- `RepresentationEmbeddingOrchestrator`
- adds a convenience overload for `(documentId, modelKey, profile)`
- `EmbeddingJobService`
- adds a profile-aware enqueue overload
- `DefaultEmbeddingSelectionPolicy`
- adds profile-aware representation filtering
- `DefaultEmbeddingPolicyResolver`
- corrected for the current `SourceDescriptor.attributes()` shape
## Runtime flow after K2
document imported
-> representations built
-> policy resolved
-> profile resolved
-> model ensured
-> matching representations queued for embedding

@ -1,59 +0,0 @@
# Runtime split Patch B
Patch B builds on Patch A and makes the NEW runtime actually process embedding jobs.
## What changes
### 1. New embedding job scheduler
Adds:
- `EmbeddingJobScheduler`
- `EmbeddingJobSchedulingConfiguration`
Behavior:
- enabled only in `NEW` runtime mode
- active only when `dip.embedding.jobs.enabled=true`
- periodically calls:
- `RepresentationEmbeddingOrchestrator.processNextReadyBatch()`
### 2. Generic import hands off to the new embedding job path
`GenericDocumentImportService` is updated so that in `NEW` mode it:
- resolves `dip.embedding.default-document-model`
- ensures the model is registered in `DOC.doc_embedding_model`
- creates embedding jobs through:
- `RepresentationEmbeddingOrchestrator.enqueueRepresentation(...)`
It no longer creates legacy-style pending embeddings as the primary handoff for the NEW runtime path.
## Notes
- This patch assumes Patch A has already introduced:
- `RuntimeMode`
- `RuntimeModeProperties`
- `@ConditionalOnRuntimeMode`
- This patch does not yet remove the legacy vectorization runtime.
That remains the job of subsequent cutover steps.
## Expected runtime behavior in NEW mode
- `GenericDocumentImportService` persists new generic representations
- selected representations are queued into `DOC.doc_embedding_job`
- scheduler processes pending jobs
- vectors are persisted through the new embedding subsystem
## New config
Example:
```yaml
dip:
runtime:
mode: NEW
embedding:
enabled: true
jobs:
enabled: true
scheduler-delay-ms: 5000
```

@ -1,36 +0,0 @@
# Runtime split Patch C
Patch C moves the **new generic search runtime** off `TedProcessorProperties.search`
and into a dedicated `dip.search.*` config tree.
## New config class
- `at.procon.dip.search.config.DipSearchProperties`
## New config root
```yaml
dip:
search:
...
```
## Classes moved off `TedProcessorProperties`
- `PostgresFullTextSearchEngine`
- `PostgresTrigramSearchEngine`
- `PgVectorSemanticSearchEngine`
- `DefaultSearchOrchestrator`
- `DefaultSearchResultFusionService`
- `SearchLexicalIndexStartupRunner`
- `ChunkedLongTextRepresentationBuilder`
## What this patch intentionally does not do
- it does not yet remove `TedProcessorProperties` from all NEW-mode classes
- it does not yet move `generic-ingestion` config off `ted.*`
- it does not yet finish the legacy/new config split for import/mail/TED package processing
Those should be handled in the next config-splitting patch.
## Practical result
After this patch, **new search/semantic/chunking tuning** should be configured only via:
- `dip.search.*`
while `ted.search.*` remains legacy-oriented.

@ -1,40 +0,0 @@
# Runtime Split Patch D
This patch completes the next configuration split step for the NEW runtime.
## New property classes
- `at.procon.dip.ingestion.config.DipIngestionProperties`
- prefix: `dip.ingestion`
- `at.procon.dip.domain.ted.config.TedProjectionProperties`
- prefix: `dip.ted.projection`
## Classes moved off `TedProcessorProperties`
### NEW-mode ingestion
- `GenericDocumentImportService`
- `GenericFileSystemIngestionRoute`
- `GenericDocumentImportController`
- `MailDocumentIngestionAdapter`
- `TedPackageDocumentIngestionAdapter`
- `TedPackageChildImportProcessor`
### NEW-mode projection
- `TedNoticeProjectionService`
- `TedProjectionStartupRunner`
## Additional cleanup in `GenericDocumentImportService`
It now resolves the default document embedding model through the new embedding subsystem:
- `EmbeddingProperties`
- `EmbeddingModelRegistry`
- `EmbeddingModelCatalogService`
and no longer reads vectorization model/provider/dimensions from `TedProcessorProperties`.
## What still remains for later split steps
- legacy routes/services still using `TedProcessorProperties`
- legacy/new runtime bean gating for all remaining shared classes
- moving old TED-only config fully under `legacy.ted.*`

@ -1,26 +0,0 @@
# Runtime split Patch E
This patch continues the runtime/config split by targeting the remaining NEW-mode classes
that still injected `TedProcessorProperties`.
## New config classes
- `DipIngestionProperties` (`dip.ingestion.*`)
- `TedProjectionProperties` (`dip.ted.projection.*`)
## NEW-mode classes moved off `TedProcessorProperties`
- `GenericDocumentImportService`
- `GenericFileSystemIngestionRoute`
- `GenericDocumentImportController`
- `MailDocumentIngestionAdapter`
- `TedPackageDocumentIngestionAdapter`
- `TedPackageChildImportProcessor`
- `TedNoticeProjectionService`
- `TedProjectionStartupRunner`
## Additional behavior change
`GenericDocumentImportService` now hands embedding work off to the new embedding subsystem by:
- resolving the default document model from `EmbeddingModelRegistry`
- ensuring the model is registered via `EmbeddingModelCatalogService`
- enqueueing jobs through `RepresentationEmbeddingOrchestrator`
This removes the new import path's runtime dependence on legacy `TedProcessorProperties.vectorization`.

@ -1,39 +0,0 @@
# Runtime split Patch F
This patch finishes the first major bean-gating pass for the **legacy runtime**.
## What it does
Marks the remaining old runtime classes as:
- `@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)`
### Legacy routes / runtime
- `MailRoute`
- `SolutionBriefRoute`
- `TedDocumentRoute`
- `TedPackageDownloadCamelRoute`
- `TedPackageDownloadRoute`
- `VectorizationRoute`
### Legacy config/runtime infrastructure
- `AsyncConfig`
- `TedProcessorProperties`
### Legacy controller / listeners / services
- `AdminController`
- `VectorizationEventListener`
- `AttachmentProcessingService`
- `BatchDocumentProcessingService`
- `DocumentProcessingService`
- `SearchService`
- `SimilaritySearchService`
- `TedPackageDownloadService`
- `TedPhase2GenericDocumentService`
- `VectorizationProcessorService`
- `VectorizationService`
- `VectorizationStartupRunner`
## Added profile file
- `application-legacy.yml`
This patch is intended to apply **after Patch AE**. It does not yet remove the old `ted.*` property tree; it makes the old bean graph activate only in `LEGACY` mode.

@ -1,24 +0,0 @@
# Runtime split Patch G
Patch G moves the remaining NEW-mode search/chunking classes off `TedProcessorProperties.search`
and onto `DipSearchProperties` (`dip.search.*`).
## New config class
- `at.procon.dip.search.config.DipSearchProperties`
## Classes switched to `DipSearchProperties`
- `PostgresFullTextSearchEngine`
- `PostgresTrigramSearchEngine`
- `PgVectorSemanticSearchEngine`
- `DefaultSearchResultFusionService`
- `DefaultSearchOrchestrator`
- `SearchLexicalIndexStartupRunner`
- `ChunkedLongTextRepresentationBuilder`
## Additional cleanup
These classes are also marked `NEW`-only in this patch.
## Effect
After Patch G, the generic NEW-mode search/chunking path no longer depends on
`TedProcessorProperties.search`. That leaves `TedProcessorProperties` much closer to
legacy-only ownership.

@ -1,17 +0,0 @@
# Runtime split Patch H
Patch H is a final cleanup / verification step after the previous split patches.
## What it does
- makes `TedProcessorProperties` explicitly `LEGACY`-only
- removes the stale `TedProcessorProperties` import/comment from `DocumentIntelligencePlatformApplication`
- adds a regression test that fails if NEW runtime classes reintroduce a dependency on `TedProcessorProperties`
- adds a simple `application-legacy.yml` profile file
## Why this matters
After the NEW search/ingestion/projection classes are moved to:
- `DipSearchProperties`
- `DipIngestionProperties`
- `TedProjectionProperties`
`TedProcessorProperties` should be owned strictly by the legacy runtime graph.

@ -1,21 +0,0 @@
# Runtime split Patch I
Patch I extracts the remaining legacy vectorization cluster off `TedProcessorProperties`
and onto a dedicated legacy-only config class.
## New config class
- `at.procon.ted.config.LegacyVectorizationProperties`
- prefix: `legacy.ted.vectorization.*`
## Classes switched off `TedProcessorProperties`
- `GenericVectorizationRoute`
- `DocumentEmbeddingProcessingService`
- `ConfiguredEmbeddingModelStartupRunner`
- `GenericVectorizationStartupRunner`
## Additional cleanup
These classes are also marked `LEGACY`-only via `@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)`.
## Effect
The `at.procon.dip.vectorization.*` package now clearly belongs to the old runtime graph and no longer pulls
its settings from the shared monolithic `TedProcessorProperties`.

@ -1,45 +0,0 @@
# Runtime split Patch J
Patch J is a broader cleanup patch for the **actual current codebase**.
It adds the missing runtime/config split scaffolding and rewires the remaining NEW-mode classes
that still injected `TedProcessorProperties`.
## Added
- `dip.runtime` infrastructure
- `RuntimeMode`
- `RuntimeModeProperties`
- `@ConditionalOnRuntimeMode`
- `RuntimeModeCondition`
- `DipSearchProperties`
- `DipIngestionProperties`
- `TedProjectionProperties`
## Rewired off `TedProcessorProperties`
### NEW search/chunking
- `PostgresFullTextSearchEngine`
- `PostgresTrigramSearchEngine`
- `PgVectorSemanticSearchEngine`
- `DefaultSearchOrchestrator`
- `SearchLexicalIndexStartupRunner`
- `DefaultSearchResultFusionService`
- `ChunkedLongTextRepresentationBuilder`
### NEW ingestion/projection
- `GenericDocumentImportService`
- `GenericFileSystemIngestionRoute`
- `GenericDocumentImportController`
- `MailDocumentIngestionAdapter`
- `TedPackageDocumentIngestionAdapter`
- `TedPackageChildImportProcessor`
- `TedNoticeProjectionService`
- `TedProjectionStartupRunner`
## Additional behavior
- `GenericDocumentImportService` now hands embedding work off to the new embedding subsystem
via `RepresentationEmbeddingOrchestrator` and resolves the default model through
`EmbeddingModelRegistry` / `EmbeddingModelCatalogService`.
## Notes
This patch intentionally targets the real current leftovers visible in the actual codebase.
It assumes the new embedding subsystem already exists.

@ -1,20 +0,0 @@
# NEW TED package import route
This patch adds a NEW-runtime TED package download path that:
- reuses the proven package sequencing rules
- stores package tracking in `TedDailyPackage`
- downloads the package tar.gz
- ingests it only through `DocumentIngestionGateway`
- never calls the legacy XML batch processing / vectorization flow
## Added classes
- `TedPackageSequenceService`
- `DefaultTedPackageSequenceService`
- `TedPackageDownloadNewProperties`
- `TedPackageDownloadNewRoute`
## Config
Use the `dip.ingestion.ted-download.*` block in `application-new.yml`.

@ -1,67 +0,0 @@
# Vector-sync HTTP embedding provider
This provider supports two endpoints:
- `POST {baseUrl}/vector-sync` for single-text requests
- `POST {baseUrl}/vectorize-batch` for batch document requests
## Single request
Request body:
```json
{
"model": "intfloat/multilingual-e5-large",
"text": "This is a sample text to vectorize"
}
```
## Batch request
Request body:
```json
{
"model": "intfloat/multilingual-e5-large",
"truncate_text": false,
"truncate_length": 512,
"chunk_size": 20,
"items": [
{
"id": "2f48fd5c-9d39-4d80-9225-ea0c59c77c9a",
"text": "This is a sample text to vectorize"
}
]
}
```
## Provider configuration
```yaml
batch-request:
truncate-text: false
truncate-length: 512
chunk-size: 20
```
These values are used for `/vectorize-batch` calls and can also be overridden per request via `EmbeddingRequest.providerOptions()`.
## Orchestrator batch processing
To let `RepresentationEmbeddingOrchestrator` send multiple representations in one provider call, enable batch processing for jobs and for the model:
```yaml
dip:
embedding:
jobs:
enabled: true
process-in-batches: true
execution-batch-size: 20
models:
e5-default:
supports-batch: true
```
Notes:
- jobs are grouped by `modelKey`
- non-batch-capable models still fall back to single-item execution
- `execution-batch-size` controls how many texts are sent in one `/vectorize-batch` request

@ -1,5 +1,6 @@
package at.procon.dip; package at.procon.dip;
import at.procon.ted.config.TedProcessorProperties;
import org.springframework.boot.SpringApplication; import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.context.properties.EnableConfigurationProperties; import org.springframework.boot.context.properties.EnableConfigurationProperties;
@ -16,8 +17,9 @@ import org.springframework.scheduling.annotation.EnableAsync;
*/ */
@SpringBootApplication(scanBasePackages = {"at.procon.dip", "at.procon.ted"}) @SpringBootApplication(scanBasePackages = {"at.procon.dip", "at.procon.ted"})
@EnableAsync @EnableAsync
@EntityScan(basePackages = {"at.procon.ted.model.entity", "at.procon.dip.domain.document.entity", "at.procon.dip.domain.tenant.entity", "at.procon.dip.domain.ted.entity", "at.procon.dip.embedding.job.entity", "at.procon.dip.migration.audit.entity"}) //@EnableConfigurationProperties(TedProcessorProperties.class)
@EnableJpaRepositories(basePackages = {"at.procon.ted.repository", "at.procon.dip.domain.document.repository", "at.procon.dip.domain.tenant.repository", "at.procon.dip.domain.ted.repository", "at.procon.dip.embedding.job.repository", "at.procon.dip.migration.audit.repository"}) @EntityScan(basePackages = {"at.procon.ted.model.entity", "at.procon.dip.domain.document.entity", "at.procon.dip.domain.tenant.entity", "at.procon.dip.domain.ted.entity", "at.procon.dip.embedding.job.entity"})
@EnableJpaRepositories(basePackages = {"at.procon.ted.repository", "at.procon.dip.domain.document.repository", "at.procon.dip.domain.tenant.repository", "at.procon.dip.domain.ted.repository", "at.procon.dip.embedding.job.repository"})
public class DocumentIntelligencePlatformApplication { public class DocumentIntelligencePlatformApplication {
public static void main(String[] args) { public static void main(String[] args) {

@ -9,6 +9,5 @@ public enum RepresentationType {
SUMMARY, SUMMARY,
TITLE_ABSTRACT, TITLE_ABSTRACT,
CHUNK, CHUNK,
METADATA_ENRICHED, METADATA_ENRICHED
ATTACHMENT_ROLLUP
} }

@ -38,7 +38,7 @@ public interface DocumentEmbeddingRepository extends JpaRepository<DocumentEmbed
"error_message = NULL, token_count = :tokenCount, embedding_dimensions = :dimensions WHERE id = :id", "error_message = NULL, token_count = :tokenCount, embedding_dimensions = :dimensions WHERE id = :id",
nativeQuery = true) nativeQuery = true)
int updateEmbeddingVector(@Param("id") UUID id, int updateEmbeddingVector(@Param("id") UUID id,
@Param("vectorData") float[] vectorData, @Param("vectorData") String vectorData,
@Param("tokenCount") Integer tokenCount, @Param("tokenCount") Integer tokenCount,
@Param("dimensions") Integer dimensions); @Param("dimensions") Integer dimensions);

@ -1,16 +0,0 @@
package at.procon.dip.domain.ted.config;
import jakarta.validation.constraints.Positive;
import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "dip.ted.projection")
@Data
public class TedProjectionProperties {
private boolean enabled = true;
private boolean startupBackfillEnabled = false;
@Positive
private int startupBackfillLimit = 250;
}

@ -55,10 +55,10 @@ public class TedNoticeOrganization {
@Column(name = "company_id", length = 1000) @Column(name = "company_id", length = 1000)
private String companyId; private String companyId;
@Column(name = "country_code", columnDefinition = "TEXT") @Column(name = "country_code", length = 10)
private String countryCode; private String countryCode;
@Column(name = "city", columnDefinition = "TEXT") @Column(name = "city", length = 255)
private String city; private String city;
@Column(name = "postal_code", length = 255) @Column(name = "postal_code", length = 255)

@ -108,7 +108,7 @@ public class TedNoticeProjection {
@Column(name = "buyer_country_code", length = 10) @Column(name = "buyer_country_code", length = 10)
private String buyerCountryCode; private String buyerCountryCode;
@Column(name = "buyer_city", columnDefinition = "TEXT") @Column(name = "buyer_city", length = 255)
private String buyerCity; private String buyerCity;
@Column(name = "buyer_postal_code", length = 100) @Column(name = "buyer_postal_code", length = 100)
@ -129,7 +129,7 @@ public class TedNoticeProjection {
@Column(name = "project_description", columnDefinition = "TEXT") @Column(name = "project_description", columnDefinition = "TEXT")
private String projectDescription; private String projectDescription;
@Column(name = "internal_reference", columnDefinition = "TEXT") @Column(name = "internal_reference", length = 500)
private String internalReference; private String internalReference;
@Enumerated(EnumType.STRING) @Enumerated(EnumType.STRING)

@ -1,189 +0,0 @@
package at.procon.dip.domain.ted.service;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
import at.procon.ted.model.entity.TedDailyPackage;
import at.procon.ted.repository.TedDailyPackageRepository;
import java.time.Duration;
import java.time.LocalDate;
import java.time.OffsetDateTime;
import java.time.Year;
import java.time.ZoneOffset;
import java.util.Optional;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;
/**
* NEW-runtime implementation of TED package sequencing.
* <p>
* This reuses the same decision rules as the legacy TED package downloader:
* <ul>
* <li>current year forward crawling first</li>
* <li>gap filling by walking backward to package 1</li>
* <li>NOT_FOUND retry handling with current-year indefinite retry support</li>
* <li>previous-year grace period before a tail NOT_FOUND becomes final</li>
* </ul>
*/
@Service
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor
@Slf4j
public class DefaultTedPackageSequenceService implements TedPackageSequenceService {
private final TedPackageDownloadProperties properties;
private final TedDailyPackageRepository packageRepository;
@Override
public PackageInfo getNextPackageToDownload() {
int currentYear = Year.now().getValue();
log.debug("Determining next TED package to download for NEW runtime (current year: {})", currentYear);
// 1) Current year forward crawling first (newest data first)
PackageInfo nextInCurrentYear = getNextForwardPackage(currentYear);
if (nextInCurrentYear != null) {
log.info("Next TED package: {} (current year {} forward)", nextInCurrentYear.identifier(), currentYear);
return nextInCurrentYear;
}
// 2) Walk all years backward and fill gaps / continue unfinished years
for (int year = currentYear; year >= properties.getStartYear(); year--) {
PackageInfo gapFiller = getGapFillerPackage(year);
if (gapFiller != null) {
log.info("Next TED package: {} (filling gap in year {})", gapFiller.identifier(), year);
return gapFiller;
}
if (!isYearComplete(year)) {
PackageInfo forwardPackage = getNextForwardPackage(year);
if (forwardPackage != null) {
log.info("Next TED package: {} (continuing year {})", forwardPackage.identifier(), year);
return forwardPackage;
}
} else {
log.debug("TED package year {} is complete", year);
}
}
// 3) Open a new older year if possible
int oldestYear = getOldestYearWithData();
if (oldestYear > properties.getStartYear()) {
int previousYear = oldestYear - 1;
if (previousYear >= properties.getStartYear()) {
PackageInfo first = new PackageInfo(previousYear, 1);
log.info("Next TED package: {} (opening year {})", first.identifier(), previousYear);
return first;
}
}
log.info("All TED package years from {} to {} appear complete - nothing to download",
properties.getStartYear(), currentYear);
return null;
}
private PackageInfo getNextForwardPackage(int year) {
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
if (latest.isEmpty()) {
return new PackageInfo(year, 1);
}
TedDailyPackage latestPackage = latest.get();
if (latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND) {
if (shouldRetryNotFoundPackage(latestPackage)) {
return new PackageInfo(year, latestPackage.getSerialNumber());
}
if (isNotFoundRetryableForYear(latestPackage)) {
log.debug("Year {} still inside NOT_FOUND retry window for package {} until {}",
year, latestPackage.getPackageIdentifier(), calculateNextRetryAt(latestPackage));
return null;
}
log.debug("Year {} finalized after grace period at tail package {}", year, latestPackage.getPackageIdentifier());
return null;
}
return new PackageInfo(year, latestPackage.getSerialNumber() + 1);
}
private PackageInfo getGapFillerPackage(int year) {
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
if (first.isEmpty()) {
return null;
}
int minSerial = first.get().getSerialNumber();
if (minSerial <= 1) {
return null;
}
return new PackageInfo(year, minSerial - 1);
}
private boolean isYearComplete(int year) {
Optional<TedDailyPackage> first = packageRepository.findFirstByYear(year);
Optional<TedDailyPackage> latest = packageRepository.findLatestByYear(year);
if (first.isEmpty() || latest.isEmpty()) {
return false;
}
if (first.get().getSerialNumber() != 1) {
return false;
}
TedDailyPackage latestPackage = latest.get();
return latestPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.NOT_FOUND
&& !isNotFoundRetryableForYear(latestPackage);
}
private boolean shouldRetryNotFoundPackage(TedDailyPackage pkg) {
if (!isNotFoundRetryableForYear(pkg)) {
return false;
}
OffsetDateTime nextRetryAt = calculateNextRetryAt(pkg);
return !nextRetryAt.isAfter(OffsetDateTime.now());
}
private boolean isNotFoundRetryableForYear(TedDailyPackage pkg) {
int currentYear = Year.now().getValue();
int packageYear = pkg.getYear() != null ? pkg.getYear() : currentYear;
if (packageYear >= currentYear) {
return properties.isRetryCurrentYearNotFoundIndefinitely();
}
return OffsetDateTime.now().isBefore(getYearRetryGraceDeadline(packageYear));
}
private OffsetDateTime calculateNextRetryAt(TedDailyPackage pkg) {
OffsetDateTime lastAttemptAt = pkg.getUpdatedAt() != null
? pkg.getUpdatedAt()
: (pkg.getCreatedAt() != null ? pkg.getCreatedAt() : OffsetDateTime.now());
return lastAttemptAt.plus(Duration.ofMillis(properties.getNotFoundRetryInterval()));
}
private OffsetDateTime getYearRetryGraceDeadline(int year) {
return LocalDate.of(year + 1, 1, 1)
.atStartOfDay()
.atOffset(ZoneOffset.UTC)
.plusDays(properties.getPreviousYearGracePeriodDays());
}
private int getOldestYearWithData() {
int currentYear = Year.now().getValue();
for (int year = properties.getStartYear(); year <= currentYear; year++) {
if (packageRepository.findLatestByYear(year).isPresent()) {
return year;
}
}
return currentYear;
}
}

@ -8,9 +8,7 @@ import at.procon.dip.domain.ted.entity.TedNoticeProjection;
import at.procon.dip.domain.ted.repository.TedNoticeLotRepository; import at.procon.dip.domain.ted.repository.TedNoticeLotRepository;
import at.procon.dip.domain.ted.repository.TedNoticeOrganizationRepository; import at.procon.dip.domain.ted.repository.TedNoticeOrganizationRepository;
import at.procon.dip.domain.ted.repository.TedNoticeProjectionRepository; import at.procon.dip.domain.ted.repository.TedNoticeProjectionRepository;
import at.procon.dip.domain.ted.config.TedProjectionProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.model.entity.Organization; import at.procon.ted.model.entity.Organization;
import at.procon.ted.model.entity.ProcurementDocument; import at.procon.ted.model.entity.ProcurementDocument;
import at.procon.ted.model.entity.ProcurementLot; import at.procon.ted.model.entity.ProcurementLot;
@ -26,12 +24,11 @@ import org.springframework.transaction.annotation.Transactional;
* Phase 3 service that materializes TED-specific structured projections on top of the generic DOC document root. * Phase 3 service that materializes TED-specific structured projections on top of the generic DOC document root.
*/ */
@Service @Service
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class TedNoticeProjectionService { public class TedNoticeProjectionService {
private final TedProjectionProperties properties; private final TedProcessorProperties properties;
private final TedGenericDocumentRootService tedGenericDocumentRootService; private final TedGenericDocumentRootService tedGenericDocumentRootService;
private final DocumentRepository documentRepository; private final DocumentRepository documentRepository;
private final TedNoticeProjectionRepository projectionRepository; private final TedNoticeProjectionRepository projectionRepository;
@ -40,7 +37,7 @@ public class TedNoticeProjectionService {
@Transactional @Transactional
public UUID registerOrRefreshProjection(ProcurementDocument legacyDocument) { public UUID registerOrRefreshProjection(ProcurementDocument legacyDocument) {
if (!properties.isEnabled()) { if (!properties.getProjection().isEnabled()) {
return null; return null;
} }
@ -50,7 +47,7 @@ public class TedNoticeProjectionService {
@Transactional @Transactional
public UUID registerOrRefreshProjection(ProcurementDocument legacyDocument, UUID genericDocumentId) { public UUID registerOrRefreshProjection(ProcurementDocument legacyDocument, UUID genericDocumentId) {
if (!properties.isEnabled()) { if (!properties.getProjection().isEnabled()) {
return null; return null;
} }

@ -1,25 +0,0 @@
package at.procon.dip.domain.ted.service;
/**
* Shared package sequencing contract used to determine the next TED daily package to download.
* <p>
* This service encapsulates the proven sequencing rules from the legacy download implementation
* so they can also be used by the NEW runtime without depending on the old route/service graph.
*/
public interface TedPackageSequenceService {
/**
* Returns the next package to download according to the current sequencing strategy,
* or {@code null} if nothing should be downloaded right now.
*/
PackageInfo getNextPackageToDownload();
/**
* Simple year/serial pair with TED package identifier helper.
*/
record PackageInfo(int year, int serialNumber) {
public String identifier() {
return "%04d%05d".formatted(year, serialNumber);
}
}
}

@ -2,9 +2,7 @@ package at.procon.dip.domain.ted.startup;
import at.procon.dip.domain.ted.repository.TedNoticeProjectionRepository; import at.procon.dip.domain.ted.repository.TedNoticeProjectionRepository;
import at.procon.dip.domain.ted.service.TedNoticeProjectionService; import at.procon.dip.domain.ted.service.TedNoticeProjectionService;
import at.procon.dip.domain.ted.config.TedProjectionProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.repository.ProcurementDocumentRepository; import at.procon.ted.repository.ProcurementDocumentRepository;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
@ -18,23 +16,22 @@ import org.springframework.stereotype.Component;
* Optional startup backfill for Phase 3 TED projections. * Optional startup backfill for Phase 3 TED projections.
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class TedProjectionStartupRunner implements ApplicationRunner { public class TedProjectionStartupRunner implements ApplicationRunner {
private final TedProjectionProperties properties; private final TedProcessorProperties properties;
private final ProcurementDocumentRepository procurementDocumentRepository; private final ProcurementDocumentRepository procurementDocumentRepository;
private final TedNoticeProjectionRepository projectionRepository; private final TedNoticeProjectionRepository projectionRepository;
private final TedNoticeProjectionService projectionService; private final TedNoticeProjectionService projectionService;
@Override @Override
public void run(ApplicationArguments args) { public void run(ApplicationArguments args) {
if (!properties.isEnabled() || !properties.isStartupBackfillEnabled()) { if (!properties.getProjection().isEnabled() || !properties.getProjection().isStartupBackfillEnabled()) {
return; return;
} }
int limit = properties.getStartupBackfillLimit(); int limit = properties.getProjection().getStartupBackfillLimit();
log.info("Phase 3 startup backfill enabled - ensuring TED projections for up to {} documents", limit); log.info("Phase 3 startup backfill enabled - ensuring TED projections for up to {} documents", limit);
var page = procurementDocumentRepository.findAll( var page = procurementDocumentRepository.findAll(

@ -1,12 +0,0 @@
package at.procon.dip.embedding.config;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.EnableScheduling;
@Configuration
@EnableScheduling
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
public class EmbeddingJobSchedulingConfiguration {
}

@ -1,14 +0,0 @@
package at.procon.dip.embedding.config;
import lombok.Data;
@Data
public class EmbeddingPolicyCondition {
private String documentType;
private String documentFamily;
private String sourceType;
private String mimeType;
private String language;
private String ownerTenantKey;
private String embeddingPolicyHint;
}

@ -1,16 +0,0 @@
package at.procon.dip.embedding.config;
import java.util.ArrayList;
import java.util.List;
import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "dip.embedding.policies")
@Data
public class EmbeddingPolicyProperties {
private EmbeddingPolicyUse defaultPolicy = new EmbeddingPolicyUse();
private List<EmbeddingPolicyRule> rules = new ArrayList<>();
}

@ -1,10 +0,0 @@
package at.procon.dip.embedding.config;
import lombok.Data;
@Data
public class EmbeddingPolicyRule {
private String name;
private EmbeddingPolicyCondition when = new EmbeddingPolicyCondition();
private EmbeddingPolicyUse use = new EmbeddingPolicyUse();
}

@ -1,12 +0,0 @@
package at.procon.dip.embedding.config;
import lombok.Data;
@Data
public class EmbeddingPolicyUse {
private String policyKey;
private String modelKey;
private String queryModelKey;
private String profileKey;
private boolean enabled = true;
}

@ -1,23 +0,0 @@
package at.procon.dip.embedding.config;
import at.procon.dip.domain.document.RepresentationType;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "dip.embedding.profiles")
@Data
public class EmbeddingProfileProperties {
private Map<String, ProfileDefinition> definitions = new LinkedHashMap<>();
@Data
public static class ProfileDefinition {
private List<RepresentationType> embedRepresentationTypes = new ArrayList<>();
}
}

@ -30,14 +30,6 @@ public class EmbeddingProperties {
private Duration readTimeout = Duration.ofSeconds(60); private Duration readTimeout = Duration.ofSeconds(60);
private Map<String, String> headers = new LinkedHashMap<>(); private Map<String, String> headers = new LinkedHashMap<>();
private Integer dimensions; private Integer dimensions;
private BatchRequestProperties batchRequest = new BatchRequestProperties();
}
@Data
public static class BatchRequestProperties {
private boolean truncateText = false;
private int truncateLength = 512;
private int chunkSize = 20;
} }
@Data @Data
@ -67,11 +59,8 @@ public class EmbeddingProperties {
public static class JobsProperties { public static class JobsProperties {
private boolean enabled = false; private boolean enabled = false;
private int batchSize = 16; private int batchSize = 16;
private boolean processInBatches = false;
private int executionBatchSize = 8;
private int maxRetries = 5; private int maxRetries = 5;
private Duration initialRetryDelay = Duration.ofSeconds(30); private Duration initialRetryDelay = Duration.ofSeconds(30);
private Duration maxRetryDelay = Duration.ofHours(6); private Duration maxRetryDelay = Duration.ofHours(6);
private long schedulerDelayMs = 5000;
} }
} }

@ -1,32 +0,0 @@
package at.procon.dip.embedding.job;
import at.procon.dip.embedding.service.RepresentationEmbeddingOrchestrator;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
@Component
@RequiredArgsConstructor
@Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@ConditionalOnProperty(prefix = "dip.embedding.jobs", name = "enabled", havingValue = "true")
public class EmbeddingJobScheduler {
private final RepresentationEmbeddingOrchestrator orchestrator;
@Scheduled(fixedDelayString = "${dip.embedding.jobs.scheduler-delay-ms:5000}")
public void processNextBatch() {
try {
int processed = orchestrator.processNextReadyBatch();
if (processed > 0) {
log.debug("NEW runtime embedding job scheduler processed {} job(s)", processed);
}
} catch (Exception ex) {
log.warn("NEW runtime embedding job scheduler failed: {}", ex.getMessage(), ex);
}
}
}

@ -1,14 +1,13 @@
package at.procon.dip.embedding.job.service; package at.procon.dip.embedding.job.service;
import at.procon.dip.domain.document.entity.DocumentTextRepresentation;
import at.procon.dip.embedding.config.EmbeddingProperties; import at.procon.dip.embedding.config.EmbeddingProperties;
import at.procon.dip.embedding.job.entity.EmbeddingJob; import at.procon.dip.embedding.job.entity.EmbeddingJob;
import at.procon.dip.embedding.job.repository.EmbeddingJobRepository; import at.procon.dip.embedding.job.repository.EmbeddingJobRepository;
import at.procon.dip.embedding.model.EmbeddingJobStatus; import at.procon.dip.embedding.model.EmbeddingJobStatus;
import at.procon.dip.embedding.model.EmbeddingJobType; import at.procon.dip.embedding.model.EmbeddingJobType;
import at.procon.dip.embedding.policy.EmbeddingProfile;
import at.procon.dip.embedding.policy.EmbeddingSelectionPolicy; import at.procon.dip.embedding.policy.EmbeddingSelectionPolicy;
import at.procon.dip.embedding.registry.EmbeddingModelRegistry; import at.procon.dip.embedding.registry.EmbeddingModelRegistry;
import at.procon.dip.domain.document.entity.DocumentTextRepresentation;
import java.time.Duration; import java.time.Duration;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
import java.util.List; import java.util.List;
@ -47,14 +46,6 @@ public class EmbeddingJobService {
.toList(); .toList();
} }
public List<EmbeddingJob> enqueueForDocument(UUID documentId, String modelKey, EmbeddingProfile profile) {
var model = modelRegistry.getRequired(modelKey);
List<DocumentTextRepresentation> selected = selectionPolicy.selectRepresentations(documentId, model, profile);
return selected.stream()
.map(representation -> enqueueForRepresentation(documentId, representation.getId(), modelKey, EmbeddingJobType.DOCUMENT_EMBED))
.toList();
}
public EmbeddingJob enqueueForRepresentation(UUID documentId, UUID representationId, String modelKey, EmbeddingJobType jobType) { public EmbeddingJob enqueueForRepresentation(UUID documentId, UUID representationId, String modelKey, EmbeddingJobType jobType) {
return jobRepository.findFirstByRepresentationIdAndModelKeyAndJobTypeAndStatusIn( return jobRepository.findFirstByRepresentationIdAndModelKeyAndJobTypeAndStatusIn(
representationId, representationId,

@ -13,9 +13,6 @@ public record ResolvedEmbeddingProviderConfig(
Duration connectTimeout, Duration connectTimeout,
Duration readTimeout, Duration readTimeout,
Map<String, String> headers, Map<String, String> headers,
Integer dimensions, Integer dimensions
Boolean batchTruncateText,
Integer batchTruncateLength,
Integer batchChunkSize
) { ) {
} }

@ -23,24 +23,18 @@ public class DefaultEmbeddingSelectionPolicy implements EmbeddingSelectionPolicy
@Override @Override
public List<DocumentTextRepresentation> selectRepresentations(UUID documentId, EmbeddingModelDescriptor model) { public List<DocumentTextRepresentation> selectRepresentations(UUID documentId, EmbeddingModelDescriptor model) {
return selectRepresentations(documentId, model, null);
}
@Override
public List<DocumentTextRepresentation> selectRepresentations(UUID documentId, EmbeddingModelDescriptor model, EmbeddingProfile profile) {
List<DocumentTextRepresentation> representations = representationRepository.findByDocument_Id(documentId); List<DocumentTextRepresentation> representations = representationRepository.findByDocument_Id(documentId);
List<DocumentTextRepresentation> selected = new ArrayList<>(); List<DocumentTextRepresentation> selected = new ArrayList<>();
EmbeddingProperties.IndexingProperties indexing = embeddingProperties.getIndexing(); EmbeddingProperties.IndexingProperties indexing = embeddingProperties.getIndexing();
for (DocumentTextRepresentation representation : representations) { for (DocumentTextRepresentation representation : representations) {
if (include(representation, indexing, profile)) { if (include(representation, indexing)) {
selected.add(representation); selected.add(representation);
} }
} }
if (selected.isEmpty()) { if (selected.isEmpty()) {
representationRepository.findFirstByDocument_IdAndPrimaryRepresentationTrue(documentId) representationRepository.findFirstByDocument_IdAndPrimaryRepresentationTrue(documentId)
.filter(rep -> include(rep, indexing, profile))
.ifPresent(selected::add); .ifPresent(selected::add);
} }
@ -54,12 +48,7 @@ public class DefaultEmbeddingSelectionPolicy implements EmbeddingSelectionPolicy
.toList(); .toList();
} }
private boolean include(DocumentTextRepresentation representation, private boolean include(DocumentTextRepresentation representation, EmbeddingProperties.IndexingProperties indexing) {
EmbeddingProperties.IndexingProperties indexing,
EmbeddingProfile profile) {
if (profile != null && !profile.includes(representation.getRepresentationType())) {
return false;
}
return switch (representation.getRepresentationType()) { return switch (representation.getRepresentationType()) {
case SEMANTIC_TEXT -> indexing.isEmbedSemanticText(); case SEMANTIC_TEXT -> indexing.isEmbedSemanticText();
case TITLE_ABSTRACT -> indexing.isEmbedTitleAbstract(); case TITLE_ABSTRACT -> indexing.isEmbedTitleAbstract();

@ -1,10 +0,0 @@
package at.procon.dip.embedding.policy;
public record EmbeddingPolicy(
String policyKey,
String modelKey,
String queryModelKey,
String profileKey,
boolean enabled
) {
}

@ -1,13 +0,0 @@
package at.procon.dip.embedding.policy;
import at.procon.dip.domain.document.RepresentationType;
import java.util.List;
public record EmbeddingProfile(
String profileKey,
List<RepresentationType> embedRepresentationTypes
) {
public boolean includes(RepresentationType representationType) {
return embedRepresentationTypes != null && embedRepresentationTypes.contains(representationType);
}
}

@ -8,6 +8,4 @@ import java.util.UUID;
public interface EmbeddingSelectionPolicy { public interface EmbeddingSelectionPolicy {
List<DocumentTextRepresentation> selectRepresentations(UUID documentId, EmbeddingModelDescriptor model); List<DocumentTextRepresentation> selectRepresentations(UUID documentId, EmbeddingModelDescriptor model);
List<DocumentTextRepresentation> selectRepresentations(UUID documentId, EmbeddingModelDescriptor model, EmbeddingProfile profile);
} }

@ -1,68 +0,0 @@
package at.procon.dip.embedding.provider.http;
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.List;
import lombok.RequiredArgsConstructor;
@RequiredArgsConstructor
abstract class AbstractHttpEmbeddingProviderSupport {
protected final ObjectMapper objectMapper;
protected final HttpClient httpClient = HttpClient.newBuilder()
.version(HttpClient.Version.HTTP_1_1)
.build();
protected String trimTrailingSlash(String value) {
if (value == null || value.isBlank()) {
throw new IllegalArgumentException("Embedding provider baseUrl must be configured");
}
return value.endsWith("/") ? value.substring(0, value.length() - 1) : value;
}
protected HttpResponse<String> postJson(ResolvedEmbeddingProviderConfig providerConfig,
String path,
Object body) throws IOException, InterruptedException {
HttpRequest.Builder builder = HttpRequest.newBuilder()
.uri(URI.create(trimTrailingSlash(providerConfig.baseUrl()) + path))
.timeout(providerConfig.readTimeout() == null ? Duration.ofSeconds(60) : providerConfig.readTimeout())
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(
objectMapper.writeValueAsString(body),
StandardCharsets.UTF_8
));
if (providerConfig.apiKey() != null && !providerConfig.apiKey().isBlank()) {
builder.header("Authorization", "Bearer " + providerConfig.apiKey());
}
if (providerConfig.headers() != null) {
providerConfig.headers().forEach(builder::header);
}
HttpResponse<String> response = httpClient.send(
builder.build(),
HttpResponse.BodyHandlers.ofString(StandardCharsets.UTF_8)
);
if (response.statusCode() / 100 != 2) {
throw new IllegalStateException(
"Embedding provider returned status %d: %s".formatted(response.statusCode(), response.body())
);
}
return response;
}
protected float[] toArray(List<Float> embedding) {
float[] result = new float[embedding.size()];
for (int i = 0; i < embedding.size(); i++) {
result[i] = embedding.get(i);
}
return result;
}
}

@ -9,22 +9,26 @@ import at.procon.dip.embedding.provider.EmbeddingProvider;
import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException; import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse; import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List; import java.util.List;
import java.util.Map; import java.util.Map;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
/**
* Existing HTTP/JSON embedding provider using the /embed contract.
*/
@Component @Component
public class ExternalHttpEmbeddingProvider extends AbstractHttpEmbeddingProviderSupport implements EmbeddingProvider { @RequiredArgsConstructor
public class ExternalHttpEmbeddingProvider implements EmbeddingProvider {
private static final String PROVIDER_TYPE = "http-json"; private static final String PROVIDER_TYPE = "http-json";
public ExternalHttpEmbeddingProvider(ObjectMapper objectMapper, ObjectMapper mapper) { private final ObjectMapper objectMapper;
super(objectMapper); private final HttpClient httpClient = HttpClient.newBuilder().version(HttpClient.Version.HTTP_1_1).build();
}
@Override @Override
public String providerType() { public String providerType() {
@ -40,43 +44,61 @@ public class ExternalHttpEmbeddingProvider extends AbstractHttpEmbeddingProvider
public EmbeddingProviderResult embedDocuments(ResolvedEmbeddingProviderConfig providerConfig, public EmbeddingProviderResult embedDocuments(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model, EmbeddingModelDescriptor model,
EmbeddingRequest request) { EmbeddingRequest request) {
return execute(providerConfig, request, EmbeddingUseCase.DOCUMENT); return execute(providerConfig, model, request, EmbeddingUseCase.DOCUMENT);
} }
@Override @Override
public EmbeddingProviderResult embedQuery(ResolvedEmbeddingProviderConfig providerConfig, public EmbeddingProviderResult embedQuery(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model, EmbeddingModelDescriptor model,
EmbeddingRequest request) { EmbeddingRequest request) {
return execute(providerConfig, request, EmbeddingUseCase.QUERY); return execute(providerConfig, model, request, EmbeddingUseCase.QUERY);
} }
private EmbeddingProviderResult execute(ResolvedEmbeddingProviderConfig providerConfig, private EmbeddingProviderResult execute(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model,
EmbeddingRequest request, EmbeddingRequest request,
EmbeddingUseCase useCase) { EmbeddingUseCase useCase) {
if (request.texts() == null || request.texts().isEmpty()) {
throw new IllegalArgumentException("Embedding request texts must not be empty");
}
try { try {
HttpResponse<String> response = postJson( var payload = new ProviderRequest(
providerConfig, model.providerModelKey(),
"/embed", request.texts(),
Map.of( useCase == EmbeddingUseCase.QUERY,
"text", request.texts().getFirst(), request.providerOptions() == null ? Map.of() : request.providerOptions()
"isQuery", useCase == EmbeddingUseCase.QUERY
)
); );
EmbedResponse parsed = objectMapper.readValue(response.body(), EmbedResponse.class); HttpRequest.Builder builder = HttpRequest.newBuilder()
if (parsed.embedding == null) { .uri(URI.create(trimTrailingSlash(providerConfig.baseUrl()) + "/embed"))
throw new IllegalStateException("Embedding provider returned no embedding"); .timeout(providerConfig.readTimeout() == null ? Duration.ofSeconds(60) : providerConfig.readTimeout())
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(objectMapper.writeValueAsString(payload), StandardCharsets.UTF_8));
if (providerConfig.apiKey() != null && !providerConfig.apiKey().isBlank()) {
builder.header("Authorization", "Bearer " + providerConfig.apiKey());
}
if (providerConfig.headers() != null) {
providerConfig.headers().forEach(builder::header);
}
HttpResponse<String> response = httpClient.send(builder.build(), HttpResponse.BodyHandlers.ofString(StandardCharsets.UTF_8));
if (response.statusCode() / 100 != 2) {
throw new IllegalStateException("Embedding provider returned status %d: %s".formatted(response.statusCode(), response.body()));
}
ProviderResponse parsed = objectMapper.readValue(response.body(), ProviderResponse.class);
List<float[]> vectors = new ArrayList<>();
if (parsed.embeddings != null) {
for (List<Float> embedding : parsed.embeddings) {
vectors.add(toArray(embedding));
}
} else if (parsed.embedding != null) {
vectors.add(toArray(parsed.embedding));
} }
return new EmbeddingProviderResult( return new EmbeddingProviderResult(
null, model,
List.of(parsed.embedding), vectors,
List.of(), parsed.warnings == null ? List.of() : parsed.warnings,
null, parsed.requestId,
parsed.tokenCount parsed.tokenCount
); );
} catch (InterruptedException e) { } catch (InterruptedException e) {
@ -87,12 +109,41 @@ public class ExternalHttpEmbeddingProvider extends AbstractHttpEmbeddingProvider
} }
} }
public static class EmbedResponse { private float[] toArray(List<Float> embedding) {
float[] result = new float[embedding.size()];
for (int i = 0; i < embedding.size(); i++) {
result[i] = embedding.get(i);
}
return result;
}
private String trimTrailingSlash(String value) {
if (value == null || value.isBlank()) {
throw new IllegalArgumentException("Embedding provider baseUrl must be configured");
}
return value.endsWith("/") ? value.substring(0, value.length() - 1) : value;
}
private record ProviderRequest(
@JsonProperty("model") String model,
@JsonProperty("texts") List<String> texts,
@JsonProperty("is_query") boolean query,
@JsonProperty("options") Map<String, Object> options
) {
}
private static class ProviderResponse {
@JsonProperty("embedding") @JsonProperty("embedding")
public float[] embedding; public List<Float> embedding;
@JsonProperty("embeddings")
public List<List<Float>> embeddings;
@JsonProperty("warnings")
public List<String> warnings;
@JsonProperty("dimensions") @JsonProperty("request_id")
public Integer dimensions; public String requestId;
@JsonProperty("token_count") @JsonProperty("token_count")
public Integer tokenCount; public Integer tokenCount;

@ -1,374 +0,0 @@
package at.procon.dip.embedding.provider.http;
import at.procon.dip.embedding.model.EmbeddingModelDescriptor;
import at.procon.dip.embedding.model.EmbeddingProviderResult;
import at.procon.dip.embedding.model.EmbeddingRequest;
import at.procon.dip.embedding.model.ResolvedEmbeddingProviderConfig;
import at.procon.dip.embedding.provider.EmbeddingProvider;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.net.http.HttpResponse;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import org.springframework.stereotype.Component;
/**
* HTTP provider for vector APIs.
*
* Supported endpoints:
* POST {baseUrl}/vector-sync - single text
* POST {baseUrl}/vectorize-batch - multiple texts
*/
@Component
public class VectorSyncHttpEmbeddingProvider extends AbstractHttpEmbeddingProviderSupport implements EmbeddingProvider {
private static final String PROVIDER_TYPE = "http-vector-sync";
private static final boolean DEFAULT_TRUNCATE_TEXT = false;
private static final int DEFAULT_TRUNCATE_LENGTH = 512;
private static final int DEFAULT_CHUNK_SIZE = 20;
private static final List<String> TRUNCATE_TEXT_KEYS = List.of(
"vectorize-batch.truncate-text",
"vectorize-batch.truncate_text",
"truncate_text",
"truncate-text",
"truncateText"
);
private static final List<String> TRUNCATE_LENGTH_KEYS = List.of(
"vectorize-batch.truncate-length",
"vectorize-batch.truncate_length",
"truncate_length",
"truncate-length",
"truncateLength"
);
private static final List<String> CHUNK_SIZE_KEYS = List.of(
"vectorize-batch.chunk-size",
"vectorize-batch.chunk_size",
"chunk_size",
"chunk-size",
"chunkSize"
);
public VectorSyncHttpEmbeddingProvider(ObjectMapper objectMapper) {
super(objectMapper);
}
@Override
public String providerType() {
return PROVIDER_TYPE;
}
@Override
public boolean supports(EmbeddingModelDescriptor model, ResolvedEmbeddingProviderConfig providerConfig) {
return PROVIDER_TYPE.equalsIgnoreCase(providerConfig.providerType());
}
@Override
public EmbeddingProviderResult embedDocuments(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model,
EmbeddingRequest request) {
return execute(providerConfig, model, request);
}
@Override
public EmbeddingProviderResult embedQuery(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model,
EmbeddingRequest request) {
return execute(providerConfig, model, request);
}
private EmbeddingProviderResult execute(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model,
EmbeddingRequest request) {
if (request.texts() == null || request.texts().isEmpty()) {
throw new IllegalArgumentException("Embedding request texts must not be empty");
}
try {
return request.texts().size() == 1
? executeSingle(providerConfig, model, request.texts().getFirst())
: executeBatch(providerConfig, model, request);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IllegalStateException("Embedding provider call interrupted", e);
} catch (IOException e) {
throw new IllegalStateException("Failed to call embedding provider", e);
}
}
private EmbeddingProviderResult executeSingle(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model,
String text) throws IOException, InterruptedException {
HttpResponse<String> response = postJson(
providerConfig,
"/vector-sync",
new VectorSyncRequest(model.providerModelKey(), text)
);
VectorSyncResponse parsed = objectMapper.readValue(response.body(), VectorSyncResponse.class);
float[] vector = extractVector(parsed.vector, parsed.combinedVector, model);
return new EmbeddingProviderResult(
model,
List.of(vector),
List.of(),
null,
parsed.tokenCount
);
}
private EmbeddingProviderResult executeBatch(ResolvedEmbeddingProviderConfig providerConfig,
EmbeddingModelDescriptor model,
EmbeddingRequest request) throws IOException, InterruptedException {
BatchRequestSettings settings = resolveBatchRequestSettings(providerConfig, request.providerOptions());
if (settings.truncateLength() <= 0) {
throw new IllegalArgumentException("Batch truncate length must be > 0");
}
if (settings.chunkSize() <= 0) {
throw new IllegalArgumentException("Batch chunk size must be > 0");
}
List<String> requestOrder = new ArrayList<>(request.texts().size());
List<VectorizeBatchItemRequest> items = new ArrayList<>(request.texts().size());
for (String text : request.texts()) {
String id = UUID.randomUUID().toString();
requestOrder.add(id);
items.add(new VectorizeBatchItemRequest(id, text));
}
HttpResponse<String> response = postJson(
providerConfig,
"/vectorize-batch",
new VectorizeBatchRequest(
model.providerModelKey(),
settings.truncateText(),
settings.truncateLength(),
settings.chunkSize(),
items
)
);
VectorizeBatchResponse parsed = objectMapper.readValue(response.body(), VectorizeBatchResponse.class);
if (parsed.results == null || parsed.results.isEmpty()) {
throw new IllegalStateException("Vectorize-batch provider returned no results");
}
Map<String, VectorizeBatchItemResponse> resultById = new HashMap<>();
for (VectorizeBatchItemResponse result : parsed.results) {
resultById.put(result.id, result);
}
List<float[]> vectors = new ArrayList<>(request.texts().size());
int totalTokenCount = 0;
boolean hasAnyTokenCount = false;
for (String id : requestOrder) {
VectorizeBatchItemResponse item = resultById.get(id);
if (item == null) {
throw new IllegalStateException("Vectorize-batch provider response is missing item for id " + id);
}
vectors.add(extractVector(item.vector, item.combinedVector, model));
if (item.tokenCount != null) {
totalTokenCount += item.tokenCount;
hasAnyTokenCount = true;
}
}
return new EmbeddingProviderResult(
model,
vectors,
List.of(),
null,
hasAnyTokenCount ? totalTokenCount : null
);
}
private BatchRequestSettings resolveBatchRequestSettings(ResolvedEmbeddingProviderConfig providerConfig,
Map<String, Object> providerOptions) {
boolean truncateText = resolveBooleanOption(
providerOptions,
TRUNCATE_TEXT_KEYS,
providerConfig.batchTruncateText() != null ? providerConfig.batchTruncateText() : DEFAULT_TRUNCATE_TEXT
);
int truncateLength = resolveIntOption(
providerOptions,
TRUNCATE_LENGTH_KEYS,
providerConfig.batchTruncateLength() != null ? providerConfig.batchTruncateLength() : DEFAULT_TRUNCATE_LENGTH
);
int chunkSize = resolveIntOption(
providerOptions,
CHUNK_SIZE_KEYS,
providerConfig.batchChunkSize() != null ? providerConfig.batchChunkSize() : DEFAULT_CHUNK_SIZE
);
return new BatchRequestSettings(truncateText, truncateLength, chunkSize);
}
private boolean resolveBooleanOption(Map<String, Object> providerOptions,
List<String> keys,
boolean defaultValue) {
Object raw = resolveOption(providerOptions, keys);
if (raw == null) {
return defaultValue;
}
if (raw instanceof Boolean booleanValue) {
return booleanValue;
}
String normalized = String.valueOf(raw).trim();
if (normalized.isEmpty()) {
return defaultValue;
}
return Boolean.parseBoolean(normalized);
}
private int resolveIntOption(Map<String, Object> providerOptions,
List<String> keys,
int defaultValue) {
Object raw = resolveOption(providerOptions, keys);
if (raw == null) {
return defaultValue;
}
if (raw instanceof Number number) {
return number.intValue();
}
String normalized = String.valueOf(raw).trim();
if (normalized.isEmpty()) {
return defaultValue;
}
return Integer.parseInt(normalized);
}
private Object resolveOption(Map<String, Object> providerOptions, List<String> keys) {
if (providerOptions == null || providerOptions.isEmpty()) {
return null;
}
for (String key : keys) {
if (providerOptions.containsKey(key)) {
return providerOptions.get(key);
}
}
return null;
}
private float[] extractVector(List<Float> vector,
List<Float> combinedVector,
EmbeddingModelDescriptor model) {
float[] resolved;
if (combinedVector != null && !combinedVector.isEmpty()) {
resolved = toArray(combinedVector);
} else if (vector != null && !vector.isEmpty()) {
resolved = toArray(vector);
} else {
throw new IllegalStateException("Embedding provider returned no vector");
}
if (model.dimensions() > 0 && resolved.length != model.dimensions()) {
throw new IllegalStateException(
"Embedding provider returned dimension %d for model %s, expected %d"
.formatted(resolved.length, model.modelKey(), model.dimensions())
);
}
return resolved;
}
private record BatchRequestSettings(boolean truncateText, int truncateLength, int chunkSize) {
}
private record VectorSyncRequest(
@JsonProperty("model") String model,
@JsonProperty("text") String text
) {
}
private record VectorizeBatchRequest(
@JsonProperty("model") String model,
@JsonProperty("truncate_text") boolean truncateText,
@JsonProperty("truncate_length") int truncateLength,
@JsonProperty("chunk_size") int chunkSize,
@JsonProperty("items") List<VectorizeBatchItemRequest> items
) {
}
private record VectorizeBatchItemRequest(
@JsonProperty("id") String id,
@JsonProperty("text") String text
) {
}
static class VectorSyncResponse {
@JsonProperty("runtime_ms")
public Double runtimeMs;
@JsonProperty("vector")
public List<Float> vector;
@JsonProperty("incomplete")
public Boolean incomplete;
@JsonProperty("combined_vector")
public List<Float> combinedVector;
@JsonProperty("token_count")
public Integer tokenCount;
@JsonProperty("model")
public String model;
@JsonProperty("max_seq_length")
public Integer maxSeqLength;
}
static class VectorizeBatchResponse {
@JsonProperty("model")
public String model;
@JsonProperty("count")
public Integer count;
@JsonProperty("results")
public List<VectorizeBatchItemResponse> results;
}
static class VectorizeBatchItemResponse {
@JsonProperty("id")
public String id;
@JsonProperty("vector")
public List<Float> vector;
@JsonProperty("token_count")
public Integer tokenCount;
@JsonProperty("runtime_ms")
public Double runtimeMs;
@JsonProperty("incomplete")
public Boolean incomplete;
@JsonProperty("combined_vector")
public List<Float> combinedVector;
@JsonProperty("truncated")
public Boolean truncated;
@JsonProperty("truncate_length")
public Integer truncateLength;
@JsonProperty("model")
public String model;
@JsonProperty("max_seq_length")
public Integer maxSeqLength;
}
}

@ -18,10 +18,6 @@ public class EmbeddingProviderConfigResolver {
throw new IllegalArgumentException("Unknown embedding provider config key: " + providerConfigKey); throw new IllegalArgumentException("Unknown embedding provider config key: " + providerConfigKey);
} }
EmbeddingProperties.BatchRequestProperties batchRequest = provider.getBatchRequest() == null
? new EmbeddingProperties.BatchRequestProperties()
: provider.getBatchRequest();
return ResolvedEmbeddingProviderConfig.builder() return ResolvedEmbeddingProviderConfig.builder()
.key(providerConfigKey) .key(providerConfigKey)
.providerType(provider.getType()) .providerType(provider.getType())
@ -31,9 +27,6 @@ public class EmbeddingProviderConfigResolver {
.readTimeout(provider.getReadTimeout()) .readTimeout(provider.getReadTimeout())
.headers(provider.getHeaders() == null ? Map.of() : Map.copyOf(provider.getHeaders())) .headers(provider.getHeaders() == null ? Map.of() : Map.copyOf(provider.getHeaders()))
.dimensions(provider.getDimensions()) .dimensions(provider.getDimensions())
.batchTruncateText(batchRequest.isTruncateText())
.batchTruncateLength(batchRequest.getTruncateLength())
.batchChunkSize(batchRequest.getChunkSize())
.build(); .build();
} }
} }

@ -1,131 +0,0 @@
package at.procon.dip.embedding.service;
import at.procon.dip.domain.document.entity.Document;
import at.procon.dip.embedding.config.EmbeddingPolicyCondition;
import at.procon.dip.embedding.config.EmbeddingPolicyProperties;
import at.procon.dip.embedding.config.EmbeddingPolicyRule;
import at.procon.dip.embedding.config.EmbeddingPolicyUse;
import at.procon.dip.embedding.policy.EmbeddingPolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor;
import java.util.Map;
import java.util.Objects;
import java.util.regex.Pattern;
import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Service;
@Service
@RequiredArgsConstructor
public class DefaultEmbeddingPolicyResolver implements EmbeddingPolicyResolver {
private final EmbeddingPolicyProperties properties;
@Override
public EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor) {
String overridePolicy = attributeValue(sourceDescriptor, "embeddingPolicyKey");
if (overridePolicy != null) {
return policyByKey(overridePolicy);
}
String policyHint = policyHint(sourceDescriptor);
if (policyHint != null) {
return policyByKey(policyHint);
}
for (EmbeddingPolicyRule rule : properties.getRules()) {
if (matches(rule.getWhen(), document, sourceDescriptor)) {
return toPolicy(rule.getUse());
}
}
return toPolicy(properties.getDefaultPolicy());
}
private EmbeddingPolicy policyByKey(String policyKey) {
for (EmbeddingPolicyRule rule : properties.getRules()) {
if (rule.getUse() != null && policyKey.equals(rule.getUse().getPolicyKey())) {
return toPolicy(rule.getUse());
}
}
EmbeddingPolicyUse def = properties.getDefaultPolicy();
if (def != null && policyKey.equals(def.getPolicyKey())) {
return toPolicy(def);
}
throw new IllegalArgumentException("Unknown embedding policy key: " + policyKey);
}
private EmbeddingPolicy toPolicy(EmbeddingPolicyUse use) {
if (use == null) {
throw new IllegalStateException("Embedding policy configuration is missing");
}
return new EmbeddingPolicy(
use.getPolicyKey(),
use.getModelKey(),
use.getQueryModelKey(),
use.getProfileKey(),
use.isEnabled()
);
}
private boolean matches(EmbeddingPolicyCondition c, Document document, SourceDescriptor sourceDescriptor) {
if (c == null) {
return true;
}
if (!matchesExact(c.getDocumentType(), enumName(document != null ? document.getDocumentType() : null))) {
return false;
}
if (!matchesExact(c.getDocumentFamily(), enumName(document != null ? document.getDocumentFamily() : null))) {
return false;
}
if (!matchesExact(c.getSourceType(), enumName(sourceDescriptor != null ? sourceDescriptor.sourceType() : null))) {
return false;
}
if (!matchesMime(c.getMimeType(), sourceDescriptor != null ? sourceDescriptor.mediaType() : null)) {
return false;
}
if (!matchesExact(c.getLanguage(), document != null ? document.getLanguageCode() : null)) {
return false;
}
if (!matchesExact(c.getOwnerTenantKey(), document != null && document.getOwnerTenant() != null ? document.getOwnerTenant().getTenantKey() : null )) {
return false;
}
return matchesExact(c.getEmbeddingPolicyHint(), policyHint(sourceDescriptor));
}
private boolean matchesExact(String expected, String actual) {
if (expected == null || expected.isBlank()) {
return true;
}
return Objects.equals(expected, actual);
}
private boolean matchesMime(String pattern, String actual) {
if (pattern == null || pattern.isBlank()) {
return true;
}
if (actual == null || actual.isBlank()) {
return false;
}
return Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(actual).matches();
}
private String enumName(Enum<?> value) {
return value != null ? value.name() : null;
}
private String policyHint(SourceDescriptor sourceDescriptor) {
return attributeValue(sourceDescriptor, "embeddingPolicyHint");
}
private String attributeValue(SourceDescriptor sourceDescriptor, String key) {
if (sourceDescriptor == null) {
return null;
}
Map<String, String> attributes = sourceDescriptor.attributes();
if (attributes == null) {
return null;
}
String value = attributes.get(key);
return (value == null || value.isBlank()) ? null : value;
}
}

@ -1,31 +0,0 @@
package at.procon.dip.embedding.service;
import at.procon.dip.embedding.config.EmbeddingProfileProperties;
import at.procon.dip.embedding.policy.EmbeddingProfile;
import java.util.List;
import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Service;
@Service
@RequiredArgsConstructor
public class DefaultEmbeddingProfileResolver implements EmbeddingProfileResolver {
private final EmbeddingProfileProperties properties;
@Override
public EmbeddingProfile resolve(String profileKey) {
if (profileKey == null || profileKey.isBlank()) {
throw new IllegalArgumentException("Embedding profile key must not be blank");
}
EmbeddingProfileProperties.ProfileDefinition definition = properties.getDefinitions().get(profileKey);
if (definition == null) {
throw new IllegalArgumentException("Unknown embedding profile: " + profileKey);
}
return new EmbeddingProfile(
profileKey,
List.copyOf(definition.getEmbedRepresentationTypes())
);
}
}

@ -8,6 +8,7 @@ import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository;
import at.procon.dip.domain.document.repository.DocumentTextRepresentationRepository; import at.procon.dip.domain.document.repository.DocumentTextRepresentationRepository;
import at.procon.dip.domain.document.service.DocumentEmbeddingService; import at.procon.dip.domain.document.service.DocumentEmbeddingService;
import at.procon.dip.embedding.model.EmbeddingProviderResult; import at.procon.dip.embedding.model.EmbeddingProviderResult;
import at.procon.dip.embedding.support.EmbeddingVectorCodec;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
import java.util.UUID; import java.util.UUID;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -43,17 +44,11 @@ public class EmbeddingPersistenceService {
if (result.vectors() == null || result.vectors().isEmpty()) { if (result.vectors() == null || result.vectors().isEmpty()) {
throw new IllegalArgumentException("Embedding provider result contains no vectors"); throw new IllegalArgumentException("Embedding provider result contains no vectors");
} }
saveCompleted(embeddingId, result.vectors().getFirst(), result.tokenCount()); float[] vector = result.vectors().getFirst();
}
public void saveCompleted(UUID embeddingId, float[] vector, Integer tokenCount) {
if (vector == null || vector.length == 0) {
throw new IllegalArgumentException("Embedding vector must not be empty");
}
embeddingRepository.updateEmbeddingVector( embeddingRepository.updateEmbeddingVector(
embeddingId, embeddingId,
vector, EmbeddingVectorCodec.toPgVector(vector),
tokenCount, result.tokenCount(),
vector.length vector.length
); );
} }

@ -1,9 +0,0 @@
package at.procon.dip.embedding.service;
import at.procon.dip.domain.document.entity.Document;
import at.procon.dip.embedding.policy.EmbeddingPolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor;
public interface EmbeddingPolicyResolver {
EmbeddingPolicy resolve(Document document, SourceDescriptor sourceDescriptor);
}

@ -1,7 +0,0 @@
package at.procon.dip.embedding.service;
import at.procon.dip.embedding.policy.EmbeddingProfile;
public interface EmbeddingProfileResolver {
EmbeddingProfile resolve(String profileKey);
}

@ -7,14 +7,9 @@ import at.procon.dip.embedding.config.EmbeddingProperties;
import at.procon.dip.embedding.job.entity.EmbeddingJob; import at.procon.dip.embedding.job.entity.EmbeddingJob;
import at.procon.dip.embedding.job.service.EmbeddingJobService; import at.procon.dip.embedding.job.service.EmbeddingJobService;
import at.procon.dip.embedding.model.EmbeddingJobType; import at.procon.dip.embedding.model.EmbeddingJobType;
import at.procon.dip.embedding.model.EmbeddingModelDescriptor;
import at.procon.dip.embedding.model.EmbeddingProviderResult; import at.procon.dip.embedding.model.EmbeddingProviderResult;
import at.procon.dip.embedding.model.EmbeddingUseCase; import at.procon.dip.embedding.model.EmbeddingUseCase;
import at.procon.dip.embedding.policy.EmbeddingProfile;
import at.procon.dip.embedding.policy.EmbeddingSelectionPolicy;
import at.procon.dip.embedding.registry.EmbeddingModelRegistry; import at.procon.dip.embedding.registry.EmbeddingModelRegistry;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List; import java.util.List;
import java.util.UUID; import java.util.UUID;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -31,7 +26,6 @@ public class RepresentationEmbeddingOrchestrator {
private final EmbeddingExecutionService executionService; private final EmbeddingExecutionService executionService;
private final EmbeddingPersistenceService persistenceService; private final EmbeddingPersistenceService persistenceService;
private final DocumentTextRepresentationRepository representationRepository; private final DocumentTextRepresentationRepository representationRepository;
private final EmbeddingSelectionPolicy selectionPolicy;
private final EmbeddingModelRegistry modelRegistry; private final EmbeddingModelRegistry modelRegistry;
private final EmbeddingProperties embeddingProperties; private final EmbeddingProperties embeddingProperties;
@ -45,14 +39,6 @@ public class RepresentationEmbeddingOrchestrator {
return jobService.enqueueForDocument(documentId, modelKey); return jobService.enqueueForDocument(documentId, modelKey);
} }
@Transactional
public List<EmbeddingJob> enqueueDocument(UUID documentId, String modelKey, EmbeddingProfile profile) {
var model = modelRegistry.getRequired(modelKey);
return selectionPolicy.selectRepresentations(documentId, model, profile).stream()
.map(representation -> enqueueRepresentation(documentId, representation.getId(), modelKey))
.toList();
}
@Transactional @Transactional
public EmbeddingJob enqueueRepresentation(UUID documentId, UUID representationId, String modelKey) { public EmbeddingJob enqueueRepresentation(UUID documentId, UUID representationId, String modelKey) {
return jobService.enqueueForRepresentation(documentId, representationId, modelKey, EmbeddingJobType.DOCUMENT_EMBED); return jobService.enqueueForRepresentation(documentId, representationId, modelKey, EmbeddingJobType.DOCUMENT_EMBED);
@ -66,138 +52,25 @@ public class RepresentationEmbeddingOrchestrator {
} }
List<EmbeddingJob> jobs = jobService.claimNextReadyJobs(embeddingProperties.getJobs().getBatchSize()); List<EmbeddingJob> jobs = jobService.claimNextReadyJobs(embeddingProperties.getJobs().getBatchSize());
if (jobs.isEmpty()) { for (EmbeddingJob job : jobs) {
return 0; processClaimedJob(job);
}
if (embeddingProperties.getJobs().isProcessInBatches()) {
processClaimedJobsInBatches(jobs);
} else {
jobs.forEach(this::processClaimedJobSafely);
} }
return jobs.size(); return jobs.size();
} }
@Transactional @Transactional
public void processClaimedJob(EmbeddingJob job) { public void processClaimedJob(EmbeddingJob job) {
EmbeddingModelDescriptor model = modelRegistry.getRequired(job.getModelKey());
PreparedEmbedding prepared = prepareEmbedding(job, model);
if (prepared == null) {
return;
}
try {
EmbeddingProviderResult result = executionService.embedTexts(
job.getModelKey(),
EmbeddingUseCase.DOCUMENT,
List.of(prepared.text())
);
persistenceService.saveCompleted(prepared.embeddingId(), result);
jobService.markCompleted(job.getId(), result.providerRequestId());
} catch (RuntimeException ex) {
persistenceService.markFailed(prepared.embeddingId(), ex.getMessage());
jobService.markFailed(job.getId(), ex.getMessage(), true);
throw ex;
}
}
private void processClaimedJobsInBatches(List<EmbeddingJob> jobs) {
LinkedHashMap<String, List<EmbeddingJob>> jobsByModelKey = new LinkedHashMap<>();
for (EmbeddingJob job : jobs) {
jobsByModelKey.computeIfAbsent(job.getModelKey(), ignored -> new ArrayList<>()).add(job);
}
int executionBatchSize = Math.max(1, embeddingProperties.getJobs().getExecutionBatchSize());
for (var entry : jobsByModelKey.entrySet()) {
EmbeddingModelDescriptor model = modelRegistry.getRequired(entry.getKey());
if (!model.supportsBatch()) {
entry.getValue().forEach(this::processClaimedJobSafely);
continue;
}
List<EmbeddingJob> sameModelJobs = entry.getValue();
for (int start = 0; start < sameModelJobs.size(); start += executionBatchSize) {
List<EmbeddingJob> partition = sameModelJobs.subList(start, Math.min(start + executionBatchSize, sameModelJobs.size()));
if (partition.size() == 1) {
processClaimedJobSafely(partition.getFirst());
} else {
processClaimedBatchSafely(partition, model);
}
}
}
}
private void processClaimedBatchSafely(List<EmbeddingJob> jobs, EmbeddingModelDescriptor model) {
try {
processClaimedBatch(jobs, model);
} catch (RuntimeException ex) {
log.warn("Failed to process embedding batch for model {} ({} jobs): {}",
model.modelKey(), jobs.size(), ex.getMessage(), ex);
}
}
private void processClaimedJobSafely(EmbeddingJob job) {
try {
processClaimedJob(job);
} catch (RuntimeException ex) {
log.warn("Failed to process embedding job {} for representation {}: {}",
job.getId(), job.getRepresentationId(), ex.getMessage(), ex);
}
}
private void processClaimedBatch(List<EmbeddingJob> jobs, EmbeddingModelDescriptor model) {
List<PreparedEmbedding> preparedItems = new ArrayList<>(jobs.size());
for (EmbeddingJob job : jobs) {
PreparedEmbedding prepared = prepareEmbedding(job, model);
if (prepared != null) {
preparedItems.add(prepared);
}
}
if (preparedItems.isEmpty()) {
return;
}
try {
EmbeddingProviderResult result = executionService.embedTexts(
model.modelKey(),
EmbeddingUseCase.DOCUMENT,
preparedItems.stream().map(PreparedEmbedding::text).toList()
);
if (result.vectors() == null || result.vectors().size() != preparedItems.size()) {
throw new IllegalStateException(
"Embedding provider returned %d vectors for %d batch items"
.formatted(result.vectors() == null ? 0 : result.vectors().size(), preparedItems.size())
);
}
for (int i = 0; i < preparedItems.size(); i++) {
PreparedEmbedding prepared = preparedItems.get(i);
persistenceService.saveCompleted(prepared.embeddingId(), result.vectors().get(i), null);
jobService.markCompleted(prepared.job().getId(), result.providerRequestId());
}
} catch (RuntimeException ex) {
for (PreparedEmbedding prepared : preparedItems) {
persistenceService.markFailed(prepared.embeddingId(), ex.getMessage());
jobService.markFailed(prepared.job().getId(), ex.getMessage(), true);
}
throw ex;
}
}
private PreparedEmbedding prepareEmbedding(EmbeddingJob job, EmbeddingModelDescriptor model) {
DocumentTextRepresentation representation = representationRepository.findById(job.getRepresentationId()) DocumentTextRepresentation representation = representationRepository.findById(job.getRepresentationId())
.orElseThrow(() -> new IllegalArgumentException("Unknown representation id: " + job.getRepresentationId())); .orElseThrow(() -> new IllegalArgumentException("Unknown representation id: " + job.getRepresentationId()));
String text = representation.getTextBody(); String text = representation.getTextBody();
if (text == null || text.isBlank()) { if (text == null || text.isBlank()) {
jobService.markFailed(job.getId(), "No text representation available", false); jobService.markFailed(job.getId(), "No text representation available", false);
return null; return;
} }
int maxChars = model.maxInputChars() != null int maxChars = modelRegistry.getRequired(job.getModelKey()).maxInputChars() != null
? model.maxInputChars() ? modelRegistry.getRequired(job.getModelKey()).maxInputChars()
: embeddingProperties.getIndexing().getFallbackMaxInputChars(); : embeddingProperties.getIndexing().getFallbackMaxInputChars();
if (text.length() > maxChars) { if (text.length() > maxChars) {
text = text.substring(0, maxChars); text = text.substring(0, maxChars);
@ -205,9 +78,19 @@ public class RepresentationEmbeddingOrchestrator {
DocumentEmbedding embedding = persistenceService.ensurePending(representation.getId(), job.getModelKey()); DocumentEmbedding embedding = persistenceService.ensurePending(representation.getId(), job.getModelKey());
persistenceService.markProcessing(embedding.getId()); persistenceService.markProcessing(embedding.getId());
return new PreparedEmbedding(job, embedding.getId(), text);
}
private record PreparedEmbedding(EmbeddingJob job, UUID embeddingId, String text) { try {
EmbeddingProviderResult result = executionService.embedTexts(
job.getModelKey(),
EmbeddingUseCase.DOCUMENT,
List.of(text)
);
persistenceService.saveCompleted(embedding.getId(), result);
jobService.markCompleted(job.getId(), result.providerRequestId());
} catch (RuntimeException ex) {
persistenceService.markFailed(embedding.getId(), ex.getMessage());
jobService.markFailed(job.getId(), ex.getMessage(), true);
throw ex;
}
} }
} }

@ -6,15 +6,11 @@ import at.procon.dip.ingestion.spi.DocumentIngestionAdapter;
import at.procon.dip.ingestion.spi.IngestionResult; import at.procon.dip.ingestion.spi.IngestionResult;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import java.util.List; import java.util.List;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
@Component @Component
@RequiredArgsConstructor @RequiredArgsConstructor
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
public class FileSystemDocumentIngestionAdapter implements DocumentIngestionAdapter { public class FileSystemDocumentIngestionAdapter implements DocumentIngestionAdapter {
private final GenericDocumentImportService importService; private final GenericDocumentImportService importService;

@ -7,15 +7,11 @@ import at.procon.dip.ingestion.spi.DocumentIngestionAdapter;
import at.procon.dip.ingestion.spi.IngestionResult; import at.procon.dip.ingestion.spi.IngestionResult;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import java.util.List; import java.util.List;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
@Component @Component
@RequiredArgsConstructor @RequiredArgsConstructor
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
public class InlineContentDocumentIngestionAdapter implements DocumentIngestionAdapter { public class InlineContentDocumentIngestionAdapter implements DocumentIngestionAdapter {
private final GenericDocumentImportService importService; private final GenericDocumentImportService importService;

@ -17,9 +17,7 @@ import at.procon.dip.ingestion.spi.IngestionResult;
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy; import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.ingestion.util.DocumentImportSupport; import at.procon.dip.ingestion.util.DocumentImportSupport;
import at.procon.dip.ingestion.config.DipIngestionProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.service.attachment.AttachmentExtractor; import at.procon.ted.service.attachment.AttachmentExtractor;
import at.procon.ted.service.attachment.ZipExtractionService; import at.procon.ted.service.attachment.ZipExtractionService;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
@ -32,12 +30,11 @@ import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class MailDocumentIngestionAdapter implements DocumentIngestionAdapter { public class MailDocumentIngestionAdapter implements DocumentIngestionAdapter {
private final DipIngestionProperties properties; private final TedProcessorProperties properties;
private final GenericDocumentImportService importService; private final GenericDocumentImportService importService;
private final MailMessageExtractionService mailExtractionService; private final MailMessageExtractionService mailExtractionService;
private final DocumentRelationService relationService; private final DocumentRelationService relationService;
@ -46,8 +43,8 @@ public class MailDocumentIngestionAdapter implements DocumentIngestionAdapter {
@Override @Override
public boolean supports(SourceDescriptor sourceDescriptor) { public boolean supports(SourceDescriptor sourceDescriptor) {
return sourceDescriptor.sourceType() == SourceType.MAIL return sourceDescriptor.sourceType() == SourceType.MAIL
&& properties.isEnabled() && properties.getGenericIngestion().isEnabled()
&& properties.isMailAdapterEnabled(); && properties.getGenericIngestion().isMailAdapterEnabled();
} }
@Override @Override
@ -65,7 +62,7 @@ public class MailDocumentIngestionAdapter implements DocumentIngestionAdapter {
if (!parsed.recipients().isEmpty()) rootAttributes.put("to", String.join(", ", parsed.recipients())); if (!parsed.recipients().isEmpty()) rootAttributes.put("to", String.join(", ", parsed.recipients()));
rootAttributes.putIfAbsent("title", parsed.subject() != null ? parsed.subject() : sourceDescriptor.fileName()); rootAttributes.putIfAbsent("title", parsed.subject() != null ? parsed.subject() : sourceDescriptor.fileName());
rootAttributes.put("attachmentCount", Integer.toString(parsed.attachments().size())); rootAttributes.put("attachmentCount", Integer.toString(parsed.attachments().size()));
rootAttributes.put("importBatchId", properties.getMailImportBatchId()); rootAttributes.put("importBatchId", properties.getGenericIngestion().getMailImportBatchId());
ImportedDocumentResult rootResult = importService.importDocument(new SourceDescriptor( ImportedDocumentResult rootResult = importService.importDocument(new SourceDescriptor(
accessContext, accessContext,
@ -96,13 +93,13 @@ public class MailDocumentIngestionAdapter implements DocumentIngestionAdapter {
private void importAttachment(java.util.UUID parentDocumentId, DocumentAccessContext accessContext, SourceDescriptor parentSource, private void importAttachment(java.util.UUID parentDocumentId, DocumentAccessContext accessContext, SourceDescriptor parentSource,
MailAttachment attachment, List<at.procon.dip.domain.document.CanonicalDocumentMetadata> documents, MailAttachment attachment, List<at.procon.dip.domain.document.CanonicalDocumentMetadata> documents,
List<String> warnings, int sortOrder, int depth) { List<String> warnings, int sortOrder, int depth) {
boolean expandableWrapper = properties.isExpandMailZipAttachments() boolean expandableWrapper = properties.getGenericIngestion().isExpandMailZipAttachments()
&& zipExtractionService.canHandle(attachment.fileName(), attachment.contentType()); && zipExtractionService.canHandle(attachment.fileName(), attachment.contentType());
Map<String, String> attachmentAttributes = new LinkedHashMap<>(); Map<String, String> attachmentAttributes = new LinkedHashMap<>();
attachmentAttributes.put("title", attachment.fileName()); attachmentAttributes.put("title", attachment.fileName());
attachmentAttributes.put("mailSourceIdentifier", parentSource.sourceIdentifier()); attachmentAttributes.put("mailSourceIdentifier", parentSource.sourceIdentifier());
attachmentAttributes.put("importBatchId", properties.getMailImportBatchId()); attachmentAttributes.put("importBatchId", properties.getGenericIngestion().getMailImportBatchId());
if (expandableWrapper) { if (expandableWrapper) {
attachmentAttributes.put("wrapperDocument", Boolean.TRUE.toString()); attachmentAttributes.put("wrapperDocument", Boolean.TRUE.toString());
} }
@ -147,11 +144,11 @@ public class MailDocumentIngestionAdapter implements DocumentIngestionAdapter {
} }
private DocumentAccessContext defaultMailAccessContext() { private DocumentAccessContext defaultMailAccessContext() {
String tenantKey = properties.getMailDefaultOwnerTenantKey(); String tenantKey = properties.getGenericIngestion().getMailDefaultOwnerTenantKey();
if (tenantKey == null || tenantKey.isBlank()) { if (tenantKey == null || tenantKey.isBlank()) {
tenantKey = properties.getDefaultOwnerTenantKey(); tenantKey = properties.getGenericIngestion().getDefaultOwnerTenantKey();
} }
DocumentVisibility visibility = properties.getMailDefaultVisibility(); DocumentVisibility visibility = properties.getGenericIngestion().getMailDefaultVisibility();
TenantRef tenant = (tenantKey == null || tenantKey.isBlank()) ? null : new TenantRef(null, tenantKey, tenantKey); TenantRef tenant = (tenantKey == null || tenantKey.isBlank()) ? null : new TenantRef(null, tenantKey, tenantKey);
if (tenant == null && visibility == DocumentVisibility.TENANT) { if (tenant == null && visibility == DocumentVisibility.TENANT) {
visibility = DocumentVisibility.RESTRICTED; visibility = DocumentVisibility.RESTRICTED;

@ -1,8 +1,6 @@
package at.procon.dip.ingestion.adapter; package at.procon.dip.ingestion.adapter;
import at.procon.dip.domain.access.DocumentAccessContext; import at.procon.dip.domain.access.DocumentAccessContext;
import at.procon.dip.domain.document.CanonicalDocumentMetadata;
import at.procon.dip.domain.document.SourceType;
import at.procon.dip.ingestion.dto.ImportedDocumentResult; import at.procon.dip.ingestion.dto.ImportedDocumentResult;
import at.procon.dip.ingestion.service.GenericDocumentImportService; import at.procon.dip.ingestion.service.GenericDocumentImportService;
import at.procon.dip.ingestion.service.TedPackageChildImportProcessor; import at.procon.dip.ingestion.service.TedPackageChildImportProcessor;
@ -11,9 +9,7 @@ import at.procon.dip.ingestion.spi.DocumentIngestionAdapter;
import at.procon.dip.ingestion.spi.IngestionResult; import at.procon.dip.ingestion.spi.IngestionResult;
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy; import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.ingestion.config.DipIngestionProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.nio.file.Files; import java.nio.file.Files;
import java.nio.file.Path; import java.nio.file.Path;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
@ -24,28 +20,26 @@ import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicInteger;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Propagation; import org.springframework.transaction.annotation.Propagation;
import org.springframework.transaction.annotation.Transactional; import org.springframework.transaction.annotation.Transactional;
import org.springframework.util.StringUtils; import org.springframework.util.StringUtils;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class TedPackageDocumentIngestionAdapter implements DocumentIngestionAdapter { public class TedPackageDocumentIngestionAdapter implements DocumentIngestionAdapter {
private final DipIngestionProperties properties; private final TedProcessorProperties properties;
private final GenericDocumentImportService importService; private final GenericDocumentImportService importService;
private final TedPackageExpansionService expansionService; private final TedPackageExpansionService expansionService;
private final TedPackageChildImportProcessor childImportProcessor; private final TedPackageChildImportProcessor childImportProcessor;
@Override @Override
public boolean supports(SourceDescriptor sourceDescriptor) { public boolean supports(SourceDescriptor sourceDescriptor) {
return sourceDescriptor.sourceType() == SourceType.TED_PACKAGE return sourceDescriptor.sourceType() == at.procon.dip.domain.document.SourceType.TED_PACKAGE
&& properties.isEnabled() && properties.getGenericIngestion().isEnabled()
&& properties.isTedPackageAdapterEnabled(); && properties.getGenericIngestion().isTedPackageAdapterEnabled();
} }
@Override @Override
@ -57,11 +51,11 @@ public class TedPackageDocumentIngestionAdapter implements DocumentIngestionAdap
rootAttributes.putIfAbsent("packageId", sourceDescriptor.sourceIdentifier()); rootAttributes.putIfAbsent("packageId", sourceDescriptor.sourceIdentifier());
rootAttributes.putIfAbsent("title", sourceDescriptor.fileName() != null ? sourceDescriptor.fileName() : sourceDescriptor.sourceIdentifier()); rootAttributes.putIfAbsent("title", sourceDescriptor.fileName() != null ? sourceDescriptor.fileName() : sourceDescriptor.sourceIdentifier());
rootAttributes.put("wrapperDocument", Boolean.TRUE.toString()); rootAttributes.put("wrapperDocument", Boolean.TRUE.toString());
rootAttributes.put("importBatchId", properties.getTedPackageImportBatchId()); rootAttributes.put("importBatchId", properties.getGenericIngestion().getTedPackageImportBatchId());
ImportedDocumentResult packageDocument = importService.importDocument(new SourceDescriptor( ImportedDocumentResult packageDocument = importService.importDocument(new SourceDescriptor(
sourceDescriptor.accessContext() == null ? DocumentAccessContext.publicDocument() : sourceDescriptor.accessContext(), sourceDescriptor.accessContext() == null ? DocumentAccessContext.publicDocument() : sourceDescriptor.accessContext(),
SourceType.TED_PACKAGE, at.procon.dip.domain.document.SourceType.TED_PACKAGE,
sourceDescriptor.sourceIdentifier(), sourceDescriptor.sourceIdentifier(),
packageRootSource.sourceUri(), packageRootSource.sourceUri(),
sourceDescriptor.fileName(), sourceDescriptor.fileName(),
@ -74,7 +68,7 @@ public class TedPackageDocumentIngestionAdapter implements DocumentIngestionAdap
)); ));
List<String> warnings = new ArrayList<>(packageDocument.warnings()); List<String> warnings = new ArrayList<>(packageDocument.warnings());
List<CanonicalDocumentMetadata> documents = new ArrayList<>(); List<at.procon.dip.domain.document.CanonicalDocumentMetadata> documents = new ArrayList<>();
documents.add(packageDocument.document().toCanonicalMetadata()); documents.add(packageDocument.document().toCanonicalMetadata());
AtomicInteger sortOrder = new AtomicInteger(); AtomicInteger sortOrder = new AtomicInteger();

@ -7,9 +7,7 @@ import at.procon.dip.domain.tenant.TenantRef;
import at.procon.dip.ingestion.service.DocumentIngestionGateway; import at.procon.dip.ingestion.service.DocumentIngestionGateway;
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy; import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.ingestion.config.DipIngestionProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.nio.file.Files; import java.nio.file.Files;
import java.nio.file.Path; import java.nio.file.Path;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
@ -23,22 +21,21 @@ import org.springframework.stereotype.Component;
import org.springframework.util.StringUtils; import org.springframework.util.StringUtils;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class GenericFileSystemIngestionRoute extends RouteBuilder { public class GenericFileSystemIngestionRoute extends RouteBuilder {
private final DipIngestionProperties properties; private final TedProcessorProperties properties;
private final DocumentIngestionGateway ingestionGateway; private final DocumentIngestionGateway ingestionGateway;
@Override @Override
public void configure() { public void configure() {
if (!properties.isEnabled() || !properties.isFileSystemEnabled()) { if (!properties.getGenericIngestion().isEnabled() || !properties.getGenericIngestion().isFileSystemEnabled()) {
log.info("Phase 4 generic filesystem ingestion route disabled"); log.info("Phase 4 generic filesystem ingestion route disabled");
return; return;
} }
var config = properties; var config = properties.getGenericIngestion();
log.info("Configuring Phase 4 generic filesystem ingestion from {}", config.getInputDirectory()); log.info("Configuring Phase 4 generic filesystem ingestion from {}", config.getInputDirectory());
fromF("file:%s?recursive=true&include=%s&delay=%d&maxMessagesPerPoll=%d&move=%s&moveFailed=%s", fromF("file:%s?recursive=true&include=%s&delay=%d&maxMessagesPerPoll=%d&move=%s&moveFailed=%s",
@ -61,7 +58,7 @@ public class GenericFileSystemIngestionRoute extends RouteBuilder {
} }
byte[] payload = Files.readAllBytes(path); byte[] payload = Files.readAllBytes(path);
Map<String, String> attributes = new LinkedHashMap<>(); Map<String, String> attributes = new LinkedHashMap<>();
String languageCode = properties.getDefaultLanguageCode(); String languageCode = properties.getGenericIngestion().getDefaultLanguageCode();
if (StringUtils.hasText(languageCode)) { if (StringUtils.hasText(languageCode)) {
attributes.put("languageCode", languageCode); attributes.put("languageCode", languageCode);
} }
@ -83,8 +80,8 @@ public class GenericFileSystemIngestionRoute extends RouteBuilder {
} }
private DocumentAccessContext buildDefaultAccessContext() { private DocumentAccessContext buildDefaultAccessContext() {
String ownerTenantKey = properties.getDefaultOwnerTenantKey(); String ownerTenantKey = properties.getGenericIngestion().getDefaultOwnerTenantKey();
DocumentVisibility visibility = properties.getDefaultVisibility(); DocumentVisibility visibility = properties.getGenericIngestion().getDefaultVisibility();
if (!StringUtils.hasText(ownerTenantKey)) { if (!StringUtils.hasText(ownerTenantKey)) {
return new DocumentAccessContext(null, visibility); return new DocumentAccessContext(null, visibility);
} }

@ -1,328 +0,0 @@
package at.procon.dip.ingestion.camel;
import at.procon.dip.domain.document.SourceType;
import at.procon.dip.ingestion.config.TedPackageDownloadProperties;
import at.procon.dip.ingestion.service.DocumentIngestionGateway;
import at.procon.dip.ingestion.spi.IngestionResult;
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.model.entity.TedDailyPackage;
import at.procon.ted.repository.TedDailyPackageRepository;
import at.procon.dip.domain.ted.service.TedPackageSequenceService;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.security.MessageDigest;
import java.time.OffsetDateTime;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange;
import org.apache.camel.LoggingLevel;
import org.apache.camel.builder.RouteBuilder;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.stereotype.Component;
/**
* NEW-runtime TED daily package download route.
* <p>
* Reuses the proven package sequencing rules through {@link TedPackageSequenceService},
* but hands off processing only to the NEW ingestion gateway. No legacy XML batch persistence,
* no legacy vectorization route, no old semantic path.
*/
@Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@ConditionalOnProperty(name = "dip.ingestion.ted-download.enabled", havingValue = "true")
@RequiredArgsConstructor
@Slf4j
public class TedPackageDownloadRoute extends RouteBuilder {
private static final String ROUTE_ID_SCHEDULER = "ted-package-new-scheduler";
private static final String ROUTE_ID_DOWNLOADER = "ted-package-new-downloader";
private static final String ROUTE_ID_ERROR = "ted-package-new-error-handler";
private final TedPackageDownloadProperties properties;
private final TedDailyPackageRepository packageRepository;
private final TedPackageSequenceService sequenceService;
private final DocumentIngestionGateway documentIngestionGateway;
@Override
public void configure() {
errorHandler(deadLetterChannel("direct:ted-package-new-error")
.maximumRedeliveries(3)
.redeliveryDelay(10_000)
.retryAttemptedLogLevel(LoggingLevel.WARN)
.logStackTrace(true));
from("direct:ted-package-new-error")
.routeId(ROUTE_ID_ERROR)
.process(this::handleError);
from("timer:ted-package-new-scheduler?period={{dip.ingestion.ted-download.poll-interval:3600000}}&delay=0")
.routeId(ROUTE_ID_SCHEDULER)
.process(this::checkRunningPackages)
.choice()
.when(header("tooManyRunning").isEqualTo(true))
.log(LoggingLevel.INFO, "Skipping NEW TED package download - already ${header.runningCount} packages in progress")
.otherwise()
.process(this::determineNextPackage)
.choice()
.when(header("packageId").isNotNull())
.to("direct:download-ted-package-new")
.otherwise()
.log(LoggingLevel.INFO, "No NEW TED package to download right now")
.end()
.end();
from("direct:download-ted-package-new")
.routeId(ROUTE_ID_DOWNLOADER)
.log(LoggingLevel.INFO, "NEW TED package download started: ${header.packageId}")
.setHeader("downloadStartTime", constant(System.currentTimeMillis()))
.process(this::createPackageRecord)
.delay(simple("{{dip.ingestion.ted-download.delay-between-downloads:5000}}"))
.setHeader(Exchange.HTTP_METHOD, constant("GET"))
.setHeader("CamelHttpConnectionClose", constant(true))
.toD("${header.downloadUrl}?bridgeEndpoint=true&throwExceptionOnFailure=false&socketTimeout={{dip.ingestion.ted-download.download-timeout:300000}}")
.choice()
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(200))
.process(this::calculateHash)
.process(this::checkDuplicateByHash)
.choice()
.when(header("isDuplicate").isEqualTo(true))
.process(this::markDuplicate)
.otherwise()
.process(this::saveDownloadedPackage)
.process(this::ingestThroughGateway)
.process(this::markCompleted)
.endChoice()
.when(header(Exchange.HTTP_RESPONSE_CODE).isEqualTo(404))
.process(this::markNotFound)
.otherwise()
.process(this::markFailed)
.end();
}
private void checkRunningPackages(Exchange exchange) {
long downloadingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING).size();
long processingCount = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING).size();
long runningCount = downloadingCount + processingCount;
exchange.getIn().setHeader("runningCount", runningCount);
exchange.getIn().setHeader("tooManyRunning", runningCount >= properties.getMaxRunningPackages());
if (runningCount > 0) {
log.info("Currently {} TED packages in progress in NEW runtime ({} downloading, {} processing)",
runningCount, downloadingCount, processingCount);
}
}
private void determineNextPackage(Exchange exchange) {
List<TedDailyPackage> pendingPackages = packageRepository.findByDownloadStatus(TedDailyPackage.DownloadStatus.PENDING);
if (!pendingPackages.isEmpty()) {
TedDailyPackage pkg = pendingPackages.get(0);
log.info("Retrying PENDING TED package in NEW runtime: {}", pkg.getPackageIdentifier());
setPackageHeaders(exchange, pkg.getYear(), pkg.getSerialNumber());
return;
}
TedPackageSequenceService.PackageInfo packageInfo = sequenceService.getNextPackageToDownload();
if (packageInfo == null) {
exchange.getIn().setHeader("packageId", null);
return;
}
setPackageHeaders(exchange, packageInfo.year(), packageInfo.serialNumber());
}
private void setPackageHeaders(Exchange exchange, int year, int serialNumber) {
String packageId = "%04d%05d".formatted(year, serialNumber);
String downloadUrl = properties.getBaseUrl() + packageId;
exchange.getIn().setHeader("packageId", packageId);
exchange.getIn().setHeader("year", year);
exchange.getIn().setHeader("serialNumber", serialNumber);
exchange.getIn().setHeader("downloadUrl", downloadUrl);
}
private void createPackageRecord(Exchange exchange) {
String packageId = exchange.getIn().getHeader("packageId", String.class);
Integer year = exchange.getIn().getHeader("year", Integer.class);
Integer serialNumber = exchange.getIn().getHeader("serialNumber", Integer.class);
String downloadUrl = exchange.getIn().getHeader("downloadUrl", String.class);
if (packageRepository.existsByPackageIdentifier(packageId)) {
return;
}
TedDailyPackage pkg = TedDailyPackage.builder()
.packageIdentifier(packageId)
.year(year)
.serialNumber(serialNumber)
.downloadUrl(downloadUrl)
.downloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADING)
.build();
packageRepository.save(pkg);
}
private void calculateHash(Exchange exchange) throws Exception {
byte[] body = exchange.getIn().getBody(byte[].class);
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hashBytes = digest.digest(body);
StringBuilder sb = new StringBuilder();
for (byte b : hashBytes) {
sb.append(String.format("%02x", b));
}
exchange.getIn().setHeader("fileHash", sb.toString());
}
private void checkDuplicateByHash(Exchange exchange) {
String hash = exchange.getIn().getHeader("fileHash", String.class);
Optional<TedDailyPackage> duplicate = packageRepository.findAll().stream()
.filter(p -> hash.equals(p.getFileHash()))
.findFirst();
exchange.getIn().setHeader("isDuplicate", duplicate.isPresent());
duplicate.ifPresent(pkg -> exchange.getIn().setHeader("duplicateOf", pkg.getPackageIdentifier()));
}
private void saveDownloadedPackage(Exchange exchange) throws Exception {
String packageId = exchange.getIn().getHeader("packageId", String.class);
String hash = exchange.getIn().getHeader("fileHash", String.class);
byte[] body = exchange.getIn().getBody(byte[].class);
Path downloadDir = Paths.get(properties.getDownloadDirectory());
Files.createDirectories(downloadDir);
Path downloadPath = downloadDir.resolve(packageId + ".tar.gz");
Files.write(downloadPath, body);
long downloadDuration = System.currentTimeMillis() -
exchange.getIn().getHeader("downloadStartTime", Long.class);
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setFileHash(hash);
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.DOWNLOADED);
pkg.setDownloadedAt(OffsetDateTime.now());
pkg.setDownloadDurationMs(downloadDuration);
packageRepository.save(pkg);
});
exchange.getIn().setHeader("downloadPath", downloadPath.toString());
}
private void ingestThroughGateway(Exchange exchange) {
String packageId = exchange.getIn().getHeader("packageId", String.class);
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.PROCESSING);
packageRepository.save(pkg);
});
IngestionResult ingestionResult = documentIngestionGateway.ingest(new SourceDescriptor(
null,
SourceType.TED_PACKAGE,
packageId,
downloadPath,
packageId + ".tar.gz",
"application/gzip",
null,
null,
OffsetDateTime.now(),
OriginalContentStoragePolicy.DEFAULT,
Map.of(
"packageId", packageId,
"title", packageId + ".tar.gz"
)
));
int importedChildCount = Math.max(0, ingestionResult.documents().size() - 1);
exchange.getIn().setHeader("gatewayImportedChildCount", importedChildCount);
exchange.getIn().setHeader("gatewayImportWarnings", ingestionResult.warnings().size());
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setXmlFileCount(importedChildCount);
pkg.setProcessedCount(importedChildCount);
pkg.setFailedCount(0);
packageRepository.save(pkg);
});
}
private void markCompleted(Exchange exchange) throws Exception {
String packageId = exchange.getIn().getHeader("packageId", String.class);
String downloadPath = exchange.getIn().getHeader("downloadPath", String.class);
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
pkg.setProcessedAt(OffsetDateTime.now());
if (pkg.getDownloadedAt() != null) {
long processingDuration = Math.max(0L,
java.time.Duration.between(pkg.getDownloadedAt(), OffsetDateTime.now()).toMillis());
pkg.setProcessingDurationMs(processingDuration);
}
packageRepository.save(pkg);
});
if (properties.isDeleteAfterIngestion() && downloadPath != null) {
Files.deleteIfExists(Path.of(downloadPath));
}
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
long totalDuration = (pkg.getDownloadDurationMs() != null ? pkg.getDownloadDurationMs() : 0L)
+ (pkg.getProcessingDurationMs() != null ? pkg.getProcessingDurationMs() : 0L);
log.info("NEW TED package {} completed: xmlCount={}, processed={}, failed={}, totalDuration={}ms",
packageId, pkg.getXmlFileCount(), pkg.getProcessedCount(), pkg.getFailedCount(), totalDuration);
});
}
private void markNotFound(Exchange exchange) {
String packageId = exchange.getIn().getHeader("packageId", String.class);
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.NOT_FOUND);
pkg.setErrorMessage("Package not found (404)");
packageRepository.save(pkg);
});
}
private void markFailed(Exchange exchange) {
String packageId = exchange.getIn().getHeader("packageId", String.class);
Integer httpCode = exchange.getIn().getHeader(Exchange.HTTP_RESPONSE_CODE, Integer.class);
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
pkg.setErrorMessage("HTTP " + httpCode);
packageRepository.save(pkg);
});
}
private void markDuplicate(Exchange exchange) {
String packageId = exchange.getIn().getHeader("packageId", String.class);
String duplicateOf = exchange.getIn().getHeader("duplicateOf", String.class);
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.COMPLETED);
pkg.setErrorMessage("Duplicate of " + duplicateOf);
pkg.setProcessedAt(OffsetDateTime.now());
packageRepository.save(pkg);
});
}
private void handleError(Exchange exchange) {
Exception exception = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class);
String packageId = exchange.getIn().getHeader("packageId", String.class);
if (packageId != null) {
packageRepository.findByPackageIdentifier(packageId).ifPresent(pkg -> {
pkg.setDownloadStatus(TedDailyPackage.DownloadStatus.FAILED);
pkg.setErrorMessage(exception != null ? exception.getMessage() : "Unknown route error");
packageRepository.save(pkg);
});
}
}
}

@ -1,61 +0,0 @@
package at.procon.dip.ingestion.config;
import at.procon.dip.domain.access.DocumentVisibility;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import jakarta.validation.constraints.NotBlank;
import jakarta.validation.constraints.Positive;
import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "dip.ingestion")
@Data
public class DipIngestionProperties {
private boolean enabled = false;
private boolean fileSystemEnabled = false;
private boolean restUploadEnabled = true;
private String inputDirectory = "/ted.europe/generic-input";
private String filePattern = ".*\\.(pdf|txt|html|htm|xml|md|markdown|csv|json|yaml|yml)$";
private String processedDirectory = ".dip-processed";
private String errorDirectory = ".dip-error";
@Positive
private long pollInterval = 15000;
@Positive
private int maxMessagesPerPoll = 10;
private String defaultOwnerTenantKey;
private DocumentVisibility defaultVisibility = DocumentVisibility.PUBLIC;
private String defaultLanguageCode;
private boolean storeOriginalBinaryInDb = true;
@Positive
private int maxBinaryBytesInDb = 5242880;
private boolean deduplicateByContentHash = true;
private boolean storeOriginalContentForWrapperDocuments = true;
private boolean vectorizePrimaryRepresentationOnly = true;
@NotBlank
private String importBatchId = "phase4-generic";
private boolean tedPackageAdapterEnabled = true;
private boolean mailAdapterEnabled = false;
private String mailDefaultOwnerTenantKey;
private DocumentVisibility mailDefaultVisibility = DocumentVisibility.TENANT;
private boolean expandMailZipAttachments = true;
@NotBlank
private String tedPackageImportBatchId = "phase41-ted-package";
private boolean gatewayOnlyForTedPackages = false;
@NotBlank
private String mailImportBatchId = "phase41-mail";
}

@ -1,52 +0,0 @@
package at.procon.dip.ingestion.config;
import jakarta.validation.constraints.Min;
import jakarta.validation.constraints.NotBlank;
import jakarta.validation.constraints.Positive;
import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
/**
* NEW-runtime TED package download configuration.
* <p>
* This is intentionally separate from the legacy {@code ted.download.*} tree.
*/
@Configuration
@ConfigurationProperties(prefix = "dip.ingestion.ted-download")
@Data
public class TedPackageDownloadProperties {
private boolean enabled = false;
@NotBlank
private String baseUrl = "https://ted.europa.eu/packages/daily/";
@NotBlank
private String downloadDirectory = "/ted.europe/downloads-new";
@Positive
private int startYear = 2015;
@Positive
private long pollInterval = 3_600_000L;
@Positive
private long notFoundRetryInterval = 21_600_000L;
@Min(0)
private int previousYearGracePeriodDays = 30;
private boolean retryCurrentYearNotFoundIndefinitely = true;
@Positive
private long downloadTimeout = 300_000L;
@Positive
private int maxRunningPackages = 2;
@Positive
private long delayBetweenDownloads = 5_000L;
private boolean deleteAfterIngestion = true;
}

@ -11,9 +11,7 @@ import at.procon.dip.ingestion.service.DocumentIngestionGateway;
import at.procon.dip.ingestion.spi.IngestionResult; import at.procon.dip.ingestion.spi.IngestionResult;
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy; import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.ingestion.config.DipIngestionProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
import java.util.LinkedHashMap; import java.util.LinkedHashMap;
import java.util.Map; import java.util.Map;
@ -30,11 +28,10 @@ import org.springframework.web.multipart.MultipartFile;
@RestController @RestController
@RequestMapping("/v1/dip/import") @RequestMapping("/v1/dip/import")
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
public class GenericDocumentImportController { public class GenericDocumentImportController {
private final DipIngestionProperties properties; private final TedProcessorProperties properties;
private final DocumentIngestionGateway ingestionGateway; private final DocumentIngestionGateway ingestionGateway;
@PostMapping(path = "/upload", consumes = MediaType.MULTIPART_FORM_DATA_VALUE) @PostMapping(path = "/upload", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
@ -102,7 +99,7 @@ public class GenericDocumentImportController {
} }
private void ensureRestUploadEnabled() { private void ensureRestUploadEnabled() {
if (!properties.isEnabled() || !properties.isRestUploadEnabled()) { if (!properties.getGenericIngestion().isEnabled() || !properties.getGenericIngestion().isRestUploadEnabled()) {
throw new IllegalStateException("Generic REST import is disabled"); throw new IllegalStateException("Generic REST import is disabled");
} }
} }
@ -110,7 +107,7 @@ public class GenericDocumentImportController {
private DocumentAccessContext buildAccessContext(String ownerTenantKey, DocumentVisibility visibility) { private DocumentAccessContext buildAccessContext(String ownerTenantKey, DocumentVisibility visibility) {
DocumentVisibility effectiveVisibility = visibility != null DocumentVisibility effectiveVisibility = visibility != null
? visibility ? visibility
: properties.getDefaultVisibility(); : properties.getGenericIngestion().getDefaultVisibility();
if (!StringUtils.hasText(ownerTenantKey)) { if (!StringUtils.hasText(ownerTenantKey)) {
return new DocumentAccessContext(null, effectiveVisibility); return new DocumentAccessContext(null, effectiveVisibility);
} }

@ -10,9 +10,13 @@ import at.procon.dip.domain.document.DocumentStatus;
import at.procon.dip.domain.document.StorageType; import at.procon.dip.domain.document.StorageType;
import at.procon.dip.domain.document.entity.Document; import at.procon.dip.domain.document.entity.Document;
import at.procon.dip.domain.document.entity.DocumentContent; import at.procon.dip.domain.document.entity.DocumentContent;
import at.procon.dip.domain.document.entity.DocumentEmbeddingModel;
import at.procon.dip.domain.document.entity.DocumentSource;
import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository;
import at.procon.dip.domain.document.repository.DocumentRepository; import at.procon.dip.domain.document.repository.DocumentRepository;
import at.procon.dip.domain.document.repository.DocumentSourceRepository; import at.procon.dip.domain.document.repository.DocumentSourceRepository;
import at.procon.dip.domain.document.service.DocumentContentService; import at.procon.dip.domain.document.service.DocumentContentService;
import at.procon.dip.domain.document.service.DocumentEmbeddingService;
import at.procon.dip.domain.document.service.DocumentRepresentationService; import at.procon.dip.domain.document.service.DocumentRepresentationService;
import at.procon.dip.domain.document.service.DocumentService; import at.procon.dip.domain.document.service.DocumentService;
import at.procon.dip.domain.document.service.DocumentSourceService; import at.procon.dip.domain.document.service.DocumentSourceService;
@ -20,30 +24,21 @@ import at.procon.dip.domain.document.service.command.AddDocumentContentCommand;
import at.procon.dip.domain.document.service.command.AddDocumentSourceCommand; import at.procon.dip.domain.document.service.command.AddDocumentSourceCommand;
import at.procon.dip.domain.document.service.command.AddDocumentTextRepresentationCommand; import at.procon.dip.domain.document.service.command.AddDocumentTextRepresentationCommand;
import at.procon.dip.domain.document.service.command.CreateDocumentCommand; import at.procon.dip.domain.document.service.command.CreateDocumentCommand;
import at.procon.dip.embedding.config.EmbeddingProperties; import at.procon.dip.domain.document.service.command.RegisterEmbeddingModelCommand;
import at.procon.dip.embedding.policy.EmbeddingPolicy;
import at.procon.dip.embedding.policy.EmbeddingProfile;
import at.procon.dip.embedding.registry.EmbeddingModelRegistry;
import at.procon.dip.embedding.service.EmbeddingModelCatalogService;
import at.procon.dip.embedding.service.EmbeddingPolicyResolver;
import at.procon.dip.embedding.service.EmbeddingProfileResolver;
import at.procon.dip.embedding.service.RepresentationEmbeddingOrchestrator;
import at.procon.dip.extraction.service.DocumentExtractionService; import at.procon.dip.extraction.service.DocumentExtractionService;
import at.procon.dip.extraction.spi.ExtractionRequest; import at.procon.dip.extraction.spi.ExtractionRequest;
import at.procon.dip.extraction.spi.ExtractionResult; import at.procon.dip.extraction.spi.ExtractionResult;
import at.procon.dip.ingestion.config.DipIngestionProperties;
import at.procon.dip.ingestion.dto.ImportedDocumentResult; import at.procon.dip.ingestion.dto.ImportedDocumentResult;
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy; import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.ingestion.util.DocumentImportSupport; import at.procon.dip.ingestion.util.DocumentImportSupport;
import at.procon.dip.normalization.service.TextRepresentationBuildService; import at.procon.dip.normalization.service.TextRepresentationBuildService;
import at.procon.dip.processing.service.StructuredDocumentProcessingService;
import at.procon.dip.normalization.spi.RepresentationBuildRequest; import at.procon.dip.normalization.spi.RepresentationBuildRequest;
import at.procon.dip.normalization.spi.TextRepresentationDraft; import at.procon.dip.normalization.spi.TextRepresentationDraft;
import at.procon.dip.processing.service.StructuredDocumentProcessingService;
import at.procon.dip.processing.spi.DocumentProcessingPolicy; import at.procon.dip.processing.spi.DocumentProcessingPolicy;
import at.procon.dip.processing.spi.StructuredProcessingRequest; import at.procon.dip.processing.spi.StructuredProcessingRequest;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.util.HashUtils; import at.procon.ted.util.HashUtils;
import java.nio.charset.StandardCharsets; import java.nio.charset.StandardCharsets;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
@ -52,35 +47,34 @@ import java.util.LinkedHashMap;
import java.util.List; import java.util.List;
import java.util.Map; import java.util.Map;
import java.util.Optional; import java.util.Optional;
import java.util.UUID;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service; import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional; import org.springframework.transaction.annotation.Transactional;
import org.springframework.util.StringUtils; import org.springframework.util.StringUtils;
/**
* Phase 4 generic import pipeline that persists arbitrary document types into the DOC model.
*/
@Service @Service
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class GenericDocumentImportService { public class GenericDocumentImportService {
private final DipIngestionProperties properties; private final TedProcessorProperties properties;
private final DocumentRepository documentRepository; private final DocumentRepository documentRepository;
private final DocumentSourceRepository documentSourceRepository; private final DocumentSourceRepository documentSourceRepository;
private final DocumentEmbeddingRepository documentEmbeddingRepository;
private final DocumentService documentService; private final DocumentService documentService;
private final DocumentSourceService documentSourceService; private final DocumentSourceService documentSourceService;
private final DocumentContentService documentContentService; private final DocumentContentService documentContentService;
private final DocumentRepresentationService documentRepresentationService; private final DocumentRepresentationService documentRepresentationService;
private final DocumentEmbeddingService documentEmbeddingService;
private final DocumentClassificationService classificationService; private final DocumentClassificationService classificationService;
private final DocumentExtractionService extractionService; private final DocumentExtractionService extractionService;
private final TextRepresentationBuildService representationBuildService; private final TextRepresentationBuildService representationBuildService;
private final StructuredDocumentProcessingService structuredProcessingService; private final StructuredDocumentProcessingService structuredProcessingService;
private final EmbeddingProperties embeddingProperties;
private final EmbeddingModelRegistry embeddingModelRegistry;
private final EmbeddingModelCatalogService embeddingModelCatalogService;
private final RepresentationEmbeddingOrchestrator representationEmbeddingOrchestrator;
private final EmbeddingPolicyResolver embeddingPolicyResolver;
private final EmbeddingProfileResolver embeddingProfileResolver;
@Transactional @Transactional
public ImportedDocumentResult importDocument(SourceDescriptor sourceDescriptor) { public ImportedDocumentResult importDocument(SourceDescriptor sourceDescriptor) {
@ -92,7 +86,7 @@ public class GenericDocumentImportService {
? defaultAccessContext() ? defaultAccessContext()
: sourceDescriptor.accessContext(); : sourceDescriptor.accessContext();
if (properties.isDeduplicateByContentHash()) { if (properties.getGenericIngestion().isDeduplicateByContentHash()) {
Optional<Document> existing = resolveDeduplicatedDocument(dedupHash, accessContext); Optional<Document> existing = resolveDeduplicatedDocument(dedupHash, accessContext);
if (existing.isPresent()) { if (existing.isPresent()) {
Document document = existing.get(); Document document = existing.get();
@ -164,7 +158,7 @@ public class GenericDocumentImportService {
if (processingPolicy.runRepresentationBuilders()) { if (processingPolicy.runRepresentationBuilders()) {
var drafts = representationBuildService.build(new RepresentationBuildRequest(sourceDescriptor, detection, extractionResult)); var drafts = representationBuildService.build(new RepresentationBuildRequest(sourceDescriptor, detection, extractionResult));
persistRepresentationsAndEmbeddings(document, originalContent, persistedDerivedContent, drafts, sourceDescriptor); persistRepresentationsAndEmbeddings(document, originalContent, persistedDerivedContent, drafts);
} }
if (processingPolicy.applyStructuredTitleIfMissing() && !extractionResult.structuredPayloads().isEmpty()) { if (processingPolicy.applyStructuredTitleIfMissing() && !extractionResult.structuredPayloads().isEmpty()) {
@ -183,7 +177,30 @@ public class GenericDocumentImportService {
return new ImportedDocumentResult(reloaded, detection, warnings, false); return new ImportedDocumentResult(reloaded, detection, warnings, false);
} }
private ExtractionResult emptyExtractionResult() {
return new ExtractionResult(java.util.Collections.emptyMap(), java.util.Collections.emptyList(), java.util.Collections.emptyList());
}
private Optional<Document> resolveDeduplicatedDocument(String dedupHash, DocumentAccessContext accessContext) {
return documentRepository.findAllByDedupHash(dedupHash).stream()
.filter(existing -> sameAccessScope(existing, accessContext))
.findFirst();
}
private boolean sameAccessScope(Document existing, DocumentAccessContext accessContext) {
if (existing.getVisibility() != accessContext.visibility()) {
return false;
}
String existingTenantKey = existing.getOwnerTenant() == null ? null : existing.getOwnerTenant().getTenantKey();
String requestedTenantKey = accessContext.ownerTenant() == null ? null : accessContext.ownerTenant().tenantKey();
return java.util.Objects.equals(existingTenantKey, requestedTenantKey);
}
private SourceDescriptor withResolvedMediaType(SourceDescriptor sourceDescriptor, ResolvedPayload payload) { private SourceDescriptor withResolvedMediaType(SourceDescriptor sourceDescriptor, ResolvedPayload payload) {
if (StringUtils.hasText(sourceDescriptor.mediaType())) {
return sourceDescriptor;
}
return new SourceDescriptor( return new SourceDescriptor(
sourceDescriptor.accessContext(), sourceDescriptor.accessContext(),
sourceDescriptor.sourceType(), sourceDescriptor.sourceType(),
@ -231,8 +248,8 @@ public class GenericDocumentImportService {
} }
private DocumentAccessContext defaultAccessContext() { private DocumentAccessContext defaultAccessContext() {
String tenantKey = properties.getDefaultOwnerTenantKey(); String tenantKey = properties.getGenericIngestion().getDefaultOwnerTenantKey();
DocumentVisibility visibility = properties.getDefaultVisibility(); DocumentVisibility visibility = properties.getGenericIngestion().getDefaultVisibility();
if (!StringUtils.hasText(tenantKey)) { if (!StringUtils.hasText(tenantKey)) {
return new DocumentAccessContext(null, visibility); return new DocumentAccessContext(null, visibility);
} }
@ -247,7 +264,7 @@ public class GenericDocumentImportService {
return sourceDescriptor.fileName(); return sourceDescriptor.fileName();
} }
if (StringUtils.hasText(payload.textContent())) { if (StringUtils.hasText(payload.textContent())) {
for (String line : payload.textContent().split("\n")) { for (String line : payload.textContent().split("\\n")) {
if (StringUtils.hasText(line)) { if (StringUtils.hasText(line)) {
return DocumentImportSupport.ellipsize(line.trim(), 240); return DocumentImportSupport.ellipsize(line.trim(), 240);
} }
@ -277,7 +294,7 @@ public class GenericDocumentImportService {
String importBatchId = sourceDescriptor.attributes() != null && StringUtils.hasText(sourceDescriptor.attributes().get("importBatchId")) String importBatchId = sourceDescriptor.attributes() != null && StringUtils.hasText(sourceDescriptor.attributes().get("importBatchId"))
? sourceDescriptor.attributes().get("importBatchId") ? sourceDescriptor.attributes().get("importBatchId")
: properties.getImportBatchId(); : properties.getGenericIngestion().getImportBatchId();
documentSourceService.addSource(new AddDocumentSourceCommand( documentSourceService.addSource(new AddDocumentSourceCommand(
document.getId(), document.getId(),
@ -298,7 +315,7 @@ public class GenericDocumentImportService {
if (sourceDescriptor.originalContentStoragePolicy() == OriginalContentStoragePolicy.SKIP) { if (sourceDescriptor.originalContentStoragePolicy() == OriginalContentStoragePolicy.SKIP) {
return false; return false;
} }
if (properties.isStoreOriginalContentForWrapperDocuments()) { if (properties.getGenericIngestion().isStoreOriginalContentForWrapperDocuments()) {
return true; return true;
} }
return !isWrapperDocument(sourceDescriptor); return !isWrapperDocument(sourceDescriptor);
@ -338,9 +355,9 @@ public class GenericDocumentImportService {
} }
private boolean shouldStoreBinaryInDb(byte[] binaryContent) { private boolean shouldStoreBinaryInDb(byte[] binaryContent) {
return properties.isStoreOriginalBinaryInDb() return properties.getGenericIngestion().isStoreOriginalBinaryInDb()
&& binaryContent != null && binaryContent != null
&& binaryContent.length <= properties.getMaxBinaryBytesInDb(); && binaryContent.length <= properties.getGenericIngestion().getMaxBinaryBytesInDb();
} }
private Map<ContentRole, DocumentContent> persistDerivedContent(Document document, private Map<ContentRole, DocumentContent> persistDerivedContent(Document document,
@ -373,33 +390,32 @@ public class GenericDocumentImportService {
private void persistRepresentationsAndEmbeddings(Document document, private void persistRepresentationsAndEmbeddings(Document document,
DocumentContent originalContent, DocumentContent originalContent,
Map<ContentRole, DocumentContent> derivedContent, Map<ContentRole, DocumentContent> derivedContent,
List<TextRepresentationDraft> drafts, List<TextRepresentationDraft> drafts) {
SourceDescriptor sourceDescriptor) {
if (drafts == null || drafts.isEmpty()) { if (drafts == null || drafts.isEmpty()) {
return; return;
} }
EmbeddingPolicy embeddingPolicy = null; DocumentEmbeddingModel model = null;
EmbeddingProfile embeddingProfile = null; if (properties.getVectorization().isEnabled() && properties.getVectorization().isGenericPipelineEnabled()) {
if (embeddingProperties.isEnabled()) { model = documentEmbeddingService.registerModel(new RegisterEmbeddingModelCommand(
embeddingPolicy = embeddingPolicyResolver.resolve(document, sourceDescriptor); properties.getVectorization().getModelName(),
if (embeddingPolicy != null && embeddingPolicy.enabled()) { properties.getVectorization().getEmbeddingProvider(),
embeddingModelRegistry.getRequired(embeddingPolicy.modelKey()); properties.getVectorization().getModelName(),
embeddingModelCatalogService.ensureRegistered(embeddingPolicy.modelKey()); properties.getVectorization().getDimensions(),
embeddingProfile = embeddingProfileResolver.resolve(embeddingPolicy.profileKey()); null,
log.debug("Resolved embedding policy {} for document {} -> model={}, profile={}", false,
embeddingPolicy.policyKey(), document.getId(), embeddingPolicy.modelKey(), embeddingPolicy.profileKey()); true
} else if (embeddingPolicy != null) { ));
log.debug("Resolved disabled embedding policy {} for document {}", embeddingPolicy.policyKey(), document.getId());
}
} }
for (TextRepresentationDraft draft : drafts) { for (TextRepresentationDraft draft : drafts) {
if (!StringUtils.hasText(draft.textBody())) { if (!StringUtils.hasText(draft.textBody())) {
continue; continue;
} }
DocumentContent linkedContent = switch (draft.representationType()) {
DocumentContent linkedContent = resolveLinkedContent(draft, originalContent, derivedContent); case FULLTEXT, SEMANTIC_TEXT, SUMMARY, TITLE_ABSTRACT, METADATA_ENRICHED, CHUNK ->
derivedContent.getOrDefault(ContentRole.NORMALIZED_TEXT, originalContent);
};
var representation = documentRepresentationService.addRepresentation(new AddDocumentTextRepresentationCommand( var representation = documentRepresentationService.addRepresentation(new AddDocumentTextRepresentationCommand(
document.getId(), document.getId(),
@ -415,12 +431,8 @@ public class GenericDocumentImportService {
draft.textBody() draft.textBody()
)); ));
if (shouldQueueEmbedding(draft, embeddingPolicy, embeddingProfile)) { if (model != null && shouldQueueEmbedding(draft)) {
representationEmbeddingOrchestrator.enqueueRepresentation( documentEmbeddingService.ensurePendingEmbedding(document.getId(), representation.getId(), model.getId());
document.getId(),
representation.getId(),
embeddingPolicy.modelKey()
);
} }
} }
documentService.updateStatus(document.getId(), DocumentStatus.REPRESENTED); documentService.updateStatus(document.getId(), DocumentStatus.REPRESENTED);
@ -435,19 +447,11 @@ public class GenericDocumentImportService {
return derivedContent.getOrDefault(ContentRole.NORMALIZED_TEXT, originalContent); return derivedContent.getOrDefault(ContentRole.NORMALIZED_TEXT, originalContent);
} }
private boolean shouldQueueEmbedding(TextRepresentationDraft draft, private boolean shouldQueueEmbedding(TextRepresentationDraft draft) {
EmbeddingPolicy embeddingPolicy,
EmbeddingProfile embeddingProfile) {
if (embeddingPolicy == null || !embeddingPolicy.enabled() || embeddingProfile == null) {
return false;
}
if (!embeddingProfile.includes(draft.representationType())) {
return false;
}
if (draft.queueForEmbedding() != null) { if (draft.queueForEmbedding() != null) {
return draft.queueForEmbedding(); return draft.queueForEmbedding();
} }
return properties.isVectorizePrimaryRepresentationOnly() ? draft.primary() : true; return properties.getGenericIngestion().isVectorizePrimaryRepresentationOnly() ? draft.primary() : true;
} }
private ExtractionResult mergeExtractionResults(ExtractionResult base, ExtractionResult override) { private ExtractionResult mergeExtractionResults(ExtractionResult base, ExtractionResult override) {
@ -500,31 +504,6 @@ public class GenericDocumentImportService {
return java.util.Objects.equals(left, right); return java.util.Objects.equals(left, right);
} }
private Optional<Document> resolveDeduplicatedDocument(String dedupHash, DocumentAccessContext accessContext) {
return documentRepository.findByDedupHash(dedupHash).stream()
.filter(document -> matchesAccessContext(document, accessContext))
.findFirst();
}
private boolean matchesAccessContext(Document document, DocumentAccessContext accessContext) {
String expectedTenantKey = accessContext.ownerTenant() == null ? null : accessContext.ownerTenant().tenantKey();
if (!equalsNullable(document.getOwnerTenant() != null ? document.getOwnerTenant().getTenantKey() : null, expectedTenantKey)) {
return false;
}
return document.getVisibility() == accessContext.visibility();
}
private ExtractionResult emptyExtractionResult() {
return new ExtractionResult(Map.of(), List.of(), List.of());
}
private CanonicalDocumentMetadata buildCanonicalMetadata( Document document,
DetectionResult detection,
SourceDescriptor sourceDescriptor,
ExtractionResult extractionResult) {
return document.toCanonicalMetadata();
}
private record ResolvedPayload(byte[] binaryContent, String textContent, String mediaType) { private record ResolvedPayload(byte[] binaryContent, String textContent, String mediaType) {
} }
} }

@ -10,9 +10,7 @@ import at.procon.dip.ingestion.dto.ImportedDocumentResult;
import at.procon.dip.ingestion.service.TedPackageExpansionService.TedPackageEntry; import at.procon.dip.ingestion.service.TedPackageExpansionService.TedPackageEntry;
import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy; import at.procon.dip.ingestion.spi.OriginalContentStoragePolicy;
import at.procon.dip.ingestion.spi.SourceDescriptor; import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.ingestion.config.DipIngestionProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
import java.util.LinkedHashMap; import java.util.LinkedHashMap;
import java.util.Map; import java.util.Map;
@ -23,13 +21,12 @@ import org.springframework.transaction.annotation.Propagation;
import org.springframework.transaction.annotation.Transactional; import org.springframework.transaction.annotation.Transactional;
@Service @Service
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
public class TedPackageChildImportProcessor { public class TedPackageChildImportProcessor {
private final GenericDocumentImportService importService; private final GenericDocumentImportService importService;
private final DocumentRelationService relationService; private final DocumentRelationService relationService;
private final DipIngestionProperties properties; private final TedProcessorProperties properties;
@Transactional(propagation = Propagation.REQUIRES_NEW) @Transactional(propagation = Propagation.REQUIRES_NEW)
public ChildImportResult processChild(UUID packageDocumentId, public ChildImportResult processChild(UUID packageDocumentId,
@ -49,7 +46,7 @@ public class TedPackageChildImportProcessor {
childAttributes.put("packageId", packageSourceIdentifier); childAttributes.put("packageId", packageSourceIdentifier);
childAttributes.put("archivePath", entry.archivePath()); childAttributes.put("archivePath", entry.archivePath());
childAttributes.put("title", entry.fileName()); childAttributes.put("title", entry.fileName());
childAttributes.put("importBatchId", properties.getTedPackageImportBatchId()); childAttributes.put("importBatchId", properties.getGenericIngestion().getTedPackageImportBatchId());
ImportedDocumentResult childResult = importService.importDocument(new SourceDescriptor( ImportedDocumentResult childResult = importService.importDocument(new SourceDescriptor(
accessContext == null ? DocumentAccessContext.publicDocument() : accessContext, accessContext == null ? DocumentAccessContext.publicDocument() : accessContext,

@ -1,47 +0,0 @@
package at.procon.dip.migration.audit.config;
import jakarta.validation.constraints.Min;
import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "dip.migration.legacy-audit")
@Data
public class LegacyTedAuditProperties {
/**
* Enables the Wave 1 / Milestone A legacy TED audit subsystem.
*/
private boolean enabled = true;
/**
* Automatically runs the read-only audit on application startup.
*/
private boolean startupRunEnabled = false;
/**
* Maximum number of legacy TED documents to scan during startup.
* 0 means no limit.
*/
@Min(0)
private int startupRunLimit = 500;
/**
* Batch size for legacy TED document paging.
*/
@Min(1)
private int pageSize = 100;
/**
* Hard cap for persisted findings in a single run to avoid runaway audit volume.
*/
@Min(1)
private int maxFindingsPerRun = 10000;
/**
* Maximum number of duplicate/grouped samples recorded for global aggregate checks.
*/
@Min(1)
private int maxDuplicateSamples = 100;
}

@ -1,87 +0,0 @@
package at.procon.dip.migration.audit.entity;
import at.procon.dip.architecture.SchemaNames;
import jakarta.persistence.Column;
import jakarta.persistence.Entity;
import jakarta.persistence.EnumType;
import jakarta.persistence.Enumerated;
import jakarta.persistence.FetchType;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.GenerationType;
import jakarta.persistence.Id;
import jakarta.persistence.Index;
import jakarta.persistence.JoinColumn;
import jakarta.persistence.ManyToOne;
import jakarta.persistence.PrePersist;
import jakarta.persistence.Table;
import java.time.OffsetDateTime;
import java.util.UUID;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;
@Entity
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_finding", indexes = {
@Index(name = "idx_doc_legacy_audit_find_run", columnList = "run_id"),
@Index(name = "idx_doc_legacy_audit_find_type", columnList = "finding_type"),
@Index(name = "idx_doc_legacy_audit_find_severity", columnList = "severity"),
@Index(name = "idx_doc_legacy_audit_find_legacy_doc", columnList = "legacy_procurement_document_id"),
@Index(name = "idx_doc_legacy_audit_find_document", columnList = "document_id")
})
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class LegacyTedAuditFinding {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private UUID id;
@ManyToOne(fetch = FetchType.LAZY, optional = false)
@JoinColumn(name = "run_id", nullable = false)
private LegacyTedAuditRun run;
@Enumerated(EnumType.STRING)
@Column(name = "severity", nullable = false, length = 16)
private LegacyTedAuditSeverity severity;
@Enumerated(EnumType.STRING)
@Column(name = "finding_type", nullable = false, length = 64)
private LegacyTedAuditFindingType findingType;
@Column(name = "package_identifier", length = 20)
private String packageIdentifier;
@Column(name = "legacy_procurement_document_id")
private UUID legacyProcurementDocumentId;
@Column(name = "document_id")
private UUID documentId;
@Column(name = "ted_notice_projection_id")
private UUID tedNoticeProjectionId;
@Column(name = "reference_key", length = 255)
private String referenceKey;
@Column(name = "message", nullable = false, columnDefinition = "TEXT")
private String message;
@Column(name = "details_text", columnDefinition = "TEXT")
private String detailsText;
@Builder.Default
@Column(name = "created_at", nullable = false, updatable = false)
private OffsetDateTime createdAt = OffsetDateTime.now();
@PrePersist
protected void onCreate() {
if (createdAt == null) {
createdAt = OffsetDateTime.now();
}
}
}

@ -1,28 +0,0 @@
package at.procon.dip.migration.audit.entity;
public enum LegacyTedAuditFindingType {
PACKAGE_SEQUENCE_GAP,
PACKAGE_INCOMPLETE,
PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
PACKAGE_COMPLETED_COUNT_MISMATCH,
PACKAGE_MISSING_XML_FILE_COUNT,
PACKAGE_MISSING_FILE_HASH,
PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
LEGACY_PUBLICATION_ID_DUPLICATE,
DOC_DEDUP_HASH_DUPLICATE,
LEGACY_DOCUMENT_MISSING_HASH,
LEGACY_DOCUMENT_MISSING_XML,
LEGACY_DOCUMENT_MISSING_TEXT,
LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
DOC_DOCUMENT_MISSING,
DOC_DOCUMENT_DUPLICATE,
DOC_SOURCE_MISSING,
DOC_ORIGINAL_CONTENT_MISSING,
DOC_ORIGINAL_CONTENT_DUPLICATE,
DOC_PRIMARY_REPRESENTATION_MISSING,
DOC_PRIMARY_REPRESENTATION_DUPLICATE,
TED_PROJECTION_MISSING,
TED_PROJECTION_MISSING_LEGACY_LINK,
TED_PROJECTION_DOCUMENT_MISMATCH,
FINDINGS_TRUNCATED
}

@ -1,110 +0,0 @@
package at.procon.dip.migration.audit.entity;
import at.procon.dip.architecture.SchemaNames;
import jakarta.persistence.Column;
import jakarta.persistence.Entity;
import jakarta.persistence.EnumType;
import jakarta.persistence.Enumerated;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.GenerationType;
import jakarta.persistence.Id;
import jakarta.persistence.Index;
import jakarta.persistence.PrePersist;
import jakarta.persistence.PreUpdate;
import jakarta.persistence.Table;
import java.time.OffsetDateTime;
import java.util.UUID;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;
@Entity
@Table(schema = SchemaNames.DOC, name = "doc_legacy_audit_run", indexes = {
@Index(name = "idx_doc_legacy_audit_run_status", columnList = "status"),
@Index(name = "idx_doc_legacy_audit_run_started", columnList = "started_at")
})
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class LegacyTedAuditRun {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private UUID id;
@Enumerated(EnumType.STRING)
@Column(name = "status", nullable = false, length = 32)
private LegacyTedAuditRunStatus status;
@Column(name = "requested_limit")
private Integer requestedLimit;
@Column(name = "page_size", nullable = false)
private Integer pageSize;
@Column(name = "scanned_packages", nullable = false)
@Builder.Default
private Integer scannedPackages = 0;
@Column(name = "scanned_legacy_documents", nullable = false)
@Builder.Default
private Integer scannedLegacyDocuments = 0;
@Column(name = "finding_count", nullable = false)
@Builder.Default
private Integer findingCount = 0;
@Column(name = "info_count", nullable = false)
@Builder.Default
private Integer infoCount = 0;
@Column(name = "warning_count", nullable = false)
@Builder.Default
private Integer warningCount = 0;
@Column(name = "error_count", nullable = false)
@Builder.Default
private Integer errorCount = 0;
@Column(name = "started_at", nullable = false)
private OffsetDateTime startedAt;
@Column(name = "completed_at")
private OffsetDateTime completedAt;
@Column(name = "summary_text", columnDefinition = "TEXT")
private String summaryText;
@Column(name = "failure_message", columnDefinition = "TEXT")
private String failureMessage;
@Builder.Default
@Column(name = "created_at", nullable = false, updatable = false)
private OffsetDateTime createdAt = OffsetDateTime.now();
@Builder.Default
@Column(name = "updated_at", nullable = false)
private OffsetDateTime updatedAt = OffsetDateTime.now();
@PrePersist
protected void onCreate() {
if (startedAt == null) {
startedAt = OffsetDateTime.now();
}
if (createdAt == null) {
createdAt = OffsetDateTime.now();
}
if (updatedAt == null) {
updatedAt = OffsetDateTime.now();
}
}
@PreUpdate
protected void onUpdate() {
updatedAt = OffsetDateTime.now();
}
}

@ -1,7 +0,0 @@
package at.procon.dip.migration.audit.entity;
public enum LegacyTedAuditRunStatus {
RUNNING,
COMPLETED,
FAILED
}

@ -1,7 +0,0 @@
package at.procon.dip.migration.audit.entity;
public enum LegacyTedAuditSeverity {
INFO,
WARNING,
ERROR
}

@ -1,8 +0,0 @@
package at.procon.dip.migration.audit.repository;
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
import java.util.UUID;
import org.springframework.data.jpa.repository.JpaRepository;
public interface LegacyTedAuditFindingRepository extends JpaRepository<LegacyTedAuditFinding, UUID> {
}

@ -1,8 +0,0 @@
package at.procon.dip.migration.audit.repository;
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
import java.util.UUID;
import org.springframework.data.jpa.repository.JpaRepository;
public interface LegacyTedAuditRunRepository extends JpaRepository<LegacyTedAuditRun, UUID> {
}

@ -1,610 +0,0 @@
package at.procon.dip.migration.audit.service;
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
import at.procon.dip.migration.audit.entity.LegacyTedAuditFinding;
import at.procon.dip.migration.audit.entity.LegacyTedAuditFindingType;
import at.procon.dip.migration.audit.entity.LegacyTedAuditRun;
import at.procon.dip.migration.audit.entity.LegacyTedAuditRunStatus;
import at.procon.dip.migration.audit.entity.LegacyTedAuditSeverity;
import at.procon.dip.migration.audit.repository.LegacyTedAuditFindingRepository;
import at.procon.dip.migration.audit.repository.LegacyTedAuditRunRepository;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.model.entity.ProcurementDocument;
import at.procon.ted.model.entity.TedDailyPackage;
import at.procon.ted.repository.ProcurementDocumentRepository;
import at.procon.ted.repository.TedDailyPackageRepository;
import java.time.OffsetDateTime;
import java.time.Year;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.UUID;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.data.domain.Page;
import org.springframework.data.domain.PageRequest;
import org.springframework.data.domain.Sort;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;
import org.springframework.util.StringUtils;
@Service
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor
@Slf4j
public class LegacyTedAuditService {
private final LegacyTedAuditProperties properties;
private final TedDailyPackageRepository tedDailyPackageRepository;
private final ProcurementDocumentRepository procurementDocumentRepository;
private final LegacyTedAuditRunRepository runRepository;
private final LegacyTedAuditFindingRepository findingRepository;
private final JdbcTemplate jdbcTemplate;
public LegacyTedAuditRun executeAudit() {
return executeAudit(properties.getStartupRunLimit());
}
public LegacyTedAuditRun executeAudit(int requestedLimit) {
if (!properties.isEnabled()) {
throw new IllegalStateException("Legacy TED audit is disabled by configuration");
}
Integer effectiveLimit = requestedLimit > 0 ? requestedLimit : null;
int pageSize = properties.getPageSize();
AuditAccumulator accumulator = new AuditAccumulator();
LegacyTedAuditRun run = LegacyTedAuditRun.builder()
.status(LegacyTedAuditRunStatus.RUNNING)
.requestedLimit(effectiveLimit)
.pageSize(pageSize)
.startedAt(OffsetDateTime.now())
.build();
run = runRepository.save(run);
try {
int scannedPackages = auditPackages(run, accumulator);
auditGlobalDuplicates(run, accumulator);
int scannedLegacyDocuments = 0;//auditLegacyDocuments(run, accumulator, effectiveLimit, pageSize);
run.setStatus(LegacyTedAuditRunStatus.COMPLETED);
run.setCompletedAt(OffsetDateTime.now());
run.setScannedPackages(scannedPackages);
run.setScannedLegacyDocuments(scannedLegacyDocuments);
run.setFindingCount(accumulator.totalFindings());
run.setInfoCount(accumulator.infoCount());
run.setWarningCount(accumulator.warningCount());
run.setErrorCount(accumulator.errorCount());
run.setSummaryText(buildSummary(scannedPackages, scannedLegacyDocuments, accumulator));
run.setFailureMessage(null);
run = runRepository.save(run);
log.info("Wave 1 / Milestone A legacy-only audit completed: runId={}, packages={}, documents={}, findings={}, warnings={}, errors={}",
run.getId(), scannedPackages, scannedLegacyDocuments, accumulator.totalFindings(),
accumulator.warningCount(), accumulator.errorCount());
return run;
} catch (RuntimeException ex) {
run.setStatus(LegacyTedAuditRunStatus.FAILED);
run.setCompletedAt(OffsetDateTime.now());
run.setScannedPackages(accumulator.scannedPackages());
run.setScannedLegacyDocuments(accumulator.scannedLegacyDocuments());
run.setFindingCount(accumulator.totalFindings());
run.setInfoCount(accumulator.infoCount());
run.setWarningCount(accumulator.warningCount());
run.setErrorCount(accumulator.errorCount());
run.setFailureMessage(ex.getMessage());
run.setSummaryText(buildSummary(accumulator.scannedPackages(), accumulator.scannedLegacyDocuments(), accumulator));
runRepository.save(run);
log.error("Wave 1 / Milestone A legacy-only audit failed: runId={}", run.getId(), ex);
throw ex;
}
}
private int auditPackages(LegacyTedAuditRun run, AuditAccumulator accumulator) {
List<TedDailyPackage> packages = tedDailyPackageRepository.findAll(Sort.by(Sort.Direction.ASC, "year", "serialNumber"));
if (packages.isEmpty()) {
return 0;
}
Map<Integer, List<TedDailyPackage>> packagesByYear = new TreeMap<>();
for (TedDailyPackage dailyPackage : packages) {
packagesByYear.computeIfAbsent(dailyPackage.getYear(), ignored -> new ArrayList<>()).add(dailyPackage);
}
int firstYear = packagesByYear.keySet().iterator().next();
int currentYear = Year.now().getValue();
for (int year = firstYear; year <= currentYear; year++) {
List<TedDailyPackage> yearPackages = packagesByYear.get(year);
if (yearPackages == null || yearPackages.isEmpty()) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
null,
null,
null,
null,
"year:" + year,
"No TED package rows exist for this year inside the audited interval",
"year=" + year + ", intervalStartYear=" + firstYear + ", intervalEndYear=" + currentYear);
continue;
}
auditYearPackageSequence(run, accumulator, year, yearPackages);
for (TedDailyPackage dailyPackage : yearPackages) {
accumulator.incrementScannedPackages();
auditSinglePackage(run, accumulator, dailyPackage);
}
}
return packages.size();
}
private void auditYearPackageSequence(LegacyTedAuditRun run,
AuditAccumulator accumulator,
int year,
List<TedDailyPackage> yearPackages) {
yearPackages.sort((left, right) -> Integer.compare(safeInt(left.getSerialNumber()), safeInt(right.getSerialNumber())));
int firstSerial = safeInt(yearPackages.getFirst().getSerialNumber());
if (firstSerial > 1) {
recordMissingPackageRange(run, accumulator, year, 1, firstSerial - 1,
"TED package year starts after serial 1");
}
for (int i = 1; i < yearPackages.size(); i++) {
int previousSerial = safeInt(yearPackages.get(i - 1).getSerialNumber());
int currentSerial = safeInt(yearPackages.get(i).getSerialNumber());
if (currentSerial > previousSerial + 1) {
recordMissingPackageRange(run, accumulator, year, previousSerial + 1, currentSerial - 1,
"TED package sequence gap detected");
}
}
}
private void recordMissingPackageRange(LegacyTedAuditRun run,
AuditAccumulator accumulator,
int year,
int startSerial,
int endSerial,
String message) {
String startPackageId = formatPackageIdentifier(year, startSerial);
String endPackageId = formatPackageIdentifier(year, endSerial);
String referenceKey = startSerial == endSerial ? startPackageId : startPackageId + "-" + endPackageId;
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_SEQUENCE_GAP,
startSerial == endSerial ? startPackageId : null,
null,
null,
null,
referenceKey,
message,
"year=" + year + ", missingStartSerial=" + startSerial + ", missingEndSerial=" + endSerial);
}
private void auditSinglePackage(LegacyTedAuditRun run,
AuditAccumulator accumulator,
TedDailyPackage dailyPackage) {
String packageIdentifier = dailyPackage.getPackageIdentifier();
int processedCount = safeInt(dailyPackage.getProcessedCount());
int failedCount = safeInt(dailyPackage.getFailedCount());
int accountedDocuments = processedCount + failedCount;
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
&& dailyPackage.getProcessedAt() == null) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_COMPLETED_WITHOUT_PROCESSED_AT,
packageIdentifier,
null,
null,
null,
packageIdentifier,
"TED package is marked COMPLETED but processedAt is null",
null);
}
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
&& dailyPackage.getXmlFileCount() == null) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_MISSING_XML_FILE_COUNT,
packageIdentifier,
null,
null,
null,
packageIdentifier,
"TED package is marked COMPLETED but xmlFileCount is null",
null);
}
if ((dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.DOWNLOADED
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.PROCESSING
|| dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED)
&& !StringUtils.hasText(dailyPackage.getFileHash())) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_MISSING_FILE_HASH,
packageIdentifier,
null,
null,
null,
packageIdentifier,
"TED package has no file hash recorded",
"downloadStatus=" + dailyPackage.getDownloadStatus());
}
if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
&& !StringUtils.hasText(dailyPackage.getErrorMessage())) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_FAILED_WITHOUT_ERROR_MESSAGE,
packageIdentifier,
null,
null,
null,
packageIdentifier,
"TED package is marked FAILED but has no error message",
null);
}
if (dailyPackage.getXmlFileCount() != null) {
if (accountedDocuments > dailyPackage.getXmlFileCount()) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.ERROR,
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
packageIdentifier,
null,
null,
null,
packageIdentifier,
"TED package accounting exceeds xmlFileCount",
"xmlFileCount=" + dailyPackage.getXmlFileCount()
+ ", processedCount=" + processedCount
+ ", failedCount=" + failedCount);
} else if (dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.COMPLETED
&& accountedDocuments < dailyPackage.getXmlFileCount()) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_COMPLETED_COUNT_MISMATCH,
packageIdentifier,
null,
null,
null,
packageIdentifier,
"TED package accounting is below xmlFileCount",
"xmlFileCount=" + dailyPackage.getXmlFileCount()
+ ", processedCount=" + processedCount
+ ", failedCount=" + failedCount);
}
}
if (isPackageIncompleteForReimport(dailyPackage, processedCount, failedCount, accountedDocuments)) {
recordFinding(run, accumulator,
dailyPackage.getDownloadStatus() == TedDailyPackage.DownloadStatus.FAILED
? LegacyTedAuditSeverity.ERROR
: LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.PACKAGE_INCOMPLETE,
packageIdentifier,
null,
null,
null,
packageIdentifier,
"TED package is not fully imported and should be considered for re-import",
buildIncompletePackageDetails(dailyPackage, processedCount, failedCount, accountedDocuments));
}
}
private boolean isPackageIncompleteForReimport(TedDailyPackage dailyPackage,
int processedCount,
int failedCount,
int accountedDocuments) {
TedDailyPackage.DownloadStatus status = dailyPackage.getDownloadStatus();
if (status == null) {
return true;
}
if (status == TedDailyPackage.DownloadStatus.NOT_FOUND) {
return false;
}
if (status == TedDailyPackage.DownloadStatus.PENDING
|| status == TedDailyPackage.DownloadStatus.DOWNLOADING
|| status == TedDailyPackage.DownloadStatus.DOWNLOADED
|| status == TedDailyPackage.DownloadStatus.PROCESSING
|| status == TedDailyPackage.DownloadStatus.FAILED) {
return true;
}
if (status != TedDailyPackage.DownloadStatus.COMPLETED) {
return true;
}
if (dailyPackage.getXmlFileCount() == null) {
return true;
}
if (failedCount > 0) {
return true;
}
return processedCount < dailyPackage.getXmlFileCount()
|| accountedDocuments != dailyPackage.getXmlFileCount();
}
private String buildIncompletePackageDetails(TedDailyPackage dailyPackage,
int processedCount,
int failedCount,
int accountedDocuments) {
return "status=" + dailyPackage.getDownloadStatus()
+ ", xmlFileCount=" + dailyPackage.getXmlFileCount()
+ ", processedCount=" + processedCount
+ ", failedCount=" + failedCount
+ ", accountedDocuments=" + accountedDocuments;
}
private void auditGlobalDuplicates(LegacyTedAuditRun run, AuditAccumulator accumulator) {
int limit = properties.getMaxDuplicateSamples();
jdbcTemplate.query(
"""
SELECT publication_id, COUNT(*) AS duplicate_count
FROM ted.procurement_document
WHERE publication_id IS NOT NULL AND publication_id <> ''
GROUP BY publication_id
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC, publication_id ASC
LIMIT ?
""",
ps -> ps.setInt(1, limit),
(rs, rowNum) -> {
String publicationId = rs.getString("publication_id");
long duplicateCount = rs.getLong("duplicate_count");
recordFinding(run, accumulator,
LegacyTedAuditSeverity.ERROR,
LegacyTedAuditFindingType.LEGACY_PUBLICATION_ID_DUPLICATE,
null,
null,
null,
null,
publicationId,
"Legacy TED publicationId appears multiple times",
"publicationId=" + publicationId + ", duplicateCount=" + duplicateCount);
return null;
});
}
private int auditLegacyDocuments(LegacyTedAuditRun run,
AuditAccumulator accumulator,
Integer requestedLimit,
int pageSize) {
int processed = 0;
int pageNumber = 0;
while (requestedLimit == null || processed < requestedLimit) {
Page<ProcurementDocument> page = procurementDocumentRepository.findAll(
PageRequest.of(pageNumber, pageSize, Sort.by(Sort.Direction.ASC, "createdAt", "id")));
if (page.isEmpty()) {
break;
}
for (ProcurementDocument legacyDocument : page.getContent()) {
auditSingleLegacyDocument(run, accumulator, legacyDocument);
accumulator.incrementScannedLegacyDocuments();
processed++;
if (requestedLimit != null && processed >= requestedLimit) {
return processed;
}
}
if (!page.hasNext()) {
break;
}
pageNumber++;
}
return processed;
}
private void auditSingleLegacyDocument(LegacyTedAuditRun run,
AuditAccumulator accumulator,
ProcurementDocument legacyDocument) {
UUID legacyDocumentId = legacyDocument.getId();
String referenceKey = buildReferenceKey(legacyDocument);
String documentHash = legacyDocument.getDocumentHash();
if (!StringUtils.hasText(documentHash)) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.ERROR,
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_HASH,
null,
legacyDocumentId,
null,
null,
referenceKey,
"Legacy TED document has no documentHash",
null);
return;
}
if (!StringUtils.hasText(legacyDocument.getXmlDocument())) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.ERROR,
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_XML,
null,
legacyDocumentId,
null,
null,
referenceKey,
"Legacy TED document has no xmlDocument payload",
"documentHash=" + documentHash);
}
if (!StringUtils.hasText(legacyDocument.getTextContent())) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_TEXT,
null,
legacyDocumentId,
null,
null,
referenceKey,
"Legacy TED document has no normalized textContent",
"documentHash=" + documentHash);
}
if (!StringUtils.hasText(legacyDocument.getPublicationId())) {
recordFinding(run, accumulator,
LegacyTedAuditSeverity.WARNING,
LegacyTedAuditFindingType.LEGACY_DOCUMENT_MISSING_PUBLICATION_ID,
null,
legacyDocumentId,
null,
null,
referenceKey,
"Legacy TED document has no publicationId",
"documentHash=" + documentHash);
}
}
private void recordFinding(LegacyTedAuditRun run,
AuditAccumulator accumulator,
LegacyTedAuditSeverity severity,
LegacyTedAuditFindingType findingType,
String packageIdentifier,
UUID legacyProcurementDocumentId,
UUID genericDocumentId,
UUID tedProjectionId,
String referenceKey,
String message,
String detailsText) {
if (accumulator.totalFindings() >= properties.getMaxFindingsPerRun()) {
accumulator.markTruncated();
if (!accumulator.truncationRecorded()) {
LegacyTedAuditFinding truncatedFinding = LegacyTedAuditFinding.builder()
.run(run)
.severity(LegacyTedAuditSeverity.INFO)
.findingType(LegacyTedAuditFindingType.FINDINGS_TRUNCATED)
.referenceKey(referenceKey != null ? referenceKey : "max-findings-per-run")
.message("Legacy TED audit finding limit reached; additional findings were suppressed")
.detailsText("maxFindingsPerRun=" + properties.getMaxFindingsPerRun())
.build();
findingRepository.save(truncatedFinding);
accumulator.recordFinding(LegacyTedAuditSeverity.INFO, true);
}
return;
}
LegacyTedAuditFinding finding = LegacyTedAuditFinding.builder()
.run(run)
.severity(severity)
.findingType(findingType)
.packageIdentifier(packageIdentifier)
.legacyProcurementDocumentId(legacyProcurementDocumentId)
.documentId(genericDocumentId)
.tedNoticeProjectionId(tedProjectionId)
.referenceKey(referenceKey)
.message(message)
.detailsText(detailsText)
.build();
findingRepository.save(finding);
accumulator.recordFinding(severity, false);
}
private String buildReferenceKey(ProcurementDocument legacyDocument) {
if (StringUtils.hasText(legacyDocument.getPublicationId())) {
return legacyDocument.getPublicationId();
}
if (StringUtils.hasText(legacyDocument.getNoticeId())) {
return legacyDocument.getNoticeId();
}
if (StringUtils.hasText(legacyDocument.getSourceFilename())) {
return legacyDocument.getSourceFilename();
}
return String.valueOf(legacyDocument.getId());
}
private int safeInt(Integer value) {
return value != null ? value : 0;
}
private String formatPackageIdentifier(int year, int serialNumber) {
return "%04d%05d".formatted(year, serialNumber);
}
private String buildSummary(int scannedPackages,
int scannedLegacyDocuments,
AuditAccumulator accumulator) {
return "packages=" + scannedPackages
+ ", legacyDocuments=" + scannedLegacyDocuments
+ ", findings=" + accumulator.totalFindings()
+ ", warnings=" + accumulator.warningCount()
+ ", errors=" + accumulator.errorCount()
+ (accumulator.truncated() ? ", truncated=true" : "");
}
private static final class AuditAccumulator {
private int scannedPackages;
private int scannedLegacyDocuments;
private int infoCount;
private int warningCount;
private int errorCount;
private boolean truncated;
private boolean truncationRecorded;
void incrementScannedPackages() {
scannedPackages++;
}
void incrementScannedLegacyDocuments() {
scannedLegacyDocuments++;
}
void recordFinding(LegacyTedAuditSeverity severity, boolean truncationFindingRecordedNow) {
switch (severity) {
case INFO -> infoCount++;
case WARNING -> warningCount++;
case ERROR -> errorCount++;
}
if (truncationFindingRecordedNow) {
truncationRecorded = true;
}
}
void markTruncated() {
truncated = true;
}
int totalFindings() {
return infoCount + warningCount + errorCount;
}
int infoCount() {
return infoCount;
}
int warningCount() {
return warningCount;
}
int errorCount() {
return errorCount;
}
int scannedPackages() {
return scannedPackages;
}
int scannedLegacyDocuments() {
return scannedLegacyDocuments;
}
boolean truncated() {
return truncated;
}
boolean truncationRecorded() {
return truncationRecorded;
}
}
}

@ -1,33 +0,0 @@
package at.procon.dip.migration.audit.startup;
import at.procon.dip.migration.audit.config.LegacyTedAuditProperties;
import at.procon.dip.migration.audit.service.LegacyTedAuditService;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.ApplicationArguments;
import org.springframework.boot.ApplicationRunner;
import org.springframework.stereotype.Component;
@Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor
@Slf4j
public class LegacyTedAuditStartupRunner implements ApplicationRunner {
private final LegacyTedAuditProperties properties;
private final LegacyTedAuditService legacyTedAuditService;
@Override
public void run(ApplicationArguments args) {
if (!properties.isEnabled() || !properties.isStartupRunEnabled()) {
return;
}
int requestedLimit = properties.getStartupRunLimit();
log.info("Wave 1 / Milestone A startup audit enabled - scanning legacy TED data with limit {}",
requestedLimit > 0 ? requestedLimit : "unbounded");
legacyTedAuditService.executeAudit(requestedLimit);
}
}

@ -6,9 +6,7 @@ import at.procon.dip.domain.document.RepresentationType;
import at.procon.dip.normalization.spi.RepresentationBuildRequest; import at.procon.dip.normalization.spi.RepresentationBuildRequest;
import at.procon.dip.normalization.spi.TextRepresentationBuilder; import at.procon.dip.normalization.spi.TextRepresentationBuilder;
import at.procon.dip.normalization.spi.TextRepresentationDraft; import at.procon.dip.normalization.spi.TextRepresentationDraft;
import at.procon.dip.search.config.DipSearchProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.List; import java.util.List;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -23,7 +21,7 @@ public class ChunkedLongTextRepresentationBuilder implements TextRepresentationB
public static final String BUILDER_KEY = "long-text-chunker"; public static final String BUILDER_KEY = "long-text-chunker";
private final DipSearchProperties properties; private final TedProcessorProperties properties;
@Override @Override
public boolean supports(DocumentType documentType) { public boolean supports(DocumentType documentType) {
@ -32,7 +30,7 @@ public class ChunkedLongTextRepresentationBuilder implements TextRepresentationB
@Override @Override
public List<TextRepresentationDraft> build(RepresentationBuildRequest request) { public List<TextRepresentationDraft> build(RepresentationBuildRequest request) {
if (!properties.isChunkingEnabled()) { if (!properties.getSearch().isChunkingEnabled()) {
return List.of(); return List.of();
} }
@ -44,8 +42,8 @@ public class ChunkedLongTextRepresentationBuilder implements TextRepresentationB
return List.of(); return List.of();
} }
int target = Math.max(400, properties.getChunkTargetChars()); int target = Math.max(400, properties.getSearch().getChunkTargetChars());
int overlap = Math.max(0, Math.min(target / 3, properties.getChunkOverlapChars())); int overlap = Math.max(0, Math.min(target / 3, properties.getSearch().getChunkOverlapChars()));
if (baseText.length() <= target + overlap) { if (baseText.length() <= target + overlap) {
return List.of(); return List.of();
} }
@ -53,7 +51,7 @@ public class ChunkedLongTextRepresentationBuilder implements TextRepresentationB
List<TextRepresentationDraft> drafts = new ArrayList<>(); List<TextRepresentationDraft> drafts = new ArrayList<>();
int start = 0; int start = 0;
int chunkIndex = 0; int chunkIndex = 0;
while (start < baseText.length() && chunkIndex < properties.getMaxChunksPerDocument()) { while (start < baseText.length() && chunkIndex < properties.getSearch().getMaxChunksPerDocument()) {
int end = Math.min(baseText.length(), start + target); int end = Math.min(baseText.length(), start + target);
if (end < baseText.length()) { if (end < baseText.length()) {
int boundary = findBoundary(baseText, end, Math.min(baseText.length(), end + 160)); int boundary = findBoundary(baseText, end, Math.min(baseText.length(), end + 160));

@ -13,8 +13,6 @@ import at.procon.dip.ingestion.spi.SourceDescriptor;
import at.procon.dip.processing.spi.DocumentProcessingPolicy; import at.procon.dip.processing.spi.DocumentProcessingPolicy;
import at.procon.dip.processing.spi.StructuredDocumentProcessor; import at.procon.dip.processing.spi.StructuredDocumentProcessor;
import at.procon.dip.processing.spi.StructuredProcessingRequest; import at.procon.dip.processing.spi.StructuredProcessingRequest;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.model.entity.ProcurementDocument; import at.procon.ted.model.entity.ProcurementDocument;
import at.procon.ted.service.XmlParserService; import at.procon.ted.service.XmlParserService;
import java.nio.charset.StandardCharsets; import java.nio.charset.StandardCharsets;
@ -27,7 +25,6 @@ import org.springframework.stereotype.Component;
import org.springframework.util.StringUtils; import org.springframework.util.StringUtils;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class TedStructuredDocumentProcessor implements StructuredDocumentProcessor { public class TedStructuredDocumentProcessor implements StructuredDocumentProcessor {

@ -1,18 +1,12 @@
package at.procon.dip.runtime.condition; package at.procon.dip.runtime.condition;
import at.procon.dip.runtime.config.RuntimeMode; import at.procon.dip.runtime.config.RuntimeMode;
import org.springframework.boot.context.properties.bind.Binder; import java.util.Map;
import org.springframework.context.EnvironmentAware;
import org.springframework.context.annotation.Condition; import org.springframework.context.annotation.Condition;
import org.springframework.context.annotation.ConditionContext; import org.springframework.context.annotation.ConditionContext;
import org.springframework.core.env.Environment;
import org.springframework.core.type.AnnotatedTypeMetadata; import org.springframework.core.type.AnnotatedTypeMetadata;
import java.util.Map; public class RuntimeModeCondition implements Condition {
public class RuntimeModeCondition implements Condition, EnvironmentAware {
private Environment environment;
@Override @Override
public boolean matches(ConditionContext context, AnnotatedTypeMetadata metadata) { public boolean matches(ConditionContext context, AnnotatedTypeMetadata metadata) {
@ -31,9 +25,4 @@ public class RuntimeModeCondition implements Condition, EnvironmentAware {
} }
return actual == expected; return actual == expected;
} }
@Override
public void setEnvironment(Environment environment) {
this.environment = environment;
}
} }

@ -1,81 +1,49 @@
package at.procon.dip.search.config; package at.procon.dip.search.config;
import jakarta.validation.constraints.Min;
import jakarta.validation.constraints.Positive;
import lombok.Data; import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties; import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration; import org.springframework.context.annotation.Configuration;
import org.springframework.validation.annotation.Validated;
/**
* New-runtime generic search configuration.
*
* <p>This property tree is intentionally separated from the legacy
* {@code ted.search.*} settings. NEW-mode search/semantic/lexical code should
* depend on {@code dip.search.*} only.</p>
*/
@Configuration @Configuration
@ConfigurationProperties(prefix = "dip.search") @ConfigurationProperties(prefix = "dip.search")
@Data @Data
@Validated
public class DipSearchProperties { public class DipSearchProperties {
/** Default page size for search results. */ private Lexical lexical = new Lexical();
@Positive private Semantic semantic = new Semantic();
private int defaultPageSize = 20; private Fusion fusion = new Fusion();
private Chunking chunking = new Chunking();
/** Maximum allowed page size. */ @Data
@Positive public static class Lexical {
private int maxPageSize = 100; private double trigramSimilarityThreshold = 0.12;
/** Semantic similarity threshold (normalized score). */
private double similarityThreshold = 0.7d;
/** Minimum trigram similarity for fuzzy lexical matches. */
private double trigramSimilarityThreshold = 0.12d;
/** Candidate limits per search engine before fusion/collapse. */
@Positive
private int fulltextCandidateLimit = 120; private int fulltextCandidateLimit = 120;
@Positive
private int trigramCandidateLimit = 120; private int trigramCandidateLimit = 120;
}
@Positive @Data
public static class Semantic {
private double similarityThreshold = 0.7;
private int semanticCandidateLimit = 120; private int semanticCandidateLimit = 120;
private String defaultModelKey;
/** Hybrid fusion weights. */ }
private double fulltextWeight = 0.35d;
private double trigramWeight = 0.20d; @Data
private double semanticWeight = 0.45d; public static class Fusion {
private double fulltextWeight = 0.35;
/** Enable chunk representations for long documents. */ private double trigramWeight = 0.20;
private boolean chunkingEnabled = true; private double semanticWeight = 0.45;
private double recencyBoostWeight = 0.05;
/** Target chunk size in characters for CHUNK representations. */
@Positive
private int chunkTargetChars = 1800;
/** Overlap between consecutive chunks in characters. */
@Min(0)
private int chunkOverlapChars = 200;
/** Maximum CHUNK representations generated per document. */
@Positive
private int maxChunksPerDocument = 12;
/** Additional score weight for recency. */
private double recencyBoostWeight = 0.05d;
/** Half-life in days used for recency decay. */
@Positive
private int recencyHalfLifeDays = 30; private int recencyHalfLifeDays = 30;
private int debugTopHitsPerEngine = 10;
}
/** Startup backfill limit for missing DOC lexical vectors. */ @Data
@Positive public static class Chunking {
private boolean enabled = true;
private int targetChars = 1800;
private int overlapChars = 200;
private int maxChunksPerDocument = 12;
private int startupLexicalBackfillLimit = 500; private int startupLexicalBackfillLimit = 500;
}
/** Number of hits per engine returned by the debug endpoint. */
@Positive
private int debugTopHitsPerEngine = 10;
} }

@ -5,20 +5,17 @@ import at.procon.dip.search.dto.SearchEngineType;
import at.procon.dip.search.dto.SearchHit; import at.procon.dip.search.dto.SearchHit;
import at.procon.dip.search.engine.SearchEngine; import at.procon.dip.search.engine.SearchEngine;
import at.procon.dip.search.repository.DocumentFullTextSearchRepository; import at.procon.dip.search.repository.DocumentFullTextSearchRepository;
import at.procon.dip.search.config.DipSearchProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.util.List; import java.util.List;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
public class PostgresFullTextSearchEngine implements SearchEngine { public class PostgresFullTextSearchEngine implements SearchEngine {
private final DocumentFullTextSearchRepository repository; private final DocumentFullTextSearchRepository repository;
private final DipSearchProperties properties; private final TedProcessorProperties properties;
@Override @Override
public SearchEngineType type() { public SearchEngineType type() {
@ -32,6 +29,6 @@ public class PostgresFullTextSearchEngine implements SearchEngine {
@Override @Override
public List<SearchHit> execute(SearchExecutionContext context) { public List<SearchHit> execute(SearchExecutionContext context) {
return repository.search(context, properties.getFulltextCandidateLimit()); return repository.search(context, properties.getSearch().getFulltextCandidateLimit());
} }
} }

@ -9,23 +9,23 @@ import at.procon.dip.search.dto.SearchHit;
import at.procon.dip.search.engine.SearchEngine; import at.procon.dip.search.engine.SearchEngine;
import at.procon.dip.search.repository.DocumentSemanticSearchRepository; import at.procon.dip.search.repository.DocumentSemanticSearchRepository;
import at.procon.dip.search.service.SemanticQueryEmbeddingService; import at.procon.dip.search.service.SemanticQueryEmbeddingService;
import at.procon.dip.search.config.DipSearchProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.util.List; import java.util.List;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
public class PgVectorSemanticSearchEngine implements SearchEngine { public class PgVectorSemanticSearchEngine implements SearchEngine {
private final EmbeddingProperties embeddingProperties; private final EmbeddingProperties embeddingProperties;
private final EmbeddingModelRegistry embeddingModelRegistry; private final EmbeddingModelRegistry embeddingModelRegistry;
private final DipSearchProperties properties; private final TedProcessorProperties properties;
private final SemanticQueryEmbeddingService queryEmbeddingService; private final SemanticQueryEmbeddingService queryEmbeddingService;
private final DocumentSemanticSearchRepository repository; private final DocumentSemanticSearchRepository repository;
@ -56,8 +56,8 @@ public class PgVectorSemanticSearchEngine implements SearchEngine {
model.dimensions(), model.dimensions(),
model.distanceMetric(), model.distanceMetric(),
query.vectorString(), query.vectorString(),
properties.getSemanticCandidateLimit(), properties.getSearch().getSemanticCandidateLimit(),
properties.getSimilarityThreshold())) properties.getSearch().getSimilarityThreshold()))
.orElseGet(() -> { .orElseGet(() -> {
log.debug("Semantic search skipped because query embedding could not be generated for model {}", model.modelKey()); log.debug("Semantic search skipped because query embedding could not be generated for model {}", model.modelKey());
return List.of(); return List.of();

@ -5,20 +5,17 @@ import at.procon.dip.search.dto.SearchEngineType;
import at.procon.dip.search.dto.SearchHit; import at.procon.dip.search.dto.SearchHit;
import at.procon.dip.search.engine.SearchEngine; import at.procon.dip.search.engine.SearchEngine;
import at.procon.dip.search.repository.DocumentTrigramSearchRepository; import at.procon.dip.search.repository.DocumentTrigramSearchRepository;
import at.procon.dip.search.config.DipSearchProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.util.List; import java.util.List;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
public class PostgresTrigramSearchEngine implements SearchEngine { public class PostgresTrigramSearchEngine implements SearchEngine {
private final DocumentTrigramSearchRepository repository; private final DocumentTrigramSearchRepository repository;
private final DipSearchProperties properties; private final TedProcessorProperties properties;
@Override @Override
public SearchEngineType type() { public SearchEngineType type() {
@ -34,7 +31,7 @@ public class PostgresTrigramSearchEngine implements SearchEngine {
public List<SearchHit> execute(SearchExecutionContext context) { public List<SearchHit> execute(SearchExecutionContext context) {
return repository.search( return repository.search(
context, context,
properties.getTrigramCandidateLimit(), properties.getSearch().getTrigramCandidateLimit(),
properties.getTrigramSimilarityThreshold()); properties.getSearch().getTrigramSimilarityThreshold());
} }
} }

@ -7,9 +7,7 @@ import at.procon.dip.search.dto.SearchEngineType;
import at.procon.dip.search.dto.SearchHit; import at.procon.dip.search.dto.SearchHit;
import at.procon.dip.search.dto.SearchResponse; import at.procon.dip.search.dto.SearchResponse;
import at.procon.dip.search.dto.SearchSortMode; import at.procon.dip.search.dto.SearchSortMode;
import at.procon.dip.search.config.DipSearchProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.Comparator; import java.util.Comparator;
import java.util.EnumMap; import java.util.EnumMap;
@ -22,12 +20,11 @@ import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
public class DefaultSearchResultFusionService implements SearchResultFusionService { public class DefaultSearchResultFusionService implements SearchResultFusionService {
private final SearchScoreNormalizer normalizer; private final SearchScoreNormalizer normalizer;
private final DipSearchProperties properties; private final TedProcessorProperties properties;
@Override @Override
public SearchResponse fuse(SearchExecutionContext context, public SearchResponse fuse(SearchExecutionContext context,
@ -100,7 +97,7 @@ public class DefaultSearchResultFusionService implements SearchResultFusionServi
if (hit == null) { if (hit == null) {
return 0.0d; return 0.0d;
} }
DipSearchProperties search = properties; TedProcessorProperties.SearchProperties search = properties.getSearch();
return switch (engineType) { return switch (engineType) {
case POSTGRES_FULLTEXT -> hit.getNormalizedScore() * search.getFulltextWeight(); case POSTGRES_FULLTEXT -> hit.getNormalizedScore() * search.getFulltextWeight();
case POSTGRES_TRIGRAM -> hit.getNormalizedScore() * search.getTrigramWeight(); case POSTGRES_TRIGRAM -> hit.getNormalizedScore() * search.getTrigramWeight();
@ -113,9 +110,9 @@ public class DefaultSearchResultFusionService implements SearchResultFusionServi
normalized.forEach((engine, hits) -> { normalized.forEach((engine, hits) -> {
for (SearchHit hit : hits) { for (SearchHit hit : hits) {
double finalScore = switch (engine) { double finalScore = switch (engine) {
case POSTGRES_FULLTEXT -> hit.getNormalizedScore() * properties.getFulltextWeight(); case POSTGRES_FULLTEXT -> hit.getNormalizedScore() * properties.getSearch().getFulltextWeight();
case POSTGRES_TRIGRAM -> hit.getNormalizedScore() * properties.getTrigramWeight(); case POSTGRES_TRIGRAM -> hit.getNormalizedScore() * properties.getSearch().getTrigramWeight();
case PGVECTOR_SEMANTIC -> hit.getNormalizedScore() * properties.getSemanticWeight(); case PGVECTOR_SEMANTIC -> hit.getNormalizedScore() * properties.getSearch().getSemanticWeight();
}; };
merged.add(hit.toBuilder() merged.add(hit.toBuilder()
.finalScore(finalScore + recencyBoost(hit)) .finalScore(finalScore + recencyBoost(hit))
@ -141,13 +138,13 @@ public class DefaultSearchResultFusionService implements SearchResultFusionServi
} }
private double recencyBoost(SearchHit hit) { private double recencyBoost(SearchHit hit) {
if (properties.getRecencyBoostWeight() <= 0.0d || hit.getCreatedAt() == null) { if (properties.getSearch().getRecencyBoostWeight() <= 0.0d || hit.getCreatedAt() == null) {
return 0.0d; return 0.0d;
} }
double halfLifeDays = Math.max(1.0d, properties.getRecencyHalfLifeDays()); double halfLifeDays = Math.max(1.0d, properties.getSearch().getRecencyHalfLifeDays());
double ageDays = Math.max(0.0d, java.time.Duration.between(hit.getCreatedAt(), java.time.OffsetDateTime.now()).toSeconds() / 86400.0d); double ageDays = Math.max(0.0d, java.time.Duration.between(hit.getCreatedAt(), java.time.OffsetDateTime.now()).toSeconds() / 86400.0d);
double normalized = Math.exp(-Math.log(2.0d) * (ageDays / halfLifeDays)); double normalized = Math.exp(-Math.log(2.0d) * (ageDays / halfLifeDays));
return normalized * properties.getRecencyBoostWeight(); return normalized * properties.getSearch().getRecencyBoostWeight();
} }
private int representationPriority(SearchHit hit) { private int representationPriority(SearchHit hit) {

@ -33,7 +33,7 @@ public class DocumentSemanticSearchRepository {
throw new IllegalArgumentException("Semantic search requires a distance metric"); throw new IllegalArgumentException("Semantic search requires a distance metric");
} }
String vectorType = "vector(" + modelDimensions + ")"; String vectorType = "public.vector(" + modelDimensions + ")";
String similarityExpr = buildSimilarityExpression(distanceMetric, vectorType); String similarityExpr = buildSimilarityExpression(distanceMetric, vectorType);
StringBuilder sql = new StringBuilder(""" StringBuilder sql = new StringBuilder("""

@ -12,9 +12,7 @@ import at.procon.dip.search.engine.SearchEngine;
import at.procon.dip.search.plan.SearchPlanner; import at.procon.dip.search.plan.SearchPlanner;
import at.procon.dip.search.rank.SearchResultFusionService; import at.procon.dip.search.rank.SearchResultFusionService;
import at.procon.dip.search.spi.SearchDocumentScope; import at.procon.dip.search.spi.SearchDocumentScope;
import at.procon.dip.search.config.DipSearchProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.LinkedHashMap; import java.util.LinkedHashMap;
import java.util.List; import java.util.List;
@ -23,11 +21,10 @@ import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Service; import org.springframework.stereotype.Service;
@Service @Service
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
public class DefaultSearchOrchestrator implements SearchOrchestrator { public class DefaultSearchOrchestrator implements SearchOrchestrator {
private final DipSearchProperties properties; private final TedProcessorProperties properties;
private final SearchPlanner planner; private final SearchPlanner planner;
private final List<SearchEngine> engines; private final List<SearchEngine> engines;
private final SearchResultFusionService fusionService; private final SearchResultFusionService fusionService;
@ -48,7 +45,7 @@ public class DefaultSearchOrchestrator implements SearchOrchestrator {
metricsService.recordSearch(execution.engineResults(), fused.getHits().size(), true); metricsService.recordSearch(execution.engineResults(), fused.getHits().size(), true);
List<SearchEngineDebugResult> debugResults = new ArrayList<>(); List<SearchEngineDebugResult> debugResults = new ArrayList<>();
int topLimit = properties.getDebugTopHitsPerEngine(); int topLimit = properties.getSearch().getDebugTopHitsPerEngine();
execution.engineResults().forEach((engine, hits) -> debugResults.add(SearchEngineDebugResult.builder() execution.engineResults().forEach((engine, hits) -> debugResults.add(SearchEngineDebugResult.builder()
.engineType(engine) .engineType(engine)
.hitCount(hits.size()) .hitCount(hits.size())
@ -71,9 +68,9 @@ public class DefaultSearchOrchestrator implements SearchOrchestrator {
private SearchExecution executeInternal(SearchRequest request, SearchDocumentScope scope) { private SearchExecution executeInternal(SearchRequest request, SearchDocumentScope scope) {
int page = request.getPage() == null || request.getPage() < 0 ? 0 : request.getPage(); int page = request.getPage() == null || request.getPage() < 0 ? 0 : request.getPage();
int requestedSize = request.getSize() == null || request.getSize() <= 0 int requestedSize = request.getSize() == null || request.getSize() <= 0
? properties.getDefaultPageSize() ? properties.getSearch().getDefaultPageSize()
: request.getSize(); : request.getSize();
int size = Math.min(requestedSize, properties.getMaxPageSize()); int size = Math.min(requestedSize, properties.getSearch().getMaxPageSize());
SearchExecutionContext context = SearchExecutionContext.builder() SearchExecutionContext context = SearchExecutionContext.builder()
.request(request) .request(request)

@ -1,8 +1,6 @@
package at.procon.dip.search.service; package at.procon.dip.search.service;
import at.procon.dip.search.config.DipSearchProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.ApplicationArguments; import org.springframework.boot.ApplicationArguments;
@ -10,17 +8,16 @@ import org.springframework.boot.ApplicationRunner;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.NEW)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class SearchLexicalIndexStartupRunner implements ApplicationRunner { public class SearchLexicalIndexStartupRunner implements ApplicationRunner {
private final DipSearchProperties properties; private final TedProcessorProperties properties;
private final DocumentLexicalIndexService lexicalIndexService; private final DocumentLexicalIndexService lexicalIndexService;
@Override @Override
public void run(ApplicationArguments args) { public void run(ApplicationArguments args) {
int updated = lexicalIndexService.backfillMissingVectors(properties.getStartupLexicalBackfillLimit()); int updated = lexicalIndexService.backfillMissingVectors(properties.getSearch().getStartupLexicalBackfillLimit());
if (updated > 0) { if (updated > 0) {
log.info("Search lexical index startup backfill updated {} representations", updated); log.info("Search lexical index startup backfill updated {} representations", updated);
} }

@ -2,13 +2,8 @@ package at.procon.dip.vectorization.camel;
import at.procon.dip.domain.document.EmbeddingStatus; import at.procon.dip.domain.document.EmbeddingStatus;
import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository; import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.dip.vectorization.service.DocumentEmbeddingProcessingService; import at.procon.dip.vectorization.service.DocumentEmbeddingProcessingService;
import at.procon.ted.config.LegacyVectorizationProperties;
import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.annotation.JsonProperty;
import java.util.List;
import java.util.UUID;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange; import org.apache.camel.Exchange;
@ -18,22 +13,27 @@ import org.apache.camel.model.dataformat.JsonLibrary;
import org.springframework.data.domain.PageRequest; import org.springframework.data.domain.PageRequest;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.util.List;
import java.util.UUID;
/** /**
* Legacy generic vectorization route. * Phase 2 generic vectorization route.
* Uses DOC.doc_text_representation as the source text and DOC.doc_embedding as the write target * Uses DOC.doc_text_representation as the source text and DOC.doc_embedding as the write target.
* but belongs to the old runtime graph and is therefore activated only in LEGACY mode.
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
public class GenericVectorizationRoute extends RouteBuilder { public class GenericVectorizationRoute extends RouteBuilder {
private static final String ROUTE_ID_TRIGGER = "generic-vectorization-trigger"; private static final String ROUTE_ID_TRIGGER = "generic-vectorization-trigger";
private static final String ROUTE_ID_PROCESSOR = "generic-vectorization-processor"; private static final String ROUTE_ID_PROCESSOR = "generic-vectorization-processor";
private static final String ROUTE_ID_SCHEDULER = "generic-vectorization-scheduler"; private static final String ROUTE_ID_SCHEDULER = "generic-vectorization-scheduler";
private final LegacyVectorizationProperties properties; private final TedProcessorProperties properties;
private final DocumentEmbeddingRepository embeddingRepository; private final DocumentEmbeddingRepository embeddingRepository;
private final DocumentEmbeddingProcessingService processingService; private final DocumentEmbeddingProcessingService processingService;
@ -52,95 +52,163 @@ public class GenericVectorizationRoute extends RouteBuilder {
@Override @Override
public void configure() { public void configure() {
if (!properties.isEnabled() || !properties.isGenericPipelineEnabled()) { if (!properties.getVectorization().isEnabled() || !properties.getVectorization().isGenericPipelineEnabled()) {
log.info("Phase 2 generic vectorization route disabled"); log.info("Phase 2 generic vectorization route disabled");
return; return;
} }
log.info("Configuring generic vectorization routes (legacy mode, apiUrl={}, scheduler={}ms)", log.info("Configuring generic vectorization routes (phase2=true, apiUrl={}, scheduler={}ms)",
properties.getApiUrl(), properties.getVectorization().getApiUrl(),
properties.getGenericSchedulerPeriodMs()); properties.getVectorization().getGenericSchedulerPeriodMs());
onException(Exception.class) onException(Exception.class)
.handled(true) .handled(true)
.process(exchange -> { .process(exchange -> {
UUID embeddingId = exchange.getIn().getHeader("embeddingId", UUID.class); UUID embeddingId = exchange.getIn().getHeader("embeddingId", UUID.class);
Exception exception = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class); Exception exception = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class);
log.error("Generic vectorization failed for embedding {}: {}", embeddingId, String error = exception != null ? exception.getMessage() : "Unknown vectorization error";
exception != null ? exception.getMessage() : "unknown error", exception);
if (embeddingId != null) { if (embeddingId != null) {
processingService.markAsFailed(embeddingId, try {
exception != null ? exception.getMessage() : "Unknown vectorization error"); processingService.markAsFailed(embeddingId, error);
} catch (Exception nested) {
log.warn("Failed to mark embedding {} as failed: {}", embeddingId, nested.getMessage());
}
} }
}) })
.log(LoggingLevel.WARN, "Generic vectorization exception handled for ${header.embeddingId}"); .to("log:generic-vectorization-error?level=WARN");
from("direct:vectorize-embedding") from("direct:vectorize-embedding")
.routeId(ROUTE_ID_TRIGGER) .routeId(ROUTE_ID_TRIGGER)
.setHeader("embeddingId", header("embeddingId")) .doTry()
.to("seda:vectorize-embedding-async"); .to("seda:vectorize-embedding-async?waitForTaskToComplete=Never&size=1000&blockWhenFull=true&timeout=5000")
.doCatch(Exception.class)
.log(LoggingLevel.WARN, "Failed to queue embedding ${header.embeddingId}: ${exception.message}")
.end();
from("seda:vectorize-embedding-async?concurrentConsumers=1&blockWhenFull=true&size=1000") from("seda:vectorize-embedding-async?size=1000")
.routeId(ROUTE_ID_PROCESSOR) .routeId(ROUTE_ID_PROCESSOR)
.threads().executorService(executorService()) .threads().executorService(executorService())
.process(exchange -> { .process(exchange -> {
UUID embeddingId = exchange.getIn().getHeader("embeddingId", UUID.class); UUID embeddingId = exchange.getIn().getHeader("embeddingId", UUID.class);
if (embeddingId == null) {
exchange.setProperty(Exchange.ROUTE_STOP, Boolean.TRUE);
return;
}
DocumentEmbeddingProcessingService.EmbeddingPayload payload = DocumentEmbeddingProcessingService.EmbeddingPayload payload =
processingService.prepareEmbeddingForVectorization(embeddingId); processingService.prepareEmbeddingForVectorization(embeddingId);
if (payload == null) { if (payload == null) {
exchange.setProperty(Exchange.ROUTE_STOP, Boolean.TRUE); exchange.setProperty("skipVectorization", true);
return; return;
} }
exchange.getIn().setBody(payload);
}) EmbedRequest request = new EmbedRequest();
.filter(exchangeProperty(Exchange.ROUTE_STOP).isNull()) request.text = payload.textContent();
.process(exchange -> { request.isQuery = false;
DocumentEmbeddingProcessingService.EmbeddingPayload payload =
exchange.getIn().getBody(DocumentEmbeddingProcessingService.EmbeddingPayload.class); exchange.getIn().setHeader("embeddingId", payload.embeddingId());
VectorizationRequest request = new VectorizationRequest(payload.textContent(), false); exchange.getIn().setHeader("documentId", payload.documentId());
exchange.getIn().setHeader(Exchange.HTTP_METHOD, "POST");
exchange.getIn().setHeader(Exchange.CONTENT_TYPE, "application/json");
exchange.getIn().setBody(request); exchange.getIn().setBody(request);
}) })
.choice()
.when(exchangeProperty("skipVectorization").isEqualTo(true))
.log(LoggingLevel.DEBUG, "Skipping generic vectorization for ${header.embeddingId}")
.otherwise()
.marshal().json(JsonLibrary.Jackson) .marshal().json(JsonLibrary.Jackson)
.removeHeaders("CamelHttp*") .setProperty("retryCount", constant(0))
.setHeader(Exchange.HTTP_METHOD, constant("POST")) .setProperty("maxRetries", constant(properties.getVectorization().getMaxRetries()))
.setHeader(Exchange.CONTENT_TYPE, constant("application/json")) .setProperty("vectorizationSuccess", constant(false))
.toD(properties.getApiUrl() + "/embed?bridgeEndpoint=true&throwExceptionOnFailure=true") .loopDoWhile(simple("${exchangeProperty.vectorizationSuccess} == false && ${exchangeProperty.retryCount} < ${exchangeProperty.maxRetries}"))
.unmarshal().json(JsonLibrary.Jackson, VectorizationResponse.class) .process(exchange -> {
Integer retryCount = exchange.getProperty("retryCount", Integer.class);
exchange.setProperty("retryCount", retryCount + 1);
if (retryCount > 0) {
long backoffMs = (long) Math.pow(2, retryCount) * 1000L;
Thread.sleep(backoffMs);
}
})
.doTry()
.toD(properties.getVectorization().getApiUrl() + "/embed?bridgeEndpoint=true&throwExceptionOnFailure=false&connectTimeout=" +
properties.getVectorization().getConnectTimeout() + "&socketTimeout=" +
properties.getVectorization().getSocketTimeout())
.process(exchange -> {
Integer statusCode = exchange.getIn().getHeader(Exchange.HTTP_RESPONSE_CODE, Integer.class);
if (statusCode == null || statusCode != 200) {
String body = exchange.getIn().getBody(String.class);
throw new RuntimeException("Embedding service returned HTTP " + statusCode + ": " + body);
}
})
.unmarshal().json(JsonLibrary.Jackson, EmbedResponse.class)
.process(exchange -> { .process(exchange -> {
UUID embeddingId = exchange.getIn().getHeader("embeddingId", UUID.class); UUID embeddingId = exchange.getIn().getHeader("embeddingId", UUID.class);
VectorizationResponse response = exchange.getIn().getBody(VectorizationResponse.class); EmbedResponse response = exchange.getIn().getBody(EmbedResponse.class);
if (response == null || response.embedding() == null) { if (response == null || response.embedding == null) {
throw new IllegalStateException("Embedding service returned empty response"); throw new RuntimeException("Embedding service returned null embedding response");
} }
processingService.saveEmbedding(embeddingId, response.embedding(), response.tokenCount()); processingService.saveEmbedding(embeddingId, response.embedding, response.tokenCount);
}); exchange.setProperty("vectorizationSuccess", true);
})
.doCatch(Exception.class)
.process(exchange -> {
UUID embeddingId = exchange.getIn().getHeader("embeddingId", UUID.class);
Integer retryCount = exchange.getProperty("retryCount", Integer.class);
Integer maxRetries = exchange.getProperty("maxRetries", Integer.class);
Exception exception = exchange.getProperty(Exchange.EXCEPTION_CAUGHT, Exception.class);
String errorMsg = exception != null ? exception.getMessage() : "Unknown error";
if (errorMsg != null && errorMsg.contains("Connection pool shut down")) {
log.warn("Generic vectorization aborted for {} because the application is shutting down", embeddingId);
exchange.setProperty("vectorizationSuccess", true);
return;
}
if (retryCount >= maxRetries) {
processingService.markAsFailed(embeddingId, errorMsg);
} else {
log.warn("Generic vectorization attempt #{} failed for {}: {}", retryCount, embeddingId, errorMsg);
}
})
.end()
.end()
.end();
from("timer:generic-vectorization-poller?period=" + properties.getGenericSchedulerPeriodMs()) from("timer:generic-vectorization-scheduler?period=" + properties.getVectorization().getGenericSchedulerPeriodMs() + "&delay=500")
.routeId(ROUTE_ID_SCHEDULER) .routeId(ROUTE_ID_SCHEDULER)
.process(exchange -> { .process(exchange -> {
List<UUID> ids = embeddingRepository.findIdsByEmbeddingStatus( int batchSize = properties.getVectorization().getBatchSize();
EmbeddingStatus.PENDING, PageRequest.of(0, properties.getBatchSize())); List<UUID> pending = embeddingRepository.findIdsByEmbeddingStatus(EmbeddingStatus.PENDING, PageRequest.of(0, batchSize));
exchange.getIn().setBody(ids); List<UUID> failed = List.of();
if (pending.isEmpty()) {
failed = embeddingRepository.findIdsByEmbeddingStatus(EmbeddingStatus.FAILED, PageRequest.of(0, batchSize));
}
List<UUID> toProcess = !pending.isEmpty() ? pending : failed;
if (toProcess.isEmpty()) {
exchange.setProperty("noPendingEmbeddings", true);
} else {
exchange.getIn().setBody(toProcess);
}
}) })
.choice()
.when(exchangeProperty("noPendingEmbeddings").isEqualTo(true))
.log(LoggingLevel.DEBUG, "Generic vectorization scheduler: nothing pending")
.otherwise()
.split(body()) .split(body())
.setHeader("embeddingId", body()) .process(exchange -> {
.to("direct:vectorize-embedding"); UUID embeddingId = exchange.getIn().getBody(UUID.class);
exchange.getIn().setHeader("embeddingId", embeddingId);
})
.to("direct:vectorize-embedding")
.end()
.end();
} }
public record VectorizationRequest( public static class EmbedRequest {
@JsonProperty("text") String text, @JsonProperty("text")
@JsonProperty("isQuery") boolean isQuery public String text;
) {
@JsonProperty("is_query")
public boolean isQuery;
} }
public record VectorizationResponse( public static class EmbedResponse {
@JsonProperty("embedding") float[] embedding, public float[] embedding;
@JsonProperty("token_count") Integer tokenCount public int dimensions;
) { @JsonProperty("token_count")
public int tokenCount;
} }
} }

@ -1,20 +1,16 @@
package at.procon.dip.vectorization.service; package at.procon.dip.vectorization.service;
import at.procon.dip.domain.document.DocumentStatus; import at.procon.dip.domain.document.DocumentStatus;
import at.procon.dip.domain.document.EmbeddingStatus; import at.procon.dip.domain.document.EmbeddingStatus;
import at.procon.dip.domain.document.entity.DocumentEmbedding; import at.procon.dip.domain.document.entity.DocumentEmbedding;
import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository; import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository;
import at.procon.dip.domain.document.service.DocumentService; import at.procon.dip.domain.document.service.DocumentService;
import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.config.LegacyVectorizationProperties;
import at.procon.ted.model.entity.VectorizationStatus; import at.procon.ted.model.entity.VectorizationStatus;
import at.procon.ted.repository.ProcurementDocumentRepository; import at.procon.ted.repository.ProcurementDocumentRepository;
import at.procon.ted.service.VectorizationService; import at.procon.ted.service.VectorizationService;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import java.time.OffsetDateTime; import java.time.OffsetDateTime;
import java.util.UUID; import java.util.UUID;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -24,21 +20,21 @@ import org.springframework.transaction.annotation.Propagation;
import org.springframework.transaction.annotation.Transactional; import org.springframework.transaction.annotation.Transactional;
/** /**
* Legacy generic vectorization processor that works on DOC text representations and DOC embeddings. * Phase 2 generic vectorization processor that works on DOC text representations and DOC embeddings.
* <p> * <p>
* The service keeps the existing TED semantic search operational by optionally dual-writing completed * The service keeps the existing TED semantic search operational by optionally dual-writing completed
* embeddings back into the legacy TED procurement_document vector columns, resolved by document hash. * embeddings back into the legacy TED procurement_document vector columns, resolved by document hash.
*/ */
@Service @Service
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
public class DocumentEmbeddingProcessingService { public class DocumentEmbeddingProcessingService {
private final DocumentEmbeddingRepository embeddingRepository; private final DocumentEmbeddingRepository embeddingRepository;
private final DocumentService documentService; private final DocumentService documentService;
private final VectorizationService vectorizationService; private final VectorizationService vectorizationService;
private final LegacyVectorizationProperties properties; private final TedProcessorProperties properties;
private final ProcurementDocumentRepository procurementDocumentRepository; private final ProcurementDocumentRepository procurementDocumentRepository;
@Transactional(propagation = Propagation.REQUIRES_NEW) @Transactional(propagation = Propagation.REQUIRES_NEW)
@ -65,7 +61,7 @@ public class DocumentEmbeddingProcessingService {
return null; return null;
} }
int maxLength = properties.getMaxTextLength(); int maxLength = properties.getVectorization().getMaxTextLength();
if (textBody.length() > maxLength) { if (textBody.length() > maxLength) {
log.debug("Truncating representation {} for embedding {} from {} to {} chars", log.debug("Truncating representation {} for embedding {} from {} to {} chars",
embedding.getRepresentation().getId(), embeddingId, textBody.length(), maxLength); embedding.getRepresentation().getId(), embeddingId, textBody.length(), maxLength);
@ -95,10 +91,10 @@ public class DocumentEmbeddingProcessingService {
} }
String vectorString = vectorizationService.floatArrayToVectorString(embedding); String vectorString = vectorizationService.floatArrayToVectorString(embedding);
embeddingRepository.updateEmbeddingVector(embeddingId, embedding, tokenCount, embedding.length); embeddingRepository.updateEmbeddingVector(embeddingId, vectorString, tokenCount, embedding.length);
documentService.updateStatus(loaded.getDocument().getId(), DocumentStatus.INDEXED); documentService.updateStatus(loaded.getDocument().getId(), DocumentStatus.INDEXED);
if (properties.isDualWriteLegacyTedVectors()) { if (properties.getVectorization().isDualWriteLegacyTedVectors()) {
dualWriteLegacyTedVector(loaded, vectorString, tokenCount); dualWriteLegacyTedVector(loaded, vectorString, tokenCount);
} }
} }
@ -111,7 +107,8 @@ public class DocumentEmbeddingProcessingService {
embeddingRepository.updateEmbeddingStatus(embeddingId, EmbeddingStatus.FAILED, errorMessage, null); embeddingRepository.updateEmbeddingStatus(embeddingId, EmbeddingStatus.FAILED, errorMessage, null);
documentService.updateStatus(loaded.getDocument().getId(), DocumentStatus.FAILED); documentService.updateStatus(loaded.getDocument().getId(), DocumentStatus.FAILED);
if (properties.isDualWriteLegacyTedVectors()) { if (properties.getVectorization().isDualWriteLegacyTedVectors()) {
loaded.getDocument().getDedupHash();
procurementDocumentRepository.findByDocumentHash(loaded.getDocument().getDedupHash()) procurementDocumentRepository.findByDocumentHash(loaded.getDocument().getDedupHash())
.ifPresent(doc -> procurementDocumentRepository.updateVectorizationStatus( .ifPresent(doc -> procurementDocumentRepository.updateVectorizationStatus(
doc.getId(), VectorizationStatus.FAILED, errorMessage, null)); doc.getId(), VectorizationStatus.FAILED, errorMessage, null));

@ -2,9 +2,9 @@ package at.procon.dip.vectorization.startup;
import at.procon.dip.domain.document.service.DocumentEmbeddingService; import at.procon.dip.domain.document.service.DocumentEmbeddingService;
import at.procon.dip.domain.document.service.command.RegisterEmbeddingModelCommand; import at.procon.dip.domain.document.service.command.RegisterEmbeddingModelCommand;
import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode; import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode; import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.config.LegacyVectorizationProperties;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.ApplicationArguments; import org.springframework.boot.ApplicationArguments;
@ -12,33 +12,33 @@ import org.springframework.boot.ApplicationRunner;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
/** /**
* Ensures the configured embedding model exists in DOC.doc_embedding_model for the legacy runtime path. * Ensures the configured embedding model exists in DOC.doc_embedding_model.
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
public class ConfiguredEmbeddingModelStartupRunner implements ApplicationRunner { public class ConfiguredEmbeddingModelStartupRunner implements ApplicationRunner {
private final LegacyVectorizationProperties properties; private final TedProcessorProperties properties;
private final DocumentEmbeddingService embeddingService; private final DocumentEmbeddingService embeddingService;
@Override @Override
public void run(ApplicationArguments args) { public void run(ApplicationArguments args) {
if (!properties.isEnabled() || !properties.isGenericPipelineEnabled()) { if (!properties.getVectorization().isEnabled() || !properties.getVectorization().isGenericPipelineEnabled()) {
return; return;
} }
embeddingService.registerModel(new RegisterEmbeddingModelCommand( embeddingService.registerModel(new RegisterEmbeddingModelCommand(
properties.getModelName(), properties.getVectorization().getModelName(),
properties.getEmbeddingProvider(), properties.getVectorization().getEmbeddingProvider(),
properties.getModelName(), properties.getVectorization().getModelName(),
properties.getDimensions(), properties.getVectorization().getDimensions(),
null, null,
false, false,
true true
)); ));
log.info("Legacy embedding model ensured: {}", properties.getModelName()); log.info("Phase 2 embedding model ensured: {}", properties.getVectorization().getModelName());
} }
} }

@ -2,9 +2,9 @@ package at.procon.dip.vectorization.startup;
import at.procon.dip.domain.document.EmbeddingStatus; import at.procon.dip.domain.document.EmbeddingStatus;
import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository; import at.procon.dip.domain.document.repository.DocumentEmbeddingRepository;
import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode; import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode; import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.config.LegacyVectorizationProperties;
import java.util.List; import java.util.List;
import java.util.UUID; import java.util.UUID;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -16,30 +16,30 @@ import org.springframework.data.domain.PageRequest;
import org.springframework.stereotype.Component; import org.springframework.stereotype.Component;
/** /**
* Queues pending and failed DOC embeddings immediately on startup for the legacy runtime graph. * Queues pending and failed DOC embeddings immediately on startup.
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
public class GenericVectorizationStartupRunner implements ApplicationRunner { public class GenericVectorizationStartupRunner implements ApplicationRunner {
private static final int BATCH_SIZE = 1000; private static final int BATCH_SIZE = 1000;
private final LegacyVectorizationProperties properties; private final TedProcessorProperties properties;
private final DocumentEmbeddingRepository embeddingRepository; private final DocumentEmbeddingRepository embeddingRepository;
private final ProducerTemplate producerTemplate; private final ProducerTemplate producerTemplate;
@Override @Override
public void run(ApplicationArguments args) { public void run(ApplicationArguments args) {
if (!properties.isEnabled() || !properties.isGenericPipelineEnabled()) { if (!properties.getVectorization().isEnabled() || !properties.getVectorization().isGenericPipelineEnabled()) {
return; return;
} }
int queued = 0; int queued = 0;
queued += queueByStatus(EmbeddingStatus.PENDING, "PENDING"); queued += queueByStatus(EmbeddingStatus.PENDING, "PENDING");
queued += queueByStatus(EmbeddingStatus.FAILED, "FAILED"); queued += queueByStatus(EmbeddingStatus.FAILED, "FAILED");
log.info("Legacy generic vectorization startup runner queued {} embedding jobs", queued); log.info("Generic vectorization startup runner queued {} embedding jobs", queued);
} }
private int queueByStatus(EmbeddingStatus status, String label) { private int queueByStatus(EmbeddingStatus status, String label) {

@ -13,8 +13,6 @@ import jakarta.mail.Multipart;
import jakarta.mail.Part; import jakarta.mail.Part;
import jakarta.mail.Session; import jakarta.mail.Session;
import jakarta.mail.internet.MimeMessage; import jakarta.mail.internet.MimeMessage;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange; import org.apache.camel.Exchange;
@ -47,7 +45,6 @@ import java.util.*;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class MailRoute extends RouteBuilder { public class MailRoute extends RouteBuilder {

@ -4,8 +4,6 @@ import at.procon.ted.config.TedProcessorProperties;
import at.procon.ted.service.ExcelExportService; import at.procon.ted.service.ExcelExportService;
import at.procon.ted.service.SimilaritySearchService; import at.procon.ted.service.SimilaritySearchService;
import at.procon.ted.service.SimilaritySearchService.SimilaritySearchResponse; import at.procon.ted.service.SimilaritySearchService.SimilaritySearchResponse;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange; import org.apache.camel.Exchange;
@ -29,7 +27,6 @@ import java.nio.file.Paths;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class SolutionBriefRoute extends RouteBuilder { public class SolutionBriefRoute extends RouteBuilder {

@ -2,8 +2,6 @@ package at.procon.ted.camel;
import at.procon.ted.config.TedProcessorProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.ted.service.DocumentProcessingService; import at.procon.ted.service.DocumentProcessingService;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange; import org.apache.camel.Exchange;
@ -26,7 +24,6 @@ import java.nio.file.Path;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class TedDocumentRoute extends RouteBuilder { public class TedDocumentRoute extends RouteBuilder {

@ -9,8 +9,6 @@ import at.procon.ted.model.entity.TedDailyPackage;
import at.procon.ted.repository.TedDailyPackageRepository; import at.procon.ted.repository.TedDailyPackageRepository;
import at.procon.ted.service.BatchDocumentProcessingService; import at.procon.ted.service.BatchDocumentProcessingService;
import at.procon.ted.service.TedPackageDownloadService; import at.procon.ted.service.TedPackageDownloadService;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange; import org.apache.camel.Exchange;
@ -49,7 +47,6 @@ import java.util.Optional;
@ConditionalOnProperty(name = "ted.download.enabled", havingValue = "true") @ConditionalOnProperty(name = "ted.download.enabled", havingValue = "true")
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
public class TedPackageDownloadCamelRoute extends RouteBuilder { public class TedPackageDownloadCamelRoute extends RouteBuilder {
private static final String ROUTE_ID_SCHEDULER = "ted-package-scheduler"; private static final String ROUTE_ID_SCHEDULER = "ted-package-scheduler";

@ -2,8 +2,6 @@ package at.procon.ted.camel;
import at.procon.ted.config.TedProcessorProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.ted.service.TedPackageDownloadService; import at.procon.ted.service.TedPackageDownloadService;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange; import org.apache.camel.Exchange;
@ -32,7 +30,6 @@ import java.util.List;
@ConditionalOnProperty(name = "ted.download.use-service-based", havingValue = "true") @ConditionalOnProperty(name = "ted.download.use-service-based", havingValue = "true")
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
public class TedPackageDownloadRoute extends RouteBuilder { public class TedPackageDownloadRoute extends RouteBuilder {
private static final String ROUTE_ID_SCHEDULER = "ted-package-download-scheduler"; private static final String ROUTE_ID_SCHEDULER = "ted-package-download-scheduler";

@ -7,8 +7,6 @@ import at.procon.ted.repository.ProcurementDocumentRepository;
import at.procon.ted.service.VectorizationProcessorService; import at.procon.ted.service.VectorizationProcessorService;
import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.ObjectMapper;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.Exchange; import org.apache.camel.Exchange;
@ -33,7 +31,6 @@ import java.util.UUID;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class VectorizationRoute extends RouteBuilder { public class VectorizationRoute extends RouteBuilder {
@ -71,6 +68,10 @@ public class VectorizationRoute extends RouteBuilder {
log.info("Vectorization is disabled, skipping route configuration"); log.info("Vectorization is disabled, skipping route configuration");
return; return;
} }
if (properties.getVectorization().isGenericPipelineEnabled()) {
log.info("Legacy vectorization route disabled because Phase 2 generic pipeline is enabled");
return;
}
log.info("Configuring vectorization routes (enabled=true, apiUrl={}, connectTimeout={}ms, socketTimeout={}ms, maxRetries={}, scheduler every 6s)", log.info("Configuring vectorization routes (enabled=true, apiUrl={}, connectTimeout={}ms, socketTimeout={}ms, maxRetries={}, scheduler every 6s)",
properties.getVectorization().getApiUrl(), properties.getVectorization().getApiUrl(),

@ -1,7 +1,5 @@
package at.procon.ted.config; package at.procon.ted.config;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.springframework.aop.interceptor.AsyncUncaughtExceptionHandler; import org.springframework.aop.interceptor.AsyncUncaughtExceptionHandler;
@ -21,7 +19,6 @@ import java.util.concurrent.Executor;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@Configuration @Configuration
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@EnableAsync @EnableAsync
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j

@ -0,0 +1,16 @@
package at.procon.ted.config;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
/**
* Patch A scaffold for the legacy runtime configuration tree.
*
* The legacy runtime still uses {@link TedProcessorProperties} today. This class is
* introduced so the old configuration can be moved gradually from `ted.*` to
* `legacy.ted.*` without blocking the runtime split.
*/
@Configuration
@ConfigurationProperties(prefix = "legacy.ted")
public class LegacyTedProperties extends TedProcessorProperties {
}

@ -1,115 +0,0 @@
package at.procon.ted.config;
import jakarta.validation.constraints.Min;
import jakarta.validation.constraints.NotBlank;
import jakarta.validation.constraints.Positive;
import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
import org.springframework.validation.annotation.Validated;
/**
* Legacy vectorization configuration used only by the old runtime path.
* <p>
* This extracts the former ted.vectorization.* subtree away from TedProcessorProperties
* so that legacy vectorization beans no longer depend on the shared monolithic config.
*/
@Configuration
@ConfigurationProperties(prefix = "legacy.ted.vectorization")
@Data
@Validated
public class LegacyVectorizationProperties {
/**
* Enable/disable legacy async vectorization.
*/
private boolean enabled = true;
/**
* Use external HTTP API instead of Python subprocess.
*/
private boolean useHttpApi = false;
/**
* Embedding service HTTP API URL.
*/
private String apiUrl = "http://localhost:8001";
/**
* Sentence transformer model name.
*/
private String modelName = "intfloat/multilingual-e5-large";
/**
* Vector dimensions (must match model output).
*/
@Positive
private int dimensions = 1024;
/**
* Batch size for vectorization processing.
*/
@Min(1)
private int batchSize = 16;
/**
* Thread pool size for async vectorization.
*/
@Min(1)
private int threadPoolSize = 4;
/**
* Maximum text length for vectorization (characters).
*/
@Positive
private int maxTextLength = 8192;
/**
* HTTP connection timeout in milliseconds.
*/
@Positive
private int connectTimeout = 10000;
/**
* HTTP socket/read timeout in milliseconds.
*/
@Positive
private int socketTimeout = 60000;
/**
* Maximum retries on connection failure.
*/
@Min(0)
private int maxRetries = 5;
/**
* Enable the former Phase 2 generic pipeline in the legacy runtime.
* In the split runtime design this should normally stay false in NEW mode
* because legacy beans are not instantiated there.
*/
private boolean genericPipelineEnabled = true;
/**
* Keep writing completed TED embeddings back to the legacy ted.procurement_document
* vector columns so the existing semantic search stays operational during migration.
*/
private boolean dualWriteLegacyTedVectors = true;
/**
* Scheduler interval for generic embedding polling (milliseconds).
*/
@Positive
private long genericSchedulerPeriodMs = 6000;
/**
* Builder key for the primary TED semantic representation created during transitional dual-write.
*/
@NotBlank
private String primaryRepresentationBuilderKey = "ted-phase2-primary-representation";
/**
* Provider key used when registering the configured embedding model in DOC.doc_embedding_model.
*/
@NotBlank
private String embeddingProvider = "http-embedding-service";
}

@ -1,11 +1,8 @@
package at.procon.ted.config; package at.procon.ted.config;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.Data; import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties; import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration; import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Primary;
import org.springframework.validation.annotation.Validated; import org.springframework.validation.annotation.Validated;
import jakarta.validation.constraints.Min; import jakarta.validation.constraints.Min;
@ -18,11 +15,9 @@ import jakarta.validation.constraints.Positive;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@Configuration @Configuration
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@ConfigurationProperties(prefix = "ted") @ConfigurationProperties(prefix = "ted")
@Data @Data
@Validated @Validated
@Primary
public class TedProcessorProperties { public class TedProcessorProperties {
private InputProperties input = new InputProperties(); private InputProperties input = new InputProperties();
@ -34,7 +29,6 @@ public class TedProcessorProperties {
private SolutionBriefProperties solutionBrief = new SolutionBriefProperties(); private SolutionBriefProperties solutionBrief = new SolutionBriefProperties();
private ProjectionProperties projection = new ProjectionProperties(); private ProjectionProperties projection = new ProjectionProperties();
private GenericIngestionProperties genericIngestion = new GenericIngestionProperties(); private GenericIngestionProperties genericIngestion = new GenericIngestionProperties();
private RepairProperties repair = new RepairProperties();
/** /**
* Input directory configuration for Apache Camel file consumer. * Input directory configuration for Apache Camel file consumer.
@ -160,9 +154,37 @@ public class TedProcessorProperties {
*/ */
@Min(0) @Min(0)
private int maxRetries = 5; private int maxRetries = 5;
/**
* Enable the Phase 2 generic vectorization pipeline based on DOC text representations
* and DOC embeddings instead of the legacy TED document vector columns as the primary
* write target.
*/
private boolean genericPipelineEnabled = true;
/**
* Keep writing completed TED embeddings back to the legacy ted.procurement_document
* vector columns so the existing semantic search stays operational during migration.
*/
private boolean dualWriteLegacyTedVectors = true;
/**
* Scheduler interval for generic embedding polling (milliseconds).
*/
@Positive @Positive
private long genericSchedulerPeriodMs = 30000; private long genericSchedulerPeriodMs = 6000;
private String primaryRepresentationBuilderKey = "default-generic";
/**
* Builder key for the primary TED semantic representation created during Phase 2 dual-write.
*/
@NotBlank
private String primaryRepresentationBuilderKey = "ted-phase2-primary-representation";
/**
* Provider key used when registering the configured embedding model in DOC.doc_embedding_model.
*/
@NotBlank
private String embeddingProvider = "http-embedding-service";
} }
/** /**
@ -357,64 +379,6 @@ public class TedProcessorProperties {
private boolean prioritizeCurrentYear = true; private boolean prioritizeCurrentYear = true;
} }
/**
* Legacy TED package repair / re-import configuration.
*/
@Data
public static class RepairProperties {
/**
* Enable startup repair of incomplete or missing TED packages.
*/
private boolean enabled = false;
/**
* If true, only logs the selected package candidates without modifying data.
*/
private boolean dryRun = false;
/**
* Maximum number of packages to process in one startup run.
*/
@Positive
private int maxPackages = 100;
/**
* Optional explicit package identifiers (YYYYSSSSS) to repair.
*/
private java.util.List<String> packageIdentifiers = new java.util.ArrayList<>();
/**
* Optional lower bound package identifier (inclusive).
*/
private String fromPackageIdentifier;
/**
* Optional upper bound package identifier (inclusive).
*/
private String toPackageIdentifier;
/**
* Include missing package sequence numbers inside the selected range.
*/
private boolean includeMissingSequenceGaps = true;
/**
* Re-download the package archive when it is missing locally.
*/
private boolean redownloadMissingArchives = true;
/**
* Always re-download the package archive even when a local archive already exists.
*/
private boolean forceRedownload = false;
/**
* Refuse startup repair while the automatic legacy package download scheduler is enabled.
*/
private boolean allowWhileDownloadEnabled = false;
}
/** /**
* IMAP Mail configuration for email processing. * IMAP Mail configuration for email processing.
*/ */

@ -10,8 +10,6 @@ import at.procon.ted.service.DocumentProcessingService;
import at.procon.ted.service.VectorizationService; import at.procon.ted.service.VectorizationService;
import io.swagger.v3.oas.annotations.Operation; import io.swagger.v3.oas.annotations.Operation;
import io.swagger.v3.oas.annotations.tags.Tag; import io.swagger.v3.oas.annotations.tags.Tag;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.ProducerTemplate; import org.apache.camel.ProducerTemplate;
@ -37,7 +35,6 @@ import java.util.UUID;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@RestController @RestController
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequestMapping("/v1/admin") @RequestMapping("/v1/admin")
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ -78,11 +75,17 @@ public class AdminController {
Map<String, Object> status = new HashMap<>(); Map<String, Object> status = new HashMap<>();
Map<String, Long> statusCounts = new HashMap<>(); Map<String, Long> statusCounts = new HashMap<>();
if (properties.getVectorization().isGenericPipelineEnabled()) {
List<Object[]> counts = documentEmbeddingRepository.countByEmbeddingStatus();
for (Object[] row : counts) {
statusCounts.put(((EmbeddingStatus) row[0]).name(), (Long) row[1]);
}
} else {
List<Object[]> counts = documentRepository.countByVectorizationStatus(); List<Object[]> counts = documentRepository.countByVectorizationStatus();
for (Object[] row : counts) { for (Object[] row : counts) {
statusCounts.put(((VectorizationStatus) row[0]).name(), (Long) row[1]); statusCounts.put(((VectorizationStatus) row[0]).name(), (Long) row[1]);
} }
}
status.put("counts", statusCounts); status.put("counts", statusCounts);
status.put("serviceAvailable", vectorizationService.isAvailable()); status.put("serviceAvailable", vectorizationService.isAvailable());
@ -112,7 +115,14 @@ public class AdminController {
return ResponseEntity.badRequest().body(result); return ResponseEntity.badRequest().body(result);
} }
if (properties.getVectorization().isGenericPipelineEnabled()) {
var document = documentRepository.findById(documentId).orElseThrow();
UUID embeddingId = tedPhase2GenericDocumentService.registerOrRefreshTedDocument(document);
producerTemplate.sendBodyAndHeader("direct:vectorize-embedding", null, "embeddingId", embeddingId);
result.put("embeddingId", embeddingId);
} else {
producerTemplate.sendBodyAndHeader("direct:vectorize", null, "documentId", documentId); producerTemplate.sendBodyAndHeader("direct:vectorize", null, "documentId", documentId);
}
result.put("success", true); result.put("success", true);
result.put("message", "Vectorization triggered for document " + documentId); result.put("message", "Vectorization triggered for document " + documentId);
@ -137,6 +147,15 @@ public class AdminController {
} }
int count = 0; int count = 0;
if (properties.getVectorization().isGenericPipelineEnabled()) {
var pending = documentEmbeddingRepository.findIdsByEmbeddingStatus(
EmbeddingStatus.PENDING,
PageRequest.of(0, Math.min(batchSize, 500)));
for (UUID embeddingId : pending) {
producerTemplate.sendBodyAndHeader("direct:vectorize-embedding", null, "embeddingId", embeddingId);
count++;
}
} else {
var pending = documentRepository.findByVectorizationStatus( var pending = documentRepository.findByVectorizationStatus(
VectorizationStatus.PENDING, VectorizationStatus.PENDING,
PageRequest.of(0, Math.min(batchSize, 500))); PageRequest.of(0, Math.min(batchSize, 500)));
@ -145,6 +164,7 @@ public class AdminController {
producerTemplate.sendBodyAndHeader("direct:vectorize", null, "documentId", doc.getId()); producerTemplate.sendBodyAndHeader("direct:vectorize", null, "documentId", doc.getId());
count++; count++;
} }
}
result.put("success", true); result.put("success", true);
result.put("message", "Triggered vectorization for " + count + " documents"); result.put("message", "Triggered vectorization for " + count + " documents");

@ -1,7 +1,5 @@
package at.procon.ted.controller; package at.procon.ted.controller;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.model.dto.DocumentDtos.*; import at.procon.ted.model.dto.DocumentDtos.*;
import at.procon.ted.model.entity.ContractNature; import at.procon.ted.model.entity.ContractNature;
import at.procon.ted.model.entity.NoticeType; import at.procon.ted.model.entity.NoticeType;
@ -40,7 +38,6 @@ import java.util.UUID;
@RequestMapping("/v1/documents") @RequestMapping("/v1/documents")
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@Tag(name = "Documents", description = "TED Procurement Document Search API") @Tag(name = "Documents", description = "TED Procurement Document Search API")
public class DocumentController { public class DocumentController {

@ -1,7 +1,5 @@
package at.procon.ted.controller; package at.procon.ted.controller;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import at.procon.ted.service.SimilaritySearchService; import at.procon.ted.service.SimilaritySearchService;
import at.procon.ted.service.SimilaritySearchService.SimilaritySearchResponse; import at.procon.ted.service.SimilaritySearchService.SimilaritySearchResponse;
import io.swagger.v3.oas.annotations.Operation; import io.swagger.v3.oas.annotations.Operation;
@ -30,7 +28,6 @@ import java.io.IOException;
@RequestMapping("/similarity") @RequestMapping("/similarity")
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@Tag(name = "Similarity Search", description = "Vector-based semantic similarity search on TED procurement documents") @Tag(name = "Similarity Search", description = "Vector-based semantic similarity search on TED procurement documents")
public class SimilaritySearchController { public class SimilaritySearchController {

@ -1,8 +1,6 @@
package at.procon.ted.event; package at.procon.ted.event;
import at.procon.ted.config.TedProcessorProperties; import at.procon.ted.config.TedProcessorProperties;
import at.procon.dip.runtime.condition.ConditionalOnRuntimeMode;
import at.procon.dip.runtime.config.RuntimeMode;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j; import lombok.extern.slf4j.Slf4j;
import org.apache.camel.ProducerTemplate; import org.apache.camel.ProducerTemplate;
@ -17,7 +15,6 @@ import org.springframework.transaction.event.TransactionalEventListener;
* @author Martin.Schweitzer@procon.co.at and claude.ai * @author Martin.Schweitzer@procon.co.at and claude.ai
*/ */
@Component @Component
@ConditionalOnRuntimeMode(RuntimeMode.LEGACY)
@RequiredArgsConstructor @RequiredArgsConstructor
@Slf4j @Slf4j
public class VectorizationEventListener { public class VectorizationEventListener {
@ -31,7 +28,7 @@ public class VectorizationEventListener {
*/ */
@TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT) @TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
public void onDocumentSaved(DocumentSavedEvent event) { public void onDocumentSaved(DocumentSavedEvent event) {
if (!properties.getVectorization().isEnabled()) { if (!properties.getVectorization().isEnabled() || properties.getVectorization().isGenericPipelineEnabled()) {
return; return;
} }

@ -58,7 +58,7 @@ public class Organization {
@Column(name = "country_code", length = 10) @Column(name = "country_code", length = 10)
private String countryCode; private String countryCode;
@Column(name = "city", columnDefinition = "TEXT") @Column(name = "city", length = 255)
private String city; private String city;
@Column(name = "postal_code", length = 255) @Column(name = "postal_code", length = 255)

@ -102,7 +102,7 @@ public class ProcurementDocument {
@Column(name = "buyer_country_code", length = 10) @Column(name = "buyer_country_code", length = 10)
private String buyerCountryCode; private String buyerCountryCode;
@Column(name = "buyer_city", columnDefinition = "TEXT") @Column(name = "buyer_city", length = 255)
private String buyerCity; private String buyerCity;
@Column(name = "buyer_postal_code", length = 100) @Column(name = "buyer_postal_code", length = 100)
@ -124,7 +124,7 @@ public class ProcurementDocument {
@Column(name = "project_description", columnDefinition = "TEXT") @Column(name = "project_description", columnDefinition = "TEXT")
private String projectDescription; private String projectDescription;
@Column(name = "internal_reference", columnDefinition = "TEXT") @Column(name = "internal_reference", length = 500)
private String internalReference; private String internalReference;
@Enumerated(EnumType.STRING) @Enumerated(EnumType.STRING)

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save