You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/docs/WAVE2_TED_STRUCTURED_SEARCH...

118 lines
3.4 KiB
Markdown

# Wave 2 — Extended TED structured search in NEW runtime
## What was added
This extension completes the missing parts from the earlier Wave 2 proposal:
1. **Projection-aware TED structured search in NEW runtime**
- endpoint: `GET /v1/documents/search`
- endpoint: `POST /v1/documents/search`
- active only in `dip.runtime.mode=NEW`
2. **Repository-level joins across NEW projection model**
- `DOC.doc_document`
- `TED.ted_notice_projection`
- `TED.ted_notice_lot`
- `TED.ted_notice_organization`
3. **Extended TED structured filters**
- `countryCode`, `countryCodes`
- `noticeType`
- `contractNature`
- `procedureType`
- `cpvPrefix`, `cpvCodes`
- `nutsCode`, `nutsCodes`
- `publicationDateFrom`, `publicationDateTo`
- `submissionDeadlineAfter`
- `euFunded`
- `buyerNameContains`
- `projectTitleContains`
4. **Hybrid ranking path**
- structured filters first narrow the candidate `document_id` set
- generic NEW lexical/trigram/semantic search ranks only inside that candidate set
- request parameter `q` is used as the hybrid query text
- `similarityThreshold` is forwarded as a per-request semantic threshold override
5. **Facets**
- countries
- notice types
- procedure types
- buyers
- publication months (`YYYY-MM`)
- CPV families (first 2 digits)
6. **Parity coverage**
- NEW structured-only parity test against legacy `SearchService` for shared filters
- NEW endpoint integration test for structured results + facets
## Main classes
- `TedStructuredSearchRepository`
- `TedStructuredSearchService`
- `TedStructuredSearchController`
- `TedStructuredSearchFilter`
- `TedStructuredSearchFacets`
## How hybrid search works
For requests with `q`:
1. apply TED structured filters on projection tables
2. collect matching `document_id`s
3. pass those ids into NEW generic search scope as `candidateDocumentIds`
4. let NEW search engines rank those TED documents
5. map ranked hits back to TED summaries
This gives structured filtering plus lexical/trigram/semantic relevance ranking.
## New configuration
```yaml
dip:
ted:
projection:
structured-search-hybrid-candidate-limit: 5000
structured-search-facet-bucket-limit: 12
```
## Current behavior notes
- Structured-only requests work without `q`
- Hybrid requests use `q` and NEW generic ranking
- When `q` is present, returned `similarity` contains the fused NEW search score
- Facets are computed from the structured candidate set before pagination
- `includeFacets=false` disables facet calculation
- `facetBucketLimit` overrides the default bucket size per request
## Compatibility notes
- The NEW endpoint reuses the legacy `DocumentDtos.SearchRequest` and `SearchResponse`
- The response was extended with optional `facets`
- Existing legacy clients remain compatible because extra JSON fields are additive
## Parity scope
Parity is implemented for **shared structured filters** between legacy and NEW runtime.
Good parity candidates:
- country
- notice type
- contract nature
- procedure type
- publication date range
- submission deadline after
- eu funded
- buyer name contains
- project title contains
Legacy structured parity is **not exact** for filters that legacy `SearchService` does not implement in structured mode, especially:
- lot/organization-expanded `cpvPrefix`
- `cpvCodes`
- `nutsCode`
- `nutsCodes`
- lot-level EU funded semantics
Those are NEW-runtime improvements on top of legacy behavior.