Document Intelligence Platform Spring Boot application for processing EU eForms public procurement notices. Features: - Apache Camel directory watching and processing - PostgreSQL storage with XML and vector columns - Async vectorization using sentence-transformers - REST API for structured and semantic search
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
trifonovt 52330d751d vectorization using py temporal service 2 weeks ago
docs vectorization using py temporal service 2 weeks ago
postman Refactor phases 5 - search - tests 1 month ago
src vectorization using py temporal service 2 weeks ago
DAILY_PACKAGE_DOWNLOAD.md Initial import 1 month ago
Dockerfile.embedding Initial import 1 month ago
EXECUTE_ENUM_FIX.md Initial import 1 month ago
LICENSE Initial import 1 month ago
MEMORY-OPTIMIZATION.md Initial import 1 month ago
PATCH_NOTES.md Refactor phases 5 - search - slice 1 1 month ago
README.md Refactor phases 0-2 1 month ago
README_SLICE3.txt Refactor phases 5 - search - tests 1 month ago
Search-TED.ps1 Initial import 1 month ago
TED_AUTOMATED_PIPELINE.md Initial import 1 month ago
TED_NOTICE_URL.md Initial import 1 month ago
TED_PACKAGE_DOWNLOAD_CAMEL_ROUTE.md Initial import 1 month ago
VECTORIZATION.md Initial import 1 month ago
XPATH_EXAMPLES.md Initial import 1 month ago
docker-compose.yml Initial import 1 month ago
embedding_service.py Initial import 1 month ago
execute-enum-fix.bat Initial import 1 month ago
fix-organization-schema.bat Initial import 1 month ago
fix-organization-schema.sql Initial import 1 month ago
pom.xml Refactor phases 5 - search - tests 1 month ago
requirements-embedding.txt Initial import 1 month ago
reset-stuck-packages.sql Initial import 1 month ago
solution-brief-processed.dat Initial import 1 month ago
start.bat Initial import 1 month ago
start.sh Initial import 1 month ago
ted-procurement-processor.zip Initial import 1 month ago

README.md

Document Intelligence Platform

Generic document ingestion, normalization and semantic search platform with TED support

A production-ready Spring Boot application showcasing advanced AI semantic search capabilities for processing and searching EU eForms public procurement notices from TED (Tenders Electronic Daily).

Author: Martin.Schweitzer@procon.co.at and claude.ai

Phase 0 foundation is in place: the codebase now exposes the broader platform namespace at.procon.dip while the existing TED runtime under at.procon.ted remains operational during migration.


🎯 Demonstrator Highlights

This application demonstrates the integration of cutting-edge technologies for intelligent document processing:

  • Natural Language Queries: Search 100,000+ procurement documents using plain language
    • Example: "medical equipment for hospitals in Germany"
    • Example: "IT infrastructure projects in Austria"
  • Multilingual Support: 100+ languages supported via intfloat/multilingual-e5-large model
  • 1024-Dimensional Embeddings: High-precision vector representations for accurate similarity matching
  • Hybrid Search: Combine semantic search with traditional filters (country, CPV codes, dates)

🗄️ PostgreSQL Native XML

  • Native XML Data Type: Store complete eForms XML documents without serialization overhead
  • XPath Queries: Direct XML querying within PostgreSQL for complex data extraction
  • Dual Storage Strategy:
    • Original XML preserved for audit trail and reprocessing
    • Extracted metadata in structured columns for fast filtering
    • Best of both worlds: flexibility + performance

🚀 Production-Grade Features

  • Fully Automated Pipeline: Downloads and processes 30,000+ documents daily from ted.europa.eu
  • Apache Camel Integration: Enterprise Integration Patterns (Timer, Splitter, SEDA, Dead Letter Channel)
  • Idempotent Processing: SHA-256 hashing prevents duplicate imports
  • Async Vectorization: Non-blocking background processing with 4 concurrent workers
  • pgvector Extension: IVFFlat indexing for fast cosine similarity search at scale
  • eForms SDK 1.13: Full schema validation for EU standard compliance

Key Technologies

Technology Purpose Benefit
PostgreSQL 16+ Database with native XML Query XML with XPath while maintaining structure
pgvector Vector similarity search Million-scale semantic search with cosine similarity
Apache Camel Integration framework Enterprise patterns for robust data pipelines
Spring Boot 3.x Application framework Modern Java with dependency injection
intfloat/e5-large Embedding model State-of-the-art multilingual semantic understanding
eForms SDK EU standard Compliance with official procurement schemas

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     TED Procurement Processor                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌─────────────────┐    ┌───────────────────┐  │
│  │  File System │───▶│  Apache Camel   │───▶│  Document         │  │
│  │  (*.xml)     │    │  Route          │    │  Processing       │  │
│  └──────────────┘    └─────────────────┘    │  Service          │  │
│                                              └─────────┬─────────┘  │
│                                                        │            │
│                                                        ▼            │
│  ┌──────────────┐    ┌─────────────────┐    ┌───────────────────┐  │
│  │  REST API    │◀───│  Search         │◀───│  PostgreSQL       │  │
│  │  Controller  │    │  Service        │    │  + pgvector       │  │
│  └──────────────┘    └─────────────────┘    └───────────────────┘  │
│                                                        ▲            │
│                                                        │            │
│  ┌──────────────────────────────────────────────────────┐          │
│  │              Vectorization Service (Async)            │          │
│  │        intfloat/multilingual-e5-large (1024d)         │          │
│  └──────────────────────────────────────────────────────┘          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Prerequisites

  • Java 21+
  • Maven 3.9+
  • PostgreSQL 16+ with pgvector extension
  • Python 3.11+ (for embedding service)
  • Docker & Docker Compose (optional, for easy setup)

🚀 Automated Pipeline

See TED_AUTOMATED_PIPELINE.md for complete documentation on the automated download, processing, and vectorization pipeline.

The application automatically:

  1. Downloads TED Daily Packages every hour from ted.europa.eu
  2. Extracts and processes XML files
  3. Stores in PostgreSQL with native XML support
  4. Generates 1024-dimensional embeddings for semantic search
  5. Enables REST API queries with natural language

Quick Start

1. Start PostgreSQL with pgvector

Using Docker:

docker-compose up -d postgres

Or manually install PostgreSQL with pgvector extension.

2. Configure Application

Edit src/main/resources/application.yml:

ted:
  input:
    directory: D:/ted.europe/2025-11.tar/2025-11/11  # Your TED XML directory
    pattern: "**/*.xml"

3. Build and Run

# Build
mvn clean package -DskipTests

# Run
java -jar target/ted-procurement-processor-1.0.0-SNAPSHOT.jar

4. Start Embedding Service (Optional)

For semantic search capabilities:

# Using Docker
docker-compose --profile with-embedding up -d embedding-service

# Or manually
pip install -r requirements-embedding.txt
python embedding_service.py

Database Schema

Main Tables

Table Description
procurement_document Main table with extracted metadata and original XML
procurement_lot Individual lots within procurement notices
organization Organizations mentioned in notices (buyers, review bodies)
processing_log Audit trail for document processing events

Key Columns in procurement_document

Column Type Description
id UUID Primary key
document_hash VARCHAR(64) SHA-256 hash for idempotency
publication_id VARCHAR(50) TED publication ID (e.g., "00786665-2025")
notice_url VARCHAR(255) TED website URL (e.g., "https://ted.europa.eu/en/notice/-/detail/786665-2025")
xml_document XML Original document
text_content TEXT Extracted text for vectorization
content_vector vector(1024) Embedding for semantic search
buyer_country_code VARCHAR(10) ISO 3166-1 alpha-3 country code
cpv_codes VARCHAR(100)[] CPV classification codes
nuts_codes VARCHAR(20)[] NUTS region codes

REST API

Search Endpoints

Search with structured filters:

# Search by country
curl "http://localhost:8080/api/v1/documents/search?countryCode=POL"

# Search by CPV code prefix (medical supplies)
curl "http://localhost:8080/api/v1/documents/search?cpvPrefix=33"

# Search by date range
curl "http://localhost:8080/api/v1/documents/search?publicationDateFrom=2025-01-01&publicationDateTo=2025-12-31"

# Combined filters
curl "http://localhost:8080/api/v1/documents/search?countryCode=DEU&contractNature=SERVICES&noticeType=CONTRACT_NOTICE"

Natural language semantic search:

# Search for medical equipment tenders
curl "http://localhost:8080/api/v1/documents/semantic-search?query=medical+equipment+hospital+supplies"

# Search with similarity threshold
curl "http://localhost:8080/api/v1/documents/semantic-search?query=construction+works+road+infrastructure&threshold=0.75"

POST /api/v1/documents/search

Complex search with JSON body:

curl -X POST "http://localhost:8080/api/v1/documents/search" \
  -H "Content-Type: application/json" \
  -d '{
    "countryCodes": ["DEU", "AUT", "CHE"],
    "contractNature": "SERVICES",
    "cpvPrefix": "72",
    "semanticQuery": "software development IT services",
    "similarityThreshold": 0.7,
    "page": 0,
    "size": 20
  }'

Document Retrieval

# Get by UUID
curl "http://localhost:8080/api/v1/documents/{uuid}"

# Get by publication ID
curl "http://localhost:8080/api/v1/documents/publication/00786665-2025"

Metadata Endpoints

# List countries
curl "http://localhost:8080/api/v1/documents/metadata/countries"

# Get statistics
curl "http://localhost:8080/api/v1/documents/statistics"

# Upcoming deadlines
curl "http://localhost:8080/api/v1/documents/upcoming-deadlines?limit=50"

Admin Endpoints

# Health check
curl "http://localhost:8080/api/v1/admin/health"

# Vectorization status
curl "http://localhost:8080/api/v1/admin/vectorization/status"

# Trigger vectorization for pending documents
curl -X POST "http://localhost:8080/api/v1/admin/vectorization/process-pending?batchSize=100"

Configuration

Application Properties

Property Default Description
ted.input.directory - Input directory for XML files
ted.input.pattern **/*.xml File pattern (Ant-style)
ted.input.poll-interval 5000 Polling interval in ms
ted.schema.enabled true Enable XSD validation
ted.vectorization.enabled true Enable async vectorization
ted.vectorization.model-name intfloat/multilingual-e5-large Embedding model
ted.vectorization.dimensions 1024 Vector dimensions
ted.search.default-page-size 20 Default results per page
ted.search.similarity-threshold 0.7 Default similarity threshold

Environment Variables

Variable Description
DB_USERNAME PostgreSQL username
DB_PASSWORD PostgreSQL password
TED_INPUT_DIR Override input directory

Data Model

Notice Types

  • CONTRACT_NOTICE - Standard contract notices
  • PRIOR_INFORMATION_NOTICE - Prior information notices
  • CONTRACT_AWARD_NOTICE - Contract award notices
  • MODIFICATION_NOTICE - Contract modifications
  • OTHER - Other notice types

Contract Nature

  • SUPPLIES - Goods procurement
  • SERVICES - Service procurement
  • WORKS - Construction works
  • MIXED - Mixed contracts
  • UNKNOWN - Not specified

Procedure Types

  • OPEN - Open procedure
  • RESTRICTED - Restricted procedure
  • COMPETITIVE_DIALOGUE - Competitive dialogue
  • INNOVATION_PARTNERSHIP - Innovation partnership
  • NEGOTIATED_WITHOUT_PUBLICATION - Negotiated without prior publication
  • NEGOTIATED_WITH_PUBLICATION - Negotiated with prior publication
  • OTHER - Other procedures

See VECTORIZATION.md for detailed documentation on the vectorization pipeline.

The application uses the intfloat/multilingual-e5-large model for generating document embeddings:

  • Dimensions: 1024
  • Languages: Supports 100+ languages
  • Normalization: Embeddings are L2 normalized for cosine similarity

Query Prefixes

For optimal results with e5 models:

  • Documents use passage: prefix
  • Queries use query: prefix

This is handled automatically by the vectorization service.

Development

Running Tests

mvn test

Building Docker Image

docker build -t ted-procurement-processor .

OpenAPI Documentation

Access Swagger UI at: http://localhost:8080/api/swagger-ui.html

Performance Considerations

Indexes

The schema includes optimized indexes for:

  • Hash lookup (idempotent processing)
  • Publication/notice ID lookups
  • Date range queries
  • Geographic searches (country, NUTS codes)
  • CPV code classification
  • Vector similarity search (IVFFlat)
  • Full-text trigram search

Batch Processing

  • Configure ted.input.max-messages-per-poll for batch sizes
  • Vectorization processes documents in batches of 16 by default
  • Use the admin API to trigger bulk vectorization

Troubleshooting

Common Issues

Files not being processed:

  • Check directory path in configuration
  • Verify file permissions
  • Check Camel route status in logs

Duplicate detection not working:

  • Ensure document_hash column has unique constraint
  • Check if XML content is exactly the same

Vectorization failing:

  • Verify embedding service is running
  • Check Python dependencies
  • Ensure sufficient memory for model

Slow searches:

  • Ensure pgvector IVFFlat index is created
  • Check if content_vector column is populated
  • Consider adjusting lists parameter in index

License

Licensed under the European Union Public Licence (EUPL) v1.2

Copyright (c) 2025 PROCON DATA Gesellschaft m.b.H.

You may use, copy, modify and distribute this work under the terms of the EUPL. See the LICENSE file for details or visit: https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12

Acknowledgments