|
|
4 weeks ago | |
|---|---|---|
| docs | 4 weeks ago | |
| postman | 1 month ago | |
| src | 4 weeks ago | |
| DAILY_PACKAGE_DOWNLOAD.md | 1 month ago | |
| Dockerfile.embedding | 1 month ago | |
| EXECUTE_ENUM_FIX.md | 1 month ago | |
| LICENSE | 1 month ago | |
| MEMORY-OPTIMIZATION.md | 1 month ago | |
| PATCH_NOTES.md | 1 month ago | |
| README.md | 1 month ago | |
| README_SLICE3.txt | 1 month ago | |
| Search-TED.ps1 | 1 month ago | |
| TED_AUTOMATED_PIPELINE.md | 1 month ago | |
| TED_NOTICE_URL.md | 1 month ago | |
| TED_PACKAGE_DOWNLOAD_CAMEL_ROUTE.md | 1 month ago | |
| VECTORIZATION.md | 1 month ago | |
| XPATH_EXAMPLES.md | 1 month ago | |
| docker-compose.yml | 1 month ago | |
| embedding_service.py | 1 month ago | |
| execute-enum-fix.bat | 1 month ago | |
| fix-organization-schema.bat | 1 month ago | |
| fix-organization-schema.sql | 1 month ago | |
| pom.xml | 1 month ago | |
| requirements-embedding.txt | 1 month ago | |
| reset-stuck-packages.sql | 1 month ago | |
| solution-brief-processed.dat | 1 month ago | |
| start.bat | 1 month ago | |
| start.sh | 1 month ago | |
| ted-procurement-processor.zip | 1 month ago | |
README.md
Document Intelligence Platform
Generic document ingestion, normalization and semantic search platform with TED support
A production-ready Spring Boot application showcasing advanced AI semantic search capabilities for processing and searching EU eForms public procurement notices from TED (Tenders Electronic Daily).
Author: Martin.Schweitzer@procon.co.at and claude.ai
Phase 0 foundation is in place: the codebase now exposes the broader platform namespace
at.procon.dipwhile the existing TED runtime underat.procon.tedremains operational during migration.
🎯 Demonstrator Highlights
This application demonstrates the integration of cutting-edge technologies for intelligent document processing:
🧠 AI Semantic Search
- Natural Language Queries: Search 100,000+ procurement documents using plain language
- Example: "medical equipment for hospitals in Germany"
- Example: "IT infrastructure projects in Austria"
- Multilingual Support: 100+ languages supported via
intfloat/multilingual-e5-largemodel - 1024-Dimensional Embeddings: High-precision vector representations for accurate similarity matching
- Hybrid Search: Combine semantic search with traditional filters (country, CPV codes, dates)
🗄️ PostgreSQL Native XML
- Native XML Data Type: Store complete eForms XML documents without serialization overhead
- XPath Queries: Direct XML querying within PostgreSQL for complex data extraction
- Dual Storage Strategy:
- Original XML preserved for audit trail and reprocessing
- Extracted metadata in structured columns for fast filtering
- Best of both worlds: flexibility + performance
🚀 Production-Grade Features
- Fully Automated Pipeline: Downloads and processes 30,000+ documents daily from ted.europa.eu
- Apache Camel Integration: Enterprise Integration Patterns (Timer, Splitter, SEDA, Dead Letter Channel)
- Idempotent Processing: SHA-256 hashing prevents duplicate imports
- Async Vectorization: Non-blocking background processing with 4 concurrent workers
- pgvector Extension: IVFFlat indexing for fast cosine similarity search at scale
- eForms SDK 1.13: Full schema validation for EU standard compliance
Key Technologies
| Technology | Purpose | Benefit |
|---|---|---|
| PostgreSQL 16+ | Database with native XML | Query XML with XPath while maintaining structure |
| pgvector | Vector similarity search | Million-scale semantic search with cosine similarity |
| Apache Camel | Integration framework | Enterprise patterns for robust data pipelines |
| Spring Boot 3.x | Application framework | Modern Java with dependency injection |
| intfloat/e5-large | Embedding model | State-of-the-art multilingual semantic understanding |
| eForms SDK | EU standard | Compliance with official procurement schemas |
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ TED Procurement Processor │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌─────────────────┐ ┌───────────────────┐ │
│ │ File System │───▶│ Apache Camel │───▶│ Document │ │
│ │ (*.xml) │ │ Route │ │ Processing │ │
│ └──────────────┘ └─────────────────┘ │ Service │ │
│ └─────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌─────────────────┐ ┌───────────────────┐ │
│ │ REST API │◀───│ Search │◀───│ PostgreSQL │ │
│ │ Controller │ │ Service │ │ + pgvector │ │
│ └──────────────┘ └─────────────────┘ └───────────────────┘ │
│ ▲ │
│ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Vectorization Service (Async) │ │
│ │ intfloat/multilingual-e5-large (1024d) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Prerequisites
- Java 21+
- Maven 3.9+
- PostgreSQL 16+ with pgvector extension
- Python 3.11+ (for embedding service)
- Docker & Docker Compose (optional, for easy setup)
🚀 Automated Pipeline
See TED_AUTOMATED_PIPELINE.md for complete documentation on the automated download, processing, and vectorization pipeline.
The application automatically:
- Downloads TED Daily Packages every hour from ted.europa.eu
- Extracts and processes XML files
- Stores in PostgreSQL with native XML support
- Generates 1024-dimensional embeddings for semantic search
- Enables REST API queries with natural language
Quick Start
1. Start PostgreSQL with pgvector
Using Docker:
docker-compose up -d postgres
Or manually install PostgreSQL with pgvector extension.
2. Configure Application
Edit src/main/resources/application.yml:
ted:
input:
directory: D:/ted.europe/2025-11.tar/2025-11/11 # Your TED XML directory
pattern: "**/*.xml"
3. Build and Run
# Build
mvn clean package -DskipTests
# Run
java -jar target/ted-procurement-processor-1.0.0-SNAPSHOT.jar
4. Start Embedding Service (Optional)
For semantic search capabilities:
# Using Docker
docker-compose --profile with-embedding up -d embedding-service
# Or manually
pip install -r requirements-embedding.txt
python embedding_service.py
Database Schema
Main Tables
| Table | Description |
|---|---|
procurement_document |
Main table with extracted metadata and original XML |
procurement_lot |
Individual lots within procurement notices |
organization |
Organizations mentioned in notices (buyers, review bodies) |
processing_log |
Audit trail for document processing events |
Key Columns in procurement_document
| Column | Type | Description |
|---|---|---|
id |
UUID | Primary key |
document_hash |
VARCHAR(64) | SHA-256 hash for idempotency |
publication_id |
VARCHAR(50) | TED publication ID (e.g., "00786665-2025") |
notice_url |
VARCHAR(255) | TED website URL (e.g., "https://ted.europa.eu/en/notice/-/detail/786665-2025") |
xml_document |
XML | Original document |
text_content |
TEXT | Extracted text for vectorization |
content_vector |
vector(1024) | Embedding for semantic search |
buyer_country_code |
VARCHAR(10) | ISO 3166-1 alpha-3 country code |
cpv_codes |
VARCHAR(100)[] | CPV classification codes |
nuts_codes |
VARCHAR(20)[] | NUTS region codes |
REST API
Search Endpoints
GET /api/v1/documents/search
Search with structured filters:
# Search by country
curl "http://localhost:8080/api/v1/documents/search?countryCode=POL"
# Search by CPV code prefix (medical supplies)
curl "http://localhost:8080/api/v1/documents/search?cpvPrefix=33"
# Search by date range
curl "http://localhost:8080/api/v1/documents/search?publicationDateFrom=2025-01-01&publicationDateTo=2025-12-31"
# Combined filters
curl "http://localhost:8080/api/v1/documents/search?countryCode=DEU&contractNature=SERVICES¬iceType=CONTRACT_NOTICE"
GET /api/v1/documents/semantic-search
Natural language semantic search:
# Search for medical equipment tenders
curl "http://localhost:8080/api/v1/documents/semantic-search?query=medical+equipment+hospital+supplies"
# Search with similarity threshold
curl "http://localhost:8080/api/v1/documents/semantic-search?query=construction+works+road+infrastructure&threshold=0.75"
POST /api/v1/documents/search
Complex search with JSON body:
curl -X POST "http://localhost:8080/api/v1/documents/search" \
-H "Content-Type: application/json" \
-d '{
"countryCodes": ["DEU", "AUT", "CHE"],
"contractNature": "SERVICES",
"cpvPrefix": "72",
"semanticQuery": "software development IT services",
"similarityThreshold": 0.7,
"page": 0,
"size": 20
}'
Document Retrieval
# Get by UUID
curl "http://localhost:8080/api/v1/documents/{uuid}"
# Get by publication ID
curl "http://localhost:8080/api/v1/documents/publication/00786665-2025"
Metadata Endpoints
# List countries
curl "http://localhost:8080/api/v1/documents/metadata/countries"
# Get statistics
curl "http://localhost:8080/api/v1/documents/statistics"
# Upcoming deadlines
curl "http://localhost:8080/api/v1/documents/upcoming-deadlines?limit=50"
Admin Endpoints
# Health check
curl "http://localhost:8080/api/v1/admin/health"
# Vectorization status
curl "http://localhost:8080/api/v1/admin/vectorization/status"
# Trigger vectorization for pending documents
curl -X POST "http://localhost:8080/api/v1/admin/vectorization/process-pending?batchSize=100"
Configuration
Application Properties
| Property | Default | Description |
|---|---|---|
ted.input.directory |
- | Input directory for XML files |
ted.input.pattern |
**/*.xml |
File pattern (Ant-style) |
ted.input.poll-interval |
5000 | Polling interval in ms |
ted.schema.enabled |
true | Enable XSD validation |
ted.vectorization.enabled |
true | Enable async vectorization |
ted.vectorization.model-name |
intfloat/multilingual-e5-large |
Embedding model |
ted.vectorization.dimensions |
1024 | Vector dimensions |
ted.search.default-page-size |
20 | Default results per page |
ted.search.similarity-threshold |
0.7 | Default similarity threshold |
Environment Variables
| Variable | Description |
|---|---|
DB_USERNAME |
PostgreSQL username |
DB_PASSWORD |
PostgreSQL password |
TED_INPUT_DIR |
Override input directory |
Data Model
Notice Types
CONTRACT_NOTICE- Standard contract noticesPRIOR_INFORMATION_NOTICE- Prior information noticesCONTRACT_AWARD_NOTICE- Contract award noticesMODIFICATION_NOTICE- Contract modificationsOTHER- Other notice types
Contract Nature
SUPPLIES- Goods procurementSERVICES- Service procurementWORKS- Construction worksMIXED- Mixed contractsUNKNOWN- Not specified
Procedure Types
OPEN- Open procedureRESTRICTED- Restricted procedureCOMPETITIVE_DIALOGUE- Competitive dialogueINNOVATION_PARTNERSHIP- Innovation partnershipNEGOTIATED_WITHOUT_PUBLICATION- Negotiated without prior publicationNEGOTIATED_WITH_PUBLICATION- Negotiated with prior publicationOTHER- Other procedures
Semantic Search
See VECTORIZATION.md for detailed documentation on the vectorization pipeline.
The application uses the intfloat/multilingual-e5-large model for generating document embeddings:
- Dimensions: 1024
- Languages: Supports 100+ languages
- Normalization: Embeddings are L2 normalized for cosine similarity
Query Prefixes
For optimal results with e5 models:
- Documents use
passage:prefix - Queries use
query:prefix
This is handled automatically by the vectorization service.
Development
Running Tests
mvn test
Building Docker Image
docker build -t ted-procurement-processor .
OpenAPI Documentation
Access Swagger UI at: http://localhost:8080/api/swagger-ui.html
Performance Considerations
Indexes
The schema includes optimized indexes for:
- Hash lookup (idempotent processing)
- Publication/notice ID lookups
- Date range queries
- Geographic searches (country, NUTS codes)
- CPV code classification
- Vector similarity search (IVFFlat)
- Full-text trigram search
Batch Processing
- Configure
ted.input.max-messages-per-pollfor batch sizes - Vectorization processes documents in batches of 16 by default
- Use the admin API to trigger bulk vectorization
Troubleshooting
Common Issues
Files not being processed:
- Check directory path in configuration
- Verify file permissions
- Check Camel route status in logs
Duplicate detection not working:
- Ensure
document_hashcolumn has unique constraint - Check if XML content is exactly the same
Vectorization failing:
- Verify embedding service is running
- Check Python dependencies
- Ensure sufficient memory for model
Slow searches:
- Ensure pgvector IVFFlat index is created
- Check if
content_vectorcolumn is populated - Consider adjusting
listsparameter in index
License
Licensed under the European Union Public Licence (EUPL) v1.2
Copyright (c) 2025 PROCON DATA Gesellschaft m.b.H.
You may use, copy, modify and distribute this work under the terms of the EUPL. See the LICENSE file for details or visit: https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12
Acknowledgments
- eForms SDK - EU Publications Office
- pgvector - Vector similarity search for PostgreSQL
- sentence-transformers - Text embeddings
- Apache Camel - Integration framework