# Document Intelligence Platform **Generic document ingestion, normalization and semantic search platform with TED support** A production-ready Spring Boot application showcasing advanced AI semantic search capabilities for processing and searching EU eForms public procurement notices from TED (Tenders Electronic Daily). **Author:** Martin.Schweitzer@procon.co.at and claude.ai > Phase 0 foundation is in place: the codebase now exposes the broader platform namespace `at.procon.dip` while the existing TED runtime under `at.procon.ted` remains operational during migration. --- ## 🎯 Demonstrator Highlights This application demonstrates the integration of cutting-edge technologies for intelligent document processing: ### 🧠 **AI Semantic Search** - **Natural Language Queries**: Search 100,000+ procurement documents using plain language - Example: *"medical equipment for hospitals in Germany"* - Example: *"IT infrastructure projects in Austria"* - **Multilingual Support**: 100+ languages supported via `intfloat/multilingual-e5-large` model - **1024-Dimensional Embeddings**: High-precision vector representations for accurate similarity matching - **Hybrid Search**: Combine semantic search with traditional filters (country, CPV codes, dates) ### πŸ—„οΈ **PostgreSQL Native XML** - **Native XML Data Type**: Store complete eForms XML documents without serialization overhead - **XPath Queries**: Direct XML querying within PostgreSQL for complex data extraction - **Dual Storage Strategy**: - Original XML preserved for audit trail and reprocessing - Extracted metadata in structured columns for fast filtering - Best of both worlds: flexibility + performance ### πŸš€ **Production-Grade Features** - **Fully Automated Pipeline**: Downloads and processes 30,000+ documents daily from ted.europa.eu - **Apache Camel Integration**: Enterprise Integration Patterns (Timer, Splitter, SEDA, Dead Letter Channel) - **Idempotent Processing**: SHA-256 hashing prevents duplicate imports - **Async Vectorization**: Non-blocking background processing with 4 concurrent workers - **pgvector Extension**: IVFFlat indexing for fast cosine similarity search at scale - **eForms SDK 1.13**: Full schema validation for EU standard compliance --- ## Key Technologies | Technology | Purpose | Benefit | |------------|---------|---------| | **PostgreSQL 16+** | Database with native XML | Query XML with XPath while maintaining structure | | **pgvector** | Vector similarity search | Million-scale semantic search with cosine similarity | | **Apache Camel** | Integration framework | Enterprise patterns for robust data pipelines | | **Spring Boot 3.x** | Application framework | Modern Java with dependency injection | | **intfloat/e5-large** | Embedding model | State-of-the-art multilingual semantic understanding | | **eForms SDK** | EU standard | Compliance with official procurement schemas | ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ TED Procurement Processor β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ File System │───▢│ Apache Camel │───▢│ Document β”‚ β”‚ β”‚ β”‚ (*.xml) β”‚ β”‚ Route β”‚ β”‚ Processing β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Service β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ REST API │◀───│ Search │◀───│ PostgreSQL β”‚ β”‚ β”‚ β”‚ Controller β”‚ β”‚ Service β”‚ β”‚ + pgvector β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–² β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Vectorization Service (Async) β”‚ β”‚ β”‚ β”‚ intfloat/multilingual-e5-large (1024d) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## Prerequisites - Java 21+ - Maven 3.9+ - PostgreSQL 16+ with pgvector extension - Python 3.11+ (for embedding service) - Docker & Docker Compose (optional, for easy setup) ## πŸš€ Automated Pipeline **See [TED_AUTOMATED_PIPELINE.md](TED_AUTOMATED_PIPELINE.md) for complete documentation on the automated download, processing, and vectorization pipeline.** The application automatically: 1. Downloads TED Daily Packages every hour from ted.europa.eu 2. Extracts and processes XML files 3. Stores in PostgreSQL with native XML support 4. Generates 1024-dimensional embeddings for semantic search 5. Enables REST API queries with natural language ## Quick Start ### 1. Start PostgreSQL with pgvector Using Docker: ```bash docker-compose up -d postgres ``` Or manually install PostgreSQL with pgvector extension. ### 2. Configure Application Edit `src/main/resources/application.yml`: ```yaml ted: input: directory: D:/ted.europe/2025-11.tar/2025-11/11 # Your TED XML directory pattern: "**/*.xml" ``` ### 3. Build and Run ```bash # Build mvn clean package -DskipTests # Run java -jar target/ted-procurement-processor-1.0.0-SNAPSHOT.jar ``` ### 4. Start Embedding Service (Optional) For semantic search capabilities: ```bash # Using Docker docker-compose --profile with-embedding up -d embedding-service # Or manually pip install -r requirements-embedding.txt python embedding_service.py ``` ## Database Schema ### Main Tables | Table | Description | |-------|-------------| | `procurement_document` | Main table with extracted metadata and original XML | | `procurement_lot` | Individual lots within procurement notices | | `organization` | Organizations mentioned in notices (buyers, review bodies) | | `processing_log` | Audit trail for document processing events | ### Key Columns in `procurement_document` | Column | Type | Description | |--------|------|-------------| | `id` | UUID | Primary key | | `document_hash` | VARCHAR(64) | SHA-256 hash for idempotency | | `publication_id` | VARCHAR(50) | TED publication ID (e.g., "00786665-2025") | | `notice_url` | VARCHAR(255) | TED website URL (e.g., "https://ted.europa.eu/en/notice/-/detail/786665-2025") | | `xml_document` | XML | Original document | | `text_content` | TEXT | Extracted text for vectorization | | `content_vector` | vector(1024) | Embedding for semantic search | | `buyer_country_code` | VARCHAR(10) | ISO 3166-1 alpha-3 country code | | `cpv_codes` | VARCHAR(100)[] | CPV classification codes | | `nuts_codes` | VARCHAR(20)[] | NUTS region codes | ## REST API ### Search Endpoints #### GET /api/v1/documents/search Search with structured filters: ```bash # Search by country curl "http://localhost:8080/api/v1/documents/search?countryCode=POL" # Search by CPV code prefix (medical supplies) curl "http://localhost:8080/api/v1/documents/search?cpvPrefix=33" # Search by date range curl "http://localhost:8080/api/v1/documents/search?publicationDateFrom=2025-01-01&publicationDateTo=2025-12-31" # Combined filters curl "http://localhost:8080/api/v1/documents/search?countryCode=DEU&contractNature=SERVICES¬iceType=CONTRACT_NOTICE" ``` #### GET /api/v1/documents/semantic-search Natural language semantic search: ```bash # Search for medical equipment tenders curl "http://localhost:8080/api/v1/documents/semantic-search?query=medical+equipment+hospital+supplies" # Search with similarity threshold curl "http://localhost:8080/api/v1/documents/semantic-search?query=construction+works+road+infrastructure&threshold=0.75" ``` #### POST /api/v1/documents/search Complex search with JSON body: ```bash curl -X POST "http://localhost:8080/api/v1/documents/search" \ -H "Content-Type: application/json" \ -d '{ "countryCodes": ["DEU", "AUT", "CHE"], "contractNature": "SERVICES", "cpvPrefix": "72", "semanticQuery": "software development IT services", "similarityThreshold": 0.7, "page": 0, "size": 20 }' ``` ### Document Retrieval ```bash # Get by UUID curl "http://localhost:8080/api/v1/documents/{uuid}" # Get by publication ID curl "http://localhost:8080/api/v1/documents/publication/00786665-2025" ``` ### Metadata Endpoints ```bash # List countries curl "http://localhost:8080/api/v1/documents/metadata/countries" # Get statistics curl "http://localhost:8080/api/v1/documents/statistics" # Upcoming deadlines curl "http://localhost:8080/api/v1/documents/upcoming-deadlines?limit=50" ``` ### Admin Endpoints ```bash # Health check curl "http://localhost:8080/api/v1/admin/health" # Vectorization status curl "http://localhost:8080/api/v1/admin/vectorization/status" # Trigger vectorization for pending documents curl -X POST "http://localhost:8080/api/v1/admin/vectorization/process-pending?batchSize=100" ``` ## Configuration ### Application Properties | Property | Default | Description | |----------|---------|-------------| | `ted.input.directory` | - | Input directory for XML files | | `ted.input.pattern` | `**/*.xml` | File pattern (Ant-style) | | `ted.input.poll-interval` | 5000 | Polling interval in ms | | `ted.schema.enabled` | true | Enable XSD validation | | `ted.vectorization.enabled` | true | Enable async vectorization | | `ted.vectorization.model-name` | `intfloat/multilingual-e5-large` | Embedding model | | `ted.vectorization.dimensions` | 1024 | Vector dimensions | | `ted.search.default-page-size` | 20 | Default results per page | | `ted.search.similarity-threshold` | 0.7 | Default similarity threshold | ### Environment Variables | Variable | Description | |----------|-------------| | `DB_USERNAME` | PostgreSQL username | | `DB_PASSWORD` | PostgreSQL password | | `TED_INPUT_DIR` | Override input directory | ## Data Model ### Notice Types - `CONTRACT_NOTICE` - Standard contract notices - `PRIOR_INFORMATION_NOTICE` - Prior information notices - `CONTRACT_AWARD_NOTICE` - Contract award notices - `MODIFICATION_NOTICE` - Contract modifications - `OTHER` - Other notice types ### Contract Nature - `SUPPLIES` - Goods procurement - `SERVICES` - Service procurement - `WORKS` - Construction works - `MIXED` - Mixed contracts - `UNKNOWN` - Not specified ### Procedure Types - `OPEN` - Open procedure - `RESTRICTED` - Restricted procedure - `COMPETITIVE_DIALOGUE` - Competitive dialogue - `INNOVATION_PARTNERSHIP` - Innovation partnership - `NEGOTIATED_WITHOUT_PUBLICATION` - Negotiated without prior publication - `NEGOTIATED_WITH_PUBLICATION` - Negotiated with prior publication - `OTHER` - Other procedures ## Semantic Search **See [VECTORIZATION.md](VECTORIZATION.md) for detailed documentation on the vectorization pipeline.** The application uses the `intfloat/multilingual-e5-large` model for generating document embeddings: - **Dimensions**: 1024 - **Languages**: Supports 100+ languages - **Normalization**: Embeddings are L2 normalized for cosine similarity ### Query Prefixes For optimal results with e5 models: - Documents use `passage: ` prefix - Queries use `query: ` prefix This is handled automatically by the vectorization service. ## Development ### Running Tests ```bash mvn test ``` ### Building Docker Image ```bash docker build -t ted-procurement-processor . ``` ### OpenAPI Documentation Access Swagger UI at: `http://localhost:8080/api/swagger-ui.html` ## Performance Considerations ### Indexes The schema includes optimized indexes for: - Hash lookup (idempotent processing) - Publication/notice ID lookups - Date range queries - Geographic searches (country, NUTS codes) - CPV code classification - Vector similarity search (IVFFlat) - Full-text trigram search ### Batch Processing - Configure `ted.input.max-messages-per-poll` for batch sizes - Vectorization processes documents in batches of 16 by default - Use the admin API to trigger bulk vectorization ## Troubleshooting ### Common Issues **Files not being processed:** - Check directory path in configuration - Verify file permissions - Check Camel route status in logs **Duplicate detection not working:** - Ensure `document_hash` column has unique constraint - Check if XML content is exactly the same **Vectorization failing:** - Verify embedding service is running - Check Python dependencies - Ensure sufficient memory for model **Slow searches:** - Ensure pgvector IVFFlat index is created - Check if `content_vector` column is populated - Consider adjusting `lists` parameter in index ## License Licensed under the European Union Public Licence (EUPL) v1.2 Copyright (c) 2025 PROCON DATA Gesellschaft m.b.H. You may use, copy, modify and distribute this work under the terms of the EUPL. See the [LICENSE](LICENSE) file for details or visit: https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12 ## Acknowledgments - [eForms SDK](https://github.com/OP-TED/eForms-SDK) - EU Publications Office - [pgvector](https://github.com/pgvector/pgvector) - Vector similarity search for PostgreSQL - [sentence-transformers](https://www.sbert.net/) - Text embeddings - [Apache Camel](https://camel.apache.org/) - Integration framework