You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
405 lines
15 KiB
Markdown
405 lines
15 KiB
Markdown
# Document Intelligence Platform
|
|
|
|
**Generic document ingestion, normalization and semantic search platform with TED support**
|
|
|
|
A production-ready Spring Boot application showcasing advanced AI semantic search capabilities for processing and searching EU eForms public procurement notices from TED (Tenders Electronic Daily).
|
|
|
|
**Author:** Martin.Schweitzer@procon.co.at and claude.ai
|
|
|
|
> Phase 0 foundation is in place: the codebase now exposes the broader platform namespace `at.procon.dip` while the existing TED runtime under `at.procon.ted` remains operational during migration.
|
|
|
|
---
|
|
|
|
## 🎯 Demonstrator Highlights
|
|
|
|
This application demonstrates the integration of cutting-edge technologies for intelligent document processing:
|
|
|
|
### 🧠 **AI Semantic Search**
|
|
- **Natural Language Queries**: Search 100,000+ procurement documents using plain language
|
|
- Example: *"medical equipment for hospitals in Germany"*
|
|
- Example: *"IT infrastructure projects in Austria"*
|
|
- **Multilingual Support**: 100+ languages supported via `intfloat/multilingual-e5-large` model
|
|
- **1024-Dimensional Embeddings**: High-precision vector representations for accurate similarity matching
|
|
- **Hybrid Search**: Combine semantic search with traditional filters (country, CPV codes, dates)
|
|
|
|
### 🗄️ **PostgreSQL Native XML**
|
|
- **Native XML Data Type**: Store complete eForms XML documents without serialization overhead
|
|
- **XPath Queries**: Direct XML querying within PostgreSQL for complex data extraction
|
|
- **Dual Storage Strategy**:
|
|
- Original XML preserved for audit trail and reprocessing
|
|
- Extracted metadata in structured columns for fast filtering
|
|
- Best of both worlds: flexibility + performance
|
|
|
|
### 🚀 **Production-Grade Features**
|
|
- **Fully Automated Pipeline**: Downloads and processes 30,000+ documents daily from ted.europa.eu
|
|
- **Apache Camel Integration**: Enterprise Integration Patterns (Timer, Splitter, SEDA, Dead Letter Channel)
|
|
- **Idempotent Processing**: SHA-256 hashing prevents duplicate imports
|
|
- **Async Vectorization**: Non-blocking background processing with 4 concurrent workers
|
|
- **pgvector Extension**: IVFFlat indexing for fast cosine similarity search at scale
|
|
- **eForms SDK 1.13**: Full schema validation for EU standard compliance
|
|
|
|
---
|
|
|
|
## Key Technologies
|
|
|
|
| Technology | Purpose | Benefit |
|
|
|------------|---------|---------|
|
|
| **PostgreSQL 16+** | Database with native XML | Query XML with XPath while maintaining structure |
|
|
| **pgvector** | Vector similarity search | Million-scale semantic search with cosine similarity |
|
|
| **Apache Camel** | Integration framework | Enterprise patterns for robust data pipelines |
|
|
| **Spring Boot 3.x** | Application framework | Modern Java with dependency injection |
|
|
| **intfloat/e5-large** | Embedding model | State-of-the-art multilingual semantic understanding |
|
|
| **eForms SDK** | EU standard | Compliance with official procurement schemas |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ TED Procurement Processor │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────────┐ ┌─────────────────┐ ┌───────────────────┐ │
|
|
│ │ File System │───▶│ Apache Camel │───▶│ Document │ │
|
|
│ │ (*.xml) │ │ Route │ │ Processing │ │
|
|
│ └──────────────┘ └─────────────────┘ │ Service │ │
|
|
│ └─────────┬─────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────┐ ┌─────────────────┐ ┌───────────────────┐ │
|
|
│ │ REST API │◀───│ Search │◀───│ PostgreSQL │ │
|
|
│ │ Controller │ │ Service │ │ + pgvector │ │
|
|
│ └──────────────┘ └─────────────────┘ └───────────────────┘ │
|
|
│ ▲ │
|
|
│ │ │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Vectorization Service (Async) │ │
|
|
│ │ intfloat/multilingual-e5-large (1024d) │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Prerequisites
|
|
|
|
- Java 21+
|
|
- Maven 3.9+
|
|
- PostgreSQL 16+ with pgvector extension
|
|
- Python 3.11+ (for embedding service)
|
|
- Docker & Docker Compose (optional, for easy setup)
|
|
|
|
## 🚀 Automated Pipeline
|
|
|
|
**See [TED_AUTOMATED_PIPELINE.md](TED_AUTOMATED_PIPELINE.md) for complete documentation on the automated download, processing, and vectorization pipeline.**
|
|
|
|
The application automatically:
|
|
1. Downloads TED Daily Packages every hour from ted.europa.eu
|
|
2. Extracts and processes XML files
|
|
3. Stores in PostgreSQL with native XML support
|
|
4. Generates 1024-dimensional embeddings for semantic search
|
|
5. Enables REST API queries with natural language
|
|
|
|
## Quick Start
|
|
|
|
### 1. Start PostgreSQL with pgvector
|
|
|
|
Using Docker:
|
|
```bash
|
|
docker-compose up -d postgres
|
|
```
|
|
|
|
Or manually install PostgreSQL with pgvector extension.
|
|
|
|
### 2. Configure Application
|
|
|
|
Edit `src/main/resources/application.yml`:
|
|
|
|
```yaml
|
|
ted:
|
|
input:
|
|
directory: D:/ted.europe/2025-11.tar/2025-11/11 # Your TED XML directory
|
|
pattern: "**/*.xml"
|
|
```
|
|
|
|
### 3. Build and Run
|
|
|
|
```bash
|
|
# Build
|
|
mvn clean package -DskipTests
|
|
|
|
# Run
|
|
java -jar target/ted-procurement-processor-1.0.0-SNAPSHOT.jar
|
|
```
|
|
|
|
### 4. Start Embedding Service (Optional)
|
|
|
|
For semantic search capabilities:
|
|
|
|
```bash
|
|
# Using Docker
|
|
docker-compose --profile with-embedding up -d embedding-service
|
|
|
|
# Or manually
|
|
pip install -r requirements-embedding.txt
|
|
python embedding_service.py
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### Main Tables
|
|
|
|
| Table | Description |
|
|
|-------|-------------|
|
|
| `procurement_document` | Main table with extracted metadata and original XML |
|
|
| `procurement_lot` | Individual lots within procurement notices |
|
|
| `organization` | Organizations mentioned in notices (buyers, review bodies) |
|
|
| `processing_log` | Audit trail for document processing events |
|
|
|
|
### Key Columns in `procurement_document`
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | UUID | Primary key |
|
|
| `document_hash` | VARCHAR(64) | SHA-256 hash for idempotency |
|
|
| `publication_id` | VARCHAR(50) | TED publication ID (e.g., "00786665-2025") |
|
|
| `notice_url` | VARCHAR(255) | TED website URL (e.g., "https://ted.europa.eu/en/notice/-/detail/786665-2025") |
|
|
| `xml_document` | XML | Original document |
|
|
| `text_content` | TEXT | Extracted text for vectorization |
|
|
| `content_vector` | vector(1024) | Embedding for semantic search |
|
|
| `buyer_country_code` | VARCHAR(10) | ISO 3166-1 alpha-3 country code |
|
|
| `cpv_codes` | VARCHAR(100)[] | CPV classification codes |
|
|
| `nuts_codes` | VARCHAR(20)[] | NUTS region codes |
|
|
|
|
## REST API
|
|
|
|
### Search Endpoints
|
|
|
|
#### GET /api/v1/documents/search
|
|
|
|
Search with structured filters:
|
|
|
|
```bash
|
|
# Search by country
|
|
curl "http://localhost:8080/api/v1/documents/search?countryCode=POL"
|
|
|
|
# Search by CPV code prefix (medical supplies)
|
|
curl "http://localhost:8080/api/v1/documents/search?cpvPrefix=33"
|
|
|
|
# Search by date range
|
|
curl "http://localhost:8080/api/v1/documents/search?publicationDateFrom=2025-01-01&publicationDateTo=2025-12-31"
|
|
|
|
# Combined filters
|
|
curl "http://localhost:8080/api/v1/documents/search?countryCode=DEU&contractNature=SERVICES¬iceType=CONTRACT_NOTICE"
|
|
```
|
|
|
|
#### GET /api/v1/documents/semantic-search
|
|
|
|
Natural language semantic search:
|
|
|
|
```bash
|
|
# Search for medical equipment tenders
|
|
curl "http://localhost:8080/api/v1/documents/semantic-search?query=medical+equipment+hospital+supplies"
|
|
|
|
# Search with similarity threshold
|
|
curl "http://localhost:8080/api/v1/documents/semantic-search?query=construction+works+road+infrastructure&threshold=0.75"
|
|
```
|
|
|
|
#### POST /api/v1/documents/search
|
|
|
|
Complex search with JSON body:
|
|
|
|
```bash
|
|
curl -X POST "http://localhost:8080/api/v1/documents/search" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"countryCodes": ["DEU", "AUT", "CHE"],
|
|
"contractNature": "SERVICES",
|
|
"cpvPrefix": "72",
|
|
"semanticQuery": "software development IT services",
|
|
"similarityThreshold": 0.7,
|
|
"page": 0,
|
|
"size": 20
|
|
}'
|
|
```
|
|
|
|
### Document Retrieval
|
|
|
|
```bash
|
|
# Get by UUID
|
|
curl "http://localhost:8080/api/v1/documents/{uuid}"
|
|
|
|
# Get by publication ID
|
|
curl "http://localhost:8080/api/v1/documents/publication/00786665-2025"
|
|
```
|
|
|
|
### Metadata Endpoints
|
|
|
|
```bash
|
|
# List countries
|
|
curl "http://localhost:8080/api/v1/documents/metadata/countries"
|
|
|
|
# Get statistics
|
|
curl "http://localhost:8080/api/v1/documents/statistics"
|
|
|
|
# Upcoming deadlines
|
|
curl "http://localhost:8080/api/v1/documents/upcoming-deadlines?limit=50"
|
|
```
|
|
|
|
### Admin Endpoints
|
|
|
|
```bash
|
|
# Health check
|
|
curl "http://localhost:8080/api/v1/admin/health"
|
|
|
|
# Vectorization status
|
|
curl "http://localhost:8080/api/v1/admin/vectorization/status"
|
|
|
|
# Trigger vectorization for pending documents
|
|
curl -X POST "http://localhost:8080/api/v1/admin/vectorization/process-pending?batchSize=100"
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Application Properties
|
|
|
|
| Property | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `ted.input.directory` | - | Input directory for XML files |
|
|
| `ted.input.pattern` | `**/*.xml` | File pattern (Ant-style) |
|
|
| `ted.input.poll-interval` | 5000 | Polling interval in ms |
|
|
| `ted.schema.enabled` | true | Enable XSD validation |
|
|
| `ted.vectorization.enabled` | true | Enable async vectorization |
|
|
| `ted.vectorization.model-name` | `intfloat/multilingual-e5-large` | Embedding model |
|
|
| `ted.vectorization.dimensions` | 1024 | Vector dimensions |
|
|
| `ted.search.default-page-size` | 20 | Default results per page |
|
|
| `ted.search.similarity-threshold` | 0.7 | Default similarity threshold |
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Description |
|
|
|----------|-------------|
|
|
| `DB_USERNAME` | PostgreSQL username |
|
|
| `DB_PASSWORD` | PostgreSQL password |
|
|
| `TED_INPUT_DIR` | Override input directory |
|
|
|
|
## Data Model
|
|
|
|
### Notice Types
|
|
|
|
- `CONTRACT_NOTICE` - Standard contract notices
|
|
- `PRIOR_INFORMATION_NOTICE` - Prior information notices
|
|
- `CONTRACT_AWARD_NOTICE` - Contract award notices
|
|
- `MODIFICATION_NOTICE` - Contract modifications
|
|
- `OTHER` - Other notice types
|
|
|
|
### Contract Nature
|
|
|
|
- `SUPPLIES` - Goods procurement
|
|
- `SERVICES` - Service procurement
|
|
- `WORKS` - Construction works
|
|
- `MIXED` - Mixed contracts
|
|
- `UNKNOWN` - Not specified
|
|
|
|
### Procedure Types
|
|
|
|
- `OPEN` - Open procedure
|
|
- `RESTRICTED` - Restricted procedure
|
|
- `COMPETITIVE_DIALOGUE` - Competitive dialogue
|
|
- `INNOVATION_PARTNERSHIP` - Innovation partnership
|
|
- `NEGOTIATED_WITHOUT_PUBLICATION` - Negotiated without prior publication
|
|
- `NEGOTIATED_WITH_PUBLICATION` - Negotiated with prior publication
|
|
- `OTHER` - Other procedures
|
|
|
|
## Semantic Search
|
|
|
|
**See [VECTORIZATION.md](VECTORIZATION.md) for detailed documentation on the vectorization pipeline.**
|
|
|
|
The application uses the `intfloat/multilingual-e5-large` model for generating document embeddings:
|
|
|
|
- **Dimensions**: 1024
|
|
- **Languages**: Supports 100+ languages
|
|
- **Normalization**: Embeddings are L2 normalized for cosine similarity
|
|
|
|
### Query Prefixes
|
|
|
|
For optimal results with e5 models:
|
|
- Documents use `passage: ` prefix
|
|
- Queries use `query: ` prefix
|
|
|
|
This is handled automatically by the vectorization service.
|
|
|
|
## Development
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
mvn test
|
|
```
|
|
|
|
### Building Docker Image
|
|
|
|
```bash
|
|
docker build -t ted-procurement-processor .
|
|
```
|
|
|
|
### OpenAPI Documentation
|
|
|
|
Access Swagger UI at: `http://localhost:8080/api/swagger-ui.html`
|
|
|
|
## Performance Considerations
|
|
|
|
### Indexes
|
|
|
|
The schema includes optimized indexes for:
|
|
- Hash lookup (idempotent processing)
|
|
- Publication/notice ID lookups
|
|
- Date range queries
|
|
- Geographic searches (country, NUTS codes)
|
|
- CPV code classification
|
|
- Vector similarity search (IVFFlat)
|
|
- Full-text trigram search
|
|
|
|
### Batch Processing
|
|
|
|
- Configure `ted.input.max-messages-per-poll` for batch sizes
|
|
- Vectorization processes documents in batches of 16 by default
|
|
- Use the admin API to trigger bulk vectorization
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Files not being processed:**
|
|
- Check directory path in configuration
|
|
- Verify file permissions
|
|
- Check Camel route status in logs
|
|
|
|
**Duplicate detection not working:**
|
|
- Ensure `document_hash` column has unique constraint
|
|
- Check if XML content is exactly the same
|
|
|
|
**Vectorization failing:**
|
|
- Verify embedding service is running
|
|
- Check Python dependencies
|
|
- Ensure sufficient memory for model
|
|
|
|
**Slow searches:**
|
|
- Ensure pgvector IVFFlat index is created
|
|
- Check if `content_vector` column is populated
|
|
- Consider adjusting `lists` parameter in index
|
|
|
|
## License
|
|
|
|
Licensed under the European Union Public Licence (EUPL) v1.2
|
|
|
|
Copyright (c) 2025 PROCON DATA Gesellschaft m.b.H.
|
|
|
|
You may use, copy, modify and distribute this work under the terms of the EUPL.
|
|
See the [LICENSE](LICENSE) file for details or visit: https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12
|
|
|
|
## Acknowledgments
|
|
|
|
- [eForms SDK](https://github.com/OP-TED/eForms-SDK) - EU Publications Office
|
|
- [pgvector](https://github.com/pgvector/pgvector) - Vector similarity search for PostgreSQL
|
|
- [sentence-transformers](https://www.sbert.net/) - Text embeddings
|
|
- [Apache Camel](https://camel.apache.org/) - Integration framework
|