DIP/README.md

# Document Intelligence Platform

**Generic document ingestion, normalization and semantic search platform with TED support**

A production-ready Spring Boot application showcasing advanced AI semantic search capabilities for processing and searching EU eForms public procurement notices from TED (Tenders Electronic Daily).

**Author:** Martin.Schweitzer@procon.co.at and claude.ai

> Phase 0 foundation is in place: the codebase now exposes the broader platform namespace `at.procon.dip` while the existing TED runtime under `at.procon.ted` remains operational during migration.

---

## 🎯 Demonstrator Highlights

This application demonstrates the integration of cutting-edge technologies for intelligent document processing:

### 🧠 **AI Semantic Search**
- **Natural Language Queries**: Search 100,000+ procurement documents using plain language
  - Example: *"medical equipment for hospitals in Germany"*
  - Example: *"IT infrastructure projects in Austria"*
- **Multilingual Support**: 100+ languages supported via `intfloat/multilingual-e5-large` model
- **1024-Dimensional Embeddings**: High-precision vector representations for accurate similarity matching
- **Hybrid Search**: Combine semantic search with traditional filters (country, CPV codes, dates)

### 🗄️ **PostgreSQL Native XML**
- **Native XML Data Type**: Store complete eForms XML documents without serialization overhead
- **XPath Queries**: Direct XML querying within PostgreSQL for complex data extraction
- **Dual Storage Strategy**:
  - Original XML preserved for audit trail and reprocessing
  - Extracted metadata in structured columns for fast filtering
  - Best of both worlds: flexibility + performance

### 🚀 **Production-Grade Features**
- **Fully Automated Pipeline**: Downloads and processes 30,000+ documents daily from ted.europa.eu
- **Apache Camel Integration**: Enterprise Integration Patterns (Timer, Splitter, SEDA, Dead Letter Channel)
- **Idempotent Processing**: SHA-256 hashing prevents duplicate imports
- **Async Vectorization**: Non-blocking background processing with 4 concurrent workers
- **pgvector Extension**: IVFFlat indexing for fast cosine similarity search at scale
- **eForms SDK 1.13**: Full schema validation for EU standard compliance

---

## Key Technologies

| Technology | Purpose | Benefit |
|------------|---------|---------|
| **PostgreSQL 16+** | Database with native XML | Query XML with XPath while maintaining structure |
| **pgvector** | Vector similarity search | Million-scale semantic search with cosine similarity |
| **Apache Camel** | Integration framework | Enterprise patterns for robust data pipelines |
| **Spring Boot 3.x** | Application framework | Modern Java with dependency injection |
| **intfloat/e5-large** | Embedding model | State-of-the-art multilingual semantic understanding |
| **eForms SDK** | EU standard | Compliance with official procurement schemas |

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                     TED Procurement Processor                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌─────────────────┐    ┌───────────────────┐  │
│  │  File System │───▶│  Apache Camel   │───▶│  Document         │  │
│  │  (*.xml)     │    │  Route          │    │  Processing       │  │
│  └──────────────┘    └─────────────────┘    │  Service          │  │
│                                              └─────────┬─────────┘  │
│                                                        │            │
│                                                        ▼            │
│  ┌──────────────┐    ┌─────────────────┐    ┌───────────────────┐  │
│  │  REST API    │◀───│  Search         │◀───│  PostgreSQL       │  │
│  │  Controller  │    │  Service        │    │  + pgvector       │  │
│  └──────────────┘    └─────────────────┘    └───────────────────┘  │
│                                                        ▲            │
│                                                        │            │
│  ┌──────────────────────────────────────────────────────┐          │
│  │              Vectorization Service (Async)            │          │
│  │        intfloat/multilingual-e5-large (1024d)         │          │
│  └──────────────────────────────────────────────────────┘          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

## Prerequisites

- Java 21+
- Maven 3.9+
- PostgreSQL 16+ with pgvector extension
- Python 3.11+ (for embedding service)
- Docker & Docker Compose (optional, for easy setup)

## 🚀 Automated Pipeline

**See [TED_AUTOMATED_PIPELINE.md](TED_AUTOMATED_PIPELINE.md) for complete documentation on the automated download, processing, and vectorization pipeline.**

The application automatically:
1. Downloads TED Daily Packages every hour from ted.europa.eu
2. Extracts and processes XML files
3. Stores in PostgreSQL with native XML support
4. Generates 1024-dimensional embeddings for semantic search
5. Enables REST API queries with natural language

## Quick Start

### 1. Start PostgreSQL with pgvector

Using Docker:
```bash
docker-compose up -d postgres
```

Or manually install PostgreSQL with pgvector extension.

### 2. Configure Application

Edit `src/main/resources/application.yml`:

```yaml
ted:
  input:
    directory: D:/ted.europe/2025-11.tar/2025-11/11  # Your TED XML directory
    pattern: "**/*.xml"
```

### 3. Build and Run

```bash
# Build
mvn clean package -DskipTests

# Run
java -jar target/ted-procurement-processor-1.0.0-SNAPSHOT.jar
```

### 4. Start Embedding Service (Optional)

For semantic search capabilities:

```bash
# Using Docker
docker-compose --profile with-embedding up -d embedding-service

# Or manually
pip install -r requirements-embedding.txt
python embedding_service.py
```

## Database Schema

### Main Tables

| Table | Description |
|-------|-------------|
| `procurement_document` | Main table with extracted metadata and original XML |
| `procurement_lot` | Individual lots within procurement notices |
| `organization` | Organizations mentioned in notices (buyers, review bodies) |
| `processing_log` | Audit trail for document processing events |

### Key Columns in `procurement_document`

| Column | Type | Description |
|--------|------|-------------|
| `id` | UUID | Primary key |
| `document_hash` | VARCHAR(64) | SHA-256 hash for idempotency |
| `publication_id` | VARCHAR(50) | TED publication ID (e.g., "00786665-2025") |
| `notice_url` | VARCHAR(255) | TED website URL (e.g., "https://ted.europa.eu/en/notice/-/detail/786665-2025") |
| `xml_document` | XML | Original document |
| `text_content` | TEXT | Extracted text for vectorization |
| `content_vector` | vector(1024) | Embedding for semantic search |
| `buyer_country_code` | VARCHAR(10) | ISO 3166-1 alpha-3 country code |
| `cpv_codes` | VARCHAR(100)[] | CPV classification codes |
| `nuts_codes` | VARCHAR(20)[] | NUTS region codes |

## REST API

### Search Endpoints

#### GET /api/v1/documents/search

Search with structured filters:

```bash
# Search by country
curl "http://localhost:8080/api/v1/documents/search?countryCode=POL"

# Search by CPV code prefix (medical supplies)
curl "http://localhost:8080/api/v1/documents/search?cpvPrefix=33"

# Search by date range
curl "http://localhost:8080/api/v1/documents/search?publicationDateFrom=2025-01-01&publicationDateTo=2025-12-31"

# Combined filters
curl "http://localhost:8080/api/v1/documents/search?countryCode=DEU&contractNature=SERVICES&noticeType=CONTRACT_NOTICE"
```

#### GET /api/v1/documents/semantic-search

Natural language semantic search:

```bash
# Search for medical equipment tenders
curl "http://localhost:8080/api/v1/documents/semantic-search?query=medical+equipment+hospital+supplies"

# Search with similarity threshold
curl "http://localhost:8080/api/v1/documents/semantic-search?query=construction+works+road+infrastructure&threshold=0.75"
```

#### POST /api/v1/documents/search

Complex search with JSON body:

```bash
curl -X POST "http://localhost:8080/api/v1/documents/search" \
  -H "Content-Type: application/json" \
  -d '{
    "countryCodes": ["DEU", "AUT", "CHE"],
    "contractNature": "SERVICES",
    "cpvPrefix": "72",
    "semanticQuery": "software development IT services",
    "similarityThreshold": 0.7,
    "page": 0,
    "size": 20
  }'
```

### Document Retrieval

```bash
# Get by UUID
curl "http://localhost:8080/api/v1/documents/{uuid}"

# Get by publication ID
curl "http://localhost:8080/api/v1/documents/publication/00786665-2025"
```

### Metadata Endpoints

```bash
# List countries
curl "http://localhost:8080/api/v1/documents/metadata/countries"

# Get statistics
curl "http://localhost:8080/api/v1/documents/statistics"

# Upcoming deadlines
curl "http://localhost:8080/api/v1/documents/upcoming-deadlines?limit=50"
```

### Admin Endpoints

```bash
# Health check
curl "http://localhost:8080/api/v1/admin/health"

# Vectorization status
curl "http://localhost:8080/api/v1/admin/vectorization/status"

# Trigger vectorization for pending documents
curl -X POST "http://localhost:8080/api/v1/admin/vectorization/process-pending?batchSize=100"
```

## Configuration

### Application Properties

| Property | Default | Description |
|----------|---------|-------------|
| `ted.input.directory` | - | Input directory for XML files |
| `ted.input.pattern` | `**/*.xml` | File pattern (Ant-style) |
| `ted.input.poll-interval` | 5000 | Polling interval in ms |
| `ted.schema.enabled` | true | Enable XSD validation |
| `ted.vectorization.enabled` | true | Enable async vectorization |
| `ted.vectorization.model-name` | `intfloat/multilingual-e5-large` | Embedding model |
| `ted.vectorization.dimensions` | 1024 | Vector dimensions |
| `ted.search.default-page-size` | 20 | Default results per page |
| `ted.search.similarity-threshold` | 0.7 | Default similarity threshold |

### Environment Variables

| Variable | Description |
|----------|-------------|
| `DB_USERNAME` | PostgreSQL username |
| `DB_PASSWORD` | PostgreSQL password |
| `TED_INPUT_DIR` | Override input directory |

## Data Model

### Notice Types

- `CONTRACT_NOTICE` - Standard contract notices
- `PRIOR_INFORMATION_NOTICE` - Prior information notices
- `CONTRACT_AWARD_NOTICE` - Contract award notices
- `MODIFICATION_NOTICE` - Contract modifications
- `OTHER` - Other notice types

### Contract Nature

- `SUPPLIES` - Goods procurement
- `SERVICES` - Service procurement
- `WORKS` - Construction works
- `MIXED` - Mixed contracts
- `UNKNOWN` - Not specified

### Procedure Types

- `OPEN` - Open procedure
- `RESTRICTED` - Restricted procedure
- `COMPETITIVE_DIALOGUE` - Competitive dialogue
- `INNOVATION_PARTNERSHIP` - Innovation partnership
- `NEGOTIATED_WITHOUT_PUBLICATION` - Negotiated without prior publication
- `NEGOTIATED_WITH_PUBLICATION` - Negotiated with prior publication
- `OTHER` - Other procedures

## Semantic Search

**See [VECTORIZATION.md](VECTORIZATION.md) for detailed documentation on the vectorization pipeline.**

The application uses the `intfloat/multilingual-e5-large` model for generating document embeddings:

- **Dimensions**: 1024
- **Languages**: Supports 100+ languages
- **Normalization**: Embeddings are L2 normalized for cosine similarity

### Query Prefixes

For optimal results with e5 models:
- Documents use `passage: ` prefix
- Queries use `query: ` prefix

This is handled automatically by the vectorization service.

## Development

### Running Tests

```bash
mvn test
```

### Building Docker Image

```bash
docker build -t ted-procurement-processor .
```

### OpenAPI Documentation

Access Swagger UI at: `http://localhost:8080/api/swagger-ui.html`

## Performance Considerations

### Indexes

The schema includes optimized indexes for:
- Hash lookup (idempotent processing)
- Publication/notice ID lookups
- Date range queries
- Geographic searches (country, NUTS codes)
- CPV code classification
- Vector similarity search (IVFFlat)
- Full-text trigram search

### Batch Processing

- Configure `ted.input.max-messages-per-poll` for batch sizes
- Vectorization processes documents in batches of 16 by default
- Use the admin API to trigger bulk vectorization

## Troubleshooting

### Common Issues

**Files not being processed:**
- Check directory path in configuration
- Verify file permissions
- Check Camel route status in logs

**Duplicate detection not working:**
- Ensure `document_hash` column has unique constraint
- Check if XML content is exactly the same

**Vectorization failing:**
- Verify embedding service is running
- Check Python dependencies
- Ensure sufficient memory for model

**Slow searches:**
- Ensure pgvector IVFFlat index is created
- Check if `content_vector` column is populated
- Consider adjusting `lists` parameter in index

## License

Licensed under the European Union Public Licence (EUPL) v1.2

Copyright (c) 2025 PROCON DATA Gesellschaft m.b.H.

You may use, copy, modify and distribute this work under the terms of the EUPL.
See the [LICENSE](LICENSE) file for details or visit: https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12

## Acknowledgments

- [eForms SDK](https://github.com/OP-TED/eForms-SDK) - EU Publications Office
- [pgvector](https://github.com/pgvector/pgvector) - Vector similarity search for PostgreSQL
- [sentence-transformers](https://www.sbert.net/) - Text embeddings
- [Apache Camel](https://camel.apache.org/) - Integration framework