You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
468 lines
18 KiB
Markdown
468 lines
18 KiB
Markdown
# TED Automatisierte Download & Verarbeitungs-Pipeline
|
|
|
|
## Übersicht
|
|
|
|
Die komplette automatisierte Pipeline für TED (Tenders Electronic Daily) Ausschreibungen:
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────────────────┐
|
|
│ TED Automatisierte Pipeline │
|
|
├────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Timer (1h) │ Alle 1 Stunde neue Packages prüfen │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ HTTP Download │ https://ted.europa.eu/packages/daily/ │
|
|
│ │ Package │ Format: YYYY-MM-DD_XXXX.tar.gz │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Extract │ tar.gz → Tausende von XML Files │
|
|
│ │ tar.gz │ Extract to: D:/ted.europe/extracted │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ XML Splitter │ Parallel Processing (Streaming) │
|
|
│ │ (Parallel) │ Each XML → direct:process-document │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ XML Parser │ XPath Parsing + Metadata Extraction │
|
|
│ │ & Validator │ Schema Validation (eForms SDK 1.13) │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ SHA-256 Hash │ Idempotent Processing │
|
|
│ │ Check │ Skip if already imported │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Save to DB │ PostgreSQL (ted.procurement_document) │
|
|
│ │ (PostgreSQL) │ + Native XML + Metadata │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ wireTap │ Non-blocking Trigger │
|
|
│ │ Vectorization │ direct:vectorize (async) │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ SEDA Queue │ 4 Concurrent Workers │
|
|
│ │ (Async) │ vectorize-async queue │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Extract Text │ Title + Description + Lots │
|
|
│ │ Content │ Buyer Info + CPV Codes │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ POST to │ http://localhost:8001/embed │
|
|
│ │ Embedding API │ {"text": "...", "is_query": false} │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Python Service │ intfloat/multilingual-e5-large │
|
|
│ │ (FastAPI) │ Returns: 1024-dimensional vector │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Save Vector │ content_vector column (pgvector) │
|
|
│ │ to Database │ Status: COMPLETED │
|
|
│ └─────────────────┘ │
|
|
│ │
|
|
└────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Konfiguration
|
|
|
|
**application.yml:**
|
|
|
|
```yaml
|
|
ted:
|
|
# Input directory (points to extract directory)
|
|
input:
|
|
directory: D:/ted.europe/extracted
|
|
pattern: "**/*.xml" # Recursive scanning
|
|
poll-interval: 5000 # Check every 5 seconds
|
|
max-messages-per-poll: 100 # Process up to 100 XMLs per poll
|
|
|
|
# Automatic download from ted.europa.eu
|
|
download:
|
|
enabled: true # ✅ ENABLED
|
|
base-url: https://ted.europa.eu/packages/daily/
|
|
download-directory: D:/ted.europe/downloads
|
|
extract-directory: D:/ted.europe/extracted
|
|
start-year: 2024 # Start downloading from 2024
|
|
poll-interval: 3600000 # Check every 1 hour
|
|
max-consecutive-404: 4 # Stop after 4 consecutive 404s
|
|
delete-after-extraction: true # Clean up tar.gz files
|
|
|
|
# Vectorization (automatic after save)
|
|
vectorization:
|
|
enabled: true # ✅ ENABLED
|
|
api-url: http://localhost:8001
|
|
model-name: intfloat/multilingual-e5-large
|
|
dimensions: 1024
|
|
batch-size: 16
|
|
max-text-length: 8192
|
|
```
|
|
|
|
## Camel Routes
|
|
|
|
### 1. **TedPackageDownloadCamelRoute** (Download & Extract)
|
|
|
|
**Route ID:** `ted-package-scheduler`
|
|
|
|
**Trigger:** Timer alle 1 Stunde
|
|
|
|
**Ablauf:**
|
|
1. Bestimmt nächstes Package (Jahr + Serial Number)
|
|
2. Prüft ob bereits vorhanden (Idempotent Consumer)
|
|
3. HTTP GET von `https://ted.europa.eu/packages/daily/YYYY-MM-DD_XXXX.tar.gz`
|
|
4. Speichert in `download-directory`
|
|
5. Extrahiert nach `extract-directory`
|
|
6. Löscht tar.gz (optional)
|
|
7. Splittiert XML Files → `direct:process-document`
|
|
|
|
**Enterprise Integration Patterns:**
|
|
- ✅ Timer Pattern
|
|
- ✅ Idempotent Consumer
|
|
- ✅ Content-Based Router
|
|
- ✅ Splitter Pattern (Parallel + Streaming)
|
|
- ✅ Dead Letter Channel
|
|
|
|
### 2. **TedDocumentRoute** (XML Processing)
|
|
|
|
**Route ID:** `ted-document-processor`
|
|
|
|
**Trigger:**
|
|
- File Watcher auf `D:/ted.europe/extracted`
|
|
- Direct Call von Download Route
|
|
|
|
**Ablauf:**
|
|
1. Liest XML File
|
|
2. Parst mit XPath (eForms UBL Schema)
|
|
3. Extrahiert Metadata
|
|
4. Berechnet SHA-256 Hash
|
|
5. Prüft Duplikat in DB
|
|
6. Speichert in `ted.procurement_document`
|
|
7. **wireTap** → `direct:vectorize` (non-blocking!)
|
|
|
|
### 3. **VectorizationRoute** (Async Embedding)
|
|
|
|
**Route ID:** `vectorization-processor`
|
|
|
|
**Trigger:**
|
|
- wireTap von TedDocumentRoute
|
|
- Timer Scheduler (alle 60s für PENDING)
|
|
|
|
**Ablauf:**
|
|
1. Load document from DB
|
|
2. Extract text_content (Document + Lots)
|
|
3. POST to Python Embedding Service
|
|
4. Parse 1024-dimensional vector
|
|
5. Save to `content_vector` column
|
|
6. Update status → `COMPLETED`
|
|
|
|
**Queue:** SEDA with 4 concurrent workers
|
|
|
|
## Verzeichnisstruktur
|
|
|
|
```
|
|
D:/ted.europe/
|
|
├── downloads/ # Temporäre tar.gz Downloads
|
|
│ └── 2025-11-30_0001.tar.gz
|
|
│ └── 2025-11-30_0002.tar.gz
|
|
│
|
|
├── extracted/ # Extrahierte XML Files
|
|
│ ├── 2025-11-30/
|
|
│ │ ├── 001/
|
|
│ │ │ ├── 00123456_2025.xml
|
|
│ │ │ └── 00123457_2025.xml
|
|
│ │ └── 002/
|
|
│ │ └── ...
|
|
│ └── .processed/ # Erfolgreich verarbeitete XMLs
|
|
│ └── .error/ # Fehlgeschlagene XMLs
|
|
```
|
|
|
|
## Datenbank-Tracking
|
|
|
|
### ted_daily_package (Download-Tracking)
|
|
|
|
| Spalte | Typ | Beschreibung |
|
|
|--------|-----|--------------|
|
|
| `id` | UUID | Primary Key |
|
|
| `year` | INT | Package Jahr (2024, 2025) |
|
|
| `serial_number` | INT | Package Nummer (1, 2, 3...) |
|
|
| `package_id` | VARCHAR | Format: `2025-11-30_0001` |
|
|
| `download_url` | VARCHAR | Full URL |
|
|
| `download_status` | VARCHAR | PENDING, DOWNLOADING, COMPLETED, NOT_FOUND, FAILED |
|
|
| `downloaded_at` | TIMESTAMP | Download-Zeitpunkt |
|
|
| `file_size_bytes` | BIGINT | Größe der tar.gz |
|
|
| `xml_file_count` | INT | Anzahl extrahierter XMLs |
|
|
| `processed_count` | INT | Anzahl verarbeiteter XMLs |
|
|
|
|
### procurement_document (XML-Daten)
|
|
|
|
| Spalte | Typ | Beschreibung |
|
|
|--------|-----|--------------|
|
|
| `id` | UUID | Primary Key |
|
|
| `document_hash` | VARCHAR(64) | SHA-256 für Idempotenz |
|
|
| `publication_id` | VARCHAR(50) | TED ID (00123456-2025) |
|
|
| `notice_url` | VARCHAR(255) | Auto-generated TED URL |
|
|
| `xml_document` | XML | Native PostgreSQL XML |
|
|
| `text_content` | TEXT | Für Vektorisierung |
|
|
| `content_vector` | vector(1024) | pgvector Embedding |
|
|
| `vectorization_status` | VARCHAR | PENDING, PROCESSING, COMPLETED, FAILED |
|
|
|
|
## Monitoring
|
|
|
|
### Camel Routes Status
|
|
|
|
```bash
|
|
curl http://localhost:8888/api/actuator/camel/routes
|
|
```
|
|
|
|
**Wichtige Routes:**
|
|
- `ted-package-scheduler` - Download Timer
|
|
- `ted-document-processor` - XML Processing
|
|
- `vectorization-processor` - Embedding Generation
|
|
- `vectorization-scheduler` - PENDING Documents
|
|
|
|
### Download Status
|
|
|
|
```sql
|
|
SELECT
|
|
year,
|
|
COUNT(*) FILTER (WHERE download_status = 'COMPLETED') as completed,
|
|
COUNT(*) FILTER (WHERE download_status = 'NOT_FOUND') as not_found,
|
|
COUNT(*) FILTER (WHERE download_status = 'FAILED') as failed,
|
|
SUM(xml_file_count) as total_xmls,
|
|
SUM(processed_count) as processed_xmls
|
|
FROM ted.ted_daily_package
|
|
GROUP BY year
|
|
ORDER BY year DESC;
|
|
```
|
|
|
|
### Vectorization Status
|
|
|
|
```sql
|
|
SELECT
|
|
COUNT(*) FILTER (WHERE vectorization_status = 'COMPLETED') as completed,
|
|
COUNT(*) FILTER (WHERE vectorization_status = 'PENDING') as pending,
|
|
COUNT(*) FILTER (WHERE vectorization_status = 'FAILED') as failed,
|
|
COUNT(*) FILTER (WHERE content_vector IS NOT NULL) as has_vector
|
|
FROM ted.procurement_document;
|
|
```
|
|
|
|
### Heute verarbeitete Dokumente
|
|
|
|
```sql
|
|
SELECT
|
|
COUNT(*) as today_count,
|
|
MIN(created_at) as first,
|
|
MAX(created_at) as last
|
|
FROM ted.procurement_document
|
|
WHERE created_at::date = CURRENT_DATE;
|
|
```
|
|
|
|
## Python Embedding Service
|
|
|
|
**Start:**
|
|
```bash
|
|
python embedding_service.py
|
|
```
|
|
|
|
**Health Check:**
|
|
```bash
|
|
curl http://localhost:8001/health
|
|
```
|
|
|
|
**Expected Response:**
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"model_name": "intfloat/multilingual-e5-large",
|
|
"dimensions": 1024,
|
|
"max_length": 512
|
|
}
|
|
```
|
|
|
|
## Start der Pipeline
|
|
|
|
1. **Python Embedding Service starten:**
|
|
```bash
|
|
python embedding_service.py
|
|
```
|
|
|
|
2. **Spring Boot Anwendung starten:**
|
|
```bash
|
|
mvn spring-boot:run
|
|
```
|
|
|
|
3. **Logs beobachten:**
|
|
```
|
|
INFO: Checking for new TED packages...
|
|
INFO: Next package to download: 2025-11-30_0001
|
|
INFO: Downloading from https://ted.europa.eu/packages/daily/...
|
|
INFO: Extracting package 2025-11-30_0001...
|
|
INFO: Processing 1247 XML files from package 2025-11-30_0001
|
|
INFO: Document processed successfully: 00123456_2025.xml
|
|
DEBUG: Queueing document for vectorization: xxx
|
|
INFO: Successfully vectorized document: xxx
|
|
```
|
|
|
|
## Durchsatz
|
|
|
|
**Geschätzte Performance:**
|
|
|
|
| Phase | Geschwindigkeit | Bemerkung |
|
|
|-------|----------------|-----------|
|
|
| **Download** | 1 Package/Stunde | Timer-basiert |
|
|
| **Extract** | ~10 Sekunden | tar.gz → XMLs |
|
|
| **XML Processing** | ~100-200 XMLs/min | Abhängig von CPU |
|
|
| **Vectorization** | ~60-90 Docs/min | 4 Workers, Python Service |
|
|
|
|
**Täglich:**
|
|
- ~24 Packages heruntergeladen
|
|
- ~30.000-50.000 Dokumente verarbeitet (je nach Package-Größe)
|
|
- ~30.000-50.000 Vektoren generiert
|
|
|
|
## Fehlerbehandlung
|
|
|
|
### Download Fehler
|
|
|
|
**404 Not Found:** Package existiert (noch) nicht
|
|
- Max 4 consecutive 404s → Switch zu Vorjahr
|
|
- Automatische Wiederholung nach 1 Stunde
|
|
|
|
**Network Error:** Temporäre Verbindungsprobleme
|
|
- 3 Retries mit 10s Delay
|
|
- Dead Letter Channel
|
|
|
|
### Processing Fehler
|
|
|
|
**Duplikate:** SHA-256 Hash bereits vorhanden
|
|
- Wird übersprungen (Idempotent Processing)
|
|
- Log: "Duplicate document skipped"
|
|
|
|
**XML Parsing Error:** Ungültiges XML
|
|
- 3 Retries
|
|
- Move to `.error` directory
|
|
- Status: FAILED in DB
|
|
|
|
### Vectorization Fehler
|
|
|
|
**Embedding Service nicht erreichbar:**
|
|
- 2 Retries mit 2s Delay
|
|
- Status: FAILED
|
|
- Scheduler versucht erneut nach 60s
|
|
|
|
**Invalid Embedding Dimension:**
|
|
- Status: FAILED mit Error-Message
|
|
- Manuelles Eingreifen erforderlich
|
|
|
|
## Troubleshooting
|
|
|
|
### Pipeline läuft nicht
|
|
|
|
```bash
|
|
# Prüfe Camel Routes
|
|
curl http://localhost:8888/api/actuator/camel/routes | jq '.routes[] | {id: .id, status: .status}'
|
|
|
|
# Prüfe Download Route
|
|
tail -f logs/ted-procurement-processor.log | grep "ted-package"
|
|
|
|
# Prüfe Vectorization Route
|
|
tail -f logs/ted-procurement-processor.log | grep "vectoriz"
|
|
```
|
|
|
|
### Keine Downloads
|
|
|
|
1. Prüfe `ted.download.enabled = true`
|
|
2. Prüfe Internet-Verbindung
|
|
3. Prüfe ted.europa.eu erreichbar
|
|
4. Prüfe Logs für 404/403 Errors
|
|
|
|
### Keine Vektorisierung
|
|
|
|
1. Prüfe Embedding Service: `curl http://localhost:8001/health`
|
|
2. Prüfe `ted.vectorization.enabled = true`
|
|
3. Prüfe PENDING Dokumente in DB
|
|
4. Prüfe Logs für HTTP 400/500 Errors
|
|
|
|
## Semantic Search
|
|
|
|
Nach erfolgreicher Vektorisierung sind Dokumente durchsuchbar:
|
|
|
|
```bash
|
|
# Semantic Search
|
|
curl "http://localhost:8888/api/v1/documents/semantic-search?query=medical+equipment"
|
|
|
|
# Combined Search (Semantic + Filters)
|
|
curl -X POST "http://localhost:8888/api/v1/documents/search" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"countryCodes": ["DEU", "AUT"],
|
|
"semanticQuery": "software development",
|
|
"similarityThreshold": 0.7
|
|
}'
|
|
```
|
|
|
|
## Performance-Optimierung
|
|
|
|
### Vectorization beschleunigen
|
|
|
|
```yaml
|
|
ted:
|
|
vectorization:
|
|
thread-pool-size: 8 # Mehr Workers (Standard: 4)
|
|
```
|
|
|
|
**Achtung:** Mehr Workers = mehr Last auf Python Service!
|
|
|
|
### XML Processing beschleunigen
|
|
|
|
```yaml
|
|
ted:
|
|
input:
|
|
max-messages-per-poll: 200 # Mehr Files pro Poll (Standard: 100)
|
|
```
|
|
|
|
### Download parallelisieren
|
|
|
|
```yaml
|
|
ted:
|
|
download:
|
|
max-concurrent-downloads: 4 # Mehr parallele Downloads (Standard: 2)
|
|
```
|
|
|
|
**Achtung:** ted.europa.eu Rate Limiting beachten!
|
|
|
|
## Zusammenfassung
|
|
|
|
✅ **Komplett automatisierte Pipeline** von Download bis Semantic Search
|
|
✅ **Idempotent Processing** - Keine Duplikate
|
|
✅ **Asynchrone Vektorisierung** - Non-blocking
|
|
✅ **Enterprise Integration Patterns** - Production-ready
|
|
✅ **Fehlerbehandlung** - Retries & Dead Letter Channel
|
|
✅ **Monitoring** - Actuator + SQL Queries
|
|
✅ **Skalierbar** - Concurrent Workers & Parallel Processing
|
|
|
|
Die Pipeline läuft vollautomatisch 24/7 und verarbeitet alle neuen TED-Ausschreibungen! 🚀
|