# TED Automatisierte Download & Verarbeitungs-Pipeline ## Übersicht Die komplette automatisierte Pipeline für TED (Tenders Electronic Daily) Ausschreibungen: ``` ┌────────────────────────────────────────────────────────────────────────┐ │ TED Automatisierte Pipeline │ ├────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────┐ │ │ │ Timer (1h) │ Alle 1 Stunde neue Packages prüfen │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ HTTP Download │ https://ted.europa.eu/packages/daily/ │ │ │ Package │ Format: YYYY-MM-DD_XXXX.tar.gz │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Extract │ tar.gz → Tausende von XML Files │ │ │ tar.gz │ Extract to: D:/ted.europe/extracted │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ XML Splitter │ Parallel Processing (Streaming) │ │ │ (Parallel) │ Each XML → direct:process-document │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ XML Parser │ XPath Parsing + Metadata Extraction │ │ │ & Validator │ Schema Validation (eForms SDK 1.13) │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ SHA-256 Hash │ Idempotent Processing │ │ │ Check │ Skip if already imported │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Save to DB │ PostgreSQL (ted.procurement_document) │ │ │ (PostgreSQL) │ + Native XML + Metadata │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ wireTap │ Non-blocking Trigger │ │ │ Vectorization │ direct:vectorize (async) │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ SEDA Queue │ 4 Concurrent Workers │ │ │ (Async) │ vectorize-async queue │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Extract Text │ Title + Description + Lots │ │ │ Content │ Buyer Info + CPV Codes │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ POST to │ http://localhost:8001/embed │ │ │ Embedding API │ {"text": "...", "is_query": false} │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Python Service │ intfloat/multilingual-e5-large │ │ │ (FastAPI) │ Returns: 1024-dimensional vector │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Save Vector │ content_vector column (pgvector) │ │ │ to Database │ Status: COMPLETED │ │ └─────────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────────┘ ``` ## Konfiguration **application.yml:** ```yaml ted: # Input directory (points to extract directory) input: directory: D:/ted.europe/extracted pattern: "**/*.xml" # Recursive scanning poll-interval: 5000 # Check every 5 seconds max-messages-per-poll: 100 # Process up to 100 XMLs per poll # Automatic download from ted.europa.eu download: enabled: true # ✅ ENABLED base-url: https://ted.europa.eu/packages/daily/ download-directory: D:/ted.europe/downloads extract-directory: D:/ted.europe/extracted start-year: 2024 # Start downloading from 2024 poll-interval: 3600000 # Check every 1 hour max-consecutive-404: 4 # Stop after 4 consecutive 404s delete-after-extraction: true # Clean up tar.gz files # Vectorization (automatic after save) vectorization: enabled: true # ✅ ENABLED api-url: http://localhost:8001 model-name: intfloat/multilingual-e5-large dimensions: 1024 batch-size: 16 max-text-length: 8192 ``` ## Camel Routes ### 1. **TedPackageDownloadCamelRoute** (Download & Extract) **Route ID:** `ted-package-scheduler` **Trigger:** Timer alle 1 Stunde **Ablauf:** 1. Bestimmt nächstes Package (Jahr + Serial Number) 2. Prüft ob bereits vorhanden (Idempotent Consumer) 3. HTTP GET von `https://ted.europa.eu/packages/daily/YYYY-MM-DD_XXXX.tar.gz` 4. Speichert in `download-directory` 5. Extrahiert nach `extract-directory` 6. Löscht tar.gz (optional) 7. Splittiert XML Files → `direct:process-document` **Enterprise Integration Patterns:** - ✅ Timer Pattern - ✅ Idempotent Consumer - ✅ Content-Based Router - ✅ Splitter Pattern (Parallel + Streaming) - ✅ Dead Letter Channel ### 2. **TedDocumentRoute** (XML Processing) **Route ID:** `ted-document-processor` **Trigger:** - File Watcher auf `D:/ted.europe/extracted` - Direct Call von Download Route **Ablauf:** 1. Liest XML File 2. Parst mit XPath (eForms UBL Schema) 3. Extrahiert Metadata 4. Berechnet SHA-256 Hash 5. Prüft Duplikat in DB 6. Speichert in `ted.procurement_document` 7. **wireTap** → `direct:vectorize` (non-blocking!) ### 3. **VectorizationRoute** (Async Embedding) **Route ID:** `vectorization-processor` **Trigger:** - wireTap von TedDocumentRoute - Timer Scheduler (alle 60s für PENDING) **Ablauf:** 1. Load document from DB 2. Extract text_content (Document + Lots) 3. POST to Python Embedding Service 4. Parse 1024-dimensional vector 5. Save to `content_vector` column 6. Update status → `COMPLETED` **Queue:** SEDA with 4 concurrent workers ## Verzeichnisstruktur ``` D:/ted.europe/ ├── downloads/ # Temporäre tar.gz Downloads │ └── 2025-11-30_0001.tar.gz │ └── 2025-11-30_0002.tar.gz │ ├── extracted/ # Extrahierte XML Files │ ├── 2025-11-30/ │ │ ├── 001/ │ │ │ ├── 00123456_2025.xml │ │ │ └── 00123457_2025.xml │ │ └── 002/ │ │ └── ... │ └── .processed/ # Erfolgreich verarbeitete XMLs │ └── .error/ # Fehlgeschlagene XMLs ``` ## Datenbank-Tracking ### ted_daily_package (Download-Tracking) | Spalte | Typ | Beschreibung | |--------|-----|--------------| | `id` | UUID | Primary Key | | `year` | INT | Package Jahr (2024, 2025) | | `serial_number` | INT | Package Nummer (1, 2, 3...) | | `package_id` | VARCHAR | Format: `2025-11-30_0001` | | `download_url` | VARCHAR | Full URL | | `download_status` | VARCHAR | PENDING, DOWNLOADING, COMPLETED, NOT_FOUND, FAILED | | `downloaded_at` | TIMESTAMP | Download-Zeitpunkt | | `file_size_bytes` | BIGINT | Größe der tar.gz | | `xml_file_count` | INT | Anzahl extrahierter XMLs | | `processed_count` | INT | Anzahl verarbeiteter XMLs | ### procurement_document (XML-Daten) | Spalte | Typ | Beschreibung | |--------|-----|--------------| | `id` | UUID | Primary Key | | `document_hash` | VARCHAR(64) | SHA-256 für Idempotenz | | `publication_id` | VARCHAR(50) | TED ID (00123456-2025) | | `notice_url` | VARCHAR(255) | Auto-generated TED URL | | `xml_document` | XML | Native PostgreSQL XML | | `text_content` | TEXT | Für Vektorisierung | | `content_vector` | vector(1024) | pgvector Embedding | | `vectorization_status` | VARCHAR | PENDING, PROCESSING, COMPLETED, FAILED | ## Monitoring ### Camel Routes Status ```bash curl http://localhost:8888/api/actuator/camel/routes ``` **Wichtige Routes:** - `ted-package-scheduler` - Download Timer - `ted-document-processor` - XML Processing - `vectorization-processor` - Embedding Generation - `vectorization-scheduler` - PENDING Documents ### Download Status ```sql SELECT year, COUNT(*) FILTER (WHERE download_status = 'COMPLETED') as completed, COUNT(*) FILTER (WHERE download_status = 'NOT_FOUND') as not_found, COUNT(*) FILTER (WHERE download_status = 'FAILED') as failed, SUM(xml_file_count) as total_xmls, SUM(processed_count) as processed_xmls FROM ted.ted_daily_package GROUP BY year ORDER BY year DESC; ``` ### Vectorization Status ```sql SELECT COUNT(*) FILTER (WHERE vectorization_status = 'COMPLETED') as completed, COUNT(*) FILTER (WHERE vectorization_status = 'PENDING') as pending, COUNT(*) FILTER (WHERE vectorization_status = 'FAILED') as failed, COUNT(*) FILTER (WHERE content_vector IS NOT NULL) as has_vector FROM ted.procurement_document; ``` ### Heute verarbeitete Dokumente ```sql SELECT COUNT(*) as today_count, MIN(created_at) as first, MAX(created_at) as last FROM ted.procurement_document WHERE created_at::date = CURRENT_DATE; ``` ## Python Embedding Service **Start:** ```bash python embedding_service.py ``` **Health Check:** ```bash curl http://localhost:8001/health ``` **Expected Response:** ```json { "status": "healthy", "model_name": "intfloat/multilingual-e5-large", "dimensions": 1024, "max_length": 512 } ``` ## Start der Pipeline 1. **Python Embedding Service starten:** ```bash python embedding_service.py ``` 2. **Spring Boot Anwendung starten:** ```bash mvn spring-boot:run ``` 3. **Logs beobachten:** ``` INFO: Checking for new TED packages... INFO: Next package to download: 2025-11-30_0001 INFO: Downloading from https://ted.europa.eu/packages/daily/... INFO: Extracting package 2025-11-30_0001... INFO: Processing 1247 XML files from package 2025-11-30_0001 INFO: Document processed successfully: 00123456_2025.xml DEBUG: Queueing document for vectorization: xxx INFO: Successfully vectorized document: xxx ``` ## Durchsatz **Geschätzte Performance:** | Phase | Geschwindigkeit | Bemerkung | |-------|----------------|-----------| | **Download** | 1 Package/Stunde | Timer-basiert | | **Extract** | ~10 Sekunden | tar.gz → XMLs | | **XML Processing** | ~100-200 XMLs/min | Abhängig von CPU | | **Vectorization** | ~60-90 Docs/min | 4 Workers, Python Service | **Täglich:** - ~24 Packages heruntergeladen - ~30.000-50.000 Dokumente verarbeitet (je nach Package-Größe) - ~30.000-50.000 Vektoren generiert ## Fehlerbehandlung ### Download Fehler **404 Not Found:** Package existiert (noch) nicht - Max 4 consecutive 404s → Switch zu Vorjahr - Automatische Wiederholung nach 1 Stunde **Network Error:** Temporäre Verbindungsprobleme - 3 Retries mit 10s Delay - Dead Letter Channel ### Processing Fehler **Duplikate:** SHA-256 Hash bereits vorhanden - Wird übersprungen (Idempotent Processing) - Log: "Duplicate document skipped" **XML Parsing Error:** Ungültiges XML - 3 Retries - Move to `.error` directory - Status: FAILED in DB ### Vectorization Fehler **Embedding Service nicht erreichbar:** - 2 Retries mit 2s Delay - Status: FAILED - Scheduler versucht erneut nach 60s **Invalid Embedding Dimension:** - Status: FAILED mit Error-Message - Manuelles Eingreifen erforderlich ## Troubleshooting ### Pipeline läuft nicht ```bash # Prüfe Camel Routes curl http://localhost:8888/api/actuator/camel/routes | jq '.routes[] | {id: .id, status: .status}' # Prüfe Download Route tail -f logs/ted-procurement-processor.log | grep "ted-package" # Prüfe Vectorization Route tail -f logs/ted-procurement-processor.log | grep "vectoriz" ``` ### Keine Downloads 1. Prüfe `ted.download.enabled = true` 2. Prüfe Internet-Verbindung 3. Prüfe ted.europa.eu erreichbar 4. Prüfe Logs für 404/403 Errors ### Keine Vektorisierung 1. Prüfe Embedding Service: `curl http://localhost:8001/health` 2. Prüfe `ted.vectorization.enabled = true` 3. Prüfe PENDING Dokumente in DB 4. Prüfe Logs für HTTP 400/500 Errors ## Semantic Search Nach erfolgreicher Vektorisierung sind Dokumente durchsuchbar: ```bash # Semantic Search curl "http://localhost:8888/api/v1/documents/semantic-search?query=medical+equipment" # Combined Search (Semantic + Filters) curl -X POST "http://localhost:8888/api/v1/documents/search" \ -H "Content-Type: application/json" \ -d '{ "countryCodes": ["DEU", "AUT"], "semanticQuery": "software development", "similarityThreshold": 0.7 }' ``` ## Performance-Optimierung ### Vectorization beschleunigen ```yaml ted: vectorization: thread-pool-size: 8 # Mehr Workers (Standard: 4) ``` **Achtung:** Mehr Workers = mehr Last auf Python Service! ### XML Processing beschleunigen ```yaml ted: input: max-messages-per-poll: 200 # Mehr Files pro Poll (Standard: 100) ``` ### Download parallelisieren ```yaml ted: download: max-concurrent-downloads: 4 # Mehr parallele Downloads (Standard: 2) ``` **Achtung:** ted.europa.eu Rate Limiting beachten! ## Zusammenfassung ✅ **Komplett automatisierte Pipeline** von Download bis Semantic Search ✅ **Idempotent Processing** - Keine Duplikate ✅ **Asynchrone Vektorisierung** - Non-blocking ✅ **Enterprise Integration Patterns** - Production-ready ✅ **Fehlerbehandlung** - Retries & Dead Letter Channel ✅ **Monitoring** - Actuator + SQL Queries ✅ **Skalierbar** - Concurrent Workers & Parallel Processing Die Pipeline läuft vollautomatisch 24/7 und verarbeitet alle neuen TED-Ausschreibungen! 🚀