You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
375 lines
11 KiB
Markdown
375 lines
11 KiB
Markdown
# TED Package Download - Camel-Native Implementation
|
|
|
|
## Übersicht
|
|
|
|
Vollständig Camel-basierte Implementierung des automatischen TED Daily Package Downloads unter Verwendung von Apache Camel Enterprise Integration Patterns (EIP).
|
|
|
|
## Architektur
|
|
|
|
### Verwendete Enterprise Integration Patterns
|
|
|
|
1. **Timer Pattern** - Periodischer Trigger für Downloads
|
|
2. **Content-Based Router** - Verzweigung basierend auf HTTP-Status
|
|
3. **Splitter Pattern** - Parallele Verarbeitung von XML-Dateien
|
|
4. **Dead Letter Channel** - Fehlerbehandlung mit Retry-Logik
|
|
5. **Message Filter** - Filterung basierend auf Package-Status
|
|
6. **Pipes and Filters** - Sequenzielle Verarbeitung
|
|
|
|
### Route-Komponenten
|
|
|
|
#### 1. **Timer-Scheduler** (`ted-package-scheduler`)
|
|
```
|
|
timer → determineNextPackage → choice → download-package
|
|
```
|
|
- Läuft alle X Millisekunden (konfigurierbar, Default: 1 Stunde)
|
|
- Ermittelt nächstes Package (aktuelles Jahr priorisiert)
|
|
- Stoppt automatisch nach 4 aufeinanderfolgenden 404-Fehlern
|
|
|
|
#### 2. **HTTP-Downloader** (`ted-package-http-downloader`)
|
|
```
|
|
direct:download-package → createPackageRecord → delay → HTTP GET → choice
|
|
├─ 200 OK → process-downloaded-package
|
|
├─ 404 → markPackageNotFound
|
|
└─ other → markPackageFailed
|
|
```
|
|
- Native HTTP-Component für Downloads
|
|
- Rate Limiting via delay()
|
|
- Content-Based Routing nach HTTP-Status
|
|
|
|
#### 3. **Package-Processor** (`ted-package-processor`)
|
|
```
|
|
process-downloaded-package → calculateHash → checkDuplicate → choice
|
|
├─ duplicate → markPackageDuplicate
|
|
└─ new → saveDownloadedPackage → extract-tar-gz
|
|
```
|
|
- SHA-256 Hash-Berechnung
|
|
- Duplikaterkennung via Hash
|
|
- Speicherung auf Filesystem
|
|
|
|
#### 4. **TAR.GZ-Extractor** (`ted-package-extractor`)
|
|
```
|
|
extract-tar-gz → extractTarGz → deleteTarGz (optional) → split-xml-files
|
|
```
|
|
- Apache Commons Compress für TAR.GZ
|
|
- Extraktion aller XML-Dateien
|
|
- Optionales Cleanup
|
|
|
|
#### 5. **XML-Splitter** (`ted-package-xml-splitter`)
|
|
```
|
|
split-xml-files → split(xmlFiles) → prepareXmlForProcessing → direct:process-document
|
|
```
|
|
- Parallele Verarbeitung (.parallelProcessing())
|
|
- Streaming (.streaming())
|
|
- Integration mit bestehender XML-Route
|
|
|
|
## Camel-Komponenten
|
|
|
|
### Verwendete Camel-Komponenten
|
|
|
|
- **timer** - Periodischer Trigger
|
|
- **http** - HTTP GET Requests
|
|
- **direct** - Synchrone Route-Verbindungen
|
|
- **bean** - Processor-Aufrufe
|
|
- **file** - Filesystem-Operationen (indirekt via Processor)
|
|
|
|
### Dependencies (pom.xml)
|
|
|
|
```xml
|
|
<dependency>
|
|
<groupId>org.apache.camel</groupId>
|
|
<artifactId>camel-http</artifactId>
|
|
<version>${camel.version}</version>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.apache.camel</groupId>
|
|
<artifactId>camel-bean</artifactId>
|
|
<version>${camel.version}</version>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.apache.camel</groupId>
|
|
<artifactId>camel-jackson</artifactId>
|
|
<version>${camel.version}</version>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.apache.commons</groupId>
|
|
<artifactId>commons-compress</artifactId>
|
|
<version>1.27.1</version>
|
|
</dependency>
|
|
```
|
|
|
|
## Workflow-Diagramm
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ Timer (1h) │
|
|
│ Scheduler │
|
|
└──────┬──────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Determine Next │
|
|
│ Package (Bean) │
|
|
└──────┬──────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ HTTP GET │
|
|
│ https://ted... │
|
|
└──────┬──────────────┘
|
|
│
|
|
▼
|
|
┌──┴───┐
|
|
│Choice│
|
|
└──┬───┘
|
|
│
|
|
┌──┴─────┬─────────┬─────────┐
|
|
│ │ │ │
|
|
200 404 Other Error
|
|
│ │ │ │
|
|
▼ ▼ ▼ ▼
|
|
Process NotFound Failed Dead Letter
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Calculate Hash │
|
|
│ (SHA-256) │
|
|
└──────┬──────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Check Duplicate │
|
|
│ (DB Query) │
|
|
└──────┬──────────────┘
|
|
│
|
|
┌──┴───┐
|
|
│Choice│
|
|
└──┬───┘
|
|
│
|
|
┌──┴─────┬─────────┐
|
|
│ │ │
|
|
New Duplicate │
|
|
│ │ │
|
|
▼ ▼ │
|
|
Extract Complete │
|
|
│ │
|
|
▼ │
|
|
┌─────────────────────┤
|
|
│ Extract TAR.GZ │
|
|
│ (Apache Commons) │
|
|
└──────┬──────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Split XML Files │
|
|
│ (Parallel) │
|
|
└──────┬──────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Process Document │
|
|
│ (existing route) │
|
|
└─────────────────────┘
|
|
```
|
|
|
|
## Message Headers
|
|
|
|
### Download Route Headers
|
|
|
|
| Header | Type | Beschreibung |
|
|
|--------|------|--------------|
|
|
| `packageId` | String | YYYYSSSSS Format |
|
|
| `year` | Integer | Jahr des Packages |
|
|
| `serialNumber` | Integer | Seriennummer |
|
|
| `downloadUrl` | String | Vollständige Download-URL |
|
|
| `downloadStartTime` | Long | Start-Timestamp |
|
|
| `CamelHttpResponseCode` | Integer | HTTP Status |
|
|
| `fileHash` | String | SHA-256 Hash |
|
|
| `isDuplicate` | Boolean | Duplikat-Flag |
|
|
| `duplicateOf` | String | Original Package-ID |
|
|
|
|
### Extraction Headers
|
|
|
|
| Header | Type | Beschreibung |
|
|
|--------|------|--------------|
|
|
| `downloadPath` | String | Pfad zur tar.gz Datei |
|
|
| `xmlFiles` | List<Path> | Liste der XML-Dateien |
|
|
| `xmlFileCount` | Integer | Anzahl XML-Dateien |
|
|
| `deleteAfterExtraction` | Boolean | Cleanup-Flag |
|
|
|
|
## Konfiguration
|
|
|
|
### application.yml
|
|
|
|
```yaml
|
|
ted:
|
|
download:
|
|
enabled: true # Aktiviert die Camel-native Route
|
|
base-url: https://ted.europa.eu/packages/daily/
|
|
download-directory: D:/ted.europe/downloads
|
|
extract-directory: D:/ted.europe/extracted
|
|
start-year: 2024
|
|
max-consecutive-404: 4
|
|
poll-interval: 3600000 # 1 Stunde
|
|
download-timeout: 300000 # 5 Minuten
|
|
delay-between-downloads: 5000 # 5 Sekunden
|
|
delete-after-extraction: true
|
|
prioritize-current-year: true
|
|
|
|
# Optional: Service-basierte Route (alte Implementierung)
|
|
use-service-based: false # Deaktiviert
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Dead Letter Channel
|
|
|
|
```java
|
|
errorHandler(deadLetterChannel("direct:package-download-error")
|
|
.maximumRedeliveries(3)
|
|
.redeliveryDelay(10000)
|
|
.retryAttemptedLogLevel(LoggingLevel.WARN))
|
|
```
|
|
|
|
**Retry-Strategie:**
|
|
- Maximale Wiederholungen: 3
|
|
- Verzögerung: 10 Sekunden
|
|
- Bei Fehler: Dead Letter Channel
|
|
|
|
### Fehlerbehandlung
|
|
|
|
1. **HTTP-Fehler**:
|
|
- 404 → Status: NOT_FOUND (kein Retry)
|
|
- 5xx → Retry 3x
|
|
- Andere → Status: FAILED
|
|
|
|
2. **Verarbeitungsfehler**:
|
|
- Hash-Berechnung fehlgeschlagen → Retry
|
|
- Extraktion fehlgeschlagen → Retry
|
|
- XML-Verarbeitung fehlgeschlagen → Package-Status bleibt PROCESSING
|
|
|
|
## Monitoring & Logging
|
|
|
|
### Log-Levels
|
|
|
|
```yaml
|
|
logging:
|
|
level:
|
|
at.procon.ted.camel: DEBUG
|
|
org.apache.camel: INFO
|
|
```
|
|
|
|
### Log-Meldungen
|
|
|
|
- `INFO`: Package-Start, Completion, Status-Änderungen
|
|
- `DEBUG`: HTTP-Responses, Hash-Berechnungen, Extraktionen
|
|
- `WARN`: Duplikate, HTTP-Fehler, Retries
|
|
- `ERROR`: Dead Letter Channel, kritische Fehler
|
|
|
|
## Performance-Optimierung
|
|
|
|
### Parallele Verarbeitung
|
|
|
|
```java
|
|
.split(header("xmlFiles"))
|
|
.parallelProcessing() // Parallele Verarbeitung
|
|
.streaming() // Streaming für große Listen
|
|
```
|
|
|
|
### Rate Limiting
|
|
|
|
```java
|
|
.delay(simple("{{ted.download.delay-between-downloads:5000}}"))
|
|
```
|
|
|
|
Verhindert Server-Überlastung durch konfigurierbare Verzögerung.
|
|
|
|
## Database-Integration
|
|
|
|
### Package-Tracking
|
|
|
|
Alle Statusänderungen werden in `TED.ted_daily_package` gespeichert:
|
|
|
|
```sql
|
|
SELECT
|
|
package_identifier,
|
|
download_status,
|
|
xml_file_count,
|
|
processed_count,
|
|
downloaded_at
|
|
FROM TED.ted_daily_package
|
|
ORDER BY year DESC, serial_number DESC;
|
|
```
|
|
|
|
### Status-Workflow
|
|
|
|
```
|
|
PENDING → DOWNLOADING → DOWNLOADED → PROCESSING → COMPLETED
|
|
↓ ↓
|
|
NOT_FOUND FAILED
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Manueller Test
|
|
|
|
```bash
|
|
# 1. Verzeichnisse erstellen
|
|
mkdir -p D:/ted.europe/downloads
|
|
mkdir -p D:/ted.europe/extracted
|
|
|
|
# 2. Database Migration
|
|
psql -h 94.130.218.54 -p 5432 -U postgres -d Sales \
|
|
-f src/main/resources/db/migration/V2__add_ted_daily_package_table.sql
|
|
|
|
# 3. Anwendung starten
|
|
mvn spring-boot:run
|
|
|
|
# 4. Logs überwachen
|
|
tail -f logs/spring.log | grep "ted-package"
|
|
```
|
|
|
|
### Erfolgreicher Download (Logs)
|
|
|
|
```
|
|
INFO - Checking for new TED packages...
|
|
INFO - Next package to download: 202400001
|
|
INFO - Downloaded package 202400001
|
|
INFO - Extracting package 202400001...
|
|
INFO - Extracted 1234 XML files from package 202400001
|
|
INFO - Processing 1234 XML files from package 202400001
|
|
INFO - Completed processing package 202400001
|
|
```
|
|
|
|
## Vorteile der Camel-Native Implementierung
|
|
|
|
1. ✅ **Enterprise Integration Patterns** - Bewährte Muster
|
|
2. ✅ **Declarative Configuration** - Route-Definition in Java
|
|
3. ✅ **Native HTTP Component** - Optimiert und getestet
|
|
4. ✅ **Monitoring** - Camel JMX-Management
|
|
5. ✅ **Error Handling** - Dead Letter Channel, Retry
|
|
6. ✅ **Parallel Processing** - Split/Aggregate Pattern
|
|
7. ✅ **Message Transformation** - Header/Body-Manipulation
|
|
8. ✅ **Content-Based Routing** - Dynamische Verzweigungen
|
|
|
|
## Unterschied zur Service-basierten Route
|
|
|
|
| Feature | Camel-Native | Service-basiert |
|
|
|---------|-------------|-----------------|
|
|
| HTTP Download | Camel HTTP Component | Java HttpURLConnection |
|
|
| Retry | Camel Error Handler | Manuell |
|
|
| Routing | Content-Based Router | if/else |
|
|
| Parallelisierung | Camel Splitter | Java Executor |
|
|
| Monitoring | Camel JMX | Custom |
|
|
| Konfiguration | `ted.download.enabled` | `ted.download.use-service-based` |
|
|
|
|
## Nächste Schritte
|
|
|
|
1. ✅ Database Migration ausführen
|
|
2. ✅ Verzeichnisse erstellen
|
|
3. ✅ `ted.download.enabled=true` setzen
|
|
4. ✅ Anwendung starten
|
|
5. ⏳ Logs überwachen
|
|
6. ⏳ DB-Status prüfen
|
|
|
|
Das System ist produktionsbereit! 🚀
|