You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/DAILY_PACKAGE_DOWNLOAD.md

110 lines
3.1 KiB
Markdown

# TED Daily Package Download - Implementierung
## Übersicht
Das System lädt automatisch TED Daily Packages herunter und verarbeitet sie.
## Komponenten
### 1. Entity: TedDailyPackage ✅
- Tracking von Downloads
- Status-Management
- Idempotenz durch Hash
### 2. Repository: TedDailyPackageRepository ✅
- Package-Verwaltung
- Status-Queries
- Latest-Package-Ermittlung
### 3. Configuration: DownloadProperties ✅
- Download-Einstellungen
- URL-Konfiguration
- Rate Limiting
### 4. Service: TedPackageDownloadService (in Arbeit)
- Package-Download
- tar.gz Extraktion
- Fortschritts-Tracking
### 5. Camel Route: TedPackageDownloadRoute (ausstehend)
- Scheduled Downloads
- Error Handling
- Integration mit bestehender XML-Verarbeitung
## Workflow
1. **Initialization**
- Letztes Package aus DB ermitteln
- Start-Punkt berechnen (aktuelles Jahr oder letztes Package +1)
2. **Download-Loop**
- Current Year: Start bei letztem +1, bis 404 (max 4x)
- Previous Years: Rückwärts downloaden, langsam
3. **Package Processing**
- Download tar.gz
- Hash berechnen (SHA-256)
- Prüfung gegen DB (Idempotenz)
- Extraktion der XML-Dateien
- Weiterleitung an XML-Verarbeitungsroute
4. **Status Tracking**
- PENDING → DOWNLOADING → DOWNLOADED → PROCESSING → COMPLETED
- Fehlerbehandlung: FAILED, NOT_FOUND
## Konfiguration (application.yml)
```yaml
ted:
download:
enabled: true
base-url: https://ted.europa.eu/packages/daily/
download-directory: D:/ted.europe/downloads
extract-directory: D:/ted.europe/extracted
start-year: 2024
max-consecutive-404: 4
poll-interval: 3600000 # 1 Stunde
download-timeout: 300000 # 5 Minuten
max-concurrent-downloads: 2
delay-between-downloads: 5000 # 5 Sekunden
delete-after-extraction: true
prioritize-current-year: true
```
## Database Migration
```sql
CREATE TABLE TED.ted_daily_package (
id UUID PRIMARY KEY,
package_identifier VARCHAR(20) NOT NULL UNIQUE,
year INTEGER NOT NULL,
serial_number INTEGER NOT NULL,
download_url VARCHAR(500) NOT NULL,
file_hash VARCHAR(64),
xml_file_count INTEGER,
processed_count INTEGER DEFAULT 0,
failed_count INTEGER DEFAULT 0,
download_status VARCHAR(30) NOT NULL DEFAULT 'PENDING',
error_message TEXT,
downloaded_at TIMESTAMP WITH TIME ZONE,
processed_at TIMESTAMP WITH TIME ZONE,
download_duration_ms BIGINT,
processing_duration_ms BIGINT,
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE(year, serial_number)
);
CREATE INDEX idx_package_identifier ON TED.ted_daily_package(package_identifier);
CREATE INDEX idx_package_year_serial ON TED.ted_daily_package(year, serial_number);
CREATE INDEX idx_package_status ON TED.ted_daily_package(download_status);
CREATE INDEX idx_package_downloaded_at ON TED.ted_daily_package(downloaded_at);
```
## Nächste Schritte
1. Package Download Service fertigstellen
2. Camel Route erstellen
3. Database Migration ausführen
4. Testing & Integration