You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.1 KiB
3.1 KiB
TED Daily Package Download - Implementierung
Übersicht
Das System lädt automatisch TED Daily Packages herunter und verarbeitet sie.
Komponenten
1. Entity: TedDailyPackage ✅
- Tracking von Downloads
- Status-Management
- Idempotenz durch Hash
2. Repository: TedDailyPackageRepository ✅
- Package-Verwaltung
- Status-Queries
- Latest-Package-Ermittlung
3. Configuration: DownloadProperties ✅
- Download-Einstellungen
- URL-Konfiguration
- Rate Limiting
4. Service: TedPackageDownloadService (in Arbeit)
- Package-Download
- tar.gz Extraktion
- Fortschritts-Tracking
5. Camel Route: TedPackageDownloadRoute (ausstehend)
- Scheduled Downloads
- Error Handling
- Integration mit bestehender XML-Verarbeitung
Workflow
-
Initialization
- Letztes Package aus DB ermitteln
- Start-Punkt berechnen (aktuelles Jahr oder letztes Package +1)
-
Download-Loop
- Current Year: Start bei letztem +1, bis 404 (max 4x)
- Previous Years: Rückwärts downloaden, langsam
-
Package Processing
- Download tar.gz
- Hash berechnen (SHA-256)
- Prüfung gegen DB (Idempotenz)
- Extraktion der XML-Dateien
- Weiterleitung an XML-Verarbeitungsroute
-
Status Tracking
- PENDING → DOWNLOADING → DOWNLOADED → PROCESSING → COMPLETED
- Fehlerbehandlung: FAILED, NOT_FOUND
Konfiguration (application.yml)
ted:
download:
enabled: true
base-url: https://ted.europa.eu/packages/daily/
download-directory: D:/ted.europe/downloads
extract-directory: D:/ted.europe/extracted
start-year: 2024
max-consecutive-404: 4
poll-interval: 3600000 # 1 Stunde
download-timeout: 300000 # 5 Minuten
max-concurrent-downloads: 2
delay-between-downloads: 5000 # 5 Sekunden
delete-after-extraction: true
prioritize-current-year: true
Database Migration
CREATE TABLE TED.ted_daily_package (
id UUID PRIMARY KEY,
package_identifier VARCHAR(20) NOT NULL UNIQUE,
year INTEGER NOT NULL,
serial_number INTEGER NOT NULL,
download_url VARCHAR(500) NOT NULL,
file_hash VARCHAR(64),
xml_file_count INTEGER,
processed_count INTEGER DEFAULT 0,
failed_count INTEGER DEFAULT 0,
download_status VARCHAR(30) NOT NULL DEFAULT 'PENDING',
error_message TEXT,
downloaded_at TIMESTAMP WITH TIME ZONE,
processed_at TIMESTAMP WITH TIME ZONE,
download_duration_ms BIGINT,
processing_duration_ms BIGINT,
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE(year, serial_number)
);
CREATE INDEX idx_package_identifier ON TED.ted_daily_package(package_identifier);
CREATE INDEX idx_package_year_serial ON TED.ted_daily_package(year, serial_number);
CREATE INDEX idx_package_status ON TED.ted_daily_package(download_status);
CREATE INDEX idx_package_downloaded_at ON TED.ted_daily_package(downloaded_at);
Nächste Schritte
- Package Download Service fertigstellen
- Camel Route erstellen
- Database Migration ausführen
- Testing & Integration