You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/DAILY_PACKAGE_DOWNLOAD.md

3.1 KiB

TED Daily Package Download - Implementierung

Übersicht

Das System lädt automatisch TED Daily Packages herunter und verarbeitet sie.

Komponenten

1. Entity: TedDailyPackage

  • Tracking von Downloads
  • Status-Management
  • Idempotenz durch Hash

2. Repository: TedDailyPackageRepository

  • Package-Verwaltung
  • Status-Queries
  • Latest-Package-Ermittlung

3. Configuration: DownloadProperties

  • Download-Einstellungen
  • URL-Konfiguration
  • Rate Limiting

4. Service: TedPackageDownloadService (in Arbeit)

  • Package-Download
  • tar.gz Extraktion
  • Fortschritts-Tracking

5. Camel Route: TedPackageDownloadRoute (ausstehend)

  • Scheduled Downloads
  • Error Handling
  • Integration mit bestehender XML-Verarbeitung

Workflow

  1. Initialization

    • Letztes Package aus DB ermitteln
    • Start-Punkt berechnen (aktuelles Jahr oder letztes Package +1)
  2. Download-Loop

    • Current Year: Start bei letztem +1, bis 404 (max 4x)
    • Previous Years: Rückwärts downloaden, langsam
  3. Package Processing

    • Download tar.gz
    • Hash berechnen (SHA-256)
    • Prüfung gegen DB (Idempotenz)
    • Extraktion der XML-Dateien
    • Weiterleitung an XML-Verarbeitungsroute
  4. Status Tracking

    • PENDING → DOWNLOADING → DOWNLOADED → PROCESSING → COMPLETED
    • Fehlerbehandlung: FAILED, NOT_FOUND

Konfiguration (application.yml)

ted:
  download:
    enabled: true
    base-url: https://ted.europa.eu/packages/daily/
    download-directory: D:/ted.europe/downloads
    extract-directory: D:/ted.europe/extracted
    start-year: 2024
    max-consecutive-404: 4
    poll-interval: 3600000  # 1 Stunde
    download-timeout: 300000  # 5 Minuten
    max-concurrent-downloads: 2
    delay-between-downloads: 5000  # 5 Sekunden
    delete-after-extraction: true
    prioritize-current-year: true

Database Migration

CREATE TABLE TED.ted_daily_package (
    id UUID PRIMARY KEY,
    package_identifier VARCHAR(20) NOT NULL UNIQUE,
    year INTEGER NOT NULL,
    serial_number INTEGER NOT NULL,
    download_url VARCHAR(500) NOT NULL,
    file_hash VARCHAR(64),
    xml_file_count INTEGER,
    processed_count INTEGER DEFAULT 0,
    failed_count INTEGER DEFAULT 0,
    download_status VARCHAR(30) NOT NULL DEFAULT 'PENDING',
    error_message TEXT,
    downloaded_at TIMESTAMP WITH TIME ZONE,
    processed_at TIMESTAMP WITH TIME ZONE,
    download_duration_ms BIGINT,
    processing_duration_ms BIGINT,
    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(year, serial_number)
);

CREATE INDEX idx_package_identifier ON TED.ted_daily_package(package_identifier);
CREATE INDEX idx_package_year_serial ON TED.ted_daily_package(year, serial_number);
CREATE INDEX idx_package_status ON TED.ted_daily_package(download_status);
CREATE INDEX idx_package_downloaded_at ON TED.ted_daily_package(downloaded_at);

Nächste Schritte

  1. Package Download Service fertigstellen
  2. Camel Route erstellen
  3. Database Migration ausführen
  4. Testing & Integration