eventhub/docs/import-performance-and-erro...

14 KiB

Import Performance and Error Fixes

Scope

This document summarizes the event-ingest, master-data, schema, locking, retry, and correctness fixes made in the current optimization round for the EventHub import pipeline.

It covers:

  • event schema and migration fixes
  • event import throughput fixes
  • master-data import throughput fixes
  • deadlock and transaction-visibility fixes
  • connection-reset and retry handling
  • async import cursor correctness fixes
  • logging and operational visibility improvements

Main committed changes:

  • 2e6e1aa - Optimize ingestion pipeline and reduce import contention
  • bd3620b - Improve vehicle reference caching during ingest

Additional related fixes are currently present in the workspace but may not yet be committed.

1. Schema and Migration Fixes

1.1 event_detail / hypertable ordering

Problem:

  • executing eventhub_schema_create.sql on an empty database failed with relation "eventhub.event_detail" does not exist

Fix:

  • create eventhub.event_detail before create_hypertable(...)
  • add its foreign key after the hypertable conversion

Files:

  • src/main/resources/db/eventhub_schema_create.sql

1.2 Explicit migrations for event hypertable and source-record support

Problem:

  • the runtime schema evolution needed explicit migrations for hypertable conversion and event_source_record

Fix:

  • add migration for source_package_id on event
  • add migration for event hypertable conversion and FK recreation
  • add migration to ensure event_source_record exists and is backfilled

Files:

  • src/main/resources/db/migration/V9__add_event_source_package_id.sql
  • src/main/resources/db/migration/V10__make_event_hypertable.sql
  • src/main/resources/db/migration/V11__ensure_event_source_record.sql

2. Event Import Throughput Fixes

2.1 Replace per-event inserts with staged set-based writes

Problem:

  • EventRepository.batchInsert(...) originally processed events one by one despite the batch API
  • this caused one insert/query cycle per event and poor throughput

Fix:

  • stage a whole ingest batch into eventhub_event_import_stage
  • reserve event_source_record rows set-wise
  • insert eventhub.event rows set-wise
  • upsert eventhub.event_detail rows in batch

Files:

  • src/main/java/at/procon/eventhub/persistence/EventRepository.java

2.2 Fix missing event rows when source records were reserved

Problem:

  • after the set-based refactor, some runs created event_source_record rows without creating event rows

Cause:

  • the insert statement reserved source records and then tried to re-read them through the base table in the same data-modifying CTE chain

Fix:

  • use the RETURNING rows from the source-record reservation CTE directly
  • also support already-existing source records that still miss the event row

Files:

  • src/main/java/at/procon/eventhub/persistence/EventRepository.java

2.3 Stream extraction instead of materializing full result sets

Problem:

  • extraction loaded full SQL chunks into memory before handing them to Camel

Fix:

  • stream rows directly from JDBC to direct:eventhub-normalized-input
  • keep only counters and watermark information

Files:

  • src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java
  • src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java

2.4 Increase batch size and enable parallel queue draining

Problem:

  • the async ingest route drained too slowly with 1000-event batches and a single consumer

Fix:

  • raise Camel completion size from 1000 to 5000
  • enable 4 concurrent SEDA consumers

Files:

  • src/main/java/at/procon/eventhub/config/EventHubProperties.java
  • src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java
  • src/main/resources/application.yml

2.5 Give each Camel flush its own package key

Problem:

  • multiple flushes of the same extraction package reused the same data_package identity
  • logs were misleading and event_count on the package row was overwritten by later flushes

Fix:

  • derive a unique packageKey per completed Camel batch using the aggregate package key plus the Camel exchange id
  • preserve both the aggregate key and the child key in metadata

Files:

  • src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java

2.6 Improve batch-local entity and vehicle caching

Problem:

  • after the bulk insert refactor, the main remaining hot path was still reference resolution

Fixes:

  • cache entity ids in the batch by entityType + sourceEntityId
  • cache vehicle resolutions inside a batch
  • later extend vehicle caching to be range-aware for registration-based assignment lookups:
    • direct vehicle identifiers cache without time sensitivity
    • registration-based resolutions cache over assignment validity intervals

Files:

  • src/main/java/at/procon/eventhub/persistence/EventRepository.java
  • src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java

3. Master-Data Import Throughput Fixes

3.1 Set-based master entity and relation upserts

Problem:

  • source master data was previously written row by row

Fix:

  • stage master entities and relations into temporary tables
  • run set-based insert ... select ... on conflict do update

Files:

  • src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java

3.2 Stream and chunk master-data refresh

Problem:

  • the refresh path loaded large source master-data result sets into memory

Fix:

  • stream source rows
  • flush in chunks of 5000

Files:

  • src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java

3.3 Bulk vehicle reconciliation from master data

Problem:

  • reconciling vehicles and registrations from master data was done row by row

Fix:

  • replace the loop with set-based SQL for vehicles, registrations, and projected assignments

Files:

  • src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java

4. Deadlock and Contention Fixes

4.1 Remove unnecessary hot-row updates on vehicle and registration rows

Problem:

  • event import updated vehicle.updated_at and vehicle_registration.updated_at even when no new information was being added
  • this created deadlocks under parallel ingest

Fix:

  • only update vehicle when missing source_vehicle_entity_id or vin can actually be filled
  • only update vehicle_registration when missing source id, nation, or registration number can actually be filled
  • stop using event import as a generic "touch row" path

Files:

  • src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java

4.2 Make event-time source master entity resolution "find or create", not "update on conflict"

Problem:

  • concurrent event batches could deadlock on eventhub.source_master_entity through INSERT ... ON CONFLICT DO UPDATE

Fix:

  • first SELECT id
  • if missing, INSERT ... ON CONFLICT DO NOTHING RETURNING id
  • if another transaction won the race, select again
  • do not update existing master entity rows during event ingest

Files:

  • src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java

4.3 Fix race handling when RETURNING returns no row

Problem:

  • if a concurrent transaction inserted the entity first, the resolver could still fail unexpectedly

Fix:

  • allow the RETURNING path to yield null
  • retry with a follow-up SELECT

Files:

  • src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java

5. Transaction Visibility and Correctness Fixes

5.1 Remove outer transaction around full tachograph execution

Problem:

  • master-data refresh logs showed completion, but master-data rows were not visible yet because the outer import method still held the transaction open

Fix:

  • remove the outer transaction from startAndExecuteImport(...)
  • keep chunk-level and package-level transactions independent

Files:

  • src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java

5.2 Preserve the original ingest exception if package failure marking also fails

Problem:

  • when ingest failed and markFailed(...) also failed because of a broken connection, the secondary bookkeeping error hid the real root cause

Fix:

  • wrap dataPackageRepository.markFailed(...) in its own try/catch
  • log the bookkeeping failure
  • keep the original ingest exception and attach the bookkeeping failure as suppressed

Files:

  • src/main/java/at/procon/eventhub/service/EventHubIngestionService.java

5.3 Do not advance import cursors before async ingest really finishes

Problem:

  • extraction previously marked packages imported and advanced import_cursor before the async CAMEL_BATCH ingest was durably finished
  • this could skip source data on the next run if async ingest later failed

Fix:

  • add grouped child-batch status lookup on data_package
  • make extraction package completion wait for all derived CAMEL_BATCH rows to reach terminal success
  • fail the planned extraction package if child batches fail or time out
  • only advance the cursor after the async ingest succeeded
  • make "Completed import run" mean durable ingest completion instead of extraction completion

Files:

  • src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java
  • src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java
  • src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java

6. Connection Reset and Retry Hardening

6.1 Retry transient DB failures in the Camel ingest route

Problem:

  • long-running imports hit transient failures such as deadlocks and connection resets

Fix:

  • add Camel redelivery with exponential backoff for:
    • CannotAcquireLockException
    • PessimisticLockingFailureException
    • DataAccessResourceFailureException
    • TransientDataAccessException

Files:

  • src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java

6.2 Tune Hikari for shorter-lived and healthier pooled connections

Problem:

  • SQLSTATE 08006 / Connection reset events left broken pool entries behind

Fix:

  • configure Hikari with explicit pool sizing and connection lifetime / keepalive settings:
    • maximum-pool-size: 16
    • minimum-idle: 4
    • connection-timeout: 30000
    • validation-timeout: 5000
    • idle-timeout: 300000
    • keepalive-time: 120000
    • max-lifetime: 540000

Files:

  • src/main/resources/application.yml

7. Observability Improvements

7.1 Master-data progress logging

Added logs for:

  • refresh start
  • per-section progress
  • per-chunk counts
  • byType breakdowns
  • section completion
  • reconciliation start and result

Files:

  • src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java

7.2 Event extraction progress logging

Added logs for:

  • extraction start
  • progress every 5000 mapped events
  • final mapped totals with byType

Files:

  • src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java

7.3 Event ingest throughput logging

Added logs for:

  • receivedCount
  • insertedCount
  • elapsedMs
  • receivedPerSecond
  • byType

Files:

  • src/main/java/at/procon/eventhub/service/EventHubIngestionService.java

7.4 Async-ingest wait progress logging

Added logs for:

  • number of expected child batches
  • observed child batches
  • successful / failed / importing child-batch counts while the import executor waits for durable completion

Files:

  • src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java

8. Operational Notes

Throughput effect seen during the optimization round

Observed progression during the work:

  • roughly 30 events/sec before the later cache and blocking fixes
  • roughly 300 rows/sec after the main contention and stuck-session cleanup work

This is a major improvement, but large historical backfills are still expensive.

What remains expensive

The main remaining bottleneck is still reference resolution in the ingest hot path, especially:

  • driver entity resolution
  • source-package entity resolution
  • vehicle / registration lookup and creation

The next major optimization step would be set-based pre-resolution of references per ingest batch instead of resolving them one event at a time.

Safe rerun behavior

  • event ingest remains idempotent through event_source_record.source_record_key_hash
  • already imported events should generally be kept
  • when historical cursor corruption existed, repair should target import_cursor, not wholesale deletion of imported events

9. Main Files Touched

  • src/main/java/at/procon/eventhub/persistence/EventRepository.java
  • src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java
  • src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java
  • src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java
  • src/main/java/at/procon/eventhub/service/EventHubIngestionService.java
  • src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java
  • src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java
  • src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java
  • src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java
  • src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java
  • src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java
  • src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java
  • src/main/java/at/procon/eventhub/config/EventHubProperties.java
  • src/main/resources/application.yml
  • src/main/resources/db/eventhub_schema_create.sql
  • src/main/resources/db/migration/V9__add_event_source_package_id.sql
  • src/main/resources/db/migration/V10__make_event_hypertable.sql
  • src/main/resources/db/migration/V11__ensure_event_source_record.sql