14 KiB

Raw Blame History

Import Performance and Error Fixes

Scope

This document summarizes the event-ingest, master-data, schema, locking, retry, and correctness fixes made in the current optimization round for the EventHub import pipeline.

It covers:

event schema and migration fixes
event import throughput fixes
master-data import throughput fixes
deadlock and transaction-visibility fixes
connection-reset and retry handling
async import cursor correctness fixes
logging and operational visibility improvements

Main committed changes:

2e6e1aa - Optimize ingestion pipeline and reduce import contention
bd3620b - Improve vehicle reference caching during ingest

Additional related fixes are currently present in the workspace but may not yet be committed.

1. Schema and Migration Fixes

1.1 `event_detail` / hypertable ordering

Problem:

executing eventhub_schema_create.sql on an empty database failed with relation "eventhub.event_detail" does not exist

Fix:

create eventhub.event_detail before create_hypertable(...)
add its foreign key after the hypertable conversion

Files:

src/main/resources/db/eventhub_schema_create.sql

1.2 Explicit migrations for event hypertable and source-record support

Problem:

the runtime schema evolution needed explicit migrations for hypertable conversion and event_source_record

Fix:

add migration for source_package_id on event
add migration for event hypertable conversion and FK recreation
add migration to ensure event_source_record exists and is backfilled

Files:

src/main/resources/db/migration/V9__add_event_source_package_id.sql
src/main/resources/db/migration/V10__make_event_hypertable.sql
src/main/resources/db/migration/V11__ensure_event_source_record.sql

2. Event Import Throughput Fixes

2.1 Replace per-event inserts with staged set-based writes

Problem:

EventRepository.batchInsert(...) originally processed events one by one despite the batch API
this caused one insert/query cycle per event and poor throughput

Fix:

stage a whole ingest batch into eventhub_event_import_stage
reserve event_source_record rows set-wise
insert eventhub.event rows set-wise
upsert eventhub.event_detail rows in batch

Files:

src/main/java/at/procon/eventhub/persistence/EventRepository.java

2.2 Fix missing event rows when source records were reserved

Problem:

after the set-based refactor, some runs created event_source_record rows without creating event rows

Cause:

the insert statement reserved source records and then tried to re-read them through the base table in the same data-modifying CTE chain

Fix:

use the RETURNING rows from the source-record reservation CTE directly
also support already-existing source records that still miss the event row

Files:

src/main/java/at/procon/eventhub/persistence/EventRepository.java

2.3 Stream extraction instead of materializing full result sets

Problem:

extraction loaded full SQL chunks into memory before handing them to Camel

Fix:

stream rows directly from JDBC to direct:eventhub-normalized-input
keep only counters and watermark information

Files:

src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java
src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java

2.4 Increase batch size and enable parallel queue draining

Problem:

the async ingest route drained too slowly with 1000-event batches and a single consumer

Fix:

raise Camel completion size from 1000 to 5000
enable 4 concurrent SEDA consumers

Files:

src/main/java/at/procon/eventhub/config/EventHubProperties.java
src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java
src/main/resources/application.yml

2.5 Give each Camel flush its own package key

Problem:

multiple flushes of the same extraction package reused the same data_package identity
logs were misleading and event_count on the package row was overwritten by later flushes

Fix:

derive a unique packageKey per completed Camel batch using the aggregate package key plus the Camel exchange id
preserve both the aggregate key and the child key in metadata

Files:

src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java

2.6 Improve batch-local entity and vehicle caching

Problem:

after the bulk insert refactor, the main remaining hot path was still reference resolution

Fixes:

cache entity ids in the batch by entityType + sourceEntityId
cache vehicle resolutions inside a batch
later extend vehicle caching to be range-aware for registration-based assignment lookups:
- direct vehicle identifiers cache without time sensitivity
- registration-based resolutions cache over assignment validity intervals

Files:

src/main/java/at/procon/eventhub/persistence/EventRepository.java
src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java

3. Master-Data Import Throughput Fixes

3.1 Set-based master entity and relation upserts

Problem:

source master data was previously written row by row

Fix:

stage master entities and relations into temporary tables
run set-based insert ... select ... on conflict do update

Files:

src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java

3.2 Stream and chunk master-data refresh

Problem:

the refresh path loaded large source master-data result sets into memory

Fix:

stream source rows
flush in chunks of 5000

Files:

src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java

3.3 Bulk vehicle reconciliation from master data

Problem:

reconciling vehicles and registrations from master data was done row by row

Fix:

replace the loop with set-based SQL for vehicles, registrations, and projected assignments

Files:

src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java

4. Deadlock and Contention Fixes

4.1 Remove unnecessary hot-row updates on vehicle and registration rows

Problem:

event import updated vehicle.updated_at and vehicle_registration.updated_at even when no new information was being added
this created deadlocks under parallel ingest

Fix:

only update vehicle when missing source_vehicle_entity_id or vin can actually be filled
only update vehicle_registration when missing source id, nation, or registration number can actually be filled
stop using event import as a generic "touch row" path

Files:

src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java

4.2 Make event-time source master entity resolution "find or create", not "update on conflict"

Problem:

concurrent event batches could deadlock on eventhub.source_master_entity through INSERT ... ON CONFLICT DO UPDATE

Fix:

first SELECT id
if missing, INSERT ... ON CONFLICT DO NOTHING RETURNING id
if another transaction won the race, select again
do not update existing master entity rows during event ingest

Files:

src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java

4.3 Fix race handling when `RETURNING` returns no row

Problem:

if a concurrent transaction inserted the entity first, the resolver could still fail unexpectedly

Fix:

allow the RETURNING path to yield null
retry with a follow-up SELECT

Files:

src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java

5. Transaction Visibility and Correctness Fixes

5.1 Remove outer transaction around full tachograph execution

Problem:

master-data refresh logs showed completion, but master-data rows were not visible yet because the outer import method still held the transaction open

Fix:

remove the outer transaction from startAndExecuteImport(...)
keep chunk-level and package-level transactions independent

Files:

src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java

5.2 Preserve the original ingest exception if package failure marking also fails

Problem:

when ingest failed and markFailed(...) also failed because of a broken connection, the secondary bookkeeping error hid the real root cause

Fix:

wrap dataPackageRepository.markFailed(...) in its own try/catch
log the bookkeeping failure
keep the original ingest exception and attach the bookkeeping failure as suppressed

Files:

src/main/java/at/procon/eventhub/service/EventHubIngestionService.java

5.3 Do not advance import cursors before async ingest really finishes

Problem:

extraction previously marked packages imported and advanced import_cursor before the async CAMEL_BATCH ingest was durably finished
this could skip source data on the next run if async ingest later failed

Fix:

add grouped child-batch status lookup on data_package
make extraction package completion wait for all derived CAMEL_BATCH rows to reach terminal success
fail the planned extraction package if child batches fail or time out
only advance the cursor after the async ingest succeeded
make "Completed import run" mean durable ingest completion instead of extraction completion

Files:

src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java
src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java
src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java

6. Connection Reset and Retry Hardening

6.1 Retry transient DB failures in the Camel ingest route

Problem:

long-running imports hit transient failures such as deadlocks and connection resets

Fix:

add Camel redelivery with exponential backoff for:
- CannotAcquireLockException
- PessimisticLockingFailureException
- DataAccessResourceFailureException
- TransientDataAccessException

Files:

src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java

6.2 Tune Hikari for shorter-lived and healthier pooled connections

Problem:

SQLSTATE 08006 / Connection reset events left broken pool entries behind

Fix:

configure Hikari with explicit pool sizing and connection lifetime / keepalive settings:
- maximum-pool-size: 16
- minimum-idle: 4
- connection-timeout: 30000
- validation-timeout: 5000
- idle-timeout: 300000
- keepalive-time: 120000
- max-lifetime: 540000

Files:

src/main/resources/application.yml

7. Observability Improvements

7.1 Master-data progress logging

Added logs for:

refresh start
per-section progress
per-chunk counts
byType breakdowns
section completion
reconciliation start and result

Files:

src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java

7.2 Event extraction progress logging

Added logs for:

extraction start
progress every 5000 mapped events
final mapped totals with byType

Files:

src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java

7.3 Event ingest throughput logging

Added logs for:

receivedCount
insertedCount
elapsedMs
receivedPerSecond
byType

Files:

src/main/java/at/procon/eventhub/service/EventHubIngestionService.java

7.4 Async-ingest wait progress logging

Added logs for:

number of expected child batches
observed child batches
successful / failed / importing child-batch counts while the import executor waits for durable completion

Files:

src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java

8. Operational Notes

Throughput effect seen during the optimization round

Observed progression during the work:

roughly 30 events/sec before the later cache and blocking fixes
roughly 300 rows/sec after the main contention and stuck-session cleanup work

This is a major improvement, but large historical backfills are still expensive.

What remains expensive

The main remaining bottleneck is still reference resolution in the ingest hot path, especially:

driver entity resolution
source-package entity resolution
vehicle / registration lookup and creation

The next major optimization step would be set-based pre-resolution of references per ingest batch instead of resolving them one event at a time.

Safe rerun behavior

event ingest remains idempotent through event_source_record.source_record_key_hash
already imported events should generally be kept
when historical cursor corruption existed, repair should target import_cursor, not wholesale deletion of imported events

9. Main Files Touched

src/main/java/at/procon/eventhub/persistence/EventRepository.java
src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java
src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java
src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java
src/main/java/at/procon/eventhub/service/EventHubIngestionService.java
src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java
src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java
src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java
src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java
src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java
src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java
src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java
src/main/java/at/procon/eventhub/config/EventHubProperties.java
src/main/resources/application.yml
src/main/resources/db/eventhub_schema_create.sql
src/main/resources/db/migration/V9__add_event_source_package_id.sql
src/main/resources/db/migration/V10__make_event_hypertable.sql
src/main/resources/db/migration/V11__ensure_event_source_record.sql

14 KiB Raw Blame History

Import Performance and Error Fixes

Scope

1. Schema and Migration Fixes

1.1 event_detail / hypertable ordering

1.2 Explicit migrations for event hypertable and source-record support

2. Event Import Throughput Fixes

2.1 Replace per-event inserts with staged set-based writes

2.2 Fix missing event rows when source records were reserved

2.3 Stream extraction instead of materializing full result sets

2.4 Increase batch size and enable parallel queue draining

2.5 Give each Camel flush its own package key

2.6 Improve batch-local entity and vehicle caching

3. Master-Data Import Throughput Fixes

3.1 Set-based master entity and relation upserts

3.2 Stream and chunk master-data refresh

3.3 Bulk vehicle reconciliation from master data

4. Deadlock and Contention Fixes

4.1 Remove unnecessary hot-row updates on vehicle and registration rows

4.2 Make event-time source master entity resolution "find or create", not "update on conflict"

4.3 Fix race handling when RETURNING returns no row

5. Transaction Visibility and Correctness Fixes

5.1 Remove outer transaction around full tachograph execution

5.2 Preserve the original ingest exception if package failure marking also fails

5.3 Do not advance import cursors before async ingest really finishes

6. Connection Reset and Retry Hardening

6.1 Retry transient DB failures in the Camel ingest route

6.2 Tune Hikari for shorter-lived and healthier pooled connections

7. Observability Improvements

7.1 Master-data progress logging

7.2 Event extraction progress logging

7.3 Event ingest throughput logging

7.4 Async-ingest wait progress logging

8. Operational Notes

Throughput effect seen during the optimization round

What remains expensive

Safe rerun behavior

9. Main Files Touched

14 KiB

Raw Blame History

1.1 `event_detail` / hypertable ordering

4.3 Fix race handling when `RETURNING` returns no row