eventhub/docs/import-performance-and-erro...

460 lines
14 KiB
Markdown

# Import Performance and Error Fixes
## Scope
This document summarizes the event-ingest, master-data, schema, locking, retry, and correctness fixes made in the current optimization round for the EventHub import pipeline.
It covers:
- event schema and migration fixes
- event import throughput fixes
- master-data import throughput fixes
- deadlock and transaction-visibility fixes
- connection-reset and retry handling
- async import cursor correctness fixes
- logging and operational visibility improvements
Main committed changes:
- `2e6e1aa` - `Optimize ingestion pipeline and reduce import contention`
- `bd3620b` - `Improve vehicle reference caching during ingest`
Additional related fixes are currently present in the workspace but may not yet be committed.
## 1. Schema and Migration Fixes
### 1.1 `event_detail` / hypertable ordering
Problem:
- executing `eventhub_schema_create.sql` on an empty database failed with `relation "eventhub.event_detail" does not exist`
Fix:
- create `eventhub.event_detail` before `create_hypertable(...)`
- add its foreign key after the hypertable conversion
Files:
- `src/main/resources/db/eventhub_schema_create.sql`
### 1.2 Explicit migrations for event hypertable and source-record support
Problem:
- the runtime schema evolution needed explicit migrations for hypertable conversion and `event_source_record`
Fix:
- add migration for `source_package_id` on `event`
- add migration for `event` hypertable conversion and FK recreation
- add migration to ensure `event_source_record` exists and is backfilled
Files:
- `src/main/resources/db/migration/V9__add_event_source_package_id.sql`
- `src/main/resources/db/migration/V10__make_event_hypertable.sql`
- `src/main/resources/db/migration/V11__ensure_event_source_record.sql`
## 2. Event Import Throughput Fixes
### 2.1 Replace per-event inserts with staged set-based writes
Problem:
- `EventRepository.batchInsert(...)` originally processed events one by one despite the batch API
- this caused one insert/query cycle per event and poor throughput
Fix:
- stage a whole ingest batch into `eventhub_event_import_stage`
- reserve `event_source_record` rows set-wise
- insert `eventhub.event` rows set-wise
- upsert `eventhub.event_detail` rows in batch
Files:
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
### 2.2 Fix missing event rows when source records were reserved
Problem:
- after the set-based refactor, some runs created `event_source_record` rows without creating `event` rows
Cause:
- the insert statement reserved source records and then tried to re-read them through the base table in the same data-modifying CTE chain
Fix:
- use the `RETURNING` rows from the source-record reservation CTE directly
- also support already-existing source records that still miss the `event` row
Files:
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
### 2.3 Stream extraction instead of materializing full result sets
Problem:
- extraction loaded full SQL chunks into memory before handing them to Camel
Fix:
- stream rows directly from JDBC to `direct:eventhub-normalized-input`
- keep only counters and watermark information
Files:
- `src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java`
- `src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java`
### 2.4 Increase batch size and enable parallel queue draining
Problem:
- the async ingest route drained too slowly with `1000`-event batches and a single consumer
Fix:
- raise Camel completion size from `1000` to `5000`
- enable `4` concurrent SEDA consumers
Files:
- `src/main/java/at/procon/eventhub/config/EventHubProperties.java`
- `src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java`
- `src/main/resources/application.yml`
### 2.5 Give each Camel flush its own package key
Problem:
- multiple flushes of the same extraction package reused the same `data_package` identity
- logs were misleading and `event_count` on the package row was overwritten by later flushes
Fix:
- derive a unique `packageKey` per completed Camel batch using the aggregate package key plus the Camel exchange id
- preserve both the aggregate key and the child key in metadata
Files:
- `src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java`
### 2.6 Improve batch-local entity and vehicle caching
Problem:
- after the bulk insert refactor, the main remaining hot path was still reference resolution
Fixes:
- cache entity ids in the batch by `entityType + sourceEntityId`
- cache vehicle resolutions inside a batch
- later extend vehicle caching to be range-aware for registration-based assignment lookups:
- direct vehicle identifiers cache without time sensitivity
- registration-based resolutions cache over assignment validity intervals
Files:
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
## 3. Master-Data Import Throughput Fixes
### 3.1 Set-based master entity and relation upserts
Problem:
- source master data was previously written row by row
Fix:
- stage master entities and relations into temporary tables
- run set-based `insert ... select ... on conflict do update`
Files:
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
### 3.2 Stream and chunk master-data refresh
Problem:
- the refresh path loaded large source master-data result sets into memory
Fix:
- stream source rows
- flush in chunks of `5000`
Files:
- `src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java`
### 3.3 Bulk vehicle reconciliation from master data
Problem:
- reconciling vehicles and registrations from master data was done row by row
Fix:
- replace the loop with set-based SQL for vehicles, registrations, and projected assignments
Files:
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
## 4. Deadlock and Contention Fixes
### 4.1 Remove unnecessary hot-row updates on vehicle and registration rows
Problem:
- event import updated `vehicle.updated_at` and `vehicle_registration.updated_at` even when no new information was being added
- this created deadlocks under parallel ingest
Fix:
- only update `vehicle` when missing `source_vehicle_entity_id` or `vin` can actually be filled
- only update `vehicle_registration` when missing source id, nation, or registration number can actually be filled
- stop using event import as a generic "touch row" path
Files:
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
### 4.2 Make event-time source master entity resolution "find or create", not "update on conflict"
Problem:
- concurrent event batches could deadlock on `eventhub.source_master_entity` through `INSERT ... ON CONFLICT DO UPDATE`
Fix:
- first `SELECT id`
- if missing, `INSERT ... ON CONFLICT DO NOTHING RETURNING id`
- if another transaction won the race, select again
- do not update existing master entity rows during event ingest
Files:
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
### 4.3 Fix race handling when `RETURNING` returns no row
Problem:
- if a concurrent transaction inserted the entity first, the resolver could still fail unexpectedly
Fix:
- allow the `RETURNING` path to yield `null`
- retry with a follow-up `SELECT`
Files:
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
## 5. Transaction Visibility and Correctness Fixes
### 5.1 Remove outer transaction around full tachograph execution
Problem:
- master-data refresh logs showed completion, but master-data rows were not visible yet because the outer import method still held the transaction open
Fix:
- remove the outer transaction from `startAndExecuteImport(...)`
- keep chunk-level and package-level transactions independent
Files:
- `src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java`
### 5.2 Preserve the original ingest exception if package failure marking also fails
Problem:
- when ingest failed and `markFailed(...)` also failed because of a broken connection, the secondary bookkeeping error hid the real root cause
Fix:
- wrap `dataPackageRepository.markFailed(...)` in its own `try/catch`
- log the bookkeeping failure
- keep the original ingest exception and attach the bookkeeping failure as suppressed
Files:
- `src/main/java/at/procon/eventhub/service/EventHubIngestionService.java`
### 5.3 Do not advance import cursors before async ingest really finishes
Problem:
- extraction previously marked packages imported and advanced `import_cursor` before the async `CAMEL_BATCH` ingest was durably finished
- this could skip source data on the next run if async ingest later failed
Fix:
- add grouped child-batch status lookup on `data_package`
- make extraction package completion wait for all derived `CAMEL_BATCH` rows to reach terminal success
- fail the planned extraction package if child batches fail or time out
- only advance the cursor after the async ingest succeeded
- make "Completed import run" mean durable ingest completion instead of extraction completion
Files:
- `src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java`
- `src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java`
- `src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java`
## 6. Connection Reset and Retry Hardening
### 6.1 Retry transient DB failures in the Camel ingest route
Problem:
- long-running imports hit transient failures such as deadlocks and connection resets
Fix:
- add Camel redelivery with exponential backoff for:
- `CannotAcquireLockException`
- `PessimisticLockingFailureException`
- `DataAccessResourceFailureException`
- `TransientDataAccessException`
Files:
- `src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java`
### 6.2 Tune Hikari for shorter-lived and healthier pooled connections
Problem:
- `SQLSTATE 08006` / `Connection reset` events left broken pool entries behind
Fix:
- configure Hikari with explicit pool sizing and connection lifetime / keepalive settings:
- `maximum-pool-size: 16`
- `minimum-idle: 4`
- `connection-timeout: 30000`
- `validation-timeout: 5000`
- `idle-timeout: 300000`
- `keepalive-time: 120000`
- `max-lifetime: 540000`
Files:
- `src/main/resources/application.yml`
## 7. Observability Improvements
### 7.1 Master-data progress logging
Added logs for:
- refresh start
- per-section progress
- per-chunk counts
- `byType` breakdowns
- section completion
- reconciliation start and result
Files:
- `src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java`
### 7.2 Event extraction progress logging
Added logs for:
- extraction start
- progress every `5000` mapped events
- final mapped totals with `byType`
Files:
- `src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java`
### 7.3 Event ingest throughput logging
Added logs for:
- `receivedCount`
- `insertedCount`
- `elapsedMs`
- `receivedPerSecond`
- `byType`
Files:
- `src/main/java/at/procon/eventhub/service/EventHubIngestionService.java`
### 7.4 Async-ingest wait progress logging
Added logs for:
- number of expected child batches
- observed child batches
- successful / failed / importing child-batch counts while the import executor waits for durable completion
Files:
- `src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java`
## 8. Operational Notes
### Throughput effect seen during the optimization round
Observed progression during the work:
- roughly `30` events/sec before the later cache and blocking fixes
- roughly `300` rows/sec after the main contention and stuck-session cleanup work
This is a major improvement, but large historical backfills are still expensive.
### What remains expensive
The main remaining bottleneck is still reference resolution in the ingest hot path, especially:
- driver entity resolution
- source-package entity resolution
- vehicle / registration lookup and creation
The next major optimization step would be set-based pre-resolution of references per ingest batch instead of resolving them one event at a time.
### Safe rerun behavior
- event ingest remains idempotent through `event_source_record.source_record_key_hash`
- already imported events should generally be kept
- when historical cursor corruption existed, repair should target `import_cursor`, not wholesale deletion of imported events
## 9. Main Files Touched
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
- `src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java`
- `src/main/java/at/procon/eventhub/service/EventHubIngestionService.java`
- `src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java`
- `src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java`
- `src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java`
- `src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java`
- `src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java`
- `src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java`
- `src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java`
- `src/main/java/at/procon/eventhub/config/EventHubProperties.java`
- `src/main/resources/application.yml`
- `src/main/resources/db/eventhub_schema_create.sql`
- `src/main/resources/db/migration/V9__add_event_source_package_id.sql`
- `src/main/resources/db/migration/V10__make_event_hypertable.sql`
- `src/main/resources/db/migration/V11__ensure_event_source_record.sql`