460 lines
14 KiB
Markdown
460 lines
14 KiB
Markdown
# Import Performance and Error Fixes
|
|
|
|
## Scope
|
|
|
|
This document summarizes the event-ingest, master-data, schema, locking, retry, and correctness fixes made in the current optimization round for the EventHub import pipeline.
|
|
|
|
It covers:
|
|
|
|
- event schema and migration fixes
|
|
- event import throughput fixes
|
|
- master-data import throughput fixes
|
|
- deadlock and transaction-visibility fixes
|
|
- connection-reset and retry handling
|
|
- async import cursor correctness fixes
|
|
- logging and operational visibility improvements
|
|
|
|
Main committed changes:
|
|
|
|
- `2e6e1aa` - `Optimize ingestion pipeline and reduce import contention`
|
|
- `bd3620b` - `Improve vehicle reference caching during ingest`
|
|
|
|
Additional related fixes are currently present in the workspace but may not yet be committed.
|
|
|
|
## 1. Schema and Migration Fixes
|
|
|
|
### 1.1 `event_detail` / hypertable ordering
|
|
|
|
Problem:
|
|
|
|
- executing `eventhub_schema_create.sql` on an empty database failed with `relation "eventhub.event_detail" does not exist`
|
|
|
|
Fix:
|
|
|
|
- create `eventhub.event_detail` before `create_hypertable(...)`
|
|
- add its foreign key after the hypertable conversion
|
|
|
|
Files:
|
|
|
|
- `src/main/resources/db/eventhub_schema_create.sql`
|
|
|
|
### 1.2 Explicit migrations for event hypertable and source-record support
|
|
|
|
Problem:
|
|
|
|
- the runtime schema evolution needed explicit migrations for hypertable conversion and `event_source_record`
|
|
|
|
Fix:
|
|
|
|
- add migration for `source_package_id` on `event`
|
|
- add migration for `event` hypertable conversion and FK recreation
|
|
- add migration to ensure `event_source_record` exists and is backfilled
|
|
|
|
Files:
|
|
|
|
- `src/main/resources/db/migration/V9__add_event_source_package_id.sql`
|
|
- `src/main/resources/db/migration/V10__make_event_hypertable.sql`
|
|
- `src/main/resources/db/migration/V11__ensure_event_source_record.sql`
|
|
|
|
## 2. Event Import Throughput Fixes
|
|
|
|
### 2.1 Replace per-event inserts with staged set-based writes
|
|
|
|
Problem:
|
|
|
|
- `EventRepository.batchInsert(...)` originally processed events one by one despite the batch API
|
|
- this caused one insert/query cycle per event and poor throughput
|
|
|
|
Fix:
|
|
|
|
- stage a whole ingest batch into `eventhub_event_import_stage`
|
|
- reserve `event_source_record` rows set-wise
|
|
- insert `eventhub.event` rows set-wise
|
|
- upsert `eventhub.event_detail` rows in batch
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
|
|
|
|
### 2.2 Fix missing event rows when source records were reserved
|
|
|
|
Problem:
|
|
|
|
- after the set-based refactor, some runs created `event_source_record` rows without creating `event` rows
|
|
|
|
Cause:
|
|
|
|
- the insert statement reserved source records and then tried to re-read them through the base table in the same data-modifying CTE chain
|
|
|
|
Fix:
|
|
|
|
- use the `RETURNING` rows from the source-record reservation CTE directly
|
|
- also support already-existing source records that still miss the `event` row
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
|
|
|
|
### 2.3 Stream extraction instead of materializing full result sets
|
|
|
|
Problem:
|
|
|
|
- extraction loaded full SQL chunks into memory before handing them to Camel
|
|
|
|
Fix:
|
|
|
|
- stream rows directly from JDBC to `direct:eventhub-normalized-input`
|
|
- keep only counters and watermark information
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java`
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java`
|
|
|
|
### 2.4 Increase batch size and enable parallel queue draining
|
|
|
|
Problem:
|
|
|
|
- the async ingest route drained too slowly with `1000`-event batches and a single consumer
|
|
|
|
Fix:
|
|
|
|
- raise Camel completion size from `1000` to `5000`
|
|
- enable `4` concurrent SEDA consumers
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/config/EventHubProperties.java`
|
|
- `src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java`
|
|
- `src/main/resources/application.yml`
|
|
|
|
### 2.5 Give each Camel flush its own package key
|
|
|
|
Problem:
|
|
|
|
- multiple flushes of the same extraction package reused the same `data_package` identity
|
|
- logs were misleading and `event_count` on the package row was overwritten by later flushes
|
|
|
|
Fix:
|
|
|
|
- derive a unique `packageKey` per completed Camel batch using the aggregate package key plus the Camel exchange id
|
|
- preserve both the aggregate key and the child key in metadata
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java`
|
|
|
|
### 2.6 Improve batch-local entity and vehicle caching
|
|
|
|
Problem:
|
|
|
|
- after the bulk insert refactor, the main remaining hot path was still reference resolution
|
|
|
|
Fixes:
|
|
|
|
- cache entity ids in the batch by `entityType + sourceEntityId`
|
|
- cache vehicle resolutions inside a batch
|
|
- later extend vehicle caching to be range-aware for registration-based assignment lookups:
|
|
- direct vehicle identifiers cache without time sensitivity
|
|
- registration-based resolutions cache over assignment validity intervals
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
|
|
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
|
|
|
|
## 3. Master-Data Import Throughput Fixes
|
|
|
|
### 3.1 Set-based master entity and relation upserts
|
|
|
|
Problem:
|
|
|
|
- source master data was previously written row by row
|
|
|
|
Fix:
|
|
|
|
- stage master entities and relations into temporary tables
|
|
- run set-based `insert ... select ... on conflict do update`
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
|
|
|
|
### 3.2 Stream and chunk master-data refresh
|
|
|
|
Problem:
|
|
|
|
- the refresh path loaded large source master-data result sets into memory
|
|
|
|
Fix:
|
|
|
|
- stream source rows
|
|
- flush in chunks of `5000`
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java`
|
|
|
|
### 3.3 Bulk vehicle reconciliation from master data
|
|
|
|
Problem:
|
|
|
|
- reconciling vehicles and registrations from master data was done row by row
|
|
|
|
Fix:
|
|
|
|
- replace the loop with set-based SQL for vehicles, registrations, and projected assignments
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
|
|
|
|
## 4. Deadlock and Contention Fixes
|
|
|
|
### 4.1 Remove unnecessary hot-row updates on vehicle and registration rows
|
|
|
|
Problem:
|
|
|
|
- event import updated `vehicle.updated_at` and `vehicle_registration.updated_at` even when no new information was being added
|
|
- this created deadlocks under parallel ingest
|
|
|
|
Fix:
|
|
|
|
- only update `vehicle` when missing `source_vehicle_entity_id` or `vin` can actually be filled
|
|
- only update `vehicle_registration` when missing source id, nation, or registration number can actually be filled
|
|
- stop using event import as a generic "touch row" path
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
|
|
|
|
### 4.2 Make event-time source master entity resolution "find or create", not "update on conflict"
|
|
|
|
Problem:
|
|
|
|
- concurrent event batches could deadlock on `eventhub.source_master_entity` through `INSERT ... ON CONFLICT DO UPDATE`
|
|
|
|
Fix:
|
|
|
|
- first `SELECT id`
|
|
- if missing, `INSERT ... ON CONFLICT DO NOTHING RETURNING id`
|
|
- if another transaction won the race, select again
|
|
- do not update existing master entity rows during event ingest
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
|
|
|
|
### 4.3 Fix race handling when `RETURNING` returns no row
|
|
|
|
Problem:
|
|
|
|
- if a concurrent transaction inserted the entity first, the resolver could still fail unexpectedly
|
|
|
|
Fix:
|
|
|
|
- allow the `RETURNING` path to yield `null`
|
|
- retry with a follow-up `SELECT`
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
|
|
|
|
## 5. Transaction Visibility and Correctness Fixes
|
|
|
|
### 5.1 Remove outer transaction around full tachograph execution
|
|
|
|
Problem:
|
|
|
|
- master-data refresh logs showed completion, but master-data rows were not visible yet because the outer import method still held the transaction open
|
|
|
|
Fix:
|
|
|
|
- remove the outer transaction from `startAndExecuteImport(...)`
|
|
- keep chunk-level and package-level transactions independent
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java`
|
|
|
|
### 5.2 Preserve the original ingest exception if package failure marking also fails
|
|
|
|
Problem:
|
|
|
|
- when ingest failed and `markFailed(...)` also failed because of a broken connection, the secondary bookkeeping error hid the real root cause
|
|
|
|
Fix:
|
|
|
|
- wrap `dataPackageRepository.markFailed(...)` in its own `try/catch`
|
|
- log the bookkeeping failure
|
|
- keep the original ingest exception and attach the bookkeeping failure as suppressed
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/service/EventHubIngestionService.java`
|
|
|
|
### 5.3 Do not advance import cursors before async ingest really finishes
|
|
|
|
Problem:
|
|
|
|
- extraction previously marked packages imported and advanced `import_cursor` before the async `CAMEL_BATCH` ingest was durably finished
|
|
- this could skip source data on the next run if async ingest later failed
|
|
|
|
Fix:
|
|
|
|
- add grouped child-batch status lookup on `data_package`
|
|
- make extraction package completion wait for all derived `CAMEL_BATCH` rows to reach terminal success
|
|
- fail the planned extraction package if child batches fail or time out
|
|
- only advance the cursor after the async ingest succeeded
|
|
- make "Completed import run" mean durable ingest completion instead of extraction completion
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java`
|
|
- `src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java`
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java`
|
|
|
|
## 6. Connection Reset and Retry Hardening
|
|
|
|
### 6.1 Retry transient DB failures in the Camel ingest route
|
|
|
|
Problem:
|
|
|
|
- long-running imports hit transient failures such as deadlocks and connection resets
|
|
|
|
Fix:
|
|
|
|
- add Camel redelivery with exponential backoff for:
|
|
- `CannotAcquireLockException`
|
|
- `PessimisticLockingFailureException`
|
|
- `DataAccessResourceFailureException`
|
|
- `TransientDataAccessException`
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java`
|
|
|
|
### 6.2 Tune Hikari for shorter-lived and healthier pooled connections
|
|
|
|
Problem:
|
|
|
|
- `SQLSTATE 08006` / `Connection reset` events left broken pool entries behind
|
|
|
|
Fix:
|
|
|
|
- configure Hikari with explicit pool sizing and connection lifetime / keepalive settings:
|
|
- `maximum-pool-size: 16`
|
|
- `minimum-idle: 4`
|
|
- `connection-timeout: 30000`
|
|
- `validation-timeout: 5000`
|
|
- `idle-timeout: 300000`
|
|
- `keepalive-time: 120000`
|
|
- `max-lifetime: 540000`
|
|
|
|
Files:
|
|
|
|
- `src/main/resources/application.yml`
|
|
|
|
## 7. Observability Improvements
|
|
|
|
### 7.1 Master-data progress logging
|
|
|
|
Added logs for:
|
|
|
|
- refresh start
|
|
- per-section progress
|
|
- per-chunk counts
|
|
- `byType` breakdowns
|
|
- section completion
|
|
- reconciliation start and result
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java`
|
|
|
|
### 7.2 Event extraction progress logging
|
|
|
|
Added logs for:
|
|
|
|
- extraction start
|
|
- progress every `5000` mapped events
|
|
- final mapped totals with `byType`
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java`
|
|
|
|
### 7.3 Event ingest throughput logging
|
|
|
|
Added logs for:
|
|
|
|
- `receivedCount`
|
|
- `insertedCount`
|
|
- `elapsedMs`
|
|
- `receivedPerSecond`
|
|
- `byType`
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/service/EventHubIngestionService.java`
|
|
|
|
### 7.4 Async-ingest wait progress logging
|
|
|
|
Added logs for:
|
|
|
|
- number of expected child batches
|
|
- observed child batches
|
|
- successful / failed / importing child-batch counts while the import executor waits for durable completion
|
|
|
|
Files:
|
|
|
|
- `src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java`
|
|
|
|
## 8. Operational Notes
|
|
|
|
### Throughput effect seen during the optimization round
|
|
|
|
Observed progression during the work:
|
|
|
|
- roughly `30` events/sec before the later cache and blocking fixes
|
|
- roughly `300` rows/sec after the main contention and stuck-session cleanup work
|
|
|
|
This is a major improvement, but large historical backfills are still expensive.
|
|
|
|
### What remains expensive
|
|
|
|
The main remaining bottleneck is still reference resolution in the ingest hot path, especially:
|
|
|
|
- driver entity resolution
|
|
- source-package entity resolution
|
|
- vehicle / registration lookup and creation
|
|
|
|
The next major optimization step would be set-based pre-resolution of references per ingest batch instead of resolving them one event at a time.
|
|
|
|
### Safe rerun behavior
|
|
|
|
- event ingest remains idempotent through `event_source_record.source_record_key_hash`
|
|
- already imported events should generally be kept
|
|
- when historical cursor corruption existed, repair should target `import_cursor`, not wholesale deletion of imported events
|
|
|
|
## 9. Main Files Touched
|
|
|
|
- `src/main/java/at/procon/eventhub/persistence/EventRepository.java`
|
|
- `src/main/java/at/procon/eventhub/persistence/VehicleIdentityRepository.java`
|
|
- `src/main/java/at/procon/eventhub/persistence/SourceMasterDataRepository.java`
|
|
- `src/main/java/at/procon/eventhub/persistence/DataPackageRepository.java`
|
|
- `src/main/java/at/procon/eventhub/service/EventHubIngestionService.java`
|
|
- `src/main/java/at/procon/eventhub/camel/EventHubCommonIngestionRoute.java`
|
|
- `src/main/java/at/procon/eventhub/camel/EventHubBatchBuildProcessor.java`
|
|
- `src/main/java/at/procon/eventhub/importing/extraction/AbstractJdbcExtractionBatchExecutor.java`
|
|
- `src/main/java/at/procon/eventhub/importing/AbstractImportExecutionService.java`
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/JdbcTachographExtractionBatchExecutor.java`
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/TachographMasterDataRefreshService.java`
|
|
- `src/main/java/at/procon/eventhub/tachograph/service/TachographImportExecutionService.java`
|
|
- `src/main/java/at/procon/eventhub/config/EventHubProperties.java`
|
|
- `src/main/resources/application.yml`
|
|
- `src/main/resources/db/eventhub_schema_create.sql`
|
|
- `src/main/resources/db/migration/V9__add_event_source_package_id.sql`
|
|
- `src/main/resources/db/migration/V10__make_event_hypertable.sql`
|
|
- `src/main/resources/db/migration/V11__ensure_event_source_record.sql`
|