eventhub/README_NDI_HOME_CLASSIFICAT...

162 lines
6.4 KiB
Markdown

# NDI HOME / NOT_HOME classification and country trip segmentation
This patch implements the HOME / NOT_HOME classification and the country-trip segmentation described in `docs/ndi_home_classification_en.md`. It reuses the existing driver-working-time pipeline and adds configurable Nominatim reverse geocoding only where source country evidence is missing.
## Public processing plan
Use:
```text
driver-home-classification-v1
```
The dedicated plan delegates to the shared `driver-working-time-v1` pipeline and explicitly inserts:
```text
support-evidence-normalization
-> ndi-home-classification
-> country-trip-segmentation
-> driving-derived-projections
```
The normal `driver-working-time-v1` plan keeps both modules optional. They can also be requested explicitly as `ndi-home-classification` and `country-trip-segmentation`.
## Reused projection structures
`DriverWorkingTimeReusableProjectionBuilder.buildAllNonDrivingIntervalCoverage(...)` runs the existing Esper interruption/card-absence/GNSS enrichment pipeline with a zero rest-candidate threshold. It creates enriched evidence for every positive non-driving interruption without changing the legacy daily/weekly-rest threshold or outputs.
The implementation reuses `DriverWorkingTimeRestCoverageInterval` as the enriched NDI evidence model. It provides:
- previous and next driving/vehicle identities;
- NDI start, end, and duration;
- card-absence duration and percentage;
- begin/end boundary GNSS evidence;
- boundary odometer and movement evidence.
## HOME / NOT_HOME classification
The rules are evaluated in the document order:
1. previous and next vehicles differ -> `HOME`;
2. card absent for more than 80% -> `HOME`;
3. NDI longer than 24 hours -> `HOME`;
4. no position: NDI longer than 7.5 hours -> `HOME`, otherwise `NOT_HOME`;
5. positioned long NDI in a company or driver home cluster -> `HOME`;
6. positioned long NDI outside those clusters -> `NOT_HOME`;
7. remaining short NDI -> `NOT_HOME`.
Every classification contains a `DriverNdiHomeClassificationReason`, so the first matching rule remains visible in the API response.
## Location learning and clustering
Only NDIs longer than 7.5 hours with a position are added to the corpus. Position selection uses the existing resolved begin-boundary evidence and falls back to resolved end-boundary evidence.
The in-memory cache:
- accumulates observations across one or more file-session executions;
- deduplicates the same NDI across repeated/overlapping sessions;
- retains source-session provenance;
- stores the driver key on every observation;
- calculates actual-driver and other-driver views per request.
Clustering uses Java DBSCAN with Haversine distance. Defaults are 150 metres and three points. Noise observations remain in the denominator for visit-share calculations but are never home clusters.
## Country trip segmentation
`DriverCountryTripSegmentationService` builds country segments over driving intervals.
Evidence precedence is:
1. explicit tachograph border-crossing event (`countryFrom` / `countryTo`);
2. country code already present on a positioned support event;
3. Nominatim reverse lookup for a positioned event without a usable country code.
Country values are normalized to ISO 3166-1 alpha-2 where a mapping is known. Segment boundaries retain their evidence source:
```text
EXPLICIT_BORDER_CROSSING
GNSS_SOURCE_COUNTRY_CHANGE
NOMINATIM_COUNTRY_CHANGE
VEHICLE_CHANGE
FINAL
```
The result includes segment counts, explicit-border counts, remote lookup counts, cache-hit counts, unresolved-coordinate counts, warnings, and OpenStreetMap attribution.
## Nominatim integration
The client uses the reverse endpoint with:
```text
format=jsonv2
zoom=3
addressdetails=1
layer=address
```
Only `address.country_code` is required by the classification/segmentation logic. Failures do not fail the whole processing plan; the coordinate remains unresolved and a diagnostic warning is returned.
Safeguards:
- identifying configurable `User-Agent`;
- optional identifying email;
- shared coordinate cache with TTL and maximum size;
- coordinate quantization for cache reuse;
- one execution-level remote lookup budget;
- fully serialized remote calls;
- configurable minimum interval;
- enforced minimum one-second interval for `nominatim.openstreetmap.org`;
- public OSM endpoint disabled unless deliberately opted in;
- configurable endpoint so a self-hosted or contracted Nominatim service can be substituted without code changes.
### Configuration
```yaml
eventhub:
reverse-geocoding:
enabled: true
provider: NOMINATIM
nominatim:
base-url: https://nominatim.openstreetmap.org
public-service-enabled: false
user-agent: eventhub-tachograph/0.1 (Nominatim reverse geocoding)
email: ""
accept-language: en
connect-timeout: 10s
read-timeout: 20s
minimum-request-interval: 1s
cache-ttl: 30d
cache-max-entries: 100000
coordinate-decimal-places: 4
max-remote-lookups-per-execution: 25
```
Environment variables use the `NOMINATIM_*` names shown in `application.yml`.
For a self-hosted endpoint, set `NOMINATIM_BASE_URL`; `public-service-enabled` is not needed. For deliberately selected, policy-compliant, low-volume use of the donated public endpoint, additionally set:
```text
NOMINATIM_PUBLIC_SERVICE_ENABLED=true
NOMINATIM_USER_AGENT=<application/version and contact identifier>
NOMINATIM_EMAIL=<contact email when appropriate>
```
Production or recurring tachograph batch processing should use a self-hosted instance or a provider whose terms cover the expected workload. Coordinates may reveal vehicle or driver movements; do not send confidential or personal-location data to a public endpoint without an appropriate legal and privacy basis.
## File-session learning scope
The dedicated plan defaults `ndiLearnAllFileSessionDrivers` to `true`. For a request with explicit canonical driver keys, it internally loads all drivers from selected file sessions for location learning and filters the response back to the originally requested drivers.
The scope is not broadened when the source is mixed/database-only, the option is disabled, or the result cannot safely be filtered by canonical driver key.
## Response extensions
Each driver partition can contain:
```text
ndiHomeClassification
countryTripSegmentation
```
The fields are omitted when their optional modules were not executed, preserving the existing JSON shape for normal `driver-working-time-v1` calls.