DIP/docs/clustering/PYTHON_CLUSTERING_SERVICE.md

# Python clustering backend for DBSCAN and advanced algorithms

This patch adds a dedicated Python service for clustering algorithms that are better supported in the Python scientific stack than in Java.

## Why Python for this step

The Spring module remains the orchestrator for:
- embedding selection
- run metadata
- result persistence
- cluster browsing APIs

The Python backend executes the actual clustering for algorithms such as:
- `DBSCAN`
- `HDBSCAN`
- `MINI_BATCH_KMEANS`
- `AGGLOMERATIVE`
- `KMEANS` with optional reduction

## Spring-side contract changes in this patch

The Spring request model now supports generic algorithm parameters through `parameters` instead of only `k`.

Examples:
- KMeans: `{ "k": 25 }`
- DBSCAN: `{ "eps": 0.25, "minSamples": 5 }`
- HDBSCAN: `{ "minClusterSize": 15, "minSamples": 5 }`
- Agglomerative: `{ "k": 20, "linkage": "average", "metric": "euclidean" }`

The Python response is now mapped with:
- `noise`
- `membershipScore`
- `distanceToCentroid`
- noise cluster rows
- `noiseCount`

Those values are persisted back into:
- `doc.doc_embedding_cluster`
- `doc.doc_embedding_cluster_assignment`
- `doc.doc_embedding_cluster_run`

## Recommended defaults for embeddings

For high-dimensional text embeddings, use:
- `normalizeVectors=true`
- `reductionMethod=PCA`
- `reductionDimensions=50..150`

Typical starting points:
- DBSCAN: `eps=0.20..0.35`, `minSamples=5`
- HDBSCAN: `minClusterSize=10..30`, `minSamples=3..10`

The right values still depend on:
- embedding model
- whether vectors are normalized
- whether full documents or chunks are clustered
- the semantic density of the selected dataset