DIP/docs/clustering/PYTHON_CLUSTERING_SERVICE.md

58 lines
1.6 KiB
Markdown

# Python clustering backend for DBSCAN and advanced algorithms
This patch adds a dedicated Python service for clustering algorithms that are better supported in the Python scientific stack than in Java.
## Why Python for this step
The Spring module remains the orchestrator for:
- embedding selection
- run metadata
- result persistence
- cluster browsing APIs
The Python backend executes the actual clustering for algorithms such as:
- `DBSCAN`
- `HDBSCAN`
- `MINI_BATCH_KMEANS`
- `AGGLOMERATIVE`
- `KMEANS` with optional reduction
## Spring-side contract changes in this patch
The Spring request model now supports generic algorithm parameters through `parameters` instead of only `k`.
Examples:
- KMeans: `{ "k": 25 }`
- DBSCAN: `{ "eps": 0.25, "minSamples": 5 }`
- HDBSCAN: `{ "minClusterSize": 15, "minSamples": 5 }`
- Agglomerative: `{ "k": 20, "linkage": "average", "metric": "euclidean" }`
The Python response is now mapped with:
- `noise`
- `membershipScore`
- `distanceToCentroid`
- noise cluster rows
- `noiseCount`
Those values are persisted back into:
- `doc.doc_embedding_cluster`
- `doc.doc_embedding_cluster_assignment`
- `doc.doc_embedding_cluster_run`
## Recommended defaults for embeddings
For high-dimensional text embeddings, use:
- `normalizeVectors=true`
- `reductionMethod=PCA`
- `reductionDimensions=50..150`
Typical starting points:
- DBSCAN: `eps=0.20..0.35`, `minSamples=5`
- HDBSCAN: `minClusterSize=10..30`, `minSamples=3..10`
The right values still depend on:
- embedding model
- whether vectors are normalized
- whether full documents or chunks are clustered
- the semantic density of the selected dataset