58 lines
1.6 KiB
Markdown
58 lines
1.6 KiB
Markdown
# Python clustering backend for DBSCAN and advanced algorithms
|
|
|
|
This patch adds a dedicated Python service for clustering algorithms that are better supported in the Python scientific stack than in Java.
|
|
|
|
## Why Python for this step
|
|
|
|
The Spring module remains the orchestrator for:
|
|
- embedding selection
|
|
- run metadata
|
|
- result persistence
|
|
- cluster browsing APIs
|
|
|
|
The Python backend executes the actual clustering for algorithms such as:
|
|
- `DBSCAN`
|
|
- `HDBSCAN`
|
|
- `MINI_BATCH_KMEANS`
|
|
- `AGGLOMERATIVE`
|
|
- `KMEANS` with optional reduction
|
|
|
|
## Spring-side contract changes in this patch
|
|
|
|
The Spring request model now supports generic algorithm parameters through `parameters` instead of only `k`.
|
|
|
|
Examples:
|
|
- KMeans: `{ "k": 25 }`
|
|
- DBSCAN: `{ "eps": 0.25, "minSamples": 5 }`
|
|
- HDBSCAN: `{ "minClusterSize": 15, "minSamples": 5 }`
|
|
- Agglomerative: `{ "k": 20, "linkage": "average", "metric": "euclidean" }`
|
|
|
|
The Python response is now mapped with:
|
|
- `noise`
|
|
- `membershipScore`
|
|
- `distanceToCentroid`
|
|
- noise cluster rows
|
|
- `noiseCount`
|
|
|
|
Those values are persisted back into:
|
|
- `doc.doc_embedding_cluster`
|
|
- `doc.doc_embedding_cluster_assignment`
|
|
- `doc.doc_embedding_cluster_run`
|
|
|
|
## Recommended defaults for embeddings
|
|
|
|
For high-dimensional text embeddings, use:
|
|
- `normalizeVectors=true`
|
|
- `reductionMethod=PCA`
|
|
- `reductionDimensions=50..150`
|
|
|
|
Typical starting points:
|
|
- DBSCAN: `eps=0.20..0.35`, `minSamples=5`
|
|
- HDBSCAN: `minClusterSize=10..30`, `minSamples=3..10`
|
|
|
|
The right values still depend on:
|
|
- embedding model
|
|
- whether vectors are normalized
|
|
- whether full documents or chunks are clustered
|
|
- the semantic density of the selected dataset
|