1.6 KiB
1.6 KiB
Python clustering backend for DBSCAN and advanced algorithms
This patch adds a dedicated Python service for clustering algorithms that are better supported in the Python scientific stack than in Java.
Why Python for this step
The Spring module remains the orchestrator for:
- embedding selection
- run metadata
- result persistence
- cluster browsing APIs
The Python backend executes the actual clustering for algorithms such as:
DBSCANHDBSCANMINI_BATCH_KMEANSAGGLOMERATIVEKMEANSwith optional reduction
Spring-side contract changes in this patch
The Spring request model now supports generic algorithm parameters through parameters instead of only k.
Examples:
- KMeans:
{ "k": 25 } - DBSCAN:
{ "eps": 0.25, "minSamples": 5 } - HDBSCAN:
{ "minClusterSize": 15, "minSamples": 5 } - Agglomerative:
{ "k": 20, "linkage": "average", "metric": "euclidean" }
The Python response is now mapped with:
noisemembershipScoredistanceToCentroid- noise cluster rows
noiseCount
Those values are persisted back into:
doc.doc_embedding_clusterdoc.doc_embedding_cluster_assignmentdoc.doc_embedding_cluster_run
Recommended defaults for embeddings
For high-dimensional text embeddings, use:
normalizeVectors=truereductionMethod=PCAreductionDimensions=50..150
Typical starting points:
- DBSCAN:
eps=0.20..0.35,minSamples=5 - HDBSCAN:
minClusterSize=10..30,minSamples=3..10
The right values still depend on:
- embedding model
- whether vectors are normalized
- whether full documents or chunks are clustered
- the semantic density of the selected dataset