1.6 KiB

Raw Blame History

Python clustering backend for DBSCAN and advanced algorithms

This patch adds a dedicated Python service for clustering algorithms that are better supported in the Python scientific stack than in Java.

Why Python for this step

The Spring module remains the orchestrator for:

embedding selection
run metadata
result persistence
cluster browsing APIs

The Python backend executes the actual clustering for algorithms such as:

DBSCAN
HDBSCAN
MINI_BATCH_KMEANS
AGGLOMERATIVE
KMEANS with optional reduction

Spring-side contract changes in this patch

The Spring request model now supports generic algorithm parameters through parameters instead of only k.

Examples:

KMeans: { "k": 25 }
DBSCAN: { "eps": 0.25, "minSamples": 5 }
HDBSCAN: { "minClusterSize": 15, "minSamples": 5 }
Agglomerative: { "k": 20, "linkage": "average", "metric": "euclidean" }

The Python response is now mapped with:

noise
membershipScore
distanceToCentroid
noise cluster rows
noiseCount

Those values are persisted back into:

doc.doc_embedding_cluster
doc.doc_embedding_cluster_assignment
doc.doc_embedding_cluster_run

Recommended defaults for embeddings

For high-dimensional text embeddings, use:

normalizeVectors=true
reductionMethod=PCA
reductionDimensions=50..150

Typical starting points:

DBSCAN: eps=0.20..0.35, minSamples=5
HDBSCAN: minClusterSize=10..30, minSamples=3..10

The right values still depend on:

embedding model
whether vectors are normalized
whether full documents or chunks are clustered
the semantic density of the selected dataset