# Python clustering backend for DBSCAN and advanced algorithms This patch adds a dedicated Python service for clustering algorithms that are better supported in the Python scientific stack than in Java. ## Why Python for this step The Spring module remains the orchestrator for: - embedding selection - run metadata - result persistence - cluster browsing APIs The Python backend executes the actual clustering for algorithms such as: - `DBSCAN` - `HDBSCAN` - `MINI_BATCH_KMEANS` - `AGGLOMERATIVE` - `KMEANS` with optional reduction ## Spring-side contract changes in this patch The Spring request model now supports generic algorithm parameters through `parameters` instead of only `k`. Examples: - KMeans: `{ "k": 25 }` - DBSCAN: `{ "eps": 0.25, "minSamples": 5 }` - HDBSCAN: `{ "minClusterSize": 15, "minSamples": 5 }` - Agglomerative: `{ "k": 20, "linkage": "average", "metric": "euclidean" }` The Python response is now mapped with: - `noise` - `membershipScore` - `distanceToCentroid` - noise cluster rows - `noiseCount` Those values are persisted back into: - `doc.doc_embedding_cluster` - `doc.doc_embedding_cluster_assignment` - `doc.doc_embedding_cluster_run` ## Recommended defaults for embeddings For high-dimensional text embeddings, use: - `normalizeVectors=true` - `reductionMethod=PCA` - `reductionDimensions=50..150` Typical starting points: - DBSCAN: `eps=0.20..0.35`, `minSamples=5` - HDBSCAN: `minClusterSize=10..30`, `minSamples=3..10` The right values still depend on: - embedding model - whether vectors are normalized - whether full documents or chunks are clustered - the semantic density of the selected dataset