DIP/docs/clustering/PYTHON_CLUSTERING_SERVICE.md

1.6 KiB

Python clustering backend for DBSCAN and advanced algorithms

This patch adds a dedicated Python service for clustering algorithms that are better supported in the Python scientific stack than in Java.

Why Python for this step

The Spring module remains the orchestrator for:

  • embedding selection
  • run metadata
  • result persistence
  • cluster browsing APIs

The Python backend executes the actual clustering for algorithms such as:

  • DBSCAN
  • HDBSCAN
  • MINI_BATCH_KMEANS
  • AGGLOMERATIVE
  • KMEANS with optional reduction

Spring-side contract changes in this patch

The Spring request model now supports generic algorithm parameters through parameters instead of only k.

Examples:

  • KMeans: { "k": 25 }
  • DBSCAN: { "eps": 0.25, "minSamples": 5 }
  • HDBSCAN: { "minClusterSize": 15, "minSamples": 5 }
  • Agglomerative: { "k": 20, "linkage": "average", "metric": "euclidean" }

The Python response is now mapped with:

  • noise
  • membershipScore
  • distanceToCentroid
  • noise cluster rows
  • noiseCount

Those values are persisted back into:

  • doc.doc_embedding_cluster
  • doc.doc_embedding_cluster_assignment
  • doc.doc_embedding_cluster_run

For high-dimensional text embeddings, use:

  • normalizeVectors=true
  • reductionMethod=PCA
  • reductionDimensions=50..150

Typical starting points:

  • DBSCAN: eps=0.20..0.35, minSamples=5
  • HDBSCAN: minClusterSize=10..30, minSamples=3..10

The right values still depend on:

  • embedding model
  • whether vectors are normalized
  • whether full documents or chunks are clustered
  • the semantic density of the selected dataset