# DIP Clustering Service Remote Python clustering backend for the DIP Spring clustering module. ## Main execution mode The preferred execution mode is now: - Spring keeps run metadata, selection snapshot, and lifecycle. - Spring sends only a compact request containing `runId`. - Python loads the run metadata and selected embeddings directly from Postgres. - Python returns compact assignments keyed by `embeddingId`. This avoids sending the full embedding matrix through HTTP. ## Implemented algorithms - `KMEANS` - `MINI_BATCH_KMEANS` - `DBSCAN` - `HDBSCAN` - `AGGLOMERATIVE` ## Implemented reductions - `NONE` - `PCA` - `UMAP` ## API ### `GET /health` Returns service status and supported algorithms/reduction methods. ### `POST /cluster-run` Preferred endpoint. Accepts only the cluster run id. Example request body: ```json { "runId": "6c3bc3a3-24b0-47a5-9e35-92dd4b7275f8" } ``` This service supports two remote execution modes at the same time: - `POST /cluster` - Spring uploads embeddings in the request body. - This keeps the original implementation intact. - `POST /cluster-run` - Spring sends only `runId`. - Python loads run metadata and embeddings directly from Postgres. ## Start ```powershell py -3.11 -m venv .venv .\.venv\Scripts\python.exe -m pip install --upgrade pip .\.venv\Scripts\python.exe -m pip install -r requirements.txt ``` Configure DB access for `/cluster-run` with either: ### `POST /cluster` Accepts the Spring `PythonClusteringRequest` payload and returns `PythonClusteringResponse`. Example request body: ```json { "algorithm": "DBSCAN", "parameters": { "eps": 0.25, "minSamples": 5, "metric": "euclidean", "normalizeVectors": true }, "reductionMethod": "PCA", "reductionDimensions": 100, "items": [ { "embeddingId": "11111111-1111-1111-1111-111111111111", "documentId": "22222222-2222-2222-2222-222222222222", "representationId": "33333333-3333-3333-3333-333333333333", "vector": [0.1, 0.2, 0.3] } ] } ``` ## Parameters by algorithm ### KMEANS - `k` required - `randomState` optional, default `42` - `nInit` optional, default `10` - `maxIter` optional, default `300` ### MINI_BATCH_KMEANS - `k` required - `batchSize` optional - `randomState` optional, default `42` - `nInit` optional, default `10` - `maxIter` optional, default `300` ### DBSCAN - `eps` required - `minSamples` optional, default `5` - `metric` optional, default `euclidean` - `algorithm` optional, default `auto` - `nJobs` optional, default `-1` ### HDBSCAN - `minClusterSize` optional, default `10` - `minSamples` optional - `metric` optional, default `euclidean` - `clusterSelectionMethod` optional, default `eom` ### AGGLOMERATIVE - `k` required - `linkage` optional, default `average` - `metric` optional, default `euclidean` - `computeDistances` optional, default `false` ## Shared parameters - `normalizeVectors` optional, default `true` - `randomState` optional, used by `KMEANS`, `MINI_BATCH_KMEANS`, `PCA`, `UMAP` ## UMAP reduction parameters - `reductionMetric` optional, default `cosine` - `umapNeighbors` optional, default `15` - `umapMinDist` optional, default `0.0` ## Local run ## Required database configuration Set either: - `CLUSTERING_DB_DSN` - or `DATABASE_URL` - or `CLUSTERING_DB_HOST`, `CLUSTERING_DB_PORT`, `CLUSTERING_DB_NAME`, `CLUSTERING_DB_USER`, `CLUSTERING_DB_PASSWORD` Example: ```bash export CLUSTERING_DB_DSN=postgresql://postgres:postgres@localhost:5432/dip ``` ## Local run on Windows ```powershell $env:CLUSTERING_DB_DSN="postgresql://postgres:postgres@localhost:5432/dip" .\.venv\Scripts\python.exe -m uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload ``` ## Docker run ```bash docker build -t dip-clustering-service . docker run --rm -p 8001:8001 dip-clustering-service ``` ## Spring configuration Use the original request-upload mode: ```yaml dip: clustering: python: enabled: true base-url: http://localhost:8001 cluster-path: /cluster cluster-run-path: /cluster-run request-mode: INLINE_VECTORS connect-timeout: 30s read-timeout: 30m ``` Use compact `runId` mode: ```yaml dip: clustering: python: enabled: true base-url: http://localhost:8001 cluster-path: /cluster cluster-run-path: /cluster-run request-mode: RUN_ID connect-timeout: 30s read-timeout: 30m ``` `INLINE_VECTORS` is the default if `request-mode` is omitted.