History

trifonovt 979f1ba18e clustering		2026-04-24 16:34:38 +02:00
..
app	clustering	2026-04-24 16:34:38 +02:00
tests	clustering	2026-04-24 16:34:38 +02:00
Dockerfile	clustering	2026-04-24 16:34:38 +02:00
README.md	clustering	2026-04-24 16:34:38 +02:00
requirements.txt	clustering	2026-04-24 16:34:38 +02:00

README.md

DIP Clustering Service

Remote Python clustering backend for the DIP Spring clustering module.

Main execution mode

The preferred execution mode is now:

Spring keeps run metadata, selection snapshot, and lifecycle.
Spring sends only a compact request containing runId.
Python loads the run metadata and selected embeddings directly from Postgres.
Python returns compact assignments keyed by embeddingId.

This avoids sending the full embedding matrix through HTTP.

Implemented algorithms

KMEANS
MINI_BATCH_KMEANS
DBSCAN
HDBSCAN
AGGLOMERATIVE

Implemented reductions

NONE
PCA
UMAP

API

`GET /health`

Returns service status and supported algorithms/reduction methods.

`POST /cluster-run`

Preferred endpoint. Accepts only the cluster run id.

Example request body:

{
  "runId": "6c3bc3a3-24b0-47a5-9e35-92dd4b7275f8"
}

This service supports two remote execution modes at the same time:

POST /cluster
- Spring uploads embeddings in the request body.
- This keeps the original implementation intact.
POST /cluster-run
- Spring sends only runId.
- Python loads run metadata and embeddings directly from Postgres.

Start

py -3.11 -m venv .venv
.\.venv\Scripts\python.exe -m pip install --upgrade pip
.\.venv\Scripts\python.exe -m pip install -r requirements.txt

Configure DB access for /cluster-run with either:

`POST /cluster`

Accepts the Spring PythonClusteringRequest payload and returns PythonClusteringResponse.

Example request body:

{
  "algorithm": "DBSCAN",
  "parameters": {
    "eps": 0.25,
    "minSamples": 5,
    "metric": "euclidean",
    "normalizeVectors": true
  },
  "reductionMethod": "PCA",
  "reductionDimensions": 100,
  "items": [
    {
      "embeddingId": "11111111-1111-1111-1111-111111111111",
      "documentId": "22222222-2222-2222-2222-222222222222",
      "representationId": "33333333-3333-3333-3333-333333333333",
      "vector": [0.1, 0.2, 0.3]
    }
  ]
}

Parameters by algorithm

KMEANS

k required
randomState optional, default 42
nInit optional, default 10
maxIter optional, default 300

MINI_BATCH_KMEANS

k required
batchSize optional
randomState optional, default 42
nInit optional, default 10
maxIter optional, default 300

DBSCAN

eps required
minSamples optional, default 5
metric optional, default euclidean
algorithm optional, default auto
nJobs optional, default -1

HDBSCAN

minClusterSize optional, default 10
minSamples optional
metric optional, default euclidean
clusterSelectionMethod optional, default eom

AGGLOMERATIVE

k required
linkage optional, default average
metric optional, default euclidean
computeDistances optional, default false

Shared parameters

normalizeVectors optional, default true
randomState optional, used by KMEANS, MINI_BATCH_KMEANS, PCA, UMAP

UMAP reduction parameters

reductionMetric optional, default cosine
umapNeighbors optional, default 15
umapMinDist optional, default 0.0

Local run

Required database configuration

Set either:

CLUSTERING_DB_DSN
or DATABASE_URL
or CLUSTERING_DB_HOST, CLUSTERING_DB_PORT, CLUSTERING_DB_NAME, CLUSTERING_DB_USER, CLUSTERING_DB_PASSWORD

Example:

export CLUSTERING_DB_DSN=postgresql://postgres:postgres@localhost:5432/dip

Local run on Windows

$env:CLUSTERING_DB_DSN="postgresql://postgres:postgres@localhost:5432/dip"
.\.venv\Scripts\python.exe -m uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload

Docker run

docker build -t dip-clustering-service .
docker run --rm -p 8001:8001 dip-clustering-service

Spring configuration

Use the original request-upload mode:

dip:
  clustering:
    python:
      enabled: true
      base-url: http://localhost:8001
      cluster-path: /cluster
      cluster-run-path: /cluster-run
      request-mode: INLINE_VECTORS
      connect-timeout: 30s
      read-timeout: 30m

Use compact runId mode:

dip:
  clustering:
    python:
      enabled: true
      base-url: http://localhost:8001
      cluster-path: /cluster
      cluster-run-path: /cluster-run
      request-mode: RUN_ID
      connect-timeout: 30s
      read-timeout: 30m

INLINE_VECTORS is the default if request-mode is omitted.