DIP/python/dip-clustering-service
trifonovt 979f1ba18e clustering 2026-04-24 16:34:38 +02:00
..
app clustering 2026-04-24 16:34:38 +02:00
tests clustering 2026-04-24 16:34:38 +02:00
Dockerfile clustering 2026-04-24 16:34:38 +02:00
README.md clustering 2026-04-24 16:34:38 +02:00
requirements.txt clustering 2026-04-24 16:34:38 +02:00

README.md

DIP Clustering Service

Remote Python clustering backend for the DIP Spring clustering module.

Main execution mode

The preferred execution mode is now:

  • Spring keeps run metadata, selection snapshot, and lifecycle.
  • Spring sends only a compact request containing runId.
  • Python loads the run metadata and selected embeddings directly from Postgres.
  • Python returns compact assignments keyed by embeddingId.

This avoids sending the full embedding matrix through HTTP.

Implemented algorithms

  • KMEANS
  • MINI_BATCH_KMEANS
  • DBSCAN
  • HDBSCAN
  • AGGLOMERATIVE

Implemented reductions

  • NONE
  • PCA
  • UMAP

API

GET /health

Returns service status and supported algorithms/reduction methods.

POST /cluster-run

Preferred endpoint. Accepts only the cluster run id.

Example request body:

{
  "runId": "6c3bc3a3-24b0-47a5-9e35-92dd4b7275f8"
}

This service supports two remote execution modes at the same time:

  • POST /cluster
    • Spring uploads embeddings in the request body.
    • This keeps the original implementation intact.
  • POST /cluster-run
    • Spring sends only runId.
    • Python loads run metadata and embeddings directly from Postgres.

Start

py -3.11 -m venv .venv
.\.venv\Scripts\python.exe -m pip install --upgrade pip
.\.venv\Scripts\python.exe -m pip install -r requirements.txt

Configure DB access for /cluster-run with either:

POST /cluster

Accepts the Spring PythonClusteringRequest payload and returns PythonClusteringResponse.

Example request body:

{
  "algorithm": "DBSCAN",
  "parameters": {
    "eps": 0.25,
    "minSamples": 5,
    "metric": "euclidean",
    "normalizeVectors": true
  },
  "reductionMethod": "PCA",
  "reductionDimensions": 100,
  "items": [
    {
      "embeddingId": "11111111-1111-1111-1111-111111111111",
      "documentId": "22222222-2222-2222-2222-222222222222",
      "representationId": "33333333-3333-3333-3333-333333333333",
      "vector": [0.1, 0.2, 0.3]
    }
  ]
}

Parameters by algorithm

KMEANS

  • k required
  • randomState optional, default 42
  • nInit optional, default 10
  • maxIter optional, default 300

MINI_BATCH_KMEANS

  • k required
  • batchSize optional
  • randomState optional, default 42
  • nInit optional, default 10
  • maxIter optional, default 300

DBSCAN

  • eps required
  • minSamples optional, default 5
  • metric optional, default euclidean
  • algorithm optional, default auto
  • nJobs optional, default -1

HDBSCAN

  • minClusterSize optional, default 10
  • minSamples optional
  • metric optional, default euclidean
  • clusterSelectionMethod optional, default eom

AGGLOMERATIVE

  • k required
  • linkage optional, default average
  • metric optional, default euclidean
  • computeDistances optional, default false

Shared parameters

  • normalizeVectors optional, default true
  • randomState optional, used by KMEANS, MINI_BATCH_KMEANS, PCA, UMAP

UMAP reduction parameters

  • reductionMetric optional, default cosine
  • umapNeighbors optional, default 15
  • umapMinDist optional, default 0.0

Local run

Required database configuration

Set either:

  • CLUSTERING_DB_DSN
  • or DATABASE_URL
  • or CLUSTERING_DB_HOST, CLUSTERING_DB_PORT, CLUSTERING_DB_NAME, CLUSTERING_DB_USER, CLUSTERING_DB_PASSWORD

Example:

export CLUSTERING_DB_DSN=postgresql://postgres:postgres@localhost:5432/dip

Local run on Windows

$env:CLUSTERING_DB_DSN="postgresql://postgres:postgres@localhost:5432/dip"
.\.venv\Scripts\python.exe -m uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload

Docker run

docker build -t dip-clustering-service .
docker run --rm -p 8001:8001 dip-clustering-service

Spring configuration

Use the original request-upload mode:

dip:
  clustering:
    python:
      enabled: true
      base-url: http://localhost:8001
      cluster-path: /cluster
      cluster-run-path: /cluster-run
      request-mode: INLINE_VECTORS
      connect-timeout: 30s
      read-timeout: 30m

Use compact runId mode:

dip:
  clustering:
    python:
      enabled: true
      base-url: http://localhost:8001
      cluster-path: /cluster
      cluster-run-path: /cluster-run
      request-mode: RUN_ID
      connect-timeout: 30s
      read-timeout: 30m

INLINE_VECTORS is the default if request-mode is omitted.