DIP/python/dip-clustering-service/README.md

# DIP Clustering Service

Remote Python clustering backend for the DIP Spring clustering module.

## Main execution mode

The preferred execution mode is now:

- Spring keeps run metadata, selection snapshot, and lifecycle.
- Spring sends only a compact request containing `runId`.
- Python loads the run metadata and selected embeddings directly from Postgres.
- Python returns compact assignments keyed by `embeddingId`.

This avoids sending the full embedding matrix through HTTP.

## Implemented algorithms

- `KMEANS`
- `MINI_BATCH_KMEANS`
- `DBSCAN`
- `HDBSCAN`
- `AGGLOMERATIVE`

## Implemented reductions

- `NONE`
- `PCA`
- `UMAP`

## API

### `GET /health`

Returns service status and supported algorithms/reduction methods.

### `POST /cluster-run`

Preferred endpoint. Accepts only the cluster run id.

Example request body:

```json
{
  "runId": "6c3bc3a3-24b0-47a5-9e35-92dd4b7275f8"
}
```


This service supports two remote execution modes at the same time:

- `POST /cluster`
  - Spring uploads embeddings in the request body.
  - This keeps the original implementation intact.
- `POST /cluster-run`
  - Spring sends only `runId`.
  - Python loads run metadata and embeddings directly from Postgres.

## Start

```powershell
py -3.11 -m venv .venv
.\.venv\Scripts\python.exe -m pip install --upgrade pip
.\.venv\Scripts\python.exe -m pip install -r requirements.txt
```

Configure DB access for `/cluster-run` with either:


### `POST /cluster`

Accepts the Spring `PythonClusteringRequest` payload and returns `PythonClusteringResponse`.

Example request body:

```json
{
  "algorithm": "DBSCAN",
  "parameters": {
    "eps": 0.25,
    "minSamples": 5,
    "metric": "euclidean",
    "normalizeVectors": true
  },
  "reductionMethod": "PCA",
  "reductionDimensions": 100,
  "items": [
    {
      "embeddingId": "11111111-1111-1111-1111-111111111111",
      "documentId": "22222222-2222-2222-2222-222222222222",
      "representationId": "33333333-3333-3333-3333-333333333333",
      "vector": [0.1, 0.2, 0.3]
    }
  ]
}
```

## Parameters by algorithm

### KMEANS
- `k` required
- `randomState` optional, default `42`
- `nInit` optional, default `10`
- `maxIter` optional, default `300`

### MINI_BATCH_KMEANS
- `k` required
- `batchSize` optional
- `randomState` optional, default `42`
- `nInit` optional, default `10`
- `maxIter` optional, default `300`

### DBSCAN
- `eps` required
- `minSamples` optional, default `5`
- `metric` optional, default `euclidean`
- `algorithm` optional, default `auto`
- `nJobs` optional, default `-1`

### HDBSCAN
- `minClusterSize` optional, default `10`
- `minSamples` optional
- `metric` optional, default `euclidean`
- `clusterSelectionMethod` optional, default `eom`

### AGGLOMERATIVE
- `k` required
- `linkage` optional, default `average`
- `metric` optional, default `euclidean`
- `computeDistances` optional, default `false`

## Shared parameters

- `normalizeVectors` optional, default `true`
- `randomState` optional, used by `KMEANS`, `MINI_BATCH_KMEANS`, `PCA`, `UMAP`

## UMAP reduction parameters

- `reductionMetric` optional, default `cosine`
- `umapNeighbors` optional, default `15`
- `umapMinDist` optional, default `0.0`

## Local run
## Required database configuration

Set either:

- `CLUSTERING_DB_DSN`
- or `DATABASE_URL`
- or `CLUSTERING_DB_HOST`, `CLUSTERING_DB_PORT`, `CLUSTERING_DB_NAME`, `CLUSTERING_DB_USER`, `CLUSTERING_DB_PASSWORD`

Example:

```bash
export CLUSTERING_DB_DSN=postgresql://postgres:postgres@localhost:5432/dip
```

## Local run on Windows

```powershell
$env:CLUSTERING_DB_DSN="postgresql://postgres:postgres@localhost:5432/dip"
.\.venv\Scripts\python.exe -m uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload
```


## Docker run

```bash
docker build -t dip-clustering-service .
docker run --rm -p 8001:8001 dip-clustering-service
```

## Spring configuration

Use the original request-upload mode:

```yaml
dip:
  clustering:
    python:
      enabled: true
      base-url: http://localhost:8001
      cluster-path: /cluster
      cluster-run-path: /cluster-run
      request-mode: INLINE_VECTORS
      connect-timeout: 30s
      read-timeout: 30m
```

Use compact `runId` mode:

```yaml
dip:
  clustering:
    python:
      enabled: true
      base-url: http://localhost:8001
      cluster-path: /cluster
      cluster-run-path: /cluster-run
      request-mode: RUN_ID
      connect-timeout: 30s
      read-timeout: 30m
```

`INLINE_VECTORS` is the default if `request-mode` is omitted.