DIP/python/dip-clustering-service/README.md

205 lines
4.4 KiB
Markdown

# DIP Clustering Service
Remote Python clustering backend for the DIP Spring clustering module.
## Main execution mode
The preferred execution mode is now:
- Spring keeps run metadata, selection snapshot, and lifecycle.
- Spring sends only a compact request containing `runId`.
- Python loads the run metadata and selected embeddings directly from Postgres.
- Python returns compact assignments keyed by `embeddingId`.
This avoids sending the full embedding matrix through HTTP.
## Implemented algorithms
- `KMEANS`
- `MINI_BATCH_KMEANS`
- `DBSCAN`
- `HDBSCAN`
- `AGGLOMERATIVE`
## Implemented reductions
- `NONE`
- `PCA`
- `UMAP`
## API
### `GET /health`
Returns service status and supported algorithms/reduction methods.
### `POST /cluster-run`
Preferred endpoint. Accepts only the cluster run id.
Example request body:
```json
{
"runId": "6c3bc3a3-24b0-47a5-9e35-92dd4b7275f8"
}
```
This service supports two remote execution modes at the same time:
- `POST /cluster`
- Spring uploads embeddings in the request body.
- This keeps the original implementation intact.
- `POST /cluster-run`
- Spring sends only `runId`.
- Python loads run metadata and embeddings directly from Postgres.
## Start
```powershell
py -3.11 -m venv .venv
.\.venv\Scripts\python.exe -m pip install --upgrade pip
.\.venv\Scripts\python.exe -m pip install -r requirements.txt
```
Configure DB access for `/cluster-run` with either:
### `POST /cluster`
Accepts the Spring `PythonClusteringRequest` payload and returns `PythonClusteringResponse`.
Example request body:
```json
{
"algorithm": "DBSCAN",
"parameters": {
"eps": 0.25,
"minSamples": 5,
"metric": "euclidean",
"normalizeVectors": true
},
"reductionMethod": "PCA",
"reductionDimensions": 100,
"items": [
{
"embeddingId": "11111111-1111-1111-1111-111111111111",
"documentId": "22222222-2222-2222-2222-222222222222",
"representationId": "33333333-3333-3333-3333-333333333333",
"vector": [0.1, 0.2, 0.3]
}
]
}
```
## Parameters by algorithm
### KMEANS
- `k` required
- `randomState` optional, default `42`
- `nInit` optional, default `10`
- `maxIter` optional, default `300`
### MINI_BATCH_KMEANS
- `k` required
- `batchSize` optional
- `randomState` optional, default `42`
- `nInit` optional, default `10`
- `maxIter` optional, default `300`
### DBSCAN
- `eps` required
- `minSamples` optional, default `5`
- `metric` optional, default `euclidean`
- `algorithm` optional, default `auto`
- `nJobs` optional, default `-1`
### HDBSCAN
- `minClusterSize` optional, default `10`
- `minSamples` optional
- `metric` optional, default `euclidean`
- `clusterSelectionMethod` optional, default `eom`
### AGGLOMERATIVE
- `k` required
- `linkage` optional, default `average`
- `metric` optional, default `euclidean`
- `computeDistances` optional, default `false`
## Shared parameters
- `normalizeVectors` optional, default `true`
- `randomState` optional, used by `KMEANS`, `MINI_BATCH_KMEANS`, `PCA`, `UMAP`
## UMAP reduction parameters
- `reductionMetric` optional, default `cosine`
- `umapNeighbors` optional, default `15`
- `umapMinDist` optional, default `0.0`
## Local run
## Required database configuration
Set either:
- `CLUSTERING_DB_DSN`
- or `DATABASE_URL`
- or `CLUSTERING_DB_HOST`, `CLUSTERING_DB_PORT`, `CLUSTERING_DB_NAME`, `CLUSTERING_DB_USER`, `CLUSTERING_DB_PASSWORD`
Example:
```bash
export CLUSTERING_DB_DSN=postgresql://postgres:postgres@localhost:5432/dip
```
## Local run on Windows
```powershell
$env:CLUSTERING_DB_DSN="postgresql://postgres:postgres@localhost:5432/dip"
.\.venv\Scripts\python.exe -m uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload
```
## Docker run
```bash
docker build -t dip-clustering-service .
docker run --rm -p 8001:8001 dip-clustering-service
```
## Spring configuration
Use the original request-upload mode:
```yaml
dip:
clustering:
python:
enabled: true
base-url: http://localhost:8001
cluster-path: /cluster
cluster-run-path: /cluster-run
request-mode: INLINE_VECTORS
connect-timeout: 30s
read-timeout: 30m
```
Use compact `runId` mode:
```yaml
dip:
clustering:
python:
enabled: true
base-url: http://localhost:8001
cluster-path: /cluster
cluster-run-path: /cluster-run
request-mode: RUN_ID
connect-timeout: 30s
read-timeout: 30m
```
`INLINE_VECTORS` is the default if `request-mode` is omitted.