205 lines
4.4 KiB
Markdown
205 lines
4.4 KiB
Markdown
# DIP Clustering Service
|
|
|
|
Remote Python clustering backend for the DIP Spring clustering module.
|
|
|
|
## Main execution mode
|
|
|
|
The preferred execution mode is now:
|
|
|
|
- Spring keeps run metadata, selection snapshot, and lifecycle.
|
|
- Spring sends only a compact request containing `runId`.
|
|
- Python loads the run metadata and selected embeddings directly from Postgres.
|
|
- Python returns compact assignments keyed by `embeddingId`.
|
|
|
|
This avoids sending the full embedding matrix through HTTP.
|
|
|
|
## Implemented algorithms
|
|
|
|
- `KMEANS`
|
|
- `MINI_BATCH_KMEANS`
|
|
- `DBSCAN`
|
|
- `HDBSCAN`
|
|
- `AGGLOMERATIVE`
|
|
|
|
## Implemented reductions
|
|
|
|
- `NONE`
|
|
- `PCA`
|
|
- `UMAP`
|
|
|
|
## API
|
|
|
|
### `GET /health`
|
|
|
|
Returns service status and supported algorithms/reduction methods.
|
|
|
|
### `POST /cluster-run`
|
|
|
|
Preferred endpoint. Accepts only the cluster run id.
|
|
|
|
Example request body:
|
|
|
|
```json
|
|
{
|
|
"runId": "6c3bc3a3-24b0-47a5-9e35-92dd4b7275f8"
|
|
}
|
|
```
|
|
|
|
|
|
This service supports two remote execution modes at the same time:
|
|
|
|
- `POST /cluster`
|
|
- Spring uploads embeddings in the request body.
|
|
- This keeps the original implementation intact.
|
|
- `POST /cluster-run`
|
|
- Spring sends only `runId`.
|
|
- Python loads run metadata and embeddings directly from Postgres.
|
|
|
|
## Start
|
|
|
|
```powershell
|
|
py -3.11 -m venv .venv
|
|
.\.venv\Scripts\python.exe -m pip install --upgrade pip
|
|
.\.venv\Scripts\python.exe -m pip install -r requirements.txt
|
|
```
|
|
|
|
Configure DB access for `/cluster-run` with either:
|
|
|
|
|
|
### `POST /cluster`
|
|
|
|
Accepts the Spring `PythonClusteringRequest` payload and returns `PythonClusteringResponse`.
|
|
|
|
Example request body:
|
|
|
|
```json
|
|
{
|
|
"algorithm": "DBSCAN",
|
|
"parameters": {
|
|
"eps": 0.25,
|
|
"minSamples": 5,
|
|
"metric": "euclidean",
|
|
"normalizeVectors": true
|
|
},
|
|
"reductionMethod": "PCA",
|
|
"reductionDimensions": 100,
|
|
"items": [
|
|
{
|
|
"embeddingId": "11111111-1111-1111-1111-111111111111",
|
|
"documentId": "22222222-2222-2222-2222-222222222222",
|
|
"representationId": "33333333-3333-3333-3333-333333333333",
|
|
"vector": [0.1, 0.2, 0.3]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Parameters by algorithm
|
|
|
|
### KMEANS
|
|
- `k` required
|
|
- `randomState` optional, default `42`
|
|
- `nInit` optional, default `10`
|
|
- `maxIter` optional, default `300`
|
|
|
|
### MINI_BATCH_KMEANS
|
|
- `k` required
|
|
- `batchSize` optional
|
|
- `randomState` optional, default `42`
|
|
- `nInit` optional, default `10`
|
|
- `maxIter` optional, default `300`
|
|
|
|
### DBSCAN
|
|
- `eps` required
|
|
- `minSamples` optional, default `5`
|
|
- `metric` optional, default `euclidean`
|
|
- `algorithm` optional, default `auto`
|
|
- `nJobs` optional, default `-1`
|
|
|
|
### HDBSCAN
|
|
- `minClusterSize` optional, default `10`
|
|
- `minSamples` optional
|
|
- `metric` optional, default `euclidean`
|
|
- `clusterSelectionMethod` optional, default `eom`
|
|
|
|
### AGGLOMERATIVE
|
|
- `k` required
|
|
- `linkage` optional, default `average`
|
|
- `metric` optional, default `euclidean`
|
|
- `computeDistances` optional, default `false`
|
|
|
|
## Shared parameters
|
|
|
|
- `normalizeVectors` optional, default `true`
|
|
- `randomState` optional, used by `KMEANS`, `MINI_BATCH_KMEANS`, `PCA`, `UMAP`
|
|
|
|
## UMAP reduction parameters
|
|
|
|
- `reductionMetric` optional, default `cosine`
|
|
- `umapNeighbors` optional, default `15`
|
|
- `umapMinDist` optional, default `0.0`
|
|
|
|
## Local run
|
|
## Required database configuration
|
|
|
|
Set either:
|
|
|
|
- `CLUSTERING_DB_DSN`
|
|
- or `DATABASE_URL`
|
|
- or `CLUSTERING_DB_HOST`, `CLUSTERING_DB_PORT`, `CLUSTERING_DB_NAME`, `CLUSTERING_DB_USER`, `CLUSTERING_DB_PASSWORD`
|
|
|
|
Example:
|
|
|
|
```bash
|
|
export CLUSTERING_DB_DSN=postgresql://postgres:postgres@localhost:5432/dip
|
|
```
|
|
|
|
## Local run on Windows
|
|
|
|
```powershell
|
|
$env:CLUSTERING_DB_DSN="postgresql://postgres:postgres@localhost:5432/dip"
|
|
.\.venv\Scripts\python.exe -m uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload
|
|
```
|
|
|
|
|
|
## Docker run
|
|
|
|
```bash
|
|
docker build -t dip-clustering-service .
|
|
docker run --rm -p 8001:8001 dip-clustering-service
|
|
```
|
|
|
|
## Spring configuration
|
|
|
|
Use the original request-upload mode:
|
|
|
|
```yaml
|
|
dip:
|
|
clustering:
|
|
python:
|
|
enabled: true
|
|
base-url: http://localhost:8001
|
|
cluster-path: /cluster
|
|
cluster-run-path: /cluster-run
|
|
request-mode: INLINE_VECTORS
|
|
connect-timeout: 30s
|
|
read-timeout: 30m
|
|
```
|
|
|
|
Use compact `runId` mode:
|
|
|
|
```yaml
|
|
dip:
|
|
clustering:
|
|
python:
|
|
enabled: true
|
|
base-url: http://localhost:8001
|
|
cluster-path: /cluster
|
|
cluster-run-path: /cluster-run
|
|
request-mode: RUN_ID
|
|
connect-timeout: 30s
|
|
read-timeout: 30m
|
|
```
|
|
|
|
`INLINE_VECTORS` is the default if `request-mode` is omitted.
|