|
|
||
|---|---|---|
| .. | ||
| app | ||
| tests | ||
| Dockerfile | ||
| README.md | ||
| requirements.txt | ||
README.md
DIP Clustering Service
Remote Python clustering backend for the DIP Spring clustering module.
Main execution mode
The preferred execution mode is now:
- Spring keeps run metadata, selection snapshot, and lifecycle.
- Spring sends only a compact request containing
runId. - Python loads the run metadata and selected embeddings directly from Postgres.
- Python returns compact assignments keyed by
embeddingId.
This avoids sending the full embedding matrix through HTTP.
Implemented algorithms
KMEANSMINI_BATCH_KMEANSDBSCANHDBSCANAGGLOMERATIVE
Implemented reductions
NONEPCAUMAP
API
GET /health
Returns service status and supported algorithms/reduction methods.
POST /cluster-run
Preferred endpoint. Accepts only the cluster run id.
Example request body:
{
"runId": "6c3bc3a3-24b0-47a5-9e35-92dd4b7275f8"
}
This service supports two remote execution modes at the same time:
POST /cluster- Spring uploads embeddings in the request body.
- This keeps the original implementation intact.
POST /cluster-run- Spring sends only
runId. - Python loads run metadata and embeddings directly from Postgres.
- Spring sends only
Start
py -3.11 -m venv .venv
.\.venv\Scripts\python.exe -m pip install --upgrade pip
.\.venv\Scripts\python.exe -m pip install -r requirements.txt
Configure DB access for /cluster-run with either:
POST /cluster
Accepts the Spring PythonClusteringRequest payload and returns PythonClusteringResponse.
Example request body:
{
"algorithm": "DBSCAN",
"parameters": {
"eps": 0.25,
"minSamples": 5,
"metric": "euclidean",
"normalizeVectors": true
},
"reductionMethod": "PCA",
"reductionDimensions": 100,
"items": [
{
"embeddingId": "11111111-1111-1111-1111-111111111111",
"documentId": "22222222-2222-2222-2222-222222222222",
"representationId": "33333333-3333-3333-3333-333333333333",
"vector": [0.1, 0.2, 0.3]
}
]
}
Parameters by algorithm
KMEANS
krequiredrandomStateoptional, default42nInitoptional, default10maxIteroptional, default300
MINI_BATCH_KMEANS
krequiredbatchSizeoptionalrandomStateoptional, default42nInitoptional, default10maxIteroptional, default300
DBSCAN
epsrequiredminSamplesoptional, default5metricoptional, defaulteuclideanalgorithmoptional, defaultautonJobsoptional, default-1
HDBSCAN
minClusterSizeoptional, default10minSamplesoptionalmetricoptional, defaulteuclideanclusterSelectionMethodoptional, defaulteom
AGGLOMERATIVE
krequiredlinkageoptional, defaultaveragemetricoptional, defaulteuclideancomputeDistancesoptional, defaultfalse
Shared parameters
normalizeVectorsoptional, defaulttruerandomStateoptional, used byKMEANS,MINI_BATCH_KMEANS,PCA,UMAP
UMAP reduction parameters
reductionMetricoptional, defaultcosineumapNeighborsoptional, default15umapMinDistoptional, default0.0
Local run
Required database configuration
Set either:
CLUSTERING_DB_DSN- or
DATABASE_URL - or
CLUSTERING_DB_HOST,CLUSTERING_DB_PORT,CLUSTERING_DB_NAME,CLUSTERING_DB_USER,CLUSTERING_DB_PASSWORD
Example:
export CLUSTERING_DB_DSN=postgresql://postgres:postgres@localhost:5432/dip
Local run on Windows
$env:CLUSTERING_DB_DSN="postgresql://postgres:postgres@localhost:5432/dip"
.\.venv\Scripts\python.exe -m uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload
Docker run
docker build -t dip-clustering-service .
docker run --rm -p 8001:8001 dip-clustering-service
Spring configuration
Use the original request-upload mode:
dip:
clustering:
python:
enabled: true
base-url: http://localhost:8001
cluster-path: /cluster
cluster-run-path: /cluster-run
request-mode: INLINE_VECTORS
connect-timeout: 30s
read-timeout: 30m
Use compact runId mode:
dip:
clustering:
python:
enabled: true
base-url: http://localhost:8001
cluster-path: /cluster
cluster-run-path: /cluster-run
request-mode: RUN_ID
connect-timeout: 30s
read-timeout: 30m
INLINE_VECTORS is the default if request-mode is omitted.