You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DIP/MEMORY-OPTIMIZATION.md

140 lines
3.4 KiB
Markdown

# Memory Optimization Changes
## Problem
Persistent OutOfMemoryError crashes after ~30 minutes of operation.
## Root Causes Identified
1. **Parallel Processing** - Too many concurrent threads processing XML files
2. **Vectorization** - Heavy memory consumption from embedding service calls
3. **Connection Leaks** - HikariCP pool too large (20 connections)
4. **Duplicate File Processing** - File Consumer route was disabled but still causing issues
## Changes Made (2026-01-07)
### 1. Vectorization DISABLED
**File**: `application.yml`
```yaml
vectorization:
enabled: false # Was: true
```
**Reason**: Vectorization can be re-enabled later after stability is proven
### 2. Reduced Database Connection Pool
**File**: `application.yml`
```yaml
hikari:
maximum-pool-size: 5 # Was: 20
minimum-idle: 2 # Was: 5
idle-timeout: 300000 # Was: 600000
max-lifetime: 900000 # Was: 1800000
leak-detection-threshold: 60000 # NEW
```
### 3. Sequential Processing (No Parallelism)
**File**: `TedPackageDownloadCamelRoute.java`
- **Parallel Processing DISABLED** in XML file splitter
- Thread pool reduced to 1 thread (was: 3)
- Only 1 package processed at a time (was: 3)
```java
.split(header("xmlFiles"))
// .parallelProcessing() // DISABLED
.stopOnException(false)
```
### 4. File Consumer Already Disabled
**File**: `TedDocumentRoute.java`
- File consumer route commented out to prevent duplicate processing
- Only Package Download Route processes files
### 5. Start Script with 8GB Heap
**File**: `start.bat`
```batch
java -Xms4g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -jar target\ted-procurement-processor-1.0.0-SNAPSHOT.jar
```
## Performance Impact
### Before
- 3 packages in parallel
- 3 XML files in parallel per package
- Vectorization running
- ~150 concurrent operations
- **Crashes after 30 minutes**
### After
- 1 package at a time
- Sequential XML file processing
- No vectorization
- ~10-20 concurrent operations
- **Should run stable indefinitely**
## How to Start
1. **Reset stuck packages** (if any):
```bash
psql -h 94.130.218.54 -p 32333 -U postgres -d RELM -f reset-stuck-packages.sql
```
2. **Start application**:
```bash
start.bat
```
3. **Monitor memory**:
- Check logs for OutOfMemoryError
- Monitor with: `jconsole` or `jvisualvm`
## Re-enabling Features Later
### Step 1: Test with current settings
Run for 24-48 hours to confirm stability
### Step 2: Gradually increase parallelism
```java
// In TedPackageDownloadCamelRoute.java
.split(header("xmlFiles"))
.parallelProcessing()
.executorService(executorService()) // Set to 2-3 threads
```
### Step 3: Re-enable vectorization
```yaml
# In application.yml
vectorization:
enabled: true
```
### Step 4: Increase connection pool (if needed)
```yaml
hikari:
maximum-pool-size: 10 # Increase gradually
```
## Monitoring Commands
### Check running packages
```sql
SELECT package_identifier, download_status, updated_at
FROM ted.ted_daily_package
WHERE download_status IN ('DOWNLOADING', 'PROCESSING')
ORDER BY updated_at DESC;
```
### Check memory usage
```bash
jcmd <PID> GC.heap_info
```
### Check thread count
```bash
jcmd <PID> Thread.print | grep "ted-" | wc -l
```
## Notes
- **Processing is slower** but stable
- Approx. 50-100 documents/minute (sequential)
- Can process ~100,000 documents/day
- Vectorization can be run as separate batch job later