# Memory Optimization Changes ## Problem Persistent OutOfMemoryError crashes after ~30 minutes of operation. ## Root Causes Identified 1. **Parallel Processing** - Too many concurrent threads processing XML files 2. **Vectorization** - Heavy memory consumption from embedding service calls 3. **Connection Leaks** - HikariCP pool too large (20 connections) 4. **Duplicate File Processing** - File Consumer route was disabled but still causing issues ## Changes Made (2026-01-07) ### 1. Vectorization DISABLED **File**: `application.yml` ```yaml vectorization: enabled: false # Was: true ``` **Reason**: Vectorization can be re-enabled later after stability is proven ### 2. Reduced Database Connection Pool **File**: `application.yml` ```yaml hikari: maximum-pool-size: 5 # Was: 20 minimum-idle: 2 # Was: 5 idle-timeout: 300000 # Was: 600000 max-lifetime: 900000 # Was: 1800000 leak-detection-threshold: 60000 # NEW ``` ### 3. Sequential Processing (No Parallelism) **File**: `TedPackageDownloadCamelRoute.java` - **Parallel Processing DISABLED** in XML file splitter - Thread pool reduced to 1 thread (was: 3) - Only 1 package processed at a time (was: 3) ```java .split(header("xmlFiles")) // .parallelProcessing() // DISABLED .stopOnException(false) ``` ### 4. File Consumer Already Disabled **File**: `TedDocumentRoute.java` - File consumer route commented out to prevent duplicate processing - Only Package Download Route processes files ### 5. Start Script with 8GB Heap **File**: `start.bat` ```batch java -Xms4g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -jar target\ted-procurement-processor-1.0.0-SNAPSHOT.jar ``` ## Performance Impact ### Before - 3 packages in parallel - 3 XML files in parallel per package - Vectorization running - ~150 concurrent operations - **Crashes after 30 minutes** ### After - 1 package at a time - Sequential XML file processing - No vectorization - ~10-20 concurrent operations - **Should run stable indefinitely** ## How to Start 1. **Reset stuck packages** (if any): ```bash psql -h 94.130.218.54 -p 32333 -U postgres -d RELM -f reset-stuck-packages.sql ``` 2. **Start application**: ```bash start.bat ``` 3. **Monitor memory**: - Check logs for OutOfMemoryError - Monitor with: `jconsole` or `jvisualvm` ## Re-enabling Features Later ### Step 1: Test with current settings Run for 24-48 hours to confirm stability ### Step 2: Gradually increase parallelism ```java // In TedPackageDownloadCamelRoute.java .split(header("xmlFiles")) .parallelProcessing() .executorService(executorService()) // Set to 2-3 threads ``` ### Step 3: Re-enable vectorization ```yaml # In application.yml vectorization: enabled: true ``` ### Step 4: Increase connection pool (if needed) ```yaml hikari: maximum-pool-size: 10 # Increase gradually ``` ## Monitoring Commands ### Check running packages ```sql SELECT package_identifier, download_status, updated_at FROM ted.ted_daily_package WHERE download_status IN ('DOWNLOADING', 'PROCESSING') ORDER BY updated_at DESC; ``` ### Check memory usage ```bash jcmd GC.heap_info ``` ### Check thread count ```bash jcmd Thread.print | grep "ted-" | wc -l ``` ## Notes - **Processing is slower** but stable - Approx. 50-100 documents/minute (sequential) - Can process ~100,000 documents/day - Vectorization can be run as separate batch job later