docs: add comprehensive container status monitoring system documentation

## Added Documentation Created detailed documentation in `.ai/core/application-architecture.md` explaining the container status monitoring system to prevent future bugs. ## Key Sections ### 1. Container Status Monitoring System Overview - Explains that status is updated through multiple independent paths - Emphasizes that ALL paths must be updated when changing status logic ### 2. Critical Implementation Locations Documents all four status calculation locations: - **SSH-Based Updates**: `GetContainersStatus.php` (scheduled, every ~1min) - **Sentinel-Based Updates**: `PushServerUpdateJob.php` (real-time, every ~30sec) - **Multi-Server Aggregation**: `ComplexStatusCheck.php` (on-demand) - **Service-Level Aggregation**: `Service.php` (service status) ### 3. Status Flow Diagram Visual representation of how status flows from different sources to UI ### 4. Status Priority System Documents the required priority: unhealthy > unknown > healthy ### 5. Excluded Containers Explains `:excluded` suffix handling and behavior ### 6. Developer Guidelines - Checklist of all locations to update - Testing requirements - Edge cases to handle ### 7. Related Tests Links to all relevant test files ### 8. Common Bugs to Avoid Real examples from bugs we've fixed, with solutions ## Why This Documentation Matters The recent bug (unknown → healthy) happened because: 1. `GetContainersStatus.php` was updated to handle "unknown" status 2. `PushServerUpdateJob.php` was NOT updated 3. This caused periodic status flipping This documentation ensures future developers (and AI assistants like Claude) will know to update ALL four locations when modifying status logic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 13:42:45 +01:00 · 2025-11-19 13:42:45 +01:00 · 128c0b00ec
commit 128c0b00ec
parent 6b62847a11
1 changed files with 208 additions and 0 deletions
--- a/.ai/core/application-architecture.md
+++ b/.ai/core/application-architecture.md
@ -361,3 +361,211 @@ ### **Background Processing**
 - **Queue Workers**: Horizon-managed job processing
 - **Job Batching**: Related job grouping
 - **Failed Job Handling**: Automatic retry logic
+
+## Container Status Monitoring System
+
+### **Overview**
+
+Container health status is monitored and updated through **multiple independent paths**. When modifying status logic, **ALL paths must be updated** to ensure consistency.
+
+### **Critical Implementation Locations**
+
+#### **1. SSH-Based Status Updates (Scheduled)**
+**File**: [app/Actions/Docker/GetContainersStatus.php](mdc:app/Actions/Docker/GetContainersStatus.php)
+**Method**: `aggregateApplicationStatus()` (lines 487-540)
+**Trigger**: Scheduled job or manual refresh
+**Frequency**: Every minute (via `ServerCheckJob`)
+
+**Status Aggregation Logic**:
+```php
+// Tracks multiple status flags
+$hasRunning = false;
+$hasRestarting = false;
+$hasUnhealthy = false;
+$hasUnknown = false;  // ⚠️ CRITICAL: Must track unknown
+$hasExited = false;
+// ... more states
+
+// Priority: restarting > degraded > running (unhealthy > unknown > healthy)
+if ($hasRunning) {
+    if ($hasUnhealthy) return 'running (unhealthy)';
+    elseif ($hasUnknown) return 'running (unknown)';
+    else return 'running (healthy)';
+}
+```
+
+#### **2. Sentinel-Based Status Updates (Real-time)**
+**File**: [app/Jobs/PushServerUpdateJob.php](mdc:app/Jobs/PushServerUpdateJob.php)
+**Method**: `aggregateMultiContainerStatuses()` (lines 269-298)
+**Trigger**: Sentinel push updates from remote servers
+**Frequency**: Every ~30 seconds (real-time)
+
+**Status Aggregation Logic**:
+```php
+// ⚠️ MUST match GetContainersStatus logic
+$hasRunning = false;
+$hasUnhealthy = false;
+$hasUnknown = false;  // ⚠️ CRITICAL: Added to fix bug
+
+foreach ($relevantStatuses as $status) {
+    if (str($status)->contains('running')) {
+        $hasRunning = true;
+        if (str($status)->contains('unhealthy')) $hasUnhealthy = true;
+        if (str($status)->contains('unknown')) $hasUnknown = true;  // ⚠️ CRITICAL
+    }
+}
+
+// Priority: unhealthy > unknown > healthy
+if ($hasRunning) {
+    if ($hasUnhealthy) $aggregatedStatus = 'running (unhealthy)';
+    elseif ($hasUnknown) $aggregatedStatus = 'running (unknown)';
+    else $aggregatedStatus = 'running (healthy)';
+}
+```
+
+#### **3. Multi-Server Status Aggregation**
+**File**: [app/Actions/Shared/ComplexStatusCheck.php](mdc:app/Actions/Shared/ComplexStatusCheck.php)
+**Method**: `resource()` (lines 48-210)
+**Purpose**: Aggregates status across multiple servers for applications
+**Used by**: Applications with multiple destinations
+
+**Key Features**:
+- Aggregates statuses from main + additional servers
+- Handles excluded containers (`:excluded` suffix)
+- Calculates overall application health from all containers
+
+**Status Format with Excluded Containers**:
+```php
+// When all containers excluded from health checks:
+return 'running:unhealthy:excluded';  // Container running but unhealthy, monitoring disabled
+return 'running:unknown:excluded';     // Container running, health unknown, monitoring disabled
+return 'running:healthy:excluded';     // Container running and healthy, monitoring disabled
+return 'degraded:excluded';            // Some containers down, monitoring disabled
+return 'exited:excluded';              // All containers stopped, monitoring disabled
+```
+
+#### **4. Service-Level Status Aggregation**
+**File**: [app/Models/Service.php](mdc:app/Models/Service.php)
+**Method**: `complexStatus()` (lines 176-288)
+**Purpose**: Aggregates status for multi-container services
+**Used by**: Docker Compose services
+
+**Status Calculation**:
+```php
+// Aggregates status from all service applications and databases
+// Handles excluded containers separately
+// Returns status with :excluded suffix when all containers excluded
+if (!$hasNonExcluded && $complexStatus === null && $complexHealth === null) {
+    // All services excluded - calculate from excluded containers
+    return "{$excludedStatus}:excluded";
+}
+```
+
+### **Status Flow Diagram**
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Container Status Sources                  │
+└─────────────────────────────────────────────────────────────┘
+                             │
+        ┌────────────────────┼────────────────────┐
+        │                    │                    │
+        ▼                    ▼                    ▼
+┌───────────────┐   ┌─────────────────┐   ┌──────────────┐
+│ SSH-Based     │   │ Sentinel-Based  │   │ Multi-Server │
+│ (Scheduled)   │   │ (Real-time)     │   │ Aggregation  │
+├───────────────┤   ├─────────────────┤   ├──────────────┤
+│ ServerCheck   │   │ PushServerUp-   │   │ ComplexStatus│
+│ Job           │   │ dateJob         │   │ Check        │
+│               │   │                 │   │              │
+│ Every ~1min   │   │ Every ~30sec    │   │ On demand    │
+└───────┬───────┘   └────────┬────────┘   └──────┬───────┘
+        │                    │                    │
+        └────────────────────┼────────────────────┘
+                             │
+                             ▼
+                 ┌───────────────────────┐
+                 │ Application/Service   │
+                 │ Status Property       │
+                 └───────────────────────┘
+                             │
+                             ▼
+                 ┌───────────────────────┐
+                 │ UI Display (Livewire) │
+                 └───────────────────────┘
+```
+
+### **Status Priority System**
+
+All status aggregation locations **MUST** follow the same priority:
+
+**For Running Containers**:
+1. **unhealthy** - Container has failing health checks
+2. **unknown** - Container health status cannot be determined
+3. **healthy** - Container is healthy
+
+**For Non-Running States**:
+1. **restarting** → `degraded (unhealthy)`
+2. **running + exited** → `degraded (unhealthy)`
+3. **dead/removing** → `degraded (unhealthy)`
+4. **paused** → `paused`
+5. **created/starting** → `starting`
+6. **exited** → `exited (unhealthy)`
+
+### **Excluded Containers**
+
+When containers have `exclude_from_hc: true` flag:
+
+**Behavior**:
+- Status is still calculated from container state
+- `:excluded` suffix is appended to indicate monitoring disabled
+- UI shows "(Monitoring Disabled)" badge
+- Action buttons respect the actual container state
+
+**Format**: `{actual-status}:excluded`
+**Examples**: `running:unknown:excluded`, `degraded:excluded`, `exited:excluded`
+
+### **Important Notes for Developers**
+
+⚠️ **CRITICAL**: When modifying container status logic:
+
+1. **Update ALL four locations**:
+   - `GetContainersStatus.php` (SSH-based)
+   - `PushServerUpdateJob.php` (Sentinel-based)
+   - `ComplexStatusCheck.php` (multi-server)
+   - `Service.php` (service-level)
+
+2. **Maintain consistent priority**:
+   - unhealthy > unknown > healthy
+   - Apply same logic across all paths
+
+3. **Test both update paths**:
+   - Run unit tests: `./vendor/bin/pest tests/Unit/`
+   - Test SSH updates (manual refresh)
+   - Test Sentinel updates (wait 30 seconds)
+
+4. **Handle edge cases**:
+   - All containers excluded (`exclude_from_hc: true`)
+   - Mixed excluded/non-excluded containers
+   - Unknown health states
+   - Container crash loops (restart count)
+
+### **Related Tests**
+
+- **[tests/Unit/ContainerHealthStatusTest.php](mdc:tests/Unit/ContainerHealthStatusTest.php)**: Health status aggregation
+- **[tests/Unit/PushServerUpdateJobStatusAggregationTest.php](mdc:tests/Unit/PushServerUpdateJobStatusAggregationTest.php)**: Sentinel update logic
+- **[tests/Unit/ExcludeFromHealthCheckTest.php](mdc:tests/Unit/ExcludeFromHealthCheckTest.php)**: Excluded container handling
+
+### **Common Bugs to Avoid**
+
+❌ **Bug**: Forgetting to track `$hasUnknown` flag
+✅ **Fix**: Initialize and check for "unknown" in all status aggregation
+
+❌ **Bug**: Using ternary operator instead of if-elseif-else
+✅ **Fix**: Use explicit if-elseif-else to handle 3-way priority
+
+❌ **Bug**: Updating only one path (SSH or Sentinel)
+✅ **Fix**: Always update all four status calculation locations
+
+❌ **Bug**: Not handling excluded containers with `:excluded` suffix
+✅ **Fix**: Check for `:excluded` suffix in UI logic and button visibility