From 128c0b00ecda426f184409abd53279739852358c Mon Sep 17 00:00:00 2001 From: Andras Bacsai <5845193+andrasbacsai@users.noreply.github.com> Date: Wed, 19 Nov 2025 13:42:45 +0100 Subject: [PATCH] docs: add comprehensive container status monitoring system documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Added Documentation Created detailed documentation in `.ai/core/application-architecture.md` explaining the container status monitoring system to prevent future bugs. ## Key Sections ### 1. Container Status Monitoring System Overview - Explains that status is updated through multiple independent paths - Emphasizes that ALL paths must be updated when changing status logic ### 2. Critical Implementation Locations Documents all four status calculation locations: - **SSH-Based Updates**: `GetContainersStatus.php` (scheduled, every ~1min) - **Sentinel-Based Updates**: `PushServerUpdateJob.php` (real-time, every ~30sec) - **Multi-Server Aggregation**: `ComplexStatusCheck.php` (on-demand) - **Service-Level Aggregation**: `Service.php` (service status) ### 3. Status Flow Diagram Visual representation of how status flows from different sources to UI ### 4. Status Priority System Documents the required priority: unhealthy > unknown > healthy ### 5. Excluded Containers Explains `:excluded` suffix handling and behavior ### 6. Developer Guidelines - Checklist of all locations to update - Testing requirements - Edge cases to handle ### 7. Related Tests Links to all relevant test files ### 8. Common Bugs to Avoid Real examples from bugs we've fixed, with solutions ## Why This Documentation Matters The recent bug (unknown β†’ healthy) happened because: 1. `GetContainersStatus.php` was updated to handle "unknown" status 2. `PushServerUpdateJob.php` was NOT updated 3. This caused periodic status flipping This documentation ensures future developers (and AI assistants like Claude) will know to update ALL four locations when modifying status logic. πŸ€– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .ai/core/application-architecture.md | 208 +++++++++++++++++++++++++++ 1 file changed, 208 insertions(+) diff --git a/.ai/core/application-architecture.md b/.ai/core/application-architecture.md index daaac0eaa..434f1afa9 100644 --- a/.ai/core/application-architecture.md +++ b/.ai/core/application-architecture.md @@ -361,3 +361,211 @@ ### **Background Processing** - **Queue Workers**: Horizon-managed job processing - **Job Batching**: Related job grouping - **Failed Job Handling**: Automatic retry logic + +## Container Status Monitoring System + +### **Overview** + +Container health status is monitored and updated through **multiple independent paths**. When modifying status logic, **ALL paths must be updated** to ensure consistency. + +### **Critical Implementation Locations** + +#### **1. SSH-Based Status Updates (Scheduled)** +**File**: [app/Actions/Docker/GetContainersStatus.php](mdc:app/Actions/Docker/GetContainersStatus.php) +**Method**: `aggregateApplicationStatus()` (lines 487-540) +**Trigger**: Scheduled job or manual refresh +**Frequency**: Every minute (via `ServerCheckJob`) + +**Status Aggregation Logic**: +```php +// Tracks multiple status flags +$hasRunning = false; +$hasRestarting = false; +$hasUnhealthy = false; +$hasUnknown = false; // ⚠️ CRITICAL: Must track unknown +$hasExited = false; +// ... more states + +// Priority: restarting > degraded > running (unhealthy > unknown > healthy) +if ($hasRunning) { + if ($hasUnhealthy) return 'running (unhealthy)'; + elseif ($hasUnknown) return 'running (unknown)'; + else return 'running (healthy)'; +} +``` + +#### **2. Sentinel-Based Status Updates (Real-time)** +**File**: [app/Jobs/PushServerUpdateJob.php](mdc:app/Jobs/PushServerUpdateJob.php) +**Method**: `aggregateMultiContainerStatuses()` (lines 269-298) +**Trigger**: Sentinel push updates from remote servers +**Frequency**: Every ~30 seconds (real-time) + +**Status Aggregation Logic**: +```php +// ⚠️ MUST match GetContainersStatus logic +$hasRunning = false; +$hasUnhealthy = false; +$hasUnknown = false; // ⚠️ CRITICAL: Added to fix bug + +foreach ($relevantStatuses as $status) { + if (str($status)->contains('running')) { + $hasRunning = true; + if (str($status)->contains('unhealthy')) $hasUnhealthy = true; + if (str($status)->contains('unknown')) $hasUnknown = true; // ⚠️ CRITICAL + } +} + +// Priority: unhealthy > unknown > healthy +if ($hasRunning) { + if ($hasUnhealthy) $aggregatedStatus = 'running (unhealthy)'; + elseif ($hasUnknown) $aggregatedStatus = 'running (unknown)'; + else $aggregatedStatus = 'running (healthy)'; +} +``` + +#### **3. Multi-Server Status Aggregation** +**File**: [app/Actions/Shared/ComplexStatusCheck.php](mdc:app/Actions/Shared/ComplexStatusCheck.php) +**Method**: `resource()` (lines 48-210) +**Purpose**: Aggregates status across multiple servers for applications +**Used by**: Applications with multiple destinations + +**Key Features**: +- Aggregates statuses from main + additional servers +- Handles excluded containers (`:excluded` suffix) +- Calculates overall application health from all containers + +**Status Format with Excluded Containers**: +```php +// When all containers excluded from health checks: +return 'running:unhealthy:excluded'; // Container running but unhealthy, monitoring disabled +return 'running:unknown:excluded'; // Container running, health unknown, monitoring disabled +return 'running:healthy:excluded'; // Container running and healthy, monitoring disabled +return 'degraded:excluded'; // Some containers down, monitoring disabled +return 'exited:excluded'; // All containers stopped, monitoring disabled +``` + +#### **4. Service-Level Status Aggregation** +**File**: [app/Models/Service.php](mdc:app/Models/Service.php) +**Method**: `complexStatus()` (lines 176-288) +**Purpose**: Aggregates status for multi-container services +**Used by**: Docker Compose services + +**Status Calculation**: +```php +// Aggregates status from all service applications and databases +// Handles excluded containers separately +// Returns status with :excluded suffix when all containers excluded +if (!$hasNonExcluded && $complexStatus === null && $complexHealth === null) { + // All services excluded - calculate from excluded containers + return "{$excludedStatus}:excluded"; +} +``` + +### **Status Flow Diagram** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Container Status Sources β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ β”‚ β”‚ + β–Ό β–Ό β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ SSH-Based β”‚ β”‚ Sentinel-Based β”‚ β”‚ Multi-Server β”‚ +β”‚ (Scheduled) β”‚ β”‚ (Real-time) β”‚ β”‚ Aggregation β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ ServerCheck β”‚ β”‚ PushServerUp- β”‚ β”‚ ComplexStatusβ”‚ +β”‚ Job β”‚ β”‚ dateJob β”‚ β”‚ Check β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ Every ~1min β”‚ β”‚ Every ~30sec β”‚ β”‚ On demand β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Application/Service β”‚ + β”‚ Status Property β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ UI Display (Livewire) β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### **Status Priority System** + +All status aggregation locations **MUST** follow the same priority: + +**For Running Containers**: +1. **unhealthy** - Container has failing health checks +2. **unknown** - Container health status cannot be determined +3. **healthy** - Container is healthy + +**For Non-Running States**: +1. **restarting** β†’ `degraded (unhealthy)` +2. **running + exited** β†’ `degraded (unhealthy)` +3. **dead/removing** β†’ `degraded (unhealthy)` +4. **paused** β†’ `paused` +5. **created/starting** β†’ `starting` +6. **exited** β†’ `exited (unhealthy)` + +### **Excluded Containers** + +When containers have `exclude_from_hc: true` flag: + +**Behavior**: +- Status is still calculated from container state +- `:excluded` suffix is appended to indicate monitoring disabled +- UI shows "(Monitoring Disabled)" badge +- Action buttons respect the actual container state + +**Format**: `{actual-status}:excluded` +**Examples**: `running:unknown:excluded`, `degraded:excluded`, `exited:excluded` + +### **Important Notes for Developers** + +⚠️ **CRITICAL**: When modifying container status logic: + +1. **Update ALL four locations**: + - `GetContainersStatus.php` (SSH-based) + - `PushServerUpdateJob.php` (Sentinel-based) + - `ComplexStatusCheck.php` (multi-server) + - `Service.php` (service-level) + +2. **Maintain consistent priority**: + - unhealthy > unknown > healthy + - Apply same logic across all paths + +3. **Test both update paths**: + - Run unit tests: `./vendor/bin/pest tests/Unit/` + - Test SSH updates (manual refresh) + - Test Sentinel updates (wait 30 seconds) + +4. **Handle edge cases**: + - All containers excluded (`exclude_from_hc: true`) + - Mixed excluded/non-excluded containers + - Unknown health states + - Container crash loops (restart count) + +### **Related Tests** + +- **[tests/Unit/ContainerHealthStatusTest.php](mdc:tests/Unit/ContainerHealthStatusTest.php)**: Health status aggregation +- **[tests/Unit/PushServerUpdateJobStatusAggregationTest.php](mdc:tests/Unit/PushServerUpdateJobStatusAggregationTest.php)**: Sentinel update logic +- **[tests/Unit/ExcludeFromHealthCheckTest.php](mdc:tests/Unit/ExcludeFromHealthCheckTest.php)**: Excluded container handling + +### **Common Bugs to Avoid** + +❌ **Bug**: Forgetting to track `$hasUnknown` flag +βœ… **Fix**: Initialize and check for "unknown" in all status aggregation + +❌ **Bug**: Using ternary operator instead of if-elseif-else +βœ… **Fix**: Use explicit if-elseif-else to handle 3-way priority + +❌ **Bug**: Updating only one path (SSH or Sentinel) +βœ… **Fix**: Always update all four status calculation locations + +❌ **Bug**: Not handling excluded containers with `:excluded` suffix +βœ… **Fix**: Check for `:excluded` suffix in UI logic and button visibility