docs: add comprehensive container status monitoring system documentation

## Added Documentation

Created detailed documentation in `.ai/core/application-architecture.md`
explaining the container status monitoring system to prevent future bugs.

## Key Sections

### 1. Container Status Monitoring System Overview
- Explains that status is updated through multiple independent paths
- Emphasizes that ALL paths must be updated when changing status logic

### 2. Critical Implementation Locations
Documents all four status calculation locations:
- **SSH-Based Updates**: `GetContainersStatus.php` (scheduled, every ~1min)
- **Sentinel-Based Updates**: `PushServerUpdateJob.php` (real-time, every ~30sec)
- **Multi-Server Aggregation**: `ComplexStatusCheck.php` (on-demand)
- **Service-Level Aggregation**: `Service.php` (service status)

### 3. Status Flow Diagram
Visual representation of how status flows from different sources to UI

### 4. Status Priority System
Documents the required priority: unhealthy > unknown > healthy

### 5. Excluded Containers
Explains `:excluded` suffix handling and behavior

### 6. Developer Guidelines
- Checklist of all locations to update
- Testing requirements
- Edge cases to handle

### 7. Related Tests
Links to all relevant test files

### 8. Common Bugs to Avoid
Real examples from bugs we've fixed, with solutions

## Why This Documentation Matters

The recent bug (unknown → healthy) happened because:
1. `GetContainersStatus.php` was updated to handle "unknown" status
2. `PushServerUpdateJob.php` was NOT updated
3. This caused periodic status flipping

This documentation ensures future developers (and AI assistants like Claude)
will know to update ALL four locations when modifying status logic.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Andras Bacsai 2025-11-19 13:42:45 +01:00
parent 6b62847a11
commit 128c0b00ec

View file

@ -361,3 +361,211 @@ ### **Background Processing**
- **Queue Workers**: Horizon-managed job processing
- **Job Batching**: Related job grouping
- **Failed Job Handling**: Automatic retry logic
## Container Status Monitoring System
### **Overview**
Container health status is monitored and updated through **multiple independent paths**. When modifying status logic, **ALL paths must be updated** to ensure consistency.
### **Critical Implementation Locations**
#### **1. SSH-Based Status Updates (Scheduled)**
**File**: [app/Actions/Docker/GetContainersStatus.php](mdc:app/Actions/Docker/GetContainersStatus.php)
**Method**: `aggregateApplicationStatus()` (lines 487-540)
**Trigger**: Scheduled job or manual refresh
**Frequency**: Every minute (via `ServerCheckJob`)
**Status Aggregation Logic**:
```php
// Tracks multiple status flags
$hasRunning = false;
$hasRestarting = false;
$hasUnhealthy = false;
$hasUnknown = false; // ⚠️ CRITICAL: Must track unknown
$hasExited = false;
// ... more states
// Priority: restarting > degraded > running (unhealthy > unknown > healthy)
if ($hasRunning) {
if ($hasUnhealthy) return 'running (unhealthy)';
elseif ($hasUnknown) return 'running (unknown)';
else return 'running (healthy)';
}
```
#### **2. Sentinel-Based Status Updates (Real-time)**
**File**: [app/Jobs/PushServerUpdateJob.php](mdc:app/Jobs/PushServerUpdateJob.php)
**Method**: `aggregateMultiContainerStatuses()` (lines 269-298)
**Trigger**: Sentinel push updates from remote servers
**Frequency**: Every ~30 seconds (real-time)
**Status Aggregation Logic**:
```php
// ⚠️ MUST match GetContainersStatus logic
$hasRunning = false;
$hasUnhealthy = false;
$hasUnknown = false; // ⚠️ CRITICAL: Added to fix bug
foreach ($relevantStatuses as $status) {
if (str($status)->contains('running')) {
$hasRunning = true;
if (str($status)->contains('unhealthy')) $hasUnhealthy = true;
if (str($status)->contains('unknown')) $hasUnknown = true; // ⚠️ CRITICAL
}
}
// Priority: unhealthy > unknown > healthy
if ($hasRunning) {
if ($hasUnhealthy) $aggregatedStatus = 'running (unhealthy)';
elseif ($hasUnknown) $aggregatedStatus = 'running (unknown)';
else $aggregatedStatus = 'running (healthy)';
}
```
#### **3. Multi-Server Status Aggregation**
**File**: [app/Actions/Shared/ComplexStatusCheck.php](mdc:app/Actions/Shared/ComplexStatusCheck.php)
**Method**: `resource()` (lines 48-210)
**Purpose**: Aggregates status across multiple servers for applications
**Used by**: Applications with multiple destinations
**Key Features**:
- Aggregates statuses from main + additional servers
- Handles excluded containers (`:excluded` suffix)
- Calculates overall application health from all containers
**Status Format with Excluded Containers**:
```php
// When all containers excluded from health checks:
return 'running:unhealthy:excluded'; // Container running but unhealthy, monitoring disabled
return 'running:unknown:excluded'; // Container running, health unknown, monitoring disabled
return 'running:healthy:excluded'; // Container running and healthy, monitoring disabled
return 'degraded:excluded'; // Some containers down, monitoring disabled
return 'exited:excluded'; // All containers stopped, monitoring disabled
```
#### **4. Service-Level Status Aggregation**
**File**: [app/Models/Service.php](mdc:app/Models/Service.php)
**Method**: `complexStatus()` (lines 176-288)
**Purpose**: Aggregates status for multi-container services
**Used by**: Docker Compose services
**Status Calculation**:
```php
// Aggregates status from all service applications and databases
// Handles excluded containers separately
// Returns status with :excluded suffix when all containers excluded
if (!$hasNonExcluded && $complexStatus === null && $complexHealth === null) {
// All services excluded - calculate from excluded containers
return "{$excludedStatus}:excluded";
}
```
### **Status Flow Diagram**
```
┌─────────────────────────────────────────────────────────────┐
│ Container Status Sources │
└─────────────────────────────────────────────────────────────┘
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────┐ ┌──────────────┐
│ SSH-Based │ │ Sentinel-Based │ │ Multi-Server │
│ (Scheduled) │ │ (Real-time) │ │ Aggregation │
├───────────────┤ ├─────────────────┤ ├──────────────┤
│ ServerCheck │ │ PushServerUp- │ │ ComplexStatus│
│ Job │ │ dateJob │ │ Check │
│ │ │ │ │ │
│ Every ~1min │ │ Every ~30sec │ │ On demand │
└───────┬───────┘ └────────┬────────┘ └──────┬───────┘
│ │ │
└────────────────────┼────────────────────┘
┌───────────────────────┐
│ Application/Service │
│ Status Property │
└───────────────────────┘
┌───────────────────────┐
│ UI Display (Livewire) │
└───────────────────────┘
```
### **Status Priority System**
All status aggregation locations **MUST** follow the same priority:
**For Running Containers**:
1. **unhealthy** - Container has failing health checks
2. **unknown** - Container health status cannot be determined
3. **healthy** - Container is healthy
**For Non-Running States**:
1. **restarting**`degraded (unhealthy)`
2. **running + exited**`degraded (unhealthy)`
3. **dead/removing**`degraded (unhealthy)`
4. **paused**`paused`
5. **created/starting**`starting`
6. **exited**`exited (unhealthy)`
### **Excluded Containers**
When containers have `exclude_from_hc: true` flag:
**Behavior**:
- Status is still calculated from container state
- `:excluded` suffix is appended to indicate monitoring disabled
- UI shows "(Monitoring Disabled)" badge
- Action buttons respect the actual container state
**Format**: `{actual-status}:excluded`
**Examples**: `running:unknown:excluded`, `degraded:excluded`, `exited:excluded`
### **Important Notes for Developers**
⚠️ **CRITICAL**: When modifying container status logic:
1. **Update ALL four locations**:
- `GetContainersStatus.php` (SSH-based)
- `PushServerUpdateJob.php` (Sentinel-based)
- `ComplexStatusCheck.php` (multi-server)
- `Service.php` (service-level)
2. **Maintain consistent priority**:
- unhealthy > unknown > healthy
- Apply same logic across all paths
3. **Test both update paths**:
- Run unit tests: `./vendor/bin/pest tests/Unit/`
- Test SSH updates (manual refresh)
- Test Sentinel updates (wait 30 seconds)
4. **Handle edge cases**:
- All containers excluded (`exclude_from_hc: true`)
- Mixed excluded/non-excluded containers
- Unknown health states
- Container crash loops (restart count)
### **Related Tests**
- **[tests/Unit/ContainerHealthStatusTest.php](mdc:tests/Unit/ContainerHealthStatusTest.php)**: Health status aggregation
- **[tests/Unit/PushServerUpdateJobStatusAggregationTest.php](mdc:tests/Unit/PushServerUpdateJobStatusAggregationTest.php)**: Sentinel update logic
- **[tests/Unit/ExcludeFromHealthCheckTest.php](mdc:tests/Unit/ExcludeFromHealthCheckTest.php)**: Excluded container handling
### **Common Bugs to Avoid**
**Bug**: Forgetting to track `$hasUnknown` flag
**Fix**: Initialize and check for "unknown" in all status aggregation
**Bug**: Using ternary operator instead of if-elseif-else
**Fix**: Use explicit if-elseif-else to handle 3-way priority
**Bug**: Updating only one path (SSH or Sentinel)
**Fix**: Always update all four status calculation locations
**Bug**: Not handling excluded containers with `:excluded` suffix
**Fix**: Check for `:excluded` suffix in UI logic and button visibility