docs: add comprehensive container status monitoring system documentation
## Added Documentation Created detailed documentation in `.ai/core/application-architecture.md` explaining the container status monitoring system to prevent future bugs. ## Key Sections ### 1. Container Status Monitoring System Overview - Explains that status is updated through multiple independent paths - Emphasizes that ALL paths must be updated when changing status logic ### 2. Critical Implementation Locations Documents all four status calculation locations: - **SSH-Based Updates**: `GetContainersStatus.php` (scheduled, every ~1min) - **Sentinel-Based Updates**: `PushServerUpdateJob.php` (real-time, every ~30sec) - **Multi-Server Aggregation**: `ComplexStatusCheck.php` (on-demand) - **Service-Level Aggregation**: `Service.php` (service status) ### 3. Status Flow Diagram Visual representation of how status flows from different sources to UI ### 4. Status Priority System Documents the required priority: unhealthy > unknown > healthy ### 5. Excluded Containers Explains `:excluded` suffix handling and behavior ### 6. Developer Guidelines - Checklist of all locations to update - Testing requirements - Edge cases to handle ### 7. Related Tests Links to all relevant test files ### 8. Common Bugs to Avoid Real examples from bugs we've fixed, with solutions ## Why This Documentation Matters The recent bug (unknown → healthy) happened because: 1. `GetContainersStatus.php` was updated to handle "unknown" status 2. `PushServerUpdateJob.php` was NOT updated 3. This caused periodic status flipping This documentation ensures future developers (and AI assistants like Claude) will know to update ALL four locations when modifying status logic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
6b62847a11
commit
128c0b00ec
1 changed files with 208 additions and 0 deletions
|
|
@ -361,3 +361,211 @@ ### **Background Processing**
|
|||
- **Queue Workers**: Horizon-managed job processing
|
||||
- **Job Batching**: Related job grouping
|
||||
- **Failed Job Handling**: Automatic retry logic
|
||||
|
||||
## Container Status Monitoring System
|
||||
|
||||
### **Overview**
|
||||
|
||||
Container health status is monitored and updated through **multiple independent paths**. When modifying status logic, **ALL paths must be updated** to ensure consistency.
|
||||
|
||||
### **Critical Implementation Locations**
|
||||
|
||||
#### **1. SSH-Based Status Updates (Scheduled)**
|
||||
**File**: [app/Actions/Docker/GetContainersStatus.php](mdc:app/Actions/Docker/GetContainersStatus.php)
|
||||
**Method**: `aggregateApplicationStatus()` (lines 487-540)
|
||||
**Trigger**: Scheduled job or manual refresh
|
||||
**Frequency**: Every minute (via `ServerCheckJob`)
|
||||
|
||||
**Status Aggregation Logic**:
|
||||
```php
|
||||
// Tracks multiple status flags
|
||||
$hasRunning = false;
|
||||
$hasRestarting = false;
|
||||
$hasUnhealthy = false;
|
||||
$hasUnknown = false; // ⚠️ CRITICAL: Must track unknown
|
||||
$hasExited = false;
|
||||
// ... more states
|
||||
|
||||
// Priority: restarting > degraded > running (unhealthy > unknown > healthy)
|
||||
if ($hasRunning) {
|
||||
if ($hasUnhealthy) return 'running (unhealthy)';
|
||||
elseif ($hasUnknown) return 'running (unknown)';
|
||||
else return 'running (healthy)';
|
||||
}
|
||||
```
|
||||
|
||||
#### **2. Sentinel-Based Status Updates (Real-time)**
|
||||
**File**: [app/Jobs/PushServerUpdateJob.php](mdc:app/Jobs/PushServerUpdateJob.php)
|
||||
**Method**: `aggregateMultiContainerStatuses()` (lines 269-298)
|
||||
**Trigger**: Sentinel push updates from remote servers
|
||||
**Frequency**: Every ~30 seconds (real-time)
|
||||
|
||||
**Status Aggregation Logic**:
|
||||
```php
|
||||
// ⚠️ MUST match GetContainersStatus logic
|
||||
$hasRunning = false;
|
||||
$hasUnhealthy = false;
|
||||
$hasUnknown = false; // ⚠️ CRITICAL: Added to fix bug
|
||||
|
||||
foreach ($relevantStatuses as $status) {
|
||||
if (str($status)->contains('running')) {
|
||||
$hasRunning = true;
|
||||
if (str($status)->contains('unhealthy')) $hasUnhealthy = true;
|
||||
if (str($status)->contains('unknown')) $hasUnknown = true; // ⚠️ CRITICAL
|
||||
}
|
||||
}
|
||||
|
||||
// Priority: unhealthy > unknown > healthy
|
||||
if ($hasRunning) {
|
||||
if ($hasUnhealthy) $aggregatedStatus = 'running (unhealthy)';
|
||||
elseif ($hasUnknown) $aggregatedStatus = 'running (unknown)';
|
||||
else $aggregatedStatus = 'running (healthy)';
|
||||
}
|
||||
```
|
||||
|
||||
#### **3. Multi-Server Status Aggregation**
|
||||
**File**: [app/Actions/Shared/ComplexStatusCheck.php](mdc:app/Actions/Shared/ComplexStatusCheck.php)
|
||||
**Method**: `resource()` (lines 48-210)
|
||||
**Purpose**: Aggregates status across multiple servers for applications
|
||||
**Used by**: Applications with multiple destinations
|
||||
|
||||
**Key Features**:
|
||||
- Aggregates statuses from main + additional servers
|
||||
- Handles excluded containers (`:excluded` suffix)
|
||||
- Calculates overall application health from all containers
|
||||
|
||||
**Status Format with Excluded Containers**:
|
||||
```php
|
||||
// When all containers excluded from health checks:
|
||||
return 'running:unhealthy:excluded'; // Container running but unhealthy, monitoring disabled
|
||||
return 'running:unknown:excluded'; // Container running, health unknown, monitoring disabled
|
||||
return 'running:healthy:excluded'; // Container running and healthy, monitoring disabled
|
||||
return 'degraded:excluded'; // Some containers down, monitoring disabled
|
||||
return 'exited:excluded'; // All containers stopped, monitoring disabled
|
||||
```
|
||||
|
||||
#### **4. Service-Level Status Aggregation**
|
||||
**File**: [app/Models/Service.php](mdc:app/Models/Service.php)
|
||||
**Method**: `complexStatus()` (lines 176-288)
|
||||
**Purpose**: Aggregates status for multi-container services
|
||||
**Used by**: Docker Compose services
|
||||
|
||||
**Status Calculation**:
|
||||
```php
|
||||
// Aggregates status from all service applications and databases
|
||||
// Handles excluded containers separately
|
||||
// Returns status with :excluded suffix when all containers excluded
|
||||
if (!$hasNonExcluded && $complexStatus === null && $complexHealth === null) {
|
||||
// All services excluded - calculate from excluded containers
|
||||
return "{$excludedStatus}:excluded";
|
||||
}
|
||||
```
|
||||
|
||||
### **Status Flow Diagram**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Container Status Sources │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌────────────────────┼────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌───────────────┐ ┌─────────────────┐ ┌──────────────┐
|
||||
│ SSH-Based │ │ Sentinel-Based │ │ Multi-Server │
|
||||
│ (Scheduled) │ │ (Real-time) │ │ Aggregation │
|
||||
├───────────────┤ ├─────────────────┤ ├──────────────┤
|
||||
│ ServerCheck │ │ PushServerUp- │ │ ComplexStatus│
|
||||
│ Job │ │ dateJob │ │ Check │
|
||||
│ │ │ │ │ │
|
||||
│ Every ~1min │ │ Every ~30sec │ │ On demand │
|
||||
└───────┬───────┘ └────────┬────────┘ └──────┬───────┘
|
||||
│ │ │
|
||||
└────────────────────┼────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ Application/Service │
|
||||
│ Status Property │
|
||||
└───────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ UI Display (Livewire) │
|
||||
└───────────────────────┘
|
||||
```
|
||||
|
||||
### **Status Priority System**
|
||||
|
||||
All status aggregation locations **MUST** follow the same priority:
|
||||
|
||||
**For Running Containers**:
|
||||
1. **unhealthy** - Container has failing health checks
|
||||
2. **unknown** - Container health status cannot be determined
|
||||
3. **healthy** - Container is healthy
|
||||
|
||||
**For Non-Running States**:
|
||||
1. **restarting** → `degraded (unhealthy)`
|
||||
2. **running + exited** → `degraded (unhealthy)`
|
||||
3. **dead/removing** → `degraded (unhealthy)`
|
||||
4. **paused** → `paused`
|
||||
5. **created/starting** → `starting`
|
||||
6. **exited** → `exited (unhealthy)`
|
||||
|
||||
### **Excluded Containers**
|
||||
|
||||
When containers have `exclude_from_hc: true` flag:
|
||||
|
||||
**Behavior**:
|
||||
- Status is still calculated from container state
|
||||
- `:excluded` suffix is appended to indicate monitoring disabled
|
||||
- UI shows "(Monitoring Disabled)" badge
|
||||
- Action buttons respect the actual container state
|
||||
|
||||
**Format**: `{actual-status}:excluded`
|
||||
**Examples**: `running:unknown:excluded`, `degraded:excluded`, `exited:excluded`
|
||||
|
||||
### **Important Notes for Developers**
|
||||
|
||||
⚠️ **CRITICAL**: When modifying container status logic:
|
||||
|
||||
1. **Update ALL four locations**:
|
||||
- `GetContainersStatus.php` (SSH-based)
|
||||
- `PushServerUpdateJob.php` (Sentinel-based)
|
||||
- `ComplexStatusCheck.php` (multi-server)
|
||||
- `Service.php` (service-level)
|
||||
|
||||
2. **Maintain consistent priority**:
|
||||
- unhealthy > unknown > healthy
|
||||
- Apply same logic across all paths
|
||||
|
||||
3. **Test both update paths**:
|
||||
- Run unit tests: `./vendor/bin/pest tests/Unit/`
|
||||
- Test SSH updates (manual refresh)
|
||||
- Test Sentinel updates (wait 30 seconds)
|
||||
|
||||
4. **Handle edge cases**:
|
||||
- All containers excluded (`exclude_from_hc: true`)
|
||||
- Mixed excluded/non-excluded containers
|
||||
- Unknown health states
|
||||
- Container crash loops (restart count)
|
||||
|
||||
### **Related Tests**
|
||||
|
||||
- **[tests/Unit/ContainerHealthStatusTest.php](mdc:tests/Unit/ContainerHealthStatusTest.php)**: Health status aggregation
|
||||
- **[tests/Unit/PushServerUpdateJobStatusAggregationTest.php](mdc:tests/Unit/PushServerUpdateJobStatusAggregationTest.php)**: Sentinel update logic
|
||||
- **[tests/Unit/ExcludeFromHealthCheckTest.php](mdc:tests/Unit/ExcludeFromHealthCheckTest.php)**: Excluded container handling
|
||||
|
||||
### **Common Bugs to Avoid**
|
||||
|
||||
❌ **Bug**: Forgetting to track `$hasUnknown` flag
|
||||
✅ **Fix**: Initialize and check for "unknown" in all status aggregation
|
||||
|
||||
❌ **Bug**: Using ternary operator instead of if-elseif-else
|
||||
✅ **Fix**: Use explicit if-elseif-else to handle 3-way priority
|
||||
|
||||
❌ **Bug**: Updating only one path (SSH or Sentinel)
|
||||
✅ **Fix**: Always update all four status calculation locations
|
||||
|
||||
❌ **Bug**: Not handling excluded containers with `:excluded` suffix
|
||||
✅ **Fix**: Check for `:excluded` suffix in UI logic and button visibility
|
||||
|
|
|
|||
Loading…
Reference in a new issue