Commit graph

47 commits

Author SHA1 Message Date
Andras Bacsai
e41dbde46b chore: prepare for PR 2026-03-10 18:34:37 +01:00
Andras Bacsai
fc229c4889 chore: prepare for PR 2026-02-03 15:32:03 +01:00
Andras Bacsai
cc53e9476e fix(docker): add fallback for Docker Swarm container labels 2026-01-07 14:57:13 +01:00
Andras Bacsai
d04d0ab499 Refactor restart tracking and add missing model casts
- Consolidate duplicate restart tracking logic in GetContainersStatus
- Add last_restart_type string cast to all 8 standalone database models

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 17:45:58 +01:00
Andras Bacsai
6d47d24169 Fix standalone database "restarting" status flickering and add restart tracking
- Fix status flickering: Track databases in active/transient states (restarting, starting, created, paused) not just running
- Add isActiveOrTransient() helper to distinguish between active states and terminal states (exited, dead)
- Add safeguard: Protect updateNotFoundDatabaseStatus() from marking as exited when containers collection is empty
- Add restart_count tracking: New migration adds restart_count, last_restart_at, last_restart_type to all standalone database tables
- Update 8 database models with $casts for new restart tracking fields
- Update GetContainersStatus to extract RestartCount from Docker and update database models
- Reset restart tracking when database exits completely

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-17 16:25:41 +01:00
Andras Bacsai
66e81d6d96 Fix container status display: preserve "Restarting" for applications and sub-resources
Add preserveRestarting parameter to ContainerStatusAggregator to allow applications
and service sub-resources to display "Restarting" status instead of being marked as
"Degraded". This gives better visibility into container restart behavior.

- Update ContainerStatusAggregator to accept preserveRestarting parameter (defaults to false)
- Update GetContainersStatus to use preserveRestarting: true for applications and service sub-resources
- Update PushServerUpdateJob to use preserveRestarting: true for applications and service sub-resources
- Add comprehensive documentation explaining the parameter behavior and when to use it

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 08:23:35 +01:00
Andras Bacsai
ac9eca3c05 fix: don't show health status for exited containers
Exited containers don't run health checks, so showing "(unhealthy)" is
misleading. This fix ensures exited status displays without health
suffixes across all monitoring systems (SSH, Sentinel, services, etc.)
and at the UI layer for backward compatibility with existing data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 09:09:37 +01:00
Andras Bacsai
ae6eef3cdb feat(tests): add comprehensive tests for ContainerStatusAggregator and serverStatus accessor
- Introduced tests for ContainerStatusAggregator to validate status aggregation logic across various container states.
- Implemented tests to ensure serverStatus accessor correctly checks server infrastructure health without being affected by container status.
- Updated ExcludeFromHealthCheckTest to verify excluded status handling in various components.
- Removed obsolete PushServerUpdateJobStatusAggregationTest as its functionality is covered elsewhere.
- Updated version number for sentinel to 0.0.17 in versions.json.
2025-11-20 17:31:07 +01:00
Andras Bacsai
14bba8ba86 fix: correct Sentinel default health status and remove debug logging
This commit addresses container status reporting issues and removes debug logging:

**Primary Fix:**
- Changed PushServerUpdateJob to default to 'unknown' instead of 'unhealthy' when health_status field is missing from Sentinel data
- This ensures containers WITHOUT healthcheck defined are correctly reported as "unknown" not "unhealthy"
- Matches SSH path behavior (GetContainersStatus) which already defaulted to 'unknown'

**Service Multi-Container Aggregation:**
- Implemented service container status aggregation (same pattern as applications)
- Added serviceContainerStatuses collection to both Sentinel and SSH paths
- Services now aggregate status using priority: unhealthy > unknown > healthy
- Prevents race conditions where last-processed container would win

**Debug Logging Cleanup:**
- Removed all [STATUS-DEBUG] logging statements (25 total)
- Removed all ray() debugging calls (3 total)
- Removed proof_unknown_preserved and health_status_was_null debug fields
- Code is now production-ready

**Test Coverage:**
- Added 2 new tests for Sentinel default health status behavior
- Added 5 new tests for service aggregation in SSH path
- All 16 tests pass (66 assertions)

**Note:** The root cause was identified as Sentinel (Go binary) also defaulting to "unhealthy". That will need a separate fix in the Sentinel codebase.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 11:10:34 +01:00
Andras Bacsai
d2d9c1b2bc debug: add comprehensive status change logging
Added detailed debug logging to all status update paths to help
diagnose why "unhealthy" status appears in the UI.

## Logging Added

### 1. PushServerUpdateJob (Sentinel updates)
**Location**: Lines 303-315
**Logs**: Status changes from Sentinel push updates
**Data tracked**:
- Old vs new status
- Container statuses that led to aggregation
- Status flags (hasRunning, hasUnhealthy, hasUnknown)

### 2. GetContainersStatus (SSH updates)
**Location**: Lines 441-449, 346-354, 358-365
**Logs**: Status changes from SSH-based checks
**Scenarios**:
- Normal status aggregation
- Recently restarted containers (kept as degraded)
- Applications not running (set to exited)
**Data tracked**:
- Old vs new status
- Container statuses
- Restart count and timing
- Whether containers exist

### 3. Application Model Status Accessor
**Location**: Lines 706-712, 726-732
**Logs**: When status is set without explicit health information
**Issue**: Highlights cases where health defaults to "unhealthy"
**Data tracked**:
- Raw value passed to setter
- Final result after default applied

## How to Use

### Enable Debug Logging
Edit `.env` or `config/logging.php` to set log level to debug:
```
LOG_LEVEL=debug
```

### Monitor Logs
```bash
tail -f storage/logs/laravel.log | grep STATUS-DEBUG
```

### Log Format
All logs use `[STATUS-DEBUG]` prefix for easy filtering:
```
[2025-11-19 13:00:00] local.DEBUG: [STATUS-DEBUG] Sentinel status change
{
  "source": "PushServerUpdateJob",
  "app_id": 123,
  "app_name": "my-app",
  "old_status": "running:unknown",
  "new_status": "running:healthy",
  "container_statuses": [...],
  "flags": {...}
}
```

## What to Look For

1. **Default to unhealthy**: Check Application model accessor logs
2. **Status flipping**: Compare timestamps between Sentinel and SSH updates
3. **Incorrect aggregation**: Check flags and container_statuses
4. **Stale database values**: Check if old_status persists across multiple logs

## Next Steps

After gathering logs, we can:
1. Identify the exact source of "unhealthy" status
2. Determine if it's a default issue, aggregation bug, or timing problem
3. Apply targeted fix based on evidence

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 13:52:08 +01:00
Andras Bacsai
e3746a4b88 fix: preserve unknown health state and handle edge case container states
This commit fixes container health status aggregation to correctly handle
unknown health states and edge case container states across all resource types.

Changes:

1. **Preserve Unknown Health State**
   - Add three-way priority: unhealthy > unknown > healthy
   - Detect containers without healthchecks (null health) as unknown
   - Apply across GetContainersStatus, ComplexStatusCheck, and Service models

2. **Handle Edge Case Container States**
   - Add support for: created, starting, paused, dead, removing
   - Map to appropriate statuses: starting (unknown), paused (unknown), degraded (unhealthy)
   - Prevent containers in transitional states from showing incorrect status

3. **Add :excluded Suffix for Excluded Containers**
   - Parse exclude_from_hc flag from docker-compose YAML
   - Append :excluded suffix to individual container statuses
   - Skip :excluded containers in non-excluded aggregation sections
   - Strip :excluded suffix in excluded aggregation sections
   - Makes it clear in UI which containers are excluded from monitoring

Files Modified:
- app/Actions/Docker/GetContainersStatus.php
- app/Actions/Shared/ComplexStatusCheck.php
- app/Models/Service.php
- tests/Unit/ContainerHealthStatusTest.php

Tests: 18 passed (82 assertions)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 13:19:25 +01:00
Andras Bacsai
498b189286 fix: correct status for services with all containers excluded from health checks
When all containers are excluded from health checks, display their actual status
with :excluded suffix instead of misleading hardcoded statuses. This prevents
broken UI state with incorrect action buttons and provides clarity that monitoring
is disabled.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 10:54:51 +01:00
Andras Bacsai
23c165d4d1 fix: wrap database updates in a transaction for consistency in GetContainersStatus 2025-11-10 15:07:44 +01:00
Andras Bacsai
68a9f2ca77 feat: add container restart tracking and crash loop detection
Track container restart counts from Docker and detect crash loops to provide better visibility into application health issues.

- Add restart_count, last_restart_at, and last_restart_type columns to applications table
- Detect restart count increases from Docker inspect data and send notifications
- Show restart count badge in UI with warning icon on Logs navigation
- Distinguish between crash restarts and manual restarts
- Implement 30-second grace period to prevent false "exited" status during crash loops
- Reset restart count on manual stop, restart, and redeploy actions
- Add unit tests for restart count tracking logic

This helps users quickly identify when containers are in crash loops and need attention, even when the container status flickers between states during Docker's restart backoff period.
2025-11-10 13:04:31 +01:00
Andras Bacsai
f515870f36 fix(docker): enhance container status aggregation to include restarting and exited states 2025-09-18 18:12:52 +02:00
Andras Bacsai
08d257535a fix(docker): enhance container status aggregation for multi-container applications, including exclusion handling based on docker-compose configuration 2025-09-13 20:32:15 +02:00
Andras Bacsai
684bd823c6 fix(docker): add protection against empty container queries in GetContainersStatus to prevent unnecessary updates 2025-06-04 10:03:07 +02:00
Andras Bacsai
786bfa960f improvement(core): simplify events for app/db/service status changes 2025-05-19 21:50:32 +02:00
Andras Bacsai
27e4882d57 feat(core): You can validate compose files with docker compose config
fix(core): labels are now accepted with both compose styles
refactor: remove lots of ray's
2025-02-27 11:29:04 +01:00
Andras Bacsai
1fe4dd722b Revert "rector: arrrrr"
This reverts commit 16c0cd10d8.
2025-01-07 15:31:43 +01:00
Andras Bacsai
16c0cd10d8 rector: arrrrr 2025-01-07 14:52:08 +01:00
Andras Bacsai
319c3023dc fix 2024-12-02 22:50:03 +01:00
Andras Bacsai
58988d3686 fix: a few inputs 2024-12-02 22:50:03 +01:00
Andras Bacsai
7dc65dfd79 fix: make sure important jobs/actions are running on high prio queue 2024-11-22 11:16:01 +01:00
Andras Bacsai
a2b6a61c4a fix: update last online with old function 2024-11-08 09:43:46 +01:00
Andras Bacsai
e69b0ca1a9 disable tcp proxy notification 2024-11-08 09:18:43 +01:00
Andras Bacsai
ca7c214775 fix: new way to update container statuses 2024-11-03 09:02:14 +01:00
Lucas Michot
0c133b113c Delete some useless imports 2024-10-31 16:33:49 +01:00
Andras Bacsai
ac768e5313 feat: limit storage check emails
feat: sentinel should send storage usage
2024-10-22 14:01:36 +02:00
Andras Bacsai
5c93780304 remove unnecessary code 2024-10-22 12:01:46 +02:00
peaklabs-dev
8635f92ed4
Remove duplicated proxy check 2024-10-14 21:35:38 +02:00
Andras Bacsai
dedf2cf87b fix: proxy fixes 2024-09-27 15:36:51 +02:00
Andras Bacsai
d6b4e33db3 fix: exited services statuses 2024-09-24 20:40:41 +02:00
Andras Bacsai
e4b92bb660 feat: new server checking job
feat: show if the server  has problems on ui
2024-08-05 15:48:15 +02:00
Thijmen
d86274cc37 Fix styling 2024-06-10 20:43:34 +00:00
Andras Bacsai
086138fbd9 fix: disable containerStopped job for now 2024-05-23 11:31:52 +02:00
Andras Bacsai
6c3b4070ba chore: Refactor container name logic in GetContainersStatus.php and ForcePasswordReset.php 2024-05-22 14:44:11 +02:00
Andras Bacsai
a3d73634e7 feat: scheduled task failed notification 2024-05-21 15:36:26 +02:00
Andras Bacsai
98b6aec203 feat: admin view for deleting users 2024-05-21 14:29:06 +02:00
Andras Bacsai
1fb7e97700 Fix error handling in GetContainersStatus.php and increase length of stripe_comment field in migrations 2024-05-10 12:10:47 +02:00
Andras Bacsai
e91a64b1cc Refactor GetContainersStatus.php for improved readability and maintainability 2024-05-09 12:10:06 +02:00
Andras Bacsai
ba40f93386 do not use sentinel for container details for now 2024-05-08 20:59:58 +02:00
Andras Bacsai
829e17ef2b Refactor GetContainersStatus.php for improved readability and maintainability 2024-05-08 15:19:12 +02:00
Andras Bacsai
bc5d3bea14 Refactor GetContainersStatus.php for improved readability and maintainability 2024-05-08 15:04:13 +02:00
Andras Bacsai
c618e58a11 feat: start Sentinel on servers. 2024-05-08 14:22:35 +02:00
Andras Bacsai
3eb4aed867 chore: Refactor GetContainersStatus.php for improved readability and maintainability 2024-05-08 09:23:32 +02:00
Andras Bacsai
f6f959a897 feat: experimental sentinel 2024-05-07 15:41:50 +02:00