diff --git a/STALE_LOCK_FIX.md b/STALE_LOCK_FIX.md deleted file mode 100644 index c80eca76a..000000000 --- a/STALE_LOCK_FIX.md +++ /dev/null @@ -1,154 +0,0 @@ -# Fix for Stale Lock Issue in ScheduledJobManager - -## Issue -GitHub Issue: #4539 - Scheduled tasks not executing on schedule - -### Symptoms -- Scheduled tasks stop executing after working for weeks/months -- Backups don't run -- Auto-updates don't work -- Error in Horizon: `Illuminate\Queue\MaxAttemptsExceededException: App\Jobs\ScheduledJobManager has been attempted too many times` -- Running `horizon:clear`, `cleanup:redis`, `schedule:clear-cache` doesn't fix the problem - -## Root Cause - -The `ScheduledJobManager` was using `WithoutOverlapping` middleware with only `releaseAfter(60)`: - -```php -(new WithoutOverlapping('scheduled-job-manager')) - ->releaseAfter(60) -``` - -**Problems with this approach:** - -1. **No automatic lock expiration**: Without `expireAfter()`, locks persist indefinitely if: - - Process hangs or becomes unresponsive - - Job takes longer than expected - - Unexpected termination occurs - -2. **Race condition with releaseAfter()**: - - Job acquires lock - - Job gets stuck/hangs - - After 60s, job is released back to queue - - New attempt can't acquire lock (still held by hung process) - - Repeats until MaxAttemptsExceededException - -3. **Against Laravel best practices**: Laravel docs explicitly recommend using `expireAfter()` to prevent stale locks - -## Solution - -This fix has two parts: - -### Part 1: Prevention (Fix Future Locks) - -Changed the middleware to match the pattern used by other Coolify jobs: - -```php -// File: app/Jobs/ScheduledJobManager.php -(new WithoutOverlapping('scheduled-job-manager')) - ->expireAfter(60) // Lock expires after 1 minute (matches job frequency) - ->dontRelease() // Don't re-queue on lock conflict -``` - -### Part 2: Recovery (Clear Existing Stale Locks) - -Enhanced `cleanup:redis` command to clear existing stale locks: - -```php -// File: app/Console/Commands/CleanupRedis.php -// Added --clear-locks flag -php artisan cleanup:redis --clear-locks -``` - -**What it does:** -- Scans Redis for `laravel-queue-overlap` keys (WithoutOverlapping locks) -- Checks TTL of each lock -- Deletes locks with TTL = -1 (no expiration = stale!) -- Skips active locks that have proper expiration -- Called automatically during `app:init` (on Coolify startup/update) - -### Why This Works - -✅ **Auto-expiring locks**: Lock automatically expires after 60 seconds, even if: - - Process crashes - - Job hangs - - Network issues occur - -✅ **No retry storms**: `dontRelease()` prevents failed jobs from being re-queued repeatedly - -✅ **Consistent pattern**: Matches other Coolify jobs like: - - `DockerCleanupJob`: `expireAfter(600)->dontRelease()` - - `ServerCheckJob`: `expireAfter(60)->dontRelease()` - - `RestartProxyJob`: `expireAfter(60)->dontRelease()` - -✅ **Laravel recommended**: Follows official Laravel documentation for preventing stale locks - -### Why 60 Seconds? - -- Job runs **every minute** (`everyMinute()` schedule) -- Matches the job frequency (1:1 ratio) -- Matches `CleanupInstanceStuffsJob` pattern (also runs frequently with 60s expiry) -- Allows next cycle to run if current job hangs -- Still reasonable timeout to prevent long-held locks - -## Testing - -### Manual Lock Key Inspection - -To check for locks in Redis: - -```bash -docker exec -it coolify-redis redis-cli -SELECT 0 -KEYS *laravel-queue-overlap*ScheduledJobManager* -``` - -Full key format: -``` -coolify_development_database_coolify_development_cache_laravel-queue-overlap:App\Jobs\ScheduledJobManager:scheduled-job-manager -``` - -Check TTL: -```bash -TTL "" -``` - -- `-1` = No expiration (STALE LOCK - the bug!) -- `-2` = Key doesn't exist -- Positive number = Seconds until expiration (GOOD!) - -### Testing the Fix - -Created test jobs to demonstrate the fix: -- `TestStaleLockJob.php` - Uses broken pattern (`releaseAfter` only) -- `TestFixedLockJob.php` - Uses fixed pattern (`expireAfter` + `dontRelease`) - -## Impact - -This fix will: -- ✅ **Immediate recovery**: Existing stale locks cleared on upgrade/restart -- ✅ **Future prevention**: New locks auto-expire, preventing issue recurrence -- ✅ **Self-recovery**: System can recover from transient issues automatically -- ✅ **Zero manual intervention**: No need for users to manually clear locks -- ✅ **Reliable operations**: Backups, tasks, and auto-updates run consistently - -## Files Modified - -1. **app/Jobs/ScheduledJobManager.php** - - Changed middleware to use `expireAfter(120)->dontRelease()` - -2. **app/Console/Commands/CleanupRedis.php** - - Added `--clear-locks` flag - - Added `cleanupCacheLocks()` method - -3. **app/Console/Commands/Init.php** - - Updated to call `cleanup:redis --clear-locks` on startup - -4. **tests/Unit/ScheduledJobManagerLockTest.php** - - New unit test to prevent regression - -## References - -- Laravel Docs: https://laravel.com/docs/12.x/queues#preventing-job-overlaps -- GitHub Issue: https://github.com/coollabsio/coolify/issues/4539 -- Related Pattern: All other Coolify jobs use `expireAfter()->dontRelease()`