## Problem Scheduled tasks, backups, and auto-updates stopped working after 1-2 months with error: MaxAttemptsExceededException: App\Jobs\ScheduledJobManager has been attempted too many times. Root cause: ScheduledJobManager used WithoutOverlapping with only releaseAfter(60), causing locks without expiration (TTL=-1) that persisted indefinitely when jobs hung or processes crashed. ## Solution ### Part 1: Prevention (Future Locks) - Added expireAfter(60) to ScheduledJobManager middleware - Lock now auto-expires after 60 seconds (matches everyMinute schedule) - Changed from releaseAfter(60) to expireAfter(60)->dontRelease() - Follows Laravel best practices and matches other Coolify jobs ### Part 2: Recovery (Existing Locks) - Enhanced cleanup:redis command with --clear-locks flag - Scans Redis for stale locks (TTL=-1) and removes them - Called automatically during app:init on startup/upgrade - Provides immediate recovery for affected instances ## Changes - app/Jobs/ScheduledJobManager.php: Added expireAfter(60)->dontRelease() - app/Console/Commands/CleanupRedis.php: Added cleanupCacheLocks() method - app/Console/Commands/Init.php: Auto-clear locks on startup - tests/Unit/ScheduledJobManagerLockTest.php: Test to prevent regression - STALE_LOCK_FIX.md: Complete documentation ## Testing - Unit tests pass (2 tests, 8 assertions) - Code formatted with Pint - Matches pattern used by CleanupInstanceStuffsJob 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4.8 KiB
Fix for Stale Lock Issue in ScheduledJobManager
Issue
GitHub Issue: #4539 - Scheduled tasks not executing on schedule
Symptoms
- Scheduled tasks stop executing after working for weeks/months
- Backups don't run
- Auto-updates don't work
- Error in Horizon:
Illuminate\Queue\MaxAttemptsExceededException: App\Jobs\ScheduledJobManager has been attempted too many times - Running
horizon:clear,cleanup:redis,schedule:clear-cachedoesn't fix the problem
Root Cause
The ScheduledJobManager was using WithoutOverlapping middleware with only releaseAfter(60):
(new WithoutOverlapping('scheduled-job-manager'))
->releaseAfter(60)
Problems with this approach:
-
No automatic lock expiration: Without
expireAfter(), locks persist indefinitely if:- Process hangs or becomes unresponsive
- Job takes longer than expected
- Unexpected termination occurs
-
Race condition with releaseAfter():
- Job acquires lock
- Job gets stuck/hangs
- After 60s, job is released back to queue
- New attempt can't acquire lock (still held by hung process)
- Repeats until MaxAttemptsExceededException
-
Against Laravel best practices: Laravel docs explicitly recommend using
expireAfter()to prevent stale locks
Solution
This fix has two parts:
Part 1: Prevention (Fix Future Locks)
Changed the middleware to match the pattern used by other Coolify jobs:
// File: app/Jobs/ScheduledJobManager.php
(new WithoutOverlapping('scheduled-job-manager'))
->expireAfter(60) // Lock expires after 1 minute (matches job frequency)
->dontRelease() // Don't re-queue on lock conflict
Part 2: Recovery (Clear Existing Stale Locks)
Enhanced cleanup:redis command to clear existing stale locks:
// File: app/Console/Commands/CleanupRedis.php
// Added --clear-locks flag
php artisan cleanup:redis --clear-locks
What it does:
- Scans Redis for
laravel-queue-overlapkeys (WithoutOverlapping locks) - Checks TTL of each lock
- Deletes locks with TTL = -1 (no expiration = stale!)
- Skips active locks that have proper expiration
- Called automatically during
app:init(on Coolify startup/update)
Why This Works
✅ Auto-expiring locks: Lock automatically expires after 60 seconds, even if:
- Process crashes
- Job hangs
- Network issues occur
✅ No retry storms: dontRelease() prevents failed jobs from being re-queued repeatedly
✅ Consistent pattern: Matches other Coolify jobs like:
DockerCleanupJob:expireAfter(600)->dontRelease()ServerCheckJob:expireAfter(60)->dontRelease()RestartProxyJob:expireAfter(60)->dontRelease()
✅ Laravel recommended: Follows official Laravel documentation for preventing stale locks
Why 60 Seconds?
- Job runs every minute (
everyMinute()schedule) - Matches the job frequency (1:1 ratio)
- Matches
CleanupInstanceStuffsJobpattern (also runs frequently with 60s expiry) - Allows next cycle to run if current job hangs
- Still reasonable timeout to prevent long-held locks
Testing
Manual Lock Key Inspection
To check for locks in Redis:
docker exec -it coolify-redis redis-cli
SELECT 0
KEYS *laravel-queue-overlap*ScheduledJobManager*
Full key format:
coolify_development_database_coolify_development_cache_laravel-queue-overlap:App\Jobs\ScheduledJobManager:scheduled-job-manager
Check TTL:
TTL "<full-key-from-above>"
-1= No expiration (STALE LOCK - the bug!)-2= Key doesn't exist- Positive number = Seconds until expiration (GOOD!)
Testing the Fix
Created test jobs to demonstrate the fix:
TestStaleLockJob.php- Uses broken pattern (releaseAfteronly)TestFixedLockJob.php- Uses fixed pattern (expireAfter+dontRelease)
Impact
This fix will:
- ✅ Immediate recovery: Existing stale locks cleared on upgrade/restart
- ✅ Future prevention: New locks auto-expire, preventing issue recurrence
- ✅ Self-recovery: System can recover from transient issues automatically
- ✅ Zero manual intervention: No need for users to manually clear locks
- ✅ Reliable operations: Backups, tasks, and auto-updates run consistently
Files Modified
-
app/Jobs/ScheduledJobManager.php
- Changed middleware to use
expireAfter(120)->dontRelease()
- Changed middleware to use
-
app/Console/Commands/CleanupRedis.php
- Added
--clear-locksflag - Added
cleanupCacheLocks()method
- Added
-
app/Console/Commands/Init.php
- Updated to call
cleanup:redis --clear-lockson startup
- Updated to call
-
tests/Unit/ScheduledJobManagerLockTest.php
- New unit test to prevent regression
References
- Laravel Docs: https://laravel.com/docs/12.x/queues#preventing-job-overlaps
- GitHub Issue: https://github.com/coollabsio/coolify/issues/4539
- Related Pattern: All other Coolify jobs use
expireAfter()->dontRelease()