coolify/STALE_LOCK_FIX.md
Andras Bacsai c6a2d1fe0a Fix stale lock issue causing scheduled tasks to stop (#4539)
## Problem
Scheduled tasks, backups, and auto-updates stopped working after 1-2 months
with error: MaxAttemptsExceededException: App\Jobs\ScheduledJobManager has
been attempted too many times.

Root cause: ScheduledJobManager used WithoutOverlapping with only
releaseAfter(60), causing locks without expiration (TTL=-1) that persisted
indefinitely when jobs hung or processes crashed.

## Solution

### Part 1: Prevention (Future Locks)
- Added expireAfter(60) to ScheduledJobManager middleware
- Lock now auto-expires after 60 seconds (matches everyMinute schedule)
- Changed from releaseAfter(60) to expireAfter(60)->dontRelease()
- Follows Laravel best practices and matches other Coolify jobs

### Part 2: Recovery (Existing Locks)
- Enhanced cleanup:redis command with --clear-locks flag
- Scans Redis for stale locks (TTL=-1) and removes them
- Called automatically during app:init on startup/upgrade
- Provides immediate recovery for affected instances

## Changes
- app/Jobs/ScheduledJobManager.php: Added expireAfter(60)->dontRelease()
- app/Console/Commands/CleanupRedis.php: Added cleanupCacheLocks() method
- app/Console/Commands/Init.php: Auto-clear locks on startup
- tests/Unit/ScheduledJobManagerLockTest.php: Test to prevent regression
- STALE_LOCK_FIX.md: Complete documentation

## Testing
- Unit tests pass (2 tests, 8 assertions)
- Code formatted with Pint
- Matches pattern used by CleanupInstanceStuffsJob

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 10:07:33 +02:00

4.8 KiB

Fix for Stale Lock Issue in ScheduledJobManager

Issue

GitHub Issue: #4539 - Scheduled tasks not executing on schedule

Symptoms

  • Scheduled tasks stop executing after working for weeks/months
  • Backups don't run
  • Auto-updates don't work
  • Error in Horizon: Illuminate\Queue\MaxAttemptsExceededException: App\Jobs\ScheduledJobManager has been attempted too many times
  • Running horizon:clear, cleanup:redis, schedule:clear-cache doesn't fix the problem

Root Cause

The ScheduledJobManager was using WithoutOverlapping middleware with only releaseAfter(60):

(new WithoutOverlapping('scheduled-job-manager'))
    ->releaseAfter(60)

Problems with this approach:

  1. No automatic lock expiration: Without expireAfter(), locks persist indefinitely if:

    • Process hangs or becomes unresponsive
    • Job takes longer than expected
    • Unexpected termination occurs
  2. Race condition with releaseAfter():

    • Job acquires lock
    • Job gets stuck/hangs
    • After 60s, job is released back to queue
    • New attempt can't acquire lock (still held by hung process)
    • Repeats until MaxAttemptsExceededException
  3. Against Laravel best practices: Laravel docs explicitly recommend using expireAfter() to prevent stale locks

Solution

This fix has two parts:

Part 1: Prevention (Fix Future Locks)

Changed the middleware to match the pattern used by other Coolify jobs:

// File: app/Jobs/ScheduledJobManager.php
(new WithoutOverlapping('scheduled-job-manager'))
    ->expireAfter(60)   // Lock expires after 1 minute (matches job frequency)
    ->dontRelease()     // Don't re-queue on lock conflict

Part 2: Recovery (Clear Existing Stale Locks)

Enhanced cleanup:redis command to clear existing stale locks:

// File: app/Console/Commands/CleanupRedis.php
// Added --clear-locks flag
php artisan cleanup:redis --clear-locks

What it does:

  • Scans Redis for laravel-queue-overlap keys (WithoutOverlapping locks)
  • Checks TTL of each lock
  • Deletes locks with TTL = -1 (no expiration = stale!)
  • Skips active locks that have proper expiration
  • Called automatically during app:init (on Coolify startup/update)

Why This Works

Auto-expiring locks: Lock automatically expires after 60 seconds, even if:

  • Process crashes
  • Job hangs
  • Network issues occur

No retry storms: dontRelease() prevents failed jobs from being re-queued repeatedly

Consistent pattern: Matches other Coolify jobs like:

  • DockerCleanupJob: expireAfter(600)->dontRelease()
  • ServerCheckJob: expireAfter(60)->dontRelease()
  • RestartProxyJob: expireAfter(60)->dontRelease()

Laravel recommended: Follows official Laravel documentation for preventing stale locks

Why 60 Seconds?

  • Job runs every minute (everyMinute() schedule)
  • Matches the job frequency (1:1 ratio)
  • Matches CleanupInstanceStuffsJob pattern (also runs frequently with 60s expiry)
  • Allows next cycle to run if current job hangs
  • Still reasonable timeout to prevent long-held locks

Testing

Manual Lock Key Inspection

To check for locks in Redis:

docker exec -it coolify-redis redis-cli
SELECT 0
KEYS *laravel-queue-overlap*ScheduledJobManager*

Full key format:

coolify_development_database_coolify_development_cache_laravel-queue-overlap:App\Jobs\ScheduledJobManager:scheduled-job-manager

Check TTL:

TTL "<full-key-from-above>"
  • -1 = No expiration (STALE LOCK - the bug!)
  • -2 = Key doesn't exist
  • Positive number = Seconds until expiration (GOOD!)

Testing the Fix

Created test jobs to demonstrate the fix:

  • TestStaleLockJob.php - Uses broken pattern (releaseAfter only)
  • TestFixedLockJob.php - Uses fixed pattern (expireAfter + dontRelease)

Impact

This fix will:

  • Immediate recovery: Existing stale locks cleared on upgrade/restart
  • Future prevention: New locks auto-expire, preventing issue recurrence
  • Self-recovery: System can recover from transient issues automatically
  • Zero manual intervention: No need for users to manually clear locks
  • Reliable operations: Backups, tasks, and auto-updates run consistently

Files Modified

  1. app/Jobs/ScheduledJobManager.php

    • Changed middleware to use expireAfter(120)->dontRelease()
  2. app/Console/Commands/CleanupRedis.php

    • Added --clear-locks flag
    • Added cleanupCacheLocks() method
  3. app/Console/Commands/Init.php

    • Updated to call cleanup:redis --clear-locks on startup
  4. tests/Unit/ScheduledJobManagerLockTest.php

    • New unit test to prevent regression

References