POTENTIAL CRON MANAGER DEAD LOCK DETECTED cleared itself


#1

We found the following message today:

POTENTIAL CRON MANAGER DEAD LOCK DETECTED

There are a couple of posts with the same topic but they were from 2007, 2008 and 2009 so I thought it might be out-dated.

The message on admin interface suggests that a root user needs to login and clear the lock. But before we got a chance to raise this we found that the lock cleared itself approximately 21 hours since the last successful run.

Is it expectable? Cheers.


(John gill) #2

A quick perusal of core/assets/system/cron/cron_manager/cron_manager.inc indicates that there’s nothing that will un-deadlock cron manager like that

> [root@matrixdemo matrix]# grep -n run_check core/assets/system/cron/cron_manager/cron_manager.inc
> 535:                    if ((int) $this->attr('run_check') >= (int) $this->attr('warn_after_num_run_checks')) {
> 544:                                                            'num_attempts'          => $this->attr('warn_after_num_run_checks'),
> 567:                    }//end if run_check
> 572:                            $this->setAttrValue('run_check', (int) $this->attr('run_check') + 1);
> 607:                    // now get the lock, set the running flag and reset the run_check
> 608:                    if ($GLOBALS['SQ_SYSTEM']->am->acquireLock($this->id, 'attributes', 0, TRUE) && $this->setAttrValue('running', TRUE) && $this->setAttrValue('run_check', 0) && $this->saveAttributes()) {
> 741:                                            $this->setAttrValue('run_check', 0);

Lines 741 and 608 set run_check back to 0, but they only happen if it isn’t already deadlocked.

I suppose that on a long enough timeline, the increment on line 572 might eventually overflow run_check. How long is PHP_INT_MAX * 15 minutes?


#3

Thanks @JohnGill. I don’t know what happens between line 608 and 741, but why set the run_check to 0 if it already is.


(Marcus Fong) #4

[quote=“qlddev, post:3, topic:10626, full:true”]
Thanks @JohnGill. I don’t know what happens between line 608 and 741, but why set the run_check to 0 if it already is.[/quote]

The process works like this:

  • Matrix cron run #1 starts. It executes line 608 at the beginning of the run, setting run_check to 0 and running to 1. It then starts processing cron jobs one by one.
  • If cron run #1 has taken more than 15 minutes, cron run #2 will start. It will check running to see whether there is a cron run already in progress. Since there is, it will increment run_check to 1, and then exit.
  • If cron run #1 has taken more than 30 minutes, cron run #3 will start. It will check running to see whether there is a cron run already in progress, discover that there is, and increment run_check to 2 before exiting.
  • If cron run #1 has taken more than 45 minutes, cron run #4 will start. It will check running to see whether there is a cron run already in progress, discover that there is, and increment run_check to 3 before exiting.

At this point, if the Matrix cron manager is using the default settings, run_check will have reached the Warn After Blocked Runs threshold of 3 (described in https://matrix.squiz.net/manuals/system-management/chapters/scheduled-jobs-manager#Options-Screen). This causes a cron deadlock warning e-mail to be sent, alerting administrator users to the fact that there may be a problem with the Matrix crons.

  • If cron run #1 then ends up completing successfully (i.e. it didn’t die or hang, it was just taking a very long time), it will execute line 741 and reset run_check and running to 0. At this point the cron deadlock e-mails will stop and the deadlock will appear to have resolved itself.

That said, it would normally be unusual for a cron run to take 21 hours. You can try to identify the long-running job by going through your Matrix logs and looking for lines that say “the value of attribute current_job for asset Scheduled Jobs Manager has been changed from X to Y”, matching the times when the cron deadlock began and ended.

Alternatively, if you have a service agreement with Squiz you can send in a ticket and ask our support team to investigate.


#5

Thanks @mfong Really useful insight on what happens behind the scene. I will check the log.