A better way to clean up cache files


(Peter Sheppard) #1

When you have a large number of valid cache entries, and you try to run the cache clearup script, the time it takes is exponentially relational to the number of valid entries, due to the way the php function in_array() works.


We had some 50,000 valid cache entries, but hadn't run the script for a few months, so had somewhere in the region of 10 gigs worth of cache files. When we tried to run it, it took three weeks!! Over that time, it managed to delete numerous valid entries, as it's "valid" list had been generated long before all the new entries were created.



I've re-written the script (see below), which took 2 hours on first run to clear up the whole lot, and subsequent runs take about 10mins - enough to now run it on a daily cron.



The script relies on a little bit of postgres trickery, so you will need to create a special index for it to work.



Index creation statements:

    CREATE INDEX sq_cache_path
      ON sq_cache
      USING btree
      (path);
    
    CREATE INDEX sq_cache_dir
      ON sq_cache
      USING btree
      (substring(path, 1, 4));


For any readers who've not seen the latter syntax before, this tells postgres to create an index of values of the [b]result[/b] of the function. This then makes WHERE clauses using that function with those parameters practically instant.

The code...
    error_reporting(E_ALL);
    if ((php_sapi_name() != 'cli')) {
    	trigger_error("You can only run this script from the command line\n", E_USER_ERROR);
    }
$SYSTEM_ROOT = (isset($_SERVER['argv'][1])) ? $_SERVER['argv'][1] : '';
if (empty($SYSTEM_ROOT) || !is_dir($SYSTEM_ROOT)) {
	echo "ERROR: You need to supply the path to the System Root as the first argument\n";
	exit();
}

require_once $SYSTEM_ROOT.'/core/include/init.inc';

echo "\nWarning: Please make sure you have the correct permission to remove cache files.\n";
echo 'SQ_CACHE_PATH is \''.SQ_CACHE_PATH."'\n\n";
// ask for the root password for the system
echo 'Enter the root password for "'.SQ_CONF_SYSTEM_NAME.'": ';
$root_password = rtrim(fgets(STDIN, 4094));

// check that the correct root password was entered
$root_user = & $GLOBALS['SQ_SYSTEM']->am->getSystemAsset('root_user');
if (!$root_user->comparePassword($root_password)) {
	echo "ERROR: The root password entered was incorrect\n";
	exit();
}

// log in as root
if (!$GLOBALS['SQ_SYSTEM']->setCurrentUser($root_user)) {
	trigger_error("Failed login in as root user\n", E_USER_ERROR);
}

$cache_path_len = strlen(SQ_CACHE_PATH) + 1;

// Firstly clear all expired entries from sq_cache table
$GLOBALS['SQ_SYSTEM']->changeDatabaseConnection('dbcache');
$GLOBALS['SQ_SYSTEM']->doTransaction('BEGIN');
$db =& $GLOBALS['SQ_SYSTEM']->db;

$str = "Clearing expired entries from sq_cache table";
printf ('%s%'.(60 - strlen($str)).'s', $str,'');

$sql = 'DELETE FROM sq_cache WHERE expires < NOW()';
$result = $db->query($sql);
assert_valid_db_result($result);
$GLOBALS['SQ_SYSTEM']->doTransaction('COMMIT');
$GLOBALS['SQ_SYSTEM']->restoreDatabaseConnection();

printStatus('OK');

$str = "Finding cache buckets";
printf ('%s%'.(60 - strlen($str)).'s', $str,'');

// get all the cache directory names
exec('find '.SQ_CACHE_PATH."  -type d -maxdepth 1 -name '[0-9]*' | sort", $current_dirs);
printStatus('OK');

$count = 0;
$total = 0;
// loop through each directory, to make it less memory intensive
foreach ($current_dirs as $dir) {
	$bucket = substr($dir, -4);
	
	$str = "Getting valid entries from sq_cache table for bucket $bucket";
	printf ('%s%'.(60 - strlen($str)).'s', $str,'');
	
	// get valid entries from the database
	$GLOBALS['SQ_SYSTEM']->changeDatabaseConnection('dbcache');
	$db =& $GLOBALS['SQ_SYSTEM']->db;
	
	$sql = 'SELECT substring(path,6) AS filename
FROM sq_cache
WHERE expires > NOW()
  AND substring(path, 1, 4) = '.$db->quote($bucket).'

';
	
	$result = $db->getCol($sql);
	assert_valid_db_result($result);
	$GLOBALS['SQ_SYSTEM']->restoreDatabaseConnection();

	// Convert the result into an associative array
	foreach ($result as $valid_file)
	{
		$valid_files[$bucket.'/'.$valid_file] = true;
	}

	printStatus('OK');

	$current_files = Array();
	
	$str = "\tFinding files in bucket $bucket";
	printf ('%s%'.(50 - strlen($str)).'s', $str,'');
	
	// remove the file if there isnt a corresponding entry in the sq_cache table
	exec("find $dir -type f -name '[a-z0-9]*' | sort", $current_files);
	printStatus('OK');
	
	foreach ($current_files as $file) {
		$file_name = substr($file, $cache_path_len);
		if (!isset($valid_files[$file_name])) {
			$total++;
			printFileName($file_name);
			$status = @unlink(SQ_CACHE_PATH.'/'.$file_name);
			$ok = ($status) ? 'OK' : 'FAILED';
			printStatus($ok);
			if ($status) $count++;
		}
	}

}

echo "\nSummary: $count/$total cache file(s) removed.\n";
if ($count != $total) {
	$problematic = $total - $count;
	trigger_error("$problematic file(s) cannot be removed, please check file permission.", E_USER_WARNING);
}


/**
* Prints the file path to be removed
*
* @param string	$file_name	the name of the cache file
*
* @return void
* @access public
*/
function printFileName($file_name)
{
	$str = "\tRemoving ".$file_name;
	printf ('%s%'.(50 - strlen($str)).'s', $str,'');

}//end printFileName()


/**
* Prints the status of the container integrity check
*
* @param string	$status	the status of the check
*
* @return void
* @access public
*/
function printStatus($status)
{
	echo "[ $status ]\n";

}//end printStatus()</pre>

(Rhulse) #2

Hi Peter,



Thanks for the really interesting post. We had similar issues in the past, and our cache grows very fast due to the amount of new content posted each day, and the rate of content churn.



Our method is slightly simpler.



Disclaimer for casual readers: DO NOT TRY THIS AT HOME. This approach has been fine tuned for our system with our loading profile - discuss it with your sysadmin first to determine if it is at all suitable.



We have two shell scripts.



The first clears sq_cache < now() the same as you do.



The second uses a recursive find limited to one level at a time. The -d switch is used in the find loop and anything older than 1 day is removed.



Both run overnight.



Our design criteria were speed and server load. The files that get missed due to age > 1 day are zapped on the next run.



cheers,

Richard