Cron Manager deadlocks

Twice in the last 24 hours our site has been hit by Cron Manager deadlock. We are running 3.18.5 GPL on top of a patched Centos, PostgreSQL 8.1.11, PHP 5.1.6. On both occasions have got emailed the error messages:
[codebox]Fatal error: Uncaught exception 'Exception' with message 'Unable to update attribute values for

asset "Cron Manager" (#16) due to database error: SQLSTATE[57P01]: Admin shutdown: 7 FATAL:

terminating connection due to administrator command

CONTEXT: PL/pgSQL function "sq_set_rollback_timestamp" line 14 at if

SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"

PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform

SQL statement "SELECT sq_set_rollback_timestamp()"

PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform

SQL statement "UPDATE sq_rb_ast_attr_val SET custom_val = $1 WHERE sq_eff_from =

sq_get_rollback_timestamp() AND assetid = $2 AND attrid = $3 "

PL/pgSQL function "sq_ast_attr_val_rb_trg_fn" line 1 at SQL statement

server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.' in /var/www/vhosts/xxx/mysource_matri

x/core/include/asset.inc:2917

Stack trace:

#0 /var/www/vhosts/xxx in /var/www/vhosts/xxx/mysource_matrix/core/include/asset.

inc on line 2917[/codebox]

In error.log apparently just after that:

[codebox]

[16-Dec-2008 02:54:13] PHP Fatal error: Uncaught exception 'Exception' with message 'Could not create database connection: DBError!:SQLSTATE[08006] [7] could not connect to server: No such file or directory

Is the server running locally and accepting

connections on Unix domain socket "/tmp/.s.PGSQL.5432"?' in /var/www/vhosts/xxx/mysource_matrix/core/include/mysource.inc:2288

Stack trace:

#0 /var/www/vhosts/xxx/mysource_matrix/core/include/mysource.inc(211): MySource->changeDatabaseConnection('db')

#1 /var/www/vhosts/xxx/mysource_matrix/core/include/init.inc(231): MySource->init()

#2 /var/www/vhosts/xxx/mysource_matrix/core/web/index.php(28): require_once('/var/www/vhosts…')

#3 {main}

thrown in /var/www/vhosts/xxx/mysource_matrix/core/include/mysource.inc on line 2288

[/codebox]

The site continues to run fine. It is not under heavy (or different from usual) load. Clearing the deadlock via CMgr's Running Status page fixes the deadlock.

I'd very much appreciate any thoughts on what to look for, please.



thanks, martin

Someone has kindly pointed out the Postgresql logs, which have a worrying number of the following example events:
[codebox]ERROR: duplicate key violates unique constraint "cache_pk"

ERROR: duplicate key violates unique constraint "cache_pk"

ERROR: current transaction is aborted, commands ignored until end of transaction block

ERROR: current transaction is aborted, commands ignored until end of transaction block

ERROR: prepared statement "pdo_pgsql_stmt_b989abb4" already exists

ERROR: duplicate key violates unique constraint "cache_pk"

ERROR: current transaction is aborted, commands ignored until end of transaction block

ERROR: duplicate key violates unique constraint "cache_pk"

ERROR: current transaction is aborted, commands ignored until end of transaction block

ERROR: prepared statement "pdo_pgsql_stmt_b9616d4c" already exists

[/codebox]

Indeed, the db server restarted around the time of the Cron Manager episode.



Any thoughts?



thanks, martin

Strange indeed.


Do you only have the one system / are you doing any kind of database replication? In the PostgreSQL logs, do the "prepared statement already exists" errors occur only around "cache_pk" unique key violation errors? Do the errors in the PostgreSQL logs occur constantly?

Hi Dan,


No replication, only site on the server. We hover around 500 pages a day, so nothing heavy.



Based on the last week of psql logs, getting 5-10 duplicate key violations per day. About half of that for prepared statement errors.



Unfortunately the logs have not been timestamped. Have just reconfiged for that, will reboot shortly.



thanks, martin


Reboot? :blink:

You should just be able to reload PostgreSQL to re-read basic configuration file changes.

You can also use


log_min_error_statement = error



in your postgresql.conf to capture the statement that is causing the errors.



We've got quite a lot of cache_pk duplication errors - they're all to do with nested content in the templates of search pages.



If you're on redhat, you can update postgres config without losing any connections with "service postgresql reload" - not sure what the line is for non redhat though.

Dan: needed to reboot anyway :slight_smile:


Peter: service also works on Centos (Redhat in disguise), thanks. Have changed log_min_error_statement as suggested.



I do use some search pages. We'll see what transpires in the logs.



cheers, martin

Caught the following log messages, having enabled timestamps etc:
[codebox]2008-12-16 16:35:20.749 GMT - : LOG: received fast shutdown request

2008-12-16 16:35:20.749 GMT - : LOG: aborting any active transactions

2008-12-16 16:35:20.749 GMT - 4947d7f9.15d9: FATAL: terminating connection due to administrator command

2008-12-16 16:35:20.749 GMT - 4947d78e.66dd: FATAL: terminating connection due to administrator command

2008-12-16 16:35:20.749 GMT - 4947d797.6d3c: FATAL: terminating connection due to administrator command

2008-12-16 16:35:20.749 GMT - 4947d797.6d3c: CONTEXT: SQL statement "SELECT * FROM pg_tables WHERE tablename = 'sq_rollback_timestamp' AND schemaname = (current_schemas(true))[1]"

PL/pgSQL function "sq_set_rollback_timestamp" line 9 at perform

SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"

PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform

SQL statement "SELECT sq_set_rollback_timestamp()"

PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform

SQL statement "UPDATE sq_rb_ast_attr_val SET custom_val = $1 WHERE sq_eff_from = sq_get_rollback_timestamp() AND assetid = $2 AND attrid = $3 "

PL/pgSQL function "sq_ast_attr_val_rb_trg_fn" line 1 at SQL statement

2008-12-16 16:35:20.749 GMT - 4947d797.6d3c: STATEMENT: UPDATE sq_ast_attr_val SET custom_val = $1 WHERE attrid = $2 AND assetid = $3

2008-12-16 16:35:20.751 GMT - 4947d789.65fd: FATAL: terminating connection due to administrator command

2008-12-16 16:35:20.971 GMT - : LOG: shutting down

2008-12-16 16:35:24.286 GMT - : LOG: database system is shut down

2008-12-16 16:35:24.473 GMT - : LOG: logger shutting down

2008-12-16 16:37:03.999 GMT - : LOG: database system was shut down at 2008-12-16 16:35:24 GMT

2008-12-16 16:37:03.999 GMT - : LOG: checkpoint record is at 6/F55B83EC

2008-12-16 16:37:03.999 GMT - : LOG: redo record is at 6/F55B83EC; undo record is at 0/0; shutdown TRUE

2008-12-16 16:37:03.999 GMT - : LOG: next transaction ID: 79731193; next OID: 126832

2008-12-16 16:37:03.999 GMT - : LOG: next MultiXactId: 1; next MultiXactOffset: 0

2008-12-16 16:37:04.169 GMT - : LOG: database system is ready

2008-12-16 16:37:04.211 GMT - : LOG: transaction ID wrap limit is 1075065437, limited by database "dolmatrixdev1"

[/codebox]

Help? Next step?



thanks, martin

Postgres really doesn't like something. I'm not sure what though!


I take it there's nothing in /var/log/messages or any other system log files?



Hopefully a postgres expert will pop along with an answer!!

What version of PostgreSQL are you running?


According to the source code, PostgreSQL only does a “fast shutdown” when it receives a SIGINT.

[codebox]# psql --version
psql (PostgreSQL) 8.1.11

[/codebox]

Haven't the faintest idea what could be causing a SIGINT, suggestions on what to look at very welcome.



postgresql.conf is off the shelf apart from the logging changes done today.



Nothing suspicious in system log files.



Anything extra I could log?



Thanks, martin

If it helps, here are all the active lines in postgresql.conf:
[codebox]max_connections = 100

shared_buffers = 1000 # min 16 or max_connections*2, 8KB each

effective_cache_size = 2000 # typically 8KB each

redirect_stderr = on # Enable capturing of stderr into log

log_directory = 'pg_log' # Directory where log files are written

log_filename = 'postgresql-%Y%m%d.log' # Log file name pattern.

log_truncate_on_rotation = on # If on, any existing log file of the same

log_rotation_age = 1440 # Automatic rotation of logfiles will

log_rotation_size = 0 # Automatic rotation of logfiles will

log_min_error_statement = error # Values in order of increasing severity:

log_line_prefix = '%m - %c: ' # Special values:

lc_messages = 'C' # locale for system error message

lc_monetary = 'C' # locale for monetary formatting

lc_numeric = 'C' # locale for number formatting

lc_time = 'C' # locale for time formatting

[/codebox]



cheers, martin

Is there anything in the logs before the part you posted? Are you doing anything in particular inside matrix that seems to trigger the error? Something strange is causing that SIGINT - is that cron deadlock gone (and is cron running ok?)

There was nothing else in the logs before the part posted. Not had another occurrence since I increased the logging, will post if I do. There is no obvious trigger or pattern.


What I did get last night was:

[codebox]

2008-12-17 07:25:09.654 GMT - 4948a94f.764e: ERROR: duplicate key violates unique constraint "cache_pk"

2008-12-17 07:25:09.654 GMT - 4948a94f.764e: STATEMENT: INSERT INTO sq_cache(cache_key, perm_key, url, assetid, expires, path)

VALUES (($1), ($2), ($3), ($4), (TO_TIMESTAMP($5, 'YYYY-MM-DD HH24:MI:SS')), ($6))

2008-12-17 07:25:09.775 GMT - 4948a94f.764e: ERROR: current transaction is aborted, commands ignored until end of transaction block

2008-12-17 07:25:09.775 GMT - 4948a94f.764e: STATEMENT: DEALLOCATE pdo_pgsql_stmt_ba3175a4

2008-12-17 07:25:10.009 GMT - 4948a94f.764e: ERROR: duplicate key violates unique constraint "cache_pk"

2008-12-17 07:25:10.009 GMT - 4948a94f.764e: STATEMENT: INSERT INTO sq_cache(cache_key, perm_key, url, assetid, expires, path)

VALUES (($1), ($2), ($3), ($4), (TO_TIMESTAMP($5, 'YYYY-MM-DD HH24:MI:SS')), ($6))

2008-12-17 07:25:10.009 GMT - 4948a94f.764e: ERROR: current transaction is aborted, commands ignored until end of transaction block

2008-12-17 07:25:10.009 GMT - 4948a94f.764e: STATEMENT: DEALLOCATE pdo_pgsql_stmt_ba22b97c

2008-12-17 07:25:10.286 GMT - 4948a94f.764e: ERROR: prepared statement "pdo_pgsql_stmt_ba22b97c" already exists

2008-12-17 07:25:10.286 GMT - 4948a94f.764e: STATEMENT: SELECT

sq_cache.expires, sq_cache.path

FROM

sq_cache

WHERE

(

sq_cache.cache_key = ($1)

AND sq_cache.perm_key = ($2)

AND sq_cache.url = ($3)

)[/codebox]



cheers, martin

If anyone has some insights they could share, would be much appreciated.


thanks, martin

Latest from Cron Daemon:
[codebox]Fatal error: Uncaught exception 'Exception' with message 'Unable to update asset "Cron Manager"

(#16) due to database error: SQLSTATE[57P01]: Admin shutdown: 7 FATAL: terminating connection due

to administrator command

CONTEXT: PL/pgSQL function "sq_set_rollback_timestamp" line 14 at if

SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"

PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform

SQL statement "SELECT sq_set_rollback_timestamp()"

PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform

SQL statement "UPDATE sq_rb_ast SET type_code = $1 ,version = $2 ,name = $3 ,short_name = $4

,status = $5 ,languages = $6 ,charset = $7 ,force_secure = $8 ,created = $9 ,created_userid =

$10 ,updated = $11 ,updated_userid = $12 ,published = $13 ,published_userid = $14

,status_changed = $15 ,status_changed_userid = $16 WHERE sq_eff_from =

sq_get_rollback_timestamp() AND assetid = $17 "

PL/pgSQL function "sq_ast_rb_trg_fn" line 1 at SQL statement

server closed the connection une in /var/www/vhosts/deddington.org.uk/mysource_matrix/core/include/a

sset.inc on line 1232[/codebox]



And following in PG log:

[codebox]2008-12-22 17:02:58.372 GMT - : LOG: received fast shutdown request

2008-12-22 17:02:58.372 GMT - : LOG: aborting any active transactions

2008-12-22 17:02:58.373 GMT - 494fc791.2ebc: FATAL: terminating connection due

to administrator command

2008-12-22 17:02:58.373 GMT - 494fc792.2f23: FATAL: terminating connection due

to administrator command

2008-12-22 17:02:58.373 GMT - 494fc792.2f23: CONTEXT: PL/pgSQL function "sq_set

_rollback_timestamp" line 14 at if

SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"

PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform

SQL statement "SELECT sq_set_rollback_timestamp()"

PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform

SQL statement "UPDATE sq_rb_ast SET type_code = $1 ,version = $2 ,name

= $3 ,short_name = $4 ,status = $5 ,languages = $6 ,charset = $7 ,force_se

cure = $8 ,created = $9 ,created_userid = $10 ,updated = $11 ,updated_userid

= $12 ,published = $13 ,published_userid = $14 ,status_changed = $15 ,statu

s_changed_userid = $16 WHERE sq_eff_from = sq_get_rollback_timestamp() AND ass

etid = $17 "

PL/pgSQL function "sq_ast_rb_trg_fn" line 1 at SQL statement

2008-12-22 17:02:58.373 GMT - 494fc792.2f23: STATEMENT: UPDATE

sq_ast

SET name = $1,short_name = $2,

languages

= $3,

charset

= $4,

force_secure = $5,

updated

= $6,

updated_userid = $7

WHERE

assetid = $8

2008-12-22 17:02:58.374 GMT - 494fc792.2f11: FATAL: terminating connection due

to administrator command

2008-12-22 17:02:58.374 GMT - 494fc7b8.3f31: FATAL: terminating connection due

to administrator command

2008-12-22 17:02:58.405 GMT - : LOG: shutting down

2008-12-22 17:02:58.630 GMT - : LOG: database system is shut down

2008-12-22 17:02:58.648 GMT - : LOG: logger shutting down

2008-12-22 17:03:09.158 GMT - : LOG: database system was shut down at 2008-12-2

2 17:02:58 GMT

[/codebox]

Any pointers or suggestions would be most welcome, I have no idea what to look for at this point.



thanks, martin

Can you attach/paste the results of:


cat /proc/meminfo

free

ps auxf



Thanks

As requested by Dan:
[codebox][root@vps1 pg_log]# cat /proc/meminfo

MemTotal: 163840 kB

MemFree: 77900 kB

Buffers: 0 kB

Cached: 0 kB

SwapCached: 0 kB

Active: 0 kB

Inactive: 0 kB

HighTotal: 0 kB

HighFree: 0 kB

LowTotal: 163840 kB

LowFree: 77900 kB

SwapTotal: 0 kB

SwapFree: 0 kB

Dirty: 0 kB

Writeback: 0 kB

AnonPages: 0 kB

Mapped: 85940 kB

Slab: 0 kB

PageTables: 0 kB

NFS_Unstable: 0 kB

Bounce: 0 kB

CommitLimit: 0 kB

Committed_AS: 236912 kB

VmallocTotal: 0 kB

VmallocUsed: 0 kB

VmallocChunk: 0 kB

HugePages_Total: 0

HugePages_Free: 0

HugePages_Rsvd: 0

Hugepagesize: 2048 kB

[root@vps1 pg_log]# free

total used free shared buffers cached

Mem: 163840 81100 82740 0 0 0

-/+ buffers/cache: 81100 82740

Swap: 0 0 0

[root@vps1 pg_log]# ps auxf

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

root 1 0.0 0.3 2060 648 ? Ss Dec19 0:00 init [3]

root 23809 0.0 0.3 2148 524 ? S<s Dec19 0:00 /sbin/udevd -d

root 9552 0.0 0.3 1716 568 ? Ss Dec19 0:00 syslogd -m 0

root 13428 0.0 0.2 1672 404 ? Ss Dec19 0:00 klogd -x

root 13465 0.0 0.6 6988 1052 ? Ss Dec19 0:00 /usr/sbin/sshd

root 19591 0.0 1.6 9832 2784 ? Ss 17:19 0:00 \_ sshd: root@

root 20148 0.0 0.8 2580 1392 pts/0 Ss 17:20 0:00 \_ -bash

root 17719 0.0 0.4 2160 776 pts/0 R+ 17:52 0:00 \_ ps

root 13532 0.0 0.5 2716 864 ? Ss Dec19 0:00 xinetd -stayali

root 13553 0.0 0.3 5076 512 ? Ss Dec19 0:00 /usr/sbin/vsftp

root 13659 0.0 0.6 2448 1136 ? S Dec19 0:00 /bin/sh /usr/bi

mysql 13804 0.0 13.0 140836 21360 ? Sl Dec19 0:34 \_ /usr/libexe

root 23781 0.0 1.1 8956 1904 ? Ss Dec19 0:01 sendmail: accep

smmsp 24412 0.0 0.9 8040 1476 ? Ss Dec19 0:00 sendmail: Queue

root 30382 0.0 6.6 45164 10864 ? Ss Dec19 0:01 /usr/sbin/httpd

apache 30153 0.0 16.1 52020 26492 ? S 12:46 0:05 \_ /usr/sbin/h

apache 20259 0.0 16.0 52032 26248 ? S 14:45 0:02 \_ /usr/sbin/h

apache 30231 0.0 15.6 51752 25648 ? S 15:16 0:02 \_ /usr/sbin/h

apache 9667 0.0 16.8 53776 27604 ? S 15:20 0:01 \_ /usr/sbin/h

apache 15530 0.0 16.2 52288 26652 ? S 15:22 0:02 \_ /usr/sbin/h

root 32550 0.0 0.6 3212 1104 ? Ss Dec19 0:00 crond

root 32749 0.0 0.4 5480 692 ? Ss Dec19 0:00 /usr/sbin/sasla

root 3127 0.0 0.2 5480 436 ? S Dec19 0:00 \_ /usr/sbin/s

postgres 28439 0.0 1.9 18876 3244 ? S 17:03 0:00 /usr/bin/postma

postgres 29797 0.0 0.4 8656 788 ? S 17:03 0:00 \_ postgres: l

postgres 29801 0.0 1.8 19008 3056 ? S 17:03 0:00 \_ postgres: w

postgres 29802 0.0 0.4 9656 796 ? S 17:03 0:00 \_ postgres: s

postgres 29803 0.0 0.6 8884 1016 ? S 17:03 0:00 \_ postgre[/codebox]



thanks, martin

This server doesn't have much memory, and it doesn't have any swap! I suspect something (maybe kernel OOM Killer) is killing PostgreSQL when the system is running out of memory.

Dan,


Thanks for that, much appreciated. I'll get on to the hosting people right away.



cheers, martin