Twice in the last 24 hours our site has been hit by Cron Manager deadlock. We are running 3.18.5 GPL on top of a patched Centos, PostgreSQL 8.1.11, PHP 5.1.6. On both occasions have got emailed the error messages:
[codebox]Fatal error: Uncaught exception 'Exception' with message 'Unable to update attribute values for
asset "Cron Manager" (#16) due to database error: SQLSTATE[57P01]: Admin shutdown: 7 FATAL:
terminating connection due to administrator command
CONTEXT: PL/pgSQL function "sq_set_rollback_timestamp" line 14 at if
SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"
PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform
SQL statement "SELECT sq_set_rollback_timestamp()"
PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform
SQL statement "UPDATE sq_rb_ast_attr_val SET custom_val = $1 WHERE sq_eff_from =
sq_get_rollback_timestamp() AND assetid = $2 AND attrid = $3 "
PL/pgSQL function "sq_ast_attr_val_rb_trg_fn" line 1 at SQL statement
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.' in /var/www/vhosts/xxx/mysource_matri
x/core/include/asset.inc:2917
Stack trace:
#0 /var/www/vhosts/xxx in /var/www/vhosts/xxx/mysource_matrix/core/include/asset.
inc on line 2917[/codebox]
In error.log apparently just after that:
[codebox]
[16-Dec-2008 02:54:13] PHP Fatal error: Uncaught exception 'Exception' with message 'Could not create database connection: DBError!:SQLSTATE[08006] [7] could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?' in /var/www/vhosts/xxx/mysource_matrix/core/include/mysource.inc:2288
Stack trace:
#0 /var/www/vhosts/xxx/mysource_matrix/core/include/mysource.inc(211): MySource->changeDatabaseConnection('db')
#1 /var/www/vhosts/xxx/mysource_matrix/core/include/init.inc(231): MySource->init()
#2 /var/www/vhosts/xxx/mysource_matrix/core/web/index.php(28): require_once('/var/www/vhosts…')
#3 {main}
thrown in /var/www/vhosts/xxx/mysource_matrix/core/include/mysource.inc on line 2288
[/codebox]
The site continues to run fine. It is not under heavy (or different from usual) load. Clearing the deadlock via CMgr's Running Status page fixes the deadlock.
I'd very much appreciate any thoughts on what to look for, please.
thanks, martin
Someone has kindly pointed out the Postgresql logs, which have a worrying number of the following example events:
[codebox]ERROR: duplicate key violates unique constraint "cache_pk"
ERROR: duplicate key violates unique constraint "cache_pk"
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: prepared statement "pdo_pgsql_stmt_b989abb4" already exists
ERROR: duplicate key violates unique constraint "cache_pk"
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: duplicate key violates unique constraint "cache_pk"
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: prepared statement "pdo_pgsql_stmt_b9616d4c" already exists
[/codebox]
Indeed, the db server restarted around the time of the Cron Manager episode.
Any thoughts?
thanks, martin
Strange indeed.
Do you only have the one system / are you doing any kind of database replication? In the PostgreSQL logs, do the "prepared statement already exists" errors occur only around "cache_pk" unique key violation errors? Do the errors in the PostgreSQL logs occur constantly?
Hi Dan,
No replication, only site on the server. We hover around 500 pages a day, so nothing heavy.
Based on the last week of psql logs, getting 5-10 duplicate key violations per day. About half of that for prepared statement errors.
Unfortunately the logs have not been timestamped. Have just reconfiged for that, will reboot shortly.
thanks, martin
Reboot? :blink:
You should just be able to reload PostgreSQL to re-read basic configuration file changes.
You can also use
log_min_error_statement = error
in your postgresql.conf to capture the statement that is causing the errors.
We've got quite a lot of cache_pk duplication errors - they're all to do with nested content in the templates of search pages.
If you're on redhat, you can update postgres config without losing any connections with "service postgresql reload" - not sure what the line is for non redhat though.
Dan: needed to reboot anyway
Peter: service also works on Centos (Redhat in disguise), thanks. Have changed log_min_error_statement as suggested.
I do use some search pages. We'll see what transpires in the logs.
cheers, martin
Caught the following log messages, having enabled timestamps etc:
[codebox]2008-12-16 16:35:20.749 GMT - : LOG: received fast shutdown request
2008-12-16 16:35:20.749 GMT - : LOG: aborting any active transactions
2008-12-16 16:35:20.749 GMT - 4947d7f9.15d9: FATAL: terminating connection due to administrator command
2008-12-16 16:35:20.749 GMT - 4947d78e.66dd: FATAL: terminating connection due to administrator command
2008-12-16 16:35:20.749 GMT - 4947d797.6d3c: FATAL: terminating connection due to administrator command
2008-12-16 16:35:20.749 GMT - 4947d797.6d3c: CONTEXT: SQL statement "SELECT * FROM pg_tables WHERE tablename = 'sq_rollback_timestamp' AND schemaname = (current_schemas(true))[1]"
PL/pgSQL function "sq_set_rollback_timestamp" line 9 at perform
SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"
PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform
SQL statement "SELECT sq_set_rollback_timestamp()"
PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform
SQL statement "UPDATE sq_rb_ast_attr_val SET custom_val = $1 WHERE sq_eff_from = sq_get_rollback_timestamp() AND assetid = $2 AND attrid = $3 "
PL/pgSQL function "sq_ast_attr_val_rb_trg_fn" line 1 at SQL statement
2008-12-16 16:35:20.749 GMT - 4947d797.6d3c: STATEMENT: UPDATE sq_ast_attr_val SET custom_val = $1 WHERE attrid = $2 AND assetid = $3
2008-12-16 16:35:20.751 GMT - 4947d789.65fd: FATAL: terminating connection due to administrator command
2008-12-16 16:35:20.971 GMT - : LOG: shutting down
2008-12-16 16:35:24.286 GMT - : LOG: database system is shut down
2008-12-16 16:35:24.473 GMT - : LOG: logger shutting down
2008-12-16 16:37:03.999 GMT - : LOG: database system was shut down at 2008-12-16 16:35:24 GMT
2008-12-16 16:37:03.999 GMT - : LOG: checkpoint record is at 6/F55B83EC
2008-12-16 16:37:03.999 GMT - : LOG: redo record is at 6/F55B83EC; undo record is at 0/0; shutdown TRUE
2008-12-16 16:37:03.999 GMT - : LOG: next transaction ID: 79731193; next OID: 126832
2008-12-16 16:37:03.999 GMT - : LOG: next MultiXactId: 1; next MultiXactOffset: 0
2008-12-16 16:37:04.169 GMT - : LOG: database system is ready
2008-12-16 16:37:04.211 GMT - : LOG: transaction ID wrap limit is 1075065437, limited by database "dolmatrixdev1"
[/codebox]
Help? Next step?
thanks, martin
Postgres really doesn't like something. I'm not sure what though!
I take it there's nothing in /var/log/messages or any other system log files?
Hopefully a postgres expert will pop along with an answer!!
What version of PostgreSQL are you running?
According to the source code, PostgreSQL only does a “fast shutdown” when it receives a SIGINT.
[codebox]# psql --version
psql (PostgreSQL) 8.1.11
[/codebox]
Haven't the faintest idea what could be causing a SIGINT, suggestions on what to look at very welcome.
postgresql.conf is off the shelf apart from the logging changes done today.
Nothing suspicious in system log files.
Anything extra I could log?
Thanks, martin
If it helps, here are all the active lines in postgresql.conf:
[codebox]max_connections = 100
shared_buffers = 1000 # min 16 or max_connections*2, 8KB each
effective_cache_size = 2000 # typically 8KB each
redirect_stderr = on # Enable capturing of stderr into log
log_directory = 'pg_log' # Directory where log files are written
log_filename = 'postgresql-%Y%m%d.log' # Log file name pattern.
log_truncate_on_rotation = on # If on, any existing log file of the same
log_rotation_age = 1440 # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_error_statement = error # Values in order of increasing severity:
log_line_prefix = '%m - %c: ' # Special values:
lc_messages = 'C' # locale for system error message
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
[/codebox]
cheers, martin
Is there anything in the logs before the part you posted? Are you doing anything in particular inside matrix that seems to trigger the error? Something strange is causing that SIGINT - is that cron deadlock gone (and is cron running ok?)
There was nothing else in the logs before the part posted. Not had another occurrence since I increased the logging, will post if I do. There is no obvious trigger or pattern.
What I did get last night was:
[codebox]
2008-12-17 07:25:09.654 GMT - 4948a94f.764e: ERROR: duplicate key violates unique constraint "cache_pk"
2008-12-17 07:25:09.654 GMT - 4948a94f.764e: STATEMENT: INSERT INTO sq_cache(cache_key, perm_key, url, assetid, expires, path)
VALUES (($1), ($2), ($3), ($4), (TO_TIMESTAMP($5, 'YYYY-MM-DD HH24:MI:SS')), ($6))
2008-12-17 07:25:09.775 GMT - 4948a94f.764e: ERROR: current transaction is aborted, commands ignored until end of transaction block
2008-12-17 07:25:09.775 GMT - 4948a94f.764e: STATEMENT: DEALLOCATE pdo_pgsql_stmt_ba3175a4
2008-12-17 07:25:10.009 GMT - 4948a94f.764e: ERROR: duplicate key violates unique constraint "cache_pk"
2008-12-17 07:25:10.009 GMT - 4948a94f.764e: STATEMENT: INSERT INTO sq_cache(cache_key, perm_key, url, assetid, expires, path)
VALUES (($1), ($2), ($3), ($4), (TO_TIMESTAMP($5, 'YYYY-MM-DD HH24:MI:SS')), ($6))
2008-12-17 07:25:10.009 GMT - 4948a94f.764e: ERROR: current transaction is aborted, commands ignored until end of transaction block
2008-12-17 07:25:10.009 GMT - 4948a94f.764e: STATEMENT: DEALLOCATE pdo_pgsql_stmt_ba22b97c
2008-12-17 07:25:10.286 GMT - 4948a94f.764e: ERROR: prepared statement "pdo_pgsql_stmt_ba22b97c" already exists
2008-12-17 07:25:10.286 GMT - 4948a94f.764e: STATEMENT: SELECT
sq_cache.expires, sq_cache.path
FROM
sq_cache
WHERE
(
sq_cache.cache_key = ($1)
AND sq_cache.perm_key = ($2)
AND sq_cache.url = ($3)
)[/codebox]
cheers, martin
If anyone has some insights they could share, would be much appreciated.
thanks, martin
Latest from Cron Daemon:
[codebox]Fatal error: Uncaught exception 'Exception' with message 'Unable to update asset "Cron Manager"
(#16) due to database error: SQLSTATE[57P01]: Admin shutdown: 7 FATAL: terminating connection due
to administrator command
CONTEXT: PL/pgSQL function "sq_set_rollback_timestamp" line 14 at if
SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"
PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform
SQL statement "SELECT sq_set_rollback_timestamp()"
PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform
SQL statement "UPDATE sq_rb_ast SET type_code = $1 ,version = $2 ,name = $3 ,short_name = $4
,status = $5 ,languages = $6 ,charset = $7 ,force_secure = $8 ,created = $9 ,created_userid =
$10 ,updated = $11 ,updated_userid = $12 ,published = $13 ,published_userid = $14
,status_changed = $15 ,status_changed_userid = $16 WHERE sq_eff_from =
sq_get_rollback_timestamp() AND assetid = $17 "
PL/pgSQL function "sq_ast_rb_trg_fn" line 1 at SQL statement
server closed the connection une in /var/www/vhosts/deddington.org.uk/mysource_matrix/core/include/a
sset.inc on line 1232[/codebox]
And following in PG log:
[codebox]2008-12-22 17:02:58.372 GMT - : LOG: received fast shutdown request
2008-12-22 17:02:58.372 GMT - : LOG: aborting any active transactions
2008-12-22 17:02:58.373 GMT - 494fc791.2ebc: FATAL: terminating connection due
to administrator command
2008-12-22 17:02:58.373 GMT - 494fc792.2f23: FATAL: terminating connection due
to administrator command
2008-12-22 17:02:58.373 GMT - 494fc792.2f23: CONTEXT: PL/pgSQL function "sq_set
_rollback_timestamp" line 14 at if
SQL statement "SELECT sq_set_rollback_timestamp(now()::TIMESTAMP)"
PL/pgSQL function "sq_set_rollback_timestamp" line 3 at perform
SQL statement "SELECT sq_set_rollback_timestamp()"
PL/pgSQL function "sq_get_rollback_timestamp" line 5 at perform
SQL statement "UPDATE sq_rb_ast SET type_code = $1 ,version = $2 ,name
= $3 ,short_name = $4 ,status = $5 ,languages = $6 ,charset = $7 ,force_se
cure = $8 ,created = $9 ,created_userid = $10 ,updated = $11 ,updated_userid
= $12 ,published = $13 ,published_userid = $14 ,status_changed = $15 ,statu
s_changed_userid = $16 WHERE sq_eff_from = sq_get_rollback_timestamp() AND ass
etid = $17 "
PL/pgSQL function "sq_ast_rb_trg_fn" line 1 at SQL statement
2008-12-22 17:02:58.373 GMT - 494fc792.2f23: STATEMENT: UPDATE
sq_ast
SET name = $1,short_name = $2,
languages
= $3,
charset
= $4,
force_secure = $5,
updated
= $6,
updated_userid = $7
WHERE
assetid = $8
2008-12-22 17:02:58.374 GMT - 494fc792.2f11: FATAL: terminating connection due
to administrator command
2008-12-22 17:02:58.374 GMT - 494fc7b8.3f31: FATAL: terminating connection due
to administrator command
2008-12-22 17:02:58.405 GMT - : LOG: shutting down
2008-12-22 17:02:58.630 GMT - : LOG: database system is shut down
2008-12-22 17:02:58.648 GMT - : LOG: logger shutting down
2008-12-22 17:03:09.158 GMT - : LOG: database system was shut down at 2008-12-2
2 17:02:58 GMT
[/codebox]
Any pointers or suggestions would be most welcome, I have no idea what to look for at this point.
thanks, martin
Can you attach/paste the results of:
cat /proc/meminfo
free
ps auxf
Thanks
As requested by Dan:
[codebox][root@vps1 pg_log]# cat /proc/meminfo
MemTotal: 163840 kB
MemFree: 77900 kB
Buffers: 0 kB
Cached: 0 kB
SwapCached: 0 kB
Active: 0 kB
Inactive: 0 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 163840 kB
LowFree: 77900 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 0 kB
Mapped: 85940 kB
Slab: 0 kB
PageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 0 kB
Committed_AS: 236912 kB
VmallocTotal: 0 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
[root@vps1 pg_log]# free
total used free shared buffers cached
Mem: 163840 81100 82740 0 0 0
-/+ buffers/cache: 81100 82740
Swap: 0 0 0
[root@vps1 pg_log]# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.3 2060 648 ? Ss Dec19 0:00 init [3]
root 23809 0.0 0.3 2148 524 ? S<s Dec19 0:00 /sbin/udevd -d
root 9552 0.0 0.3 1716 568 ? Ss Dec19 0:00 syslogd -m 0
root 13428 0.0 0.2 1672 404 ? Ss Dec19 0:00 klogd -x
root 13465 0.0 0.6 6988 1052 ? Ss Dec19 0:00 /usr/sbin/sshd
root 19591 0.0 1.6 9832 2784 ? Ss 17:19 0:00 \_ sshd: root@
root 20148 0.0 0.8 2580 1392 pts/0 Ss 17:20 0:00 \_ -bash
root 17719 0.0 0.4 2160 776 pts/0 R+ 17:52 0:00 \_ ps
root 13532 0.0 0.5 2716 864 ? Ss Dec19 0:00 xinetd -stayali
root 13553 0.0 0.3 5076 512 ? Ss Dec19 0:00 /usr/sbin/vsftp
root 13659 0.0 0.6 2448 1136 ? S Dec19 0:00 /bin/sh /usr/bi
mysql 13804 0.0 13.0 140836 21360 ? Sl Dec19 0:34 \_ /usr/libexe
root 23781 0.0 1.1 8956 1904 ? Ss Dec19 0:01 sendmail: accep
smmsp 24412 0.0 0.9 8040 1476 ? Ss Dec19 0:00 sendmail: Queue
root 30382 0.0 6.6 45164 10864 ? Ss Dec19 0:01 /usr/sbin/httpd
apache 30153 0.0 16.1 52020 26492 ? S 12:46 0:05 \_ /usr/sbin/h
apache 20259 0.0 16.0 52032 26248 ? S 14:45 0:02 \_ /usr/sbin/h
apache 30231 0.0 15.6 51752 25648 ? S 15:16 0:02 \_ /usr/sbin/h
apache 9667 0.0 16.8 53776 27604 ? S 15:20 0:01 \_ /usr/sbin/h
apache 15530 0.0 16.2 52288 26652 ? S 15:22 0:02 \_ /usr/sbin/h
root 32550 0.0 0.6 3212 1104 ? Ss Dec19 0:00 crond
root 32749 0.0 0.4 5480 692 ? Ss Dec19 0:00 /usr/sbin/sasla
root 3127 0.0 0.2 5480 436 ? S Dec19 0:00 \_ /usr/sbin/s
postgres 28439 0.0 1.9 18876 3244 ? S 17:03 0:00 /usr/bin/postma
postgres 29797 0.0 0.4 8656 788 ? S 17:03 0:00 \_ postgres: l
postgres 29801 0.0 1.8 19008 3056 ? S 17:03 0:00 \_ postgres: w
postgres 29802 0.0 0.4 9656 796 ? S 17:03 0:00 \_ postgres: s
postgres 29803 0.0 0.6 8884 1016 ? S 17:03 0:00 \_ postgre[/codebox]
thanks, martin
This server doesn't have much memory, and it doesn't have any swap! I suspect something (maybe kernel OOM Killer) is killing PostgreSQL when the system is running out of memory.
Dan,
Thanks for that, much appreciated. I'll get on to the hosting people right away.
cheers, martin