Finally, after getting way behind current MSM releases on our old prod server, we now have the latest and greatest. We're running 3.28.6, and all our content from our old 3.16 install was pulled into it. The box is a beast - HP DL385 Dual 12 Cored AMD, 48GB RAM, 3 x 146GB Array plus 300GB SAS 10k 2.5" drives.
And yet… performance on the backend is nothing short of dreadful! To give an example, a HIPO task updating URL's on a site with around 200 assets takes 20 - 30 minutes. I tried it on one of our larger sites… and killed the HIPO after 2 hours at just 15%. We've had it "performance tuned", and yet almost unbelievably, this system seems to run slower than the 4 year old box ( with identical content ) it replaced. In case you're wondering if we have millions of hits a day, and 300 backend users… sorry… this box is not even in production, and more often than not, the only backend user is me!
At this stage I am at a complete loss to explain this woeful performance, but what's worse is trying to explain how $22000 and hours of work later, our publishers will actually find it slower than before.
Here's a dump of processes running one of the HIPO tasks I described this morning. To say the box is barely raising a sweat is quite an understatement.
08:53:31 CPU %user %nice %system %iowait %steal %idle
08:53:36 all 0.00 0.00 0.00 0.00 0.00 100.00
08:53:41 all 0.00 0.00 0.00 0.00 0.00 100.00
08:53:46 all 0.00 0.00 0.01 0.00 0.00 99.99
08:53:51 all 0.36 0.00 0.03 0.00 0.00 99.62
08:53:56 all 0.00 0.00 0.00 0.00 0.00 100.00
08:54:01 all 0.01 0.00 0.00 0.00 0.00 99.99
08:54:06 all 0.00 0.00 0.00 0.00 0.00 100.00
08:54:11 all 0.00 0.00 0.00 0.00 0.00 100.00
08:54:16 all 0.01 0.00 0.02 0.00 0.00 99.98
08:54:21 all 0.00 0.00 0.00 0.00 0.00 100.00
08:54:26 all 0.28 0.00 0.03 0.00 0.00 99.69
08:54:31 all 0.01 0.00 0.00 0.00 0.00 99.99
Would love to hear any ideas on why this might be, because we're pretty much running out of excuses.
This isn't representative of a normal installation of Matrix. There could be any number of issues that could cause a bottleneck, so rather than describe each one we have encountered and solved on various platforms you may need to provide information on your installation (operating system, php version, database version etc), or contact your Squiz representative.
[quote]
Finally, after getting way behind current MSM releases on our old prod server, we now have the latest and greatest. We're running 3.28.6, and all our content from our old 3.16 install was pulled into it. The box is a beast - HP DL385 Dual 12 Cored AMD, 48GB RAM, 3 x 146GB Array plus 300GB SAS 10k 2.5" drives.
And yet… performance on the backend is nothing short of dreadful! To give an example, a HIPO task updating URL's on a site with around 200 assets takes 20 - 30 minutes. I tried it on one of our larger sites… and killed the HIPO after 2 hours at just 15%. We've had it "performance tuned", and yet almost unbelievably, this system seems to run slower than the 4 year old box ( with identical content ) it replaced.
[/quote]
What sort of tuning has been done? What sort of raid setup are the drives in? I'm also assuming you're using postgres for the database. What sort of tuning has been done to it? See http://community.squiz.net/advanced/postgresql-tuning-guide for some general guidelines.
RHEL 5
rpm -qa php*
php-mbstring-5.1.6-27.el5_5.3
php-pdo-5.1.6-27.el5_5.3
php-common-5.1.6-27.el5_5.3
php-cli-5.1.6-27.el5_5.3
php-bcmath-5.1.6-27.el5_5.3
php-pgsql-5.1.6-27.el5_5.3
php-pear-1.4.9-6.el5
php-5.1.6-27.el5_5.3
php-soap-5.1.6-27.el5_5.3
php-devel-5.1.6-27.el5_5.3
php-gd-5.1.6-27.el5_5.3
php-xml-5.1.6-27.el5_5.3
rpm -qa postgres*
postgresql84-libs-8.4.5-1.el5_5.1
postgresql84-8.4.5-1.el5_5.1
postgresql84-server-8.4.5-1.el5_5.1
postgresql-libs-8.1.22-1.el5_5.1
Postgres and CMS partitions all on RAID 5 Disk on fast disk
Postgres Dump - 1.3GB of data dumped in just over 20 seconds… without any tuning.
Tuning as per that link has been performed.
It's a supported install.
[quote]
Postgres and CMS partitions all on RAID 5 Disk on fast disk
Postgres Dump - 1.3GB of data dumped in just over 20 seconds… without any tuning.
Tuning as per that link has been performed.
It's a supported install.
[/quote]
RAID5 is not very good for writes. http://en.wikipedia.org/wiki/RAID_5#RAID_5_performance - first paragraph says it all.
That is abnormal behaviour. A few thoughts off the top of my head.
I would get someone to look at the db, and work your way up the stack.
check that there are no missing indexes for a start, and that there is enough ram actually allocated. The max number of connections is also important.
Also check that the kernel can handle the ram - it should be a 64 bit kernel or have the big mem patch. If your apps are not seeing the ram they expect there will be lot of disc io and this will slow things down. Although probably not that much on a largely idle machine…
I would also check that rollback is not on, and that roles are off (this used to be. Peformane issue, not sure about now though).
Have you enough ram allocated for matrix.
Cheers
Richard
[quote]
RAID5 is not very good for writes. http://en.wikipedia.org/wiki/RAID_5#RAID_5_performance - first paragraph says it all.
[/quote]
Yeah, but not that bad! 
[quote]
That is abnormal behaviour. A few thoughts off the top of my head.
I would get someone to look at the db, and work your way up the stack.
check that there are no missing indexes for a start, and that there is enough ram actually allocated. The max number of connections is also important.
Also check that the kernel can handle the ram - it should be a 64 bit kernel or have the big mem patch. If your apps are not seeing the ram they expect there will be lot of disc io and this will slow things down. Although probably not that much on a largely idle machine…
I would also check that rollback is not on, and that roles are off (this used to be. Peformane issue, not sure about now though).
Have you enough ram allocated for matrix.
Cheers
Richard
[/quote]
kernel.shmmax=25189984216
kernel.shmall=3148747899
System has 48gb, Apache locked down to 100 connections as with Postgres ( as I understand… so they match ). Database was tuned by Squiz support.
Roles and Rollback are on… but they were on on the old box as well.
[quote]
kernel.shmmax=25189984216
kernel.shmall=3148747899
System has 48gb, Apache locked down to 100 connections as with Postgres ( as I understand… so they match ). Database was tuned by Squiz support.
Roles and Rollback are on… but they were on on the old box as well.
[/quote]
Have you tested write performance using dd or something like that?
What are the postgres settings - shared_buffers, effective_cache_size, work_mem, checkpoint* - in particular ?
[quote]
Have you tested write performance using dd or something like that?
What are the postgres settings - shared_buffers, effective_cache_size, work_mem, checkpoint* - in particular ?
[/quote]
Hi Chris.
I think DD will need something like 100GB of space ( 2 x RAM ) to do a thorough test… so that would be a stretch. Not sure if the infrastructure guys can run something else similar to measure RAID throughput ( suggestions anyone? ).
Postgres settings would be as per Squiz recomendations, e.g. whatever the tuning script sets them to in postrges.conf.
Just for a test I tried disabling rollback, no improvement. Did it on a (very) small site… took 3 minutes and 30 seconds to update on URL across 13 pages :o . I have not tried it without roles… but I would think Rollback would be a bigger drag chute because it's essentially "shadowing" every asset that changes.
[quote]
Hi Chris.
I think DD will need something like 100GB of space ( 2 x RAM ) to do a thorough test… so that would be a stretch. Not sure if the infrastructure guys can run something else similar to measure RAID throughput ( suggestions anyone? ).
Postgres settings would be as per Squiz recomendations, e.g. whatever the tuning script sets them to in postrges.conf.
[/quote]
I don't know what the generated settings are, can you send them through (PM them if you don't want them public) ?
As for checking raid, I don't know if the hp utils give that sort of information (the hpacucli utils come in handy anyway).
I have some questions - apologies if they appear negative.
You don't have internal staff that can debug your server and database performance? Seems a bit strange for an organisation that can afford such expensive hardware to resort to a public forum for help. Where is your expert support?
Have you been led blind down an alley? I have seen this happen quite often.
Hi,
Don't have any answers I'm afraid, but have a similar situation.
Also on an SLA - upgraded to 3.28.4 a couple months ago (same box etc) and have had terribly slow back-end performance ever since.
Have support tickets in with Squiz support, but not managed to find a solution yet.
Slowness very evident with HIPPO jobs when moving, cloning etc.
Everything just seems to take much longer to process.
So for what it's worth, we haven't changed our server setup and have similar issues as you.
Does anyone have 3.28 running with fast (normal) HIPPO processing?
Just a quick check before looking at anything else… what locale was the database cluster initialised with?
OK. So there's been a lot of talk about tuning of the hardware and various software and what not. What about Matrix itself? Is it only HIPO jobs that take a long time to complete? Do you have LDAP authentication? Is frontend page performance slow too? Looks like your CPU is waiting for the disk to complete a task quite a bit. Does this happen on specific tasks only?
Database is in C locale as per squiz's recommendation, Postgres tuned by Squiz. Just to clarify… this is an enterprise system under an SLA… not a couple of script kiddies playing out in the rumpus room with dad's old Solaris box ;).
Managed to pull some numbers from DD overnight, writes were around 270MB/sec on the RAID5 SCSI. Could this be the bottleneck? :blink: The SATA drive that only keeps backups ( none of Matrix uses this but it's in the same box ) came in at a lazy 67MB/sec.
I am starting to wonder if we're pushing the scale ceiling vis-a-vis number of sites & assets. The performance issues we're having are NOT new… the old system was sluggish as well. We had it looked at several times, various tweaks, database vaccuums etc. were performed and it certainly improved things… but not by the order of magnitude we'd need in this instance. "Hardware" was mentioned more than once as an avenue for improvement.
Seriously considering doing my own test build on my comparatively lame 64bit Dual Quad desktop with 8GB, creating 200,000 assets with 20 system URLs, and then trying some common back end tasks and seeing if it all comes to a grinding halt. If it doesn't, it could mean there is some "evil" lurking in our database. That would have implications for anyone else attempting to migrate larger Matrix installations to upgraded hardware, as Karl seems to have experienced.
I think you need some diagnostics to try to find out where the issues are occuring. A good start would be to set up some postgres logging with log_min_duration_statement set to something like 500 or 1000 to capture any really bad DB queries.
If you can afford the disk hit (and I assume with a non-live system you can), you can log all DB queries with this, and feed the log file at pgfouine to get some statistics.
Do you need Global Roles? If not, turn that setting off; we found it made quite a difference.
Do you have lots of cron jobs? One of the nastiest queries we have a problem with is when a cron job runs, it needs to run sort_order = sort_order - 1 on all children of the cron manager. If you do have 200 cron jobs running then deleting themselves, you can see what's going to happen!
[quote]
I think you need some diagnostics to try to find out where the issues are occuring. A good start would be to set up some postgres logging with log_min_duration_statement set to something like 500 or 1000 to capture any really bad DB queries.
If you can afford the disk hit (and I assume with a non-live system you can), you can log all DB queries with this, and feed the log file at pgfouine to get some statistics.
Do you need Global Roles? If not, turn that setting off; we found it made quite a difference.
Do you have lots of cron jobs? One of the nastiest queries we have a problem with is when a cron job runs, it needs to run sort_order = sort_order - 1 on all children of the cron manager. If you do have 200 cron jobs running then deleting themselves, you can see what's going to happen!
[/quote]
Hi Peter. We've disabled the Crons for now, and Roles are disabled. It's run-of-the-mill HIPO jobs that are glacial - setting permissions, applying public read, or applying a URL to a site. We started logging queries on Friday and will collect data over the next few days.
We actually went through this process on the old box about 2 years ago, when it also suffered back AND front end performance issues. I recall some of the queries we found were frightening, and the flat table structure store for the assets meant there were a multitude of complex joins.
Given the data is identical to the box it was pulled from, and the spec of the new box exceeds the old by an order of magnitude, I'm dying to see how bad it is this time.
A few numbers in from query logging, but these are just initial ones that caught our attention from the small sample we've captured so far.
2010-12-07 15:38:46 CST matrix_prod_3285 matrix LOG: execute pdo_pgsql_stmt_375e9618: SELECT ct.treeid as our_treeid, cl.minorid, pt.treeid as parent_treeid, a.assetid, a.name FROM sq_ast_lnk cl
INNER JOIN sq_ast_lnk_tree ct ON cl.linkid = ct.linkid, sq_ast_lnk pl INNER JOIN sq_ast_lnk_tree pt ON pl.linkid = pt.linkid INNER JOIN sq_ast a ON a.assetid = pl.minorid WHERE cl.minorid IN (SELECT l.majorid FROM sq_ast_lnk l WHERE l.minorid = $1) AND ct.treeid LIKE pt.treeid || '%' AND pt.treeid <= ct.treeid AND pt.treeid IN ('0005','0005000O','0005000O0005','0005000O00050004','0005000O00050003') ORDER BY cl.linkid, ct.treeid, pt.treeid2010-12-07 15:38:46 CST matrix_prod_3285 matrix DETAIL: parameters: $1 = '31397'
2010-12-07 15:38:46 CST matrix_prod_3285 matrix LOG: duration: 172.360 ms
2010-12-07 15:38:45 CST matrix_prod_3285 matrix LOG: execute pdo_pgsql_stmt_375da438: SELECT sq_get_lineage_treeids AS treeid FROM sq_get_lineage_treeids($1, $2)
2010-12-07 15:38:45 CST matrix_prod_3285 matrix DETAIL: parameters: $1 = '31397', $2 = '4'
2010-12-07 15:38:46 CST matrix_prod_3285 matrix LOG: duration: 309.099 ms
These occurred during a HIPO task to update a site URL on a 12 page site, which took around 3 minutes.
Any thoughts?
Can you run EXPLAIN ANALYZE on the first query, substituting $1 for the asset ID as given by the log, and post the results?