Strange Amount of Connections

Since Monday, our site has been very slow. I just thought that it was our network that was making it slow, but I realized today after checking netstat our established connections are WAY higher than normal. For example:


Established connections:

  • Current 178.34
  • Average 52.46



    Current connections are triple what they are normally.

    [attachment=361:localhos…tat_week.png]





    Number of Connections:
  • Current 161
  • Average 112



    [attachment=362:localhos…ses_week.png]





    Does anyone know what would cause this? Our whole site seems to be sluggish, and it seems that all 90 of our apache worker processes are running.



    Any ideas would be appreciated.
    localhost.localdomain_netstat_week.png (4.54 KB)
    localhost.localdomain_processes_week.png (3.38 KB)

I'm going to ask the obvious: was anything changed on the server?

…and a slightly more ominous question:



Do you know if anything was changed?



It may be a good idea to see if there were any SSH attempts to get into the box, and if the apache config has been changed.



I have heard of boxes being hacked and set up to serve phishing content pages.



The other thing is to check the server logs to see if someone is hijacking any content. Is the an image or images, or perhaps a graphic or javascript library that someone has linked to.



Another possiblity is that someone has connected to your RSS feeds (if you have any) or is scrapping content off your pages for every request on their own site. I am aware that some people connect their home page code directly to other peope's RSS, generating a request on the remote site for every request they get. Very bad form!



cheers,



Richard

Hi Nic,


Does the problem recur again if you restart Apache/PostgreSQL services on your web server?



Also, have a look at the server access logs. At the time of connections going up, what requests were being sent to your site? Were there any web crawlers spidering it, or perhaps lots of asynchronous AJAX requests being sent?



We've seen similar problems lately, and have a hunch it could potentially be related to PHP session locking (but we need more information to confirm this).



Cheers,

Dan.

Well, I have some good news. Our site is now speedy again, and the connection numbers are back to normal. :slight_smile:


This morning I realized that I had only been trying to restart apache. So, I gave it a hard shut-down, and that did the trick. So, that is a great thing.



[attachment=365:localhos…stat_day.png]



But, back to the cause of this, because that is still kind of a mystery to me. No, nothing on the server had changed, no new hardware/software, and no un authorized ssh access as we have that very locked down. Squiz is the only IP that can access it from off-campus.



This is my idea as to why this problem was caused, and I am wondering if Squiz could look into it, as I think it might be a possible bug:



This same scenario happened once before when we were still developing the PUC website. Both times this has happened I realized what might have caused it. This time around, I had created an XML User Bridge because I didn't know what it was. Then, once it was created, I clicked on the "Test URL" or "Test Link" button, whatever it is called. I had not entered anything, and it just sat there and ran. It didn't want to wait for an error, so I deleted the asset and went on my way. That way Monday.



The previous time I experienced something like this was when I was in Global Preferences and looking at the Comment Preferences. There is the exact same kind of "Auto-Test" button there that is used in the XML User Bridge (I think), and at that time, I clicked this button also. At the time, I didn't know what it did, but it was similar to Monday's problem.



Does anyone know exactly what file gets run when those buttons are click? My hunch is that when clicking those without a URL entered, it just sits there any runs forever, and never times out. Therefore, bringing the server to a crawl.



This is just my suspicions about the problem, and suggestions are welcome. And if a bug report needs to be filed I can do that as well.



Thanks everyone.
localhost.localdomain_netstat_day.png (4.61 KB)

Hey Nic,


When you execute a HTTP request, Matrix accesses session information which causes the session to be locked until the HTTP request is complete. While this is happening, no other HTTP requests can run (they are blocked, waiting for access to the session file). This explains why when you click a button that does something which takes a long time to run, you can't do anything else in your browser for that session until the initial operation has returned.



As you've also surmised, my guess is that there may have been something in the operations you were performing (ie. with the XML User Bridge) that are broken and caused the request to never finish properly. This request could have stayed open on the server and cause a database connection to stay open, or caused other requests from the same session to stay open.



At the moment we've only got a fairly vague hypothesis to go on as the condition seems to be rare and difficult to reproduce, however, we are currently doing some testing on database connections and session locking to try and uncover problems with deadlocks that we are seeing on some other systems.



I presume that the problem doesn't occur for you all the time, so your best bet is to probably just keep an eye on it for now. If it persists regularly, definitely let us know - maybe you can help provide some information to assist us with debugging and/or reproducing the issue.



Cheers :slight_smile:

Dan.


Dan,

I really think that it is caused by clicking the "Auto-Test" button which is in the XML User Bridge, and also in the Comments preferences (Global Preferences). Both times this has happened, I backtracked my activity, and it seemed like I clicked the Auto-Test button, without a URL, and that is when the system started to crawl.

I think that /__lib/web/connectivity.php just sits there and tries to connect, and keeps the connection open until Apache is stopped.

I think that in debugging this problem that is the first place to test, IMHO. :)

funny you should mention the connectivity.php script.


We have previously questioned the potential dangers of this with Squiz, and have been met with indifference.



This script can have a high risk if the correct url is sent to it. In our tests it has pretty much taken out one of our nodes. Just saying be wary.

You asked if it was a security risk. I said we did not believe it to be. That is not indifference.



If you have a specific problem with it, please provide some information that we can use to replicate your issue. Talking generally doesn't help us identify any problems.



Nic, we will look into the URL tester with no URL set and make sure it is not sitting their with open connections.


Would you like me to submit a bug report?

In my testing on 3.16, the auto-test feature does attempt to connect to that blank URL, which it really shouldn't do. It times out for me after 60 seconds though, with a message of "Unknown state". The connection does not remain open and returns a 200 OK header, which I believe it always does regardless of the URL given.


Nic, when you try and test it, does it sit there forever or does it time out after 60 seconds as well?


Greg,

Yes, if I click Auto-Test, and wait, it will time out correctly.

The problem occurs when you click Auto-Test, and then navigate away, as it seems that the connection DOESN'T timeout. Now, I know this would not happen to users very much, but I still think that it is a big problem.

Here is the graph of what happened when I clicked Auto-Test, and then navigated away (you can see the huge spike happening on the right):
[attachment=366:localhos...stat_day.png] localhost.localdomain_netstat_day.png (4.77 KB)

Greg your right, we found that this script caused our server to grind to a halt, so we asked if there was any way this script was an issue for security or exploit, for spam or a hack tool. As you said then “The file has been checked and I don’t see any potential hack.”


The full thread is here http://forums.matrix.squiz.net/index.php?showtopic=5626



Thanks for your help

Greg, let me know how this issue goes, and if you would like me to file a bug report.


Thanks for the help.

I'm working through it with our sysadmins here. I will get back to you once we sort it out. We may need to just add something to stop the blank URLs from being submitted as I'm not sure there are good timeout options in the PEAR packages we use.

This is so strange, I stumbled on this post now.


I don't know if this is connected in any way, but this morning I was trying to export submission logs from a form and save it as xml, I don't usualy do it in XML. Anyway it didn't work and xml file had errors.



15 min later website went down. Restarting Apache and Squiz server solved a problem. I didn't give it a second thought until I read this post. Still don't know if this is related.



we are on 3.18.5

[quote]This is so strange, I stumbled on this post now.


I don't know if this is connected in any way, but this morning I was trying to export submission logs from a form and save it as xml, I don't usualy do it in XML. Anyway it didn't work and xml file had errors.



15 min later website went down. Restarting Apache and Squiz server solved a problem. I didn't give it a second thought until I read this post. Still don't know if this is related.



we are on 3.18.5[/quote]



Hmm, this doesn't sound related to me. The problem we had was with the Auto-Test button being clicked without a URL, then navigating away. It has to do with the connectivity.php script or another file is calls…


Just curious if this had been sorted out.

It hasn't been yet, although we have a couple of people working on it. It looks like we'll probably have to just ensure the interface doesn't let you submit blank URLs as the PEAR packages we use don't allow us to control timeouts.


Thanks, sorry for my curiosity. :)