Mystery system overload message from DirectADmin

cnm · Nov 1, 2010

Got this message on October 19:

Warning: The system load average is 11.53 Today at 04:08

This is an automated message notifying you that the 1 minute load average on your system is 11.53.
This has exceeded the 10 threshold.

One Minute - 11.53
Five Minutes - 3.29
Fifteen Minutes - 1.17

Got his message October 31:

Subject: Warning: The system load average is 15.63 Today at 9:01

This is an automated message notifying you that the 1 minute load average on your system is 15.63.
This has exceeded the 10 threshold.

One Minute - 15.63
Five Minutes - 4.3
Fifteen Minutes - 1.57

Both said

================================
Automated Message Generated by DirectAdmin

I know of nothing that could have caused these - was asleep at the time they happened and the first I knew of high load was when I found the messages in my email. My logwatch shows nothing unusual. No one reported any difficulties with my site.

Is there a way to get DirectAdmin to include more detail in these rather useless messages?

scsi · Nov 1, 2010

I just disabled the feature and dont think its useful at all.

zEitEr · Nov 1, 2010

cnm said:
Is there a way to get DirectAdmin to include more detail in these rather useless messages?

Is your Directadmin copy outdated? The current version includes output of top.

cnm · Nov 1, 2010

Version info:

Server Version 1.36.0
Current Available Version 1.362000
Last Updated Mon Jul 5 15:44:04 2010

I told DA to update itself now.

SCSI, Disabling the feature sounds good to me, but where do you do that? I don't find any setting. I'll look again after the update.

Edit: Maybe I don't want to disable it (but I'd still like to know how to do it). Now that I've updated, and it says it will show 30 lines of top - maybe there's a slight possibility of understanding what it's talking about. I doubt it, though. My server is dual-core so the max 'real' load is 2. The higher load numbers are derived from the number of processes waiting, as I understand it. Processes could be waiting for any number of reasons - database locked, httpd stopped, interrupts disabled, whatever. I don't think top can elucidate any of that. I believe it would be more helpful of DA to show Server Status.

LawsHosting · Nov 1, 2010

cnm said:
SCSI, Disabling the feature sounds good to me, but where do you do that? I don't find any setting. I'll look again after the update.

in the actual directadmin.conf :

http://www.directadmin.com/features.php?id=1088

cnm · Nov 1, 2010

There is no 'check_load' item in my DA conf files

/usr/local/directadmin/conf/directadmin.conf
/usr/local/directadmin/data/templates/directadmin.conf

Since it seems there is no 'check_load' configured, why is it sending me those messages??

LawsHosting · Nov 1, 2010

Its on by default unless you add it manually with a 0.

Setting check_load to 0 will disable the check.
There will no interface option for this for now, it's a simple directadmin.conf setting, enabled by default.

cnm · Nov 1, 2010

Thanks - well, I guess I'll leave it alone for the time being.

Really appreciate all the help with this

.

cnm · Dec 1, 2010

The trouble with it is that it doesn't spring into action until after the load spike is gone, if the spike only lasts a minute or two.

I wrote my own cron script to check for load > 3 and capture server-status. That enabled me to spot the bad bot doing thousands of GETs.

So I have disabled DAs check_load. Thanks very much for telling me how.

R1Lover · Dec 1, 2010

scsi said:
I just disabled the feature and dont think its useful at all.

I disagree.... you don't want to know when one of your servers is acting up?

I also have seen the loads on a few servers going up here and there the last few days... I have found that it's an update to claimav from what I can see that is running a backup of come sort.

cnm · Dec 1, 2010

It's easy to write your own load_check.sh to run as a cron job.

R1Lover · Dec 1, 2010

but there is no need too when it's already there and working as designed....

cnm · Dec 1, 2010

The trouble with the DA one is that by the time you get to high one minute load average, it's all over and the ps dump shows nothing useful.

Using a faster trigger - my script is set to detect load > 3, and having it invoke server-status, it's possible to see details of the httpd activity for at least the tail end of the high load spike.

(Server-status is a web page normally read via remote browser, but when the script does a WGET the html is saved on the server.)

LawsHosting · Dec 1, 2010

cnm said:
(Server-status is a web page normally read via remote browser, but when the script does a WGET the html is saved on the server.)

Thats very well if only Apache is being hammered/exploited.

cnm · Dec 1, 2010

True. But it turned out to be what I needed. Perhaps DA might consider WGET server-status (which has to be configured in httpd.conf) in addition to the 'top' dump.

The 'top' is good if the condition is persisting for more than a few minutes; the WGET shows remnants of an excitement which may be all finished.

My script does do a 'ps -aux' dump in addition to the WGET. Also my dump files have timestamps in their names so the dumping can continue as long as the one minute average remains elevated.

tomtom901 · Dec 2, 2010

Care to share that script CNM

cnm · Dec 2, 2010

Sure - maybe you can improve it. Derived from http://landofthefreeish.com/linux/how-to-create-a-shell-script-to-monitor-load-average/
It is run as a cron job once a minute. Note that the WGET automatically saves the html as [url_arg], [url_arg].1, etc, so it does not get overwritten. It is saved in current directory.

The mysqladmin processlist is actually useless in my experience - just says processlist is queried. But I threw it in.

Code:

#!/bin/bash

# Define Variables
CUR_TIME=`date +"%A %b %e %r"`
HOSTNAME=`hostname`
# Retrieve the load average of the past 1 minute
LOAD_AVG=`uptime | cut -d'l' -f2 | awk '{print $3}' |cut -d',' -f1`
# Truncate load average to integer
LOAD_AVG=`echo $LOAD_AVG | cut -f 1 -d.`
echo $LOAD_AVG
# Define Threshold. This value will be compared with the current load average. Set the value as per your wish.
LIMIT=3

# Compare the current load average with Threshold and email the server administrator if threshold is greater.
# This will take successive readings until load eases off.

if [ $LOAD_AVG -gt $LIMIT ]; then
        cd /home/mike
        wget 'http://www25.yourdnshost.com/[I][COLOR="teal"][url_arg edited out - cnm][/COLOR][/I]'
        ssh [[I][COLOR="Teal"]root login edited out - cnm[/COLOR][/I]]
        FILE="wps_`date +%y%m%d-%H%M`.txt"
        w > /home/mike/$FILE
        /usr/sbin/sendmail [email protected] < $FILE
        ps -aux >> /home/mike/$FILE
        mysqladmin processlist --verbose >> /home/mike/processlist.txt
        FILE="SWI-httpd_`date +%y%m%d-%H%M`.log"
        cp -p /var/log/httpd/domains/spywareinfoforum.com.log /home/mike/$FILE
fi

Edit: The thing got triggered yesterday, enabled me to identify excessive WGETs from two addresses. The server-status gave useful info, showed lots of activity, mostly in condition "C" Closing connection. Looked up IPs; whois had this: ec2-72-44-57-248.compute-1.amazonaws.com. Went to amazonaws.com, navigated to http://aws-portal.amazon.com/gp/aws/html-forms-controller/contactus/AWSAbuse, submitted abuse report via their form, Abuse type 'Excessive web crawling', with excerpt from the /var/log/httpd/domains/spywareinfoforum.com.log snapshot. Received:

Dear Abuse Reporter,

Thank you for submitting your abuse report.

We've determined that an Amazon EC2 instance was running at the IP address you provided in your abuse report. We've forwarded the details of your complaint to our Amazon EC2 customer. We'll investigate the complaint to determine what additional actions, if any, need to be taken in this case.

If you wish to provide additional information to EC2 or our customer regarding this case, please reply to [email protected] with the original subject line.

Thanks again for alerting us to this issue.

Case number: 52761353794

Your original report:

* Source IPs: 72.44.57.248, 50.16.8.115
* Abuse Time: Wed Dec 01 12:20:00 UTC 2010

Later they sent me a followup with copy of the letter they sent to their customer.

Mystery system overload message from DirectADmin

cnm

Verified User

scsi

Verified User

zEitEr

Super Moderator

cnm

Verified User

LawsHosting

Verified User

cnm

Verified User

LawsHosting

Verified User

cnm

Verified User

cnm

Verified User

R1Lover

Verified User

cnm

Verified User

R1Lover

Verified User

cnm

Verified User

LawsHosting

Verified User

cnm

Verified User

tomtom901

Verified User

cnm

Verified User