Hi everyone,
Over the weekend we had a client Perl script go haywire causing their Apache error log (/var/log/https/domains) to grow to 12GB within an hour - WHAT FUN!
This of course brought the server to it's knees as VAR was filled 100%. Not being able to SSH in (CPU @100% + extreme latency) this forced a trip to the data center, reboot into single user mode, and deleting the log file to bring everything back to normal.
We have never had this before (bear in mind we we host just a few sites and not as experienced as most in this forum ). We use Munin to monitor and send various system alerts - however, considering the hour window - we were not able to stop the event from occurring. Disk alerts come at 80% and 92% capacity - and in a short while we were already at 100%.
Munin (and I believe MRTG) are great for displaying raw data and learning specific patterns over time - however, little is available about specific users.
May I ask if there are standard practices used by the Pros in the forum with regard to monitoring "user" events such as (1)user script access times (2)cpu usage by user/user script(s) and (3)active user scripts? Rather than issuing a "ps -aux" while sitting at the helm I would believe something else is available?
Thanks for any input/suggestions that you can provide!
Over the weekend we had a client Perl script go haywire causing their Apache error log (/var/log/https/domains) to grow to 12GB within an hour - WHAT FUN!
This of course brought the server to it's knees as VAR was filled 100%. Not being able to SSH in (CPU @100% + extreme latency) this forced a trip to the data center, reboot into single user mode, and deleting the log file to bring everything back to normal.
We have never had this before (bear in mind we we host just a few sites and not as experienced as most in this forum ). We use Munin to monitor and send various system alerts - however, considering the hour window - we were not able to stop the event from occurring. Disk alerts come at 80% and 92% capacity - and in a short while we were already at 100%.
Munin (and I believe MRTG) are great for displaying raw data and learning specific patterns over time - however, little is available about specific users.
May I ask if there are standard practices used by the Pros in the forum with regard to monitoring "user" events such as (1)user script access times (2)cpu usage by user/user script(s) and (3)active user scripts? Rather than issuing a "ps -aux" while sitting at the helm I would believe something else is available?
Thanks for any input/suggestions that you can provide!