Server hung up, OOM takes control, reboot only option

Giancarlo · Sep 29, 2015

Hi, I've got a server running on CentOS 6.6

Other specs:
Processor Name Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
Vendor ID GenuineIntel
Processor Speed (MHz) 2660.000
Total Memory 1922372 kB
Free Memory 307240 kB
Total Swap Memory 4194300 kB
Free Swap Memory 4124872 kB
System Uptime 0 Days, 13 Hours and 27 Minutes
Apache 2.2.31 Running
DirectAdmin 1.48.3 Running
Exim 4.76 Running
MySQL 5.5.27 Running
Named 9.8.2rc1 Running
ProFTPd 1.3.4b Running
sshd Running
dovecot 2.2.18 Running
Php 5.3.29 Installed

Once in a while, the server is hanging up, and the only thing I can do it reboot it.
I tried to figure out what could cause this behaviour, but couldn't solve it my self.
There should be enough disk space and memory.
Now: load average: 0.11, 0.03, 0.01
But when this problem situation accours things look very differently out of nothing I get these values:

This is an automated message notifying you that the 15 minute load average on your system is 554.93.
This has exceeded the 4 threshold.

One Minute - 550.36
Five Minutes - 550.02
Fifteen Minutes - 554.93

top - 06:29:51 up 6 days, 10:16, 0 users, load average: 564.72, 558.17, 551.36
Tasks: 717 total, 142 running, 571 sleeping, 0 stopped, 4 zombie
Cpu(s): 5.1%us, 1.1%sy, 0.5%ni, 90.7%id, 2.3%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 1922372k total, 1868360k used, 54012k free, 772k buffers
Swap: 4194300k total, 4188736k used, 5564k free, 14344k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6 root RT 0 0 0 0 S 6.3 0.0 29:20.62 [watchdog/0]
16 root 20 0 0 0 0 R 5.0 0.0 53:54.44 [kblockd/0]
1204 root 10 -10 7600 3520 2296 S 4.1 0.2 24:10.44 iscsid
13 root 20 0 0 0 0 R 1.4 0.0 6:20.32 [sync_supers]
1688 root 20 0 243m 4472 820 S 1.3 0.2 4:01.68 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
7 root 20 0 0 0 0 R 1.2 0.0 10:00.65 [events/0]
1203 root 20 0 4932 408 376 R 1.1 0.0 6:12.94 iscsid
167 root 20 0 0 0 0 R 1.0 0.0 4:58.87 [mpt_poll_0]
2327 root 20 0 19044 464 416 R 0.9 0.0 3:27.98 /usr/local/directadmin/da-popb4smtp
638 root 20 0 0 0 0 R 0.8 0.0 3:38.50 [vmmemctl]
14 root 20 0 0 0 0 R 0.6 0.0 5:28.24 [bdi-default]
1912 root 20 0 8364 324 288 S 0.6 0.0 2:05.66 /usr/sbin/fcoemon --syslog
4 root 20 0 0 0 0 R 0.6 0.0 3:38.06 [ksoftirqd/0]
6129 apache 20 0 231m 5052 1120 D 0.6 0.3 0:57.83 /usr/sbin/httpd -k start -DSSL
8229 root 20 0 0 0 0 S 0.5 0.0 0:10.44 [flush-253:0]
1887 root 20 0 13584 472 396 D 0.4 0.0 2:16.00 lldpad -d
1631 root 16 -4 98728 2260 540 S 0.3 0.1 1:46.42 auditd
6248 apache 20 0 204m 4716 768 D 0.3 0.2 0:43.69 /usr/sbin/httpd -k start -DSSL
6574 apache 20 0 183m 4868 1448 R 0.3 0.3 0:31.14 /usr/sbin/httpd -k start -DSSL
8159 root 20 0 0 0 0 R 0.3 0.0 0:07.15 [flush-253:2]
6424 apache 20 0 183m 4400 776 D 0.3 0.2 0:47.45 /usr/sbin/httpd -k start -DSSL
2030 ntp 20 0 30732 928 816 S 0.3 0.0 0:53.10 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
6148 apache 20 0 215m 5676 596 R 0.2 0.3 0:56.69 /usr/sbin/httpd -k start -DSSL

I've got no idea where to start debugging this problem and find out why this is happening.

I hope someone can point me to a step by step methode to find the problem.

zEitEr · Sep 29, 2015

Hello,

Start and check:

1. server-status of apache
2. mytop or mysqladmin to see running queries
3. apache logs for repeating POST requests
4. user directories for malware
5. exim ogs for possible spam originated from your server
6. a full list of running processes
etc.

Please search the forums for details on how to perform all or any of these steps.

Giancarlo · Oct 1, 2015

Hi Alex,

Thank you for your reply.
Although some of the points can be checked afterwards, others can't be checked, as the server is unresponsive.
And while mostly the hanging of the server is at night, the server needs to be rebooted asap for users to have their sites up and running again.
So I was hoping for suggestions for a better logging option than the default ones of CentOS to lead in the right direction.

zEitEr · Oct 1, 2015

Steps 1, 2, 6: can be done with a monitoring software or scripts.
Steps 3, 4, 5: can be done any time, both before and after an incident.

The main goal here is to understand what process eats the most of RAM.

Server hung up, OOM takes control, reboot only option

Giancarlo

New member

zEitEr

Super Moderator

Giancarlo

New member

zEitEr

Super Moderator