How to troubleshoot overloaded server

TheoRichel

Verified User
Joined
Dec 27, 2010
Messages
21
My VPS was down for about twenty hours and apart from my sites my admin panel was inaccessible as well. After the host rebooted its runninbg again and I am now trying to find out what caused this. I have been checking the many logs and they are not very informative. I have checked them all and copied the last two lines before the evident break. The follow below. Do they tell you anything? Where else should I look?

Exim Mainlog:

2011-01-24 16:41:07 H=([173.210.226.79]) [173.210.226.79] F=<[email protected]> rejected RCPT <[email protected]>:
2011-01-24 16:41:08 H=([173.210.226.79]) [173.210.226.79] incomplete transaction (QUIT) from <[email protected]>


Exim rejectlog:
2011-01-24 16:37:09 H=mhy71-1-82-242-199-229.fbx.proxad.net (proxad.net) [82.242.199.229] F=<[email protected]> rejected RCPT <[email protected]>:
2011-01-24 16:41:07 H=([173.210.226.79]) [173.210.226.79] F=<[email protected]> rejected RCPT <[email protected]>:

Access log is empty

Apache error log:
[Mon Jan 24 16:34:38 2011] [error] [client 82.73.138.210] File does not exist: /var/www/html/favicon.ico
[Mon Jan 24 16:34:38 2011] [error] [client 82.73.138.210] File does not exist: /var/www/html/404.shtml

Apache Suexec : Unable to read log

Mail log:
Jan 24 16:46:01 vps dovecot[1018]: pop3-login: Disconnected (auth failed, 1 attempts): user=<[email protected]>, method=PLAIN, rip=82.176.206.246, lip=69.94.32.233
Jan 24 16:46:03 vps dovecot[1018]: pop3-login: Disconnected (auth failed, 1 attempts): user=<[email protected]>, method=PLAIN, rip=82.176.206.246, lip=69.94.32.233
 
Hello,

Check previous to reboot logs. Check VPS control panel's logs, some of them allow to see system researches utilization.
 
Where is that reboot log?

Currently I have looked at the logs on Admin level and on the level of that user, but I cannot find any 'reboot'-log.

I showed the logs that had something to say in my first post.
 
Find time when reboot was done. Use

Code:
uptime

or/and

Code:
last | grep reboot

Then find upto 10-20 lines in logs before that time.
 
Results of uptime and grep reboot

Below the results of the two commands you suggested:
Last login: Mon Dec 27 15:08:18 2010 from cable-206-246.zeelandnet.nl
[admin@vps ~]$ uptime
12:44:47 up 8:19, 1 user, load average: 0.28, 0.31, 0.23
[admin@vps ~]$ last | grep reboot
reboot system boot 2.6.18-194.26.1. Wed Jan 26 04:25 (08:19)
reboot system boot 2.6.18-194.26.1. Tue Jan 4 14:36 (21+22:08)
reboot system boot 2.6.18-194.26.1. Tue Jan 4 14:25 (00:10)
reboot system boot 2.6.18-194.26.1. Tue Jan 4 08:10 (06:13)
reboot system boot 2.6.18-194.26.1. Mon Jan 3 04:22 (1+03:47)
reboot system boot 2.6.18-194.26.1. Wed Dec 22 10:51 (11+17:30)
reboot system boot 2.6.18-128.2.1.e Wed Dec 22 10:14 (00:36)
[admin@vps ~]$
[admin@vps ~]$
 
Ok, now find in all logs records prior to Wed Jan 26 04:25 (08:19). Thus you can learn, what was going on.

Did you check system resources utilization graphs in VPS CP?
 
Do Not understand

I have already shown you the contents of my logs

And I do not know where to find the check system resources utilization graphs
 
Sorry, that won't do. Your posted logs shows nothing related with your problem for me. If somebody can say anything more with the information you've posted, s/he is welcome.

If you would tell me your hosting company name, I could try to find out what VPS CP does they offer.

Anyway, your problem mostly has too little in common with directadmin. It's a general aspect of system administrating, and hosting panel like directadmin is not a determinative thing. You'd better consult your hosting provider support team in that case, or find help on some forums for linux administrators.
 
I am at OLM.net

But I still do not understand the vagueness of all this. Are my logs wrongly configured, is it something I should do myself? What kind of system events are not logged? Where should I normally be able to check system resources and its history? Is that an option that DA normally offers but is disabled by OLM or what?
 
Ok, let me try again:

According to your last | grep reboot output, your server was rebooted at Wed Jan 26 04:25 (08:19). Was this caused by overload?

Is that's so, why do you post logs for period of 2011-01-24 16:37:09? To understand, what was going on with your server, you need at least see period from Wed Jan 26 00:00 till reboot was done (Wed Jan 26 04:25 (08:19)).

Is it clear?

I'm sorry to say that, but if you can not do it yourself, I won't teach you here how to administrate your Linux box. You'd better contact your hosting support.
 
The reboot was not caused by overload

The server supposedly went down because of overload. The reboot was done by OLM since I couldnt access the system anymore
My logs show a hole between the 24th and the 26th. That is when my server was down. I thought that what happened before that hole could be informative.

Looking again at the logs that may indeed be the case since the system messages log does display someline that seem informative:

Jan 24 16:55:17 vps kernel: Free swap = 0kB
Jan 24 16:55:19 vps kernel: Total swap = 0kB
Jan 24 16:55:19 vps kernel: Free swap: 0kB
Jan 24 16:55:20 vps kernel: 133120 pages of RAM
Jan 24 16:55:25 vps kernel: 0 pages of HIGHMEM
Jan 24 16:56:03 vps kernel: 2606 reserved pages
Jan 24 16:56:06 vps kernel: 55128 pages shared
Jan 24 16:56:15 vps kernel: 0 pages swap cached
Jan 24 16:56:22 vps kernel: 0 pages dirty
Jan 24 16:56:48 vps kernel: 0 pages writeback
Jan 24 16:56:49 vps kernel: 2 pages mapped
Jan 24 16:56:49 vps kernel: 4909 pages slab
Jan 24 16:56:50 vps kernel: 1605 pages pagetables
Jan 24 16:57:12 vps kernel: Out of memory: Killed process 28501, UID 101, (httpd).
Jan 24 16:57:12 vps kernel: httpd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Jan 24 16:57:47 vps kernel: [<c0454bb7>] out_of_memory+0x72/0x1a3
Jan 24 16:57:48 vps kernel: [<c04561d3>] __alloc_pages+0x24e/0x2cf
Jan 24 16:58:28 vps kernel: [<c04574cb>] __do_page_cache_readahead+0xc8/0x18b
Jan 24 16:58:41 vps kernel: [<c045412f>] filemap_nopage+0x157/0x34c
Jan 24 16:58:41 vps kernel: [<c045fb81>] __handle_mm_fault+0x329/0x14ae
Jan 24 16:58:42 vps kernel: [<c04915e2>] __mark_inode_dirty+0x13d/0x14f
Jan 24 16:58:56 vps kernel: [<c0489d74>] touch_atime+0x60/0x90
Jan 24 16:58:57 vps kernel: [<c042530c>] local_bh_enable+0x5/0x81
Jan 24 16:58:57 vps kernel: [<c05ba3cb>] lock_sock+0x8e/0x96
Jan 24 16:58:57 vps kernel: [<c061f010>] _spin_lock_bh+0x8/0x18
Jan 26 04:25:50 vps syslogd 1.4.1: restart.
Jan 26 04:25:50 vps kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jan 26 04:25:50 vps kernel: Linux version 2.6.18-194.26.1.el5xen ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Tue Nov 9 14:13:46 EST 2010
Jan 26 04:25:50 vps kernel: BIOS-provided physical RAM map:
Jan 26 04:25:50 vps kernel: Xen: 0000000000000000 - 0000000020800000 (usable)
Jan 26 04:25:50 vps kernel: 0MB HIGHMEM available.
Jan 26 04:25:50 vps kernel: 520MB LOWMEM available.

And as to the learning process. I am a journalist with a somewhat successful blog. Due to that success I became too big for shared hosting and was forced to go to a VPS. I never asked to be a system admin, but I am now forced to become one. I can see the fun of it, but it remains a schizophrenic situation since all that admin work goes to the detriment of my money making work. I am definitely not unwilling to learn, but in these unexpected crises references to the large body of work that explains the intricacies of Linux is not very helpful.
 
Ok. That's much better. I can suppose, that your problem was caused by the fact, you ran out of memory (RAM).

see

Jan 24 16:57:12 vps kernel: Out of memory: Killed process 28501, UID 101, (httpd).

Apache and other services should be tunned, if you're not ready either update your hosting package or order additional RAM.
 
Back
Top