Troubleshooting system crash

Strator · Sep 28, 2013

Hi,

My DA server was down for 8 hours last night, so after bringing things up again, I am eager to find out what exactly happened - but somehow I can't find any meaningful log files ("messages" at least has nothing to contribute). Can somebody please point me in the right direction? Thanks.

ISOS6 · Sep 29, 2013

Hackers attack log files too. They erase traces, so it's no wonder you can not find any log. I think you've been hacked. Always have the latest version of Joomla, Wordpress, etc.. This is where hackers come in from.

You must secure your server better. Check forums and Google to find what's healthiest for your server. In addition, I suggest Antivirus and Firewall installed. And activate the DA's BTM (Brute Force Monitor), set so that after 10 or 20 invalid attempts ban IP address.

Write what you have and have not already, so I can guide you.

Strator · Sep 29, 2013

Jumping to that conclusion with zero information is kinda silly, isn't it.

Can someone possibly answer my question?

nobaloney · Sep 30, 2013

Was the entire server offline, or just certain services? Was it really down, or just unreachable on the 'net?

Check email log files for large gaps in times around the period which the server appeared down.

Check messages and other logs for the same, and also for signs of daemons starting up or coming down.

Jeff

Strator · Sep 30, 2013

Good question, thank you. To the outside, the server was totally non-responsive (no Ping, no httpd, no IMAP, no sshd), but according to the logs, it wasn't quite dead.

- During the entire downtime, the cronlog has one line per minute, which seems to be normal behavior:

12:58:01 server CROND[12451]: (root) CMD (/usr/local/directadmin/dataskq)

Interestingly, a couple of days before the downtime, the hourly entries changed. Originally, it was along the lines of:

Sep 21 08:01:01 server CROND[9323]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 08:01:01 server run-parts(/etc/cron.hourly)[9323]: starting 0anacron
Sep 21 08:01:01 server run-parts(/etc/cron.hourly)[9332]: finished 0anacron

...but a couple of days before the incident, the entries change to:

Sep 28 05:01:01 server CROND[10554]: (root) CMD (run-parts /etc/cron.hourly)
Sep 28 05:01:01 server run-parts(/etc/cron.hourly)[10554]: starting 0anacron
Sep 28 05:01:01 server anacron[10563]: Anacron started on 2013-09-28
Sep 28 05:01:01 server run-parts(/etc/cron.hourly)[10565]: finished 0anacron
Sep 28 05:01:01 server anacron[10563]: Job `cron.daily' locked by another anacron - skipping
Sep 28 05:01:01 server anacron[10563]: Normal exit (0 jobs run)

...and stay like that until the server is finally rebooted after the "crash". I googled this, and came across this old thread:

https://bugzilla.redhat.com/show_bug.cgi?id=517321

From what I make of this, it seems like this issue might actually prevent proper logging on my server (so in fact, the issues for the actual downtime may be quite trivial, but are impossible to troubleshoot because of poor logs). This would also be in line with a logging issue I came across a couple of months ago and which I was never able to fix:

http://forum.directadmin.com/showthread.php?t=47008

- "messages" has has nothing for the time span except for two lines, which I had initially overlooked:

05:20:22 server init: serial (hvc0) main process ended, respawning
05:24:23 server init: serial (hvc0) main process ended, respawning

(These two lines come almost 7 hours after the server was down - after 8 hours I hit the hard reset. Not sure if this is interesting information or not.)

- The mail log has nothing for that timespan.
- Apache logs seems to be configured to be deleted (!) on a daily basis (I would have expected to see the older log files archived) - so there's nothing to check there.
- dmesg has no timestamps (not sure if that is normal), but without that it seems to me of limited use.

nobaloney · Oct 1, 2013

Occam's Razor would seem to tell us that likely the Internet connection was interrupted. The only way to know with certainty is to get a truthful reply from the datacenter, and frankly, unless they monitor outages in real time they may not know.

But you should probably install a cron job that runs top every minute and saves the results somewhere in a file named with a timestamp. You should delete them every week or so to keep them from taking up too much room.

That way if it happens again you'll have a minute-by-minute snapshot of the server.

Others may reply with other ideas.

Jeff

Strator · Oct 3, 2013

Thanks for the suggestion - I might implement that, eventually.

At this point, however, I believe that the issue I need to solve first is with cron and logging. I have traced the failure of cron.daily back to the nightly standard cron jobs. Here's what they *should* look like on my server:

Sep 25 03:31:01 server anacron[9221]: Job `cron.daily' started
Sep 25 03:31:02 server run-parts(/etc/cron.daily)[10120]: starting custombuild
Sep 25 03:31:22 server run-parts(/etc/cron.daily)[12575]: finished custombuild
Sep 25 03:31:22 server run-parts(/etc/cron.daily)[10120]: starting freshclam
Sep 25 03:31:33 server run-parts(/etc/cron.daily)[12585]: finished freshclam
Sep 25 03:31:33 server run-parts(/etc/cron.daily)[10120]: starting logrotate
Sep 25 03:31:33 server run-parts(/etc/cron.daily)[12593]: finished logrotate
Sep 25 03:31:33 server run-parts(/etc/cron.daily)[10120]: starting mlocate.cron
Sep 25 03:32:02 server CROND[12627]: (root) CMD (/usr/local/directadmin/dataskq)
Sep 25 03:32:05 server run-parts(/etc/cron.daily)[12629]: finished mlocate.cron
Sep 25 03:32:05 server anacron[9221]: Job `cron.daily' terminated
Sep 25 03:32:05 server anacron[9221]: Normal exit (1 job run)

And here's what happened when it broke:

Sep 26 03:38:02 server anacron[30514]: Job `cron.daily' started
Sep 26 03:38:02 server run-parts(/etc/cron.daily)[31455]: starting custombuild

That was all - custombuild never finished, and locked cron.daily until the server crashed (whether or not it may have contributed to the crash is another question - albeit an interesting one).

Any ideas?

scsi · Oct 3, 2013

You probably just have bad hardware. There will not be anything in the logs.

nobaloney · Oct 4, 2013

Strator said:
custombuild never finished, and locked cron.daily until the server crashed (whether or not it may have contributed to the crash is another question - albeit an interesting one).

So it's likely the problem occurred somewhere in custombuild. There's a custombuild log at /usr/local/directadmin/custombuild/custombuild.log. Did you look at it to see where custombuild may have failed?

Jeff

Strator · Oct 6, 2013

nobaloney said:
So it's likely the problem occurred somewhere in custombuild. There's a custombuild log at /usr/local/directadmin/custombuild/custombuild.log. Did you look at it to see where custombuild may have failed?

Jeff

Great idea, but it seems like it isn't leading anywhere. All that file contains is iterations of:

update
update_data
versions_nobold

or

update
update_data
versions_nobold
versions_nobold
update_versions

For the time that anacron was stuck, they are simply missing.

Same goes for the freshclam log (which makes a lot of sense, as the daily cronjob simply wasn't running).

Troubleshooting system crash

Strator

Verified User

ISOS6

Verified User

Strator

Verified User

nobaloney

NoBaloney Internet Svcs - In Memoriam †

Strator

Verified User

nobaloney

NoBaloney Internet Svcs - In Memoriam †

Strator

Verified User

scsi

Verified User

nobaloney

NoBaloney Internet Svcs - In Memoriam †

Strator

Verified User