Server crashing, maybe quotacheck related?

cbooth7575

Verified User
Joined
Dec 1, 2004
Messages
7
Hi everybody,

Twice now my machine has frozen/crashed/become unavailable in similar circumstances.

Both times it has happened shortly after 5am on a Sunday (not consecutive sundays though). After a remote reboot (via a power management interface) I've looked around in my /var/logs and don't see anything obvious as to why it went down, but I'm able to pinpoint the time to sometime after 5:03am.

Trying to find the culprit, I've found this in Direct Admin's crontab, and I'm guessing it may be related:

5 5 * * 0 root /sbin/quotaoff -a; /sbin/quotacheck -augm; /sbin/quotaon -a;

That looks like it's possibly the problem as it runs at 5:05am on Sundays.

Can anybody suggest how I can test if this is the problem? Does it log errors somewhere?? Can I try running it myself as root? (btw, I'm running FC2 if it helps)

Thanks!

Cameron
 
We had the exact same crashes occur on a FC2 server, also on sundays around 5 am (though it only crashed on a few sundays). Luckily (sic) our RAID controller wrecked our harddrives and I got to reinstall everything, which seems to have the small advantage of having fixed this problem.

So, sorry, I can't help you, but be sure to post here if you manage to track this problem down, because I'm afraid it'll pop up on our end again as well.
 
same problem

Mine is happening right now and last night between 12:00 and 12:30 . It is messing up raid also.
Need help please DA.

Server is not responding at any service..
Redhat 9 new version da.
 
Sounds around the time when the tally is run by default. If possible, can you disable it and let it go a few days without tallying?
If it doesn't crash again then you've probably found the culprit, although if you executed the commands in the command line directory without causing any problems then it might just be conflicting with another high usage task in cron, in which case try spacing them out a little. For example, I modify tally to run around 1 in the morning instead of being so close to midnight when other scripts might be more likely to run.
 
I took it out of cron.d and no crash for 2 days.
Ran echo 'action=tally&value=all' >> /usr/local/directadmin/data/task.queue from command and got to 2.84 load and now I can't get in with ssh , mail, or web.
Will still ping right now.

Off to the colo again.

Any help would be great.

No other scrips being run arount 12:00

Thanks.
 
I strongly suspect that it's a new bug after the most recent patch.

My apache has been frozen for quite a few times through out the past two weeks, and recently when my admin took out the tally cron, everything works fine again. (He suspected the cron without knowing this thread).

I am on FreeBSD 5 btw. I think someone from DA should really have a look at this and do an immediately fix.

I am actually surprised that they don't have a bug reporting forum?
 
My situation is more than apache going down when I run the tally from cron or command line all services go down and I end up with a kernel panic and by raid arrary is critical. I have ordered a new hard drive to replace the one that is failing to make sure it is not a hardware issue. Da said to try upgrading kerenel which will be my next step. This might me a pain because of the driver support for promise ATA raid TX 2200.

I think it it kind of weird though everything else still works fine.
Daily backup of 3 gigs and cron.daily runs logwatch, rkhunter...ect
with no crash.

I can not figure what started the problem I upgraded DA about 1 week before the crashes started. But the tally is what what brings the server down so no stats fro the last week.

Will post results when new hard drive is installed.
 
Update:

I replaced hard drive in raid array and now the tally ran tonight no crash. I guess the drive just went bad and tally was working the server to much with critical array.

Problem fix just hardware related.
Thanks DA and forum for good support.
 
rmday said:
Update:

I replaced hard drive in raid array and now the tally ran tonight no crash. I guess the drive just went bad and tally was working the server to much with critical array.

Problem fix just hardware related.
Thanks DA and forum for good support.
A crashed raid array will be detected by the kernel (?) and the damaged drive will be de-activated. At least, that's supposed to happen.
Works perfectly on our old RH8.0 server (drive crashed some time ago, but still have to replace it, can't afford the downtime this month :().

Can't the DA tally be forced to use a bit nicer to the processor by autocontrolling load a bit ?
e.g. detect the load, run slower if the load is high etc.
Same question for the sysbk backup sys. Every night around 3 am my servers are almost unreachable due to a full system backup taking up too much load...
 
I believe you can manually adjust the level at which sysbk will run or even refuse to run.

Jeff
 
quotacheck causing crashes on FC2 boxes

I'm with cbooth7575 on this one;

I have four boxes with DA on them, an FC1, two FC2s and an FC3.
The FC2 boxes "die" every other weekend, Nagios shows them going between 5:12am and 5:18am Sunday morning - the nearest cronjob is the 5:05 quotacheck.

I have not managed to witness this happening (ie. I haven't been motivated enough to move the cronjob and watch it) but it appears some kind of out-of-memory failure is going on, as the kernel continues to run and return pings, and socket connections are accepted - but are never handed off to any process (ie. I hit port 25, get the connect, but exim never grabs the socket).

Both boxes are dual processor IBMs with 36GB SCSI disks and 1GB physical memory, 1GB swap memory. Fairly beefy and certainly with plenty of memory for incidental large memory using processes.

Obviously DA needs quotacheck to update the system quotas. It certainly appears that quotacheck is bombing the boxes. What to do?
I can't WAIT until DA installs and works properly on Debian boxes :p

Cheers; JP
 
rmday,

Your RAID story once again gives me deja vu. We had a badass crash in which our RAID controller crapped out, destroying, well, everything.
(We were able to restore large parts of /home, /backup and /etc from one of the two drives though, and we had off-site backups, thank god... as well as paperwork detailing most user accounts and settings, since /var was gone.)
All this followed a few weeks with sunday morning crashes every other weekend. So the pieces definitely fit with the symptoms we were having prior to our crash.

Luckily, we have new drives and a new RAID controller now. My advice to anyone running into this sundaymorning crash problem is checking out your RAID and making regular off-site backups.

jphindin, our server was also FC2. Since the rest of the config looked pretty different from yours (we didn't have SCSI drives) it could be FC2 related and not RAID related, though.

At any rate, I'll investigate more if this problem ever rears its ugly head again on our end.
 
Last edited:
My server is also crashing. I never tought of a pattern, but now I looked into the Cron Job-log and the last thing mentioned is the quota check.

OS: Fedora Core II

Is there a fix for this crashing problem?
 
Same problem here.
Got Fedora Core 2
Software raid (mirrorring)

Server becomes totally inresponsive, ping does still work though.

Timing is around 05:00 on sunday mornings.
 
Hello,

Remove the entire quotacheck line from your /etc/cron.d/directadmin_cron and restart crond, only if you are affected. We have the line commented out for new installs, however, not all OS's are affected. It's good to run it now and then to make sure everything is in sync, but it's not absolutely required. I believe it's a bug with the quotas in earlier Fedora kernels.

John
 
Well, I have no crons left to run at 5AM (that I can find) and still I am getting crash/reboots at between 4:56 AM and 5:02 AM every day. Sometimes it will reboot 2-3 times over the course of a few hours, but most of the time it's the one reboot around 5:00 AM.

My quotacheck cron has been set to run at 12:05 AM forever. It's never been around 5AM. And, I can put a heck of a load on the machine for a long period without ever crashing it. So I don't believe it's hardware.

I got me a new FreeBSD world yesterday hoping that would help, but no. I was running FreeBSD 4.10 and am now running 4.11.

It would be nice if others could check their logs. This is not a problem easily noticed. I logged in to do some upgrades when I noticed the machine had only been up a fw hours and started nosing around only to find it was happening every day.
 
Last edited:
LyricTung,

My server is also crashing, even after I removed the line.

Today, it crashed at 5:00 AM.

It is really starting to irritate both me and my clients.


Regards,
Ben
 
Schaap- Thanks. Good to know. I'm going nuts as well.

Rebooted this AM twice: 4:55 and 5:01. I can find nothing. If it wasn't for others here having the same problem, I'd start thinking about power problems in the NOC at that time causing brown outs, etc.

I normally don't run a cron log because of the dataskq running each minute, but I set one up last night. I definitely have nothing but dataskq left running at that time:

Nov 27 04:50:00 luca /usr/sbin/cron[9493]: (root) CMD (/usr/libexec/atrun)
Nov 27 04:51:00 luca /usr/sbin/cron[9497]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 04:52:00 luca /usr/sbin/cron[9500]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 04:53:00 luca /usr/sbin/cron[9503]: (root) CMD (/usr/local/directadmin/dataskq)
____________________________
Nov 27 04:56:00 luca /usr/sbin/cron[292]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 04:57:00 luca /usr/sbin/cron[295]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 04:58:00 luca /usr/sbin/cron[310]: (root) CMD (/usr/local/directadmin/dataskq)
____________________________
Nov 27 05:01:00 luca /usr/sbin/cron[292]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:01:00 luca /usr/sbin/cron[293]: (root) CMD (adjkerntz -a)
Nov 27 05:02:00 luca /usr/sbin/cron[297]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:03:00 luca /usr/sbin/cron[301]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:04:00 luca /usr/sbin/cron[305]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:05:00 luca /usr/sbin/cron[309]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:05:00 luca /usr/sbin/cron[310]: (root) CMD (/usr/libexec/atrun)
Nov 27 05:06:00 luca /usr/sbin/cron[314]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:07:00 luca /usr/sbin/cron[317]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:08:00 luca /usr/sbin/cron[321]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:09:00 luca /usr/sbin/cron[324]: (root) CMD (/usr/local/directadmin/dataskq)
Nov 27 05:10:00 luca /usr/sbin/cron[328]: (root) CMD (/usr/local/directadmin/dataskq)
 
check the /var/log/directadmin/system.log and /var/log/directadmin/errortaskq.log to see if the dataskq was doing anything at that hour. It will just check for dead processes and for data in the task.queue files every minute. The task.queue files are filled via crons, so anything "extra" should be showing up in the cron log (eg, the tally's). If there are any dead processes, they will show up in those logs as being attempted to restart etc..

John
 
Back
Top