Duboux
Verified User
- Joined
- Apr 20, 2007
- Messages
- 244
Hi,
I'm having some odd problems with my server.
It shows a high load, but the rest of the information I see with "top" makes it look like there's nothing busy going on.
It started with this:
and after a reboot:
And it doesn't seem to have heavy processes..
I noticed the extreme %wa values, meaning the CPU's are waiting for things to happen before they do anything themselves..
So there must be something wrong, like IRQ interupts, or I/O errors, bad disk segments, memory problem, etc..
I checked dmesg and /var/log/messages for anything that could relate to such failures, but nothing was written about it..
I did some tests:
(who is dm-0 and dm-1 ? -->
[root@server ~]# dmsetup ls
VolGroup00-LogVol01 (253, 1)
VolGroup00-LogVol00 (253, 0)
)
Checked if the disk was full:
I checked the speed of the disk 3 times:
Getting some info from the disk:
and here's what caught my eye:
My conclusion:
It seems like my disk only uses 1 head to read/write data, instead of all the 16 it is supposed to use. And therefore is very slow, causing a system wide slow response, making processes pile up.
Am I correct, and what can I do ?
Can I configure this disk to use 16 heads ?
Can I just remove it, if I have another disk in raid-1 ?
The clients on this server complained that they haven't received email for 2 days now.
Is this related to this disk problem ?
Thanks
I'm having some odd problems with my server.
It shows a high load, but the rest of the information I see with "top" makes it look like there's nothing busy going on.
It started with this:
Code:
top - 10:44:04 up 113 days, 7:41, 1 user, load average: 8.75, 9.27, 8.99
Tasks: 1059 total, 1 running, 1028 sleeping, 0 stopped, 30 zombie
Cpu(s): 1.2%us, 0.6%sy, 0.1%ni, 82.1%id, [COLOR="Red"]15.6%wa[/COLOR], 0.1%hi, 0.3%si, 0.0%st
Mem: 2049080k total, 1973000k used, 76080k free, 133816k buffers
Swap: 4095992k total, 12k used, 4095980k free, 998116k cached
Code:
top - 09:51:10 up 21:51, 1 user, load average: 8.12, 7.92, 8.69
Tasks: 1397 total, 1 running, 1374 sleeping, 0 stopped, 22 zombie
Cpu(s): 0.3%us, 0.5%sy, 0.0%ni, 0.0%id, [COLOR="Red"]98.9%wa[/COLOR], 0.2%hi, 0.2%si, 0.0%st
Mem: 2049072k total, 2036928k used, 12144k free, 62944k buffers
Swap: 4095992k total, 104k used, 4095888k free, 746988k cached
And it doesn't seem to have heavy processes..
Code:
top - 22:43:44 up 2 days, 13:05, 1 user, load average: 3.87, 4.07, 4.23
Tasks: 191 total, 1 running, 190 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3%us, 0.3%sy, 0.0%ni, 35.1%id, [COLOR="Red"]64.2%wa[/COLOR], 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id,[COLOR="Red"]100.0%wa[/COLOR], 0.0%hi, 0.0%si, 0.0%st
Mem: 2049044k total, 2036900k used, 12144k free, 26148k buffers
Swap: 4095992k total, 100k used, 4095892k free, 1277908k cached
PID PPID UID USER RUSER TTY PR NI VIRT SWAP RES nDRT TIME+ %CPU %MEM S COMMAND
8242 3207 0 root root pts/0 15 0 12756 11m 1208 0 0:03.99 0.7 0.1 R top
1 0 0 root root ? 15 0 10364 9716 648 0 0:02.88 0.0 0.0 S init [3]
2 1 0 root root ? RT -5 0 0 0 0 0:00.19 0.0 0.0 S [migration/0]
3 1 0 root root ? 34 19 0 0 0 0 0:00.36 0.0 0.0 S [ksoftirqd/0]
4 1 0 root root ? RT -5 0 0 0 0 0:00.00 0.0 0.0 S [watchdog/0]
I noticed the extreme %wa values, meaning the CPU's are waiting for things to happen before they do anything themselves..
So there must be something wrong, like IRQ interupts, or I/O errors, bad disk segments, memory problem, etc..
I checked dmesg and /var/log/messages for anything that could relate to such failures, but nothing was written about it..
I did some tests:
Code:
avg-cpu: %user %nice %system [COLOR="Red"]%iowait[/COLOR] %steal %idle
0.00 0.00 0.00 [COLOR="Red"]49.75[/COLOR] 0.00 50.25
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
hda 0.00 198.00 0.00 0.00 0.00 0.00 0.00 68.08 0.00 0.00 100.10
hda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
hda2 0.00 198.00 0.00 0.00 0.00 0.00 0.00 68.08 0.00 0.00 100.10
dm-0 0.00 0.00 0.00 286.00 0.00 2288.00 8.00 166.79 0.00 3.50 100.10
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system [COLOR="Red"]%iowait[/COLOR] %steal %idle
0.00 0.00 0.50 [COLOR="Red"]91.04[/COLOR] 0.00 8.46
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
hda 0.00 4.95 0.00 10.89 0.00 102.97 9.45 122.03 4958.18 91.09 99.21
hda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
hda2 0.00 4.95 0.00 10.89 0.00 102.97 9.45 122.03 4958.18 91.09 99.21
dm-0 0.00 0.00 0.00 27.72 0.00 221.78 8.00 327.89 2301.79 35.79 99.21
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[root@server ~]# dmsetup ls
VolGroup00-LogVol01 (253, 1)
VolGroup00-LogVol00 (253, 0)
)
Checked if the disk was full:
Code:
[root@server ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
899G 23G 830G 3% /
/dev/hda1 99M 47M 48M 50% /boot
tmpfs 1001M 0 1001M 0% /dev/shm
Code:
[root@server ~]# dmesg |grep hda
hda: WDC WD10EARS-003BB1, ATA DISK drive
hda: max request size: 512KiB
hda: 1953525168 sectors (1000204 MB), CHS=65535/255/63
hda: cache flushes supported
hda: hda1 hda2
EXT3 FS on hda1, internal journal
I checked the speed of the disk 3 times:
Code:
[root@server rsync]# hdparm -tT /dev/hda
/dev/hda:
Timing cached reads: 4128 MB in 2.00 seconds = 2064.56 MB/sec
Timing buffered disk reads: 8 MB in 3.36 seconds = [COLOR="Red"]2.38 MB/sec[/COLOR]
Timing cached reads: 4156 MB in 2.00 seconds = 2078.69 MB/sec
Timing buffered disk reads: 6 MB in 3.34 seconds = [COLOR="Red"]1.80 MB/sec[/COLOR]
Timing cached reads: 4132 MB in 2.00 seconds = 2067.08 MB/sec
Timing buffered disk reads: 2 MB in 4.51 seconds = [COLOR="Red"]453.69 kB/sec[/COLOR]
Getting some info from the disk:
Code:
[root@server rsync]# hdparm -i /dev/hda
/dev/hda:
Model=WDC WD10EARS-003BB1, FwRev=80.00A80, SerialNo=WD-WCAV5L679802
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2
AdvancedPM=no WriteCache=enabled
Drive conforms to: Unspecified: ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6 ATA/ATAPI-7
* signifies the current active mode
Code:
[root@server rsync]# hdparm -I /dev/hda
[SIZE="2"]/dev/hda:
ATA device, with non-removable media
Model Number: WDC WD10EARS-003BB1
Serial Number: WD-WCAV5L679802
Firmware Revision: 80.00A80
Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5
Standards:
Supported: 8 7 6 5
Likely used: 8[/SIZE]
Configuration:
Logical max [COLOR="Red"]current[/COLOR]
cylinders 16383 65535
[COLOR="Red"]heads[/COLOR] 16 [COLOR="Red"][B]1[/B][/COLOR]
sectors/track 63 63
--
[SIZE="2"] CHS current addressable sectors: 4128705
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 1953525168
device size with M = 1024*1024: 953869 MBytes
device size with M = 1000*1000: 1000204 MBytes (1000 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
Power-Up In Standby feature set
* SET_FEATURES required to spinup after power up
* SET_MAX security extension
Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* 64-bit World wide name
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Host-initiated interface power management
* Phy event counters
* unknown 76[12]
DMA Setup Auto-Activate optimization
* Software settings preservation
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
228min for SECURITY ERASE UNIT. 228min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct[/SIZE]
My conclusion:
It seems like my disk only uses 1 head to read/write data, instead of all the 16 it is supposed to use. And therefore is very slow, causing a system wide slow response, making processes pile up.
Am I correct, and what can I do ?
Can I configure this disk to use 16 heads ?
Can I just remove it, if I have another disk in raid-1 ?
The clients on this server complained that they haven't received email for 2 days now.
Is this related to this disk problem ?
Thanks
