Captain's Universe Home
Captain's Universe Home
Cosmic Ray Muon DetectorTeleGarden Pages
Time on MarsBryophyllum Plants
Jupiter Radio AstronomyAncient Pages
Salzburg Tourist GuideEarth Magnetometer
  H O M E     AJAX & MORE     LINUX & MORE     RTAI     XENOMAI     ADEOS IPIPE      
    JAVA & BROWSERS     *NIX     ELECTRONICS     REVIEWS     ARTEMIA     FAIRY SHRIMP      


Linux Harddisk Monitoring with SmartMonTools (smartctl)

If you see harddisk errors in the kernel log file ( IDE harddisk errors: DriveReady SeekComplete Error status=0x51 DriveStatusError error=0x04) it's probably too late to install SmartMonTools and you'd better replace the harddisk.

SmartMonTools monitor the harddisks in your server/workstation and will most likely alert you before the drive will die.


Install SmartMonTools from source or install a pre-compiled package (e.g. deb or rpm) for your system. I will not cover the compilation and installation from source, since every good distribution contains a SmartMonTools package.

After installing, you can check your harddisk with the command "smartctl".
First check if your drive supports SMART at all:
# smartctl -i /dev/hda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
[...]
SMART support is: Available - device has SMART capability.
SMART support is: Disabled
or this if SMART is already enabled:
SMART support is: Enabled
If SMART is supported, switch it on with:
# smartctl -s on /dev/hda
Then check which tests are supported by your harddisk:
# smartctl -c /dev/hda
The output will look like:
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                 (2460) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  35) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
(don't be confused by the "Self-test routine in progress..." at the time I copied this output, a test was performed - normally it will say "Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run)

At the end you see that 3 tests are supported:
  • Short self-test routine = short test
  • Extended self-test routine = long test
  • Conveyance self-test routine = conveyance test
In the brackets you see the approximate duration of the test. The short test is nice, but only the long test is an in-depth test which is more accurate.
You can invoke those tests with (one at a time!):
# smartctl -t short /dev/hda
# smartctl -t long /dev/hda
# smartctl -t conveyance /dev/hda
"smartctl" will print how long it will take and when the test is finished (date/time).


After the test time is elapsed, check the statistics with:
# smartctl -l selftest /dev/hda
The output will be something like this:
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%      4697         -
# 2  Extended offline    Completed without error       00%      4697         -
# 3  Short offline       Completed without error       00%      4696         -
This shows that the tests were completed without errors. If errors show up, it's probably best to backup data immediately and replace the disk.

Full output with all sort of SMART data is show with:
# smartctl -a /dev/hda
Another interesting output:
# smartctl -A /dev/hda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   158   157   021    Pre-fail  Always       -       3066
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       524
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       4695
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       523
194 Temperature_Celsius     0x0022   111   094   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0
This shows some interesting statistics for your harddrive.
There are several attributes (ATTRIBUTE_NAME) and the other interesting value is RAW_VALUE. Most attributes are just nice to know, e.g.
Spin_Up_Time, Start_Stop_Count, Power_On_Hours, Power_Cycle_Count
others need regular attention/monitoring
Spin_Retry_Count, Calibration_Retry_Count, Temperature_Celsius, Current_Pending_Sector
and the last values are very important:
Reallocated_Sector_Ct, Seek_Error_Rate, Reallocated_Event_Count, Offline_Uncorrectable,
UDMA_CRC_Error_Count, Multi_Zone_Error_Rate, Hardware_ECC_Recovered (last value not above)
If you see increasing values in the last category of values, you definately should replace the harddrive to be sure.


Last but not least, here is the configuration of the smartd daemon in /etc/smartd.conf

See the man-page of smartctl for details, but this one seems to be a nice configuration:
/dev/hda -a -o on -S on -s (S/../.././19|L/../../3/21|C/../.././20) -m root
S/../.././19 = short test every day at 19:00
C/../.././20 = conveyance test every day at 20:00
L/../../3/21 = long test every wednesday (3) at 21:00
-m root = root will be emailed if anything strange occurs
Save the smartd.conf file and restart the smartd.

Last-Modified: Fri, 31 Mar 2006 19:37:37 GMT

Google
 
Web www.captain.at
go to top
© 1996-2010 . All rights reserved.
No reproduction, distribution, publishing or transmission of the copyrighted materials at this site is permitted. Policy
go to top