Linux Harddisk Monitoring with SmartMonTools (smartctl)
If you see harddisk errors in the kernel log file ( IDE
harddisk errors: DriveReady SeekComplete Error status=0x51 DriveStatusError error=0x04)
it's probably too late to install SmartMonTools and you'd better replace the harddisk.
SmartMonTools monitor the harddisks in your server/workstation and will most likely
alert you before the drive will die.
Install SmartMonTools from source or install a pre-compiled package (e.g. deb or rpm) for
your system. I will not cover the compilation and installation from source, since every good
distribution contains a SmartMonTools package.
After installing, you can check your harddisk with the command "smartctl".
First check if your drive supports SMART at all:
# smartctl -i /dev/hda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
[...]
SMART support is: Available - device has SMART capability.
SMART support is: Disabled
or this if SMART is already enabled:
SMART support is: Enabled
If SMART is supported, switch it on with:
# smartctl -s on /dev/hda
Then check which tests are supported by your harddisk:
# smartctl -c /dev/hda
The output will look like:
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: (2460) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 35) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
(don't be confused by the "Self-test routine in progress..." at the time I copied this output, a test
was performed - normally it will say "Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever been run)
At the end you see that 3 tests are supported:
- Short self-test routine = short test
- Extended self-test routine = long test
- Conveyance self-test routine = conveyance test
In the brackets you see the approximate duration of the test. The short test is nice, but only
the long test is an in-depth test which is more accurate.
You can invoke those tests with (one at a time!):
# smartctl -t short /dev/hda
# smartctl -t long /dev/hda
# smartctl -t conveyance /dev/hda
"smartctl" will print how long it will take and when the test is finished (date/time).
After the test time is elapsed, check the statistics with:
# smartctl -l selftest /dev/hda
The output will be something like this:
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 4697 -
# 2 Extended offline Completed without error 00% 4697 -
# 3 Short offline Completed without error 00% 4696 -
This shows that the tests were completed without errors. If errors show up, it's probably best
to backup data immediately and replace the disk.
Full output with all sort of SMART data is show with:
# smartctl -a /dev/hda
Another interesting output:
# smartctl -A /dev/hda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 158 157 021 Pre-fail Always - 3066
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 524
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 4695
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 523
194 Temperature_Celsius 0x0022 111 094 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
This shows some interesting statistics for your harddrive.
There are several attributes (ATTRIBUTE_NAME) and the other interesting value is RAW_VALUE.
Most attributes are just nice to know, e.g.
Spin_Up_Time, Start_Stop_Count, Power_On_Hours, Power_Cycle_Count
others need regular attention/monitoring
Spin_Retry_Count, Calibration_Retry_Count, Temperature_Celsius, Current_Pending_Sector
and the last values are very important:
Reallocated_Sector_Ct, Seek_Error_Rate, Reallocated_Event_Count, Offline_Uncorrectable,
UDMA_CRC_Error_Count, Multi_Zone_Error_Rate, Hardware_ECC_Recovered (last value not above)
If you see increasing values in the last category of values, you definately should replace the
harddrive to be sure.
Last but not least, here is the configuration of the smartd daemon in /etc/smartd.conf
See the man-page of smartctl for details, but this one seems to be a nice configuration:
/dev/hda -a -o on -S on -s (S/../.././19|L/../../3/21|C/../.././20) -m root
S/../.././19 = short test every day at 19:00
C/../.././20 = conveyance test every day at 20:00
L/../../3/21 = long test every wednesday (3) at 21:00
-m root = root will be emailed if anything strange occurs
Save the smartd.conf file and restart the smartd.
Last-Modified: Fri, 31 Mar 2006 19:37:37 GMT