Playing with a "failing" harddisk - badblocks, hdparm, smartctl

I am playing a empty HDD which smartd considered as failing.

smartctl -A /dev/sdX
smartctl 7.2 2021-01-17 r5170 [x86_64-linux-5.10.10-200.fc33.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   192   192   051    Pre-fail  Always       -       56286
  3 Spin_Up_Time            0x0027   194   165   021    Pre-fail  Always       -       5291
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3452
  5 Reallocated_Sector_Ct   0x0033   133   133   140    Pre-fail  Always   FAILING_NOW 1265
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12286
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3226
192 Power-Off_Retract_Count 0x0032   196   196   000    Old_age   Always       -       3179
193 Load_Cycle_Count        0x0032   135   135   000    Old_age   Always       -       196190
194 Temperature_Celsius     0x0022   118   102   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       943
197 Current_Pending_Sector  0x0032   001   001   000    Old_age   Always       -       64793
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       38
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   197   197   000    Old_age   Offline      -       1032

Game Target: use all the good sectors of the failing HDD to form a “perfect” virtual block device.

My original game plan:
A/. Identify all badblocks of the device, using badblocks
$ sudo badblocks -wsv -t 0xff -o bb-sdX.txt /dev/sdX

B/. use bb-sdX.txt to produce a init-file, that dmsetup can use, as per https://unix.stackexchange.com/a/362257
$ sudo dmsetup create nobbsdX --table bb-sdX-table

C/. make a filesystem, and use f3 GitHub - AltraMayor/f3: F3 - Fight Flash Fraud
to test write/read correctness
sudo f3write /mnt/.nobbsdX sudo f3read /mnt/.nobbsdX

Problems:

  • badblocks only completed 80%, as computer is “rebooted”, but ZERO badblocks have been reported
  • I proceeded to use f3 to do the read/write test, with lots of errors, and certainly some errors are inside the checked 80% area of the HDD
  • because I am using btrfs filesystem, I further use btrfs scrub /dev/sdX, and having lots of errors as expected, again lots of error inside the 80% area checked by badblocks
some errors due to btrfs scrub

Jan 29 14:23:49 amdf.lan kernel: BTRFS warning (device sdc1): i/o error at logical 181908836352 on dev /dev/sdc1, physical 181908836352, root 5, inode 425, offset 316645376, length 4096, links 1 (path: 169.h2w)
Jan 29 14:23:49 amdf.lan kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 66, rd 368, flush 0, corrupt 0, gen 0
Jan 29 14:23:49 amdf.lan kernel: BTRFS error (device sdc1): unable to fixup (regular) error at logical 181908836352 on dev /dev/sdc1
Jan 29 14:23:52 amdf.lan kernel: ata3.00: exception Emask 0x0 SAct 0x4000 SErr 0x0 action 0x0
Jan 29 14:23:52 amdf.lan kernel: ata3.00: irq_stat 0x40000008
Jan 29 14:23:52 amdf.lan kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan 29 14:23:52 amdf.lan kernel: ata3.00: cmd 60/08:70:50:56:2d/00:00:15:00:00/40 tag 14 ncq dma 4096 in
res 41/40:00:50:56:2d/00:00:15:00:00/40 Emask 0x409 (media error)
Jan 29 14:23:52 amdf.lan kernel: ata3.00: status: { DRDY ERR }
Jan 29 14:23:52 amdf.lan kernel: ata3.00: error: { UNC }
Jan 29 14:23:52 amdf.lan kernel: ata3.00: configured for UDMA/133
Jan 29 14:23:52 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
Jan 29 14:23:52 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#14 Sense Key : Medium Error [current]
Jan 29 14:23:52 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#14 Add. Sense: Unrecovered read error - auto reallocate failed
Jan 29 14:23:52 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#14 CDB: Read(10) 28 00 15 2d 56 50 00 00 08 00
Jan 29 14:23:52 amdf.lan kernel: blk_update_request: I/O error, dev sdc, sector 355292752 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
Jan 29 14:23:52 amdf.lan kernel: ata3: EH complete
Jan 29 14:23:52 amdf.lan kernel: BTRFS warning (device sdc1): i/o error at logical 181908840448 on dev /dev/sdc1, physical 181908840448, root 5, inode 425, offset 316649472, length 4096, links 1 (path: 169.h2w)
Jan 29 14:23:52 amdf.lan kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 66, rd 369, flush 0, corrupt 0, gen 0
Jan 29 14:23:52 amdf.lan kernel: BTRFS error (device sdc1): unable to fixup (regular) error at logical 181908840448 on dev /dev/sdc1
Jan 29 14:25:25 amdf.lan NetworkManager[874]: [1611901525.7612] dhcp6 (br24): option dhcp6_name_servers => ‘fda0:7018:7174:224:7e8b:8d1c:7b91:10c1’
Jan 29 14:25:25 amdf.lan NetworkManager[874]: [1611901525.7613] dhcp6 (br24): option ip6_address => ‘fda0:7018:7174:224:7e8b:8d1c:0:216 2404:c804:927:8c00:9b3:a455:0:216’
Jan 29 14:25:25 amdf.lan NetworkManager[874]: [1611901525.7614] dhcp6 (br24): state changed bound → bound
Jan 29 14:25:25 amdf.lan systemd[1]: Starting Network Manager Script Dispatcher Service…
Jan 29 14:25:25 amdf.lan audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm=“systemd” exe="/usr/lib/systemd/systemd" hostname=? add>
Jan 29 14:25:25 amdf.lan systemd[1]: Started Network Manager Script Dispatcher Service.
Jan 29 14:25:35 amdf.lan systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Jan 29 14:25:35 amdf.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm=“systemd” exe="/usr/lib/systemd/systemd" hostname=? addr>
Jan 29 14:25:35 amdf.lan kernel: ata3.00: exception Emask 0x0 SAct 0xc0001fff SErr 0x0 action 0x0
Jan 29 14:25:35 amdf.lan kernel: ata3.00: irq_stat 0x40000008
Jan 29 14:25:35 amdf.lan kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan 29 14:25:35 amdf.lan kernel: ata3.00: cmd 60/00:f0:00:51:1f/05:00:16:00:00/40 tag 30 ncq dma 655360 in
res 41/40:00:60:54:1f/00:00:16:00:00/40 Emask 0x409 (media error)
Jan 29 14:25:35 amdf.lan kernel: ata3.00: status: { DRDY ERR }
Jan 29 14:25:35 amdf.lan kernel: ata3.00: error: { UNC }
Jan 29 14:25:35 amdf.lan kernel: ata3.00: configured for UDMA/133
Jan 29 14:25:35 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
Jan 29 14:25:35 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#30 Sense Key : Medium Error [current]
Jan 29 14:25:35 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#30 Add. Sense: Unrecovered read error - auto reallocate failed
Jan 29 14:25:35 amdf.lan kernel: sd 2:0:0:0: [sdc] tag#30 CDB: Read(10) 28 00 16 1f 51 00 00 05 00 00
Jan 29 14:25:35 amdf.lan kernel: blk_update_request: I/O error, dev sdc, sector 371151968 op 0x0:(READ) flags 0x0 phys_seg 52 prio class 0
Jan 29 14:25:35 amdf.lan kernel: ata3: EH complete

Question 1:
Why badblocks cannot identify any badblocks? Any other tools can be use to identify badblocks?

Question 2:
BTRFS warning (device sdc1): i/o error at logical 499311915008 on dev /dev/sdc1, physical 499311915008, root 5, inode 2121, offset 1000136704, length 4096, links 1 (path: 1865.h2w)

I want to try testing with, as per Identify damaged files - ArchWiki
sudo hdparm --read-sector 4621327 /dev/sdX sudo hdparm --repair-sector 4621327 --yes-i-know-what-i-am-doing /dev/sdX

output of --read-sector 499311915008

$sudo hdparm --read-sector 499311915008 /dev/sdb

/dev/sdb:
reading sector 499311915008: SG_IO: bad/missing sense data, sb: 70 00 05 00 00 00 00 0a 10 51 e0 01 21 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

Given that BTRFS warning, is 499311915008 labeled “physical” the correct value for --read-sector?

2 Likes

You are a bit too optimistic about the HDD state.

HDD is a mechanical analog device, so the issues may not be limited to bad blocks and some issues specific to mechanic may not be detected by SMART due to difficulty converting comprehensive analog information to a narrow digital representation.

Moreover, SMART relies on the built-in controller’s firmware which code may contain bugs that can affect the testing results, and the vendors are known being lazy to fix bugs or simply hiding some issues for the sake of profit.

And even if we ignore those problems, you will likely face severe performance regression or errors on a storage driver level leading to the file system driver errors due to controller trying very hard to read or write sectors which aren’t marked bad yet, but not entirely good either.

3 Likes

Thanks for the inputs.

The aim is not to use the created virtual device for long term.

It is just to get pass one round of write/read test.

It is just meant as a learning exercise.

2 Likes

Did you reboot the computer or was it spontaneously?

What is f3?

I am not sure why it is rebooted, I am not in front of the machine at that time.

f3, GitHub - AltraMayor/f3: F3 - Fight Flash Fraud, is a tool to create test files, read it back to detect errors.

Ah, I missed the f3 link.
Badblocks is the tool to test a harddrive. When using btrfs to do tests you are testing a filesystem. I would run badblocks again and let it finish.

1 Like

Hi @SampsonF,

Based on which SMART attributes?

Your plan seems doable, but on the condition that you are only dealing with some minor damage to some of the sectors on your platters. If the controller of the disk is also on the fritz, you won’t get very far; you might be faced with a frozen system or unexpected reboots and you won’t be able to just map out the bad blocks.

Have you read “Bad block HOWTO for smartmontools”? It’s a bit dated, but most of the things discussed there still apply.

3 Likes

Thank you for the pointer.

It is based on the Relocated Sectors Count.

Yes, I discovered that HOWTO after posting here.

I have done runing ddrescue got the disk.map file. There are about 2000 bad areas reported by ddrescue about my HDD.

I am running another round of badblocks and see if it still report no badblocks or not.

Does your Reallocated_Sector_Ct increase by a lot as you are using the disk (not throughout its entire life, just these past few days)?
What about the Current_Pending_Sector, Reported_Uncorrect and Offline_Uncorrectable attributes? If it’s not too much trouble, could you upload/paste the output of smartctl -a /dev/sdc (or whatever letter your disk has been assigned) someplace?

Last capture of smartctl -A is in the opening post.

Below is the current output:

latest smartctl -A /dev/sdX

$ sudo smartctl -A /dev/sdb
smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.10.10-200.fc33.x86_64] (local build)
Copyright © 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 188 188 051 Pre-fail Always - 81338
3 Spin_Up_Time 0x0027 194 165 021 Pre-fail Always - 5291
4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3452
5 Reallocated_Sector_Ct 0x0033 133 133 140 Pre-fail Always FAILING_NOW 1265
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 12331
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 3226
192 Power-Off_Retract_Count 0x0032 196 196 000 Old_age Always - 3179
193 Load_Cycle_Count 0x0032 135 135 000 Old_age Always - 196207
194 Temperature_Celsius 0x0022 114 102 000 Old_age Always - 36
196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 1240
197 Current_Pending_Sector 0x0032 001 001 000 Old_age Always - 64793
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 38
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 197 197 000 Old_age Offline - 1032

Hello @SampsonF,

That’s interesting. One might say that the situation is critical, but stable…
After 45 hours of operation the Reallocated_Sector_Ct, Current_Pending_Sector and Offline_Uncorrectable have remained constant, but the disk has tried 297 times to remap the damaged areas, unsuccessfully (based on Reallocated_Event_Count values), probably because it had 1265 sectors to spare and they’re all used up.
I’d say your best bet is to use the map you got from ddrescue and try to create a partition using the largest available number of contiguous sectors between bad ones. As long as the disk is not attempting to read/write the problem areas, you might be able to avoid lockups and actually use the disk. If you are lucky, you won’t have any new bad sectors, though I’d be mindful of the sectors adjacent to the already damaged ones.

Do post your results, best of luck!

4 Likes

I think this is the simplest advice. Use partitioning to segment out the range of bad blocks. The bad range can either be no partition, or partitioned. If GPT you can add a partition label to that bad range “DO NOT USE” or whatever.

You can have Btrfs (or LVM) use multiple partitions, so you can still use all the known good space. In this case I would probably use DUP metadata regardless of whether it’s an SSD or HDD. More bad sectors may develop, and it’ll help improve the chances the file system survives.

If this is SSD though, it’s going to get worse sooner than later. Once they start to produce errors, they don’t ever stabilize.

3 Likes

Finished the 120hrs one pass badblocks R/W test.

79 errors are reported while ddrescue reported tones.

The most interesting part is: during this whole 120hrs testing, there is no induced ata interface errors.

1 Like

Are you sure you don’t have a power problem? One time I thought that I had a failing HD. On a hunch, I shut down and unplugged my computer and plugged the HD into a different cable from the power supply. After that, I never had any problems anymore. Before going to extraordinary lengths I suggest that you try that.

Not likely power issues, as I can connect 4 drives concurrently without issues.

I want to make sure that you understand what I am saying: I’m not suggesting that your power supply isn’t powerful enough; what I’m suggesting is that the particular power cable going from the power supply to the HD (or the connector on the end of that cable) might have something wrong with it. I suggest shutting down, turning off the power, and trying to connect that HD to a different power cable. In my case the problem was immediately solved.

2 Likes

Thanks for the inputs.

Yes, power including cable issues are not likely.

I have use the same connection for good and bad drive, and the issue is local to the drive.

1 Like

I’m not quickly finding an example of write failure, but this is what uncorrectable read failures look like:

Once the drive is out of reserve sectors to remap, the drive itself produces uncorrectable write errors.

If you see similar errors in dmesg, in particular UNC (uncorrectable) along with WRITE, then the drive is just toast and you should get rid of it. There’s not much point in even experimenting with it.

The two work arounds, the previously mentioned partition around the bad sectors to avoid them; or feed the badblocks result into mkfs.ext4 and then it will avoid them. But any new bad blocks that develop will result in file system problems again, no matter which method you use.

1 Like

I am certain that the drive is not worth using for data storage.

My only aim for this experiment is just to learn those related techniques.

Now I got the badblocks list and disk.map from ddrescue of the same drive. badblocks just reported 79 errors while ddrescue reported over 2000 bad areas.

I am studying how to use the badblocks list and feed it to dmsetup using linear mappings to avoid the known badblocks.

Then I will try to use ddrescue disk.map to feed into dmsetup one more time.

And compare the findings.

It will take time for me to learn the dmsetup linear syntax and how to use the 79 badblocks entries.

Only after that, I will try to use the ddrescue disk.map data.

My target filesystem is to use btrfs - as it will report bit rioting via scrub.
Question:

  • mkfs.ext4 can avoid badblocks
  • ext4 fs can be converted to btrfs
    will btrfs avoid those blocks marked bad in ext4 also as a result of the conversion?

That should work. I’m not sure how to do it.

will btrfs avoid those blocks marked bad in ext4 also as a result of the conversion?

Not that I’m aware of; and in fact since it has no support for tracking badblocks, I’m curious whether it detects a badblocks list in ext4 and then refuses to convert with an appropriate message. If not, good chance it’s a bug.

1 Like