RAID 5 inactive with disk failure

I’m not sure where to start to explain what exactly happened to end up in this bad situation and I do not want to make it worse.

In the beginning the system froze (kerneloops) but I blamed it on the graphics card why it happened, because with journalctl I had found messages with GPU lockup.
In worst case after a reboot the RAID would start re-syncing and all would be fine.

After a while the system would not boot up normally and end up in emergency mode. With the help of journalctl -b I found out that one of the disks (sdd) belonging to the array was causing errors. First I believed it could be the SATA cable. Fiddling with the cables made the whole disk disappear so I replaced the cable with another one.
During boot the missing disk was visible again and I noticed that the status of the RAID array changed from Degraded to Rebuild.

Unfortunately, after grub Fedora boots after a while into emergency mode.

sda, sdc and sdc are the disks of the RAID array (md127). However, lsblk shows that sdd is not part of the array (anymore).

NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                      8:0    0 931.5G  0 disk 
└─md127                  9:127  0     0B  0 md   
sdb                      8:16   0   3.6T  0 disk /data/bulk
sdc                      8:32   0 931.5G  0 disk 
└─md127                  9:127  0     0B  0 md   
sdd                      8:48   0 931.5G  0 disk 
sde                      8:64   0 119.2G  0 disk 
├─sde1                   8:65   0     1G  0 part /boot
└─sde2                   8:66   0 118.2G  0 part 
  ├─fedora_fedaic-root 253:0    0 112.3G  0 lvm  /
  └─fedora_fedaic-swap 253:1    0   5.9G  0 lvm  [SWAP]
sdf                      8:80   1  14.5G  0 disk 
├─sdf1                   8:81   1  14.4G  0 part 
└─sdf2                   8:82   1    32M  0 part 
sr0                     11:0    1  1024M  0 rom 

cat /proc/mdstat shows that it is inactive and sdd is not part of the RAID array

Personalities : 
md127 : inactive sda[1](S) sdc[0](S)
      5552 blocks super external:imsm
       
unused devices: <none>

And with mdadm --examine /dev/sd[acd] it is possible to see that sdd was member of the RAID array.

/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.2.02
    Orig Family : d6ba148c
         Family : d6ba148c
     Generation : 002db736
  Creation Time : Unknown
     Attributes : All supported
           UUID : fcaaa905:3813afd6:892f86ab:83424a41
       Checksum : 89e55567 correct
    MPB Sectors : 2
          Disks : 3
   RAID Devices : 1

  Disk00 Serial : WD-WCC3F3525846
          State : active
             Id : 00000000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)

[RAID5_2TB]:
       Subarray : 0
           UUID : 5aca63f5:865b38a5:fec0f38e:a41e95cb
     RAID Level : 5 <-- 5
        Members : 3 <-- 3
          Slots : [UUU] <-- [_UU]
    Failed disk : 0
      This Slot : 0 (out-of-sync)
    Sector Size : 512
     Array Size : 3907039232 (1863.02 GiB 2000.40 GB)
   Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
  Sector Offset : 0
    Num Stripes : 15261872
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : rebuild
      Map State : normal <-- degraded
     Checkpoint : 0 (128)
    Dirty State : clean
     RWH Policy : off
      Volume ID : 0

  Disk01 Serial : WD-WMC1U6533085
          State : active
             Id : 00020000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)

  Disk02 Serial : WD-WMC1U5023513
          State : active
             Id : 00030000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)
/dev/sdc:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.2.02
    Orig Family : d6ba148c
         Family : d6ba148c
     Generation : 002db736
  Creation Time : Unknown
     Attributes : All supported
           UUID : fcaaa905:3813afd6:892f86ab:83424a41
       Checksum : 89e55567 correct
    MPB Sectors : 2
          Disks : 3
   RAID Devices : 1

  Disk01 Serial : WD-WMC1U6533085
          State : active
             Id : 00020000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)

[RAID5_2TB]:
       Subarray : 0
           UUID : 5aca63f5:865b38a5:fec0f38e:a41e95cb
     RAID Level : 5 <-- 5
        Members : 3 <-- 3
          Slots : [UUU] <-- [_UU]
    Failed disk : 0
      This Slot : 1
    Sector Size : 512
     Array Size : 3907039232 (1863.02 GiB 2000.40 GB)
   Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
  Sector Offset : 0
    Num Stripes : 15261872
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : rebuild
      Map State : normal <-- degraded
     Checkpoint : 0 (128)
    Dirty State : clean
     RWH Policy : off
      Volume ID : 0

  Disk00 Serial : WD-WCC3F3525846
          State : active
             Id : 00000000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)

  Disk02 Serial : WD-WMC1U5023513
          State : active
             Id : 00030000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)
/dev/sdd:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.2.02
    Orig Family : d6ba148c
         Family : d6ba148c
     Generation : 002db736
  Creation Time : Unknown
     Attributes : All supported
           UUID : fcaaa905:3813afd6:892f86ab:83424a41
       Checksum : 89e55567 correct
    MPB Sectors : 2
          Disks : 3
   RAID Devices : 1

  Disk02 Serial : WD-WMC1U5023513
          State : active
             Id : 00030000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)

[RAID5_2TB]:
       Subarray : 0
           UUID : 5aca63f5:865b38a5:fec0f38e:a41e95cb
     RAID Level : 5 <-- 5
        Members : 3 <-- 3
          Slots : [UUU] <-- [_UU]
    Failed disk : 0
      This Slot : 2
    Sector Size : 512
     Array Size : 3907039232 (1863.02 GiB 2000.40 GB)
   Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
  Sector Offset : 0
    Num Stripes : 15261872
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : rebuild
      Map State : normal <-- degraded
     Checkpoint : 0 (128)
    Dirty State : clean
     RWH Policy : off
      Volume ID : 0

  Disk00 Serial : WD-WCC3F3525846
          State : active
             Id : 00000000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)

  Disk01 Serial : WD-WMC1U6533085
          State : active
             Id : 00020000
    Usable Size : 1953519616 (931.51 GiB 1000.20 GB)

What I don’t understand when I examine messages from earlier boots with e.g. journalctl -b -12 | egrep md[0-9] I’ve found lines with md126. What’s the difference between md126 and md127 ?

kernel: md/raid:md126: not clean -- starting background reconstruction
kernel: md/raid:md126: device sda operational as raid disk 0
kernel: md/raid:md126: device sdc operational as raid disk 1
kernel: md/raid:md126: device sdd operational as raid disk 2
[...]
systemd[1]: Started mdmon@md127.service - MD Metadata Monitor on /dev/md127
[...]
mdadm[1141]: RebuildFinished event detected on md device /dev/md126

If I do journalctl -b -11 | egrep md[0-9] shows nothing. If I examine it without grep, then I find for example this.

I’ve found this Linux Raid Wiki, but when I read it I do not understand the risks if would try something. As I said at the beginning I don’t want to go from a bad situation to a worse situation.

What is best course of action to fix the RAID array without losing (too many) data?
I have an identical empty/spare disk if that helps and if more info is needed please let me know.

On a raid 5 array, the loss of one disk should not lose data.
There are a lot of details you need to do first, including check the status of sdd.
sudo fdisk -l to verify the drive is seen and identified
sudo smartctl -a /dev/sdX where the X is the identifier for the drive of concern (sdd maybe?).
ls /dev/sd* to see that the devices are all shown
cat /proc/mdstat to see the array status

if smartctl shows a failure then the drive must be replaced. If not then it may be possible to recover the drive.

If you provide the results of the above then how to move forward can be determined.

Thank you for your reply.

fdisk -l

Disk /dev/sda: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: WDC WD10EZEX-08M
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdb: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: ST4000DX001-1CE1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdc: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: WDC WD10EZRX-00A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdd: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: WDC WD10EZRX-00A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sde: 119.24 GiB, 128035676160 bytes, 250069680 sectors
Disk model: OCZ-VERTEX4     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x08ff1faa

Device     Boot   Start       End   Sectors   Size Id Type
/dev/sde1  *       2048   2099199   2097152     1G 83 Linux
/dev/sde2       2099200 250068991 247969792 118.2G 8e Linux LVM


Disk /dev/mapper/fedora_fedaic-root: 112.31 GiB, 120590434304 bytes, 235528192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/fedora_fedaic-swap: 5.93 GiB, 6366953472 bytes, 12435456 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdf: 14.46 GiB, 15525216256 bytes, 30322688 sectors
Disk model: USB DISK 2.0    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x8562f0e2

Device     Boot    Start      End  Sectors  Size Id Type
/dev/sdf1  *        2048 30257151 30255104 14.4G  7 HPFS/NTFS/exFAT
/dev/sdf2       30257152 30322687    65536   32M ef EFI (FAT-12/16/32)

ls /dev/sd*

/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sde1
/dev/sde2
/dev/sdf
/dev/sdf1
/dev/sdf2

cat /proc/mdstat

Personalities : 
md127 : inactive sda[1](S) sdc[0](S)
      5552 blocks super external:imsm
       
unused devices: <none>

I couldn’t run smartctl. Command was not found. Which package provides smartctl?

EDIT

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.5-300.fc36.x86_64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD10EZRX-00A8LB0
Serial Number:    WD-WMC1U5023513
LU WWN Device Id: 5 0014ee 657e5cca9
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database 7.3/5319
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Dec  3 18:58:47 2022 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(13140) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 150) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x30b5)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       648
  3 Spin_Up_Time            0x0027   136   132   021    Pre-fail  Always       -       4200
  4 Start_Stop_Count        0x0032   083   083   000    Old_age   Always       -       17275
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   020   020   000    Old_age   Always       -       58506
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       970
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       81
193 Load_Cycle_Count        0x0032   122   122   000    Old_age   Always       -       236401
194 Temperature_Celsius     0x0022   120   103   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       3

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

That smartctl output shows a drive that is more than 7 years old with over 6 1/2 years powered on. It may or may not be recoverable, but does not show what I would consider serious attributes and only 2 sectors relocated.

Please read the man page for mdadm and learn as you progress with this.

First, to get the array back functional in degraded state
sudo mdadm -A md127 -v -R /dev/sda /dev/sdc

Once the array is functional then steps must be followed to remove sdd, clean up any raid identifiers on the drive, then add it back to the array. I will try to assist one step at a time in that process.

As far as I understand now you want me to assemble the RAID array manually and make it run with only 2 disks, which somehow puts the array in a degraded state (from current rebuild state).

Is there a way to record the output of the command I’m going to execute as I think you’ve purposely added the -v option in the command?

The emergency mode has affected my network connectivity. I could not install smartmontools on my Fedora system. I’ve used smartmontools from live USBs to get the information needed.

I did notice that disk sda has two errors but did pass the overall health check.

I have included also the SMART info of the other two disks (sda & sdc).

[root@sysrescue ~]# smartctl -a /dev/sda

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.74-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD10EZEX-08M2NA0
Serial Number:    WD-WCC3F3525846
LU WWN Device Id: 5 0014ee 25f55c27d
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Dec  4 11:17:39 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(11220) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 116) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       237
  3 Spin_Up_Time            0x0027   172   170   021    Pre-fail  Always       -       2358
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1854
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       23936
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1728
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       65
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       8334
194 Temperature_Celsius     0x0022   119   091   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   196   000    Old_age   Always       -       36
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 2
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 10713 hours (446 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 4f c2 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d8 00 00 4f c2 00 00      00:00:26.320  SMART ENABLE OPERATIONS
  f5 00 00 00 00 00 00 00      00:00:26.186  SECURITY FREEZE LOCK
  ec 00 00 00 00 00 00 00      00:00:26.186  IDENTIFY DEVICE
  c6 00 10 00 00 00 00 00      00:00:26.186  SET MULTIPLE MODE
  ef 03 45 00 00 00 00 00      00:00:26.186  SET FEATURES [Set transfer mode]

Error 1 occurred at disk power-on lifetime: 1791 hours (74 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 02 00 00 00 00 00 00      00:02:32.491  SET FEATURES [Enable write cache]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@sysrescue ~]# smartctl -a /dev/sdc

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.74-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD10EZRX-00A8LB0
Serial Number:    WD-WMC1U6533085
LU WWN Device Id: 5 0014ee 658189529
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database 7.3/5319
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Dec  4 11:17:43 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(13020) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 149) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x30b5)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   139   135   021    Pre-fail  Always       -       4050
  4 Start_Stop_Count        0x0032   083   083   000    Old_age   Always       -       17318
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   020   020   000    Old_age   Always       -       58726
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       980
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       85
193 Load_Cycle_Count        0x0032   122   122   000    Old_age   Always       -       235879
194 Temperature_Celsius     0x0022   120   104   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

EDIT
The output of mdadm -A md127 -v -R /dev/sda /dev/sdc

mdadm: looking for devices for md127
mdadm: /dev/sda is busy - skipping
mdadm: /dev/sdc is busy - skipping

Perhaps md126?

This shows an inactive array that is only able to see 2 of the assigned 3 drives. The mdadm command given was to assemble those drives already existing and bring it active in a degraded state with only the 2 shown drives.

The data should still be intact, but the array could not be activated.

There is another possibility.
Each drive has its own data cable, and it is possible that A) drive sdd has a data cable that is failing or the connections are not clean, or B) one drive is causing more than one partial failure due to defective connections or cable that is also interfering with controller function, in which case the activation won’t work.

At this point I would suggest that you power the machine off completely (unplug it from the wall) then disconnect and reconnect both ends of each data cable to each drive. Possibly even replace one or more cables (drive sdd especially) if you have extras available. Then try powering back up.
It should not matter if you even relocate the cables to different controller ports since the drives are usually identified independent of port location

If you use a different cable and different port for sdd and the problem follows the physical drive then it would seem to be specifically the drive. If it follows the cable that originally was on sdd then it would seem to be the cable. If it stays with whatever device is attached to the port where sdd was originally connected then it would seem to be the controller.

If the problem disappears after reseating the cables and doing nothing else then it would seem to be an oxidized/dirty connection.

I did replace one cable, but will fiddle with the other cables too to rule things out.

Does it make sense to run first mdadm --stop /dev/md127 and then mdadm -A md127 -v -R /dev/sda /dev/sdc?

Source: Assemble Run - Linux Raid Wiki

Sorry for my late response. I was on holiday.
When I swap cables between sdd (WD-WMC1U5023513) and sdb (Z3013KGQ), then sdb (WD-WMC1U5023513) becomes the drive with errors. It doesn’t matter to which cable to which port I connect the drive with serial WD-WMC1U5023513.
Therefore I can assume that the cables are OK.
I didn’t understand what you mean with the controller.

When I start the system with a Live-USB of Fedora and open Disks then it shows at Assessment: Disk is OK, 3 bad sectors (25° C / 77° F) on disk with serial WD-WMC1U5023513.

Note that earlier I said

This tells me that the drive now seen as sdb serial WD-WMC1U5023513 is bad enough that mdadm does not want to use it.

You can fail it out of the array and remove it with mdadm md127 --fail /dev/sdb --remove /dev/sdb

Failing the device and removing it from the array should allow the array to be started in degraded mode. (if not go no further at this time)

Once the failing device has been removed from the array it should be possible to (temporarily) fix the device using the badblocks command, The array metadata can be removed from the device with dd if=/dev/zero of=/dev/sdb bs=1M count=1

Once the device has had the metadata cleared you could add it back into the array as a new device and the array should rebuild it.

Note that a device with 3 bad blocks should always be replaced as soon as possible. Historically, once a device starts exhibiting bad sectors the damage usually grows over time. Normally the device itself swaps the data from a failing sector into one of the spare sectors a device automatically retains for that exact reason. Usually the system never sees the bad sector since the device manages that. If it is showing 3 bad sectors then smartctl -a /dev/sdb should show the failures.

There are always 3 physical parts to using a drive.

  1. The controller where the cable is attached to the motherboard
  2. The attaching cable
  3. The drive itself.

Those comments were meant to allow you to track which physical device had the problem and the steps you followed narrowed it down to the drive itself.

Thank you for your fast reply. The arrangement of the disks/cables is as follows now:

  • sda: WD-WCC3F3525846
  • sdb: WD-WMC1U6533085
  • sdc: WD-WMC1U5023513 (faulty disk, not part of array anymore)

I will start the system and run in my current state mdadm md127 --fail /dev/sdc --remove /dev/sdc
How would I know the system changes the array into degraded state?

After I have confirmed that the array is in degraded state, then I should run dd if=/dev/zero of=/dev/sdc bs=1M count=1 which copies something from /dev/zero to /dev/sdc one time one block of 1 Megabyte. Correct?

At the moment when I start the system the RAID BIOS shows that the array is in Rebuild state (see also photo from my first post).

I couldn’t tell where you with smartctl can see that the faulty drive has 3 bad sectors.

smartctl -a /dev/sdc (with Fedora Live-USB)
[liveuser@localhost-live ~]$ sudo smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.5-300.fc36.x86_64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD10EZRX-00A8LB0
Serial Number:    WD-WMC1U5023513
LU WWN Device Id: 5 0014ee 657e5cca9
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database 7.3/5319
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Jan  3 13:42:10 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(13140) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 150) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x30b5)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       996
  3 Spin_Up_Time            0x0027   136   131   021    Pre-fail  Always       -       4175
  4 Start_Stop_Count        0x0032   083   083   000    Old_age   Always       -       17287
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   020   020   000    Old_age   Always       -       58755
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       982
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       83
193 Load_Cycle_Count        0x0032   122   122   000    Old_age   Always       -       236450
194 Temperature_Celsius     0x0022   119   103   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       7

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I did actually change the command to the actual state of my system:
mdadm md127 --fail /dev/sdc --remove /dev/sdc

I got mdadm: error opening md127: No such file or directory.

EDIT:
I tried also mdadm /dev/md127 --fail /dev/sdc --remove /dev/sdc, but then I get a message saying mdadm: set device faulty failed for /dev/sdc: No such device

EDIT2:
I started the system accidentally with the faulty drive disconnected from the system and this is what I got with cat /proc/mdstat
Output:

Personalities : [raid6] [raid5] [raid4]
md126 : inactive sda[1] sdb[0]
      1953519616 blocks super external:/md127/0

md127 : inactive sdb[1](S) sda[0](S)
      5552 blocks super external:imsm
       
unused devices: <none>

Here is the md126 back which I mentioned in an earlier post.

EDIT3:
What if I tried with the faulty disk physically disconnected and run sudo mdadm -A md126 -v -R /dev/sda /dev/sdb ?
Would that work, @computersavvy ?

@computersavvy I’ve been searching and I’ve found information that might help to assemble my RAID 5 array.
The output of mdadm --examine --scan was:

ARRAY metadata=imsm UUID=fcaaa905:3813afd6:892f86ab:83424a41
ARRAY /dev/md/RAID5_2TB container=fcaaa905:3813afd6:892f86ab:83424a41 member=0 UUID=5aca63f5:865b38a5:fec0f38e:a41e95cb

Following the information, then I ran mdadm --assemble --scan /dev/md/RAID5_2TB
Some messages followed, see below the relevant ones:

madadm: /dev/sdc has wrong uuid.
madadm: /dev/sdb has wrong uuid.
madadm: /dev/sda has wrong uuid.
madadm: looking in container /dev/md/imsm0
madadm: found match on member /md127/0 in /dev/md/imsm0
[ 1806.344395] md/raid:md126: not enough operational devices (2/3 failed)
madadm: /dev/md/RAID5_2TB has been assembled with 2 devices but cannot be started.

This is more then I got so far. What should I do next? Just run and hope it will start in degraded state? The faulty disk is still connected and visible as /dev/sdc but is not part in any array. See also output of cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4]
md126 : inactive sda[1] sdb[0]
      1953519616 blocks super external:/md127/0

md127 : inactive sdb[1](S) sda[0](S)
      5552 blocks super external:imsm
       
unused devices: <none>

I believe slowly I’m getting there.
The result of mdadm -A md126 -v -R /dev/sda /dev/sdb /dev/sdc is that a device is added to md127 (container) but not to md126.
Output of cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4]
md126 : inactive sda[1] sdb[0]
      1953519616 blocks super external:/md127/0

md127 : inactive sdc[2](S) sda[1](S) sdb[0](S)
      8328 blocks super external:imsm
       
unused devices: <none>

EDIT:
The question is now, how to get the real array RAID5_2TB. which I think is md126, running/active again?

Hi @computersavvy , I got my RAID array running just only mdadm --assemble --scan with the help of this source. It assembled and started right away with the following messages:

mdadm: Merging with already-assembled /dev/md/imsm0
mdadm: Container /dev/md/imsm0 has been assembled with 3 drives
mdadm: dev/md/RAID_2TB_0 has been assembled with 3 drives and started.

According to /proc/mdstat it is now recovering/re-syncing. All files seem to be there. Checked some files and they seemed fine.

Personalities : [raid6] [raid5] [raid4] 
md126 : active raid5 sda[2] sdb[1] sdc[0]
      1953519616 blocks super external:/md127/0 level 5, 64k chunk, algorithm 0 [3/2] [_UU]
      [====>................]  recovery = 23.4% (229001920/976759808) finish=94.1min speed=132338K/sec
      
md127 : inactive sdc[2](S) sda[1](S) sdb[0](S)
      8328 blocks super external:imsm
       
unused devices: <none>

After the raid array is finished recovering, copy the files to another location and replace the faulty drive or even all drives.

Just running mdadm --assemble --scan might just have been the solution after all.

This part seems strange. It almost seems that it believes it imported that array (md127) from another system. The fact it is now running as md126 is at least a big step forward.

All 3 drives are showing as spares on the md127 device and inactive so you could probably fail that array out once the other is fully rebuilt. Giving the array identity and the device name it could be failed out (disassembled) one device at a time so the array does not still hang out there and cause potential problems

The drives seem to have info on both arrays, with md126 as the current array and md127 as the external inactive array. That may or may not present an issue in the future if you fail to clean it up now.

This is somewhat speculation, but this is what I think what happened after reading man-pages about mdadm and mdadm.conf and mdadm in general and of course your help to understand the matter. This gave me enough confidence that my actions wouldn’t be destructive, as I had read somewhere that mdadm does not do ‘destructive’ things unless you force it into.

Before installing Fedora the very very first time years and years ago, I had the raid array already setup in the BIOS (Intel Matrix Storage Manager). This probably created imsm container array (md127) containing information with the available (spare) disks and type of RAID array. As it is long time ago, I don’t remember it anymore, in Fedora the actual raid array (md126) must have been setup with the disks from the container array (md127).

Probably at the time I believed that a RAID array created with Intel Matrix Storage Manager (imsm) was a hardware array. Now I believe that in Linux it doesn’t make much sense to make use of a Intel BIOS RAID unless you have real RAID controller installed in your system. The first RAID array is still controlled by the operating system and making it a software array.

:smile: With all the information I have gained/gathered the last two months I believe your suggestion makes sense.

Quite possibly.

There has been some problems with fedora and the bios controlled raid in that the anaconda installer does not always see the drive on the system if the SATA devices are configured as RAID. I have not encountered that myself (no new hardware), but numerous others have been forced to switch the SATA config from RAID to AHCI before they could install fedora on their systems. Yours may have been one of the early forays into bios controlled raid that still worked with fedora, and if your bios is (or was) set to RAID may be a contributing factor in this issue.

Guess we should wait and see. :thinking:

So far so good.

Personalities : [raid6] [raid5] [raid4] 
md126 : active raid5 sda[2] sdb[1] sdc[0]
      1953519616 blocks super external:/md127/0 level 5, 64k chunk, algorithm 0 [3/3] [UUU]
      
md127 : inactive sdc[2](S) sda[1](S) sdb[0](S)
      8328 blocks super external:imsm
       
unused devices: <none>

In the meanwhile copying the data from the array, before it fails again. :smile:

I guess that’s true! md126 starts auto-read-only.

Personalities : [raid6] [raid5] [raid4] 
md126 : active (auto-read-only) raid5 sda[2] sdb[1] sdc[0]
      1953519616 blocks super external:/md127/0 level 5, 64k chunk, algorithm 0 [3/3] [UUU]
      
md127 : inactive sdc[2](S) sda[1](S) sdb[0](S)
      8328 blocks super external:imsm
       
unused devices: <none>