Btrfs scrub find one error then aborted - cannot resumed

When I see some btrfs error stats for my 8T data disk, I am checking with scrub:

sudo btrfs scrub start /var/vols/8T-5

After running for about 3 hours 18 minutes, it is aborted with one csum error found, as follows:

btrfs scrub status /var/vols/8T-5

btrfs scrub resume .
scrub resumed on ., fsid 6a32b7e3-0ad3-4316-942d-ec568e1e86f8 (pid=33754)
[root@amdf 8T-5]# btrfs scrub status /var/vols/8T-5
UUID: 6a32b7e3-0ad3-4316-942d-ec568e1e86f8
Scrub resumed: Tue May 18 09:26:51 2021
Status: aborted
Duration: 3:18:53
Total to scrub: 6.05TiB
Rate: 206.82MiB/s
Error summary: csum=1
Corrected: 0
Uncorrectable: 1
Unverified: 0

journal when trying to resume scrub

May 18 09:26:51 amdf kernel: BTRFS info (device sdb2): scrub: started on devid 1
May 18 09:26:53 amdf kernel: BTRFS error (device sdb2): parent transid verify failed on 8477840605184 wanted 255798 found 255532
May 18 09:26:53 amdf kernel: BTRFS error (device sdb2): parent transid verify failed on 8477840605184 wanted 255798 found 255532
May 18 09:26:53 amdf kernel: BTRFS info (device sdb2): scrub: not finished on devid 1 with status: -5

Versions:
rpm -qa | egrep ‘kernel|btrfs’
kernel-headers-5.11.19-300.fc34.x86_64
kernel-core-5.11.19-300.fc34.x86_64
kernel-modules-5.11.19-300.fc34.x86_64
kernel-5.11.19-300.fc34.x86_64
kernel-devel-5.11.19-300.fc34.x86_64
kernel-modules-extra-5.11.19-300.fc34.x86_64
btrfs-progs-5.11.1-1.fc34.x86_64

Question 1:
How to identify which file is affected by this error?

update: I find out by checking the journal by

sudo journalctl -f --no-pager | grep "checksum error"
May 18 07:14:02 amdf kernel: BTRFS warning (device sdb2): checksum error at logical 2572643012608 on dev /dev/sdb2, physical 2578020110336, root 2629, inode 116800, offset 62066688, length 4096, links 1 (path: 1/VMs/dell250G.qcow2)

Question 2:
How to continue the scrub to check for remaining blocks, the original ETA is about 11 hours, so there must be a lot of blocks to be checked.

1 Like

parent transid errors are metadata errors (a problem with the file system itself), and the checksum error is related to data itself.

smartctl -i /dev/sdb | grep 'Device Model\|Firmware'
btrfs device stats /dev/sdb2
lsattr 1/VMs/dell250G.qcow2
virsh dumpxml $VMNAME | grep cache
btrfs insp dump-t -b 8477840605184 --hide-names /dev/sdb2
btrfs check --readonly /dev/sdb2
journalctl -k | grep -i btrfs > btrfs-kmesg.txt

--hide-names is optional but that block might contain file names

The last command might be long, hence output to a file which you can either paste into pastebin (one week is long enough) or use a file sharing service.

1 Like

Thank you very much for looking into my issue!

$ smartctl -i /dev/sdb | grep 'Device Model\|Firmware'
Device Model:     TOSHIBA MG05ACA800E
Firmware Version: GX4K

$ btrfs device stats /dev/sdb2
[/dev/sdb2].write_io_errs    0
[/dev/sdb2].read_io_errs     0
[/dev/sdb2].flush_io_errs    0
[/dev/sdb2].corruption_errs  3012
[/dev/sdb2].generation_errs  0

$ lsattr /var/vols/8T-5/Data/1/VMs/dell250G.qcow2 
-------------------- /var/vols/8T-5/Data/1/VMs/dell250G.qcow2

virsh # I reinstalled my Silverblue host already. This qcow2 is not defined in this deployment.

btrfs insp dump-t -b 8477840605184 /dev/sdb2
btrfs-progs v5.11.1 
node 8477840605184 level 1 items 245 free space 248 generation 255532 owner CSUM_TREE
node 8477840605184 flags 0x1(WRITTEN) backref revision 1
fs uuid 6a32b7e3-0ad3-4316-942d-ec568e1e86f8
chunk uuid 8d8bd209-fa88-4302-ac28-a57c1b325f42
	key (EXTENT_CSUM EXTENT_CSUM 4910987608064) block 2095462367232 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4910988488704) block 2095462416384 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911005106176) block 2095462481920 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911021723648) block 2095462547456 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911037898752) block 2095462563840 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911048200192) block 2095462678528 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911064821760) block 2095462727680 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911064985600) block 2095462760448 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911081603072) block 2095462875136 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911098220544) block 2095462924288 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911114838016) block 2095463022592 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911131455488) block 2095463038976 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911143026688) block 2095463268352 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911159644160) block 2095463350272 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911176261632) block 2095463399424 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911192883200) block 2095463563264 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911199666176) block 2095463579648 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911214714880) block 2095463743488 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911227543552) block 2095463759872 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911244161024) block 2095463776256 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911260778496) block 2095463825408 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911277395968) block 2095464005632 gen 188363
	key (EXTENT_CSUM EXTENT_CSUM 4911292563456) block 8478135468032 gen 215033
	key (EXTENT_CSUM EXTENT_CSUM 4911309180928) block 8478117560320 gen 254923
	key (EXTENT_CSUM EXTENT_CSUM 4911328657408) block 6450678431744 gen 255196
	key (EXTENT_CSUM EXTENT_CSUM 4911333212160) block 8478454317056 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4911340621824) block 8477848289280 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4911344996352) block 8477773086720 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4911351791616) block 8477841784832 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4911354847232) block 8477859430400 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4911359250432) block 8477901832192 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4911364321280) block 8477901996032 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4911366905856) block 8477561896960 gen 255498
	key (EXTENT_CSUM EXTENT_CSUM 4911374159872) block 8478435409920 gen 254966
	key (EXTENT_CSUM EXTENT_CSUM 4911379673088) block 6450688770048 gen 255457
	key (EXTENT_CSUM EXTENT_CSUM 4911384678400) block 8478435950592 gen 254966
	key (EXTENT_CSUM EXTENT_CSUM 4911391444992) block 8477718446080 gen 255335
	key (EXTENT_CSUM EXTENT_CSUM 4911397834752) block 6452110753792 gen 255472
	key (EXTENT_CSUM EXTENT_CSUM 4911405080576) block 2095485042688 gen 255431
	key (EXTENT_CSUM EXTENT_CSUM 4911410008064) block 8477701062656 gen 255506
	key (EXTENT_CSUM EXTENT_CSUM 4911416238080) block 8477803364352 gen 255524
	key (EXTENT_CSUM EXTENT_CSUM 4911424786432) block 8477701111808 gen 255506
	key (EXTENT_CSUM EXTENT_CSUM 4911432511488) block 8477879271424 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4911438635008) block 8477763993600 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4911443587072) block 8477801889792 gen 255515
	key (EXTENT_CSUM EXTENT_CSUM 4911450861568) block 8477618438144 gen 255296
	key (EXTENT_CSUM EXTENT_CSUM 4911459819520) block 8477484105728 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4911468179456) block 8477699997696 gen 255515
	key (EXTENT_CSUM EXTENT_CSUM 4911474753536) block 8477896949760 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4911481630720) block 8478070276096 gen 255516
	key (EXTENT_CSUM EXTENT_CSUM 4911493132288) block 8477562339328 gen 255498
	key (EXTENT_CSUM EXTENT_CSUM 4911502270464) block 2095487516672 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4911510507520) block 8477902143488 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4911518830592) block 2095481126912 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4911530209280) block 2095487631360 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4911543361536) block 8477619011584 gen 255296
	key (EXTENT_CSUM EXTENT_CSUM 4911553613824) block 6450689130496 gen 255256
	key (EXTENT_CSUM EXTENT_CSUM 4911562694656) block 8477786603520 gen 255338
	key (EXTENT_CSUM EXTENT_CSUM 4911575584768) block 8477607313408 gen 255498
	key (EXTENT_CSUM EXTENT_CSUM 4911587049472) block 2095723347968 gen 255433
	key (EXTENT_CSUM EXTENT_CSUM 4911602769920) block 8478107811840 gen 237988
	key (EXTENT_CSUM EXTENT_CSUM 4911617314816) block 2095312977920 gen 214274
	key (EXTENT_CSUM EXTENT_CSUM 4911633747968) block 8477859282944 gen 253191
	key (EXTENT_CSUM EXTENT_CSUM 4911648309248) block 8477953359872 gen 213569
	key (EXTENT_CSUM EXTENT_CSUM 4911664930816) block 8478088052736 gen 211124
	key (EXTENT_CSUM EXTENT_CSUM 4911681552384) block 8478088069120 gen 211124
	key (EXTENT_CSUM EXTENT_CSUM 4911698173952) block 8478088085504 gen 211124
	key (EXTENT_CSUM EXTENT_CSUM 4911714795520) block 6450689163264 gen 255256
	key (EXTENT_CSUM EXTENT_CSUM 4911728156672) block 8477522378752 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4911736844288) block 8477701341184 gen 255506
	key (EXTENT_CSUM EXTENT_CSUM 4911748939776) block 8477701373952 gen 255506
	key (EXTENT_CSUM EXTENT_CSUM 4911760543744) block 2896207118336 gen 255194
	key (EXTENT_CSUM EXTENT_CSUM 4911775547392) block 2095250456576 gen 254827
	key (EXTENT_CSUM EXTENT_CSUM 4911791718400) block 8478089609216 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4911806226432) block 8477701390336 gen 255506
	key (EXTENT_CSUM EXTENT_CSUM 4911810736128) block 2095572369408 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4911818137600) block 6450563284992 gen 255257
	key (EXTENT_CSUM EXTENT_CSUM 4911831973888) block 6450713198592 gen 255278
	key (EXTENT_CSUM EXTENT_CSUM 4911846227968) block 8477701832704 gen 228558
	key (EXTENT_CSUM EXTENT_CSUM 4911862849536) block 6450678562816 gen 251344
	key (EXTENT_CSUM EXTENT_CSUM 4911878438912) block 8477939040256 gen 255361
	key (EXTENT_CSUM EXTENT_CSUM 4911885381632) block 6450696372224 gen 255472
	key (EXTENT_CSUM EXTENT_CSUM 4911892893696) block 8477773053952 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4911897636864) block 8477619027968 gen 255505
	key (EXTENT_CSUM EXTENT_CSUM 4911907803136) block 8478331551744 gen 255381
	key (EXTENT_CSUM EXTENT_CSUM 4911915851776) block 2095487746048 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4911928381440) block 2095485534208 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4911939751936) block 6450691227648 gen 255256
	key (EXTENT_CSUM EXTENT_CSUM 4911949889536) block 8477550346240 gen 255497
	key (EXTENT_CSUM EXTENT_CSUM 4911960281088) block 2095244673024 gen 255250
	key (EXTENT_CSUM EXTENT_CSUM 4911968550912) block 8478374330368 gen 255169
	key (EXTENT_CSUM EXTENT_CSUM 4911976820736) block 6450563022848 gen 255258
	key (EXTENT_CSUM EXTENT_CSUM 4911986008064) block 2095667888128 gen 255250
	key (EXTENT_CSUM EXTENT_CSUM 4911994003456) block 6450721079296 gen 255281
	key (EXTENT_CSUM EXTENT_CSUM 4911999787008) block 8478046109696 gen 255365
	key (EXTENT_CSUM EXTENT_CSUM 4912009973760) block 8477873487872 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4912016211968) block 8477532733440 gen 255218
	key (EXTENT_CSUM EXTENT_CSUM 4912029597696) block 8478335778816 gen 212412
	key (EXTENT_CSUM EXTENT_CSUM 4912046219264) block 7714681274368 gen 211366
	key (EXTENT_CSUM EXTENT_CSUM 4912059023360) block 7714680864768 gen 211366
	key (EXTENT_CSUM EXTENT_CSUM 4912075644928) block 8477652254720 gen 211367
	key (EXTENT_CSUM EXTENT_CSUM 4912092266496) block 8477652271104 gen 211367
	key (EXTENT_CSUM EXTENT_CSUM 4912108888064) block 8477652303872 gen 211367
	key (EXTENT_CSUM EXTENT_CSUM 4912125509632) block 8477652336640 gen 211367
	key (EXTENT_CSUM EXTENT_CSUM 4912142131200) block 8477849419776 gen 211374
	key (EXTENT_CSUM EXTENT_CSUM 4912158752768) block 8477849436160 gen 211374
	key (EXTENT_CSUM EXTENT_CSUM 4912175349760) block 8477523558400 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4912187535360) block 8477497917440 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4912194662400) block 6450695290880 gen 255470
	key (EXTENT_CSUM EXTENT_CSUM 4912201371648) block 8478114512896 gen 255374
	key (EXTENT_CSUM EXTENT_CSUM 4912217075712) block 8478446649344 gen 254966
	key (EXTENT_CSUM EXTENT_CSUM 4912229380096) block 8478070603776 gen 254922
	key (EXTENT_CSUM EXTENT_CSUM 4912245252096) block 8478299865088 gen 255168
	key (EXTENT_CSUM EXTENT_CSUM 4912259923968) block 2896123740160 gen 246335
	key (EXTENT_CSUM EXTENT_CSUM 4912268677120) block 8478299013120 gen 255243
	key (EXTENT_CSUM EXTENT_CSUM 4912281976832) block 8477963911168 gen 255363
	key (EXTENT_CSUM EXTENT_CSUM 4913475031040) block 8477498097664 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4913481289728) block 6450575785984 gen 255457
	key (EXTENT_CSUM EXTENT_CSUM 4913487241216) block 8477939056640 gen 255361
	key (EXTENT_CSUM EXTENT_CSUM 4913498177536) block 8477786865664 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4913511317504) block 8477786750976 gen 255338
	key (EXTENT_CSUM EXTENT_CSUM 4913522954240) block 8477605265408 gen 255494
	key (EXTENT_CSUM EXTENT_CSUM 4913528889344) block 2095627419648 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4913541951488) block 6450697551872 gen 255272
	key (EXTENT_CSUM EXTENT_CSUM 4913553911808) block 8477768794112 gen 255327
	key (EXTENT_CSUM EXTENT_CSUM 4913568161792) block 6450691522560 gen 255256
	key (EXTENT_CSUM EXTENT_CSUM 4913581428736) block 8477575086080 gen 255291
	key (EXTENT_CSUM EXTENT_CSUM 4913589563392) block 8478185816064 gen 255149
	key (EXTENT_CSUM EXTENT_CSUM 4913601384448) block 6450716065792 gen 246502
	key (EXTENT_CSUM EXTENT_CSUM 4913605787648) block 8478096195584 gen 255366
	key (EXTENT_CSUM EXTENT_CSUM 4913618006016) block 8477898932224 gen 211885
	key (EXTENT_CSUM EXTENT_CSUM 4913622409216) block 6450704809984 gen 228533
	key (EXTENT_CSUM EXTENT_CSUM 4913639030784) block 8477952966656 gen 255362
	key (EXTENT_CSUM EXTENT_CSUM 4913644511232) block 8477966811136 gen 255362
	key (EXTENT_CSUM EXTENT_CSUM 4913655652352) block 8477927342080 gen 232501
	key (EXTENT_CSUM EXTENT_CSUM 4913672273920) block 8477956505600 gen 211461
	key (EXTENT_CSUM EXTENT_CSUM 4913688895488) block 2095244050432 gen 255250
	key (EXTENT_CSUM EXTENT_CSUM 4913701113856) block 6450721669120 gen 254346
	key (EXTENT_CSUM EXTENT_CSUM 4913705517056) block 2095483715584 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4913720102912) block 6452041842688 gen 255281
	key (EXTENT_CSUM EXTENT_CSUM 4913729728512) block 8477619224576 gen 255505
	key (EXTENT_CSUM EXTENT_CSUM 4913743921152) block 8477841932288 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4913749307392) block 8477499588608 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4913759141888) block 8477929799680 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4913768996864) block 8477751410688 gen 254381
	key (EXTENT_CSUM EXTENT_CSUM 4913784815616) block 8478404837376 gen 254826
	key (EXTENT_CSUM EXTENT_CSUM 4913799761920) block 8477927243776 gen 222541
	key (EXTENT_CSUM EXTENT_CSUM 4913801412608) block 2095249260544 gen 255250
	key (EXTENT_CSUM EXTENT_CSUM 4913816383488) block 6450721619968 gen 254346
	key (EXTENT_CSUM EXTENT_CSUM 4913823748096) block 8477929881600 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4913834254336) block 2095250554880 gen 254827
	key (EXTENT_CSUM EXTENT_CSUM 4913848864768) block 8477581983744 gen 255218
	key (EXTENT_CSUM EXTENT_CSUM 4913863110656) block 8478237048832 gen 224784
	key (EXTENT_CSUM EXTENT_CSUM 4913865134080) block 8477829742592 gen 212218
	key (EXTENT_CSUM EXTENT_CSUM 4913881702400) block 8478104633344 gen 255374
	key (EXTENT_CSUM EXTENT_CSUM 4913895874560) block 8477759193088 gen 255504
	key (EXTENT_CSUM EXTENT_CSUM 4913908670464) block 2095243853824 gen 255250
	key (EXTENT_CSUM EXTENT_CSUM 4913923538944) block 8477883006976 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4913934700544) block 8477902307328 gen 254896
	key (EXTENT_CSUM EXTENT_CSUM 4913948164096) block 8478039752704 gen 230061
	key (EXTENT_CSUM EXTENT_CSUM 4913960562688) block 8477893263360 gen 255532
	key (EXTENT_CSUM EXTENT_CSUM 4913974534144) block 8477701406720 gen 255506
	key (EXTENT_CSUM EXTENT_CSUM 4913982152704) block 8478447419392 gen 254966
	key (EXTENT_CSUM EXTENT_CSUM 4913990545408) block 8478032609280 gen 255241
	key (EXTENT_CSUM EXTENT_CSUM 4913998557184) block 6450649448448 gen 255457
	key (EXTENT_CSUM EXTENT_CSUM 4914013429760) block 6450685919232 gen 255066
	key (EXTENT_CSUM EXTENT_CSUM 4914026553344) block 8478117609472 gen 211588
	key (EXTENT_CSUM EXTENT_CSUM 4914043150336) block 2095244165120 gen 255250
	key (EXTENT_CSUM EXTENT_CSUM 4914052521984) block 8477876486144 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4914063429632) block 6451884916736 gen 255281
	key (EXTENT_CSUM EXTENT_CSUM 4914075656192) block 6450696847360 gen 255472
	key (EXTENT_CSUM EXTENT_CSUM 4914089598976) block 2095702278144 gen 255250
	key (EXTENT_CSUM EXTENT_CSUM 4914102054912) block 2896060088320 gen 246582
	key (EXTENT_CSUM EXTENT_CSUM 4914115891200) block 8478298603520 gen 254675
	key (EXTENT_CSUM EXTENT_CSUM 4914130063360) block 8478386814976 gen 251905
	key (EXTENT_CSUM EXTENT_CSUM 4914143866880) block 8477500719104 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4914152570880) block 8477929897984 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4914164236288) block 6450691801088 gen 253085
	key (EXTENT_CSUM EXTENT_CSUM 4914176966656) block 8477937795072 gen 255359
	key (EXTENT_CSUM EXTENT_CSUM 4914191990784) block 8477986832384 gen 242644
	key (EXTENT_CSUM EXTENT_CSUM 4914205011968) block 8478431232000 gen 229967
	key (EXTENT_CSUM EXTENT_CSUM 4914219974656) block 6450685837312 gen 243164
	key (EXTENT_CSUM EXTENT_CSUM 4914229489664) block 8478437212160 gen 212155
	key (EXTENT_CSUM EXTENT_CSUM 4914244780032) block 8478429773824 gen 254676
	key (EXTENT_CSUM EXTENT_CSUM 4914261147648) block 6450567839744 gen 255197
	key (EXTENT_CSUM EXTENT_CSUM 4914274107392) block 6450708267008 gen 254588
	key (EXTENT_CSUM EXTENT_CSUM 4914286198784) block 8477761110016 gen 254891
	key (EXTENT_CSUM EXTENT_CSUM 4914290323456) block 8477524525056 gen 255493
	key (EXTENT_CSUM EXTENT_CSUM 4914301665280) block 8477779968000 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4914312564736) block 8478408212480 gen 255403
	key (EXTENT_CSUM EXTENT_CSUM 4914323267584) block 6450649481216 gen 255457
	key (EXTENT_CSUM EXTENT_CSUM 4914335182848) block 6450685296640 gen 255458
	key (EXTENT_CSUM EXTENT_CSUM 4914344677376) block 8477768417280 gen 255287
	key (EXTENT_CSUM EXTENT_CSUM 4914353102848) block 2095481683968 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4914361483264) block 6450682101760 gen 255256
	key (EXTENT_CSUM EXTENT_CSUM 4914367983616) block 6450693767168 gen 255256
	key (EXTENT_CSUM EXTENT_CSUM 4914383478784) block 6450694324224 gen 233628
	key (EXTENT_CSUM EXTENT_CSUM 4914399047680) block 8477987045376 gen 255361
	key (EXTENT_CSUM EXTENT_CSUM 4914412744704) block 8478307549184 gen 255381
	key (EXTENT_CSUM EXTENT_CSUM 4914428186624) block 8478244864000 gen 255381
	key (EXTENT_CSUM EXTENT_CSUM 4914433777664) block 8477774594048 gen 246653
	key (EXTENT_CSUM EXTENT_CSUM 4914546737152) block 8478216699904 gen 228246
	key (EXTENT_CSUM EXTENT_CSUM 4914563305472) block 8478245945344 gen 255381
	key (EXTENT_CSUM EXTENT_CSUM 4914576044032) block 6450721439744 gen 255281
	key (EXTENT_CSUM EXTENT_CSUM 4914581688320) block 8478293983232 gen 254825
	key (EXTENT_CSUM EXTENT_CSUM 4914590142464) block 2095650177024 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4914597494784) block 6450568921088 gen 255197
	key (EXTENT_CSUM EXTENT_CSUM 4914610982912) block 8478371135488 gen 251905
	key (EXTENT_CSUM EXTENT_CSUM 4914626797568) block 8477869326336 gen 212084
	key (EXTENT_CSUM EXTENT_CSUM 4914630680576) block 8477845454848 gen 211870
	key (EXTENT_CSUM EXTENT_CSUM 4914647302144) block 8478078418944 gen 254922
	key (EXTENT_CSUM EXTENT_CSUM 4914662670336) block 8478430019584 gen 254676
	key (EXTENT_CSUM EXTENT_CSUM 4914677780480) block 2095483994112 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4914693001216) block 2095484616704 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4914706898944) block 8477840654336 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4914714718208) block 2095484207104 gen 255251
	key (EXTENT_CSUM EXTENT_CSUM 4914727444480) block 8477848092672 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4914740789248) block 8477971283968 gen 211893
	key (EXTENT_CSUM EXTENT_CSUM 4914757410816) block 8478087774208 gen 229924
	key (EXTENT_CSUM EXTENT_CSUM 4914770120704) block 8478289854464 gen 211979
	key (EXTENT_CSUM EXTENT_CSUM 4914774007808) block 8477987684352 gen 211894
	key (EXTENT_CSUM EXTENT_CSUM 4914790629376) block 6450681446400 gen 255256
	key (EXTENT_CSUM EXTENT_CSUM 4914805555200) block 8477938810880 gen 223692
	key (EXTENT_CSUM EXTENT_CSUM 4914807250944) block 8477829660672 gen 255339
	key (EXTENT_CSUM EXTENT_CSUM 4914817228800) block 6450574770176 gen 255457
	key (EXTENT_CSUM EXTENT_CSUM 4914826629120) block 8477688201216 gen 255514
	key (EXTENT_CSUM EXTENT_CSUM 4914839814144) block 8478079074304 gen 254922
	key (EXTENT_CSUM EXTENT_CSUM 4914854117376) block 6450690572288 gen 255007
	key (EXTENT_CSUM EXTENT_CSUM 4914866855936) block 8478053515264 gen 255365
	key (EXTENT_CSUM EXTENT_CSUM 4914881347584) block 8478107172864 gen 222728
	key (EXTENT_CSUM EXTENT_CSUM 4914887307264) block 8478301421568 gen 211978
	key (EXTENT_CSUM EXTENT_CSUM 4914903928832) block 8478307958784 gen 211978
	key (EXTENT_CSUM EXTENT_CSUM 4914920525824) block 8478308401152 gen 211978
	key (EXTENT_CSUM EXTENT_CSUM 4914937147392) block 8478308794368 gen 211978
	key (EXTENT_CSUM EXTENT_CSUM 4914953691136) block 8477544480768 gen 254881
	key (EXTENT_CSUM EXTENT_CSUM 4914968854528) block 6450716000256 gen 255472
	key (EXTENT_CSUM EXTENT_CSUM 4914979389440) block 8477903044608 gen 254896
	key (EXTENT_CSUM EXTENT_CSUM 4914994855936) block 8477900636160 gen 255341
	key (EXTENT_CSUM EXTENT_CSUM 4915010801664) block 8478087577600 gen 228567
	key (EXTENT_CSUM EXTENT_CSUM 4915015127040) block 8478116757504 gen 255374
	key (EXTENT_CSUM EXTENT_CSUM 4915028074496) block 8477920133120 gen 212219
	key (EXTENT_CSUM EXTENT_CSUM 4915031957504) block 7714281635840 gen 254868
	key (EXTENT_CSUM EXTENT_CSUM 4915039379456) block 6450697109504 gen 255472
	key (EXTENT_CSUM EXTENT_CSUM 4915048333312) block 6450697142272 gen 255472
	key (EXTENT_CSUM EXTENT_CSUM 4915059974144) block 8477896966144 gen 255532
umount /dev/sdb2 && btrfs check --readonly /dev/sdb2
Opening filesystem to check...
ERROR: /dev/sdb2 is currently mounted, use --force if you really intend to check the filesystem
[root@amdf fcc]# umount /var/vols/8T-5 
[root@amdf fcc]# btrfs check --readonly /dev/sdb2
Opening filesystem to check...
Checking filesystem on /dev/sdb2
UUID: 6a32b7e3-0ad3-4316-942d-ec568e1e86f8
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
bad block 8477840605184
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
[5/7] checking only csums items (without verifying data)
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477703667712 item=130 parent level=1 child bytenr=8477840605184 child level=1
Error going to next leaf -5
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
ERROR: transid errors in file system
found 6639532507136 bytes used, error(s) found
total csum bytes: 6040169540
total tree bytes: 7466680320
total fs tree bytes: 3670016
total extent tree bytes: 439713792
btree space waste bytes: 960877186
file data blocks allocated: 74513506304
 referenced 69412663296
journalctl  -k | grep -i btrfs
May 18 11:54:04 fedora kernel: Btrfs loaded, crc32c=crc32c-generic, zoned=yes
May 18 11:54:04 fedora kernel: BTRFS: device label fedora_fedora devid 1 transid 95521 /dev/sda4 scanned by systemd-udevd (424)
May 18 11:54:04 fedora kernel: BTRFS: device label btrfs devid 1 transid 262656 /dev/sdb2 scanned by systemd-udevd (395)
May 18 11:54:07 fedora kernel: BTRFS info (device sda4): disk space caching is enabled
May 18 11:54:07 fedora kernel: BTRFS info (device sda4): has skinny extents
May 18 11:54:07 fedora kernel: BTRFS info (device sda4): enabling ssd optimizations
May 18 11:54:10 amdf kernel: BTRFS info (device sda4): use zstd compression, level 1
May 18 11:54:10 amdf kernel: BTRFS info (device sda4): disk space caching is enabled
May 18 11:54:12 amdf kernel: BTRFS info (device sdb2): use zstd compression, level 5
May 18 11:54:12 amdf kernel: BTRFS info (device sdb2): disk space caching is enabled
May 18 11:54:12 amdf kernel: BTRFS info (device sdb2): has skinny extents
May 18 11:54:12 amdf kernel: BTRFS info (device sda4): disk space caching is enabled
May 18 11:54:12 amdf kernel: BTRFS info (device sda4): disk space caching is enabled
May 18 11:54:12 amdf kernel: BTRFS info (device sdb2): bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 3012, gen 0

ps. I have already removed the OS instance when those 3K corrupts happened. Thus cannot get the logs when those curropts occured.

If I recall correctly, when the currupts occured, the host is running Silverblue Rawhide with kernel upgraded (In May 17 or May 16)

1 Like

OK this might be related. When a libvirt storage pool is created, it detects if it’s on Btrfs and will set chattr +C on the enclosing directory defined as the pool. And then any files created or copied in the directory will have the C file attribute as well. But this file doesn’t have that. So I’m guessing the qcow2 was copied into this directory before libvirt considered it a storage pool.

There is a known issue when combining datacow and O_DIRECT on Btrfs, in which checksums aren’t always updated. More details here:

https://lore.kernel.org/linux-btrfs/20130708132038.GG2260@localhost.localdomain/
https://lore.kernel.org/linux-btrfs/15e49989-8fe2-6da3-83fa-89dc13b465d8@suse.com/

Unfortunately this qcow2 is a bit stuck. Its data blocks are OK, it’s the checksums that are wrong because they weren’t updated in O_DIRECT mode when inflight data changed. The workaround is to mount this file system with mount option ro,rescue=all and then copy the qcow2 file to some other file system. Because rescue=all includes ignoredatacsums the file can be copied out successfully. Now you can mount the first file system normally (read-write), delete the qcow2 file, chattr +c the enclosing directory, and copy the qcow2 file back. If you have any snapshots that contain the qcow2 file, they need to be deleted too.

I think the file system is otherwise OK, but I’ll double check on the parent transid errors that stop the scrub. That might be a bug, but I’m not sure.

1 Like

Thanks a lot!

I copied the QCOW2 file out and removed it from the 8T disk. qemu-img is reporting no errors on checking it

 $ qemu-img check dell250G.qcow2 
No errors were found on the image.
1967425/3815603 = 51.56% allocated, 1.50% fragmented, 0.00% compressed clusters
Image end offset: 128970653696

Afterwards, btrfs scrub resume do not abort immediately.

Is it save for me to keep using my 8T disk normally? Or I should mount it in readonly mode until the transid error is concluded?

Update:
The scrub is able to finish:

sudo btrfs scrub status /var/vols/8T-5/
UUID:             6a32b7e3-0ad3-4316-942d-ec568e1e86f8
Scrub resumed:    Wed May 19 00:28:21 2021
Status:           finished
Duration:         9:08:49
Total to scrub:   5.76TiB
Rate:             184.14MiB/s
Error summary:    verify=2 csum=1
  Corrected:      0
  Uncorrectable:  3
  Unverified:     0

There are two additional check sum errors at the same logical 8477840605184, but different physical

May 19 05:09:56 amdf kernel: BTRFS warning (device sdb2): checksum/header error at logical 8477840605184 on dev /dev/sdb2, physical 6205643522048: metadata leaf (level 0) in tree 7
May 19 05:09:56 amdf kernel: BTRFS warning (device sdb2): checksum/header error at logical 8477840605184 on dev /dev/sdb2, physical 6205643522048: metadata leaf (level 0) in tree 7
May 19 05:10:03 amdf kernel: BTRFS warning (device sdb2): checksum/header error at logical 8477840605184 on dev /dev/sdb2, physical 6206717263872: metadata leaf (level 0) in tree 7
May 19 05:10:03 amdf kernel: BTRFS warning (device sdb2): checksum/header error at logical 8477840605184 on dev /dev/sdb2, physical 6206717263

device stats has new errors: [/dev/sdb2].generation_errs 2

The result of btrfs insp dump-t -b 8477840605184 seems the same as before.

A lot less error from btrfs check

sudo btrfs check --readonly /dev/sdb2
Opening filesystem to check...
Checking filesystem on /dev/sdb2
UUID: 6a32b7e3-0ad3-4316-942d-ec568e1e86f8
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
bad block 8477840605184
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=8477897146368 item=218 parent level=1 child bytenr=8477840605184 child level=1
Error going to next leaf -5
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
ERROR: transid errors in file system
found 6317718130688 bytes used, error(s) found
total csum bytes: 5833265948
total tree bytes: 7172931584
total fs tree bytes: 3670016
total extent tree bytes: 393412608
btree space waste bytes: 919949199
file data blocks allocated: 74452951040
 referenced 69352108032

Notice the identical error appears twice each time, that’s due to DUP metadata, i.e. two copies of the same metadata leaf, both are bad. Btrfs has a write time tree checker to make sure it’s not doing something stupid, and also what we’ve found at logical 8477840605184 is 100 commits older. That suggests the drive simply dropped those writes, they never made it to stable media. And also, the earlier report of 3000+ corruption, could have been the same event and it affected your qcow2 file.

Ultimately this is buggy firmware, and it might be possible to work around it by disabling the drive’s write cache using hdparm -W (check the man page, there’s -w and -W, one is dangerous the other manipulates the drives write cache). Now you have an enterprise drive and I suspect it’s under warranty, and I personally think you have a valid warranty claim, and would aggressively make the case that it should be replaced under warranty because a drive should not do this. Even a consumer drive shouldn’t, but absolutely an enterprise drive shouldn’t. But if they do replace it, the replacement has a good chance of having the same firmware revision and defect. Now what?

You’ll need to make some choices about what the next steps are.

  • Backup the data, reformat the drive, restore data. This will obviously take a while, but it will definitely get you a healthy file system again; or
  • Try btrfs check --repair might be able to fix the problem, but it can take a while if the file system is large. There is also a risk it will make things worse, so anything important on the drive needs to be backed up first anyway before trying this option.
    • After --repair you need to rerun btrfs check --readonly to be sure whether it fixed the problem. This should take a fraction of the time of a full scrub because it’s only working on metadata. If the repair didn’t work…
    • You can try btrfs check --init-csum-tree This will take a while because it has to read every data block to recreate the csum tree from scratch. Scrub takes 11 hours for this file system, so I expect it will take 11-12 hours to recreate the csum tree. The reason for this particular option is because the reported errors point to the csum tree. e.g. this line

node 8477840605184 level 1 items 245 free space 248 generation 255532 owner CSUM_TREE

Sorry about the bad news, but it’s sorta why we use Btrfs, to make sure our data is where it’s supposed to be, and is what it’s purported to be.

The first option will take roughly 22 hours. If you have a spare 8T drive to use for this purpose, and retask this 8T drive for some other purpose, then it’s “only” 11 hours.

The second option might take less time depending on whether you are OK doing a partial backup and taking some risk that the repair won’t fix the problem, and you’ll end up trying things that take almost as long or possibly longer than the first option.

2 Likes

Thank you very much for all your help!

I will source for enough space for a full backup (I guess I cannot use btrfs send/receive this time) and likely will do a btrfs check --repair to see what happens.

Yeah its a good question what will happen during the backup.

The scrub shows data is OK, there’s just a couple of uncorrectable errors related to metadata, mainly this one checksum tree node. A btrfs send/receive backup probably would work, which has the advantage if being a lot faster if you already have an initial send/receive already done, and all you need to do is an incremental one with -p option.

Whether you use rsync or send/receive, if Btrfs runs into confusion, it will stop. The way it stops is issuing EIO (input output error) to the application. What you see depends on the application but send will spit out a single line of error message (instead of nothing) which you can make more verbose using -v with the send. Rsync will also just stop with 1-3 lines of error. dmesg is where the kernel reports errors so you might want to use journalctl -f so you can see if there are any btrfs messages while doing the backup.

Note that by default, with a normal mount, btrfs won’t replicate corruption no matter the backup method. It just issues EIO. But some applications will just substitute zeros anytime there’s EIO, so it’s really up to application handling what happens, exactly.

But once you run into a problem, then we’ll know what to do.

If there’s a problem, it is possible to mount ro,rescue=all and extract any files that can’t be copied with normal mount. They might have corruption but you can isolate the files and check them carefully later or replace them.


Also, it might be worth summarizing your posts, and post to the linux-btrfs@ mailing list. And then cross-post the URLs for both, so they refer to each other. You might get a second opinion about the problem, but mostly it’ll just let developers and users know about this particular make/model/firmware revision.

https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list

1 Like

https://lore.kernel.org/linux-btrfs/CACEy+ER0_B0rOkuzwGEvJXO7jkxZ55D0JcTX+=ApHa+RstmNXQ@mail.gmail.com/T/#u

I am doing rsync from the 8T disk to other disks at the moment.
So far no notable errors.

I finished export all data out from my 8T disk.

I can test any recovery method, even lost data from the disk.

My aim is to have btrfs check reports no error.

What I should try?

btrfs-image -c9 -t4 -ss /dev/sdXY /path/to/btrfs.image

Then if the btrfs check breaks the file system, there’s an image for the developers to work on and see what went wrong.

btrfs check --repair /dev/sdXY

The actual problem is with the checksum/header of a csum node. So I don’t know that --repair can fix it. Chances are it won’t. But it’s possible that…

btrfs check --init-csum-tree /dev/sdXY

will fix it. Since this recomputes checksums for data blocks, but it will take a long time at least as long as a scrub. So it’s up to you if it’s worth finding out. Glad your data is safe though!

Update, it already finished

 2021-05-24 10:06:38
# ls -lsh
total 6.7G
6.7G -rw-r--r--. 1 root root 6.7G May 24 10:04 btrfs-image_p2.image
[root@amdf 8T_transid_error] 2021-05-24 10:06:44

===
Is this normal? If yes, I will let it run to finish.

btrfs-image giving lots of WARNINGs…

WARNING: cannot find a hash collision for 'GEO', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'GUM', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'KGZ', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'KHM', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'KIR', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'KWT', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'LAO', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'LBN', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'LKA', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'MDV', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'MHL', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'MMR', generating garbage, it won't match indexes
WARNING: cannot find a hash collision for 'MNG', generating garbage, it won't match indexes

WARNING: cannot find a hash collision for ‘MNP’, generating garbage, it won’t match indexes

It seems to me, btrfs check --repair do not make it worse.

This is check repair

] 2021-05-24 10:06:44
# sudo btrfs check --repair /dev/sdc2
enabling repair mode
WARNING:

	Do not use --repair unless you are advised to do so by a developer
	or an experienced user, and then only after having accepted that no
	fsck can successfully repair all types of filesystem corruption. Eg.
	some software or hardware bugs can fatally damage a volume.
	The operation will start in 10 seconds.
	Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/sdc2
UUID: 6a32b7e3-0ad3-4316-942d-ec568e1e86f8
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
bad block 8477840605184
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=2095243116544 item=433 parent level=1 child bytenr=8477840605184 child level=1
Error going to next leaf -5
[6/7] checking root refs
Recowing metadata block 8477840605184
ERROR: fails to fix transid errors
[7/7] checking quota groups skipped (not enabled on this FS)
found 6311744303104 bytes used, error(s) found
total csum bytes: 5827985368
total tree bytes: 7167934464
total fs tree bytes: 3670016
total extent tree bytes: 393412608
btree space waste bytes: 920648260
file data blocks allocated: 74452951040
 referenced 69352108032
extent buffer leak: start 8477840605184 len 16384
[root@amdf 8T_transid_error] 2021-05-24 10:18:11
# 

This is after:

2021-05-24 10:18:11
# sudo btrfs check --readonly /dev/sdc2
Opening filesystem to check...
Checking filesystem on /dev/sdc2
UUID: 6a32b7e3-0ad3-4316-942d-ec568e1e86f8
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
bad block 8477840605184
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
parent transid verify failed on 8477840605184 wanted 255798 found 255532
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=2095243116544 item=433 parent level=1 child bytenr=8477840605184 child level=1
Error going to next leaf -5
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
ERROR: transid errors in file system
found 6311744303104 bytes used, error(s) found
total csum bytes: 5827985368
total tree bytes: 7167934464
total fs tree bytes: 3670016
total extent tree bytes: 393412608
btree space waste bytes: 920647264
file data blocks allocated: 74452951040
 referenced 69352108032
[root@amdf 8T_transid_error] 2021-05-24 10:28:33

It’s normal. Something to do with short filenames and -ss option, which is optional.

1 Like

btrfs check --init-csum-tree /dev/sdXY did not correct the bad block error.

OK, great testing. I’d update the linux-btrfs@ thread to point out that neither --repair nor --init-csum-tree fixed the problem. And offer up the ~7G btrfs-image file upon request. That’s about all you can do. And it’s more than most.

1 Like

I want to try forcing reallocate bad-sector similar to above.

Would you please help me to translate 8477840605184 from btrfs check to the address I can try in

sudo hdparm --yes-i-know-what-i-am-doing --write-sector #### /dev/sda

Thank you very much!

btrfs-map-logical -l 8477840605184 /dev/… should do it. It will return two physical offsets because it’s using dup profile for metadata.

I don’t think it’s a bad sector problem though. The chances of two bad sectors for this same block is astronomical. But also the dump tree for this block shows it’s stale, so the more likely cause it’s a dropped write.

1 Like

Yes, I agree it is not physical bad sector. As the smart report looks good.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       15585
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       355
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       4891
 10 Spin_Retry_Count        0x0033   107   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       238
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       136
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       365
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       40 (Min/Max 16/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   089   089   000    Old_age   Always       -       4566
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       656
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

I will now wipe this drive and reuse it.

Thank you very much for all your help during this journey

The evidence available suggests the drive just dropped a bunch of writes. But it’s not conclusive. We know that Btrfs is being hammered by xfstests, which includes piles of file system tests including dm-log-writes based tests to ensure correct write ordering by all linux file systems for crash/power fail safety. These tests are happening by many upstreams and downstreams, they happen continuously for linux-next, for all the rc’s, and all stable releases of the kernel. This doesn’t mean there isn’t a btrfs or block layer bug as yet undiscovered, but it strongly points to the hardware’s firmware, which is also really difficult to understand because absolutely it should never happen. In any consumer drive. Let alone enterprise.

You do have a tedious way to maybe prove the theory though, which is to keep using Btrfs in the same configuration and workload. And see if it happens again. And in theory, it should. The drive age is quite young so it shouldn’t take too long to reproduce. If it does, then it means another rebuild but this time disable the write cache consistently and see if it happens a 3rd time or not. If not, that’s fairly conclusively a firmware+write caching related bug.

Note that smartctl -x reports quite a lot more detailed information, including internally logged read/write errors that the drive thinks it has self-corrected for, that don’t result in error reporting to the kernel. There’s a remote possibility there are clues in this output.

In my opinion, you do already have enough information to file a support ticket with Toshiba and make them aware that you suspect a firmware bug resulting in rare and transient dropped writes. Whether they will want to swap out the drive preemptively, I have no idea. But I think it’s worth getting a support ticket on record for this particular drive. I’d say this even for a consumer drive under warranty, but for sure an enterprise drive because you pay a premium primarily for better support handling.

1 Like