Investigating kernel crashes due to NVMe disk

In Fedora 32 Silverblue, I guess the Linux kernel 5.6.15-300.fc32.x86_64 crashes, because of my NVMe disk (a WDC WDS100T2B0C-00PXH0 aka WD Blue SN550 1TB).


What happens

Randomly (I assume when it accesses the file system/the NVMe SSD disk quite much, it just freezes and shows me a fullscreen error. It’s always some kind of ext4 error, but it’s a new installation, so the file system is intact.

Here are some errors:

t 4948.2505971 EXT4-fs error (device dm-2): __ext4 find_emtry-1536: inode 83829000: comm gdb-session-wor: reading directory lblock 0


IMG_20200604_230820.jpg


[ 213.350921 EXT4-fs error (device dm-2): __ext4 find_entry:1536: inode 83029000: comm glm-session-war: reading directory Iblock @


IMG_20200605_000220.jpg


{ 206.681358) EXT4-fs error (device dm-4): ext4_read_inode_bitmap:200: comm dconf worker: Cannot read inode bitmap - block_group = 1056, inode_bitmap = 34603024
{ 206.681465] EXT4-fs error (device dm-4) in ext4 free. inode:355: IO failure
{ 206.775200] EXT4-fs error (device dm-4): ext4_wait_block_bitmap:520@: comm cheese:cs0: Cannot read block bitmap - block_group = 38, block_bitmap = 1048582
{ 206.775410] EXT4-fs error (device dm-4): ext4_discard_preallocations:4090: comm cheese:cs0: Error -5 reading block bitmap for 38
{ 213.584473] EXT4-fs error (device dm-4): ext4_journal_check_start :84: Detected aborted journal
{ 213.584557] EXT4-fs (dm-4): Remounting filesystem read-only


IMG_20200605_232825.jpg

What also happened

I assume some kind of this also caused another error: the TPM seems to have been corrupted and I had to regenerate it.

What I actually saw is: At some boot, the BIOS/UEFI showed me a message that claimed I had switched the CPU (of course, I did not, it’s the built-in AMD Ryzen CPU) and it needs to regenerate the fTPM values or so.
As I do not have anything that relies on the TPM, I could just choose Y (yes) to regenerate it.
(Note: This happened after all photos IIRC.)

Research

It seems this issues cause may be some Linux kernel bugs:

https://www.linuxquestions.org/questions/linux-hardware-18/system-crash-hang-nvme-ext4-error-troubleshooting-tips-4175642613/
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1746340?comments=all
https://bugzilla.kernel.org/show_bug.cgi?id=197875

What they all have in common is that they refer to quite old Linux kernels. I had hope that the up-to-date kernel in Fedora 32 would not have these issued, but well…

System

Here are all logs with system information (nvme-cli, smartctl, lshw etc.):

Side-note: I had to learn that not all WDC drives actually support the custom WDC commands that nvme-cli provides.

A log catching the problem

Also I’ve managed to catch dmesg output when this occured. This time, it was not noticeable in the graphically, but I could actually still use the system. However, in the background, it seems to have mounted the whole file system as readonly (and did not tell me lol) – do have a look at the end of that kernel log:

Funny how the system is still able to run when it throws all these kinds of error…

Questions

I guess debugging this is hard if you have no file system where coredumps or so could be written too. (I guess this is why the systemd coredumps fail too)

So anything I can still do here? Anything else I can provide for debugging?

And if I can, where do I report Linux kernel bugs? Just at https://bugzilla.kernel.org/ or is there some place to report kernel bugs for Fedora?


overview: other issues of this device

Reporting Bugs  (Fedora) # Is there any Silverblue specific around?  This uses dnf.
Reporting Bugs  (Kernel).

Hmm ok, reported as a Fedora kernel bug for now.

1 Like

As I got no reply in the Fedora bug tracker, now I also reported it in the Linux kernel bug tracker

Hi! I don’t have too much to add, but I have run into this same issue myself, and did spot a couple other reports of what seem to be the same issue (one in Amazon comments, another on Reddit). Combination of a Ryzen processor and the SN550 hard drive leading to hard freezes that require holding the power button to reset. Looks like from the kernel bug you already discovered that the kernel flag resolves it as a workaround at least.
Let me know if I can add anything to your bug reports that might help out!

Here’s the report on Reddit: https://www.reddit.com/r/pop_os/comments/g8y3ae/experiencing_random_freezes_after_installing_new/

1 Like

Yeah, thanks I’ve cross-posted the relevant links there and also made the kernel bug tracker people aware of that.

So far, the experience in the Fedora kernel bug tracker really is not satisfying. I did not receive even a single reply…

Summer vacations, probably.