Fedora 31 becoming completely unresponsive (freezes) on Thinkpad X270

Hello everybody,

I am experiencing the problem that my fedora 31 system freezes completely at least once a day, to the point where I have to use the power button ( ssh and magic sysrq keys don’t work always, or maybe I I do something wrong there). :flushed:
The sound of the video was stuck in a maybe two seconds loop.

Here is the output of journalctl -b -1 -p 3 for the last boot where it froze:

Summary
-- Logs begin at Sun 2019-12-22 21:08:35 CET, end at Wed 2020-01-08 22:32:15 CET. --
Jan 08 22:24:29 localhost.localdomain systemd-udevd[646]: /usr/lib/udev/rules.d/65-md-incremental.rules:28 Invalid value "/sbin/mdadm -I $env{DEV>
Jan 08 22:24:29 localhost.localdomain systemd-udevd[646]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:5 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:24:29 localhost.localdomain systemd-udevd[646]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:6 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Not valid error log pointer 0x00000000 for Init uCode
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Fseq Registers:
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xC0BF2650 | FSEQ_ERROR_CODE
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xA04E95F3 | FSEQ_TOP_INIT_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xABBD9C69 | FSEQ_CNVIO_INIT_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0000A056 | FSEQ_OTP_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xDDFFF6A7 | FSEQ_TOP_CONTENT_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x664E56AB | FSEQ_ALIVE_TOKEN
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xB1097904 | FSEQ_CNVI_ID
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xC22312E4 | FSEQ_CNVR_ID
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x03000000 | CNVI_AUX_MISC_CHIP
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0BADCAFE | CNVR_AUX_MISC_CHIP
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0BADCAFE | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0BADCAFE | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: SecBoot CPU1 Status: 0x3040001, CPU2 Status: 0x0
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Failed to start INIT ucode: -110
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Firmware not running - cannot dump error
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Failed to run INIT ucode: -110
Jan 08 22:24:32 localhost.localdomain kernel: Bluetooth: hci0: command 0xfc09 tx timeout
Jan 08 22:24:40 localhost.localdomain kernel: Bluetooth: hci0: Failed to send firmware signature (-110)
Jan 08 22:24:55 localhost.localdomain gdm-password][1444]: gkr-pam: unable to locate daemon control file
Jan 08 22:27:25 localhost.localdomain systemd[1455]: Failed to start Mark boot as successful.

What else could be usefull to narrow down the problem source?
Thank you for the help in advance!

man journalctl:
"
-b:
while -0 is the last boot, -1 the boot before last, and so on.

Just in case, -p:

  • "emerg" (0),
  • "alert" (1),
  • "crit" (2),
  • "err" (3),
  • "warning" (4),
  • "notice" (5),
  • "info" (6),
  • "debug" (7)
"
Possible related ("The issue is: it freezes randomly."):

Your can see also:
https://ask.fedoraproject.org/search?q=freezes


PS: If your have enough RAM?
PSPS: What is your computer hardware?

Thank you a lot for your suggestions! I am using a Thinkpad X270 with the following specs:

  • Skylake i5-6300U
  • 8GB RAM
  • Intel Dual Band Wireless-AC 8260
  • ADATA SX6000PNP 1TB NVMe disk

Software side:

  • Dual boot with windows 10
  • Linux 5.4.7-200.fc31.x86_64
  • Fedora 31 (Workstation Edition)

And if I run the right command now for journalctl, I get:

Summary

Jan 08 22:28:26 localhost.localdomain systemd-udevd[635]: /usr/lib/udev/rules.d/65-md-incremental.rules:28 Invalid value "/sbin/mdadm -I $env{DEV>
Jan 08 22:28:26 localhost.localdomain systemd-udevd[635]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:5 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:28:26 localhost.localdomain systemd-udevd[635]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:6 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:28:41 localhost.localdomain gdm-password][1496]: gkr-pam: unable to locate daemon control file
Jan 08 22:28:42 localhost.localdomain systemd[1508]: Failed to start Application launched by gnome-session-binary.
Jan 08 22:28:43 localhost.localdomain systemd[1508]: Failed to start Application launched by gnome-session-binary.
Jan 08 22:30:54 localhost.localdomain systemd[1508]: Failed to start Mark boot as successful.

Which does not look more helpfull to me? Also I don’t think the RAM is the problem, sometimes fedora freezes a few minutes into the session. My guess is that it has something to do with either firefox (using it most of the times freezes happen) or the NVMe SSD, as I installed that one myself?

Before I opened this thread here, I was searching for other peoples solutions. But without knowing the cause, I find many different things people suggest and it did not help me further

Are your tried xtym’s suggestions from above?

Also:

dmesg --level=emerg,alert,crit,err,warn -H

Your can press the / to search within output of the command above. Press Ctrl + C to cancel (in dmesg, usually it is Esc), and press q to quit.

(man dmesg) Options are:

  • emerg - system is unusable
  • alert - action must be taken immediately
  • crit - critical conditions
  • err - error conditions
  • warn - warning conditions
  • notice - normal but significant condition
  • info - informational
  • debug - debug-level messages

PS: Years ago, i remember, i’d random freezes with some version of the proprietary video driver…

kernel 5.4.10 is in updates-testing and should go stable soon (maybe Monday). If you want to try sooner, you can upgrade just the kernel, while also enabling updates-testing repo.

Thank you all so far!
So I checked the last freeze today with the dmesg --level=emerg,alert,crit,err,warn -H command:

I can not say at what point exactly the system was freezing, but both the “nvme timeout” and “USB power management unreliable” are suspicious to me. Anybody knows more about this logs and could brighten me up?

-H is for paging.

dmesg --level=emerg,alert,crit,err,warn >> log.txt

xdg-open log.txt
Switch off wrapping!

Press Ctrl + a, then Ctrl + c

Paste between ``` and ```
[    0.272627] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    1.603683] battery: [Firmware Bug]: battery: (dis)charge rate invalid.
[    4.662191] [Firmware Bug]: ACPI(PEGP) defines _DOD but not _DOS
[    4.760812] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190816/nsarguments-59)
[    4.760861] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190816/nsarguments-59)
[   26.626778] printk: systemd: 24 output lines suppressed due to ratelimiting
[   32.862258] kauditd_printk_skb: 20 callbacks suppressed
[   36.619141] ACPI Warning: SystemIO range 0x0000000000000428-0x000000000000042F conflicts with OpRegion 0x0000000000000400-0x000000000000047F (\PMIO) (20190816/utaddress-204)
[   36.619152] ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x000000000000055F (\_SB.PCI0.PEG0.PEGP.GPIO) (20190816/utaddress-204)
[   36.619155] ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20190816/utaddress-204)
[   36.619158] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x000000000000055F (\_SB.PCI0.PEG0.PEGP.GPIO) (20190816/utaddress-204)
[   36.619161] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20190816/utaddress-204)
[   36.619164] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x000000000000055F (\_SB.PCI0.PEG0.PEGP.GPIO) (20190816/utaddress-204)
[   36.619167] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20190816/utaddress-204)
[   36.619170] lpc_ich: Resource conflict(s) found affecting gpio_ich
[   38.844172] kauditd_printk_skb: 10 callbacks suppressed
[   39.481932] uvcvideo 1-1.3:1.0: Entity type for entity Processing 2 was not initialized!
[   39.481937] uvcvideo 1-1.3:1.0: Entity type for entity Extension 6 was not initialized!
[   39.481941] uvcvideo 1-1.3:1.0: Entity type for entity Camera 1 was not initialized!
[   83.298717] nouveau 0000:01:00.0: bus: MMIO write of ffffff1f FAULT at 6013d4 [ IBUS ]
[14600.301387] nouveau 0000:01:00.0: bus: MMIO write of ffffff1f FAULT at 6013d4 [ IBUS ]
[16815.398458] nouveau 0000:01:00.0: bus: MMIO write of 0000001f FAULT at 6013d4 [ IBUS ]
[16815.398506] nouveau 0000:01:00.0: bus: MMIO write of badf1001 FAULT at 50405c [ IBUS ]
[16816.164646] nouveau 0000:01:00.0: bus: MMIO write of 0000001f FAULT at 6013d4 [ IBUS ]
[16816.164748] nouveau 0000:01:00.0: bus: MMIO write of badf1001 FAULT at 50405c [ IBUS ]
[16817.130014] done.
[16870.737737] nouveau 0000:01:00.0: bus: MMIO write of ffffff1f FAULT at 6013d4 [ IBUS ]
[16888.331219] nouveau 0000:01:00.0: bus: MMIO write of ffff9e1f FAULT at 6013d4 [ IBUS ]
[22353.382655] nouveau 0000:01:00.0: bus: MMIO write of ffff9e1f FAULT at 6013d4 [ IBUS ]
[22353.382674] nouveau 0000:01:00.0: bus: MMIO write of badf1001 FAULT at 50405c [ IBUS ]
[22383.573745] nouveau 0000:01:00.0: bus: MMIO write of ffffff1f FAULT at 6013d4 [ IBUS ]
[22535.454558] nouveau 0000:01:00.0: bus: MMIO write of ffffff1f FAULT at 6013d4 [ IBUS ]
[22548.399916] nouveau 0000:01:00.0: bus: MMIO write of ffff9e1f FAULT at 6013d4 [ IBUS ]

Did your tried the disk's SMART test?

PS: How about to send this machine to a service-center?

@vits95 I followed your instructions, hope this is better now:

Summary
[    0.469905] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[    0.469905] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[    0.470244]  #3
[    0.475823] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    2.178713] usb: port power management may be unreliable
[    2.630185] usb 1-7: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping
[    2.630190] usb 1-7: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping
[    3.889116] acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[    3.889553] acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[   11.314725] printk: systemd: 23 output lines suppressed due to ratelimiting
[   42.396646] nvme nvme0: I/O 832 QID 2 timeout, aborting
[   42.528797] nvme nvme0: Abort status: 0x0
[   42.542773] kauditd_printk_skb: 19 callbacks suppressed
[   42.915296] systemd-journald[621]: File /run/log/journal/24a9fb3a012f46d894763e6ec4bfa78c/system.journal corrupted or uncleanly shut down, renaming and replacing.
[   43.594218] resource sanity check: requesting [mem 0xfed10000-0xfed15fff], which spans more than pnp 00:08 [mem 0xfed10000-0xfed13fff]
[   43.594227] caller snb_uncore_imc_init_box+0x6c/0xb0 [intel_uncore] mapping multiple BARs
[   43.675869] uvcvideo 1-8:1.0: Entity type for entity Extension 4 was not initialized!
[   43.675872] uvcvideo 1-8:1.0: Entity type for entity Extension 3 was not initialized!
[   43.675874] uvcvideo 1-8:1.0: Entity type for entity Processing 2 was not initialized!
[   43.675876] uvcvideo 1-8:1.0: Entity type for entity Camera 1 was not initialized!
[   44.128425] thermal thermal_zone3: failed to read out thermal zone (-61)
[   45.738681] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring
[   46.007657] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring
[   48.298823] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[   66.936633] systemd-journald[621]: File /var/log/journal/24a9fb3a012f46d894763e6ec4bfa78c/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.

But I also compared this journal entries to one where I had no freeze and there is no difference. So I guess the cause can not be found here…

Also my Windows 10 which is installed parallel does not show this problem, so sending the laptop to a service-center would probably not help much. I am going to run the SMART test next, but first i need a fedora stick it seems.

1 Like

Reporting Bugs (www.kernel.org).
SMART can be run “on the go”, i think.

Yes, but for nvme the disk utility does not work, I found this here and this is the result when I run sudo nvme smart-log /dev/nvme0

Summary
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 32 C
available_spare                     : 100%
available_spare_threshold           : 32%
percentage_used                     : 0%
data_units_read                     : 2’986’511
data_units_written                  : 3’375’857
host_read_commands                  : 42’351’997
host_write_commands                 : 34’833’159
controller_busy_time                : 0
power_cycles                        : 469
power_on_hours                      : 379
unsafe_shutdowns                    : 72
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

So I guess critical_warning: 0 is a good sign? :grin:

Edit: Reading up a bit more, I found out that sudo smartctl -x /dev/nvme0n1 provides more information:

Summary
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.8-200.fc31.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       ADATA SX6000PNP
Serial Number:                      2J2920161848
Firmware Version:                   V9001b31
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1’024’209’543’168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            525433 334e323035
Local Time is:                      Tue Jan 14 18:19:39 2020 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     118 Celsius
Critical Comp. Temp. Threshold:     150 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    50.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        32 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    2’987’650 [1.52 TB]
Data Units Written:                 3’376’117 [1.72 TB]
Host Read Commands:                 42’370’117
Host Write Commands:                34’837’809
Controller Busy Time:               0
Power Cycles:                       469
Power On Hours:                     379
Unsafe Shutdowns:                   72
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 8 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          1     0  0x0000  0x0000  0x000            0     0     -
  6 1219368206019409729     0  0x0000  0x0000  0x000            0     0     -

:chicken: :egg: If SMART isn’t buggy itself!

Wiki

NVMe unsafe shutdowns
Does anyone else get non-zero "Unsafe Shutdowns" on NVMe (Model Number: LENSE30512GMSP34MEAT3TA)? smartctl on mine laptop constantly reports those, also I get frequent FS corruption errors after reboot :/

Your mentioned “first i need a fedora stick it seems”. I’ve no idea about how it is handled by default, but possible your need to check the file sistem?

man fsck

Different machine (Dell XPS 13) and probably different setup (x86_64 Linux 5.4.8-200.fc31.x86_64), but I am experiencing similar freezes.

I, personally, no.

  • 5.4.8-200.fc31.x86_64,
  • Intel(R) Pentium(R) CPU B960 @ 2.20GHz
  • 4GB RAM
  • Fedora-Workstation 31

What hardware do your have?

Did your both tested the Celsius/Fahrenheit 'es of your machines? Maybe this is a thermal issue?

@thoroc @almghandi

Thermals were fine after hard reboot (~50C) but the machine was whining under load just before that.

See, i had not only some freezes long time ago, from proprietary blob.

Also i’d experienced shutdowns because of overheating. Cleaning and free drivers helped.

PS: That year i’d even used proprietary drivers without issue (few weeks) before it start glitch again for me (all is on the same laptop).

After reading the post I will leave a suggestion or an idea that may be related (or it may not be).

Many of these processors come with advance technologies like:

“Intel Turbo Boost Technology”
“Idle States”
“Enhanced Intel SpeedStep Technology”

Which make possible to modify the TDP and the Processor Frequency according to the demand. In some types of CPU, TDP jumps appear to be poorly managed with some kernel versions and may create instability that being said you can test the following:

If you have another versions (older) of the kernel try with them and see if the problem is reproduced.

If you know about your BIOS, you may want to try to deactivate the options such as turbo boost or idle states temporally to verify that this is not the reason for the instability through the management of a certain kernel in Linux of these options.

Take it like a opinion/suggestion

Regards.

2 Likes

@xtym I turned off the Intel SpeedStep Technology in the BIOS and observe better responsiveness of the system and no freezes. Your suggestion seems to solve the problem so far! :tada:
I will mark your answer as solution if no problems show up in the next three days.

While reading up, I found older reports of problems with SpeedStep aswell:

Apparently the issues described there are fixed by now, but SpeedStep Technology has potential for problems.
One explanation I can think of: The degeneration of the silicon chip or the power electronics maybe increase the risk for complications and that would also explain why my laptop was originally working fine.

2 Likes