High CPU usage (kernel threads) in idle F31 / Poor UX responsiveness

Hello,

I would like to ask for help to further debug the following issue. I’m stuck and don’ t know which commands / tools I could use to track its cause.

Background: over time the responsiveness of my F31 degraded severely (mouse response, visual window switch, etc). This happened a few times and I thought it was just an personal mess-up, broken component install, custom driver, etc.

This week it reappeared (after a system update) and I decided to perform a fresh reinstall, but even in a fresh / default F31 Workstation install it is present (not extra component, backup, config, etc changed – apart from a few system tools added like htop, nmon, etc). Could it be a regression bug?

It appears to be some sort of IO issue that degrade user input / system response. Even typing delay is noticeable in almost zero system load.

Notebook Acer Aspire E5-574-592S
Kernel 5.5.5-200.fc31.x86_64
F31 Workstation Installed with default options
rpmfusion free/nonfree added
minimal extra components added than default (htop, atop, nmon, glances sysstat...)
full dnf system update / reboot 
Memory: 8 GB
CPU: Intel® Core™ i5-6200U CPU @ 2.30GHz × 4
GPU: Intel® HD Graphics 520 (Skylake GT2)
GNOME Version 3.34.4
SSD 500GB 860 Evo Samsung

Dmesg output: https://hastebin.com/ipigozabuv.coffeescript
Installed packages: https://hastebin.com/segarepaxu.pl

Since the issue appeared an disappeared a few times (over updates, reinstalls, etc) and the hardware works fine and fast in other OSses, I don’t think that’s any hardware issue.

  • How can I find out which process exactly is causing the kernel thread high usage showed in htop (red bar)?. System was fully rebooted and just terminal/htop opened – nothing beyond the default is running.

  • The system has a SSD. Given the fresh reinstall, could this be related to background ssd trim done by the kernel? Even so, could it impact in user input delay (mouse/keyboard)

  • Kernel threads: https://hastebin.com/guxerijulo.sql
    I found ways to list kernel threads but don’t how to monitor CPU usage / io_wait of each one (to find out which is causing the high usage)

  • Glances shows a low (0.1% IO wait), there is no user load… so I can’t figure out that high red bar in htop or the system slow responsiveness.

Thank you,

Mario

PS update:

I confirmed that the issue is not present in kernel 5.3.7-301.fc31.x86_64, it started after update to kernel 5.5.5-200.fc31.

Htop shows no abnormal cpu / kernel usage and system responsiveness is normal in 5.3.7.

Also, the following error was introduced with 5.5.5:
(when system is shutting down)

I’d a similar a years ago with Ubuntu.  Try to create a new user after an update, to test if it’s a some ~/'s .config issue?

I did and the issue persists (new user add; first user login/profile creation with kernel 5.5.5, same issue).

In the first login it already presented the following error:

I noticed that during boot the system halted for ~3 seconds in the message:

[   14.425862] irq 16: nobody cared (try booting with the "irqpoll" option)
[   14.425868] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.7-301.fc31.x86_64 #1
[   14.425870] Hardware name: Acer Aspire E5-574/Zoro_SL , BIOS V1.18 05/25/2017
[   14.425871] Call Trace:
[   14.425875]  <IRQ>
[   14.425884]  dump_stack+0x5c/0x80
[   14.425891]  __report_bad_irq+0x35/0xa7
[   14.425896]  note_interrupt.cold+0xb/0x63
[   14.425901]  handle_irq_event_percpu+0x6f/0x80
[   14.425905]  handle_irq_event+0x28/0x48
[   14.425909]  handle_fasteoi_irq+0x86/0x130
[   14.425915]  handle_irq+0x1c/0x30
[   14.425919]  do_IRQ+0x4b/0xd0
[   14.425926]  common_interrupt+0xf/0xf
[   14.425928]  </IRQ>
[   14.425934] RIP: 0010:cpuidle_enter_state+0xc4/0x420
[   14.425938] Code: e8 a1 ce 92 ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 39 03 00 00 31 ff e8 03 bd 98 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 89 cf 01 00 00 c7 45 10 00 00 00 00 48 83 c4 18 44 89
[   14.425941] RSP: 0018:ffffffff8b403e58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
[   14.425944] RAX: ffff8d2aaa228880 RBX: ffffffff8b4d6520 RCX: 000000000000001f
[   14.425946] RDX: 0000000000000000 RSI: 0000000035554b23 RDI: 0000000000000000
[   14.425948] RBP: ffff8d2aaa232d00 R08: 000000035bd87f18 R09: 000000007fffffff
[   14.425949] R10: ffff8d2aaa2276c4 R11: ffff8d2aaa2276a4 R12: 0000000000000008
[   14.425951] R13: 000000035bd87f18 R14: 0000000000000008 R15: ffffffff8b413780
[   14.425958]  ? cpuidle_enter_state+0x9f/0x420
[   14.425962]  cpuidle_enter+0x29/0x40
[   14.425968]  do_idle+0x1cf/0x250
[   14.425974]  cpu_startup_entry+0x19/0x20
[   14.425980]  start_kernel+0x54f/0x56c
[   14.425987]  secondary_startup_64+0xa4/0xb0
[   14.425990] handlers:
[   14.425997] [<0000000011998faa>] idma64_irq [idma64]
[   14.426003] [<000000000f2c55cf>] i2c_dw_isr
[   14.426008] [<000000005b99de97>] i801_isr [i2c_i801]
[   14.426011] Disabling IRQ #16

Strangely, the error appears even when booting kernel 5.3.7 (which does not present the high cpu / system lag) – not sure if related.

Also, enabling more detailed info in htop:

Screenshot from 2020-02-28 11-07-26

Seems that the high usage is related to Hard IRQ (Ref: https://unix.stackexchange.com/questions/50390/meaning-of-colours-in-htop)

Indicating that the kernel issue related to IRQ 16 degrades the overall system usage in kernel 5.5.5 (?)

[mario@localhost Pictures]$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       
   0:          9          0          0          0  IR-IO-APIC    2-edge      timer
   1:          0          0          0         13  IR-IO-APIC    1-edge      i8042
   8:          0          1          0          0  IR-IO-APIC    8-edge      rtc0
   9:          0          5          0          0  IR-IO-APIC    9-fasteoi   acpi
  16:     112426          0          0          0  IR-IO-APIC   16-fasteoi   idma64.0, i2c_designware.0, i801_smbus
  17:      20636          0          0          0  IR-IO-APIC   17-fasteoi   ath9k
  82:          0          3          0          0  IR-IO-APIC   82-edge    
 120:          0          0          0          0  DMAR-MSI    0-edge      dmar0
(...)

lspci -v: https://hastebin.com/mekalupeje.cs

00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
	Subsystem: Acer Incorporated [ALI] Device 100c
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at a122b000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss
	Kernel modules: intel_lpss_pci

Problem confirmed also in kernel 5.6.0-0.rc3.git2.1.vanilla.knurd.1.fc31.x86_64

dmesg: https://hastebin.com/iwerahaziv.coffeescript

Looks like some sort of conflict with devices/drivers.

When the issue happens I see a lot of interrupts in IRQ 16:

[mario@localhost ~]$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       
(...)
  16:          0    7745062          0          0  IR-IO-APIC   16-fasteoi   i2c_designware.0, idma64.0, i801_smbus

This interrupt was enumerated as the realtek gigabit controller:

  pci device: name = 0000:01:00.0
    path = /devices/pci0000:00/0000:00:1c.0/0000:01:00.0
    modalias = "pci:v000010ECd00008168sv00001025sd0000100Cbc02sc00i00"
    class = 0x20000
    vendor = 0x10ec
    device = 0x8168
    subvendor = 0x1025
    subdevice = 0x100c
    irq = 16
    res[0] = 0x3000 0x30ff 0x40101
    res[2] = 0xa1104000 0xa1104fff 0x140204
    res[4] = 0xa1100000 0xa1103fff 0x140204
    config[64]


PCI	10ec	Realtek Semiconductor Co., Ltd.	8168	RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller

I noticed that the trackpad was not working.

and then disabled the module of that device (blacklisted at book: added rd.driver.blacklist=r8169 to the kernel grub line.

Rebooted and the issue persists, but now with another device in IRQ 16, also generating the interrupts load:

  pci device: name = 0000:00:15.0
    path = /devices/pci0000:00/0000:00:15.0
    modalias = "pci:v00008086d00009D60sv00001025sd0000100Cbc11sc80i00"
    class = 0x118000
    vendor = 0x8086
    device = 0x9d60
    subvendor = 0x1025
    subdevice = 0x100c
    irq = 16
    res[0] = 0xa122b000 0xa122bfff 0x140204
    config[64]

Intel Corporation	8086	Sunrise Point-LP Serial IO I2C Controller #0	9d60

but trackpad is working!

If I disable the following modules during boot:
rd.driver.blacklist=r8169,i801_smbus,intel_lpss,i2c_i801
(criteria: the ones I found related to IRQ 16).

the issue is gone! No high CPU usage / kernel thread wait / events and system performance is normal.
(although the devices are disabled: trackpad, ethernet, etc)

not sure how to proper fix it.

1 Like

I was able to fix the issue adding the following kernel args:
initcall_blacklist=dw_i2c_init_driver module_blacklist=idma64

(touchpad and everything seems to work. Not sure if there is any other side effect or performance degradation. Maybe battery life? idma64: Enable DMA support for Intel Low Power Subsystem such as found on Intel Skylake PCH. ).

I’m experiencing a much better system response (regular processing; mouse/keyboard IO, etc). CPU usage is regular (<2% in all cores when idle). Before it was ~90% in one core all of the time. Since it’s kernel threads/time it does not appear by default in some resource monitors.

The issue was introduced somewhere after kernel 4.1.x. I tested with Fedora 22 (kernel 4.0.4) and it works just fine. Other distros like ubuntu (even with recent kernels 5.x+) does not present the issue.

Although it seems a specific edge case and hardware issue unrelated with Fedora, I still consider it to be something pending fix in the Fedora distro, because:

  • Other distros using the same kernel don’t have this issue. ie. Ubuntu.
  • It used to work in earlier Fedora (ie. F22) and was introduced somewhere between F22-F31.
  • The ‘i2c/irq/touchpad’ issue may be present in a large number of devices (many other elan/synaptics)
  • It’s not catastrophic in a sense that every user observes it: many users may be suffering performance degradation and just considering it to be the regular Fedora performance (unless you specifically search for kernel threads time – by default it does not appear in the regular system-monitor CPU usage).

Further debug info. Since it works fine in F22, I found some divergent details that may indicate the root cause:

Updated issue overview: some i2c hardware driver (usually touchpad related) seems to conflict with other devices and constantly generate a high amount of interrupts, causing system performance and responsiveness degradation. Performance is impacted, but hardware components do work (ie. touchpad works fine). The high CPU usage is kernel threads and not observed by default in all monitors (ie. red bar in htop or hi section of top).

Fedora 22 kernel 4.0.4 (works):

00:1f.4 SMBus: Intel Corporation Device 9d23 (rev 21)
	Interrupt: pin A routed to IRQ 16
	Kernel modules: i2c_i801

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
	Interrupt: pin A routed to IRQ 126	
	Kernel driver in use: r8169

Fedora 31 kernel 5.5.7:

00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
	Interrupt: pin A routed to IRQ 16
	Kernel driver in use: intel-lpss

00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
	Interrupt: pin A routed to IRQ 16
	Kernel driver in use: i801_smbus
	Kernel modules: i2c_i801

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
	Interrupt: pin A routed to IRQ 16
	Kernel driver in use: r8169
	Kernel modules: r8169	

Interrupt 16 is not present in F22. In F31:

cat /proc/interrupts:
  16:          0    2833163          0          0  IR-IO-APIC   16-fasteoi   i2c_designware.0, idma64.0, i801_smbus

tree:
/proc/irq/16
├── affinity_hint
├── effective_affinity
├── effective_affinity_list
├── i2c_designware.0
├── i801_smbus
├── idma64.0
├── node
├── smp_affinity
├── smp_affinity_list
└── spurious

The presence of multiple modules/devices routed to IRQ 16 could be the issue?

tree /proc/irq/16 5.5.7-200.fc31.x86_64

/proc/irq/16
├── affinity_hint
├── effective_affinity
├── effective_affinity_list
├── i2c_designware.0
├── i801_smbus
├── idma64.0
├── node
├── smp_affinity
├── smp_affinity_list
└── spurious

/proc/irq/16 not present in F22 (works ok).

Full logs:
dmesg_4.0.4-301.fc22.x86_64_loglevel-max.txt - https://paste.ubuntu.com/p/zPzK55jMDF/
dmesg_5.5.7-200.fc31.x86_64_loglevel-max.txt - https://paste.ubuntu.com/p/tvn7KwrkXH/
lscpi-vvv_4.0.4-301.fc22.x86_64.txt - https://paste.ubuntu.com/p/sg6JYqpGGX/
lspci-vvv_5.5.7-200.fc31.x86_64.txt - https://paste.ubuntu.com/p/FxD256gmF9/

PS: is there a more appropriate way of attaching logs here in Ask Fedora?

Ref: https://unix.stackexchange.com/questions/423797/how-do-i-disable-i2c-designware-support-when-its-not-built-%20as-a-module%20/%20446913%20

1 Like