Adaptec 71605H HBA Card (pm8001 driver) causes system hang when drive attached with kernel >= 5.13.16-200.fc34.x86_64

Hello,

I have two systems which I have been upgrading from Fedora 33. Both have Adaptec 71605H HBA Cards and I get the same behaviour in each system.

They both work fine with the initial Fedora 34 kernel 5.11.12-300.fc34.x86_64, but they both have problems when I update to use a more recent 5.13 (F34) or 5.14 (F34/F35) kernels.

If I remove all the HDDs connected to the HBA card then the systems boot and work fine. When I connect a HDD to the HBA card the system hangs after about 5 seconds and requires a hard reboot. If I have a drive connected to the HBA at boot time then the system hangs at boot time. Connecting the same disk directly to the motherboard works fine.

If I use journalctl --follow when I connect a disk to the HBA I see the following output:

Oct 11 18:37:56 sulphur kernel: sas: phy-7:3 added to port-7:0, phy_mask:0x8 (50000d11074d0603)
Oct 11 18:37:56 sulphur kernel: sas: DOING DISCOVERY on port 0, pid:144
Oct 11 18:37:56 sulphur kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0
Oct 11 18:37:56 sulphur kernel: sas: ata7: end_device-7:0: dev error handler
Oct 11 18:38:56 sulphur kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Oct 11 18:38:56 sulphur kernel: rcu:         0-...0: (5 ticks this GP) idle=2b6/1/0x4000000000000000 softirq=48347/48348 fqs=14999 
Oct 11 18:38:56 sulphur kernel:         (detected by 3, t=60002 jiffies, g=92413, q=329)
Oct 11 18:38:56 sulphur kernel: Sending NMI from CPU 3 to CPUs 0:
Oct 11 18:38:56 sulphur kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
Oct 11 18:38:56 sulphur kernel: Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr ns sunrpc vfat fat mlx4_ib ib_uverbs ib_core ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt at24 intel_pmc_bxt iTCO_vendor_support irqbypass rapl intel_cstate intel_uncore i2c_i801 mlx4_core intel_pch_thermal i2c_smbus lpc_ich pm80xx acpi_ipmi joydev ipmi_si libsas ie31200_edac ipmi_devintf ipmi_msghandler fuse zram ip_tables xfs ast drm_vram_helper drm_kms_helper cec drm_ttm_helper ttm drm crct10dif_pclmul mpt3sas crc32_pclmul crc32c_intel igb ghash_clmulni_intel dca i2c_algo_bit raid_class scsi_transport_sas video
Oct 11 18:38:56 sulphur kernel: CPU: 0 PID: 1175 Comm: kworker/u8:0 Tainted: G        W         5.14.10-300.fc35.x86_64 #1
Oct 11 18:38:56 sulphur kernel: Hardware name: Supermicro X10SL7-F/X10SL7-F, BIOS 2.00 04/24/2014
Oct 11 18:38:56 sulphur kernel: Workqueue: events_unbound async_run_entry_fn
Oct 11 18:38:56 sulphur kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1d0
Oct 11 18:38:56 sulphur kernel: Code: 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 11 01 00 00 85 c0 74 0e 8b 02 84 c0 74 08 f3 90 8b 02 <84> c0 75 f8 b8 01 00 00 00 66 89 02 c3 8b 37 b9 00 02 00 00 81 fe
Oct 11 18:38:56 sulphur kernel: RSP: 0018:ffffa79f415f3a68 EFLAGS: 00000002
Oct 11 18:38:56 sulphur kernel: RAX: 0000000000000101 RBX: ffff8b3ac2b50000 RCX: 0000000000000000
Oct 11 18:38:56 sulphur kernel: RDX: ffff8b3ac2b50038 RSI: 0000000000000000 RDI: ffff8b3ac2b50038
Oct 11 18:38:56 sulphur kernel: RBP: ffff8b3ae32c3f00 R08: 0000000000000001 R09: ffff8b3ae32c3f00
Oct 11 18:38:56 sulphur kernel: R10: 0000000074706db0 R11: 0000000000000001 R12: 0000000000000046
Oct 11 18:38:56 sulphur kernel: R13: 0000000000000000 R14: ffff8b3ac2b50038 R15: ffff8b3ad6400000
Oct 11 18:38:56 sulphur kernel: FS:  0000000000000000(0000) GS:ffff8b3ddfc00000(0000) knlGS:0000000000000000
Oct 11 18:38:56 sulphur kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 18:38:56 sulphur kernel: CR2: 00007f12d865fbb0 CR3: 0000000410e10004 CR4: 00000000001706f0
Oct 11 18:38:56 sulphur kernel: Call Trace:
Oct 11 18:38:56 sulphur kernel:  _raw_spin_lock_irqsave+0x32/0x40
Oct 11 18:38:56 sulphur kernel:  pm8001_task_exec.constprop.0+0x66/0x3f0 [pm80xx]
Oct 11 18:38:56 sulphur kernel:  ? kmem_cache_alloc+0x165/0x290
Oct 11 18:38:56 sulphur kernel:  sas_ata_qc_issue+0x17d/0x220 [libsas]
Oct 11 18:38:56 sulphur kernel:  ata_qc_issue+0xfe/0x1f0
Oct 11 18:38:56 sulphur kernel:  ata_exec_internal_sg+0x2b8/0x560
Oct 11 18:38:56 sulphur kernel:  ata_hpa_resize+0x15b/0x440
Oct 11 18:38:56 sulphur kernel:  ? ata_dev_blacklisted+0x68/0xc0
Oct 11 18:38:56 sulphur kernel:  ata_dev_configure+0x188/0xed0
Oct 11 18:38:56 sulphur kernel:  ? ata_dev_read_id+0x3ca/0x470
Oct 11 18:38:56 sulphur kernel:  ata_eh_recover+0x973/0x1340
Oct 11 18:38:56 sulphur kernel:  ? __irq_work_queue_local+0x48/0x50
Oct 11 18:38:56 sulphur kernel:  ? enqueue_entity+0x16a/0x780
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_sched_eh+0x60/0x60 [libsas]
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_prereset+0x50/0x50 [libsas]
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_sched_eh+0x60/0x60 [libsas]
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_prereset+0x50/0x50 [libsas]
Oct 11 18:38:56 sulphur kernel:  ata_do_eh+0x71/0xf0
Oct 11 18:38:56 sulphur kernel:  ata_scsi_port_error_handler+0x3cf/0x8a0
Oct 11 18:38:56 sulphur kernel:  async_sas_ata_eh+0x44/0x7b [libsas]
Oct 11 18:38:56 sulphur kernel:  async_run_entry_fn+0x30/0x130
Oct 11 18:38:56 sulphur kernel:  process_one_work+0x1ec/0x390
Oct 11 18:38:56 sulphur kernel:  worker_thread+0x53/0x3e0
Oct 11 18:38:56 sulphur kernel:  ? process_one_work+0x390/0x390
Oct 11 18:38:56 sulphur kernel:  kthread+0x127/0x150
Oct 11 18:38:56 sulphur kernel:  ? set_kthread_struct+0x40/0x40
Oct 11 18:38:56 sulphur kernel:  ret_from_fork+0x22/0x30
Oct 11 18:38:56 sulphur kernel: NMI backtrace for cpu 0
Oct 11 18:38:56 sulphur kernel: CPU: 0 PID: 1175 Comm: kworker/u8:0 Tainted: G        W         5.14.10-300.fc35.x86_64 #1
Oct 11 18:38:56 sulphur kernel: Hardware name: Supermicro X10SL7-F/X10SL7-F, BIOS 2.00 04/24/2014
Oct 11 18:38:56 sulphur kernel: Workqueue: events_unbound async_run_entry_fn
Oct 11 18:38:56 sulphur kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1d0
Oct 11 18:38:56 sulphur kernel: Code: 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 11 01 00 00 85 c0 74 0e 8b 02 84 c0 74 08 f3 90 8b 02 <84> c0 75 f8 b8 01 00 00 00 66 89 02 c3 8b 37 b9 00 02 00 00 81 fe
Oct 11 18:38:56 sulphur kernel: RSP: 0018:ffffa79f415f3a68 EFLAGS: 00000002
Oct 11 18:38:56 sulphur kernel: RAX: 0000000000000101 RBX: ffff8b3ac2b50000 RCX: 0000000000000000
Oct 11 18:38:56 sulphur kernel: RDX: ffff8b3ac2b50038 RSI: 0000000000000000 RDI: ffff8b3ac2b50038
Oct 11 18:38:56 sulphur kernel: RBP: ffff8b3ae32c3f00 R08: 0000000000000001 R09: ffff8b3ae32c3f00
Oct 11 18:38:56 sulphur kernel: R10: 0000000074706db0 R11: 0000000000000001 R12: 0000000000000046
Oct 11 18:38:56 sulphur kernel: R13: 0000000000000000 R14: ffff8b3ac2b50038 R15: ffff8b3ad6400000
Oct 11 18:38:56 sulphur kernel: FS:  0000000000000000(0000) GS:ffff8b3ddfc00000(0000) knlGS:0000000000000000
Oct 11 18:38:56 sulphur kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 18:38:56 sulphur kernel: CR2: 00007f12d865fbb0 CR3: 0000000410e10004 CR4: 00000000001706f0
Oct 11 18:38:56 sulphur kernel: Call Trace:
Oct 11 18:38:56 sulphur kernel:  _raw_spin_lock_irqsave+0x32/
Oct 11 18:38:56 sulphur kernel: Lost 26 message(s)!

My guess is that some of the changes introduced between 5.11 version of the pm8001 driver and 5.13 version have caused this issue. However, I may well be wrong.

Are there any other tests or diagnosis I should do? Or should I file a kernel bug report somewhere? I haven’t done that before…

Thanks for any help or advice you can give me.

Best wishes,

Dan

I have reported this issue on RedHat Bugzilla.