[ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver

Jack Morgenstein jackm at dev.mellanox.co.il
Sun Feb 22 03:37:00 PST 2009


On Sunday 22 February 2009 09:15, Roland Dreier wrote:
>  > I ran on RHEL5.2 ...
> 
> I suspect that at some point in the 2+ years since 2.6.18 more locking
> was added to sysfs so that this race no longer exists.  You could try
> and see if my test (add a sleep to the show method and make sure you
> remove the low-level driver during that window) results in an instant
> crash with the RHEL 5.2 kernel.
> 
>  - R.
> 
There is still a problem, which we do not see with ConnectX (because of the separation between
mlx4_ib and mlx4_core -- and we are unloading only mlx4_ib, leaving all the mlx4_core infrastructure intact).

I tried your test with a Sinai card (mthca, and got the following Kernel Oops (on Kernel 2,6,27.4)
(Note that ib_mthca is still loaded, but with "(-)" following).

- Jack
======================

enter show_port_pkey
call ib_query_pkey
BUG: unable to handle kernel paging request at ffffc20000648698
IP: [<ffffffffa0217278>] mthca_cmd_post+0x168/0x24c [ib_mthca]
PGD 7fc59067 PUD 11fc30067 PMD 11ff34067 PTE 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: rdma_ucm rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs
ib_umad iw_nes mlx4_en inet_lro mlx4_ib mlx4_core ib_mthca(-) ib_mad ib_core memtrack mst_pciconf
mst_pci nfsd auth_rpcgss exportfs autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc ipv6
dm_mirror dm_log dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac parport_pc lp parport rtc_cmos
ide_cd_mod floppy sg button rtc_core cdrom serio_raw i2c_nforce2 rtc_lib k8temp shpchp forcedeth i2c_core
hwmon pcspkr sata_nv libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: inet_lro]
Pid: 23114, comm: cat Not tainted 2.6.27.4 #1
RIP: 0010:[<ffffffffa0217278>]  [<ffffffffa0217278>] mthca_cmd_post+0x168/0x24c [ib_mthca]
RSP: 0018:ffff88011695bcd8  EFLAGS: 00010246
RAX: ffffc20000648680 RBX: ffff8800734b6000 RCX: 0000000000000001
RDX: ffffffff8021072e RSI: 000000005102d000 RDI: ffff8800734b66c8
RBP: 0000000000000001 R08: 0000000000000003 R09: 0000000000000024
R10: ffff88011695be5f R11: 000000000000ea60 R12: 000000000000ffff
R13: 000000005102d000 R14: 000000007e57b000 R15: 000000005102d003
FS:  00007fb5ad5c86f0(0000) GS:ffffffff806fca80(0000) knlGS:00000000f735fb90
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffc20000648698 CR3: 000000006f897000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process cat (pid: 23114, threadinfo ffff88011695a000, task ffff88011f4ec240)
Stack:  000000005fe51850 002488006b0c5e80 ffff88005fe5ffff ffff88006b0c5e80
 ffff88006b0c5e80 002488005fe51840 0000000000000296 ffff8800734b6000
 0000000000000024 000000005102d003 ffff88011695bdb8 0000000000000001
Call Trace:
 [<ffffffffa021758c>] ? mthca_cmd_poll+0x61/0x118 [ib_mthca]
 [<ffffffffa021775f>] ? mthca_cmd_box+0x5d/0x62 [ib_mthca]
 [<ffffffffa0219c21>] ? mthca_MAD_IFC+0x171/0x1bc [ib_mthca]
 [<ffffffffa0225254>] ? mthca_query_pkey+0x103/0x18a [ib_mthca]
 [<ffffffff8023da38>] ? process_timeout+0x0/0x5
 [<ffffffffa01e9e7b>] ? show_port_pkey+0x4f/0x74 [ib_core]
 [<ffffffff802d7a1a>] ? sysfs_read_file+0xa8/0x12f
 [<ffffffff80291560>] ? vfs_read+0xaa/0x133
 [<ffffffff80291847>] ? sys_read+0x45/0x6e
 [<ffffffff8020be0b>] ? system_call_fastpath+0x16/0x1b


Code: c0 48 87 02 e8 73 89 26 e0 48 8b 83 98 06 00 00 8b 40 18 66 85 c0 79
0c 48 8b 05 14 66 53 e0 4c 39 e0 78 d2 48 8b 83 98 06 00 00 <8b> 40 18 66 85
c0 41 bc f5 ff ff ff 0f 88 b4 00 00 00 4c 89 e8
RIP  [<ffffffffa0217278>] mthca_cmd_post+0x168/0x24c [ib_mthca]
 RSP <ffff88011695bcd8>
CR2: ffffc20000648698
---[ end trace 7cb234a047e4a788 ]---



More information about the general mailing list