[ofa-general] Re: Kernel panic in IPoIB stability testing
Yossi Etigin
yosefe at Voltaire.COM
Tue Feb 3 09:56:40 PST 2009
I think it comes from unicast_arp_send.
Consider this scenario:
- paths are flushed (opensm up/down).
- unicast_arp_send() is called with a path in priv->path_list. path->valid is 0.
- path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet)
(see the prints just before the panic).
- unicast_arp_send calls() path_free().
- path memory is overwritten.
- __ipoib_dev_flush() is called again.
- mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic
because path->list became invalid.
--Yossi
Jack Morgenstein wrote:
> We saw the following kernel panic when testing ipoib stability intensively
> by simultaneously (i.e., in separate processes, with random wait intervals) doing:
> - ifconfig up/down
> - opensm up/down
> - ipoib ping
> - arp delete
> - driver up/down
>
> ib0: ib_sa_path_rec_get failed: -11
> ib0: ib_sa_path_rec_get failed: -11
> Unable to handle kernel NULL pointer dereference at 0000000000000000
> RIP: [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> PGD 224ea0067 PUD 225ae9067 PMD 0
> Oops: 0000 [1] SMP
> last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0
> CPU 2
> Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth
> sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U)
> ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec
> i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac
> cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1
> RIP: 0010:[<ffffffff883ac404>] [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP: 0018:ffff810121ee7de0 EFLAGS: 00010046
> RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002
> RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500
> RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000
> R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30
> R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3
> FS: 00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0
> Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860)
> Stack: ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850 ffffffffffffffff 7fffffffffffffff
> ffffffffffffffff ffff810121ee8688 ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4
> Call Trace: [<ffffffff883ae850>] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6
> [<ffffffff8004b2b4>] run_workqueue+0x94/0xe5
> [<ffffffff80047c13>] worker_thread+0x0/0x122
> [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
> [<ffffffff80047d03>] worker_thread+0xf0/0x122
> [<ffffffff80086c5f>] default_wake_function+0x0/0xe
> [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
> [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
> [<ffffffff8003216e>] kthread+0xfe/0x132
> [<ffffffff8005bfe5>] child_rip+0xa/0x11
> [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
> [<ffffffff80032070>] kthread+0x0/0x132
> [<ffffffff8005bfdb>] child_rip+0x0/0x11
>
> Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49
> RIP [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP <ffff810121ee7de0>
> CR2: 0000000000000000
> <0>Kernel panic - not syncing: Fatal exception
>
> In objdump -ld, we get:
> ipoib_mark_paths_invalid():
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365
> 13f7: c7 83 e0 00 00 00 00 movl $0x0,0xe0(%rbx)
> 13fe: 00 00 00
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361
> 1401: 4c 89 e3 mov %r12,%rbx
> ==> 1404: 4d 8b a4 24 d0 00 00 mov 0xd0(%r12),%r12
> 140b: 00
> 140c: 48 8d 93 d0 00 00 00 lea 0xd0(%rbx),%rdx
> 1413: 48 8d 45 38 lea 0x38(%rbp),%rax
> 1417: 49 81 ec d0 00 00 00 sub $0xd0,%r12
> 141e: 48 39 c2 cmp %rax,%rdx
> 1421: 0f 85 4b ff ff ff jne 1372 <ipoib_mark_paths_invalid+0x2a>
> --------------------------------
> and in the source code, we get:
>
> void ipoib_mark_paths_invalid(struct net_device *dev)
> {
> struct ipoib_dev_priv *priv = netdev_priv(dev);
> struct ipoib_path *path, *tp;
>
> spin_lock_irq(&priv->lock);
>
> ==> list_for_each_entry_safe(path, tp, &priv->path_list, list) {
> ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
> be16_to_cpu(path->pathrec.dlid),
> IPOIB_GID_ARG(path->pathrec.dgid));
> path->valid = 0;
> }
>
> spin_unlock_irq(&priv->lock);
> }
> --------------------------------------------
> Any ideas?
>
> - Jack
More information about the general
mailing list