[ofa-general] Kernel panic in IPoIB stability testing

Jack Morgenstein jackm at dev.mellanox.co.il
Tue Feb 3 08:16:41 PST 2009


We saw the following kernel panic when testing ipoib stability intensively
by simultaneously (i.e., in separate processes, with random wait intervals) doing:
- ifconfig up/down
- opensm up/down
- ipoib ping
- arp delete
- driver up/down

ib0: ib_sa_path_rec_get failed: -11
ib0: ib_sa_path_rec_get failed: -11
Unable to handle kernel NULL pointer dereference at 0000000000000000
RIP:  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
PGD 224ea0067 PUD 225ae9067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0
CPU 2
Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth
sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U)
ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec
i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac
cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1
RIP: 0010:[<ffffffff883ac404>]  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
RSP: 0018:ffff810121ee7de0  EFLAGS: 00010046
RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002
RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500
RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000
R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30
R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3
FS:  00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0
Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860)
Stack:  ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850  ffffffffffffffff 7fffffffffffffff
ffffffffffffffff ffff810121ee8688  ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4
Call Trace:  [<ffffffff883ae850>] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6
             [<ffffffff8004b2b4>] run_workqueue+0x94/0xe5
             [<ffffffff80047c13>] worker_thread+0x0/0x122
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff80047d03>] worker_thread+0xf0/0x122
             [<ffffffff80086c5f>] default_wake_function+0x0/0xe
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff8003216e>] kthread+0xfe/0x132
             [<ffffffff8005bfe5>] child_rip+0xa/0x11
             [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
             [<ffffffff80032070>] kthread+0x0/0x132
             [<ffffffff8005bfdb>] child_rip+0x0/0x11

Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49
RIP  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
RSP <ffff810121ee7de0>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Fatal exception

In objdump -ld, we get:
ipoib_mark_paths_invalid():
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365
    13f7:       c7 83 e0 00 00 00 00    movl   $0x0,0xe0(%rbx)
    13fe:       00 00 00
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361
    1401:       4c 89 e3                mov    %r12,%rbx
==>    1404:       4d 8b a4 24 d0 00 00    mov    0xd0(%r12),%r12
    140b:       00
    140c:       48 8d 93 d0 00 00 00    lea    0xd0(%rbx),%rdx
    1413:       48 8d 45 38             lea    0x38(%rbp),%rax
    1417:       49 81 ec d0 00 00 00    sub    $0xd0,%r12
    141e:       48 39 c2                cmp    %rax,%rdx
    1421:       0f 85 4b ff ff ff       jne    1372 <ipoib_mark_paths_invalid+0x2a>
--------------------------------
and in the source code, we get:

void ipoib_mark_paths_invalid(struct net_device *dev)
{
        struct ipoib_dev_priv *priv = netdev_priv(dev);
        struct ipoib_path *path, *tp;

        spin_lock_irq(&priv->lock);

==>        list_for_each_entry_safe(path, tp, &priv->path_list, list) {
                ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
                        be16_to_cpu(path->pathrec.dlid),
                        IPOIB_GID_ARG(path->pathrec.dgid));
                path->valid =  0;
        }

        spin_unlock_irq(&priv->lock);
}
--------------------------------------------
Any ideas?

- Jack



More information about the general mailing list