[ofa-general] Re: Kernel panic in IPoIB stability testing

Yossi Etigin yosefe at Voltaire.COM
Tue Feb 3 09:56:40 PST 2009


I think it comes from unicast_arp_send.

Consider this scenario:
- paths are flushed (opensm up/down).
- unicast_arp_send() is called with a path in priv->path_list. path->valid is 0.
- path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet)
  (see the prints just before the panic).
- unicast_arp_send calls() path_free().
- path memory is overwritten.
- __ipoib_dev_flush() is called again.
- mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic
  because path->list became invalid.

--Yossi

Jack Morgenstein wrote:
> We saw the following kernel panic when testing ipoib stability intensively
> by simultaneously (i.e., in separate processes, with random wait intervals) doing:
> - ifconfig up/down
> - opensm up/down
> - ipoib ping
> - arp delete
> - driver up/down
> 
> ib0: ib_sa_path_rec_get failed: -11
> ib0: ib_sa_path_rec_get failed: -11
> Unable to handle kernel NULL pointer dereference at 0000000000000000
> RIP:  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> PGD 224ea0067 PUD 225ae9067 PMD 0
> Oops: 0000 [1] SMP
> last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0
> CPU 2
> Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth
> sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U)
> ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec
> i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac
> cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1
> RIP: 0010:[<ffffffff883ac404>]  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP: 0018:ffff810121ee7de0  EFLAGS: 00010046
> RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002
> RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500
> RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000
> R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30
> R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3
> FS:  00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0
> Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860)
> Stack:  ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850  ffffffffffffffff 7fffffffffffffff
> ffffffffffffffff ffff810121ee8688  ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4
> Call Trace:  [<ffffffff883ae850>] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6
>              [<ffffffff8004b2b4>] run_workqueue+0x94/0xe5
>              [<ffffffff80047c13>] worker_thread+0x0/0x122
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff80047d03>] worker_thread+0xf0/0x122
>              [<ffffffff80086c5f>] default_wake_function+0x0/0xe
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff8003216e>] kthread+0xfe/0x132
>              [<ffffffff8005bfe5>] child_rip+0xa/0x11
>              [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61
>              [<ffffffff80032070>] kthread+0x0/0x132
>              [<ffffffff8005bfdb>] child_rip+0x0/0x11
> 
> Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49
> RIP  [<ffffffff883ac404>] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec
> RSP <ffff810121ee7de0>
> CR2: 0000000000000000
>  <0>Kernel panic - not syncing: Fatal exception
> 
> In objdump -ld, we get:
> ipoib_mark_paths_invalid():
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365
>     13f7:       c7 83 e0 00 00 00 00    movl   $0x0,0xe0(%rbx)
>     13fe:       00 00 00
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361
>     1401:       4c 89 e3                mov    %r12,%rbx
> ==>    1404:       4d 8b a4 24 d0 00 00    mov    0xd0(%r12),%r12
>     140b:       00
>     140c:       48 8d 93 d0 00 00 00    lea    0xd0(%rbx),%rdx
>     1413:       48 8d 45 38             lea    0x38(%rbp),%rax
>     1417:       49 81 ec d0 00 00 00    sub    $0xd0,%r12
>     141e:       48 39 c2                cmp    %rax,%rdx
>     1421:       0f 85 4b ff ff ff       jne    1372 <ipoib_mark_paths_invalid+0x2a>
> --------------------------------
> and in the source code, we get:
> 
> void ipoib_mark_paths_invalid(struct net_device *dev)
> {
>         struct ipoib_dev_priv *priv = netdev_priv(dev);
>         struct ipoib_path *path, *tp;
> 
>         spin_lock_irq(&priv->lock);
> 
> ==>        list_for_each_entry_safe(path, tp, &priv->path_list, list) {
>                 ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n",
>                         be16_to_cpu(path->pathrec.dlid),
>                         IPOIB_GID_ARG(path->pathrec.dgid));
>                 path->valid =  0;
>         }
> 
>         spin_unlock_irq(&priv->lock);
> }
> --------------------------------------------
> Any ideas?
> 
> - Jack



More information about the general mailing list