[openib-general] opensm fails to bring up subnet..

Hal Rosenstock halr at voltaire.com
Fri Jun 3 05:53:32 PDT 2005


On Thu, 2005-06-02 at 19:46, Troy Benjegerdes wrote: 
> Some more info.. I rebooted the switches, and tried to re-run it.
> 
> I found that ibnetdiscover showed everything with a LID of 0 except 1
> HCA card.. when I found that machine and did 'rmmod ib_mthca', opensm
> seemed to get unstuck and mapped all the other lids.

Yes, that would do it ("unsticK' things) as the link to that HCA port
would not become LinkUp so opensm would just ignore it and everything
would be fine. (That's what OpenSM really needs to do when it encounters
a non responsive port with LinkUp; this has been discussed on this list
before).

> And just now, as a sanity check, I was going to reload all the IB
> modules, but got the following panic:
> 
> gozer.scl.ameslab.gov login: ib_mad: Invalid directed route

That message means that for some reason an SMP with an invalid directed
route was attempted to be sent and was discarded. The SMP either had a
"unreasonable" hop count, or hop count and pointer were for a switch but
the node was not a switch (this is occuring on an HCA port). [It likely
would have been one received from the SM.]

> ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, f71c2000 busy
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
>  printing eip:
> c012d348
> *pde = 00000000
> Oops: 0002 [#1]
> SMP
> Modules linked in: ib_umad nfsd lockd sunrpc ipv6 evdev floppy pcspkr ib_mad ib_core shpchp pci_hotplug ohci_hcd usbcore serverworks i2c_piix4 i2c_core sworks_agp agpgart aic7xxx tg3 xfs exportfs capability commoncap ide_cd ide_core cdrom genrtc isofs ext2 ext3 jbd mbcache sd_mod aacraid scsi_mod unix fbcon font bitblit vesafb cfbcopyarea cfbimgblt cfbfillrect
> CPU:    0
> EIP:    0060:[<c012d348>]    Tainted: GF     VLI
> EFLAGS: 00010812   (2.6.11-1-686-smp)
> EIP is at __queue_work+0x38/0x70
> eax: c2157414   ebx: c2157400   ecx: 00000000   edx: f77ea2d4
> esi: f77ea2d0   edi: 00000286   ebp: c012d3f0   esp: f7277f44
> ds: 007b   es: 007b   ss: 0068
> 
> Process default.hotplug (pid: 3102, threadinfo=f7276000 task=f70d7a60)
> Stack: f7841580 f77ea200 f77ea2d0 c200c9a0 c012d432 c2157400 f77ea2d0 f77ea2d0
>        00000100 c01262e6 f77ea2d0 f7277fa0 c0118093 00000000 f7276000 f7277f80
>        f7277f80 000001f0 00000011 c035ff68 c0397aa0 00000000 c0121cca c035ff68
> 
> Call Trace:
>  [<c012d432>] delayed_work_timer_fn+0x42/0x50
>  [<c01262e6>] run_timer_softirq+0xd6/0x1c0
>  [<c0118093>] scheduler_tick+0x63/0x320
>  [<c0121cca>] __do_softirq+0xba/0xd0
>  [<c0121d0d>] do_softirq+0x2d/0x30
>  [<c0103b98>] apic_timer_interrupt+0x1c/0x24
> 
> Code: 24 08 8b 74 24 18 89 d8 89 7c 24 0c e8 22 fb 17 00 89 5e 14 89 c7 8d 56 04 8d 43 0c 8b 48 04 89 46 04 89 50 04 8d 43 14 89 4a 04 <89> 11 ba 03 00 00 00 b9 01 00 00 00 ff 43 08 c7 04 24 00 00 00
>  <0>Kernel panic - not syncing: Fatal exception in interrupt

It looks like ib_mad somehow references through a NULL pointer after
this error while processing delayed work (timeout_sends ?). but I don't
see it.

-- Hal







More information about the general mailing list