[openib-general] opensm fails to bring up subnet..

Troy Benjegerdes hozer at hozed.org
Thu Jun 2 16:46:20 PDT 2005


On Thu, Jun 02, 2005 at 06:23:31PM -0500, Troy Benjegerdes wrote:
> I'm having intermittent problems with opensm.. It seems after a while
> IPoIB stops working and if I restart opensm, it starts spitting out
> errors. Do I have a misbehaving switch somewhere?
> 
> ibnetdiscover seems to work fine.
> 
> 
> (this is from running 'opensm -v -o -r')
> 

Some more info.. I rebooted the switches, and tried to re-run it.

I found that ibnetdiscover showed everything with a LID of 0 except 1
HCA card.. when I found that machine and did 'rmmod ib_mthca', opensm
seemed to get unstuck and mapped all the other lids.

And just now, as a sanity check, I was going to reload all the IB
modules, but got the following panic:

gozer.scl.ameslab.gov login: ib_mad: Invalid directed route
ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, f71c2000 busy
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c012d348
*pde = 00000000
Oops: 0002 [#1]
SMP
Modules linked in: ib_umad nfsd lockd sunrpc ipv6 evdev floppy pcspkr ib_mad ib_core shpchp pci_hotplug ohci_hcd usbcore serverworks i2c_piix4 i2c_core sworks_agp agpgart aic7xxx tg3 xfs exportfs capability commoncap ide_cd ide_core cdrom genrtc isofs ext2 ext3 jbd mbcache sd_mod aacraid scsi_mod unix fbcon font bitblit vesafb cfbcopyarea cfbimgblt cfbfillrect
CPU:    0
EIP:    0060:[<c012d348>]    Tainted: GF     VLI
EFLAGS: 00010812   (2.6.11-1-686-smp)
EIP is at __queue_work+0x38/0x70
eax: c2157414   ebx: c2157400   ecx: 00000000   edx: f77ea2d4
esi: f77ea2d0   edi: 00000286   ebp: c012d3f0   esp: f7277f44
ds: 007b   es: 007b   ss: 0068

Process default.hotplug (pid: 3102, threadinfo=f7276000 task=f70d7a60)
Stack: f7841580 f77ea200 f77ea2d0 c200c9a0 c012d432 c2157400 f77ea2d0 f77ea2d0
       00000100 c01262e6 f77ea2d0 f7277fa0 c0118093 00000000 f7276000 f7277f80
       f7277f80 000001f0 00000011 c035ff68 c0397aa0 00000000 c0121cca c035ff68

Call Trace:
 [<c012d432>] delayed_work_timer_fn+0x42/0x50
 [<c01262e6>] run_timer_softirq+0xd6/0x1c0
 [<c0118093>] scheduler_tick+0x63/0x320
 [<c0121cca>] __do_softirq+0xba/0xd0
 [<c0121d0d>] do_softirq+0x2d/0x30
 [<c0103b98>] apic_timer_interrupt+0x1c/0x24

Code: 24 08 8b 74 24 18 89 d8 89 7c 24 0c e8 22 fb 17 00 89 5e 14 89 c7 8d 56 04 8d 43 0c 8b 48 04 89 46 04 89 50 04 8d 43 14 89 4a 04 <89> 11 ba 03 00 00 00 b9 01 00 00 00 ff 43 08 c7 04 24 00 00 00
 <0>Kernel panic - not syncing: Fatal exception in interrupt




More information about the general mailing list