[openib-general] opensm fails to bring up subnet..
Troy Benjegerdes
hozer at hozed.org
Thu Jun 2 16:46:20 PDT 2005
On Thu, Jun 02, 2005 at 06:23:31PM -0500, Troy Benjegerdes wrote:
> I'm having intermittent problems with opensm.. It seems after a while
> IPoIB stops working and if I restart opensm, it starts spitting out
> errors. Do I have a misbehaving switch somewhere?
>
> ibnetdiscover seems to work fine.
>
>
> (this is from running 'opensm -v -o -r')
>
Some more info.. I rebooted the switches, and tried to re-run it.
I found that ibnetdiscover showed everything with a LID of 0 except 1
HCA card.. when I found that machine and did 'rmmod ib_mthca', opensm
seemed to get unstuck and mapped all the other lids.
And just now, as a sanity check, I was going to reload all the IB
modules, but got the following panic:
gozer.scl.ameslab.gov login: ib_mad: Invalid directed route
ib_mthca 0000:03:00.0: dma_pool_destroy mthca_av, f71c2000 busy
Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c012d348
*pde = 00000000
Oops: 0002 [#1]
SMP
Modules linked in: ib_umad nfsd lockd sunrpc ipv6 evdev floppy pcspkr ib_mad ib_core shpchp pci_hotplug ohci_hcd usbcore serverworks i2c_piix4 i2c_core sworks_agp agpgart aic7xxx tg3 xfs exportfs capability commoncap ide_cd ide_core cdrom genrtc isofs ext2 ext3 jbd mbcache sd_mod aacraid scsi_mod unix fbcon font bitblit vesafb cfbcopyarea cfbimgblt cfbfillrect
CPU: 0
EIP: 0060:[<c012d348>] Tainted: GF VLI
EFLAGS: 00010812 (2.6.11-1-686-smp)
EIP is at __queue_work+0x38/0x70
eax: c2157414 ebx: c2157400 ecx: 00000000 edx: f77ea2d4
esi: f77ea2d0 edi: 00000286 ebp: c012d3f0 esp: f7277f44
ds: 007b es: 007b ss: 0068
Process default.hotplug (pid: 3102, threadinfo=f7276000 task=f70d7a60)
Stack: f7841580 f77ea200 f77ea2d0 c200c9a0 c012d432 c2157400 f77ea2d0 f77ea2d0
00000100 c01262e6 f77ea2d0 f7277fa0 c0118093 00000000 f7276000 f7277f80
f7277f80 000001f0 00000011 c035ff68 c0397aa0 00000000 c0121cca c035ff68
Call Trace:
[<c012d432>] delayed_work_timer_fn+0x42/0x50
[<c01262e6>] run_timer_softirq+0xd6/0x1c0
[<c0118093>] scheduler_tick+0x63/0x320
[<c0121cca>] __do_softirq+0xba/0xd0
[<c0121d0d>] do_softirq+0x2d/0x30
[<c0103b98>] apic_timer_interrupt+0x1c/0x24
Code: 24 08 8b 74 24 18 89 d8 89 7c 24 0c e8 22 fb 17 00 89 5e 14 89 c7 8d 56 04 8d 43 0c 8b 48 04 89 46 04 89 50 04 8d 43 14 89 4a 04 <89> 11 ba 03 00 00 00 b9 01 00 00 00 ff 43 08 c7 04 24 00 00 00
<0>Kernel panic - not syncing: Fatal exception in interrupt
More information about the general
mailing list