[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5
Hal Rosenstock
halr at voltaire.com
Fri May 26 11:47:36 PDT 2006
Don,
On Fri, 2006-05-26 at 14:35, Don.Albert at Bull.com wrote:
> Hal,
>
> > Yes, that is very useful. I had been working on trying to come up
> with
> > what the problem was but this narrows it down to something I was
> > thinking might be going on.
> >
> > It looks like you are running back to back HCAs, right ?
>
> Yes, the HCAs are 4X DDR, connected back to back.
>
> >
> > It also looks to me like your remote (in terms of OpenSM) CA node is
> not
> > responding to SMA requests like SubnGet NodeInfo yet the link is
> active.
> > Can you describe what state that node is in (what modules are
> loaded,
> > etc.) ? Can you do an ibstat/ibstatus on that node ?
>
> Both systems are booted and the link appears active. Here is the
> information you asked for:
>
> >>>>>>>>>>>>>>>>>>>
>
> Local System (where OpenSM is attempting to run)
>
> [koa] (ib) ib> ibstat
> CA 'mthca0'
> CA type: MT25204
> Number of ports: 1
> Firmware version: 1.0.800
> Hardware version: a0
> Node GUID: 0x0002c90200216dc4
> System image GUID: 0x0002c90200216dc7
> Port 1:
> State: Initializing
> Physical state: LinkUp
> Rate: 20
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510a68
> Port GUID: 0x0002c90200216dc5
> [koa] (ib) ib> ibstatus
> Infiniband device 'mthca0' port 1 status:
> default gid: fe80:0000:0000:0000:0002:c902:0021:6dc5
> base lid: 0x0
> sm lid: 0x0
> state: 2: INIT
> phys state: 5: LinkUp
> rate: 20 Gb/sec (4X DDR)
>
> [koa] (ib) ib> /sbin/lsmod
> Module Size Used by
> parport_pc 28008 0
> lp 12872 0
> parport 37260 2 parport_pc,lp
> ib_ipath 58392 0
> ipath_core 154596 1 ib_ipath
> pcmcia 34864 0
> yenta_socket 25484 0
> rsrc_nonstatic 12160 1 yenta_socket
> pcmcia_core 38068 3 pcmcia,yenta_socket,rsrc_nonstatic
> button 7328 0
> battery 10120 0
> ac 5512 0
> uhci_hcd 31776 0
> hw_random 6824 0
> i2c_i801 10260 0
> i2c_core 20992 1 i2c_i801
> ib_mthca 109744 0
> ib_ipoib 48792 0
> ib_uverbs 34128 0
> ib_umad 14000 0
> ib_ucm 16520 0
> ib_sa 13884 1 ib_ipoib
> ib_cm 30144 1 ib_ucm
> ib_mad 35896 4 ib_mthca,ib_umad,ib_sa,ib_cm
> ib_core 45952 9
> ib_ipath,ib_mthca,ib_ipoib,ib_uverbs,ib_umad,ib_ucm,ib_sa,ib_cm,ib_mad
> floppy 67400 0
>
> >>>>>>>>>>>>>>>>>>>
>
> Remote system (no OpenSM instance)
>
> [jatoba] (ib) ib> ibstat
> CA 'mthca0'
> CA type: MT25204
> Number of ports: 1
> Firmware version: 1.0.800
> Hardware version: a0
> Node GUID: 0x0002c90200216e40
> System image GUID: 0x0002c90200216e43
> Port 1:
> State: Initializing
> Physical state: LinkUp
> Rate: 20
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510a68
> Port GUID: 0x0002c90200216e41
> [jatoba] (ib) ib> ibstatus
> Infiniband device 'mthca0' port 1 status:
> default gid: fe80:0000:0000:0000:0002:c902:0021:6e41
> base lid: 0x0
> sm lid: 0x0
> state: 2: INIT
> phys state: 5: LinkUp
> rate: 20 Gb/sec (4X DDR)
One more thing on the remote side, try:
smpquery nodeinfo -D 0
> [jatoba] (ib) ib> /sbin/lsmod
> Module Size Used by
> parport_pc 28008 0
> lp 12872 0
> parport 37260 2 parport_pc,lp
> ib_ipath 58392 0
> ipath_core 154596 1 ib_ipath
> pcmcia 34864 0
> yenta_socket 25484 0
> rsrc_nonstatic 12160 1 yenta_socket
> pcmcia_core 38068 3 pcmcia,yenta_socket,rsrc_nonstatic
> button 7328 0
> battery 10120 0
> ac 5512 0
> uhci_hcd 31776 0
> hw_random 6824 0
> i2c_i801 10260 0
> i2c_core 20992 1 i2c_i801
> ib_mthca 109744 0
> ib_ipoib 48792 0
> ib_uverbs 34128 0
> ib_umad 14000 2
> ib_ucm 16520 0
> ib_sa 13884 1 ib_ipoib
> ib_cm 30144 1 ib_ucm
> ib_mad 35896 4 ib_mthca,ib_umad,ib_sa,ib_cm
> ib_core 45952 9
> ib_ipath,ib_mthca,ib_ipoib,ib_uverbs,ib_umad,ib_ucm,ib_sa,ib_cm,ib_mad
> floppy 67400 0
Do you also have an iPath adapter ? If not, no need to load those
modules.
> >>>>>>>>>>>>>>>>>>>
>
> >
> > Can you try this patch to see if it gets you further and let me know
> ?
> > Note that this is just a potential workaround right now.
> >
>
> I will try rebuilding with the patch and let you know the results.
Thanks for your help in resolving this.
-- Hal
> Thanks,
> -Don Albert-
More information about the general
mailing list