[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

Hal Rosenstock halr at voltaire.com
Fri May 26 11:47:36 PDT 2006


Don,

On Fri, 2006-05-26 at 14:35, Don.Albert at Bull.com wrote:
> Hal,
>  
> > Yes, that is very useful. I had been working on trying to come up
> with
> > what the problem was but this narrows it down to something I was
> > thinking might be going on.
> > 
> > It looks like you are running back to back HCAs, right ?
> 
> Yes, the HCAs are 4X DDR, connected back to back.
> 
> > 
> > It also looks to me like your remote (in terms of OpenSM) CA node is
> not
> > responding to SMA requests like SubnGet NodeInfo yet the link is
> active.
> > Can you describe what state that node is in (what modules are
> loaded,
> > etc.) ? Can you do an ibstat/ibstatus on that node ?
> 
> Both systems are booted and the link appears active.  Here is the
> information you asked for:
> 
> >>>>>>>>>>>>>>>>>>>
> 
> Local System (where OpenSM is attempting to run)
> 
> [koa] (ib) ib> ibstat
> CA 'mthca0'
>         CA type: MT25204
>         Number of ports: 1
>         Firmware version: 1.0.800
>         Hardware version: a0
>         Node GUID: 0x0002c90200216dc4
>         System image GUID: 0x0002c90200216dc7
>         Port 1:
>                 State: Initializing
>                 Physical state: LinkUp
>                 Rate: 20
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x02510a68
>                 Port GUID: 0x0002c90200216dc5
> [koa] (ib) ib> ibstatus
> Infiniband device 'mthca0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c902:0021:6dc5
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
> 
> [koa] (ib) ib> /sbin/lsmod
> Module                  Size  Used by
> parport_pc             28008  0
> lp                     12872  0
> parport                37260  2 parport_pc,lp
> ib_ipath               58392  0
> ipath_core            154596  1 ib_ipath
> pcmcia                 34864  0
> yenta_socket           25484  0
> rsrc_nonstatic         12160  1 yenta_socket
> pcmcia_core            38068  3 pcmcia,yenta_socket,rsrc_nonstatic
> button                  7328  0
> battery                10120  0
> ac                      5512  0
> uhci_hcd               31776  0
> hw_random               6824  0
> i2c_i801               10260  0
> i2c_core               20992  1 i2c_i801
> ib_mthca              109744  0
> ib_ipoib               48792  0
> ib_uverbs              34128  0
> ib_umad                14000  0
> ib_ucm                 16520  0
> ib_sa                  13884  1 ib_ipoib
> ib_cm                  30144  1 ib_ucm
> ib_mad                 35896  4 ib_mthca,ib_umad,ib_sa,ib_cm
> ib_core                45952  9
> ib_ipath,ib_mthca,ib_ipoib,ib_uverbs,ib_umad,ib_ucm,ib_sa,ib_cm,ib_mad
> floppy                 67400  0
> 
> >>>>>>>>>>>>>>>>>>>
> 
> Remote system (no OpenSM instance)
> 
> [jatoba] (ib) ib> ibstat
> CA 'mthca0'
>         CA type: MT25204
>         Number of ports: 1
>         Firmware version: 1.0.800
>         Hardware version: a0
>         Node GUID: 0x0002c90200216e40
>         System image GUID: 0x0002c90200216e43
>         Port 1:
>                 State: Initializing
>                 Physical state: LinkUp
>                 Rate: 20
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x02510a68
>                 Port GUID: 0x0002c90200216e41
> [jatoba] (ib) ib> ibstatus
> Infiniband device 'mthca0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c902:0021:6e41
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)

One more thing on the remote side, try:

smpquery nodeinfo -D 0

> [jatoba] (ib) ib> /sbin/lsmod
> Module                  Size  Used by
> parport_pc             28008  0
> lp                     12872  0
> parport                37260  2 parport_pc,lp
> ib_ipath               58392  0
> ipath_core            154596  1 ib_ipath
> pcmcia                 34864  0
> yenta_socket           25484  0
> rsrc_nonstatic         12160  1 yenta_socket
> pcmcia_core            38068  3 pcmcia,yenta_socket,rsrc_nonstatic
> button                  7328  0
> battery                10120  0
> ac                      5512  0
> uhci_hcd               31776  0
> hw_random               6824  0
> i2c_i801               10260  0
> i2c_core               20992  1 i2c_i801
> ib_mthca              109744  0
> ib_ipoib               48792  0
> ib_uverbs              34128  0
> ib_umad                14000  2
> ib_ucm                 16520  0
> ib_sa                  13884  1 ib_ipoib
> ib_cm                  30144  1 ib_ucm
> ib_mad                 35896  4 ib_mthca,ib_umad,ib_sa,ib_cm
> ib_core                45952  9
> ib_ipath,ib_mthca,ib_ipoib,ib_uverbs,ib_umad,ib_ucm,ib_sa,ib_cm,ib_mad
> floppy                 67400  0

Do you also have an iPath adapter ? If not, no need to load those
modules.

> >>>>>>>>>>>>>>>>>>>
> 
> > 
> > Can you try this patch to see if it gets you further and let me know
> ?
> > Note that this is just a potential workaround right now.
> > 
> 
> I will try rebuilding with the patch and let you know the results.

Thanks for your help in resolving this.

-- Hal

> Thanks,
>         -Don Albert-




More information about the ewg mailing list