[openib-general] question on opensm error

Tue Feb 15 06:31:46 PST 2005

On Tue, 2005-02-15 at 08:53, Ronald G. Minnich wrote:
> On Tue, 15 Feb 2005, Hal Rosenstock wrote:
> 
> > ibstatus/ibstat can show the local port logical and physical port state.
> 
> bluesteel:~ # ibstat
> CA 'mthca0':
>         CA type: MT23108
>         Number of ports: 2
>         Firmware version: 3.3.2
>         Hardware version: a1
>         Node GUID: 0x0002c90108a03e60
>         System image GUID: 0x0002c9000100d050
>         Port 1:
>                 State: Initializing

Hmmm. Guess it's not a local problem. With a port state of initialize,
the SM MAD (get nodeinfo) should get out of the HCA port on VL15.

>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00500a68
>                 Port GUID: 0x0002c90108a03e61
>         Port 2:
>                 State: Down
>                 Rate: 2
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00500a68
>                 Port GUID: 0x0002c90108a03e62
> 
> 
> > It might be helpful to try running ibnetdiscover -e (to show the
> > errors). smpquery can also be used to query the bad link/host.
> 
> no -e switch on my copy. svn update time? 

Not sure but it is certainly in the latest. I just tried it.

> This was kind of interesting, it did find a lot of switches ...
> [0][1][3][8][7][3][3][2][8][5][8] -> known remote switch 
> {0002c90108d19748} portnum 0 lid 0xe4-0xe4 "MT43132 Mellanox Technologies"
> [0][1][3][8][7][3][3][2][8][2] -> processing switch {0002c90108d19200} 
> portnum 0 lid 0x0-0x0 "MT43132 Mellanox Technologies"
> 
> (more like this -- much more)

Just out of curiousity, what is the deepest number of hops ?

> and some hcas
> [0][1][3][8][7][3][3][2][8][2][2] -> new remote hca {0002c901081e6700} 
> portnum 1 lid 0x0-0x0 "MT23108 InfiniHost Mellanox Technologies"
>         [1] {0002c901081e6700}
> 
> but osm.log is about 59MB of these:
> [1108475425:000915547][411FF970] -> umad_receiver: send completed with 
> error(method=1 attr=11) -- dropping.
> 
> smpquery? Have not seen that. Remember I'm trying to get this done with 
> openib ONLY. Probably a bad idea :-)

This was added to diags/net last Thursday so you likely haven't sync'd
up since then.

> here's plain ibnetdiscover
> 
> bluesteel:~ # ibnetdiscover 
> warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
> warn: [4710] _do_madrpc: send failed; Invalid argument
> warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][5][3][2][8][2][4] 
> port 4 failed, skipping port
> warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
> warn: [4710] _do_madrpc: send failed; Invalid argument
> warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][4][1][1][2] 
> port 2 failed, skipping port
> warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
> warn: [4710] _do_madrpc: send failed; Invalid argument
> warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][1][8][4][2] 
> port 2 failed, skipping port

It is some sort of remote problem. Not sure what is causing the failure
but it appears to be on multiple nodes concurrently.

-- Hal

> 
> ron