[openib-general] question on opensm error
Hal Rosenstock
halr at voltaire.com
Tue Feb 15 06:31:46 PST 2005
On Tue, 2005-02-15 at 08:53, Ronald G. Minnich wrote:
> On Tue, 15 Feb 2005, Hal Rosenstock wrote:
>
> > ibstatus/ibstat can show the local port logical and physical port state.
>
> bluesteel:~ # ibstat
> CA 'mthca0':
> CA type: MT23108
> Number of ports: 2
> Firmware version: 3.3.2
> Hardware version: a1
> Node GUID: 0x0002c90108a03e60
> System image GUID: 0x0002c9000100d050
> Port 1:
> State: Initializing
Hmmm. Guess it's not a local problem. With a port state of initialize,
the SM MAD (get nodeinfo) should get out of the HCA port on VL15.
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x00500a68
> Port GUID: 0x0002c90108a03e61
> Port 2:
> State: Down
> Rate: 2
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x00500a68
> Port GUID: 0x0002c90108a03e62
>
>
> > It might be helpful to try running ibnetdiscover -e (to show the
> > errors). smpquery can also be used to query the bad link/host.
>
> no -e switch on my copy. svn update time?
Not sure but it is certainly in the latest. I just tried it.
> This was kind of interesting, it did find a lot of switches ...
> [0][1][3][8][7][3][3][2][8][5][8] -> known remote switch
> {0002c90108d19748} portnum 0 lid 0xe4-0xe4 "MT43132 Mellanox Technologies"
> [0][1][3][8][7][3][3][2][8][2] -> processing switch {0002c90108d19200}
> portnum 0 lid 0x0-0x0 "MT43132 Mellanox Technologies"
>
> (more like this -- much more)
Just out of curiousity, what is the deepest number of hops ?
> and some hcas
> [0][1][3][8][7][3][3][2][8][2][2] -> new remote hca {0002c901081e6700}
> portnum 1 lid 0x0-0x0 "MT23108 InfiniHost Mellanox Technologies"
> [1] {0002c901081e6700}
>
> but osm.log is about 59MB of these:
> [1108475425:000915547][411FF970] -> umad_receiver: send completed with
> error(method=1 attr=11) -- dropping.
>
> smpquery? Have not seen that. Remember I'm trying to get this done with
> openib ONLY. Probably a bad idea :-)
This was added to diags/net last Thursday so you likely haven't sync'd
up since then.
> here's plain ibnetdiscover
>
> bluesteel:~ # ibnetdiscover
> warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
> warn: [4710] _do_madrpc: send failed; Invalid argument
> warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][5][3][2][8][2][4]
> port 4 failed, skipping port
> warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
> warn: [4710] _do_madrpc: send failed; Invalid argument
> warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][4][1][1][2]
> port 2 failed, skipping port
> warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
> warn: [4710] _do_madrpc: send failed; Invalid argument
> warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][1][8][4][2]
> port 2 failed, skipping port
It is some sort of remote problem. Not sure what is causing the failure
but it appears to be on multiple nodes concurrently.
-- Hal
>
> ron
More information about the general
mailing list