[Users] OpenSM error message rosetta stone?

Ira Weiny weiny2 at llnl.gov
Tue Feb 19 11:03:15 PST 2013


On Tue, 19 Feb 2013 11:30:38 -0600
Narayan Desai <narayan.desai at gmail.com> wrote:

> We're running 3.3.15-0.2 from my PPA. I haven't had time to build the
> latest release yet.
> 
> With a little more context, we're seeing:
> Feb 19 11:27:55 899340 [22C64700] 0x01 -> state_mgr_light_sweep_start:
> ERR 3315: Unknown remote side for node 0x0002c902004158b0
> (MF0;mlx1-b:IS5600/L13/U1) port 2. Adding to light sweep sampling list
> Feb 19 11:27:55 899360 [22C64700] 0x01 -> Directed Path Dump of 3 hop
> path: Path = 0,1,19,13Feb 19 11:27:55 899394 [22C64700] 0x01 ->
> state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
> 0x0002c902004158d8 (MF0;mlx1-b:IS5600/L07/U1) port 6. Adding to light
> sweep sampling list
> Feb 19 11:27:55 899400 [22C64700] 0x01 -> Directed Path Dump of 3 hop
> path: Path = 0,1,19,7Feb 19 11:27:56 706975 [21C62700] 0x01 ->
> log_send_error: ERR 5411: DR SMP Send completed with error
> (IB_TIMEOUT) -- dropping
>                         Method 0x1, Attr 0x15, TID 0x1bc3b6
> Feb 19 11:27:56 706999 [21C62700] 0x01 -> Received SMP on a 4 hop
> path: Initial path = 0,1,19,13,2, Return path  = 0,0,0,0,0Feb 19
> 11:27:56 707008 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113:
> MAD completed in error (IB_TIMEOUT): SubnGet(PortInfo), attr_mod 0x0,
> TID 0x1bc3b6
> Feb 19 11:27:56 707012 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
> 3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set
> mkey?
> Feb 19 11:27:56 707029 [21C62700] 0x01 -> log_send_error: ERR 5411: DR
> SMP Send completed with error (IB_TIMEOUT) -- dropping
>                         Method 0x1, Attr 0x15, TID 0x1bc3b7
> Feb 19 11:27:56 707035 [21C62700] 0x01 -> Received SMP on a 4 hop
> path: Initial path = 0,1,19,7,6, Return path  = 0,0,0,0,0Feb 19
> 11:27:56 707039 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113:
> MAD completed in error (IB_TIMEOUT): SubnGet(PortInfo), attr_mod 0x0,
> TID 0x1bc3b7
> Feb 19 11:27:56 707042 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
> 3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set
> mkey?
> 
> It looks like some lines are being mixed; is this just a lack of a
> newline, or are the messages interspersed?

Yes there is a bug here.  I submitted a patch but it was rejected because the newline was added as part of another patch.  So, I believe this is fixed in 3.3.16.

> 
> Does the initial path information identify the remote node having
> troubles? How can I turn that into usable coordinates?

The DR path in this case is the node which the SM _can_ talk to (0,1,19,13 guid 0x0002c902004158b0).  The remote node which is not responding is on port 6 of that node.  Whatever is connected to port 6 is the problem node.

The easiest way to trace this using the diags would be:

iblinkinfo -D 0,1,19,13
or
iblinkinfo -G 0x0002c902004158b0

It too will fail to query port 6 but it should give you a better idea of where in the fabric you are by looking at the other nodes connected to other ports...

Ira

>  -nld
> 
> On Tue, Feb 19, 2013 at 11:10 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > On Tue, 19 Feb 2013 10:52:16 -0600
> > Narayan Desai <narayan.desai at gmail.com> wrote:
> >
> >> Is there a good guide to decoding opensm error logs?
> >>
> >> i'm specifically seeing this:
> >> Feb 19 10:50:26 667041 [21C62700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
> >> 3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set
> >> mkey?
> >> Feb 19 10:50:26 667057 [21C62700] 0x01 -> log_send_error: ERR 5411: DR
> >> SMP Send completed with error (IB_TIMEOUT) -- dropping
> >>                         Method 0x1, Attr 0x15, TID 0x1b684f
> >>
> >> a lot.
> >
> > What version of OpenSM is this?  Jim Foraker here at LLNL worked on the mkey support and we just went through fixing an issue similar to the above but I can't remember the details off the top of my head.
> >
> >>
> >> Also, the timestamp is clear enough, but what do the next 3 fields
> >> (667*, [21C6*, and 0x01 mean?
> >
> > 667* -- milisecond time stamp
> > 21C* -- thread id
> > 0x01 -- log level
> >
> > Ira
> >
> >> thanks.
> >>  -nld
> >> _______________________________________________
> >> Users mailing list
> >> Users at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
> >
> >
> > --
> > Ira Weiny
> > Member of Technical Staff
> > Lawrence Livermore National Lab
> > 925-423-8008
> > weiny2 at llnl.gov


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov



More information about the Users mailing list