[ewg] OpenSM problem on today's OFED-1.5.1 daily build

Hal Rosenstock hal.rosenstock at gmail.com
Fri Feb 19 16:09:32 PST 2010


On Fri, Feb 19, 2010 at 7:08 PM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:
> On Fri, Feb 19, 2010 at 6:47 PM, Woodruff, Robert J
> <robert.j.woodruff at intel.com> wrote:
>> Hal wrote,
>>
>>>Has there been any change between those two in the management space ?
>>
>> I am not sure on that, but there must be some changes because it
>> works with RC1 but fails with today's daily build.
>
> Could it be changes to mlx driver ?
>
>>>What state is the peer port in ? Any interesting OpenSM log messages ?
>>
>> The peer port on the other node is in the Iniaializing state.
>
> And that's an SDR switch port ?
>
>> Here is the tail of the opensm log file.
>>
>>
>> Feb 19 15:44:23 734840 [1C05CA90] 0x80 -> Entering DISCOVERING state
>> Feb 19 15:44:23 746070 [1C05CA90] 0x02 -> osm_vendor_bind: Binding to port 0x2c90300044fa9
>> Feb 19 15:44:23 773455 [1C05CA90] 0x02 -> osm_vendor_bind: Binding to port 0x2c90300044fa9
>> Feb 19 15:44:23 773501 [1C05CA90] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0002c90300044fa9
>> Feb 19 15:44:24 574767 [41A72940] 0x01 -> umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping
>>                        Method 0x1, Attr 0x11, TID 0x140000123b, Hop Ptr: 0x0
>> Feb 19 15:44:24 574798 [41A72940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0, Return path  = 0,0
>> Feb 19 15:44:24 574811 [41A72940] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x123b
>> Using default GUID 0x2c90300044fa9
>> Entering MASTER state
>>
>> Feb 19 15:44:24 574879 [595F1940] 0x80 -> Entering MASTER state
>> SUBNET UP
>>
>> Feb 19 15:44:24 576233 [595F1940] 0x80 -> SUBNET UP
>> Feb 19 15:44:34 538093 [41A72940] 0x01 -> umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping
>>                        Method 0x1, Attr 0x11, TID 0x1400001240, Hop Ptr: 0x0
>> Feb 19 15:44:34 538114 [41A72940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0, Return path  = 0,0
>> Feb 19 15:44:34 538123 [41A72940] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1240
>> Feb 19 15:44:34 538853 [595F1940] 0x02 -> SUBNET UP
>> Feb 19 15:44:44 541415 [41A72940] 0x01 -> umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping
>>                        Method 0x1, Attr 0x11, TID 0x1400001244, Hop Ptr: 0x0
>> Feb 19 15:44:44 541434 [41A72940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0, Return path  = 0,0
>> Feb 19 15:44:44 541442 [41A72940] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1244
>
> Looks like the switch SMA is not responding ? Can you try some smpquerys to it ?

Also, try rebooting that switch.

>
> Is this reproducible in this environment ?
>



More information about the ewg mailing list