[ofa-general] Re: OpenSM "stuck" - user level MAD library seems to be timing out

Sasha Khapyorsky sashak at voltaire.com
Sat Aug 4 12:35:13 PDT 2007


Hi Lan,

On 12:33 Thu 02 Aug     , lbt wrote:
> Hi Sasha,
> 
> I am hitting a problem where the user level MAD library seems to be timing
> out, causing the ports to be stuck in "INIT" state because the subnet has no
> "Master" SM available. The system is still in this state, so if there are
> any suggestions on what other type of debug info I could collect or clues to
> what the problem might be, it would be much apprceciated :)
> 
> I have 3 machines (OFED 1.1. stack, Opensm v2.0.5), where 2 of them are
> running open SM, connected by an IB switch. Several tests were being done
> pulling IB cables, but not touching at all the IB connections between the
> Master SM and the IB switch, or rebooting the IB switch (i.e. no SM
> migration should be occurring). Everything was working fine, until at one
> point, I pull the IB cable on the IB switch of the lower priority (standby)
> SM. For some reason, this starts causing problems on the higher priority
> Master SM. The higher priority SM now thinks it's in Standby state, and the
> lower priority SM's MAD packets are timing out.  It is odd because, I would
> not expect any effect on the higher priority SM (as it's IB connections are
> not being affected). And not sure why MAD packets are timing out on the
> lower priority SM. Rebooting the lower priority SM and replugging IB cables
> into different ports on the IB switch, didn't help.

Is it reproducible or randomly happened problem?

> Lower priority SM: (packets timeout)
> [root at vortex3l-83 ~]# sminfo -d -e -P 1
> ibwarn: [26764] smp_query: attr 21 mod 0 route DR path [0]
> ibwarn: [26764] mad_rpc: data offs 64 sz 64
> mad data
> 0000 0000 0000 0000 fe80 0000 0000 0000
> 0003 0001 0251 0a6a 0000 0000 0103 0302
> 1252 0011 4040 0008 0804 ff40 0000 0000
> 0000 2012 1088 0000 0000 0000 0000 0000
> ibwarn: [26764] smp_query: attr 32 mod 0 route Lid 1

It is possible that Master SM dropped routing to lid 1 node (which was
disconnected some time before Master became StandBy). I suppose sminfo
using direct path should work.

Sasha

> ibwarn: [26764] _do_madrpc: retry 1 (timeout 1000 ms)
> ibwarn: [26764] _do_madrpc: retry 2 (timeout 1000 ms)
> ibwarn: [26764] _do_madrpc: timeout after 3 retries, 3000 ms
> sminfo: iberror: [pid 26764] main: failed: query
> 
> Higher priority SM: (thinks its Standby now)
> [root at vortex3l-84 log]# sminfo -d -e -P 1
> ibwarn: [2487] smp_query: attr 21 mod 0 route DR path [0]
> ibwarn: [2487] mad_rpc: data offs 64 sz 64
> mad data
> 0000 0000 0000 0000 fe80 0000 0000 0000
> 0002 0003 0251 0a6a 0000 0000 0103 0302
> 1252 0011 4040 0008 0804 ff40 0000 0000
> 0000 2012 1088 0000 0000 0000 0000 0000
> ibwarn: [2487] smp_query: attr 32 mod 0 route Lid 3
> ibwarn: [2487] mad_rpc: data offs 64 sz 64
> mad data
> 0050 4501 4a3a 0001 0000 0000 0000 0000
> 0000 020e 0200 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> sminfo: sm lid 3 sm guid 0x5045014a3a0001, activity count 526 priority 0
> state 2 SMINFO_STANDBY
> 
> Just another data point, but each machine happens to have 2 HCA ports, port
> 1 and port 2. Port 1 is connected to different subnet than port2. During all
> these steps, port2 subnet is still fine and working OK. The problem
> described above was being seen with the port 1 subnet only.
> 
> Thanks!
> Lan



More information about the general mailing list