[ofa-general] Is response time of GMP is more than SMP

Wed Jun 25 23:37:58 PDT 2008

Hi Hal,

Hal Rosenstock wrote:
>
>>>>
>>>>I am sending only request for
>>>>
>>>>	rpc.mgtclass = IB_PERFORMANCE_CLASS;
>>>>	rpc.method = IB_MAD_METHOD_GET;
>>>>
>>>>at every one second.
> 
> 
> Does perfquery work reliably with the same node(s) you are having
> trouble with ?
> 
> Does your app follow what perfquery does ?

Yes, perfquery works fine and I am following similar way of implementation. Here 
is the output. I think difference is there in Load. I am sending 4 GS request 
per second basis and some got passed and some got timeout(110) or recv failed.

# perfquery
# Port counters: Lid 393 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtData:.........................65899728
RcvData:.........................65899656
XmtPkts:.........................915274
RcvPkts:.........................915273

> 
>>>>>In general, there are a few possibilities (which can cause this). SM
>>>>>traffic is VL15 whereas GS traffic is on a data VL (usually VL0 in most
>>>>>subnets).
>>>>>
>>>>>Some possibilities are:
>>>>>1. Timeout/retry being hit for some GS traffic (GS request or response
>>>>>lost/corrupted)
>>>>
>>>>Yes, this is also happening, Sometimes I am getting corrupt data back,
>>>
>>>
>>>Is there an error indicated ?
>>
>>For such packets I am getting umad_status as 110.
> 
> 
> That's ETIMEDOUT. You need to handle the errors (and not treat the
> receive as a valid packet). Are you doing that ?

yes, I am catching this error.

> 
> The underlying question is why are you getting the timeout relatively
> frequently so I recommend checking all the error counters along the
> path.

# Checking Ca: nodeguid 0x00144fa5e9ce001c
Node check lid 392:  OK
Error check on lid 392 (HCA-1) port all:  OK

> 
> Are you sure the request gets to the responder ? Does the responder
> respond and it doesn't make it back ?

yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?

> 
> -- Hal
> 
>