[ofa-general] Is response time of GMP is more than SMP
Sumit Gaur - Sun Microsystem
Sumit.Gaur at Sun.COM
Wed Jun 25 23:37:58 PDT 2008
Hi Hal,
Hal Rosenstock wrote:
>
>>>>
>>>>I am sending only request for
>>>>
>>>> rpc.mgtclass = IB_PERFORMANCE_CLASS;
>>>> rpc.method = IB_MAD_METHOD_GET;
>>>>
>>>>at every one second.
>
>
> Does perfquery work reliably with the same node(s) you are having
> trouble with ?
>
> Does your app follow what perfquery does ?
Yes, perfquery works fine and I am following similar way of implementation. Here
is the output. I think difference is there in Load. I am sending 4 GS request
per second basis and some got passed and some got timeout(110) or recv failed.
# perfquery
# Port counters: Lid 393 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtData:.........................65899728
RcvData:.........................65899656
XmtPkts:.........................915274
RcvPkts:.........................915273
>
>>>>>In general, there are a few possibilities (which can cause this). SM
>>>>>traffic is VL15 whereas GS traffic is on a data VL (usually VL0 in most
>>>>>subnets).
>>>>>
>>>>>Some possibilities are:
>>>>>1. Timeout/retry being hit for some GS traffic (GS request or response
>>>>>lost/corrupted)
>>>>
>>>>Yes, this is also happening, Sometimes I am getting corrupt data back,
>>>
>>>
>>>Is there an error indicated ?
>>
>>For such packets I am getting umad_status as 110.
>
>
> That's ETIMEDOUT. You need to handle the errors (and not treat the
> receive as a valid packet). Are you doing that ?
yes, I am catching this error.
>
> The underlying question is why are you getting the timeout relatively
> frequently so I recommend checking all the error counters along the
> path.
# Checking Ca: nodeguid 0x00144fa5e9ce001c
Node check lid 392: OK
Error check on lid 392 (HCA-1) port all: OK
>
> Are you sure the request gets to the responder ? Does the responder
> respond and it doesn't make it back ?
yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?
>
> -- Hal
>
>
More information about the general
mailing list