[ofa-general] Is response time of GMP is more than SMP
Hal Rosenstock
hrosenstock at xsigo.com
Thu Jun 26 06:09:54 PDT 2008
Hi Sumit,
On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote:
> Hi Hal,
>
> Hal Rosenstock wrote:
> >
> >>>>
> >>>>I am sending only request for
> >>>>
> >>>> rpc.mgtclass = IB_PERFORMANCE_CLASS;
> >>>> rpc.method = IB_MAD_METHOD_GET;
> >>>>
> >>>>at every one second.
> >
> >
> > Does perfquery work reliably with the same node(s) you are having
> > trouble with ?
> >
> > Does your app follow what perfquery does ?
>
> Yes, perfquery works fine and I am following similar way of implementation. Here
> is the output. I think difference is there in Load. I am sending 4 GS request
> per second basis and some got passed and some got timeout(110) or recv failed.
Can you elaborate on the multiple sends ? Are they outstanding
concurrently ? Are they to the same destination or different ones ? Are
they from a single or multiple threads ?
> # perfquery
> # Port counters: Lid 393 port 1
> PortSelect:......................1
> CounterSelect:...................0x0000
> SymbolErrors:....................0
> LinkRecovers:....................0
> LinkDowned:......................0
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................0
> XmtDiscards:.....................0
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtData:.........................65899728
> RcvData:.........................65899656
> XmtPkts:.........................915274
> RcvPkts:.........................915273
>
> >
> >>>>>In general, there are a few possibilities (which can cause this). SM
> >>>>>traffic is VL15 whereas GS traffic is on a data VL (usually VL0 in most
> >>>>>subnets).
> >>>>>
> >>>>>Some possibilities are:
> >>>>>1. Timeout/retry being hit for some GS traffic (GS request or response
> >>>>>lost/corrupted)
> >>>>
> >>>>Yes, this is also happening, Sometimes I am getting corrupt data back,
> >>>
> >>>
> >>>Is there an error indicated ?
> >>
> >>For such packets I am getting umad_status as 110.
> >
> >
> > That's ETIMEDOUT. You need to handle the errors (and not treat the
> > receive as a valid packet). Are you doing that ?
>
> yes, I am catching this error.
OK but you had said the received packet was corrupted. Maybe a nit, but
with timeout and other errors, the receive packet is invalid rather than
corrupted (an app shouldn't be looking at the response in the error
cases).
> > The underlying question is why are you getting the timeout relatively
> > frequently so I recommend checking all the error counters along the
> > path.
>
> # Checking Ca: nodeguid 0x00144fa5e9ce001c
> Node check lid 392: OK
> Error check on lid 392 (HCA-1) port all: OK
Is that the requester or responder ? It's not the entire path. Maybe the
simplest thing is: what does ibchecknet or ibcheckerrors say ?
In any case, based on your comments above about perfquery working
reliably, I'm skeptical whether this is the issue but it's best to rule
it out.
> > Are you sure the request gets to the responder ? Does the responder
> > respond and it doesn't make it back ?
>
> yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?
I don't know enough about what is different about your app yet to say
more right now.
-- Hal
> > -- Hal
> >
> >
More information about the general
mailing list