[ofa-general] Is response time of GMP is more than SMP

Thu Jun 26 06:09:54 PDT 2008

Hi Sumit,

On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
> >
> >>>>
> >>>>I am sending only request for
> >>>>
> >>>>	rpc.mgtclass = IB_PERFORMANCE_CLASS;
> >>>>	rpc.method = IB_MAD_METHOD_GET;
> >>>>
> >>>>at every one second.
> > 
> > 
> > Does perfquery work reliably with the same node(s) you are having
> > trouble with ?
> > 
> > Does your app follow what perfquery does ?
> 
> Yes, perfquery works fine and I am following similar way of implementation. Here 
> is the output. I think difference is there in Load. I am sending 4 GS request 
> per second basis and some got passed and some got timeout(110) or recv failed.

Can you elaborate on the multiple sends ? Are they outstanding
concurrently ? Are they to the same destination or different ones ? Are
they from a single or multiple threads ?

> # perfquery
> # Port counters: Lid 393 port 1
> PortSelect:......................1
> CounterSelect:...................0x0000
> SymbolErrors:....................0
> LinkRecovers:....................0
> LinkDowned:......................0
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................0
> XmtDiscards:.....................0
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtData:.........................65899728
> RcvData:.........................65899656
> XmtPkts:.........................915274
> RcvPkts:.........................915273
> 
> > 
> >>>>>In general, there are a few possibilities (which can cause this). SM
> >>>>>traffic is VL15 whereas GS traffic is on a data VL (usually VL0 in most
> >>>>>subnets).
> >>>>>
> >>>>>Some possibilities are:
> >>>>>1. Timeout/retry being hit for some GS traffic (GS request or response
> >>>>>lost/corrupted)
> >>>>
> >>>>Yes, this is also happening, Sometimes I am getting corrupt data back,
> >>>
> >>>
> >>>Is there an error indicated ?
> >>
> >>For such packets I am getting umad_status as 110.
> > 
> > 
> > That's ETIMEDOUT. You need to handle the errors (and not treat the
> > receive as a valid packet). Are you doing that ?
> 
> yes, I am catching this error.

OK but you had said the received packet was corrupted. Maybe a nit, but
with timeout and other errors, the receive packet is invalid rather than
corrupted (an app shouldn't be looking at the response in the error
cases).

> > The underlying question is why are you getting the timeout relatively
> > frequently so I recommend checking all the error counters along the
> > path.
> 
> # Checking Ca: nodeguid 0x00144fa5e9ce001c
> Node check lid 392:  OK
> Error check on lid 392 (HCA-1) port all:  OK

Is that the requester or responder ? It's not the entire path. Maybe the
simplest thing is: what does ibchecknet or ibcheckerrors say ?

In any case, based on your comments above about perfquery working
reliably, I'm skeptical whether this is the issue but it's best to rule
it out.

> > Are you sure the request gets to the responder ? Does the responder
> > respond and it doesn't make it back ?
> 
> yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?

I don't know enough about what is different about your app yet to say
more right now.

-- Hal

> > -- Hal
> > 
> >