[ofa-general] Is response time of GMP is more than SMP

Hal Rosenstock hrosenstock at xsigo.com
Fri Jun 27 06:15:01 PDT 2008


On Fri, 2008-06-27 at 17:01 +0530, Sumit Gaur - Sun Microsystem wrote:
> Find my answers below:-
> 
> Hal Rosenstock wrote:
> > Hi Sumit,
> > 
> > On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote:
> > 
> >>Hi Hal,
> >>
> >>Hal Rosenstock wrote:
> >>
> >>>>>>I am sending only request for
> >>>>>>
> >>>>>>	rpc.mgtclass = IB_PERFORMANCE_CLASS;
> >>>>>>	rpc.method = IB_MAD_METHOD_GET;
> >>>>>>
> >>>>>>at every one second.
> >>>
> >>>
> >>>Does perfquery work reliably with the same node(s) you are having
> >>>trouble with ?
> >>>
> >>>Does your app follow what perfquery does ?
> >>
> >>Yes, perfquery works fine and I am following similar way of implementation. Here 
> >>is the output. I think difference is there in Load. I am sending 4 GS request 
> >>per second basis and some got passed and some got timeout(110) or recv failed.
 
> > Can you elaborate on the multiple sends ? Are they outstanding
> > concurrently ? Are they to the same destination or different ones ? Are
> > they from a single or multiple threads ?
> No they are sending sequentially(mutex enabled) no concurrency but timeout for 
> umad_recv is 100ms.

Can you try increasing that to see if there is some threshold where it
works more reliably ? Does it work better at say 200 msec (as you said
your rate was 4/sec) ? The default timeout used in the diags is 1 sec.

BTW, this could explain the timeouts but I'm not sure about the other
errors you mentioned.

>  Yes they are for same destination. They all are from single 
> threads. I still point out same I configure for SMP and no failure.
> > 
> > 
> >># perfquery
> >># Port counters: Lid 393 port 1
> >>PortSelect:......................1
> >>CounterSelect:...................0x0000
> >>SymbolErrors:....................0
> >>LinkRecovers:....................0
> >>LinkDowned:......................0
> >>RcvErrors:.......................0
> >>RcvRemotePhysErrors:.............0
> >>RcvSwRelayErrors:................0
> >>XmtDiscards:.....................0
> >>XmtConstraintErrors:.............0
> >>RcvConstraintErrors:.............0
> >>LinkIntegrityErrors:.............0
> >>ExcBufOverrunErrors:.............0
> >>VL15Dropped:.....................0
> >>XmtData:.........................65899728
> >>RcvData:.........................65899656
> >>XmtPkts:.........................915274
> >>RcvPkts:.........................915273
> >>
> >>
> 
> 
> > 
> > 
> > OK but you had said the received packet was corrupted. Maybe a nit, but
> > with timeout and other errors, the receive packet is invalid rather than
> > corrupted (an app shouldn't be looking at the response in the error
> > cases).
> > 
> > 
> >>>The underlying question is why are you getting the timeout relatively
> >>>frequently so I recommend checking all the error counters along the
> >>>path.
> >>
> >># Checking Ca: nodeguid 0x00144fa5e9ce001c
> >>Node check lid 392:  OK
> >>Error check on lid 392 (HCA-1) port all:  OK
> >
> > 
> > Is that the requester or responder ? It's not the entire path. Maybe the
> > simplest thing is: what does ibchecknet or ibcheckerrors say ?
> > 
> 
> I am using lid 106
> 
> [root at o4test65 tmp]# ibchecknet
> #warn: counter SymbolErrors = 65535     (threshold 10)
> Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 18        (threshold 10)
> Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 19        (threshold 10)
> Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 26        (threshold 10)
> #warn: counter LinkDowned = 13  (threshold 10)
> #warn: counter RcvErrors = 27   (threshold 10)
> Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 255       (threshold 10)
> Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 1968      (threshold 10)
> Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 1967      (threshold 10)
> Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: 
> FAILED
> # Checked Switch: nodeguid 0x00144f0000a61390 with failure
> #warn: counter SymbolErrors = 65535     (threshold 10)
> Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkDowned = 12  (threshold 10)
> Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 15        (threshold 10)
> #warn: counter LinkDowned = 12  (threshold 10)
> Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 15        (threshold 10)
> #warn: counter LinkDowned = 12  (threshold 10)
> Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 255       (threshold 10)
> #warn: counter RcvErrors = 445  (threshold 10)
> Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: Logical link state is Initialize
> Port check lid 12 port 15:  FAILED
> # Checked Switch: nodeguid 0x00144f0000a61397 with failure
> #warn: Logical link state is Initialize
> Port check lid 12 port 14:  FAILED
> #warn: Logical link state is Initialize
> Port check lid 12 port 13:  FAILED
> #warn: counter LinkRecovers = 11        (threshold 10)
> Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13: 
> FAILED
> 
> # Checking Ca: nodeguid 0x00144fa5e9ce001c
> 
> # Checking Ca: nodeguid 0x00144fa5e9ce000c
> 
> # Checking Ca: nodeguid 0x00144fa5e9ce0014
> 
> # Checking Ca: nodeguid 0x00144fa5e9ce0004
> 
> ## Summary: 29 nodes checked, 0 bad nodes found
> ##          359 ports checked, 3 bad ports found
> ##          2 ports have errors beyond threshold
> 
> > In any case, based on your comments above about perfquery working
> > reliably, I'm skeptical whether this is the issue but it's best to rule
> > it out.
> [root at o4test65 tmp]# ibcheckerrors
> #warn: counter SymbolErrors = 65535     (threshold 10)
> Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 18        (threshold 10)
> Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 19        (threshold 10)
> Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 26        (threshold 10)
> #warn: counter LinkDowned = 13  (threshold 10)
> #warn: counter RcvErrors = 27   (threshold 10)
> Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 255       (threshold 10)
> Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 2081      (threshold 10)
> Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 2080      (threshold 10)
> Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: 
> FAILED
> # Checked Switch: nodeguid 0x00144f0000a61390 with failure
> #warn: counter SymbolErrors = 65535     (threshold 10)
> Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkDowned = 12  (threshold 10)
> Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 15        (threshold 10)
> #warn: counter LinkDowned = 12  (threshold 10)
> Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 15        (threshold 10)
> #warn: counter LinkDowned = 12  (threshold 10)
> Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
> FAILED
> #warn: counter SymbolErrors = 65535     (threshold 10)
> #warn: counter LinkRecovers = 255       (threshold 10)
> #warn: counter RcvErrors = 445  (threshold 10)
> Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
>   FAILED
> #warn: counter LinkRecovers = 11        (threshold 10)
> Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13: 
> FAILED
> # Checked Switch: nodeguid 0x00144f0000a61397 with failure
> 
> ## Summary: 29 nodes checked, 0 bad nodes found
> ##          359 ports checked, 2 ports have errors beyond threshold

Looks like there are some issues here to debug in your subnet. It might
help to clear the counters and see what is actively going on to isolate
these issues. This could factor into those other errors you are seeing.

-- Hal

> >>>Are you sure the request gets to the responder ? Does the responder
> >>>respond and it doesn't make it back ?
> >>
> >>yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?
> > 
> > 
> > I don't know enough about what is different about your app yet to say
> > more right now.
> > 
> > -- Hal
> > 
> > 
> >>>-- Hal
> >>>
> >>>
> > 
> > 




More information about the general mailing list