[ofa-general] Is response time of GMP is more than SMP

Sumit Gaur - Sun Microsystem Sumit.Gaur at Sun.COM
Fri Jun 27 04:31:52 PDT 2008


Find my answers below:-

Hal Rosenstock wrote:
> Hi Sumit,
> 
> On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote:
> 
>>Hi Hal,
>>
>>Hal Rosenstock wrote:
>>
>>>>>>I am sending only request for
>>>>>>
>>>>>>	rpc.mgtclass = IB_PERFORMANCE_CLASS;
>>>>>>	rpc.method = IB_MAD_METHOD_GET;
>>>>>>
>>>>>>at every one second.
>>>
>>>
>>>Does perfquery work reliably with the same node(s) you are having
>>>trouble with ?
>>>
>>>Does your app follow what perfquery does ?
>>
>>Yes, perfquery works fine and I am following similar way of implementation. Here 
>>is the output. I think difference is there in Load. I am sending 4 GS request 
>>per second basis and some got passed and some got timeout(110) or recv failed.
> 
> 
> Can you elaborate on the multiple sends ? Are they outstanding
> concurrently ? Are they to the same destination or different ones ? Are
> they from a single or multiple threads ?
No they are sending sequentially(mutex enabled) no concurrency but timeout for 
umad_recv is 100ms. Yes they are for same destination. They all are from single 
threads. I still point out same I configure for SMP and no failure.
> 
> 
>># perfquery
>># Port counters: Lid 393 port 1
>>PortSelect:......................1
>>CounterSelect:...................0x0000
>>SymbolErrors:....................0
>>LinkRecovers:....................0
>>LinkDowned:......................0
>>RcvErrors:.......................0
>>RcvRemotePhysErrors:.............0
>>RcvSwRelayErrors:................0
>>XmtDiscards:.....................0
>>XmtConstraintErrors:.............0
>>RcvConstraintErrors:.............0
>>LinkIntegrityErrors:.............0
>>ExcBufOverrunErrors:.............0
>>VL15Dropped:.....................0
>>XmtData:.........................65899728
>>RcvData:.........................65899656
>>XmtPkts:.........................915274
>>RcvPkts:.........................915273
>>
>>


> 
> 
> OK but you had said the received packet was corrupted. Maybe a nit, but
> with timeout and other errors, the receive packet is invalid rather than
> corrupted (an app shouldn't be looking at the response in the error
> cases).
> 
> 
>>>The underlying question is why are you getting the timeout relatively
>>>frequently so I recommend checking all the error counters along the
>>>path.
>>
>># Checking Ca: nodeguid 0x00144fa5e9ce001c
>>Node check lid 392:  OK
>>Error check on lid 392 (HCA-1) port all:  OK
>
> 
> Is that the requester or responder ? It's not the entire path. Maybe the
> simplest thing is: what does ibchecknet or ibcheckerrors say ?
> 

I am using lid 106

[root at o4test65 tmp]# ibchecknet
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 18        (threshold 10)
Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 19        (threshold 10)
Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 26        (threshold 10)
#warn: counter LinkDowned = 13  (threshold 10)
#warn: counter RcvErrors = 27   (threshold 10)
Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 1968      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 1967      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: 
FAILED
# Checked Switch: nodeguid 0x00144f0000a61390 with failure
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
#warn: counter RcvErrors = 445  (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: Logical link state is Initialize
Port check lid 12 port 15:  FAILED
# Checked Switch: nodeguid 0x00144f0000a61397 with failure
#warn: Logical link state is Initialize
Port check lid 12 port 14:  FAILED
#warn: Logical link state is Initialize
Port check lid 12 port 13:  FAILED
#warn: counter LinkRecovers = 11        (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13: 
FAILED

# Checking Ca: nodeguid 0x00144fa5e9ce001c

# Checking Ca: nodeguid 0x00144fa5e9ce000c

# Checking Ca: nodeguid 0x00144fa5e9ce0014

# Checking Ca: nodeguid 0x00144fa5e9ce0004

## Summary: 29 nodes checked, 0 bad nodes found
##          359 ports checked, 3 bad ports found
##          2 ports have errors beyond threshold

> In any case, based on your comments above about perfquery working
> reliably, I'm skeptical whether this is the issue but it's best to rule
> it out.
[root at o4test65 tmp]# ibcheckerrors
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 18        (threshold 10)
Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 19        (threshold 10)
Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 26        (threshold 10)
#warn: counter LinkDowned = 13  (threshold 10)
#warn: counter RcvErrors = 27   (threshold 10)
Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 2081      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 2080      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: 
FAILED
# Checked Switch: nodeguid 0x00144f0000a61390 with failure
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
#warn: counter RcvErrors = 445  (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all: 
  FAILED
#warn: counter LinkRecovers = 11        (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13: 
FAILED
# Checked Switch: nodeguid 0x00144f0000a61397 with failure

## Summary: 29 nodes checked, 0 bad nodes found
##          359 ports checked, 2 ports have errors beyond threshold


> 
> 
>>>Are you sure the request gets to the responder ? Does the responder
>>>respond and it doesn't make it back ?
>>
>>yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?
> 
> 
> I don't know enough about what is different about your app yet to say
> more right now.
> 
> -- Hal
> 
> 
>>>-- Hal
>>>
>>>
> 
> 



More information about the general mailing list