[ofa-general] Is response time of GMP is more than SMP
Sumit Gaur - Sun Microsystem
Sumit.Gaur at Sun.COM
Fri Jun 27 04:31:52 PDT 2008
Find my answers below:-
Hal Rosenstock wrote:
> Hi Sumit,
>
> On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote:
>
>>Hi Hal,
>>
>>Hal Rosenstock wrote:
>>
>>>>>>I am sending only request for
>>>>>>
>>>>>> rpc.mgtclass = IB_PERFORMANCE_CLASS;
>>>>>> rpc.method = IB_MAD_METHOD_GET;
>>>>>>
>>>>>>at every one second.
>>>
>>>
>>>Does perfquery work reliably with the same node(s) you are having
>>>trouble with ?
>>>
>>>Does your app follow what perfquery does ?
>>
>>Yes, perfquery works fine and I am following similar way of implementation. Here
>>is the output. I think difference is there in Load. I am sending 4 GS request
>>per second basis and some got passed and some got timeout(110) or recv failed.
>
>
> Can you elaborate on the multiple sends ? Are they outstanding
> concurrently ? Are they to the same destination or different ones ? Are
> they from a single or multiple threads ?
No they are sending sequentially(mutex enabled) no concurrency but timeout for
umad_recv is 100ms. Yes they are for same destination. They all are from single
threads. I still point out same I configure for SMP and no failure.
>
>
>># perfquery
>># Port counters: Lid 393 port 1
>>PortSelect:......................1
>>CounterSelect:...................0x0000
>>SymbolErrors:....................0
>>LinkRecovers:....................0
>>LinkDowned:......................0
>>RcvErrors:.......................0
>>RcvRemotePhysErrors:.............0
>>RcvSwRelayErrors:................0
>>XmtDiscards:.....................0
>>XmtConstraintErrors:.............0
>>RcvConstraintErrors:.............0
>>LinkIntegrityErrors:.............0
>>ExcBufOverrunErrors:.............0
>>VL15Dropped:.....................0
>>XmtData:.........................65899728
>>RcvData:.........................65899656
>>XmtPkts:.........................915274
>>RcvPkts:.........................915273
>>
>>
>
>
> OK but you had said the received packet was corrupted. Maybe a nit, but
> with timeout and other errors, the receive packet is invalid rather than
> corrupted (an app shouldn't be looking at the response in the error
> cases).
>
>
>>>The underlying question is why are you getting the timeout relatively
>>>frequently so I recommend checking all the error counters along the
>>>path.
>>
>># Checking Ca: nodeguid 0x00144fa5e9ce001c
>>Node check lid 392: OK
>>Error check on lid 392 (HCA-1) port all: OK
>
>
> Is that the requester or responder ? It's not the entire path. Maybe the
> simplest thing is: what does ibchecknet or ibcheckerrors say ?
>
I am using lid 106
[root at o4test65 tmp]# ibchecknet
#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 18 (threshold 10)
Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 19 (threshold 10)
Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 26 (threshold 10)
#warn: counter LinkDowned = 13 (threshold 10)
#warn: counter RcvErrors = 27 (threshold 10)
Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 255 (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 1968 (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 1967 (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15:
FAILED
# Checked Switch: nodeguid 0x00144f0000a61390 with failure
#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkDowned = 12 (threshold 10)
Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 15 (threshold 10)
#warn: counter LinkDowned = 12 (threshold 10)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 15 (threshold 10)
#warn: counter LinkDowned = 12 (threshold 10)
Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 255 (threshold 10)
#warn: counter RcvErrors = 445 (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: Logical link state is Initialize
Port check lid 12 port 15: FAILED
# Checked Switch: nodeguid 0x00144f0000a61397 with failure
#warn: Logical link state is Initialize
Port check lid 12 port 14: FAILED
#warn: Logical link state is Initialize
Port check lid 12 port 13: FAILED
#warn: counter LinkRecovers = 11 (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13:
FAILED
# Checking Ca: nodeguid 0x00144fa5e9ce001c
# Checking Ca: nodeguid 0x00144fa5e9ce000c
# Checking Ca: nodeguid 0x00144fa5e9ce0014
# Checking Ca: nodeguid 0x00144fa5e9ce0004
## Summary: 29 nodes checked, 0 bad nodes found
## 359 ports checked, 3 bad ports found
## 2 ports have errors beyond threshold
> In any case, based on your comments above about perfquery working
> reliably, I'm skeptical whether this is the issue but it's best to rule
> it out.
[root at o4test65 tmp]# ibcheckerrors
#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 18 (threshold 10)
Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 19 (threshold 10)
Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 26 (threshold 10)
#warn: counter LinkDowned = 13 (threshold 10)
#warn: counter RcvErrors = 27 (threshold 10)
Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 255 (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 2081 (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 2080 (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15:
FAILED
# Checked Switch: nodeguid 0x00144f0000a61390 with failure
#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkDowned = 12 (threshold 10)
Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 15 (threshold 10)
#warn: counter LinkDowned = 12 (threshold 10)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 15 (threshold 10)
#warn: counter LinkDowned = 12 (threshold 10)
Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkRecovers = 255 (threshold 10)
#warn: counter RcvErrors = 445 (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all:
FAILED
#warn: counter LinkRecovers = 11 (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13:
FAILED
# Checked Switch: nodeguid 0x00144f0000a61397 with failure
## Summary: 29 nodes checked, 0 bad nodes found
## 359 ports checked, 2 ports have errors beyond threshold
>
>
>>>Are you sure the request gets to the responder ? Does the responder
>>>respond and it doesn't make it back ?
>>
>>yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?
>
>
> I don't know enough about what is different about your app yet to say
> more right now.
>
> -- Hal
>
>
>>>-- Hal
>>>
>>>
>
>
More information about the general
mailing list