[ofa-general] Bug with SDP on IA64
Amir Vadai
amirv at mellanox.co.il
Sun Oct 26 23:55:40 PDT 2008
I asked our IB expert Jack for hints and he told me this:
>From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume 1, revision 1.2.1
* Local Length Error - ... Generated for a
Work Request posted to the local Receive Queue when the sum of
the Data Segment lengths is too small to receive a valid incoming
message or the length of the incoming message is greater than the
maximum message size supported by the HCA port that received the
message.
There seem to be 2 possibilities:
1. The receiver did not post enough/large-enough scatter gather entries in
the receive queue.
or
2. The sender sent a 0-length packet, but did so incorrectly.
(if any of the s/g entries (i.e., data segment entries) have a zero
byte count, this results in 2 GigaBytes of data being sent over the wire).
I note that SDP does not check for this (see sdp_post_send() in file sdp_bcopy.c:
the sge->length field is not checked for zero length).
Regarding how to debug this, you need to talk with an sdp expert to see if sdp may try
to send 0-length packets under stress ([Amir]: I can help you with this).
This is NOT an endianness problem -- it occurs also when he tries to send between
ia64 hosts:
"> When doing SDP transfers from an IA64 to any other host (IA64, x86,
> x86_64) through ttcp, I got this message:"
- Amir
Amir Vadai wrote:
> Hi,
>
>
> Please open a bug in https://bugs.openfabrics.org/ (make sure it is not
> a duplicate)
>
> I guess you have some endianess problem since ia64 is big endian and x86 is little endian.
>
> Try running the test on a stock Redhat/SLES kernel.
>
> - Amir
>
>
> Nicolas Morey Chaisemartin wrote:
>
>
>> Hi,
>>
>> I am stuck with a bug from ofa-kernel 1.3.1 on an IA64 running a Bull
>> 2.6.18 kernel.
>> When doing SDP transfers from an IA64 to any other host (IA64, x86,
>> x86_64) through ttcp, I got this message:
>>
>> [root at h2 ~]# LD_PRELOAD=/usr/lib/libsdp.so.1 ~/ttcp/ttcp -t -s
>> 192.168.0.10
>> ttcp-t: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp ->
>> 192.168.0.10
>> ttcp-t: socket
>> ttcp-t: tcp_maxseg
>> ttcp-t: connect
>> ttcp-t: IO: Connection reset by peer
>> errno=104
>> [root at h2 ~]#
>>
>> And the same error on the other side.
>> I activated the debug mode for sdp module and found out than on the
>> receiver side a completion error 1 shows up:
>> Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): Recv
>> completion with error. Status 1
>> Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): sdp_reset
>> state=1
>> Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814):
>> sdp_cma_handler event 10 id 0000010425120600
>> Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814):
>> RDMA_CM_EVENT_DISCONNECTED
>>
>> The error triggers a socket reset which terminates the connection.
>> According to the docs I could find, Status 1 is a local length error,
>> meaning the size written in the packet doesn't match the payload.
>>
>> I've noticed that with few packets (<= 100) or when ttcp is slowed
>> down (started through strace) transfers seem to work.
>>
>> I've tried to update to the latest ofa-kernel (1.4.1 from 10/16/2008)
>> and the bug is still there.
>>
>> Has anyone seen this problem before? What can I do to locate where
>> things go wrong?
>>
>> Regards
>>
>> Nicolas
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
>
More information about the general
mailing list