[ofa-general] Bug with SDP on IA64

Nicolas Morey Chaisemartin nicolas.morey-chaisemartin at ext.bull.net
Mon Oct 27 02:09:33 PDT 2008


Amir Vadai a écrit :
> I asked our IB expert Jack for hints and he told me this:
>
>
> >From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume 1, revision 1.2.1
> * Local Length Error - ... Generated for a
>   Work Request posted to the local Receive Queue when the sum of
>   the Data Segment lengths is too small to receive a valid incoming
>   message or the length of the incoming message is greater than the
>   maximum message size supported by the HCA port that received the
>   message.
>
>
> There seem to be 2 possibilities:
> 1. The receiver did not post enough/large-enough scatter gather entries in
>    the receive queue.
>
>
> or 
> 2. The sender sent a 0-length packet, but did so incorrectly.
>    (if any of the s/g entries (i.e., data segment entries) have a zero
>    byte count, this results in 2 GigaBytes of data being sent over the wire).
>
>
>    I note that SDP does not check for this (see sdp_post_send() in file sdp_bcopy.c:
>    the sge->length field is not checked for zero length).
>
>
> Regarding how to debug this, you need to talk with an sdp expert to see if sdp may try
> to send 0-length packets under stress ([Amir]: I can help you with this).
>
>   
I've just run a few more tests.
I added a test in sdp_post_send to check to sge->length field:
if(sge->length == 0){printk(KERN_ERR "SDP sending 0bytes packet\n");}

In the case of IA64-> IA64 transfer (it is in fact on the same server), 
the message shows up in the syslog just before the connection crashes.
However on IA64->x86_64 transfer, it doesn't show up, so I doubt it 
comes from here.

I also doubt it comes from the buffer on the receiving end as sdp 
transfers fail from IA64 to x86 but they are successful on x86 to x86, 
and on RDMA transfer (using perftest tools), x86 to x86 transfer have 
shown higher performances due to better PCI bus.


I tried to follow the packet/frag size from in sdb_post_send 
(sdb_bcopy.c) and it appears there are packet over 4k going through:
Oct 27 09:05:03 s_kernel at h2 kernel: SDP sending 30720 bytes packet on frag 0
Isn't a packet size supposed to be <= to the MTU at this point?

I added the same line on x86_64 and all fragments have size <= 4096, so 
my guess is there is a problem there on IA64

Nicolas Morey-Chaisemartin





More information about the general mailing list