[ofa-general] using SDP for block device traffic: several problems

Amir Vadai amirv at mellanox.co.il
Wed Jul 1 09:53:16 PDT 2009


I understand that you're using ofed-1.5. 
We should check if those bugs exist in 1.4.1 too - since there were many changes in SDP in ofed-1.5.

Please open bugs in bugzilla here: https://bugs.openfabrics.org

Thanks,
Amir

On 07/01/2009 04:36 PM, Lars Ellenberg wrote:
> On Wed, Jul 01, 2009 at 04:02:17PM +0300, Amir Vadai wrote:
>  Subject: Re: [patch] fix SDP page leak in sdp_bz_cleanup
>  In-Reply-To: <4A4B5E59.2030001 at mellanox.co.il>
>> Hi Lars,
>>
>> This is the right place for posting patches.
>>
>> I will commit it ASAP into both branches.
> 
> Thanks for that one.
> 
> now, let me summarize some other findings.
> 
> == off-by-one error, data corruption ==
> 
>   I think that "sometimes" you lose the last byte of a fragment.
> 
>   situation: multi core, mlx4_ib driver,
>   IPoIB configured, SDP configured, more details on request ;)
> 
>   do large message traffic on several streaming sockets
>   at the same time, using as much bandwidth as possible,
>   some on IPoIB, some on SDP.
> 
>   "sometimes" (typically within a couple of minutes), when receiving the
>   stream, the last byte of some fragment is missing, or replaced by the
>   first byte of the next fragment (if any).
> 
>   This has been noticed when using SDP from kernel space (for DRBD),
>   and reproduced in userland.
> 
>   I will provide two simple perl scripts (server and client) today or
>   tomorrow, so you should be able to reproduce this yourself in userland.
> 
>   It does not occur (within my patience time span) if there is not much
>   load, or if I only use one stream, or even if I only use SDP (and not
>   simultaneously also IPoIB streams). It only happens on SDP streams.
> 
>   I'm not sure if this off-by-one happens during send or recv.
> 
>   I'm open for suggestions to aid in tracking it down.
> 
> 
> == module count imbalance ==
> 
>   after modprobe, module usage count of ib_sdp is 0, as it should be.
>   starting to use it with some streaming sockest, module count goes up.
> 
>   once the streams start disconnecting, being interrupted from the other
>   side, reconnect and similar stuff, module count quickly drops below
>   zero, manifesting in lsmod showing a module count of 4.2 millon ;)
> 
>   I'm still trying to track this down, I'm not yet sure if it is a double
>   module_put, or a missing (try_)module_get ...
> 
> 
> more when I find more.
> 
> Cheers,
> 

-- 
Amir Vadai
Software Eng.
Mellanox Technologies
mailto: amirv at mellanox.co.il
Tel +972-3-6259539



More information about the general mailing list