[ofa-general] using SDP for block device traffic: several problems
Amir Vadai
amirv at mellanox.co.il
Wed Jul 1 09:53:16 PDT 2009
I understand that you're using ofed-1.5.
We should check if those bugs exist in 1.4.1 too - since there were many changes in SDP in ofed-1.5.
Please open bugs in bugzilla here: https://bugs.openfabrics.org
Thanks,
Amir
On 07/01/2009 04:36 PM, Lars Ellenberg wrote:
> On Wed, Jul 01, 2009 at 04:02:17PM +0300, Amir Vadai wrote:
> Subject: Re: [patch] fix SDP page leak in sdp_bz_cleanup
> In-Reply-To: <4A4B5E59.2030001 at mellanox.co.il>
>> Hi Lars,
>>
>> This is the right place for posting patches.
>>
>> I will commit it ASAP into both branches.
>
> Thanks for that one.
>
> now, let me summarize some other findings.
>
> == off-by-one error, data corruption ==
>
> I think that "sometimes" you lose the last byte of a fragment.
>
> situation: multi core, mlx4_ib driver,
> IPoIB configured, SDP configured, more details on request ;)
>
> do large message traffic on several streaming sockets
> at the same time, using as much bandwidth as possible,
> some on IPoIB, some on SDP.
>
> "sometimes" (typically within a couple of minutes), when receiving the
> stream, the last byte of some fragment is missing, or replaced by the
> first byte of the next fragment (if any).
>
> This has been noticed when using SDP from kernel space (for DRBD),
> and reproduced in userland.
>
> I will provide two simple perl scripts (server and client) today or
> tomorrow, so you should be able to reproduce this yourself in userland.
>
> It does not occur (within my patience time span) if there is not much
> load, or if I only use one stream, or even if I only use SDP (and not
> simultaneously also IPoIB streams). It only happens on SDP streams.
>
> I'm not sure if this off-by-one happens during send or recv.
>
> I'm open for suggestions to aid in tracking it down.
>
>
> == module count imbalance ==
>
> after modprobe, module usage count of ib_sdp is 0, as it should be.
> starting to use it with some streaming sockest, module count goes up.
>
> once the streams start disconnecting, being interrupted from the other
> side, reconnect and similar stuff, module count quickly drops below
> zero, manifesting in lsmod showing a module count of 4.2 millon ;)
>
> I'm still trying to track this down, I'm not yet sure if it is a double
> module_put, or a missing (try_)module_get ...
>
>
> more when I find more.
>
> Cheers,
>
--
Amir Vadai
Software Eng.
Mellanox Technologies
mailto: amirv at mellanox.co.il
Tel +972-3-6259539
More information about the general
mailing list