[ofa-general] using SDP for block device traffic: several problems

Lars Ellenberg lars.ellenberg at linbit.com
Wed Jul 1 06:36:52 PDT 2009


On Wed, Jul 01, 2009 at 04:02:17PM +0300, Amir Vadai wrote:
 Subject: Re: [patch] fix SDP page leak in sdp_bz_cleanup
 In-Reply-To: <4A4B5E59.2030001 at mellanox.co.il>
> Hi Lars,
> 
> This is the right place for posting patches.
> 
> I will commit it ASAP into both branches.

Thanks for that one.

now, let me summarize some other findings.

== off-by-one error, data corruption ==

  I think that "sometimes" you lose the last byte of a fragment.

  situation: multi core, mlx4_ib driver,
  IPoIB configured, SDP configured, more details on request ;)

  do large message traffic on several streaming sockets
  at the same time, using as much bandwidth as possible,
  some on IPoIB, some on SDP.

  "sometimes" (typically within a couple of minutes), when receiving the
  stream, the last byte of some fragment is missing, or replaced by the
  first byte of the next fragment (if any).

  This has been noticed when using SDP from kernel space (for DRBD),
  and reproduced in userland.

  I will provide two simple perl scripts (server and client) today or
  tomorrow, so you should be able to reproduce this yourself in userland.

  It does not occur (within my patience time span) if there is not much
  load, or if I only use one stream, or even if I only use SDP (and not
  simultaneously also IPoIB streams). It only happens on SDP streams.

  I'm not sure if this off-by-one happens during send or recv.

  I'm open for suggestions to aid in tracking it down.


== module count imbalance ==

  after modprobe, module usage count of ib_sdp is 0, as it should be.
  starting to use it with some streaming sockest, module count goes up.

  once the streams start disconnecting, being interrupted from the other
  side, reconnect and similar stuff, module count quickly drops below
  zero, manifesting in lsmod showing a module count of 4.2 millon ;)

  I'm still trying to track this down, I'm not yet sure if it is a double
  module_put, or a missing (try_)module_get ...


more when I find more.

Cheers,

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.



More information about the general mailing list