[ofa-general] Re: re [NET]: Fix neighbour destructor handling

Or Gerlitz or.gerlitz at gmail.com
Wed Apr 11 15:43:38 PDT 2007


On 4/11/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
> > I did followed most of the discussions between you and MoniS re the
> > ipoib/bonding integration in OFED 1.2 and elsewhere, however: i don't
> > see why "bonding is basically broken for ipoib", if you don't mind,
> > please tell me the bottom line from your perspective.
>
> Here's a short summary of issues I saw last time, I'm not sure
> I haven't forgot something but here goes:

Michael,

Thanks for taking the time to summarize this. Indeed it does make
sense to try and address these concerns before reposting the patches,
conditioned that the audience is in the picture of what are we talking
about, in other wors i might repost the patches just for the sake of
discussion. Anyway, please see if you can address some follow up
clarification/questions and comments below.

> 1.Calling to_ipoib_neigh without device lock taken might be racy
>   I think you need to find another way to find the device.

just to be sure, you refer to the call added in MoniS patch to the
ipoib neigh desctructor?

> 2.Ah kept in the ipoib_neigh might belong to a device which is different
>   from the one start_xmit is called at.

how come? before a bonding fail-over took place, some failure happened
to the active slave, and from the ipoib code I understand that all
failure schemes, specifically those that cause the device carrier
(RUNNING) bit to be off, flush the ipoib neigh and their associated
address handles, so the ipoib_neigh buddy of the neighbour is cleaned
and one start_xmit is called over the new active slave an new
ipoib_neigh/ah would be created.

> 3.When the slave device goes down, master does not, and since
>   neighbours are matched to the master there's no guarantee they will be
>   cleaned up.

just to be sure, by "goes down" you mean is it not UP any more? I
understand its a common Linux behaviour not to clean neighbours when
the associated device is not UP, correct? what is the problematic
implication you see here?

thanks again for raising the concerns,

Or.



More information about the general mailing list