[ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel
Dennis Portello
dennis.portello at gmail.com
Mon Apr 20 04:21:15 PDT 2009
Hello,
I seem to be experiencing the exact issue discussed below (back in
December). I'm using the 2.6.27 kernel and the bonding drivers available in
that kernel. Was there ever a solution or patch to solve this? I have been
using the ib-bond scripts as well, but using other approaches like standard
OS tools or adding the bond through sysfs all seem to have the same results.
Regular TCP/IP unicast works, though dmesg is full of warning about
multicast failing. Multicast does not work at all.
Any hints or suggestions would be greatly appreciated.
Best regards,
Dennis Portello
> Or Gerlitz wrote:
>>> If I am not mistaken the issue you mention is a little different from
the one I pointed out.
>>> Without bonding I see the following:
>>> kernel: ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
>>> However, with bonding what I see is :
>>> ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000,
status -22
>>
>> Please note that -11 EAGAIN (try again) is and -22 is EINVAL (invalid
>> argument). So you can get EAGAIN when the underlying core sa agent is
>> not ready to send SA queries, while you get EINVAL when attempting to
>> join on a junk MGID. I am confident that for long time we see joins on
>> junk MGIDs and it has been reported on this list (google...) in the
>> past, no resolution yet.
>
> Or,
>
> I looked through the mailing list going back more than a year. The closest
> I can find to this issue (-EINVAL) was when you reported problems with
junk MGID on a
> child interface (and that works properly now).
>
> I agree that the -EAGAIN problem has been known for some time now.
However, this issue with
> IPoIB bonding is new. My recollections are that it all worked properly
around end October.
> I had not tested since then, so this is something that must have cropped
in the interregnum.
>
>>
>> Under bonding there might be a window is time where from the kernel
>> network stack perspective the bonding device ether-type is ethernet
>> and not infiniband and hence the wrong (ip_eth_mc_map instead of
>> ip_ib_mc_map) function would be called to do the mapping from the IP
>> multicast address to the HW multicast address
>>
>>
>>> Subsequently an ib-bond status does not reveal any slave as active as
shown below:
>>> ib-bond --status
>>> bond0: 80:00:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:03:05:b9
>>> slave0: ib0
>>> slave1: ib1
>>
>> As this script is not standard and deprecated, I would recommend not
>> to use it but rather the classic /proc/net/bonding/bond0 entry, along
>> with ip addr show on bond0, ib0, ib1
> Thanks for alerting me to the fact that the ib-bond script was deprecated.
Again this seemed
> to all work about 6 weeks ago. Is that (ib-bond is deprecated) documented
somewhere?
>
> Pradeep
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090420/b5e3984c/attachment.html>
More information about the general
mailing list