***SPAM*** Re: [ofa-general] Intermittent: ib0: multicast join failed

Hal Rosenstock hal.rosenstock at gmail.com
Wed Oct 1 12:57:45 PDT 2008


On Wed, Oct 1, 2008 at 3:05 PM, Nathan Dauchy <Nathan.Dauchy at noaa.gov> wrote:
> Hal Rosenstock wrote:
>> On Mon, Sep 22, 2008 at 2:43 PM, Roger Spellman <roger at terascala.com> wrote:
>>> Thanks, Hal.
>>>
>>> Below is the output to ibstat and ibstatus.  It shows that the rate is
>>> 2.5 Gb/sec, rather than 10 Gb/sec.
>>>
>>> Is there a way to get it to renegotiate the rate, short of rebooting?
>>
>> Try ibportstate reset on the switch peer port. You could also replug
>> the cable on that link.
>
> Hal,
>
> Is there an easy way to determine the switch peer port from the node itself?

Are you hooked to a chassis switch or a simple switch like a 24 porter
? You may be able to tell from the face plate as to which port it is.

A command based way from that host for your configuration appears to be:
smpquery portinfo -D 0,1 | grep LocalPort

> [root at h118 ~]# ibportstate -D 0 1 speed 2
> Initial PortInfo:
> # Port info: DR path 0 port 1
> LinkSpeedEnabled:................2.5 Gbps
>
> After PortInfo set:
> # Port info: DR path 0 port 1
> LinkSpeedEnabled:................5.0 Gbps
>
> [root at h118 ~]# ibportstate -D 0 1 reset
> ibportstate: iberror: failed: smp query nodeinfo: Node type not switch
>
> I guess I am looking for more detailed documentation on how to craft the
> "direct route path".

IBA 1.2.1 chapter 14.2.2 is the definitive source on directed route SMPs

>>>> It's likely a rate issue where the negotiated port rate is not the
>>>> broadcast group rate.
>>
>> Yes, it's a rate problem (the link is coming up a 1X SDR which is 2.5
>> Gbps whereas I suspect that the group is 10 Gbps so it can't join.
>>
>
> I think we are seeing something similar on our mixed SDR/DDR network.
> All switches are DDR, but ~390 hosts are SDR, ~260 are SDR.  Messages
> like the following show up in "osm.log":
>
> Oct 01 03:31:05 514600 [42803940] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR
> 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> 0xff12601bffff0000 : 0x0000000000000016 from port 0x0002c90200224d91
> (MT25218 InfiniHostEx Mellanox Technologies)

That's an IPv6 group (as indicated by the 0x601b in the MGID). Are you
using IPv6 ? If not, you can ignore this. It's not a rate issue; it's
a creation issue.

> Is there a way to configure the hosts, switches, or subnet manager to
> avoid this error?

If you are not using IPv6, turn it off.

> Olga Shern's posting implies it is not a real problem and that
> subsequent multicast joins succeed.  Perhaps an update could be made to
> only log a "warning" for the first failure and "error" if it doesn't
> join successfully within some number of tries or some number of seconds?

This is not easy IMO as there's useful and different information in
those messages (group, port, etc.) which mean different things to the
network admin. It ends up being a tradeoff of too much in the log v.
too little. Some people just want one message and others want to see
all the failures. It isn't easy to track whether a certain message has
already been logged.

-- Hal

>  Just a thought.
>
> Thanks,
> Nathan
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>



More information about the general mailing list