[ofa-general] Intermittent: ib0: multicast join failed

Nathan Dauchy Nathan.Dauchy at noaa.gov
Wed Oct 1 12:05:55 PDT 2008


Hal Rosenstock wrote:
> On Mon, Sep 22, 2008 at 2:43 PM, Roger Spellman <roger at terascala.com> wrote:
>> Thanks, Hal.
>>
>> Below is the output to ibstat and ibstatus.  It shows that the rate is
>> 2.5 Gb/sec, rather than 10 Gb/sec.
>>
>> Is there a way to get it to renegotiate the rate, short of rebooting?
> 
> Try ibportstate reset on the switch peer port. You could also replug
> the cable on that link.

Hal,

Is there an easy way to determine the switch peer port from the node itself?

[root at h118 ~]# ibportstate -D 0 1 speed 2
Initial PortInfo:
# Port info: DR path 0 port 1
LinkSpeedEnabled:................2.5 Gbps

After PortInfo set:
# Port info: DR path 0 port 1
LinkSpeedEnabled:................5.0 Gbps

[root at h118 ~]# ibportstate -D 0 1 reset
ibportstate: iberror: failed: smp query nodeinfo: Node type not switch

I guess I am looking for more detailed documentation on how to craft the
"direct route path".


>>> It's likely a rate issue where the negotiated port rate is not the
>>> broadcast group rate.
> 
> Yes, it's a rate problem (the link is coming up a 1X SDR which is 2.5
> Gbps whereas I suspect that the group is 10 Gbps so it can't join.
> 

I think we are seeing something similar on our mixed SDR/DDR network.
All switches are DDR, but ~390 hosts are SDR, ~260 are SDR.  Messages
like the following show up in "osm.log":

Oct 01 03:31:05 514600 [42803940] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR
1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
0xff12601bffff0000 : 0x0000000000000016 from port 0x0002c90200224d91
(MT25218 InfiniHostEx Mellanox Technologies)

Is there a way to configure the hosts, switches, or subnet manager to
avoid this error?

Olga Shern's posting implies it is not a real problem and that
subsequent multicast joins succeed.  Perhaps an update could be made to
only log a "warning" for the first failure and "error" if it doesn't
join successfully within some number of tries or some number of seconds?
 Just a thought.

Thanks,
Nathan




More information about the general mailing list