[ofa-general] Re: multicast join failed for...

Hal Rosenstock halr at voltaire.com
Wed Apr 11 07:00:02 PDT 2007


On Wed, 2007-04-11 at 05:49, Michael S. Tsirkin wrote:
> > Quoting Hal Rosenstock <halr at voltaire.com>:
> > Subject: Re: multicast join failed for...
> > 
> > On Mon, 2007-04-09 at 18:47, Egor Tur wrote:
> > > Hi folk.
> > > 
> > > > > ib1: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 
> > > > > ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0001, status -22 
> > > > >  
> > > > > And in osm.log: 
> > > > > Apr 09 21:33:50 658439 [42003960] -> __osm_mcmr_rcv_join_mgrp: ERR 1B12: __validate_more_comp_fields, 
> > > > > __validate_port_caps, or JoinState = 0 failed from port 0x001708ffffd15099 (HP Lion Cub DDR 128MB), 
> > > > > sending IB_SA_MAD_STATUS_REQ_INVALID 
> > > > >  
> > > > OpenSM ERR 1B12 means that the rate or MTU of the port was incompatible
> > > > with the MC group. You could turn on -V with OpenSM and see more log
> > > > messages as to what is going on wrong from the SM's perspective.
> > > 
> > > Ok. This from osm.log with -V :
> > >  
> > > Apr 10 00:56:06 390007 [44007960] -> __osm_sa_mad_ctrl_process: [                                                              
> > > Apr 10 00:56:06 390016 [44007960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD         
> > > Apr 10 00:56:06 390027 [44007960] -> __osm_sa_mad_ctrl_process: ]                                                              
> > > Apr 10 00:56:06 390033 [44007960] -> __osm_sa_mad_ctrl_rcv_callback: ]                                                         
> > > Apr 10 00:56:06 390046 [41001960] -> osm_mcmr_rcv_process: [                                                                   
> > > Apr 10 00:56:06 390054 [41001960] -> __osm_mcmr_rcv_join_mgrp: [                                                               
> > > Apr 10 00:56:06 390060 [41001960] -> __osm_mcmr_rcv_join_mgrp: Dump of incoming record                                         
> > > Apr 10 00:56:06 390065 [41001960] -> MCMember Record dump:                                                                     
> > >                                 MGID....................0xff12601bffff0000 : 0x0000000000000001                                
> > >                                 PortGid.................0xfe80000000000000 : 0x001708ffffd1509a                                
> > >                                 qkey....................0xB1B                                                                  
> > >                                 mlid....................0x0                                                                    
> > >                                 mtu.....................0x84                                                                   
> > >                                 TClass..................0x0                                                                    
> > >                                 pkey....................0xFFFF                                                                 
> > >                                 rate....................0x83                                                                   
> > >                                 pkt_life................0x0                                                                    
> > >                                 SLFlowLabelHopLimit.....0x0                                                                    
> > >                                 ScopeState..............0x1                                                                    
> > >                                 ProxyJoin...............0x0                                                                    
> > > Apr 10 00:56:06 390084 [41001960] -> __validate_more_comp_fields: Requested RATE 6 is not equal to 3
> > 
> > Rate 6 is 20 Gb/sec whereas 3 is 10 Gb/sec. So the port is 4x DDR (rate
> > 6) and the group is 4x SDR. The request is for equal to the rate so it
> > fails.
> 
> 
> BTW, the only reason I know for IPoIB to request a specific rate
> is if the broadcast multicast group has that rate. Roland, is that right?

The IPoIB RFC says that non broadcast multicast groups must use the same
parameters as those used in the broadcast group. It also looks to me
that is what is implemented in the code.

> So, how come the broadcast multicast group has rate DDR, but a specific
> group has lower rate?

I think the error is not that the broadcast and some nonbroadcast groups
have different rates but that the SM is rejecting the join for a DDR
port to an SDR group. I wonder whether the broadcast group was formed
properly (and asked about that but haven't heard back yet).

-- Hal




More information about the general mailing list