[ofa-general] ib0: multicast join failed

Pawel Dziekonski dzieko at wcss.pl
Fri Aug 7 05:12:51 PDT 2009


On Fri, 07 Aug 2009 at 03:04:25PM +0300, Yossi Etigin wrote:
> On 07/08/09 14:25, Pawel Dziekonski wrote:
> > Hi,
> > 
> > today I got the following:
> > 
> > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > 
> > and connection to Lustre was lost.
> > 
> > I can ping IPoIB address of local iface, but I can't ping any remote IPoIB
> > address.
> > There is plenty of free mem so this is not the oom-killer case.
> > There are no other noticable problems with this host.
> > 
> > Is it a hardware problem with IB iface?
> 
> Is your SM alive?

Well, this is a good question.

My SM is on Voltaire ISR2012 switch. Today I lost contact with its web
interface - I don't know why. CLI works fine. Net itself works too. So
I assume that SM works.

L:ISR2012-0004(utilities)# sminfo -m -e
[1249654331:530476][32061] => _do_madrpc: timeout after 3 retries, 600 ms
sm_lid:..........................1
sm_guid:.........................0x8f10500000007
sm_key:..........................0x0
sm_activity:.....................574044927
sm_priority:.....................14
sm_state:........................SMINFO_MASTER
nodeip:..........................
nodename:........................
node_guid:.......................0x8f10500000007
devid:...........................0x5a37
vendor:..........................0x8f1
node_desc:.......................ISR2012 Voltaire sFB-2012
node_type:.......................Switch
localport:.......................0

L:ISR2012-0004(utilities)# port-verify -b
[1249653810:282614][26657] => _do_madrpc: timeout after 3 retries, 600 ms
[1249653810:283115][26657] => madrpc: failed class 129 method 1 attr 17 DR Path: 0,18,24,13
[1249653810:283585][26657] => discover: Nodeinfo on 0,18,24,13 port 13 failed, skipping port
#
# Topology file: generated on Fri Aug  7 14:03:34 2009
#
Printing Chassis 1 (chassis guid 0x0008f10500000004)

devid=0x5a38
switchguids=0x8f104003f680a Chassis ISR2012 1 Line  9  Chip 1
Switch  24 "S-0008f104003f680a"         # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 192
[13][ext 13] "S-0008f10400413b08"[11] width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!

devid=0x5a30
switchguids=0x8f104004136c0
Switch  24 "S-0008f104004136c0"         # "ISR9024D Voltaire" smalid 209
[22] "S-0008f104003f680a"[22] width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!

devid=0x5a30
switchguids=0x8f104004136b0
Switch  24 "S-0008f104004136b0"         # "ISR9024D Voltaire" smalid 204
[13] "S-000b8cffff002cc7"[12] width 4X speed 5.0 Gbs
errs.sym:........................752 <- Alert !!!
[24] "S-0008f104003f680a"[19] width 4X speed 5.0 Gbs
errs.sym:........................1 <- Alert !!!
errs.rcv:........................1 <- Alert !!!

devid=0x5a30
switchguids=0x8f10400413b08
Switch  24 "S-0008f10400413b08"         # "ISR9024D Voltaire" smalid 224
[13]     Alert -> Could not access this port Remote Peer.

devid=0xb924
switchguids=0xb8cffff002cc7
Switch  24 "S-000b8cffff002cc7"         # "MT47396 Infiniscale-III Mellanox Technologies" smalid 246
[12] "S-0008f104004136b0"[13] width 4X speed 5.0 Gbs
errs.sym:........................4 <- Alert !!!

devid=0x6732
hcaguids=0x2c90300031878
Hca     2 "H-0002c90300031878"          # "oss1 HCA-1"
[1] "S-0008f104004136c0"[9]     # lid 169 lmc 0 width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!

SUMMARY: ALARM [found - 5 bad_nodes and 6 bad_ports].

(VL15Dropped errors masked out)

-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl



More information about the general mailing list