[ofa-general] ib0: multicast join failed
Pawel Dziekonski
dzieko at wcss.pl
Fri Aug 7 05:12:51 PDT 2009
On Fri, 07 Aug 2009 at 03:04:25PM +0300, Yossi Etigin wrote:
> On 07/08/09 14:25, Pawel Dziekonski wrote:
> > Hi,
> >
> > today I got the following:
> >
> > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> >
> > and connection to Lustre was lost.
> >
> > I can ping IPoIB address of local iface, but I can't ping any remote IPoIB
> > address.
> > There is plenty of free mem so this is not the oom-killer case.
> > There are no other noticable problems with this host.
> >
> > Is it a hardware problem with IB iface?
>
> Is your SM alive?
Well, this is a good question.
My SM is on Voltaire ISR2012 switch. Today I lost contact with its web
interface - I don't know why. CLI works fine. Net itself works too. So
I assume that SM works.
L:ISR2012-0004(utilities)# sminfo -m -e
[1249654331:530476][32061] => _do_madrpc: timeout after 3 retries, 600 ms
sm_lid:..........................1
sm_guid:.........................0x8f10500000007
sm_key:..........................0x0
sm_activity:.....................574044927
sm_priority:.....................14
sm_state:........................SMINFO_MASTER
nodeip:..........................
nodename:........................
node_guid:.......................0x8f10500000007
devid:...........................0x5a37
vendor:..........................0x8f1
node_desc:.......................ISR2012 Voltaire sFB-2012
node_type:.......................Switch
localport:.......................0
L:ISR2012-0004(utilities)# port-verify -b
[1249653810:282614][26657] => _do_madrpc: timeout after 3 retries, 600 ms
[1249653810:283115][26657] => madrpc: failed class 129 method 1 attr 17 DR Path: 0,18,24,13
[1249653810:283585][26657] => discover: Nodeinfo on 0,18,24,13 port 13 failed, skipping port
#
# Topology file: generated on Fri Aug 7 14:03:34 2009
#
Printing Chassis 1 (chassis guid 0x0008f10500000004)
devid=0x5a38
switchguids=0x8f104003f680a Chassis ISR2012 1 Line 9 Chip 1
Switch 24 "S-0008f104003f680a" # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 192
[13][ext 13] "S-0008f10400413b08"[11] width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!
devid=0x5a30
switchguids=0x8f104004136c0
Switch 24 "S-0008f104004136c0" # "ISR9024D Voltaire" smalid 209
[22] "S-0008f104003f680a"[22] width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!
devid=0x5a30
switchguids=0x8f104004136b0
Switch 24 "S-0008f104004136b0" # "ISR9024D Voltaire" smalid 204
[13] "S-000b8cffff002cc7"[12] width 4X speed 5.0 Gbs
errs.sym:........................752 <- Alert !!!
[24] "S-0008f104003f680a"[19] width 4X speed 5.0 Gbs
errs.sym:........................1 <- Alert !!!
errs.rcv:........................1 <- Alert !!!
devid=0x5a30
switchguids=0x8f10400413b08
Switch 24 "S-0008f10400413b08" # "ISR9024D Voltaire" smalid 224
[13] Alert -> Could not access this port Remote Peer.
devid=0xb924
switchguids=0xb8cffff002cc7
Switch 24 "S-000b8cffff002cc7" # "MT47396 Infiniscale-III Mellanox Technologies" smalid 246
[12] "S-0008f104004136b0"[13] width 4X speed 5.0 Gbs
errs.sym:........................4 <- Alert !!!
devid=0x6732
hcaguids=0x2c90300031878
Hca 2 "H-0002c90300031878" # "oss1 HCA-1"
[1] "S-0008f104004136c0"[9] # lid 169 lmc 0 width 4X speed 5.0 Gbs
errs.remphysrcv:.................6 <- Alert !!!
SUMMARY: ALARM [found - 5 bad_nodes and 6 bad_ports].
(VL15Dropped errors masked out)
--
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl
More information about the general
mailing list