[ofa-general] GPFS node loses IB-connection

Mon May 21 06:04:08 PDT 2007

Hi,

We are running GPFS with SDP. For this we use OFED 1.2-rc1. The machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex (rev a0)".

Under heavy load, we sometimes lose a node from our GPFS cluster.
The machine that lost connection (=10.224.158.104 or gpfswhbe1s1) gave this error:
May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   Reason code 668 Failure Reason Lost membership in cluster enterprise.universe. Unmounting file systems.
May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   

After this, we got the following message on some of the nodes that are part of the cluster (including the failing node):
GPFS Deadman Switch timer [0] has expired; IOs in progress: 0
Badness in do_exit at kernel/exit.c:807
Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}
       <ffffffff8010a7be>{system_call+126}
Badness in do_exit at kernel/exit.c:807
Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}
       <ffffffff8010a7be>{system_call+126}
idr_remove called for id=0 which is not allocated.
Call Trace: <ffffffff801e5ac0>{idr_remove+228} <ffffffff80180904>{kill_anon_super+41}
       <ffffffff8018099a>{deactivate_super+111} <ffffffff8019418b>{sys_umount+624}
       <ffffffff8018303c>{sys_newstat+25} <ffffffff8017b7c8>{__fput+348}
       <ffffffff80193b3d>{mntput_no_expire+25} <ffffffff80178eb3>{filp_close+89}
       <ffffffff8010a7be>{system_call+126}

Not all of them give the tracelog at the end.

GPFS then gives the following errors:
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3
10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3
10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.
10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.
10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1
10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1
etc.

We found this in the logs of the switch:
May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1
May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1
May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports
May 18 11:02:51 topspin-120sc ib_sm.x[628]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 1, due to non-responding CA
May 18 11:02:51 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/1, type=ib4xTXP
May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=65(1/1)
May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=65(1/1)
May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports
May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change
May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1
May 18 11:02:53 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/1, type=ib4xTXP
May 18 11:02:54 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1
May 18 11:02:54 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

We are sure this is gpfswhbe1s1, as the number is the same as the node_guid+1:
gpfswhbe1s1:~ # ibv_devinfo
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.hca_id:       mthca0
        fw_ver:                         5.1.0
        node_guid:                      0005:ad00:0008:98d0
        sys_image_guid:                 0005:ad00:0008:98d3
        vendor_id:                      0x05ad
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       HCA.LionMini.A0
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               6
                        port_lmc:               0x00
                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               4
                        port_lmc:               0x00

Does anyone have a clue what happened?
The error does not come up very often. So we can't reproduce it easily.

We believe the HCA on gpfswhbe1s1 caused the probem, but we can't really see it. 
All help is appreciated!

Regards,

Koen
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070521/9c115a45/attachment.html>