<HTML dir=ltr><HEAD>

<META http-equiv=Content-Type content="text/html; charset=unicode">

<META content="MSHTML 6.00.2900.3086" name=GENERATOR></HEAD>

<BODY>

<DIV><FONT face=Arial color=#000000 size=2>Hi,</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>We are running GPFS with SDP. For this we use OFED 1.2-rc1. The machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex (rev a0)".</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Under heavy load, we sometimes lose a node from our GPFS cluster.</FONT><FONT face=Arial size=2></DIV>

<DIV>

<DIV><FONT face=Arial size=2>The machine that lost connection (=10.224.158.104 or gpfswhbe1s1) gave this error:</FONT></DIV>

<DIV>May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   Reason code 668 Failure Reason Lost membership in cluster enterprise.universe. Unmounting file systems.<BR>May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901:   </DIV></FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>After this, we got the following message on some of the nodes that are part of the cluster (including the failing node):</FONT></DIV>

<DIV><FONT face=Arial size=2>GPFS Deadman Switch timer [0] has expired; IOs in progress: 0<BR>Badness in do_exit at kernel/exit.c:807</FONT></DIV>

<DIV><FONT face=Arial size=2>Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}<BR>       <ffffffff8010a7be>{system_call+126}<BR>Badness in do_exit at kernel/exit.c:807</FONT></DIV>

<DIV><FONT face=Arial size=2>Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}<BR>       <ffffffff8010a7be>{system_call+126}<BR>idr_remove called for id=0 which is not allocated.</FONT></DIV>

<DIV><FONT face=Arial size=2>Call Trace: <ffffffff801e5ac0>{idr_remove+228} <ffffffff80180904>{kill_anon_super+41}<BR>       <ffffffff8018099a>{deactivate_super+111} <ffffffff8019418b>{sys_umount+624}<BR>       <ffffffff8018303c>{sys_newstat+25} <ffffffff8017b7c8>{__fput+348}<BR>       <ffffffff80193b3d>{mntput_no_expire+25} <ffffffff80178eb3>{filp_close+89}<BR>       <ffffffff8010a7be>{system_call+126}<BR></FONT></DIV>

<DIV><FONT face=Arial size=2>Not all of them give the tracelog at the end.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>GPFS then gives the following errors:</FONT></DIV>

<DIV><FONT face=Arial size=2>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.<BR>10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.<BR>10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1<BR>10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1</FONT></DIV>

<DIV><FONT face=Arial size=2>etc.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>We found this in the logs of the switch:</FONT></DIV>

<DIV><FONT face=Arial size=2>May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1<BR>May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1<BR>May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports<BR>May 18 11:02:51 topspin-120sc ib_sm.x[628]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 1, due to non-responding CA<BR>May 18 11:02:51 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/1, type=ib4xTXP<BR>May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=65(1/1)<BR>May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=65(1/1)<BR>May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports<BR>May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change<BR>May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1<BR>May 18 11:02:53 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/1, type=ib4xTXP<BR>May 18 11:02:54 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1<BR>May 18 11:02:54 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change<BR></FONT></DIV>

<DIV><FONT face=Arial size=2>We are sure this is gpfswhbe1s1, as the number is the same as the node_guid+1:</FONT></DIV>

<DIV><FONT face=Arial size=2>gpfswhbe1s1:~ # ibv_devinfo<BR>libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.<BR>    This will severely limit memory registrations.hca_id:       mthca0<BR>        fw_ver:                         5.1.0<BR>        node_guid:                      0005:ad00:0008:98d0<BR>        sys_image_guid:                 0005:ad00:0008:98d3<BR>        vendor_id:                      0x05ad<BR>        vendor_part_id:                 25218<BR>        hw_ver:                         0xA0<BR>        board_id:                       HCA.LionMini.A0<BR>        phys_port_cnt:                  2<BR>                port:   1<BR>                        state:                  PORT_ACTIVE (4)<BR>                        max_mtu:                2048 (4)<BR>                        active_mtu:             2048 (4)<BR>                        sm_lid:                 2<BR>                        port_lid:               6<BR>                        port_lmc:               0x00</FONT></DIV>

<DIV><FONT face=Arial size=2>                port:   2<BR>                        state:                  PORT_ACTIVE (4)<BR>                        max_mtu:                2048 (4)<BR>                        active_mtu:             2048 (4)<BR>                        sm_lid:                 2<BR>                        port_lid:               4<BR>                        port_lmc:               0x00<BR></DIV></FONT>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Does anyone have a clue what happened?</FONT></DIV>

<DIV><FONT face=Arial size=2>The error does not come up very often. So we can't reproduce it easily.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>We believe the HCA on gpfswhbe1s1 caused the probem, but we can't really see it. </FONT></DIV>

<DIV><FONT face=Arial size=2>All help is appreciated!</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Regards,</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Koen</DIV></FONT>*** Disclaimer ***<br><br>Vlaamse Radio- en Televisieomroep<br>Auguste Reyerslaan 52, 1043 Brussel<br><br>nv van publiek recht<br>BTW BE 0244.142.664<br>RPR Brussel<br>http://www.vrt.be/disclaimer<br> <br></BODY></HTML>