<HTML dir=ltr><HEAD>
<META http-equiv=Content-Type content="text/html; charset=unicode">
<META content="MSHTML 6.00.2900.3086" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial color=#000000 size=2>Hi,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>We are running GPFS with SDP. For this we use OFED 1.2-rc1. The machines are IBM x3755's and x3655's. The IB-switch is a SFS-7000P. The HCA's are all "Mellanox Technologies MT25208 InfiniHost III Ex (rev a0)".</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Under heavy load, we sometimes lose a node from our GPFS cluster.</FONT><FONT face=Arial size=2></DIV>
<DIV>
<DIV><FONT face=Arial size=2>The machine that lost connection (=10.224.158.104 or gpfswhbe1s1) gave this error:</FONT></DIV>
<DIV>May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901: Reason code 668 Failure Reason Lost membership in cluster enterprise.universe. Unmounting file systems.<BR>May 18 13:02:51 gpfswhbe1s1 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=4997901: </DIV></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>After this, we got the following message on some of the nodes that are part of the cluster (including the failing node):</FONT></DIV>
<DIV><FONT face=Arial size=2>GPFS Deadman Switch timer [0] has expired; IOs in progress: 0<BR>Badness in do_exit at kernel/exit.c:807</FONT></DIV>
<DIV><FONT face=Arial size=2>Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}<BR> <ffffffff8010a7be>{system_call+126}<BR>Badness in do_exit at kernel/exit.c:807</FONT></DIV>
<DIV><FONT face=Arial size=2>Call Trace: <ffffffff80133370>{do_exit+80} <ffffffff80133c17>{sys_exit_group+0}<BR> <ffffffff8010a7be>{system_call+126}<BR>idr_remove called for id=0 which is not allocated.</FONT></DIV>
<DIV><FONT face=Arial size=2>Call Trace: <ffffffff801e5ac0>{idr_remove+228} <ffffffff80180904>{kill_anon_super+41}<BR> <ffffffff8018099a>{deactivate_super+111} <ffffffff8019418b>{sys_umount+624}<BR> <ffffffff8018303c>{sys_newstat+25} <ffffffff8017b7c8>{__fput+348}<BR> <ffffffff80193b3d>{mntput_no_expire+25} <ffffffff80178eb3>{filp_close+89}<BR> <ffffffff8010a7be>{system_call+126}<BR></FONT></DIV>
<DIV><FONT face=Arial size=2>Not all of them give the tracelog at the end.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>GPFS then gives the following errors:</FONT></DIV>
<DIV><FONT face=Arial size=2>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.1 cic-gpfswhbe1n1<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.15 cic-gpfswhbe1s2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.16 cic-gpfswhbe1s3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.17 cic-gpfswhbe1s4<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.2 cic-gpfswhbe1n2<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Close connection to 192.168.1.3 cic-gpfswhbe1n3<BR>10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.<BR>10.224.158.104 Fri May 18 13:02:51 2007: Lost membership in cluster enterprise.universe. Unmounting file systems.<BR>10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1<BR>10.224.158.106 Fri May 18 13:02:53 2007: Close connection to 192.168.1.14 cic-gpfswhbe1s1</FONT></DIV>
<DIV><FONT face=Arial size=2>etc.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>We found this in the logs of the switch:</FONT></DIV>
<DIV><FONT face=Arial size=2>May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1<BR>May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1<BR>May 18 11:02:51 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering removed ports<BR>May 18 11:02:51 topspin-120sc ib_sm.x[628]: %IB-6-INFO: Program switch port state to down, node=00:05:ad:00:00:0b:a2:cc, port= 1, due to non-responding CA<BR>May 18 11:02:51 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port down - port=1/1, type=ib4xTXP<BR>May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: in portTblFindEntry() - IfIndex=65(1/1)<BR>May 18 11:02:51 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO: cannot find entry - IfIndex=65(1/1)<BR>May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by discovering new ports<BR>May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change<BR>May 18 11:02:52 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:98:d1<BR>May 18 11:02:53 topspin-120sc port_mgr.x[497]: %PORT-6-INFO: port up - port=1/1, type=ib4xTXP<BR>May 18 11:02:54 topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:98:d1<BR>May 18 11:02:54 topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change<BR></FONT></DIV>
<DIV><FONT face=Arial size=2>We are sure this is gpfswhbe1s1, as the number is the same as the node_guid+1:</FONT></DIV>
<DIV><FONT face=Arial size=2>gpfswhbe1s1:~ # ibv_devinfo<BR>libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.<BR> This will severely limit memory registrations.hca_id: mthca0<BR> fw_ver: 5.1.0<BR> node_guid: 0005:ad00:0008:98d0<BR> sys_image_guid: 0005:ad00:0008:98d3<BR> vendor_id: 0x05ad<BR> vendor_part_id: 25218<BR> hw_ver: 0xA0<BR> board_id: HCA.LionMini.A0<BR> phys_port_cnt: 2<BR> port: 1<BR> state: PORT_ACTIVE (4)<BR> max_mtu: 2048 (4)<BR> active_mtu: 2048 (4)<BR> sm_lid: 2<BR> port_lid: 6<BR> port_lmc: 0x00</FONT></DIV>
<DIV><FONT face=Arial size=2> port: 2<BR> state: PORT_ACTIVE (4)<BR> max_mtu: 2048 (4)<BR> active_mtu: 2048 (4)<BR> sm_lid: 2<BR> port_lid: 4<BR> port_lmc: 0x00<BR></DIV></FONT>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Does anyone have a clue what happened?</FONT></DIV>
<DIV><FONT face=Arial size=2>The error does not come up very often. So we can't reproduce it easily.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>We believe the HCA on gpfswhbe1s1 caused the probem, but we can't really see it. </FONT></DIV>
<DIV><FONT face=Arial size=2>All help is appreciated!</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Regards,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Koen</DIV></FONT>*** Disclaimer ***<br><br>Vlaamse Radio- en Televisieomroep<br>Auguste Reyerslaan 52, 1043 Brussel<br><br>nv van publiek recht<br>BTW BE 0244.142.664<br>RPR Brussel<br>http://www.vrt.be/disclaimer<br> <br></BODY></HTML>