[ofa-general] RHEL 5.3 (2.6.18-128.1.1.el5 kernel) and connected mode

Robert Cummins robertacummins at gmail.com
Fri Aug 14 10:16:14 PDT 2009


Hello,

IHAC that is experiencing a problem with IB.  Specifically, when placing
the Infinihost III card in connected mode with 'echo connected
> /sys/class/net/ib0/mode' some nodes stop responding.  By 'stop
responding' I mean:

  - ping <ib ip address> doesn't work (no packets returned; 100% packet
loss)
  - ib_rdma_bw -b node never runs
  - ibping does work

since the customer is mounting their nfs server over IB nfs services
stop working when in connected mode.  What is interesting is if I leave
the nfs server in datagram mode then the affected nodes can still
interact with the nfs server, ie., nfs service continues to work, but I
can not communicate over IB with other nodes that are also in connected
mode.

At first I thought this was only a problem with IPoIB.  I note the
following difference between nodes that do not work in connected mode
and nodes that do.  The first output is from a node that stops working,
the second from a node that continues to work.

[root at ws3 ~]# modinfo ib_ipoib
filename:       /lib/modules/2.6.18-128.el5/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko
license:        Dual BSD/GPL
description:    IP-over-InfiniBand net driver
author:         Roland Dreier
srcversion:     E3C28A100A995101E2AB934
depends:        ib_cm,ipv6,ib_core,ib_sa
vermagic:       2.6.18-128.el5 SMP mod_unload gcc-4.1
parm:           max_nonsrq_conn_qp:Max number of connected-mode QPs per
interface (applied only if shared receive queue is not available) (int)
parm:           set_nonsrq:set to dictate working in none SRQ mode,
otherwise act according to device capabilities (int)
parm:           mcast_debug_level:Enable multicast debug tracing if > 0
(int)
parm:           send_queue_size:Number of descriptors in send queue
(int)
parm:           recv_queue_size:Number of descriptors in receive queue
(int)
parm:           debug_level:Enable debug tracing if > 0 (int)
module_sig:
883f35049492f615cdc734e64d24fa112659309d1b9619270a5e84a97a46cbc6e4ac0908b21f20a0a75b803bc72eba1ce62d2a8eec53fd9c2d7288c
[root at ws3 ~]# 


[root at scyld ~]# modinfo ib_ipoib
filename:       /lib/modules/2.6.18-128.1.1.el5.530g0000/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko
license:        Dual BSD/GPL
description:    IP-over-InfiniBand net driver
author:         Roland Dreier
srcversion:     8E47481E21B330BFE32B7CE
depends:        ib_cm,ipv6,ib_core,ib_sa
vermagic:       2.6.18-128.1.1.el5.530g0000 SMP mod_unload gcc-4.1
parm:           max_nonsrq_conn_qp:Max number of connected-mode QPs per
interface (applied only if shared receive queue is not available) (int)
parm:           set_nonsrq:set to dictate working in none SRQ mode,
otherwise act according to device capabilities (int)
parm:           mcast_debug_level:Enable multicast debug tracing if > 0
(int)
parm:           send_queue_size:Number of descriptors in send queue
(int)
parm:           recv_queue_size:Number of descriptors in receive queue
(int)
parm:           debug_level:Enable debug tracing if > 0 (int)
module_sig:
883f35049c0555e56ccec1c0ba19c3112535c09b5f5dbc8607465f947d60f2be7fa26132d43309f5dc241bebfe2f2f88fc7c93fbe5ea12cd721a59
[root at scyld ~]# 

However, after retesting with ib_rdma_bw I can see that even the verbs
layer is not working.

I have not tried using the ib_ipoib.ko from the 'working' configuration
in the non-working system since I assumed it would not load due to the
slight kernel difference.

It should be noted that the I have four nodes that fail and nearly 20
that 'work'.   The failing nodes are running the same kernel
(2.6.18-128.el5) while the working nodes are running the
2.6.18-128.1.1.el5 kernel.  I am at a loss as to how to proceed with
debugging this short of getting the latest OFED distro and building it.

Has anyone else run into this problem and if so, how did you get around
it?  

TIA


R.




More information about the general mailing list