[ofa-general] Need help with Infiniband problem

Tue Mar 17 08:33:56 PDT 2009

As joe requested, here's before reloading the openIB stuff

[root at h2o01 ~]# lsmod | grep ib_
ib_srp                 38281  0
ib_sdp                 54785  0
rdma_cm                39381  2 ib_sdp,rdma_ucm
ib_addr                11081  1 rdma_cm
ib_mthca              158357  0
ib_ipoib               96673  0
ib_umad                20969  0
ib_ucm                 20937  0
ib_uverbs              43377  2 rdma_ucm,ib_ucm
ib_cm                  42217  4 ib_srp,rdma_cm,ib_ipoib,ib_ucm
ib_sa                  48841  4 ib_srp,rdma_cm,ib_ipoib,ib_cm
ib_mad                 43497  4 ib_mthca,ib_umad,ib_cm,ib_sa
ib_core                69825  13 
ib_srp,ib_sdp,rdma_ucm,rdma_cm,iw_cm,ib_mthca,ib_ipoib,ib_umad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad
ipv6                  285729  29 ib_ipoib
scsi_mod              145425  3 ib_srp,libata,sd_mod

[root at h2o01 ~]# /etc/init.d/openibd restart
Unloading OpenIB kernel modules:                           [  OK  ]
Loading OpenIB kernel modules:                             [  OK  ]
[root at h2o01 ~]#
[root at h2o01 ~]# lsmod | grep ib_
ib_srp                 38281  0
ib_sdp                 54785  0
ib_ipoib               96673  0
rdma_cm                39381  2 ib_sdp,rdma_ucm
ib_addr                11081  1 rdma_cm
ib_mthca              158357  0
ib_umad                20969  0
ib_ucm                 20937  0
ib_uverbs              43377  2 rdma_ucm,ib_ucm
ib_cm                  42217  4 ib_srp,ib_ipoib,rdma_cm,ib_ucm
ib_sa                  48841  4 ib_srp,ib_ipoib,rdma_cm,ib_cm
ib_mad                 43497  4 ib_mthca,ib_umad,ib_cm,ib_sa
ib_core                69825  13 
ib_srp,ib_sdp,ib_ipoib,rdma_ucm,rdma_cm,iw_cm,ib_mthca,ib_umad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad
ipv6                  285729  29 ib_ipoib
scsi_mod              145425  3 ib_srp,libata,sd_mod

[root at h2o01 ~]# ifconfig ib0 up
[root at h2o01 ~]# ifconfig ib0
ib0       Link encap:UNSPEC  HWaddr 
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 
          inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:2044  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

As you can see that the ib0 interface does come up and routing seems to 
be setup properly

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt 
Iface
192.168.2.0     *               255.255.255.0   U         0 0          0 ib0
10.84.4.0       *               255.255.255.0   U         0 0          0 
eth0
192.168.1.0     *               255.255.255.0   U         0 0          0 
eth1
169.254.0.0     *               255.255.0.0     U         0 0          0 ib0
224.0.0.0       *               240.0.0.0       U         0 0          0 
eth1
default         10.84.4.1       0.0.0.0         UG        0 0        

But if i ping a node, i get nothing:

[root at h2o01 ~]# ping h2oi05.cluster
PING h2oi05.cluster (192.168.2.5) 56(84) bytes of data.
 From h2oi01.cluster (192.168.2.1) icmp_seq=0 Destination Host Unreachable
 From h2oi01.cluster (192.168.2.1) icmp_seq=1 Destination Host Unreachable
 From h2oi01.cluster (192.168.2.1) icmp_seq=2 Destination Host Unreachable
 From h2oi01.cluster (192.168.2.1) icmp_seq=4 Destination Host Unreachable
 From h2oi01.cluster (192.168.2.1) icmp_seq=5 Destination Host Unreachable
 From h2oi01.cluster (192.168.2.1) icmp_seq=6 Destination Host Unreachable

--- h2oi05.cluster ping statistics ---
8 packets transmitted, 0 received, +6 errors, 100% packet loss, time 7000ms
, pipe 4

I did ping myself and i get :

[root at h2o01 ~]# ping h2oi01.cluster
PING h2oi01.cluster (192.168.2.1) 56(84) bytes of data.
64 bytes from h2oi01.cluster (192.168.2.1): icmp_seq=0 ttl=64 time=0.018 ms
64 bytes from h2oi01.cluster (192.168.2.1): icmp_seq=1 ttl=64 time=0.010 ms
64 bytes from h2oi01.cluster (192.168.2.1): icmp_seq=2 ttl=64 time=0.011 ms
64 bytes from h2oi01.cluster (192.168.2.1): icmp_seq=3 ttl=64 time=0.011 ms
64 bytes from h2oi01.cluster (192.168.2.1): icmp_seq=4 ttl=64 time=0.015 ms
64 bytes from h2oi01.cluster (192.168.2.1): icmp_seq=5 ttl=64 time=0.008 ms

--- h2oi01.cluster ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4999ms
rtt min/avg/max/mdev = 0.008/0.012/0.018/0.004 ms, pipe 2

It appears that the Ip stack over IB is up and installed, just not 
talking on the wire or passing thru the switch.

jeff

Joe Landman wrote:
> jeffrey Lang wrote:
>   
>> First let me say, I hope this is the right list for this email, if not 
>> please forgive me.
>>
>> I have a small 16 node compute cluster.    The university where I work 
>> at recently opened a new Datacenter.  My cluster was moved from the old 
>> Datacenter.   Before the move the inifiniband was working properly, 
>> after the move the ipoib has stopped working.
>>     
>
> [...]
>
>   
>> I've reset the sm on the switch, but nothing seems to work.
>>
>> Any ideas of where to look for whats causing the problem?
>>     
>
> Could you do an
>
> 	lsmod | grep ib_
>
> I assume you did an
>
> 	/etc/init.d/openibd restart
>
> If not, now is a good time ... then rerun the lsmod above.
>
> If you don't see ib_ipoib, then you might try this
>
> 	ifconfig ib0 up
> 	
> then send the output of
>
> 	lsmod | grep ib_
> 	ifconfig ib0
> 	
> If these still don't work, try
>
> 	modprobe ib_ipoib
> 	ifconfig ib0 up
> 	ifconfig ib0
> 	
>
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090317/2f636395/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jrlang.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090317/2f636395/attachment.vcf>