[openib-general] [openfabrics-ewg] Problems with OFED IPoIB HA on SLES10
Scott Weitzenkamp (sweitzen)
sweitzen at cisco.com
Tue Oct 3 22:39:54 PDT 2006
If I fail back and forth between ib0 and ib1 every 30 seconds or so for
several hours, while IPoIB traffic is running, IPoIB host gets an Oops:
and IPoIB stops working.
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
general protection fault: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 7
Modules linked in: af_packet ib_sdp rdma_ucm rdma_cm ib_addr ib_cm
ib_ipoib ib_s
a ib_uverbs ib_umad ib_mthca ib_mad ib_core nls_utf8 st ipv6 nfs lockd
nfs_acl s
unrpc button battery ac apparmor aamatch_pcre loop usbhid dm_mod
hw_random ide_c
d ehci_hcd uhci_hcd cdrom i8xx_tco ide_floppy usbcore shpchp e1000
pci_hotplug f
loppy reiserfs edd fan thermal processor siimage sg mptspi mptscsih
mptbase scsi
_transport_spi piix sd_mod scsi_mod ide_disk ide_core
Pid: 23541, comm: ib_mad1 Tainted: G U 2.6.16.21-0.8-smp #1
RIP: 0010:[<ffffffff802cffea>] <ffffffff802cffea>{_spin_lock_irqsave+3}
RSP: 0018:ffff810132a4fc20 EFLAGS: 00010086
RAX: 0000000000000286 RBX: 0000000000000000 RCX: ffffffff883324ee
RDX: ffff810128d5e380 RSI: 0000000000000000 RDI: 0000ffff1b6017ff
RBP: 00000000fffffffc R08: ffffffff803d3260 R09: ffff810140333800
R10: ffff81000107d400 R11: 0000000000000292 R12: ffff810128d5e380
R13: ffff810132a4fc78 R14: 0000ffff1b6017ff R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff810142d19740(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b0b5e6ae180 CR3: 0000000128cbc000 CR4: 00000000000006e0
Process ib_mad1 (pid: 23541, threadinfo ffff810132a4e000, task
ffff810142b56100)
Stack: ffffffff8833c5f5 ffff8101302b3000 0000ffff1b6012ff
0000000000000002
0000000000000296 ffff8101302b3500 ffffffff8027753e
ffff810128d5e3a0
ffff81012bce1680 ffff810128d5e380
Call Trace: <ffffffff8833c5f5>{:ib_ipoib:path_rec_completion+862}
<ffffffff8027753e>{dev_queue_xmit+545}
<ffffffff8833c5b2>{:ib_ipoib:path_
rec_completion+795}
<ffffffff8833252e>{:ib_sa:ib_sa_path_rec_callback+64}
<ffffffff80138f17>{lock_timer_base+27}
<ffffffff80138f89>{try_to_del_time
r_sync+81}
<ffffffff883322b3>{:ib_sa:send_handler+72}
<ffffffff8826762f>{:ib_mad:ib_
mad_complete_send_wr+421}
<ffffffff88267f00>{:ib_mad:ib_mad_completion_handler+947}
<ffffffff88267b4d>{:ib_mad:ib_mad_completion_handler+0}
<ffffffff80140177>{run_workqueue+153}
<ffffffff8014081e>{worker_thread+0}
<ffffffff801437e5>{keventd_create_kthread+0}
<ffffffff80140927>{worker_th
read+265}
<ffffffff8012787f>{__wake_up_common+62}
<ffffffff8012905a>{default_wake_f
unction+0}
<ffffffff801437e5>{keventd_create_kthread+0}
<ffffffff80143aca>{kthread+2
36}
<ffffffff8010b60a>{child_rip+8}
<ffffffff801437e5>{keventd_create_kthread
+0}
<ffffffff801439de>{kthread+0} <ffffffff8010b602>{child_rip+0}
Code: f0 ff 0f 0f 88 29 01 00 00 c3 fa f0 ff 0f 0f 88 2a 01 00 00
RIP <ffffffff802cffea>{_spin_lock_irqsave+3} RSP <ffff810132a4fc20>
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
________________________________
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Scott
Weitzenkamp (sweitzen)
Sent: Tuesday, October 03, 2006 2:53 PM
To: Vladimir Sokolovsky
Cc: EWG; openib-General
Subject: Re: [openib-general] [openfabrics-ewg] Problems with
OFED IPoIB HA on SLES10
Vlad, thaks for the fast response. I have some followup
questions about configuring IPoIB HA, see below.
3) I got IPoIB HA working on SLES 10, but the
documentation is a little lacking. Looks like I have to put the same
IP address in ifcfg-ib0 and ifcfg-ib1, is this correct?
Yes, IP address should be the same. Actually the
configuration of the secondary interface does not matter.
The High Availability daemon reads the configuration of
the primary interface and migrates it between the interfaces in case of
failure.
If I don't have an ifcfg-ib1 file, then ipoib_ha.pl won't start.
If I don't have an ifcfg-ib1, then ipoib_ha.pl won't start. I
would prefer to not configure ifcfg-ib1 since I don't plan to use it.
# ipoib_ha.pl --with-arping --with-multicast -v
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file
or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file
or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file
or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file
or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file
or directory
...
If I put different IP addresses in ifcfg-ib0 and ifcfg-ib1, then
the ifcfg-ib1 IP address is used for both ib0 and ib1!
# pwd
/etc/sysconfig/network
# cat ifcfg-ib0
DEVICE=ib0
BOOTPROTO=static
IPADDR=192.168.2.46
NETMASK=255.255.255.0
ONBOOT=yes
# cat ifcfg-ib1
DEVICE=ib1
BOOTPROTO=static
IPADDR=192.168.6.46
NETMASK=255.255.255.0
ONBOOT=yes
# /etc/init.d/openibd start
Loading HCA driver and Access Layer: [ OK
]
Setting up InfiniBand network interfaces:
ib0 device: Mellanox Technologies MT25208 InfiniHost
III Ex (Tavor com
patibility mode) (rev 20)
ib0 configuration: ib1
Bringing up interface ib0: [ OK
]
ib1 device: Mellanox Technologies MT25208 InfiniHost
III Ex (Tavor com
patibility mode) (rev 20)
Bringing up interface ib1: [ OK
]
Setting up service network . . . [
done ]
# ifconfig ib0
ib0 Link encap:UNSPEC HWaddr
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00
-00
inet addr:192.168.6.46 Bcast:192.168.6.255
Mask:255.255.255.0
inet6 addr: fe80::202:c902:21:700d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:224 (224.0 b)
# ifconfig ib1
ib1 Link encap:UNSPEC HWaddr
00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00
-00
inet addr:192.168.6.46 Bcast:192.168.6.255
Mask:255.255.255.0
inet6 addr: fe80::202:c902:21:700e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:304 (304.0 b)
Notice how both ib0 and ib1 have the IP address from ifcfg-ib1.
This contradicts this info from ipoib_release_notes.txt:
b. The ib1 interface uses the configuration script
of ib0.
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061003/cbaee2a3/attachment.html>
More information about the general
mailing list