[openib-general] Fwd: Re: Problems with OFED IPoIB HA on SLES10
Michael S. Tsirkin
mst at mellanox.co.il
Wed Oct 4 05:46:57 PDT 2006
Another point: this seems to be crashing while we
are requeueing the packet through dev_start_xmit upon
path record completion.
It looks like this could try to requeue even though the
interface is going down - could this trigger some problems?
Quoting r. Michael S. Tsirkin <mst at mellanox.co.il>:
Subject: Fwd: Re: Problems with OFED IPoIB HA on SLES10
BTW, any idea?
The ipoib_ha is just a script that ups/downs and configures interfaces,
so this crash it seems coul also happen on systems without it.
--
MST
Date: Tue, 3 Oct 2006 22:39:54 -0700
From: "Scott Weitzenkamp (sweitzen)" <sweitzen at cisco.com>
Subject: Re: [openib-general] Problems with OFED IPoIB HA on SLES10
If I fail back and forth between ib0 and ib1 every 30 seconds or so for several hours, while IPoIB traffic is running, IPoIB host gets an Oops: and IPoIB stops working.
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
general protection fault: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 7
Modules linked in: af_packet ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_ipoib ib_s
a ib_uverbs ib_umad ib_mthca ib_mad ib_core nls_utf8 st ipv6 nfs lockd nfs_acl s
unrpc button battery ac apparmor aamatch_pcre loop usbhid dm_mod hw_random ide_c
d ehci_hcd uhci_hcd cdrom i8xx_tco ide_floppy usbcore shpchp e1000 pci_hotplug f
loppy reiserfs edd fan thermal processor siimage sg mptspi mptscsih mptbase scsi
_transport_spi piix sd_mod scsi_mod ide_disk ide_core
Pid: 23541, comm: ib_mad1 Tainted: G U 2.6.16.21-0.8-smp #1
RIP: 0010:[<ffffffff802cffea>] <ffffffff802cffea>{_spin_lock_irqsave+3}
RSP: 0018:ffff810132a4fc20 EFLAGS: 00010086
RAX: 0000000000000286 RBX: 0000000000000000 RCX: ffffffff883324ee
RDX: ffff810128d5e380 RSI: 0000000000000000 RDI: 0000ffff1b6017ff
RBP: 00000000fffffffc R08: ffffffff803d3260 R09: ffff810140333800
R10: ffff81000107d400 R11: 0000000000000292 R12: ffff810128d5e380
R13: ffff810132a4fc78 R14: 0000ffff1b6017ff R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff810142d19740(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b0b5e6ae180 CR3: 0000000128cbc000 CR4: 00000000000006e0
Process ib_mad1 (pid: 23541, threadinfo ffff810132a4e000, task ffff810142b56100)
Stack: ffffffff8833c5f5 ffff8101302b3000 0000ffff1b6012ff 0000000000000002
0000000000000296 ffff8101302b3500 ffffffff8027753e ffff810128d5e3a0
ffff81012bce1680 ffff810128d5e380
Call Trace: <ffffffff8833c5f5>{:ib_ipoib:path_rec_completion+862}
<ffffffff8027753e>{dev_queue_xmit+545} <ffffffff8833c5b2>{:ib_ipoib:path_
rec_completion+795}
<ffffffff8833252e>{:ib_sa:ib_sa_path_rec_callback+64}
<ffffffff80138f17>{lock_timer_base+27} <ffffffff80138f89>{try_to_del_time
r_sync+81}
<ffffffff883322b3>{:ib_sa:send_handler+72} <ffffffff8826762f>{:ib_mad:ib_
mad_complete_send_wr+421}
<ffffffff88267f00>{:ib_mad:ib_mad_completion_handler+947}
<ffffffff88267b4d>{:ib_mad:ib_mad_completion_handler+0}
<ffffffff80140177>{run_workqueue+153} <ffffffff8014081e>{worker_thread+0}
<ffffffff801437e5>{keventd_create_kthread+0} <ffffffff80140927>{worker_th
read+265}
<ffffffff8012787f>{__wake_up_common+62} <ffffffff8012905a>{default_wake_f
unction+0}
<ffffffff801437e5>{keventd_create_kthread+0} <ffffffff80143aca>{kthread+2
36}
<ffffffff8010b60a>{child_rip+8} <ffffffff801437e5>{keventd_create_kthread
+0}
<ffffffff801439de>{kthread+0} <ffffffff8010b602>{child_rip+0}
Code: f0 ff 0f 0f 88 29 01 00 00 c3 fa f0 ff 0f 0f 88 2a 01 00 00
RIP <ffffffff802cffea>{_spin_lock_irqsave+3} RSP <ffff810132a4fc20>
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Scott Weitzenkamp (sweitzen)
Sent: Tuesday, October 03, 2006 2:53 PM
To: Vladimir Sokolovsky
Cc: EWG; openib-General
Subject: Re: [openib-general] [openfabrics-ewg] Problems with OFED IPoIB HA on SLES10
Vlad, thaks for the fast response. I have some followup questions about configuring IPoIB HA, see below.
3) I got IPoIB HA working on SLES 10, but the documentation is a little lacking. Looks like I have to put the same IP address in ifcfg-ib0 and ifcfg-ib1, is this correct?
Yes, IP address should be the same. Actually the configuration of the secondary interface does not matter.
The High Availability daemon reads the configuration of the primary interface and migrates it between the interfaces in case of failure.
If I don't have an ifcfg-ib1 file, then ipoib_ha.pl won't start.
If I don't have an ifcfg-ib1, then ipoib_ha.pl won't start. I would prefer to not configure ifcfg-ib1 since I don't plan to use it.
# ipoib_ha.pl --with-arping --with-multicast -v
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory
Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory
...
If I put different IP addresses in ifcfg-ib0 and ifcfg-ib1, then the ifcfg-ib1 IP address is used for both ib0 and ib1!
# pwd
/etc/sysconfig/network
# cat ifcfg-ib0
DEVICE=ib0
BOOTPROTO=static
IPADDR=192.168.2.46
NETMASK=255.255.255.0
ONBOOT=yes
# cat ifcfg-ib1
DEVICE=ib1
BOOTPROTO=static
IPADDR=192.168.6.46
NETMASK=255.255.255.0
ONBOOT=yes
# /etc/init.d/openibd start
Loading HCA driver and Access Layer: [ OK ]
Setting up InfiniBand network interfaces:
ib0 device: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor com
patibility mode) (rev 20)
ib0 configuration: ib1
Bringing up interface ib0: [ OK ]
ib1 device: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor com
patibility mode) (rev 20)
Bringing up interface ib1: [ OK ]
Setting up service network . . . [ done ]
# ifconfig ib0
ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00
-00
inet addr:192.168.6.46 Bcast:192.168.6.255 Mask:255.255.255.0
inet6 addr: fe80::202:c902:21:700d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:224 (224.0 b)
# ifconfig ib1
ib1 Link encap:UNSPEC HWaddr 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00
-00
inet addr:192.168.6.46 Bcast:192.168.6.255 Mask:255.255.255.0
inet6 addr: fe80::202:c902:21:700e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:304 (304.0 b)
Notice how both ib0 and ib1 have the IP address from ifcfg-ib1. This contradicts this info from ipoib_release_notes.txt:
b. The ib1 interface uses the configuration script of ib0.
Scott
_______________________________________________
openfabrics-ewg mailing list
openfabrics-ewg at openib.org
http://openib.org/mailman/listinfo/openfabrics-ewg
_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--
MST
More information about the general
mailing list