[ofa-general] [Bug 447] New: ib_ipoib kernel 2.6.9-34 panic when routing to 10G ethernet
bugzilla-daemon at lists.openfabrics.org
bugzilla-daemon at lists.openfabrics.org
Mon Mar 12 10:24:55 PDT 2007
https://bugs.openfabrics.org/show_bug.cgi?id=447
Summary: ib_ipoib kernel 2.6.9-34 panic when routing to 10G
ethernet
Product: OpenFabrics Linux
Version: 1.1
Platform: X86-64
OS/Version: RHEL 4
Status: NEW
Severity: blocker
Priority: P1
Component: IPoIB
AssignedTo: bugzilla at openib.org
ReportedBy: DarylGrunau at gmail.com
We're experiencing kernel panics of the following ilk on our I/O nodes used as
routers between our IB fabric and 10GE network (providing service to a Panasas
filesystem). The panic can be triggered by simply mounting the Panasas
filesystem via the I/O node - some time later (as soon as 1 minute, and
sometimes overnight/weekend) the node panics. Using the compute-node-mounted
filesystem accelerates the timetable.
Kernel BUG at dev:1121
invalid operand: 0000 [1] SMP
CPU 7
Modules linked in: myri10ge(U) ib_ipoib ib_mthca ib_uverbs ib_umad ib_ucm ib_sa
ib_cm ib_mad ib_core bluesmoke_k8 bluesmoke_mc perfctr ipmi_devintf ipmi_si
ipmi_msghandler bnx2 ext3 jbd nfs lockd nfs_acl sunrpc
Pid: 0, comm: swapper Not tainted 2.6.9-34.ELsmp.lanl
RIP: 0010:[<ffffffff802aafc2>] <ffffffff802aafc2>{__skb_linearize+62}
RSP: 0018:00000102270efcf8 EFLAGS: 00010203
RAX: 0000000000000001 RBX: 000000000000001c RCX: 000001061fef7680
RDX: 00000000ffffffdc RSI: 0000000000000220 RDI: 000001061fef7600
RBP: 000001021fedabc0 R08: 0000000000000000 R09: 000000000000003c
R10: 0000000000000000 R11: 0000000000000000 R12: 000001021f459a80
R13: 0000000000000000 R14: 000001081d741000 R15: 0000000000000000
FS: 0000002a95ac76e0(0000) GS:ffffffff804d8600(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000429ff0 CR3: 00000000dfcae000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 000001081ff38000, task 0000010220035030)
Stack: 000001081d741000 000001081d741000 00000000fffffff4 000001021f459a80
0000000000000000 ffffffff802ab133 000001021fedabc0 000001021f459ac0
000001021fedabc0 ffffffff802b01c8
Call Trace:<IRQ> <ffffffff802ab133>{dev_queue_xmit+93}
<ffffffff802b01c8>{neigh_resolve_output+578}
<ffffffff802afdc4>{neigh_update+626}
<ffffffff802e62a7>{arp_process+1257}
<ffffffffa01766a6>{:ib_ipoib:ipoib_ib_completion+936}
<ffffffff802ab87d>{netif_receive_skb+590}
<ffffffff802ab939>{process_backlog+136}
<ffffffff802aba43>{net_rx_action+129}
<ffffffff8013bf38>{__do_softirq+88}
<ffffffff8013bfe1>{do_softirq+49} <ffffffff801131a7>{do_IRQ+328}
<ffffffff801107bf>{ret_from_intr+0} <EOI>
<ffffffff8010e749>{default_idle+0}
<ffffffff8010e769>{default_idle+32} <ffffffff8010e7dc>{cpu_idle+26}
Code: 0f 0b 41 ee 31 80 ff ff ff ff 61 04 85 d2 b8 00 00 00 00 0f
RIP <ffffffff802aafc2>{__skb_linearize+62} RSP <00000102270efcf8>
<0>Kernel panic - not syncing: Oops
----------------
Our HCA hardware/firmware is:
-bash-3.00# ./ibv_devinfo
hca_id: mthca0
fw_ver: 5.1.937
node_guid: 0002:c902:0023:85cc
sys_image_guid: 0002:c902:0023:85cf
vendor_id: 0x02c9
vendor_part_id: 25218
hw_ver: 0xA0
board_id: MT_0370110001
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 69
port_lmc: 0x00
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 512 (2)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
-bash-3.00# lspci -vvv
[[ snip ]]
41:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (rev a0)
Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size 10
Interrupt: pin A routed to IRQ 209
Region 0: Memory at e1700000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at e1800000 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5
Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
Vector table: BAR=0 offset=00082000
PBA: BAR=0 offset=00082200
Capabilities: [60] Express Endpoint IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+
Device: Latency L0s <64ns, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x8
[[ snip ]]
--------------
We are running a patched kernel (perfctr & bluesmoke/EDAC only) from RHEL4
update 3 with IB drivers from Voltaire's GridStack-4.1.5_4:
dapl-1.2.0-0.x86_64.rpm
dapl-devel-1.2.0-0.x86_64.rpm
ibutils-1.0-0.x86_64.rpm
kernel-ib-1.1-2.6.9_34.ELsmp.lanl.x86_64.rpm
kernel-ib-devel-1.1-2.6.9_34.ELsmp.lanl.x86_64.rpm
libibcm-0.9.0-0.x86_64.rpm
libibcm-devel-0.9.0-0.x86_64.rpm
libibcommon-1.0-0.x86_64.rpm
libibcommon-devel-1.0-0.x86_64.rpm
libibmad-1.0-0.x86_64.rpm
libibmad-devel-1.0-0.x86_64.rpm
libibumad-1.0-0.x86_64.rpm
libibumad-devel-1.0-0.x86_64.rpm
libibverbs-1.0.4-0.x86_64.rpm
libibverbs-devel-1.0.4-0.x86_64.rpm
libibverbs-utils-1.0.4-0.x86_64.rpm
libipathverbs-1.0-0.x86_64.rpm
libipathverbs-devel-1.0-0.x86_64.rpm
libmthca-1.0.3-0.x86_64.rpm
libmthca-devel-1.0.3-0.x86_64.rpm
libopensm-2.0.0-0.x86_64.rpm
libopensm-devel-2.0.0-0.x86_64.rpm
libosmcomp-2.0.0-0.x86_64.rpm
libosmcomp-devel-2.0.0-0.x86_64.rpm
libosmvendor-2.0.0-0.x86_64.rpm
libosmvendor-devel-2.0.0-0.x86_64.rpm
librdmacm-0.9.0-0.x86_64.rpm
librdmacm-devel-0.9.0-0.x86_64.rpm
librdmacm-utils-0.9.0-0.x86_64.rpm
mpitests_openmpi_gcc-2.0-0.x86_64.rpm
mstflint-1.0-0.x86_64.rpm
ofed-docs-1.1-0.noarch.rpm
ofed-scripts-1.1-0.noarch.rpm
openib-diags-1.1.0-0.x86_64.rpm
openmpi_gcc-1.1.1-1.x86_64.rpm
perftest-1.0-0.x86_64.rpm
tvflash-0.9.0-0.x86_64.rpm
Our kernel module info and interface configs look like this:
-bash-3.00# cat /etc/modprobe.conf
alias eth2 myri10ge
options myri10ge myri10ge_lro=0
-bash-3.00# lsmod
Module Size Used by
myri10ge 53656 0
ib_ipoib 45641 0
ib_mthca 129505 1
ib_uverbs 41841 3
ib_umad 18929 0
ib_ucm 21193 0
ib_sa 16213 1 ib_ipoib
ib_cm 39217 1 ib_ucm
ib_mad 40169 4 ib_mthca,ib_umad,ib_sa,ib_cm
ib_core 53313 8
ib_ipoib,ib_mthca,ib_uverbs,ib_umad,ib_ucm,ib_sa,ib_cm,ib_mad
bluesmoke_k8 18405 0
bluesmoke_mc 26377 5 bluesmoke_k8
perfctr 46305 0
ipmi_devintf 11985 0
ipmi_si 40033 0
ipmi_msghandler 33093 2 ipmi_devintf,ipmi_si
bnx2 147217 0
ext3 137809 0
jbd 68977 1 ext3
nfs 243825 1
lockd 77809 2 nfs
nfs_acl 5185 1 nfs
sunrpc 174905 5 nfs,lockd,nfs_acl
-bash-3.00# ifconfig ib0
ib0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.160.137.141 Bcast:10.160.159.255 Mask:255.255.224.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:6119 errors:0 dropped:0 overruns:0 frame:0
TX packets:1856 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:520392 (508.1 KiB) TX bytes:157844 (154.1 KiB)
-bash-3.00# ifconfig eth2
eth2 Link encap:Ethernet HWaddr 00:60:DD:47:B3:10
inet addr:10.160.168.74 Bcast:10.160.168.79 Mask:255.255.255.248
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1762 errors:0 dropped:0 overruns:0 frame:0
TX packets:4416 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:172374 (168.3 KiB) TX bytes:486056 (474.6 KiB)
Interrupt:74
Note that the Myri10G driver sets eth2 to MTU=9000 by default; here we have it
at MTU=1500 (to baseline). We are currently implementing jumbo frames to see
if the problem changes. Any help/information you can provide re: this problem
would be greatly appreciated!
Daryl Grunau
--
Configure bugmail: https://bugs.openfabrics.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the general
mailing list