[openib-general] [Bug 286] New: "ifconfig ib# down" hangs telnet connection-- NETDEV WATCHDOG: ib0: transmit timed out
bugzilla-daemon at openib.org
bugzilla-daemon at openib.org
Tue Oct 24 13:47:56 PDT 2006
http://openib.org/bugzilla/show_bug.cgi?id=286
Summary: "ifconfig ib# down" hangs telnet connection-- NETDEV
WATCHDOG: ib0: transmit timed out
Product: OpenFabrics Linux
Version: 1.0rc6
Platform: X86
OS/Version: SLES 9
Status: NEW
Severity: major
Priority: P2
Component: IPoIB
AssignedTo: bugzilla at openib.org
ReportedBy: amir.vetry at sun.com
CC: David.Brean at Sun.COM, amir.vetry at sun.com
"ifconfig ib# down" command hangs ethernet telnet connection, and even if any
other subsequence telnet connection opened to the system and execute "ifconfig
-a"
will hang the Galaxy 4F system. Moreover, "Control ^C" will not retrieve
telnet
connection. System does not allow to kill any ifconfig process either.
Note: This system telnet connection is via ethernet port (not IB).
Easiest to reproduce this issue is when various IPoIB traffic is ran across a
IB-HCA
PCI-E ports (from Mellanox) in Galaxy4F (Sun x4600), the IB link drops and the
following error messages are seen in /var/log/messages.
"NETDEV WATCHDOG: ib0: transmit timed out"
When IB links dropped, no traffic can pass through the IB ports, and all the
IPoIB
traffics stop and the ping also fails.
After the above has been experienced, type "ifconfig ib# down", this should
hang ethernet telnet
connection, even it hangs console connection.
System information
=============
- Galaxy 4 F (sun x4600)
- IB-HCA PCI-E (Mellanox)
- OFED-1.0.1 (or 1.0) driver
- OS: Suse 9- U3 (or Redhat4-u3)
- Linux 2.6.5-7.244-smp #1 SMP x86_64 x86_64 x86_64 GNU/Linux
IB-HCA PCI-E information
========================
- fw_ver: 4.6.2
- vendor_id: 0x02c9
- vendor_part_id: 25208
- hw_ver: 0xA0
Some type of IB switch:
(e.g. Sun Sleipner switch (or Topspin 360 switch)
===========================
- 9 Port IB
- Bootable Image: 2.1.2 (Apr 28 06)
OR
- Topspin 360 switch FW: 2.8 (52)
- 12 IB ports
Topology
========
G4F(ib#)-----(ib#)IBswitch(ib#)----TrafficGenerator
Steps to reproduce
==================
1. Telnet to a Galaxy4f system (via ethernet port)
2. Run two (or more) stream IPoIB traffic simultaneously (e.g. sync_netperf,
NFS-corrupt)
3. While traffic is running, monitor messages log
4. Look for the error messages like:
NETDEV WATCHDOG: ib0: transmit timed out
5. Once this message (above message) is observed, the IB link should go down
6. Do "ifconfig -a" to view which ib port is available
7. Do "ifconfig down" This should hang connection!!
# ifconfig -a
ib4 Link encap:UNSPEC HWaddr
00-00-04-04-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:192.9.11.176 Bcast:192.9.11.255 Mask:255.255.255.0
inet6 addr: fe80::202:c902:20:2f2d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:3229 errors:0 dropped:0 overruns:0 frame:0
TX packets:29 errors:0 dropped:3060 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:184632 (180.3 Kb) TX bytes:3000 (2.9 Kb)
# ifconfig ib4 down up
System will hang here !!!
===============================================
#cat /etc/*release
SUSE LINUX Enterprise Server 9 (x86_64)
VERSION = 9
PATCHLEVEL = 3
LSB_VERSION="core-2.0-noarch:core-3.0-noarch:core-2.0-x86_64:core-3.0-x86_64"
nspgqa176b:~ #
hca_id: mthca2
fw_ver: 4.6.2
node_guid: 0002:c902:0020:2f2c
sys_image_guid: 0002:c902:0020:2f2f
vendor_id: 0x02c9
vendor_part_id: 25208
hw_ver: 0xA0
board_id: MT_00B0000001
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 3
port_lmc: 0x00
port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 4
port_lmc: 0x00
nspgqa176b:~ #
nspgqa176b:~ #
nspgqa176b:~ # ps -ef | grep ifconfig
root 27495 1 0 10:57 ? 00:00:00 ifconfig ib4 down up
root 27497 1 0 10:57 ? 00:00:00 /sbin/ifconfig -a
root 27634 1 0 11:26 ? 00:00:00 ifconfig -a
root 27711 1 0 11:40 ? 00:00:00 ifconfig -a
root 27754 1 0 11:44 pts/4 00:00:00 ifconfig -a
root 27781 1 0 11:56 pts/3 00:00:00 ifconfig -a
root 27787 27638 0 11:57 pts/7 00:00:00 grep ifconfig
nspgqa176b:~ #
nspgqa176b:~ # kill 27495 27497 27634 27711 27754 27781 27787 27638
-bash: kill: (27787) - No such process
nspgqa176b:~ #
nspgqa176b:~ # ps -ef | grep ifconfig
root 27495 1 0 10:57 ? 00:00:00 ifconfig ib4 down up
root 27497 1 0 10:57 ? 00:00:00 /sbin/ifconfig -a
root 27634 1 0 11:26 ? 00:00:00 ifconfig -a
root 27711 1 0 11:40 ? 00:00:00 ifconfig -a
root 27754 1 0 11:44 pts/4 00:00:00 ifconfig -a
root 27781 1 0 11:56 pts/3 00:00:00 ifconfig -a
root 27789 27638 0 11:58 pts/7 00:00:00 grep ifconfig
nspgqa176b:~ #
nspgqa176b:~ # pkill -9 27495
nspgqa176b:~ # ps -ef | grep ifconfig
root 27495 1 0 10:57 ? 00:00:00 ifconfig ib4 down up
root 27497 1 0 10:57 ? 00:00:00 /sbin/ifconfig -a
root 27634 1 0 11:26 ? 00:00:00 ifconfig -a
root 27711 1 0 11:40 ? 00:00:00 ifconfig -a
root 27754 1 0 11:44 pts/4 00:00:00 ifconfig -a
root 27781 1 0 11:56 pts/3 00:00:00 ifconfig -a
root 27792 27638 0 11:58 pts/7 00:00:00 grep ifconfig
nspgqa176b:~ #
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the general
mailing list