[openib-general] [Bug 286] New: "ifconfig ib# down" hangs telnet connection-- NETDEV WATCHDOG: ib0: transmit timed out

bugzilla-daemon at openib.org bugzilla-daemon at openib.org
Tue Oct 24 13:47:56 PDT 2006


http://openib.org/bugzilla/show_bug.cgi?id=286

           Summary: "ifconfig ib# down" hangs telnet connection-- NETDEV
                    WATCHDOG: ib0: transmit timed out
           Product: OpenFabrics Linux
           Version: 1.0rc6
          Platform: X86
        OS/Version: SLES 9
            Status: NEW
          Severity: major
          Priority: P2
         Component: IPoIB
        AssignedTo: bugzilla at openib.org
        ReportedBy: amir.vetry at sun.com
                CC: David.Brean at Sun.COM, amir.vetry at sun.com


"ifconfig ib# down" command hangs ethernet telnet connection, and even if any 
other subsequence telnet connection opened to the system and execute "ifconfig
-a" 
will hang the Galaxy 4F system.  Moreover, "Control ^C" will not retrieve
telnet 
connection.  System does not allow to kill any ifconfig process either.

  Note: This system telnet connection is via ethernet port (not IB).


Easiest to reproduce this issue is when various IPoIB traffic is ran across a
IB-HCA 
PCI-E ports (from Mellanox) in Galaxy4F (Sun x4600), the IB link drops and the 
following error messages are seen in /var/log/messages.  

      "NETDEV WATCHDOG: ib0: transmit timed out"

When IB links dropped, no traffic can pass through the IB ports, and all the
IPoIB 
traffics stop and the ping also fails.

After the above has been experienced, type "ifconfig ib# down", this should
hang ethernet telnet 
connection, even it hangs console connection.


System information
=============
- Galaxy 4 F (sun x4600)
- IB-HCA PCI-E (Mellanox)
- OFED-1.0.1 (or 1.0) driver
- OS: Suse 9- U3 (or Redhat4-u3)
   - Linux 2.6.5-7.244-smp #1 SMP x86_64 x86_64 x86_64 GNU/Linux


IB-HCA PCI-E information
========================
- fw_ver: 4.6.2
- vendor_id: 0x02c9
- vendor_part_id: 25208
- hw_ver: 0xA0


Some type of IB switch:
(e.g. Sun Sleipner switch (or Topspin 360 switch)
===========================
- 9 Port IB
- Bootable Image: 2.1.2 (Apr 28 06)

OR
- Topspin 360 switch FW: 2.8 (52)
- 12 IB ports

Topology
========
G4F(ib#)-----(ib#)IBswitch(ib#)----TrafficGenerator

Steps to reproduce
==================
1. Telnet to a Galaxy4f system (via ethernet port)
2. Run two (or more) stream IPoIB traffic simultaneously (e.g. sync_netperf,
NFS-corrupt) 
3. While traffic is running, monitor messages log
4. Look for the error messages like: 
     NETDEV WATCHDOG: ib0: transmit timed out

5. Once this message (above message) is observed, the IB link should go down
6. Do "ifconfig -a" to view which ib port is available
7. Do "ifconfig down" This should hang connection!!


# ifconfig -a

ib4       Link encap:UNSPEC  HWaddr
00-00-04-04-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:192.9.11.176  Bcast:192.9.11.255  Mask:255.255.255.0
          inet6 addr: fe80::202:c902:20:2f2d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:3229 errors:0 dropped:0 overruns:0 frame:0
          TX packets:29 errors:0 dropped:3060 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:184632 (180.3 Kb)  TX bytes:3000 (2.9 Kb)

# ifconfig ib4 down up

     System will hang here !!!


===============================================
#cat /etc/*release
SUSE LINUX Enterprise Server 9 (x86_64)
VERSION = 9
PATCHLEVEL = 3
LSB_VERSION="core-2.0-noarch:core-3.0-noarch:core-2.0-x86_64:core-3.0-x86_64"
nspgqa176b:~ #

hca_id: mthca2
        fw_ver:                         4.6.2
        node_guid:                      0002:c902:0020:2f2c
        sys_image_guid:                 0002:c902:0020:2f2f
        vendor_id:                      0x02c9
        vendor_part_id:                 25208
        hw_ver:                         0xA0
        board_id:                       MT_00B0000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               3
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               4
                        port_lmc:               0x00

nspgqa176b:~ #


nspgqa176b:~ #
nspgqa176b:~ # ps -ef | grep ifconfig
root     27495     1  0 10:57 ?        00:00:00 ifconfig ib4 down up
root     27497     1  0 10:57 ?        00:00:00 /sbin/ifconfig -a
root     27634     1  0 11:26 ?        00:00:00 ifconfig -a
root     27711     1  0 11:40 ?        00:00:00 ifconfig -a
root     27754     1  0 11:44 pts/4    00:00:00 ifconfig -a
root     27781     1  0 11:56 pts/3    00:00:00 ifconfig -a
root     27787 27638  0 11:57 pts/7    00:00:00 grep ifconfig
nspgqa176b:~ #
nspgqa176b:~ # kill 27495 27497  27634 27711 27754 27781 27787 27638
-bash: kill: (27787) - No such process
nspgqa176b:~ #
nspgqa176b:~ # ps -ef | grep ifconfig
root     27495     1  0 10:57 ?        00:00:00 ifconfig ib4 down up
root     27497     1  0 10:57 ?        00:00:00 /sbin/ifconfig -a
root     27634     1  0 11:26 ?        00:00:00 ifconfig -a
root     27711     1  0 11:40 ?        00:00:00 ifconfig -a
root     27754     1  0 11:44 pts/4    00:00:00 ifconfig -a
root     27781     1  0 11:56 pts/3    00:00:00 ifconfig -a
root     27789 27638  0 11:58 pts/7    00:00:00 grep ifconfig
nspgqa176b:~ #
nspgqa176b:~ # pkill -9 27495
nspgqa176b:~ # ps -ef | grep ifconfig
root     27495     1  0 10:57 ?        00:00:00 ifconfig ib4 down up
root     27497     1  0 10:57 ?        00:00:00 /sbin/ifconfig -a
root     27634     1  0 11:26 ?        00:00:00 ifconfig -a
root     27711     1  0 11:40 ?        00:00:00 ifconfig -a
root     27754     1  0 11:44 pts/4    00:00:00 ifconfig -a
root     27781     1  0 11:56 pts/3    00:00:00 ifconfig -a
root     27792 27638  0 11:58 pts/7    00:00:00 grep ifconfig
nspgqa176b:~ #




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the general mailing list