[ofa-general] Next set of mthca issues

Scott Weitzenkamp (sweitzen) sweitzen at cisco.com
Mon Apr 16 18:38:18 PDT 2007


Looks like https://bugs.openfabrics.org/show_bug.cgi?id=431 to me, which
is fixed in OFED-1.2-20070411-0938 or newer.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Pradeep Satyanarayana
> Sent: Monday, April 16, 2007 5:19 PM
> To: general at lists.openfabrics.org; Michael S. Tsirkin; Roland 
> Dreier (rdreier)
> Subject: [ofa-general] Next set of mthca issues
> 
> Here is the stack trace that I see after I upgraded to the 
> latest version 
> (3.5) of the FW. Now the version
> of FW is not displayed in /var/log/messages. Is that because 
> FW version is 
> at the "expected level"? 
> However, /sys/class/infiniband/mthca0/fw_ver does indicate it is 3.5.
> 
>  ping seems to work fine, but run into problems with netperf. 
> (especially 
> when it is the receiver i.e. running netserver).
> I am running these tests  on  a ppc64 mcahine.
> 
> Pradeep
> pradeep at us.ibm.com
> 
> 
> Apr 16 19:37:49 elm3b37 kernel: ib_mthca: Mellanox InfiniBand 
> HCA driver 
> v0.08 (February 14, 2006)
> Apr 16 19:37:49 elm3b37 kernel: ib_mthca: Initializing 0002:d9:00.0
> Apr 16 19:37:53 elm3b37 kernel: ADDRCONF(NETDEV_UP): ib1: link is not 
> ready
> Apr 16 19:38:02 elm3b37 kernel: ib0: enabling connected mode 
> will cause 
> multicast packet drops
> Apr 16 19:38:05 elm3b37 kernel: ib0: mtu > 2044 will cause multicast 
> packet drops.
> Apr 16 19:46:25 elm3b37 kernel: Call Trace:
> Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3BB0] [C00000000000F884] 
> .show_stack+0x54/0x1f0 (unreliable)
> Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3C60] [C0000000000415EC] 
> .eeh_dn_check_failure+0x2bc/0x320
> Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3D10] [C0000000000416E4] 
> .eeh_check_failure+0x94/0x170
> Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3D90] [D00000000025ACEC] 
> .mthca_tavor_interrupt+0x1cc/0x1e0 [ib_mthca]
> Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3E50] [C00000000008C180] 
> .handle_IRQ_event+0x70/0x100
> Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3EF0] [C00000000008EAB0] 
> .handle_fasteoi_irq+0xd0/0x200
> Apr 16 19:46:25 elm3b37 kernel: [C00000000FFF3F90] [C000000000028638] 
> .call_handle_irq+0x1c/0x2c
> Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FA50] [C00000000000CCA0] 
> .do_IRQ+0xc0/0x1e0
> Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FAE0] [C000000000004270] 
> hardware_interrupt_entry+0x18/0x28
> Apr 16 19:46:25 elm3b37 kernel: --- Exception: 501 at 
> .pseries_dedicated_idle_sleep+0xd4/0x1a0
> Apr 16 19:46:25 elm3b37 kernel:     LR = 
> .pseries_dedicated_idle_sleep+0xd0/0x1a0
> Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FDD0] [0000000000000000] 
> .__start+0x4000000000000000/0x8 (unreliable)
> Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FE70] [C00000000001200C] 
> .cpu_idle+0x13c/0x250
> Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FF00] [C00000000002B16C] 
> .start_secondary+0x14c/0x190
> Apr 16 19:46:25 elm3b37 kernel: [C0000000EB57FF90] [C000000000008364] 
> .start_secondary_prolog+0xc/0x10
> Apr 16 19:46:25 elm3b37 kernel: EEH: Detected PCI bus error on device 
> 0002:d9:00.0
> Apr 16 19:46:25 elm3b37 kernel: EEH: This PCI device has 
> failed 1 times 
> since last reboot: location=U7879.001.DQD1EKZ-P1-C2 
> driver=ib_mthca pci 
> addr=0002:d9:00.0
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0: 
> Catastrophic error 
> detected: unknown error
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[00]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[01]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[02]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[03]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[04]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[05]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[06]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[07]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[08]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[09]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[0a]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[0b]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[0c]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[0d]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[0e]: ffffffff
> Apr 16 19:46:30 elm3b37 kernel: ib_mthca 0002:d9:00.0:   
> buf[0f]: ffffffff
> Apr 16 19:46:35 elm3b37 kernel: ib_mthca 0002:d9:00.0: 
> HW2SW_MPT failed 
> (-11)
> Apr 16 19:47:05 elm3b37 last message repeated 3 times
> Apr 16 19:47:05 elm3b37 last message repeated 3 times
> Apr 16 19:47:15 elm3b37 kernel: ib0: ib_detach_mcast failed 
> (result = -11)
> Apr 16 19:47:15 elm3b37 kernel: ib0: ipoib_mcast_detach 
> failed (result = 
> -11)
> Apr 16 19:47:25 elm3b37 kernel: ib0: ib_detach_mcast failed 
> (result = -11)
> Apr 16 19:47:25 elm3b37 kernel: ib0: ipoib_mcast_detach 
> failed (result = 
> -11)
> Apr 16 19:47:35 elm3b37 kernel: ib0: ib_detach_mcast failed 
> (result = -11)
> Apr 16 19:47:35 elm3b37 kernel: ib0: ipoib_mcast_detach 
> failed (result = 
> -11)
> Apr 16 19:47:45 elm3b37 kernel: ib0: ib_detach_mcast failed 
> (result = -11)
> Apr 16 19:47:45 elm3b37 kernel: ib0: ipoib_mcast_detach 
> failed (result = 
> -11)
> 
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 



More information about the general mailing list