[openib-general] [ANNOUCEv2] OpenIB OpenSM 1.1.0: trunk now supports 1.8.0 features

Hal Rosenstock halr at voltaire.com
Wed Sep 14 08:14:30 PDT 2005


Hi Brett,

On Wed, 2005-09-14 at 10:01, Brett Bode wrote:
>     Let's see how many of these I can tackle... 

Thanks for the configuration information. It puts things into more
perspective. There may be more specifics needed but we'll see where we
get to on this.

> There are two switches 
> in the setup and both are brand new 24 port DDR2 switches from Mellanox 
> (sorry i don't know the switch part off the top of my head). Most of 
> the NICs are rev a1 based NICs that have a fairly recent firmware on 
> them. Though the opensm is running on a MT25208 InfiniHost III Ex in a 
> dual opteron. The node that failed is an 8 way IBM pSeries p655 with:
> [  176.575945] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 
> 2005)
> [  176.674674] ib_mthca: Initializing Mellanox Technologies MT23108 
> InfiniHost (0001:62:00.0)
> [  176.800432] PCI: Enabling device: (0001:62:00.0), cmd 142
> 
> lspci on a matching node gives:
> 0001:62:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
>          Subsystem: Mellanox Technology MT23108 InfiniHost
>          Flags: bus master, 66MHz, medium devsel, latency 144, IRQ 201
>          Memory at d8800000 (64-bit, non-prefetchable) [size=1M]
>          Memory at d8000000 (64-bit, prefetchable) [size=8M]
>          Capabilities: <available only to root>
> 
> The pSeries nodes have firmware to hide the onboard memory due to some 
> issues with openfirmware... At this point the node that had the issue 
> is in a weird state where I can still login and perform some commands, 
> but some fail (lspci hangs) and its routing is a bit screwed up as it 
> can't see hosts over the ethernet properly either. I think that is 
> because it crashed in the ipoib code. Here is the kernel oops:
> [2507694.118336] Oops: Kernel access of bad area, sig: 11 [#1]
> [2507694.131400] SMP NR_CPUS=8 PSERIES
> [2507694.139788] Modules linked in: pvfs2 ib_ipoib ib_sa ib_mthca 
> ib_mad ib_core
> [2507694.156537] NIP: D0000000006216A0 XER: 20000000 LR: 
> D000000000621680 CTR: C0000000001C5CD8
> [2507694.176313] REGS: c0000007fe73f8a0 TRAP: 0300   Not tainted  
> (2.6.12.3-power4)
> [2507694.193613] MSR: 9000000000001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 
> 11 CR: 24000084
> [2507694.211341] DAR: 0000000000000008 DSISR: 0000000042000000
> [2507694.224396] TASK: c0000000081e1000[23] 'events/5' THREAD: 
> c0000007fe73c000 CPU: 5
> [2507694.241849] GPR00: D000000000621680 C0000007FE73FB20 
> D000000000636A98 C00000079D14A2C0
> [2507694.261054] GPR04: D00000000062E648 C00000003FA4B080 
> 0000000000000000 0000000000000000
> [2507694.280223] GPR08: 0000000000000000 0000000000000000 
> C00000003FA557D0 C00000077CA81E80
> [2507694.299362] GPR12: D0000000006289D0 C00000000052BC00 
> 0000000000000000 0000000000000000
> [2507694.318568] GPR16: 0000000000000000 0000000000000000 
> 0000000003A10000 000000000291FE84
> [2507694.337756] GPR20: 0000000000000038 0000000003EB8E28 
> 0000000800000000 C0000000041B1AC0
> [2507694.356925] GPR24: C00000002CE34380 9000000000009032 
> C00000003FA557E8 C00000003FA55780
> [2507694.376130] GPR28: 0000000000000000 C0000007918252C0 
> D000000000635F80 C0000007918252C0
> [2507694.395650] NIP [d0000000006216a0] .path_free+0x1a8/0x26c 
> [ib_ipoib]
> [2507694.410972] LR [d000000000621680] .path_free+0x188/0x26c [ib_ipoib]
> [2507694.426015] Call Trace:
> [2507694.432189] [c0000007fe73fb20] [d000000000621680] 
> .path_free+0x188/0x26c [ib_ipoib] (unreliable)
> [2507694.453198] [c0000007fe73fbd0] [d000000000621864] 
> .ipoib_flush_paths+0x100/0x148 [ib_ipoib]
> [2507694.473184] [c0000007fe73fc80] [d0000000006249c0] 
> .ipoib_ib_dev_down+0x13c/0x194 [ib_ipoib]
> [2507694.493149] [c0000007fe73fd20] [d000000000625004] 
> .ipoib_ib_dev_flush+0x44/0xac [ib_ipoib]
> [2507694.512946] [c0000007fe73fdb0] [c00000000005ca0c] 
> .worker_thread+0x244/0x318
> [2507694.529880] [c0000007fe73fee0] [c0000000000630e4] 
> .kthread+0x154/0x1a4
> [2507694.545601] [c0000007fe73ff90] [c000000000013508] 
> .kernel_thread+0x4c/0x6c
> [2507694.562225] Instruction dump:
> [2507694.569557] 38630020 419effb8 e89e8008 48007355 e8410028 e93d0020 
> 7fa3eb78 fb890058
> [2507694.588139] 60000000 e97d0020 7ffdfb78 e92b00d8 <fb890008> 
> 480072fd e8410028 381f0028
> [2507694.607156]

What OpenIB svn version are you running ?

> What is the procedure for determining if the multicast setup on the 
> switch is trashed?

When the failure occurs:

Please run ibnetdiscover and send the output.
Also run ibchecknet to see what this shows

ibroute - display unicast and multicast forwarding tables of switches

So determine the LIDs of the switches (ibswitches can help with this)

So it's something like:
ibnetdiscover top1
ibswitches top1
Switch  : 0x005442ba00003080 ports 24 "MT47396 Infiniscale-III Mellanox Technologies" port 0 lid 2
Switch  : 0x0008f10400410015 ports 8 "SW-6IB4 Voltaire" port 0 lid 5

ibroute -M 2
Multicast mlids [0xc000-0xc3ff] of switch Lid 0x2 guid
0x005442ba00003080 (MT47396 Infiniscale-III Mellanox Technologies):
            0                   1                   2         
     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 
 MLid
0xc000                              x                         
0xc001                              x                         
0xc002                              x                         
0xc003                              x                         
4 valid mlids dumped 

ibroute -M 5
Multicast mlids [0xc000-0xc3ff] of switch Lid 0x5 guid
0x0008f10400410015 (SW-6IB4 Voltaire):
     Ports: 0 1 2 3 4 5 6 7 8 
 MLid
0xc000        x   x     x     
0xc001        x   x     x     
0xc003        x   x     x     
0xc004            x     x     
0xc005            x           
0xc006                  x     
6 valid mlids dumped 

The LIDs to use are configuration dependent and depend on what the OpenSM hands out.

There is also ibtracert
ibtracert - display unicast or multicast route from source to destination

> I suspect that if it is, the crashed node is causing it as I had power 
> cycled the switch yesterday which seemed to get things working up until 
> I plugged the crashed node in again.

But without recycling the switch things don't work, right ? With just
unplugging this node, it doesn't work ? It sounds like the switch has
some issue. Can you tell if it forwards any packets ?

-- Hal




More information about the general mailing list