[openib-general] Problem with OFED on XT3 (update)
Makia Minich
minich at ornl.gov
Mon Jul 24 05:58:29 PDT 2006
Just wanted to update people on this (still looking for some insight, but
not really expecting any).
I was successfully able to bring up the entire stack (all loadable modules)
without ifconfig'ing the ib0 interface. At this point, I'm able to
participate in the network (I can see the subnet manager, ibping works
between the XT3 and an separate node on the switch, ibstat sees the card
information, and ibnetdiscover can see the entire network. So, it would
seem that there's something going on in ipoib specifically.
When I have a chance to try some other things, I'll update accordingly, but
again if anyone happens to see something that seems interesting from the
kernel panic below, let me know (not really expecting much, just hoping that
someone's ran across a like problem).
On 7/18/06 10:58 AM, "Makia Minich" <minich at ornl.gov> wrote:
> First, a little bit about what I'm trying to do (hoping that someone becomes
> interested enough to keep reading), and then the problem. I'm currently
> tasked with getting some form of infiniband up and running on a service node
> of the Cray XT3. Because the XT3 is currently shipping with SuSE9 (with the
> 2.6.5 based kernel) I decided to go with the OFED 1.0.1 release to see out
> of the box what is going to happen. Because of the system layout, I'm
> unable to change out the kernel, so there were some minor OFED source tweaks
> that I needed to perform (attached) to satisfy some missing symbols.
>
> On loading modules, I was seemingly successful loading everything up to and
> including ib_ipoib. Ifconfig showed the ib0 and ib1 devices available, and
> /sys/class/infiniband showed link to the subnet manager was in place.
> Attempting to assign an ip-address to the interface proved to be too much,
> as the node kernel panicked with the following:
>
> general protection fault: 0000 [1]
> CPU 0
> Pid: 11258, comm: ifconfig Tainted: P U (2.6.5-7.252-ss )
> RIP: 0010:[<ffffffff8029a85d>]
> <ffffffff8029a85d>{__kfree_skb+173}
> RSP: 0018:00000100c3cf3af8 EFLAGS: 00010286
> RAX: 1b6012ffffffff00 RBX: 0000000000000000 RCX: ffffffffffffffe8
> RDX: 0000000000000000 RSI: ffffffff80421ba0 RDI: 0000010005cfd340
> RBP: 00000100e0c97480 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffff4
> R13: ffffffff8029eeb0 R14: 0000000000000000 R15: 0000000000000003
> FS: 0000002a9588e0a0(0000) GS:ffffffff80514b40(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000002a9576650c CR3: 0000000000101000 CR4: 00000000000006e0
> Process ifconfig (pid: 11258, threadinfo 00000100c3cf2000, task
> 00000100c3eba580)
> Stack:
> 0000000000000003
> 00000100c281f000
> 00000100e0c97480
> ffffffff802ab825
> 00000100c281f000
> ffffffff8029ef78
> 0000000000000000
> 00000100c281f000
> 0000000000000003
> ffffffff802a86e3
> Call Trace:
> <ffffffff802ab825>{noop_enqueue+37}
> <ffffffff8029ef78>{dev_queue_xmit+200}
> <ffffffff802a86e3>{nf_hook_slow+227}
> <ffffffff8029eeb0>{dev_queue_xmit+0}
> <ffffffff80310014>{igmp6_send+724}
> <ffffffff80304270>{fib6_walk_continue+192}
> <ffffffff803043a0>{fib6_clean_node+0}
> <ffffffff803107e3>{igmp6_join_group+51}
> <ffffffff8030e18f>{igmp6_group_added+191}
> <ffffffff802fbae1>{addrconf_prefix_route+225}
> <ffffffff8030e415>{mld_del_delrec+117}
> <ffffffff8030e726>{ipv6_dev_mc_inc+486}
> <ffffffff802fb86b>{addrconf_join_solict+59}
> <ffffffff802fd0fc>{addrconf_dad_start+28}
> <ffffffff802fc93b>{addrconf_add_linklocal+43}
> <ffffffff802fca35>{addrconf_dev_config+229}
> <ffffffff802fcc9b>{addrconf_notify+123}
> <ffffffff801411ff>{notifier_call_chain+31}
> <ffffffff8029ea25>{dev_open+261}
> <ffffffff8029fe0f>{dev_change_flags+95}
> <ffffffff802d91c4>{devinet_ioctl+756}
> <ffffffff802db4c7>{inet_ioctl+87}
> <ffffffff80297641>{sock_ioctl+577}
> <ffffffff80186ef4>{sys_ioctl+532}
> <ffffffff80112055>{error_exit+0}
> <ffffffff80111750>{system_call+124}
> Code:
> ff
> 08
> 0f
> 94
> c2
> 84
> d2
> 74
> 09
> 48
> 8b
> 01
> 48
> 89
> c7
> ff
> 50
> 08
> 48
> 89
> RIP
> <ffffffff8029a85d>{__kfree_skb+173}
> RSP <00000100c3cf3af8>
>
> <0>Kernel panic: Aiee, killing interrupt handler!
> In interrupt handler - not syncing
>
> Due to a lack of system dumps, I'm hoping that someone might have seen a
> similar panic and might offer some things to try to resolve this issue.
>
> Thanks...
--
Makia Minich <minich at ornl.gov>
National Center for Computation Science
Oak Ridge National Laboratory
Phone: 865.574.7460
More information about the general
mailing list