[ofa-general] ipath oops
Bernd Schubert
bs at q-leap.de
Fri Mar 30 04:42:18 PDT 2007
No answer so far and I need a little help to debug this, changed the subject
maybe it will cause more interest that way.
Hi,
with 2.6.20.4 and lustre-1.4.9 we get an oops, see below.
In principle it also could be a lustre problem, but with mellanox cards it
just works fine.
[ 195.339317] Lustre: Added LNI 192.168.41.106 at o2ib [8/64]
[ 195.352336] Lustre: Added LNI 192.168.42.106 at tcp [8/256]
[ 195.357796] Lustre: Accept secure, port 988
[ 195.412988] Lustre: Lustre Lite Client File System; info at clusterfs.com
[ 195.449596] Unable to handle kernel paging request at 000000007740b000 RIP:
[ 195.454249] [<ffffffff803513d2>] __iowrite32_copy+0x2/0x8
[ 195.462306] PGD 11ac87067 PUD 0
[ 195.465648] Oops: 0000 [1] SMP
Entering kdb (current=0xffff81007755c100, pid 3191) on processor 3 Oops:
<NULL>
due to oops @ 0xffffffff803513d2
r15 = 0x0000000000000005 r14 = 0x0000000000000168
r13 = 0x000000007740b000 r12 = 0xffffc200001d601c
rbp = 0xffff81007c083a60 rbx = 0x0000000000000059
r11 = 0x0000000000000000 r10 = 0xffff810076bc4000
r9 = 0xffff810076bc4000 r8 = 0xffff81007ccf2ec8
rax = 0x0000000000000000 rcx = 0x0000000000000059
rdx = 0x0000000000000059 rsi = 0x000000007740b000
rdi = 0xffffc200001d601c orig_rax = 0xffffffffffffffff
rip = 0xffffffff803513d2 cs = 0x0000000000000010
eflags = 0x0000000000010206 rsp = 0xffff81007c0839f0
ss = 0x0000000000000000 ®s = 0xffff81007c083958
[3]kdb> bt
Stack traceback for pid 3191
0xffff81007755c100 3191 19 1 3 R 0xffff81007755c3c0 *ib_cm/3
rsp rip Function (args)
0xffff81007c0839d8 0xffffffff803513d2 __iowrite32_copy+0x2
0xffff81007c083a08 0xffffffff88066161 [ib_ipath]ipath_verbs_send+0x10b
0xffff81007c083a68 0xffffffff88061205 [ib_ipath]ipath_do_ruc_send+0x707
0xffff81007c083af8 0xffffffff88061619 [ib_ipath]ipath_post_ruc_send+0x1fd
0xffff81007c083b58 0xffffffff88065c39 [ib_ipath]ipath_post_send+0x70
0xffff81007c083b88 0xffffffff88284685 [ko2iblnd]kiblnd_check_sends+0x5c0
0xffff81007c083b98 0xffffffff8046e3af _spin_unlock+0x9
0xffff81007c083bf8 0xffffffff882873af [ko2iblnd]kiblnd_connreq_done+0x3d2
0xffff81007c083c28 0xffffffff8826b96d [ib_cm]ib_send_cm_rtu+0xec
0xffff81007c083c78 0xffffffff882886e9 [ko2iblnd]kiblnd_check_connreply+0x318
0xffff81007c083cd8 0xffffffff88289537 [ko2iblnd]kiblnd_cm_callback+0xb02
0xffff81007c083d38 0xffffffff88274c01 [rdma_cm]cma_ib_handler+0x18a
0xffff81007c083da8 0xffffffff8826c7da [ib_cm]cm_process_work+0x5c
0xffff81007c083dd8 0xffffffff8826de19 [ib_cm]cm_work_handler+0xad7
0xffff81007c083e28 0xffffffff8826d342 [ib_cm]cm_work_handler
0xffff81007c083e38 0xffffffff80238bc9 run_workqueue+0xb1
0xffff81007c083e58 0xffffffff80238c71 worker_thread
0xffff81007c083e68 0xffffffff8023bed0 keventd_create_kthread
0xffff81007c083e78 0xffffffff80238d97 worker_thread+0x126
In ipath_verbs.c: ipath_verbs_send() the problem is the address of
ss->sge.vaddr.
The problem seems to be in the goto loop of ipath_ruc.c: ipath_do_ruc_send().
First time qp->s_hdrwords is zero, so it dosen't call
if (qp->s_hdrwords != 0) {
...
ipath_verbs_send()
...
}
Then also both ifs are not true.
if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE &&
(bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0) {
printk ("Sending.\n");
bth2 = qp->s_ack_psn++ & IPATH_PSN_MASK;
}
else if (!((qp->ibqp.qp_type == IB_QPT_RC) ?
ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2) :
ipath_make_uc_req(qp, ohdr, pmtu, &bth0, &bth2))) {
...
}
So it increases qp->s_hdrwords and after the "goto again", ipath_verbs_send()
will be called and it crashes.
In ipath_make_rc_req():
qp->s_cur is zero, so wqe = qp->s_wq.
Also, qp->s_cur_sge = &qp->s_sge and qp->s_sge.sge = wqe->sg_list[0];
If I see it right wqe->sg_list[0] or wqe->sg_list[0].vaddr is wrong, but so
far I havn't tracked down where this is set.
Any help to solve the problem is appreciated.
Thanks in advance,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
More information about the general
mailing list