[ofa-general] lustre problem

Bernd Schubert bs at q-leap.de
Wed Mar 28 06:46:13 PDT 2007


Hi,

with 2.6.20.4 and lustre-1.4.9 we get an oops, see below.

In principle it also could be a lustre problem, but with mellanox cards it 
just works fine.


[  195.339317] Lustre: Added LNI 192.168.41.106 at o2ib [8/64]
[  195.352336] Lustre: Added LNI 192.168.42.106 at tcp [8/256]
[  195.357796] Lustre: Accept secure, port 988
[  195.412988] Lustre: Lustre Lite Client File System; info at clusterfs.com
[  195.449596] Unable to handle kernel paging request at 000000007740b000 RIP:
[  195.454249]  [<ffffffff803513d2>] __iowrite32_copy+0x2/0x8
[  195.462306] PGD 11ac87067 PUD 0
[  195.465648] Oops: 0000 [1] SMP

Entering kdb (current=0xffff81007755c100, pid 3191) on processor 3 Oops: 
<NULL>
due to oops @ 0xffffffff803513d2
     r15 = 0x0000000000000005      r14 = 0x0000000000000168
     r13 = 0x000000007740b000      r12 = 0xffffc200001d601c
     rbp = 0xffff81007c083a60      rbx = 0x0000000000000059
     r11 = 0x0000000000000000      r10 = 0xffff810076bc4000
      r9 = 0xffff810076bc4000       r8 = 0xffff81007ccf2ec8
     rax = 0x0000000000000000      rcx = 0x0000000000000059
     rdx = 0x0000000000000059      rsi = 0x000000007740b000
     rdi = 0xffffc200001d601c orig_rax = 0xffffffffffffffff
     rip = 0xffffffff803513d2       cs = 0x0000000000000010
  eflags = 0x0000000000010206      rsp = 0xffff81007c0839f0
      ss = 0x0000000000000000 &regs = 0xffff81007c083958
[3]kdb> bt
Stack traceback for pid 3191
0xffff81007755c100     3191       19  1    3   R  0xffff81007755c3c0 *ib_cm/3
rsp                rip                Function (args)
0xffff81007c0839d8 0xffffffff803513d2 __iowrite32_copy+0x2
0xffff81007c083a08 0xffffffff88066161 [ib_ipath]ipath_verbs_send+0x10b
0xffff81007c083a68 0xffffffff88061205 [ib_ipath]ipath_do_ruc_send+0x707
0xffff81007c083af8 0xffffffff88061619 [ib_ipath]ipath_post_ruc_send+0x1fd
0xffff81007c083b58 0xffffffff88065c39 [ib_ipath]ipath_post_send+0x70
0xffff81007c083b88 0xffffffff88284685 [ko2iblnd]kiblnd_check_sends+0x5c0
0xffff81007c083b98 0xffffffff8046e3af _spin_unlock+0x9
0xffff81007c083bf8 0xffffffff882873af [ko2iblnd]kiblnd_connreq_done+0x3d2
0xffff81007c083c28 0xffffffff8826b96d [ib_cm]ib_send_cm_rtu+0xec
0xffff81007c083c78 0xffffffff882886e9 [ko2iblnd]kiblnd_check_connreply+0x318
0xffff81007c083cd8 0xffffffff88289537 [ko2iblnd]kiblnd_cm_callback+0xb02
0xffff81007c083d38 0xffffffff88274c01 [rdma_cm]cma_ib_handler+0x18a
0xffff81007c083da8 0xffffffff8826c7da [ib_cm]cm_process_work+0x5c
0xffff81007c083dd8 0xffffffff8826de19 [ib_cm]cm_work_handler+0xad7
0xffff81007c083e28 0xffffffff8826d342 [ib_cm]cm_work_handler
0xffff81007c083e38 0xffffffff80238bc9 run_workqueue+0xb1
0xffff81007c083e58 0xffffffff80238c71 worker_thread
0xffff81007c083e68 0xffffffff8023bed0 keventd_create_kthread
0xffff81007c083e78 0xffffffff80238d97 worker_thread+0x126


In ipath_verbs.c: ipath_verbs_send() the problem is the address of 
ss->sge.vaddr.  

The problem seems to be in the goto loop of ipath_ruc.c: ipath_do_ruc_send().

First time qp->s_hdrwords is zero, so it dosen't call 

if (qp->s_hdrwords != 0) {
	...
	ipath_verbs_send()
	...
}


Then also both ifs are not true.

	if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE &&
	    (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0) {
		printk ("Sending.\n");
		bth2 = qp->s_ack_psn++ & IPATH_PSN_MASK;
		
	}
	else if (!((qp->ibqp.qp_type == IB_QPT_RC) ?
		   ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2) :
		   ipath_make_uc_req(qp, ohdr, pmtu, &bth0, &bth2))) {
		...
	}

So it increases qp->s_hdrwords and after the "goto again", ipath_verbs_send() 
will be called and it crashes.


Any help to solve the problem is appreciated.


Thanks in advance,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH



More information about the general mailing list