[openib-general] Re: [PATCHv3] kDAPL: remove use of HANDLE's (vs. r2564)

Wed Jun 8 14:30:01 PDT 2005

The new problem that I'm seeing is below. I don't think that this 
patch caused this though, so I'll go ahead and commit your patch with 
my modifications.

The oops I see is below. By my calculations, the crash is on line 450 
of mthca_cq.c. That line is:

entry->wr_id = (*cur_qp)->wrid[wqe_index];

and the resulting instruction that fails is

b98:       8b 54 d8 04             mov    0x4(%eax,%ebx,8),%edx

Does anyone know which part of the C statement that is?

Unable to handle kernel paging request at virtual address 00002014
  printing eip:
e0a65008
*pde = 1814d067
Oops: 0000 [#1]
Modules linked in: kdapltest ib_dat_provider dat ib_cm ib_at ib_ipoib 
ib_sa md5 ipv6 parport_pc lp parport autofs4 nfs lockd sunrpc 
i2c_piix4 i2c_core ib_mthca ib_mad ib_core e100 mii floppy sg aic7xxx 
sd_mod scsi_mod
CPU:    0
EIP:    0060:[<e0a65008>]    Not tainted VLI
EFLAGS: 00010046   (2.6.11-openib)
EIP is at mthca_poll_cq+0x368/0x760 [ib_mthca]
eax: 00000000   ebx: 00000402   ecx: 00000000   edx: dd6ea000
esi: cdbf7000   edi: cdbf7058   ebp: c0469cc8   esp: c0469c5c
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c0469000 task=c03bfc20)
Stack: 00000000 c046cf1f 00000030 00000000 00000000 00000000 00000000 
00000000
        cdbf5040 00000000 00000000 00000000 c0469d04 00000000 00000086 
dd8af000
        c0469d04 00000001 d1a8d9c0 00000000 00000046 00000001 00000000 
cdbf7000
Call Trace:
  [<c01034ba>] show_stack+0x7a/0x90
  [<c0103639>] show_registers+0x149/0x1c0
  [<c0103886>] die+0x126/0x2a0
  [<c0110b7e>] do_page_fault+0x45e/0x644
  [<c0103003>] error_code+0x2b/0x30
  [<e1ac530b>] dapl_ib_completion_poll+0x3c/0xd1 [ib_dat_provider]
  [<e1aceed5>] dapl_evd_cq_poll_to_event+0x17/0x3b [ib_dat_provider]
  [<e1ad03ea>] dapl_evd_dequeue+0x277/0x34b [ib_dat_provider]
  [<e1ace0a4>] dapl_evd_upcall_trigger+0x34/0x66 [ib_dat_provider]
  [<e1acfbf6>] dapl_evd_dto_callback+0xd4/0xea [ib_dat_provider]
  [<e0a644b3>] mthca_cq_event+0x33/0x80 [ib_mthca]
  [<e0a629f4>] mthca_eq_int+0x3a4/0x580 [ib_mthca]
  [<e0a62c51>] mthca_tavor_interrupt+0x81/0x350 [ib_mthca]
  [<c01426d5>] handle_IRQ_event+0x35/0x70
  [<c0142818>] __do_IRQ+0x108/0x340
  [<c0104b76>] do_IRQ+0x96/0xa0
  [<c0102fca>] common_interrupt+0x1a/0x20
  [<c01426d5>] handle_IRQ_event+0x35/0x70
  [<c0142818>] __do_IRQ+0x108/0x340
  [<c0104b3a>] do_IRQ+0x5a/0xa0
  =======================
  [<c0102fca>] common_interrupt+0x1a/0x20
  [<c0100627>] cpu_idle+0x57/0x60
  [<c0100249>] rest_init+0x19/0x20
  [<c043b8ca>] start_kernel+0x17a/0x1f0
  [<c010019f>] 0xc010019f
Code: 00 00 00 8b 55 b4 0f b6 52 1d 81 e2 80 00 00 00 e9 d3 fd ff ff 
8b 45 b4 8d 7e 58 8b 4f 34 8b 58 18 8b 86 e0 00 00 00 0f cb d3 eb <8b> 
54 d8 04 8b 04 d8 e9 36 fe ff ff 8b 55 b4 8b 4d c4 8b 42 14

On Wed, 8 Jun 2005, Tom Duffy wrote:

> On Wed, 2005-06-08 at 13:44 -0700, Tom Duffy wrote:
>> On Wed, 2005-06-08 at 16:36 -0400, James Lentini wrote:
>>> Do you see any additional stability problems after applying this? I'm
>>> updating my OpenIB tree to see if that is my problem.
>>
>> This just in!
>
> I think your patch is fine, at least it doesn't introduce any new bugs,
> because after a reboot of both machines, restarting the SM, the quit
> test and the transaction test work fine.
>
> -tduffy
>