[openib-general] Re: ipoib oops

Michael S. Tsirkin mst at mellanox.co.il
Mon Nov 7 08:27:35 PST 2005


Quoting r. Michael S. Tsirkin <mst at mellanox.co.il>:
> Subject: Re: ipoib oops
> 
> > Quoting Michael S. Tsirkin <mst at mellanox.co.il>:
> > Subject: ipoib oops
> > 
> > Hi!
> > I saw this in /var/log/messages recently.
> > Unfortunately I cant say exactly what I did to trigger this problem.
> 
> Oops, I left out part of the log.
> Here it is in full.
> Actually, I had opensm running on the same node, and it appears
> stuck in defunc state currently - I wander whether
> we have the umad module crashing and corrupting the ipoib data
> structures, or the reverse.
> 
> ------------------------------
>  <1>Unable to handle kernel NULL pointer dereference at 0000000000000000
> RIP:
> <ffffffff88032669>{:ib_umad:send_handler+41}
> PGD 0
> Oops: 0000 [2] SMP
> CPU 1
> Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad
> ib_core
> Pid: 12203, comm: opensm Not tainted 2.6.14 #4
> RIP: 0010:[<ffffffff88032669>]
> <ffffffff88032669>{:ib_umad:send_handler+41}
> RSP: 0018:ffff81017bae7bb8  EFLAGS: 00010296
> RAX: ffff810178970d98 RBX: ffff81017bae7c78 RCX: ffff810178970d68
> RDX: ffff81017bae7c68 RSI: ffff81017bae7c78 RDI: ffff81017e825c10
> RBP: 0000000000000000 R08: ffff81017bae6000 R09: 0000000000000100
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff81017e825c10
> R13: ffff810176c01000 R14: ffff81017bae7c68 R15: ffff81017bae7ef8
> FS:  0000000000000000(0000) GS:ffffffff805ff880(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0

This one seems to be at
user_mad.c line 179:

---------------------
static void send_handler(struct ib_mad_agent *agent,
                         struct ib_mad_send_wc *send_wc)
{
        struct ib_umad_file *file = agent->context;
        struct ib_umad_packet *timeout;
        struct ib_umad_packet *packet = send_wc->send_buf->context[0];

        ib_destroy_ah(packet->msg->ah);    <----------------------------- here
        ib_free_send_mad(packet->msg);
---------------------

Looks like send_wc is NULL.
And given that the send handler seems to be always called with
wc on the stack, it now appears that it was actually ipoib
that triggered some data corruption for umad.

Right?

-- 
MST



More information about the general mailing list