[openib-general] IPoIB oops on path record completion

Hal Rosenstock halr at voltaire.com
Wed Dec 15 18:21:56 PST 2004


On Wed, 2004-12-15 at 20:38, Roland Dreier wrote: 
>     Hal> No but it definitely oops in that callback. I didn't trace it
>     Hal> in path_rec_completion; only glanced at the code. Don't the
>     Hal> debug statements deference through NULL regardless of status ?
> 
> Good point, I missed those uses.  I've just pushed a patch that should
> fix this problem.

Still oops-es :-(

>     Hal> Have you gotten a negative status on the callback (and NULL
>     Hal> pathrec) ?  I've yet to see this response on the analyzer as
>     Hal> there are too many to go through right now. I do see it with
>     Hal> extra debug I put in to narrow this down.
> 
> No, I guess that's why I never saw this problem -- I've never had a
> path record callback fail.

I get the callback error but the packets on the wire look OK (see
below).

>     Hal> Also, what I do see when I do a broadcast ping is that the
>     Hal> path record is obtained over and over rather than being
>     Hal> requested once and cached.  Is that what is supposed to be
>     Hal> happening now ?
> 
> I can't duplicate this one.  If I do something like
> 
>     ifconfig ib0 10.0.0.1
>     ping -b 10.255.255.255
> 
> I only see one path record lookup for each remote system.

I'm using /24 rather than /8 (class C rather than class A).
So my config is
ifconfig ib0 192.168.0.1
ping -b 192.168.0.255
I have 3 nodes (this is an x86 and the other 2 are x86_64).
I doubt this makes a difference in terms of these issues.

> You seem to be seeing a path record lookup fail, which of course will
> cause the lookup to be retried the next time we want to send something
> to that destination.  Do you know why the the path record lookup is failing?

Just looked at the trace and all SA GetResp(PathRecord) had status 0 but
ib_sa_path_rec_callback: sa_query 0xc373fe48 status 0xffffff92 mad
0x00000000

This status is
#define ETIMEDOUT       110     /* Connection timed out */

Can you shorten your timeout down to 1 msec and see what happens ?

I do see one SA Get which was not responded to. There are a variety of
reasons this could occur.

-- Hal





More information about the general mailing list