[ewg] [PATCH] Patch for libibmad
Mike Heinz
michael.heinz at qlogic.com
Wed Apr 21 06:29:50 PDT 2010
I agree that the stack dump is... weird, but it was reproducible, it happened every time they ran perfquery on their fabric. This patch (along with the other one) appeared to fix the problem.
> But such entries should be never used, at least not by perfquery.
The problem, I think, is the massive enumeration that's being used. Instead of assigning explicit values to all those constants, the code relies on the enums being listed in the correct order. I think that raises a risk that if the header is mismatched with the version of the library at compile time, (possibly because the user is recompiling) this problem could arise.
Anyway - I agree that we have a very poor understanding of the problem; if you want to hold off on this patch, that's fine. The other one is probably more useful.
-----Original Message-----
From: Sasha Khapyorsky [mailto:sashakvolt at gmail.com] On Behalf Of Sasha Khapyorsky
Sent: Wednesday, April 21, 2010 6:09 AM
To: Mike Heinz
Cc: ewg at openfabrics.org
Subject: Re: [PATCH] Patch for libibmad
Hi Mike,
On 12:16 Mon 19 Apr , Mike Heinz wrote:
> We had a customer report that perfquery was crashing on their nodes when trying to query ports on a switch. When I examined the core dump, it was clear that libibmad was dereferencing a null pointer from one of the mad_set_ functions:
>
> #0 0x0000000000000000 in ?? ()
> #1 0x00002ae4e13e7536 in mad_set_field () from /usr/lib64/libibmad.so.5
> #2 0x00002ae4e13e7656 in mad_field_name () from /usr/lib64/libibmad.so.5
> #3 0x0000000000401662 in mad_dump_perfcounters_rcv_sl ()
> #4 0x00000000004024c9 in mad_dump_perfcounters_rcv_sl ()
> #5 0x00002ae4e18168b4 in __libc_start_main () from /lib64/libc.so.6
> #6 0x0000000000401189 in mad_dump_perfcounters_rcv_sl ()
> #7 0x00007fffe5570ce8 in ?? ()
> #8 0x0000000000000000 in ?? ()
I cannot find a path where mad_set_field() (or even mad_field_name())
call would be resulted by mad_dump_perfcounters_rcv_sl(). Do you?
> It appears that mad_set_field() was hitting a NULL pointer in the table of MAD attributes (ib_mad_f). Such entries are being used to separate different groups of mad attributes in the table.
>
> Reviewing the code, I noted that the mad_set_* and mad_get_* functions already have some error checking to avoid going completely off the end of the table, but they do not detect the case where the selected field is unset.
But such entries should be never used, at least not by perfquery. So it
is unclear to me how you are hitting such error.
> This patch corrects the problem.
I would like to understand the problem better before fixing something.
Sasha
More information about the ewg
mailing list