[openib-general] OpenSM realloc error

Hal Rosenstock halr at voltaire.com
Thu Feb 16 15:18:05 PST 2006


Hi Owen,

On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote:
> So, here is the back trace with no code modifications...
> 
> 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> #1  0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6
> #2  0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6
> #3  0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6
> #4  0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6
> #5  0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6
> #6  0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6
> #7  0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6
> #8  0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6
> #9  0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6
> #10 0x00000080a362be90 in .cl_log_event ()
> from /usr/lib64/libosmcomp.so.1
> #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1
> #12 0x000000001001316c in ?? ()
> #13 0x00000000100059b4 in ?? ()
> #14 0x00000080b970411c in .generic_start_main ()
> from /lib64/tls/libc.so.6
> #15 0x00000080b97042a4 in .__libc_start_main ()
> from /lib64/tls/libc.so.6
> #16 0x0000000000000000 in ?? ()
> (gdb)
> 
> Commenting out the cl_log_event in osm_log results in this backtrace:
> 
> (gdb) bt
> #0  0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> #1  0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6
> #2  0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6
> #3  0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6
> #4  0x00000080b9758b50 in .__GI___libc_malloc ()
> from /lib64/tls/libc.so.6
> #5  0x00000400000607bc in __cl_malloc_priv (size=0) at
> cl_memory_osd.c:62
> #6  0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416
> #7  0x00000400000629f4 in cl_ptr_vector_set_capacity
> (p_vector=0x100788d0,
>     new_capacity=6349) at cl_ptr_vector.c:216
> #8  0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16)
>     at cl_ptr_vector.c:270
> #9  0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0,
> min_size=6349,
>     grow_size=16) at cl_ptr_vector.c:93
> #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0,
> thread_count=0,
>     name=0x100464c0 "opensm") at cl_dispatcher.c:214
> #11 0x00000000100133f8 in ?? ()
> #12 0x00000000100059b4 in ?? ()
> #13 0x00000080b970411c in .generic_start_main ()
> from /lib64/tls/libc.so.6
> #14 0x00000080b97042a4 in .__libc_start_main ()
> from /lib64/tls/libc.so.6
> #15 0x0000000000000000 in ?? ()

__cl_malloc_priv is just a wrapper for malloc:

from cl_memory_osd.c:
void*
__cl_malloc_priv(
        IN      const size_t    size )
{
        return malloc( size );
}

If I believe gdb this appears to be a malloc of 0 bytes but since the
new_capacity was 6349 (and this would be multiplied by sizeof(void *)),
I'm not sure whether to trust this.

Can you send me the compile line from the OpenSM build ? Are the include
paths correct for 64 bit headers ?

> So now I've compiled it in 32-bit mode (had to fix my chroot) and
> everything runs, but I get the following message...
> 
> Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0
>  
> Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr:
> assign CA mthca0 port 1 guid (0x2c90109764831) as the default port
> Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port
> 0x2c90109764831.
> Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port
> 0x2c90109764831.
> Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to
> obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping

For some reason, on the response received, it is not finding the match
in the transaction table. I thought this was fixed a while ago for
PowerPC. Can you run opensm with -V and see if there is any more output
that might be helpful ?

> Other info:
> [root at m2 ~]# ibstat
> CA 'mthca0'
>         CA type: MT23108
>         Number of ports: 2
>         Firmware version: 3.3.2
>         Hardware version: a1
>         Node GUID: 0x0002c90109764830
>         System image GUID: 0x0002c90109764833
>         Port 1:
>                 State: Initializing
>                 Physical state: LinkUp
>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00510a68
>                 Port GUID: 0x0002c90109764831
>         Port 2:
>                 State: Down
>                 Physical state: Polling
>                 Rate: 2
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00510a68
>                 Port GUID: 0x0002c90109764832
> 
> 
> [root at m2 ~]# ibstatus
> Infiniband device 'mthca0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c901:0976:4831
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            10 Gb/sec (4X)

This is goodness and means the physical link has been established on
this port.

> Infiniband device 'mthca0' port 2 status:
>         default gid:     fe80:0000:0000:0000:0002:c901:0976:4832
>         base lid:        0x0
>         sm lid:          0x0
>         state:           1: DOWN
>         phys state:      2: Polling
>         rate:            2.5 Gb/sec (1X)
> 
> 
> My archives suggest a firmware upgrade, but 3.3.3 isnt available from
> SBS as far as I can tell and my contact no longer works there so I'm
> going to have to find the new person to talk about getting newer
> firmware, unless of course another vendors firmware will work on this
> card.

I think 3.3.2 should be OK. In any case, I doubt it's the source of the
problem above.

-- Hal

> Cheers,
> Owen
> 




More information about the general mailing list