[openib-general] OpenSM realloc error

Owen Stampflee ostampflee at terrasoftsolutions.com
Thu Feb 16 21:01:59 PST 2006


Of course, I need to get things working first, than we can deal with the
64-bit issues (gotta please the boss, and if shipping 32-bit binarys and
both 32/64 bit libraries provides a working udapl, ipoib, and 32+64-bit
mpi, I can meet my deadline (Monday)). I'm suspecting some glibc issues
on our end, but I've never seen these before and since we're using a
RHEL-based toolchain, this _should_ just work.

Any thoughts on ipoib? My research hasnt shown that problem before. I'll
see if I can get mvapich built tomorrow and see if that at least works.

Thanks for all the assistance,
Owen

On Thu, 2006-02-16 at 23:45 -0500, Hal Rosenstock wrote:
> On Thu, 2006-02-16 at 20:43, Owen Stampflee wrote:
> > A 32-bit build of 5411 gets the link to become active
> 
> Glad to hear this.That is what I would expect and would like to confirm
> the tid patch is missing from the FC5 package as well as getting to the
> bottom of the 64 bit issues if you have some time to help on this.
> 
> -- Hal
> 
> > and ipv_rc_pingpng works, but I cant bring up ipoib...
> > 
> > dmesg says this (tried both ib0 and ib1 to ensure ports werent swapped)
> > ADDRCONF(NETDEV_UP): ib0: link is not ready
> > ADDRCONF(NETDEV_UP): ib1: link is not ready
> > 
> > At least we're making progress.
> > 
> > Thanks,
> > Owen
> > 
> > On Thu, 2006-02-16 at 18:18 -0500, Hal Rosenstock wrote:
> > > Hi Owen,
> > > 
> > > On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote:
> > > > So, here is the back trace with no code modifications...
> > > > 
> > > > 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> > > > (gdb) bt
> > > > #0  0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> > > > #1  0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6
> > > > #2  0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6
> > > > #3  0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6
> > > > #4  0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6
> > > > #5  0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6
> > > > #6  0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6
> > > > #7  0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6
> > > > #8  0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6
> > > > #9  0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6
> > > > #10 0x00000080a362be90 in .cl_log_event ()
> > > > from /usr/lib64/libosmcomp.so.1
> > > > #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1
> > > > #12 0x000000001001316c in ?? ()
> > > > #13 0x00000000100059b4 in ?? ()
> > > > #14 0x00000080b970411c in .generic_start_main ()
> > > > from /lib64/tls/libc.so.6
> > > > #15 0x00000080b97042a4 in .__libc_start_main ()
> > > > from /lib64/tls/libc.so.6
> > > > #16 0x0000000000000000 in ?? ()
> > > > (gdb)
> > > > 
> > > > Commenting out the cl_log_event in osm_log results in this backtrace:
> > > > 
> > > > (gdb) bt
> > > > #0  0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> > > > #1  0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6
> > > > #2  0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6
> > > > #3  0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6
> > > > #4  0x00000080b9758b50 in .__GI___libc_malloc ()
> > > > from /lib64/tls/libc.so.6
> > > > #5  0x00000400000607bc in __cl_malloc_priv (size=0) at
> > > > cl_memory_osd.c:62
> > > > #6  0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416
> > > > #7  0x00000400000629f4 in cl_ptr_vector_set_capacity
> > > > (p_vector=0x100788d0,
> > > >     new_capacity=6349) at cl_ptr_vector.c:216
> > > > #8  0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16)
> > > >     at cl_ptr_vector.c:270
> > > > #9  0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0,
> > > > min_size=6349,
> > > >     grow_size=16) at cl_ptr_vector.c:93
> > > > #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0,
> > > > thread_count=0,
> > > >     name=0x100464c0 "opensm") at cl_dispatcher.c:214
> > > > #11 0x00000000100133f8 in ?? ()
> > > > #12 0x00000000100059b4 in ?? ()
> > > > #13 0x00000080b970411c in .generic_start_main ()
> > > > from /lib64/tls/libc.so.6
> > > > #14 0x00000080b97042a4 in .__libc_start_main ()
> > > > from /lib64/tls/libc.so.6
> > > > #15 0x0000000000000000 in ?? ()
> > > 
> > > __cl_malloc_priv is just a wrapper for malloc:
> > > 
> > > from cl_memory_osd.c:
> > > void*
> > > __cl_malloc_priv(
> > >         IN      const size_t    size )
> > > {
> > >         return malloc( size );
> > > }
> > > 
> > > If I believe gdb this appears to be a malloc of 0 bytes but since the
> > > new_capacity was 6349 (and this would be multiplied by sizeof(void *)),
> > > I'm not sure whether to trust this.
> > > 
> > > Can you send me the compile line from the OpenSM build ? Are the include
> > > paths correct for 64 bit headers ?
> > > 
> > > > So now I've compiled it in 32-bit mode (had to fix my chroot) and
> > > > everything runs, but I get the following message...
> > > > 
> > > > Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0
> > > >  
> > > > Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting
> > > > Generic Notice type:3 num:66 from LID:0x0000
> > > > GID:0xfe80000000000000,0x0000000000000000
> > > > Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting
> > > > Generic Notice type:3 num:66 from LID:0x0000
> > > > GID:0xfe80000000000000,0x0000000000000000
> > > > Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr:
> > > > assign CA mthca0 port 1 guid (0x2c90109764831) as the default port
> > > > Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port
> > > > 0x2c90109764831.
> > > > Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port
> > > > 0x2c90109764831.
> > > > Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to
> > > > obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping
> > > 
> > > For some reason, on the response received, it is not finding the match
> > > in the transaction table. I thought this was fixed a while ago for
> > > PowerPC. Can you run opensm with -V and see if there is any more output
> > > that might be helpful ?
> > > 
> > > > Other info:
> > > > [root at m2 ~]# ibstat
> > > > CA 'mthca0'
> > > >         CA type: MT23108
> > > >         Number of ports: 2
> > > >         Firmware version: 3.3.2
> > > >         Hardware version: a1
> > > >         Node GUID: 0x0002c90109764830
> > > >         System image GUID: 0x0002c90109764833
> > > >         Port 1:
> > > >                 State: Initializing
> > > >                 Physical state: LinkUp
> > > >                 Rate: 10
> > > >                 Base lid: 0
> > > >                 LMC: 0
> > > >                 SM lid: 0
> > > >                 Capability mask: 0x00510a68
> > > >                 Port GUID: 0x0002c90109764831
> > > >         Port 2:
> > > >                 State: Down
> > > >                 Physical state: Polling
> > > >                 Rate: 2
> > > >                 Base lid: 0
> > > >                 LMC: 0
> > > >                 SM lid: 0
> > > >                 Capability mask: 0x00510a68
> > > >                 Port GUID: 0x0002c90109764832
> > > > 
> > > > 
> > > > [root at m2 ~]# ibstatus
> > > > Infiniband device 'mthca0' port 1 status:
> > > >         default gid:     fe80:0000:0000:0000:0002:c901:0976:4831
> > > >         base lid:        0x0
> > > >         sm lid:          0x0
> > > >         state:           2: INIT
> > > >         phys state:      5: LinkUp
> > > >         rate:            10 Gb/sec (4X)
> > > 
> > > This is goodness and means the physical link has been established on
> > > this port.
> > > 
> > > > Infiniband device 'mthca0' port 2 status:
> > > >         default gid:     fe80:0000:0000:0000:0002:c901:0976:4832
> > > >         base lid:        0x0
> > > >         sm lid:          0x0
> > > >         state:           1: DOWN
> > > >         phys state:      2: Polling
> > > >         rate:            2.5 Gb/sec (1X)
> > > > 
> > > > 
> > > > My archives suggest a firmware upgrade, but 3.3.3 isnt available from
> > > > SBS as far as I can tell and my contact no longer works there so I'm
> > > > going to have to find the new person to talk about getting newer
> > > > firmware, unless of course another vendors firmware will work on this
> > > > card.
> > > 
> > > I think 3.3.2 should be OK. In any case, I doubt it's the source of the
> > > problem above.
> > > 
> > > -- Hal
> > > 
> > > > Cheers,
> > > > Owen
> > > > 
> > > 
> > > 
> > > 
> > 
> 
> 
> !DSPAM:43f5572d122323871347016!




More information about the general mailing list