[openib-general] OpenSM realloc error

Hal Rosenstock halr at voltaire.com
Fri Feb 17 04:01:40 PST 2006


Hi Owen,

On Fri, 2006-02-17 at 00:01, Owen Stampflee wrote:
> Of course, I need to get things working first, than we can deal with the
> 64-bit issues (gotta please the boss, and if shipping 32-bit binarys and
> both 32/64 bit libraries provides a working udapl, ipoib, and 32+64-bit
> mpi, I can meet my deadline (Monday)). I'm suspecting some glibc issues
> on our end, but I've never seen these before and since we're using a
> RHEL-based toolchain, this _should_ just work.
> 
> Any thoughts on ipoib?

Are you referring to the ib<n>: link is not ready messages ?

How are the IPoIB interfaces being configured ? Are they by the network
scripts ? Does it use arping (to look for duplicates) ? Is DHCP enabled
or is a static address assigned ?

Can you try statically configuring an IPoIB subnet first ?

[You might also want to start another thread on this issue as some who
could help may not read all the way down to this after they see the
subject line.]

-- Hal

>  My research hasnt shown that problem before. I'll
> see if I can get mvapich built tomorrow and see if that at least works.
> 
> Thanks for all the assistance,
> Owen
> 
> On Thu, 2006-02-16 at 23:45 -0500, Hal Rosenstock wrote:
> > On Thu, 2006-02-16 at 20:43, Owen Stampflee wrote:
> > > A 32-bit build of 5411 gets the link to become active
> > 
> > Glad to hear this.That is what I would expect and would like to confirm
> > the tid patch is missing from the FC5 package as well as getting to the
> > bottom of the 64 bit issues if you have some time to help on this.
> > 
> > -- Hal
> > 
> > > and ipv_rc_pingpng works, but I cant bring up ipoib...
> > > 
> > > dmesg says this (tried both ib0 and ib1 to ensure ports werent swapped)
> > > ADDRCONF(NETDEV_UP): ib0: link is not ready
> > > ADDRCONF(NETDEV_UP): ib1: link is not ready
> > > 
> > > At least we're making progress.
> > > 
> > > Thanks,
> > > Owen
> > > 
> > > On Thu, 2006-02-16 at 18:18 -0500, Hal Rosenstock wrote:
> > > > Hi Owen,
> > > > 
> > > > On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote:
> > > > > So, here is the back trace with no code modifications...
> > > > > 
> > > > > 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> > > > > (gdb) bt
> > > > > #0  0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> > > > > #1  0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6
> > > > > #2  0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6
> > > > > #3  0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6
> > > > > #4  0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6
> > > > > #5  0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6
> > > > > #6  0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6
> > > > > #7  0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6
> > > > > #8  0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6
> > > > > #9  0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6
> > > > > #10 0x00000080a362be90 in .cl_log_event ()
> > > > > from /usr/lib64/libosmcomp.so.1
> > > > > #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1
> > > > > #12 0x000000001001316c in ?? ()
> > > > > #13 0x00000000100059b4 in ?? ()
> > > > > #14 0x00000080b970411c in .generic_start_main ()
> > > > > from /lib64/tls/libc.so.6
> > > > > #15 0x00000080b97042a4 in .__libc_start_main ()
> > > > > from /lib64/tls/libc.so.6
> > > > > #16 0x0000000000000000 in ?? ()
> > > > > (gdb)
> > > > > 
> > > > > Commenting out the cl_log_event in osm_log results in this backtrace:
> > > > > 
> > > > > (gdb) bt
> > > > > #0  0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6
> > > > > #1  0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6
> > > > > #2  0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6
> > > > > #3  0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6
> > > > > #4  0x00000080b9758b50 in .__GI___libc_malloc ()
> > > > > from /lib64/tls/libc.so.6
> > > > > #5  0x00000400000607bc in __cl_malloc_priv (size=0) at
> > > > > cl_memory_osd.c:62
> > > > > #6  0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416
> > > > > #7  0x00000400000629f4 in cl_ptr_vector_set_capacity
> > > > > (p_vector=0x100788d0,
> > > > >     new_capacity=6349) at cl_ptr_vector.c:216
> > > > > #8  0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16)
> > > > >     at cl_ptr_vector.c:270
> > > > > #9  0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0,
> > > > > min_size=6349,
> > > > >     grow_size=16) at cl_ptr_vector.c:93
> > > > > #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0,
> > > > > thread_count=0,
> > > > >     name=0x100464c0 "opensm") at cl_dispatcher.c:214
> > > > > #11 0x00000000100133f8 in ?? ()
> > > > > #12 0x00000000100059b4 in ?? ()
> > > > > #13 0x00000080b970411c in .generic_start_main ()
> > > > > from /lib64/tls/libc.so.6
> > > > > #14 0x00000080b97042a4 in .__libc_start_main ()
> > > > > from /lib64/tls/libc.so.6
> > > > > #15 0x0000000000000000 in ?? ()
> > > > 
> > > > __cl_malloc_priv is just a wrapper for malloc:
> > > > 
> > > > from cl_memory_osd.c:
> > > > void*
> > > > __cl_malloc_priv(
> > > >         IN      const size_t    size )
> > > > {
> > > >         return malloc( size );
> > > > }
> > > > 
> > > > If I believe gdb this appears to be a malloc of 0 bytes but since the
> > > > new_capacity was 6349 (and this would be multiplied by sizeof(void *)),
> > > > I'm not sure whether to trust this.
> > > > 
> > > > Can you send me the compile line from the OpenSM build ? Are the include
> > > > paths correct for 64 bit headers ?
> > > > 
> > > > > So now I've compiled it in 32-bit mode (had to fix my chroot) and
> > > > > everything runs, but I get the following message...
> > > > > 
> > > > > Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0
> > > > >  
> > > > > Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting
> > > > > Generic Notice type:3 num:66 from LID:0x0000
> > > > > GID:0xfe80000000000000,0x0000000000000000
> > > > > Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting
> > > > > Generic Notice type:3 num:66 from LID:0x0000
> > > > > GID:0xfe80000000000000,0x0000000000000000
> > > > > Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr:
> > > > > assign CA mthca0 port 1 guid (0x2c90109764831) as the default port
> > > > > Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port
> > > > > 0x2c90109764831.
> > > > > Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port
> > > > > 0x2c90109764831.
> > > > > Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to
> > > > > obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping
> > > > 
> > > > For some reason, on the response received, it is not finding the match
> > > > in the transaction table. I thought this was fixed a while ago for
> > > > PowerPC. Can you run opensm with -V and see if there is any more output
> > > > that might be helpful ?
> > > > 
> > > > > Other info:
> > > > > [root at m2 ~]# ibstat
> > > > > CA 'mthca0'
> > > > >         CA type: MT23108
> > > > >         Number of ports: 2
> > > > >         Firmware version: 3.3.2
> > > > >         Hardware version: a1
> > > > >         Node GUID: 0x0002c90109764830
> > > > >         System image GUID: 0x0002c90109764833
> > > > >         Port 1:
> > > > >                 State: Initializing
> > > > >                 Physical state: LinkUp
> > > > >                 Rate: 10
> > > > >                 Base lid: 0
> > > > >                 LMC: 0
> > > > >                 SM lid: 0
> > > > >                 Capability mask: 0x00510a68
> > > > >                 Port GUID: 0x0002c90109764831
> > > > >         Port 2:
> > > > >                 State: Down
> > > > >                 Physical state: Polling
> > > > >                 Rate: 2
> > > > >                 Base lid: 0
> > > > >                 LMC: 0
> > > > >                 SM lid: 0
> > > > >                 Capability mask: 0x00510a68
> > > > >                 Port GUID: 0x0002c90109764832
> > > > > 
> > > > > 
> > > > > [root at m2 ~]# ibstatus
> > > > > Infiniband device 'mthca0' port 1 status:
> > > > >         default gid:     fe80:0000:0000:0000:0002:c901:0976:4831
> > > > >         base lid:        0x0
> > > > >         sm lid:          0x0
> > > > >         state:           2: INIT
> > > > >         phys state:      5: LinkUp
> > > > >         rate:            10 Gb/sec (4X)
> > > > 
> > > > This is goodness and means the physical link has been established on
> > > > this port.
> > > > 
> > > > > Infiniband device 'mthca0' port 2 status:
> > > > >         default gid:     fe80:0000:0000:0000:0002:c901:0976:4832
> > > > >         base lid:        0x0
> > > > >         sm lid:          0x0
> > > > >         state:           1: DOWN
> > > > >         phys state:      2: Polling
> > > > >         rate:            2.5 Gb/sec (1X)
> > > > > 
> > > > > 
> > > > > My archives suggest a firmware upgrade, but 3.3.3 isnt available from
> > > > > SBS as far as I can tell and my contact no longer works there so I'm
> > > > > going to have to find the new person to talk about getting newer
> > > > > firmware, unless of course another vendors firmware will work on this
> > > > > card.
> > > > 
> > > > I think 3.3.2 should be OK. In any case, I doubt it's the source of the
> > > > problem above.
> > > > 
> > > > -- Hal
> > > > 
> > > > > Cheers,
> > > > > Owen
> > > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> > !DSPAM:43f5572d122323871347016!
> 




More information about the general mailing list