[ewg] ib_acme fails for requests with IPv4 addresses (ofed 3.5)
Jens Domke
jens.domke at tu-dresden.de
Sun Mar 24 21:45:22 PDT 2013
Hello Sean,
thank you very much. I found the reason, now.
The multicast rounting/support was disabled in the OpenSM configuration file:
disable_multicast TRUE
If I would have read the ibacm man page more carefully, I would have seen that ibacm relies on multicast. My bad.
It's still a bit wired that ib_acme worked for the rest of the nodes, even though multicast was disabled ;-)
We will try to use ibacm it on a large installation over the weekend
and I feel confident now that this will work and we can perform the benchmarks.
Thanks again,
Jens
On Mar 23, 2013, at 4:59 AM, Hefty, Sean wrote:
>> Now I have another problem with 3 out of 18 nodes. All 3 get the correct
>> information for the other 15 nodes if I run ib_acme, and also the other 15 can
>> obtain the right information for the 3, but if I run ib_acme among those 3
>> nodes then I get a "Connection timed out".
>> On all three nodes the command for 'localhost' does work, too.
>>
>> Here the ouput:
>> ===============================================================================
>> =====
>> rc001 ~ $ pdsh -w rc0[00-17] 'for x in `seq 100 117`; do ib_acme -f i -d
>> 10.1.4.${x} -v; done' | grep failed -B 1
>> rc002: Destination: 10.1.4.106
>> rc002: ib_acm_resolve_ip failed: Connection timed out
>> rc002: SA verification: failed Cannot assign requested address
>> --
>> rc011: Destination: 10.1.4.102
>> rc011: ib_acm_resolve_ip failed: Connection timed out
>> rc011: SA verification: failed Cannot assign requested address
>> --
>> rc006: Destination: 10.1.4.102
>> rc006: ib_acm_resolve_ip failed: Connection timed out
>> rc006: SA verification: failed Cannot assign requested address
>> --
>> rc002: Destination: 10.1.4.111
>> rc002: ib_acm_resolve_ip failed: Connection timed out
>> rc002: SA verification: failed Cannot assign requested address
>> --
>> rc011: Destination: 10.1.4.106
>> rc011: ib_acm_resolve_ip failed: Connection timed out
>> rc011: SA verification: failed Cannot assign requested address
>> --
>> rc006: Destination: 10.1.4.111
>> rc006: ib_acm_resolve_ip failed: Connection timed out
>> rc006: SA verification: failed Cannot assign requested address
>> ===============================================================================
>> =====
>>
>> Do you have seen this type of problem before? In this case it should not be
>> related to the ibacm_addr.cfg, right?
>> Maybe its a problem with the switch or links, I will try some other ports of
>> the switch tomorrow.
>
> I have not seen this problem before. The log file that you provided looks okay to me.
>
> The following snippet from the rc011 log file indicates that the address resolution message sent from rc011 is correctly being routed back to rc011. (rc011 simply discards the message.)
>
> 1363971114.607: acm_process_recv: base endpoint name rc011
> 1363971114.607: acm_process_acm_recv:
> 1363971114.607: acm_process_acm_recv: src 10.1.4.111
> 1363971114.607: acm_process_acm_recv: dest 10.1.4.106
> 1363971114.607: acm_process_acm_recv: unsolicited request
> 1363971114.607: acm_process_addr_req:
> 1363971114.607: acm_acquire_dest: 10.1.4.111
> 1363971114.607: acm_get_dest: 10.1.4.111
> 1363971114.607: acm_process_addr_req: dest state 4
> 1363971114.607: acm_complete_queued_req: status 0
> 1363971114.607: acm_put_dest: 10.1.4.111
>
> What would be interesting to know is if the log file on rc006 shows that it received the message from rc011. That is, do we see something like this:
>
> : acm_process_recv: base endpoint name rc006
> : acm_process_acm_recv:
> : acm_process_acm_recv: src 10.1.4.111
> : acm_process_acm_recv: dest 10.1.4.106
> : acm_process_acm_recv: unsolicited request
>
> It's curious that only a select group of nodes can't communicate with each other. I'm inclined to agree with your assessment that it may be an issue with the switch, or possibly how the multicast group was configured.
>
> - Sean
--------------------------------
Dipl.-Math. Jens Domke
Research Assistant
Technische Universitaet Dresden
Center for Information Services and High Performance Computing (ZIH)
Interdisciplinary Application Development and Coordination
01062 Dresden
Tel.: +49 (351) 463-39114
Fax: +49 (351) 463-37773
E-Mail: jens.domke at tu-dresden.de
--------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4624 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130325/4b105a65/attachment.bin>
More information about the ewg
mailing list