[ewg] ib_acme fails for requests with IPv4 addresses (ofed 3.5)

Hefty, Sean sean.hefty at intel.com
Fri Mar 22 12:59:43 PDT 2013


> Now I have another problem with 3 out of 18 nodes. All 3 get the correct
> information for the other 15 nodes if I run ib_acme, and also the other 15 can
> obtain the right information for the 3, but if I run ib_acme among those 3
> nodes then I get a "Connection timed out".
> On all three nodes the command for 'localhost' does work, too.
> 
> Here the ouput:
> ===============================================================================
> =====
> rc001 ~ $ pdsh -w rc0[00-17] 'for x in `seq 100 117`; do ib_acme -f i -d
> 10.1.4.${x} -v; done' | grep failed -B 1
> rc002: Destination: 10.1.4.106
> rc002: ib_acm_resolve_ip failed: Connection timed out
> rc002: SA verification: failed Cannot assign requested address
> --
> rc011: Destination: 10.1.4.102
> rc011: ib_acm_resolve_ip failed: Connection timed out
> rc011: SA verification: failed Cannot assign requested address
> --
> rc006: Destination: 10.1.4.102
> rc006: ib_acm_resolve_ip failed: Connection timed out
> rc006: SA verification: failed Cannot assign requested address
> --
> rc002: Destination: 10.1.4.111
> rc002: ib_acm_resolve_ip failed: Connection timed out
> rc002: SA verification: failed Cannot assign requested address
> --
> rc011: Destination: 10.1.4.106
> rc011: ib_acm_resolve_ip failed: Connection timed out
> rc011: SA verification: failed Cannot assign requested address
> --
> rc006: Destination: 10.1.4.111
> rc006: ib_acm_resolve_ip failed: Connection timed out
> rc006: SA verification: failed Cannot assign requested address
> ===============================================================================
> =====
> 
> Do you have seen this type of problem before? In this case it should not be
> related to the ibacm_addr.cfg, right?
> Maybe its a problem with the switch or links, I will try some other ports of
> the switch tomorrow.

I have not seen this problem before.  The log file that you provided looks okay to me.
 
The following  snippet from the rc011 log file indicates that the address resolution message sent from rc011 is correctly being routed back to rc011.  (rc011 simply discards the message.)

1363971114.607: acm_process_recv: base endpoint name rc011
1363971114.607: acm_process_acm_recv: 
1363971114.607: acm_process_acm_recv: src  10.1.4.111
1363971114.607: acm_process_acm_recv: dest 10.1.4.106
1363971114.607: acm_process_acm_recv: unsolicited request
1363971114.607: acm_process_addr_req: 
1363971114.607: acm_acquire_dest: 10.1.4.111
1363971114.607: acm_get_dest: 10.1.4.111
1363971114.607: acm_process_addr_req: dest state 4
1363971114.607: acm_complete_queued_req: status 0
1363971114.607: acm_put_dest: 10.1.4.111

What would be interesting to know is if the log file on rc006 shows that it received the message from rc011.  That is, do we see something like this:

: acm_process_recv: base endpoint name rc006
: acm_process_acm_recv: 
: acm_process_acm_recv: src  10.1.4.111
: acm_process_acm_recv: dest 10.1.4.106
: acm_process_acm_recv: unsolicited request

It's curious that only a select group of nodes can't communicate with each other.  I'm inclined to agree with your assessment that it may be an issue with the switch, or possibly how the multicast group was configured.

- Sean



More information about the ewg mailing list