[ewg] MPI Jobs failed due to IB issues

Tue Apr 9 16:39:03 PDT 2013

Hi Guys,

I had a user report that his jobs failed overnight and saw the following
error message in his logs:

--
--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.

  Local host:        amber04
  MPI process PID:   23493
  Error number:      10 (IBV_EVENT_PORT_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--------------------------------------------------------------------------
[amber04:23491] 5 more processes have sent help message
help-mpi-btl-openib.txt / of error
event
[amber04:23491] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error m
essages
[[54699,1],1][btl_openib_component.c:3224:handle_wc] from amber03 to:
amber04 error polling
 LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 16045312
opcode 0  vendo
r error 129 qp_idx 0
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   amber03
  Local device: mlx4_0
  Peer host:    amber04
--

When I checked the logs on our Mellanox IB Switch, I saw the following
errors during the same time:

--
Apr 8 22:26:53 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.WARNING]:
refresh_i2c(), hwd_main.c:7345, build 1: can't refresh device error
MLXI2C_CR_ERROR 8. closing and reseting device
Apr 8 22:26:53 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.WARNING]:
iterate_temp_sensors(), hwd_main.c:11257, build 1: Failed to refresh i2c
device MLXI2C_CR_ERROR 8
Apr 8 22:26:53 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.ERR]:
iterate_temp_sensors(), hwd_main.c:11257, build 1: Error code 8 returned
Apr 8 22:26:53 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.ERR]:
hwd_mon_handle_iterate(), hwd_main.c:10618, build 1: Error code 8 returned
Apr 8 22:26:53 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.ERR]:
mdc_mon_iterate_node_internal(), mdc_misc.c:586, build 1: Error code 8
returned
Apr 8 22:28:12 ib01-rwc-oo74 temp_control[2323]: [tc.ERR]:
get_bindings_by_name(), tc.c:36, build 1: Received empty data
system/chassis/temperature/state
Apr 8 22:28:12 ib01-oo74 temp_control[2323]: [tc.WARNING]: Failed to get
binding: system/chassis/temperature/state err:0
Apr 8 22:28:12 ib01-oo74 temp_control[2323]: [tc.ERR]:
lew_universal_event_handler(), libevent_wrapper.c:303, build 1: Error code
6 returned
Apr 8 22:28:32 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.WARNING]:
iterate_is4modules_initialized(), hwd_main.c:11368, build 1: Failed to
refresh i2c device MLXI2C_ERROR 1
Apr 8 22:28:32 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.ERR]:
iterate_is4modules_initialized(), hwd_main.c:11368, build 1: Error code 1
returned
Apr 8 22:28:32 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.ERR]:
hwd_mon_handle_iterate(), hwd_main.c:10659, build 1: Error code 1 returned
Apr 8 22:28:32 ib01-oo74 hwd[2244]: TID 1208100608: [hwd.ERR]:
mdc_mon_iterate_node_internal(), mdc_misc.c:586, build 1: Error code 1
returned
Apr 8 22:29:45 ib01-oo74 smm[2275]: [smm.ERR]: smm_forward_sa_db(),
smm_main.c:7167, build 1: Error code 14004 (item not found) returned
Apr 8 22:29:45 ib01-oo74 smm[2275]: [smm.ERR]: smm_get_sa_db(),
smm_main.c:6748, build 1: Error code 14004 (item not found) returned
Apr 8 22:29:45 ib01-oo74 smm[2275]: [smm.ERR]: smm_handle_sa_db_updates(),
smm_main.c:7253, build 1: Error code 14004 (item not found) returned
Apr 8 22:29:45 ib01-oo74 smm[2275]: [smm.ERR]: smm_handle_event_request(),
smm_main.c:7393, build 1: Error code 14004 (item not found) returned
--

Not sure what these error codes are and could not find anything on the web.
 Any ideas what these error messages are and if they would cause a
disconnect on the ib network?  I have run perfquery's and ibdiagnet and the
only thing that I can see is some nodes exceeding the PortXmitDiscards.  I
have reset these counters to see if this is still happening but don't
really think that is the cause of the problem here.

Any help would be greatly appreciated.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130409/638b1536/attachment.html>