[Users] Communication (Send/Recv) error according to message size(65535)
Peter Kjellström
cap at nsc.liu.se
Mon Jan 4 03:52:26 PST 2021
On Mon, 4 Jan 2021 06:08:39 +0000
Kihang Youn <kyoun at lenovo.com> wrote:
> Hello,
>
> I am testing the newly upgraded OFED (5.1-0.6.6) and corresponding
> OpenMPI (4.0.2, 4.0.4).
...
> When communicating between compute nodes(inter-nodes), if the size of
> send/recv messages exceeds 65535, the following error occurs. This
> does not happen when using one compute node.
...
Error message below mentions mlx5_2 and RoCE. But the ibstat shows
Infiniband port at mlx5_0. Is this an openmpi naming confusion or do
you actually have an ethernet port also (mlx5_2)?
/Peter
> Part of the error message:
>
> [pduru18:351568:0:351568] ib_mlx5_log.c:143 Transport retry count
> exceeded on mlx5_2:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
...
> > ibstat
> CA 'mlx5_0'
> CA type: MT4123
> Number of ports: 1
> Firmware version: 20.28.1002
> Hardware version: 0
> Node GUID: 0xb8599f0300b84da6
> System image GUID: 0xb8599f0300b84da6
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 100
> Base lid: 4
> LMC: 0
> SM lid: 4
> Capability mask: 0x2651e84a
> Port GUID: 0xb8599f0300b84da6
> Link layer: InfiniBand
More information about the Users
mailing list