[Users] Communication (Send/Recv) error according to message size(65535)

Peter Kjellström cap at nsc.liu.se
Mon Jan 4 03:52:26 PST 2021


On Mon, 4 Jan 2021 06:08:39 +0000
Kihang Youn <kyoun at lenovo.com> wrote:

> Hello,
> 
> I am testing the newly upgraded OFED (5.1-0.6.6) and corresponding
> OpenMPI (4.0.2, 4.0.4).
...
> When communicating between compute nodes(inter-nodes), if the size of
> send/recv messages exceeds 65535, the following error occurs. This
> does not happen when using one compute node.
...

Error message below mentions mlx5_2 and RoCE. But the ibstat shows
Infiniband port at mlx5_0. Is this an openmpi naming confusion or do
you actually have an ethernet port also (mlx5_2)?

/Peter

> Part of the error message:
> 
> [pduru18:351568:0:351568] ib_mlx5_log.c:143  Transport retry count
> exceeded on mlx5_2:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
... 
> > ibstat  
> CA 'mlx5_0'
>         CA type: MT4123
>         Number of ports: 1
>         Firmware version: 20.28.1002
>         Hardware version: 0
>         Node GUID: 0xb8599f0300b84da6
>         System image GUID: 0xb8599f0300b84da6
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 4
>                 LMC: 0
>                 SM lid: 4
>                 Capability mask: 0x2651e84a
>                 Port GUID: 0xb8599f0300b84da6
>                 Link layer: InfiniBand


More information about the Users mailing list