SPAM Re: [ofa-general] mlx4: errors and failures on OOM

Tue Apr 14 09:12:23 PDT 2009

On Mon, 13 Apr 2009 07:40:33 -0400
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> On Sat, Apr 11, 2009 at 4:33 PM, Bernd Schubert
> <bs_lists at aakef.fastmail.fm> wrote:
> > Hello,
> >
> > last week I had issues with Lustre failures, which turned out to be
> > failures of many clients, which run into out-of-memory due to bad user space jobs
> > (and no protection again that by the queuing system).
> >
> > Anyway, I don't think IB is supposed to fail, when the oom killer activates.
> >
> > Errors for 0x001b0d0000008ede "Cisco Switch"
> >   5: [XmtDiscards == 270]
> >         Link info:     38    5[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a81    1[  ] "eul0605 HCA-1"
> >   16: [XmtDiscards == 132]
> >         Link info:     38   16[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a01    1[  ] "eul0616 HCA-1"
> >
> > I used a script to monitor the fabric for failures every 5 min and just when the oom
> > killer activated on the clients the messages above came up.
> 
> XmtDiscards are the total number of outbound packets discarded by the port
> because the port is down or congested. Reasons for this include:
> • Output port is not in the active state
> • Packet length exceeded NeighborMTU
> • Switch Lifetime Limit exceeded
> • Switch HOQ Lifetime Limit exceeded
> This may also include packets discarded while in VLStalled State.

For what you are describing this is "normal".  "Normal" in the sense that the
HCA is no longer accepting inbound packets and the switch discards them.

> 
> > Below are syslogs from one of these clients
> >
> > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50173 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit
> > 300s).
> > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 30 previous similar messages
> > Apr  4 08:50:38 eul0605 kernel: LustreError: 166-1: MGC172.17.31.247 at o2ib: Connection to service MGS via nid 172.17.31.247 at o2ib was lost; in
> > progress operations using this service will fail.
> > Apr  4 08:50:38 eul0605 kernel: Lustre: home1-MDT0000-mdc-0000010430fa0800: Connection to service home1-MDT0000 via nid 172.17.31.247 at o2ib was
> > lost; in progress operations using this service will wait for recovery to complete.
> > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 7 previous similar messages
> > Apr  4 08:50:38 eul0605 kernel: Lustre: tmp-OST0003-osc-0000010423750000: Connection to service tmp-OST0003 via nid 172.17.31.231 at o2ib was lost; in
> > progress operations using this service will wait for recovery to complete.
> > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 29 previous similar messages
> > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 000001041bcbb800 x50205/t0
> > o250->MGS at 172.17.31.247@o2ib:26/25 lens 304/456 e 0 to 1 dl 1238828031 ref 2 fl Rpc:N/0/0 rc 0/0
> > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages
> > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50205 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit
> > 300s).
> >
> > ===> So somehow lustre lost the network connection. On the server side the
> > logs simply show this node didn't answer to pings anymore.
> >
> >
> > Apr  4 08:52:58 eul0605 kernel: Lustre: Skipped 31 previous similar messages
> > Apr  4 08:52:59 eul0605 kernel: Lustre: Changing connection for MGC172.17.31.247 at o2ib to MGC172.17.31.247 at o2ib_1/172.17.31.246 at o2ib
> > Apr  4 08:52:59 eul0605 kernel: Lustre: Skipped 61 previous similar messages
> > Apr  4 08:53:00 eul0605 kernel: oom-killer: gfp_mask=0xd2
> >
> > [...]
> >
> > Apr  4 08:53:05 eul0605 kernel: Out of Memory: Killed process 10612 (gamos).
> > Apr  4 08:53:10 eul0605 kernel: 3212 pages swap cached
> > Apr  4 08:53:10 eul0605 kernel: Out of Memory: Killed process 10292 (tcsh).
> >
> > ===> And here we see, gamos consumed all memory again.
> >
> > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 0000010430f8f800 x50237/t0
> > o250->MGS at MGC172.17.31.247@o2ib_1:26/25 lens 304/456 e 0 to 1 dl 1238828107 ref 2 fl Rpc:N/0/0 rc 0/0
> > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages
> > Apr  4 08:53:10 eul0605 kernel: Lustre: Request x50237 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.246 at o2ib 50s ago has timed out (limit
> > 300s).
> > Apr  4 08:53:10 eul0605 kernel: Lustre: Skipped 31 previous similar messages
> > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> 
> That multicast group looks like the IPv4 broadcast group; -11 is
> EAGAIN.  I'm not sure what's causing IPoIB to indicate this but I
> wonder if this is a second level failure due to the previous (Lustre)
> error detected.
> 
> -- Hal
> 
> > ===> So we see the reason why Lustre lost network connection - infiniband is down.
> >
> >
> > In most cases IB recovers from that situation, not always. If it then entirely
> > fails, ibnetdiscover or ibclearerrors will report that can't resolve the route
> > to these nodes.
> >
> >
> > This with drivers from ofed-1.3.1. Any ideas why OOM causes issues with IB?

Are you getting any errors on the console from the kernel on these nodes?
Specifically from the HCA (I think it was mlx4) driver?  If the nodes
recover I assume that means the ib0 errors go away and lustre reconnects?

Ira

> >
> >
> > Thanks,
> > Bernd

***SPAM*** Re: [ofa-general] mlx4: errors and failures on OOM

SPAM Re: [ofa-general] mlx4: errors and failures on OOM