[openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer

Helen Chen hycsw at ca.sandia.gov
Thu Oct 13 15:07:18 PDT 2005


Roland,

It doesn't seem like shrinking the TCP window had helped.  I captured the
Dmesg log from Lustre server and associated client reporting IOZONE error.
BTW, this problem is a moving target so it is hard to believe that it
is hardware related(?)  BTW, I am using the mellanox DDR switch and HCA.

Thanks,
Helen

------- Dmesg from Lustre server ------
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 1638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 2638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 3638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 4638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 5638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 6638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 7638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 8638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 9638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 10638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 11638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 12638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 13638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 14638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 15638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 16638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 17638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 18638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 19638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 20638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 21638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 22638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 23638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 24638
LustreError: 12471:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET req at f5d8e000 x20249/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl
Interpret:/0/0 rc 0/0
LustreError: 12485:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting 5f9e7_lov2_e307d728c2 at NET_0xc0a80249_UUID id
192.168.2.73-12345
LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET req at d51dfa00 x20359/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl
Interpret:/0/0 rc 0/0
LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) previously skipped 1 similar messages
LustreError: 12477:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting 30326_lov2_7ce4b0bf00 at NET_0xc0a8024e_UUID id
192.168.2.78-12345
LustreError: 12477:0:(filter.c:1728:filter_grant_sanity_check()) filter_disconnect: tot_granted 48570368 != fo_tot_granted 49618944
LustreError: 12477:0:(filter.c:1731:filter_grant_sanity_check()) filter_disconnect: tot_pending 7340032 != fo_tot_pending 8388608
Lustre: A connection with 192.168.2.80 timed out; the network or that node may be down.
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80250 ip 192.168.2.80:1022
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 25638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 26638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 27638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 28638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 29638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 30638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 31638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 32638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 33638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 34638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 35638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 36638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 37638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 38638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 39638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 40638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 41638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 42638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 43638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 44638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 45638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 46638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 47638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 48638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 49638
LustreError: A timeout occurred receiving data from 192.168.2.73; the network or that node may be down.
LustreError: 12189:0:(socknal_cb.c:2214:ksocknal_find_timed_out_conn()) Timed out RX from 0xc0a80249 f2630000 192.168.2.73
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80249 ip 192.168.2.73:1021
LustreError: 12189:0:(socknal.c:1329:ksocknal_destroy_conn()) Completing partial receive from 0xc0a8024e, ip 192.168.2.78:1021, with error
LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 19, desc eb0c8000
LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 19, desc f2603000
LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting 30326_lov2_7ce4b0bf00 at NET_0xc0a8024e_UUID id
192.168.2.78-12345
LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) previously skipped 6 similar messages
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 50638
Lustre: A connection with 192.168.2.79 timed out; the network or that node may be down.
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024f ip 192.168.2.79:1021
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) previously skipped 1 similar messages
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 51638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 52638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 53638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 54638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 55638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 56638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 57638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 58638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 59638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 60638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 61638
Lustre: A connection with 192.168.2.72 timed out; the network or that node may be down.
Lustre: previously skipped 3 similar messages
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80248 ip 192.168.2.72:1021
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) previously skipped 3 similar messages
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 62638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 63638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 64638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 65638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 66638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 67638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 68638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 69638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 70638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 71638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 72638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 73638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 74638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 75638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 76638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 77638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 78638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 79638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 80638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 81638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 82638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 83638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 84638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 85638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 86638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 87638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 88638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 89638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 90638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 91638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 92638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 93638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 94638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 95638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 96638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 97638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 98638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 99638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 100638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 101638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 102638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 103638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 104638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 105638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 106638
LustreError: 12458:0:(ldlm_lib.c:506:target_handle_reconnect()) 709aa0a3-a6a1-4134-b2b4-805212eb9430 reconnecting
Lustre: 12470:0:(filter.c:2645:filter_set_info()) on3-ost1: received MDS connection (0xbc2765ac563141df)
Lustre: 12486:0:(filter.c:2082:filter_destroy_precreated()) on3-ost2: deleting orphan objects from 6 to 67
Lustre: 12583:0:(llog_cat.c:352:llog_cat_process_cb()) processing log 0x149423e:3575f5db at index 2 of catalog 0x149423a
Lustre: 12583:0:(filter_log.c:235:filter_recov_log_mds_ost_cb()) fetch generation log, send cookie
Lustre: 12583:0:(llog.c:287:llog_process()) recovery from log: 0x149423e:3575f5db stopped
LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) 8ebea_lov2_7a4510c13a reconnecting
LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) e24e8_lov1_13fb4ed690 reconnecting
LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages
LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages
LustreError: 12461:0:(ldlm_lib.c:506:target_handle_reconnect()) 97cda_lov2_81558eef0b reconnecting
LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) 03c5b_lov2_084e2d0661 reconnecting
LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages
LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) 8da95_lov1_79a1a2e0bd reconnecting
LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 4 similar messages
LustreError: 12454:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239844, 100s ago) req at ea8d0800 x5/t0
o401->@NET_0xc0a80253_UUID:15 lens 104/64 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 12454:0:(recov_thread.c:410:log_commit_thread()) commit f538e000:f7679e80 drop 1 cookies: rc -110


--------- Dmesg from Lustre client -----------------------
Lustre: A connection with 192.168.2.74 timed out; the network or that node may be down.
LustreError: 11145:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024a ip 192.168.2.74:988
LustreError: 11143:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988
LustreError: Host 192.168.2.74 was unreachable; the network or that node may be down, or Lustre may be misconfigured.
LustreError: 11143:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 64 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73)
LustreError: 11143:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 req at f615f600 x20271/t0 o400->on3-ost2_UUID at NID_on3-ib_UUID:6 lens
64/64 ref 2 fl Rpc:N/0/0 rc 0/0
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239884, 3s ago) req at f615f600 x20271/t0
o400->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: Connection to service on3-ost2 via nid 192.168.2.74 was lost; in progress operations using this service will wait for recovery to
complete.
Lustre: 11269:0:(import.c:142:ptlrpc_set_import_discon()) OSC_on2_on3-ost2_MNT_on2-ib_2: connection lost to on3-ost2_UUID at NID_on3-ib_UUID
LustreError: 11270:0:(lib-move.c:1510:lib_api_put()) Error sending PUT to 0xc0a8024a: 19
LustreError: 11141:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988
LustreError: Host 192.168.2.74 was unreachable; the network or that node may be down, or Lustre may be misconfigured.
LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 240 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73)
LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) previously skipped 1 similar messages
LustreError: 11141:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 req at f66a9600 x20283/t0 o8->on3-ost2_UUID at NID_on3-ib_UUID:6 lens
240/144 ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11141:0:(events.c:61:request_out_callback()) previously skipped 3 similar messages
LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239912, 3s ago) req at f66a9600 x20283/t0
o8->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 240/144 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 3 similar messages
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239819, 100s ago) req at f528ca00 x20242/t0
o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 1 similar messages
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239834, 100s ago) req at f66a9a00 x20256/t0
o400->on3-ost1_UUID at NID_on3-ib_UUID:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 8 similar messages
Lustre: Connection restored to service on3-ost1 using nid 192.168.2.74.
Lustre: 11270:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on2_on3-ost1_MNT_on2-ib: connection restored to
on3-ost1_UUID at NID_on3-ib_UUID
LustreError: This client was evicted by on3-ost2; in progress operations using this service will fail.
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f528c600 x20302/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f528c200 x20303/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d7c41c00 x20305/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d7c41800 x20306/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f607d600 x20307/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1925dc0 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1779840 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 275 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c177d820 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 485 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1792560 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 815 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c18dd440 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 1399 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c18e3600 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 2637 similar messages
Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74.
Lustre: 11530:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on2_on3-ost2_MNT_on2-ib_2: connection restored to
on3-ost2_UUID at NID_on3-ib_UUID




>From hycsw Thu Oct 13 14:21:18 2005
A
From: hycsw (Helen Chen)
Message-Id: <200510132121.OAA29376 at ca.sandia.gov>
To: hycsw at ca, rolandd at cisco.com
Subject: Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer
Cc: hycsw at sandia.gov, openib-general at openib.org
Status: R

Roland,

>From rolandd at cisco.com Thu Oct 13 13:53:05 2005
>
>    Helen> Roland, Thank you for your response.  That fixed my initial
>    Helen> buffer allocation failure.  After we tuned the Lustre and
>    Helen> reran same IOZONE tests again, we got the following
>    Helen> problem.  Was there an actual network interrupt? If so, the
>    Helen> problem is not obvious now; the two nodes are pinging over
>    Helen> IPoIB.  Please advice.
>
>That's very odd.  This message:
>
>    Helen> NETDEV WATCHDOG: ib0: transmit timed out
>    Helen> ib0: transmit timeout: latency 1846
>
>says that we are not seeing send completions from the HCA.  However,
>are you saying that even when you are seeing this message, ping over
>IPoIB is working?
>

No, I didn't know there were any problem until IOZONE reported read 
error from the Lustre Client.  

BTW, the backend storage is iSCSI over 10 GbE using jumbo frame.  This
pl\roblem only appeared after our tuning errfor: we increased the iSCSI
payload to 1 MB, and increased the TCP window to 512 KB from 256 KB. I
will shrink my TCP window and see if the problem goes away.

Thanks,
Helen




More information about the general mailing list