[Openib-windows] Win IBhost stop receive broadcast packets
    Anatoly Lisenko 
    anatolyl at voltaire.com
       
    Tue Jan  2 03:56:50 PST 2007
    
    
  
Hi ,
 
I saw some problem with windows ibhost stack: reboot of infiniband
switch can cause ping loss ( even after ibsw get up ).
I start to research this anomaly and I saw:
1. ib stack doesn't receive broadcast arp packets.
2. All other packets unicast + multicast are received.
3. rx packets hca port counter increased each time broadcast packet
arrived 
4. It seems that firmware drop this packet. ( I don't see any
completions )
 
I examined the logs and saw that somehow we fall into state when :
1. hca's port joined to bcast group
2. ipoib qp detached from bcast group
 
This is stack backtrace of mlnx_detach_mcast func. :
f7125d10 f68ae6ab mthca!mlnx_detach_mcast+0x13
[n:\win-ibhost\trunk\hw\mthca\kernel\hca_mcast.c @ 142]
f7125d38 f68c9c30 ibbus!__cleanup_mcast+0x24b
[n:\win-ibhost\trunk\core\al\al_mcast.c @ 304]
f7125d70 f6820212 ibbus!async_destroy_cb+0x420
[n:\win-ibhost\trunk\core\al\al_common.c @ 665]
f7125d8c f6825dc2 ibbus!__cl_async_proc_worker+0x92
[n:\win-ibhost\trunk\core\complib\cl_async_proc.c @ 153]
f7125da0 f6827c3a ibbus!__cl_thread_pool_routine+0x52
[n:\win-ibhost\trunk\core\complib\cl_threadpool.c @ 67]
f7125dac 80948bb2 ibbus!__thread_callback+0x2a
[n:\win-ibhost\trunk\core\complib\kernel\cl_thread.c @ 49]
f7125ddc 8088d4d2 nt!PspSystemThreadStartup+0x2e
00000000 00000000 nt!KiThreadStartup+0x16
 
 
Mthca wpp log:
00000662          kernel   1236     600       2          312
01\02\2007-13:28:02:781            mlnx_query_ca()===>
00000663          kernel   1236     600       2          321
01\02\2007-13:28:02:781            mlnx_query_ca() :port 0 gid0:
00000664          kernel   1236     600       2          322
01\02\2007-13:28:02:781            mlnx_query_ca() :
0xfe80000000-0x08f14398095
00000665          kernel   1236     600       2          323
01\02\2007-13:28:02:781            mlnx_query_ca() :port 1 gid0:
00000666          kernel   1236     600       2          324
01\02\2007-13:28:02:781            mlnx_query_ca() :
0xfe80000000-0x08f14398096
00000667          kernel   1236     600       2          325
01\02\2007-13:28:02:781            mlnx_query_ca() :Space required 1898
used 1898
00000668          kernel   1236     600       2          326
01\02\2007-13:28:02:781            mlnx_conv_hca_cap() :Port 1 port_guid
0x8f10403980095
00000669          kernel   1236     600       2          327
01\02\2007-13:28:02:781            mlnx_conv_hca_cap() :Port 2 port_guid
0x8f10403980096
00000670          kernel   1236     600       2          328
01\02\2007-13:28:02:781            mlnx_query_ca()<===
00000671          kernel   4          276       2          339
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>
00000672          kernel   4          276       2          340
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 89930EA8,
qp_p 88A56E78, mlid c0, mgid ffff1b4012ff`ffffffff00000000
00000678          kernel   4          276       2          346
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS
00000681          kernel   4          276       2          349
01\02\2007-13:28:02:859            mlnx_enable_cq_notify()===>
00000682          kernel   4          276       2          350
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS
00000683          kernel   4          276       2          357
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>
00000684          kernel   4          276       2          358
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 898F1D68,
qp_p 88A56E78, mlid 1c0, mgid ffff1b4012ff`100000000000000
00000685          kernel   4          276       2          359
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS
00000686          kernel   4          276       2          362
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>
00000687          kernel   4          276       2          363
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 884352D8,
qp_p 88A56E78, mlid 2c0, mgid ffff051412ff`30000c280010000
00000688          kernel   4          276       2          364
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS
00000689          kernel   0          0          3          129
01\02\2007-13:28:02:750            mlnx_enable_cq_notify()===>
00000690          kernel   0          0          3          130
01\02\2007-13:28:02:750            completes with ERROR status
IB_SUCCESS
...
00000776          kernel   0          0          3          373
01\02\2007-13:28:03:109            mlnx_enable_cq_notify()===>
00000777          kernel   0          0          3          374
01\02\2007-13:28:03:109            completes with ERROR status
IB_SUCCESS
00000778          kernel   4          272       3          375
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 89918F40,
qp_p 88A56E78, mlid 2c0, mgid ffff051412ff`30000c280010000
00000779          kernel   4          272       3          376
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS
00000780          kernel   4          272       3          377
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88DB3F00,
qp_p 88A56E78, mlid 1c0, mgid ffff1b4012ff`100000000000000
00000781          kernel   4          272       3          378
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS
00000782          kernel   4          272       3          379
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88A48DA0,
qp_p 88A56E78, mlid 6c0, mgid ffff051412ff`ffffa8ff00ff0000
00000783          kernel   4          272       3          380
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS
00000784          kernel   4          272       3          381
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88A94DD0,
qp_p 88A56E78, mlid c0, mgid ffff1b4012ff`ffffffff00000000
00000785          kernel   4          272       3          382
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS
00000786          kernel   4          280       1          383
01\02\2007-13:28:22:781            mlnx_query_ca()===>
00000787          kernel   4          280       1          384
01\02\2007-13:28:22:781            mlnx_query_ca() :port 0 gid0:
00000788          kernel   4          280       1          385
01\02\2007-13:28:22:781            mlnx_query_ca() :
0xfe80000000-0x08f14398095
00000789          kernel   4          280       1          386
01\02\2007-13:28:22:781            mlnx_query_ca() :port 1 gid0:
00000790          kernel   4          280       1          387
01\02\2007-13:28:22:781            mlnx_query_ca() :
0xfe80000000-0x08f14398096
00000791          kernel   4          280       1          388
01\02\2007-13:28:22:781            mlnx_query_ca() :Space required 1898
used 1898
00000792          kernel   4          280       1          389
01\02\2007-13:28:22:781            mlnx_conv_hca_cap() :Port 1 port_guid
0x8f10403980095
00000793          kernel   4          280       1          390
01\02\2007-13:28:22:781            mlnx_conv_hca_cap() :Port 2 port_guid
0x8f10403980096
00000794          kernel   4          280       1          391
01\02\2007-13:28:22:781            mlnx_query_ca()<===
 
 
Ipoib wpp log:
 
00000130          kernel   0          0          0          130
01\02\2007-13:28:01:468            [IPoIB] :ipoib_check_for_hang():]
00000131          kernel   4          280       0          133
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_reset_all():[
00000132          kernel   4          280       0          134
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():[
00000133          kernel   4          280       0          135
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():]
...
00000140          kernel   4          280       0          150
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():[
00000141          kernel   4          280       0          151
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():]
00000142          kernel   4          280       0          152
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_reset_all():]
00000143          kernel   4          280       0          160
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_down():]
00000144          kernel   4          280       1          131
01\02\2007-13:28:02:781            [IPoIB] :__ipoib_pnp_cb() :Link DOWN!
00000145          kernel   4          280       1          132
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_down():[
00000146          kernel   4          312       1          153
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[
00000147          kernel   4          312       1          154
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group
00000148          kernel   4          312       1          164
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]
00000149          kernel   4          312       1          165
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[
00000150          kernel   4          312       1          166
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]
00000151          kernel   4          280       1          170
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_up():[
00000152          kernel   4          280       1          171
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_up():]
00000153          kernel   4          308       2          145
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[
00000154          kernel   4          308       2          147
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group
00000155          kernel   4          308       2          161
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]
00000156          kernel   4          308       2          162
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[
00000157          kernel   4          308       2          163
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]
00000158          kernel   4          276       2          191
01\02\2007-13:28:02:859            [IPoIB] :__bcast_cb():[
00000159          kernel   4          276       2          192
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_add_bcast():[
00000160          kernel   4          276       2          193
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[
00000161          kernel   4          276       2          194
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]
00000162          kernel   4          276       2          195
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[
00000163          kernel   4          276       2          196
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast()
:Create av for MAC: 00-00-00-00-00-00
00000164          kernel   4          276       2          197
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[
00000165          kernel   4          276       2          198
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]
00000166          kernel   4          276       2          199
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]
00000167          kernel   4          276       2          200
01\02\2007-13:28:02:859            [IPoIB]
:__endpt_mgr_insert_locked():[
00000168          kernel   4          276       2          201
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: FF-FF-FF-FF-FF-FF
00000169          kernel   4          276       2          202
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[
00000170          kernel   4          276       2          203
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]
00000171          kernel   4          276       2          204
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_add_bcast():]
00000172          kernel   4          276       2          205
01\02\2007-13:28:02:859            [IPoIB] :__ib_mgr_activate():[
00000173          kernel   4          276       2          206
01\02\2007-13:28:02:859            [IPoIB] :__ib_mgr_activate():]
00000174          kernel   4          276       2          207
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active():[
00000175          kernel   4          276       2          208
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():[
00000176          kernel   4          276       2          209
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref():[
00000177          kernel   4          276       2          210
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Look for
:     MAC: 01-00-5E-00-00-01
00000178          kernel   4          276       2          211
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Failed
endpoint lookup.[IpoIB] :__endpt_mgr_ref():]
00000179          kernel   4          276       2          212
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[
00000180          kernel   4          276       2          213
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]
00000181          kernel   4          276       2          214
01\02\2007-13:28:02:859            [IPoIB]
:__endpt_mgr_insert_locked():[
00000182          kernel   4          276       2          215
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: 01-00-5E-00-00-01
00000183          kernel   4          276       2          216
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[
00000184          kernel   4          276       2          217
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]
00000185          kernel   4          276       2          218
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():]
00000186          kernel   4          276       2          219
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():[
00000187          kernel   4          276       2          220
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref():[
00000188          kernel   4          276       2          221
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Look for
:     MAC: 01-80-C2-00-00-03
00000189          kernel   4          276       2          222
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Failed
endpoint lookup.[IpoIB] :__endpt_mgr_ref():]
00000190          kernel   4          276       2          223
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[
00000191          kernel   4          276       2          224
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]
00000192          kernel   4          276       2          225
01\02\2007-13:28:02:859            [IPoIB]
:__endpt_mgr_insert_locked():[
00000193          kernel   4          276       2          226
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: 01-80-C2-00-00-03
00000194          kernel   4          276       2          227
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[
00000195          kernel   4          276       2          228
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]
00000196          kernel   4          276       2          229
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():]
00000197          kernel   4          276       2          230
01\02\2007-13:28:02:859            [IPoIB] :ipoib_resume_oids():[
00000198          kernel   4          276       2          231
01\02\2007-13:28:02:859            [IPoIB] :ipoib_resume_oids():]
00000199          kernel   4          276       2          232
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active() :Link UP!
00000200          kernel   4          276       2          233
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active():]
00000201          kernel   4          276       2          234
01\02\2007-13:28:02:859            [IPoIB] :__bcast_cb():]
00000202          kernel   4          276       2          235
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():[
00000203          kernel   4          276       2          236
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[
00000204          kernel   4          276       2          237
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast()
:Create av for MAC: 01-00-5E-00-00-01
00000205          kernel   4          276       2          238
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[
00000206          kernel   4          276       2          239
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]
00000207          kernel   4          276       2          240
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]
00000208          kernel   4          276       2          241
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():]
00000209          kernel   4          276       2          242
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():[
00000210          kernel   4          276       2          243
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[
00000211          kernel   4          276       2          244
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast()
:Create av for MAC: 01-80-C2-00-00-03
00000212          kernel   4          276       2          245
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[
00000213          kernel   4          276       2          246
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]
00000214          kernel   4          276       2          247
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]
00000215          kernel   4          276       2          248
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():]
00000216          kernel   4          320       3          136
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[
00000217          kernel   4          320       3          139
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]
00000218          kernel   4          320       3          141
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[
00000219          kernel   4          320       3          143
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]
00000220          kernel   4          320       3          144
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[
00000221          kernel   4          320       3          146
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group
00000222          kernel   4          320       3          155
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]
00000223          kernel   4          320       3          156
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[
00000224          kernel   4          320       3          157
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]
00000225          kernel   4          320       3          158
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[
00000226          kernel   4          320       3          159
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group
00000227          kernel   4          320       3          167
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]
00000228          kernel   4          320       3          168
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[
00000229          kernel   4          320       3          169
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]
00000230          kernel   0          0          3          172
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb():[
00000231          kernel   0          0          3          173
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_add_local():[
00000232          kernel   0          0          3          174
01\02\2007-13:28:02:781            [IPoIB] :ipoib_endpt_create():[
00000233          kernel   0          0          3          175
01\02\2007-13:28:02:781            [IPoIB] :ipoib_endpt_create():]
00000234          kernel   0          0          3          176
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_insert():[
00000235          kernel   0          0          3          177
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_insert():]
00000236          kernel   0          0          3          178
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_add_local():]
00000237          kernel   0          0          3          179
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb() :Received
port info: link width = 2.
00000238          kernel   0          0          3          180
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate():[
00000239          kernel   0          0          3          181
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate() :Link speed
is 2.5Gs
00000240          kernel   0          0          3          182
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate() :Link width
is 4X
00000241          kernel   0          0          3          183
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate():]
00000242          kernel   0          0          3          184
01\02\2007-13:28:02:781            [IPoIB] :__port_get_bcast():[
00000243          kernel   0          0          3          185
01\02\2007-13:28:02:781            [IPoIB] :__port_get_bcast():]
00000244          kernel   0          0          3          186
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb():]
00000245          kernel   2624     2732     3          187
01\02\2007-13:28:02:781            [IPoIB] :__bcast_get_cb():[
00000246          kernel   2624     2732     3          188
01\02\2007-13:28:02:781            [IPoIB] :__port_join_bcast():[
00000247          kernel   2624     2732     3          189
01\02\2007-13:28:02:781            [IPoIB] :__port_join_bcast():]
00000248          kernel   2624     2732     3          190
01\02\2007-13:28:02:781            [IPoIB] :__bcast_get_cb():]
00000249          kernel   0          0          3          249
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():[
00000250          kernel   0          0          3          250
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():]
...
00000339          kernel   0          0          3          339
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():[
00000340          kernel   0          0          3          340
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():]
00000341          kernel   0          0          0          341
01\02\2007-13:28:03:468            [IPoIB] :ipoib_check_for_hang():[
 
 
Thanks,
Anatoly
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070102/9e5263c3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mthca_wpp_flags_0x400.log
Type: application/octet-stream
Size: 29023 bytes
Desc: mthca_wpp_flags_0x400.log
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070102/9e5263c3/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ipoib_wpp_flag_0x122.log
Type: application/octet-stream
Size: 23777 bytes
Desc: ipoib_wpp_flag_0x122.log
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070102/9e5263c3/attachment-0001.obj>
    
    
More information about the ofw
mailing list