[openib-general] IBM eHCA testing..

Heiko J Schick SCHICKHJ at de.ibm.com
Mon Oct 10 03:53:23 PDT 2005


Hello Troy,

below you will find our preliminary analysis about the problem you've 
reported on Oct 10 via the OpenIB mailing-list [1]: 

[1]:  http://openib.org/pipermail/openib-general/2005-October/012353.html

[  381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025)
[  381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd
[  393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025)
[  452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR  Port 1 is not 
active.
[  452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp() 
failed rc=ffffffffffffffff
[  452.821917] PU0002 000b03aa:ehca_create_qp <<< failed ret=ffffffea
[  452.821939] ib_mad: Couldn't create ib_mad QP1
[  453.313412] ib_mad: Couldn't open ehca0 port 1
[  475.132318] PU0002 00060100:ehca_parse_ec  EHCA port 1 is available.
[  518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN 
r3=168 r4=1000000003000004 r5=2000000000000008 r6=8a40000000000000 
r7=1e4e49000 r8=0 r9=0 r10=0
[  518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR 
HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0 
r9=800000000005aa18 r10=0
[  518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR 
hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c00000000f2cd080 
qp_num=8
[  518.249460] ib0: failed to modify QP to init, ret = -22
[  518.418976] ib0: ipoib_qp_create returned -22
[  528.813491] Oops: Kernel access of bad area, sig: 11 [#1]
[  528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR
[  528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core 
ebus
[  528.813540] NIP: D000000000049C6C XER: 20000020 LR: D0000000000760A0 
CTR: D000000000049C60
[  528.813554] REGS: c00000000f1eb1d0 TRAP: 0300   Not tainted 
(2.6.13.3-power5)
[  528.813568] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 
22028422
[  528.813580] DAR: 0000000000000000 DSISR: 0000000040000000
[  528.813592] TASK: c00000000209a9a0[2021] 'ifconfig' THREAD: 
c00000000f1e8000 CPU: 0
[  528.813605] GPR00: D0000000000760A0 C00000000F1EB450 D00000000005FFF0 
0000000000000000
[  528.813625] GPR04: C00000000F1EB548 0000000000000071 C00000000F1EB540 
0000000000000001
[  528.813645] GPR08: 000000000000000B 0000000000000001 0000000000000004 
D000000000049C60
[  528.813664] GPR12: D0000000000774C0 C0000000004B4000 00000000100C0000 
00000000100A0000
[  528.813685] GPR16: 0000000000000000 0000000000000000 0000000010020000 
0000000010020000
[  528.813704] GPR20: 000000001001E71C C0000001E466C000 FFFFFFFFFFFF8914 
C0000001E46D4810
[  528.813725] GPR24: C0000001E46D4800 C00000000F43B900 C00000000F1EBD10 
0000000000000002
[  528.813745] GPR28: 0000000000000000 C0000001E466C380 D000000000084640 
C00000000F1EB548
[  528.813768] NIP [d000000000049c6c] .ib_modify_qp+0xc/0x40 [ib_core]
[  528.813797] LR [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 
[ib_ipoib]
[  528.813822] Call Trace:
[  528.813829] [c00000000f1eb450] [00000000434849c5] 0x434849c5 
(unreliable)
[  528.813846] [c00000000f1eb4d0] [d0000000000760a0] 
.ipoib_qp_create+0xe0/0x1c0 [ib_ipoib]
[  528.813873] [c00000000f1eb5f0] [d00000000007261c] 
.ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib]
[  528.813899] [c00000000f1eb680] [d00000000006f38c] 
.ipoib_open+0x7c/0x190 [ib_ipoib]
[  528.813923] [c00000000f1eb720] [c00000000032a650] .dev_open+0xc0/0x120
[  528.813942] [c00000000f1eb7c0] [c000000000328c70] 
.dev_change_flags+0x180/0x1c0
[  528.813961] [c00000000f1eb860] [c00000000037a02c] 
.devinet_ioctl+0x81c/0x850
[  528.813980] [c00000000f1eb970] [c00000000037a65c] 
.inet_ioctl+0x27c/0x2d0
[  528.813998] [c00000000f1eba00] [c00000000031bc4c] 
.sock_ioctl+0x8c/0x440
[  528.814016] [c00000000f1ebaa0] [c0000000000c22f0] .do_ioctl+0x60/0x120
[  528.814033] [c00000000f1ebb40] [c0000000000c244c] .vfs_ioctl+0x9c/0x4d0
[  528.814050] [c00000000f1ebbf0] [c0000000000c28cc] .sys_ioctl+0x4c/0xa0
[  528.814066] [c00000000f1ebca0] [c00000000001bb24] 
.dev_ifsioc+0x84/0x390
[  528.814084] [c00000000f1ebd70] [c0000000000e4d88] 
.compat_sys_ioctl+0x158/0x500
[  528.814103] [c00000000f1ebe30] [c00000000000d300] syscall_exit+0x0/0x18
[  528.814119] Instruction dump:
[  528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020 
60000000 60000000
[  528.814150] 60000000 7c0802a6 f8010010 f821ff81 <e9230000> e9490170 
e80a0000 f8410028
[  528.814174]  <7>RTAS: event: 3, Type: Platform Error, Severity: 2

It looks that IPoIB uses ressources which are already freed. We don't 
receive a "port active" event for port 1 in time (after 20 seconds). This 
means, that the ib_mad stack tries to create an AQP1. Here, our eHCA 
InfiniBand Device Driver waits for a maximum of 20 seconds for a port 
active event. It seems that with the usage of OpenSM we will receive the 
"port active" event after ca. 45 seconds. 

For the MAD and IPoIB stack this means the following:

MAD:
====
1. No AQP1 QP will exist for port 1, because of the missing "port active 
event".

2. All resources are freed, because of the error handling routines in 
ib_mad.
   create_mad_qp reports an error to ib_mad_port_open which destroys all 
allocated resources 
   (workqueue, AQPs, MR, PD, CQ, etc.). 
3. Multicast join request to the SM won't work !!! 
   IPoIB doesn't work on ifconfig ib0 xxx.xxx.xxx.xxx !!!

IPoIB:
======
For IPoIB a "port active" event which is to late is going to be a problem.

1. The function ipoib_add_one calls ipoib_add_port which creates all IB 
ressources 
   (QPs, CQ, etc. function ipoib_dev_init -> ipoib_in_dev_init, ...)

2. Function ipoib_ib_dev_init (executed at startup / module load) calls 
ipoib_ib_dev_open which
   wants to modify the IPoIB QP from INIT -> RTR -> RTS via 
ipoib_qp_create.

3. The first ib_modify_qp functions (Reset2Init) in ipoib_qp_create 
failes, because the port is not active
   at the moment.
   See: 
   [  518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR 
hipz_h_modify_qp() failed rc=ffffffffffffffd3 ...
   [  518.249460] ib0: failed to modify QP to init, ret = -22
   [  518.418976] ib0: ipoib_qp_create returned -22

4. If that happes the function ipoib_qp_create in ib_verbs.c will destroy 
the IPoIB QP.

5. A user enters ifconfig ib0 xxx.xxx.xxx.xxx which executes ipoib_open. 
This function executes also
   ipoib_ib_dev_open which wants to modifies the IPoIB QP from INIT -> RTR 
-> RTS via ipoib_qp_create.

6. ib_modify_qp will occur a Kernel panic (because priv->qp is NULL see 
function ipoib_qp_create).

Mit freundlichen Gruessen / Kind Regards
Heiko Joerg Schick

IBM Deutschland Entwicklung GmbH
I/Ox Microcode Development
Linux Infiniband Device Drivers

Schoenaicher Str. 220
71032 Boeblingen
E-Mail: schickhj at de.ibm.com
External: 49-7031-16-0 x4219,   t/l: 120-4219
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20051010/d7c767bc/attachment.html>


More information about the general mailing list