[openib-general] IBM eHCA testing..
Heiko J Schick
SCHICKHJ at de.ibm.com
Mon Oct 10 03:53:23 PDT 2005
Hello Troy,
below you will find our preliminary analysis about the problem you've
reported on Oct 10 via the OpenIB mailing-list [1]:
[1]: http://openib.org/pipermail/openib-general/2005-October/012353.html
[ 381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025)
[ 381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd
[ 393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025)
[ 452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR Port 1 is not
active.
[ 452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp()
failed rc=ffffffffffffffff
[ 452.821917] PU0002 000b03aa:ehca_create_qp <<< failed ret=ffffffea
[ 452.821939] ib_mad: Couldn't create ib_mad QP1
[ 453.313412] ib_mad: Couldn't open ehca0 port 1
[ 475.132318] PU0002 00060100:ehca_parse_ec EHCA port 1 is available.
[ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN
r3=168 r4=1000000003000004 r5=2000000000000008 r6=8a40000000000000
r7=1e4e49000 r8=0 r9=0 r10=0
[ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR
HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0
r9=800000000005aa18 r10=0
[ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR
hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c00000000f2cd080
qp_num=8
[ 518.249460] ib0: failed to modify QP to init, ret = -22
[ 518.418976] ib0: ipoib_qp_create returned -22
[ 528.813491] Oops: Kernel access of bad area, sig: 11 [#1]
[ 528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR
[ 528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core
ebus
[ 528.813540] NIP: D000000000049C6C XER: 20000020 LR: D0000000000760A0
CTR: D000000000049C60
[ 528.813554] REGS: c00000000f1eb1d0 TRAP: 0300 Not tainted
(2.6.13.3-power5)
[ 528.813568] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR:
22028422
[ 528.813580] DAR: 0000000000000000 DSISR: 0000000040000000
[ 528.813592] TASK: c00000000209a9a0[2021] 'ifconfig' THREAD:
c00000000f1e8000 CPU: 0
[ 528.813605] GPR00: D0000000000760A0 C00000000F1EB450 D00000000005FFF0
0000000000000000
[ 528.813625] GPR04: C00000000F1EB548 0000000000000071 C00000000F1EB540
0000000000000001
[ 528.813645] GPR08: 000000000000000B 0000000000000001 0000000000000004
D000000000049C60
[ 528.813664] GPR12: D0000000000774C0 C0000000004B4000 00000000100C0000
00000000100A0000
[ 528.813685] GPR16: 0000000000000000 0000000000000000 0000000010020000
0000000010020000
[ 528.813704] GPR20: 000000001001E71C C0000001E466C000 FFFFFFFFFFFF8914
C0000001E46D4810
[ 528.813725] GPR24: C0000001E46D4800 C00000000F43B900 C00000000F1EBD10
0000000000000002
[ 528.813745] GPR28: 0000000000000000 C0000001E466C380 D000000000084640
C00000000F1EB548
[ 528.813768] NIP [d000000000049c6c] .ib_modify_qp+0xc/0x40 [ib_core]
[ 528.813797] LR [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0
[ib_ipoib]
[ 528.813822] Call Trace:
[ 528.813829] [c00000000f1eb450] [00000000434849c5] 0x434849c5
(unreliable)
[ 528.813846] [c00000000f1eb4d0] [d0000000000760a0]
.ipoib_qp_create+0xe0/0x1c0 [ib_ipoib]
[ 528.813873] [c00000000f1eb5f0] [d00000000007261c]
.ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib]
[ 528.813899] [c00000000f1eb680] [d00000000006f38c]
.ipoib_open+0x7c/0x190 [ib_ipoib]
[ 528.813923] [c00000000f1eb720] [c00000000032a650] .dev_open+0xc0/0x120
[ 528.813942] [c00000000f1eb7c0] [c000000000328c70]
.dev_change_flags+0x180/0x1c0
[ 528.813961] [c00000000f1eb860] [c00000000037a02c]
.devinet_ioctl+0x81c/0x850
[ 528.813980] [c00000000f1eb970] [c00000000037a65c]
.inet_ioctl+0x27c/0x2d0
[ 528.813998] [c00000000f1eba00] [c00000000031bc4c]
.sock_ioctl+0x8c/0x440
[ 528.814016] [c00000000f1ebaa0] [c0000000000c22f0] .do_ioctl+0x60/0x120
[ 528.814033] [c00000000f1ebb40] [c0000000000c244c] .vfs_ioctl+0x9c/0x4d0
[ 528.814050] [c00000000f1ebbf0] [c0000000000c28cc] .sys_ioctl+0x4c/0xa0
[ 528.814066] [c00000000f1ebca0] [c00000000001bb24]
.dev_ifsioc+0x84/0x390
[ 528.814084] [c00000000f1ebd70] [c0000000000e4d88]
.compat_sys_ioctl+0x158/0x500
[ 528.814103] [c00000000f1ebe30] [c00000000000d300] syscall_exit+0x0/0x18
[ 528.814119] Instruction dump:
[ 528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020
60000000 60000000
[ 528.814150] 60000000 7c0802a6 f8010010 f821ff81 <e9230000> e9490170
e80a0000 f8410028
[ 528.814174] <7>RTAS: event: 3, Type: Platform Error, Severity: 2
It looks that IPoIB uses ressources which are already freed. We don't
receive a "port active" event for port 1 in time (after 20 seconds). This
means, that the ib_mad stack tries to create an AQP1. Here, our eHCA
InfiniBand Device Driver waits for a maximum of 20 seconds for a port
active event. It seems that with the usage of OpenSM we will receive the
"port active" event after ca. 45 seconds.
For the MAD and IPoIB stack this means the following:
MAD:
====
1. No AQP1 QP will exist for port 1, because of the missing "port active
event".
2. All resources are freed, because of the error handling routines in
ib_mad.
create_mad_qp reports an error to ib_mad_port_open which destroys all
allocated resources
(workqueue, AQPs, MR, PD, CQ, etc.).
3. Multicast join request to the SM won't work !!!
IPoIB doesn't work on ifconfig ib0 xxx.xxx.xxx.xxx !!!
IPoIB:
======
For IPoIB a "port active" event which is to late is going to be a problem.
1. The function ipoib_add_one calls ipoib_add_port which creates all IB
ressources
(QPs, CQ, etc. function ipoib_dev_init -> ipoib_in_dev_init, ...)
2. Function ipoib_ib_dev_init (executed at startup / module load) calls
ipoib_ib_dev_open which
wants to modify the IPoIB QP from INIT -> RTR -> RTS via
ipoib_qp_create.
3. The first ib_modify_qp functions (Reset2Init) in ipoib_qp_create
failes, because the port is not active
at the moment.
See:
[ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR
hipz_h_modify_qp() failed rc=ffffffffffffffd3 ...
[ 518.249460] ib0: failed to modify QP to init, ret = -22
[ 518.418976] ib0: ipoib_qp_create returned -22
4. If that happes the function ipoib_qp_create in ib_verbs.c will destroy
the IPoIB QP.
5. A user enters ifconfig ib0 xxx.xxx.xxx.xxx which executes ipoib_open.
This function executes also
ipoib_ib_dev_open which wants to modifies the IPoIB QP from INIT -> RTR
-> RTS via ipoib_qp_create.
6. ib_modify_qp will occur a Kernel panic (because priv->qp is NULL see
function ipoib_qp_create).
Mit freundlichen Gruessen / Kind Regards
Heiko Joerg Schick
IBM Deutschland Entwicklung GmbH
I/Ox Microcode Development
Linux Infiniband Device Drivers
Schoenaicher Str. 220
71032 Boeblingen
E-Mail: schickhj at de.ibm.com
External: 49-7031-16-0 x4219, t/l: 120-4219
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20051010/d7c767bc/attachment.html>
More information about the general
mailing list