[ofa-general] synchronize commands issued to MTHCA

Yicheng Jia YJia at tmriusa.com
Wed Jan 2 09:59:30 PST 2008


Hi Jack,

Thanks for your reply. The HCA I'm using is memory free, the chip is 
MT25204 and the HCA type is arbel, so it doesn't go through the "if 
(ah->type == MTHCA_AH_ON_HCA)" part of code. By checking the debug output, 
I got more details about this problem:

The SW2HW_MPT command is issued while UDAV table is been creating. During 
the time that the driver is waiting for the completion of the command, it 
does many other things: creating send mad package, posting send mad 
request to the SQ and posting another receive mad request to the RQ. 
There's no error report for all of these actions. However after it, the 
HCA report command parameter error for the SW2HW_MPT.

I've copied a snippet context of the debug trace output when this error 
happens, hopefully it will help spot the reason.

139903841835 HCR CMD: op_code:          LE: d
139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler
139903890876 HCR CMD: in_param_h:       LE: 0
139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler
139903993296 HCR CMD: in_param_l:       LE: cf616000
139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc
139904094753 HCR CMD: input_modifier:   LE: 1e
139904139150 TRACE: mthca_provider.c:447/mthca_ah_create
MTHCA DBG: <mthca_av.c:229> Created UDAV at 8075220/00000000:
139904197065 HCR CMD: out_pram_h:       LE: 0
139904333343   [ 0] 01000005
139904384499 HCR CMD: out_pram_l:       LE: 0
139904428086   [ 4] 0000ffff
139904478675 HCR CMD: token:            LE: ffff0000
139904520156   [ 8] 00003000
139904572059 HCR CMD: op_code_modifier: LE: 0
139904612802   [ c] 00000000
139904667693 HCR CMD: event:            LE: 0
139904708526   [10] 00000000
139904758422 HCR CMD 0x18h:             LE=80000d, BE=d008000
139904799210   [14] 00000000
139904904204   [18] 00000000
139904946792MTHCA DBG: <mthca_cmd.c:195> HCR_STATUS 40100698= d008000 ? 
8000
   [1c] 00000002
139905076860 TRACE: mthca_av.c:235/mthca_create_ah
139905112329 TRACE: mthca_av.c:243/mthca_create_ah
139905147672 TRACE: mthca_provider.c:460/mthca_ah_create
636959 DEBUG: <mthca_qp.c:1908> Start mthca_arbel_post_send. qp 0 wr 
8d984b8 
139905324432 TRACE: mthca_qp.c:1911/mthca_arbel_post_send
139905359505 TRACE: mthca_qp.c:1939/mthca_arbel_post_send
139905418932 TRACE: mthca_qp.c:1949/mthca_arbel_post_send
636959 DEBUG: <mthca_qp.c:1953> qp is not direct access and wqe: 0x8d84400 

139905541467 TRACE: mthca_qp.c:1954/mthca_arbel_post_send
139905577647 TRACE: mthca_qp.c:1964/mthca_arbel_post_send
139905614565 TRACE: mthca_qp.c:2057/mthca_arbel_post_send
139905669411 TRACE: mthca_qp.c:2076/mthca_arbel_post_send
139905705726 TRACE: mthca_qp.c:2078/mthca_arbel_post_send
636959 DEBUG: <mthca_qp.c:2087> wr sg length 0x18, lkey 0x80001900, local 
addr 0xce2393b8
139905831060 TRACE: mthca_qp.c:2078/mthca_arbel_post_send
636959 DEBUG: <mthca_qp.c:2087> wr sg length 0xe8, lkey 0x80001900, local 
addr 0xce2393d0
139905956322 TRACE: mthca_qp.c:2092/mthca_arbel_post_send
636959 DEBUG: <mthca_qp.c:2101> wr id 148473016
139906069875 TRACE: mthca_qp.c:2120/mthca_arbel_post_send
139906106379 TRACE: mthca_qp.c:2128/mthca_arbel_post_send
139906142892 TRACE: mthca_qp.c:2131/mthca_arbel_post_send
139906178640 TRACE: mthca_qp.c:2135/mthca_arbel_post_send
139906214703 TRACE: mthca_qp.c:2158/mthca_arbel_post_send
139906250568 TRACE: mthca_qp.c:2160/mthca_arbel_post_send
636959 DEBUG: <mthca_qp.c:2162> End mthca_arbel_post_send. err 0
 139906369953 TRACE: mad.c:650/ib_mad_recv_done_handler
139906406295 TRACE: mad.c:669/ib_mad_recv_done_handler
139906441539 TRACE: mad.c:672/ib_mad_recv_done_handler
636959 QNX   DBG: <mad.c:530> 
mad_priv->header.mad_list.mad_queue->list.prev  88b0a2c 
139906578384 TRACE: mthca_qp.c:2177/mthca_arbel_post_receive
139906614168 TRACE: mthca_qp.c:2194/mthca_arbel_post_receive
139906649295 TRACE: mthca_qp.c:2196/mthca_arbel_post_receive
139906689129 TRACE: mad.c:674/ib_mad_recv_done_handler
139906723068 TRACE: mad.c:676/ib_mad_recv_done_handler
636959 QNX   DBG: <linux_cache.c:151> kmem_cache 5 free object=88b0724
139906793007 HCR CMD: Status Return:              : 3

Again, thanks for your help!

Best,
Yicheng




Jack Morgenstein <jackm at dev.mellanox.co.il> 
01/01/2008 01:03 AM

To
general at lists.openfabrics.org
cc
Yicheng Jia <YJia at tmriusa.com>, Roland Dreier <rdreier at cisco.com>
Subject
Re: [ofa-general] synchronize commands issued to MTHCA






On Tuesday 01 January 2008 03:02, Yicheng Jia wrote:

Does your HCA use on-board memory?
(Run: "lspci" and look at "Mellanox" lines.  You have on-board memory
 if you see either:
                 PCI bridge: Mellanox Technologies MT23108 InfiniHost HCA 
bridge (rev a1)
                 InfiniBand: Mellanox Technologies MT23108 InfiniHost HCA 
(rev a1)
 OR:
   InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor 
compatibility mode)
)

In that case, when you create an AH in kernel space
(file mthca_av.c, procedure mthca_create_ah() ), you will enter the 
following flow:
                 if (ah->type == MTHCA_AH_ON_HCA) {
                                 memcpy_toio(dev->av_table.av_map + index 
* MTHCA_AV_SIZE,
                                                     av, MTHCA_AV_SIZE);
                                 kfree(av);
                 }

Roland, do you think that the memcpy_toio() call might mess things up?

Maybe we need "wmb()" or "mmiowb()" here as well?

- Jack

> Hi Roland,
> 
> Thanks for your reply!
> 
> Actually I'm working on porting IB driver to QNX platform. I resume the 
> work started by my former colleague, and I just found that the sync 
codes 
> (dev->cmd.poll_sem and dev->cmd.hcr_mutex) were deleted for unknown 
> reason. After adding back these sync codes, the driver runs much 
> smoothlier. 
> 
> However I still get a command exec error which I believe is relevant to 
> command synchronization. The problem is when "Created UDAV" is called 
> during SW2HW_MPT command is being executed, the SW2HW_MPT command would 
> return with bad parameter error. Here are my debug trace output:
> 
> 139903841835 HCR CMD: op_code:          LE: d
> 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler
> 139903890876 HCR CMD: in_param_h:       LE: 0
> 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler
> 139903993296 HCR CMD: in_param_l:       LE: cf616000
> 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc
> 139904094753 HCR CMD: input_modifier:   LE: 1e
> 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create
> MTHCA DBG: <mthca_av.c:229> Created UDAV at 8075220/00000000:
> 139904197065 HCR CMD: out_pram_h:       LE: 0
> 139904333343   [ 0] 01000005
> 139904384499 HCR CMD: out_pram_l:       LE: 0
> 139904428086   [ 4] 0000ffff
> 139904478675 HCR CMD: token:            LE: ffff0000
> 139904520156   [ 8] 00003000
> 139904572059 HCR CMD: op_code_modifier: LE: 0
> 139904612802   [ c] 00000000
> 139904667693 HCR CMD: event:            LE: 0
> 139904708526   [10] 00000000
> 139904758422 HCR CMD 0x18h:             LE=80000d, BE=d008000
> 139904799210   [14] 00000000
> 139904904204   [18] 00000000
> 139904946792MTHCA DBG: <mthca_cmd.c:195> HCR_STATUS 40100698= d008000 ? 
> 8000
>    [1c] 00000002
> 139905076860 TRACE: mthca_av.c:235/mthca_create_ah
> 139905112329 TRACE: mthca_av.c:243/mthca_create_ah
> 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create
> ....
> 139906793007 HCR CMD: Status Return:              : 3
> 
> Do you have any idea?
> 
> Thanks and have a good new year!
> Yicheng
> 
> 
> 
> 
> Roland Dreier <rdreier at cisco.com> 
> 12/28/2007 11:39 PM
> 
> To
> Yicheng Jia <YJia at tmriusa.com>
> cc
> general at lists.openfabrics.org
> Subject
> Re: [ofa-general] synchronize commands issued to MTHCA
> 
> 
> 
> 
> 
> 
>  > I'm using OFED-1.0 and the problem I believe is related to command 
>  > synchronization of HCA. The host issues a MAD_INF command at first 
and 
>  > then a SW2HW_MTP command without waiting for the completion of the 
> first 
>  > command. Both of commands return with bad parameters error.
> 
> I guess you mean the MAD_IFC and SW2HW_MPT commands?  I've never heard
> of a problem like that -- more details about your hardware/software
> config and the exact symptoms you see would be helpful in debugging.
> 
> Anyway OFED 1.0 is ancient by now -- you are much better off just
> using drivers from the standard kernel.  If you must use OFED, then
> OFED 1.2 or even a 1.3 prerelease would be better.
> 
>  > My question is why there's no synchronization mechanism for the 
command 
> 
>  > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize 
>  > between every command?
> 
> The HCA firmware allows multiple commands to be queued.  The
> dev->cmd.event_sem semaphore is used to limit the number of
> outstanding commands to the HCA's capabilities, and the
> dev->cmd.hcr_mutex mutex is used to serialize the actual writing of
> commands to the HCA.
> 
> There was a mmiowb() added to mthca_cmd_post() fairly recently that
> might fix your problems if you are running on a large SGI Altix system.
> 
>  - R.
> 
> 
_____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
MessageLabs. 
> For more information please visit http://www.ers.ibm.com
> 
_____________________________________________________________________________
> 
> 

_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080102/e406578e/attachment.html>


More information about the general mailing list