[ofa-general] synchronize commands issued to MTHCA
Jack Morgenstein
jackm at dev.mellanox.co.il
Mon Dec 31 23:03:38 PST 2007
On Tuesday 01 January 2008 03:02, Yicheng Jia wrote:
Does your HCA use on-board memory?
(Run: "lspci" and look at "Mellanox" lines. You have on-board memory
if you see either:
PCI bridge: Mellanox Technologies MT23108 InfiniHost HCA bridge (rev a1)
InfiniBand: Mellanox Technologies MT23108 InfiniHost HCA (rev a1)
OR:
InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode)
)
In that case, when you create an AH in kernel space
(file mthca_av.c, procedure mthca_create_ah() ), you will enter the following flow:
if (ah->type == MTHCA_AH_ON_HCA) {
memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE,
av, MTHCA_AV_SIZE);
kfree(av);
}
Roland, do you think that the memcpy_toio() call might mess things up?
Maybe we need "wmb()" or "mmiowb()" here as well?
- Jack
> Hi Roland,
>
> Thanks for your reply!
>
> Actually I'm working on porting IB driver to QNX platform. I resume the
> work started by my former colleague, and I just found that the sync codes
> (dev->cmd.poll_sem and dev->cmd.hcr_mutex) were deleted for unknown
> reason. After adding back these sync codes, the driver runs much
> smoothlier.
>
> However I still get a command exec error which I believe is relevant to
> command synchronization. The problem is when "Created UDAV" is called
> during SW2HW_MPT command is being executed, the SW2HW_MPT command would
> return with bad parameter error. Here are my debug trace output:
>
> 139903841835 HCR CMD: op_code: LE: d
> 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler
> 139903890876 HCR CMD: in_param_h: LE: 0
> 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler
> 139903993296 HCR CMD: in_param_l: LE: cf616000
> 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc
> 139904094753 HCR CMD: input_modifier: LE: 1e
> 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create
> MTHCA DBG: <mthca_av.c:229> Created UDAV at 8075220/00000000:
> 139904197065 HCR CMD: out_pram_h: LE: 0
> 139904333343 [ 0] 01000005
> 139904384499 HCR CMD: out_pram_l: LE: 0
> 139904428086 [ 4] 0000ffff
> 139904478675 HCR CMD: token: LE: ffff0000
> 139904520156 [ 8] 00003000
> 139904572059 HCR CMD: op_code_modifier: LE: 0
> 139904612802 [ c] 00000000
> 139904667693 HCR CMD: event: LE: 0
> 139904708526 [10] 00000000
> 139904758422 HCR CMD 0x18h: LE=80000d, BE=d008000
> 139904799210 [14] 00000000
> 139904904204 [18] 00000000
> 139904946792MTHCA DBG: <mthca_cmd.c:195> HCR_STATUS 40100698= d008000 ?
> 8000
> [1c] 00000002
> 139905076860 TRACE: mthca_av.c:235/mthca_create_ah
> 139905112329 TRACE: mthca_av.c:243/mthca_create_ah
> 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create
> ....
> 139906793007 HCR CMD: Status Return: : 3
>
> Do you have any idea?
>
> Thanks and have a good new year!
> Yicheng
>
>
>
>
> Roland Dreier <rdreier at cisco.com>
> 12/28/2007 11:39 PM
>
> To
> Yicheng Jia <YJia at tmriusa.com>
> cc
> general at lists.openfabrics.org
> Subject
> Re: [ofa-general] synchronize commands issued to MTHCA
>
>
>
>
>
>
> > I'm using OFED-1.0 and the problem I believe is related to command
> > synchronization of HCA. The host issues a MAD_INF command at first and
> > then a SW2HW_MTP command without waiting for the completion of the
> first
> > command. Both of commands return with bad parameters error.
>
> I guess you mean the MAD_IFC and SW2HW_MPT commands? I've never heard
> of a problem like that -- more details about your hardware/software
> config and the exact symptoms you see would be helpful in debugging.
>
> Anyway OFED 1.0 is ancient by now -- you are much better off just
> using drivers from the standard kernel. If you must use OFED, then
> OFED 1.2 or even a 1.3 prerelease would be better.
>
> > My question is why there's no synchronization mechanism for the command
>
> > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize
> > between every command?
>
> The HCA firmware allows multiple commands to be queued. The
> dev->cmd.event_sem semaphore is used to limit the number of
> outstanding commands to the HCA's capabilities, and the
> dev->cmd.hcr_mutex mutex is used to serialize the actual writing of
> commands to the HCA.
>
> There was a mmiowb() added to mthca_cmd_post() fairly recently that
> might fix your problems if you are running on a large SGI Altix system.
>
> - R.
>
> _____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by MessageLabs.
> For more information please visit http://www.ers.ibm.com
> _____________________________________________________________________________
>
>
More information about the general
mailing list