[Openib-windows] IPOIB virtualization What was already done, what still has to be done to finish the job.

Fabian Tillier ftillier at silverstorm.com
Fri Apr 21 10:31:36 PDT 2006


Hi Hal,

On 21 Apr 2006 09:34:27 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
> Hi Tzachi,
>
> On Fri, 2006-04-21 at 09:23, Tzachi Dar wrote:
> > Hi Fab,
> >
> > The following mail summarizes the place the work that I did on
> > IPOIB virtualization, that is running IPOIB on Microsoft virtual
> > server R2.
> >
> > Please note that the current status is that ping works in all
> > directions, still there is a lot of work needed in order to bring it
> > to product quality. The biggest issue that still has to be done is
> > allow for packets that are bigger than 1500 bytes, and smaller than
> > 2048 to pass to the guest OS. Currently, I have implemented a hack
> > that tells windows that we only support MTU of 1500 bytes (like
> > Ethernet). My change assumes that all machines are windows machines,
> > and all have my changes, but this is not always true. One example that
> > breaks this assumption is Linux.
>
> Is this issue with Windows or Linux in terms of this interoperation ?
> Can you elaborate on this ?

This is a Windows VM issue.

A Windows VM can only receive ~1500 byte packets from the Windows host
machine.  This means that if a packet is sent using the full IPoIB MTU
to a guest VM, that packet will not ever make it - it gets dropped
somewhere between being handed off to the host network stack and the
guest OS.  I would expect this to be a bug in the MS virtual server
network emulation layer, but we haven't confirmed this yet.

So a Linux system sending a full IPoIB MTU packet to a Windows VM
would not work, through no fault of the Linux machine (the same
applies if a Windows machine sends a full IPoIB MTU packet).

To prevent Windows from sending a full IPoIB MTU, the IPoIB driver
must report its MTU to Windows as 1500 bytes, rather than 2044, but
then any full IPoIB MTU packet (from a Linux host for example) would
overrun the RQ WQE.

> >  It seems that a better solution to this problem is either talk to MS
> > and see if they have a solution to this problem or accept bigger
> > packets and break them by demand.
> >
> > Due to the time that is needed to complete the work (see also problems
> > bellow) we have decided not to support virtualization for this
> > release.
> >
> > Attached to this mail is the latest version that I created. It should
> > fit less or more to the version of IPOIB.
> >
> > The changes that I have made are in the following areas. I'll describe
> > shortly what the problem was, what I did and what still has to be
> > done. Some of the problems described are not really related to
> > virtualization.
> >
> > 1) checking where to pass the packets. I have implemented the code
> > that sniffs arps and creates a table of IP, Mac. Packets are later
> > changed based on that table. Code is almost complete, however there is
> > a need to take the correct lock when writing the table (shouldn't be
> > that complicated).
> >
> > 2) DHCP support. A few general comments: 1) The current code
> > introduced changes to DHCP packets both in the receiver side as well
> > as in the sender side. This works well if we write the software in
> > both sides. Assuming that the other side is Linux, this is not true.
> > I'm not sure that there is a spec that solves these problems.
>
> Please elaborate on this.

Windows IPoIB masquarades itself as an 802.3 device.  This means that
the IPoIB encapsulation for ARP and DHCP packets is implemented
internally to the IPoIB driver.  So when NDIS sends a DHCP packet,
IPoIB converts it to follow the IETF draft so that on the wire it
looks like IPoIB, not Ethernet.  The receiver then converts back to
Ethernet.

There's a bug in the driver (that I'm currently fixing) where it
doesn't recalculate the IP or UDP checksums or update the packet
lengths, eventhough it changed the packet payload.  This means that
DHCP packets on the wire will have incorrect lengths and checksums,
and DHCP packets received after conversion to Ethernet format could
have the wrong checksum.

It is really useful in Windows (and I can't stress this enough) to
have IB port GUIDs that can be converted to valid, unique Ethernet
MACs.  The driver currently only has code to handle SilverStorm and
Mellanox GUIDs, because no other vendor has put forth their algorithm
(if they have one) to do this.  If Voltaire has such an algorithm, it
would be great to add a handler for it.

- Fab



More information about the ofw mailing list