Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network-binding-plugin: add plugin for vhostuser interfaces. #294

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Overview
`vhostuser` interfaces are supported by qemu but not implemented in Kubevirt. Network Binding Plugin is a good framework to add support for `vhostuser` interfaces to Kubervirt.

## Motivation
`vhostuser` interfaces are required to attach VMs to a userspace dataplane such as OVS-DPDK or VPP and achieve a fast datapath from the VM to the physical NIC.
This is a mandatory feature for networking VMs such as vRouter, IPSEC gateways, firewall or SD-WAN VNFs, that usually bind the network interfaces using DPDK. Expected performance with DPDK can only be met if the whole datapath is userspace and not go through kernel interfaces like with usual bridge interfaces.

## Goals
Be able to add `vhostuser` secondary interfaces to the VM definition in Kubevirt.

## Non Goals
The `vhostuser` secondary interfaces configuration in the dataplane is under the responsibility of Multus and the CNI such as `userspace CNI`.

## Definition of Users
- **VM User** is the persona that configures `VirtualMachine` or `VirtualMachineInstance`
- **Cluster Admin** is the persona that configures `KuberVirt` resources
- **Network Binding Plugin Developer** is the persona that implements the `network-vhostuser-binding` plugin
- **CNI Developer** is the persona that implements the CNI that configures the dataplane with vhostuser sockets
- **Dataplane Developer** is the persona that implements the userspace dataplane

## User Stories
- As a VM User, I want to create a VM with one or serveral `vhostuser` interfaces attached to a userspace dataplane.
- As a VM User, I want the `vhostuser` interface to be configured with a specific MAC address.
- As a VM User, I want to enable multi-queue on the `vhostuser` interface
- As a VM User, I want to be able to configure the `vhostuser` interface as transitional
- As a Cluster Admin, I want to be able to enable `network-vhostuser-binding`
- As a Network Binding Plugin Developer, I want the shared socket path to be accessible to `virt-launcher` pod
- As a Dataplane Developer, I want to access all `vhostuser` sockets of VM pods
- As a CNI Developer, I want to know whet vhostuser sockets are located

## Repos
Kubevirt repo, and most specificaly [cmd/sidecars](https://github.com/kubevirt/kubevirt/tree/main/cmd/sidecars).

## Design
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which kind of networks the plugin is going to support? (pod network, secondary networks)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin is designed to support secondary networks.

This proposal leverages the KubeVirt Network Binding Plugin sidecar framework to implement a new `network-vhostuser-binding-plugin`.

`network-vhostuser-binding-plugin` role is to implement the modification to the domain XML according to the VMI definition passed through its gRPC service by the `virt-launcher` pod on `OnDefineDomain` event from `virt-handler`.

`vhostuser` interfaces are defined in the VMI under `spec/domain/devices/interfaces` using the binding name `vhostuser`:

```yaml
spec:
domain:
devices:
networkInterfaceMultiqueue: true
interfaces:
- name: default
masquerade: {}
- name: net1
binding:
name: vhostuser
macAddress: ca:fe:ca:fe:42:42
```

`network-vhostuser-binding` translates the VMI definition into libvirt domain XML modifications on `OnDefineDomain`:
1. Creates a new interface with `type='vhostuser'`
2. Set the MAC address if specified in the VMI spec
3. Define model type according to `useVirtioTransitional` VMI spec
4. If `networkInterfaceMultiqueue` is set to `true`, add the number of queues calculated after the number of cores of the VMI
5. Add `memAccess='shared'` to all NUMA cells elements
6. Define the device name according to Kubevirt naming schema
7. Define the `vhostuser` socket path, immutable accross Live Migration

As `OnDefineDomain` hook can be called multiple times by KubeVirt, `network-vhostuser-binding` modification must be idempotent.

Below is an example of modified domain XML:

```xml
<cpu mode="host-model">
<topology sockets="2" cores="8" threads="1"></topology>
<numa>
<cell id="0" cpus="0-7" memory="2097152" unit="KiB" memAccess="shared"/>
<cell id="1" cpus="8-15" memory="2097152" unit="KiB" memAccess="shared"/>
</numa>
</cpu>
<interface type='vhostuser'>
<source type='unix' path='/var/run/kubevirt/vhostuser/net1/poda08a0fcbdea' mode='server'/>
<target dev='poda08a0fcbdea'/>
<model type='virtio-non-transitional'/>
<mac address='ca:fe:ca:fe:42:42'/>
<driver name='vhost' queues='8' rx_queue_size='1024' tx_queue_size='1024'/>
<alias name='ua-net1'/>
</interface>
```

### Implementation details

The socket path have to be available to both `virt-launcher` pod (and `compute` container) and dataplane pod.
In order to not use hostPath volumes that requires pod to be privileged, we propose to implement a **vhostuser Device Plugin** that will be able to inject mounts to the sockets directory into unprivileged pods, and annotations.

#### Device Plugin for **vhostuser sockets** resources

Device plugins can instructs kubelet to add mounts into the containers when managed resources are requested.

This design proposal relies on a device plugin that would manage two kinds of resources on the userspace dataplane that we can think of a virtual switch:
- **dataplane**: `1`
This resource give access to all sub directories of `/var/run/vhostuser`, and to sockets inside.
It is requested by the dataplane itself.
Kubelet injects `/var/run/vhostuser` mount in the container.
- **vhostuser sockets**: `n`
This resource can be thought as a virtual switch port, and can have a limit related to dataplane own limitation (performance, CPU, etc.).
It can help schedule workloads on node where dataplane has available resources.
It is requested through VM or VMI definition in resources request spec. In turn the `compute` container of the `virt-launcher` pod will request the same resources.
This makes the device plugin allocates a sub directory `/var/run/vhostuser/<socketXX>`, and mount it into the `virt-launcher` pod.

The device plugin has to comply with [`device-info-spec`](https://github.com/k8snetworkplumbingwg/device-info-spec/blob/main/SPEC.md#device-information-specification). This allows information sharing between device plugin and the CNI. Thanks to Multus being compliant with this spec, the CNI can retrieve device information (socket path and and type) to be used to configure the dataplane accordingly. Multus will annotate the `virt-launcher` pod with this information, KubeVirt extracts only a part into `kubevirt.io/network-info`.

The device plugin has to care about directory permissions and SELinux, for the sockets to be accessible from requesting pods.

#### Network Binding Plugin and Kubevirt requirements

Network Binding Plugin then can leverage `downwardAPI` feature available from Kubevirt v1.3.0, in order to retrieve the `kubevirt.io/network-info` annotation values, and extract the socket path to configure the interface in the domain XML.

But it can't use it directly as it would break Live Migration of VMs:
The socket directories `/var/run/vhostuser/<socketXX>` are not predictable, and new ones get allocated when the destination pod is being created.
Unfortunately the domain XML is the one from the source pod (migration domain), and references sockets paths allocated to source pod.

Hence, Network Binding Plugin needs to use immutable paths to sockets. This can be achieved using the interface name (or its hash version) in symbolic links to the real socket path: `/var/run/kubevirt/vhostuser/net1` -> `/var/run/vhostuser/<socketXX>`.

This requires an enhancement in KubeVirt, and Network Binding Plugin KubeVirt CRD spec, in order for `virt-launcher` pod to have a shared `emptyDir` volume, mounted in both `compute` and `vhostuser-network-binding-plugin` containers.

#### Implementation diagram

![kubevirt-vhostuser-shared-sockets](kubevirt-vhostuser-binding-plugin-device-plugin.drawio.png)

## API Examples

### KubeVirt CRD

A new parameter for the shared directory must be defined in the Network Binding Plugin spec of the KubeVirt CR:

```yaml
apiVersion: kubevirt.io/v1
kind: KubeVirt
spec:
configuration:
network:
binding:
vhostuser:
sidecarImage: network-vhostuser-binding:main
sharedDir: /var/run/kubevirt/vhostuser
```

### No modification to VM

Example of a `VirtualMachine` definition using `network-vhostuser-binding` plugin and device plugin resources requests:

```yaml
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vhostuser-vm
namespace: tests
spec:
running: true
template:
metadata:
labels:
kubevirt.io/domain: vhostuser-vm
spec:
architecture: amd64
domain:
cpu:
cores: 4
devices:
disks:
- disk:
bus: virtio
name: containerdisk
interfaces:
- masquerade: {}
name: default
- binding:
name: vhostuser
macAddress: ca:fe:ca:fe:42:42
name: net1
networkInterfaceMultiqueue: true
machine:
type: q35
memory:
hugepages:
pageSize: 1Gi
resources:
limits:
vhostuser/sockets: 1
requests:
memory: 2Gi
vhostuser/sockets: 1
networks:
- name: default
pod: {}
- multus:
networkName: vhostuser-network
name: net1
nodeSelector:
node-class: dpdk
volumes:
- containerDisk:
image: os-container-disk-40g
name: containerdisk
```

## Scalability
(overview of how the design scales)

## Update/Rollback Compatibility
Kubevirt Network Binding plugin relies on `hooks/v1alpha3` API for a clean termination of the `network-vhostuser-binding` container in the virt-launcher pod.

## Functional Testing Approach
Create a VM with several `vhostuser` interfaces then:
- check the generated domain XML contains all interfaces with appropriate configuration
- check the vhostuser sockets are created in the expected directory of virt-launcher pod
- check the vhostuser sockets are available to the dataplane pod
- check the VM is running
- check VM network connectivity
- live migrate the VM
- check the VM is migrated and is running
- check VM network connectivity

# Implementation Phases
- [ ] Implement network binding plugin sharedDir spec in KubeVirt
- [x] First implementation of the `network-vhostuser-binding`
- [x] Implement vhostuser device plugin, based on [generic-device-plugin](https://github.com/squat/generic-device-plugingeneric-device-plugin)
- [ ] Upstream `network-vhostuser-binding`