kvm API 接口就是一组 ioctl
集合,用来控制虚拟机的各个方面。这些 ioctl 可以分为四类:
kvm API 以文件描述符为中心。
通过 open("/dev/kvm")
获得 kvm 子系统的句柄;此句柄可用于调用系统 ioctl。通过调用ioctl KVM_CREATE_VM
将创建一个 VM 文件描述符,该描述符可用于发出 VM ioctl(第二类 ioctl)。 VM fd 上的 KVM_CREATE_VCPU
或 KVM_CREATE_DEVICE
ioctl 将创建一个虚拟 cpu 或虚拟设备 device,并返回对应的文件描述符。使用 vcpu 或 device fd 可以执行对应的 ioctl (第三类,四类 ioctl)。
需要注意的是,虽然 VM ioctl 只能从创建 VM 的进程发出,但 VM 的生命周期与其文件描述符相关联,而不是与其创建者(进程)相关联。换句话说,直到最后一个对 VM 文件描述符的引用被释放后,VM 及其资源(包括相关的地址空间)才会被释放。例如,如果在 ioctl(KVM_CREATE_VM) 之后发出 fork(),则在父(原始)进程及其子进程都将其引用 VM 的文件描述符释放后,才会释放 VM资源。
由于必须释放文件描述符的所有饮用才会释放 VM 资源,因此强烈建议不要轻易通过 fork()、dup() 等创建 VM 的引用,这可能会产生不必要的副作用,例如当虚拟机关闭时,由虚拟机进程分配的内存可能不会被释放/取消。
从 Linux 2.6.22 开始,KVM ABI 已经稳定:不允许向后不兼容的更改。但是,有一个扩展工具允许查询和使用 API 的向后兼容扩展。
扩展机制不基于 Linux 版本号。相反,kvm 定义了扩展标识符和一个查询特定扩展标识符是否可用的工具。如果是,则对应的 ioctl 是可用的。
本节介绍 kvm guest 的 ioctl api。对于每个 ioctl,都提供了以下信息和描述:
guest 与 host: 一般来说,host 是虚拟机的宿主机,对虚拟机进行管理,而 guest 即虚拟机,这个是相当host 的一个概念。
ENOTTY
EBADF
, ENOMEM
, EINVAL
获取当前 kvm 的接口版本
Capability:basic
Architectures:all
Type:system ioctl
Parameters:none
Returns:the constant KVM_API_VERSION (=12)
这个ioctl的返回值预期就是12,如果返回值为12,说明这个是符合预期的,所有 Capability 为 basic 的 ioctl 都是可用的。
创建虚拟机
Capability:basic
Architectures:all
Type:system ioctl
Parameters:machine type identifier (KVM_VM_*)
Returns:a VM fd that can be used to control the new virtual machine.
如果我们想先创建一个即没有虚拟 CPU,也没有内存的虚拟机,那可以使用 0 作为机器类型。
如果是在 S390 上创建虚拟机,请先执行 KVM_CAP_S390_UCONTROL
检查,并使用标志 `KVM_VM_S390_UCONTROL` 作为特权用户。
这里标注 KVM_CREATE_VM 是一种 basic 能力,实际在 s390 上不一定是100%支持的,如果在 s390 上使用, 需要执行 KVM_CAP_S390_UCONTROL 进行检查。
在 arm64 上,VM 的物理地址大小(IPA 大小限制)默认限制为 40 位(即 1TB )。如果主机支持扩展 KVM_CAP_ARM_VM_IPA_SIZE,则可以配置限制。如果支持,请使用 KVM_VM_TYPE_ARM_IPA_SIZE(IPA_Bits) 设置机器类型标识符中的大小.
例如已知某arm单板支持 KVM_CAP_ARM_VM_IPA_SIZE
, 则可以通过以下方式设置虚拟机内存寻址大小
dev_fd= open("/dev/kvm");vm_fd = ioctl(dev_fd, KVM_CREATE_VM, KVM_VM_TYPE_ARM_IPA_SIZE(48));
如果配置的 IPA_SIZE
大小是不被支持的,则 create vm将失败。
Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST
Architectures:x86
Type:system ioctl
Parameters:struct kvm_msr_list
(in/out)
Returns:0 on success; -1 on error
Errors:
EFAULT | 无法读取或写入 msr 索引列表 |
---|---|
E2BIG | msr 索引太大,无法放入用户指定的数组中。 |
struct kvm_msr_list {__u32 nmsrs; /* number of msrs in entries */__u32 indices[0];
};
用户在 nmsrs 中填写索引数组的大小, kvm 调整 nmsrs 以反映 msrs 的实际数量,并用它们的数字填充索引数组。
查询当前内核对 kvm API 的扩展支持。
Capability: basic, KVM_CAP_CHECK_EXTENSION_VM for vm ioctl
Architectures: all
Type: system ioctl, vm ioctl
Parameters: extension identifier (KVM_CAP_*)
Returns: 0 表示不支持; 1或其它数字表示支持
其参数为一个扩展标识符KVM_CAP_*
, 返回0表示不支持, 1或其它数表示支持。
获取 vcpu 的控制内存区域大小。
KVM_RUN ioctl 通过共享内存区域与用户空间通信。此 ioctl 返回该区域的大小。
Capability:basic
Architectures: all
Type:system ioctl
Parameters:none
Returns:size of vcpu mmap area, in bytes
此内存空间是用户可以控制的 vcpu 内存。其与cpu构架是无关的,所以这是个 system ioctl, 我们调用 ioctl 时需要穿入dev_fd
,而不是vcpu_fd
除了 KVM_RUN 通信区域的大小, 此接口返回的 mmap 还包含了VCPU 文件描述符的其他区域,如:
KVM_CAP_COALESCED_MMIO
是可用的,则内存页 KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE
也包含在此 mmap 内存区域内。KVM_CAP_DIRTY_LOG_RING
是可用的,则 KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE
的一些页也是在该 mmap 内存区域,8.3 节有更详细的描述。此 ioctl 已过时并已删除
Capability:basic
Architectures:all
Type:vm ioctl
Parameters:struct kvm_memory_region (in)
Returns:0 on success, -1 on error
创建 vcpu
Capability: basic
Architectures: all
Type: vm ioctl
Parameters: vcpu id (apic id on x86)
Returns: vcpu fd on success, -1 on error
新增一个 vcpu 到虚拟机。最多可以添加 max_vcpus。 vcpu id 是 [0, max_vcpu_id) 范围内的整数。
可以在运行时使用 KVM_CHECK_EXTENSION
ioctl, 传入参数 KVM_CAP_NR_VCPUS
, 返回推荐的 max_vcpus 值。传入参数 KVM_CAP_MAX_VCPUS
, 返回最大可能的 max_vcpus 值。
如果 KVM_CAP_NR_VCPUS
不存在,您应该假设 max_vcpus 为 4, 如果 KVM_CAP_MAX_VCPUS 不存在,您应该假设 max_vcpus 与 KVM_CAP_NR_VCPUS
返回的值相同.
可以在运行时使用 KVM_CHECK_EXTENSION
ioctl, 传入参数 KVM_CAP_MAX_VCPU_ID
, 获取 max_vcpu_id 的最大可能值。
获取所有脏页的位图
Capability:basic
Architectures:all
Type:vm ioctl
Parameters:struct kvm_dirty_log
(in/out)
Returns:0 on success, -1 on error
给定一个 memory solt,返回自上次调用此 ioctl 以来所有脏页的位图。位 0 是内存插槽中的第一页。
(免费订阅,永久学习)学习地址: Dpdk/网络协议栈/vpp/OvS/DDos/NFV/虚拟化/高性能专家-学习视频教程-腾讯课堂
更多DPDK相关学习资料有需要的可以自行报名学习,免费订阅,永久学习,或点击这里加qun免费
领取,关注我持续更新哦! !
(确保清除整个 kvm_dirty_log 以避免填充问题)
/* for KVM_GET_DIRTY_LOG */
struct kvm_dirty_log {__u32 slot;__u32 padding;union {void __user *dirty_bitmap; /* one bit per page */__u64 padding;};
};
如果 KVM_CAP_MULTI_ADDRESS_SPACE 可用,slot
字段的第 16-31 位指定要返回脏位图的地址空间。有关 slot 字段使用的详细信息,请参阅 KVM_SET_USER_MEMORY_REGION。
此 ioctl 已过时并已被删除。
运行 guest vcpu。
Capability: basic
Architectures: all
Type: vcpu ioctl
Parameters: none
Returns: 0 on success, -1 on error
虽然该接口没有显示指定参数,但是需要我们事先配置好 struct kvm_run
获取CPU的通用寄存器信息,参数 struct kvm_regs, 该接口不支持 arm64 构架
Capability: basic
Architectures: all except arm64
Type: vcpu ioctl
Parameters: struct kvm_regs (out)
Returns: 0 on success, -1 on error
Reads the general purpose registers from the vcpu.
/* x86 */
struct kvm_regs {/* out (KVM_GET_REGS) / in (KVM_SET_REGS) */__u64 rax, rbx, rcx, rdx;__u64 rsi, rdi, rsp, rbp;__u64 r8, r9, r10, r11;__u64 r12, r13, r14, r15;__u64 rip, rflags;
};/* mips */
struct kvm_regs {/* out (KVM_GET_REGS) / in (KVM_SET_REGS) */__u64 gpr[32];__u64 hi;__u64 lo;__u64 pc;
};
设置CPU的通用寄存器信息,参数 struct kvm_regs, 该接口不支持 arm64 构架
Capability: basic
Architectures: all except arm64
Type: vcpu ioctl
Parameters: struct kvm_regs (in)
Returns: 0 on success, -1 on error
Writes the general purpose registers into the vcpu.
See KVM_GET_REGS for the data structure.
获取CPU的 特殊寄存器信息,参数struct kvm_sregs
, 仅支持 x86, ppc 构架
Capability: basic
Architectures: x86, ppc
Type: vcpu ioctl
Parameters: struct kvm_sregs (out)
Returns: 0 on success, -1 on error
Reads special registers from the vcpu.
/* x86 */
struct kvm_sregs {struct kvm_segment cs, ds, es, fs, gs, ss;struct kvm_segment tr, ldt;struct kvm_dtable gdt, idt;__u64 cr0, cr2, cr3, cr4, cr8;__u64 efer;__u64 apic_base;__u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
};/* ppc -- see arch/powerpc/include/uapi/asm/kvm.h */
interrupt_bitmap is a bitmap of pending external interrupts. At most one bit may be set. This interrupt has been acknowledged by the APIC but not yet injected into the cpu core.
设置CPU的 特殊寄存器信息,参数 struct kvm_sregs
, 仅支持 x86, ppc 构架
Capability: basic
Architectures: x86, ppc
Type: vcpu ioctl
Parameters: struct kvm_sregs (in)
Returns: 0 on success, -1 on error
Writes special registers into the vcpu. See KVM_GET_SREGS for the data structures.
根据 vcpu 的当前地址转换模式转换一个虚拟地址。
Capability: basic
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_translation (in/out)
Returns: 0 on success, -1 on error
Translates a virtual address according to the vcpu’s current address translation mode.
struct kvm_translation {/* in */__u64 linear_address;/* out */__u64 physical_address;__u8 valid;__u8 writeable;__u8 usermode;__u8 pad[5];
};
注入硬件中断向量排队。
Capability: basic
Architectures: x86, ppc, mips, riscv
Type: vcpu ioctl
Parameters: struct kvm_interrupt (in)
Returns: 0 on success, negative on failure.
Queues a hardware interrupt vector to be injected.
/* for KVM_INTERRUPT */
struct kvm_interrupt {/* in */__u32 irq;
};
Returns
0 | on success, |
-EEXIST | if an interrupt is already enqueued |
-EINVAL | the irq number is invalid |
-ENXIO | if the PIC is in the kernel |
-EFAULT | if the pointer is invalid |
Note ‘irq’ is an interrupt vector, not an interrupt pin or line. This ioctl is useful if the in-kernel PIC is not used.
Queues an external interrupt to be injected. This ioctl is overleaded with 3 different irq values:
Note that any value for ‘irq’ other than the ones stated above is invalid and incurs unexpected behavior.
This is an asynchronous vcpu ioctl and can be invoked from any thread.
Queues an external interrupt to be injected into the virtual CPU. A negative interrupt number dequeues the interrupt.
This is an asynchronous vcpu ioctl and can be invoked from any thread.
Queues an external interrupt to be injected into the virutal CPU. This ioctl is overloaded with 2 different irq values:
This is an asynchronous vcpu ioctl and can be invoked from any thread.
对此支持已删除。请改用 KVM_SET_GUEST_DEBUG。
Capability: basic
Architectures: none
Type: vcpu ioctl
Parameters: none
Returns: -1 on error
Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.
用作系统 ioctl 时:读取可用于 VM 的基于 MSR 的功能的值。
用作 vcpu ioctl 时:从 vcpu 读取特定于模型的寄存器。
Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system)
Architectures: x86
Type: system ioctl, vcpu ioctl
Parameters: struct kvm_msrs (in/out)
Returns: number of msrs successfully returned; -1 on error
When used as a system ioctl: Reads the values of MSR-based features that are available for the VM. This is similar to KVM_GET_SUPPORTED_CPUID, but it returns MSR indices and values. The list of msr-based features can be obtained using KVM_GET_MSR_FEATURE_INDEX_LIST in a system ioctl.
When used as a vcpu ioctl: Reads model-specific registers from the vcpu. Supported msr indices can be obtained using KVM_GET_MSR_INDEX_LIST in a system ioctl.
struct kvm_msrs {__u32 nmsrs; /* number of msrs in entries */__u32 pad;struct kvm_msr_entry entries[0];
};struct kvm_msr_entry {__u32 index;__u32 reserved;__u64 data;
};
Application code should set the ‘nmsrs’ member (which indicates the size of the entries array) and the ‘index’ member of each array entry. kvm will fill in the ‘data’ member.
与 KVM_GET_MSRS 不同,该 ioctl 仅用于 vcpu, 将特定于模型的寄存器写入 vcpu
Capability: basic
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_msrs (in)
Returns: number of msrs successfully set (see below), -1 on error
Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the data structures.
Application code should set the ‘nmsrs’ member (which indicates the size of the entries array), and the ‘index’ and ‘data’ members of each array entry.
It tries to set the MSRs in array entries[] one by one. If setting an MSR fails, e.g., due to setting reserved bits, the MSR isn’t supported/emulated by KVM, etc…, it stops processing the MSR list and returns the number of MSRs that have been set successfully.
定义 vcpu 对 cpuid 指令的响应.
Capability: basic
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_cpuid (in)
Returns: 0 on success, -1 on error
Defines the vcpu responses to the cpuid instruction. Applications should use the KVM_SET_CPUID2 ioctl if available.
Caveat emptor:
struct kvm_cpuid_entry {__u32 function;__u32 eax;__u32 ebx;__u32 ecx;__u32 edx;__u32 padding;
};/* for KVM_SET_CPUID */
struct kvm_cpuid {__u32 nent;__u32 padding;struct kvm_cpuid_entry entries[0];
};
定义在 KVM_RUN 执行期间哪些信号被阻塞.
Capability: basic
Architectures: all
Type: vcpu ioctl
Parameters: struct kvm_signal_mask (in)
Returns: 0 on success, -1 on error
Defines which signals are blocked during execution of KVM_RUN. This signal mask temporarily overrides the threads signal mask. Any unblocked signal received (except SIGKILL and SIGSTOP, which retain their traditional behaviour) will cause KVM_RUN to return with -EINTR.
Note the signal will only be delivered if not blocked by the original signal mask.
/* for KVM_SET_SIGNAL_MASK */
struct kvm_signal_mask {__u32 len;__u8 sigset[0];
};
从 vcpu 读取浮点状态。
Capability: basic
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_fpu (out)
Returns: 0 on success, -1 on error
Reads the floating point state from the vcpu.
/* for KVM_GET_FPU and KVM_SET_FPU */
struct kvm_fpu {__u8 fpr[8][16];__u16 fcw;__u16 fsw;__u8 ftwx; /* in fxsave format */__u8 pad1;__u16 last_opcode;__u64 last_ip;__u64 last_dp;__u8 xmm[16][16];__u32 mxcsr;__u32 pad2;
};
将浮点状态写入 vcpu。
Capability: basic
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_fpu (in)
Returns: 0 on success, -1 on error
Writes the floating point state to the vcpu.
/* for KVM_GET_FPU and KVM_SET_FPU */
struct kvm_fpu {__u8 fpr[8][16];__u16 fcw;__u16 fsw;__u8 ftwx; /* in fxsave format */__u8 pad1;__u16 last_opcode;__u64 last_ip;__u64 last_dp;__u8 xmm[16][16];__u32 mxcsr;__u32 pad2;
};
在内核中创建一个中断控制器模型。
Capability: KVM_CAP_IRQCHIP, KVM_CAP_S390_IRQCHIP (s390)
Architectures: x86, arm64, s390
Type: vm ioctl
Parameters: none
Returns: 0 on success, -1 on error
Creates an interrupt controller model in the kernel. On x86, creates a virtual ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23 only go to the IOAPIC. On arm64, a GICv2 is created. Any other GIC versions require the usage of KVM_CREATE_DEVICE, which also supports creating a GICv2. Using KVM_CREATE_DEVICE is preferred over KVM_CREATE_IRQCHIP for GICv2. On s390, a dummy irq routing table is created.
Note that on s390 the KVM_CAP_S390_IRQCHIP vm capability needs to be enabled before KVM_CREATE_IRQCHIP can be used.
将 GSI 输入的级别设置为内核中的中断控制器模型。
Capability: KVM_CAP_IRQCHIP
Architectures: x86, arm64
Type: vm ioctl
Parameters: struct kvm_irq_level
Returns: 0 on success, -1 on error
Sets the level of a GSI input to the interrupt controller model in the kernel. On some architectures it is required that an interrupt controller model has been previously created with KVM_CREATE_IRQCHIP. Note that edge-triggered interrupts require the level to be set to 1 and then back to 0.
On real hardware, interrupt pins can be active-low or active-high. This does not matter for the level field of struct kvm_irq_level: 1 always means active (asserted), 0 means inactive (deasserted).
x86 allows the operating system to program the interrupt polarity (active-low/active-high) for level-triggered interrupts, and KVM used to consider the polarity. However, due to bitrot in the handling of active-low interrupts, the above convention is now valid on x86 too. This is signaled by KVM_CAP_X86_IOAPIC_POLARITY_IGNORED. Userspace should not present interrupts to the guest as active-low unless this capability is present (or unless it is not using the in-kernel irqchip, of course).
arm64 can signal an interrupt either at the CPU level, or at the in-kernel irqchip (GIC), and for in-kernel irqchip can tell the GIC to use PPIs designated for specific cpus. The irq field is interpreted like this:
bits: | 31 ... 28 | 27 ... 24 | 23 ... 16 | 15 ... 0 |
field: | vcpu2_index | irq_type | vcpu_index | irq_id |
The irq_type field has the following values:
(The irq_id field thus corresponds nicely to the IRQ ID in the ARM GIC specs)
In both cases, level is used to assert/deassert the line.
When KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 is supported, the target vcpu is identified as (256 * vcpu2_index + vcpu_index). Otherwise, vcpu2_index must be zero.
Note that on arm64, the KVM_CAP_IRQCHIP capability only conditions injection of interrupts for the in-kernel irqchip. KVM_IRQ_LINE can always be used for a userspace interrupt controller.
struct kvm_irq_level {union {__u32 irq; /* GSI */__s32 status; /* not used for KVM_IRQ_LEVEL */};__u32 level; /* 0 or 1 */
};
将使用 KVM_CREATE_IRQCHIP 创建的内核中断控制器的状态读取到调用者提供的缓冲区中。
Capability: KVM_CAP_IRQCHIP
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_irqchip (in/out)
Returns: 0 on success, -1 on error
Reads the state of a kernel interrupt controller created with KVM_CREATE_IRQCHIP into a buffer provided by the caller.
struct kvm_irqchip {__u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */__u32 pad;union {char dummy[512]; /* reserving space */struct kvm_pic_state pic;struct kvm_ioapic_state ioapic;} chip;
};
设置使用 KVM_CREATE_IRQCHIP 从调用者提供的缓冲区创建的内核中断控制器的状态。
Capability: KVM_CAP_IRQCHIP
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_irqchip (in)
Returns: 0 on success, -1 on error
Sets the state of a kernel interrupt controller created with KVM_CREATE_IRQCHIP from a buffer provided by the caller.
struct kvm_irqchip {__u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */__u32 pad;union {char dummy[512]; /* reserving space */struct kvm_pic_state pic;struct kvm_ioapic_state ioapic;} chip;
};
设置 Xen HVM Guest 用于初始化其超级调用页面的 MSR,并提供用户空间中超级调用 blob 的起始地址和大小。
Capability: KVM_CAP_XEN_HVM
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_xen_hvm_config (in)
Returns: 0 on success, -1 on error
Sets the MSR that the Xen HVM guest uses to initialize its hypercall page, and provides the starting address and size of the hypercall blobs in userspace. When the guest writes the MSR, kvm copies one page of a blob (32- or 64-bit, depending on the vcpu mode) to guest memory.
struct kvm_xen_hvm_config {__u32 flags;__u32 msr;__u64 blob_addr_32;__u64 blob_addr_64;__u8 blob_size_32;__u8 blob_size_64;__u8 pad2[30];
};
If certain flags are returned from the KVM_CAP_XEN_HVM check, they may be set in the flags field of this ioctl:
The KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL flag requests KVM to generate the contents of the hypercall page automatically; hypercalls will be intercepted and passed to userspace through KVM_EXIT_XEN. In this ase, all of the blob size and address fields must be zero.
The KVM_XEN_HVM_CONFIG_EVTCHN_SEND flag indicates to KVM that userspace will always use the KVM_XEN_HVM_EVTCHN_SEND ioctl to deliver event channel interrupts rather than manipulating the guest’s shared_info structures directly. This, in turn, may allow KVM to enable features such as intercepting the SCHEDOP_poll hypercall to accelerate PV spinlock operation for the guest. Userspace may still use the ioctl to deliver events if it was advertised, even if userspace does not send this indication that it will always do so
No other flags are currently valid in the struct kvm_xen_hvm_config.
获取当前 Guest 看到的 kvmclock 的当前时间戳。
Capability: KVM_CAP_ADJUST_CLOCK
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_clock_data (out)
Returns: 0 on success, -1 on error
Gets the current timestamp of kvmclock as seen by the current guest. In conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios such as migration.
When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the set of bits that KVM can return in struct kvm_clock_data’s flag member.
The following flags are defined:
KVM_CLOCK_TSC_STABLE
If set, the returned value is the exact kvmclock value seen by all VCPUs at the instant when KVM_GET_CLOCK was called. If clear, the returned value is simply CLOCK_MONOTONIC plus a constant offset; the offset can be modified with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, but the exact value read by each VCPU could differ, because the host TSC is not stable.
KVM_CLOCK_REALTIME
If set, the realtime field in the kvm_clock_data structure is populated with the value of the host’s real time clocksource at the instant when KVM_GET_CLOCK was called. If clear, the realtime field does not contain a value.
KVM_CLOCK_HOST_TSC
If set, the host_tsc field in the kvm_clock_data structure is populated with the value of the host’s timestamp counter (TSC) at the instant when KVM_GET_CLOCK was called. If clear, the host_tsc field does not contain a value.
struct kvm_clock_data {__u64 clock; /* kvmclock current value */__u32 flags;__u32 pad0;__u64 realtime;__u64 host_tsc;__u32 pad[4];
};
Capability: KVM_CAP_ADJUST_CLOCK
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_clock_data (in)
Returns: 0 on success, -1 on error
Sets the current timestamp of kvmclock to the value specified in its parameter. In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios such as migration.
The following flags can be passed:
KVM_CLOCK_REALTIME
If set, KVM will compare the value of the realtime field with the value of the host’s real time clocksource at the instant when KVM_SET_CLOCK was called. The difference in elapsed time is added to the final kvmclock value that will be provided to guests.
Other flags returned by KVM_GET_CLOCK
are accepted but ignored.
struct kvm_clock_data {__u64 clock; /* kvmclock current value */__u32 flags;__u32 pad0;__u64 realtime;__u64 host_tsc;__u32 pad[4];
};
Capability: KVM_CAP_VCPU_EVENTS, Extended by: KVM_CAP_INTR_SHADOW
Architectures: x86, arm64
Type: vcpu ioctl
Parameters: struct kvm_vcpu_event (out)
Returns: 0 on success, -1 on error
Gets currently pending exceptions, interrupts, and NMIs as well as related states of the vcpu.
struct kvm_vcpu_events {struct {__u8 injected;__u8 nr;__u8 has_error_code;__u8 pending;__u32 error_code;} exception;struct {__u8 injected;__u8 nr;__u8 soft;__u8 shadow;} interrupt;struct {__u8 injected;__u8 pending;__u8 masked;__u8 pad;} nmi;__u32 sipi_vector;__u32 flags;struct {__u8 smm;__u8 pending;__u8 smm_inside_nmi;__u8 latched_init;} smi;__u8 reserved[27];__u8 exception_has_payload;__u64 exception_payload;
};
The following bits are defined in the flags field:
If the guest accesses a device that is being emulated by the host kernel in such a way that a real device would generate a physical SError, KVM may make a virtual SError pending for that VCPU. This system error interrupt remains pending until the guest takes the exception by unmasking PSTATE.A.
Running the VCPU may cause it to take a pending SError, or make an access that causes an SError to become pending. The event’s description is only valid while the VPCU is not running.
This API provides a way to read and write the pending ‘event’ state that is not visible to the guest. To save, restore or migrate a VCPU the struct representing the state can be read then written using this GET/SET API, along with the other guest-visible registers. It is not possible to ‘cancel’ an SError that has been made pending.
A device being emulated in user-space may also wish to generate an SError. To do this the events structure can be populated by user-space. The current state should be read first, to ensure no existing SError is pending. If an existing SError is pending, the architecture’s ‘Multiple SError interrupts’ rules should be followed. (2.5.3 of DDI0587.a “ARM Reliability, Availability, and Serviceability (RAS) Specification”).
SError exceptions always have an ESR value. Some CPUs have the ability to specify what the virtual SError’s ESR value should be. These systems will advertise KVM_CAP_ARM_INJECT_SERROR_ESR. In this case exception.has_esr will always have a non-zero value when read, and the agent making an SError pending should specify the ISS field in the lower 24 bits of exception.serror_esr. If the system supports KVM_CAP_ARM_INJECT_SERROR_ESR, but user-space sets the events with exception.has_esr as zero, KVM will choose an ESR.
Specifying exception.has_esr on a system that does not support it will return -EINVAL. Setting anything other than the lower 24bits of exception.serror_esr will return -EINVAL.
It is not possible to read back a pending external abort (injected via KVM_SET_VCPU_EVENTS or otherwise) because such an exception is always delivered directly to the virtual CPU).
struct kvm_vcpu_events {struct {__u8 serror_pending;__u8 serror_has_esr;__u8 ext_dabt_pending;/* Align it to 8 bytes */__u8 pad[5];__u64 serror_esr;} exception;__u32 reserved[12];
};
Capability: KVM_CAP_VCPU_EVENTS, Extended by KVM_CAP_INTR_SHADOW
Architectures: x86, arm64
Type: vcpu ioctl
Parameters: struct kvm_vcpu_event (in)
Returns: 0 on success, -1 on error
Set pending exceptions, interrupts, and NMIs as well as related states of the vcpu.
See KVM_GET_VCPU_EVENTS for the data structure.
Fields that may be modified asynchronously by running VCPUs can be excluded from the update. These fields are nmi.pending, sipi_vector, smi.smm, smi.pending. Keep the corresponding bits in the flags field cleared to suppress overwriting the current in-kernel state. The bits are:
.KVM_VCPUEVENT_VALID_NMI_PENDING | transfer nmi.pending to the kernel |
KVM_VCPUEVENT_VALID_SIPI_VECTOR | transfer sipi_vector |
KVM_VCPUEVENT_VALID_SMM | transfer the smi sub-struct. |
If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in the flags field to signal that interrupt.shadow contains a valid state and shall be written into the VCPU.
KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
If KVM_CAP_EXCEPTION_PAYLOAD is enabled, KVM_VCPUEVENT_VALID_PAYLOAD can be set in the flags field to signal that the exception_has_payload, exception_payload, and exception.pending fields contain a valid state and shall be written into the VCPU.
If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, KVM_VCPUEVENT_VALID_TRIPLE_FAULT can be set in flags field to signal that the triple_fault field contains a valid state and shall be written into the VCPU.
User space may need to inject several types of events to the guest.
Set the pending SError exception state for this VCPU. It is not possible to ‘cancel’ an Serror that has been made pending.
If the guest performed an access to I/O memory which could not be handled by userspace, for example because of missing instruction syndrome decode information or because there is no device mapped at the accessed IPA, then userspace can ask the kernel to inject an external abort using the address from the exiting fault on the VCPU. It is a programming error to set ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV. This feature is only available if the system supports KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in how userspace reports accesses for the above cases to guests, across different userspace implementations. Nevertheless, userspace can still emulate all Arm exceptions by manipulating individual registers using the KVM_SET_ONE_REG API.
See KVM_GET_VCPU_EVENTS for the data structure.
Capability: KVM_CAP_DEBUGREGS
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_debugregs (out)
Returns: 0 on success, -1 on error
Reads debug registers from the vcpu.
struct kvm_debugregs {__u64 db[4];__u64 dr6;__u64 dr7;__u64 flags;__u64 reserved[9];
};
4.34 KVM_SET_DEBUGREGS
Capability: KVM_CAP_DEBUGREGS
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_debugregs (in)
Returns: 0 on success, -1 on error
Writes debug registers into the vcpu.
See KVM_GET_DEBUGREGS for the data structure. The flags field is unused yet and must be cleared on entry.
Capability: KVM_CAP_USER_MEMORY
Architectures: all
Type: vm ioctl
Parameters: struct kvm_userspace_memory_region (in)
Returns: 0 on success, -1 on error
struct kvm_userspace_memory_region {__u32 slot;__u32 flags;__u64 guest_phys_addr;__u64 memory_size; /* bytes */__u64 userspace_addr; /* start of the userspace allocated memory */
};/* for kvm_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of “slot” specify the slot id and this value should be less than the maximum number of user memory slots supported per VM. The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS. Slots may not overlap in guest physical address space.
If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of “slot” specifies the address space which is being modified. They must be less than the value that KVM_CHECK_EXTENSION returns for the KVM_CAP_MULTI_ADDRESS_SPACE capability. Slots in separate address spaces are unrelated; the restriction on overlapping slots only applies within each address space.
Deleting a slot is done by passing zero for memory_size. When changing an existing slot, it may be moved in the guest physical memory space, or its flags may be modified, but it may not be resized.
Memory for the region is taken starting at the address denoted by the field userspace_addr, which must point at user addressable memory for the entire memory slot size. Any object may back this memory, including anonymous memory, ordinary files, and hugetlbfs.
On architectures that support a form of address tagging, userspace_addr must be an untagged address.
It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host.
The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, to make a new slot read-only. In this case, writes to this memory will be posted to userspace as KVM_EXIT_MMIO exits.
When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an mmap() that affects the region will be made visible immediately. Another example is madvise(MADV_DROP).
It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl. The KVM_SET_MEMORY_REGION does not allow fine grained control over memory allocation and is deprecated.
Capability: KVM_CAP_SET_TSS_ADDR
Architectures: x86
Type: vm ioctl
Parameters: unsigned long tss_address (in)
Returns: 0 on success, -1 on error
This ioctl defines the physical address of a three-page region in the guest physical address space. The region must be within the first 4GB of the guest physical address space and must not conflict with any memory slot or any mmio address. The guest may malfunction if it accesses this memory region.
This ioctl is required on Intel-based hosts. This is needed on Intel hardware because of a quirk in the virtualization implementation (see the internals documentation when it pops into existence).
Capability: KVM_CAP_ENABLE_CAP
Architectures: mips, ppc, s390, x86
Type: vcpu ioctl
Parameters: struct kvm_enable_cap (in)
Returns: 0 on success; -1 on error
Capability: KVM_CAP_ENABLE_CAP_VM
Architectures:all
Type: vm ioctl
Parameters: struct kvm_enable_cap (in)
Returns: 0 on success; -1 on error
Note:Not all extensions are enabled by default.
Using this ioctl the application can enable an extension, making it available to the guest.
On systems that do not support this ioctl, it always fails. On systems that do support it, it only works for extensions that are supported for enablement.
To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should be used.
struct kvm_enable_cap {/* in */__u32 cap;
The capability that is supposed to get enabled.
__u32 flags;
A bitfield indicating future enhancements. Has to be 0 for now.
__u64 args[4];
Arguments for enabling a feature. If a feature needs initial values to function properly, this is the place to put them.
__u8 pad[64];
};
The vcpu ioctl should be used for vcpu-specific capabilities, the vm ioctl for vm-wide capabilities.
Capability: KVM_CAP_MP_STATE
Architectures: x86, s390, arm64, riscv
Type: vcpu ioctl
Parameters: struct kvm_mp_state (out)
Returns: 0 on success; -1 on error
struct kvm_mp_state {__u32 mp_state;
};
Returns the vcpu’s current “multiprocessing state” (though also valid on uniprocessor guests).
Possible values are:
KVM_MP_STATE_RUNNABLE | the vcpu is currently running [x86,arm64,riscv] |
KVM_MP_STATE_UNINITIALIZED | the vcpu is an application processor (AP) which has not yet received an INIT signal [x86] |
KVM_MP_STATE_INIT_RECEIVED | the vcpu has received an INIT signal, and is now ready for a SIPI [x86] |
KVM_MP_STATE_HALTED | the vcpu has executed a HLT instruction and is waiting for an interrupt [x86] |
KVM_MP_STATE_SIPI_RECEIVED | the vcpu has just received a SIPI (vector accessible via KVM_GET_VCPU_EVENTS) [x86] |
KVM_MP_STATE_STOPPED | the vcpu is stopped [s390,arm64,riscv] |
KVM_MP_STATE_CHECK_STOP | the vcpu is in a special error state [s390] |
KVM_MP_STATE_OPERATING | the vcpu is operating (running or halted) [s390] |
KVM_MP_STATE_LOAD | the vcpu is in a special load/startup state [s390] |
KVM_MP_STATE_SUSPENDED | the vcpu is in a suspend state and is waiting for a wakeup event [arm64] |
On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel irqchip, the multiprocessing state must be maintained by userspace on these architectures.
If a vCPU is in the KVM_MP_STATE_SUSPENDED state, KVM will emulate the architectural execution of a WFI instruction.
If a wakeup event is recognized, KVM will exit to userspace with a KVM_SYSTEM_EVENT exit, where the event type is KVM_SYSTEM_EVENT_WAKEUP. If userspace wants to honor the wakeup, it must set the vCPU’s MP state to KVM_MP_STATE_RUNNABLE. If it does not, KVM will continue to await a wakeup event in subsequent calls to KVM_RUN.
Warning:
If userspace intends to keep the vCPU in a SUSPENDED state, it is strongly recommended that userspace take action to suppress the wakeup event (such as masking an interrupt). Otherwise, subsequent calls to KVM_RUN will immediately exit with a KVM_SYSTEM_EVENT_WAKEUP event and inadvertently waste CPU cycles.
Additionally, if userspace takes action to suppress a wakeup event, it is strongly recommended that it also restores the vCPU to its original state when the vCPU is made RUNNABLE again. For example, if userspace masked a pending interrupt to suppress the wakeup, the interrupt should be unmasked before returning control to the guest.
For riscv:
The only states that are valid are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
Capability: KVM_CAP_MP_STATE
Architectures: x86, s390, arm64, riscv
Type: vcpu ioctl
Parameters: struct kvm_mp_state (in)
Returns: 0 on success; -1 on error
Sets the vcpu’s current “multiprocessing state”; see KVM_GET_MP_STATE for arguments.
On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel irqchip, the multiprocessing state must be maintained by userspace on these architectures.
The only states that are valid are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not.
Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
Architectures: x86
Type: vm ioctl
Parameters: unsigned long identity (in)
Returns: 0 on success, -1 on error
This ioctl defines the physical address of a one-page region in the guest physical address space. The region must be within the first 4GB of the guest physical address space and must not conflict with any memory slot or any mmio address. The guest may malfunction if it accesses this memory region.
Setting the address to 0 will result in resetting the address to its default (0xfffbc000).
This ioctl is required on Intel-based hosts. This is needed on Intel hardware because of a quirk in the virtualization implementation (see the internals documentation when it pops into existence).
Fails if any VCPU has already been created.
Capability: KVM_CAP_SET_BOOT_CPU_ID
Architectures: x86
Type: vm ioctl
Parameters: unsigned long vcpu_id
Returns: 0 on success, -1 on error
Define which vcpu is the Bootstrap Processor (BSP). Values are the same as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default is vcpu 0. This ioctl has to be called before vcpu creation, otherwise it will return EBUSY error.
Capability: KVM_CAP_XSAVE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xsave (out)
Returns: 0 on success, -1 on error
struct kvm_xsave {__u32 region[1024];__u32 extra[0];
};
This ioctl would copy current vcpu’s xsave struct to the userspace.
Capability: KVM_CAP_XSAVE and KVM_CAP_XSAVE2
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xsave (in)
Returns: 0 on success, -1 on error
struct kvm_xsave {__u32 region[1024];__u32 extra[0];
};
This ioctl would copy userspace’s xsave struct to the kernel. It copies as many bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2), when invoked on the vm file descriptor. The size value returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will always be at least 4096. Currently, it is only greater than 4096 if a dynamic feature has been enabled with arch_prctl()
, but this may change in the future.
The offsets of the state save areas in struct kvm_xsave follow the contents of CPUID leaf 0xD on the host.
Capability: KVM_CAP_XCRS
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xcrs (out)
Returns: 0 on success, -1 on error
struct kvm_xcr {__u32 xcr;__u32 reserved;__u64 value;
};struct kvm_xcrs {__u32 nr_xcrs;__u32 flags;struct kvm_xcr xcrs[KVM_MAX_XCRS];__u64 padding[16];
};
This ioctl would copy current vcpu’s xcrs to the userspace.
Capability: KVM_CAP_XCRS
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xcrs (in)
Returns: 0 on success, -1 on error
struct kvm_xcr {__u32 xcr;__u32 reserved;__u64 value;
};struct kvm_xcrs {__u32 nr_xcrs;__u32 flags;struct kvm_xcr xcrs[KVM_MAX_XCRS];__u64 padding[16];
};
This ioctl would set vcpu’s xcr to the value userspace specified.
原文链接:https://zhuanlan.zhihu.com/p/562713595
上一篇:前后端发布分支规则