Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/torvalds/linux.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-11-11KVM: x86/mmu: Block all page faults during kvm_zap_gfn_range()Sean Christopherson
When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated range to effectively block all page faults while the zap is in-progress. The invalidation helpers take a host virtual address, whereas zapping a GFN obviously provides a guest physical address and with the wrong unit of measurement (frame vs. byte). Alternatively, KVM could walk all memslots to get the associated HVAs, but thanks to SMM, that would require multiple lookups. And practically speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path, e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0 during boot, and APICv inhibits are similarly infrequent operations. Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range") Reported-by: Chao Peng <chao.p.peng@linux.intel.com> Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221111001841.2412598-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Limit the maximum number of supported AMD GP countersLike Xu
The AMD PerfMonV2 specification allows for a maximum of 16 GP counters, but currently only 6 pairs of MSRs are accepted by KVM. While AMD64_NUM_COUNTERS_CORE is already equal to 6, increasing without adjusting msrs_to_save_all[] could result in out-of-bounds accesses. Therefore introduce a macro (named KVM_AMD_PMC_MAX_GENERIC) to refer to the number of counters supported by KVM. Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-3-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Limit the maximum number of supported Intel GP countersLike Xu
The Intel Architectural IA32_PMCx MSRs addresses range allows for a maximum of 8 GP counters, and KVM cannot address any more. Introduce a local macro (named KVM_INTEL_PMC_MAX_GENERIC) and use it consistently to refer to the number of counters supported by KVM, thus avoiding possible out-of-bound accesses. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-2-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yetLike Xu
The SDM lists an architectural MSR IA32_CORE_CAPABILITIES (0xCF) that limits the theoretical maximum value of the Intel GP PMC MSRs allocated at 0xC1 to 14; likewise the Intel April 2022 SDM adds IA32_OVERCLOCKING_STATUS at 0x195 which limits the number of event selection MSRs to 15 (0x186-0x194). Limiting the maximum number of counters to 14 or 18 based on the currently allocated MSRs is clearly fragile, and it seems likely that Intel will even place PMCs 8-15 at a completely different range of MSR indices. So stop at the maximum number of GP PMCs supported today on Intel processors. There are some machines, like Intel P4 with non Architectural PMU, that may indeed have 18 counters, but those counters are in a completely different MSR address range and are not supported by KVM. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Fixes: cf05a67b68b8 ("KVM: x86: omit "impossible" pmu MSRs from MSR list") Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: Only dump VMSA to klog at KERN_DEBUG levelPeter Gonda
Explicitly print the VMSA dump at KERN_DEBUG log level, KERN_CONT uses KERNEL_DEFAULT if the previous log line has a newline, i.e. if there's nothing to continuing, and as a result the VMSA gets dumped when it shouldn't. The KERN_CONT documentation says it defaults back to KERNL_DEFAULT if the previous log line has a newline. So switch from KERN_CONT to print_hex_dump_debug(). Jarkko pointed this out in reference to the original patch. See: https://lore.kernel.org/all/YuPMeWX4uuR1Tz3M@kernel.org/ print_hex_dump(KERN_DEBUG, ...) was pointed out there, but print_hex_dump_debug() should similar. Fixes: 6fac42f127b8 ("KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog") Signed-off-by: Peter Gonda <pgonda@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Harald Hoyer <harald@profian.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Message-Id: <20221104142220.469452-1-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09x86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callersPaolo Bonzini
x86_virt_spec_ctrl only deals with the paravirtualized MSR_IA32_VIRT_SPEC_CTRL now and does not handle MSR_IA32_SPEC_CTRL anymore; remove the corresponding, unused argument. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assemblyPaolo Bonzini
Restoration of the host IA32_SPEC_CTRL value is probably too late with respect to the return thunk training sequence. With respect to the user/kernel boundary, AMD says, "If software chooses to toggle STIBP (e.g., set STIBP on kernel entry, and clear it on kernel exit), software should set STIBP to 1 before executing the return thunk training sequence." I assume the same requirements apply to the guest/host boundary. The return thunk training sequence is in vmenter.S, quite close to the VM-exit. On hosts without V_SPEC_CTRL, however, the host's IA32_SPEC_CTRL value is not restored until much later. To avoid this, move the restoration of host SPEC_CTRL to assembly and, for consistency, move the restoration of the guest SPEC_CTRL as well. This is not particularly difficult, apart from some care to cover both 32- and 64-bit, and to share code between SEV-ES and normal vmentry. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: restore host save area from assemblyPaolo Bonzini
Allow access to the percpu area via the GS segment base, which is needed in order to access the saved host spec_ctrl value. In linux-next FILL_RETURN_BUFFER also needs to access percpu data. For simplicity, the physical address of the save area is added to struct svm_cpu_data. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reported-by: Nathan Chancellor <nathan@kernel.org> Analyzed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: move guest vmsave/vmload back to assemblyPaolo Bonzini
It is error-prone that code after vmexit cannot access percpu data because GSBASE has not been restored yet. It forces MSR_IA32_SPEC_CTRL save/restore to happen very late, after the predictor untraining sequence, and it gets in the way of return stack depth tracking (a retbleed mitigation that is in linux-next as of 2022-11-09). As a first step towards fixing that, move the VMCB VMSAVE/VMLOAD to assembly, essentially undoing commit fb0c4a4fee5a ("KVM: SVM: move VMLOAD/VMSAVE to C code", 2021-03-15). The reason for that commit was that it made it simpler to use a different VMCB for VMLOAD/VMSAVE versus VMRUN; but that is not a big hassle anymore thanks to the kvm-asm-offsets machinery and other related cleanups. The idea on how to number the exception tables is stolen from a prototype patch by Peter Zijlstra. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Link: <https://lore.kernel.org/all/f571e404-e625-bae1-10e9-449b2eb4cbd8@citrix.com/> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: do not allocate struct svm_cpu_data dynamicallyPaolo Bonzini
The svm_data percpu variable is a pointer, but it is allocated via svm_hardware_setup() when KVM is loaded. Unlike hardware_enable() this means that it is never NULL for the whole lifetime of KVM, and static allocation does not waste any memory compared to the status quo. It is also more efficient and more easily handled from assembly code, so do it and don't look back. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: remove dead field from struct svm_cpu_dataPaolo Bonzini
The "cpu" field of struct svm_cpu_data has been write-only since commit 4b656b120249 ("KVM: SVM: force new asid on vcpu migration", 2009-08-05). Remove it. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: remove unused field from struct vcpu_svmPaolo Bonzini
The pointer to svm_cpu_data in struct vcpu_svm looks interesting from the point of view of accessing it after vmexit, when the GSBASE is still containing the guest value. However, despite existing since the very first commit of drivers/kvm/svm.c (commit 6aa8b732ca01, "[PATCH] kvm: userspace interface", 2006-12-10), it was never set to anything. Ignore the opportunity to fix a 16 year old "bug" and delete it; doing things the "harder" way makes it possible to remove more old cruft. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: retrieve VMCB from assemblyPaolo Bonzini
Continue moving accesses to struct vcpu_svm to vmenter.S. Reducing the number of arguments limits the chance of mistakes due to different registers used for argument passing in 32- and 64-bit ABIs; pushing the VMCB argument and almost immediately popping it into a different register looks pretty weird. 32-bit ABI is not a concern for __svm_sev_es_vcpu_run() which is 64-bit only; however, it will soon need @svm to save/restore SPEC_CTRL so stay consistent with __svm_vcpu_run() and let them share the same prototype. No functional change intended. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: adjust register allocation for __svm_vcpu_run()Paolo Bonzini
32-bit ABI uses RAX/RCX/RDX as its argument registers, so they are in the way of instructions that hardcode their operands such as RDMSR/WRMSR or VMLOAD/VMRUN/VMSAVE. In preparation for moving vmload/vmsave to __svm_vcpu_run(), keep the pointer to the struct vcpu_svm in %rdi. In particular, it is now possible to load svm->vmcb01.pa in %rax without clobbering the struct vcpu_svm pointer. No functional change intended. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: replace regs argument of __svm_vcpu_run() with vcpu_svmPaolo Bonzini
Since registers are reachable through vcpu_svm, and we will need to access more fields of that struct, pass it instead of the regs[] array. No functional change intended. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: use a separate asm-offsets.c filePaolo Bonzini
This already removes an ugly #include "" from asm-offsets.c, but especially it avoids a future error when trying to define asm-offsets for KVM's svm/svm.h header. This would not work for kernel/asm-offsets.c, because svm/svm.h includes kvm_cache_regs.h which is not in the include path when compiling asm-offsets.c. The problem is not there if the .c file is in arch/x86/kvm. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-03KVM: x86: Fix a typo about the usage of kvcalloc()Liao Chang
Swap the 1st and 2nd arguments to be consistent with the usage of kvcalloc(). Fixes: c9b8fecddb5b ("KVM: use kvcalloc for array allocations") Signed-off-by: Liao Chang <liaochang1@huawei.com> Message-Id: <20221103011749.139262-1-liaochang1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-03KVM: x86: Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()Ben Gardon
kvm_zap_gfn_range() must be called in an SRCU read-critical section, but there is no SRCU annotation in __kvm_set_or_clear_apicv_inhibit(). This can lead to the following warning via kvm_arch_vcpu_ioctl_set_guest_debug() if a Shadow MMU is in use (TDP MMU disabled or nesting): [ 1416.659809] ============================= [ 1416.659810] WARNING: suspicious RCU usage [ 1416.659839] 6.1.0-dbg-DEV #1 Tainted: G S I [ 1416.659853] ----------------------------- [ 1416.659854] include/linux/kvm_host.h:954 suspicious rcu_dereference_check() usage! [ 1416.659856] ... [ 1416.659904] dump_stack_lvl+0x84/0xaa [ 1416.659910] dump_stack+0x10/0x15 [ 1416.659913] lockdep_rcu_suspicious+0x11e/0x130 [ 1416.659919] kvm_zap_gfn_range+0x226/0x5e0 [ 1416.659926] ? kvm_make_all_cpus_request_except+0x18b/0x1e0 [ 1416.659935] __kvm_set_or_clear_apicv_inhibit+0xcc/0x100 [ 1416.659940] kvm_arch_vcpu_ioctl_set_guest_debug+0x350/0x390 [ 1416.659946] kvm_vcpu_ioctl+0x2fc/0x620 [ 1416.659955] __se_sys_ioctl+0x77/0xc0 [ 1416.659962] __x64_sys_ioctl+0x1d/0x20 [ 1416.659965] do_syscall_64+0x3d/0x80 [ 1416.659969] entry_SYSCALL_64_after_hwframe+0x63/0xcd Always take the KVM SRCU read lock in __kvm_set_or_clear_apicv_inhibit() to protect the GFN to memslot translation. The SRCU read lock is not technically required when no Shadow MMUs are in use, since the TDP MMU walks the paging structures from the roots and does not need to look up GFN translations in the memslots, but make the SRCU locking unconditional for simplicty. In most cases, the SRCU locking is taken care of in the vCPU run loop, but when called through other ioctls (such as KVM_SET_GUEST_DEBUG) there is no srcu_read_lock. Tested: ran tools/testing/selftests/kvm/x86_64/debug_regs on a DBG build. This patch causes the suspicious RCU warning to disappear. Note that the warning is hit in __kvm_zap_rmaps(), so kvm_memslots_have_rmaps() must return true in order for this to repro (i.e. the TDP MMU must be off or nesting in use.) Reported-by: Greg Thelen <gthelen@google.com> Fixes: 36222b117e36 ("KVM: x86: don't disable APICv memslot when inhibited") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20221102205359.1260980-1-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-02KVM: VMX: Ignore guest CPUID for host userspace writes to DEBUGCTLSean Christopherson
Ignore guest CPUID for host userspace writes to the DEBUGCTL MSR, KVM's ABI is that setting CPUID vs. state can be done in any order, i.e. KVM allows userspace to stuff MSRs prior to setting the guest's CPUID that makes the new MSR "legal". Keep the vmx_get_perf_capabilities() check for guest writes, even though it's technically unnecessary since the vCPU's PERF_CAPABILITIES is consulted when refreshing LBR support. A future patch will clean up vmx_get_perf_capabilities() to avoid the RDMSR on every call, at which point the paranoia will incur no meaningful overhead. Note, prior to vmx_get_perf_capabilities() checking that the host fully supports LBRs via x86_perf_get_lbr(), KVM effectively relied on intel_pmu_lbr_is_enabled() to guard against host userspace enabling LBRs on platforms without full support. Fixes: c646236344e9 ("KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabled") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221006000314.73240-5-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-02KVM: VMX: Fold vmx_supported_debugctl() into vcpu_supported_debugctl()Sean Christopherson
Fold vmx_supported_debugctl() into vcpu_supported_debugctl(), its only caller. Setting bits only to clear them a few instructions later is rather silly, and splitting the logic makes things seem more complicated than they actually are. Opportunistically drop DEBUGCTLMSR_LBR_MASK now that there's a single reference to the pair of bits. The extra layer of indirection provides no meaningful value and makes it unnecessarily tedious to understand what KVM is doing. No functional change. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221006000314.73240-4-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-02KVM: VMX: Advertise PMU LBRs if and only if perf supports LBRsSean Christopherson
Advertise LBR support to userspace via MSR_IA32_PERF_CAPABILITIES if and only if perf fully supports LBRs. Perf may disable LBRs (by zeroing the number of LBRs) even on platforms the allegedly support LBRs, e.g. if probing any LBR MSRs during setup fails. Fixes: be635e34c284 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES") Reported-by: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221006000314.73240-3-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-28KVM: x86/xen: Fix eventfd error handling in kvm_xen_eventfd_assign()Eiichi Tsukata
Should not call eventfd_ctx_put() in case of error. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Reported-by: syzbot+6f0c896c5a9449a10ded@syzkaller.appspotmail.com Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Message-Id: <20221028092631.117438-1-eiichi.tsukata@nutanix.com> [Introduce new goto target instead. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-28KVM: x86: smm: number of GPRs in the SMRAM image depends on the image formatMaxim Levitsky
On 64 bit host, if the guest doesn't have X86_FEATURE_LM, KVM will access 16 gprs to 32-bit smram image, causing out-ouf-bound ram access. On 32 bit host, the rsm_load_state_64/enter_smm_save_state_64 is compiled out, thus access overflow can't happen. Fixes: b443183a25ab61 ("KVM: x86: Reduce the number of emulator GPRs to '8' for 32-bit KVM") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221025124741.228045-15-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-28KVM: x86: emulator: update the emulation mode after CR0 writeMaxim Levitsky
Update the emulation mode when handling writes to CR0, because toggling CR0.PE switches between Real and Protected Mode, and toggling CR0.PG when EFER.LME=1 switches between Long and Protected Mode. This is likely a benign bug because there is no writeback of state, other than the RIP increment, and when toggling CR0.PE, the CPU has to execute code from a very low memory address. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-14-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-28KVM: x86: emulator: update the emulation mode after rsmMaxim Levitsky
Update the emulation mode after RSM so that RIP will be correctly written back, because the RSM instruction can switch the CPU mode from 32 bit (or less) to 64 bit. This fixes a guest crash in case the #SMI is received while the guest runs a code from an address > 32 bit. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-13-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-28KVM: x86: emulator: introduce emulator_recalc_and_set_modeMaxim Levitsky
Some instructions update the cpu execution mode, which needs to update the emulation mode. Extract this code, and make assign_eip_far use it. assign_eip_far now reads CS, instead of getting it via a parameter, which is ok, because callers always assign CS to the same value before calling this function. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-12-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-28KVM: x86: emulator: em_sysexit should update ctxt->modeMaxim Levitsky
SYSEXIT is one of the instructions that can change the processor mode, thus ctxt->mode should be updated after it. Note that this is likely a benign bug, because the only problematic mode change is from 32 bit to 64 bit which can lead to truncation of RIP, and it is not possible to do with sysexit, since sysexit running in 32 bit mode will be limited to 32 bit version. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221025124741.228045-11-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: Initialize gfn_to_pfn_cache locks in dedicated helperMichal Luczaj
Move the gfn_to_pfn_cache lock initialization to another helper and call the new helper during VM/vCPU creation. There are race conditions possible due to kvm_gfn_to_pfn_cache_init()'s ability to re-initialize the cache's locks. For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock. (thread 1) | (thread 2) | kvm_xen_set_evtchn_fast | read_lock_irqsave(&gpc->lock, ...) | | kvm_gfn_to_pfn_cache_init | rwlock_init(&gpc->lock) read_unlock_irqrestore(&gpc->lock, ...) | Rename "cache_init" and "cache_destroy" to activate+deactivate to avoid implying that the cache really is destroyed/freed. Note, there more races in the newly named kvm_gpc_activate() that will be addressed separately. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: call out that this is a bug fix] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: VMX: fully disable SGX if SECONDARY_EXEC_ENCLS_EXITING unavailableEmanuele Giuseppe Esposito
Clear enable_sgx if ENCLS-exiting is not supported, i.e. if SGX cannot be virtualized. When KVM is loaded, adjust_vmx_controls checks that the bit is available before enabling the feature; however, other parts of the code check enable_sgx and not clearing the variable caused two different bugs, mostly affecting nested virtualization scenarios. First, because enable_sgx remained true, SECONDARY_EXEC_ENCLS_EXITING would be marked available in the capability MSR that are accessed by a nested hypervisor. KVM would then propagate the control from vmcs12 to vmcs02 even if it isn't supported by the processor, thus causing an unexpected VM-Fail (exit code 0x7) in L1. Second, vmx_set_cpu_caps() would not clear the SGX bits when hardware support is unavailable. This is a much less problematic bug as it only happens if SGX is soft-disabled (available in the processor but hidden in CPUID) or if SGX is supported for bare metal but not in the VMCS (will never happen when running on bare metal, but can theoertically happen when running in a VM). Last but not least, this ensures that module params in sysfs reflect KVM's actual configuration. RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=2127128 Fixes: 72add915fbd5 ("KVM: VMX: Enable SGX virtualization for SGX1, SGX2 and LC") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Bandan Das <bsd@redhat.com> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20221025123749.2201649-1-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: x86: Exempt pending triple fault from event injection sanity checkSean Christopherson
Exempt pending triple faults, a.k.a. KVM_REQ_TRIPLE_FAULT, when asserting that KVM didn't attempt to queue a new exception during event injection. KVM needs to emulate the injection itself when emulating Real Mode due to lack of unrestricted guest support (VMX) and will queue a triple fault if that emulation fails. Ideally the assertion would more precisely filter out the emulated Real Mode triple fault case, but rmode.vm86_active is buried in vcpu_vmx and can't be queried without a new kvm_x86_ops. And unlike "regular" exceptions, triple fault cannot put the vCPU into an infinite loop; the triple fault will force either an exit to userspace or a nested VM-Exit, and triple fault after nested VM-Exit will force an exit to userspace. I.e. there is no functional issue, so just suppress the warning for triple faults. Opportunistically convert the warning to a one-time thing, when it fires, it fires _a lot_, and is usually user triggerable, i.e. can be used to spam the kernel log. Fixes: 7055fb113116 ("KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions") Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202209301338.aca913c3-yujie.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220930230008.1636044-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: x86: Reduce refcount if single_open() fails in kvm_mmu_rmaps_stat_open()Hou Wenlong
Refcount is increased before calling single_open() in kvm_mmu_rmaps_stat_open(), If single_open() fails, refcount should be restored, otherwise the vm couldn't be destroyed. Fixes: 3bcd0662d66fd ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <a75900413bb8b1e556be690e9588a0f92e946a30.1665733883.git.houwenlong.hwl@antgroup.com> [Preserved return value of single_open. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: x86: Mask off reserved bits in CPUID.8000001FHJim Mattson
KVM_GET_SUPPORTED_CPUID should only enumerate features that KVM actually supports. CPUID.8000001FH:EBX[31:16] are reserved bits and should be masked off. Fixes: 8765d75329a3 ("KVM: X86: Extend CPUID range to include new leaf") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220929225203.2234702-6-jmattson@google.com> Cc: stable@vger.kernel.org [Clear NumVMPL too. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Mask off reserved bits in CPUID.8000001AHJim Mattson
KVM_GET_SUPPORTED_CPUID should only enumerate features that KVM actually supports. In the case of CPUID.8000001AH, only three bits are currently defined. The 125 reserved bits should be masked off. Fixes: 24c82e576b78 ("KVM: Sanitize cpuid") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220929225203.2234702-4-jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Mask off reserved bits in CPUID.80000008HJim Mattson
KVM_GET_SUPPORTED_CPUID should only enumerate features that KVM actually supports. The following ranges of CPUID.80000008H are reserved and should be masked off: ECX[31:18] ECX[11:8] In addition, the PerfTscSize field at ECX[17:16] should also be zero because KVM does not set the PERFTSC bit at CPUID.80000001H.ECX[27]. Fixes: 24c82e576b78 ("KVM: Sanitize cpuid") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220929225203.2234702-3-jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Mask off reserved bits in CPUID.80000006HJim Mattson
KVM_GET_SUPPORTED_CPUID should only enumerate features that KVM actually supports. CPUID.80000006H:EDX[17:16] are reserved bits and should be masked off. Fixes: 43d05de2bee7 ("KVM: pass through CPUID(0x80000006)") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220929225203.2234702-2-jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Mask off reserved bits in CPUID.80000001HJim Mattson
KVM_GET_SUPPORTED_CPUID should only enumerate features that KVM actually supports. CPUID.80000001:EBX[27:16] are reserved bits and should be masked off. Fixes: 0771671749b5 ("KVM: Enhance guest cpuid management") Signed-off-by: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTERAlexander Graf
The KVM_X86_SET_MSR_FILTER ioctls contains a pointer in the passed in struct which means it has a different struct size depending on whether it gets called from 32bit or 64bit code. This patch introduces compat code that converts from the 32bit struct to its 64bit counterpart which then gets used going forward internally. With this applied, 32bit QEMU can successfully set MSR bitmaps when running on 64bit kernels. Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com> Fixes: 1a155254ff937 ("KVM: x86: Introduce MSR filtering") Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-4-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()Alexander Graf
In the next patch we want to introduce a second caller to set_msr_filter() which constructs its own filter list on the stack. Refactor the original function so it takes it as argument instead of reading it through copy_from_user(). Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-3-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more kvm updates from Paolo Bonzini: "The main batch of ARM + RISC-V changes, and a few fixes and cleanups for x86 (PMU virtualization and selftests). ARM: - Fixes for single-stepping in the presence of an async exception as well as the preservation of PSTATE.SS - Better handling of AArch32 ID registers on AArch64-only systems - Fixes for the dirty-ring API, allowing it to work on architectures with relaxed memory ordering - Advertise the new kvmarm mailing list - Various minor cleanups and spelling fixes RISC-V: - Improved instruction encoding infrastructure for instructions not yet supported by binutils - Svinval support for both KVM Host and KVM Guest - Zihintpause support for KVM Guest - Zicbom support for KVM Guest - Record number of signal exits as a VCPU stat - Use generic guest entry infrastructure x86: - Misc PMU fixes and cleanups. - selftests: fixes for Hyper-V hypercall - selftests: fix nx_huge_pages_test on TDP-disabled hosts - selftests: cleanups for fix_hypercall_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (57 commits) riscv: select HAVE_POSIX_CPU_TIMERS_TASK_WORK RISC-V: KVM: Use generic guest entry infrastructure RISC-V: KVM: Record number of signal exits as a vCPU stat RISC-V: KVM: add __init annotation to riscv_kvm_init() RISC-V: KVM: Expose Zicbom to the guest RISC-V: KVM: Provide UAPI for Zicbom block size RISC-V: KVM: Make ISA ext mappings explicit RISC-V: KVM: Allow Guest use Zihintpause extension RISC-V: KVM: Allow Guest use Svinval extension RISC-V: KVM: Use Svinval for local TLB maintenance when available RISC-V: Probe Svinval extension form ISA string RISC-V: KVM: Change the SBI specification version to v1.0 riscv: KVM: Apply insn-def to hlv encodings riscv: KVM: Apply insn-def to hfence encodings riscv: Introduce support for defining instructions riscv: Add X register names to gpr-nums KVM: arm64: Advertise new kvmarm mailing list kvm: vmx: keep constant definition format consistent kvm: mmu: fix typos in struct kvm_arch KVM: selftests: Fix nx_huge_pages_test on TDP-disabled hosts ...
2022-10-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "The first batch of KVM patches, mostly covering x86. ARM: - Account stage2 page table allocations in memory stats x86: - Account EPT/NPT arm64 page table allocations in memory stats - Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR accesses - Drop eVMCS controls filtering for KVM on Hyper-V, all known versions of Hyper-V now support eVMCS fields associated with features that are enumerated to the guest - Use KVM's sanitized VMCS config as the basis for the values of nested VMX capabilities MSRs - A myriad event/exception fixes and cleanups. Most notably, pending exceptions morph into VM-Exits earlier, as soon as the exception is queued, instead of waiting until the next vmentry. This fixed a longstanding issue where the exceptions would incorrecly become double-faults instead of triggering a vmexit; the common case of page-fault vmexits had a special workaround, but now it's fixed for good - A handful of fixes for memory leaks in error paths - Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow - Never write to memory from non-sleepable kvm_vcpu_check_block() - Selftests refinements and cleanups - Misc typo cleanups Generic: - remove KVM_REQ_UNHALT" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) KVM: remove KVM_REQ_UNHALT KVM: mips, x86: do not rely on KVM_REQ_UNHALT KVM: x86: never write to memory from kvm_vcpu_check_block() KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set KVM: x86: lapic does not have to process INIT if it is blocked KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed KVM: nVMX: Make an event request when pending an MTF nested VM-Exit KVM: x86: make vendor code check for all nested events mailmap: Update Oliver's email address KVM: x86: Allow force_emulation_prefix to be written without a reload KVM: selftests: Add an x86-only test to verify nested exception queueing KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events() KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions KVM: x86: Morph pending exceptions to pending VM-Exits at queue time ...
2022-10-04Merge tag 'x86_cleanups_for_v6.1_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: - The usual round of smaller fixes and cleanups all over the tree * tag 'x86_cleanups_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Include the header of init_ia32_feat_ctl()'s prototype x86/uaccess: Improve __try_cmpxchg64_user_asm() for x86_32 x86: Fix various duplicate-word comment typos x86/boot: Remove superfluous type casting from arch/x86/boot/bitops.h
2022-10-03Merge tag 'kvmarm-6.1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for v6.1 - Fixes for single-stepping in the presence of an async exception as well as the preservation of PSTATE.SS - Better handling of AArch32 ID registers on AArch64-only systems - Fixes for the dirty-ring API, allowing it to work on architectures with relaxed memory ordering - Advertise the new kvmarm mailing list - Various minor cleanups and spelling fixes
2022-09-30Merge tag 'kvm-x86-6.1-2' of https://github.com/sean-jc/linux into HEADPaolo Bonzini
KVM x86 updates for 6.1, batch #2: - Misc PMU fixes and cleanups. - Fixes for Hyper-V hypercall selftest
2022-09-30KVM: x86: Hide IA32_PLATFORM_DCA_CAP[31:0] from the guestJim Mattson
The only thing reported by CPUID.9 is the value of IA32_PLATFORM_DCA_CAP[31:0] in EAX. This MSR doesn't even exist in the guest, since CPUID.1:ECX.DCA[bit 18] is clear in the guest. Clear CPUID.9 in KVM_GET_SUPPORTED_CPUID. Fixes: 24c82e576b78 ("KVM: Sanitize cpuid") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220922231854.249383-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-29KVM: x86: Select CONFIG_HAVE_KVM_DIRTY_RING_ACQ_RELMarc Zyngier
Since x86 is TSO (give or take), allow it to advertise the new ACQ_REL version of the dirty ring capability. No other change is required for it. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-4-maz@kernel.org
2022-09-29KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config optionMarc Zyngier
In order to differenciate between architectures that require no extra synchronisation when accessing the dirty ring and those who do, add a new capability (KVM_CAP_DIRTY_LOG_RING_ACQ_REL) that identify the latter sort. TSO architectures can obviously advertise both, while relaxed architectures must only advertise the ACQ_REL version. This requires some configuration symbol rejigging, with HAVE_KVM_DIRTY_RING being only indirectly selected by two top-level config symbols: - HAVE_KVM_DIRTY_RING_TSO for strongly ordered architectures (x86) - HAVE_KVM_DIRTY_RING_ACQ_REL for weakly ordered architectures (arm64) Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-3-maz@kernel.org
2022-09-28KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalabilityLike Xu
If the number of AMD gp counters continues to grow, the code will be very clumsy and the switch-case design of inline get_gp_pmc_amd() will also bloat the kernel text size. The target code is taught to manage two groups of MSRs, each representing a different version of the AMD PMU counter MSRs. The MSR addresses of each group are contiguous, with no holes, and there is no intersection between two sets of addresses, but they are discrete in functionality by design like this: [Group A : All counter MSRs are tightly bound to all event select MSRs ] MSR_K7_EVNTSEL0 0xc0010000 MSR_K7_EVNTSELi 0xc0010000 + i ... MSR_K7_EVNTSEL3 0xc0010003 MSR_K7_PERFCTR0 0xc0010004 MSR_K7_PERFCTRi 0xc0010004 + i ... MSR_K7_PERFCTR3 0xc0010007 [Group B : The counter MSRs are interleaved with the event select MSRs ] MSR_F15H_PERF_CTL0 0xc0010200 MSR_F15H_PERF_CTR0 (0xc0010200 + 1) ... MSR_F15H_PERF_CTLi (0xc0010200 + 2 * i) MSR_F15H_PERF_CTRi (0xc0010200 + 2 * i + 1) ... MSR_F15H_PERF_CTL5 (0xc0010200 + 2 * 5) MSR_F15H_PERF_CTR5 (0xc0010200 + 2 * 5 + 1) Rewrite get_gp_pmc_amd() in this way: first determine which group of registers is accessed, then determine if it matches its requested type, applying different scaling ratios respectively, and finally get pmc_idx to pass into amd_pmc_idx_to_pmc(). Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-8-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-09-28KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc()Like Xu
Access PMU counters on AMD by directly indexing the array of general purpose counters instead of translating the PMC index to an MSR index. AMD only supports gp counters, there's no need to translate a PMC index to an MSR index and back to a PMC index. Opportunistically apply array_index_nospec() to reduce the attack surface for speculative execution and remove the dead code. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-7-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-09-28KVM: x86/pmu: Avoid using PEBS perf_events for normal countersLike Xu
The check logic in the pmc_resume_counter() to determine whether a perf_event is reusable is partial and flawed, especially when it comes to a pseudocode sequence (contrived, but valid) like: - enabling a counter and its PEBS bit - enable global_ctrl - run workload - disable only the PEBS bit, leaving the global_ctrl bit enabled In this corner case, a perf_event created for PEBS can be reused by a normal counter before it has been released and recreated, and when this normal counter overflows, it triggers a PEBS interrupt (precise_ip != 0). To address this issue, reprogram all affected counters when PEBS_ENABLE change and reuse a counter if and only if PEBS exactly matches precise. Fixes: 79f3e3b58386 ("KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter") Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-4-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-09-28KVM: x86/pmu: Refactor PERF_GLOBAL_CTRL update helper for reuse by PEBSLike Xu
Extract the "global ctrl" specific bits out of global_ctrl_changed() so that the helper only deals with reprogramming general purpose counters, and rename the helper accordingly. PEBS needs the same logic, i.e needs to reprogram counters associated when PEBS_ENABLE bits are toggled, and will use the helper in a future fix. No functional change intended. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20220831085328.45489-4-likexu@tencent.com [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>