[pull] master from torvalds:master #33

hva_to_pfn_remapped() calls fixup_user_fault(), which has already handled the retry gracefully. Even if "unlocked" is set to true, it means that we've got a VM_FAULT_RETRY inside fixup_user_fault(), however the page fault has already retried and we should have the pfn set correctly. No need to do that again. Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The index returned by kvm_async_pf_gfn_slot() will be removed when an async pf gfn is going to be removed. However kvm_async_pf_gfn_slot() is not reliable in that it can return the last key it loops over even if the gfn is not found in the async gfn array. It should never happen, but it's still better to sanity check against that to make sure no unexpected gfn will be removed. Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

0x4b564d00 and 0x4b564d01 belong to KVM_FEATURE_CLOCKSOURCE2. Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Improve handle_external_interrupt_irqoff inline assembly in several ways: - remove unneeded %c operand modifiers and "$" prefixes - use %rsp instead of _ASM_SP, since we are in CONFIG_X86_64 part - use $-16 immediate to align %rsp - remove unneeded use of __ASM_SIZE macro - define "ss" named operand only for X86_64 The patch introduces no functional changes. Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Explicitly truncate the data written to vmcs.SYSENTER_EIP/ESP on WRMSR if the virtual CPU doesn't support 64-bit mode. The SYSENTER address fields in the VMCS are natural width, i.e. bits 63:32 are dropped if the CPU doesn't support Intel 64 architectures. This behavior is visible to the guest after a VM-Exit/VM-Exit roundtrip, e.g. if the guest sets bits 63:32 in the actual MSR. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Don't propagate GUEST_SYSENTER_* from vmcs02 to vmcs12 on nested VM-Exit as the vmcs12 fields are updated in vmx_set_msr(), and writes to the corresponding MSRs are always intercepted by KVM when running L2. Dropping the propagation was intended to be done in the same commit that added vmcs12 writes in vmx_set_msr()[1], but for reasons unknown was only shuffled around[2][3]. [1] https://patchwork.kernel.org/patch/10933215 [2] https://patchwork.kernel.org/patch/10933215/#22682289 [3] https://lore.kernel.org/patchwork/patch/1088643 Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Introduce generic fastpath handler to handle MSR fastpath, VMX-preemption timer fastpath etc; move it after vmx_complete_interrupts() in order to catch events delivered to the guest, and abort the fast path in later patches. While at it, move the kvm_exit tracepoint so that it is printed for fastpath vmexits as well. There is no observed performance effect for the IPI fastpath after this patch. Tested-by: Haiwei Li <[email protected]> Cc: Haiwei Li <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Use __print_flags() to display the names of VMX flags in VM-Exit traces and strip the flags when printing the basic exit reason, e.g. so that a failed VM-Entry due to invalid guest state gets recorded as "INVALID_STATE FAILED_VMENTRY" instead of "0x80000021". Opportunstically fix misaligned variables in the kvm_exit and kvm_nested_vmexit_inject tracepoints. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Introduce kvm_vcpu_exit_request() helper, we need to check some conditions before enter guest again immediately, we skip invoking the exit handler and go through full run loop if complete fastpath but there is stuff preventing we enter guest again immediately. Tested-by: Haiwei Li <[email protected]> Cc: Haiwei Li <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Adds a fastpath_t typedef since enum lines are a bit long, and replace EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values. - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop, but it would skip invoking the exit handler. - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered without invoking the exit handler or going back to vcpu_run Tested-by: Haiwei Li <[email protected]> Cc: Haiwei Li <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

While optimizing posted-interrupt delivery especially for the timer fastpath scenario, I measured kvm_x86_ops.deliver_posted_interrupt() to introduce substantial latency because the processor has to perform all vmentry tasks, ack the posted interrupt notification vector, read the posted-interrupt descriptor etc. This is not only slow, it is also unnecessary when delivering an interrupt to the current CPU (as is the case for the LAPIC timer) because PIR->IRR and IRR->RVI synchronization is already performed on vmentry Therefore skip kvm_vcpu_trigger_posted_interrupt in this case, and instead do vmx_sync_pir_to_irr() on the EXIT_FASTPATH_REENTER_GUEST fastpath as well. Tested-by: Haiwei Li <[email protected]> Cc: Haiwei Li <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Replace the ad hoc test in vmx_set_hv_timer with a test in the caller, start_hv_timer. This test is not Intel-specific and would be duplicated when introducing the fast path for the TSC deadline MSR. Signed-off-by: Paolo Bonzini <[email protected]>

This patch implements a fast path for emulation of writes to the TSCDEADLINE MSR. Besides shortcutting various housekeeping tasks in the vCPU loop, the fast path can also deliver the timer interrupt directly without going through KVM_REQ_PENDING_TIMER because it runs in vCPU context. Tested-by: Haiwei Li <[email protected]> Cc: Haiwei Li <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

This patch implements a fastpath for the preemption timer vmexit. The vmexit can be handled quickly so it can be performed with interrupts off and going back directly to the guest. Testing on SKX Server. cyclictest in guest(w/o mwait exposed, adaptive advance lapic timer is default -1): 5540.5ns -> 4602ns 17% kvm-unit-test/vmexit.flat: w/o avanced timer: tscdeadline_immed: 3028.5 -> 2494.75 17.6% tscdeadline: 5765.7 -> 5285 8.3% w/ adaptive advance timer default -1: tscdeadline_immed: 3123.75 -> 2583 17.3% tscdeadline: 4663.75 -> 4537 2.7% Tested-by: Haiwei Li <[email protected]> Cc: Haiwei Li <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Code clean up and remove unnecessary intercept check for INTERCEPT_VINTR. Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

This has already been handled in the prior call to svm_clear_vintr(). Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Remove a 'struct kvm_x86_ops' param that got left behind when the nested ops were moved to their own struct. Fixes: 33b2217 ("KVM: x86: move nested-related kvm_x86_ops to a separate struct") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The PINNED bit is ignored by hrtimer_init. It is only considered when starting the timer. When the hrtimer isn't pinned to the same logical processor as the vCPU thread to be interrupted, the emulated VMX-preemption timer often fails to adhere to the architectural specification. Fixes: f15a75e ("KVM: nVMX: make emulated nested preemption timer pinned") Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Peter Shier <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Prepare for migration of this hrtimer, by changing it from relative to absolute. (I couldn't get migration to work with a relative timer.) Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Peter Shier <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The hrtimer used to emulate the VMX-preemption timer must be pinned to the same logical processor as the vCPU thread to be interrupted if we want to have any hope of adhering to the architectural specification of the VMX-preemption timer. Even with this change, the emulated VMX-preemption timer VM-exit occasionally arrives too late. Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Peter Shier <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Two new stats for exposing halt-polling cpu usage: halt_poll_success_ns halt_poll_fail_ns Thus sum of these 2 stats is the total cpu time spent polling. "success" means the VCPU polled until a virtual interrupt was delivered. "fail" means the VCPU had to schedule out (either because the maximum poll time was reached or it needed to yield the CPU). To avoid touching every arch's kvm_vcpu_stat struct, only update and export halt-polling cpu usage stats if we're on x86. Exporting cpu usage as a u64 and in nanoseconds means we will overflow at ~500 years, which seems reasonably large. Signed-off-by: David Matlack <[email protected]> Signed-off-by: Jon Cargille <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Now that the 32bit KVM/arm host is a distant memory, let's move the whole of the KVM/arm64 code into the arm64 tree. As they said in the song: Welcome Home (Sanitarium). Signed-off-by: Marc Zyngier <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]

CONFIG_KVM_ARM_HOST is just a proxy for CONFIG_KVM, so remove it in favour of the latter. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Fuad Tabba <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

arm64 KVM supports 16k pages since 02e0b76 ("arm64: kvm: Add support for 16K pages"), so update the Kconfig help text accordingly. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Fuad Tabba <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Changing CONFIG_KVM to be a 'menuconfig' entry in Kconfig mean that we can straightforwardly enumerate optional features, such as the virtual PMU device as dependent options. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Fuad Tabba <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Consolidate references to the CONFIG_KVM configuration item to encompass entire folders rather than per line. Signed-off-by: Fuad Tabba <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Now that this function isn't constrained by the 32bit PCS, let's simplify it by taking a single 64bit offset instead of two 32bit parameters. Signed-off-by: Marc Zyngier <[email protected]>

By the time we start using the has_vhe() helper, we have long discovered whether we are running VHE or not. It thus makes sense to use cpus_have_final_cap() instead of cpus_have_const_cap(), which leads to a small text size reduction. Signed-off-by: Marc Zyngier <[email protected]> Acked-by: David Brazdil <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Fix spelling and typos (e.g., repeated words) in comments. Signed-off-by: Fuad Tabba <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

…pported stage2_unmap_vm() was introduced to unmap user RAM region in the stage2 page table to make the caches coherent. E.g., a guest reboot with stage1 MMU disabled will access memory using non-cacheable attributes. If the RAM and caches are not coherent at this stage, some evicted dirty cache line may go and corrupt guest data in RAM. Since ARMv8.4, S2FWB feature is mandatory and KVM will take advantage of it to configure the stage2 page table and the attributes of memory access. So we ensure that guests always access memory using cacheable attributes and thus, the caches always be coherent. So on CPUs that support S2FWB, we can safely reset the vcpu without a heavy stage2 unmapping. Signed-off-by: Zenghui Yu <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Do cond_resched_lock() in stage2_flush_memslot() like what is done in unmap_stage2_range() and other places holding mmu_lock while processing a possibly large range of memory. Signed-off-by: Jiang Yi <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Reviewed-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]

If we are checking whether the stage2 can map PAGE_SIZE, we don't have to do the boundary checks as both the host VMA and the guest memslots are page aligned. Bail the case easily. While we're at it, fixup a typo in the comment below. Signed-off-by: Suzuki K Poulose <[email protected]> Signed-off-by: Zenghui Yu <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

We support mapping host memory backed by PMD transparent hugepages at stage2 as huge pages. However the checks are now spread across two different places. Let us unify the handling of the THPs to keep the code cleaner (and future proof for PUD THP support). This patch moves transparent_hugepage_adjust() closer to the caller to avoid a forward declaration for fault_supports_stage2_huge_mappings(). Also, since we already handle the case where the host VA and the guest PA may not be aligned, the explicit VM_BUG_ON() is not required. Signed-off-by: Suzuki K Poulose <[email protected]> Signed-off-by: Zenghui Yu <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

There is already support of enabling dirty log gradually in small chunks for x86 in commit 3c9bd40 ("KVM: x86: enable dirty log gradually in small chunks"). This adds support for arm64. x86 still writes protect all huge pages when DIRTY_LOG_INITIALLY_ALL_SET is enabled. However, for arm64, both huge pages and normal pages can be write protected gradually by userspace. Under the Huawei Kunpeng 920 2.6GHz platform, I did some tests on 128G Linux VMs with different page size. The memory pressure is 127G in each case. The time taken of memory_global_dirty_log_start in QEMU is listed below: Page Size Before After Optimization 4K 650ms 1.8ms 2M 4ms 1.8ms 1G 2ms 1.8ms Besides the time reduction, the biggest improvement is that we will minimize the performance side effect (because of dissolving huge pages and marking memslots dirty) on guest after enabling dirty log. Signed-off-by: Keqian Zhu <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

…ersion KVM_CAP_MAX_VCPUS always return the maximum possible number of VCPUs, irrespective of the selected interrupt controller. This is pretty misleading for userspace that selects a GICv2 on a GICv3 system that supports v2 compat: It always gets a maximum of 512 VCPUs, even if the effective limit is 8. The 9th VCPU will fail to be created, which is unexpected as far as userspace is concerned. Fortunately, we already have the right information stashed in the kvm structure, and we can return it as requested. Reported-by: Ard Biesheuvel <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Tested-by: Alexandru Elisei <[email protected]> Reviewed-by: Alexandru Elisei <[email protected]> Link: https://lore.kernel.org/r/[email protected]

When ATI Radeon GPU driver has been compiled directly into the kernel instead of as a module, we should make sure the firmware for the model (check available ones in /lib/firmware/radeon) is built-in to the kernel as well, otherwise there exists the following fatal error during GPU init, change CONFIG_DRM_RADEON=y to CONFIG_DRM_RADEON=m to fix it. [ 1.900997] [drm] Loading RS780 Microcode [ 1.905077] radeon 0000:01:05.0: Direct firmware load for radeon/RS780_pfp.bin failed with error -2 [ 1.914140] r600_cp: Failed to load firmware "radeon/RS780_pfp.bin" [ 1.920405] [drm:r600_init] *ERROR* Failed to load firmware! [ 1.926069] radeon 0000:01:05.0: Fatal error during GPU init [ 1.931729] [drm] radeon: finishing device. Fixes: 024e6a8 ("MIPS: Loongson: Add a Loongson-3 default config file") Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

When CONFIG_HAVE_STD_PC_SERIAL_PORT is set, there exists build errors of 8250-platform.c due to linux/module.h is not included. CONFIG_HAVE_STD_PC_SERIAL_PORT is not used in arch/mips for many years, 8250-platform.c is also not built and used, so it is not necessary to fix the build errors, just remove the not used file 8250-platform.c and the related code in Kconfig and Makefile. Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Fix some symbol names to align with Loongson's User Manual wording. Also correct the comment in csr_readq() suggesting the wrong instruction in use. Fixes: 6a6f9b7 ("MIPS: Loongson: Add CFUCFG&CSR support") Signed-off-by: WANG Xuerui <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

…features Add the constants for easier and maintainable composition of CPUCFG values. Signed-off-by: WANG Xuerui <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

These are exposed to userland alternatively via the new CPUCFG instruction on Loongson-3A R4 and above. Add definitions for readback on older cores. Signed-off-by: WANG Xuerui <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

User space checkpoint and restart tool (CRIU) needs the page's change to be soft tracked. This allows to do a pre checkpoint and then dump only touched pages. Signed-off-by: Guoyun Sun <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

…config After commit 87fcfa7 ("MIPS: Loongson64: Add generic dts"), there already exists the node and property of Loongson CPU UART0 in loongson3-package.dtsi: cpu_uart0: serial@1fe001e0 { compatible = "ns16550a"; reg = <0 0x1fe001e0 0x8>; clock-frequency = <33000000>; interrupt-parent = <&liointc>; interrupts = <10 IRQ_TYPE_LEVEL_HIGH>; no-loopback-test; }; In order to support for serial console on the Loongson platform, add CONFIG_SERIAL_OF_PLATFORM=y to loongson3_defconfig. With this patch, we can see the following boot messages: [ 1.877745] printk: console [ttyS0] disabled [ 1.881979] 1fe001e0.serial: ttyS0 at MMIO 0x1fe001e0 (irq = 16, base_baud = 2062500) is a 16550A [ 1.890838] printk: console [ttyS0] enabled And also, we can login normally from the serial console. Signed-off-by: Tiezhu Yang <[email protected]> Reviewed-by: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Remove an old macro that no longer exists anywhere else in the tree that snuck in when IP30 support was added Signed-off-by: Joshua Kinard <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

In commit 81eaadc ("kgdboc: disable the console lock when in kgdb") we avoided the WARN_CONSOLE_UNLOCKED() yell when we were in kgdboc. That still works fine, but it turns out that we get a similar yell when using other I/O drivers. One example is the "I/O driver" for the kgdb test suite (kgdbts). When I enabled that I again got the same yells. Even though "kgdbts" doesn't actually interact with the user over the console, using it still causes kgdb to print to the consoles. That trips the same warning: con_is_visible+0x60/0x68 con_scroll+0x110/0x1b8 lf+0x4c/0xc8 vt_console_print+0x1b8/0x348 vkdb_printf+0x320/0x89c kdb_printf+0x68/0x90 kdb_main_loop+0x190/0x860 kdb_stub+0x2cc/0x3ec kgdb_cpu_enter+0x268/0x744 kgdb_handle_exception+0x1a4/0x200 kgdb_compiled_brk_fn+0x34/0x44 brk_handler+0x7c/0xb8 do_debug_exception+0x1b4/0x228 Let's increment/decrement the "ignore_console_lock_warning" variable all the time when we enter the debugger. This will allow us to later revert commit 81eaadc ("kgdboc: disable the console lock when in kgdb"). Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.1.Ied2b058357152ebcc8bf68edd6f20a11d98d7d4e@changeid Signed-off-by: Daniel Thompson <[email protected]>

This reverts commit 81eaadc. Commit 81eaadc ("kgdboc: disable the console lock when in kgdb") is no longer needed now that we have the patch ("kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb"). Revert it. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.2.I02258eee1497e55bcbe8dc477de90369c7c7c2c5@changeid Signed-off-by: Daniel Thompson <[email protected]>

If you build CONFIG_KGDB_SERIAL_CONSOLE into the kernel then you should be able to have KGDB init itself at bootup by specifying the "kgdboc=..." kernel command line parameter. This has worked OK for me for many years, but on a new device I switched to it stopped working. The problem is that on this new device the serial driver gets its probe deferred. Now when kgdb initializes it can't find the tty driver and when it gives up it never tries again. We could try to find ways to move up the initialization of the serial driver and such a thing might be worthwhile, but it's nice to be robust against serial drivers that load late. We could move kgdb to init itself later but that penalizes our ability to debug early boot code on systems where the driver inits early. We could roll our own system of detecting when new tty drivers get loaded and then use that to figure out when kgdb can init, but that's ugly. Instead, let's jump on the -EPROBE_DEFER bandwagon. We'll create a singleton instance of a "kgdboc" platform device. If we can't find our tty device when the singleton "kgdboc" probes we'll return -EPROBE_DEFER which means that the system will call us back later to try again when the tty device might be there. We won't fully transition all of the kgdboc to a platform device because early kgdb initialization (via the "ekgdboc" kernel command line parameter) still runs before the platform device has been created. The kgdb platform device is merely used as a convenient way to hook into the system's normal probe deferral mechanisms. As part of this, we'll ever-so-slightly change how the "kgdboc=..." kernel command line parameter works. Previously if you booted up and kgdb couldn't find the tty driver then later reading '/sys/module/kgdboc/parameters/kgdboc' would return a blank string. Now kgdb will keep track of the string that came as part of the command line and give it back to you. It's expected that this should be an OK change. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.3.I4a493cfb0f9f740ce8fd2ab58e62dc92d18fed30@changeid [[email protected]: Make config_mutex static] Signed-off-by: Daniel Thompson <[email protected]>

Using kgdb requires at least some level of architecture-level initialization. If nothing else, it relies on the architecture to pass breakpoints / crashes onto kgdb. On some architectures this all works super early, specifically it starts working at some point in time before Linux parses early_params's. On other architectures it doesn't. A survey of a few platforms: a) x86: Presumably it all works early since "ekgdboc" is documented to work here. b) arm64: Catching crashes works; with a simple patch breakpoints can also be made to work. c) arm: Nothing in kgdb works until paging_init() -> devicemaps_init() -> early_trap_init() Let's be conservative and, by default, process "kgdbwait" (which tells the kernel to drop into the debugger ASAP at boot) a bit later at dbg_late_init() time. If an architecture has tested it and wants to re-enable super early debugging, they can select the ARCH_HAS_EARLY_DEBUG KConfig option. We'll do this for x86 to start. It should be noted that dbg_late_init() is still called quite early in the system. Note that this patch doesn't affect when kgdb runs its init. If kgdb is set to initialize early it will still initialize when parsing early_param's. This patch _only_ inhibits the initial breakpoint from "kgdbwait". This means: * Without any extra patches arm64 platforms will at least catch crashes after kgdb inits. * arm platforms will catch crashes (and could handle a hardcoded kgdb_breakpoint()) any time after early_trap_init() runs, even before dbg_late_init(). Signed-off-by: Douglas Anderson <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.4.I3113aea1b08d8ce36dc3720209392ae8b815201b@changeid Signed-off-by: Daniel Thompson <[email protected]>

If we detect that we recursively entered the debugger we should hack our I/O ops to NULL so that the panic() in the next line won't actually cause another recursion into the debugger. The first line of kgdb_panic() will check this and return. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.6.I89de39f68736c9de610e6f241e68d8dbc44bc266@changeid Signed-off-by: Daniel Thompson <[email protected]>

This file is only ever compiled if that config is on since the Makefile says: obj-$(CONFIG_KGDB_SERIAL_CONSOLE) += kgdboc.o Let's get rid of the useless #ifdef. Reported-by: Daniel Thompson <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.7.Icb528f03d0026d957e60f537aa711ada6fd219dc@changeid Signed-off-by: Daniel Thompson <[email protected]>

We want to enable kgdb to debug the early parts of the kernel. Unfortunately kgdb normally is a client of the tty API in the kernel and serial drivers don't register to the tty layer until fairly late in the boot process. Serial drivers do, however, commonly register a boot console. Let's enable the kgdboc driver to work with boot consoles to provide early debugging. This change co-opts the existing read() function pointer that's part of "struct console". It's assumed that if a boot console (with the flag CON_BOOT) has implemented read() that both the read() and write() function are polling functions. That means they work without interrupts and read() will return immediately (with 0 bytes read) if there's nothing to read. This should be a safe assumption since it appears that no current boot consoles implement read() right now and there seems no reason to do so unless they wanted to support "kgdboc_earlycon". The normal/expected way to make all this work is to use "kgdboc_earlycon" and "kgdboc" together. You should point them both to the same physical serial connection. At boot time, as the system transitions from the boot console to the normal console (and registers a tty), kgdb will switch over. One awkward part of all this, though, is that there can be a window where the boot console goes away and we can't quite transtion over to the main kgdboc that uses the tty layer. There are two main problems: 1. The act of registering the tty doesn't cause any call into kgdboc so there is a window of time when the tty is there but kgdboc's init code hasn't been called so we can't transition to it. 2. On some serial drivers the normal console inits (and replaces the boot console) quite early in the system. Presumably these drivers were coded up before earlycon worked as well as it does today and probably they don't need to do this anymore, but it causes us problems nontheless. Problem #1 is not too big of a deal somewhat due to the luck of probe ordering. kgdboc is last in the tty/serial/Makefile so its probe gets right after all other tty devices. It's not fun to rely on this, but it does work for the most part. Problem #2 is a big deal, but only for some serial drivers. Other serial drivers end up registering the console (which gets rid of the boot console) and tty at nearly the same time. The way we'll deal with the window when the system has stopped using the boot console and the time when we're setup using the tty is to keep using the boot console. This may sound surprising, but it has been found to work well in practice. If it doesn't work, it shouldn't be too hard for a given serial driver to make it keep working. Specifically, it's expected that the read()/write() function provided in the boot console should be the same (or nearly the same) as the normal kgdb polling functions. That means continuing to use them should work just fine. To make things even more likely to work work we'll also trap the recently added exit() function in the boot console we're using and delay any calls to it until we're all done with the boot console. NOTE: there could be ways to use all this in weird / unexpected ways. If you do something like this, it's a bit of a buyer beware situation. Specifically: - If you specify only "kgdboc_earlycon" but not "kgdboc" then (depending on your serial driver) things will probably work OK, but you'll get a warning printed the first time you use kgdb after the boot console is gone. You'd only be able to do this, of course, if the serial driver you're running atop provided an early boot console. - If your "kgdboc_earlycon" and "kgdboc" devices are not the same device things should work OK, but it'll be your job to switch over which device you're monitoring (including figuring out how to switch over gdb in-flight if you're using it). When trying to enable "kgdboc_earlycon" it should be noted that the names that are registered through the boot console layer and the tty layer are not the same for the same port. For example when debugging on one board I'd need to pass "kgdboc_earlycon=qcom_geni kgdboc=ttyMSM0" to enable things properly. Since digging up the boot console name is a pain and there will rarely be more than one boot console enabled, you can provide the "kgdboc_earlycon" parameter without specifying the name of the boot console. In this case we'll just pick the first boot that implements read() that we find. This new "kgdboc_earlycon" parameter should be contrasted to the existing "ekgdboc" parameter. While both provide a way to debug very early, the usage and mechanisms are quite different. Specifically "kgdboc_earlycon" is meant to be used in tandem with "kgdboc" and there is a transition from one to the other. The "ekgdboc" parameter, on the other hand, replaces the "kgdboc" parameter. It runs the same logic as the "kgdboc" parameter but just relies on your TTY driver being present super early. The only known usage of the old "ekgdboc" parameter is documented as "ekgdboc=kbd earlyprintk=vga". It should be noted that "kbd" has special treatment allowing it to init early as a tty device. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Tested-by: Sumit Garg <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.8.I8fba5961bf452ab92350654aa61957f23ecf0100@changeid Signed-off-by: Daniel Thompson <[email protected]>

KVM stores the gfn in MMIO SPTEs as a caching optimization. These are split in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits in an L1TF attack. This works as long as there are 5 free bits between MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO access triggers a reserved-bit-set page fault. The bit positions however were computed wrongly for AMD processors that have encryption support. In this case, x86_phys_bits is reduced (for example from 48 to 43, to account for the C bit at position 47 and four bits used internally to store the SEV ASID and other stuff) while x86_cache_bits in would remain set to 48, and _all_ bits between the reduced MAXPHYADDR and bit 51 are set. Then low_phys_bits would also cover some of the bits that are set in the shadow_mmio_value, terribly confusing the gfn caching mechanism. To fix this, avoid splitting gfns as long as the processor does not have the L1TF bug (which includes all AMD processors). When there is no splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing the overlap. This fixes "npt=0" operation on EPYC processors. Thanks to Maxim Levitsky for bisecting this bug. Cc: [email protected] Fixes: 52918ed ("KVM: SVM: Override default MMIO mask if memory encryption is enabled") Signed-off-by: Paolo Bonzini <[email protected]>

trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer core itself is safe, but the resulting tracepoints can be utilized by e.g. BPF which is unsafe. Provide variants which do not contain the lockdep invocation so the lockdep and tracer invocations can be split at the call site and placed properly. This is required because lockdep needs to be aware of the state before switching away from RCU idle and after switching to RCU idle because these transitions can take locks. As these code pathes are going to be non-instrumentable the tracer can be invoked after RCU is turned on and before the switch to RCU idle. So for these new variants there is no need to invoke the rcuidle aware tracer functions. Name them so they match the lockdep counterparts. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

Force inlining and prevent instrumentation of all sorts by marking the functions which are invoked from low level entry code with 'noinstr'. Split the irqflags tracking into two parts. One which does the heavy lifting while RCU is watching and the final one which can be invoked after RCU is turned off. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

Force inlining of the helpers and mark the instrumentable parts accordingly. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

KVM overloads #PF to indicate two types of not-actually-page-fault events. Right now, the KVM guest code intercepts them by modifying the IDT and hooking the #PF vector. This makes the already fragile fault code even harder to understand, and it also pollutes call traces with async_page_fault and do_async_page_fault for normal page faults. Clean it up by moving the logic into do_page_fault() using a static branch. This gets rid of the platform trap_init override mechanism completely. [ tglx: Fixed up 32bit, removed error code from the async functions and massaged coding style ] Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

While working on the entry consolidation I stumbled over the KVM async page fault handler and kvm_async_pf_task_wait() in particular. It took me a while to realize that the randomly sprinkled around rcu_irq_enter()/exit() invocations are just cargo cult programming. Several patches "fixed" RCU splats by curing the symptoms without noticing that the code is flawed from a design perspective. The main problem is that this async injection is not based on a proper handshake mechanism and only respects the minimal requirement, i.e. the guest is not in a state where it has interrupts disabled. Aside of that the actual code is a convoluted one fits it all swiss army knife. It is invoked from different places with different RCU constraints: 1) Host side: vcpu_enter_guest() kvm_x86_ops->handle_exit() kvm_handle_page_fault() kvm_async_pf_task_wait() The invocation happens from fully preemptible context. 2) Guest side: The async page fault interrupted: a) user space b) preemptible kernel code which is not in a RCU read side critical section c) non-preemtible kernel code or a RCU read side critical section or kernel code with CONFIG_PREEMPTION=n which allows not to differentiate between #2b and #2c. RCU is watching for: #1 The vCPU exited and current is definitely not the idle task #2a The #PF entry code on the guest went through enter_from_user_mode() which reactivates RCU #2b There is no preemptible, interrupts enabled code in the kernel which can run with RCU looking away. (The idle task is always non preemptible). I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU voodoo at all. In #2c RCU is eventually not watching, but as that state cannot schedule anyway there is no point to worry about it so it has to invoke rcu_irq_enter() before running that code. This can be optimized, but this will be done as an extra step in course of the entry code consolidation work. So the proper solution for this is to: - Split kvm_async_pf_task_wait() into schedule and halt based waiting interfaces which share the enqueueing code. - Add comments (condensed form of this changelog) to spare others the time waste and pain of reverse engineering all of this with the help of uncomprehensible changelogs and code history. - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(), user mode and schedulable kernel side async page faults (#1, #2a, #2b) - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel case (#2c). For this case also remove the rcu_irq_exit()/enter() pair around the halt as it is just a pointless exercise: - vCPUs can VMEXIT at any random point and can be scheduled out for an arbitrary amount of time by the host and this is not any different except that it voluntary triggers the exit via halt. - The interrupted context could have RCU watching already. So the rcu_irq_exit() before the halt is not gaining anything aside of confusing the reader. Claiming that this might prevent RCU stalls is just an illusion. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

The async page fault injection into kernel space creates more problems than it solves. The host has absolutely no knowledge about the state of the guest if the fault happens in CPL0. The only restriction for the host is interrupt disabled state. If interrupts are enabled in the guest then the exception can hit arbitrary code. The HALT based wait in non-preemotible code is a hacky replacement for a proper hypercall. For the ongoing work to restrict instrumentation and make the RCU idle interaction well defined the required extra work for supporting async pagefault in CPL0 is just not justified and creates complexity for a dubious benefit. The CPL3 injection is well defined and does not cause any issues as it is more or less the same as a regular page fault from CPL3. Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

After commit 9d0aaf9 ("MIPS: SGI-IP27: Move all shared IP27 declarations to ip27-common.h"), ip27-common.h is included more than once in ip27-timer.c, remove it. Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

The parameter "cmdline_p" is useless in bootcmdline_init()，remove it. Signed-off-by: Zhi Li <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

When XPA mode is enabled the normally 32-bits MAAR pair registers are extended to be of 64-bits width as in pure 64-bits MIPS architecture. In this case the MAAR registers can enable the speculative loads/stores for addresses of up to 39-bits width. But in this case the process of the MAAR initialization changes a bit. The upper 32-bits of the registers are supposed to be accessed by mean of the dedicated instructions mfhc0/mthc0 and there is a CP0.MAAR.VH bit which should be set together with CP0.MAAR.VL as indication of the boundary validity. All of these peculiarities were taken into account in this commit so the speculative loads/stores would work when XPA mode is enabled. Co-developed-by: Alexey Malahov <[email protected]> Signed-off-by: Alexey Malahov <[email protected]> Signed-off-by: Serge Semin <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is throttled which means that se can't be NULL in such case and we can move the label after the if (!se) statement. Futhermore, the latter can be removed because se is always NULL when reaching this point. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Phil Auld <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

The cpuacct_charge() and cpuacct_account_field() are called with rq->lock held, and this means preemption(and IRQs) are indeed disabled, so it is safe to use __this_cpu_*() to allow for better code-generation. Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

update_tg_cfs_*() propagate the impact of the attach/detach of an entity down into the cfs_rq hierarchy and must keep the sync with the current pelt window. Even if we can't sync child cfs_rq and its group se, we can sync the group se and its parent cfs_rq with current position in the PELT window. In fact, we must keep them sync in order to stay also synced with others entities and group entities that are already attached to the cfs_rq. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] KSPP#21 [3] commit 7649773 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20200507192141.GA16183@embeddedor

The user_mode(task_pt_regs(tsk)) always return true for user thread, and false for kernel thread. So it means that the cpuacct.usage_sys is the time that kernel thread uses not the time that thread uses in the kernel mode. We can try get_irq_regs() first, if it is NULL, then we can fall back to task_pt_regs(). Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

When users write some huge number into cpu.cfs_quota_us or cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of schedulable checks. to_ratio() could be altered to avoid unnecessary internal overflow, but min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to prevent overflow. Signed-off-by: Huaixin Chang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ben Segall <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

After commit 0ce5ebd ("mfd: ioc3: Add driver for SGI IOC3 chip"), the related includes and comment about ioc3 are not used any more in ip27-timer.c, remove them. Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Add missing include which adds the prototype to plat_time_init(). Fixes: f932449 ("MIPS: ingenic: Drop obsolete code, merge the rest in setup.c") Signed-off-by: Paul Cercueil <[email protected]> Reported-by: kbuild test robot <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

rcuwait_active only returns whether w->task is not NULL. This is exactly one of the usecases that are mentioned in the documentation for rcu_access_pointer() where it is correct to bypass lockdep checks. This avoids a splat from kvm_vcpu_on_spin(). Reported-by: Wanpeng Li <[email protected]> Tested-by: Wanpeng Li <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

…m/linux/kernel/git/tip/tip into HEAD

For each storvsc_device, storvsc keeps track of the channel target CPUs associated to the device (alloced_cpus) and it uses this information to fill a "cache" (stor_chns) mapping CPU->channel according to a certain heuristic. Update the alloced_cpus mask and the stor_chns array when a channel of the storvsc device is re-assigned to a different CPU. Signed-off-by: Andrea Parri (Microsoft) <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by; Long Li <[email protected]> Reviewed-by: Michael Kelley <[email protected]> [ wei: fix a small issue reported by kbuild test robot <[email protected]> ] Signed-off-by: Wei Liu <[email protected]>

The pointer primary_channel is being assigned with a value that is never used. The assignment is redundant and can be removed. Move the definition of primary_channel to a narrower scope. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Link: https://lore.kernel.org/r/[email protected] [ wei: move primary_channel and update commit message ] Signed-off-by: Wei Liu <[email protected]>

The Hyper-V Reference TSC Page structure is defined twice. struct ms_hyperv_tsc_page has padding out to a full 4 Kbyte page size. But the padding is not needed because the declaration includes a union with HV_HYP_PAGE_SIZE. KVM uses the second definition, which is struct _HV_REFERENCE_TSC_PAGE, because it does not have the padding. Fix the duplication by removing the padding from ms_hyperv_tsc_page. Fix up the KVM code to use it. Remove the no longer used struct _HV_REFERENCE_TSC_PAGE. There is no functional change. Signed-off-by: Michael Kelley <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

The HV_PROCESSOR_POWER_STATE_C<n> #defines date back to year 2010, but they are not in the TLFS v6.0 document and are not used anywhere in Linux. Remove them. Signed-off-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

…iles In preparation for adding ARM64 support, split hyperv-tlfs.h into architecture dependent and architecture independent files, similar to what has been done with mshyperv.h. Move architecture independent definitions into include/asm-generic/hyperv-tlfs.h. The split will avoid duplicating significant lines of code in the ARM64 version of hyperv-tlfs.h. The split has no functional impact. Some of the common definitions have "X64" in the symbol name. Change these to remove the "X64" in the architecture independent version of hyperv-tlfs.h, but add aliases with the "X64" in the x86 version so that x86 code will continue to compile. A later patch set will change all the references and allow removal of the aliases. Signed-off-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

Add definitions for GetVpRegister and SetVpRegister hypercalls, which are implemented for both x86 and ARM64. Signed-off-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

This is a follow up to the commit 1d3c9c0 ("hyper-v: Use UUID API for exporting the GUID") which starts the conversion. There is export_guid() function which exports guid_t to the u8 array. Use it instead of open coding variant. This allows to hide the uuid_t internals. Signed-off-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

Drop dereference when printing the GUID with printf() like functions. This allows to hide the uuid_t internals. Signed-off-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

printf() like functions in the kernel have extensions, such as %*phN to dump small pieces of memory as hex values. Replace print_alias_name() with the direct use of %*phN. Signed-off-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

uuid_le is an alias for guid_t and is going to be removed in the future. Replace it with original type. Signed-off-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

None of the things mentioned in the comment is initialized in hv_init. They've been moved elsewhere. Signed-off-by: Wei Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Michael Kelley <[email protected]>

The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] KSPP#21 [3] commit 7649773 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Link: https://lore.kernel.org/r/20200507185323.GA14416@embeddedor Signed-off-by: Wei Liu <[email protected]>

After commit f5ff0a2 ("[MIPS] Use generic NTP code for all MIPS platforms"), TICK_SIZE is not used in ip27-timer.c for many years, remove it. Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Indeed according to the MIPS32 Privileged Resource Architecgture the MAAR pair register address field either takes [12:31] bits for non-XPA systems and [12:55] otherwise. In any case the current address mask is just wrong for 64-bit and 32-bits XPA chips. So lets extend it to 59-bits of physical address value. This shall cover the 64-bits architecture and systems with XPA enabled, and won't cause any problem for non-XPA 32-bit systems, since address values exceeding the architecture specific MAAR mask will be just truncated with setting zeros in the unsupported upper bits. Co-developed-by: Alexey Malahov <[email protected]> Signed-off-by: Alexey Malahov <[email protected]> Signed-off-by: Serge Semin <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

There are five MIPS32/64 architecture releases currently available: from 1 to 6 except fourth one, which was intentionally skipped. Three of them can be called as major: 1st, 2nd and 6th, that not only have some system level alterations, but also introduced significant core/ISA level updates. The rest of the MIPS architecture releases are minor. Even though they don't have as much ISA/system/core level changes as the major ones with respect to the previous releases, they still provide a set of updates (I'd say they were intended to be the intermediate releases before a major one) that might be useful for the kernel and user-level code, when activated by the kernel or compiler. In particular the following features were introduced or ended up being available at/after MIPS32/64 Release 5 architecture: + the last release of the misaligned memory access instructions, + virtualisation - VZ ASE - is optional component of the arch, + SIMD - MSA ASE - is optional component of the arch, + DSP ASE is optional component of the arch, + CP0.Status.FR=1 for CP1.FIR.F64=1 (pure 64-bit FPU general registers) must be available if FPU is implemented, + CP1.FIR.Has2008 support is required so CP1.FCSR.{ABS2008,NAN2008} bits are available. + UFR/UNFR aliases to access CP0.Status.FR from user-space by means of ctc1/cfc1 instructions (enabled by CP0.Config5.UFR), + CP0.COnfig5.LLB=1 and eretnc instruction are implemented to without accidentally clearing LL-bit when returning from an interrupt, exception, or error trap, + XPA feature together with extended versions of CPx registers is introduced, which needs to have mfhc0/mthc0 instructions available. So due to these changes GNU GCC provides an extended instructions set support for MIPS32/64 Release 5 by default like eretnc/mfhc0/mthc0. Even though the architecture alteration isn't that big, it still worth to be taken into account by the kernel software. Finally we can't deny that some optimization/limitations might be found in future and implemented on some level in kernel or compiler. In this case having even intermediate MIPS architecture releases support would be more than useful. So the most of the changes provided by this commit can be split into either compile- or runtime configs related. The compile-time related changes are caused by adding the new CONFIG_CPU_MIPS32_R5/CONFIG_CPU_MIPSR5 configs and concern the code activating MIPSR2 or MIPSR6 already implemented features (like eretnc/LLbit, mthc0/mfhc0). In addition CPU_HAS_MSA can be now freely enabled for MIPS32/64 release 5 based platforms as this is done for CPU_MIPS32_R6 CPUs. The runtime changes concerns the features which are handled with respect to the MIPS ISA revision detected at run-time by means of CP0.Config.{AT,AR} bits. Alas these fields can be used to detect either r1 or r2 or r6 releases. But since we know which CPUs in fact support the R5 arch, we can manually set MIPS_CPU_ISA_M32R5/MIPS_CPU_ISA_M64R5 bit of c->isa_level and then use cpu_has_mips32r5/cpu_has_mips64r5 where it's appropriate. Since XPA/EVA provide too complex alterationss and to have them used with MIPS32 Release 2 charged kernels (for compatibility with current platform configs) they are left to be setup as a separate kernel configs. Co-developed-by: Alexey Malahov <[email protected]> Signed-off-by: Alexey Malahov <[email protected]> Signed-off-by: Serge Semin <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

This is a MIPS32 Release 5 based IP core with XPA, EVA, dual/quad issue exec pipes, MMU with two-levels TLB, UCA, MSA, MDU core level features and system level features like up to six P5600 calculation cores, CM2 with L2 cache, IOCU/IOMMU (though might be unused depending on the system-specific IP core configuration), GIC, CPC, virtualisation module, eJTAG and PDtrace. As being MIPS32 Release 5 based core it provides all the features available by the CPU_MIPS32_R5 config, while adding a few more like UCA attribute support, availability of CPU-freq (by means of L2/CM clock ratio setting), EI/VI GIC modes detection at runtime. In addition to this if P5600 architecture is enabled modern GNU GCC provides a specific tuning for P5600 processors with respect to the classic MIPS32 Release 5. First of all branch-likely avoidance is activated only when the code is compiled with the speed optimization (avoidance is always enabled for the pure MIPS32 Release 5 architecture). Secondly the madd/msub avoidance is enabled since madd/msub utilization isn't profitable due to overhead of getting the result out of the HI/LO registers. Multiply-accumulate instructions are activated and utilized together with the necessary code reorder when multiply-add/multiply-subtract statements are met. Finally load/store bonding is activated by default. All of these optimizations may make the code relatively faster than if just MIP32 release 5 architecture was requested. Co-developed-by: Alexey Malahov <[email protected]> Signed-off-by: Alexey Malahov <[email protected]> Signed-off-by: Serge Semin <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

Commit 1aeba34 ("MIPS: Hardcode cpu_has_mips* where target ISA allows") updated the cpu_has_mips* macro to be replaced with a constant expression where it's possible. By mistake it wasn't done correctly for cpu_has_mips64r1/cpu_has_mips64r2 macro. They are defined to be replaced with conditional expression __isa_range_or_flag(), which means either ISA revision being within the range or the corresponding CPU options flag was set at the probe stage or both being true at the same time. But the ISA level value doesn't indicate whether the ISA is MIPS32 or MIPS64. Due to this if we select MIPS32r1 - MIPS32r5 architectures the __isa_range() macro will activate the cpu_has_mips64rX flags, which is incorrect. In order to fix the problem we make sure the 64bits CPU support is enabled by means of checking the flag cpu_has_64bits aside with proper ISA range and specific Revision flag being set. Fixes: 1aeba34 ("MIPS: Hardcode cpu_has_mips* where target ISA allows") Signed-off-by: Serge Semin <[email protected]> Cc: Alexey Malahov <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

CP0 config register may indicate whether write-through merging is allowed. Currently there are two types of the merging available: SysAD Valid and Full modes. Whether each of them are supported by the core is implementation dependent. Moreover whether the ability to change the mode also depends on the chip family instance. Taking into account all of this we created a dedicated mm_config() method to detect and enable merging if it's supported. It is called for MIPS-type processors at CPU-probe stage and attempts to detect whether the write merging is available. If it's known to be supported and switchable, then switch on the full mode. Otherwise just perform the CP0.Config.MM field analysis. In addition there are platforms like InterAptiv/ProAptiv, which do have the MM bit field set by default, but having write-through cacheing unsupported makes write-merging also unsupported. In this case we just ignore the MM field value. Co-developed-by: Alexey Malahov <[email protected]> Signed-off-by: Alexey Malahov <[email protected]> Signed-off-by: Serge Semin <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

There are bit fields which persist in the MIPS CONFIG and CONFIG6 registers, but haven't been described in the generic mipsregs.h header so far. In particular, the generic CONFIG bitfields are BE - endian mode, BM - burst mode, SB - SimpleBE, OCP interface mode indicator, UDI - user-defined "CorExtend" instructions, DSP - data scratch pad RAM present, ISP - instruction scratch pad RAM present, etc. The core-specific CONFIG6 bitfields are JRCD - jump register cache prediction disable, R6 - MIPSr6 extensions enable, IFUPerfCtl - IFU performance control, SPCD - sleep state performance counter, DLSB - disable load/store bonding. A new exception code reported in the ExcCode field of the Cause register: 30 - Parity/ECC error exception happened on either fetch, load or cache refill. Lets add them to the mipsregs.h header to be used in future platform code, which have them utilized. Signed-off-by: Serge Semin <[email protected]> Cc: Alexey Malahov <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

On some platforms IO-memory might require to use a proper load/store instructions (like Baikal-T1 IO-memory). To fix the cps-vec UART debug printout let's add the CONFIG_CPS_NS16550_WIDTH config to determine which instructions lb/sb, lh/sh or lw/sw are required for MMIO operations. Signed-off-by: Serge Semin <[email protected]> Cc: Alexey Malahov <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

Loops-per-jiffies is a special number which represents a number of noop-loop cycles per CPU-scheduler quantum - jiffies. As you understand aside from CPU-specific implementation it depends on the CPU frequency. So when a platform has the CPU frequency fixed, we have no problem and the current udelay interface will work just fine. But as soon as CPU-freq driver is enabled and the cores frequency changes, we'll end up with distorted udelay's. In order to fix this we have to accordinly adjust the per-CPU udelay_val (the same as the global loops_per_jiffy) number. This can be done in the CPU-freq transition event handler. We subscribe to that event in the MIPS arch time-inititalization method. Co-developed-by: Alexey Malahov <[email protected]> Signed-off-by: Alexey Malahov <[email protected]> Signed-off-by: Serge Semin <[email protected]> Reviewed-by: Jiaxun Yang <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

Commit 07d6957 ("MIPS: Don't register r4k sched clock when CPUFREQ enabled") disabled the r4k-clock usage for scheduler ticks counting due to the scheduler being non-tolerant for unstable clocks sources. For the same reason the clock should be used in the system clocksource framework with care. As soon as CPU frequency changes the clocksource framework should be notified about this by marking the R4K timer being unstable (which it really is, since the ticks rate has been changed synchronously with the CPU frequency). Signed-off-by: Serge Semin <[email protected]> Cc: Alexey Malahov <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

Due to being embedded into the CPU cores MIPS count/compare timer frequency is changed together with the CPU clocks alteration. In case if frequency really changes the kernel clockevent framework must be notified, otherwise the kernel timers won't work correctly. Fix this by calling clockevents_update_freq() for each r4k clockevent handlers registered per available CPUs. Traditionally MIPS r4k-clock are clocked with CPU frequency divided by 2. But this isn't true for some of the platforms. Due to this we have to save the basic CPU frequency, so then use it to scale the initial timer frequency (mips_hpt_frequency) and pass the updated value further to the clockevent framework. Signed-off-by: Serge Semin <[email protected]> Cc: Alexey Malahov <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rob Herring <[email protected]> Cc: [email protected] Signed-off-by: Thomas Bogendoerfer <[email protected]>

__get_user_check and __put_user_check use temporary pointer but don't mark it as __user, resulting in sparse warnings: sparse: warning: incorrect type in initializer (different address spaces) sparse: expected long *__pu_addr sparse: got long [noderef] <asn:1> *ret sparse: warning: incorrect type in argument 1 (different address spaces) sparse: expected void [noderef] <asn:1> *to sparse: got long *__pu_addr Add __user annotation to temporary pointer in __get_user_check and __put_user_check. Reported-by: kbuild test robot <[email protected]> Reported-by: Arnd Bergmann <[email protected]> Signed-off-by: Max Filippov <[email protected]>

8-byte access in __get_user_size converts pointer to temporary variable to the type of original user pointer and then dereferences it, resulting in the following sparse warning: sparse: warning: dereference of noderef expression Instead dereference the original user pointer under the __typeof__ and add indirection outside. Signed-off-by: Max Filippov <[email protected]>

Error paths in __get_user_check and __get_user_size directly assing 0 to the result. It causes the following sparse warnings: sparse: warning: Using plain integer as NULL pointer Convert 0 to the type pointed to by the user pointer before assigning it. Signed-off-by: Max Filippov <[email protected]>

clear_user, strncpy_user, strnlen_user and their helpers operate on user pointers, but don't have their arguments marked as __user. Add __user annotation to userspace pointers of those functions. Fix open-coded access check in the strnlen_user while at it. Signed-off-by: Max Filippov <[email protected]>

vmbus_process_offer() does two things (among others): 1) first, it sets the channel's target CPU with cpu_hotplug_lock; 2) it then adds the channel to the channel list(s) with channel_mutex. Since cpu_hotplug_lock is released before (2), the channel's target CPU (as designated in (1)) can be deemed "free" by hv_synic_cleanup() and go offline before the channel is added to the list. Fix the race condition by "extending" the cpu_hotplug_lock critical section to include (2) (and (1)), nesting the channel_mutex critical section within the cpu_hotplug_lock critical section as done elsewhere (hv_synic_cleanup(), target_cpu_store()) in the hyperv drivers code. Move even further by extending the channel_mutex critical section to include (1) (and (2)): this change allows to remove (the now redundant) bind_channel_to_cpu_lock, and generally simplifies the handling of the target CPUs (that are now always modified with channel_mutex held). Fixes: d570aec ("Drivers: hv: vmbus: Synchronize init_vp_index() vs. CPU hotplug") Signed-off-by: Andrea Parri (Microsoft) <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

init_vp_index() uses the (per-node) hv_numa_map[] masks to record the CPUs allocated for channel interrupts at a given time, and distribute the performance-critical channels across the available CPUs: in part., the mask of "candidate" target CPUs in a given NUMA node, for a newly offered channel, is determined by XOR-ing the node's CPU mask and the node's hv_numa_map. This operation/mechanism assumes that no offline CPUs is set in the hv_numa_map mask, an assumption that does not hold since such mask is currently not updated when a channel is removed or assigned to a different CPU. To address the issues described above, this adds hooks in the channel removal path (hv_process_channel_removal()) and in target_cpu_store() in order to clear, resp. to update, the hv_numa_map[] masks as needed. This also adds a (missed) update of the masks in init_vp_index() (cf., e.g., the memory-allocation failure path in this function). Like in the case of init_vp_index(), such hooks require to determine if the given channel is performance critical. init_vp_index() does this by parsing the channel's offer, it can not rely on the device data structure (device_obj) to retrieve such information because the device data structure has not been allocated/linked with the channel by the time that init_vp_index() executes. A similar situation may hold in hv_is_alloced_cpu() (defined below); the adopted approach is to "cache" the device type of the channel, as computed by parsing the channel's offer, in the channel structure itself. Fixes: 7527810 ("Drivers: hv: vmbus: Introduce the CHANNELMSG_MODIFYCHANNEL message type") Signed-off-by: Andrea Parri (Microsoft) <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>

The exception handler subroutines are declared as a single char, but when copied to the required addresses the copy length is 0x80. When range checks are enabled for memcpy() this results in a build failure, with error messages such as: In file included from arch/mips/mti-malta/malta-init.c:15: In function 'memcpy', inlined from 'mips_nmi_setup' at arch/mips/mti-malta/malta-init.c:98:2: include/linux/string.h:376:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter 376 | __read_overflow2(); | ^~~~~~~~~~~~~~~~~~ Change the declarations to use type char[]. Signed-off-by: Ben Hutchings <[email protected]> Signed-off-by: YunQiang Su <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Add config check in Makefile to only build the subdir of current platform. E.g. without this patch: AR arch/mips/built-in.a AR arch/mips/boot/dts/brcm/built-in.a AR arch/mips/boot/dts/cavium-octeon/built-in.a AR arch/mips/boot/dts/img/built-in.a AR arch/mips/boot/dts/ingenic/built-in.a AR arch/mips/boot/dts/lantiq/built-in.a DTC arch/mips/boot/dts/loongson/loongson3_4core_rs780e.dtb DTB arch/mips/boot/dts/loongson/loongson3_4core_rs780e.dtb.S AS arch/mips/boot/dts/loongson/loongson3_4core_rs780e.dtb.o DTC arch/mips/boot/dts/loongson/loongson3_8core_rs780e.dtb DTB arch/mips/boot/dts/loongson/loongson3_8core_rs780e.dtb.S AS arch/mips/boot/dts/loongson/loongson3_8core_rs780e.dtb.o AR arch/mips/boot/dts/loongson/built-in.a AR arch/mips/boot/dts/mscc/built-in.a AR arch/mips/boot/dts/mti/built-in.a AR arch/mips/boot/dts/netlogic/built-in.a AR arch/mips/boot/dts/ni/built-in.a AR arch/mips/boot/dts/pic32/built-in.a AR arch/mips/boot/dts/qca/built-in.a AR arch/mips/boot/dts/ralink/built-in.a AR arch/mips/boot/dts/xilfpga/built-in.a AR arch/mips/boot/dts/built-in.a With this patch: AR arch/mips/built-in.a DTC arch/mips/boot/dts/loongson/loongson3_4core_rs780e.dtb DTB arch/mips/boot/dts/loongson/loongson3_4core_rs780e.dtb.S AS arch/mips/boot/dts/loongson/loongson3_4core_rs780e.dtb.o DTC arch/mips/boot/dts/loongson/loongson3_8core_rs780e.dtb DTB arch/mips/boot/dts/loongson/loongson3_8core_rs780e.dtb.S AS arch/mips/boot/dts/loongson/loongson3_8core_rs780e.dtb.o AR arch/mips/boot/dts/loongson/built-in.a AR arch/mips/boot/dts/built-in.a Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

CP0.Config6 is a Vendor-defined register whose bits definitions are different from one to another. Recently, Xuerui's Loongson-3 patch and Serge's P5600 patch make the definitions inconsistency and unclear. To make life easy, this patch tidy the definition up: 1, Add a _MTI_ infix for proAptiv/P5600 feature bits; 2, Add a _LOONGSON_ infix for Loongson-3 feature bits; 3, Add bit6/bit7 definition for Loongson-3 which will be used later. All existing users of these macros are updated. Cc: WANG Xuerui <[email protected]> Cc: Serge Semin <[email protected]> Signed-off-by: Huacai Chen <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

CPUCFG is the instruction for querying processor characteristics on newer Loongson processors, much like CPUID of x86. Since the instruction is supposedly designed to provide a unified way to do feature detection (without having to, for example, parse /proc/cpuinfo which is too heavyweight), it is important to provide compatibility for older cores without native support. Fortunately, most of the fields can be synthesized without changes to semantics. Performance is not really big a concern, because feature detection logic is not expected to be invoked very often in typical userland applications. The instruction can't be emulated on LOONGSON_2EF cores, according to FlyGoat's experiments. Because the LWC2 opcode is assigned to other valid instructions on 2E and 2F, no RI exception is raised for us to intercept. So compatibility is only extended back furthest to Loongson-3A1000. Loongson-2K is covered too, as it is basically a remix of various blocks from the 3A/3B models from a kernel perspective. This is lightly based on Loongson's work on their Linux 3.10 fork, for being the authority on the right feature flags to fill in, where things aren't otherwise discoverable. Signed-off-by: WANG Xuerui <[email protected]> Reviewed-by: Jiaxun Yang <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Fix the ordering of the macros in arch/mips/mach-ip30/war.h to match those in arch/mips/mach-ip27/war.h. Signed-off-by: Joshua Kinard <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

There is a file descriptor resource leak in elf-entry.c, fix this by adding fclose() before return and die. Signed-off-by: Kaige Li <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Both Rik and Mel reported seeing ttwu() spend significant time on: smp_cond_load_acquire(&p->on_cpu, !VAL); Attempt to avoid this by queueing the wakeup on the CPU that owns the p->on_cpu value. This will then allow the ttwu() to complete without further waiting. Since we run schedule() with interrupts disabled, the IPI is guaranteed to happen after p->on_cpu is cleared, this is what makes it safe to queue early. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Jirka Hladky <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: [email protected] Cc: Hillf Danton <[email protected]> Cc: Rik van Riel <[email protected]> Link: https://lore.kernel.org/r/[email protected]

…ling The previous commit: c6e7bd7: ("sched/core: Optimize ttwu() spinning on p->on_cpu") avoids spinning on p->on_rq when the task is descheduling, but only if the wakee is on a CPU that does not share cache with the waker. This patch offloads the activation of the wakee to the CPU that is about to go idle if the task is the only one on the runqueue. This potentially allows the waker task to continue making progress when the wakeup is not strictly synchronous. This is very obvious with netperf UDP_STREAM running on localhost. The waker is sending packets as quickly as possible without waiting for any reply. It frequently wakes the server for the processing of packets and when netserver is using local memory, it quickly completes the processing and goes back to idle. The waker often observes that netserver is on_rq and spins excessively leading to a drop in throughput. This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and localwakelist-v1r2 respectively. 5.7.0-rc6 5.7.0-rc6 5.7.0-rc6 vanilla optttwu-v1r1 localwakelist-v1r2 Hmean send-64 251.49 ( 0.00%) 258.05 * 2.61%* 305.59 * 21.51%* Hmean send-128 497.86 ( 0.00%) 519.89 * 4.43%* 600.25 * 20.57%* Hmean send-256 944.90 ( 0.00%) 997.45 * 5.56%* 1140.19 * 20.67%* Hmean send-1024 3779.03 ( 0.00%) 3859.18 * 2.12%* 4518.19 * 19.56%* Hmean send-2048 7030.81 ( 0.00%) 7315.99 * 4.06%* 8683.01 * 23.50%* Hmean send-3312 10847.44 ( 0.00%) 11149.43 * 2.78%* 12896.71 * 18.89%* Hmean send-4096 13436.19 ( 0.00%) 13614.09 ( 1.32%) 15041.09 * 11.94%* Hmean send-8192 22624.49 ( 0.00%) 23265.32 * 2.83%* 24534.96 * 8.44%* Hmean send-16384 34441.87 ( 0.00%) 36457.15 * 5.85%* 35986.21 * 4.48%* Note that this benefit is not universal to all wakeups, it only applies to the case where the waker often spins on p->on_rq. The impact can be seen from a "perf sched latency" report generated from a single iteration of one packet size: ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- vanilla netperf:4337 | 21709.193 ms | 2932 | avg: 0.002 ms | max: 0.041 ms | max at: 112.154512 s netserver:4338 | 14629.459 ms | 5146990 | avg: 0.001 ms | max: 1615.864 ms | max at: 140.134496 s localwakelist-v1r2 netperf:4339 | 29789.717 ms | 2460 | avg: 0.002 ms | max: 0.059 ms | max at: 138.205389 s netserver:4340 | 18858.767 ms | 7279005 | avg: 0.001 ms | max: 0.362 ms | max at: 135.709683 s ----------------------------------------------------------------------------------------------------------------- Note that the average wakeup delay is quite small on both the vanilla kernel and with the two patches applied. However, there are significant outliers with the vanilla kernel with the maximum one measured as 1615 milliseconds with a vanilla kernel but never worse than 0.362 ms with both patches applied and a much higher rate of context switching. Similarly a separate profile of cycles showed that 2.83% of all cycles were spent in try_to_wake_up() with almost half of the cycles spent on spinning on p->on_rq. With the two patches, the percentage of cycles spent in try_to_wake_up() drops to 1.13% Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jirka Hladky <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: [email protected] Cc: Hillf Danton <[email protected]> Cc: Rik van Riel <[email protected]> Link: https://lore.kernel.org/r/[email protected]

The BCM6358 SoC has only 38 available GPIOs. Fix it. Signed-off-by: Daniel González Cabanelas <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Pull bits of code to the only place where it is used. Remove empty function __cpu_init_stage2(). Remove redundant has_vhe() check since this function is nVHE-only. No functional changes intended. Signed-off-by: David Brazdil <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

The comment used to say that kvm_get_hyp_vector is only called on VHE systems. In fact, it is also called from the nVHE init function cpu_init_hyp_mode(). Fix the comment to stop confusing devs. Signed-off-by: David Brazdil <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

This abstraction was introduced to hide the difference between arm and arm64 but, with the former no longer supported, this abstraction can be removed and the canonical kernel API used directly instead. Signed-off-by: Andrew Scull <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> CC: Marc Zyngier <[email protected]> CC: James Morse <[email protected]> CC: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Change 'excpetion' to 'exception', 'handeled' to 'handled' and 'the the' to 'the'. Signed-off-by: Chris Packham <[email protected]> Message-Id: <[email protected]> Signed-off-by: Max Filippov <[email protected]>

MISC_STRAP_BUS_BOOT_SEL_SHIFT is 18 according to Broadcom's GPL source code. Signed-off-by: Álvaro Fernández Rojas <[email protected]> Acked-by: Florian Fainelli <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Don't disable MEM/IO decoding when a device have both non_compliant_bars and mmio_always_on. That would allow us quirk devices with junk in BARs but can't disable their decoding. Signed-off-by: Jiaxun Yang <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

This controller can be found on Loongson-2K SoC, Loongson-3 systems with RS780E/LS7A PCH. The RS780E part of code was previously located at arch/mips/pci/ops-loongson3.c and now it can use generic PCI driver implementation. Signed-off-by: Jiaxun Yang <[email protected]> Reviewed-by: Rob Herring <[email protected]> Acked-by: Lorenzo Pieralisi <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

PCI host controller found on Loongson PCHs and SoCs. Signed-off-by: Jiaxun Yang <[email protected]> Reviewed-by: Rob Herring <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Add PCI Host controller node for Loongson64 with RS780E PCH dts. Note that PCI interrupts are probed via legacy way, as different machine have different interrupt arrangement, we can't cover all of them in dt. Signed-off-by: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

We can now enable generic PCI driver in Kconfig, and remove legacy PCI driver code. Radeon vbios quirk is moved to the platform folder to fit the new structure. Signed-off-by: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Add memory info of the GCW Zero in its devicetree. The bootloader generally provides this information, but since it is fixed to 512 MiB, it doesn't hurt to have it in devicetree. It allows the kernel to boot without any parameter passed as argument. Signed-off-by: Paul Cercueil <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Add support for the GCW Zero prototype. The only (?) difference is that it only has 256 MiB of RAM, compared to the 512 MiB of RAM of the retail device. Signed-off-by: Paul Cercueil <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Having a generic board option makes it possible to create a kernel that will run on various Ingenic SoCs, as long as the right devicetree is provided. Signed-off-by: Paul Cercueil <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

It is not necessary to flush tlb page on all CPUs if suitable PTE entry exists already during page fault handling, just updating TLB is fine. Here redefine flush_tlb_fix_spurious_fault as empty on MIPS system. Signed-off-by: Bibo Mao <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

If two threads concurrently fault at the same page, the thread that won the race updates the PTE and its local TLB. For now, the other thread gives up, simply does nothing, and continues. It could happen that this second thread triggers another fault, whereby it only updates its local TLB while handling the fault. Instead of triggering another fault, let's directly update the local TLB of the second thread. Function update_mmu_tlb is used here to update local TLB on the second thread, and it is defined as empty on other arches. Signed-off-by: Bibo Mao <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Here add pte_sw_mkyoung function to make page readable on MIPS platform during page fault handling. This patch improves page fault latency about 10% on my MIPS machine with lmbench lat_pagefault case. It is noop function on other arches, there is no negative influence on those architectures. Signed-off-by: Bibo Mao <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

If original PTE has _PAGE_ACCESSED bit set, and new pte has no _PAGE_NO_READ bit set, we can add _PAGE_SILENT_READ bit to enable page valid bit. Signed-off-by: Bibo Mao <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Register "a1" is unsaved in this function, when CONFIG_TRACE_IRQFLAGS is enabled, the TRACE_IRQS_OFF macro will call trace_hardirqs_off(), and this may change register "a1". The changed register "a1" as argument will be send to do_fpe() and do_msa_fpe(). Signed-off-by: YuanJunQing <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Loongson64 load kernel at 0x82000000 and allocate exception vectors by ebase. So we don't need to reserve space for exception vectors at head of kernel. Signed-off-by: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Set the mmio_value to '0' instead of simply clearing the present bit to squash a benign warning in kvm_mmu_set_mmio_spte_mask() that complains about the mmio_value overlapping the lower GFN mask on systems with 52 bits of PA space. Opportunistically clean up the code and comments. Cc: [email protected] Fixes: d43e267 ("KVM: x86: only do L1TF workaround on affected processors") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Even though we might not allow the guest to use WAITPKG's new instructions, we should tell KVM that the feature is supported by the host CPU. Note that vmx_waitpkg_supported checks that WAITPKG _can_ be set in secondary execution controls as specified by VMX capability MSR, rather that we actually enable it for a guest. Cc: [email protected] Fixes: e69e72f ("KVM: x86: Add support for user wait instructions") Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

This msr is only available when the host supports WAITPKG feature. This breaks a nested guest, if the L1 hypervisor is set to ignore unknown msrs, because the only other safety check that the kernel does is that it attempts to read the msr and rejects it if it gets an exception. Cc: [email protected] Fixes: 6e3ba4a ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

We can simply look at bits 52-53 to identify MMIO entries in KVM's page tables. Therefore, there is no need to pass a mask to kvm_mmu_set_mmio_spte_mask. Signed-off-by: Paolo Bonzini <[email protected]>

…/kernel/git/kvms390/linux into HEAD KVM: s390: Cleanups for 5.8 - vsie (nesting) cleanups - remove unneeded semicolon

Merge AMD fixes before doing more development work.

The migration functionality was left incomplete in commit 5ef8acb ("KVM: nVMX: Emulate MTF when performing instruction emulation", 2020-02-23), fix it. Fixes: 5ef8acb ("KVM: nVMX: Emulate MTF when performing instruction emulation") Cc: [email protected] Reviewed-by: Oliver Upton <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Remove unnecessary brackets from a case statement that unintentionally encapsulates unrelated case statements in the same switch statement. While technically legal and functionally correct syntax, the brackets are visually confusing and potentially dangerous, e.g. the last of the encapsulated case statements has an undocumented fall-through that isn't flagged by compilers due the encapsulation. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Take a u32 for the index in has_emulated_msr() to match hardware, which treats MSR indices as unsigned 32-bit values. Functionally, taking a signed int doesn't cause problems with the current code base, but could theoretically cause problems with 32-bit KVM, e.g. if the index were checked via a less-than statement, which would evaluate incorrectly for MSR indices with bit 31 set. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

…case The second "/* fall through */" in rmode_exception() makes code harder to read. Replace it with "return" to indicate they are different cases, only the #DB and #BP check vcpu->guest_debug, while others don't care. And this also improves the readability. Suggested-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Miaohe Lin <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

There is a bad indentation in next&queue branch. The patch looks like fixes nothing though it fixes the indentation. Before fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: After fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: Signed-off-by: Haiwei Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

pic_in_kernel(), ioapic_in_kernel() and irqchip_kernel() have the same implementation. Signed-off-by: Peng Hao <[email protected]> Message-Id: <HKAPR02MB4291D5926EA10B8BFE9EA0D3E0B70@HKAPR02MB4291.apcprd02.prod.outlook.com> Signed-off-by: Paolo Bonzini <[email protected]>

Async page faults have to be trapped in the host (L1 in this case), since the APF reason was passed from L0 to L1 and stored in the L1 APF data page. This was completely reversed: the page faults were passed to the guest, a L2 hypervisor. Cc: [email protected] Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Restoring the ASID from the hsave area on VMEXIT is wrong, because its value depends on the handling of TLB flushes. Just skipping the field in copy_vmcb_control_area will do. Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>

Initialize vcpu->arch.tdp_level during vCPU creation to avoid consuming garbage if userspace calls KVM_RUN without first calling KVM_SET_CPUID. Fixes: e93fd3b ("KVM: x86/mmu: Capture TDP level when updating CPUID") Reported-by: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

…exit() L2 guest hang is observed after 'exit_required' was dropped and nSVM switched to check_nested_events() completely. The hang is a busy loop when e.g. KVM is emulating an instruction (e.g. L2 is accessing MMIO space and we drop to userspace). After nested_svm_vmexit() and when L1 is doing VMRUN nested guest's RIP is not advanced so KVM goes into emulating the same instruction which caused nested_svm_vmexit() and the loop continues. nested_svm_vmexit() is not new, however, with check_nested_events() we're now calling it later than before. In case by that time KVM has modified register state we may pick stale values from VMCB when trying to save nested guest state to nested VMCB. nVMX code handles this case correctly: sync_vmcs02_to_vmcs12() called from nested_vmx_vmexit() does e.g 'vmcs12->guest_rip = kvm_rip_read(vcpu)' and this ensures KVM-made modifications are preserved. Do the same for nSVM. Generally, nested_vmx_vmexit()/nested_svm_vmexit() need to pick up all nested guest state modifications done by KVM after vmexit. It would be great to find a way to express this in a way which would not require to manually track these changes, e.g. nested_{vmcb,vmcs}_get_field(). Co-debugged-with: Paolo Bonzini <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Instead of calling kvm_event_needs_reinjection, track its future return value in a variable. This will be useful in the next patch. Signed-off-by: Paolo Bonzini <[email protected]>

If CONFIG_MIPS_MALTA is not set but CONFIG_LEGACY_BOARD_SEAD3 is set, the subdir arch/mips/boot/dts/mti will not be built, so the sead3.dts which depends on CONFIG_LEGACY_BOARD_SEAD3 in this subdir is also not built, and then there exists the following build error, fix it. LD .tmp_vmlinux.kallsyms1 arch/mips/generic/board-sead3.o:(.mips.machines.init+0x4): undefined reference to `__dtb_sead3_begin' Makefile:1106: recipe for target 'vmlinux' failed make: *** [vmlinux] Error 1 Additionally, add CONFIG_FIT_IMAGE_FDT_BOSTON check for subdir img to fix the following build error when CONFIG_MACH_PISTACHIO is not set but CONFIG_FIT_IMAGE_FDT_BOSTON is set. FATAL ERROR: Couldn't open "boot/dts/img/boston.dtb": No such file or directory Reported-by: kbuild test robot <[email protected]> Reported-by: Guenter Roeck <[email protected]> Fixes: 41528ba ("MIPS: DTS: Only build subdir of current platform") Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

CPU_LOONGSON2EF need software to maintain cache consistency, so modify the 'cpu_needs_post_dma_flush' function to return true when the cpu type is CPU_LOONGSON2EF. Cc: [email protected] Signed-off-by: Lichao Liu <[email protected]> Reviewed-by: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

PCI_IOBASE is used to create VM maps for PCI I/O ports, it is required by generic PCI drivers to make memory mapped I/O range work. To deal with legacy drivers that have fixed I/O ports range we reserved 0x10000 in PCI_IOBASE, should be enough for i8259 i8042 stuff. Signed-off-by: Jiaxun Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Signed-off-by: Ingo Molnar <[email protected]>

We are going to rely on the loosening of RCU callback semantics, introduced by this commit: 806f04e: ("rcu: Allow for smp_call_function() running callbacks from idle") Signed-off-by: Ingo Molnar <[email protected]>

The recent commit: 90b5363 ("sched: Clean up scheduler_ipi()") got smp_call_function_single_async() subtly wrong. Even though it will return -EBUSY when trying to re-use a csd, that condition is not atomic and still requires external serialization. The change in kick_ilb() got this wrong. While on first reading kick_ilb() has an atomic test-and-set that appears to serialize the use, the matching 'release' is not in the right place to actually guarantee this serialization. Rework the nohz_idle_balance() trigger so that the release is in the IPI callback and thus guarantees the required serialization for the CSD. Fixes: 90b5363 ("sched: Clean up scheduler_ipi()") Reported-by: Qian Cai <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]

The call_single_queue can contain (two) different callbacks, synchronous and asynchronous. The current interrupt handler runs them in-order, which means that remote CPUs that are waiting for their synchronous call can be delayed by running asynchronous callbacks. Rework the interrupt handler to first run the synchonous callbacks. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

This ensures flush_smp_call_function_queue() is strictly about call_single_queue. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Just like the ttwu_queue_remote() IPI, make use of _TIF_POLLING_NRFLAG to avoid sending IPIs to idle CPUs. [ mingo: Fix UP build bug. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Currently irq_work_queue_on() will issue an unconditional arch_send_call_function_single_ipi() and has the handler do irq_work_run(). This is unfortunate in that it makes the IPI handler look at a second cacheline and it misses the opportunity to avoid the IPI. Instead note that struct irq_work and struct __call_single_data are very similar in layout, so use a few bits in the flags word to encode a type and stick the irq_work on the call_single_queue list. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

In preparation of removing rq->wake_list, replace the !list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully equivalent as this new variable is racy. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

The recent commit: 90b5363 ("sched: Clean up scheduler_ipi()") got smp_call_function_single_async() subtly wrong. Even though it will return -EBUSY when trying to re-use a csd, that condition is not atomic and still requires external serialization. The change in ttwu_queue_remote() got this wrong. While on first reading ttwu_queue_remote() has an atomic test-and-set that appears to serialize the use, the matching 'release' is not in the right place to actually guarantee this serialization. The actual race is vs the sched_ttwu_pending() call in the idle loop; that can run the wakeup-list without consuming the CSD. Instead of trying to chain the lists, merge them. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Move the prototypes for sched_ttwu_pending() and send_call_function_single_ipi() into the newly created kernel/sched/smp.h header, to make sure they are all the same, and to architectures happy that use -Wmissing-prototypes. Signed-off-by: Ingo Molnar <[email protected]>

If we move the used_lrs field to the version-specific cpu interface structure, the following functions only operate on the struct vgic_v3_cpu_if and not the full vcpu: __vgic_v3_save_state __vgic_v3_restore_state __vgic_v3_activate_traps __vgic_v3_deactivate_traps __vgic_v3_save_aprs __vgic_v3_restore_aprs This is going to be very useful for nested virt, so move the used_lrs field and change the prototypes and implementations of these functions to take the cpu_if parameter directly. No functional change. Reviewed-by: James Morse <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>

Extract the direct HW accessors for later reuse. Reviewed-by: James Morse <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>

As we're about to become a bit more harsh when it comes to the lack of reset callbacks, let's add the missing PMU reset handlers. Note that these only cover *CLR registers that were always covered by their *SET counterpart, so there is no semantic change here. Reviewed-by: James Morse <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>

Our sysreg reset check has become a bit silly, as it only checks whether a reset callback actually exists for a given sysreg entry, and apply the method if available. Doing the check at each vcpu reset is pretty dumb, as the tables never change. It is thus perfectly possible to do the same checks at boot time. This also allows us to introduce a sparse sys_regs[] array, something that will be required with ARMv8.4-NV. Signed-off-by: Marc Zyngier <[email protected]>

Keeping empty structure as the vcpu state initializer is slightly wasteful: we only want to set pstate, and zero everything else. Just do that. Signed-off-by: Marc Zyngier <[email protected]>

After commit 6423e59 ("MIPS: Loongson64: Switch to generic PCI driver"), arch/mips/loongson64/pci.c is not used any more, remove it. Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

We currently assume that an exception is delivered to EL1, always. Once we emulate EL2, this no longer will be the case. To prepare for this, add a target_mode parameter. While we're at it, merge the computing of the target PC and PSTATE in a single function that updates both PC and CPSR after saving their previous values in the corresponding ELR/SPSR. This ensures that they are updated in the correct order (a pretty common source of bugs...). Reviewed-by: Mark Rutland <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>

The general comment about keeping the enum order in sync with the save/restore code has been obsolete for many years now. Just drop it. Note that there are other ordering requirements in the enum, such as the PtrAuth and PMU registers, which are still valid. Reported-by: James Morse <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>

In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: Paolo Bonzini <[email protected]>

This allows exceptions injected by the emulator to be properly delivered as vmexits. The code also becomes simpler, because we can just let all L0-intercepted exceptions go through the usual path. In particular, our emulation of the VMX #DB exit qualification is very much simplified, because the vmexit injection path can use kvm_deliver_exception_payload to update DR6. Signed-off-by: Paolo Bonzini <[email protected]>

All events now inject vmexits before vmentry rather than after vmexit. Therefore, exit_required is not set anymore and we can remove it. Signed-off-by: Paolo Bonzini <[email protected]>

The usual drill at this point, except there is no code to remove because this case was not handled at all. Signed-off-by: Paolo Bonzini <[email protected]>

svm_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. This was was added in commit 689f3bf ("KVM: x86: unify callbacks to load paging root", 2020-03-16) just to keep the two vendor-specific modules closer, but we'll fix VMX too. Fixes: 689f3bf ("KVM: x86: unify callbacks to load paging root") Signed-off-by: Paolo Bonzini <[email protected]>

vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. Fixes: 04f11ef ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter") Signed-off-by: Paolo Bonzini <[email protected]>

PTR_STR is redefined when CONFIG_TEST_PRINTF is set. This causes the following build warning: CC lib/test_printf.o lib/test_printf.c:214:0: warning: "PTR_STR" redefined #define PTR_STR "ffff0123456789ab" ^ In file included from ./arch/mips/include/asm/dsemul.h:11:0, from ./arch/mips/include/asm/processor.h:22, from ./arch/mips/include/asm/thread_info.h:16, from ./include/linux/thread_info.h:38, from ./include/asm-generic/preempt.h:5, from ./arch/mips/include/generated/asm/preempt.h:1, from ./include/linux/preempt.h:78, from ./include/linux/spinlock.h:51, from ./include/linux/seqlock.h:36, from ./include/linux/time.h:6, from ./include/linux/stat.h:19, from ./include/linux/module.h:13, from lib/test_printf.c:10: ./arch/mips/include/asm/inst.h:20:0: note: this is the location of the previous definition #define PTR_STR ".dword" ^ Instead of renaming PTR_STR we move the unaligned macros to a new file, which is only included inside MIPS code. This way we can safely include asm.h and can use STR(PTR) again. Fixes: e701656 ("MIPS: inst.h: Stop including asm.h to avoid various build failures") Cc: Maciej W. Rozycki" <[email protected]> Reported-by: Tiezhu Yang <[email protected]> Co-developed-by: Huacai Chen <[email protected]> Signed-off-by: Huacai Chen <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Previously it was thought that all future Loongson cores would come with native CPUCFG. From new information shared by Huacai this is definitely not true (maybe some future 2K cores, for example), so collisions at PRID_REV level are inevitable. The CPU model matching needs to take PRID_IMP into consideration. The emulation logic needs to be disabled for those future cores as well, as we cannot possibly encode their non-discoverable features right now. Reported-by: Huacai Chen <[email protected]> Cc: Jiaxun Yang <[email protected]> Signed-off-by: WANG Xuerui <[email protected]> Reviewed-by: Huacai Chen <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

The point is to allow userspace to probe for CPUCFG without possibly triggering invalid instructions. In addition to that, future Loongson feature bits could all be stuffed into CPUCFG bit fields (or "leaves" in x86-speak) if Loongson does not make mistakes, so ELF HWCAP bits are conserved. Userspace can determine native CPUCFG availability by checking the LCSRP (Loongson CSR Present) bit in CPUCFG output after seeing CPUCFG bit in HWCAP. Native CPUCFG always sets the LCSRP bit, as CPUCFG is part of the Loongson CSR ASE, while the emulation intentionally leaves this bit clear. The other existing Loongson-specific HWCAP bits are, to my best knowledge, unused, as (1) they are fairly recent additions, (2) Loongson never back-ported the patch into their kernel fork, and (3) Loongson's existing installed base rarely upgrade, if ever; However, they are still considered userspace ABI, hence unfortunately unremovable. But hopefully at least we could stop adding new Loongson HWCAP bits in the future. Cc: Paul Burton <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Huacai Chen <[email protected]> Signed-off-by: WANG Xuerui <[email protected]> Reviewed-by: Huacai Chen <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

Originally the match arms are ordered by model release date, however the LOONGSON_64R cores are even more reduced capability-wise. So put them at top of the switch block. Suggested-by: Huacai Chen <[email protected]> Signed-off-by: WANG Xuerui <[email protected]> Reviewed-by: Huacai Chen <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

'bootrom_setup()' is only called via 'postcore_initcall'. It can be marked as __init to save a few bytes of memory. Signed-off-by: Christophe JAILLET <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

ralink_clk_init is only called in arch/mips/ralink/clk.c which isn't compiled for mt7621. And it doesn't export a proper cpu clock. Drop this unused function. Signed-off-by: Chuanhong Guo <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>

The userspace_addr alignment and range checks are not performed for private memory slots that are prepared by KVM itself. This is unnecessary and makes it questionable to use __*_user functions to access memory later on. We also rely on the userspace address being aligned since we have an entire family of functions to map gfn to pfn. Fortunately skipping the check is completely unnecessary. Only x86 uses private memslots and their userspace_addr is obtained from vm_mmap, therefore it must be below PAGE_OFFSET. In fact, any attempt to pass an address above PAGE_OFFSET would have failed because such an address would return true for kvm_is_error_hva. Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Unmapping the nested VMCB in enter_svm_guest_mode is a bit of a wart, since the map argument is not used elsewhere in the function. There are just two callers, and those are also the place where kvm_vcpu_map is called, so it is cleaner to unmap there. Signed-off-by: Paolo Bonzini <[email protected]>

When restoring SVM nested state, the control state cache in svm->nested will have to be filled, but the save state will not have to be moved into svm->vmcb. Therefore, pull the code that handles the control area into a separate function. Signed-off-by: Paolo Bonzini <[email protected]>

Split out filling svm->vmcb.save and svm->vmcb.control before VMRUN. Only the latter will be useful when restoring nested SVM state. This patch introduces no semantic change, so the MMU setup is still done in nested_prepare_vmcb_save. The next patch will clean up things. Signed-off-by: Paolo Bonzini <[email protected]>

Everything that is needed during nested state restore is now part of nested_prepare_vmcb_control. Signed-off-by: Paolo Bonzini <[email protected]>

Use l1_tsc_offset to compute svm->vcpu.arch.tsc_offset and svm->vmcb->control.tsc_offset, instead of relying on hsave. Signed-off-by: Paolo Bonzini <[email protected]>

This will come in handy when we put a struct vmcb_control_area in svm->nested. Signed-off-by: Paolo Bonzini <[email protected]>

Allow placing the VMCB structs on the stack or in other structs without wasting too much space. Add BUILD_BUG_ON as a quick safeguard against typos. Signed-off-by: Paolo Bonzini <[email protected]>

In preparation for nested SVM save/restore, store all data that matters from the VMCB control area into svm->nested. It will then become part of the nested SVM state that is saved by KVM_SET_NESTED_STATE and restored by KVM_GET_NESTED_STATE, just like the cached vmcs12 for nVMX. Signed-off-by: Paolo Bonzini <[email protected]>

Restore the INT_CTL value from the guest's VMCB once we've stopped using it, so that virtual interrupts can be injected as requested by L1. V_TPR is up-to-date however, and it can change if the guest writes to CR8, so keep it. Signed-off-by: Paolo Bonzini <[email protected]>

…y vmexit The control state changes on every L2->L0 vmexit, and we will have to serialize it in the nested state. So keep it up to date in svm->nested.ctl and just copy them back to the nested VMCB in nested_svm_vmexit. Signed-off-by: Paolo Bonzini <[email protected]>

kvm_vcpu_apicv_active must be false when nested virtualization is enabled, so there is no need to check it in clgi_interception. Signed-off-by: Paolo Bonzini <[email protected]>

Extract the code that is needed to implement CLGI and STGI, so that we can run it from VMRUN and vmexit (and in the future, KVM_SET_NESTED_STATE). Skip the request for KVM_REQ_EVENT unless needed, subsuming the evaluate_pending_interrupts optimization that is found in enter_svm_guest_mode. Signed-off-by: Paolo Bonzini <[email protected]>

There is only one GIF flag for the whole processor, so make sure it is not clobbered when switching to L2 (in which case we also have to include the V_GIF_ENABLE_MASK, lest we confuse enable_gif/disable_gif/gif_set). When going back, L1 could in theory have entered L2 without issuing a CLGI so make sure the svm_set_gif is done last, after svm->vmcb->control.int_ctl has been copied back from hsave. Signed-off-by: Paolo Bonzini <[email protected]>

This bit was added to nested VMX right when nested_run_pending was introduced, but it is not yet there in nSVM. Since we can have pending events that L0 injected directly into L2 on vmentry, we have to transfer them into L1's queue. For this to work, one important change is required: svm_complete_interrupts (which clears the "injected" fields from the previous VMRUN, and updates them from svm->vmcb's EXITINTINFO) must be placed before we inject the vmexit. This is not too scary though; VMX even does it in vmx_vcpu_run. While at it, the nested_vmexit_inject tracepoint is moved towards the end of nested_svm_vmexit. This ensures that the synthesized EXITINTINFO is visible in the trace. Signed-off-by: Paolo Bonzini <[email protected]>

Now that the int_ctl field is stored in svm->nested.ctl.int_ctl, we can use it instead of vcpu->arch.hflags to check whether L2 is running in V_INTR_MASKING mode. Signed-off-by: Paolo Bonzini <[email protected]>

The L1 flags can be found in the save area of svm->nested.hsave, fish it from there so that there is one fewer thing to migrate. Signed-off-by: Paolo Bonzini <[email protected]>

The authoritative state does not come from the VMCB once in guest mode, but KVM_SET_NESTED_STATE can still perform checks on L1's provided SVM controls because we get them from userspace. Therefore, split out a function to do them. Signed-off-by: Paolo Bonzini <[email protected]>

According to the AMD manual, the effect of turning off EFER.SVME while a guest is running is undefined. We make it leave guest mode immediately, similar to the effect of clearing the VMX bit in MSR_IA32_FEAT_CTL. Signed-off-by: Paolo Bonzini <[email protected]>

This allows fetching the registers from the hsave area when setting up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which runs long after the CR0, CR4 and EFER values in vcpu have been switched to hold L2 guest state). Signed-off-by: Paolo Bonzini <[email protected]>

Many tests will want to check if the CPU is Intel or AMD in guest code, add cpu_has_svm() and put it as static inline to svm_util.h. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The test is similar to the existing one for VMX, but simpler because we don't have to test shadow VMCS or vmptrld/vmptrst/vmclear. Signed-off-by: Paolo Bonzini <[email protected]>

KVM_CAP_NESTED_STATE is now supported for AMD too but smm test acts like it is still Intel only. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Similar to VMX, the state that is captured through the currently available IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. In particular, the SVM-specific state for nested virtualization includes the L1 saved state (including the interrupt flag), the cached L2 controls, and the GIF. Signed-off-by: Paolo Bonzini <[email protected]>

This reverts commit 5b494ae. If unlocked==true then the vma pointer could be invalidated, so the 2nd follow_pfn() is potentially racy: we do need to get out and redo find_vma_intersection(). Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] KSPP#21 [3] commit 7649773 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Message-Id: <20200507185618.GA14831@embeddedor> Signed-off-by: Paolo Bonzini <[email protected]>

…Page Ready" exceptions simultaneously" Commit 9a6e7c3 (""KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously") added a protection against 'page ready' notification coming before 'page not present' is delivered. This situation seems to be impossible since commit 2a266f2 ("KVM MMU: check pending exception before injecting APF) which added 'vcpu->arch.exception.pending' check to kvm_can_do_async_pf. On x86, kvm_arch_async_page_present() has only one call site: kvm_check_async_pf_completion() loop and we only enter the loop when kvm_arch_can_inject_async_page_present(vcpu) which when async pf msr is enabled, translates into kvm_can_do_async_pf(). There is also one problem with the cancellation mechanism. We don't seem to check that the 'page not present' notification we're canceling matches the 'page ready' notification so in theory, we may erroneously drop two valid events. Revert the commit. Reviewed-by: Gavin Shan <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

…dequeue_async_page_present() An innocent reader of the following x86 KVM code: bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) { if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED)) return true; ... may get very confused: if APF mechanism is not enabled, why do we report that we 'can inject async page present'? In reality, upon injection kvm_arch_async_page_present() will check the same condition again and, in case APF is disabled, will just drop the item. This is fine as the guest which deliberately disabled APF doesn't expect to get any APF notifications. Rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() to make it clear what we are checking: if the item can be dequeued (meaning either injected or just dropped). On s390 kvm_arch_can_inject_async_page_present() always returns 'true' so the rename doesn't matter much. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

We already have kvm_write_guest_offset_cached(), introduce read analogue. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Concerns were expressed around APF delivery via synthetic #PF exception as in some cases such delivery may collide with real page fault. For 'page ready' notifications we can easily switch to using an interrupt instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the legacy one. One notable difference between the two mechanisms is that interrupt may not get handled immediately so whenever we would like to deliver next event (regardless of its type) we must be sure the guest had read and cleared previous event in the slot. While on it, get rid on 'type 1/type 2' names for APF events in the documentation as they are causing confusion. Use 'page not present' and 'page ready' everywhere instead. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

If two page ready notifications happen back to back the second one is not delivered and the only mechanism we currently have is kvm_check_async_pf_completion() check in vcpu_run() loop. The check will only be performed with the next vmexit when it happens and in some cases it may take a while. With interrupt based page ready notification delivery the situation is even worse: unlike exceptions, interrupts are not handled immediately so we must check if the slot is empty. This is slow and unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate the fact that the slot is free and host should check its notification queue. Mandate using it for interrupt based 'page ready' APF event delivery. As kvm_check_async_pf_completion() is going away from vcpu_run() we need a way to communicate the fact that vcpu->async_pf.done queue has transitioned from empty to non-empty state. Introduce kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Introduce new capability to indicate that KVM supports interrupt based delivery of 'page ready' APF events. This includes support for both MSR_KVM_ASYNC_PF_INT and MSR_KVM_ASYNC_PF_ACK. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated field from the struct could be used by get_msr. This also makes this API consistent with kvm_pmu_set_msr. No functional changes. Signed-off-by: Wei Wang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0) for GP counters that allows writing the full counter width. Enable this range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]). The guest would query CPUID to get the counter width, and sign extends the counter values as needed. The traditional MSRs always limit to 32bit, even though the counter internally is larger (48 or 57 bits). When the new capability is set, use the alternative range which do not have these restrictions. This lowers the overhead of perf stat slightly because it has to do less interrupts to accumulate the counter value. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The problem the patch is trying to address is the fact that 'struct kvm_hyperv_exit' has different layout on when compiling in 32 and 64 bit modes. In 64-bit mode the default alignment boundary is 64 bits thus forcing extra gaps after 'type' and 'msr' but in 32-bit mode the boundary is at 32 bits thus no extra gaps. This is an issue as even when the kernel is 64 bit, the userspace using the interface can be both 32 and 64 bit but the same 32 bit userspace has to work with 32 bit kernel. The issue is fixed by forcing the 64 bit layout, this leads to ABI change for 32 bit builds and while we are obviously breaking '32 bit userspace with 32 bit kernel' case, we're fixing the '32 bit userspace with 64 bit kernel' one. As the interface has no (known) users and 32 bit KVM is rather baroque nowadays, this seems like a reasonable decision. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Jon Doron <[email protected]> Message-Id: <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Add new field to hold preemption timer expiration deadline appended to struct kvm_vmx_nested_state_hdr. This is to prevent the first VM-Enter after migration from incorrectly restarting the timer with the full timer value instead of partially decayed timer value. KVM_SET_NESTED_STATE restarts timer using migrated state regardless of whether L1 sets VM_EXIT_SAVE_VMX_PREEMPTION_TIMER. Fixes: cf8b84f ("kvm: nVMX: Prepare for checkpointing L2 state") Signed-off-by: Peter Shier <[email protected]> Signed-off-by: Makarand Sonare <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

When a nested VM with a VMX-preemption timer is migrated, verify that the nested VM and its parent VM observe the VMX-preemption timer exit close to the original expiration deadline. Signed-off-by: Makarand Sonare <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Hyper-V synthetic debugger has two modes, one that uses MSRs and the other that use Hypercalls. Add all the required definitions to both types of synthetic debugger interface. Some of the required new CPUIDs and MSRs are not documented in the TLFS so they are in hyperv.h instead. The reason they are not documented is because they are subjected to be removed in future versions of Windows. Reviewed-by: Michael Kelley <[email protected]> Signed-off-by: Jon Doron <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Add support for Hyper-V synthetic debugger (syndbg) interface. The syndbg interface is using MSRs to emulate a way to send/recv packets data. The debug transport dll (kdvm/kdnet) will identify if Hyper-V is enabled and if it supports the synthetic debugger interface it will attempt to use it, instead of trying to initialize a network adapter. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Jon Doron <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Microsoft's kdvm.dll dbgtransport module does not respect the hypercall page and simply identifies the CPU being used (AMD/Intel) and according to it simply makes hypercalls with the relevant instruction (vmmcall/vmcall respectively). The relevant function in kdvm is KdHvConnectHypervisor which first checks if the hypercall page has been enabled via HV_X64_MSR_HYPERCALL_ENABLE, and in case it was not it simply sets the HV_X64_MSR_GUEST_OS_ID to 0x1000101010001 which means: build_number = 0x0001 service_version = 0x01 minor_version = 0x01 major_version = 0x01 os_id = 0x00 (Undefined) vendor_id = 1 (Microsoft) os_type = 0 (A value of 0 indicates a proprietary, closed source OS) and starts issuing the hypercall without setting the hypercall page. To resolve this issue simply enable hypercalls also if the guest_os_id is not 0. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Jon Doron <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

There is another mode for the synthetic debugger which uses hypercalls to send/recv network data instead of the MSR interface. This interface is much slower and less recommended since you might get a lot of VMExits while KDVM polling for new packets to recv, rather than simply checking the pending page to see if there is data avialble and then request. Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Jon Doron <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Update tests to reflect new CPUID capabilities with SYNDBG. Check that we get the right number of entries and that 0x40000000.EAX always returns the correct max leaf. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Jon Doron <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The userspace_addr alignment and range checks are not performed for private memory slots that are prepared by KVM itself. This is unnecessary and makes it questionable to use __*_user functions to access memory later on. We also rely on the userspace address being aligned since we have an entire family of functions to map gfn to pfn. Fortunately skipping the check is completely unnecessary. Only x86 uses private memslots and their userspace_addr is obtained from vm_mmap, therefore it must be below PAGE_OFFSET. In fact, any attempt to pass an address above PAGE_OFFSET would have failed because such an address would return true for kvm_is_error_hva. Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

…it/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 5.8: - Move the arch-specific code into arch/arm64/kvm - Start the post-32bit cleanup - Cherry-pick a few non-invasive pre-NV patches

This merge brings in a few fixes that I would have sent this week, had there been a 5.7-rc8 release. Signed-off-by: Paolo Bonzini <[email protected]>

vmx_tsc_adjust_test fails with: IA32_TSC_ADJUST is -4294969448 (-1 * TSC_ADJUST_VALUE + -2152). IA32_TSC_ADJUST is -4294969448 (-1 * TSC_ADJUST_VALUE + -2152). IA32_TSC_ADJUST is 281470681738540 (65534 * TSC_ADJUST_VALUE + 4294962476). ==== Test Assertion Failure ==== x86_64/vmx_tsc_adjust_test.c:153: false pid=19738 tid=19738 - Interrupted system call 1 0x0000000000401192: main at vmx_tsc_adjust_test.c:153 2 0x00007fe1ef8583d4: ?? ??:0 3 0x0000000000401201: _start at ??:? Failed guest assert: (adjust <= max) The problem is that is 'tsc_val' should be u64, not u32 or the reading gets truncated. Fixes: 8d7fbf0 ("KVM: selftests: VMX preemption timer migration test") Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The PA-RISC Linux project web page is now hosted at https://parisc.wiki.kernel.org Signed-off-by: Helge Deller <[email protected]>

Some SMP platforms don't have CONFIG_IRQ_WORK defined, resulting in a link error at build time. Define a stub and clean up the prototype definitions. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>

When kgdboc is compiled as a module all of the "ekgdboc" and "kgdb_earlycon" code isn't useful and, in fact, breaks compilation. This is because early_param() isn't defined for modules and that's how this code gets configured. It turns out that this was broken by commit eae3e19 ("kgdboc: Remove useless #ifdef CONFIG_KGDB_SERIAL_CONSOLE in kgdboc") and then made worse by commit 2209956 ("kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles"). I guess the #ifdef wasn't so useless, even if it wasn't obvious why it was useful. When kgdboc was compiled as a module only "CONFIG_KGDB_SERIAL_CONSOLE_MODULE" was defined, not "CONFIG_KGDB_SERIAL_CONSOLE". That meant that the old module. Let's basically do the same thing that the old code (pre-removal of the #ifdef) did but use "IS_BUILTIN(CONFIG_KGDB_SERIAL_CONSOLE)" to make it more obvious what the point of the check is. We'll fix kgdboc_earlycon in a similar way. Fixes: 2209956 ("kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles") Fixes: eae3e19 ("kgdboc: Remove useless #ifdef CONFIG_KGDB_SERIAL_CONSOLE in kgdboc") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/20200519084345.1.I91670accc8a5ddabab227eb63bb4ad3e2e9d2b58@changeid Signed-off-by: Daniel Thompson <[email protected]>

When I combined kgdboc_earlycon with an inflight patch titled ("soc: qcom-geni-se: Add interconnect support to fix earlycon crash") [1] things went boom. Specifically I got a crash during the transition between kgdboc_earlycon and the main kgdboc that looked like this: Call trace: __schedule_bug+0x68/0x6c __schedule+0x75c/0x924 schedule+0x8c/0xbc schedule_timeout+0x9c/0xfc do_wait_for_common+0xd0/0x160 wait_for_completion_timeout+0x54/0x74 rpmh_write_batch+0x1fc/0x23c qcom_icc_bcm_voter_commit+0x1b4/0x388 qcom_icc_set+0x2c/0x3c apply_constraints+0x5c/0x98 icc_set_bw+0x204/0x3bc icc_put+0x30/0xf8 geni_remove_earlycon_icc_vote+0x6c/0x9c qcom_geni_serial_earlycon_exit+0x10/0x1c kgdboc_earlycon_deinit+0x38/0x58 kgdb_register_io_module+0x11c/0x194 configure_kgdboc+0x108/0x174 kgdboc_probe+0x38/0x60 platform_drv_probe+0x90/0xb0 really_probe+0x130/0x2fc ... The problem was that we were holding the "kgdb_registration_lock" while calling into code that didn't expect to be called in spinlock context. Let's slightly defer when we call the deinit code so that it's not done under spinlock. NOTE: this does mean that the "deinit" call of the old kgdb IO module is now made _after_ the init of the new IO module, but presumably that's OK. [1] https://lkml.kernel.org/r/[email protected] Fixes: 2209956 ("kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles") Signed-off-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/20200526142001.1.I523dc33f96589cb9956f5679976d402c8cda36fa@changeid [[email protected]: Resolved merge issues by hand] Signed-off-by: Daniel Thompson <[email protected]>

The recent patch ("kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles") adds a new kernel command line parameter. Document it. Note that the patch adding the feature does some comparing/contrasting of "kgdboc_earlycon" vs. the existing "ekgdboc". See that patch for more details, but briefly "ekgdboc" can be used _instead_ of "kgdboc" and just makes "kgdboc" do its normal initialization early (only works if your tty driver is already ready). The new "kgdboc_earlycon" works in combination with "kgdboc" and is backed by boot consoles. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.9.I7d5eb42c6180c831d47aef1af44d0b8be3fac559@changeid Signed-off-by: Daniel Thompson <[email protected]>

Currently there is no guarantee that an earlycon will be initialized before kgdboc tries to adopt it. Almost the opposite: on systems with ACPI then if earlycon has no arguments then it is guaranteed that earlycon will not be initialized. This patch mitigates the problem by giving kgdboc_earlycon a second chance during console_init(). This isn't quite as good as stopping during early parameter parsing but it is still early in the kernel boot. Signed-off-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Douglas Anderson <[email protected]>

Implement the read() function in the early console driver. With recent kgdb patches this allows you to use kgdb to debug fairly early into the system boot. We only bother implementing this if polling is enabled since kgdb can't be enabled without that. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.10.If2deff9679a62c1ce1b8f2558a8635dc837adf8c@changeid Signed-off-by: Daniel Thompson <[email protected]>

Implement the read() function in the early console driver. With recent kgdb patches this allows you to use kgdb to debug fairly early into the system boot. We only bother implementing this if polling is enabled since kgdb can't be enabled without that. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.11.I8f668556c244776523320a95b09373a86eda11b7@changeid Signed-off-by: Daniel Thompson <[email protected]>

Implement the read() function in the early console driver. With recently added kgdboc_earlycon feature, this allows you to use kgdb to debug fairly early into the system boot. We only bother implementing this if polling is enabled since kgdb can't be enabled without that. Signed-off-by: Sumit Garg <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/20200507130644.v4.12.I8ee0811f0e0816dd8bfe7f2f5540b3dba074fae8@changeid Signed-off-by: Daniel Thompson <[email protected]>

From code inspection the math in handle_ctrl_cmd() looks super sketchy because it subjects -1 from cmdptr and then does a "% KDB_CMD_HISTORY_COUNT". It turns out that this code works because "cmdptr" is unsigned and KDB_CMD_HISTORY_COUNT is a nice power of 2. Let's make this a little less sketchy. This patch should be a no-op. Signed-off-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/20200507161125.1.I2cce9ac66e141230c3644b8174b6c15d4e769232@changeid Reviewed-by: Sumit Garg <[email protected]> Signed-off-by: Daniel Thompson <[email protected]>

Currently, 'KDBFLAGS' is an internal variable of kdb, it is combined by 'KDBDEBUG' and state flags. It will be shown only when 'KDBDEBUG' is set, and the user can define an environment variable named 'KDBFLAGS' too. These are puzzling indeed. After communication with Daniel, it seems that 'KDBFLAGS' is a misfeature. So let's replace 'KDBFLAGS' with 'KDBDEBUG' to just show the value we wrote into. After this modification, we can use `md4c1 kdb_flags` instead, to observe the state flags. Suggested-by: Daniel Thompson <[email protected]> Signed-off-by: Wei Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] [[email protected]: Make kdb_flags unsigned to avoid arithmetic right shift] Signed-off-by: Daniel Thompson <[email protected]>

…ux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The changes in this cycle are: - Optimize the task wakeup CPU selection logic, to improve scalability and reduce wakeup latency spikes - PELT enhancements - CFS bandwidth handling fixes - Optimize the wakeup path by remove rq->wake_list and replacing it with ->ttwu_pending - Optimize IPI cross-calls by making flush_smp_call_function_queue() process sync callbacks first. - Misc fixes and enhancements" * tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too sched/headers: Split out open-coded prototypes into kernel/sched/smp.h sched: Replace rq::wake_list sched: Add rq::ttwu_pending irq_work, smp: Allow irq_work on call_single_queue smp: Optimize send_call_function_single_ipi() smp: Move irq_work_run() out of flush_smp_call_function_queue() smp: Optimize flush_smp_call_function_queue() sched: Fix smp_call_function_single_async() usage for ILB sched/core: Offload wakee task activation if it the wakee is descheduling sched/core: Optimize ttwu() spinning on p->on_cpu sched: Defend cfs and rt bandwidth quota against overflow sched/cpuacct: Fix charge cpuacct.usage_sys sched/fair: Replace zero-length array with flexible-array sched/pelt: Sync util/runnable_sum with PELT window when propagating sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr() sched/fair: Optimize enqueue_task_fair() sched: Make scheduler_ipi inline sched: Clean up scheduler_ipi() sched/core: Simplify sched_init() ...

…/git/brauner/linux Pull thread updates from Christian Brauner: "We have been discussing using pidfds to attach to namespaces for quite a while and the patches have in one form or another already existed for about a year. But I wanted to wait to see how the general api would be received and adopted. This contains the changes to make it possible to use pidfds to attach to the namespaces of a process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. Specifying 0 is not valid together with a pidfd. Here are just two obvious examples: setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET); setns(pidfd, CLONE_NEWUSER); Allowing to also attach subsets of namespaces supports various use-cases where callers setns to a subset of namespaces to retain privilege, perform an action and then re-attach another subset of namespaces. Apart from significantly reducing the number of syscalls needed to attach to all currently supported namespaces (eight "open+setns" sequences vs just a single "setns()"), this also allows atomic setns to a set of namespaces, i.e. either attaching to all namespaces succeeds or we fail without having changed anything. This is centered around a new internal struct nsset which holds all information necessary for a task to switch to a new set of namespaces atomically. Fwiw, with this change a pidfd becomes the only token needed to interact with a container. I'm expecting this to be picked-up by util-linux for nsenter rather soon. Associated with this change is a shiny new test-suite dedicated to setns() (for pidfds and nsfds alike)" * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests/pidfd: add pidfd setns tests nsproxy: attach to namespaces via pidfds nsproxy: add struct nsset

Fixes: cc03f19 (ia64: csum_partial_copy_nocheck(): don't abuse csum_partial_copy_from_user()) Reported-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: Al Viro <[email protected]>

…nel/git/viro/vfs Pull ia64 build regression fix from Al Viro: "Fix a braino in ia64 uaccess csum changes" * 'uaccess.csum' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix a braino in ia64 uaccess csum changes

…/mips/linux Pull MIPS updates from Thomas Bogendoerfer: - added support for MIPSr5 and P5600 cores - converted Loongson PCI driver into a PCI host driver using the generic PCI framework - added emulation of CPUCFG command for Loogonson64 cpus - removed of LASAT, PMC MSP71xx and NEC MARKEINS/EMMA - ioremap cleanup - fix for a race between two threads faulting the same page - various cleanups and fixes * tag 'mips_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (143 commits) MIPS: ralink: drop ralink_clk_init for mt7621 MIPS: ralink: bootrom: mark a function as __init to save some memory MIPS: Loongson64: Reorder CPUCFG model match arms MIPS: Expose Loongson CPUCFG availability via HWCAP MIPS: Loongson64: Guard against future cores without CPUCFG MIPS: Fix build warning about "PTR_STR" redefinition MIPS: Loongson64: Remove not used pci.c MIPS: Loongson64: Define PCI_IOBASE MIPS: CPU_LOONGSON2EF need software to maintain cache consistency MIPS: DTS: Fix build errors used with various configs MIPS: Loongson64: select NO_EXCEPT_FILL MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe() MIPS: mm: add page valid judgement in function pte_modify mm/memory.c: Add memory read privilege on page fault handling mm/memory.c: Update local TLB if PTE entry exists MIPS: Do not flush tlb page when updating PTE entry MIPS: ingenic: Default to a generic board MIPS: ingenic: Add support for GCW Zero prototype MIPS: ingenic: DTS: Add memory info of GCW Zero MIPS: Loongson64: Switch to generic PCI driver ...

…nel/git/deller/parisc-linux Pull parsic updates from Helge Deller: "Enable the sysctl file interface for panic_on_stackoverflow for parisc, a warning fix and a bunch of documentation updates since the parisc website is now at https://parisc.wiki.kernel.org" * 'parisc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: MAINTAINERS: Update references to parisc website parisc: module: Update references to parisc website parisc: hardware: Update references to parisc website parisc: firmware: Update references to parisc website parisc: Kconfig: Update references to parisc website parisc: add sysctl file interface panic_on_stackoverflow parisc: use -fno-strict-aliasing for decompressor parisc: suppress error messages for 'make clean'

Pull Xtensa updates from Max Filippov: - fix __user annotations in asm/uaccess.h - fix comments in entry.S * tag 'xtensa-20200603' of git://github.com/jcmvbkbc/linux-xtensa: xtensa: Fix spelling/grammar in comment xtensa: add missing __user annotations to asm/uaccess.h xtensa: fix error paths in __get_user_{check,size} xtensa: fix type conversion in __get_user_size xtensa: add missing __user annotations to __{get,put}_user_check

…/git/danielt/linux Pull kgdb updates from Daniel Thompson: "By far the biggest change in this cycle are the changes that allow much earlier debug of systems that are hooked up via UART by taking advantage of the earlycon framework to implement the kgdb I/O hooks before handing over to the regular polling I/O drivers once they are available. When discussing Doug's work we also found and fixed an broken raw_smp_processor_id() sequence in in_dbg_master(). Also included are a collection of much smaller fixes and tweaks: a couple of tweaks to ged rid of doc gen or coccicheck warnings, future proof some internal calculations that made implicit power-of-2 assumptions and eliminate some rather weird handling of magic environment variables in kdb" * tag 'kgdb-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Remove the misfeature 'KDBFLAGS' kdb: Cleanup math with KDB_CMD_HISTORY_COUNT serial: amba-pl011: Support kgdboc_earlycon serial: 8250_early: Support kgdboc_earlycon serial: qcom_geni_serial: Support kgdboc_earlycon serial: kgdboc: Allow earlycon initialization to be deferred Documentation: kgdboc: Document new kgdboc_earlycon parameter kgdb: Don't call the deinit under spinlock kgdboc: Disable all the early code when kgdboc is a module kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles kgdboc: Remove useless #ifdef CONFIG_KGDB_SERIAL_CONSOLE in kgdboc kgdb: Prevent infinite recursive entries to the debugger kgdb: Delay "kgdbwait" to dbg_late_init() by default kgdboc: Use a platform device to handle tty drivers showing up late Revert "kgdboc: disable the console lock when in kgdb" kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb kgdb: Return true in kgdb_nmi_poll_knock() kgdb: Drop malformed kernel doc comment kgdb: Fix spurious true from in_dbg_master()

…kernel/git/hyperv/linux Pull hyper-v updates from Wei Liu: - a series from Andrea to support channel reassignment - a series from Vitaly to clean up Vmbus message handling - a series from Michael to clean up and augment hyperv-tlfs.h - patches from Andy to clean up GUID usage in Hyper-V code - a few other misc patches * tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (29 commits) Drivers: hv: vmbus: Resolve more races involving init_vp_index() Drivers: hv: vmbus: Resolve race between init_vp_index() and CPU hotplug vmbus: Replace zero-length array with flexible-array Driver: hv: vmbus: drop a no long applicable comment hyper-v: Switch to use UUID types directly hyper-v: Replace open-coded variant of %*phN specifier hyper-v: Supply GUID pointer to printf() like functions hyper-v: Use UUID API for exporting the GUID (part 2) asm-generic/hyperv: Add definitions for Get/SetVpRegister hypercalls x86/hyperv: Split hyperv-tlfs.h into arch dependent and independent files x86/hyperv: Remove HV_PROCESSOR_POWER_STATE #defines KVM: x86: hyperv: Remove duplicate definitions of Reference TSC Page drivers: hv: remove redundant assignment to pointer primary_channel scsi: storvsc: Re-init stor_chns when a channel interrupt is re-assigned Drivers: hv: vmbus: Introduce the CHANNELMSG_MODIFYCHANNEL message type Drivers: hv: vmbus: Synchronize init_vp_index() vs. CPU hotplug Drivers: hv: vmbus: Remove the unused HV_LOCALIZED channel affinity logic PCI: hv: Prepare hv_compose_msi_msg() for the VMBus-channel-interrupt-to-vCPU reassignment functionality Drivers: hv: vmbus: Use a spin lock for synchronizing channel scheduling vs. channel removal hv_utils: Always execute the fcopy and vss callbacks in a tasklet ...

Pull kvm updates from Paolo Bonzini: "ARM: - Move the arch-specific code into arch/arm64/kvm - Start the post-32bit cleanup - Cherry-pick a few non-invasive pre-NV patches x86: - Rework of TLB flushing - Rework of event injection, especially with respect to nested virtualization - Nested AMD event injection facelift, building on the rework of generic code and fixing a lot of corner cases - Nested AMD live migration support - Optimization for TSC deadline MSR writes and IPIs - Various cleanups - Asynchronous page fault cleanups (from tglx, common topic branch with tip tree) - Interrupt-based delivery of asynchronous "page ready" events (host side) - Hyper-V MSRs and hypercalls for guest debugging - VMX preemption timer fixes s390: - Cleanups Generic: - switch vCPU thread wakeup from swait to rcuwait The other architectures, and the guest side of the asynchronous page fault work, will come next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits) KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test KVM: check userspace_addr for all memslots KVM: selftests: update hyperv_cpuid with SynDBG tests x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls x86/kvm/hyper-v: enable hypercalls regardless of hypercall page x86/kvm/hyper-v: Add support for synthetic debugger interface x86/hyper-v: Add synthetic debugger definitions KVM: selftests: VMX preemption timer migration test KVM: nVMX: Fix VMX preemption timer migration x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit KVM: x86/pmu: Support full width counting KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT KVM: x86: acknowledgment mechanism for async pf page ready notifications KVM: x86: interrupt based APF 'page ready' event delivery KVM: introduce kvm_read_guest_offset_cached() KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously" KVM: VMX: Replace zero-length array with flexible-array ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from torvalds:master #33

[pull] master from torvalds:master #33

Commits on May 15, 2020

Commits on May 16, 2020

Commits on May 17, 2020

Commits on May 18, 2020

Commits on May 19, 2020

Commits on May 20, 2020

Commits on May 21, 2020

Commits on May 22, 2020

Commits on May 23, 2020

Commits on May 24, 2020

Commits on May 25, 2020

Commits on May 27, 2020

Commits on May 28, 2020

Commits on May 30, 2020

Commits on May 31, 2020

Commits on Jun 1, 2020

Commits on Jun 2, 2020

Commits on Jun 3, 2020