Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/torvalds/linux.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-05-27Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "S390: - ultravisor communication device driver - fix TEID on terminating storage key ops RISC-V: - Added Sv57x4 support for G-stage page table - Added range based local HFENCE functions - Added remote HFENCE functions based on VCPU requests - Added ISA extension registers in ONE_REG interface - Updated KVM RISC-V maintainers entry to cover selftests support ARM: - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes x86: - New ioctls to get/set TSC frequency for a whole VM - Allow userspace to opt out of hypercall patching - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr AMD SEV improvements: - Add KVM_EXIT_SHUTDOWN metadata for SEV-ES - V_TSC_AUX support Nested virtualization improvements for AMD: - Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) - Allow AVIC to co-exist with a nested guest running - Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support - PAUSE filtering for nested hypervisors Guest support: - Decoupling of vcpu_is_preempted from PV spinlocks" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits) KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest KVM: selftests: x86: Sync the new name of the test case to .gitignore Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND x86, kvm: use correct GFP flags for preemption disabled KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer x86/kvm: Alloc dummy async #PF token outside of raw spinlock KVM: x86: avoid calling x86 emulator without a decoded instruction KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave) s390/uv_uapi: depend on CONFIG_S390 KVM: selftests: x86: Fix test failure on arch lbr capable platforms KVM: LAPIC: Trace LAPIC timer expiration on every vmentry KVM: s390: selftest: Test suppression indication on key prot exception KVM: s390: Don't indicate suppression on dirtying, failing memop selftests: drivers/s390x: Add uvdevice tests drivers/s390/char: Add Ultravisor io device MAINTAINERS: Update KVM RISC-V entry to cover selftests support RISC-V: KVM: Introduce ISA extension register RISC-V: KVM: Cleanup stale TLB entries when host CPU changes RISC-V: KVM: Add remote HFENCE functions based on VCPU requests ...
2022-05-26Merge tag 'mm-stable-2022-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Almost all of MM here. A few things are still getting finished off, reviewed, etc. - Yang Shi has improved the behaviour of khugepaged collapsing of readonly file-backed transparent hugepages. - Johannes Weiner has arranged for zswap memory use to be tracked and managed on a per-cgroup basis. - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime enablement of the recent huge page vmemmap optimization feature. - Baolin Wang contributes a series to fix some issues around hugetlb pagetable invalidation. - Zhenwei Pi has fixed some interactions between hwpoisoned pages and virtualization. - Tong Tiangen has enabled the use of the presently x86-only page_table_check debugging feature on arm64 and riscv. - David Vernet has done some fixup work on the memcg selftests. - Peter Xu has taught userfaultfd to handle write protection faults against shmem- and hugetlbfs-backed files. - More DAMON development from SeongJae Park - adding online tuning of the feature and support for monitoring of fixed virtual address ranges. Also easier discovery of which monitoring operations are available. - Nadav Amit has done some optimization of TLB flushing during mprotect(). - Neil Brown continues to labor away at improving our swap-over-NFS support. - David Hildenbrand has some fixes to anon page COWing versus get_user_pages(). - Peng Liu fixed some errors in the core hugetlb code. - Joao Martins has reduced the amount of memory consumed by device-dax's compound devmaps. - Some cleanups of the arch-specific pagemap code from Anshuman Khandual. - Muchun Song has found and fixed some errors in the TLB flushing of transparent hugepages. - Roman Gushchin has done more work on the memcg selftests. ... and, of course, many smaller fixes and cleanups. Notably, the customary million cleanup serieses from Miaohe Lin" * tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits) mm: kfence: use PAGE_ALIGNED helper selftests: vm: add the "settings" file with timeout variable selftests: vm: add "test_hmm.sh" to TEST_FILES selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests selftests: vm: add migration to the .gitignore selftests/vm/pkeys: fix typo in comment ksm: fix typo in comment selftests: vm: add process_mrelease tests Revert "mm/vmscan: never demote for memcg reclaim" mm/kfence: print disabling or re-enabling message include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion" mm: fix a potential infinite loop in start_isolate_page_range() MAINTAINERS: add Muchun as co-maintainer for HugeTLB zram: fix Kconfig dependency warning mm/shmem: fix shmem folio swapoff hang cgroup: fix an error handling path in alloc_pagecache_max_30M() mm: damon: use HPAGE_PMD_SIZE tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate nodemask.h: fix compilation error with GCC12 ...
2022-05-26Merge tag 'kbuild-v5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add HOSTPKG_CONFIG env variable to allow users to override pkg-config - Support W=e as a shorthand for KCFLAGS=-Werror - Fix CONFIG_IKHEADERS build to support toybox cpio - Add scripts/dummy-tools/pahole to ease distro packagers' life - Suppress false-positive warnings from checksyscalls.sh for W=2 build - Factor out the common code of arch/*/boot/install.sh into scripts/install.sh - Support 'kernel-install' tool in scripts/prune-kernel - Refactor module-versioning to link the symbol versions at the final link of vmlinux and modules - Remove CONFIG_MODULE_REL_CRCS because module-versioning now works in an arch-agnostic way - Refactor modpost, Makefiles * tag 'kbuild-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (56 commits) genksyms: adjust the output format to modpost kbuild: stop merging *.symversions kbuild: link symbol CRCs at final link, removing CONFIG_MODULE_REL_CRCS modpost: extract symbol versions from *.cmd files modpost: add sym_find_with_module() helper modpost: change the license of EXPORT_SYMBOL to bool type modpost: remove left-over cross_compile declaration kbuild: record symbol versions in *.cmd files kbuild: generate a list of objects in vmlinux modpost: move *.mod.c generation to write_mod_c_files() modpost: merge add_{intree_flag,retpoline,staging_flag} to add_header scripts/prune-kernel: Use kernel-install if available kbuild: factor out the common installation code into scripts/install.sh modpost: split new_symbol() to symbol allocation and hash table addition modpost: make sym_add_exported() always allocate a new symbol modpost: make multiple export error modpost: dump Module.symvers in the same order of modules.order modpost: traverse the namespace_list in order modpost: use doubly linked list for dump_lists modpost: traverse unresolved symbols in order ...
2022-05-26Merge tag 'asm-generic-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "The asm-generic tree contains three separate changes for linux-5.19: - The h8300 architecture is retired after it has been effectively unmaintained for a number of years. This is the last architecture we supported that has no MMU implementation, but there are still a few architectures (arm, m68k, riscv, sh and xtensa) that support CPUs with and without an MMU. - A series to add a generic ticket spinlock that can be shared by most architectures with a working cmpxchg or ll/sc type atomic, including the conversion of riscv, csky and openrisc. This series is also a prerequisite for the loongarch64 architecture port that will come as a separate pull request. - A cleanup of some exported uapi header files to ensure they can be included from user space without relying on other kernel headers" * tag 'asm-generic-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: h8300: remove stale bindings and symlink sparc: add asm/stat.h to UAPI compile-test coverage powerpc: add asm/stat.h to UAPI compile-test coverage mips: add asm/stat.h to UAPI compile-test coverage riscv: add linux/bpf_perf_event.h to UAPI compile-test coverage kbuild: prevent exported headers from including <stdlib.h>, <stdbool.h> agpgart.h: do not include <stdlib.h> from exported header csky: Move to generic ticket-spinlock RISC-V: Move to queued RW locks RISC-V: Move to generic spinlocks openrisc: Move to ticket-spinlock asm-generic: qrwlock: Document the spinlock fairness requirements asm-generic: qspinlock: Indicate the use of mixed-size atomics asm-generic: ticket-lock: New generic ticket-based spinlock remove the h8300 architecture
2022-05-24kbuild: link symbol CRCs at final link, removing CONFIG_MODULE_REL_CRCSMasahiro Yamada
include/{linux,asm-generic}/export.h defines a weak symbol, __crc_* as a placeholder. Genksyms writes the version CRCs into the linker script, which will be used for filling the __crc_* symbols. The linker script format depends on CONFIG_MODULE_REL_CRCS. If it is enabled, __crc_* holds the offset to the reference of CRC. It is time to get rid of this complexity. Now that modpost parses text files (.*.cmd) to collect all the CRCs, it can generate C code that will be linked to the vmlinux or modules. Generate a new C file, .vmlinux.export.c, which contains the CRCs of symbols exported by vmlinux. It is compiled and linked to vmlinux in scripts/link-vmlinux.sh. Put the CRCs of symbols exported by modules into the existing *.mod.c files. No additional build step is needed for modules. As before, *.mod.c are compiled and linked to *.ko in scripts/Makefile.modfinal. No linker magic is used here. The new C implementation works in the same way, whether CONFIG_RELOCATABLE is enabled or not. CONFIG_MODULE_REL_CRCS is no longer needed. Previously, Kbuild invoked additional $(LD) to update the CRCs in objects, but this step is unneeded too. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nicolas Schier <nicolas@fjasle.eu> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM-14 (x86-64)
2022-05-20mm: Kconfig: move swap and slab config options to the MM sectionJohannes Weiner
These are currently under General Setup. MM seems like a better fit. Link: https://lkml.kernel.org/r/20220510152847.230957-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/uffd: move USERFAULTFD configs into mm/Peter Xu
We used to have USERFAULTFD configs stored in init/. It makes sense as a start because that's the default place for storing syscall related configs. However userfaultfd evolved a bit in the past few years and some more config options were added. They're no longer related to syscalls and start to be not suitable to be kept in the init/ directory anymore, because they're pure mm concepts. But it's not ideal either to keep the userfaultfd configs separate from each other. Hence this patch moves the userfaultfd configs under init/ to be under mm/ so that we'll start to group all userfaultfd configs together. We do have quite a few examples of syscall related configs that are not put under init/Kconfig: FTRACE_SYSCALLS, SWAP, FILE_LOCKING, MEMFD_CREATE.. They all reside in the dir where they're more suitable for the concept. So it seems there's no restriction to keep the role of having syscall related CONFIG_* under init/ only. Link: https://lkml.kernel.org/r/20220420144823.35277-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-13Kconfig: Add option for asm goto w/ tied outputs to workaround clang-13 bugSean Christopherson
Add a config option to guard (future) usage of asm_volatile_goto() that includes "tied outputs", i.e. "+" constraints that specify both an input and output parameter. clang-13 has a bug[1] that causes compilation of such inline asm to fail, and KVM wants to use a "+m" constraint to implement a uaccess form of CMPXCHG[2]. E.g. the test code fails with <stdin>:1:29: error: invalid operand in inline asm: '.long (${1:l}) - .' int foo(int *x) { asm goto (".long (%l[bar]) - .\n": "+m"(*x) ::: bar); return *x; bar: return 0; } ^ <stdin>:1:29: error: unknown token in expression <inline asm>:1:9: note: instantiated into assembly here .long () - . ^ 2 errors generated. on clang-13, but passes on gcc (with appropriate asm goto support). The bug is fixed in clang-14, but won't be backported to clang-13 as the changes are too invasive/risky. gcc also had a similar bug[3], fixed in gcc-11, where gcc failed to account for its behavior of assigning two numbers to tied outputs (one for input, one for output) when evaluating symbolic references. [1] https://github.com/ClangBuiltLinux/linux/issues/1512 [2] https://lore.kernel.org/all/YfMruK8%2F1izZ2VHS@google.com [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98096 Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-06mm/slub: use stackdepot to save stack trace in objectsOliver Glitta
Many stack traces are similar so there are many similar arrays. Stackdepot saves each unique stack only once. Replace field addrs in struct track with depot_stack_handle_t handle. Use stackdepot to save stack trace. The benefits are smaller memory overhead and possibility to aggregate per-cache statistics in the following patch using the stackdepot handle instead of matching stacks manually. [ vbabka@suse.cz: rebase to 5.17-rc1 and adjust accordingly ] This was initially merged as commit 788691464c29 and reverted by commit ae14c63a9f20 due to several issues, that should now be fixed. The problem of unconditional memory overhead by stackdepot has been addressed by commit 2dba5eb1c73b ("lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()"), so the dependency on stackdepot will result in extra memory usage only when a slab cache tracking is actually enabled, and not for all CONFIG_SLUB_DEBUG builds. The build failures on some architectures were also addressed, and the reported issue with xfs/433 test did not reproduce on 5.17-rc1 with this patch. Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
2022-04-04Merge branch 'remove-h8300' of git://git.infradead.org/users/hch/misc into ↵Arnd Bergmann
asm-generic * 'remove-h8300' of git://git.infradead.org/users/hch/misc: remove the h8300 architecture This is clearly the least actively maintained architecture we have at the moment, and probably the least useful. It is now the only one that does not support MMUs at all, and most of the boards only support 4MB of RAM, out of which the defconfig kernel needs more than half just for .text/.data. Guenter Roeck did the original patch to remove the architecture in 2013 after it had already been obsolete for a while, and Yoshinori Sato brought it back in a much more modern form in 2015. Looking at the git history since the reinstantiation, it's clear that almost all commits in the tree are build fixes or cross-architecture cleanups: $ git log --no-merges --format=%an v4.5.. arch/h8300/ | sort | uniq -c | sort -rn | head -n 12 25 Masahiro Yamada 18 Christoph Hellwig 14 Mike Rapoport 9 Arnd Bergmann 8 Mark Rutland 7 Peter Zijlstra 6 Kees Cook 6 Ingo Molnar 6 Al Viro 5 Randy Dunlap 4 Yury Norov Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-03-31Merge tag 'kbuild-v5.18-v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add new environment variables, USERCFLAGS and USERLDFLAGS to allow additional flags to be passed to user-space programs. - Fix missing fflush() bugs in Kconfig and fixdep - Fix a minor bug in the comment format of the .config file - Make kallsyms ignore llvm's local labels, .L* - Fix UAPI compile-test for cross-compiling with Clang - Extend the LLVM= syntax to support LLVM=<suffix> form for using a particular version of LLVm, and LLVM=<prefix> form for using custom LLVM in a particular directory path. - Clean up Makefiles * tag 'kbuild-v5.18-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: Make $(LLVM) more flexible kbuild: add --target to correctly cross-compile UAPI headers with Clang fixdep: use fflush() and ferror() to ensure successful write to files arch: syscalls: simplify uapi/kapi directory creation usr/include: replace extra-y with always-y certs: simplify empty certs creation in certs/Makefile certs: include certs/signing_key.x509 unconditionally kallsyms: ignore all local labels prefixed by '.L' kconfig: fix missing '# end of' for empty menu kconfig: add fflush() before ferror() check kbuild: replace $(if A,A,B) with $(or A,B) kbuild: Add environment variables for userprogs flags kbuild: unify cmd_copy and cmd_shipped
2022-03-24Merge tag 'net-next-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "The sprinkling of SPI drivers is because we added a new one and Mark sent us a SPI driver interface conversion pull request. Core ---- - Introduce XDP multi-buffer support, allowing the use of XDP with jumbo frame MTUs and combination with Rx coalescing offloads (LRO). - Speed up netns dismantling (5x) and lower the memory cost a little. Remove unnecessary per-netns sockets. Scope some lists to a netns. Cut down RCU syncing. Use batch methods. Allow netdev registration to complete out of order. - Support distinguishing timestamp types (ingress vs egress) and maintaining them across packet scrubbing points (e.g. redirect). - Continue the work of annotating packet drop reasons throughout the stack. - Switch netdev error counters from an atomic to dynamically allocated per-CPU counters. - Rework a few preempt_disable(), local_irq_save() and busy waiting sections problematic on PREEMPT_RT. - Extend the ref_tracker to allow catching use-after-free bugs. BPF --- - Introduce "packing allocator" for BPF JIT images. JITed code is marked read only, and used to be allocated at page granularity. Custom allocator allows for more efficient memory use, lower iTLB pressure and prevents identity mapping huge pages from getting split. - Make use of BTF type annotations (e.g. __user, __percpu) to enforce the correct probe read access method, add appropriate helpers. - Convert the BPF preload to use light skeleton and drop the user-mode-driver dependency. - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling its use as a packet generator. - Allow local storage memory to be allocated with GFP_KERNEL if called from a hook allowed to sleep. - Introduce fprobe (multi kprobe) to speed up mass attachment (arch bits to come later). - Add unstable conntrack lookup helpers for BPF by using the BPF kfunc infra. - Allow cgroup BPF progs to return custom errors to user space. - Add support for AF_UNIX iterator batching. - Allow iterator programs to use sleepable helpers. - Support JIT of add, and, or, xor and xchg atomic ops on arm64. - Add BTFGen support to bpftool which allows to use CO-RE in kernels without BTF info. - Large number of libbpf API improvements, cleanups and deprecations. Protocols --------- - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev. - Adjust TSO packet sizes based on min_rtt, allowing very low latency links (data centers) to always send full-sized TSO super-frames. - Make IPv6 flow label changes (AKA hash rethink) more configurable, via sysctl and setsockopt. Distinguish between server and client behavior. - VxLAN support to "collect metadata" devices to terminate only configured VNIs. This is similar to VLAN filtering in the bridge. - Support inserting IPv6 IOAM information to a fraction of frames. - Add protocol attribute to IP addresses to allow identifying where given address comes from (kernel-generated, DHCP etc.) - Support setting socket and IPv6 options via cmsg on ping6 sockets. - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS. Define dscp_t and stop taking ECN bits into account in fib-rules. - Add support for locked bridge ports (for 802.1X). - tun: support NAPI for packets received from batched XDP buffs, doubling the performance in some scenarios. - IPv6 extension header handling in Open vSwitch. - Support IPv6 control message load balancing in bonding, prevent neighbor solicitation and advertisement from using the wrong port. Support NS/NA monitor selection similar to existing ARP monitor. - SMC - improve performance with TCP_CORK and sendfile() - support auto-corking - support TCP_NODELAY - MCTP (Management Component Transport Protocol) - add user space tag control interface - I2C binding driver (as specified by DMTF DSP0237) - Multi-BSSID beacon handling in AP mode for WiFi. - Bluetooth: - handle MSFT Monitor Device Event - add MGMT Adv Monitor Device Found/Lost events - Multi-Path TCP: - add support for the SO_SNDTIMEO socket option - lots of selftest cleanups and improvements - Increase the max PDU size in CAN ISOTP to 64 kB. Driver API ---------- - Add HW counters for SW netdevs, a mechanism for devices which offload packet forwarding to report packet statistics back to software interfaces such as tunnels. - Select the default NIC queue count as a fraction of number of physical CPU cores, instead of hard-coding to 8. - Expose devlink instance locks to drivers. Allow device layer of drivers to use that lock directly instead of creating their own which always runs into ordering issues in devlink callbacks. - Add header/data split indication to guide user space enabling of TCP zero-copy Rx. - Allow configuring completion queue event size. - Refactor page_pool to enable fragmenting after allocation. - Add allocation and page reuse statistics to page_pool. - Improve Multiple Spanning Trees support in the bridge to allow reuse of topologies across VLANs, saving HW resources in switches. - DSA (Distributed Switch Architecture): - replay and offload of host VLAN entries - offload of static and local FDB entries on LAG interfaces - FDB isolation and unicast filtering New hardware / drivers ---------------------- - Ethernet: - LAN937x T1 PHYs - Davicom DM9051 SPI NIC driver - Realtek RTL8367S, RTL8367RB-VB switch and MDIO - Microchip ksz8563 switches - Netronome NFP3800 SmartNICs - Fungible SmartNICs - MediaTek MT8195 switches - WiFi: - mt76: MediaTek mt7916 - mt76: MediaTek mt7921u USB adapters - brcmfmac: Broadcom BCM43454/6 - Mobile: - iosm: Intel M.2 7360 WWAN card Drivers ------- - Convert many drivers to the new phylink API built for split PCS designs but also simplifying other cases. - Intel Ethernet NICs: - add TTY for GNSS module for E810T device - improve AF_XDP performance - GTP-C and GTP-U filter offload - QinQ VLAN support - Mellanox Ethernet NICs (mlx5): - support xdp->data_meta - multi-buffer XDP - offload tc push_eth and pop_eth actions - Netronome Ethernet NICs (nfp): - flow-independent tc action hardware offload (police / meter) - AF_XDP - Other Ethernet NICs: - at803x: fiber and SFP support - xgmac: mdio: preamble suppression and custom MDC frequencies - r8169: enable ASPM L1.2 if system vendor flags it as safe - macb/gem: ZynqMP SGMII - hns3: add TX push mode - dpaa2-eth: software TSO - lan743x: multi-queue, mdio, SGMII, PTP - axienet: NAPI and GRO support - Mellanox Ethernet switches (mlxsw): - source and dest IP address rewrites - RJ45 ports - Marvell Ethernet switches (prestera): - basic routing offload - multi-chain TC ACL offload - NXP embedded Ethernet switches (ocelot & felix): - PTP over UDP with the ocelot-8021q DSA tagging protocol - basic QoS classification on Felix DSA switch using dcbnl - port mirroring for ocelot switches - Microchip high-speed industrial Ethernet (sparx5): - offloading of bridge port flooding flags - PTP Hardware Clock - Other embedded switches: - lan966x: PTP Hardward Clock - qca8k: mdio read/write operations via crafted Ethernet packets - Qualcomm 802.11ax WiFi (ath11k): - add LDPC FEC type and 802.11ax High Efficiency data in radiotap - enable RX PPDU stats in monitor co-exist mode - Intel WiFi (iwlwifi): - UHB TAS enablement via BIOS - band disablement via BIOS - channel switch offload - 32 Rx AMPDU sessions in newer devices - MediaTek WiFi (mt76): - background radar detection - thermal management improvements on mt7915 - SAR support for more mt76 platforms - MBSSID and 6 GHz band on mt7915 - RealTek WiFi: - rtw89: AP mode - rtw89: 160 MHz channels and 6 GHz band - rtw89: hardware scan - Bluetooth: - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS) - Microchip CAN (mcp251xfd): - multiple RX-FIFOs and runtime configurable RX/TX rings - internal PLL, runtime PM handling simplification - improve chip detection and error handling after wakeup" * tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits) llc: fix netdevice reference leaks in llc_ui_bind() drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool ice: don't allow to run ice_send_event_to_aux() in atomic ctx ice: fix 'scheduling while atomic' on aux critical err interrupt net/sched: fix incorrect vlan_push_eth dest field net: bridge: mst: Restrict info size queries to bridge ports net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init() drivers: net: xgene: Fix regression in CRC stripping net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT net: dsa: fix missing host-filtered multicast addresses net/mlx5e: Fix build warning, detected write beyond size of field iwlwifi: mvm: Don't fail if PPAG isn't supported selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" netdevice: add missing dm_private kdoc net: bridge: mst: prevent NULL deref in br_mst_info_size() selftests: forwarding: Use same VRF for port and VLAN upper ...
2022-02-23remove the h8300 architectureChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-14kbuild: Add environment variables for userprogs flagsElliot Berman
Allow additional arguments be passed to userprogs compilation. Reproducible clang builds need to provide a sysroot and gcc path to ensure the same toolchain is used across hosts. KCFLAGS is not currently used for any user programs compilation, so add new USERCFLAGS and USERLDFLAGS which serves similar purpose as HOSTCFLAGS/HOSTLDFLAGS. Clang might detect GCC installation on hosts which have it installed to a default location in /. With addition of these environment variables, you can specify flags such as: $ make USERCFLAGS=--sysroot=/path/to/sysroot This can also be used to specify different sysroots such as musl or bionic which may be installed on the host in paths that the compiler may not search by default. Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Fangrui Song <maskray@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-02-11locking: Enable RT_MUTEXES by default on PREEMPT_RT.Sebastian Andrzej Siewior
The CONFIG_RT_MUTEXES option is enabled by CONFIG_FUTEX and CONFIG_I2C. If both are disabled then a CONFIG_PREEMPT_RT build fails to compile. It is not possible to have a PREEMPT_RT kernel without RT_MUTEX support because RT_MUTEX based locking is always used. Enable CONFIG_RT_MUTEXES by default on PREEMPT_RT builds. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YgKmhjkcuqWXdUjQ@linutronix.de
2022-02-02kbuild: Add CONFIG_PAHOLE_VERSIONNathan Chancellor
There are a few different places where pahole's version is turned into a three digit form with the exact same command. Move this command into scripts/pahole-version.sh to reduce the amount of duplication across the tree. Create CONFIG_PAHOLE_VERSION so the version code can be used in Kconfig to enable and disable configuration options based on the pahole version, which is already done in a couple of places. Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220201205624.652313-3-nathan@kernel.org
2022-01-19Merge tag 'kbuild-v5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add new kconfig target 'make mod2noconfig', which will be useful to speed up the build and test iteration. - Raise the minimum supported version of LLVM to 11.0.0 - Refactor certs/Makefile - Change the format of include/config/auto.conf to stop double-quoting string type CONFIG options. - Fix ARCH=sh builds in dash - Separate compression macros for general purposes (cmd_bzip2 etc.) and the ones for decompressors (cmd_bzip2_with_size etc.) - Misc Makefile cleanups * tag 'kbuild-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits) kbuild: add cmd_file_size arch: decompressor: remove useless vmlinux.bin.all-y kbuild: rename cmd_{bzip2,lzma,lzo,lz4,xzkern,zstd22} kbuild: drop $(size_append) from cmd_zstd sh: rename suffix-y to suffix_y doc: kbuild: fix default in `imply` table microblaze: use built-in function to get CPU_{MAJOR,MINOR,REV} certs: move scripts/extract-cert to certs/ kbuild: do not quote string values in include/config/auto.conf kbuild: do not include include/config/auto.conf from shell scripts certs: simplify $(srctree)/ handling and remove config_filename macro kbuild: stop using config_filename in scripts/Makefile.modsign certs: remove misleading comments about GCC PR certs: refactor file cleaning certs: remove unneeded -I$(srctree) option for system_certificates.o certs: unify duplicated cmd_extract_certs and improve the log certs: use $< and $@ to simplify the key generation rule kbuild: remove headers_check stub kbuild: move headers_check.pl to usr/include/ certs: use if_changed to re-generate the key when the key type is changed ...
2022-01-17Merge branch 'modules-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module updates from Luis Chamberlain: "The biggest change here is in-kernel support for module decompression. This change is being made to help support LSMs like LoadPin as otherwise it loses link between the source of kernel module on the disk and binary blob that is being loaded into the kernel. kmod decompression is still done by userspace even with this is done, both because there are no measurable gains in not doing so and as it adds a secondary extra check for validating the module before loading it into the kernel. The rest of the changes are minor, the only other change worth mentionin there is Jessica Yu is now bowing out of maintenance of modules as she's taking a break from work. While there were other changes posted for modules, those have not yet received much review of testing so I'm not yet comfortable in merging any of those changes yet." * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: fix signature check failures when using in-kernel decompression kernel: Fix spelling mistake "compresser" -> "compressor" MAINTAINERS: add mailing lists for kmod and modules module.h: allow #define strings to work with MODULE_IMPORT_NS module: add in-kernel support for decompressing MAINTAINERS: Remove myself as modules maintainer module: Remove outdated comment
2022-01-13Merge tag 'perf_core_for_v5.17_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Borislav Petkov: "Cleanup of the perf/kvm interaction." * tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Drop guest callback (un)register stubs KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y KVM: arm64: Convert to the generic perf callbacks KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c KVM: Move x86's perf guest info callbacks to generic KVM KVM: x86: More precisely identify NMI from guest when handling PMI KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable perf/core: Use static_call to optimize perf_guest_info_callbacks perf: Force architectures to opt-in to guest callbacks perf: Add wrappers for invoking guest callbacks perf/core: Rework guest callbacks to prepare for static_call support perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv perf: Stop pretending that perf can handle multiple guest callbacks KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest KVM: x86: Register perf callbacks after calling vendor's hardware_setup() perf: Protect perf_guest_cbs with RCU
2022-01-12module: add in-kernel support for decompressingDmitry Torokhov
Current scheme of having userspace decompress kernel modules before loading them into the kernel runs afoul of LoadPin security policy, as it loses link between the source of kernel module on the disk and binary blob that is being loaded into the kernel. To solve this issue let's implement decompression in kernel, so that we can pass a file descriptor of compressed module file into finit_module() which will keep LoadPin happy. To let userspace know what compression/decompression scheme kernel supports it will create /sys/module/compression attribute. kmod can read this attribute and decide if it can pass compressed file to finit_module(). New MODULE_INIT_COMPRESSED_DATA flag indicates that the kernel should attempt to decompress the data read from file descriptor prior to trying load the module. To simplify things kernel will only implement single decompression method matching compression method selected when generating modules. This patch implements gzip and xz; more can be added later, Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-01-12Merge tag 'locking_core_for_v5.17_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Borislav Petkov: "Lots of cleanups and preparation. Highlights: - futex: Cleanup and remove runtime futex_cmpxchg detection - rtmutex: Some fixes for the PREEMPT_RT locking infrastructure - kcsan: Share owner_on_cpu() between mutex,rtmutex and rwsem and annotate the racy owner->on_cpu access *once*. - atomic64: Dead-Code-Elemination" [ Description above by Peter Zijlstra ] * tag 'locking_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/atomic: atomic64: Remove unusable atomic ops futex: Fix additional regressions locking: Allow to include asm/spinlock_types.h from linux/spinlock_types_raw.h x86/mm: Include spinlock_t definition in pgtable. locking: Mark racy reads of owner->on_cpu locking: Make owner_on_cpu() into <linux/sched.h> lockdep/selftests: Adapt ww-tests for PREEMPT_RT lockdep/selftests: Skip the softirq related tests on PREEMPT_RT lockdep/selftests: Unbalanced migrate_disable() & rcu_read_lock(). lockdep/selftests: Avoid using local_lock_{acquire|release}(). lockdep: Remove softirq accounting on PREEMPT_RT. locking/rtmutex: Add rt_mutex_lock_nest_lock() and rt_mutex_lock_killable(). locking/rtmutex: Squash self-deadlock check for ww_rt_mutex. locking: Remove rt_rwlock_is_contended(). sched: Trigger warning if ->migration_disabled counter underflows. futex: Fix sparc32/m68k/nds32 build regression futex: Remove futex_cmpxchg detection futex: Ensure futex_atomic_cmpxchg_inatomic() is present kernel/locking: Use a pointer in ww_mutex_trylock().
2022-01-11Merge tag 'pm-5.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "The most signigicant change here is the addition of a new cpufreq 'P-state' driver for AMD processors as a better replacement for the venerable acpi-cpufreq driver. There are also other cpufreq updates (in the core, intel_pstate, ARM drivers), PM core updates (mostly related to adding new macros for declaring PM operations which should make the lives of driver developers somewhat easier), and a bunch of assorted fixes and cleanups. Summary: - Add new P-state driver for AMD processors (Huang Rui). - Fix initialization of min and max frequency QoS requests in the cpufreq core (Rafael Wysocki). - Fix EPP handling on Alder Lake in intel_pstate (Srinivas Pandruvada). - Make intel_pstate update cpuinfo.max_freq when notified of HWP capabilities changes and drop a redundant function call from that driver (Rafael Wysocki). - Improve IRQ support in the Qcom cpufreq driver (Ard Biesheuvel, Stephen Boyd, Vladimir Zapolskiy). - Fix double devm_remap() in the Mediatek cpufreq driver (Hector Yuan). - Introduce thermal pressure helpers for cpufreq CPU cooling (Lukasz Luba). - Make cpufreq use default_groups in kobj_type (Greg Kroah-Hartman). - Make cpuidle use default_groups in kobj_type (Greg Kroah-Hartman). - Fix two comments in cpuidle code (Jason Wang, Yang Li). - Allow model-specific normal EPB value to be used in the intel_epb sysfs attribute handling code (Srinivas Pandruvada). - Simplify locking in pm_runtime_put_suppliers() (Rafael Wysocki). - Add safety net to supplier device release in the runtime PM core code (Rafael Wysocki). - Capture device status before disabling runtime PM for it (Rafael Wysocki). - Add new macros for declaring PM operations to allow drivers to avoid guarding them with CONFIG_PM #ifdefs or __maybe_unused and update some drivers to use these macros (Paul Cercueil). - Allow ACPI hardware signature to be honoured during restore from hibernation (David Woodhouse). - Update outdated operating performance points (OPP) documentation (Tang Yizhou). - Reduce log severity for informative message regarding frequency transition failures in devfreq (Tzung-Bi Shih). - Add DRAM frequency controller devfreq driver for Allwinner sunXi SoCs (Samuel Holland). - Add missing COMMON_CLK dependency to sun8i devfreq driver (Arnd Bergmann). - Add support for new layout of Psys PowerLimit Register on SPR to the Intel RAPL power capping driver (Zhang Rui). - Fix typo in a comment in idle_inject.c (Jason Wang). - Remove unused function definition from the DTPM (Dynamit Thermal Power Management) power capping framework (Daniel Lezcano). - Reduce DTPM trace verbosity (Daniel Lezcano)" * tag 'pm-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (53 commits) x86, sched: Fix undefined reference to init_freq_invariance_cppc() build error cpufreq: amd-pstate: Fix Kconfig dependencies for AMD P-State cpufreq: amd-pstate: Fix struct amd_cpudata kernel-doc comment cpuidle: use default_groups in kobj_type x86: intel_epb: Allow model specific normal EPB value MAINTAINERS: Add AMD P-State driver maintainer entry Documentation: amd-pstate: Add AMD P-State driver introduction cpufreq: amd-pstate: Add AMD P-State performance attributes cpufreq: amd-pstate: Add AMD P-State frequencies attributes cpufreq: amd-pstate: Add boost mode support for AMD P-State cpufreq: amd-pstate: Add trace for AMD P-State module cpufreq: amd-pstate: Introduce the support for the processors with shared memory solution cpufreq: amd-pstate: Add fast switch function for AMD P-State cpufreq: amd-pstate: Introduce a new AMD P-State driver to support future processors ACPI: CPPC: Add CPPC enable register function ACPI: CPPC: Check present CPUs for determining _CPC is valid ACPI: CPPC: Implement support for SystemIO registers x86/msr: Add AMD CPPC MSR definitions x86/cpufeatures: Add AMD Collaborative Processor Performance Control feature flag cpufreq: use default_groups in kobj_type ...
2022-01-05mm: Make SLAB_MERGE_DEFAULT depend on SL[AU]BHyeonggon Yoo
SLOB always manage objects of different caches in same page regardless of SLAB_MERGE_DEFAULT. Because it has no effect on SLOB, make it depend on SLAB || SLUB. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20211225060921.13584-1-42.hyeyoo@gmail.com
2021-12-30Merge branch 'cpufreq/arm/linux-next' of ↵Rafael J. Wysocki
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm Pull ARM cpufreq updates for 5.17-rc1 from Viresh Kumar: "- Qcom cpufreq driver updates improve irq support (Ard Biesheuvel, Stephen Boyd, and Vladimir Zapolskiy). - Fixes double devm_remap for mediatek driver (Hector Yuan). - Introduces thermal pressure helpers (Lukasz Luba)." * 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: cpufreq: mediatek-hw: Fix double devm_remap in hotplug case cpufreq: qcom-hw: Use optional irq API cpufreq: qcom-hw: Set CPU affinity of dcvsh interrupts cpufreq: qcom-hw: Fix probable nested interrupt handling cpufreq: qcom-cpufreq-hw: Avoid stack buffer for IRQ name arch_topology: Remove unused topology_set_thermal_pressure() and related cpufreq: qcom-cpufreq-hw: Use new thermal pressure update function cpufreq: qcom-cpufreq-hw: Update offline CPUs per-cpu thermal pressure thermal: cpufreq_cooling: Use new thermal pressure update function arch_topology: Introduce thermal pressure update function
2021-12-13Merge tag 'v5.16-rc5' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-12-02init/Kconfig: Drop linker version check for LD_ORPHAN_WARNNathan Chancellor
The minimum supported version of LLVM has been raised to 11.0.0, meaning this check is always true, so it can be dropped. Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-11-25futex: Remove futex_cmpxchg detectionArnd Bergmann
Now that all architectures have a working futex implementation in any configuration, remove the runtime detection code. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Vineet Gupta <vgupta@kernel.org> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: https://lore.kernel.org/r/20211026100432.1730393-2-arnd@kernel.org
2021-11-25futex: Ensure futex_atomic_cmpxchg_inatomic() is presentArnd Bergmann
The boot-time detection of futex_atomic_cmpxchg_inatomic() has a bug on some 32-bit arm builds, and Thomas Gleixner suggested that setting CONFIG_HAVE_FUTEX_CMPXCHG would avoid the problem, as it is always present anyway. Looking into which other architectures could do the same showed that almost all architectures have it, the exceptions being: - some old 32-bit MIPS uniprocessor cores without ll/sc - one xtensa variant with no SMP - 32-bit SPARC when built for SMP Fix MIPS And Xtensa by rearranging the generic code to let it be used as a fallback. For SPARC, the SMP definition just ends up turning off futex anyway, so this can be done at Kconfig time instead. Note that sparc32 glibc requires the CASA instruction for its mutexes anyway, which is only available when running on SPARCv9 or LEON CPUs, but needs to be implemented in the sparc32 kernel for those. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Rich Felker <dalias@libc.org> Link: https://lore.kernel.org/r/20211026100432.1730393-1-arnd@kernel.org
2021-11-23arch_topology: Remove unused topology_set_thermal_pressure() and relatedLukasz Luba
There is no need of this function (and related) since code has been converted to use the new arch_update_thermal_pressure() API. The old code can be removed. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-11-17perf: Force architectures to opt-in to guest callbacksSean Christopherson
Introduce GUEST_PERF_EVENTS and require architectures to select it to allow registering and using guest callbacks in perf. This will hopefully make it more difficult for new architectures to add useless "support" for guest callbacks, e.g. via copy+paste. Stubbing out the helpers has the happy bonus of avoiding a load of perf_guest_cbs when GUEST_PERF_EVENTS=n on arm64/x86. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20211111020738.2512932-9-seanjc@google.com
2021-11-15kbuild: Fix -Wimplicit-fallthrough=5 error for GCC 5.x and 6.xGustavo A. R. Silva
-Wimplicit-fallthrough=5 was under cc-option because it was only available in GCC 7.x and newer so the build is now broken for GCC 5.x and 6.x: gcc: error: unrecognized command line option '-Wimplicit-fallthrough=5'; did you mean '-Wno-fallthrough'? Fix this by moving -Wimplicit-fallthrough=5 under cc-option. Fixes: dee2b702bcf0 ("kconfig: Add support for -Wimplicit-fallthrough") Reported-by: Nathan Chancellor <nathan@kernel.org> Co-developed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-15kconfig: Add support for -Wimplicit-fallthroughGustavo A. R. Silva
Add Kconfig support for -Wimplicit-fallthrough for both GCC and Clang. The compiler option is under configuration CC_IMPLICIT_FALLTHROUGH, which is enabled by default. Special thanks to Nathan Chancellor who fixed the Clang bug[1][2]. This bugfix only appears in Clang 14.0.0, so older versions still contain the bug and -Wimplicit-fallthrough won't be enabled for them, for now. This concludes a long journey and now we are finally getting rid of the unintentional fallthrough bug-class in the kernel, entirely. :) Link: https://github.com/llvm/llvm-project/commit/9ed4a94d6451046a51ef393cd62f00710820a7e8 [1] Link: https://bugs.llvm.org/show_bug.cgi?id=51094 [2] Link: https://github.com/KSPP/linux/issues/115 Link: https://github.com/ClangBuiltLinux/linux/issues/236 Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-11mm: allow only SLUB on PREEMPT_RTIngo Molnar
Memory allocators may disable interrupts or preemption as part of the allocation and freeing process. For PREEMPT_RT it is important that these sections remain deterministic and short and therefore don't depend on the size of the memory to allocate/ free or the inner state of the algorithm. Until v3.12-RT the SLAB allocator was an option but involved several changes to meet all the requirements. The SLUB design fits better with PREEMPT_RT model and so the SLAB patches were dropped in the 3.12-RT patchset. Comparing the two allocator, SLUB outperformed SLAB in both throughput (time needed to allocate and free memory) and the maximal latency of the system measured with cyclictest during hackbench. SLOB was never evaluated since it was unlikely that it preforms better than SLAB. During a quick test, the kernel crashed with SLOB enabled during boot. Disable SLAB and SLOB on PREEMPT_RT. [bigeasy@linutronix.de: commit description] Link: https://lkml.kernel.org/r/20211015210336.gen3tib33ig5q2md@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on ↵Sebastian Andrzej Siewior
PREEMPT_RT TRANSPARENT_HUGEPAGE: There are potential non-deterministic delays to an RT thread if a critical memory region is not THP-aligned and a non-RT buffer is located in the same hugepage-aligned region. It's also possible for an unrelated thread to migrate pages belonging to an RT task incurring unexpected page faults due to memory defragmentation even if khugepaged is disabled. Regular HUGEPAGEs are not affected by this can be used. NUMA_BALANCING: There is a non-deterministic delay to mark PTEs PROT_NONE to gather NUMA fault samples, increased page faults of regions even if mlocked and non-deterministic delays when migrating pages. [Mel Gorman worded 99% of the commit description]. Link: https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/ Link: https://lore.kernel.org/all/20211026165100.ahz5bkx44lrrw5pt@linutronix.de/ Link: https://lkml.kernel.org/r/20211028143327.hfbxjze7palrpfgp@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08kbuild: Only default to -Werror if COMPILE_TESTMarco Elver
The cross-product of the kernel's supported toolchains, architectures, and configuration options is large. So large, that it's generally accepted to be infeasible to enumerate and build+test them all (many compile-testers rely on randomly generated configs). Without the possibility to enumerate all possible combinations of toolchains, architectures, and configuration options, it is inevitable that compiler warnings in this space exist. With -Werror, this means that an innumerable set of kernels are now broken, yet had been perfectly usable before (confused compilers, code with warnings unused, or luck). Distributors will necessarily pick a point in the toolchain X arch X config space, and if unlucky, will have a broken build. Granted, those will likely disable CONFIG_WERROR and move on. The kernel's default configuration is unlikely to be suitable for all users, but it's inappropriate to force many users to set CONFIG_WERROR=n. This also holds for CI systems which are focused on runtime testing, where the odd warning in some subsystem will disrupt testing of the rest of the kernel. Many of those runtime-focused CI systems run tests or fuzz the kernel using runtime debugging tools. Runtime testing of different subsystems can proceed in parallel, and potentially uncover serious bugs; halting runtime testing of the entire kernel because of the odd warning (now error) in a subsystem or driver is simply inappropriate. Therefore, runtime-focused CI systems will likely choose CONFIG_WERROR=n as well. The appropriate usecase for -Werror is therefore compile-test focused builds (often done by developers or CI systems). Reflect this in the Kconfig option by making the default value of WERROR match COMPILE_TEST. Signed-off-by: Marco Elver <elver@google.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviwed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-05Enable '-Werror' by default for all kernel buildsLinus Torvalds
... but make it a config option so that broken environments can disable it when required. We really should always have a clean build, and will disable specific over-eager warnings as required, if we can't fix them. But while I fairly religiously enforce that in my own tree, it doesn't get enforced by various build robots that don't necessarily report warnings. So this just makes '-Werror' a default compiler flag, but allows people to disable it for their configuration if they have some particular issues. Occasionally, new compiler versions end up enabling new warnings, and it can take a while before we have them fixed (or the warnings disabled if that is what it takes), so the config option allows for that situation. Hopefully this will mean that I get fewer pull requests that have new warnings that were not noticed by various automation we have in place. Knock wood. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-02Merge tag 'printk-for-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Optionally, provide an index of possible printk messages via <debugfs>/printk/index/. It can be used when monitoring important kernel messages on a farm of various hosts. The monitor has to be updated when some messages has changed or are not longer available by a newly deployed kernel. - Add printk.console_no_auto_verbose boot parameter. It allows to generate crash dump even with slow consoles in a reasonable time frame. - Remove printk_safe buffers. The messages are always stored directly to the main logbuffer, even in NMI or recursive context. Also it allows to serialize syslog operations by a mutex instead of a spin lock. - Misc clean up and build fixes. * tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk/index: Fix -Wunused-function warning lib/nmi_backtrace: Serialize even messages about idle CPUs printk: Add printk.console_no_auto_verbose boot parameter printk: Remove console_silent() lib/test_scanf: Handle n_bits == 0 in random tests printk: syslog: close window between wait and read printk: convert @syslog_lock to mutex printk: remove NMI tracking printk: remove safe buffers printk: track/limit recursion lib/nmi_backtrace: explicitly serialize banner and regs printk: Move the printk() kerneldoc comment to its new home printk/index: Fix warning about missing prototypes MIPS/asm/printk: Fix build failure caused by printk printk: index: Add indexing support to dev_printk printk: Userspace format indexing support printk: Rework parse_prefix into printk_parse_prefix printk: Straighten out log_flags into printk_info_flags string_helpers: Escape double quotes in escape_special printk/console: Check consistent sequence number when handling race in console_unlock()
2021-08-30Merge branch 'rework/printk_safe-removal' into for-linusPetr Mladek
2021-07-26printk: remove NMI trackingJohn Ogness
All NMI contexts are handled the same as the safe context: store the message and defer printing. There is no need to have special NMI context tracking for this. Using in_nmi() is enough. There are several parts of the kernel that are manually calling into the printk NMI context tracking in order to cause general printk deferred printing: arch/arm/kernel/smp.c arch/powerpc/kexec/crash.c kernel/trace/trace.c For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new function pair printk_deferred_enter/exit that explicitly achieves the same objective. For ftrace, remove the printk context manipulation completely. It was added in commit 03fc7f9c99c1 ("printk/nmi: Prevent deadlock when accessing the main log buffer in NMI"). The purpose was to enforce storing messages directly into the ring buffer even in NMI context. It really should have only modified the behavior in NMI context. There is no need for a special behavior any longer. All messages are always stored directly now. The console deferring is handled transparently in vprintk(). Signed-off-by: John Ogness <john.ogness@linutronix.de> [pmladek@suse.com: Remove special handling in ftrace.c completely. Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de
2021-07-19printk: Userspace format indexing supportChris Down
We have a number of systems industry-wide that have a subset of their functionality that works as follows: 1. Receive a message from local kmsg, serial console, or netconsole; 2. Apply a set of rules to classify the message; 3. Do something based on this classification (like scheduling a remediation for the machine), rinse, and repeat. As a couple of examples of places we have this implemented just inside Facebook, although this isn't a Facebook-specific problem, we have this inside our netconsole processing (for alarm classification), and as part of our machine health checking. We use these messages to determine fairly important metrics around production health, and it's important that we get them right. While for some kinds of issues we have counters, tracepoints, or metrics with a stable interface which can reliably indicate the issue, in order to react to production issues quickly we need to work with the interface which most kernel developers naturally use when developing: printk. Most production issues come from unexpected phenomena, and as such usually the code in question doesn't have easily usable tracepoints or other counters available for the specific problem being mitigated. We have a number of lines of monitoring defence against problems in production (host metrics, process metrics, service metrics, etc), and where it's not feasible to reliably monitor at another level, this kind of pragmatic netconsole monitoring is essential. As one would expect, monitoring using printk is rather brittle for a number of reasons -- most notably that the message might disappear entirely in a new version of the kernel, or that the message may change in some way that the regex or other classification methods start to silently fail. One factor that makes this even harder is that, under normal operation, many of these messages are never expected to be hit. For example, there may be a rare hardware bug which one wants to detect if it was to ever happen again, but its recurrence is not likely or anticipated. This precludes using something like checking whether the printk in question was printed somewhere fleetwide recently to determine whether the message in question is still present or not, since we don't anticipate that it should be printed anywhere, but still need to monitor for its future presence in the long-term. This class of issue has happened on a number of occasions, causing unhealthy machines with hardware issues to remain in production for longer than ideal. As a recent example, some monitoring around blk_update_request fell out of date and caused semi-broken machines to remain in production for longer than would be desirable. Searching through the codebase to find the message is also extremely fragile, because many of the messages are further constructed beyond their callsite (eg. btrfs_printk and other module-specific wrappers, each with their own functionality). Even if they aren't, guessing the format and formulation of the underlying message based on the aesthetics of the message emitted is not a recipe for success at scale, and our previous issues with fleetwide machine health checking demonstrate as much. This provides a solution to the issue of silently changed or deleted printks: we record pointers to all printk format strings known at compile time into a new .printk_index section, both in vmlinux and modules. At runtime, this can then be iterated by looking at <debugfs>/printk/index/<module>, which emits the following format, both readable by humans and able to be parsed by machines: $ head -1 vmlinux; shuf -n 5 vmlinux # <level[,flags]> filename:line function "format" <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n" <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n" <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n" <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n" <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" This mitigates the majority of cases where we have a highly-specific printk which we want to match on, as we can now enumerate and check whether the format changed or the printk callsite disappeared entirely in userspace. This allows us to catch changes to printks we monitor earlier and decide what to do about it before it becomes problematic. There is no additional runtime cost for printk callers or printk itself, and the assembly generated is exactly the same. Signed-off-by: Chris Down <chris@chrisdown.name> Cc: Petr Mladek <pmladek@suse.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h} Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-07-17Revert "mm/slub: use stackdepot to save stack trace in objects"Linus Torvalds
This reverts commit 788691464c29455346dc613a3b43c2fb9e5757a4. It's not clear why, but it causes unexplained problems in entirely unrelated xfs code. The most likely explanation is some slab corruption, possibly triggered due to CONFIG_SLUB_DEBUG_ON. See [1]. It ends up having a few other problems too, like build errors on arch/arc, and Geert reporting it using much more memory on m68k [3] (it probably does so elsewhere too, but it is probably just more noticeable on m68k). The architecture issues (both build and memory use) are likely just because this change effectively force-enabled STACKDEPOT (along with a very bad default value for the stackdepot hash size). But together with the xfs issue, this all smells like "this commit was not ready" to me. Link: https://lore.kernel.org/linux-xfs/YPE3l82acwgI2OiV@infradead.org/ [1] Link: https://lore.kernel.org/lkml/202107150600.LkGNb4Vb-lkp@intel.com/ [2] Link: https://lore.kernel.org/lkml/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/ [3] Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08mm/slub: use stackdepot to save stack trace in objectsOliver Glitta
Many stack traces are similar so there are many similar arrays. Stackdepot saves each unique stack only once. Replace field addrs in struct track with depot_stack_handle_t handle. Use stackdepot to save stack trace. The benefits are smaller memory overhead and possibility to aggregate per-cache statistics in the future using the stackdepot handle instead of matching stacks manually. [rdunlap@infradead.org: rename save_stack_trace()] Link: https://lkml.kernel.org/r/20210513051920.29320-1-rdunlap@infradead.org [vbabka@suse.cz: fix lockdep splat] Link: https://lkml.kernel.org/r/20210516195150.26740-1-vbabka@suse.czLink: https://lkml.kernel.org/r/20210414163434.4376-1-glittao@gmail.com Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01Merge tag 'clang-features-v5.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull clang feature updates from Kees Cook: - Add CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in the face of the noinstr attribute, paving the way for PGO and fixing GCOV. (Nick Desaulniers) - x86_64 LTO coverage is expanded to 32-bit x86. (Nathan Chancellor) - Small fixes to CFI. (Mark Rutland, Nathan Chancellor) * tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR compiler_attributes.h: cleanups for GCC 4.9+ compiler_attributes.h: define __no_profile, add to noinstr x86, lto: Enable Clang LTO for 32-bit as well CFI: Move function_nocfi() into compiler.h MAINTAINERS: Add Clang CFI section
2021-06-22Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTRNick Desaulniers
We don't want compiler instrumentation to touch noinstr functions, which are annotated with the no_profile_instrument_function function attribute. Add a Kconfig test for this and make GCOV depend on it, and in the future, PGO. If an architecture is using noinstr, it should denote that via this Kconfig value. That makes Kconfigs that depend on noinstr able to express dependencies in an architecturally agnostic way. Cc: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/lkml/YMTn9yjuemKFLbws@hirez.programming.kicks-ass.net/ Link: https://lore.kernel.org/lkml/YMcssV%2Fn5IBGv4f0@hirez.programming.kicks-ass.net/ Suggested-by: Nathan Chancellor <nathan@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210621231822.2848305-4-ndesaulniers@google.com
2021-05-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-05-11 The following pull-request contains BPF updates for your *net* tree. We've added 13 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 817 insertions(+), 382 deletions(-). The main changes are: 1) Fix multiple ringbuf bugs in particular to prevent writable mmap of read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo. 2) Fix verifier alu32 known-const subregister bound tracking for bitwise operations and/or/xor, from Daniel Borkmann. 3) Reject trampoline attachment for functions with variable arguments, and also add a deny list of other forbidden functions, from Jiri Olsa. 4) Fix nested bpf_bprintf_prepare() calls used by various helpers by switching to per-CPU buffers, from Florent Revest. 5) Fix kernel compilation with BTF debug info on ppc64 due to pahole missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau. 6) Add a kconfig entry to provide an option to disallow unprivileged BPF by default, from Daniel Borkmann. 7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY() macro is not available, from Arnaldo Carvalho de Melo. 8) Migrate test_tc_redirect to test_progs framework as prep work for upcoming skb_change_head() fix & selftest, from Jussi Maki. 9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not present, from Ian Rogers. 10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame size, from Magnus Karlsson. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-11bpf, kconfig: Add consolidated menu entry for bpf with core optionsDaniel Borkmann
Right now, all core BPF related options are scattered in different Kconfig locations mainly due to historic reasons. Moving forward, lets add a proper subsystem entry under ... General setup ---> BPF subsystem ---> ... in order to have all knobs in a single location and thus ease BPF related configuration. Networking related bits such as sockmap are out of scope for the general setup and therefore better suited to remain in net/Kconfig. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/f23f58765a4d59244ebd8037da7b6a6b2fb58446.1620765074.git.daniel@iogearbox.net
2021-05-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: "This is everything else from -mm for this merge window. 90 patches. Subsystems affected by this patch series: mm (cleanups and slub), alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat, checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov, panic, delayacct, gdb, resource, selftests, async, initramfs, ipc, drivers/char, and spelling" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (90 commits) mm: fix typos in comments mm: fix typos in comments treewide: remove editor modelines and cruft ipc/sem.c: spelling fix fs: fat: fix spelling typo of values kernel/sys.c: fix typo kernel/up.c: fix typo kernel/user_namespace.c: fix typos kernel/umh.c: fix some spelling mistakes include/linux/pgtable.h: few spelling fixes mm/slab.c: fix spelling mistake "disired" -> "desired" scripts/spelling.txt: add "overflw" scripts/spelling.txt: Add "diabled" typo scripts/spelling.txt: add "overlfow" arm: print alloc free paths for address in registers mm/vmalloc: remove vwrite() mm: remove xlate_dev_kmem_ptr() drivers/char: remove /dev/kmem for good mm: fix some typos and code style problems ipc/sem.c: mundane typo fixes ...
2021-05-07modules: add CONFIG_MODPROBE_PATHRasmus Villemoes
Allow the developer to specifiy the initial value of the modprobe_path[] string. This can be used to set it to the empty string initially, thus effectively disabling request_module() during early boot until userspace writes a new value via the /proc/sys/kernel/modprobe interface. [1] When building a custom kernel (often for an embedded target), it's normal to build everything into the kernel that is needed for booting, and indeed the initramfs often contains no modules at all, so every such request_module() done before userspace init has mounted the real rootfs is a waste of time. This is particularly useful when combined with the previous patch, which made the initramfs unpacking asynchronous - for that to work, it had to make any usermodehelper call wait for the unpacking to finish before attempting to invoke the userspace helper. By eliminating all such (known-to-be-futile) calls of usermodehelper, the initramfs unpacking and the {device,late}_initcalls can proceed in parallel for much longer. For a relatively slow ppc board I'm working on, the two patches combined lead to 0.2s faster boot - but more importantly, the fact that the initramfs unpacking proceeds completely in the background while devices get probed means I get to handle the gpio watchdog in time without getting reset. [1] __request_module() already has an early -ENOENT return when modprobe_path is the empty string. Link: https://lkml.kernel.org/r/20210313212528.2956377-3-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Jessica Yu <jeyu@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "The remainder of the main mm/ queue. 143 patches. Subsystems affected by this patch series (all mm): pagecache, hugetlb, userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap, kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and kfence" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (143 commits) kfence: use power-efficient work queue to run delayed work kfence: maximize allocation wait timeout duration kfence: await for allocation using wait_event kfence: zero guard page after out-of-bounds access mm/process_vm_access.c: remove duplicate include mm/mempool: minor coding style tweaks mm/highmem.c: fix coding style issue btrfs: use memzero_page() instead of open coded kmap pattern iov_iter: lift memzero_page() to highmem.h mm/zsmalloc: use BUG_ON instead of if condition followed by BUG. mm/zswap.c: switch from strlcpy to strscpy arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE mm,memory_hotplug: add kernel boot option to enable memmap_on_memory acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported mm,memory_hotplug: allocate memmap from the added memory range mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count() mm,memory_hotplug: relax fully spanned sections check drivers/base/memory: introduce memory_block_{online,offline} mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove ...
2021-05-05userfaultfd: add minor fault registration modeAxel Rasmussen
Patch series "userfaultfd: add minor fault handling", v9. Overview ======== This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS. When enabled (via the UFFDIO_API ioctl), this feature means that any hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also* get events for "minor" faults. By "minor" fault, I mean the following situation: Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory). One of the mappings is registered with userfaultfd (in minor mode), and the other is not. Via the non-UFFD mapping, the underlying pages have already been allocated & filled with some contents. The UFFD mapping has not yet been faulted in; when it is touched for the first time, this results in what I'm calling a "minor" fault. As a concrete example, when working with hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing page. We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is, userspace resolves the fault by either a) doing nothing if the contents are already correct, or b) updating the underlying contents using the second, non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA, or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". Use Case ======== Consider the use case of VM live migration (e.g. under QEMU/KVM): 1. While a VM is still running, we copy the contents of its memory to a target machine. The pages are populated on the target by writing to the non-UFFD mapping, using the setup described above. The VM is still running (and therefore its memory is likely changing), so this may be repeated several times, until we decide the target is "up to date enough". 2. We pause the VM on the source, and start executing on the target machine. During this gap, the VM's user(s) will *see* a pause, so it is desirable to minimize this window. 3. Between the last time any page was copied from the source to the target, and when the VM was paused, the contents of that page may have changed - and therefore the copy we have on the target machine is out of date. Although we can keep track of which pages are out of date, for VMs with large amounts of memory, it is "slow" to transfer this information to the target machine. We want to resume execution before such a transfer would complete. 4. So, the guest begins executing on the target machine. The first time it touches its memory (via the UFFD-registered mapping), userspace wants to intercept this fault. Userspace checks whether or not the page is up to date, and if not, copies the updated page from the source machine, via the non-UFFD mapping. Finally, whether a copy was performed or not, userspace issues a UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". We don't have to do all of the final updates on-demand. The userfaultfd manager can, in the background, also copy over updated pages once it receives the map of which pages are up-to-date or not. Interaction with Existing APIs ============================== Because this is a feature, a registered VMA could potentially receive both missing and minor faults. I spent some time thinking through how the existing API interacts with the new feature: UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault: - For non-shared memory or shmem, -EINVAL is returned. - For hugetlb, -EFAULT is returned. UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without modifications, the existing codepath assumes a new page needs to be allocated. This is okay, since userspace must have a second non-UFFD-registered mapping anyway, thus there isn't much reason to want to use these in any case (just memcpy or memset or similar). - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned. - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case). - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns -ENOENT in that case (regardless of the kind of fault). Future Work =========== This series only supports hugetlbfs. I have a second series in flight to support shmem as well, extending the functionality. This series is more mature than the shmem support at this point, and the functionality works fully on hugetlbfs, so this series can be merged first and then shmem support will follow. This patch (of 6): This feature allows userspace to intercept "minor" faults. By "minor" faults, I mean the following situation: Let there exist two mappings (i.e., VMAs) to the same page(s). One of the mappings is registered with userfaultfd (in minor mode), and the other is not. Via the non-UFFD mapping, the underlying pages have already been allocated & filled with some contents. The UFFD mapping has not yet been faulted in; when it is touched for the first time, this results in what I'm calling a "minor" fault. As a concrete example, when working with hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing page. This commit adds the new registration mode, and sets the relevant flag on the VMAs being registered. In the hugetlb fault path, if we find that we have huge_pte_none(), but find_lock_page() does indeed find an existing page, then we have a "minor" fault, and if the VMA has the userfaultfd registration flag, we call into userfaultfd to handle it. This is implemented as a new registration mode, instead of an API feature. This is because the alternative implementation has significant drawbacks [1]. However, doing it this was requires we allocate a VM_* flag for the new registration mode. On 32-bit systems, there are no unused bits, so this feature is only supported on architectures with CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in MINOR mode on 32-bit architectures, we return -EINVAL. [1] https://lore.kernel.org/patchwork/patch/1380226/ [peterx@redhat.com: fix minor fault page leak] Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Steven Price <steven.price@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>