commit e641e014278f6144faed0bf14fe79be0b220c2ef Author: Alexandre Frade Date: Wed Jan 12 23:54:08 2022 +0000 Linux 5.16.0-xanmod1 Signed-off-by: Alexandre Frade commit f83f4f4e20d2492746317ff64eb061107494fd3c Author: graysky Date: Tue Sep 14 15:35:34 2021 -0400 x86/kconfig: more uarches for kernel 5.15+ FEATURES This patch adds additional CPU options to the Linux kernel accessible under: Processor type and features ---> Processor family ---> With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are offered which are good for supported Intel or AMD CPUs: • x86-64-v2 • x86-64-v3 • x86-64-v4 Users of glibc 2.33 and above can see which level is supported by current hardware by running: /lib/ld-linux-x86-64.so.2 --help | grep supported Alternatively, compare the flags from /proc/cpuinfo to this list.[1] CPU-specific microarchitectures include: • AMD Improved K8-family • AMD K10-family • AMD Family 10h (Barcelona) • AMD Family 14h (Bobcat) • AMD Family 16h (Jaguar) • AMD Family 15h (Bulldozer) • AMD Family 15h (Piledriver) • AMD Family 15h (Steamroller) • AMD Family 15h (Excavator) • AMD Family 17h (Zen) • AMD Family 17h (Zen 2) • AMD Family 19h (Zen 3)† • Intel Silvermont low-power processors • Intel Goldmont low-power processors (Apollo Lake and Denverton) • Intel Goldmont Plus low-power processors (Gemini Lake) • Intel 1st Gen Core i3/i5/i7 (Nehalem) • Intel 1.5 Gen Core i3/i5/i7 (Westmere) • Intel 2nd Gen Core i3/i5/i7 (Sandybridge) • Intel 3rd Gen Core i3/i5/i7 (Ivybridge) • Intel 4th Gen Core i3/i5/i7 (Haswell) • Intel 5th Gen Core i3/i5/i7 (Broadwell) • Intel 6th Gen Core i3/i5/i7 (Skylake) • Intel 6th Gen Core i7/i9 (Skylake X) • Intel 8th Gen Core i3/i5/i7 (Cannon Lake) • Intel 10th Gen Core i7/i9 (Ice Lake) • Intel Xeon (Cascade Lake) • Intel Xeon (Cooper Lake)* • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)* • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡ • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡ • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡ Notes: If not otherwise noted, gcc >=9.1 is required for support. *Requires gcc >=10.1 or clang >=10.0 †Required gcc >=10.3 or clang >=12.0 ‡Required gcc >=11.1 or clang >=12.0 It also offers to compile passing the 'native' option which, "selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using -march=native enables all instruction subsets supported by the local machine and will produce code optimized for the local machine under the constraints of the selected instruction set."[2] Users of Intel CPUs should select the 'Intel-Native' option and users of AMD CPUs should select the 'AMD-Native' option. MINOR NOTES RELATING TO INTEL ATOM PROCESSORS This patch also changes -march=atom to -march=bonnell in accordance with the gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I believe it should use the newer -march=bonnell flag for atom processors.[3] It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The recommendation is to use the 'atom' option instead. BENEFITS Small but real speed increases are measurable using a make endpoint comparing a generic kernel to one built with one of the respective microarchs. See the following experimental evidence supporting this statement: https://github.com/graysky2/kernel_gcc_patch REQUIREMENTS linux version >=5.15 gcc version >=9.0 or clang version >=9.0 ACKNOWLEDGMENTS This patch builds on the seminal work by Jeroen.[5] REFERENCES 1. https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9 2. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options 3. https://bugzilla.kernel.org/show_bug.cgi?id=77461 4. https://github.com/graysky2/kernel_gcc_patch/issues/15 5. http://www.linuxforge.net/docs/linux/linux-gcc.php Signed-off-by: graysky commit dfa0270cbaa1678e908061412b44a01bdc5bd28c Author: Zebediah Figura Date: Thu Nov 4 17:15:38 2021 +0000 winesync: Introduce the winesync driver and character device Rebased-by: Tk-Glitch Rebased-by: Alexandre Frade Signed-off-by: Alexandre Frade commit 79f9f08c2b2b5ea30d617167ca0d40b593622855 Author: Serge Hallyn Date: Fri May 31 19:12:12 2013 +0100 sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default add sysctl to disallow unprivileged CLONE_NEWUSER by default This is a short-term patch. Unprivileged use of CLONE_NEWUSER is certainly an intended feature of user namespaces. However for at least saucy we want to make sure that, if any security issues are found, we have a fail-safe. Signed-off-by: Serge Hallyn [bwh: Remove unneeded binary sysctl bits] [bwh: Keep this sysctl, but change the default to enabled] commit 8a357c256342bd41dda20108cba17d54d9a62461 Author: Christian Brauner Date: Wed Jan 23 21:54:23 2019 +0100 SAUCE: binder: give binder_alloc its own debug mask file Currently both binder.c and binder_alloc.c both register the /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts in sysfs. This commit gives binder_alloc.c its own /sys/module/binder_linux/paramters/alloc_debug_mask file. Signed-off-by: Christian Brauner Signed-off-by: Seth Forshee commit e44ebc9b1b0b33b9fe2248a488479e3ace470ea3 Author: Christian Brauner Date: Wed Jan 16 23:13:25 2019 +0100 SAUCE: binder: turn into module The Android binder driver needs to become a module for the sake of shipping Anbox. To do this we need to export the following functions since binder is currently still using them: - security_binder_set_context_mgr() - security_binder_transaction() - security_binder_transfer_binder() - security_binder_transfer_file() - can_nice() - __wake_up_pollfree() - __close_fd_get_file() - mmput_async() - task_work_add() - map_kernel_range_noflush() - get_vm_area() - zap_page_range() - put_ipc_ns() - get_ipc_ns_exported() - show_init_ipc_ns() Rebased-by: Alexandre Frade Signed-off-by: Christian Brauner [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ] Signed-off-by: Seth Forshee Signed-off-by: Alexandre Frade commit 8ddaada2d676852853db4555b27c9ef7f9ad2ba5 Author: Christian Brauner Date: Wed Jun 20 19:21:37 2018 +0200 SAUCE: ashmem: turn into module The Android ashmem driver needs to become a module for the sake of Anbox. To do this we need to export shmem_zero_setup() since ashmem is currently using is. Note, the abomination that is the Android ashmem driver will go away in the not so distant future in favour of memfds. Signed-off-by: Christian Brauner Signed-off-by: Seth Forshee commit 039f212ae30995afc7974b1d1ed9905264c7fa57 Author: Arjan van de Ven Date: Wed May 17 01:52:11 2017 +0000 init: wait for partition and retry scan As Clear Linux boots fast the device is not ready when the mounting code is reached, so a retry device scan will be performed every 0.5 sec for at least 40 sec and synchronize the async task. Signed-off-by: Miguel Bernal Marin commit 36c62c812e748ad3b1b3333d3928a578e6c14a2d Author: Arjan van de Ven Date: Thu Jun 2 23:36:32 2016 -0500 drivers: initialize ata before graphics ATA init is the long pole in the boot process, and its asynchronous. move the graphics init after it so that ata and graphics initialize in parallel commit 0047359f551468633c95a851ad99754cd35955cc Author: Arjan van de Ven Date: Sun Feb 18 23:35:41 2018 +0000 locking: rwsem: spin faster tweak rwsem owner spinning a bit Signed-off-by: Alexandre Frade commit 9ce5b74ea37c43e78cb099f02e01d22430445330 Author: William Douglas Date: Wed Jun 20 17:23:21 2018 +0000 firmware: Enable stateless firmware loading Prefer the order of specific version before generic and /etc before /lib to enable the user to give specific overrides for generic firmware and distribution firmware. commit de4f4a54ac9b5eab572c1341fb834628363a8f40 Author: Arjan van de Ven Date: Sun Sep 22 11:12:35 2019 -0300 intel_rapl: Silence rapl trace debug commit 968d9525eb6f1cf6d6abc0afb61e41ac357d5955 Author: Mark Weiman Date: Sun Aug 12 11:36:21 2018 -0400 pci: Enable overrides for missing ACS capabilities This an updated version of Alex Williamson's patch from: https://lkml.org/lkml/2013/5/30/513 Original commit message follows: PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that allows us to control whether transactions are allowed to be redirected in various subnodes of a PCIe topology. For instance, if two endpoints are below a root port or downsteam switch port, the downstream port may optionally redirect transactions between the devices, bypassing upstream devices. The same can happen internally on multifunction devices. The transaction may never be visible to the upstream devices. One upstream device that we particularly care about is the IOMMU. If a redirection occurs in the topology below the IOMMU, then the IOMMU cannot provide isolation between devices. This is why the PCIe spec encourages topologies to include ACS support. Without it, we have to assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation. Unfortunately, far too many topologies do not support ACS to make this a steadfast requirement. Even the latest chipsets from Intel are only sporadically supporting ACS. We have trouble getting interconnect vendors to include the PCIe spec required PCIe capability, let alone suggested features. Therefore, we need to add some flexibility. The pcie_acs_override= boot option lets users opt-in specific devices or sets of devices to assume ACS support. The "downstream" option assumes full ACS support on root ports and downstream switch ports. The "multifunction" option assumes the subset of ACS features available on multifunction endpoints and upstream switch ports are supported. The "id:nnnn:nnnn" option enables ACS support on devices matching the provided vendor and device IDs, allowing more strategic ACS overrides. These options may be combined in any order. A maximum of 16 id specific overrides are available. It's suggested to use the most limited set of options necessary to avoid completely disabling ACS across the topology. Note to hardware vendors, we have facilities to permanently quirk specific devices which enforce isolation but not provide an ACS capability. Please contact me to have your devices added and save your customers the hassle of this boot option. Rebased-by: Alexandre Frade Signed-off-by: Mark Weiman Signed-off-by: Alexandre Frade commit 989ec85f9ed607a294a80c6823d647d153515db2 Author: Alexandre Frade Date: Thu Oct 7 14:09:55 2021 +0000 i2c: busses: Add SMBus capability to work with OpenRGB driver control Signed-off-by: Alexandre Frade commit 37c65fd2b0504b517b4460be6e1cabefe58e02e2 Author: Deren Wu Date: Sun Nov 14 10:46:57 2021 +0800 mt76: mt7921: add support for PCIe ID 0x0608/0x0616 New mt7921 serials chip support Signed-off-by: Deren Wu commit 08e0b9854c04c080bbc6d0a7496003454a8bd655 Author: Alexandre Frade Date: Wed Dec 8 11:55:28 2021 +0000 netfilter: Add full cone NAT support Link: https://github.com/llccd/netfilter-full-cone-nat Signed-off-by: Alexandre Frade commit 9463259810f512aa60a8b64a67e44f9eb940ea4d Author: Huang Rui Date: Thu Jan 6 15:43:06 2022 +0800 x86, sched: Fix the undefined reference building error of init_freq_invariance_cppc The init_freq_invariance_cppc function is implemented in smpboot and depends on CONFIG_SMP. MODPOST vmlinux.symvers MODINFO modules.builtin.modinfo GEN modules.builtin LD .tmp_vmlinux.kallsyms1 ld: drivers/acpi/cppc_acpi.o: in function `acpi_cppc_processor_probe': /home/ray/brahma3/linux/drivers/acpi/cppc_acpi.c:819: undefined reference to `init_freq_invariance_cppc' make: *** [Makefile:1161: vmlinux] Error 1 See https://lore.kernel.org/lkml/484af487-7511-647e-5c5b-33d4429acdec@infradead.org/. Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems") Reported-by: kernel test robot Reported-by: Randy Dunlap Reported-by: Stephen Rothwell Signed-off-by: Huang Rui Cc: Rafael J. Wysocki Cc: Borislav Petkov Cc: Ingo Molnar Cc: Peter Zijlstra Cc: x86@kernel.org Cc: stable@vger.kernel.org commit 55cb4113ae2bb4cc2b7a201291c4511ea35fc447 Author: Huang Rui Date: Thu Jan 6 15:43:05 2022 +0800 cpufreq: amd-pstate: Fix the dependence issue of AMD P-State The AMD P-State driver is based on ACPI CPPC function, so ACPI should be dependence of this driver in the kernel config. In file included from ../drivers/cpufreq/amd-pstate.c:40:0: ../include/acpi/processor.h:226:2: error: unknown type name ‘phys_cpuid_t’ phys_cpuid_t phys_id; /* CPU hardware ID such as APIC ID for x86 */ ^~~~~~~~~~~~ ../include/acpi/processor.h:355:1: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’? phys_cpuid_t acpi_get_phys_id(acpi_handle, int type, u32 acpi_id); ^~~~~~~~~~~~ phys_addr_t CC drivers/rtc/rtc-rv3029c2.o ../include/acpi/processor.h:356:1: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’? phys_cpuid_t acpi_map_madt_entry(u32 acpi_id); ^~~~~~~~~~~~ phys_addr_t ../include/acpi/processor.h:357:20: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’? int acpi_map_cpuid(phys_cpuid_t phys_id, u32 acpi_id); ^~~~~~~~~~~~ phys_addr_t See https://lore.kernel.org/lkml/20e286d4-25d7-fb6e-31a1-4349c805aae3@infradead.org/. Reported-by: Randy Dunlap Reported-by: Stephen Rothwell Signed-off-by: Huang Rui commit 632f0bcb2fac314d5a8c35757d6efb51a8c32f60 Author: Huang Rui Date: Fri Dec 24 09:05:08 2021 +0800 MAINTAINERS: Add AMD P-State driver maintainer entry I will continue to add new feature and processor support, optimize the performance, and handle the issues for AMD P-State driver. Signed-off-by: Huang Rui commit 89af400445a9b0c636976ca813c7f9e39896b9e3 Author: Huang Rui Date: Fri Dec 24 09:05:07 2021 +0800 Documentation: amd-pstate: Add AMD P-State driver introduction Introduce the AMD P-State driver design and implementation. Signed-off-by: Huang Rui commit f250f45f86134f7ac1bd2893194eac063e6a1772 Author: Huang Rui Date: Fri Dec 24 09:05:06 2021 +0800 cpufreq: amd-pstate: Add AMD P-State performance attributes Introduce sysfs attributes to get the different level AMD P-State performances. Signed-off-by: Huang Rui commit 76026702049aa265e904c41bd70d13cea5450cc2 Author: Huang Rui Date: Fri Dec 24 09:05:05 2021 +0800 cpufreq: amd-pstate: Add AMD P-State frequencies attributes Introduce sysfs attributes to get the different level processor frequencies. Signed-off-by: Huang Rui commit 36ec7e5e5ff8c6fd2f694cf149b7b03261a5d187 Author: Huang Rui Date: Fri Dec 24 09:05:04 2021 +0800 cpufreq: amd-pstate: Add boost mode support for AMD P-State If the sbios supports the boost mode of AMD P-State, let's switch to boost enabled by default. Signed-off-by: Huang Rui commit 949edd580676ef0412b28dccf040a8877360452e Author: Huang Rui Date: Fri Dec 24 09:05:03 2021 +0800 cpufreq: amd-pstate: Add trace for AMD P-State module Add trace event to monitor the performance value changes which is controlled by cpu governors. Signed-off-by: Huang Rui commit 3701edd92882f976f261d3d5ed1ae2e86b63a0ce Author: Huang Rui Date: Fri Dec 24 09:05:02 2021 +0800 cpufreq: amd-pstate: Introduce the support for the processors with shared memory solution In some of Zen2 and Zen3 based processors, they are using the shared memory that exposed from ACPI SBIOS. In this kind of the processors, there is no MSR support, so we add acpi cppc function as the backend for them. It is using a module param (shared_mem) to enable related processors manually. We will enable this by default once we address performance issue on this solution. Signed-off-by: Jinzhou Su Signed-off-by: Huang Rui commit 8bca5bff7205475c000d310be17b6b4ae8efd37e Author: Huang Rui Date: Fri Dec 24 09:05:01 2021 +0800 cpufreq: amd-pstate: Add fast switch function for AMD P-State Introduce the fast switch function for AMD P-State on the AMD processors which support the full MSR register control. It's able to decrease the latency on interrupt context. Signed-off-by: Huang Rui commit 6a3addfce915a50f177693dcc487d1423ab03de7 Author: Huang Rui Date: Fri Dec 24 09:05:00 2021 +0800 cpufreq: amd-pstate: Introduce a new AMD P-State driver to support future processors AMD P-State is the AMD CPU performance scaling driver that introduces a new CPU frequency control mechanism on AMD Zen based CPU series in Linux kernel. The new mechanism is based on Collaborative processor performance control (CPPC) which is finer grain frequency management than legacy ACPI hardware P-States. Current AMD CPU platforms are using the ACPI P-states driver to manage CPU frequency and clocks with switching only in 3 P-states. AMD P-State is to replace the ACPI P-states controls, allows a flexible, low-latency interface for the Linux kernel to directly communicate the performance hints to hardware. AMD P-State leverages the Linux kernel governors such as *schedutil*, *ondemand*, etc. to manage the performance hints which are provided by CPPC hardware functionality. The first version for AMD P-State is to support one of the Zen3 processors, and we will support more in future after we verify the hardware and SBIOS functionalities. There are two types of hardware implementations for AMD P-State: one is full MSR support and another is shared memory support. It can use X86_FEATURE_CPPC feature flag to distinguish the different types. Using the new AMD P-State method + kernel governors (*schedutil*, *ondemand*, ...) to manage the frequency update is the most appropriate bridge between AMD Zen based hardware processor and Linux kernel, the processor is able to adjust to the most efficiency frequency according to the kernel scheduler loading. Please check the detailed CPU feature and MSR register description in Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors: https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip Signed-off-by: Huang Rui commit 4da27c1163145643bdb002dd9b821d4803232700 Author: Jinzhou Su Date: Fri Dec 24 09:04:59 2021 +0800 ACPI: CPPC: Add CPPC enable register function Add a new function to enable CPPC feature. This function will write Continuous Performance Control package EnableRegister field on the processor. CPPC EnableRegister register described in section 8.4.7.1 of ACPI 6.4: This element is optional. If supported, contains a resource descriptor with a single Register() descriptor that describes a register to which OSPM writes a One to enable CPPC on this processor. Before this register is set, the processor will be controlled by legacy mechanisms (ACPI Pstates, firmware, etc.). This register will be used for AMD processors to enable AMD P-State function instead of legacy ACPI P-States. Signed-off-by: Jinzhou Su Signed-off-by: Huang Rui commit 4821365415ebb57161757ac5d0c196b0344b8d5f Author: Mario Limonciello Date: Fri Dec 24 09:04:58 2021 +0800 ACPI: CPPC: Check present CPUs for determining _CPC is valid As this is a static check, it should be based upon what is currently present on the system. This makes probeing more deterministic. While local APIC flags field (lapic_flags) of cpu core in MADT table is 0, then the cpu core won't be enabled. In this case, _CPC won't be found in this core, and return back to _CPC invalid with walking through possible cpus (include disable cpus). This is not expected, so switch to check present CPUs instead. Reported-by: Jinzhou Su Signed-off-by: Mario Limonciello Signed-off-by: Huang Rui commit ea3b8f094eaf46a8862d845a626c23c0ed3a80ec Author: Steven Noonan Date: Fri Dec 24 09:04:57 2021 +0800 ACPI: CPPC: Implement support for SystemIO registers According to the ACPI v6.2 (and later) specification, SystemIO can be used for _CPC registers. This teaches cppc_acpi how to handle such registers. This patch was tested using the amd_pstate driver on my Zephyrus G15 (model GA503QS) using the current version 410 BIOS, which uses a SystemIO register for the HighestPerformance element in _CPC. Signed-off-by: Steven Noonan Signed-off-by: Huang Rui commit d7ec32945cf92feb0d70798880c0cc85199329aa Author: Huang Rui Date: Fri Dec 24 09:04:56 2021 +0800 x86/msr: Add AMD CPPC MSR definitions AMD CPPC (Collaborative Processor Performance Control) function uses MSR registers to manage the performance hints. So add the MSR register macro here. Signed-off-by: Huang Rui Acked-by: Borislav Petkov commit 8eb61010f2fddbf5888bae765ad5d2a447c2f479 Author: Huang Rui Date: Fri Dec 24 09:04:55 2021 +0800 x86/cpufeatures: Add AMD Collaborative Processor Performance Control feature flag Add Collaborative Processor Performance Control feature flag for AMD processors. This feature flag will be used on the following AMD P-State driver. The AMD P-State driver has two approaches to implement the frequency control behavior. That depends on the CPU hardware implementation. One is "Full MSR Support" and another is "Shared Memory Support". The feature flag indicates the current processors with "Full MSR Support". Acked-by: Borislav Petkov Signed-off-by: Huang Rui commit 85f1e1442b2c72a4b1b1bd467e93fce11b5b65d8 Author: Eric Dumazet Date: Thu Nov 18 09:52:39 2021 -0800 x86/csum: Fix compilation error for UM load_unaligned_zeropad() is not yet universal. ARCH=um SUBARCH=x86_64 builds do not have it. When CONFIG_DCACHE_WORD_ACCESS is not set, simply continue the bisection with 4, 2 and 1 byte steps. Fixes: df4554cebdaa ("x86/csum: Rewrite/optimize csum_partial()") Reported-by: kernel test robot Signed-off-by: Eric Dumazet Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20211118175239.1525650-1-eric.dumazet@gmail.com commit b0d8e91932a1bec7cda2fcac70c1c2bc37e5cf66 Author: Eric Dumazet Date: Fri Nov 12 08:19:50 2021 -0800 x86/csum: Rewrite/optimize csum_partial() With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely used. csum_partial() is heavily used with small amount of bytes, and is consuming many cycles. IPv6 header size for instance is 40 bytes. Another thing to consider is that NET_IP_ALIGN is 0 on x86, meaning that network headers are not word-aligned, unless the driver forces this. This means that csum_partial() fetches one u16 to 'align the buffer', then perform three u64 additions with carry in a loop, then a remaining u32, then a remaining u16. With this new version, we perform a loop only for the 64 bytes blocks, then the remaining is bisected. Tested on various cpus, all of them show a big reduction in csum_partial() cost (by 50 to 80 %) Before: 4.16% [kernel] [k] csum_partial After: 0.83% [kernel] [k] csum_partial If run in a loop 1,000,000 times: Before: 26,922,913 cycles # 3846130.429 GHz 80,302,961 instructions # 2.98 insn per cycle 21,059,816 branches # 3008545142.857 M/sec 2,896 branch-misses # 0.01% of all branches After: 17,960,709 cycles # 3592141.800 GHz 41,292,805 instructions # 2.30 insn per cycle 11,058,119 branches # 2211623800.000 M/sec 2,997 branch-misses # 0.03% of all branches Signed-off-by: Eric Dumazet Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Alexander Duyck Link: https://lore.kernel.org/r/20211112161950.528886-1-eric.dumazet@gmail.com commit 5cbf462e52947c696e25053ba9b799fb4f3ba9ca Author: Eric Dumazet Date: Mon Nov 15 11:02:49 2021 -0800 net: move early demux fields close to sk_refcnt sk_rx_dst/sk_rx_dst_ifindex/sk_rx_dst_cookie are read in early demux, and currently spans two cache lines. Moving them close to sk_refcnt makes more sense, as only one cache line is needed. New layout for this hot cache line is : struct sock { struct sock_common __sk_common; /* 0 0x88 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ struct dst_entry * sk_rx_dst; /* 0x88 0x8 */ int sk_rx_dst_ifindex; /* 0x90 0x4 */ u32 sk_rx_dst_cookie; /* 0x94 0x4 */ socket_lock_t sk_lock; /* 0x98 0x20 */ atomic_t sk_drops; /* 0xb8 0x4 */ int sk_rcvlowat; /* 0xbc 0x4 */ /* --- cacheline 3 boundary (192 bytes) --- */ Signed-off-by: Eric Dumazet commit 2c0abf22886b1a84a8f8ec1d880f22351eed9b0d Author: Eric Dumazet Date: Mon Nov 15 11:02:48 2021 -0800 tcp: do not call tcp_cleanup_rbuf() if we have a backlog Under pressure, tcp recvmsg() has logic to process the socket backlog, but calls tcp_cleanup_rbuf() right before. Avoiding sending ACK right before processing new segments makes a lot of sense, as this decrease the number of ACK packets, with no impact on effective ACK clocking. Signed-off-by: Eric Dumazet commit 6c55b2052590896513f92958478ec2ff390733ad Author: Eric Dumazet Date: Mon Nov 15 11:02:47 2021 -0800 tcp: check local var (timeo) before socket fields in one test Testing timeo before sk_err/sk_state/sk_shutdown makes more sense. Modern applications use non-blocking IO, while a socket is terminated only once during its life time. Signed-off-by: Eric Dumazet commit 8a307a7acb16b2dae9ebe692ff6c13817fe9ba70 Author: Eric Dumazet Date: Mon Nov 15 11:02:46 2021 -0800 tcp: defer skb freeing after socket lock is released tcp recvmsg() (or rx zerocopy) spends a fair amount of time freeing skbs after their payload has been consumed. A typical ~64KB GRO packet has to release ~45 page references, eventually going to page allocator for each of them. Currently, this freeing is performed while socket lock is held, meaning that there is a high chance that BH handler has to queue incoming packets to tcp socket backlog. This can cause additional latencies, because the user thread has to process the backlog at release_sock() time, and while doing so, additional frames can be added by BH handler. This patch adds logic to defer these frees after socket lock is released, or directly from BH handler if possible. Being able to free these skbs from BH handler helps a lot, because this avoids the usual alloc/free assymetry, when BH handler and user thread do not run on same cpu or NUMA node. One cpu can now be fully utilized for the kernel->user copy, and another cpu is handling BH processing and skb/page allocs/frees (assuming RFS is not forcing use of a single CPU) Tested: 100Gbit NIC Max throughput for one TCP_STREAM flow, over 10 runs MTU : 1500 Before: 55 Gbit After: 66 Gbit MTU : 4096+(headers) Before: 82 Gbit After: 95 Gbit Signed-off-by: Eric Dumazet commit bb1c0c8c5e9a71b53002401e6d8ff303b18e915a Author: Eric Dumazet Date: Mon Nov 15 11:02:45 2021 -0800 tcp: avoid indirect calls to sock_rfree TCP uses sk_eat_skb() when skbs can be removed from receive queue. However, the call so skb_orphan() from __kfree_skb() incurs an indirect call so sock_rfee(), which is more expensive than a direct call, especially for CONFIG_RETPOLINE=y. Add tcp_eat_recv_skb() function to make the call before __kfree_skb(). Signed-off-by: Eric Dumazet commit 3b47a3af97b99eae5facd29f3070f33703dc3ec3 Author: Eric Dumazet Date: Mon Nov 15 11:02:44 2021 -0800 tcp: tp->urg_data is unlikely to be set Use some unlikely() hints in the fast path. Signed-off-by: Eric Dumazet commit 00d8db65fe8165059cf023a285474a5ea27c5287 Author: Eric Dumazet Date: Mon Nov 15 11:02:43 2021 -0800 tcp: annotate races around tp->urg_data tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock owned. Also, it is faster to first check tp->urg_data in tcp_poll(), then tp->urg_seq == tp->copied_seq, because tp->urg_seq is located in a different/cold cache line. Signed-off-by: Eric Dumazet commit 3c4bc0be566c44077cdd15ad4e39df3b16d7d9d4 Author: Eric Dumazet Date: Mon Nov 15 11:02:42 2021 -0800 tcp: annotate data-races on tp->segs_in and tp->data_segs_in tcp_segs_in() can be called from BH, while socket spinlock is held but socket owned by user, eventually reading these fields from tcp_get_info() Found by code inspection, no need to backport this patch to older kernels. Signed-off-by: Eric Dumazet commit 14e47c2467182abfe785e1f842d342fac7758abd Author: Eric Dumazet Date: Mon Nov 15 11:02:41 2021 -0800 tcp: add RETPOLINE mitigation to sk_backlog_rcv Use INDIRECT_CALL_INET() to avoid an indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: Eric Dumazet commit 06dcd20283c886976257bb99ce38ba89044d4ebc Author: Eric Dumazet Date: Mon Nov 15 11:02:40 2021 -0800 tcp: small optimization in tcp recvmsg() When reading large chunks of data, incoming packets might be added to the backlog from BH. tcp recvmsg() detects the backlog queue is not empty, and uses a release_sock()/lock_sock() pair to process this backlog. We now have __sk_flush_backlog() to perform this a bit faster. Signed-off-by: Eric Dumazet commit 05091893496e16c9b38b647bd02ef882ac053b09 Author: Eric Dumazet Date: Mon Nov 15 11:02:39 2021 -0800 net: cache align tcp_memory_allocated, tcp_sockets_allocated tcp_memory_allocated and tcp_sockets_allocated often share a common cache line, source of false sharing. Also take care of udp_memory_allocated and mptcp_sockets_allocated. Signed-off-by: Eric Dumazet commit 44c254410781627e7db7f02be8a35e32644f8016 Author: Eric Dumazet Date: Mon Nov 15 11:02:38 2021 -0800 net: forward_alloc_get depends on CONFIG_MPTCP (struct proto)->sk_forward_alloc is currently only used by MPTCP. Signed-off-by: Eric Dumazet commit db7962b994e245716af6e2e6ab43aac937609254 Author: Eric Dumazet Date: Mon Nov 15 11:02:37 2021 -0800 net: shrink struct sock by 8 bytes Move sk_bind_phc next to sk_peer_lock to fill a hole. Signed-off-by: Eric Dumazet commit 47f7af169d29dc708243414366753968bec71283 Author: Eric Dumazet Date: Mon Nov 15 11:02:36 2021 -0800 ipv6: shrink struct ipcm6_cookie gso_size can be moved after tclass, to use an existing hole. (8 bytes saved on 64bit arches) Signed-off-by: Eric Dumazet commit c6fdfb7ab54b12ed489362df18af470ca4aa91ca Author: Eric Dumazet Date: Mon Nov 15 11:02:32 2021 -0800 tcp: small optimization in tcp_v6_send_check() For TCP flows, inet6_sk(sk)->saddr has the same value than sk->sk_v6_rcv_saddr. Using sk->sk_v6_rcv_saddr increases data locality. Signed-off-by: Eric Dumazet commit 1b9d62d108711fac71d5714fd88fa9dfb7700b35 Author: Eric Dumazet Date: Mon Nov 15 11:02:31 2021 -0800 tcp: remove dead code in __tcp_v6_send_check() For some reason, I forgot to change __tcp_v6_send_check() at the same time I removed (ip_summed == CHECKSUM_PARTIAL) check in __tcp_v4_send_check() Fixes: 98be9b12096f ("tcp: remove dead code after CHECKSUM_PARTIAL adoption") Signed-off-by: Eric Dumazet commit 2aef2076bced854f82b4ab7af33c64e5bfcb1673 Author: Eric Dumazet Date: Mon Nov 15 11:02:30 2021 -0800 tcp: minor optimization in tcp_add_backlog() If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values are not used. Defer their access to the point we need them. Signed-off-by: Eric Dumazet commit 6564991ef2b36b8d5c19f0ff6ccc1831625b0875 Author: Adithya Abraham Philip Date: Fri Jun 11 21:56:10 2021 +0000 net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2 Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to indicate that retransmitted packets and pure ACKs must have the ECT bit set. This is a necessary fix for BBRv2, which when using ECN expects ECT to be set even on retransmitted packets and ACKs. Currently CCAs like BBRv2 which can use ECN but don't "need" it do not have a way to indicate that ECT should be set on retransmissions/ACKs. Signed-off-by: Adithya Abraham Philip Signed-off-by: Neal Cardwell commit 6de1c71a63ed6b9d5c0b571d198d222dbd23e1a3 Author: Neal Cardwell Date: Mon Dec 28 19:23:09 2020 -0500 net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss Fix WARN_ON_ONCE() warnings that were firing and pointing to a bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to CA_Open. The issue was that tcp_simple_retransmit() calls: tcp_set_ca_state(sk, TCP_CA_Loss); without first calling icsk_ca_ops->ssthresh(sk) (because tcp_simple_retransmit() is dealing with losses due to MTU issues and not congestion). The lack of this callback means that BBR did not get a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd. This commit removes that warning, since a bbr->prior_cwnd of 0 is a valid situation in this state transition. For setting inflight_lo upon entering CA_Loss, to avoid setting an inflight_lo of 0 in this case, this commit switches to taking the max of cwnd and prior_cwnd. We plan to remove that line of code when we switch to cautious (PRR-style) recovery, so that awkwardness will go away. Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118 commit 72a0d85b73045cde78d6288fe8ddb8516d1a6ff3 Author: Neal Cardwell Date: Mon Aug 17 19:10:21 2020 -0400 net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2 Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0 commit cc2d74bb89a9debd8a360e16c694469d4c1e6eda Author: Neal Cardwell Date: Mon Aug 17 19:08:41 2020 -0400 net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2 Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b commit b3ac5ca7c674fe79527f3cfbd553f8fbc57a96bf Author: Neal Cardwell Date: Thu Nov 21 15:28:01 2019 -0500 net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss There is no reason to compute rs.delivered_ce upon loss. In fact, we specifically do not want to compute rs.delivered_ce upon loss. Two issues: (1) This would be the wrong thing to do, in behavior terms. With RACK's dynamic reordering window, losses can be marked long after the sequence hole appears in the ACK/SACK stream. We want to to catch the ECN mark rate rising too high as quickly as possible, which means we want to check for high ECN mark rates at ACK time (as BBRv2 currently does) and not loss marking time. (2) This is dead code. The ECN mark rate cannot be detected as too high because the check needs rs->delivered to be > 0 as well: if (rs->delivered_ce > 0 && rs->delivered > 0 && Since we are not setting rs->delivered upon loss, this check cannot succeed, so setting delivered_ce is pointless. This dead and wrong line was discovered by Randall Stewart at Netflix as he was reading the BBRv2 code. Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a commit ad956a6f603a75ee10e7fabe7bf25074c9be4668 Author: Neal Cardwell Date: Mon Jul 22 23:18:56 2019 -0400 net-tcp_bbr: v2: add a README.md for TCP BBR v2 alpha release Change-Id: I35a8c984e299d2af6e78c3d4b3aade5627678306 commit 8b19d651c9e9d1cded4984f6c691c720026eeaf9 Author: Neal Cardwell Date: Tue Jun 11 12:54:22 2019 -0400 net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower queues, lower loss, and better Reno/CUBIC coexistence than BBR v1. BBR v2 maintains the core of BBR v1: an explicit model of the network path that is two-dimensional, adapting to estimate the (a) maximum available bandwidth and (b) maximum safe volume of data a flow can keep in-flight in the network. It maintains the estimated BDP as a core guide for estimating an appropriate level of in-flight data. BBR v2 makes several key enhancements: o Its bandwidth-probing time scale is adapted, within bounds, to allow improved coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a) extended dynamically based on estimated BDP to improve coexistence with Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more scalable and responsive than Reno and CUBIC. o Rather than being largely agnostic to loss and ECN marks, it explicitly uses loss and (DCTCP-style) ECN signals to maintain its model. o It aims for lower losses than v1 by adjusting its model to attempt to stay within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh, respectively). o It adapts to loss/ECN signals even when the application is running out of data ("application-limited"), in case the "application-limited" flow is also "network-limited" (the bw and/or inflight available to this flow is lower than previously estimated when the flow ran out of data). o It has a three-part model: the model explicit three tracks operating points, where an operating point is a tuple: (bandwidth, inflight). The three operating points are: o latest: the latest measurement from the current round trip o upper bound: robust, optimistic, long-term upper bound o lower bound: robust, conservative, short-term lower bound These are stored in the following state variables: o latest: bw_latest, inflight_latest o lo: bw_lo, inflight_lo o hi: bw_hi[2], inflight_hi To gain intuition about the meaning of the three operating points, it may help to consider the analogs in CUBIC, which has a somewhat analogous three-part model used by its probing state machine: BBR param CUBIC param ----------- ------------- latest ~ cwnd lo ~ ssthresh hi ~ last_max_cwnd The analogy is only a loose one, though, since the BBR operating points are calculated differently, and are 2-dimensional (bw,inflight) rather than CUBIC's one-dimensional notion of operating point (inflight). o It uses the three-part model to adapt the magnitude of its bandwidth to match the estimated space available in the buffer, rather than (as in BBR v1) assuming that it was always acceptable to place 0.25*BDP in the bottleneck buffer when probing (commodity datacenter switches commonly do not have that much buffer for WAN flows). When BBR v2 estimates it hit a buffer limit during probing, its bandwidth probing then starts gently in case little space is still available in the buffer, and the accelerates, slowly at first and then rapidly if it can grow inflight without seeing congestion signals. In such cases, probing is bounded by inflight_hi + inflight_probe, where inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to keep losses low and bounded if a bottleneck remains congested, while rapidly/scalably utilizing free bandwidth when it becomes available. o It has a slightly revised state machine, to achieve the goals above. BBR_BW_PROBE_UP: pushes up inflight to probe for bw/vol BBR_BW_PROBE_DOWN: drain excess inflight from the queue BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty o The estimated BDP: BBR v2 continues to maintain an estimate of the path's two-way propagation delay, by tracking a windowed min_rtt, and coordinating (on an as-ndeeded basis) to try to expose the two-way propagation delay by draining the bottleneck queue. BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth estimate to estimate the current bandwidth-delay product. The estimated BDP still provides one important guideline for bounding inflight data. However, because any min-filtered RTT and max-filtered bw inherently tend to both overestimate, the estimated BDP is often too high; in this case loss or ECN marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to adapt its sending rate and inflight down to match the available capacity of the path. o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2 requires more space. Note that much of the space is due to support for per-socket parameterization and debugging in this release for research and debugging. With that state removed, the full "struct bbr" is 140 bytes, or 144 with padding. This is an increase of 40 bytes over the existing ca_priv space. o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following significant pieces: o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(), bbr_can_grow_inflight()) o long-term bandwidth estimator ("policer mode") The code layout tries to keep BBR v2 code near the bottom of the file, so that v1-applicable code in the top does not accidentally refer to v2 code. o Docs: See the following docs for more details and diagrams decsribing the BBR v2 algorithm: https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00 https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00 o Internal notes: For this upstream rebase, Neal started from: git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c then removed dev instrumentation (dynamic get/set for parameters) and code that was only used by BBRv1 Effort: net-tcp_bbr Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102 Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05 commit 8c91b8b17e402084165719e427e4113d245e57d1 Author: Neal Cardwell Date: Sat Nov 16 13:16:25 2019 -0500 net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check() Add logic for an experimental TCP connection behavior, enabled with tp->fast_ack_mode = 1, which disables checking the receive window before sending an ack in __tcp_ack_snd_check(). If this behavior is enabled, the data receiver sends an ACK if the amount of data is > RCV.MSS. Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4 commit df2dd44c0324db5b2f8e44e82a317015b4c933eb Author: Neal Cardwell Date: Fri Sep 27 17:10:26 2019 -0400 net-tcp: re-generalize TSO sizing in TCP CC module API Reorganize the API for CC modules so that the CC module once again gets complete control of the TSO sizing decision. This is how the API was set up around 2016 and the initial BBRv1 upstreaming. Later Eric Dumazet simplified it. But with wider testing it now seems that to avoid CPU regressions BBR needs to have a different TSO sizing function. This is necessary to handle cases where there are many flows bottlenecked on the sender host's NIC, in which case BBR's pacing rate is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because BBR's pacing rate adapts to the low bandwidth share each flow sees. By contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very large cwnd, and thus large pacing rate and large TSO burst size. Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea commit 958aaeff5afb297f82031818ecba3661da134808 Author: Yousuk Seung Date: Wed May 23 17:55:54 2018 -0700 net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a congestion control module to receive CE events. Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN bit in opts flag to receive CE events but this may incur changes in ECN behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS that allows congestion control modules to receive CE events independently of TCP_CONG_NEEDS_ECN. Effort: net-tcp Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b Change-Id: I2255506985242f376d910c6fd37daabaf4744f24 commit 08ed311d295db7c463c476524d95f52030462fb5 Author: Neal Cardwell Date: Tue May 7 22:37:19 2019 -0400 net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue Syzkaller was able to use TCP_REPAIR to reproduce the new warning added in tcp_fragment(): WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487 tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487() inconsistent: tx.in_flight: 0 old_factor: 53 The warning happens because skbs inserted into the tcp_rtx_queue during the repair process go through a sort of "fake send" process, and that process was seting pcount but not tx.in_flight, and thus the warnings (where old_factor is the old pcount). The fix of setting tx.in_flight in the TCP_REPAIR code path seems simple enough, and indeed makes the repro code from syzkaller stop producing warnings. Running through kokonut tests, and will send out for review when all tests pass. Effort: net-tcp_bbr Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63 Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05 commit 38d074ba1c129b8738eb7eead9698d320c0eb240 Author: Neal Cardwell Date: Wed May 1 20:16:25 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment() When we fragment an skb that has already been sent, we need to update the tx.in_flight for the first skb in the resulting pair ("buff"). Because we were not updating the tx.in_flight, the tx.in_flight value was inconsistent with the pcount of the "buff" skb (tx.in_flight would be too high). That meant that if the "buff" skb was lost, then bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value that is too high. This could result in longer queues and higher packet loss. Packetdrill testing verified that without this commit, when the second half of an skb is SACKed and then later the first half of that skb is marked lost, the calculated inflight_hi was incorrect. Effort: net-tcp_bbr Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874 Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d commit 010819adf5096d03f5d3ba32176514345de98c89 Author: Neal Cardwell Date: Wed May 1 20:16:33 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb() When tcp_shifted_skb() updates state as adjacent SACKed skbs are coalesced, previously the tx.in_flight was not adjusted, so we could get contradictory state where the skb's recorded pcount was bigger than the tx.in_flight (the number of segments that were in_flight after sending the skb). Normally have a SACKed skb with contradictory pcount/tx.in_flight would not matter. However, with SACK reneging, the SACKed bit is removed, and an skb once again becomes eligible for retransmitting, fragmenting, SACKing, etc. Packetdrill testing verified the following sequence is possible in a kernel that does not have this commit: - skb N is SACKed - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb() - tcp_shifted_skb() will increase the pcount of prev, but leave tx.in_flight as-is - so prev skb can have pcount > tx.in_flight - RTO, tcp_timeout_mark_lost(), detect reneg, remove "SACKed" bit, mark skb N as lost - find pcount of skb N is greater than its tx.in_flight I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb(): WARN_ON_ONCE(inflight_prev < 0) to fire in production machines using bbr2. Tested: See last commit in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715 Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55 commit 6f61d68d1e53865dfa49e20b44e0db0dfa4bc47b Author: Neal Cardwell Date: Tue May 7 22:36:36 2019 -0400 net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight() Factor out the code to set an skb's tx.in_flight field into its own function, so that this code can be used for the TCP_REPAIR "fake send" code path that inserts skbs into the rtx queue without sending them. This is in preparation for the following patch, which fixes an issue with TCP_REPAIR and tx.in_flight. Tested: See last patch in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af commit fed267c2eb13256f89fcc907e814c73a919311ca Author: Neal Cardwell Date: Tue Aug 7 21:52:06 2018 -0400 net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API For connections experiencing reordering, RACK can mark packets lost long after we receive the SACKs/ACKs hinting that the packets were actually lost. This means that CC modules cannot easily learn the volume of inflight data at which packet loss happens by looking at the current inflight or even the packets in flight when the most recently SACKed packet was sent. To learn this, CC modules need to know how many packets were in flight at the time lost packets were sent. This new callback, combined with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this. This also provides a consistent callback that is invoked whether packets are marked lost upon ACK processing, using the RACK reordering timer, or at RTO time. Effort: net-tcp_bbr Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4 Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a commit b324fd8092194ee352cf47e4af633a3b90a83f6c Author: Neal Cardwell Date: Mon Nov 19 13:48:36 2018 -0500 net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece For understanding the relationship between inflight and ECN signals, to try to find the highest inflight value that has acceptable levels ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8 Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3 commit b2f308e94a5564555f63349019314523c258103a Author: Neal Cardwell Date: Thu Oct 12 23:44:27 2017 -0400 net-tcp_bbr: v2: count packets lost over TCP rate sampling interval For understanding the relationship between inflight and packet loss signals, to try to find the highest inflight value that has acceptable levels of packet losses. Effort: net-tcp_bbr Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5 commit 2854a76b0897c45fce59754e9d2c689361c826d5 Author: Neal Cardwell Date: Sat Aug 5 11:49:50 2017 -0400 net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample For understanding the relationship between inflight and losses or ECN signals, to try to find the highest inflight value that has acceptable levels of loss/ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a commit d0775a1e4f43188c66e972ea8447eb5fa0e3bdae Author: Neal Cardwell Date: Sun Jun 24 21:55:59 2018 -0400 net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes Free up some space for tracking inflight and losses for each bw sample, in upcoming commits. These timestamps are in microseconds, and are now stored in 32 bits. So they can only hold time intervals up to roughly 2^12 = 4096 seconds. But Linux TCP RTT and RTO tracking has the same 32-bit microsecond implementation approach and resulting deployment limitations. So this is not introducing a new limit. And these should not be a limitation for the foreseeable future. Effort: net-tcp_bbr Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55 Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c commit d6c71581cba6a4d1ff7d4cf899ab0ec052123b65 Author: Neal Cardwell Date: Tue Jun 11 12:26:55 2019 -0400 net-tcp_bbr: broaden app-limited rate sample detection This commit is a bug fix for the Linux TCP app-limited (application-limited) logic that is used for collecting rate (bandwidth) samples. Previously the app-limited logic only looked for "bubbles" of silence in between application writes, by checking at the start of each sendmsg. But "bubbles" of silence can also happen before retransmits: e.g. bubbles can happen between an application write and a retransmit, or between two retransmits. Retransmits are triggered by ACKs or timers. So this commit checks for bubbles of app-limited silence upon ACKs or timers. Why does this commit check for app-limited state at the start of ACKs and timer handling? Because at that point we know whether inflight was fully using the cwnd. During processing the ACK or timer event we often change the cwnd; after changing the cwnd we can't know whether inflight was fully using the old cwnd. Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc Change-Id: I37221506f5166877c2b110753d39bb0757985e68 commit ed307ccbcfff0b21a526f53cfa28df9b2052db53 Author: Stephan Mueller Date: Tue Sep 28 17:41:57 2021 +0200 char/lrng: add power-on and runtime self-tests Parts of the LRNG are already covered by self-tests, including: * Self-test of SP800-90A DRBG provided by the Linux kernel crypto API. * Self-test of the PRNG provided by the Linux kernel crypto API. * Raw noise source data testing including SP800-90B compliant tests when enabling CONFIG_LRNG_HEALTH_TESTS This patch adds the self-tests for the remaining critical functions of the LRNG that are essential to maintain entropy and provide cryptographic strong random numbers. The following self-tests are implemented: * Self-test of the time array maintenance. This test verifies whether the time stamp array management to store multiple values in one integer implements a concatenation of the data. * Self-test of the software hash implementation ensures that this function operates compliant to the FIPS 180-4 specification. The self-test performs a hash operation of a zeroized per-CPU data array. * Self-test of the ChaCha20 DRNG is based on the self-tests that are already present and implemented with the stand-alone user space ChaCha20 DRNG implementation available at [1]. The self-tests cover different use cases of the DRNG seeded with known seed data. The status of the LRNG self-tests is provided with the selftest_status SysFS file. If the file contains a zero, the self-tests passed. The value 0xffffffff means that the self-tests were not executed. Any other value indicates a self-test failure. The self-test may be compiled to panic the system if the self-test fails. All self-tests operate on private state data structures. This implies that none of the self-tests have any impact on the regular LRNG operations. This allows the self-tests to be repeated at runtime by writing anything into the selftest_status SysFS file. [1] https://www.chronox.de/chacha20.html CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange CC: Marcelo Henrique Cerri CC: Neil Horman Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 990036a0a3ff50142abe71c31872da8c24f8d0c9 Author: Stephan Mueller Date: Mon Oct 18 20:55:51 2021 +0200 char/lrng: add interface for gathering of raw entropy The test interface allows a privileged process to capture the raw unconditioned noise that is collected by the LRNG for statistical analysis. Such testing allows the analysis how much entropy the interrupt noise source provides on a given platform. Extracted noise data is not used to seed the LRNG. This is a test interface and not appropriate for production systems. Yet, the interface is considered to be sufficiently secured for production systems. Access to the data is given through the lrng_raw debugfs file. The data buffer should be multiples of sizeof(u32) to fill the entire buffer. Using the option lrng_testing.boot_test=1 the raw noise of the first 1000 entropy events since boot can be sampled. This test interface allows generating the data required for analysis whether the LRNG is in compliance with SP800-90B sections 3.1.3 and 3.1.4. In addition, the test interface allows gathering of the concatenated raw entropy data to verify that the concatenation works appropriately. This includes sampling of the following raw data: * high-resolution time stamp * Jiffies * IRQ number * IRQ flags * return instruction pointer * interrupt register state * array logic batching the high-resolution time stamp * enabling the runtime configuration of entropy source entropy rates Also, a testing interface to support ACVT of the hash implementation is provided. The reason why only hash testing is supported (as opposed to also provide testing for the DRNG) is the fact that the LRNG software hash implementation contains glue code that may warrant testing in addition to the testing of the software ciphers via the kernel crypto API. Also, for testing the CTR-DRBG, the underlying AES implementation would need to be tested. However, such AES test interface cannot be provided by the LRNG as it has no means to access the AES operation. Finally, the execution duration for processing a time stamp can be obtained with the LRNG raw entropy interface. If a test interface is not compiled, its code is a noop which has no impact on the performance. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit c808bb6fb444fde94a8dec94d3936cc44e7f8a44 Author: Stephan Mueller Date: Tue Sep 28 18:10:36 2021 +0200 char/lrng: add SP800-90B compliant health tests Implement health tests for LRNG's slow noise sources as mandated by SP-800-90B The file contains the following health tests: - stuck test: The stuck test calculates the first, second and third discrete derivative of the time stamp to be processed by the hash for the per-CPU entropy pool. Only if all three values are non-zero, the received time delta is considered to be non-stuck. - SP800-90B Repetition Count Test (RCT): The LRNG uses an enhanced version of the RCT specified in SP800-90B section 4.4.1. Instead of counting identical back-to-back values, the input to the RCT is the counting of the stuck values during the processing of received interrupt events. The RCT is applied with alpha=2^-30 compliant to the recommendation of FIPS 140-2 IG 9.8. During the counting operation, the LRNG always calculates the RCT cut-off value of C. If that value exceeds the allowed cut-off value, the LRNG will trigger the health test failure discussed below. An error is logged to the kernel log that such RCT failure occurred. This test is only applied and enforced in FIPS mode, i.e. when the kernel compiled with CONFIG_CONFIG_FIPS is started with fips=1. - SP800-90B Adaptive Proportion Test (APT): The LRNG implements the APT as defined in SP800-90B section 4.4.2. The applied significance level again is alpha=2^-30 compliant to the recommendation of FIPS 140-2 IG 9.8. The aforementioned health tests are applied to the first 1,024 time stamps obtained from interrupt events. In case one error is identified for either the RCT, or the APT, the collected entropy is invalidated and the SP800-90B startup health test is restarted. As long as the SP800-90B startup health test is not completed, all LRNG random number output interfaces that may block will block and not generate any data. This implies that only those potentially blocking interfaces are defined to provide random numbers that are seeded with the interrupt noise source being SP800-90B compliant. All other output interfaces will not be affected by the SP800-90B startup test and thus are not considered SP800-90B compliant. At runtime, the SP800-90B APT and RCT are applied to each time stamp generated for a received interrupt. When either the APT and RCT indicates a noise source failure, the LRNG is reset to a state it has immediately after boot: - all entropy counters are set to zero - the SP800-90B startup tests are re-performed which implies that getrandom(2) would block again until new entropy was collected To summarize, the following rules apply: • SP800-90B compliant output interfaces - /dev/random - getrandom(2) system call - get_random_bytes kernel-internal interface when being triggered by the callback registered with add_random_ready_callback • SP800-90B non-compliant output interfaces - /dev/urandom - get_random_bytes kernel-internal interface called directly - randomize_page kernel-internal interface - get_random_u32 and get_random_u64 kernel-internal interfaces - get_random_u32_wait, get_random_u64_wait, get_random_int_wait, and get_random_long_wait kernel-internal interfaces If either the RCT, or the APT health test fails irrespective whether during initialization or runtime, the following actions occur: 1. The entropy of the entire entropy pool is invalidated. 2. All DRNGs are reset which imply that they are treated as being not seeded and require a reseed during next invocation. 3. The SP800-90B startup health test are initiated with all implications of the startup tests. That implies that from that point on, new events must be observed and its entropy must be inserted into the entropy pool before random numbers are calculated from the entropy pool. Further details on the SP800-90B compliance and the availability of all test tools required to perform all tests mandated by SP800-90B are provided at [1]. The entire health testing code is compile-time configurable. The patch provides a CONFIG_BROKEN configuration of the APT / RCT cutoff values which have a high likelihood to trigger the health test failure. The BROKEN APT cutoff is set to the exact mean of the expected value if the time stamps are equally distributed (512 time stamps divided by 16 possible values due to using the 4 LSB of the time stamp). The BROKEN RCT cutoff value is set to 1 which is likely to be triggered during regular operation. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 7db70085449199372687d8cf508c5679f3bb165c Author: Stephan Mueller Date: Wed Oct 13 22:53:51 2021 +0200 char/lrng: add Jitter RNG fast noise source The Jitter RNG fast noise source implemented as part of the kernel crypto API is queried for 256 bits of entropy at the time the seed buffer managed by the LRNG is about to be filled. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit e0ba90281b8805c8b7928979a46d033b73b1a7ce Author: Stephan Mueller Date: Wed Sep 16 09:50:27 2020 +0200 crypto: move Jitter RNG header include dir To support the LRNG operation which uses the Jitter RNG separately from the kernel crypto API, the header file must be accessible to the LRNG code. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Roman Drahtmüller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit f07f4a5f40dc634bb20420ff319cdcc8eeb5c992 Author: Stephan Mueller Date: Fri Jun 18 08:10:53 2021 +0200 char/lrng: add kernel crypto API PRNG extension Add runtime-pluggable support for all PRNGs that are accessible via the kernel crypto API, including hardware PRNGs. The PRNG is selected with the module parameter drng_name where the name must be one that the kernel crypto API can resolve into an RNG. This allows using of the kernel crypto API PRNG implementations that provide an interface to hardware PRNGs. Using this extension, the LRNG uses the hardware PRNGs to generate random numbers. An example is the S390 CPACF support providing such a PRNG. The hash is provided by a kernel crypto API SHASH whose digest size complies with the seedsize of the PRNG. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit b86cde95ce97e7d0b1bf48179f35ea20cc7a69a1 Author: Stephan Mueller Date: Fri Jun 18 08:09:59 2021 +0200 char/lrng: add SP800-90A DRBG extension Using the LRNG switchable DRNG support, the SP800-90A DRBG extension is implemented. The DRBG uses the kernel crypto API DRBG implementation. In addition, it uses the kernel crypto API SHASH support to provide the hashing operation. The DRBG supports the choice of either a CTR DRBG using AES-256, HMAC DRBG with SHA-512 core or Hash DRBG with SHA-512 core. The used core can be selected with the module parameter lrng_drbg_type. The default is the CTR DRBG. When compiling the DRBG extension statically, the DRBG is loaded at late_initcall stage which implies that with the start of user space, the user space interfaces of getrandom(2), /dev/random and /dev/urandom provide random data produced by an SP800-90A DRBG. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 2eed3ac32f7184096ff7f18591fbedd1de66d97d Author: Stephan Mueller Date: Tue Sep 15 22:17:43 2020 +0200 crypto: drbg: externalize DRBG functions for LRNG This patch allows several DRBG functions to be called by the LRNG kernel code paths outside the drbg.c file. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Roman Drahtmüller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 3ef6e0b3e183a5f5d97bcd3e688387c91c66a03e Author: Stephan Mueller Date: Fri Jun 18 08:08:20 2021 +0200 char/lrng: add common generic hash support The LRNG switchable DRNG support also allows the replacement of the hash implementation used as conditioning component. The common generic hash support code provides the required callbacks using the synchronous hash implementations of the kernel crypto API. All synchronous hash implementations supported by the kernel crypto API can be used as part of the LRNG with this generic support. The generic support is intended to be configured by separate switchable DRNG backends. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange CC: "Peter, Matthias" CC: Marcelo Henrique Cerri CC: Neil Horman Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 1ac746ce017ec114f354b6f89493c8c059184fd7 Author: Stephan Mueller Date: Fri Oct 1 20:47:30 2021 +0200 char/lrng: add switchable DRNG support The DRNG switch support allows replacing the DRNG mechanism of the LRNG. The switching support rests on the interface definition of include/linux/lrng.h. A new DRNG is implemented by filling in the interface defined in this header file. In addition to the DRNG, the extension also has to provide a hash implementation that is used to hash the entropy pool for random number extraction. Note: It is permissible to implement a DRNG whose operations may sleep. However, the hash function must not sleep. The switchable DRNG support allows replacing the DRNG at runtime. However, only one DRNG extension is allowed to be loaded at any given time. Before replacing it with another DRNG implementation, the possibly existing DRNG extension must be unloaded. The switchable DRNG extension activates the new DRNG during load time. It is expected, however, that such a DRNG switch would be done only once by an administrator to load the intended DRNG implementation. It is permissible to compile DRNG extensions either as kernel modules or statically. The initialization of the DRNG extension should be performed with a late_initcall to ensure the extension is available when user space starts but after all other initialization completed. The initialization is performed by registering the function call data structure with the lrng_set_drng_cb function. In order to unload the DRNG extension, lrng_set_drng_cb must be invoked with the NULL parameter. The DRNG extension should always provide a security strength that is at least as strong as LRNG_DRNG_SECURITY_STRENGTH_BITS. The hash extension must not sleep and must not maintain a separate state. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 98be5134b53ff31b0169b1d71d88d98d22410ac8 Author: Stephan Mueller Date: Mon Oct 18 20:52:42 2021 +0200 char/lrng: CPU entropy source Certain CPUs provide instructions giving access to an entropy source (e.g. RDSEED on Intel/AMD, DARN on POWER, etc.). The LRNG can utilize the entropy source to seed its DRNG from. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Mathematical aspects Reviewed-by: "Peter, Matthias" Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 78be60b1bab4754b5b1f7b9153ed67186d4a5d24 Author: Stephan Mueller Date: Tue Sep 28 17:05:50 2021 +0200 char/lrng: allocate one DRNG instance per NUMA node In order to improve NUMA-locality when serving getrandom(2) requests, allocate one DRNG instance per node. The DRNG instance that is present right from the start of the kernel is reused as the first per-NUMA-node DRNG. For all remaining online NUMA nodes a new DRNG instance is allocated. During boot time, the multiple DRNG instances are seeded sequentially. With this, the first DRNG instance (referenced as the initial DRNG in the code) is completely seeded with 256 bits of entropy before the next DRNG instance is completely seeded. When random numbers are requested, the NUMA-node-local DRNG is checked whether it has been already fully seeded. If this is not the case, the initial DRNG is used to serve the request. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange CC: Eric Biggers Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 07f25c72f6b3699b4851f2ded96fcee5f34fb709 Author: Stephan Mueller Date: Wed Oct 13 22:50:53 2021 +0200 char/lrng: sysctls and /proc interface The LRNG sysctl interface provides the same controls as the existing /dev/random implementation. These sysctls behave identically and are implemented identically. The goal is to allow a possible merge of the existing /dev/random implementation with this implementation which implies that this patch tries have a very close similarity. Yet, all sysctls are documented at [1]. In addition, it provides the file lrng_type which provides details about the LRNG: - the name of the DRNG that produces the random numbers for /dev/random, /dev/urandom, getrandom(2) - the hash used to produce random numbers from the entropy pool - the number of secondary DRNG instances - indicator whether the LRNG operates SP800-90B compliant - indicator whether a high-resolution timer is identified - only with a high-resolution timer the interrupt noise source will deliver sufficient entropy - indicator whether the LRNG has been minimally seeded (i.e. is the secondary DRNG seeded with at least 128 bits of entropy) - indicator whether the LRNG has been fully seeded (i.e. is the secondary DRNG seeded with at least 256 bits of entropy) [1] https://www.chronox.de/lrng.html CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 53123f524900466761165f1d795e906ce1fddc5c Author: Stephan Mueller Date: Wed Oct 13 22:49:22 2021 +0200 char/lrng: IRQ entropy source The interrupt entropy source hooks into the interrupt handler via the add_interrupt_randomness function callback. Every interrupt received by the kernel is also sent to the LRNG for processing. The IRQ entropy source performs the following processing: 1. Record a time stamp 2. Divide the time stamp by its greatest common divisor to eliminate fixed least significant bits. 3. Insert the 8 LSB of the result from step 2 into the collection pool. 4. When the collection pool is full, it is hashed into the per-CPU entropy pool (if continuous compression is enabled) or the latest time stamps overwrite the oldest entries in the collection pool. If entropy is requested from the IRQ entropy pool, a message digest over all per-CPU entropy pool digests is calculated. The GCD calculation is performed for the first 100 interupt time stamps. Until the GCD value is calculated, the full 32 bit time stamp is inserted into the collection pool. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Mathematical aspects Reviewed-by: "Peter, Matthias" Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 0d82aaa422d09de415cafb571131055c3ee53902 Author: Stephan Mueller Date: Sun Oct 17 20:23:04 2021 +0200 drivers/char: Introduce the Linux Random Number Generator In an effort to provide a flexible implementation for a random number generator that also delivers entropy during early boot time, allows replacement of the deterministic random number generation mechanism, implement the various components in separate code for easier maintenance, and provide compliance to SP800-90[A|B|C], introduce the Linux Random Number Generator (LRNG) framework. The LRNG framework provides a flexible random number generator which allows developers and system integrators to achieve different goals by ensuring that each solution establishes a secure state. The general design is as follows. Additional implementation details are given in [1]. The LRNG consists of the following components: 1. The LRNG implements a DRNG. The DRNG always generates the requested amount of output. When using the SP800-90A terminology it operates without prediction resistance. The DRNG maintains a counter of how many bytes were generated since last re-seed and a timer of the elapsed time since last re-seed. If either the counter or the timer reaches a threshold, the DRNG is seeded from the entropy pool. In case the Linux kernel detects a NUMA system, one DRNG instance per NUMA node is maintained. 2. The DRNG is seeded by concatenating the data from the following sources which deliver data and are credited with entropy if enabled: (a) the output of the IRQ per-CPU entropy pools, (b) the auxiliary entropy pool, (c) the Jitter RNG if available and enabled, and (d) the CPU-based noise source such as Intel RDSEED. The entropy estimate of the data of all noise sources are added to form the entropy estimate of the data used to seed the DRNG with. The LRNG ensures, however, that the DRNG after seeding is at maximum the security strength of the DRNG. The LRNG is designed such that none of these noise sources can dominate the other noise sources to provide seed data to the DRNG during due to the following: (a) During boot time, the amount of received entropy at the different entropy sources are the trigger points to (re)seed the DRNG. (b) At runtime, the available entropy from the slow noise source is concatenated with a pre-defined amount of data from the fast noise sources. In addition, each DRNG reseed operation triggers external noise source providers to deliver one block of data. 3. The IRQ entropy pool collects noise data from interrupt timing. Any data received by the LRNG from the interrupt noise sources is inserted into a per-CPU entropy pool using a hash operation that can be changed during runtime. Per default, SHA-256 is used. (a) When an interrupt occurs, the 8 least significant bits of the high-resolution time stamp divided by the greatest common divisor (GCD) is mixed into the per-CPU entropy pool. This time stamp is credited with heuristically implied entropy. (b) HID event data like the key stroke or the mouse coordinates are mixed into the per-CPU entropy pool. This data is not credited with entropy by the LRNG. 5. Any data provided from user space by either writing to /dev/random, /dev/urandom or the IOCTL of RNDADDENTROPY on both device files are always injected into the auxiliary pool. Also, device drivers may provide data that is mixed into an auxiliary pool using the same hash that is used to process the per-CPU entropy pool. This data is not credited with entropy by the LRNG. In addition, when a hardware random number generator covered by the Linux kernel HW generator framework wants to deliver random numbers, it is injected into the auxiliary pool as well. HW generator noise source is handled separately from the other noise source due to the fact that the HW generator framework may decide by itself when to deliver data whereas the other noise sources always requested for data driven by the LRNG operation. Similarly any user space provided data is inserted into the entropy pool. When seed data for the DRNG is to be generated, all per-CPU entropy pools are hashed. The message digest forms the data used for seeding the DRNG. To speed up the interrupt handling code of the LRNG, the time stamp collected for an interrupt event is divided by the greatest common divisor to eliminate fixed low bits and then truncated to the 8 least significant bits. 1024 truncated time stamps are concatenated and then jointly inserted into the per-CPU entropy pool. During boot time, until the fully seeded stage is reached, each time stamp with its 32 least significant bits is are concatenated. When 1024/32 = 32 such events are received, they are injected into the per-CPU entropy pool. The LRNG allows the DRNG mechanism to be changed at runtime. Per default, a ChaCha20-based DRNG is used. The ChaCha20-DRNG implemented for the LRNG is also provided as a stand-alone user space deterministic random number generator. The LRNG also offers an SP800-90A DRBG based on the Linux kernel crypto API DRBG implementation. The processing of entropic data from the noise source before injecting them into the DRNG is performed with the following mathematical operations: 1. Truncation: The received time stamps divided by the GCD are truncated to 8 least significant bits (or 32 least significant bits during boot time) 2. Concatenation: The received and truncated time stamps as well as auxiliary 32 bit words are concatenated to fill the per-CPU data array that is capable of holding 64 8-bit words. 3. Hashing: A set of concatenated time stamp data received from the interrupts are hashed together with the current existing per-CPU entropy pool state. The resulting message digest is the new per-CPU entropy pool state. 4. Hashing: When new data is added to the auxiliary pool, the data is hashed together with the auxiliary pool to form a new auxiliary pool state. 5. Hashing: A message digest of all per-CPU entropy pools and the is calculated which forms the new auxiliary pool state. 6. Truncation: The most-significant bits (MSB) defined by the requested number of bits (commonly equal to the security strength of the DRBG) or the entropy available transported with the buffer (which is the minimum of the message digest size and the available entropy in all entropy pools and the auxiliary pool), whatever is smaller, are obtained from the slow noise source output buffer. 7. Concatenation: The temporary seed buffer used to seed the DRNG is a concatenation of the slow noise source buffer, the Jitter RNG output, the CPU noise source output, and the current time. The DRNG always tries to seed itself with 256 bits of entropy, except during boot. In any case, if the noise sources cannot deliver that amount, the available entropy is used and the DRNG keeps track on how much entropy it was seeded with. The entropy implied by the LRNG available in the entropy pool may be too conservative. To ensure that during boot time all available entropy from the entropy pool is transferred to the DRNG, the hash_df function always generates 256 data bits during boot to seed the DRNG. During boot, the DRNG is seeded as follows: 1. The DRNG is reseeded from the entropy sources if the entropy sources collectively have at least 32 bits of entropy. The goal of this step is to ensure that the DRNG receives some initial entropy as early as possible. 2. The DRNG is reseeded from the entropy sources if all entropy sources collectively can provide at least 128 bits of entropy. 3. The DRNG is reseeded from the entropy sources if all entropy sources collectively can provide at least 256 bits. At the time of the reseeding steps, the DRNG requests as much entropy as is available in order to skip certain steps and reach the seeding level of 256 bits. This may imply that one or more of the aforementioned steps are skipped. Before the DRNG is seeded with 256 bits of entropy in step 3, requests of random data from /dev/random and the getrandom system call are not processed. The reseeding of the DRNG always ensures that all entropy sources collectively can deliver at least 128 entropy bits during runtime once the DRNG is fully seeded. The DRNG operates as deterministic random number generator with the following properties: * The maximum number of random bytes that can be generated with one DRNG generate operation is limited to 4096 bytes. When longer random numbers are requested, multiple DRNG generate operations are performed. The ChaCha20 DRNG as well as the SP800-90A DRBGs implement an update of their state after completing a generate request for backtracking resistance. * The DRNG is reseeded with whatever entropy is available - in the worst case where no additional entropy can be provided by the entropy sources, the DRNG is not re-seeded and continues its operation to try to reseed again after again the expiry of one of these thresholds: - If the last reseeding of the DRNG is more than 600 seconds ago, or - 2^20 DRNG generate operations are performed, whatever comes first, or - the DRNG is forced to reseed before the next generation of random numbers if data has been injected into the LRNG by writing data into /dev/random or /dev/urandom. - If the DRNG was not successfully reseeded after 2^30 generate requests, the DRNG reverts back to an unseeded stage implying that the blocking interfaces of /dev/random and getrandom will block again. The chosen values prevent high-volume requests from user space to cause frequent reseeding operations which drag down the performance of the DRNG. With the automatic reseeding after 600 seconds, the LRNG is triggered to reseed itself before the first request after a suspend that put the hardware to sleep for longer than 600 seconds. To support smaller devices including IoT environments, this patch allows reducing the runtime memory footprint of the LRNG at compile time by selecting smaller collection data sizes. When selecting the compilation of a kernel for a small environment, prevent the allocation of a buffer up to 4096 bytes to serve user space requests. In this case, the stack variable of 64 bytes is used to serve all user space requests. The LRNG has the following properties: * internal noise source: interrupts timing with fast boot time seeding * high performance of interrupt handling code: The LRNG impact on the interrupt handling has been reduced to a minimum. On one example system, the LRNG interrupt handling code in its fastest configuration executes within an average 55 cycles whereas the existing /dev/random on the same device takes about 97 cycles when measuring the execution time of add_interrupt_randomness(). * use of almost never contended lock for hashing operation to collect raw entropy supporting concurrency-free use of massive parallel systems - worst case rate of contention is the number of DRNG reseeds, usually: number of NUMA nodes contentions per 5 minutes. * use of standalone ChaCha20 based RNG with the option to use a different DRNG selectable at compile time * instantiate one DRNG per NUMA node * support for runtime switchable output DRNGs * use of runtime-switchable hash for conditioning implementation following widely accepted approach * compile-time selectable collection size * support of small systems by allowing the reduction of the runtime memory needs Further details including the rationale for the design choices and properties of the LRNG together with testing is provided at [1]. In addition, the documentation explains the conducted regression tests to verify that the LRNG is API and ABI compatible with the existing /dev/random implementation. Note, this patch covers the entropy sources manager, the API implementation, the built-in ChaCha20 DRNG and the auxiliary entropy pool. [1] https://www.chronox.de/lrng.html CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Mathematical aspects Reviewed-by: "Peter, Matthias" Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Tested-by: Jirka Hladky Reviewed-by: Jirka Hladky Signed-off-by: Stephan Mueller commit 7aa708cfb7fb0e3562fa9a15216b6675f4762618 Author: Alexey Avramov Date: Sat Nov 13 10:42:27 2021 +0900 mm/vmscan: add sysctl knobs for protecting the working set The kernel does not provide a way to protect the working set under memory pressure. A certain amount of anonymous and clean file pages is required by the userspace for normal operation. First of all, the userspace needs a cache of shared libraries and executable binaries. If the amount of the clean file pages falls below a certain level, then thrashing and even livelock can take place. The patch provides sysctl knobs for protecting the working set (anonymous and clean file pages) under memory pressure. The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous pages. The anonymous pages on the current node won't be reclaimed under any conditions when their amount is below vm.anon_min_kbytes. This knob may be used to prevent excessive swap thrashing when anonymous memory is low (for example, when memory is going to be overfilled by compressed data of zram module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested 0 in Kconfig). The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file pages using this knob may be used when swapping is still possible to - prevent disk I/O thrashing under memory pressure; - improve performance in disk cache-bound tasks under memory pressure. The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in Kconfig). The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_min_kbytes. Hard protection of clean file pages using this knob may be used to - prevent disk I/O thrashing under memory pressure even with no free swap space; - improve performance in disk cache-bound tasks under memory pressure; - avoid high latency and prevent livelock in near-OOM conditions. The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in Kconfig). Signed-off-by: Alexey Avramov commit 1a60a347574e7dcff7d49302fb93af4eec0ad4a6 Author: Yu Zhao Date: Tue Jan 4 13:22:28 2022 -0700 mm: multigenerational lru: Kconfig Add configuration options for the multigenerational lru. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit 1bec6b47268d90de7299ff36f386ef344c7154d0 Author: Yu Zhao Date: Tue Jan 4 13:22:27 2022 -0700 mm: multigenerational lru: user interface Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch. Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention. Compared with the size-based approach, e.g., [1], this time-based approach has the following advantages: 1) It's easier to configure because it's agnostic to applications and memory sizes. 2) It's more reliable because it's directly wired to the OOM killer. Add /sys/kernel/debug/lru_gen for working set estimation and proactive reclaim. Compared with the page table-based approach and the PFN-based approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has the following advantages: 1) It offers better choices because it's aware of memcgs, NUMA nodes, shared mappings and unmapped page cache. 2) It's more scalable because it's O(nr_hot_evictable_pages), whereas the PFN-based approach is O(nr_total_pages). Add /sys/kernel/debug/lru_gen_full for debugging. [1] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit 811d1a9d6a1421021dc083f7aa8a7122167a76df Author: Yu Zhao Date: Tue Jan 4 13:22:26 2022 -0700 mm: multigenerational lru: eviction The eviction consumes old generations. Given an lruvec, it scans pages on lrugen->lists[] indexed by min_seq%MAX_NR_GENS. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both are available from the same generation. Each generation is divided into multiple tiers. Tiers represent different ranges of numbers of accesses thru file descriptors. A page accessed N times thru file descriptors is in tier order_base_2(N). The feedback loop also monitors refaults over all tiers and decides when to promote pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The eviction sorts a page according to the gen counter if the aging has found this page accessed thru page tables, which completes the promotion of this page. The eviction also promotes a page to the next generation (min_seq+1 rather than max_seq) if this page was accessed multiple times thru file descriptors and the feedback loop has detected higher refaults from the tier this page is in. This approach has the following advantages: 1) It removes the cost of activation (recall the terms) in the buffered access path by inferring whether pages accessed multiple times thru file descriptors are statistically hot and thus worth promoting in the eviction path. 2) It takes pages accessed thru page tables into account and avoids overprotecting pages accessed multiple times thru file descriptors. 3) More tiers, which require additional bits in folio->flags, provide better protection for pages accessed more than twice thru file descriptors, when under heavy buffered I/O workloads. The eviction increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS is empty. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit 22bfc0019ac38a0967faf35bf9939387d9ba47d3 Author: Yu Zhao Date: Tue Jan 4 13:22:25 2022 -0700 mm: multigenerational lru: aging To avoid confusions, the term "scan" will be applied to PTEs in a page table and pages on an lru list. It emphasizes on consecutive elements in a set rather than the data structure holding this set together. The aging produces young generations. Given an lruvec, it iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to scan PTEs for accessed pages. On finding a young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. After each iteration of this list, it increments max_seq. The aging is needed before the eviction can continue when max_seq-min_seq+1 reaches MIN_NR_GENS. To avoid confusions, the terms "promotion" and "demotion" will be applied to the multigenerational lru, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive lru, as usual. IOW, the aging promotes a page to the youngest generation when it finds this page accessed thru page tables; demotion happens consequently when it creates a new generation. Note that promotion doesn't require any lru list operations in the aging path, only the update of the gen counter and the lru sizes; demotion, unless as the result of the creation of a new generation, requires lru list operations, e.g., lru_deactivate_fn(). The aging uses the following optimizations when walking page tables: 1) It uses the accessed bit in non-leaf PMD entries, the hint from the CPU scheduler and the Bloom filters to reduce its search space. 2) It doesn't zigzag between a PGD table and the same PMD or PTE table spanning multiple VMAs. In other words, it finishes all the VMAs within the range of the same PMD or PTE table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5. The aging is only interested in accessed pages and therefore has the complexity of O(nr_hot_evictable_pages). The worst case scenario is the aging fails to exploit any spatial locality and the eviction has to promote all accessed pages when walking the rmap, which is similar to the active/inactive lru. However, generations still can provide better temporal locality. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit 4dccd52500664d2a404eb3ba3bfe158b2a2a92b5 Author: Yu Zhao Date: Tue Jan 4 13:22:24 2022 -0700 mm: multigenerational lru: mm_struct list To exploit spatial locality, the aging prefers to walk page tables to search for young PTEs. And this patch paves the way for that. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. To avoid confusions, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. A page table walker, i.e., a thread in the aging path, iterates an mm_struct list and calls walk_page_range() with each mm_struct on this list. The iteration finishes when it reaches the end of this list. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore the aging can run concurrently. This infra also provides the following optimizations: 1) It tracks the usage of mm_struct's between context switches so that page table walkers may skip processes that have been sleeping since the last iteration. 2) It provides generational Bloom filters to record populated branches so that page table walkers may reduce their search space based on the query results. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit e45699af41cd0356c46135107969c2ba308e4cb2 Author: Yu Zhao Date: Tue Jan 4 13:22:23 2022 -0700 mm: multigenerational lru: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they're aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages may be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. There are two conceptually independent processes (as in manufacturing process): "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both processes can be triggered separately from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling in data centers. The variable size of the sliding window is designed for such use cases. To avoid confusions, the terms "hot" and "cold" will be applied to the multigenerational lru, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive lru, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one thru page tables and the other thru file descriptors. The protection of the former channel is by design stronger because: 1) The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2) The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3) The penalty of underprotecting the former channel is higher because applications usually don't prepare themselves for major faults like they do for blocked I/O. For example, GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present and the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed. The "outlying refaults" will be addressed in [PATCH 07/10]. A few macros, i.e., LRU_REFS_*, used in that patch are added in this one to make the patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page hasn't been used since then. This process, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit 22380a132a26c6e67f5a9a7a615fc9c339e9c18e Author: Yu Zhao Date: Tue Jan 4 13:22:22 2022 -0700 mm/vmscan.c: refactor shrink_node() This patch refactors shrink_node() to improve readability for the upcoming changes to mm/vmscan.c. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit 1d3d37788232f0cad7e5f46d1eae6cce28be4dd2 Author: Yu Zhao Date: Tue Jan 4 13:22:21 2022 -0700 mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Some architectures support the accessed bit in non-leaf PMD entries, e.g., x86_64 sets the accessed bit in a non-leaf PMD entry when using it as part of linear address translation [1]. Page table walkers that clear the accessed bit may use this feature to reduce their search space. Although an inline function is preferable, this capability is added as a configuration option for the consistency with the existing macros. [1]: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (June 2021), section 4.8 Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit f6abb14a93eef78e871fca379613c56702933f3f Author: Yu Zhao Date: Tue Jan 4 13:22:20 2022 -0700 mm: x86, arm64: add arch_has_hw_pte_young() Some architectures automatically set the accessed bit in PTEs, e.g., x86 and arm64 v8.2. On architectures that don't have this capability, clearing the accessed bit in a PTE usually triggers a page fault following the TLB miss of this PTE. Being aware of this capability can help make better decisions, e.g., whether to spread the work out over a period of time to avoid bursty page faults when trying to clear the accessed bit in a large number of PTEs. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov commit 49f3adf9ef6a791bafcfaa6f01d2c6d9a27a9c3f Author: André Almeida Date: Mon Oct 25 09:49:42 2021 -0300 futex: Add entry point for FUTEX_WAIT_MULTIPLE (opcode 31) Add an option to wait on multiple futexes using the old interface, that uses opcode 31 through futex() syscall. Do that by just translation the old interface to use the new code. This allows old and stable versions of Proton to still use fsync in new kernel releases. Signed-off-by: André Almeida commit 932fce1ed0596142382e78b221f4b965b705fc90 Author: Alexandre Frade Date: Fri Jun 18 19:10:55 2021 +0000 XANMOD: Makefile: Turn off loop vectorization for GCC -O3 optimization level Signed-off-by: Alexandre Frade commit eb4a18e0cd72d65a9fe0afe3edcfd12be5fc7a0e Author: Alexandre Frade Date: Thu Sep 3 20:36:13 2020 +0000 XANMOD: init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures Signed-off-by: Alexandre Frade commit a2614c7e2a54e076363db8827c12b80db92c1f9d Author: Alexandre Frade Date: Thu Jun 25 16:40:43 2020 -0300 XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE Signed-off-by: Alexandre Frade commit 64901729773bf5581cc06878fb5abf8b2759ffd1 Author: Alexandre Frade Date: Mon Jan 29 17:41:29 2018 +0000 XANMOD: scripts: disable the localversion "+" tag of a git repo Signed-off-by: Alexandre Frade commit 4752faf9f1e2437533722dcb60529d2670678646 Author: Alexandre Frade Date: Tue Mar 31 13:32:08 2020 -0300 XANMOD: cpufreq: tunes ondemand and conservative governor for performance Signed-off-by: Alexandre Frade commit 2bd5fa59c26bfaa201a2b250cc70b65e30179b20 Author: Alexandre Frade Date: Mon Jan 29 17:31:25 2018 +0000 XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping Signed-off-by: Alexandre Frade commit bb174dfdfa0fd21b14e8c4e7662f692c5b5e19c1 Author: Alexandre Frade Date: Thu Aug 13 14:57:06 2020 +0000 XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default Signed-off-by: Alexandre Frade commit ee7dbc58b4fbb1a393c7f375744c3b91a875a3d6 Author: Alexandre Frade Date: Mon Jan 29 16:59:22 2018 +0000 XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed Signed-off-by: Alexandre Frade commit f0b6b5339416631b845e3495c5625c83e9ed81ec Author: Alexandre Frade Date: Mon Jan 29 17:26:15 2018 +0000 XANMOD: kconfig: add 500Hz timer interrupt kernel config option Signed-off-by: Alexandre Frade commit a7f3f16374100d19b8e89768b97e421b74098f1d Author: Alexandre Frade Date: Mon Dec 14 16:24:26 2020 +0000 XANMOD: block: set rq_affinity to force full multithreading I/O requests Signed-off-by: Alexandre Frade commit 641a5091ac6fd1c08c097d94b3fcadb73477f3d2 Author: Alexandre Frade Date: Thu Jan 6 16:59:01 2022 +0000 XANMOD: block/mq-deadline: Disable front_merges by default Signed-off-by: Alexandre Frade commit 05bdfcdb4122ff03c18c3f3e9ba5c59684484ef8 Author: Alexandre Frade Date: Sun Aug 29 23:58:33 2021 +0000 XANMOD: fair: Remove all energy efficiency functions Signed-off-by: Alexandre Frade