commit 05199477b9b1992177f147db22cd142e03a89090 Author: Alexandre Frade Date: Thu Nov 4 17:37:07 2021 +0000 Linux 5.15.0-xanmod1 Signed-off-by: Alexandre Frade commit 4c1a01cd2937d77d07b88fdacc379aafde3adcdc Author: Denis Pauk Date: Sun Oct 3 00:08:54 2021 +0300 hwmon: (nct6775) Add additional ASUS motherboards. Add support: * PRIME B360-PLUS * PRIME X570-PRO * ROG CROSSHAIR VIII FORMULA * ROG STRIX B550-I GAMING * ROG STRIX X570-F GAMING * ROG STRIX Z390-E GAMING * TUF GAMING B550-PRO * TUF GAMING Z490-PLUS * TUF GAMING Z490-PLUS (WI-FI) BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=204807 Signed-off-by: Denis Pauk Tested-by: matt-testalltheway Tested-by: Kamil Dudka Tested-by: Robert Swiecki Tested-by: Kamil Pietrzak Tested-by: Igor Tested-by: Tor Vic Tested-by: Poezevara Cc: Andy Shevchenko Cc: Guenter Roeck Reviewed-by: Andy Shevchenko Link: https://lore.kernel.org/r/20211002210857.709956-2-pauk.denis@gmail.com Signed-off-by: Guenter Roeck commit 7d8d7c1c6cd6a376be9423164dea7d53ebe64aac Author: Denis Pauk Date: Sat Sep 18 01:02:40 2021 +0300 hwmon: (nct6775) Support access via Asus WMI Support accessing the NCT677x via Asus WMI functions. On mainboards that support this way of accessing the chip, the driver will usually not work without this option since in these mainboards, ACPI will mark the I/O port as used. Code uses ACPI firmware interface to communicate with sensors with ASUS motherboards: * PRIME B460-PLUS, * ROG CROSSHAIR VIII IMPACT, * ROG STRIX B550-E GAMING, * ROG STRIX B550-F GAMING, * ROG STRIX B550-F GAMING (WI-FI), * ROG STRIX Z490-I GAMING, * TUF GAMING B550M-PLUS, * TUF GAMING B550M-PLUS (WI-FI), * TUF GAMING B550-PLUS, * TUF GAMING X570-PLUS, * TUF GAMING X570-PRO (WI-FI). BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=204807 Signed-off-by: Denis Pauk Co-developed-by: Bernhard Seibold Signed-off-by: Bernhard Seibold Tested-by: Pär Ekholm Tested-by: Tested-by: Artem S. Tashkinov Tested-by: Vittorio Roberto Alfieri Tested-by: Sahan Fernando Cc: Andy Shevchenko Cc: Guenter Roeck Link: https://lore.kernel.org/r/20210917220240.56553-4-pauk.denis@gmail.com Signed-off-by: Guenter Roeck commit 620ee9237f6dc2bf0cca2f5e70701e7275c48448 Author: Denis Pauk Date: Sat Sep 18 01:02:39 2021 +0300 hwmon: (nct6775) Use nct6775_*() function pointers in nct6775_data. Prepare for platform specific callbacks usage: * Use nct6775 function pointers in struct nct6775_data instead direct calls. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=204807 Signed-off-by: Denis Pauk Co-developed-by: Bernhard Seibold Signed-off-by: Bernhard Seibold Cc: Andy Shevchenko Cc: Guenter Roeck Reviewed-by: Guenter Roeck Link: https://lore.kernel.org/r/20210917220240.56553-3-pauk.denis@gmail.com Signed-off-by: Guenter Roeck commit d213ad7733abc973bcda7d1916e60e44d2737d72 Author: Denis Pauk Date: Sat Sep 18 01:02:38 2021 +0300 hwmon: (nct6775) Use superio_*() function pointers in sio_data. Prepare for platform specific callbacks usage: * Rearrange code for directly use struct nct6775_sio_data in superio_*() functions. * Use superio function pointers in nct6775_sio_data struct instead direct calls. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=204807 Signed-off-by: Denis Pauk Co-developed-by: Bernhard Seibold Signed-off-by: Bernhard Seibold Cc: Andy Shevchenko Cc: Guenter Roeck Reviewed-by: Guenter Roeck Link: https://lore.kernel.org/r/20210917220240.56553-2-pauk.denis@gmail.com Signed-off-by: Guenter Roeck commit 52294dfe6c0c81bdf5f23c37a20b1ee3e1273145 Author: Alexandre Frade Date: Thu Oct 7 14:09:55 2021 +0000 i2c: busses: Add SMBus capability to work with OpenRGB driver control Signed-off-by: Alexandre Frade commit 7e5cd1955d9ccb8d23d90715f50ca34aa1d89206 Author: graysky Date: Tue Sep 14 15:35:34 2021 -0400 x86/kconfig: more uarches for kernel 5.15+ FEATURES This patch adds additional CPU options to the Linux kernel accessible under: Processor type and features ---> Processor family ---> With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are offered which are good for supported Intel or AMD CPUs: • x86-64-v2 • x86-64-v3 • x86-64-v4 Users of glibc 2.33 and above can see which level is supported by current hardware by running: /lib/ld-linux-x86-64.so.2 --help | grep supported Alternatively, compare the flags from /proc/cpuinfo to this list.[1] CPU-specific microarchitectures include: • AMD Improved K8-family • AMD K10-family • AMD Family 10h (Barcelona) • AMD Family 14h (Bobcat) • AMD Family 16h (Jaguar) • AMD Family 15h (Bulldozer) • AMD Family 15h (Piledriver) • AMD Family 15h (Steamroller) • AMD Family 15h (Excavator) • AMD Family 17h (Zen) • AMD Family 17h (Zen 2) • AMD Family 19h (Zen 3)† • Intel Silvermont low-power processors • Intel Goldmont low-power processors (Apollo Lake and Denverton) • Intel Goldmont Plus low-power processors (Gemini Lake) • Intel 1st Gen Core i3/i5/i7 (Nehalem) • Intel 1.5 Gen Core i3/i5/i7 (Westmere) • Intel 2nd Gen Core i3/i5/i7 (Sandybridge) • Intel 3rd Gen Core i3/i5/i7 (Ivybridge) • Intel 4th Gen Core i3/i5/i7 (Haswell) • Intel 5th Gen Core i3/i5/i7 (Broadwell) • Intel 6th Gen Core i3/i5/i7 (Skylake) • Intel 6th Gen Core i7/i9 (Skylake X) • Intel 8th Gen Core i3/i5/i7 (Cannon Lake) • Intel 10th Gen Core i7/i9 (Ice Lake) • Intel Xeon (Cascade Lake) • Intel Xeon (Cooper Lake)* • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)* • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡ • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡ • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡ Notes: If not otherwise noted, gcc >=9.1 is required for support. *Requires gcc >=10.1 or clang >=10.0 †Required gcc >=10.3 or clang >=12.0 ‡Required gcc >=11.1 or clang >=12.0 It also offers to compile passing the 'native' option which, "selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using -march=native enables all instruction subsets supported by the local machine and will produce code optimized for the local machine under the constraints of the selected instruction set."[2] Users of Intel CPUs should select the 'Intel-Native' option and users of AMD CPUs should select the 'AMD-Native' option. MINOR NOTES RELATING TO INTEL ATOM PROCESSORS This patch also changes -march=atom to -march=bonnell in accordance with the gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I believe it should use the newer -march=bonnell flag for atom processors.[3] It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The recommendation is to use the 'atom' option instead. BENEFITS Small but real speed increases are measurable using a make endpoint comparing a generic kernel to one built with one of the respective microarchs. See the following experimental evidence supporting this statement: https://github.com/graysky2/kernel_gcc_patch REQUIREMENTS linux version >=5.15 gcc version >=9.0 or clang version >=9.0 ACKNOWLEDGMENTS This patch builds on the seminal work by Jeroen.[5] REFERENCES 1. https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9 2. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options 3. https://bugzilla.kernel.org/show_bug.cgi?id=77461 4. https://github.com/graysky2/kernel_gcc_patch/issues/15 5. http://www.linuxforge.net/docs/linux/linux-gcc.php Signed-off-by: graysky commit 6356ef5e280adb1f72126158064abfb54b4beea1 Author: Mark Weiman Date: Sun Aug 12 11:36:21 2018 -0400 pci: Enable overrides for missing ACS capabilities This an updated version of Alex Williamson's patch from: https://lkml.org/lkml/2013/5/30/513 Original commit message follows: PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that allows us to control whether transactions are allowed to be redirected in various subnodes of a PCIe topology. For instance, if two endpoints are below a root port or downsteam switch port, the downstream port may optionally redirect transactions between the devices, bypassing upstream devices. The same can happen internally on multifunction devices. The transaction may never be visible to the upstream devices. One upstream device that we particularly care about is the IOMMU. If a redirection occurs in the topology below the IOMMU, then the IOMMU cannot provide isolation between devices. This is why the PCIe spec encourages topologies to include ACS support. Without it, we have to assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation. Unfortunately, far too many topologies do not support ACS to make this a steadfast requirement. Even the latest chipsets from Intel are only sporadically supporting ACS. We have trouble getting interconnect vendors to include the PCIe spec required PCIe capability, let alone suggested features. Therefore, we need to add some flexibility. The pcie_acs_override= boot option lets users opt-in specific devices or sets of devices to assume ACS support. The "downstream" option assumes full ACS support on root ports and downstream switch ports. The "multifunction" option assumes the subset of ACS features available on multifunction endpoints and upstream switch ports are supported. The "id:nnnn:nnnn" option enables ACS support on devices matching the provided vendor and device IDs, allowing more strategic ACS overrides. These options may be combined in any order. A maximum of 16 id specific overrides are available. It's suggested to use the most limited set of options necessary to avoid completely disabling ACS across the topology. Note to hardware vendors, we have facilities to permanently quirk specific devices which enforce isolation but not provide an ACS capability. Please contact me to have your devices added and save your customers the hassle of this boot option. Signed-off-by: Mark Weiman commit b3ac51ec8a68eb1ca0da770a7dc5c94605beeaad Author: Arjan van de Ven Date: Wed May 17 01:52:11 2017 +0000 init: wait for partition and retry scan As Clear Linux boots fast the device is not ready when the mounting code is reached, so a retry device scan will be performed every 0.5 sec for at least 40 sec and synchronize the async task. Signed-off-by: Miguel Bernal Marin commit e5d5e6bfe341893f2851c31a38f35bcf5a1d3317 Author: Arjan van de Ven Date: Thu Jun 2 23:36:32 2016 -0500 drivers: initialize ata before graphics ATA init is the long pole in the boot process, and its asynchronous. move the graphics init after it so that ata and graphics initialize in parallel commit a1b07b9ee4a95b5be9a1abd450ed0663ba99cbe7 Author: Arjan van de Ven Date: Sun Feb 18 23:35:41 2018 +0000 locking: rwsem: spin faster tweak rwsem owner spinning a bit Signed-off-by: Alexandre Frade commit 3c329979ad17e838a9857fb2ddf6e883f16e12c7 Author: William Douglas Date: Wed Jun 20 17:23:21 2018 +0000 firmware: Enable stateless firmware loading Prefer the order of specific version before generic and /etc before /lib to enable the user to give specific overrides for generic firmware and distribution firmware. commit 20dd71d8d7f6da35eba65f1f67a293ff6aed446c Author: Arjan van de Ven Date: Sun Sep 22 11:12:35 2019 -0300 intel_rapl: Silence rapl trace debug commit a2b5b68f1ea4795bb5eda6f621ec3a347455f5da Author: Christian Brauner Date: Wed Jan 23 21:54:23 2019 +0100 SAUCE: binder: give binder_alloc its own debug mask file Currently both binder.c and binder_alloc.c both register the /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts in sysfs. This commit gives binder_alloc.c its own /sys/module/binder_linux/paramters/alloc_debug_mask file. Signed-off-by: Christian Brauner Signed-off-by: Seth Forshee commit 533c4a27ae7bb5d8d508ddee68b008e215dfc84d Author: Christian Brauner Date: Wed Jan 16 23:13:25 2019 +0100 SAUCE: binder: turn into module The Android binder driver needs to become a module for the sake of shipping Anbox. To do this we need to export the following functions since binder is currently still using them: - security_binder_set_context_mgr() - security_binder_transaction() - security_binder_transfer_binder() - security_binder_transfer_file() - can_nice() - __close_fd_get_file() - mmput_async() - task_work_add() - map_kernel_range_noflush() - get_vm_area() - zap_page_range() - put_ipc_ns() - get_ipc_ns_exported() - show_init_ipc_ns() Signed-off-by: Christian Brauner [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ] Signed-off-by: Seth Forshee commit ca8e24c237b5cac4100e7c0a8ddc4f0bdf3d4b30 Author: Christian Brauner Date: Wed Jun 20 19:21:37 2018 +0200 SAUCE: ashmem: turn into module The Android ashmem driver needs to become a module for the sake of Anbox. To do this we need to export shmem_zero_setup() since ashmem is currently using is. Note, the abomination that is the Android ashmem driver will go away in the not so distant future in favour of memfds. Signed-off-by: Christian Brauner Signed-off-by: Seth Forshee commit 35f3f041b875c5acb7dc10d13fc5288eed3e475a Author: Serge Hallyn Date: Fri May 31 19:12:12 2013 +0100 sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default add sysctl to disallow unprivileged CLONE_NEWUSER by default This is a short-term patch. Unprivileged use of CLONE_NEWUSER is certainly an intended feature of user namespaces. However for at least saucy we want to make sure that, if any security issues are found, we have a fail-safe. Signed-off-by: Serge Hallyn [bwh: Remove unneeded binary sysctl bits] [bwh: Keep this sysctl, but change the default to enabled] commit b2851fc71697197d10adf30fbe4332b2e8cacfa6 Author: Nathan Chancellor Date: Thu Oct 21 13:23:53 2021 -0700 lib: zstd: Add cast to silence clang's -Wbitwise-instead-of-logical A new warning in clang warns that there is an instance where boolean expressions are being used with bitwise operators instead of logical ones: lib/zstd/decompress/huf_decompress.c:890:25: warning: use of bitwise '&' with boolean operands [-Wbitwise-instead-of-logical] (BIT_reloadDStreamFast(&bitD1) == BIT_DStream_unfinished) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ zstd does this frequently to help with performance, as logical operators have branches whereas bitwise ones do not. To fix this warning in other cases, the expressions were placed on separate lines with the '&=' operator; however, this particular instance was moved away from that so that it could be surrounded by LIKELY, which is a macro for __builtin_expect(), to help with a performance regression, according to upstream zstd pull #1973. Aside from switching to logical operators, which is likely undesirable in this instance, or disabling the warning outright, the solution is casting one of the expressions to an integer type to make it clear to clang that the author knows what they are doing. Add a cast to U32 to silence the warning. The first U32 cast is to silence an instance of -Wshorten-64-to-32 because __builtin_expect() returns long so it cannot be moved. Link: https://github.com/ClangBuiltLinux/linux/issues/1486 Link: https://github.com/facebook/zstd/pull/1973 Reported-by: Nick Desaulniers Signed-off-by: Nathan Chancellor commit fdf1bbfdb160651940e102a2b0e7c0ad7cefbb15 Author: Paweł Jasiak Date: Fri Oct 8 13:37:59 2021 +0200 kbuild: Add make tarzst-pkg build option Add tarzst-pkg and perf-tarzst-src-pkg targets to build zstd compressed tarballs. Signed-off-by: Paweł Jasiak Signed-off-by: Masahiro Yamada commit 0c4ef9d82067de65e4e3effd1a5732678619b0b4 Author: Nick Terrell Date: Mon Apr 26 16:34:03 2021 -0700 MAINTAINERS: Add maintainer entry for zstd Adds a maintainer entry for zstd listing myself as the maintainer for all zstd code, pointing to the upstream issues tracker for bugs, and listing my linux repo as the tree. Signed-off-by: Nick Terrell commit 9eabe27c672c05df961f10fe754b75778261f3e3 Author: Nick Terrell Date: Fri Sep 11 16:37:08 2020 -0700 lib: zstd: Upgrade to latest upstream zstd version 1.4.10 Upgrade to the latest upstream zstd version 1.4.10. This patch is 100% generated from upstream zstd commit 20821a46f412 [0]. This patch is very large because it is transitioning from the custom kernel zstd to using upstream directly. The new zstd follows upstreams file structure which is different. Future update patches will be much smaller because they will only contain the changes from one upstream zstd release. As an aid for review I've created a commit [1] that shows the diff between upstream zstd as-is (which doesn't compile), and the zstd code imported in this patch. The verion of zstd in this patch is generated from upstream with changes applied by automation to replace upstreams libc dependencies, remove unnecessary portability macros, replace `/**` comments with `/*` comments, and use the kernel's xxhash instead of bundling it. The benefits of this patch are as follows: 1. Using upstream directly with automated script to generate kernel code. This allows us to update the kernel every upstream release, so the kernel gets the latest bug fixes and performance improvements, and doesn't get 3 years out of date again. The automation and the translated code are tested every upstream commit to ensure it continues to work. 2. Upgrades from a custom zstd based on 1.3.1 to 1.4.10, getting 3 years of performance improvements and bug fixes. On x86_64 I've measured 15% faster BtrFS and SquashFS decompression+read speeds, 35% faster kernel decompression, and 30% faster ZRAM decompression+read speeds. 3. Zstd-1.4.10 supports negative compression levels, which allow zstd to match or subsume lzo's performance. 4. Maintains the same kernel-specific wrapper API, so no callers have to be modified with zstd version updates. One concern that was brought up was stack usage. Upstream zstd had already removed most of its heavy stack usage functions, but I just removed the last functions that allocate arrays on the stack. I've measured the high water mark for both compression and decompression before and after this patch. Decompression is approximately neutral, using about 1.2KB of stack space. Compression levels up to 3 regressed from 1.4KB -> 1.6KB, and higher compression levels regressed from 1.5KB -> 2KB. We've added unit tests upstream to prevent further regression. I believe that this is a reasonable increase, and if it does end up causing problems, this commit can be cleanly reverted, because it only touches zstd. I chose the bulk update instead of replaying upstream commits because there have been ~3500 upstream commits since the 1.3.1 release, zstd wasn't ready to be used in the kernel as-is before a month ago, and not all upstream zstd commits build. The bulk update preserves bisectablity because bugs can be bisected to the zstd version update. At that point the update can be reverted, and we can work with upstream to find and fix the bug. Note that upstream zstd release 1.4.10 doesn't exist yet. I have cut a staging branch at 20821a46f412 [0] and will apply any changes requested to the staging branch. Once we're ready to merge this update I will cut a zstd release at the commit we merge, so we have a known zstd release in the kernel. The implementation of the kernel API is contained in zstd_compress_module.c and zstd_decompress_module.c. [0] https://github.com/facebook/zstd/commit/20821a46f4122f9abd7c7b245d28162dde8129c9 [1] https://github.com/terrelln/linux/commit/e0fa481d0e3df26918da0a13749740a1f6777574 Signed-off-by: Nick Terrell commit 5d44201d9edc28c2faf51f9788c0495c5947fdb2 Author: Nick Terrell Date: Mon Sep 14 12:54:12 2020 -0700 lib: zstd: Add decompress_sources.h for decompress_unzstd Adds decompress_sources.h which includes every .c file necessary for zstd decompression. This is used in decompress_unzstd.c so the internal structure of the library isn't exposed. This allows us to upgrade the zstd library version without modifying any callers. Instead we just need to update decompress_sources.h. Signed-off-by: Nick Terrell commit a92bf084d33676d0d46a39ebeaf67574aceba9a7 Author: Nick Terrell Date: Fri Sep 11 16:49:00 2020 -0700 lib: zstd: Add kernel-specific API This patch: - Moves `include/linux/zstd.h` -> `include/linux/zstd_lib.h` - Updates modified zstd headers to yearless copyright - Adds a new API in `include/linux/zstd.h` that is functionally equivalent to the in-use subset of the current API. Functions are renamed to avoid symbol collisions with zstd, to make it clear it is not the upstream zstd API, and to follow the kernel style guide. - Updates all callers to use the new API. There are no functional changes in this patch. Since there are no functional change, I felt it was okay to update all the callers in a single patch. Once the API is approved, the callers are mechanically changed. This patch is preparing for the 3rd patch in this series, which updates zstd to version 1.4.10. Since the upstream zstd API is no longer exposed to callers, the update can happen transparently. Signed-off-by: Nick Terrell commit d0fe816e6098462c7985381a4096913cb179cbe1 Author: Huang Rui Date: Fri Oct 29 21:02:41 2021 +0800 Documentation: amd-pstate: add amd-pstate driver introduction Introduce the amd-pstate driver design and implementation. Signed-off-by: Huang Rui commit 6d3e84c31f5c8873bb02ff2a10bd1fd073500651 Author: Huang Rui Date: Fri Oct 29 21:02:40 2021 +0800 cpupower: print amd-pstate information on cpupower amd-pstate kernel module is using the fine grain frequency instead of acpi hardware pstate. So the performance and frequency values should be printed in frequency-info. Signed-off-by: Huang Rui commit 0f68a352aab7ec550ef100f5a6b7f4e2fb52d457 Author: Huang Rui Date: Fri Oct 29 21:02:39 2021 +0800 cpupower: move print_speed function into misc helper The print_speed can be as a common function, and expose it into misc helper header. Then it can be used on other helper files as well. Signed-off-by: Huang Rui commit 2e3c1f40a15f5497b5e99f3296afc6f112e59899 Author: Huang Rui Date: Fri Oct 29 21:02:38 2021 +0800 cpupower: enable boost state support for amd-pstate module The legacy ACPI hardware P-States function has 3 P-States on ACPI table, the CPU frequency only can be switched between the 3 P-States. While the processor supports the boost state, it will have another boost state that the frequency can be higher than P0 state, and the state can be decoded by the function of decode_pstates() and read by amd_pci_get_num_boost_states(). However, the new AMD P-States function is different than legacy ACPI hardware P-State on AMD processors. That has a finer grain frequency range between the highest and lowest frequency. And boost frequency is actually the frequency which is mapped on highest performance ratio. The similiar previous P0 frequency is mapped on nominal performance ratio. If the highest performance on the processor is higher than nominal performance, then we think the current processor supports the boost state. And it uses amd_pstate_boost_init() to initialize boost for AMD P-States function. Signed-off-by: Huang Rui commit e843b0dcbe9d0ee85e08ce9a7f11dd37a41246f6 Author: Huang Rui Date: Fri Oct 29 21:02:37 2021 +0800 cpupower: add amd-pstate sysfs definition and access helper Introduce the marco definitions and access helper function for amd-pstate sysfs interfaces such as each performance goals and frequency levels in amd helper file. They will be used to read the sysfs attribute from amd-pstate cpufreq driver for cpupower utilities. Signed-off-by: Huang Rui commit d391c3172281e8ae86b298a8d5e6361dc53a6712 Author: Huang Rui Date: Fri Oct 29 21:02:36 2021 +0800 cpupower: add the function to get the sysfs value from specific table Expose the helper into cpufreq header, then cpufreq driver can use this function to get the sysfs value if it has any specific sysfs interfaces. Signed-off-by: Huang Rui commit d5bc8e3e262cadd9cba755a5e83416695756d5e5 Author: Huang Rui Date: Fri Oct 29 21:02:35 2021 +0800 cpupower: initial AMD P-state capability If kernel starts the amd-pstate module, the cpupower will initial the capability flag as CPUPOWER_CAP_AMD_PSTATE. And once amd-pstate capability is set, it won't need to set legacy ACPI relative capabilities anymore. Signed-off-by: Huang Rui commit 3223a43bae34c0db8c83798134170aabdfce656b Author: Huang Rui Date: Fri Oct 29 21:02:34 2021 +0800 cpupower: add the function to check amd-pstate enabled The processor with amd-pstate function also supports legacy ACPI hardware P-States feature as well. Once driver sets amd-pstate eanbled, the processor will respond the finer grain amd-pstate feature instead of legacy ACPI P-States. So it introduces the cpupower_amd_pstate_enabled() to check whether the current kernel enables amd-pstate or acpi-cpufreq module. Signed-off-by: Huang Rui commit 2be733db76f7b0d713f09c13e6c21dd168d4101d Author: Huang Rui Date: Fri Oct 29 21:02:33 2021 +0800 cpupower: add AMD P-state capability flag Add AMD P-state capability flag in cpupower to indicate AMD new P-state kernel module support on Ryzen processors. Signed-off-by: Huang Rui commit b89d41fa5e10df868a29444749b3e9811474697a Author: Huang Rui Date: Fri Oct 29 21:02:32 2021 +0800 cpufreq: amd: add amd-pstate performance attributes Introduce sysfs attributes to get the different level amd-pstate performances. Signed-off-by: Huang Rui commit dd80c527df00ada7308e0bb3d3919622281ae180 Author: Huang Rui Date: Fri Oct 29 21:02:31 2021 +0800 cpufreq: amd: add amd-pstate frequencies attributes Introduce sysfs attributes to get the different level processor frequencies. Signed-off-by: Huang Rui commit 9f1e5af9f6c16364167edae13cdac877742027c6 Author: Huang Rui Date: Fri Oct 29 21:02:30 2021 +0800 cpufreq: amd: add boost mode support for amd-pstate If the sbios supports the boost mode of amd-pstate, let's switch to boost enabled by default. Signed-off-by: Huang Rui commit de13cc00ce1dc1affd4029461320d7d252b209b6 Author: Huang Rui Date: Fri Oct 29 21:02:29 2021 +0800 cpufreq: amd: add trace for amd-pstate module Add trace event to monitor the performance value changes which is controlled by cpu governors. Signed-off-by: Huang Rui commit a97ebea71a980752c4852641dee71c1fc60a790a Author: Huang Rui Date: Fri Oct 29 21:02:28 2021 +0800 cpufreq: amd: add acpi cppc function as the backend for legacy processors In some old Zen based processors, they are using the shared memory that exposed from ACPI SBIOS. Signed-off-by: Jinzhou Su Signed-off-by: Huang Rui commit 57448d13ce9afbd5672ea397439d2c60d6ff4415 Author: Huang Rui Date: Fri Oct 29 21:02:27 2021 +0800 cpufreq: amd: add fast switch function for amd-pstate Introduce the fast switch function for amd-pstate on the AMD processors which support the full MSR register control. It's able to decrease the lattency on interrupt context. Signed-off-by: Huang Rui commit 7d9f5592ddfd7bf536b3ece185f6bc8ae3f19dd6 Author: Huang Rui Date: Fri Oct 29 21:02:26 2021 +0800 cpufreq: amd: introduce a new amd pstate driver to support future processors amd-pstate is the AMD CPU performance scaling driver that introduces a new CPU frequency control mechanism on AMD Zen based CPU series in Linux kernel. The new mechanism is based on Collaborative processor performance control (CPPC) which is finer grain frequency management than legacy ACPI hardware P-States. Current AMD CPU platforms are using the ACPI P-states driver to manage CPU frequency and clocks with switching only in 3 P-states. AMD P-States is to replace the ACPI P-states controls, allows a flexible, low-latency interface for the Linux kernel to directly communicate the performance hints to hardware. "amd-pstate" leverages the Linux kernel governors such as *schedutil*, *ondemand*, etc. to manage the performance hints which are provided by CPPC hardware functionality. The first version for amd-pstate is to support one of the Zen3 processors, and we will support more in future after we verify the hardware and SBIOS functionalities. There are two types of hardware implementations for amd-pstate: one is full MSR support and another is shared memory support. It can use X86_FEATURE_AMD_CPPC_EXT feature flag to distinguish the different types. Using the new AMD P-States method + kernel governors (*schedutil*, *ondemand*, ...) to manage the frequency update is the most appropriate bridge between AMD Zen based hardware processor and Linux kernel, the processor is able to ajust to the most efficiency frequency according to the kernel scheduler loading. Performance Per Watt (PPW) Caculation: The PPW caculation is referred by below paper: https://software.intel.com/content/dam/develop/external/us/en/documents/performance-per-what-paper.pdf Below formula is referred from below spec to measure the PPW: (F / t) / P = F * t / (t * E) = F / E, "F" is the number of frames per second. "P" is power measurd in watts. "E" is energy measured in joules. We use the RAPL interface with "perf" tool to get the energy data of the package power. The data comparsions between amd-pstate and acpi-freq module are tested on AMD Cezanne processor: 1) TBench CPU benchmark: +---------------------------------------------------------------------+ | | | TBench (Performance Per Watt) | | Higher is better | +-------------------+------------------------+------------------------+ | | Performance Per Watt | Performance Per Watt | | Kernel Module | (Schedutil) | (Ondemand) | | | Unit: MB / (s * J) | Unit: MB / (s * J) | +-------------------+------------------------+------------------------+ | | | | | acpi-cpufreq | 3.022 | 2.969 | | | | | +-------------------+------------------------+------------------------+ | | | | | amd-pstate | 3.131 | 3.284 | | | | | +-------------------+------------------------+------------------------+ 2) Gitsource CPU benchmark: +---------------------------------------------------------------------+ | | | Gitsource (Performance Per Watt) | | Higher is better | +-------------------+------------------------+------------------------+ | | Performance Per Watt | Performance Per Watt | | Kernel Module | (Schedutil) | (Ondemand) | | | Unit: 1 / (s * J) | Unit: 1 / (s * J) | +-------------------+------------------------+------------------------+ | | | | | acpi-cpufreq | 3.42172E-07 | 2.74508E-07 | | | | | +-------------------+------------------------+------------------------+ | | | | | amd-pstate | 4.09141E-07 | 3.47610E-07 | | | | | +-------------------+------------------------+------------------------+ 3) Speedometer 2.0 CPU benchmark: +---------------------------------------------------------------------+ | | | Speedometer 2.0 (Performance Per Watt) | | Higher is better | +-------------------+------------------------+------------------------+ | | Performance Per Watt | Performance Per Watt | | Kernel Module | (Schedutil) | (Ondemand) | | | Unit: 1 / (s * J) | Unit: 1 / (s * J) | +-------------------+------------------------+------------------------+ | | | | | acpi-cpufreq | 0.116111767 | 0.110321664 | | | | | +-------------------+------------------------+------------------------+ | | | | | amd-pstate | 0.115825281 | 0.122024299 | | | | | +-------------------+------------------------+------------------------+ According to above average data, we can see this solution has shown better performance per watt scaling on mobile CPU benchmarks in most of cases. Signed-off-by: Huang Rui commit 507e3fa01595c1a1ba29c6100b086d5f393b0279 Author: Jinzhou Su Date: Fri Oct 29 21:02:25 2021 +0800 ACPI: CPPC: add cppc enable register function Add a new function to enable CPPC feature. This function will write Continuous Performance Control package EnableRegister field on the processor. CPPC EnableRegister register described in section 8.4.7.1 of ACPI 6.4: This element is optional. If supported, contains a resource descriptor with a single Register() descriptor that describes a register to which OSPM writes a One to enable CPPC on this processor. Before this register is set, the processor will be controlled by legacy mechanisms (ACPI Pstates, firmware, etc.). This register will be used for AMD processors to enable amd-pstate function instead of legacy ACPI P-States. Signed-off-by: Jinzhou Su Signed-off-by: Huang Rui commit 281d2f12565255e4b6ec89860e3991ed50baed63 Author: Mario Limonciello Date: Fri Oct 29 21:02:24 2021 +0800 ACPI: CPPC: Check present CPUs for determining _CPC is valid As this is a static check, it should be based upon what is currently present on the system. This makes probeing more deterministic. While local APIC flags field (lapic_flags) of cpu core in MADT table is 0, then the cpu core won't be enabled. In this case, _CPC won't be found in this core, and return back to _CPC invalid with walking through possible cpus (include disable cpus). This is not expected, so switch to check present CPUs instead. Reported-by: Jinzhou Su Signed-off-by: Mario Limonciello Signed-off-by: Huang Rui commit 006bf956420e1a115671ffc65b8e4f8d3d9ca401 Author: Steven Noonan Date: Fri Oct 29 21:02:23 2021 +0800 ACPI: CPPC: implement support for SystemIO registers According to the ACPI v6.2 (and later) specification, SystemIO can be used for _CPC registers. This teaches cppc_acpi how to handle such registers. This patch was tested using the amd_pstate driver on my Zephyrus G15 (model GA503QS) using the current version 410 BIOS, which uses a SystemIO register for the HighestPerformance element in _CPC. Signed-off-by: Steven Noonan Signed-off-by: Huang Rui commit 5abba2f85d7e070decf56ef35ce6a414a0a8669b Author: Huang Rui Date: Fri Oct 29 21:02:22 2021 +0800 x86/msr: add AMD CPPC MSR definitions AMD CPPC (Collaborative Processor Performance Control) function uses MSR registers to manage the performance hints. So add the MSR register macro here. Signed-off-by: Huang Rui commit a00d065f4b06d01582052b774050f3804f696a75 Author: Huang Rui Date: Fri Oct 29 21:02:21 2021 +0800 x86/cpufreatures: add AMD Collaborative Processor Performance Control feature flag Add Collaborative Processor Performance Control feature flag for AMD processors. This feature flag will be used on the following amd-pstate driver. The amd-pstate driver has two approaches to implement the frequency control behavior. That depends on the CPU hardware implementation. One is "Full MSR Support" and another is "Shared Memory Support". The feature flag indicates the current processors with "Full MSR Support". Signed-off-by: Huang Rui Acked-by: Borislav Petkov commit 1243cc77acdaaa677a201d6bd5611fba374edfee Author: Stephan Mueller Date: Mon Aug 2 22:01:09 2021 +0200 char/lrng: add power-on and runtime self-tests Parts of the LRNG are already covered by self-tests, including: * Self-test of SP800-90A DRBG provided by the Linux kernel crypto API. * Self-test of the PRNG provided by the Linux kernel crypto API. * Raw noise source data testing including SP800-90B compliant tests when enabling CONFIG_LRNG_HEALTH_TESTS This patch adds the self-tests for the remaining critical functions of the LRNG that are essential to maintain entropy and provide cryptographic strong random numbers. The following self-tests are implemented: * Self-test of the time array maintenance. This test verifies whether the time stamp array management to store multiple values in one integer implements a concatenation of the data. * Self-test of the software hash implementation ensures that this function operates compliant to the FIPS 180-4 specification. The self-test performs a hash operation of a zeroized per-CPU data array. * Self-test of the ChaCha20 DRNG is based on the self-tests that are already present and implemented with the stand-alone user space ChaCha20 DRNG implementation available at [1]. The self-tests cover different use cases of the DRNG seeded with known seed data. The status of the LRNG self-tests is provided with the selftest_status SysFS file. If the file contains a zero, the self-tests passed. The value 0xffffffff means that the self-tests were not executed. Any other value indicates a self-test failure. The self-test may be compiled to panic the system if the self-test fails. All self-tests operate on private state data structures. This implies that none of the self-tests have any impact on the regular LRNG operations. This allows the self-tests to be repeated at runtime by writing anything into the selftest_status SysFS file. [1] https://www.chronox.de/chacha20.html CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange CC: Marcelo Henrique Cerri CC: Neil Horman Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Signed-off-by: Stephan Mueller commit 64564cc8c19abf7da0bdb36fffb06b3433a44dbf Author: Stephan Mueller Date: Mon Aug 2 22:08:30 2021 +0200 char/lrng: add interface for gathering of raw entropy The test interface allows a privileged process to capture the raw unconditioned noise that is collected by the LRNG for statistical analysis. Such testing allows the analysis how much entropy the interrupt noise source provides on a given platform. Extracted noise data is not used to seed the LRNG. This is a test interface and not appropriate for production systems. Yet, the interface is considered to be sufficiently secured for production systems. Access to the data is given through the lrng_raw debugfs file. The data buffer should be multiples of sizeof(u32) to fill the entire buffer. Using the option lrng_testing.boot_test=1 the raw noise of the first 1000 entropy events since boot can be sampled. This test interface allows generating the data required for analysis whether the LRNG is in compliance with SP800-90B sections 3.1.3 and 3.1.4. In addition, the test interface allows gathering of the concatenated raw entropy data to verify that the concatenation works appropriately. This includes sampling of the following raw data: * high-resolution time stamp * Jiffies * IRQ number * IRQ flags * return instruction pointer * interrupt register state * array logic batching the high-resolution time stamp * enabling the runtime configuration of entropy source entropy rates Also, a testing interface to support ACVT of the hash implementation is provided. The reason why only hash testing is supported (as opposed to also provide testing for the DRNG) is the fact that the LRNG software hash implementation contains glue code that may warrant testing in addition to the testing of the software ciphers via the kernel crypto API. Also, for testing the CTR-DRBG, the underlying AES implementation would need to be tested. However, such AES test interface cannot be provided by the LRNG as it has no means to access the AES operation. Finally, the execution duration for processing a time stamp can be obtained with the LRNG raw entropy interface. If a test interface is not compiled, its code is a noop which has no impact on the performance. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit 1fba41ef7445f500a44e11e805e4214173a744ff Author: Stephan Mueller Date: Mon Aug 2 21:59:23 2021 +0200 char/lrng: add SP800-90B compliant health tests Implement health tests for LRNG's slow noise sources as mandated by SP-800-90B The file contains the following health tests: - stuck test: The stuck test calculates the first, second and third discrete derivative of the time stamp to be processed by the hash for the per-CPU entropy pool. Only if all three values are non-zero, the received time delta is considered to be non-stuck. - SP800-90B Repetition Count Test (RCT): The LRNG uses an enhanced version of the RCT specified in SP800-90B section 4.4.1. Instead of counting identical back-to-back values, the input to the RCT is the counting of the stuck values during the processing of received interrupt events. The RCT is applied with alpha=2^-30 compliant to the recommendation of FIPS 140-2 IG 9.8. During the counting operation, the LRNG always calculates the RCT cut-off value of C. If that value exceeds the allowed cut-off value, the LRNG will trigger the health test failure discussed below. An error is logged to the kernel log that such RCT failure occurred. This test is only applied and enforced in FIPS mode, i.e. when the kernel compiled with CONFIG_CONFIG_FIPS is started with fips=1. - SP800-90B Adaptive Proportion Test (APT): The LRNG implements the APT as defined in SP800-90B section 4.4.2. The applied significance level again is alpha=2^-30 compliant to the recommendation of FIPS 140-2 IG 9.8. The aforementioned health tests are applied to the first 1,024 time stamps obtained from interrupt events. In case one error is identified for either the RCT, or the APT, the collected entropy is invalidated and the SP800-90B startup health test is restarted. As long as the SP800-90B startup health test is not completed, all LRNG random number output interfaces that may block will block and not generate any data. This implies that only those potentially blocking interfaces are defined to provide random numbers that are seeded with the interrupt noise source being SP800-90B compliant. All other output interfaces will not be affected by the SP800-90B startup test and thus are not considered SP800-90B compliant. At runtime, the SP800-90B APT and RCT are applied to each time stamp generated for a received interrupt. When either the APT and RCT indicates a noise source failure, the LRNG is reset to a state it has immediately after boot: - all entropy counters are set to zero - the SP800-90B startup tests are re-performed which implies that getrandom(2) would block again until new entropy was collected To summarize, the following rules apply: • SP800-90B compliant output interfaces - /dev/random - getrandom(2) system call - get_random_bytes kernel-internal interface when being triggered by the callback registered with add_random_ready_callback • SP800-90B non-compliant output interfaces - /dev/urandom - get_random_bytes kernel-internal interface called directly - randomize_page kernel-internal interface - get_random_u32 and get_random_u64 kernel-internal interfaces - get_random_u32_wait, get_random_u64_wait, get_random_int_wait, and get_random_long_wait kernel-internal interfaces If either the RCT, or the APT health test fails irrespective whether during initialization or runtime, the following actions occur: 1. The entropy of the entire entropy pool is invalidated. 2. All DRNGs are reset which imply that they are treated as being not seeded and require a reseed during next invocation. 3. The SP800-90B startup health test are initiated with all implications of the startup tests. That implies that from that point on, new events must be observed and its entropy must be inserted into the entropy pool before random numbers are calculated from the entropy pool. Further details on the SP800-90B compliance and the availability of all test tools required to perform all tests mandated by SP800-90B are provided at [1]. The entire health testing code is compile-time configurable. The patch provides a CONFIG_BROKEN configuration of the APT / RCT cutoff values which have a high likelihood to trigger the health test failure. The BROKEN APT cutoff is set to the exact mean of the expected value if the time stamps are equally distributed (512 time stamps divided by 16 possible values due to using the 4 LSB of the time stamp). The BROKEN RCT cutoff value is set to 1 which is likely to be triggered during regular operation. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit 161114f6a2c6810b574a9aad8c2113dc100148ea Author: Stephan Mueller Date: Mon Aug 30 12:32:04 2021 +0200 char/lrng: add Jitter RNG fast noise source The Jitter RNG fast noise source implemented as part of the kernel crypto API is queried for 256 bits of entropy at the time the seed buffer managed by the LRNG is about to be filled. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit 3c7c2e767ee6a8d53fa600291054e62af54fb341 Author: Stephan Mueller Date: Wed Sep 16 09:50:27 2020 +0200 crypto: move Jitter RNG header include dir To support the LRNG operation which uses the Jitter RNG separately from the kernel crypto API, the header file must be accessible to the LRNG code. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Roman Drahtmüller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit 2797709be3eb607cd09d490dec76f1d0bb72ea12 Author: Stephan Mueller Date: Fri Jun 18 08:10:53 2021 +0200 char/lrng: add kernel crypto API PRNG extension Add runtime-pluggable support for all PRNGs that are accessible via the kernel crypto API, including hardware PRNGs. The PRNG is selected with the module parameter drng_name where the name must be one that the kernel crypto API can resolve into an RNG. This allows using of the kernel crypto API PRNG implementations that provide an interface to hardware PRNGs. Using this extension, the LRNG uses the hardware PRNGs to generate random numbers. An example is the S390 CPACF support providing such a PRNG. The hash is provided by a kernel crypto API SHASH whose digest size complies with the seedsize of the PRNG. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit e9384fc48ad9098c656d9699684290ab3fa2b57b Author: Stephan Mueller Date: Fri Jun 18 08:09:59 2021 +0200 char/lrng: add SP800-90A DRBG extension Using the LRNG switchable DRNG support, the SP800-90A DRBG extension is implemented. The DRBG uses the kernel crypto API DRBG implementation. In addition, it uses the kernel crypto API SHASH support to provide the hashing operation. The DRBG supports the choice of either a CTR DRBG using AES-256, HMAC DRBG with SHA-512 core or Hash DRBG with SHA-512 core. The used core can be selected with the module parameter lrng_drbg_type. The default is the CTR DRBG. When compiling the DRBG extension statically, the DRBG is loaded at late_initcall stage which implies that with the start of user space, the user space interfaces of getrandom(2), /dev/random and /dev/urandom provide random data produced by an SP800-90A DRBG. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit 8f2ac07c07255345aab3cc73f77cd229f551f078 Author: Stephan Mueller Date: Tue Sep 15 22:17:43 2020 +0200 crypto: DRBG - externalize DRBG functions for LRNG This patch allows several DRBG functions to be called by the LRNG kernel code paths outside the drbg.c file. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Roman Drahtmueller Tested-by: Roman Drahtmüller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit 7ce9102c28686667dc76301771d04f8092629a30 Author: Stephan Mueller Date: Fri Jun 18 08:08:20 2021 +0200 char/lrng: add common generic hash support The LRNG switchable DRNG support also allows the replacement of the hash implementation used as conditioning component. The common generic hash support code provides the required callbacks using the synchronous hash implementations of the kernel crypto API. All synchronous hash implementations supported by the kernel crypto API can be used as part of the LRNG with this generic support. The generic support is intended to be configured by separate switchable DRNG backends. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange CC: "Peter, Matthias" CC: Marcelo Henrique Cerri CC: Neil Horman Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Signed-off-by: Stephan Mueller commit 6b2cceb51b0e6df13c16104b92cff7afeade8be0 Author: Stephan Mueller Date: Mon Aug 2 21:54:12 2021 +0200 char/lrng: add switchable DRNG support The DRNG switch support allows replacing the DRNG mechanism of the LRNG. The switching support rests on the interface definition of include/linux/lrng.h. A new DRNG is implemented by filling in the interface defined in this header file. In addition to the DRNG, the extension also has to provide a hash implementation that is used to hash the entropy pool for random number extraction. Note: It is permissible to implement a DRNG whose operations may sleep. However, the hash function must not sleep. The switchable DRNG support allows replacing the DRNG at runtime. However, only one DRNG extension is allowed to be loaded at any given time. Before replacing it with another DRNG implementation, the possibly existing DRNG extension must be unloaded. The switchable DRNG extension activates the new DRNG during load time. It is expected, however, that such a DRNG switch would be done only once by an administrator to load the intended DRNG implementation. It is permissible to compile DRNG extensions either as kernel modules or statically. The initialization of the DRNG extension should be performed with a late_initcall to ensure the extension is available when user space starts but after all other initialization completed. The initialization is performed by registering the function call data structure with the lrng_set_drng_cb function. In order to unload the DRNG extension, lrng_set_drng_cb must be invoked with the NULL parameter. The DRNG extension should always provide a security strength that is at least as strong as LRNG_DRNG_SECURITY_STRENGTH_BITS. The hash extension must not sleep and must not maintain a separate state. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit e45b8648e6375417406029f1398afea76008cd88 Author: Stephan Mueller Date: Mon Aug 2 21:52:37 2021 +0200 char/lrng: sysctls and /proc interface The LRNG sysctl interface provides the same controls as the existing /dev/random implementation. These sysctls behave identically and are implemented identically. The goal is to allow a possible merge of the existing /dev/random implementation with this implementation which implies that this patch tries have a very close similarity. Yet, all sysctls are documented at [1]. In addition, it provides the file lrng_type which provides details about the LRNG: - the name of the DRNG that produces the random numbers for /dev/random, /dev/urandom, getrandom(2) - the hash used to produce random numbers from the entropy pool - the number of secondary DRNG instances - indicator whether the LRNG operates SP800-90B compliant - indicator whether a high-resolution timer is identified - only with a high-resolution timer the interrupt noise source will deliver sufficient entropy - indicator whether the LRNG has been minimally seeded (i.e. is the secondary DRNG seeded with at least 128 bits of entropy) - indicator whether the LRNG has been fully seeded (i.e. is the secondary DRNG seeded with at least 256 bits of entropy) [1] https://www.chronox.de/lrng.html CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit f7f12c5e2ad256e51b8f00afb81c6cc7fabc2c7e Author: Stephan Mueller Date: Mon Aug 2 21:49:21 2021 +0200 char/lrng: allocate one DRNG instance per NUMA node In order to improve NUMA-locality when serving getrandom(2) requests, allocate one DRNG instance per node. The DRNG instance that is present right from the start of the kernel is reused as the first per-NUMA-node DRNG. For all remaining online NUMA nodes a new DRNG instance is allocated. During boot time, the multiple DRNG instances are seeded sequentially. With this, the first DRNG instance (referenced as the initial DRNG in the code) is completely seeded with 256 bits of entropy before the next DRNG instance is completely seeded. When random numbers are requested, the NUMA-node-local DRNG is checked whether it has been already fully seeded. If this is not the case, the initial DRNG is used to serve the request. CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange CC: Eric Biggers Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit ab993f0e8e8f37017489057c90bcad60ef90c39a Author: Stephan Mueller Date: Wed Aug 4 21:27:54 2021 +0200 drivers: Introduce the Linux Random Number Generator In an effort to provide a flexible implementation for a random number generator that also delivers entropy during early boot time, allows replacement of the deterministic random number generation mechanism, implement the various components in separate code for easier maintenance, and provide compliance to SP800-90[A|B|C], introduce the Linux Random Number Generator (LRNG) framework. The general design is as follows. Additional implementation details are given in [1]. The LRNG consists of the following components: 1. The LRNG implements a DRNG. The DRNG always generates the requested amount of output. When using the SP800-90A terminology it operates without prediction resistance. The secondary DRNG maintains a counter of how many bytes were generated since last re-seed and a timer of the elapsed time since last re-seed. If either the counter or the timer reaches a threshold, the secondary DRNG is seeded from the entropy pool. In case the Linux kernel detects a NUMA system, one secondary DRNG instance per NUMA node is maintained. 2. The DRNG is seeded by concatenating the data from the following sources: (a) the output of the entropy pool, (b) the Jitter RNG if available and enabled, and (c) the CPU-based noise source such as Intel RDRAND if available and enabled. The entropy estimate of the data of all noise sources are added to form the entropy estimate of the data used to seed the DRNG with. The LRNG ensures, however, that the DRNG after seeding is at maximum the security strength of the DRNG. The LRNG is designed such that none of these noise sources can dominate the other noise sources to provide seed data to the DRNG during due to the following: (a) During boot time, the amount of received interrupts are the trigger points to (re)seed the DRNG. (b) At runtime, the available entropy from the slow noise source is concatenated with a pre-defined amount of data from the fast noise sources. In addition, each DRNG reseed operation triggers external noise source providers to deliver one block of data. 3. The entropy pool accumulates entropy obtained from certain events, which will henceforth be collectively called "slow noise sources". The entropy pool collects noise data from slow noise sources. Any data received by the LRNG from the slow noise sources is inserted into a per-CPU entropy pool using a hash operation that can be changed during runtime. Per default, SHA-256 is used. (a) When an interrupt occurs, the high-resolution time stamp is mixed into the per-CPU entropy pool. This time stamp is credited with heuristically implied entropy. (b) HID event data like the key stroke or the mouse coordinates are mixed into the per-CPU entropy pool. This data is not credited with entropy by the LRNG. (c) Device drivers may provide data that is mixed into an auxiliary pool using the same hash that is used to process the per-CPU entropy pool. This data is not credited with entropy by the LRNG. Any data provided from user space by either writing to /dev/random, /dev/urandom or the IOCTL of RNDADDENTROPY on both device files are always injected into the auxiliary pool. In addition, when a hardware random number generator covered by the Linux kernel HW generator framework wants to deliver random numbers, it is injected into the auxiliary pool as well. HW generator noise source is handled separately from the other noise source due to the fact that the HW generator framework may decide by itself when to deliver data whereas the other noise sources always requested for data driven by the LRNG operation. Similarly any user space provided data is inserted into the entropy pool. When seed data for the DRNG is to be generated, all per-CPU entropy pools and the auxiliary pool are hashed. The message digest forms the new auxiliary pool state. At the same time, this data is used for seeding the DRNG. To speed up the interrupt handling code of the LRNG, the time stamp collected for an interrupt event is truncated to the 8 least significant bits. 64 truncated time stamps are concatenated and then jointly inserted into the per-CPU entropy pool. During boot time, until the fully seeded stage is reached, each time stamp with its 32 least significant bits is are concatenated. When 16 such events are received, they are injected into the per-CPU entropy pool. The LRNG allows the DRNG mechanism to be changed at runtime. Per default, a ChaCha20-based DRNG is used. The ChaCha20-DRNG implemented for the LRNG is also provided as a stand-alone user space deterministic random number generator. The LRNG also offers an SP800-90A DRBG based on the Linux kernel crypto API DRBG implementation. The processing of entropic data from the noise source before injecting them into the DRNG is performed with the following mathematical operations: 1. Truncation: The received time stamps are truncated to 8 least significant bits (or 32 least significant bits during boot time) 2. Concatenation: The received and truncated time stamps as well as auxiliary 32 bit words are concatenated to fill the per-CPU data array that is capable of holding 64 8-bit words. 3. Hashing: A set of concatenated time stamp data received from the interrupts are hashed together with the current existing per-CPU entropy pool state. The resulting message digest is the new per-CPU entropy pool state. 4. Hashing: When new data is added to the auxiliary pool, the data is hashed together with the auxiliary pool to form a new auxiliary pool state. 5. Hashing: A message digest of all per-CPU entropy pools and the auxiliary pool is calculated which forms the new auxiliary pool state. At the same time, this message digest is used to fill the slow noise source output buffer discussed in the following. 6. Truncation: The most-significant bits (MSB) defined by the requested number of bits (commonly equal to the security strength of the DRBG) or the entropy available transported with the buffer (which is the minimum of the message digest size and the available entropy in all entropy pools and the auxiliary pool), whatever is smaller, are obtained from the slow noise source output buffer. 7. Concatenation: The temporary seed buffer used to seed the DRNG is a concatenation of the slow noise source buffer, the Jitter RNG output, the CPU noise source output, and the current time. The DRNG always tries to seed itself with 256 bits of entropy, except during boot. In any case, if the noise sources cannot deliver that amount, the available entropy is used and the DRNG keeps track on how much entropy it was seeded with. The entropy implied by the LRNG available in the entropy pool may be too conservative. To ensure that during boot time all available entropy from the entropy pool is transferred to the DRNG, the hash_df function always generates 256 data bits during boot to seed the DRNG. During boot, the DRNG is seeded as follows: 1. The DRNG is reseeded from the entropy pool and potentially the fast noise sources if the entropy pool has collected at least 32 bits of entropy from the interrupt noise source. The goal of this step is to ensure that the DRNG receives some initial entropy as early as possible. In addition it receives the entropy available from the fast noise sources. 2. The DRNG is reseeded from the entropy pool and potentially the fast noise sources if all noise sources collectively can provide at least 128 bits of entropy. 3. The DRNG is reseeded from the entropy pool and potentially the fast noise sources if all noise sources collectivel can provide at least 256 bits. At the time of the reseeding steps, the DRNG requests as much entropy as is available in order to skip certain steps and reach the seeding level of 256 bits. This may imply that one or more of the aforementioned steps are skipped. In all listed steps, the DRNG is (re)seeded with a number of random bytes from the entropy pool that is at most the amount of entropy present in the entropy pool. This means that when the entropy pool contains 128 or 256 bits of entropy, the DRNG is seeded with that amount of entropy as well. Before the DRNG is seeded with 256 bits of entropy in step 3, requests of random data from /dev/random and the getrandom system call are not processed. The hash operation providing random data from the entropy pools will always require that all entropy sources collectively can deliver at least 128 entropy bits. The DRNG operates as deterministic random number generator with the following properties: * The maximum number of random bytes that can be generated with one DRNG generate operation is limited to 4096 bytes. When longer random numbers are requested, multiple DRNG generate operations are performed. The ChaCha20 DRNG as well as the SP800-90A DRBGs implement an update of their state after completing a generate request for backtracking resistance. * The secondary DRNG is reseeded with whatever entropy is available – in the worst case where no additional entropy can be provided by the noise sources, the DRNG is not re-seeded and continues its operation to try to reseed again after again the expiry of one of these thresholds: - If the last reseeding of the secondary DRNG is more than 600 seconds ago, or - 2^20 DRNG generate operations are performed, whatever comes first, or - the secondary DRNG is forced to reseed before the next generation of random numbers if data has been injected into the LRNG by writing data into /dev/random or /dev/urandom. The chosen values prevent high-volume requests from user space to cause frequent reseeding operations which drag down the performance of the DRNG. With the automatic reseeding after 600 seconds, the LRNG is triggered to reseed itself before the first request after a suspend that put the hardware to sleep for longer than 600 seconds. To support smaller devices including IoT environments, this patch allows reducing the runtime memory footprint of the LRNG at compile time by selecting smaller collection data sizes. When selecting the compilation of a kernel for a small environment, prevent the allocation of a buffer up to 4096 bytes to serve user space requests. In this case, the stack variable of 64 bytes is used to serve all user space requests. The LRNG has the following properties: * internal noise source: interrupts timing with fast boot time seeding * high performance of interrupt handling code: The LRNG impact on the interrupt handling has been reduced to a minimum. On one example system, the LRNG interrupt handling code in its fastest configuration executes within an average 55 cycles whereas the existing /dev/random on the same device takes about 97 cycles when measuring the execution time of add_interrupt_randomness(). * use of almost never contended lock for hashing operation to collect raw entropy supporting concurrency-free use of massive parallel systems - worst case rate of contention is the number of DRNG reseeds, usually: number of NUMA nodes contentions per 5 minutes. * use of standalone ChaCha20 based RNG with the option to use a different DRNG selectable at compile time * instantiate one DRNG per NUMA node * support for runtime switchable output DRNGs * use of runtime-switchable hash for conditioning implementation following widely accepted approach * compile-time selectable collection size * support of small systems by allowing the reduction of the runtime memory needs Further details including the rationale for the design choices and properties of the LRNG together with testing is provided at [1]. In addition, the documentation explains the conducted regression tests to verify that the LRNG is API and ABI compatible with the existing /dev/random implementation. [1] https://www.chronox.de/lrng.html CC: Torsten Duwe CC: "Eric W. Biederman" CC: "Alexander E. Patrakov" CC: "Ahmed S. Darwish" CC: "Theodore Y. Ts'o" CC: Willy Tarreau CC: Matthew Garrett CC: Vito Caputo CC: Andreas Dilger CC: Jan Kara CC: Ray Strode CC: William Jon McCann CC: zhangjs CC: Andy Lutomirski CC: Florian Weimer CC: Lennart Poettering CC: Nicolai Stange Reviewed-by: Alexander Lobakin Tested-by: Alexander Lobakin Mathematical aspects Reviewed-by: "Peter, Matthias" Reviewed-by: Marcelo Henrique Cerri Reviewed-by: Roman Drahtmueller Tested-by: Marcelo Henrique Cerri Tested-by: Neil Horman Signed-off-by: Stephan Mueller commit 837b0aeffa60a73796b4734e6b4de74ff43558f9 Author: Adithya Abraham Philip Date: Fri Jun 11 21:56:10 2021 +0000 net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2 Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to indicate that retransmitted packets and pure ACKs must have the ECT bit set. This is a necessary fix for BBRv2, which when using ECN expects ECT to be set even on retransmitted packets and ACKs. Currently CCAs like BBRv2 which can use ECN but don't "need" it do not have a way to indicate that ECT should be set on retransmissions/ACKs. Signed-off-by: Adithya Abraham Philip Signed-off-by: Neal Cardwell commit b8fa56fb9f35ffefe080098ad3008cac8f2c1ee2 Author: Neal Cardwell Date: Mon Dec 28 19:23:09 2020 -0500 net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss Fix WARN_ON_ONCE() warnings that were firing and pointing to a bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to CA_Open. The issue was that tcp_simple_retransmit() calls: tcp_set_ca_state(sk, TCP_CA_Loss); without first calling icsk_ca_ops->ssthresh(sk) (because tcp_simple_retransmit() is dealing with losses due to MTU issues and not congestion). The lack of this callback means that BBR did not get a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd. This commit removes that warning, since a bbr->prior_cwnd of 0 is a valid situation in this state transition. For setting inflight_lo upon entering CA_Loss, to avoid setting an inflight_lo of 0 in this case, this commit switches to taking the max of cwnd and prior_cwnd. We plan to remove that line of code when we switch to cautious (PRR-style) recovery, so that awkwardness will go away. Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118 commit 699194ffc5ec330c14d03d295bad98968879f4e8 Author: Neal Cardwell Date: Mon Aug 17 19:10:21 2020 -0400 net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2 Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0 commit 19bb512064ef922f913f18d6ac78f6756b85dec9 Author: Neal Cardwell Date: Mon Aug 17 19:08:41 2020 -0400 net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2 Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b commit b02ffd2d6287dba89dc15a257d13f516246b06ea Author: Neal Cardwell Date: Thu Nov 21 15:28:01 2019 -0500 net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss There is no reason to compute rs.delivered_ce upon loss. In fact, we specifically do not want to compute rs.delivered_ce upon loss. Two issues: (1) This would be the wrong thing to do, in behavior terms. With RACK's dynamic reordering window, losses can be marked long after the sequence hole appears in the ACK/SACK stream. We want to to catch the ECN mark rate rising too high as quickly as possible, which means we want to check for high ECN mark rates at ACK time (as BBRv2 currently does) and not loss marking time. (2) This is dead code. The ECN mark rate cannot be detected as too high because the check needs rs->delivered to be > 0 as well: if (rs->delivered_ce > 0 && rs->delivered > 0 && Since we are not setting rs->delivered upon loss, this check cannot succeed, so setting delivered_ce is pointless. This dead and wrong line was discovered by Randall Stewart at Netflix as he was reading the BBRv2 code. Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a commit 9b7be0b3c71db0194e02cf8ff2cd908f441c269a Author: Neal Cardwell Date: Tue Jun 11 12:54:22 2019 -0400 net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower queues, lower loss, and better Reno/CUBIC coexistence than BBR v1. BBR v2 maintains the core of BBR v1: an explicit model of the network path that is two-dimensional, adapting to estimate the (a) maximum available bandwidth and (b) maximum safe volume of data a flow can keep in-flight in the network. It maintains the estimated BDP as a core guide for estimating an appropriate level of in-flight data. BBR v2 makes several key enhancements: o Its bandwidth-probing time scale is adapted, within bounds, to allow improved coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a) extended dynamically based on estimated BDP to improve coexistence with Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more scalable and responsive than Reno and CUBIC. o Rather than being largely agnostic to loss and ECN marks, it explicitly uses loss and (DCTCP-style) ECN signals to maintain its model. o It aims for lower losses than v1 by adjusting its model to attempt to stay within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh, respectively). o It adapts to loss/ECN signals even when the application is running out of data ("application-limited"), in case the "application-limited" flow is also "network-limited" (the bw and/or inflight available to this flow is lower than previously estimated when the flow ran out of data). o It has a three-part model: the model explicit three tracks operating points, where an operating point is a tuple: (bandwidth, inflight). The three operating points are: o latest: the latest measurement from the current round trip o upper bound: robust, optimistic, long-term upper bound o lower bound: robust, conservative, short-term lower bound These are stored in the following state variables: o latest: bw_latest, inflight_latest o lo: bw_lo, inflight_lo o hi: bw_hi[2], inflight_hi To gain intuition about the meaning of the three operating points, it may help to consider the analogs in CUBIC, which has a somewhat analogous three-part model used by its probing state machine: BBR param CUBIC param ----------- ------------- latest ~ cwnd lo ~ ssthresh hi ~ last_max_cwnd The analogy is only a loose one, though, since the BBR operating points are calculated differently, and are 2-dimensional (bw,inflight) rather than CUBIC's one-dimensional notion of operating point (inflight). o It uses the three-part model to adapt the magnitude of its bandwidth to match the estimated space available in the buffer, rather than (as in BBR v1) assuming that it was always acceptable to place 0.25*BDP in the bottleneck buffer when probing (commodity datacenter switches commonly do not have that much buffer for WAN flows). When BBR v2 estimates it hit a buffer limit during probing, its bandwidth probing then starts gently in case little space is still available in the buffer, and the accelerates, slowly at first and then rapidly if it can grow inflight without seeing congestion signals. In such cases, probing is bounded by inflight_hi + inflight_probe, where inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to keep losses low and bounded if a bottleneck remains congested, while rapidly/scalably utilizing free bandwidth when it becomes available. o It has a slightly revised state machine, to achieve the goals above. BBR_BW_PROBE_UP: pushes up inflight to probe for bw/vol BBR_BW_PROBE_DOWN: drain excess inflight from the queue BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty o The estimated BDP: BBR v2 continues to maintain an estimate of the path's two-way propagation delay, by tracking a windowed min_rtt, and coordinating (on an as-ndeeded basis) to try to expose the two-way propagation delay by draining the bottleneck queue. BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth estimate to estimate the current bandwidth-delay product. The estimated BDP still provides one important guideline for bounding inflight data. However, because any min-filtered RTT and max-filtered bw inherently tend to both overestimate, the estimated BDP is often too high; in this case loss or ECN marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to adapt its sending rate and inflight down to match the available capacity of the path. o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2 requires more space. Note that much of the space is due to support for per-socket parameterization and debugging in this release for research and debugging. With that state removed, the full "struct bbr" is 140 bytes, or 144 with padding. This is an increase of 40 bytes over the existing ca_priv space. o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following significant pieces: o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(), bbr_can_grow_inflight()) o long-term bandwidth estimator ("policer mode") The code layout tries to keep BBR v2 code near the bottom of the file, so that v1-applicable code in the top does not accidentally refer to v2 code. o Docs: See the following docs for more details and diagrams decsribing the BBR v2 algorithm: https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00 https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00 o Internal notes: For this upstream rebase, Neal started from: git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c then removed dev instrumentation (dynamic get/set for parameters) and code that was only used by BBRv1 Effort: net-tcp_bbr Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102 Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05 commit 65e4da0782e44b9e679e969bd523d17de53f4d3a Author: Neal Cardwell Date: Sat Nov 16 13:16:25 2019 -0500 net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check() Add logic for an experimental TCP connection behavior, enabled with tp->fast_ack_mode = 1, which disables checking the receive window before sending an ack in __tcp_ack_snd_check(). If this behavior is enabled, the data receiver sends an ACK if the amount of data is > RCV.MSS. Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4 commit 39d36331cfe3ff8c999d5cae3af7cc60cbc3ef99 Author: Neal Cardwell Date: Fri Sep 27 17:10:26 2019 -0400 net-tcp: re-generalize TSO sizing in TCP CC module API Reorganize the API for CC modules so that the CC module once again gets complete control of the TSO sizing decision. This is how the API was set up around 2016 and the initial BBRv1 upstreaming. Later Eric Dumazet simplified it. But with wider testing it now seems that to avoid CPU regressions BBR needs to have a different TSO sizing function. This is necessary to handle cases where there are many flows bottlenecked on the sender host's NIC, in which case BBR's pacing rate is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because BBR's pacing rate adapts to the low bandwidth share each flow sees. By contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very large cwnd, and thus large pacing rate and large TSO burst size. Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea commit 3110f23f540dee520648a1570aaa2a4b446d01cb Author: Yousuk Seung Date: Wed May 23 17:55:54 2018 -0700 net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a congestion control module to receive CE events. Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN bit in opts flag to receive CE events but this may incur changes in ECN behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS that allows congestion control modules to receive CE events independently of TCP_CONG_NEEDS_ECN. Effort: net-tcp Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b Change-Id: I2255506985242f376d910c6fd37daabaf4744f24 commit 30e316fb1b40906d3ba8d0c1f3e51741fb64c6de Author: Neal Cardwell Date: Tue May 7 22:37:19 2019 -0400 net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue Syzkaller was able to use TCP_REPAIR to reproduce the new warning added in tcp_fragment(): WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487 tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487() inconsistent: tx.in_flight: 0 old_factor: 53 The warning happens because skbs inserted into the tcp_rtx_queue during the repair process go through a sort of "fake send" process, and that process was seting pcount but not tx.in_flight, and thus the warnings (where old_factor is the old pcount). The fix of setting tx.in_flight in the TCP_REPAIR code path seems simple enough, and indeed makes the repro code from syzkaller stop producing warnings. Running through kokonut tests, and will send out for review when all tests pass. Effort: net-tcp_bbr Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63 Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05 commit d50e05bdceabb0e23c41d62fd3aff2ae46ce6e30 Author: Neal Cardwell Date: Wed May 1 20:16:25 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment() When we fragment an skb that has already been sent, we need to update the tx.in_flight for the first skb in the resulting pair ("buff"). Because we were not updating the tx.in_flight, the tx.in_flight value was inconsistent with the pcount of the "buff" skb (tx.in_flight would be too high). That meant that if the "buff" skb was lost, then bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value that is too high. This could result in longer queues and higher packet loss. Packetdrill testing verified that without this commit, when the second half of an skb is SACKed and then later the first half of that skb is marked lost, the calculated inflight_hi was incorrect. Effort: net-tcp_bbr Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874 Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d commit 293e82ac9754024fb24916df5f1f97990be57921 Author: Neal Cardwell Date: Wed May 1 20:16:33 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb() When tcp_shifted_skb() updates state as adjacent SACKed skbs are coalesced, previously the tx.in_flight was not adjusted, so we could get contradictory state where the skb's recorded pcount was bigger than the tx.in_flight (the number of segments that were in_flight after sending the skb). Normally have a SACKed skb with contradictory pcount/tx.in_flight would not matter. However, with SACK reneging, the SACKed bit is removed, and an skb once again becomes eligible for retransmitting, fragmenting, SACKing, etc. Packetdrill testing verified the following sequence is possible in a kernel that does not have this commit: - skb N is SACKed - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb() - tcp_shifted_skb() will increase the pcount of prev, but leave tx.in_flight as-is - so prev skb can have pcount > tx.in_flight - RTO, tcp_timeout_mark_lost(), detect reneg, remove "SACKed" bit, mark skb N as lost - find pcount of skb N is greater than its tx.in_flight I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb(): WARN_ON_ONCE(inflight_prev < 0) to fire in production machines using bbr2. Tested: See last commit in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715 Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55 commit 400bc7e73e85a4730073ee8c92b6a5fe9ffd0d76 Author: Neal Cardwell Date: Tue May 7 22:36:36 2019 -0400 net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight() Factor out the code to set an skb's tx.in_flight field into its own function, so that this code can be used for the TCP_REPAIR "fake send" code path that inserts skbs into the rtx queue without sending them. This is in preparation for the following patch, which fixes an issue with TCP_REPAIR and tx.in_flight. Tested: See last patch in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af commit 61a0a7585d9a09dd2ccd6eb1961f6ad9ad1e5e0d Author: Neal Cardwell Date: Tue Aug 7 21:52:06 2018 -0400 net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API For connections experiencing reordering, RACK can mark packets lost long after we receive the SACKs/ACKs hinting that the packets were actually lost. This means that CC modules cannot easily learn the volume of inflight data at which packet loss happens by looking at the current inflight or even the packets in flight when the most recently SACKed packet was sent. To learn this, CC modules need to know how many packets were in flight at the time lost packets were sent. This new callback, combined with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this. This also provides a consistent callback that is invoked whether packets are marked lost upon ACK processing, using the RACK reordering timer, or at RTO time. Effort: net-tcp_bbr Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4 Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a commit 6aca161876b5dc73538ce1a0ae195a45adbd8514 Author: Neal Cardwell Date: Mon Nov 19 13:48:36 2018 -0500 net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece For understanding the relationship between inflight and ECN signals, to try to find the highest inflight value that has acceptable levels ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8 Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3 commit b47bb01ca950dc1f528770a023390353374ccdc4 Author: Neal Cardwell Date: Thu Oct 12 23:44:27 2017 -0400 net-tcp_bbr: v2: count packets lost over TCP rate sampling interval For understanding the relationship between inflight and packet loss signals, to try to find the highest inflight value that has acceptable levels of packet losses. Effort: net-tcp_bbr Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5 commit 8166fcfbbd67b530699fdf8e7056df87b009c956 Author: Neal Cardwell Date: Sat Aug 5 11:49:50 2017 -0400 net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample For understanding the relationship between inflight and losses or ECN signals, to try to find the highest inflight value that has acceptable levels of loss/ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a commit 32b38b7a8bfb485e817b23a70afd3fcb5bd1f2c0 Author: Neal Cardwell Date: Sun Jun 24 21:55:59 2018 -0400 net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes Free up some space for tracking inflight and losses for each bw sample, in upcoming commits. These timestamps are in microseconds, and are now stored in 32 bits. So they can only hold time intervals up to roughly 2^12 = 4096 seconds. But Linux TCP RTT and RTO tracking has the same 32-bit microsecond implementation approach and resulting deployment limitations. So this is not introducing a new limit. And these should not be a limitation for the foreseeable future. Effort: net-tcp_bbr Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55 Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c commit 2944f3f4d6207adbeae299412f5e05895b4817f9 Author: Yuchung Cheng Date: Tue Mar 27 18:01:46 2018 -0700 net-tcp_rate: account for CE marks in rate sample This patch counts number of packets delivered have CE mark in the rate sample, using similar approach of delivery accounting. Effort: net-tcp_rate Origin-9xx-SHA1: 710644db434c3da335a7c8b72207a671ccbb5cf8 Change-Id: I0968fb33fe19b5c774e8c3afd2685558a6ec8710 commit 7222702d844534d0fd53c727f68a7f7a5f8c314a Author: Yuchung Cheng Date: Tue Mar 27 18:33:29 2018 -0700 net-tcp_rate: consolidate inflight tracking approaches in TCP In order to track CE marks per rate sample (one round trip), we'll need to snap the starting tcp delivered_ce acount in the packet meta header (tcp_skb_cb). But there's not enough space. Good news is that the "last_in_flight" in the header, used by NV congestion control, is almost equivalent as "delivered". In fact "delivered" is better by accounting out-of-order packets additionally. Therefore we can remove it to make room for the CE tracking. This would make delayed ACK detection slightly less accurate but the impact is negligible since it's not used for any critical control. Effort: net-tcp_rate Origin-9xx-SHA1: ddcd46ec85d5f1c4454258af0c54b3254c0d64a7 Change-Id: I1a184aad6d101c981ac7f2f275aa9417ff856910 commit a1b8bfc95fd85052db8694f8a6eab83dda6208f5 Author: Neal Cardwell Date: Tue Jun 11 12:26:55 2019 -0400 net-tcp_bbr: broaden app-limited rate sample detection This commit is a bug fix for the Linux TCP app-limited (application-limited) logic that is used for collecting rate (bandwidth) samples. Previously the app-limited logic only looked for "bubbles" of silence in between application writes, by checking at the start of each sendmsg. But "bubbles" of silence can also happen before retransmits: e.g. bubbles can happen between an application write and a retransmit, or between two retransmits. Retransmits are triggered by ACKs or timers. So this commit checks for bubbles of app-limited silence upon ACKs or timers. Why does this commit check for app-limited state at the start of ACKs and timer handling? Because at that point we know whether inflight was fully using the cwnd. During processing the ACK or timer event we often change the cwnd; after changing the cwnd we can't know whether inflight was fully using the old cwnd. Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc Change-Id: I37221506f5166877c2b110753d39bb0757985e68 commit 4afde7d1d43fd433105376e6c7cfd5d9c8e6672b Author: Alexey Avramov Date: Tue Nov 2 07:53:11 2021 +0900 mm/vmscan: add sysctl knobs for protecting the working set The kernel does not provide a way to protect the working set under memory pressure. A certain amount of anonymous and clean file pages is required by the userspace for normal operation. First of all, the userspace needs a cache of shared libraries and executable binaries. If the amount of the clean file pages falls below a certain level, then thrashing and even livelock can take place. The patch provides sysctl knobs for protecting the working set (anonymous and clean file pages) under memory pressure. The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous pages. The anonymous pages on the current node won't be reclaimed under any conditions when their amount is below vm.anon_min_kbytes. This knob may be used to prevent excessive swap thrashing when anonymous memory is low (for example, when memory is going to be overfilled by compressed data of zram module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested 0 in Kconfig). The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file pages using this knob may be used when swapping is still possible to - prevent disk I/O thrashing under memory pressure; - improve performance in disk cache-bound tasks under memory pressure. The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in Kconfig). The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_min_kbytes. Hard protection of clean file pages using this knob may be used to - prevent disk I/O thrashing under memory pressure even with no free swap space; - improve performance in disk cache-bound tasks under memory pressure; - avoid high latency and prevent livelock in near-OOM conditions. The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in Kconfig). Signed-off-by: Alexey Avramov commit ad80c3b998a49c28df7f379c22cc3ea08025c508 Author: Zebediah Figura Date: Thu Nov 4 17:15:38 2021 +0000 winesync: Introduce the winesync driver and character device Rebased-by: Tk-Glitch Rebased-by: Alexandre Frade Signed-off-by: Alexandre Frade commit 304b10fb971da29f23e9fd4ece02d97d5b4621d1 Author: André Almeida Date: Mon Oct 25 09:49:42 2021 -0300 futex: Add entry point for FUTEX_WAIT_MULTIPLE (opcode 31) Add an option to wait on multiple futexes using the old interface, that uses opcode 31 through futex() syscall. Do that by just translation the old interface to use the new code. This allows old and stable versions of Proton to still use fsync in new kernel releases. Signed-off-by: André Almeida commit 6d1b96996072b05e2bad55d772fd0c55a715b2fa Author: André Almeida Date: Thu Sep 23 14:11:06 2021 -0300 futex,x86: Wire up sys_futex_waitv() Wire up syscall entry point for x86 arch, for both i386 and x86_64. Signed-off-by: André Almeida Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20210923171111.300673-18-andrealmeid@collabora.com commit 0552eed197f120e8bd593105dcdd551eb0a40b53 Author: André Almeida Date: Thu Sep 23 14:11:05 2021 -0300 futex: Implement sys_futex_waitv() Add support to wait on multiple futexes. This is the interface implemented by this syscall: futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes, unsigned int flags, struct timespec *timeout, clockid_t clockid) struct futex_waitv { __u64 val; __u64 uaddr; __u32 flags; __u32 __reserved; }; Given an array of struct futex_waitv, wait on each uaddr. The thread wakes if a futex_wake() is performed at any uaddr. The syscall returns immediately if any waiter has *uaddr != val. *timeout is an optional absolute timeout value for the operation. This syscall supports only 64bit sized timeout structs. The flags argument of the syscall should be empty, but it can be used for future extensions. Flags for shared futexes, sizes, etc. should be used on the individual flags of each waiter. __reserved is used for explicit padding and should be 0, but it might be used for future extensions. If the userspace uses 32-bit pointers, it should make sure to explicitly cast it when assigning to waitv::uaddr. Returns the array index of one of the woken futexes. There’s no given information of how many were woken, or any particular attribute of it (if it’s the first woken, if it is of the smaller index...). Signed-off-by: André Almeida Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20210923171111.300673-17-andrealmeid@collabora.com commit 4fdad1cc6d6b339d2dce271adf185be61390ccf1 Author: Con Kolivas Date: Mon Dec 14 19:09:01 2020 +0000 clockevents, hrtimer: Make hrtimer granularity and minimum hrtimeout configurable in sysctl. Set default granularity to 100us and min timeout to 500us Rebased-by: Alexandre Frade commit 5e67c25bb8e69cdd476fc400fbc509bb795485ad Author: Con Kolivas Date: Mon Feb 20 13:32:58 2017 +1100 time: Don't use hrtimer overlay when pm_freezing since some drivers still don't correctly use freezable timeouts. Rebased-by: Alexandre Frade commit 36e2473399cf3648266f530966384967af51470e Author: Con Kolivas Date: Mon Feb 20 13:30:32 2017 +1100 hrtimer: Replace all calls to schedule_timeout_uninterruptible of potentially under 50ms to use schedule_msec_hrtimeout_uninterruptible Rebased-by: Alexandre Frade commit c94909f9892ee558852fd674c448ec55dc22c6d0 Author: Con Kolivas Date: Mon Feb 20 13:30:07 2017 +1100 hrtimer: Replace all calls to schedule_timeout_interruptible of potentially under 50ms to use schedule_msec_hrtimeout_interruptible. Rebased-by: Alexandre Frade commit 7a07c55b886bf67e7c1fcf497b6ce828e3a67c43 Author: Con Kolivas Date: Mon Feb 15 21:56:16 2021 +0000 hrtimer: Replace all schedule timeout(1) with schedule_min_hrtimeout() Rebased-by: Alexandre Frade commit b1119134900ae33d0a2876deb6841887b6af5465 Author: Con Kolivas Date: Fri Nov 4 09:25:54 2016 +1100 timer: Convert msleep to use hrtimers when active. Rebased-by: Alexandre Frade commit de3d10622c993d4ee9be1dc7c8af2df97123f3d9 Author: Con Kolivas Date: Sat Nov 5 09:27:36 2016 +1100 time: Special case calls of schedule_timeout(1) to use the min hrtimeout of 1ms, working around low Hz resolutions. Rebased-by: Alexandre Frade commit 8e6b86a0ed4a09a220af0464e6e1e1aafc91ffe4 Author: Con Kolivas Date: Sat Aug 12 11:53:39 2017 +1000 hrtimer: Create highres timeout variants of schedule_timeout functions. Rebased-by: Alexandre Frade commit 1c0343c72b752fa2fbb552e628be4e41403b9b7c Author: Alexandre Frade Date: Sun Aug 29 23:58:33 2021 +0000 XANMOD: fair: Remove all energy efficiency functions Signed-off-by: Alexandre Frade commit 9232ff82bbc50f364b315a9e2953f3b1026454ee Author: Alexandre Frade Date: Fri Jun 18 19:10:55 2021 +0000 XANMOD: Makefile: Turn off loop vectorization for GCC -O3 optimization level Signed-off-by: Alexandre Frade commit acecd1e82b0ad6d07314067b51053ef6ff7996ce Author: Alexandre Frade Date: Thu Sep 3 20:36:13 2020 +0000 XANMOD: init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures Signed-off-by: Alexandre Frade commit 5973084b3c95f2b02905f79a86ea48d96125ebce Author: Alexandre Frade Date: Thu Jun 25 16:40:43 2020 -0300 XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE Signed-off-by: Alexandre Frade commit 70dff0736d29f6813f3ce4af7a78489005e441ae Author: Alexandre Frade Date: Mon Jan 29 17:41:29 2018 +0000 XANMOD: scripts: disable the localversion "+" tag of a git repo Signed-off-by: Alexandre Frade commit 6bf377d46ef3d41e0a4f552bb4a9bee3ba601656 Author: Alexandre Frade Date: Tue Mar 31 13:32:08 2020 -0300 XANMOD: cpufreq: tunes ondemand and conservative governor for performance Signed-off-by: Alexandre Frade commit 9879bb447b5a706d36a54c1e7f8334c9cf430ffd Author: Alexandre Frade Date: Mon Jan 29 17:31:25 2018 +0000 XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping Signed-off-by: Alexandre Frade commit 51fb6e90d1861e8a8f7f65f51441479ffb959837 Author: Alexandre Frade Date: Thu Aug 13 14:57:06 2020 +0000 XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default Signed-off-by: Alexandre Frade commit d009ad96d15674c1fac6a4fd2798ed814dd4f09b Author: Alexandre Frade Date: Mon Jan 29 16:59:22 2018 +0000 XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed Signed-off-by: Alexandre Frade commit fb0e76989bb3dc6b1eaa9968ecd8c817b00f3ca7 Author: Alexandre Frade Date: Mon Jan 29 17:26:15 2018 +0000 XANMOD: kconfig: add 500Hz timer interrupt kernel config option Signed-off-by: Alexandre Frade commit d0bc09370b4ad746f867a4d8e8cfd689859bbd6e Author: Alexandre Frade Date: Mon Dec 14 16:24:26 2020 +0000 XANMOD: block: set rq_affinity to force full multithreading I/O requests Signed-off-by: Alexandre Frade commit 7c013f8d5c9ed1d12530b43ec15e5939955e13af Author: Alexandre Frade Date: Mon Jun 1 18:23:51 2020 -0300 XANMOD: block, bfq: change BLK_DEV_ZONED depends to IOSCHED_BFQ Signed-off-by: Alexandre Frade commit af2a86239ff91c60a1cdae7aff49eedeeb10edd4 Author: Alexandre Frade Date: Mon Nov 25 15:13:06 2019 -0300 XANMOD: elevator: set default scheduler to bfq for blk-mq Signed-off-by: Alexandre Frade commit 8bb7eca972ad531c9b149c0a51ab43a417385813 Author: Linus Torvalds Date: Sun Oct 31 13:53:10 2021 -0700 Linux 5.15