commit ec366c457ef3bffdd0db968d1077099536d27963 Author: Alexandre Frade Date: Sat Aug 6 20:57:00 2022 +0000 Linux 5.19.0-xanmod2 Signed-off-by: Alexandre Frade commit ab65c0b200d5625d9ecf8b238fa490cef03e3997 Author: Alexandre Frade Date: Sat Aug 6 20:51:58 2022 +0000 Revert "mm/vmscan: add sysctl knobs for protecting the working set" This reverts commit 69231fa48bdf03d9c56a3fae6d8746cbf01bbea1. commit f2995731db1f23ace5df54d03376c96d65fb0fea Author: Alexandre Frade Date: Tue Aug 2 04:56:37 2022 +0000 Linux 5.19.0-xanmod1 Signed-off-by: Alexandre Frade commit 275cd7d8c8b7e90e22bf30e6153bac47bbd9369c Author: Alexandre Frade Date: Mon Mar 21 21:37:19 2022 +0000 i2c: busses: Add SMBus capability to work with OpenRGB driver control Signed-off-by: Alexandre Frade commit 051dd941f97e40e8ef22a4c47628402005efee6a Author: Mark Weiman Date: Sun Aug 12 11:36:21 2018 -0400 pci: Enable overrides for missing ACS capabilities This an updated version of Alex Williamson's patch from: https://lkml.org/lkml/2013/5/30/513 Original commit message follows: PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that allows us to control whether transactions are allowed to be redirected in various subnodes of a PCIe topology. For instance, if two endpoints are below a root port or downsteam switch port, the downstream port may optionally redirect transactions between the devices, bypassing upstream devices. The same can happen internally on multifunction devices. The transaction may never be visible to the upstream devices. One upstream device that we particularly care about is the IOMMU. If a redirection occurs in the topology below the IOMMU, then the IOMMU cannot provide isolation between devices. This is why the PCIe spec encourages topologies to include ACS support. Without it, we have to assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation. Unfortunately, far too many topologies do not support ACS to make this a steadfast requirement. Even the latest chipsets from Intel are only sporadically supporting ACS. We have trouble getting interconnect vendors to include the PCIe spec required PCIe capability, let alone suggested features. Therefore, we need to add some flexibility. The pcie_acs_override= boot option lets users opt-in specific devices or sets of devices to assume ACS support. The "downstream" option assumes full ACS support on root ports and downstream switch ports. The "multifunction" option assumes the subset of ACS features available on multifunction endpoints and upstream switch ports are supported. The "id:nnnn:nnnn" option enables ACS support on devices matching the provided vendor and device IDs, allowing more strategic ACS overrides. These options may be combined in any order. A maximum of 16 id specific overrides are available. It's suggested to use the most limited set of options necessary to avoid completely disabling ACS across the topology. Note to hardware vendors, we have facilities to permanently quirk specific devices which enforce isolation but not provide an ACS capability. Please contact me to have your devices added and save your customers the hassle of this boot option. Rebased-by: Alexandre Frade Signed-off-by: Mark Weiman Signed-off-by: Alexandre Frade commit 01bf246b3bf92022e45834162def26cc579a2faa Author: graysky Date: Tue Mar 15 05:58:43 2022 -0400 x86/kconfig: more uarches for kernel 5.17+ FEATURES This patch adds additional CPU options to the Linux kernel accessible under: Processor type and features ---> Processor family ---> With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are offered which are good for supported Intel or AMD CPUs: • x86-64-v2 • x86-64-v3 • x86-64-v4 Users of glibc 2.33 and above can see which level is supported by current hardware by running: /lib/ld-linux-x86-64.so.2 --help | grep supported Alternatively, compare the flags from /proc/cpuinfo to this list.[1] CPU-specific microarchitectures include: • AMD Improved K8-family • AMD K10-family • AMD Family 10h (Barcelona) • AMD Family 14h (Bobcat) • AMD Family 16h (Jaguar) • AMD Family 15h (Bulldozer) • AMD Family 15h (Piledriver) • AMD Family 15h (Steamroller) • AMD Family 15h (Excavator) • AMD Family 17h (Zen) • AMD Family 17h (Zen 2) • AMD Family 19h (Zen 3)† • Intel Silvermont low-power processors • Intel Goldmont low-power processors (Apollo Lake and Denverton) • Intel Goldmont Plus low-power processors (Gemini Lake) • Intel 1st Gen Core i3/i5/i7 (Nehalem) • Intel 1.5 Gen Core i3/i5/i7 (Westmere) • Intel 2nd Gen Core i3/i5/i7 (Sandybridge) • Intel 3rd Gen Core i3/i5/i7 (Ivybridge) • Intel 4th Gen Core i3/i5/i7 (Haswell) • Intel 5th Gen Core i3/i5/i7 (Broadwell) • Intel 6th Gen Core i3/i5/i7 (Skylake) • Intel 6th Gen Core i7/i9 (Skylake X) • Intel 8th Gen Core i3/i5/i7 (Cannon Lake) • Intel 10th Gen Core i7/i9 (Ice Lake) • Intel Xeon (Cascade Lake) • Intel Xeon (Cooper Lake)* • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)* • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡ • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡ • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡ Notes: If not otherwise noted, gcc >=9.1 is required for support. *Requires gcc >=10.1 or clang >=10.0 †Required gcc >=10.3 or clang >=12.0 ‡Required gcc >=11.1 or clang >=12.0 It also offers to compile passing the 'native' option which, "selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using -march=native enables all instruction subsets supported by the local machine and will produce code optimized for the local machine under the constraints of the selected instruction set."[2] Users of Intel CPUs should select the 'Intel-Native' option and users of AMD CPUs should select the 'AMD-Native' option. MINOR NOTES RELATING TO INTEL ATOM PROCESSORS This patch also changes -march=atom to -march=bonnell in accordance with the gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I believe it should use the newer -march=bonnell flag for atom processors.[3] It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The recommendation is to use the 'atom' option instead. BENEFITS Small but real speed increases are measurable using a make endpoint comparing a generic kernel to one built with one of the respective microarchs. See the following experimental evidence supporting this statement: https://github.com/graysky2/kernel_gcc_patch REQUIREMENTS linux version 5.17+ gcc version >=9.0 or clang version >=9.0 ACKNOWLEDGMENTS This patch builds on the seminal work by Jeroen.[5] REFERENCES 1. https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9 2. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options 3. https://bugzilla.kernel.org/show_bug.cgi?id=77461 4. https://github.com/graysky2/kernel_gcc_patch/issues/15 5. http://www.linuxforge.net/docs/linux/linux-gcc.php Signed-off-by: graysky Signed-off-by: Alexandre Frade commit 092dc52d24d4dc5b2ae00a4f2941d0a3e90a4330 Author: Christian Brauner Date: Wed Jan 23 21:54:23 2019 +0100 SAUCE: binder: give binder_alloc its own debug mask file Currently both binder.c and binder_alloc.c both register the /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts in sysfs. This commit gives binder_alloc.c its own /sys/module/binder_linux/paramters/alloc_debug_mask file. Signed-off-by: Christian Brauner Signed-off-by: Seth Forshee Signed-off-by: Alexandre Frade commit cfcf7e81ed1f822e96b2006f0b53699cb3c2e79f Author: Christian Brauner Date: Wed Jan 16 23:13:25 2019 +0100 SAUCE: binder: turn into module The Android binder driver needs to become a module for the sake of shipping Anbox. To do this we need to export the following functions since binder is currently still using them: - security_binder_set_context_mgr() - security_binder_transaction() - security_binder_transfer_binder() - security_binder_transfer_file() - can_nice() - __wake_up_pollfree() - __close_fd_get_file() - mmput_async() - task_work_add() - map_kernel_range_noflush() - get_vm_area() - zap_page_range() - put_ipc_ns() - get_ipc_ns_exported() - show_init_ipc_ns() Rebased-by: Alexandre Frade Signed-off-by: Christian Brauner [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ] Signed-off-by: Seth Forshee Signed-off-by: Alexandre Frade commit 1f0fdc2366f8f372e23bc7dddf8474fd0b376cbd Author: Serge Hallyn Date: Fri May 31 19:12:12 2013 +0100 sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default add sysctl to disallow unprivileged CLONE_NEWUSER by default This is a short-term patch. Unprivileged use of CLONE_NEWUSER is certainly an intended feature of user namespaces. However for at least saucy we want to make sure that, if any security issues are found, we have a fail-safe. Signed-off-by: Serge Hallyn [bwh: Remove unneeded binary sysctl bits] [bwh: Keep this sysctl, but change the default to enabled] Signed-off-by: Alexandre Frade commit 80443721385a1ba095c87cc2e3b5a2f1f200d53c Author: Zebediah Figura Date: Wed Apr 20 18:58:17 2022 -0500 docs: winesync: Document alertable waits. Signed-off-by: Alexandre Frade commit ea4499b4623879156350fbe842f527589b250418 Author: Zebediah Figura Date: Wed Apr 20 18:24:43 2022 -0500 serftests: winesync: Add some tests for wakeup signaling via alerts. Signed-off-by: Alexandre Frade commit 895b83e167476231c347977460bc53a04dc571ec Author: Zebediah Figura Date: Wed Apr 20 18:08:37 2022 -0500 selftests: winesync: Add tests for alertable waits. Signed-off-by: Alexandre Frade commit 33fee2141b21869c2f9a181a7bf4d87ebb4a7a4e Author: Zebediah Figura Date: Wed Apr 13 20:02:39 2022 -0500 winesync: Introduce alertable waits. Signed-off-by: Alexandre Frade commit 21aac1af10e6cd4bf456fe2c458ab35312b86c7b Author: Zebediah Figura Date: Wed Jan 19 22:01:46 2022 -0600 docs: winesync: Document event APIs. Signed-off-by: Alexandre Frade commit bb3d53ec8cb668781486fce9e09f6ab8410ff0ce Author: Zebediah Figura Date: Wed Jan 19 21:06:22 2022 -0600 selftests: winesync: Add some tests for invalid object handling with events. Signed-off-by: Alexandre Frade commit b10edcf7bc710c9a63b0a7b709874fd65d689091 Author: Zebediah Figura Date: Wed Jan 19 21:00:50 2022 -0600 selftests: winesync: Add some tests for wakeup signaling with events. Signed-off-by: Alexandre Frade commit 9c0a898674419307054e353a4c8a0a38ef471217 Author: Zebediah Figura Date: Wed Jan 19 19:45:39 2022 -0600 selftests: winesync: Add some tests for auto-reset event state. Signed-off-by: Alexandre Frade commit 2fda591ea641e546d99249afd47ff9569b40aa3b Author: Zebediah Figura Date: Wed Jan 19 19:34:47 2022 -0600 selftests: winesync: Add some tests for manual-reset event state. Signed-off-by: Alexandre Frade commit 69dbb25d1d229e14aaa5616718d47e3cfaf02e6f Author: Zebediah Figura Date: Wed Jan 19 19:14:00 2022 -0600 winesync: Introduce WINESYNC_IOC_READ_EVENT. Signed-off-by: Alexandre Frade commit 4f3ff01258ff3e3c1442606bc86e51851b562555 Author: Zebediah Figura Date: Wed Jan 19 19:10:12 2022 -0600 winesync: Introduce WINESYNC_IOC_PULSE_EVENT. Signed-off-by: Alexandre Frade commit 2770e079a3d6ac17a81935e900eefe1733a32f6e Author: Zebediah Figura Date: Wed Jan 19 19:00:25 2022 -0600 winesync: Introduce WINESYNC_IOC_RESET_EVENT. Signed-off-by: Alexandre Frade commit 3ab668e9b57e80c4e702aabd3cff07b0c747187f Author: Zebediah Figura Date: Wed Jan 19 18:43:30 2022 -0600 winesync: Introduce WINESYNC_IOC_SET_EVENT. Signed-off-by: Alexandre Frade commit 289fe79e455ec046dfc6e56647063ed7c7cdc463 Author: Zebediah Figura Date: Wed Jan 19 18:21:03 2022 -0600 winesync: Introduce WINESYNC_IOC_CREATE_EVENT. Signed-off-by: Alexandre Frade commit 249f4907bf2aa743937a3c9c713faccb5ed8a25c Author: Zebediah Figura Date: Fri Mar 5 12:22:55 2021 -0600 maintainers: Add an entry for winesync. Signed-off-by: Alexandre Frade commit 18a26c17e817d9650f9a05f0e391a7d42495f476 Author: Zebediah Figura Date: Fri Mar 5 12:09:36 2021 -0600 selftests: winesync: Add some tests for wakeup signaling with WINESYNC_IOC_WAIT_ALL. Signed-off-by: Alexandre Frade commit f0381462fc94e03d2fe856f4aca10e1332d53671 Author: Zebediah Figura Date: Fri Mar 5 12:09:32 2021 -0600 selftests: winesync: Add some tests for wakeup signaling with WINESYNC_IOC_WAIT_ANY. Signed-off-by: Alexandre Frade commit 0ca9426a40fc6fc00929676583609d278cb5fbfa Author: Zebediah Figura Date: Fri Mar 5 12:08:54 2021 -0600 selftests: winesync: Add some tests for invalid object handling. Signed-off-by: Alexandre Frade commit 8d5be9e41727715fe6a4b626bce37efeee1aa51f Author: Zebediah Figura Date: Fri Mar 5 12:08:25 2021 -0600 selftests: winesync: Add some tests for WINESYNC_IOC_WAIT_ALL. Signed-off-by: Alexandre Frade commit 92d025e3d17a7bda396c06275ea39cf7c3b0dbd8 Author: Zebediah Figura Date: Fri Mar 5 12:07:45 2021 -0600 selftests: winesync: Add some tests for WINESYNC_IOC_WAIT_ANY. Signed-off-by: Alexandre Frade commit 457d6678490ec743203d21049eaeaa7074dcb830 Author: Zebediah Figura Date: Fri Mar 5 12:07:04 2021 -0600 selftests: winesync: Add some tests for mutex state. Signed-off-by: Alexandre Frade commit 169fa95e32a7d4b17b2380043e5a0b46758119f8 Author: Zebediah Figura Date: Fri Mar 5 12:06:23 2021 -0600 selftests: winesync: Add some tests for semaphore state. Signed-off-by: Alexandre Frade commit b8c7da1fea3f64c7c41c2779896c5984e0e52360 Author: Zebediah Figura Date: Fri Mar 5 11:50:49 2021 -0600 docs: winesync: Add documentation for the winesync uAPI. Signed-off-by: Alexandre Frade commit 3c5b959a24675a05a5b16256c6d0c16f33b163fe Author: Zebediah Figura Date: Fri Mar 5 11:48:10 2021 -0600 winesync: Introduce WINESYNC_IOC_READ_MUTEX. Signed-off-by: Alexandre Frade commit bd60bba54990640f8a98f1f78a737cd33661d094 Author: Zebediah Figura Date: Fri Mar 5 11:47:55 2021 -0600 winesync: Introduce WINESYNC_IOC_READ_SEM. Signed-off-by: Alexandre Frade commit 004c2611f2c8054ab8a694bc1ab14fe15a1dc322 Author: Zebediah Figura Date: Fri Mar 5 11:46:46 2021 -0600 winesync: Introduce WINESYNC_IOC_KILL_OWNER. Signed-off-by: Alexandre Frade commit 717517060484810b712220274f40024cd7b06a86 Author: Zebediah Figura Date: Fri Mar 5 11:44:41 2021 -0600 winesync: Introduce WINESYNC_IOC_PUT_MUTEX. Signed-off-by: Alexandre Frade commit 7935daadc580757987449b823d5447c0991b3d2b Author: Zebediah Figura Date: Fri Mar 5 11:41:10 2021 -0600 winesync: Introduce WINESYNC_IOC_CREATE_MUTEX. Signed-off-by: Alexandre Frade commit 51ec0831c8dbf0129e8873a0a429d031d46ff723 Author: Zebediah Figura Date: Fri Mar 5 11:36:09 2021 -0600 winesync: Introduce WINESYNC_IOC_WAIT_ALL. Signed-off-by: Alexandre Frade commit cefb1569f722684592089790cceb56e088321074 Author: Zebediah Figura Date: Fri Mar 5 11:31:44 2021 -0600 winesync: Introduce WINESYNC_IOC_WAIT_ANY. Signed-off-by: Alexandre Frade commit 1959164d8f25481687c023cc643e84fdd24b0e18 Author: Zebediah Figura Date: Fri Mar 5 11:22:42 2021 -0600 winesync: Introduce WINESYNC_IOC_PUT_SEM. Signed-off-by: Alexandre Frade commit 7aca173d140cc7818064fa510ff75f55be675eeb Author: Zebediah Figura Date: Fri Mar 5 11:15:39 2021 -0600 winesync: Introduce WINESYNC_IOC_CREATE_SEM and WINESYNC_IOC_DELETE. Signed-off-by: Alexandre Frade commit 6ed53d3f45453bd53b706e01b19493eca089bb3f Author: Zebediah Figura Date: Fri Mar 5 10:57:06 2021 -0600 winesync: Reserve a minor device number and ioctl range. Signed-off-by: Alexandre Frade commit 2779f66c32cce38808017464c61f399039fdc1f7 Author: Zebediah Figura Date: Fri Mar 5 10:50:45 2021 -0600 winesync: Introduce the winesync driver and character device. Signed-off-by: Alexandre Frade commit 5023239d477c6743fe1bc21b4ffe4427174b50d0 Author: Felix Fietkau Date: Sat Dec 5 15:07:03 2015 +0100 mac80211: ignore AP power level when tx power type is "fixed" In some cases a user might want to connect to a far away access point, which announces a low tx power limit. Using the AP's power limit can make the connection significantly more unstable or even impossible, and mac80211 currently provides no way to disable this behavior. To fix this, use the currently unused distinction between limited and fixed tx power to decide whether a remote AP's power limit should be accepted. Signed-off-by: Felix Fietkau Signed-off-by: Alexandre Frade commit 46ec667a001aac5d8af372b8f7b78159d1cf7b57 Author: Alexandre Frade Date: Wed Dec 8 11:55:28 2021 +0000 netfilter: Add full cone NAT support Link: https://github.com/llccd/netfilter-full-cone-nat Signed-off-by: Alexandre Frade commit 8dab367e68b98c43771b725f13ce0f01a03b0de6 Author: Konstantin Demin Date: Tue May 17 10:10:40 2022 +0300 net-tcp_bbr: v2: Use correct 64-bit division Signed-off-by: Konstantin Demin Signed-off-by: Alexandre Frade commit 7f599abe426b34b32f48a8cb8bacb6eeb2884a88 Author: Adithya Abraham Philip Date: Fri Jun 11 21:56:10 2021 +0000 net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2 Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to indicate that retransmitted packets and pure ACKs must have the ECT bit set. This is a necessary fix for BBRv2, which when using ECN expects ECT to be set even on retransmitted packets and ACKs. Currently CCAs like BBRv2 which can use ECN but don't "need" it do not have a way to indicate that ECT should be set on retransmissions/ACKs. Signed-off-by: Adithya Abraham Philip Signed-off-by: Neal Cardwell Signed-off-by: Alexandre Frade commit 7aae2dad139c30d99a847d4ed15ec57c5a5e442a Author: Neal Cardwell Date: Mon Dec 28 19:23:09 2020 -0500 net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss Fix WARN_ON_ONCE() warnings that were firing and pointing to a bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to CA_Open. The issue was that tcp_simple_retransmit() calls: tcp_set_ca_state(sk, TCP_CA_Loss); without first calling icsk_ca_ops->ssthresh(sk) (because tcp_simple_retransmit() is dealing with losses due to MTU issues and not congestion). The lack of this callback means that BBR did not get a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd. This commit removes that warning, since a bbr->prior_cwnd of 0 is a valid situation in this state transition. For setting inflight_lo upon entering CA_Loss, to avoid setting an inflight_lo of 0 in this case, this commit switches to taking the max of cwnd and prior_cwnd. We plan to remove that line of code when we switch to cautious (PRR-style) recovery, so that awkwardness will go away. Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118 Signed-off-by: Alexandre Frade commit bb0c9f9489116c807129825211ede9f7dffbe733 Author: Neal Cardwell Date: Mon Aug 17 19:10:21 2020 -0400 net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2 Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0 Signed-off-by: Alexandre Frade commit a6362e6167f36979abee2bb9b1e68d5bb90f4544 Author: Neal Cardwell Date: Mon Aug 17 19:08:41 2020 -0400 net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2 Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b Signed-off-by: Alexandre Frade commit ab97a6429a701fba34bbccb4214e30ed55365d0c Author: Neal Cardwell Date: Thu Nov 21 15:28:01 2019 -0500 net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss There is no reason to compute rs.delivered_ce upon loss. In fact, we specifically do not want to compute rs.delivered_ce upon loss. Two issues: (1) This would be the wrong thing to do, in behavior terms. With RACK's dynamic reordering window, losses can be marked long after the sequence hole appears in the ACK/SACK stream. We want to to catch the ECN mark rate rising too high as quickly as possible, which means we want to check for high ECN mark rates at ACK time (as BBRv2 currently does) and not loss marking time. (2) This is dead code. The ECN mark rate cannot be detected as too high because the check needs rs->delivered to be > 0 as well: if (rs->delivered_ce > 0 && rs->delivered > 0 && Since we are not setting rs->delivered upon loss, this check cannot succeed, so setting delivered_ce is pointless. This dead and wrong line was discovered by Randall Stewart at Netflix as he was reading the BBRv2 code. Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a Signed-off-by: Alexandre Frade commit 966903938e79ca45daeacc25ef531db00e4f2693 Author: Neal Cardwell Date: Tue Jun 11 12:54:22 2019 -0400 net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower queues, lower loss, and better Reno/CUBIC coexistence than BBR v1. BBR v2 maintains the core of BBR v1: an explicit model of the network path that is two-dimensional, adapting to estimate the (a) maximum available bandwidth and (b) maximum safe volume of data a flow can keep in-flight in the network. It maintains the estimated BDP as a core guide for estimating an appropriate level of in-flight data. BBR v2 makes several key enhancements: o Its bandwidth-probing time scale is adapted, within bounds, to allow improved coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a) extended dynamically based on estimated BDP to improve coexistence with Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more scalable and responsive than Reno and CUBIC. o Rather than being largely agnostic to loss and ECN marks, it explicitly uses loss and (DCTCP-style) ECN signals to maintain its model. o It aims for lower losses than v1 by adjusting its model to attempt to stay within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh, respectively). o It adapts to loss/ECN signals even when the application is running out of data ("application-limited"), in case the "application-limited" flow is also "network-limited" (the bw and/or inflight available to this flow is lower than previously estimated when the flow ran out of data). o It has a three-part model: the model explicit three tracks operating points, where an operating point is a tuple: (bandwidth, inflight). The three operating points are: o latest: the latest measurement from the current round trip o upper bound: robust, optimistic, long-term upper bound o lower bound: robust, conservative, short-term lower bound These are stored in the following state variables: o latest: bw_latest, inflight_latest o lo: bw_lo, inflight_lo o hi: bw_hi[2], inflight_hi To gain intuition about the meaning of the three operating points, it may help to consider the analogs in CUBIC, which has a somewhat analogous three-part model used by its probing state machine: BBR param CUBIC param ----------- ------------- latest ~ cwnd lo ~ ssthresh hi ~ last_max_cwnd The analogy is only a loose one, though, since the BBR operating points are calculated differently, and are 2-dimensional (bw,inflight) rather than CUBIC's one-dimensional notion of operating point (inflight). o It uses the three-part model to adapt the magnitude of its bandwidth to match the estimated space available in the buffer, rather than (as in BBR v1) assuming that it was always acceptable to place 0.25*BDP in the bottleneck buffer when probing (commodity datacenter switches commonly do not have that much buffer for WAN flows). When BBR v2 estimates it hit a buffer limit during probing, its bandwidth probing then starts gently in case little space is still available in the buffer, and the accelerates, slowly at first and then rapidly if it can grow inflight without seeing congestion signals. In such cases, probing is bounded by inflight_hi + inflight_probe, where inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to keep losses low and bounded if a bottleneck remains congested, while rapidly/scalably utilizing free bandwidth when it becomes available. o It has a slightly revised state machine, to achieve the goals above. BBR_BW_PROBE_UP: pushes up inflight to probe for bw/vol BBR_BW_PROBE_DOWN: drain excess inflight from the queue BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty o The estimated BDP: BBR v2 continues to maintain an estimate of the path's two-way propagation delay, by tracking a windowed min_rtt, and coordinating (on an as-ndeeded basis) to try to expose the two-way propagation delay by draining the bottleneck queue. BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth estimate to estimate the current bandwidth-delay product. The estimated BDP still provides one important guideline for bounding inflight data. However, because any min-filtered RTT and max-filtered bw inherently tend to both overestimate, the estimated BDP is often too high; in this case loss or ECN marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to adapt its sending rate and inflight down to match the available capacity of the path. o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2 requires more space. Note that much of the space is due to support for per-socket parameterization and debugging in this release for research and debugging. With that state removed, the full "struct bbr" is 140 bytes, or 144 with padding. This is an increase of 40 bytes over the existing ca_priv space. o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following significant pieces: o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(), bbr_can_grow_inflight()) o long-term bandwidth estimator ("policer mode") The code layout tries to keep BBR v2 code near the bottom of the file, so that v1-applicable code in the top does not accidentally refer to v2 code. o Docs: See the following docs for more details and diagrams decsribing the BBR v2 algorithm: https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00 https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00 o Internal notes: For this upstream rebase, Neal started from: git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c then removed dev instrumentation (dynamic get/set for parameters) and code that was only used by BBRv1 Effort: net-tcp_bbr Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102 Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05 Signed-off-by: Alexandre Frade commit 7d1f5b089a4fa9fbfeda28df4450f9afe4633512 Author: Neal Cardwell Date: Sat Nov 16 13:16:25 2019 -0500 net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check() Add logic for an experimental TCP connection behavior, enabled with tp->fast_ack_mode = 1, which disables checking the receive window before sending an ack in __tcp_ack_snd_check(). If this behavior is enabled, the data receiver sends an ACK if the amount of data is > RCV.MSS. Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4 Signed-off-by: Alexandre Frade commit c40b1d29a9c22c5d7095c0caea01ee3d89fd4170 Author: Neal Cardwell Date: Fri Sep 27 17:10:26 2019 -0400 net-tcp: re-generalize TSO sizing in TCP CC module API Reorganize the API for CC modules so that the CC module once again gets complete control of the TSO sizing decision. This is how the API was set up around 2016 and the initial BBRv1 upstreaming. Later Eric Dumazet simplified it. But with wider testing it now seems that to avoid CPU regressions BBR needs to have a different TSO sizing function. This is necessary to handle cases where there are many flows bottlenecked on the sender host's NIC, in which case BBR's pacing rate is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because BBR's pacing rate adapts to the low bandwidth share each flow sees. By contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very large cwnd, and thus large pacing rate and large TSO burst size. Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea Signed-off-by: Alexandre Frade commit 86ba33c6a9b1937a487282ba387941d96ec3a7c3 Author: Yousuk Seung Date: Wed May 23 17:55:54 2018 -0700 net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a congestion control module to receive CE events. Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN bit in opts flag to receive CE events but this may incur changes in ECN behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS that allows congestion control modules to receive CE events independently of TCP_CONG_NEEDS_ECN. Effort: net-tcp Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b Change-Id: I2255506985242f376d910c6fd37daabaf4744f24 Signed-off-by: Alexandre Frade commit 577443a6262498dfde75384f27cbe0db53a07082 Author: Neal Cardwell Date: Tue May 7 22:37:19 2019 -0400 net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue Syzkaller was able to use TCP_REPAIR to reproduce the new warning added in tcp_fragment(): WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487 tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487() inconsistent: tx.in_flight: 0 old_factor: 53 The warning happens because skbs inserted into the tcp_rtx_queue during the repair process go through a sort of "fake send" process, and that process was seting pcount but not tx.in_flight, and thus the warnings (where old_factor is the old pcount). The fix of setting tx.in_flight in the TCP_REPAIR code path seems simple enough, and indeed makes the repro code from syzkaller stop producing warnings. Running through kokonut tests, and will send out for review when all tests pass. Effort: net-tcp_bbr Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63 Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05 Signed-off-by: Alexandre Frade commit b0ab3a8b019ce27ccc6d88f4b24bea3d362eb8a5 Author: Neal Cardwell Date: Wed May 1 20:16:25 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment() When we fragment an skb that has already been sent, we need to update the tx.in_flight for the first skb in the resulting pair ("buff"). Because we were not updating the tx.in_flight, the tx.in_flight value was inconsistent with the pcount of the "buff" skb (tx.in_flight would be too high). That meant that if the "buff" skb was lost, then bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value that is too high. This could result in longer queues and higher packet loss. Packetdrill testing verified that without this commit, when the second half of an skb is SACKed and then later the first half of that skb is marked lost, the calculated inflight_hi was incorrect. Effort: net-tcp_bbr Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874 Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d Signed-off-by: Alexandre Frade commit eee9f623770149eb0c4a308c4cdf313d9bc76d01 Author: Neal Cardwell Date: Wed May 1 20:16:33 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb() When tcp_shifted_skb() updates state as adjacent SACKed skbs are coalesced, previously the tx.in_flight was not adjusted, so we could get contradictory state where the skb's recorded pcount was bigger than the tx.in_flight (the number of segments that were in_flight after sending the skb). Normally have a SACKed skb with contradictory pcount/tx.in_flight would not matter. However, with SACK reneging, the SACKed bit is removed, and an skb once again becomes eligible for retransmitting, fragmenting, SACKing, etc. Packetdrill testing verified the following sequence is possible in a kernel that does not have this commit: - skb N is SACKed - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb() - tcp_shifted_skb() will increase the pcount of prev, but leave tx.in_flight as-is - so prev skb can have pcount > tx.in_flight - RTO, tcp_timeout_mark_lost(), detect reneg, remove "SACKed" bit, mark skb N as lost - find pcount of skb N is greater than its tx.in_flight I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb(): WARN_ON_ONCE(inflight_prev < 0) to fire in production machines using bbr2. Tested: See last commit in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715 Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55 Signed-off-by: Alexandre Frade commit 522e2019ff703096be134432d75924ea3f6c1280 Author: Neal Cardwell Date: Tue May 7 22:36:36 2019 -0400 net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight() Factor out the code to set an skb's tx.in_flight field into its own function, so that this code can be used for the TCP_REPAIR "fake send" code path that inserts skbs into the rtx queue without sending them. This is in preparation for the following patch, which fixes an issue with TCP_REPAIR and tx.in_flight. Tested: See last patch in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af Signed-off-by: Alexandre Frade commit b684386ddfa6b2bc8e9851df5cf8737352cea0bd Author: Neal Cardwell Date: Tue Aug 7 21:52:06 2018 -0400 net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API For connections experiencing reordering, RACK can mark packets lost long after we receive the SACKs/ACKs hinting that the packets were actually lost. This means that CC modules cannot easily learn the volume of inflight data at which packet loss happens by looking at the current inflight or even the packets in flight when the most recently SACKed packet was sent. To learn this, CC modules need to know how many packets were in flight at the time lost packets were sent. This new callback, combined with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this. This also provides a consistent callback that is invoked whether packets are marked lost upon ACK processing, using the RACK reordering timer, or at RTO time. Effort: net-tcp_bbr Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4 Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a Signed-off-by: Alexandre Frade commit ecabbcd8e1d27a478a6831a8caa1388ea1f0b3f0 Author: Neal Cardwell Date: Mon Nov 19 13:48:36 2018 -0500 net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece For understanding the relationship between inflight and ECN signals, to try to find the highest inflight value that has acceptable levels ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8 Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3 Signed-off-by: Alexandre Frade commit 1b2cdda7be7d8744832715b12f3c09c582dcd213 Author: Neal Cardwell Date: Thu Oct 12 23:44:27 2017 -0400 net-tcp_bbr: v2: count packets lost over TCP rate sampling interval For understanding the relationship between inflight and packet loss signals, to try to find the highest inflight value that has acceptable levels of packet losses. Effort: net-tcp_bbr Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5 Signed-off-by: Alexandre Frade commit 254a220ad1503f0f8a06c35af182ac5acf721c21 Author: Neal Cardwell Date: Sat Aug 5 11:49:50 2017 -0400 net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample For understanding the relationship between inflight and losses or ECN signals, to try to find the highest inflight value that has acceptable levels of loss/ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a Signed-off-by: Alexandre Frade commit 12bc5407a8a8723c9e64838c10a0edcc182909ee Author: Neal Cardwell Date: Sun Jun 24 21:55:59 2018 -0400 net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes Free up some space for tracking inflight and losses for each bw sample, in upcoming commits. These timestamps are in microseconds, and are now stored in 32 bits. So they can only hold time intervals up to roughly 2^12 = 4096 seconds. But Linux TCP RTT and RTO tracking has the same 32-bit microsecond implementation approach and resulting deployment limitations. So this is not introducing a new limit. And these should not be a limitation for the foreseeable future. Effort: net-tcp_bbr Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55 Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c Signed-off-by: Alexandre Frade commit d813d9ac071cc01426879729d3348aecb0058dfc Author: Neal Cardwell Date: Tue Jun 11 12:26:55 2019 -0400 net-tcp_bbr: broaden app-limited rate sample detection This commit is a bug fix for the Linux TCP app-limited (application-limited) logic that is used for collecting rate (bandwidth) samples. Previously the app-limited logic only looked for "bubbles" of silence in between application writes, by checking at the start of each sendmsg. But "bubbles" of silence can also happen before retransmits: e.g. bubbles can happen between an application write and a retransmit, or between two retransmits. Retransmits are triggered by ACKs or timers. So this commit checks for bubbles of app-limited silence upon ACKs or timers. Why does this commit check for app-limited state at the start of ACKs and timer handling? Because at that point we know whether inflight was fully using the cwnd. During processing the ACK or timer event we often change the cwnd; after changing the cwnd we can't know whether inflight was fully using the old cwnd. Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc Change-Id: I37221506f5166877c2b110753d39bb0757985e68 Signed-off-by: Alexandre Frade commit 3df0e4f02a54ae7b9a91ce653e1ff7206aa234e3 Author: Arjan van de Ven Date: Wed May 17 01:52:11 2017 +0000 init: wait for partition and retry scan As Clear Linux boots fast the device is not ready when the mounting code is reached, so a retry device scan will be performed every 0.5 sec for at least 40 sec and synchronize the async task. Signed-off-by: Miguel Bernal Marin Signed-off-by: Alexandre Frade commit 4ff6159b1cbd2f511a05ce60b82e1f0f23601ca0 Author: Arjan van de Ven Date: Thu Jun 2 23:36:32 2016 -0500 drivers: initialize ata before graphics ATA init is the long pole in the boot process, and its asynchronous. move the graphics init after it so that ata and graphics initialize in parallel Signed-off-by: Alexandre Frade commit 9e8670f285093d5d5914051f5258f7dadfb5c219 Author: Arjan van de Ven Date: Sun Feb 18 23:35:41 2018 +0000 locking: rwsem: spin faster tweak rwsem owner spinning a bit Signed-off-by: Alexandre Frade commit b38f3ed0964cc70bbcdf3b54aac5907a6a070be8 Author: William Douglas Date: Wed Jun 20 17:23:21 2018 +0000 firmware: Enable stateless firmware loading Prefer the order of specific version before generic and /etc before /lib to enable the user to give specific overrides for generic firmware and distribution firmware. Signed-off-by: Alexandre Frade commit fe712029d32a39131af354e91c9c752cdb0cc670 Author: Arjan van de Ven Date: Sun Sep 22 11:12:35 2019 -0300 intel_rapl: Silence rapl trace debug Signed-off-by: Alexandre Frade commit a9c4ca63ec480bac8812ae4ab6e5aa6ab7e286be Author: huangjie.albert Date: Mon Jul 25 16:38:56 2022 +0800 x86: boot: avoid memory copy if kernel is uncompressed 1、if kernel is uncompressed. we do not need to relocate kernel image for decompression 2、if kaslr is disabled, we do not need to do a memory copy before prase_elf. Two memory copies can be skipped with this patch. this can save aboat 20ms during booting. Signed-off-by: huangjie.albert Signed-off-by: Alexandre Frade commit 5983811809023a855398632b3366f51b365967d6 Author: huangjie.albert Date: Mon Jul 25 16:38:55 2022 +0800 x86: Support the uncompressed kernel to speed up booting Although the compressed kernel can save the time of loading the kernel into the memory and save the disk space for storing the kernel, but in some time-sensitive scenarios, the time for decompressing the kernel is intolerable. Therefore, it is necessary to support uncompressed kernel images, so that the time of kernel decompression can be saved when the kernel is started. This part of the time on my machine is approximately: image type image size times compressed(gzip) 8.5M 159ms uncompressed 53M 8.5ms Signed-off-by: huangjie.albert Signed-off-by: Alexandre Frade commit ae46c0d6346df2084e0b07982fbed59bfdd8989f Author: huangjie.albert Date: Mon Jul 25 16:38:54 2022 +0800 kexec: add CONFING_KEXEC_PURGATORY_SKIP_SIG the verify_sha256_digest may cost 300+ ms in my test environment: bzImage: 53M initramfs:28M We can add a macro to control whether to enable this check. If we can confirm that the data in this will not change, we can turn off the check and get a faster startup. Signed-off-by: huangjie.albert Signed-off-by: Alexandre Frade commit e3f81f3ef5195bc434b52c5ffd399db278195249 Author: huangjie.albert Date: Mon Jul 25 16:38:53 2022 +0800 kexec: reuse crash kernel reserved memory for normal kexec normally, for kexec reboot, each segment of the second os (such as : kernel、initrd、etc.) will be copied to discontinuous physical memory during kexec load. and then a memory copy will be performed when kexec -e is executed to copy each segment of the second os to contiguous physical memory, which will Affects the time the kexec switch to the new os. Therefore, if we reuse the crash kernel reserved memory for kexec. When kexec loads the second os, each segment of the second OS is directly copied to the contiguous physical memory, so there is no need to make a second copy when kexec -e is executed later. The kexec userspace tool also needs to add parameter options(-r) that support the use of reserved memory (see another patch for kexec) examples: bzimage: 53M initramfs: 28M can save aboat 40 ms, The larger the image size, the greater the time savings Signed-off-by: huangjie.albert Signed-off-by: Alexandre Frade commit 69231fa48bdf03d9c56a3fae6d8746cbf01bbea1 Author: Alexey Avramov Date: Sat Nov 13 10:42:27 2021 +0900 mm/vmscan: add sysctl knobs for protecting the working set The kernel does not provide a way to protect the working set under memory pressure. A certain amount of anonymous and clean file pages is required by the userspace for normal operation. First of all, the userspace needs a cache of shared libraries and executable binaries. If the amount of the clean file pages falls below a certain level, then thrashing and even livelock can take place. The patch provides sysctl knobs for protecting the working set (anonymous and clean file pages) under memory pressure. The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous pages. The anonymous pages on the current node won't be reclaimed under any conditions when their amount is below vm.anon_min_kbytes. This knob may be used to prevent excessive swap thrashing when anonymous memory is low (for example, when memory is going to be overfilled by compressed data of zram module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested 0 in Kconfig). The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file pages using this knob may be used when swapping is still possible to - prevent disk I/O thrashing under memory pressure; - improve performance in disk cache-bound tasks under memory pressure. The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in Kconfig). The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_min_kbytes. Hard protection of clean file pages using this knob may be used to - prevent disk I/O thrashing under memory pressure even with no free swap space; - improve performance in disk cache-bound tasks under memory pressure; - avoid high latency and prevent livelock in near-OOM conditions. The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in Kconfig). Signed-off-by: Alexey Avramov Signed-off-by: Alexandre Frade commit 092837029b12b8f368f94b723850e2b385593f99 Author: Alexandre Frade Date: Wed Jun 15 21:31:23 2022 +0000 XANMOD: mm: multi-gen LRU: Add Hz min_ttl_ms for thrashing prevention Signed-off-by: Alexandre Frade commit ea7c46d4599da71cf75ad0ef125b4e2e384baafa Author: Yu Zhao Date: Wed Jul 6 16:00:23 2022 -0600 mm: multi-gen LRU: design doc Add a design doc. Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Reviewed-by: Bagas Sanjaya Signed-off-by: Alexandre Frade commit 3f5bba8241ebbaaf3e2703adc5f18a98fa1b5cbd Author: Yu Zhao Date: Wed Jul 6 16:00:22 2022 -0600 mm: multi-gen LRU: admin guide Add an admin guide. Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Reviewed-by: Bagas Sanjaya Signed-off-by: Alexandre Frade commit d4a88dd54c7e4006282bcddceaf4035a31330d97 Author: Yu Zhao Date: Wed Jul 6 16:00:21 2022 -0600 mm: multi-gen LRU: debugfs interface Add /sys/kernel/debug/lru_gen for working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. Compared with the page table-based approach and the PFN-based approach, this lruvec-based approach has the following advantages: 1. It offers better choices because it is aware of memcgs, NUMA nodes, shared mappings and unmapped page cache. 2. It is more scalable because it is O(nr_hot_pages), whereas the PFN-based approach is O(nr_total_pages). Add /sys/kernel/debug/lru_gen_full for debugging. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 Signed-off-by: Yu Zhao Reviewed-by: Qi Zheng Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit 4cf3f4a60369dc6a332f08a5a0db36a75f0b9911 Author: Yu Zhao Date: Wed Jul 6 16:00:20 2022 -0600 mm: multi-gen LRU: thrashing prevention Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as requested by many desktop users [1]. When set to value N, it prevents the working set of N milliseconds from getting evicted. The OOM killer is triggered if this working set cannot be kept in memory. Based on the average human detectable lag (~100ms), N=1000 usually eliminates intolerable lags due to thrashing. Larger values like N=3000 make lags less noticeable at the risk of premature OOM kills. Compared with the size-based approach [2], this time-based approach has the following advantages: 1. It is easier to configure because it is agnostic to applications and memory sizes. 2. It is more reliable because it is directly wired to the OOM killer. [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/ [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/ Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit 77863e4a3550a6832d6a2466e4793a1ac1863cdb Author: Yu Zhao Date: Wed Jul 6 16:00:19 2022 -0600 mm: multi-gen LRU: kill switch Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously. Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/ Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit f4afeebb59f66f3ff4969dec0197046292e4a214 Author: Yu Zhao Date: Wed Jul 6 16:00:18 2022 -0600 mm: multi-gen LRU: optimize multiple memcgs When multiple memcgs are available, it is possible to make better choices based on generations and tiers and therefore improve the overall performance under global memory pressure. This patch adds a rudimentary optimization to select memcgs that can drop single-use unmapped clean pages first. Doing so reduces the chance of going into the aging path or swapping. These two decisions can be costly. A typical example that benefits from this optimization is a server running mixed types of workloads, e.g., heavy anon workload in one memcg and heavy buffered I/O workload in the other. Though this optimization can be applied to both kswapd and direct reclaim, it is only added to kswapd to keep the patchset manageable. Later improvements will cover the direct reclaim path. Server benchmark results: Mixed workloads: fio (buffered I/O): +[19, 21]% IOPS BW patch1-8: 1880k 7343MiB/s patch1-9: 2252k 8796MiB/s memcached (anon): +[119, 123]% Ops/sec KB/sec patch1-8: 862768.65 33514.68 patch1-9: 1911022.12 74234.54 Mixed workloads: fio (buffered I/O): +[75, 77]% IOPS BW 5.19-rc1: 1279k 4996MiB/s patch1-9: 2252k 8796MiB/s memcached (anon): +[13, 15]% Ops/sec KB/sec 5.19-rc1: 1673524.04 65008.87 patch1-9: 1911022.12 74234.54 Configurations: (changes since patch 6) cat mixed.sh modprobe brd rd_nr=2 rd_size=56623104 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 mkfs.ext4 /dev/ram1 mount -t ext4 /dev/ram1 /mnt memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=90m --group_reporting & pid=$! sleep 200 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed kill -INT $pid wait Client benchmark results: no change (CONFIG_MEMCG=n) Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit 79b705a26df77bcf5f524c91bd84448f5a700f5f Author: Yu Zhao Date: Wed Jul 6 16:00:17 2022 -0600 mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit 0d4941a1ee8738024fe0702c659cb8ecd605c5b0 Author: Yu Zhao Date: Wed Jul 6 16:00:16 2022 -0600 mm: multi-gen LRU: exploit locality in rmap Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, searching the rmap can incur the highest CPU cost in the reclaim path. This patch exploits spatial locality to reduce the trips into the rmap. When shrink_page_list() walks the rmap and finds a young PTE, a new function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent PTEs. On finding another young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[3, 5]% Ops/sec KB/sec patch1-6: 1106168.46 43025.04 patch1-7: 1147696.57 44640.29 Configurations: no change Client benchmark results: kswapd profiles: patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset Configurations: no change Signed-off-by: Yu Zhao Acked-by: Barry Song Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit f29fe9c7a127f3a2d8c59bffffc488463dcc7ad4 Author: Yu Zhao Date: Wed Jul 6 16:00:15 2022 -0600 mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <>/etc/memcached.conf </sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G Chrome OS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit 0378a3b31a08c24195cb8a6719da726a246b6566 Author: Yu Zhao Date: Wed Jul 6 16:00:14 2022 -0600 mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit a84676f82f8bf0e3365bd378105cd0513ccf32eb Author: Yu Zhao Date: Wed Jul 6 16:00:13 2022 -0600 Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" This patch undoes the following refactor: commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller") The upcoming changes to include/linux/mm_inline.h will reuse __update_lru_size(). Signed-off-by: Yu Zhao Reviewed-by: Miaohe Lin Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit c21ae4767935fef088b550291c0fa1716931fa1d Author: Yu Zhao Date: Wed Jul 6 16:00:12 2022 -0600 mm/vmscan.c: refactor shrink_node() This patch refactors shrink_node() to improve readability for the upcoming changes to mm/vmscan.c. Signed-off-by: Yu Zhao Reviewed-by: Barry Song Reviewed-by: Miaohe Lin Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit 963f13b7dc3ecc0d91978cb816e29f27a3175116 Author: Yu Zhao Date: Wed Jul 6 16:00:11 2022 -0600 mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Some architectures support the accessed bit in non-leaf PMD entries, e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it as part of linear address translation [1]. Page table walkers that clear the accessed bit may use this capability to reduce their search space. Note that: 1. Although an inline function is preferable, this capability is added as a configuration option for consistency with the existing macros. 2. Due to the little interest in other varieties, this capability was only tested on Intel and AMD CPUs. Thanks to the following developers for their efforts [2][3]. Randy Dunlap Stephen Rothwell [1]: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (June 2021), section 4.8 [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/ [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/ Signed-off-by: Yu Zhao Reviewed-by: Barry Song Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit bd8b2802f62a88742f1221c3e2fa90754309cd1a Author: Yu Zhao Date: Wed Jul 6 16:00:10 2022 -0600 mm: x86, arm64: add arch_has_hw_pte_young() Some architectures automatically set the accessed bit in PTEs, e.g., x86 and arm64 v8.2. On architectures that do not have this capability, clearing the accessed bit in a PTE usually triggers a page fault following the TLB miss of this PTE (to emulate the accessed bit). Being aware of this capability can help make better decisions, e.g., whether to spread the work out over a period of time to reduce bursty page faults when trying to clear the accessed bit in many PTEs. Note that theoretically this capability can be unreliable, e.g., hotplugged CPUs might be different from builtin ones. Therefore it should not be used in architecture-independent code that involves correctness, e.g., to determine whether TLB flushes are required (in combination with the accessed bit). Signed-off-by: Yu Zhao Reviewed-by: Barry Song Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Acked-by: Will Deacon Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Alexandre Frade commit 3f4641ec67d3ddd8c18a1505a487d0581e46fdc4 Author: André Almeida Date: Mon Oct 25 09:49:42 2021 -0300 futex: Add entry point for FUTEX_WAIT_MULTIPLE (opcode 31) Add an option to wait on multiple futexes using the old interface, that uses opcode 31 through futex() syscall. Do that by just translation the old interface to use the new code. This allows old and stable versions of Proton to still use fsync in new kernel releases. Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit 4804100d494b73f2b5ca30d5be97cf52ec2cc2cc Author: Alexandre Frade Date: Fri Jun 18 19:10:55 2021 +0000 XANMOD: Makefile: Turn off loop vectorization for GCC -O3 optimization level Signed-off-by: Alexandre Frade commit c14234e95f00f2fc588c8cdeb73848b0f90cf64a Author: Alexandre Frade Date: Thu Sep 3 20:36:13 2020 +0000 XANMOD: init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures Signed-off-by: Alexandre Frade commit a1cee4d274ba5343f4a5343f69db050d3c37cd0d Author: Alexandre Frade Date: Thu Jun 25 16:40:43 2020 -0300 XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE Signed-off-by: Alexandre Frade Signed-off-by: Alexandre Frade commit b43ed9148357a00835924be5edb9e1eaa7ab2759 Author: Alexandre Frade Date: Sun May 29 00:57:40 2022 +0000 XANMOD: scripts/setlocalversion: remove "+" tag for git repo short version Signed-off-by: Alexandre Frade commit 6a188ead202e981f6bd25080716f5295dd0b8954 Author: Alexandre Frade Date: Tue Mar 31 13:32:08 2020 -0300 XANMOD: cpufreq: tunes ondemand and conservative governor for performance Signed-off-by: Alexandre Frade Signed-off-by: Alexandre Frade commit b8ee537cbf903d156384d1c9f3885f280066015f Author: Alexandre Frade Date: Mon Jan 29 17:31:25 2018 +0000 XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping Signed-off-by: Alexandre Frade Signed-off-by: Alexandre Frade commit 480d175e559adf01a323b7531c13fe7adb7086e9 Author: Alexandre Frade Date: Wed Jun 15 17:07:29 2022 +0000 XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default Signed-off-by: Alexandre Frade commit 63718c397cc3a2629ef01c8803068818c71b9861 Author: Alexandre Frade Date: Mon Jan 29 16:59:22 2018 +0000 XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed Signed-off-by: Alexandre Frade Signed-off-by: Alexandre Frade commit af7a3064dcc0013730b1d14d43b851620194c776 Author: Alexandre Frade Date: Mon Jan 29 17:26:15 2018 +0000 XANMOD: kconfig: add 500Hz timer interrupt kernel config option Signed-off-by: Alexandre Frade Signed-off-by: Alexandre Frade commit 788d6416cd78d30d14856962fb1724711c162d0d Author: Alexandre Frade Date: Mon Dec 14 16:24:26 2020 +0000 XANMOD: block: set rq_affinity to force full multithreading I/O requests Signed-off-by: Alexandre Frade commit a83f6b7152ba845ff5ad7f13dd15180aef8d4b3f Author: Alexandre Frade Date: Wed May 11 18:56:51 2022 +0000 XANMOD: block/mq-deadline: Increase write priority to improve responsiveness Signed-off-by: Alexandre Frade commit e421ad5f74bd3393c52594c09f3d037806b1f25b Author: Alexandre Frade Date: Thu Jan 6 16:59:01 2022 +0000 XANMOD: block/mq-deadline: Disable front_merges by default Signed-off-by: Alexandre Frade commit a0a63c0737cd033e9ee7dda705206d4176fbebd6 Author: Alexandre Frade Date: Fri Mar 25 22:36:34 2022 +0000 XANMOD: Change rcutree.kthread_prio to SCHED_RR policy Signed-off-by: Alexandre Frade commit a034480f55234573e600c0796d11f52f5031bb0e Author: Alexandre Frade Date: Mon Aug 1 01:49:22 2022 +0000 XANMOD: fair: Remove all energy efficiency functions Signed-off-by: Alexandre Frade