commit 47fe430223e36182da7857d089cb30ceb04e07b6
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon May 30 11:47:36 2022 +0000

    Linux 5.18.1-xanmod1
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b9a1ebb5bb2f8843a1268552c0d8fa56da2f9496
Author: graysky <graysky@archlinux.us>
Date:   Tue Mar 15 05:58:43 2022 -0400

    x86/kconfig: more uarches for kernel 5.17+
    
    FEATURES
    This patch adds additional CPU options to the Linux kernel accessible under:
     Processor type and features  --->
      Processor family --->
    
    With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are
    offered which are good for supported Intel or AMD CPUs:
    • x86-64-v2
    • x86-64-v3
    • x86-64-v4
    
    Users of glibc 2.33 and above can see which level is supported by current
    hardware by running:
      /lib/ld-linux-x86-64.so.2 --help | grep supported
    
    Alternatively, compare the flags from /proc/cpuinfo to this list.[1]
    
    CPU-specific microarchitectures include:
    • AMD Improved K8-family
    • AMD K10-family
    • AMD Family 10h (Barcelona)
    • AMD Family 14h (Bobcat)
    • AMD Family 16h (Jaguar)
    • AMD Family 15h (Bulldozer)
    • AMD Family 15h (Piledriver)
    • AMD Family 15h (Steamroller)
    • AMD Family 15h (Excavator)
    • AMD Family 17h (Zen)
    • AMD Family 17h (Zen 2)
    • AMD Family 19h (Zen 3)†
    • Intel Silvermont low-power processors
    • Intel Goldmont low-power processors (Apollo Lake and Denverton)
    • Intel Goldmont Plus low-power processors (Gemini Lake)
    • Intel 1st Gen Core i3/i5/i7 (Nehalem)
    • Intel 1.5 Gen Core i3/i5/i7 (Westmere)
    • Intel 2nd Gen Core i3/i5/i7 (Sandybridge)
    • Intel 3rd Gen Core i3/i5/i7 (Ivybridge)
    • Intel 4th Gen Core i3/i5/i7 (Haswell)
    • Intel 5th Gen Core i3/i5/i7 (Broadwell)
    • Intel 6th Gen Core i3/i5/i7 (Skylake)
    • Intel 6th Gen Core i7/i9 (Skylake X)
    • Intel 8th Gen Core i3/i5/i7 (Cannon Lake)
    • Intel 10th Gen Core i7/i9 (Ice Lake)
    • Intel Xeon (Cascade Lake)
    • Intel Xeon (Cooper Lake)*
    • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)*
    • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡
    • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡
    • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡
    
    Notes: If not otherwise noted, gcc >=9.1 is required for support.
           *Requires gcc >=10.1 or clang >=10.0
           †Required gcc >=10.3 or clang >=12.0
           ‡Required gcc >=11.1 or clang >=12.0
    
    It also offers to compile passing the 'native' option which, "selects the CPU
    to generate code for at compilation time by determining the processor type of
    the compiling machine. Using -march=native enables all instruction subsets
    supported by the local machine and will produce code optimized for the local
    machine under the constraints of the selected instruction set."[2]
    
    Users of Intel CPUs should select the 'Intel-Native' option and users of AMD
    CPUs should select the 'AMD-Native' option.
    
    MINOR NOTES RELATING TO INTEL ATOM PROCESSORS
    This patch also changes -march=atom to -march=bonnell in accordance with the
    gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I
    believe it should use the newer -march=bonnell flag for atom processors.[3]
    
    It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The
    recommendation is to use the 'atom' option instead.
    
    BENEFITS
    Small but real speed increases are measurable using a make endpoint comparing
    a generic kernel to one built with one of the respective microarchs.
    
    See the following experimental evidence supporting this statement:
    https://github.com/graysky2/kernel_gcc_patch
    
    REQUIREMENTS
    linux version 5.17+
    gcc version >=9.0 or clang version >=9.0
    
    ACKNOWLEDGMENTS
    This patch builds on the seminal work by Jeroen.[5]
    
    REFERENCES
    1.  https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9
    2.  https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options
    3.  https://bugzilla.kernel.org/show_bug.cgi?id=77461
    4.  https://github.com/graysky2/kernel_gcc_patch/issues/15
    5.  http://www.linuxforge.net/docs/linux/linux-gcc.php
    
    Signed-off-by: graysky <graysky@archlinux.us>

commit 943a0b2a5208fb5806c65c404481ce665622cda9
Author: Mark Weiman <mark.weiman@markzz.com>
Date:   Sun Aug 12 11:36:21 2018 -0400

    pci: Enable overrides for missing ACS capabilities
    
    This an updated version of Alex Williamson's patch from:
    https://lkml.org/lkml/2013/5/30/513
    
    Original commit message follows:
    
    PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
    allows us to control whether transactions are allowed to be redirected
    in various subnodes of a PCIe topology.  For instance, if two
    endpoints are below a root port or downsteam switch port, the
    downstream port may optionally redirect transactions between the
    devices, bypassing upstream devices.  The same can happen internally
    on multifunction devices.  The transaction may never be visible to the
    upstream devices.
    
    One upstream device that we particularly care about is the IOMMU.  If
    a redirection occurs in the topology below the IOMMU, then the IOMMU
    cannot provide isolation between devices.  This is why the PCIe spec
    encourages topologies to include ACS support.  Without it, we have to
    assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
    
    Unfortunately, far too many topologies do not support ACS to make this
    a steadfast requirement.  Even the latest chipsets from Intel are only
    sporadically supporting ACS.  We have trouble getting interconnect
    vendors to include the PCIe spec required PCIe capability, let alone
    suggested features.
    
    Therefore, we need to add some flexibility.  The pcie_acs_override=
    boot option lets users opt-in specific devices or sets of devices to
    assume ACS support.  The "downstream" option assumes full ACS support
    on root ports and downstream switch ports.  The "multifunction"
    option assumes the subset of ACS features available on multifunction
    endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
    option enables ACS support on devices matching the provided vendor
    and device IDs, allowing more strategic ACS overrides.  These options
    may be combined in any order.  A maximum of 16 id specific overrides
    are available.  It's suggested to use the most limited set of options
    necessary to avoid completely disabling ACS across the topology.
    Note to hardware vendors, we have facilities to permanently quirk
    specific devices which enforce isolation but not provide an ACS
    capability.  Please contact me to have your devices added and save
    your customers the hassle of this boot option.
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Mark Weiman <mark.weiman@markzz.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit cccdc3e3efb0490061486f83dff49e64d401612a
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Apr 20 18:58:17 2022 -0500

    docs: winesync: Document alertable waits

commit f5aa0e85e66b0a5f0e8062ed41721a085df89c0e
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Apr 20 18:24:43 2022 -0500

    serftests: winesync: Add some tests for wakeup signaling via alerts

commit e041f35f0014600032c6203361522c689dda16f8
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Apr 20 18:08:37 2022 -0500

    selftests: winesync: Add tests for alertable waits

commit 3c6a9de4ccba0b22cc1a21446564aaef81312c74
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Apr 13 20:02:39 2022 -0500

    winesync: Introduce alertable waits

commit 403a89df995d8b67fe15bd5a16dad383605ed0fd
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 22:01:46 2022 -0600

    docs: winesync: Document event APIs

commit 2e8dad37b0af7983d5495ac3323aeb97d56df5b0
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 21:06:22 2022 -0600

    selftests: winesync: Add some tests for invalid object handling with events

commit e1396f64fc33a31f2cf3d5d3d84715adca063bea
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 21:00:50 2022 -0600

    selftests: winesync: Add some tests for wakeup signaling with events

commit f9ec3bda3749af08de923a5bafd5a88f7ffb6560
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 19:45:39 2022 -0600

    selftests: winesync: Add some tests for auto-reset event state

commit 8ee9a533869e2a4adadc9ffdf902f863a5ec8659
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 19:34:47 2022 -0600

    selftests: winesync: Add some tests for manual-reset event state

commit 97effc15945cde601143b28207042cdbd37df6e6
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 19:14:00 2022 -0600

    winesync: Introduce WINESYNC_IOC_READ_EVENT

commit 68f8f65d81a7e01c77c64ef9c8f24c3b4d0eee1f
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 19:10:12 2022 -0600

    winesync: Introduce WINESYNC_IOC_PULSE_EVENT

commit 5b8c03156cd5080cd0ce7b24bee994b5710f0175
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 19:00:25 2022 -0600

    winesync: Introduce WINESYNC_IOC_RESET_EVENT

commit 86708349cbf836f6b33b3298397159006f8a8a95
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 18:43:30 2022 -0600

    winesync: Introduce WINESYNC_IOC_SET_EVENT

commit 18569c1bc9193b6ecdd5bde84d5a29bd7413ebd4
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Jan 19 18:21:03 2022 -0600

    winesync: Introduce WINESYNC_IOC_CREATE_EVENT

commit 9484b5b4a7e409474e4fa1e3398b14e09e2f9ca4
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:22:55 2021 -0600

    maintainers: Add an entry for winesync

commit 929fffb76df5c20635ff59a397d33a95b6ca8389
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:09:36 2021 -0600

    selftests: winesync: Add some tests for wakeup signaling with WINESYNC_IOC_WAIT_ALL

commit f1d89dc2e26fb0210ecce3fc229b2982e48a9ff0
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:09:32 2021 -0600

    selftests: winesync: Add some tests for wakeup signaling with WINESYNC_IOC_WAIT_ANY

commit a55d91826c9ab0ffa13566c016d7b99ae2c9cb22
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:08:54 2021 -0600

    selftests: winesync: Add some tests for invalid object handling

commit 07a5448ca737e13258d79280c50645e8e86880eb
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:08:25 2021 -0600

    selftests: winesync: Add some tests for WINESYNC_IOC_WAIT_ALL

commit a4bc75455f44bb5d7f9823e2f9326fd9cf249117
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:07:45 2021 -0600

    selftests: winesync: Add some tests for WINESYNC_IOC_WAIT_ANY

commit c3359d53ab6a5ca0784c282f1bbfa2a217a215c6
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:07:04 2021 -0600

    selftests: winesync: Add some tests for mutex state

commit 75f0fc3da6ae6f1196e592aad52bffdb93636fe8
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 12:06:23 2021 -0600

    selftests: winesync: Add some tests for semaphore state

commit b7322625947328ad6f24fa1ac436e4dad273e603
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:50:49 2021 -0600

    docs: winesync: Add documentation for the winesync uAPI

commit 79ee408de0c907e4dc3de047a8433e3418ae46ce
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:48:10 2021 -0600

    winesync: Introduce WINESYNC_IOC_READ_MUTEX

commit f97d5bccc702c8992f089a14366d58dd0cb77346
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:47:55 2021 -0600

    winesync: Introduce WINESYNC_IOC_READ_SEM

commit 224d0c0a6c241cddee43a5029c77f105f3d93875
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:46:46 2021 -0600

    winesync: Introduce WINESYNC_IOC_KILL_OWNER

commit 1d25d18bc6a70daa5bc50021d4f8ec0cce0c72c7
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:44:41 2021 -0600

    winesync: Introduce WINESYNC_IOC_PUT_MUTEX

commit acb327aa5184bed693b2ec849b9572003a9747be
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:41:10 2021 -0600

    winesync: Introduce WINESYNC_IOC_CREATE_MUTEX

commit 584676da82565e7e6e69b9dddc9b1869d18d32f5
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:36:09 2021 -0600

    winesync: Introduce WINESYNC_IOC_WAIT_ALL

commit bb79698445c8e8b82ca6b2bfa6c985353deeaa75
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:31:44 2021 -0600

    winesync: Introduce WINESYNC_IOC_WAIT_ANY

commit d98d1621e3cad8d1530baf5689c4ae78e02e00b6
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:22:42 2021 -0600

    winesync: Introduce WINESYNC_IOC_PUT_SEM

commit a30ebf2e6a541bb02cabf6c8c915b03958ceb831
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 11:15:39 2021 -0600

    winesync: Introduce WINESYNC_IOC_CREATE_SEM and WINESYNC_IOC_DELETE

commit 456414a38e5877cf3daa885c2336852b95d171b7
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 10:57:06 2021 -0600

    winesync: Reserve a minor device number and ioctl range

commit 2946fd9a8dd5f613c6df7bf8a5bac499c6a5c5b2
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Fri Mar 5 10:50:45 2021 -0600

    winesync: Introduce the winesync driver and character device

commit 54dcaa9d0157448dbf76900f70630ff8ba85fd00
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Mar 21 21:37:19 2022 +0000

    i2c: busses: Add SMBus capability to work with OpenRGB driver control
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit abb68aaee66d5269bf24bf2aa4f8660522313363
Author: Serge Hallyn <serge.hallyn@canonical.com>
Date:   Fri May 31 19:12:12 2013 +0100

    sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    This is a short-term patch.  Unprivileged use of CLONE_NEWUSER
    is certainly an intended feature of user namespaces.  However
    for at least saucy we want to make sure that, if any security
    issues are found, we have a fail-safe.
    
    Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
    [bwh: Remove unneeded binary sysctl bits]
    [bwh: Keep this sysctl, but change the default to enabled]

commit 28e45dbde6e825970edaafc8c9e6dc1f5f15a7fe
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Wed May 17 01:52:11 2017 +0000

    init: wait for partition and retry scan
    
    As Clear Linux boots fast the device is not ready when
    the mounting code is reached, so a retry device scan will
    be performed every 0.5 sec for at least 40 sec
    and synchronize the async task.
    
    Signed-off-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>

commit e4d2308d72fd25bc1c80696011a24612f9b21614
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Thu Jun 2 23:36:32 2016 -0500

    drivers: initialize ata before graphics
    
    ATA init is the long pole in the boot process, and its asynchronous.
    move the graphics init after it so that ata and graphics initialize
    in parallel

commit e0b33ffece7368a01dbef3f9c8f7da1c05b771eb
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Feb 18 23:35:41 2018 +0000

    locking: rwsem: spin faster
    
    tweak rwsem owner spinning a bit
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 70bb0b61b29d404bb00bf47fd399ced1804b0fc6
Author: William Douglas <william.douglas@intel.com>
Date:   Wed Jun 20 17:23:21 2018 +0000

    firmware: Enable stateless firmware loading
    
    Prefer the order of specific version before generic and /etc before
    /lib to enable the user to give specific overrides for generic
    firmware and distribution firmware.

commit 0845a6f99c481d3c2c139a26452a67c742644b20
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Sep 22 11:12:35 2019 -0300

    intel_rapl: Silence rapl trace debug

commit 10b5e6385414de002d333ce095e3588a84afd4c6
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jan 23 21:54:23 2019 +0100

    SAUCE: binder: give binder_alloc its own debug mask file
    
    Currently both binder.c and binder_alloc.c both register the
    /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts
    in sysfs. This commit gives binder_alloc.c its own
    /sys/module/binder_linux/paramters/alloc_debug_mask file.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit 79ac08834f7c91d9c362915c9bc9d8bb88ff65e4
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jan 16 23:13:25 2019 +0100

    SAUCE: binder: turn into module
    
    The Android binder driver needs to become a module for the sake of shipping
    Anbox. To do this we need to export the following functions since binder is
    currently still using them:
    
    - security_binder_set_context_mgr()
    - security_binder_transaction()
    - security_binder_transfer_binder()
    - security_binder_transfer_file()
    - can_nice()
    - __wake_up_pollfree()
    - __close_fd_get_file()
    - mmput_async()
    - task_work_add()
    - map_kernel_range_noflush()
    - get_vm_area()
    - zap_page_range()
    - put_ipc_ns()
    - get_ipc_ns_exported()
    - show_init_ipc_ns()
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ]
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 477b97190ab602855ae48170a14199058063946f
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed Dec 8 11:55:28 2021 +0000

    netfilter: Add full cone NAT support
    
    Link: https://github.com/llccd/netfilter-full-cone-nat
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 1a76a1270710108db06c6638213c750e041e6591
Author: Felix Fietkau <nbd@openwrt.org>
Date:   Sat Dec 5 15:07:03 2015 +0100

    mac80211: ignore AP power level when tx power type is "fixed"
    
    In some cases a user might want to connect to a far away access point,
    which announces a low tx power limit. Using the AP's power limit can
    make the connection significantly more unstable or even impossible, and
    mac80211 currently provides no way to disable this behavior.
    
    To fix this, use the currently unused distinction between limited and
    fixed tx power to decide whether a remote AP's power limit should be
    accepted.
    
    Signed-off-by: Felix Fietkau <nbd@openwrt.org>

commit 73c0874926b9908a88b6d20a24269f5e78571fe7
Author: Adithya Abraham Philip <abrahamphilip@google.com>
Date:   Fri Jun 11 21:56:10 2021 +0000

    net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2
    
    Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to
    indicate that retransmitted packets and pure ACKs must have the
    ECT bit set. This is a necessary fix for BBRv2, which when using
    ECN expects ECT to be set even on retransmitted packets and ACKs.
    Currently CCAs like BBRv2 which can use ECN but don't "need" it
    do not have a way to indicate that ECT should be set on
    retransmissions/ACKs.
    
    Signed-off-by: Adithya Abraham Philip <abrahamphilip@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>

commit 157d26999443785f5a33bcf7cb1e2219ceec3562
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Dec 28 19:23:09 2020 -0500

    net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss
    
    Fix WARN_ON_ONCE() warnings that were firing and pointing to a
    bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to
    CA_Open.
    
    The issue was that tcp_simple_retransmit() calls:
    
      tcp_set_ca_state(sk, TCP_CA_Loss);
    
    without first calling icsk_ca_ops->ssthresh(sk) (because
    tcp_simple_retransmit() is dealing with losses due to MTU issues and
    not congestion). The lack of this callback means that BBR did not get
    a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such
    cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd.
    
    This commit removes that warning, since a bbr->prior_cwnd of 0 is a
    valid situation in this state transition.
    
    For setting inflight_lo upon entering CA_Loss, to avoid setting an
    inflight_lo of 0 in this case, this commit switches to taking the max
    of cwnd and prior_cwnd. We plan to remove that line of code when we
    switch to cautious (PRR-style) recovery, so that awkwardness will go
    away.
    
    Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118

commit 00a66a403728e792e89b629c94825054fbe98376
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:10:21 2020 -0400

    net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2
    
    Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0

commit 3db8bedae1a7c44c70a2801d0add7f51defc9ebd
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:08:41 2020 -0400

    net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2
    
    Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b

commit 7c8c8307985427a0ee91a8514b1476f139ba9866
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Nov 21 15:28:01 2019 -0500

    net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss
    
    There is no reason to compute rs.delivered_ce upon loss.
    
    In fact, we specifically do not want to compute rs.delivered_ce upon loss.
    
    Two issues:
    
    (1) This would be the wrong thing to do, in behavior terms.  With
        RACK's dynamic reordering window, losses can be marked long after
        the sequence hole appears in the ACK/SACK stream. We want to to
        catch the ECN mark rate rising too high as quickly as possible,
        which means we want to check for high ECN mark rates at ACK time
        (as BBRv2 currently does) and not loss marking time.
    
    (2) This is dead code. The ECN mark rate cannot be detected as too
        high because the check needs rs->delivered to be > 0 as well:
    
           if (rs->delivered_ce > 0 && rs->delivered > 0 &&
    
        Since we are not setting rs->delivered upon loss, this check
        cannot succeed, so setting delivered_ce is pointless.
    
    This dead and wrong line was discovered by Randall Stewart at Netflix
    as he was reading the BBRv2 code.
    
    Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a

commit b938c1c0d8364c89706ad3381077ec913ba1aafe
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:54:22 2019 -0400

    net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP
    
    BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
    queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.
    
    BBR v2 maintains the core of BBR v1: an explicit model of the network
    path that is two-dimensional, adapting to estimate the (a) maximum
    available bandwidth and (b) maximum safe volume of data a flow can
    keep in-flight in the network. It maintains the estimated BDP as a
    core guide for estimating an appropriate level of in-flight data.
    
    BBR v2 makes several key enhancements:
    
    o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
    coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
    extended dynamically based on estimated BDP to improve coexistence with
    Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
    scalable and responsive than Reno and CUBIC.
    
    o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
    loss and (DCTCP-style) ECN signals to maintain its model.
    
    o It aims for lower losses than v1 by adjusting its model to attempt to stay
    within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
    respectively).
    
    o It adapts to loss/ECN signals even when the application is running out of
    data ("application-limited"), in case the "application-limited" flow is also
    "network-limited" (the bw and/or inflight available to this flow is lower than
    previously estimated when the flow ran out of data).
    
    o It has a three-part model: the model explicit three tracks operating points,
    where an operating point is a tuple: (bandwidth, inflight). The three operating
    points are:
    
      o latest:        the latest measurement from the current round trip
      o upper bound:   robust, optimistic, long-term upper bound
      o lower bound:   robust, conservative, short-term lower bound
    
    These are stored in the following state variables:
    
      o latest:  bw_latest, inflight_latest
      o lo:      bw_lo,     inflight_lo
      o hi:      bw_hi[2],  inflight_hi
    
    To gain intuition about the meaning of the three operating points, it
    may help to consider the analogs in CUBIC, which has a somewhat
    analogous three-part model used by its probing state machine:
    
      BBR param     CUBIC param
      -----------   -------------
      latest     ~  cwnd
      lo         ~  ssthresh
      hi         ~  last_max_cwnd
    
    The analogy is only a loose one, though, since the BBR operating
    points are calculated differently, and are 2-dimensional (bw,inflight)
    rather than CUBIC's one-dimensional notion of operating point
    (inflight).
    
    o It uses the three-part model to adapt the magnitude of its bandwidth
    to match the estimated space available in the buffer, rather than (as
    in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
    the bottleneck buffer when probing (commodity datacenter switches
    commonly do not have that much buffer for WAN flows). When BBR v2
    estimates it hit a buffer limit during probing, its bandwidth probing
    then starts gently in case little space is still available in the
    buffer, and the accelerates, slowly at first and then rapidly if it
    can grow inflight without seeing congestion signals. In such cases,
    probing is bounded by inflight_hi + inflight_probe, where
    inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
    keep losses low and bounded if a bottleneck remains congested, while
    rapidly/scalably utilizing free bandwidth when it becomes available.
    
    o It has a slightly revised state machine, to achieve the goals above.
        BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
        BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
        BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
        BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty
    
    o The estimated BDP: BBR v2 continues to maintain an estimate of the
    path's two-way propagation delay, by tracking a windowed min_rtt, and
    coordinating (on an as-ndeeded basis) to try to expose the two-way
    propagation delay by draining the bottleneck queue.
    
    BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
    estimate to estimate the current bandwidth-delay product. The estimated BDP
    still provides one important guideline for bounding inflight data. However,
    because any min-filtered RTT and max-filtered bw inherently tend to both
    overestimate, the estimated BDP is often too high; in this case loss or ECN
    marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
    adapt its sending rate and inflight down to match the available capacity of the
    path.
    
    o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
    requires more space. Note that much of the space is due to support for
    per-socket parameterization and debugging in this release for research
    and debugging. With that state removed, the full "struct bbr" is 140
    bytes, or 144 with padding. This is an increase of 40 bytes over the
    existing ca_priv space.
    
    o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
      significant pieces:
    
      o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
        bbr_can_grow_inflight())
      o long-term bandwidth estimator ("policer mode")
    
      The code layout tries to keep BBR v2 code near the bottom of the
      file, so that v1-applicable code in the top does not accidentally
      refer to v2 code.
    
    o Docs:
      See the following docs for more details and diagrams decsribing the BBR v2
      algorithm:
        https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
        https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00
    
    o Internal notes:
      For this upstream rebase, Neal started from:
        git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
      then removed dev instrumentation (dynamic get/set for parameters)
      and code that was only used by BBRv1
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
    Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05

commit d9e996927eb79ed02fc8781cc08a534783dab186
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Nov 16 13:16:25 2019 -0500

    net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check()
    
    Add logic for an experimental TCP connection behavior, enabled with
    tp->fast_ack_mode = 1, which disables checking the receive window
    before sending an ack in __tcp_ack_snd_check(). If this behavior is
    enabled, the data receiver sends an ACK if the amount of data is >
    RCV.MSS.
    
    Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4

commit 049d978823abd65235ac3357538c9077288432b6
Author: Neal Cardwell <ncardwell@google.com>
Date:   Fri Sep 27 17:10:26 2019 -0400

    net-tcp: re-generalize TSO sizing in TCP CC module API
    
    Reorganize the API for CC modules so that the CC module once again
    gets complete control of the TSO sizing decision. This is how the API
    was set up around 2016 and the initial BBRv1 upstreaming. Later Eric
    Dumazet simplified it. But with wider testing it now seems that to
    avoid CPU regressions BBR needs to have a different TSO sizing
    function.
    
    This is necessary to handle cases where there are many flows
    bottlenecked on the sender host's NIC, in which case BBR's pacing rate
    is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because
    BBR's pacing rate adapts to the low bandwidth share each flow sees. By
    contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very
    large cwnd, and thus large pacing rate and large TSO burst size.
    
    Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea

commit 100034ae4d6a3155fa464016d19517a6a15a4ca0
Author: Yousuk Seung <ysseung@google.com>
Date:   Wed May 23 17:55:54 2018 -0700

    net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS
    
    Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a
    congestion control module to receive CE events.
    
    Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN
    bit in opts flag to receive CE events but this may incur changes in ECN
    behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS
    that allows congestion control modules to receive CE events
    independently of TCP_CONG_NEEDS_ECN.
    
    Effort: net-tcp
    Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b
    Change-Id: I2255506985242f376d910c6fd37daabaf4744f24

commit f103c4029878d8a71cf7a5d4f1c0d889db45f80b
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:37:19 2019 -0400

    net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue
    
    Syzkaller was able to use TCP_REPAIR to reproduce the new warning
    added in tcp_fragment():
    
      WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487
        tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487()
      inconsistent: tx.in_flight: 0 old_factor: 53
    
    The warning happens because skbs inserted into the tcp_rtx_queue
    during the repair process go through a sort of "fake send" process,
    and that process was seting pcount but not tx.in_flight, and thus the
    warnings (where old_factor is the old pcount).
    
    The fix of setting tx.in_flight in the TCP_REPAIR code path seems
    simple enough, and indeed makes the repro code from syzkaller stop
    producing warnings. Running through kokonut tests, and will send out
    for review when all tests pass.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63
    Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>

commit ca3b65fbaad3f0b9385bc1b46fc11ac5c8ef8409
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:25 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment()
    
    When we fragment an skb that has already been sent, we need to update
    the tx.in_flight for the first skb in the resulting pair ("buff").
    
    Because we were not updating the tx.in_flight, the tx.in_flight value
    was inconsistent with the pcount of the "buff" skb (tx.in_flight would
    be too high). That meant that if the "buff" skb was lost, then
    bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value
    that is too high. This could result in longer queues and higher packet
    loss.
    
    Packetdrill testing verified that without this commit, when the second
    half of an skb is SACKed and then later the first half of that skb is
    marked lost, the calculated inflight_hi was incorrect.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874
    Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d

commit 69e37035c32cff51a53c8fb6f24332d33d694388
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:33 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb()
    
    When tcp_shifted_skb() updates state as adjacent SACKed skbs are
    coalesced, previously the tx.in_flight was not adjusted, so we could
    get contradictory state where the skb's recorded pcount was bigger
    than the tx.in_flight (the number of segments that were in_flight
    after sending the skb).
    
    Normally have a SACKed skb with contradictory pcount/tx.in_flight
    would not matter. However, with SACK reneging, the SACKed bit is
    removed, and an skb once again becomes eligible for retransmitting,
    fragmenting, SACKing, etc. Packetdrill testing verified the following
    sequence is possible in a kernel that does not have this commit:
    
     - skb N is SACKed
     - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb()
       - tcp_shifted_skb() will increase the pcount of prev,
         but leave tx.in_flight as-is
       - so prev skb can have pcount > tx.in_flight
     - RTO, tcp_timeout_mark_lost(), detect reneg,
       remove "SACKed" bit, mark skb N as lost
       - find pcount of skb N is greater than its tx.in_flight
    
    I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb():
      WARN_ON_ONCE(inflight_prev < 0)
    to fire in production machines using bbr2.
    
    Tested: See last commit in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715
    Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55

commit 1668db7ceb3324ec4928b2b3209abe08d4de635c
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:36:36 2019 -0400

    net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight()
    
    Factor out the code to set an skb's tx.in_flight field into its own
    function, so that this code can be used for the TCP_REPAIR "fake send"
    code path that inserts skbs into the rtx queue without sending
    them. This is in preparation for the following patch, which fixes an
    issue with TCP_REPAIR and tx.in_flight.
    
    Tested: See last patch in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b
    Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af

commit 27430b0e5345afa82c5e0c170af4af896b61c11d
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Aug 7 21:52:06 2018 -0400

    net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API
    
    For connections experiencing reordering, RACK can mark packets lost
    long after we receive the SACKs/ACKs hinting that the packets were
    actually lost.
    
    This means that CC modules cannot easily learn the volume of inflight
    data at which packet loss happens by looking at the current inflight
    or even the packets in flight when the most recently SACKed packet was
    sent. To learn this, CC modules need to know how many packets were in
    flight at the time lost packets were sent. This new callback, combined
    with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this.
    
    This also provides a consistent callback that is invoked whether
    packets are marked lost upon ACK processing, using the RACK reordering
    timer, or at RTO time.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4
    Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a

commit bf24d725ba91dc2659948732130dfcd5c44ad5eb
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Nov 19 13:48:36 2018 -0500

    net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece
    
    For understanding the relationship between inflight and ECN signals,
    to try to find the highest inflight value that has acceptable levels
    ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8
    Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3

commit c69f3943e026768cf9cca3a8b4a3978980fa4b85
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Oct 12 23:44:27 2017 -0400

    net-tcp_bbr: v2: count packets lost over TCP rate sampling interval
    
    For understanding the relationship between inflight and packet loss
    signals, to try to find the highest inflight value that has acceptable
    levels of packet losses.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab
    Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>

commit e282d98bc64872020f9509f90b56e3c7a80c3c64
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Aug 5 11:49:50 2017 -0400

    net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample
    
    For understanding the relationship between inflight and losses or ECN
    signals, to try to find the highest inflight value that has acceptable
    levels of loss/ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea
    Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a

commit a29bc9280a3cb67d61dc1ebe9b3708c1d37c8678
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Jun 24 21:55:59 2018 -0400

    net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes
    
    Free up some space for tracking inflight and losses for each
    bw sample, in upcoming commits.
    
    These timestamps are in microseconds, and are now stored in 32
    bits. So they can only hold time intervals up to roughly 2^12 = 4096
    seconds.  But Linux TCP RTT and RTO tracking has the same 32-bit
    microsecond implementation approach and resulting deployment
    limitations. So this is not introducing a new limit. And these should
    not be a limitation for the foreseeable future.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55
    Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>

commit bbac81e25fe6374b5599913339a59627d41e1071
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:26:55 2019 -0400

    net-tcp_bbr: broaden app-limited rate sample detection
    
    This commit is a bug fix for the Linux TCP app-limited
    (application-limited) logic that is used for collecting rate
    (bandwidth) samples.
    
    Previously the app-limited logic only looked for "bubbles" of
    silence in between application writes, by checking at the start
    of each sendmsg. But "bubbles" of silence can also happen before
    retransmits: e.g. bubbles can happen between an application write
    and a retransmit, or between two retransmits.
    
    Retransmits are triggered by ACKs or timers. So this commit checks
    for bubbles of app-limited silence upon ACKs or timers.
    
    Why does this commit check for app-limited state at the start of
    ACKs and timer handling? Because at that point we know whether
    inflight was fully using the cwnd.  During processing the ACK or
    timer event we often change the cwnd; after changing the cwnd we
    can't know whether inflight was fully using the old cwnd.
    
    Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
    Change-Id: I37221506f5166877c2b110753d39bb0757985e68

commit 0336b1cfa0f9d3f68da990464b080a3f2367d996
Author: Alexey Avramov <hakavlad@inbox.lv>
Date:   Sat Nov 13 10:42:27 2021 +0900

    mm/vmscan: add sysctl knobs for protecting the working set
    
    The kernel does not provide a way to protect the working set under memory
    pressure. A certain amount of anonymous and clean file pages is required by
    the userspace for normal operation. First of all, the userspace needs a
    cache of shared libraries and executable binaries. If the amount of the
    clean file pages falls below a certain level, then thrashing and even
    livelock can take place.
    
    The patch provides sysctl knobs for protecting the working set (anonymous
    and clean file pages) under memory pressure.
    
    The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous
    pages. The anonymous pages on the current node won't be reclaimed under any
    conditions when their amount is below vm.anon_min_kbytes. This knob may be
    used to prevent excessive swap thrashing when anonymous memory is low (for
    example, when memory is going to be overfilled by compressed data of zram
    module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested
    0 in Kconfig).
    
    The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
    clean file pages. The file pages on the current node won't be reclaimed
    under memory pressure when the amount of clean file pages is below
    vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file
    pages using this knob may be used when swapping is still possible to
      - prevent disk I/O thrashing under memory pressure;
      - improve performance in disk cache-bound tasks under memory pressure.
    The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
    Kconfig).
    
    The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean
    file pages. The file pages on the current node won't be reclaimed under
    memory pressure when the amount of clean file pages is below
    vm.clean_min_kbytes. Hard protection of clean file pages using this knob
    may be used to
      - prevent disk I/O thrashing under memory pressure even with no free swap
        space;
      - improve performance in disk cache-bound tasks under memory pressure;
      - avoid high latency and prevent livelock in near-OOM conditions.
    The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
    Kconfig).
    
    Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>

commit b067ead25ed9dd602cc0ea8eded105c1931179f1
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:33 2022 -0600

    mm: multi-gen LRU: design doc
    
    Add a design doc.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 9cd37351057623b89584e5430774321019287667
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:32 2022 -0600

    mm: multi-gen LRU: admin guide
    
    Add an admin guide.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 221f7d8660c7b89223803cf53f734b711f50b07c
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:31 2022 -0600

    mm: multi-gen LRU: debugfs interface
    
    Add /sys/kernel/debug/lru_gen for working set estimation and proactive
    reclaim. These techniques are commonly used to optimize job scheduling
    (bin packing) in data centers [1][2].
    
    Compared with the page table-based approach and the PFN-based
    approach, this lruvec-based approach has the following advantages:
    1. It offers better choices because it is aware of memcgs, NUMA nodes,
       shared mappings and unmapped page cache.
    2. It is more scalable because it is O(nr_hot_pages), whereas the
       PFN-based approach is O(nr_total_pages).
    
    Add /sys/kernel/debug/lru_gen_full for debugging.
    
    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 6f9b54dff962646004c979ad0dd7ed155da3748c
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:30 2022 -0600

    mm: multi-gen LRU: thrashing prevention
    
    Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
    requested by many desktop users [1].
    
    When set to value N, it prevents the working set of N milliseconds
    from getting evicted. The OOM killer is triggered if this working set
    cannot be kept in memory. Based on the average human detectable lag
    (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
    Larger values like N=3000 make lags less noticeable at the risk of
    premature OOM kills.
    
    Compared with the size-based approach, e.g., [2], this time-based
    approach has the following advantages:
    1. It is easier to configure because it is agnostic to applications
       and memory sizes.
    2. It is more reliable because it is directly wired to the OOM killer.
    
    [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
    [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 93f31833fac44e79afb61523cbac4ed56a30d352
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:29 2022 -0600

    mm: multi-gen LRU: kill switch
    
    Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
    can be disabled include:
      0x0001: the multi-gen LRU core
      0x0002: walking page table, when arch_has_hw_pte_young() returns
              true
      0x0004: clearing the accessed bit in non-leaf PMD entries, when
              CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
      [yYnN]: apply to all the components above
    E.g.,
      echo y >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0007
      echo 5 >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0005
    
    NB: the page table walks happen on the scale of seconds under heavy
    memory pressure, in which case the mmap_lock contention is a lesser
    concern, compared with the LRU lock contention and the I/O congestion.
    So far the only well-known case of the mmap_lock contention happens on
    Android, due to Scudo [1] which allocates several thousand VMAs for
    merely a few hundred MBs. The SPF and the Maple Tree also have
    provided their own assessments [2][3]. However, if walking page tables
    does worsen the mmap_lock contention, the kill switch can be used to
    disable it. In this case the multi-gen LRU will suffer a minor
    performance degradation, as shown previously.
    
    Clearing the accessed bit in non-leaf PMD entries can also be
    disabled, since this behavior was not tested on x86 varieties other
    than Intel and AMD.
    
    [1] https://source.android.com/devices/tech/debug/scudo
    [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
    [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 5cc012677726ada4ce540791f8a7ec6f02292377
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:28 2022 -0600

    mm: multi-gen LRU: optimize multiple memcgs
    
    When multiple memcgs are available, it is possible to make better
    choices based on generations and tiers and therefore improve the
    overall performance under global memory pressure. This patch adds a
    rudimentary optimization to select memcgs that can drop single-use
    unmapped clean pages first. Doing so reduces the chance of going into
    the aging path or swapping. These two operations can be costly.
    
    A typical example that benefits from this optimization is a server
    running mixed types of workloads, e.g., heavy anon workload in one
    memcg and heavy buffered I/O workload in the other.
    
    Though this optimization can be applied to both kswapd and direct
    reclaim, it is only added to kswapd to keep the patchset manageable.
    Later improvements will cover the direct reclaim path.
    
    Server benchmark results:
      Mixed workloads:
        fio (buffered I/O): +[1, 3]%
                    IOPS         BW
          patch1-8: 2154k        8415MiB/s
          patch1-9: 2205k        8613MiB/s
    
        memcached (anon): +[132, 136]%
                    Ops/sec      KB/sec
          patch1-8: 819618.49    31838.48
          patch1-9: 1916516.06   74447.92
    
      Mixed workloads:
        fio (buffered I/O): +[59, 61]%
                    IOPS         BW
          5.18-rc1: 1378k        5385MiB/s
          patch1-9: 2205k        8613MiB/s
    
        memcached (anon): +[229, 233]%
                    Ops/sec      KB/sec
          5.18-rc1: 578946.00    22489.44
          patch1-9: 1916516.06   74447.92
    
      Configurations:
        (changes since patch 6)
    
        cat mixed.sh
        modprobe brd rd_nr=2 rd_size=56623104
    
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0
    
        mkfs.ext4 /dev/ram1
        mount -t ext4 /dev/ram1 /mnt
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000
    
        fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=90m --group_reporting &
        pid=$!
    
        sleep 200
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
    
        kill -INT $pid
        wait
    
    Client benchmark results:
      no change (CONFIG_MEMCG=n)
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit c0c8228236fa3e1fed38d362fc5d37d0890a8987
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:27 2022 -0600

    mm: multi-gen LRU: support page table walks
    
    To further exploit spatial locality, the aging prefers to walk page
    tables to search for young PTEs and promote hot pages. A kill switch
    will be added in the next patch to disable this behavior. When
    disabled, the aging relies on the rmap only.
    
    NB: this behavior has nothing similar with the page table scanning in
    the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
    pages to swapcache and unmaps them.
    
    To avoid confusion, the term "iteration" specifically means the
    traversal of an entire mm_struct list; the term "walk" will be applied
    to page tables and the rmap, as usual.
    
    An mm_struct list is maintained for each memcg, and an mm_struct
    follows its owner task to the new memcg when this task is migrated.
    Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot
    pages before it increments max_seq.
    
    When multiple page table walkers iterate the same list, each of them
    gets a unique mm_struct; therefore they can run concurrently. Page
    table walkers ignore any misplaced pages, e.g., if an mm_struct was
    migrated, pages it left in the previous memcg will not be promoted
    when its current memcg is under reclaim. Similarly, page table walkers
    will not promote pages from nodes other than the one under reclaim.
    
    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change
    
      Single workload:
        memcached (anon): +[8, 10]%
                    Ops/sec      KB/sec
          patch1-7: 1193918.93   46438.15
          patch1-8: 1301954.44   50640.27
    
      Configurations:
        no change
    
    Client benchmark results:
      kswapd profiles:
        patch1-7
          45.90%  lzo1x_1_do_compress (real work)
           9.14%  page_vma_mapped_walk
           6.81%  _raw_spin_unlock_irq
           2.80%  ptep_clear_flush
           2.34%  __zram_bvec_write
           2.29%  do_raw_spin_lock
           1.84%  lru_gen_look_around
           1.78%  memmove
           1.74%  obj_malloc
           1.50%  free_unref_page_list
    
        patch1-8
          46.96%  lzo1x_1_do_compress (real work)
           7.55%  page_vma_mapped_walk
           5.89%  _raw_spin_unlock_irq
           3.33%  walk_pte_range
           2.65%  ptep_clear_flush
           2.23%  __zram_bvec_write
           2.08%  do_raw_spin_lock
           1.83%  memmove
           1.65%  obj_malloc
           1.47%  free_unref_page_list
    
      Configurations:
        no change
    
    Thanks to the following developers for their efforts [3].
      kernel test robot <lkp@intel.com>
    
    [1] https://lwn.net/Articles/23732/
    [2] https://source.android.com/devices/tech/debug/scudo
    [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit a961702eef7c9d09c979a9778eba4a0d1e5f90f3
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:26 2022 -0600

    mm: multi-gen LRU: exploit locality in rmap
    
    Searching the rmap for PTEs mapping each page on an LRU list (to test
    and clear the accessed bit) can be expensive because pages from
    different VMAs (PA space) are not cache friendly to the rmap (VA
    space). For workloads mostly using mapped pages, the rmap has a high
    CPU cost in the reclaim path.
    
    This patch exploits spatial locality to reduce the trips into the
    rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
    new function lru_gen_look_around() scans at most BITS_PER_LONG-1
    adjacent PTEs. On finding another young PTE, it clears the accessed
    bit and updates the gen counter of the page mapped by this PTE to
    (max_seq%MAX_NR_GENS)+1.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change
    
      Single workload:
        memcached (anon): +[5.5, 7.5]%
                    Ops/sec      KB/sec
          patch1-6: 1120643.70   43588.06
          patch1-7: 1193918.93   46438.15
    
      Configurations:
        no change
    
    Client benchmark results:
      kswapd profiles:
        patch1-6
          35.99%  lzo1x_1_do_compress (real work)
          19.40%  page_vma_mapped_walk
           6.31%  _raw_spin_unlock_irq
           3.95%  do_raw_spin_lock
           2.39%  anon_vma_interval_tree_iter_first
           2.25%  ptep_clear_flush
           1.92%  __anon_vma_interval_tree_subtree_search
           1.70%  folio_referenced_one
           1.68%  __zram_bvec_write
           1.43%  anon_vma_interval_tree_iter_next
    
        patch1-7
          45.90%  lzo1x_1_do_compress (real work)
           9.14%  page_vma_mapped_walk
           6.81%  _raw_spin_unlock_irq
           2.80%  ptep_clear_flush
           2.34%  __zram_bvec_write
           2.29%  do_raw_spin_lock
           1.84%  lru_gen_look_around
           1.78%  memmove
           1.74%  obj_malloc
           1.50%  free_unref_page_list
    
      Configurations:
        no change
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 7169860055b65fa4812e4fa6b4ecc774a4d67f41
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:25 2022 -0600

    mm: multi-gen LRU: minimal implementation
    
    To avoid confusion, the terms "promotion" and "demotion" will be
    applied to the multi-gen LRU, as a new convention; the terms
    "activation" and "deactivation" will be applied to the active/inactive
    LRU, as usual.
    
    The aging produces young generations. Given an lruvec, it increments
    max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
    promotes hot pages to the youngest generation when it finds them
    accessed through page tables; the demotion of cold pages happens
    consequently when it increments max_seq. The aging has the complexity
    O(nr_hot_pages), since it is only interested in hot pages. Promotion
    in the aging path does not involve any LRU list operations, only the
    updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
    the result of the increment of max_seq, requires LRU list operations,
    e.g., lru_deactivate_fn().
    
    The eviction consumes old generations. Given an lruvec, it increments
    min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
    feedback loop modeled after the PID controller monitors refaults over
    anon and file types and decides which type to evict when both types
    are available from the same generation.
    
    Each generation is divided into multiple tiers. Tiers represent
    different ranges of numbers of accesses through file descriptors. A
    page accessed N times through file descriptors is in tier
    order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
    bits in folio->flags. In contrast to moving across generations, which
    requires the LRU lock, moving across tiers only involves operations on
    folio->flags. The feedback loop also monitors refaults over all tiers
    and decides when to protect pages in which tiers (N>1), using the
    first tier (N=0,1) as a baseline. The first tier contains single-use
    unmapped clean pages, which are most likely the best choices. The
    eviction moves a page to the next generation, i.e., min_seq+1, if the
    feedback loop decides so. This approach has the following advantages:
    1. It removes the cost of activation in the buffered access path by
       inferring whether pages accessed multiple times through file
       descriptors are statistically hot and thus worth protecting in the
       eviction path.
    2. It takes pages accessed through page tables into account and avoids
       overprotecting pages accessed multiple times through file
       descriptors. (Pages accessed through page tables are in the first
       tier, since N=0.)
    3. More tiers provide better protection for pages accessed more than
       twice through file descriptors, when under heavy buffered I/O
       workloads.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): +[40, 42]%
                    IOPS         BW
          5.18-rc1: 2463k        9621MiB/s
          patch1-6: 3484k        13.3GiB/s
    
      Single workload:
        memcached (anon): +[44, 46]%
                    Ops/sec      KB/sec
          5.18-rc1: 771403.27    30004.17
          patch1-6: 1120643.70   43588.06
    
      Configurations:
        CPU: two Xeon 6154
        Mem: total 256G
    
        Node 1 was only used as a ram disk to reduce the variance in the
        results.
    
        patch drivers/block/brd.c <<EOF
        99,100c99,100
        <   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
        <   page = alloc_page(gfp_flags);
        ---
        >   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
        >   page = alloc_pages_node(1, gfp_flags, 0);
        EOF
    
        cat >>/etc/systemd/system.conf <<EOF
        CPUAffinity=numa
        NUMAPolicy=bind
        NUMAMask=0
        EOF
    
        cat >>/etc/memcached.conf <<EOF
        -m 184320
        -s /var/run/memcached/memcached.sock
        -a 0766
        -t 36
        -B binary
        EOF
    
        cat fio.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkfs.ext4 /dev/ram0
        mount -t ext4 /dev/ram0 /mnt
    
        mkdir /sys/fs/cgroup/user.slice/test
        echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
        echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
        fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=5m --group_reporting
    
        cat memcached.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
    
    Client benchmark results:
      kswapd profiles:
        5.18-rc1
          40.53%  page_vma_mapped_walk
          20.37%  lzo1x_1_do_compress (real work)
           6.99%  do_raw_spin_lock
           3.93%  _raw_spin_unlock_irq
           2.08%  vma_interval_tree_subtree_search
           2.06%  vma_interval_tree_iter_next
           1.95%  folio_referenced_one
           1.93%  anon_vma_interval_tree_iter_first
           1.51%  ptep_clear_flush
           1.35%  __anon_vma_interval_tree_subtree_search
    
        patch1-6
          35.99%  lzo1x_1_do_compress (real work)
          19.40%  page_vma_mapped_walk
           6.31%  _raw_spin_unlock_irq
           3.95%  do_raw_spin_lock
           2.39%  anon_vma_interval_tree_iter_first
           2.25%  ptep_clear_flush
           1.92%  __anon_vma_interval_tree_subtree_search
           1.70%  folio_referenced_one
           1.68%  __zram_bvec_write
           1.43%  anon_vma_interval_tree_iter_next
    
      Configurations:
        CPU: single Snapdragon 7c
        Mem: total 4G
    
        Chrome OS MemoryPressure [1]
    
    [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 82443b08a4037dde6a945935e578944bd4e8d509
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:24 2022 -0600

    mm: multi-gen LRU: groundwork
    
    Evictable pages are divided into multiple generations for each lruvec.
    The youngest generation number is stored in lrugen->max_seq for both
    anon and file types as they are aged on an equal footing. The oldest
    generation numbers are stored in lrugen->min_seq[] separately for anon
    and file types as clean file pages can be evicted regardless of swap
    constraints. These three variables are monotonically increasing.
    
    Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
    in order to fit into the gen counter in folio->flags. Each truncated
    generation number is an index to lrugen->lists[]. The sliding window
    technique is used to track at least MIN_NR_GENS and at most
    MAX_NR_GENS generations. The gen counter stores a value within [1,
    MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
    stores 0.
    
    There are two conceptually independent procedures: "the aging", which
    produces young generations, and "the eviction", which consumes old
    generations. They form a closed-loop system, i.e., "the page reclaim".
    Both procedures can be invoked from userspace for the purposes of
    working set estimation and proactive reclaim. These techniques are
    commonly used to optimize job scheduling (bin packing) in data
    centers [1][2].
    
    To avoid confusion, the terms "hot" and "cold" will be applied to the
    multi-gen LRU, as a new convention; the terms "active" and "inactive"
    will be applied to the active/inactive LRU, as usual.
    
    The protection of hot pages and the selection of cold pages are based
    on page access channels and patterns. There are two access channels:
    one through page tables and the other through file descriptors. The
    protection of the former channel is by design stronger because:
    1. The uncertainty in determining the access patterns of the former
       channel is higher due to the approximation of the accessed bit.
    2. The cost of evicting the former channel is higher due to the TLB
       flushes required and the likelihood of encountering the dirty bit.
    3. The penalty of underprotecting the former channel is higher because
       applications usually do not prepare themselves for major page
       faults like they do for blocked I/O. E.g., GUI applications
       commonly use dedicated I/O threads to avoid blocking the rendering
       threads.
    There are also two access patterns: one with temporal locality and the
    other without. For the reasons listed above, the former channel is
    assumed to follow the former pattern unless VM_SEQ_READ or
    VM_RAND_READ is present; the latter channel is assumed to follow the
    latter pattern unless outlying refaults have been observed [3][4].
    
    The next patch will address the "outlying refaults". Three macros,
    i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
    added in this patch to make the entire patchset less diffy.
    
    A page is added to the youngest generation on faulting. The aging
    needs to check the accessed bit at least twice before handing this
    page over to the eviction. The first check takes care of the accessed
    bit set on the initial fault; the second check makes sure this page
    has not been used since then. This protocol, AKA second chance,
    requires a minimum of two generations, hence MIN_NR_GENS.
    
    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    [3] https://lwn.net/Articles/495543/
    [4] https://lwn.net/Articles/815342/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 2af24d41b32e19e26a8f3ba45d3e336e7baa18cb
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:23 2022 -0600

    Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller"
    
    This patch undoes the following refactor:
    commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")
    
    The upcoming changes to include/linux/mm_inline.h will reuse
    __update_lru_size().
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit b73df9f1ba9b06f8b919bcc28e435db18003efb9
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:22 2022 -0600

    mm/vmscan.c: refactor shrink_node()
    
    This patch refactors shrink_node() to improve readability for the
    upcoming changes to mm/vmscan.c.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit e49f694ec47d06a32d7c15f0671814fce289987e
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:21 2022 -0600

    mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
    
    Some architectures support the accessed bit in non-leaf PMD entries,
    e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
    as part of linear address translation [1]. Page table walkers that
    clear the accessed bit may use this capability to reduce their search
    space.
    
    Note that:
    1. Although an inline function is preferable, this capability is added
       as a configuration option for consistency with the existing macros.
    2. Due to the little interest in other varieties, this capability was
       only tested on Intel and AMD CPUs.
    
    Thanks to the following developers for their efforts [2][3].
      Randy Dunlap <rdunlap@infradead.org>
      Stephen Rothwell <sfr@canb.auug.org.au>
    
    [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
         Volume 3 (June 2021), section 4.8
    [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
    [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit a378cea5999d18745127fd655feb80ef0111d1e3
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue May 17 19:46:20 2022 -0600

    mm: x86, arm64: add arch_has_hw_pte_young()
    
    Some architectures automatically set the accessed bit in PTEs, e.g.,
    x86 and arm64 v8.2. On architectures that do not have this capability,
    clearing the accessed bit in a PTE usually triggers a page fault
    following the TLB miss of this PTE (to emulate the accessed bit).
    
    Being aware of this capability can help make better decisions, e.g.,
    whether to spread the work out over a period of time to reduce bursty
    page faults when trying to clear the accessed bit in many PTEs.
    
    Note that theoretically this capability can be unreliable, e.g.,
    hotplugged CPUs might be different from builtin ones. Therefore it
    should not be used in architecture-independent code that involves
    correctness, e.g., to determine whether TLB flushes are required (in
    combination with the accessed bit).
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Acked-by: Will Deacon <will@kernel.org>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 9fcafe62e0472e46d2636cf24949dba3f6f5a2d5
Author: André Almeida <andrealmeid@igalia.com>
Date:   Mon Oct 25 09:49:42 2021 -0300

    futex: Add entry point for FUTEX_WAIT_MULTIPLE (opcode 31)
    
    Add an option to wait on multiple futexes using the old interface, that
    uses opcode 31 through futex() syscall. Do that by just translation the
    old interface to use the new code. This allows old and stable versions
    of Proton to still use fsync in new kernel releases.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit 261b8a344d3f9e188e7da284c435c138eecd6652
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Fri Jun 18 19:10:55 2021 +0000

    XANMOD: Makefile: Turn off loop vectorization for GCC -O3 optimization level
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 81f8dfad768b108b1cbc9b0d806ea0888299fc52
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Sep 3 20:36:13 2020 +0000

    XANMOD: init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 599ee3f583b6ce8b1d8b3d8ade10e97a7153207d
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Thu Jun 25 16:40:43 2020 -0300

    XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit a155b102db89c3735b42a9719610dcce313fda55
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Sun May 29 00:57:40 2022 +0000

    XANMOD: scripts/setlocalversion: remove "+" tag for git repo short version
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b66adf79cae4a7a640013d9ee1357780aca9418a
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Tue Mar 31 13:32:08 2020 -0300

    XANMOD: cpufreq: tunes ondemand and conservative governor for performance
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 0a58887cae8bd610cfe20f8e83244c24a4baeadd
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:31:25 2018 +0000

    XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit e3244bec01cece786b4ce54e2b58a06569bbd4af
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Fri May 27 19:39:37 2022 +0000

    XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4e329b61446ebf76416f0d0c75ad2cd6586b3907
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 16:59:22 2018 +0000

    XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 26c74dc45cacfccbcd0caf7d41e9654347f677b4
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:26:15 2018 +0000

    XANMOD: kconfig: add 500Hz timer interrupt kernel config option
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit b192d1955bf3d52866c41abea1bfed7d45b8a4d7
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Dec 14 16:24:26 2020 +0000

    XANMOD: block: set rq_affinity to force full multithreading I/O requests
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit e7949eb9a8f5f7a9a21dba48edee67e83fe19ac7
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed May 11 18:56:51 2022 +0000

    XANMOD: block/mq-deadline: Increase write priority to improve responsiveness
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ac2a7167ce708fd850e479fbe66cb08132050af0
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Jan 6 16:59:01 2022 +0000

    XANMOD: block/mq-deadline: Disable front_merges by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit cd6156b3062d2d5b8c1dcc2e7122e31d14acf809
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Fri Mar 25 22:36:34 2022 +0000

    XANMOD: Change rcutree.kthread_prio to SCHED_RR policy
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ca63da750b767c0b0b3d76b7071a0aae6800dfec
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Mar 21 18:20:24 2022 +0000

    XANMOD: fair: Remove all energy efficiency functions
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 0047d57e6c91177bb731bed5ada6c211868bc27c
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Mon May 30 09:24:09 2022 +0200

    Linux 5.18.1
    
    Link: https://lore.kernel.org/r/20220527084801.223648383@linuxfoundation.org
    Tested-by: Ronald Warsow <rwarsow@gmx.de
    Tested-by: Guenter Roeck <linux@roeck-us.net>
    Tested-by: Justin M. Forbes <jforbes@fedoraproject.org>
    Tested-by: Ron Economos <re@w6rz.net>
    Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
    Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Tested-by: Rudi Heitbaum <rudi@heitbaum.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 80213405895162c448b30d28623f1f95337f9b3e
Author: Edward Matijevic <motolav@gmail.com>
Date:   Fri May 20 23:45:15 2022 -0500

    ALSA: ctxfi: Add SB046x PCI ID
    
    commit 1b073ebb174d0c7109b438e0a5eb4495137803ec upstream.
    
    Adds the PCI ID for X-Fi cards sold under the Platnum and XtremeMusic names
    
    Before: snd_ctxfi 0000:05:05.0: chip 20K1 model Unknown (1102:0021) is found
    After: snd_ctxfi 0000:05:05.0: chip 20K1 model SB046x (1102:0021) is found
    
    [ This is only about defining the model name string, and the rest is
      handled just like before, as a default unknown device.
      Edward confirmed that the stuff has been working fine -- tiwai ]
    
    Signed-off-by: Edward Matijevic <motolav@gmail.com>
    Cc: <stable@vger.kernel.org>
    Link: https://lore.kernel.org/r/cae7d1a4-8bd9-7dfe-7427-db7e766f7272@gmail.com
    Signed-off-by: Takashi Iwai <tiwai@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 453f8156652106e335176db8bfff05333e248498
Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Date:   Thu Apr 7 11:51:20 2022 +0100

    ACPI: sysfs: Fix BERT error region memory mapping
    
    commit 1bbc21785b7336619fb6a67f1fff5afdaf229acc upstream.
    
    Currently the sysfs interface maps the BERT error region as "memory"
    (through acpi_os_map_memory()) in order to copy the error records into
    memory buffers through memory operations (eg memory_read_from_buffer()).
    
    The OS system cannot detect whether the BERT error region is part of
    system RAM or it is "device memory" (eg BMC memory) and therefore it
    cannot detect which memory attributes the bus to memory support (and
    corresponding kernel mapping, unless firmware provides the required
    information).
    
    The acpi_os_map_memory() arch backend implementation determines the
    mapping attributes. On arm64, if the BERT error region is not present in
    the EFI memory map, the error region is mapped as device-nGnRnE; this
    triggers alignment faults since memcpy unaligned accesses are not
    allowed in device-nGnRnE regions.
    
    The ACPI sysfs code cannot therefore map by default the BERT error
    region with memory semantics but should use a safer default.
    
    Change the sysfs code to map the BERT error region as MMIO (through
    acpi_os_map_iomem()) and use the memcpy_fromio() interface to read the
    error region into the kernel buffer.
    
    Link: https://lore.kernel.org/linux-arm-kernel/31ffe8fc-f5ee-2858-26c5-0fd8bdd68702@arm.com
    Link: https://lore.kernel.org/linux-acpi/CAJZ5v0g+OVbhuUUDrLUCfX_mVqY_e8ubgLTU98=jfjTeb4t+Pw@mail.gmail.com
    Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
    Tested-by: Veronika Kabatova <vkabatov@redhat.com>
    Tested-by: Aristeu Rozanski <aris@redhat.com>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Cc: dann frazier <dann.frazier@canonical.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 068108f53811d99b5a470eafe5b0ddd523ed2c79
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sun May 22 22:25:41 2022 +0200

    random: check for signals after page of pool writes
    
    commit 1ce6c8d68f8ac587f54d0a271ac594d3d51f3efb upstream.
    
    get_random_bytes_user() checks for signals after producing a PAGE_SIZE
    worth of output, just like /dev/zero does. write_pool() is doing
    basically the same work (actually, slightly more expensive), and so
    should stop to check for signals in the same way. Let's also name it
    write_pool_user() to match get_random_bytes_user(), so this won't be
    misused in the future.
    
    Before this patch, massive writes to /dev/urandom would tie up the
    process for an extremely long time and make it unterminatable. After, it
    can be successfully interrupted. The following test program can be used
    to see this works as intended:
    
      #include <unistd.h>
      #include <fcntl.h>
      #include <signal.h>
      #include <stdio.h>
    
      static unsigned char x[~0U];
    
      static void handle(int) { }
    
      int main(int argc, char *argv[])
      {
        pid_t pid = getpid(), child;
        int fd;
        signal(SIGUSR1, handle);
        if (!(child = fork())) {
          for (;;)
            kill(pid, SIGUSR1);
        }
        fd = open("/dev/urandom", O_WRONLY);
        pause();
        printf("interrupted after writing %zd bytes\n", write(fd, x, sizeof(x)));
        close(fd);
        kill(child, SIGTERM);
        return 0;
      }
    
    Result before: "interrupted after writing 2147479552 bytes"
    Result after: "interrupted after writing 4096 bytes"
    
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 08467db994c0cf6c16e347a99e374a432e6c3509
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu May 19 17:31:37 2022 -0600

    random: wire up fops->splice_{read,write}_iter()
    
    commit 79025e727a846be6fd215ae9cdb654368ac3f9a6 upstream.
    
    Now that random/urandom is using {read,write}_iter, we can wire it up to
    using the generic splice handlers.
    
    Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    [Jason: added the splice_write path. Note that sendfile() and such still
     does not work for read, though it does for write, because of a file
     type restriction in splice_direct_to_actor(), which I'll address
     separately.]
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit fdb1b354e301dd2aa56df862b9d4894b79039c40
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu May 19 17:43:15 2022 -0600

    random: convert to using fops->write_iter()
    
    commit 22b0a222af4df8ee9bb8e07013ab44da9511b047 upstream.
    
    Now that the read side has been converted to fix a regression with
    splice, convert the write side as well to have some symmetry in the
    interface used (and help deprecate ->write()).
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    [Jason: cleaned up random_ioctl a bit, require full writes in
     RNDADDENTROPY since it's crediting entropy, simplify control flow of
     write_pool(), and incorporate suggestions from Al.]
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 09ea8bc7276aead310e5faa21ae29ceeecea87c9
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu May 19 17:31:36 2022 -0600

    random: convert to using fops->read_iter()
    
    commit 1b388e7765f2eaa137cf5d92b47ef5925ad83ced upstream.
    
    This is a pre-requisite to wiring up splice() again for the random
    and urandom drivers. It also allows us to remove the INT_MAX check in
    getrandom(), because import_single_range() applies capping internally.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    [Jason: rewrote get_random_bytes_user() to simplify and also incorporate
     additional suggestions from Al.]
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 806afdc6d97fe17a9b96a8d6964f7d2da610ee58
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sun May 15 00:22:05 2022 +0200

    random: unify batched entropy implementations
    
    commit 3092adcef3ffd2ef59634998297ca8358461ebce upstream.
    
    There are currently two separate batched entropy implementations, for
    u32 and u64, with nearly identical code, with the goal of avoiding
    unaligned memory accesses and letting the buffers be used more
    efficiently. Having to maintain these two functions independently is a
    bit of a hassle though, considering that they always need to be kept in
    sync.
    
    This commit factors them out into a type-generic macro, so that the
    expansion produces the same code as before, such that diffing the
    assembly shows no differences. This will also make it easier in the
    future to add u16 and u8 batches.
    
    This was initially tested using an always_inline function and letting
    gcc constant fold the type size in, but the code gen was less efficient,
    and in general it was more verbose and harder to follow. So this patch
    goes with the boring macro solution, similar to what's already done for
    the _wait functions in random.h.
    
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 463ebd6f8707e57847680541a05c910947a2518a
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat May 14 13:59:30 2022 +0200

    random: move randomize_page() into mm where it belongs
    
    commit 5ad7dd882e45d7fe432c32e896e2aaa0b21746ea upstream.
    
    randomize_page is an mm function. It is documented like one. It contains
    the history of one. It has the naming convention of one. It looks
    just like another very similar function in mm, randomize_stack_top().
    And it has always been maintained and updated by mm people. There is no
    need for it to be in random.c. In the "which shape does not look like
    the other ones" test, pointing to randomize_page() is correct.
    
    So move randomize_page() into mm/util.c, right next to the similar
    randomize_stack_top() function.
    
    This commit contains no actual code changes.
    
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 4f276d40e6032bcc19b341185e335dee65a431aa
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri May 13 16:17:12 2022 +0200

    random: move initialization functions out of hot pages
    
    commit 560181c27b582557d633ecb608110075433383af upstream.
    
    Much of random.c is devoted to initializing the rng and accounting for
    when a sufficient amount of entropy has been added. In a perfect world,
    this would all happen during init, and so we could mark these functions
    as __init. But in reality, this isn't the case: sometimes the rng only
    finishes initializing some seconds after system init is finished.
    
    For this reason, at the moment, a whole host of functions that are only
    used relatively close to system init and then never again are intermixed
    with functions that are used in hot code all the time. This creates more
    cache misses than necessary.
    
    In order to pack the hot code closer together, this commit moves the
    initialization functions that can't be marked as __init into
    .text.unlikely by way of the __cold attribute.
    
    Of particular note is moving credit_init_bits() into a macro wrapper
    that inlines the crng_ready() static branch check. This avoids a
    function call to a nop+ret, and most notably prevents extra entropy
    arithmetic from being computed in mix_interrupt_randomness().
    
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6fd3ff02dadcf6eacf6efd2e0840e75506d15905
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri May 13 13:18:46 2022 +0200

    random: make consistent use of buf and len
    
    commit a19402634c435a4eae226df53c141cdbb9922e7b upstream.
    
    The current code was a mix of "nbytes", "count", "size", "buffer", "in",
    and so forth. Instead, let's clean this up by naming input parameters
    "buf" (or "ubuf") and "len", so that you always understand that you're
    reading this variety of function argument.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit da71481fefa21c7511244b66d000590704bd21ba
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri May 13 12:32:23 2022 +0200

    random: use proper return types on get_random_{int,long}_wait()
    
    commit 7c3a8a1db5e03d02cc0abb3357a84b8b326dfac3 upstream.
    
    Before these were returning signed values, but the API is intended to be
    used with unsigned values.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 4df97cbda270c1254fd866a9ec80bbd4bab868a8
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri May 13 12:29:38 2022 +0200

    random: remove extern from functions in header
    
    commit 7782cfeca7d420e8bb707613d4cfb0f7ff29bb3a upstream.
    
    Accoriding to the kernel style guide, having `extern` on functions in
    headers is old school and deprecated, and doesn't add anything. So remove
    them from random.h, and tidy up the file a little bit too.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 05daffcd83dae217daedfa17577bbf8dc91f48fd
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Tue May 3 15:30:45 2022 +0200

    random: use static branch for crng_ready()
    
    commit f5bda35fba615ace70a656d4700423fa6c9bebee upstream.
    
    Since crng_ready() is only false briefly during initialization and then
    forever after becomes true, we don't need to evaluate it after, making
    it a prime candidate for a static branch.
    
    One complication, however, is that it changes state in a particular call
    to credit_init_bits(), which might be made from atomic context, which
    means we must kick off a workqueue to change the static key. Further
    complicating things, credit_init_bits() may be called sufficiently early
    on in system initialization such that system_wq is NULL.
    
    Fortunately, there exists the nice function execute_in_process_context(),
    which will immediately execute the function if !in_interrupt(), and
    otherwise defer it to a workqueue. During early init, before workqueues
    are available, in_interrupt() is always false, because interrupts
    haven't even been enabled yet, which means the function in that case
    executes immediately. Later on, after workqueues are available,
    in_interrupt() might be true, but in that case, the work is queued in
    system_wq and all goes well.
    
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Sultan Alsawaf <sultan@kerneltoast.com>
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 91657c81fc3d0401189060acf54cddc201464645
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Thu May 12 15:32:26 2022 +0200

    random: credit architectural init the exact amount
    
    commit 12e45a2a6308105469968951e6d563e8f4fea187 upstream.
    
    RDRAND and RDSEED can fail sometimes, which is fine. We currently
    initialize the RNG with 512 bits of RDRAND/RDSEED. We only need 256 bits
    of those to succeed in order to initialize the RNG. Instead of the
    current "all or nothing" approach, actually credit these contributions
    the amount that is actually contributed.
    
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d9394ac04eeb2d60518566e78835e952ed395166
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Thu May 5 02:20:22 2022 +0200

    random: handle latent entropy and command line from random_init()
    
    commit 2f14062bb14b0fcfcc21e6dc7d5b5c0d25966164 upstream.
    
    Currently, start_kernel() adds latent entropy and the command line to
    the entropy bool *after* the RNG has been initialized, deferring when
    it's actually used by things like stack canaries until the next time
    the pool is seeded. This surely is not intended.
    
    Rather than splitting up which entropy gets added where and when between
    start_kernel() and random_init(), just do everything in random_init(),
    which should eliminate these kinds of bugs in the future.
    
    While we're at it, rename the awkwardly titled "rand_initialize()" to
    the more standard "random_init()" nomenclature.
    
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 68f784cbb1757f82061a74c3d22b6781348ba04d
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Tue May 10 15:20:42 2022 +0200

    random: use proper jiffies comparison macro
    
    commit 8a5b8a4a4ceb353b4dd5bafd09e2b15751bcdb51 upstream.
    
    This expands to exactly the same code that it replaces, but makes things
    consistent by using the same macro for jiffy comparisons throughout.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 443a7b15c858ec799ae7ba5146316c4d30c778d6
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Mon May 9 16:13:18 2022 +0200

    random: remove ratelimiting for in-kernel unseeded randomness
    
    commit cc1e127bfa95b5fb2f9307e7168bf8b2b45b4c5e upstream.
    
    The CONFIG_WARN_ALL_UNSEEDED_RANDOM debug option controls whether the
    kernel warns about all unseeded randomness or just the first instance.
    There's some complicated rate limiting and comparison to the previous
    caller, such that even with CONFIG_WARN_ALL_UNSEEDED_RANDOM enabled,
    developers still don't see all the messages or even an accurate count of
    how many were missed. This is the result of basically parallel
    mechanisms aimed at accomplishing more or less the same thing, added at
    different points in random.c history, which sort of compete with the
    first-instance-only limiting we have now.
    
    It turns out, however, that nobody cares about the first unseeded
    randomness instance of in-kernel users. The same first user has been
    there for ages now, and nobody is doing anything about it. It isn't even
    clear that anybody _can_ do anything about it. Most places that can do
    something about it have switched over to using get_random_bytes_wait()
    or wait_for_random_bytes(), which is the right thing to do, but there is
    still much code that needs randomness sometimes during init, and as a
    geeneral rule, if you're not using one of the _wait functions or the
    readiness notifier callback, you're bound to be doing it wrong just
    based on that fact alone.
    
    So warning about this same first user that can't easily change is simply
    not an effective mechanism for anything at all. Users can't do anything
    about it, as the Kconfig text points out -- the problem isn't in
    userspace code -- and kernel developers don't or more often can't react
    to it.
    
    Instead, show the warning for all instances when CONFIG_WARN_ALL_UNSEEDED_RANDOM
    is set, so that developers can debug things need be, or if it isn't set,
    don't show a warning at all.
    
    At the same time, CONFIG_WARN_ALL_UNSEEDED_RANDOM now implies setting
    random.ratelimit_disable=1 on by default, since if you care about one
    you probably care about the other too. And we can clean up usage around
    the related urandom_warning ratelimiter as well (whose behavior isn't
    changing), so that it properly counts missed messages after the 10
    message threshold is reached.
    
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 4d77c5dc80264c0b9d9d4799d2afa8024e23a64f
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Mon May 9 13:53:24 2022 +0200

    random: move initialization out of reseeding hot path
    
    commit 68c9c8b192c6dae9be6278e98ee44029d5da2d31 upstream.
    
    Initialization happens once -- by way of credit_init_bits() -- and then
    it never happens again. Therefore, it doesn't need to be in
    crng_reseed(), which is a hot path that is called multiple times. It
    also doesn't make sense to have there, as initialization activity is
    better associated with initialization routines.
    
    After the prior commit, crng_reseed() now won't be called by multiple
    concurrent callers, which means that we can safely move the
    "finialize_init" logic into crng_init_bits() unconditionally.
    
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e84123f90491618c6d0095495502e566bc5b41c4
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Mon May 9 13:40:55 2022 +0200

    random: avoid initializing twice in credit race
    
    commit fed7ef061686cc813b1f3d8d0edc6c35b4d3537b upstream.
    
    Since all changes of crng_init now go through credit_init_bits(), we can
    fix a long standing race in which two concurrent callers of
    credit_init_bits() have the new bit count >= some threshold, but are
    doing so with crng_init as a lower threshold, checked outside of a lock,
    resulting in crng_reseed() or similar being called twice.
    
    In order to fix this, we can use the original cmpxchg value of the bit
    count, and only change crng_init when the bit count transitions from
    below a threshold to meeting the threshold.
    
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit a199b1cf0d53b2da8994b084cd25a84f26cb462a
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sun May 8 13:20:30 2022 +0200

    random: use symbolic constants for crng_init states
    
    commit e3d2c5e79a999aa4e7d6f0127e16d3da5a4ff70d upstream.
    
    crng_init represents a state machine, with three states, and various
    rules for transitions. For the longest time, we've been managing these
    with "0", "1", and "2", and expecting people to figure it out. To make
    the code more obvious, replace these with proper enum values
    representing the transition, and then redocument what each of these
    states mean.
    
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Cc: Joe Perches <joe@perches.com>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 65419736ad67ae4d42647eecdde4f71770dbf50c
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat May 7 14:03:46 2022 +0200

    siphash: use one source of truth for siphash permutations
    
    commit e73aaae2fa9024832e1f42e30c787c7baf61d014 upstream.
    
    The SipHash family of permutations is currently used in three places:
    
    - siphash.c itself, used in the ordinary way it was intended.
    - random32.c, in a construction from an anonymous contributor.
    - random.c, as part of its fast_mix function.
    
    Each one of these places reinvents the wheel with the same C code, same
    rotation constants, and same symmetry-breaking constants.
    
    This commit tidies things up a bit by placing macros for the
    permutations and constants into siphash.h, where each of the three .c
    users can access them. It also leaves a note dissuading more users of
    them from emerging.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3cac6963de9f3014c9b980caede71f2344d47daa
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri May 6 23:19:43 2022 +0200

    random: help compiler out with fast_mix() by using simpler arguments
    
    commit 791332b3cbb080510954a4c152ce02af8832eac9 upstream.
    
    Now that fast_mix() has more than one caller, gcc no longer inlines it.
    That's fine. But it also doesn't handle the compound literal argument we
    pass it very efficiently, nor does it handle the loop as well as it
    could. So just expand the code to spell out this function so that it
    generates the same code as it did before. Performance-wise, this now
    behaves as it did before the last commit. The difference in actual code
    size on x86 is 45 bytes, which is less than a cache line.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 1e34e244a3727c13c7f8858c42c3d82bc3d1d818
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri May 6 18:30:51 2022 +0200

    random: do not use input pool from hard IRQs
    
    commit e3e33fc2ea7fcefd0d761db9d6219f83b4248f5c upstream.
    
    Years ago, a separate fast pool was added for interrupts, so that the
    cost associated with taking the input pool spinlocks and mixing into it
    would be avoided in places where latency is critical. However, one
    oversight was that add_input_randomness() and add_disk_randomness()
    still sometimes are called directly from the interrupt handler, rather
    than being deferred to a thread. This means that some unlucky interrupts
    will be caught doing a blake2s_compress() call and potentially spinning
    on input_pool.lock, which can also be taken by unprivileged users by
    writing into /dev/urandom.
    
    In order to fix this, add_timer_randomness() now checks whether it is
    being called from a hard IRQ and if so, just mixes into the per-cpu IRQ
    fast pool using fast_mix(), which is much faster and can be done
    lock-free. A nice consequence of this, as well, is that it means hard
    IRQ context FPU support is likely no longer useful.
    
    The entropy estimation algorithm used by add_timer_randomness() is also
    somewhat different than the one used for add_interrupt_randomness(). The
    former looks at deltas of deltas of deltas, while the latter just waits
    for 64 interrupts for one bit or for one second since the last bit. In
    order to bridge these, and since add_interrupt_randomness() runs after
    an add_timer_randomness() that's called from hard IRQ, we add to the
    fast pool credit the related amount, and then subtract one to account
    for add_interrupt_randomness()'s contribution.
    
    A downside of this, however, is that the num argument is potentially
    attacker controlled, which puts a bit more pressure on the fast_mix()
    sponge to do more than it's really intended to do. As a mitigating
    factor, the first 96 bits of input aren't attacker controlled (a cycle
    counter followed by zeros), which means it's essentially two rounds of
    siphash rather than one, which is somewhat better. It's also not that
    much different from add_interrupt_randomness()'s use of the irq stack
    instruction pointer register.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Filipe Manana <fdmanana@suse.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 157edd57d51f5945e8aa207dc64cc10560ffc582
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri May 6 18:27:38 2022 +0200

    random: order timer entropy functions below interrupt functions
    
    commit a4b5c26b79ffdfcfb816c198f2fc2b1e7b5b580f upstream.
    
    There are no code changes here; this is just a reordering of functions,
    so that in subsequent commits, the timer entropy functions can call into
    the interrupt ones.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 0e4944c9c11ebb207355d6edab0ac5b5a958e9a6
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat Apr 30 22:03:29 2022 +0200

    random: do not pretend to handle premature next security model
    
    commit e85c0fc1d94c52483a603651748d4c76d6aa1c6b upstream.
    
    Per the thread linked below, "premature next" is not considered to be a
    realistic threat model, and leads to more serious security problems.
    
    "Premature next" is the scenario in which:
    
    - Attacker compromises the current state of a fully initialized RNG via
      some kind of infoleak.
    - New bits of entropy are added directly to the key used to generate the
      /dev/urandom stream, without any buffering or pooling.
    - Attacker then, somehow having read access to /dev/urandom, samples RNG
      output and brute forces the individual new bits that were added.
    - Result: the RNG never "recovers" from the initial compromise, a
      so-called violation of what academics term "post-compromise security".
    
    The usual solutions to this involve some form of delaying when entropy
    gets mixed into the crng. With Fortuna, this involves multiple input
    buckets. With what the Linux RNG was trying to do prior, this involves
    entropy estimation.
    
    However, by delaying when entropy gets mixed in, it also means that RNG
    compromises are extremely dangerous during the window of time before
    the RNG has gathered enough entropy, during which time nonces may become
    predictable (or repeated), ephemeral keys may not be secret, and so
    forth. Moreover, it's unclear how realistic "premature next" is from an
    attack perspective, if these attacks even make sense in practice.
    
    Put together -- and discussed in more detail in the thread below --
    these constitute grounds for just doing away with the current code that
    pretends to handle premature next. I say "pretends" because it wasn't
    doing an especially great job at it either; should we change our mind
    about this direction, we would probably implement Fortuna to "fix" the
    "problem", in which case, removing the pretend solution still makes
    sense.
    
    This also reduces the crng reseed period from 5 minutes down to 1
    minute. The rationale from the thread might lead us toward reducing that
    even further in the future (or even eliminating it), but that remains a
    topic of a future commit.
    
    At a high level, this patch changes semantics from:
    
        Before: Seed for the first time after 256 "bits" of estimated
        entropy have been accumulated since the system booted. Thereafter,
        reseed once every five minutes, but only if 256 new "bits" have been
        accumulated since the last reseeding.
    
        After: Seed for the first time after 256 "bits" of estimated entropy
        have been accumulated since the system booted. Thereafter, reseed
        once every minute.
    
    Most of this patch is renaming and removing: POOL_MIN_BITS becomes
    POOL_INIT_BITS, credit_entropy_bits() becomes credit_init_bits(),
    crng_reseed() loses its "force" parameter since it's now always true,
    the drain_entropy() function no longer has any use so it's removed,
    entropy estimation is skipped if we've already init'd, the various
    notifiers for "low on entropy" are now only active prior to init, and
    finally, some documentation comments are cleaned up here and there.
    
    Link: https://lore.kernel.org/lkml/YmlMGx6+uigkGiZ0@zx2c4.com/
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Nadia Heninger <nadiah@cs.ucsd.edu>
    Cc: Tom Ristenpart <ristenpart@cornell.edu>
    Reviewed-by: Eric Biggers <ebiggers@google.com>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 7176e96d798807e536bf4e76b8767a5e94a301f6
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat Apr 30 15:08:20 2022 +0200

    random: use first 128 bits of input as fast init
    
    commit 5c3b747ef54fa2a7318776777f6044540d99f721 upstream.
    
    Before, the first 64 bytes of input, regardless of how entropic it was,
    would be used to mutate the crng base key directly, and none of those
    bytes would be credited as having entropy. Then 256 bits of credited
    input would be accumulated, and only then would the rng transition from
    the earlier "fast init" phase into being actually initialized.
    
    The thinking was that by mixing and matching fast init and real init, an
    attacker who compromised the fast init state, considered easy to do
    given how little entropy might be in those first 64 bytes, would then be
    able to bruteforce bits from the actual initialization. By keeping these
    separate, bruteforcing became impossible.
    
    However, by not crediting potentially creditable bits from those first 64
    bytes of input, we delay initialization, and actually make the problem
    worse, because it means the user is drawing worse random numbers for a
    longer period of time.
    
    Instead, we can take the first 128 bits as fast init, and allow them to
    be credited, and then hold off on the next 128 bits until they've
    accumulated. This is still a wide enough margin to prevent bruteforcing
    the rng state, while still initializing much faster.
    
    Then, rather than trying to piecemeal inject into the base crng key at
    various points, instead just extract from the pool when we need it, for
    the crng_init==0 phase. Performance may even be better for the various
    inputs here, since there are likely more calls to mix_pool_bytes() then
    there are to get_random_bytes() during this phase of system execution.
    
    Since the preinit injection code is gone, bootloader randomness can then
    do something significantly more straight forward, removing the weird
    system_wq hack in hwgenerator randomness.
    
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f0cb03957175452e8dfc18f3f7d8dd2698553bd4
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Tue May 3 14:14:32 2022 +0200

    random: do not use batches when !crng_ready()
    
    commit cbe89e5a375a51bbb952929b93fa973416fea74e upstream.
    
    It's too hard to keep the batches synchronized, and pointless anyway,
    since in !crng_ready(), we're updating the base_crng key really often,
    where batching only hurts. So instead, if the crng isn't ready, just
    call into get_random_bytes(). At this stage nothing is performance
    critical anyhow.
    
    Cc: Theodore Ts'o <tytso@mit.edu>
    Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 8f4d6e33b53405de6696a92caea2d71a5b7c15e0
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Tue Apr 12 19:59:57 2022 +0200

    random: insist on random_get_entropy() existing in order to simplify
    
    commit 4b758eda851eb9336ca86a0041a4d3da55f66511 upstream.
    
    All platforms are now guaranteed to provide some value for
    random_get_entropy(). In case some bug leads to this not being so, we
    print a warning, because that indicates that something is really very
    wrong (and likely other things are impacted too). This should never be
    hit, but it's a good and cheap way of finding out if something ever is
    problematic.
    
    Since we now have viable fallback code for random_get_entropy() on all
    platforms, which is, in the worst case, not worse than jiffies, we can
    count on getting the best possible value out of it. That means there's
    no longer a use for using jiffies as entropy input. It also means we no
    longer have a reason for doing the round-robin register flow in the IRQ
    handler, which was always of fairly dubious value.
    
    Instead we can greatly simplify the IRQ handler inputs and also unify
    the construction between 64-bits and 32-bits. We now collect the cycle
    counter and the return address, since those are the two things that
    matter. Because the return address and the irq number are likely
    related, to the extent we mix in the irq number, we can just xor it into
    the top unchanging bytes of the return address, rather than the bottom
    changing bytes of the cycle counter as before. Then, we can do a fixed 2
    rounds of SipHash/HSipHash. Finally, we use the same construction of
    hashing only half of the [H]SipHash state on 32-bit and 64-bit. We're
    not actually discarding any entropy, since that entropy is carried
    through until the next time. And more importantly, it lets us do the
    same sponge-like construction everywhere.
    
    Cc: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 7ebcfba533435b50a80096064e374a3cc08aabbb
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    xtensa: use fallback for random_get_entropy() instead of zero
    
    commit e10e2f58030c5c211d49042a8c2a1b93d40b2ffb upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is really not the best we can do.
    Instead, at least calling random_get_entropy_fallback() would be
    preferable, because that always needs to return _something_, even
    falling back to jiffies eventually. It's not as though
    random_get_entropy_fallback() is super high precision or guaranteed to
    be entropic, but basically anything that's not zero all the time is
    better than returning zero all the time.
    
    This is accomplished by just including the asm-generic code like on
    other architectures, which means we can get rid of the empty stub
    function here.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Acked-by: Max Filippov <jcmvbkbc@gmail.com>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 32a177ccb3884717251e2394750f283ef3b6c916
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    sparc: use fallback for random_get_entropy() instead of zero
    
    commit ac9756c79797bb98972736b13cfb239fd2cffb79 upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is really not the best we can do.
    Instead, at least calling random_get_entropy_fallback() would be
    preferable, because that always needs to return _something_, even
    falling back to jiffies eventually. It's not as though
    random_get_entropy_fallback() is super high precision or guaranteed to
    be entropic, but basically anything that's not zero all the time is
    better than returning zero all the time.
    
    This is accomplished by just including the asm-generic code like on
    other architectures, which means we can get rid of the empty stub
    function here.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: David S. Miller <davem@davemloft.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d161ede79be763eabaa2a617eae0c9892f3869b6
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    um: use fallback for random_get_entropy() instead of zero
    
    commit 9f13fb0cd11ed2327abff69f6501a2c124c88b5a upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is really not the best we can do.
    Instead, at least calling random_get_entropy_fallback() would be
    preferable, because that always needs to return _something_, even
    falling back to jiffies eventually. It's not as though
    random_get_entropy_fallback() is super high precision or guaranteed to
    be entropic, but basically anything that's not zero all the time is
    better than returning zero all the time.
    
    This is accomplished by just including the asm-generic code like on
    other architectures, which means we can get rid of the empty stub
    function here.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
    Acked-by: Johannes Berg <johannes@sipsolutions.net>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c37981032472f827d920dd37d59b9c248fbe0b8c
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    x86/tsc: Use fallback for random_get_entropy() instead of zero
    
    commit 3bd4abc07a267e6a8b33d7f8717136e18f921c53 upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is suboptimal. Instead, fallback
    to calling random_get_entropy_fallback(), which isn't extremely high
    precision or guaranteed to be entropic, but is certainly better than
    returning zero all the time.
    
    If CONFIG_X86_TSC=n, then it's possible for the kernel to run on systems
    without RDTSC, such as 486 and certain 586, so the fallback code is only
    required for that case.
    
    As well, fix up both the new function and the get_cycles() function from
    which it was derived to use cpu_feature_enabled() rather than
    boot_cpu_has(), and use !IS_ENABLED() instead of #ifndef.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: x86@kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cca2a7ad060e5abae5d12f4b58de64679d59fa81
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    nios2: use fallback for random_get_entropy() instead of zero
    
    commit c04e72700f2293013dab40208e809369378f224c upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is really not the best we can do.
    Instead, at least calling random_get_entropy_fallback() would be
    preferable, because that always needs to return _something_, even
    falling back to jiffies eventually. It's not as though
    random_get_entropy_fallback() is super high precision or guaranteed to
    be entropic, but basically anything that's not zero all the time is
    better than returning zero all the time.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Acked-by: Dinh Nguyen <dinguyen@kernel.org>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit a9ebafbe00f5086a8149d2b9eebe6033dce509d5
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    arm: use fallback for random_get_entropy() instead of zero
    
    commit ff8a8f59c99f6a7c656387addc4d9f2247d75077 upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is really not the best we can do.
    Instead, at least calling random_get_entropy_fallback() would be
    preferable, because that always needs to return _something_, even
    falling back to jiffies eventually. It's not as though
    random_get_entropy_fallback() is super high precision or guaranteed to
    be entropic, but basically anything that's not zero all the time is
    better than returning zero all the time.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9d3ee85be180b174fb3a28668e067d37c93c6ae5
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    mips: use fallback for random_get_entropy() instead of just c0 random
    
    commit 1c99c6a7c3c599a68321b01b9ec243215ede5a68 upstream.
    
    For situations in which we don't have a c0 counter register available,
    we've been falling back to reading the c0 "random" register, which is
    usually bounded by the amount of TLB entries and changes every other
    cycle or so. This means it wraps extremely often. We can do better by
    combining this fast-changing counter with a potentially slower-changing
    counter from random_get_entropy_fallback() in the more significant bits.
    This commit combines the two, taking into account that the changing bits
    are in a different bit position depending on the CPU model. In addition,
    we previously were falling back to 0 for ancient CPUs that Linux does
    not support anyway; remove that dead path entirely.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Tested-by: Maciej W. Rozycki <macro@orcam.me.uk>
    Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f4e9fe58d4af645cf92a889d54a842ff79969a53
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    riscv: use fallback for random_get_entropy() instead of zero
    
    commit 6d01238623faa9425f820353d2066baf6c9dc872 upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is really not the best we can do.
    Instead, at least calling random_get_entropy_fallback() would be
    preferable, because that always needs to return _something_, even
    falling back to jiffies eventually. It's not as though
    random_get_entropy_fallback() is super high precision or guaranteed to
    be entropic, but basically anything that's not zero all the time is
    better than returning zero all the time.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
    Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 955a870ebdf691c84baa09ea27fd0e80135d0ad7
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Apr 8 18:03:13 2022 +0200

    m68k: use fallback for random_get_entropy() instead of zero
    
    commit 0f392c95391f2d708b12971a07edaa7973f9eece upstream.
    
    In the event that random_get_entropy() can't access a cycle counter or
    similar, falling back to returning 0 is really not the best we can do.
    Instead, at least calling random_get_entropy_fallback() would be
    preferable, because that always needs to return _something_, even
    falling back to jiffies eventually. It's not as though
    random_get_entropy_fallback() is super high precision or guaranteed to
    be entropic, but basically anything that's not zero all the time is
    better than returning zero all the time.
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 7a5a2e2ccfd9496dcc9290c2e8799da612270607
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sun Apr 10 16:49:50 2022 +0200

    timekeeping: Add raw clock fallback for random_get_entropy()
    
    commit 1366992e16bddd5e2d9a561687f367f9f802e2e4 upstream.
    
    The addition of random_get_entropy_fallback() provides access to
    whichever time source has the highest frequency, which is useful for
    gathering entropy on platforms without available cycle counters. It's
    not necessarily as good as being able to quickly access a cycle counter
    that the CPU has, but it's still something, even when it falls back to
    being jiffies-based.
    
    In the event that a given arch does not define get_cycles(), falling
    back to the get_cycles() default implementation that returns 0 is really
    not the best we can do. Instead, at least calling
    random_get_entropy_fallback() would be preferable, because that always
    needs to return _something_, even falling back to jiffies eventually.
    It's not as though random_get_entropy_fallback() is super high precision
    or guaranteed to be entropic, but basically anything that's not zero all
    the time is better than returning zero all the time.
    
    Finally, since random_get_entropy_fallback() is used during extremely
    early boot when randomizing freelists in mm_init(), it can be called
    before timekeeping has been initialized. In that case there really is
    nothing we can do; jiffies hasn't even started ticking yet. So just give
    up and return 0.
    
    Suggested-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 62d1c104c7f6ee38d4f2717b1539a0dd54f71ff9
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat Apr 23 21:11:41 2022 +0200

    powerpc: define get_cycles macro for arch-override
    
    commit 408835832158df0357e18e96da7f2d1ed6b80e7f upstream.
    
    PowerPC defines a get_cycles() function, but it does not do the usual
    `#define get_cycles get_cycles` dance, making it impossible for generic
    code to see if an arch-specific function was defined. While the
    get_cycles() ifdef is not currently used, the following timekeeping
    patch in this series will depend on the macro existing (or not existing)
    when defining random_get_entropy().
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Benjamin Herrenschmidt <benh@ozlabs.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Acked-by: Michael Ellerman <mpe@ellerman.id.au>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d0d24c89d88505011b09eec0dec43bec1c617bef
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat Apr 23 21:11:41 2022 +0200

    alpha: define get_cycles macro for arch-override
    
    commit 1097710bc9660e1e588cf2186a35db3d95c4d258 upstream.
    
    Alpha defines a get_cycles() function, but it does not do the usual
    `#define get_cycles get_cycles` dance, making it impossible for generic
    code to see if an arch-specific function was defined. While the
    get_cycles() ifdef is not currently used, the following timekeeping
    patch in this series will depend on the macro existing (or not existing)
    when defining random_get_entropy().
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Acked-by: Matt Turner <mattst88@gmail.com>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9a51867dc50cdc4df16200d8897bced7f243aaab
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat Apr 23 21:11:41 2022 +0200

    parisc: define get_cycles macro for arch-override
    
    commit 8865bbe6ba1120e67f72201b7003a16202cd42be upstream.
    
    PA-RISC defines a get_cycles() function, but it does not do the usual
    `#define get_cycles get_cycles` dance, making it impossible for generic
    code to see if an arch-specific function was defined. While the
    get_cycles() ifdef is not currently used, the following timekeeping
    patch in this series will depend on the macro existing (or not existing)
    when defining random_get_entropy().
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Acked-by: Helge Deller <deller@gmx.de>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f11c51290d0f854f83632d79a03af9605ca2c32a
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat Apr 23 21:11:41 2022 +0200

    s390: define get_cycles macro for arch-override
    
    commit 2e3df523256cb9836de8441e9c791a796759bb3c upstream.
    
    S390x defines a get_cycles() function, but it does not do the usual
    `#define get_cycles get_cycles` dance, making it impossible for generic
    code to see if an arch-specific function was defined. While the
    get_cycles() ifdef is not currently used, the following timekeeping
    patch in this series will depend on the macro existing (or not existing)
    when defining random_get_entropy().
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Acked-by: Heiko Carstens <hca@linux.ibm.com>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9d923d15fc4b99d55a4cf0f8b1336ed7b8af218e
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sat Apr 23 21:11:41 2022 +0200

    ia64: define get_cycles macro for arch-override
    
    commit 57c0900b91d8891ab43f0e6b464d059fda51d102 upstream.
    
    Itanium defines a get_cycles() function, but it does not do the usual
    `#define get_cycles get_cycles` dance, making it impossible for generic
    code to see if an arch-specific function was defined. While the
    get_cycles() ifdef is not currently used, the following timekeeping
    patch in this series will depend on the macro existing (or not existing)
    when defining random_get_entropy().
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit a27dffcb938836affc211846a44e6dfa306582f7
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Thu May 5 02:20:22 2022 +0200

    init: call time_init() before rand_initialize()
    
    commit fe222a6ca2d53c38433cba5d3be62a39099e708e upstream.
    
    Currently time_init() is called after rand_initialize(), but
    rand_initialize() makes use of the timer on various platforms, and
    sometimes this timer needs to be initialized by time_init() first. In
    order for random_get_entropy() to not return zero during early boot when
    it's potentially used as an entropy source, reverse the order of these
    two calls. The block doing random initialization was right before
    time_init() before, so changing the order shouldn't have any complicated
    effects.
    
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reviewed-by: Stafford Horne <shorne@gmail.com>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3162bd8ac0530d5488a6e861f91871a2002ae416
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Tue May 3 21:43:58 2022 +0200

    random: fix sysctl documentation nits
    
    commit 069c4ea6871c18bd368f27756e0f91ffb524a788 upstream.
    
    A semicolon was missing, and the almost-alphabetical-but-not ordering
    was confusing, so regroup these by category instead.
    
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 8c37b3bc825d4ce567d01533782224cdec8337a5
Author: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Date:   Mon May 9 18:50:20 2022 +0530

    HID: amd_sfh: Add support for sensor discovery
    
    commit b5d7f43e97dabfa04a4be5ff027ce7da119332be upstream.
    
    Sensor discovery status fails in case of broken sensors or
    platform not supported. Hence disable driver on failure
    of sensor discovery.
    
    Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
    Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
    Signed-off-by: Jiri Kosina <jkosina@suse.cz>
    Cc: Mario Limonciello <Mario.Limonciello@amd.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit eca56bf0066ef2f1e7be0e3fa7564b85a309872c
Author: Daniel Thompson <daniel.thompson@linaro.org>
Date:   Mon May 23 19:11:02 2022 +0100

    lockdown: also lock down previous kgdb use
    
    commit eadb2f47a3ced5c64b23b90fd2a3463f63726066 upstream.
    
    KGDB and KDB allow read and write access to kernel memory, and thus
    should be restricted during lockdown.  An attacker with access to a
    serial port (for example, via a hypervisor console, which some cloud
    vendors provide over the network) could trigger the debugger so it is
    important that the debugger respect the lockdown mode when/if it is
    triggered.
    
    Fix this by integrating lockdown into kdb's existing permissions
    mechanism.  Unfortunately kgdb does not have any permissions mechanism
    (although it certainly could be added later) so, for now, kgdb is simply
    and brutally disabled by immediately exiting the gdb stub without taking
    any action.
    
    For lockdowns established early in the boot (e.g. the normal case) then
    this should be fine but on systems where kgdb has set breakpoints before
    the lockdown is enacted than "bad things" will happen.
    
    CVE: CVE-2022-21499
    Co-developed-by: Stephen Brennan <stephen.s.brennan@oracle.com>
    Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
    Reviewed-by: Douglas Anderson <dianders@chromium.org>
    Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>