commit 33259059523a221e7db294f80753a3ca201f8d4c
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Oct 3 00:48:35 2022 +0000

    Linux 6.0.0-xanmod1
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4e2684ed25cd348b8b0bbc7ac9200a0ac7e5bb7f
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Oct 3 21:09:27 2022 +0000

    XANMOD: scripts/setlocalversion: Move localversion* files to the end
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b338392ab79747b0626ab8b36aaea81f6ac1eaf5
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:06 2022 +0000

    mm/mmap.c: pass in mapping to __vma_link_file()
    
    __vma_link_file() resolves the mapping from the file, if there is one.
    Pass through the mapping and check the vm_file externally since most
    places already have the required information and check of vm_file.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 30cbc90e009ec25ed510f166e166aa78b3067505
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:06 2022 +0000

    mm/mmap: drop range_has_overlap() function
    
    Since there is no longer a linked list, the range_has_overlap() function
    is identical to the find_vma_intersection() function.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 7a1831ec686595c657adc03ec0c7af2d56df8b6d
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:06 2022 +0000

    mm: remove the vma linked list
    
    Replace any vm_next use with vma_find().
    
    Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
    maple tree.
    
    Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
    the same time, alter the loop to be more compact.
    
    Now that free_pgtables() and unmap_vmas() take a maple tree as an
    argument, rearrange do_mas_align_munmap() to use the new tree to hold the
    vmas to remove.
    
    Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
    used to update the linked list.
    
    Drop linked list update from __insert_vm_struct().
    
    Rework validation of tree as it was depending on the linked list.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 24100db026c2cf9b21e7b22127620400a0bb97b6
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Tue Sep 6 19:49:05 2022 +0000

    mm/vmscan: Use vma iterator instead of vm_next
    
    Use the vma iterator in in get_next_vma() instead of the linked list.
    
    Suggested-by: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 7e973ae75992533288183eafda6e7d87d07510e2
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:05 2022 +0000

    riscv: use vma iterator for vdso
    
    Remove the linked list use in favour of the vma iterator.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 65eddffedc4a233561dfb116c7b9038256bc50c4
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:05 2022 +0000

    nommu: remove uses of VMA linked list
    
    Use the maple tree or VMA iterator instead.  This is faster and will allow
    us to shrink the VMA.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 011fd661dececbb43c79e80be2dd4d0473121019
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:04 2022 +0000

    i915: use the VMA iterator
    
    Replace the linked list in probe_range() with the VMA iterator.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 07fa26753484f7af502f4413649bd53c81a61e78
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:04 2022 +0000

    mm/swapfile: use vma iterator instead of vma linked list
    
    unuse_mm() no longer needs to reference the linked list.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 3f2dc51aa948d8623d5f15ae55bc841bfd68caba
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:04 2022 +0000

    mm/pagewalk: use vma_find() instead of vma linked list
    
    walk_page_range() no longer uses the one vma linked list reference.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 11cdab3ca23bd7569b9a7faa2b5cb422e73542b3
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:03 2022 +0000

    mm/oom_kill: use vma iterators instead of vma linked list
    
    Use vma iterator in preparation of removing the linked list.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 14cbb30eef4a7a97174013f686d504ba986dc296
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:03 2022 +0000

    mm/msync: use vma_find() instead of vma linked list
    
    Remove a single use of the vma linked list in preparation for the
    removal of the linked list.  Uses find_vma() to get the next element.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 619b9005d1c4885abecec857ec30987d4e7d550f
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:03 2022 +0000

    mm/mremap: use vma_find_intersection() instead of vma linked list
    
    Using the vma_find_intersection() call allows for cleaner code and
    removes linked list users in preparation of the linked list removal.
    
    Also remove one user of the linked list at the same time in favour of
    find_vma().
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 1f4c5c0de738a492b593f5bb98636d8e53dcc762
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:02 2022 +0000

    mm/mprotect: use maple tree navigation instead of VMA linked list
    
    Switch to navigating the VMA list with the maple tree operators in
    preparation for removing the linked list.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit e6e56789f1a47364d754f223a650ebab4d948aa9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:02 2022 +0000

    mm/mlock: use vma iterator and maple state instead of vma linked list
    
    Handle overflow checking in count_mm_mlocked_page_nr() differently.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4f10986ccfa99c5964ebe9c153061e31ac602ca0
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:02 2022 +0000

    mm/mempolicy: use vma iterator & maple state instead of vma linked list
    
    Reworked the way mbind_range() finds the first VMA to reuse the maple
    state and limit the number of tree walks needed.
    
    Note, this drops the VM_BUG_ON(!vma) call, which would catch a start
    address higher than the last VMA.  The code was written in a way that
    allowed no VMA updates to occur and still return success.  There should be
    no functional change to this scenario with the new code.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 19543af5b33ef7e1f536415a784c1774f44b055b
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:01 2022 +0000

    mm/memcontrol: stop using mm->highest_vm_end
    
    Pass through ULONG_MAX instead.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 56978fa6bcadbce91bed16b160e31833bee8dc9f
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:01 2022 +0000

    mm/madvise: use vma_find() instead of vma linked list
    
    madvise_walk_vmas() no longer uses linked list.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 24272c4af966dd57746a32a287d445cfc2bb25e0
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:01 2022 +0000

    mm/ksm: use vma iterators instead of vma linked list
    
    Remove the use of the linked list for eventual removal.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 46a4d13b9d260dbd7e65ffb7b8a878738171a76d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:00 2022 +0000

    mm/khugepaged: stop using vma linked list
    
    Use vma iterator & find_vma() instead of vma linked list.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4294269f11eddc27c7190d2c666d5fcf8136911d
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:00 2022 +0000

    mm/gup: use maple tree navigation instead of linked list
    
    Use find_vma_intersection() to locate the VMAs in __mm_populate() instead
    of using find_vma() and the linked list.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 1efb0e09b05e6307b29e5dd67fdfef6898519c63
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:59 2022 +0000

    bpf: remove VMA linked list
    
    Use vma_next() and remove reference to the start of the linked list
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 2f225aa51edb9c3f3537fe9e31aa1e211dadbbc5
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:59 2022 +0000

    fork: use VMA iterator
    
    The VMA iterator is faster than the linked list and removing the linked
    list will shrink the vm_area_struct.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4bfb9c0367bf0350fe769d914edffea90f78935b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:59 2022 +0000

    sched: use maple tree iterator to walk VMAs
    
    The linked list is slower than walking the VMAs using the maple tree.  We
    can't use the VMA iterator here because it doesn't support moving to an
    earlier position.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 655e8035191137fa5f130ea7613186a55a349744
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:58 2022 +0000

    perf: use VMA iterator
    
    The VMA iterator is faster than the linked list and removing the linked
    list will shrink the vm_area_struct.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 46471806fa3eed8a2795837cb2f70bc0c50d98d5
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:58 2022 +0000

    acct: use VMA iterator instead of linked list
    
    The VMA iterator is faster than the linked list.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit c281f4150b51b9e59ee39b379fa70c2285ab2003
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:58 2022 +0000

    ipc/shm: use VMA iterator instead of linked list
    
    The VMA iterator is faster than the linked llist, and it can be walked
    even when VMAs are being removed from the address space, so there's no
    need to keep track of 'next'.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 8e8a31497d8ad055cd238b170de7e3268de27d4f
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:57 2022 +0000

    userfaultfd: use maple tree iterator to iterate VMAs
    
    Don't use the mm_struct linked list or the vma->vm_next in prep for
    removal.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit bf18fab98b1521ae6de7f5b3bfa406281146f962
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:57 2022 +0000

    fs/proc/task_mmu: stop using linked list and highest_vm_end
    
    Remove references to mm_struct linked list and highest_vm_end for when
    they are removed
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d0759b74cc9aca89933a81f37aebfc90153ab710
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:56 2022 +0000

    fs/proc/base: use the vma iterators in place of linked list
    
    Use the vma iterator instead of a for loop across the linked list.  The
    link list of vmas will be removed in this patch set.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 3425245c1d21bf11873a4ac8944ff5664ea175bb
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:56 2022 +0000

    exec: use VMA iterator instead of linked list
    
    Remove a use of the vm_next list by doing the initial lookup with the VMA
    iterator and then using it to find the next entry.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit f61fbb149be8f66e007100e6bfb8d33bdf82b12b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:56 2022 +0000

    coredump: remove vma linked list walk
    
    Use the Maple Tree iterator instead.  This is too complicated for the VMA
    iterator to handle, so let's open-code it for now.  If this turns out to
    be a common pattern, we can migrate it to common code.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit a057c375eb33c836e5d80a6e3dff3aae2c9dd66a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:56 2022 +0000

    um: remove vma linked list walk
    
    Use the VMA iterator instead.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 23e102ba82d3d44695bca2dca5b8d104f7c146e2
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:55 2022 +0000

    optee: remove vma linked list walk
    
    Use the VMA iterator instead.  Change the calling convention of
    __check_mem_type() to pass in the mm instead of the first vma in the
    range.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ae7b8b0d87dc6e6198b350dc1f6268f8e3612cb3
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:55 2022 +0000

    cxl: remove vma linked list walk
    
    Use the VMA iterator instead.  This requires a little restructuring of the
    surrounding code to hoist the mm to the caller.  That turns
    cxl_prefault_one() into a trivial function, so call cxl_fault_segment()
    directly.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 16f285ce32bc7e17f29273c9d9559f8cb389799a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:55 2022 +0000

    xtensa: remove vma linked list walks
    
    Use the VMA iterator instead.  Since VMA can no longer be NULL in the
    loop, then deal with out-of-memory outside the loop.  This means a
    slightly longer run time in the failure case (-ENOMEM) - it will run to
    the end of the VMAs before erroring instead of in the middle of the loop.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit e0323853de17d14991e76fcfdd80fb0181f3241f
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:54 2022 +0000

    x86: remove vma linked list walks
    
    Use the VMA iterator instead.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ab8cba2e3bfdfce6966a66a4c0af42d3997f0137
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:54 2022 +0000

    s390: remove vma linked list walks
    
    Use the VMA iterator instead.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 82a213a773dfed344828cb9ced28587262a66ca9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:53 2022 +0000

    powerpc: remove mmap linked list walks
    
    Use the VMA iterator instead.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit e4d76d7cd551094b6da8f5ca4a11067fe92c72ce
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:53 2022 +0000

    parisc: remove mmap linked list from cache handling
    
    Use the VMA iterator instead.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit feb9d760fb22c2b09378c5571cbfa6b7a85897f2
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:53 2022 +0000

    arm64: Change elfcore for_each_mte_vma() to use VMA iterator
    
    Rework for_each_mte_vma() to use a VMA iterator instead of an explicit
    linked-list.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Link: https://lore.kernel.org/r/20220218023650.672072-1-Liam.Howlett@oracle.com
    Signed-off-by: Will Deacon <will@kernel.org>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b23a4e990fb0cd3779fe85e6ee1334b63b6be539
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:53 2022 +0000

    arm64: remove mmap linked list from vdso
    
    Use the VMA iterator instead.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 32dffe9cf155350766f6620f35355b3fdce53b18
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:52 2022 +0000

    mm/mmap: change do_brk_munmap() to use do_mas_align_munmap()
    
    do_brk_munmap() has already aligned the address and has a maple tree state
    to be used.  Use the new do_mas_align_munmap() to avoid unnecessary
    alignment and error checks.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit e4e0fea58ea1ff200d7155f011356ad5f5f9bb31
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:52 2022 +0000

    mm/mmap: reorganize munmap to use maple states
    
    Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
    do_mas_align_munmap().
    
    do_munmap() is a wrapper to create a maple state for any callers that have
    not been converted to the maple tree.
    
    do_mas_munmap() takes a maple state to mumap a range.  This is just a
    small function which checks for error conditions and aligns the end of the
    range.
    
    do_mas_align_munmap() uses the aligned range to mumap a range.
    do_mas_align_munmap() starts with the first VMA in the range, then finds
    the last VMA in the range.  Both start and end are split if necessary.
    Then the VMAs are removed from the linked list and the mm mlock count is
    updated at the same time.  Followed by a single tree operation of
    overwriting the area in with a NULL.  Finally, the detached list is
    unmapped and freed.
    
    By reorganizing the munmap calls as outlined, it is now possible to avoid
    extra work of aligning pre-aligned callers which are known to be safe,
    avoid extra VMA lookups or tree walks for modifications.
    
    detach_vmas_to_be_unmapped() is no longer used, so drop this code.
    
    vm_brk_flags() can just call the do_mas_munmap() as it checks for
    intersecting VMAs directly.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 63744e3fce438bff6b1f0afb9114eb73876b7ef0
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:52 2022 +0000

    mm/mmap: move mmap_region() below do_munmap()
    
    Relocation of code for the next commit.  There should be no changes here.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit cff2b0d390d02ce23c14a12ab46e4f17c7c693e2
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:51 2022 +0000

    mm: convert vma_lookup() to use mtree_load()
    
    Unlike the rbtree, the Maple Tree will return a NULL if there's nothing at
    a particular address.
    
    Since the previous commit dropped the vmacache, it is now possible to
    consult the tree directly.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 8b7fd055590b0624988f4ce816a29dcd58faf629
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:51 2022 +0000

    mm: remove vmacache
    
    By using the maple tree and the maple tree state, the vmacache is no
    longer beneficial and is complicating the VMA code.  Remove the vmacache
    to reduce the work in keeping it up to date and code complexity.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 62a0f26210bc7c74bf72cb17ed49f67394346549
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:51 2022 +0000

    mm/mmap: use advanced maple tree API for mmap_region()
    
    Changing mmap_region() to use the maple tree state and the advanced maple
    tree interface allows for a lot less tree walking.
    
    This change removes the last caller of munmap_vma_range(), so drop this
    unused function.
    
    Add vma_expand() to expand a VMA if possible by doing the necessary
    hugepage check, uprobe_munmap of files, dcache flush, modifications then
    undoing the detaches, etc.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 7976c7cd205c62ccc938a3ec9f5d8a120edebd19
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:50 2022 +0000

    mm: use maple tree operations for find_vma_intersection()
    
    Move find_vma_intersection() to mmap.c and change implementation to maple
    tree.
    
    When searching for a vma within a range, it is easier to use the maple
    tree interface.
    
    Exported find_vma_intersection() for kvm module.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 1108eb09ce0247d7e76f9f634cf7da4dc08557c0
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:50 2022 +0000

    mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()
    
    Avoid allocating a new VMA when it a vma modification can occur.  When a
    brk() can expand or contract a VMA, then the single store operation will
    only modify one index of the maple tree instead of causing a node to split
    or coalesce.  This avoids unnecessary allocations/frees of maple tree
    nodes and VMAs.
    
    Move some limit & flag verifications out of the do_brk_flags() function to
    use only relevant checks in the code path of bkr() and vm_brk_flags().
    
    Set the vma to check if it can expand in vm_brk_flags() if extra criteria
    are met.
    
    Drop userfaultfd from do_brk_flags() path and only use it in
    vm_brk_flags() path since that is the only place a munmap will happen.
    
    Add a wraper for munmap for the brk case called do_brk_munmap().
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 6cccf3f62c10c0d4bc03ec89373be55081be8f0c
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:50 2022 +0000

    mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup()
    
    vma_lookup() will walk the vma tree once and not continue to look for the
    next vma.  Since the exact vma is checked below, this is a more optimal
    way of searching.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 0c073f9ca8f0e51350a8ae25634a4f614b5d3393
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:49 2022 +0000

    mm: optimize find_exact_vma() to use vma_lookup()
    
    Use vma_lookup() to walk the tree to the start value requested.  If the
    vma at the start does not match, then the answer is NULL and there is no
    need to look at the next vma the way that find_vma() would.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4193619d8b3020ee4c03fb8eb8b61dced194b3f3
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:49 2022 +0000

    xen: use vma_lookup() in privcmd_ioctl_mmap()
    
    vma_lookup() walks the VMA tree for a specific value, find_vma() will
    search the tree after walking to a specific value.  It is more efficient
    to only walk to the requested value since privcmd_ioctl_mmap() will exit
    the loop if vm_start != msg->va.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 8af043f9d8bf1ac829527d45180ebbdec75a06c9
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:49 2022 +0000

    mmap: change zeroing of maple tree in __vma_adjust()
    
    Only write to the maple tree if we are not inserting or the insert isn't
    going to overwrite the area to clear.  This avoids spanning writes and
    node coealescing when unnecessary.
    
    The change requires a custom search for the linked list addition to find
    the correct VMA for the prev link.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit c9a2202185fccf469c4fafbc21db5b6962986fbb
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:48 2022 +0000

    mm: remove rb tree.
    
    Remove the RB tree and start using the maple tree for vm_area_struct
    tracking.
    
    Drop validate_mm() calls in expand_upwards() and expand_downwards() as the
    lock is not held.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 0706d041b99e04c782673a5e0d0ca9891994939b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:48 2022 +0000

    proc: remove VMA rbtree use from nommu
    
    These users of the rbtree should probably have been walks of the linked
    list, but convert them to use walks of the maple tree.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 3b1673be0812af1cfbd72938f7d549f14b838943
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:48 2022 +0000

    damon: convert __damon_va_three_regions to use the VMA iterator
    
    This rather specialised walk can use the VMA iterator.  If this proves to
    be too slow, we can write a custom routine to find the two largest gaps,
    but it will be somewhat complicated, so let's see if we need it first.
    
    Update the kunit test case to use the maple tree.  This also fixes an
    issue with the kunit testcase not adding the last VMA to the list.
    
    Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests)
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 17d2ccb7fd15bc861ad04e046454c946363756af
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:47 2022 +0000

    kernel/fork: use maple tree for dup_mmap() during forking
    
    The maple tree was already tracking VMAs in this function by an earlier
    commit, but the rbtree iterator was being used to iterate the list.
    Change the iterator to use a maple tree native iterator and switch to the
    maple tree advanced API to avoid multiple walks of the tree during insert
    operations.  Unexport the now-unused vma_store() function.
    
    For performance reasons we bulk allocate the maple tree nodes.  The node
    calculations are done internally to the tree and use the VMA count and
    assume the worst-case node requirements.  The VM_DONT_COPY flag does not
    allow for the most efficient copy method of the tree and so a bulk loading
    algorithm is used.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4a99e8ac694a3dc77c688552f799ef4027540173
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:47 2022 +0000

    mm/mmap: use maple tree for unmapped_area{_topdown}
    
    The maple tree code was added to find the unmapped area in a previous
    commit and was checked against what the rbtree returned, but the actual
    result was never used.  Start using the maple tree implementation and
    remove the rbtree code.
    
    Add kernel documentation comment for these functions.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 772d54e48cc6822c064e3ceda07becf39dc83915
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:47 2022 +0000

    mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
    
    Use the maple tree's advanced API and a maple state to walk the tree for
    the entry at the address of the next vma, then use the maple state to walk
    back one entry to find the previous entry.
    
    Add kernel documentation comments for this API.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 76903aa3f27aac2ad19d605e9551546b785c0d37
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:46 2022 +0000

    mm/mmap: use the maple tree in find_vma() instead of the rbtree.
    
    Using the maple tree interface mt_find() will handle the RCU locking and
    will start searching at the address up to the limit, ULONG_MAX in this
    case.
    
    Add kernel documentation to this API.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit f399b7d8c4c7abd2aeab689e9afde5de737931ca
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:46 2022 +0000

    mmap: use the VMA iterator in count_vma_pages_range()
    
    This simplifies the implementation and is faster than using the linked
    list.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 054d834a8ff1b83c14fc4b771c7cbebd33b5e44b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:46 2022 +0000

    mm: add VMA iterator
    
    This thin layer of abstraction over the maple tree state is for iterating
    over VMAs.  You can go forwards, go backwards or ask where the iterator
    is.  Rename the existing vma_next() to __vma_next() -- it will be removed
    by the end of this series.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 9ad235cea1753570ef78fc59b0466f2422c11cb3
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:45 2022 +0000

    mm: start tracking VMAs with maple tree
    
    Start tracking the VMAs with the new maple tree structure in parallel with
    the rb_tree.  Add debug and trace events for maple tree operations and
    duplicate the rb_tree that is created on forks into the maple tree.
    
    The maple tree is added to the mm_struct including the mm_init struct,
    added support in required mm/mmap functions, added tracking in kernel/fork
    for process forking, and used to find the unmapped_area and checked
    against what the rbtree finds.
    
    This also moves the mmap_lock() in exit_mmap() since the oom reaper call
    does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
    
    When splitting a vma fails due to allocations of the maple tree nodes,
    the error path in __split_vma() calls new->vm_ops->close(new).  The page
    accounting for hugetlb is actually in the close() operation,  so it
    accounts for the removal of 1/2 of the VMA which was not adjusted.  This
    results in a negative exit value.  To avoid the negative charge, set
    vm_start = vm_end and vm_pgoff = 0.
    
    There is also a potential accounting issue in special mappings from
    insert_vm_struct() failing to allocate, so reverse the charge there in
    the failure scenario.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ebce1b587998b6f088897b1bf0946b5c7b5d993f
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:45 2022 +0000

    lib/test_maple_tree: add testing for maple tree
    
    This is a test suite that uses the radix test infrastructure.  It has been
    split into its own commit to allow for easier review of the maple tree
    code.
    
    The testing includes:
    - Allocation of nodes
    - gfp flag allocation checks
    - Expansion & contraction of tree
    - preallocation checks
    - tree navigation by next/prev
    - tree navigation by iterators (mas_for_each, etc)
    - Number of nodes for a given number of entries
    - Generic tree construction tests
    - Addition and removal of entries in forward and reverse numerical indexes
    - gap searching both forward and reverse
    - Combining gaps by overwriting entries in different ways
    - splitting right-most node
    - splitting left-most node
    - overwriting multiple slots
    - overwriting across different levels of the tree
    - overwriting the middle of a tree
    - causing a 3-way split up to the root by overwriting the last slot and
      first slot of different nodes and spanning different levels
    - RCU stress testing of the tree with threads
    - Duplication of the tree by entry count
    - Tests which were generated by fuzzers have been added.
    - A large number of tests which come from recording crashing in a VM and
      reconstructing the tree (see check_erase2_set())
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit a6503ef6e91dbf2eb5b7304d4d19f202a1f511a2
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:41 2022 +0000

    radix tree test suite: add lockdep_is_held to header
    
    maple tree uses lockdep_is_held, so define it as external in the header.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 08478061fc8a0e92cbe0eedd50a67e3049aab720
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:41 2022 +0000

    radix tree test suite: add support for slab bulk APIs
    
    Add support for kmem_cache_free_bulk() and kmem_cache_alloc_bulk() to the
    radix tree test suite.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 2e772b6e31418ea5a30787bbbf07abf2e1b38445
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:40 2022 +0000

    radix tree test suite: add allocation counts and size to kmem_cache
    
    Add functions to get the number of allocations, and total allocations from
    a kmem_cache.  Also add a function to get the allocated size and a way to
    zero the total allocations.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 17baf643a6f195529a3d583652204a5ac6f08771
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:40 2022 +0000

    radix tree test suite: add kmem_cache_set_non_kernel()
    
    kmem_cache_set_non_kernel() is a mechanism to allow a certain number of
    kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set in
    the flags.  This functionality allows for testing different paths though
    the code.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 311edb5666d30c7e72056e5af8240ac379c00d22
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:39 2022 +0000

    radix tree test suite: add pr_err define
    
    define pr_err to printk
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 239f3ec5d9aa4da25046f1076637cfe8b52f27e9
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:39 2022 +0000

    Maple Tree: add new data structure
    
    Patch series "Introducing the Maple Tree"
    
    The maple tree is an RCU-safe range based B-tree designed to use modern
    processor cache efficiently.  There are a number of places in the kernel
    that a non-overlapping range-based tree would be beneficial, especially
    one with a simple interface.  If you use an rbtree with other data
    structures to improve performance or an interval tree to track
    non-overlapping ranges, then this is for you.
    
    The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
    nodes.  With the increased branching factor, it is significantly shorter
    than the rbtree so it has fewer cache misses.  The removal of the linked
    list between subsequent entries also reduces the cache misses and the need
    to pull in the previous and next VMA during many tree alterations.
    
    The first user that is covered in this patch set is the vm_area_struct,
    where three data structures are replaced by the maple tree: the augmented
    rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
    long term goal is to reduce or remove the mmap_lock contention.
    
    The plan is to get to the point where we use the maple tree in RCU mode.
    Readers will not block for writers.  A single write operation will be
    allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
    would be RCU enabled and this mode would be entered once multiple tasks
    are using the mm_struct.
    
    Davidlor said
    
    : Yes I like the maple tree, and at this stage I don't think we can ask for
    : more from this series wrt the MM - albeit there seems to still be some
    : folks reporting breakage.  Fundamentally I see Liam's work to (re)move
    : complexity out of the MM (not to say that the actual maple tree is not
    : complex) by consolidating the three complimentary data structures very
    : much worth it considering performance does not take a hit.  This was very
    : much a turn off with the range locking approach, which worst case scenario
    : incurred in prohibitive overhead.  Also as Liam and Matthew have
    : mentioned, RCU opens up a lot of nice performance opportunities, and in
    : addition academia[1] has shown outstanding scalability of address spaces
    : with the foundation of replacing the locked rbtree with RCU aware trees.
    
    A similar work has been discovered in the academic press
    
            https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
    
    Sheer coincidence.  We designed our tree with the intention of solving the
    hardest problem first.  Upon settling on a b-tree variant and a rough
    outline, we researched ranged based b-trees and RCU b-trees and did find
    that article.  So it was nice to find reassurances that we were on the
    right path, but our design choice of using ranges made that paper unusable
    for us.
    
    This patch (of 70):
    
    The maple tree is an RCU-safe range based B-tree designed to use modern
    processor cache efficiently.  There are a number of places in the kernel
    that a non-overlapping range-based tree would be beneficial, especially
    one with a simple interface.  If you use an rbtree with other data
    structures to improve performance or an interval tree to track
    non-overlapping ranges, then this is for you.
    
    The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
    nodes.  With the increased branching factor, it is significantly shorter
    than the rbtree so it has fewer cache misses.  The removal of the linked
    list between subsequent entries also reduces the cache misses and the need
    to pull in the previous and next VMA during many tree alterations.
    
    The first user that is covered in this patch set is the vm_area_struct,
    where three data structures are replaced by the maple tree: the augmented
    rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
    long term goal is to reduce or remove the mmap_lock contention.
    
    The plan is to get to the point where we use the maple tree in RCU mode.
    Readers will not block for writers.  A single write operation will be
    allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
    would be RCU enabled and this mode would be entered once multiple tasks
    are using the mm_struct.
    
    There is additional BUG_ON() calls added within the tree, most of which
    are in debug code.  These will be replaced with a WARN_ON() call in the
    future.  There is also additional BUG_ON() calls within the code which
    will also be reduced in number at a later date.  These exist to catch
    things such as out-of-range accesses which would crash anyways.
    
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 83f537f0ea9808c675aa4631b5ad7194b00a17fb
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:11 2022 -0600

    mm: multi-gen LRU: design doc
    
    Add a design doc.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit a8ee626cf48f570618b6f3d564be877679ade34d
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:10 2022 -0600

    mm: multi-gen LRU: admin guide
    
    Add an admin guide.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit a40c369b5bf900cc97e4712f8969963dee151ed4
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:09 2022 -0600

    mm: multi-gen LRU: debugfs interface
    
    Add /sys/kernel/debug/lru_gen for working set estimation and proactive
    reclaim. These techniques are commonly used to optimize job scheduling
    (bin packing) in data centers [1][2].
    
    Compared with the page table-based approach and the PFN-based
    approach, this lruvec-based approach has the following advantages:
    1. It offers better choices because it is aware of memcgs, NUMA nodes,
       shared mappings and unmapped page cache.
    2. It is more scalable because it is O(nr_hot_pages), whereas the
       PFN-based approach is O(nr_total_pages).
    
    Add /sys/kernel/debug/lru_gen_full for debugging.
    
    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit a91dd6b74270a59256c36fbd4d501bd2bb4e1d17
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:08 2022 -0600

    mm: multi-gen LRU: thrashing prevention
    
    Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
    requested by many desktop users [1].
    
    When set to value N, it prevents the working set of N milliseconds
    from getting evicted. The OOM killer is triggered if this working set
    cannot be kept in memory. Based on the average human detectable lag
    (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
    Larger values like N=3000 make lags less noticeable at the risk of
    premature OOM kills.
    
    Compared with the size-based approach [2], this time-based approach
    has the following advantages:
    1. It is easier to configure because it is agnostic to applications
       and memory sizes.
    2. It is more reliable because it is directly wired to the OOM killer.
    
    [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
    [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ca93c00d18a01c0f331ea43bd455330852846d6e
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:07 2022 -0600

    mm: multi-gen LRU: kill switch
    
    Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
    can be disabled include:
      0x0001: the multi-gen LRU core
      0x0002: walking page table, when arch_has_hw_pte_young() returns
              true
      0x0004: clearing the accessed bit in non-leaf PMD entries, when
              CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
      [yYnN]: apply to all the components above
    E.g.,
      echo y >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0007
      echo 5 >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0005
    
    NB: the page table walks happen on the scale of seconds under heavy
    memory pressure, in which case the mmap_lock contention is a lesser
    concern, compared with the LRU lock contention and the I/O congestion.
    So far the only well-known case of the mmap_lock contention happens on
    Android, due to Scudo [1] which allocates several thousand VMAs for
    merely a few hundred MBs. The SPF and the Maple Tree also have
    provided their own assessments [2][3]. However, if walking page tables
    does worsen the mmap_lock contention, the kill switch can be used to
    disable it. In this case the multi-gen LRU will suffer a minor
    performance degradation, as shown previously.
    
    Clearing the accessed bit in non-leaf PMD entries can also be
    disabled, since this behavior was not tested on x86 varieties other
    than Intel and AMD.
    
    [1] https://source.android.com/devices/tech/debug/scudo
    [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
    [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 040561f776b5db0af90ae911e7eea6c53539ca53
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:06 2022 -0600

    mm: multi-gen LRU: optimize multiple memcgs
    
    When multiple memcgs are available, it is possible to use generations
    as a frame of reference to make better choices and improve overall
    performance under global memory pressure. This patch adds a basic
    optimization to select memcgs that can drop single-use unmapped clean
    pages first. Doing so reduces the chance of going into the aging path
    or swapping, which can be costly.
    
    A typical example that benefits from this optimization is a server
    running mixed types of workloads, e.g., heavy anon workload in one
    memcg and heavy buffered I/O workload in the other.
    
    Though this optimization can be applied to both kswapd and direct
    reclaim, it is only added to kswapd to keep the patchset manageable.
    Later improvements may cover the direct reclaim path.
    
    While ensuring certain fairness to all eligible memcgs, proportional
    scans of individual memcgs also require proper backoff to avoid
    overshooting their aggregate reclaim target by too much. Otherwise it
    can cause high direct reclaim latency. The conditions for backoff are:
    1. At low priorities, for direct reclaim, if aging fairness or direct
       reclaim latency is at risk, i.e., aging one memcg multiple times or
       swapping after the target is met.
    2. At high priorities, for global reclaim, if per-zone free pages are
       above respective watermarks.
    
    Server benchmark results:
      Mixed workloads:
        fio (buffered I/O): +[19, 21]%
                    IOPS         BW
          patch1-8: 1880k        7343MiB/s
          patch1-9: 2252k        8796MiB/s
    
        memcached (anon): +[119, 123]%
                    Ops/sec      KB/sec
          patch1-8: 862768.65    33514.68
          patch1-9: 1911022.12   74234.54
    
      Mixed workloads:
        fio (buffered I/O): +[75, 77]%
                    IOPS         BW
          5.19-rc1: 1279k        4996MiB/s
          patch1-9: 2252k        8796MiB/s
    
        memcached (anon): +[13, 15]%
                    Ops/sec      KB/sec
          5.19-rc1: 1673524.04   65008.87
          patch1-9: 1911022.12   74234.54
    
      Configurations:
        (changes since patch 6)
    
        cat mixed.sh
        modprobe brd rd_nr=2 rd_size=56623104
    
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0
    
        mkfs.ext4 /dev/ram1
        mount -t ext4 /dev/ram1 /mnt
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000
    
        fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=90m --group_reporting &
        pid=$!
    
        sleep 200
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
    
        kill -INT $pid
        wait
    
    Client benchmark results:
      no change (CONFIG_MEMCG=n)
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d69a18903af8fe6f62d38178b69efc8dafba6ddf
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:05 2022 -0600

    mm: multi-gen LRU: support page table walks
    
    To further exploit spatial locality, the aging prefers to walk page
    tables to search for young PTEs and promote hot pages. A kill switch
    will be added in the next patch to disable this behavior. When
    disabled, the aging relies on the rmap only.
    
    NB: this behavior has nothing similar with the page table scanning in
    the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
    pages to swapcache and unmaps them.
    
    To avoid confusion, the term "iteration" specifically means the
    traversal of an entire mm_struct list; the term "walk" will be applied
    to page tables and the rmap, as usual.
    
    An mm_struct list is maintained for each memcg, and an mm_struct
    follows its owner task to the new memcg when this task is migrated.
    Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot
    pages before it increments max_seq.
    
    When multiple page table walkers iterate the same list, each of them
    gets a unique mm_struct; therefore they can run concurrently. Page
    table walkers ignore any misplaced pages, e.g., if an mm_struct was
    migrated, pages it left in the previous memcg will not be promoted
    when its current memcg is under reclaim. Similarly, page table walkers
    will not promote pages from nodes other than the one under reclaim.
    
    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change
    
      Single workload:
        memcached (anon): +[8, 10]%
                    Ops/sec      KB/sec
          patch1-7: 1147696.57   44640.29
          patch1-8: 1245274.91   48435.66
    
      Configurations:
        no change
    
    Client benchmark results:
      kswapd profiles:
        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset
    
        patch1-8
          49.44%  lzo1x_1_do_compress (real work)
           6.19%  page_vma_mapped_walk (overhead)
           5.97%  _raw_spin_unlock_irq
           3.13%  get_pfn_folio
           2.85%  ptep_clear_flush
           2.42%  __zram_bvec_write
           2.08%  do_raw_spin_lock
           1.92%  memmove
           1.44%  alloc_zspage
           1.36%  memset
    
      Configurations:
        no change
    
    Thanks to the following developers for their efforts [3].
      kernel test robot <lkp@intel.com>
    
    [1] https://lwn.net/Articles/23732/
    [2] https://llvm.org/docs/ScudoHardenedAllocator.html
    [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ea5691aa840a1b015b6e3006bd89f2afcbc8ab50
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:04 2022 -0600

    mm: multi-gen LRU: exploit locality in rmap
    
    Searching the rmap for PTEs mapping each page on an LRU list (to test
    and clear the accessed bit) can be expensive because pages from
    different VMAs (PA space) are not cache friendly to the rmap (VA
    space). For workloads mostly using mapped pages, searching the rmap
    can incur the highest CPU cost in the reclaim path.
    
    This patch exploits spatial locality to reduce the trips into the
    rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
    new function lru_gen_look_around() scans at most BITS_PER_LONG-1
    adjacent PTEs. On finding another young PTE, it clears the accessed
    bit and updates the gen counter of the page mapped by this PTE to
    (max_seq%MAX_NR_GENS)+1.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change
    
      Single workload:
        memcached (anon): +[3, 5]%
                    Ops/sec      KB/sec
          patch1-6: 1106168.46   43025.04
          patch1-7: 1147696.57   44640.29
    
      Configurations:
        no change
    
    Client benchmark results:
      kswapd profiles:
        patch1-6
          39.03%  lzo1x_1_do_compress (real work)
          18.47%  page_vma_mapped_walk (overhead)
           6.74%  _raw_spin_unlock_irq
           3.97%  do_raw_spin_lock
           2.49%  ptep_clear_flush
           2.48%  anon_vma_interval_tree_iter_first
           1.92%  folio_referenced_one
           1.88%  __zram_bvec_write
           1.48%  memmove
           1.31%  vma_interval_tree_iter_next
    
        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset
    
      Configurations:
        no change
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 40829ecfbf9e4ab4c41d414f7e8ee9dff5d049ee
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:03 2022 -0600

    mm: multi-gen LRU: minimal implementation
    
    To avoid confusion, the terms "promotion" and "demotion" will be
    applied to the multi-gen LRU, as a new convention; the terms
    "activation" and "deactivation" will be applied to the active/inactive
    LRU, as usual.
    
    The aging produces young generations. Given an lruvec, it increments
    max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
    promotes hot pages to the youngest generation when it finds them
    accessed through page tables; the demotion of cold pages happens
    consequently when it increments max_seq. Promotion in the aging path
    does not involve any LRU list operations, only the updates of the gen
    counter and lrugen->nr_pages[]; demotion, unless as the result of the
    increment of max_seq, requires LRU list operations, e.g.,
    lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages),
    since it is only interested in hot pages.
    
    The eviction consumes old generations. Given an lruvec, it increments
    min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes
    empty. A feedback loop modeled after the PID controller monitors
    refaults over anon and file types and decides which type to evict when
    both types are available from the same generation.
    
    The protection of pages accessed multiple times through file
    descriptors takes place in the eviction path. Each generation is
    divided into multiple tiers. A page accessed N times through file
    descriptors is in tier order_base_2(N). Tiers do not have dedicated
    lrugen->lists[], only bits in folio->flags. The aforementioned
    feedback loop also monitors refaults over all tiers and decides when
    to protect pages in which tiers (N>1), using the first tier (N=0,1) as
    a baseline. The first tier contains single-use unmapped clean pages,
    which are most likely the best choices. In contrast to promotion in
    the aging path, the protection of a page in the eviction path is
    achieved by moving this page to the next generation, i.e., min_seq+1,
    if the feedback loop decides so. This approach has the following
    advantages:
    1. It removes the cost of activation in the buffered access path by
       inferring whether pages accessed multiple times through file
       descriptors are statistically hot and thus worth protecting in the
       eviction path.
    2. It takes pages accessed through page tables into account and avoids
       overprotecting pages accessed multiple times through file
       descriptors. (Pages accessed through page tables are in the first
       tier, since N=0.)
    3. More tiers provide better protection for pages accessed more than
       twice through file descriptors, when under heavy buffered I/O
       workloads.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): +[30, 32]%
                    IOPS         BW
          5.19-rc1: 2673k        10.2GiB/s
          patch1-6: 3491k        13.3GiB/s
    
      Single workload:
        memcached (anon): -[4, 6]%
                    Ops/sec      KB/sec
          5.19-rc1: 1161501.04   45177.25
          patch1-6: 1106168.46   43025.04
    
      Configurations:
        CPU: two Xeon 6154
        Mem: total 256G
    
        Node 1 was only used as a ram disk to reduce the variance in the
        results.
    
        patch drivers/block/brd.c <<EOF
        99,100c99,100
        <   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
        <   page = alloc_page(gfp_flags);
        ---
        >   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
        >   page = alloc_pages_node(1, gfp_flags, 0);
        EOF
    
        cat >>/etc/systemd/system.conf <<EOF
        CPUAffinity=numa
        NUMAPolicy=bind
        NUMAMask=0
        EOF
    
        cat >>/etc/memcached.conf <<EOF
        -m 184320
        -s /var/run/memcached/memcached.sock
        -a 0766
        -t 36
        -B binary
        EOF
    
        cat fio.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkfs.ext4 /dev/ram0
        mount -t ext4 /dev/ram0 /mnt
    
        mkdir /sys/fs/cgroup/user.slice/test
        echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
        echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
        fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=5m --group_reporting
    
        cat memcached.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
    
    Client benchmark results:
      kswapd profiles:
        5.19-rc1
          40.33%  page_vma_mapped_walk (overhead)
          21.80%  lzo1x_1_do_compress (real work)
           7.53%  do_raw_spin_lock
           3.95%  _raw_spin_unlock_irq
           2.52%  vma_interval_tree_iter_next
           2.37%  folio_referenced_one
           2.28%  vma_interval_tree_subtree_search
           1.97%  anon_vma_interval_tree_iter_first
           1.60%  ptep_clear_flush
           1.06%  __zram_bvec_write
    
        patch1-6
          39.03%  lzo1x_1_do_compress (real work)
          18.47%  page_vma_mapped_walk (overhead)
           6.74%  _raw_spin_unlock_irq
           3.97%  do_raw_spin_lock
           2.49%  ptep_clear_flush
           2.48%  anon_vma_interval_tree_iter_first
           1.92%  folio_referenced_one
           1.88%  __zram_bvec_write
           1.48%  memmove
           1.31%  vma_interval_tree_iter_next
    
      Configurations:
        CPU: single Snapdragon 7c
        Mem: total 4G
    
        ChromeOS MemoryPressure [1]
    
    [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 22dd51189c1f4c2713888c468e7dbfe2f8191248
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:02 2022 -0600

    mm: multi-gen LRU: groundwork
    
    Evictable pages are divided into multiple generations for each lruvec.
    The youngest generation number is stored in lrugen->max_seq for both
    anon and file types as they are aged on an equal footing. The oldest
    generation numbers are stored in lrugen->min_seq[] separately for anon
    and file types as clean file pages can be evicted regardless of swap
    constraints. These three variables are monotonically increasing.
    
    Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
    in order to fit into the gen counter in folio->flags. Each truncated
    generation number is an index to lrugen->lists[]. The sliding window
    technique is used to track at least MIN_NR_GENS and at most
    MAX_NR_GENS generations. The gen counter stores a value within [1,
    MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
    stores 0.
    
    There are two conceptually independent procedures: "the aging", which
    produces young generations, and "the eviction", which consumes old
    generations. They form a closed-loop system, i.e., "the page reclaim".
    Both procedures can be invoked from userspace for the purposes of
    working set estimation and proactive reclaim. These techniques are
    commonly used to optimize job scheduling (bin packing) in data
    centers [1][2].
    
    To avoid confusion, the terms "hot" and "cold" will be applied to the
    multi-gen LRU, as a new convention; the terms "active" and "inactive"
    will be applied to the active/inactive LRU, as usual.
    
    The protection of hot pages and the selection of cold pages are based
    on page access channels and patterns. There are two access channels:
    one through page tables and the other through file descriptors. The
    protection of the former channel is by design stronger because:
    1. The uncertainty in determining the access patterns of the former
       channel is higher due to the approximation of the accessed bit.
    2. The cost of evicting the former channel is higher due to the TLB
       flushes required and the likelihood of encountering the dirty bit.
    3. The penalty of underprotecting the former channel is higher because
       applications usually do not prepare themselves for major page
       faults like they do for blocked I/O. E.g., GUI applications
       commonly use dedicated I/O threads to avoid blocking rendering
       threads.
    There are also two access patterns: one with temporal locality and the
    other without. For the reasons listed above, the former channel is
    assumed to follow the former pattern unless VM_SEQ_READ or
    VM_RAND_READ is present; the latter channel is assumed to follow the
    latter pattern unless outlying refaults have been observed [3][4].
    
    The next patch will address the "outlying refaults". Three macros,
    i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
    added in this patch to make the entire patchset less diffy.
    
    A page is added to the youngest generation on faulting. The aging
    needs to check the accessed bit at least twice before handing this
    page over to the eviction. The first check takes care of the accessed
    bit set on the initial fault; the second check makes sure this page
    has not been used since then. This protocol, AKA second chance,
    requires a minimum of two generations, hence MIN_NR_GENS.
    
    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    [3] https://lwn.net/Articles/495543/
    [4] https://lwn.net/Articles/815342/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit c4bf71f3815cff823f737a64ef0df57bcbf0dbe1
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:01 2022 -0600

    Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller"
    
    This patch undoes the following refactor:
    commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")
    
    The upcoming changes to include/linux/mm_inline.h will reuse
    __update_lru_size().
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 64c694c4f4f94bc4bbd759c6bf35609ebf948690
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:00 2022 -0600

    mm/vmscan.c: refactor shrink_node()
    
    This patch refactors shrink_node() to improve readability for the
    upcoming changes to mm/vmscan.c.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 2cc2cfac3505a2a99cfe2ea44504d6bad55c8fff
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 01:59:59 2022 -0600

    mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
    
    Some architectures support the accessed bit in non-leaf PMD entries,
    e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
    as part of linear address translation [1]. Page table walkers that
    clear the accessed bit may use this capability to reduce their search
    space.
    
    Note that:
    1. Although an inline function is preferable, this capability is added
       as a configuration option for consistency with the existing macros.
    2. Due to the little interest in other varieties, this capability was
       only tested on Intel and AMD CPUs.
    
    Thanks to the following developers for their efforts [2][3].
      Randy Dunlap <rdunlap@infradead.org>
      Stephen Rothwell <sfr@canb.auug.org.au>
    
    [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
         Volume 3 (June 2021), section 4.8
    [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
    [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 70410b381a4fe3499e4f3190ff95605603d3dc45
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 01:59:58 2022 -0600

    mm: x86, arm64: add arch_has_hw_pte_young()
    
    Some architectures automatically set the accessed bit in PTEs, e.g.,
    x86 and arm64 v8.2. On architectures that do not have this capability,
    clearing the accessed bit in a PTE usually triggers a page fault
    following the TLB miss of this PTE (to emulate the accessed bit).
    
    Being aware of this capability can help make better decisions, e.g.,
    whether to spread the work out over a period of time to reduce bursty
    page faults when trying to clear the accessed bit in many PTEs.
    
    Note that theoretically this capability can be unreliable, e.g.,
    hotplugged CPUs might be different from builtin ones. Therefore it
    should not be used in architecture-independent code that involves
    correctness, e.g., to determine whether TLB flushes are required (in
    combination with the accessed bit).
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Acked-by: Will Deacon <will@kernel.org>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 480fefb8f23ab4ba068efb0924a051387f678386
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Mar 21 21:37:19 2022 +0000

    i2c: busses: Add SMBus capability to work with OpenRGB driver control
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 337e4ff50f3ac6d7aab1fc487b3720cf7ec2a6f0
Author: Mark Weiman <mark.weiman@markzz.com>
Date:   Sun Aug 12 11:36:21 2018 -0400

    pci: Enable overrides for missing ACS capabilities
    
    This an updated version of Alex Williamson's patch from:
    https://lkml.org/lkml/2013/5/30/513
    
    Original commit message follows:
    
    PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
    allows us to control whether transactions are allowed to be redirected
    in various subnodes of a PCIe topology.  For instance, if two
    endpoints are below a root port or downsteam switch port, the
    downstream port may optionally redirect transactions between the
    devices, bypassing upstream devices.  The same can happen internally
    on multifunction devices.  The transaction may never be visible to the
    upstream devices.
    
    One upstream device that we particularly care about is the IOMMU.  If
    a redirection occurs in the topology below the IOMMU, then the IOMMU
    cannot provide isolation between devices.  This is why the PCIe spec
    encourages topologies to include ACS support.  Without it, we have to
    assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
    
    Unfortunately, far too many topologies do not support ACS to make this
    a steadfast requirement.  Even the latest chipsets from Intel are only
    sporadically supporting ACS.  We have trouble getting interconnect
    vendors to include the PCIe spec required PCIe capability, let alone
    suggested features.
    
    Therefore, we need to add some flexibility.  The pcie_acs_override=
    boot option lets users opt-in specific devices or sets of devices to
    assume ACS support.  The "downstream" option assumes full ACS support
    on root ports and downstream switch ports.  The "multifunction"
    option assumes the subset of ACS features available on multifunction
    endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
    option enables ACS support on devices matching the provided vendor
    and device IDs, allowing more strategic ACS overrides.  These options
    may be combined in any order.  A maximum of 16 id specific overrides
    are available.  It's suggested to use the most limited set of options
    necessary to avoid completely disabling ACS across the topology.
    Note to hardware vendors, we have facilities to permanently quirk
    specific devices which enforce isolation but not provide an ACS
    capability.  Please contact me to have your devices added and save
    your customers the hassle of this boot option.
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Mark Weiman <mark.weiman@markzz.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 7214178c2707ebc42cbda691b8c9d5190d3e5b2e
Author: Serge Hallyn <serge.hallyn@canonical.com>
Date:   Fri May 31 19:12:12 2013 +0100

    sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    This is a short-term patch.  Unprivileged use of CLONE_NEWUSER
    is certainly an intended feature of user namespaces.  However
    for at least saucy we want to make sure that, if any security
    issues are found, we have a fail-safe.
    
    Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
    [bwh: Remove unneeded binary sysctl bits]
    [bwh: Keep this sysctl, but change the default to enabled]
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 7255ae89e43265a3d01c1156433e701e4f262ea8
Author: Zebediah Figura <zfigura@codeweavers.com>
Date:   Wed Sep 28 03:48:52 2022 +0000

    winesync: Introduce the winesync driver and character device patchset
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 1da1676cd268cec87ebc3ca5138f81e51b23d2b8
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jan 23 21:54:23 2019 +0100

    SAUCE: binder: give binder_alloc its own debug mask file
    
    Currently both binder.c and binder_alloc.c both register the
    /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts
    in sysfs. This commit gives binder_alloc.c its own
    /sys/module/binder_linux/paramters/alloc_debug_mask file.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d040142a298730620f7624ed54cf61c338c10562
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jan 16 23:13:25 2019 +0100

    SAUCE: binder: turn into module
    
    The Android binder driver needs to become a module for the sake of shipping
    Anbox. To do this we need to export the following functions since binder is
    currently still using them:
    
    - security_binder_set_context_mgr()
    - security_binder_transaction()
    - security_binder_transfer_binder()
    - security_binder_transfer_file()
    - can_nice()
    - __wake_up_pollfree()
    - __close_fd_get_file()
    - task_work_add()
    - map_kernel_range_noflush()
    - get_vm_area()
    - zap_page_range()
    - put_ipc_ns()
    - get_ipc_ns_exported()
    - show_init_ipc_ns()
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ]
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ad732bef4221561528675f64d3ffb6a69669e572
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Wed May 17 01:52:11 2017 +0000

    init: wait for partition and retry scan
    
    As Clear Linux boots fast the device is not ready when
    the mounting code is reached, so a retry device scan will
    be performed every 0.5 sec for at least 40 sec
    and synchronize the async task.
    
    Signed-off-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 6ff1c2983c54662306a8835d6eb09d063e8b4ef1
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Thu Jun 2 23:36:32 2016 -0500

    drivers: initialize ata before graphics
    
    ATA init is the long pole in the boot process, and its asynchronous.
    move the graphics init after it so that ata and graphics initialize
    in parallel
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4a47a48ed1e6df19a62de3fe2b0a56685b49e33c
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Feb 18 23:35:41 2018 +0000

    locking: rwsem: spin faster
    
    tweak rwsem owner spinning a bit
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b1ffe7a97035f5f0638e89411f8275306c29ec71
Author: William Douglas <william.douglas@intel.com>
Date:   Wed Jun 20 17:23:21 2018 +0000

    firmware: Enable stateless firmware loading
    
    Prefer the order of specific version before generic and /etc before
    /lib to enable the user to give specific overrides for generic
    firmware and distribution firmware.
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 8408ab7f909873344ac0b12478fff8a0fff3dae4
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Sep 22 11:12:35 2019 -0300

    intel_rapl: Silence rapl trace debug
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 58db8824f8d7b598961275348e544bffbcc5a16d
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed Dec 8 11:55:28 2021 +0000

    netfilter: Add full cone NAT support
    
    Link: https://github.com/llccd/netfilter-full-cone-nat
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit c6d0bc235374d5581af4bb56b40f20496a42a651
Author: Felix Fietkau <nbd@openwrt.org>
Date:   Sat Dec 5 15:07:03 2015 +0100

    mac80211: ignore AP power level when tx power type is "fixed"
    
    In some cases a user might want to connect to a far away access point,
    which announces a low tx power limit. Using the AP's power limit can
    make the connection significantly more unstable or even impossible, and
    mac80211 currently provides no way to disable this behavior.
    
    To fix this, use the currently unused distinction between limited and
    fixed tx power to decide whether a remote AP's power limit should be
    accepted.
    
    Signed-off-by: Felix Fietkau <nbd@openwrt.org>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit fc0293a37f06f318bf11c1a999d5b19dbd41d4d6
Author: Konstantin Demin <rockdrilla@gmail.com>
Date:   Tue May 17 10:10:40 2022 +0300

    net-tcp_bbr: v2: Use correct 64-bit division
    
    Signed-off-by: Konstantin Demin <rockdrilla@gmail.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 5fdaa51bd85b915a661db7565a06b2d7189effa5
Author: Adithya Abraham Philip <abrahamphilip@google.com>
Date:   Fri Jun 11 21:56:10 2021 +0000

    net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2
    
    Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to
    indicate that retransmitted packets and pure ACKs must have the
    ECT bit set. This is a necessary fix for BBRv2, which when using
    ECN expects ECT to be set even on retransmitted packets and ACKs.
    Currently CCAs like BBRv2 which can use ECN but don't "need" it
    do not have a way to indicate that ECT should be set on
    retransmissions/ACKs.
    
    Signed-off-by: Adithya Abraham Philip <abrahamphilip@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit e878e3a184ea25e4a1ee0caaa9f9d3aff0ba60bc
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Dec 28 19:23:09 2020 -0500

    net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss
    
    Fix WARN_ON_ONCE() warnings that were firing and pointing to a
    bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to
    CA_Open.
    
    The issue was that tcp_simple_retransmit() calls:
    
      tcp_set_ca_state(sk, TCP_CA_Loss);
    
    without first calling icsk_ca_ops->ssthresh(sk) (because
    tcp_simple_retransmit() is dealing with losses due to MTU issues and
    not congestion). The lack of this callback means that BBR did not get
    a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such
    cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd.
    
    This commit removes that warning, since a bbr->prior_cwnd of 0 is a
    valid situation in this state transition.
    
    For setting inflight_lo upon entering CA_Loss, to avoid setting an
    inflight_lo of 0 in this case, this commit switches to taking the max
    of cwnd and prior_cwnd. We plan to remove that line of code when we
    switch to cautious (PRR-style) recovery, so that awkwardness will go
    away.
    
    Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit a920e713a1247e6dcac25d7aa46504c58ff7232d
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:10:21 2020 -0400

    net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2
    
    Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d5cb3035f239657a060848c05cc8c4e4ce55e4bf
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:08:41 2020 -0400

    net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2
    
    Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit c21f9f1d317d9ea7cbc02a2d6b5b9ef2df73832f
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Nov 21 15:28:01 2019 -0500

    net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss
    
    There is no reason to compute rs.delivered_ce upon loss.
    
    In fact, we specifically do not want to compute rs.delivered_ce upon loss.
    
    Two issues:
    
    (1) This would be the wrong thing to do, in behavior terms.  With
        RACK's dynamic reordering window, losses can be marked long after
        the sequence hole appears in the ACK/SACK stream. We want to to
        catch the ECN mark rate rising too high as quickly as possible,
        which means we want to check for high ECN mark rates at ACK time
        (as BBRv2 currently does) and not loss marking time.
    
    (2) This is dead code. The ECN mark rate cannot be detected as too
        high because the check needs rs->delivered to be > 0 as well:
    
           if (rs->delivered_ce > 0 && rs->delivered > 0 &&
    
        Since we are not setting rs->delivered upon loss, this check
        cannot succeed, so setting delivered_ce is pointless.
    
    This dead and wrong line was discovered by Randall Stewart at Netflix
    as he was reading the BBRv2 code.
    
    Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit bb7e932939b6781aeb7ddd48c82f2e12c80a13ec
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:54:22 2019 -0400

    net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP
    
    BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
    queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.
    
    BBR v2 maintains the core of BBR v1: an explicit model of the network
    path that is two-dimensional, adapting to estimate the (a) maximum
    available bandwidth and (b) maximum safe volume of data a flow can
    keep in-flight in the network. It maintains the estimated BDP as a
    core guide for estimating an appropriate level of in-flight data.
    
    BBR v2 makes several key enhancements:
    
    o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
    coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
    extended dynamically based on estimated BDP to improve coexistence with
    Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
    scalable and responsive than Reno and CUBIC.
    
    o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
    loss and (DCTCP-style) ECN signals to maintain its model.
    
    o It aims for lower losses than v1 by adjusting its model to attempt to stay
    within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
    respectively).
    
    o It adapts to loss/ECN signals even when the application is running out of
    data ("application-limited"), in case the "application-limited" flow is also
    "network-limited" (the bw and/or inflight available to this flow is lower than
    previously estimated when the flow ran out of data).
    
    o It has a three-part model: the model explicit three tracks operating points,
    where an operating point is a tuple: (bandwidth, inflight). The three operating
    points are:
    
      o latest:        the latest measurement from the current round trip
      o upper bound:   robust, optimistic, long-term upper bound
      o lower bound:   robust, conservative, short-term lower bound
    
    These are stored in the following state variables:
    
      o latest:  bw_latest, inflight_latest
      o lo:      bw_lo,     inflight_lo
      o hi:      bw_hi[2],  inflight_hi
    
    To gain intuition about the meaning of the three operating points, it
    may help to consider the analogs in CUBIC, which has a somewhat
    analogous three-part model used by its probing state machine:
    
      BBR param     CUBIC param
      -----------   -------------
      latest     ~  cwnd
      lo         ~  ssthresh
      hi         ~  last_max_cwnd
    
    The analogy is only a loose one, though, since the BBR operating
    points are calculated differently, and are 2-dimensional (bw,inflight)
    rather than CUBIC's one-dimensional notion of operating point
    (inflight).
    
    o It uses the three-part model to adapt the magnitude of its bandwidth
    to match the estimated space available in the buffer, rather than (as
    in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
    the bottleneck buffer when probing (commodity datacenter switches
    commonly do not have that much buffer for WAN flows). When BBR v2
    estimates it hit a buffer limit during probing, its bandwidth probing
    then starts gently in case little space is still available in the
    buffer, and the accelerates, slowly at first and then rapidly if it
    can grow inflight without seeing congestion signals. In such cases,
    probing is bounded by inflight_hi + inflight_probe, where
    inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
    keep losses low and bounded if a bottleneck remains congested, while
    rapidly/scalably utilizing free bandwidth when it becomes available.
    
    o It has a slightly revised state machine, to achieve the goals above.
        BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
        BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
        BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
        BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty
    
    o The estimated BDP: BBR v2 continues to maintain an estimate of the
    path's two-way propagation delay, by tracking a windowed min_rtt, and
    coordinating (on an as-ndeeded basis) to try to expose the two-way
    propagation delay by draining the bottleneck queue.
    
    BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
    estimate to estimate the current bandwidth-delay product. The estimated BDP
    still provides one important guideline for bounding inflight data. However,
    because any min-filtered RTT and max-filtered bw inherently tend to both
    overestimate, the estimated BDP is often too high; in this case loss or ECN
    marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
    adapt its sending rate and inflight down to match the available capacity of the
    path.
    
    o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
    requires more space. Note that much of the space is due to support for
    per-socket parameterization and debugging in this release for research
    and debugging. With that state removed, the full "struct bbr" is 140
    bytes, or 144 with padding. This is an increase of 40 bytes over the
    existing ca_priv space.
    
    o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
      significant pieces:
    
      o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
        bbr_can_grow_inflight())
      o long-term bandwidth estimator ("policer mode")
    
      The code layout tries to keep BBR v2 code near the bottom of the
      file, so that v1-applicable code in the top does not accidentally
      refer to v2 code.
    
    o Docs:
      See the following docs for more details and diagrams decsribing the BBR v2
      algorithm:
        https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
        https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00
    
    o Internal notes:
      For this upstream rebase, Neal started from:
        git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
      then removed dev instrumentation (dynamic get/set for parameters)
      and code that was only used by BBRv1
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
    Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 91caa8f8a008dcd718330918c9be47457e3ca4e3
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Nov 16 13:16:25 2019 -0500

    net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check()
    
    Add logic for an experimental TCP connection behavior, enabled with
    tp->fast_ack_mode = 1, which disables checking the receive window
    before sending an ack in __tcp_ack_snd_check(). If this behavior is
    enabled, the data receiver sends an ACK if the amount of data is >
    RCV.MSS.
    
    Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit fbb5ead9dfa8d968c07213963fce009145b1cc62
Author: Neal Cardwell <ncardwell@google.com>
Date:   Fri Sep 27 17:10:26 2019 -0400

    net-tcp: re-generalize TSO sizing in TCP CC module API
    
    Reorganize the API for CC modules so that the CC module once again
    gets complete control of the TSO sizing decision. This is how the API
    was set up around 2016 and the initial BBRv1 upstreaming. Later Eric
    Dumazet simplified it. But with wider testing it now seems that to
    avoid CPU regressions BBR needs to have a different TSO sizing
    function.
    
    This is necessary to handle cases where there are many flows
    bottlenecked on the sender host's NIC, in which case BBR's pacing rate
    is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because
    BBR's pacing rate adapts to the low bandwidth share each flow sees. By
    contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very
    large cwnd, and thus large pacing rate and large TSO burst size.
    
    Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d4d33365985a432653e41d8d64ed0746717c7bcf
Author: Yousuk Seung <ysseung@google.com>
Date:   Wed May 23 17:55:54 2018 -0700

    net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS
    
    Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a
    congestion control module to receive CE events.
    
    Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN
    bit in opts flag to receive CE events but this may incur changes in ECN
    behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS
    that allows congestion control modules to receive CE events
    independently of TCP_CONG_NEEDS_ECN.
    
    Effort: net-tcp
    Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b
    Change-Id: I2255506985242f376d910c6fd37daabaf4744f24
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit deaeafb1fc9751591b039902f7223d0d790fd6ae
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:37:19 2019 -0400

    net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue
    
    Syzkaller was able to use TCP_REPAIR to reproduce the new warning
    added in tcp_fragment():
    
      WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487
        tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487()
      inconsistent: tx.in_flight: 0 old_factor: 53
    
    The warning happens because skbs inserted into the tcp_rtx_queue
    during the repair process go through a sort of "fake send" process,
    and that process was seting pcount but not tx.in_flight, and thus the
    warnings (where old_factor is the old pcount).
    
    The fix of setting tx.in_flight in the TCP_REPAIR code path seems
    simple enough, and indeed makes the repro code from syzkaller stop
    producing warnings. Running through kokonut tests, and will send out
    for review when all tests pass.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63
    Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 12b1c175804db31fa7254d19d70543a2f2c9b13e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:25 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment()
    
    When we fragment an skb that has already been sent, we need to update
    the tx.in_flight for the first skb in the resulting pair ("buff").
    
    Because we were not updating the tx.in_flight, the tx.in_flight value
    was inconsistent with the pcount of the "buff" skb (tx.in_flight would
    be too high). That meant that if the "buff" skb was lost, then
    bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value
    that is too high. This could result in longer queues and higher packet
    loss.
    
    Packetdrill testing verified that without this commit, when the second
    half of an skb is SACKed and then later the first half of that skb is
    marked lost, the calculated inflight_hi was incorrect.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874
    Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 39fc2887016214a4545104617df4443991c6e751
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:33 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb()
    
    When tcp_shifted_skb() updates state as adjacent SACKed skbs are
    coalesced, previously the tx.in_flight was not adjusted, so we could
    get contradictory state where the skb's recorded pcount was bigger
    than the tx.in_flight (the number of segments that were in_flight
    after sending the skb).
    
    Normally have a SACKed skb with contradictory pcount/tx.in_flight
    would not matter. However, with SACK reneging, the SACKed bit is
    removed, and an skb once again becomes eligible for retransmitting,
    fragmenting, SACKing, etc. Packetdrill testing verified the following
    sequence is possible in a kernel that does not have this commit:
    
     - skb N is SACKed
     - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb()
       - tcp_shifted_skb() will increase the pcount of prev,
         but leave tx.in_flight as-is
       - so prev skb can have pcount > tx.in_flight
     - RTO, tcp_timeout_mark_lost(), detect reneg,
       remove "SACKed" bit, mark skb N as lost
       - find pcount of skb N is greater than its tx.in_flight
    
    I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb():
      WARN_ON_ONCE(inflight_prev < 0)
    to fire in production machines using bbr2.
    
    Tested: See last commit in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715
    Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 6dfefa381bc618d23e08322a8cd09a56a1326657
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:36:36 2019 -0400

    net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight()
    
    Factor out the code to set an skb's tx.in_flight field into its own
    function, so that this code can be used for the TCP_REPAIR "fake send"
    code path that inserts skbs into the rtx queue without sending
    them. This is in preparation for the following patch, which fixes an
    issue with TCP_REPAIR and tx.in_flight.
    
    Tested: See last patch in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b
    Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 2877a7d99b16f3c51e6c33d8a3edd9e0f6c49e8b
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Aug 7 21:52:06 2018 -0400

    net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API
    
    For connections experiencing reordering, RACK can mark packets lost
    long after we receive the SACKs/ACKs hinting that the packets were
    actually lost.
    
    This means that CC modules cannot easily learn the volume of inflight
    data at which packet loss happens by looking at the current inflight
    or even the packets in flight when the most recently SACKed packet was
    sent. To learn this, CC modules need to know how many packets were in
    flight at the time lost packets were sent. This new callback, combined
    with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this.
    
    This also provides a consistent callback that is invoked whether
    packets are marked lost upon ACK processing, using the RACK reordering
    timer, or at RTO time.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4
    Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 75175c39041a718cc82b2c6809e7c50c8a122a2d
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Nov 19 13:48:36 2018 -0500

    net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece
    
    For understanding the relationship between inflight and ECN signals,
    to try to find the highest inflight value that has acceptable levels
    ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8
    Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 9daa252ceecaeaf4a84748a24522d52066019613
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Oct 12 23:44:27 2017 -0400

    net-tcp_bbr: v2: count packets lost over TCP rate sampling interval
    
    For understanding the relationship between inflight and packet loss
    signals, to try to find the highest inflight value that has acceptable
    levels of packet losses.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab
    Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 067a40adfd89f42c0ed883bc22c258b42f585c86
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Aug 5 11:49:50 2017 -0400

    net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample
    
    For understanding the relationship between inflight and losses or ECN
    signals, to try to find the highest inflight value that has acceptable
    levels of loss/ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea
    Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d4ff8ef8f37a41b26812436e595f57d94feaed44
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Jun 24 21:55:59 2018 -0400

    net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes
    
    Free up some space for tracking inflight and losses for each
    bw sample, in upcoming commits.
    
    These timestamps are in microseconds, and are now stored in 32
    bits. So they can only hold time intervals up to roughly 2^12 = 4096
    seconds.  But Linux TCP RTT and RTO tracking has the same 32-bit
    microsecond implementation approach and resulting deployment
    limitations. So this is not introducing a new limit. And these should
    not be a limitation for the foreseeable future.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55
    Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 9f28645f1b74c855e9e21d9220ce9ed06e42092d
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:26:55 2019 -0400

    net-tcp_bbr: broaden app-limited rate sample detection
    
    This commit is a bug fix for the Linux TCP app-limited
    (application-limited) logic that is used for collecting rate
    (bandwidth) samples.
    
    Previously the app-limited logic only looked for "bubbles" of
    silence in between application writes, by checking at the start
    of each sendmsg. But "bubbles" of silence can also happen before
    retransmits: e.g. bubbles can happen between an application write
    and a retransmit, or between two retransmits.
    
    Retransmits are triggered by ACKs or timers. So this commit checks
    for bubbles of app-limited silence upon ACKs or timers.
    
    Why does this commit check for app-limited state at the start of
    ACKs and timer handling? Because at that point we know whether
    inflight was fully using the cwnd.  During processing the ACK or
    timer event we often change the cwnd; after changing the cwnd we
    can't know whether inflight was fully using the old cwnd.
    
    Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
    Change-Id: I37221506f5166877c2b110753d39bb0757985e68
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 0b4908df8f7da168774ec74960938c8dabd9d5ef
Author: André Almeida <andrealmeid@igalia.com>
Date:   Mon Oct 25 09:49:42 2021 -0300

    futex: Add entry point for FUTEX_WAIT_MULTIPLE (opcode 31)
    
    Add an option to wait on multiple futexes using the old interface, that
    uses opcode 31 through futex() syscall. Do that by just translation the
    old interface to use the new code. This allows old and stable versions
    of Proton to still use fsync in new kernel releases.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 6cbaa601f6074e14c6d8aeb0bf163f793f314f58
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Tue Aug 30 02:26:20 2022 +0000

    XANMOD: Makefile: Move ARM and x86 instruction set selection to kernel-wide build
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 8972dc123250bdd9ec35a496246b4d05e203349a
Author: graysky <graysky@archlinux.us>
Date:   Tue Mar 15 05:58:43 2022 -0400

    x86/kconfig: more uarches for kernel 5.17+
    
    FEATURES
    This patch adds additional CPU options to the Linux kernel accessible under:
     Processor type and features  --->
      Processor family --->
    
    With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are
    offered which are good for supported Intel or AMD CPUs:
    • x86-64-v2
    • x86-64-v3
    • x86-64-v4
    
    Users of glibc 2.33 and above can see which level is supported by current
    hardware by running:
      /lib/ld-linux-x86-64.so.2 --help | grep supported
    
    Alternatively, compare the flags from /proc/cpuinfo to this list.[1]
    
    CPU-specific microarchitectures include:
    • AMD Improved K8-family
    • AMD K10-family
    • AMD Family 10h (Barcelona)
    • AMD Family 14h (Bobcat)
    • AMD Family 16h (Jaguar)
    • AMD Family 15h (Bulldozer)
    • AMD Family 15h (Piledriver)
    • AMD Family 15h (Steamroller)
    • AMD Family 15h (Excavator)
    • AMD Family 17h (Zen)
    • AMD Family 17h (Zen 2)
    • AMD Family 19h (Zen 3)†
    • Intel Silvermont low-power processors
    • Intel Goldmont low-power processors (Apollo Lake and Denverton)
    • Intel Goldmont Plus low-power processors (Gemini Lake)
    • Intel 1st Gen Core i3/i5/i7 (Nehalem)
    • Intel 1.5 Gen Core i3/i5/i7 (Westmere)
    • Intel 2nd Gen Core i3/i5/i7 (Sandybridge)
    • Intel 3rd Gen Core i3/i5/i7 (Ivybridge)
    • Intel 4th Gen Core i3/i5/i7 (Haswell)
    • Intel 5th Gen Core i3/i5/i7 (Broadwell)
    • Intel 6th Gen Core i3/i5/i7 (Skylake)
    • Intel 6th Gen Core i7/i9 (Skylake X)
    • Intel 8th Gen Core i3/i5/i7 (Cannon Lake)
    • Intel 10th Gen Core i7/i9 (Ice Lake)
    • Intel Xeon (Cascade Lake)
    • Intel Xeon (Cooper Lake)*
    • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)*
    • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡
    • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡
    • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡
    
    Notes: If not otherwise noted, gcc >=9.1 is required for support.
           *Requires gcc >=10.1 or clang >=10.0
           †Required gcc >=10.3 or clang >=12.0
           ‡Required gcc >=11.1 or clang >=12.0
    
    It also offers to compile passing the 'native' option which, "selects the CPU
    to generate code for at compilation time by determining the processor type of
    the compiling machine. Using -march=native enables all instruction subsets
    supported by the local machine and will produce code optimized for the local
    machine under the constraints of the selected instruction set."[2]
    
    Users of Intel CPUs should select the 'Intel-Native' option and users of AMD
    CPUs should select the 'AMD-Native' option.
    
    MINOR NOTES RELATING TO INTEL ATOM PROCESSORS
    This patch also changes -march=atom to -march=bonnell in accordance with the
    gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I
    believe it should use the newer -march=bonnell flag for atom processors.[3]
    
    It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The
    recommendation is to use the 'atom' option instead.
    
    BENEFITS
    Small but real speed increases are measurable using a make endpoint comparing
    a generic kernel to one built with one of the respective microarchs.
    
    See the following experimental evidence supporting this statement:
    https://github.com/graysky2/kernel_gcc_patch
    
    REQUIREMENTS
    linux version 5.17+
    gcc version >=9.0 or clang version >=9.0
    
    ACKNOWLEDGMENTS
    This patch builds on the seminal work by Jeroen.[5]
    
    REFERENCES
    1.  https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9
    2.  https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options
    3.  https://bugzilla.kernel.org/show_bug.cgi?id=77461
    4.  https://github.com/graysky2/kernel_gcc_patch/issues/15
    5.  http://www.linuxforge.net/docs/linux/linux-gcc.php
    
    Signed-off-by: graysky <graysky@archlinux.us>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit bfe5811007122c3fdf16ccd1f28188d230a47940
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Aug 29 16:47:26 2022 +0000

    XANMOD: Makefile: Disable GCC vectorization on trees
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 38286c07d1c1aac0f9fa9ca0c0db64f9ab515f38
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Thu Jun 25 16:40:43 2020 -0300

    XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4f60f999f3dd98ec4158e8d5bf9884222c09c315
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Sun May 29 00:57:40 2022 +0000

    XANMOD: scripts/setlocalversion: remove "+" tag for git repo short version
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit f56b3ee1c3731bf19ad348c30511907846a3e812
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Tue Mar 31 13:32:08 2020 -0300

    XANMOD: cpufreq: tunes ondemand and conservative governor for performance
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 1bae2d9d2ab41c5c9534c5cab2b03f14ffc458ec
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:31:25 2018 +0000

    XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 00f5d3b2399701be23e4ad86d45571458a5a4edf
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed Jun 15 17:07:29 2022 +0000

    XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 2ddcdffce635b1e46b2fe2022f4c6b604d4d6eed
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 16:59:22 2018 +0000

    XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit babc130da078cc40e9478a6952b1cc9930ade00b
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:26:15 2018 +0000

    XANMOD: kconfig: add 500Hz timer interrupt kernel config option
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 8be2745bb660a5dd7aec0a809532979711e8f1a9
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Dec 14 16:24:26 2020 +0000

    XANMOD: block: set rq_affinity to force full multithreading I/O requests
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 6771df72370f4fd4dfd8756f63164e7b7482116e
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed May 11 18:56:51 2022 +0000

    XANMOD: block/mq-deadline: Increase write priority to improve responsiveness
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 0e6988b22730a1ea931143e1129e3b541746165a
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Jan 6 16:59:01 2022 +0000

    XANMOD: block/mq-deadline: Disable front_merges by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit be5d689b5789177a940f7718ce382632a1b93222
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Fri Mar 25 22:36:34 2022 +0000

    XANMOD: Change rcutree.kthread_prio to SCHED_RR policy
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 68ac4bebd5c79bfdfa6793d1019e6bad357e9d2d
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Aug 1 01:49:22 2022 +0000

    XANMOD: fair: Remove all energy efficiency functions
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 192c74f1b5a8e36e7d6428dd9ed4c38cfa7f5af3
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Aug 29 17:02:28 2022 +0000

    XANMOD: x86/build: Add more x86_64 optimizations
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4fe89d07dcc2804c8b562f6c7896a45643d34b2f
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Oct 2 14:09:07 2022 -0700

    Linux 6.0