commit 4beda543e40efb4a8521e0fa4ea92461e01437c8 Author: Alexandre Frade Date: Thu Jul 8 22:50:13 2021 +0000 Linux 5.13.1-rt1-xanmod1 Signed-off-by: Alexandre Frade commit e9d81aaaa078e4addf5cc855c30b877fafb43ff7 Merge: 3d13cb77d7e0 7e175e6b5997 Author: Alexandre Frade Date: Thu Jul 8 20:01:09 2021 +0000 Merge tag 'v5.13-rt1' into 5.13 v5.13-rt1 ResurrexiT! commit 7e175e6b59975c8901ad370f7818937f68de45c1 Author: Thomas Gleixner Date: Fri Jul 8 20:25:16 2011 +0200 Add localversion for -RT release Signed-off-by: Thomas Gleixner commit 6f3f622fe87aee711020341478979d247ea605f6 Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:41 2019 +0200 POWERPC: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 646e1ac9bc99e20cf7a5374b6d877d2796373945 Author: Sebastian Andrzej Siewior Date: Fri Jan 8 19:48:21 2021 +0100 powerpc: Avoid recursive header includes - The include of bug.h leads to an include of printk.h which gets back to spinlock.h and complains then about missing xchg(). Remove bug.h and add bits.h which is needed for BITS_PER_BYTE. - Avoid the "please don't include this file directly" error from rwlock-rt. Allow an include from/with rtmutex.h. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit a1a11e9c8b5c222baeae02e027da0f1ae6669397 Author: Sebastian Andrzej Siewior Date: Tue Mar 26 18:31:29 2019 +0100 powerpc/stackprotector: work around stack-guard init from atomic This is invoked from the secondary CPU in atomic context. On x86 we use tsc instead. On Power we XOR it against mftb() so lets use stack address as the initial value. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit f35ef21b2f18a5367912423e4e17546db005c405 Author: Bogdan Purcareata Date: Fri Apr 24 15:53:13 2015 +0000 powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood Signed-off-by: Bogdan Purcareata Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d91a27cdd893b229345d8f27075baea4c866f365 Author: Sebastian Andrzej Siewior Date: Tue Mar 26 18:31:54 2019 +0100 powerpc/pseries/iommu: Use a locallock instead local_irq_save() The locallock protects the per-CPU variable tce_page. The function attempts to allocate memory while tce_page is protected (by disabling interrupts). Use local_irq_save() instead of local_irq_disable(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 9d22762227ecb4ff0ba2f789d1129047fdef1c14 Author: Sebastian Andrzej Siewior Date: Fri Jul 26 11:30:49 2019 +0200 powerpc: traps: Use PREEMPT_RT Add PREEMPT_RT to the backtrace if enabled. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit dd5ac564bb5e626d0f783f1862656f01f0b151a5 Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:35 2019 +0200 ARM64: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit fb140d3f82d429d64ab0a7ef14751e3979ed3910 Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:29 2019 +0200 ARM: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 2b7f161a51bc94e090dfb6d50e6873d4378f20e1 Author: Sebastian Andrzej Siewior Date: Wed Jul 25 14:02:38 2018 +0200 arm64: fpsimd: Delay freeing memory in fpsimd_flush_thread() fpsimd_flush_thread() invokes kfree() via sve_free() within a preempt disabled section which is not working on -RT. Delay freeing of memory until preemption is enabled again. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit b7f7c9df58ce626e8f2883a329d1ab4d6b3514b4 Author: Josh Cartwright Date: Thu Feb 11 11:54:01 2016 -0600 KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable() kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall Reported-by: Manish Jaggi Signed-off-by: Josh Cartwright Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit ae21940445d60c64c9e1f4efa2d1fa74ec913d2a Author: Yadi.hu Date: Wed Dec 10 10:32:09 2014 +0800 ARM: enable irq in translation/section permission fault handlers Probably happens on all ARM, with CONFIG_PREEMPT_RT CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { *((char*)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [] (unwind_backtrace+0x0/0x104) [ 512.745001] [] (dump_stack+0x20/0x24) [ 512.745355] [] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [] (rt_spin_lock+0x34/0x6c) [ 512.746073] [] (do_force_sig_info+0x34/0xf0) [ 512.746457] [] (force_sig_info+0x18/0x1c) [ 512.746829] [] (__do_user_fault+0x9c/0xd8) [ 512.747185] [] (do_bad_area+0x7c/0x94) [ 512.747536] [] (do_sect_fault+0x40/0x48) [ 512.747898] [] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 184bec33a00faa87b948d1b137e8937e9699afe2 Author: Anders Roxell Date: Thu May 14 17:52:17 2015 +0200 arch/arm64: Add lazy preempt support arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell Signed-off-by: Thomas Gleixner commit 2f9ba402f708cf7195f29a477901cacbe11089c0 Author: Thomas Gleixner Date: Thu Nov 1 10:14:11 2012 +0100 powerpc: Add support for lazy preemption Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit 95e00217927cdd22e872c6f22b21a8bf1cd02709 Author: Thomas Gleixner Date: Wed Oct 31 12:04:11 2012 +0100 arm: Add support for lazy preemption Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit f2f9e496208c584356e84e720a3dfd99970ee5e9 Author: Thomas Gleixner Date: Thu Nov 1 11:03:47 2012 +0100 x86: Support for lazy preemption Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit 5edd1691b69a1dff3109b39075eb05ed41534005 Author: Sebastian Andrzej Siewior Date: Tue Jun 30 11:45:14 2020 +0200 x86/entry: Use should_resched() in idtentry_exit_cond_resched() The TIF_NEED_RESCHED bit is inlined on x86 into the preemption counter. By using should_resched(0) instead of need_resched() the same check can be performed which uses the same variable as 'preempt_count()` which was issued before. Use should_resched(0) instead need_resched(). Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 2d1c3636a727647cffb1185ccac6897e551a4071 Author: Thomas Gleixner Date: Fri Oct 26 18:50:54 2012 +0100 sched: Add support for lazy preemption It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner commit 4d507329b3b9318f8fddbc1508ce38d324b3325a Author: Sebastian Andrzej Siewior Date: Thu Nov 7 17:49:20 2019 +0100 x86: Enable RT also on 32bit Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 9e6ddccc4933cc2770eafd4974bfa32c32031c9e Author: Sebastian Andrzej Siewior Date: Wed Aug 7 18:15:38 2019 +0200 x86: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 60e58fe04abe7b6cfca0edcb14949141f29f97dc Author: Thomas Gleixner Date: Sun Nov 6 12:26:18 2011 +0100 x86: kvm Require const tsc for RT Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner commit 0cf7fc66884a8a410ff971fbe3c279feb24d7698 Author: Oleg Nesterov Date: Tue Jul 14 14:26:34 2015 +0200 signal/x86: Delay calling signals in atomic On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov Signed-off-by: Steven Rostedt Signed-off-by: Thomas Gleixner [bigeasy: also needed on 32bit as per Yang Shi ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit a60399bc57eb8413ec018041b5231eebcd8c6898 Author: Clark Williams Date: Sat Jul 30 21:55:53 2011 -0500 sysfs: Add /sys/kernel/realtime entry Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams Signed-off-by: Peter Zijlstra Signed-off-by: Thomas Gleixner commit 2e2c38e82e2ed71f1c38c1bad9341be63396fa73 Author: Haris Okanovic Date: Tue Aug 15 15:13:08 2017 -0500 tpm_tis: fix stall after iowrite*()s ioread8() operations to TPM MMIO addresses can stall the cpu when immediately following a sequence of iowrite*()'s to the same region. For example, cyclitest measures ~400us latency spikes when a non-RT usermode application communicates with an SPI-based TPM chip (Intel Atom E3940 system, PREEMPT_RT kernel). The spikes are caused by a stalling ioread8() operation following a sequence of 30+ iowrite8()s to the same address. I believe this happens because the write sequence is buffered (in cpu or somewhere along the bus), and gets flushed on the first LOAD instruction (ioread*()) that follows. The enclosed change appears to fix this issue: read the TPM chip's access register (status code) after every iowrite*() operation to amortize the cost of flushing data to chip across multiple instructions. Signed-off-by: Haris Okanovic Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit cf646084521dc80cf95247db9bfd1299e802e498 Author: Thomas Gleixner Date: Tue Jan 8 21:36:51 2013 +0100 tty/serial/pl011: Make the locking work on RT The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner commit b42b6e6889e21b06cee7a633c0f1b562e5d56eef Author: Thomas Gleixner Date: Thu Jul 28 13:32:57 2011 +0200 tty/serial/omap: Make the locking RT aware The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner commit f5faab2a86dcb04973c3736f8692db8a05dd8798 Author: Sebastian Andrzej Siewior Date: Tue Jul 7 12:25:11 2020 +0200 drm/i915/gt: Only disable interrupts for the timeline lock on !force-threaded According to commit d67739268cf0e ("drm/i915/gt: Mark up the nested engine-pm timeline lock as irqsafe") the intrrupts are disabled the code may be called from an interrupt handler and from preemptible context. With `force_irqthreads' set the timeline mutex is never observed in IRQ context so it is not neede to disable interrupts. Disable only interrupts if not in `force_irqthreads' mode. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 872271cf52c484d4f7ccc7fbf44f8861f402b18a Author: Sebastian Andrzej Siewior Date: Wed Dec 19 10:47:02 2018 +0100 drm/i915: skip DRM_I915_LOW_LEVEL_TRACEPOINTS with NOTRACE The order of the header files is important. If this header file is included after tracepoint.h was included then the NOTRACE here becomes a nop. Currently this happens for two .c files which use the tracepoitns behind DRM_I915_LOW_LEVEL_TRACEPOINTS. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 4cad3232d7f0c7e6a3116da68a1c44111b5c25f3 Author: Sebastian Andrzej Siewior Date: Thu Dec 6 09:52:20 2018 +0100 drm/i915: disable tracing on -RT Luca Abeni reported this: | BUG: scheduling while atomic: kworker/u8:2/15203/0x00000003 | CPU: 1 PID: 15203 Comm: kworker/u8:2 Not tainted 4.19.1-rt3 #10 | Call Trace: | rt_spin_lock+0x3f/0x50 | gen6_read32+0x45/0x1d0 [i915] | g4x_get_vblank_counter+0x36/0x40 [i915] | trace_event_raw_event_i915_pipe_update_start+0x7d/0xf0 [i915] The tracing events use trace_i915_pipe_update_start() among other events use functions acquire spin locks. A few trace points use intel_get_crtc_scanline(), others use ->get_vblank_counter() wich also might acquire a sleeping lock. Based on this I don't see any other way than disable trace points on RT. Cc: stable-rt@vger.kernel.org Reported-by: Luca Abeni Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit a2a18b7ffa33dc77da5a7b3a2cab414059a035e1 Author: Mike Galbraith Date: Sat Feb 27 09:01:42 2016 +0100 drm/i915: Don't disable interrupts on PREEMPT_RT during atomic updates Commit 8d7849db3eab7 ("drm/i915: Make sprite updates atomic") started disabling interrupts across atomic updates. This breaks on PREEMPT_RT because within this section the code attempt to acquire spinlock_t locks which are sleeping locks on PREEMPT_RT. According to the comment the interrupts are disabled to avoid random delays and not required for protection or synchronisation. Don't disable interrupts on PREEMPT_RT during atomic updates. [bigeasy: drop local locks, commit message] Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 13e2d414eb6f1ecf942ed343c093842bc024b1fc Author: Mike Galbraith Date: Sat Feb 27 08:09:11 2016 +0100 drm,radeon,i915: Use preempt_disable/enable_rt() where recommended DRM folks identified the spots, so use them. Signed-off-by: Mike Galbraith Signed-off-by: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: linux-rt-users Signed-off-by: Thomas Gleixner commit 3ca5dbf1d1029d2379bf1cf2fd7cc1fb711e60b3 Author: Thomas Gleixner Date: Tue Aug 21 20:38:50 2012 +0200 random: Make it work on rt Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner commit 4d07cdf96442fca3858211cef9d4420832f899e3 Author: Thomas Gleixner Date: Thu Dec 16 14:25:18 2010 +0100 x86: stackprotector: Avoid random pool on rt CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde Signed-off-by: Thomas Gleixner commit 625193bb2705b64d53004ecb4fa3ffec68ddb525 Author: Thomas Gleixner Date: Tue Jul 14 14:26:34 2015 +0200 panic: skip get_random_bytes for RT_FULL in init_oops_id Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner commit 435b065c36a9a9d2ee46e61b679f6024dc044ea0 Author: Sebastian Andrzej Siewior Date: Thu Jul 26 18:52:00 2018 +0200 crypto: cryptd - add a lock instead preempt_disable/local_bh_disable cryptd has a per-CPU lock which protected with local_bh_disable() and preempt_disable(). Add an explicit spin_lock to make the locking context more obvious and visible to lockdep. Since it is a per-CPU lock, there should be no lock contention on the actual spinlock. There is a small race-window where we could be migrated to another CPU after the cpu_queue has been obtain. This is not a problem because the actual ressource is protected by the spinlock. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 63297909b77d84facb0c3341275dfb7ab9b74bd7 Author: Sebastian Andrzej Siewior Date: Thu Nov 30 13:40:10 2017 +0100 crypto: limit more FPU-enabled sections Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new *_fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 5190ea9f444791dfa6c60e1cc4fd1d561d79002d Author: Thomas Gleixner Date: Sat Nov 12 14:00:48 2011 +0100 scsi/fcoe: Make RT aware. Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner commit 8b6bd58088fde59e48b75a090c1baa928dc73faf Author: Thomas Gleixner Date: Tue Apr 6 16:51:31 2010 +0200 md: raid5: Make raid5_percpu handling RT aware __raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel Signed-off-by: Thomas Gleixner Tested-by: Udo van den Heuvel commit 35da1a4242a1a4559b803e7ad54117c55b42c8a7 Author: Mike Galbraith Date: Thu Mar 31 04:08:28 2016 +0200 drivers/block/zram: Replace bit spinlocks with rtmutex for -rt They're nondeterministic, and lead to ___might_sleep() splats in -rt. OTOH, they're a lot less wasteful than an rtmutex per page. Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit a377dcde718af1e665957dcb8129337af35161a0 Author: Sebastian Andrzej Siewior Date: Tue Jul 14 14:26:34 2015 +0200 block/mq: do not invoke preempt_disable() preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 8596053f3a473347c05ccc32d236326f9e8300e4 Author: Priyanka Jain Date: Thu May 17 09:35:11 2012 +0530 net: Remove preemption disabling in netif_rx() 1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain Signed-off-by: Thomas Gleixner Acked-by: Rajan Srivastava Cc: Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner [bigeasy: Remove assumption about migrate_disable() from the description.] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 7bcfc0a2920ce68c056698007685ba74bd6578ab Author: Sebastian Andrzej Siewior Date: Wed Mar 30 13:36:29 2016 +0200 net: dev: always take qdisc's busylock in __dev_xmit_skb() The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit b91f0be757d7d1e86f7ff94a79a5098e0df0d0c9 Author: Sebastian Andrzej Siewior Date: Wed Sep 16 16:15:39 2020 +0200 net: Dequeue in dev_cpu_dead() without the lock Upstream uses skb_dequeue() to acquire lock of `input_pkt_queue'. The reason is to synchronize against a remote CPU which still thinks that the CPU is online enqueues packets to this CPU. There are no guarantees that the packet is enqueued before the callback is run, it just hope. RT however complains about an not initialized lock because it uses another lock for `input_pkt_queue' due to the IRQ-off nature of the context. Use the unlocked dequeue version for `input_pkt_queue'. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 08f72cc549122afe69a55b1367dffcc6e7d33016 Author: Thomas Gleixner Date: Tue Jul 12 15:38:34 2011 +0200 net: Use skbufhead with raw lock Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner commit 34d9adb7cc9441a2dc21bd4862aa54eebb2fb29d Author: Mike Galbraith Date: Wed Feb 18 16:05:28 2015 +0100 sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light() |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd |Preemption disabled at:[] svc_xprt_received+0x4b/0xc0 [sunrpc] |CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 |Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 | ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 | 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 | ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 |Call Trace: | [] dump_stack+0x4f/0x9e | [] __might_sleep+0xe6/0x150 | [] rt_spin_lock+0x24/0x50 | [] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] | [] svc_xprt_received+0x4b/0xc0 [sunrpc] | [] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] | [] svc_addsock+0x143/0x200 [sunrpc] | [] write_ports+0x28c/0x340 [nfsd] | [] nfsctl_transaction_write+0x4c/0x80 [nfsd] | [] vfs_write+0xb3/0x1d0 | [] SyS_write+0x49/0xb0 | [] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit cd5a07cc5b4df4ca9a0dd206eca1dcff622933d5 Author: Sebastian Andrzej Siewior Date: Fri Jun 16 19:03:16 2017 +0200 net/core: use local_bh_disable() in netif_rx_ni() In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 256ed200bb58572900c16418f20442ce61f7f771 Author: Sebastian Andrzej Siewior Date: Tue Sep 8 16:57:11 2020 +0200 net: Properly annotate the try-lock for the seqlock In patch ("net/Qdisc: use a seqlock instead seqcount") the seqcount has been replaced with a seqlock to allow to reader to boost the preempted writer. The try_write_seqlock() acquired the lock with a try-lock but the seqcount annotation was "lock". Opencode write_seqcount_t_begin() and use the try-lock annotation for lockdep. Reported-by: Mike Galbraith Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 13f621d9f778b2eb23d4c2104aff46245cf7ed69 Author: Sebastian Andrzej Siewior Date: Wed Sep 14 17:36:35 2016 +0200 net/Qdisc: use a seqlock instead seqcount The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 72d6f4f680bfea38795cc4afd165148cdd6461eb Author: Scott Wood Date: Wed Sep 11 17:57:29 2019 +0100 rcutorture: Avoid problematic critical section nesting on RT rcutorture was generating some nesting scenarios that are not reasonable. Constrain the state selection to avoid them. Example #1: 1. preempt_disable() 2. local_bh_disable() 3. preempt_enable() 4. local_bh_enable() On PREEMPT_RT, BH disabling takes a local lock only when called in non-atomic context. Thus, atomic context must be retained until after BH is re-enabled. Likewise, if BH is initially disabled in non-atomic context, it cannot be re-enabled in atomic context. Example #2: 1. rcu_read_lock() 2. local_irq_disable() 3. rcu_read_unlock() 4. local_irq_enable() If the thread is preempted between steps 1 and 2, rcu_read_unlock_special.b.blocked will be set, but it won't be acted on in step 3 because IRQs are disabled. Thus, reporting of the quiescent state will be delayed beyond the local_irq_enable(). For now, these scenarios will continue to be tested on non-PREEMPT_RT kernels, until debug checks are added to ensure that they are not happening elsewhere. Signed-off-by: Scott Wood Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 8abf1b2d2e8f61dc527e290e3cb83b0cf9fafef0 Author: Sebastian Andrzej Siewior Date: Wed Mar 10 15:09:02 2021 +0100 rcu: Delay RCU-selftests Delay RCU-selftests until ksoftirqd is up and running. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 0b9358cbbc0d2e9bd3e1a4788136ea5ddd7b05ee Author: Thomas Gleixner Date: Wed Mar 7 21:00:34 2012 +0100 fs: namespace: Use cpu_chill() in trylock loops Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner commit 3e649cdf24ddfabb01964ec59d1c2455f8273df1 Author: Thomas Gleixner Date: Wed Mar 7 20:51:03 2012 +0100 rt: Introduce cpu_chill() Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): | |Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken |up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is |called from softirq context, it may block the ksoftirqd() from running, in |which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() |If a task calls cpu_chill() and gets woken up by a regular or spurious |wakeup and has a signal pending, then it exits the sleep loop in |do_nanosleep() and sets up the restart block. If restart->nanosleep.type is |not TI_NONE then this results in accessing a stale user pointer from a |previously interrupted syscall and a copy to user based on the stale |pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: | [....] Waiting for /dev to be fully populated... | ===================================== | [ BUG: udevd/229 still has locks held! ] | 3.12.11-rt17 #23 Not tainted | ------------------------------------- | 1 lock held by udevd/229: | #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 | | stack backtrace: | CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 | (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) | (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) | (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) | (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) | (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) | (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) | (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) | (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) | (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) | (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) | (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) | (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) | (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) | (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) | (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) | (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) | (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner Signed-off-by: Steven Rostedt Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 6577676e1025d244dd51be507fa461f6aa5964d5 Author: Sebastian Andrzej Siewior Date: Fri Oct 20 11:29:53 2017 +0200 fs/dcache: disable preemption on i_dir_seq's write side i_dir_seq is an opencoded seqcounter. Based on the code it looks like we could have two writers in parallel despite the fact that the d_lock is held. The problem is that during the write process on RT the preemption is still enabled and if this process is interrupted by a reader with RT priority then we lock up. To avoid that lock up I am disabling the preemption during the update. The rename of i_dir_seq is here to ensure to catch new write sides in future. Cc: stable-rt@vger.kernel.org Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit b6b25f302612ff245d0b5aaf450d7bd147f499ed Author: Sebastian Andrzej Siewior Date: Wed Sep 14 14:35:49 2016 +0200 fs/dcache: use swait_queue instead of waitqueue __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit bbd667d9ef68d0705b8553d9cbbca113940ea602 Author: Sebastian Andrzej Siewior Date: Thu Aug 29 18:21:04 2013 +0200 ptrace: fix ptrace vs tasklist_lock race As explained by Alexander Fyodorov : |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. [ Fix for ptrace_unfreeze_traced() by Oleg Nesterov ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit c8d68449743241e79e6bf2dbfa08cedde307acd4 Author: Thomas Gleixner Date: Wed Sep 21 19:57:12 2011 +0200 signal: Revert ptrace preempt magic Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner commit c6d6f9cc7f5da22dbcd755d426dfba8105f9b1c3 Author: Thomas Gleixner Date: Fri Jul 3 08:44:34 2009 -0500 mm/scatterlist: Do not disable irqs on RT For -RT it is enough to keep pagefault disabled (which is currently handled by kmap_atomic()). Signed-off-by: Thomas Gleixner commit d69a7f2680b5776e27bb2ef16c7070596deabd52 Author: Thomas Gleixner Date: Tue Jul 12 11:39:36 2011 +0200 mm/vmalloc: Another preempt disable region which sucks Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner commit 4e9fa61b3153bdf7bddba5e44ec0f243a0be37a9 Author: Mike Galbraith Date: Tue Mar 22 11:16:09 2016 +0100 mm/zsmalloc: copy with get_cpu_var() and locking get_cpu_var() disables preemption and triggers a might_sleep() splat later. This is replaced with get_locked_var(). This bitspinlocks are replaced with a proper mutex which requires a slightly larger struct to allocate. Signed-off-by: Mike Galbraith Signed-off-by: Thomas Gleixner [bigeasy: replace the bitspin_lock() with a mutex, get_locked_var(). Mike then fixed the size magic] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 782bf4999ae66d23ce543f91399313d618908b0d Author: Sebastian Andrzej Siewior Date: Wed Jan 28 17:14:16 2015 +0100 mm/memcontrol: Replace local_irq_disable with local locks There are a few local_irq_disable() which then take sleeping locks. This patch converts them local locks. [bigeasy: Move unlock after memcg_check_events() in mem_cgroup_swapout(), pointed out by Matt Fleming ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 25de25aa3b89afacf840dfece4f7b423fae2b8c3 Author: Yang Shi Date: Wed Oct 30 11:48:33 2013 -0700 mm/memcontrol: Don't call schedule_work_on in preemption disabled context The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [] dump_stack+0x19/0x1b [] __might_sleep+0xf1/0x170 [] rt_spin_lock+0x20/0x50 [] queue_work_on+0x61/0x100 [] drain_all_stock+0xe1/0x1c0 [] mem_cgroup_reclaim+0x90/0xe0 [] __mem_cgroup_try_charge+0x41a/0xc40 [] ? release_pages+0x1b1/0x1f0 [] ? sched_exec+0x40/0xb0 [] mem_cgroup_charge_common+0x37/0x70 [] mem_cgroup_newpage_charge+0x26/0x30 [] handle_pte_fault+0x618/0x840 [] ? unpin_current_cpu+0x16/0x70 [] ? migrate_enable+0xd4/0x200 [] handle_mm_fault+0x145/0x1e0 [] __do_page_fault+0x1a1/0x4c0 [] ? preempt_schedule_irq+0x4b/0x70 [] ? retint_kernel+0x37/0x40 [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Signed-off-by: Yang Shi Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 6e041ac24c6256fe97ddbc81e7cec054d6c366da Author: Sebastian Andrzej Siewior Date: Thu May 20 16:00:41 2021 +0200 mm: memcontrol: Replace disable-IRQ locking with a local_lock Access to the per-CPU variable memcg_stock is synchronized by disabling interrupts. Convert it to a local_lock which allows RT kernels to substitute them with a real per CPU lock. On non RT kernels this maps to local_irq_save() as before, but provides also lockdep coverage of the critical region. No functional change. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 858043073a33d436f321bd46ad63898bb334cc3a Author: Sebastian Andrzej Siewior Date: Thu May 20 12:33:07 2021 +0200 mm: memcontrol: Add an argument to refill_stock() to indicate locking The access to the per-CPU variable memcg_stock is protected by disabling interrupts. refill_stock() may change the ->caching member and updates the ->nr_pages member. refill_obj_stock() is also accecssing memcg_stock (modifies ->nr_pages) and disables interrupts as part for the locking. Since refill_obj_stock() may invoke refill_stock() (via drain_obj_stock() -> obj_cgroup_uncharge_pages()) the "disable interrupts"-lock is acquired recursively. Add an argument to refill_stock() to indicate if it is required to disable interrupts as part of the locking for exclusive memcg_stock access. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d0964c3449bab426321ffadc7cfd9cfa854a8fce Author: Sebastian Andrzej Siewior Date: Mon Aug 17 12:28:10 2020 +0200 u64_stats: Disable preemption on 32bit-UP/SMP with RT during updates On RT the seqcount_t is required even on UP because the softirq can be preempted. The IRQ handler is threaded so it is also preemptible. Disable preemption on 32bit-RT during value updates. There is no need to disable interrupts on RT because the handler is run threaded. Therefore disabling preemption is enough to guarantee that the update is not interruped. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 62888474f4c7def1b4d7eef2beb101fd3db1c406 Author: Sebastian Andrzej Siewior Date: Wed Oct 28 18:15:32 2020 +0100 mm/memcontrol: Disable preemption in __mod_memcg_lruvec_state() The callers expect disabled preemption/interrupts while invoking __mod_memcg_lruvec_state(). This works mainline because a lock of somekind is acquired. Use preempt_disable_rt() where per-CPU variables are accessed and a stable pointer is expected. This is also done in __mod_zone_page_state() for the same reason. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit fd1745bb1e6a2c236589e8fafd4a460f5b49844a Author: Ingo Molnar Date: Fri Jul 3 08:30:13 2009 -0500 mm/vmstat: Protect per cpu variables with preempt disable on RT Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the same ressources is not updated in parallel due to preemption. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit 07be330e9aa8a8bfad04d6e34c85a0157d2caa5e Author: Sebastian Andrzej Siewior Date: Tue Mar 2 18:58:04 2021 +0100 mm: slub: Don't enable partial CPU caches on PREEMPT_RT by default SLUB's partial CPU caches lead to higher latencies in a hackbench benchmark. Don't enable partial CPU caches by default on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 1473115e3d4e4f673c726c6eb66d5bdff318d779 Author: Sebastian Andrzej Siewior Date: Thu Jul 2 14:27:23 2020 +0200 mm: page_alloc: Use migrate_disable() in drain_local_pages_wq() drain_local_pages_wq() disables preemption to avoid CPU migration during CPU hotplug and can't use cpus_read_lock(). Using migrate_disable() works here, too. The scheduler won't take the CPU offline until the task left the migrate-disable section. Use migrate_disable() in drain_local_pages_wq(). Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 7b51cfe9736bbabeb5bd46f35a6cd1f95771ee10 Author: Sebastian Andrzej Siewior Date: Fri Jul 2 15:34:24 2021 +0200 mm, slub: Duct tape lockdep_assert_held(local_lock_t) on RT The local_lock_t needs to be changed to make lockdep_assert_held() magically work. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit ef1919f3b424d7bca40012a9449534e960383663 Author: Sebastian Andrzej Siewior Date: Tue Jun 23 15:32:51 2015 +0200 irqwork: push most work into softirq context Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 61f56f34c5313ab2d384daad1ed5a28e5ccdd228 Author: Thomas Gleixner Date: Mon Jul 18 13:59:17 2011 +0200 softirq: Disable softirq stacks for RT Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner commit d98b56489a5c57e693bf38e2f40bd214c824edaf Author: Thomas Gleixner Date: Sun Nov 13 17:17:09 2011 +0100 softirq: Check preemption after reenabling interrupts raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde Signed-off-by: Thomas Gleixner commit 6d59282185364970687e479dd7278fc0fea1294f Author: Mike Galbraith Date: Sun Jan 8 09:32:25 2017 +0100 cpuset: Convert callback_lock to raw_spinlock_t The two commits below add up to a cpuset might_sleep() splat for RT: 8447a0fee974 cpuset: convert callback_mutex to a spinlock 344736f29b35 cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the 8447a0fee974 changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit bf37891999a380f988e1dd8058b3da1c2653e1f5 Author: Thomas Gleixner Date: Tue Sep 13 16:42:35 2011 +0200 sched: Disable TTWU_QUEUE on RT The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner commit e1ca6ff2eb5a110d79057b8c2321a7dfc5f27356 Author: Thomas Gleixner Date: Tue Jun 7 09:19:06 2011 +0200 sched: Do not account rcu_preempt_depth on RT in might_sleep() RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner commit 62c341d62dba37f2f0e2cc43f738582c871e84fa Author: Sebastian Andrzej Siewior Date: Mon Nov 21 19:31:08 2016 +0100 kernel/sched: move stack + kprobe clean up to __put_task_struct() There is no need to free the stack before the task struct (except for reasons mentioned in commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")). This also comes handy on -RT because we can't free memory in preempt disabled region. vfree_atomic() delays the memory cleanup to a worker. Since we move everything to the RCU callback, we can also free it immediately. Cc: stable-rt@vger.kernel.org #for kprobe_flush_task() Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 7da87732639e954c37b4885e4dfdf6facf1f13c9 Author: Thomas Gleixner Date: Mon Jun 6 12:20:33 2011 +0200 sched: Move mmdrop to RCU on RT Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner commit 93fcb6de922b66624a3c79445f39274f781449cb Author: Thomas Gleixner Date: Mon Jun 6 12:12:51 2011 +0200 sched: Limit the number of task migrations per batch Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner commit efad5fed9c9bdb626789a766faac249882d6cf52 Author: Sebastian Andrzej Siewior Date: Sat May 27 19:02:06 2017 +0200 kernel/sched: add {put|get}_cpu_light() Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 468c014d5e3733525eff26af855a6553736eaccb Author: Thomas Gleixner Date: Fri Jul 24 12:38:56 2009 +0200 preempt: Provide preempt_*_(no)rt variants RT needs a few preempt_disable/enable points which are not necessary otherwise. Implement variants to avoid #ifdeffery. Signed-off-by: Thomas Gleixner commit 32f3c134a2774d83da33407cba1ae3cef57533ee Author: Sebastian Andrzej Siewior Date: Tue Oct 17 16:36:18 2017 +0200 lockdep: disable self-test The self-test wasn't always 100% accurate for RT. We disabled a few tests which failed because they had a different semantic for RT. Some still reported false positives. Now the selftest locks up the system during boot and it needs to be investigated… Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 63bf6d3149bbb052d6c49ccc65419815f5e79596 Author: Josh Cartwright Date: Wed Jan 28 13:08:45 2015 -0600 lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionals "lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT conditionals. Signed-off-by: Josh Cartwright Signed-off-by: Xander Huff Signed-off-by: Thomas Gleixner Acked-by: Gratian Crisan Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d73448cc523ceaa187a8b8794ce511fdbf3a3836 Author: Yong Zhang Date: Mon Apr 16 15:01:56 2012 +0800 lockdep: selftest: Only do hardirq context test for raw spinlock On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang Signed-off-by: Thomas Gleixner Cc: Yong Zhang Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner commit a89aefc8729891931c25b9c47966254bc4c854f3 Author: Thomas Gleixner Date: Sun Jul 17 18:51:23 2011 +0200 lockdep: Make it RT aware teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner commit 5419cc6547b3f7a1d22f354e7ce1d4b936b0f556 Author: Sebastian Andrzej Siewior Date: Fri Aug 4 17:40:42 2017 +0200 locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs Upstream uses arch_spinlock_t within spinlock_t and requests that spinlock_types.h header file is included first. On -RT we have the rt_mutex with its raw_lock wait_lock which needs architectures' spinlock_types.h header file for its definition. However we need rt_mutex first because it is used to build the spinlock_t so that check does not work for us. Therefore I am dropping that check. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d1184a4a2389a330df33161a3a28f4c974f876e6 Author: Sebastian Andrzej Siewior Date: Thu May 20 18:09:38 2021 +0200 locking/RT: Add might sleeping annotation. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit df4f4e03b7ca6f74a0acb53024b317c5e86ac9ae Author: Thomas Gleixner Date: Tue Apr 13 23:34:56 2021 +0200 locking/local_lock: Add RT support On PREEMPT_RT enabled kernels local_lock has a real spinlock inside. Provide the necessary macros to substitute the non-RT variants. Signed-off-by: Thomas Gleixner commit 0fdc3cb86fec46049c4f54beb85bc82034c7c285 Author: Thomas Gleixner Date: Tue Apr 13 23:26:09 2021 +0200 locking/local_lock: Prepare for RT support PREEMPT_RT enabled kernels will add a real lock to local_lock and have to replace the preemption/interrupt disable/enable pairs by migrate_disable/enable pairs. To avoid duplicating the inline helpers for RT provide defines which map the relevant invocations to the non-RT variants. No functional change. Signed-off-by: Thomas Gleixner commit ba5d7ea8cf4ce7d0e48760cf286d8ed7261783b3 Author: Steven Rostedt Date: Tue Jul 6 16:36:57 2021 +0200 locking/rtmutex: Add adaptive spinwait mechanism Going to sleep when a spinlock or rwlock is contended can be quite inefficient when the contention time is short and the lock owner is running on a different CPU. The MCS mechanism is not applicable to rtmutex based locks, so provide a simple adaptive spinwait mechanism for the RT specific spin/rwlock implementations. [ tglx: Provide a contemporary changelog ] Originally-by: Gregory Haskins Signed-off-by: Steven Rostedt Signed-off-by: Thomas Gleixner commit 8e39f8f9c367fd009ddd27e2376539c0712f1fca Author: Gregory Haskins Date: Tue Jul 6 16:36:57 2021 +0200 locking/rtmutex: Implement equal priority lock stealing The current logic only allows lock stealing to occur if the current task is of higher priority than the pending owner. Signficant throughput improvements can be gained by allowing the lock stealing to include tasks of equal priority when the contended lock is a spin_lock or a rw_lock and the tasks are not in a RT scheduling task. The assumption was that the system will make faster progress by allowing the task already on the CPU to take the lock rather than waiting for the system to wake up a different task. This does add a degree of unfairness, but in reality no negative side effects have been observed in the many years that this has been used in the RT kernel. [ tglx: Refactored and rewritten several times by Steve Rostedt, Sebastian Siewior and myself ] Signed-off-by: Gregory Haskins Signed-off-by: Thomas Gleixner commit aa8c4cd5600d09576939f6093512260825adb046 Author: Thomas Gleixner Date: Tue Jul 6 16:36:57 2021 +0200 preempt: Adjust PREEMPT_LOCK_OFFSET for RT On PREEMPT_RT regular spinlocks and rwlocks are substituted with rtmutex based constructs. spin/rwlock held regions are preemptible on PREEMPT_RT, so PREEMPT_LOCK_OFFSET has to be 0 to make the various cond_resched_*lock() functions work correctly. Signed-off-by: Thomas Gleixner commit eaaa5e87721d080f69479f80657c2256fb6d82ee Author: Thomas Gleixner Date: Tue Jul 6 16:36:57 2021 +0200 rtmutex: Prevent lockdep false positive with PI futexes On PREEMPT_RT the futex hashbucket spinlock becomes 'sleeping' and rtmutex based. That causes a lockdep false positive because some of the futex functions invoke spin_unlock(&hb->lock) with the wait_lock of the rtmutex associated to the pi_futex held. spin_unlock() in turn takes wait_lock of the rtmutex on which the spinlock is based which makes lockdep notice a lock recursion. Give the futex/rtmutex wait_lock a seperate key. Signed-off-by: Thomas Gleixner commit a796a11915bbe53f55c748a53f116f01424b73df Author: Thomas Gleixner Date: Tue Jul 6 16:36:57 2021 +0200 futex: Prevent requeue_pi() lock nesting issue on RT The requeue_pi() operation on RT kernels creates a problem versus the task::pi_blocked_on state when a waiter is woken early (signal, timeout) and that early wake up interleaves with the requeue_pi() operation. When the requeue manages to block the waiter on the rtmutex which is associated to the second futex, then a concurrent early wakeup of that waiter faces the problem that it has to acquire the hash bucket spinlock, which is not an issue on non-RT kernels, but on RT kernels spinlocks are substituted by 'sleeping' spinlocks based on rtmutex. If the hash bucket lock is contended then blocking on that spinlock would result in a impossible situation: blocking on two locks at the same time (the hash bucket lock and the rtmutex representing the PI futex). It was considered to make the hash bucket locks raw_spinlocks, but especially requeue operations with a large amount of waiters can introduce significant latencies, so that's not an option for RT. The RT tree carried a solution which (ab)used task::pi_blocked_on to store the information about an ongoing requeue and an early wakeup which worked, but required to add checks for these special states all over the place. The distangling of an early wakeup of a waiter for a requeue_pi() operation is already looking at quite some different states and the task::pi_blocked_on magic just expanded that to a hard to understand 'state machine'. This can be avoided by keeping track of the waiter/requeue state in the futex_q object itself. Add a requeue_state field to struct futex_q with the following possible states: Q_REQUEUE_PI_NONE Q_REQUEUE_PI_IGNORE Q_REQUEUE_PI_IN_PROGRESS Q_REQUEUE_PI_WAIT Q_REQUEUE_PI_DONE Q_REQUEUE_PI_LOCKED The waiter starts with state = NONE and the following state transitions are valid: On the waiter side: Q_REQUEUE_PI_NONE -> Q_REQUEUE_PI_IGNORE Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_WAIT On the requeue side: Q_REQUEUE_PI_NONE -> Q_REQUEUE_PI_INPROGRESS Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_DONE/LOCKED Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_NONE (requeue failed) Q_REQUEUE_PI_WAIT -> Q_REQUEUE_PI_DONE/LOCKED Q_REQUEUE_PI_WAIT -> Q_REQUEUE_PI_IGNORE (requeue failed) The requeue side ignores a waiter with state Q_REQUEUE_PI_IGNORE as this signals that the waiter is already on the way out. It also means that the waiter is still on the 'wait' futex, i.e. uaddr1. The waiter side signals early wakeup to the requeue side either through setting state to Q_REQUEUE_PI_IGNORE or to Q_REQUEUE_PI_WAIT depending on the current state. In case of Q_REQUEUE_PI_IGNORE it can immediately proceed to take the hash bucket lock of uaddr1. If it set state to WAIT, which means the wakeup is interleaving with a requeue in progress it has to wait for the requeue side to change the state. Either to DONE/LOCKED or to IGNORE. DONE/LOCKED means the waiter q is now on the uaddr2 futex and either blocked (DONE) or has acquired it (LOCKED). IGNORE is set by the requeue side when the requeue attempt failed via deadlock detection and therefore the waiter's futex_q is still on the uaddr1 futex. While this is not strictly required on !RT making this unconditional has the benefit of common code and it also allows the waiter to avoid taking the hash bucket lock on the way out in certain cases, which reduces contention. Add the required helpers required for the state transitions, invoke them at the right places and restructure the futex_wait_requeue_pi() code to handle the return from wait (early or not) based on the state machine values. On !RT enabled kernels the waiter spin waits for the state going from Q_REQUEUE_PI_WAIT to some other state, on RT enabled kernels this is handled by rcuwait_wait_event() and the corresponding wake up on the requeue side. Signed-off-by: Thomas Gleixner commit caf90d1280afb870b6e3d4798d828dabc90a4b29 Author: Thomas Gleixner Date: Tue Jul 6 16:36:56 2021 +0200 futex: Clarify comment in futex_requeue() The comment about the restriction of the number of waiters to wake for the REQUEUE_PI case is confusing at best. Rewrite it. Signed-off-by: Thomas Gleixner commit e67ddc796fc167bc36b776c3c163ccf1bc37994f Author: Thomas Gleixner Date: Tue Jul 6 16:36:56 2021 +0200 futex: Restructure futex_requeue() No point in taking two more 'requeue_pi' conditionals just to get to the requeue. Same for the requeue_pi case just the other way round. No functional change. Signed-off-by: Thomas Gleixner commit f3ffb1c6a2e8b84dd36557f3a2a30d9a2c82bd15 Author: Thomas Gleixner Date: Tue Jul 6 16:36:55 2021 +0200 futex: Correct the number of requeued waiters for PI The accounting is wrong when either the PI sanity check or the requeue PI operation fails. Adjust it in the failure path. Will be simplified in the next step. Signed-off-by: Thomas Gleixner commit d8411abc3a26fa8cd1fd0baac3447dc27c95eb81 Author: Thomas Gleixner Date: Tue Jul 6 16:36:54 2021 +0200 futex: Cleanup stale comments The futex key reference mechanism is long gone. Cleanup the stale comments which still mention it. Signed-off-by: Thomas Gleixner commit 1e1c70cfd49d70fd1b69622c6a797c539fd451c9 Author: Thomas Gleixner Date: Tue Jul 6 16:36:54 2021 +0200 futex: Validate waiter correctly in futex_proxy_trylock_atomic() The loop in futex_requeue() has a sanity check for the waiter which is missing in futex_proxy_trylock_atomic(). In theory the key2 check is sufficient, but futexes are cursed so add it for completness and paranoia sake. Signed-off-by: Thomas Gleixner commit fc6e6a83c2ab412e7058f6afd9867f9fbd874081 Author: Sebastian Andrzej Siewior Date: Thu Jul 1 17:50:20 2021 +0200 lib/test_lockup: Adapt to changed variables. The inner parts of certain locks (mutex, rwlocks) changed due to a rework for RT and non RT code. Most users remain unaffected, but those who fiddle around in the inner parts need to be updated. Match the struct names to the newer layout. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d64c4ab5d3e8d87647abc972f73b9f1dd7c9d495 Author: Thomas Gleixner Date: Tue Jul 6 16:36:52 2021 +0200 locking/rtmutex: Add mutex variant for RT Add the necessary defines, helpers and API functions for replacing mutex on a PREEMPT_RT enabled kernel with a rtmutex based variant. If PREEMPT_RT is enabled then the regular 'struct mutex' is renamed to 'struct __mutex', which is still typedeffed as '_mutex_t' to allow the standalone compilation and utilization of ww_mutex. No functional change when CONFIG_PREEMPT_RT=n Signed-off-by: Thomas Gleixner commit 8cbe9cb30f66789698965a0d27263fb7e76dee06 Author: Thomas Gleixner Date: Tue Jul 6 16:36:51 2021 +0200 locking/mutex: Exclude non-ww_mutex API for RT In order to build ww_mutex standalone on RT and to replace mutex with a RT specific rtmutex based variant, guard the non-ww_mutex API so it is only built when CONFIG_PREEMPT_RT is disabled. No functional change. Signed-off-by: Thomas Gleixner commit 9e1721dcf72bb6a1ad2de92e52216a3e280224a6 Author: Thomas Gleixner Date: Tue Jul 6 16:36:51 2021 +0200 locking/mutex: Rearrange items in mutex.h Move the lockdep map initializer to a different place so it can be shared with the upcoming RT variant of struct mutex. No functional change. Signed-off-by: Thomas Gleixner commit 61df59103205964c4d86134a153e51a4a931cfb9 Author: Thomas Gleixner Date: Tue Jul 6 16:36:51 2021 +0200 locking/mutex: Replace struct mutex in core code PREEMPT_RT replaces 'struct mutex' with a rtmutex based variant so all mutex operations are included into the priority inheritance scheme, but wants to utilize the ww_mutex specific part of the regular mutex implementation as is. As the regular mutex and ww_mutex implementation are tightly coupled (ww_mutex has a 'struct mutex' inside) and share a lot of code (ww_mutex is mostly an extension) a simple replacement of 'struct mutex' does not work. 'struct mutex' has a typedef '_mutex_t' associated. Replace all 'struct mutex' references in the mutex code code with '_mutex_t' which allows to have a RT specific 'struct mutex' in the final step. No functional change. Signed-off-by: Thomas Gleixner commit 9e38af086bbd692fcc07bd24b66b27ef6d775f07 Author: Thomas Gleixner Date: Tue Jul 6 16:36:51 2021 +0200 locking/ww_mutex: Switch to _mutex_t PREEMPT_RT replaces 'struct mutex' with a rtmutex based variant so all mutex operations are included into the priority inheritance scheme, but wants to utilize the ww_mutex specific part of the regular mutex implementation as is. As the regular mutex and ww_mutex implementation are tightly coupled (ww_mutex has a 'struct mutex' inside) and share a lot of code (ww_mutex is mostly an extension) a simple replacement of 'struct mutex' does not work. 'struct mutex' has a typedef '_mutex_t' associated. Replace all 'struct mutex' references in ww_mutex with '_mutex_t' which allows to have a RT specific 'struct mutex' in the final step. No functional change. Signed-off-by: Thomas Gleixner commit 0d00cb96f75047a2c69c59a49c65d86509122673 Author: Thomas Gleixner Date: Tue Jul 6 16:36:51 2021 +0200 locking/mutex: Rename the ww_mutex relevant functions In order to build ww_mutex standalone for PREEMPT_RT and to allow replacing the regular mutex with an RT specific rtmutex based variant, rename a few ww_mutex relevant functions, so the final RT build does not have namespace collisions. No functional change. Signed-off-by: Thomas Gleixner commit c0652713c31434fb54868cac4788398791489025 Author: Thomas Gleixner Date: Tue Jul 6 16:36:50 2021 +0200 locking/mutex: Introduce _mutex_t PREEMPT_RT replaces 'struct mutex' with a rtmutex based variant so all mutex operations are included into the priority inheritance scheme. But a complete replacement of the mutex implementation would require to reimplement ww_mutex on top of the rtmutex based variant. That has been tried, but the outcome is dubious if not outright wrong in some cases: 1) ww_mutex by it's semantics can never provide any realtime properties 2) The waiter ordering of ww_mutex depends on the associated context stamp, which is not possible with priority based ordering on a rtmutex based implementation So a rtmutex based ww_mutex would be semanticaly different and incomplete. Aside of that the ww_mutex specific helpers cannot be shared between the regular mutex and the RT variant, so they are likely to diverge further and grow different properties and bugs. The alternative solution is to make it possible to compile the ww_mutex specific part of the regular mutex implementation as is on RT and have a rtmutex based 'struct mutex' variant. As the regular mutex and ww_mutex implementation are tightly coupled (ww_mutex has a 'struct mutex' inside) and share a lot of code (ww_mutex is mostly an extension) a simple replacement of 'struct mutex' does not work. To solve this attach a typedef to 'struct mutex': _mutex_t This new type is then used to replace 'struct mutex' in 'struct ww_mutex', in a few helper functions and in the actual regular mutex code. None of the actual usage sites of mutexes are affected. That allows in the final step to have a RT specific 'struct mutex' and the regular _mutex_t type. Signed-off-by: Thomas Gleixner commit b0bb8d3b97f78809f0a1453105ad5494a39ffeff Author: Thomas Gleixner Date: Tue Jul 6 16:36:50 2021 +0200 locking/mutex: Make mutex::wait_lock raw PREEMPT_RT wants to utilize the existing ww_mutex implementation instead of trying to mangle ww_mutex functionality into the rtmutex based mutex implementation. The mutex internal wait_lock is a regular spinlock which would be converted to a sleeping spinlock on RT, but that's not really required because the wait_lock held times are short and limited. Convert it to a raw_spinlock like the wait_lock of rtmutex. Signed-off-by: Thomas Gleixner commit 52f869aee24db1e6415a4c6b8422f88b650008c1 Author: Thomas Gleixner Date: Tue Jul 6 16:36:50 2021 +0200 locking/ww_mutex: Move ww_mutex declarations into ww_mutex.h Move the ww_mutex declarations in the ww_mutex specific header where they belong. Preperatory change to allow compiling ww_mutex standalone. Signed-off-by: Thomas Gleixner commit 5e5f46d3edac9840b14f906dbe6b097cc5d8ce90 Author: Thomas Gleixner Date: Tue Jul 6 16:36:50 2021 +0200 locking/mutex: Move waiter to core header Move the mutex waiter declaration from the global to the core local header. There is no reason to expose it outside of the core code. Signed-off-by: Thomas Gleixner commit 72a31aa31eaeef8f1252d555856778269b9fc4e0 Author: Thomas Gleixner Date: Tue Jul 6 16:36:50 2021 +0200 locking/mutex: Consolidate core headers Having two header files which contain just the non-debug and debug variants is mostly waste of disc space and has no real value. Stick the debug variants into the common mutex.h file as counterpart to the stubs for the non-debug case. That allows to add helpers and defines to the common header for the upcoming handling of mutexes and ww_mutexes on PREEMPT_RT. Signed-off-by: Thomas Gleixner commit 387b12793ce0faedbfc4563aaf557f664c95f1a9 Author: Thomas Gleixner Date: Tue Jul 6 16:36:49 2021 +0200 locking/rwlock: Provide RT variant Similar to rw_semaphores on RT the rwlock substitution is not writer fair because it's not feasible to have a writer inherit it's priority to multiple readers. Readers blocked on a writer follow the normal rules of priority inheritance. Like RT spinlocks RT rwlocks are state preserving accross the slow lock operations (contended case). Signed-off-by: Thomas Gleixner commit 3779e68a306bca13763583d465132b6fba9c56fe Author: Thomas Gleixner Date: Tue Jul 6 16:36:49 2021 +0200 locking/spinlock: Provide RT variant Provide the actual locking functions which make use of the general and spinlock specific rtmutex code. Signed-off-by: Thomas Gleixner commit 052c362a00b3b3f3187dcc93c221c2e023d92dde Author: Thomas Gleixner Date: Tue Jul 6 16:36:49 2021 +0200 locking/rtmutex: Provide the spin/rwlock core lock function A simplified version of the rtmutex slowlock function which neither handles signals nor timeouts and is careful about preserving the state of the blocked task accross the lock operation. Signed-off-by: Thomas Gleixner commit a78d15145be2367460595d306aeda6609f23777b Author: Thomas Gleixner Date: Tue Jul 6 16:36:49 2021 +0200 locking/spinlock: Provide RT variant header Provide the necessary wrappers around the actual rtmutex based spinlock implementation. Signed-off-by: Thomas Gleixner commit 1d0de8e9dd1017dd85087f05719cfc50f87e80ef Author: Thomas Gleixner Date: Tue Jul 6 16:36:49 2021 +0200 locking/spinlock: Provide RT specific spinlock type RT replaces spinlocks with a simple wrapper around a rtmutex which turns spinlocks on RT into 'sleeping' spinlocks. The actual implementation of the spinlock API differs from a regular rtmutex as it does neither handle timeouts nor signals and it is state preserving accross the lock operation. Signed-off-by: Thomas Gleixner commit 1eabaec31e25dd11f2b915841755733c010bf151 Author: Sebastian Andrzej Siewior Date: Tue Jul 6 16:36:48 2021 +0200 locking/rtmutex: Include only rbtree types rtmutex.h needs the definition of struct rb_root_cached. rbtree.h includes kernel.h which includes spinlock.h. That works nicely for non-RT enabled kernels, but on RT enabled kernels spinlocks are based on rtmutexes which creates another circular header dependency as spinlocks.h will require rtmutex.h. Include rbtree_types.h instead. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 3f1005975f9d1e02c35e5c720d91f284fc5a6b1f Author: Sebastian Andrzej Siewior Date: Tue Jul 6 16:36:48 2021 +0200 rbtree: Split out the rbtree type definitions rtmutex.h needs the definition of struct rb_root_cached. rbtree.h includes kernel.h which includes spinlock.h. That works nicely for non-RT enabled kernels, but on RT enabled kernels spinlocks are based on rtmutexes which creates another circular header dependency as spinlocks.h will require rtmutex.h. Split out the type definitions and move them into their own header file so the rtmutex header can include just those. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 98c59a95a1469a0f66f4e4f2ded3f45b97705816 Author: Sebastian Andrzej Siewior Date: Tue Jul 6 16:36:48 2021 +0200 locking/lockdep: Reduce includes in debug_locks.h The inclusion of printk.h leads to a circular dependency if spinlock_t is based on rtmutexes on RT enabled kernels. Include only atomic.h (xchg()) and cache.h (__read_mostly) which is all what debug_locks.h requires. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit e0594a0d6b0fa58257aaccc593e49f908105287a Author: Sebastian Andrzej Siewior Date: Tue Jul 6 16:36:48 2021 +0200 locking/rtmutex: Prevent future include recursion hell rtmutex only needs raw_spinlock_t, but it includes spinlock_types.h which is not a problem on an non RT enabled kernel. RT kernels substitute regular spinlocks with 'sleeping' spinlocks which are based on rtmutexes and therefore must be able to include rtmutex.h. Include spinlock_types_raw.h instead. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d1739ad7ce7c4d5120e3b58f06834fa848d3caa5 Author: Thomas Gleixner Date: Tue Jul 6 16:36:48 2021 +0200 locking/spinlock: Split the lock types header Move raw_spinlock into its own file. Prepare for RT 'sleeping spinlocks' to avoid header recursion as RT locks require rtmutex.h which in turn requires the raw spinlock types. No functional change. Signed-off-by: Thomas Gleixner commit a891c54d9b25fbaafb3f20fae54ad20e3229c150 Author: Thomas Gleixner Date: Tue Jul 6 16:36:47 2021 +0200 locking/rtmutex: Guard regular sleeping locks specific functions Guard the regular sleeping lock specific functionality which is used for rtmutex on non-RT enabled kernels and for mutex, rtmutex and semaphores on RT enabled kernels so the code can be reused for the RT specific implementation of spinlocks and rwlocks in a different compilation unit. No functional change. Signed-off-by: Thomas Gleixner commit a7844d974babf05694b7e13eb6507f63cad74bb1 Author: Thomas Gleixner Date: Tue Jul 6 16:36:47 2021 +0200 locking/rtmutex: Prepare RT rt_mutex_wake_q for RT locks Add a rtlock_task pointer to rt_mutex_wake_q which allows to handle the RT specific wakeup for spin/rwlock waiters. The pointer is just consuming 4/8 bytes on stack so it is provided unconditionaly to avoid #ifdeffery all over the place. No functional change for non-RT enabled kernels. Signed-off-by: Thomas Gleixner commit 908d2944c4eb56c1c7cadb98e1492c6462567811 Author: Thomas Gleixner Date: Tue Jul 6 16:36:47 2021 +0200 locking/rtmutex: Use rt_mutex_wake_q_head Prepare for the required state aware handling of waiter wakeups via wake_q and switch the rtmutex code over to the rtmutex specific wrapper. No functional change. Signed-off-by: Thomas Gleixner commit a0477bfbc5bb986f348c4b9ac7bf3f0893da0596 Author: Thomas Gleixner Date: Tue Jul 6 16:36:47 2021 +0200 locking/rtmutex: Provide rt_mutex_wake_q and helpers To handle the difference of wakeups for regular sleeping locks (mutex, rtmutex, rw_semaphore) and the wakeups for 'sleeping' spin/rwlocks on PREEMPT_RT enabled kernels correctly, it is required to provide a wake_q construct which allows to keep them seperate. Provide a wrapper around wake_q and the required helpers, which will be extended with the state handling later. No functional change. Signed-off-by: Thomas Gleixner commit 5328155b18de876d8749d7632bab41cca0919737 Author: Thomas Gleixner Date: Tue Jul 6 16:36:47 2021 +0200 locking/rtmutex: Add wake_state to rt_mutex_waiter Regular sleeping locks like mutexes, rtmutexes and rw_semaphores are always entering and leaving a blocking section with task state == TASK_RUNNING. On a non-RT kernel spinlocks and rwlocks never affect the task state, but on RT kernels these locks are converted to rtmutex based 'sleeping' locks. So in case of contention the task goes to block which requires to carefully preserve the task state and restore it after acquiring the lock taking regular wakeups for the task into account which happened while the task was blocked. This state preserving is achieved by having a seperate task state for blocking on a RT spin/rwlock and a saved_state field in task_struct along with careful handling of these wakeup scenarios in try_to_wake_up(). To avoid conditionals in the rtmutex code, store the wake state which has to be used for waking a lock waiter in rt_mutex_waiter which allows to handle the regular and RT spin/rwlocks by handing it to wake_up_state(). Signed-off-by: Thomas Gleixner commit 1a83a3b0bb6b4f584b3171ec9b6cf2747562132a Author: Thomas Gleixner Date: Tue Jul 6 16:36:47 2021 +0200 locking/rwsem: Add rtmutex based R/W semaphore implementation The RT specific R/W semaphore implementation used to restrict the number of readers to one because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting was painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code paths shows, that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is done in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme. R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, the problem has to be revisited. Signed-off-by: Thomas Gleixner commit 28b867762482d80cf05b7fd978dcb1b05836b197 Author: Thomas Gleixner Date: Tue Jul 6 16:36:46 2021 +0200 locking: Add base code for RT rw_semaphore and rwlock On PREEMPT_RT rw_semaphores and rwlocks are substituted with a rtmutex and a reader count. The implementation is writer unfair as it is not feasible to do priority inheritance on multiple readers, but experience has shown that realtime workloads are not the typical workloads which are sensitive to writer starvation. The inner workings of rw_semaphores and rwlocks on RT are almost indentical except for the task state and signal handling. rw_semaphores are not state preserving over a contention, they are expect to enter and leave with state == TASK_RUNNING. rwlocks have a mechanism to preserve the state of the task at entry and restore it after unblocking taking potential non-lock related wakeups into account. rw_semaphores can also be subject to signal handling interrupting a blocked state, while rwlocks ignore signals. To avoid code duplication, provide a shared implementation which takes the small difference vs. state and signals into account. The code is included into the relevant rw_semaphore/rwlock base code and compiled for each use case seperately. Signed-off-by: Thomas Gleixner commit e03cbdcf154eddbc014705b6e2ebacd755d6ab39 Author: Thomas Gleixner Date: Tue Jul 6 16:36:46 2021 +0200 locking/rtmutex: Provide lockdep less variants of rtmutex interfaces The existing rtmutex_() functions are used by code which uses rtmutex directly. These interfaces contain rtmutex specific lockdep operations. The inner code can be reused for lock implementations which build on top of rtmutexes, i.e. the lock substitutions for RT enabled kernels. But as these are different lock types they have their own lockdep operations. Calling the existing rtmutex interfaces for those would cause double lockdep checks and longer lock chains for no value. Provide rt_mutex_lock_state(), __rt_mutex_trylock() and __rt_mutex_unlock() which are not doing any lockdep operations on the rtmutex itself. The caller has to do them on the lock type which embeds the rtmutex. Signed-off-by: Thomas Gleixner commit fb5c624fe3c514db85b9f1dcbddfb4509644b068 Author: Thomas Gleixner Date: Tue Jul 6 16:36:46 2021 +0200 locking/rtmutex: Provide rt_mutex_slowlock_locked() Split the inner workings of rt_mutex_slowlock() out into a seperate function which can be reused by the upcoming RT lock substitutions, e.g. for rw_semaphores. Signed-off-by: Thomas Gleixner commit d6de1c1adf2da4f84dec14f1bc6e9fe1bb314d5b Author: Thomas Gleixner Date: Tue Jul 6 16:36:46 2021 +0200 rtmutex: Split API and implementation Prepare for reusing the inner functions of rtmutex for RT lock substitutions. Signed-off-by: Thomas Gleixner commit 9abf291893bdd655e4ad4d2802e03d91dbbaa182 Author: Sebastian Andrzej Siewior Date: Mon Apr 26 09:40:07 2021 +0200 rtmutex: Convert macros to inlines Inlines are typesafe... Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 6d3f059f50e4bb1739feaf9cbc3d1adde0ce330e Author: Thomas Gleixner Date: Tue Jul 6 16:36:45 2021 +0200 sched/wake_q: Provide WAKE_Q_HEAD_INITIALIZER The RT specific spin/rwlock implementation requires special handling of the to be woken waiters. Provide a WAKE_Q_HEAD_INITIALIZER which can be used by the rtmutex code to implement a RT aware wake_q derivative. Signed-off-by: Thomas Gleixner commit e5cc3ca9494a713e99978ea40839d73b2882c314 Author: Thomas Gleixner Date: Tue Jul 6 16:36:45 2021 +0200 sched: Provide schedule point for RT locks RT enabled kernels substitute spin/rwlocks with 'sleeping' variants based on rtmutex. Blocking on such a lock is similar to preemption versus: - I/O scheduling and worker handling because these functions might block on another substituted lock or come from a lock contention within these functions. - RCU considers this like a preemption because the task might be in a read side critical section. Add a seperate scheduling point for this and hand a new scheduling mode argument to __schedule() which allows along with seperate mode masks to handle this gracefully from within the scheduler without proliferating that to other subsystems like RCU. Signed-off-by: Thomas Gleixner commit bf9603274469cf3c06e46646351791c0bf68fea4 Author: Thomas Gleixner Date: Tue Jul 6 16:36:45 2021 +0200 sched: Rework the __schedule() preempt argument PREEMPT_RT needs to hand a special state into __schedule() when a task blocks on a 'sleeping' spin/rwlock. This is required to handle rcu_note_context_switch() correctly without having special casing in the RCU code. From an RCU point of view the blocking on the sleeping spinlock is equivalent to preemption because the task might be in a read side critical section. schedule_debug() also has a check which would trigger with the !preempt case, but that could be handled differently. To avoid adding another argument and extra checks which cannot be optimized out by the compiler the following solution has been chosen: - Replace the boolean 'preempt' argument with an unsigned integer 'sched_mode' argument and define constants to hand in: (0 == No preemption, 1 = preemption). - Add two masks to apply on that mode one for the debug/rcu invocations and one for the actual scheduling decision. For a non RT kernel these masks are UINT_MAX, i.e. all bits are set which allows the compiler to optimze the AND operation out because it is not masking out anything. IOW, it's not different from the boolean. RT enabled kernels will define these masks seperately. Signed-off-by: Thomas Gleixner commit 8b3163b2445598d211c2c035d43293f0ec3f3245 Author: Thomas Gleixner Date: Tue Jul 6 16:36:44 2021 +0200 sched: Prepare for RT sleeping spin/rwlocks Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state preserving. Any wakeup which matches the state is valid. RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates an issue vs. task::state. In order to block on the lock the task has to overwrite task::state and a consecutive wakeup issued by the unlocker sets the state back to TASK_RUNNING. As a consequence the task loses the state which was set before the lock acquire and also any regular wakeup targeted at the task while it is blocked on the lock. To handle this gracefully add a 'saved_state' member to task_struct which is used in the following way: 1) When a task blocks on a 'sleeping' spinlock, the current state is saved in task::saved_state before it is set to TASK_RTLOCK_WAIT. 2) When the task unblocks and after acquiring the lock, it restores the saved state. 3) When a regular wakeup happens for a task while it is blocked then the state change of that wakeup is redirected to operate on task::saved_state. This is also required when the task state is running because the task might have been woken up from the lock wait and has not yet restored the saved state. To make it complete provide the necessary helpers to save and restore the saved state along with the necessary documentation how the RT lock blocking is supposed to work. For non-RT kernels there is no functional change. Signed-off-by: Thomas Gleixner commit e5376079ce19f17db369e9dda7a1c9c49dc85469 Author: Thomas Gleixner Date: Tue Jul 6 16:36:43 2021 +0200 sched: Introduce TASK_RTLOCK_WAIT RT kernels have an extra quirk for try_to_wake_up() to handle task state preservation accross blocking on a 'sleeping' spin/rwlock. For this to function correctly and under all circumstances try_to_wake_up() must be able to identify whether the wakeup is lock related or not and whether the task is waiting for a lock or not. The original approach was to use a special wake_flag argument for try_to_wake_up() and just use TASK_UNINTERRUPTIBLE for the tasks wait state and the try_to_wake_up() state argument. This works in principle, but due to the fact that try_to_wake_up() cannot determine whether the task is waiting for a RT lock wakeup or for a regular wakeup it's suboptimal. RT kernels save the original task state when blocking on a RT lock and restore it when the lock has been acquired. Any non lock related wakeup is checked against the saved state and if it matches the saved state is set to running so that the wakeup is not lost when the state is restored. While the necessary logic for the wake_flag based solution is trivial the downside is that any regular wakeup with TASK_UNINTERRUPTIBLE in the state argument set will wake the task despite the fact that it is still blocked on the lock. That's not a fatal problem as the lock wait has do deal with spurious wakeups anyway, but it introduces unneccesary latencies. Introduce the TASK_RTLOCK_WAIT state bit which will be set when a task blocks on a RT lock. The lock wakeup will use wake_up_state(TASK_RTLOCK_WAIT) so both the waiting state and the wakeup state are distinguishable, which avoids spurious wakeups and allows better analysis. Signed-off-by: Thomas Gleixner commit 7b569a8fe497f0487b6fdc81f7ab8342c93faed9 Author: Thomas Gleixner Date: Tue Jul 6 16:36:43 2021 +0200 sched: Split out the wakeup state check RT kernels have a slightly more complicated handling of wakeups due to 'sleeping' spin/rwlocks. If a task is blocked on such a lock then the original state of the task is preserved over the blocking and any regular (non lock related) wakeup has to be targeted at the saved state to ensure that these wakeups are not lost. Once the task acquired the lock it restores the task state from the saved state. To avoid cluttering try_to_wake_up() with that logic, split the wake up state check out into an inline helper and use it at both places where task::state is checked against the state argument of try_to_wake_up(). No functional change. Signed-off-by: Thomas Gleixner commit bee357cbbb92002a51e170be5fd4a38b78ad08cd Author: Thomas Gleixner Date: Sun Jul 17 21:41:35 2011 +0200 debugobjects: Make RT aware Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner commit e1a2ed90bf8524c10521585661247dba44919c36 Author: Thomas Gleixner Date: Sun Jul 17 21:56:42 2011 +0200 trace: Add migrate-disabled counter to tracing output Signed-off-by: Thomas Gleixner commit f507f34db3d7dece4994287ee33d14fe4e0ea08d Author: Grygorii Strashko Date: Tue Jul 21 19:43:56 2015 +0300 pid.h: include atomic.h This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 930fe8dd94675d20a7c6619bebf0f83ade04852d Author: Sebastian Andrzej Siewior Date: Mon Oct 28 12:19:57 2013 +0100 wait.h: include atomic.h | CC init/main.o |In file included from include/linux/mmzone.h:9:0, | from include/linux/gfp.h:4, | from include/linux/kmod.h:22, | from include/linux/module.h:13, | from init/main.c:15: |include/linux/wait.h: In function ‘wait_on_atomic_t’: |include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] | if (atomic_read(val) == 0) | ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 276abf48d000079da14ea183b11d9db7be82d8ce Author: Sebastian Andrzej Siewior Date: Thu Jul 26 15:06:10 2018 +0200 efi: Allow efi=runtime In case the command line option "efi=noruntime" is default at built-time, the user could overwrite its state by `efi=runtime' and allow it again. Acked-by: Ard Biesheuvel Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit cdfc123edc5e774f29e2e1502eebe4b929b88be0 Author: Sebastian Andrzej Siewior Date: Thu Jul 26 15:03:16 2018 +0200 efi: Disable runtime services on RT Based on meassurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. Those 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during runtimed (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 6107960d6569a17a622230eb01092f3b7e68eb2f Author: Sebastian Andrzej Siewior Date: Sat May 27 19:02:06 2017 +0200 net/core: disable NET_RX_BUSY_POLL on RT napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption so we would have to work around this and add explicit locking for synchronisation against ksoftirqd. Without explicit synchronisation a low priority process would "own" the NAPI state (by setting NAPIF_STATE_SCHED) and could be scheduled out (no preempt_disable() and BH is preemptible on RT). In case a network packages arrives then the interrupt handler would set NAPIF_STATE_MISSED and the system would wait until the task owning the NAPI would be scheduled in again. Should a task with RT priority busy poll then it would consume the CPU instead allowing tasks with lower priority to run. The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero) so disable NET_RX_BUSY_POLL on RT to avoid wrong locking context on RT. Should this feature be considered useful on RT systems then it could be enabled again with proper locking and synchronisation. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit fdfbb25ebdd3422ec3ee5dcc80dee9431b80700e Author: Thomas Gleixner Date: Mon Jul 18 17:03:52 2011 +0200 sched: Disable CONFIG_RT_GROUP_SCHED on RT Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner commit 8ec5e355c7e2d39d1f915c2b5b690d95ca0ef471 Author: Ingo Molnar Date: Fri Jul 3 08:44:03 2009 -0500 mm: Allow only SLUB on RT Memory allocation disables interrupts as part of the allocation and freeing process. For -RT it is important that this section remain short and don't depend on the size of the request or an internal state of the memory allocator. At the beginning the SLAB memory allocator was adopted for RT's needs and it required substantial changes. Later, with the addition of the SLUB memory allocator we adopted this one as well and the changes were smaller. More important, due to the design of the SLUB allocator it performs better and its worst case latency was smaller. In the end only SLUB remained supported. Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 853484ed910413f59a6f4260593ba088ebf41a15 Author: Thomas Gleixner Date: Sun Jul 24 12:11:43 2011 +0200 kconfig: Disable config options which are not RT compatible Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner commit b0873a0cd9c408b8a302010378ad8dce7e7f6679 Author: Sebastian Andrzej Siewior Date: Thu Jan 23 14:45:59 2014 +0100 leds: trigger: disable CPU trigger on -RT as it triggers: |CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 |[] (unwind_backtrace+0x0/0xf8) from [] (show_stack+0x1c/0x20) |[] (show_stack+0x1c/0x20) from [] (dump_stack+0x20/0x2c) |[] (dump_stack+0x20/0x2c) from [] (__might_sleep+0x13c/0x170) |[] (__might_sleep+0x13c/0x170) from [] (__rt_spin_lock+0x28/0x38) |[] (__rt_spin_lock+0x28/0x38) from [] (rt_read_lock+0x68/0x7c) |[] (rt_read_lock+0x68/0x7c) from [] (led_trigger_event+0x2c/0x5c) |[] (led_trigger_event+0x2c/0x5c) from [] (ledtrig_cpu+0x54/0x5c) |[] (ledtrig_cpu+0x54/0x5c) from [] (arch_cpu_idle_exit+0x18/0x1c) |[] (arch_cpu_idle_exit+0x18/0x1c) from [] (cpu_startup_entry+0xa8/0x234) |[] (cpu_startup_entry+0xa8/0x234) from [] (rest_init+0xb8/0xe0) |[] (rest_init+0xb8/0xe0) from [] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 34809848f72d84c77c39ad68190ed1a11f573316 Author: Thomas Gleixner Date: Wed Jul 8 17:14:48 2015 +0200 jump-label: disable if stop_machine() is used Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit f9bffbde69457558991cbfc1b9f795d948a0a658 Author: Ingo Molnar Date: Fri Jul 3 08:29:57 2009 -0500 genirq: Disable irqpoll on -rt Creates long latencies for no value Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit bae73e9b444c5380dcebdebf5723581d0dae5b84 Author: Josh Cartwright Date: Thu Feb 11 11:54:00 2016 -0600 genirq: update irq_set_irqchip_state documentation On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit abe17fca7cc1e2916140d51d42e1825d9a3cd053 Author: Sebastian Andrzej Siewior Date: Mon Feb 15 18:44:12 2021 +0100 smp: Wake ksoftirqd on PREEMPT_RT instead do_softirq(). The softirq implementation on PREEMPT_RT does not provide do_softirq(). The other user of do_softirq() is replaced with a local_bh_disable() + enable() around the possible raise-softirq invocation. This can not be done here because migration_cpu_stop() is invoked with disabled preemption. Wake the softirq thread on PREEMPT_RT if there are any pending softirqs. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d7a13455f7f22dbe293e6d99eb8e8df4fb86a89f Author: Sebastian Andrzej Siewior Date: Thu Jul 1 17:43:16 2021 +0200 samples/kfifo: Rename read_lock/write_lock The variables names read_lock and write_lock can clash with functions used for read/writer locks. Rename read_lock to read_access and write_lock to write_access to avoid a name collision. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 554e55b4eee80c084864d7e799aa5ba505916e7f Author: Sebastian Andrzej Siewior Date: Mon Oct 12 17:33:54 2020 +0200 tcp: Remove superfluous BH-disable around listening_hash Commit 9652dc2eb9e40 ("tcp: relax listening_hash operations") removed the need to disable bottom half while acquiring listening_hash.lock. There are still two callers left which disable bottom half before the lock is acquired. Drop local_bh_disable() around __inet_hash() which acquires listening_hash->lock, invoke inet_ehash_nolisten() with disabled BH. inet_unhash() conditionally acquires listening_hash->lock. Reported-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/linux-rt-users/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de/ Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net commit 2ba152ee3e8d7c312f73ad397b6bc134f62cc766 Author: Thomas Gleixner Date: Tue Sep 8 07:32:20 2020 +0200 net: Move lockdep where it belongs Signed-off-by: Thomas Gleixner commit ba75f58059e3c47602253bd9937ed39bfc35e5eb Author: Sebastian Andrzej Siewior Date: Fri Aug 14 18:53:34 2020 +0200 shmem: Use raw_spinlock_t for ->stat_lock Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is per-CPU. Access here is serialized by disabling preemption. If the pool is empty, it gets reloaded from `->next_ino'. Access here is serialized by ->stat_lock which is a spinlock_t and can not be acquired with disabled preemption. One way around it would make per-CPU ino_batch struct containing the inode number a local_lock_t. Another sollution is to promote ->stat_lock to a raw_spinlock_t. The critical sections are short. The mpol_put() should be moved outside of the critical section to avoid invoking the destrutor with disabled preemption. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 109e285c64c2309882df143208263d195170d263 Author: Sebastian Andrzej Siewior Date: Mon Feb 11 10:40:46 2019 +0100 mm: workingset: replace IRQ-off check with a lockdep assert. Commit 68d48e6a2df57 ("mm: workingset: add vmstat counter for shadow nodes") introduced an IRQ-off check to ensure that a lock is held which also disabled interrupts. This does not work the same way on -RT because none of the locks, that are held, disable interrupts. Replace this check with a lockdep assert which ensures that the lock is held. Cc: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 64d8a21476dee47df55dd3aafc0fd0c0a5d2b98e Author: Sebastian Andrzej Siewior Date: Tue Jul 3 18:19:48 2018 +0200 cgroup: use irqsave in cgroup_rstat_flush_locked() All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 8ba34ad86e2432ce7de361d73bb7a54f9c879558 Author: Valentin Schneider Date: Sun Nov 22 20:19:04 2020 +0000 notifier: Make atomic_notifiers use raw_spinlock Booting a recent PREEMPT_RT kernel (v5.10-rc3-rt7-rebase) on my arm64 Juno leads to the idle task blocking on an RT sleeping spinlock down some notifier path: [ 1.809101] BUG: scheduling while atomic: swapper/5/0/0x00000002 [ 1.809116] Modules linked in: [ 1.809123] Preemption disabled at: [ 1.809125] secondary_start_kernel (arch/arm64/kernel/smp.c:227) [ 1.809146] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.10.0-rc3-rt7 #168 [ 1.809153] Hardware name: ARM Juno development board (r0) (DT) [ 1.809158] Call trace: [ 1.809160] dump_backtrace (arch/arm64/kernel/stacktrace.c:100 (discriminator 1)) [ 1.809170] show_stack (arch/arm64/kernel/stacktrace.c:198) [ 1.809178] dump_stack (lib/dump_stack.c:122) [ 1.809188] __schedule_bug (kernel/sched/core.c:4886) [ 1.809197] __schedule (./arch/arm64/include/asm/preempt.h:18 kernel/sched/core.c:4913 kernel/sched/core.c:5040) [ 1.809204] preempt_schedule_lock (kernel/sched/core.c:5365 (discriminator 1)) [ 1.809210] rt_spin_lock_slowlock_locked (kernel/locking/rtmutex.c:1072) [ 1.809217] rt_spin_lock_slowlock (kernel/locking/rtmutex.c:1110) [ 1.809224] rt_spin_lock (./include/linux/rcupdate.h:647 kernel/locking/rtmutex.c:1139) [ 1.809231] atomic_notifier_call_chain_robust (kernel/notifier.c:71 kernel/notifier.c:118 kernel/notifier.c:186) [ 1.809240] cpu_pm_enter (kernel/cpu_pm.c:39 kernel/cpu_pm.c:93) [ 1.809249] psci_enter_idle_state (drivers/cpuidle/cpuidle-psci.c:52 drivers/cpuidle/cpuidle-psci.c:129) [ 1.809258] cpuidle_enter_state (drivers/cpuidle/cpuidle.c:238) [ 1.809267] cpuidle_enter (drivers/cpuidle/cpuidle.c:353) [ 1.809275] do_idle (kernel/sched/idle.c:132 kernel/sched/idle.c:213 kernel/sched/idle.c:273) [ 1.809282] cpu_startup_entry (kernel/sched/idle.c:368 (discriminator 1)) [ 1.809288] secondary_start_kernel (arch/arm64/kernel/smp.c:273) Two points worth noting: 1) That this is conceptually the same issue as pointed out in: 313c8c16ee62 ("PM / CPU: replace raw_notifier with atomic_notifier") 2) Only the _robust() variant of atomic_notifier callchains suffer from this AFAICT only the cpu_pm_notifier_chain really needs to be changed, but singling it out would mean introducing a new (truly) non-blocking API. At the same time, callers that are fine with any blocking within the call chain should use blocking notifiers, so patching up all atomic_notifier's doesn't seem *too* crazy to me. Fixes: 70d932985757 ("notifier: Fix broken error handling pattern") Signed-off-by: Valentin Schneider Signed-off-by: Thomas Gleixner Reviewed-by: Daniel Bristot de Oliveira Link: https://lkml.kernel.org/r/20201122201904.30940-1-valentin.schneider@arm.com Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 10bc7873d8ce4ce530bea31f00160d6bcb2f7b04 Author: Thomas Gleixner Date: Mon Nov 9 23:32:39 2020 +0100 genirq: Move prio assignment into the newly created thread With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority assignment to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith Signed-off-by: Thomas Gleixner [bigeasy: Patch description] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de commit 35811571bf3a761645f138e5459c2e69b97fc8b6 Author: Sebastian Andrzej Siewior Date: Mon Nov 9 21:30:41 2020 +0100 kthread: Move prio/affinite change into the newly created thread With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority reset to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de commit 0585cfb8fbe84f633c0395a1d108500c4618c501 Author: Sebastian Andrzej Siewior Date: Fri Jul 2 15:33:20 2021 +0200 mm, slub: Correct ordering in slab_unlock() Fold into mm, slub: optionally save/restore irqs in slab_[un]lock()/ Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 340e7c4136c3712d5c931a57e91100fb73305452 Author: Vlastimil Babka Date: Sat May 22 01:59:38 2021 +0200 mm, slub: convert kmem_cpu_slab protection to local_lock Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's equivalent, with better lockdep visibility. On PREEMPT_RT that means better preemption. However, the cost on PREEMPT_RT is the loss of lockless fast paths which only work with cpu freelist. Those are designed to detect and recover from being preempted by other conflicting operations (both fast or slow path), but the slow path operations assume they cannot be preempted by a fast path operation, which is guaranteed naturally with disabled irqs. With local locks on PREEMPT_RT, the fast paths now also need to take the local lock to avoid races. In the allocation fastpath slab_alloc_node() we can just defer to the slowpath __slab_alloc() which also works with cpu freelist, but under the local lock. In the free fastpath do_slab_free() we have to add a new local lock protected version of freeing to the cpu freelist, as the existing slowpath only works with the page freelist. Also update the comment about locking scheme in SLUB to reflect changes done by this series. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 2180da7ea70a0fa7c6cc9fd5350805f87bd2d5a9 Author: Vlastimil Babka Date: Fri May 21 14:03:23 2021 +0200 mm, slub: use migrate_disable() on PREEMPT_RT We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with the list_lock spinlock. We can use migrate_disable() instead, but that increases overhead on !PREEMPT_RT as it's an unconditional function call even though it's ultimately a migrate_disable() there. In order to get the best available mechanism on both PREEMPT_RT and !PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr() wrappers and use them. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 98ac7c83f7611f324cfff622371adb51e6b5ebbe Author: Vlastimil Babka Date: Fri Jun 4 12:03:23 2021 +0200 mm, slub: make slab_lock() disable irqs with PREEMPT_RT We need to disable irqs around slab_lock() (a bit spinlock) to make it irq-safe. The calls to slab_lock() are nested under spin_lock_irqsave() which doesn't disable irqs on PREEMPT_RT, so add explicit disabling with PREEMPT_RT. We also distinguish cmpxchg_double_slab() where we do the disabling explicitly and __cmpxchg_double_slab() for contexts with already disabled irqs. However these context are also typically spin_lock_irqsave() thus insufficient on PREEMPT_RT. Thus, change __cmpxchg_double_slab() to be same as cmpxchg_double_slab() on PREEMPT_RT. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit dde8c73f2bd04af94cef72c96424d776537170af Author: Vlastimil Babka Date: Fri Jun 4 12:55:55 2021 +0200 mm, slub: optionally save/restore irqs in slab_[un]lock()/ For PREEMPT_RT we will need to disable irqs for this bit spinlock. As a preparation, add a flags parameter, and an internal version that takes additional bool parameter to control irq saving/restoring (the flags parameter is compile-time unused if the bool is a constant false). Convert ___cmpxchg_double_slab(), which also comes with the same bool parameter, to use the internal version. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit de1f2497acfb0635558cb8cf610592adc7e8c83c Author: Sebastian Andrzej Siewior Date: Thu Jul 16 18:47:50 2020 +0200 mm: slub: Make object_map_lock a raw_spinlock_t The variable object_map is protected by object_map_lock. The lock is always acquired in debug code and within already atomic context Make object_map_lock a raw_spinlock_t. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 12a3a78defce0f8ba270c624479d130eb571775b Author: Sebastian Andrzej Siewior Date: Fri Feb 26 17:11:55 2021 +0100 mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context flush_all() flushes a specific SLAB cache on each CPU (where the cache is present). The deactivate_slab()/__free_slab() invocation happens within IPI handler and is problematic for PREEMPT_RT. The flush operation is not a frequent operation or a hot path. The per-CPU flush operation can be moved to within a workqueue. [vbabka@suse.cz: adapt to new SLUB changes] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 6e256a70bacc64981c554f3462f0f1fb22f7dd70 Author: Vlastimil Babka Date: Thu Jun 3 19:17:42 2021 +0200 mm, slab: make flush_slab() possible to call with irqs enabled Currently flush_slab() is always called with disabled IRQs if it's needed, but the following patches will change that, so add a parameter to control IRQ disabling within the function, which only protects the kmem_cache_cpu manipulation and not the call to deactivate_slab() which doesn't need it. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit f66b34cdc3a0b4c01b59d8597bc0cbe36adb2196 Author: Vlastimil Babka Date: Fri May 21 01:48:56 2021 +0200 mm, slub: don't disable irqs in slub_cpu_dead() slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only functions that are now irq safe, so we don't need to disable irqs anymore. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 15742266f9d30e85fc14035d9491ac1803427546 Author: Vlastimil Babka Date: Fri May 21 01:16:54 2021 +0200 mm, slub: only disable irq with spin_lock in __unfreeze_partials() __unfreeze_partials() no longer needs to have irqs disabled, except for making the spin_lock operations irq-safe, so convert the spin_locks operations and remove the separate irq handling. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 02194b557292dbfd34cd029d93f926f2de7abb6c Author: Vlastimil Babka Date: Thu May 20 16:39:51 2021 +0200 mm, slub: detach percpu partial list in unfreeze_partials() using this_cpu_cmpxchg() Instead of relying on disabled irqs for atomicity when detaching the percpu partial list, we can use this_cpu_cmpxchg() and detach without irqs disabled. However, unfreeze_partials() can be also called from another cpu on behalf of a cpu that is being offlined, so we need to restructure the code accordingly: - __unfreeze_partials() is the bulk of unfreeze_partials() that processes the detached percpu partial list - unfreeze_partials() uses this_cpu_cmpxchg() to detach list from current cpu - unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no protection, and is called from __flush_cpu_slab() - flush_cpu_slab() is for the local cpu thus it needs to call unfreeze_partials(). So it can't simply call __flush_cpu_slab(smp_processor_id()) anymore and we have to open-code it Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 62047a84e9eb8f87b95635a312f17cd1e3b86454 Author: Vlastimil Babka Date: Thu May 20 14:18:12 2021 +0200 mm, slub: detach whole partial list at once in unfreeze_partials() Instead of iterating through the live percpu partial list, detach it from the kmem_cache_cpu at once. This is simpler and will allow further optimization. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 85fd98f06f02b01196a056d84209d475c10f3c7f Author: Vlastimil Babka Date: Thu May 20 14:01:57 2021 +0200 mm, slub: discard slabs in unfreeze_partials() without irqs disabled No need for disabled irqs when discarding slabs, so restore them before discarding. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit bfcb75f81573c966b314d7dc56f612313dbe22c6 Author: Vlastimil Babka Date: Thu May 20 14:00:03 2021 +0200 mm, slub: move irq control into unfreeze_partials() unfreeze_partials() can be optimized so that it doesn't need irqs disabled for the whole time. As the first step, move irq control into the function and remove it from the put_cpu_partial() caller. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit e6acdc5f5fc6ef0b23e83eb0f73064d854a7e164 Author: Vlastimil Babka Date: Wed May 12 14:04:43 2021 +0200 mm, slub: call deactivate_slab() without disabling irqs The function is now safe to be called with irqs enabled, so move the calls outside of irq disabled sections. When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to reenable them before deactivate_slab() we need to open-code flush_slab() in ___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields. But that means a IRQ handler meanwhile might have assigned a new page to kmem_cache_cpu.page so we have to retry the whole check. The remaining callers of flush_slab() are the IPI handler which has disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the following patch. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit fc54ebb41d5d25040b3492426224b52aaa43194d Author: Vlastimil Babka Date: Wed May 12 13:59:58 2021 +0200 mm, slub: make locking in deactivate_slab() irq-safe dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will be possible to call it with irqs enabled. Just convert the spin_lock calls to their irq saving/restoring variants to make it irq-safe. Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(), because in some situations we don't take the list_lock, which would disable irqs. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 378a8597511dc865bdd387e2045673eecf850e32 Author: Vlastimil Babka Date: Wed May 12 13:53:34 2021 +0200 mm, slub: move reset of c->page and freelist out of deactivate_slab() deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's freelist and putting the slab on the proper node's list. It also sets the respective kmem_cache_cpu pointers to NULL. By extracting the kmem_cache_cpu operations from the function, we can make it not dependent on disabled irqs. Also if we return a single free pointer from ___slab_alloc, we no longer have to assign kmem_cache_cpu.page before deactivation or care if somebody preempted us and assigned a different page to our kmem_cache_cpu in the process. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit a1bedf14b4ef5bd9fb633ee89841cfb7b8f2c432 Author: Vlastimil Babka Date: Tue May 11 17:45:26 2021 +0200 mm, slub: stop disabling irqs around get_partial() The function get_partial() does not need to have irqs disabled as a whole. It's sufficient to convert spin_lock operations to their irq saving/restoring versions. As a result, it's now possible to reach the page allocator from the slab allocator without disabling and re-enabling interrupts on the way. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit e7fa6bb0ab58c69fb6cf31846fdf0129caf7e7ca Author: Vlastimil Babka Date: Tue May 11 16:56:09 2021 +0200 mm, slub: check new pages with restored irqs Building on top of the previous patch, re-enable irqs before checking new pages. alloc_debug_processing() is now called with enabled irqs so we need to remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be a need for it anyway. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 843f16905f6683346e6c962c4ed39f6fbdf313e8 Author: Vlastimil Babka Date: Tue May 11 16:37:51 2021 +0200 mm, slub: validate slab from partial list or page allocator before making it cpu slab When we obtain a new slab page from node partial list or page allocator, we assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo the assignment. In order to allow doing the checks without irq disabled, restructure the code so that the checks are done first, and kmem_cache_cpu.page assignment only after they pass. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 033708e4faa880c775131def8c6242815bd23e4d Author: Vlastimil Babka Date: Mon May 10 16:30:01 2021 +0200 mm, slub: restore irqs around calling new_slab() allocate_slab() currently re-enables irqs before calling to the page allocator. It depends on gfpflags_allow_blocking() to determine if it's safe to do so. Now we can instead simply restore irq before calling it through new_slab(). The other caller early_kmem_cache_node_alloc() is unaffected by this. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit aa890ddaca1d27ceba0db33e31c829c006cff07e Author: Vlastimil Babka Date: Mon May 10 13:56:17 2021 +0200 mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() Continue reducing the irq disabled scope. Check for per-cpu partial slabs with first with irqs enabled and then recheck with irqs disabled before grabbing the slab page. Mostly preparatory for the following patches. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 8c1d368c71c0287d81afbe7a8b1baf28d8f72b1b Author: Vlastimil Babka Date: Sat May 8 02:28:02 2021 +0200 mm, slub: do initial checks in ___slab_alloc() with irqs enabled As another step of shortening irq disabled sections in ___slab_alloc(), delay disabling irqs until we pass the initial checks if there is a cached percpu slab and it's suitable for our allocation. Now we have to recheck c->page after actually disabling irqs as an allocation in irq handler might have replaced it. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Mel Gorman commit 12c69bab1ece4076e397f2fc740078da0c3b1238 Author: Vlastimil Babka Date: Fri May 7 19:32:31 2021 +0200 mm, slub: move disabling/enabling irqs to ___slab_alloc() Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This includes cases where this is not needed, such as when the allocation ends up in the page allocator and has to awkwardly enable irqs back based on gfp flags. Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when it hits the __slab_alloc() slow path, and long periods with disabled interrupts are undesirable. As a first step towards reducing irq disabled periods, move irq handling into ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not protect against modification by an irq handler, which is still done by disabled irq for most of ___slab_alloc(). As a small immediate benefit, slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled. kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them before calling ___slab_alloc(), which then disables them at its discretion. The whole kmem_cache_alloc_bulk() operation also disables preemption. When ___slab_alloc() calls new_slab() to allocate a new page, re-enable preemption, because new_slab() will re-enable interrupts in contexts that allow blocking (this will be improved by later patches). The patch itself will thus increase overhead a bit due to disabled preemption (on configs where it matters) and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will be gradually improved in the following patches. Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard to CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly paired in all configurations. On configs without involuntary preemption and debugging the re-read of kmem_cache_cpu pointer is still compiled out as it was before. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 78ed20c1e4aab33e8ec0f7422442765d2804a278 Author: Vlastimil Babka Date: Tue May 18 02:01:39 2021 +0200 mm, slub: simplify kmem_cache_cpu and tid setup In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently that's done by reading the tid first using this_cpu_read(), then the kmem_cache_cpu pointer and verifying we read the same tid using the pointer and plain READ_ONCE(). This can be simplified to just fetching kmem_cache_cpu pointer and then reading tid using the pointer. That guarantees they are from the same cpu. We don't need to read the tid using this_cpu_read() because the value will be validated by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the freelist didn't change by anyone preempting us since reading the tid. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Mel Gorman commit ae3b1f17e84ed59a61b35f12d0fb6d8650a629c1 Author: Vlastimil Babka Date: Tue May 11 18:25:09 2021 +0200 mm, slub: restructure new page checks in ___slab_alloc() When we allocate slab object from a newly acquired page (from node's partial list or page allocator), we usually also retain the page as a new percpu slab. There are two exceptions - when pfmemalloc status of the page doesn't match our gfp flags, or when the cache has debugging enabled. The current code for these decisions is not easy to follow, so restructure it and add comments. The new structure will also help with the following changes. No functional change. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Mel Gorman commit d071b8eed93b9a884b268668f1eb75c06804b6a8 Author: Vlastimil Babka Date: Tue May 11 14:05:22 2021 +0200 mm, slub: return slab page from get_partial() and set c->page afterwards The function get_partial() finds a suitable page on a partial list, acquires and returns its freelist and assigns the page pointer to kmem_cache_cpu. In later patch we will need more control over the kmem_cache_cpu.page assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a pointer to a page that get_partial() can fill and the caller can assign the kmem_cache_cpu.page pointer. No functional change as all of this still happens with disabled IRQs. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 1b92ed69ba9349f97defd6f62788fa8949bcfa2f Author: Vlastimil Babka Date: Tue May 11 13:01:34 2021 +0200 mm, slub: dissolve new_slab_objects() into ___slab_alloc() The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it there. This is a preparatory step with no functional change. The only minor change is moving WARN_ON_ONCE() for using a constructor together with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still able to catch a development change introducing a systematic misuse. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Christoph Lameter Acked-by: Mel Gorman commit 3c7b04f4ae2506ce43d7a6a0add926578616a583 Author: Vlastimil Babka Date: Tue May 11 12:45:48 2021 +0200 mm, slub: extract get_partial() from new_slab_objects() The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first preparatory step with no functional change. This adds a goto label that appears unnecessary at this point, but will be useful for later changes. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Christoph Lameter commit 49dde93b06d1b735f768dbe9c8a1209adfed6964 Author: Vlastimil Babka Date: Fri Jun 4 12:16:14 2021 +0200 mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab() These functions differ only in irq disabling in the slow path. We can create a common function with an extra bool parameter to control the irq disabling. As the functions are inline and the parameter compile-time constant, there will be no runtime overhead due to this change. Also change the DEBUG_VM based irqs disable assert to the more standard lockdep_assert based one. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 26d89006c6617445dfabec3ea1ae441f3a1e09d6 Author: Vlastimil Babka Date: Tue Jun 8 01:19:03 2021 +0200 mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately") introduced cpu partial flushing for kmemcg caches, based on setting the target cpu_partial to 0 and adding a flushing check in put_cpu_partial(). This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab: introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations"). However the check and flush in put_cpu_partial() was never removed, although it's effectively a dead code. So this patch removes it. Note that d6e0b7fa1186 also added preempt_disable()/enable() to unfreeze_partials() which could be thus also considered unnecessary. But further patches will rely on it, so keep it. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 8f9b6e2f13011d74a46290b0ebc3326e72066441 Author: Vlastimil Babka Date: Fri May 21 01:25:06 2021 +0200 mm, slub: don't disable irq for debug_check_no_locks_freed() In slab_free_hook() we disable irqs around the debug_check_no_locks_freed() call, which is unnecessary, as irqs are already being disabled inside the call. This seems to be leftover from the past where there were more calls inside the irq disabled sections. Remove the irq disable/enable operations. Mel noted: > Looks like it was needed for kmemcheck which went away back in 4.15 Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Mel Gorman commit bec329bbe49dbbc27fd4eccfdf837f5fe81a3ece Author: Vlastimil Babka Date: Sun May 23 01:37:07 2021 +0200 mm, slub: allocate private object map for validate_slab_cache() validate_slab_cache() is called either to handle a sysfs write, or from a self-test context. In both situations it's straightforward to preallocate a private object bitmap instead of grabbing the shared static one meant for critical sections, so let's do that. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Christoph Lameter Acked-by: Mel Gorman commit cea6298794478c2574ffc948de060c5a835f6563 Author: Vlastimil Babka Date: Sun May 23 01:28:37 2021 +0200 mm, slub: allocate private object map for sysfs listings Slub has a static spinlock protected bitmap for marking which objects are on freelist when it wants to list them, for situations where dynamically allocating such map can lead to recursion or locking issues, and on-stack bitmap would be too large. The handlers of sysfs files alloc_calls and free_calls also currently use this shared bitmap, but their syscall context makes it straightforward to allocate a private map before entering locked sections, so switch these processing paths to use a private bitmap. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Christoph Lameter Acked-by: Mel Gorman commit 0492d5407c2e9a3d6b2fc85e19b91f6089185b89 Author: Vlastimil Babka Date: Fri May 28 14:32:10 2021 +0200 mm, slub: don't call flush_all() from list_locations() list_locations() can only be called on caches with SLAB_STORE_USER flag and as with all slub debugging flags, such caches avoid cpu or percpu partial slabs altogether, so there's nothing to flush. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner commit 72f1ab0179ac5cc017511f9e0322fc30df403d24 Author: Mel Gorman Date: Fri May 14 15:46:22 2021 +0100 mm/page_alloc: Split per cpu page lists and zone stats -fix mm/ is not W=1 clean for allnoconfig but the patch "mm/page_alloc: Split per cpu page lists and zone stats" makes it worse with the following warning mm/vmstat.c: In function ‘zoneinfo_show_print’: mm/vmstat.c:1698:28: warning: variable ‘pzstats’ set but not used [-Wunused-but-set-variable] struct per_cpu_zonestat *pzstats; ^~~~~~~ This is a fix to the mmotm patch mm-page_alloc-split-per-cpu-page-lists-and-zone-stats.patch. Signed-off-by: Mel Gorman Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 7d4d69cd8788af7f2dfea018a3407c88c2416a82 Author: Mel Gorman Date: Wed May 12 10:54:58 2021 +0100 mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok VM events do not need explicit protection by disabling IRQs so update the counter with IRQs enabled in __free_pages_ok. Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 8ec908f71c8d16d0bf9cde514c762c83e5ff3f41 Author: Mel Gorman Date: Wed May 12 10:54:57 2021 +0100 mm/page_alloc: Avoid conflating IRQs disabled with zone->lock Historically when freeing pages, free_one_page() assumed that callers had IRQs disabled and the zone->lock could be acquired with spin_lock(). This confuses the scope of what local_lock_irq is protecting and what zone->lock is protecting in free_unref_page_list in particular. This patch uses spin_lock_irqsave() for the zone->lock in free_one_page() instead of relying on callers to have disabled IRQs. free_unref_page_commit() is changed to only deal with PCP pages protected by the local lock. free_unref_page_list() then first frees isolated pages to the buddy lists with free_one_page() and frees the rest of the pages to the PCP via free_unref_page_commit(). The end result is that free_one_page() is no longer depending on side-effects of local_lock to be correct. Note that this may incur a performance penalty while memory hot-remove is running but that is not a common operation. [lkp@intel.com: Ensure CMA pages get addded to correct pcp list] Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit b6ff3966dcbfa74c9633e1a8c470a244414986f2 Author: Mel Gorman Date: Wed May 12 10:54:56 2021 +0100 mm/page_alloc: Explicitly acquire the zone lock in __free_pages_ok __free_pages_ok() disables IRQs before calling a common helper free_one_page() that acquires the zone lock. This is not safe according to Documentation/locking/locktypes.rst and in this context, IRQ disabling is not protecting a per_cpu_pages structure either or a local_lock would be used. This patch explicitly acquires the lock with spin_lock_irqsave instead of relying on a helper. This removes the last instance of local_irq_save() in page_alloc.c. Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 16e165bea08e613e5385ce3b5c84ce01c0c809ea Author: Mel Gorman Date: Wed May 12 10:54:55 2021 +0100 mm/page_alloc: Reduce duration that IRQs are disabled for VM counters IRQs are left disabled for the zone and node VM event counters. This is unnecessary as the affected counters are allowed to race for preemmption and IRQs. This patch reduces the scope of IRQs being disabled via local_[lock|unlock]_irq on !PREEMPT_RT kernels. One __mod_zone_freepage_state is still called with IRQs disabled. While this could be moved out, it's not free on all architectures as some require IRQs to be disabled for mod_zone_page_state on !PREEMPT_RT kernels. Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit c7285ff94096c60cf03f31f6057b4af38dda92cf Author: Mel Gorman Date: Wed May 12 10:54:54 2021 +0100 mm/page_alloc: Batch the accounting updates in the bulk allocator Now that the zone_statistics are simple counters that do not require special protection, the bulk allocator accounting updates can be batch updated without adding too much complexity with protected RMW updates or using xchg. Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 069f3cf439ee2eeddd18ec134201c111de95f31e Author: Mel Gorman Date: Wed May 12 10:54:53 2021 +0100 mm/vmstat: Inline NUMA event counter updates __count_numa_event is small enough to be treated similarly to __count_vm_event so inline it. Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 39642efb7daa3a5f640bbb852bb981a709d66917 Author: Mel Gorman Date: Wed May 12 10:54:52 2021 +0100 mm/vmstat: Convert NUMA statistics to basic NUMA counters NUMA statistics are maintained on the zone level for hits, misses, foreign etc but nothing relies on them being perfectly accurate for functional correctness. The counters are used by userspace to get a general overview of a workloads NUMA behaviour but the page allocator incurs a high cost to maintain perfect accuracy similar to what is required for a vmstat like NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to turn off the collection of NUMA statistics like NUMA_HIT. This patch converts NUMA_HIT and friends to be NUMA events with similar accuracy to VM events. There is a possibility that slight errors will be introduced but the overall trend as seen by userspace will be similar. The counters are no longer updated from vmstat_refresh context as it is unnecessary overhead for counters that may never be read by userspace. Note that counters could be maintained at the node level to save space but it would have a user-visible impact due to /proc/zoneinfo. [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA] Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 7e057409dc6848aa4ca5be8d922fba948742ad7b Author: Mel Gorman Date: Wed May 12 10:54:51 2021 +0100 mm/page_alloc: Convert per-cpu list protection to local_lock There is a lack of clarity of what exactly local_irq_save/local_irq_restore protects in page_alloc.c . It conflates the protection of per-cpu page allocation structures with per-cpu vmstat deltas. This patch protects the PCP structure using local_lock which for most configurations is identical to IRQ enabling/disabling. The scope of the lock is still wider than it should be but this is decreased later. It is possible for the local_lock to be embedded safely within struct per_cpu_pages but it adds complexity to free_unref_page_list. [lkp@intel.com: Make pagesets static] Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit b40b27ff9be924922be1a2ae6ed885e9412d22d7 Author: Mel Gorman Date: Wed May 12 10:54:50 2021 +0100 mm/page_alloc: Split per cpu page lists and zone stats The per-cpu page allocator lists and the per-cpu vmstat deltas are stored in the same struct per_cpu_pages even though vmstats have no direct impact on the per-cpu page lists. This is inconsistent because the vmstats for a node are stored on a dedicated structure. The bigger issue is that the per_cpu_pages structure is not cache-aligned and stat updates either cache conflict with adjacent per-cpu lists incurring a runtime cost or padding is required incurring a memory cost. This patch splits the per-cpu pagelists and the vmstat deltas into separate structures. It's mostly a mechanical conversion but some variable renaming is done to clearly distinguish the per-cpu pages structure (pcp) from the vmstats (pzstats). Superficially, this appears to increase the size of the per_cpu_pages structure but the movement of expire fills a structure hole so there is no impact overall. [lkp@intel.com: Check struct per_cpu_zonestat has a non-zero size] [vbabka@suse.cz: Init zone->per_cpu_zonestats properly] Signed-off-by: Mel Gorman Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit d36c3ebdd16546d7467e82f780ae02b0e605af84 Author: Thomas Gleixner Date: Sun Dec 6 22:40:07 2020 +0100 timers: Move clearing of base::timer_running under base::lock syzbot reported KCSAN data races vs. timer_base::timer_running being set to NULL without holding base::lock in expire_timers(). This looks innocent and most reads are clearly not problematic but for a non-RT kernel it's completely irrelevant whether the store happens before or after taking the lock. For an RT kernel moving the store under the lock requires an extra unlock/lock pair in the case that there is a waiter for the timer. But that's not the end of the world and definitely not worth the trouble of adding boatloads of comments and annotations to the code. Famous last words... Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Cc: stable-rt@vger.kernel.org commit 1d1164a4e1927868029b10ce5e854ca133d9766a Author: Sebastian Andrzej Siewior Date: Fri Oct 30 13:59:06 2020 +0100 highmem: Don't disable preemption on RT in kmap_atomic() Disabling preemption makes it impossible to acquire sleeping locks within kmap_atomic() section. For PREEMPT_RT it is sufficient to disable migration. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 63cf1e4b564a46ec7bee5571cff518c70355dbdf Author: John Ogness Date: Mon Nov 30 01:42:10 2020 +0106 printk: add pr_flush() Provide a function to allow waiting for console printers to catch up to the latest logged message. Use pr_flush() to give console printers a chance to finish in critical situations if no atomic console is available. For now pr_flush() is only used in the most common error paths: panic(), print_oops_end_marker(), report_bug(), kmsg_dump(). Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit b41f91f573cd9c671efe8698efcaac2b99af0ea0 Author: John Ogness Date: Mon Nov 30 01:42:09 2020 +0106 printk: add console handover If earlyprintk is used, a boot console will print directly to the console immediately. The boot console will unregister itself as soon as a non-boot console registers. However, the non-boot console does not begin printing until its kthread has started. Since this happens much later, there is a long pause in the console output. If the ringbuffer is small, messages could even be dropped during the pause. Add a new CON_HANDOVER console flag to be used internally by printk in order to track which non-boot console took over from a boot console. If handover consoles have implemented write_atomic(), they are allowed to print directly to the console until their kthread can take over. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 4a181ae05c92b849ee813966829073ea189f8749 Author: John Ogness Date: Mon Nov 30 01:42:08 2020 +0106 printk: remove deferred printing Since printing occurs either atomically or from the printing kthread, there is no need for any deferring or tracking possible recursion paths. Remove all printk context tracking. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit c4049cfc8a0327c22d56855882d8e2cffd6d20fa Author: John Ogness Date: Mon Nov 30 01:42:07 2020 +0106 printk: move console printing to kthreads Create a kthread for each console to perform console printing. Now all console printing is fully asynchronous except for the boot console and when the kernel enters sync mode (and there are atomic consoles available). The console_lock() and console_unlock() functions now only do what their name says... locking and unlocking of the console. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 4b788a578cc2be37f1761f07407a9d920ecb0671 Author: John Ogness Date: Mon Nov 30 01:42:06 2020 +0106 printk: introduce kernel sync mode When the kernel performs an OOPS, enter into "sync mode": - only atomic consoles (write_atomic() callback) will print - printing occurs within vprintk_store() instead of console_unlock() CONSOLE_LOG_MAX is moved to printk.h to support the per-console buffer used in sync mode. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 7995ace9ab04969b9d5577e5fd74d77765c7d917 Author: John Ogness Date: Mon Nov 30 01:42:05 2020 +0106 printk: use seqcount_latch for console_seq In preparation for atomic printing, change @console_seq to use seqcount_latch so that it can be read without requiring @console_sem. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 19aa624068300cc91cb3c4e342def5e0ab0e40d4 Author: John Ogness Date: Mon Nov 30 01:42:04 2020 +0106 printk: combine boot_delay_msec() into printk_delay() boot_delay_msec() is always called immediately before printk_delay() so just combine the two. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 109255d49bc349bf222a231045ca6464e9dfe248 Author: John Ogness Date: Mon Nov 30 01:42:03 2020 +0106 printk: relocate printk_delay() and vprintk_default() Move printk_delay() and vprintk_default() "as is" further up so that they can be used by new functions in an upcoming commit. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit b94b12794bb4dc84d23705ea20d1435ce72db5da Author: John Ogness Date: Mon Nov 30 01:42:02 2020 +0106 serial: 8250: implement write_atomic Implement a non-sleeping NMI-safe write_atomic() console function in order to support emergency console printing. Since interrupts need to be disabled during transmit, all usage of the IER register is wrapped with access functions that use the console_atomic_lock() function to synchronize register access while tracking the state of the interrupts. This is necessary because write_atomic() can be called from an NMI context that has preempted write_atomic(). Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 8d4fe695dbb0dfdf8b0699a8ef7c1c3402fdd823 Author: John Ogness Date: Fri Mar 19 14:57:31 2021 +0100 kdb: only use atomic consoles for output mirroring Currently kdb uses the @oops_in_progress hack to mirror kdb output to all active consoles from NMI context. Ignoring locks is unsafe. Now that an NMI-safe atomic interfaces is available for consoles, use that interface to mirror kdb output. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 735eda8e5ceb2b6716285c005e876bff2df979cc Author: John Ogness Date: Mon Nov 30 01:42:01 2020 +0106 console: add write_atomic interface Add a write_atomic() callback to the console. This is an optional function for console drivers. The function must be atomic (including NMI safe) for writing to the console. Console drivers must still implement the write() callback. The write_atomic() callback will only be used in special situations, such as when the kernel panics. Creating an NMI safe write_atomic() that must synchronize with write() requires a careful implementation of the console driver. To aid with the implementation, a set of console_atomic_*() functions are provided: void console_atomic_lock(unsigned int *flags); void console_atomic_unlock(unsigned int flags); These functions synchronize using a processor-reentrant spinlock (called a cpulock). kgdb makes use of its own cpulock (@dbg_master_lock, @kgdb_active). This will conflict with the printk cpulock. Therefore, a CPU must ensure that it is not holding the printk cpulock when calling kgdb_cpu_enter(). If it is, it must allow its printk context to complete first. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 8c1c98157afc93269bd28db78663e40fdf5ed138 Author: John Ogness Date: Thu Feb 18 17:37:41 2021 +0100 printk: convert @syslog_lock to spin_lock @syslog_log was a raw_spin_lock to simplify the transition of removing @logbuf_lock and the safe buffers. With that transition complete, @syslog_log can become a spin_lock. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 114233f8744cb4911d536f9f118f66433f97b2fe Author: John Ogness Date: Mon Nov 30 01:42:00 2020 +0106 printk: remove safe buffers With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. Although the safe buffers are removed, the NMI and safe context tracking is still in place. In these contexts, store the message immediately but still use irq_work to defer the console printing. Since printk recursion tracking is in place, safe context tracking for most of printk is not needed. Remove it. Only safe context tracking relating to the console lock is left in place. This is because the console lock is needed for the actual printing. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 2baa48303c4513ec77b8429035a4b0d5c7701408 Author: John Ogness Date: Fri Dec 11 00:55:25 2020 +0106 printk: track/limit recursion Track printk() recursion and limit it to 3 levels per-CPU and per-context. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner