commit f3d91c892c1ffba159bce64fbce194b50f87aff6 Author: Alexandre Frade Date: Tue Dec 15 03:51:15 2020 +0000 Linux 5.10.1-rt17-xanmod1 Signed-off-by: Alexandre Frade commit b81fa230b84cb65929af0e3fa3ff7e4e05d8cf81 Author: Térence Clastres Date: Mon Oct 12 18:58:11 2020 +0200 futex: Restore futex_key Required for FUTEX_WAIT MULTIPLE implementation on Linux v5.9+. Signed-off-by: Alexandre Frade commit 2f5dd9aecd90f81da60ec4edec0f0a54e2f7a2b8 Author: Alexandre Frade Date: Sun Oct 13 03:10:39 2019 -0300 kconfig: set PREEMPT_RT and RCU_BOOST without delay by default Signed-off-by: Alexandre Frade commit 014ed860c9e3076c1535ad400f80a0020a596468 Author: Alexandre Frade Date: Sat Oct 17 15:44:32 2020 +0000 sched/core: Set nr_migrate to increases number of tasks to iterate in a single balance run Signed-off-by: Alexandre Frade commit d84895e60d5b49ba9bc622cf41cef2d93d551681 Merge: 4c0cfad1c866 c9db032f6d86 Author: Alexandre Frade Date: Tue Dec 15 03:42:27 2020 +0000 Merge tag 'v5.10-rt17' into 5.10 v5.10-rt17 commit 4c0cfad1c866aee688eb4d09d5184b9a97a5aa45 Author: Alexandre Frade Date: Tue Dec 15 03:35:13 2020 +0000 Revert "futex: Restore futex_key" This reverts commit 78ffcf5d8641d79c183107e235eca43daf445292. commit a79df2515ed8908b245d2771fd4c1cbe6aa1ecf7 Author: Alexandre Frade Date: Tue Dec 15 03:32:39 2020 +0000 Revert "kconfig: set PREEMPT and RCU_BOOST without delay by default" This reverts commit 0e96d54d9a30f7d9c9fa0479307368a4710f0183. commit 29ecb5035b14b34fa8f804101183c26b2d2966a9 Author: Alexandre Frade Date: Tue Dec 15 03:31:16 2020 +0000 Revert "sched/core: nr_migrate = 256 increases number of tasks to iterate in a single balance run." This reverts commit a4fe3d5253fcaa996a37c389b17bdee493fe8158. commit f0217e1d6bb355da7c5f16bb50ae1527523965e7 Author: Alexandre Frade Date: Mon Dec 14 22:09:57 2020 +0000 Linux 5.10.1-xanmod1 Signed-off-by: Alexandre Frade commit 01328dbc6743fadcd7938745855449a92ebd442a Merge: 2c00287937f6 841fca5a32cc Author: Alexandre Frade Date: Mon Dec 14 22:08:51 2020 +0000 Merge tag 'v5.10.1' into 5.10 This is the 5.10.1 stable release commit 2c00287937f6f5d253a9b32b7fc04e33b1686127 Author: Alexandre Frade Date: Mon Dec 14 19:56:08 2020 +0000 Linux 5.10.0-xanmod1 Signed-off-by: Alexandre Frade commit 77467adf2a308098f00f6fc05ab7023300a5985a Author: Gabriel Krisman Bertazi Date: Thu Feb 13 18:45:25 2020 -0300 selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test Add test for wait at multiple futexes mechanism. Skip the test if it's a x32 application and the kernel returned the approtiaded error, since this ABI is not supported for this operation. Signed-off-by: Gabriel Krisman Bertazi Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit d45f43260a2d02eee32241ece3a6a6090eedbdb1 Author: Gabriel Krisman Bertazi Date: Thu Feb 13 18:45:24 2020 -0300 selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test Add test for wouldblock return when waiting for multiple futexes. Skip the test if it's a x32 application and the kernel returned the approtiaded error, since this ABI is not supported for this operation. Signed-off-by: Gabriel Krisman Bertazi Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit 9e5c157ec1805a6c92667bc49bb9cadcdd8a203e Author: Gabriel Krisman Bertazi Date: Thu Feb 13 18:45:23 2020 -0300 selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test Add test for timeout when waiting for multiple futexes. Skip the test if it's a x32 application and the kernel returned the approtiaded error, since this ABI is not supported for this operation. Signed-off-by: Gabriel Krisman Bertazi Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit 78ffcf5d8641d79c183107e235eca43daf445292 Author: Térence Clastres Date: Mon Oct 12 18:58:11 2020 +0200 futex: Restore futex_key Required for FUTEX_WAIT MULTIPLE implementation on Linux v5.9+. Signed-off-by: Alexandre Frade commit 701d7904b95d8ca30d5413ba7a3527365f03d8a5 Author: Gabriel Krisman Bertazi Date: Thu Feb 13 18:45:22 2020 -0300 futex: Implement mechanism to wait on any of several futexes This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows a thread to wait on several futexes at the same time, and be awoken by any of them. In a sense, it implements one of the features that was supported by pooling on the old FUTEX_FD interface. The use case lies in the Wine implementation of the Windows NT interface WaitMultipleObjects. This Windows API function allows a thread to sleep waiting on the first of a set of event sources (mutexes, timers, signal, console input, etc) to signal. Considering this is a primitive synchronization operation for Windows applications, being able to quickly signal events on the producer side, and quickly go to sleep on the consumer side is essential for good performance of those running over Wine. Wine developers have an implementation that uses eventfd, but it suffers from FD exhaustion (there is applications that go to the order of multi-milion FDs), and higher CPU utilization than this new operation. The futex list is passed as an array of `struct futex_wait_block` (pointer, value, bitset) to the kernel, which will enqueue all of them and sleep if none was already triggered. It returns a hint of which futex caused the wake up event to userspace, but the hint doesn't guarantee that is the only futex triggered. Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel. This was tested using three mechanisms: 1) By reimplementing FUTEX_WAIT in terms of FUTEX_WAIT_MULTIPLE and running the unmodified tools/testing/selftests/futex and a full linux distro on top of this kernel. 2) By an example code that exercises the FUTEX_WAIT_MULTIPLE path on a multi-threaded, event-handling setup. 3) By running the Wine fsync with Valve's Proton compatibility code implementation and executing multi-threaded applications, in particular modern games, on top of this implementation. Changes were tested for the following ABIs: x86_64, i386 and x32. Support for x32 applications is not implemented since it would take a major rework adding a new entry point and splitting the current futex 64 entry point in two and we can't change the current x32 syscall number without breaking user space compatibility. CC: Steven Rostedt Cc: Richard Yao Cc: Thomas Gleixner Cc: Peter Zijlstra Co-developed-by: Zebediah Figura Signed-off-by: Zebediah Figura Co-developed-by: Steven Noonan Signed-off-by: Steven Noonan Co-developed-by: Pierre-Loup A. Griffais Signed-off-by: Pierre-Loup A. Griffais Signed-off-by: Gabriel Krisman Bertazi [Added compatibility code] Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit f315e6f977cb487f37be86593579f058cdab1a4f Author: Con Kolivas Date: Mon Dec 14 19:09:01 2020 +0000 clockevents, hrtimer: Make hrtimer granularity and minimum hrtimeout configurable in sysctl. Set default granularity to 100us and min timeout to 500us Signed-off-by: Alexandre Frade commit 841fca5a32cccd7d0123c0271f4350161ada5507 Author: Greg Kroah-Hartman Date: Mon Dec 14 19:33:01 2020 +0100 Linux 5.10.1 Link: https://lore.kernel.org/r/20201214170452.563016590@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman commit 26934c83005e75eab2b8d54d0fa5adbee4f27535 Author: Greg Kroah-Hartman Date: Mon Dec 14 17:51:18 2020 +0100 Revert "dm raid: fix discard limits for raid1 and raid10" This reverts commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512. It causes problems :( Reported-by: Dave Jones Reported-by: Mike Snitzer Cc: Zdenek Kabelac Cc: Mikulas Patocka Cc: Linus Torvalds commit 859f70354379ce53be23bca3580cb7f77978c7a2 Author: Greg Kroah-Hartman Date: Mon Dec 14 17:48:11 2020 +0100 Revert "md: change mddev 'chunk_sectors' from int to unsigned" This reverts commit 6ffeb1c3f8226244c08105bcdbeecc04bad6b89a. It causes problems :( Reported-by: Dave Jones Reported-by: Mike Snitzer Cc: Song Liu Cc: Jens Axboe Cc: Linus Torvalds commit 2ac82cd4a1d2430c796abe17d7b277aa8464b171 Author: Con Kolivas Date: Mon Feb 20 13:32:58 2017 +1100 time: Don't use hrtimer overlay when pm_freezing since some drivers still don't correctly use freezable timeouts. commit 73704132d9467d273cd168cc7cbb9f47871c4dda Author: Con Kolivas Date: Mon Feb 20 13:30:32 2017 +1100 hrtimer: Replace all calls to schedule_timeout_uninterruptible of potentially under 50ms to use schedule_msec_hrtimeout_uninterruptible commit 09df1663efffecd7759218f2c0f787a214288c71 Author: Con Kolivas Date: Mon Feb 20 13:30:07 2017 +1100 hrtimer: Replace all calls to schedule_timeout_interruptible of potentially under 50ms to use schedule_msec_hrtimeout_interruptible. commit 400ccdbeca218a93859e6dafb25d72db3a39d54b Author: Con Kolivas Date: Mon Feb 20 13:28:30 2017 +1100 hrtimer: Replace all schedule timeout(1) with schedule_min_hrtimeout() commit 53ee597e1a5894d5f8c572683c537c672ab55217 Author: Con Kolivas Date: Fri Nov 4 09:25:54 2016 +1100 timer: Convert msleep to use hrtimers when active. commit a8fc5c95ed86e1c87f23127b8af7e88149cf0110 Author: Con Kolivas Date: Sat Nov 5 09:27:36 2016 +1100 time: Special case calls of schedule_timeout(1) to use the min hrtimeout of 1ms, working around low Hz resolutions. commit 2f2cb68fff0bc87e1900f8d7e2e8e69245de939a Author: Con Kolivas Date: Sat Aug 12 11:53:39 2017 +1000 hrtimer: Create highres timeout variants of schedule_timeout functions. commit dc1c2a060000af3b5b313bdc35bb5aaeba77ae9e Author: Mark Weiman Date: Sun Aug 12 11:36:21 2018 -0400 pci: Enable overrides for missing ACS capabilities This an updated version of Alex Williamson's patch from: https://lkml.org/lkml/2013/5/30/513 Original commit message follows: PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that allows us to control whether transactions are allowed to be redirected in various subnodes of a PCIe topology. For instance, if two endpoints are below a root port or downsteam switch port, the downstream port may optionally redirect transactions between the devices, bypassing upstream devices. The same can happen internally on multifunction devices. The transaction may never be visible to the upstream devices. One upstream device that we particularly care about is the IOMMU. If a redirection occurs in the topology below the IOMMU, then the IOMMU cannot provide isolation between devices. This is why the PCIe spec encourages topologies to include ACS support. Without it, we have to assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation. Unfortunately, far too many topologies do not support ACS to make this a steadfast requirement. Even the latest chipsets from Intel are only sporadically supporting ACS. We have trouble getting interconnect vendors to include the PCIe spec required PCIe capability, let alone suggested features. Therefore, we need to add some flexibility. The pcie_acs_override= boot option lets users opt-in specific devices or sets of devices to assume ACS support. The "downstream" option assumes full ACS support on root ports and downstream switch ports. The "multifunction" option assumes the subset of ACS features available on multifunction endpoints and upstream switch ports are supported. The "id:nnnn:nnnn" option enables ACS support on devices matching the provided vendor and device IDs, allowing more strategic ACS overrides. These options may be combined in any order. A maximum of 16 id specific overrides are available. It's suggested to use the most limited set of options necessary to avoid completely disabling ACS across the topology. Note to hardware vendors, we have facilities to permanently quirk specific devices which enforce isolation but not provide an ACS capability. Please contact me to have your devices added and save your customers the hassle of this boot option. Signed-off-by: Mark Weiman commit 2f8656f5c0c21b59b871ef486ea306865f870d9e Author: graysky Date: Fri Nov 13 15:45:08 2020 -0500 x86/kconfig: Enable additional cpu optimizations for gcc v10.1+ kernel v5.8+ WARNING This patch works with gcc versions 10.1+ and with kernel version 5.8+ and should NOT be applied when compiling on older versions of gcc due to key name changes of the march flags introduced with the version 4.9 release of gcc.[1] Use the older version of this patch hosted on the same github for older versions of gcc. FEATURES This patch adds additional CPU options to the Linux kernel accessible under: Processor type and features ---> Processor family ---> The expanded microarchitectures include: * AMD Improved K8-family * AMD K10-family * AMD Family 10h (Barcelona) * AMD Family 14h (Bobcat) * AMD Family 16h (Jaguar) * AMD Family 15h (Bulldozer) * AMD Family 15h (Piledriver) * AMD Family 15h (Steamroller) * AMD Family 15h (Excavator) * AMD Family 17h (Zen) * AMD Family 17h (Zen 2) * Intel Silvermont low-power processors * Intel Goldmont low-power processors (Apollo Lake and Denverton) * Intel Goldmont Plus low-power processors (Gemini Lake) * Intel 1st Gen Core i3/i5/i7 (Nehalem) * Intel 1.5 Gen Core i3/i5/i7 (Westmere) * Intel 2nd Gen Core i3/i5/i7 (Sandybridge) * Intel 3rd Gen Core i3/i5/i7 (Ivybridge) * Intel 4th Gen Core i3/i5/i7 (Haswell) * Intel 5th Gen Core i3/i5/i7 (Broadwell) * Intel 6th Gen Core i3/i5/i7 (Skylake) * Intel 6th Gen Core i7/i9 (Skylake X) * Intel 8th Gen Core i3/i5/i7 (Cannon Lake) * Intel 10th Gen Core i7/i9 (Ice Lake) * Intel Xeon (Cascade Lake) * Intel Xeon (Cooper Lake) * Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake) It also offers to compile passing the 'native' option which, "selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using -march=native enables all instruction subsets supported by the local machine and will produce code optimized for the local machine under the constraints of the selected instruction set."[2] Do NOT try using the 'native' option on AMD Piledriver, Steamroller, or Excavator CPUs (-march=bdver{2,3,4} flag). The build will error out due the kernel's objtool issue with these.[3a,b] MINOR NOTES This patch also changes 'atom' to 'bonnell' in accordance with the gcc v4.9 changes. Note that upstream is using the deprecated 'match=atom' flags when I believe it should use the newer 'march=bonnell' flag for atom processors.[4] It is not recommended to compile on Atom-CPUs with the 'native' option.[5] The recommendation is to use the 'atom' option instead. BENEFITS Small but real speed increases are measurable using a make endpoint comparing a generic kernel to one built with one of the respective microarchs. See the following experimental evidence supporting this statement: https://github.com/graysky2/kernel_gcc_patch REQUIREMENTS linux version >=5.8 gcc version >=10.1 ACKNOWLEDGMENTS This patch builds on the seminal work by Jeroen.[6] REFERENCES 1. https://gcc.gnu.org/gcc-4.9/changes.html 2. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html 3a. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95671#c11 3b. https://github.com/graysky2/kernel_gcc_patch/issues/55 4. https://bugzilla.kernel.org/show_bug.cgi?id=77461 5. https://github.com/graysky2/kernel_gcc_patch/issues/15 6. http://www.linuxforge.net/docs/linux/linux-gcc.php commit cdc1cedf140cb10eed81d3689bdcd351efc9a41a Author: Ben Hutchings Date: Mon Sep 7 02:51:53 2020 +0100 android: Export symbols needed by Android drivers We want to enable use of the Android ashmem and binder drivers to support Anbox, but they should not be built-in as that would waste resources and increase security attack surface on systems that don't need them. Export the currently un-exported symbols they depend on. commit 6665150f4172c8cdda2c1a25817cdf3264fd8e79 Author: Ben Hutchings Date: Fri Jun 22 17:27:00 2018 +0100 android: Enable building ashmem and binder as modules We want to enable use of the Android ashmem and binder drivers to support Anbox, but they should not be built-in as that would waste resources and increase security attack surface on systems that don't need them. - Add a MODULE_LICENSE declaration to ashmem - Change the Makefiles to build each driver as an object with the "_linux" suffix (which is what Anbox expects) - Change config symbol types to tristate commit fba3fdf0f4c9c850b1dc32ca69e34b63b0c0aaab Author: Arjan van de Ven Date: Sun Feb 18 23:35:41 2018 +0000 locking: rwsem: spin faster tweak rwsem owner spinning a bit commit 9be657a1c35b26478768c889cd745ee5870df5c4 Author: William Douglas Date: Wed Jun 20 17:23:21 2018 +0000 firmware: Enable stateless firmware loading Prefer the order of specific version before generic and /etc before /lib to enable the user to give specific overrides for generic firmware and distribution firmware. commit a188c00ee2a8330301a4c4deda16ce9a533a0763 Author: Arjan van de Ven Date: Sun Sep 22 11:12:35 2019 -0300 intel_rapl: Silence rapl trace debug commit ef3a9310bd6e1309ea631ccfa45eda9c9699cf2f Author: Piotr Gorski Date: Thu Jul 30 16:53:12 2020 -0800 init: add support for zstd compressed modules Signed-off-by: Piotr Gorski commit cda59e1485431ee8c6da4029b87e4b31875e19bd Author: Alexandre Frade Date: Thu Oct 15 21:28:33 2020 +0000 modules: disinherit TAINT_PROPRIETARY_MODULE Signed-off-by: Alexandre Frade commit 0d7e1b0c69b223093090984c301b922e93dda1fc Author: Scott James Remnant Date: Tue Oct 27 10:05:32 2009 +0000 trace: add trace events for open(), exec() and uselib() (for v3.7+) BugLink: http://bugs.launchpad.net/bugs/462111 This patch uses TRACE_EVENT to add tracepoints for the open(), exec() and uselib() syscalls so that ureadahead can cheaply trace the boot sequence to determine what to read to speed up the next. It's not upstream because it will need to be rebased onto the syscall trace events whenever that gets merged, and is a stop-gap. [apw@canonical.com: updated for v3.7 and later.] [apw@canonical.com: updated for v3.19 and later.] BugLink: http://bugs.launchpad.net/bugs/1085766 Signed-off-by: Scott James Remnant Acked-by: Stefan Bader Acked-by: Andy Whitcroft Signed-off-by: Stefan Bader Conflicts: fs/open.c Signed-off-by: Tim Gardner commit 7530d3213e30b6406f4183a206ca5c312028cc71 Author: Alexandre Frade Date: Thu Sep 3 20:36:13 2020 +0000 init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures Signed-off-by: Alexandre Frade commit c7ae89bfea6bc055adb38ed17abd40fd8aa908ed Author: Alexandre Frade Date: Thu Jun 25 16:40:43 2020 -0300 lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE Signed-off-by: Alexandre Frade commit 8930564c99efe479d5c2aaca7a9760c50ca3691a Author: Alexandre Frade Date: Mon Jan 29 17:41:29 2018 +0000 scripts: disable the localversion "+" tag of a git repo Signed-off-by: Alexandre Frade commit c36df79def5f70ed2118cfee7a37e0019ec7b486 Author: Alexandre Frade Date: Tue Mar 31 13:32:08 2020 -0300 cpufreq: tunes ondemand and conservative governor for performance Signed-off-by: Alexandre Frade commit 2e97c147186bda7c18697227c8262d1fcabcbe00 Author: Alexandre Frade Date: Mon Jan 29 17:31:25 2018 +0000 mm/vmscan: vm_swappiness = 30 decreases the amount of swapping Signed-off-by: Alexandre Frade commit ecae96c998a0a3c402822c9b07c0e1f799adb72f Author: Alexandre Frade Date: Thu Aug 13 14:57:06 2020 +0000 sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default Signed-off-by: Alexandre Frade commit 05348849965f50fdcdb1813bff849e46f197f14f Author: Alexandre Frade Date: Mon Jan 29 16:59:22 2018 +0000 dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed Signed-off-by: Alexandre Frade commit 0e96d54d9a30f7d9c9fa0479307368a4710f0183 Author: Alexandre Frade Date: Sun Oct 13 03:10:39 2019 -0300 kconfig: set PREEMPT and RCU_BOOST without delay by default Signed-off-by: Alexandre Frade commit b027c4060b0e19560e97a835a7cf1e0f04860580 Author: Alexandre Frade Date: Mon Jan 29 17:26:15 2018 +0000 kconfig: add 500Hz timer interrupt kernel config option Signed-off-by: Alexandre Frade commit a4fe3d5253fcaa996a37c389b17bdee493fe8158 Author: Alexandre Frade Date: Mon Jan 29 18:29:13 2018 +0000 sched/core: nr_migrate = 256 increases number of tasks to iterate in a single balance run. Signed-off-by: Alexandre Frade commit fff3507c469d76d826c994a7473353f899c64515 Author: Alexandre Frade Date: Mon Dec 14 16:24:26 2020 +0000 block: set rq_affinity to force full multithreading I/O requests Signed-off-by: Alexandre Frade commit fd8376cc13b309493652ac320abb62812cda788e Author: Alexandre Frade Date: Mon Jun 1 18:23:51 2020 -0300 block, bfq: change BLK_DEV_ZONED depends to IOSCHED_BFQ Signed-off-by: Alexandre Frade commit b20e12746c6ff4760877cb65f449d741a99ae010 Author: Alexandre Frade Date: Mon Nov 25 15:13:06 2019 -0300 elevator: set default scheduler to bfq for blk-mq Signed-off-by: Alexandre Frade commit c9db032f6d86fe8f8b772c2c10a8b58a3fb0ac1d Author: Sebastian Andrzej Siewior Date: Mon Dec 14 11:37:06 2020 +0100 v5.10-rt17 Signed-off-by: Sebastian Andrzej Siewior commit a4806241793ae99ab145c6dc7c7dc92198b160ac Merge: f5a89f8910a8 2c85ebc57b3e Author: Sebastian Andrzej Siewior Date: Mon Dec 14 11:36:54 2020 +0100 Merge tag 'v5.10' into linux-5.10.y-rt Linux 5.10 commit f5a89f8910a89355009b133180fd4d2afc4958d0 Author: Sebastian Andrzej Siewior Date: Fri Dec 11 17:40:58 2020 +0100 v5.10-rc7-rt16 Signed-off-by: Sebastian Andrzej Siewior commit 3439af203ffd416842b645d4e3caed154f1d1418 Author: Thomas Gleixner Date: Sun Dec 6 22:40:07 2020 +0100 timers: Move clearing of base::timer_running under base::lock syzbot reported KCSAN data races vs. timer_base::timer_running being set to NULL without holding base::lock in expire_timers(). This looks innocent and most reads are clearly not problematic but for a non-RT kernel it's completely irrelevant whether the store happens before or after taking the lock. For an RT kernel moving the store under the lock requires an extra unlock/lock pair in the case that there is a waiter for the timer. But that's not the end of the world and definitely not worth the trouble of adding boatloads of comments and annotations to the code. Famous last words... Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior Cc: stable-rt@vger.kernel.org commit aab781636482a5ddaf8c714d11adaab12df6b323 Author: Sebastian Andrzej Siewior Date: Fri Dec 11 17:39:54 2020 +0100 Revert "hrtimer: Allow raw wakeups during boot" This change is no longer needed since commit 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()") Signed-off-by: Sebastian Andrzej Siewior commit db683a136ca64f32965dfd8531e2a3037173f60b Author: Sebastian Andrzej Siewior Date: Mon Dec 7 12:25:03 2020 +0100 v5.10-rc7-rt15 Signed-off-by: Sebastian Andrzej Siewior commit c56e0d4667765dfba5f2eaf7fa77e988ead92f4b Merge: e49c68f2da07 0477e9288185 Author: Sebastian Andrzej Siewior Date: Mon Dec 7 12:23:56 2020 +0100 Merge tag 'v5.10-rc7' into linux-5.10.y-rt Linux 5.10-rc7 commit e49c68f2da0729bd56a5c5e6e64f8a7309ad28f3 Author: Sebastian Andrzej Siewior Date: Fri Dec 4 18:27:12 2020 +0100 v5.10-rc6-rt14 Signed-off-by: Sebastian Andrzej Siewior commit 77ba6b34dc4b2e6a63983ade6a37ae54c0335f29 Author: Sebastian Andrzej Siewior Date: Fri Dec 4 18:19:11 2020 +0100 softirq: Update to v2 This is an all-on-one commit updating the softirq series by Thomas Gleixner. It includes: Update already existing patches to what has been merged into the TIP tree as of commit 9f112156f8da016df2dcbe77108e5b070aa58992 and later: parisc: Remove bogus __IRQ_STAT macro sh: Get rid of nmi_count() irqstat: Get rid of nmi_count() and __IRQ_STAT() um/irqstat: Get rid of the duplicated declarations ARM: irqstat: Get rid of duplicated declaration arm64: irqstat: Get rid of duplicated declaration asm-generic/irqstat: Add optional __nmi_count member sh: irqstat: Use the generic irq_cpustat_t irqstat: Move declaration into asm-generic/hardirq.h preempt: Cleanup the macro maze a bit softirq: Move related code into one section sh/irq: Add missing closing parentheses in Cherry pick patches from Frederic Weisbecker from the TIP tree as of commit 7197688b2006357da75a014e0a76be89ca9c2d46 and later: sched/cputime: Remove symbol exports from IRQ time s390/vtime: Use the generic IRQ entry accounting sched/vtime: Consolidate IRQ time accounting irqtime: Move irqtime entry accounting after irq offset irq: Call tick_irq_enter() inside HARDIRQ_OFFSET Finally, apply the series "softirq: Make it RT aware" by Thomas Gleixner posted 2020-12-04 18:01. Signed-off-by: Sebastian Andrzej Siewior commit 8b7f7e411be72aca1b7074014a9537f90b9dd399 Author: Sebastian Andrzej Siewior Date: Mon Nov 30 18:32:43 2020 +0100 v5.10-rc6-rt13 Signed-off-by: Sebastian Andrzej Siewior commit acf5b481a0d7ce0e9c1e9a06c9ce4195a759bcdd Author: Sebastian Andrzej Siewior Date: Mon Nov 30 18:27:22 2020 +0100 printk: Update the printk code This is all-on-one patch replacing the current RT related printk patches: [PATCH 01/15] printk: refactor kmsg_dump_get_buffer() [PATCH 02/15] printk: use buffer pools for sprint buffers [PATCH 03/15] printk: change @clear_seq to atomic64_t [PATCH 04/15] printk: remove logbuf_lock, add syslog_lock [PATCH 05/15] printk: remove safe buffers [PATCH 06/15] console: add write_atomic interface [PATCH 07/15] serial: 8250: implement write_atomic [PATCH 08/15] printk: inline log_output(),log_store() in [PATCH 09/15] printk: relocate printk_delay() and vprintk_default() [PATCH 10/15] printk: combine boot_delay_msec() into printk_delay() [PATCH 11/15] printk: introduce kernel sync mode [PATCH 12/15] printk: move console printing to kthreads [PATCH 13/15] printk: remove deferred printing [PATCH 14/15] printk: add console handover [PATCH] printk: Tiny cleanup with an updated version by John Ogness: [PATCH 01/16] printk: refactor kmsg_dump_get_buffer() [PATCH 02/16] printk: inline log_output(),log_store() in [PATCH 03/16] printk: change @clear_seq to atomic64_t [PATCH 04/16] printk: remove logbuf_lock, add syslog_lock [PATCH 05/16] printk: remove safe buffers [PATCH 06/16] console: add write_atomic interface [PATCH 07/16] serial: 8250: implement write_atomic [PATCH 08/16] printk: relocate printk_delay() and vprintk_default() [PATCH 09/16] printk: combine boot_delay_msec() into printk_delay() [PATCH 10/16] printk: change @console_seq to atomic64_t [PATCH 11/16] printk: introduce kernel sync mode [PATCH 12/16] printk: move console printing to kthreads [PATCH 13/16] printk: remove deferred printing [PATCH 14/16] printk: add console handover [PATCH 15/16] printk: add pr_flush() [PATCH 16/16] printk: kmsg_dump,do_mounts wait for printers Signed-off-by: Sebastian Andrzej Siewior commit 028a936ab5d0ba05029a6387b3905cb1c570cdf7 Author: Sebastian Andrzej Siewior Date: Mon Nov 30 18:21:47 2020 +0100 mm/zswap: Initialize the local-lock Since the adaption to local-locks in v5.9 cycle, the initialisation of the lock was lost. Initialize zswap_comp::lock. Signed-off-by: Sebastian Andrzej Siewior commit 23d5e0d7db656ae130461094ea52e62b6e65d65e Author: Valentin Schneider Date: Sun Nov 22 20:19:04 2020 +0000 notifier: Make atomic_notifiers use raw_spinlock Booting a recent PREEMPT_RT kernel (v5.10-rc3-rt7-rebase) on my arm64 Juno leads to the idle task blocking on an RT sleeping spinlock down some notifier path: [ 1.809101] BUG: scheduling while atomic: swapper/5/0/0x00000002 [ 1.809116] Modules linked in: [ 1.809123] Preemption disabled at: [ 1.809125] secondary_start_kernel (arch/arm64/kernel/smp.c:227) [ 1.809146] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.10.0-rc3-rt7 #168 [ 1.809153] Hardware name: ARM Juno development board (r0) (DT) [ 1.809158] Call trace: [ 1.809160] dump_backtrace (arch/arm64/kernel/stacktrace.c:100 (discriminator 1)) [ 1.809170] show_stack (arch/arm64/kernel/stacktrace.c:198) [ 1.809178] dump_stack (lib/dump_stack.c:122) [ 1.809188] __schedule_bug (kernel/sched/core.c:4886) [ 1.809197] __schedule (./arch/arm64/include/asm/preempt.h:18 kernel/sched/core.c:4913 kernel/sched/core.c:5040) [ 1.809204] preempt_schedule_lock (kernel/sched/core.c:5365 (discriminator 1)) [ 1.809210] rt_spin_lock_slowlock_locked (kernel/locking/rtmutex.c:1072) [ 1.809217] rt_spin_lock_slowlock (kernel/locking/rtmutex.c:1110) [ 1.809224] rt_spin_lock (./include/linux/rcupdate.h:647 kernel/locking/rtmutex.c:1139) [ 1.809231] atomic_notifier_call_chain_robust (kernel/notifier.c:71 kernel/notifier.c:118 kernel/notifier.c:186) [ 1.809240] cpu_pm_enter (kernel/cpu_pm.c:39 kernel/cpu_pm.c:93) [ 1.809249] psci_enter_idle_state (drivers/cpuidle/cpuidle-psci.c:52 drivers/cpuidle/cpuidle-psci.c:129) [ 1.809258] cpuidle_enter_state (drivers/cpuidle/cpuidle.c:238) [ 1.809267] cpuidle_enter (drivers/cpuidle/cpuidle.c:353) [ 1.809275] do_idle (kernel/sched/idle.c:132 kernel/sched/idle.c:213 kernel/sched/idle.c:273) [ 1.809282] cpu_startup_entry (kernel/sched/idle.c:368 (discriminator 1)) [ 1.809288] secondary_start_kernel (arch/arm64/kernel/smp.c:273) Two points worth noting: 1) That this is conceptually the same issue as pointed out in: 313c8c16ee62 ("PM / CPU: replace raw_notifier with atomic_notifier") 2) Only the _robust() variant of atomic_notifier callchains suffer from this AFAICT only the cpu_pm_notifier_chain really needs to be changed, but singling it out would mean introducing a new (truly) non-blocking API. At the same time, callers that are fine with any blocking within the call chain should use blocking notifiers, so patching up all atomic_notifier's doesn't seem *too* crazy to me. Fixes: 70d932985757 ("notifier: Fix broken error handling pattern") Signed-off-by: Valentin Schneider Reviewed-by: Daniel Bristot de Oliveira Link: https://lkml.kernel.org/r/20201122201904.30940-1-valentin.schneider@arm.com Signed-off-by: Sebastian Andrzej Siewior commit fd4539f875ef12c44e24beccefe31061a46b5102 Author: Sebastian Andrzej Siewior Date: Mon Nov 30 16:22:28 2020 +0100 v5.10-rc6-rt12 Signed-off-by: Sebastian Andrzej Siewior commit fb2c1dbff5f36a0930392c3dfeaf0f8aa733d302 Merge: 2e8ac77091be b65054597872 Author: Sebastian Andrzej Siewior Date: Mon Nov 30 16:17:32 2020 +0100 Merge tag 'v5.10-rc6' into linux-5.10.y-rt Linux 5.10-rc6 commit 2e8ac77091be4c1111901395f6548b7d3facb9c8 Author: Sebastian Andrzej Siewior Date: Fri Nov 27 17:27:57 2020 +0100 v5.10-rc5-rt11 Signed-off-by: Sebastian Andrzej Siewior commit 010bf10735549cd373dbf94932b65ff5f5d96594 Author: Sebastian Andrzej Siewior Date: Fri Nov 27 17:25:16 2020 +0100 rtmutex: Allow allnoconfig with enabled RT This is an incremental update of the proposed rtmutex patches for upstream. The changes allows allnoconfig builds on x86 with enabled RT. It also updates the ifdef error to allow to include raw locks for the rtmutex header. Signed-off-by: Sebastian Andrzej Siewior commit 9ef126b1e593cb6db796b6f5c486627cf79bffd3 Author: Sebastian Andrzej Siewior Date: Fri Nov 27 17:22:41 2020 +0100 clk: imx8qxp: Replace the workaround for the build failure This is an all-in-one commit replacing the patch clk: imx8qxp: Unbreak auto module building for MXC_CLK_SCU with a patch by Dong Aisheng posted on Wed, 25 Nov 2020 18:50:37 +0800 with the subject clk: imx: scu: fix MXC_CLK_SCU module build break Signed-off-by: Sebastian Andrzej Siewior commit 398a5fcd0c95039f984f0baf0459ce70d393097d Author: Sebastian Andrzej Siewior Date: Tue Nov 24 13:38:45 2020 +0100 v5.10-rc5-rt10 Signed-off-by: Sebastian Andrzej Siewior commit 37e9245a4b19974533ca5d8ff026f563590a3cf9 Author: Sebastian Andrzej Siewior Date: Tue Nov 24 12:29:53 2020 +0100 clk: imx8qxp: Unbreak auto module building for MXC_CLK_SCU Automatic moudule building is broken by adding module support to i.MX8QXP clock driver. It can be tested by ARM defconfig + CONFIG_IMX_MBOX=m and CONFIG_MXC_CLK_SCU=m. The compile breaks because the modules and source files are mixed. After fixing that, the build breaks because the SCU driver has no license or symbols, which are required by the CLK_IMX8QXP driver, are not properly exported. Compile module clk-imx-scu.o which contains of clk-scu.o clk-lpcg-scu.o if CONFIG_MXC_CLK_SCU is enabled. Compile modules clk-imx8qxp.o and clk-imx8qxp-lpcg.o if CONFIG_CLK_IMX8QXP is enabled. Add EXPORT_SYMBOL_GPL() to functions which fail to resolve once CONFIG_CLK_IMX8QXP is enabled as module. Add License GPL to clk-scu.c. Fixes: e0d0d4d86c766 ("clk: imx8qxp: Support building i.MX8QXP clock driver as module") Signed-off-by: Sebastian Andrzej Siewior commit ebec003ccc8ba59420b9c21f6f27a35d61e6c82f Author: Valentin Schneider Date: Fri Nov 13 11:24:14 2020 +0000 sched/core: Add missing completion for affine_move_task() waiters Qian reported that some fuzzer issuing sched_setaffinity() ends up stuck on a wait_for_completion(). The problematic pattern seems to be: affine_move_task() // task_running() case stop_one_cpu(); wait_for_completion(&pending->done); Combined with, on the stopper side: migration_cpu_stop() // Task moved between unlocks and scheduling the stopper task_rq(p) != rq && // task_running() case dest_cpu >= 0 => no complete_all() This can happen with both PREEMPT and !PREEMPT, although !PREEMPT should be more likely to see this given the targeted task has a much bigger window to block and be woken up elsewhere before the stopper runs. Make migration_cpu_stop() always look at pending affinity requests; signal their completion if the stopper hits a rq mismatch but the task is still within its allowed mask. When Migrate-Disable isn't involved, this matches the previous set_cpus_allowed_ptr() vs migration_cpu_stop() behaviour. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: Qian Cai Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/lkml/8b62fd1ad1b18def27f18e2ee2df3ff5b36d0762.camel@redhat.com Signed-off-by: Sebastian Andrzej Siewior commit 5480cd723532fa55cd6e40ba2f74ed3115b3428d Author: Peter Zijlstra Date: Tue Nov 17 12:14:51 2020 +0100 sched: Fix migration_cpu_stop() WARN Oleksandr reported hitting the WARN in the 'task_rq(p) != rq' branch of migration_cpu_stop(). Valentin noted that using cpu_of(rq) in that case is just plain wrong to begin with, since per the earlier branch that isn't the actual CPU of the task. Replace both instances of is_cpu_allowed() by a direct p->cpus_mask test using task_cpu(). Reported-by: Oleksandr Natalenko Debugged-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit f0f8de5791d68b1b760c1b934276e0668f156de9 Author: Sebastian Andrzej Siewior Date: Tue Nov 24 13:35:50 2020 +0100 v5.10-rc5-rt9 Signed-off-by: Sebastian Andrzej Siewior commit 1596395ca38dc859553a9c03a8e180e5be35c4bd Merge: 98adc1b769e3 418baf2c28f3 Author: Sebastian Andrzej Siewior Date: Tue Nov 24 13:35:32 2020 +0100 Merge tag 'v5.10-rc5' into linux-5.10.y-rt Linux 5.10-rc5 Signed-off-by: Sebastian Andrzej Siewior commit 98adc1b769e3270402a58ee7d7bcf03632b44a7b Author: Sebastian Andrzej Siewior Date: Mon Nov 16 11:19:37 2020 +0100 v5.10-rc4-rt8 Signed-off-by: Sebastian Andrzej Siewior commit 97ed13aad79e6102e688e7e0540af588975ce516 Merge: e32498c63d18 09162bc32c88 Author: Sebastian Andrzej Siewior Date: Mon Nov 16 11:19:14 2020 +0100 Merge tag 'v5.10-rc4' into linux-5.10.y-rt Linux 5.10-rc4 Signed-off-by: Sebastian Andrzej Siewior commit e32498c63d189427b445e858084f16a7c78610b6 Author: Sebastian Andrzej Siewior Date: Thu Nov 12 17:50:02 2020 +0100 v5.10-rc3-rt7 Signed-off-by: Sebastian Andrzej Siewior commit a0874158e3e1a9ce674bfe7a4ab3f0eccb70d7e1 Author: Sebastian Andrzej Siewior Date: Thu Nov 12 17:42:33 2020 +0100 softirq: Update to current developtment version Update the softirq WIP code to the current development branch by Thomas Gleixner. Also cherry-pick two patches from the -tip tree: 5f0c71278d684 ("x86/fpu: Simplify fpregs_[un]lock()") cba08c5dc6dc1 ("x86/fpu: Make kernel FPU protection RT friendly") which were splitted away from the softirq work. Signed-off-by: Sebastian Andrzej Siewior commit a0d1418b048b11240853746a5fe1de7c10232514 Author: Thomas Gleixner Date: Thu Nov 12 11:59:32 2020 +0100 mm/highmem: Take kmap_high_get() properly into account kunmap_local() warns when the virtual address to unmap is below PAGE_OFFSET. This is correct except for the case that the mapping was obtained via kmap_high_get() because the PKMAP addresses are right below PAGE_OFFSET. Cure it by skipping the WARN_ON() when the unmap was handled by kunmap_high(). Fixes: 298fa1ad5571 ("highmem: Provide generic variant of kmap_atomic*") Reported-by: vtolkm@googlemail.com Reported-by: Marek Szyprowski Signed-off-by: Thomas Gleixner Tested-by: Marek Szyprowski Tested-by: Sebastian Andrzej Siewior Cc: Andrew Morton Link: https://lore.kernel.org/r/87y2j6n8mj.fsf@nanos.tec.linutronix.de Signed-off-by: Sebastian Andrzej Siewior commit 06c5a1f9ca5ad3c90f3973d426685714e2adcf49 Author: Thomas Gleixner Date: Mon Nov 9 23:32:39 2020 +0100 genirq: Move prio assignment into the newly created thread With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority assignment to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith Signed-off-by: Thomas Gleixner [bigeasy: Patch description] Signed-off-by: Sebastian Andrzej Siewior Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de commit f9a9f009155bf3b2f5b32fcbf639eec461adb931 Author: Sebastian Andrzej Siewior Date: Mon Nov 9 21:30:41 2020 +0100 kthread: Move prio/affinite change into the newly created thread With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority reset to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de commit 6fec97678348dc3efee32dc2e784a70171e171c1 Author: Sebastian Andrzej Siewior Date: Mon Nov 9 16:43:49 2020 +0100 v5.10-rc3-rt6 Signed-off-by: Sebastian Andrzej Siewior commit 4e4efad14c1f77c04087c6e79f47ae8d6656e496 Author: Sebastian Andrzej Siewior Date: Mon Nov 9 15:54:03 2020 +0100 sched: Unlock the rq in affine_move_task() error path Unlock the rq if returned early in the error path. Reported-by: Joe Korty Signed-off-by: Sebastian Andrzej Siewior Link: https://lkml.kernel.org/r/20201106203921.GA48461@zipoli.concurrent-rt.com commit a62e6793ff79ac4ef7a366303c98eb47e47458a1 Author: Sebastian Andrzej Siewior Date: Mon Nov 9 15:33:20 2020 +0100 v5.10-rc3-rt5 Signed-off-by: Sebastian Andrzej Siewior commit 32e65c08cb113e8955735baca827c34b3e0e622e Merge: 4978f1afd87c f8394f232b1e Author: Sebastian Andrzej Siewior Date: Mon Nov 9 15:32:54 2020 +0100 Merge tag 'v5.10-rc3' into linux-5.10.y-rt Linux 5.10-rc3 commit 4978f1afd87cd2d28a35cd3bc2e21f201558407e Author: Sebastian Andrzej Siewior Date: Tue Nov 3 20:20:23 2020 +0100 v5.10-rc2-rt4 Signed-off-by: Sebastian Andrzej Siewior commit e1d217e38e33becf5a45b35bfa20d5acca63569f Author: Sebastian Andrzej Siewior Date: Mon Nov 2 14:14:24 2020 +0100 timers: Don't block on ->expiry_lock for TIMER_IRQSAFE PREEMPT_RT does not spin and wait until a running timer completes its callback but instead it blocks on a sleeping lock to prevent a deadlock. This blocking can not be done for workqueue's IRQ_SAFE timer which will be canceled in an IRQ-off region. It has to happen to in IRQ-off region because changing the PENDING bit and clearing the timer must not be interrupted to avoid a busy-loop. The callback invocation of IRQSAFE timer is not preempted on PREEMPT_RT so there is no need to synchronize on timer_base::expiry_lock. Don't acquire the timer_base::expiry_lock for TIMER_IRQSAFE flagged timer. Add a lockdep annotation to ensure that this function is always invoked in preemptible context on PREEMPT_RT. Reported-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior Cc: stable-rt@vger.kernel.org commit 777db91e45570419e175bab32b5d1b5b3d9b7450 Author: Sebastian Andrzej Siewior Date: Tue Nov 3 18:43:35 2020 +0100 sched: Provide preempt-lazy stub for !PREEMPT_COUNT An empty stub for lazy-preempt is needed in the !PREEMPT_COUNT case since migrate_disable() is enabled for !RT configurations. Signed-off-by: Sebastian Andrzej Siewior commit 0fdc91971b34cf6857b4cfd8c322ae936cfc189b Author: Oleg Nesterov Date: Tue Nov 3 12:39:01 2020 +0100 ptrace: fix ptrace_unfreeze_traced() race with rt-lock The patch "ptrace: fix ptrace vs tasklist_lock race" changed ptrace_freeze_traced() to take task->saved_state into account, but ptrace_unfreeze_traced() has the same problem and needs a similar fix: it should check/update both ->state and ->saved_state. Reported-by: Luis Claudio R. Goncalves Fixes: "ptrace: fix ptrace vs tasklist_lock race" Signed-off-by: Oleg Nesterov Signed-off-by: Sebastian Andrzej Siewior Cc: stable-rt@vger.kernel.org commit 1815252e03a2a3e77a2b5d963651d903d20c06c2 Author: Sebastian Andrzej Siewior Date: Tue Nov 3 17:19:39 2020 +0100 mm/highmem: Preemptible variant of kmap_atomic & friends This is an all-in-one containing the incremental update from v2 to v3 of the series "mm/highmem: Preemptible variant of kmap_atomic & friends" as posted by Thomas Gleixner on 2020-11-03 10:27: mm/highmem: Un-EXPORT __kmap_atomic_idx() highmem: Remove unused functions fs: Remove asm/kmap_types.h includes sh/highmem: Remove all traces of unused cruft asm-generic: Provide kmap_size.h highmem: Provide generic variant of kmap_atomic highmem: Make DEBUG_HIGHMEM functional x86/mm/highmem: Use generic kmap atomic implementation arc/mm/highmem: Use generic kmap atomic implementation ARM: highmem: Switch to generic kmap atomic csky/mm/highmem: Switch to generic kmap atomic microblaze/mm/highmem: Switch to generic kmap atomic mips/mm/highmem: Switch to generic kmap atomic nds32/mm/highmem: Switch to generic kmap atomic powerpc/mm/highmem: Switch to generic kmap atomic sparc/mm/highmem: Switch to generic kmap atomic xtensa/mm/highmem: Switch to generic kmap atomic highmem: Get rid of kmap_types.h mm/highmem: Remove the old kmap_atomic cruft io-mapping: Cleanup atomic iomap Documentation/io-mapping: Remove outdated blurb highmem: High implementation details and document API sched: Make migrate_disable/enable() independent of RT sched: highmem: Store local kmaps in task struct mm/highmem: Provide kmap_local* io-mapping: Provide iomap_local variant x86/crashdump/32: Simplify copy_oldmem_page() mips/crashdump: Simplify copy_oldmem_page() ARM: mm: Replace kmap_atomic_pfn() highmem: Remove kmap_atomic_pfn() drm/ttm: Replace kmap_atomic() usage drm/vmgfx: Replace kmap_atomic() highmem: Remove kmap_atomic_prot() drm/qxl: Replace io_mapping_map_atomic_wc() drm/nouveau/device: Replace io_mapping_map_atomic_wc() drm/i915: Replace io_mapping_map_atomic_wc() io-mapping: Remove io_mapping_map_atomic_wc() This commit also includes fixes from the thread. Signed-off-by: Sebastian Andrzej Siewior commit fe82702ad8dd2918a86949902fc2d1c0bc375274 Author: Sebastian Andrzej Siewior Date: Tue Nov 3 12:50:26 2020 +0100 v5.10-rc2-rt3 Signed-off-by: Sebastian Andrzej Siewior commit 09e6f52ba24dd79a21c3f60a56a379f94fbe0290 Merge: 2f42a2fe3450 3cea11cd5e3b Author: Sebastian Andrzej Siewior Date: Tue Nov 3 12:49:52 2020 +0100 Merge tag 'v5.10-rc2' into linux-5.10.y-rt Linux 5.10-rc2 Signed-off-by: Sebastian Andrzej Siewior commit 2f42a2fe3450a6ef26ed1df893ed83e9d094079f Author: Sebastian Andrzej Siewior Date: Fri Oct 30 19:50:49 2020 +0100 v5.10-rc1-rt2 Signed-off-by: Sebastian Andrzej Siewior commit 8675726664ee6e6f6406ce6db2b3cb3b36ab40f5 Author: Sebastian Andrzej Siewior Date: Fri Oct 30 13:59:06 2020 +0100 highmem: Don't disable preemption on RT in kmap_atomic() Disabling preemption make it impossible to acquire sleeping locks within kmap_atomic() section. For PREEMPT_RT it is sufficient to disable migration. Signed-off-by: Sebastian Andrzej Siewior commit 32ac92b5a39e2ef91308929bf5ed0804094b4183 Author: Sebastian Andrzej Siewior Date: Fri Oct 30 19:51:02 2020 +0100 mm/highmem: Preemptible variant of kmap_atomic & friends This is an all-in-one patch cotaining the patch series "mm/highmem: Preemptible variant of kmap_atomic & friends" as sent by Thomas Gleixner on 2020-10-29 23:18: sched: Make migrate_disable/enable() independent of RT mm/highmem: Un-EXPORT __kmap_atomic_idx() highmem: Provide generic variant of kmap_atomic* x86/mm/highmem: Use generic kmap atomic implementation arc/mm/highmem: Use generic kmap atomic implementation ARM: highmem: Switch to generic kmap atomic csky/mm/highmem: Switch to generic kmap atomic microblaze/mm/highmem: Switch to generic kmap atomic mips/mm/highmem: Switch to generic kmap atomic nds32/mm/highmem: Switch to generic kmap atomic powerpc/mm/highmem: Switch to generic kmap atomic sparc/mm/highmem: Switch to generic kmap atomic xtensa/mm/highmem: Switch to generic kmap atomic mm/highmem: Remove the old kmap_atomic cruft io-mapping: Cleanup atomic iomap sched: highmem: Store local kmaps in task struct mm/highmem: Provide kmap_local* io-mapping: Provide iomap_local variant plus fixes which were folded in. It also contains a revert of the old patches which were implementing the extra highmem bits for RT. Signed-off-by: Sebastian Andrzej Siewior commit f913ce319363b768f3a20ca2a0e2e1601ab95fb9 Author: Sebastian Andrzej Siewior Date: Fri Oct 30 19:23:14 2020 +0100 block-mq: Disable preemption in blk_mq_complete_request_remote() There callers of blk_mq_complete_request() which invoke it in preemptible context. Disable preemption while an item is added on the local-CPU to ensure that the softirq is fired on the same CPU. Signed-off-by: Sebastian Andrzej Siewior commit e27ef68731a139841c3cf7bd59ac990f488d32ee Author: Paul E. McKenney Date: Thu Sep 24 15:11:55 2020 -0700 rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled The try_invoke_on_locked_down_task() function requires that interrupts be enabled, but it is called with interrupts disabled from rcu_print_task_stall(), resulting in an "IRQs not enabled as expected" diagnostic. This commit therefore updates rcu_print_task_stall() to accumulate a list of the first few tasks while holding the current leaf rcu_node structure's ->lock, then releases that lock and only then uses try_invoke_on_locked_down_task() to attempt to obtain per-task detailed information. Of course, as soon as ->lock is released, the task might exit, so the get_task_struct() function is used to prevent the task structure from going away in the meantime. Link: https://lore.kernel.org/lkml/000000000000903d5805ab908fc4@google.com/ Reported-by: syzbot+cb3b69ae80afd6535b0e@syzkaller.appspotmail.com Reported-by: syzbot+f04854e1c5c9e913cc27@syzkaller.appspotmail.com Signed-off-by: Paul E. McKenney Signed-off-by: Sebastian Andrzej Siewior commit 16966f1ba0ebb5c8032e87e7e6d46f1bfd38c780 Author: Thomas Gleixner Date: Fri Jul 8 20:25:16 2011 +0200 Add localversion for -RT release Signed-off-by: Thomas Gleixner commit 8e79ef08134fa4096f778cf512e275060f84a89b Author: Clark Williams Date: Sat Jul 30 21:55:53 2011 -0500 sysfs: Add /sys/kernel/realtime entry Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams Signed-off-by: Peter Zijlstra commit 791d78ca7b3d8e53eb8f2802a5eefe8af78c3920 Author: Ingo Molnar Date: Fri Jul 3 08:29:57 2009 -0500 genirq: Disable irqpoll on -rt Creates long latencies for no value Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit 04cc05c8098c7ebb677fa7dc417f9be195975d90 Author: Matt Fleming Date: Tue Apr 7 10:54:13 2020 +0100 signal: Prevent double-free of user struct The way user struct reference counting works changed significantly with, fda31c50292a ("signal: avoid double atomic counter increments for user accounting") Now user structs are only freed once the last pending signal is dequeued. Make sigqueue_free_current() follow this new convention to avoid freeing the user struct multiple times and triggering this warning: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 6794 at lib/refcount.c:288 refcount_dec_not_one+0x45/0x50 Call Trace: refcount_dec_and_lock_irqsave+0x16/0x60 free_uid+0x31/0xa0 __dequeue_signal+0x17c/0x190 dequeue_signal+0x5a/0x1b0 do_sigtimedwait+0x208/0x250 __x64_sys_rt_sigtimedwait+0x6f/0xd0 do_syscall_64+0x72/0x200 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Matt Fleming Reported-by: Daniel Wagner Signed-off-by: Sebastian Andrzej Siewior commit 47d4b7f9079c910e78a45eb6ba7117a4ed3aa5ad Author: Thomas Gleixner Date: Fri Jul 3 08:44:56 2009 -0500 signals: Allow rt tasks to cache one sigqueue struct To avoid allocation allow rt tasks to cache one sigqueue struct in task struct. Signed-off-by: Thomas Gleixner commit 8b7057fe1fc92a02d692022e214fb9247fa56d66 Author: Haris Okanovic Date: Tue Aug 15 15:13:08 2017 -0500 tpm_tis: fix stall after iowrite*()s ioread8() operations to TPM MMIO addresses can stall the cpu when immediately following a sequence of iowrite*()'s to the same region. For example, cyclitest measures ~400us latency spikes when a non-RT usermode application communicates with an SPI-based TPM chip (Intel Atom E3940 system, PREEMPT_RT kernel). The spikes are caused by a stalling ioread8() operation following a sequence of 30+ iowrite8()s to the same address. I believe this happens because the write sequence is buffered (in cpu or somewhere along the bus), and gets flushed on the first LOAD instruction (ioread*()) that follows. The enclosed change appears to fix this issue: read the TPM chip's access register (status code) after every iowrite*() operation to amortize the cost of flushing data to chip across multiple instructions. Signed-off-by: Haris Okanovic Signed-off-by: Sebastian Andrzej Siewior commit 612499611949116bfd69a34b34e77bcbcc492ab8 Author: Mike Galbraith Date: Thu Mar 31 04:08:28 2016 +0200 drivers/block/zram: Replace bit spinlocks with rtmutex for -rt They're nondeterministic, and lead to ___might_sleep() splats in -rt. OTOH, they're a lot less wasteful than an rtmutex per page. Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit 8e2f98a03c45f5feead10b45f31be1d741b2dc69 Author: Thomas Gleixner Date: Mon Jul 18 17:10:12 2011 +0200 mips: Disable highmem on RT The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner commit e3254b691b03205ec875ccbb0425e7b219f1f6c2 Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:41 2019 +0200 POWERPC: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit a99cd452e8483245f23c40a736b89a83919dc5c7 Author: Sebastian Andrzej Siewior Date: Tue Mar 26 18:31:29 2019 +0100 powerpc/stackprotector: work around stack-guard init from atomic This is invoked from the secondary CPU in atomic context. On x86 we use tsc instead. On Power we XOR it against mftb() so lets use stack address as the initial value. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit d3b5e7bdad7c3f52a242237124458d7861fdb842 Author: Thomas Gleixner Date: Mon Jul 18 17:08:34 2011 +0200 powerpc: Disable highmem on RT The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner commit d8647f0ffdbc3f2ba3bfc0ced37c5d793e7ee564 Author: Bogdan Purcareata Date: Fri Apr 24 15:53:13 2015 +0000 powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood Signed-off-by: Bogdan Purcareata Signed-off-by: Sebastian Andrzej Siewior commit 5708951e6a4c10f5951412cbcd6b07a9ec3fe771 Author: Sebastian Andrzej Siewior Date: Tue Mar 26 18:31:54 2019 +0100 powerpc/pseries/iommu: Use a locallock instead local_irq_save() The locallock protects the per-CPU variable tce_page. The function attempts to allocate memory while tce_page is protected (by disabling interrupts). Use local_irq_save() instead of local_irq_disable(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit 3b2813f033b64998f1f5c0385acf220311cfe3bc Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:35 2019 +0200 ARM64: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit d8f742a94a343c1e55119961ef3bc669d3aa3c00 Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:29 2019 +0200 ARM: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit e721f755c363d2e70f9984c58a5e6cdf18b4b5bd Author: Sebastian Andrzej Siewior Date: Thu Nov 7 17:49:20 2019 +0100 x86: Enable RT also on 32bit Signed-off-by: Sebastian Andrzej Siewior commit f209c7d068e429f4385a020ee9bcbb6d018c1f4a Author: Sebastian Andrzej Siewior Date: Wed Jul 25 14:02:38 2018 +0200 arm64: fpsimd: Delay freeing memory in fpsimd_flush_thread() fpsimd_flush_thread() invokes kfree() via sve_free() within a preempt disabled section which is not working on -RT. Delay freeing of memory until preemption is enabled again. Signed-off-by: Sebastian Andrzej Siewior commit 8b0a1acc9171303872cc63961692ea4150d92468 Author: Josh Cartwright Date: Thu Feb 11 11:54:01 2016 -0600 KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable() kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall Reported-by: Manish Jaggi Signed-off-by: Josh Cartwright Signed-off-by: Sebastian Andrzej Siewior commit 9d9be9466a71fbff79e6a59c3f1ade349211b4e2 Author: Josh Cartwright Date: Thu Feb 11 11:54:00 2016 -0600 genirq: update irq_set_irqchip_state documentation On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright Signed-off-by: Sebastian Andrzej Siewior commit f90f7abc931b3d9d47f11dd84fbededebe577341 Author: Yadi.hu Date: Wed Dec 10 10:32:09 2014 +0800 ARM: enable irq in translation/section permission fault handlers Probably happens on all ARM, with CONFIG_PREEMPT_RT CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { *((char*)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [] (unwind_backtrace+0x0/0x104) [ 512.745001] [] (dump_stack+0x20/0x24) [ 512.745355] [] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [] (rt_spin_lock+0x34/0x6c) [ 512.746073] [] (do_force_sig_info+0x34/0xf0) [ 512.746457] [] (force_sig_info+0x18/0x1c) [ 512.746829] [] (__do_user_fault+0x9c/0xd8) [ 512.747185] [] (do_bad_area+0x7c/0x94) [ 512.747536] [] (do_sect_fault+0x40/0x48) [ 512.747898] [] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu Signed-off-by: Sebastian Andrzej Siewior commit 008cc77aff249e830e5eb90b7ae3a6784597b8cf Author: Thomas Gleixner Date: Tue Jan 8 21:36:51 2013 +0100 tty/serial/pl011: Make the locking work on RT The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner commit 612ae174f781c9186ecf271a569b292bbe45e2e0 Author: Thomas Gleixner Date: Thu Jul 28 13:32:57 2011 +0200 tty/serial/omap: Make the locking RT aware The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner commit 68cad904bd4845f7f1099d7e9f4ba6befab8205d Author: Sebastian Andrzej Siewior Date: Thu Jan 23 14:45:59 2014 +0100 leds: trigger: disable CPU trigger on -RT as it triggers: |CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 |[] (unwind_backtrace+0x0/0xf8) from [] (show_stack+0x1c/0x20) |[] (show_stack+0x1c/0x20) from [] (dump_stack+0x20/0x2c) |[] (dump_stack+0x20/0x2c) from [] (__might_sleep+0x13c/0x170) |[] (__might_sleep+0x13c/0x170) from [] (__rt_spin_lock+0x28/0x38) |[] (__rt_spin_lock+0x28/0x38) from [] (rt_read_lock+0x68/0x7c) |[] (rt_read_lock+0x68/0x7c) from [] (led_trigger_event+0x2c/0x5c) |[] (led_trigger_event+0x2c/0x5c) from [] (ledtrig_cpu+0x54/0x5c) |[] (ledtrig_cpu+0x54/0x5c) from [] (arch_cpu_idle_exit+0x18/0x1c) |[] (arch_cpu_idle_exit+0x18/0x1c) from [] (cpu_startup_entry+0xa8/0x234) |[] (cpu_startup_entry+0xa8/0x234) from [] (rest_init+0xb8/0xe0) |[] (rest_init+0xb8/0xe0) from [] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior commit 88cf8ba83e55e53529c9794102d6bed8a9e06bc7 Author: Thomas Gleixner Date: Wed Jul 8 17:14:48 2015 +0200 jump-label: disable if stop_machine() is used Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior commit c4e89dbaed5c1a80917fd3ede795d5027bd22361 Author: Anders Roxell Date: Thu May 14 17:52:17 2015 +0200 arch/arm64: Add lazy preempt support arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell commit 3e39f9464f749c2f0cc308ffc022a0e6152338f3 Author: Thomas Gleixner Date: Thu Nov 1 10:14:11 2012 +0100 powerpc: Add support for lazy preemption Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit e21afa1ca7bbd698d45fa1dd13293284a3bd0697 Author: Thomas Gleixner Date: Wed Oct 31 12:04:11 2012 +0100 arm: Add support for lazy preemption Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit 30fc8a257725f883702f5edd927befa8388b0cd9 Author: Thomas Gleixner Date: Thu Nov 1 11:03:47 2012 +0100 x86: Support for lazy preemption Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit 15444674c734aa8e3258bb3c4f6886785b56be7a Author: Sebastian Andrzej Siewior Date: Tue Jun 30 11:45:14 2020 +0200 x86/entry: Use should_resched() in idtentry_exit_cond_resched() The TIF_NEED_RESCHED bit is inlined on x86 into the preemption counter. By using should_resched(0) instead of need_resched() the same check can be performed which uses the same variable as 'preempt_count()` which was issued before. Use should_resched(0) instead need_resched(). Signed-off-by: Sebastian Andrzej Siewior commit d414e1e3d51bf67ed765419477b3545e9d54b01b Author: Thomas Gleixner Date: Fri Oct 26 18:50:54 2012 +0100 sched: Add support for lazy preemption It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner commit 741e216a6ff58a5494e5c5a8f1f3613a3f52bc3d Author: Thomas Gleixner Date: Fri Jul 3 08:44:34 2009 -0500 mm/scatterlist: Do not disable irqs on RT For -RT it is enough to keep pagefault disabled (which is currently handled by kmap_atomic()). Signed-off-by: Thomas Gleixner commit c598d406fe44c4661103c5d9c55686068dfa79fb Author: Thomas Gleixner Date: Wed Feb 13 11:03:11 2013 +0100 arm: Enable highmem for rt fixup highmem for ARM. Signed-off-by: Thomas Gleixner commit b78a6e177b42046d84fd3db575755acfcd4b6742 Author: Sebastian Andrzej Siewior Date: Mon Mar 11 21:37:27 2013 +0100 arm/highmem: Flush tlb on unmap The tlb should be flushed on unmap and thus make the mapping entry invalid. This is only done in the non-debug case which does not look right. Signed-off-by: Sebastian Andrzej Siewior commit f76433035fe6323a9debde937e9320dceb5d03f7 Author: Sebastian Andrzej Siewior Date: Mon Mar 11 17:09:55 2013 +0100 x86/highmem: Add a "already used pte" check This is a copy from kmap_atomic_prot(). Signed-off-by: Sebastian Andrzej Siewior commit 86c7c576adf2ef89d17853ccb4468c53e3675f2d Author: Peter Zijlstra Date: Thu Jul 28 10:43:51 2011 +0200 mm, rt: kmap_atomic scheduling In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ] commit bab62ed29a55d6e2b5fe0844ef8265d2cb29570b Author: Sebastian Andrzej Siewior Date: Wed Aug 7 18:15:38 2019 +0200 x86: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit e3f4ff4edd53c510cc80d21e97990f206a9e9ea1 Author: Mike Galbraith Date: Sun Jan 8 09:32:25 2017 +0100 cpuset: Convert callback_lock to raw_spinlock_t The two commits below add up to a cpuset might_sleep() splat for RT: 8447a0fee974 cpuset: convert callback_mutex to a spinlock 344736f29b35 cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the 8447a0fee974 changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit 3327ab1dc7edd18bf1f633d1d96e08ca444d5d1d Author: Sebastian Andrzej Siewior Date: Tue Jul 7 12:25:11 2020 +0200 drm/i915/gt: Only disable interrupts for the timeline lock on !force-threaded According to commit d67739268cf0e ("drm/i915/gt: Mark up the nested engine-pm timeline lock as irqsafe") the intrrupts are disabled the code may be called from an interrupt handler and from preemptible context. With `force_irqthreads' set the timeline mutex is never observed in IRQ context so it is not neede to disable interrupts. Disable only interrupts if not in `force_irqthreads' mode. Signed-off-by: Sebastian Andrzej Siewior commit 2f1685039f1647da51e86a7e135a738c32fdb63f Author: Sebastian Andrzej Siewior Date: Wed Dec 19 10:47:02 2018 +0100 drm/i915: skip DRM_I915_LOW_LEVEL_TRACEPOINTS with NOTRACE The order of the header files is important. If this header file is included after tracepoint.h was included then the NOTRACE here becomes a nop. Currently this happens for two .c files which use the tracepoitns behind DRM_I915_LOW_LEVEL_TRACEPOINTS. Signed-off-by: Sebastian Andrzej Siewior commit 9eb94f96251c8efe68fae34103f037038a6999e7 Author: Sebastian Andrzej Siewior Date: Thu Dec 6 09:52:20 2018 +0100 drm/i915: disable tracing on -RT Luca Abeni reported this: | BUG: scheduling while atomic: kworker/u8:2/15203/0x00000003 | CPU: 1 PID: 15203 Comm: kworker/u8:2 Not tainted 4.19.1-rt3 #10 | Call Trace: | rt_spin_lock+0x3f/0x50 | gen6_read32+0x45/0x1d0 [i915] | g4x_get_vblank_counter+0x36/0x40 [i915] | trace_event_raw_event_i915_pipe_update_start+0x7d/0xf0 [i915] The tracing events use trace_i915_pipe_update_start() among other events use functions acquire spin locks. A few trace points use intel_get_crtc_scanline(), others use ->get_vblank_counter() wich also might acquire a sleeping lock. Based on this I don't see any other way than disable trace points on RT. Cc: stable-rt@vger.kernel.org Reported-by: Luca Abeni Signed-off-by: Sebastian Andrzej Siewior commit 777476d6b47c9a4b17a8d1a7d4c6f4ddd9501d80 Author: Mike Galbraith Date: Sat Feb 27 09:01:42 2016 +0100 drm/i915: Don't disable interrupts on PREEMPT_RT during atomic updates Commit 8d7849db3eab7 ("drm/i915: Make sprite updates atomic") started disabling interrupts across atomic updates. This breaks on PREEMPT_RT because within this section the code attempt to acquire spinlock_t locks which are sleeping locks on PREEMPT_RT. According to the comment the interrupts are disabled to avoid random delays and not required for protection or synchronisation. Don't disable interrupts on PREEMPT_RT during atomic updates. [bigeasy: drop local locks, commit message] Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit d70a1c4fbb8fc82ead7b814cec5a976da2fff0fb Author: Mike Galbraith Date: Sat Feb 27 08:09:11 2016 +0100 drm,radeon,i915: Use preempt_disable/enable_rt() where recommended DRM folks identified the spots, so use them. Signed-off-by: Mike Galbraith Cc: Sebastian Andrzej Siewior Cc: linux-rt-users Signed-off-by: Thomas Gleixner commit 271bdf8bd388890eedef9d0edba63154ebcf6ea0 Author: Sebastian Andrzej Siewior Date: Tue Oct 17 16:36:18 2017 +0200 lockdep: disable self-test The self-test wasn't always 100% accurate for RT. We disabled a few tests which failed because they had a different semantic for RT. Some still reported false positives. Now the selftest locks up the system during boot and it needs to be investigated… Signed-off-by: Sebastian Andrzej Siewior commit b807b85b63d4c34b89b116f395b2154cf9d26e14 Author: Josh Cartwright Date: Wed Jan 28 13:08:45 2015 -0600 lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionals "lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT conditionals. Signed-off-by: Josh Cartwright Signed-off-by: Xander Huff Acked-by: Gratian Crisan Signed-off-by: Sebastian Andrzej Siewior commit 85a86947f54db92dba0f81ca3d4e826c2c8ceae7 Author: Yong Zhang Date: Mon Apr 16 15:01:56 2012 +0800 lockdep: selftest: Only do hardirq context test for raw spinlock On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang Cc: Yong Zhang Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner commit 579b56b42697211ca7fad2c11d4a56204daea586 Author: Thomas Gleixner Date: Sun Jul 17 18:51:23 2011 +0200 lockdep: Make it RT aware teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner commit c3f48e8e0aa1da9cf1c920ed6c2d1823b0758cf9 Author: Priyanka Jain Date: Thu May 17 09:35:11 2012 +0530 net: Remove preemption disabling in netif_rx() 1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. 3)When CONFIG_PREEMPT_RT is not defined, migrate_enable(), migrate_disable() maps to preempt_enable() and preempt_disable(), so no change in functionality in case of non-RT. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain Acked-by: Rajan Srivastava Cc: Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner commit c19a1b9dff1aa63d381ded4ec015de21c6c7d91c Author: Thomas Gleixner Date: Tue Aug 21 20:38:50 2012 +0200 random: Make it work on rt Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner commit 4008be4fe48c3f9b95c5bd6f97dcd68e73461018 Author: Thomas Gleixner Date: Thu Dec 16 14:25:18 2010 +0100 x86: stackprotector: Avoid random pool on rt CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde Signed-off-by: Thomas Gleixner commit ee0842652e6a7ca5ec56060cc353fff0bbc4c765 Author: Thomas Gleixner Date: Tue Jul 14 14:26:34 2015 +0200 panic: skip get_random_bytes for RT_FULL in init_oops_id Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner commit 6a10e6f88786e74c6e2edc1197414f2a77703b22 Author: Sebastian Andrzej Siewior Date: Thu Jul 26 18:52:00 2018 +0200 crypto: cryptd - add a lock instead preempt_disable/local_bh_disable cryptd has a per-CPU lock which protected with local_bh_disable() and preempt_disable(). Add an explicit spin_lock to make the locking context more obvious and visible to lockdep. Since it is a per-CPU lock, there should be no lock contention on the actual spinlock. There is a small race-window where we could be migrated to another CPU after the cpu_queue has been obtain. This is not a problem because the actual ressource is protected by the spinlock. Signed-off-by: Sebastian Andrzej Siewior commit 66a8a808017a0c3e19a1709d6536e7a5b39f731c Author: Sebastian Andrzej Siewior Date: Thu Nov 30 13:40:10 2017 +0100 crypto: limit more FPU-enabled sections Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new *_fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit a4416f11c4aa4d99ce6bbcdaca648f0a708e5778 Author: Sebastian Andrzej Siewior Date: Fri Feb 21 17:24:04 2014 +0100 crypto: Reduce preempt disabled regions, more algos Don Estabrook reported | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200() | kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() and his backtrace showed some crypto functions which looked fine. The problem is the following sequence: glue_xts_crypt_128bit() { blkcipher_walk_virt(); /* normal migrate_disable() */ glue_fpu_begin(); /* get atomic */ while (nbytes) { __glue_xts_crypt_128bit(); blkcipher_walk_done(); /* with nbytes = 0, migrate_enable() * while we are atomic */ }; glue_fpu_end() /* no longer atomic */ } and this is why the counter get out of sync and the warning is printed. The other problem is that we are non-preemptible between glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this, I shorten the FPU off region and ensure blkcipher_walk_done() is called with preemption enabled. This might hurt the performance because we now enable/disable the FPU state more often but we gain lower latency and the bug is gone. Reported-by: Don Estabrook Signed-off-by: Sebastian Andrzej Siewior commit 758105fb21d86e82d98deba9ff745b264e033c59 Author: Peter Zijlstra Date: Mon Nov 14 18:19:27 2011 +0100 x86: crypto: Reduce preempt disabled regions Restrict the preempt disabled regions to the actual floating point operations and enable preemption for the administrative actions. This is necessary on RT to avoid that kfree and other operations are called with preemption disabled. Reported-and-tested-by: Carsten Emde Signed-off-by: Peter Zijlstra Signed-off-by: Thomas Gleixner commit c1ecdc62c514c2d541490026c312ec614ebd35aa Author: Sebastian Andrzej Siewior Date: Tue Jun 23 15:32:51 2015 +0200 irqwork: push most work into softirq context Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior commit 166c20bdc7b3d4e0c7f52874b3c3c9cfba258421 Author: Sebastian Andrzej Siewior Date: Wed Mar 30 13:36:29 2016 +0200 net: dev: always take qdisc's busylock in __dev_xmit_skb() The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior commit 231ff163edfcb777505c0b6c954611151febe0c9 Author: Sebastian Andrzej Siewior Date: Wed Sep 16 16:15:39 2020 +0200 net: Dequeue in dev_cpu_dead() without the lock Upstream uses skb_dequeue() to acquire lock of `input_pkt_queue'. The reason is to synchronize against a remote CPU which still thinks that the CPU is online enqueues packets to this CPU. There are no guarantees that the packet is enqueued before the callback is run, it just hope. RT however complains about an not initialized lock because it uses another lock for `input_pkt_queue' due to the IRQ-off nature of the context. Use the unlocked dequeue version for `input_pkt_queue'. Signed-off-by: Sebastian Andrzej Siewior commit ee362de4b2677bfcd6f9f6ad262cdc2ee20f0ef2 Author: Thomas Gleixner Date: Tue Jul 12 15:38:34 2011 +0200 net: Use skbufhead with raw lock Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner commit 089ffa737d8ca260bdbde79705ccbb938e89b2f2 Author: Thomas Gleixner Date: Sun Jul 17 21:41:35 2011 +0200 debugobjects: Make RT aware Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner commit 0eb766ad786f1123de13f888d70acfc9fdc4266f Author: Thomas Gleixner Date: Wed Mar 7 21:00:34 2012 +0100 fs: namespace: Use cpu_chill() in trylock loops Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner commit ca2a97742c368981d127fcaf7699756da6233d97 Author: Thomas Gleixner Date: Wed Mar 7 20:51:03 2012 +0100 rt: Introduce cpu_chill() Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): | |Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken |up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is |called from softirq context, it may block the ksoftirqd() from running, in |which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() |If a task calls cpu_chill() and gets woken up by a regular or spurious |wakeup and has a signal pending, then it exits the sleep loop in |do_nanosleep() and sets up the restart block. If restart->nanosleep.type is |not TI_NONE then this results in accessing a stale user pointer from a |previously interrupted syscall and a copy to user based on the stale |pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: | [....] Waiting for /dev to be fully populated... | ===================================== | [ BUG: udevd/229 still has locks held! ] | 3.12.11-rt17 #23 Not tainted | ------------------------------------- | 1 lock held by udevd/229: | #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 | | stack backtrace: | CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 | (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) | (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) | (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) | (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) | (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) | (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) | (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) | (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) | (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) | (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) | (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) | (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) | (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) | (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) | (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) | (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) | (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner Signed-off-by: Steven Rostedt Signed-off-by: Sebastian Andrzej Siewior commit 5e31b00747cde2d68ba33e4a75672c44ac0940f5 Author: Mike Galbraith Date: Wed Feb 18 16:05:28 2015 +0100 sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light() |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd |Preemption disabled at:[] svc_xprt_received+0x4b/0xc0 [sunrpc] |CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 |Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 | ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 | 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 | ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 |Call Trace: | [] dump_stack+0x4f/0x9e | [] __might_sleep+0xe6/0x150 | [] rt_spin_lock+0x24/0x50 | [] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] | [] svc_xprt_received+0x4b/0xc0 [sunrpc] | [] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] | [] svc_addsock+0x143/0x200 [sunrpc] | [] write_ports+0x28c/0x340 [nfsd] | [] nfsctl_transaction_write+0x4c/0x80 [nfsd] | [] vfs_write+0xb3/0x1d0 | [] SyS_write+0x49/0xb0 | [] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit c2f899841d5d8e11bf706c2b51e4b63336d5a987 Author: Thomas Gleixner Date: Sat Nov 12 14:00:48 2011 +0100 scsi/fcoe: Make RT aware. Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner commit 7c691ccfc4b5aeeaf572db1258e78b406d3836c3 Author: Thomas Gleixner Date: Tue Apr 6 16:51:31 2010 +0200 md: raid5: Make raid5_percpu handling RT aware __raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel Signed-off-by: Thomas Gleixner Tested-by: Udo van den Heuvel commit 568d1d191e94631dea3e67fa8358f45cb72e113a Author: Sebastian Andrzej Siewior Date: Tue Jul 14 14:26:34 2015 +0200 block/mq: do not invoke preempt_disable() preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior commit b49a329a92dabf38f35228c4eead2dc478b12063 Author: Thomas Gleixner Date: Tue Jul 12 11:39:36 2011 +0200 mm/vmalloc: Another preempt disable region which sucks Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner commit d8c5a7d75e08b849bc202c6664b0a269b11a3368 Author: Scott Wood Date: Wed Sep 11 17:57:29 2019 +0100 rcutorture: Avoid problematic critical section nesting on RT rcutorture was generating some nesting scenarios that are not reasonable. Constrain the state selection to avoid them. Example #1: 1. preempt_disable() 2. local_bh_disable() 3. preempt_enable() 4. local_bh_enable() On PREEMPT_RT, BH disabling takes a local lock only when called in non-atomic context. Thus, atomic context must be retained until after BH is re-enabled. Likewise, if BH is initially disabled in non-atomic context, it cannot be re-enabled in atomic context. Example #2: 1. rcu_read_lock() 2. local_irq_disable() 3. rcu_read_unlock() 4. local_irq_enable() If the thread is preempted between steps 1 and 2, rcu_read_unlock_special.b.blocked will be set, but it won't be acted on in step 3 because IRQs are disabled. Thus, reporting of the quiescent state will be delayed beyond the local_irq_enable(). For now, these scenarios will continue to be tested on non-PREEMPT_RT kernels, until debug checks are added to ensure that they are not happening elsewhere. Signed-off-by: Scott Wood Signed-off-by: Sebastian Andrzej Siewior commit e0b671bca2e747b0eccc1e03c559041d14816b4e Author: Julia Cartwright Date: Wed Oct 12 11:21:14 2016 -0500 rcu: enable rcu_normal_after_boot by default for RT The forcing of an expedited grace period is an expensive and very RT-application unfriendly operation, as it forcibly preempts all running tasks on CPUs which are preventing the gp from expiring. By default, as a policy decision, disable the expediting of grace periods (after boot) on configurations which enable PREEMPT_RT. Suggested-by: Luiz Capitulino Acked-by: Paul E. McKenney Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior commit 5ffd75a968287d678868e29da70ba59a8ffc90be Author: Scott Wood Date: Wed Sep 11 17:57:28 2019 +0100 rcu: Use rcuc threads on PREEMPT_RT as we did While switching to the reworked RCU-thread code, it has been forgotten to enable the thread processing on -RT. Besides restoring behavior that used to be default on RT, this avoids a deadlock on scheduler locks. Signed-off-by: Scott Wood Signed-off-by: Sebastian Andrzej Siewior commit 5347e6783a35082e7d1d11d7254cbdfced2b4590 Author: Sebastian Andrzej Siewior Date: Tue Nov 19 09:25:04 2019 +0100 locking: Make spinlock_t and rwlock_t a RCU section on RT On !RT a locked spinlock_t and rwlock_t disables preemption which implies a RCU read section. There is code that relies on that behaviour. Add an explicit RCU read section on RT while a sleeping lock (a lock which would disables preemption on !RT) acquired. Signed-off-by: Sebastian Andrzej Siewior commit 6a4adaead978cf7f90bd67baaf8d54f00b187c0d Author: Sebastian Andrzej Siewior Date: Fri Aug 4 17:40:42 2017 +0200 locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs Upstream uses arch_spinlock_t within spinlock_t and requests that spinlock_types.h header file is included first. On -RT we have the rt_mutex with its raw_lock wait_lock which needs architectures' spinlock_types.h header file for its definition. However we need rt_mutex first because it is used to build the spinlock_t so that check does not work for us. Therefore I am dropping that check. Signed-off-by: Sebastian Andrzej Siewior commit db4ac702aad7bdd6fda8f09b1f6b0b078bdbce9b Author: Thomas Gleixner Date: Sun Jul 17 21:56:42 2011 +0200 trace: Add migrate-disabled counter to tracing output Signed-off-by: Thomas Gleixner commit 2a0f9307655268267c7b7419e15be53208bfd996 Author: Sebastian Andrzej Siewior Date: Sat May 27 19:02:06 2017 +0200 kernel/sched: add {put|get}_cpu_light() Signed-off-by: Sebastian Andrzej Siewior commit 6362e6476cd7b1df9339acdf75a3e5972ebdc768 Author: Sebastian Andrzej Siewior Date: Thu Aug 29 18:21:04 2013 +0200 ptrace: fix ptrace vs tasklist_lock race As explained by Alexander Fyodorov : |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior commit b48058089457cef4bba7ff8afa5de582165a6547 Author: Grygorii Strashko Date: Tue Jul 21 19:43:56 2015 +0300 pid.h: include atomic.h This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko Signed-off-by: Sebastian Andrzej Siewior commit c9f685e7b1d683aa6e4033119dbcd828c5de8928 Author: Sebastian Andrzej Siewior Date: Fri Jun 16 19:03:16 2017 +0200 net/core: use local_bh_disable() in netif_rx_ni() In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior commit 19739343064e159e58daf4fc90f48c4370d979ca Author: Thomas Gleixner Date: Mon Jul 18 13:59:17 2011 +0200 softirq: Disable softirq stacks for RT Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner commit 84ee9b819f705f9eddff95b6a93bba626346675a Author: Thomas Gleixner Date: Sun Nov 13 17:17:09 2011 +0100 softirq: Check preemption after reenabling interrupts raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde Signed-off-by: Thomas Gleixner commit 9e466d9f3ec426eedc8921f34918e004fda3af80 Author: Thomas Gleixner Date: Tue Sep 13 16:42:35 2011 +0200 sched: Disable TTWU_QUEUE on RT The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner commit f3541b467fbb5e16c34cc637c3444018ebb05163 Author: Thomas Gleixner Date: Tue Jun 7 09:19:06 2011 +0200 sched: Do not account rcu_preempt_depth on RT in might_sleep() RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner commit 16290122c38ce6cdda3f5aa27267ab789e150904 Author: Sebastian Andrzej Siewior Date: Mon Nov 21 19:31:08 2016 +0100 kernel/sched: move stack + kprobe clean up to __put_task_struct() There is no need to free the stack before the task struct (except for reasons mentioned in commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")). This also comes handy on -RT because we can't free memory in preempt disabled region. vfree_atomic() delays the memory cleanup to a worker. Since we move everything to the RCU callback, we can also free it immediately. Cc: stable-rt@vger.kernel.org #for kprobe_flush_task() Signed-off-by: Sebastian Andrzej Siewior commit e5adc6c5e5fc78974c14be55ce1fffe575028b0c Author: Thomas Gleixner Date: Mon Jun 6 12:20:33 2011 +0200 sched: Move mmdrop to RCU on RT Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner commit cd1935646103aed0c4b42f23159401f3eb9d38da Author: Thomas Gleixner Date: Mon Jun 6 12:12:51 2011 +0200 sched: Limit the number of task migrations per batch Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner commit 92b28e2ac6e5cea219f2a8be402dd1e7378f84b2 Author: Sebastian Andrzej Siewior Date: Fri Aug 9 15:25:21 2019 +0200 hrtimer: Allow raw wakeups during boot There are a few wake-up timers during the early boot which are essencial for the system to make progress. At this stage there are no softirq spawn for the softirq processing so there is no timer processing in softirq. The wakeup in question: smpboot_create_thread() -> kthread_create_on_cpu() -> kthread_bind() -> wait_task_inactive() -> schedule_hrtimeout() Let the timer fire in hardirq context during the system boot. Signed-off-by: Sebastian Andrzej Siewior commit a35f061c552ee5d142c90b5a88287c835c7bffeb Author: Sebastian Andrzej Siewior Date: Mon Oct 28 12:19:57 2013 +0100 wait.h: include atomic.h | CC init/main.o |In file included from include/linux/mmzone.h:9:0, | from include/linux/gfp.h:4, | from include/linux/kmod.h:22, | from include/linux/module.h:13, | from init/main.c:15: |include/linux/wait.h: In function ‘wait_on_atomic_t’: |include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] | if (atomic_read(val) == 0) | ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior commit 8a7b7e76d837a168447181c57c40093318022235 Author: Thomas Gleixner Date: Sun Nov 6 12:26:18 2011 +0100 x86: kvm Require const tsc for RT Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner commit 920a943a6db9cc7b4e44ac7795befacf22b24788 Author: Luis Claudio R. Goncalves Date: Tue Jun 25 11:28:04 2019 -0300 mm/zswap: Use local lock to protect per-CPU data zwap uses per-CPU compression. The per-CPU data pointer is acquired with get_cpu_ptr() which implicitly disables preemption. It allocates memory inside the preempt disabled region which conflicts with the PREEMPT_RT semantics. Replace the implicit preemption control with an explicit local lock. This allows RT kernels to substitute it with a real per CPU lock, which serializes the access but keeps the code section preemptible. On non RT kernels this maps to preempt_disable() as before, i.e. no functional change. [bigeasy: Use local_lock(), additional hunks, patch description] Cc: Seth Jennings Cc: Dan Streetman Cc: Vitaly Wool Cc: Andrew Morton Cc: linux-mm@kvack.org Signed-off-by: Luis Claudio R. Goncalves Signed-off-by: Sebastian Andrzej Siewior commit c44b2b03492ee20b395b48a6581f945fa6546818 Author: Mike Galbraith Date: Tue Mar 22 11:16:09 2016 +0100 mm/zsmalloc: copy with get_cpu_var() and locking get_cpu_var() disables preemption and triggers a might_sleep() splat later. This is replaced with get_locked_var(). This bitspinlocks are replaced with a proper mutex which requires a slightly larger struct to allocate. Signed-off-by: Mike Galbraith [bigeasy: replace the bitspin_lock() with a mutex, get_locked_var(). Mike then fixed the size magic] Signed-off-by: Sebastian Andrzej Siewior commit 3ba5f29c1f619a4ca0193f1dead73c192aa4d772 Author: Sebastian Andrzej Siewior Date: Wed Jan 28 17:14:16 2015 +0100 mm/memcontrol: Replace local_irq_disable with local locks There are a few local_irq_disable() which then take sleeping locks. This patch converts them local locks. [bigeasy: Move unlock after memcg_check_events() in mem_cgroup_swapout(), pointed out by Matt Fleming ] Signed-off-by: Sebastian Andrzej Siewior commit da4793fc15dfa7d5651028b903fbcd84efc4cf27 Author: Yang Shi Date: Wed Oct 30 11:48:33 2013 -0700 mm/memcontrol: Don't call schedule_work_on in preemption disabled context The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [] dump_stack+0x19/0x1b [] __might_sleep+0xf1/0x170 [] rt_spin_lock+0x20/0x50 [] queue_work_on+0x61/0x100 [] drain_all_stock+0xe1/0x1c0 [] mem_cgroup_reclaim+0x90/0xe0 [] __mem_cgroup_try_charge+0x41a/0xc40 [] ? release_pages+0x1b1/0x1f0 [] ? sched_exec+0x40/0xb0 [] mem_cgroup_charge_common+0x37/0x70 [] mem_cgroup_newpage_charge+0x26/0x30 [] handle_pte_fault+0x618/0x840 [] ? unpin_current_cpu+0x16/0x70 [] ? migrate_enable+0xd4/0x200 [] handle_mm_fault+0x145/0x1e0 [] __do_page_fault+0x1a1/0x4c0 [] ? preempt_schedule_irq+0x4b/0x70 [] ? retint_kernel+0x37/0x40 [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Signed-off-by: Yang Shi Signed-off-by: Sebastian Andrzej Siewior commit 687efb4b848500c8b49b88c058c0fcfb4dc1b2f2 Author: Sebastian Andrzej Siewior Date: Tue Aug 18 10:30:00 2020 +0200 mm: memcontrol: Provide a local_lock for per-CPU memcg_stock The interrupts are disabled to ensure CPU-local access to the per-CPU variable `memcg_stock'. As the code inside the interrupt disabled section acquires regular spinlocks, which are converted to 'sleeping' spinlocks on a PREEMPT_RT kernel, this conflicts with the RT semantics. Convert it to a local_lock which allows RT kernels to substitute them with a real per CPU lock. On non RT kernels this maps to local_irq_save() as before, but provides also lockdep coverage of the critical region. No functional change. Signed-off-by: Sebastian Andrzej Siewior commit 9e95bf2dbc460cc24e8cdfe9948e1552e2400669 Author: Sebastian Andrzej Siewior Date: Wed Apr 15 19:00:47 2015 +0200 slub: Disable SLUB_CPU_PARTIAL |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 87, name: rcuop/7 |1 lock held by rcuop/7/87: | #0: (rcu_callback){......}, at: [] rcu_nocb_kthread+0x1ca/0x5d0 |Preemption disabled at:[] put_cpu_partial+0x29/0x220 | |CPU: 0 PID: 87 Comm: rcuop/7 Tainted: G W 4.0.0-rt0+ #477 |Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 | 000000000007a9fc ffff88013987baf8 ffffffff817441c7 0000000000000007 | 0000000000000000 ffff88013987bb18 ffffffff810eee51 0000000000000000 | ffff88013fc10200 ffff88013987bb48 ffffffff8174a1c4 000000000007a9fc |Call Trace: | [] dump_stack+0x4f/0x90 | [] ___might_sleep+0x121/0x1b0 | [] rt_spin_lock+0x24/0x60 | [] __free_pages_ok+0xaa/0x540 | [] __free_pages+0x1d/0x30 | [] __free_slab+0xc5/0x1e0 | [] free_delayed+0x56/0x70 | [] put_cpu_partial+0x14d/0x220 | [] __slab_free+0x158/0x2c0 | [] kmem_cache_free+0x221/0x2d0 | [] file_free_rcu+0x2c/0x40 | [] rcu_nocb_kthread+0x243/0x5d0 | [] kthread+0xfc/0x120 | [] ret_from_fork+0x58/0x90 Signed-off-by: Sebastian Andrzej Siewior commit 58d1e2cfe79ea14bff043a3cca6300c185a1be45 Author: Thomas Gleixner Date: Wed Jan 9 12:08:15 2013 +0100 slub: Enable irqs for __GFP_WAIT SYSTEM_RUNNING might be too late for enabling interrupts. Allocations with GFP_WAIT can happen before that. So use this as an indicator. [bigeasy: Add warning on RT for allocations in atomic context. Don't enable interrupts on allocations during SYSTEM_SUSPEND. This is done during suspend by ACPI, noticed by Liwei Song ] Signed-off-by: Thomas Gleixner commit 974425851bc00a528f6d8f742ec15fe9de406f25 Author: Sebastian Andrzej Siewior Date: Thu Jul 16 18:47:50 2020 +0200 mm/slub: Make object_map_lock a raw_spinlock_t The variable object_map is protected by object_map_lock. The lock is always acquired in debug code and within already atomic context Make object_map_lock a raw_spinlock_t. Signed-off-by: Sebastian Andrzej Siewior commit 0ed149458dff29dad90bfc97b2aaa3dab6e110e5 Author: Ingo Molnar Date: Fri Jul 3 08:29:37 2009 -0500 mm: page_alloc: rt-friendly per-cpu pages rt-friendly per-cpu pages: convert the irqs-off per-cpu locking method into a preemptible, explicit-per-cpu-locks method. Contains fixes from: Peter Zijlstra Thomas Gleixner Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit fc30ec071e281354debba3f7bb06a6b33d4d97cf Author: Sebastian Andrzej Siewior Date: Thu Jul 2 14:27:23 2020 +0200 mm/page_alloc: Use migrate_disable() in drain_local_pages_wq() drain_local_pages_wq() disables preemption to avoid CPU migration during CPU hotplug. Using migrate_disable() makes the function preemptible on PREEMPT_RT but still avoids CPU migrations during CPU-hotplug. On !PREEMPT_RT it behaves like preempt_disable(). Use migrate_disable() in drain_local_pages_wq(). Signed-off-by: Sebastian Andrzej Siewior commit 8bbd4557ab56972e23cc651df944b1b1a89bd490 Author: Kevin Hao Date: Mon May 4 11:34:07 2020 +0800 mm: slub: Always flush the delayed empty slubs in flush_all() After commit f0b231101c94 ("mm/SLUB: delay giving back empty slubs to IRQ enabled regions"), when the free_slab() is invoked with the IRQ disabled, the empty slubs are moved to a per-CPU list and will be freed after IRQ enabled later. But in the current codes, there is a check to see if there really has the cpu slub on a specific cpu before flushing the delayed empty slubs, this may cause a reference of already released kmem_cache in a scenario like below: cpu 0 cpu 1 kmem_cache_destroy() flush_all() --->IPI flush_cpu_slab() flush_slab() deactivate_slab() discard_slab() free_slab() c->page = NULL; for_each_online_cpu(cpu) if (!has_cpu_slab(1, s)) continue this skip to flush the delayed empty slub released by cpu1 kmem_cache_free(kmem_cache, s) kmalloc() __slab_alloc() free_delayed() __free_slab() reference to released kmem_cache Fixes: f0b231101c94 ("mm/SLUB: delay giving back empty slubs to IRQ enabled regions") Signed-off-by: Kevin Hao Signed-off-by: Sebastian Andrzej Siewior Cc: stable-rt@vger.kernel.org commit 44dbf26e91923ef20b61f64254e19a060d1fb220 Author: Thomas Gleixner Date: Thu Jun 21 17:29:19 2018 +0200 mm/SLUB: delay giving back empty slubs to IRQ enabled regions __free_slab() is invoked with disabled interrupts which increases the irq-off time while __free_pages() is doing the work. Allow __free_slab() to be invoked with enabled interrupts and move everything from interrupts-off invocations to a temporary per-CPU list so it can be processed later. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 88793f1a041df3435adfb21048b24ba8f441df09 Author: Thomas Gleixner Date: Mon May 28 15:24:22 2018 +0200 mm/SLxB: change list_lock to raw_spinlock_t The list_lock is used with used with IRQs off on RT. Make it a raw_spinlock_t otherwise the interrupts won't be disabled on -RT. The locking rules remain the same on !RT. This patch changes it for SLAB and SLUB since both share the same header file for struct kmem_cache_node defintion. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 2f47b7b834d074a1dd73ae0392d001c3f46c0109 Author: Peter Zijlstra Date: Mon May 28 15:24:21 2018 +0200 Split IRQ-off and zone->lock while freeing pages from PCP list #2 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior commit c2ced7210b37bd450643ac49cb61a4d0433d75be Author: Peter Zijlstra Date: Mon May 28 15:24:20 2018 +0200 Split IRQ-off and zone->lock while freeing pages from PCP list #1 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior commit 6815f2ea136f388d2fce4e498d07faf43967546a Author: Oleg Nesterov Date: Tue Jul 14 14:26:34 2015 +0200 signal/x86: Delay calling signals in atomic On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov Signed-off-by: Steven Rostedt Signed-off-by: Thomas Gleixner [bigeasy: also needed on 32bit as per Yang Shi ] Signed-off-by: Sebastian Andrzej Siewior commit 0520f2d79f9c1dad8f1549e32e042d8f94ca19c1 Author: Thomas Gleixner Date: Mon Jun 20 09:03:47 2011 +0200 rt: Add local irq locks Introduce locallock. For !RT this maps to preempt_disable()/ local_irq_disable() so there is not much that changes. For RT this will map to a spinlock. This makes preemption possible and locked "ressource" gets the lockdep anotation it wouldn't have otherwise. The locks are recursive for owner == current. Also, all locks user migrate_disable() which ensures that the task is not migrated to another CPU while the lock is held and the owner is preempted. Signed-off-by: Thomas Gleixner commit 2bb15c0c19c7eba226aaadc23b47d891a736f9d6 Author: Sebastian Andrzej Siewior Date: Thu Jul 26 15:06:10 2018 +0200 efi: Allow efi=runtime In case the command line option "efi=noruntime" is default at built-time, the user could overwrite its state by `efi=runtime' and allow it again. Acked-by: Ard Biesheuvel Signed-off-by: Sebastian Andrzej Siewior commit 25596bb000c28ad09af9e04d024d14b8cfa7ba4d Author: Sebastian Andrzej Siewior Date: Thu Jul 26 15:03:16 2018 +0200 efi: Disable runtime services on RT Based on meassurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. Those 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during runtimed (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel Signed-off-by: Sebastian Andrzej Siewior commit ece18d281b978c6f10b1fef742252aa64fad53a7 Author: Sebastian Andrzej Siewior Date: Sat May 27 19:02:06 2017 +0200 net/core: disable NET_RX_BUSY_POLL on RT napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption so we would have to work around this and add explicit locking for synchronisation against ksoftirqd. Without explicit synchronisation a low priority process would "own" the NAPI state (by setting NAPIF_STATE_SCHED) and could be scheduled out (no preempt_disable() and BH is preemptible on RT). In case a network packages arrives then the interrupt handler would set NAPIF_STATE_MISSED and the system would wait until the task owning the NAPI would be scheduled in again. Should a task with RT priority busy poll then it would consume the CPU instead allowing tasks with lower priority to run. The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero) so disable NET_RX_BUSY_POLL on RT to avoid wrong locking context on RT. Should this feature be considered useful on RT systems then it could be enabled again with proper locking and synchronisation. Signed-off-by: Sebastian Andrzej Siewior commit 7d375b2ac78e5ab399469f12137e6c91e1aa930c Author: Thomas Gleixner Date: Mon Jul 18 17:03:52 2011 +0200 sched: Disable CONFIG_RT_GROUP_SCHED on RT Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner commit a163ef8687a1930559ee7c66d1d1c9c4f26bbfa2 Author: Sebastian Andrzej Siewior Date: Fri Mar 21 20:19:05 2014 +0100 rcu: make RCU_BOOST default on RT Since it is no longer invoked from the softirq people run into OOM more often if the priority of the RCU thread is too low. Making boosting default on RT should help in those case and it can be switched off if someone knows better. Signed-off-by: Sebastian Andrzej Siewior commit 40fc20853f0a43009893723d0663eb585a3004b4 Author: Ingo Molnar Date: Fri Jul 3 08:44:03 2009 -0500 mm: Allow only SLUB on RT Memory allocation disables interrupts as part of the allocation and freeing process. For -RT it is important that this section remain short and don't depend on the size of the request or an internal state of the memory allocator. At the beginning the SLAB memory allocator was adopted for RT's needs and it required substantial changes. Later, with the addition of the SLUB memory allocator we adopted this one as well and the changes were smaller. More important, due to the design of the SLUB allocator it performs better and its worst case latency was smaller. In the end only SLUB remained supported. Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit aa9cfedbb8bcde0a6ef47fcbf758026e79a7d4e4 Author: Thomas Gleixner Date: Sun Jul 24 12:11:43 2011 +0200 kconfig: Disable config options which are not RT compatible Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner commit 81ce24cd24a9ed540fd772d57f6e3ac5d3c80ae7 Author: Sebastian Andrzej Siewior Date: Tue Sep 8 16:57:11 2020 +0200 net: Properly annotate the try-lock for the seqlock In patch ("net/Qdisc: use a seqlock instead seqcount") the seqcount has been replaced with a seqlock to allow to reader to boost the preempted writer. The try_write_seqlock() acquired the lock with a try-lock but the seqcount annotation was "lock". Opencode write_seqcount_t_begin() and use the try-lock annotation for lockdep. Reported-by: Mike Galbraith Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit 70e9af7a74783f998c013b447e26abf76ef0118f Author: Sebastian Andrzej Siewior Date: Wed Sep 14 17:36:35 2016 +0200 net/Qdisc: use a seqlock instead seqcount The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior commit 176bdfada1a5579df3193cc83dc591369397f939 Author: Sebastian Andrzej Siewior Date: Fri Oct 20 11:29:53 2017 +0200 fs/dcache: disable preemption on i_dir_seq's write side i_dir_seq is an opencoded seqcounter. Based on the code it looks like we could have two writers in parallel despite the fact that the d_lock is held. The problem is that during the write process on RT the preemption is still enabled and if this process is interrupted by a reader with RT priority then we lock up. To avoid that lock up I am disabling the preemption during the update. The rename of i_dir_seq is here to ensure to catch new write sides in future. Cc: stable-rt@vger.kernel.org Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior commit 622a49be7e18622870ee7cfafd682625f7beb6e9 Author: Sebastian Andrzej Siewior Date: Wed Sep 14 14:35:49 2016 +0200 fs/dcache: use swait_queue instead of waitqueue __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior commit c395a9bb8cd278f2edb715ca29a41b4d10170d89 Author: Sebastian Andrzej Siewior Date: Mon Aug 17 12:28:10 2020 +0200 u64_stats: Disable preemption on 32bit-UP/SMP with RT during updates On RT the seqcount_t is required even on UP because the softirq can be preempted. The IRQ handler is threaded so it is also preemptible. Disable preemption on 32bit-RT during value updates. There is no need to disable interrupts on RT because the handler is run threaded. Therefore disabling preemption is enough to guarantee that the update is not interruped. Signed-off-by: Sebastian Andrzej Siewior commit 672f02e28e7e2f0b1ba351e839f9b8aa8b67c618 Author: Ahmed S. Darwish Date: Wed Jun 10 12:53:22 2020 +0200 xfrm: Use sequence counter with associated spinlock A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Upstream-status: The xfrm locking used for seqcoun writer serialization appears to be broken. If that's the case, a proper fix will need to be submitted upstream. (e.g. make the seqcount per network namespace?) Signed-off-by: Ahmed S. Darwish Signed-off-by: Sebastian Andrzej Siewior commit 74858f0d38a8d3c069a0745ff53ae084c8e7cabb Author: Sebastian Andrzej Siewior Date: Wed Oct 28 18:15:32 2020 +0100 mm/memcontrol: Disable preemption in __mod_memcg_lruvec_state() The callers expect disabled preemption/interrupts while invoking __mod_memcg_lruvec_state(). This works mainline because a lock of somekind is acquired. Use preempt_disable_rt() where per-CPU variables are accessed and a stable pointer is expected. This is also done in __mod_zone_page_state() for the same reason. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit c4b17c1e25817bb99645a41f97b57b4294344a40 Author: Ingo Molnar Date: Fri Jul 3 08:30:13 2009 -0500 mm/vmstat: Protect per cpu variables with preempt disable on RT Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the same ressources is not updated in parallel due to preemption. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit d9417089a1c641216a506ba0eeb7b8887cefc932 Author: Thomas Gleixner Date: Fri Jul 24 12:38:56 2009 +0200 preempt: Provide preempt_*_(no)rt variants RT needs a few preempt_disable/enable points which are not necessary otherwise. Implement variants to avoid #ifdeffery. Signed-off-by: Thomas Gleixner commit ba75980c4224a91f396a5e7af73f44fe6094d32d Author: Thomas Gleixner Date: Wed Sep 21 19:57:12 2011 +0200 signal: Revert ptrace preempt magic Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner commit 6ce7756440d30f54a1323ad7f767bd8405cd81a8 Author: Sebastian Andrzej Siewior Date: Tue Oct 6 13:07:17 2020 +0200 locking/rtmutex: Use custom scheduling function for spin-schedule() PREEMPT_RT builds the rwsem, mutex, spinlock and rwlock typed locks on top of a rtmutex lock. While blocked task->pi_blocked_on is set (tsk_is_pi_blocked()) and task needs to schedule away while waiting. The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. Add preempt_schedule_lock() which is invoked if scheduling is required while blocking on a PREEMPT_RT-only sleeping lock. Remove tsk_is_pi_blocked() from the scheduler path which is no longer needed with the additional scheduler entry point. Signed-off-by: Sebastian Andrzej Siewior commit 908e9fd01b2d011f2378d758005b4701bf398c9d Author: Sebastian Andrzej Siewior Date: Thu Oct 12 17:34:38 2017 +0200 locking/rtmutex: add ww_mutex addon for mutex-rt Signed-off-by: Sebastian Andrzej Siewior commit a7f50f593f4750e9ed8e8868a01c94f202045c8b Author: Thomas Gleixner Date: Thu Oct 12 17:31:14 2017 +0200 locking/rtmutex: wire up RT's locking Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 5d9f3e355ec8b3cf58151e53ba1511bfd2fbe029 Author: Thomas Gleixner Date: Thu Oct 12 17:18:06 2017 +0200 locking/rtmutex: add rwlock implementation based on rtmutex The implementation is bias-based, similar to the rwsem implementation. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 36155acaad14aeec5f0b3b6e64a4fe9dd6d8b769 Author: Thomas Gleixner Date: Thu Oct 12 17:28:34 2017 +0200 locking/rtmutex: add rwsem implementation based on rtmutex The RT specific R/W semaphore implementation restricts the number of readers to one because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting is painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code paths shows, that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is changed in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme. R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, we need to rethink the approach. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit b7bba779f8b3971512ae9b78a04a53f3fd5031ae Author: Thomas Gleixner Date: Thu Oct 12 17:17:03 2017 +0200 locking/rtmutex: add mutex implementation based on rtmutex Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 7ca0fc1d6dffd3a4b5aaa132258b63c27d73a9ea Author: Sebastian Andrzej Siewior Date: Wed Dec 2 11:34:07 2015 +0100 locking/rtmutex: Allow rt_mutex_trylock() on PREEMPT_RT Non PREEMPT_RT kernel can deadlock on rt_mutex_trylock() in softirq context. On PREEMPT_RT the softirq context is handled in thread context. This avoids the deadlock in the slow path and PI-boosting will be done on the correct thread. Signed-off-by: Sebastian Andrzej Siewior commit 7fe3a1588075b7369d52b0403209356a721f6c07 Author: Thomas Gleixner Date: Thu Oct 12 17:11:19 2017 +0200 locking/rtmutex: add sleeping lock implementation Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 22f0779df9ad31f96c60f9ad2f1c0b42b9594dd3 Author: Thomas Gleixner Date: Sat Jun 25 09:21:04 2011 +0200 sched: Add saved_state for tasks blocked on sleeping locks Spinlocks are state preserving in !RT. RT changes the state when a task gets blocked on a lock. So we need to remember the state before the lock contention. If a regular wakeup (not a RTmutex related wakeup) happens, the saved_state is updated to running. When the lock sleep is done, the saved state is restored. Signed-off-by: Thomas Gleixner commit e2899e8be283e095cf56879802a9aab9e6d517e8 Author: Thomas Gleixner Date: Thu Oct 12 16:36:39 2017 +0200 locking/rtmutex: export lockdep-less version of rt_mutex's lock, trylock and unlock Required for lock implementation ontop of rtmutex. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 920acb570f604b0b1cd65007085bc7d59ae89d1d Author: Thomas Gleixner Date: Thu Oct 12 16:14:22 2017 +0200 locking/rtmutex: Provide rt_mutex_slowlock_locked() This is the inner-part of rt_mutex_slowlock(), required for rwsem-rt. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 5f8144be45b5d0cf55d09b33f97edfdd43210525 Author: Sebastian Andrzej Siewior Date: Fri Aug 14 17:08:41 2020 +0200 locking: split out the rbtree definition rtmutex.h needs the definition for rb_root_cached. By including kernel.h we will get to spinlock.h which requires rtmutex.h again. Split out the required struct definition and move it into its own header file which can be included by rtmutex.h Signed-off-by: Sebastian Andrzej Siewior commit da606ab0eee4b286cb0e35ffeff66a74f69e02b0 Author: Sebastian Andrzej Siewior Date: Fri Aug 14 16:55:25 2020 +0200 lockdep: Reduce header files in debug_locks.h The inclusion of printk.h leads to circular dependency if spinlock_t is based on rt_mutex. Include only atomic.h (xchg()) and cache.h (__read_mostly). Signed-off-by: Sebastian Andrzej Siewior commit 7365d12bb221f70f532fd38eb14a1fe6e22e430d Author: Thomas Gleixner Date: Wed Jun 29 20:06:39 2011 +0200 locking/rtmutex: Avoid include hell Include only the required raw types. This avoids pulling in the complete spinlock header which in turn requires rtmutex.h at some point. Signed-off-by: Thomas Gleixner commit 7b6791810fc3c89d37ac0a88004c4720a57b74d9 Author: Thomas Gleixner Date: Wed Jun 29 19:34:01 2011 +0200 locking/spinlock: Split the lock types header Split raw_spinlock into its own file and the remaining spinlock_t into its own non-RT header. The non-RT header will be replaced later by sleeping spinlocks. Signed-off-by: Thomas Gleixner commit 61381ea9bf80574b47038c30fc42d1dbf2db42d5 Author: Thomas Gleixner Date: Sat Apr 1 12:50:59 2017 +0200 locking/rtmutex: Make lock_killable work Locking an rt mutex killable does not work because signal handling is restricted to TASK_INTERRUPTIBLE. Use signal_pending_state() unconditionally. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 6bd096c0b915eed9b7984340d8579bd5494668df Author: Steven Rostedt Date: Tue Jul 14 14:26:34 2015 +0200 futex: Fix bug on when a requeued RT task times out Requeue with timeout causes a bug with PREEMPT_RT. The bug comes from a timed out condition. TASK 1 TASK 2 ------ ------ futex_wait_requeue_pi() futex_wait_queue_me() double_lock_hb(); raw_spin_lock(pi_lock); if (current->pi_blocked_on) { } else { current->pi_blocked_on = PI_WAKE_INPROGRESS; run_spin_unlock(pi_lock); spin_lock(hb->lock); <-- blocked! plist_for_each_entry_safe(this) { rt_mutex_start_proxy_lock(); task_blocks_on_rt_mutex(); BUG_ON(task->pi_blocked_on)!!!! The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to grab the hb->lock, which it fails to do so. As the hb->lock is a mutex, it will block and set the "pi_blocked_on" to the hb->lock. When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails because the task1's pi_blocked_on is no longer set to that, but instead, set to the hb->lock. The fix: When calling rt_mutex_start_proxy_lock() a check is made to see if the proxy tasks pi_blocked_on is set. If so, exit out early. Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies the proxy task that it is being requeued, and will handle things appropriately. Signed-off-by: Steven Rostedt Signed-off-by: Thomas Gleixner commit 801df8f631d1d6b200e894367c1838827f31410f Author: Thomas Gleixner Date: Fri Jun 10 11:04:15 2011 +0200 locking/rtmutex: Handle the various new futex race conditions RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner commit be4a0918a7e3019e17c80c37e96f1c9d75d8a8ff Author: Sebastian Andrzej Siewior Date: Wed Oct 7 12:11:33 2020 +0200 locking/rtmutex: Remove rt_mutex_timed_lock() rt_mutex_timed_lock() has no callers since commit c051b21f71d1f ("rtmutex: Confine deadlock logic to futex") Remove rt_mutex_timed_lock(). Signed-off-by: Sebastian Andrzej Siewior commit 4e04ecdc0a5b9a3223788598976ae9d16bea5a2a Author: Sebastian Andrzej Siewior Date: Tue Sep 29 16:32:49 2020 +0200 locking/rtmutex: Move rt_mutex_init() outside of CONFIG_DEBUG_RT_MUTEXES rt_mutex_init() only initializes lockdep if CONFIG_DEBUG_RT_MUTEXES is enabled. The static initializer (DEFINE_RT_MUTEX) does not have such a restriction. Move rt_mutex_init() outside of CONFIG_DEBUG_RT_MUTEXES. Move the remaining functions in this CONFIG_DEBUG_RT_MUTEXES block to the upper block. Signed-off-by: Sebastian Andrzej Siewior commit da954044eb3802e593da9553b884885d42701b05 Author: Sebastian Andrzej Siewior Date: Tue Sep 29 16:05:11 2020 +0200 locking/rtmutex: Remove output from deadlock detector. In commit f5694788ad8da ("rt_mutex: Add lockdep annotations") rtmutex gained lockdep annotation for rt_mutex_lock() and and related functions. lockdep will see the locking order and may complain about a deadlock before rtmutex' own mechanism gets a chance to detect it. The rtmutex deadlock detector will only complain locks with the RT_MUTEX_MIN_CHAINWALK and a waiter must be pending. That means it works only for in-kernel locks because the futex interface always uses RT_MUTEX_FULL_CHAINWALK. The requirement for an active waiter limits the detector to actual deadlocks and makes it possible to report potential deadlocks like lockdep does. It looks like lockdep is better suited for reporting deadlocks. Remove rtmutex' debug print on deadlock detection. Signed-off-by: Sebastian Andrzej Siewior commit a0f66074c3913ae7b9c319fb8259f8b32c10a82e Author: Sebastian Andrzej Siewior Date: Tue Sep 29 15:21:17 2020 +0200 locking/rtmutex: Remove cruft Most of this is around since the very beginning. I'm not sure if this was used while the rtmutex-deadlock-tester was around but today it seems to only waste memory: - save_state: No users - name: Assigned and printed if a dead lock was detected. I'm keeping it but want to point out that lockdep has the same information. - file + line: Printed if ::name was NULL. This is only used for in-kernel locks so it ::name shouldn't be NULL and then ::file and ::line isn't used. - magic: Assigned to NULL by rt_mutex_destroy(). Remove members of rt_mutex which are not used. Signed-off-by: Sebastian Andrzej Siewior commit 138035db9b89725b413603d8359e8c49e541a772 Author: Thomas Gleixner Date: Mon Sep 7 22:57:32 2020 +0200 tasklets: Use static line for functions Inlines exist for a reason. Signed-off-by: Thomas Gleixner commit 2fad6e60d7777921dd2529189327a82b700d9659 Author: Thomas Gleixner Date: Mon Sep 21 17:47:34 2020 +0200 tasklets: Avoid cancel/kill deadlock on RT Signed-off-by: Thomas Gleixner commit 6a12c95ddb724592dc7c9f1949d16eb69000b08f Author: Thomas Gleixner Date: Mon Aug 31 15:12:38 2020 +0200 softirq: Replace barrier() with cpu_relax() in tasklet_unlock_wait() A barrier() in a tight loop which waits for something to happen on a remote CPU is a pointless exercise. Replace it with cpu_relax() which allows HT siblings to make progress. Signed-off-by: Thomas Gleixner commit db93e2f1b4b062b4d3fb29ce71d07b5d9a3d556a Author: Thomas Gleixner Date: Mon Aug 31 17:26:08 2020 +0200 rcu: Prevent false positive softirq warning on RT Soft interrupt disabled sections can legitimately be preempted or schedule out when blocking on a lock on RT enabled kernels so the RCU preempt check warning has to be disabled for RT kernels. Signed-off-by: Thomas Gleixner commit be16faf253e0938d7e6927f0a7fce24f9f634ab1 Author: Thomas Gleixner Date: Mon Aug 31 17:02:36 2020 +0200 tick/sched: Prevent false positive softirq pending warnings on RT On RT a task which has soft interrupts disabled can block on a lock and schedule out to idle while soft interrupts are pending. This triggers the warning in the NOHZ idle code which complains about going idle with pending soft interrupts. But as the task is blocked soft interrupt processing is temporarily blocked as well which means that such a warning is a false positive. To prevent that check the per CPU state which indicates that a scheduled out task has soft interrupts disabled. Signed-off-by: Thomas Gleixner commit 8f90600c7fb4bc7215ea2f290694cc5d2344c2ee Author: Thomas Gleixner Date: Mon Sep 21 17:26:19 2020 +0200 softirq: Add RT variant Signed-off-by: Thomas Gleixner commit b0f95964088356b0b6972264ccae02e37e538bfa Author: Thomas Gleixner Date: Mon Sep 21 20:15:50 2020 +0200 x86/fpu: Do not disable BH on RT Signed-off-by: Thomas Gleixner commit 37b00d558dbc853165c0cbfbb7de4057058b2bf8 Author: Sebastian Andrzej Siewior Date: Mon Oct 12 17:33:54 2020 +0200 tcp: Remove superfluous BH-disable around listening_hash Commit 9652dc2eb9e40 ("tcp: relax listening_hash operations") removed the need to disable bottom half while acquiring listening_hash.lock. There are still two callers left which disable bottom half before the lock is acquired. Drop local_bh_disable() around __inet_hash() which acquires listening_hash->lock, invoke inet_ehash_nolisten() with disabled BH. inet_unhash() conditionally acquires listening_hash->lock. Signed-off-by: Sebastian Andrzej Siewior commit 9e70aa348fe8c19455d72bf8b9f2765e44b3722b Author: Thomas Gleixner Date: Tue Sep 8 07:32:20 2020 +0200 net: Move lockdep where it belongs Signed-off-by: Thomas Gleixner commit 8a66ccfd85e4bbf1cf771e48314ed773ea97e7a7 Author: Sebastian Andrzej Siewior Date: Fri Aug 14 18:53:34 2020 +0200 shmem: Use raw_spinlock_t for ->stat_lock Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is per-CPU. Access here is serialized by disabling preemption. If the pool is empty, it gets reloaded from `->next_ino'. Access here is serialized by ->stat_lock which is a spinlock_t and can not be acquired with disabled preemption. One way around it would make per-CPU ino_batch struct containing the inode number a local_lock_t. Another sollution is to promote ->stat_lock to a raw_spinlock_t. The critical sections are short. The mpol_put() should be moved outside of the critical section to avoid invoking the destrutor with disabled preemption. Signed-off-by: Sebastian Andrzej Siewior commit d52eb543f10e44d90a073cd51011bf93e8f6c7a8 Author: Sebastian Andrzej Siewior Date: Mon Feb 11 11:33:11 2019 +0100 tpm: remove tpm_dev_wq_lock Added in commit 9e1b74a63f776 ("tpm: add support for nonblocking operation") but never actually used it. Cc: Philip Tricca Cc: Tadeusz Struk Cc: Jarkko Sakkinen Signed-off-by: Sebastian Andrzej Siewior commit 609ec411708f1688201d16f3882b15f5dc11dad5 Author: Sebastian Andrzej Siewior Date: Mon Feb 11 10:40:46 2019 +0100 mm: workingset: replace IRQ-off check with a lockdep assert. Commit 68d48e6a2df57 ("mm: workingset: add vmstat counter for shadow nodes") introduced an IRQ-off check to ensure that a lock is held which also disabled interrupts. This does not work the same way on -RT because none of the locks, that are held, disable interrupts. Replace this check with a lockdep assert which ensures that the lock is held. Cc: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior commit a82e77701098bef9a35c8835d45da19b076c772b Author: Sebastian Andrzej Siewior Date: Tue Jul 3 18:19:48 2018 +0200 cgroup: use irqsave in cgroup_rstat_flush_locked() All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Signed-off-by: Sebastian Andrzej Siewior commit 0a4fb06449e48b6d2623e4f2b1b7e29e9cc74c3e Author: Sebastian Andrzej Siewior Date: Tue Oct 20 18:48:16 2020 +0200 printk: Tiny cleanup - mark functions and variables static which are used only in this file. - add printf annotation where appropriate - remove static functions without caller - add kdb header file for kgdb builds. Signed-off-by: Sebastian Andrzej Siewior commit 4c78ecfedb60126b31f3030dfb9e810f2d737e4c Author: John Ogness Date: Mon Oct 19 23:03:44 2020 +0206 printk: add console handover If earlyprintk is used, a boot console will print directly to the console immediately. The boot console will unregister itself as soon as a non-boot console registers. However, the non-boot console does not begin printing until its kthread has started. Since this happens much later, there is a long pause in the console output. If the ringbuffer is small, messages could even be dropped during the pause. Add a new CON_HANDOVER console flag to be used internally by printk in order to track which non-boot console took over from a boot console. If handover consoles have implemented write_atomic(), they are allowed to print directly to the console until their kthread can take over. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 9153e3c5cb0c96242fa85ab151b9b29647634c9a Author: John Ogness Date: Mon Oct 19 22:53:30 2020 +0206 printk: remove deferred printing Since printing occurs either atomically or from the printing kthread, there is no need for any deferring or tracking possible recursion paths. Remove all printk context tracking. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 0097798fd99948d3ffea535005eee7eb3b14fd06 Author: John Ogness Date: Mon Oct 19 22:30:38 2020 +0206 printk: move console printing to kthreads Create a kthread for each console to perform console printing. Now all console printing is fully asynchronous except for the boot console and when the kernel enters sync mode (and there are atomic consoles available). The console_lock() and console_unlock() functions now only do what their name says... locking and unlocking of the console. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit b88736f1186c57134b7d48a94c7c9e17ae029f78 Author: John Ogness Date: Wed Oct 14 20:40:05 2020 +0200 printk: introduce kernel sync mode When the kernel performs an OOPS, enter into "sync mode": - only atomic consoles (write_atomic() callback) will print - printing occurs within vprintk_store() instead of console_unlock() Change @console_seq to atomic64_t for atomic access. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit faf7e6133364ce576646941a70a5c7da4b30b8f0 Author: John Ogness Date: Mon Oct 19 22:11:31 2020 +0206 printk: combine boot_delay_msec() into printk_delay() boot_delay_msec() is always called immediately before printk_delay() so just combine the two. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit e5ac22a3f07894b9614f26c861431729b39f56c9 Author: John Ogness Date: Mon Oct 19 21:02:40 2020 +0206 printk: relocate printk_delay() and vprintk_default() Move printk_delay() and vprintk_default() "as is" further up so that they can be used by new functions in an upcoming commit. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 14f0916f55e708febdda52396ed34239c042a3b9 Author: John Ogness Date: Mon Oct 19 16:40:26 2020 +0206 printk: inline log_output(),log_store() in vprintk_store() In preparation for supporting atomic printing, inline log_output() and log_store() into vprintk_store(). This allows these sub-functions to more easily communicate if they have performed a finalized commit as well as the sequence number of that commit. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 4fcfbc211ea9ca87a8d74234ff238da12ba34f74 Author: John Ogness Date: Wed Oct 14 20:31:46 2020 +0200 serial: 8250: implement write_atomic Implement a non-sleeping NMI-safe write_atomic() console function in order to support emergency console printing. Since interrupts need to be disabled during transmit, all usage of the IER register is wrapped with access functions that use the console_atomic_lock() function to synchronize register access while tracking the state of the interrupts. This is necessary because write_atomic() can be called from an NMI context that has preempted write_atomic(). Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 929999e09b583f5d048bd8a06b37fcfc1086cf69 Author: John Ogness Date: Wed Oct 14 20:26:35 2020 +0200 console: add write_atomic interface Add a write_atomic() callback to the console. This is an optional function for console drivers. The function must be atomic (including NMI safe) for writing to the console. Console drivers must still implement the write() callback. The write_atomic() callback will only be used in special situations, such as when the kernel panics. Creating an NMI safe write_atomic() that must synchronize with write() requires a careful implementation of the console driver. To aid with the implementation, a set of console_atomic_*() functions are provided: void console_atomic_lock(unsigned int *flags); void console_atomic_unlock(unsigned int flags); These functions synchronize using a processor-reentrant spinlock (called a cpulock). Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit be9d7a177fcdc9ba2ab582a19fa02ea0907825f2 Author: John Ogness Date: Wed Oct 14 20:00:11 2020 +0200 printk: remove safe buffers With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. In NMI or safe contexts, store the message immediately but still use irq_work to defer the console printing. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 8bde8e1691fad24ceaaf6de18930f25d740e6cd2 Author: John Ogness Date: Wed Oct 14 19:06:12 2020 +0200 printk: remove logbuf_lock, add syslog_lock Since the ringbuffer is lockless, there is no need for it to be protected by @logbuf_lock. Remove @logbuf_lock. This means that printk_nmi_direct and printk_safe_flush_on_panic() no longer need to acquire any lock to run. The global variables @syslog_seq, @syslog_partial, @syslog_time were also protected by @logbuf_lock. Introduce @syslog_lock to protect these. @console_seq, @exclusive_console_stop_seq, @console_dropped are protected by @console_lock. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 9a1b129567a1fb934ffd73a2b500464e2d0448e4 Author: John Ogness Date: Tue Oct 13 23:19:35 2020 +0200 printk: change @clear_seq to atomic64_t Currently @clear_seq access is protected by @logbuf_lock. Once @logbuf_lock is removed some other form of synchronization will be required. Change the type of @clear_seq to atomic64_t to provide the synchronization. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 452392de6e9f15ff0d6c37e64741b1534d94f630 Author: John Ogness Date: Tue Oct 13 22:57:55 2020 +0200 printk: use buffer pools for sprint buffers vprintk_store() is using a single static buffer as a temporary sprint buffer for the message text. This will not work once @logbuf_lock is removed. Replace the single static buffer with per-cpu and global pools. Each per-cpu pool is large enough to support a worse case of 2 contexts (non-NMI and NMI). To support printk() recursion and printk() calls before per-cpu variables are ready, an extra/fallback global pool of 2 contexts is available. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 03a93ff12b847955a264ec7b9e5b5d021db9bc49 Author: John Ogness Date: Wed Oct 14 19:09:15 2020 +0200 printk: refactor kmsg_dump_get_buffer() kmsg_dump_get_buffer() requires nearly the same logic as syslog_print_all(), but uses different variable names and does not make use of the ringbuffer loop macros. Modify kmsg_dump_get_buffer() so that the implementation is as similar to syslog_print_all() as possible. At some point it would be nice to have this code factored into a helper function. But until then, the code should at least look similar enough so that it is obvious there is logic duplication implemented. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit f99fe7d39bc2dcfc4de00926b7837be0028f985f Author: Sebastian Andrzej Siewior Date: Wed Oct 28 18:55:27 2020 +0100 lib/test_lockup: Minimum fix to get it compiled on PREEMPT_RT On PREEMPT_RT the locks are quite different so they can't be tested as it is done below. The alternative is test for the waitlock within rtmutex. This is the bare minim to get it compiled. Problems which exists on PREEMP_RT: - none of the locks (spinlock_t, rwlock_t, mutex_t, rw_semaphore) may be acquired with disabled preemption or interrupts. If I read the code correct the it is possible to acquire a mutex with disabled interrupts. I don't know how to obtain a lock pointer. Technically they are not exported to userland. - memory can not be allocated with disabled premption or interrupts even with GFP_ATOMIC. Signed-off-by: Sebastian Andrzej Siewior commit 66e4dff55a71370535faa53f4f6fd80c3bc94718 Author: Sebastian Andrzej Siewior Date: Wed Oct 28 11:08:21 2020 +0100 blk-mq: Use llist_head for blk_cpu_done With llist_head it is possible to avoid the locking (the irq-off region) when items are added. This makes it possible to add items on a remote CPU. llist_add() returns true if the list was previously empty. This can be used to invoke the SMP function call / raise sofirq only if the first item was added (otherwise it is already pending). This simplifies the code a little and reduces the IRQ-off regions. With this change it possible to reduce the SMP-function call a simple __raise_softirq_irqoff(). Signed-off-by: Sebastian Andrzej Siewior commit f32ed2668327037de44fe0487be85fedf674f5b5 Author: Sebastian Andrzej Siewior Date: Wed Oct 28 11:07:09 2020 +0100 blk-mq: Always complete remote completions requests in softirq Controllers with multiple queues have their IRQ-handelers pinned to a CPU. The core shouldn't need to complete the request on a remote CPU. Remove this case and always raise the softirq to complete the request. Signed-off-by: Sebastian Andrzej Siewior commit ec7c2eedc35a4594f229c2a78e84c98c2034aa32 Author: Sebastian Andrzej Siewior Date: Wed Oct 28 11:07:44 2020 +0100 blk-mq: Don't complete on a remote CPU in force threaded mode With force threaded interrupts enabled, raising softirq from an SMP function call will always result in waking the ksoftirqd thread. This is not optimal given that the thread runs at SCHED_OTHER priority. Completing the request in hard IRQ-context on PREEMPT_RT (which enforces the force threaded mode) is bad because the completion handler may acquire sleeping locks which violate the locking context. Disable request completing on a remote CPU in force threaded mode. Signed-off-by: Sebastian Andrzej Siewior commit 79fc5beb60694b46b729eb6700f99f26d2c97e89 Author: Sebastian Andrzej Siewior Date: Fri Jul 26 11:30:49 2019 +0200 Use CONFIG_PREEMPTION Thisi is an all-in-one patch of the current `PREEMPTION' branch. Signed-off-by: Sebastian Andrzej Siewior commit 53e9b2b2bb0e5ff69cf163143d48ba26a3666fc0 Author: Valentin Schneider Date: Fri Oct 23 12:12:17 2020 +0200 sched: Comment affine_move_task() Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20201013140116.26651-2-valentin.schneider@arm.com Signed-off-by: Sebastian Andrzej Siewior commit f39ec431851ebc419ca9efe2153aabab942c2ca4 Author: Valentin Schneider Date: Fri Oct 23 12:12:16 2020 +0200 sched: Deny self-issued __set_cpus_allowed_ptr() when migrate_disable() migrate_disable(); set_cpus_allowed_ptr(current, {something excluding task_cpu(current)}); affine_move_task(); <-- never returns Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20201013140116.26651-1-valentin.schneider@arm.com Signed-off-by: Sebastian Andrzej Siewior commit d851d4f23f615f6d995cadfa21387c70569b63bb Author: Peter Zijlstra Date: Fri Oct 23 12:12:15 2020 +0200 sched: Add migrate_disable() tracepoints XXX write a tracer: - 'migirate_disable() -> migrate_enable()' time in task_sched_runtime() - 'migrate_pull -> sched-in' time in task_sched_runtime() The first will give worst case for the second, which is the actual interference experienced by the task to due migration constraints of migrate_disable(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 584d86a4e1ce1fb0cd2d33635be9f5f42b02731c Author: Peter Zijlstra Date: Fri Oct 23 12:12:14 2020 +0200 sched/proc: Print accurate cpumask vs migrate_disable() Ensure /proc/*/status doesn't print 'random' cpumasks due to migrate_disable(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 1beec5b55060e8006c3c0a2d5443f8df8f45ef6d Author: Peter Zijlstra Date: Fri Oct 23 12:12:13 2020 +0200 sched: Fix migrate_disable() vs rt/dl balancing In order to minimize the interference of migrate_disable() on lower priority tasks, which can be deprived of runtime due to being stuck below a higher priority task. Teach the RT/DL balancers to push away these higher priority tasks when a lower priority task gets selected to run on a freshly demoted CPU (pull). This adds migration interference to the higher priority task, but restores bandwidth to system that would otherwise be irrevocably lost. Without this it would be possible to have all tasks on the system stuck on a single CPU, each task preempted in a migrate_disable() section with a single high priority task running. This way we can still approximate running the M highest priority tasks on the system. Migrating the top task away is (ofcourse) still subject to migrate_disable() too, which means the lower task is subject to an interference equivalent to the worst case migrate_disable() section. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit e9ba2b29d0dea674a9dc56ec70cec83b48c89e3a Author: Peter Zijlstra Date: Fri Oct 23 12:12:12 2020 +0200 sched, lockdep: Annotate ->pi_lock recursion There's a valid ->pi_lock recursion issue where the actual PI code tries to wake up the stop task. Make lockdep aware so it doesn't complain about this. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 0523fce6f66165a1474c60180d786b2bf1cbb705 Author: Peter Zijlstra Date: Fri Oct 23 12:12:11 2020 +0200 sched,rt: Use the full cpumask for balancing We want migrate_disable() tasks to get PULLs in order for them to PUSH away the higher priority task. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 7584575653efaedd9e0a571fac47d0b178671af4 Author: Peter Zijlstra Date: Fri Oct 23 12:12:10 2020 +0200 sched,rt: Use cpumask_any*_distribute() Replace a bunch of cpumask_any*() instances with cpumask_any*_distribute(), by injecting this little bit of random in cpu selection, we reduce the chance two competing balance operations working off the same lowest_mask pick the same CPU. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit fe5382979500b572fa4f9d96bcb9c8cf44b983fb Author: Thomas Gleixner Date: Fri Oct 23 12:12:09 2020 +0200 sched/core: Make migrate disable and CPU hotplug cooperative On CPU unplug tasks which are in a migrate disabled region cannot be pushed to a different CPU until they returned to migrateable state. Account the number of tasks on a runqueue which are in a migrate disabled section and make the hotplug wait mechanism respect that. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 5d7baa3f554d1791961726478f99f4d21a7b7373 Author: Peter Zijlstra Date: Fri Oct 23 12:12:08 2020 +0200 sched: Fix migrate_disable() vs set_cpus_allowed_ptr() Concurrent migrate_disable() and set_cpus_allowed_ptr() has interesting features. We rely on set_cpus_allowed_ptr() to not return until the task runs inside the provided mask. This expectation is exported to userspace. This means that any set_cpus_allowed_ptr() caller must wait until migrate_enable() allows migrations. At the same time, we don't want migrate_enable() to schedule, due to patterns like: preempt_disable(); migrate_disable(); ... migrate_enable(); preempt_enable(); And: raw_spin_lock(&B); spin_unlock(&A); this means that when migrate_enable() must restore the affinity mask, it cannot wait for completion thereof. Luck will have it that that is exactly the case where there is a pending set_cpus_allowed_ptr(), so let that provide storage for the async stop machine. Much thanks to Valentin who used TLA+ most effective and found lots of 'interesting' cases. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 622665c60fecb9833f4db4ae91df0c60171dbc15 Author: Peter Zijlstra Date: Fri Oct 23 12:12:07 2020 +0200 sched: Add migrate_disable() Add the base migrate_disable() support (under protest). While migrate_disable() is (currently) required for PREEMPT_RT, it is also one of the biggest flaws in the system. Notably this is just the base implementation, it is broken vs sched_setaffinity() and hotplug, both solved in additional patches for ease of review. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit a25e1c75e88991bed561f8114ef38618024c2ef9 Author: Peter Zijlstra Date: Fri Oct 23 12:12:06 2020 +0200 sched: Massage set_cpus_allowed() Thread a u32 flags word through the *set_cpus_allowed*() callchain. This will allow adding behavioural tweaks for future users. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 8570c166e9b9cfa7ee3637b957a9fffd0c5b9059 Author: Peter Zijlstra Date: Fri Oct 23 12:12:05 2020 +0200 sched: Fix hotplug vs CPU bandwidth control Since we now migrate tasks away before DYING, we should also move bandwidth unthrottle, otherwise we can gain tasks from unthrottle after we expect all tasks to be gone already. Also; it looks like the RT balancers don't respect cpu_active() and instead rely on rq->online in part, complete this. This too requires we do set_rq_offline() earlier to match the cpu_active() semantics. (The bigger patch is to convert RT to cpu_active() entirely) Since set_rq_online() is called from sched_cpu_activate(), place set_rq_offline() in sched_cpu_deactivate(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 3dc80c278022ec43b137216ac51e25a9468bf2d7 Author: Thomas Gleixner Date: Fri Oct 23 12:12:04 2020 +0200 sched/hotplug: Consolidate task migration on CPU unplug With the new mechanism which kicks tasks off the outgoing CPU at the end of schedule() the situation on an outgoing CPU right before the stopper thread brings it down completely is: - All user tasks and all unbound kernel threads have either been migrated away or are not running and the next wakeup will move them to a online CPU. - All per CPU kernel threads, except cpu hotplug thread and the stopper thread have either been unbound or parked by the responsible CPU hotplug callback. That means that at the last step before the stopper thread is invoked the cpu hotplug thread is the last legitimate running task on the outgoing CPU. Add a final wait step right before the stopper thread is kicked which ensures that any still running tasks on the way to park or on the way to kick themself of the CPU are either sleeping or gone. This allows to remove the migrate_tasks() crutch in sched_cpu_dying(). If sched_cpu_dying() detects that there is still another running task aside of the stopper thread then it will explode with the appropriate fireworks. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 62d01396ee646db616b72dbf3ec78ca0b56014e9 Author: Peter Zijlstra Date: Fri Oct 23 12:12:03 2020 +0200 workqueue: Manually break affinity on hotplug Don't rely on the scheduler to force break affinity for us -- it will stop doing that for per-cpu-kthreads. Signed-off-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo Signed-off-by: Sebastian Andrzej Siewior commit 728a7156f6378072c5378a0e4f2345fa3974126f Author: Thomas Gleixner Date: Fri Oct 23 12:12:02 2020 +0200 sched/core: Wait for tasks being pushed away on hotplug RT kernels need to ensure that all tasks which are not per CPU kthreads have left the outgoing CPU to guarantee that no tasks are force migrated within a migrate disabled section. There is also some desire to (ab)use fine grained CPU hotplug control to clear a CPU from active state to force migrate tasks which are not per CPU kthreads away for power control purposes. Add a mechanism which waits until all tasks which should leave the CPU after the CPU active flag is cleared have moved to a different online CPU. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 4c1c6261e21f6d77e4d6feee54d4721165f57157 Author: Peter Zijlstra Date: Fri Oct 23 12:12:01 2020 +0200 sched/hotplug: Ensure only per-cpu kthreads run during hotplug In preparation for migrate_disable(), make sure only per-cpu kthreads are allowed to run on !active CPUs. This is ran (as one of the very first steps) from the cpu-hotplug task which is a per-cpu kthread and completion of the hotplug operation only requires such tasks. This constraint enables the migrate_disable() implementation to wait for completion of all migrate_disable regions on this CPU at hotplug time without fear of any new ones starting. This replaces the unlikely(rq->balance_callbacks) test at the tail of context_switch with an unlikely(rq->balance_work), the fast path is not affected. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 0126b9d4153822b9acbd5d74e61316b74766fcf1 Author: Peter Zijlstra Date: Fri Oct 23 12:12:00 2020 +0200 sched: Fix balance_callback() The intent of balance_callback() has always been to delay executing balancing operations until the end of the current rq->lock section. This is because balance operations must often drop rq->lock, and that isn't safe in general. However, as noted by Scott, there were a few holes in that scheme; balance_callback() was called after rq->lock was dropped, which means another CPU can interleave and touch the callback list. Rework code to call the balance callbacks before dropping rq->lock where possible, and otherwise splice the balance list onto a local stack. This guarantees that the balance list must be empty when we take rq->lock. IOW, we'll only ever run our own balance callbacks. Reported-by: Scott Wood Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit 69f842d667363c43e908ea871709121375633b67 Author: Peter Zijlstra Date: Fri Oct 23 12:11:59 2020 +0200 stop_machine: Add function and caller debug info Crashes in stop-machine are hard to connect to the calling code, add a little something to help with that. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior