commit b5a6a8d1a681af5838ae13e7810eee3574a73b32 Author: Alexandre Frade Date: Wed Apr 18 22:00:37 2018 -0300 4.16.2-xanmod4 Signed-off-by: Alexandre Frade commit 75db9efbf8611596378e24a85d41cb38b9270d70 Author: Alfred Chen Date: Fri Apr 13 21:57:27 2018 +0800 Tag PDS 0.98n commit 676486405914a5c621734e5ff878d75871c27cc1 Author: Alfred Chen Date: Thu Apr 12 12:47:31 2018 +0800 pds: Optimize pds_load_balance(). commit 9e7976e095cb296cb54d14499a9ec1c2860fd445 Author: Alfred Chen Date: Wed Apr 11 19:45:11 2018 +0800 pds: Code cleanup. commit 11089e22101b7879d74ef3c04361d2375f6696a4 Author: Alfred Chen Date: Wed Apr 11 13:55:44 2018 +0800 pds: Optimize scheduler_tick() and pds_sg_balance(). commit 1ae369e0f151a8149ce52a88ad7134fa74330535 Author: Alfred Chen Date: Sun Apr 8 16:04:50 2018 +0800 pds: Migrate max SCHED_RQ_NR_MIGRATION tasks to empty rq at a time. commit 65508afa7e2b51ad9dfb48f5abae45e5522c3357 Author: Alexandre Frade Date: Wed Apr 18 21:58:38 2018 -0300 elevator: set bfq-mq instead of bfq (mainline) commit 69a138d6eefe7b01cf2b230b5f563f104305d493 Author: Paolo Valente Date: Wed Apr 4 11:28:16 2018 +0200 block, bfq-sq, bfq-mq: lower-bound the estimated peak rate to 1 If a storage device handled by BFQ happens to be slower than 7.5 KB/s for a certain amount of time (in the order of a second), then the estimated peak rate of the device, maintained in BFQ, becomes equal to 0. The reason is the limited precision with which the rate is represented (details on the range of representable values in the comments introduced by this commit). This leads to a division-by-zero error where the estimated peak rate is used as divisor. Such a type of failure has been reported in [1]. This commit addresses this issue by: 1. Lower-bounding the estimated peak rate to 1 2. Adding and improving comments on the range of rates representable [1] https://www.spinics.net/lists/kernel/msg2739205.html Signed-off-by: Konstantin Khlebnikov Signed-off-by: Paolo Valente commit 89a4686aa8cea6229737e882b3a4d596294b671d Author: Melzani Alessandro Date: Mon Feb 26 22:59:30 2018 +0100 bfq-sq, bfq-mq: port of "block, bfq: fix error handle in bfq_init" if elv_register fail, bfq_pool should be free. Signed-off-by: Alessandro Melzani commit b53ac1db67088bfae10fa2a9287f0d7776895bd2 Author: Melzani Alessandro Date: Mon Feb 26 22:43:30 2018 +0100 bfq-sq, bfq-mq: port of "bfq: Use icq_to_bic() consistently" Some code uses icq_to_bic() to convert an io_cq pointer to a bfq_io_cq pointer while other code uses a direct cast. Convert the code that uses a direct cast such that it uses icq_to_bic(). Signed-off-by: Alessandro Melzani commit 07b2c4ceb835692ff9466f5f39d879d0b130d2a2 Author: Melzani Alessandro Date: Mon Feb 26 22:21:59 2018 +0100 bfq-mq: port of "block, bfq: remove batches of confusing ifdefs" Commit a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs: one reported in [1], plus a similar one in another function. This commit removes both batches, in the way suggested in [1]. [1] https://www.spinics.net/lists/linux-block/msg20043.html Fixes: a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") Signed-off-by: Alessandro Melzani commit 489d6ba475244f8e1044d44acc14c3f4eff2cd83 Author: Davide Paganelli Date: Thu Feb 8 11:49:58 2018 +0100 block, bfq-mq, bfq-sq: make bfq_bfqq_expire print expiration reason Improve readability of the log messages related to the expiration reasons of the function bfq_bfqq_expire. Change the printing of the number that represents the reason for expiration with an actual textual description of the reason. Signed-off-by: Davide Paganelli Signed-off-by: Paolo Valente commit 3fbe84a9d5ba4f90538d68a7a8851c8d466ab660 Author: Davide Paganelli Date: Thu Feb 8 12:19:24 2018 +0100 block, bfq-mq, bfq-sq: make log functions print names of calling functions Add the macro __func__ as a parameter to the invocations of the functions pr_crit, blk_add_trace_msg and blk_add_cgroup_trace_msg in bfq_log* functions, in order to include automatically in the log messages the names of the functions that call the log functions. The programmer can then avoid doing it. Signed-off-by: Davide Paganelli Signed-off-by: Paolo Valente commit de7787b54ffd2c1996552ebba611974dd6f52ca8 Author: Paolo Valente Date: Mon Jan 15 15:07:05 2018 +0100 block, bfq-mq: add requeue-request hook Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device be re-inserted into the active I/O scheduler for that device. As a consequence, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on that request before each re-insertion. This fact is the cause of the failure reported in [1]. For an I/O scheduler, every re-insertion of the same re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber, this fact causes no harm. In contrast, it confuses a stateful scheduler like BFQ, which keeps state for an I/O request, until the finish_request hook is invoked on the request. In particular, BFQ may get stuck, waiting forever for the number of request dispatches, of the same request, to be balanced by an equal number of request completions (while there will be one completion for that request). In this state, BFQ may refuse to serve I/O requests from other bfq_queues. The hang reported in [1] then follows. However, the above re-prepared requests undergo a requeue, thus the requeue_request hook of the active elevator is invoked for these requests, if set. This commit then addresses the above issue by properly implementing the hook requeue_request in BFQ. [1] https://marc.info/?l=linux-block&m=151211117608676 Reported-by: Ivan Kozik Reported-by: Alban Browaeys Signed-off-by: Paolo Valente Signed-off-by: Serena Ziviani commit 10c366a958fad2b7ded2125b7bbf6a4fef19a43b Author: Paolo Valente Date: Sat Jan 13 18:48:41 2018 +0100 block, bfq-sq, bfq-mq: remove trace_printks Commit ("block, bfq-sq, bfq-mq: trace get and put of bfq groups") unwisely added some invocations of the function trace_printk, which is inappropriate in production kernels. This commit removes those invocations. Signed-off-by: Paolo Valente commit f43ec49c0914122eea89618e948e7bf47179fb5a Author: Paolo Valente Date: Wed Jan 10 09:08:22 2018 +0100 bfq-sq, bfq-mq: compile group put for oom queue only if BFQ_GROUP_IOSCHED is set Commit ("bfq-sq, bfq-mq: release oom-queue ref to root group on exit") added a missing put of the root bfq group for the oom queue. That put has to be, and can be, performed only if CONFIG_BFQ_GROUP_IOSCHED is defined: the function doing the put is even not defined at all if CONFIG_BFQ_GROUP_IOSCHED is not defined. But that commit makes that put be invoked regardless of whether CONFIG_BFQ_GROUP_IOSCHED is defined. This commit fixes this mistake, by making that invocation be compiled only if CONFIG_BFQ_GROUP_IOSCHED is actually defined. Fixes ("block, bfq: release oom-queue ref to root group on exit") Reported-by: Jan Alexander Steffens Signed-off-by: Paolo Valente commit f080854651a27c4b7603cccf7a6180be760af92b Author: Paolo Valente Date: Mon Jan 8 19:40:38 2018 +0100 block, bfq-sq, bfq-mq: trace get and put of bfq groups Signed-off-by: Paolo Valente commit 5217a18eff28d9a000336c94cbf274a6e791043d Author: Paolo Valente Date: Mon Jan 8 19:38:45 2018 +0100 bfq-sq, bfq-mq: release oom-queue ref to root group on exit On scheduler init, a reference to the root group, and a reference to its corresponding blkg are taken for the oom queue. Yet these references are not released on scheduler exit, which prevents these objects from be freed. This commit adds the missing reference releases. Reported-by: Davide Ferrari Signed-off-by: Paolo Valente commit 30e6f63b4cf9a9b7facce884f969a93ed8e8f567 Author: Paolo Valente Date: Thu Jan 4 16:29:58 2018 +0100 bfq-sq, bfq-mq: put async queues for root bfq groups too For each pair [device for which bfq is selected as I/O scheduler, group in blkio/io], bfq maintains a corresponding bfq group. Each such bfq group contains a set of async queues, with each async queue created on demand, i.e., when some I/O request arrives for it. On creation, an async queue gets an extra reference, to make sure that the queue is not freed as long as its bfq group exists. Accordingly, to allow the queue to be freed after the group exited, this extra reference must released on group exit. The above holds also for a bfq root group, i.e., for the bfq group corresponding to the root blkio/io root for a given device. Yet, by mistake, the references to the existing async queues of a root group are not released when the latter exits. This causes a memory leak when the instance of bfq for a given device exits. In a similar vein, bfqg_stats_xfer_dead is not executed for a root group. This commit fixes bfq_pd_offline so that the latter executes the above missing operations for a root group too. Reported-by: Holger Hoffstätte Reported-by: Guoqing Jiang Signed-off-by: Davide Ferrari Signed-off-by: Paolo Valente commit 26a342ca170fc4353c6d1efc547769f97d52bd9f Author: Paolo Valente Date: Thu Dec 21 15:51:39 2017 +0100 bfq-sq, bfq-mq: limit sectors served with interactive weight raising To maximise responsiveness, BFQ raises the weight, and performs device idling, for bfq_queues associated with processes deemed as interactive. In particular, weight raising has a maximum duration, equal to the time needed to start a large application. If a weight-raised process goes on doing I/O beyond this maximum duration, it loses weight-raising. This mechanism is evidently vulnerable to the following false positives: I/O-bound applications that will go on doing I/O for much longer than the duration of weight-raising. These applications have basically no benefit from being weight-raised at the beginning of their I/O. On the opposite end, while being weight-raised, these applications a) unjustly steal throughput to applications that may truly need low latency; b) make BFQ uselessly perform device idling; device idling results in loss of device throughput with most flash-based storage, and may increase latencies when used purposelessly. This commit adds a countermeasure to reduce both the above problems. To introduce this countermeasure, we provide the following extra piece of information (full details in the comments added by this commit). During the start-up of the large application used as a reference to set the duration of weight-raising, involved processes transfer at most ~110K sectors each. Accordingly, a process initially deemed as interactive has no right to be weight-raised any longer, once transferred 110K sectors or more. Basing on this consideration, this commit early-ends weight-raising for a bfq_queue if the latter happens to have received an amount of service at least equal to 110K sectors (actually, a little bit more, to keep a safety margin). I/O-bound applications that reach a high throughput, such as file copy, get to this threshold much before the allowed weight-raising period finishes. Thus this early ending of weight-raising reduces the amount of time during which these applications cause the problems described above. Signed-off-by: Paolo Valente commit 835476c491abfb03ac10832d4e9aeb3fd143fb86 Author: Paolo Valente Date: Tue Dec 19 12:07:12 2017 +0100 block, bfq-mq: limit tags for writes and async I/O Asynchronous I/O can easily starve synchronous I/O (both sync reads and sync writes), by consuming all request tags. Similarly, storms of synchronous writes, such as those that sync(2) may trigger, can starve synchronous reads. In their turn, these two problems may also cause BFQ to loose control on latency for interactive and soft real-time applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice Writer takes 0.6 seconds to start if the device is idle, but it takes more than 45 seconds (!) if there are sequential writes in the background. This commit addresses this issue by limiting the maximum percentage of tags that asynchronous I/O requests and synchronous write requests can consume. In particular, this commit grants a higher threshold to synchronous writes, to prevent the latter from being starved by asynchronous I/O. According to the above test, LibreOffice Writer now starts in about 1.2 seconds on average, regardless of the background workload, and apart from some rare outlier. To check this improvement, run, e.g., sudo ./comm_startup_lat.sh bfq-mq 5 5 seq 10 "lowriter --terminate_after_init" for the comm_startup_lat benchmark in the S suite [1]. [1] https://github.com/Algodev-github/S Signed-off-by: Paolo Valente commit 381657acc3d7c4db12b1affadfa4ee5f4d89b2f0 Author: Chiara Bruschi Date: Mon Dec 11 18:55:26 2017 +0100 block, bfq-sq, bfq-mq: specify usage condition of delta_us in bfq_log_bfqq call Inside the function bfq_completed_request the value of a variable called delta_us is computed as current request completion time. delta_us is used inside a call to the function bfq_log_bfqq as divisor in a division operation to compute a rate value, but no check makes sure that delta_us has non-zero value. A divisor with value 0 leads to a division error that could result in a kernel oops (therefore unstable/unreliable system state) and consequently cause kernel panic if resources are unavailable after the system fault. This commit fixes this call to bfq_log_bfqq specifying the condition that allows delta_us to be safely used as divisor. Signed-off-by: Paolo Valente Signed-off-by: Chiara Bruschi commit f2ed7bab2114e6c4d81d1e440e231e947d43c328 Author: Paolo Valente Date: Thu Nov 30 17:48:28 2017 +0100 block, bfq-sq, bfq-mq: increase threshold to deem I/O as random If two processes do I/O close to each other, i.e., are cooperating processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their associated bfq_queues, so as to get sequential I/O from the union of the I/O requests of the processes, and thus reach a higher throughput. A merged queue is then split if its I/O stops being sequential. In this respect, BFQ deems the I/O of a bfq_queue as (mostly) sequential only if less than 4 I/O requests are random, out of the last 32 requests inserted into the queue. Unfortunately, extensive testing (with the interleaved_io benchmark of the S suite [1], and with real applications spawning cooperating processes) has clearly shown that, with such a low threshold, only a rather low I/O throughput may be reached when several cooperating processes do I/O. In particular, the outcome of each test run was bimodal: if queue merging occurred and was stable during the test, then the throughput was close to the peak rate of the storage device, otherwise the throughput was arbitrarily low (usually around 1/10 of the peak rate with a rotational device). The probability to get the unlucky outcomes grew with the number of cooperating processes: it was already significant with 5 processes, and close to one with 7 or more processes. The cause of the low throughput in the unlucky runs was that the merged queues containing the I/O of these cooperating processes were soon split, because they contained more random I/O requests than those tolerated by the 4/32 threshold, but - that I/O would have however allowed the storage device to reach peak throughput or almost peak throughput; - in contrast, the I/O of these processes, if served individually (from separate queues) yielded a rather low throughput. So we repeated our tests with increasing values of the threshold, until we found the minimum value (19) for which we obtained maximum throughput, reliably, with at least up to 9 cooperating processes. Then we checked that the use of that higher threshold value did not cause any regression for any other benchmark in the suite [1]. This commit raises the threshold to such a higher value. [1] https://github.com/Algodev-github/S Signed-off-by: Angelo Ruocco Signed-off-by: Paolo Valente commit 0ca16110f19e9ae0ee449e5ec82a40e9813075da Author: Angelo Ruocco Date: Mon Dec 11 14:19:54 2017 +0100 block, bfq-sq, bfq-mq: remove superfluous check in queue-merging setup When two or more processes do I/O in a way that the their requests are sequential in respect to one another, BFQ merges the bfq_queues associated with the processes. This way the overall I/O pattern becomes sequential, and thus there is a boost in througput. These cooperating processes usually start or restart to do I/O shortly after each other. So, in order to avoid merging non-cooperating processes, BFQ ensures that none of these queues has been in weight raising for too long. In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged only shortly after being created", BFQ checks whether any queue (and not only weight-raised ones) is doing I/O continuously from too long to be merged. This new additional check makes the first one useless: a queue doing I/O from long enough, if being weight-raised, is also a queue in weight raising for too long to be merged. Accordingly, this commit removes the first check. Signed-off-by: Angelo Ruocco Signed-off-by: Paolo Valente commit 5dc48d90c68e320fc3737cca6a3157538c539c1d Author: Paolo Valente Date: Fri Oct 27 11:12:14 2017 +0200 block, bfq-sq, bfq-mq: let a queue be merged only shortly after starting I/O In BFQ and CFQ, two processes are said to be cooperating if they do I/O in such a way that the union of their I/O requests yields a sequential I/O pattern. To get such a sequential I/O pattern out of the non-sequential pattern of each cooperating process, BFQ and CFQ merge the queues associated with these processes. In more detail, cooperating processes, and thus their associated queues, usually start, or restart, to do I/O shortly after each other. This is the case, e.g., for the I/O threads of KVM/QEMU and of the dump utility. Basing on this assumption, this commit allows a bfq_queue to be merged only during a short time interval (100ms) after it starts, or re-starts, to do I/O. This filtering provides two important benefits. First, it greatly reduces the probability that two non-cooperating processes have their queues merged by mistake, if they just happen to do I/O close to each other for a short time interval. These spurious merges cause loss of service guarantees. A low-weight bfq_queue may unjustly get more than its expected share of the throughput: if such a low-weight queue is merged with a high-weight queue, then the I/O for the low-weight queue is served as if the queue had a high weight. This may damage other high-weight queues unexpectedly. For instance, because of this issue, lxterminal occasionally took 7.5 seconds to start, instead of 6.5 seconds, when some sequential readers and writers did I/O in the background on a FUJITSU MHX2300BT HDD. The reason is that the bfq_queues associated with some of the readers or the writers were merged with the high-weight queues of some processes that had to do some urgent but little I/O. The readers then exploited the inherited high weight for all or most of their I/O, during the start-up of terminal. The filtering introduced by this commit eliminated any outlier caused by spurious queue merges in our start-up time tests. This filtering also provides a little boost of the throughput sustainable by BFQ: 3-4%, depending on the CPU. The reason is that, once a bfq_queue cannot be merged any longer, this commit makes BFQ stop updating the data needed to handle merging for the queue. Signed-off-by: Paolo Valente Signed-off-by: Angelo Ruocco commit 46ce0cab1c4f42aea0da953d586629158a963011 Author: Angelo Ruocco Date: Mon Dec 18 08:28:08 2017 +0100 block, bfq-sq, bfq-mq: check low_latency flag in bfq_bfqq_save_state() A just-created bfq_queue will certainly be deemed as interactive on the arrival of its first I/O request, if the low_latency flag is set. Yet, if the queue is merged with another queue on the arrival of its first I/O request, it will not have the chance to be flagged as interactive. Nevertheless, if the queue is then split soon enough, it has to be flagged as interactive after the split. To handle this early-merge scenario correctly, BFQ saves the state of the queue, on the merge, as if the latter had already been deemed interactive. So, if the queue is split soon, it will get weight-raised, because the previous state of the queue is resumed on the split. Unfortunately, in the act of saving the state of the newly-created queue, BFQ doesn't check whether the low_latency flag is set, and this causes early-merged queues to be then weight-raised, on queue splits, even if low_latency is off. This commit addresses this problem by adding the missing check. Signed-off-by: Angelo Ruocco Signed-off-by: Paolo Valente commit ae29ae99e23b636901c3c063acbcc7762ccfa340 Author: Paolo Valente Date: Thu Nov 16 18:38:13 2017 +0100 block, bfq-sq, bfq-mq: add missing rq_pos_tree update on rq removal If two processes do I/O close to each other, then BFQ merges the bfq_queues associated with these processes, to get a more sequential I/O, and thus a higher throughput. In this respect, to detect whether two processes are doing I/O close to each other, BFQ keeps a list of the head-of-line I/O requests of all active bfq_queues. The list is ordered by initial sectors, and implemented through a red-black tree (rq_pos_tree). Unfortunately, the update of the rq_pos_tree was incomplete, because the tree was not updated on the removal of the head-of-line I/O request of a bfq_queue, in case the queue did not remain empty. This commit adds the missing update. Signed-off-by: Paolo Valente Signed-off-by: Angelo Ruocco commit 6c68a1eaa33ba2ea92e5145b085f85d6a2f77485 Author: Chiara Bruschi Date: Thu Dec 7 09:57:19 2017 +0100 block, bfq-mq: fix occurrences of request prepare/finish methods' old names Commits 'b01f1fa3bb19' (Port of "blk-mq-sched: unify request prepare methods") and 'cc10d2d7d2c1' (Port of "blk-mq-sched: unify request finished methods") changed the old names of current bfq_prepare_request and bfq_finish_request methods, but left them unchanged elsewhere in the code (related comments, part of function name bfq_put_rq_priv_body). This commit fixes every occurrence of the old names of these methods by changing them into the current names. Fixes: b01f1fa3bb19 (Port of "blk-mq-sched: unify request prepare methods") Fixes: cc10d2d7d2c1 (Port of "blk-mq-sched: unify request finished methods") Reviewed-by: Paolo Valente Signed-off-by: Federico Motta Signed-off-by: Chiara Bruschi commit cd9a1069d2b7ef26d4e413ec6573cc023196b234 Author: Paolo Valente Date: Sun Nov 12 22:43:46 2017 +0100 block, bfq-sq, bfq-mq: consider also past I/O in soft real-time detection BFQ privileges the I/O of soft real-time applications, such as video players, to guarantee to these application a high bandwidth and a low latency. In this respect, it is not easy to correctly detect when an application is soft real-time. A particularly nasty false positive is that of an I/O-bound application that occasionally happens to meet all requirements to be deemed as soft real-time. After being detected as soft real-time, such an application monopolizes the device. Fortunately, BFQ will realize soon that the application is actually not soft real-time and suspend every privilege. Yet, the application may happen again to be wrongly detected as soft real-time, and so on. As highlighted by our tests, this problem causes BFQ to occasionally fail to guarantee a high responsiveness, in the presence of heavy background I/O workloads. The reason is that the background workload happens to be detected as soft real-time, more or less frequently, during the execution of the interactive task under test. To give an idea, because of this problem, Libreoffice Writer occasionally takes 8 seconds, instead of 3, to start up, if there are sequential reads and writes in the background, on a Kingston SSDNow V300. This commit addresses this issue by leveraging the following facts. The reason why some applications are detected as soft real-time despite all BFQ checks to avoid false positives, is simply that, during high CPU or storage-device load, I/O-bound applications may happen to do I/O slowly enough to meet all soft real-time requirements, and pass all BFQ extra checks. Yet, this happens only for limited time periods: slow-speed time intervals are usually interspersed between other time intervals during which these applications do I/O at a very high speed. To exploit these facts, this commit introduces a little change, in the detection of soft real-time behavior, to systematically consider also the recent past: the higher the speed was in the recent past, the later next I/O should arrive for the application to be considered as soft real-time. At the beginning of a slow-speed interval, the minimum arrival time allowed for the next I/O usually happens to still be so high, to fall *after* the end of the slow-speed period itself. As a consequence, the application does not risk to be deemed as soft real-time during the slow-speed interval. Then, during the next high-speed interval, the application cannot, evidently, be deemed as soft real-time (exactly because of its speed), and so on. This extra filtering proved to be rather effective: in the above test, the frequency of false positives became so low that the start-up time was 3 seconds in all iterations (apart from occasional outliers, caused by page-cache-management issues, which are out of the scope of this commit, and cannot be solved by an I/O scheduler). Signed-off-by: Paolo Valente Signed-off-by: Angelo Ruocco commit 47b8e103d248bdc959ed539a14eec1792d18029f Author: Paolo Valente Date: Tue Nov 14 08:28:45 2017 +0100 block, bfq-mq: turn BUG_ON on request-size into WARN_ON BFQ has many checks of internal and external consistency. One of them checks that an I/O request has still sectors to serve, if it happens to be retired without being served. If the request has no sector to serve, a BUG_ON signals the failure and causes the kernel to terminate. Yet, from a crash report by a user [1], this condition may happen to hold, in apparently correct functioning, for I/O with a CD/DVD. To address this issue, this commit turns the above BUG_ON into a WARN_ON. This commit also adds a companion WARN_ON on request insertion into the scheduler. [1] https://groups.google.com/d/msg/bfq-iosched/DDOTJBroBa4/VyU1zUFtCgAJ Reported-by: Alexandre Frade Signed-off-by: Paolo Valente commit 5f6d39d6a993dd0f4e224961a593658b69a7b1ff Author: Luca Miccio Date: Wed Nov 8 19:07:41 2017 +0100 block, bfq-sq, bfq-mq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP BFQ (both bfq-mq and bfq-sq) currently creates, and updates, its own instance of the whole set of blkio statistics that cfq creates. Yet, from the comments of Tejun Heo in [1], it turned out that most of these statistics are meant/useful only for debugging. This commit makes BFQ create the latter, debugging statistics only if the option CONFIG_DEBUG_BLK_CGROUP is set. By doing so, this commit also enables BFQ to enjoy a high perfomance boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then BFQ has to update far fewer statistics, and, in particular, not the heaviest to update. To give an idea of the benefits, if CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), the throughput of bfq-mq grows from 310 to 400 KIOPS (+30%). We have measured similar or even much higher boosts with other CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have been obtained and can be reproduced very easily with the script in [1]. [1] https://www.spinics.net/lists/linux-block/msg18943.html Suggested-by: Tejun Heo Suggested-by: Ulf Hansson Signed-off-by: Luca Miccio Signed-off-by: Paolo Valente commit e2e9dc51e0279f9fc9c1f3e84a96d6629902cb7b Author: Paolo Valente Date: Wed Nov 8 19:07:40 2017 +0100 block, bfq-mq: update blkio stats outside the scheduler lock bfq-mq invokes various blkg_*stats_* functions to update the statistics contained in the special files blkio.bfq-mq.* in the blkio controller groups, i.e., the I/O accounting related to the proportional-share policy provided by bfq-mq. The execution of these functions takes a considerable percentage, about 40%, of the total per-request execution time of bfq-mq (i.e., of the sum of the execution time of all the bfq-mq functions that have to be executed to process an I/O request from its creation to its destruction). This reduces the request-processing rate sustainable by bfq-mq noticeably, even on a multicore CPU. In fact, the bfq-mq functions that invoke blkg_*stats_* functions cannot be executed in parallel with the rest of the code of bfq-mq, because both are executed under the same same per-device scheduler lock. To reduce this slowdown, this commit moves, wherever possible, the invocation of these functions (more precisely, of the bfq-mq functions that invoke blkg_*stats_* functions) outside the critical sections protected by the scheduler lock. With this change, and with all blkio.bfq-mq.* statistics enabled, the throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel i7-4850HQ, in case of 8 threads doing random I/O in parallel on null_blk, with the latter configured with 0 latency. We obtained the same or higher throughput boosts, up to +30%, with other processors (some figures are reported in the documentation). For our tests, we used the script [1], with which our results can be easily reproduced. NOTE. This commit still protects the invocation of blkg_*stats_* functions with the request_queue lock, because the group these functions are invoked on may otherwise disappear before or while these functions are executed. Fortunately, tests without even this lock show, by difference, that the serialization caused by this lock has a little impact (at most ~5% of throughput reduction). [1] https://github.com/Algodev-github/IOSpeed Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio commit dc032e32e9fb70957ca2633b65356a5de13e6116 Author: Luca Miccio Date: Tue Oct 31 09:50:11 2017 +0100 block, bfq-mq: add missing invocations of bfqg_stats_update_io_add/remove bfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be invoked, respectively, when an I/O request enters and when an I/O request exits the scheduler. Unfortunately, bfq-mq does not fully comply with this scheme, because it does not invoke these functions for requests that are inserted into or extracted from its priority dispatch list. This commit fixes this mistake. Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio commit 3789b09e981f1c21bc2514a0128cc650c0d19a8e Author: Paolo Valente Date: Mon Oct 30 16:50:50 2017 +0100 doc, block, bfq-mq: update max IOPS sustainable with BFQ We have investigated more deeply the performance of BFQ, in terms of number of IOPS that can be processed by the CPU when BFQ is used as I/O scheduler. In more detail, using the script [1], we have measured the number of IOPS reached on top of a null block device configured with zero latency, as a function of the workload (sequential read, sequential write, random read, random write) and of the system (we considered desktops, laptops and embedded systems). Basing on the resulting figures, with this commit we update the current, conservative IOPS range reported in BFQ documentation. In particular, the documentation now reports, for each of three different systems, the lowest number of IOPS obtained for that system with the above test (namely, the value obtained with the workload leading to the lowest IOPS). [1] https://github.com/Algodev-github/IOSpeed Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio commit 6de745c4f70673d43c984f6bfc8b73ec8333e4ab Author: Paolo Valente Date: Fri Oct 6 19:35:38 2017 +0200 bfq-sq, bfq-mq: fix unbalanced decrements of burst size The commit "bfq-sq, bfq-mq: decrease burst size when queues in burst exit" introduced the decrement of burst_size on the removal of a bfq_queue from the burst list. Unfortunately, this decrement can happen to be performed even when burst size is already equal to 0, because of unbalanced decrements. A description follows of the cause of these unbalanced decrements, namely a wrong assumption, and of the way how this wrong assumption leads to unbalanced decrements. The wrong assumption is that a bfq_queue can exit only if the process associated with the bfq_queue has exited. This is false, because a bfq_queue, say Q, may exit also as a consequence of a merge with another bfq_queue. In this case, Q exits because the I/O of its associated process has been redirected to another bfq_queue. The decrement unbalance occurs because Q may then be re-created after a split, and added back to the current burst list, *without* incrementing burst_size. burst_size is not incremented because Q is not a new bfq_queue added to the burst list, but a bfq_queue only temporarily removed from the list, and, before the commit "bfq-sq, bfq-mq: decrease burst size when queues in burst exit", burst_size was not decremented when Q was removed. This commit addresses this issue by just checking whether the exiting bfq_queue is a merged bfq_queue, and, in that case, not decrementing burst_size. Unfortunately, this still leaves room for unbalanced decrements, in the following rarer case: on a split, the bfq_queue happens to be inserted into a different burst list than that it was removed from when merged. If this happens, the number of elements in the new burst list becomes higher than burst_size (by one). When the bfq_queue then exits, it is of course not in a merged state any longer, thus burst_size is decremented, which results in an unbalanced decrement. To handle this sporadic, unlucky case in a simple way, this commit also checks that burst_size is larger than 0 before decrementing it. Finally, this commit removes an useless, extra check: the check that the bfq_queue is sync, performed before checking whether the bfq_queue is in the burst list. This extra check is redundant, because only sync bfq_queues can be inserted into the burst list. Reported-by: Philip Müller Signed-off-by: Paolo Valente Signed-off-by: Angelo Ruocco Tested-by: Philip Müller Tested-by: Oleksandr Natalenko Tested-by: Lee Tibbert commit 6ce79f28f5a2bd610fcbaa962c037aaa0312193e Author: omcira Date: Mon Sep 18 10:49:48 2017 +0200 bfq-sq, bfq-mq: decrease burst size when queues in burst exit If many queues belonging to the same group happen to be created shortly after each other, then the concurrent processes associated with these queues have typically a common goal, and they get it done as soon as possible if not hampered by device idling. Examples are processes spawned by git grep, or by systemd during boot. As for device idling, this mechanism is currently necessary for weight raising to succeed in its goal: privileging I/O. In view of these facts, BFQ does not provide the above queues with either weight raising or device idling. On the other hand, a burst of queue creations may be caused also by the start-up of a complex application. In this case, these queues need usually to be served one after the other, and as quickly as possible, to maximise responsiveness. Therefore, in this case the best strategy is to weight-raise all the queues created during the burst, i.e., the exact opposite of the strategy for the above case. To distinguish between the two cases, BFQ uses an empirical burst-size threshold, found through extensive tests and monitoring of daily usage. Only large bursts, i.e., burst with a size above this threshold, are considered as generated by a high number of parallel processes. In this respect, upstart-based boot proved to be rather hard to detect as generating a large burst of queue creations, because with upstart most of the queues created in a burst exit *before* the next queues in the same burst are created. To address this issue, I changed the burst-detection mechanism so as to not decrease the size of the current burst even if one of the queues in the burst is eliminated. Unfortunately, this missing decrease causes false positives on very fast systems: on the start-up of a complex application, such as libreoffice writer, so many queues are created, served and exited shortly after each other, that a large burst of queue creations is wrongly detected as occurring. These false positives just disappear if the size of a burst is decreased when one of the queues in the burst exits. This commit restores the missing burst-size decrease, relying of the fact that upstart is apparently unlikely to be used on systems running this and future versions of the kernel. Signed-off-by: Paolo Valente Signed-off-by: Mauro Andreolini Signed-off-by: Angelo Ruocco Tested-by: Mirko Montanari commit 2423ae7522926e89cf3dd5a32a7ded8daa433264 Author: Paolo Valente Date: Fri Sep 15 04:58:33 2017 -0400 bfq-sq, bfq-mq: let early-merged queues be weight-raised on split too A just-created bfq_queue, say Q, may happen to be merged with another bfq_queue on the very first invocation of the function __bfq_insert_request. In such a case, even if Q would clearly deserve interactive weight raising (as it has just been created), the function bfq_add_request does not make it to be invoked for Q, and thus to activate weight raising for Q. As a consequence, when the state of Q is saved for a possible future restore, after a split of Q from the other bfq_queue(s), such a state happens to be (unjustly) non-weight-raised. Then the bfq_queue will not enjoy any weight raising on the split, even if should still be in an interactive weight-raising period when the split occurs. This commit solves this problem as follows, for a just-created bfq_queue that is being early-merged: it stores directly, in the saved state of the bfq_queue, the weight-raising state that would have been assigned to the bfq_queue if not early-merged. Signed-off-by: Paolo Valente Tested-by: Angelo Ruocco Tested-by: Mirko Montanari commit 6fa9ada0547c5bd2ec064b6de6f0c9b9990c5721 Author: Paolo Valente Date: Fri Sep 15 01:53:51 2017 -0400 bfq-sq, bfq-mq: check and switch back to interactive wr also on queue split As already explained in the message of commit "bfq-mq, bfq-sq: fix wrong init of saved start time for weight raising", if a soft real-time weight-raising period happens to be nested in a larger interactive weight-raising period, then BFQ restores the interactive weight raising at the end of the soft real-time weight raising. In particular, BFQ checks whether the latter has ended only on request dispatches. Unfortunately, the above scheme fails to restore interactive weight raising in the following corner case: if a bfq_queue, say Q, 1) Is merged with another bfq_queue while it is in a nested soft real-time weight-raising period. The weight-raising state of Q is then saved, and not considered any longer until a split occurs. 2) Is split from the other bfq_queue(s) at a time instant when its soft real-time weight raising is already finished. On the split, while resuming the previous, soft real-time weight-raised state of the bfq_queue Q, BFQ checks whether the current soft real-time weight-raising period is actually over. If so, BFQ switches weight raising off for Q, *without* checking whether the soft real-time period was actually nested in a non-yet-finished interactive weight-raising period. This commit addresses this issue by adding the above missing check in bfq_queue splits, and restoring interactive weight raising if needed. Signed-off-by: Paolo Valente Tested-by: Angelo Ruocco Tested-by: Mirko Montanari commit 39e65b374475db1959c8d1becb3ede209d2e133d Author: Paolo Valente Date: Thu Sep 14 05:12:58 2017 -0400 Fix commit "Unnest request-queue and ioc locks from scheduler locks" The commit "Unnest request-queue and ioc locks from scheduler locks" mistakenly removed the setting of the split flag in function bfq_prepare_request. This commit puts this missing instruction back in its place. Signed-off-by: Paolo Valente commit 5c21add0b9850e35c30c07b663e6c9617a9f5772 Author: Paolo Valente Date: Tue Sep 12 16:45:53 2017 +0200 bfq-mq, bfq-sq: fix wrong init of saved start time for weight raising This commit fixes a bug that causes bfq to fail to guarantee a high responsiveness on some drives, if there is heavy random read+write I/O in the background. More precisely, such a failure allowed this bug to be found [1], but the bug may well cause other yet unreported anomalies. BFQ raises the weight of the bfq_queues associated with soft real-time applications, to privilege the I/O, and thus reduce latency, for these applications. This mechanism is named soft-real-time weight raising in BFQ. A soft real-time period may happen to be nested into an interactive weight raising period, i.e., it may happen that, when a bfq_queue switches to a soft real-time weight-raised state, the bfq_queue is already being weight-raised because deemed interactive too. In this case, BFQ saves in a special variable wr_start_at_switch_to_srt, the time instant when the interactive weight-raising period started for the bfq_queue, i.e., the time instant when BFQ started to deem the bfq_queue interactive. This value is then used to check whether the interactive weight-raising period would still be in progress when the soft real-time weight-raising period ends. If so, interactive weight raising is restored for the bfq_queue. This restore is useful, in particular, because it prevents bfq_queues from losing their interactive weight raising prematurely, as a consequence of spurious, short-lived soft real-time weight-raising periods caused by wrong detections as soft real-time. If, instead, a bfq_queue switches to soft-real-time weight raising while it *is not* already in an interactive weight-raising period, then the variable wr_start_at_switch_to_srt has no meaning during the following soft real-time weight-raising period. Unfortunately the handling of this case is wrong in BFQ: not only the variable is not flagged somehow as meaningless, but it is also set to the time when the switch to soft real-time weight-raising occurs. This may cause an interactive weight-raising period to be considered mistakenly as still in progress, and thus a spurious interactive weight-raising period to start for the bfq_queue, at the end of the soft-real-time weight-raising period. In particular the spurious interactive weight-raising period will be considered as still in progress, if the soft-real-time weight-raising period does not last very long. The bfq_queue will then be wrongly privileged and, if I/O bound, will unjustly steal bandwidth to truly interactive or soft real-time bfq_queues, harming responsiveness and low latency. This commit fixes this issue by just setting wr_start_at_switch_to_srt to minus infinity (farthest past time instant according to jiffies macros): when the soft-real-time weight-raising period ends, certainly no interactive weight-raising period will be considered as still in progress. [1] Background I/O Type: Random - Background I/O mix: Reads and writes - Application to start: LibreOffice Writer in http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop Signed-off-by: Paolo Valente Signed-off-by: Angelo Ruocco Tested-by: Oleksandr Natalenko Tested-by: Lee Tibbert Tested-by: Mirko Montanari commit 47bb6da80b79e7f1ca207c043201a5bb2526dff9 Author: Luca Miccio Date: Wed Sep 13 12:03:56 2017 +0200 bfq-mq, bfq-sq: Disable writeback throttling Similarly to CFQ, BFQ has its write-throttling heuristics, and it is better not to combine them with further write-throttling heuristics of a different nature. So this commit disables write-back throttling for a device if BFQ is used as I/O scheduler for that device. Signed-off-by: Luca Miccio Signed-off-by: Paolo Valente Tested-by: Oleksandr Natalenko commit 30f77b90f9ceb3c2a157e14bf1ebfb79c605ff6e Author: Paolo Valente Date: Thu Aug 31 19:24:26 2017 +0200 doc, block, bfq: fix some typos and stale sentences Signed-off-by: Paolo Valente Reviewed-by: Jeremy Hickman Reviewed-by: Laurentiu Nicola commit a8e82c7dce81509ce3919c51530527b8612e0367 Author: Paolo Valente Date: Thu Aug 10 08:15:50 2017 +0200 bfq-sq-mq: guarantee update_next_in_service always returns an eligible entity If the function bfq_update_next_in_service is invoked as a consequence of the activation or requeueing of an entity, say E, then it doesn't invoke bfq_lookup_next_entity to get the next-in-service entity. In contrast, it follows a shorter path: if E happens to be eligible (see commit "bfq-sq-mq: make lookup_next_entity push up vtime on expirations" for details on eligibility) and to have a lower virtual finish time than the current candidate as next-in-service entity, then E directly becomes the next-in-service entity. Unfortunately, there is a corner case for which this shorter path makes bfq_update_next_in_service choose a non eligible entity: it occurs if both E and the current next-in-service entity happen to be non eligible when bfq_update_next_in_service is invoked. In this case, E is not set as next-in-service, and, since bfq_lookup_next_entity is not invoked, the state of the parent entity is not updated so as to end up with an eligible entity as the proper next-in-service entity. In this respect, next-in-service is actually allowed to be non eligible while some queue is in service: since no system-virtual-time push-up can be performed in that case (see again commit "bfq-sq-mq: make lookup_next_entity push up vtime on expirations" for details), next-in-service is chosen, speculatively, as a function of the possible value that the system virtual time may get after a push up. But the correctness of the schedule breaks if next-in-service is still a non eligible entity when it is time to set in service the next entity. Unfortunately, this may happen in the above corner case. This commit fixes this problem by making bfq_update_next_in_service invoke bfq_lookup_next_entity not only if the above shorter path cannot be taken, but also if the shorter path is taken but fails to yield an eligible next-in-service entity. Signed-off-by: Paolo Valente commit 9c28595fce38ea2df71eb93652d50eef00fcbad9 Author: Paolo Valente Date: Wed Aug 9 22:53:00 2017 +0200 bfq-sq-mq: remove direct switch to an entity in higher class If the function bfq_update_next_in_service is invoked as a consequence of the activation or requeueing of an entity, say E, and finds out that E belongs to a higher-priority class than that of the current next-in-service entity, then it sets next_in_service directly to E. But this may lead to anomalous schedules, because E may happen not be eligible for service, because its virtual start time is higher than the system virtual time for its service tree. This commit addresses this issue by simply removing this direct switch. Signed-off-by: Paolo Valente commit 6fc422923e6dd7cc693f32f26dc9d94f5a52f09c Author: Paolo Valente Date: Wed Aug 9 22:29:01 2017 +0200 bfq-sq-mq: make lookup_next_entity push up vtime on expirations To provide a very smooth service, bfq starts to serve a bfq_queue only if the queue is 'eligible', i.e., if the same queue would have started to be served in the ideal, perfectly fair system that bfq simulates internally. This is obtained by associating each queue with a virtual start time, and by computing a special system virtual time quantity: a queue is eligible only if the system virtual time has reached the virtual start time of the queue. Finally, bfq guarantees that, when a new queue must be set in service, there is always at least one eligible entity for each active parent entity in the scheduler. To provide this guarantee, the function __bfq_lookup_next_entity pushes up, for each parent entity on which it is invoked, the system virtual time to the minimum among the virtual start times of the entities in the active tree for the parent entity (more precisely, the push up occurs if the system virtual time happens to be lower than all such virtual start times). There is however a circumstance in which __bfq_lookup_next_entity cannot push up the system virtual time for a parent entity, even if the system virtual time is lower than the virtual start times of all the child entities in the active tree. It happens if one of the child entities is in service. In fact, in such a case, there is already an eligible entity, the in-service one, even if it may not be not present in the active tree (because in-service entities may be removed from the active tree). Unfortunately, in the last re-design of the hierarchical-scheduling engine, the reset of the pointer to the in-service entity for a given parent entity--reset to be done as a consequence of the expiration of the in-service entity--always happens after the function __bfq_lookup_next_entity has been invoked. This causes the function to think that there is still an entity in service for the parent entity, and then that the system virtual time cannot be pushed up, even if actually such a no-more-in-service entity has already been properly reinserted into the active tree (or in some other tree if no more active). Yet, the system virtual time *had* to be pushed up, to be ready to correctly choose the next queue to serve. Because of the lack of this push up, bfq may wrongly set in service a queue that had been speculatively pre-computed as the possible next-in-service queue, but that would no more be the one to serve after the expiration and the reinsertion into the active trees of the previously in-service entities. This commit addresses this issue by making __bfq_lookup_next_entity properly push up the system virtual time if an expiration is occurring. Signed-off-by: Paolo Valente commit 8b3c6923d27f1f5dd98d95b94ef7ae25d9758f61 Author: Paolo Valente Date: Wed Aug 9 16:40:39 2017 +0200 bfq-sq: fix commit "Remove all get and put of I/O contexts" in branch bfq-mq The commit "Remove all get and put of I/O contexts" erroneously removed the reset of the field in_service_bic for bfq-sq. This commit re-adds that missing reset. Signed-off-by: Paolo Valente commit 1b1395aa094da4b2ad0a4526d0a076e4ca4f7d0c Author: Lee Tibbert Date: Wed Jul 19 10:28:32 2017 -0400 Improve most frequently used no-logging path This patch originated as a fix for compiler unused-variable warnings issued when compiling bfq-mq with logging disabled (both CONFIG_BLK_DEV_IO_TRACE and CONFIG_BFQ_REDIRECT_TO_CONSOLE undefined). It turns out to also have benefits for the bfq-sq path as well. In most performance sensitive production builds blktrace_api logging will probably be turned off, so it is worth making the no-logging path compile without warnings. Any performance benefit is a bonus. Thank you to T. B. on the bfq-iosched@googlegroups.com list for ((void) (bfqq)) simplification/suggestion/improvement. All bugs and unclear descriptions are my own doing. The discussion below is based on the gcc compiler with optimization level of at least 02. Lower optimization levels are unlikely to remove no-op instruction equivalents. Provide three improvements in this likely case. 1) Fix multiple occurrences of an unused-variable warning issued when compiling bfq-mq with no logging. The warning occurred each time the bfq_log_bfqg macro was expanded inside a code block such as the following snippet from block/bfq-sched.c, line 139 and few following, lightly edited for indentation in order to pass checkpatch.pl maximum line lengths. else { struct bfq_group *bfqg = container_of(next_in_service, struct bfq_group, entity); bfq_log_bfqg((struct bfq_data *)bfqg->bfqd, bfqg, "update_next_in_service: chosen this entity"); } Previously bfq-mq.h expanded bfq_log_bfqg to blk_add_trace_msg. When both bfq console logging and blktrace_api logging are disabled, include/linux/blktrace_api expands to do { } while (0), leaving the code block local variable unused. bfq_log_bfqq() had similar behavior but is never called with a potentially unused variable. This patch fixes that macro for consistency. bfq-sq.h (single queue) with blktrace_api enabled, and the bfq console logging macros have code paths which not trigger this warning. kernel.org (4.12 & 4.13) bfq (bfq-iosched.h) could trigger the warning but no code does so now. This patch fixes bfq-iosched.h for consistency. The style above enables a software engineering approach where complex expressions are moved to a local variable before the bfq_log* call. This makes it easier to read the expression and use breakpoints to verify it. bfq-mq uses this approach in several places. New bfq_log* macros are provided for the no-logging case. I touch only the second argument, because current code never uses the local variable approach with the first or other arguments. I tried to balance consistency with simplicity. 2) For bfq-sq, reduce to zero, the number of instructions executed when no logging is configured. No sense marshaling arguments which are never going to be used. On a trial V8R11 builds, this reduced the size of bfq-iosched.o by 14.3 KiB. The size went from 70304 to 55664 bytes. bfq-mq and kernel.org bfq code size does not change because existing macros already optimize to zero bytes when not logging. The current changes maintains consistency with the bfq-sq path and makes the bfq-mq & bfq no-logging paths resistant to future logging path macro changes which might cause generated code. 3) Slightly reduce compile time of all bfq variants by including blktrace_api.h only when it will be used. Signed-off-by: Lee Tibbert commit 477ffebed78e9b226c8213b7a2c9756e5d6d1bff Author: Paolo Valente Date: Wed Jul 5 21:08:32 2017 +0200 Add to documentation that bfq-mq and bfq-sq contain last fixes too Signed-off-by: Paolo Valente commit ad625152bce044d08691623193de4e5393ab45f3 Author: Paolo Valente Date: Wed Jul 5 16:28:00 2017 +0200 bfq-sq: fix prefix of names of cgroups parameters Signed-off-by: Paolo Valente commit dbe510d28da67cc1eef1d89222cdc1511314f1d3 Author: Paolo Valente Date: Wed Jul 5 12:43:22 2017 +0200 Add list of bfq instances to documentation Signed-off-by: Paolo Valente commit 6a3308a405331ebeaccce0ff9e189e2282741745 Author: Paolo Valente Date: Wed Jul 5 12:02:16 2017 +0200 Port of "blk-mq-sched: unify request prepare methods" This patch makes sure we always allocate requests in the core blk-mq code and use a common prepare_request method to initialize them for both mq I/O schedulers. For Kyber and additional limit_depth method is added that is called before allocating the request. Also because none of the intializations can really fail the new method does not return an error - instead the bfq finish method is hardened to deal with the no-IOC case. Last but not least this removes the abuse of RQF_QUEUE by the blk-mq scheduling code as RQF_ELFPRIV is all that is needed now. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe commit bef6a9869bfdc88b9b21123af64ee2e2690a5ff9 Author: Paolo Valente Date: Wed Jul 5 11:54:57 2017 +0200 Port of "bfq-iosched: fix NULL ioc check in bfq_get_rq_private" icq_to_bic is a container_of operation, so we need to check for NULL before it. Also move the check outside the spinlock while we're at it. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe commit f8b456c930200d9d08e98b3aa5d24219edcb9f0c Author: Paolo Valente Date: Wed Jul 5 11:48:17 2017 +0200 Port of "blk-mq-sched: unify request finished methods" No need to have two different callouts of bfq vs kyber. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe commit a60da208da5e95899d0446d6a0d9c07d6a7a0dc6 Author: Paolo Valente Date: Sat Jun 17 11:18:11 2017 +0200 bfq-mq: fix macro name in conditional invocation of policy_unregister This commit fixes the name of the macro in the conditional group that invokes blkcg_policy_unregister in bfq_exit for bfq-mq. Because of this error, blkcg_policy_unregister was never invoked. Signed-off-by: Paolo Valente commit edcf696573fde194740b42dec6f2387803d95c02 Author: Paolo Valente Date: Mon May 15 22:25:03 2017 +0200 block, bfq-mq: access and cache blkg data only when safe In blk-cgroup, operations on blkg objects are protected with the request_queue lock. This is no more the lock that protects I/O-scheduler operations in blk-mq. In fact, the latter are now protected with a finer-grained per-scheduler-instance lock. As a consequence, although blkg lookups are also rcu-protected, blk-mq I/O schedulers may see inconsistent data when they access blkg and blkg-related objects. BFQ does access these objects, and does incur this problem, in the following case. The blkg_lookup performed in bfq_get_queue, being protected (only) through rcu, may happen to return the address of a copy of the original blkg. If this is the case, then the blkg_get performed in bfq_get_queue, to pin down the blkg, is useless: it does not prevent blk-cgroup code from destroying both the original blkg and all objects directly or indirectly referred by the copy of the blkg. BFQ accesses these objects, which typically causes a crash for NULL-pointer dereference of memory-protection violation. Some additional protection mechanism should be added to blk-cgroup to address this issue. In the meantime, this commit provides a quick temporary fix for BFQ: cache (when safe) blkg data that might disappear right after a blkg_lookup. In particular, this commit exploits the following facts to achieve its goal without introducing further locks. Destroy operations on a blkg invoke, as a first step, hooks of the scheduler associated with the blkg. And these hooks are executed with bfqd->lock held for BFQ. As a consequence, for any blkg associated with the request queue an instance of BFQ is attached to, we are guaranteed that such a blkg is not destroyed, and that all the pointers it contains are consistent, while that instance is holding its bfqd->lock. A blkg_lookup performed with bfqd->lock held then returns a fully consistent blkg, which remains consistent until this lock is held. In more detail, this holds even if the returned blkg is a copy of the original one. Finally, also the object describing a group inside BFQ needs to be protected from destruction on the blkg_free of the original blkg (which invokes bfq_pd_free). This commit adds private refcounting for this object, to let it disappear only after no bfq_queue refers to it any longer. This commit also removes or updates some stale comments on locking issues related to blk-cgroup operations. Reported-by: Tomas Konir Reported-by: Lee Tibbert Reported-by: Marco Piazza Signed-off-by: Paolo Valente Tested-by: Tomas Konir Tested-by: Lee Tibbert Tested-by: Marco Piazza commit c0c5d1b59eb8cf950cca664bca9f8599856eda8a Author: Paolo Valente Date: Fri May 12 11:56:13 2017 +0200 Add tentative extra tests on groups, reqs and queues Signed-off-by: Paolo Valente commit 78fa4c2bf53d90cfb7ecaf02a1ffe64b0d984e75 Author: Paolo Valente Date: Fri May 12 09:51:18 2017 +0200 Change cgroup params prefix to bfq-mq for bfq-mq Signed-off-by: Paolo Valente commit 1a9eba362e36eb2210d108a694f0134e0357cb07 Author: Paolo Valente Date: Wed Mar 29 18:55:30 2017 +0200 Fix wrong unlikely Signed-off-by: Paolo Valente commit 0a062b83e7c9f0fce2e8b97ae5b82d8dd7933817 Author: Paolo Valente Date: Wed Mar 29 18:41:46 2017 +0200 BUGFIX: Remove unneeded and deadlock-causing lock in request_merged Signed-off-by: Paolo Valente commit 71f6e5028a42dba448723b7598498ce86338e2c9 Author: Paolo Valente Date: Fri Mar 17 06:15:18 2017 +0100 Remove all get and put of I/O contexts When a bfq queue is set in service and when it is merged, a reference to the I/O context associated with the queue is taken. This reference is then released when the queue is deselected from service or split. More precisely, the release of the reference is postponed to when the scheduler lock is released, to avoid nesting between the scheduler and the I/O-context lock. In fact, such nesting would lead to deadlocks, because of other code paths that take the same locks in the opposite order. This postponing of I/O-context releases does complicate code. This commit addresses this issue by modifying involved operations in such a way to not need to get the above I/O-context references any more. Then it also removes any get and release of these references. Signed-off-by: Paolo Valente commit 4f2fc648798a7014d15965f19137dcff8ca9194a Author: Paolo Valente Date: Sat Feb 25 17:38:05 2017 +0100 Complete support for cgroups This commit completes cgroups support for bfq-mq. In particular, it deals with a sort of circular dependency introduced in blk-mq: the function blkcg_activate_policy, invoked during scheduler initialization, triggers the invocation of the has_work scheduler hook (before the init function is finished). To adress this issue, this commit moves the invocation of blkcg_activate_policy after the initialization of all the fields that could be initialized before invoking blkcg_activate_policy itself. This enables has_work to correctly return false, and thus to prevent the blk-mq stack from invoking further scheduler hooks before the init function is finished. Signed-off-by: Paolo Valente commit 0175d54c60c1fc2e97f906ba0a34932b9c3d26f5 Author: Paolo Valente Date: Fri Feb 17 14:28:02 2017 +0100 TESTING: Check wrong invocation of merge and put_rq_priv functions Check that merge functions are not invoked on requests queued in the dispatch queue, and that neither put_rq_private is invoked on these requests if, in addition, they have not passed through get_rq_private. Signed-off-by: Paolo Valente commit b25f3dc28b9cd6fc917c8a109448c3037f8f298c Author: Paolo Valente Date: Fri Mar 3 09:39:35 2017 +0100 Add checks and extra log messages - Part III Signed-off-by: Paolo Valente commit e744879f394ba5363711438a175ab6421cdb5335 Author: Paolo Valente Date: Wed Feb 22 11:30:01 2017 +0100 Fix unbalanced increment of rq_in_driver Signed-off-by: Paolo Valente commit e51e6ed1f65d7ef3234b47f6b8bb759bcc50bbf7 Author: Paolo Valente Date: Fri Mar 3 09:31:14 2017 +0100 Add checks and extra log messages - Part II Signed-off-by: Paolo Valente commit 873fa39c0f444f5beca752bdb3923200ce6c8046 Author: Paolo Valente Date: Tue Feb 21 10:26:22 2017 +0100 Unnest request-queue and ioc locks from scheduler locks In some bio-merging functions, the request-queue lock needs to be taken, to lookup for the bic associated with the process that issued the bio that may need to be merged. In addition, put_io_context must be invoked in some other functions, and put_io_context may cause the lock of the involved ioc to be taken. In both cases, these extra request-queue or ioc locks are taken, or might be taken, while the scheduler lock is being held. In this respect, there are other code paths, in part external to bfq-mq, in which the same locks are taken (nested) in the opposite order, i.e., it is the scheduler lock to be taken while the request-queue or the ioc lock is being held. This leads to circular deadlocks. This commit addresses this issue by modifying the logic of the above functions, so as to let the lookup and put_io_context be performed, and thus the extra locks be taken, outside the critical sections protected by the scheduler lock. Signed-off-by: Paolo Valente commit 15b3fa6a425b67ada9773cc125a29aead24c41fc Author: Paolo Valente Date: Tue Feb 7 15:14:29 2017 +0100 bfq-mq: execute exit_icq operations immediately Exploting Omar's patch that removes the taking of the queue lock in put_io_context_active, this patch moves back the operation of the bfq_exit_icq hook from a deferred work to the body of the function. Signed-off-by: Paolo Valente commit 6e412ac9dc583a0b2d7223c491d265ad4d402ee3 Author: Paolo Valente Date: Thu Feb 9 10:36:27 2017 +0100 Add lock check in bfq_allow_bio_merge Signed-off-by: Paolo Valente commit d75fdb7885df87e33f0adeb042c0c5a734bccca7 Author: Paolo Valente Date: Fri Mar 3 08:52:40 2017 +0100 Add checks and extra log messages - Part I Signed-off-by: Paolo Valente commit 6b1d3bf5dc65e5f927e57736aaab38056e78d011 Author: Paolo Valente Date: Tue Dec 20 09:07:19 2016 +0100 Modify interface and operation to comply with blk-mq-sched As for modifications of the operation, the major changes are the introduction of a scheduler lock, and the moving to deferred work of the body of the hook exit_icq. The latter change has been made to avoid deadlocks caused by the combination of the following facts: 1) such a body takes the scheduler lock, and, if not deferred, 2) it does so from inside the exit_icq hook, which is invoked with the queue lock held, and 3) there is at least one code path, namely that starting from bfq_bio_merge, which takes these locks in the opposite order. Signed-off-by: Paolo Valente commit 63cd762262acbc3057aa017a5f5b922274bdde21 Author: Paolo Valente Date: Wed Jan 18 11:42:22 2017 +0100 Embed bfq-ioc.c and add locking on request queue The version of bfq-ioc.c for bfq-iosched.c is not correct any more for bfq-mq, because, in bfq-mq, the request queue lock is not being held when bfq_bic_lookup is invoked. That function must then take that look on its own. This commit removes the inclusion of bfq-ioc.c, copies the content of bfq-ioc.c into bfq-mq-iosched.c, and adds the grabbing of the lock. Signed-off-by: Paolo Valente commit e3e819451e1f0cfcd45ed6c24358a6ddd48bd96e Author: Paolo Valente Date: Sat Jan 21 12:41:14 2017 +0100 Move thinktime from bic to bfqq Prep change to make it possible to protect this field with a scheduler lock. Signed-off-by: Paolo Valente commit f37f719fb13a296c5ddae79bf04d972ccbf17573 Author: Paolo Valente Date: Mon Dec 19 18:11:33 2016 +0100 Copy header file bfq.h as bfq-mq.h This commit introduces the header file bfq-mq.h, that will play for bfq-mq-iosched.c the same role that bfq.h plays for bfq-iosched.c. For the moment, the file bfq-mq.h is just a copy of bfq.h. Signed-off-by: Paolo Valente commit 474cfdc7778c84feeb2d12eb4833da5dcbdb37c3 Author: Paolo Valente Date: Fri Jan 20 09:18:25 2017 +0100 Increase max policies for io controller To let bfq-mq policy be plugged too (however cgroups suppport is not yet functional in bfq-mq). Signed-off-by: Paolo Valente commit ecd765f72cbec082292a03093aeb6af017c2d121 Author: Paolo Valente Date: Mon Dec 19 17:13:39 2016 +0100 Add config and build bits for bfq-mq-iosched Signed-off-by: Paolo Valente commit 047dc7340016f029160cdcf079bab8c6833dade3 Author: Paolo Valente Date: Mon Dec 19 16:59:33 2016 +0100 FIRST BFQ-MQ COMMIT: Copy bfq-sq-iosched.c as bfq-mq-iosched.c This commit introduces bfq-mq-iosched.c, the main source file that will contain the code of bfq for blk-mq. I name tentatively bfq-mq this version of bfq. For the moment, the file bfq-mq-iosched.c is just a copy of bfq-sq-iosched.c, i.e, of the main source file of bfq for blk. Signed-off-by: Paolo Valente commit 8d123e4d83928f8ff379ce4d779286dd666e84b3 Author: Paolo Valente Date: Thu May 4 10:53:43 2017 +0200 block, bfq: improve and refactor throughput-boosting logic When a queue associated with a process remains empty, there are cases where throughput gets boosted if the device is idled to await the arrival of a new I/O request for that queue. Currently, BFQ assumes that one of these cases is when the device has no internal queueing (regardless of the properties of the I/O being served). Unfortunately, this condition has proved to be too general. So, this commit refines it as "the device has no internal queueing and is rotational". This refinement provides a significant throughput boost with random I/O, on flash-based storage without internal queueing. For example, on a HiKey board, throughput increases by up to 125%, growing, e.g., from 6.9MB/s to 15.6MB/s with two or three random readers in parallel. This commit also refactors the code related to device idling, for the following reason. Finding the change that provides the above large improvement has been slightly more difficult than it had to be, because the logic that decides whether to idle the device is still scattered across three functions. Almost all of the logic is in the function bfq_bfqq_may_idle, but (1) part of the decision is made in bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may switch off idling regardless of the output of bfq_bfqq_may_idle. In addition, both bfq_update_idle_window and bfq_bfqq_must_idle make their decisions as a function of parameters that are used, for similar purposes, also in bfq_bfqq_may_idle. This commit addresses this issue by moving all the logic into bfq_bfqq_may_idle. Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio commit a503694a61373259bf2b8b71eea3a031db5e5207 Author: Paolo Valente Date: Fri Jul 28 21:09:51 2017 +0200 block, bfq: consider also in_service_entity to state whether an entity is active Groups of BFQ queues are represented by generic entities in BFQ. When a queue belonging to a parent entity is deactivated, the parent entity may need to be deactivated too, in case the deactivated queue was the only active queue for the parent entity. This deactivation may need to be propagated upwards if the entity belongs, in its turn, to a further higher-level entity, and so on. In particular, the upward propagation of deactivation stops at the first parent entity that remains active even if one of its child entities has been deactivated. To decide whether the last non-deactivation condition holds for a parent entity, BFQ checks whether the field next_in_service is still not NULL for the parent entity, after the deactivation of one of its child entity. If it is not NULL, then there are certainly other active entities in the parent entity, and deactivations can stop. Unfortunately, this check misses a corner case: if in_service_entity is not NULL, then next_in_service may happen to be NULL, although the parent entity is evidently active. This happens if: 1) the entity pointed by in_service_entity is the only active entity in the parent entity, and 2) according to the definition of next_in_service, the in_service_entity cannot be considered as next_in_service. See the comments on the definition of next_in_service for details on this second point. Hitting the above corner case causes crashes. To address this issue, this commit: 1) Extends the above check on only next_in_service to controlling both next_in_service and in_service_entity (if any of them is not NULL, then no further deactivation is performed) 2) Improves the (important) comments on how next_in_service is defined and updated; in particular it fixes a few rather obscure paragraphs Reported-by: Eric Wheeler Reported-by: Rick Yiu Reported-by: Tom X Nguyen Signed-off-by: Paolo Valente Tested-by: Eric Wheeler Tested-by: Rick Yiu Tested-by: Laurentiu Nicola Tested-by: Tom X Nguyen commit 5f051579fde6443e33d87f3589340da9235af00e Author: Paolo Valente Date: Fri Jul 21 12:08:57 2017 +0200 block, bfq: reset in_service_entity if it becomes idle BFQ implements hierarchical scheduling by representing each group of queues with a generic parent entity. For each parent entity, BFQ maintains an in_service_entity pointer: if one of the child entities happens to be in service, in_service_entity points to it. The resetting of these pointers happens only on queue expirations: when the in-service queue is expired, i.e., stops to be the queue in service, BFQ resets all in_service_entity pointers along the parent-entity path from this queue to the root entity. Functions handling the scheduling of entities assume, naturally, that in-service entities are active, i.e., have pending I/O requests (or, as a special case, even if they have no pending requests, they are expected to receive a new request very soon, with the scheduler idling the storage device while waiting for such an event). Unfortunately, the above resetting scheme of the in_service_entity pointers may cause this assumption to be violated. For example, the in-service queue may happen to remain without requests because of a request merge. In this case the queue does become idle, and all related data structures are updated accordingly. But in_service_entity still points to the queue in the parent entity. This inconsistency may even propagate to higher-level parent entities, if they happen to become idle as well, as a consequence of the leaf queue becoming idle. For this queue and parent entities, scheduling functions have an undefined behaviour, and, as reported, may easily lead to kernel crashes or hangs. This commit addresses this issue by simply resetting the in_service_entity field also when it is detected to point to an entity becoming idle (regardless of why the entity becomes idle). Reported-by: Laurentiu Nicola Signed-off-by: Paolo Valente Tested-by: Laurentiu Nicola commit ac20910651b59a92ed92a4080a6e9e6f9fa5e7ed Author: Paolo Valente Date: Thu Jul 20 10:46:39 2017 +0200 Add extra checks related to entity scheduling - extra checks related to ioprioi-class changes - specific check on st->idle in __bfq_requeue_entity Signed-off-by: Paolo Valente commit 383e5b2ddbdb94d522e1a2af8c960f4dd6f486a9 Author: Paolo Valente Date: Tue Apr 7 13:39:12 2015 +0200 Add BFQ-v8r12 This commit is the result of the following operations. 1. The squash of all the commits between "block: cgroups, kconfig, build bits for BFQ-v7r11-4.5.0" and BFQ-v8r12 in the branch bfq-mq-v8-v4.11 2. The renaming of two files (block/bfq-cgroup.c -> block/bfq-cgroup-included.c and block/bfq-iosched.c -> block/bfq-sq-iosched.c) and of one option (CONFIG_BFQ_GROUP_IOSCHED -> CONFIG_BFQ_SQ_GROUP_IOSCHED), to avoid name clashes. These name clashes are due to the presence of bfq in mainline from 4.12. 3. The modification of block/Makefile and block/Kconfig.iosched to comply with the above renaming. Signed-off-by: Mauro Andreolini Signed-off-by: Arianna Avanzini Signed-off-by: Linus Walleij Signed-off-by: Paolo Valente