注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Tsecer的回音岛

Tsecer的博客

 
 
 

日志

 
 

Hierarchical RCU(分级RCU)--部分内容注释  

2011-10-12 22:54:34|  分类: Linux内核 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

原文地址

http://lwn.net/Articles/305782/

这里只是部分摘录并进行了注释。

Brief Overview of Classic RCU Implementation

The key concept behind the Classic RCU implementation is that Classic RCU read-side critical sections are confined to kernel code and are not permitted to block. This means that any time a given CPU is seen either blocking, in the idle loop, or exiting the kernel, we know that all RCU read-side critical sections that were previously running on that CPU must have completed. Such states are called “quiescent states”, and after each CPU has passed through at least one quiescent state, the RCU grace period ends.

传统RCU实现的核心思想就是 Schematic of Classic RCURCU读者一侧的关键区只能在内核中执行、并且不允许阻塞。这就是说任何时候,如果一个CPU阻塞、进入空闲循环或者退出内核态,可以认为读者侧肯定已经退出关键区。CPU的这样一个状态称为“quiescent State”(休眠期),所有的CPU都经历了至少一个休眠期,可以认为RCU的grace period(缓行期)已经结束。---注释,这里的翻译不标准,所以之后依然使用英文单词表示。

Classic RCU's most important data structure is the rcu_ctrlblk structure, which contains the ->cpumask field, which contains one bit per CPU. Each CPU's bit is set to one at the beginning of each grace period, and each CPU must clear its bit after it passes through a quiescent state. Because multiple CPUs might want to clear their bits concurrently, which would corrupt the ->cpumask field, a ->lock spinlock is used to protect ->cpumask, preventing any such corruption. Unfortunately, this spinlock can also suffer extreme contention if there are more than a few hundred CPUs, which might soon become quite common if multicore trends continue. Worse yet, the fact that all CPUs must clear their own bit means that CPUs are not permitted to sleep through a grace period, which limits Linux's ability to conserve power.

传统rcu最为重要的数据结构就是包含了cpumask成员域的rcu_ctrlblk ,其中的每一个bit表示一个CPU。在grace period的开始,每个CPU对应的bit被设置为1,当每个CPU通过了一个quiescent state之后,它必须清空它在cpumask中对应的bit位(这一点也容易理解,这样可以方便的使用数值是否为零来判断是否所有CPU都经过了quiescent state)。因为可能会有多个CPU并发的来清除这个标志位,这个操作是一个对cpumask的修改操作,所以需要一个自旋锁来保护这个cpumask,避免多CPU访问时造成的不一致。不幸的是,随着多核趋势的继续,在具有数百上千个CPU的系统中,这个自旋锁本身也会迅速的引起激烈的竞争。更糟的是,所有的CPU都必须清空它们自己的标志位也就意味着CPU不能再一个grace period中睡眠(可以认为放弃CPU或者阻塞。不是很理解,可能是说如果CPU睡眠,那个这个CPU的bit就没有人来清零,这样的话由于一个grace piriod要求所有的CPU都清零,那么如果这个CPU一直睡眠下去,那么这个CPU的bit位没有人来清零,从而导致整个系统中的cpumask中有一些残留的bit无法清零,从而导致这个grace period永远无法来到)。而这个限制就会限制Linux节能的功能。

…………

Towards a More Scalable RCU Implementation

One effective way to reduce lock contention is to create a hierarchy, as shown in the following figure. Here, each of the four rcu_node structures has its own lock, so that only CPUs 0 and 1 will acquire the lower left rcu_node's lock, only CPUs 2 and 3 will acquire the lower middle rcu_node's lock, and only CPUs 4 and 5 will acquire the lower right rcu_node's lock. During any given grace period, only one of the CPUs accessing each of the lower rcu_node structures will access the upper rcu_node, namely, the last of each pair of CPUs to record a quiescent state for the corresponding grace period.

一个有效的减少锁竞争的方法就是采用下图所示的分级结构。这里,四个rcu_node 都有自己的锁,所以,只有CPU0和CPU1来竞争左下角的rcu_node 结构内的锁,只有CPU2和3竞争中间rcu_node 的锁、只有CPU4和5竞争有效较rcu_node 的锁。在任何一个grace period中,……(只有访问时所有下层的rcu_node 都已经完成了quiescent state的cpu才能够通知自己上层的rcu_node结构一个quiescent state的完成?),也就是说,每个CPU对中的最后一个完成quiescent state负责通知上层rcu_node

Schematic of Tree RCU

This results in a significant reduction in lock contention: instead of six CPUs contending for a single lock each grace period, we have only three for the upper rcu_node's lock (a reduction of 50%) and only two for each of the lower rcu_nodes' locks (a reduction of 67%). 这个结果将会显著的降低锁的竞争,不再是6个CPU在一个grace period中竞争唯一的一个锁,最顶层的rcu_node 有3个可能竞争者(降低50%),而下层的rcu_node 每个只有两个竞争者(下降了67%)。

The tree of rcu_node structures is embedded into a linear array in the rcu_state structure, with the root of the tree in element zero, as shown below for an eight-CPU system with a three-level hierarchy. The arrows link a given rcu_node structure to its parent. Each rcu_node indicates the range of CPUs covered, so that the root node covers all of the CPUs, each node in the second level covers half of the CPUs, and each node in the leaf level covering a pair of CPUs. This array is allocated statically at compile time based on the value of NR_CPUS.

这个rcu_node 树形结构潜入一个rcu_state结构的线性数组中,它的根节点是第零个元素,如下图的8个CPU系统所示。其中的箭头给出了rcu_node 结构和它的父节点之间的关系。其中节点内的两个数字表示了她们覆盖的CPU的范围,X:Y则表示从第X个到第Y个CPU,第二层中的每一个rcu_node 结构覆盖半数CPU、叶子节点覆盖两个CPU。

Embedding of Tree RCU

The following sequence of six figures shows how grace periods are detected. In the first figure, no CPU has yet passed through a quiescent state, as indicated by the red rectangles. Suppose that all six CPUs simultaneously try to tell RCU that they have passed through a quiescent state. Only one of each pair will be able to acquire the lock on the corresponding lower rcu_node, and so the second figure shows the result if the lucky CPUs are numbers 0, 3, and 5, as indicated by the green rectangles. Once these lucky CPUs have finished, then the other CPUs will acquire the lock, as shown in the third figure. Each of these CPUs will see that they are the last in their group, and therefore all three will attempt to move to the upper rcu_node. Only one at a time can acquire the upper rcu_node structure's lock, and the fourth, fifth, and sixth figures show the sequence of states assuming that CPU 1, CPU 2, and CPU 4 acquire the lock in that order. The sixth and final figure in the group shows that all CPUs have passed through a quiescent state, so that the grace period has ended.

下图表示了一个grace period的检测过程,其中的红色矩形表示对应的CPU没有完成quiescent state,如下图1所示。假设说所有6个CPU同时希望告诉RCU她们已经经过了一个quiescent state,它们每对中只有一个能够获得它们对应的底层rcu_node 结构的lock。这里假设分别是0、3、5.。当这三个完成之后,接下来的三个各自获得了CPUP,此时所有的底层rcu_node 都完成了quiescent state。然后它们通知根结构,此时将会由三个同时竞争根节点的rcu_node 锁,此时分三次竞争,最终完成grace period。

Schematic of Tree RCU

In the above sequence, there were never more than three CPUs contending for any one lock, in happy contrast to Classic RCU, where all six CPUs might contend. However, even more dramatic reductions in lock contention are possible with larger numbers of CPUs. Consider a hierarchy of rcu_node structures, with 64 lower structures and 64*64=4,096 CPUs, as shown in the following figure.

Schematic of Tree RCU

Here each of the lower rcu_node structures' locks are acquired by 64 CPUs, a 64-times reduction from the 4,096 CPUs that would acquire Classic RCU's single global lock. Similarly, during a given grace period, only one CPU from each of the lower rcu_node structures will acquire the upper rcu_node structure's lock, which is again a 64x reduction from the contention level that would be experienced by Classic RCU running on a 4,096-CPU system.

Quick Quiz 1: Wait a minute! With all those new locks, how do you avoid deadlock?

Quick Quiz 2: Why stop at a 64-times reduction? Why not go for a few orders of magnitude instead?

Quick Quiz 3: But I don't care about McKenney's lame excuses in the answer to Quick Quiz 2!!! I want to get the number of CPUs contending on a single lock down to something reasonable, like sixteen or so!!!

The implementation maintains some per-CPU data, such as lists of RCU callbacks, organized into rcu_data structures. In addition, rcu (as in call_rcu()) and rcu_bh (as in call_rcu_bh()) each maintain their own hierarchy, as shown in the following figure. 实现中维护了一些一些CPU数据,例如rcu_data 中的RCU回调链表,rcu和rcu_bh维护了它们自己的分解,入下图所示。

Schematic of Tree RCU

Quick Quiz 4: OK, so what is the story with the colors?

The next section discusses energy conservation.

Towards a Greener RCU Implementation

As noted earlier, an important goal of this effort is to leave sleeping CPUs lie in order to promote energy conservation. In contrast, classic RCU will happily awaken each and every sleeping CPU at least once per grace period in some cases, which is suboptimal in the case where a small number of CPUs are busy doing RCU updates and the majority of the CPUs are mostly idle. This situation occurs frequently in systems sized for peak loads, and we need to be able to accommodate it gracefully. Furthermore, we need to fix a long-standing bug in Classic RCU where a dynticks-idle CPU servicing an interrupt containing a long-running RCU read-side critical section will fail to prevent an RCU grace period from ending.

Quick Quiz 5: Given such an egregious bug, why does Linux run at all?

This is accomplished by requiring that all CPUs manipulate counters located in a per-CPU rcu_dynticks structure. Loosely speaking, these counters have even-numbered values when the corresponding CPU is in dynticks idle mode, and have odd-numbered values otherwise. RCU thus needs to wait for quiescent states only for those CPUs whose rcu_dynticks counters are odd, and need not wake up sleeping CPUs, whose counters will be even. As shown in the following diagram, each per-CPU rcu_dynticks is shared by the “rcu” and “rcu_bh” implementations. 通过要求所有的CPU维护一个CPU计数器rcu_dynticks。通俗的说,这些,当对应的CPU处于dynticks idle模式的时候,它的值是偶数的,否则是奇数。所以RCU只需要等待那些rcu_dynticks 值为偶数的CPU经过quiescent state,并且不必唤醒值为偶数的睡眠CPU。这个CPU变量rcu_dynticks  被rcu和rcu_bh共享。

Schematic of Tree RCU

The following section presents a high-level view of the RCU state machine.

State Machine

At a sufficiently high level, Linux-kernel RCU implementations can be thought of as high-level state machines as shown in the following schematic:

Tree-RCU classic view of state machine.

The common-case path through this state machine on a busy system goes through the two uppermost loops, initializing at the beginning of each grace period (GP), waiting for quiescent states (QS), and noting when each CPU passes through its first quiescent state for a given grace period. On such a system, quiescent states will occur on each context switch, or, for CPUs that are either idle or executing user-mode code, each scheduling-clock interrupt. CPU-hotplug events will take the state machine through the “CPU Offline” box, while the presence of “holdout” CPUs that fail to pass through quiescent states quickly enough will exercise the path through the “Send resched IPIs to Holdout CPUs” box. RCU implementations that avoid unnecessarily awakening dyntick-idle CPUs will mark those CPUs as being in an extended quiescent state, taking the “Y” branch out of the “CPUs in dyntick-idle Mode?” decision diamond (but note that CPUs in dyntick-idle mode will not be sent resched IPIs). Finally, if CONFIG_RCU_CPU_STALL_DETECTOR is enabled, truly excessive delays in reaching quiescent states will exercise the “Complain About Holdout CPUs” path.

The events in the above state schematic interact with different data structures, as shown below: 下图展示了上面所说的事件和数据结构的对应关系:

Tree-RCU data-structure-oriented view of state machine.

However, the state schematic does not directly translate into C code for any of the RCU implementations. Instead, these implementations are coded as an event-driven system within the kernel. Therefore, the following section describes some “use cases”, or ways in which the RCU algorithm traverses the above state schematic as well as the relevant data structures. 在内核中,这里的实现是通过一个事件驱动的系统模型来完成的,下界描述了一些用力,或者说RCU算法遍历上面所说的状态以及相关数据结构的途径。

 

Use Cases

This section gives an overview of several “use cases” within the RCU implementation, listing the data structures touched and the functions invoked. The use cases are as follows:

  1. Start a new grace period.
  2. Pass through a quiescent state.
  3. Announce a quiescent state to RCU.
  4. Enter and leave dynticks idle mode.
  5. Interrupt from dynticks idle mode.
  6. NMI from dynticks idle mode.
  7. Note that a CPU is in dynticks idle mode.
  8. Offline a CPU.
  9. Online a CPU.
  10. Detect a too-long grace period.

Each of these use cases is described in the following sections.

Start a New Grace Period

The rcu_start_gp() function starts a new grace period. This function is invoked when a CPU having callbacks waiting for a grace period notices that no grace period is in progress. rcu_start_gp函数启动一个新的grace period,当一个CPU有回调等待一个grace period通知的并且没有正在处理的grace period的时候,系统会启动这个函数。

The rcu_start_gp() function updates state in the rcu_state and rcu_data structures to note the newly started grace period, acquires the ->onoff lock (and disables irqs) to exclude any concurrent CPU-hotplug operations, sets the bits in all of the rcu_node structures to indicate that all CPUs (including this one) must pass through a quiescent state, and finally releases the ->onoff lock. 首先获得onoff锁来避免并发的CPU热插拔操作,然后置位rcu_node结构中的所有bit来告诉所有的CPU,它们必须通过一个quiescent状态,之后释放onoff锁。

The bit-setting operation is carried out in two phases. First, the non-leaf rcu_node structures' bits are set without holding any additional locks, and then finally each leaf rcu_node structure's bits are set in turn while holding that structure's ->lock. 非叶子rcu_node结构bit位结构设置的时候不需要任何额外的锁,但是到最后每个叶子rcu_node 结构的时候,需要持有结构的lock锁。

Quick Quiz 6: But what happens if a CPU tries to report going through a quiescent state (by clearing its bit) before the bit-setting CPU has finished?

Quick Quiz 7: And what happens if all CPUs try to report going through a quiescent state before the bit-setting CPU has finished, thus ending the new grace period before it starts?

Pass Through a Quiescent State

The rcu and rcu_bh flavors of RCU have different sets of quiescent states. Quiescent states for rcu are context switch, idle (either dynticks or the idle loop), and user-mode execution, while quiescent states for rcu_bh are any code outside of softirq with interrupts enabled. Note that an quiescent state for rcu is also a quiescent state for rcu_bh. Quiescent states for rcu are recorded by invoking rcu_qsctr_inc(), while quiescent states for rcu_bh are recorded by invoking rcu_bh_qsctr_inc(). These two functions record their state in the current CPU's rcu_data structure. rcu和rcu两个集合有不同的quiescent 状态,…………,rcu通过rcu_qsctr_inc来记录一个quiescent,而rcu_bh则通过rcu_bh_qsctr_inc来完成,它们将这个事件记录到CPU结构rcu_data中。

These functions are invoked from the scheduler, from __do_softirq(), and from rcu_check_callbacks(). This latter function is invoked from the scheduling-clock interrupt, and analyzes state to determine whether this interrupt occurred within a quiescent state, invoking rcu_qsctr_inc() and/or rcu_bh_qsctr_inc(), as appropriate. It also raises RCU_SOFTIRQ, which results in rcu_process_callbacks() being invoked on the current CPU at some later time from softirq context. 这里说了两个函数的调用时机。

Announce a Quiescent State to RCU

The afore-mentioned rcu_process_callbacks() function has several duties:

  1. Determining when to take measures to end an over-long grace period (via force_quiescent_state()).确定合适来结束一个故常事件的grace period(通过force_quiesent_period) 
  2. Taking appropriate action when some other CPU detected the end of a grace period (via rcu_process_gp_end()). “Appropriate action“ includes advancing this CPU's callbacks and recording the new grace period. This same function updates state in response to some other CPU starting a new grace period.
  3. Reporting the current CPU's quiescent states to the core RCU mechanism (via rcu_check_quiescent_state(), which in turn invokes cpu_quiet()). This of course might mark the end of the current grace period.
  4. Starting a new grace period if there is no grace period in progress and this CPU has RCU callbacks still waiting for a grace period (via cpu_needs_another_gp() and rcu_start_gp()).
  5. Invoking any of this CPU's callbacks whose grace period has ended (via rcu_do_batch()).

These interactions are carefully orchestrated in order to avoid buggy behavior such as reporting a quiescent state from the previous grace period against the current grace period.

Enter and Leave Dynticks Idle Mode

The scheduler invokes rcu_enter_nohz() to enter dynticks-idle mode, and invokes rcu_exit_nohz() to exit it. The rcu_enter_nohz() function increments a per-CPU dynticks_nesting variable and also a per-CPU dynticks counter, the latter of which which must then have an even-numbered value. The rcu_exit_nohz() function decrements this same per-CPU dynticks_nesting variable, and again increments the per-CPU dynticks counter, the latter of which must then have an odd-numbered value. 系统通过rcu_enter_nohz和rcu_exit_nohz来通知rcu系统进入和推出一个dynticks空闲模式,两者分别递增和递减CPU变量dynticks,从而保证可以通过奇偶性来判断系统的当前状态。

The dynticks counter can be sampled by other CPUs. If the value is even, the first CPU is in an extended quiescent state. Similarly, if the counter value changes during a given grace period, the first CPU must have been in an extended quiescent state at some point during the grace period. However, there is another dynticks_nmi per-CPU variable that must also be sampled, as will be discussed below.

Interrupt from Dynticks Idle Mode

Interrupts from dynticks idle mode are handled by rcu_irq_enter() and rcu_irq_exit(). The rcu_irq_enter() function increments the per-CPU dynticks_nesting variable, and, if the prior value was zero, also increments the dynticks per-CPU variable (which must then have an odd-numbered value).

The rcu_irq_exit() function decrements the per-CPU dynticks_nesting variable, and, if the new value is zero, also increments the dynticks per-CPU variable (which must then have an even-numbered value).

Note that entering an irq handler exits dynticks idle mode and vice versa. This enter/exit anti-correspondence can cause much confusion. You have been warned.

NMI from Dynticks Idle Mode

NMIs from dynticks idle mode are handled by rcu_nmi_enter() and rcu_nmi_exit(). These functions both increment the dynticks_nmi counter, but only if the aforementioned dynticks counter is even. In other words, NMI's refrain from manipulating the dynticks_nmi counter if the NMI occurred in non-dynticks-idle mode or within an interrupt handler.

The only difference between these two functions is the error checks, as rcu_nmi_enter() must leave the dynticks_nmi counter with an odd value, and rcu_nmi_exit() must leave this counter with an even value.

Note That a CPU is in Dynticks Idle Mode

The force_quiescent_state() function implements a two-phase state machine. In the first phase (RCU_SAVE_DYNTICK), the dyntick_save_progress_counter() function scans the CPUs that have not yet reported a quiescent state, recording their per-CPU dynticks and dynticks_nmi counters. If these counters both have even-numbered values, then the corresponding CPU is in dynticks-idle state, which is therefore noted as an extended quiescent state (reported via cpu_quiet_msk()). In the second phase (RCU_FORCE_QS), the rcu_implicit_dynticks_qs() function again scans the CPUs that have not yet reported a quiescent state (either explicitly or implicitly during the RCU_SAVE_DYNTICK phase), again checking the per-CPU dynticks and dynticks_nmi counters. If each of these has either changed in value or is now even, then the corresponding CPU has either passed through or is now in dynticks idle, which as before is noted as an extended quiescent state.

If rcu_implicit_dynticks_qs() finds that a given CPU has neither been in dynticks idle mode nor reported a quiescent state, it invokes rcu_implicit_offline_qs(), which checks to see if that CPU is offline, which is also reported as an extended quiescent state. If the CPU is online, then rcu_implicit_offline_qs() sends it a reschedule IPI in an attempt to remind it of its duty to report a quiescent state to RCU.

Note that force_quiescent_state() does not directly invoke either dyntick_save_progress_counter() or rcu_implicit_dynticks_qs(), instead passing these functions to an intervening rcu_process_dyntick() function that abstracts out the common code involved in scanning the CPUs and reporting extended quiescent states.

Quick Quiz 8: And what happens if one CPU comes out of dyntick-idle mode and then passed through a quiescent state just as another CPU notices that the first CPU was in dyntick-idle mode? Couldn't they both attempt to report a quiescent state at the same time, resulting in confusion?

Quick Quiz 9: But what if all the CPUs end up in dyntick-idle mode? Wouldn't that prevent the current RCU grace period from ever ending?

Quick Quiz 10: Given that force_quiescent_state() is a two-phase state machine, don't we have double the scheduling latency due to scanning all the CPUs?

Offline a CPU

CPU-offline events cause rcu_cpu_notify() to invoke rcu_offline_cpu(), which in turn invokes __rcu_offline_cpu() on both the rcu and the rcu_bh instances of the data structures. This function clears the outgoing CPU's bits so that future grace periods will not expect this CPU to announce quiescent states, and further invokes cpu_quiet() in order to announce the offline-induced extended quiescent state. This work is performed with the global ->onofflock held in order to prevent interference with concurrent grace-period initialization.

Quick Quiz 11: But the other reason to hold ->onofflock is to prevent multiple concurrent online/offline operations, right?

Online a CPU

CPU-online events cause rcu_cpu_notify() to invoke rcu_online_cpu(), which initializes the incoming CPU's dynticks state, and then invokes rcu_init_percpu_data() to initialize the incoming CPU's rcu_data structure, and also to set this CPU's bits (again protected by the global ->onofflock) so that future grace periods will wait for a quiescent state from this CPU. Finally, rcu_online_cpu() sets up the RCU softirq vector for this CPU.

Quick Quiz 12: Given all these acquisitions of the global ->onofflock, won't there be horrible lock contention when running with thousands of CPUs?

Detect a Too-Long Grace Period

When the CONFIG_RCU_CPU_STALL_DETECTOR kernel parameter is specified, the record_gp_stall_check_time() function records the time and also a timestamp set three seconds into the future. If the current grace period still has not ended by that time, the check_cpu_stall() function will check for the culprit, invoking print_cpu_stall() if the current CPU is the holdout, or print_other_cpu_stall() if it is some other CPU. A two-jiffies offset helps ensure that CPUs report on themselves when possible, taking advantage of the fact that a CPU can normally do a better job of tracing its own stack than it can tracing some other CPU's stack.

 

Answers to Quick Quizzes

Quick Quiz 1: Wait a minute! With all those new locks, how do you avoid deadlock?

Answer: Deadlock is avoided by never holding more than one of the rcu_node structures' locks at a given time. This algorithm uses two more locks, one to prevent CPU hotplug operations from running concurrently with grace-period advancement (onofflock) and another to permit only one CPU at a time from forcing a quiescent state to end quickly (fqslock). These are subject to a locking hierarchy, so that fqslock must be acquired before onofflock, which in turn must be acquired before any of the rcu_node structures' locks.

Also, as a practical matter, refusing to ever hold more than one of the rcu_node locks means that it is unnecessary to track which ones are held. Such tracking would be painful as well as unnecessary. 通过不同时持有一个以上的rcu_node结构锁避免死锁。

Back to Quick Quiz 1.

Quick Quiz 2: Why stop at a 64-times reduction? Why not go for a few orders of magnitude instead?

Answer: RCU works with no problems on systems with a few hundred CPUs, so allowing 64 CPUs to contend on a single lock leaves plenty of headroom. Keep in mind that these locks are acquired quite rarely, as each CPU will check in about one time per grace period, and grace periods extend for milliseconds.

Back to Quick Quiz 2.

Quick Quiz 3: But I don't care about McKenney's lame excuses in the answer to Quick Quiz 2!!! I want to get the number of CPUs contending on a single lock down to something reasonable, like sixteen or so!!!

Answer: OK, have it your way, then!!! Set CONFIG_RCU_FANOUT=16 and (for NR_CPUS=4096) you will get a three-level hierarchy with with 256 rcu_node structures at the lowest level, 16 rcu_node structures as intermediate nodes, and a single root-level rcu_node. The penalty you will pay is that more rcu_node structures will need to be scanned when checking to see which CPUs need help completing their quiescent states (256 instead of only 64).

Back to Quick Quiz 3.

Quick Quiz 4: OK, so what is the story with the colors?

Answer: Data structures analogous to rcu_state (including rcu_ctrlblk) are yellow, those containing the bitmaps used to determine when CPUs have checked in are pink, and the per-CPU rcu_data structures are blue. Later on, we will see that data structures used to conserve energy (such as rcu_dynticks) will be green.

Back to Quick Quiz 4.

Quick Quiz 5: Given such an egregious bug, why does Linux run at all?

Answer: Because the Linux kernel contains device drivers that are (relatively) well behaved. Few if any of them spin in RCU read-side critical sections for the many milliseconds that would be required to provoke this bug. The bug nevertheless does need to be fixed, and this variant of RCU does fix it. 由于读者侧的关键区很少多余几个毫秒,所以这个bug一般没有触发。

Back to Quick Quiz 5.

Quick Quiz 6: But what happens if a CPU tries to report going through a quiescent state (by clearing its bit) before the bit-setting CPU has finished?

Answer: There are three cases to consider here:

  1. A CPU corresponding to a non-yet-initialized leaf rcu_node structure tries to report a quiescent state. This CPU will see its bit already cleared, so will give up on reporting its quiescent state. Some later quiescent state will serve for the new grace period. 一个尚未初始化(即仍然保持为零值,因为初始化就是置一)的rcu_node结构对应的CPU试图来汇报一个quiescent state,这个CPU会看到它的bit为已经被清零,所以它会放弃这次汇报,接下来的另一个quiescent 状态用来为这个grace period服务。
  2. A CPU corresponding to a leaf rcu_node structure that is currently being initialized tries to report a quiescent state. This CPU will see that the rcu_node structure's ->lock is held, so will spin until it is released. But once the lock is released, the rcu_node structure will have been initialized, reducing to the following case. 如果正在初始化的CPU试图汇报 quiescent,那么这个CPU将会由于节点的lock被持有而阻塞,所以它将会自旋等待,知道锁被释放,当锁被释放之后,就归入下面状态。
  3. A CPU corresponding to a leaf rcu_node that has already been initialized tries to report a quiescent state. This CPU will find its bit set, and will therefore clear it. If it is the last CPU for that leaf node, it will move up to the next level of the hierarchy. However, this CPU cannot possibly be the last CPU in the system to report a quiescent state, given that the CPU doing the initialization cannot yet have checked in. 当已经被初始化的rcu_node对应的CPU汇报quiescent状态的时候,它会清空这个标志位。如果它是叶子节点中的最后一个CPU,那么它将会通知上层。但是,这个CPU不可能是系统中最后一个汇报quiescent的CPU,因为正在初始化的那个CPU一定还没有进行检测。

So, in all three cases, the potential race is resolved correctly.

Back to Quick Quiz 6.

Quick Quiz 7: And what happens if all CPUs try to report going through a quiescent state before the bit-setting CPU has finished, thus ending the new grace period before it starts?

Answer: The bit-setting CPU cannot pass through a quiescent state during initialization, as it has irqs disabled. Its bits therefore remain non-zero, preventing the grace period from ending until the data structure has been fully initialized.

Back to Quick Quiz 7.

Quick Quiz 8: And what happens if one CPU comes out of dyntick-idle mode and then passed through a quiescent state just as another CPU notices that the first CPU was in dyntick-idle mode? Couldn't they both attempt to report a quiescent state at the same time, resulting in confusion?

Answer: They will both attempt to acquire the lock on the same leaf rcu_node structure. The first one to acquire the lock will report the quiescent state and clear the appropriate bit, and the second one to acquire the lock will see that this bit has already been cleared.

Back to Quick Quiz 8.

Quick Quiz 9: But what if all the CPUs end up in dyntick-idle mode? Wouldn't that prevent the current RCU grace period from ever ending?

Answer: Indeed it will! However, CPUs that have RCU callbacks are not permitted to enter dyntick-idle mode, so the only way that all the CPUs could possibly end up in dyntick-idle mode would be if there were absolutely no RCU callbacks in the system. And if there are no RCU callbacks in the system, then there is no need for the RCU grace period to end. In fact, there is no need for the RCU grace period to even start.

RCU will restart if some irq handler does a call_rcu(), which will cause an RCU callback to appear on the corresponding CPU, which will force that CPU out of dyntick-idle mode, which will in turn permit the current RCU grace period to come to an end.

Back to Quick Quiz 9.

Quick Quiz 10: Given that force_quiescent_state() is a two-phase state machine, don't we have double the scheduling latency due to scanning all the CPUs?

Answer: Ah, but the two phases will not execute back-to-back on the same CPU. Therefore, the scheduling-latency hit of the two-phase algorithm is no different than that of a single-phase algorithm. If the scheduling latency becomes a problem, one approach would be to recode the state machine to scan the CPUs incrementally. But first show me a problem in the real world, then I will consider fixing it!

Back to Quick Quiz 10.

Quick Quiz 11: But the other reason to hold ->onofflock is to prevent multiple concurrent online/offline operations, right?

Answer: Actually, no! The CPU-hotplug code's synchronization design prevents multiple concurrent CPU online/offline operations, so only one CPU online/offline operation can be executing at any given time. Therefore, the only purpose of ->onofflock is to prevent a CPU online or offline operation from running concurrently with grace-period initialization.

Back to Quick Quiz 11.

Quick Quiz 12: Given all these acquisitions of the global ->onofflock, won't there be horrible lock contention when running with thousands of CPUs?

Answer: Actually, there can be only three acquisitions of this lock per grace period, and each grace period lasts many milliseconds. One of the acquisitions is by the CPU initializing for the current grace period, and the other two onlining and offlining some CPU. These latter two cannot run concurrently due to the CPU-hotplug locking, so at most two CPUs can be contending for this lock at any given time.

Lock contention on ->onofflock should therefore be no problem, even on systems with thousands of CPUs.

  评论这张
 
阅读(2573)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017