Read-copy-update

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures (e.g., linked lists, trees, hash tables).

Whenever a thread is inserting or deleting elements of data structures in shared memory, all readers are guaranteed to see and traverse either the older or the new structure, therefore avoiding inconsistencies (e.g., dereferencing null pointers).

RCU's fundamental guarantee may be used by splitting updates into removal and reclamation phases. The removal phase removes references to data items within a data structure (possibly by replacing them with references to new versions of these data items) and can run concurrently with RCU read-side critical sections. The reason that it is safe to run the removal phase concurrently with RCU readers is the semantics of modern CPUs guarantee that readers will see either the old or the new version of the data structure rather than a partially updated reference. Once a grace period has elapsed, there can no longer be any readers referencing the old version, so it is then safe for the reclamation phase to free (reclaim) the data items that made up that old version.

Splitting an update into removal and reclamation phases allows the updater to perform the removal phase immediately, and to defer the reclamation phase until all readers active during the removal phase have completed, in other words, until a grace period has elapsed.

So, the typical RCU update sequence goes something like the following:

Ensure that all readers accessing RCU-protected data structures carry out their references from within an RCU read-side critical section.
Remove pointers to a data structure, so that subsequent readers cannot gain a reference to it.
Wait for a grace period to elapse, so that all previous readers (which might still have pointers to the data structure removed in the prior step) will have completed their RCU read-side critical sections.
At this point, there cannot be any readers still holding references to the data structure, so it now may safely be reclaimed (e.g., freed).

In the above procedure (which matches the earlier diagram), the updater is performing both the removal and the reclamation step, but it is often helpful for an entirely different thread to do the reclamation. Reference counting can be used to let the reader perform removal so, even if the same thread performs both the update step (step (2) above) and the reclamation step (step (4) above), it is often helpful to think of them separately.

RCU is perhaps the most common non-blocking algorithm for a shared data structure. RCU is completely wait-free for any number of readers. Single-writer implementations RCU are also lock-free for the writer. Some multi-writer implementations of RCU are lock-free. Other multi-writer implementations of RCU serialize writers with a lock.

Uses

By early 2008, there were almost 2,000 uses of the RCU API within the Linux kernel including the networking protocol stacks and the memory-management system. , there were more than 9,000 uses. Since 2006, researchers have applied RCU and similar techniques to a number of problems, including management of metadata used in dynamic analysis, managing the lifetime of clustered objects, managing object lifetime in the K42 research operating system, and optimizing software transactional memory implementations. Dragonfly BSD uses a technique similar to RCU that most closely resembles Linux's Sleepable RCU (SRCU) implementation.

Advantages and disadvantages

The ability to wait until all readers are done allows RCU readers to use much lighter-weight synchronization—in some cases, no synchronization at all. In more conventional lock-based schemes, readers must use heavy-weight synchronization in order to prevent an updater from deleting the data structure out from under them. The reason is that lock-based updaters typically update data in place and must therefore exclude readers. In contrast, RCU-based updaters typically take advantage of the fact that writes to single aligned pointers are atomic on modern CPUs, allowing atomic insertion, removal, and replacement of data in a linked structure without disrupting readers. Concurrent RCU readers can then continue accessing the old versions and can dispense with the atomic read-modify-write instructions, memory barriers, and cache misses that are so expensive on modern SMP computer systems, even in absence of lock contention. The lightweight nature of RCU's read-side primitives provides additional advantages beyond excellent performance, scalability, and real-time response. For example, they provide immunity to most deadlock and livelock conditions.

RCU is a specialized technique that works best in situations with mostly reads and few updates but is often less applicable to update-only workloads. Although the fact that RCU readers and updaters may execute concurrently is what enables the lightweight nature of RCU's read-side primitives, some algorithms may not be amenable to read/update concurrency.

The applicability of RCU is a subject of ongoing research.

Patents

The technique is covered by U.S. software patent , issued August 15, 1995, and assigned to Sequent Computer Systems, as well as by (expired 2009-03-30), (expired 2010-04-05), (expired 2009-05-18), and (expired 2009-05-25). The now-expired US Patent covers a closely related technique. RCU is also the topic of one claim in the SCO v. IBM lawsuit.

Sample RCU interface

RCU is available in a number of operating systems and was added to the Linux kernel in October 2002. User-level implementations such as liburcu are also available.

The implementation of RCU in version 2.6 of the Linux kernel is among the better-known RCU implementations and will be used as an inspiration for the RCU API in the remainder of this article. The core API (Application Programming Interface) is quite small:

rcu_read_lock(): Marks an RCU-protected data structure so that it won't be reclaimed for the full duration of that critical section.
rcu_read_unlock(): Used by a reader to inform the reclaimer that the reader is exiting an RCU read-side critical section. Note that RCU read-side critical sections may be nested and/or overlapping.
synchronize_rcu(): Blocks until all pre-existing RCU read-side critical sections on all CPUs have completed. Note that <code>synchronize_rcu</code> will not necessarily wait for any subsequent RCU read-side critical sections to complete. For example, consider the following sequence of events:

<pre>

CPU 0 CPU 1 CPU 2

----------------- ------------------------- ---------------

1. rcu_read_lock()

2. enters synchronize_rcu()

3. rcu_read_lock()

4. rcu_read_unlock()

5. exits synchronize_rcu()

6. rcu_read_unlock()

</pre>

:Since <code>synchronize_rcu</code> is the API that must figure out when readers are done, its implementation is key to RCU. For RCU to be useful in all but the most read-intensive situations, <code>synchronize_rcu</code>'s overhead must also be quite small.

:Alternatively, instead of blocking, synchronize_rcu may register a callback to be invoked after all ongoing RCU read-side critical sections have completed. This callback variant is called <code>call_rcu</code> in the Linux kernel.

rcu_assign_pointer(): The updater uses this function to assign a new value to an RCU-protected pointer, in order to safely communicate the change in value from the updater to the reader. This function returns the new value, and also executes any memory barrier instructions required for a given CPU architecture. Perhaps more importantly, it serves to document which pointers are protected by RCU.
rcu_dereference(): The reader uses <code>rcu_dereference</code> to fetch an RCU-protected pointer, which returns a value that may then be safely dereferenced. It also executes any directives required by the compiler or the CPU, for example, a volatile cast for gcc, a memory_order_consume load for C/C++11 or the memory-barrier instruction required by the old DEC Alpha CPU. The value returned by <code>rcu_dereference</code> is valid only within the enclosing RCU read-side critical section. As with <code>rcu_assign_pointer</code>, an important function of <code>rcu_dereference</code> is to document which pointers are protected by RCU.

thumb|upright=2|RCU API communications between the reader, updater, and reclaimer

The diagram on the right shows how each API communicates among the reader, updater, and reclaimer.

The RCU infrastructure observes the time sequence of <code>rcu_read_lock</code>, <code>rcu_read_unlock</code>, <code>synchronize_rcu</code>, and <code>call_rcu</code> invocations in order to determine when (1) <code>synchronize_rcu</code> invocations may return to their callers and (2) <code>call_rcu</code> callbacks may be invoked. Efficient implementations of the RCU infrastructure make heavy use of batching in order to amortize their overhead over many uses of the corresponding APIs.

Simple implementation

RCU has extremely simple "toy" implementations that can aid understanding of RCU. This section presents one such "toy" implementation that works in a non-preemptive environment.

void rcu_read_lock(void) { }

void rcu_read_unlock(void) { }

void call_rcu(void (*callback) (void *), void *arg)

{

// add callback/arg pair to a list

}

void synchronize_rcu(void)

{

int cpu, ncpus = 0;

for each_cpu(cpu)

schedule_current_task_to(cpu);

for each entry in the call_rcu list

entry->callback (entry->arg);

}

</syntaxhighlight>

In the code sample, <code>rcu_assign_pointer</code> and <code>rcu_dereference</code> can be ignored without missing much. However, they are needed in order to suppress harmful compiler optimization and to prevent CPUs from reordering accesses.

define rcu_assign_pointer(p, v) ({ \

smp_wmb(); /* Order previous writes. */ \

ACCESS_ONCE(p) = (v); \

})

define rcu_dereference(p) ({ \

typeof(p) _value = ACCESS_ONCE(p); \

smp_read_barrier_depends(); /* nop on most architectures */ \

(_value); \

})

</syntaxhighlight>

Note that <code>rcu_read_lock</code> and <code>rcu_read_unlock</code> do nothing. This is the strength of classic RCU in a non-preemptive kernel: read-side overhead is precisely zero, as <code>smp_read_barrier_depends()</code> is an empty macro on all but DEC Alpha CPUs; such memory barriers are not needed on modern CPUs. The <code>ACCESS_ONCE()</code> macro is a volatile cast that generates no additional code in most cases. And there is no way that <code>rcu_read_lock</code> can participate in a deadlock cycle, cause a realtime process to miss its scheduling deadline, precipitate priority inversion, or result in high lock contention. However, in this toy RCU implementation, blocking within an RCU read-side critical section is illegal, just as is blocking while holding a pure spinlock.

The implementation of <code>synchronize_rcu</code> moves the caller of synchronize_cpu to each CPU, thus blocking until all CPUs have been able to perform the context switch. Recall that this is a non-preemptive environment and that blocking within an RCU read-side critical section is illegal, which imply that there can be no preemption points within an RCU read-side critical section. Therefore, if a given CPU executes a context switch (to schedule another process), we know that this CPU must have completed all preceding RCU read-side critical sections. Once all CPUs have executed a context switch, then all preceding RCU read-side critical sections will have completed.

Analogy with reader–writer locking

Although RCU can be used in many different ways, a very common use of RCU is analogous to reader–writer locking. The following side-by-side code display shows how closely related reader–writer locking and RCU can be.

/* reader-writer locking */ /* RCU */

1 struct el { 1 struct el {

2 struct list_head lp; 2 struct list_head lp;

3 long key; 3 long key;

4 spinlock_t mutex; 4 spinlock_t mutex;

5 int data; 5 int data;

6 /* Other data fields */ 6 /* Other data fields */

7 }; 7 };

8 DEFINE_RWLOCK(listmutex); 8 DEFINE_SPINLOCK(listmutex);

9 LIST_HEAD(head); 9 LIST_HEAD(head);

1 int search(long key, int *result) 1 int search(long key, int *result)

2 { 2 {

3 struct el *p; 3 struct el *p;

4 4

5 read_lock(&listmutex); 5 rcu_read_lock();

6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry_rcu(p, &head, lp) {

7 if (p->key == key) { 7 if (p->key == key) {

8 *result = p->data; 8 *result = p->data;

9 read_unlock(&listmutex); 9 rcu_read_unlock();

10 return 1; 10 return 1;

11 } 11 }

12 } 12 }

13 read_unlock(&listmutex); 13 rcu_read_unlock();

14 return 0; 14 return 0;

15 } 15 }

1 int delete(long key) 1 int delete(long key)

2 { 2 {

3 struct el *p; 3 struct el *p;

4 4

5 write_lock(&listmutex); 5 spin_lock(&listmutex);

6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry(p, &head, lp) {

7 if (p->key == key) { 7 if (p->key == key) {

8 list_del(&p->lp); 8 list_del_rcu(&p->lp);

9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);

10 synchronize_rcu();

10 kfree(p); 11 kfree(p);

11 return 1; 12 return 1;

12 } 13 }

13 } 14 }

14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);

15 return 0; 16 return 0;

16 } 17 }

</syntaxhighlight>

The differences between the two approaches are quite small. Read-side locking moves to <code>rcu_read_lock</code> and <code>rcu_read_unlock</code>, update-side locking moves from a reader-writer lock to a simple spinlock, and a <code>synchronize_rcu</code> precedes the <code>kfree</code>.

However, there is one potential catch: the read-side and update-side critical sections can now run concurrently. In many cases, this will not be a problem, but it is necessary to check carefully regardless. For example, if multiple independent list updates must be seen as a single atomic update, converting to RCU will require special care.

Also, the presence of <code>synchronize_rcu</code> means that the RCU version of <code>delete</code> can now block. If this is a problem, <code>call_rcu</code> could be used like <code>call_rcu (kfree, p)</code> in place of <code>synchronize_rcu</code>. This is especially useful in combination with reference counting.

History

Techniques and mechanisms resembling RCU have been independently invented multiple times:

H. T. Kung and Q. Lehman described use of garbage collectors to implement RCU-like access to a binary search tree.
Udi Manber and Richard Ladner extended Kung's and Lehman's work to non-garbage-collected environments by deferring reclamation until all threads running at removal time have terminated, which works in environments that do not have long-lived threads.
Richard Rashid et al. described a lazy translation lookaside buffer (TLB) implementation that deferred reclaiming virtual-address space until all CPUs flushed their TLB, which is similar in spirit to some RCU implementations.
James P. Hennessy, Damian L. Osisek, and Joseph W. Seigh, II were granted US Patent 4,809,168 in 1989 (since lapsed). This patent describes an RCU-like mechanism that was apparently used in VM/XA on IBM mainframes.
William Pugh described an RCU-like mechanism that relied on explicit flag-setting by readers.
Aju John proposed an RCU-like implementation where updaters simply wait for a fixed period of time, under the assumption that readers would all complete within that fixed time, as might be appropriate in a hard real-time system. Van Jacobson proposed a similar scheme in 1993 (verbal communication).
J. Slingwine and P. E. McKenney received US Patent 5,442,758 in August 1995, which describes RCU as implemented in DYNIX/ptx and later in the Linux kernel.
B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm described an RCU-like mechanism used in the University of Toronto Tornado research operating system and the closely related IBM Research K42 research operating systems.
Rusty Russell and Phil Rumpf described RCU-like techniques for handling unloading of Linux kernel modules.
D. Sarma added RCU to version 2.5.43 of the Linux kernel in October 2002.
Robert Colvin et al. formally verified a lazy concurrent list-based set algorithm that resembles RCU.
M. Desnoyers et al. published a description of user-space RCU.
A. Gotsman et al. derived formal semantics for RCU based on separation logic.
Ilan Frenkel, Roman Geller, Yoram Ramberg, and Yoram Snir were granted US Patent 7,099,932 in 2006. This patent describes an RCU-like mechanism for retrieving and storing quality of service policy management information using a directory service in a manner that enforces read/write consistency and enables read/write concurrency.

Bauer, R.T., (June 2009), "Operational Verification of a Relativistic Program" PSU Tech Report TR-09-04 (http://www.pdx.edu/sites/www.pdx.edu.computer-science/files/tr0904.pdf )

External links

Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan: User-space RCU. Linux Weekly News.
Paul E. McKenney and Jonathan Walpole: What is RCU, Fundamentally?, What is RCU? Part 2: Usage, and RCU part 3: the RCU API. Linux Weekly News.
Paul E. McKenney's RCU web page
Hart, McKenney, and Demke Brown (2006). Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation An IPDPS 2006 Best Paper comparing RCU's performance to that of other lockless synchronization mechanisms. Journal version (including Walpole as author).
(1995) "Apparatus and method for achieving reduced overhead mutual exclusion and maintaining coherency in a multiprocessor system utilizing execution history and thread monitoring"
Paul McKenney: Sleepable RCU. Linux Weekly News.