:Author: Started by Paul Jackson <pj@sgi.com>
Robust_futexes provide a mechanism that is used in addition to normal
futexes, for kernel assist of cleanup of held locks on task exit.
The interesting data as to what futexes a thread is holding is kept on a
linked list in user space, where it can be updated efficiently as locks
are taken and dropped, without kernel intervention. The only additional
kernel intervention required for robust_futexes above and beyond what is
1) a one time call, per thread, to tell the kernel where its list of
held robust_futexes begins, and
2) internal kernel code at exit, to handle any listed locks held
The existing normal futexes already provide a "Fast Userspace Locking"
mechanism, which handles uncontested locking without needing a system
call, and handles contested locking by maintaining a list of waiting
threads in the kernel. Options on the sys_futex(2) system call support
waiting on a particular futex, and waking up the next waiter on a
For robust_futexes to work, the user code (typically in a library such
as glibc linked with the application) has to manage and place the
necessary list elements exactly as the kernel expects them. If it fails
to do so, then improperly listed locks will not be cleaned up on exit,
probably causing deadlock or other such failure of the other threads
waiting on the same locks.
A thread that anticipates possibly using robust_futexes should first
sys_set_robust_list(struct robust_list_head __user *head, size_t len);
The pointer 'head' points to a structure in the threads address space
consisting of three words. Each word is 32 bits on 32 bit arch's, or 64
bits on 64 bit arch's, and local byte order. Each thread should have
its own thread private 'head'.
If a thread is running in 32 bit compatibility mode on a 64 native arch
kernel, then it can actually have two such structures - one using 32 bit
words for 32 bit compatibility mode, and one using 64 bit words for 64
bit native mode. The kernel, if it is a 64 bit kernel supporting 32 bit
compatibility mode, will attempt to process both lists on each task
exit, if the corresponding sys_set_robust_list() call has been made to
The first word in the memory structure at 'head' contains a
pointer to a single linked list of 'lock entries', one per lock,
as described below. If the list is empty, the pointer will point
to itself, 'head'. The last 'lock entry' points back to the 'head'.
The second word, called 'offset', specifies the offset from the
address of the associated 'lock entry', plus or minus, of what will
be called the 'lock word', from that 'lock entry'. The 'lock word'
is always a 32 bit word, unlike the other words above. The 'lock