Kernel level exception handling in Linux
Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
When a process runs in kernel mode, it often has to access user
mode memory whose address has been passed by an untrusted program.
To protect itself the kernel has to verify this address.
In older versions of Linux this was done with the
int verify_area(int type, const void * addr, unsigned long size)
function (which has since been replaced by access_ok()).
This function verified that the memory area starting at address
'addr' and of size 'size' was accessible for the operation specified
in type (read or write). To do this, verify_read had to look up the
virtual memory area (vma) that contained the address addr. In the
normal case (correctly working program), this test was successful.
It only failed for a few buggy programs. In some kernel profiling
tests, this normally unneeded verification used up a considerable
To overcome this situation, Linus decided to let the virtual memory
hardware present in every Linux-capable CPU handle this test.
Whenever the kernel tries to access an address that is currently not
accessible, the CPU generates a page fault exception and calls the
void do_page_fault(struct pt_regs *regs, unsigned long error_code)
in arch/x86/mm/fault.c. The parameters on the stack are set up by
the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
regs is a pointer to the saved registers on the stack, error_code
contains a reason code for the exception.
do_page_fault first obtains the unaccessible address from the CPU
control register CR2. If the address is within the virtual address
space of the process, the fault probably occurred, because the page
was not swapped in, write protected or something similar. However,
we are interested in the other case: the address is not valid, there
is no vma that contains this address. In this case, the kernel jumps
There it uses the address of the instruction that caused the exception
(i.e. regs->eip) to find an address where the execution can continue
(fixup). If this search is successful, the fault handler modifies the
return address (again regs->eip) and returns. The execution will
continue at the address in fixup.
Where does fixup point to?
Since we jump to the contents of fixup, fixup obviously points
to executable code. This code is hidden inside the user access macros.
I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
as an example. The definition is somewhat hard to follow, so let's peek at
the code generated by the preprocessor and the compiler. I selected
the get_user call in drivers/char/sysrq.c for a detailed examination.
The original code in sysrq.c line 587:
The preprocessor output (edited to become somewhat readable):
long __gu_err = - 14 , __gu_val = 0;
const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
(((sizeof(*(buf))) <= 0xC0000000UL) &&
((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
switch ((sizeof(*(buf)))) {
"1: mov" "b" " %2,%" "b" "1\n"
".section .fixup,\"ax\"\n"
" xor" "b" " %" "b" "1,%" "b" "1\n"
".section __ex_table,\"a\"\n"
".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
"1: mov" "w" " %2,%" "w" "1\n"