Source of exception-tables.txt - linux-imx

              ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)

     Kernel level exception handling in Linux

  Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>

When a process runs in kernel mode, it often has to access user

mode memory whose address has been passed by an untrusted program.

To protect itself the kernel has to verify this address.

In older versions of Linux this was done with the

int verify_area(int type, const void * addr, unsigned long size)

function (which has since been replaced by access_ok()).

This function verified that the memory area starting at address

'addr' and of size 'size' was accessible for the operation specified

in type (read or write). To do this, verify_read had to look up the

virtual memory area (vma) that contained the address addr. In the

normal case (correctly working program), this test was successful.

It only failed for a few buggy programs. In some kernel profiling

tests, this normally unneeded verification used up a considerable

amount of time.

To overcome this situation, Linus decided to let the virtual memory

hardware present in every Linux-capable CPU handle this test.

How does this work?

Whenever the kernel tries to access an address that is currently not

accessible, the CPU generates a page fault exception and calls the

page fault handler

void do_page_fault(struct pt_regs *regs, unsigned long error_code)

in arch/x86/mm/fault.c. The parameters on the stack are set up by

the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter

regs is a pointer to the saved registers on the stack, error_code

contains a reason code for the exception.

do_page_fault first obtains the unaccessible address from the CPU

control register CR2. If the address is within the virtual address

space of the process, the fault probably occurred, because the page

was not swapped in, write protected or something similar. However,

we are interested in the other case: the address is not valid, there

is no vma that contains this address. In this case, the kernel jumps

to the bad_area label.

There it uses the address of the instruction that caused the exception

(i.e. regs->eip) to find an address where the execution can continue

(fixup). If this search is successful, the fault handler modifies the

return address (again regs->eip) and returns. The execution will

continue at the address in fixup.

Where does fixup point to?

Since we jump to the contents of fixup, fixup obviously points

to executable code. This code is hidden inside the user access macros.

I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h

as an example. The definition is somewhat hard to follow, so let's peek at

the code generated by the preprocessor and the compiler. I selected

the get_user call in drivers/char/sysrq.c for a detailed examination.

The original code in sysrq.c line 587:

        get_user(c, buf);

The preprocessor output (edited to become somewhat readable):

    long __gu_err = - 14 , __gu_val = 0;

    const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));

    if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||

       (((sizeof(*(buf))) <= 0xC0000000UL) &&

       ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))

      do {

        __gu_err  = 0;

        switch ((sizeof(*(buf)))) {

          case 1:

            __asm__ __volatile__(

              "1:      mov" "b" " %2,%" "b" "1\n"

              "2:\n"

              ".section .fixup,\"ax\"\n"

              "3:      movl %3,%0\n"

              "        xor" "b" " %" "b" "1,%" "b" "1\n"

              "        jmp 2b\n"

              ".section __ex_table,\"a\"\n"

              "        .align 4\n"

              "        .long 1b,3b\n"

              ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)

                            (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;

              break;

          case 2:

            __asm__ __volatile__(

              "1:      mov" "w" " %2,%" "w" "1\n"

              "2:\n"