switch_to()
and everything in it.iret
task after returning the scheduler. Many of these tasks in different versions of the kernel float between switch_to()
and the scheduler. You can only guarantee that in each version we will always see a stack swap and FPU switching. /** include/linux/sched.h */ #define switch_to(n) { struct {long a,b;} __tmp; __asm__("cmpl %%ecx,_current\n\t" "je 1f\n\t" "xchgl %%ecx,_current\n\t" "movw %%dx,%1\n\t" "ljmp %0\n\t" "cmpl %%ecx,%2\n\t" "jne 1f\n\t" "clts\n" "1:" ::"m" (*&__tmp.a), "m" (*&__tmp.b), "m" (last_task_used_math), "d" _TSS(n), "c" ((long) task[n])); }
/** include/linux/sched.h */ #define switch_to(n) { struct {long a,b;} __tmp; __asm__("cmpl %%ecx,_current\n\t" "je 1f\n\t" "movw %%dx,%1\n\t" "xchgl %%ecx,_current\n\t" "ljmp %0\n\t" "cmpl %%ecx,_last_task_used_math\n\t" "jne 1f\n\t" "clts\n" "1:" ::"m" (*&__tmp.a), "m" (*&__tmp.b), "d" (_TSS(n)), "c" ((long) task[n])); }
#define switch_to(n) {
switch_to()
is a macro. It appears in exactly one place: in the very last line of schedule()
. Therefore, after preprocessing, the macro shares the scheduler area. Checking in the global area of unknown references, such as current
and last_task_used_math
. The input argument n
is the sequence number of the next task (from 0 to 63). struct {long a,b;} __tmp;
a
and b
. We will set some of these bytes later for the long-range transition operation. __asm__("cmpl %%ecx,_current\n\t"
current
from the scheduler. Both contain pointers to the task_struct
a process. Below in ECX is the target task pointer as the specified input: "c" ((long) task[n])
. The result of the comparison sets the value of the state register EFLAGS: for example, ZF = 1, if both pointers are the same (x - x = 0). "je 1f\n\t"
je
instruction verifies that ZF = 1. If this is the case, it goes to the first label '1' after this point in the code, which is 8 lines ahead. "xchgl %%ecx,_current\n\t"
current
to reflect the new task. The pointer from ECX (task [n]) switches to current. Flags are not updated. "movw %%dx,%1\n\t"
__tmp.b
, that is, bytes 5 through 8 of our reserved 8-byte structure. The DX value is the specified input: "d" (_TSS(n))
. The _TSS
multi-level macro _TSS
expanded into a valid TSS segment selector, which I will discuss below. The bottom line is that the two high bytes of __tmp.b
now contain a segment pointer to the next task. "ljmp %0\n\t"
ljmp
is an indirect long-range transition that needs a 6-byte (48-bit) operand. Second, operand% 0 refers to the uninitialized variable __tmp.А
. Finally, the transition to the segment selector in GDT has a special meaning in x86. Let's take a look at these moments.__tmp
structure contained two 4-byte values, and the structure is based on the a element. But if we use this element as the base address of the 6-byte operand, we will reach two bytes inside the integer __tmp.b
. These two bytes are part of the “segment selector” of the far address. When the processor sees that the segment is a TSS in the GDT, the offset portion is completely ignored. The fact that __tmp.a
is not initialized does not matter, because __tmp.b
still has a valid value due to the previous instruction movw
. Add the address of the transition to the chart:_TSS(n)
macro ensures the presence of these four zeros. The lower two bits are the segment privilege level (00 corresponds to supervisor / kernel), the next zero bit means using the GDT table (stored in the GDTR during system boot). The fourth zero is technically part of the segment index, which forces all TSS searches on the even entries of the GDT table.__tmp
defines the TSS descriptor in the GDT. Here is how it is described in the manual for 80386:"ljmp %0\n\t"
took and executed all the steps of context switching. It remains only to tidy up a bit. "cmpl %%ecx,%2\n\t"
last_task_used_math
pointer. The TS flag helps to check whether the coprocessor has a different context. Hardware context switches do not control the coprocessor. "jne 1f\n\t"
"clts\n"
"1:"
::"m" (*&__tmp.a),
"m" (*&__tmp.b),
"m" (last_task_used_math),
task_struct
that restored the state of the coprocessor. "d" (_TSS(n)),
#define _TSS(n) ((((unsigned long) n)<<4)+(FIRST_TSS_ENTRY<<3)) #define FIRST_TSS_ENTRY 4
Task # | 16-bit Segment Selector |
---|---|
0 | 0000000000100 0 00 |
one | 0000000000110 0 00 |
2 | 0000000001000 0 00 |
3 | 0000000001010 0 00 |
four | 0000000001100 0 00 |
five | 0000000001110 0 00 |
6 | 0000000010000 0 00 |
7 | 0000000010010 0 00 |
"c" ((long) task[n]));
_last_task_used_math
removed as an input variable because the character is already available in the global scope. The corresponding comparison operation is changed to a direct link.xchgl
instruction swapped with movw
to bring it closer to the hardware context switch ( ljmp
). The problem is that these operations are not atomic: it is unlikely that an interrupt may occur between xchgl
and ljmp
, which will lead to another context switch with an incorrect current
task and an unsaved state of the real task. Replacing these instructions in places makes such a situation very unlikely. However, in a long-running system, “very unlikely” is a synonym for “inevitable”. /** include/linux/sched.h */ #define switch_to(tsk) __asm__("cmpl %%ecx,_current\n\t" "je 1f\n\t" "cli\n\t" "xchgl %%ecx,_current\n\t" "ljmp %0\n\t" "sti\n\t" "cmpl %%ecx,_last_task_used_math\n\t" "jne 1f\n\t" "clts\n" "1:" : /* no output */ :"m" (*(((char *)&tsk->tss.tr)-4)), "c" (tsk) :"cx")
switch_to()
takes a pointer to a new task. So you can remove the __tmp
structure, and instead use a direct link to the TSS. Let's sort out each line. #define switch_to(tsk)
"__asm__("cmpl %%ecx,_current\n\t"
"je 1f\n\t"
"cli\n\t"
"xchgl %%ecx,_current\n\t" "ljmp %0\n\t"
"sti\n\t"
"cmpl %%ecx,_last_task_used_math\n\t" "jne 1f\n\t" "clts\n" "1:"
: /* no output */
:"m" (*(((char *)&tsk->tss.tr)-4)),
tss.tr
element contains _TSS (task_number) for the GDT / TSS memory reference, which was used in the kernel up to 1.0. We are still retreating 4 bytes and loading the 6-byte segment selector to take the top two bytes. Fun! "c" (tsk)
:"cx")
switch_to()
, one of which is included when the kernel is compiled. The architecture-dependent code was separated from the kernel, so you need to look for the x86 version elsewhere. /** include/asm-i386/system.h */ #define switch_to(tsk) do { __asm__("cli\n\t" "xchgl %%ecx,_current\n\t" "ljmp %0\n\t" "sti\n\t" "cmpl %%ecx,_last_task_used_math\n\t" "jne 1f\n\t" "clts\n" "1:" : /* no output */ :"m" (*(((char *)&tsk->tss.tr)-4)), "c" (tsk) :"cx"); /* Now maybe reload the debug registers */ if(current->debugreg[7]){ loaddebug(0); loaddebug(1); loaddebug(2); loaddebug(3); loaddebug(6); } } while (0)
switch_to()
to the sheduler code on C. Some debugging tasks have been moved from C code to switch_to ()
, probably to avoid separating them. Let's look at the changes. #define switch_to(tsk) do {
switch_to()
wrapped in a do-while (0) loop. This construct prevents errors if the macro expands to several statements as a consequence of the condition (if it exists). Currently it is not there, but given the changes in the scheduler, I suspect that this is the result of the code editing, left just in case. My suggestion: ...within schedule()... if (current == next) return; kstat.context_swtch++; switch_to(next);
...within schedule()... if (current != next) switch_to(next); /* do-while(0) 'captures' entire * block to ensure proper parse */
__asm__("cli\n\t" "xchgl %%ecx,_current\n\t" "ljmp %0\n\t" "sti\n\t" "cmpl %%ecx,_last_task_used_math\n\t" "jne 1f\n\t" "clts\n" "1:" : /* no output */ :"m" (*(((char *)&tsk->tss.tr)-4)), "c" (tsk) :"cx");
current
swap, then a hardware context switch is activated and the use of the coprocessor is checked. /* Now maybe reload the debug registers */ if(current->debugreg[7]){
switch_to()
. Exactly the same sequence C is used in 1.0. I guess the developers wanted to make sure that: 1) debugging is as close as possible to the context switch 2) switch_to is the latest in schedule()
. loaddebug(0); loaddebug(1); loaddebug(2); loaddebug(3);
loaddebug(6);
} while (0)
switch_to()
block. Although the condition is always the same, it guarantees that the parser takes the function as a base unit that does not interact with the neighboring conditions in schedule()
. Note the absence of a comma at the end - it is after the macro is called: switch_to(next);
.switch_to()
: a single-processor version (UP) of Linux 1.x and a new improved version for symmetric multiprocessing (SMP). First, let's consider the edits in the old code, because some changes from there are also included in the SMP version. /** include/asm-i386/system.h */ #else /* Single process only (not SMP) */ #define switch_to(prev,next) do { __asm__("movl %2,"SYMBOL_NAME_STR(current_set)"\n\t" "ljmp %0\n\t" "cmpl %1,"SYMBOL_NAME_STR(last_task_used_math)"\n\t" "jne 1f\n\t" "clts\n" "1:" : /* no outputs */ :"m" (*(((char *)&next->tss.tr)-4)), "r" (prev), "r" (next)); /* Now maybe reload the debug registers */ if(prev->debugreg[7]){ loaddebug(prev,0); loaddebug(prev,1); loaddebug(prev,2); loaddebug(prev,3); loaddebug(prev,6); } } while (0) #endif
switch_to()
a new argument: the process *task_struct
from which we are switching. #define switch_to(prev,next) do {
prev
specifies the task with which we switch ( *task_struct
). We still wrap the macro in a do-while (0) loop to help parse the single-line if around the macro. __asm__("movl %2,"SYMBOL_NAME_STR(current_set)"\n\t"
xchgl %%ecx,_current
except for the fact that we now have an array of several task_struct and a macro ( SYMBOL_NAME_STR
) for processing inline assembly symbols. Why use a preprocessor for this? The fact is that some assemblers (GAS) require adding an underscore character (_) to the name of the variable C. Other assemblers do not have such a requirement. In order not to drive the convention hard, you can customize it during compilation according to your set of tools. "ljmp %0\n\t" "cmpl %1,"SYMBOL_NAME_STR(last_task_used_math)"\n\t" "jne 1f\n\t" "clts\n" "1:" : /* no outputs */ :"m" (*(((char *)&next->tss.tr)-4)),
"r" (prev), "r" (next));
next
Was previously encoded in ECX. /* Now maybe reload the debug registers */ if(prev->debugreg[7]){ loaddebug(prev,0); loaddebug(prev,1); loaddebug(prev,2); loaddebug(prev,3); loaddebug(prev,6); } } while (0)
/** include/asm-i386/system.h */ #ifdef __SMP__ /* Multiprocessing enabled */ #define switch_to(prev,next) do { cli(); if(prev->flags&PF_USEDFPU) { __asm__ __volatile__("fnsave %0":"=m" (prev->tss.i387.hard)); __asm__ __volatile__("fwait"); prev->flags&=~PF_USEDFPU; } prev->lock_depth=syscall_count; kernel_counter+=next->lock_depth-prev->lock_depth; syscall_count=next->lock_depth; __asm__("pushl %%edx\n\t" "movl "SYMBOL_NAME_STR(apic_reg)",%%edx\n\t" "movl 0x20(%%edx), %%edx\n\t" "shrl $22,%%edx\n\t" "and $0x3C,%%edx\n\t" "movl %%ecx,"SYMBOL_NAME_STR(current_set)"(,%%edx)\n\t" "popl %%edx\n\t" "ljmp %0\n\t" "sti\n\t" : /* no output */ :"m" (*(((char *)&next->tss.tr)-4)), "c" (next)); /* Now maybe reload the debug registers */ if(prev->debugreg[7]){ loaddebug(prev,0); loaddebug(prev,1); loaddebug(prev,2); loaddebug(prev,3); loaddebug(prev,6); } } while (0)
if(prev->flags&PF_USEDFPU)
__asm__ __volatile__("fnsave %0":"=m" (prev->tss.i387.hard));
__volatile__
should protect this instruction from being modified by the optimizer. __asm__ __volatile__("fwait");
prev->flags&=~PF_USEDFPU;
prev->lock_depth=syscall_count;
kernel_counter+=next->lock_depth-prev->lock_depth;
syscall_count=next->lock_depth;
__asm__("pushl %%edx\n\t"
"movl "SYMBOL_NAME_STR(apic_reg)",%%edx\n\t"
apic_reg
broadcast during OS initialization. "movl 0x20(%%edx), %%edx\n\t"
"shrl $22,%%edx\n\t"
"and $0x3C,%%edx\n\t"
"movl %%ecx,"SYMBOL_NAME_STR(current_set)"(,%%edx)\n\t"
"popl %%edx\n\t"
/** include/asm-i386/system.h */ #define switch_to(prev,next) do { unsigned long eax, edx, ecx; asm volatile("pushl %%ebx\n\t" "pushl %%esi\n\t" "pushl %%edi\n\t" "pushl %%ebp\n\t" "movl %%esp,%0\n\t" /* save ESP */ "movl %5,%%esp\n\t" /* restore ESP */ "movl $1f,%1\n\t" /* save EIP */ "pushl %6\n\t" /* restore EIP */ "jmp __switch_to\n" "1:\t" "popl %%ebp\n\t" "popl %%edi\n\t" "popl %%esi\n\t" "popl %%ebx" :"=m" (prev->tss.esp),"=m" (prev->tss.eip), "=a" (eax), "=d" (edx), "=c" (ecx) :"m" (next->tss.esp),"m" (next->tss.eip), "a" (prev), "d" (next)); } while (0)
switch_to()
radically different from all previous versions: it's simple! In the inline assembler, we interchange the stack and instruction pointers (context switching tasks 1 and 2). Everything else is done after the transition to the code C ( __switch_to()
). asm volatile("pushl %%ebx\n\t" "pushl %%esi\n\t" "pushl %%edi\n\t" "pushl %%ebp\n\t"
"movl %%esp,%0\n\t" /* save ESP */ "movl %5,%%esp\n\t" /* restore ESP */
prev->tss.esp
), and the new one has% 5 ( next->tss.esp
). "movl $1f,%1\n\t" /* save EIP */
1
: "pushl %6\n\t" /* restore EIP */
"jmp __switch_to\n"
"popl %%ebp\n\t" "popl %%edi\n\t" "popl %%esi\n\t" "popl %%ebx"
/** arch/i386/kernel/process.c */ void __switch_to(struct task_struct *prev, struct task_struct *next) { /* Do the FPU save and set TS if it wasn't set before.. */ unlazy_fpu(prev); gdt_table[next->tss.tr >> 3].b &= 0xfffffdff; asm volatile("ltr %0": :"g" (*(unsigned short *)&next->tss.tr)); asm volatile("movl %%fs,%0":"=m" (*(int *)&prev->tss.fs)); asm volatile("movl %%gs,%0":"=m" (*(int *)&prev->tss.gs)); /* Re-load LDT if necessary */ if (next->mm->segments != prev->mm->segments) asm volatile("lldt %0": :"g" (*(unsigned short *)&next->tss.ldt)); /* Re-load page tables */ { unsigned long new_cr3 = next->tss.cr3; if (new_cr3 != prev->tss.cr3) asm volatile("movl %0,%%cr3": :"r" (new_cr3)); } /* Restore %fs and %gs. */ loadsegment(fs,next->tss.fs); loadsegment(gs,next->tss.gs); if (next->tss.debugreg[7]){ loaddebug(next,0); loaddebug(next,1); loaddebug(next,2); loaddebug(next,3); loaddebug(next,6); loaddebug(next,7); } }
__switch_to()
. This function is written in C and includes several familiar components, such as debug registers. Going to code C allows you to move them even closer to the context switch. unlazy_fpu(prev);
gdt_table[next->tss.tr >> 3].b &= 0xfffffdff;
tss.tr
contains the value of the task segment selector, where permissions use the bottom three bits. We only need an index, so we shift these bits. The second byte of the TSS is modified to remove bit 10. asm volatile("ltr %0": :"g" (*(unsigned short *)&next->tss.tr));
asm volatile("movl %%fs,%0":"=m" (*(int *)&prev->tss.fs)); asm volatile("movl %%gs,%0":"=m" (*(int *)&prev->tss.gs));
if (next->mm->segments != prev->mm->segments) asm volatile("lldt %0": :"g" (*(unsigned short *)&next->tss.ldt));
if (new_cr3 != prev->tss.cr3) asm volatile("movl %0,%%cr3": :"r" (new_cr3));
loadsegment(fs,next->tss.fs); loadsegment(gs,next->tss.gs);
loaddebug(prev,7);
/** include/asm-i386/system.h */ #define switch_to(prev,next,last) do { asm volatile("pushl %%esi\n\t" "pushl %%edi\n\t" "pushl %%ebp\n\t" "movl %%esp,%0\n\t" /* save ESP */ "movl %3,%%esp\n\t" /* restore ESP */ "movl $1f,%1\n\t" /* save EIP */ "pushl %4\n\t" /* restore EIP */ "jmp __switch_to\n" "1:\t" "popl %%ebp\n\t" "popl %%edi\n\t" "popl %%esi\n\t" :"=m" (prev->thread.esp),"=m" (prev->thread.eip), "=b" (last) :"m" (next->thread.esp),"m" (next->thread.eip), "a" (prev), "d" (next), "b" (prev)); } while (0)
last
that contains the same value as prev
. It is transmitted via EBX, but not used. :"=m" (prev->thread.esp),"=m" (prev->thread.eip), :"m" (next->thread.esp),"m" (next->thread.eip),
/** arch/i386/kernel/process.c */ void __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread, *next = &next_p->thread; struct tss_struct *tss = init_tss + smp_processor_id(); unlazy_fpu(prev_p); tss->esp0 = next->esp0; asm volatile("movl %%fs,%0":"=m" (*(int *)&prev->fs)); asm volatile("movl %%gs,%0":"=m" (*(int *)&prev->gs)); /* Restore %fs and %gs. */ loadsegment(fs, next->fs); loadsegment(gs, next->gs); /* Now maybe reload the debug registers */ if (next->debugreg[7]){ loaddebug(next, 0); loaddebug(next, 1); loaddebug(next, 2); loaddebug(next, 3); /* no 4 and 5 */ loaddebug(next, 6); loaddebug(next, 7); } if (prev->ioperm || next->ioperm) { if (next->ioperm) { memcpy(tss->io_bitmap, next->io_bitmap, IO_BITMAP_SIZE*sizeof(unsigned long)); tss->bitmap = IO_BITMAP_OFFSET; } else tss->bitmap = INVALID_IO_BITMAP_OFFSET; } }
void __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
_p
. This is a small but important nuance because prev
and next
going to convert to kernel threads. struct thread_struct *prev = &prev_p->thread, *next = &next_p->thread;
tss->esp0 = next->esp0;
asm volatile("movl %%fs,%0":"=m" (*(int *)&prev->fs)); asm volatile("movl %%gs,%0":"=m" (*(int *)&prev->gs));
if (prev->ioperm || next->ioperm) { if (next->ioperm) { memcpy(tss->io_bitmap, next->io_bitmap, IO_BITMAP_SIZE*sizeof(unsigned long)); tss->bitmap = IO_BITMAP_OFFSET;
} else tss->bitmap = INVALID_IO_BITMAP_OFFSET;
/** include/asm-i386/system.h */ #define switch_to(prev,next,last) do { unsigned long esi,edi; asm volatile("pushfl\n\t" "pushl %%ebp\n\t" "movl %%esp,%0\n\t" /* save ESP */ "movl %5,%%esp\n\t" /* restore ESP */ "movl $1f,%1\n\t" /* save EIP */ "pushl %6\n\t" /* restore EIP */ "jmp __switch_to\n" "1:\t" "popl %%ebp\n\t" "popfl" :"=m" (prev->thread.esp),"=m" (prev->thread.eip), "=a" (last),"=S" (esi),"=D" (edi) :"m" (next->thread.esp),"m" (next->thread.eip), "2" (prev), "d" (next)); } while (0)
/** arch/i386/kernel/process.c */ struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread, *next = &next_p->thread; int cpu = smp_processor_id(); struct tss_struct *tss = init_tss + cpu; __unlazy_fpu(prev_p); load_esp0(tss, next->esp0); /* Load the per-thread Thread-Local Storage descriptor. */ load_TLS(next, cpu); asm volatile("movl %%fs,%0":"=m" (*(int *)&prev->fs)); asm volatile("movl %%gs,%0":"=m" (*(int *)&prev->gs)); /* Restore %fs and %gs if needed. */ if (unlikely(prev->fs | prev->gs | next->fs | next->gs)) { loadsegment(fs, next->fs); loadsegment(gs, next->gs); } /* Now maybe reload the debug registers */ if (unlikely(next->debugreg[7])) { loaddebug(next, 0); loaddebug(next, 1); loaddebug(next, 2); loaddebug(next, 3); /* no 4 and 5 */ loaddebug(next, 6); loaddebug(next, 7); } if (unlikely(prev->io_bitmap_ptr || next->io_bitmap_ptr)) { if (next->io_bitmap_ptr) { memcpy(tss->io_bitmap, next->io_bitmap_ptr, IO_BITMAP_BYTES); tss->io_bitmap_base = IO_BITMAP_OFFSET; } else tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET; } return prev_p; }
unlikely()
. I will not consider changes to existing code wrapped in [un], so as not to re-explain it. The macro simply tells the code generator which base unit should appear first to help pipelining. struct task_struct *__switch_to(...)
load_TLS(next, cpu);
if (unlikely(prev->fs | prev->gs | next->fs | next->gs)) {
/** include/asm-x86_64/system.h */ #define SAVE_CONTEXT "pushfq ; pushq %%rbp ; movq %%rsi,%%rbp\n\t" #define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp ; popfq\n\t" #define __EXTRA_CLOBBER ,"rcx","rbx","rdx","r8","r9","r10","r11","r12","r13","r14","r15" #define switch_to(prev,next,last) asm volatile(SAVE_CONTEXT "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ "call __switch_to\n\t" ".globl thread_return\n" "thread_return:\n\t" "movq %%gs:%P[pda_pcurrent],%%rsi\n\t" "movq %P[thread_info](%%rsi),%%r8\n\t" "btr %[tif_fork],%P[ti_flags](%%r8)\n\t" "movq %%rax,%%rdi\n\t" "jc ret_from_fork\n\t" RESTORE_CONTEXT : "=a" (last) : [next] "S" (next), [prev] "D" (prev), [threadrsp] "i" (offsetof(struct task_struct, thread.rsp)), [ti_flags] "i" (offsetof(struct thread_info, flags)), [tif_fork] "i" (TIF_FORK), [thread_info] "i" (offsetof(struct task_struct, thread_info)), [pda_pcurrent] "i" (offsetof(struct x8664_pda, pcurrent)) : "memory", "cc" __EXTRA_CLOBBER)
_switch_to()
, so you have to re-walk through its lines. Many changes are simply register names ( r..
instead of e..
). There are a few more helpers that I have mentioned above. asm volatile(SAVE_CONTEXT
"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */
[threadrsp]
is an immediate offset of thread.rsp inside the task_struct. %P
dereference the prev: threadsp pointer to ensure proper storage of the updated SP. "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */
"call __switch_to\n\t"
".globl thread_return\n"
thread_return
. "thread_return:\n\t"
thread_return
. Purely mechanically, according to it the instruction pointer should proceed to the next instruction. In fact, not used in the kernel or in the library (for example, glibc). My guess is that pthreads can use it ... but it doesn't seem like it is. "movq %%gs:%P[pda_pcurrent],%%rsi\n\t"
"movq %P[thread_info](%%rsi),%%r8\n\t"
thread_info
to r8. This is new in Linux 2.6, and in fact a lightweight version task_struct
that easily fits onto the stack. "btr %[tif_fork],%P[ti_flags](%%r8)\n\t"
thread_info->flags
and resets the bit in the structure. After a few lines, this bit will be set after the fork / clone and enabled to start ret_from_fork. "movq %%rax,%%rdi\n\t"
task_struct
previous stream to RDI. The last instruction that works with EAX is a call to a C function __switch_to
that returns prev
to EAX. "jc ret_from_fork\n\t"
: "=a" (last)
: [next] "S" (next), [prev] "D" (prev), [threadrsp] "i" (offsetof(struct task_struct, thread.rsp)), [ti_flags] "i" (offsetof(struct thread_info, flags)), [tif_fork] "i" (TIF_FORK), [thread_info] "i" (offsetof(struct task_struct, thread_info)), [pda_pcurrent] "i" (offsetof(struct x8664_pda, pcurrent))
: "memory", "cc" __EXTRA_CLOBBER)
/** arch/x86_64/kernel/process.c */ struct task_struct *__switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread, *next = &next_p->thread; int cpu = smp_processor_id(); struct tss_struct *tss = init_tss + cpu; unlazy_fpu(prev_p); tss->rsp0 = next->rsp0; asm volatile("movl %%es,%0" : "=m" (prev->es)); if (unlikely(next->es | prev->es)) loadsegment(es, next->es); asm volatile ("movl %%ds,%0" : "=m" (prev->ds)); if (unlikely(next->ds | prev->ds)) loadsegment(ds, next->ds); load_TLS(next, cpu); /* Switch FS and GS. */ { unsigned fsindex; asm volatile("movl %%fs,%0" : "=g" (fsindex)); if (unlikely(fsindex | next->fsindex | prev->fs)) { loadsegment(fs, next->fsindex); if (fsindex) prev->fs = 0; } /* when next process has a 64bit base use it */ if (next->fs) wrmsrl(MSR_FS_BASE, next->fs); prev->fsindex = fsindex; } { unsigned gsindex; asm volatile("movl %%gs,%0" : "=g" (gsindex)); if (unlikely(gsindex | next->gsindex | prev->gs)) { load_gs_index(next->gsindex); if (gsindex) prev->gs = 0; } if (next->gs) wrmsrl(MSR_KERNEL_GS_BASE, next->gs); prev->gsindex = gsindex; } /* Switch the PDA context. */ prev->userrsp = read_pda(oldrsp); write_pda(oldrsp, next->userrsp); write_pda(pcurrent, next_p); write_pda(kernelstack, (unsigned long)next_p->thread_info + THREAD_SIZE - PDA_STACKOFFSET); /* Now maybe reload the debug registers */ if (unlikely(next->debugreg7)) { loaddebug(next, 0); loaddebug(next, 1); loaddebug(next, 2); loaddebug(next, 3); /* no 4 and 5 */ loaddebug(next, 6); loaddebug(next, 7); } /* Handle the IO bitmap */ if (unlikely(prev->io_bitmap_ptr || next->io_bitmap_ptr)) { if (next->io_bitmap_ptr) { memcpy(tss->io_bitmap, next->io_bitmap_ptr, IO_BITMAP_BYTES); tss->io_bitmap_base = IO_BITMAP_OFFSET; } else { tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET; } } return prev_p; }
asm volatile("movl %%es,%0" : "=m" (prev->es)); if (unlikely(next->es | prev->es)) loadsegment(es, next->es);
asm volatile ("movl %%ds,%0" : "=m" (prev->ds)); if (unlikely(next->ds | prev->ds)) loadsegment(ds, next->ds);
unsigned fsindex; asm volatile("movl %%fs,%0" : "=g" (fsindex)); if (unlikely(fsindex | next->fsindex | prev->fs)) { loadsegment(fs, next->fsindex); if (fsindex) prev->fs = 0; }
fsindex
, and then loads the FS for the new task, if necessary. In principle, if an old or new task has a valid value for FS, something is loaded in its place (maybe NULL). FS is typically used for local stream storage, but there are other uses depending on when context switching occurs. Exactly the same code is used for GS, so there is no need to repeat. GS is usually a segment for thread_info
. if (next->fs) wrmsrl(MSR_FS_BASE, next->fs);
prev->fsindex = fsindex;
prev->userrsp = read_pda(oldrsp); write_pda(oldrsp, next->userrsp); write_pda(pcurrent, next_p); write_pda(kernelstack, (unsigned long)next_p->thread_info + THREAD_SIZE - PDA_STACKOFFSET);
/** arch/x86/include/asm/system.h */ #define SAVE_CONTEXT "pushf ; pushq %%rbp ; movq %%rsi,%%rbp\n\t" #define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp ; popf\t" #define __EXTRA_CLOBBER \ ,"rcx","rbx","rdx","r8","r9","r10","r11","r12","r13","r14","r15" #define switch_to(prev, next, last) asm volatile(SAVE_CONTEXT "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ "call __switch_to\n\t" "movq "__percpu_arg([current_task])",%%rsi\n\t" __switch_canary "movq %P[thread_info](%%rsi),%%r8\n\t" "movq %%rax,%%rdi\n\t" "testl %[_tif_fork],%P[ti_flags](%%r8)\n\t" "jnz ret_from_fork\n\t" RESTORE_CONTEXT : "=a" (last) __switch_canary_oparam : [next] "S" (next), [prev] "D" (prev), [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), [ti_flags] "i" (offsetof(struct thread_info, flags)), [_tif_fork] "i" (_TIF_FORK), [thread_info] "i" (offsetof(struct task_struct, stack)), [current_task] "m" (current_task) __switch_canary_iparam : "memory", "cc" __EXTRA_CLOBBER)
switch_to()
only four changes. Two of them are connected with each other, and nothing cardinally new. movq "__percpu_arg([current_task])",%%rsi\n\t
task_struct
to RSI. This is the “new” way to access information about the task: each CPU has a static symbol. Previously information was available through GS: [pda offset]. Subsequent RSI operations are the same as in version 2.6. __switch_canary
testl %[_tif_fork],%P[ti_flags](%%r8)\n\t jnz ret_from_fork\n\t
ret_from_fork()
. Previously, it was an instruction btr
, but now we postpone resetting the bit until the call is completed. The name has changed to JNZ due to a test change: if the bit is set, TEST (AND) will be positive. __switch_canary_oparam
CONFIG_CC_STACKPROTECTOR
. __switch_canary_iparam
CONFIG_CC_STACKPROTECTOR
/** arch/x86/kernel/process_64.c */ __notrace_funcgraph struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread; struct thread_struct *next = &next_p->thread; int cpu = smp_processor_id(); struct tss_struct *tss = &per_cpu(init_tss, cpu); unsigned fsindex, gsindex; bool preload_fpu; preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5; /* we're going to use this soon, after a few expensive things */ if (preload_fpu) prefetch(next->fpu.state); /* Reload esp0, LDT and the page table pointer: */ load_sp0(tss, next); savesegment(es, prev->es); if (unlikely(next->es | prev->es)) loadsegment(es, next->es); savesegment(ds, prev->ds); if (unlikely(next->ds | prev->ds)) loadsegment(ds, next->ds); savesegment(fs, fsindex); savesegment(gs, gsindex); load_TLS(next, cpu); __unlazy_fpu(prev_p); /* Make sure cpu is ready for new context */ if (preload_fpu) clts(); arch_end_context_switch(next_p); /* Switch FS and GS. */ if (unlikely(fsindex | next->fsindex | prev->fs)) { loadsegment(fs, next->fsindex); if (fsindex) prev->fs = 0; } /* when next process has a 64bit base use it */ if (next->fs) wrmsrl(MSR_FS_BASE, next->fs); prev->fsindex = fsindex; if (unlikely(gsindex | next->gsindex | prev->gs)) { load_gs_index(next->gsindex); if (gsindex) prev->gs = 0; } if (next->gs) wrmsrl(MSR_KERNEL_GS_BASE, next->gs); prev->gsindex = gsindex; /* Switch the PDA and FPU contexts. */ prev->usersp = percpu_read(old_rsp); percpu_write(old_rsp, next->usersp); percpu_write(current_task, next_p); percpu_write(kernel_stack, (unsigned long)task_stack_page(next_p) + THREAD_SIZE - KERNEL_STACK_OFFSET); /* Now maybe reload the debug registers and handle I/O bitmaps */ if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT || task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev_p, next_p, tss); /* Preload the FPU context - task is likely to be using it. */ if (preload_fpu) __math_state_restore(); return prev_p; }
__notrace_funcgraph struct task_struct * __switch_to(...)
__notrace_funcgraph
prohibits active ftrace track switch_to
. preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5; if (preload_fpu) prefetch(next->fpu.state);
load_sp0(tss, next);
savesegment(es, prev->es);
asm volatile("movl %%es,%0" : "=m" (prev->es));
. if (preload_fpu) clts();
clts()
- the same idea that we saw from the first version of Linux : "cmpl %%ecx,%2\n\t jne 1f\n\t clts\n"
. jne 1f\n\t clts\n" arch_end_context_switch(next_p);
if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT || task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev_p, next_p, tss);
switch_to
, including debug registers and I / O bitmap parameters. More about this will tell the review code 4.14.67. if (preload_fpu) __math_state_restore();
switch_to()
обзавёлся собственным заголовочным файлом arch/x86/include/asm/switch_to.h . Макрос вызывается ровно один раз, в конце context_switch()
из kernel/sched/core.с .switch_to()
разбит на две части: макрос prepare_switch_to()
и часть встроенного ассемблера перемещены в реальный файл ассемблера ( arch/x86/entry/entry_64.S ). /** arch/x86/include/asm/switch_to.h */ #define switch_to(prev, next, last) do { prepare_switch_to(prev, next); ((last) = __switch_to_asm((prev), (next))); } while (0)
prepare_switch_to(prev, next);
((last) = __switch_to_asm((prev), (next)));
prepare_switch_to
one defined in the same source file. /** arch/x86/include/asm/switch_to.h */ static inline void prepare_switch_to(struct task_struct *prev, struct task_struct *next) { #ifdef CONFIG_VMAP_STACK READ_ONCE(*(unsigned char *)next->thread.sp); #endif }
#ifdef CONFIG_VMAP_STACK
yes
. READ_ONCE(*(unsigned char *)next->thread.sp);
/** arch/x86/entry/entry_64.S */ ENTRY(__switch_to_asm) UNWIND_HINT_FUNC /* Save callee-saved registers */ pushq %rbp pushq %rbx pushq %r12 pushq %r13 pushq %r14 pushq %r15 /* switch stack */ movq %rsp, TASK_threadsp(%rdi) movq TASK_threadsp(%rsi), %rsp #ifdef CONFIG_CC_STACKPROTECTOR movq TASK_stack_canary(%rsi), %rbx movq %rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset #endif #ifdef CONFIG_RETPOLINE FILL_RETURN_BUFFER %r12, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW #endif /* restore callee-saved registers */ popq %r15 popq %r14 popq %r13 popq %r12 popq %rbx popq %rbp jmp __switch_to END(__switch_to_asm)
UNWIND_HINT_FUNC
pushq %rbp, %rbx, %r12, %r13, %r14, %r15
movq %rsp, TASK_threadsp(%rdi) movq TASK_threadsp(%rsi), %rsp
task_struct *
, prev and next, in accordance with the System V ABI conventions. Here is a subset of registers along with their use: #ifdef CONFIG_CC_STACKPROTECTOR movq TASK_stack_canary(%rsi), %rbx movq %rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
#ifdef CONFIG_RETPOLINE FILL_RETURN_BUFFER %r12, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW
popq %r15, %r14, %r13, %r12, %rbx, %rbp
/** arch/x86/kernel/process_64.c */ __visible __notrace_funcgraph struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread; struct thread_struct *next = &next_p->thread; struct fpu *prev_fpu = &prev->fpu; struct fpu *next_fpu = &next->fpu; int cpu = smp_processor_id(); struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu); WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) && this_cpu_read(irq_count) != -1); switch_fpu_prepare(prev_fpu, cpu); save_fsgs(prev_p); load_TLS(next, cpu); arch_end_context_switch(next_p); savesegment(es, prev->es); if (unlikely(next->es | prev->es)) loadsegment(es, next->es); savesegment(ds, prev->ds); if (unlikely(next->ds | prev->ds)) loadsegment(ds, next->ds); load_seg_legacy(prev->fsindex, prev->fsbase, next->fsindex, next->fsbase, FS); load_seg_legacy(prev->gsindex, prev->gsbase, next->gsindex, next->gsbase, GS); switch_fpu_finish(next_fpu, cpu); /* Switch the PDA and FPU contexts. */ this_cpu_write(current_task, next_p); this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p)); /* Reload sp0. */ update_sp0(next_p); /* Now maybe reload the debug registers and handle I/O bitmaps */ if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT || task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev_p, next_p, tss); #ifdef CONFIG_XEN_PV if (unlikely(static_cpu_has(X86_FEATURE_XENPV) && prev->iopl != next->iopl)) xen_set_iopl_mask(next->iopl); #endif if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) { unsigned short ss_sel; savesegment(ss, ss_sel); if (ss_sel != __KERNEL_DS) loadsegment(ss, __KERNEL_DS); } /* Load the Intel cache allocation PQR MSR. */ intel_rdt_sched_in(); return prev_p; }
__visible __notrace_funcgraph struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
__switch_to()
.__switch_to()
against the ftrace tracer. The function was added approximately in version 2.6.29, and soon it was turned on. struct thread_struct *prev = &prev_p->thread; struct thread_struct *next = &next_p->thread; struct fpu *prev_fpu = &prev->fpu; struct fpu *next_fpu = &next->fpu;
task_struct *
. The thread_struct contains TSS data for the task (registers, etc.). The structure fpu>
contains FPU data, such as the last used CPU, initialization, and register values. int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu);
WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) && this_cpu_read(irq_count) != -1);
switch_fpu_prepare(prev_fpu, cpu);
save_fsgs(prev_p);
load_TLS(next, cpu);
arch_end_context_switch(next_p);
savesegment(es, prev->es); if (unlikely(next->es | prev->es)) loadsegment(es, next->es);
load_seg_legacy(prev->fsindex, prev->fsbase, next->fsindex, next->fsbase, FS);
switch_fpu_finish(next_fpu, cpu);
this_cpu_write(current_task, next_p);
task_struct *
) task . Effectively updates FPU and PDA states (data area for each processor). this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
update_sp0(next_p);
if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT || task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV)) __switch_to_xtra(prev_p, next_p, tss);
__switch_to_xtra()
. #ifdef CONFIG_XEN_PV if (unlikely(static_cpu_has(X86_FEATURE_XENPV) && prev->iopl != next->iopl)) xen_set_iopl_mask(next->iopl);
if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) { unsigned short ss_sel; savesegment(ss, ss_sel); if (ss_sel != __KERNEL_DS) loadsegment(ss, __KERNEL_DS);
intel_rdt_sched_in();
return prev_p;
Source: https://habr.com/ru/post/438042/