diff options
Diffstat (limited to 'docs/mini-porting.txt')
-rw-r--r-- | docs/mini-porting.txt | 451 |
1 files changed, 451 insertions, 0 deletions
diff --git a/docs/mini-porting.txt b/docs/mini-porting.txt new file mode 100644 index 00000000000..7cf14775b1d --- /dev/null +++ b/docs/mini-porting.txt @@ -0,0 +1,451 @@ + Mono JIT porting guide. + Paolo Molaro (lupus@ximian.com) + +* Introduction + + This documents describes the process of porting the mono JIT + to a new CPU architecture. The new mono JIT has been designed + to make porting easier though at the same time enable the port + to take full advantage from the new architecture features and + instructions. Knowledge of the mini architecture (described in + the mini-doc.txt file) is a requirement for understanding this + guide, as well as an earlier document about porting the mono + interpreter (available on the web site). + + There are six main areas that a port needs to implement to + have a fully-functional JIT for a given architecture: + + 1) instruction selection + 2) native code emission + 3) call conventions and register allocation + 4) method trampolines + 5) exception handling + 6) minor helper methods + + To take advantage of some not-so-common processor features + (for example conditional execution of instructions as may be + found on ARM or ia64), it may be needed to develop an + high-level optimization, but doing so is not a requirement for + getting the JIT to work. + + We'll see in more details each of the steps required, note, + though, that a new port may just as well start from a + cut&paste of an existing port to a similar architecture (for + example from x86 to amd64, or from powerpc to sparc). + + The architecture specific code is split from the rest of the + JIT, for example the x86 specific code and data is all + included in the following files in the distribution: + + mini-x86.h mini-x86.c + inssel-x86.brg + cpu-pentium.md + tramp-x86.c + exceptions-x86.c + + I suggest a similar split for other architectures as well. + + Note that this document is still incomplete: some sections are + only sketched and some are missing, but the important info to + get a port going is already described. + + +* Architecture-specific instructions and instruction selection. + + The JIT already provides a set of instructions that can be + easily mapped to a great variety of different processor + instructions. Sometimes it may be necessary or advisable to + add a new instruction that represent more closely an + instruction in the architecture. Note that a mini instruction + can be used to represent also a short sequence of CPU + low-level instructions, but note that each instruction + represents the minimum amount of code the instruction + scheduler will handle (i.e., the scheduler won't schedule the + instructions that compose the low-level sequence as individual + instructions, but just the whole sequence, as an indivisible + block). + + New instructions are created by adding a line in the + mini-ops.h file, assigning an opcode and a name. To specify + the input and output for the instruction, there are two + different places, depending on the context in which the + instruction gets used. + + If the instruction is used in the tree representation, the + input and output types are defined by the BURG rules in the + *.brg files (the usual non-terminals are 'reg' to represent a + normal register, 'lreg' to represent a register or two that + hold a 64 bit value, freg for a floating point register). + + If an instruction is used as a low-level CPU instruction, the + info is specified in a machine description file. The + description file is processed by the genmdesc program to + provide a data structure that can be easily used from C code + to query the needed info about the instruction. + + As an example, let's consider the add instruction for both x86 + and ppc: + + x86 version: + add: dest:i src1:i src2:i len:2 clob:1 + ppc version: + add: dest:i src1:i src2:i len:4 + + Note that the instruction takes two input integer registers on + both CPU, but on x86 the first source register is clobbered + (clob:1) and the length in bytes of the instruction differs. + + Note that integer adds and floating point adds use different + opcodes, unlike the IL language (64 bit add is done with two + instructions on 32 bit architectures, using a add that sets + the carry and an add with carry). + + A specific CPU port may assign any meaning to the clob field + for an instruction since the value will be processed in an + arch-specific file anyway. + + See the top of the existing cpu-pentium.md file for more info + on other fields: the info may or may not be applicable to a + different CPU, in this latter case the info can be ignored. + + The code in mini.c together with the BURG rules in inssel.brg, + inssel-float.brg and inssel-long32.brg provides general + purpose mappings from the tree representation to a set of + instructions that should be easily implemented in any + architecture. To allow for additional arch-specific + functionality, an arch-specific BURG file can be used: in this + file arch-specific instructions can be selected that provide + better performance than the general instructions or that + provide functionality that is needed by the JIT but that + cannot be expressed in a general enough way. + + As an example, x86 has the special instruction "push" to make + it easier to implement the default call convention (passing + arguments on the stack): almost all the other architectures + don't have such an instruction (and don't need it anyway), so + we added a special rule in the inssel-x86.brg file for it. + + So, one of the first things needed in a port is to write a + cpu-$(arch).md machine description file and fill it with the + needed info. As a start, only a few instructions can be + specified, like the ones required to do simple integer + operations. The default rules of the instruction selector will + emit the common instructions and so we're ready to go for the + next step in porting the JIT. + + +*) Native code emission + + Since the first step in porting mono to a new CPU is to port + the interpreter, there should be already a file that allows + the emission of binary native code in a buffer for the + architecture. This file should be placed in the + + mono/arch/$(arch)/ + + directory. + + The bulk of the code emission happens in the mini-$(arch).c + file, in a function called mono_arch_output_basic_block + (). This function takes a basic block, walks the list of + instructions in the block and emits the binary code for each. + Optionally a peephole optimization pass is done on the basic + block, but this can be left for later, when the port actually + works. + + This function is very simple, there is just a big switch on + the instruction opcode and in the corresponding case the + functions or macros to emit the binary native code are + used. Note that in this function the lengths of the + instructions are used to determine if the buffer for the code + needs enlarging. + + To complete the code emission for a method, a few other + functions need implementing as well: + + mono_arch_emit_prolog () + mono_arch_emit_epilog () + mono_arch_patch_code () + + mono_arch_emit_prolog () will emit the code to setup the stack + frame for a method, optionally call the callbacks used in + profiling and tracing, and move the arguments to their home + location (in a caller-save register if the variable was + allocated to one, or in a stack location if the argument was + passed in a volatile register and wasn't allocated a + non-volatile one). caller-save registers used by the function + are saved in the prolog as well. + + mono_arch_emit_epilog () will emit the code needed to return + from the function, optionally calling the profiling or tracing + callbacks. At this point the basic blocks or the code that was + moved out of the normal flow for the function can be emitted + as well (this is usually done to provide better info for the + static branch predictor). In the epilog, caller-save + registers are restored if they were used. + + Note that, to help exception handling and stack unwinding, + when there is a transition from managed to unmanaged code, + some special processing needs to be done (basically, saving + all the registers and setting up the links in the Last Managed + Frame structure). + + When the epilog has been emitted, the upper level code + arranges for the buffer of memory that contains the native + code to be copied in an area of executable memory and at this + point, instructions that use relative addressing need to be + patched to have the right offsets: this work is done by + mono_arch_patch_code (). + + +* Call conventions and register allocation + + To account for the differences in the call conventions, a few functions need to + be implemented. + + mono_arch_allocate_vars () assigns to both arguments and local + variables the offset relative to the frame register where they + are stored, dead variables are simply discarded. The total + amount of stack needed is calculated. + + mono_arch_call_opcode () is the function that more closely + deals with the call convention on a given system. For each + argument to a function call, an instruction is created that + actually puts the argument where needed, be it the stack or a + specific register. This function can also re-arrange th order + of evaluation when multiple arguments are involved if needed + (like, on x86 arguments are pushed on the stack in reverse + order). The function needs to carefully take into accounts + platform specific issues, like how structures are returned as + well as the differences in size and/or alignment of managed + and corresponding unmanaged structures. + + The other chunk of code that needs to deal with the call + convention and other specifics of a CPU, is the local register + allocator, implemented in a function named + mono_arch_local_regalloc (). The local allocator deals with a + basic block at a time and basically just allocates registers + for temporary values during expression evaluation, spilling + and unspilling as necessary. + + The local allocator needs to take into account clobbering + information, both during simple instructions and during + function calls and it needs to deal with other + architecture-specific weirdnesses, like instructions that take + inputs only in specific registers or output only is some. + + Some effort will be put later in moving most of the local + register allocator to a common file so that the code can be + shared more for similar, risc-like CPUs. The register + allocator does a first pass on the instructions in a block, + collecting liveness information and in a backward pass on the + same list performs the actual register allocation, inserting + the instructions needed to spill values, if necessary. + + When this part of code is implemented, some testing can be + done with the generated code for the new architecture. Most + helpful is the use of the --regression command line switch to + run the regression tests (basic.cs, for example). + + Note that the JIT will try to initialize the runtime, but it + may not be able yet to compile and execute complex code: + commenting most of the code in the mini_init() function in + mini.c is needed to let the JIT just compile the regression + tests. Also, using multiple -v switches on the command line + makes the JIT dump an increasing amount of information during + compilation. + + +* Method trampolines + + To get better startup performance, the JIT actually compiles a + method only when needed. To achieve this, when a call to a + method is compiled, we actually emit a call to a magic + trampoline. The magic trampoline is a function written in + assembly that invokes the compiler to compile the given method + and jumps to the newly compiled code, ensuring the arguments + it received are passed correctly to the actual method. + + Before jumping to the new code, though, the magic trampoline + takes care of patching the call site so that next time the + call will go directly to the method instead of the + trampoline. How does this all work? + + mono_arch_create_jit_trampoline () creates a small function + that just preserves the arguments passed to it and adds an + additional argument (the method to compile) before calling the + generic trampoline. This small function is called the specific + trampoline, because it is method-specific (the method to + compile is hard-code in the instruction stream). + + The generic trampoline saves all the arguments that could get + clobbered and calls a C function that will do two things: + + *) actually call the JIT to compile the method + *) identify the calling code so that it can be patched to call directly + the actual method + + If the 'this' argument to a method is a boxed valuetype that + is passed to a method that expects just a pointer to the data, + an additional unboxing trampoline will need to be inserted as + well. + + +* Exception handling + + Exception handling is likely the most difficult part of the + port, as it needs to deal with unwinding (both managed and + unmanaged code) and calling catch and filter blocks. It also + needs to deal with signals, because mono takes advantage of + the MMU in the CPU and of the operation system to handle + dereferences of the NULL pointer. Some of the function needed + to implement the mechanisms are: + + mono_arch_get_throw_exception () returns a function that takes + an exception object and invokes an arch-specific function that + will enter the exception processing. To do so, all the + relevant registers need to be saved and passed on. + + mono_arch_handle_exception () this function takes the + exception thrown and a context that describes the state of the + CPU at the time the exception was thrown. The function needs + to implement the exception handling mechanism, so it makes a + search for an handler for the exception and if none is found, + it follows the unhandled exception path (that can print a + trace and exit or just abort the current thread). The + difficulty here is to unwind the stack correctly, by restoring + the register state at each call site in the call chain, + calling finally, filters and handler blocks while doing so. + + As part of exception handling a couple of internal calls need + to be implemented as well. + + ves_icall_get_frame_info () returns info about a specific + frame. + + mono_jit_walk_stack () walks the stack and calls a callback with info for + each frame found. + + ves_icall_get_trace () return an array of StackFrame objects. + +** Code generation for filter/finally handlers + + Filter and finally handlers are called from 2 different locations: + + 1.) from within the method containing the exception clauses + 2.) from the stack unwinding code + + To make this possible we implement them like subroutines, + ending with a "return" statement. The subroutine does not save + the base pointer, because we need access to the local + variables of the enclosing method. Its is possible that + instructions inside those handlers modify the stack pointer, + thus we save the stack pointer at the start of the handler, + and restore it at the end. We have to use a "call" instruction + to execute such finally handlers. + + The MIR code for filter and finally handlers looks like: + + OP_START_HANDLER + ... + OP_END_FINALLY | OP_ENDFILTER(reg) + + OP_START_HANDLER: should save the stack pointer somewhere + OP_END_FINALLY: restores the stack pointers and returns. + OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg". + +** Calling finally/filter handlers + + There is a special opcode to call those handler, its called + OP_CALL_HANDLER. It simple emits a call instruction. + + Its a bit more complex to call handler from outside (in the + stack unwinding code), because we have to restore the whole + context of the method first. After that we simply emit a call + instruction to invoke the handler. Its usually possible to use + the same code to call filter and finally handlers (see + arch_get_call_filter). + +** Calling catch handlers + + Catch handlers are always called from the stack unwinding + code. Unlike finally clauses or filters, catch handler never + return. Instead we simply restore the whole context, and + restart execution at the catch handler. + +** Passing Exception objects to catch handlers and filters. + + We use a local variable to store exception objects. The stack + unwinding code must store the exception object into this + variable before calling catch handler or filter. + +* Minor helper methods + + A few minor helper methods are referenced from the arch-independent code. + Some of them are: + + *) mono_arch_cpu_optimizations () + This function returns a mask of optimizations that + should be enabled for the current CPU and a mask of + optimizations that should be excluded, instead. + + *) mono_arch_regname () + Returns the name for a numeric register. + + *) mono_arch_get_allocatable_int_vars () + Returns a list of variables that can be allocated to + the integer registers in the current architecture. + + *) mono_arch_get_global_int_regs () + Returns a list of caller-save registers that can be + used to allocate variables in the current method. + + *) mono_arch_instrument_mem_needs () + *) mono_arch_instrument_prolog () + *) mono_arch_instrument_epilog () + Functions needed to implement the profiling interface. + + +* Writing regression tests + + Regression tests for the JIT should be written for any bug + found in the JIT in one of the *.cs files in the mini + directory. Eventually all the operations of the JIT should be + tested (including the ones that get selected only when some + specific optimization is enabled). + + +* Platform specific optimizations + + An example of a platform-specific optimization is the peephole + optimization: we look at a small window of code at a time and + we replace one or more instructions with others that perform + better for the given architecture or CPU. + +* 64 bit support tips, by Zoltan Varga (vargaz@gmail.com) + + For a 64-bit port of the Mono runtime, you will typically do + the following: + + * need to use inssel-long.brg instead of + inssel-long32.brg. + + * need to implement lots of new opcodes: + OP_I<OP> is 32 bit op + OP_L<OP> and CEE_<OP> are 64 bit ops + + + The 64 bit version of an existing port might share the code + with the 32 bit port (for example SPARC/SPARV9), or it might + be separate (x86/AMD64). + + That will depend on the similarities of the two instructions + sets/ABIs etc. + + The runtime and most parts of the JIT are 64 bit clean + at this point, so the only parts which require changing are + the arch dependent files. + + + +
\ No newline at end of file |