1 files changed, 451 insertions, 0 deletions
diff --git a/docs/mini-porting.txt b/docs/mini-porting.txt
new file mode 100644
index 00000000000..7cf14775b1d
--- /dev/null
+++ b/docs/mini-porting.txt
@@ -0,0 +1,451 @@
+		       Mono JIT porting guide.
+		   Paolo Molaro (lupus@ximian.com)
+
+* Introduction
+
+	This documents describes the process of porting the mono JIT
+	to a new CPU architecture. The new mono JIT has been designed
+	to make porting easier though at the same time enable the port
+	to take full advantage from the new architecture features and
+	instructions. Knowledge of the mini architecture (described in
+	the mini-doc.txt file) is a requirement for understanding this
+	guide, as well as an earlier document about porting the mono
+	interpreter (available on the web site).
+	
+	There are six main areas that a port needs to implement to
+	have a fully-functional JIT for a given architecture:
+	
+		1) instruction selection
+		2) native code emission
+		3) call conventions and register allocation
+		4) method trampolines
+		5) exception handling
+		6) minor helper methods
+	
+	To take advantage of some not-so-common processor features
+	(for example conditional execution of instructions as may be
+	found on ARM or ia64), it may be needed to develop an
+	high-level optimization, but doing so is not a requirement for
+	getting the JIT to work.
+	
+	We'll see in more details each of the steps required, note,
+	though, that a new port may just as well start from a
+	cut&paste of an existing port to a similar architecture (for
+	example from x86 to amd64, or from powerpc to sparc).
+	
+	The architecture specific code is split from the rest of the
+	JIT, for example the x86 specific code and data is all
+	included in the following files in the distribution:
+	
+		mini-x86.h mini-x86.c
+		inssel-x86.brg
+		cpu-pentium.md
+		tramp-x86.c 
+		exceptions-x86.c 
+	
+	I suggest a similar split for other architectures as well.
+	
+	Note that this document is still incomplete: some sections are
+	only sketched and some are missing, but the important info to
+	get a port going is already described.
+
+
+* Architecture-specific instructions and instruction selection.
+
+	The JIT already provides a set of instructions that can be
+	easily mapped to a great variety of different processor
+	instructions.  Sometimes it may be necessary or advisable to
+	add a new instruction that represent more closely an
+	instruction in the architecture.  Note that a mini instruction
+	can be used to represent also a short sequence of CPU
+	low-level instructions, but note that each instruction
+	represents the minimum amount of code the instruction
+	scheduler will handle (i.e., the scheduler won't schedule the
+	instructions that compose the low-level sequence as individual
+	instructions, but just the whole sequence, as an indivisible
+	block).
+
+	New instructions are created by adding a line in the
+	mini-ops.h file, assigning an opcode and a name. To specify
+	the input and output for the instruction, there are two
+	different places, depending on the context in which the
+	instruction gets used.
+
+	If the instruction is used in the tree representation, the
+	input and output types are defined by the BURG rules in the
+	*.brg files (the usual non-terminals are 'reg' to represent a
+	normal register, 'lreg' to represent a register or two that
+	hold a 64 bit value, freg for a floating point register).
+
+	If an instruction is used as a low-level CPU instruction, the
+	info is specified in a machine description file. The
+	description file is processed by the genmdesc program to
+	provide a data structure that can be easily used from C code
+	to query the needed info about the instruction.
+
+	As an example, let's consider the add instruction for both x86
+	and ppc:
+	
+	x86 version:
+		add: dest:i src1:i src2:i len:2 clob:1
+	ppc version:
+		add: dest:i src1:i src2:i len:4
+	
+	Note that the instruction takes two input integer registers on
+	both CPU, but on x86 the first source register is clobbered
+	(clob:1) and the length in bytes of the instruction differs.
+
+	Note that integer adds and floating point adds use different
+	opcodes, unlike the IL language (64 bit add is done with two
+	instructions on 32 bit architectures, using a add that sets
+	the carry and an add with carry).
+
+	A specific CPU port may assign any meaning to the clob field
+	for an instruction since the value will be processed in an
+	arch-specific file anyway.
+
+	See the top of the existing cpu-pentium.md file for more info
+	on other fields: the info may or may not be applicable to a
+	different CPU, in this latter case the info can be ignored.
+
+	The code in mini.c together with the BURG rules in inssel.brg,
+	inssel-float.brg and inssel-long32.brg provides general
+	purpose mappings from the tree representation to a set of
+	instructions that should be easily implemented in any
+	architecture.  To allow for additional arch-specific
+	functionality, an arch-specific BURG file can be used: in this
+	file arch-specific instructions can be selected that provide
+	better performance than the general instructions or that
+	provide functionality that is needed by the JIT but that
+	cannot be expressed in a general enough way.
+	
+	As an example, x86 has the special instruction "push" to make
+	it easier to implement the default call convention (passing
+	arguments on the stack): almost all the other architectures
+	don't have such an instruction (and don't need it anyway), so
+	we added a special rule in the inssel-x86.brg file for it.
+	
+	So, one of the first things needed in a port is to write a
+	cpu-$(arch).md machine description file and fill it with the
+	needed info. As a start, only a few instructions can be
+	specified, like the ones required to do simple integer
+	operations. The default rules of the instruction selector will
+	emit the common instructions and so we're ready to go for the
+	next step in porting the JIT.
+	
+
+*) Native code emission
+
+	Since the first step in porting mono to a new CPU is to port
+	the interpreter, there should be already a file that allows
+	the emission of binary native code in a buffer for the
+	architecture. This file should be placed in the
+
+		mono/arch/$(arch)/
+
+	directory.
+
+	The bulk of the code emission happens in the mini-$(arch).c
+	file, in a function called mono_arch_output_basic_block
+	(). This function takes a basic block, walks the list of
+	instructions in the block and emits the binary code for each.
+	Optionally a peephole optimization pass is done on the basic
+	block, but this can be left for later, when the port actually
+	works.
+
+	This function is very simple, there is just a big switch on
+	the instruction opcode and in the corresponding case the
+	functions or macros to emit the binary native code are
+	used. Note that in this function the lengths of the
+	instructions are used to determine if the buffer for the code
+	needs enlarging.
+	
+	To complete the code emission for a method, a few other
+	functions need implementing as well:
+	
+		mono_arch_emit_prolog ()
+		mono_arch_emit_epilog ()
+		mono_arch_patch_code ()
+	
+	mono_arch_emit_prolog () will emit the code to setup the stack
+	frame for a method, optionally call the callbacks used in
+	profiling and tracing, and move the arguments to their home
+	location (in a caller-save register if the variable was
+	allocated to one, or in a stack location if the argument was
+	passed in a volatile register and wasn't allocated a
+	non-volatile one). caller-save registers used by the function
+	are saved in the prolog as well.
+	
+	mono_arch_emit_epilog () will emit the code needed to return
+	from the function, optionally calling the profiling or tracing
+	callbacks. At this point the basic blocks or the code that was
+	moved out of the normal flow for the function can be emitted
+	as well (this is usually done to provide better info for the
+	static branch predictor).  In the epilog, caller-save
+	registers are restored if they were used.
+
+	Note that, to help exception handling and stack unwinding,
+	when there is a transition from managed to unmanaged code,
+	some special processing needs to be done (basically, saving
+	all the registers and setting up the links in the Last Managed
+	Frame structure).
+	
+	When the epilog has been emitted, the upper level code
+	arranges for the buffer of memory that contains the native
+	code to be copied in an area of executable memory and at this
+	point, instructions that use relative addressing need to be
+	patched to have the right offsets: this work is done by
+	mono_arch_patch_code ().
+
+
+* Call conventions and register allocation
+
+	To account for the differences in the call conventions, a few functions need to
+	be implemented.
+	
+	mono_arch_allocate_vars () assigns to both arguments and local
+	variables the offset relative to the frame register where they
+	are stored, dead variables are simply discarded. The total
+	amount of stack needed is calculated.
+	
+	mono_arch_call_opcode () is the function that more closely
+	deals with the call convention on a given system. For each
+	argument to a function call, an instruction is created that
+	actually puts the argument where needed, be it the stack or a
+	specific register. This function can also re-arrange th order
+	of evaluation when multiple arguments are involved if needed
+	(like, on x86 arguments are pushed on the stack in reverse
+	order). The function needs to carefully take into accounts
+	platform specific issues, like how structures are returned as
+	well as the differences in size and/or alignment of managed
+	and corresponding unmanaged structures.
+	
+	The other chunk of code that needs to deal with the call
+	convention and other specifics of a CPU, is the local register
+	allocator, implemented in a function named
+	mono_arch_local_regalloc (). The local allocator deals with a
+	basic block at a time and basically just allocates registers
+	for temporary values during expression evaluation, spilling
+	and unspilling as necessary.
+
+	The local allocator needs to take into account clobbering
+	information, both during simple instructions and during
+	function calls and it needs to deal with other
+	architecture-specific weirdnesses, like instructions that take
+	inputs only in specific registers or output only is some.
+
+	Some effort will be put later in moving most of the local
+	register allocator to a common file so that the code can be
+	shared more for similar, risc-like CPUs.  The register
+	allocator does a first pass on the instructions in a block,
+	collecting liveness information and in a backward pass on the
+	same list performs the actual register allocation, inserting
+	the instructions needed to spill values, if necessary.
+	
+	When this part of code is implemented, some testing can be
+	done with the generated code for the new architecture. Most
+	helpful is the use of the --regression command line switch to
+	run the regression tests (basic.cs, for example).
+
+	Note that the JIT will try to initialize the runtime, but it
+	may not be able yet to compile and execute complex code:
+	commenting most of the code in the mini_init() function in
+	mini.c is needed to let the JIT just compile the regression
+	tests.  Also, using multiple -v switches on the command line
+	makes the JIT dump an increasing amount of information during
+	compilation.
+	
+	
+* Method trampolines
+
+	To get better startup performance, the JIT actually compiles a
+	method only when needed. To achieve this, when a call to a
+	method is compiled, we actually emit a call to a magic
+	trampoline. The magic trampoline is a function written in
+	assembly that invokes the compiler to compile the given method
+	and jumps to the newly compiled code, ensuring the arguments
+	it received are passed correctly to the actual method.
+
+	Before jumping to the new code, though, the magic trampoline
+	takes care of patching the call site so that next time the
+	call will go directly to the method instead of the
+	trampoline. How does this all work?
+
+	mono_arch_create_jit_trampoline () creates a small function
+	that just preserves the arguments passed to it and adds an
+	additional argument (the method to compile) before calling the
+	generic trampoline. This small function is called the specific
+	trampoline, because it is method-specific (the method to
+	compile is hard-code in the instruction stream).
+
+	The generic trampoline saves all the arguments that could get
+	clobbered and calls a C function that will do two things:
+	
+	*) actually call the JIT to compile the method
+	*) identify the calling code so that it can be patched to call directly
+	the actual method
+	
+	If the 'this' argument to a method is a boxed valuetype that
+	is passed to a method that expects just a pointer to the data,
+	an additional unboxing trampoline will need to be inserted as
+	well.
+	
+
+* Exception handling
+
+	Exception handling is likely the most difficult part of the
+	port, as it needs to deal with unwinding (both managed and
+	unmanaged code) and calling catch and filter blocks. It also
+	needs to deal with signals, because mono takes advantage of
+	the MMU in the CPU and of the operation system to handle
+	dereferences of the NULL pointer. Some of the function needed
+	to implement the mechanisms are:
+	
+	mono_arch_get_throw_exception () returns a function that takes
+	an exception object and invokes an arch-specific function that
+	will enter the exception processing.  To do so, all the
+	relevant registers need to be saved and passed on.
+	
+	mono_arch_handle_exception () this function takes the
+	exception thrown and a context that describes the state of the
+	CPU at the time the exception was thrown. The function needs
+	to implement the exception handling mechanism, so it makes a
+	search for an handler for the exception and if none is found,
+	it follows the unhandled exception path (that can print a
+	trace and exit or just abort the current thread). The
+	difficulty here is to unwind the stack correctly, by restoring
+	the register state at each call site in the call chain,
+	calling finally, filters and handler blocks while doing so.
+	
+	As part of exception handling a couple of internal calls need
+	to be implemented as well.
+
+	ves_icall_get_frame_info () returns info about a specific
+	frame.
+
+	mono_jit_walk_stack () walks the stack and calls a callback with info for
+	each frame found.
+
+	ves_icall_get_trace () return an array of StackFrame objects.
+	
+** Code generation for filter/finally handlers
+
+	Filter and finally handlers are called from 2 different locations:
+	
+	       1.) from within the method containing the exception clauses
+	       2.) from the stack unwinding code
+	
+	To make this possible we implement them like subroutines,
+	ending with a "return" statement. The subroutine does not save
+	the base pointer, because we need access to the local
+	variables of the enclosing method. Its is possible that
+	instructions inside those handlers modify the stack pointer,
+	thus we save the stack pointer at the start of the handler,
+	and restore it at the end. We have to use a "call" instruction
+	to execute such finally handlers.
+	
+	The MIR code for filter and finally handlers looks like:
+	
+	    OP_START_HANDLER
+	    ...
+	    OP_END_FINALLY | OP_ENDFILTER(reg)
+	
+	OP_START_HANDLER: should save the stack pointer somewhere
+	OP_END_FINALLY: restores the stack pointers and returns.
+	OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
+	
+** Calling finally/filter handlers 
+
+	There is a special opcode to call those handler, its called
+	OP_CALL_HANDLER. It simple emits a call instruction.
+	
+	Its a bit more complex to call handler from outside (in the
+	stack unwinding code), because we have to restore the whole
+	context of the method first. After that we simply emit a call
+	instruction to invoke the handler. Its usually possible to use
+	the same code to call filter and finally handlers (see
+	arch_get_call_filter).
+	
+** Calling catch handlers
+
+	Catch handlers are always called from the stack unwinding
+	code. Unlike finally clauses or filters, catch handler never
+	return. Instead we simply restore the whole context, and
+	restart execution at the catch handler.
+	
+** Passing Exception objects to catch handlers and filters.
+
+	We use a local variable to store exception objects. The stack
+	unwinding code must store the exception object into this
+	variable before calling catch handler or filter.
+	
+* Minor helper methods
+
+	A few minor helper methods are referenced from the arch-independent code.
+	Some of them are:
+	
+	*) mono_arch_cpu_optimizations ()
+		This function returns a mask of optimizations that
+		should be enabled for the current CPU and a mask of
+		optimizations that should be excluded, instead.
+	
+	*) mono_arch_regname ()
+		Returns the name for a numeric register.
+	
+	*) mono_arch_get_allocatable_int_vars ()
+		Returns a list of variables that can be allocated to
+		the integer registers in the current architecture.
+	
+	*) mono_arch_get_global_int_regs ()
+		Returns a list of caller-save registers that can be
+		used to allocate variables in the current method.
+	
+	*) mono_arch_instrument_mem_needs ()
+	*) mono_arch_instrument_prolog ()
+	*) mono_arch_instrument_epilog ()
+		Functions needed to implement the profiling interface.
+	
+	
+* Writing regression tests
+
+	Regression tests for the JIT should be written for any bug
+	found in the JIT in one of the *.cs files in the mini
+	directory. Eventually all the operations of the JIT should be
+	tested (including the ones that get selected only when some
+	specific optimization is enabled).
+	
+
+* Platform specific optimizations
+
+	An example of a platform-specific optimization is the peephole
+	optimization: we look at a small window of code at a time and
+	we replace one or more instructions with others that perform
+	better for the given architecture or CPU.
+	
+* 64 bit support tips, by Zoltan Varga (vargaz@gmail.com)
+
+	For a 64-bit port of the Mono runtime, you will typically do
+	the following:
+
+		* need to use inssel-long.brg instead of
+		  inssel-long32.brg.
+
+		* need to implement lots of new opcodes:
+		       OP_I<OP> is 32 bit op
+		       OP_L<OP> and CEE_<OP> are 64 bit ops
+
+
+	The 64 bit version of an existing port might share the code
+	with the 32 bit port (for example SPARC/SPARV9), or it might
+	be separate (x86/AMD64).  
+
+	That will depend on the similarities of the two instructions
+	sets/ABIs etc.
+
+	The runtime and most parts of the JIT are 64 bit clean
+	at this point, so the only parts which require changing are
+	the arch dependent files.
+
+
+
+	
+\ No newline at end of file